EURASIP Journal on and Systems

Information Theoretic Methods for Bioinformatics

Guest Editors: Jorma Rissanen, Peter Grünwald, Jukka Heikkonen, Petri Myllymäki, Teemu Roos, and Juho Rousu Information Theoretic Methods for Bioinformatics EURASIP Journal on Bioinformatics and Information Theoretic Methods for Bioinformatics

Guest Editors: Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Copyright © 2007 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2007 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and repro- duction in any medium, provided the original work is properly cited. Editor-in-Chief I. Tabus, Tampere University of Technology, Finland

Associate Editors Jaakko Astola, Finland J. Garcia-Frias, USA Paola Sebastiani, USA Junior Barrera, Brazil Debashis Ghosh, USA Erchin Serpedin, USA Michael L. Bittner, USA John Goutsias, USA Ilya Shmulevich, USA MichaelR.Brent,USA RodericGuigo,Spain A. H. Tewfik, USA Yidong Chen, USA Yufei Huang, USA Sabine Van Huffel, Belgium Paul Dan Cristea, Romania Seungchan Kim, USA Z. Jane Wang, Canada Aniruddha Datta, USA John Quackenbush, USA Yue Wang, USA Bart De Moor, Belgium Jorma Rissanen, Finland Edward R. Dougherty, USA Stephane´ Robin, France Contents

Information Theoretic Methods for Bioinformatics, Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Volume 2007, Article ID 79128, 2 pages

Compressing Proteomes: The Relevance of Medium Range Correlations, Dario Benedetto, Emanuele Caglioti, and Claudia Chica Volume 2007, Article ID 60723, 8 pages

A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification, Chris Hemmerich and Sun Kim Volume 2007, Article ID 87356, 9 pages

Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates, Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama, and Wojciech Szpankowski Volume 2007, Article ID 14741, 11 pages

Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information, Arvind Rao, Alfred O. Hero III, David J. States, and James Douglas Engel Volume 2007, Article ID 13853, 13 pages

Splitting the BLOSUM Score into Numbers of Biological Significance, Francesco Fabris, Andrea Sgarro, and Alessandro Tossi Volume 2007, Article ID 31450, 18 pages

Aligning Sequences by Minimum Description Length,JohnS.Conery Volume 2007, Article ID 72936, 14 pages

MicroRNA Target Detection and Analysis for Genes Related to Breast Using MDLcompress, Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin, and Andrew S. Torres Volume 2007, Article ID 43670, 16 pages

Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among Bacteria, Haruo Suzuki, Rintaro Saito, and Masaru Tomita Volume 2007, Article ID 61374, 7 pages

Information-Theoretic Inference of Large Transcriptional Regulatory Networks,PatrickE.Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi Volume 2007, Article ID 79879, 9 pages

NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks, Petri Kontkanen, Hannes Wettig, and Petri Myllymaki¨ Volume 2007, Article ID 90947, 11 pages Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 79128, 2 pages doi:10.1155/2007/79128

Editorial Information Theoretic Methods for Bioinformatics

Jorma Rissanen,1, 2 Peter Grunwald,¨ 3 Jukka Heikkonen,4 Petri Myllymaki,¨ 2, 5 Teemu Roos,2, 5 and Juho Rousu5

1 Computer Learning Research Center, University of London, Royal Holloway TW20 0EX, UK 2 Helsinki Institute for Information Technology, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland 3 Centrum voor Wiskunde en Informatica (CWI), P.O. Box 94079, 1090 GB Amsterdam, The Netherlands 4 Laboratory of Computational Engineering, Helsinki University of Technology, P.O. Box 9203, 02015 HUT, Finland 5 Department of Computer Science, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland

Received 24 December 2007; Accepted 24 December 2007

Copyright © 2007 Jorma Rissanen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The ever-ongoing growth in the amount of biological data, length with which the data can be encoded, taking advantage the development of -wide measurement technolo- of the regular features the model prescribes to the data. This gies, and the gradual, inevitable shift in molecular biology task requires information and coding theoretic means. Simi- from the study of individual genes to the systems view; all larly, the frequently used distance measures like the Kullback- these factors contribute to the need to study biological sys- Leibler divergence and the mutual information express mean tems by statistical and computational means. In this task, we codelength differences. are facing a dual challenge: on the one hand, biological sys- D. Benedetto et al. study correlations and compressibil- tems and hence their models are inherently complex, and on ity of proteome sequences. They identify dependencies at the the other hand, the measurement data, while being genome- range of 10 to 100 amino acids. The source of such depen- wide, are typically scarce in terms of sample sizes (the “large dencies is not entirely clear. One contributing factor in the p,smalln” problem) and noisy. case of interprotein dependencies is likely to be sequence du- This means that the traditional statistical approach, plication. The dependencies can be exploited in compression where the model is viewed as a distorted image of something of proteome sequences. Furthermore, they seem to have a called a true distribution which the statisticians are trying to role in evolutionary and structural analysis of proteomes. estimate, is poorly justified. This lack of rationality is particu- C. M. Hemmerich and S. Kim also use information the- larly striking when one tries to learn the structure of the data ory for studying the correlations in protein sequences. They by testing for the truth of a hypothesis in a collection where base their method on computing the mutual information of none of them is true. Similarly, the Bayesian approaches that nonadjacent residues lying at a fixed distance d apart, where require prior knowledge, which is either nonexistent or vague the distance is varied from zero to a fixed upper bound. The and difficult to express in terms of a distribution for the pa- mutual information vector formed by these statistics is used rameters, are subject to modeling assumptions which may to train a nearest-neighbor classifier to predict membership bias the results in an unintended manner. in protein families with results indicating that the correla- It was the editors’ intent and hope to encourage applica- tions between nonadjacent residues are predictive of protein tions of techniques for model fitting influenced by informa- family. tion theory, originally created for communication theory but H. M. Aktulga et al. detect statistically dependent ge- more recently expanded to cover algorithmic information nomic sequences. Their paper addresses two applications. theory and applicable to statistical modeling. In this view, First, they identify different parts of a gene (maize zmSRp32) the objective in modeling is to learn structures and proper- that are mutually dependent without appealing to the usual ties in data by simply fitting models without requiring any of assumption that dependencies are revealed by a considerable them to be “true”. The performance is not measured by any amount of exact matches. It is discovered that dependencies distance to the nonexisting “truth” but in terms of the prob- exist between the 5 untranslated region and its alternatively ability they assign to the data, which is equivalent to the code spliced exons. As a second application, they discover short 2 EURASIP Journal on Bioinformatics and Systems Biology tandem repeats which are useful in, for instance, genetic pro- correlated with G+C (guanine-cytosine) composition in the filing. In both cases, the used techniques are based on mutual genome. In their paper, H. Suzuki et al. quantify the corre- information. lation of G+C composition with synonymous codon usage The objective in the paper by A. Rao et al. is to dis- bias, where the bias is measured by the entropy of the third cover long-range regulatory elements (LREs) that determine codon position. They show that the correlation depends on tissue-specific gene expression. Their methodology is based various genomic features and varies among different species. on the concept of directed information,avariantofmutual This raises several interesting questions about the different information introduced originally in the 1970s. It is shown evolutionary forces causing the codon usage bias. that directed information can be successfully used for select- The paper by P. E. Meyer et al. tackles the challenging ing motifs that discriminate between tissue-specific and non- problem of inferring large gene regulatory networks using in- specific LREs. In particular, the performance of directed in- formation theory. Their MRNET method extends the maxi- formation is better than that of mutual information. mum relevance/minimum redundancy (MRMR) feature se- F. Fabris et al. present an in-depth study to BLOSUM— lection technique to networks by formulating the network in- block substitution matrix scores. They propose a decompo- ference problem as a series of input/output supervised gene sition of the BLOSUM score into three components: the mu- selection procedures. Empirical results are competitive with tual information of two compared sequences, the divergence the state-of-the-art methods. of observed amino acid co-occurence frequencies from the P. Kontkanen et al. study the problem of computing the probabilities in the substitution matrix, and the background normalized maximum likelihood (NML) universal model for frequency divergence measuring the stochastic distance of Bayesian networks, which are important tools for modeling the observed amino acid frequences from the marginals in discrete data in biological applications. The most advanced the substitution matrix. The authors show how the result MDL method for model selection between such networks is of the decomposition, called BLOSpectrum, can be used to based on comparing the NML distributions for each network analyze questions about the correctness of the chosen BLO- under consideration, but the naive computation of these dis- SUM matrix, the degree of typicality of compared sequences tributions requires exponential time with respect to the given or their alignment, and the presence of weak or concealed data sample size. Utilizing certain computational tricks, and correlations in alignments with low BLOSUM scores. building on earlier work with multinomial and Naive Bayes The paper by J. Conery presents a new framework for models, the authors show how the computation can be per- biological sequence alignment that is based on describing formed efficiently for tree-structured Bayesian networks. pairs of sequences by simple regular expressions. These reg- ular expressions are given in terms of right-linear grammars, ACKNOWLEDGMENTS and the best grammar is found by use of the MDL prin- ciple. Essentially, when two sequences contain similar sub- We thank the Editor-in-Chief for the opportunity to prepare strings, this similarity can be exploited to describe the se- this special issue, and the staff of Hindawi for their assistance. quences with fewer bits. The precise codelengths are deter- The greatest credit is of course to the authors, who submit- mined with a substitution matrix that provides conditional ted contributions of the highest quality. We also thank the probabilities for the event that a particular symbol is re- reviewers who have had a crucial role in the selection and placed by another particular symbol. One advantage of such editing of the ten papers appearing in the special issue. a grammar-based approach is that gaps are not needed to align sequences of varying length. The author experimentally Jorma Rissanen compares the alignments found by his method with those Peter Grunwald¨ found by CLUSTALW. In a second experiment, he measures Jukka Heikkonen the accuracy of his method on pairwise alignments taken Petri Myllymaki¨ from the BAlisBASE benchmark. Teemu Roos S. C. Evans et al. explore miRNA sequences based on Juho Rousu MDLcompress, an MDL-based grammar inference algo- rithm that is an extension of the optimal symbol compres- sion ratio (OSCR) algorithm published earlier. Using MDL- compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms (SNPs) and breast can- cer. Their results suggest that MDLcompress outperforms other grammar-based coding methods, such as DNA se- quitur, while retaining a two-part code that highlights bio- logically significant phrases. The ability to quantify cost in bits for phrases in the MDL model allows prediction of re- gions where SNPs may have the most impact on biological activity. The partially redundant third position of codons (protein-coding nucleotide triplets) tends to have a strongly biased distribution. The amount of bias is known to be Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 60723, 8 pages doi:10.1155/2007/60723

Research Article Compressing Proteomes: The Relevance of Medium Range Correlations

Dario Benedetto,1 Emanuele Caglioti,1 and Claudia Chica2

1 Dipartimento di Matematica, Universita` di Roma “La Sapienza”, Piazzale Aldo Moro 5, 00185 Roma, Italy 2 Structural and Unit, EMBL Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany

Received 14 January 2007; Revised 28 May 2007; Accepted 10 September 2007

Recommended by Teemu Roos

We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical mod- els that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.

Copyright © 2007 Dario Benedetto et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION finite sequence of length L, the informational content in bits is approximately Lh and so Lh is the minimum length in bit Protein sequences have been considered for a long time as of any sequence that contains the same information. In this nearly random or highly complex sequences, from the infor- way Lh provides a theoretical lower bound for the sequence’s mational content point of view. The main reason for this is compression. A compression algorithm is intended to code a the local complexity of amino acid composition, that is, the sequence into a shorter one, from which it is possible to ob- type and number of amino acids found in a sequence seg- tain unequivocally the former. In practise, one cannot com- ment, especially inside the globular domains [1]. This com- press at a rate equal to the Shannon entropy for the given plexity could be related to the so called randomness of cod- sequence. Nonetheless, it is possible to approximate such a ing sequences in DNA, already pointed out in a pioneering limit, using an efficient compression algorithm. work [2] and explained by evolutionary models [3]. Studies Statistical compression algorithms achieve their goal by on protein sequence compression show that proteins behave assigning shorter code words to the most probable charac- as sequences of independent characters and have a very low ters; their efficiency depends on the accuracy of the model compressibility, around 1% [4]. The ordered set of protein used to estimate each character’s probability. Models try to sequences belonging to one organism, the proteome, was also take advantage of the correlations between characters con- considered to be not compressible due to this little Markov sidering, for example, how the preceding characters, that is, dependency [5]. Improvements are obtained by [6, 7]. How- the character’s context, determine the probability of the next ever, later studies [8–10] suggest that proteomes contain dif- one, as in the prediction by partial matching (PPM) scheme ferent sources of regularities, and can be compressed to rates [12]. around 30%. For a relevant discussion on the validity of these Most successful algorithms for proteome compression results see Cao et al. [7]. are based on the identification of duplicated sequences or In this work, we focus on the statistical study of proteome repeats. The compress protein (CP) algorithm [5], for ex- sequences, using the concept of entropy brought into infor- ample, considers that duplicated sequences in proteomes are mation theory by Shannon [11]. The Shannon entropy is re- similar but not identical because of mutation and evolu- lated to the amount of information of a sequence emitted by tionary divergence. CP uses a modified PPM that includes a certain source. The entropy h of a sequence is the limit of the probability of amino acid substitutions when estimating the average amount of information per character, when the each residue probability. The ProtComp algorithm [8]opti- length of the sequence tends to infinity. In particular, for a mises the use of approximate repeats by updating the amino 2 EURASIP Journal on Bioinformatics and Systems Biology

Table 1: Proteome sequences.

Abbreviation Organism Proteome length Number of proteins Mj Methanococcus jannaschii 448 779 1680 Hi Heamophilus influenzae 509 519 1657 Vc Vibrio cholerae 870 500 2988 Ec Escherichia coli 157 8496 5339 Sc Saccharomyces cerevisiae 2 900 352 5835 Dm Drosophyla melanogaster 5 818 330 11 592 Ce Caenorhabditis elegans 6 874 562 17 456 Hs Homo sapiens 3 295 751 5733 acid substitution matrix as the repeated similar blocks appear length. Protein length range correlations are in agreement along the sequence. The context-tree weighting (CTW) [13] with the process of sequence duplication, as it has been pre- is another context-based method that has been applied for viously suggested for long-range correlations [9]; in addition biological sequence compression. In [6] the authors present a to that, we show that they also contain information about CTW-based algorithm that predicts the probability of a char- the three-dimensional structure of the proteins. Short range acter by weighting the importance of short and long contexts correlations might instead relate to the local constraints on considering as well the occurrence of approximate repeats or amino acid distribution due to secondary structure require- palindromes in those contexts. The XM [7] is a statistical al- ments. gorithm which combines, via a Bayesian average, the prob- ability of an amino acid calculated on a local scale with the probability of that same residue being part of a duplicated 2. RESULTS AND DISCUSSION region of the proteome. Nonstatistical approaches, based on the Burros-Wheeler For our statistical analysis, we used the proteomes of 4 transform (BWT) [9], have also been used for identifying prokaryotic and 4 eukaryotic organisms shown in Table 1. overlapping and distant repeats in proteomes, and efficiently They were retrieved from the database of the Integr8 web use them in compression. Even simpler models, that rely on a portal [14], with exception of the Hi, Mj, Sc, and Hs pro- block code representation of the protein sequences [10], have teomes that were obtained from the protein corpus in [15], proved to be successful in some cases. for the sake of comparison of our compression rate results All the algorithms commented above put into evidence with previous studies on the same proteomes. The proteomes the existence and importance of redundancy in proteome se- are not complete (in particular the version of Hs in the pro- quences. Here we present a purely statistical study of 8 eu- tein corpus) but they represent a natural set of proteins where karyotic and prokaryotic proteomes. Firstly, we analyse the the redundancy has a biological meaning. It is important to correlation function of the whole sequences and find evi- remark that the sequence of the proteins in the proteome files dence of medium range correlations, between amino acids of the Integr8 database is not the natural one. Those files are located 100 residues apart. Then we calculate the amino acid not useful for our analysis. Nevertheless, using the additional correlations considering the protein boundaries and iden- information available in the database, it is possible to order tify the role of the intra/interprotein scale in determining the proteins as they are found in the chromososmes. The pro- the medium range correlations. Furthermore, we generate teome files of the protein corpus do not present this problem, groups of amino acids using their pair correlations at dis- but the sequence of the proteins is not available. Therefore, tance 100, that reveal the structural meaning of the medium for the analysis shown in Table 2 and in Figure 2,wehave range correlations. Using the results of proteome correla- used the version of Hi, Mj, Sc in the Integr8 database. For the tions, we propose a statistical model for the distribution of same reason, the data for Hs is missing in Table 2 since the amino acids in 4 proteomes: Haemophilus influenzae (bac- protein order is not obtainable at the Integr8 site. teria), Methanococcus jannaschii (bacteria), Saccharomyces cerevisiae (eukarya) and Homo sapiens (eukarya), and we es- 2.1. Correlations timate their compression rate to compare our results against previous works. The sources of nonrandomness studied fall into two As a first approximation to the general trends in residue dis- scales: the medium range correlations between amino acids tribution, we study the cooccurrence of amino acids. More ff of the same and neighboring sequences, at distances of order precisely, we calculate the pair correlations at di erent dis- 100, and the short range Markovian correlations between the tances, that is, the average number of times equal residues a contiguous residues up to distance 10. Previous studies [9] appear at distance k along the whole sequence show that proteomes present repeated subsequences at very long distances (50–300). In this article, we do not consider  k = 1 k these long-range correlations of the order of the proteome C Caa (1) 20 a Dario Benedetto et al. 3

0.0004 Table 2: Intra- and interprotein correlation. Intraprotein correla- 0.00035 tion is always higher than interprotein correlation, and correlation between matching halves (−−) is higher than that of not corre- 0.0003 sponding halves (+−). ) k ( 0.00025 −− +− C Proteome Intraprot corr Interprot corr Interprot corr 0.0002 Mj 0.271914 0.050381 0.050231 0.00015 Hi 0.265803 0.045588 0.039246

Correlation 0.0001 Vc 0.256386 0.063712 0.041780 Ec 0.271597 0.080064 0.069980 5e − 05 Sc 0.270560 0.032501 0.018606 0 Dm 0.295940 0.095722 0.056176 −5e − 05 100 200 300 400 500 600 700 800 900 1000 Ce 0.288071 0.122692 0.077690 Distance k

Dm Mj the same protein sequence). In Table 2, we present the re- Ce Vc sults for the intraprotein correlation between the two halves Sc Hi of the same protein and the interprotein correlation between corresponding and noncorresponding halves of two contigu- Figure 1: Correlation function for the 8 proteomes. Notice that the −− function remains positive for distances up to 1000 and that eukary- ous proteins: first half with first half (corr ) and second half +− otic proteomes (continuous lines) tend to present higher values. with first half (corr ). These correlations are defined as follows. Let Np be the − + number of proteins, let ρi (a)andρi (a) be the relative fre- with quency of the residue a in the first and the second half of the ith protein, respectively, and let ρ(a) be the corresponding − 1 Nk     mean value. We define Ck = χ σ = a χ σ = a − f 2,(2) aa − i i+k a  N k i=1 1    σ ±± = ρ±(a) − ρ(a) ρ±(a) − ρ(a) ,(3) i,j 20 i j where N is the sequence length, χ(σi = a) is the charac- a teristic function of finding residue a at position i,and fa is the relative frequency of amino acid a in the proteome. Ac- for instance, cording to this definition, a positive correlation means that, 1     σ ±− = ρ+(a) − ρ(a) ρ−(a) − ρ(a) . (4) for a distance k, the number of pairs of equal amino acid i,j 20 i j is more frequent than expected due to their frequency in a the proteome. The resulting correlation function for the 8 We also define proteomes we studied (Figure 1) shows that eukaryotic se-   quences have stronger correlations than prokaryotic ones. + = ++ − = −− σi σi,i , σi σi,i . (5) Moreover, for all the proteomes, the correlation remains pos- itive at a medium range, for values of k bigger than 800 or The intraprotein correlation is 1000, depending on the proteome. We notice that the natu- ral order of proteins in the proteomes, given by the succes- Np −+ 1 σi,i sion of genes in the chromosomes, is relevant: when we ran- Cintra = − . (6) N σ σ+ domly permute proteins, the medium range correlations are p i=1 i i lost, both in eukaryotes and prokaryotes. The two interprotein correlations are The medium range correlations imply that, in proteomes, the amino acid distribution of neighboring proteins tends N −1 p σ −− to be more similar than that of distant ones. This fact can −− = 1 i,i+1 Cinter − − − , be related to the process of duplication, recognied as the Np 1 =1 σi σi+1 i (7) dominant force in the evolution of protein function [16]. As N −1 p σ+− protein repeats have been related to duplication at different +− = 1 i,i+1 Cinter − + − . scales (genome, gene, or exon) [17], it is possible that the Np 1 i=1 σi σi+1 amino acid patterns responsible for the observed medium range correlation have the same evolutionary origin. The correlation values in Table 2 have the same trend for all Due to the correlation definition used, the medium range the proteomes: intraprotein correlation is always higher than correlations could be caused either by pairs of amino acids interprotein correlation. ±± ff belonging to the same protein, or to different ones. There- The correlation defined by means of σi,j are di erent k fore, we split the nonlocal correlation into two groups and from the traditional correlation Caa which is the correla- analyse them separately: interprotein correlations (between tion of the symbol a at distance k,wherek is the number of 2 contiguous proteins) and intraprotein correlations (inside residues: we have calculated the correlation function of the 4 EURASIP Journal on Bioinformatics and Systems Biology

0.05 of strongly conserved hydrophobic residues even when the other residues start to differ at several other positions. 0.04 The evidence obtained from the correlation analysis does not allow to clarify the nature of the structural constraints ) k

( 0.03 measured: do they reflect the modular repetition of sec- C ondary structure elements, caused by duplication or, per- 0.02 haps, they depend on the conservation of higher order ter- tiary structure units like domains? We try to address this

Correlation 0.01 question by defining amino acid groups as explained in the next section. 0 2.2. Grouping of amino acids −0.01 0 5 10 15 20 25 30 In a previous study [4], the complexity of large sets of nonre- Distance k (no of proteins) dundant protein sequences was measured using a reduced al-

−− phabet approximation, that is, using groups of amino acids Sc: inter-prot corr defined by an a priori classification. The Shannon entropy Sc: inter-prot corr+− was then estimated from the entropies of the blocks of n- Figure 2: Correlation function, at distance of k proteins, between characters. The authors did not find enough evidence to sup- amino acids belonging to corresponding (corr−), and noncorre- port the existence of short range correlations between the sponding (corr+−) halves; S. cerevisiae proteome. Correlation be- amino acids of protein sequences. tween corresponding halves is higher, suggesting that structural re- Conversely, given the above evidence of medium range quirements modulate the evolution of protein sequences, by main- correlations in proteome sequences, we build groups of cor- taining certain amino acid patterns. related amino acids using the correlations between the 20 k amino acids. We calculate Cab, the correlation between all amino acid pairs ab at distances k, in the same way we cal- k frequencies of the amino acids at the distance of one protein. culate Caa in the previous section: In Figure 2, we also analyse how the interprotein correlations between matching and nonmatching protein halves vary with N−k     k = 1 = = − the number k of proteins separating the two halves. We com- C χ σi a χ σi+k b fa fb. (9) ab N − k pare 1

− −− A quick look at the resulting 20 × 20 matrix for k = 100 Np k σ −− = 1 i,i+k (Figure 3), which presumably includes both intraprotein and C (k) − − − , Np k =1 σi σi+k interprotein correlation, puts in evidence that the signs of the i (8) − matrix elements, and thus the positive and negative correla- Np k σ+− +− = 1 i,i+k tions, are not distributed randomly among residues but, in- C (k) − + − . Np k i=1 σi σi+k stead, in a grouped fashion: some amino acids present posi- tive or negative correlations with the same subset of residues. As an extension of the results in Table 2, we find that the Then, we construct groups of amino acids in such a way correlation between matching halves is kept higher than that that they maximise the positive medium range correlation; of noncorresponding halves along the proteome. Analogous in practical terms it means that amino acids which are more results to Table 2 and Figure 2 hold for second-second and likely to appear at distances of order 100 would be grouped first-second halves. together. Gene duplication can explain both the existence and or- For a given partition of the set of amino acids in Ng der dependence of interprotein correlation, but it is not groups, we calculate the sum of the correlation function be- enough to justify why intraprotein correlations remain high, tween any pair of residues ab belonging to a same group. because high interprotein correlations can also appear in a More precisely, groups are obtained by maximising the fol- low intraprotein correlations context. Indeed, the presence of lowing quantity: intraprotein correlations indicates a nonrandom distribution of amino acids at a protein length scale. This nonrandomness Ng  200 can be related to segmental duplication, that is, duplication = k F(G) Cab, (10) = of segments inside the same protein; likewise, it can reflect i 1a,b∈gi k=1 the maintenance of amino acid patterns during the protein divergence that follows gene duplication as a consequence of which is function of a partition G of the amino acids in Ng the structural constraints imposed upon protein sequences. disjoint sets gi. Due to the huge number of possible choices As an example, extensive searches of protein databases for the groups, we maximise this value using a simulated an- [18] reveal the high frequency of tandemly repeated se- nealing algorithm. This is a Monte Carlo algorithm used for quences of approximately 50 amino acids, ARM and HEAT, optimisation [19]. For a given partition G, we construct a in eukaryotic proteins. Moreover, those repeats present a core new partition G choosing at random a residue and changing Dario Benedetto et al. 5

synonymous relationships. It is well known that mutations between amino acids sharing geometrical and/or physico- CYP chemical properties are the basis of neutral evolution at a T molecular level [20]; this fact also explains why there is not a one-to-one relationship between protein sequences and structures [21]. Moreover, structurally neighboring residues have been found to distribute differentially (proxi- mally/distally) in the protein sequences, depending on their physico-chemical properties [22]. Indeed, the groups defined from the pair correlations at amediumrange(Table 3) almost correspond with the natu- ral classification based on their physico-chemical properties: hydrophobic, polar, charged, small, and ambiguous. In par- ticular, the fact that hydrophobic amino acids group together allows us to think that the correlation function is gathering

VLIMFWNQHKRDEGAS some of the three-dimensional information contained in the VLIMFWNQHKRDEGASTCYP protein sequence, more precisely tertiary structure informa- Figure 3: Correlation between the 20 amino acids for Hi. Posi- tion, as hydrophobic interactions are considered the driving tive (black) and negative (grey) correlations determine amino acid forces of the protein folding process [23]. groups. Therefore, the reason why intraprotein correlations re- main high is not only related to the repetition of secondary structure units, but is also the conservation of the amino Table 3: Groups of amino acids determined by maximisation of acids responsible for the protein tertiary structure. the positive medium range correlation. Amino acids that are more Beside this, it is important to notice that, even if the likely to appear at 200 residues distance are grouped together. amino acid usage in eukaryotes and prokaryotes is very sim- Proteome Groups ilar [24], the amino acid correlations are not, as they col- LIFWSY lect part of the structural information, contained in the se- quences. The number of groups is also different: 3 for H. in- Hi VMGATP fluenzae and M. jannaschii,4forS. cerevisiae and H. sapiens. NQHKRDEC This could indicate a higher interchangeability of residues in LIFWNSY some proteomes, but further analysis is needed to confirm Mj VMQHGATCP this hypothesis. KRDE LIMFWCY 2.3. Sequence entropy estimation NQHSTP Sc KRDE In order to quantify the capability that a statistical model has VGA to identify the nonrandomness of a sequence, one can use it to construct an arithmetic coding compressor [25]. We es- VLIMFWNY timate the compression rate of such a compressor with the HSTC Hs sequence entropy QKDE RGAP N =−1 S log 2 pi(σi), (11) N i    its group. If F(G ) >F(G), the algorithm accepts the new par- using the model to calculate the probability Pi σi of charac- tition. Iterating this procedure we would reach a local max- ter σi at position i. The better is the model, the lower is the imum which may not be the absolute maximum. In order estimated value of the sequence entropy. We construct three to avoid being trapped in a local maximum, the algorithm models to estimate the probability of each character, consid- accepts, with a small probability P, a new partition G for ering the previous ones and taking into account both short which F(G) ≤ F(G). The value of this probability P slowly and medium range correlations. For each model, we find pa- decreases to zero as the number of iterations increases in such rameters that minimise the sequence entropy. The Smin value a way that the convergence of the algorithm to the absolute obtained is taken as an estimate of the compression rate of maximum of F is guaranteed. a running arithmetic codification [25] of the proteomes and The number and the structure of the groups chosen have is used to compare our results with other compression algo- the highest value of F(G) and represent an equilibrated par- rithms (Table 4). tition of the 20 amino acids, that is, groups with only one Previous works on protein sequence compression like [5] element are not accepted. are based on short range Markovian models. In those models, The idea behind our grouping scheme is to simplify the probability of each amino acid is calculated as a function the amino acid pattern mining by taking advantage of their of the context in which it appears, considering the frequency 6 EURASIP Journal on Bioinformatics and Systems Biology

Table 4: Compression rate in bit per character for the studied proteomes. One-character entropy is the entropy of the sequences considering that their residues are independently distributed.

Algorithm Hi Mj Sc Hs One-character entropy 4.155 4.068 4.165 4.133 CP, Nevill-Manning and Witten 1999 [5] 4.143 4.051 4.146 4.112 lza-CTW, Matsumoto et al. 2000 [6] 4.118 4.028 3.951 3.920 ProtComp, Cao et al. 2007 [7] 4.108 4.008 3.938 3.824 XM, Cao et al. 2007 [7] 4.102 4.000 3.885 3.786 Model 1∗ 4.111 4.017 3.963 3.978 Model 2∗ 4.102 4.005 3.948 3.933 Model 3∗ 4.100 4.002 3.945 3.931 ProtComp, Hategan and Tabus 2004 [8]† 2.330 3.910 3.440 3.910 BWT/SCP, Adjeroh and Nan 2006 [9]† 2.546 2.273 3.111 3.435 ∗ Estimation † Results obtained with a different set of proteomes with which this amino acid happens to be after the l previous argued in other works on latent periodicity of protein se- residues. quences [27, 28]. From the point of view of protein sequence Following this idea, we start our statistical description evolution, the short range parameters can also reflect the ex- of proteome sequences taking into account the information istence of constraints on the distribution of residues. Protein given by the neighboring residues using a variation of the in- sequences are modified by mutation, but still have to cope terpolated Markov models [26]. In order to predict the prob- with folding requirements that determine a nonrandom po- ability of the ith character, we consider the contexts up to a sitioning of key residues, depending on their geometrical and length Nc (number of contexts) that precede it, that is, the physico-chemical properties. In fact, structural alphabets de- substrings σi−k ···σi−1 for k = 0, ...,Nc.Foranycharac- rived from hidden Markov models denote that local confor- i ff ter a, we count the number Fk(a) of previous occurrences mations of protein structures have di erent sequence speci- of the substring σi−k ···σi−1a. The conditional frequency of ficity [29]. finding character a after the context σi−k ···σi−1 is obtained The intra/interprotein correlations identified in previous dividing by the sum over all amino acids b at position i: sections suggest that the frequencies of the single residues has nonnegligible fluctuations on the medium range. We take Fi (a) into account these fluctuations in our second model (model  k . (12) Fi (b) 2inTable 4): b k  i Nc i 1+μR (a)+ = λ F (a) Our model 1 predicts the probability of character a at posi- =   L k 0 k k  Model 2: pi(a) i Nc i . (14) tion i with b 1+μRL(b)+ k=0λkFk(b)  Nc i Here we added 1+ = λ F (a) Model 1: p (a) =   k 0 k k  . (13)   i Nc i i = ··· i 1+ = λ F (b) R (a) number of a in σi−L σi−1 . (15) b k 0 k k L L We remark that the main difference between our short range This quantity is proportional to the frequency of the amino approach and CTW is that we give a weight to the different acid a in the subsequence of length L,withL a distance of contexts, while in [6] a weight is given to their correspond- medium scale, starting from the position i − L.Thefactori/L i = ing conditional probabilities. We find that the most infor- guarantees that aRL(a) i, so that it increases with i in the i = mative positions were the previous 8; this length is in qual- same way as the other terms of the sum (e.g., aF0(a) i). itative agreement with the results found in [6]. Model 1 in The parameter μ is optimised as λk.TheoptimalvaluesforL Table 4 indicates the results obtained considering only the found during the entropy minimisation stage are 190 for Hi, short range correlations for Nc = 8. 163 for Mj, 105 for Sc, and 115 for Hs. The model depends on the parameters λk that are op- Finally, in model 3, we use the groups found in timised, using standard algorithms for minimisation, in or- Section 2.2 (see Table 3). In particular, a contribution to der to achieve the best estimate of the compression rate. This the probablity of a given residue is obtained by computing “entropy minimisation” stage is very time expensive. In a real the probability of the residue to belong to a certain group compression procedure, those parameters should be speci- and then the conditional probability of the residue once the fied and therefore would contribute to the estimated entropy. group is given is    In our case this contribution is negligible. i i Nc i 1+μGL ga f (a)+ k=0λkFk(a) The short range correlations support the existence of pe- Model 3: pi(a) =       , i i Nc i riodic patterns in protein sequences. They can be caused by b 1+μGL gb f (b)+ k=0λkFk(b) the alternation of alpha-beta secondary structure units, as (16) Dario Benedetto et al. 7

i where ga is the group of a, f (a) is the relative frequency of a is a reasonable explanation for the interprotein correlation. in its group, as measured up to the position i − 1, and However, it does not account for the intraprotein correla-  tions; this can instead be related to the maintenance of the i = GL(g) number of amino acids of amino acid patterns responsible for the three-dimensional  i (17) structure, as the segregation between hydrophobic and polar the group g in σ − ···σ − . i L i 1 L amino acids indicates. More elaborately, the sampling of the space of structures during proteome evolution is determined For this model, the optimal values of the parameter L are 129 by the duplication processes but it is highly constrained by for Hi, 94 for Mj, 77 for Sc, and 100 for Hs. the structural and functional requirements that protein se- As one can see in Table 4, the capability of our statistical quences have to meet inside a living system. model to represent the nonrandom information contained Prokaryotic proteomes show lower correlation values, es- in proteomes is comparable to those models that consider pecially for distances under 100 residues, and a smaller com- repeated amino acid patterns at both short and medium scale pressibility than eukaryotic proteomes. These characteristics [6, 7]. point at a higher redundancy of eukaryotic proteome se- The improvement in the performance of models 2 and 3 quences, and suggest that the increase of proteome size does is due to the fact that they identify the short range correla- not imply de novo generation of protein sequences, with tions and separate them from the fluctuations of amino acid completely different amino acid distribution. frequencies at a protein length range. This demonstrates that both correlation types are informative and that the statistical ACKNOWLEDGMENTS significance of repetitions at those scales is enough to model the amino acid probabilities. The authors would like to thank Toby Gibson for reading and The compression rate achieved when the medium range commenting the manuscript and the reviewers for their con- correlations are modelled with the frequency of amino acid structive criticism that helped to improve the quality of the groups (model 3) is almost equivalent to the compression paper. rate of model 2. From a biological perspective it indicates that groups of amino acids are meaningful, and that the redun- REFERENCES dant information at medium scale has a structural compo- nent might be coming from the three-dimensional structure [1] J. C. Wootton, “Non-globular domains in protein sequences: constraints. automated segmentation using complexity measures,” Com- According to our results, there is an important difference puters & Chemistry, vol. 18, no. 3, pp. 269–285, 1994. in the compressibility rates of the eukaryotic and prokaryotic [2] B. E. Blaisdell, “A prevalent persistent global nonrandomness proteomes which is in agreement with the correlation func- that distinguishes coding and non-coding eucaryotic nuclear tion in Figure 1. The sequences of S. cerevisiae and H. sapi- DNA sequences,” Journal of Molecular Evolution, vol. 19, no. 2, pp. 122–133, 1983. ens are more redundant, and thus more compressible, than [3] Y. Almirantis and A. Provata, “An evolutionary model for the those of H. influenzae and M. jannaschii; correspondingly, origin of non-randomness, long-range order and fractality in the correlation functions of Sc and Hs remain positive for the genome,” BioEssays, vol. 23, no. 7, pp. 647–656, 2001. longer distances than Hi and Mj. This additional redundancy [4] O. Weiss, M. A. Jimenez-Monta´ no,˜ and H. Herzel, “Informa- could be related to the presence, in eukaryotic proteomes, of tion content of protein sequences,” Journal of Theoretical Biol- paralogous proteins with very similar distribution of synony- ogy, vol. 206, no. 3, pp. 379–386, 2000. ff mous amino acids, but di erent function. There is evidence [5] C. G. Nevill-Manning and I. H. Witten, “Protein is incom- suggesting that paralogous genes have been recruited during pressible,” in Proceedings of the Data Compression Conference evolution of different metabolic pathways and are related to (DCC ’99), pp. 257–266, Snowbird, Utah, USA, March 1999. the organism adaptability to environmental changes [16]. On [6] T. Matsumoto, K. Sadakane, and H. Imai, “Biological sequence the other hand, the lower compressibility of the Hi and Mj compression algorithms,” Genome Informatics, vol. 11, pp. 43– proteomes is in agreement with the reduction of prokaryotic 52, 2000. genome size as an adaptation to fast metabolic rates [30, 31]. [7]M.D.Cao,T.I.Dix,L.Allison,andC.Mears,“Asimplestatis- tical algorithm for biological sequence compression,” in Pro- ceedings of the Data Compression Conference (DCC ’07),pp. 3. CONCLUSIONS 43–52, Snowbird, Utah, USA, March 2007. In this article, we show that the correlation function gath- [8] A. Hategan and I. Tabus, “Protein is compressible,” in Pro- ers evolutionary and structural information of proteomes. ceedings of the 6th Nordic Signal Processing Symposium (NOR- SIG ’04), pp. 192–195, Espoo, Finland, June 2004. Even if proteins are highly complex sequences, at a proteome [9] D. Adjeroh and F. Nan, “On compressibility of protein se- scale, it is possible to identify correlations between charac- quences,” in Proceedings of the Data Compression Conference ters at short and medium ranges. It confirms that protein (DCC ’06), pp. 422–434, Snowbird, Utah, USA, March 2006. sequences are not completely random, indeed they present [10] G. Sampath, “A block coding method that leads to signifi- repeated amino acid patterns at those two scales. The alter- cantly lower entropy values for the proteins and coding sec- nation of secondary structure units can determine the local tions of Haemophilus influenzae,” in Proceedings of the IEEE redundancy. This was already known and generally modelled Bioinformatics Conference (CSB ’03), pp. 287–293, Stanford, using Markov models. In our opinion, sequence duplication Calif, USA, August 2003. 8 EURASIP Journal on Bioinformatics and Systems Biology

[11] C. E. Shannon, “A mathematical theory of communication,” [31] J. Raes, J. O. Korbel, M. J. Lercher, C. von Mering, and P. Bork, Bell System Technical Journal, vol. 27, pp. 379–423 and 623– “Prediction of effective genome size in metagenomic samples,” 656, 1948. Genome Biology, vol. 8, no. 1, p. R10, 2007. [12] J. Cleary and I. Witten, “Data compression using adaptive cod- ing and partial string matching,” IEEE Transactions on Com- munications, vol. 32, no. 4, pp. 396–402, 1984. [13] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting method: basic properties,” IEEE Trans- actions on Information Theory, vol. 41, no. 3, pp. 653–664, 1995. [14] Integr8 web portal, ftp://ftp.ebi.ac.uk/pub/databases/integr8/, 2006. [15] J. Abel, “The data compression resource on the internet,” http://www.datacompression.info/, 2005. [16] C. A. Orengo and J. M. Thornton, “Protein families and their evolution—a structural perspective,” Annual Review of Bio- chemistry, vol. 74, pp. 867–900, 2005. [17] J. Heringa, “The evolution and recognition of protein se- quence repeats,” Computers & Chemistry, vol. 18, no. 3, pp. 233–243, 1994. [18] M.A.Andrade,C.Petosa,S.I.O’Donoghue,C.W.Muller,¨ and P. Bork, “Comparison of ARM and HEAT protein repeats,” Journal of Molecular Biology, vol. 309, no. 1, pp. 1–18, 2001. [19] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimiza- tion by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983. [20] L. A. Mirny and E. I. Shakhnovich, “Universally conserved po- sitions in protein folds: reading evolutionary signals about sta- bility, folding kinetics and function,” Journal of Molecular Bi- ology, vol. 291, no. 1, pp. 177–196, 1999. [21]M.A.Huynen,P.F.Stadler,andW.Fontana,“Smoothness within ruggedness: the role of neutrality in adaptation,” Pro- ceedings of the National Academy of Sciences of the of America, vol. 93, no. 1, pp. 397–401, 1996. [22] S. Karlin, “Statistical signals in bioinformatics,” Proceedings of the National Academy of Sciences of the United States of Amer- ica, vol. 102, no. 38, pp. 13355–13362, 2005. [23] K. A. Dill, “Dominant forces in protein folding,” Biochemistry, vol. 29, no. 31, pp. 7133–7155, 1990. [24] B. Rost, “Did evolution leap to create the protein universe?” Current Opinion in Structural Biology, vol. 12, no. 3, pp. 409– 416, 2002. [25] J. Rissanen and G. G. Langdon Jr., “Arithmetic Coding,” IBM Journal of Research and Development, vol. 23, no. 2, pp. 149– 162, 1979. [26] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, “Microbial gene identification using interpolated Markov models,” Nu- cleic Acids Research, vol. 26, no. 2, pp. 544–548, 1998. [27] V. P. Turutina, A. A. Laskin, N. A. Kudryashov, K. G. Skryabin, and E. V. Korotkov, “Identification of latent period- icity in amino acid sequences of protein families,” Biochemistry (Moscow), vol. 71, no. 1, pp. 18–31, 2006. [28] E. V. Korotkov and M. A. Korotkova, “Enlarged similarity of nucleic acid sequences,” DNA Research, vol. 3, no. 3, pp. 157– 164, 1996. [29]A.C.CamprouxandP.Tuffery,´ “Hidden Markov model- derived structural alphabet for proteins: the learning of pro- tein local shapes captures sequence specificity,” Biochimica et Biophysica Acta, vol. 1724, no. 3, pp. 394–403, 2005. [30] S. D. Bentley and J. Parkhill, “Comparative genomic structure of prokaryotes,” Annual Review of Genetics, vol. 38, pp. 771– 791, 2004. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 87356, 9 pages doi:10.1155/2007/87356

Research Article A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification

Chris Hemmerich1 and Sun Kim2

1 Center For and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington 47405-3700, India 2 School of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street, Bloomington 47408-3912, India Received 28 February 2007; Revised 22 June 2007; Accepted 31 July 2007

Recommended by Juho Rousu

We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.

Copyright © 2007 C. Hemmerich and S. Kim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION tein structure. To investigate this question, we used the fam- ily and sequence alignment information from Pfam-A [4]. To A protein can be viewed as a string composed from the 20- model sequences, we defined and used the mutual informa- symbol amino acid alphabet or, alternatively, as the sum of tion vector (MIV) where each entry represents the MI estima- their structural properties, for example, residue-specific in- tion for amino acid pairs separated by a particular distance in teractions or hydropathy (hydrophilic/hydrophobic) interac- the primary structure. We studied two different properties of tions. Protein sequences contain sufficient information to sequences: amino acid identity and hydropathy. construct secondary and tertiary protein structures. Most In this paper, we report three important findings. methods for predicting protein structure rely on primary se- (1) MI scores for the majority of 1000 real protein se- quence information by matching sequences representing un- quences sampled from Pfam are statistically significant known structures to those with known structures. Thus, re- (as defined by a P value cutoff of .05) as compared to searchers have investigated the correlation of amino acids random sequences of the same character composition, within and across protein sequences [1–3]. Despite all this, in see Section 4.1. terms of character strings, proteins can be regarded as slightly (2) MIV has significantly better modeling power of pro- edited random strings [1]. teins than MI, as demonstrated in the protein sequence Previous research has shown that residue correlation can classification experiment, see Section 5.2. provide biological insight, but that MI calculations for pro- (3) The best classification results are provided by MIVs tein sequences require careful adjustment for sampling er- containing scores generated from both the amino acid rors. An information-theoretic analysis of amino acid con- alphabet and the hydropathy alphabet, see Section 5.2. tact potential pairings with a treatment of sampling biases has shown that the amount of amino acid pairing informa- In Section 2, we briefly summarize the concept of MI tion is small, but statistically significant [2]. Another recent and a method for normalizing MI content. In Section 3,we study by Martin et al. [3] showed that normalized mutual in- formally define the MIV and its use in characterizing pro- formation can be used to search for coevolving residues. tein sequences. In Section 4, we test whether MI scores for From the literature surveyed, it was not clear what signif- protein sequences sampled from the Pfam database are sta- icance the correlation of amino acid pairings holds for pro- tistically significant compared to random sequences of the 2 EURASIP Journal on Bioinformatics and Systems Biology same residue composition. We test the ability of MIV to clas- From the entropy equations above, we derive the MI sify sequences from the Pfam database in Section 5, and in equation for a protein sequence X = (x1, ..., xN ): Section 6, we examine correlation with MIVs and further in-       P(x , x ) vestigate the effects of alphabet size in terms of information = i j MI P xi, xj log2 ,(4) P(xi)P(xj ) theory. We conclude with a discussion of the results and their i∈ΣA j∈ΣA implications. where the pair probability P(xi, xj ) is the frequency of two residues being adjacent in the sequence. 2. MUTUAL INFORMATION (MI) CONTENT We use MI content to estimate correlation in protein se- 2.2. Normalization by joint entropy quences to gain insight into the prediction of secondary and tertiary structures. Measuring correlation between residues Since MI(X, Y) represents a reduction in H(X)orH(Y), the is problematic because sequence elements are symbolic vari- value of MI(X, Y) can be altered significantly by the entropy ables that lack a natural ordering or underlying metric [5]. in X and Y. The MI score we calculate for a sequence is also ff Residues can be ordered in certain properties such as hy- a ected by the entropy in that sequence. Martin et al. [3]pro- dropathy, charge, and molecular weight. Weiss and Herzel [6] pose a method of normalizing the MI score of a sequence analyzed several such correlation functions. using the joint entropy of a sequence. The joint entropy, or H(X, Y), can be defined as MI is a measure of correlation from information theory       [7] based on entropy, which is a function of the probability =− H(X, Y) P xi, xj log2P xi, xj (5) distribution of residues. We can estimate entropy by count- i∈ΣA j∈ΣA ing residue frequencies. Entropy is maximal when all residues appear with the same frequency. MI is calculated by system- and is related to MI(X, Y) by the equation atically extracting pairs of residues from a sequence and cal- MI(X, Y) = H(X)+H(Y) − H(X, Y). (6) culating the distribution of pair frequencies weighted by the frequencies of the residues composing the pairs. The complete equation for our normalized MI measure- By defining a pair as adjacent residues in the protein se- ment is quence, MI estimates the correlation between the identities MI(X, Y) of adjacent residues. We later define pairs using nonadjacent    H(X, Y)       residues, and physical properties rather than residue identi- ∈Σ ∈Σ P x , x log P x , x /P x P x =− i A j A  i j  2  i j  i j ties. . i∈ΣA j∈ΣA P xi, xj log2P xi, xj MI has been proven useful in multiple studies of bio- (7) logical sequences. It has been used to predict coding regions in DNA [8], and has been used to detect coevolving residue 3. MUTUAL INFORMATION VECTOR (MIV) pairs in protein multiple sequence alignments [3]. We calculate the MI of a sequence to characterize the struc- 2.1. Mutual information ture of the resulting protein. The structure is affected by dif- ferent types of interactions, and we can modify our meth- The entropy of a random variable X, H(X), represents the ods to consider different biological properties of a protein se- uncertainty of the value of X. H(X) is 0 when the identity of quence. To improve our characterization, we combine these X is known, and H(X) is maximal when all possible values different methods to create of vector of MI scores. of X are equally likely. The mutual information of two vari- Using the flexibility of MI and existing knowledge of pro- ables MI(X, Y) represents the reduction in uncertainty of X tein structures, we investigate several methods for generating given Y,andconversely,MI(Y, X) represents the reduction MI scores from a protein sequence. We can calculate the pair in uncertainty of Y given X: probability P(xi, xj ) using any relationship that is defined for ∈ Σ MI(X, Y) = H(X) − H(X | Y) = H(Y) − H(Y | X). (1) all amino acid identities i, j A. In particular, we examine distance between residue pairings, different types of residue- | When X and Y are independent, H(X Y) simplifies to residue interactions, classical and normalized MI scores, and H(X), so MI(X, Y) is 0. The upper bound of MI(X, Y) is the three methods of interpreting gap symbols in Pfam align- lesser of H(X)andH(Y), representing complete correlation ments. between X and Y: H(X | Y) = H(Y | X) = 0. (2) 3.1. Distance MI vectors We can measure the entropy of a protein sequence S as Protein exists as a folded structure, allowing nonadjacent      residues to interact. Furthermore, these interactions help to =− H(S) P xi log2P xi ,(3)determine that structure. For this reason, we use MIV to ∈Σ i A characterize nonadjacent interactions. Our calculation of MI where ΣA is the alphabet of amino acid residues and P(xi)is for adjacent pairs of residues is a specific case of a more gen- the marginal probability of residue i.InSection 3.3, we dis- eral relationship, separation by exactly d residues in the se- cuss several methods for estimating this probability. quence. C. Hemmerich and S. Kim 3

Table 1: MI(3)—residue pairings of distance 3 for the sequence Our second method is to use a common prior probability DEIPCPFCGC. distribution for all sequences. Since all of our sequences are (1) DEIPCPFCGC (4) DEIPCPFCGC part of the Pfam database, we use residue frequencies calcu- lated from Pfam as our prior. In our results, we refer to this (2) DEIPCPFCGC (5) DEIPCPFCGC method as the Pfam prior. The large sample size allows the (3) DEIPCPFCGC (6) DEIPCPFCGC frequency to more accurately estimate the probability. How- ever, since Pfam contains sequences from many organisms, Table 2: Amino acid partition primarily based on hydropathy. the probability distribution is less accurate. Hydropathy Amino acids Hydrophobic: C,I,M,F,W,Y,V,L 3.4. Interpreting gap symbols Hydrophilic: R,N,D,E,Q,H,K,S,T,P,A,G The Pfam sequence alignments contain gap information, which presents a challenge for our MIV calculations. The Definition 1. For a sequence S = (s1, ..., sN ), mutual infor- gap character does not represent a physical element of the mation of distance d, MI(d) is defined as sequence, but it does provide information on how to view     the sequence and compare it to others. Because of this con-     P x , x tradiction, we compared three strategies for processing gap = d i  j  MI(d) Pd xi, xj log2 . (8) characters in the alignments. P xi P xj i∈ΣA j∈ΣA

The pair probabilities, Pd(xi, xj ), are calculated using all The strict method combinations of positions sm and sn in sequence S such that This method removes all gap symbols from a sequence be- m +(d +1)= n, n ≤ N. (9) fore performing any calculations, operating on the protein sequence rather than an alignment. A sequence of length N will contain N − (d +1)pairs. The literal method Table 1 shows how to extract pairs of distance 3 from the sequence DEIPCPFCGC. Gaps are a proven tool in creating alignments between re- lated sequences and searching for relationships between se- Definition 2. The mutual information vector of length k for quences. This method expands the sequence alphabet to in- asequenceX,MIV(X), is defined as a vector of k entries, k clude the gap symbol. For Σ we define and use a new alpha- MI(0), ...,MI(k − 1). A bet:

3.2. Sequence alphabets Σ = Σ ∪{−} A A . (10) The alphabet chosen to represent the protein sequence has Σ Σ Σ two effects on our calculations. First, by defining the alpha- MI is then calculated for A . H is transformed to G using bet, we also define the type of residue interactions we are the same method. measuring. By using the full amino acid alphabet, we are only able to find correlations based on residue-specific inter- The hybrid method actions. If we instead use an alphabet based on hydropathy, we make correlations based on hydrophilic/hydrophobic in- This method is a compromise of the previous two methods. teractions. Second, altering the size of our alphabet has a sig- Gap symbols are excluded from the sequence alphabet when nificant effect on our MI calculations. This effect is discussed calculating MI. Occurrences of the gap symbol are still con- in Section 6.2. sidered when calculating the total number of symbols. For a In our study, we used two different alphabets: a set of 20 sequence containing one or more gap symbols, amino acids residues, ΣA, and a hydropathy-based alphabet,  Σ H , derived from grammar complexity and syntactic struc- Pi < 1. (11) ∈Σ ture of protein sequences [9] (see Table 2 for mapping ΣA to i A ΣH ). Pairs containing any gap symbols are also excluded, so for a 3.3. Estimating residue marginal probabilities gapped sequence,  To calculate the MIV for a sequence, we estimate the Pij < 1. (12) marginal probabilities for the characters in the sequence al- i,j∈ΣA phabet. The simplest method is to use residue frequencies from the sequence being scored. This is our default method. TheseadjustmentsresultinanegativeMIscoreforsome Unfortunately, the quality of the estimation suffers from the sequences, unlike classical MI where a minimum score of 0 short length of protein sequences. represents independent variables. 4 EURASIP Journal on Bioinformatics and Systems Biology

Table 3: MIVs’ examples calculated for four sequences from Pfam. All methods used literal gap interpretation.

Globin MI(d) Ferrochelatase MI(d) DUF629 MI(d) Big 2 MI(d)

d ΣA ΣH ΣA ΣH ΣA ΣH ΣA ΣH 0 1.34081 0.42600 0.95240 0.13820 0.70611 0.04752 1.26794 0.21026 1 1.20553 0.23740 0.93240 0.03837 0.63171 0.00856 0.92824 0.05522 2 1.07361 0.12164 0.90004 0.02497 0.63330 0.00367 0.95326 0.07424 3 0.92912 0.02704 0.87380 0.03133 0.66955 0.00575 0.99630 0.04962 4 0.97230 0.00380 0.90400 0.02153 0.62328 0.00587 1.00100 0.08373 5 0.91082 0.00392 0.78479 0.02944 0.68383 0.00674 0.98737 0.03664 6 0.90658 0.01581 0.81559 0.00588 0.63120 0.00782 1.06852 0.05216 7 0.87965 0.02435 0.91757 0.00822 0.67433 0.00172 1.04627 0.12002 8 0.83376 0.01860 0.87615 0.01247 0.63719 0.00495 1.00784 0.05221 9 0.88404 0.01000 0.90823 0.00721 0.61597 0.00411 0.97119 0.04002 10 0.88685 0.01353 0.89673 0.00611 0.60790 0.00718 1.02660 0.02240 11 0.90792 0.01719 0.94314 0.02195 0.66750 0.00867 0.92858 0.02261 12 0.95955 0.00231 0.87247 0.01027 0.64879 0.00805 0.98879 0.03156 13 0.88584 0.01387 0.85914 0.00733 0.66959 0.00607 1.09997 0.04766 14 0.93670 0.01490 0.88250 0.00335 0.66033 0.00106 1.06989 0.01286 15 0.86407 0.02052 0.94592 0.00548 0.62171 0.01363 1.27002 0.06204 16 0.89004 0.04024 0.92664 0.01398 0.63445 0.00314 1.05699 0.03154 17 0.91409 0.01706 0.80241 0.00108 0.67801 0.00536 1.06677 0.02136 18 0.89522 0.01691 0.85366 0.00719 0.65903 0.00898 1.05439 0.03310 19 0.92742 0.03319 0.90928 0.01334 0.70176 0.00151 1.17621 0.01902

3.5. MIV examples In theory, a random string contains no correlation be- tween characters. So, we expect a “slightly edited random Table 3 shows eight examples of MIVs calculated from the string” to exhibit little correlation. In practice, noninfinite Pfam database. A sequence was taken from four random random strings usually have a nonzero MI score. This over- families, and the MIV was calculated using the literal gap estimation of MI in finite sequences is a factor of the length method for both ΣH and ΣA. All scores are in bits. The scores of the string, alphabet size, and frequency of the characters generated from ΣA are significantly larger than those from that make up the string. We investigated the significance of ΣH . We investigate this observation further in Sections 4.1 this error for our calculations and methods for reducing or and 6.2. correcting for the error. To confirm the significance of our MI scores, we used 3.6. MIV concatenation a permutation-based technique. We compared known cod- ing sequences to random sequences in order to generate a The previous sections have introduced several methods for P value signifying the chance that our observed MI score scoring sequences that can be used to generate MIVs. Just or higher would be obtained from a random sequence of aswecombinedMIscorestocreateMIV,wecanfurther residues. Since MI scores are dependent on sequence length ffl concatenate MIVs. Any number of vectors calculated by any and residue frequency, we used the shu e command from methods can be concatenated in any order. However, for two the HMMER package to conserve these parameters in our vectors to be comparable, they must be the same length, and random sequences. must agree on the feature stored at every index. We sampled 1000 sequences from our subset of Pfam- A. A simple random sample was performed without replace- Definition 3. Any two MIVs, MIV j (A)andMIVk(B), can be ment from all sequences between 100 and 1000 residues in concatenated to form MIVj+k(C). length. We calculated MI(0) for each sequence sampled. We then generated 10 000 shuffled versions of each sequence and calculated MI(0) for each. 4. ANALYSIS OF CORRELATION IN We used three scoring methods to calculate MI(0): PROTEIN SEQUENCES (1) ΣA with literal gap interpretation, In [1], Weiss states that “protein sequences can be regarded (2) Σ normalized by joint entropy with literal gap inter- as slightly edited random strings.” This presents a significant A pretation, challenge for successfully classifying protein sequences based on MI. (3) ΣH with literal gap interpretation. C. Hemmerich and S. Kim 5

1 2 1.8 0.9 1.6 0.8

es (bits) 1.4 0.7 ffl 1.2 0.6 1 es/MI(0) for sequence

ffl 0.5 0.8 0.6 0.4 0.4 0.3 Mean of MI(0) for shu 0.2 0.2 0 100 200 300 400 500 600 700 800 900 1000

Mean of MI(0) for0 shu .1 Sequence length (residue count) 0 100 200 300 400 500 600 700 800 900 1000 ΣA literal Sequence length (residue count) ΣA literal, normalized ΣH literal ΣA literal Σ Figure 1: Mean MI(0) of shuffled sequences. A literal, normalized ΣH literal Figure 2: Normalized MI(0) of shuffled sequences. In all three cases, the MI(0) score for a shuffled se- quence of infinite length would be 0; therefore, the calculated scores represent the error introduced by sample-size effects. this experiment for MI(1), MI(5), MI(10), and MI(15) and Figure 1, mean MI(0) of shuffled sequences, shows the aver- summarized the results in Table 4. age shuffled sequence scores (i.e., sampling error) in bits for These results suggest that despite the low MI content of each method. This figure shows that, as expected, the sam- protein sequences, we are able to detect significant MI in a pling error tends to decrease as the sequence length increases. majority of our sampled sequences at MI(0). The number of significant sequences decreases for MI(d) as d increases. The 4.1. Significance of MI(0) for protein sequences results for the classic MI method are significantly affected by sampling error. Normalization by joint entropy reduces this To compare the amount of error, in each method we nor- error slightly for most sequences, and using ΣH is a much malized the mean MI(0) scores from Figure 1 by dividing the more effective correction. mean MI(0) score by the MI(0) score of the sequence used to ffl generate the shu es. This ratio estimates the amount of the 5. MEASURING MIV PERFORMANCE THROUGH ff sequence MI(0) score attributed to sample-size e ects. PROTEIN CLASSIFICATION Figure 2, normalized MI(0) of shuffled sequences, com- pares the effectiveness of our two corrective methods in min- We used sequence classification to evaluate the ability of MI imizing the sample-size effects. This figure shows that nor- to characterize protein sequences and to test our hypothe- malization by joint entropy is not as effective as Figure 1 sug- sis that MIV characterizes a protein sequence better MI. As gests. Despite a large reduction in bits, in most cases, the por- such,ourobjectiveistomeasurethedifference in accuracy tion of the score attributed to sampling effects shows only a between the methods, rather than to reach a specific classifi- minor improvement. ΣH still shows a significant reduction in cation accuracy. sample-size effects for most sequences. We used the Pfam-A dataset to carry out this compar- Figures 1 and 2 provide insight into trends for the three ison. The families contained in the Pfam database vary in methods, but do not answer our question of whether or not sequence count and sequence length. We removed all fami- the MI scores are significant. For a given sequence S,weesti- lies containing any sequence of less than 100 residues due to mated the P value as complications with calculating MI for small strings. We also x limited our study to families with more than 10 sequences P = , (13) N and less than or equal to 200 sequences. After filtering Pfam- A based on our requirements, we were left with 2392 families where N is the number of random shuffles and x is the num- to consider in the experiment. ber of shuffles whose MI(0) was greater than or equal to Sequence similarity is the most widely used method of MI(0) for S. For this experiment, we choose a significance family classification. BLAST [10] is a popular tool incor- cutoff of .05. For a sequence to be labeled significant, no more porating this method. Our method differs significantly, in than 50 of the 10 000 shuffled versions may have an MI(0) that classification is based on a vector of numerical features, score equal or larger than the original sequence. We repeated rather than the protein’s residue sequence. 6 EURASIP Journal on Bioinformatics and Systems Biology

ff Table 4: Sequence significance calculated for significance cuto .05. as MIV20. The results for these experiments are summarized in Table 5, classification Results for MI(0) and MIV20. Number of significant sequences (of 1000) Scoring method All MIV20 methods were more accurate than their MI(0) MI(0) MI(1) MI(5) MI(10) MI(15) counterparts. The best method was ΣH with hybrid gap scor- Literal-ΣA 762 630 277 103 54 ing with a mean accuracy of 85.14%. The eight best perform- Normalized ing methods used Σ , with the best method based on Σ hav- 777 657 309 106 60 H A literal-ΣA ing a mean accuracy of only 66.69%. Another important ob-

Literal-ΣH 894 783 368 162 117 servation is that strict gap interpretation performs poorly in sequence classification. The best strict method had a mean accuracy of 29.96%—much lower than the other gap meth- Classification of feature vectors is a well-studied prob- ods. lem with many available strategies. A good introduction to Our final classification attempts were made using con- many methods is available in [11], and the method chosen catenations of previously generated MIV20 scores. We eval- can significantly affect performance. Since the focus of this uated all combinations of methods. The five combinations experiment is to compare methods of calculating MIV, we most accurate at classification are shown in Table 6. The best only used the well-established and versatile nearest neighbor method combinations are over 90% accurate, with the best Σ classifier in conjunction with Euclidean distance [12]. being 90.99%. The classification power of H with hybrid gap interpretation is demonstrated, as this method appears 5.1. Classification implementation in all five results. Surprisingly, two strict scoring methods ap- pear in the top 5, despite their poor performance when used For classification, we used the WEKA package [11]. WEKA alone. uses the instance based 1 (IB1) algorithm [13] to imple- Based on our results, we made the following observa- ment nearest neighbor classification. This is an instance- tions. based learning algorithm derived from the nearest neighbor (1) The correlation of non-adjacent pairs as measured ffi pattern classifier and is more e cient than the naive imple- by MIV is significant. Classification based on every mentation. method improved significantly for MIV compared to ff The results of this method can di er from the classic MI(0). The highest accuracy achieved for MI(0) was nearest neighbor classifier in that the range of each attribute 26.73% and for MIV it was 85.14% (see Table 5). is normalized. This normalization ensures that each attribute (2) Normalized MI had an insignificant effect on scores gen- contributes equally to the calculation of the Euclidean dis- erated from Σ . Both methods reduce the sample-size tance. As shown in Table 3, MI scores calculated from Σ H A error in estimating entropy and MI for sequences. A have a larger magnitude than those calculated from Σ . This H possible explanation for the lack of further improve- normalization allows the two alphabets to be used together. ment through normalization is that ΣH is a more ef- fective corrective measure than normalization. We ex- 5.2. Sequence classification with MIV plore this possibility further in Section 6.2,werewe consider entropy for both alphabets. In this experiment, we explore the effectiveness of classifica- (3) For the most accurate methods, using the Pfam prior de- tions made using the correlation measurements outlined in creased accuracy. Despite our concerns about using the Section 3. frequency of a short sequence to estimate the marginal Each experiment was performed on a random sample of residue probabilities, the results show that these es- 50 families from our subset of the Pfam database. We then timations better characterize the sequences than the used leave-one-out cross-validation [14]totesteachofour Pfam prior probability distribution. However, four of classification methods on the chosen families. the five best combinations contain a method utilizing In leave-one-out validation, the sequences from all 50 the Pfam prior, showing that the two methods for esti- families are placed in a training pool. In turn, each sequence mating marginal probabilities are complimentary. is extracted from this pool and the remaining sequences are used to build a classification model. The extracted sequence (4) As with sequence-based classification, introducing gaps is then classified using this model. If the sequence is placed improves accuracy. For all methods, removing gap char- in the correct family, the classification is counted as a suc- acters with the strict method drastically reduced accu- cess. Accuracy for each method is measured as racy. Despite this, two of the five best combinations in- cluded a strict scoring method. no. of correct classifications (5) The best scoring concatenated MIVs included both al- . (14) Σ no. of classification attempts phabets. The inclusion of A is significant—all eight nonstrict ΣH methods scored better than any ΣA We repeated this process 100 times, using a new sampling method (see Table 5). The inclusion shows that ΣA of 50 families from Pfam each time. Results are reported for provides information not included in the ΣH and each method as the mean accuracy of these repetitions. For strengthens our assertion that the different alphabets each of the 24 combinations of scoring options outlined in characterize different forces affecting protein struc- Section 3, we evaluated classification based on MI(0), as well ture. C. Hemmerich and S. Kim 7

Table 5: Classification results for MI(0) and MIV20 methods. SD represents the standard deviation of the experiment accuracies.

MIV MI(0) accuracy MIV accuracy 20 Method 20 rank Mean SD Mean SD

1 Hybrid-ΣH 26.73% 2.59 85.14% 2.06

2 Normalized hybrid-ΣH 26.20% 4.16 85.01% 2.19

3 Literal-ΣH 22.92% 3.41 79.51% 2.79

4 Normalized literal-ΣH 23.45% 3.88 78.86% 2.79

5 Normalized Hybrid-ΣH w/Pfam prior 26.31% 3.95 77.21% 2.94

6 Literal-ΣH w/Pfam prior 22.73% 4.90 76.89% 2.91

7 Normalized Literal-ΣH w/Pfam prior 22.45% 4.89 76.29% 2.96

8 Hybrid-ΣH w/Pfam prior 22.81% 2.97 71.57% 3.15

9 Normalized literal-ΣA 17.76% 3.21 66.69% 4.14

10 Hybrid-ΣA 17.16% 3.06 64.09% 4.36

11 Normalized literal-ΣA w/Pfam prior 19.60% 3.67 63.39% 4.05

12 Literal-ΣA 16.36% 2.84 61.97% 4.32

13 Literal-ΣA w/Pfam prior 19.95% 2.84 61.82% 4.12

14 Hybrid-ΣA w/Pfam prior 23.09% 3.36 58.07% 4.28

15 Normalized hybrid-ΣA 18.10% 3.08 41.76% 4.59

16 Normalized hybrid-ΣA w/Pfam prior 23.32% 3.65 40.46% 4.04

17 Strict-ΣH w/Pfam prior 12.97% 2.85 29.96% 3.89

18 Normalized strict-ΣH w/Pfam prior 13.01% 2.72 29.81% 3.87

19 Normalized strict-ΣA w/Pfam prior 19.77% 3.52 29.73% 3.93

20 Normalized strict-ΣA 18.27% 2.92 29.20% 3.65

21 Strict-ΣH 11.22% 2.33 29.09% 3.60

22 Normalized strict-ΣH 11.15% 2.52 28.85% 3.58

23 Strict-ΣA w/Pfam prior 19.25% 3.38 28.44% 3.91

24 Strict-ΣA 16.27% 2.75 25.80% 3.60

Table 6: Top scoring combinations of MIV methods. All combinations of two MIV methods were tested, with these five methods performing the most accurately. SD represents the standard deviation of the experiment accuracies.

Rank First method Second method Mean accuracy SD

1 Hybrid-ΣH Normalized hybrid-ΣA w/Pfam prior 90.99% 1.44

2 Hybrid-ΣH Normalized strict-ΣA w/Pfam prior 90.66% 1.47

3 Hybrid-ΣH Literal-ΣA w/Pfam prior 90.30% 1.48

4 Hybrid-ΣH Literal-ΣA 90.24% 1.73

5 Hybrid-ΣH Strict-ΣA w/Pfam prior 90.08% 1.57

6. FURTHER MIV ANALYSIS The results strengthen our observations from the classifi- cation experiment. Methods that performed well in classifi- In this section, we examine the results of our different meth- cation exhibit less redundancy between MIV indexes. In par- ods of calculating MIVs for Pfam sequences. We first use cor- ticular, the advantage of methods using ΣH is clear. In each relation within the MIV as a metric to compare several of our case, correlation decreases as the distance between indexes scoring methods. We then take a closer look at the effect of increases. For short distances, ΣA methods exhibit this to a reducing our alphabet size when translating from ΣA to ΣH . lesser degree; however, after index 10, the scores are highly correlated. 6.1. Correlation within MIVs 6.2. Effect of alphabets We calculated MIVs for 120 276 Pfam sequences using each of our methods and measured the correlation within each Not all intraprotein interactions are residue specific. Cline method using Pearson’s correlation. The results of this anal- [2] explored information attributed to hydropathy, charge, ysis are presented in Figure 3. Each method is represented by disulfide bonding, and burial. Hydropathy, an alphabet com- a20× 20 grid containing each pairing of entries within that posed of two symbols, was found to contain half as much in- MIV. formation as the 20-element amino acid alphabet. However, 8 EURASIP Journal on Bioinformatics and Systems Biology

20 20 20 20 0.8 15 15 15 15 0.6 10 10 10 10 0.4 5 5 5 5 0.2

5101520 5101520 5101520 5101520

Literal-ΣA Normalized literal-ΣA Hybrid-ΣA Normalized hybrid-ΣA (a)

20 20 20 20 0.8 15 15 15 15 0.6 10 10 10 10 0.4 5 5 5 5 0.2

5101520 5101520 5101520 5101520

Literal-ΣH Normalized literal-ΣH Hybrid-ΣH Normalized hybrid-ΣH (b)

Figure 3: Pearson’s correlation analysis of scoring methods. Note the reduced correlation in the methods based on ΣH , which all performed very well in classification tests. with only two symbols, the alphabet should be more resistant Table 7: Comparison of measured entropy to expected entropy val- to the underestimation of entropy and overestimation of MI ues for 1000 amino acid sequences. Each sequence is 100 residues caused by finite sequence effects [15]. long and was generated by a Bernoulli scheme. For this method, a protein sequence is translated using Alphabet Theoretical Mean measured Alphabet the process given in Section 3.2. It is important to remem- size entropy entropy ber that the scores generated for entropy and MI are actually Σ estimates based on finite samples. Because of the reduced al- A 20 4.322 4.178 ΣH 2 0.971 0.964 phabet size of ΣH , we expected to see increased accuracy in entropy and MI estimations.To confirm this, we examined the effects of converting random sequences of 100 residues (a length representative of those found in the Pfam database) Σ bution. The positions remain independent, so the expected into H . MI remains 0. We generated each sequence from a Bernoulli scheme. Table 7 shows the measured and expected entropies for Each position in the sequences is selected independently of both alphabets. The entropy for ΣA is underestimated by any residues selected before it, and all selections are made .144, and the entropy for Σ is underestimated by only randomly from a uniform distribution. Therefore, for every H .007. The effect of ΣH on MI estimation is much more pro- position in the sequence, all residues are equally likely to oc- nounced. Figure 4 shows the dramatic overestimation of MI cur. in ΣA and high standard deviation around the mean. The By sampling residues from a uniform distribution, the overestimation of MI for Σ is negligible in comparison. Bernoulli scheme maximizes entropy for the alphabet size H (N): 7. CONCLUSIONS 1 H =−log . (15) 2 N We have shown that residue correlation information can be Since all positions are independent of others, MI is 0. used to characterize protein sequences. To model sequences, Knowing the theoretical values of both entropy and MI, we we defined and used the mutual information vector (MIV) can compare the calculated estimates for a finite sequence to where each entry represents the mutual information content the theoretical values to determine the magnitude of finite between two amino acids for the corresponding distance. We sequence effects. have shown that MIV of proteins is significantly different We estimated entropy and MI for each of these sequences from random sequences of the same character composition and then translated the sequences to ΣH . The translated when the distance between residues is considered. Furthermore, sequences are no longer Bernoulli sequences because the we have shown that the MIV values of proteins are significant residue partitioning is not equal—eight residues fall into one enough to determine the family membership of a protein se- category and twelve into the other. Therefore, we estimated quence with an accuracy of over 90%. What we have shown is the entropy for the new alphabet using this probability distri- simply that the MIV score of a protein is significant enough C. Hemmerich and S. Kim 9

2.5 main,” Journal of Molecular Evolution, vol. 48, no. 5, pp. 501– 516, 1999. 2 [6]O.WeissandH.Herzel,“Correlationsinproteinsequences and property codes,” Journal of Theoretical Biology, vol. 190, 1.5 no. 4, pp. 341–353, 1998. [7]T.M.CoverandJ.A.Thomas,Elements of Information Theory, MI (d) 1 Wiley-Interscience, New York, NY, USA, 1991. [8] I. Grosse, H. Herzel, S. V. Buldyrev, and H. E. Stanley, “Species 0 5 . independence of mutual information in coding and noncod- ing DNA,” Physical Review E, vol. 61, no. 5, pp. 5624–5629, 0 2000. 024681012141618 [9]M.A.Jimenez-Monta´ no,˜ “On the syntactic structure of pro- Residue distance d tein sequences and the concept of grammar complexity,” Bul- letin of Mathematical Biology, vol. 46, no. 4, pp. 641–659, 1984. Mean MIV for ΣH Mean MIV for ΣA [10] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lip- man, “Basic local alignment search tool,” Journal of Molecular Figure 4: Comparison of MI overestimation in protein sequences Biology, vol. 215, no. 3, pp. 403–410, 1990. generated from Bernoulli schemes for gap distances from 0 to [11] I. H. Witten and E. Frank, Data Mining: Practical Machine 19 residues. The full residue alphabet greatly over-estimates this Learning Tools and Techniques, Morgan Kaufmann Series in amount. Reducing the alphabet to two symbols approximates the Data Management Systems, Morgan Kaufmann, San Fran- theoretical value of 0. cisco, Calif, USA, 2nd edition, 2005. [12] T. M. Cover and P. Hart, “Nearest neighbor pattern classifica- tion,” IEEE Transactions on Information Theory, vol. 13, no. 1, for family classification—MIV is not a practical alternative to pp. 21–27, 1967. similarity-based family classification methods. [13] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learn- There are a number of interesting questions to be an- ing algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, swered. In particular, it is not clear how to interpret a vector 1991. [14] R. Kohavi, “A study of cross-validation and bootstrap for ac- of mutual information values. It would also be interesting ff curacy estimation and model selection,” in Proceedings of the to study the e ect of distance in computing mutual infor- 14th International Joint Conference on Artificial Intelligence (IJ- mation in relation to protein structures, especially in terms CAI ’95), vol. 2, pp. 1137–1145, Montreal,´ Quebec,´ Canada, of secondary structures. In our experiment (see Table 4), we August 1995. have observed that normalized MIV scores exhibit more in- [15] H. Herzel, A. O. Schmitt, and W. Ebeling, “Finite sample ef- formation content than nonnormalized MIV scores. How- fects in sequence analysis,” Chaos, Solitons & Fractals, vol. 4, ever, in the classification task, normalized MIV scores did no. 1, pp. 97–113, 1994. not always achieve better classification accuracy than non- normalized MIV scores. We hope to investigate this issue in the future.

ACKNOWLEDGMENTS

This work is partially supported by NSF DBI-0237901 and Indiana Genomics Initiatives (INGEN). The authors also thank the Center for Genomics and Bioinformatics for the use of computational resources.

REFERENCES

[1] O. Weiss, M. A. Jimenez-Monta´ no,˜ and H. Herzel, “Informa- tion content of protein sequences,” Journal of Theoretical Biol- ogy, vol. 206, no. 3, pp. 379–386, 2000. [2] M.S.Cline,K.Karplus,R.H.Lathrop,T.F.Smith,R.G.Rogers Jr., and D. Haussler, “Information-theoretic dissection of pair- wise contact potentials,” Proteins: Structure, Function and Ge- netics, vol. 49, no. 1, pp. 7–14, 2002. [3] L. C. Martin, G. B. Gloor, S. D. Dunn, and L. M. Wahl, “Us- ing information theory to search for co-evolving residues in proteins,” Bioinformatics, vol. 21, no. 22, pp. 4116–4124, 2005. [4] A. Bateman, L. Coin, R. Durbin, et al., “The Pfam protein fam- ilies database,” Nucleic Acids Research, vol. 32, Database issue, pp. D138–D141, 2004. [5] W. R. Atchley, W. Terhalle, and A. Dress, “Positional depen- dence, cliques, and predictive motifs in the bHLH protein do- Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 14741, 11 pages doi:10.1155/2007/14741

Research Article Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates

Hasan Metin Aktulga,1 Ioannis Kontoyiannis,2 L. Alex Lyznik,3 Lukasz Szpankowski,4 Ananth Y. Grama,1 and Wojciech Szpankowski1

1 Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA 2 Department of Informatics, Athens University of Economics & Business, Patission 76, 10434 Athens, Greece 3 Pioneer Hi-Breed International, Johnston, IA, USA 4 Bioinformatics Program, University of California, San Diego, CA 92093, USA

Received 26 February 2007; Accepted 25 September 2007

Recommended by Petri Myllymaki¨

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.

Copyright © 2007 Hasan Metin Aktulga et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION tivated, we propose to develop precise and reliable method- ologies for quantifying and identifying such dependencies, Questions of quantification, representation, and description based on the information-theoretic notion of mutual infor- of the overall flow of information in biosystems are of cen- mation. tral importance in the life sciences. In this paper, we de- Biomolecules store information in the form of monomer velop statistical tools based on information-theoretic ideas, strings such as deoxyribonucleotides, ribonucleotides, and and demonstrate their use in identifying informative parts amino acids. As a result of numerous genome and protein in biomolecules. Specifically, our goal is to detect statistically sequencing efforts, vast amounts of sequence data is now dependent segments of biosequences, hoping to reveal po- available for computational analysis. While basic tools such tentially important biological phenomena. It is well known as BLAST provide powerful computational engines for iden- [1–3] that various parts of biomolecules, such as DNA, RNA, tification of conserved sequence motifs, they are less suitable and proteins, are significantly (statistically) correlated. For- for detecting potential hidden correlations without experi- mal measures and techniques for quantifying these correla- mental precedence (higher-order substitutions). tions are topics of current investigation. The biological im- The application of analytic methods for finding regions plications of these correlations are deep, and they themselves of statistical dependence through mutual information has remain unresolved. For example, statistical dependencies be- been illustrated through a comparative analysis of the 5 un- tween exons carrying protein coding sequences and noncod- translated regions of DNA coding sequences [4]. It has been ing introns may indicate the existence of as-yet unknown er- known that eukaryotic translational initiation requires the ror correction mechanisms or structural scaffolds. Thus mo- consensus sequence around the start codon defined as the 2 EURASIP Journal on Bioinformatics and Systems Biology

Kozak’s motif [5]. By screening at least 500 sequences, an and introns may justify additional search for still unknown unexpected correlation between positions −2and−1 of the factors affecting RNA processing. Kozak’s sequence was observed, thus implying a novel trans- The complexity and importance of the RNA processing lational initiation signal for eukaryotic genes. This pattern system is emphasized by the largely unexplained mechanisms was discovered using mutual information, and not detected of alternative splicing, which provide a source of substantial by analyzing single-nucleotide conservation. In other rele- diversity in gene products. The same sequence may be recog- vant work, neighbor-dependent substitution matrices were nized as an exon or an intron, depending on a broader con- applied to estimate the average mutual information con- text of splicing reactions. The information that is required tent of the core promoter regions from five different organ- for the selection of a particular segment of RNA molecules is isms [6, 7]. Such comparative analyses verified the impor- very likely embedded into either exons or introns, or both. tance of TATA-boxes and transcriptional initiation. A similar Again, it seems that the splicing outcome is determined methodology elucidated patterns of sequence conservation by structural information carried by RNA molecules them- at the 3 untranslated regions of orthologous genes from hu- selves, unless the fundamental dogma of biology (the unidi- man, mouse, and rat [8], making them potential rectional flow of information from DNA to proteins) is to be targets for experimental verification of hidden functional sig- questioned. nals. Finally, the constant evolution of genomes introduces In a different kind of application, statistical dependence certain polymorphisms, such as tandem repeats, which are an techniques find important applications in the analysis of gene important component of genetic profiling applications. We expression data. Typically, the basic underlying assumption also study these forms of statistical dependencies in biologi- in such analyses is that genes expressed similarly under di- cal sequences using mutual information. vergent conditions share functional domains of biological ac- In Section 2 we develop some theoretical background, tivity. Establishing dependency or potential relationships be- and we derive a threshold function for testing statistical sig- tween sets of genes from their expression profiles holds the nificance. This function admits a dual interpretation either key to the identification of novel functional elements. Statis- as the classical log-likelihood ratio from hypothesis testing, tical approaches to estimation of mutual information from or as the “empirical mutual information.” gene expression datasets have been investigated in [1]. Section 3 contains our experimental results. In Section Protein engineering is another important area where sta- 3.1 we present our empirical findings for the problem of de- tistical dependency tools are utilized. Reliable predictions of tecting statistical dependency between different parts in a protein secondary structures based on long-range depen- DNA sequence. Extensive numerical experiments were car- dencies may enhance functional characterizations of pro- ried out on certain regions of the maize zmSRp32 gene [11], teins [9]. Since secondary structures are determined by both which is functionally homologous to the human ASF/SF2 al- short- and long-range interactions between single amino ternative splicing factor. The efficiency of the empirical mu- acids, the application of comparative statistical tools based tual information in this context is demonstrated. Moreover, on consensus sequence algorithms or short amino acid se- our findings suggest the existence of a biological connection quences centered on the prediction sites is far from optimal. between the 5 untranslated region in zmSRp32 and its alter- Analyses that incorporate mutual information estimates may natively spliced exons. provide more accurate predictions. Finally, in Section 3.2, we show how the empirical mu- In this work we focus on developing reliable and pre- tual information can be utilized in the difficult problem of cise information-theoretic methods for determining whether searching DNA sequences for short tandem repeats (STRs), two biosequences are likely to be statistically dependent. Our an important task in genetic profiling. We extend the simple main goal is to develop efficient algorithmic tools that can hypothesis test of the previous sections to a methodology for be easily applied to large data sets, mainly—though not testing a DNA string against different “probe” sequences, in exclusively—as a rigorous exploratory tool. In fact, as dis- ordertodetectSTRsbothaccuratelyandefficiently. Experi- cussed in detail below, our findings are not the final word on mental results on DNA sequences from the FBI’s combined the experiments we performed, but, rather, the first step in DNA index system (CODIS) are presented, showing that the the process of identifying segments of interest. Another moti- empirical mutual information can be a powerful tool in this vating factor for this project, which is more closely related to context as well. ideas from information theory, is the question of determin- ing whether there are error correction mechanisms built into large molecules, as argued by Battail; see [10] and the ref- 2. THEORETICAL BACKGROUND erences therein. We choose to work with protein coding ex- ons and noncoding introns. While exons are well-conserved In this section, we outline the theoretical basis for the mu- parts of DNA, introns have much greater variability. They tual information estimators we will later apply to biological are dispersed on strings of biopolymers and still they have sequences. to be precisely identified in order to produce biologically rel- Suppose we have two strings of unequal lengths, evant information. It seems that there is no external source of information but the structure of RNA molecules them- n = X1 X1, X2, ..., Xn, selves to generate functional templates for protein synthesis. (1) M = Determining potential mutual relationships between exons Y1 Y1, Y2, Y3, ..., YM, Hasan Metin Aktulga et al. 3

 where M ≥ n, taking values in a common finite alphabet A. ilarly, let P(x)andqj (y) denote the empirical distributions + −1 In most of our experiments, M is significantly larger than of Xn and Y j n , respectively. We define the empirical (per- ≈ ≈ 1 j n; typical values of interest are n 80 and M 300.  n j+n−1 Our main goal is to determine whether or not there is some symbol) mutual information Ij (n)betweenX1 and Yj form of statistical dependence between them. Specifically, by applying (2) to the empirical instead of the true distribu- n tions, so that we assume that the string X1 consists of independent and identically distributed (i.i.d.) random variables Xi with com-  p (x, y)  =  j mon distribution P(x)onA, and that the random vari- Ij (n) pj (x, y)log   . (3) ∈ p(x)qj(y) ables Yi are also i.i.d. with a possibly different distribution x,y A Q(y). Let {W(y | x)} be a family of conditional distribu- →∞ tions, or “channel,” with the property that, when the in- The law of large numbers implies that as n ,wehave p(x)→P(x), q (y)→Q(x), and p (x, y) converges to the true put distribution is P, the output has distribution Q, that is, j j ∈ | = ff joint distribution of X, Y. x AP(x)W(y x) Q(y)forally.Wewishtodi erentiate n between the following two scenarios: Clearly, this implies that in scenario (i), where X1 and n M n  → →∞ (i) independence: X1 and Y1 are independent, Y1 are independent, Ij (n) 0, for any fixed j,asn .On n ∈  (ii) dependence: First X1 is generated, then an index J the other hand, in scenario (ii), IJ (n)convergestoI(X; Y) > { − } J+n−1 1, 2, ..., M n+1 is chosen in an arbitrary way, and YJ 0 where the two random variables X, Yare such that X has is generated as the output of the discrete memoryless channel distribution P and the conditional distribution of Y given n = = | W with input X1 , that is, for each j 1, 2, ..., n, the condi- X x is W(y x). n | tional distribution of Yj+J−1 given X1 is W(y Xj ). Finally, In passing we should point out there are other methods the rest of the Yi’s are generated i.i.d. according to Q.(To of checking statistical (in)dependence, for instance, random- avoid the trivial case where both scenarios are identical, we ization or permutation tests discussed in [13, 14]. assume that the rows of W are not all equal to Q so that in n J+n−1 the second scenario X1 and YJ are actually not indepen- 2.1. An independence test based on dent.) mutual information It is important at this point to note that although nei- ther of these two cases is biologically realistic as a descrip- We propose to use the following simple test for detecting de- n M tion of the elements in a genomic sequence, it turns out that pendence between X1 and Y1 . Choose and fix a threshold  this set of assumptions provides a good operational starting θ>0, and compute the empirical mutual information Ij (n) n j+n−1 point: the experimental results reported in Section 3 clearly between X1 and each contiguous substring Yj of length indicate that, in practice, the resulting statistical methods ob- M  n from Y1 .IfIj (n) is larger than θ for some j, declare that tained under the present assumptions can provide accurate n j+n−1 and biologically relevant information. Of course, the natu- the strings X1 and Yj are dependent; otherwise, declare ral next step in any application is the careful examination of that they are independent. the corresponding findings, either through purely biological Before examining the issue of selecting the value of the considerations or further testing. threshold θ, we note that this statistic is identical to the To distinguish between (i) and (ii), we look at every pos- (normalized) log-likelihood ratio between the above two hy- sible alignment of Xn with Y M, and we estimate the mutual potheses. To see this, observe that expanding the definition 1 1   information between them. Recall that for two random vari- of pj (x, y)inIj (n), we can simply rewrite ables X, Y with marginal distributions P(x), Q(y), respec-  n p (x, y) tively, and joint distribution V(x, y), the mutual information  = 1 I j ( ) { − }( , )log Ij n (Xi,Yj+i 1) x y   between X and Y is defined as x,y∈A n i=1 p(x)qj(y)  (4) V(x, y) n  p (x, y) I(X; Y) = V(x, y)log . (2) = 1 I j {(X Y − )}(x, y)log , ∈ P(x)Q(y) i, j+i 1   x,y A n i=1x,y∈A p(x)qj(y)

Recall also that I(X; Y) is always nonnegative, and it equals I where the indicator function { − }(x, y)equals1if zero if and only if X and Y are independent. The loga- (Xi,Yj+i 1) (X Y − ) = (x, y) and it is equal to zero otherwise. Then, rithms above and throughout the paper are taken to base 2, i, j+i 1   log = log , so that I(X; Y) can be interpreted as the number n  2 1 pj Xi, Yj+i−1 of bits of information that each of these two random vari-  =     Ij (n) log   n = p Xi qj Yj+i−1 ables carries about the other (cf. [12]). i 1     n (5) In order to distinguish between the two scenarios above, = p X Y − = 1  i 1 j  i,  j+i 1  n log n , we compute the empirical mutual information between X1   − M n i=1 p Xi qj Yj+i 1 and each contiguous substring of Y1 of length n:foreach j = 1, 2, ..., M − n +1,let p (x, y) denote the joint j which is exactly the normalized logarithm of the ratio be- n j+n−1  n  empirical distribution of (X1 , Yj ), that is, let pj (x, y) tween the joint empirical likelihood i=1 pj (Xi, Yj+i−1)of be the proportion of the n positions in (X1, Yj ), (X2, the two strings,  and the product of their empirical marginal n  n  Yj+1), ...,(Xn, Yj+n−1) where (Xi, Yj+i−1)equals(x, y). Sim- likelihoods i=1 p(Xi)][ i=1 qj (Yj+i−1) . 4 EURASIP Journal on Bioinformatics and Systems Biology

2.2. Probabilities of error I = I(X; Y) of the mutual information, but, as we show be- low, the rate of this convergence is slower than the 1/n rate There are two kinds of errors this test can make: declaring → of scenario√ (i): here,√I(n) I with probability one, but only at that two strings are dependent when they are not, and vice rate 1/ n, in that n [I(n) − I] converges in distribution to versa. The actual probabilities of these two types of errors a Gaussian  depend on the distribution of the statistic Ij (n). Since this √ D   distribution is independent of j,wetakej = 1 and write n I(n) − I −→ T∼N 0, σ2 , (10) I(n) for the normalized log-likelihood ratio I (n). The next 1 where the resulting variance σ2 is given by two subsections present some classical asymptotics for  ( ) I1 n .   W(Y | X) σ2 = Var log Scenario (i): independence Q(Y)    W(y | x) 2 (11) We already noted that in this case I(n)convergestozeroas = p(x)W(y | x) log − I . ( ) n→∞, and below we shall see that this convergence takes x,y∈A Q y place at a rate of approximately 1/n.Specifically,I(n) →0 with probability one, and a standard application of the mul- An outline of the proof of (10) is given below; for another tivariate central limit theorem for the joint empirical distri- derivation see [19].  Therefore, for any fixed threshold θ0 and large n,wecan (12) estimate the probability of error as where the last approximation sign indicates equality to first order in the exponent. Thus, despite the fact that I(n)con- P = Pr{declare dependence | independent strings} e,1 verges at different speeds in the two scenarios, both error = Pr I(n) >θ| independent strings (7) probabilities Pe,1 and Pe,2 decay exponentially with the sam- ≈ Pr Z>(2 ln 2)θn , ple size n. To see why (10) holds it is convenient to use the alterna- where Z is as before. Therefore, for large n the error proba- tive expression for I(n)givenin(5). Using this, and recalling 2  bility Pe,1 decays like the tail of the χ distribution function, that I(n) = I1(n), we obtain      √ √ 1 n p X , Y ≈ − γ k,(θ ln 2)n − = 1  i  i  − Pe,1 1 ,(8)n[I(n) I] n log   I . (13) Γ(k) n i=1 p Xi q1 Yi where k = (|A|−1)2/2, and Γ, γ denote the Gamma function Since the empirical distributions converge to the correspond- and the incomplete Gamma function, respectively. Although ing true distributions, for large n it is straightforward to jus- this is fairly implicit, we know that the tail of the χ2 distribu- tify the approximation tion decays like e−x/2 as x→∞; therefore,      √ n | − ≈ √1 1 P Xi W Yi Xi − ≈ − n I(n) I log I . Pe,1 exp (θln2)n ,(9) n n i=1 P Xi Q Yi (14) where this approximation is to first-order in the exponent. The fact that this indeed converges in distribution to a Scenario (ii): dependence N(0, σ2), as n→∞, easily follows from the central limit the- orem, upon noting that the mean of the logarithm in (14) In this case, the asymptotic behavior of the test statistic I(n) equals I and its variance is σ2. is somewhat different. Suppose as before that the random n variables X1 are i.i.d. with distribution P, and that the con- Discussion n | ditional distribution of each Yi given X1 is W(Y Xi), for some fixed family of conditional distributions W(y | x); this From the above analysis it follows that in order for both n makes the random variables Y1 i.i.d. with distribution Q. probabilities of error to decay to zero for large n (so that we We mentioned in the last section that under the sec- rule out false positives as well as making sure that no depen- ond scenario, I(n) converges to the true underlying value dent segments are overlooked) the threshold θ needs to be Hasan Metin Aktulga et al. 5

DNA structure of zmSRp32

Exons 5 untranslated region (5 UTR) 3UTR Intron Intron

Start Protein coding sequence Stop Pre-mRNA processing mRNA structures

Alternative intron Alternative exons

178 268 369 3243 3688 3884 4254 3800

Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a number of exons (shaded boxes) and introns (lines) flanked by the 5 and 3 untranslated regions (white boxes). RNA transcripts (pre-mRNA) are processed to yield mRNA molecules used as templates for protein synthesis. Alternative pre-mRNA splicing generates different mRNA templates from the same transcripts, by selecting either alternative exons or alternative introns. The regions discussed in the text are identified by indices corresponding to the nucleotide position in the original DNA sequence. strictly between 0 and I = I(X; Y). For that, we need to have in alternative processing (splicing) of pre-mRNA transcripts. some prior information about the value of I, that is, of the Then we show how the same methodology can be easily level of dependence we are looking for. If the value of I were adapted to the problem of identifying tandem repeats. We actually known and a fixed threshold θ ∈ (0, I)waschosen present experimental results on DNA sequences from the independent of n, then both probabilities of error would de- FBI’s combined DNA index system (CODIS), which clearly cay exponentially fast, but with typically very different expo- indicate that the empirical mutual information can be a pow- nents: erful tool for this computationally intensive task.

Pe,1 ≈ exp − (θln 2)n ,     3.1. Detecting DNA sequence dependencies − 2 (15) I√ θ Pe,2 ≈ exp − n ; 2σ All of our experiments were performed on the maize zm- SRp32 gene [11]. This gene belongs to a group of genes that recall the expressions in (9)and(12). Clearly, balancing the 2 are functionally homologous to the human ASF/SF2 alter- two exponents also requires knowledge of the value of σ in native splicing factor. Interestingly, these genes encode alter- the case when the two strings are dependent, which, in turn, native splicing factors in maize and yet themselves are also requires full knowledge of the marginal distribution P and alternatively spliced. The gene zmSRp32 is coded by 4735 the channel W. Of course this is unreasonable, since we can- nucleotides and has four alternative splicing variants. Two not specify in advance the exact kind and level of dependence of these four variants are due to different splicings of this we are actually trying to detect in the data. gene, between positions 1–369 and 3243–4220, respectively, A practical (and standard) approach is as follows: since as shown in Figure 1. The results given here are primarily the probability of error of the first kind P1,e only depends on from experiments on these segments of zmSRp32. θ (at least for large n), and since in practice declaring false In order to understand and quantify the amount of cor- positives is much more undesirable than overlooking poten- relation between different parts of this gene, we computed tial dependence, in our experiments we decide on an accept- the mutual information between all functional elements in- ably small false-positive probability , and then select θ based ≈  cluding exons, introns, and the 5 untranslated region. As be- on the above approximation, by setting Pe,1 in (7). n = fore, we denote the shorter sequence of length n by X1 M = (X1, X2, ..., Xn) and the longer one of length M by Y1 3. EXPERIMENTAL RESULTS (Y1, Y2, ..., YM). We apply the simple mutual information estimator I (n)definedin(3) to estimate the mutual infor- In this section, we apply the mutual information test de- j n j+n−1 = − scribed above to biological data. First we show that it can mation between X1 and Yj for each j 1, 2, ..., M   be used effectively to identify statistical dependence between n + 1, and we plot the “dependency graph” of Ij = Ij (n)ver- regions of the maize zmSRp32 gene that may be involved sus j;seeFigure 2. The threshold θ is computed, according 6 EURASIP Journal on Bioinformatics and Systems Biology

0.08 0.06 0.07 0.05 0.06 0.04 0.05 0.04 0.03 0.03 0.02 Mutual information 0.02 Mutual information 0.01 0.01 0 0 3200 3300 3400 3500 3600 3700 3800 3900 3200 3300 3400 3500 3600 3700 3800 3900 Base position on zmSRp32 gene sequence Base position on zmSRp32 gene sequence (a) (b)

Figure 2: Estimated mutual information between the exon located between bases 1–369 and each contiguous subsequence of length 369 in the intron between bases 3243–4220. The estimates were computed both for the original sequences in the standard four-letter alphabet {A, C, G, T} (shown in (a)), as well as for the corresponding transformed sequences for the two-letter purine/pyrimidine grouping {AG, CT} (shown in (b)). to (7), by setting , the probability of false positives, equal to pects of mRNA metabolism. Our observations can therefore 0.001; it is represented by a (red) straight horizontal line in be interpreted as suggesting that the maize zmSRp32 5UTR the figures. contains information that could be utilized in the process of Inorderto“amplify”theeffects of regions of potential alternative splicing, yet another important aspect of mRNA dependency in various segments of the zmSRp32 gene, we metabolism. The fact that the value of the empirical mutual  computed the mutual information estimates Ij on the origi- information between 5 UTR and the DNA sequences that nal strings over the regular four-letter alphabet {A, C, G, T}, encode alternatively spliced elements is significantly greater as well as on transformed versions of the strings where than zero clearly points in that direction. Further experimen- pairs of letters were grouped together, using either the tal work could be carried out to verify the existence, and fur- Watson-Crick pair {AT, CG} or the purine-pyrimidine pair ther explore the meaning, of these newly identified statistical {AG, CT}. In our results we observed that such groupings are dependencies. often helpful in identifying dependency; this is clearly illus- We should note that there are many other sequence trated by the estimates shown in Figures 2 and 3.Sometimes matching techniques, the most popular of which is probably the {AT, CG} pair produces better results, while in other the celebrated BLAST algorithm. BLAST’s working princi- cases the purine-pyrimidine pair finds new dependencies. ples are very different from those underlying our method. As Figure 2 strongly suggests that there is significant depen- a first step, BLAST searches a database of biological sequences dence between the bases in positions 1–369 and certain sub- for various small words found in the query string. It identi- strings of the bases in positions 3243–4220. While the 1– fies sequences that are candidates for potential matches, and 369 region contains the 5 untranslated sequences, an intron, thus eliminates a huge portion of the database containing and the first protein coding exon, the 3243–4220 sequence sequences unrelated to the query. In the second step, small encodes an intron that undergoes alternative splicing. After word matches in every candidate sequence are extended by narrowing down the mutual information calculations to the means of a Smith-Waterman-type local alignment algorithm. 5 untranslated region (5UTR) in positions 1–78 and the Finally, these extended local alignments are combined with 5UTR intron in positions 78–268, we found that the initially some scoring schemes, and the highest scoring alignments identified dependency was still present; see Figure 3.Aclose obtained are returned. Therefore, BLAST requires a consid- inspection of the resulting mutual information graphs indi- erable fraction of exact matches to find sequences related to cates that the dependency is restricted to the alternative exons each other. However, our approach does not enforce any such embedded into the intron sequences, in positions 3688–3800 requirements. For example, if two sequences do not have any and 3884–4254. exact matches at all, but the characters in one sequence are These findings suggest that there might be a deeper con- a characterwise encoding of the ones in the other sequence, nection between the 5UTR DNA sequences and the DNA then BLAST would fail to produce any significant matches sequences that undergo alternative splicing. The UTRs are (without corresponding substitution matrices), while our al- multifunctional genetic elements that control gene expres- gorithm would detect a high degree of dependency. This sion by determining mRNA stability and efficiency of mRNA is illustrated by the results in the following section, where M translation. Like in the zmSRp32 maize gene, they can pro- the presence of certain repetitive patterns in Y1 is revealed n vide multiple alternatively spliced variants for more com- through matching it to a “probe sequence” X1 which does not plex regulation of mRNA translation [20]. They also con- contain the repetitive pattern, but is “statistically similar” to tain a number of regulatory motifs that may affect many as- the pattern sought. Hasan Metin Aktulga et al. 7

0.4 0.35

0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15

Mutual information 0.1 0.1 Mutual information 0.05 0.05 0 0 32 33 34 35 36 37 38 39 40 41 42 32 33 34 35 36 37 38 39 40 41 42 × 2 × 2 Base position on zmSRp32 gene sequence 10 Base position on zmSRp32 gene sequence 10 (a) (b)

0.14 0.1 0.09 0.12 0.08 0.1 0.07 0.08 0.06 0.05 0.06 0.04

Mutual information 0.03 Mutual information 0.04 0.02 0.02 0.01 0 0 32 33 34 35 36 37 38 39 40 41 32 33 34 35 36 37 38 39 40 41 × 2 × 2 Base position on zmSRp32 gene sequence 10 Base position on zmSRp32 gene sequence 10 (c) (d)

0.12 0.08 0.07 0.1 0.06 0.08 0.05 0.06 0.04 0.03 0.04 Mutual information Mutual information 0.02 0.02 0.01 0 0 32 33 34 35 36 37 38 39 40 32 33 34 35 36 37 38 39 40 × 2 × 2 Base position on zmSRp32 gene sequence 10 Base position on zmSRp32 gene sequence 10 (e) (f)

 Figure 3: Dependency graph of Ij versus j for the zmSRp32 gene, using different alphabet groupings: in (a) and (b), we plot the estimated mutual information between the exon found between bases 1–78 and each subsequence of length 78 in the intron located between bases 3243–4220. Plot (a) shows estimates over the original four-letter alphabet {A, C, G, T} , and (b) shows the corresponding estimates over the Watson-Crick pairs {AT, CG}. Similarly, plots (c) and (d) contain the estimated mutual information between the intron located in bases 79–268 and all corresponding subsequences of the intron between bases 3243–4220. Plot (c) shows estimates over the original alphabet, and plot (d) over the two-letter purine/pyrimidine grouping {AG, CT}. Plots (e) and (f) show the estimated mutual information between the 5 untranslated region and all corresponding subsequences of the intron between bases 3243–4220, for the four-letter alphabet (in (e)), and for the two-letter purine/pyrimidine grouping {AG, CT} (in (f)). 8 EURASIP Journal on Bioinformatics and Systems Biology

3.2. Application to tandem repeats should be significantly larger than zero, where “significantly” means larger than the corresponding estimates in ordinary Here we further explore the utility of the mutual informa- DNA fragments containing no STRs. Obviously, the results tion statistic, and we examine its performance on the prob- will depend heavily on the exact form of the probe sequence. lem of detecting short tandem repeats (STRs) in genomic se- Therefore, it is critical to decide on the method for select- quences. STRs, usually found in noncoding regions, are made n ing: (a) the length, and (b) the exact contents of X1 .The of back-to-back repetitions of a sequence which is at least two n n length of X1 is crucial; if it is too short, then X1 itself is likely bases long and generally shorter than 15 bases. The period of M to appear often in Y1 , producing many large values of the an STR is defined as the length of the repetition sequence empirical mutual information and making it hard to distin- in it. Owing to their short lengths, STRs survive mutations guish between STRs and ordinary sequences. Moreover, in well, and can easily be amplified using PCR without produc- that case there is little hope that the analysis of the previ- ing erroneous data. Although there are many well-identified n ous section (which was carried out of long sequences X1 ) STRs in the , interestingly, the number of rep- will provide useful estimates for the probability of error. If, etitions at any specific locus varies significantly among indi- n on the other hand, X1 is too long, then any alignment of the viduals, that is, they are polymorphic DNA fragments. These n M probe X1 with Y1 will likely also contain too many irrelevant properties make STRs suitable tools for determining genetic base pairs. This will produce negligibly small mutual infor- profiles, and have become a prevalent method in forensic in- mation estimates, again making impossible to detect STRs. vestigations. Long repetitive sequences have also been ob- These considerations are illustrated by the results in Figure 4. n served in genomic sequences, but have not gained as much As for the contents of the probe sequence X1 , the best n attention since they cannot survive environmental degrada- choicewouldbetotakeasegmentX1 containing an exact M tion and do not produce high quality data from PCR analysis. matchtoanSTRpresentinY1 . But in most of the interest- Several algorithms have been proposed for detecting ing applications, this is of course unavailable to us. A “second n STRs in long DNA strings with no prior knowledge about best” choice might be a sequence X1 that contains a segment M the size and the pattern of repetition. These algorithms of the same “pattern” as the STR present in Y1 , where we say are mostly based on pattern matching, and they all have that two sequences have the same pattern if each one can be high time-complexity. Finding short repetitions in a long obtained from the other via a permutation of the letters in sequence is a challenging problem. When the query string the alphabet (cf. [21, 22]). For example, TCTA and GTGC is a DNA segment that contains many insertions, deletions, have the same pattern, whereas TCTA and CTAT do not or substitutions due to mutations, the problem becomes (although they do have the same empirical distribution). For n even harder. Exact- and approximate-pattern matching algo- example, if X1 contains the exact same pattern as the periodic rithms need to be modified to account for these mutations, n part of the STR to be detected, and X1 has the same pattern and this renders them complex and inefficient. To overcome n ff as X1 , then, a priori, either choice should be equally e ec- these limitations, we propose a statistical approach using an tive at detecting the STR under consideration; see Figure 5. adaptation of the method described in the previous sections. n (This observation also shows that a single probe X1 may in In the United States, the FBI has decided on 13 loci to be fact be appropriate for locating more than a single STR, e.g., used as the basis for genetic profile analysis, and they con- n STRs with the same pattern as X1 ,asinFigure 5, or with the tinue to be the standard in this area. To demonstrate how same period, as in Figure 4.) The problem with this choice our approach can be used for STR detection, we chose to is, again, that the exact patterns of STRs present in a DNA use sequences from the FBI’s combined DNA index system sequence are not available to us in advance, and we cannot (CODIS): the SE33 locus contained in the GenBank sequence expect all STRs in a given sequence to be of the same pattern. V00481, and the VWA locus contained in the GenBank se- n Even though both of the above choices for X1 are usually quence M25858. The periods of STRs found in CODIS typi- M not practically feasible, if the sequence Y1 is relatively short cally range from 2 to bases, and do not exhibit enough vari- and contains a single STR whose contents are known, then ei- ability to demonstrate how our approach would perform un- ther choice would produce high-quality data, from which the der divergent conditions. For this reason, we used the V00481 M STRcontainedinY1 we can easily be detected; see Figure 5 sequence as is, but on M25858 we artificially introduced an for an illustration. STR with period 11, by substituting bases 2821–2920 (where In practice, in addition to the fact that the contents of we know that there are no other repeating sequences) with STRs are not known in advance, there is also the issue that 9tandemrepeatsofACTTTGCCTAT. We have also intro- in a long DNA sequence there are often many different STRs, duced base substitutions, deletions, and insertions on our ar- and a unique probe will not match all of them exactly. But tificial STR to imitate mutations. since STRs usually have a period between 2 and 15 bases, we M = Let Y1 (Y1, Y2, ..., YM) denote the DNA sequence in can actually run our method for all possible choices of rep- which we are looking for STRs. The gist of our approach is etition sequences, and detect all STRs in the given query se- simply to choose a periodic probe sequence of length n,say, quence Y M. The number of possible probes Xn can be drasti- n = M 1 1 X1 (X1, X2, ..., Xn) (typically much shorter than Y1 ), and cally reduced by observing that (1) we only need one repeat-   then to calculate the empirical mutual information Ij = Ij (n) ing sequence of each possible pattern, and (2) it suffices to n M between X1 and each of its possible alignments with Y1 .In only consider repetition patters whose period is prime. Note order to detect the presence of STRs, the values of the em- that in view of the earlier discussion and the results shown n pirical mutual information in regions where STRs do appear in Figure 4, the period of the repeating part of X1 is likely to Hasan Metin Aktulga et al. 9

1.8 0.9 1.6 0.8 1.4 0.7 1.2 0.6 1 0.5 0.8 0.4 0.6 0.3 Mutual information 0.4 Mutual information 0.2 0.2 0.1 0 0 0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800 Base position on GenBank V00481 sequence Base position on GenBank V00481 sequence (a) (b)

M = n Figure 4: Dependency graph of the GenBank sequence Y1 V00481, for a probe sequence X1 which is a repetition of AGGT,oflength(a) M 12, or (b) 60. The sequence Y1 contains STRs that are repetitions of the pattern AAAG, in the following regions: (i) there is a repetition of AAAG between bases 62–108; (ii) AAAG is intervened by AG and AAGG until base 138; (iii) again between 138–294 there are repetitions of AAAG, some of which are modified by insertions and substitutions. In (a) our probe is too short, and it is almost impossible to distinguish the SE33 locus from the rest. However, in (b) the location SE33 is singled out by the two big peaks in the mutual information estimates; the shorter peak between the two larger ones is due to the interventions described above. Note that the STRs were identified by a probe sequence that was a repetition of a pattern different from that of the repeating part of the STRs themselves, but of the same period.

1.5 1.5

1 1

0.5 0.5 Mutual information Mutual information

0 0 0 50 100 150 200 250 0 50 100 150 200 250 (a) (b)

n = Figure 5: Dependency graph of the VWA locus contained in GenBank sequence M25858 for a probe sequence X1 with n 12, which is a repetition of (a) TCTA , an exactly matching probe, (b) GTGC,acompletelydifferent probe, but of the exact same “pattern”. In both cases, n we have chosen X1 to be long enough to suppress unrelated information. Note that the results in (a) and (b) are almost identical. The VWA locus contains an STR of TCTA between positions 44–123. This STR is apparent in both dependency graphs by forming a periodic curve with high correlation. be more important than the actual contents. For example, if dependency graph, for example, by feeding the relevant parts M we were to apply our method for finding STRs in Y1 with a separately into one of the standard string matching-based n probe X1 whose period is 5 bases long, then many STRs with tandem repeat algorithms. Thus, our method can serve as an a period that is a multiple of 5 should peak in the dependency initial filtering step which, combined with an exact pattern chart, thus allowing us to detect their approximate positions matching algorithm, provides a very accurate and efficient M in Y1 . Clearly, probes that consist of very short repeats, such method for the identification of STRs. as AAA . . . , should be avoided. The importance of choosing In terms of its practical implementation, note that our n an X1 with the correct period is illustrated in Figure 6. approach has a linear running time O(M), where M is the M The results in Figures 4, 5,and6 clearly indicate that the length of Y1 . The empirical mutual information of course ff M proposed methodology is very e ective at detecting the pres- needs to be evaluated for every possible alignment of Y1 and n ence of STRs, although at first glance it may appear that it X1 , with each such calculation done in O(n)steps,wheren is n cannot provide precise information about their start-end po- the length of X1 .Butn is typically no longer than a few hun- sitions and their repeat sequences. But this final task can eas- dred bases, and, at least to first-order, it can be considered M ily be accomplished by reevaluating Y1 near the peak in the constant. Also, repeating this process for all possible repeat 10 EURASIP Journal on Bioinformatics and Systems Biology

1.4 0.5 0.45 1.2 0.4 1 0.35 0.8 0.3 0.25 0.6 0.2

Mutual information 0.15

Mutual information 0.4 0.1 0.2 0.05 0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 Base position on GenBank M25858 sequence Base position on GenBank M25858 sequence (a) (b)

Figure 6: In these charts we use the modified GenBank sequence M25858, which contains the VWA locus in CODIS between positions 1683–1762 and the artificial STR introduced by us at 2821–2920. The repeat sequence of the VWA locus is TCTA, and the repeat sequence n = of the artificial STR is ACTTTGCCTAT. In (a), the probe X1 has length n 88 and consists of repetitions of AGGT. Here the repeating sequence of the VWA locus (which has period 4) is clearly indicated by the peak, whereas the artificial tandem repeat (which has period 11) does not show up in the results. The small peak around position 2100 is due to a very noisy STR again with a 4-base period. In (b), the probe n = X1 again has length n 88, and it consists of repetitions of CATAGTTCGGA. This produces the opposite result: the artificial STR is clearly identified, but there is no indication of the STR present at the VWA locus. periods does not affect the complexity of our method by through extensive analysis of CODIS data, we show that our much, since the number of such periods is quite small and approach is particularly well suited for the problem of dis- can also be considered to be constant. And, as mentioned covering short tandem repeats, an application of importance n above, choosing probes X1 only containing repeating seg- in genetic profiling studies. ments with a prime period, further improves the running time of our method. ACKNOWLEDGMENTS We, therefore, conclude that (a) the empirical mutual in- formation appears in this case to be a very effective tool for This research was supported in part by the NSF Grants detecting STRs; and (b) selecting the length and repetition CCF-0513636 and DMS-0503742, and the NIH Grant R01 n period of the probe sequence X1 is crucial for identifying tan- GM068959-01. dem repeats accurately. REFERENCES 4. CONCLUSIONS [1] R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, “The Biological information is stored in the form of monomer mutual information: detecting and evaluating dependencies strings composed of conserved biomolecular sequences. Ac- between variables,” Bioinformatics, vol. 18, supplement 2, pp. cording to Manfred Eigen, “The differentiable characteris- S231–S240, 2002. [2]Z.Dawy,B.Goebel,J.Hagenauer,C.Andreoli,T.Meitinger, tic of living systems is information. Information assures the and J. C. Mueller, “Gene mapping and marker clustering us- controlled reproduction of all constituents, thereby ensuring ing Shannon’s mutual information,” IEEE/ACM Transactions conservation of viability.” Hoping to reveal novel, potentially on Computational Biology and Bioinformatics,vol.3,no.1,pp. important biological phenomena, we employ information- 47–56, 2006. theoretic tools, especially the notion of mutual information, [3] E. Segal, Y. Fondufe-Mittendorf, L. Chen, et al., “A genomic to detect statistically dependent segments of biosequences. code for nucleosome positioning,” Nature, vol. 442, no. 7104, The biological implications of the existance of such correla- pp. 772–778, 2006. [4] Y. Osada, R. Saito, and M. Tomita, “Comparative analysis of tions are deep, and they themselves remain unresolved. The proposed approach may provide a powerful key to funda- base correlations in 5 untranslated regions of various species,” mental advances in understanding and quantifying biolog- Gene, vol. 375, no. 1-2, pp. 80–86, 2006. ical information. [5] M. Kozak, “Initiation of translation in prokaryotes and eu- This work addresses two specific applications based on karyotes,” Gene, vol. 234, no. 2, pp. 187–208, 1999. [6]D.A.ReddyandC.K.Mitra,“Comparativeanalysisoftran- the proposed tools. From the experimental analysis carried scription start sites using mutual information,” Genomics, Pro- out on regions of the maize zmSRp32 gene, our findings sug- teomics and Bioinformatics, vol. 4, no. 3, pp. 189–195, 2006. gest the existence of a biological connection between the 5 [7]D.A.Reddy,B.V.L.S.Prasad,andC.K.Mitra,“Comparative untranslated region in zmSRp32 and its alternatively spliced analysis of core promoter region: information content from exons, potentially indicating the presence of novel alterna- mono and dinucleotide substitution matrices,” Computational tive splicing mechanisms or structural scaffolds. Secondly, Biology and Chemistry, vol. 30, no. 1, pp. 58–62, 2006. Hasan Metin Aktulga et al. 11

[8] S. A. Shabalina, A. Y. Ogurtsov, I. B. Rogozin, E. V. Koonin, and D. J. Lipman, “Comparative analysis of orthologous eu- karyotic mRNAs: potential hidden functional signals,” Nucleic Acids Research, vol. 32, no. 5, pp. 1774–1782, 2004. [9] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Ex- ploiting the past and the future in protein secondary structure prediction,” Bioinformatics, vol. 15, no. 11, pp. 937–946, 1999. [10] G. Battail, “Should genetics get an information-theoretic edu- cation? Genomes as error-correcting codes,” IEEE Engineering in Medicine and Biology Magazine, vol. 25, no. 1, pp. 34–45, 2006. [11]H.Gao,W.J.Gordon-Kamm,andL.A.Lyznik,“ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced,” Gene, vol. 339, no. 1-2, pp. 25–37, 2004. [12] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, USA, 1991. [13] P. I. Good, Resampling Methods,Birkhauser,¨ Boston, Mass, USA, 2005. [14] B. Manly, Randomization, Bootstrap and Monte Carlo Methods in Biology, Chapman & Hall/CRC, Boca Raton, Fla, USA, 1977. [15] E. L. Lehmann and J. P. Romano, Testing Statistical Hypotheses, Springer, New York, NY, USA, 3rd edition, 2005. [16] M. J. Schervish, Theory of Statistics,Springer,NewYork,NY, USA, 1995. [17] J. Hagenauer, Z. Dawy, B. Gobel,P.Hanus,andJ.Mueller,“Ge-¨ nomic analysis using methods from information theory,” in Proceedings of IEEE Information Theory Workshop (ITW ’04), pp. 55–59, San Antonio, Tex, USA, October 2004. [18] B. Goebel, Z. Dawy, J. Hagenauer, and J. C. Mueller, “An ap- proximation to the distribution of finite sample size mutual information estimates,” in Proceedings of IEEE International Conference on Communications (ICC ’05), vol. 2, pp. 1102– 1106, Seoul, Korea, May 2005. [19] M. Hutter, “Distribution of mutual information,” in Advances in Neural Information Processing Systems 14, pp. 399–406, MIT Press, Cambridge, Mass, USA, 2002. [20] T. A. Hughes, “Regulation of gene expression by alternative untranslated regions,” Trends in Genetics, vol. 22, no. 3, pp. 119–122, 2006. [21] J. Aberg,˚ Yu. M. Shtarkov, and B. J. M. Smeets, “Multialphabet coding with separate alphabet description,” in Proceedings of the International Conference on Compression and Complexity of Sequences, pp. 56–65, Positano, Italy, June 1997. [22] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang, “Limit results on pattern entropy,” IEEE Transactions on Infor- mation Theory, vol. 52, no. 7, pp. 2954–2964, 2006. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 13853, 13 pages doi:10.1155/2007/13853

Research Article Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information

Arvind Rao,1 Alfred O. Hero III,1 David J. States,2 and James Douglas Engel3

1 Departments of Electrical Engineering and Computer Science and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA 2 Departments of Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA 3 Department of Cell and Developmental Biology, University of Michigan, Ann Arbor, MI 48109, USA

Received 1 March 2007; Revised 23 June 2007; Accepted 17 September 2007

Recommended by Teemu Roos

Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Se- quence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental bio- logical processes such as development and disease progression. In this work, we present an approach to the identification of motifs (not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization. Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discrimina- tory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved sequence element identified from genome-wide studies.

Copyright © 2007 Arvind Rao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION transcriptional start site (TSS). The basal transcriptional ma- chinery at the promoter coupled with the transcription fac- tor complexes at these distal, long-range regulatory elements Understanding the mechanisms underlying regulation of (LREs) are collectively involved in directing tissue-specific tissue-specific gene expression remains a challenging ques- expression of genes. tion. While all mature cells in the body have a complete copy One of the current challenges in the post-genomic era of the human genome, each cell type only expresses those is the principled discovery of such LREs genome-wide. Re- genes it needs to carry out its assigned task. This includes cently, there has been a community-wide effort (http:// genes required for basic cellular maintenance (often called www.genome.gov/ENCODE) to find all regulatory elements “housekeeping genes”) and those genes whose function is in 1% of the human genome. The examination of the dis- specific to the particular tissue type that the cell belongs to. covered elements would reveal characteristics typical of most Gene expression by a way of transcription is the process of enhancers which would aid their principled discovery and generation of messenger RNA (mRNA) from the DNA tem- examination on a genome-wide scale. Some characteristics plate representing the gene. It is the intermediate step before of experimentally identified distal regulatory elements [1, 2] the generation of functional protein from messenger RNA. are as follows. During gene expression (see Figure 1), transcription factor (TF) proteins are recruited at the proximal promoter of the (i) Noncoding elements: distal regulatory elements are gene as well as at sequence elements (enhancers/silencers) noncoding and can either be intronic or intergenic re- which can lie several hundreds of kilobases from the gene’s gionsonthegenome.Hence,previousmodelsforgene 2 EURASIP Journal on Bioinformatics and Systems Biology

TF complex TATA box Another practical reason for the examination of pro- Distal RNA pol. II TSS enhancer moters is that their locations (and genomic sequences) are more clearly delineated on genome databases (like ffi Distal Promoter UCSC or Ensembl). Su cient data (http://symatlas (proximal) enhancer Exon Intron .gnf.org) on the expression of genes is also publicly available for analysis. Sequence motif discovery is set Figure 1: Schematic of transcriptional regulation. Sequence motifs up as a feature extraction problem from these tissue- at the promoter and the distal regulatory elements together confer specific promoter sequences. Subsequently, a support specificity of gene expression via TF binding. vector machine (SVM) classifier is used to classify new promoters into specific and nonspecific categories based on the identified sequence features (motifs). Us- ing the SVM classifier algorithm, 90% of tissue-specific finding [3] are not directly applicable. With over 98% genes are correctly classified based upon their up- of the annotated genome being noncoding, the pre- stream promoter region sequences alone. cise localization of regulatory elements that underlie (ii) Known long range regulatory elements (LRE) motifs: tissue-specific gene expression is a challenging prob- to analyze the motifs in LRE elements, we examine lem. the results of the above approach on the Enhancer (ii) Distance/orientation independent: an enhancer can Browser dataset (http://enhancer.lbl.gov) which has act from variable genomic distances (hundreds of kilo- results of expression of ultraconserved genomic ele- bases) to regulate gene expression in conjunction with ments in transgenic mice [8]. An examination of these the proximal promoter, possibly via a looping mecha- ultraconserved enhancers is useful for the extraction nism [4]. These enhancers can lie upstream or down- of discriminatory motifs to distinguish the regulatory stream of the actual gene along the genomic locus. elements from the nonregulatory (neutral) ones. Here (iii) Promoter dependent: since the action at a distance of the results indicate that up to 95% of the sequences can these elements involves the recruitment of TFs that di- be correctly classified using these identified motifs. rect tissue-specific gene expression, the promoter that We note that some of the identified motifs might not be tran- they interact with is critical. scription factor binding motifs, and would need to be func- Although there are instances where a gene harbors tissue- tionally characterized. This is an advantage of our method- specific activity at the promoter itself, the role of long-range instead of constraining ourselves to the degeneracy present elements (LREs) remains of interest, for example, for a de- in TF databases (like TRANSFAC/JASPAR), we look for all tailed understanding of their regulatory role in gene expres- sequences of a fixed length. sion during biological processes like organ development and disease progression [5]. We seek to develop computational 2. CONTRIBUTIONS strategies to find novel LREs genome-wide that govern tissue specific expression for any gene of interest. A common ap- Using microarray gene expression data, [9, 10] proposes an proach for their discovery is the use of motif-based sequence approach to assign genes into tissue-specific and nonspecific signatures. Any sequence element can then be scanned for categories using an entropy criterion. Variation in expression such a signature and its tissue specificity can be ascertained and its divergence from ubiquitous expression (uniform dis- [6]. tribution across all tissue types) is used to make this assign- Thus, our primary question in this regard is that is there ment. Based on such assignment, several features like CpG a discriminating sequence property of LRE elements that de- island density, frequency of transcription factor motif occur- termines tissue-specific gene expression—more particularly, rence, can be examined to potentially discriminate these two are there any sequence motifs in known regulatory elements groups. Other work has explored the existence of key mo- that can aid discovery of new elements [7]. To answer this, we tifs (transcription factor binding sites) in the promoters of examine known tissue-specific regulatory elements (promot- tissue-specific genes (see [11, 12]). Based on the successes ers and enhancers) for motifs that discriminate them from reported in these methods, it is expected that a principled a background set of neutral elements (such as housekeeping examination and characterization of every sequence motif gene promoters). For this study, the datasets are derived from identified to be discriminatory might lead to improved in- the following sources. sight into the biology of gene regulation. For example, such a strategy might lead to the discovery of newer TFBS motifs, (i) Promoters of tissue-specific genes: before the widespread as well as those underlying epigenetic phenomena. discovery of long-range regulatory elements (LREs), it For the purpose of identifying discriminative motifs from was hypothesized that promoters governed gene ex- the training data (tissue-specific promoters or LREs), our ap- pression alone. There is substantial evidence for the proach is as follows. binding of tissue-specific transcription factors at the promoters of expressed genes. This suggests that in (i) Variable selection: firstly, sequence motifs that dis- spite of newer information implicating the role of criminate between tissue-specific and non-specific el- LREs, promoters also have interesting motifs that gov- ements are discovered. In machine learning, this is ern tissue-specific expression. a feature selection problem with features being the Arvind Rao et al. 3

counts of sequence motifs in the training sequences. Examine sequences Without loss of generality, six-nucleotide motifs (hex- (promoters/enhancers) amers) are used as motif features. This is based on from Tissue Expression Atlas the observation that most transcription factor binding Training data motifs have a 5-6 nucleotide core sequence with de- generacy at the ends of the motif. A similar setup has Tissue-specific Neutral sequences been introduced in [13–15]. The motif search space sequences 6 = is, therefore, a 4 4096-dimensional one. The pre- Parse sequences to obtain relative counts sented approach, however, does not depend on mo- Preprocess tif length and can be scaled according to biological knowledge. For variable (motif) selection, a novel fea- Build co-occurrence ture selection approach (based on an information the- matrices for training data oretic quantity called directed information (DI)) is pro- posed. The improved performance of this criterion over using mutual information for motif selection is Feature (motif) selection (DI/MI) and classification (SVM) also demonstrated. (ii) Classifier design: after discovering discriminating mo- tifs using the above DI step, an SVM classifier that Biological interpretation separates the samples between the two classes (specific of top ranking motifs and nonspecific) from this motif space is constructed. Figure 2: An overview of the proposed approach. Each of the steps Apart from this novel feature selection approach, several are outlined in the following sections. questions pertaining to bioinformatics methodology can be potentially answered using this framework—some of these areasfollows. most common approach is to look for TFBS motifs that are (i) Are there common motifs underlying tissue-specific statistically over-represented in the promoters of the coex- expression that are identified from tissue-specific pro- pressed genes based on a background (binomial or Poisson) moters and enhancers? In this paper, an examina- distribution of motif occurrence genomewide. tion of motifs (from promoters and enhancers) cor- In this work, the problem of motif discovery is set up as responding to brain-specific expression is done to ad- follows. Using two annotated groups of genes, tissue-specific dress this question. (“ts”) and nontissue-specific (“nts”), hexamer motifs that (ii) Do these motifs correspond to known motifs (tran- best discriminate these two classes are found. The goal would scription factor binding sites)? We show that several be to make this set of motifs as small as possible, that is, to motifs are indeed consensus sites for transcription fac- achieve maximal class partitioning with the smallest feature tor binding, although their real role can only be iden- subset. tified in conjunction with experimental evidence. Several metrics have been proposed to find features with (iii) Is it possible to relate the motif information from the maximal class label association. From information theory, sequence and expression perspectives to understand mutual information is a popular choice [18]. This is a sym- regulatory mechanisms? This question is addressed in metric association metric and does not resolve the direc- Section 11.3. tion of dependency (i.e., if features depend on the class la- (iv) How useful are these motifs in predicting new tissue- bel or vice versa). It is important to find features that induce specific regulatory elements? This is partly explained the class label. Feature selection from data implies selection from the results of SVM classification. (control) of a feature subset that maximally captures the un- This work differs from that in [13, 14], in several aspects. derlying character (class label) of the data. There is no con- We present the DI-based feature selection procedure as part trol over the label (a purely observational characterization). of an overall unified framework to answer several questions With this motivation, a new metric for discriminative in bioinformatics, not limited to finding discriminating mo- hexamer subset selection, termed “directed information” tifs between two classes of sequences. Particularly, one of (DI), is proposed. Based on the selected features, a classifier the advantages is the ability to examine any particular mo- is used to classify sequences to tissue-specific or nontissue- tif as a potential discriminator between two classes. Also, specific categories. The performance of this DI-based feature this work accounts for the notion of tissue-specificity of selection metric is subsequently evaluated in the context of promoters/enhancers (in line with more recent work in [8– the SVM classifier. 10, 16, 17]). Also, this framework enables the principled in- tegration of various data sources to address the above ques- 4. OVERALL METHODOLOGY tions. These are clarified in Section 11. The overall schematic of the proposed procedure is outlined 3. RATIONALE in Figure 2. The main approaches to finding common motifs driving Below we present our approach to find promoter-specific tissue-specificgeneregulationaresummarizedin[1, 2]. The or enhancer-specific motifs. 4 EURASIP Journal on Bioinformatics and Systems Biology

5. MOTIF ACQUISITION Table 1: The “motif frequency matrix” for a set of gene promoters. The first column is their ENSEMBL gene identifiers and the other 4 5.1. Promoter motifs columns are the motifs. A cell entry denotes the number of times a given motif occurs in the upstream (−2000 to +1000 bp from TSS) 5.1.1. Microarray analysis region of each corresponding gene.

Raw microarray data is available from the Novartis Foun- Ensembl Gene ID AAAAAA AAAAAG AAAAAT AAAACA dation (GNF) [http://symatlas.gnf.org]. Data is normal- ENSG00000155366 0 0 1 4 ized using RMA from the bioconductor packages for R ENSG000001780892 6 5 5 6 [http://cran.r-project.org]. Following normalization, repli- ENSG00000189171 1 2 1 0 cate samples are averaged together. Only 25 tissue types ENSG00000168664 6 3 8 0 are used in our analysis including: adrenal gland, amygdala, brain, caudate nucleus, cerebellum, corpus callosum, cortex, ENSG00000160917 4 1 4 2 dorsal root ganglion, heart, HUVEC, kidney, liver, lung, pan- ENSG00000163655 2 4 0 1 creas, pituitary, placenta, salivary, spinal cord, spleen, testis, ENSG000001228844 8 6 10 7 thalamus, thymus, thyroid, trachea, and uterus. ENSG00000176749 0 0 0 0 In this context, the notion of tissue specificity of a gene ENSG00000006451 5 2 2 1 needs clarification. Suppose there are N genes, g1, g2, ..., gN , and T tissue types (in GNF: T = 25), we construct an × = N T tissue specificity matrix: M [0]N×T .Foreachgene individually. This results in two hexamer-gene cooccurrence gi,1 ≤ i ≤ N,letgi,[0.5T] = median(gi,k), for all k ∈ 1, 2, ..., matrices—one for the “ts” class (dimension Ntrain,+1 × 1000) T; gi,k being the expression level of gene i in tissue k.Define and the other for the “nts” class (dimension Ntrain,−1 × 1000). each entry Mi,k as Here Ntrain,+1 and Ntrain,−1 are the number of positive training ⎧ ⎨ and negative training samples, respectively. 1ifgi,k ≥ 2gi,[0.5T], = The input to the feature selection procedure is a gene Mi,k ⎩ (1) 0 otherwise. promoter-motif frequency table (Table 1). The genes relevant to each class are identified from tissue microarray analysis, = T ≤ following steps in Section 5.1.1 and the frequency table is Now consider the N-dimensional vector mi k=1Mi,k,1 i ≤ N, that is, summing all the columns of each row. The built by parsing the gene promoters for the presence of each 6 interquartile range of m can be used for “ts”/“nts” assign- of the 4 = 4096 possible hexamers. ment. Gene indices i that are in quartile 1 (= 3) are labeled as “ts,” and those in quartile 4 (= 22) are labeled as “nts.” 5.2. LRE motifs With this approach, a total of 1924 probes represent- ing 1817 genes were classified as tissue-specific, while 2006 To analyze long range elements which confer tissue-specific probes representing 2273 genes were classified as nontissue- expression, the Mouse Enhancer database (http://enhancer specific. In this work, genes which are either heart-specific or .lbl.gov) is examined. This database has a list of experi- brain-specific are considered. From the tissue-specific genes mentally validated ultraconserved elements which have been obtained from the above approach, 45 brain-specific gene tested for tissue specific expression in transgenic mice [8], promoters and 118 heart-specific gene promoters are ob- and can be searched for a list of all elements which have tained. As mentioned in Section 2, one of the objectives is expression in a tissue of interest. In this work, we consider to find motifs that are responsible for brain/heart specific expression in tissues relating to the developing brain. Ac- expression and also correlate them with binding profiles of cording to the experimental protocol, the various regions are known transcription factor binding motifs. cloned upstream of a heat shock protein promoter (hsp68- lacz), thereby not adhering to the idea of promoter specificity 5.1.2. Sequence analysis in tissue-specific expression. Though this is of concern in that there is loss of some gene-specific information, we work Genes (“ts” or “nts”) associated with candidate probes are with this data since we are more interested in tissue expres- identified using the Ensembl Ensmart [http://www.ensembl sion and also due to a paucity of public promoter-dependent .org] tool. For each gene, sequence from 2000 bp upstream enhancer data. and 1000 bp down-stream upto the start of the first exon rel- This database also has a collection of ultraconserved el- ative to their reported TSS is extracted from the Ensembl ements that do not have any transgenic expression in vivo. Genome Database (Release 37). The relative counts of each This is used as the neutral/background set of data which cor- of the 46 hexamers are computed within each gene promoter responds to the “nts” (nontissue-specific class) for feature se- sequence of the two categories (“ts” and “nts”)—using the lection and classifier design. “seqinr” library in the R environment. A t-test is performed As in the above (promoter) case, these sequences (sev- between the relative counts of each hexamer between the two enty four enhancers for brain-specific expression) are parsed expression categories (“ts” and “nts”) and the top 1000 sig- for the absolute counts of the 4096 hexamers, a cooccurrence  = nificant hexamers (H = H1, H2, ..., H1000) are obtained. The matrix (Ntrain,+1 74) is built and then t-test P-values are   =    relative counts of these hexamers is recomputed for each gene used to find the top 1000 hexamers (H H1, H2, ..., H1000) Arvind Rao et al. 5 that are maximally different between the two classes (brain- X2 specific and brain-nonspecific). Y The next three sections clarify the preprocessing, feature selection, and classifier design steps to mine these cooccur- X1 X2 rence matrices for hexamer motifs that are strongly associ- ated with the class label. We note that though this work is il- lustrated using two class labels, the approach can be extended in a straightforward way to the multiclass problem.

6. PREPROCESSING

From the above, Ntrain,+1 × 1000 and Ntrain,−1 × 1000 di- mensional cooccurrence matrices are available for the tissue- specific and nonspecific data, both for the promoter and enhancer sequences. Before proceeding to the feature (hex- X1 amer motif) selection step, the counts of the M = 1000 hexamers in each training sample need to be normalized Figure 3: Causal feature discovery for two class discrimination, to account for variable sequence lengths. In the cooccur- adapted from [20]. Here the variables X1 and X2 discriminate Y, the class label. rence matrix, let gci,k represent the absolute count of the kth hexamer, k ∈ 1, 2, ..., M, in the ith gene. Then, for = each gene gi, the quantile labeled matrix has Xi,k l if of performance is the amount of information flow from the ≤ = gci,[((l−1)/K)M] gci,k

This hypothesis test is done for each of the 1000 mo- butionally between the positive and negative training tifs, in order to select the top d motifs based on DI value, samples. The top 1000 of these hexamers are cho- which is then used for classifier training subsequently. This sen for further analysis. This step is only necessary leads to a need for multiple-testing correction. Because the to reduce the computational complexity of the over- Bonferroni correction is extremely stringent in such settings, all procedure—computing the DI between each of the the Benjamini-Hochberg procedure [32], which has a higher 4096 hexamers and the class label is relatively expen- false positive rate but a lower false negative rate, is used in sive. this work. (5) For the top K = 1000 hexamers which are most significantly different between the positive and nega- N → N N N 9. SUPPORT VECTOR MACHINES tive training examples, I(Xk Y )andI(Xk ; Y )re- veal the degree of association for each of the k ∈ From the top d features identified from the ranked list (1, 2, ..., K) hexamers. The entropy terms in the di- of features having high DI with the class label, a sup- rected information and mutual information expres- port vector machine classifier in these d dimensions is de- sions are found using a higher-order entropy estima- signed. An SVM is a hyperplane classifier which operates tor. Using the procedure of Section 7, the raw DI val- by finding a maximum margin linear hyperplane to sepa- ues are converted into their normalized versions. Since rate two different classes of data in high-dimensional (D> the goal is to maximize I(Xk→Y), we can rank the DI d) space. The training data has N(= Ntrain,+1 + Ntrain,−1) values in descending order. ∈ Rd ∈ pairs (x1, y1), (x2, y2), ...,(xN , yN ), with xi and yi (6) The significance of the DI estimate is obtained based {−1, +1}. on the bootstrapping methodology. For every hex- An SVM is a maximum margin hyperplane classifier in a amer, a P = 0.05 significance with respect to its nonlinearly extended high-dimensional space. For extending bootstrapped null distribution yields potentially dis- the dimensions from d to D>d, a radial basis kernel is used. criminative hexamers between the two classes. The The objective is to minimize β in the hyperplane {x : Benjamini-Hochberg procedure is used for multiple- = T } T ≥ − ∀ ≥ f (x) x β + β0 ,subjecttoyi(xi β + β0) 1 ξi i, ξi testing correction. Ranking the significant hexamers ≤ 0, ξi constant [33]. by decreasing DI value yields features that can be used for classifier (SVM) training. 10. SUMMARY OF OVERALL APPROACH (7) Train the support vector machine (SVM) classifier on the top d features from the ranked DI list(s). For com- Our proposed approach is as follows. Here, the term “se- parison with the MI-based technique, we use the hex- quence” can pertain to either tissue-specific promoters or amers which have the top d (normalized) MI values. LRE sequences, obtained from the GNF SymAtlas and En- The accuracy of the trained classifier is plotted as a sembl databases or the Enhancer Browser. function of the number of features (d), after ten-fold (1) The sequence is parsed to obtain the relative counts/ cross-validation. As we gradually consider higher d,we frequencies of occurrence of the hexamer in that se- move down the ranked list. In the plots below, the mis- quence and to build the hexamer-sequence frequency classification fraction is reported instead. A fraction of matrix. The “seqinr” package in R is used for this pur- 0.1 corresponds to 10% misclassification. pose. This is done for all the sequences in the specific Note. An important point concerns the training of the SVM (class “+1”) and nonspecific (class “−1”) categories. classifier with the top d features selected using DI or MI (step = ThematrixthushasN Ntrain,+1 + Ntrain,−1 rows and (7) above). Since the feature selection step is decoupled from 46 = 4096 columns. the classification step, it is preferred that the top d motifs are (2) The obtained hexamer-sequence frequency matrix is consistently ranked high among multiple draws of the data, preprocessed by assigning quantile labels for each hex- so as to warrant their inclusion in the classifier. However, amer within the ith sequence. A hexamer-sequence this does not yield expected results on this data set. Briefly, matrix is thus obtained where the (i, j)th entry has the a kendall rank correlation coefficient [34]wascomputedbe- quantile label of the jth hexamer in the ith sequence. tween the rankings of the motifs between multiple data draws This is done for all the N training sequences consisting (by sampling a subset of the entire dataset), for both MI- of examples from the −1 and +1 class labels. and DI-based feature-selection. It is observed that this co- (3) Thus, two submatrices corresponding to the two class efficient is very low in both MI and DI, indicating a highly labels are built. One matrix contains the hexamer- variable ranking. This is likely due to the high variability in sequence quantile labels for the positive training ex- data distribution across these multiple draws (due to limited amples and the other matrix is for the negative training number of data points), as well as the sensitivity of the data- examples. dependent entropy estimation procedure to the range of the (4) To select hexamers that are most different between the samples in the draw. To circumvent this problem of inconsis- positive and negative training examples, a t-test is per- tency in rank of motifs, a median DI/MI value is computed formed for each hexamer, between the “ts” and “nts” across these various draws and the top d features based on the groups. Ranking the corresponding t-test P-values median DI/MI value across these draws are picked for SVM yields those hexamers that are most different distri- training [20]. 8 EURASIP Journal on Bioinformatics and Systems Biology

11. RESULTS GC hkg prom GC brain prom 0.8 0.8 0.7 0.7 11.1. Tissue specific promoters 0.6 0.6 0.5 0.5 We use DI to find hexamers that discriminate brain-specific 0.4 0.4 and heart-specific expression from neutral sequences. The 0.3 0.3 negative training sets are sequences that are not brain or 0.2 0.2 heart-specific, respectively. Results using the MI and DI (a) (b) methods are given below (see Figures 5 and 7). The plots × 2 indicate the SVM cross-validated misclassification accuracy 10 4 10 (ideally 0) for the data as the number of features using the metric (DI or MI) is gradually increased. We can see that for 3 8 6 any given classification accuracy, the number of features us- 2 ing DI is less than the corresponding number of features us- 4 Frequency Frequency ing MI. This translates into a lower misclassification rate for 1 2 DI-based feature selection. We also observe that as the num- 0 0 ber of features d is increased, the performance of MI is the 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 same as DI. This is expected since, as we gather more fea- GC hkg prom GC brain prom tures using MI or DI, the differences in MI versus DI ranking are compensated. (c) (d) An important point needs to be clarified here. There Figure 4: GC sequence composition for brain-specific promoters is a possibility of sequence composition bias in the tissue- and housekeeping (hkg) promoters. specific and neutral sequences used during training. This has been reported in recent work [15]. To avoid detecting GC rich sequences as hexamer features, it is necessary to confirm that there is no significant GC-composition bias between the 0.35 specific and neutral sets in each of the case studies. This is demonstrated in Figures 4, 6,and8. In each case, it is ob- 0.3 served that the mean GC-composition is almost same for the specific versus neutral set. However, in such studies, it is nec- 0.25 essary to select for sequences that do not exhibit such bias. In Figures 6 and 8, even the distribution of GC-composition 0.2 is similar among the samples. For Figure 4, even though the 0.15 distributions are slightly different, the box plots indicate sim- ilarity in mean GC-content. 0.1 Next, some of the motifs that discriminate between tissue-specific and nonspecific categories for the brain pro- Misclassification rate (fraction) 0.05 moter, heart promoter, and brain enhancer cases, respec- tively, are listed in Table 2. Additionally, if the genes en- 0 coding for these TFs are expressed in the correspond- 0 50 100 150 200 ∗ ing tissue [35], a ( ) sign is appended. In some cases, Number of top ranking features used for classification the hexamer motifs match the consensus sequences of known transcription factors (TFs). This suggests a poten- MI tial role for that particular TF in regulating expression DI of tissue-specific genes. This matching of hexamer motifs Figure 5: Misclassification accuracy for the MI versus DI case with TFBS consensus sites is done using the MAPPER en- (brain promoter set). Accuracy of classification is ∼0.9, that is, 93%. gine (http://bio.chip.org/mapper). It is to be noted that a hexamer-TFBS match does not necessarily imply the func- tional role of the TF in the corresponding tissue (brain or heart). However, such information would be useful to guide doi: 10.1155/2007/13853), we have reported only a few due focused experiments to confirm their role in vivo (using tech- to space constraints. niques such as chromatin immunoprecipitation). In the context of the heart-specific genes, we con- As is clear from the above results, there are several sider the cardiac troponin gene (cTNT, ENSEMBL: other motifs which are novel or correspond to nonconsen- ENSG00000118194), which is present in the heart promoter sus motifs of known transcription factors. Hence, each of set. An examination of the high DI motifs for the heart- the identified hexamers merit experimental investigation. specific set yields motifs with the GATA consensus site, as Also, though we identify as many as 200 hexamers in this well as matches with the MEF2 transcription factor. It has work (please see Supplementary Material available online at been established earlier that GATA-4, MEF2 are indeed Arvind Rao et al. 9

GC hkg prom GC heart prom Table 2: Comparison of high ranking motifs (by DI) across differ- 0.8 0.8 ent data sets. The (∗) sign indicates tissue-specific expression of the 0.7 0.7 corresponding TF gene. 0.6 0.6 Brain promoters Heart promoters Brain enhancers 0.5 0.5 Ahr-ARNT (∗)Pax2HNF-4(∗) ∗ ∗ 0.4 0.4 Tcf11-MafG ( ) Tcf11-MafG ( )Nkx2 ∗ ∗ 0.3 0.3 c-ETS ( )XBP1()AML1 FREAC-4 Sox-17 (∗)c-ETS(∗) (a) (b) T3R-alpha1 FREAC-4 Elk1 (∗) 2 ×10 ∗ 4 30 GATA( ) 25 3 20 2 15 10 versus number of features in the MI and DI scenarios reveal Frequency 1 Frequency 5 the superior performance of the DI-based hexamer selection 0 0 compared to MI (see Figure 9). 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.7 In this case, the enhancer sequences are ultraconserved, GC hkg prom GC heart prom thus obtained after alignment across multiple species. The examination of these sequences identified motifs that are (c) (d) potentially selected for regulatory function across evolu- Figure 6: GC sequence composition for heart-specific promoters tionary distances. Using alignment as a prefiltering strat- and housekeeping (hkg) promoters. egy helps remove bias conferred by sequence elements that arise via random mutation but might be over-represented. This is permitted in programs like Toucan [12] and rVISTA 0.35 (http://rvista.dcode.org). As in the previous case, some of the top ranking motifs ∗ 0.3 from this dataset are also shown in Table 2. The ( ) signed TFs indicate that some of these discovered motifs indeed 0.25 have documented high expression in the brain. The occur- rence of such tissue-specific transcription factor motifs in 0.2 these regulatory elements gives credence to the discovered motifs. For example, ELK-1 is involved in neuronal differ- 0.15 entiation [38]. Also, some motifs matching consensus sites of TEF1 and ETS1 are common to the brain-enhancer and 0.1 brain-promoter set. Though this is interesting, an experi-

Misclassification rate (fraction) ment to confirm the enrichment of such transcription fac- 0.05 tors in the population of brain-specific regulatory sequences is necessary. 0 0 50 100 150 200 Number of top ranking features used for classification 11.3. Quantifying sequence-based TF influence

MI A very interesting question emerges from the above pre- DI sented results. What if one is interested in a motif that is not present in the above ranked hexamer list for a particu- Figure 7: Misclassification accuracy for the MI versus DI case (heart lar tissue-specific set? As an example, consider the case for promoter set). MyoD, a transcription factor which is expressed in muscle and has an activity in heart-specific genes too [39]. In fact, a variant of its consensus motif CATTTG is indeed in the top involved in transcriptional activation of this gene [36]and ranking hexamer list. The DI-based framework further per- the results have been confirmed by ChIP [37]. mits investigation of the directional association of the canon- ical MyoD motif (CACCTG) for the discrimination of heart- 11.2. Enhancer DB specific genes versus housekeeping genes. This is shown in Figure 10.Asisobserved,MyoD has a significant directional Additionally, all the brain-specific regulatory elements pro- influence on the heart-specific versus neutral sequence class filed in the mouse Enhancer Browser database (http:// label. This, in conjunction with the expression level char- enhancer.lbl.gov) are examined for discriminating motifs. acteristics of MyoD, indicates that the motif CACCTG is Figure 8 shows that the two classes have similar GC- potentially relevant to make the distinction between heart- composition. Again, the plot of misclassification accuracy specific and neutral sequences. 10 EURASIP Journal on Bioinformatics and Systems Biology

GC neutral GC brain enh Empirical CDF of null distribution 1 0.6 0.6 0.9 0.8 0.4 0.4 0.7

0.2 0.2 0.6 ) x

(a) (b) ( 0.5 F 60 0.4 25 40 0.3 15 0.2 20 Frequency Frequency 5 0.1 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 DI of MyoD→heart-specific promoters (x) GC neutral GC brain enh (c) (d) Figure 10: Cumulative distribution function for bootstrapped I(MyoD motif: CACCTG→Y); Y is the class label (heart-specific Figure 8: GC sequence composition for brain-specific enhancers versus housekeeping). True I(CACCTG→Y) = 0.4977. and neutral noncoding regions.

0.35 sion of genes expressed on day e14.5 in the degenerating mesonephros and nephric duct (TS22). This set has about 0 3 . 43 genes (including Gata2). These genes are available in the Supplementary Material. 0.25 Using TOUCAN, the set of module TFs is combinations of the following TFs: E47, HNF3B, HNF1, RREB1, HFH3, 0.2 CREBP1, VMYB, GFI1. These were obtained by aligning the − 0.15 promoters of these 43 genes ( 2000 bp upstream to +200 bp from the TSS), and looking for over-represented TF mo- 0.1 tifs based on the TRANSFAC/JASPAR databases. Using the DI-based motif selection, a set of 200 hexamers are found Misclassification rate (fraction) 0.05 that discriminate these 43 gene promoter sequences from the background housekeeping promoter set. They map to 0 the consensus sites of several known TFs, such as (iden- 0 50 100 150 200 tified from http://bio.chip.org/mapper) Nkx, Max1, c-ETS, Number of top ranking features used for classification FREAC4, Ahr-ARNT, CREBP2, E2F, HNF3A/B, NFATc, Pax2, LEF1, Max1, SP1, Tef1, Tcf11-MafG; many of which are ex- MI DI pressed in the developing kidney (http://www.expasy.org). Moreover, we observe that the TFs that are common between Figure 9: Misclassification accuracy for the MI versus DI case the TOUCAN results and the DI-based approach: FREAC4, (brain enhancer set). Max1, HNF3a/b, HNF1, SP1, CREBP, RREB1, HFH3, are mostly kidney-specific. Thus, we believe that this observa- tion makes a case for finding all (possibly degenerate) TF Another theme picks up on something quite tradition- motif searches from TRANSFAC, and filtering them based on ally done in bioinformatics research-finding key TF regula- tissue-specific expression subsequently. Such a strategy yields tors underlying tissue-specific expression. Two major ques- several more TF candidates for testing and validation of bio- tions emerge from this theme. logical function. For the second question, we examine the following sce- (1) Which putative regulatory TFs underlie the tissue- nario. The Gata3 gene is observed to be expressed in the specific expression of a group of genes? developing ureteric bud (UB) during kidney development. (2) For the TFs found using tools like TOUCAN [12], can To find UB specific TF regulators, conserved TF modules we examine the degree of influence that the particular can be examined in the promoters of UB-specific genes. TF motif has in directing tissue-specific expression? These experimentally annotated UB-specific genes are ob- To address the first question, we examine the TFs re- tained from the Mouse Genome Informatics database at vealed by DI/MI motif selection and compare these to the http://www.informatics.jax.org. Several programs are used TFs discovered from TOUCAN [12], underlying the expres- for such analysis, like Genomatix [11]orToucan[12]. Using Arvind Rao et al. 11

Toucan, the promoters of the various UB specific genes are Empirical CDF aligned to discover related modules. The top-ranking mod- 1 ule in Toucan contains AHR-ARNT, Hox13, Pax2, Tal1alpha- 0.9 E47, Oct1. Again, the power of these motifs to discriminate 0.8 UB-specific and nonspecific genes, based on DI, can be in- vestigated. 0.7 For this purpose, we check if the Pax2 binding motif 0.6 )

(GTTCC [40]) indeed induces kidney specific expression by x

( 0.5 looking for the strength of DI between the GTTCC motif and F the class label (+1) indicating UB expression (see Figure 11). 0.4 This once again adds to computational evidence for the true 0.3 role of Pax2 in directing ureteric bud specific expression [40]. The main implication here is that from sequence data, there 0.2 is strong evidence for the Pax2 motif being a useful feature 0.1 for UB-specific genes. This is especially relevant given the 0 documented role of Pax2 (see [41]) directing ureteric-bud 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 expression of the Gata3 gene, one of the key modulators of x kidney morphogenesis. Both the MyoD and Pax2 studies in- Figure 11: Cumulative distribution function for bootstrapped dicate the relevance of principled data integration using ex- I(Pax2motif:GTTCC→Y); Y is the class label (UB/non-UB). True pression [35, 42] and sequence modalities. I(GTTCC→Y) = 0.9792.

11.4. Observations lection problem, a new metric—the “directed information” With regard to the feature selection and classification results, (DI)—is proposed. In conjunction with a support vector ma- in both studies (enhancers and promoters), we observe that chine classifier, this method was shown to outperform the about 100 hexamers are enough to discriminate the tissue- state-of-the-art method employing undirected mutual infor- specific from the neutral sequences. Furthermore, some se- mation. We also find that only a subset of the discriminating quence features of these motifs at the promoter/enhancer motifs correlate with known transcription factor motifs and emerge. hence the other motifs might be potentially related to non- consensus TF binding or underlying epigenetic phenomena (i) There is higher sequence variability at the promoter governing tissue-specific gene expression. The superior per- since it has to act in concert with LREs of different tis- formance of the directed-information-based variable selec- sue types during gene regulation. tion suggests its utility to more general learning problems. (ii) Since the enhancer/LRE acts with the promoter to con- As per the initial motivation, the discovery of these motifs fer expression in only one tissue type, these sequences can aid in the prospective discovery of other tissue-specific are more specific and hence their mining identifies regulatory regions. motifs that are probably more indicative of tissue- We have also examined the applicability of DI to prospec- specific expression. tively resolve the functional role of any TF motif in a biolog- We however, reiterate that the enhancer dataset that we study ical process, integrating other sources (literature, expression uses the hsp68-lacz as the promoter driven by the ultracon- data, module searches). served elements. Hence there is no promoter specificity in this context. Though this is a disadvantage and might not 13. FUTURE WORK reveal all key motifs, it is the best that can be done in the absence of any other comprehensive repository. Several opportunities for future work exist within this pro- The second aspect of the presented results highlights two posed framework. Multiple sequence alignment of pro- important points. Firstly, the identified motifs have a strong moter/regulatory sequences across species would be a useful predictive value as suggested by the cross-validation results as preprocessing step to reduce false detection of discrimina- well as Table 2. Moreover, DI provides a principled method- tory motifs. The hexamers can also be identified based on ology to investigate any given motif for tissue-specificity as other metrics exploiting distributional divergence between well as for identifying expression-level relationships between the samples of the “+1” and “−1” classes. Furthermore, there the TFs and their target genes, (Section 11.3). is a need for consistent high-dimensional entropy estima- tors within the small sample regime. A very interesting di- 12. CONCLUSIONS rection of potential interest is the formulation of a stepwise hexamer selection algorithm, using the directed information In this work, a framework for the identification of hex- for maximal relevance selection and mutual information for amer motifs to discriminate between two kinds of se- minimizing between-hexamer redundancy [18]. This analy- quences (tissue-specific promoters or regulatory elements sis is beyond the scope of this work but an implementation versus nonspecific elements) is presented. For this feature se- is available from the authors for further investigation. (The 12 EURASIP Journal on Bioinformatics and Systems Biology source code of the analysis tools in R 2.0 and MATLAB 6.1 is analysis,” Nucleic Acids Research, vol. 33, (Web Server Issue), available on request). pp. W393–W396, 2005. [13] B. Y. Chan and D. Kibler, “Using hexamers to predict cis- ACKNOWLEDGMENTS regulatory motifs in Drosophila,” BMC Bioinformatics, vol. 6, p. 262, 2005. [14] G. B. Hutchinson, “The prediction of vertebrate promoter The authors gratefully acknowledge the support of the NIH ff under Award 5R01-GM028896-21 for J. D. Engel. They regions using di erential hexamer frequency analysis,” Com- puter Applications in the Biosciences, vol. 12, no. 5, pp. 391–398, would like to thank Professor Sandeep Pradhan and Mr. 1996. Ramji Venkataramanan for useful discussions on directed [15] P. Sumazin, G. Chen, N. Hata, A. D. Smith, T. Zhang, and M. information. They are extremely grateful to Professor Erik Q. Zhang, “DWE: discriminating word enumerator,” Bioinfor- Learned-Miller and Dr. Damian Fermin for sharing their matics, vol. 21, no. 1, pp. 31–38, 2005. code for high-dimensional entropy estimation and EN- [16] G. Lakshmanan, K. H. Lieuw, K.-C. Lim, et al., “Localiza- SEMBL sequence extraction, respectively. They also thank tion of distant urogenital system-, central nervous system-, the anonymous reviewers and the corresponding editor for and endocardium-specific transcriptional regulatory elements helping them improve the quality of the manuscript through in the GATA-3 locus,” Molecular and Cellular Biology, vol. 19, insightful comments and suggestions. The material in this no. 2, pp. 1558–1568, 1999. paper was presented in part at the IEEE Statistical Signal Pro- [17] M. Khandekar, N. Suzuki, J. Lewton, M. Yamamoto, and J. cessing Workshop 2007 (SSP07). D. Engel, “Multiple, distant Gata2 enhancers specify tempo- rally and tissue-specific patterning in the developing urogeni- REFERENCES tal system,” Molecular and Cellular Biology, vol. 24, no. 23, pp. 10263–10276, 2004. [1] K. D. MacIsaac and E. Fraenkel, “Practical strategies for dis- [18] H. Peng, F. Long, and C. Ding, “Feature selection based covering regulatory DNA sequence motifs,” PLoS Computa- on mutual information criteria of max-dependency, max- tional Biology, vol. 2, no. 4, p. e36, 2006. relevance, and min-redundancy,” IEEE Transactions on Pattern [2] G. Kreiman, “Identification of sparsely distributed clusters of Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226– cis-regulatory elements in sets of co-expressed genes,” Nucleic 1238, 2005. Acids Research, vol. 32, no. 9, pp. 2889–2900, 2004. [19] Proceedings of NIPS 2006 Workshop on Causality Feature Se- [3] C. Burge and S. Karlin, “Prediction of complete gene struc- lection, http://research.ihost.com/cws2006/. ff tures in human genomic DNA,” Journal of Molecular Biology, [20] I. Guyon and A. Elissee , “An introduction to variable and vol. 268, no. 1, pp. 78–94, 1997. feature selection,” The Journal of Machine Learning Research, [4] Q. Li, G. Barkess, and H. Qian, “Chromatin looping and the vol. 3, pp. 1157–1182, 2003. probability of transcription,” Trends in Genetics, vol. 22, no. 4, [21] H. Marko, “The bidirectional communication theory—a gen- pp. 197–202, 2006. eralization of information theory,” IEEE Transactions on Com- [5] D. A. Kleinjan and V. van Heyningen, “Long-range control of munications, vol. COM-21, no. 12, pp. 1345–1351, 1973. gene expression: emerging mechanisms and disruption in dis- [22] J. Massey, “Causality, feedback and directed information,” in ease,” The American Journal of Human Genetics, vol. 76, no. 1, Proceedings of the International Symposium on Information pp. 8–32, 2005. Theory and Its Applications (ISITA ’90), pp. 303–305, Waikiki, [6]L.A.Pennacchio,G.G.Loots,M.A.Nobrega,andI. Hawaii, USA, November 1990. Ovcharenko, “Predicting tissue-specific enhancers in the hu- [23] R. Venkataramanan and S. S. Pradhan, “Source coding with man genome,” Genome Research, vol. 17, no. 2, pp. 201–211, feed-forward: rate-distortion theorems and error exponents 2007. for a general source,” IEEE Transactions on Information The- [7] D. C. King, J. Taylor, L. Elnitski, F. Chiaromonte, W. Miller, ory, vol. 53, no. 6, pp. 2154–2179, 2007. and R. C. Hardison, “Evaluation of regulatory potential and [24] T. M. Cover and J. A. Thomas, Elements of Information Theory, conservation scores for detecting cis-regulatory modules in John Wiley & Sons, New York, NY, USA, 1991. aligned mammalian genome sequences,” Genome Research, [25] E. G. Miller, “A new class of entropy estimators for mul- vol. 15, no. 8, pp. 1051–1060, 2005. tidimensional densities,” in Proceedings of the IEEE Interna- [8] L. A. Pennacchio, N. Ahituv, A. M. Moses, et al., “In vivo en- tional Conference on Accoustics, Speech, and Signal Processing hancer analysis of human conserved non-coding sequences,” (ICASSP ’03), vol. 3, pp. 297–300, Hong Kong, April 2003. Nature, vol. 444, no. 7118, pp. 499–502, 2006. [26] R. M. Willett and R. D. Nowak, “Complexity-regularized mul- [9] K. Kadota, J. Ye, Y. Nakai, T. Terada, and K. Shimizu, “ROKU: tiresolution density estimation,” in Proceedings of the Interna- a novel method for indentification of tissue-specific genes,” tional Symposium on Information Theory (ISIT ’04), pp. 303– BMC Bioinformatics, vol. 7, pp. 294, 2006. 305, Chicago, Ill, USA, June-July 2004. [10]J.Schug,W.-P.Schuller,C.Kappen,J.M.Salbaum,M.Bu- [27] I. Nemenman, F. Shafee, and W. Bialek, “Entropy and infer- can, and C. J. Stoeckert Jr., “Promoter features related to tissue ence, revisited,” in Advances in Neural Information Processing specificity as measured by Shannon entropy,” Genome biology, Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, vol. 6, no. 4, p. R33, 2005. Eds., MIT Press, Cambridge, Mass, USA, 2002. [11] T. Werner, “Regulatory networks: linking microarray data [28] L. Paninski, “Estimation of entropy and mutual information,” to systems biology,” Mechanisms of Ageing and Development, Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. vol. 128, no. 1, pp. 168–172, 2007. [29] H. Joe, “Relative entropy measures of multivariate depen- [12] S. Aerts, P. Van Loo, G. Thijs, et al., “TOUCAN 2: the all- dence,” Journal of the American Statistical Association, vol. 84, inclusive open source workbench for regulatory sequence no. 405, pp. 157–164, 1989. Arvind Rao et al. 13

[30] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Monographs on Statistics and Applied Probability, Chapman & Hall/CRC, Boca Raton, Fla, USA, 1994. [31] J. O. Ramsay and B. W. Silverman, Functional Data Analysis, Springer Series in Statistics, Springer, New York, NY, USA, 1997. [32] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B, vol. 57, no. 1, pp. 289–300, 1995. [33] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Sta- tistical Learning, Springer, New York, NY, USA, 2001. [34] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938. [35] NCBI Pubmed URL, http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi. [36] A. M. Murphy, W. R. Thompson, L. F. Peng, and L. Jones II, “Regulation of the rat cardiac troponin I gene by the transcrip- tion factor GATA-4,” Biochemical Journal, vol. 322, part 2, pp. 393–401, 1997. [37] A. Azakie, J. R. Fineman, and Y. He, “Myocardial transcription factors are modulated during pathologic cardiac hypertrophy in vivo,” The Journal of Thoracic and Cardiovascular Surgery, vol. 132, no. 6, pp. 1262–1271.e4, 2006. [38] P. Vanhoutte, J. L. Nissen, B. Brugg, et al., “Opposing roles of Elk-1 and its brain-specific usoform, short Elk-1, in nerve growth factor-induced PC12 differentiation,” Journal of Bio- logical Chemistry, vol. 276, no. 7, pp. 5189–5196, 2001. [39] E. N. Olson, “Regulation of muscle transcription by the MyoD family: the heart of the matter,” Circulation Research, vol. 72, no. 1, pp. 1–6, 1993. [40] G. R. Dressler and E. C. Douglass, “Pax-2 is a DNA-binding protein expressed in embryonic kidney and Wilms tumor,” Proceedings of the National Academy of Sciences of the United States of America, vol. 89, no. 4, pp. 1179–1183, 1992. [41] D. Grote, A. Souabni, M. Busslinger, and M. Bouchard, “Pax2/8-regulated Gata3 expression is necessary for morpho- genesis and guidance of the nephric duct in the developing kidney,” Development, vol. 133, no. 1, pp. 53–61, 2006. [42]A.Rao,A.O.Hero,D.J.States,andJ.D.Engel,“Inference of biologically relevant gene influence networks using the di- rected information criterion,” in Proceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP ’06), vol. 2, pp. 1028–1031, Toulouse, France, May 2006. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 31450, 18 pages doi:10.1155/2007/31450

Research Article Splitting the BLOSUM Score into Numbers of Biological Significance

Francesco Fabris,1, 2 Andrea Sgarro,1, 2 and Alessandro Tossi3

1 Dipartimento di Matematica e Informatica, Universita` degli Studi di Trieste, via Valerio 12b, 34127 Trieste, Italy 2 Centro di Biomedicina Molecolare, AREA Science Park, Strada Statale 14, Basovizza, 34012 Trieste, Italy 3 Dipartimento di Biochimica, Biofisica, e Chimica delle Macromolecole, Universita` degli Studi di Trieste, via Licio Giorgieri 1, 34127 Trieste, Italy Received 2 October 2006; Accepted 30 March 2007

Recommended by Juho Rousu

Mathematical tools developed in the context of Shannon information theory were used to analyze the meaning of the BLOSUM score, which was split into three components termed as the BLOSUM spectrum (or BLOSpectrum). These relate respectively to the sequence convergence (the stochastic similarity of the two protein sequences), to the background frequency divergence (typicality of the amino acid probability distribution in each sequence), and to the target frequency divergence (compliance of the amino acid variations between the two sequences to the protein model implicit in the BLOCKS database). This treatment sharpens the pro- tein sequence comparison, providing a rationale for the biological significance of the obtained score, and helps to identify weakly related sequences. Moreover, the BLOSpectrum can guide the choice of the most appropriate scoring matrix, tailoring it to the evolutionary divergence associated with the two sequences, or indicate if a compositionally adjusted matrix could perform better.

Copyright © 2007 Francesco Fabris et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION to 0 out of the main diagonal (i = j). An Mk matrix, which estimates the expected probability of changes at a distance of Substitution matrices have been in use since the introduc- k evolutionary units, is then obtained by multiplying the M tion of the Needleman and Wunsch algorithm [1], and are matrix by itself k times. Each Mk matrix is then associated to referred to, either implicitly or explicitly, in several other pa- k ff the scoring matrix PAM , whose entries are obtained on the pers from the seventies, McLachlan [2], Sanko [3], Sellers basis of the log ratio [4], Waterman et al. [5], Dayhoff et al. [6]. These are the conceptual tools at the basis of several methods for attribut- mk(i, j) s(i, j) = log ,(1) ing a similarity score to two aligned protein sequences. Any p(i)p(j) amino acid substitution matrix, which is a 20 ∗ 20 table, has a scoring method that is implicitly associated with a set of where p(i)andp(j) are the observed frequencies of the ami- target frequencies p(i, j)[7, 8], pertaining to the pair i, j of no acids. amino acids that are paired in the alignment. An important S. Henikoff and J. G. Henikoff introduce the BLOck SUb- approach to obtaining the score associated with the paired stitution Matrix (BLOSUM) [9]. While the scoring method amino acids i, j, was that suggested by Dayhoff et al. [6], is always based on a log odds ratio, as seems natural in any who developed a stochastic model of protein evolution called kind of substitution matrices [7], the method for deriving PAM (points of accepted mutations). In this model, the fre- the target frequencies is quite different from PAM; one needs quencies m(i, j) indicate the probability of change from one evaluating the joint target frequencies p(i, j) of finding the amino acid i to another amino acid j, in homologous protein amino acids i and j paired in alignments among homologous sequences with at least 85% identity, during short-term evo- proteins with a controlled rate of percent identity. This joint lution. The matrix M, relating each amino acid to each of the probability is compared with p(i)p(j), the product of the other 19, with an evolutionary distance of 1, would have en- background frequencies of amino acids i and j,derivedfrom tries m(i, j) close to 1 on the main diagonal (i = j)andclose amino acids probability distribution P ={p1, p2, ..., p20}. 2 EURASIP Journal on Bioinformatics and Systems Biology

The target and background frequencies are tied by the equal- scoring method in finding concealed or weakly correlated se- p i = 20 p i j ity ( ) j=1 ( , ) so that the background probability quences are well documented in the literature, the most rele- distribution is the marginal of the joint target frequencies vant being: p i p j [10]. The product ( ) ( ) reflects the likelihood of the in- (1) Gaps: insertions or deletions (of one or more residue) dependence setting, namely that the amino acids i and j are p i j >pi p j in one or both the aligned sequence cause loss of syn- paired by pure chance. If ( , ) ( ) ( ), then the presence chronization, significantly decreasing the score; of i stochastically induces the presence of j,andviceversa(i θ θ and j are “attractive”), while if p(i, j) p(i)p(j), We have set out to inspect, in more depth and by use of and negative when the opposite occurs. The i, j entry of mathematical tools, what the BLOSUM score really measures the BLOSUM matrix is the score of the pair i, j (or j, i, from a biological point of view; the aim was to split the score which is the same since the sequences are not ordered; for into components, the BLOSpecrum, that provide insight on adifferent approach see Yu et al. [11]) multiplied by a suit- the above described phenomena and other biological infor- able scale factor (4 for BLOSUM-35 and BLOSUM-40, 3 for mation regarding the compared sequences, once the align- BLOSUM-50, and 2 for the remaining). The value so ob- ment has been made using the classical methods (BLAST, tained is then rounded to the nearest integer, and the (un- FASTA, etc.). We do not propose an alternative alignment al- scaled) global score of two sequences X = x1,x2, ..., xn and gorithm or a method for increasing the performance of the Y = y1, y2, ..., yn of length n is given by summing up the available ones; nor do we suggest new methods for inserting scores relative to each position gaps so as to maximize the score (see, e.g., [14, 15]). Ours is n simply a diagnostic tool to reveal the following:     p(i, j) S(X, Y) = s xh, yh = n(i, j)log ,(3) p(i)p(j) (1) if, for an available algorithm, the chosen scoring ma- h=1 i,j trix is correct; where n(i, j) is the number of occurrences of the pair i, j in- (2) whether the aligned sequences are typical protein se- side the aligned sequences. This equation weighs the log ratio quences or not; associated to the i, j entry of the BLOSUM matrix with the (3) whether the alignment itself is typical with respect to occurrences of the pair i, j, and seems intuitive following a BLOCKS database; and heuristic approach, as any reasonable substitution matrix is (4) the possible presence of a weak or concealed correla- implicitly of this form [7]. In order to compute the neces- tion also for alignments resulting in a relatively low sary target and background frequencies p(i, j)andp(i)p(j), BLOSUM score, that might otherwise be neglected. S. Henikoff andJ.G.Henikoff used the database BLOCKS (http://blocks.fhcrc.org/index.html), which contains sets of The method is associated with the use of a BLOSUM proteins with a controlled maximum rate of percent identity matrix that has been developed within the context of local “θ” that defines the BLOSUM matrix, so that BLOSUM-62 (ungapped) alignment statistics [7, 8, 11]. To allow a crit- refers θ = 62%, and so forth. ical evaluation of our method, we furnish an online soft- Scoring substitution matrices, such as PAM or BLOSUM, ware package that provides values for each component of are used in modern web tools (BLAST, PSI-BLAST, and oth- the BLOSpecrum for two aligned sequences (http://bioinf. ers) for performing database searches; the search is accom- dimi.uniud.it/software/software/blosumapplet). Providing a plished by finding all sequences that, when compared to a rationale about the biological significance of an obtained given query sequence, sum up a score over a certain thresh- score sharpens the comparison of weakly related sequences, old. The aim is usually that of discovering biological correla- and can reveal that comparable scores actually conceal com- tion among different sequences, often belonging to different pletely different biological relationships. Furthermore, our organisms, which may be associated with a similar biolog- decomposition helps in selecting the matrix that is correctly ical function. In most cases, this correlation is quite evident tailored for the actual evolutionary divergence associated to when proteins are associated with genes that have duplicated, the two sequences one is going to compare, or in deciding if or organisms that have diverged from one another relatively a compositionally adjusted matrix might not perform better. recently, and leads to high values of the BLOSUM (or PAM) Although we have used the BLOSUM scoring method for score. But in some cases, a relevant biological correlation may our analyses, since it is the most widely used by web tools be obscured by phenomena that reduce the score, making measuring protein similarities, our decomposition is appli- it difficult to capture. Those that limit the efficiency of the cable, in principle, to any scoring matrix in the form of (3), Francesco Fabris et al. 3 and confirms that the usefulness of this type of matrix has a different lengths, we report the normalized perresidue score solid mathematical justification. to permit a coherent comparison. It is important to stress the fact that while f (i, j) is the observed frequency pertaining to 2. METHODS the sequences under inspection, the target frequencies p(i, j), together with the background marginals p(i)andp(j), per- 2.1. Mathematical analysis of the BLOSUM score tain to the database BLOCKS. In a sense, they constitute “the model” of the typical behaviour of a protein, since p(i)or The BLOSUM score (3) can be analyzed from a mathematical p(j) is in fact the “typical” probability distribution of amino perspective using well-known tools developed by Shannon acids as observed in most proteins, while p(i, j) is the “typi- in his seminal paper that laid the foundation for Information cal” probability of finding the amino acids i and j position- Theory [16, 17]. The first of these is the Mutual Information ally paired in two protein sequences with a percent identity I X Y ( , )(orrelative entropy) between two random variables depending from θ. From an evolutionary point of view, we X Y and , can say that if p(i, j) is greater than in the case of indepen- i j  p(i, j) dence, then it is very likely that and are biologically corre- I(X, Y) = p(i, j)log ,(4)lated. p(i)p(j) i,j Equation (7) is in fact quite similar to (4), which spec- ifies mutual information, the only difference being the use where p(i, j), p(i), p(j) are, respectively, the joint proba- of f (i, j) instead of p(i, j) as the multiplying factor for the bility distribution and the marginals associated to the ran- logarithmic term, so that the normalized score is a kind of dom variables X and Y. We can adapt (4) to the compar- “mixed” mutual information. As a matter of fact, we can de- ison of two sequences if we interpret p(i, j) as the relative fine frequency of finding amino acids i and j paired in the X and Y sequences, and p(i)(p(j)) of finding amino acid i  p i j I A B = p i j ( , ) (j)insequenceX (Y). Following this approach, in a bio- ( , ) ( , )log p i p j (8) i j ( ) ( ) logical setting, mutual information (MI) becomes a measure , of the stochastic correlation between two sequences. It can be shown (see the appendix) that I(X, Y) ≤ log 20 ≈ 4.3219. as the mutual information, or relative entropy, of the tar- The second tool is the informational divergence D(P//Q)be- get and background frequencies associated to the database BLOCKS, or to any other protein model used to find the tar- tween two probability distributions P ={p , p , ..., pK } and 1 2 get frequencies. Here A,andB are dummy random variables Q ={q , q , ..., qK } [18], where 1 2 taken to have generated the data of the database. The quan- K tity I(A, B)wasineffect used by Altschul in the case of PAM  p(i) D(P//Q) = p(i)log . (5) matrices [7], and by S. Henikoff andJ.G.Henikoff [9] for the q(i) i=1 BLOSUM matrices, and in both cases it can be interpreted as the average exchange of information associated with a pair The informational divergence (ID) can be interpreted as of aligned amino acids of the data bank, or as the expected a measure of the nonsymmetrical “distance” between two average score associated to pairs of amino acids, when they probability distributions. A more detailed mathematical are put into correspondence in alignments that adhere to treatment of the properties associated with MI and ID is pro- the protein model over which the matrices are computed. vided in the appendix. Here, we simply indicate that ID and From the perspective of an aligning method, we can state that MI are nonnegative quantities, and that they are tied by the I(A, B) measures the average information available for each formula position in order to distinguish the alignment from chance,  p i j   so that the higher its value, the shorter the fragments whose I X Y = p i j ( , ) = D P //P P ≥ ( , ) ( , )log p i p j XY X Y 0, alignment can be distinguished from chance [7]. Equation i,j ( ) ( ) (6)(or(A.4) in the appendix) ensures also that this average (6) score is always greater than or equal to zero. so that MI is really a special kind of ID, that measures the On the other hand, if we compute the expected score i j “distance” between the joint probability distributions PXY when two amino acids and are picked at random in an and the product PX PY of the two marginals PX and PY . independence setting model, given as Given two amino acid sequences, X and Y, the corre-  p(i, j)  sponding BLOSUM (unscaled) normalized score SN (X, Y), E A B = p i p j =−D P P //P ≤ ( , ) ( ) ( )log p i p j X Y XY) 0, measured in bits,iscomputedas i,j ( ) ( ) (9) n    p i j S X Y = 1 s x y = f i j ( , ) N ( , ) n h, h ( , )log p i p j ,(7) h=1 i,j ( ) ( ) the classical assumptions made in constructing a scoring ma- trix [7] require that this expected score is lower than or equal where f (i, j) = n(i, j)/n is the relative frequency of the pair to zero. Note that all these quantities pertain to the database i, j observed on the aligned sequences X and Y.Because BLOCKS (in the case of BLOSUM), that is to the particular one usually deals with sequences that could have remarkably “protein model” used. 4 EURASIP Journal on Bioinformatics and Systems Biology

To solely evaluate the stochastic similarity between two once it has been scaled. The difference is usually quite small sequences X and Y, the identity (about 2-3% if the score is high), but it becomes more and more significant as the score approaches zero.  f (i, j) I(X, Y) = f (i, j)log , (10) fX i fY j i,j ( ) ( ) 2.2. Taking gaps into account which measures the degree of stochastic dependence between An important consideration regarding our mathematical the protein sequences, would suffice (here fX (i) = n(i)/n and analysis is that it does not formally take gaps into account. fY (j) = n(j)/n are the relative frequencies of amino acid i From a mathematical perspective, the only way to account observed in sequence X and amino acid j observed in se- correctly for gaps would be to use a 21∗21 scoring matrix, in quence Y). But this is not so interesting from the biological which the gap is treated as equivalent to a 21st amino acid, so point of view, as one has to take into account the possibil- that pairs of the form (i, −)or(−, j), where the symbol “−” ity that, even if similar from the stochastic point of view, two represents the gap, are also contemplated; but from a biologi- sequences are far from being an example of a typical protein- cal perspective this might not be acceptable, since a gap is not to-protein matching (or evolutionary transition). In other a real component of a sequence. We can nevertheless extend words, we need to inspect this stochastic similarity under the our analysis to a gapped score if we admit the independence “lens” of the protein model used in the BLOCKS database (or between each gap and any residue paired with it. Biologically, by the PAM model, for the matter). independence may be questionable, and would need to be Subjecting the (unscaled) normalized score SN (X, Y)of determined case by case, as each gap is due to a chance dele- (7) to simple mathematical manipulations (see the appendix tion or insertion event subsequently acted on by natural se- for details), we can split SN (X, Y) into the following terms: lection (which may be neutral or positive). Moreover, there   is no certainty as to the correct positioning of a gap in any SN (X, Y) = I(X, Y) − D FXY//PAB     (11) given alignment, as it is introduced a posteriori as the prod- + D FX //PA + D FY //PB . uct of an alignment algorithm that takes the two sequences X and Y, and tries to minimize (by an exact procedure, or F Here, XY is the joint frequency distribution of the amino by a heuristic approach) the number of changes, insertions acids pairs in the sequences, (observed target frequencies), or deletions that allow to transform X into Y (or vice versa). F F while X and Y are, respectively, the distribution of the In practice, we consider quite reasonable the idea that gaps X Y amino acids inside and (observed background frequen- in a given position should imply a degree of independence as P cies). AB instead is the joint probability distribution asso- to which amino acids might occur there in related proteins; ciated to the BLOCKS database, and is the vector of target this is accepted also in PSI-BLAST [19]. The consequence of P P = P frequencies. Note also that A = B are the probabil- assuming independence is that p(−, j) = p(−)p(j)leadstoa ity distributions of the amino acids inside the same database null contribution of the corresponding score, since s(−, j) = BLOCKS, that is the database background frequencies; they log[p(−, j)/p(−)p(j)] = 0(see(3)), so that for gapped se- are equal as a consequence of the symmetry of the BLO- quences, we simply assign a score equal to zero whenever an p i j p j i SUM matrix entries, since ( , ) = ( , ). We define the set amino acid is paired with a gap. Note that this does not mean I X Y D F //P D F //P D F //P } { ( , ), ( XY AB), ( X ), ( Y ) to be the BLO- that we reduce a gapped alignment to an ungapped one, but SUM spectrum of the aligned sequences (or BLOSpectrum). that we simply ignore the gap and the corresponding residue, Notice that (11) holds also when the BLOSUM matrix is de- since the pair is not affecting the BLOSpectrum,duetoits compositionally adjusted following the approach described zero contribution to the score. Moreover, it is conceivable in Yu et al. [11], that is when the background frequencies are that for distant sequence correlations, the use of different al- ff P P di erent ( A = B). gorithms, or of different gap penalties schemes for any given ff The terms constituting the BLOSpectrum have a di er- algorithm, could result in a different pattern of gaps and con- D F //P D F //P ent order of magnitude, as ( X )and ( Y )actwith sequently in different sequence alignments, each with a cor- a cardinality of 20, when compared to the joint divergences responding BLOSpectrum. In this case, the likelihood of each I X Y D F //P ( , )and ( XY AB), that act on probability distribu- alignment might be tested by exploiting the BLOSpectrum, tions whose cardinality is 20 ∗ 20 = 400. From a practical that might be quite different even if the numerical scores have I X Y point of view, this means that the contribution of ( , ) approximately the same value; this can help identify the most D F //P and ( XY AB) to the score is expected to be roughly appropriate one. double than that of D(FX //P)andD(FY //P). Actually, un- der the hypothesis of a Bernoullian process (i.e., station- 3. RESULTS AND DISCUSSION ary and memoryless), we have D(P2//Q2) = 2D(P//Q)[18] (as in our case 202 = 400), and the sum of the two terms 3.1. Meaning and biological implications of the D(FX //P)+D(FY //P) compensates the order of magnitude BLOSpectrum terms of the joint divergences. Let us now analyze the meaning of the terms in (11). Finally, it should be recalled that the score actually ob- tained by using the BLOSUM matrices, whose entries are (i) The mutual information I(X, Y) is the sequence con- multiplied by the constant c and rounded to the nearest inte- vergence, which measures the degree of stochastic de- ger, is an approximation of the exact score SN (X, Y)of(11), pendence (or stochastic correlation) between aligned Francesco Fabris et al. 5

sequences X and Y; the greater its value, the more sta- consequence that only alignment characterized by remark- tistically correlated are the two. It is highly correlated able values of I(X, Y)willemerge. with, but not identical to, the percent identity of the There are therefore essentially three cases of biological in- alignment, as it also includes the propensity of finding terest, which we can now analyze in terms of the correspon- certain amino acids paired, even if different. dence between mathematical and biological meaning of the This term enhances the overall BLOSUM score, since terms. it is taken with the plus sign. F 1 (ii) The target frequency divergence D(FXY//PAB)measures Case 1. The joint observed frequencies XY are typical, that the difference between the “observed” target frequen- is, they are very close to the target frequencies, FXY ≈ PAB. cies, and the target frequencies implicit in the substi- In this case, D(FXY//PAB) ≈ 0 and also D(F//P) ≈ 0. tution matrix. In mathematical terms, it measures the F stochastic distance between FXY and PAB, that is the Case 2. The joint observed frequencies XY are not typical F = P F ≈ P F ≈ P distance between the mode in which amino acids are ( XY AB), but the marginals are typical ( X , Y ). D F //P  D F//P ≈ paired in the X and Y sequences and inside the “pro- In this case, ( XY AB) 0, but ( ) 0. tein model” implicit in the BLOCKS database. When Case 3. Both the joint observed FXY and the marginals FX , the vector of observed frequencies FXY is “far” from P FY are not typical, that is FXY = PAB, FX = P, FY = P. the vector of target frequencies AB exhibited by the D F //P  D F//P  protein model, then the divergence is high, so that In this case, ( XY AB) 0, but also ( ) 0. starting from X we obtain an Y (or vice versa) that Case 1 is straightforward; two similar protein sequences is not that we would expect on the basis of the target with a typical background amino acid distribution; and frequencies of the database; in other words, the amino amino acids paired in a way that complies with the protein acids are paired following relative frequencies that are model implicit in BLOCKS result in a high score. This is not the standard ones. frequently the case for two firmly correlated sequences, be- The term D(FXY//PAB)isapenaltyfactorin(11), since longing to the same family of proteins with standard amino it is taken with the minus sign. D F //P acid content, associated with organisms that diverged only (iii) The background frequency divergence ( X A)(or recently. D(FY //PB)) of the sequence X (or Y) measures the dif- Case 2 is rather more interesting; the amino acid dis- ference between the “observed” background frequen- tribution is close to the background distribution (these are cies, and the background frequencies implicit in the “typical” protein sequences) but the score is highly penalized substitution matrix. In mathematical terms, it mea- as the observed joint frequencies are different from the tar- sures the stochastic distance between the observed fre- get frequencies implicit in the BLOCKS database. This can quencies FX (or FY ) and the vector P = PA = PB of have different causes. For example, the chosen BLOSUM ma- background frequencies of the amino acids inside the trix may be incorrectly matched to the evolutionary distance database BLOCKS. The greater is its value, the more of the sequences, or the sequences may have diverged under different are the observed frequencies from the back- a nonstandard evolutionary process. For high-scoring align- ground frequencies exhibited by a typical protein se- ments involving unrelated sequences, the target frequency di- quence. vergence D(FXY//PAB) will tend to be low, due to the second This term enhances the score, since it is taken with the theorem of Karlin and Altschul [8], when the target frequen- plus sign. cies associated to the scoring matrix in use are the correct Note that the quantities that constitute the decomposition of ones for the aligned sequences being analyzed.2 This is be- the BLOSUM score are not independent of one another. For cause any set of target frequencies in any particular amino example, D(FXY//PAB) ≈ 0 implies low values for D(F//P) acid substitution matrix, such as BLOSUM-θ, is tailored to also. This is because when FXY → PAB (or D(FXY//PAB) → 0; a particular degree of evolutionary divergence between the see the appendix), then also the observed marginals FX and sequences, generally measured by relative entropy (8)[7], FY are forced to approach the background marginal, that and related with the controlled maximum rate θ of per- is FX → P and FY → P, which implies D(F//P) → 0. cent identity. So a low D(FXY//PAB) ≈ 0 is evidence that This is a consequence of the tie between a joint probabil- the BLOSUM-θ matrix we are using is the correct one, as a ity distribution and its marginals [10]. For the same reason, precise consequence of a mathematical theorem, while con- if D(F//P)  0, then D(FXY//PAB) will also be large, al- versely for positive (or almost positive) scoring alignments though the opposite is not necessarily the case. This leads with large target frequency divergence, the sequences may be to (at least partially) a compensation of the effects, due to the minus sign of the target frequency divergence, so that 1 −D(FXY//PAB)+D(FX //PA)+D(FY //PB)hasasmallvalue. Recall that the concept of “typicality” always refers to the adherence of the This implies that a significant BLOSUM score can be ob- various probability distributions to that of the protein model associated tained only when the aligned sequences are statistically cor- to the database BLOCKS. I X Y 2 Note that in general, choosing the (θ parameter associated with the) related, that is, when ( , ) has a high value. Since when smallest D(FXY//PAB)isdifferent from choosing the minimum E-value performing an alignment we are mainly interested in posi- associated with different θ parameters. Recall that E = m ∗ n2−S,whereS tive or almost positive global scores, it is a straightforward is the score and m and n are the sequences lengths. 6 EURASIP Journal on Bioinformatics and Systems Biology related at a different evolutionary distance than that of the when the background frequencies vary, and the scale factor substitution matrix in use. Trying several scoring matrices λ = (log(p(i, j)/p(i)p(j)))/s(i, j) appropriate for normaliz- until “something interesting” is found is a common prac- ing nominal scores varies as well [8]. If the real λ is lower tice in protein sequence alignment [20]. In our case, scan- than the “standard” one, then the uncorrected nominal score ning the θ range could thus lead to a significant decrease in can appear much too high [19, 22]. Our approach offers a D(FXY//PAB), as detected in the BLOSpectrum, and improve different perspective to the problem, that is, the possibility the score [7, 12, 13], taking it back to Case 1. This could in of gaining insight about biological sequence correlation di- turn result in a better capacity to discriminate weakly corre- rectly from the BLOSUM score. Moreover, the background lated sequences from those correlated by chance. If, on the frequency divergence components of BLOSpectrum indicate other hand, tuning θ does not greatly affect D(FXY//PAB), whether compositionally adjusted matrices could be useful and we are comparing typical sequences (low background in the case under inspection. Since [21] illustrates three “cri- frequency divergence) with an appropriate θ parameter, the teria for invoking compositional adjustment” (length ratio, large target frequency divergence indicates that some non- compositional distance, and compositional angle), we sug- standard evolutionary process (regarding the substitution of gest that the occurrence of “Case 3” in the BLOSUM spec- amino acids) is at work. This cannot adequately be captured trum could be thought of as an additional fourth criterion. by the standard BLOCKS database and BLOSUM substitu- The background divergence of the BLOSpectrum decom- tion matrices. Under these circumstances, Case 2 can never position offers a further rationale to confirm the effectiveness lead to high scores, due to the penalization of the target fre- of the procedure proposed by Yu et al., since a large back- quency divergence. We are here likely in the grey area of ground divergence D(F//P) forces the target frequency diver- weakly correlated sequences with a very old common ances- gence D(FXY//PAB) to be unnaturally large; compositionally tor, or of portions of proteins with strong structural prop- adjusted matrices, that minimizes background frequency di- erties that do not require the conservation of the entire se- vergence, tend to remove this effect, leaving it free to assume quence. Note that unfortunately we are not able to assess the the value associated to the (correct degree of evolutionary) statistical significance when our method finds a suspected divergence between the sequences under inspection. concealed correlation; however, the method still gives us use- As a consequence of the three cases discussed above, we ful information that helps guide our judgment on the possi- can suggest the following procedure for analyzing the score ble existence of such correlation, that needs to be further in- obtained from an alignment between two given sequences vestigated in depth, exploiting other biological information of the same length, or resulting from a BLAST or FASTA such as 3D structure and biological function. (gapped or ungapped) database search. Case 3 accounts for the situation in which we have two nontypical sequences, with high values of both target and Scoring analysis procedure background frequency divergence. This applies, for example, to some families of antimicrobial peptides, that are unusually (1) Given the two sequences, evaluate the components rich in certain amino acids (such as Pro and Arg, Gly,orTrp of (11) by inserting the sequences in the available residues). This means that the high penalty arising from the software to obtain the BLOSpectrum (http://bioinf. subtracted D(FXY//PAB) is (at least partially) compensated dimi.uniud.it/software/software/blosumapplet). D F //P by the positive D(FX //PA)andD(FY //PB), and the global (2) Evaluate the target frequency divergence ( XY AB) θ score does not collapse to negative values, even if it is usu- for each . θ D F //P ally low. In effect, the background frequency divergence acts (3) Choose the value that minimizes ( XY AB). as a compensation factor that prevents excessive penalties for (4) Determine if the alignment falls in Cases 1, 2,or3 as those sequences which, even though related by nonstandard described. amino acid substitutions, also have a nontypical background (5) If the alignment falls in Case 1,wehavetwostrictly distribution of the amino acids inside the sequences them- correlated proteins. θ selves. In other words, the nontypicality of FXY is (at least (6) If, even after tuning , the alignment falls in Case 2 D F //P D F//P in part) forced of by the anomalous background frequen- ( ( XY AB) is high, but ( ) is low), then we cies of the amino acids. This compensation is welcome, since may have a concealed or weak correlation between the it avoids missing biologically related sequences pertaining sequences. D F //P to nontypical protein families, and mathematically corrob- (7) If the alignment falls in Case 3 (both ( XY AB)and D F//P orates the robustness of the BLOSUM scoring method. ( ) are high), we may have correlated sequences belonging to a nontypical family. In this case, the use The problem of evaluating the best method for scor- of compositionally adjusted matrices may provide a ing nonstandard sequences has been recently tackled by sharper score [11, 21]. Yu et al. [11, 21], who showed that standard substitution matrices are not truly appropriate in this case, and de- In analyzing the parameters that compose the BLOSpectrum, veloped a method for obtaining compositionally adjusted so as to decide among Cases 1, 2,and3, we find it useful to matrices. In general, when background frequencies differ use an indicative, if somewhat arbitrary set of guidelines, as markedly from those implicit in the substitution matrix (i.e., summarized in Table 1. the background frequency divergence is high) is one case We assign a range of values for each parameter (tag L = when using a standard matrix is nonoptimal. Another is Low, tag M = Medium, tag H = High). These values have been Francesco Fabris et al. 7

Table 1: Rule of thumb guidelines to decide among low (L), the algebraic sum of the four terms, together with the rough medium (M), and high (H) values of the parameters. BLOSUM score, directly obtained by summing up the inte- θ L M H ger values of the BLOSUM- matrix. As already observed in Section 2.2 the pairs containing a gap, such as (−, j)or(i, −), I X Y < > ( , ) 0.9 0.9–1.1 1.1 are not considered in the computation, since their contribu- D F //P < > ( XY AB) 1.1 1.1–1.5 1.5 tion to the score is zero when one assumes the independence D(F//P) <0.3 0.3–0.7 >0.7 between a gap and the paired amino acid. There are essentially two ways for employing the BLO- Spectrum. The first one is that of performing a BLAST or derived from a “rule of thumb” approach when analyzing the FASTA search inside a database, given a query sequence. results of the experiments described in the following sections; Theresultisasetofh possible matches, ordered by score, but obviously they need to be tuned as soon as new experi- in which the query sequence and the corresponding match n n ... n mental evidence will be available. are paired for a length that is respectively 1, 2, , h.The The final consideration is that, when comparing biologi- user can extract all matches of interest within the output cally related sequences, one has to choose the correct scoring set and compares them with the query sequence by using matrix if necessary by means of a compositional adjustment. BLOSpectrum software. The second one is that of comparing If, as a result, background and target frequency divergences two assigned sequences with a program such as BLAST2, so have low values, the mutual information or sequence conver- as to find the best gapped alignment. Also in this case we can gence I(X, Y) remains as the effective parameter that mea- use BLOSpectrum on the two portions of the query sequences sures protein similarity. If, after considering the above possi- that are paired by BLAST2 and that have the same length n. bilities, one still observes a residual persistence of the target It is obvious that the next step would be that of integrating frequency divergence, then two weakly correlated sequences the BLOSpectrum tool inside a widely used database search are presumably identified, that derived from a common re- engine. mote ancestor after several events of substitution. Even if the correct way for using the BLOSpectrum soft- ware is that of supplying it with two sequences of the same length, derived from preceding queries of BLAST, BLAST2, 3.2. Practical implementation of the method FASTA or others, the BLOSpectrum applet accepts also two sequences of different length n and m>n; in this case the As stated in the Introduction, we recall that the analysis based program merely computes the scores associated to all possi- on the BLOSpectrum evaluation is not aimed at increasing ble alignments of n over m, showing the highest one, but it the performance of available alignment algorithms, nor at does not insert gaps. suggesting new methods for inserting gaps so as to maximize the score. The BLOSpectrum only gives added information of biological and operative interest, but only once two se- 3.3. Biological examples quences have already been aligned using current algorithms, such as BLAST, BLAST2, FASTA, or others. The ultimate bi- To illustrate the behavior of the BLOSpectrum under the per- ological goal of the method is that of revealing the possible spective of the above three cases, we have chosen groups of presence of a weak or concealed correlation for alignments proteins from several established protein families present in resulting in a relatively low BLOSUM score, that might other- the SWISSPROT data bank http://www.expasy.uniprot.org wise be neglected. Another operative merit is that the knowl- (see Table 2), together with some specific examples of se- edge of the target frequency divergence helps identify the best quences, taken from the literature, that are known to be bio- scoring matrix, that is the one tailored for the correct evolu- logically related, even if aligning with rather modest scores. tionary distance. The first set contains sequences from the related Hep- In order to perform automatic computation of the four atocyte nuclear factor 4α (HNF4-α), Hepatocyte nuclear fac- terms of (11), we have developed the software BLOSpec- tor 6 (HNF6), and GAT binding protein 1 (globin transcrip- trum, freely available at http://bioinf.dimi.uniud.it/software/ tion factor 1 families). These represent typical protein fami- software/blosumapplet. Given two sequences with the same lies coupled by standard target frequencies. Furthermore, se- length, with or without gaps, the software derives the vec- quences within each family are quite similar to one another, tors FX , FY ,andFXY by computing the relative frequencies with a percent identity greater than 85%. All these proteins f (i) = n(i)/n, f (j) = n(j)/n,and f (i, j) = n(i, j)/n, that is are expected to fall in Case 1. the relative frequency of amino acid i observed in sequence The second set of sequences is expected to fall in Case 2.A X, of amino acid j observed in sequence Y, and the relative first example is taken from the serine protease family, contain- frequency of the pair i, j. The vectors PAB ={p(i, j)}i,j and ing paralogous proteins such as trypsin, elastase, and chy- P ={p(i)}i, needed to decompose the score, are those de- motrypsin, whose phylogenetic tree constructed according to rived from BLOCKS database and used by S. Henikoff and the multiple alignment for all members of this family [23]is J. G. Henikoff [9] to extract the score entries of the 20 ∗ 20 consistent with a continuous evolutionary divergence from BLOSUM matrices (35, 40, 50, 62, 80, 100); they have been acommonancestorofbothprokaryotesandeukaryotes. kindly provided by these authors on request. The software Another example pertaining to weakly correlated sequences computes also the exact BLOSUM normalized score, that is that show distant relationships is the one originally used by 8 EURASIP Journal on Bioinformatics and Systems Biology

Table 2: The three sets of protein families used in testing the BLOSpectrum. The UniProt ID is furnished (with the sequence length). For the defensins and Pro-rich peptides, only the mature peptide sequences were used in alignments. In the following tables, sequences are indicated by the corresponding numbers 1–4.

Sequence Family 1 2 3 4 First set P41235 (465) P49698 (465) P22449 (465) HNF4-α H. sapiens Mus musculus Rattus norv. Q9UBC0 (465) O08755 (465) P70512 (465) HNF6 H. sapiens Mus musculus Rattus norv. P15976 (413) P17679 (413) P43429 (413) GAT1 H. sapiens Mus musculus Rattus norv. Second set P07477 (247) P17538 (263) Q9UNI1 (258) H. sapiens H. sapiens H. sapiens trypsin chymotrypsin elastase1 Serine proteases P00775 (259) P35049 (248) Streptomyces Fusarium oxy- griseus trypsin sporum trypsin P02232 (92) S06134 (92) Hemoglobins Vicia faba P. chilensis leghemoglobin I hemoglobin I A26491 (41) NP493808 (41) Transposons D. mauritiana C. elegans mariner transposon transposon TC1 BD01 (36) BD02 (41) BD03 (39) BD04 (50) Beta defensins H. sapiens H. sapiens H. sapiens H. sapiens Third set Pro/Arg- rich BCT5 (43) bovin BCT7 (59) bovin PR39PRC (42) pig PF (82) pig peptides

Altschul [7] to compare PAM-250 with PAM-120 matrices, does not appear, at first sight, to be too abnormal. The se- that is, the 92 length residue Vicia faba leghemoglobin I and quence comparisons score are modest at best, even though Paracaudina chilensis hemoglobin I, characterized by a very members are known to be biologically correlated. poor percent identity (about 15%), with pairs of identical The third set contains sequences that are expected to fall amino acids residues that are spread fairly evenly along the in Case 3. These are members of the Bactenecins family of lin- alignment. A further example considers the sequences as- ear antimicrobial peptides, with an unusually high content sociated to Drosophila mauritiana mariner transposon and of Pro and Arg residues, and an identity of about 35% [27], Caenorhabditis elegans transposon TC1, with a length of 41 representing sequences with a highly atypical amino acid fre- residues, used by S. Henikoff andJ.G.Henikoff [9] to test the quency distribution. performance of their BLOSUM scoring matrices. The last ex- If we analyze the alignments inside all these sets of pro- ample derives from human beta defensins. This family of host tein families, we effectively find examples for each of the defense peptides have arisen by gene duplication followed by three cases illustrated in the preceding section. The align- rapid divergence driven by positive selection, a common oc- ments of human and mouse HNF4-α sequences (as illus- currence in proteins involved in immunity [24]. They are trated in Table 3), and the BLOSpectrum of HNF4-α,HNF6, characterized by the presence of six highly conserved cys- and GAT1 sequence comparisons (see Figure 1), are clear ex- teine residues, which determines folding to a conserved ter- amples of Case 1, with high correlation between all respective tiary structure, while the rest of the sequence seems to have couples of sequences and a target frequency divergence that been relatively free of structural constraints during evolution is strongly sensitive to the BLOSUM-θ parameter, so we stop [25, 26]. Even if clearly related, these peptides have a percent- the scoring procedure at step 5. age sequence identity less than 40%. For example, the HNF4-α alignment has a target fre- All these families represent the case of nonstandard tar- quency divergence that varies from 2.41 to 0.93 when get frequencies, while the amino acid frequency distribution passing from BLOSUM-35 (a matrix tailored for a wrong Francesco Fabris et al. 9

Table 3: BLOSUM decomposition for intrafamily alignments for proteins of the first set.

HNF4-α human versus HNF4-α mouse

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 100 3.939 0.929 0.050 0.057 3.118 2833 95.9 80 3.939 1.297 0.046 0.053 2.741 2537 95.9 62 3.939 1.582 0.046 0.052 2.456 2330 95.9 50 3.939 1.861 0.043 0.050 2.171 3003 95.9 40 3.939 2.226 0.039 0.047 1.800 3381 95.9 35 3.939 2.414 0.036 0.044 1.605 2982 95.9 HNF4-α (BLOSUM-100)

Sequences I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 1–3 3.955 0.930 0.050 0.056 3.132 2846 96.3 2-3 4.141 1.008 0.057 0.056 3.246 2952 99.5

First set

HNF4-α human HNF6 human GAT1 human versus versus versus HNF4-α mouse HNF6 mouse GAT1 mouse

3 3 3

2 2 2

1 1 1

−1 −1 −1 12345

BLOSUM-100 BLOSUM-100 BLOSUM-100

(1) I(X, Y)(2)D(FXY//PAB)(3)D(FX //P)(4)D(FY //P)(5)Score

Figure 1: BLOSpectrum for sequences of the first set. evolutionary distance), to BLOSUM-100 (the matrix tai- greatly penalized. Other members of the HNF4-α,HNF6,or lored for a correct evolutionary distance) so that minimiz- GAT1 families behave similarly (see Figure 1). ing the frequency divergence (rows in italic) helps identify The situation changes considerably when we compute the the best θ parameter for comparing the analyzed sequences; BLOSUM decomposition for the different examples listed it corresponds to θ = 100, coherent with the high per- for the second set, for example, comparing human trypsin, cent identity (86–96%). In this case, the compensation fac- elastase and chymotrypsin to one another, or comparing tor D(FX //P)+D(FY //P) corresponding to background fre- these enzymes in distantly related species, such as human, quency divergence is almost zero, since observed background streptomyces griseus (a bacterium), and Fusarium oxyspo- and target frequencies are very near to those implicit in rum (a fungus). Following the Scoring Procedure, and starting the BLOCKS database, leading to the conclusion that these with ungapped alignments, we have a case of high target fre- are typical sequences that correspond closely to the protein quency divergence, with a low level of background frequency model associated with BLOCKS. The global (normalized) divergence, corresponding to the situation outlined in step score is high (3.12 in the HNF4-α example), due to a high 6. However, as soon as we use gapped alignments, we ob- degree of stochastic similarity (I(X, Y) ≈ 3.94), which is not serve a remarkable increment in the score, due to a reduced 10 EURASIP Journal on Bioinformatics and Systems Biology

Second set

Vicia faba D. mauritiana Chymotrypsin human leghemoglobin I mariner transposon BD01 human versus versus versus versus S. griseus trypsin Paracaudina chilensis C. elegans BD02 human hemoglobin I transposon TC1

Ungapped

3

2 2 2

1 1 1 1

−1 −1 −1 −1

−2 −2 −2 12345 −3 −3 BLOSUM-35 BLOSUM-40BLOSUM-62 BLOSUM-35

Gapped

2 2 2

1 1 1

−1 −1 −1

−2 −2 12345

BLOSUM-80BLOSUM-40 BLOSUM-50

(1) I(X, Y)(2)D(FXY//PAB)(3)D(FX //P)(4)D(FY //P)(5)Score

Figure 2: BLOSpectrum for (ungapped and gapped) sequences of the second set. penalization factor associated to target frequency divergence proteins under the BLOCKS model. Furthermore, extending (see Figure 2,firstcolumn,andTable 4). This is the obvious the size of the alignment or including gaps does not signif- case when the bad matching is a consequence of deletions icantly alter the spectrum (see Table 5 and Figure 2,second and/or insertions that occurred during evolution, which is column), so we leave the Scoring Procedure at step 6; we sim- resolved once gaps are introduced, so that the sequence com- ply have weakly related sequences. parison falls into Case 1 The Drosophila mauritiana and Caenorhabditis elegans Adifferent situation occurs aligning Vicia faba leghe- transposons provide a similar example, with only a weak moglobin I and Paracaudina chilensis hemoglobin I. D(FXY// minimization for θ = 62 (D(FXY//PAB) = 2.80). The other PAB) minimization (step 3) leads to a narrower spread BLOSpectrum components are respectively I(X, Y) = 2.34, of values (2.48–2.07) when passing from BLOSUM-100 to D(FX //P) = 0.53, and D(FY //P) = 0.72. The sequences thus BLOSUM-35, with minimum (2.05) at θ = 40, which is con- have a high stochastic correlation, but the target frequencies sequently the best parameter to compare the sequences. The are rather atypical, so that the divergence entirely kills the global score (0.24) is rather low, despite these sequences be- contribution derived from mutual information, and if the ing clearly evolutionarily related. In fact, the BLOSpectrum score is weakly positive (0.79) it is only due to the terms shows that the stochastic correlation I(X, Y)isquitehigh associated to background frequency divergence. In fact, the (1.84), but is killed by the heavy penalty derived from the biological relationship of these atypical sequence fragments negative contribution of D(FXY//PAB), while the compensa- is effectively captured only due to the presence of this com- tion factors due to background frequency divergence are less pensation factor. In this case, a gapped alignment includ- significant (0.25 and 0.19, resp.), as the sequences are typical ing a wider portion of the sequences, actually reduces the Francesco Fabris et al. 11

Table 4: BLOSUM decomposition for ungapped and gapped serine proteases.

Serine proteases

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity human chymotrypsin versus Streptomyces griseus trypsin (ungapped) 100 1.014 2.023 0.134 0.132 −0.742 −398 11.5 80 1.014 1.739 0.141 0.137 −0.446 −230 11.5 62 1.014 1.570 0.146 0.145 −0.264 −121 11.5 50 1.014 1.437 0.134 0.141 −0.147 −120 11.5 40 1.014 1.321 0.132 0.138 −0.035 −42 11.5 35 1.014 1.305 0.136 0.145 −0.008 −7 11.5 human chymotrypsin versus Streptomyces griseus trypsin (gapped) 100 1.645 1.213 0.164 0.156 0.753 326 35.9 80 1.645 1.138 0.170 0.164 0.842 382 35.9 62 1.645 1.149 0.178 0.171 0.845 416 35.9 50 1.645 1.176 0.171 0.159 0.800 557 35.9 40 1.645 1.270 0.170 0.158 0.703 640 35.9 35 1.645 1.346 0.177 0.163 0.640 584 35.9

background frequency divergences to remarkably lower val- score is heavily penalized by a remarkable target frequency ues (0.237 and 0.226), neutralizing the compensation (see divergence. Only the compensation factor induced by back- Table 6 and Figure 2, third column). ground frequency divergence can, in some cases, sustain the In both the preceding examples, we are in the situation score over positive values, allowing the identification of a bi- where the parameter θ of the substitution matrix is appropri- ological correlation that would otherwise have been lost. ate for the sequence divergence of the sequences in question, The third set of sequences are Pro/Arg rich antimicro- the background frequency divergence is small, but the target bial peptides of the Bactenecins family, with about 35% iden- frequency divergence is still large: this is a signal that we are tity [27, 28]. The obtained scores are clearly positive, despite dealing with weakly related sequences, characterized by sev- the poor stochastic correlation (0.40–0.60, see Table 8 and eral events of substitution that occurred during evolution. It Figure 3). is usually difficult to capture these weakly related sequences The penalty factor due to target frequency divergence is using standard scoring matrices, such as BLOSUM or PAM, remarkably high in this case (4.15–4.49) and should drag since the common ancestor could be very old. As a matter of the score to quite negative values, but the compensation fac- fact, this difficulty was used to respectively test the PAM-250 tor due to background frequency divergence is even greater versus PAM-120 matrices (Altschul [7], hemoglobin) and and fully compensates it. We thus leave the scoring proce- BLOSUM-62 versus PAM-160 matrices (S. Henikoff and J. dure at step 7. This is the typical case of poorly conserved G. Henikoff [9], transposons). Here, we cannot remove the sequences with singular key structural aspects that are how- cause of mismatching and we leave the Scoring Procedure at ever highly preserved (c.f. the pattern of proline and argi- step 6. nine residues). As the background frequencies FX and FY The last example from this group derives from human are far from the standard background P associated with the beta defensins, and even if these sequences are known to be BLOCKS database, the evaluation of a more realistic score for evolutionarily related, some couples actually show a negative these sequences pass through the use of a decompositionally normalized score (1–4, 2-3, 2–4, see Table 7 and Figure 2, adjusted BLOSUM matrix [11]. Such matrices are built in last column), suggesting that they are not. In fact, a nor- such a way as to reduce background frequency divergence, mal BLOSUM-62 BLAST search using the human beta de- so as to eliminate the portion of target divergence that is in- fensin 1 sequence, picks up several homologues from other duced by it. In this way, the residual target divergence ac- mammalian species, whereas those with the three paralogous counts only for effective evolutional divergence between se- human sequences are below the cutoff score. BLOSpectrum quences. analysis reveals a high stochastic correlation I(X, Y) (2.00– As a final example, we obtained BLOSUM spectra also for 3.03), neutralized by an even higher-penalty factor due to the sequences from obviously uncorrelated families. The results target frequency divergence (3.28–3.56), partly compensated are reported in Table 9 and Figure 4. In these cases we gener- by the substantial background frequency divergences (0.54– ally obtain a poor stochastic correlation I(X, Y), and a high 0.79), and with little effect of the BLOSUM-θ parameter, or value for the penalty factor D(FXY//PAB), leading to a glob- of introducing gaps. These are fairly typical proteins, whose ally negative score, which is not compensated by background 12 EURASIP Journal on Bioinformatics and Systems Biology

Table 5: BLOSUM decomposition for ungapped and gapped hemoglobins.

P02232: 49 SAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVVLDGKDGSIHIQKGVLDPHFVVVKEALLKTIKE 115

++ + S ++ AHA +V ++ + +L + L H V H+ + + L++ ++

S06134: 61 ASQLRSSRQMQAHAIRVSSIMSEYVEELDSDILPELLATLARTHDLNKVGADHYNLFAKVLMEALQA 127

P02232: 116 ASGDKWSEELSAAWEVAYDGLATAI 140

G ++E+ AW A+

S06134: 128 ELGSDFNEKTRDAWAKAFSIVQAVL 152

Vicia faba leghemoglobin I versus Paracaudina chilensis hemoglobin I (ungapped)

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 100 1.839 2.478 0.264 0.207 −0.166 −31 15.2 80 1.839 2.240 0.264 0.199 0.063 12 15.2 62 1.839 2.128 0.260 0.192 0.163 35 15.2 50 1.839 2.077 0.255 0.185 0.203 54 15.2 40 1.839 2.051 0.255 0.194 0.237 83 15.2 35 1.839 2.070 0.263 0.202 0.235 82 15.2 Vicia faba leghemoglobin I versus Paracaudina chilensis hemoglobin I (gapped) 100 1.597 1.962 0.166 0.172 −0.026 −10 18.1 80 1.597 1.759 0.161 0.163 0.162 40 18.1 62 1.597 1.661 0.154 0.153 0.243 65 18.1 50 1.597 1.618 0.145 0.145 0.268 104 18.1 40 1.597 1.606 0.145 0.155 0.291 152 18.1 35 1.597 1.623 0.154 0.163 0.283 148 18.1 P02232: 2 FTEKQEALVNSSSQLFKQNPSNYSVLFYTIILQKAPTAKAMFSFLK--DSAGVVDSPKLGAHAEKVF 68

T Q+ +V + +N +++ + I P+A+ F + ++ + S ++ AHA +V

S06134: 12 LTLAQKKIVRKTWHQLMRNKTSFVTDVFIRIFAYDPSAQNKFPQMAGMSASQLRSSRQMQAHAIRVS 78

P02232: 69 GMVRDSAVQLRATGEVVLDGKDGSIHIQKGVLDPHFVVVKEALLKTIKEASGDKWSEELSAAWEVAY 135

++ + +L + L H V H+ + + L++ ++ G ++E+ AW A+

S06134: 79 SIMSEYVEELDSDILPELLATLARTHDLNKVGADHYNLFAKVLMEALQAELGSDFNEKTRDAWAKAF 145

frequency divergences. Note that in two cases, a mildly posi- 4. CONCLUSIONS tive score could suggest a distant relationship. Analysis of the BLOSpectrum helps in evaluating this possibility. The PF12 A standard use of scoring substitution matrices, such as versus GAT1 alignment is simply a case of overcompensation BLOSUM-θ, is often insufficient for discovering concealed for a nontypical sequence (the background frequency diver- correlations between weakly related sequences. Among other gence for one of the sequences is very high). In the second causes, this can derive from (i) the introduction of gaps dur- case, however, the I(X, Y) value for the BD04 versus GAT1 ing evolution (ii) use of a BLOSUM-θ matrix tailored for a human alignment is surprisingly quite high, suggesting that different evolutionary distance than that pertaining to the a closer look might be appropriate. aligned sequences, and/or (iii) the use of standard matrices Francesco Fabris et al. 13

Table 6: BLOSUM decomposition for ungapped and gapped transposons.

NP_493808: 243 VFQQDNDPKHTSLHVRSWFQRRHVHLLDWPSQSPDLNPIEH 283

+F DN P HT+ VR + + +L + SPDL P +

A26491: 245 IFLHDNAPSHTARAVRDTLETLNWEVLPHAAYSPDLAPSDY 285

Drosophila mauritiana mariner transposon versus C. elegans transposon TC1 (ungapped)

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 100 2.339 2.926 0.740 0.531 0.685 55 34.1 80 2.339 2.849 0.733 0.531 0.754 60 34.1 62 2.339 2.800 0.724 0.526 0.789 67 34.1 50 2.339 2.831 0.721 0.516 0.746 90 34.1 40 2.339 2.935 0.716 0.509 0.630 104 34.1 35 2.339 2.969 0.714 0.505 0.590 92 34.1 Drosophila mauritiana mariner transposon versus C. elegans transposon TC1 (gapped) 100 1.991 2.244 0.244 0.243 0.235 40 25.0 80 1.991 2.110 0.246 0.234 0.362 67 25.0 62 1.991 2.021 0.245 0.227 0.443 91 25.0 50 1.991 2.009 0.237 0.226 0.445 123 25.0 40 1.991 2.043 0.227 0.228 0.404 152 25.0 35 1.991 2.066 0.226 0.229 0.381 144 25.0 NP_493808: 243 VFQQDNDPKHTSLHVRSWFQRRHVHLLDWPSQSPDLNPIE-HLWEELERRLGGIRASNAD 301

+F DN P HT+ VR + + +L + SPDL P + HL+ + L R + +

A26491: 245 IFLHDNAPSHTARAVRDTLETLNWEVLPHAAYSPDLAPSDYHLFASMGHALAEQRFDSYE 304

NP_493808: 302 AKFNQLENAWKAIPMSVIHKLIDSMPRRCQAVIDANG 338

+ L++A +I +PR++++G

A2649: 305 SVKKWLDEWFAAKDDEFYWRGIHKLPERWEKCVASDG 341

for comparison of proteins with nonstandard background then we have shown the BLOSUM scoring method has an frequency distributions of amino acids. All these well-known in-built capacity to correct for anomalies in amino acid dis- effects can be better evidenced and quantified by decompo- tributions using background frequency divergence as a com- sition of BLOSUM score (BLOSpectrum) according to (11). pensation factor. One can also choose to compositionally ad- This equation highlights the core of the biological correla- just the matrix, so as to reduce the compensation factor to- tion measured by the BLOSUM score, that is mutual infor- gether with the component of target frequency divergence mation I(X, Y), or sequence convergence. If gaps are taken that is induced by a bad background frequency distribution. into account (such as in BLAST), and the correct θ parame- This systematic method is illustrated in the scoring analysis ter is chosen with the help of BLOSpectrum, and if the back- procedure of Section 2. ground frequencies of sequences are near to the standard Our decomposition becomes important when we con- ones, then the global score is given by sequence convergence sider sequences for which the BLOSUM score indicates a plus a residual penalization factor due to target frequency weak or no correlation. A critical evaluation of the BLO- divergence. This residual value implicitly takes into account Spectrum components can help corroborate or identify an that numerous substitution events may have occurred dur- underlying biological correlation and whether the matrices ing sequence evolution, and so is a coherent measure of the being used are the most appropriate ones for measuring it. biological relationship and distance between the sequences. In other words, when considering the grey area of BLO- If the background frequencies of sequences are not standard, SUM scores with a marginal significance, it could help to 14 EURASIP Journal on Bioinformatics and Systems Biology

Table 7: The BLOSUM terms for beta defensins. BD01 human versus BD02 human

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 100 3.030 3.566 0.564 0.618 0.646 45 41.6 80 3.030 3.453 0.568 0.623 0.768 58 41.6 62 3.030 3.438 0.604 0.652 0.849 65 41.6 50 3.030 3.418 0.615 0.663 0.891 99 41.6 40 3.030 3.378 0.577 0.626 0.855 129 41.6 35 3.030 3.320 0.539 0.588 0.837 120 41.6 human beta defensins (BLOSUM-35)

Sequences I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 1–3 2.731 3.325 0.539 0.751 0.697 101 30.5 1–4 2.532 3.658 0.539 0.728 0.141 22 16.6 2-3 2.009 3.466 0.794 0.616 −0.045 −10 10.2 2–4 2.334 3.522 0.609 0.568 −0.009 0 12.1 3-4 2.122 3.286 0.794 0.655 0.286 44 20.5

Table 8: The BLOSUM terms for Pro/Arg-rich peptides.

BCT5 bovin versus BCT7 bovin

BLOSUM I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 100 0.424 4.935 2.329 2.460 0.279 28 34.8 80 0.424 4.724 2.317 2.449 0.467 42 34.8 62 0.424 4.637 2.301 2.430 0.518 37 34.8 50 0.424 4.533 2.264 2.389 0.544 68 34.8 40 0.424 4.407 2.221 2.338 0.576 97 34.8 35 0.424 4.368 2.199 2.301 0.556 98 34.8 Pro/Arg-rich peptides (BLOSUM-35)

Sequences I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 1–3 0.516 4.434 2.095 2.205 0.382 63 30.9 1–4 0.446 4.491 2.199 2.488 0.643 110 39.5 2-3 0.584 4.156 2.095 2.257 0.780 133 47.6 2–4 0.406 4.350 2.256 2.251 0.563 134 37.2 3-4 0.609 4.260 2.095 2.347 0.792 132 45.2

decide if an evolutionary relationship actually exists. We pro- APPENDIX vide online software at http://bioinf.dimi.uniud.it/software/ Proof of (11). By multiplying inside the log function of (7) software/blosumapplet which integrates a BLOSpectrum his- by f (i, j)/f(i, j)andby f (i) f (j)/f(i) f (j) and rearranging togram with the score obtained by a classical BLAST engine the terms, we obtain working on two input sequences, which allows an immediate  p(i, j) f (i, j) f (i) f (j) visual analysis of the score components. The systematic use SN (X, Y) = f (i, j)log p(i)p(j) f (i, j) f (i) f (j) of BLOSpectrum parameters to permit a more sensitive filter- i,j ing of scores inside a BLAST or similar engine could be the  f i j  f i j = f i j ( , ) − f i j ( , ) logical next operative step. We have provided several biolog- ( , )log f i f j ( , )log p i j i j ( ) ( ) i j ( , ) ical examples indicating the potential of our method, but it , ,  f i f j is clear that it needs a massive biological experimentation to f i j ( ) ( ) + ( , )log p i p j completely test its effective usefulness. i,j ( ) ( ) Francesco Fabris et al. 15

Third set

BCT5 bovin BCT5 bovin BCT7 bovin versus versus versus BCT7 bovin PR39PRC pig PR39PRC pig

2 2 2

1 1 1

−1 −1 −1

−2 −2 −2

−3 −3 −3

−4 −4 −4

12345 BLOSUM-35 BLOSUM-35 BLOSUM-35

(1) I(X, Y)(2)D(FXY//PAB)(3)D(FX //P)(4)D(FY //P)(5)Score

Figure 3: BLOSpectrum for sequences of the third set.

Table 9: Some examples of BLOSUM-35 terms for sequences belonging to noncorrelated families.

BLOSUM-35 HNF4-α human versus HNF6 human

Sequences I(X, Y) D(FXY//PAB) D(FX //P) D(FY //P) SN (X, Y) Score % Identity 1-1 0.578 0.986 0.036 0.205 −0.165 −312 5.37 HNF4-α human versus GAT1 human 1-1 0.712 1.033 0.038 0.193 −0.088 −144 8.71 HNF6 human versus GAT1 human 1-1 0.622 1.122 0.230 0.193 −0.076 −143 8.47 BD04 human versus BCT7 bovin 4–2 1.010 3.887 0.460 2.220 −0.195 −36 10.0 PF12 pig versus GAT1 human 4–1 0.686 3.486 2.182 0.709 0.091 24 18.2 BD04 human versus GAT1 human 4–1 2.243 3.033 0.460 0.465 0.136 25 12.0

  = I(X, Y) − D FXY//PAB erties pertaining to ID and MI; they are summarized as fol- lows.  f (i)  f (j) + f (i, j)log + f (i, j)log Let us start by considering some probability distribu- p(i) p(j) i,j i,j A K     tions [10] over an alphabet with symbols, for example = I(X, Y) − D FXY//PAB + D FX //PA P ={p1, p2, ..., pK }, Q ={q1, q2, ..., qK },andsoon.Inour   context, K = 20, as there are 20 amino acids, and the al- + D FY //PB . (A.1) phabet letters correspond to the 1-letter amino acid standard coding (D = Asp, E = Glu, W = Trp, etc.). If we imagine the A fuller understanding of the mathematical tools used in space of all possible K dimensional probability distributions, Section 2 requires some definitions and mathematical prop- it is right to ask what is the “distance” from P to Q (or vice 16 EURASIP Journal on Bioinformatics and Systems Biology

Noncorrelated sequences

HNF4 human HNF4 human HNF6 human versus versus versus HNF6 human GAT1 human GAT1 human

1 1 1

−1 −1 −1 12345 BLOSUM-35 BLOSUM-35 BLOSUM-35

BD04 human PF12 pig BD04 human versus versus versus BCT7 bovin GAT1 human GAT1 human

2 2 2

1 1 1

−1 −1 −1

−2 −2 −2

−3 −3 −3

−4 −4 12345 BLOSUM-35 BLOSUM-35 BLOSUM-35

(1) I(X, Y)(2)D(FXY//PAB)(3)D(FX //P)(4)D(FY //P)(5)Score

Figure 4: BLOSpectrum for noncorrelated sequences.

versa). The most popular (pseudo-)distance is the informa- logarithm). Since D(P//Q) = 0 if and only if P ≡ Q, this al- tional divergence D(P//Q), lows us to interpret the ID as a measure of (pseudo)distance between probability distributions. It is only “pseudo” (from K p i D P//Q  p i ( ) the mathematical point of view) since the concept of “dis- ( ) ( )log q i ,(A.2) i=1 ( ) tance” is well defined in mathematics, and requires also sym- metry between the variables and the validity of the so-called introduced by Kullback in 1954 in the context of statistics triangular inequality. But ID lacks both these last two prop- p i ≥ q i > [29]; here ( ) 0and ( ) 0. It is easy to verify [18] erties, since, in general, D(P//Q) = D(Q//P) (it is asymmet- that the informational divergence (ID) is nonnegative, and it ric) and, if R is a third probability distribution, we are not P Q P ≡ Q is equal to 0 if and only if is coincident with ( ). sure that D(P//R)+D(R//Q) is greater than D(P//Q) (the D P//Q → ∞ Furthermore, ID is not boundable, since ( ) + if triangular inequality does not hold). We underline that such i q i → an exists such that ( ) 0. All this can be summarized in a distance is not symmetric (and so the order in which P and the following way: Q are specified does matter), that is, it is a distance “from” 0 ≤ D(P//Q) ≤ +∞ (= 0 when P ≡ Q) rather than a distance “between.”   (A.3) P ={p p ... p K } = +∞ when there exists i such that 2(i) = 0 . Suppose now that X X (1), X (2), , X ( ) and PY ={pY (1), pY (2), ..., pY (K)} are the probability distribu- Note that ID is the sum of positive and negative terms, and tions associated to the (random) variables X and Y,which the fact that the average is always greater than zero is not ob- take their values in the same alphabet A.Here,pX (i) = vious (it is a consequence of the convexity property of the Pr{X = i} means the probability that the variable X assumes Francesco Fabris et al. 17 the value i.Inourframework,X and Y are two protein se- ACKNOWLEDGMENTS quences of the same length n,andpX (2) = Pr{X = 2}=0.09 (e.g.) is interpreted as the relative frequency of the second The authors thank Jorja Henikoff, who provided the matrices amino acid of the alphabet A;so,theoveralloccurrenceof of joint probability distributions associated to the database the 2nd amino acid in sequence X is equal to 0.09n. In this BLOCKS, and an anonymous referee of a previous version context, we can introduce also a joint probability distribu- of this paper, who made several key remarks. This work has tion associated to the sequences, PXY ={pXY(i, j), i, j ∈ been supported by the Italian Ministry of Research, PRIN A}=Pr{X = i, Y = j, i, j ∈ A},wherepXY(i, j)corre- 2003, FIRB 2003 Grants, by the Istituto Nazionale di Alta sponds to the relative frequency of finding the amino acids Matematica (INdAM), 2003 Grant, and by the Regione Friuli i, j paired in a certain position of the alignment between X Venezia Giulia (2005 Grants). Y p i j = P and . It is well known that i,j XY( , ) 1( XY is a prob- ability distribution) and that the sum of the joint probabili- REFERENCES tiesoveronevariablegivesthemarginal of the other variable j pXY(i, j) = pX (i). For example, given that the ninth and [1] S. B. Needleman and C. D. Wunsch, “A general method appli- the fifth amino acid in the alphabet are Arginine and Leucine, cable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, respectively, pXY(9, 5) = pXY(Arg, Leu) = 0.01 means that the relative frequency of finding Arg in X paired with Leu in pp. 443–453, 1970. Y [2] A. D. McLachlan, “Tests for comparing related amino-acid is equal to 0.01. In practice, we avoid the use of the sub- c c p i p i j sequences. Cytochrome and cytochrome 551,” Journal of scripts, and use the simpler notation ( )and ( , ) instead Molecular Biology, vol. 61, no. 2, pp. 409–424, 1971. p i p i j of X ( )and XY( , ). [3] D. Sankoff, “Matching sequences under deletion-insertion Since the condition of independence between two vari- constraints,” Proceedings of the National Academy of Sciences ables (protein sequences) X and Y is fixed by the formula of the United States of America, vol. 69, no. 1, pp. 4–6, 1972. pXY(i, j) = pX (i)pY (j)(foreachpairi, j ∈ A), then, once [4] P. H. Sellers, “On the theory and computation of evolution- ary distances,” SIAM Journal on Applied Mathematics, vol. 26, assigned a certain PXY, it could be interesting to attempt no. 4, pp. 787–793, 1974. to evaluate the distance of PXY from the condition of inde- [5]M.S.Waterman,T.F.Smith,andW.A.Beyer,“Somebiologi- pendence between the variables. Making use of the ID (A.2), cal sequence metrics,” Advances in Mathematics, vol. 20, no. 3, D P //P P we need to evaluate the quantity ( XY X Y ), that is the pp. 367–387, 1976. stochastic distance between the joint PXY and the product of [6] M. O. Dayhoff,R.M.Schwartz,andB.C.Orcutt,“Amodelof the marginals PX PY . If we have independence, then PXY ≡ evolutionary change in proteins,” in Atlas of Protein Sequence PX PY , and the divergence equals zero. On the contrary, if it and Structure,M.O.Dayhoff, Ed., vol. 5, supplement 3, pp. appears that X and Y are tied by a certain degree of depen- 345–352, National Biomedical Research Foundation, Wash- dence, this can be measured by ington, DC, USA, 1978. [7] S. F. Altschul, “Amino acid substitution matrices from an in- formation theoretic perspective,” Journal of Molecular Biology,    p(i, j) vol. 219, no. 3, pp. 555–565, 1991. D PXY//PX PY = p(i, j)log  I(X, Y) ≥ 0. p i p j [8] S. Karlin and S. F. Altschul, “Methods for assessing the statis- i,j ( ) ( ) tical significance of molecular sequence features by using gen- (A.4) eral scoring schemes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 87, no. 6, pp. 2264– This quantity is called also the mutual information (or rela- 2268, 1990. tive entropy) I(X, Y) between the random variables (the pro- [9] S. Henikoff andJ.G.Henikoff, “Amino acid substitution tein sequences, in our setting) X and Y. It is symmetric in matrices from protein blocks,” Proceedings of the National its variables (I(X, Y) = I(Y, X)) and is always nonnegative, Academy of Sciences of the United States of America, vol. 89, no. 22, pp. 10915–10919, 1992. since it is an informational divergence. Note also that MI is [10] W. Feller, An Introduction to Probability and Its Applications, upper bounded by the logarithm of the alphabet cardinal- John Wiley & Sons, New York, NY, USA, 1968. I X Y ≤ ity, that is ( , ) log 20 [18]. Moreover, since it equals [11]Y.-K.Yu,J.C.Wootton,andS.F.Altschul,“Thecompositional zero if and only if the joint probability distribution coin- adjustment of amino acid substitution matrices,” Proceedings cides with the product of the marginals, that is, when we of the National Academy of Sciences of the United States of Amer- have independence between the two variables, we can inter- ica, vol. 100, no. 26, pp. 15688–15693, 2003. pret the mutual information (MI)asameasureofstochastic [12] S. F. Altschul, “A protein alignment scoring system sensitive dependence between X and Y. From another point of view, at all evolutionary distances,” Journal of Molecular Evolution, we can say that independence is equivalent to the situation vol. 36, no. 3, pp. 290–300, 1993. in which the variables X and Y do not exchange informa- [13] D. J. States, W. Gish, and S. F. Altschul, “Improved sensitiv- ity of nucleic acid database searches using application-specific tion. So, the meaning of I(X, Y) can be read also as the de- scoring matrices,” Methods, vol. 3, no. 1, pp. 66–70, 1991. gree of dependence between the variables, or as the average in- [14] S. R. Sunyaev, G. A. Bogopolsky, N. V. Oleynikova, P. K. formation exchanged between the same variables. Mutual in- Vlasov, A. V. Finkelstein, and M. A. Roytberg, “From analy- formation is one of the pillars of Shannon information the- sis of protein structural alignments toward a novel approach ory, and was introduced in the seminal paper by Shannon to align protein sequences,” Proteins: Structure, Function, and [16, 17]. Bioinformatics, vol. 54, no. 3, pp. 569–582, 2004. 18 EURASIP Journal on Bioinformatics and Systems Biology

[15] M. A. Zachariah, G. E. Crooks, S. R. Holbrook, and S. E. Brenner, “A generalized affine gap model significantly im- proves protein sequence alignment accuracy,” Proteins: Struc- ture, Function, and Bioinformatics, vol. 58, no. 2, pp. 329–338, 2005. [16] C. E. Shannon, “A mathematical theory of communication— part I,” Bell System Technical Journal, vol. 27, pp. 379–423, 1948. [17] C. E. Shannon, “A mathematical theory of communication— part II,” Bell System Technical Journal, vol. 27, pp. 623–656, 1948. [18] I. Csiszar´ and J. Korner,¨ Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press, New York, NY, USA, 1981. [19] A. A. Scha¨ffer, L. Aravind, T. L. Madden, et al., “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucleic Acids Research, vol. 29, no. 14, pp. 2994–3005, 2001. [20] F. Frommlet, A. Futschik, and M. Bogdan, “On the significance of sequence alignments when using multiple scoring matri- ces,” Bioinformatics, vol. 20, no. 6, pp. 881–887, 2004. [21]S.F.Altschul,J.C.Wootton,E.M.Gertz,etal.,“Protein database searches using compositionally adjusted substitution matrices,” FEBS Journal, vol. 272, no. 20, pp. 5101–5109, 2005. [22] A. A. Scha¨ffer, Y. I. Wolf, C. P. Ponting, E. V. Koonin, L. Aravind, and S. F. Altschul, “IMPALA: matching a pro- tein sequence against a collection of PSI-BLAST-constructed position-specific score matrices,” Bioinformatics, vol. 15, no. 12, pp. 1000–1011, 1999. [23] W. R. Rypniewski, A. Perrakis, C. E. Vorgias, and K. S. Wilson, “Evolutionary divergence and conservation of trypsin,” Pro- tein Engineering, vol. 7, no. 1, pp. 57–64, 1994. [24] A. L. Hughes, “Evolutionary diversification of the mammalian defensins,” Cellular and Molecular Life Sciences, vol. 56, no. 1- 2, pp. 94–103, 1999. [25] F. Bauer, K. Schweimer, E. Kluver,¨ et al., “Structure determi- nation of human and murine β-defensins reveals structural conservation in the absence of significant sequence similarity,” Protein Science, vol. 10, no. 12, pp. 2470–2479, 2001. [26] A. Tossi and L. Sandri, “Molecular diversity in gene-encoded, cationic antimicrobial polypeptides,” Current Pharmaceutical Design, vol. 8, no. 9, pp. 743–761, 2002. [27] R. Gennaro, M. Zanetti, M. Benincasa, E. Podda, and M. Mi- ani, “Pro-rich antimicrobial peptides from animals: structure, biological functions and mechanism of action,” Current Phar- maceutical Design, vol. 8, no. 9, pp. 763–778, 2002. [28] M. E. Selsted, M. J. Novotny, W. L. Morris, Y.-Q. Tang, W. Smith, and J. S. Cullor, “Indolicidin, a novel bactericidal tridecapeptide amide from neutrophils,” Journal of Biological Chemistry, vol. 267, no. 7, pp. 4292–4295, 1992. [29] S. Kullback, Information Theory and Statistics, Dover, Mineola, NY, USA, 1997. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 72936, 14 pages doi:10.1155/2007/72936

Research Article Aligning Sequences by Minimum Description Length

John S. Conery

Department of Computer and Information Science, University of Oregon, Eugene, OR 97403, USA

Received 26 February 2007; Revised 6 August 2007; Accepted 16 November 2007

Recommended by Peter Grunwald¨

This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum de- scription length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.

Copyright © 2007 John S. Conery. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION represent the sequences. If there are segments in each input sequence that are similar to corresponding segments in the Sequence alignment is a fundamental operation in bioin- other sequences, the grammar will have a single rule that di- formatics, used in a wide variety of applications ranging rectly generates the characters for these segments. from genome assembly, which requires exact or nearly exact An alignment algorithm based on this new framework matches between ends of small fragments of DNA sequences will consider different sets of rules to include in the grammar [1], to homology search in sequence databases, which in- it produces. The focus of this paper is on the use of minimum volves pairwise local alignment of DNA or protein sequences description length (MDL) [7] as the basis of the alignment [2], to phylogenetic inference and studies of protein structure algorithm. The MDL principle argues that the best alignment and function, which depend on multiple global alignments will be the one described by the shortest grammar, where the of protein sequences [3–5]. length of a grammar is measured in terms of the number of These diverse applications all use the same basic defini- bits needed to encode it. tion of alignment: a character in one sequence corresponds The key idea is to use conditional probabilities to encode either to a character from the other sequence or to a “gap” letters in aligned regions. If a grammar has a rule that aligns character that represents a space in the middle of the other letter x in one sequence with letter y in another sequence, sequence. Alignment is often described informally as a pro- the encoding of the rule will be based on p(y | x), and if the cess of writing a set of sequences in such a way that matching alignment is accurate, the resulting encoding is shorter than characters are displayed within the same column, and gaps the one that encodes x and y separately in an unaligned re- are inserted in strings in order to maximize the similarity gion. But there is a tradeoff: adding a new rule to a grammar across all columns. More formally, alignments can be defined requires adding new symbols for the rule structure, and the by a matrix M,whereMij is 1 if character i of one sequence number of bits required to encode these symbols adds to the is aligned with character j of the other sequence, or in some total size of the encoded grammar. The alignment algorithm cases, Mij is a probability, for example, the posterior proba- must determine the net benefit of each potential aligned re- bility of aligning letters i and j [6]. gion and choose the set of aligned regions that provides the This paper introduces a new framework for describing overall shortest encoding. the similarities and differences in a set of sequences. The idea MDL has been used to infer grammars for large col- is to construct a special-purpose grammar for the strings that lections of natural language sentences [8]andtosearch 2 EURASIP Journal on Bioinformatics and Systems Biology for recurring patterns in protein and DNA sequences [9]. Regular expression alignments are similar to the align- These applications of MDL are examples of machine learn- ments produced by DIALIGN [22, 23], a program that cre- ing, where the system uses the data as a training set and the ates consistent sets of ungapped local alignments. The main goal is to infer a general description that can be applied to differences are that fragments in DIALIGN are defined by a other data. The goal of the sequence alignment algorithm Smith-Waterman alignment based on finding a locally opti- presented here is simply to find the best description for the mal score and including neighboring letters until the score data at hand; there is no attempt to create a general grammar drops below a threshold, and DIALIGN uses a minimum that may apply to other sequences. length parameter to exclude short random matches. The Grammars have been used previously to describe the method presented in this paper uses the MDL criterion to structure of biological sequences [10–12], and regular ex- find the ends of aligned regions—if adding a pair of letters is pressions are a well-known technique for describing patterns less costly than leaving the letters in a variable region, then that define families of proteins [13].Butaswithprevious the letters are included in the aligned region. work on MDL and grammars, these other applications use Other methods that consider only ungapped local align- grammars and regular expressions to describe general pat- ments are also similar to regular expression alignments. terns that may be found in sequences beyond those used to Schneider [24] used information theory as the basis of a define the pattern, whereas for alignment the goal is to find a multiple alignment algorithm for small ungapped DNA se- grammar that describes only the input data. quences and successfully applied it to binding sites. More re- cently, Krasnogor and Pelta [25] described a method for eval- Grammars have the potential to describe a wide variety uating the similarity of pairs of proteins, but their analysis of relationships among sequences. For example, a top level describes a global similarity metric without actually aligning rule might specify several different ways to partition the se- the substrings responsible for the similarity. quences into smaller groups, and then specify separate align- The next section of this paper provides some background ments for each group. In this case, the top level rules are ef- information on sequence alignment and explains in more fectively a representation of a phylogenetic tree that shows detail how a regular expression can be used to capture the the evolutionary history of the sequences. This paper fo- essential information about the similarity in a set of se- cuses on one very restricted type of grammar that is capable quences. The details of the MDL encoding for sequence let- of describing only the simplest correspondence between se- ters and other symbols found in expressions are given in quences. The algorithm presented here assumes that only two Section 3. Results of two sets of experiments designed to test sequences are being aligned, and that the goal is to describe the method are presented in Section 4. similarity over the entire length of both input sequences, that The regular expression alignment method described in is, the algorithm is for pairwise global alignment. For this ap- this paper has been implemented in a program named plication, the simplest type of formal grammar—a right lin- realign. The source code, which is written in C++ and has ear grammar—is sufficient to describe the alignment. Since been tested on OS/X and Linux systems, is freely available every right linear grammar has an equivalent regular expres- underanopensourcelicenseandcanbedownloadedfrom sion, and because regular expressions are simpler to explain the project web site [26]. (and are more commonly used in bioinformatics), the re- mainder of this paper will use regular expression syntax when discussing grammars for a pair of sequences. 2. ALIGNMENTS AND REGULAR EXPRESSIONS Current alignment algorithms are highly sensitive to the One of the main applications of sequence alignment is com- choice of gap parameters [14–17]; for example, Reese and parison of protein sequences. The inputs to the algorithm are Pearson showed that the choice of gap penalties can influ- sets of strings, where each letter corresponds to one of the 20 ence the score for alignments made during a database search amino acids found in proteins. The goal of the alignment is by an order of magnitude [18]. One of the advantages of the to identify regions in each of the input sequences that are grammar-based framework is that gaps are not needed to parts of the same structural or functional elements or are de- align sequences of varying length. Instead, the parts of reg- scended from a common ancestor. ular expressions that correspond to regions of unaligned po- Figure 1(b) shows the evolution of fragments of three sitions will have a different number of characters from each hypothetical proteins starting from a 9-nucleotide DNA se- input sequence. quence. The labels below the leaves of the tree are the amino Previous work using information theory in sequence acids corresponding to the DNA sequences at the leaves. The alignment has been within the general framework of a only change along the left branch is a single substitution Needleman-Wunsch global alignment or Smith-Waterman which changes the first amino acid from P to T, and an align- local alignment. Allison et al. [19] used minimum message ment algorithm should have no problem finding the corre- length to consider the cost of different sequences of edit op- spondences between the two short sequences (Figure 1(c)). erations in global alignment of DNA; Schmidt [20]stud- The sequence on the right branch of the tree is the re- ied the information content of gapped and ungapped align- sult of a mutation that inserted six nucleotides in the middle ments, and Aynechi and Kuntz [21] used information the- of the original sequence. In order to align the resulting se- ory to study the distribution of gap sizes. The work described quence with one of its shorter cousins, a standard alignment here takes a different approach altogether, since gap charac- algorithm inserts a gap, represented by a sequence of one or ters are not used to make the alignments. more dashes, to mark where it thinks the insertion occurred. John S. Conery 3

Genetic code

. . .

. . .

(a) (b) (c) (d) (e)

Figure 1: (a) The genetic code specifies how triplets of DNA letters (known as “codons”) are translated into single amino acids when a cell manufactures a protein sequence from a gene. (b) A tree showing the evolution of a short DNA sequence. Labels below the leaves are the corresponding amino acid sequences. (c) Alignment of the two shorter sequences. (d) and (e) Two ways to align the longer sequence with one of the shorter ones.

This alignment is complicated by the fact that the insertion Regular expressions are widely used for pattern match- occurred in the middle of a codon; the single CCC that corre- ing, where the expression describes the general form of a sponded to a P in the ancestral sequence is now part of two string and an application can test whether a given string codons, CCT and TTC. Figures 1(d) and 1(e) show two differ- matches the pattern. To see how a regular expression is an ent ways of doing the alignment; the difference between the alternative to a standard gap-based alignment consider the two is the placement of the gap, which can go either before following pattern, which describes the two sequences in Fig- or after the middle P of the short sequence. ures 1(d) and 1(e): A key parameter in the alignment of protein sequences is the choice of a substitution matrix, a 20 × 20 array S in P(P | LFS)P. (1) which Si,j is a score for aligning amino acid i with amino acid j.ThePAMmatrices[27] were created by analyzing hand Here the vertical bar means “or” and the parentheses are used alignments of a carefully chosen set of sequences that were to mark the ends of the alternatives. The pattern described known to be descending from a common ancestor. PAM ma- by this expression is the set of strings that start with a P, then trices are identified by a number that indicates the degree to have either another P or the string LFS,andendinaP.In which sequences have changed; a unit of “1 PAM” is roughly this example, the letters enclosed in parentheses correspond the amount of sequence divergence that can be expected in to a variable region: the pattern simply says “these letters are 10 million years [28], so the PAM20 matrix could be used not aligned” and no attempt is made to say why they are not to align a set of sequences where the common ancestor lived aligned or what the source of the difference is. The regular around 200 million years ago. Other common substitution expression is an abstract description, covering both the align- matrices are the BLOSUM family [29] and the Gonnet ma- ments of Figures 1(d) and 1(e) (and a third, biologically less trix [30]. plausible, alignment in which the top string would be P–P– Substitution matrices give higher scores to pairs of let- P). ters that are expected to be found in alignments, and lower For a more realistic example, consider the two sequence (negative) scores to pairings that are rare. For example, the fragments in Figure 2(a), which are from the beginning of PAM100 matrix has positive scores on the main diagonal, to two of the protein sequences used to test the alignment ap- use when aligning letters with themselves; the highest score is plication. Substrings of 15 characters near the front of each 12, for the pair W/W, since tryptophan (W) is highly conserved. sequence are similar to each other. A regular expression that Smaller positive scores are for letters that frequently substi- describes this similarity would have three groups, showing tute for one another, for example, leucine (L) and isoleucine letters before and after the region of similarity as well as the (I) are both hydrophobic and the matrix entry for the pair region itself (Figure 2(b)). I/L is 1. Histidine (H) is hydrophilic, and the matrix entry Any pair of sequences can be described by a regular ex- for I/H is −4. The pair P/L has a score of −4 and the pair P/S pression of this form. The expression consists of a series of has a score of 0, so an algorithm using PAM100 would prefer segments, written one after another, where each segment has the alignment shown in Figure 1(e). two substrings separated by the vertical bar. But this standard 4 EURASIP Journal on Bioinformatics and Systems Biology

(a)

(b)

(c)

Figure 2: (a) Strings from the start of two of the amino acid sequences used to test the alignment algorithm. The substrings in blue are similar to the corresponding substring in the other sequence. (b) A regular expression that makes explicit the boundaries of the region of similarity. (c) The canonical form representation of the regular expression. The canonical form has the same groupings of letters, but displays the letters in a different order and uses marker symbols instead of parentheses to specify group boundaries. A # means the sequence segments areblocks,wheretheith letter from one sequence has been aligned with the ith letter in the other sequence. A > designates the start of a variable region of unaligned letters.

notation introduces a problem: how does one distinguish at least one letter. Substrings in variable regions can have any segments describing aligned characters from segments for number of sequence letters, and one of the strings can have unaligned characters? The following convention solves the zero letters. Since # and > define the boundaries of blocks problem of distinguishing between the types of segments and they are referred to as marker symbols. reduces the number of symbols to a minimum. In a canonical Sequence expressions can easily be extended to describe form sequence expression, a multiple alignment of n>2 sequences. Each segment in an expression would have n substrings separated by vertical (i) each open parenthesis is replaced with a symbol that bars, and the corresponding canonical form would have n specifies the type of the segment that starts at that lo- lines in each block and in each variable region. The MDL cation. An aligned segment starts with #, an unaligned code length function and the alignment algorithm in the fol- segment starts with >; lowing section assume there are only two sequences; possible (ii) the vertical bar separating the two parts of a segment is extensions for multiple alignment will be discussed in the fi- replaced by the symbol used at the start of the segment; nal section. thus if the segment starts with #, the two parts of the segment are separated by a second #; (iii) the closing parenthesis marking the end of a segment 3. ALIGNMENT USING MINIMUM can just be deleted since it is redundant (every closing DESCRIPTION LENGTH parenthesis is either followed by an opening parenthe- sis or comes at the end of the expression); It is easy to see there is at least one canonical form sequence expression for every pair of sequences: simply create a sin- (iv) to make an expression easier to read, it is displayed by gle variable region, writing the string for each complete se- starting a new line for each # or >, with the under- quence to the right of a > symbol. This default expression is standing that “white space” breaking the expression the null hypothesis that the sequences have nothing in com- into new lines is for formatting purposes only and is mon. not part of the expression itself. The goal of an alignment algorithm is to generate al- The canonical form of the expression describing the align- ternative hypotheses, in the form of expressions that have ment of the initial parts of the two example genes is shown one or more blocks containing equal-length substrings from in Figure 2(c). the input sequences. The alignment process can be viewed In the literature on sequence alignment, an ungapped lo- as a series of rewrite operations applied to variable regions. cal alignment is often referred to as a block. In the canonical A rewrite step that creates a block splits a variable region form sequence expression, a block corresponds to a pair of into three parts: a variable region for characters before the lines starting with #; pairs of lines starting with > are called block, the block itself, and a variable region for characters variable regions. Note that the substrings in blocks always following the block (Figure 3). The transformation adds four have the same number of sequence letters, and always have marker symbols to the expression: two # symbols identify John S. Conery 5

The receiver recovers the original sequence data by expand- ing the expression to generate every sequence that matches the expression. 2markers 6markers 27 letters 27 letters A “communication protocol” that specifies the type of in- formation contained in a message and the order in which the pieces of the message are transmitted is an essential part of the encoding. The representation of a sequence expression Figure 3: Schematic representation of an expression rewriting op- begins with a preamble that contains information about the eration. A canonical form expression with a single variable region structure of the expression and the encoding of alignment is transformed into a new expression with two variable regions sur- parameters. rounding a block. The number of sequence letters does not change, A canonical form sequence expression is an alternating but four new marker symbols are added to specify the boundaries series of blocks and variable regions, where the marker sym- of the block. bols (# and >) inserted into the input sequences identify the boundaries between segments. The communication proto- col allows the transmitter to simplify the expression as it is compressed by putting a single bit in the preamble to spec- the locations of the start of the block (one in each input se- ify the type of the first segment. Then the only thing that is quence) and two > symbols mark the end of the block. As a required is a single type of symbol to specify the locations of special case, the block might be at the beginning or end of the remaining markers. For the example sequences shown in the expression; if so only two new # markers are added to the Figure 2, the expression can be transformed into the follow- expression. ing string: Since the alignment algorithm uses the minimum de- > MNNNNYIF.MNSYKP.ENENPILYNTNEGEE. scription length principle to search for the simplest expres- (2) sion, this transformation appears to be a step in the wrong ENENPVLYNYKEDEE.NRSS.SSHI direction because the complexity of the expression, in terms Here the >, represented by a single bit, indicates the type of the number of symbols used, has increased. The key point of the first region. The periods identify the locations of the is that MDL operates at the level of the encoding of the ex- markers. Since the regions alternate between # and >, the re- pression, that is, it prefers the expression that can be encoded ceiver infers the first period that represents another >, the in the fewest number of bits. As will be shown in this section, next two periods are #, and so on. blocks of similar sequence letters have shorter encodings. If The key parameter in every alignment is the substitution the number of bits saved by placing similar letters in a block matrix used to define joint probabilities for each letter pair is greater than the cost of encoding the symbols that mark the and single (marginal) probabilities for each individual letter. ends of the block, the transformed expression is more com- If the transmitter and receiver agree beforehand to restrict pact. the set of substitution matrices to a set of n commonly used The code length function that assigns a number of bits matrices, each matrix can be assigned an integer ID and the to each symbol in a canonical form sequence expression has preamble simply contains a single integer encoded in log n three components: 2 bits to identify the matrix. If an arbitrary matrix is allowed, (i) a protocol that defines the general structure of an ex- the protocol would have to include a representation for the pression and the representation of alignment parame- substitution matrix. ters; The rest of the information contained in the pream- (ii) a method for assigning a number of bits to each letter ble depends on the method used to represent the marker from the set of input sequences; symbols. Three different methods are presented below in (iii) a method for determining the number of bits to use Section 3.3, and each uses a different combination of param- for the marker symbols that identify the boundaries eters; for example, the indexed representation requires the between blocks and variable regions. transmitter to send the length of the longest sequence, and the tagged representation requires the transmitter to send the 3.1. Communication protocol number of bits used in the encoding of marker symbols. For numeric parameters, the transmitter can simply encode the A common exercise in information theory is to imagine that parameter in the fewest number of bits and include the en- a compressed data set is going to be sent to a receiver in coding as part of the preamble. A standard technique for rep- binary form, and the receiver needs to recover the original resenting a number that can be encoded in k bits is to send k data. This exercise ensures that all the necessary information 0s, a 1, and then the k bits that encode the number itself. is present in the compressed data—if the receiver cannot re- In general a regular expression can be expanded into construct the original data, it may be because essential infor- more than just the original sequence strings. For example, mation was not encoded by the compression algorithm. In suppose the two input strings are AB and CD, and the regular the case of the MDL alignment algorithm, the idea is to com- expression representing their alignment is of the form press a set of sequences by creating a representation of a reg- | | ular expression that describes the structure of the sequences. (A C)(B D). (3) 6 EURASIP Journal on Bioinformatics and Systems Biology

A receiver can expand this expression into the two original Table 1: Cost (in bits) of aligning pairs of letters. Sx,y is the score input strings, but the expression also matches AD and CB. for letters x and y in the PAM100 substitution matrix. c(x)+c(y) Thus the protocol needs a method for telling the receiver is the sum of the costs of the two letters, which is incurred when how to link together the substrings from different segments thelettersareinavariableregion.c(x)+c(y | x) is the cost of the same letters when they are aligned in a block. The benefit of align- so that it will reconstruct AB and CD but not AD or CB. ff One solution would be to encode sequence IDs with the ing two letters is the di erence between the unaligned cost and the aligned cost: a positive benefit results from aligning similar letters, substrings so the receiver correctly pieces together a sequence a negative benefit from aligning dissimilar letters. using a consistent set of IDs. But if a simple convention is followed, the receiver can infer the sequence IDs from the xy Sx,y c(x)+c(y) c(x)+c(y | x)benefit(y, x) order in which the sequences are transmitted. For canonical WW 12 6.36 + 6.36 6.36 + 0.44 5.92 form sequence expressions, the protocol requires that every II 63.65 + 3.65 3.65 + 1.25 2.40 region has exactly two strings, and that within a region, the LL 63.09 + 3.09 3.09 + 0.72 2.37 strings need to be given in the same order each time. ML 34.97 + 3.09 4.97 + 2.26 0.83 LI 13.09 + 3.65 3.09 + 3.66 −0.01 3.2. Encoding sequence letters LQ −23.09 + 5.02 3.09 + 6.09 −1.07 The standard technique used in information theory of en- LC −63.09 + 5.78 3.09 + 9.38 −3.60 coding symbols according to their probability distribution can be used to encode sequence letters. If a letter x occurs − with probability p(x) the encoding of x requires log2 p(x) bits. When x and y are the same letter, or similar according to The probability distribution for letters is based on the the substitution matrix being used, the cost using the condi- tional probability will be lower. For any two letters x and y, substitution matrix being used for the alignment. Scores in ff a substitution matrix are log odds ratios of the form the benefit of aligning y with x is the di erence between the cost of placing the two letters in a variable region versus their 1 p(x, y) cost in a block: s(x, y) = log (4)     λ p(x)p(y) benefit(y, x) = c(x)+c(y) − c(x)+c(y | x) = − | (5) where p(x, y) is the joint probability of observing x aligned c(y) c(y x). with y, p(x)andp(y) are the background probabilities of x In general, there is a positive benefit for pairs of letters and y,andλ is a scaling factor [31]. The realign program that have positive scores in a substitution matrix. On the uses a program named lambda [32] as a preprocessor that other hand, a negative benefit is incurred when an algorithm takes an arbitrary substitution matrix as input, solves for λ, tries to align two dissimilar letters. Table 1 shows a few exam- and saves a table of background probabilities for each single ples of pairs of letters, the cost of placing them unaligned in letter and joint probabilities for each letter pair. a variable region, and the benefit gained from aligning them The number of bits used to encode a letter in a canoni- in a block. cal sequence expression depends on whether the letter is in a block or in a variable region. For a letter x in a variable 3.3. Encoding marker symbols region the encoding is straightforward: simply use the back- ground probability of x according to the transformed substi- Three different methods for encoding of the marker symbols tution matrix. that identify the boundaries between blocks and variable re- For a block, the encoding considers pairs of letters x and gions are illustrated in Figure 4. All three methods are based y that occur in the same relative position in the block. The on the transformation in which the # and > symbols have number of bits to encode the letter x in one sequence is based been replaced by periods. The difference between the three on p(x), the same as in a variable region, but for the letter y methods is in the representation of each marker and the ad- in the other sequence, the conditional probability p(y | x)is ditional information included in the preamble. used to reflect the fact that x and y are aligned. Since by def- | = inition p(y x) p(x, y)/p(x), the substitution matrix pro- 3.3.1. Indexed representation vides the necessary information to compute the conditional probabilities. The indexed representation for marker symbols is based on To summarize, the cost, in bits, of encoding letters in a the observation that it is not necessary to include the marker canonical form sequence expression is defined as follows: symbols themselves, but only their locations in each string. If an expression has m segments, the transmitter can construct (i) for a letter x in a variable region or in the first line a table of (m − 1) entries for each string. The number of bits of a block, the code length is a function of p(x), the for each table entry depends on n, the length of the corre- marginal probability of observing x:c(x)=−log p(x); 2 sponding input sequence. Using this technique, the preamble (ii) for a letter y in the second line of a block, the code of a message is constructed as follows: length is a function of p(y | x), the conditional prob- ability of seeing y in this location given character x in (i) order the input sequences so the longest sequence is =− | the same position in the first line: c(y, x) log2 p(y x). the first one in the message; John S. Conery 7

8 20

6 18 (a) (b)

(c)

q(x, y) = (1 − γ) × p(x, y)

  p(x, y) = 1 γ = q(·) q(x, y) = (1 − γ) (d)

Figure 4: The items in blue correspond to information added to a string to specify the locations of marker symbols. (a) Indexed represen- tation. The preamble contains two tables of m − 1 numbers to specify the locations of the m marker symbols (the first marker is always =  at the front of the string) in each sequence. Each table entry has k log2n bits to specify a location in a string of length n.(b)Tagged representation. A one-bit tag added to each symbol identifies the symbol class (letter or marker), and is followed by the bits that represent − the symbol itself. (c) Scaled representation. The number of bits for each symbol x is simply log2q(x)whereq(x) is the probability of the symbol based on a distribution that includes the probability of a marker. (d) Given a probability γ for marker symbols, the joint probabilities for the letter pairs are scaled by 1.0 − γ so the sum of probabilities over all symbols is 1.0.

(ii) use one bit to specify the type of the first segment mines the individual probability for each letter and the joint (which will be the same for both sequences); probability for each letter pair.   (iii) use log2s bits to specify which one of the s substi- tution matrices was used to encode letters and letter 3.3.2. Tagged representation pairs; (iv) use 2log n + 1 bits to specify n, the length of the first 2 There are two drawbacks to the indexed representation. The input sequence. This number also allows the receiver first is that the number of bits used to represent a marker to determine k = log n, the number of bits required to 2 grows (albeit very slowly) with the length of the input se- represent a single marker table entry; quences. That means one might get a different alignment for (v) the next 2log m + 1 bits specify m, the number of 2 the same two substrings of sequence letters in different con- marker symbols in each sequence; texts; if the substrings are embedded in longer sequences, (vi) create a table of size mk bits for the locations of the the number of bits per marker will increase, and the align- m markers in the first sequence, followed by another ment algorithm might decide on a different placement for table of the same size for the markers of the second the markers in the middle of the substrings. sequence. The second disadvantage is that in many cases marker Following the preamble, the body of the message simply symbols identify the locations of insertions and deletions, consists of the encoding of the letters defined in the previous which are evolutionary events. The number of bits used to section. Since the receiver knows the length of the first se- represent a marker should correspond to the likelihood of an quence, there is no need to include an end-of-string marker insertion or deletion, but not the length of the sequence. If after the first sequence. This location becomes a de facto anything, longer sequences are more likely to have had inser- marker for the start of the second sequence. tions or deletions, so the number of bits representing those Figure 4(a) shows how the start of the two example se- events should be lower, not higher. quences would be encoded with the indexed representation. The tagged representation addresses these problems by The numbers in blue are indices between 0 and the length of defining a prefix code for markers and embedding the marker the longer of the two sequences. codes in the appropriate locations within each sequence The advantage of this representation is that no additional string. This method requires the user to specify a value for a parameters are required to align a pair of sequences: the only new parameter, named α, the number of bits required to rep- alignment parameter is the substitution matrix, which deter- resent a marker. Each symbol in the expression is preceded by 8 EURASIP Journal on Bioinformatics and Systems Biology a one-bit tag that identifies the type of symbol, for example, scaled probabilities are lower than the original probabilities, azeroforamarkerandaoneforasequenceletter.Following the scaled costs of single letters are higher, and some letter the tag is the representation of the symbol itself: α bits for pairs that had a negative benefit according to the original markers, and c(x)bitsforaletterx using the cost function probabilities will now have a positive benefit. For example, defined in the previous section. in the PAM matrices, letter pairs with scores of 0 or higher The preamble of a message based on the tagged repre- have a positive benefit using unscaled probabilities, but when sentation is much simpler: it only contains the single bit des- scaled with 1 − γ = 0.75 pairs of slightly dissimilar amino ignating whether the first segment is a block or a variable acids with scores of −1 have a positive benefit. region, the substitution matrix ID, and the value of α.The tagged representation of the alignment of the example se- 3.4. Example quences is shown in Figure 4(b). Two different alignments of the sequences of Figure 2 are 3.3.3. Scaled representation shown in Figure 5. The alignments were made using the scaled representation with the PAM20 substitution matrix The additional bits attached to each symbol in the tagged and γ = 0.02. The code length for the null hypothesis— representation result in a rather awkward code from an in- a single variable region containing all letters from the two formation theoretic point of view, where the number of bits productions—is 240.279 bits. The code length of the expres- used to represent a symbol should depend on the probability sion with two variable regions and one block is 224.728 bits. of observing that symbol. The cost of the expression with the block is less because In order to define the number of bits for each symbol s the net benefit from using conditional probabilities to com- −log ( ), where is either a sequence letter or a marker as 2q s s pute the costs of the aligned letters (129.508 − 91.381 = symbol, one can scale each element in the joint probability − 38.127 bits) outweighs the cost of introducing four marker matrix by a constant factor 1 γ (where 0 <γ<1) and then symbols (4 × 5.644 = 22.576 bits) for the boundaries of the define the number of bits in the representation of a marker as =− block. α log2(γ)(Figure 4(d)). Now the body of the message is simply the representation of each symbol, encoded according to the modified probability matrix (see also Figure 4(c)): 4. EXPERIMENTAL RESULTS =− c(x) log2q(x), To evaluate the feasibility of aligning pairs of sequences by finding the minimum cost sequence expression, a simple c(y | x) =−log q(y | x), (6) 2 graph search algorithm was developed and implemented in a · =− c( ) log2(γ). program named realign. The algorithm creates a directed acyclic graph where nodes represent candidate blocks de- The preamble of a message encoded with the scaled represen- fined by equal-length substrings from each input sequence. tation is the same as the preamble for a tag-based message, Weights assigned to nodes represent the cost in bits of the except that the additional parameter is γ instead of α. corresponding block, and weights on edges connecting two Since the probability of each single letter is the marginal nodes are defined by the cost of a variable region for the probability summed over a row of the joint probability ma- characters between the two blocks. The minimum cost path trix, and each matrix entry was multiplied by a constant scale through the graph corresponds to the optimal alignment. factor, the single-letter probabilities are also scaled by this In one set of experiments, alignments produced by same amount:  realign were compared to pairwise alignments generated q(x) = (1 − γ)p(x, y) by CLUSTALW [33], one of the most widely used alignment y programs. In a second experiment, realign was used to  (7) align pairs of sequences from the BaliBase benchmark suite = (1 − γ) p(x, y) = (1 − γ)p(x). [34]. y

But note that conditional probabilities are not affected by 4.1. Plasmodium orthologs the scaling since the scale factors cancel out: An important concept in evolutionary biology is homology, q(x, y) (1 − γ)p(x, y) | = = defined to be similarity that derives from common ancestry. q(y x) − q(x) (1 γ)p(x) In molecular genetics, two genes in different organisms are (8) p(x, y) said to be orthologs if they are both derived from a single gene = = p(y | x). p(x) in the most recent common ancestor. In genome-scale computational experiments, a simple Recall from Section 3.2 that a pair of letters will be included strategy known as “reciprocal best hit” is often used to iden- in a block if there is a positive benefit from aligning them, tify pairs of orthologous genes. For each gene a from organ- that is, if c(y) − c(y | x) > 0. In the scaled representation, ism A,doaBLASTsearch[2] to find the gene b from or- this calculation compares a cost based on a scaled probabil- ganism B that is most similar to a. If a search in the other ity with a cost defined by an unscaled probability. Since the direction, using BLAST to find the gene most similar to b in John S. Conery 9

  c(x)+c(y) for letters in the block: 129.508 bits c(x)+c(y|x)fortheblock:91.381 bits

Cost of null hypothesis: Cost of the expression with one block: 228.99 + 2α = 240.279 bits 64.272 + 91.381 + 35.211 + 6α = 224.728 bits (a) (b)

Figure 5: Cost of alternative expressions for the example sequences using the PAM20 substitution matrix and γ = 0.02. The cost for each =− = marker symbol is α log2γ 5.644 bits. (a) The cost for the null hypothesis is the sum of all the individual letter costs plus the cost of the two marker symbols. (b) When the letters in blue are aligned with one another, the costs of the letters in the second sequence are computed with conditional probabilities. This reduces the cost of the letters in the block by 129.508 − 91.381 = 38.127 bits. The transformed grammar has four additional markers, but the reduction in cost afforded by using the block outweighs the cost of the new markers (4 × 5.644 = 22.576 bits) so the expression with one block has a lower overall cost.

(a)

(b)

Untrim Trim Aligned by both 0.473 0.469 Aligned by neither 0.147 0.258 clustalw only 0.38 0.267 Realign only <0.001 0.006

(c)

Figure 6: Alignment of sequences MAL7P1.11 and Pv087705 from ApiDB [35]. (a) Comparison of CLUSTALW alignment (top two lines of text) and the regular expression alignment (bottom two lines). Background colors indicate whether the two algorithms agree. Green: columns aligned by both algorithms; blue: letters not aligned by both algorithms; white: letters aligned by CLUSTALW but appearing in variable regions in the regular expression; red: letters aligned in the regular expression but not by CLUSTALW. (b) Same as (a), but comparing the trimmed CLUSTALW alignment with regular expression alignment. The middle row of two lines shows the result of the alignment trimming algorithm; an asterisk identifies a column from the CLUSTALW alignment that was removed by “gap expansion.” (c) Proportion of each type of column averaged over all 3909 alignments.

organism A, reveals that a is most similar to b, then a and b ing BLAST to search for reciprocal best hits. Since P. falci- are most likely orthologs. parum diverged from P. vivax approximately 200 MYA [36], Once pairs of genes are identified as reciprocal best hits, all the alignments used the PAM20 substitution matrix. The a more detailed comparison is done using a global alignment realign alignments were made using the scaled representa- algorithm such as CLUSTALW [33]. To see how well the reg- tion for marker symbols with γ = 0.02 since insertion and ular expression-based alignment algorithm performs on real deletion events are relatively rare at this short evolutionary sequences, a series of alignments of orthologous genes made time scale. with realign were compared to the CLUSTALW alignments Figure 6 shows a detailed comparison of the alignments of the same genes. The complete set of genes from Plasmod- for one pair of genes (MAL7P1.11 and Pv087705). The top ium falciparum, the parasite that causes malaria, and a close two lines in Figure 6(a) are the alignment produced by relative known as Plasmodium vivax were downloaded from CLUSTALW, and the bottom two are the regular expression ApiDB, the model organism database for this family of or- alignment. To make it easier to compare the alignments, the ganisms [35]. A set of 3909 orthologs were identified by us- marker symbols have been deleted, and the letters in variable 10 EURASIP Journal on Bioinformatics and Systems Biology regions printed in italics to distinguish them from letters Sequences in BAliBASE are organized in a collection of in blocks. The four background colors indicate the level of different test sets. The sets were designed to provide differ- agreement between the two alignments: a pair can be aligned ent challenges to multiple alignment programs, for example, by both programs, aligned by neither, or aligned by one but all sequences in a test are equally distant, or sequences are in not the other. two distinct subgroups. Sequences in each set have known 3D Researchers often apply an “alignment trimming” algo- structures, and each test set was manually curated to iden- rithm to the output of an alignment algorithm to identify tify conserved core blocks within each multiple alignment. suspect columns in an alignment [37]. An example of a sus- The accuracy of an alignment algorithm can be assessed by pect column is the one shown in Figure 1 where an inser- comparing how it aligns amino acids in the core blocks. The tion occurred in the middle of a codon. Figure 6(b) shows comparisons reported here were made by aligning all pairs of the alignment of the Plasmodium genes after an alignment sequences in each test set. trimming operation [38] was applied to the CLUSTALW align- Figure 7 illustrates how the choice of a substitution ma- ments. The middle two lines in this figure show the results trix affects the accuracy of an alignment. The blocks in of the trimming application: an X indicates a letter that was Figure 7(b) are from an alignment based on PAM20, and the left in the alignment, and a  indicates a position that was blocks in Figure 7(c) are from the same pair of sequences originally aligned but has now been converted to a gap. In aligned with PAM250. Letters shown in blue are accurate this example, the alignment trimming algorithm agreed with pairings of letters in core blocks in the reference alignment, the regular expression alignment: columns that were previ- and letters in red are misaligned—either they are placed in ously shown as aligned (white background color) are now variable regions, or if they are in blocks, they are aligned with unaligned (blue). the wrong letter from the other sequence (e.g., the letters in Over all the 3909 pairs of sequences, the two alignment the block marked with (2)). The overall accuracy is higher for methods agreed on 62% of the letters (top two rows of the PAM250 alignment, which is not surprising since these Figure 6(c)). The disagreement was almost entirely due to the two sequences are only about 40% identical, and sequences fact that in 38% of the columns, the regular expression align- with this low level of similarity have probably diverged for ment was more conservative and placed characters in an un- much more than 200MY. aligned region when CLUSTALW aligned those same letters. The block marked with a (3) in Figure 7 is an example There are very few instances where realign put letters in of how a less strict substitution matrix leads to longer blocks. an aligned block and CLUSTALW did not. Applying the align- The letter pair Q and G are dissimilar in PAM20, and the block ment trimming algorithm increases the level of agreement: ends at this letter pair. But with PAM250, there is a slight ben- approximately one fourth of the columns originally consid- efit to aligning Q with G (c(G|Q)

(a)

(b) (c)

Figure 7: Portions of alignments of sequences 1aho and 1bmr from the BAliBASE alignment benchmark (Release 3) [34]. (a) The reference alignment from BAliBASE. Letters in core blocks are highlighted in blue. (b) Alignment from realign, using PAM20 and γ = 0.2. (c) Same as (b) but using PAM250. In (b) and (c) lines starting with % are comments that show the degree of similarity of corresponding letters in the preceding block: identical (=), similar (+), or dissimilar (−). Sequence letters in blue are correctly aligned core blocks. Red letters are core block column that should have been aligned but were left in variable regions. The circled numbers highlight changes in the alignment (see text).

of bits needed to encode a set of sequence expressions and encode the shortest sequence expression. Figure 8(b) shows the accuracy of the alignments. To make sure the alignment a plot of the change in compression as a function of γ,where algorithm had enough data to work with, the alignments there is a peak in the range 0.07 ≤ γ ≤ 1.0. Superimposed were done on the longest set of sequences in BAliBASE. There on this graph is a plot of the accuracy of the best alignment, are eight sequences in this test set (BB12007), ranging in also as a function of γ. The peak in this plot is an accuracy of length from 994 to 1084 letters, with a mean length of 1020 69%, at γ = 0.05. letters. 28 pairwise alignments were created, using all possi- The most accurate alignments, with a mean accuracy of ble pairs of sequences from the set. 80%, were created using the tagged representation and very Figure 8(a) shows that the number of bits required to small values of α between 1.25 and 1.75 bits (including the represent an alignment increases as γ increases. There is a tag bit). To obtain a comparable ratio between the cost of a very slight decrease in cost near γ = 0.02. At smaller values marker symbol and sequence letter in the scaled representa- − of γ the cost of representing a marker symbol ( log2γ)istoo tion γ would have to be around 0.25. But because the scaled high for the algorithm to include any blocks. Near γ = 0.02, representation requires the algorithm to compare letter prob- a few blocks are found and the overall cost is lowered. But abilities scaled by 1 − γ with unscaled conditional probabil- as γ increases, the cost of the sequence letters increases, since ities, the accuracy deteriorates with higher values of γ. This they are scaled by a factor of 1 − γ. There are typically far distortion might be the reason the peak in the accuracy curve more letter symbols than marker symbols in a sequence ex- does not correspond more closely to the peak in the compres- pression, and the increase in the size of each letter outweighs sion curve in Figure 8(b). any gain from a shorter representation for marker symbols. One could argue that for a given value of γ,itisnot 5. SUMMARY AND FUTURE WORK the total size of a sequence expression that is important, but rather the amount of compression that results from that This paper has shown that regular expressions provide use- value of γ, where compression is the difference in the number ful descriptions of alignments of pairs of sequences. The ex- of bits required to encode the null hypothesis (that the se- pressions are simple concatenations of alternating blocks and quences have nothing in common) and the number of bits to variable regions, where blocks are equal-length substrings 12 EURASIP Journal on Bioinformatics and Systems Biology

×103 ×102

10 5

8 4

6 3

4 2 1 Mean total cost (bits) Mean compression (bits)

2 1 0.5 Mean accuracy 0 0 0

0.05 0.10.15 0.2 0.05 0.10.15 0.2 γ γ Compression Accuracy (a) (b)

Figure 8: The effect of the scaling parameter γ on alignments of pairs of sequences from BAliBASE [34] test set BB12007. There are eight sequences in the set; the data points are based on averages over all (8 × 7)/2 = 28 pairs of sequences. (a) Mean cost (in bits) of alignments as a function of γ. (b) Mean compression (the difference between the cost of the null hypothesis and the lowest cost alignment for each pair of sequences) is indicated by open circles. The mean accuracy of the alignments (proportion of core blocks correctly aligned) is indicated by closed circles (scale shown on the right axis).

from each input sequence and variable regions are strings of the input sequences diverged. The substitution matrix is the unaligned characters. basis for computing the probability of aligning pairs of let- Alignment via regular expressions is an application of in- ters, and generally reflects the probability that one of the let- formation theory: a hypothetical sender constructs a regular ters changed via point mutation into the other letter. Marker expression that describes the sequences, compresses the ex- symbols typically denote block boundaries that are the result pression by encoding blocks with conditional probabilities, of insertion or deletion mutations, and for very diverse se- and transmits the encoded expression to a receiver, who can quences a smaller number of bits per marker reflect a higher recover the original sequences by generating every string that probability of an insertion or deletion. matches the expression. The only parameter that is required An alignment algorithm based on this approach can be is a substitution matrix, which sets the background proba- seen as a process that begins with a default null hypothesis bilities for unaligned letters and the conditional probabilities that the sequences are unrelated, represented by an expres- for pairs of aligned letters. For greater flexibility, an optional sion that has all characters in a single unaligned region. The second parameter specifies the number of bits to use for the algorithm searches for candidate blocks, consisting of equal- marker symbols that denote block boundaries. This informa- length substrings from each input sequence, and checks to tion theoretic framework does not use gaps to align variable- see if the encoding of an expression that includes a block is length sequences—instead a global alignment of sequences shorter than the encoding without the block. The tradeoff of differentlengthwillhaveatleastonevariableregionwith that must be taken into account is that blocks of similar let- adifferent number of letters from the input sequences—and ters will have denser encodings due to the use of conditional thus finesses issues associated with gap penalties. probabilities, but adding a block means increasing the num- Accurate alignment of biological sequences needs to take ber of marker symbols that denote the edges of blocks. into account the amount of time the sequences have been A comparison of this new method with CLUSTALW, changing since they diverged from their most recent com- a widely used standard for sequence alignment, shows mon ancestor. The two parameters that affect the encod- that the regular expression alignments generally agree with ing of regular expressions—the choice of substitution matrix CLUSTALW on regions included in blocks in the regular ex- and the number of bits to use for marker symbols—are re- pression. Approximately, three quarters of the characters left lated to the two main types of mutations that can occur since unaligned in a regular expression are aligned by CLUSTALW, John S. Conery 13 but that number drops to one half if the CLUSTALW align- ing PROSITE or other predefined collections of patterns is ments are treated with an “alignment trimming” algorithm that blocks can be encoded in fewer bits. Where the pattern to remove ambiguous regions. A more detailed case-by-case specifies one of a small set of k letters, only log2k bits are analysis would be required to determine if the remaining un- required to encode one of these letters, assuming they are aligned characters should remain unaligned (i.e., alignment equally probable in this context. In particular, constants in trimming should be more ambitious) or if they need to be the pattern require zero bits, since the receiver knows these aligned (i.e., the regular expression approach is not aligning letters as soon as the pattern is specified. A second benefit is some characters that should be aligned). that PROSITE blocks allow the expression to describe small A second set of experiments compared the output of the amounts of variability in the length of a region without in- regular expression method with known reference alignments troducing a new variable region. Of course these benefits are from the BAliBASE alignment benchmark. Since the bench- offset by the additional complexity of an encoding that allows mark is designed to test multiple alignment algorithms, and for rule names and parameter delimiters. it is generally accepted that multiple alignment is more ac- As the last example shows, regular expressions and gram- curate than simple pairwise alignment [28], it is not possible mars are very flexible, with many different rule structures to say whether the regular expression approach is as accurate able to describe the same set of sequences. The different rule as recent multiple alignment methods, but the overall accu- structures convey different information about the strings racy of over 80% for sequences with 20% to 40% identity is generated by the grammars, and the goal will be to see if min- encouraging. imum description length encoding of these alternative struc- One direction for future research is to try to automati- tures and selection of the shortest encoding accurately pro- cally determine, for each substitution matrix, the best value vides the best description of the relationships between the for α or γ, the parameters that determine the number of bits sequences. per marker symbol. Based on extensive investigation (e.g., ff [39]) of di erent combinations of substitution matrix and ACKNOWLEDGMENTS other parameters BLAST, CLUSTALW, and other applications set default values for gap penalties based on the choice of sub- The anonymous reviewers made several valuable comments. stitution matrix. A similar analysis, perhaps based on inser- The indexed representation for marker symbols was sug- tion and deletion mutation rates, might be used to match a gested by one of the reviewers, and the scaled representation substitution matrix with a setting of α or γ for regular ex- is due to Peter Grunwald.¨ The author gratefully acknowl- pression alignments. edges support by grants from the National Science Foun- A second direction for future research is to expand the dation (MCB-0342431), the National Institutes of Health method to perform multiple alignment of more than two (5R01RR020833-02), and E.T.S. Walton Visitors Award from sequences. One approach would be to use pairwise local Science Foundation Ireland. alignments produced by realign as “anchors” for DIALIGN [22, 23], a progressive multiple alignment program that joins REFERENCES consistent sets of ungapped local alignments into a com- plete multiple alignment. A different approach would align [1] E. W. Myers, “The fragment assembly string graph,” Bioinfor- all the sequences at the same time, using sum-of-pairs or matics, vol. 21, suppl. 2, pp. ii79–ii85, 2005. some other method to average conditional costs based on [2]S.F.Altschul,T.L.Madden,A.A.Schaffer, et al., “Gapped each of the n × (n − 1)/2 pairs of sequences. BLAST and PSI-BLAST: a new generation of protein database A third direction for future research is to extend the search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. canonical sequence expressions or the equivalent grammar 3389–3402, 1997. to include other forms of descriptions of regions of similarity. [3] A. J. Phillips, “Homology assessment and molecular sequence OneideaistousePROSITEblocks[40] as “subroutines” that alignment,” Journal of Biomedical Informatics, vol. 39, no. 1, can be embedded in blocks. For example, PROSITE block pp. 18–33, 2006. PS00007 is [RK]-x(2, 3)-[DE]-x(2, 3)-Y, using a notation sim- [4] J. O. Wrabl and N. V. Grishin, “Gaps in structurally similar ilar to a regular expression where a string in brackets means proteins: towards improvement of multiple sequence align- ment,” Proteins, vol. 54, no. 1, pp. 71–87, 2004. “any one of these letters” and x(2, 3) means “any sequence between 2 and 3 letters long.” A string that matches this pat- [5] K. Sjolander,¨ “Phylogenomic inference of protein molecular function: advances and challenges,” Bioinformatics, vol. 20, tern, RDIKDPEY, occurs in one of the Plasmodium sequences no. 2, pp. 170–179, 2004. discussed in Section 4.1. A block for the region containing [6]B.-J.M.Webb,J.S.Liu,andC.E.Lawrence,“BALSA:Bayesian this pattern might include a reference to the PROSITE block, algorithm for local sequence alignment,” Nucleic Acids Re- for example, instead of search, vol. 30, no. 5, pp. 1268–1277, 2002. #DLLRDIKDPEYSYT (9) [7] J. Rissanen, “Modelling by the shortest data description,” Au- tomatica, vol. 14, no. 5, pp. 465–471, 1978. the block would be something like [8] P. Grunwald,¨ “A minimum description length approach to #DLL ps00007 (R, DIK, D, PE) SYT, (10) grammar inference,” in Connectionist, Statistical, and Sym- bolic Approaches to Learning for Natural Language Processing, where the arguments to the procedure call are pieces of vol. 1040 of Lecture Notes in Computer Science, pp. 203–216, the sequence to plug in to the pattern. A benefit from us- Springer, Berlin, Germany, 1996. 14 EURASIP Journal on Bioinformatics and Systems Biology

[9] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, “Pattern dis- [28] D. W. Mount, Bioinformatics: Sequence and Genome Analysis, covery in biosequences,” in International Conference on Gram- Cold Spring Harbor Laboratory Press, New York, NY, USA, mar Inference (ICGI ’98),V.HonavarandG.Slutski,Eds., 2nd edition, 2004. vol. 1433 of Lecture Notes in Artificial Intelligence, pp. 257–270, [29] S. Henikoff andJ.G.Henikoff, “Amino acid substitution Springer, Ames, Iowa, USA, 1998. matrices from protein blocks,” Proceedings of the National [10] L. Cai, R. L. Malmberg, and Y. Wu, “Stochastic modeling Academy of Sciences of the United States of America, vol. 89, of RNA pseudoknotted structures: a grammatical approach,” no. 22, pp. 10915–10919, 1992. Bioinformatics, vol. 19, suppl. 1, pp. i66–i73, 2003. [30] G. H. Gonnet, M. A. Cohen, and S. A. Benner, “Exhaustive [11] D. B. Searls, “The computational linguistics of biological matching of the entire protein sequence database,” Science, sequences,” in Artificial Intelligence and Molecular Biology, vol. 256, no. 5062, pp. 1443–1445, 1992. pp. 47–120, American Association for Artificial Intelligence, [31] S. Karlin and S. F. Altschul, “Methods for assessing the statis- Menlo Park, Calif, USA, 1993. tical significance of molecular sequence features by using gen- [12] D. Bsearls, “Linguistic approaches to biological sequences,” eral scoring schemes,” Proceedings of the National Academy of Computer Applications in the Biosciences, vol. 13, no. 4, pp. Sciences of the United States of America, vol. 87, no. 6, pp. 2264– 333–344, 1997. 2268, 1990. [13] A. Bairoch, “PROSITE: a dictionary of sites and patterns in [32] S. R. Eddy, “Where did the BLOSUM62 alignment score ma- proteins,” Nucleic Acids Research, vol. 20, pp. 2013–2018, 1992. trix come from?” Nature Biotechnology,vol.22,no.8,pp. [14] M. Vingron and M. S. Waterman, “Sequence alignment and 1035–1036, 2004. penalty choice. Review of concepts, case studies and implica- [33] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL tions,” Journal of Molecular Biology, vol. 235, no. 1, pp. 1–12, W: improving the sensitivity of progressive multiple sequence 1994. alignment through sequence weighting, position-specific gap [15] S. Henikoff, “Scores for sequence searches and alignments,” penalties and weight matrix choice,” Nucleic Acids Research, Current Opinion in Structural Biology, vol. 6, no. 3, pp. 353– vol. 22, no. 22, pp. 4673–4680, 1994. 360, 1996. [34] J. D. Thompson, F. Plewniak, and O. Poch, “A comprehensive [16] G. Giribet and W. C. Wheeler, “On gaps,” Molecular Phyloge- comparison of multiple sequence alignment programs,” Nu- netics and Evolution, vol. 13, no. 1, pp. 132–143, 1999. cleic Acids Research, vol. 27, no. 13, pp. 2682–2690, 1999. [17] Y. Nozaki and M. Bellgard, “Statistical evaluation and compar- [35] C. Aurrecoechea, M. Heiges, H. Wang, et al., “ApiDB: inte- ison of a pairwise alignment algorithm that a priori assigns the grated resources for the apicomplexan bioinformatics resource number of gaps rather than employing gap penalties,” Bioin- center,” Nucleic Acids Research, vol. 35, pp. D427–D430, 2007. formatics, vol. 21, no. 8, pp. 1421–1428, 2005. [36] R. Carter, “Speculations on the origins of Plasmodium vivax [18] J. T. Reese and W. R. Pearson, “Empirical determination of ef- malaria,” Trends in Parasitology, vol. 19, no. 5, pp. 214–219, fective gap penalties for sequence comparison,” Bioinformatics, 2003. vol. 18, no. 11, pp. 1500–1507, 2002. [37] M. Cline, R. Hughey, and K. Karplus, “Predicting reliable re- [19] L. Allison, C. S. Wallace, and C. N. Yee, “Finite-state models in gions in protein sequence alignments,” Bioinformatics, vol. 18, the alignment of macromolecules,” JournalofMolecularEvo- no. 2, pp. 306–314, 2002. lution, vol. 35, no. 1, pp. 77–89, 1992. [38] J. S. Conery and M. Lynch, “Nucleotide substitutions and the [20] J. P. Schmidt, “An information theoretic view of gapped and evolution of duplicate genes,” in Proceedings of the 6th Pacific other alignments,” in Proceedings of the 3rd Pacific Symposium Symposium on Biocomputing (PSB ’01), pp. 167–178, Big Is- on Biocomputing (PSB ’98), pp. 561–572, Maui, Hawaii, USA, land of Hawaii, Hawaii, USA, January 2001. January 1998. [39] W. R. Pearson, “Comparison of methods for searching protein [21] T. Aynechi and I. D. Kuntz, “An information theoretic ap- sequence databases,” Protein Science, vol. 4, no. 6, pp. 1145– proach to macromolecular modeling: I. Sequence alignments,” 1160, 1995. Biophysical Journal, vol. 89, no. 5, pp. 2998–3007, 2005. [40] N. Hulo, A. Bairoch, V. Bulliard, et al., “The PROSITE [22] B. Morgenstern, “DIALIGN 2: improvement of the segment- database,” Nucleic Acids Research, vol. 34, pp. D227–D230, to-segment approach to multiple sequence alignment,” Bioin- 2006. formatics, vol. 15, no. 3, pp. 211–218, 1999. [23] M. Brudno, M. Chapman, B. Gottgens,¨ S. Batzoglou, and B. Morgenstern, “Fast and sensitive multiple alignment of large genomic sequences,” BMC Bioinformatics, vol. 4, p. 66, 2003. [24] T. D. Schneider, “Information content of individual genetic sequences,” Journal of Theoretical Biology, vol. 189, no. 4, pp. 427–441, 1997. [25] N. Krasnogor and D. A. Pelta, “Measuring the similarity of protein structures by means of the universal similarity metric,” Bioinformatics, vol. 20, no. 7, pp. 1015–1021, 2004. [26] J. S. Conery, “Realign: grammar-based sequence alignment,” University of Oregon, http://teleost.cs.uoregon.edu/realign. [27] M. O. Dayhoff,R.M.Schwartz,andB.C.Orcutt,“Amodelof evolutionary change in proteins,” in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, pp. 345–352, Washington, DC, USA, 1978. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 43670, 16 pages doi:10.1155/2007/43670

Research Article MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress

Scott C. Evans,1 Antonis Kourtidis,2 T. Stephen Markham,1 Jonathan Miller,3 Douglas S. Conklin,2 and Andrew S. Torres1

1 GE Global Research, One Research Circle, Niskayuna, NY 12309, USA 2 Gen*NY*Sis Center for Excellence in Cancer Genomics, University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY 12144, USA 3 Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA

Received 1 March 2007; Revised 12 June 2007; Accepted 23 June 2007

Recommended by Peter Grunwald¨

We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights bio- logically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.

Copyright © 2007 General Electric Company. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION though it is believed that all information about a miRNA’s targets is encoded in its sequence, attempts to identify targets The discovery of RNA interference (RNAi) [1]andcertain by informatics methods have met with limited success, and of its endogenous mediators, the microRNAs (miRNAs), has the requirements on a target site for a miRNA to regulate a catalyzed a revolution in biology and medicine [2, 3]. MiR- cognate mRNA are not fully understood. To date, over 500 NAs are transcribed as long (∼1000 nt) “pri-miRNAs,” cut distinct miRNAs have been discovered in humans, and esti- into small (∼70 nt) stem-loop “precursors,” exported into mates of the total number of human miRNAs range well into the cytoplasm of cells, and processed into short (∼20 nt) the thousands. Complex algorithms to predict which specific single-stranded RNAs, which interact with multiple proteins genes these miRNAs regulate often yield dozens or hundreds to form a superstructure known as the RNA-induced silenc- of distinct potential targets for each miRNA [4–6]. Because ing complex (RISC). The RISC binds to sequences in the of the technical difficulty of testing, all potential targets of a 3untranslated region (3UTR) of mature messenger RNA single miRNA, there are few, if any, miRNAs whose activities (mRNA) that are partially complementary to the miRNA. have been thoroughly characterized in mammalian cells. This Binding of the RISC to a target mRNA induces inhibition problem is of singular importance because of evidence sug- of protein translation by either (i) inducing cleavage of the gesting links between miRNA expression and human disease, mRNA or (ii) blocking translation of the mRNA. MiRNAs for example chronic lymphocytic leukemia and lung cancer therefore represent a nonclassical mechanism for regulation [7, 8]; however, the genes affected by these changes in miRNA of gene expression. expression remain unknown. MiRNAs can be potent mediators of gene expression, and MiRNA genes themselves were opaque to standard in- this fact has lead to large-scale searches for the full com- formatics methods for decades in part because they are plement of miRNAs and the genes that they regulate. Al- primarily localized to regions of the genome that do not 2 EURASIP Journal on Bioinformatics and Systems Biology

Update codebook, array Yes Start with Check for descendents λ<1? Encode, initial for best SCR Gain > Gmin? sequence grammar rule No done

3.5 GAAGTGCAGT GAAGTGCAGT GTCAGTGCT 3 SCR for length2, SCR for max. 2.5 symbol repeated length symbol 2 L/2times repeated 2 times GA AGTG CAGTGAAGTG CAGTGTC AGTG CT SCR 1.5 1 Length Phrase Locations Repeat 0.5 0 10GAAGTGCAGT 1, 11 2 0 20 10 4AGTG 3, 8, 13, 18, 24 5 20 40 30 40 Best OSCR phrase Symbol length 50 60 Repeats 60 70 80

Figure 1: The OSCR algorithm. Phrases that recursively contribute most to sequence compression are added to the model first. The motif AGTG is the first selected and added to OSCR’s MDL model. A longest match algorithm would not call out this motif. code for protein. Informatics techniques designed to iden- grammar-based codes do not achieve the compression of tify protein-coding sequences, transcription factors, or other DNACompress [19](see[20] for a comparison and addi- known classes of sequence did not resolve the distinctive sig- tional approach using dynamic programming), the structure natures of miRNA hairpin loops or their target sites in the of these algorithms is attractive for identifying biologically 3UTRs of protein-coding genes. In this sense, apart from meaningful phrases. The compression achieved by our algo- comparative genomics, sequence analysis methods tend to be rithm exceeds that of DNA Sequitur while retaining a two- best at identifying classes of sequence whose biological signif- part code that highlights biologically significant phrases. Dif- icance is already known. ferences between MDLcompress and GREEDY will be dis- Minimum description length (MDL) principles [9]of- cussed later. The deep recursion of our approach combined fer a general approach to de novo identification of biologi- with its two-part coding makes our algorithm uniquely able cally meaningful sequence information with a minimum of to identify biologically meaningful sequence de novo with a assumptions, biases, or prejudices. Their advantage is that minimal set of assumptions. In processing a gene transcript, they address explicitly the cost capability for data analysis we selectively identify sequences that are (i) short but oc- without over fitting. The challenge of incorporating MDL cur frequently (e.g., codons, each 3 nucleotides) and (ii) se- into sequence analysis lies in (a) quantification of appropri- quences that are relatively long but occur only a small num- ate model costs and (b) tractable computation of model in- ber of times (e.g., miRNA target sites, each ∼20 nucleotides ference. A grammar inference algorithm that infers a two- or more). An example is shown in Figure 1, where given part minimum description length code was introduced in the input sequence shown, OSCR highlights the short motif [10], applied to the problem of information security in [11] AGTG that occurs five times, over a longer sequence that oc- and to miRNA target detection in [12]. This optimal symbol curs only twice. Other model inference strategies would by- compression ratio (OSCR) algorithm produces “meaningful pass by this short motif. models” in an MDL sense while achieving a combination of In this paper, we describe initial results of miRNA anal- model and data whose descriptive size together represents an ysis using OSCR and introduce improvements to OSCR that estimate of the Kolmogorov complexity of the dataset [13]. reduce execution time and enhance its capacity to iden- We anticipate that this capacity for capturing the regularity tify biologically meaningful sequence. These modifications, of a data set within compact, meaningful models will have some of which were first introduced in [21], retain the deep wide application to DNA sequence analysis. recursion of the original algorithm but exploit novel data MDL principles were successfully applied to segment structures that make more efficient use of time and mem- DNA into coding, noncoding, and other regions in [14]. ory by gathering phrase statistics in a single pass and subse- The normalized maximum likelihood model (an MDL al- quently selecting multiple codebook phrases. Our data struc- gorithm) [15] was used to derive a regression that also ture incorporates candidate phrase frequency information achieves near state-of-the-art compression. Further MDL- and pointers identifying location of candidate phrases in related approaches include the “greedy offline”—GREEDY— the sequence, enabling efficient computation. MDL model algorithm [16] and DNA Sequitur [17, 18]. While these inference refinement is achieved by improving heuristics, Scott C. Evans et al. 3

{128-bit strings alternating 1 and 0} 101010 010101

000000000000 ···000 000000000000 ···001 000000000000 ···010 ··· 000000000000 ···011 10101010 10 ··· 1111111111111 ···10 1111111111111 ···11 2128 = 3.4 × 1038 1111 ···0000 1100 ···1100 1001 ···1001 ··· 1010 ···1010 ∼2124 {128-bit strings}

{128-bit strings with 64 1s}

Figure 2: Two-part representations of a 128-bit string. As the length of the model increases, the size of the set including the target string decreases. harnessing redundancies associated with palindrome data, Asdiscussedin[22], an MDL decomposition of a binary and taking advantage of local sequence similarity. Since it string x considering finite set models can be separated into now employs a suite of heuristics and MDL compression two parts, methods, including but not limited to the original symbol =+ | | compression ratio (SCR) measure, we refer to this improved Kϕ(x) K(S) + log2 S ,(2) algorithm as MDLcompress, reflecting its ability to apply MDL principles to infer grammar models through multiple where again Kϕ(x) is the Kolmogorov complexity for string x heuristics. on universal computer ϕ. S represents a finite set of which x We hypothesized that MDL models could discover bio- is a typical (equally likely) element. The minimum possible logically meaningful phrases within genes, and after sum- sum of descriptive cost for set S (the model cost encompass- marizing briefly our previous work with OSCR, we present ing all regularity in the string) and the log of the sets cardi- here the outcome of an MDLcompress analysis of 144 genes nality (the required cost to enumerate the equally likely set overexpressed in the breast cancer cell line, BT474. Our algo- elements) correspond to an MDL two-part description for rithm has identified novel motifs including potential miRNA string x, a model portion that describes all redundancy in the binding sites that are being considered for in vitro validation string, and a data portion that uses the model to define the studies. We further introduce a “bits per nucleotide” MDL specific string. Figure 2 shows how these concepts are mani- weighting from MDLcompress models and their inherent bi- fest in three two-part representations of the 128 binary string ··· ologically meaningful phrases. Using this weighting, “suscep- 101010 10. In this representation, the model is defined in tible” areas of sequence can be identified where an SNP dis- English language text that defines a set, and the log2 of the proportionately affects MDL cost, indicating an atypical and number of elements in the defined set is the data portion potentially pathological change in genomic information con- of the description. One representation would be to identify tent. this string by an index of all possible 128-bit strings. This in- volves a very small model description, but a data description of 128 bits, so no compression of descriptive cost is achieved. 2. MINIMUM DESCRIPTION LENGTH (MDL) A second possibility is to use additional model description to PRINCIPLES AND KOLMOGOROV COMPLEXITY restrict the set size to contain only strings with equal num- MDL is deeply related to Kolmogorov complexity, a measure ber of ones and zeros, which reduces the cardinality of the set of descriptive complexity contained in an object. It refers to by a few bits. A more promising approach will use still more the minimum length l of a program such that a universal model description to identify the set of alternating pattern of computer can generate a specific sequence [13]. Kolmogorov ones and zeros that could contain only two strings. Among complexity can be described as follows, where ϕ represents a all possible two-part descriptions of this string the combina- universal computer, p represents a program, and x represents tion that minimizes the two-part descriptive cost is the MDL a string: description. ff This example points out a major di erence between Shannon entropy and Kolmogorov complexity. The first- Kϕ(x) = min l(p) . (1) ϕ(p)=x order empirical entropy of the string 101010 ···10 is very 4 EURASIP Journal on Bioinformatics and Systems Biology

n minimum description length code and an estimate of the al- gorithmic minimum sufficient statistic [10, 11]. OSCR pro- duces “meaningful models” in an MDL sense, while achiev- ing a combination of model plus data whose descriptive size

(bits) together estimate the Kolmogorov complexity of the data set. | k

S OSCR’s capability for capturing the regularity of a data set | into compact, meaningful models has wide application for log

= sequence analysis. The deep recursion of our approach com- )

n bined with its two-part coding nature makes our algorithm | x

( uniquely able to identify meaningful sequences without lim- k

K iting assumptions. The entropy of a distribution of symbols defines the av- erage per symbol compression bound in bits per symbol for aprefixfreecode.Huffman coding and other strategies can

∗ produce an instantaneous code approaching the entropy in k K(x) n k the limit of infinite message length when the distribution is (bits) known. In the absence of knowledge of the model, one way Figure 3: This figure shows the Kolmogorov structure function. As to proceed is to measure the empirical entropy of the string. the model size (k) is allowed to increase, the size of the set (n) in- However, empirical entropy is a function of the partition and cluding string x with an equally likely probability decreases. k∗ in- depends on what substrings are grouped together to be con- dicates the value of the Kolmogorov minimum sufficient statistic. sidered symbols. Our goal is to optimize the partition (the number of symbols, their length, and distribution) of a string such that the compression bound for an instantaneous code, high, since the numbers of ones and zeros are equal. How- (the total number of encoded symbols R time entropy Hs) ever, intuitively the regularity of the string makes it seem plus the codebook size is minimized. We define the approx- strange to call it random. By considering the model cost, as imate model descriptive cost M to be the sum of the lengths well as the data costs of a string, MDL theory provides a for- of unique symbols, and total descriptive cost Dp as follows: mal methodology that justifies objectively classifying a string as something other than a member of the set of all 128 bit M ≡ li, Dp ≡ M + R · Hs. (4) binary. These concepts can be extended beyond the class of i models that can be constructed using finite sets to all com- While not exact (symbol delimiting “comma costs” are ig- putable functions [22]. nored in the model, while possible redundancy advantages The size of the model (the number of bits allocated to are not considered either), these definitions provide an ap- spelling out the members of set S) is related to the Kol- proximate means of breaking out MDL costs on a per symbol mogorov structure function, (see [23]). defines the small- basis. The analysis that follows can easily be adapted to other est set, S, that can be described in at most k bits and contains model cost assumptions. a given string x of length n, n | = | | 2.1. Symbol compression ratio k x n min log2 S . (3) p:l(p)

SCR versus symbol length for various number of repeats String x= a rose is a rose is a rose

1.1 OSCR statistics: SCR based on length and frequency of phrase 1 a 3 7 r 3 o 3 s 5 e 3 i 2 0.9 3 R = 26 − 3 = 23 0.8 l = 2 r 3 r = 3 0.7 SCR = 1.023 R = 26 − 3(5) = 11 o 3 SCR 0.6 l = 6 r = 3 s 0.5 3 SCR = 0.5 R = 26 − 2(6) = 14 = 0.4 e 3 l 7 r = 2 = 0.3 2 SCR 0.7143 0.2 Figure 5: OSCR example. 0.1 10 20 30 40 50 60 70 80 90 100 110 Symbol length (bits)

10 repeats 40 repeats (2) Calculate the SCR for all substrings. Select the sub- 20 repeats 60 repeats string from this set with the smallest SCR and add it Figure 4: SCR versus symbol length for 1024-bit string. to the model M. (3) Replace all occurrences of the newly added substring with a unique character. enables a per-symbol formulation for D and results in a con- (4) Repeat steps 1 through 3 until no suitable substrings p are found. servative approximation for R log2(R) over the likely range of ff R. The per-symbol descriptive cost can now be formulated: (5) When a full partition has been constructed, use Hu - man coding or another coding strategy to encode the = − distribution, p, of symbols. di ri log2(R) log2 ri + li. (7) The following comments apply. Thus, we have a heuristic that conservatively estimates the (1) This algorithm progressively adds symbols that do the descriptive cost of any possible symbol in a string considering most compression “work” among all the candidates both model and data (entropy) costs. A measure of the com- to the code space. Replacement of these symbols left- pression ratio for a particular symbol is simply the descrip- most-first will alter the frequency of remaining sym- tive length of the string divided by the length of the string bols. “covered” by this symbol. We define the symbol compression (2) A less exhaustive search for the optimal SCR candidate ratio (SCR) as is possible by concentrating on the tree branches that − dominate the string or searching only certain phrase di ri log2(R) log2 ri + li λi = = . (8) sizes. Li liri (3) The initial alphabet of terminals is user supplied. This heuristic describes the “compression work” a candidate symbol will perform in a possible partition of a string. Ex- 3.1. Example amining SCR in Figure 4, it is clear that good symbol com- pression ratio arises in general when symbols are long and Consider the phrase “a rose is a rose is a rose” with ASCII repeated often. But clearly, selection of some symbols as part characters as the initial alphabet. The initial tree statistics and of the partition is preferred to others. Figure 4 shows how λ calculations provide the metrics shown in Figure 5.The symbol compression ratio varies with the length of symbols numbers across the top indicate the frequency of each sym- and number of repetitions for a 1024 bit string. bol, while the numbers along the left indicate the frequency of phrases. Here we see that the initial string consists of seven ter- 3. OSCR ALGORITHM minals {a, , r, o, s, e, i}. Expanding the tree with substrings beginning with the terminal a shows that there are 3 occur- The optimal symbol compression ratio (OSCR) algorithm rences of substrings: forms a partition of string S into symbols that have the best symbol compression ratio (SCR) among possible symbols a, a , a r, a ro, a ros, a rose ,(9) contained in S. The algorithm is as follows. (1) Starting with an initial alphabet, form a list of sub- but only 2 occurrences of longer substrings, for each of which strings contained in S, possibly with user-defined con- λ values consequently increase, leaving the phrase {a rose} straints on minimum frequency and/or maximum the candidate with the smallest λ.Hereweseetheunique length, and note the frequency of each substring. nature of the λ heuristic, which does not choose necessarily 6 EURASIP Journal on Bioinformatics and Systems Biology

Grammar Model (set) tic, and thus has more stringent separation of model and data costs and more specific model cost calcula- S1 a rose a rose f (S ) = 1 S1 1 tions resulting in greater specificity. S2 is S1 S2 is S1 f (S2) = 2 S S1S2S2 (3) As described in [21] and will be discussed in later sections, the computational architecture of MDLcom- Equally likely musings: press differs from the suffix tree with counts architec- ⎧ ⎫ ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ ⎨S1S2S2⎬ ⎨ a rose is a rose is a rose ⎬ ture of GREEDY. Specifically, MDLcompress gathers TypicalSet = S S S = is a rose a rose is a rose statistics in a single pass and then updates the data ⎩⎪ 2 1 2⎭⎪ ⎩⎪ ⎭⎪ S2S2S1 is a rose is a rose a rose structure and statistics after selecting each phrase as ffi Figure 6: OSCR grammar example model summary. opposed GREEDY’s practice of reforming the su x tree with counts data structure at each iteration. Another comparable grammar-based code is Sequitur, a linear time grammar inference algorithm [17, 18]. In this pa- the most frequently repeating symbol, or the longest match per, we show MDLcompress to exceed Sequitur’s ability to but rather a combination of length and redundancy. A sec- compress. However, it does not match Sequitur’s linear run ond iteration of the algorithm produces the model described time performance. in Figure 6. Our grammar rules enable the construction of a typical set of strings where each phrase has frequency shown the model block of Figure 6. One can think of MDL prin- 4. MIRNA TARGET DETECTION USING OSCR ciples applied in this way as analogous to the problem of finding an optimal compression code for a given dataset x with the added constraint that the descriptive cost of the In [12], we described our initial application of the OSCR al- codebook must also be considered. Thus, the cost of send- gorithm to the identification of miRNA target sites. We se- lected a family of genes from Drosophila (fruit fly) that con- ing “priors” (a codebook or other modeling information) is considered in the total descriptive cost in addition to tain in their 3 UTRs conserved sequence structures previ- the descriptive cost of the final compressed data given the ously described by Lai [24]. These authors observed that model. a highly-conserved 8-nucleotide sequence motif, known as = 5 cUGUGAUa 3; antisense = 5 uAU- The challenge of incorporating MDL in sequence analy- a K-box (sense CACAg) and located in the 3UTRs of Brd and bHLH gene sis lies in the quantification of appropriate model costs and families, exhibited strong complementarity to several fly tractable computation of model inference. Hence, OSCR has miRNAs, among them miR-11. These motifs exhibited a role been improved and optimized through additional heuristics in posttranscriptional regulation that was at the time unex- and a streamlined architecture and renamed MDLcompress, plained. which will be described in detail in later sections. MDLcom- The OSCR algorithm constructed a phrasebook consist- press forms an estimate of the strings algorithmic minimum ing of nine motifs, listed in Figure 7 (top) to optimally par- sufficient statistic by adding bits to the model until no ad- tition the adjacent set of sequences, in which the motifs ditional compression can be realized. MDLcompress retains are color coded. The OSCR algorithm correctly identified the deep recursion of the original algorithm but improve the most redundant antisense sequence (AUCACA) from the speed and memory use through novel data structures that several examples it was presented. allow gathering of phrase statistics in a single pass and subse- The input data for this analysis consists of 19 sequences, quent selection of multiple codebook phrases with minimal each 18 nucleotides in length (Figure 7). From these se- computation. quences, OSCR generated a model consisting of grammar MDLcompress and OSCR are not alone in the grammar “variables” through that map to individual nucleotides inference domain. GREEDY, developed by Apostolico and S1 S4 (grammar “terminals”), the variable that maps to the nu- Lonardi [16], is similar to MDLcompress and OSCR, but dif- S5 cleotide sequence, AUCACA, and four shorter motifs – . fer in three major areas. S6 S9 The phrase S5 turns out to be a putative target of several dif- (1) MDLcompress is deeply recursive in that the algorithm ferent miRNAs, including miR-2a, miR-2b, miR-6, miR-13a, does not remove phrases from consideration for com- miR-13b, and miR-11. OSCR identified as S9 a2nucleotide pression after they have been added to the model. The sequence (5 GU 3) that is located immediately downstream “loss of compressibility” inherent in adding a phrase of the K-box motif. The new consensus sequence would read to the model was one of the motivations of developing 5 AUCACAGU 3 and has a greater degree of homology the SCR heuristic—preventing a “too greedy” absorp- to miR-6 and miR-11 than to other D. melanogaster miR- tion of phases from preventing optimal total compres- NAs. In vivo studies performed subsequent to the original sion. With MDLcompress, since we look in the model Lai paper demonstrated the specificity of miR-11 activity as well for phrases to compress, we find that generally on the Bob-A,B,C, E(spl)ma, E(spl)m4, and E(spl)md genes the total compression heuristic at each phase gives the [25]. best performance as will be discussed later. In a separate analysis, we applied OSCR to the sequence (2) MDLcompress was designed with the express intent of of an individual fruit fly gene transcript, BobA (accession estimating the algorithmic minimum sufficient statis- NM 080348; Figure 7, bottom). Only the BobA transcript Scott C. Evans et al. 7

OSCR analysis of Brd family and bHlH repressor GGUCACAUCACAGAUACU • Motif: AUCACA first phrase added S1 G CUCGUCAUCACAGUUGGA CGAUUAAUCACAAUGAGU • GUU second phrase added S2 U UCCUCGAUCACAGUUGGA • CU, AU, and GU also called out GGUGCUAUCACAAUGUUU S3 C UGUUUUAUCACAAUAUCU AUUAGUAUCACAUCAACA S4 A AAAUGUAUCACAAUUUUU GUUGAUAUCACAAAUGUA BobA gene from Drosophila melanogaster with S5 AUCACA AAGACUAUCACACUUGGU K-box and GY-box motifs highlighted. the UACAAAAUCACAGCUGAA S6 GUU AGGAACAUCACAUCAUAU BobA gene is potentially regulated by miR-11 AGAACUAUCACAGGAACA (K-box specificity) and miR-7 (GY-box S7 CU UUAGUUAUCACAUGAACU AGUUAUAUCACAGUUGAA specificity). For clarity of exposition, stop and S8 AU CAGGCCAUCACACGGGAG start codons underlined in red. UGCCCUAUCACAGACUUA S9 GU UGGGCUAUCACAGAUGCG GUUGCCAUCACAGUUGGG

1 aacaguucuccauccgagcagaucauaaguaaccaaccugcaaaauguucaccgaaaccg 61 cucuuguuuccaacuucaauggagugacagagaagaaaucucuuaccggcgccuccacca 121 accugaagaagcugcugaagaccaucaagaaggucuucaagaacuccaagccuucgaagg 181 agauuccgauccccaacaucaucuacucuugcaauacugaggaggagcaccagaauuggc 241 ucaacgaacaacuggaggccauggcaauccaucuucacugaguucuucugggacaucccc 301 cuccaucgaguaucugugaugugacccgaucaaaaggucuauaaaucggcacuccggcuu 361 uaauauccaacugugaugacgagaacacaagacugacugacuugugugccuuggagguga 421 caaaguucgucgccucugccaacuguacauaucaaacuagcugcuaaaaugucuucaauu 481 augcuuuaauguagucuaaguuaguauuaucauugucuuccauuaguuuaagaaaaucau 541 ugucuuccauguuuguuuguuaggguaaaaaaaacuagcuuaagaauaaaaaucccucgc 601 ggaaagaaaacaau

Figure 7: Motif analysis of 19 sequences each of which is believed to contain a single target site for miR-11 from fruit fly. (Top) OSCR adds the variable S5 to its MDL codebook, the K-box motif, which has been shown to be a miRNA target site for miR-11. (Bottom) Full sequence of BobA gene transcript with K-box and GY box motifs underlined in blue text. The K-box motif (CUGUGAUG) is a target site for miR-11 and the GY-box motif (UGUCUUCCAU) is a target site for miR-7.

itself entered this second analysis, which was performed 5. MDLcompress independently of the multisequence analysis described in the paragraph above. The sense sequence of BobA is displayed The new MDLcompress algorithmic tool retains the fun- in Figure 2 with the 5UTR indicated in green; the 237 nu- damental element of OSCR—deeply—recursive heuristic- cleotides (79 codons) of the coding sequence in red; and based grammar inference, while trading computational com- the 3UTR in blue. OSCR identified the underlined motifs, plexity for space complexity to decrease execution time. The (cugugaug) and (ugucuuccau). These two motifs turn out compression and hence the ability of the algorithm to iden- not only to be conserved among multiple Drosophila sub- tify specific motifs (which we hypothesize to be of potential species, but also to be targets of two distinct miRNAs: the K- biological significance) have been enhanced by new heuris- box motif (cugugaug) is a target of miR-11 and the GY-box tics and an architecture that searches not only the sequence (ugucuuccau) a target of miR-7. Although we did not per- but also the model for candidate phrases. The performance form OSCR analysis on any additional genes, this motif had has been improved by gathering statistics about potential been identified previously in several 3UTRs, including those code words in a single pass and forming and maintaining of BobA, E(spl)m3, E(spl)m4, E(spl)m5, and Tom [23, 24]. simple matrix structures to simplify heuristic calculations. The BobA gene is particularly sensitive to miR-7. Mutants Additional gains in compression are achieved by tuning the of the BobA gene with base-pair disrupting substitutions at algorithm to take advantage of sequence-specific features both sites of interaction with miR-7 yielded nearly complete such as palindromes, regions of local similarity, and SNPs. loss of miR-7 activity [25] both in vivo and in vitro. These observations are consistent with studies from [26, 27] that 5.1. Improved SCR heuristic reveal specific sequence-matching requirements for effective miRNA activity in vitro. MDLcompress uses steepest-descent stochastic-gradient In summary, the OSCR algorithm identified (i) a methods to infer grammar-based models based upon phrases previously-known 8-nucleotide sequence motif in 19 differ- that maximize compression. It estimates an algorithmic min- ent sequence and (ii) in an entirely independent analysis, imum sufficient statistic via a highly recursive algorithm identified 2 sequence motifs, the K-box and GY-box, within that identifies those motifs enabling maximal compression. the BobA gene transcript. We now describe innovative re- A critical innovation in the OSCR algorithm was the use of finements to our MDL-based DNA compression algorithm a heuristic, the symbol compression ratio (SCR), to select with the goal of improved identification and analysis of bio- phrases. A measure of the compression ratio for a particular logically meaningful sequence—particularly miRNA targets symbol is simply the descriptive length of the string divided related to breast cancer. by number of symbols—grammar variables and terminals 8 EURASIP Journal on Bioinformatics and Systems Biology encoded by this symbol in the phrasebook. We previously de- fined the SCR for a candidate phrase i as 12 − di ri log2(R) log2 ri + li 10 λi = = (10) Li liri 8 for a phrase of length ,repeated times in a string of total 6 li ri SCR length L,withR denoting the total number of symbols in the 4 candidate partition. The numerator in the equation above consists of the MDL descriptive cost of the phrase if added to 2 the model and encoded, while the denominator consists of an 0 estimate of the unencoded descriptive cost of the candidate 0 0 20 10 40 20 phrase. This heuristic encapsulates the net gain in compres- 60 30 Length 80 40 sion per symbol that a candidate phrase would contribute if 100 Repeats it were to be added to the model. Figure 8: Symbol compression ratio (vertical axis) as a function of While (10) represents a general heuristic for determin- phrase length and number of occurrences (horizontal axes) for the ing the partition of a sequence that provides the best com- ff first phrase encountered of a given length and frequency. The vari- pression, important e ects are not taken into account by ation indicates our improved heuristic is providing benefit by con- this measure. For example, adding new symbols to a parti- sidering descriptive cost of specific phrases based on the grammars tion increases the coding costs of other symbols by a small and terminals contained in the phrase, not just length and number amount. Furthermore, for any given length and frequency, of occurrences. certain symbols ought to be preferred over others, because of probability distribution effects. Thus, we desire an SCR heuristic that more accurately estimates the potential symbol also the cost of encoding the length of the phrase to be de- compression of any candidate phrases. scribed. We estimate this cost as To this end, we can separate the costs accounted for in (10) into three parameters: (i) entropy costs (costs to repre- li = R sent the new phrase in the encoded string); (ii) model costs Cm M li + log , (14) = rj (costs to add the new phrase to the model); and (iii) previ- j 1 ous costs (costs to represent the substring in the string pre- where M(L) is the shortest prefix encoding for the length viously). The SCR of [10, 11, 28] breaks these costs down as phrase. In this way we achieve both a practical method for follows: spelling out the model for implementation and an online method for determining model costs that relies only on R known information. Since new symbols will add to the cost Ch = Ri · log , (11) Ri of other symbols simply by increasing the number of symbols Cm = li, in the alphabet, we specify an additional cost that reflects the (12) C = l R , change in costs of substrings that are not covered by candi- p i i date phrase. The effect is estimated by where R is the length of the string after substitution, l is the L +2 i C = R − R · log . (15) length of the code phrase, L is the length of the model, and o i L +1 Ri is the frequency of the code phrase in the string. An im- proved version of this heuristic, SCR 2006, provides a more This provides a new, more accurate heuristic as follows: accurate description of the compression work by eliminating C + C + C some of the simplifying assumptions made earlier. Entropy SCR 2006 = m h o . (16) costs (11) remain unchanged. However, increased accuracy Cp can be achieved by more specific costs for the model and pre- Figure 8 shows a plot of SCR 2006 versus length and number vious costs. For previous costs we consider the sum of the of repeats for a specific sequence, where the first phrase of a costs of the substrings that comprise the candidate phrase givenlengthandnumberofrepeatsisselected.Noticethat the lowest SCR phrase is primarily a function of number of li repeats and length, but also includes some variation due to = · R Cp Ri log , (13) other effects. Thus, we have improved the SCR heuristic to = rj j 1 yield a better choice of phrase to add at each iteration.

where R is the total number of symbols without the forma- 5.2. Additional heuristics tion of the candidate phrase and rj is the frequency of the jth symbol in the candidate phrase. Model costs require a In addition to SCR, two alternative heuristics are evaluated to method for not only spelling out the candidate phrase but determine the best phrase for MDL learning: longest match Scott C. Evans et al. 9

Input sequence 120 Pease porridge hot, TC pease porridge cold, 110 pease porridge in the pot, 100 nine days old. 90 Some like it hot, some like it cold, 80 some like it in the pot, 70 nine days old. 1234567 Total compression model inference S1 pease porridge peasS5porridgS5 S2 some like it S6somS5likS5it S3 in the pot, nine days old. in thS5pS7S6ninS5days old. S4 cold, S5 e S6 S7 ot, S S1hS7S6S1S4S6S1S3S6S2hS7S2S4S2S3 Longest match model inference S1 in the pot, nine days old. S2 , pease porridge S3 some like it S pease porridge hot, S2cold, S2S1S3hot, S3cold, S2S1

Figure 9: MDLcompress model-inferred grammar for the input sequence “pease porridge” using total compression (TC) and the longest match (LM) heuristics. Both the SCR and TC heuristics achieve the same total compression and both exceed the performance of LM. Subsequent iterations enable MDLcompress to identify phrases, yielding further compression of the TC grammar model.

(LM) and total compression (TC). Both of these heuristics 2000 leverage the gains described above by considering the entropy 1800 of specific variables and terminals when selecting candidate phrases. In LM, the longest phrase is selected for substitution, 1600 even if only repeated once. This heuristic can be useful when 1400 it is anticipated that the importance of a codeword is propor- 1200 tional to its length. MDLcompress can apply LM to greater 1000 advantage than other compression techniques because of its deep recursion—when a long phrase is added to codebook, 800 its subphrases, rather than being disqualified, remain poten- 600 tial candidates for subsequent phrases. For example, if the 400 longest phrase merely repeats the second longest phrase three times, MDLcompress will nevertheless identify both phrases. 200 In TC, the phrase that leads to maximum compression 0 at the current iteration is chosen. This “greedy” process does 0 102030405060708090 not necessarily increase the SCR, and may lead to the elim- ination of smaller phrases from the codebook. MDLcom- Model cost press, as explained above, helps temper this misbehavior by Description cost including the model in the search space of future iterations. Total cost Because of this “deep recursion” phrases in both the model Figure 10: The compression characteristic of MDLcompress using and data portions of the sequence are considered as candi- the hybrid heuristics longest match, followed by total compress after date codewords at each iteration-MDLcompress yields im- the longest match heuristic ceases to provide compression. proved performance over the GREEDY algorithm [16]. As with all MDL criteria, the best heuristics for a given sequence is the approach that best compresses the data. The TC gain is the improvement in compression achieved by selecting a candidate phrase and can be derived from the SCR heuris- that we search the model as well as remaining sequence for tic by removing the normalization factor. Examples of MDL- candidate phrases, reducing the need for and benefit from compress operating under different heuristics or combina- the SCR heuristic. By comparison, SEQUITUR [17]forms tions of heuristics are shown in Figures 9 and 10.Underour a grammar of 13 rules consisting of 74 symbols. Thus, us- improved architecture, the best compression seems to usu- ing MDLcompress TC we achieve better compression with a ally be achieved in TC mode, which we attribute to the fact grammar model of approximately half the size. 10 EURASIP Journal on Bioinformatics and Systems Biology

Phrase starting index

>>>phrase Array(1) a ros e is a r ose is a rose . ans = index: 1 length: 6 verboselength: 6 chararray: ’a rose’ startindices: [1 11 21] frequency: 3 111>>>phrase Array(2) ans = index: 1

Phrase length length: 10 verboselength: 10 chararray: ’a rose is’ startindices: [1 11] 22 frequency: 2 Index box Phrase array

Box update Phrase array has all information necessary to update >>>phrase Array(1) ans = other candidates after each phrase is added to the model. index: 1 length: 1 verboselength: 6 chararray: ’a rose’ startindices: [1 6 11] S1 is S1 is S1 . frequency: 3

>>>phrase Array(2) ans = index: 1 length: 5 verboselength: 10 chararray: ’a rose is’ 22 startindices: [1 6] frequency: 2

Figure 11: The data structures used in MDLcompress allow constant time selection and replacement of candidate phrases. In the top of the figure is the initial index matrix and phrase array. After adding “a rose” for the model, MDLcompress can generate the new index box and phrase array, shown in the bottom half, in constant time.

5.3. Data structures During the phrase selection part of each iteration, MDL- compress only has to search through phrase array, calculat- A second improvement of MDLcompress over OSCR is the ing the heuristic for each entry. Once a phrase is selected, improvement to execution time to allow analysis of much the matrix is used to identify overlapping phrases, which will longer input strings, such as DNA sequences. This is achieved have their frequency reduced by the substitution of a new through trading off memory usage and runtime by using ma- symbol for the selected substring. While there may be many trix data structures to store enough information about each phrases in the array that are updated, only local sections of candidate phrase to calculate the heuristic and update the the matrix are altered, so overall only a small percentage of data structures of all remaining candidate phrases. This al- the data structure is updated. This technique is what allows lows us to maintain the fundamental advantage of OSCR MDLcompress to execute efficiently even with long input se- andalgorithmssuchasGREEDY[16] that compression is quences, such as DNA. performed based upon the global structure of the sequence, rather than by the phrases that happen to be processed first, 5.4. Performance bounds as in schemes such as Sequitur, DNA Sequitur, and Lempel- Ziv. We also maintain an advantage over the GREEDY algo- The execution of MDLcompress is divided into two parts: the rithm by including phrases added to our MDL model and the single pass to gather statistics about each phrase and the sub- model space itself in our recursive search space. sequent iterations of phrase selection and replacement. Since During the initial pass of the input, MDLcompress gener- simple matrix operations are used to perform phrase selec- ates an lmax by L matrix, where entry Mi,j represents the sub- tion and replacement, the first pass of statistics gathering al- string of length i beginning at index j. This is a sparse matrix most entirely dominates both the memory requirements and with entries only at locations that represent candidates for runtime. the model. Thus, substrings with no repeats and substrings For strings with input length, L, and maximum phrase that only ever appear as part of a longer substring are repre- length, lmax, the memory requirements of the first pass are sented with a 0. Matrix locations with positive entries repre- bounded by the product L ∗ lmax and subsequent passes re- sent the index into an array with many more details for that quire less memory as phrases are replaced by (new) indi- specific substring. In the example in Figure 11,“arose”ap- vidual symbols. Since the user can define a constraint on pears three times in the input. In each location of the matrix lmax, memory use can be restricted to as little as O(L), and corresponding to this substring is a 1, and the first element in will never exceed O(L2). On platforms with limited memory the phrase array has the length, frequency, and starting index where long phrases are expected to exist, the LM heuristic for all occurrences of the substring. A similar element exists can be used in a simple preprocessing pass to identify and for “a rose is” but not exist for “a rose” since that only appears replace any phrases longer than the system can handle in as a substring of the first candidate. the standard matrix described above. Because MDLcompress Scott C. Evans et al. 11

Table 1 DNACompress Genes Sequitur DNASequitur MDLcompresss (bits/nucleotide) HUMDYSTROP 1.91 2.34 2.2 1.95 HUMGHCSA 1.03 1.86 1.74 1.49 HUMHBB 1.79 2.20 2.05 1.92 HUMHDABCD 1.80 2.26 2.12 1.92 HUMPRTB 1.82 2.22 2.14 1.92 CHNTXX 1.61 2.24 2.12 1.95

inspects the model when searching for subsequent phrases, compress model and taking account of the frequency of the this technique has minimal negative effect on overall com- phrase and its reverse-complement in motif selection. pression. The runtime of the first pass depends directly on L, lmax, 7. POST PROCESSING average phrase length lavg,andaveragenumberofrepeats of selected phrases, ravg. The unclear relationship between After the MDLcompress model has been created, two meth- lmax, lavg, ravg, and L makes deriving guaranteed performance ods possibilities for further compression are the following. bounds difficult. As a simple upper bound, we can note that ffi ∗ (1) Regions of Local similarity: it is sometimes most e - the product lavg ravg must be less than L, and the maximum cient to define a phrase as a concatenation of multiple phrase length must be less than L/2, yielding a performance 3 shorter and adjacent phrases already in the codebook. bound of O(L ). In practice, a memory constraint limits lmax ∗ (2) Single nucleotide polymorphisms (SNPs): it is some- to a constant independent of L,andlavg ravg was approxi- time most efficient to define a phrase as a single nu- mately constant and much smaller than L. Thus, the practical cleotide alteration to another phrase already in the performance bound was O(L). codebook. The runtime of the second part of the algorithm, selec- tion and replacement of compressible phrases, is simply the sum of the time to identify the best phrase and to update 8. COMPARISON TO OTHER GRAMMAR-BASED the matrices for the next iteration, multiplied by the number CODES 2 of iterations. An upper bound on these is O(L ), but again We compare MDLcompress with the state of the art in practical performance is much better. In this DNA applica- grammar-based compression: DNA Sequitur [18]. DNA Se- tion where 144 genes were analyzed, the number of candi- quitur improves the Sequitur algorithm by enabling it to har- ff date phrases, the average number of a ected phrases, and the ness advantages of palindromes and by considering other number of iterations all were independent of input length, grammar-based encoding techniques as discussed in [20]. and the selection and replacement phase ran in constant Results are summarized in Table 1. time. While compression is ultimately the best measure of al- gorithm’s capacity to approximate Kolmogorov complexity, 5.5. Enhancements for DNA compression an additional feature of grammar-based codes is their two- part encoding, which separates the meaningful model from When a symbol sequence is already known to be DNA, sev- the data elements—an advantage we will discuss in more eral “priors” can be incorporated into the model inference detail later. The results above make use of the total com- algorithm that may lead to improved compression perfor- pression heuristic and harness the advantage of consider- mance. These assumptions relate to types of structure that ing palindromes. Although we exceeded the compression of are typical of naturally occurring DNA sequence. By tuning DNA Sequitur, DNACompress still achieves better compres- our algorithm to efficiently code for these mechanisms, we sion; however it does not yield the two-part grammar code are essentially incorporating these priors into our model in- that identifies biologically significant phrases, which we will ference algorithm “by hand.” We consider these assumptions discuss next in the context of breast-cancer-related genes. to be small and within the “big O” constant inherent in trans- lating between universal computers. 9. IDENTIFICATION OF MIRNA TARGETS USING MDLCOMPRESS 6. REVERSE-COMPLEMENT MATCHES As shown in Figure 7, MDL algorithms can be used to identify miRNA target sites. We have also tested MDL- As in DNA Sequitur, the search for and grammar encod- compress for the ability to identify miRNA target sites in ing of reverse-complement matches is readily implemented known disease-related genes. The general approach is to an- by adding the reverse-complement of a phrase to the MDL- alyze mRNA transcripts to identify short sequences that are 12 EURASIP Journal on Bioinformatics and Systems Biology

MDLcompress & LATS2: sequence elements in long 3’UTR LOCUS NM 014572 Definition homo sapiens LATS, large tumor suppressor, homolog 2 (Drosophila) (LATS2), mRNA.

5’UTR CDS 3’UTR

MDLcompress (of 3’UTR ) output sequences Sequence Position in 3’UTR 1) aaaaaaaaaaaa 433, 445 2) agcacttatt 262, 362 3) aaacaggac 155, 172

Figure 12: Validation of MDLcompress performance. MDL compress identifies miRNA-372 and 373 target motif (AGCACTTATT) in LATS2 tumor suppressor gene as second phrase.

repeated and localized to the 3UTR. Comparative genomics ing SSEARCH [37] to detect possible sequence similarities to can be applied to increase our confidence that MDL phrases known miRNAs. Finally, genes containing these phrases were in fact represent candidate miRNA target sites, even if there targeted with shRNA constructs in an ErbB2-positive breast are no known cognate miRNAs that will bind to that site. cancer cell line (BT474), as well as in normal mammary As a test, we sought to determine if MDLcompress would epithelial cells (HMEC), in order to identify their poten- have identified the miRNA binding site in the 3UTR of the tial role in breast tumorigenicity. One MDLcompress phrase, tumor suppressor gene, LATS2. A recent study, which used a AGAUCAAGAUC, found in the 3UTR of the splicing fac- function-based approach to miRNA target site identification, tor arginine/serine-rich 7 (SFRS7) gene (a) was highly con- determined that LATS2 is regulated by miRNAs 372 and 373 served, (b) resulted in miRBase matches to a small number of [29]. Increased expression of these miRNAs led to down reg- miRNAs that fulfill the minimum requirements of putative ulation of LATS2 and to tumorigenesis. The miRNA 372 and miRNA targets [32] (Figures 13(a) and 13(b)) in vitro data 373 target sequence (AGCACTTATT) is located in the 3UTR implicate this gene in breast cancer progression. More specif- of LATS2 mRNA and is repeated twice but was not identified ically,downregulationofSFRS7byshRNAsinBT474cells with computation-based miRNA target identification tech- yielded a significant decrease in the proliferation marker ala- niques. Using the 3UTR of LATS2 mRNA as an input, three marBlue (Biosource), but not in normal mammary epithelial code words were added to the MDLcompress model, using cells (HMEC) (Figure 13(b)). In this experiment, cells were longest match mode as shown in Figure 12, the polyA tail, transiently transfected with miRNA-based-structure shRNA the miRNA 372 and 373 target sequence (AGCACTTATT), constructs [38] targeting the coding sequence of SFRS7, by and a third phrase (AAACAGGAC) which we do not iden- using a lipid-based reagent (FuGENE 6, Roche). A plasmid tify with any particular biological function at this time. This construct expressing green fluorescent protein (MSCV-GFP) shows that analyzing genes of interest a priori with MDL- was cotransfected to the cells to normalize transfection effi- compress can produce highly relevant sequence motifs. ciency [3]. shRNAs against the firefly luciferase gene was used Since miRNAs regulate genes important for tumorigene- as negative control. Although regulation by the specific miR- sis and MDLcompress is able to identify these targets, it fol- NAs identified in our bioinformatics analysis still requires lows that MDLcompress could be used to directly identify validation, these results suggest the possible differential regu- genes that are important for tumorigenesis. To test this, we lation of this gene in breast cancer by a miRNA and that this used a target rich set of 144 genes known to have increased gene is significant in cell proliferation, underscoring the po- expression patterns in ErbB2-positive breast cancer [30, 31] tential for OSCR to identify sequence of biological interest. and compressed each gene mRNA sequence with MDLcom- press running in longest match mode. A total of 93 phrases 10. ANALYSIS OF SINGLE NUCLEOTIDE were added to MDLcompress codebooks resulting in com- POLYMORPHISMS pression of these genes. Of these phrases, 25 were found ex- clusively in the 3UTRs of these genes. Since miRNAs interact By definition, mutation of an essential nucleotide within a more frequently with the 3UTRs of mRNAs [32], we focused given miRNA’s target sequence within an mRNA is expected our analysis on these phrases, shown in Table 2. to have a strong effect on the activity of the given miRNA The 25 3UTR phrases were run through BLAST [33] on the target. If a nucleotide that is required for interac- searches of a database of 3UTRs [34, 35] to determine tion of a miRNA with the mRNA is altered, the miRNA may level of conservation in human and other genomes. The cease to regulate that target, thereby enhancing expression phrases were also run against the miRBase database [36]us- of the mRNA and the protein it encodes. Alternatively, a Scott C. Evans et al. 13

Table 2: 3UTR MDLcompress phrases from 144 ErbB2-positive-related gene mRNA sequence.

Accession number Number of repeats Length Phrase Locations NM 000442 2 13 tttctcttttcct 2835, 3091 NM 004265 2 10 tcagggaggg 2274, 2667 NM 004265 2 10 ccccccagct 2954, 3021 NM 004265 2 10 gcagaggcag 2255, 3051 NM 005324 2 12 ttttatttataa 1292, 1802 NM 005324 2 10 cagtttcctt 997, 1991 NM 005324 2 9 tttataata 627, 1055 NM 005930 2 11 tatttcaattt 2903, 2932 NM 005930 2 11 tatttttgctc 2733, 3809 NM 005930 2 10 gacaaatgtg 3064, 3250 NM 005930 2 10 cttttttttc 3425, 3689 NM 005930 2 10 ttggaacact 3750, 3787 NM 006148 2 13 gtgtgtgagtgtg 1951, 3654 NM 006148 2 12 ccccagtctcca 647, 1651 NM 006148 2 11 acttcttggtt 1067, 1290 NM 006148 2 11 cctcctgccca 1186, 1503 NM 006148 2 11 ccccatctctg 2147, 2302 NM 006148 2 11 ggaagcacagc 1545, 2447 NM 006148 2 11 tgtgggtgggg 2014, 2776 NM 006148 2 11 cctttctggcc 2812, 3759 NM 006148 2 10 ctccctcctc 1035, 1408 NM 006148 2 10 cagctaccgg 525, 1591 NM 006148 2 10 tcccctcccc 1464, 1828 NM 006148 2 10 gtggaggaag 2159, 2267 NM 006276 2 11 agatcaagatc 1010, 1091

160 140 120 3’UTR 100 80 OSCR OSCR 60 phrase phrase 40 20 OSCR sequence AGAUCAAGAUC 0 BT474 HMEC

hsa-miR-218 UGUACCAAUCUAGUUCGUGUU rno-miR-218 UGUACCAAUCUAGUUCGUGUU Luciferase shRNA control xtr-miR-218 UGUACCAAUCUAGUUCGUGUU SFRS7 shRNA (a) (b)

Figure 13: A miRNA target site relevant to breast cancer is identified by OSCR. (a) Proposed interaction between miRNAs (human, rat, frog) and OSCR phrase. (b) Down regulation of the SFRS7 by RNAi specifically inhibits the proliferation of breast cancer cell line BT474 and not normal cells. These miRNAs may be implicated in breast cancer. single-nucleotide change to a target of one miRNA may yield results in an “illegitimate” interaction of miRNA 1 and 206 a target sequence for a distinct miRNA. A report published in with the myostatin mRNA [39]. Mutations that yield such 2006 demonstrated this SNP effect in a mammal. The study interactions between mutant mRNA and miRNAs are called found that Texel sheep, which are known for their meatiness, “Texel-like.” The authors performed a preliminary analysis possess a mutation in the 3UTR of the myostatin gene that of known human SNPs and their potential for perturbing 14 EURASIP Journal on Bioinformatics and Systems Biology

SNP500 BT474 overexpression set 13 (500 genes) MDL sequences (144 genes)

(a)

Name Accession MDL sequence Position SNP ESR1 NM 000125 GATATGTTTA 4023.5325 4029 T→ C PTGS2 NM 000963 CAAAATGC 2179, 2717.3097 3103 G→ A EGFR NM 005228 TTTTACTTC 4233.4967 4975 C→ T (b)

Figure 14: MDLcompress directly identifies putative miRNA target sequences that may be implicated in breast cancer. (a) Schematic of overlap between SNP500 database and potential miRNA sequences identified by MDLcompress in the test set. (b) Potential miRNA sites identified by MDLcompress with disease-related polymorphisms identified by SNP analysis. These miRNA targets may be implicated in breast cancer. binding sites of predicted miRNAs and identified 2490 Texel- MDLcompress cost per nucleotide-based of PGTS2 with SNP 3 like mutations and 483 mutations that potentially result in loss of miRNA binding. We performed a similar analysis on the 144 overexpressed 2.5 SNP g a gene mRNA sequences from the BT474 breast cancer cell line [30, 31] to identify which of these genes possess disease- 2 related Texel-like mutations. By cross-referencing with the SNP500 database [40],SNPswerefoundin13ofthe144 1.5 overexpressed gene mRNA sequences from the BT474 breast cancer cell line, all in the 3UTR region. The initial compari- 1 son of the 93 MDLcompress code words from the 144 genes taaaacttccttttaaatcaaaatgccaaatttattaaggtggtggagcc discussed previously did not match with any SNP phrases. We then relaxed the strict constraint that a phrase must lead 0.5 to compression at every step and asked MDLcompress in longest match to identify the top 10 candidates in each gene 0 mRNA sequence that would most likely lead to compression. 2700 2710 2720 2730 2740 2750 Strikingly, 3 of these genes-ESR-1, PGTS2, and EGFR-have Figure 15: Cost per nucleotide for PTGS2. The blue curve identifies SNPs in the set of the first 10 code word candidates identified cost per nucleotide of the original sequence based upon an MDL- by MDLcompress when run on each these genes respective compress model developed using the total compression heuristic mRNA sequence (Figure 14). These three sequences were se- and the first 15 phrases to be selected. The cost per nucleotide under lected out of the 13 because they fulfill the criteria we used the SNP g → a isshowninred. for Figure 13(a), that based on sequence analysis (similarity to miRNA sequences and intra- and inter- species sequence conservation); they are putative miRNA targets. single nucleotide typically yields a very small change in de- These motifs are localized to the 3UTR and have not scriptive cost, in most cases less than a bit; however, the SNP been predicted to interact with any known miRNAs in the in the phrase shown in Figure 15 yields a change in descrip- literature. Although further validation studies are required, tive cost on the order of 4 bits, suggesting that this phrase these observations suggest that MDLcompress may be capa- is in fact meaningful. Future work will elaborate on this po- ble of directly identifying potential miRNA target sequences tential relationship between meaningful phrases identified by with roles in breast cancer. MDLcompress and disease, and explore the capability of us- Our hypothesis regarding the significance of MDL ing MDLcompress models to predict sites where SNPs are es- phrases that are added to the MDLcompress model motivates pecially likely to cause pathology. search of these phrases for SNPs related to cancer. As shown in Figure 10, an SNP identified in PTGS2 gene [40] colo- 11. CONCLUSIONS calizes with the MDLcompress-identified phrase caaaatgc in the 3UTR of PTGS2 and yields a disproportionate change MDLcompress yields compression of DNA sequences that is in the descriptive cost of the sequence under the MDLcom- superior to any other existing grammar-based coding algo- press model generated for the original sequence. Altering a rithm. It enables automatic detection of model granularity, Scott C. Evans et al. 15 leading to identification of interesting variable-length motifs. [5] B. P.Lewis, C. B. Burge, and D. P.Bartel, “Conserved seed pair- These motifs include miRNA target sequences that may play ing, often flanked by adenosines, indicates that thousands of a role in the development of disease, including breast cancer, human genes are microRNA targets,” Cell, vol. 120, no. 1, pp. introducing a novel method of identifying microRNA targets 15–20, 2005. without specifying the sequence (or, in particular, seed) of [6] V. Rusinov, V. Baev, I. N. Minkov, and M. Tabler, “MicroIn- the microRNA that is supposed to bind them. Additionally, spector: a web tool for detection of miRNA binding sites in an RNA sequence,” Nucleic Acids Research, vol. 33, web server we have used our algorithm here to study SNPs found in issue, pp. W696–W700, 2005. overexpressed genes in the breast cancer cell line BT474, and [7] G. A. Calin, C.-G. Liu, C. Sevignani, et al., “MicroRNA pro- we identified 3 SNPs that may alter the ability of microRNAs filing reveals distinct signatures in B cell chronic lymphocytic to target their sequence neighborhood. leukemias,” Proceedings of the National Academy of Sciences of In future work, MDL specificity will be improved the United States of America, vol. 101, no. 32, pp. 11755–11760, through windowing and segmentation, concepts described 2004. in Figure 4. Running MDLcompress on consecutive windows [8] A. Esquela-Kerscher and F. J. Slack, “Oncomirs—microRNAs of sequence will enable the detection of change points, such with a role in cancer,” Nature Reviews Cancer, vol. 6, no. 4, pp. as the transition from noncoding to coding sequence, and 259–269, 2006. permit the use of multiple codebooks, enhancing specificity [9] P. Grunwald,¨ I. J. Myung, and M. Pitt, Eds., Advances in Mini- for each region of a gene. For example, the optimal MDL mum Description Length: Theory and Applications, MIT Press, codebook for a coding region is unlikely to be the same as Cambridge, Mass, USA, 2005. [10] S. C. Evans, Kolmogorov complexity estimation and application that for a 3UTR. Applying the same model over an entire ff for information system security, Ph.D. dissertation, Rensselaer gene reduces the e ectiveness of the MDL compression algo- Polytechnic Institute, Troy, NY, USA, 2003. rithm in identifying biologically significant motifs. This im- [11] S. C. Evans, B. Barnett, S. F. Bush, and G. J. Saulnier, “Mini- provement of MDLcompress to detect and take advantage of mum description length principles for detection and classifi- change points will enable the detection of nonadjacent re- cation of FTP exploits,” in Proceedings of IEEE Military Com- gions of the genome that are similar. The execution time of munications Conference (MILCOM ’04), vol. 1, pp. 473–479, MDLcompress will be further reduced by means of a novel Monterey, Calif, USA, October-November 2004. data structure that augments a suffix tree with counts and [12] S. C. Evans, A. Torres, and J. Miller, “MicroRNA target mo- pointers, enabling deep recursion of model inference without tif detection using OSCR,” Tech. Rep. GRC223, GE Research, intractable computation. With this structure, when a phrase Niskayuna, NY, USA, 2006. is selected for the MDLcompress codebook, simple opera- [13] M. Li and P. Vitanyi,´ Introduction to Kolmogorov Complexity tions can update the structure to facilitate selection of the and Applications, Springer, New York, NY, USA, 1997. ffi [14] W. Szpankowski, W. Ren, and L. Szpankowski, “An opti- next phrase by leveraging known information. The su x- mal DNA segmentation based on the MDL principle,” Inter- tree with counts and pointers architecture will enable near- national Journal of Bioinformatics Research and Applications, linear time processing of the windowed segments. vol. 1, no. 1, pp. 3–17, 2005. [15] I. Tobus, G. Korodi, and J. Rissanen, “DNA sequence com- ACKNOWLEDGMENTS pression using the normalized maximum likelihood model for discrete regression,” in Proceedings of Data Compression Con- This work was funded by the U.S. Army Medical Research ference (DCC ’03), pp. 253–262, Snowbird, Utah, USA, March Acquisition Activity, 820 Chandler Street, Fort Detrick, DM 2003. [16] A. Apostolico and S. Lonardi, “Some theory and practice of 217-5014 in Grants W81XWH-0-1-0501 (to SE and AT) and ff W8IWXH-04-1-0474 (to DSC). The content and informa- greedy o -line textual substitution,” in Proceedings of Data Compression Conference (DCC ’98), pp. 119–128, Snowbird, tion do not necessarily reflect the position or policy of the ffi Utah, USA, March 1998. government and no o cial endorsement should be inferred. [17] C. G. Nevill-Manning and I. H. Witten, “Identifying hierarchi- cal structure in sequences: a linear-time algorithm,” Journal of REFERENCES Artificial Intelligence Research, vol. 7, pp. 67–82, 1997. [18] N. Cherniavsky and R. Lander, “Grammar-based compres- [1]A.Fire,S.Xu,M.K.Montgomery,S.A.Kostas,S.E.Driver, sion of DNA sequences,” in DIMACS Working Group on The and C. C. Mello, “Potent and specific genetic interference Burrows—Wheeler Transform, Piscataway, NJ, USA, August by double-stranded RNA in caenorhabditis elegans,” Nature, 2004. vol. 391, no. 6669, pp. 806–811, 1998. [19] X. Chen, M. Li, B. Ma, and J. Tromp, “DNACompress: fast and [2] G. J. Hannon and J. J. Rossi, “Unlocking the potential of effective DNA sequence compression,” Bioinformatics, vol. 18, the human genome with RNA interference,” Nature, vol. 431, no. 12, pp. 1696–1698, 2002. no. 7006, pp. 371–378, 2004. [20] B. Behzadi and F. Le Fessant, “DNA compression chal- [3] A. Kourtidis, C. Eifert, and D. S. Conklin, “RNAi applications lenge revisited: a dynamic programming approach,” in The in target validation,” in Systems Biology, Applications and Per- 16th Annual Symposium on Combinatorial Pattern Matching spectives, P. Bringmann, E. C. Butcher, G. Parry, and B. Weiss, (CPM ’05), vol. 3537 of Lecture Notes in Computer Science,pp. Eds., vol. 61 of Ernst Schering Foundation Symposium Proceed- 190–200, Jeju Island, Korea, 2005. ings, pp. 1–21, Springer, New York, NY, USA, 2007. [21] S. C. Evans, T. S. Markham, A. Torres, A. Kourtidis, and D. [4] B. P. Lewis, I.-H. Shih, M. W. Jones-Rhoades, D. P. Bartel, and Conklin, “An improved minimum description length learn- C. B. Burge, “Prediction of mammalian microRNA targets,” ing algorithm for nucleotide sequence analysis,” in Proceed- Cell, vol. 115, no. 7, pp. 787–798, 2003. ings of IEEE 40th Asilomar Conference on Signals, Systems and 16 EURASIP Journal on Bioinformatics and Systems Biology

Computers (ACSSC ’06), pp. 1843–1850, Pacific Grove, Calif, USA, October-November 2006. [22] P. Gacs,J.T.Tromp,andP.M.B.Vit´ anyi,´ “Algorithmic statis- tics,” IEEE Transactions on Information Theory, vol. 47, no. 6, pp. 2443–2463, 2001. [23] T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley-Interscience, New York, NY, USA, 1991. [24] E. C. Lai, “MicroRNAs are complementary to 3 UTR se- quence motifs that mediate negative post-transcriptional reg- ulation,” Nature Genetics, vol. 30, no. 4, pp. 363–364, 2002. [25] E. C. Lai, B. Tam, and G. M. Rubin, “Pervasive regulation of Drosophila Notch target genes by GY-box-, Brd-box-, and K- box-class microRNAs,” Genes & Development,vol.19,no.9, pp. 1067–1080, 2005. [26] J. G. Doench and P. A. Sharp, “Specificity of microRNA target selection in translational repression,” Genes & Development, vol. 18, no. 5, pp. 504–511, 2004. [27] J. Brennecke, A. Stark, R. B. Russell, and S. M. Cohen, “Prin- ciples of microRNA-target recognition,” PLoS Biology, vol. 3, no. 3, p. e85, 2005. [28] S. C. Evans, G. J. Saulnier, and S. F. Bush, “A new universal two part code for estimation of string kolmogorov complexity and algorithmic minimum sufficient statistic,” in DIMACS Work- shop on Complexity and Inference, Piscataway, NJ, USA, June 2003. [29] P.M. Voorhoeve, C. le Sage, M. Schrier, et al., “A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in tes- ticular germ cell tumors,” Cell, vol. 124, no. 6, pp. 1169–1181, 2006. [30] A. Mackay, C. Jones, T. Dexter, et al., “cDNA microarray anal- ysis of genes associated with ERBB2 (HER2/neu) overexpres- sion in human mammary luminal epithelial cells,” Oncogene, vol. 22, no. 17, pp. 2680–2688, 2003. [31] F. Bertucci, N. Borie, C. Ginestier, et al., “Identification and validation of an ERBB2 gene expression signature in breast ,” Oncogene, vol. 23, no. 14, pp. 2564–2575, 2004. [32] L. P.Lim, N. C. Lau, P.Garrett-Engele, et al., “Microarray anal- ysis shows that some microRNAs downregulate large numbers of target mRNAs,” Nature, vol. 433, no. 7027, pp. 769–773, 2005. [33] S. F. Altschul, T. L. Madden, A. A. Scha¨ffer, et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. [34] F. Mignone, G. Grillo, F. Licciulli, et al., “UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untrans- lated regions of eukaryotic mRNAs,” Nucleic Acids Research, vol. 33, database issue, pp. D141–D146, 2005. [35] http://microrna.sanger.ac.uk/sequences/index.shtml. [36] S. Griffiths-Jones, R. J. Grocock, S. van Dongen, A. Bateman, and A. J. Enright, “miRBase: microRNA sequences, targets and gene nomenclature,” Nucleic Acids Research, vol. 34, database issue, pp. D140–D144, 2006. [37] X. Huang, R. C. Hardison, and W. Miller, “A space-efficient algorithm for local similarities,” Computer Applications in the Biosciences, vol. 6, no. 4, pp. 373–381, 1990. [38] P. J. Paddison, J. M. Silva, D. S. Conklin, et al., “A resource for large-scale RNA-interference-based screens in mammals,” Nature, vol. 428, no. 6981, pp. 427–431, 2004. [39] A. Clop, F. Marcq, H. Takeda, et al., “A mutation creating a po- tential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep,” Nature Genetics, vol. 38, no. 7, pp. 813–818, 2006. [40] http://snp500cancer.nci.nih.gov/. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 61374, 7 pages doi:10.1155/2007/61374

Research Article Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among Bacteria

Haruo Suzuki, Rintaro Saito, and Masaru Tomita

Institute for Advanced Biosciences, Keio University, Yamagata 997-0017, Japan

Received 31 January 2007; Accepted 4 June 2007

Recommended by Teemu Roos

G + C composition at the third codon position (GC3) is widely reported to be correlated with synonymous codon usage bias. However, no quantitative attempt has been made to compare the extent of this correlation among different genomes. Here, we applied Shannon entropy from information theory to measure the degree of GC3 bias and that of synonymous codon usage bias of each gene. The strength of the correlation of GC3 with synonymous codon usage bias, quantified by a correlation coefficient, varied widely among bacterial genomes, ranging from −0.07 to 0.95. Previous analyses suggesting that the relationship between GC3 and synonymous codon usage bias is independent of species are thus inconsistent with the more detailed analyses obtained here for individual species.

Copyright © 2007 Haruo Suzuki et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION mous codon usage among genes. Carbone et al. [2, 3] used the codon adaptation index as a “universal” measure of dom- Most amino acids can be encoded by more than one codon inating codon usage bias. The measures obtained by these (i.e., a triplet of nucleotides); such codons are described as methods can be interpreted as having different features (e.g., being synonymous and usually differ by one nucleotide in G + C composition bias, replication strand bias, and transla- the third position. In many organisms, alternative synony- tionally selected codon bias), depending on the gene groups mous codons are not used with equal frequency. Various fac- analyzed. Therefore, these methods would be useful for ex- tors have been proposed to contribute to synonymous codon ploratory data analysis but not for the analysis of interest usage bias, including G + C composition, replication strand here. By contrast, measures such as the “effective number of bias, and translational selection [1]. Here, we focus on the codons” [10] and Shannon entropy from information theory contribution of G + C composition to synonymous codon [11] are well defined; these measures can be regarded as rep- usage bias. resenting the degree of deviation from equal usage of synony- G + C composition has been widely reported to be cor- mous codons, independently of the genes analyzed. Previous related with synonymous codon usage bias [2–11]. However, analyses of the relationships between G + C composition and no quantitative attempt has been made to compare the ex- synonymous codon usage bias using these measures have had tent of this correlation among different genomes. It would be two problems. First, these measures of synonymous codon useful to be able to quantify the strength of the correlation usage bias have failed to take into account all three aspects of of G + C composition with synonymous codon usage bias amino acid usage (i.e., the number of different amino acids, in such a way that the estimates could be compared among their relative frequency, and their codon degeneracy), and genomes. therefore are affected by amino acid usage bias, which may Different methods have been used to analyse the mask the effects directly linked to synonymous codon usage relationships between G + C composition and synonymous bias. Second, previous analyses have compared the “degree” codon usage. Multivariate analysis methods, such as corre- of synonymous codon usage bias with G + C content [de- spondence analysis [5–7] and principal component analysis fined as (G + C)/(A+T+G+C)],andhavethereforeyielded [8], have been widely used to construct measures account- a nonlinear U-shaped relationship (a gene with a very low or ing for the largest fractions of the total variation in synony- very high G + C content has a high degree of synonymous 2 EURASIP Journal on Bioinformatics and Systems Biology codon usage bias) [9–11]; it is thus difficult to quantify the The degree of bias in synonymous codon usage of the nonlinear relationship. ith amino acid (Hi) was quantified with a measure of un- To overcome the first of these problems, we use the certainty (entropy) in Shannon’s information theory [16]: “weighted sum of relative entropy” (Ew)asameasureofsyn- onymous codon usage bias [12]. This measure takes into ki =− account all three aspects of amino acid usage enumerated Hi Rij log2 Rij,(2) above, and indeed is little affected by amino acid usage bi- j=1 ases. To overcome the second problem, we compare the de- Hi can take values from 0 (maximum bias where only one gree of synonymous codon usage bias (Ew) with the degree of G+C content bias (entropy) instead of simply the G+C con- codon is used and all other synonyms are not present) to a =− = tent; this step can provide a linear relationship. The strength maximum value Hi max ki((1/ki)log2(1/ki)) log2 ki (no of the linear relationship can be easily quantified by using a bias where alternative synonymous codons is used with equal = correlation coefficient. frequency; that is, for every j, Rij 1/ki). The approach of quantifying the strength of the corre- The relative entropy of the ith amino acid (Ei)isdefined lation of G + C composition with synonymous codon usage as the ratio of the observed entropy to the maximum possible bias by using the entropy and correlation coefficient is ap- in the amino acid: plied to bacterial species for which whole genome sequences = Hi = Hi are available. Ei ,(3) Hi max log2 ki = 2. MATERIALS AND METHODS Ei ranges from 0 (maximum bias when Hi 0) to 1 (no bias = when Hi log2 ki). 2.1. Software To obtain an estimate of the overall bias in synonymous codon usage of a gene, we combined estimates of the bias All analyses were conducted by using G-language genome from different amino acids, as follows. First, to take account analysis environment software [13], available at http://www of the difference in the degree of codon degeneracy (ki)be- .g-language.org. Graphs such as the histogram and scatter tween different amino acids, we used the relative entropy (Ei) plot were generated in the R statistical computing environ- instead of the entropy (Hi) as an estimate of the bias of each ment [14], available at http://www.r-project.org. amino acid. Second, to take account of the difference in rel- ative frequency between different amino acids in the protein, we calculated the sum of the relative entropy of each amino 2.2. Sequences acid weighted by its relative frequency in the protein. The We tested data from 371 bacterial genomes (see Additional measure of synonymous codon usage bias, designated as the Table 1 for a comprehensive list (available online at http:// “weighted sum of relative entropy” (Ew)[12], is given by www2.bioinfo.ttck.keio.ac.jp/genome/haruo/BSB ST1.pdf)). s Complete genomes in GenBank format [15]weredown- Ew = wiEi,(4) loaded from the NCBI repository site (ftp://ftp.ncbi.nih.gov/ i=1 genomes/Bacteria). Protein coding sequences containing letters other than A, C, G, or T and those containing amino where s is the number of different amino acid species in the acids with residues less than their degree of codon degener- protein and wi is the relative frequency of the ith amino acid acy were discarded. From each coding sequence, start and in the protein as a weighting factor. Ew ranges from 0 (maxi- stop codons were excluded. mum bias) to 1 (no bias).

2.3. Analyses 2.3.2. Measure of the degree of G + C composition bias

2.3.1. Measure of the degree of synonymous The entropy was calculated to quantify the degree of bias in codon usage bias G + C composition at the first, second, and third codon po- sitions of a gene (HGC1, HGC2,andHGC3,resp.), Therelativefrequencyofthejth synonymous codon for the Hp =−p log p − (1 − p)log (1 − p), (5) ith amino acid (Rij) is defined as the ratio of the number of 2 2 occurrences of a codon to the sum of all synonymous codons: where p is the G+C content (defined as (G+C)/(A+T+G+C)) at the first, second, or third codon positions in the nucleotide nij Rij =  ,(1)sequence (GC1, GC2, or GC3). ki j=1 nij The entropy (H) for G + C composition (and for usage of two-fold degenerate codons; coding for asparagine, aspar- where nij is the number of occurrences of the jth codon for tic acid, cysteine, glutamic acid, glutamine, histidine, lysine, the ith amino acid, and ki is the degree of codon degeneracy phenylalanine, or tyrosine) with values p and 1 − p is plotted for the ith amino acid. in Figure 1 as a function of p. Haruo Suzuki et al. 3

1 much better correlated with HGC3 (Figure 2(c)) than with HGC1 (Figure 2(a)), or HGC2 (Figure 2(b)), indicating that GC3 contributed more to synonymous codon usage bias than 0.8 GC1 and GC2. In S. degradans, the value of Ew was not cor- related with HGC1 (Figure 2(d)), HGC2 (Figure 2(e)), or HGC3 (Figure 2(f)), indicating that neither GC1, nor GC2 nor GC3 0.6 contributed to synonymous codon usage bias. To compare the contributions of GC1, GC2, and GC3 to

(bits) synonymous codon usage bias, we produced pairwise scatter H 0.4 plots of the r values of HGC1, HGC2,andHGC3 with Ew for 371 genomes (Figure 3). In the scatter plot of the r values of HGC3 (y-axis) plot- 0.2 ted against those of HGC1 (x-axis) (Figure 3(a)), 362 points (97.6% of the total) are on the upper left of the line y = x, indicating that GC3 contributed more to synonymous codon usage bias than did GC1 in most of the genomes analyzed. 0 0.2 0.4 0.6 0.8 1 In the scatter plot of the r values of HGC3 (y-axis) plot- p ted against those of HGC2 (x-axis) (Figure 3(b)), 367 points (98.9% of the total) are on the upper left of the line y = x, Figure 1: Entropy (H) of G + C composition and usage of two fold indicating that GC3 contributed more to synonymous codon degenerate codons with values p and 1 − p. usage bias than did GC2 in most genomes analyzed. In the scatter plot of the r values of HGC1 (y-axis) plotted against those of HGC2 (x-axis) (Figure 3(c)), the scatter plot ff 2.3.3. Estimation of the correlation of G + C displays a di use distribution of points: 186 points (50.1% = composition with synonymous codon of the total) are on the upper left of the line y x, in- usage bias dicating that the relative contributions of GC1 and GC2 to synonymous codon usage bias varied widely from genome to Spearman’s rank correlation coefficient (r) was calculated genome. to quantify the strength of the correlation between G + C We constructed histograms showing the distribution of composition bias (HGC1, HGC2,andHGC3) and synonymous r values of HGC1, HGC2,andHGC3 with Ew for 371 bacte- codon usage bias (Ew), rial genomes (Figure 4). The r values of HGC1 (Figure 4(a))     and HGC2 (Figure 4(b)) were distributed evenly between pos- m = x − x y − y itive and negative values, whereas those of H (Figure 4(c)) =  g 1 g g GC3 r       , were distributed towards positive values. The ranges [min- m − 2 m − 2 g=1 xg x g=1 yg y imum, maximum] of the r values of H , H ,and (6) GC1 GC2 m m H were [−0.51, 0.46], [−0.28, 0.39], and [−0.07, 0.95], 1  1  GC3 x = x , y = y , respectively. The r values of HGC1 (Figure 4(a))andHGC2 m g m g g=1 g=1 (Figure 4(b)) exhibited a monomodal distribution, whereas those of HGC3 (Figure 4(c)) exhibited a multimodal distribu- where xg is the rank of the x-axis value (HGC1, HGC2,orHGC3) tion. for the gth gene, yg is the rank of the y-axis value (Ew)for the gth gene, and m is the number of genes in the genome. The r value can vary from −1 (perfect negative correlation) 3.2. Correlation of r value with genomic features through 0 (no correlation) to +1 (perfect positive correla- tion). To investigate whether the correlation of GC3 with synony- mous codon usage bias (the r value of HGC3 versus Ew)was 3. RESULTS related to species characteristics, we compared the r values with genomic features such as genomic G + C content and 3.1. Correlation of G + C composition with tRNA gene copy number. Among the 371 genomes analyzed synonymous codon usage bias (r value) here, genomic G + C content ranged from 23% to 73% and tRNA gene copy number varied from 28 to 145. We investigated the correlation between the degree of G + C We constructed scatter plots of the r values of HGC3 with composition bias (HGC1, HGC2,andHGC3)andthatofsyn- Ew plotted against genomic G + C content and tRNA gene onymous codon usage bias (Ew) within each genome. copy number for 371 genomes (Figure 5). The relationship Figure 2 shows scatter plots of Ew plotted against HGC1, between the r value of HGC3 and the tRNA gene copy number HGC2,andHGC3 with Geobacter metallireducens GS-15 genes was unclear (Figure 5(b)). In contrast, the r values of HGC3 and with Saccharophagus degradans 2–40 genes as examples tended to be high in G + C-poor or G + C-rich genomes, re- and the Spearman’s rank correlation coefficient (r) calculated vealing a nonlinear relationship between the r value of HGC3 from each plot. In G. metallireducens, the value of Ew was andgenomicG+Ccontent(Figure 5(a)). The highest r value 4 EURASIP Journal on Bioinformatics and Systems Biology

0.9 0.9

0.8 0.8

0.7 0.7 w w E E 0.6 0.6

0.5 0.5

0.4 0.4

0.6 0.7 0.8 0.9 1 0.85 0.9 0.95 1

HGC1, r = 0.25 HGC2, r =−0.01 (a) (b)

0.9 0.9 0.8

0.7 0.8 w w E E 0.6 0.7 0.5

0.4 0.6

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.88 0.92 0.96 1

HGC3, r = 0.95 HGC1, r = 0.06 (c) (d)

0.9 0.9

0.8 0.8 w w E E

0.7 0.7

0.6 0.6

0.86 0.9 0.94 0.98 0.85 0.9 0.95 1

HGC2, r =−0.08 HGC3, r =−0.07 (e) (f)

Figure 2: Scatter plots of Ew plotted against (a) HGC1,(b)HGC2, and (C) HGC3 for Geobacter metallireducens GS-15 genes and against (d) HGC1,(e)HGC2, and (f) HGC3 for Saccharophagus degradans 2–40 genes. The extent of the correlation between HGC1, HGC2,andHGC3 and Ew is represented by Spearman’s rank correlation coefficient (r).

of HGC3 (0.95) was found in G. metallireducens,withage- deviation of the r values of HGC3 for G + C-poor bacteria nomic G+C content of 60% (Figure 2(c)). The lowest r value (with genomic G + C contents less than 40%) were 0.58 and of HGC3 (−0.07) was found in S. degradans, with a genomic 0.12, respectively. The corresponding values for G + C-rich G + C content of 46% (Figure 2(f)). The mean and standard bacteria (with genomic G + C contents greater than 60%) Haruo Suzuki et al. 5

80 1

60

0.5 40

20 GC3 Number of genomes

H 0

of 0 r −1 −0.5 0 0.5 1 −0.5 r of HGC1 (a) − 1 80 −1 −0.5 0 0.5 0 60 r of HGC1 (a) 40

20

1 Number of genomes 0

−1 −0.5 0 0.5 1 0.5 r of HGC2 (b) GC3

H 0 80 of r 60 −0.5 40

−1 20 Number of genomes −1 −0.5 0 0.5 1 0 r of H GC2 −1 −0.5 0 0.5 1

(b) r of HGC3 (c) 1 Figure 4: Histograms of the distribution of r values of (a) HGC1,(b) HGC2, and (c) HGC3 with Ew for 371 bacterial genomes. 0.5

GC1 were 0.86 and 0.04. Thus, the r values of HGC3 for G + C- H 0

of poor bacteria tended to be lower than those for G + C-rich r bacteria. −0.5 4. DISCUSSION

−1 Other investigators have reported that G + C composition is −1 −0.5 0 0.5 1 correlated with synonymous codon usage bias in many or- r of HGC2 ganisms. However, no quantitative attempt has been made ff (c) to compare the extent of this correlation among di erent genomes. Here, we quantified the strength of the correlation Figure 3: Pairwise scatter plots of the r values of H , H and of G + C composition bias (HGC1, HGC2,andHGC3)withsyn- GC1 GC2 ffi HGC3 with Ew for 371 bacterial genomes. Comparison of the corre- onymous codon usage bias (Ew) by using a correlation coe - lation with Ew of (a) HGC3 and HGC1,(b)HGC3 and HGC2, and (c) cient (r). This approach allowed us to quantitatively compare ff HGC1 and HGC2. the strength of this correlation among di erent genomes. 6 EURASIP Journal on Bioinformatics and Systems Biology

0.8 0.8

0.6 0.6 GC3 GC3 H H of of 0.4 0.4 r r

0.2 0.2

0 0

30 40 50 60 70 40 60 80 100 120 140 Genomic G + C content (%) tRNA gene number

(a) (b)

Figure 5: Scatter plots of the r values of HGC3 with Ew plotted against (a) genomic G+C content and (b) tRNA gene number for 371 bacterial genomes.

In a previous analysis of the relationships between G + C tional biases have a major influence on the correlation of composition and synonymous codon usage bias, Wan et al. GC3 with synonymous codon usage bias, other evolutionary [9] stated that “GC3 was the most important factor in codon factors may play a part. For example, horizontal gene trans- bias among GC, GC1, GC2, and GC3.” This is quantitatively fer among bacteria with different genomic G + C content supported by the pairwise comparison of the r values of can contribute to intragenomic variation in G + C content HGC1, HGC2,andHGC3 (Figure 3). However, the statement by [19, 20]. Wan et al. that “GC3 is the key factor driving synonymous Second, the spirochaete Borrelia burgdorferi exhibits a codon usage and that this mechanism is independent of strong base usage skew between leading and lagging strands species” differs from our conclusion that the strength of the of replication (generally inferred as reflecting strand-specific correlation of GC3 with synonymous codon usage bias (the mutational bias): genes on the leading strand tend to pref- r value of HGC3) varies widely among species (Figure 4(c)). erentially use G- or T-ending codons [21]. The r values of This discordance appears to have arisen because Wan et al. HGC3 for genes on the leading and lagging strands are similar combined the genes from different genomes into a single (0.65 and 0.63, resp.). This suggests that strand bias has little dataset for their analysis. This analysis of combined data influence on the correlation of GC3 with synonymous codon from different genomes masks the presence of genomes in usagebiasinB. burgdorferi. which the correlation of GC3 with synonymous codon usage Third, in bacteria with more tRNA genes, synonymous bias is negligible (such as that of S. degradans; Figure 2(f)); codon usage could be subject to stronger translational selec- the results are thus inconsistent with those of the more de- tion [22]. Figure 5(b) shows that tRNA gene copy number tailed analyses obtained here for individual genomes. was not correlated with the r value of HGC3. This suggests Three factors, G+C composition, replication strand bias, that translational selection has little influence on the corre- and translational selection, are well documented to shape lation of GC3 with synonymous codon usage bias. Sharp et synonymous codon usage bias [1]. al. [22] showed that the S value as a measure of translation- First, in bacteria with extreme genomic G + C composi- ally selected codon usage bias is highly correlated with tRNA tions (either G + C–rich or A + T–rich), synonymous codon gene copy number but is not correlated with genomic G + C usage could be dominated by strong mutational bias (toward content. Thus, the r value of HGC3 can be used as a measure G+CorA+T)[17, 18]. The data in Figure 5(a) indicate complementary to the S value. that, although genomic G + C content was nonlinearly corre- The most accepted hypothesis for the unequal usage of lated with the r value of HGC3, there are some exceptions; for synonymous codons in bacterial genomes is that the unequal example, Nanoarchaeum equitans Kin4-M and Mycoplasma usage is the result of a very complex balance among different genitalium G37 had identical genomic G + C contents of evolutionary forces (mutation and selection) [23]. The com- 32% but very different r values of HGC3 (0.34 and 0.87, resp.), bined use of the r value and other methods (e.g., the S value) and Thermococcus kodakarensis KOD1 had a genomic G + C will improve our understanding of the relative contributions content of around 50% but a high r value of HGC3 (0.86). of different evolutionary forces to synonymous codon usage The existence of the outliers suggests that, although muta- bias. Haruo Suzuki et al. 7

ABBREVIATIONS [10] F. Wright, “The ‘effective number of codons’ used in a gene,” Gene, vol. 87, no. 1, pp. 23–29, 1990. A: Adenine [11] B. Zeeberg, “Shannon information theoretic computation of T: Thymine synonymous codon usage biases in coding regions of human G: Guanine and mouse genomes,” Genome Research, vol. 12, no. 6, pp. C: Cytosine 944–955, 2002. GC1: G + C content at the first codon position [12] H. Suzuki, R. Saito, and M. Tomita, “The ‘weighted sum of GC2: G + C content at the second codon position relative entropy’: a new index for synonymous codon usage GC3: G + C content at the third codon position bias,” Gene, vol. 335, no. 1-2, pp. 19–23, 2004. HGC1:EntropyofGC1 [13]K.Arakawa,K.Mori,K.Ikeda,T.Matsuzaki,Y.Kobayashi, HGC2:EntropyofGC2 and M. Tomita, “G-language genome analysis environment: HGC3:EntropyofGC3 a workbench for nucleotide sequence data mining,” Bioinfor- Ew: Weighted sum of relative entropy matics, vol. 19, no. 2, pp. 305–306, 2003. r: Spearman’s rank correlation coefficient [14] R Development Core Team, R: a language and environment for statistical computing, R Foundation for Statistical Computing, ACKNOWLEDGMENTS Vienna, Austria, 2006. [15] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and The authors thank Dr Kazuharu Arakawa (Institute for Ad- D. L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 35, sup- vanced Biosciences, Keio University) for his technical advice plement 1, pp. D21–D25, 2007. on the G-language genome analysis environment, and Ku- [16] C. E. Shannon, “A mathematical theory of communication,” nihiro Baba (Faculty of Policy Management, Keio Univer- Bell System Technical Journal, vol. 27, pp. 379–423, 1948. sity) for his technical advice on the R statistical comput- [17] A. Muto and S. Osawa, “The guanine and cytosine content ing environment. This work was supported by the Ministry of genomic DNA and bacterial evolution,” Proceedings of the of Education, Culture, Sports, Science, and Technology of National Academy of Sciences of the United States of America, Japan Grant-in-Aid for the 21st Century Centre of Excellence vol. 84, no. 1, pp. 166–169, 1987. (COE) Program entitled “Understanding and Control of Life [18] N. Sueoka, “On the genetic basis of variation and heterogene- via Systems Biology” (Keio University). ity of DNA base composition,” Proceedings of the National Academy of Sciences of the United States of America, vol. 48, no. 4, pp. 582–592, 1962. REFERENCES [19] S. Garcia-Vallve, A. Romeu, and J. Palau, “Horizontal gene [1] M. D. Ermolaeva, “Synonymous codon usage in bacteria,” transfer in bacterial and archaeal complete genomes,” Genome Current Issues in Molecular Biology, vol. 3, no. 4, pp. 91–97, Research, vol. 10, no. 11, pp. 1719–1725, 2000. 2001. [20] R. J. Grocock and P. M. Sharp, “Synonymous codon usage in [2] A. Carbone, F. Kepes, and A. Zinovyev, “Codon bias signa- Pseudomonas aeruginosa PA01,” Gene, vol. 289, no. 1-2, pp. tures, organization of microorganisms in codon space, and 131–139, 2002. lifestyle,” Molecular Biology and Evolution,vol.22,no.3,pp. [21] J. O. McInerney, “Replicational and transcriptional selection 547–561, 2005. on codon usage in Borrelia burgdorferi,” Proceedings of the [3] A. Carbone, A. Zinovyev, and F. Kep´ es,` “Codon adaptation in- National Academy of Sciences of the United States of America, dex as a measure of dominating codon bias,” Bioinformatics, vol. 95, no. 18, pp. 10698–10703, 1998. vol. 19, no. 16, pp. 2005–2015, 2003. [22]P.M.Sharp,E.Bailes,R.J.Grocock,J.F.Peden,andR.E. [4] R. D. Knight, S. J. Freeland, and L. F. Landweber, “A sim- Sockett, “Variation in the strength of selected codon usage ple model based on mutation and selection explains trends bias among bacteria,” Nucleic Acids Research,vol.33,no.4,pp. in codon and amino-acid usage and GC composition within 1141–1153, 2005. and across genomes,” Genome Biology,vol.2,no.4,pp. research0010.1–research0010.13, 2001. [23] P. M. Sharp, M. Stenico, J. F. Peden, and A. T. Lloyd, “Codon [5]J.R.LobryandA.Necs¸ulea, “Synonymous codon usage and usage: mutational bias, translational selection, or both?” Bio- its potential link with optimal growth temperature in prokary- chemical Society Transactions, vol. 21, no. 4, pp. 835–841, 1993. otes,” Gene, vol. 385, pp. 128–136, 2006. [6] D. J. Lynn, G. A. C. Singer, and D. A. Hickey, “Synonymous codon usage is subject to selection in thermophilic bacteria,” Nucleic Acids Research, vol. 30, no. 19, pp. 4272–4277, 2002. [7] G. A. C. Singer and D. A. Hickey, “Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid com- position and nucleotide content,” Gene, vol. 317, no. 1-2, pp. 39–47, 2003. [8] H. Suzuki, R. Saito, and M. Tomita, “A problem in multivariate analysis of codon usage data and a possible solution,” FEBS Letters, vol. 579, no. 28, pp. 6499–6504, 2005. [9] X.-F. Wan, D. Xu, A. Kleinhofs, and J. Zhou, “Quantitative relationship between synonymous codon usage bias and GC composition across unicellular genomes,” BMC Evolutionary Biology, vol. 4, p. 19, 2004. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 79879, 9 pages doi:10.1155/2007/79879

Research Article Information-Theoretic Inference of Large Transcriptional Regulatory Networks

Patrick E. Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi

ULB Machine Learning Group, Computer Science Department, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium

Received 26 January 2007; Accepted 12 May 2007

Recommended by Juho Rousu

The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on maximum relevance/minimum redundancy (MRMR), an effective information-theoretic technique for feature selection in su- pervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental re- sults on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.

Copyright © 2007 Patrick E. Meyer et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION formatics community for the inference of very large networks [4–6]. Two important issues in computational biology are the ex- The adoption of mutual information in probabilistic tent to which it is possible to model transcriptional interac- model design can be traced back to Chow-Liu tree algo- tions by large networks of interacting elements and how these rithm [3] and its extensions proposed by [7, 8]. Later [9, 10] interactions can be effectively learned from measured expres- suggested to improve network inference by using another sion data [1]. The reverse engineering of transcriptional reg- information-theoretic quantity, namely multi-information. ulatory networks (TRNs) from expression data alone is far This paper introduces an original information-theoretic from trivial because of the combinatorial nature of the prob- method, called MRNET, inspired by a recently proposed fea- lem and the poor information content of the data [1]. An ad- ture selection technique, the maximum relevance/minimum ditional problem is that by focusing only on transcript data, redundancy (MRMR) algorithm [11, 12]. This algorithm has the inferred network should not be considered as a biochemi- been used with success in supervised classification problems cal regulatory network but as a gene-to-gene network, where to select a set of nonredundant genes which are explicative of many physical connections between macromolecules might the targeted phenotype [12, 13]. The MRMR selection strat- be hidden by shortcuts. egy consists in selecting a set of variables that has a high In spite of these evident limitations, the bioinformatics mutual information with the target variable (maximum rel- community made important advances in this domain over evance) and at the same time are mutually maximally inde- the last few years. Examples are methods like Boolean net- pendent (minimum redundancy between relevant variables). works, Bayesian networks, and Association networks [2]. The advantage of this approach is that redundancy among This paper will focus on information-theoretic ap- selected variables is avoided and that the trade-off between proaches [3–6] which typically rely on the estimation of mu- relevance and redundancy is properly taken into account. tual information from expression data in order to measure Our proposed MRNET strategy, preliminarily sketched the statistical dependence between variables (the terms “vari- in [14], consists of (i) formulating the network inference able” and “feature” are used interchangeably in this paper). problem as a series of input/output supervised gene selec- Such methods have recently held the attention of the bioin- tion procedures, where one gene at the time plays the role of 2 EURASIP Journal on Bioinformatics and Systems Biology

2 the target output, and (ii) adopting the MRMR principle to threshold I0. The complexity of the method is O(n ) since all perform the gene selection for each supervised gene selection pairwise interactions are considered. procedure. Note that this method is prone to infer false positives in The paper benchmarks MRNET against three state-of- the case of indirect interactions between genes. For example, the-art information-theoretic network inference methods, if gene X1 regulates both gene X2 and gene X3,ahighmu- namely relevance networks (RELNET), CLR, and ARACNE. tual information between the pairs {X1, X2}, {X1, X3},and The comparison relies on thirty artificial microarray datasets {X2, X3} would be present. As a consequence, the algorithm synthesized by two public-domain generators. The extensive would infer an edge between X2 and X3 although these two simulation setting allows us to study the effect of the number genes interact only through gene X1. of samples, the number of genes, and the noise intensity on the inferred network accuracy. Also, the sensitivity of the per- 2.3. CLR algorithm formance to two alternative entropy estimators is assessed. The outline of the paper is as follows. Section 2 reviews the state-of-the-art network inference techniques based on The CLR algorithm [6] is an extension of RELNET. This algo- information theory. Section 3 introduces our original ap- rithm computes the mutual information (MI) for each pair proach based on MRMR. The experimental framework and of genes and derives a score related to the empirical distribu- tion of these MI values. In particular, instead of considering the results obtained on artificially generated datasets are pre- I X X X X sented in Sections 4 and 5,respectively.Section 6 concludes the information ( i; j )betweengenes i and j ,ittakes 2 2 the paper. into account the score zij = zi + zj ,where

   2. INFORMATION-THEORETIC NETWORK INFERENCE: I Xi; Xj − μi STATE OF THE ART zi = max 0, (2) σi This section reviews some state-of-the-art methods for net- work inference which are based on information-theoretic and μi and σi are, respectively, the mean and the standard notions. deviation of the empirical distribution of the mutual infor- These methods require at first the computation of the mation values I(Xi, Xk), k = 1, ..., n. The CLR algorithm mutual information matrix (MIM), a square matrix whose was successfully applied to decipher the E. coli TRN [6]. Note i, j element that, like RELNET, CLR demands an O(n2) cost to infer the     networkfromagivenMIM.       p xi, xj MIMij = I Xi; Xj = p xi, xj log     p xi p xj xi∈X xj ∈X 2.4. ARACNE (1) The algorithm for the reconstruction of accurate cellular net- is the mutual information between Xi and Xj ,whereXi ∈ works (ARACNE) [5] is based on the data processing in- X i = ... n , 1, , , is a discrete random variable denoting the equality [16]. This inequality states that if gene X1 interacts i expression level of the th gene. with gene X3 through gene X2, then

       2.1. Chow-Liu tree I X1; X3 ≤ min I X1; X2 , I X2; X3 . (3) The Chow and Liu approach consists in finding the maxi- mum spanning tree (MST) of a complete graph, where the The ARACNE procedure starts by assigning to each pair of weights of the edges are the mutual information quantities nodes a weight equal to their mutual information. Then, as between the connected nodes [3]. The construction of the in RELNET, all edges for which I(Xi; Xj )

Network and data generator Entropy estimator Inference method

Original Mutual Artificial information Inferred network dataset matrix network

Validation procedure Precision-recall curves and F-scores

Figure 1: An artificial microarray dataset is generated from an original network. The inferred network can then be compared to this true network.

3. OUR PROPOSAL: MINIMUM REDUNDANCY measures the average redundancy of Xj to each already se- NETWORKS (MRNET) lected variable Xk ∈ S. At each step of the algorithm, the selected variable is expected to allow an efficient trade-off We propose to infer a network using the maximum rel- between relevance and redundancy. It has been shown in evance/minimum redundancy (MRMR) feature selection [12] that the MRMR criterion is an optimal “pairwise” ap- method. The idea consists in performing a series of super- proximation of the conditional mutual information between vised MRMR gene selection procedures, where each gene in any two genes Xj and Y given the set S of selected variables turn plays the role of the target output. I(Xj ; Y | S). TheMRMRmethodhasbeenintroducedin[11, 12]to- The MRNET approach consists in repeating this selec- gether with a best-first search strategy for performing filter tion procedure for each target gene by putting Y = Xi and selection in supervised learning problems. Consider a super- V = X \{Xi}, i = 1, ..., n,whereX is the set of the expres- Y V vised learning task, where the output is denoted by and sion levels of all genes. For each pair {Xi, Xj }, MRMR returns V is the set of input variables. The method ranks the set of two (not necessarily equal) scores si and sj according to (5). ff inputs according to a score that is the di erence between the The score of the pair {Xi, Xj } is then computed by taking the Y mutual information with the output variable (maximum maximum of si and sj . A specific network can then be in- relevance) and the average mutual information with the pre- ferred by deleting all the edges whose score lies below a given viously ranked variables (minimum redundancy). The ra- threshold I0 (as in RELNET, CLR, and ARACNE). Thus, the tionale is that direct interactions (i.e., the most informative algorithm infers an edge between Xi and Xj either when Xi is variables to the target Y) should be well ranked, whereas in- a well-ranked predictor of Xj (si >I0) or when Xj is a well- direct interactions (i.e., the ones with redundant information ranked predictor of Xi (sj >I0). with the direct ones) should be badly ranked by the method. An effective implementation of the MRMR best-first X The greedy search starts by selecting the variable i having search is available in [17]. This implementation demands an Y the highest mutual information to the target . The second O( f ×n) complexity for selecting f features using a best-first X selected variable j will be the one with a high information search strategy. It follows that MRNET has an O( f ×n2)com- I X Y ( j ; ) to the target and at the same time a low information plexity since the feature selection step is repeated for each of I X X ( j ; i) to the previously selected variable. In the following the n genes. In other terms, the complexity ranges between S steps, given a set of selected variables, the criterion updates O(n2)andO(n3) according to the value of f . Note that the S by choosing the variable lower the f value, the lower the number of incoming edges   MRMR per node to infer and consequently the lower the resulting Xj = arg max uj − rj (4) Xj ∈V\S complexity. Note that since mutual information is a symmetric mea- that maximizes the score sure, it is not possible to derive the direction of the edge from its weight. This limitation is common to all the methods pre- sj = uj − rj ,(5) sented so far. However, this information could be provided where uj is a relevance term and rj is a redundancy term. by edge orientation algorithms (e.g., IC) commonly used in More precisely, Bayesian networks [7].   uj = I Xj ; Y (6) 4. EXPERIMENTS is the mutual information of Xj with the target variable Y, and The experimental framework consists of four steps (see    r = 1 I X X Figure 1): the artificial network and data generation, j |S| j ; k (7) Xk ∈S the computation of the mutual information matrix, the 4 EURASIP Journal on Bioinformatics and Systems Biology inference of the network, and the validation of the results. been proposed in literature [5, 6, 20, 21]. Here, we test This section details each step of the approach. the Miller-Madow entropy estimator [20] and a parametric Gaussian density estimator. Since the Miller-Madow method 4.1. Network and data generation requires quantized values, we pretreated the data with√ the equal-sized intervals algorithm [22], where the size l = N. In order to assess the results returned by our algorithm and The parametric Gaussian estimator is directly computed by I X X = / σ σ /|C| |C| compare it to other methods, we created a set of benchmarks ( i, j ) (1 2) log( ii jj ), where is the determi- on the basis of artificially generated microarray datasets. In nant of the covariance matrix. Note that the complexity of O N N spite of the evident limitations of using synthetic data, this both estimators is ( ), where is the number of sam- O N ×n2 makes possible a quantitative assessment of the accuracy, ples. This means that since the whole MIM cost is ( ), thanks to the availability of the true network underlying the the MIM computation could be the bottleneck of the whole microarray dataset (see Figure 1). network inference procedure for a large number of samples N  n We used two different generators of artificial gene expres- ( ). We deem, however, that at the current state of the sion data: the data generator described in [18](hereafterre- technology, this should not be considered as a major issue ferred to as the sRogers generator) and the SynTReN gener- since the number of samples is typically much smaller than ator [19]. The two generators, whose implementations are the number of measured features. freely available on the World Wide Web, are sketched in the following paragraphs. 4.3. Validation

A network inference problem can be seen as a binary decision sRogers generator problem, where the inference algorithm plays the role of a classifier: for each pair of nodes, the algorithm either adds The sRogers generator produces the topology of the genetic an edge or does not. Each pair of nodes is thus assigned a network according to an approximate power-law distribu- positive label (an edge) or a negative one (no edge). tion on the number of regulatory connections out of each A positive label (an edge) predicted by the algorithm is gene. The normal steady state of the system is evaluated by considered as a true positive (TP) or as a false positive (FP) integrating a system of differential equations. The generator depending on the presence or not of the corresponding edge offers the possibility to obtain 2k different measures (k wild in the underlying true network, respectively. Analogously, a type and k knock out experiments). These measures can be negative label is considered as a true negative (TN) or a false replicated R times, yielding a total of N = 2kR samples. After negative (FN) depending on whether the corresponding edge the optional addition of noise, a dataset containing normal- is present or not in the underlying true network, respectively. ized and scaled microarray measurements is returned. The decision made by the algorithm can be summarized by a confusion matrix (see Table 2). SynTReN generator It is generally recommended [23] to use receiver opera- tor characteristic (ROC) curves when evaluating binary de- The SynTReN generator generates a network topology by se- cision problems in order to avoid effects related to the chosen lecting subnetworks from E. coli and S. cerevisiae source net- threshold. However, ROC curves can present an overly opti- works. Then, transition functions and their parameters are mistic view of algorithm’s performance if there is a large skew assigned to the edges in the network. Eventually, mRNA ex- in the class distribution, as typically encountered in TRN in- pression levels for the genes in the network are obtained by ference because of sparseness. simulating equations based on Michaelis-Menten and Hill To tackle this problem, precision-recall (PR) curves have kinetics under different conditions. As for the previous gen- been cited as an alternative to ROC curves [24]. Let the pre- erator, after the optional addition of noise, a dataset contain- cision quantity ing normalized and scaled microarray measurements is re- turned. TP p = ,(8) TP + FP Generation measure the fraction of real edges among the ones classified The two generators were used to synthesize thirty datasets. as positive and the recall quantity Table 1 reports for each dataset the number n of genes, the TP number N of samples, and the Gaussian noise intensity (ex- r = ,(9) pressed as a percentage of the signal variance). TP + FN also know as true positive rate, denote the fraction of real 4.2. Mutual information matrix estimation edges that are correctly inferred. These quantities depend on the threshold chosen to return a binary decision. The PR In order to benchmark MRNET versus RELNET, CLR, and curve is a diagram which plots the precision (p)versusrecall ARACNE, the same MIM is used for the four inference (r)fordifferent values of the threshold on a two-dimensional approaches. Several estimators of mutual information have coordinate system. Patrick E. Meyer et al. 5

Table 1: Datasets with n the number of genes and N the number of samples.

Dataset Generator Topology nNNoise RN1 sRogers Power-law tail 700 700 0% RN2 sRogers Power-law tail 700 700 5% RN3 sRogers Power-law tail 700 700 10% RN4 sRogers Power-law tail 700 700 20% RN5 sRogers Power-law tail 700 700 30% RS1 sRogers Power-law tail 700 100 0% RS2 sRogers Power-law tail 700 300 0% RS3 sRogers Power-law tail 700 500 0% RS4 sRogers Power-law tail 700 800 0% RS5 sRogers Power-law tail 700 1000 0% RV1 sRogers Power-law tail 100 700 0% RV2 sRogers Power-law tail 300 700 0% RV3 sRogers Power-law tail 500 700 0% RV4 sRogers Power-law tail 700 700 0% RV5 sRogers Power-law tail 1000 700 0% SN1 SynTReN S. Cerevisae 400 400 0% SN2 SynTReN S. Cerevisae 400 400 5% SN3 SynTReN S. Cerevisae 400 400 10% SN4 SynTReN S. Cerevisae 400 400 20% SN5 SynTReN S. Cerevisae 400 400 30% SS1 SynTReN S. Cerevisae 400 100 0% SS2 SynTReN S. Cerevisae 400 200 0% SS3 SynTReN S. Cerevisae 400 300 0% SS4 SynTReN S. Cerevisae 400 400 0% SS5 SynTReN S. Cerevisae 400 500 0% SV1 SynTReN S. Cerevisae 100 400 0% SV2 SynTReN S. Cerevisae 200 400 0% SV3 SynTReN S. Cerevisae 300 400 0% SV4 SynTReN S. Cerevisae 400 400 0% SV5 SynTReN S. Cerevisae 500 400 0%

Table 2: Confusion matrix. that if two algorithms A and B have the same error rate, then   Edge Actual positive Actual negative  2 NAB − NBA − 1 Inferred positive TP FP P > 3.841459 < 0.05, (11) NAB + NBA Inferred negative FN TN where NAB is the number of incorrect edges of the network inferred from algorithm A that are correct in the network inferred from algorithm B,andNBA is the counterpart. Note that a compact representation of the PR diagram is returned by the maximum of the F-score quantity 5. RESULTS AND DISCUSSION 2pr F = , (10) A thorough comparison would require the display of the PR- r + p curves (Figure 2) for each dataset. For reason of space, we decided to summarize the PR-curve information by the max- which is a weighted harmonic average of precision and recall. imum F-score in Table 3. Note that for each dataset, the ac- The following section will present the results by means of PR curacy of the best methods (i.e., those whose score is not sig- curves and F-scores. nificantly lower than the highest one according to McNemar Also in order to asses the significance of the results, a Mc- test) is typed in boldface. Nemar test can be performed. The McNemar test [25] states We may summarize the results as follows. 6 EURASIP Journal on Bioinformatics and Systems Biology

1 700 genes, Gaussian estimation on sRogers datasets

0.8 0.8 0.6

. . Precision 0 4 0 6 -score 0.2 F 0.4 0 0 0.2 0.4 0.6 0.8 1 . Recall 0 2 200 400 600 800 1000 MRNET CLR Samples ARACNE CLR RELNET ARACNE MRNET Figure 2: PR-curves for the RS3 dataset using Miller-Madow esti- Figure 4: Influence of number of samples on accuracy (sRogers RS mator. The curves are obtained by varying the rejection/acceptation datasets, Gaussian estimator). threshold.

400 samples, Miller-Madow estimation on SynTReN datasets how the accuracy is strongly and positively correlated to the 0.5 number of samples.

0.4 Accuracy sensitivity to the noise intensity. The intensity of noise ranges from 0% to 30% for the datasets 0.3 RN1, RN2, RN3, RN4, and RN5, and for the datasets SN1, -score

F SN2, SN3, SN4, and SN5. The performance of the methods using the Miller-Madow entropy estimator decreases signif- 0.2 icantly with the increasing noise, whereas the Gaussian esti- mator appears to be more robust (see Figure 5). 0.1 Accuracy sensitivity to the MI estimator. 100 200 300 400 500 Genes We can observe in Figure 6 that the Gaussian parametric es- CLR RELNET timator gives better results than the Miller-Madow estimator. ARACNE MRNET This is particularly evident with the sRogers datasets.

Figure 3: Influence of the number of variables on accuracy (Syn- Accuracy sensitivity to the data generator. TReN SV datasets, Miller-Madow estimator). The SynTReN generator produces datasets for which the in- ference task appears to be harder, as shown in Table 3. Accuracy sensitivity to the number of variables. Accuracy of the inference methods. The number of variables ranges from 100 to 1000 for the datasets RV1, RV2, RV3, RV4, and RV5, and from 100 to Table 3 supports the following three considerations: (i) MR- 500 for the datasets SV1, SV2, SV3, SV4, and SV5. Figure 3 NET is competitive with the other approaches, (ii) ARACNE shows that the accuracy and the number of variables of the outperforms the other approaches when the Gaussian esti- network are weakly negatively correlated. This appears to be mator is used, and (iii) MRNET and CLR are the two best true independently of the inference method and of the MI techniques when the nonparametric Miller-Madow estima- estimator. tor is used.

Accuracy sensitivity to the number of samples. 5.1. Feature selection techniques in network inference

The number of samples ranges from 100 to 1000 for the As shown experimentally in the previous section, MRNET datasets RS1, RV2, RS3, RS4, and RS5, and from 100 to 500 is competitive with the state-of-the-art techniques. Further- for the datasets SS1, SS2, SS3, SS4, and SS5. Figure 4 shows more, MRNET benefits from some additional properties Patrick E. Meyer et al. 7

Table 3: Maximum F-scores for each inference method using two different mutual information estimators. The best methods (those having a score not significantly weaker than the best score, i.e., P-value <.05) are typed in boldface. Average performances on SynTReN and sRogers datasets are reported, respectively, in the S-AVG, R-AVG lines.

Miller-Madow Gaussian RELNET CLR ARACNE MRNET RELNET CLR ARACNE MRNET SN1 0.22 0.24 0.27 0.27 0.21 0.24 0.3 0.26 SN2 0.23 0.26 0.29 0.29 0.21 0.25 0.31 0.25 SN3 0.23 0.25 0.24 0.26 0.21 0.25 0.31 0.26 SN4 0.22 0.24 0.26 0.26 0.21 0.25 0.28 0.26 SN5 0.21 0.23 0.24 0.24 0.2 0.25 0.27 0.24 SS1 0.21 0.22 0.22 0.23 0.19 0.24 0.24 0.23 SS2 0.21 0.24 0.28 0.29 0.2 0.24 0.27 0.25 SS3 0.21 0.24 0.27 0.28 0.2 0.24 0.28 0.25 SS4 0.22 0.24 0.27 0.27 0.21 0.24 0.3 0.26 SS5 0.22 0.24 0.28 0.29 0.21 0.24 0.3 0.26 SV1 0.32 0.36 0.41 0.39 0.3 0.4 0.44 0.38 SV2 0.25 0.28 0.35 0.33 0.25 0.35 0.36 0.32 SV3 0.21 0.24 0.3 0.28 0.21 0.28 0.3 0.27 SV4 0.22 0.24 0.27 0.27 0.21 0.24 0.3 0.26 SV5 0.24 0.23 0.29 0.29 0.22 0.24 0.31 0.26 S-AVG 0.23 0.25 0.28 0.28 0.21 0.26 0.30 0.27 RN1 0.59 0.65 0.6 0.61 0.89 0.87 0.92 0.93 RN2 0.50.57 0.50.49 0.89 0.87 0.92 0.92 RN3 0.5 0.55 0.5 0.52 0.89 0.87 0.92 0.92 RN4 0.46 0.51 0.47 0.47 0.89 0.87 0.92 0.91 RN5 0.42 0.46 0.41 0.4 0.88 0.86 0.91 0.91 RS1 0.1 0.11 0.09 0.1 0.19 0.19 0.19 0.18 RS2 0.35 0.32 0.31 0.31 0.45 0.44 0.47 0.46 RS3 0.38 0.32 0.36 0.38 0.58 0.56 0.60.6 RS4 0.47 0.54 0.47 0.5 0.75 0.75 0.8 0.79 RS5 0.58 0.68 0.6 0.64 0.9 0.86 0.93 0.93 RV1 0.52 0.38 0.46 0.46 0.72 0.75 0.72 0.72 RV2 0.49 0.53 0.49 0.53 0.71 0.71 0.71 0.71 RV3 0.45 0.50.45 0.48 0.69 0.69 0.71 0.71 RV4 0.47 0.51 0.48 0.48 0.69 0.7 0.74 0.72 RV5 0.47 0.52 0.47 0.48 0.7 0.68 0.74 0.73 R-AVG 0.45 0.48 0.44 0.46 0.72 0.71 0.74 0.74 Tot-AVG 0.34 0.36 0.36 0.37 0.47 0.49 0.52 0.51 which are common to all the feature selection strategies for selection step to this small list of genes. The knowledge of network inference [26, 27], as follows. existing edges can also improve the inference. For example, (1) Feature selection algorithms can often deal with thou- in a sequential selection process, as in the forward selection sands of variables in a reasonable amount of time. This used with MRMR, the next variable is selected given the al- makes inference scalable to large networks. ready selected features. As a result, the performance of the se- (2) Feature selection algorithms may be easily made par- lection can be strongly improved by conditioning on known allel, since each of the n selections tasks is independent. relationships. (3) Feature selection algorithms may be made faster by a However, there is a disadvantage in using a feature selec- priori knowledge. For example, knowing the list of regulator tion technique for network inference. The objective of fea- genes of an organism improves the selection speed and the ture selection is selecting, among a set of input variables, the inference quality by limiting the search space of the feature ones that will lead to the best predictive model. It has been 8 EURASIP Journal on Bioinformatics and Systems Biology

700 genes, 700 samples, MRNET on sRogers datasets

1 Xi Y

0.8

0.6 -score

F 0.4 Xj

0.2

X Y 0 Figure 7: Example of indirect relationship between i and . 0 0.05 0.1 0.15 0.2 0.25 0.3 Noise alone. This behavior is colloquially referred to as explaining- Empirical away effect in the Bayesian network literature [7]. Selecting Gaussian variables, like Xi, that take part into indirect interactions re- duce the accuracy of the network inference task. However, Figure 5: Influence of the noise on MRNET accuracy for the two since MRMR relies only on pairwise interactions, it does not MIM estimators (sRogers RN datasets). take into account the gain in information due to condition- ing. In our example, the MRMR algorithm, after having se- X s = I X Y − I X X MRNET 700 genes, sRogers datasets lected j , computes the score i ( i; ) ( i; j ), where I(Xi; Y) = 0andI(Xi; Xj ) > 0. This score is negative and is likely to be badly ranked. As a result, the MRMR feature se- 0.8 lection criterion is less exposed to the inconvenient of most feature selection techniques while sharing their interesting properties. Further experiments will focus on this aspect. 0.6

-score 6. CONCLUSION AND FUTURE WORK F 0.4 A new network inference method, MRNET, has been pro- posed. This method relies on an effective method of information-theoretic feature selection called MRMR. Sim- . 0 2 ilarly to other network inference methods, MRNET relies on 200 400 600 800 1000 pairwise interactions between genes, making possible the in- Samples ference of large networks (up to several thousands of genes). Another advantage of MRNET, which could be exploited Empirical in future work, is its ability to benefit explicitly from a priori Gaussian knowledge. MRNET was compared experimentally to three state- Figure 6: Influence of MI estimator on MRNET accuracy for the of-the-art information-theoretic network inference meth- two MIM estimators (sRogers RS datasets). ods, namely RELNET, CLR, and ARACNE, on thirty infer- ence tasks. The microarray datasets were generated artifi- cially with two different generators in order to effectively proved in [28] that the minimum set that achieves optimal assess their inference power. Also, two different mutual in- classification accuracy under certain general conditions is the formation estimation methods were used. The experimental Markov blanket of a target variable. The Markov blanket of results showed that MRNET is competitive with the bench- a target variable is composed of the variable’s parents, the marked information-theoretic methods. variable’s children, and the variable’s children’s parents [7]. Future work will focus on three main axes: (i) the assess- The latter are indirect relationships. In other words, these ment of additional mutual information estimators, (ii) the variables have a conditional mutual information to the tar- validation of the techniques on the basis of real microarray get variable Y higher than their mutual information. Let us data, (iii) a theoretical analysis of which conditions should consider the following example. Let Y and Xi be indepen- be met for MRNET to reconstruct the true network. dent random variables, and Xj = Xi + Y (see Figure 7). Since the variables are independent, I(Xi; Y) = 0, and the condi- ACKNOWLEDGMENT tional mutual information is higher than the mutual infor- mation, that is, I(Xi; Y | Xj ) > 0. It follows that Xi has some This work was partially supported by the Communaute´ information to Y given Xj but no information to Y taken Franc¸aise de Belgique under ARC Grant no. 04/09-307. Patrick E. Meyer et al. 9

REFERENCES [17] P. Merz and B. Freisleben, “Greedy and local search heuristics for unconstrained binary quadratic programming,” Journal of [1] E.P.vanSomeren,L.F.A.Wessels,E.Backer,andM.J.T.Rein- Heuristics, vol. 8, no. 2, pp. 197–213, 2002. ders, “Genetic network modeling,” Pharmacogenomics, vol. 3, [18] S. Rogers and M. Girolami, “A Bayesian regression approach no. 4, pp. 507–525, 2002. to the inference of regulatory networks from gene expression [2] T. S. Gardner and J. J. Faith, “Reverse-engineering transcrip- data,” Bioinformatics, vol. 21, no. 14, pp. 3131–3137, 2005. tion control networks,” Physics of Life Reviews, vol. 2, no. 1, [19] T. van den Bulcke, K. van Leemput, B. Naudts, et al., “Syn- pp. 65–88, 2005. TReN: a generator of synthetic gene expression data for design [3] C. Chow and C. Liu, “Approximating discrete probability dis- and analysis of structure learning algorithms,” BMC Bioinfor- tributions with dependence trees,” IEEE Transactions on Infor- matics, vol. 7, p. 43, 2006. mation Theory, vol. 14, no. 3, pp. 462–467, 1968. [20] L. Paninski, “Estimation of entropy and mutual information,” [4] A. J. Butte and I. S. Kohane, “Mutual information relevance Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. networks: functional genomic clustering using pairwise en- [21] J. Beirlant, E. J. Dudewica, L. Gyofi, and E. van der Meulen, tropy measurements,” Pacific Symposium on Biocomputing,pp. “Nonparametric entropy estimation: an overview,” Journal of 418–429, 2000. Statistics, vol. 6, no. 1, pp. 17–39, 1997. [5]A.A.Margolin,I.Nemenman,K.Basso,etal.,“ARACNE:an [22] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and un- algorithm for the reconstruction of gene regulatory networks supervised discretization of continuous features,” in Proceed- in a mammalian cellular context,” BMC Bioinformatics, vol. 7, ings of the 12th International Conference on Machine Learning supplement 1, p. S7, 2006. (ML ’95), pp. 194–202, Lake Tahoe, Calif, USA, July 1995. [6] J. J. Faith, B. Hayete, J. T. Thaden, et al., “Large-scale map- [23] F. J. Provost, T. Fawcett, and R. Kohavi, “The case against accu- ping and validation of Escherichia coli transcriptional regula- racy estimation for comparing induction algorithms,” in Pro- tion from a compendium of expression profiles,” PLoS Biology, ceedings of the 15th International Conference on Machine Learn- vol. 5, no. 1, p. e8, 2007. ing (ICML ’98), pp. 445–453, Morgan Kaufmann, Madison, [7] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks Wis, USA, July 1998. of Plausible, Morgan Kaufmann, San Fransisco, Calif, USA, [24] J. Bockhorst and M. Craven, “Markov networks for detecting 1988. overlapping elements in sequence data,” in Advances in Neural [8] J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu, “Learning Information Processing Systems 17,L.K.Saul,Y.Weiss,andL. Bayesian networks from data: an information-theory based Bottou, Eds., pp. 193–200, MIT Press, Cambridge, Mass, USA, approach,” Artificial Intelligence, vol. 137, no. 1-2, pp. 43–90, 2005. 2002. [25] T. G. Dietterich, “Approximate statistical tests for comparing [9] E. Schneidman, S. Still, M. J. Berry II, and W. Bialek, “Network supervised classification learning algorithms,” Neural Compu- information and connected correlations,” Physical Review Let- tation, vol. 10, no. 7, pp. 1895–1923, 1998. ters, vol. 91, no. 23, Article ID 238701, 4 pages, 2003. [26] K. B. Hwang, J. W. Lee, S.-W. Chung, and B.-T. Zhang, “Con- [10] I. Nemenman, “Multivariate dependence, and genetic network struction of large-scale Bayesian networks by local to global inference,” Tech. Rep. NSF-KITP-04-54, KITP, UCSB, Santa search,” in Proceedings of the 7th Pacific Rim International Barbara, Calif, USA, 2004. Conference on Artificial Intelligence (PRICAI ’02), pp. 375–384, [11]G.D.Tourassi,E.D.Frederick,M.K.Markey,andC.E.Floyd Tokyo, Japan, August 2002. Jr., “Application of the mutual information criterion for fea- [27] I. Tsamardinos, C. Aliferis, and A. Statnikov, “Algorithms for ture selection in computer-aided diagnosis,” Medical Physics, large scale markov blanket discovery,” in Proceedings of the vol. 28, no. 12, pp. 2394–2402, 2001. 16th International Florida Artificial Intelligence Research Soci- [12] C. Ding and H. Peng, “Minimum redundancy feature selec- ety Conference (FLAIRS ’03), pp. 376–381, St. Augustine, Fla, tion from microarray gene expression data,” Journal of Bioin- USA, May 2003. formatics and Computational Biology, vol. 3, no. 2, pp. 185– [28] I. Tsamardinos and C. Aliferis, “Towards principled feature se- 205, 2005. lection: relevancy, filters and wrappers,” in Proceedings of the [13] P. E. Meyer and G. Bontempi, “On the use of variable comple- 9th International Workshop on Artificial Intelligence and Statis- mentarity for feature selection in cancer classification,” in Ap- tics (AI&Stats ’03), Key West, Fla, USA, January 2003. plications of Evolutionary Computing: EvoWorkshops,F.Roth- lauf, J. Branke, S. Cagnoni, et al., Eds., vol. 3907 of Lecture Notes in Computer Science, pp. 91–102, Springer, Berlin, Ger- many, 2006. [14] P. E. Meyer, K. Kontos, and G. Bontempi, “Biological network inference using redundancy analysis,” in Proceedings of the 1st International Conference on Bioinformatics Research and De- velopment (BIRD ’07), pp. 916–927, Berlin, Germany, March 2007. [15] A. J. Butte, P. Tamayo, D. Slonim, T. R. Golub, and I. S. Ko- hane, “Discovering functional relationships between RNA ex- pression and chemotherapeutic susceptibility using relevance networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 22, pp. 12182–12186, 2000. [16] T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, USA, 1990. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 90947, 11 pages doi:10.1155/2007/90947

Research Article NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks

Petri Kontkanen, Hannes Wettig, and Petri Myllymaki¨

Complex Systems Computation Group (CoSCo), Helsinki Institute for Information Technology (HIIT), P.O. Box 68 (Department of Computer Science), FIN-00014 University of Helsinki, Finland Received 1 March 2007; Accepted 30 July 2007

Recommended by Peter Grunwald¨

Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algo- rithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks.

Copyright © 2007 Petri Kontkanen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION this definition involves a normalizing sum over all the possi- ble data samples of a fixed size. The logarithm of this sum is Many problems in bioinformatics can be cast as model class called the regret or parametric complexity, and it can be inter- selection tasks, that is, as tasks of selecting among a set of preted as the amount of complexity of the model class. If the competing mathematical explanations the one that best de- data is continuous, the sum is replaced by the corresponding scribes a given sample of data. Typical examples of this kind integral. of problem are DNA sequence compression [1], microarray The NML distribution has several theoretical optimality data clustering [2–4] and modeling of genetic networks [5]. properties, which make it a very attractive candidate for per- The minimum description length (MDL) principle developed forming model class selection and related tasks. It was origi- in the series of papers [6–8] is a well-founded, general frame- nally [8, 10] formulated as the unique solution to a minimax work for performing model class selection and other types of problem presented in [9], which implied that NML is the statistical inference. The fundamental idea behind the MDL minimax optimal universal model. Later [11], it was shown principle is that any regularity in data can be used to compress that NML is also the solution to a related problem involving the data, that is, to find a description or code of it, such that expected regret. See Section 2 and [10–13] for more discus- this description uses less symbols than it takes to describe sion on the theoretical properties of the NML. the data literally. The more regularities there are, the more Typical bioinformatic problems involve large discrete the data can be compressed. According to the MDL princi- datasets. In order to apply NML for these tasks one needs to ple, learning can be equated with finding regularities in data. develop suitable NML computation methods since the nor- Consequently, we can say that the more we are able to com- malizing sum or integral in the definition of NML is typically press the data, the more we have learned about them. difficult to compute directly. In this paper, we present algo- MDL model class selection is based on a quantity called rithms for efficient computation of NML for both one- and stochastic complexity (SC), which is the description length of multidimensional discrete data. The model families used in a given data relative to a model class. The stochastic com- the paper are so-called Bayesian networks (see, e.g., [14]) of plexity is defined via the normalized maximum likelihood varying complexity. A Bayesian network is a graphical repre- (NML) distribution [8, 9]. For multinomial (discrete) data, sentation of a joint distribution. The structure of the graph 2 EURASIP Journal on Bioinformatics and Systems Biology corresponds to certain conditional independence assump- defined as a process of finding the parameter vector ϕ,which tions. Note that despite the name, having Bayesian network is optimal according to some predetermined criteria. In Sec- models does not necessarily imply using Bayesian statistics, tions 3–5, we discuss three specific model families, which will and the information-theoretic approach of this paper cannot make these definitions more concrete. be considered Bayesian. The problem of computing NML for discrete data has 2.2. The NML distribution been studied before. In [15] a linear-time algorithm for the one-dimensional multinomial case was derived. A more One of the most theoretically and intuitively appealing complex case involving a multidimensional model family, model class selection criteria is the stochastic complexity. called naive Bayes, was discussed in [16]. Both these cases Denote first the maximum likelihood estimate of data xn are also reviewed in this paper. for a given model class M(ϕ)byθ(xn, M(ϕ)), that is, The paper is structured as follows. In Section 2, we dis- θ n M ϕ = { n | θ } (x , ( )) arg max θ∈Θϕ P(x ) . The normalized cuss the basic properties of the MDL principle and the NML maximum likelihood (NML) distribution [9]isnowdefined distribution. In Section 3, we instantiate the NML distribu- as tion for the multinomial case and present a linear-time com- putation algorithm. The topic of Section 4 is the naive Bayes P xn | θ xn, M(ϕ) P xn | M(ϕ) = ,(3) model family. NML computation for an extension of naive NML C M(ϕ), n Bayes, the so-called Bayesian forests, is discussed in Section 5. Finally, Section 6 gives some concluding remarks. where the normalizing term C(M(ϕ), n) in the case of dis- cretedataisgivenby 2. PROPERTIES OF THE MDL PRINCIPLE AND C M ϕ = n | θ n M ϕ THE NML MODEL ( ), n P y y , ( ) (4) yn∈Xn The MDL principle has several desirable properties. Firstly, it and the sum goes over the space of data samples of size . automatically protects against overfitting in the model class n If the data is continuous, the sum is replaced by the selection process. Secondly, this statistical framework does corresponding integral. not, unlike most other frameworks, assume that there exists The stochastic complexity of the data xn,givenamodel some underlying “true” model. The model class is only used class M(ϕ), is defined via the NML distribution as as a technical device for constructing an efficient code for de- scribing the data. MDL is also closely related to the Bayesian SC xn | M(ϕ) inference but there are some fundamental differences, the =− n | M ϕ most important being that MDL does not need any prior dis- log PNML x ( ) tribution; it only uses the data at hand. For more discussion =−log P xn | θ xn, M(ϕ) +logC M(ϕ), n on the theoretical motivations behind the MDL principle see, (5) for example, [8, 10–13, 17]. The MDL model class selection is based on minimiza- and the term log C(M(ϕ), n) is called the (minimax) regret or tion of the stochastic complexity. In the following, we give parametric complexity. The regret can be interpreted as mea- the definition of the stochastic complexity and then proceed suring the logarithm of the number of essentially different by discussing its theoretical properties. (distinguishable) distributions in the model class. Intuitively, if two distributions assign high likelihood to the same data 2.1. Model classes and families samples, they do not contribute much to the overall com- plexity of the model class, and the distributions should not n = Let x (x1, ..., xn) be a data sample of n outcomes, where be counted as different for the purposes of statistical infer- each outcome xj is an element of some space of observations ence. See [18] for more discussion on this topic. X X × ··· ×X .Then-fold Cartesian product is denoted The NML distribution (3) has several important theoret- Xn n ∈ Xn Θ ⊆ Rd by , so that x . Consider a set ,whered is ical optimality properties. The first is that NML provides a a positive integer. A class of parametric distributions indexed unique solution to the minimax problem by the elements of Θ is called a model class. That is, a model M class is defined as P xn | θ xn, M(ϕ) min max log ,(6) n M = P(·|θ):θ ∈ Θ (1) P x P xn | M(ϕ) Θ and the set is called the parameter space. as posed in [9]. The minimizing P is the NML distribution, Φ ⊆ Re Consider a set ,wheree is a positive integer. De- and the minimax regret fine a set F by n θ n M ϕ n | M ϕ F = M(ϕ):ϕ ∈ Φ . (2) log P x | x , ( ) − log P x ( ) (7) The set F is called a model family, and each of the elements is given by the parametric complexity log C(M(ϕ), n). This M(ϕ) is a model class. The associated parameter space is de- means that the NML distribution is the minimax optimal uni- noted by Θϕ. The model class selection problem can now be versal model. The term universal model in this context means Petri Kontkanen et al. 3 that the NML distribution represents (or mimics) the behav- To make the notation more compact and consistent in this ior of all the distributions in the model class M(ϕ). Note that section and the following sections, C(M(K), n)isfromnow the NML distribution itself does not have to belong to the on denoted by CMN(K, n). model class, and typically it does not. It is clear that the maximum likelihood term in (12)can A related property of NML involving expected regret was be computed in linear time by simply sweeping through the proven in [11]. This property states that NML is also a unique data once and counting the frequencies hk.However,thenor- solution to malizing sum CMN(K, n) (and thus also the parametric com- plexity log CMN(K, n)) involves a sum over an exponential n | θ n M P x x , (ϕ) number of terms. Consequently, the time complexity of com- max min Eg log ,(8) g q q xn | M(ϕ) puting the multinomial NML is dominated by (14). where the expectation is taken over xn with respect to g and 3.2. The quadratic-time algorithm the minimizing distribution q equals g. Also the maximin ex- C M ϕ pected regret is thus given by log ( ( ), n). In [16, 20], a recursion formula for removing the exponen- tiality of CMN(K, n) was presented. This formula is given by 3. NML FOR MULTINOMIAL MODELS n! r r1 r r2 C (K, n) = 1 2 In the case of discrete data, the simplest model family is the MN r !r ! n n r1+r2=n 1 2 (15) multinomial. The data are assumed to be one-dimensional ·C ∗ ·C − ∗ and to have only a finite set of possible values. Although sim- MN K , r1 MN K K , r2 , ple, the multinomial model family has practical applications. which holds for all K ∗ = 1, ..., K − 1. A straightforward For example, in [19] multinomial NML was used for his- algorithm based on this formula was then used to compute togram density estimation, and the density estimation prob- C (K, n)intimeO(n2 log K). See [16, 20] for more details. lem was regarded as a model class selection task. MN Note that in [21, 22] the quadratic-time algorithm was im- proved to O(n log n log K) by writing (15) as a convolution- 3.1. The model family type sum and then using the fast Fourier transform algo- Assume that our problem domain consists of a single dis- rithm. However, the relevance of this result is unclear due to severe numerical instability problems it easily produces in crete random variable X with K values, and that our data n practice. x = (x1, ..., xn) is multinomially distributed. The space of observations X is now the set {1, 2, ..., K}. The correspond- ing model family FMN is defined by 3.3. The linear-time algorithm Although the previous algorithms have succeeded in remov- FMN = M(ϕ):ϕ ∈ ΦMN ,(9) ing the exponentiality of the computation of the multinomial where ΦMN ={1, 2, 3, ...}. Since the parameter vector ϕ is in NML, they are still superlinear with respect to n.In[15], a this case a single integer K we denote the multinomial model linear-time algorithm based on the mathematical technique classes by M(K)anddefine of generating functions was derived for the problem. The starting point of the derivation is the generating M = ·|θ θ ∈ Θ (K) P( ): K , (10) function B defined by Θ n where K is the simplex-shaped parameter space, = 1 = n n B(z) − z , (16) 1 T(z) n≥0 n! ΘK = π1, ..., πK : πk ≥ 0, π1 + ··· + πK = 1 (11) where T is the so-called Cayley’s tree function [23, 24]. It is with π = P(X = k), k = 1, ..., K. k easy to prove (see [15, 25]) that the function BK generates Assume the data points xj are independent and identi- n ∞ the sequence ((n /n!)C (K, n)) = , that is, cally distributed (i.i.d.). The NML distribution (3) for the MN n 0 M model class (K) is now given by (see, e.g., [16, 20]) n K hk K n n! hk n B (z) = · z K hk ! ! ··· ! n≥0 n h +···+h =n h1 hK k=1 n n | M = k=1 hk/n 1 K (17) PNML x (K) , (12) n C M(K), n n n = ·CMN(K, n)z , n≥0 n! where hk is the frequency (number of occurrences) of value k in xn,and which by using the tree function T can be written as C M(K), n = P yn | θ yn, M(K) (13) 1 BK (z) = . (18) yn − K 1 T(z) K hk = n! hk ··· . (14) The properties of the tree function T can be used to prove ··· = h1! hK ! = n h1+ +hK n k 1 the following theorem. 4 EURASIP Journal on Bioinformatics and Systems Biology

Theorem 1. The CMN(K, n) terms satisfy the recurrence 4.1. The model family n CMN(K +2,n) = CMN(K +1,n)+ ·CMN(K, n). (19) Let us assume that our problem domain consists of m pri- K mary variables X1, ..., Xm and a special variable X0,which Proof. See the appendix. can be one of the variables in our original problem do- main or it can be latent. Assume that the variable Xi has It is now straightforward to write a linear-time algo- Ki values and that the extra variable X0 has K0 values. The rithm for computing the multinomial NML P (xn | n NML data x = (x1, ..., xn) consist of observations of the form M(K)) based on Theorem 1.Theprocessisdescribedin xj = (xj0, xj1, ..., xjm) ∈ X,where Algorithm 1. The time complexity of the algorithm is clearly X = × ×···× O(n + K), which is a major improvement over the previous 1, 2, ..., K0 1, 2, ..., K1 1, 2, ..., Km . methods. The algorithm is also very easy to implement and (21) does not suffer from any numerical instability problems. The naive Bayes model family FNB is defined by F = M ϕ ϕ ∈ Φ 3.4. Approximating the multinomial NML NB ( ): NB (22) Φ ={ }m+1 In practice, it is often not necessary to compute the exact with NB 1, 2, 3, ... . The corresponding model C classes are denoted by M(K0, K1, ..., Km): value of MN(K, n). A very general and powerful mathemat- ical technique called singularity analysis [26]canbeused M = ·|θ θ ∈ Θ K0, K1, ..., Km PNB( ): K0,K1,...,Km . to derive an accurate, constant-time approximation for the (23) multinomial regret. The idea of singularity analysis is to use the analytical properties of the generating function in ques- The basic naive Bayes assumption is that given the value of tion by studying its singularities, which then leads to the the special variable, the primary variables are independent. asymptotic form for the coefficients. See [25, 26] for details. We have consequently For the multinomial case, the singularity analysis approx- PNB X0 = x0, X1 = x1, ..., Xm = xm | θ imation was first derived in [25] in the context of memoryless m sources, and later [20] re-introduced in the MDL framework. (24) = P X0 = x0 | θ · P Xi = xi | X0 = x0, θ . The approximation is given by i=1 | θ log CMN(K, n) Furthermore, we assume that the distribution of P(X0 )is √ √ multinomial with parameters (π , ..., π ), and each P(X | K − 1 n π 2K·Γ(K/2) 1 1 K0 i = log +log + · √ X = k, θ) is multinomial with parameters (σ , ..., σ ). 2 2 Γ(K/2) 3Γ(K/2 − 1/2) n 0 ik1 ikKi The whole parameter space is then 3+K(K − 2)(2K +1) Γ2(K/2)·K 2 1 Θ + − · K0,K1,...,Km 36 9Γ2(K/2 − 1/2) n = π1, ..., πK , σ111, ..., σ11K , ..., σmK 1, ..., σmK K : 1 0 1 0 0 m + O . π ≥ 0, σ ≥ 0, π + ··· + π = 1, n3/2 k ikl 1 K0 ··· = = = (20) σik1 + + σikKi 1, i 1, ..., m, k 1, ...K0 , (25) Since the error term of (20)goesdownwiththerate 3/2 O(1/n ), the approximation converges very rapidly. In [20], and the parameters are defined by πk = P(X0 = k), σikl = the accuracy of (20) and two other approximations (Rissa- P(Xi = l | X0 = k). nen’s asymptotic expansion [8] and Bayesian information Assuming i.i.d., the NML distribution for the naive Bayes criterion (BIC) [27]) were tested empirically. The results cannowbewrittenas(see[16]) show that (20) is significantly better than the other approx- n PNML x | M K0, K1, ..., Km imations and accurate already with very small sample sizes. K0 hk m Ki fikl See [20] for more details. h /n = f /h (26) = k=1 k i 1 l=1 ikl k , C M K0, K1, ..., Km , n 4. NML FOR THE NAIVE BAYES MODEL n where hk is the number of times X0 has value k in x , fikl is the The one-dimensional case discussed in the previous section number of times Xi has value l when the special variable has C M is not adequate for many real-world situations, where data value k,and ( (K0, K1, ..., Km), n)isgivenby(see[16]) are typically multidimensional, involving complex depen- C M K0, K1, ..., Km , n dencies between the domain variables. In [16], a quadratic- K0 hk m time algorithm for computing the NML for a specific n! hk = C K , h . multivariate model family, usually called the naive Bayes, was ! ··· ! MN i k h +···+h =n h1 hK0 k=1 n i=1 derived. This model family has been very successful in prac- 1 K0 (27) tice in mixture modeling [28], clustering of data [16], case- based reasoning [29], classification [30, 31], and data visual- To simplify notations, from now on we write C(M(K0, ization [32]. K1, ..., Km), n)inanabbreviatedformCNB(K0, n). Petri Kontkanen et al. 5

n 1: Count the frequencies h1, ..., hK from the data x n | θ n M = K hk 2: Compute the likelihood P(x (x , (K))) k=1(hk/n) C = 3: Set MN(1, n) 1 C = r1 r2 4: Compute MN(2, n) r1+r2=n(n!/r1!r2!)(r1/n) (r2/n) 5: for k = 1toK − 2 do 6: Compute CMN(k +2,n) = CMN(k +1,n)+(n/k)·CMN(k, n) 7: end for n n n 8: Output PNML(x | M(K)) = P(x | θ(x , M(K)))/CMN(K, n)

n Algorithm 1: The linear-time algorithm for computing PNML(x | M(K)).

4.2. The quadratic-time algorithm to any node Xi. Consequently, a Bayesian tree is a connected Bayesian forest and a Bayesian forest breaks down into com- It turns out [16] that the recursive formula (15)canbegen- ponent trees, that is, connected subgraphs. The root of each eralized to the naive Bayes model family case. such component tree lacks a parent, in which case we write pa(i) = ∅. Theorem 2. The terms CNB(K0, n) satisfy the recurrence The parent set of a node Xi thus reduces to a single value r1 r2 pa(i) ∈{1, ..., i − 1, i +1,..., m, ∅}. Let further ch(i)de- C = n! r1 r2 NB K0, n G ∅ r !r ! n n note the set of children of node Xi in and ch( ) denote the r1+r2=n 1 2 (28) “children of none,” that is, the roots of the component trees ·C ∗ ·C − ∗ NB K , r1 NB K0 K , r2 , of G. ∗ The corresponding model family F can be indexed where K = 1, ..., K − 1. BF 0 by the network structure G and the corresponding attribute Proof. See the appendix. value counts K1, ..., Km: In many practical applications of the naive Bayes, the FBF = M(ϕ):ϕ ∈ ΦBF (29) quantity K0 is unknown. Its value is typically determined as a part of the model class selection process. Conse- m with ΦBF ={1, ..., |G|} × {1, 2, 3, ...} ,whereG is asso- quently, it is necessary to compute NML for model classes ciated with an integer according to some enumeration of M(K0, K1, ..., Km), where K0 has a range of values, say, K0 = all Bayesian forests on (X1, ..., Xm). As the Ki are assumed 1, ..., Kmax . The process of computing NML for this case is fixed, we can abbreviate the corresponding model classes by described in Algorithm 2. The time complexity of the algo- M(G):= M(G, K , ..., K ). O 2· 1 m rithm is (n Kmax ). If the value of K0 is fixed, the time com- Given a forest model class M(G), we index each model by O 2· plexity drops to (n log K0). See [16] for more details. a parameter vector θ in the corresponding parameter space ΘG: 5. NML FOR BAYESIAN FORESTS The naive Bayes model discussed in the previous section has ΘG = θ = θikl : θikl ≥ 0, θikl = 1, been successfully applied in various domains. In this section l we consider, tree-structured Bayesian networks, which in- = 1, , , = 1, , , = 1, , , clude the naive Bayes model as a special case but can also i ... m k ... Kpa(i) l ... Ki represent more complex dependencies. (30)

5.1. The model family where we define K∅ := 1 in order to unify notation for root and non-root nodes. Each such θikl defines a probability As before, we assume m variables X1, ..., Xm with given value cardinalities K1, ..., Km. Since the goal here is to model the θikl = P Xi = l | Xpa(i) = k, M(G), θ , (31) joint probability distribution of the m variables, there is no need to mark a special variable. We assume a data matrix where we interpret X∅ = 1 as a null condition. n = ∈ Xn ≤ ≤ ≤ ≤ x (xji) ,1 j n,and1 i m,asgiven. The joint probability that a model M = (G, θ) assigns to A Bayesian network structure G encodes independence adatavectorx = (x1, ..., xm)becomes assumptions so that if each variable is represented as a Xi node in the network, then the joint probability distribution P x | M(G), θ factorizes into a product of local probability distributions, m m one for each node, conditioned on its parent set. We define = P X = x | X = x , M(G), θ = θ . G i i pa(i) pa(i) i,xpa(i),xi a Bayesian forest to be a Bayesian network structure on the i=1 i=1 node set X1, ..., Xm which assigns at most one parent Xpa(i) (32) 6 EURASIP Journal on Bioinformatics and Systems Biology

1: Compute CMN(k, j)fork = 1, ..., Vmax , j = 0, ..., n,whereVmax = max {K1, ..., Km} 2: for K0 = 1toKmax do = = n 3: Count the frequencies h1, ..., hK0 , fik1, ..., fikKi for i 1, ..., m, k 1, ..., K0 from the data x 4: Compute the likelihood: n | θ n M = K0 hk m Ki fikl P(x (x , (K0, K1, ..., Km))) k=1(hk/n) i=1 l=1( fikl/hk) 5: Set CNB(K0,0)= 1 = 6: if K0 1 then C = m C = 7: Compute NB(1, j) i=1 MN(Ki, j)forj 1, ..., n 8: else C = r1 r2 ·C ·C − = 9: Compute NB(K0, j) r1+r2= j (j!/r1!r2!)(r1/j) (r2/j) NB(1, r1) NB(K0 1, r2)forj 1, ..., n 10: end if n n n 11: Output PNML(x | M(K0, K1, ..., Km)) = P(x | θ(x , M(K0, K1, ..., Km)))/CNB(K0, n) 12: end for

n Algorithm 2: The algorithm for computing PNML(x | M(K0, K1, ..., Km)) for K0 = 1, ..., Kmax .

n For a sample x = (xji)ofn vectors xj , we define the corre- plete sum by sweeping through the graph once, bottom-up. sponding frequencies as Let us now introduce some necessary notation. Let G be a given Bayesian forest. Then for any node Xi = = ∧ = fikl : j : xji l xj,pa(i) k , denote the subtree rooting in Xi,byGsub(i) and the forest built G Kpa(i) up by all descendants of Xi by dsc(i). The corresponding data (33) X X fil := j : xji = l = fikl. domains are sub(i) and dsc(i), respectively. Denote the sum k=1 over all n-instantiations of a subtree by

By definition, for any component tree root Xi,wehave fil = n C M G = n | θ n M G fi1l. The probability assigned to a sample x can then be writ- i ( ), n : P xsub(i) xsub(i) , sub(i) xn ∈Xn ten as sub(i) sub(i) (37) K m pa(i)Ki P xn | M(G), θ = θ fikl , (34) ikl n ∈ n = i=1 k=1 l=1 and for any vector xi Xi with frequencies fi ( fi1,

..., fiKi ), we define which is maximized at C M G | n fikl i ( ), n fi θikl x , M(G) = , (35) fpa(i),k = n n | θ n n M G : P xdsc(i), xi xdsc(i), xi , sub(i) n ∈Xn xdsc(i) dsc(i) where we define f∅ := n. The maximum data likelihood ,1 (38) thereby is

K m pa(i)Ki fikl to be the corresponding sum with fixed root instantiation, fikl P xn | M(G) = . (36) summing only over the attribute space spanned by the de- i=1 = = fpa(i),k k 1 l 1 scendants on Xi. n Note that we use fi on the left-hand side, and xi on the 5.2. The algorithm right-hand side of the definition. This needs to be justified. Interestingly, while the terms in the sum depend on the or- n The goal is to calculate the NML distribution PNML(x | n n dering of xi , the sum itself depends on xi only through its M(G)) defined in (3). This consists of calculating the n frequencies fi. To see this pick, any two representatives xi and maximum data likelihood (36) and the normalizing term xn of f and find, for example, after lexicographical ordering C M G i i ( ( ), n)givenin(4). The former involves frequency of the elements, that counting, one sweep through the data, and multiplication of the appropriate values. This can be done in time O(n + n n n ∈Xn = xn n n ∈Xn iKiKpa(i)). The latter involves a sum exponential in n, xi , xdsc(i) :xdsc(i) dsc(i) i , xdsc(i) :xdsc(i) dsc(i) . which clearly makes it the computational bottleneck of the algorithm. (39) Our approach is to break up the normalizing sum in (4) into terms corresponding to subtrees with given frequencies Next, we need to define corresponding sums over Xsub(i) in either their root or its parent. We then calculate the com- with the frequencies at the subtree root parent Xpa(i) given. Petri Kontkanen et al. 7

For any f ∼xn ∈ Xn define = P xn | θ xn , xn , M G pa(i) pa(i) pa(i) i ⎛dsc(i) i sub(i) L M G | ⎜ i ( ), n fpa(i) × ⎝ P xn | xn, sub(j) i ∈ch( ) xn ∈Xn = n | n θ n n M G j i sub(j) sub(j) (44) : P xsub(i) xpa(i), xsub(i), xpa(i) , sub(i) . ⎞ xn ∈Xn sub(i) sub(i) ⎟ θ n n M G ⎠ (40) xdsc(i), xi , sub(i)

xn Again, this is well defined since any other representative pa(i) Ki fil fil of fpa(i) yields summing the same terms modulo their order- = L M(G), n | f , (45) n j i ing. l=1 j∈ch(i) After having introduced this notation, we now briefly n outline the algorithm and in the following subsections give where xdsc(i)|sub(j) is the restriction of xdsc(i) to columns cor- a more detailed description of the steps involved. As stated responding to nodes in Gj .Wehaveused(38)for(42), (32) before, we go through G bottom-up. At each inner node Xi, for (43)and(44), and finally (36)and(40)for(45). we receive Lj (M(G), n | fi) from each child Xj , j ∈ ch(i). Now we need to calculate the outgoing messages Correspondingly, we are required to send Li(M(G), n | fpa(i)) Li(M(G), n | fpa(i)) from the incoming messages we have just up to the parent Xpa(i).AteachcomponenttreerootXi,we combined into Ci(M(G), n | fi). This is the most demanding then calculate the sum Ci(M(G), n) for the whole connec- part of the algorithm, for we need to list all possible condi- − tivity component and then combine these sums to get the tional frequencies, of which there are O(nKiKpa(i) 1) many, the normalizer Ci(M(G), n) for the complete forest G. −1 being due to the sum-to-n constraint. For fixed i,wear- range the conditional frequencies fikl into a matrix F = ( fikl) 5.2.1. Leaves and define its marginals L M G | For a leaf node Xi we can calculate the i( ( ), n ρ = (F): fik1, ..., fikKi , fpa(i)) without listing its own frequencies fi.Asin(27), k k fpa(i) splits the n data vectors into Kpa(i) subsets of sizes (46) γ = fpa(i),1, ..., fpa(i),Kpa(i) and each of them can be modeled inde- (F): fi1l, ..., fiKpa(i)l pendently as a multinomial; we have l l to be the vectors obtained by summing the rows of F Kpa(i) and the columns of F, respectively. Each such matrix then L M(G), n | fpa( ) = CMN K , fpa( ), . (41) i i i i k C M G | ρ k=1 corresponds to a term i( ( ), n (F)) and a term Li(M(G), n | γ(F)). Formally, we have C = The terms MN(Ki, n )(forn 0, ..., n) can be precalcu- lated using recurrence (19)asinAlgorithm 1. Li M(G), n | fpa(i) = Ci M(G), n | ρ(F) . F:γ(F)=fpa(i) 5.2.2. Inner nodes (47)

For inner nodes Xi we divide the task into two steps. First, we 5.2.3. Component tree roots collect the child messages Lj (M(G), n | fi) sent by each child ∈ C M G | X Xj ch(i) into partial sums i( ( ), n fi)over dsc(i), For a component tree root Xi ∈ ch(∅) we do not need to L M G | X and then “lift” these to sums i( ( ), n fpa(i))over sub(i) pass any message upward. All we need is the complete sum which are the messages to the parent. over the component tree n The first step is simple. Given an instantiation xi at Xi or, equivalently, the corresponding frequencies fi, the subtrees n! Ci MG, n = Ci MG, n | fi , (48) rooting in the children ch(i)ofXi become independent of ··· f fi1! fiKi ! each other. Thus we have i where the Ci(MG, n | fi) are calculated from (45). The sum- Ci M(G), n | fi mation goes over all nonnegative integer vectors fi summing = n n | θ n n M G (42) P xdsc(i), xi xdsc(i), xi , sub(i) to n. The above is trivially true since we sum over all instan- n ∈Xn xdsc(i) dsc(i) tiations xi of Xi and group like terms, corresponding to the = n | θ n n M G same frequency vector fi, while keeping track of their respec- P xi xdsc(i), xi , sub(i) tive count, namely n!/fi1! ··· fiK !. i × n | n P xdsc(i)|sub(j) xi , xn ∈Xn j∈ch(i) (43) 5.2.4. The algorithm dsc(i) dsc(i) For the complete forest G we simply multiply the sums over θ xn , xn , M G dsc(i) i sub(i) its tree components. Since these are independent of each 8 EURASIP Journal on Bioinformatics and Systems Biology

n 1: Count all frequencies fikl and fil from the data x K n | M G = m pa(i) Ki fikl 2: Compute P(x ( )) i=1 k=1 l=1( fikl/fpa(i),k)

3: for k = 1, ..., Kmax := max {Ki} and n = 0, ..., n do i:Xi is a leaf

4: Compute CMN(k, n )asinAlgorithm 1 5: end for 6: for each node Xi in some bottom-up order do 7: if Xi is a leaf then 8: for each frequency vector f of X do pa(i) pa(i) L M G | = Kpa(i) C 9: Compute i( ( ), n fpa(i)) k=1 MN(Ki, fpa(i)k) 10: end for 11: else if Xi is an inner node then 12: for each frequency vector fiXi do C M G | = Ki fil L M G | 13: Compute i( ( ), n fi) l=1( fil/n) j∈ch(i) j ( ( ), n fi) 14: end for 15: initialize Li ≡ 0 16: for each non-negative Ki × Kpa(i) integer matrix F with entries summing to n do 17: Li(M(G), n | γ(F)) += Ci(M(G), n | ρ(F)) 18: end for 19: else if Xi is a component tree root then C M G = Ki fil L M G | 20: Compute i( ( ), n) fi l=1( fil/n) j∈ch(i) j ( ( ), n fi) 21: end if 22: end for C M G = C M G 23: Compute ( ( ), n) i∈ch(∅) i( ( ), n) n n 24: Outpute PNML(x | M(G)) = P(x | M(G))/C(M(G), n)

n Algorithm 3: The algorithm for computing PNML(x | M(G)) for a Bayesian forest G.

other, in analogy to (42)–(45)wehave here is polynomial as well in the sample size n as in the graph size m. For attributes with relatively few values, the polyno- mial is time tolerable. C MG, n = Ci MG, n . (49) i∈ch(∅)

Algorithm 3 collects all the above into a pseudocode. 6. CONCLUSION − The time complexity of this algorithm is O(nKiKpa(i) 1)for K −1 each inner node, O(n(n + Ki)) for each leaf, and O(n i )for The normalized maximum likelihood (NML) offers a uni- a component tree root of G.Whenallm

The methods presented are especially suitable for prob- On the other hand, by manipulating (18) in the same way, we lems in bioinformatics, which typically involve multi- get dimensional discrete datasets. Furthermore, unlike the Bayesian methods, information-theoretic approaches such d 1 z· as ours do not require a prior for the model parameters. dz 1 − T(z) K This is the most important aspect, as constructing a reason- (A.5) z·K able parameter prior is a notoriously difficult problem, par- = ·T (z) − K+1 ticularly in bioinformatical domains involving novel types 1 T(z) K T(z) of data with little background knowledge. All in all, in- = · (A.6) formation theory has been found to offer a natural and − K+1 1 − T(z) 1⎛ T(z) ⎞ successful theoretical framework for biological applications in general, which makes NML an appealing choice for = ⎝ 1 − 1 ⎠ K +2 +1 (A.7) bioinformatics. 1 − T(z) K 1 − T(z) K In the future, our plan is to extend the current work nn nn to more complex cases such as general Bayesian networks, = K C K +2,n zn − C K +1,n zn , ! MN ! MN which would allow the use of NML in even more in- n≥0 n n≥0 n volved modeling tasks. Another natural area of future work (A.8) is to apply the methods of this paper to practical tasks ffi involving large discrete databases and compare the re- where (A.6)followsfromLemma 3. Comparing the coe - n sults to other approaches, such as those based on Bayesian cients of z in (A.4)and(A.8), we get statistics. n·CMN(K, n) = K· CMN(K +2,n) − CMN(K +1,n) , (A.9) APPENDIX from which the theorem follows. PROOFS OF THEOREMS

In this section, we provide detailed proofs of two theorems Proof of Theorem 2 (naive Bayes recursion) presented in the paper. We have

Proof of Theorem 1 (multinomial recursion) CNB(K0, n) K0 hk m We start by proving the following lemma. = n! hk C ··· MN Ki, hk ··· = h1! hK0 ! = n = h1+ +hK0 n k 1 i 1 Lemma 3. For the tree function T(z) we have n! K0 hhk m = k C K , h n ! MN i k T(z) h +···+h =n n k=1 hk i=1 zT (z) = . (A.1) 1 K0 1 − T(z) ∗ r1 r2 K hk K0 hk = n! r1 r2 r1! hk · r2! hk n r1 r2 ··· ∗ = n r1! r2! r1 = hk! r2 = ∗ hk! Proof. A basic property of the tree function is the functional h1+ +hK r1 k 1 k K +1 T(z) ∗ ··· = equation T(z) = ze (see, e.g., [23]). Differentiating this hK +1+ +hK0 r2 r1+r2=n equation yields ∗ m K K0 · C C MN Ki, hk MN Ki, hk = T(z) ∗ T (z) e + T(z)T (z) i=1 k=1 k=K +1 (A.2) − = T(z) r1 r2 zT (z) 1 T(z) ze , = n! r1 r2

··· ∗ = r1!r2! n n h1+ +hK r1 ∗ ··· = from which (A.1) follows. hK +1+ +hK0 r2 r1+r2=n ∗ K hk m · r1! hk C Now we can proceed to the proof of the theorem. We start MN Ki, hk h1! ···hK ∗ ! r1 by multiplying and differentiating (17) as follows: k=1 i=1 K0 hk m r2! hk n n · CMN Ki, hk d n n − ∗ ··· · C n = · · C n 1 hK +1! hK0 ! = ∗ r2 =1 z MN(K, n)z z n MN(K, n)z k K +1 i dz n≥0 n! n≥1 n! r1 r2 n! r1 r2 ∗ ∗ (A.3) = ·CNB K , r1 ·CNB K0 − K , r2 , r !r ! n n r1+r2=n 1 2 (A.10) nn = n· C (K, n)zn. (A.4) n! MN n≥0 and the proof follows. 10 EURASIP Journal on Bioinformatics and Systems Biology

ACKNOWLEDGMENTS [16] P. Kontkanen, P. Myllymaki,¨ W. Buntine, J. Rissanen, and H. Tirri, “An MDL framework for data clustering,” in Advances The authors would like to thank the anonymous reviewers in Minimum Description Length: Theory and Applications,P. and Jorma Rissanen for useful comments. This work was Grunwald,¨ I. J. Myung, and M. Pitt, Eds., The MIT Press, Cam- supported in part by the Academy of Finland under the bridge, Mass, USA, 2006. project Civi and by the Finnish Funding Agency for Technol- [17] Q. Xie and A. R. Barron, “Asymptotic minimax regret for data ogy and Innovation under the projects Kukot and PMMA. In compression, gambling, and prediction,” IEEE Transactions on addition, this work was supported in part by the IST Pro- Information Theory, vol. 46, no. 2, pp. 431–445, 2000. [18] V. Balasubramanian, “MDL, Bayesian inference, and the ge- gramme of the European Community, under the PASCAL ometry of the space of probability distributions,” in Advances Network of Excellence, IST-2002-506778. This publication in Minimum Description Length: Theory and Applications,P. only reflects the authors’ views. Grunwald,I.J.Myung,andM.Pitt,Eds.,pp.81–98,TheMIT¨ Press, Cambridge, Mass, USA, 2006. REFERENCES [19] P. Kontkanen and P. Myllymaki,¨ “MDL histogram density esti- mation,” in Proceedings of the 11th International Conference on [1] G. Korodi and I. Tabus, “An efficient normalized maximum Artificial Intelligence and Statistics, (AISTATS ’07), San Juan, likelihood algorithm for DNA sequence compression,” ACM Puerto Rico, USA, March 2007. Transactions on Information Systems, vol. 23, no. 1, pp. 3–34, [20] P. Kontkanen, W. Buntine, P. Myllymaki,¨ J. Rissanen, and H. 2005. Tirri, “Efficient computation of stochastic complexity,” in Pro- [2] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and B. ceedings of the 9th International Conference on Artificial Intelli- Brown, “Clustering methods for the analysis of DNA microar- gence and Statistics, C. Bishop and B. Frey, Eds., pp. 233–238, ray data,” Tech. Rep., Department of Health Research and Pol- Society for Artificial Intelligence and Statistics, Key West, Fla, icy, , Stanford, Calif, USA, 1999. USA, January 2003. [3] W. Pan, J. Lin, and C. T. Le, “Model-based cluster analysis [21] M. Koivisto, “Sum-Product Algorithms for the Analysis of Ge- of microarray gene-expression data,” Genome Biology, vol. 3, netic Risks,” Tech. Rep. A-2004-1, Department of Computer no. 2, pp. 1–8, 2002. Science, University of Helsinki, Helsinki, Finland, 2004. [4]G.J.McLachlan,R.W.Bean,andD.Peel,“Amixturemodel- [22] P. Kontkanen and P. Myllymaki,¨ “A fast normalized maximum based approach to the clustering of microarray expression likelihood algorithm for multinomial data,” in Proceedings of data,” Bioinformatics, vol. 18, no. 3, pp. 413–422, 2002. the 19th International Joint Conference on Artificial Intelligence [5]A.J.Hartemink,D.K.Gifford, T. S. Jaakkola, and R. A. (IJCAI ’05), Edinburgh, Scotland, August 2005. Young, “Using graphical models and genomic expression data [23] D. E. Knuth and B. Pittle, “A recurrence related to trees,” Pro- to statistically validate models of genetic regulatory networks,” ceedings of the American Mathematical Society, vol. 105, no. 2, pp. 335–349, 1989. in Proceedings of the 6th Pacific Symposium on Biocomputing ff (PSB ’01), pp. 422–433, The Big Island of Hawaii, Hawaii, [24] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Je rey, and USA, January 2001. D. E. Knuth, “On the Lambert W function,” Advances in Com- putational Mathematics, vol. 5, no. 1, pp. 329–359, 1996. [6] J. Rissanen, “Modeling by shortest data description,” Automat- [25] W. Szpankowski, Average Case Analysis of Algorithms on Se- ica, vol. 14, no. 5, pp. 465–471, 1978. quences, John Wiley & Sons, New York, NY, USA, 2001. [7] J. Rissanen, “Stochastic complexity,” Journal of the Royal Sta- [26] P. Flajolet and A. M. Odlyzko, “Singularity analysis of generat- tistical Society, Series B, vol. 49, no. 3, pp. 223–239, 1987, with ing functions,” SIAM Journal on Discrete Mathematics, vol. 3, discussions, 223–265. no. 2, pp. 216–240, 1990. [8] J. Rissanen, “Fisher information and stochastic complexity,” [27] G. Schwarz, “Estimating the dimension of a model,” Annals of IEEE Transactions on Information Theory,vol.42,no.1,pp. Statistics, vol. 6, no. 2, pp. 461–464, 1978. 40–47, 1996. [28] P. Kontkanen, P. Myllymaki,¨ and H. Tirri, “Constructing [9] Yu M. Shtarkov, “Universal sequential coding of single mes- Bayesian finite mixture models by the EM algorithm,” Tech. sages,” Problems of Information Transmission,vol.23,no.3,pp. Rep. NC-TR-97-003, ESPRIT Working Group on Neural and 175–186, 1987. Computational Learning (NeuroCOLT), Helsinki, Finland, [10] A. Barron, J. Rissanen, and B. Yu, “The minimum description 1997. length principle in coding and modeling,” IEEE Transactions [29] P. Kontkanen, P. Myllymaki,¨ T. Silander, and H. Tirri, “On on Information Theory, vol. 44, no. 6, pp. 2743–2760, 1998. Bayesian case matching,” in Proceedings of the 4th European [11] J. Rissanen, “Strong optimality of the normalized ML models Workshop Advances in Case-Based Reasoning (EWCBR ’98),B. as universal codes and information in data,” IEEE Transactions Smyth and P. Cunningham, Eds., vol. 1488 of Lecture Notes on Information Theory, vol. 47, no. 5, pp. 1712–1717, 2001. In Computer Science, pp. 13–24, Springer, Dublin, Ireland, [12] P. Grunwald,¨ The Minimum Description Length Principle,The September 1998. MIT Press, Cambridge, Mass, USA, 2007. [30] P. Grunwald,P.Kontkanen,P.Myllym¨ aki,¨ T. Silander, and H. [13] J. Rissanen, Information and Complexity in Statistical Model- Tirri, “Minimum encoding approaches for predictive model- ing, Springer, New York , NY, USA, 2007. ing,” in Proceedings of the 14th International Conference on Un- [14] D. Heckerman, “A tutorial on learning with Bayesian net- certainty in Artificial Intelligence (UAI ’98),G.CooperandS. works,” Tech. Rep. MSR-TR-95-06, Microsoft Research, Ad- Moral, Eds., pp. 183–192, Morgan Kaufmann, Madison, Wis, vanced Technology Division, One Microsoft Way, Redmond, USA, July 1998. Wash, USA, 98052, 1996. [31] P. Kontkanen, P. Myllymaki,¨ T. Silander, H. Tirri, and P. [15] P. Kontkanen and P. Myllymaki,¨ “A linear-time algorithm for Grunwald,¨ “On predictive distributions and Bayesian net- computing the multinomial stochastic complexity,” Informa- works,” Statistics and Computing, vol. 10, no. 1, pp. 39–54, tion Processing Letters, vol. 103, no. 6, pp. 227–233, 2007. 2000. Petri Kontkanen et al. 11

[32] P. Kontkanen, J. Lahtinen, P. Myllymaki,¨ T. Silander, and H. Tirri, “Supervised model-based visualization of high- dimensional data,” Intelligent Data Analysis,vol.4,no.3-4,pp. 213–227, 2000. [33] M. Dyer, R. Kannan, and J. Mount, “Sampling contingency tables,” Random Structures and Algorithms,vol.10,no.4,pp. 487–506, 1997.