Information Theoretic Methods for Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology Information Theoretic Methods for Bioinformatics Guest Editors: Jorma Rissanen, Peter Grünwald, Jukka Heikkonen, Petri Myllymäki, Teemu Roos, and Juho Rousu Information Theoretic Methods for Bioinformatics EURASIP Journal on Bioinformatics and Systems Biology Information Theoretic Methods for Bioinformatics Guest Editors: Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Copyright © 2007 Hindawi Publishing Corporation. All rights reserved. This is a special issue published in volume 2007 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Editor-in-Chief I. Tabus, Tampere University of Technology, Finland Associate Editors Jaakko Astola, Finland J. Garcia-Frias, USA Paola Sebastiani, USA Junior Barrera, Brazil Debashis Ghosh, USA Erchin Serpedin, USA Michael L. Bittner, USA John Goutsias, USA Ilya Shmulevich, USA MichaelR.Brent,USA RodericGuigo,Spain A. H. Tewfik, USA Yidong Chen, USA Yufei Huang, USA Sabine Van Huffel, Belgium Paul Dan Cristea, Romania Seungchan Kim, USA Z. Jane Wang, Canada Aniruddha Datta, USA John Quackenbush, USA Yue Wang, USA Bart De Moor, Belgium Jorma Rissanen, Finland Edward R. Dougherty, USA Stephane´ Robin, France Contents Information Theoretic Methods for Bioinformatics, Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Volume 2007, Article ID 79128, 2 pages Compressing Proteomes: The Relevance of Medium Range Correlations, Dario Benedetto, Emanuele Caglioti, and Claudia Chica Volume 2007, Article ID 60723, 8 pages A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification, Chris Hemmerich and Sun Kim Volume 2007, Article ID 87356, 9 pages Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates, Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama, and Wojciech Szpankowski Volume 2007, Article ID 14741, 11 pages Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information, Arvind Rao, Alfred O. Hero III, David J. States, and James Douglas Engel Volume 2007, Article ID 13853, 13 pages Splitting the BLOSUM Score into Numbers of Biological Significance, Francesco Fabris, Andrea Sgarro, and Alessandro Tossi Volume 2007, Article ID 31450, 18 pages Aligning Sequences by Minimum Description Length,JohnS.Conery Volume 2007, Article ID 72936, 14 pages MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress, Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin, and Andrew S. Torres Volume 2007, Article ID 43670, 16 pages Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among Bacteria, Haruo Suzuki, Rintaro Saito, and Masaru Tomita Volume 2007, Article ID 61374, 7 pages Information-Theoretic Inference of Large Transcriptional Regulatory Networks,PatrickE.Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi Volume 2007, Article ID 79879, 9 pages NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks, Petri Kontkanen, Hannes Wettig, and Petri Myllymaki¨ Volume 2007, Article ID 90947, 11 pages Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 79128, 2 pages doi:10.1155/2007/79128 Editorial Information Theoretic Methods for Bioinformatics Jorma Rissanen,1, 2 Peter Grunwald,¨ 3 Jukka Heikkonen,4 Petri Myllymaki,¨ 2, 5 Teemu Roos,2, 5 and Juho Rousu5 1 Computer Learning Research Center, University of London, Royal Holloway TW20 0EX, UK 2 Helsinki Institute for Information Technology, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland 3 Centrum voor Wiskunde en Informatica (CWI), P.O. Box 94079, 1090 GB Amsterdam, The Netherlands 4 Laboratory of Computational Engineering, Helsinki University of Technology, P.O. Box 9203, 02015 HUT, Finland 5 Department of Computer Science, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland Received 24 December 2007; Accepted 24 December 2007 Copyright © 2007 Jorma Rissanen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The ever-ongoing growth in the amount of biological data, length with which the data can be encoded, taking advantage the development of genome-wide measurement technolo- of the regular features the model prescribes to the data. This gies, and the gradual, inevitable shift in molecular biology task requires information and coding theoretic means. Simi- from the study of individual genes to the systems view; all larly, the frequently used distance measures like the Kullback- these factors contribute to the need to study biological sys- Leibler divergence and the mutual information express mean tems by statistical and computational means. In this task, we codelength differences. are facing a dual challenge: on the one hand, biological sys- D. Benedetto et al. study correlations and compressibil- tems and hence their models are inherently complex, and on ity of proteome sequences. They identify dependencies at the the other hand, the measurement data, while being genome- range of 10 to 100 amino acids. The source of such depen- wide, are typically scarce in terms of sample sizes (the “large dencies is not entirely clear. One contributing factor in the p,smalln” problem) and noisy. case of interprotein dependencies is likely to be sequence du- This means that the traditional statistical approach, plication. The dependencies can be exploited in compression where the model is viewed as a distorted image of something of proteome sequences. Furthermore, they seem to have a called a true distribution which the statisticians are trying to role in evolutionary and structural analysis of proteomes. estimate, is poorly justified. This lack of rationality is particu- C. M. Hemmerich and S. Kim also use information the- larly striking when one tries to learn the structure of the data ory for studying the correlations in protein sequences. They by testing for the truth of a hypothesis in a collection where base their method on computing the mutual information of none of them is true. Similarly, the Bayesian approaches that nonadjacent residues lying at a fixed distance d apart, where require prior knowledge, which is either nonexistent or vague the distance is varied from zero to a fixed upper bound. The and difficult to express in terms of a distribution for the pa- mutual information vector formed by these statistics is used rameters, are subject to modeling assumptions which may to train a nearest-neighbor classifier to predict membership bias the results in an unintended manner. in protein families with results indicating that the correla- It was the editors’ intent and hope to encourage applications between nonadjacent residues are predictive of protein tions of techniques for model fitting influenced by informa- family. tion theory, originally created for communication theory but H. M. Aktulga et al. detect statistically dependent ge- more recently expanded to cover algorithmic information nomic sequences. Their paper addresses two applications. theory and applicable to statistical modeling. In this view, First, they identify different parts of a gene (maize zmSRp32) the objective in modeling is to learn structures and proper- that are mutually dependent without appealing to the usual ties in data by simply fitting models without requiring any of assumption that dependencies are revealed by a considerable them to be “true”. The performance is not measured by any amount of exact matches. It is discovered that dependencies distance to the nonexisting “truth” but in terms of the prob- exist between the 5 untranslated region and its alternatively ability they assign to the data, which is equivalent to the code spliced exons. As a second application, they discover short 2 EURASIP Journal on Bioinformatics and Systems Biology tandem repeats which are useful in, for instance, genetic pro- correlated with G+C (guanine-cytosine) composition in the filing. In both cases, the used techniques are based on mutual genome. In their paper, H. Suzuki et al. quantify the corre- information. lation of G+C composition with synonymous codon usage The objective in the paper by A. Rao et al. is to dis- bias, where the bias is measured by the entropy of the third cover long-range regulatory elements (LREs) that determine codon position. They show that the correlation depends on tissue-specific gene expression. Their methodology is based various genomic features and varies among different species. on the concept of directed information,avariantofmutual This raises several interesting questions about the different information introduced originally in the 1970s. It is shown evolutionary forces causing the codon usage bias. that directed information can be successfully used for select- The paper by P. E. Meyer et al. tackles the challenging ing motifs that discriminate between tissue-specific and non- problem of inferring large gene regulatory networks using in- specific LREs. In particular, the performance of directed information

Load more