COMPACT HANDBOOK of Computational Biology
Total Page:16
File Type:pdf, Size:1020Kb
COMPACT HANDBOOK OF Computational Biology Edited by Andrzej K.Konopka M.James C.Crabbe MARCEL DEKKER NEW Y ORK Copyright © 2004 by Marcel Dekker Although great care has been taken to provide accurate and current information, neither the author(s) nor the publisher, nor anyone else associated with this publication, shall be liable for any loss, damage, or liability directly or indirectly caused or alleged to be caused by this book. The material contained herein is not intended to provide specific advice or recommendations for any specific situation. Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. ISBN: 0-8247-0982-9 Headquarters Marcel Dekker, 270 Madison Avenue, New York, NY 10016, U.S.A. tel: 212–696–9000; fax: 212–685–4540 Distribution and Customer Service Marcel Dekker, Cimarron Road, Monticello, New York 12701, U.S.A. tel: 800–228–1160; fax: 845–796–1772 World Wide Web http://www.dekker.com The publisher offers discounts on this book when ordered in bulk quantities. For more information, write to Special Sales/Professional Marketing at the headquarters address above. Copyright © 2004 by Marcel Dekker. All Rights Reserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher. Current printing (last digit): 10 9 8 7 6 5 4 3 2 1 PRINTED IN THE UNITED STATES OF AMERICA Copyright © 2004 by Marcel Dekker Foreword Computational biology is a relatively new but already mature field of academic research. In this handbook we focused on the following three goals: 1. Outlining pivotal general methodologies that will guide research for the years to come. 2. Providing a survey of specific algorithms, which have been successfully applied in molecular biology, genomics, structural biology and bioinformatics. 3. Describing and explaining away multiple misconceptions that could jeopardize science and software development of the future. The handbook is written for mature readers with a great deal of rigorous experience with either doing scientific research or writing scientific software. However no specific background in computer science, statistics, or biology is required to understand most of the chapters. In order to accommodate the readers who wish to become professional computational biologists in the future we also provide appendices that contain educationally sound glossaries of terms and descriptions of major sequence analysis algorithms. This can also be of help to executives in charge of industrial bioinformatics as well as to academic teachers who plan their courses in bioinformatics, computational biology, or one of the “*omics” (genomics, proteomics, and their variants). As a matter of fact we believe that this volume should be suitable as a senior undergraduate or graduate-level textbook of computational biology within departments that pertain to any of the life sciences. Selected chapters could iii Copyright © 2004 by Marcel Dekker iv Foreword also be used as teaching materials for students of computer science and bioengineering. This handbook is a true celebration of computational biology. It has been in preparation for a very long time with multiple updates and exchanges of chapters. However the quality and potential usefulness of the contributions make us believe that preparing the ultimate final version was the time very well spent. Our thanks go to all the authors for their masterful contributions and devotion to the mission of science. We would also like to extend our most sincere thanks to our publishing editor Anita Lekhwani and the outstanding production editor Barbara Methieu. Their almost infinite patience and encouragement during inordinately long periods of collecting updates and final versions of the chapters greatly contributed to timely completion of this project. Andrzej K.Konopka and M.James C.Crabbe Copyright © 2004 by Marcel Dekker Contents Foreword iii Contributors vii 1. Introduction: Computational Biology in 14 Brief Paragraphs 1 Andrzej K.Konopka and M.James C.Crabbe Part I. Biomolecular Sequence Analysis 2. Introduction to Pragmatic Analysis of Nucleic Acid Sequences 7 Andrzej K.Konopka 3. Motifs in Sequences: Localization and Extraction 47 Maxime Crochemore and Marie-France Sagot 4. Protein Sequence Analysis and Prediction of Secondary 99 Structural Features Jaap Heringa Part II. Biopolymer Structure Calculation and Prediction 5. Discrete Models of Biopolymers 187 Peter Schuster and Peter F.Stadler v Copyright © 2004 by Marcel Dekker vi Contents 6. Protein Structure Folding and Prediction 223 William R.Taylor 7. DNA-Protein Interactions: Target Prediction 241 Akinori Sarai and Hidetoshi Kono Part III. Genome Analysis and Evolution 8. Methods of Computational Genomics: An Overview 279 Frederique Lisacek 9. Computational Aspects of Comparative Genomics 343 Wojciech Makalowski and Izabela Makalowska 10. Computational Methods for Studying the Evolution of the Mitochondrial Genome 369 Cecilia Saccone and Graziano Pesole Appendix 1: Computational Biology: Annotated Glossary of Terms 451 Compiled by Andrzej K.Konopka, Philipp Bucher, M.James C.Crabbe, Maxim Crochemore, Jaap Heringa, Frederique Lisacek, Izabela Makalowska, Wojciech Makalowski, Graziano Pesole, Cecilia Saccone, Marie-France Sagot, Akinori Sarai, Peter Stadler, and William R.Taylor Appendix 2: A Dictionary of Programs for Sequence Analysis 503 Andrzej K.Konopka and Jaap Heringa Copyright © 2004 by Marcel Dekker Contributors Philipp Bucher Bioinformatics, Institut Suisse de Recherches Experimentales sur le Cancer (ISREC), Epalinges s/Lausanne, Switzerland M.James C.Crabbe University of Reading, Whiteknights, Reading, England Maxime Crochemore Institut Gaspard-Monge, University of Marne-la-Vallée, Marne-la-Vallée, France Jaap Heringa Centre for Integrative Bioinformatics VU (IBIYU), Faculty of Sciences and Faculty of Earth and Life Sciences, Free University, Amsterdam, The Netherlands, and MRC National Institute for Medical Research, London, England Hidetoshi Kono University of Pennsylvania, Philadelphia, Pennsylvania, U.S.A. Andrzej K.Konopka BioLingua Research Inc., Gaithersburg, Maryland, U.S.A. Frederique Lisacek Génome and Informatique, Evry, France and Geneva Bionformatics (GeneBio), Geneva, Switzerland Izabela Makalowska Pennsylvania State University, University Park, PA, U.S.A. vii Copyright © 2004 by Marcel Dekker viii Contributors Wojciech Makalowski Pennsylvania State University, University Park, PA, U.S.A. Graziano Pesole University of Milan, Milan, Italy Cecilia Saccone University of Bari, Bari, Italy Marie-France Sagot INRIA, Laboratoire de Biométrie et Biologie Évolutive, University Claude Bernard, Lyon, France Akinori Sarai RIKEN Institute, Tsukuba Life Science Center, Tsukuba, Japan Peter Schuster University of Vienna, Vienna, Austria Peter F.Stadler University of Leipzig, Leipzig, Germany William R.Taylor National Institute for Medical Research, London, England Copyright © 2004 by Marcel Dekker 1 Introduction: Computational Biology in 14 Brief Paragraphs Andrzej K.Konopka BioLingua Research Inc., Gaithersburg, Maryland, U.S.A. M.James C.Crabbe University of Reading, Whiteknights, Reading, England 1. Computational biology has been a marvelous experience for at least three generations of the brightest scientists of the second half of the twentieth century and continues to be so in the twenty-first century. Simply speaking, computational biology is the science of biology done with the use of computers. Because computers require some specialized knowledge of the cultural and technical infrastructure in which they can be used, computational biology is significantly motivated (or even inspired) by computer science and its engineering variant known as information technology. On the other hand, the kind of data and data structures that can be processed by computers provide practical constraints on the selection of computational biology research topics. For instance, it is easier to analyze sequences of biopolymers with string processing techniques and statistics than to infer unknown biological functions from the unknown three-dimensional structures of the same biopolymers by using image processing tools. Similarly, it is advisable to 1 Copyright © 2004 by Marcel Dekker 2 Konopka and Crabbe study systems of chemical reactions (including metabolic pathways) in terms of rates and fluxes rather than in terms of colors and smells of substrates and products. 2. Computational biology is genuinely interdisciplinary. It draws on facts and methods from fields of science as diverse as logic, algebra, chemical kinetics, thermodynamics, statistical mechanics, statistics, linguistics, cryptology, molecular biology, evolutionary biology, genetics, embryology, structural biology, chemistry, and even telecommunications engineering. However, most of its methods are a result of a new scientific culture in which computers serve as an extension of human memory and the ability to manipulate symbols. The methodological distinctiveness of computational biology from other modes of computer-assisted biology (such as bioinformatics or theoretical biology in our simplified classification) is indicated