BIOINFORMATICS and MOLECULAR EVOLUTION BAMA01 09/03/2009 16:31 Page Ii BAMA01 09/03/2009 16:31 Page Iii

BAMA01 09/03/2009 16:31 Page ii BAMA01 09/03/2009 16:31 Page i BIOINFORMATICS AND MOLECULAR EVOLUTION BAMA01 09/03/2009 16:31 Page ii BAMA01 09/03/2009 16:31 Page iii Bioinformatics and Molecular Evolution Paul G. Higgs and Teresa K. Attwood BAMA01 09/03/2009 16:31 Page iv © 2005 by Blackwell Science Ltd a Blackwell Publishing company BLACKWELL PUBLISHING 350 Main Street, Malden, MA 02148-5020, USA 108 Cowley Road, Oxford OX4 1JF, UK 550 Swanston Street, Carlton, Victoria 3053, Australia The right of Paul G. Higgs and Teresa K. Attwood to be identified as the Authors of this Work has been asserted in accordance with the UK Copyright, Designs, and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs, and Patents Act 1988, without the prior permission of the publisher. First published 2005 by Blackwell Science Ltd Library of Congress Cataloging-in-Publication Data Higgs, Paul G. Bioinformatics and molecular evolution / Paul G. Higgs and Teresa K. Attwood. p. ; cm. Includes bibliographical references and index. ISBN 1–4051–0683–2 (pbk. : alk. paper) 1. Molecular evolution—Mathematical models. 2. Bioinformatics. [DNLM: 1. Computational Biology. 2. Evolution, Molecular. 3. Genetics. QU 26.5 H637b 2005] I. Attwood, Teresa K. II. Title. QH371.3.M37H54 2005 572.8—dc22 2004021066 A catalogue record for this title is available from the British Library. 1 Set in 9 /2 /12pt Photina by Graphicraft Limited, Hong Kong Printed and bound in the United Kingdom by TJ International, Padstow, Cornwall The publisher’s policy is to use permanent paper from mills that operate a sustainable forestry policy, and which has been manufactured from pulp processed using acid-free and elementary chlorine-free practices. Furthermore, the publisher ensures that the text paper and cover board used have met acceptable environmental accreditation standards. For further information on Blackwell Publishing, visit our website: www.blackwellpublishing.com BAMA01 09/03/2009 16:31 Page v Short Contents Preface x Chapter plan xiii 1 Introduction: The revolution in biological information 1 2 Nucleic acids, proteins, and amino acids 12 3 Molecular evolution and population genetics 37 4 Models of sequence evolution 58 5 Information resources for genes and proteins 81 6 Sequence alignment algorithms 119 7 Searching sequence databases 139 8 Phylogenetic methods 158 9 Patterns in protein families 195 10 Probabilistic methods and machine learning 227 11 Further topics in molecular evolution and phylogenetics 257 12 Genome evolution 283 13 DNA Microarrays and the ’omes 313 Mathematical appendix 343 List of Web addresses 355 Glossary 357 Index 363 v BAMA01 09/03/2009 16:31 Page vi BAMA01 09/03/2009 16:31 Page vii Full Contents Preface x 3 MOLECULAR EVOLUTION AND POPULATION GENETICS 37 1 INTRODUCTION: THE 3.1 What is evolution? 37 REVOLUTION IN BIOLOGICAL 3.2 Mutations 39 INFORMATION 1 3.3 Sequence variation within and 1.1 Data explosions 1 between species 40 1.2 Genomics and high-throughput 3.4 Genealogical trees and coalescence 44 techniques 5 3.5 The spread of new mutations 46 1.3 What is bioinformatics? 6 3.6 Neutral evolution and adaptation 49 1.4 The relationship between population Box 3.1 The influence of selection on genetics, molecular evolution, and the fixation probability 50 bioinformatics 7 Box 3.2 A deterministic theory for Summary l References l Problems 10 the spread of mutations 51 Summary l References l Problems 54 2 NUCLEIC ACIDS, PROTEINS, AND AMINO ACIDS 12 4 MODELS OF SEQUENCE 2.1 Nucleic acid structure 12 EVOLUTION 58 2.2 Protein structure 14 4.1 Models of nucleic acid sequence 2.3 The central dogma 16 evolution 58 2.4 Physico-chemical properties of the Box 4.1 Solution of the Jukes–Cantor amino acids and their importance in model 61 protein folding 22 4.2 The PAM model of protein sequence Box 2.1 Polymerase chain reaction evolution 65 (PCR) 23 Box 4.2 PAM distances 70 2.5 Visualization of amino acid properties 4.3 Log-odds scoring matrices for amino using principal component acids 71 analysis 25 Summary l References l Problems l 2.6 Clustering amino acids according to Self-test Molecular evolution 76 their properties 28 Box 2.2 Principal component analysis 5 INFORMATION RESOURCES in more detail 29 FOR GENES AND PROTEINS 81 Summary l References l Self-test 5.1 Why build a database? 81 Biological background 34 5.2 Database file formats 82 vii BAMA01 09/03/2009 16:31 Page viii 5.3 Nucleic acid sequence databases 83 8.6 The maximum-likelihood criterion 173 5.4 Protein sequence databases 89 Box 8.2 Calculating the likelihood of 5.5 Protein family databases 95 the data on a given tree 174 5.6 Composite protein pattern 8.7 The parsimony criterion 177 databases 108 8.8 Other methods related to maximum 5.7 Protein structure databases 111 likelihood 179 5.8 Other types of biological database 113 Box 8.3 Calculating posterior Summary l References 115 probabilities 182 Summary l References l Problems l 6 SEQUENCE ALIGNMENT Self-test Phylogenetic methods 185 ALGORITHMS 119 6.1 What is an algorithm? 119 9 PATTERNS IN PROTEIN 6.2 Pairwise sequence alignment – FAMILIES 195 The problem 121 9.1 Going beyond pairwise alignment 6.3 Pairwise sequence alignment – methods for database searches 195 Dynamic programming methods 123 9.2 Regular expressions 197 6.4 The effect of scoring parameters on 9.3 Fingerprints 200 the alignment 127 9.4 Profiles and PSSMs 205 6.5 Multiple sequence alignment 130 9.5 Biological applications – G Summary l References l Problems 136 protein-coupled receptors 208 Summary l References l Problems l 7 SEARCHING SEQUENCE Self-test Protein families and databases 216 DATABASES 139 7.1 Similarity search tools 139 7.2 Alignment statistics (in theory) 147 10 PROBABILISTIC Box 7.1 Extreme value distributions 151 METHODS AND MACHINE LEARNING 227 Box 7.2 Derivation of the extreme 10.1 Using machine learning for value distribution in the pattern recognition in word-matching example 152 bioinformatics 227 7.3 Alignment statistics (in practice) 153 10.2 Probabilistic models of sequences – Summary l References l Problems l Basic ingredients 228 Self-test Alignments and database Box 10.1 Dirichlet prior distributions 232 155 searching 10.3 Introducing hidden Markov models 234 Box 10.2 The Viterbi algorithm 238 8 PHYLOGENETIC METHODS 158 8.1 Understanding phylogenetic trees 158 Box 10.3 The forward and backward 8.2 Choosing sequences 161 algorithms 239 8.3 Distance matrices and clustering 10.4 Profile hidden Markov models 241 methods 162 10.5 Neural networks 244 Box 8.1 Calculation of distances in the Box 10.4 The back-propagation neighbor-joining method 167 algorithm 249 8.4 Bootstrapping 169 10.6 Neural networks and protein 8.5 Tree optimization criteria and tree secondary structure prediction 250 search methods 171 Summary l References l Problems 253 viii BAMA01 09/03/2009 16:31 Page ix 11 FURTHER TOPICS IN Box 13.1 Examples from the Gene MOLECULAR EVOLUTION AND Ontology 335 PHYLOGENETICS 257 Summary l References l Self-test 337 11.1 RNA structure and evolution 257 11.2 Fitting evolutionary models to sequence data 266 MATHEMATICAL APPENDIX 343 11.3 Applications of molecular M.1 Exponentials and logarithms 343 phylogenetics 272 M.2 Factorials 344 Summary l References 279 M.3 Summations 344 M.4 Products 345 12 GENOME EVOLUTION 283 M.5 Permutations and combinations 345 12.1 Prokaryotic genomes 283 M.6 Differentiation 346 Box 12.1 Web resources for bacterial M.7 Integration 347 genomes 284 M.8 Differential equations 347 12.2 Organellar genomes 298 M.9 Binomial distributions 348 Summary l References 309 M.10 Normal distributions 348 M.11 Poisson distributions 350 13 DNA MICROARRAYS AND M.12 Chi-squared distributions 351 313 THE ’OMES M.13 Gamma functions and gamma 13.1 ’Omes and ’omics 313 distributions 352 13.2 How do microarrays work? 314 Problems l Self-test 353 13.3 Normalization of microarray data 316 13.4 Patterns in microarray data 319 13.5 Proteomics 325 List of Web addresses 355 13.6 Information management for Glossary 357 the ’omes 330 Index 363 ix BAMA01 09/03/2009 16:31 Page x Preface RATIONALE search them. Algorithms for sequence alignment, identification of conserved motifs in protein families, Degree programs in bioinformatics at Masters or and pattern-recognition methods using hidden undergraduate level are becoming increasingly Markov models and neural networks are discussed common and single courses in bioinformatics and in detail. A full chapter on data analysis for micro- computational biology are finding their way into arrays and proteomics is included. many types of degrees in biological sciences. This Evolutionary chapters of the book begin with a brief book satisfies a need for a textbook that explains the introduction to population genetics and the study scientific concepts and computational methods of of sequence variation within and between popula- bioinformatics and how they relate to current tions, and move on to the description of the evolu- research in biological sciences. The content is based tion of DNA and protein sequences. Phylogenetic on material taught by both authors on the MSc pro- methods are covered in detail, and examples are gram in bioinformatics at the University of Man- given of application of these methods to biological chester, UK, and on an upper level undergraduate questions.

BIOINFORMATICS and MOLECULAR EVOLUTION BAMA01 09/03/2009 16:31 Page Ii BAMA01 09/03/2009 16:31 Page Iii

ELIXIR Poster Numbers: P El001 - 037 Application Posters: P El034 - 037

The Uniprot Knowledgebase

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

The Uniprot Knowledgebase BLAST

ISB Newsletter

Embnet.News Volume 4 Nr

GOBLET Annual General Meeting

The Role of Pattern Databases in Sequence Analysis

Lorenzo Cerutti March 2011

SEB/GOBLET Bioinformatics Workshop

ISB Newsletter September 2013 12

The Uniprot Knowledgebase the Uniprot Knowledgebase Uniprotkb Flat-File Format