New Approaches to Facilitate Genome Analysis

New Approaches to Facilitate Genome Analysis Philip Scordis Sequence Analysis Group Biomolecular Structure and Modelling Unit Department of Biochemistry and Molecular Biology University College London A thesis submitted to the University of London for the degree of Doctor of Philosophy September 2000 ProQuest Number: 10013268 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. uest. ProQuest 10013268 Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 Abstract In this era of concerted genome sequencing efforts, biological sequence information is abun dant. With many prokaryotic and simple eukaryotic genomes completed, and with the genomes of more complex organisms nearing completion, the bioinformatics community, those charged with the interpretation of these data, are becoming concerned with the efficacy of current anal ysis tools. One step towards a more complete understanding of biology at the molecular level is the unambiguous functional assignment of every newly sequenced protein. The sheer scale of this problem precludes the conventional process of biochemically determining function for every example. Rather we must rely on demonstrating similarity to previously characterised proteins via computational methods, which can then be used to infer homology and hence structural and functional relationships. Our ability to do this with any measure of reliabil ity unfortunately diminishes as the pools of experimentally determined sequence data become muddied with sequences that are themselves characterised with "in silico" annotation. Part of the problem stems from the complexity of modelling biology in general, and of evo lution in particular. For example, once similarity has been identified between sequences, in order to assign a common function it is important to identify whether the inferred homologous relationship has an orthologous or paralogous origin, which currently cannot be done compu tationally. The modularity of proteins also poses problems for automatic annotation, as similar domains may occur in proteins with very different functions. Once accepted into the sequence databases, incorrect functional assignments become available for mass propagation and the consequences of incorporating those errors in further "in silico" experiments are potentially catastrophic. One solution to this problem is to collate families of proteins with demonstrable homologous relationships, derive a pattern that represents the essence of those relationships, and use this as a signature to trawl for similarity in the sequence databases. This approach not only provides a more sensitive model of evolution, but also allows annotation from all members of the family to contribute to any assignments made. This thesis describes the development of a new search method (FingerPRlNTScan) that ex ploits the familial models in the PRINTS database to provide more powerful diagnosis of evo lutionary relationships. FingerPRlNTScan is both selective and sensitive, allowing both precise identification of super-family, family and sub-family relationships, and the detection of more distant ones. Illustrations of the diagnostic performance of the method are given with respect to the haemoglobin and transfer RNA synthetase families, and whole genome data. FingerPRlNTScan has become widely used in the biological community, e.g. as the primary search interface to PRINTS via a dedicated web site at the university of Manchester, and as one of the search components of InterPro at the European Bioinformatics Institute (EBl). Fur thermore, it is currently responsible for facilitating the use of PRINTS in a number of signif icant annotation roles, such as the automatic annotation of TrEMBL at the EBl, and as part of the computational suite used to annotate the Drosophila melanogaster genome at Celera Genomics. Acknowledgements I thank my supervisor Terri Attwood for all of the support, some of the criticism and all the time and effort she had to expend to get me through this. I am grateful to my industrial supervisor Darren Flower and I acknowledge my sponsors Astra Chamwood for funding this research. I must also thank my present employers for their patience. Thanks also go to my good friends and colleagues at UCL and Manchester: Denise Henriques, Duncan Milbum, Adrian Shepherd, Jane Mabey, Julian Selley, Will Wright, who have been with me most of the way. And some who were only around for part of the ride but who certainly helped keep me sane, Karen Eilbeck, Crispin Miller, Harriet Watkin and Andrea Edwards. Heartfelt thanks must go to Vicki Kitchener, Jennifer and Michael Scordis, because without their continual support and optimism I would not have survived. Contents Abstract 2 Acknowledgements 3 1 Introduction 19 1.1 Biosequences ........................................................................................... 20 1.2 Databases ................................................................................................. 24 1.3 Sequence A n a ly sis ................................................................................. 24 2 Primary databases 27 2.1 Biological Sequences .............................................................................. 28 2.2 Nucleic Acid Databases............................................................................. 29 2.2.1 Other nucleic acid sequence resources ...................................... 31 2.3 Protein Sequence Databases ................................................................... 32 2.3.1 P S D ............................................................................................... 32 2.3.2 SWISS-PROT ............................................................................ 33 2.3.3 Sum m ary ..................................................................................... 35 2.4 Pairwise Sequence Analysis ................................................................. 37 2.4.1 Sequence similarity ..................................................................... 38 CONTENTS 5 2.4.1.1 Identity .......................................................................... 38 2.4.1.2 Amino acid side chain similarities .......................... 41 2.4.2 Scoring the alig n m en t ................................................................ 45 2.4.2.1 Counting and scoring identity .................................... 45 2.4.2.2 Counting and scoring similarity ............................. 48 2.4.3 Identifying the optimal alignm ent .............................................. 52 2.4.3.1 Global alignment algorithms .................................... 54 2.4.3.2 Local alignment algorithm s ....................................... 58 2.4.4 Assessing the significance of the alignm ent .............................. 61 2.4.4.1 Scoring .......................................................................... 61 2.4.4.2 Probability v a lu e s ....................................................... 63 2.4.4.3 Adjusting probabilities for database searches. 66 2.5 Problems of pairwise sequence analysis ................................................. 67 3 Secondary Databases 71 3.1 Gene families............................................................................................ 72 3.2 The Multiple Sequence A lignm ent .......................................................... 75 3.2.1 Creating a multiple sequence alignment ...................................... 76 3.3 Multiple Sequence Analysis ................................................................... 78 3.3.1 Scanning a sequence ................................................................... 80 3.3.1.1 The sliding w in d o w .................................................... 80 3.3.2 The M o tif ...................................................................................... 81 3.3.3 Selecting a m o tif ......................................................................... 82 3.3.4 Encoding a m o tif .......................................................................... 83 CONTENTS 6 3.3.4.1 The regular expression ................................................. 83 3.3.4.2 The frequency matrix .................................................... 85 3.3.4.3 The ‘profile’ motif ......................................................... 86 3.3.5 Motif Scoring ................................................................................ 88 3.3.5.1 Scoring a sequence - the frequency m a trix 90 3.3.5.2 Scoring a sequence - the ‘profile’ m otif ..................... 91 3.3.6 Multiple Motifs .......................................................................... 92 3.3.7 Whole A lignm ents ....................................................................... 93 3.3.7.1 Profile Methodology .................................................... 93 3.3.7.2 Profile-HMMs ............................................................. 94 3.3.8 Sum m ary ...................................................................................... 95 3.4 Pattern and Family

New Approaches to Facilitate Genome Analysis

ELIXIR Poster Numbers: P El001 - 037 Application Posters: P El034 - 037

The Uniprot Knowledgebase

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

The Uniprot Knowledgebase BLAST

ISB Newsletter

Embnet.News Volume 4 Nr

GOBLET Annual General Meeting

The Role of Pattern Databases in Sequence Analysis

BIOINFORMATICS and MOLECULAR EVOLUTION BAMA01 09/03/2009 16:31 Page Ii BAMA01 09/03/2009 16:31 Page Iii

Lorenzo Cerutti March 2011

SEB/GOBLET Bioinformatics Workshop

ISB Newsletter September 2013 12