Automated Annotation of Protein Sequences, This Is Vital and Similar Approaches Will Be Implemented for Other Domains in the Future

An Environment for Consistent Sequence Annotation and its Application to Transmembrane Proteins Steffen Möller Selwyn College Cambridge, United Kingdom June, 2001 The dissertation is submitted for the degree of Doctor of Philosophy. An Environment for Consistent Sequence Annotation and its Application to Transmembrane Proteins Steffen Möller Summary This thesis describes my research leading to the development of a novel software environment to combine multiple tools for an automated prediction of the properties of protein sequences. The test case of the implementation of the tool lies on transmembrane proteins. Part of the environment is a conflict-resolution mechanism and respective rules for its application. This contributes to an improvement of the automated sequence annotation of transmembrane proteins. The integrated tools include several membrane prediction methods. These are combined to provide an integrated method for both signal peptide prediction and membrane spanning region prediction. A database was created to describe the correlation between individual InterPro entries and transmembrane annotation. This led to the development of specialised predictors, constrained to individual protein families, and set the basis for an automated discovery of constraints for transmembrane topologies. To facilitate an evaluation of membrane protein prediction, a collection of biochemically well-characterised transmembrane protein sequences was created. As a novelty, raw data from those experiments where added from which a protein’s topology was elucidated. This was applied for analysing aberrations of predictions from the experimental results. The thesis closes with a novel application of the developed techniques to find determinants for the coupling of G protein-coupled receptors to G proteins and thereby facilitates a functional characterisation of these transmembrane receptors. 2 Publications The following publications resulted from the research described in this thesis. S. Möller, U. Leser, W. Fleischmann, R. Apweiler; "EDITtoTrEMBL: A distributed approach to high-quality automated protein sequence annotation." In: Proceedings of the German Conference on Bioinformatics (GCB'98), Zimmermann O., Schomburg D. (eds.); Köln, Germany (1998). W. Fleischmann, S. Möller, A. Gateau, R. Apweiler; "A novel method for automatic and reliable functional annotation." In: Proceedings of the German Conference on Bioinformatics (GCB'98), Zimmermann O., Schomburg D. (eds.); Köln, Germany (1998). R. Apweiler, C. O'Donovan, M.J. Martin, W. Fleischmann, H. Hermjakob, S. Möller, S. Contrino; "SWISS-PROT and its computer-annotated supplement TrEMBL: How to produce high quality automatic annotation." In: Proceedings of the World Multiconference on Systemics, Cybernetics and Informatics (SCI '98 / ISAS '98), Callaos N., Holmes L., Osers R. (eds.), Orlando, FL, USA; 4:184-191 (1998). S. Möller, U. Leser, W. Fleischmann, R. Apweiler "EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation." Bioinformatics 15 (3):219-27 (1999). W. Fleischmann, S. Möller, A. Gateau, R. Apweiler "A novel method for automatic functional annotation of proteins." Bioinformatics 15 (3):228-33 (1999). S. Möller, M. Schroeder "Conflict-resolution for the automated annotation of transmembrane proteins." In: Proceedings of the AISB 2000 in Birmingham, UK S. Möller, M. Schroeder, R. Apweiler ”Conflict-resolution for the automated annotation of transmembrane proteins.” Computers and Chemistry 26 (1):41-49 (2001). S. Möller, E. V. Kriventseva, R. Apweiler "A collection of well characterised integral membrane proteins." Bioinformatics 16 (12):1159-1160 (2000). S. Möller, M. D. R. Croning, R. Apweiler "Evaluation of methods for the prediction of membrane spanning regions in proteins." Bioinformatics, 17 (7) 646-653 (2001) S. Möller, J. Vilo, M. D. R. Croning ” Prediction of the coupling specificity of GPCRs to their G proteins.” In: Int. Sys. Mol. Biol., Copenhagen, 2001, appeared as Bioinformatics 17 (Supplement 1) S174-S181. Özgün Babur, Steffen Möller and Rolf Apweiler "TransMotif - Linking sequence motifs with transmembrane annotation." in preparation 3 Michael D. R. Croning, Steffen Möller, James Stalker, Arne Stubenau, Alex Kasprzyk, Ewan Birney, Teresa K. Attwood “Sequence mining reveals the full extent of the human 7tm receptor families.” submitted Posters that were presented on international conferences are shown in the appendix. For references to the location of source code to the programs, the collection of membrane proteins or the predictor or the GPCR coupling please contact Steffen Möller per email: [email protected]. On the Internet the page http://www.ebi.ac.uk/~moeller also provides respective links. 4 Contents PUBLICATIONS...............................................................................................................................................3 CONTENTS.......................................................................................................................................................5 LIST OF FIGURES...........................................................................................................................................9 LIST OF TABLES...........................................................................................................................................12 ABBREVIATIONS..........................................................................................................................................14 PREFACE........................................................................................................................................................15 AUDIENCE........................................................................................................................................................16 ACKNOWLEDGEMENTS........................................................................................................................................16 II.INTRODUCTION.......................................................................................................................................19 A.GENOMIC SEQUENCING...................................................................................................................................19 1.The Quest..............................................................................................................................................19 2.Biological background.........................................................................................................................19 3.Genomic Sequencing............................................................................................................................22 B.BIOINFORMATICS AND MOLECULAR BIOLOGY DATABASES...................................................................................22 1.Motivation.............................................................................................................................................22 2.Current Applications of Protein Sequence Databases.........................................................................25 C.NEED FOR AUTOMATED ANNOTATION.................................................................................................................26 1.Background...........................................................................................................................................26 III.AUTOMATED ANNOTATION OF PEPTIDE SEQUENCES.............................................................31 A.DATA FLOW OF PROTEIN ANNOTATION...............................................................................................................31 1.Submission of nucleotide sequence to EMBL.......................................................................................31 2.Manual protein annotation...................................................................................................................33 B.CONCEPT OF AUTOMATED PROTEIN ANNOTATION.................................................................................................33 1.Direct transfer by sequence similarity..................................................................................................34 2.Indirect annotation by protein domains (InterPro)..............................................................................35 3.Abstract sequence annotation (Algorithms).........................................................................................35 4.Future sources......................................................................................................................................36 5.Performance of automated annotation.................................................................................................36 C.EVIDENCES FOR INFORMATION..........................................................................................................................38 5 D.HOW A BIOLOGIST WOULD USE COMPUTER ALGORITHMS TO ANNOTATE A PROTEIN SEQUENCE.....................................38 1.Selection of programs...........................................................................................................................39 2.Integration of results............................................................................................................................40 E.POTENTIAL ALTERNATIVES TO A NEW DEVELOPMENT............................................................................................41

Automated Annotation of Protein Sequences, This Is Vital and Similar Approaches Will Be Implemented for Other Domains in the Future

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

Meeting Review: Bioinformatics and Medicine – from Molecules To

Chuanxiong Rhizoma Compound on HIF-VEGF Pathway and Cerebral Ischemia-Reperfusion Injury’S Biological Network Based on Systematic Pharmacology

The Ethos and Effects of Data-Sharing Rules: Examining The

Computational Exploration of the Medium-Chain Dehydrogenase / Reductase Superfamily

Contcenter for Genomic Regul

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

Biolayout Paper Fig3 V3

Computational Biology: Plus C'est La Même Chose, Plus Ça Change

A Detailed Genome-Wide Reconstruction of Mouse Metabolism Based on Human Recon 1

Integration of Text Mining with Systems Biology Provides New Insight Into the Pathogenesis of Diabetic Neuropathy