Thesis Reference
Total Page:16
File Type:pdf, Size:1020Kb
Thesis Development of methods for the automated annotation of prokaryotic proteomes GATTIKER, Alexandre Abstract L'annotation attachée aux données génomiques représente une aide considérable pour la compréhension de la fonction des gènes. Des méthodes ont été développées pour augmenter l'efficacité, l'étendue, la fiabilité et la cohérence de l'annotation des protéomes procaryotes au sein de la base de connaissances Swiss-Prot. La superposition de méthodes automatisées au processus d'annotation classique n'amène aucune réduction de qualité mais rend au contraire les données plus cohérentes et reproductibles. La complémentarité entre l'expertise des annotateurs et des méthodes automatiques a été maximisée en utilisant un système de règles d'annotation maintenues manuellement. Des outils ont été créés pour maximiser le débit du travail des experts travaillant sur la base de connaissance et un système automatique permet l'annotation de sous-ensembles particuliers des protéomes procaryotes. Le système a été ensuite étendu pour gérer également les protéines d'origine eucaryote et pour annoter partiellement des protéines complexes sur la base de leurs domaines et sites fonctionnels. Reference GATTIKER, Alexandre. Development of methods for the automated annotation of prokaryotic proteomes. Thèse de doctorat : Univ. Genève, 2005, no. Sc. 3616 URN : urn:nbn:ch:unige-7263 DOI : 10.13097/archive-ouverte/unige:726 Available at: http://archive-ouverte.unige.ch/unige:726 Disclaimer: layout of this document may differ from the published version. 1 / 1 UNIVERSITÉ DE GENÈVE Département de biologie structurale FACULTÉ DE MÉDECINE et bioinformatique Professeur Amos Bairoch Département d’informatique FACULTÉ DES SCIENCES Professeur Ron D. Appel Development of methods for the automated annotation of prokaryotic proteomes THÈSE présentée à la Faculté des sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique par ALEXANDRE GATTIKER de Küsnacht (ZH) Thèse N o 3616 GENÈVE 2005 La Faculté des sciences, sur le préavis de Messieurs A. BAIROCH , professeur adjoint et directeur de thèse (Département de biologie structurale et bioinformatique), R. D. APPEL , professeur associé et co-directeur de thèse (Département d’informatique), C. NOTREDAME , docteur (Swiss Institute of Bioinformatics – Biological Information Modeling – Lausanne, Switzerland), J. SCHRENZEL , professeur adjoint suppléant (Faculté de médecine – Département de médecine interne), et A. DANCHIN , docteur (Institut Pasteur – Génétique des génomes bactériens – Paris, France), autorise l’impression de la présente thèse, sans exprimer d’opinion sur les propositions qui y sont énoncées. Genève, le 30 mars 2005 Le Doyen, Pierre SPIERER Thèse N o 3616 he particular mode of publication of a thesis forces me to take the credit TT for what is truly a collaborative accomplishment. My colleagues have not only helped achieve this work, but have simply made it possible through their impressive commitment and dedication. Curators proposed and tested the systems and developed and maintained the automated annotation rules: Karine Michoud, Catherine Rivoire, Elisabeth Coudert, Tania Lima, Andrea Auchincloss, Virginie Le Saux, Nicolas Hulo, Christian Sigrist, Claudia Vitorello, Petra Langendyk-Genevaux and Silvia Braconi Quintaje. Bioinformaticians co- developed the supporting systems: Edouard de Castro, Xavier Martin, Corinne Lachaize, Isabelle Phan and Brigitte Boeckmann. Systems engineers have been regularly cleaning and feeding the beasts: Ivan Ivanyi, Karin Sonesson and Salvo Pæsano. To achieve the developments reported herein, I have also had the pleasure of fruitfully collaborating with Marco Pagni, Cédric Notredame, Henning Hermjakob, Paul Kersey, Peter McLaren, Lorna Morris, Alan Horne, Claire O’Donovan, María-Jesús Martín, Simon Penel, Volker Flegel, Laurent Falquet and Christian Iseli. I am especially indebted to my thesis advisor, Amos Bairoch, who conceived the project, invested himself unrelentingly and supported me throughout. I would also like to thank my jurors, Ron Appel, Antoine Danchin, Cédric Notredame and Jacques Schrenzel for the honor of accepting to appraise my work and for their useful comments and suggestions. A special mention of gratitude is dedicated to Elisabeth Gasteiger, for deploying treasures of patience and collaborative commitment from my very first day, and to Tania Lima, Lina Yip and Andrea Auchincloss for extremely relevant criticism and feedback on this thesis. A INTRODUCTION ..................................................................................7 A1 WHAT BACTERIAL GENOMES REVEAL ........................................................7 A2 FROM GENOME SEQUENCES TO PROTEIN FUNCTIONS..................................9 A3 BIOLOGICAL DATABASES .........................................................................10 A4 COMPUTATIONAL GENE PREDICTION .......................................................19 A5 COMPUTATIONAL SEQUENCE ANALYSIS ...................................................26 A6 INFERENCE OF PROTEIN ANNOTATION ......................................................32 A7 ISSUES IN AUTOMATIC ANNOTATION ........................................................40 B IMPROVING THE STRUCTURE AND CONTENT OF THE SWISS-PROT PROTEIN KNOWLEDGEBASE..............................45 B1 SWISSKNIFE .............................................................................................45 B2 SWISS -PROT FORMAT CHANGES ...............................................................46 B3 WORK ON PROTEOMES .............................................................................48 C INTEGRATING SEQUENCE ANALYSIS METHODS..................55 C1 FINDING MOTIFS ......................................................................................55 C2 DETECTING SEQUENCE SIMILARITY : BLAST...........................................59 C3 INTEGRATING ANALYSIS TOOLS ...............................................................63 D DEFINING AND CURATING AUTOMATED ANNOTATION RULES...................................................................................................65 D1 RULE FORMAT AND REPOSITORY ............................................................65 D2 THE CHALLENGES OF COMPLEX FAMILIES ................................................70 D3 GENOMIC COHERENCE .............................................................................71 E DEVELOPING AN AUTOMATIC IDENTIFICATION AND ANNOTATION PIPELINE.................................................................73 E1 CALCULATING AND STORING LARGE NUMBERS OF RESULTS ....................73 E2 DETECTING AND ANNOTATING PROTEINS BASED ON RULES .....................76 E3 THE ANABELLE PIPELINE .........................................................................78 F MAKING THE DATA AVAILABLE ................................................81 F1 IMPROVING ACCESS TO SWISS -PROT ON EXPAS Y ...................................81 F2 GENOME PROXIMITY VIEWER ...................................................................82 F3 HAMAP WEBSITE ...................................................................................83 G RESULTS AND DISCUSSION...........................................................85 G1 A HIGH -QUALITY AUTOMATED ANNOTATION SYSTEM .............................85 G2 AUTOMATED ANNOTATION RULES ...........................................................86 G3 HIGH THROUGHPUT OF AUTOMATED ANNOTATION ..................................87 G4 NO LOSS OF QUALITY ...............................................................................88 G5 A PREDICTOR TUNED TOWARD FUNCTIONAL ANNOTATION ......................89 H CONCLUSION.....................................................................................91 I REFERENCES .....................................................................................93 J APPENDICES: ARTICLES AND UNIRULES DEFINITION .....103 Summary A large number of complete prokaryotic genomes have been sequenced. This knowledge is bringing about considerable advances in the understanding of the processes of life, evolution of species, and mechanisms of pathogenicity and resistance of major disease agents. To be useful, genomic sequences should be annotated, in that the positions of genes should be predicted and gene products should be assigned functions based on experimental evidence or, most frequently, on computational predictions. Genes are usually predicted and annotated by genome sequencing teams, and the data is deposited in public databases. The annotation of genes describes at a very general level the function of the protein product, but not of its domains and regions. In the absence of standards, the annotation provided is often inconsistent, irreproducible, or wrong by excess or lack of information relative to obtainable evidence. Most importantly, the annotation is usually never updated and quickly becomes obsolete. The large amount of data to be processed and the pressure on cost efficiency usually precludes extensive curation of genomic data, so that automatic annotation methods have been developed. The Swiss-Prot protein knowledgebase has been for twenty years a reference protein sequence database with a high level of annotation and extensive integration with other databases. Swiss-Prot cannot absorb all of the protein sequences arising from genomic data, and the Universal Protein knowledgebase (UniProtKB) exists for the purpose. As the manually curated section of UniProtKB, the function