Adam Alexander Thil SMITH Exploitation Automatisée Des Contextes

Université d'Evry Val d'Essonne Ecole Doctorale des “Génomes aux Organismes” THESE pour obtenir le titre de Docteur en Bioinformatique, Biologie Structurale et Génomique Présentée par Adam Alexander Thil SMITH Exploitation automatisée des contextes métabolique et génomique pour l'annotation fonctionelle des génomes prokaryotes Automatically exploiting Genomic and Metabolic Contexts to aid the Functional Annotation of Prokaryote Genomes Soutenance prévue le 03 février 2012 devant le jury composé de : Présidente : Florence d'ALCHE-BUC Rapportrice : Marie-France SAGOT Rapporteur : Antoine DANCHIN Examinateur : Alain VIARI Examinateur : Christos OUZOUNIS Tuteur de thèse : David VALLENET Directrice de thèse : Claudine MEDIGUE Travail réalisé au sein du Laboratoire d'Analyses Bioinformatiques pour la Génomique et le Métabolisme, CEA/DSV/IG/CNS, UMR 8030 “génomique métabolique” Nota bene: thèse rédigée en anglais 2 / 229 Table of Contents I. Preamble.........................................................................................................................................11 I.A. Thanks....................................................................................................................................11 I.B. Notes......................................................................................................................................12 I.C. Summary................................................................................................................................12 I.D. Résumé..................................................................................................................................13 I.E. List of abbreviations...............................................................................................................17 II. Introduction...................................................................................................................................19 III. Microbial Genomes and their Analysis.......................................................................................21 III.A. Genome sequencing and resources.....................................................................................21 III.A.1. History and evolution of biological molecule sequencing..........................................21 III.A.1.a. Birth of sequencing..............................................................................................21 III.A.1.b. The Human Genome Project...............................................................................23 III.A.1.c. Sequencing technologies today............................................................................24 III.A.1.d. Sequencing technologies for the near future.......................................................26 III.A.2. Ongoing projects and resources..................................................................................26 III.A.2.a. Genome projects..................................................................................................26 III.A.2.b. Metagenomics and metagenome projects............................................................28 III.A.2.c. Biological sequence resources.............................................................................31 III.B. Annotation strategies and platforms....................................................................................32 III.B.1. The different levels of genome annotation..................................................................32 III.B.2. Annotation platforms: ultimate annotation tools.........................................................33 III.B.3. Noteworthy prokaryote genome annotation platforms................................................36 III.B.4. MicroScope Platform Overview..................................................................................37 IV. Functional annotation..................................................................................................................39 IV.A. Gene/Protein function ontologies........................................................................................40 IV.B. Key concepts in functional annotation................................................................................42 IV.B.1. Phylogenetic links........................................................................................................43 IV.B.2. Conserved sequences, domains, and gene/protein families.........................................46 IV.B.3. Functional dependence................................................................................................47 IV.B.4. The three types of gene functional annotation.............................................................48 IV.C. Annotation transfer on the basis of homology.....................................................................50 IV.C.1. Protocol........................................................................................................................50 IV.C.2. Homology-determining methods.................................................................................51 IV.C.2.a. De novo sequence clustering................................................................................51 IV.C.2.b. Semi-supervised sequence similarity-based aggregating.....................................53 IV.C.2.c. Semi-supervised domain-based aggregating........................................................54 IV.C.2.d. 3D structure-based methods.................................................................................55 IV.C.3. Limits...........................................................................................................................56 IV.D. Annotation inference on the basis of functional dependence..............................................59 3 / 229 IV.D.1. Steps............................................................................................................................59 IV.D.2. Genomic Context-based sources.................................................................................60 IV.D.2.a. Prokaryote genome organisation..........................................................................60 IV.D.2.b. Definition of genomic context.............................................................................61 IV.D.2.c. Gene clusters and syntenies.................................................................................61 IV.D.2.d. Rosetta stone (gene fusion/fission)......................................................................66 IV.D.2.e. Shared regulatory sites.........................................................................................66 IV.D.2.f. Phylogenetic profiles............................................................................................66 IV.D.3. Experimental data-based sources................................................................................67 IV.D.3.a. Co-expression / co-regulation..............................................................................68 IV.D.3.b. Co-citation...........................................................................................................68 IV.D.3.c. Protein-protein interaction...................................................................................68 IV.D.3.d. Other data sources................................................................................................68 IV.D.4. Multiple-context-based annotation methods...............................................................69 IV.D.4.a. Annotation platforms............................................................................................70 IV.D.4.b. The STRING........................................................................................................70 IV.D.4.c. Predictome...........................................................................................................72 IV.D.4.d. ProLinks...............................................................................................................72 IV.D.4.e. GeneMANIA........................................................................................................72 V. Metabolism....................................................................................................................................73 V.A. Key Actors............................................................................................................................73 V.A.1. Compounds...................................................................................................................73 V.A.2. Reactions.......................................................................................................................74 V.A.3. Enzymes........................................................................................................................75 V.A.4. Metabolic pathways......................................................................................................78 V.A.5. Metabolic Resources.....................................................................................................79 V.A.5.a. Compounds............................................................................................................79 V.A.5.b. Reactions and Enzymes.........................................................................................79 V.A.5.c. Metabolic pathways...............................................................................................80 V.B. Dealing with Metabolic Pathways........................................................................................82

Adam Alexander Thil SMITH Exploitation Automatisée Des Contextes

Biolayout Paper Fig3 V3

Computational Analysis of Protein Function Within Complete Genomes

EMBL-EBI Press Release European Molecular Biology Laboratory European Bioinformatics Institute for Immediate Release

Computational Comparisons of Model Genomes Christos Ouzounis,Georg Casari, Chris Sander,Javier Tamames and Alfonsovalencia

Designing and Running an Advanced Bioinformatics and Genome Analyses Course in Tunisia

Automated Annotation of Protein Sequences, This Is Vital and Similar Approaches Will Be Implemented for Other Domains in the Future

Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and Their Roles in Complex Disease Jeremy J

Avancesin Systemsand Synthetic Biology

An Application to Phylogenetic Analysis

EMBL April 2003 &Cetera Newsletter of the European Molecular Biology Laboratory

Experimental Investigation of Enzyme Functional Annotations Reveals Extensive Annotation Error

BIOINFORMATICS Pages 48–64