Genomic Exploration of the Hemiascomycetous Yeasts 3Rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08
Total Page:16
File Type:pdf, Size:1020Kb
1 Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08 David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team “MAGNOME” Comparative genomics 2 Which And what ones do we do we do sequence? after that? Is certainly about comparison But is also about the genomes A caricature 3 The hard part Data A solved problem Push button The hard part Biological Algorithmic – – Results otherwise otherwise not interesting not interesting 4 Hemiascomycetous yeasts Eukaryotic genomes Understand mechanisms of molecular evolution Small and compact Genome redundancy Experimental model Biotechnological interest Ortho-/para- log divergence • beer, wine, bread Expansion and contraction • assimilate hydrocarbons, of universal families tannin extracts • horomones and vaccines Tandem duplications Medical interest Block duplication and Biodiversity rearrangement Systems Conservation of synteny Comparison of evolutionary range of Hemiascomycetes and Chordates Urochordates Yarrowia lipolytica 50 Debaryomyces hansenii Ciona intestinalis 60 Kluyveromyces lactis Candida glabrata Fishes Takifugu rubripes 70 Tetraodon negroviridis Gallus gallus Birds Saccharomyces uvarum 80 90 Mammals Mus musculus Saccharomyces paradoxus Saccharomyces cerevisiae Homo sapiens Saccharomyces sensu stricto 100 Scale: average % of amino-acid identity between complete set of orthologous proteins Dujon (2006) Trends in Genetics 22: 375-387 6 Génolevures Sequencing Projects Génolevures 1 • 13 species, partial 0.2-0.4X • Souciet et al 2000 [21 papers] FEBS Letters 487 Génolevures 2 • 4 species complete 12X • Dujon, Sherman et al 2004 Nature 430 • Sherman et al 2006 NAR 34 Génolevures 3 • 3 species complete 12X • 2 species complete 7-12X Génolevures 4 • 4 + 5 + 5 close species, NGS 7 Nb of Genome Ty4 chrom. Size (Mb) whole genome duplication post-duplication gene loss Saccharomyces cerevisiae 16 12.1 expansion of sugar-utilisation genes loss of active Ty5 post-duplication gene loss Candida glabrata 13 12.3 loss of GAL genes loss of sex Ty1 / 2 loss of all active type I retroposons triplicated mating-type cassettes Kluyveromyces waltii 8 10.7 HO endonuclease loss of GAL genes short centromeres Kluyveromyces lactis 6 10.6 loss of class II degradation of HO Transposons Ty5 and non-LTR retroposons loss of HO Ashbya gossypii 7 9.2 loss of GAL genes non universal Ty3 genetic Tca2 code Debaryomyces hansenii 7 12.2 expansion of gene families high rate of encoding lipases, intron loss extracellular proteases etc... Candida albicans 8 14.9 loss of sex expansion of gene families encoding lipases, extracellular proteases, allantoin and allantoate transporters etc... Yarrowia lipolytica 6 20.5 Dujon (2006) Trends in Genetics 22: 375-387 8 Genomic data for complete genomes Complete genomes sequenced by the Génoscope What is complete? • Sequence subtelomere to subtelomere • Fully assembled chromosomes • Careful manual annotation What can you do with a complete sequence? • Track chromosomal rearrangements • Analyze species- or clade-specific gain or loss • Measure expansion and contraction of protein families • Look for long-range correlations 9 What’s next? And what do we do after that? Genome Annotation • Magus annotation system • Simultaneous annotation of putative homologs Classification into protein families • Consensus ensemble clustering Comparative maps • Discovering synteny • Identifying orthologs 10 Let’s avoid teleology Genomes are thrown together from bits and pieces of things that worked, once Genome annotation has to reflect reality and not expectations It is hard, painstaking work It is not fully automatic Good tools help R. Greaves The Annotation Process 11 Genomic DNA Legend GÉNOLEVURES technique GÉNOLEVURES result Algorithmic Predictive methods Predictive methods sequence analysis External technology RNA genes and other elements Gene models External data source Simultaneous gene annotation Transcript Protein-coding genes Classification sequencing Integration Homolog groups Systematic compar- Curation updates Curated genes ison and consensus Complementary Annotated genome Protein families analyses Magus 12 The “big iron” ProductionIn silico Alignments RuleDinkum-thinkumU.I. predictions & checker74 cores components4 Gbyte Redundant, high DBdisp search. • IBM, Dell Servers Rules • x86_64 • 3 web Rocks + bio roll Web • 1 database Web Service Bus • HMMER, NCBI BLAST, users • Mini-cluster ClustalW, EMBO SS, Glimmer, Fasta, MrBay es,Phy lip, Storage T_CoffeGenomee, MPI-Blast, • 11 Tbyte RAID Compute GROBrowser MACS Genomes d database GenCore 6 results KB Fast browser database Browsing a genome region 13 Viewing a Locus on a Genome 14 Validating a Gene Model Annotating Homolog Groups 17 Protein families Multi-species groups of related proteins Phylogenetic relationship → functional similarity Diversity of in silico results Need to calibrate or train methods for different phylogenetic groups New algorithm for consensus clustering that is efficient in practice What’s the goal? Blast Partition ∏1 E-val threshold Partition ∏2 homeomorphy Complete Partition ∏3 genomes Partition ∏4 Smith- Waterman Protein Partition ∏n families homeomorphy Reconciling different in silico predictions Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments Agreement between partitions partition partition partition • Confusion matrix partition partition • Distance between partitions that is, a shortest path in a graph of fusions/fissions NP-complete Median partitions by consensus clustering Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments Partition ∏1 Partition ∏3 Partition ∏n Partition ∏ Partition ∏2 4 Compute a median partition ∏ minimizing consensus Construction and algorithm Define a similarity FReli,j : encodes measure based on confusion matrix the composants ci Select ci in each R maximal k Rk by MDC (min. conflict regions disjoint cover) NP-complete Efficient heuristic Relaxation: admit inexact cover (Not all proteins are in families) Resolve conflicts by election + policy For each comp. C for each ci ∈ C compute Si et Di each p votes for ci in ordre Di ↑ and Si ↓ take the winning ci in order so as to cover the most Conflict regions Conflict graph proteins p subgroups family 24 Correlated gain and loss and in networks and metabolic pathways Construct a PSSM for each family 4384 families as follows Proteomes 4240 where FN = 0 Family GL2 fasta GL2 FP med 0,0 avg 3,7 max 302 PSI blast PSI blast Ev med 6e-78 max 9e-6 144 where FN > 0 PSSM Comparison FP med 4,5 avg 33 TP,TN,FP,FN and worst E-val max 307 Construction Validation Build a PSSM for each family and use to improve gene prediction Per-family size and E-value ORF criteria translations Family GL2 fasta PSI blast PSI blast filtering PSSM* Candidates Loci assigned to families *PSSM: position-specific scoring matrix for PSIBLAST Comparison with KOGs Project families on S. cerevisiae Select intersection and compare 3625 proteins (~2500 families) identities: 1901 split: 159 (4 GLS, 42 GLR, 113 GLC) merge: 117 (6 GLS, 70 GLR, 79 GLC) messy: 25 (2 GLS, 13 GLR, 23 GLC) Comparison with KOGs Comparison GLR.3294 with PIRSF (UniProt) 017196 and 017667 Comparison of GLR.3292 with PIRSF 017297 and 016767 32 Comparative maps Despite similarity in size, gene content, ecological niche, yeast genomes are highly rearranged. In general, synteny is poorly conserved. In part: • evolutionary distance • artifact of WGD 33 Comparative maps But, if we focus on the Saccharomycetacae that did not undergo a recent whole genome duplication Protoploid genomes • homogeneous • low redundancy • less reshuffling Syntenic homologs are orthologs 37 So, in conclusion Comparative genomics works if you pay attention to the data • High-quality, complete genomes • Chosen from interesting phylogenetic groups Building tools and analyses works if you have a plan • Genome annotation • Protein families and subgroups • Syntenic blocks and common markers Many opportunities for further work http://genolevures.org/ ≡ http://cbi.labri.fr/Genolevures/ 38 Acknowledgments and support Bordeaux Génolevures • Macha Nikolski CNRS • Jean-Luc Souciet • Tiphaine Martin CNRS • Bernard Dujon • Pascal Durrens CNRS • Claude Gaillardin • David Sherman INRIA • Christian Marck • Géraldine Jean • Eric Westhof • Hayssam Soueidan • Cécile Neuvéglise • Nicolás Loira • Cécile Fairhead • Adrien Goëffon • André Goffeau • Julie Bourbeillon • Philippe Baret • Rodrigo Assar • Ed Louis • Mark Johnston CNRS GDR 2354 Génolevures CNRS UMR 5800 LaBRI INRIA team MAGNOME .