1
Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08
David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team “MAGNOME” Comparative genomics 2
Which And what ones do we do we do sequence? after that?
Is certainly about comparison But is also about the genomes A caricature 3
The hard part Data A solved problem
Push button The hard part
Biological Algorithmic – – Results otherwise otherwise not interesting not interesting 4 Hemiascomycetous yeasts
Eukaryotic genomes Understand mechanisms of molecular evolution Small and compact Genome redundancy Experimental model Biotechnological interest Ortho-/para- log divergence • beer, wine, bread Expansion and contraction • assimilate hydrocarbons, of universal families tannin extracts • horomones and vaccines Tandem duplications Medical interest Block duplication and Biodiversity rearrangement Systems Conservation of synteny Comparison of evolutionary range of Hemiascomycetes and Chordates
Urochordates
Yarrowia lipolytica 50 Debaryomyces hansenii Ciona intestinalis
60 Kluyveromyces lactis
Candida glabrata Fishes Takifugu rubripes 70 Tetraodon negroviridis
Gallus gallus Birds
Saccharomyces uvarum 80
90
Mammals
Mus musculus Saccharomyces paradoxus
Saccharomyces cerevisiae Homo sapiens Saccharomyces sensu stricto 100
Scale: average % of amino-acid identity between complete set of orthologous proteins Dujon (2006) Trends in Genetics 22: 375-387 6 Génolevures Sequencing Projects
Génolevures 1 • 13 species, partial 0.2-0.4X • Souciet et al 2000 [21 papers] FEBS Letters 487 Génolevures 2 • 4 species complete 12X • Dujon, Sherman et al 2004 Nature 430 • Sherman et al 2006 NAR 34 Génolevures 3 • 3 species complete 12X • 2 species complete 7-12X Génolevures 4 • 4 + 5 + 5 close species, NGS 7
Nb of Genome Ty4 chrom. Size (Mb) whole genome duplication post-duplication gene loss Saccharomyces cerevisiae 16 12.1 expansion of sugar-utilisation genes loss of active Ty5
post-duplication gene loss Candida glabrata 13 12.3 loss of GAL genes loss of sex Ty1 / 2 loss of all active type I retroposons triplicated mating-type cassettes Kluyveromyces waltii 8 10.7 HO endonuclease loss of GAL genes short centromeres Kluyveromyces lactis 6 10.6 loss of class II degradation of HO Transposons Ty5 and non-LTR retroposons loss of HO Ashbya gossypii 7 9.2 loss of GAL genes non universal Ty3 genetic Tca2 code Debaryomyces hansenii 7 12.2 expansion of gene families high rate of encoding lipases, intron loss extracellular proteases etc... Candida albicans 8 14.9 loss of sex
expansion of gene families encoding lipases, extracellular proteases, allantoin and allantoate transporters etc... Yarrowia lipolytica 6 20.5
Dujon (2006) Trends in Genetics 22: 375-387 8 Genomic data for complete genomes
Complete genomes sequenced by the Génoscope
What is complete? • Sequence subtelomere to subtelomere • Fully assembled chromosomes • Careful manual annotation What can you do with a complete sequence? • Track chromosomal rearrangements • Analyze species- or clade-specific gain or loss • Measure expansion and contraction of protein families • Look for long-range correlations 9
What’s next? And what do we do after that? Genome Annotation • Magus annotation system • Simultaneous annotation of putative homologs Classification into protein families • Consensus ensemble clustering Comparative maps • Discovering synteny • Identifying orthologs 10 Let’s avoid teleology
Genomes are thrown together from bits and pieces of things that worked, once Genome annotation has to reflect reality and not expectations It is hard, painstaking work It is not fully automatic Good tools help
R. Greaves The Annotation Process 11
Genomic DNA Legend
GÉNOLEVURES technique
GÉNOLEVURES result
Algorithmic Predictive methods Predictive methods sequence analysis External technology RNA genes and other elements Gene models External data source
Simultaneous gene annotation
Transcript Protein-coding genes Classification sequencing
Integration Homolog groups
Systematic compar- Curation updates Curated genes ison and consensus
Complementary Annotated genome Protein families analyses Magus 12 The “big iron”
ProductionIn silico Alignments RuleDinkum-thinkumU.I. predictions & checker74 cores components4 Gbyte Redundant, high DBdisp search. • IBM, Dell Servers Rules • x86_64 • 3 web Rocks + bio roll Web • 1 database Web Service Bus • HMMER, NCBI BLAST, users • Mini-cluster ClustalW, EMBO SS, Glimmer, Fasta, MrBay es,Phy lip, Storage T_CoffeGenomee, MPI-Blast, • 11 Tbyte RAID Compute GROBrowser MACS Genomes d database GenCore 6 results KB Fast browser database Browsing a genome region 13 Viewing a Locus on a Genome 14 Validating a Gene Model Annotating Homolog Groups 17 Protein families
Multi-species groups of related proteins Phylogenetic relationship → functional similarity
Diversity of in silico results Need to calibrate or train methods for different phylogenetic groups New algorithm for consensus clustering that is efficient in practice What’s the goal?
Blast
Partition ∏1 E-val threshold
Partition ∏2
homeomorphy Complete Partition ∏3 genomes
Partition ∏4 Smith- Waterman Protein Partition ∏n families homeomorphy Reconciling different in silico predictions
Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments
Agreement between partitions partition partition partition • Confusion matrix partition partition
• Distance between partitions
that is, a shortest path in a graph of fusions/fissions NP-complete Median partitions by consensus clustering
Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments
Partition ∏1 Partition ∏3 Partition ∏n Partition ∏ Partition ∏2 4 Compute a median partition ∏ minimizing consensus Construction and algorithm
Define a similarity FReli,j : encodes measure based on confusion matrix the composants ci
Select ci in each R maximal k Rk by MDC (min. conflict regions disjoint cover)
NP-complete Efficient heuristic
Relaxation: admit inexact cover (Not all proteins are in families) Resolve conflicts by election + policy
For each comp. C
for each ci ∈ C compute Si et Di each p votes for ci in ordre Di ↑ and Si ↓ take the winning ci in order so as to cover the most Conflict regions Conflict graph proteins p subgroups family 24 Correlated gain and loss and in networks and metabolic pathways Construct a PSSM for each family
4384 families as follows Proteomes 4240 where FN = 0 Family GL2 fasta GL2 FP med 0,0 avg 3,7 max 302 PSI blast PSI blast Ev med 6e-78 max 9e-6
144 where FN > 0 PSSM Comparison FP med 4,5 avg 33 TP,TN,FP,FN and worst E-val max 307 Construction Validation Build a PSSM for each family and use to improve gene prediction
Per-family size and E-value ORF criteria translations Family GL2 fasta
PSI blast PSI blast filtering
PSSM* Candidates Loci assigned to families *PSSM: position-specific scoring matrix for PSIBLAST Comparison with KOGs
Project families on S. cerevisiae Select intersection and compare 3625 proteins (~2500 families)
identities: 1901 split: 159 (4 GLS, 42 GLR, 113 GLC) merge: 117 (6 GLS, 70 GLR, 79 GLC) messy: 25 (2 GLS, 13 GLR, 23 GLC) Comparison with KOGs Comparison GLR.3294 with PIRSF (UniProt) 017196 and 017667 Comparison of GLR.3292 with PIRSF 017297 and 016767 32 Comparative maps
Despite similarity in size, gene content, ecological niche, yeast genomes are highly rearranged. In general, synteny is poorly conserved. In part: • evolutionary distance • artifact of WGD 33 Comparative maps
But, if we focus on the Saccharomycetacae that did not undergo a recent whole genome duplication Protoploid genomes • homogeneous • low redundancy • less reshuffling
Syntenic homologs are orthologs 37 So, in conclusion
Comparative genomics works if you pay attention to the data • High-quality, complete genomes • Chosen from interesting phylogenetic groups Building tools and analyses works if you have a plan • Genome annotation • Protein families and subgroups • Syntenic blocks and common markers
Many opportunities for further work http://genolevures.org/ ≡ http://cbi.labri.fr/Genolevures/ 38 Acknowledgments and support
Bordeaux Génolevures • Macha Nikolski CNRS • Jean-Luc Souciet • Tiphaine Martin CNRS • Bernard Dujon • Pascal Durrens CNRS • Claude Gaillardin • David Sherman INRIA • Christian Marck • Géraldine Jean • Eric Westhof • Hayssam Soueidan • Cécile Neuvéglise • Nicolás Loira • Cécile Fairhead • Adrien Goëffon • André Goffeau • Julie Bourbeillon • Philippe Baret • Rodrigo Assar • Ed Louis • Mark Johnston
CNRS GDR 2354 Génolevures CNRS UMR 5800 LaBRI INRIA team MAGNOME