1

Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Moscow 2008-10-08

David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team “MAGNOME” Comparative genomics 2

Which And what ones do we do we do sequence? after that?

Is certainly about comparison But is also about the genomes A caricature 3

The hard part Data A solved problem

Push button The hard part

Biological Algorithmic – – Results otherwise otherwise not interesting not interesting 4 Hemiascomycetous yeasts

Eukaryotic genomes Understand mechanisms of molecular evolution Small and compact Genome redundancy Experimental model Biotechnological interest Ortho-/para- log divergence • beer, wine, bread Expansion and contraction • assimilate hydrocarbons, of universal families tannin extracts • horomones and vaccines Tandem duplications Medical interest Block duplication and Biodiversity rearrangement Systems Conservation of synteny Comparison of evolutionary range of Hemiascomycetes and Chordates

Urochordates

Yarrowia lipolytica 50 Debaryomyces hansenii Ciona intestinalis

60 Kluyveromyces lactis

Candida glabrata Fishes Takifugu rubripes 70 Tetraodon negroviridis

Gallus gallus Birds

Saccharomyces uvarum 80

90

Mammals

Mus musculus Saccharomyces paradoxus

Saccharomyces cerevisiae Homo sapiens Saccharomyces sensu stricto 100

Scale: average % of amino-acid identity between complete set of orthologous proteins Dujon (2006) Trends in Genetics 22: 375-387 6 Génolevures Sequencing Projects

Génolevures 1 • 13 species, partial 0.2-0.4X • Souciet et al 2000 [21 papers] FEBS Letters 487 Génolevures 2 • 4 species complete 12X • Dujon, Sherman et al 2004 Nature 430 • Sherman et al 2006 NAR 34 Génolevures 3 • 3 species complete 12X • 2 species complete 7-12X Génolevures 4 • 4 + 5 + 5 close species, NGS 7

Nb of Genome Ty4 chrom. Size (Mb) whole genome duplication post-duplication gene loss Saccharomyces cerevisiae 16 12.1 expansion of sugar-utilisation genes loss of active Ty5

post-duplication gene loss Candida glabrata 13 12.3 loss of GAL genes loss of sex Ty1 / 2 loss of all active type I retroposons triplicated mating-type cassettes Kluyveromyces waltii 8 10.7 HO endonuclease loss of GAL genes short centromeres Kluyveromyces lactis 6 10.6 loss of class II degradation of HO Transposons Ty5 and non-LTR retroposons loss of HO Ashbya gossypii 7 9.2 loss of GAL genes non universal Ty3 genetic Tca2 code Debaryomyces hansenii 7 12.2 expansion of gene families high rate of encoding lipases, loss extracellular proteases etc... Candida albicans 8 14.9 loss of sex

expansion of gene families encoding lipases, extracellular proteases, allantoin and allantoate transporters etc... Yarrowia lipolytica 6 20.5

Dujon (2006) Trends in Genetics 22: 375-387 8 Genomic data for complete genomes

Complete genomes sequenced by the Génoscope

What is complete? • Sequence subtelomere to subtelomere • Fully assembled chromosomes • Careful manual annotation What can you do with a complete sequence? • Track chromosomal rearrangements • Analyze species- or clade-specific gain or loss • Measure expansion and contraction of protein families • Look for long-range correlations 9

What’s next? And what do we do after that? Genome Annotation • Magus annotation system • Simultaneous annotation of putative homologs Classification into protein families • Consensus ensemble clustering Comparative maps • Discovering synteny • Identifying orthologs 10 Let’s avoid teleology

Genomes are thrown together from bits and pieces of things that worked, once Genome annotation has to reflect reality and not expectations It is hard, painstaking work It is not fully automatic Good tools help

R. Greaves The Annotation Process 11

Genomic DNA Legend

GÉNOLEVURES technique

GÉNOLEVURES result

Algorithmic Predictive methods Predictive methods sequence analysis External technology RNA genes and other elements Gene models External data source

Simultaneous gene annotation

Transcript Protein-coding genes Classification sequencing

Integration Homolog groups

Systematic compar- Curation updates Curated genes ison and consensus

Complementary Annotated genome Protein families analyses Magus 12 The “big iron”

ProductionIn silico Alignments RuleDinkum-thinkumU.I. predictions & checker74 cores components4 Gbyte Redundant, high DBdisp search. • IBM, Dell Servers Rules • x86_64 • 3 web Rocks + bio roll Web • 1 database Web Service Bus • HMMER, NCBI BLAST, users • Mini-cluster ClustalW, EMBO SS, Glimmer, Fasta, MrBay es,Phy lip, Storage T_CoffeGenomee, MPI-Blast, • 11 Tbyte RAID Compute GROBrowser MACS Genomes d database GenCore 6 results KB Fast browser database Browsing a genome region 13 Viewing a Locus on a Genome 14 Validating a Gene Model Annotating Homolog Groups 17 Protein families

Multi-species groups of related proteins Phylogenetic relationship → functional similarity

Diversity of in silico results Need to calibrate or train methods for different phylogenetic groups New algorithm for consensus clustering that is efficient in practice What’s the goal?

Blast

Partition ∏1 E-val threshold

Partition ∏2

homeomorphy Complete Partition ∏3 genomes

Partition ∏4 Smith- Waterman Protein Partition ∏n families homeomorphy Reconciling different in silico predictions

Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments

Agreement between partitions partition partition partition • Confusion matrix partition partition

• Distance between partitions

that is, a shortest path in a graph of fusions/fissions NP-complete Median partitions by consensus clustering

Blast & SW Homeomorphic and Proteomes sequence Nonhomeomorphic alignments Alignments

Partition ∏1 Partition ∏3 Partition ∏n Partition ∏ Partition ∏2 4 Compute a median partition ∏ minimizing consensus Construction and algorithm

Define a similarity FReli,j : encodes measure based on confusion matrix the composants ci

Select ci in each R maximal k Rk by MDC (min. conflict regions disjoint cover)

NP-complete Efficient heuristic

Relaxation: admit inexact cover (Not all proteins are in families) Resolve conflicts by election + policy

For each comp. C

for each ci ∈ C compute Si et Di each p votes for ci in ordre Di ↑ and Si ↓ take the winning ci in order so as to cover the most Conflict regions Conflict graph proteins p subgroups family 24 Correlated gain and loss and in networks and metabolic pathways Construct a PSSM for each family

4384 families as follows Proteomes 4240 where FN = 0 Family GL2 fasta GL2 FP med 0,0 avg 3,7 max 302 PSI blast PSI blast Ev med 6e-78 max 9e-6

144 where FN > 0 PSSM Comparison FP med 4,5 avg 33 TP,TN,FP,FN and worst E-val max 307 Construction Validation Build a PSSM for each family and use to improve gene prediction

Per-family size and E-value ORF criteria translations Family GL2 fasta

PSI blast PSI blast filtering

PSSM* Candidates Loci assigned to families *PSSM: position-specific scoring matrix for PSIBLAST Comparison with KOGs

Project families on S. cerevisiae Select intersection and compare 3625 proteins (~2500 families)

identities: 1901 split: 159 (4 GLS, 42 GLR, 113 GLC) merge: 117 (6 GLS, 70 GLR, 79 GLC) messy: 25 (2 GLS, 13 GLR, 23 GLC) Comparison with KOGs Comparison GLR.3294 with PIRSF (UniProt) 017196 and 017667 Comparison of GLR.3292 with PIRSF 017297 and 016767 32 Comparative maps

Despite similarity in size, gene content, ecological niche, yeast genomes are highly rearranged. In general, synteny is poorly conserved. In part: • evolutionary distance • artifact of WGD 33 Comparative maps

But, if we focus on the Saccharomycetacae that did not undergo a recent whole genome duplication Protoploid genomes • homogeneous • low redundancy • less reshuffling

Syntenic homologs are orthologs 37 So, in conclusion

Comparative genomics works if you pay attention to the data • High-quality, complete genomes • Chosen from interesting phylogenetic groups Building tools and analyses works if you have a plan • Genome annotation • Protein families and subgroups • Syntenic blocks and common markers

Many opportunities for further work http://genolevures.org/ ≡ http://cbi.labri.fr/Genolevures/ 38 Acknowledgments and support

Bordeaux Génolevures • Macha Nikolski CNRS • Jean-Luc Souciet • Tiphaine Martin CNRS • • Pascal Durrens CNRS • Claude Gaillardin • David Sherman INRIA • Christian Marck • Géraldine Jean • Eric Westhof • Hayssam Soueidan • Cécile Neuvéglise • Nicolás Loira • Cécile Fairhead • Adrien Goëffon • André Goffeau • Julie Bourbeillon • Philippe Baret • Rodrigo Assar • Ed Louis • Mark Johnston

CNRS GDR 2354 Génolevures CNRS UMR 5800 LaBRI INRIA team MAGNOME