mGene: A Novel Discriminative Gene Finding System
Gabriele Schweikert
Max Planck Institutes T¨ubingen, Germany
Worm Genomics and Systems Biology Workshop, Cambridge, July 24, 2008
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 1 / 21 Introduction Friedrich Miescher Laboratory 574 Review TRENDS in Genetics Vol.21 No.10 October 2005 of the Max Planck Society
Key: Haemonchus contortus ! Ostertagia ostertagi Human parasite Teladorsagia circumcincta Domestic animal parasite Strongyloidea Necator americanus Nippostrongylus brasiliensis Model animal parasite Rhabditina (clade V) Ancylostoma caninum Ancylostoma ceylanicum Plant parasite Caenorhabditis briggsae ! Free living Caenorhabditis elegans ! Rhabditoidea Caenorhabditis remanei ! Caenorhabditis japonica ! Caenorhabditis sp. PB2081 ! Diplogasteromorpha Pristionchus pacificus ! 1998: C. elegans genome Strongyloides ratti Strongyloides stercoralis Panagrolaimomorpha completed Parastrongyloides trichosuri Globodera pallida Globodera rostochiensis Heterodera glycines Today: several more Rhabditida Heterodera schachtii Tylenchina (clade IV) Meloidogyne arenaria Meloidogyne chitwoodii genome projects finished / Meloidogyne hapla ! Chromadorea Tylenchomorpha Meloidogyne incognita underway Meloidogyne javanica Meloidogyne paranaensis Pratylenchus penetrans Pratylenchus vulnus Genomes have to be Rhadopholus similis Cephalobomorpha Zeldia punctata annotated! Ascaris suum Ascaridomorpha Ascaris lumbricoides Toxocara canis Spirurina (clade III) Brugia malayi ! Wuchereria bancrofti Spiruromorpha Onchocerca volvulus Dirofilaria immitis ~650–750 Mya Litomosoides sigmodontis
(other Chromadorea) Trichinella spiralis ! Mitreva, 2005 Dorylaimia (clade I) Trichuris muris Trichinellida Trichuris vulpis Longidoridae Xiphinema index
Gabriele Schweikert (MPI, T¨ubingen)Enoplia (clade II) mGene: A Novel Discriminative Gene Finding System July 24, 2008 2 / 21 SSU rRNA phylogeny Trophic mode Taxa studied
TRENDS in Genetics
Figure 1. Genome information across the phylum Nematoda. All species with either significant numbers of ESTs in public databases (O100) or genome projects are arranged phylogenetically based on small subunit (18S ribosomal RNA) (SSU) rRNA phylogeny [4]. Species with genome projects completed or underway are indicated by asterisks. Adapted with permission from Ref. [4].
genes) within nematodes has been sampled. Analysis of was already sampled [14]. However, more recent eco- bacterial genespace had shown that continued addition of system sampling of marine microbes has revealed the vast complete genomes yielded diminishing returns of novelty, genetic complexity present in such environments. Sequenc- suggesting that a large percentage of bacterial genespace ing of Sargasso Sea microbes yielded 148 previously
www.sciencedirect.com Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society
Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society
Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society
Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society
Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Why Yet Another Gene Finder? Friedrich Miescher Laboratory of the Max Planck Society
GENSCAN, Burge 1997 Twinscan, Korf 2001 Augustus, Stanke 2003 Contrast, Gross 2007 ...
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 4 / 21 Why Yet Another Gene Finder? Friedrich Miescher Laboratory ! of the Max Planck Society
GENSCAN, Burge 1997 Twinscan, Korf 2001 Augustus, Stanke 2003 Contrast, Gross 2007 REVIEWS ...
70
Fly Human CONTRAST 60
50 AUGUSTUS N-SCAN ect
orr 40 n ORFs c
w 30 o
% kn TWINSCAN
20 GENSCAN
10
0 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Brent, 2008 Fly Fly2 GabrieleFlies Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 4 / 21 Drosophila melanogaster Drosophila pseudoobscura 10 more flies
Rat Chimpanzee Opossum Human Mouse Dog Mammals Homo sapiens Mus musculus Canis familiaris Rattus Pan Monodelphis norvegicus trogolytes domestica Fish Plant2 Worm2 Worm Plant Chicken Other Caenorhabditis Arabidopsis Gallus elegans thaliana gallus Fugu Oryza Caenorhabditis More worms rubripes sativa briggsae Figure 2 | The steadily increasing accuracy of de novo gene prediction algorithms. The graph shows the rise in the Nature Reviews | Genetics accuracy of de novo gene prediction programs since 1997 (when GENSCAN was introduced) and the dates on which genome sequences were first published. The measure of accuracy is ORF sensitivity — the fraction of known ORFs 24 62 Training data that are predicted exactly right, that is, yielding the correct protein. GENSCAN and AUGUSTUS use only the target 29 39 4 In de novo gene prediction, genome, TWINSCAN uses one informant and N-SCAN and CONSTRAST can use multiple informants. The graph it is a set of known gene reflects historical trends but is not a precise benchmarking of these programs on identical test data. structures with the corresponding genomic sequence (and alignments to informant genomes, if split encode amino acids28. Furthermore, although and GENE-ID32, respectively). The biggest difference available). Training data are the average similarity of functional orthologous between them was that TWINSCAN included models used in specializing the sequences is much higher than that of non-functional of conservation in splice sites and start and stop codons, probability model to fit the characteristics of a particular orthologous sequences, the two distributions overlap whereas SGP2 considered only the conservation in 28 genome. considerably . protein-coding regions. After training on known human By the year 2000, several groups had developed meth- genes with mouse alignments, the predictions of both Parse ods for combining information from mouse–human programs were still influenced more by the patterns in A segmentation of a string of letters together with a labelling alignments with models of the DNA sequences that the human DNA sequence than by the mouse alignments. of the segments. characterize splice donors and acceptors, start and stop For TWINSCAN, the primary effect of mouse–human codons and other biological features. The first programs alignments was to eliminate many of the false-positive Bayes’ rule to outperform GENSCAN by using mouse–human genes and exons predicted by GENSCAN: TWINSCAN A mathematical identity comparison were TWINSCAN29,30 and SGP2 (REF. 31). predicted 25,600 genes (versus approximately 45,000) (Pr(x|y)=Pr(y|x) Pr(x)/Pr(y)) that allows one to swap variables Their success resulted, in part, from using genome and 198,000 exons (versus approximately 315,000). For in a conditional probability alignments to modify the scoring schemes of success- comparison, current best estimates place the number of expression. ful single-genome de novo gene predictors (GENSCAN human protein-coding genes at 20,000–21,000 (REF. 33).
NATURE REVIEWS | GENETICS VOLUME 9 | JANUARY 2008 | 67 ©!2008!Nature Publishing Group! Friedrich Miescher Laboratory What’s the Difference? of the Max Planck Society Traditionally: Generative models (HMM) Learn complete generative model
DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC
Intergenic 5' UTR Exon Intron Exon 3' UTR Intergenic
New approach: Discriminative setting Vapnik 1998: Never solve a more general problem as an intermediate step Solve the classification problem directly. much easier!
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 5 / 21 Friedrich Miescher Laboratory What’s the Difference? of the Max Planck Society Traditionally: Generative models (HMM) Learn complete generative model
DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC
Intergenic 5' UTR Exon Intron Exon 3' UTR Intergenic
New approach: Discriminative setting Vapnik 1998: Never solve a more general problem as an intermediate step Solve the classification problem directly. much easier!
DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC
TRUE
WRONG
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 5 / 21 Our Approach: 2 Discriminative Steps Friedrich Miescher Laboratory of the Max Planck Society
Step 1: State-of-the-art SVM Classifiers Predict signals on DNA: Transcription start and cleavage, polyA, trans-splice sites Translation initiation sites and stop codons Donor and acceptor splice sites Recognize segment types: Exons Introns Intergenic
Step 2: Hidden Markov SVMs Combine Predictions to a valid gene structure
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 6 / 21 Our Approach: 2 Discriminative Steps Friedrich Miescher Laboratory of the Max Planck Society
Step 1: State-of-the-art SVM Classifiers Predict signals on DNA: Transcription start and cleavage, polyA, trans-splice sites Translation initiation sites and stop codons Donor and acceptor splice sites Recognize segment types: Exons Introns Intergenic
Step 2: Hidden Markov SVMs Combine Predictions to a valid gene structure
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 6 / 21 Step 1: Splice Site Recognition Friedrich Miescher Laboratory of the Max Planck Society
Worm Human Acc Don Acc Don Markov Chain auPRC(%) 92.1 90.0 16.2 26.0 SVM auPRC(%) 95.9 95.3 54.4 56.9
[Sonnenburg, Schweikert, Philips, Behr, R¨atsch,2007] Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 7 / 21 Step 1: Predictions in UCSC Browser Friedrich Miescher Laboratory of the Max Planck Society
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 8 / 21 Step 1: Predictions in UCSC Browser Friedrich Miescher Laboratory of the Max Planck Society
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 8 / 21 Step 2: Signal Integration Friedrich Miescher Laboratory of the Max Planck Society
position
True gene model
Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop
transform SVM outputs with PLif sig fθ (xsig(pi))
STEP 2: Integration of Signals large margin: + + − − Ψθ(pi , qi ) − Ψθ(pi , qi )
score Ψθ(p, q)
Gabriele SchweikertPrediction (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 9 / 21 positions p 1052 1160 1165 1183 1280 1532 1722 1798 1989 2010 2210 2243 states q tss tis don acc don acc don acc don acc stop cleave Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society
Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society
Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society
Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society
Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society
How to evaluated predictions?
True gene model
Prediction
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society
How to evaluated predictions?
True gene model
Prediction
Nucleotide evaluation:
True gene model
Prediction
Sensitivity = True Predicted/True Specificity = True Predicted/Predicted
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society
How to evaluated predictions?
True gene model
Prediction
Exon evaluation:
True gene model
Prediction
Sensitivity = True Predicted/True Specificity = True Predicted/Predicted
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 nGASP Cat. 1 Evaluation (prelim.) Friedrich Miescher Laboratory of the Max Planck Society
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 12 / 21 Friedrich Miescher Laboratory Results: mGene on Wormbase of the Max Planck Society
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 13 / 21 Friedrich Miescher Laboratory Results of the Max Planck Society
Very good on nucleotide and exon level Substantial improvements on transcript level Re-annotation of C. elegans in official wormbase annotation total: 19532 genes predicted 635 new genes 350 unconfirmed genes are not predicted Wetlab confirmations (preliminary results) Confirmed 55 of 103 newly predicted genes Confirmed only 4 of 50 annotated genes that were not predicted
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 14 / 21 Summary and Future Work Friedrich Miescher Laboratory of the Max Planck Society
High accuracy on constitutively spliced genes for nematodes Wet-lab experiments confirm good performance of mGene Re-annotation of other nematode genomes genomes Preliminary results on new information transfer methods improve performance on related organisms Integrate new experimental data New sequencing technologies Tiling arrays Computational challenge: predict alternative splicing
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 20 / 21 Acknowledgments Friedrich Miescher Laboratory of the Max Planck Society
Gunnar R¨atsch, Bernhard Sch¨olkopf, Detlef Weigel Georg Zeller, Alexander Zien, Cheng Soon Ong, S¨oren Sonnenburg, Petra Philips, Jonas Behr, Christian Widmer
Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 21 / 21