mGene: A Novel Discriminative Gene Finding System

Gabriele Schweikert

Max Planck Institutes T¨ubingen, Germany

Worm Genomics and Systems Biology Workshop, Cambridge, July 24, 2008

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 1 / 21 Introduction Friedrich Miescher Laboratory 574 Review TRENDS in Genetics Vol.21 No.10 October 2005 of the Max Planck Society

Key: Haemonchus contortus ! Ostertagia ostertagi Human parasite Teladorsagia circumcincta Domestic parasite Strongyloidea Necator americanus Nippostrongylus brasiliensis Model animal parasite Rhabditina (clade V) Ancylostoma caninum Ancylostoma ceylanicum Plant parasite briggsae ! Free living ! Rhabditoidea Caenorhabditis remanei ! Caenorhabditis japonica ! Caenorhabditis sp. PB2081 ! Diplogasteromorpha Pristionchus pacificus ! 1998: C. elegans genome Strongyloides ratti Strongyloides stercoralis Panagrolaimomorpha completed Parastrongyloides trichosuri Globodera pallida Globodera rostochiensis Heterodera glycines Today: several more Heterodera schachtii Tylenchina (clade IV) Meloidogyne arenaria Meloidogyne chitwoodii genome projects finished / Meloidogyne hapla ! Tylenchomorpha Meloidogyne incognita underway Meloidogyne javanica Meloidogyne paranaensis Pratylenchus penetrans Pratylenchus vulnus Genomes have to be Rhadopholus similis Cephalobomorpha Zeldia punctata annotated! Ascaris suum Ascaridomorpha Ascaris lumbricoides Toxocara canis Spirurina (clade III) Brugia malayi ! Wuchereria bancrofti Spiruromorpha Onchocerca volvulus Dirofilaria immitis ~650–750 Mya Litomosoides sigmodontis

(other Chromadorea) Trichinella spiralis ! Mitreva, 2005 Dorylaimia (clade I) Trichuris muris Trichinellida Trichuris vulpis Longidoridae Xiphinema index

Gabriele Schweikert (MPI, T¨ubingen)Enoplia (clade II) mGene: A Novel Discriminative Gene Finding System July 24, 2008 2 / 21 SSU rRNA phylogeny Trophic mode Taxa studied

TRENDS in Genetics

Figure 1. Genome information across the phylum Nematoda. All species with either significant numbers of ESTs in public databases (O100) or genome projects are arranged phylogenetically based on small subunit (18S ribosomal RNA) (SSU) rRNA phylogeny [4]. Species with genome projects completed or underway are indicated by asterisks. Adapted with permission from Ref. [4].

genes) within has been sampled. Analysis of was already sampled [14]. However, more recent eco- bacterial genespace had shown that continued addition of system sampling of marine microbes has revealed the vast complete genomes yielded diminishing returns of novelty, genetic complexity present in such environments. Sequenc- suggesting that a large percentage of bacterial genespace ing of Sargasso Sea microbes yielded 148 previously

www.sciencedirect.com Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society

Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society

Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society

Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Friedrich Miescher Laboratory Genome Annotation of the Max Planck Society

Use of cDNA and EST sequences Creation of cDNA libraries; selection of random clones ⇒ High-copy-number mRNAs overrepresented ⇒ Low-copy-number mRNAs missed entirely ⇒ Mostly ESTs (single sequencing reads 500-700 nucleotides) Alignment of EST cDNA sequences against genome ⇒ Cis-alignment; mostly corect ⇒ Trans-alignment of homologous genes Typical cDNA sequencing project 20-40% of transcripts sequenced incorrectly or not at all ⇒ Use cDNA and EST alignments as labeled training set ⇒ Predict mRNA & gene products

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 3 / 21 Why Yet Another Gene Finder? Friedrich Miescher Laboratory of the Max Planck Society

GENSCAN, Burge 1997 Twinscan, Korf 2001 Augustus, Stanke 2003 Contrast, Gross 2007 ...

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 4 / 21 Why Yet Another Gene Finder? Friedrich Miescher Laboratory ! of the Max Planck Society

GENSCAN, Burge 1997 Twinscan, Korf 2001 Augustus, Stanke 2003 Contrast, Gross 2007 REVIEWS ...

70

Fly Human CONTRAST 60

50 AUGUSTUS N-SCAN ect

orr 40 n ORFs c

w 30 o

% kn TWINSCAN

20 GENSCAN

10

0 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Brent, 2008 Fly Fly2 GabrieleFlies Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 4 / 21 Drosophila melanogaster Drosophila pseudoobscura 10 more flies

Rat Chimpanzee Opossum Human Mouse Dog Mammals Homo sapiens Mus musculus Canis familiaris Rattus Pan Monodelphis norvegicus trogolytes domestica Fish Plant2 Worm2 Worm Plant Chicken Other Caenorhabditis Arabidopsis Gallus elegans thaliana gallus Fugu Oryza Caenorhabditis More worms rubripes sativa briggsae Figure 2 | The steadily increasing accuracy of de novo gene prediction algorithms. The graph shows the rise in the Nature Reviews | Genetics accuracy of de novo gene prediction programs since 1997 (when GENSCAN was introduced) and the dates on which genome sequences were first published. The measure of accuracy is ORF sensitivity — the fraction of known ORFs 24 62 Training data that are predicted exactly right, that is, yielding the correct protein. GENSCAN and AUGUSTUS use only the target 29 39 4 In de novo gene prediction, genome, TWINSCAN uses one informant and N-SCAN and CONSTRAST can use multiple informants. The graph it is a set of known gene reflects historical trends but is not a precise benchmarking of these programs on identical test data. structures with the corresponding genomic sequence (and alignments to informant genomes, if split encode amino acids28. Furthermore, although and GENE-ID32, respectively). The biggest difference available). Training data are the average similarity of functional orthologous between them was that TWINSCAN included models used in specializing the sequences is much higher than that of non-functional of conservation in splice sites and start and stop codons, probability model to fit the characteristics of a particular orthologous sequences, the two distributions overlap whereas SGP2 considered only the conservation in 28 genome. considerably . protein-coding regions. After training on known human By the year 2000, several groups had developed meth- genes with mouse alignments, the predictions of both Parse ods for combining information from mouse–human programs were still influenced more by the patterns in A segmentation of a string of letters together with a labelling alignments with models of the DNA sequences that the human DNA sequence than by the mouse alignments. of the segments. characterize splice donors and acceptors, start and stop For TWINSCAN, the primary effect of mouse–human codons and other biological features. The first programs alignments was to eliminate many of the false-positive Bayes’ rule to outperform GENSCAN by using mouse–human genes and exons predicted by GENSCAN: TWINSCAN A mathematical identity comparison were TWINSCAN29,30 and SGP2 (REF. 31). predicted 25,600 genes (versus approximately 45,000) (Pr(x|y)=Pr(y|x) Pr(x)/Pr(y)) that allows one to swap variables Their success resulted, in part, from using genome and 198,000 exons (versus approximately 315,000). For in a conditional probability alignments to modify the scoring schemes of success- comparison, current best estimates place the number of expression. ful single-genome de novo gene predictors (GENSCAN human protein-coding genes at 20,000–21,000 (REF. 33).

NATURE REVIEWS | GENETICS VOLUME 9 | JANUARY 2008 | 67 ©!2008!Nature Publishing Group! Friedrich Miescher Laboratory What’s the Difference? of the Max Planck Society Traditionally: Generative models (HMM) Learn complete generative model

DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC

Intergenic 5' UTR Exon Intron Exon 3' UTR Intergenic

New approach: Discriminative setting Vapnik 1998: Never solve a more general problem as an intermediate step Solve the classification problem directly. much easier!

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 5 / 21 Friedrich Miescher Laboratory What’s the Difference? of the Max Planck Society Traditionally: Generative models (HMM) Learn complete generative model

DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC

Intergenic 5' UTR Exon Intron Exon 3' UTR Intergenic

New approach: Discriminative setting Vapnik 1998: Never solve a more general problem as an intermediate step Solve the classification problem directly. much easier!

DNA CGTATAAGCTTATAACCGATTAAGTATGTAGTCTGTTAAGTGTAGCATAGTAGAGAAGTAATAAACGTCAACC

TRUE

WRONG

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 5 / 21 Our Approach: 2 Discriminative Steps Friedrich Miescher Laboratory of the Max Planck Society

Step 1: State-of-the-art SVM Classifiers Predict signals on DNA: Transcription start and cleavage, polyA, trans-splice sites Translation initiation sites and stop codons Donor and acceptor splice sites Recognize segment types: Exons Introns Intergenic

Step 2: Hidden Markov SVMs Combine Predictions to a valid gene structure

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 6 / 21 Our Approach: 2 Discriminative Steps Friedrich Miescher Laboratory of the Max Planck Society

Step 1: State-of-the-art SVM Classifiers Predict signals on DNA: Transcription start and cleavage, polyA, trans-splice sites Translation initiation sites and stop codons Donor and acceptor splice sites Recognize segment types: Exons Introns Intergenic

Step 2: Hidden Markov SVMs Combine Predictions to a valid gene structure

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 6 / 21 Step 1: Splice Site Recognition Friedrich Miescher Laboratory of the Max Planck Society

Worm Human Acc Don Acc Don Markov Chain auPRC(%) 92.1 90.0 16.2 26.0 SVM auPRC(%) 95.9 95.3 54.4 56.9

[Sonnenburg, Schweikert, Philips, Behr, R¨atsch,2007] Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 7 / 21 Step 1: Predictions in UCSC Browser Friedrich Miescher Laboratory of the Max Planck Society

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 8 / 21 Step 1: Predictions in UCSC Browser Friedrich Miescher Laboratory of the Max Planck Society

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 8 / 21 Step 2: Signal Integration Friedrich Miescher Laboratory of the Max Planck Society

position

True gene model

Wrong gene model

STEP 1: SVM Signal Predictions

tss

tis

acc

don

stop

transform SVM outputs with PLif sig fθ (xsig(pi))

STEP 2: Integration of Signals large margin: + + − − Ψθ(pi , qi ) − Ψθ(pi , qi )

score Ψθ(p, q)

Gabriele SchweikertPrediction (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 9 / 21 positions p 1052 1160 1165 1183 1280 1532 1722 1798 1989 2010 2210 2243 states q tss tis don acc don acc don acc don acc stop cleave Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society

Find the most accurate gene finder for annotation of new genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society

Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society

Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Results: nGASP Competition Friedrich Miescher Laboratory of the Max Planck Society

Find the most accurate gene finder for annotation of new nematode genomes: Controlled competition conditions: 10% of the genome for training methods 10% of the genome for evaluation 4 Categories: Cat 1:Ab initio gene finders Cat 2: Dual/Multi-genome gene finders Cat 3: Gene finders that use EST/cDNA alignments Cat 4: Combining algorithms Evaluation on WS160 genes

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 10 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society

How to evaluated predictions?

True gene model

Prediction

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society

How to evaluated predictions?

True gene model

Prediction

Nucleotide evaluation:

True gene model

Prediction

Sensitivity = True Predicted/True Specificity = True Predicted/Predicted

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 Friedrich Miescher Laboratory nGASP Evaluation of the Max Planck Society

How to evaluated predictions?

True gene model

Prediction

Exon evaluation:

True gene model

Prediction

Sensitivity = True Predicted/True Specificity = True Predicted/Predicted

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 11 / 21 nGASP Cat. 1 Evaluation (prelim.) Friedrich Miescher Laboratory of the Max Planck Society

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 12 / 21 Friedrich Miescher Laboratory Results: mGene on Wormbase of the Max Planck Society

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 13 / 21 Friedrich Miescher Laboratory Results of the Max Planck Society

Very good on nucleotide and exon level Substantial improvements on transcript level Re-annotation of C. elegans in official wormbase annotation total: 19532 genes predicted 635 new genes 350 unconfirmed genes are not predicted Wetlab confirmations (preliminary results) Confirmed 55 of 103 newly predicted genes Confirmed only 4 of 50 annotated genes that were not predicted

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 14 / 21 Summary and Future Work Friedrich Miescher Laboratory of the Max Planck Society

High accuracy on constitutively spliced genes for nematodes Wet-lab experiments confirm good performance of mGene Re-annotation of other nematode genomes genomes Preliminary results on new information transfer methods improve performance on related organisms Integrate new experimental data New sequencing technologies Tiling arrays Computational challenge: predict alternative splicing

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 20 / 21 Acknowledgments Friedrich Miescher Laboratory of the Max Planck Society

Gunnar R¨atsch, Bernhard Sch¨olkopf, Detlef Weigel Georg Zeller, Alexander Zien, Cheng Soon Ong, S¨oren Sonnenburg, Petra Philips, Jonas Behr, Christian Widmer

Gabriele Schweikert (MPI, T¨ubingen) mGene: A Novel Discriminative Gene Finding System July 24, 2008 21 / 21