Glimmer Arthur L
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 23 no. 6 2007, pages 673–679 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm009 Genome analysis Identifying bacterial genes and endosymbiont DNA with Glimmer Arthur L. Delcher1,Ã, Kirsten A. Bratke2, Edwin C. Powers3 and Steven L. Salzberg1 1Center for Bioinformatics & Computational Biology, University of Maryland, College Park, MD 20742, USA 2 Smurfit Institute of Genetics, University of Dublin, Trinity College Dublin, Dublin 2, Ireland Downloaded from 3Department of Chemical & Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA Received on August 3, 2006; revised on December 15, 2006; accepted on January 14, 2007 Advance Access publication January 19, 2007 Associate Editor: Alfonso Valencia http://bioinformatics.oxfordjournals.org/ ABSTRACT However, there is still some room for improvement. First, Motivation: The Glimmer gene-finding software has been suc- the measurement of sensitivity relies on comparisons to well- cessfully used for finding genes in bacteria, archæa and viruses annotated bacterial genomes, where the best we can do is to representing hundreds of species. We describe several major count how many ‘known’ genes are found by a gene finder. changes to the Glimmer system, including improved methods Genes are considered known if they have clear homology, as for identifying both coding regions and start codons. We also measured by amino-acid similarity, to genes in other species. 0 describe a new module of Glimmer that can distinguish host and This similarity often breaks down near the 5 end of the endosymbiont DNA. This module was developed in response to transcript, which also tends to be the region where gene finders the discovery that eukaryotic genome sequencing projects some- disagree on the precise position of the start codon. Thus, one times inadvertently capture the DNA of intracellular bacteria living area where gene finders might still improve is in prediction in the host. of start sites, as has been pointed out in previous studies at Milton S. Eisenhower Library/ Johns Hopkins University on June 14, 2013 Results: The new methods dramatically reduce the rate of false- (Besemer et al., 2001). positive predictions, while maintaining Glimmer’s 99% sensitivity A second issue is false positives, i.e. gene predictions that do rate at detecting genes in most species, and they find substantially not correspond to genuine protein-coding genes. Because bac- more correct start sites, as measured by comparisons to known and teria are so gene-dense, it is very difficult to say with confidence well-curated genes. We show that our interpolated Markov model that any gene predicted to lie in an otherwise intergenic region is (IMM) DNA discriminator correctly separated 99% of the sequences false. Fortunately, the growing number of sequenced genomes in a recent genome project that produced a mixture of sequences from closely related species does provide some help with this from the bacterium Prochloron didemni and its sea squirt host, question, and indeed some microbial gene-finding systems rely Lissoclinum patella. on database searches to identify genes (Badger and Olsen, 1999; Availability: Glimmer is OSI Certified Open Source and available at Frishman et al., 1998; Larsen and Krogh, 2003; Nielsen and http://cbcb.umd.edu/software/glimmer Krogh, 2005). If a predicted protein is not conserved between Contact: [email protected] closely related species, then evolutionary arguments can be made that the prediction is false. A greater source of false positives in earlier releases of Glimmer, though, came from predicting too many overlapping genes. Because truly over- lapping genes are quite rare in bacterial genomes, the system should generally avoid such predictions. Here too, homology to 1 INTRODUCTION other species can resolve the question of which gene is correct. The genomes of bacteria, archæa and viruses are very gene- Our challenge was to reduce the false positive rate of Glimmer dense, with protein-coding regions typically comprising 90% without sacrificing its high sensitivity (true positive) rate. or more of the DNA sequence. As a consequence, the accuracy The new Glimmer, release 3.0, achieves a dramatically lower of prokaryotic gene-finding programs depends primarily on false-positive rate, predicts many more start sites correctly, and identifying which of the six possible reading frames contains maintains its high true positive rate. It does this through a new the true gene (Besemer and Borodovsky, 1999; Borodovsky algorithm for scanning coding regions, a new start site detec- and McIninch, 1993; Guo et al., 2003; Ouyang et al., 2004). tion module, and an overall architecture that for the first The accuracy of gene finding systems in these species is very time integrates all gene predictions across an entire genome. high as compared to eukaryotic gene finders; previous versions In addition, a new automated training program produces of the Glimmer system had a sensitivity of 99% or higher substantially improved training sets, particularly on genomes (Delcher et al., 1999; Salzberg et al., 1998). with high GC-content. We also introduce a new use for the interpolated Markov *To whom correspondence should be addressed. model (IMM) that is at the core of Glimmer. Recent large-scale ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. A.L.Delcher et al. Cumulative score 140 Start sites 120 100 80 60 Downloaded from Log-odds score 40 20 http://bioinformatics.oxfordjournals.org/ 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Bp before stop codon Fig. 1. Scoring an open reading frame from the stop codon backwards. The stop codon is at position 0 on the X-axis and the cumulative log-odds score is plotted as the solid line. Positions of possible start codons are indicated by vertical dashed lines. This ORF contains the fructose bis-P aldolase gene in Escherichia coli (EG14062) and the current Ecogene verified start site is at position 1050, near the peak score. This position is an update of the originally annotated start at position 1122. sequencing projects of eukaryotic species have inadvertently see (Delcher et al., 1999; Salzberg et al., 1998). Glimmer 3.0 takes captured the genomes of bacterial endosymbionts as a side advantage of the flexibility of this algorithm by scoring all open reading effect of the overall project (Salzberg et al., 2005). When a frames (ORFs) in reverse, from the stop codon back toward the start at Milton S. Eisenhower Library/ Johns Hopkins University on June 14, 2013 codon, with the probability of each base conditioned on a context eukaryotic species has an intracellular endosymbiont, as is true 0 for many invertebrates including fruit flies, mosquitoes and window on its 3 side and the score of the ORF being the log-likelihood sum of the bases contained in the ORF. The score is computed nematodes, then a whole-genome shotgun sequencing project incrementally as a cumulative sum at every codon position in a given cannot avoid capturing some of the symbiont DNA. Conse- ORF. In many cases, these scores show a marked peak in value, and quently there is a need to identify and separate the DNA from furthermore this peak typically occurs near the correct start site (see the host and the symbiont, in order to assemble the two Fig. 1 for an example). The advantage of scanning ORFs in reverse is genomes separately (and correctly). Besides these eukaryotic that for nucleotides near the start site, the context window of the IMM genome projects, a growing number of bacterial sequencing is contained within the coding portion of the gene, which is the type of projects are targeting endosymbionts that can only be grown data on which it was trained. This results in a more precise cumulative inside their hosts, including Wolbachia pipientis (Wu et al., score at nucleotides very close to the start site, compared to a context 2004) and Prochloron didemni (J. Ravel, personal communica- window on the opposite side which would intersect a non-coding tion). In these projects, despite investigators’ best efforts to region. isolate bacterial DNA, a considerable amount of eukaryotic As the figure shows, the cumulative IMM score steadily increases as we move away from the stop codon at the left until it reaches a peak host DNA remained in the sample and needed to be removed. and then begins to decline. The decrease occurs because the bases Motivated by these problems, we developed a new algorithm in upstream of the gene start codon are non-coding and produce negative which the IMM within Glimmer is trained separately on host IMM log-odds scores. The figure also shows, as vertical dashed lines, and endosymbiont DNA, and then turned into a classifier to the positions of all possible start codons in the ORF. We hypothesized separate the raw sequences. We report here on this new module that if we used the highest-scoring start codon in these plots, then and its successful use in two recent genome projects. Glimmer would find a higher percentage of true start sites. This is The Glimmer 3.0 package is distributed as OSI Certified borne out in our experiments, described subsequently. Open Source software and is freely available at http:// An important difference between this algorithm and earlier versions cbcb.umd.edu/software/glimmer of Glimmer is that, unlike those versions, which had a strong bias in favor of longer ORFs, this algorithm chooses start sites based on their relative scores. 2 METHODS AND RESULTS 2.1 Reverse scoring 2.2 Ribosome binding sites The IMM scoring algorithm in Glimmer computes the log-likelihood In previous versions of Glimmer, the ribosome binding site (RBS) was that a given interval on a DNA sequence was generated by a model essentially ignored, even though it provides a strong signal for the of coding versus noncoding DNA.