Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Harvard University - DASH Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Kelley, David R., Bo Liu, Arthur L. Delcher, Mihai Pop, and Steven L. Salzberg. 2011. Gene prediction with glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Research 40(1): e9. Published Version doi:10.1093/nar/gkr1067 Accessed February 19, 2015 9:50:17 AM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11248787 Terms of Use This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA (Article begins on next page) Published online 18 November 2011 Nucleic Acids Research, 2012, Vol. 40, No. 1 e9 doi:10.1093/nar/gkr1067 Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering David R. Kelley1,2,3,*, Bo Liu1, Arthur L. Delcher1, Mihai Pop1 and Steven L. Salzberg4 1Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, Department of Computer Science, 3115 Biomolecular Sciences Building 296, University of Maryland, College Park, MD 20742, 2Department of Stem Cell and Regenerative Biology, 7 Divinity Avenue, Harvard University, Cambridge, MA 02138, 3Broad Institute, 7 Cambridge Center, Cambridge, MA 02142 and 4McKusick-Nathans Institute of Genetic Medicine Johns Hopkins University School of Medicine Baltimore, MD, USA Received July 29, 2011; Revised September 19, 2011; Accepted October 28, 2011 ABSTRACT (1–3). They play an integral role in many ecosystems, including the human body where a typical individual Environmental shotgun sequencing (or carries 10–100 times more prokaryotic cells than metagenomics) is widely used to survey the human cells (4). The DNA sequences of these microorgan- communities of microbial organisms that live in isms provide us with important information about their many diverse ecosystems, such as the human identities, capabilities and evolution. Traditional methods body. Finding the protein-coding genes within the for obtaining these sequences have required scientists to sequences is an important step for assessing the select a single microbe of interest, isolate it in culture and functional capacity of a metagenome. In this work, sequence its genome to high coverage (5). we developed a metagenomics gene prediction Because many microbes cannot be isolated and grown system Glimmer-MG that achieves significantly in culture, researchers have increasingly turned to greater accuracy than previous systems via novel sequencing DNA directly from environmental samples, an approach often called ‘metagenomics’ (6,7). approaches to a number of important prediction Metagenomics is an effective tool for exploring natural subtasks. First, we introduce the use of phylogen- environments (e.g. acid mine drainage (8), ocean water etic classifications of the sequences to model par- (9), and soil (10)) and environments on and within the ameterization. We also cluster the sequences, human body (11). With the development of improved grouping together those that likely originated from sequencing technologies from companies such as 454 the same organism. Analogous to iterative schemes Life Sciences and Illumina, DNA sequence reads can be that are useful for whole genomes, we retrain our obtained at increasingly higher throughput and lower cost. models within each cluster on the initial gene pre- These transformative technologies have made it possible dictions before making final predictions. Finally, we to use metagenomics to simultaneously analyze the model both insertion/deletion and substitution genomes of entire communities of microbes in an sequencing errors using a different approach than ever-broadening collection of environments. Identifying the protein-coding genes in an organism’s previous software, allowing Glimmer-MG to change genome is a fundamental step in any genome sequence coding frame or pass through stop codons by pre- analysis, and metagenomics is no different. Whether the dicting an error. In a comparison among multiple assembly produces large contigs (8) or is highly frag- gene finding methods, Glimmer-MG makes the mented (9), many new and interesting genes can be ex- most sensitive and precise predictions on simulated tracted (12). The goal of some metagenomic experiments and real metagenomes for all read lengths and error is to compare microbial communities across environments rates tested. (13,14). In these cases, accurate gene prediction is critical to perform a functional comparison (15,16). It has long been known that sequences coding for INTRODUCTION proteins have statistical properties that differentiate Prokaryotes inhabit a diverse array of environmental them from non-coding sequences, which allows gene niches and account for most of the world’s biomass finding programs to identify open reading frames *To whom correspondence should be addressed. Tel: +1 301 405 3234; Email: [email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e9 Nucleic Acids Research, 2012, Vol. 40, No. 1 PAGE 2 OF 12 (ORFs) that represent protein-coding genes (17,18,19). metagenomics pipeline that incorporates classification Sequence composition is the most important discrimina- and clustering of the sequences prior to gene prediction. tive feature, due primarily to the fact that the triplet In addition, we addressed the other metagenomics gene patterns of coding DNA differ from non-coding DNA. prediction challenges with novel and effective solutions. These patterns can be captured by Markov chain Glimmer-MG incorporates a thorough probabilistic models, which need to be trained on a set of ORFs from model for gene length and start/stop codon presence to the same species or a close relative. State of the art pro- aid prediction of truncated genes on short sequence frag- karyotic gene finding softwares typically achieve >99% ments. Glimmer-MG predicts insertion, deletion and stop sensitivity and high precision on finished genomes (20). codon-introducing substitution errors in order to more As environmental shotgun sequencing has become more closely track coding frames in raw error-prone sequences. prevalent, computational gene prediction approaches have Our implementation of these features produces the most adapted to the particular challenges of these data. Since sensitive and precise gene predictions on realistically the source organisms of the sequences are unknown, the simulated metagenomes. On a set of real 454 reads from foremost challenge is training the statistical gene feature the human gut microbiome (37), Glimmer-MG makes models. Following training, the gene finder must make many more gene predictions matching known proteins predictions on short sequence fragments that frequently than other programs. contain only part of a gene. Further complicating Glimmer-MG is freely available as open-source matters, metagenomic assemblies have many low-coverage software from www.cbcb.umd.edu/software/glimmer-mg contigs and unassembled singleton reads (21,22), in which under the Perl Artistic License (www.perl.com/pub/a/ sequencing errors are prevalent and create difficulties for language/misc/Artistic.html). gene prediction (23). In contrast, finished genomes have many fewer sequencing errors thanks to their deeper and more uniform coverage. MATERIALS AND METHODS Despite these challenges, a number of methods for Glimmer predicting genes in metagenomic sequences have been published, reporting varying degrees of success (24–28). Glimmer’s salient feature is its use of interpolated Markov All these previous methods incorporate sequence models (IMMs) for capturing gene composition (18). GC-content into their prediction algorithms to choose IMMs are variable-order Markov chain models that model parameters. GC-content is a simple way to maximize the model order for each specific oligonucleotide identify training genomes that are likely to be evolution- window based on the amount of training data available. arily related, and whose genes might have similar sequence IMMs then interpolate the nucleotide distributions composition. This task’s goal overlaps considerably with between the chosen order and one greater. Thus, IMMs that of computationally assigning sequences a phylogen- construct the most sophisticated composition model that etic classification, which implicitly identifies close relatives. the training data sequences support. To segment the Phylogenetic classification of metagenomic sequences is a sequence into coding and non-coding sequence, Glimmer well-studied problem for which much better statistics than uses a flexible ORF-based framework that incorporates GC-content have been developed (29–32). In this article, knowledge of how prokaryotic genes can overlap and we recommend using a more sophisticated classification upstream features of translation