Glimmer Arthur L

Total Page:16

File Type:pdf, Size:1020Kb

Glimmer Arthur L Vol. 23 no. 6 2007, pages 673–679 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm009 Genome analysis Identifying bacterial genes and endosymbiont DNA with Glimmer Arthur L. Delcher1,Ã, Kirsten A. Bratke2, Edwin C. Powers3 and Steven L. Salzberg1 1Center for Bioinformatics & Computational Biology, University of Maryland, College Park, MD 20742, USA 2 Smurfit Institute of Genetics, University of Dublin, Trinity College Dublin, Dublin 2, Ireland Downloaded from 3Department of Chemical & Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA Received on August 3, 2006; revised on December 15, 2006; accepted on January 14, 2007 Advance Access publication January 19, 2007 Associate Editor: Alfonso Valencia http://bioinformatics.oxfordjournals.org/ ABSTRACT However, there is still some room for improvement. First, Motivation: The Glimmer gene-finding software has been suc- the measurement of sensitivity relies on comparisons to well- cessfully used for finding genes in bacteria, archæa and viruses annotated bacterial genomes, where the best we can do is to representing hundreds of species. We describe several major count how many ‘known’ genes are found by a gene finder. changes to the Glimmer system, including improved methods Genes are considered known if they have clear homology, as for identifying both coding regions and start codons. We also measured by amino-acid similarity, to genes in other species. 0 describe a new module of Glimmer that can distinguish host and This similarity often breaks down near the 5 end of the endosymbiont DNA. This module was developed in response to transcript, which also tends to be the region where gene finders the discovery that eukaryotic genome sequencing projects some- disagree on the precise position of the start codon. Thus, one times inadvertently capture the DNA of intracellular bacteria living area where gene finders might still improve is in prediction in the host. of start sites, as has been pointed out in previous studies at Milton S. Eisenhower Library/ Johns Hopkins University on June 14, 2013 Results: The new methods dramatically reduce the rate of false- (Besemer et al., 2001). positive predictions, while maintaining Glimmer’s 99% sensitivity A second issue is false positives, i.e. gene predictions that do rate at detecting genes in most species, and they find substantially not correspond to genuine protein-coding genes. Because bac- more correct start sites, as measured by comparisons to known and teria are so gene-dense, it is very difficult to say with confidence well-curated genes. We show that our interpolated Markov model that any gene predicted to lie in an otherwise intergenic region is (IMM) DNA discriminator correctly separated 99% of the sequences false. Fortunately, the growing number of sequenced genomes in a recent genome project that produced a mixture of sequences from closely related species does provide some help with this from the bacterium Prochloron didemni and its sea squirt host, question, and indeed some microbial gene-finding systems rely Lissoclinum patella. on database searches to identify genes (Badger and Olsen, 1999; Availability: Glimmer is OSI Certified Open Source and available at Frishman et al., 1998; Larsen and Krogh, 2003; Nielsen and http://cbcb.umd.edu/software/glimmer Krogh, 2005). If a predicted protein is not conserved between Contact: [email protected] closely related species, then evolutionary arguments can be made that the prediction is false. A greater source of false positives in earlier releases of Glimmer, though, came from predicting too many overlapping genes. Because truly over- lapping genes are quite rare in bacterial genomes, the system should generally avoid such predictions. Here too, homology to 1 INTRODUCTION other species can resolve the question of which gene is correct. The genomes of bacteria, archæa and viruses are very gene- Our challenge was to reduce the false positive rate of Glimmer dense, with protein-coding regions typically comprising 90% without sacrificing its high sensitivity (true positive) rate. or more of the DNA sequence. As a consequence, the accuracy The new Glimmer, release 3.0, achieves a dramatically lower of prokaryotic gene-finding programs depends primarily on false-positive rate, predicts many more start sites correctly, and identifying which of the six possible reading frames contains maintains its high true positive rate. It does this through a new the true gene (Besemer and Borodovsky, 1999; Borodovsky algorithm for scanning coding regions, a new start site detec- and McIninch, 1993; Guo et al., 2003; Ouyang et al., 2004). tion module, and an overall architecture that for the first The accuracy of gene finding systems in these species is very time integrates all gene predictions across an entire genome. high as compared to eukaryotic gene finders; previous versions In addition, a new automated training program produces of the Glimmer system had a sensitivity of 99% or higher substantially improved training sets, particularly on genomes (Delcher et al., 1999; Salzberg et al., 1998). with high GC-content. We also introduce a new use for the interpolated Markov *To whom correspondence should be addressed. model (IMM) that is at the core of Glimmer. Recent large-scale ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. A.L.Delcher et al. Cumulative score 140 Start sites 120 100 80 60 Downloaded from Log-odds score 40 20 http://bioinformatics.oxfordjournals.org/ 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Bp before stop codon Fig. 1. Scoring an open reading frame from the stop codon backwards. The stop codon is at position 0 on the X-axis and the cumulative log-odds score is plotted as the solid line. Positions of possible start codons are indicated by vertical dashed lines. This ORF contains the fructose bis-P aldolase gene in Escherichia coli (EG14062) and the current Ecogene verified start site is at position 1050, near the peak score. This position is an update of the originally annotated start at position 1122. sequencing projects of eukaryotic species have inadvertently see (Delcher et al., 1999; Salzberg et al., 1998). Glimmer 3.0 takes captured the genomes of bacterial endosymbionts as a side advantage of the flexibility of this algorithm by scoring all open reading effect of the overall project (Salzberg et al., 2005). When a frames (ORFs) in reverse, from the stop codon back toward the start at Milton S. Eisenhower Library/ Johns Hopkins University on June 14, 2013 codon, with the probability of each base conditioned on a context eukaryotic species has an intracellular endosymbiont, as is true 0 for many invertebrates including fruit flies, mosquitoes and window on its 3 side and the score of the ORF being the log-likelihood sum of the bases contained in the ORF. The score is computed nematodes, then a whole-genome shotgun sequencing project incrementally as a cumulative sum at every codon position in a given cannot avoid capturing some of the symbiont DNA. Conse- ORF. In many cases, these scores show a marked peak in value, and quently there is a need to identify and separate the DNA from furthermore this peak typically occurs near the correct start site (see the host and the symbiont, in order to assemble the two Fig. 1 for an example). The advantage of scanning ORFs in reverse is genomes separately (and correctly). Besides these eukaryotic that for nucleotides near the start site, the context window of the IMM genome projects, a growing number of bacterial sequencing is contained within the coding portion of the gene, which is the type of projects are targeting endosymbionts that can only be grown data on which it was trained. This results in a more precise cumulative inside their hosts, including Wolbachia pipientis (Wu et al., score at nucleotides very close to the start site, compared to a context 2004) and Prochloron didemni (J. Ravel, personal communica- window on the opposite side which would intersect a non-coding tion). In these projects, despite investigators’ best efforts to region. isolate bacterial DNA, a considerable amount of eukaryotic As the figure shows, the cumulative IMM score steadily increases as we move away from the stop codon at the left until it reaches a peak host DNA remained in the sample and needed to be removed. and then begins to decline. The decrease occurs because the bases Motivated by these problems, we developed a new algorithm in upstream of the gene start codon are non-coding and produce negative which the IMM within Glimmer is trained separately on host IMM log-odds scores. The figure also shows, as vertical dashed lines, and endosymbiont DNA, and then turned into a classifier to the positions of all possible start codons in the ORF. We hypothesized separate the raw sequences. We report here on this new module that if we used the highest-scoring start codon in these plots, then and its successful use in two recent genome projects. Glimmer would find a higher percentage of true start sites. This is The Glimmer 3.0 package is distributed as OSI Certified borne out in our experiments, described subsequently. Open Source software and is freely available at http:// An important difference between this algorithm and earlier versions cbcb.umd.edu/software/glimmer of Glimmer is that, unlike those versions, which had a strong bias in favor of longer ORFs, this algorithm chooses start sites based on their relative scores. 2 METHODS AND RESULTS 2.1 Reverse scoring 2.2 Ribosome binding sites The IMM scoring algorithm in Glimmer computes the log-likelihood In previous versions of Glimmer, the ribosome binding site (RBS) was that a given interval on a DNA sequence was generated by a model essentially ignored, even though it provides a strong signal for the of coding versus noncoding DNA.
Recommended publications
  • Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes
    ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem.
    [Show full text]
  • Microbial Gene Identification Using Interpolated Markov Models Steven L
    544–548 Nucleic Acids Research, 1998, Vol. 26, No. 2 1998 Oxford University Press Microbial gene identification using interpolated Markov models Steven L. Salzberg1,2,*, Arthur L. Delcher3, Simon Kasif4 and Owen White1 1The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA, 2Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA, 3Department of Computer Science, Loyola College in Maryland, Baltimore, MD 21210, USA and 4Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA Received September 10, 1997; Revised and Accepted November 11, 1997 ABSTRACT in new genomes still have no significant homology to known genes (1). For these genes, we must rely on computational methods of This paper describes a new system, GLIMMER, for scoring the coding region to identify the genes. The best-known finding genes in microbial genomes. In a series of tests program for this task is GeneMark (5), which uses a Markov chain on Haemophilus influenzae, Helicobacter pylori and model to score coding regions. GeneMark has been highly effective other complete microbial genomes, this system has and was used in the H.influenza and more recent genome projects. proven to be very accurate at locating virtually all the We have developed a new system, GLIMMER, that uses a technique genes in these sequences, outperforming previous called interpolated Markov models (IMMs) to find coding regions methods. A conservative estimate based on experiments in microbial sequences. IMMs are in principle more powerful than on H.pylori and H.influenzae is that the system finds Markov chains, and the computational experiments described below >97% of all genes.
    [Show full text]
  • Steven L. Salzberg
    Steven L. Salzberg McKusick-Nathans Institute of Genetic Medicine Johns Hopkins School of Medicine, MRB 459, 733 North Broadway, Baltimore, MD 20742 Phone: 410-614-6112 Email: [email protected] Education Ph.D. Computer Science 1989, Harvard University, Cambridge, MA M.Phil. 1984, M.S. 1982, Computer Science, Yale University, New Haven, CT B.A. cum laude English 1980, Yale University Research Areas: Genomics, bioinformatics, gene finding, genome assembly, sequence analysis. Academic and Professional Experience 2011-present Professor, Department of Medicine and the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University. Joint appointments as Professor in the Department of Biostatistics, Bloomberg School of Public Health, and in the Department of Computer Science, Whiting School of Engineering. 2012-present Director, Center for Computational Biology, Johns Hopkins University. 2005-2011 Director, Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies 2005-2011 Horvitz Professor, Department of Computer Science, University of Maryland. (On leave of absence 2011-2012.) 1997-2005 Senior Director of Bioinformatics (2000-2005), Director of Bioinformatics (1998-2000), and Investigator (1997-2005), The Institute for Genomic Research (TIGR). 1999-2006 Research Professor, Departments of Computer Science and Biology, Johns Hopkins University 1989-1999 Associate Professor (1996-1999), Assistant Professor (1989-1996), Department of Computer Science, Johns Hopkins University. On leave 1997-99. 1988-1989 Associate in Research, Graduate School of Business Administration, Harvard University. Consultant to Ford Motor Co. of Europe and to N.V. Bekaert (Kortrijk, Belgium). 1985-1987 Research Scientist and Senior Knowledge Engineer, Applied Expert Systems, Inc., Cambridge, MA. Designed expert systems for financial services companies.
    [Show full text]
  • By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O
    EFFECTS OF CONJUGATIVE PLASMIDS ON BACTERIAL HOSTS, AND COEVOLUTION OF HOSTS AND PLASMIDS by Katherine E. L. Cox A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland March, 2017 Abstract As agents of horizontal gene transfer, plasmids play a role in the spread of antibiotic resistance and virulence factors, and can carry genetic material between widely divergent species. Typically, plasmids are considered primarily in terms of the genes they carry and the known functions of those genes, such as providing antibiotic resistance or new metabolic pathways. However, plasmids can have a variety of effects on their hosts beyond simply providing new functionality. Interactions between hosts and plasmids can modify host behavior, for example, increasing biofilm formation, modifying of host gene expression, or influencing virulence. We wished to investigate the mechanisms by which plasmids influence host behavior, as well as the evolution of host/plasmid relationships. In this work, we used multiple approaches to explore host/plasmid interactions. We examined a specific plasmid (R1) in great detail, generating a complete sequence and annotation for this plasmid and providing a brief review of the known gene products. We probed the influence of the plasmid-borne traJ gene (a regulator of plasmid transfer) on the neonatal-meningitis clinical isolate E. coli RS218 by measuring changes in the virulence and gene expression of RS218 carrying either wild-type or disrupted copies of traJ. We then examined the evolution of host/plasmid relationships by introducing plasmids into new hosts and coevolving them for 500 generations.
    [Show full text]
  • Improved Microbial Gene Identification with GLIMMER
    4636–4641 Nucleic Acids Research, 1999, Vol. 27, No. 23 © 1999 Oxford University Press Improved microbial gene identification with GLIMMER Arthur L. Delcher1,2,*, Douglas Harmon1,SimonKasif3,OwenWhite4 and Steven L. Salzberg4 1Department of Computer Science, Loyola College in Maryland, Baltimore, MD 21210, USA, 2Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA, 3Department of Electrical Engineering and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7053, USA and 4The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received July 7, 1999; Revised and Accepted October 17, 1999 ABSTRACT a generalization of Markov chain methods. GLIMMER 1.0 has been used as the gene finder for Borrelia burgdorferi (2), The GLIMMER system for microbial gene identification Treponema pallidum (3), Chlamydia trachomatis (4) and finds ~97–98% of all genes in a genome when Thermotoga maritima (5), and the software is in use at over compared with published annotation. This paper 100 laboratories and institutes. Below we describe the algorithm reports on two new results: (i) significant technical and performance results of GLIMMER 2.0, a gene finder that improvements to GLIMMER that improve its accuracy incorporates several technical improvements to the GLIMMER still further, and (ii) a comprehensive evaluation that 1.0 algorithm. As a result of these improvements, GLIMMER demonstrates that the accuracy of the system is 2.0 has slightly higher sensitivity than GLIMMER 1.0 and is likely to be higher than previously recognized. A much better at resolving overlapping gene calls. The latter significant proportion of the genes missed by the property is especially useful for genomes such as Deinococcus system appear to be hypothetical proteins whose radiodurans, which due to their high GC-content have existence is only supported by the predictions of numerous long open reading frames (ORFs) that can easily lead to other programs.
    [Show full text]
  • The Effect of Machine Learning Algorithms on Metagenomics Gene
    The Effect of Machine Learning Algorithms on Metagenomics Gene Prediction Amani Al-Ajlan Achraf El Allali Computer Science Department Computer Science Department College of Computer and Information Sciences College of Computer and Information Sciences King Saud University King Saud University Riyadh, Saudi Arabia Riyadh, Saudi Arabia [email protected] [email protected] analysis of metagenomics is used to discover new diseases and ABSTRACT help preserve human health. For example, studying microbes in our The development of next generation sequencing facilitates the environment can help us control diseases and introduce new study of metagenomics. Computational gene prediction aims to strategies for diagnosis. Recent advances in next generating find the location of genes in a given DNA sequence. Gene sequencing (NGS) technologies have facilitated and improved prediction in metagenomics is a challenging task because of the metagenomics research by generating high-throughput, low-cost short and fragmented nature of the data. Our previous framework sequencing data [6], [21], [29]. minimum redundancy maximum relevance - support vector Sequencing data taken directly from the environment are called machines (mRMR-SVM) produced promising results in metagenomes, and the study of these data is called metagenomics metagenomics gene prediction. In this paper, we review available [30]. Machine learning techniques play an important role in metagenomics gene prediction programs and study the effect of solving many metagenomics problems, such as gene prediction the machine learning approach on gene prediction by altering the taxonomic assignment, and comparative metagenomics analysis underlining machine learning algorithm in our previous [27]. Finding genes and identifying their functions is essential to framework. Overall, SVM produces the highest accuracy based on understanding the environment being studied and to annotating tests performed on a simulated dataset.
    [Show full text]
  • Genome Sequencing and Annotation
    An Introduction to Phage Whole- Genome Sequencing and Annotation Jason J. Gill Department of Animal Science Center for Phage Technology Texas A&M University “Phage genomics” • Not a narrowly-defined topic! – Whole-phage genome sequencing – Targeted phage metagenome sequencing – Metagenomics of viral consortia – Prophage mining/annotation Whole phage genome sequencing and annotation Mixed pool of phages Assembled contigs Sequencing and Contig A: 165,220 Phage 1 assembly Phage 2 Contig B: 44,355 Phage 3 Contig C: 42,500 Contig assignment End closure & contig completion (PCR) Annotation (PCR + Sanger sequencing) Phage 1 Phage 1 Contig A = Phage 2 Phage 2 Phage 2 Contig B = Phage 1 Contig C = Phage 3 Phage 3 Phage 3 • Phages can usually be mixed into a single index or pool if they are not similar to each other – Different hosts – Different morphotypes Shotgun sequencing Genomic DNA DNA is fragmented Fragmented DNA is sequenced Sequences are assembled The assembly recapitulates the original order of the genome The consensus sequence is produced from the assembly Sequencing technology summary Technology Read length Quality* Total yield Cost per base Pyrosequencing 400 – 600 bp Moderate Moderate Moderate (Ion Torrent) Illumina 50 – 350 bp High High Low PacBio 2 – 20 kbp Moderate Moderate Moderate Nanopore > 100 kbp ? Low Low-moderate Moderate * Can vary as sequencing chemistries and software improve Overlap-layout-consensus (OLC) assembly • The “classic” method of assembly – Used for assembling long-read data (e.g., Sanger, PacBio and Oxford
    [Show full text]
  • Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Harvard University - DASH Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Kelley, David R., Bo Liu, Arthur L. Delcher, Mihai Pop, and Steven L. Salzberg. 2011. Gene prediction with glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Research 40(1): e9. Published Version doi:10.1093/nar/gkr1067 Accessed February 19, 2015 9:50:17 AM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11248787 Terms of Use This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA (Article begins on next page) Published online 18 November 2011 Nucleic Acids Research, 2012, Vol. 40, No. 1 e9 doi:10.1093/nar/gkr1067 Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering David R. Kelley1,2,3,*, Bo Liu1, Arthur L. Delcher1, Mihai Pop1 and Steven L. Salzberg4 1Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, Department of Computer Science, 3115 Biomolecular Sciences Building 296, University of Maryland, College Park, MD 20742, 2Department of Stem Cell and Regenerative Biology, 7 Divinity Avenue, Harvard University, Cambridge, MA 02138, 3Broad Institute, 7 Cambridge Center, Cambridge, MA 02142 and 4McKusick-Nathans Institute of Genetic Medicine Johns Hopkins University School of Medicine Baltimore, MD, USA Received July 29, 2011; Revised September 19, 2011; Accepted October 28, 2011 ABSTRACT (1–3).
    [Show full text]
  • MCB 1200/1201, Virus Hunting
    MCB 1200/1201, Virus Hunting: Join an international science program (Howard Hughes Medical Institutes (HHMI) Science Education Alliance (SEA) Phage Research Program) that seeks to identify new therapies for diseases. In a real research project, students will isolate viruses that infect non-pathogenic bacteria that are related to disease-causing strains. There are more of these viruses, known as bacteriophages or phages, than any other group of organisms on Earth--but relatively few of them have been isolated or characterized. Phage therapy is a way to treat bacterial infections without using antibiotics, so the phages isolated in the class could have real therapeutic uses. Applied Bioinformatics (MCB 1201): The goal of the course is for every single student to characterize a bacteriophage genome. This involves finding genes and annotating them, comparing the gene content of this phage to other known phages, and using sequence analysis to infer evolutionary relationships. Grading is based on: electronic notebook of gene annotations and evidence (20%), exams (30%), research paper (25%), poster (15%), and in-class presentations (“group meetings”, 10%). MCB 1201 Virus Hunting: Applied Bioinfo rmatics Semester: Spring M/W 1:30 - 4:30 Beach Hall, open lab times M/W 9 - 1:30, most Fridays Instructors: Dr. Johann Peter Gogarten; Office BPB 404, 486 - 4061; [email protected] Dr. Noah Reid; Office TLS 413a, 486 - 6963 ; [email protected] Course description and rationale This course is a unique classroom - based undergraduate research experience that is part of the Howard Hughes Medical Institutes Science Education Alliance Phage Research Program. It spans two terms (with MCB 1200, Phage Hunters) and culminates in a research symposium held at HHMI’s Janelia campus.
    [Show full text]
  • Machine Learning for Metagenomics: Methods and Tools
    Machine learning for metagenomics: methods and tools Hayssam Soueidan ∗ Macha Nikolski y March 9, 2016 Abstract Owing to the complexity and variability of metagenomic studies, modern machine learning approaches have seen increased usage to answer a variety of question encompassing the full range of metagenomic NGS data analysis. We review here the contribution of machine learning techniques for the field of metagenomics, by presenting known successful approaches in a unified framework. This review focuses on five important metagenomic problems: OTU-clustering, bin- ning, taxonomic profling and assignment, comparative metagenomics and gene prediction. For each of these problems, we identify the most prominent methods, summarize the machine learn- ing approaches used and put them into perspective of similar methods. We conclude our review looking further ahead at the challenge posed by the analysis of interactions within microbial communities and different environments, in a field one could call \integrative metagenomics". Keywords: metagenomics, machine learning, OTU-clustering, binning, taxonomic assignment, comparative metagenomics, gene prediction 1 Introduction While genomics is the research field relative to the study of the genome of any organism, metage- nomics is the term for the research that focuses on many genomes at the same time, as typical in some sections of environmental study. Metagenomics recognizes the need to develop computational methods that enable understanding the genetic composition and activities of communities of species so complex that they can only be sampled, never completely characterized. Analysis of microbial communities has been until recently a complicated task due to their high diversity and the fact that many of these organisms can not be cultivated.
    [Show full text]
  • Glimmer-MG Release Notes Version 0.1
    Glimmer-MG Release Notes Version 0.1 David R. Kelley, Art L. Delcher 17 May 2011 Copyright c 2011 University of Maryland Center for Bioinformatics & Computational Biology 1 Introduction This document describes Version 0.1 of the Glimmer-MG gene-finding software for metagenomic sequences. Users discovering problems or errors are encouraged to report them to [email protected] . Glimmer is a collection of programs for identifying genes in microbial DNA sequences. The system works by creating a variable-length Markov model from a training set of genes and then using that model to attempt to identify all genes in a given DNA sequence. The three versions of Glimmer are described in [SDKW98], [DHK+ 99], and [DBPS07]. Glimmer-MG is released as OSI Certified Open Source Software under the Artistic License. The license is contained in the file, LICENSE, in the distribution. 2 Installation Glimmer-MG software was written for the Linux software environment. The following instructions assume a Linux system. They also work under Mac OSX. To install Glimmer-MG, download the compressed tarfile glimmer-mg-1.0.1.tar.gz from the website. Then uncompress the file by typing tar xzf glimmer-mg.1.0.1.tar.gz A directory named glimmer-mg should result. In that directory is a script called install_glimmer.py which will download and install the prerequisite software and com- pile the Glimmer-MG code. More specifically, it will 1. Download, compile, and install Phymm Phymm is used to compute the phylogenetic classifications of sequences used by Glimmer-MG to parameterize gene prediction models.
    [Show full text]
  • Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly
    INFORMATION THEORY, GRAPH THEORY AND BAYESIAN STATISTICS BASED IMPROVED AND ROBUST METHODS IN GENOME ASSEMBLY A Dissertation by BILAL WAJID Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Erchin Serpedin Co-Chair of Committee, Hazem Nounou Committee Members, Aydin Karsilayan Byung Jun Yoon Mohamed Nounou Head of Department, Miroslav Begovic August 2015 Major Subject: Electrical Engineering Copyright 2015 Bilal Wajid ABSTRACT Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This dissertation highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, ex- amines some key metrics, emphasizes some tools and outlines the whole pipeline for next-generation sequencing. Currently, a major effort is being put towards the as- sembly of the genomes of all living organisms. Given the importance of comparative genome assembly, herein dissertation, the principle of Minimum Description Length (MDL) and its two variants, the Two-Part MDL and Sophisticated MDL, are ex- plored in identifying the optimal reference sequence for genome assembly. Thereafter, a Modular Approach to Reference Assisted Genome Assembly Pipeline, referred to as MARAGAP, is developed. MARAGAP uses the principle of Minimum Description Length (MDL) to determine the optimal reference sequence for the assembly. The optimal reference sequence is used as a template to infer inversions, insertions, dele- tions and Single Nucleotide Polymorphisms (SNPs) in the target genome.
    [Show full text]