First High-Quality Reference Genome of Amphicarpaea Edgeworthii

Total Page:16

File Type:pdf, Size:1020Kb

First High-Quality Reference Genome of Amphicarpaea Edgeworthii bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. First high-quality reference genome of Amphicarpaea edgeworthii Tingting Song1+, Mengyan Zhou2+, Yuying Yuan1, Jinqiu Yu 1, Hua Cai1, Jiawei Li1, Yajun Chen1, Yan Bai2, Gang Zhou2 and Guowen Cui1* 1 Northeast Agricultural University, 150030, Harbin, China 2 Novogene Bioinformatics Institute, 100083 Beijing, China + These authors contributed equally * Corresponding author: Guowen Cui, Northeast Agricultural University, Harbin, No. 600 Changjiang Road, Harbin, 150030, China. Email: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Amphicarpaea edgeworthii, an annual twining herb, is a widely distributed species and an ideal model for studying complex flowering types and evolutionary mechanisms of species. Herein, we generated a high-quality assembly of A. edgeworthii by using a combination of PacBio, 10× Genomics libraries, and Hi-C mapping technologies. The final 11 chromosome-level scaffolds covered 90.61% of the estimated genome (343.78 Mb), which is the first chromosome-scale assembled genome of an amphicarpic plant. These data will be beneficial for the discovery of genes that control major agronomic traits, spur genetic improvement of and functional genetic studies in legumes, and supply comparative genetic resources for other amphicarpic plants. 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Introduction In nature, the distribution of key resources required for plant growth is often uneven. Plants growing in unstable habitats, with limited supplies of mineral nutrients, water, or light, frequent soil interferences, and large environmental fluctuations, undergo adaptive evolution to improve their survival(Jackson and Caldwell, 1993; Pearcy and Caldwell, 1994). Some plant species that bear 2 or more heteromorphic flowers also bear heteromorphic fruits (seeds). Amphicarpy is a phenomenon in which a plant produces both aerial and subterranean flowers and simultaneously bears both aerial and subterranean fruits (seeds) on aerial- and subterranean- stems, respectively (Cheplick, 1987; Koontz et al., 2017; Schnee and Waller, 1986). This phenomenon is observed in at least 67 herbaceous species (31 in fabaceae) in 39 genera and 13 families of angiosperms, as reported by Zhang et al(Zhang et al., 2020a). Amphicarpy is an important part of plant adaptive evolution, in which angiosperms generally display a special type of fruiting pattern and different fruit (seed) types also exhibit various dormancy and morphological features. This type of fruiting mode is crucial for the ecological adaptation of plants that have evolved through natural selection because it reduces competition among siblings within the population, maintains and increases the population size in situ, and increases the adaptability and evolutionary plasticity of the species(Hidalgo et al., 2016; Sadeh et al., 2009). Amphicarpaea edgeworthii, an annual twining herb, belongs to the fabaceae family, which is a large and economically valuable family of flowering plants(Zhang et al., 2017; Zhang et al., 2006). In this plant species, 3 types of flowers (fruits) can 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. grow in the same plant, and aerial chasmogamous flowers are produced only during summer( Zhang et al., 2006; Zhang et al., 2005). This species may offer an attractive model for examining gene regulatory networks that control chasmogamous and cleistogamous flowering in plants. However, the mechanism of flower development in amphicarpic plants, particularly in legumes, is sparsely known. The present study was an attempt to enhance our understanding on the reproductive biology and the ultimate evolutionary mechanism in amphicarpic plants. We performed whole-genome sequencing of A. edgeworthii, which is the first plant in the legume family and also the first amphicarpic species subjected to sequencing. This reference genome represents a precious foundation for further understanding on agronomics and molecular breeding in A. edgeworthii. 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Result A. edgeworthii has a diploid genome (2n = 2x = 22) (Supplementary figure 1). We estimated the genome size of A. edgeworthii to be 360.91 Mb using 17-mer(Supplementary Table 1 and figure 2). We sequenced the genome of A. edgeworthii by using a combination of PacBio, Illumina, and 10× Genomics libraries that resulted in the generation of a 343.78-Mbp genome (contig N50 length = 1.44 Mb, scaffold N50 length = 2.4 Mb; Table 1 and Supplementary Table 2, 3 and 4). finally, we assembled a chromosome-level genome by using Hi-C technology. We used a total of 5.27 million reads from Hi-C libraries and mapped approximately 90.61% of the assembled sequences to 11 pseudochromosomes, with the longest scaffold length of 32.05 Mb (Table 1 and figure 1). Results indicate that the A. edgeworthii genome was adequately covered by the assembly. We evaluated the completeness of the genome assembly by mapping the Illumina paired-end reads to our assembly through Burrows–Wheeler Alignor (BWA) (Li and Durbin, 2009) with 98.70 % of mapping rate and 94.04 % of coverage (Supplementary Table 5). Then, we used both Core Eukaryotic Gene Mapping Approach (CEGMA) (Parra et al., 2007) and Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simão et al., 2015) to access the integrity of the assembly. In the CEGMA assessment, 238 (95.97%) of 248 core eukaryotic genes were assembled (Supplementary Table 6). furthermore, 93.4% complete single-copy BUSCOs were detected, which indicates that the assembly was complete (Supplementary Table 7). Overall, the results indicate that a high-quality assembly was generated. 5 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Repeat sequences comprise 51.28% of the assembled genome, with transposable elements (TEs) being the major component (Supplementary Table 8). Among the TEs, long terminal repeats (LTRs) were the major component (29.32%) (Supplementary Table 9). We combined de novo prediction, homology search, and mRNA-seq assisted prediction to predict genes in the A. edgeworthii genome, and we obtained 28,372 protein-coding genes (97.2% were annotated) (Supplementary Tables 10 and 11). Additionally, we identified 2,260 non-coding RNAs, including 471 miRNAs, 701 transfer RNAs, 266 ribosomal RNAs, and 822 small nuclear RNAs (Supplementary Table 12). Table 1 Statistics of the A. edgeworthii genome assembly Total assembly size (Mb) 343.78 Total number of contigs 1,475 Total number of scaffolds 1,082 Contig N50 length (Mb) 1.44 Maximum contig length (Mb) 7.65 Maximum scaffold length (Mb) 32.05 Scaffold N50 length (Mb) 28.47 Scaffold N90 length (Mb) 23.07 GC content (%) 32.04 Gene number 28,372 Repeat content (%) 51.28 6 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Discussion In this study, we constructed a high-quality chromosome-level genome assembly for A. edgeworthii by combining the long-read sequences from PacBio with highly accurate short reads from Illumina sequencing and using Hi-C technology for super-scaffolding. The assembly of A. edgeworthii adds to the growing genomic information on the fabaceae family. As the first species in the Amphicarpaea genera of the fabaceae family to be sequenced, A. edgeworthii exhibits a range of specific biological features, such as 3 types of flowers and 3 types of fruit (seed) simultaneously in the same plant. It has maintained an essential evolutionary position in the tree of life(Goodwillie et al., 2005; Silvertown et al., 2001). According to our analysis, these genomic data will serve as valuable resources for future genomic studies on amphicarpic plants and molecular breeding of soybean. To date, no genomic data for the amphicarpic plants are available. Therefore, the sequencing of the entire A. edgeworthii genome will facilitate future investigations on the phylogenetic relationships of the flowering (seed) plants. The accessibility of the A. edgeworthii genome sequence makes feasible the investigation of deep phylogenetic questions on angiosperms and the determination of genome evolution signatures and the genetic basis of interesting traits. 7 bioRxiv preprint doi: https://doi.org/10.1101/2020.09.22.306811; this version posted September 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Method Plant material and genome survey for genome sequencing, we collected fresh young leaves of A. edgeworthii species distributed in Heilongjiang Province (45.80°N, 126.53°E), China. The karyotype analysis of the plant species revealed a karyotype of 2n = 2x = 22, with uniform and small chromosomes(Wolny et al., 2013) (Supplementary figure 1).
Recommended publications
  • Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes
    ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem.
    [Show full text]
  • Microbial Gene Identification Using Interpolated Markov Models Steven L
    544–548 Nucleic Acids Research, 1998, Vol. 26, No. 2 1998 Oxford University Press Microbial gene identification using interpolated Markov models Steven L. Salzberg1,2,*, Arthur L. Delcher3, Simon Kasif4 and Owen White1 1The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA, 2Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA, 3Department of Computer Science, Loyola College in Maryland, Baltimore, MD 21210, USA and 4Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA Received September 10, 1997; Revised and Accepted November 11, 1997 ABSTRACT in new genomes still have no significant homology to known genes (1). For these genes, we must rely on computational methods of This paper describes a new system, GLIMMER, for scoring the coding region to identify the genes. The best-known finding genes in microbial genomes. In a series of tests program for this task is GeneMark (5), which uses a Markov chain on Haemophilus influenzae, Helicobacter pylori and model to score coding regions. GeneMark has been highly effective other complete microbial genomes, this system has and was used in the H.influenza and more recent genome projects. proven to be very accurate at locating virtually all the We have developed a new system, GLIMMER, that uses a technique genes in these sequences, outperforming previous called interpolated Markov models (IMMs) to find coding regions methods. A conservative estimate based on experiments in microbial sequences. IMMs are in principle more powerful than on H.pylori and H.influenzae is that the system finds Markov chains, and the computational experiments described below >97% of all genes.
    [Show full text]
  • Steven L. Salzberg
    Steven L. Salzberg McKusick-Nathans Institute of Genetic Medicine Johns Hopkins School of Medicine, MRB 459, 733 North Broadway, Baltimore, MD 20742 Phone: 410-614-6112 Email: [email protected] Education Ph.D. Computer Science 1989, Harvard University, Cambridge, MA M.Phil. 1984, M.S. 1982, Computer Science, Yale University, New Haven, CT B.A. cum laude English 1980, Yale University Research Areas: Genomics, bioinformatics, gene finding, genome assembly, sequence analysis. Academic and Professional Experience 2011-present Professor, Department of Medicine and the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University. Joint appointments as Professor in the Department of Biostatistics, Bloomberg School of Public Health, and in the Department of Computer Science, Whiting School of Engineering. 2012-present Director, Center for Computational Biology, Johns Hopkins University. 2005-2011 Director, Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies 2005-2011 Horvitz Professor, Department of Computer Science, University of Maryland. (On leave of absence 2011-2012.) 1997-2005 Senior Director of Bioinformatics (2000-2005), Director of Bioinformatics (1998-2000), and Investigator (1997-2005), The Institute for Genomic Research (TIGR). 1999-2006 Research Professor, Departments of Computer Science and Biology, Johns Hopkins University 1989-1999 Associate Professor (1996-1999), Assistant Professor (1989-1996), Department of Computer Science, Johns Hopkins University. On leave 1997-99. 1988-1989 Associate in Research, Graduate School of Business Administration, Harvard University. Consultant to Ford Motor Co. of Europe and to N.V. Bekaert (Kortrijk, Belgium). 1985-1987 Research Scientist and Senior Knowledge Engineer, Applied Expert Systems, Inc., Cambridge, MA. Designed expert systems for financial services companies.
    [Show full text]
  • By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O
    EFFECTS OF CONJUGATIVE PLASMIDS ON BACTERIAL HOSTS, AND COEVOLUTION OF HOSTS AND PLASMIDS by Katherine E. L. Cox A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland March, 2017 Abstract As agents of horizontal gene transfer, plasmids play a role in the spread of antibiotic resistance and virulence factors, and can carry genetic material between widely divergent species. Typically, plasmids are considered primarily in terms of the genes they carry and the known functions of those genes, such as providing antibiotic resistance or new metabolic pathways. However, plasmids can have a variety of effects on their hosts beyond simply providing new functionality. Interactions between hosts and plasmids can modify host behavior, for example, increasing biofilm formation, modifying of host gene expression, or influencing virulence. We wished to investigate the mechanisms by which plasmids influence host behavior, as well as the evolution of host/plasmid relationships. In this work, we used multiple approaches to explore host/plasmid interactions. We examined a specific plasmid (R1) in great detail, generating a complete sequence and annotation for this plasmid and providing a brief review of the known gene products. We probed the influence of the plasmid-borne traJ gene (a regulator of plasmid transfer) on the neonatal-meningitis clinical isolate E. coli RS218 by measuring changes in the virulence and gene expression of RS218 carrying either wild-type or disrupted copies of traJ. We then examined the evolution of host/plasmid relationships by introducing plasmids into new hosts and coevolving them for 500 generations.
    [Show full text]
  • Improved Microbial Gene Identification with GLIMMER
    4636–4641 Nucleic Acids Research, 1999, Vol. 27, No. 23 © 1999 Oxford University Press Improved microbial gene identification with GLIMMER Arthur L. Delcher1,2,*, Douglas Harmon1,SimonKasif3,OwenWhite4 and Steven L. Salzberg4 1Department of Computer Science, Loyola College in Maryland, Baltimore, MD 21210, USA, 2Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA, 3Department of Electrical Engineering and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7053, USA and 4The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received July 7, 1999; Revised and Accepted October 17, 1999 ABSTRACT a generalization of Markov chain methods. GLIMMER 1.0 has been used as the gene finder for Borrelia burgdorferi (2), The GLIMMER system for microbial gene identification Treponema pallidum (3), Chlamydia trachomatis (4) and finds ~97–98% of all genes in a genome when Thermotoga maritima (5), and the software is in use at over compared with published annotation. This paper 100 laboratories and institutes. Below we describe the algorithm reports on two new results: (i) significant technical and performance results of GLIMMER 2.0, a gene finder that improvements to GLIMMER that improve its accuracy incorporates several technical improvements to the GLIMMER still further, and (ii) a comprehensive evaluation that 1.0 algorithm. As a result of these improvements, GLIMMER demonstrates that the accuracy of the system is 2.0 has slightly higher sensitivity than GLIMMER 1.0 and is likely to be higher than previously recognized. A much better at resolving overlapping gene calls. The latter significant proportion of the genes missed by the property is especially useful for genomes such as Deinococcus system appear to be hypothetical proteins whose radiodurans, which due to their high GC-content have existence is only supported by the predictions of numerous long open reading frames (ORFs) that can easily lead to other programs.
    [Show full text]
  • The Effect of Machine Learning Algorithms on Metagenomics Gene
    The Effect of Machine Learning Algorithms on Metagenomics Gene Prediction Amani Al-Ajlan Achraf El Allali Computer Science Department Computer Science Department College of Computer and Information Sciences College of Computer and Information Sciences King Saud University King Saud University Riyadh, Saudi Arabia Riyadh, Saudi Arabia [email protected] [email protected] analysis of metagenomics is used to discover new diseases and ABSTRACT help preserve human health. For example, studying microbes in our The development of next generation sequencing facilitates the environment can help us control diseases and introduce new study of metagenomics. Computational gene prediction aims to strategies for diagnosis. Recent advances in next generating find the location of genes in a given DNA sequence. Gene sequencing (NGS) technologies have facilitated and improved prediction in metagenomics is a challenging task because of the metagenomics research by generating high-throughput, low-cost short and fragmented nature of the data. Our previous framework sequencing data [6], [21], [29]. minimum redundancy maximum relevance - support vector Sequencing data taken directly from the environment are called machines (mRMR-SVM) produced promising results in metagenomes, and the study of these data is called metagenomics metagenomics gene prediction. In this paper, we review available [30]. Machine learning techniques play an important role in metagenomics gene prediction programs and study the effect of solving many metagenomics problems, such as gene prediction the machine learning approach on gene prediction by altering the taxonomic assignment, and comparative metagenomics analysis underlining machine learning algorithm in our previous [27]. Finding genes and identifying their functions is essential to framework. Overall, SVM produces the highest accuracy based on understanding the environment being studied and to annotating tests performed on a simulated dataset.
    [Show full text]
  • Genome Sequencing and Annotation
    An Introduction to Phage Whole- Genome Sequencing and Annotation Jason J. Gill Department of Animal Science Center for Phage Technology Texas A&M University “Phage genomics” • Not a narrowly-defined topic! – Whole-phage genome sequencing – Targeted phage metagenome sequencing – Metagenomics of viral consortia – Prophage mining/annotation Whole phage genome sequencing and annotation Mixed pool of phages Assembled contigs Sequencing and Contig A: 165,220 Phage 1 assembly Phage 2 Contig B: 44,355 Phage 3 Contig C: 42,500 Contig assignment End closure & contig completion (PCR) Annotation (PCR + Sanger sequencing) Phage 1 Phage 1 Contig A = Phage 2 Phage 2 Phage 2 Contig B = Phage 1 Contig C = Phage 3 Phage 3 Phage 3 • Phages can usually be mixed into a single index or pool if they are not similar to each other – Different hosts – Different morphotypes Shotgun sequencing Genomic DNA DNA is fragmented Fragmented DNA is sequenced Sequences are assembled The assembly recapitulates the original order of the genome The consensus sequence is produced from the assembly Sequencing technology summary Technology Read length Quality* Total yield Cost per base Pyrosequencing 400 – 600 bp Moderate Moderate Moderate (Ion Torrent) Illumina 50 – 350 bp High High Low PacBio 2 – 20 kbp Moderate Moderate Moderate Nanopore > 100 kbp ? Low Low-moderate Moderate * Can vary as sequencing chemistries and software improve Overlap-layout-consensus (OLC) assembly • The “classic” method of assembly – Used for assembling long-read data (e.g., Sanger, PacBio and Oxford
    [Show full text]
  • Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Harvard University - DASH Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Kelley, David R., Bo Liu, Arthur L. Delcher, Mihai Pop, and Steven L. Salzberg. 2011. Gene prediction with glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Research 40(1): e9. Published Version doi:10.1093/nar/gkr1067 Accessed February 19, 2015 9:50:17 AM EST Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11248787 Terms of Use This article was downloaded from Harvard University's DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA (Article begins on next page) Published online 18 November 2011 Nucleic Acids Research, 2012, Vol. 40, No. 1 e9 doi:10.1093/nar/gkr1067 Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering David R. Kelley1,2,3,*, Bo Liu1, Arthur L. Delcher1, Mihai Pop1 and Steven L. Salzberg4 1Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, Department of Computer Science, 3115 Biomolecular Sciences Building 296, University of Maryland, College Park, MD 20742, 2Department of Stem Cell and Regenerative Biology, 7 Divinity Avenue, Harvard University, Cambridge, MA 02138, 3Broad Institute, 7 Cambridge Center, Cambridge, MA 02142 and 4McKusick-Nathans Institute of Genetic Medicine Johns Hopkins University School of Medicine Baltimore, MD, USA Received July 29, 2011; Revised September 19, 2011; Accepted October 28, 2011 ABSTRACT (1–3).
    [Show full text]
  • MCB 1200/1201, Virus Hunting
    MCB 1200/1201, Virus Hunting: Join an international science program (Howard Hughes Medical Institutes (HHMI) Science Education Alliance (SEA) Phage Research Program) that seeks to identify new therapies for diseases. In a real research project, students will isolate viruses that infect non-pathogenic bacteria that are related to disease-causing strains. There are more of these viruses, known as bacteriophages or phages, than any other group of organisms on Earth--but relatively few of them have been isolated or characterized. Phage therapy is a way to treat bacterial infections without using antibiotics, so the phages isolated in the class could have real therapeutic uses. Applied Bioinformatics (MCB 1201): The goal of the course is for every single student to characterize a bacteriophage genome. This involves finding genes and annotating them, comparing the gene content of this phage to other known phages, and using sequence analysis to infer evolutionary relationships. Grading is based on: electronic notebook of gene annotations and evidence (20%), exams (30%), research paper (25%), poster (15%), and in-class presentations (“group meetings”, 10%). MCB 1201 Virus Hunting: Applied Bioinfo rmatics Semester: Spring M/W 1:30 - 4:30 Beach Hall, open lab times M/W 9 - 1:30, most Fridays Instructors: Dr. Johann Peter Gogarten; Office BPB 404, 486 - 4061; [email protected] Dr. Noah Reid; Office TLS 413a, 486 - 6963 ; [email protected] Course description and rationale This course is a unique classroom - based undergraduate research experience that is part of the Howard Hughes Medical Institutes Science Education Alliance Phage Research Program. It spans two terms (with MCB 1200, Phage Hunters) and culminates in a research symposium held at HHMI’s Janelia campus.
    [Show full text]
  • Machine Learning for Metagenomics: Methods and Tools
    Machine learning for metagenomics: methods and tools Hayssam Soueidan ∗ Macha Nikolski y March 9, 2016 Abstract Owing to the complexity and variability of metagenomic studies, modern machine learning approaches have seen increased usage to answer a variety of question encompassing the full range of metagenomic NGS data analysis. We review here the contribution of machine learning techniques for the field of metagenomics, by presenting known successful approaches in a unified framework. This review focuses on five important metagenomic problems: OTU-clustering, bin- ning, taxonomic profling and assignment, comparative metagenomics and gene prediction. For each of these problems, we identify the most prominent methods, summarize the machine learn- ing approaches used and put them into perspective of similar methods. We conclude our review looking further ahead at the challenge posed by the analysis of interactions within microbial communities and different environments, in a field one could call \integrative metagenomics". Keywords: metagenomics, machine learning, OTU-clustering, binning, taxonomic assignment, comparative metagenomics, gene prediction 1 Introduction While genomics is the research field relative to the study of the genome of any organism, metage- nomics is the term for the research that focuses on many genomes at the same time, as typical in some sections of environmental study. Metagenomics recognizes the need to develop computational methods that enable understanding the genetic composition and activities of communities of species so complex that they can only be sampled, never completely characterized. Analysis of microbial communities has been until recently a complicated task due to their high diversity and the fact that many of these organisms can not be cultivated.
    [Show full text]
  • Glimmer-MG Release Notes Version 0.1
    Glimmer-MG Release Notes Version 0.1 David R. Kelley, Art L. Delcher 17 May 2011 Copyright c 2011 University of Maryland Center for Bioinformatics & Computational Biology 1 Introduction This document describes Version 0.1 of the Glimmer-MG gene-finding software for metagenomic sequences. Users discovering problems or errors are encouraged to report them to [email protected] . Glimmer is a collection of programs for identifying genes in microbial DNA sequences. The system works by creating a variable-length Markov model from a training set of genes and then using that model to attempt to identify all genes in a given DNA sequence. The three versions of Glimmer are described in [SDKW98], [DHK+ 99], and [DBPS07]. Glimmer-MG is released as OSI Certified Open Source Software under the Artistic License. The license is contained in the file, LICENSE, in the distribution. 2 Installation Glimmer-MG software was written for the Linux software environment. The following instructions assume a Linux system. They also work under Mac OSX. To install Glimmer-MG, download the compressed tarfile glimmer-mg-1.0.1.tar.gz from the website. Then uncompress the file by typing tar xzf glimmer-mg.1.0.1.tar.gz A directory named glimmer-mg should result. In that directory is a script called install_glimmer.py which will download and install the prerequisite software and com- pile the Glimmer-MG code. More specifically, it will 1. Download, compile, and install Phymm Phymm is used to compute the phylogenetic classifications of sequences used by Glimmer-MG to parameterize gene prediction models.
    [Show full text]
  • Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly
    INFORMATION THEORY, GRAPH THEORY AND BAYESIAN STATISTICS BASED IMPROVED AND ROBUST METHODS IN GENOME ASSEMBLY A Dissertation by BILAL WAJID Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Erchin Serpedin Co-Chair of Committee, Hazem Nounou Committee Members, Aydin Karsilayan Byung Jun Yoon Mohamed Nounou Head of Department, Miroslav Begovic August 2015 Major Subject: Electrical Engineering Copyright 2015 Bilal Wajid ABSTRACT Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This dissertation highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, ex- amines some key metrics, emphasizes some tools and outlines the whole pipeline for next-generation sequencing. Currently, a major effort is being put towards the as- sembly of the genomes of all living organisms. Given the importance of comparative genome assembly, herein dissertation, the principle of Minimum Description Length (MDL) and its two variants, the Two-Part MDL and Sophisticated MDL, are ex- plored in identifying the optimal reference sequence for genome assembly. Thereafter, a Modular Approach to Reference Assisted Genome Assembly Pipeline, referred to as MARAGAP, is developed. MARAGAP uses the principle of Minimum Description Length (MDL) to determine the optimal reference sequence for the assembly. The optimal reference sequence is used as a template to infer inversions, insertions, dele- tions and Single Nucleotide Polymorphisms (SNPs) in the target genome.
    [Show full text]