MOSGA 2: Comparative Genomics and Validation Tools

Total Page:16

File Type:pdf, Size:1020Kb

MOSGA 2: Comparative Genomics and Validation Tools bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. MOSGA 2: Comparative genomics and validation tools Roman Martin1, Hagen Dreßler1, Georges Hattab1, Thomas Hackl2, Matthias G. Fischer2, and Dominik Heider1, 1Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany 2Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany Due to the highly growing number of available genomic infor- which does not necessarily affect the annotation process but mation, the need for accessible and easy-to-use analysis tools is will hinder the submission process and may lead to wrong increasing. To facilitate eukaryotic genome annotations, we cre- conclusions. This study presents the new functionalities in ated MOSGA. In this work, we show how MOSGA 2 is devel- MOSGA 2, such as new workflows for comparative genomics oped by including several advanced analyses for genomic data. and the introduction of validation tools into the framework. Since the genomic data quality greatly impacts the annotation In brief, MOSGA 2 incorporates comparative genomic work- quality, we included multiple tools to validate and ensure high- flows derived from Hackl et al. (13) and validation tools from quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can Pirovano et al. (14). benefit from a broader genomic view by analyzing multiple ge- We initially developed MOSGA to facilitate genome anno- nomic data sets simultaneously. Further, we demonstrate the tation, including an easy-to-use web interface based on a new functionalities of MOSGA 2 by different use-cases and robust, scalable, modular, and reproducible platform. Be- practical examples. MOSGA 2 extends the already established sides new genome annotations, genome assemblies or mul- application to the quality control of the genomic data and inte- tiple genomes can be placed in a bigger context like com- grates and analyzes multiple genomes in a larger context, e.g., parative genomics. Therefore, we integrated several tools by phylogenetics. into MOSGA 2 to calculate and output the phylogenetic Genome annotation | Comparative Genomics | Phylogenetics | Quality control trees, the average nucleotide identity, the completeness of | Framework | Pipeline | Workflow the genomes, and the protein-coding gene comparison based Correspondence: [email protected] on previously performed protein-coding gene annotations. In addition, we incorporated tools to identify the occurrence of contamination and present a new scaffold-detection method Introduction for organellar sequences present in genome assemblies. The constantly increasing number of available whole- genome sequences, mainly derived by high-throughput se- Software changes quencing, provides more and more biological insights (1). In Phylogenetics. MOSGA 2 integrates a workflow based on particular, the sequencing of new bacterial, archaeal, and vi- nine established tools to preprocess and compute phyloge- ral genomes is now routine practice. It has accelerated dis- netic trees on multiple genomes. Preprocessing steps include coveries in microbial diversity and evolution, providing new selecting a database for genes, constructing and trimming insight into microbiome function, human health, and biogeo- multiple sequence alignment (MSA). chemical cycling (2–6). However, the generation of high- Therefore, we included BUSCO (15, 16) and EukCC (17) quality eukaryotic genomes has been hampered inDRAFT the past to identify single-copy genes in genomes for the phyloge- by sequencing costs and assembly complexity, limiting se- netic computation. The user has to choose from one of these quenced eukaryotes to those of medical or economical inter- tools or even combine both as potential phylogenetic mark- est (7). More recently, advances in Illumina high-throughput ers. MOSGA 2 concatenates the chosen gene marker se- sequencing and emerging long-read sequencing technologies quences into a new FASTA file. The user has to select a pro- developed by Pacific Biosciences and Oxford Nanopore have gram for the MSA construction, such as MAFFT (18, 19), also rendered the generation of genome assemblies of large ClustalW (20, 21), or MUSCLE (22). For an optional eukaryotes a routine task (8–10). These proceeding trends MSA trimming, we integrated trimAl (23) and ClipKIT (24). in data availability, advanced techniques, and improved af- Phylogenetic relationships are then reconstructed by maxi- fordability require more facilitated access for scientists to mum likelihood or distance algorithms using RAxML (25) or analyze these sequence data. To address the challenging is- FastME (26), respectively, and the resulting tree are visual- sue of correctly annotating new genomes, we developed the ized by ggtree (27) and ape (28). Tree rooting is optional and Modular Open-Source Genome Annotator (MOSGA) (11). can be defined by marking an uploaded genome assembly as While the number of available genomic data increases con- an outgroup. tinuously, the quality of these data is as important as their availability. Low-quality genome assemblies will introduce Average nucleotide identity. For whole-genome similarity errors (12) and, as such, affect the annotation quality. In addi- metrics, MOSGA 2 includes a comparative genomics work- tion, genome assemblies can be incomplete or contaminated, flow that calculates the Average Nucleotide Identity (ANI) Martin et al. | bioRχiv | July 30, 2021 | 1–6 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. across all genomes. To achieve this, we integrated Fas- prediction, MOSGA 2 offers BlobTools (33) to estimate the tANI (29), which compares the genomes against each other. taxonomical source of each scaffold. Such sources may be Further, the ANI values are represented as a heatmap. helpful to identify putative biological contaminants. Addi- tionally, MOSGA 2 includes NCBI’s VecScreen based on the Protein-coding genes comparison. MOSGA 2 integrates UniVec database to identify adapter sequences. a comparative genomics workflow that compares protein- coding genes from all uploaded genomes against each other. Organellar DNA scanner. MOSGA 2 is optimized for This comparison requires a previous annotation of protein- annotating nuclear DNA sequences from eukaryotic cells coding genes, which can be achieved via the annotation and is less suited for organellar DNA, such as mitochon- pipeline or by importing already annotated genomes as GBFF drial and plastid genomes. To identify organellar DNA (GenBank flat format) files. This allows, for example, in genome assemblies, MOSGA 2 combines information a comparison between different gene prediction tools or from GC-content, plastid and mitochondrial reference pro- between reference and experimental annotations. Techni- tein databases with RNA prediction algorithms such as bar- cally, MOSGA 2 extracts the protein-coding sequences and rnap and tRNAscan-SE 2.0 (34). At the end of each an- matches them against each other. Matches above a defined notation job, MOSGA 2 creates a relative ranking with the threshold will be binned back to a genome, and the aver- most likely scaffolds. The scoring is an arithmetic calculation age coding content similarities between the genomes are dis- based on the density of organellar-specific genes, the num- played as a heatmap. This analysis allows consistency checks bers of tRNAs, rRNAs, and the GC-content variance. To cre- for gene predictions across different genomes. This method ate the plastid and mitochondria reference proteins databases, compares the nucleotide sequence of protein-coding genes we clustered respective RefSeq (35) databases with MM- and has similarities with the concept of Average Amino Acids Seq2 (36) and only kept representative sequences of these Identity. clusters into our databases to remove redundancy. Genome completeness. The identification of single-copy Taxonomy search. In several MOSGA annotation jobs, we genes is a crucial step for phylogenetic analysis. Therefore, observed that multiple users did not select the best suitable we integrated BUSCO and EukCC into MOSGA 2. While gene-prediction models for the given data. Depending on the BUSCO’s data source is OrthoDB (31), EukCC relies on selection of gene prediction tools, this task could be chal- PANTHER (32). These tools can estimate the completeness lenging since, for example, the gene predictor Augustus (37) of assemblies, and MOSGA 2 integrates them for validation currently includes already 80 species-specific models. Identi- into the annotation and the comparative genomics workflow. fying the most suitable models requires knowledge or an ed- Genome completeness results for each genome are visualized ucated guess about each listed species and their relatedness. together in the comparative genomics workflow and the an- To support the user during this task, we implemented a tax- notation workflow separately.
Recommended publications
  • Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones
    PLoS BIOLOGY Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones Tadashi Imanishi1, Takeshi Itoh1,2, Yutaka Suzuki3,68, Claire O’Donovan4, Satoshi Fukuchi5, Kanako O. Koyanagi6, Roberto A. Barrero5, Takuro Tamura7,8, Yumi Yamaguchi-Kabata1, Motohiko Tanino1,7, Kei Yura9, Satoru Miyazaki5, Kazuho Ikeo5, Keiichi Homma5, Arek Kasprzyk4, Tetsuo Nishikawa10,11, Mika Hirakawa12, Jean Thierry-Mieg13,14, Danielle Thierry-Mieg13,14, Jennifer Ashurst15, Libin Jia16, Mitsuteru Nakao3, Michael A. Thomas17, Nicola Mulder4, Youla Karavidopoulou4, Lihua Jin5, Sangsoo Kim18, Tomohiro Yasuda11, Boris Lenhard19, Eric Eveno20,21, Yoshiyuki Suzuki5, Chisato Yamasaki1, Jun-ichi Takeda1, Craig Gough1,7, Phillip Hilton1,7, Yasuyuki Fujii1,7, Hiroaki Sakai1,7,22, Susumu Tanaka1,7, Clara Amid23, Matthew Bellgard24, Maria de Fatima Bonaldo25, Hidemasa Bono26, Susan K. Bromberg27, Anthony J. Brookes19, Elspeth Bruford28, Piero Carninci29, Claude Chelala20, Christine Couillault20,21, Sandro J. de Souza30, Marie-Anne Debily20, Marie-Dominique Devignes31, Inna Dubchak32, Toshinori Endo33, Anne Estreicher34, Eduardo Eyras15, Kaoru Fukami-Kobayashi35, Gopal R. Gopinath36, Esther Graudens20,21, Yoonsoo Hahn18, Michael Han23, Ze-Guang Han21,37, Kousuke Hanada5, Hideki Hanaoka1, Erimi Harada1,7, Katsuyuki Hashimoto38, Ursula Hinz34, Momoki Hirai39, Teruyoshi Hishiki40, Ian Hopkinson41,42, Sandrine Imbeaud20,21, Hidetoshi Inoko1,7,43, Alexander Kanapin4, Yayoi Kaneko1,7, Takeya Kasukawa26, Janet Kelso44, Paul Kersey4, Reiko Kikuno45, Kouichi
    [Show full text]
  • Download Or Browsing at Phylomedb [118]
    Gerdol et al. Genome Biology (2020) 21:275 https://doi.org/10.1186/s13059-020-02180-3 RESEARCH Open Access Massive gene presence-absence variation shapes an open pan-genome in the Mediterranean mussel Marco Gerdol1, Rebeca Moreira2, Fernando Cruz3, Jessica Gómez-Garrido3, Anna Vlasova4, Umberto Rosani5, Paola Venier5, Miguel A. Naranjo-Ortiz4,6, Maria Murgarella7, Samuele Greco1, Pablo Balseiro2,8, André Corvelo3,9, Leonor Frias3, Marta Gut3, Toni Gabaldón4,6,10,11, Alberto Pallavicini1,12, Carlos Canchaya7,13,14, Beatriz Novoa2, Tyler S. Alioto3,6, David Posada7,13,14* and Antonio Figueras2* * Correspondence: dposada@uvigo. es; [email protected] Abstract 7Department of Biochemistry, Genetics and Immunology, Background: The Mediterranean mussel Mytilus galloprovincialis is an ecologically University of Vigo, 36310 Vigo, and economically relevant edible marine bivalve, highly invasive and resilient to Spain biotic and abiotic stressors causing recurrent massive mortalities in other bivalves. 2Instituto de Investigaciones Marinas (IIM - CSIC), Eduardo Although these traits have been recently linked with the maintenance of a high Cabello, 6, 36208 Vigo, Spain genetic variation within natural populations, the factors underlying the evolutionary Full list of author information is success of this species remain unclear. available at the end of the article Results: Here, after the assembly of a 1.28-Gb reference genome and the resequencing of 14 individuals from two independent populations, we reveal a complex pan-genomic architecture in M. galloprovincialis, with a core set of 45,000 genes plus a strikingly high number of dispensable genes (20,000) subject to presence-absence variation, which may be entirely missing in several individuals.
    [Show full text]
  • Potential Evidence for Transgenerational Epigenetic Memory in Arabidopsis Thaliana Following Spaceflight ✉ Peipei Xu 1,2, Haiying Chen1,2, Jinbo Hu1 & Weiming Cai 1
    ARTICLE https://doi.org/10.1038/s42003-021-02342-4 OPEN Potential evidence for transgenerational epigenetic memory in Arabidopsis thaliana following spaceflight ✉ Peipei Xu 1,2, Haiying Chen1,2, Jinbo Hu1 & Weiming Cai 1 Plants grown in spaceflight exhibited differential methylation responses and this is important because plants are sessile, they are constantly exposed to a variety of environmental pres- sures and respond to them in many ways. We previously showed that the Arabidopsis genome exhibited lower methylation level after spaceflight for 60 h in orbit. Here, using the offspring of the seedlings grown in microgravity environment in the SJ-10 satellite for 11 days and 1234567890():,; returned to Earth, we systematically studied the potential effects of spaceflight on DNA methylation, transcriptome, and phenotype in the offspring. Whole-genome methylation analysis in the first generation of offspring (F1) showed that, although there was no significant difference in methylation level as had previously been observed in the parent plants, some residual imprints of DNA methylation differences were detected. Combined DNA methylation and RNA-sequencing analysis indicated that expression of many pathways, such as the abscisic acid-activated pathway, protein phosphorylation, and nitrate signaling pathway, etc. were enriched in the F1 population. As some phenotypic differences still existed in the F2 generation, it was suggested that these epigenetic DNA methylation modifications were partially retained, resulting in phenotypic differences in the offspring. Furthermore, some of the spaceflight-induced heritable differentially methylated regions (DMRs) were retained. Changes in epigenetic modifications caused by spaceflight affected the growth of two future seed generations. Altogether, our research is helpful in better understanding the adaptation mechanism of plants to the spaceflight environment.
    [Show full text]
  • BRAKER2: Automatic Eukaryotic Genome Annotation with Genemark-EP+ and AUGUSTUS Supported by a Protein Database
    bioRxiv preprint doi: https://doi.org/10.1101/2020.08.10.245134; this version posted August 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database Tomáš Brůna1,*, Katharina J. Hoff2,3*, Alexandre Lomsadze4, Mario Stanke2,3,^ and Mark Borodovsky4,5,^,§ 1 School of Biological Science, Georgia Tech, Atlanta, GA 30332, USA 2 Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany, 3 Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany, 4 Wallace H Coulter Department of Biomedical Engineering, Georgia Tech and Emory University, Atlanta, GA 30332, USA 5 School of Computational Science and Engineering, Georgia Tech, Atlanta, GA 30332, USA * joint first authors ^ joint last authors § corresponding author [email protected] Abstract Full automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins.
    [Show full text]
  • Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models Shaun Mahony*1, James O Mcinerney2, Terry J Smith1 and Aaron Golden3
    BMC Bioinformatics BioMed Central Software Open Access Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models Shaun Mahony*1, James O McInerney2, Terry J Smith1 and Aaron Golden3 Address: 1National Centre for Biomedical Engineering Science, NUI, Galway, Galway, Ireland, 2Bioinformatics and Pharmacogenomics Laboratory, NUI, Maynooth, Co. Kildare, Ireland and 3Department of Information Technology, NUI, Galway, Galway, Ireland Email: Shaun Mahony* - [email protected]; James O McInerney - [email protected]; Terry J Smith - [email protected]; Aaron Golden - [email protected] * Corresponding author Published: 05 March 2004 Received: 26 November 2003 Accepted: 05 March 2004 BMC Bioinformatics 2004, 5:23 This article is available from: http://www.biomedcentral.com/1471-2105/5/23 © 2004 Mahony et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. Abstract Background: Many current gene prediction methods use only one model to represent protein- coding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation. Results: This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential.
    [Show full text]
  • Natural Single-Nucleosome Epi-Polymorphisms in Yeast
    Natural Single-Nucleosome Epi-Polymorphisms in Yeast Muniyandi Nagarajan1., Jean-Baptiste Veyrieras1., Maud de Dieuleveult1,He´le`ne Bottin1, Steffen Fehrmann1, Anne-Laure Abraham1,Se´verine Croze2, Lars M. Steinmetz3, Xavier Gidrol4, Gae¨ l Yvert1* 1 Laboratoire de Biologie Mole´culaire de la Cellule, Ecole Normale Supe´rieure de Lyon, CNRS, Universite´ de Lyon, Lyon, France, 2 ProfileXpert, Bron, France, 3 Genome Biology, European Molecular Biology Laboratory, Heidelberg, Germany, 4 Laboratoire d’Exploration Fonctionnelle des Ge´nomes, Commissariat a` l’Energie Atomique, Direction des Sciences du Vivant, Institut de Radiobiologie Cellulaire et Mole´culaire, Evry, France Abstract Epigenomes commonly refer to the sequence of presence/absence of specific epigenetic marks along eukaryotic chromatin. Complete histone-borne epigenomes have now been described at single-nucleosome resolution from various organisms, tissues, developmental stages, or diseases, yet their intra-species natural variation has never been investigated. We describe here that the epigenomic sequence of histone H3 acetylation at Lysine 14 (H3K14ac) differs greatly between two unrelated strains of the yeast Saccharomyces cerevisiae. Using single-nucleosome chromatin immunoprecipitation and mapping, we interrogated 58,694 nucleosomes and found that 5,442 of them differed in their level of H3K14 acetylation, at a false discovery rate (FDR) of 0.0001. These Single Nucleosome Epi-Polymorphisms (SNEPs) were enriched at regulatory sites and conserved non-coding DNA sequences. Surprisingly, higher acetylation in one strain did not imply higher expression of the relevant gene. However, SNEPs were enriched in genes of high transcriptional variability and one SNEP was associated with the strength of gene activation upon stimulation. Our observations suggest a high level of inter-individual epigenomic variation in natural populations, with essential questions on the origin of this diversity and its relevance to gene x environment interactions.
    [Show full text]
  • Comparative Genomics
    17 Aug 2004 15:49 AR AR223-GG05-02.tex AR223-GG05-02.sgm LaTeX2e(2002/01/18) P1: IKH 10.1146/annurev.genom.5.061903.180057 Annu. Rev. Genomics Hum. Genet. 2004. 5:15–56 doi: 10.1146/annurev.genom.5.061903.180057 Copyright c 2004 by Annual Reviews. All rights reserved First published online as a Review in Advance on April 15, 2004 COMPARATIVE GENOMICS Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison The Center for Comparative Genomics and Bioinformatics, The Huck Institutes of Life Sciences, and the Departments of Biology, Computer Science and Engineering, and Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania; email: [email protected], [email protected], [email protected], [email protected] KeyWords whole-genome alignments, evolutionary rates, gene prediction, gene regulation, sequence conservation, Internet resources, genome browsers, genome databases, bioinformatics ■ Abstract The genomes from three mammals (human, mouse, and rat), two worms, and several yeasts have been sequenced, and more genomes will be completed in the near future for comparison with those of the major model organisms. Scientists have used various methods to align and compare the sequenced genomes to address critical issues in genome function and evolution. This review covers some of the major new insights about gene content, gene regulation, and the fraction of mammalian genomes that are under purifying selection and presumed functional. We review the evolution- ary processes that shape genomes, with particular attention to variation in rates within genomes and along different lineages. Internet resources for accessing and analyzing the treasure trove of sequence alignments and annotations are reviewed, and we dis- cuss critical problems to address in new bioinformatic developments in comparative genomics.
    [Show full text]
  • Gene Prediction in Metagenomic Sequencing Reads Phd
    Gene prediction in metagenomic sequencing reads PhD Thesis in partial fulfilment of the requirements for the degree “Doctor of Philosophy (PhD)” in the Molecular Biology Program at the Georg August University G¨ottingen, Faculty of Biology submitted by Katharina Jasmin Hoff born in Siegburg G¨ottingen 2009 Thesis Supervisor: Dr. Peter Meinicke Doctoral Committee: Prof. Dr. Burkhard Morgenstern (1st Referee) Prof. Dr. Wolfgang Liebl (2nd Referee) Prof. Dr. Lutz Walter I Affidavit I hereby insure that this PhD thesis has been written independently and with no other sources and aids than quoted. Katharina J. Hoff August, 2009 G¨ottingen, Germany II List of Publications Papers in Peer Reviewed Journals • K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, P. Meinicke Gene prediction in metagenomic fragments: a large scale machine learning approach BMC Bioinformatics 2008 9:217 • K. J. Hoff, T. Lingner, P. Meinicke, M. Tech Orphelia: predicting genes in metagenomic sequencing reads Nucleic Acids Research 2009 37:W101-W105 • K. J. Hoff. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics 2009 10:520 Posters at International Conferences • K. J. Hoff, M. Tech, P. Meinicke Gene prediction in metagenomic DNA fragments with machine learning techniques Genomes 2008, Paris, France • K. J. Hoff, M. Tech, P. Meinicke Predicting genes on metagenomic pyrosequencing reads with machine learning tech- niques German Conference on Bioinformatics 2008, Dresden, Germany • K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, P. Meinicke Gene prediction in metagenomic DNA fragments Horizons in Molecular Biology 2008, G¨ottingen, Germany • K. J. Hoff, M. Tech, P.
    [Show full text]
  • Chromosome-Level Assembly of the Caenorhabditis Remanei
    Genetics: Early Online, published on February 28, 2020 as 10.1534/genetics.119.303018 1 Chromosome-level assembly of the Caenorhabditis remanei 2 genome reveals conserved patterns of nematode genome 3 organization 4 Anastasia A. Teterina*†, John H. Willis*, Patrick C. Phillips*1 5 * Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA; 6 † Center of Parasitology, Severtsov Institute of Ecology and Evolution RAS, Moscow, 7 117071, Russia 8 1 Corresponding author: Institute of Ecology and Evolution and Department of Biology, 9 5289 University of Oregon, Eugene, OR 97403. E-mail: [email protected] 1 Copyright 2020. 10 Abstract 11 The nematode Caenorhabditis elegans is one of the key model systems in biology, 12 including possessing the first fully assembled animal genome. Whereas C. elegans is a 13 self-reproducing hermaphrodite with fairly limited within-population variation, its relative 14 C. remanei is an outcrossing species with much more extensive genetic variation, 15 making it an ideal parallel model system for evolutionary genetic investigations. Here, 16 we greatly improve on previous assemblies by generating a chromosome-level 17 assembly of the entire C. remanei genome (124.8 Mb of total size) using long-read 18 sequencing and chromatin conformation capture data. Like other fully assembled 19 genomes in the genus, we find that the C. remanei genome displays a high degree of 20 synteny with C. elegans despite multiple within-chromosome rearrangements. Both 21 genomes have high gene density in central regions of chromosomes relative to 22 chromosome ends and the opposite pattern for the accumulation of repetitive elements.
    [Show full text]
  • Comparative Genomics Approaches to Study Organism Similarities and Differences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Journal of Biomedical Informatics 35 (2002) 142–150 www.academicpress.com Methodological Review Comparative genomics approaches to study organism similarities and differences Liping Wei,a,b,* Yueyi Liu,c Inna Dubchak,d John Shon,a and John Parka a Nexus Genomics, Inc., 229 Polaris Ave., Suite 6, Mountain View, CA 94043, USA b Hang Zhou Center, Beijing Genomics Institute, Hangzhou, China c Stanford Medical Informatics, 251 Campus Dr. X215, Stanford University, Stanford CA 94305, USA d Genome Sciences Department, MS 84-171, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Received 7 June 2002 Abstract Comparative genomics is a large-scale, holistic approach that compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparative studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms. We discuss in detail the type of analyses that offer significant biological insights in the comparisons of (1) genome structure including overall genome statistics, repeats, genome rearrangement at both DNA and gene level, synteny, and breakpoints; (2) coding regions including gene content, protein content, orthologs, and paralogs; and (3) noncoding regions including the prediction of regulatory elements. We also briefly review the currently available computational tools in comparative genomics such as algorithms for genome-scale sequence alignment, gene identification, and nonhomology-based function prediction. Ó 2002 Elsevier Science (USA). All rights reserved. 1. Introduction genomes and the vast range of genome sizes.
    [Show full text]
  • Genetic and Physiological Data Implicating the New Human Gene G72 and the Gene for D-Amino Acid Oxidase in Schizophrenia
    Corrections BIOCHEMISTRY. For the article ‘‘Interaction of RNA polymerase with forked DNA: Evidence for two kinetically significant intermediates on the pathway to the final complex,’’ by Laura Tsujikawa, Oleg V. Tsodikov, and Pieter L. deHaseth, which ap- peared in number 6, March 19, 2002, of Proc. Natl. Acad. Sci. USA (99, 3493–3498; First Published March 12, 2002; 10.1073͞ pnas.062487299), the authors note the following concerning RNA polymerase (RNAP) concentrations. No correction was made for the fraction of RNAP (0.5) that is active in promoter binding. With this correction, the values of K1 and Kapp (but not Kf) would increase by about a factor of 2. The relative values would remain essentially unchanged. Also, the legends to Figs. 2, 3, and 5 contain errors pertaining to the symbols used for data obtained with and without heparin challenge, the duration of the challenge, and the concentration of added heparin. The figures and the corrected legends appear below. Fig. 3. Kinetics of complex formation. RNAP (65 nM) and wt forked DNA (1 nM) were incubated for various time intervals and then complex formation was determined immediately (Ϫheparin) or after a 2-min challenge with 100 ␮g͞ml heparin (ϩheparin). The Ϫheparin data (■) were fit (error-weighted) Ϫ1 with Eq. 8 with a2 ϭ 0(kaϪ ϭ 0.10 Ϯ 0.01 s ) and the ϩheparin data (Œ) with ϭ Ϯ Ϫ1 ϭ both single (kaϩ 0.036 0.004 s ; thin line) and double-exponential (ka1 Ϯ Ϫ1 ϭ Ϯ ϫ Ϫ4 Ϫ1 0.044 0.002 s ; ka2 (5 3) 10 s ; thick line) equations.
    [Show full text]
  • Proteogenomic Analysis and Global Discovery of Posttranslational
    Proteogenomic analysis and global discovery of PNAS PLUS posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1, Zhuo Chena, Jia Zhanga, Yan Lina, Yan Wanga, Qian Xionga, Tao Lia,2, Feng Gea,2, Donald A. Bryantb, and Jin-dong Zhaoa aKey Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; and bDepartment of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802 Edited* by Jiayang Li, Chinese Academy of Sciences, Beijing, China, and approved November 13, 2014 (received for review July 6, 2014) We describe an integrated workflow for proteogenomic analysis and cyanobacterial cells adjust their cellular activities in response and global profiling of posttranslational modifications (PTMs) in to a wide range of environmental cues and stimuli. Recently, prokaryotes and use the model cyanobacterium Synechococcus sp. cyanobacteria have attracted great interest due to their crucial PCC 7002 (hereafter Synechococcus 7002) as a test case. We found roles in global carbon and nitrogen cycles and their ability to more than 20 different kinds of PTMs, and a holistic view of PTM produce clean and renewable biofuels such as hydrogen (14–16). events in this organism grown under different conditions was Synechococcus 7002 is a unicellular, marine cyanobacterium and obtained without specific enrichment strategies. Among 3,186 pre- a model organism for studying photosynthetic carbon fixation dicted protein-coding genes, 2,938 gene products (>92%) were and the development of biofuels (17, 18). However, whereas the identified. We also identified 118 previously unidentified proteins genome of Synechococcus 7002 is fully sequenced, it is annotated and corrected 38 predicted gene-coding regions in the Synecho- only by in silico methods (www.ncbi.nlm.nih.gov/), with a large coccus 7002 genome.
    [Show full text]