Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models Shaun Mahony*1, James O Mcinerney2, Terry J Smith1 and Aaron Golden3

Total Page:16

File Type:pdf, Size:1020Kb

Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models Shaun Mahony*1, James O Mcinerney2, Terry J Smith1 and Aaron Golden3 BMC Bioinformatics BioMed Central Software Open Access Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models Shaun Mahony*1, James O McInerney2, Terry J Smith1 and Aaron Golden3 Address: 1National Centre for Biomedical Engineering Science, NUI, Galway, Galway, Ireland, 2Bioinformatics and Pharmacogenomics Laboratory, NUI, Maynooth, Co. Kildare, Ireland and 3Department of Information Technology, NUI, Galway, Galway, Ireland Email: Shaun Mahony* - [email protected]; James O McInerney - [email protected]; Terry J Smith - [email protected]; Aaron Golden - [email protected] * Corresponding author Published: 05 March 2004 Received: 26 November 2003 Accepted: 05 March 2004 BMC Bioinformatics 2004, 5:23 This article is available from: http://www.biomedcentral.com/1471-2105/5/23 © 2004 Mahony et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. Abstract Background: Many current gene prediction methods use only one model to represent protein- coding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation. Results: This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential. Conclusions: While its raw accuracy rate can be less than other methods, RescueNet consistently identifies some genes that other methods do not, and should therefore be of interest to gene- prediction software developers and genome annotation teams alike. RescueNet is recommended for use in conjunction with, or as a complement to, other gene prediction methods. Background ceding bases in the sequence. These methods are Computational gene prediction methods have yet to undoubtedly very successful, with published sensitivity achieve perfect accuracy, even in the relatively simple rates between 90% and 99% for most prokaryotic prokaryotic genomes. Problems in gene prediction centre genomes. However, as the sensitivity rates of the methods on the fact that many protein families remain uncharac- rise, specificity generally tends to fall, and while the appli- terised. As a result, it seems that only approximately half cation of sophisticated post-processing rules can correct of an organism's genes can be confidently predicted on many false-positive predictions, no method has yet the basis of homology to other known genes [1-3], so ab achieved 100% accuracy. This is especially the case in the initio prediction methods are usually employed to identify more complex eukaryotic gene-finding problem, where many protein-coding regions of DNA. less than 80% of exons in anonymous genomic sequences are correctly predicted by current methods [2,6-8]. Currently, the most popular prokaryotic gene-prediction methods, such as GeneMark.hmm [4] and Glimmer2 [5], For the foreseeable future it does not seem that the exact are based on probabilistic Markov models that aim to pre- set of genes in any organism can be automatically pre- dict each base of a DNA sequence using a number of pre- dicted by any single computational method. In practice, Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/23 this has meant that the best predictions are to be found by training data would not be useful in the case of Markov- combining evidence from two or more independent based methods, as the data contained in each sub-cluster methods [3,9]. Genome annotation teams often compare may not allow for a good estimation of model parameters. the evidence offered by multiple gene-finders in order to However, generating more specific models for subtle pat- predict the gene complement of a given genome. Because terns found in the training set can only be advantageous if of the degree of 'manual' annotation that now takes place it can be done in a way that minimises loss of overall accu- in the major genome sequencing centres, a gene-predic- racy and produces no extra false-positive predictions. tion tool will be of practical use if it can exclusively predict genes that other gene-finders cannot. This paper aims to show how the Self-Organizing Map neural network algorithm can be used to automatically Many ab initio gene-prediction methods are based on sin- identify the major trends in oligonucleotide variation in a gle models of protein-coding regions and therefore make genome, and in doing so provide multiple gene models the implicit assumption that all protein-coding regions for use in gene prediction. It will be explained that this within a particular genome will share similar statistical approach is an effective solution to the problem of intra- properties. However, evidence has mounted that single genomic variation. Specific examples of genes predicted gene models of intrinsic coding measures are no longer only by this method are offered, thus demonstrating the fully satisfying [10,11]. The problem with single model usefulness of the approach in genome annotation. methods centres on the degree of oligonucleotide compo- sition variation that exists within most genomes. On the A further advantage of using the Self-Organizing Map for codon level, intra-genomic variation in codon bias has gene prediction is the ability of the algorithm to use com- long been correlated with expression level [12]. Counter- plex descriptors as measures of gene coding potential. We balancing the translational selection theory of codon bias demonstrate this ability using relative synonymous codon is the effect of mutational bias [13,14]. Many other, often usage (RSCU) as our measure of gene coding potential. more subtle levels of variation have been recognised over Unlike other gene coding measures, RSCU is not based on the years, with many disparate evolutionary pressures the absolute frequency of k-mers, but instead describes shown to be acting on codon usage bias [15]. For exam- the codon choice for each amino acid. Markov chains ple, strand-specific codon usage biases have often been based on the RSCU measure would have transition prob- recognised [16-20], leading to more general studies of cor- abilities that are conditioned on the underlying amino- relation between the location of the gene on the genome acids. Although theoretically possible [26], the practical and codon bias [21], and the more specific discovery of a computation of such Markov chains would give rise to A+T skewed bias near the replication terminus of bacterial major difficulties. Therefore, the ability of our approach genomes [22]. Other effects shown to shape codon usage to make use of a sophisticated gene coding descriptor such are gene length [23] and selection at the amino acid level as RSCU is a distinct advantage of our approach over [24]. It has also been suggested that content variation can Markov model based methods. occur at the exon level in eukaryotic genes, the possibility existing that some exons in a gene may have different Implementation codon usage patterns to others [1]. Given that some of the Coding measure above pressures on codon usage have only recently been In this study, relative synonymous codon usage (RSCU) discovered, it is likely that some more subtle patterns have vectors are used as the measure of protein-coding poten- yet to be recognised, and therefore it is difficult to predict tial for a given window of sequence. The RSCU value for a the level of compositional variation that will be present in codon 'i' is defined as: an anonymous genomic sequence. Obs RSCU = i ()1 The need for gene-finding methods that can overcome the i Exp problems presented by intra-genomic variation was recog- i nised and addressed in the case of prokaryotic genomes by where Obsi is the observed number of occurrences of GeneMark-Genesis [25], which derives two models for codon 'i', and Expi is the expected number of occurrences each genome according to typical and atypical codon of the same codon (based on the number of times the rel- usage clusters in that genome. This increase in the number evant amino acid is present in the gene and the number of of gene models led to an increase in accuracy of the Gen- synonymous alternatives to 'i', assuming a uniform choice eMark method. While Hayes & Borodovsky experimented of synonymous codons). In order to make the data more with a third ('highly-typical') codon usage cluster and an compatible with the mathematical methods used, the log associated model in some cases, they did not see the need of each RSCUi value is found so that the resulting value is to further sub-cluster the atypical codon usage set in order positive if the codon is used more than expected in that to make even more models. Overly sub-clustering the gene, and negative if the codon is used less than expected. Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/23 Values were capped at ± 10, and set to 0 in the case of the In this work, similarity between two vectors is measured non-occurrence of an amino acid in the sample. Taking by finding the cosine of the angle between them. A cosine the RSCU values for each of the codons with synonymous of 1 represents exactly similar vectors while a cosine of 0 alternatives (and ignoring the 3 stop codons and the Trp represents exactly dissimilar vectors.
Recommended publications
  • Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones
    PLoS BIOLOGY Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones Tadashi Imanishi1, Takeshi Itoh1,2, Yutaka Suzuki3,68, Claire O’Donovan4, Satoshi Fukuchi5, Kanako O. Koyanagi6, Roberto A. Barrero5, Takuro Tamura7,8, Yumi Yamaguchi-Kabata1, Motohiko Tanino1,7, Kei Yura9, Satoru Miyazaki5, Kazuho Ikeo5, Keiichi Homma5, Arek Kasprzyk4, Tetsuo Nishikawa10,11, Mika Hirakawa12, Jean Thierry-Mieg13,14, Danielle Thierry-Mieg13,14, Jennifer Ashurst15, Libin Jia16, Mitsuteru Nakao3, Michael A. Thomas17, Nicola Mulder4, Youla Karavidopoulou4, Lihua Jin5, Sangsoo Kim18, Tomohiro Yasuda11, Boris Lenhard19, Eric Eveno20,21, Yoshiyuki Suzuki5, Chisato Yamasaki1, Jun-ichi Takeda1, Craig Gough1,7, Phillip Hilton1,7, Yasuyuki Fujii1,7, Hiroaki Sakai1,7,22, Susumu Tanaka1,7, Clara Amid23, Matthew Bellgard24, Maria de Fatima Bonaldo25, Hidemasa Bono26, Susan K. Bromberg27, Anthony J. Brookes19, Elspeth Bruford28, Piero Carninci29, Claude Chelala20, Christine Couillault20,21, Sandro J. de Souza30, Marie-Anne Debily20, Marie-Dominique Devignes31, Inna Dubchak32, Toshinori Endo33, Anne Estreicher34, Eduardo Eyras15, Kaoru Fukami-Kobayashi35, Gopal R. Gopinath36, Esther Graudens20,21, Yoonsoo Hahn18, Michael Han23, Ze-Guang Han21,37, Kousuke Hanada5, Hideki Hanaoka1, Erimi Harada1,7, Katsuyuki Hashimoto38, Ursula Hinz34, Momoki Hirai39, Teruyoshi Hishiki40, Ian Hopkinson41,42, Sandrine Imbeaud20,21, Hidetoshi Inoko1,7,43, Alexander Kanapin4, Yayoi Kaneko1,7, Takeya Kasukawa26, Janet Kelso44, Paul Kersey4, Reiko Kikuno45, Kouichi
    [Show full text]
  • Download Or Browsing at Phylomedb [118]
    Gerdol et al. Genome Biology (2020) 21:275 https://doi.org/10.1186/s13059-020-02180-3 RESEARCH Open Access Massive gene presence-absence variation shapes an open pan-genome in the Mediterranean mussel Marco Gerdol1, Rebeca Moreira2, Fernando Cruz3, Jessica Gómez-Garrido3, Anna Vlasova4, Umberto Rosani5, Paola Venier5, Miguel A. Naranjo-Ortiz4,6, Maria Murgarella7, Samuele Greco1, Pablo Balseiro2,8, André Corvelo3,9, Leonor Frias3, Marta Gut3, Toni Gabaldón4,6,10,11, Alberto Pallavicini1,12, Carlos Canchaya7,13,14, Beatriz Novoa2, Tyler S. Alioto3,6, David Posada7,13,14* and Antonio Figueras2* * Correspondence: dposada@uvigo. es; [email protected] Abstract 7Department of Biochemistry, Genetics and Immunology, Background: The Mediterranean mussel Mytilus galloprovincialis is an ecologically University of Vigo, 36310 Vigo, and economically relevant edible marine bivalve, highly invasive and resilient to Spain biotic and abiotic stressors causing recurrent massive mortalities in other bivalves. 2Instituto de Investigaciones Marinas (IIM - CSIC), Eduardo Although these traits have been recently linked with the maintenance of a high Cabello, 6, 36208 Vigo, Spain genetic variation within natural populations, the factors underlying the evolutionary Full list of author information is success of this species remain unclear. available at the end of the article Results: Here, after the assembly of a 1.28-Gb reference genome and the resequencing of 14 individuals from two independent populations, we reveal a complex pan-genomic architecture in M. galloprovincialis, with a core set of 45,000 genes plus a strikingly high number of dispensable genes (20,000) subject to presence-absence variation, which may be entirely missing in several individuals.
    [Show full text]
  • Potential Evidence for Transgenerational Epigenetic Memory in Arabidopsis Thaliana Following Spaceflight ✉ Peipei Xu 1,2, Haiying Chen1,2, Jinbo Hu1 & Weiming Cai 1
    ARTICLE https://doi.org/10.1038/s42003-021-02342-4 OPEN Potential evidence for transgenerational epigenetic memory in Arabidopsis thaliana following spaceflight ✉ Peipei Xu 1,2, Haiying Chen1,2, Jinbo Hu1 & Weiming Cai 1 Plants grown in spaceflight exhibited differential methylation responses and this is important because plants are sessile, they are constantly exposed to a variety of environmental pres- sures and respond to them in many ways. We previously showed that the Arabidopsis genome exhibited lower methylation level after spaceflight for 60 h in orbit. Here, using the offspring of the seedlings grown in microgravity environment in the SJ-10 satellite for 11 days and 1234567890():,; returned to Earth, we systematically studied the potential effects of spaceflight on DNA methylation, transcriptome, and phenotype in the offspring. Whole-genome methylation analysis in the first generation of offspring (F1) showed that, although there was no significant difference in methylation level as had previously been observed in the parent plants, some residual imprints of DNA methylation differences were detected. Combined DNA methylation and RNA-sequencing analysis indicated that expression of many pathways, such as the abscisic acid-activated pathway, protein phosphorylation, and nitrate signaling pathway, etc. were enriched in the F1 population. As some phenotypic differences still existed in the F2 generation, it was suggested that these epigenetic DNA methylation modifications were partially retained, resulting in phenotypic differences in the offspring. Furthermore, some of the spaceflight-induced heritable differentially methylated regions (DMRs) were retained. Changes in epigenetic modifications caused by spaceflight affected the growth of two future seed generations. Altogether, our research is helpful in better understanding the adaptation mechanism of plants to the spaceflight environment.
    [Show full text]
  • BRAKER2: Automatic Eukaryotic Genome Annotation with Genemark-EP+ and AUGUSTUS Supported by a Protein Database
    bioRxiv preprint doi: https://doi.org/10.1101/2020.08.10.245134; this version posted August 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database Tomáš Brůna1,*, Katharina J. Hoff2,3*, Alexandre Lomsadze4, Mario Stanke2,3,^ and Mark Borodovsky4,5,^,§ 1 School of Biological Science, Georgia Tech, Atlanta, GA 30332, USA 2 Institute of Mathematics and Computer Science, University of Greifswald, 17489 Greifswald, Germany, 3 Center for Functional Genomics of Microbes, University of Greifswald, 17489 Greifswald, Germany, 4 Wallace H Coulter Department of Biomedical Engineering, Georgia Tech and Emory University, Atlanta, GA 30332, USA 5 School of Computational Science and Engineering, Georgia Tech, Atlanta, GA 30332, USA * joint first authors ^ joint last authors § corresponding author [email protected] Abstract Full automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins.
    [Show full text]
  • Natural Single-Nucleosome Epi-Polymorphisms in Yeast
    Natural Single-Nucleosome Epi-Polymorphisms in Yeast Muniyandi Nagarajan1., Jean-Baptiste Veyrieras1., Maud de Dieuleveult1,He´le`ne Bottin1, Steffen Fehrmann1, Anne-Laure Abraham1,Se´verine Croze2, Lars M. Steinmetz3, Xavier Gidrol4, Gae¨ l Yvert1* 1 Laboratoire de Biologie Mole´culaire de la Cellule, Ecole Normale Supe´rieure de Lyon, CNRS, Universite´ de Lyon, Lyon, France, 2 ProfileXpert, Bron, France, 3 Genome Biology, European Molecular Biology Laboratory, Heidelberg, Germany, 4 Laboratoire d’Exploration Fonctionnelle des Ge´nomes, Commissariat a` l’Energie Atomique, Direction des Sciences du Vivant, Institut de Radiobiologie Cellulaire et Mole´culaire, Evry, France Abstract Epigenomes commonly refer to the sequence of presence/absence of specific epigenetic marks along eukaryotic chromatin. Complete histone-borne epigenomes have now been described at single-nucleosome resolution from various organisms, tissues, developmental stages, or diseases, yet their intra-species natural variation has never been investigated. We describe here that the epigenomic sequence of histone H3 acetylation at Lysine 14 (H3K14ac) differs greatly between two unrelated strains of the yeast Saccharomyces cerevisiae. Using single-nucleosome chromatin immunoprecipitation and mapping, we interrogated 58,694 nucleosomes and found that 5,442 of them differed in their level of H3K14 acetylation, at a false discovery rate (FDR) of 0.0001. These Single Nucleosome Epi-Polymorphisms (SNEPs) were enriched at regulatory sites and conserved non-coding DNA sequences. Surprisingly, higher acetylation in one strain did not imply higher expression of the relevant gene. However, SNEPs were enriched in genes of high transcriptional variability and one SNEP was associated with the strength of gene activation upon stimulation. Our observations suggest a high level of inter-individual epigenomic variation in natural populations, with essential questions on the origin of this diversity and its relevance to gene x environment interactions.
    [Show full text]
  • Comparative Genomics
    17 Aug 2004 15:49 AR AR223-GG05-02.tex AR223-GG05-02.sgm LaTeX2e(2002/01/18) P1: IKH 10.1146/annurev.genom.5.061903.180057 Annu. Rev. Genomics Hum. Genet. 2004. 5:15–56 doi: 10.1146/annurev.genom.5.061903.180057 Copyright c 2004 by Annual Reviews. All rights reserved First published online as a Review in Advance on April 15, 2004 COMPARATIVE GENOMICS Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison The Center for Comparative Genomics and Bioinformatics, The Huck Institutes of Life Sciences, and the Departments of Biology, Computer Science and Engineering, and Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania; email: [email protected], [email protected], [email protected], [email protected] KeyWords whole-genome alignments, evolutionary rates, gene prediction, gene regulation, sequence conservation, Internet resources, genome browsers, genome databases, bioinformatics ■ Abstract The genomes from three mammals (human, mouse, and rat), two worms, and several yeasts have been sequenced, and more genomes will be completed in the near future for comparison with those of the major model organisms. Scientists have used various methods to align and compare the sequenced genomes to address critical issues in genome function and evolution. This review covers some of the major new insights about gene content, gene regulation, and the fraction of mammalian genomes that are under purifying selection and presumed functional. We review the evolution- ary processes that shape genomes, with particular attention to variation in rates within genomes and along different lineages. Internet resources for accessing and analyzing the treasure trove of sequence alignments and annotations are reviewed, and we dis- cuss critical problems to address in new bioinformatic developments in comparative genomics.
    [Show full text]
  • MOSGA 2: Comparative Genomics and Validation Tools
    bioRxiv preprint doi: https://doi.org/10.1101/2021.07.29.454382; this version posted July 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. MOSGA 2: Comparative genomics and validation tools Roman Martin1, Hagen Dreßler1, Georges Hattab1, Thomas Hackl2, Matthias G. Fischer2, and Dominik Heider1, 1Department of Mathematics and Computer Science, University of Marburg, 35032 Marburg, Germany 2Department of Biomolecular Mechanisms, Max Planck Institute for Medical Research, Heidelberg 69120, Germany Due to the highly growing number of available genomic infor- which does not necessarily affect the annotation process but mation, the need for accessible and easy-to-use analysis tools is will hinder the submission process and may lead to wrong increasing. To facilitate eukaryotic genome annotations, we cre- conclusions. This study presents the new functionalities in ated MOSGA. In this work, we show how MOSGA 2 is devel- MOSGA 2, such as new workflows for comparative genomics oped by including several advanced analyses for genomic data. and the introduction of validation tools into the framework. Since the genomic data quality greatly impacts the annotation In brief, MOSGA 2 incorporates comparative genomic work- quality, we included multiple tools to validate and ensure high- flows derived from Hackl et al. (13) and validation tools from quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can Pirovano et al. (14).
    [Show full text]
  • Gene Prediction in Metagenomic Sequencing Reads Phd
    Gene prediction in metagenomic sequencing reads PhD Thesis in partial fulfilment of the requirements for the degree “Doctor of Philosophy (PhD)” in the Molecular Biology Program at the Georg August University G¨ottingen, Faculty of Biology submitted by Katharina Jasmin Hoff born in Siegburg G¨ottingen 2009 Thesis Supervisor: Dr. Peter Meinicke Doctoral Committee: Prof. Dr. Burkhard Morgenstern (1st Referee) Prof. Dr. Wolfgang Liebl (2nd Referee) Prof. Dr. Lutz Walter I Affidavit I hereby insure that this PhD thesis has been written independently and with no other sources and aids than quoted. Katharina J. Hoff August, 2009 G¨ottingen, Germany II List of Publications Papers in Peer Reviewed Journals • K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, P. Meinicke Gene prediction in metagenomic fragments: a large scale machine learning approach BMC Bioinformatics 2008 9:217 • K. J. Hoff, T. Lingner, P. Meinicke, M. Tech Orphelia: predicting genes in metagenomic sequencing reads Nucleic Acids Research 2009 37:W101-W105 • K. J. Hoff. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics 2009 10:520 Posters at International Conferences • K. J. Hoff, M. Tech, P. Meinicke Gene prediction in metagenomic DNA fragments with machine learning techniques Genomes 2008, Paris, France • K. J. Hoff, M. Tech, P. Meinicke Predicting genes on metagenomic pyrosequencing reads with machine learning tech- niques German Conference on Bioinformatics 2008, Dresden, Germany • K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, P. Meinicke Gene prediction in metagenomic DNA fragments Horizons in Molecular Biology 2008, G¨ottingen, Germany • K. J. Hoff, M. Tech, P.
    [Show full text]
  • Chromosome-Level Assembly of the Caenorhabditis Remanei
    Genetics: Early Online, published on February 28, 2020 as 10.1534/genetics.119.303018 1 Chromosome-level assembly of the Caenorhabditis remanei 2 genome reveals conserved patterns of nematode genome 3 organization 4 Anastasia A. Teterina*†, John H. Willis*, Patrick C. Phillips*1 5 * Institute of Ecology and Evolution, University of Oregon, Eugene, OR, 97403, USA; 6 † Center of Parasitology, Severtsov Institute of Ecology and Evolution RAS, Moscow, 7 117071, Russia 8 1 Corresponding author: Institute of Ecology and Evolution and Department of Biology, 9 5289 University of Oregon, Eugene, OR 97403. E-mail: [email protected] 1 Copyright 2020. 10 Abstract 11 The nematode Caenorhabditis elegans is one of the key model systems in biology, 12 including possessing the first fully assembled animal genome. Whereas C. elegans is a 13 self-reproducing hermaphrodite with fairly limited within-population variation, its relative 14 C. remanei is an outcrossing species with much more extensive genetic variation, 15 making it an ideal parallel model system for evolutionary genetic investigations. Here, 16 we greatly improve on previous assemblies by generating a chromosome-level 17 assembly of the entire C. remanei genome (124.8 Mb of total size) using long-read 18 sequencing and chromatin conformation capture data. Like other fully assembled 19 genomes in the genus, we find that the C. remanei genome displays a high degree of 20 synteny with C. elegans despite multiple within-chromosome rearrangements. Both 21 genomes have high gene density in central regions of chromosomes relative to 22 chromosome ends and the opposite pattern for the accumulation of repetitive elements.
    [Show full text]
  • Comparative Genomics Approaches to Study Organism Similarities and Differences
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Journal of Biomedical Informatics 35 (2002) 142–150 www.academicpress.com Methodological Review Comparative genomics approaches to study organism similarities and differences Liping Wei,a,b,* Yueyi Liu,c Inna Dubchak,d John Shon,a and John Parka a Nexus Genomics, Inc., 229 Polaris Ave., Suite 6, Mountain View, CA 94043, USA b Hang Zhou Center, Beijing Genomics Institute, Hangzhou, China c Stanford Medical Informatics, 251 Campus Dr. X215, Stanford University, Stanford CA 94305, USA d Genome Sciences Department, MS 84-171, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Received 7 June 2002 Abstract Comparative genomics is a large-scale, holistic approach that compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparative studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms. We discuss in detail the type of analyses that offer significant biological insights in the comparisons of (1) genome structure including overall genome statistics, repeats, genome rearrangement at both DNA and gene level, synteny, and breakpoints; (2) coding regions including gene content, protein content, orthologs, and paralogs; and (3) noncoding regions including the prediction of regulatory elements. We also briefly review the currently available computational tools in comparative genomics such as algorithms for genome-scale sequence alignment, gene identification, and nonhomology-based function prediction. Ó 2002 Elsevier Science (USA). All rights reserved. 1. Introduction genomes and the vast range of genome sizes.
    [Show full text]
  • Genetic and Physiological Data Implicating the New Human Gene G72 and the Gene for D-Amino Acid Oxidase in Schizophrenia
    Corrections BIOCHEMISTRY. For the article ‘‘Interaction of RNA polymerase with forked DNA: Evidence for two kinetically significant intermediates on the pathway to the final complex,’’ by Laura Tsujikawa, Oleg V. Tsodikov, and Pieter L. deHaseth, which ap- peared in number 6, March 19, 2002, of Proc. Natl. Acad. Sci. USA (99, 3493–3498; First Published March 12, 2002; 10.1073͞ pnas.062487299), the authors note the following concerning RNA polymerase (RNAP) concentrations. No correction was made for the fraction of RNAP (0.5) that is active in promoter binding. With this correction, the values of K1 and Kapp (but not Kf) would increase by about a factor of 2. The relative values would remain essentially unchanged. Also, the legends to Figs. 2, 3, and 5 contain errors pertaining to the symbols used for data obtained with and without heparin challenge, the duration of the challenge, and the concentration of added heparin. The figures and the corrected legends appear below. Fig. 3. Kinetics of complex formation. RNAP (65 nM) and wt forked DNA (1 nM) were incubated for various time intervals and then complex formation was determined immediately (Ϫheparin) or after a 2-min challenge with 100 ␮g͞ml heparin (ϩheparin). The Ϫheparin data (■) were fit (error-weighted) Ϫ1 with Eq. 8 with a2 ϭ 0(kaϪ ϭ 0.10 Ϯ 0.01 s ) and the ϩheparin data (Œ) with ϭ Ϯ Ϫ1 ϭ both single (kaϩ 0.036 0.004 s ; thin line) and double-exponential (ka1 Ϯ Ϫ1 ϭ Ϯ ϫ Ϫ4 Ϫ1 0.044 0.002 s ; ka2 (5 3) 10 s ; thick line) equations.
    [Show full text]
  • Proteogenomic Analysis and Global Discovery of Posttranslational
    Proteogenomic analysis and global discovery of PNAS PLUS posttranslational modifications in prokaryotes Ming-kun Yanga,1, Yao-hua Yanga,1, Zhuo Chena, Jia Zhanga, Yan Lina, Yan Wanga, Qian Xionga, Tao Lia,2, Feng Gea,2, Donald A. Bryantb, and Jin-dong Zhaoa aKey Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; and bDepartment of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802 Edited* by Jiayang Li, Chinese Academy of Sciences, Beijing, China, and approved November 13, 2014 (received for review July 6, 2014) We describe an integrated workflow for proteogenomic analysis and cyanobacterial cells adjust their cellular activities in response and global profiling of posttranslational modifications (PTMs) in to a wide range of environmental cues and stimuli. Recently, prokaryotes and use the model cyanobacterium Synechococcus sp. cyanobacteria have attracted great interest due to their crucial PCC 7002 (hereafter Synechococcus 7002) as a test case. We found roles in global carbon and nitrogen cycles and their ability to more than 20 different kinds of PTMs, and a holistic view of PTM produce clean and renewable biofuels such as hydrogen (14–16). events in this organism grown under different conditions was Synechococcus 7002 is a unicellular, marine cyanobacterium and obtained without specific enrichment strategies. Among 3,186 pre- a model organism for studying photosynthetic carbon fixation dicted protein-coding genes, 2,938 gene products (>92%) were and the development of biofuels (17, 18). However, whereas the identified. We also identified 118 previously unidentified proteins genome of Synechococcus 7002 is fully sequenced, it is annotated and corrected 38 predicted gene-coding regions in the Synecho- only by in silico methods (www.ncbi.nlm.nih.gov/), with a large coccus 7002 genome.
    [Show full text]