Rfam 14: Expanded Coverage of Metagenomic, Viral and Microrna Families Ioanna Kalvari, Eric P
Total Page:16
File Type:pdf, Size:1020Kb
Rfam 14: expanded coverage of metagenomic, viral and microRNA families Ioanna Kalvari, Eric P. Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, Manja Marz, Sam Griffiths-Jones, Claire Toffano-Nioche, Daniel Gautheret, Zasha Weinberg, et al. To cite this version: Ioanna Kalvari, Eric P. Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, et al.. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research, 2020, 10.1093/nar/gkaa1047. hal-03031715 HAL Id: hal-03031715 https://hal.archives-ouvertes.fr/hal-03031715 Submitted on 14 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. This article has been accepted for publication in Nucleic Acid Research Published by Oxford University Press : • DOI : 10.1093/nar/gkaa1047 • PUBMED : 33211869 Nucleic Acids Research, 2020 1 doi: 10.1093/nar/gkaa1047 Rfam 14: expanded coverage of metagenomic, viral and microRNA families Ioanna Kalvari 1,EricP.Nawrocki 2, Nancy Ontiveros-Palacios 1, Joanna Argasinska 1, Kevin Lamkiewicz 3,4, Manja Marz 3,4, Sam Griffiths-Jones 5, Claire Toffano-Nioche 6, Daniel Gautheret 6, Zasha Weinberg 7, Elena Rivas 8, Sean R. Eddy 8,9,10, Downloaded from https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1047/5992291 by guest on 14 December 2020 Robert D. Finn 1, Alex Bateman 1 and Anton I. Petrov 1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, 3RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany, 4European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany, 5Faculty of Biology, Medicine and Health, University of Manchester, Oxford Road, Manchester, M13 9PT, UK, 6Universite´ Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France, 7Bioinformatics Group, Department of Computer Science and Interdisciplinary Centre for Bioinformatics, Leipzig University, 04107 Leipzig, Germany, 8Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA, 9Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA and 10John A. Paulson School of Engineering and Applied Science, Harvard University, Cambridge, MA 02138, USA Received September 15, 2020; Revised October 14, 2020; Editorial Decision October 15, 2020; Accepted October 21, 2020 ABSTRACT sation of families with pseudoknots. Rfam is freely available at https://rfam.org. Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covari- INTRODUCTION ance model that can be used to search for additional Rfam is the database of non-coding RNA (ncRNA) fami- members of the family. Recent developments have lies (1), each represented by a multiple sequence alignment involved expert collaborations to improve the quality (known as the seed), a consensus secondary structure, and and coverage of Rfam data, focusing on microRNAs, a covariance model to annotate non-coding RNAs in nu- viral and bacterial RNAs. We have completed the first cleotide datasets using the Infernal software (2). Rfam and phase of synchronising microRNA families in Rfam Infernal are commonly used to annotate ncRNAs in newly and miRBase, creating 356 new Rfam families and sequenced genomes (3,4), and are the core components of updating 40. We established a procedure for compre- the genome annotation pipelines in Ensembl (5), Ensembl hensive annotation of viral RNA families starting with Genomes (6), NCBI Prokaryotic and Eukaryotic Gene An- notation (7,8) and other resources. For example, PDBe uses Flavivirus and Coronaviridae RNAs. We have also in- Rfam to enable searching for RNA chains, such as tRNA or creased the coverage of bacterial and metagenome- rRNA,in3Dstructures(9), while RNAcentral uses Rfam based RNA families from the ZWD database. These to detect incomplete sequences, RNA type annotations er- developments have enabled a significant growth of rors, and provide other quality controls (10). The manually the database, with the addition of 759 new families curated multiple sequence alignments from Rfam are also in Rfam 14. To facilitate further community contribu- used for training and benchmarking new software, such as tion to Rfam, expert users are now able to build and secondary structure prediction algorithms (11,12). submit new families using the newly developed Rfam Here we describe community-driven improvements and Cloud family curation system. New Rfam website fea- updates that are available in Rfam 14 (releases 14.0–14.3), tures include a new sequence similarity search pow- including new RNA families, new features on the Rfam ered by RNAcentral, as well as search and visuali- website, and new tools for expert users to contribute fam- ilies to the database. Rfam 14.3 contains 3444 families, rep- *To whom correspondence should be addressed. Tel: +44 1223 492550; Fax: +44 1223 494468; Email: [email protected] Present address: Joanna Argasinska, Stanford University; Department of Genetics; 3165 Porter Drive, Palo Alto, CA 94304, USA. C The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2020 resenting a 28% increase since version 13.0. The major- with Rfam covariance models. Ninety eight percent of ity of the new families have been generated in collabora- ZWD alignments have now been imported into Rfam ex- tion with other RNA resources and experts, in particular cept for seven families lacking covariation support as de- ZWD, miRBase and European Virus Bioinformatics Center termined by R-scape (16) and 43 families marked ‘Not (EVBC), supplemented by in-house literature curation. Fol- for Rfam’ in ZWD. The mapping between ZWD align- lowing the transition from annotating a subset of ENA (13) ments and Rfam accessions is provided in Supplementary to annotation of a collection of non-redundant and com- Information. plete genomes in Rfam 13.0 (1), we have expanded the set of We also used CaCoFold (17) to identify the pre-existing genomes in the Rfam sequence database (Rfamseq) by 76% ZWD-based Rfam families requiring an update using the Downloaded from https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa1047/5992291 by guest on 14 December 2020 in Rfam 14.0 to reflect the increase in number of genomes. new RNAcentral import mechanism. For example, the Rfamseq now includes 14 772 genomes from all domains manA alignment from Rfam (RF01745)hadonly29sta- of life. Here, we also describe the newly-developed Rfam tistically significant basepairs while a CaCoFold structure Cloud curation pipeline, which marks a further major step based on the corresponding alignment in ZWD contained towards ensuring that Rfam remains a core open and sus- 48 significant basepairs. A comparison between the Rfam tainable resource for the whole RNA community. and ZWD versions of the alignment revealed a mistake in the Rfam secondary structure introduced during manual import. As a result, the manA Rfam alignment was updated NEW FAMILIES FROM ZWD to its original version from ZWD (Figure 1A, right). We Rfam releases 14.1 and 14.3 included 253 new families from are systematically reviewing other ZWD-based families that the Zasha Weinberg Database (ZWD) that were discovered may require an update and will release them in future Rfam by a systematic computational analysis of intergenic regions versions (17). in Bacteria and metagenomic samples (14), as well as newly identified ribosomal leaders (r-leaders) (15). ZWD stores NEW WORKFLOW FOR VIRAL RNA FAMILIES the original manually curated multiple sequence alignments Despite the small size (3400–41 000 nucleotides) of RNA produced in the Breaker and Weinberg groups over the virus genomes, they contain several conserved RNA struc- last decade. ZWD is a git-based resource that currently in- tures vital for protection against exonucleases, genome di- cludes 417 alignments and is available at https://bitbucket. versification, and play a crucial role in various stages of org/zashaw/zashaweinbergdata. Examples of the new and their viral life cycle (18). Many viruses rely on RNAs to updated ZWD-based families are shown in Figure 1. infect and replicate inside a host: for example, the cis- Prior to version 14, Rfam had incorporated 108 fami- acting element of coronaviruses is essential for replication lies from ZWD, but in Rfam 14 a new process was devel- (18), while in Dengue viruses replication depends on RNA oped to automate the