Virus-Like Insertions with Sequence Signatures Similar to Those of Endogenous Nonretroviral RNA Viruses in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Shohei Kojimaa,1, Kohei Yoshikawab, Jumpei Itoc, So Nakagawad, Nicholas F. Parrishe, Masayuki Horiea,f, Shuichi Kawanob,2, and Keizo Tomonagaa,g,h,2 aLaboratory of RNA Viruses, Institute for Frontier Life and Medical Sciences, Kyoto University, Kyoto 606-8507, Japan; bDepartment of Computer and Network Engineering, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan; cDivision of Systems Virology, Department of Infectious Disease Control, International Research Center for Infectious Diseases, Institute of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan; dDepartment of Molecular Life Science, Tokai University School of Medicine, Isehara 259-1193, Japan; eGenome Immunobiology RIKEN Hakubi Research Team, RIKEN Cluster for Pioneering Research, Yokohama 230-0045, Japan; fHakubi Center for Advanced Research, Kyoto University, Kyoto 606-8507, Japan; gLaboratory of RNA Viruses, Graduate School of Biostudies, Kyoto University, Kyoto 606-8507, Japan; and hDepartment of Molecular Virology, Graduate School of Medicine, Kyoto University, Kyoto 606-8507, Japan Edited by Harmit S. Malik, Fred Hutchinson Cancer Research Center, Seattle, WA, and approved December 23, 2020 (received for review May 27, 2020) Understanding the genetics and taxonomy of ancient viruses will level (2, 5). These findings indicate that the detection of nrEVEs give us great insights into not only the origin and evolution of in animal genomes would provide a better understanding of past viruses but also how viral infections played roles in our evolution. viral diversity. Endogenous viruses are remnants of ancient viral infections and are Current methods used to identify nrEVEs depend heavily on thought to retain the genetic characteristics of viruses from ancient pairwise sequence similarity to known viral sequences (12, 13). times. In this study, we used machine learning of endogenous RNA Therefore, our knowledge of ancient viruses is inevitably biased virus sequence signatures to identify viruses in the human genome toward those that are relatively similar to known viruses. In that have not been detected or are already extinct. Here, we show that the k-mer occurrence of ancient RNA viral sequences remains particular, RNA viruses may lose similarity to extant viruses due to the rapid evolution of viral genomes, and even the ancestors similar to that of extant RNA viral sequences and can be differenti- GENETICS ated from that of other human genome sequences. Furthermore, of existing viruses may not be detected. Furthermore, it is pos- using this characteristic, we screened RNA viral insertions in the sible that ancestors of yet-to-be-recognized extant viruses, or human reference genome and found virus-like insertions with phy- extinct viruses, have also been endogenized in animal genomes. logenetic and evolutionary features indicative of an exogenous or- Thus, a comprehensive analysis of nrEVEs in animal genomes igin but lacking homology to previously identified sequences. Our would require a new detection method based on a defining analysis indicates that animal genomes still contain unknown virus- feature of viruses that does not depend on pairwise similarity to derived sequences and provides a glimpse into the diversity of the known viruses. ancient virosphere. endogenous RNA virus | human genome | paleovirology | machine learning Significance ecent advances in metagenomic analysis have shown that Ancient animals left diverse physical fossil records from which Rviruses in nature are more diverse than previously thought, we can deduce that species with extraordinary features once and many viruses with no sequence similarity to known viruses populated our planet. By infecting germlines, some ancient exist, yet undiscovered, in the biosphere. Detecting viral diversity viruses deposited genetic fossil records. However, inferring and discovering new viruses can lead to a comprehensive under- that a sequence is a viral fossil has so far required homology to standing of the coexistence between viruses and organisms and circulating viruses. We developed a method to recognize viral provide effective tools with which to predict the emergence of fossils that do not closely resemble known viruses. Rather than novel viruses with epidemic or pandemic potential. homology, we detected sequence patterns of fossilized and There is no reason to suspect that ancient viruses were less modern RNA viruses that distinguish them from human se- diverse than current viruses. Understanding the genetics and tax- quences. Our results indicate that as-yet-undiscovered fossils onomy of ancient viruses, including extinct viruses, will provide from unknown viruses remain hidden in animal genomes. great insights into not only the origin and evolution of viruses but These relics of the ancient virosphere, including sequences also how viral infections played roles in our evolution and how we reported here, will expand our knowledge about the diversity have coexisted with potential pathogens. However, much is not of ancient viruses and also our genomes. known about the diversity of ancient viruses. Author contributions: S. Kojima, M.H., S. Kawano, and K.T. designed research; S. Kojima The clue to the existence of ancient viruses is found in our and S. Kawano performed research; S.N. contributed new reagents/analytic tools; K.Y., genomes. Genome sequences called endogenous viruses are J.I., S.N., N.F.P., M.H., and K.T. analyzed data; and S. Kojima, N.F.P., S. Kawano, and K.T. remnants of ancient viral infections in an organism’s genome that wrote the paper. are thought to retain the genetic characteristics of the viruses that The authors declare no competing interest. prevailed in ancient times (1). In addition to retroviruses, which This article is a PNAS Direct Submission. are well-recognized as endogenized relics, sequences from RNA Published under the PNAS license. viruses, called nonretroviral endogenous RNA virus elements 1Present address: Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Cluster (nrEVEs), have also been inserted into animal genomes (2–5). For for Pioneering Research, Yokohama 230-0045, Japan. example, endogenous bornavirus- and filovirus-like elements show 2To whom correspondence may be addressed. Email: [email protected] or detectable sequence similarity to their extant relatives and that [email protected]. ancient viruses were directly linked to the evolution of current This article contains supporting information online at https://www.pnas.org/lookup/suppl/ viral lineages (6–11). On the other hand, some nrEVEs fall into doi:10.1073/pnas.2010758118/-/DCSupplemental. lineages distantly related to current viruses at the genus or family Published January 25, 2021. PNAS 2021 Vol. 118 No. 5 e2010758118 https://doi.org/10.1073/pnas.2010758118 | 1of10 Downloaded by guest on September 30, 2021 Extant viruses have been found to share certain patterns in the sequences within each group lacked pairwise similarity to se- occurrence of nucleic acid combinations of length k,calledk-mers. quences in other groups (SI Appendix,Fig.S2). When bornavirus The dinucleotide (k-mer = 2) composition is generally uniform in nucleoprotein (N)-derived nrEVEs were retained as test data, an animal RNA virus family (14). Prokaryotic viral sequences have more than 75% of the test sequences were correctly classified distinctive k-mer frequencies that distinguish them from the se- (Fig. 1B). Consistently, we observed 44 to 83% of the test data quences of the host (15). k-mer occurrence in viral genomes is were correctly classified when using the other nrEVE groups as thought to be shaped by several selective constraints, such as co- test sequences, with one exception: training performed without don usage bias, which buffers against error-prone replication, and filovirus glycoprotein (GP)-derived nrEVEs. From these obser- the low-CG dinucleotide property that allows viruses to evade vations, we conclude that, regardless of their origin, nrEVEs share immune response (16, 17). These observations suggest the possi- distinguishing sequence characteristics in almost all cases. bility that both ancient and modern viruses share defining k-mer signatures. Similarity in the Sequence Characteristics of nrEVEs and RNA Viruses. In this study, we employ machine learning of sequence sig- The above result demonstrates the commonality in k-mer com- natures of ancient RNA viruses to search for nrEVEs without position in nrEVE sequences. Because the genetic architecture local sequence similarity to known viruses and demonstrate the of RNA viruses seems to be influenced by a number of con- presence of nrEVEs originating from an as-yet-unrecognized straints, such as immune pressure and error-prone replication, infectious agent in the human genome. Interestingly, we find that k and has a pattern distinct from that of host species (16, 17), we the -mer frequencies of nrEVEs are more similar to those of next assessed whether the k-mer composition of nrEVEs is more current RNA viral sequences than to those of human genomic similar to human coding sequences or to the coding genes in the sequences. Furthermore, we discover not only previously unex- single-strand, negative-sense RNA [(−)ssRNA] virus group, plored ancient bornavirus-derived insertions but also a viral-like which includes bornaviruses and filoviruses.