<<

-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA in the human

Shohei Kojimaa,1, Kohei Yoshikawab, Jumpei Itoc, So Nakagawad, Nicholas F. Parrishe, Masayuki Horiea,f, Shuichi Kawanob,2, and Keizo Tomonagaa,g,h,2

aLaboratory of RNA Viruses, Institute for Frontier and Medical Sciences, Kyoto University, Kyoto 606-8507, Japan; bDepartment of Computer and Network Engineering, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan; cDivision of Systems , Department of Infectious Disease Control, International Research Center for Infectious Diseases, Institute of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan; dDepartment of Molecular Life Science, Tokai University School of Medicine, Isehara 259-1193, Japan; eGenome Immunobiology RIKEN Hakubi Research Team, RIKEN Cluster for Pioneering Research, Yokohama 230-0045, Japan; fHakubi Center for Advanced Research, Kyoto University, Kyoto 606-8507, Japan; gLaboratory of RNA Viruses, Graduate School of Biostudies, Kyoto University, Kyoto 606-8507, Japan; and hDepartment of Molecular Virology, Graduate School of Medicine, Kyoto University, Kyoto 606-8507, Japan

Edited by Harmit S. Malik, Fred Hutchinson Cancer Research Center, Seattle, WA, and approved December 23, 2020 (received for review May 27, 2020) Understanding the and taxonomy of ancient viruses will level (2, 5). These findings indicate that the detection of nrEVEs give us great insights into not only the origin and evolution of in would provide a better understanding of past viruses but also how viral played roles in our evolution. viral diversity. Endogenous viruses are remnants of ancient viral infections and are Current methods used to identify nrEVEs depend heavily on thought to retain the genetic characteristics of viruses from ancient pairwise sequence similarity to known viral sequences (12, 13). times. In this study, we used machine learning of endogenous RNA Therefore, our knowledge of ancient viruses is inevitably biased virus sequence signatures to identify viruses in the human genome toward those that are relatively similar to known viruses. In that have not been detected or are already extinct. Here, we show that the k-mer occurrence of ancient RNA viral sequences remains particular, RNA viruses may lose similarity to extant viruses due to the rapid evolution of viral genomes, and even the ancestors similar to that of extant RNA viral sequences and can be differenti- GENETICS ated from that of other human genome sequences. Furthermore, of existing viruses may not be detected. Furthermore, it is pos- using this characteristic, we screened RNA viral insertions in the sible that ancestors of yet-to-be-recognized extant viruses, or human reference genome and found virus-like insertions with phy- extinct viruses, have also been endogenized in animal genomes. logenetic and evolutionary features indicative of an exogenous or- Thus, a comprehensive analysis of nrEVEs in animal genomes igin but lacking homology to previously identified sequences. Our would require a new detection method based on a defining analysis indicates that animal genomes still contain unknown virus- feature of viruses that does not depend on pairwise similarity to derived sequences and provides a glimpse into the diversity of the known viruses. ancient virosphere.

endogenous RNA virus | human genome | paleovirology | machine learning Significance

ecent advances in metagenomic analysis have shown that Ancient left diverse physical records from which Rviruses in nature are more diverse than previously thought, we can deduce that species with extraordinary features once and many viruses with no sequence similarity to known viruses populated our planet. By infecting germlines, some ancient exist, yet undiscovered, in the . Detecting viral diversity viruses deposited genetic fossil records. However, inferring and discovering new viruses can lead to a comprehensive under- that a sequence is a viral fossil has so far required homology to standing of the coexistence between viruses and and circulating viruses. We developed a method to recognize viral provide effective tools with which to predict the emergence of that do not closely resemble known viruses. Rather than novel viruses with epidemic or pandemic potential. homology, we detected sequence patterns of fossilized and There is no reason to suspect that ancient viruses were less modern RNA viruses that distinguish them from human se- diverse than current viruses. Understanding the genetics and tax- quences. Our results indicate that as-yet-undiscovered fossils onomy of ancient viruses, including extinct viruses, will provide from unknown viruses remain hidden in animal genomes. great insights into not only the origin and evolution of viruses but These relics of the ancient virosphere, including sequences also how viral infections played roles in our evolution and how we reported here, will expand our knowledge about the diversity have coexisted with potential . However, much is not of ancient viruses and also our genomes. known about the diversity of ancient viruses. Author contributions: S. Kojima, M.H., S. Kawano, and K.T. designed research; S. Kojima The clue to the existence of ancient viruses is found in our and S. Kawano performed research; S.N. contributed new reagents/analytic tools; K.Y., genomes. Genome sequences called endogenous viruses are J.I., S.N., N.F.P., M.H., and K.T. analyzed data; and S. Kojima, N.F.P., S. Kawano, and K.T. remnants of ancient viral infections in an ’s genome that wrote the paper. are thought to retain the genetic characteristics of the viruses that The authors declare no competing interest. prevailed in ancient times (1). In addition to , which This article is a PNAS Direct Submission. are well-recognized as endogenized relics, sequences from RNA Published under the PNAS license. viruses, called nonretroviral endogenous RNA virus elements 1Present address: Genome Immunobiology RIKEN Hakubi Research Team, RIKEN Cluster (nrEVEs), have also been inserted into animal genomes (2–5). For for Pioneering Research, Yokohama 230-0045, Japan. example, endogenous bornavirus- and filovirus-like elements show 2To whom correspondence may be addressed. Email: [email protected] or detectable sequence similarity to their extant relatives and that [email protected]. ancient viruses were directly linked to the evolution of current This article contains supporting information online at https://www.pnas.org/lookup/suppl/ viral lineages (6–11). On the other hand, some nrEVEs fall into doi:10.1073/pnas.2010758118/-/DCSupplemental. lineages distantly related to current viruses at the genus or family Published January 25, 2021.

PNAS 2021 Vol. 118 No. 5 e2010758118 https://doi.org/10.1073/pnas.2010758118 | 1of10 Downloaded by guest on September 30, 2021 Extant viruses have been found to share certain patterns in the sequences within each group lacked pairwise similarity to se- occurrence of combinations of length k,calledk-mers. quences in other groups (SI Appendix,Fig.S2). When bornavirus The dinucleotide (k-mer = 2) composition is generally uniform in nucleoprotein (N)-derived nrEVEs were retained as test data, an animal RNA virus family (14). Prokaryotic viral sequences have more than 75% of the test sequences were correctly classified distinctive k-mer frequencies that distinguish them from the se- (Fig. 1B). Consistently, we observed 44 to 83% of the test data quences of the host (15). k-mer occurrence in viral genomes is were correctly classified when using the other nrEVE groups as thought to be shaped by several selective constraints, such as co- test sequences, with one exception: training performed without don usage bias, which buffers against error-prone replication, and filovirus glycoprotein (GP)-derived nrEVEs. From these obser- the low-CG dinucleotide property that allows viruses to evade vations, we conclude that, regardless of their origin, nrEVEs share immune response (16, 17). These observations suggest the possi- distinguishing sequence characteristics in almost all cases. bility that both ancient and modern viruses share defining k-mer signatures. Similarity in the Sequence Characteristics of nrEVEs and RNA Viruses. In this study, we employ machine learning of sequence sig- The above result demonstrates the commonality in k-mer com- natures of ancient RNA viruses to search for nrEVEs without position in nrEVE sequences. Because the genetic architecture local sequence similarity to known viruses and demonstrate the of RNA viruses seems to be influenced by a number of con- presence of nrEVEs originating from an as-yet-unrecognized straints, such as immune pressure and error-prone replication, infectious agent in the human genome. Interestingly, we find that k and has a pattern distinct from that of host species (16, 17), we the -mer frequencies of nrEVEs are more similar to those of next assessed whether the k-mer composition of nrEVEs is more current RNA viral sequences than to those of human genomic similar to human coding sequences or to the coding in the sequences. Furthermore, we discover not only previously unex- single-strand, negative-sense RNA [(−)ssRNA] virus group, plored ancient bornavirus-derived insertions but also a viral-like which includes bornaviruses and filoviruses. As shown in Fig. 1C, insertion, named predicted viral insertion (PVI), in the human hierarchical clustering by k-mer frequency formed one cluster genome, which is not homologous to known viral sequences but composed of a majority of nrEVEs with some (−)ssRNA viral has exogenously-derived features. We also show that the PVI- sequences when we used k = 3(SI Appendix, Fig. S3A), dem- related sequences have independently invaded mammalian onstrating that the k-mer composition of nrEVEs is more similar lineages, suggesting that an unknown virus-like agent was in- to that of (−)ssRNA viruses than to that of human coding se- vading host genomes during the mammalian radiation. Our quences. Manifold learning, a nonlinear dimensionality reduc- findings will open a window for exploring viral diversities and tion method, based on k-mer frequencies from the same dataset evolution and expand our view of the virosphere in ancient times. showed that the majority of nrEVEs exhibited a similar but Results slightly distinct distribution compared with that of the cluster composed of viral coding sequences (SI Appendix, Fig. S3B), Machine Learning Distinguishes nrEVEs from Other Human Sequences. suggesting that the k-mer composition of nrEVEs is different To uncover hidden nrEVEs, which cannot be detected with con- from that of (−)ssRNA viruses but more similar when compared ventional pairwise similarity searches, in animal genomes, we first with that of human coding sequences. These results suggest that hypothesized that nrEVEs may have different nucleotide sequence the k-mer frequency of nrEVEs still retains similarities to that of compositions from those of “nonviral” sequences in the human (−)ssRNA viral coding sequences, despite the long residence of genome. To examine this, we focused on the occurrence of k-mers nrEVEs for at most 80 million y as endogenous sequences within of nrEVEs and evaluated whether a multiclass classifier con- host genomes. structed by a support vector machine (SVM), a supervised machine learning method, can distinguish nrEVEs from human sequences (Fig. 1A). To train the SVM, we used sequences of endogenous Genome-Wide Screen for nrEVEs Hidden in the Human Genome. To bornavirus- and filovirus-derived elements, which are the only detect hidden nrEVEs originating from unknown RNA viruses, we reported mammalian nrEVEs. In addition, six different groups of applied the classifier constructed by the SVM to the reference human genome sequences, namely coding and noncoding exons, human genome. The SVM can distinguish nrEVEs with substan- processed , , promoters, and intergenic regions, tial accuracy. However, as our approach is specifically designed to were employed for the training datasets. When k = 1and2were overcome the sparseness of “ground truth,” judging false positives used, the recall and precision scores of the SVM were low, while is a challenge. We thus used the following three steps to extract these scores were high and stable when we used k = 3, 4, and 5 sequences as candidate nrEVEs: 1) search for polyA tract (pA) (Fig. 1A and SI Appendix,Fig.S1). This demonstrates that the and target site duplication (TSD) (pA-TSD), 2) detection of k-mer compositions of nrEVEs are different from those of other preintegration empty sites (PESs), and 3) removal of cellular human sequences and that k-mers of 3 or longer are sufficient to pseudogenes (Fig. 2A). accurately capture this distinction. To generate a dense k-mer Many nrEVEs in mammals share common sequence features, matrix and avoid overfitting, we used k = 3 for further analyses. such as pA-TSDs, at the junctions of viral sequences with host We next examined whether nrEVEs share sequence charac- , probably reflecting the mechanism of integration teristics defined by k-mer frequencies, regardless of the genes or by autonomous such as long interspersed nu- virus families from which they originated. To this end, we evalu- clear elements (LINEs) (2). Therefore, we first searched the ated the recall of nrEVE classification by the SVM as follows. We human genome for pA-TSDs (SI Appendix, Methods) and de- first divided the nrEVEs into two groups: bornaviral and filoviral tected more than 8 million pA-TSDs (Fig. 2B and SI Appendix, nrEVEs. We used either nrEVE group and the six groups of hu- Fig. S4). Next, we assessed whether the sequences detected by man sequences as training data and constructed an SVM classifier. the pA-TSD search were acquired by insertion. The existence of Then, we evaluated the recall of the SVM using the other group of an orthologous genomic locus with no insertion (PES) in other nrEVEs as test data. The classifier trained on filoviral nrEVEs species is evidence of evolutionary invasion. Initial manual in- categorized more than half of the bornaviral nrEVEs correctly spections of some of the sequences revealed that most of the pA- (Fig. 1B). Consistently, the SVM trained on bornaviral nrEVEs TSDs were probably derived from stochastic occurrences be- gave a recall score of more than 0.5. We next divided the nrEVEs cause we could not find a PES. To exclude pA-TSDs not derived into eight groups based on the viral genes from which they were from a recent integration via the mechanism described above, we derived. We trained SVMs, retaining one of the eight groups for extracted only those for which PESs were detectable in a genome test data and using the other seven as training data. Notably, alignment of 14 mammals (SI Appendix, Methods). This led to the

2of10 | PNAS Kojima et al. https://doi.org/10.1073/pnas.2010758118 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Downloaded by guest on September 30, 2021 extraction of 5,578 pA-TSDs that harbored at least one PES (Fig. 2B and SI Appendix, Fig. S4). Cellular processed pseudogenes share characteristic features of integration sites, such as pA and TSD, with nrEVEs. Indeed, 43% of the 5,578 pA-TSDs overlapped with known cellular processed pseudogenes, demonstrating the enrichment of inser- tions generated by retrotransposons (P < 0.001, permutation test). To remove these insertions, we used a cellular database and BLASTn-based identification of unannotated cel- lular pseudogenes (SI Appendix, Methods). This step identified more than 80% of the 5,578 pA-TSDs as likely cellular pseu- dogenes (Fig. 2B). The remaining noncellular pseudogene se- quences (582 sequences) were then categorized by the SVM classifier to predict whether they have nrEVE features, and this yielded 100 elements with k-mer occurrences typical of nrEVEs (Fig. 2B, SI Appendix, Fig. S4, and Dataset S1). Finally, we manually curated these sequences to determine those most likely to be nrEVEs using a mammalian genome comparison, a se- quence similarity search, and phylogenetic analysis.

Discovery of Additional EBLNs. Previous studies reported the pres- ence of eight bornaviral nrEVEs (seven endogenous bornavirus- like nucleoprotein elements [EBLNs] and one endogenous bornavirus-like glycoprotein element [EBLG]) in the human ge- nome (4). The SVM for the nrEVE search yielded five of eight previously reported bornaviral nrEVEs (Dataset S1), demon- strating that our approach captured most known nrEVEs. Our

approach could not detect three elements. However, this is not GENETICS surprising because our method was tuned to detect features typical of nrEVE insertions but had less sensitivity for nrEVEs that lack these characteristics. Of the three elements, one, hsEBLN-4, lacks a clear pA and TSDs (18), and the others, hsEBLN-5 and hsEBLG-5, are located in transposon-rich regions, which are dif- ficult to align with other genomes in to allow confident identification of PESs (SI Appendix,Fig.S5). We next performed a BLASTx search using the sequences de- tected by the SVM classifier to search for novel nrEVEs similar to known viruses but below the threshold of detection in canonical BLAST-based surveys. As a result, we found that two sequences showed weak similarity to the nucleoproteins of orthobornaviruses and recently discovered bornaviruses of the genus Carbovirus (SI Appendix,Fig.S6A) (19). Suspecting that these sequences repre- sent EBLNs that have not been previously identified, we investi- gated their phylogenetic relationships with extant bornaviruses and known human EBLNs (Fig. 2C). The nucleotide sequence we identified as hsEBLN-8 showed a phylogenetically close relation- ship with hsEBLN-7, while the other sequence (hereafter hsEBLN- 9)clusteredwithcarboviruses(Fig.2C). Although hsEBLN-8 has high similarity to hsEBLN-7 (Fig. 2C and SI Appendix,Fig.S6B), hsEBLN-8 was not detected in previous reports (3, 4). The region of hsEBLN-8 harboring pairwise similarity to bornavirus was shortened by a deletion and a putative insertion (Fig. 2D and SI Appendix,Fig.S6C and D). On the other hand, hsEBLN-9 did not cluster with extant orthobornaviruses, which were used as the query in tBLASTn-based nrEVE searches in previous reports (Fig. 2C). In addition, hsEBLN-9 has multiple putative frameshifts in the region where it harbors similarity to carboviruses (Fig. 2E). These Fig. 1. k-mer frequency of nrEVEs is similar to that of extant RNA viruses obscured similarities likely resulted in these EBLNs being missed in and differentiates them from human sequences. (A) Confusion matrix of the previous surveys, while reconstructable pairwise similarity to known SVM classifier using leave-one-out cross-validation. (B) nrEVEs share char- bornaviruses or EBLNs revealed that the two human sequences are acteristic sequence similarities. nrEVEs were categorized based on the viral genes of those origins (bornavirus N, M, G, and L genes and filovirus NP, previously unrecognized nrEVEs. This provides evidence that our VP35, GP, and L genes). The SVM was trained using the indicated nrEVE categories. The numbers of nrEVEs used for training and test are shown (Left). (C) Hierarchical clusters of the human and vertebrate negative-strand RNA virus-coding sequences and nrEVEs. The k-mer frequency of these se- respectively. The dendrogram leaf nodes of the human and virus coding quences was used for clustering with Ward’s method. The heatmap and sequences and nrEVEs are shown in gray, red, and blue, respectively. See SI dendrograms show the input k-mer frequency and the clustering process, Appendix, Fig. S1 for additional details.

Kojima et al. PNAS | 3of10 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral https://doi.org/10.1073/pnas.2010758118 RNA viruses in the human genome Downloaded by guest on September 30, 2021 Fig. 2. nrEVE search identifies unreported endogenous bornavirus-like elements. (A) Schematic representation of the workflow to search for nrEVEs. We searched nrEVE candidates with four steps: pA-TSD search, PES search, pseudogene and simple repeat removal, and classification using the SVM. (B) Numbers of nrEVE candidates in each step, and as a percentage of those from the preceding step, of the workflow shown in A.(C) Phylogenetic tree of human EBLNs and extant bornavirus N genes. The tree was constructed based on nucleotide sequences using the maximum-likelihood method with 1,000 bootstrap replicates. Bootstrap values greater than 70% are indicated. Gray and blue dots next to the leaf nodes represent previously reported and unreported EBLNs, respectively. The scale bar shows nucleotide substitutions per site. (D) sequence alignment showing similarity across hsEBLN-8, hsEBLN-7, and BoDV-1. The underlined sequences are the regions detected by a tBLASTn search with the BoDV-1 N protein as a query. , deletion causing a frameshift mutation; , insertion causing a frameshift mutation; X, termination codon. (E) Protein sequence alignment showing similarity between hsEBLN-9 and JCPV. Putative regions affected by frameshift mutations are indicated by arrows.

4of10 | PNAS Kojima et al. https://doi.org/10.1073/pnas.2010758118 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Downloaded by guest on September 30, 2021 method can detect nrEVEs with too-weak pairwise similarity to none of the PVIRs were orthologous to those in laurasiatherians, known viruses to be detected by other methods. suggesting that they were acquired after the divergence of the Euarchontoglires and laurasiatherians (<96 MYA). For 33 Detection of nrEVE-Like Insertions in the Human Genome. Beyond PVIRs, insertion junctions were defined. Six elements had pA- these two nrEVEs, no other sequences with any pairwise similarity TSDs, 15 elements lacked clear pA sequences and harbored only to other existing viruses were detected. Manually judging the po- TSDs, and one insertion was potentially established due to tential sources of the remaining candidates, we identified several template switching during mobilization of the LINE1 retro- E F SI Appendix insertions for which the sources were not clear (Dataset S1). transposon (Fig. 3 and and , Fig. S9). Eleven Systematic identification of the sources of such orphan insertions PVIRs had neither a clear pA nor a TSD. The diversity of the is challenging; nevertheless, insights into the formation and dis- insertion junctions suggests that the putative source agent(s) of tribution patterns in related species sometimes allow us to narrow PVIRs might not have encoded an autonomous integrase and down and specify possible sources. To highlight this approach, we that several different host integration mechanisms might con- selected one nrEVE-like insertion that we refer to as predicted tribute to the formation of PVIRs. From the absence of ortho- viral insertion (Fig. 3A) and assessed whether the source of this logs in distant animal lineages and the presence of clear sequence is an unknown virus. integration features, we concluded that human PVIRs are se- PVI is ∼600 nt in length and has a clear pA and TSDs at the quences acquired from an exogenous source. integration junctions (Fig. 3A), suggesting that PVI originated from an insertion of polyadenylated RNA by the machinery of a Similar Sequences of PVI in Mammals. Independent insertion of . An orthologous insertion site was found in the similar sequences in species that do not encode human PVIR chimpanzee and marmoset genomes but was absent in the tarsier orthologs are additional lines of evidence suggesting horizontal genome, suggesting that the insertion occurred at least 43 million transfer as viruses and the foreign origin of PVIRs. Therefore, y ago (MYA). We could not detect a clear, long open reading we next explored genomes other than the human genome. By frame (ORF) in PVI (Fig. 3A). The search for known viral se- similarity searches using human PVIRs as queries, we found quences related to PVI using BLAST failed to detect any simi- PVIRs in primates, flying lemurs, rabbits, and laurasiatherian − − larity (E-value thresholds: 1e 5 for BLASTn and 1e 3 for mammals (Fig. 3D). Notably, we did not detect any PVIRs in the BLASTx). To understand whether PVI is a cellular pseudogene, genomes of other organisms, including other vertebrates, inver- we first searched for human homologs. The BLASTn search with tebrates, prokaryotes, , and viruses. To assess whether PVI as a query yielded 21 similar sequences; however, we could these nonhuman PVIR insertions were independently generated GENETICS not find high similarity. The highest nucleotide identity score was from human PVIRs, we analyzed their orthology. Genome 77% across 57% of the query. Moreover, the closest sequence comparisons clearly defined 21 out of 286 nonhuman PVIR in- appears to have been formed by an insertion in the same ancestral sertions that occurred independently in multiple mammalian simian lineage as that for PVI (SI Appendix,Fig.S7A). To evaluate lineages (Figs. 3D and 4A and SI Appendix, Fig. S10). whether this nucleotide similarity is comparable to that of cellular Next, we analyzed phylogenetic relationships based on a pseudogenes formed similarly long ago, nucleotide identities be- multiple-sequence alignment of relatively long PVIRs (>800 tween pseudogenes and their parental genes were calculated (SI nt) (Fig. 4B). These clades were grouped into three clades Appendix, Methods). The percent identity of PVI with its most designated clade 1 to clade 3, and clade 1 harbored a sub- similar sequence was lower than most of the identities of cellular clade, 1.1. Next, we classified all PVIRs based on their phy- pseudogenes with their parental genes (Fig. 3B and SI Appendix, logenetic relationships (Fig. 4C and SI Appendix, Methods). Fig. S7B). Related nrEVEs should show relatively high sequence We observed species specificity in PVIRs; clade 1 consisted divergence compared with that of pseudogenized sequences of primate elements, with the exception of two rabbit ele- formed at the same time in evolution. This is because even the ments, while clade 2 and clade 3 contained Euarchontoglires source sequences of the closest nrEVEs should already have had and laurasiatherian PVIRs, respectively. We found several variations prior to integration due to the presence of quasispecies events suggesting horizontal transfer of a putative source in exogenously replicating viruses. The observation of only weak virus(es) of PVIRs among species. Subclade 1.1 PVIRs were identity between the closest PVIs thus suggests that these elements observed in the human, tarsier, and aye-aye genomes but are of extrinsic origin. were absent in the bushbaby and mouse lemur genomes, suggesting that subclade 1.1 PVIRs entered the aye-aye ge- Detection of PVI-Related Sequences in the Human Genome. To gain nome independently from integrations into the tarsier and C additional insights into the exogenous origin of PVIs, we next human genomes (Fig. 4 ). Note that it is still possible that identified more diverse PVI-related sequences in the human ge- the source insertion of subclade 1.1 PVIRs occurred in a nome. Based on iterative BLASTn and LASTz alignments, we common primate ancestor, yet was independently deleted in found 83 additional sequences, resulting in a total of 105 PVI and the genomes of bushbaby and mouse lemurs or has otherwise PVI-related sequences, hereafter referred to as PVIRs, in the hu- become difficult to recognize in both these genomes. In an- man genome (Fig. 3C). The lengths of these elements ranged from other case, all of the clade 1 PVIRs were found in primate less than 100 nt to several kilobases (SI Appendix,Fig.S8A). Con- genomes, except for two PVIRs found in the rabbit genome C D sistent with an extrinsic origin, three PVIRs fall within annotated (Fig. 4 and ). These observations suggest that the source PIWI-interacting RNA clusters, where the bornaviral nrEVEs are of PVIRs was either a horizontally transmissible transposon- also enriched, more often than expected by chance alone (P < 0.01; like element or an infectious agent transmissible between SI Appendix,Fig.S8B and C) (9). A dot plot analysis revealed that mammalian lineages. 20 PVIRs are tandemly arrayed as head-to-tail multimers, of which one unit is ∼1.5 kb (SI Appendix,Fig.S8D). PVIRs Are Derived from an Exogenous Infectious Agent. We evalu- To address whether the identified PVIRs show features of ated whether PVIRs are likely to be transposons or exogenous insertion, such as PESs and/or pA-TSDs, we manually assessed infectious agents based on their sequence divergence. A lineage the presence of PESs and insertion junction sequences. Orthol- of transposons is expected to be less divergent than a group of ogy was clearly defined for 24 PVIRs. Most of these elements nrEVEs if they were generated at the same time, according to were simian- or primate-specific, while one element was con- the following reasoning. The average sequence divergence of a served across the Euarchontoglires mammals (Fig. 3D). Notably, transposon family should roughly reflect its age (20) because

Kojima et al. PNAS | 5of10 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral https://doi.org/10.1073/pnas.2010758118 RNA viruses in the human genome Downloaded by guest on September 30, 2021 Fig. 3. Discovery of an nrEVE-like sequence without similarity to known viruses. (A) PVI identified by the nrEVE-search workflow. Detected pA and TSDs are shown in red and blue, respectively. Genome alignment with the human genomic position chr15:26559515–26561852(+) is shown. (B) Histogram showing nucleotide identities between pseudogenes and parental genes. Pseudogenes whose integration ages were inferred to range from 67 to 43 MYA by multiple- genome alignment were used for analysis. The box plot shows the distribution of the nucleotide identities of pseudogenes with their parental genes, and the orange dotted line shows the nucleotide identity of the PVI with its most similar sequence in the human genome. (C) Genomic positions of 105 PVIRs in the human genome. The positions of the PVIRs are shown as red lines. (D) Estimated insertion dates of PVIRs. Arrows are shown between the estimated minimum insertion age and maximum fixation age. The numbers in red next to the arrows show the number of insertions found. The numbers in blue at the internal nodes indicate the divergence times taken from TimeTree. (E) Numbers of insertion junction features observed in PVIRs in the human genome. Elements with clear insertion junctions based on the multiple-genome alignment were used for analysis. (F) Insertion junction sequences of PVIRs in the human genome. pA and TSD sequences are shown in red and blue, respectively. The parentheses in the Human_46 TSD represent a sequence deletion. See SI Appendix, Fig. S9B for details and Dataset S2 for genomic positions.

6of10 | PNAS Kojima et al. https://doi.org/10.1073/pnas.2010758118 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Downloaded by guest on September 30, 2021 GENETICS

Fig. 4. Estimated source of PVI is a virus-like transmissible agent. (A) Examples of lineage-specific PVIR insertions. Genome alignments with the genomic positions of tarsier KE945601v1:222811–226239(+) and white rhinoceros JH767724:60344592–60348202(+), which show tarsier- and laurasiatherian-specific insertions, are shown in Left and Right, respectively. See SI Appendix for the genomic positions of other species. (B) Phylogenetic tree of PVIRs. (C) Classi- fication of PVIRs. The elements were categorized into clades by constructing a phylogenetic tree using the maximum-likelihood method. (D) Phylogenetic tree of PVIRs with rabbit elements. The rabbit element assigned to clade 1 is shown with a red arrow. (E) Sequence divergences of PVIRs in clade 1.1 and transposons in the human genome. The external branch lengths were calculated from a phylogenetic tree constructed based on nucleotide sequences using the neighbor-joining method, and the distances of the sequences were calculated using the Kimura two-parameter model. The external branch length of PVIR clade 1.1 is shown with orange dotted lines. (A and C) Cladogram topologies were taken from TimeTree. (B and D) The tree was constructed based on nucleotide sequences using the maximum-likelihood method with 1,000 bootstrap replicates. Bootstrap values greater than 70% are indicated. The labels at the leaf nodes represent the element names designated in Dataset S2, and the scale bars show nucleotide substitutions per site.

Kojima et al. PNAS | 7of10 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral https://doi.org/10.1073/pnas.2010758118 RNA viruses in the human genome Downloaded by guest on September 30, 2021 transposons can accumulate mutations only after mobilization in Recent metagenomic and metatranscriptomic analyses have the germline. In contrast, virus genomes continuously acquire been uncovering previously undiscovered viral fragments in hu- mutations or variations during their replication cycle in so- mans and environmental samples, indicating that many unknown matic cells, before endogenization. This preexisting diver- infectious agents could still be present in the biosphere (23). gence of nrEVE source sequences, in addition to mutations Similarly, animal genomes may contain virus-derived sequences that accumulate after endogenization, should give rise to originating from as-yet-unidentified or extinct viruses. Our analysis higher sequence divergence of nrEVEs than of transposons revealed previously unrecognized endogenous bornavirus-like el- formed at the same age. To evaluate whether PVIRs have ements in the human genome that had not been identified before. higher divergence than transposons, we calculated sequence In contrast, the present analysis did not identify sequences similar diversity using external branch lengths (SI Appendix, Methods to known RNA viruses other than bornaviruses, even with weak and Fig. S11A). We used subclade 1.1 for this analysis because similarities such as those below the threshold of detection defined this clade is the youngest according to orthology analysis, and in the general BLAST search settings. It may still be premature to its sequences are abundant in the human genome. None of the conclude from this analysis that bornaviruses are the only RNA subclade 1.1 element integrations between human and tarsier viruses that can contribute sequences to the human genome. appeared orthologous, suggesting that these elements ex- However, our results indicate that Bornaviridae are rare RNA panded after the divergence of human and tarsier (<67 viruses that have existed for hundreds of millions of years along MYA). The average external branch length of the elements with the evolution of primate lineages. exceeds that of transposons expanded at a similar age (Fig. 4E In this study, we successfully identified a virus-like inser- and SI Appendix,Fig.S11B), suggesting that PVIRs had al- tion, PVI, which has an unknown origin, in the human ge- ready acquired some mutations and existed as polymorphic nome. This finding strongly suggests that there are still many sequences before their insertion. This observation supports uncharacterized virus-like insertions in mammalian genomes. the scenario in which PVIRs originated from an exogenous Furthermore, the independent integrations and ubiquitous infectious source, such as a virus. presence of PVIRs in the primate and laurasiatherian lineages strongly suggest that the sequences have expanded similar to Discussion an , indicating that PVIRs arise from the integration This study uncovers virus-like insertions in the human genome of exogenous agents, such as viruses. Phylogenetic analysis that lack pairwise homology to known viruses. This was made revealed that some PVIRs were most likely acquired by cross- possible by using a machine learning approach to detect species transmission of the exogenous source element. Se- k-mer–based signatures in sequences derived from ancient quence diversity suggested that PVIRs are more variable than RNA viruses. We demonstrated that nrEVE sequences have transposons, implying an exogenous life cycle for the agent. specific signatures distinguishable from those of other human Despite this body of evidence, however, we cannot conclude sequences and that the sequence features of ancient RNA definitively that the source of PVIRs was an ancient RNA viruses may retain similarity to those of extant RNA viruses. virus; the possibility that it originated from a previously This approach opens a window for exploring ancient RNA virus undescribed transposon-like element capable of cross-species sequences hidden in animal genomes. In addition, our findings transmission remains. Not all of these sequences have the typical show that the current knowledge of ancient virus diversity is features of RNA virus integration but instead exhibit diverse inte- still rudimentary and that as-yet-undiscovered sequences de- gration junction sequences. In addition, some PVIRs lack both rived from unknown viruses, such as unidentified or extinct recognizable pA and TSDs. This feature suggests that during the viruses, remain hidden even in very well-studied animal replication cycle, the source agent may have produced a DNA form genomes. that could be integrated into a double-strand DNA break (24–26). In this study, we demonstrated that nrEVEs derived from We also found tandemly arrayed integrations of PVIRs, which is not nonhomologous genes share specific sequence similarities. Al- a typical feature of canonical nrEVEs; however, endogenous ret- though we could not elucidate why nrEVEs have these similar- roviruses and retroviroid-like sequences are known to form tan- ities because of the limited interpretability of the SVM, our demly repeated DNA sequences in the host genome (27, 28). results suggest that the k-mer occurrence of nrEVEs reflects Furthermore, adeno-associated virus generates tandem viral DNA that of the ancient RNA viral genomes from which they were in infected cells (29). In addition to animal viruses, -specific derived. It is known that the dinucleotide composition of ani- , which are composed of circular single-stranded RNA mol- mal RNA viruses is mostly a characteristic of virus families ecules, are also known to produce tandem genome units in their rather than of host species (14). The VirFinder, a k-mer–based rolling circle replication process (30). Thus, the tool used to predict prokaryotic viral contigs from metagenomic structure of some PVIR integrations may be a clue to the replica- data, correctly predicts viral sequences with no pairwise simi- tion mechanism of the unidentified elements producing PVIRs. larity to the training data (15). These results support the pos- Further investigations of the genomic structure, as well as the sibility that RNA viruses, including ancient viruses, may have replication mechanism, of PVIRs may provide more clues to the k-mer patterns shared across different taxonomic families. origin of this putative viral element. Mechanistically, the viral k-mer space is shaped by several Although most nongenic sequences contributing to the known selective constraints, such as codon usage bias, which great size of many animal genomes are suggested to be buffers against error-prone replication (16), and the low-CG transposons and highly decayed repeat sequences, the form dinucleotide property that allows viruses to evade immune re- of a substantial fraction remains unclear due to a lack of sponse (17, 21, 22). The k-mer frequency of nrEVEs identified similarity to characterized sequences. Such unknown se- by BLAST similarity search showed a spatial distribution dif- quencesareoftenreferredtoas“genomic dark matter” (31). ferent from that of human coding sequences and more similar Our study suggests that unexplored virus-derived sequences to that of RNA viral sequences, suggesting that similar evolu- may be a part of the evolutionary origins of such complex tionary constraints acting on ancient and extant RNA viruses genomic sequences. Viral machinery coded in endogenous might have resulted in the unique sequence signature of retroviruses and nrEVEs are frequently co-opted or repur- nrEVEs. Further studies regarding the sequence similarity be- posed for novel cellular functions. Therefore, unveiling tween nrEVEs and current RNA viruses will provide detailed hidden viral insertions in animal genomes will provide in- views of the sequence signature and host interactions of ancient sight into the novelty of animal genomes driven by lateral RNA viruses. transfer from viruses.

8of10 | PNAS Kojima et al. https://doi.org/10.1073/pnas.2010758118 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Downloaded by guest on September 30, 2021 In summary, k-mer–based machine learning of ancient virus Hierarchical Clustering. To cluster the contribution ratios of k-mer occur- sequence signatures will open a window for exploring unap- rences, we used hierarchical clustering with the complete linkage method. preciated ancient gene flow from currently unidentified viru- This analysis was performed by the function heatmap in R. To cluster k-mer ses. Our findings will provide extensive insights into long-term frequencies of human ORFs, viral ORFs, and nrEVEs, we used hierarchical virus evolution, animal genome organization, and virus–host clustering with Ward’s method (see also the legend of SI Appendix, Fig. S3). interactions. To measure similarities between k-mer frequencies, we used Euclidean dis- tance. This analysis was performed by the function clustermap in the pack- Materials and Methods age seaborn in Python. Sequence Data Preparation for Construction of the SVM. The genomic posi- tions of protein-coding exons, noncoding exons, pseudogene exons, and Phylogenetic Classification of PVIRs. The lengths of PVIRs range from ∼100 nt introns were retrieved from the GENCODE human genome annotation to several kilobases. Therefore, it is impossible to generate a phylogenetic (release 27). For pseudogenes, gene_type “processed_pseudogenes” was used. Promoter regions were defined as the regions 1 kb upstream tree containing all PVIRs. To classify all PVIRs into clades, we investigated (downstream for transcripts in the antisense direction) of the their phylogenetic relationships in a one-by-one manner. To this end, we start sites. Intergenic regions were defined as genomic regions other than made three alignments with relatively long elements to represent the three protein-coding exons, noncoding exons, pseudogene exons, introns, pro- clades and then added one sequence to the alignment to evaluate the moters, and known nrEVEs (hsEBLN-1 to hsEBLN-7 and hsEBLG-5). The se- phylogenetic relationship of the added sequence. The addition of one ele- quences of these genomic regions were obtained from human genome ment was performed using the MAFFT L-INS-i–add option. The generated assembly hg38 with repeat sequences masked by our criteria (SI Appendix, alignments were checked manually, and then the trees were inferred by the pA-TSD Search). maximum-likelihood method with the partial deletion option using MEGA X nrEVEs, which have pairwise similarity to known viruses, were searched software (34). The Tamura three-parameter model with a discrete gamma by tBLASTn with the following option: E value = 1e-10. We used ortho- bornaviruses and filoviruses as search queries because they are the only distribution (+G) was used. The reliability of each internal branch was RNA viruses related to nrEVEs in vertebrate genomes. Whole-genome assessed by 100 bootstrap resamplings. shotgun sequences of vertebrates were used for the database. The tBLASTn search was performed on 7 November 2017. Because we searched Data Availability. Codes and data used in this article are available at https:// for nrEVEs among whole-genome shotgun sequences, hits contained github.com/shohei-kojima/Kojima_et_al_2021_PNAS. For the list of param- multiple nrEVE copies of the same nucleotide sequence. These redundant eter settings used for the pA-TSD search of the human genome, genomic hits were removed, and only one of them was retained for analysis. The accession numbers of the query protein sequences are listed in positions and manual annotations of sequences categorized in the nrEVE-- like group by our nrEVE-search workflow, and genomic positions and SI Appendix. GENETICS manual annotations of the PVIRs found, see SI Appendix and Datasets S1, S2, and S3. Construction of the SVM. We used a nonlinear SVM with a kernel function (32). For the kernel function, the Gaussian kernel function was adopted. The ACKNOWLEDGMENTS. This work was supported in part by Japan Society for tuning parameters for the SVM were selected by twofold cross-validation. the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) These analyses were performed by the functions svm and tune.svm in the JP17H04083, JP19K22530, JP20H00662, and JP20H05682 (all to K.T.); Ministry of package e1071 in R statistical software (version 3.5.3). Education, Culture, Sports, Science and Technology KAKENHI JP16H06429, JP16K21723, and JP16H06430 (all to K.T.), JP17H05823 (to S.N.), and Manifold Learning. We used t-distributed stochastic neighbor embedding JP19H04833 (to M.H.); JSPS Core-to-Core Program, Japan Agency for Medical (t-SNE) for manifold learning (33). This analysis was performed by the Research and Development Grant JP19fm0208014 (to K.T.); and the Joint Usage/ function TSNE in the package scikit-learn in Python (version 3.7.2). Research Center Program on inFront, Kyoto University.

1. C. Feschotte, C. Gilbert, Endogenous viruses: Insights into and impact 12. H. Kirsip, A. Abroi, Protein structure-guided hidden Markov models (HMMs) as a on host biology. Nat. Rev. Genet. 13, 283–296 (2012). powerful method in the detection of ancestral endogenous viral elements. Viruses 11, 2. M. Horie et al., Endogenous non-retroviral RNA virus elements in mammalian ge- 320 (2019). nomes. Nature 463,84–87 (2010). 13. K. Kryukov, M. T. Ueda, T. Imanishi, S. Nakagawa, Systematic survey of non-retroviral 3. V. A. Belyi, A. J. Levine, A. M. Skalka, Unexpected inheritance: Multiple integrations of virus-like elements in eukaryotic genomes. Virus Res. 262,30–36 (2019). ancient bornavirus and Ebolavirus/Marburgvirus sequences in vertebrate genomes. 14. F. Di Giallonardo, T. E. Schlub, M. Shi, E. C. Holmes, Dinucleotide composition in an- PLoS Pathog. 6, e1001030 (2010). imal RNA viruses is shaped more by virus family than by host species. J. Virol. 91, 4. A. Katzourakis, R. J. Gifford, Endogenous viral elements in animal genomes. PLoS e02381-16 (2017). Genet. 6, e1001191 (2010). 15. J. Ren, N. A. Ahlgren, Y. Y. Lu, J. A. Fuhrman, F. Sun, VirFinder: A novel k-mer based 5. D. J. Taylor, R. W. Leach, J. Bruenn, Filoviruses are ancient and integrated into tool for identifying viral sequences from assembled metagenomic data. Microbiome mammalian genomes. BMC Evol. Biol. 10, 193 (2010). 5, 69 (2017). 6. T. Kondoh et al., Putative endogenous filovirus VP35-like protein potentially func- 16. A. S. Lauring, A. Acevedo, S. B. Cooper, R. Andino, Codon usage determines the tions as an IFN antagonist but not a polymerase cofactor. PLoS One 12, e0186450 mutational robustness, evolutionary capacity, and of an RNA virus. Host – (2017). Microbe 12, 623 632 (2012). 7. K. Fujino, M. Horie, T. Honda, D. K. Merriman, K. Tomonaga, Inhibition of 17. M. A. Takata et al., CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature 550, 124–127 (2017). Borna disease virus replication by an endogenous bornavirus-like element in 18. M. Horie, Y. Kobayashi, Y. Suzuki, K. Tomonaga, Comprehensive analysis of endog- the ground squirrel genome. Proc. Natl. Acad. Sci. U.S.A. 111, 13175–13180 enous bornavirus-like elements in genomes. Philos. Trans. R. Soc. Lond. B (2014). Biol. Sci. 368, 20120499 (2013). 8. M. R. Edwards et al., Conservation of structure and immune antagonist functions of 19. T. H. Hyndman, C. M. Shilton, M. D. Stenglein, J. F. X. Wellehan, Jr, X. Wellehan, filoviral VP35 homologs present in microbat genomes. Cell Rep. 24, 861–872.e6 Divergent bornaviruses from Australian carpet pythons with neurological disease (2018). date the origin of extant Bornaviridae prior to the end-Cretaceous extinction. PLoS 9. N. F. Parrish et al., piRNAs derived from ancient viral processed pseudogenes as Pathog. 14, e1006881 (2018). transgenerational sequence-specific immune memory in mammals. RNA 21, 20. V. Kapitonov, J. Jurka, The age of Alu subfamilies. J. Mol. Evol. 42,59–65 (1996). – 1691 1703 (2015). 21. B. D. Greenbaum, S. Cocco, A. J. Levine, R. Monasson, Quantitative theory of entropic 10. K. Sofuku, N. F. Parrish, T. Honda, K. Tomonaga, Transcription profiling demonstrates forces acting on constrained nucleotide sequences applied to viruses. Proc. Natl. Acad. epigenetic control of non-retroviral RNA virus-derived elements in the human ge- Sci. U.S.A. 111, 5054–5059 (2014). nome. Cell Rep. 12, 1548–1554 (2015). 22. V. Odon et al., The role of ZAP and OAS3/RNAseL pathways in the attenuation of an 11. Y. Kobayashi et al., Exaptation of bornavirus-like nucleoprotein elements in afro- RNA virus with elevated frequencies of CpG and UpA dinucleotides. Nucleic Acids Res. therians. PLoS Pathog. 12, e1005785 (2016). 47, 8061–8083 (2019).

Kojima et al. PNAS | 9of10 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral https://doi.org/10.1073/pnas.2010758118 RNA viruses in the human genome Downloaded by guest on September 30, 2021 23. Y.-Z. Zhang, Y.-M. Chen, W. Wang, X.-C. Qin, E. C. Holmes, Expanding the RNA vi- 29. B. C. Schnepp, R. L. Jensen, C.-L. Chen, P. R. Johnson, K. R. Clark, Characterization of rosphere by unbiased metagenomics. Annu. Rev. Virol. 6, 119–139 (2019). adeno-associated virus genomes isolated from human tissues. J. Virol. 79, 24. J. K. Moore, J. E. Haber, Capture of retrotransposon DNA at the sites of chromosomal 14793–14803 (2005). double-strand breaks. Nature 383, 644–646 (1996). 30. R. Flores et al., replication: Rolling-circles, enzymes and . Viruses 1, – 25. C. A. Bill, J. Summers, Genomic DNA double-strand breaks are targets for 317 334 (2009). hepadnaviral DNA integration. Proc.Natl.Acad.Sci.U.S.A.101, 11135–11140 31. A. P. J. de Koning, W. Gu, T. A. Castoe, M. A. Batzer, D. D. Pollock, Repetitive el- ements may comprise over two-thirds of the human genome. PLoS Genet. 7, (2004). e1002384 (2011). 26. D. G. Miller, L. M. Petek, D. W. Russell, Adeno-associated virus vectors integrate at 32. B. E. Boser, I. M. Guyon, V. N. Vapnik, “A training algorithm for optimal margin classifiers” breakage sites. Nat. Genet. 36, 767–773 (2004). in Proceedings of the Fifth Annual Workshop on Computational Learning Theory,D.H. 27. J. A. Daròs, R. Flores, Identification of a retroviroid-like element from . Proc. Haussler, Ed. (Association for Computing Machinery, New York, NY, 1992), pp. 144–152. – Natl. Acad. Sci. U.S.A. 92, 6856 6860 (1995). 33. L. Van Der Maaten, G. Hinton, Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 28. D. Gao, Y. Li, K. D. Kim, B. Abernathy, S. A. Jackson, Landscape and evolutionary 2579–2625 (2008). dynamics of terminal repeat retrotransposons in miniature in plant genomes. Ge- 34. S. Kumar, G. Stecher, M. Li, C. Knyaz, K. Tamura, MEGA X: Molecular evolutionary nome Biol. 17, 7 (2016). genetics analysis across computing platforms. Mol. Biol. Evol. 35, 1547–1549 (2018).

10 of 10 | PNAS Kojima et al. https://doi.org/10.1073/pnas.2010758118 Virus-like insertions with sequence signatures similar to those of endogenous nonretroviral RNA viruses in the human genome Downloaded by guest on September 30, 2021