bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

A new lineage of segmented RNA viruses infecting

Darren J. Obbard*1 Mang Shi2+ Katherine E. Roberts3+ Ben Longdon3+ Alice B. Dennis4+

*Author for correspondence + Contributed equally

1. Institute of Evolutionary Biology, University of Edinburgh, Charlotte Auerbach Road, Edin- burgh, United Kingdom 2. Charles Perkins Center, The University of Sydney, NSW, Australia 3. Biosciences, College of Life & Environmental Sciences, University of Exeter, Penryn Campus, Penryn, Cornwall, United Kingdom

4 Department of Evolutionary Biology & Systematic Zoology, Institute of Biochemistry and Biol- ogy, University of Potsdam, 4476 Potsdam, Germany

Page | 1 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Abstract Metagenomic sequencing has revolutionised our knowledge of virus diversity, with new virus sequences being reported faster than ever before. However, virus discovery from meta- genomic sequencing usually depends on detectable homology: without a sufficiently close relative, so-called ‘dark’ virus sequences remain unrecognisable. An alternative approach is to use virus-identification methods that do not depend on detecting homology, such as virus recognition by host antiviral immunity. For example, virus-derived small RNAs have previously been used to propose ‘dark’ virus sequences associated with the Drosophilidae (Diptera). Here we combine published Drosophila data with a comprehensive search of transcriptomic sequences and selected meta-transcriptomic datasets to identify a completely new lineage of segmented positive-sense single-stranded RNA viruses that we provisionally refer to as the Quenyaviruses. Each of the five segments contains a single open reading frame, with most encoding proteins showing no detectable similarity to characterised viruses, and one sharing a small number of residues with the RNA-dependent RNA polymerases of single- and double- stranded RNA viruses. Using these sequences, we identify close relatives in approximately 20 , including , crustaceans, spiders and a myriapod. Using a more conserved sequence from the putative polymerase, we further identify relatives in meta-transcriptomic datasets from gut, gill, and lung tissues of vertebrates, reflecting infections of vertebrates or of their associated parasites. Our data illustrate the utility of small RNAs to detect viruses with limited sequence conservation, and provide robust evidence for a new deeply divergent and phylogenetically distinct RNA virus lineage. Key Words Metagenome, RNA virus, dark virus, , RNA interference 1 Introduction 2018). Together, these two sources of (meta- )genomic data have ‘filled in’ the tree of vi- Pioneered by studies of oceanic phage ruses at many levels. They have expanded (Breitbart et al. 2002), since the mid-2000s the host range of known viruses (e.g. metagenomic studies have identified thou- Galbraith et al. 2018), identified vast num- sands of new viruses (or virus-like sequences) bers of likely new species and genera—con- associated with bacteria, plants, animals, sequently provoking considerable debate on fungi, and single-celled eukaryotes (reviewed how we should go about virus in Greninger 2018, Obbard 2018, Shi et al. (Simmonds et al. 2017, King et al. 2018, 2018, Zhang et al. 2018). At the same time, Simmonds and Aiewsakun 2018)—and iden- routine high-throughput sequencing has pro- tified new lineages that may warrant recog- vided a rich resource for virus discovery nition at family level, including Chuviruses, among eukaryotic host genomes and tran- Yueviruses, Qinviruses, Zhaoviruses, scriptomes (e.g. Bekal et al. 2011, Longdon et Yanviruses and Weiviruses (Li et al. 2015, Shi al. 2015, Webster et al. 2015, François et al. et al. 2016). More importantly, these discov- 2016, Mushegian et al. 2016, Gilbert et al. eries have also started to impact upon our 2019). Indeed, a recent survey suggested understanding of virus evolution (Wolf et al. that, as of 2018, around 10% of the available 2018), emphasising the importance of ‘mod- picornavirus-like polymerase sequences ex- ular’ exchange (Koonin et al. 2015, Dolja and isted only as un-annotated transcripts within Koonin 2018) and suggesting surprisingly the transcriptomes of their hosts (Obbard long-term fidelity to host lineages, at least at

Page | 2 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

higher taxonomic levels (Geoghegan et al. RNA virus sequences are absent from DNA. 2017, Shi et al. 2018). The presence and absence of contigs across datasets can also provide useful clues as to Despite the successes of metagenomic virus the origin of a sequence. Specifically, se- discovery, there are clear limitations to the quences that are present in all individuals, or approach. First, ‘virus-like sequences’ from a in all populations, are more likely to repre- putative host need not equate to an active vi- sent genome integrations, sequences that al- ral infection of that species. They may repre- ways co-occur with recognisable viral frag- sent integrations into the host genome, in- ments may be segments that are not detect- fections of cellular parasites or other micro- able by homology, and sequences that co-oc- biota, infections of gut contents, or simply cur with non-host sequences are candidates contaminating nucleic acid (reviewed in to be viruses of the microbiota. Obbard 2018). Second, most metagenomic methods rely on similarity searches to iden- One of the most powerful ways to identify vi- tify virus sequences through inferred homol- ruses is to capitalise on the host’s own ability ogy. This limits the new discoveries to the rel- to recognise pathogens, for example by se- atives of known viruses. In the future, as sim- quencing the copious virus-derived small ilarity search algorithms become more sensi- RNAs generated by the antiviral RNAi re- tive (e.g. Kuchibhatla et al. 2014, Yutin et al. sponses of plants, fungi, nematodes and ar- 2018), this approach may be able to uncover thropods (Aguiar et al. 2015, Webster et al. all viruses—at least those that have common 2015). This not only demonstrates host ancestry with the references. However, this recognition of the sequences as viral in approach will probably still struggle to iden- origin, but also (if both strands of ssRNA vi- tify less conserved parts of the genome, es- ruses are present) demonstrates viral replica- pecially for segmented viruses and incom- tion, and can even identify the true host of plete assemblies. As a consequence, there the virus based on the length distribution and may be many viruses and virus fragments base composition of the small RNAs that cannot be seen through the lens of ho- (compare Webster et al. 2016, with Coyle et mology-based metagenomics, the so-called al. 2018). ‘dark’ viruses (Rinke et al. 2013, Using ribosome-depleted RNA and small- Krishnamurthy and Wang 2017, Knox et al. RNA metagenomic sequencing, Webster et al 2018). (2015) previously proposed approximately The ultimate solution to the shortcomings of 60 ‘dark’ virus sequences associated with metagenomic discovery is to isolate and ex- Drosophila. These comprised contigs of at perimentally characterise viruses. However, least one 1kbp that were present as RNA but the large number of uncharacterised virus- not DNA, contained a long open reading like sequences means that this is unlikely to frame, lacked identifiable homology with be an option in the foreseeable future. In- known viruses or cellular organisms, and stead, we can use other aspects of meta- were substantial sources of the 21nt small genomic data to corroborate evidence of a vi- RNAs that characterise Drosophila antiviral ral infection (reviewed in Obbard 2018). For RNAi. They included ‘Galbut virus’ example, metagenomic reads are more con- (KP714100, KP714099), which has since been sistent with an active infection if RNA is very shown to constitute two divergent segments abundant (several percent of the total), if of an -infecting Partitivirus (KP757930; strand biases reflect active replication (such Shi et al. 2018) and is the most common virus as the presence of the coding strand for neg- associated with Drosophila melanogaster in ative sense RNA viruses or DNA viruses), or if the wild (Webster et al. 2015); ‘Chaq virus’

Page | 3 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(KP714088), which may be a satellite or an (pool I), and the UK (pools S and T). Ribosome optional segment of Galbut virus (Shi et al. depleted and double-stranded nuclease nor- 2018); and 56 unnamed ‘dark’ virus frag- malised libraries were sequenced using the Il- ments (KP757937-KP757993). Subsequent lumina platform, and assembled using Trinity discoveries have since allowed 26 of these (Grabherr et al. 2011). Small RNAs were se- previously dark sequences to be identified as quenced from the same RNA pools, and the segments or fragments of viruses that display characteristic Dicer-mediated viral small- detectable homology in other regions, in- RNA signature used to identify around 60 pu- cluding several pieces of Flavi-like and Ifla- tative ‘dark’ virus sequences that lacked de- like viruses (Shi et al. 2016, Shi et al. 2016) tectable sequence homology (Supporting Fig- and the missing segments of a Phasmavirus ures S1 and S2; sequences accessions (Matthew J. Ballinger, pers. com.) and Torrey KP757937-KP757993). Raw data are available Pines reovirus (Shi et al. 2018). under NCBI project accession PRJNA277921. For details, see Webster et al (2015). Here we combine data from Webster et al (2015) with a search of transcriptome assem- Here we took four approaches to identify se- blies and selected meta-transcriptomic da- quences related to these ‘dark’ viruses of tasets to identify six of the remaining ‘dark’ Drosophila, and to associate ‘dark’ fragments Drosophila virus sequences as segments of into viral genomes based on the co-occur- the founding members of a new lineage of rence of homologous sequences in other segmented positive-sense single-stranded taxa. First, we obtained the collated tran- (+ss)RNA viruses. The protein encoded by scriptome shotgun assemblies available from segment 5 of these viruses shares a small the European Nucleotide Archive number of conserved residues with the RNA (ftp://ftp.ebi.ac.uk/pub/data- dependent RNA polymerases of Picorna- bases/ena/tsa/public/; Most recently ac- viruses, Flaviviruses, Permutotetraviruses, cessed 9th August 2019) and inferred their Reoviruses, Totiviruses and Picrobirna- protein sequences for similarity searching by viruses, but is not substantially more similar translating all long open reading frames pre- or robustly supported as sister to any of sent in each contig. We used these to build a these lineages—suggesting that the new lin- database for Diamond (Buchfink et al. 2014), eage may warrant recognition as a new fam- and used Diamond ‘blastp’ to search the da- ily. We find at least one homologous segment tabase with the translated ‘dark’ virus se- in publicly-available transcriptomic data from quences identified from Drosophila. Second, each of 40 different species, including we downloaded the pre-built tsa_nt BLAST multiple arthropods and a small number of database provided by NCBI vertebrates, suggesting these viruses are as- (ftp://ftp.ncbi.nlm.nih.gov/blast/db/), and sociated with a diverse range of animal taxa. used tblastn (Camacho et al. 2009) to search this database for co-occurring homologous 2 Methods fragments with the same sequences. Third, we used diamond ‘blastx’ (Buchfink et al. 2.1 Association of ‘dark’ virus segments 2014) to search large-scale metagenomic as- from Drosophila semblies derived from various invertebrates Webster et al (2015) previously performed (Shi et al. 2016) and vertebrates (Shi et al. metagenomic virus discovery by RNA se- 2018). For sources of raw data see Support- quencing from a large pool of wild-collected ing File S1. Fourth, to identify missing frag- adult Drosophila (Drosophilidae; Diptera). In ments associated with Drosophila, we also brief, ca. 5000 flies were collected in 2010 re-queried translations of the raw unanno- from Kenya (denoted pools E and K), the USA tated meta-transcriptomic assemblies of

Page | 4 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Webster et al (2015) transcriptome annotation using blastn. Sub- (https://doi.org/10.1371/jour- sequent protein-level searches (blastp, E-val- nal.pbio.1002210.s002) using blastp ues < 10-10) revealed sequence similarity in (Camacho et al. 2009). Fragments with ho- four of the fragments to putative ‘dark’ virus mologous sequences that consistently co-oc- sequences from Drosophila (Dennis et al. In curred across multiple transcriptomic da- Revision). Here we used read counts to con- tasets were taken forward as candidate seg- firm the co-occurrence of homologous frag- ments of new viruses. ments across L. fabarum individuals, and to identify a fifth viral segment, which had not 2.2 Identification of related viral segments previously detected on the basis of the origi- from Lysiphlebus fabarum nal small RNA profile in Drosophila, on the Transcriptomic data were collected from basis of its co-occurrence across samples. To adults and larvae of the parasitoid wasp Lysi- generate a complete viral genome, we se- phlebus fabarum (Braconidae; Hymenop- lected a high-abundance larval dataset (ABD- tera) as part of an experimental evolution 118-118, SRA sample SAMN10024157, pro- study (Dennis et al. 2017, Dennis et al. In ject PRJNA290156), for re-assembly with Revision). Briefly, parasitoids were reared in Trinity (Grabherr et al. 2011). For this assem- different sublines of the aphid Aphis fabae, bly we subsampled the reads by 10 thou- each either possessing different strains of the sand-fold, as we have found that at very high defensive symbiotic bacterium Hamiltonella levels of coverage, read-depth normalisation defensa, or no H. defensa. Aphid hosts were allows low-frequency polymorphisms to dis- reared on broad bean plants (Vica faba) and rupt assemblies. parasitoids were collected after 11 (adults) or 2.3 Determination of the genomic strand 14 (larvae) generations of experimental se- from a related virus of lection. Poly-A enriched cDNA libraries were constructed using the Illumina TruSeq RNA Strand-specific RNA libraries can be used to kit (adults) or the Illumina TruSeq Stranded identify strand-biases in viral RNA, providing mRNA kit (larvae). Libraries were sequenced a clue as to the likely genomic strand of the in single-end, 100bp cycles on an Illumina virus and evidence for replication. Positive HiSeq2500 (sequence data available under sense single-stranded (+ssRNA) viruses tend NCBI PRJNA290156). Trimmed and quality fil- to be very strongly biased to positive sense tered reads were assembled de novo using reads, replicating double-stranded (dsRNA) Trinity (v2.4.0, for details see Dennis et al. In viruses are weakly biased toward positive- Revision), read-counts were quantified by sense reads, and replicating negative sense (- mapping to the reference using Bowtie2 ssRNA) viruses are weakly biased toward (Langmead and Salzberg 2012), and negative sense reads. This is because mRNA- uniquely-mapping read counts were ex- like expression products of replicating viruses tracted with eXpress (Roberts and Pachter have an abundance approaching that of the 2012). To assign taxonomic origin, the assem- genomic strand. Unfortunately, much RNA bled L. fabarum transcripts were used to sequencing is strand-agnostic (including that query the NCBI nr protein blast database from the Drosophila datasets of Webster et (blastx, e-values < 10-10). The subsequent dif- al (2015)) and the vast majority of Eukaryotic ferential expression analysis identified sev- transcriptomic datasets are sequenced from eral highly expressed fragments that were poly-A enriched RNA (such as that from Lysi- not present in the L. fabarum draft genome phlebus fabarum), which artificially enriches nor in transcripts from the host aphid (A. fa- for polyadenylated RNAs such as mRNA-like bae), and were not identified in the whole- expression products. We therefore sought

Page | 5 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

relatives in a strand-specific meta-tran- min; for a further 30x cycles). As a positive scriptomic dataset that had been prepared control for RT we used host Cytochrome Oxi- without poly-A enrichment. dase I amplified with LCO/HCO primers (Folmer et al. 1994) (94ºC 30 sec, 46ºC 1 min, For this purpose, we used a metagenomic da- 72ºC 1 min; for 5x cycles followed by; 94ºC 30 taset prepared as part of an ongoing study of sec, 50ºC 1min, 72ºC 1 min; for a further 35x British Lepidoptera (Longdon & Obbard, un- cycles). All PCR reactions were carried out in published). Briefly, between one and twelve duplicate using Taq DNA Polymerase and adults (total of 45) of each of 16 different ThermoPol Buffer (New England Biolabs). We species were collected from Penryn (Corn- used (RT negative) PCR to confirm that none wall, UK) and Buckfastleigh (Devon,UK) in of these segments were present as DNA. To July and September 2017 respectively. Total confirm the identity of the resulting PCR RNA was extracted from each individual us- products, positive samples were Sanger se- ing Trizol-Chloroform extractions according quenced from the reverse primer using to the manufacturer’s instructions, and a BigDye (Applied Biosystems) after treatment strand-specific library prepared from the with exonuclease I and shrimp alkaline phos- combined pool using an Illumina TruSeq phatase. stranded total RNA kit treating samples with Gold rRNA removal mix. This was sequenced 2.4 Inference of protein domain homology by the Exeter University Sequencing service Searches using blastp had previously been using the Illumina platform. The reads were unable to detect homology between the pu- assembled de novo using Trinity (Grabherr et tative ‘dark’ virus sequences of Drosophila al. 2011), and the resulting assemblies and known proteins (Webster et al. 2015). searched as protein using Diamond ‘blastp’ However, more sophisticated Hidden Mar- (Buchfink et al. 2014). kov Model approaches to similarity searching We then used an RT-PCR screen to confirm that use position-specific scoring matrix the identity of the host, and to confirm that (PSSM) profiles are known to be more sensi- the five putative segments co-occurred in the tive (Kuchibhatla et al. 2014). We therefore same individual. RNA was reverse-tran- aligned the putative viral proteins from Dro- scribed using GoScript reverse transcriptase sophila with their homologs from other tran- (Promega) with random hexamer primers, scriptomic datasets using MUSCLE (Edgar then diluted 1:10 with nuclease free water. 2004), and used these alignments to query PCRs to amplify short regions from the five PDB, Pfam-A (v.32), NCBI Conserved Domain viral segments (S1-S5) were carried out with (v.3.16) and TIGRFAMs (v.15.0) databases us- the following primers: S1F ing HHpred (Zimmermann et al. 2018). ATGCATCTCGTTCCTGACCA and S1R 2.5 Phylogenetic analysis GCCCCTTCAGACAGCTCTAA; S2F CACCAC- CAAGAACGGACAAG and S2R TGCCACCAC- To infer relationships among the new viruses, TCTAACCACAT; S3F AGCAATTCAACGAC- we aligned protein sequences using Mcoffee CACACC and S3R GATAGGGGACAGGG- from the T-coffee package (Wallace et al. CAGATC; S4F ATGAACGAGAGGTGCCTTCA 2006), and inferred relationships by maxi- and S4R CTCCATCACCTTGACATGCG; S5F mum likelihood using IQtree (Nguyen et al. TGCACTGTTCAGCTACCTCA and S5R 2014). For each of the segments available CCGTGTCGTTCGATGAAGTC, using a touch- from Drosophila, L. fabarum, Lepidoptera, down PCR cycle (95ºC 30 sec, 62ºC (-1ºC per and the other species, between 13 (Segment cycle) 30 sec, 72ºC 1 min; for 10x cycles fol- 3) and 41 (Segment 5) protein sequences lowed by; 95ºC 30 sec, 52ºC 30 sec, 72ºC 1

Page | 6 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

were aligned, depending on level of se- after the effective sample size for most of the quence conservation. Regions of low conser- parameters (including the topological ESS) vation at either end of the alignments were was in excess of 5000 and the potential scale selected by eye and removed. However, no reduction factor for these parameters less internal regions were trimmed, as trimming than 1.01. The exceptions were three param- leads to bias toward the guide tree and gives eters relating to the absolute evolutionary false confidence (Tan et al. 2015). The end- rate (tree scale) and the distribution of rates trimmed alignments were then used to infer across sites. These occasionally flipped be- phylogenetic relationships for each of the tween two solutions with identical likeli- segments using the LG protein substitution hoods, and had overall effective sample sizes matrix (Le and Gascuel 2008) with inferred of ca. 200. We do not believe this is likely to residue frequencies and a 4-category discre- compromise our conclusions regarding the tised gamma distribution of rates. uncertainty in tree topology. To illustrate the relative distance (and likely 3 Results unresolvable relationships) between the new viruses and previously described virus fami- 3.1 Four segments of a ‘dark’ virus associ- lies, we selected for phylogenetic analysis the ated with Drosophila and other arthropods RNA dependent RNA polymerase (RdRp) se- We hypothesised that although the putative quences from representatives of major line- ‘dark’ virus fragments proposed by Webster ages of +ssRNA viruses. We aligned a core et al (2015) on the basis of small-RNA profiles RdRp sequence of 215-513 residues for a to- (Supporting Figures S1 and S2) lacked detect- tal of 255 viruses, using two different meth- able homology with known viruses, their rel- ods; T-coffee ‘Expresso’ (Armougom et al. atives may be present—but unrecognised— 2006), which uses structural data to inform in transcriptome assemblies from other spe- the alignment, and T-coffee ‘accurate’, which cies. If so, we reasoned that the co-occur- combines structural data and protein pro- rence of homologous sequences across dif- files. Each of these alignments was used to ferent datasets could allow fragments from infer the phylogenetic relationship of these Drosophila to be associated into complete vi- clades by maximum likelihood, using IQtree rus genomes. Using similarity searches we in- as described above (Nguyen et al. 2014). As itially identified six fragments from Webster before, alignment ends were trimmed by et al (2015) that each consistently identified eye, but not internally (Tan et al. 2015). To homologs in several distantly related tran- examine the consequences of conditioning scriptomic datasets; those of the centipede on a specific alignment, we also inferred se- Lithobius forficatus (transcriptome GBKE; quence relationships using BALi-Phy NCBI project PRJNA198080 (Rehm et al. (Redelings 2014). BALi-Phy uses a Bayesian 2014)), the locust Locusta migratoria ma- MCMC sampler to jointly infer the alignment, nilensis (GDIO; PRJNA283919 (Zhang et al. the tree, and the substitution and indel 2015)), the leafhopper Clastoptera arizonana model parameters. Although computation- (GEDC; PRJNA303152 (Tassone et al. 2017)), ally expensive, this captures some of the un- the hematophagous bug Triatoma infestans certainty inherent in inferring homology dur- (GFMC; PRJNA304741 (Traverso et al. 2017)), ing alignment, and empirically BALi-Phy per- and two parasitoid wasps, Ceraphron spp. forms well with highly divergent sequences (GBVD; PRJNA252127 (Peters et al. 2017)) (Nute et al. 2018). We ran 22 simultaneous and Psyttalia concolor (GCDX; instances of BALi-Phy (totalling approxi- PRJNA262710). Motivated by this discovery mately 1.7 CPU years; Xeon E5-2620 v4 of four homologous sequence groups across @2.10GHz), analysing the combined output

Page | 7 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

these taxa, we performed a new search of ‘Nora’ virus (new in Armenian (Habayeb et al. the Webster et al (2015) data that identified 2006)) and ‘Galbūt’ virus (maybe in Lithua- two additional fragments. The eight Drosoph- nian (Webster et al. 2015)), with Kwí and Nai ila-associated sequences formed two groups being indicators of uncertainty (maybe, per- (four sequences from drosophilid pool E and haps) in JRR Tolkien’s invented language four from drosophilid pool I) encoding pro- Quenya (Wickmark 2019). teins that ranged between 40% and 60% 3.2 A related hymenopteran virus identifies amino acid identity (See supporting File S1 a fifth segment for accession numbers). In an unrelated expression study of the para- Subsequent searches later identified homo- sitoid wasp Lysiphlebus fabarum, Dennis et al logs in 14 other arthropod transcriptomes, (In Revision) identified four sequences show- including six from Hymenoptera, five from ing clear 1:1 homology with the segments of , two from Coleoptera, and one Kwi virus and Nai virus. These were again ca. each from Lepidoptera and Odonata (Sup- 1.5kb in length, and each encoded a single porting File S1). We also identified some seg- open reading frame (Figure 1). Each segment ments in a plant transcriptome (Jasminum had a poly-A tract at the 3’ end, suggesting sambac; PRJNA551353; SAMN12158026). either that the virus has poly-adenylated ge- However, as this dataset contained a large nome segments, or that these represent number of reads from the Jasmine whitefly poly-adenylated mRNA-like expression prod- Dialeurodes kirkaldyi, we think it unlikely that ucts. Strongly consistent with a viral origin, the plant is the true host. the sequences were present in some individ- Although none of the protein sequences uals but not others (Supporting Figure S3), al- from these fragments displayed significant ways co-occurred with correlated read num- blastp similarity to characterised proteins, bers (correlation coefficient >0.87; Support- the presence of the four clear homologs in ing Figure S3C), and could be extremely eight unrelated arthropod transcriptomes abundant—accounting for up to 40% of non- strongly supported an association between ribosomal reads and equating to 1 million- them. In addition, the similar length and sim- fold coverage of the virus in some wasps (Fig- ilar coding structure of the fragments across ure 1). species suggested that they comprise the ge- Based on the high abundance and the clear nomic sequences of a segmented virus (all pattern of co-occurrence, we searched for between 1.5 and 1.7 kbp, containing a single other wasp-associated contigs displaying the open reading frame; Figure 1). Finally, as ex- same properties, reasoning that these were pected for viruses of Drosophila, all segments likely to be additional segments of the same were sources of 21nt small RNAs from along virus. This search identified a candidate 5th the length of both strands of the virus, segment of ca. 2kbp, again containing a sin- demonstrating that the virus is recognised as gle open reading frame (Figure 1). We then a double-stranded target by Dicer-2 (Sup- sought homologs of this 5th segment in the porting Figures S1 and S2). We therefore data of Webster et al (2015) and in the public speculatively named these putative viruses transcriptomic datasets outlined above. As from drosophilid pools E and I as ‘Kwi virus’ expected, we were able to find a homolog in and ‘Nai virus’ respectively, and submitted almost every case, confirming co-occurrence them to GenBank (KY634875-KY634878; of the five putative viral segments across da- KY634871-KY634874; mentioned in Obbard tasets (Figure 1; supporting File S1; Nai virus 2018). Provisional names were chosen fol- NCBI accession MH937729, Kwi virus lowing the precedent set by Drosophila MH937728). The protein encoded by the

Page | 8 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

newly-identified segment 5 was substantially tive-sense (coding) strand, strongly suggest- more conserved than the other proteins, ing that this virus has a +ssRNA genome (Sup- with 64% amino-acid identity between Kwi porting File S2). The five segments appeared virus and Nai virus. We believe that it had complete at the 3’ end, possessing a poly-A most likely been missed from the putative tail and suggesting that the genomic +ssRNA ‘dark’ viruses of Webster et al (2015) because is polyadenylated (Figure 1). Four of the five of the relatively small number of reads pre- segments (excluding segment 2) possessed a sent in that dataset (10-100 fold coverage; conserved sequence of ca. 150nt at the 3’ Figure 1). Based on these segments, we used end, and a similar pattern (but not sequence) a re-assembly of a single larval Lysiphlebus was seen in the closely related segments fabarum dataset (sample ABD-118; Support- from the Ceraphron sp. transcriptome. How- ing Figure 3) to provide an improved assem- ever, we were unable to identify any 5’ pat- bly, which we provisionally named ‘Sina Vi- tern or motif shared among the segments. rus’, reflecting our increased confidence that An RT-PCR survey of the individual RNA the sequences are viral in origin (Sína is extractions used to create the metagenomic Quenya for known, certain, ascertained) and pool showed that all five segments co-occur submitted the sequences to Genbank under in a single Crocallis elinguaria individual (Ge- accession numbers MN264686-MN264690. ometridae; Lepidoptera), collected at lati- 3.3 A related Lepidopteran virus suggests tude 50.169, longitude -5.125 on +ssRNA as the genomic material 23/Jul/2017. RT-negative PCR showed that viral segments were not present in a DNA To determine whether these virus genomes form. are likely to be double-stranded RNA (dsRNA), positive sense single stranded 3.4 Related viruses are present in meta- (+ssRNA) or negative sense single-stranded (- genomic datasets from other animals ssRNA), we identified a related virus in a After identifying the complete (five segment) strand-specific metatranscriptomic dataset virus genomes in transcriptomic datasets that had been prepared without poly-A en- from 12 different arthropods, and incom- richment from several species of Lepidoptera plete genomes (between one and four seg- (Longdon & Obbard, unpublished). All 5 seg- ments) in a further 15 arthropod datasets ments were detected (Figure 1), and as was (Supporting File S1), we sought to capitalise the case for Kwi, Nai, and Sina viruses, seg- on recent metagenomic datasets to identify ments 1-4 were around 1.6kbp and segment related sequences in other animals (Shi et al. 5 around 2kbp in length, each containing a 2016, Shi et al. 2018). This search yielded single open reading frame (Figure 1). We complete (or near-complete) homologs of have provisionally named these sequences as segment 5 (the most conserved protein) in 18 ‘Nete virus’ (Netë is Quenya for another one, further datasets, including four from mixed one more) and submitted them to GenBank pools of insects, two from spiders, three from under accession numbers MN264681- crustaceans, seven from bony fish, and one MN264685. each from a toad (Dongxihu virus associated Overall, this virus accounted for 3% of the with Bufo gargarizans) and a lizard reads in the metagenomic pool, giving (Bawangfen virus associated with Calotes around 10 thousand-fold coverage of the ge- versicolor). Five of these pools also contained nome (Figure 1). An analysis of the strand homologs of segment 1 (the second most bias in the metagenomic sequencing found conserved protein), and one also contained that 99.8% of reads derived from the posi- segment 4 (the third most conserved pro- tein). These sequences have been submitted

Page | 9 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

to Genbank under accession identifiers we examined the deviation in dinucleotide MN371231-MN371254; See Supporting File S1 composition from that expected on the basis for details. of the base composition, as this is reported to be predictive of host lineages (Kapoor et The finding that these virus sequences can be al. 2010, but see Di Giallonardo et al. 2017). associated with both vertebrates and inver- However, we were unable to detect any clear tebrates may indicate that they are broadly pattern among viruses, either by inspection distributed across the metazoa (note the only of a PCA, or using a linear discriminant func- non-metazoan associated sequence came tion analysis. This may support a homoge- from a plant trascriptome contaminated with nous pool of true hosts, such as arthropods insects). However, metagenomic data alone but not vertebrates, but the short sequence cannot confirm this, as such datasets can in- length available (<2kbp) and small sample clude contamination from gut contents or size (32 sequences) means that such an anal- parasites of the supposed host taxon. We ysis probably lacks power. therefore explored four sources of evidence that could be used to corroborate the tar- Finally, we also analysed the phylogenetic re- geted taxon as the true host. First, we exam- lationships for all of the segments, as (except ined the viral read abundance, as very high for vectored viruses) transitions between abundance is unlikely for viruses of contami- vertebrate and invertebrate hosts are gener- nating organisms. Abundance ranged from ally rare (Longdon et al. 2015, Geoghegan et over 37,124 Reads Per Kilobase per Million al. 2017). This showed that, despite the ap- reads (RPKM; 40% of non-ribosomal RNA) for parent absence of contaminating inverte- Sina virus in one Lysiphlebus fabarum sam- brates, sequences from the toad (Dongxihu ple, to 0.16 RPKM (six read-pairs) for Zhang- virus) and the lizard (Bawangfen virus) both gezhuang virus from a metagenomic pool of sit among arthropod samples (segments 1 Branchiopoda, with a median of 16.9 RPKM and 5; Figure 2E), as do the several other se- (Supporting File S1). This strongly supports quences from fish. The analysis also identi- some of the arthropods (such as Lysiphlebus) fied a deeply divergent clade of four se- as true hosts, but does not support or refute quences from bony fish with no close rela- that the virus may infect vertebrates (e.g. tives in invertebrates that, if not contamina- RPKM as high as 834 for one Scorpaeni- tion, could in principle represent a clade of formes fish sample, but as low as 4.6 in Dro- vertebrate-infecting viruses (Figure 2E). Ac- sophila Nai virus, where infection could be in- cession numbers, alignments and tree files dependently confirmed by the presence of are provided in Supporting file S3. 21nt viral small RNAs). Second, for two high- 3.5 Segment 5 has similarity to viral RNA de- coverage low species-complexity vertebrate pendent RNA polymerases metagenomic pools (the Bufo gargarizans lung sample and Calotes versicolor gut sam- Having identified 1:1 homologs in multiple ple) we searched raw assemblies for Cyto- datasets, we were able to use the aligned chome Oxidase I sequences of contaminating protein sequences to perform a more sensi- invertebrates. This found that <0.5% of the tive homology search for conserved protein RNA from Bufo gargarizans (Dongxihu virus motifs using HHpred (Zimmermann et al. RPKM of 94.6) and <0.01% of RNA from Ca- 2018). This still identified no significant simi- lotes versicolor (Bawangfen virus RPKM of larity in the proteins encoded by segments 2 338.5) derived from contaminating inverte- to 4 (E-value >1), and only a weakly-sup- brates, strengthening the possibility that the ported ca. 110 amino acid region of the seg- vertebrate is the true host. Third, for Seg- ment 1 alignment with similarity to methyl- ment 5 (which was available for most taxa) transferase / mRNA capping enzymes (E-

Page | 10 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

value = 0.0019; see Supporting File S4). How- identity of only 7.6%) and the inferred rela- ever, in contrast to searches using blastp, the tionships among such distantly-related line- alignment of segment 5 displayed a more ages are unlikely to be reliable (Bhardwaj et strongly supported ca. 300 amino acid region al. 2012, Nute et al. 2018). In particular, alt- with similarity to the RNA dependent RNA hough current phylogenetic methods per- polymerase of Norwalk virus (E-value = form surprisingly well on simulated data with 2.2x10-33; see Supporting File S4). This se- identities as low 8-10%, this is only true when quence was approximately equally matched homology is known (i.e. the true alignment is to around 25 different reference structure or available; Bhardwaj et al. 2012). When the motifs, including RdRps from both +ssRNA vi- alignment has to be inferred, performance is ruses such as Picornavirales, Flavi-like vi- poor—even when the true substitution ruses, and Permutotetraviruses, and dsRNA model is the one being modelled (Nute et al. viruses such as Reoviruses, Picobirnaviruses, 2018). We therefore compared between and Totiviruses. Notably, this region of simi- trees that conditioned on each of two differ- larity included a very highly conserved GDD ent alignment methods (Mcoffee modes ‘ex- motif that is shared by many viral polymer- presso’ and ‘accurate’), and also co-inferred ases, supporting the idea that segment 5 en- the tree and the alignment using BALi-Phy codes the viral polymerase. (Redelings 2014). Accession numbers, align- ments and tree files are provided in Support- 3.6 ‘Quenyaviruses’ are highly divergent ing File S3. and may constitute a new family All methods found the Quenyavirus RdRps to The new virus lineage described here has a form a monophyletic clade, supporting their distinctive genome structure comprising four treatment as a natural group. Two of the 1.6kbp +ssRNA segments each encoding a methods placed the Quenyaviruses closer to single protein of unknown function, and one (some of) the Reo-like viruses than to others 2kbp +ssRNA segment encoding an RdRp. (Figure 3B and C). However, there was little The putative RdRp is substantially divergent consistency in the placement of the other from those of characterised +ssRNA and clades relative to each other. Moreover, the dsRNA virus families, to the extent that simi- Bayesian joint alignment/tree analysis gave larity cannot be detected using routine almost no posterior support to any of the ma- blastp. On this basis we propose the informal jor clades (Figure 3C; Supporting File S3). It is name ‘Quenyaviruses’, reflecting the naming notable that many deep divisions seen in our of the four founding members, and suggest three different approaches differ to those in that they may warrant consideration as a the tree inferred by Wolf et al (2018), using new unplaced family. maximum likelihood conditioned from an To explore their relationships with other RNA alignment in which sites with >50% gaps had viruses using an explicit phylogenetic analy- been deleted. We believe that this suggests sis, we selected a region of 215-513 amino the relationships among these lineages can- acid residues of the core RdRp region from 11 not currently be robustly inferred. Neverthe- representative Quenyaviruses and 244 other less, the uncertainty in the placement of the +ssRNA viruses, representing most major lin- Quenyaviruses emphasises their deep diver- eages. We excluded birnaviruses and per- gence from other taxonomically-recognised mutotetraviruses, which have a permuted virus clades. RdRp that cannot be straightforwardly aligned (Wolf et al. 2018). Phylogenetic infer- 4. Discussion ence is necessarily challenging with such high levels of divergence (mean pairwise protein

Page | 11 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Here we report the discovery of the Quenya- Drosophila virus sequences of Webster et al viruses, a new clade of segmented +ssRNA vi- (2015), 46% remain dark, 44% are now recog- ruses identifiable from multiple (meta-)tran- nisable as members of known virus lineages, scriptomic datasets, primarily of arthropods. and 10% represent a genuinely new diver- Four of these segments had initially been gent lineage (the Quenyaviruses)—albeit one identified as ‘dark’ viruses of Drosophila, for which a sensitive search can identify purely on the basis of the characteristic some evidence of homology. Second, how small-RNA signature created by the host an- many viruses are hiding in plain sight? Per- tiviral RNAi pathway (Webster et al. 2015). haps 10% of polymerase sequences from Pi- Now, by identifying a fifth segment encoding cornavirales are currently unannotated as a divergent RdRp, we show that they form a such within transcriptomic datasets (Obbard monophyletic clade that is only distantly re- 2018), and surveys of publicly available data lated to other +ssRNA viruses, and cannot be often identify multiple new viruses (e.g. confidently placed within a wider phylogeny. François et al. 2016, Gilbert et al. 2019). Some of the sequences we analyse here have As with other metagenomic studies of virus been in the public domain for more than 7 diversity, this work raises two important years, but without routine screening and an- questions. First, how well have we truly sam- notation (or submission of such sequences to pled the virosphere? Metagenomic studies databases) they not only remain unavailable often contain sequences lacking detectable for analysis, but also potentially ‘contami- homology, and it has been suggested that nate’ other analyses with misattributed taxo- these include many ‘dark’ viruses nomic information. Finally, our work also em- (Krishnamurthy and Wang 2017). This may phasises the ease with which new viruses can imply that many deeply-divergent viruses, or be identified, relative to the investment re- viruses lacking common ancestry with known quired to understand their biology. The families, remain to be discovered. Alterna- Quenyaviruses seem broadly distributed, if tively, many of the ‘dark’ sequences may be not common, but we have no knowledge of the less-conserved fragments of otherwise their host range, transmission routes, tissue easily-recognised virus lineages (e.g. François tropisms, or pathology. et al. 2018). Thus far, of the predicted ‘dark’

Acknowledgements We thank Christoph Vorburger for facilitating the re-use of data from Lysiphlebus, David Karlin for advice on the identification of protein domains, Benjamin Redelings for advice on the use of Bali-Phy, and all of the many researchers who provided their data to publicly available da- tabases. Funding Metagenomic sequencing of Drosophilidae was funded by a Wellcome Trust Research Career Development Fellowship to DJO (WT085064; http://www.wellcome.ac.uk/). Metagenomic sequencing of Lepidoptera was funded by a Sir Henry Dale Fellowship to BL, jointly funded by the Wellcome Trust and the Royal Society (Grant Number 109356/Z/15/Z http://www.well- come.ac.uk/). Work on Lysiphlebus was funded through SNSF Professorship nr. PP00P3_146341 and Sinergia grant nr. CRSII3_154396 to Prof. Christoph Vorburger.

Page | 12 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figures & Legends Figure 1: Virus segments and sequencing coverage Panels show the structure and fold-coverage for each of the five segments (columns), for each of the four founding Quenyaviruses (rows). Graphs represent fold-coverage on a log10 scale, with the structure of the segment annotated below to scale (dark: coding, pale: non-coding). Assembled contigs that terminated with a poly-A tract are denoted ‘AA’), and potentially in- complete open reading frames indicated with a jagged edge.

Page | 13 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 2: Phylogenetic trees for each of the viral segments Panels A-E show maximum-likelihood phylogenetic trees for segments 1-5, inferred from amino-acid sequences. Panel E shows the tree for the most conserved segment, which en- codes a putative RNA dependent RNA polymerase. Trees are mid-point rooted, and the scale bar represents 0.5 substitutions per site. The four viruses marked in bold are the founding members of the clade, those underlined come from nominally vertebrate metagenomic da- tasets, and species names in italic denote sequences from public transcriptomes. One, Jas- minum sambac (marked superscript 1), came from a plant transcriptome contaminated with the whitefly Dialeurodes kirkaldyi. Note that some aspects of tree topology appear to be con- sistent among segments, suggesting that reassortment may be limited. Sequence alignments and tree files are provided in Supporting File S3

Page | 14 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Figure 3: Relationship of the Quenyaviruses to other RNA viruses Unrooted phylogenetic trees showing the possible relationships between the RdRp (segment 5) of Quenyaviruses and RdRps of representatives from other groups of RNA viruses. Trees were in- ferred by maximum-likelihood using IQtree from alignments using TCoffee modes ‘expresso’ (A) and ‘accurate’ (B), or using a Bayesian approach (C) that co- infers the tree and alignment. None of the deep relationships had any support in the Bayesian analysis, although all of the major clades were recovered and many of the relationships between them are the same as those in (B). Sequence alignments are provided in Supporting File S3.

Page | 15 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supporting Figure S1: ‘Dark’ virus identification by small-RNA sequencing Points correspond to the contigs assembled by Webster et al (2015) that are sources of sub- stantial numbers of small RNAs, and thus candidates to be viruses (high viRNA:piRNA length ratio) or transposable elements (low viRNA:piRNA ratio). Those marked in black have high blastp-detectable sequence similarity to known viruses, and those marked in colour corre- spond to segments of Kwi and Nai virus. Many pale grey points in the top-right corner of the plot are the other unconfirmed siRNA ‘candidate’ viruses reported by Webster et al (2015).

Page | 16 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supporting Figure S2: Kwi virus small RNA size distribution The bar plots (left column) show the size distribution of reads mapping to each segment (rows 1-5) of Kwi virus. Bars are coloured according to the 5’ base (red U, yellow G, blue C and green A), numbers plotted above the x-axis show read counts mapping to the positive strand, and those below the x-axis mapping to the negative strand. Line plots (right column) show the genomic locations and numbers of the 21nt reads deriving from the positive (blue) and nega- tive (red) strands of the virus. Note that siRNA numbers reflect the apparent abundance of each segment in other hosts (Supporting Figure S3).

Page | 17 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supporting Figure S3: Co-occurrence of Sina virus segments across L. fabarum samples Panels show the virus read abundance for each segment (colours) from each of the adult sam- ples (A) and larval samples (B), and the correlation in read abundance between segments across all samples (C) on a scale of virus reads per kilobase per thousand total reads. Note that virus read numbers are highly correlated among segments (Panel C: correlation coeffi- cient >0.87), and that reads from segment 3 are always most abundant while those from seg- ment 5 are always least abundant (panel C). Note that Adult samples 1-3 were from the same experimental cage, as were 4-6.

Page | 18 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supporting File S1: Virus details Excel table providing host species, NCBI project accessions, NCBI Samples, Read abundance and sequence accessions Supporting File S2: Strand bias in the sequencing reads from Lepidoptera Excel table giving the number of positive and negative sense forward-reads for each segment of Nete virus, with comparison ratios for high abundance viruses reported in Waldron et al (2018) and Medd et al (2018) Supporting File S3: Phylogenetic Analyses Compressed text files containing the alignments and tree files for Figures 2 and 3 Supporting File S4: Raw HHpred output Compressed text files containing the raw output from the HHpred analysis

Page | 19 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

References Aguiar, E., R. P. Olmo, S. Paro, F. V. Ferreira, I. J. D. de Faria, Y. M. H. Todjro, F. P. Lobo, E. G. Kroon, C. Meignin, D. Gatherer, J. L. Imler and J. T. Marques (2015). "Sequence-independent characterization of viruses based on the pattern of viral small RNAs produced by the host." Nucleic Acids Research 43(13): 6191-6206. Armougom, F., S. Moretti, O. Poirot, S. Audic, P. Dumas, B. Schaeli, V. Keduas and C. Notredame (2006). "Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee." Nucleic Acids Research 34(suppl_2): W604-W608. Bekal, S., L. L. Domier, T. L. Niblack and K. N. Lambert (2011). "Discovery and initial analysis of novel viral genomes in the soybean cyst nematode." Journal of General Virology 92(8): 1870-1879. Bhardwaj, G., K. D. Ko, Y. Hong, Z. Zhang, N. L. Ho, S. V. Chintapalli, L. A. Kline, M. Gotlin, D. N. Hartranft, M. E. Patterson, F. Dave, E. J. Smith, E. C. Holmes, R. L. Patterson and D. B. van Rossum (2012). "PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences." PLOS ONE 7(4): e34261. Breitbart, M., P. Salamon, B. Andresen, J. M. Mahaffy, A. M. Segall, D. Mead, F. Azam and F. Rohwer (2002). "Genomic analysis of uncultured marine viral communities." Proceedings of the National Academy of Sciences 99(22): 14250-14255. Buchfink, B., C. Xie and D. H. Huson (2014). "Fast and sensitive protein alignment using DIAMOND." Nature Methods 12: 59. Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer and T. L. Madden (2009). "BLAST+: architecture and applications." BMC Bioinformatics 10(1): 421. Coyle, M. C., C. N. Elya, M. Bronski and M. B. Eisen (2018). "Entomophthovirus: An insect-derived iflavirus that infects a behavior manipulating fungal pathogen of dipterans." bioRxiv: 371526. Dennis, A. B., H. Käch and C. Vorburger (In Revision). "Dual RNA-seq in an aphid parasitoid reveals plastic and evolved adaptation." Dennis, A. B., V. Patel, K. M. Oliver and C. Vorburger (2017). "Parasitoid gene expression changes after adaptation to symbiont-protected hosts." Evolution 71(11): 2599-2617. Di Giallonardo, F., T. E. Schlub, M. Shi and E. C. Holmes (2017). "Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species." Journal of Virology 91(8): e02381-02316. Dolja, V. V. and E. V. Koonin (2018). "Metagenomics reshapes the concepts of RNA virus evolution by revealing extensive horizontal virus transfer." Virus research 244: 36-52. Edgar, R. C. (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Research 32(5): 1792-1797. Folmer, O., M. Black, W. Hoeh, R. Lutz and R. Vrijenhoek (1994). "DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates." Molecular marine biology and biotechnology 3(5): 294-299. François, S., D. Filloux, M. Frayssinet, P. Roumagnac, D. P. Martin, M. Ogliastro and R. Froissart (2018). "Increase in taxonomic assignment efficiency of viral reads in metagenomic studies." Virus research 244: 230-234. François, S., D. Filloux, P. Roumagnac, D. Bigot, P. Gayral, D. P. Martin, R. Froissart and M. Ogliastro (2016). "Discovery of parvovirus-related sequences in an unexpected broad range of animals." Scientific Reports 6: 30880. Galbraith, D. A., Z. L. Fuller, A. M. Ray, A. Brockmann, M. Frazier, M. W. Gikungu, J. F. I. Martinez, K. M. Kapheim, J. T. Kerby, S. D. Kocher, O. Losyev, E. Muli, H. M. Patch, C. Rosa, J. M. Sakamoto, S. Stanley, A. D. Vaudo and C. M. Grozinger (2018). "Investigating the viral ecology of global bee communities with high-throughput metagenomics." Scientific Reports 8(1): 8879. Geoghegan, J. L., S. Duchêne and E. C. Holmes (2017). "Comparative analysis estimates the relative frequencies of co-divergence and cross-species transmission within viral families." PLOS Pathogens 13(2): e1006215.

Page | 20 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Gilbert, K. B., E. E. Holcomb, R. L. Allscheid and J. C. Carrington (2019). "Hiding in plain sight: New virus genomes discovered via a systematic analysis of fungal public transcriptomes." PLOS ONE 14(7): e0219207. Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, X. Adiconis, L. Fan, R. Raychowdhury, Q. Zeng, Z. Chen, E. Mauceli, N. Hacohen, A. Gnirke, N. Rhind, F. di Palma, B. W. Birren, C. Nusbaum, K. Lindblad-Toh, N. Friedman and A. Regev (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome." Nature Biotechnology 29: 644. Greninger, A. L. (2018). "A decade of RNA virus metagenomics is (not) enough." Virus Research 244: 218-229. Habayeb, M. S., S. K. Ekengren and D. Hultmark (2006). "Nora virus, a persistent virus in Drosophila, defines a new picorna-like virus family." Journal of General Virology 87(10): 3045-3051. Kapoor, A., P. Simmonds, W. I. Lipkin, S. Zaidi and E. Delwart (2010). "Use of nucleotide composition analysis to infer hosts for three novel picorna-like viruses." Journal of virology 84(19): 10322- 10328. King, A. M. Q., E. J. Lefkowitz, A. R. Mushegian, M. J. Adams, B. E. Dutilh, A. E. Gorbalenya, B. Harrach, R. L. Harrison, S. Junglen, N. J. Knowles, A. M. Kropinski, M. Krupovic, J. H. Kuhn, M. L. Nibert, L. Rubino, S. Sabanadzovic, H. Sanfaçon, S. G. Siddell, P. Simmonds, A. Varsani, F. M. Zerbini and A. J. Davison (2018). "Changes to taxonomy and the International Code of Virus Classification and Nomenclature ratified by the International Committee on Taxonomy of Viruses (2018)." Archives of Virology. Knox, M. A., K. R. Gedye and D. T. S. Hayman (2018). "The Challenges of Analysing Highly Diverse Picobirnavirus Sequence Data." Viruses 10(12): 685. Koonin, E. V., V. V. Dolja and M. Krupovic (2015). "Origins and evolution of viruses of eukaryotes: The ultimate modularity." Virology 479-480: 2-25. Krishnamurthy, S. R. and D. Wang (2017). "Origins and challenges of viral dark matter." Virus research 239: 136-142. Kuchibhatla, D. B., W. A. Sherman, B. Y. W. Chung, S. Cook, G. Schneider, B. Eisenhaber and D. G. Karlin (2014). "Powerful Sequence Similarity Search Methods and In-Depth Manual Analyses Can Identify Remote Homologs in Many Apparently “Orphan” Viral Proteins." Journal of Virology 88(1): 10-20. Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2." Nature Methods 9: 357. Le, S. Q. and O. Gascuel (2008). "An Improved General Amino Acid Replacement Matrix." Molecular Biology and Evolution 25(7): 1307-1320. Li, C. X., M. Shi, J. H. Tian, X. D. Lin, Y. J. Kang, L. J. Chen, X. C. Qin, J. G. Xu, E. C. Holmes and Y. Z. Zhang (2015). "Unprecedented genomic diversity of RNA viruses in arthropods reveals the ancestry of negative-sense RNA viruses." Elife 4. Longdon, B., G. G. R. Murray, W. J. Palmer, J. P. Day, D. J. Parker, J. J. Welch, D. J. Obbard and F. M. Jiggins (2015). "The evolution, diversity, and host associations of rhabdoviruses." Virus Evolution 1(1): 12. Medd, N. C., S. Fellous, F. M. Waldron, A. Xuéreb, M. Nakai, J. V. Cross and D. J. Obbard (2018). "The virome of Drosophila suzukii, an invasive pest of soft fruit." Virus Evolution 4(1): vey009-vey009. Mushegian, A., A. Shipunov and S. F. Elena (2016). "Changes in the composition of the RNA virome mark evolutionary transitions in green plants." BMC Biology 14(1): 68. Nguyen, L.-T., H. A. Schmidt, A. von Haeseler and B. Q. Minh (2014). "IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies." Molecular Biology and Evolution 32(1): 268-274. Nute, M., E. Saleh and T. Warnow (2018). "Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets." Systematic Biology 68(3): 396- 411.

Page | 21 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Obbard, D. J. (2018). "Expansion of the Metazoan Virosphere: Progress, Pitfalls, and Prospects. ." Current Opinion in Virology 31: 17-23. Peters, R. S., L. Krogmann, C. Mayer, A. Donath, S. Gunkel, K. Meusemann, A. Kozlov, L. Podsiadlowski, M. Petersen, R. Lanfear, P. A. Diez, J. Heraty, K. M. Kjer, S. Klopfstein, R. Meier, C. Polidori, T. Schmitt, S. Liu, X. Zhou, T. Wappler, J. Rust, B. Misof and O. Niehuis (2017). "Evolutionary History of the Hymenoptera." Current Biology 27(7): 1013-1018. Redelings, B. (2014). "Erasing Errors due to Alignment Ambiguity When Estimating Positive Selection." Molecular Biology and Evolution 31(8): 1979-1993. Rehm, P., K. Meusemann, J. Borner, B. Misof and T. Burmester (2014). "Phylogenetic position of Myriapoda revealed by 454 transcriptome sequencing." Molecular Phylogenetics and Evolution 77: 25-33. Rinke, C., P. Schwientek, A. Sczyrba, N. N. Ivanova, I. J. Anderson, J.-F. Cheng, A. Darling, S. Malfatti, B. K. Swan, E. A. Gies, J. A. Dodsworth, B. P. Hedlund, G. Tsiamis, S. M. Sievert, W.-T. Liu, J. A. Eisen, S. J. Hallam, N. C. Kyrpides, R. Stepanauskas, E. M. Rubin, P. Hugenholtz and T. Woyke (2013). "Insights into the phylogeny and coding potential of microbial dark matter." Nature 499: 431. Roberts, A. and L. Pachter (2012). "Streaming fragment assignment for real-time analysis of sequencing experiments." Nature Methods 10: 71. Shi, M., X. D. Lin, X. Chen, J. H. Tian, L. J. Chen, K. Li, W. Wang, J. S. Eden, J. J. Shen, L. Liu, E. C. Holmes and Y. Z. Zhang (2018). "The evolutionary history of vertebrate RNA viruses." Nature 556(7700): 197-+. Shi, M., X. D. Lin, J. H. Tian, L. J. Chen, X. Chen, C. X. Li, X. C. Qin, J. Li, J. P. Cao, J. S. Eden, J. Buchmann, W. Wang, J. G. Xu, E. C. Holmes and Y. Z. Zhang (2016). "Redefining the invertebrate RNA virosphere." Nature 540(7634): 539-+. Shi, M., X. D. Lin, N. Vasilakis, J. H. Tian, C. X. Li, L. J. Chen, G. Eastwood, X. N. Diao, M. H. Chen, X. Chen, X. C. Qin, S. G. Widen, T. G. Wood, R. B. Tesh, J. G. Xu, E. C. Holmes and Y. Z. Zhang (2016). "Divergent Viruses Discovered in Arthropods and Vertebrates Revise the Evolutionary History of the Flaviviridae and Related Viruses." Journal of Virology 90(2): 659-669. Shi, M., V. L. White, T. Schlub, J.-S. Eden, A. A. Hoffmann and E. C. Holmes (2018). "No detectable effect of Wolbachia wMel on the prevalence and abundance of the RNA virome of Drosophila melanogaster." Proceedings of the Royal Society B-Biological Sciences In Press. Shi, M., Y. Z. Zhang and E. C. Holmes (2018). "Meta-transcriptomics and the evolutionary biology of RNA viruses." Virus Research 243: 83-90. Simmonds, P., M. J. Adams, M. Benkő, M. Breitbart, J. R. Brister, E. B. Carstens, A. J. Davison, E. Delwart, A. E. Gorbalenya, B. Harrach, R. Hull, A. M. Q. King, E. V. Koonin, M. Krupovic, J. H. Kuhn, E. J. Lefkowitz, M. L. Nibert, R. Orton, M. J. Roossinck, S. Sabanadzovic, M. B. Sullivan, C. A. Suttle, R. B. Tesh, R. A. van der Vlugt, A. Varsani and F. M. Zerbini (2017). "Virus taxonomy in the age of metagenomics." Nature Reviews Microbiology 15: 161. Simmonds, P. and P. Aiewsakun (2018). "Virus classification – where do you draw the line?" Archives of Virology 163(8): 2037-2046. Tan, G., M. Muffato, C. Ledergerber, J. Herrero, N. Goldman, M. Gil and C. Dessimoz (2015). "Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single- Gene Phylogenetic Inference." Systematic Biology 64(5): 778-791. Tassone, E. E., C. C. Cowden and S. J. Castle (2017). "De novo transcriptome assemblies of four xylem sap-feeding insects." GigaScience 6(3). Traverso, L., A. Lavore, I. Sierra, V. Palacio, J. Martinez-Barnetche, J. M. Latorre-Estivalis, G. Mougabure-Cueto, F. Francini, M. G. Lorenzo, M. H. Rodríguez, S. Ons and R. V. Rivera-Pomar (2017). "Comparative and functional triatomine genomics reveals reductions and expansions in insecticide resistance-related gene families." PLOS Neglected Tropical Diseases 11(2): e0005313.

Page | 22 bioRxiv preprint doi: https://doi.org/10.1101/741645; this version posted December 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Waldron, F. M., G. N. Stone and D. J. Obbard (2018). "Metagenomic sequencing suggests a diversity of RNA interference-like responses to viruses across multicellular eukaryotes." PLOS Genetics 14(7): e1007533. Wallace, I. M., O. O'Sullivan, D. G. Higgins and C. Notredame (2006). "M-Coffee: combining multiple sequence alignment methods with T-Coffee." Nucleic Acids Research 34(6): 1692-1699. Webster, C. L., B. Longdon, S. H. Lewis and D. J. Obbard (2016). "Twenty-Five New Viruses Associated with the Drosophilidae (Diptera)." Evolutionary Bioinformatics 12(12): 13-25. Webster, C. L., F. M. Waldron, S. Robertson, D. Crowson, G. Ferrari, J. F. Quintana, J. M. Brouqui, E. H. Bayne, B. Longdon, A. H. Buck, B. P. Lazzaro, J. Akorli, P. R. Haddrill and D. J. Obbard (2015). "The Discovery, Distribution, and Evolution of Viruses Associated with Drosophila melanogaster." Plos Biology 13(7): 33. Wickmark, L. (2019). "Parf Edhellen: The collaborative dictionary dedicated to Tolkien's languages." Retrieved 15/07/2019, 2019, from http://www.elfdict.com. Wolf, Y. I., D. Kazlauskas, J. Iranzo, A. Lucía-Sanz, J. H. Kuhn, M. Krupovic, V. V. Dolja and E. V. Koonin (2018). "Origins and Evolution of the Global RNA Virome." mBio 9(6): e02329-02318. Yutin, N., D. Bäckström, T. J. G. Ettema, M. Krupovic and E. V. Koonin (2018). "Vast diversity of prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered by genomic and metagenomic sequence analysis." Virology Journal 15(1): 67. Zhang, W., J. Chen, N. O. Keyhani, Z. Zhang, S. Li and Y. Xia (2015). "Comparative transcriptomic analysis of immune responses of the migratory locust, Locusta migratoria, to challenge by the fungal insect pathogen, Metarhizium acridum." BMC Genomics 16(1): 867. Zhang, Y. Z., M. Shi and E. C. Holmes (2018). "Using Metagenomics to Characterize an Expanding Virosphere." Cell 172(6): 1168-1172. Zimmermann, L., A. Stephens, S.-Z. Nam, D. Rau, J. Kübler, M. Lozajic, F. Gabler, J. Söding, A. N. Lupas and V. Alva (2018). "A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core." Journal of Molecular Biology 430(15): 2237-2243.

Page | 23