<<

Journal of General Virology

Polycipiviridae: a proposed new family of polycistronic picorna-like RNA --Manuscript Draft--

Manuscript Number: JGV-D-17-00319R1 Full Title: Polycipiviridae: a proposed new family of polycistronic picorna-like RNA viruses Article Type: Standard Section/Category: Insect Viruses - RNA Corresponding Author: Andrew Firth University of Cambridge -, UNITED KINGDOM First Author: Ingrida Olendraite Order of Authors: Ingrida Olendraite Nina I Lukhovitskaya Sanford D Porter Steven M Valles Andrew E Firth Abstract: Solenopsis invicta 2 is a single-stranded positive-sense picorna-like RNA virus with an unusual structure. The monopartite genome of approximately 11 kb contains four open reading frames in its 5′ one third, three of which encode proteins with homology to -like jelly-roll fold capsid proteins. These are followed by an intergenic region, and then a single long open reading frame that covers the 3′ two thirds of the genome. The polypeptide translation of this 3′ open reading frame contains motifs characteristic of picornavirus-like helicase, protease and RNA- dependent RNA polymerase domains. Inspection of public transcriptome shotgun assembly sequences revealed five related apparently nearly complete virus isolated from ant species and one from a dipteran insect. By high-throughput sequencing and in silico assembly of RNA isolated from Solenopsis invicta and four other ant species, followed by targeted Sanger sequencing, we obtained nearly complete genomes for four further viruses in the group. Four further sequences were obtained from a recent large-scale invertebrate virus study. The 15 sequences are highly divergent (pairwise amino acid identities as low as 17% in the non-structural polyprotein), but possess the same overall polycistronic genome structure distinct from all other characterized picorna-like viruses. Consequently we propose the formation of a new virus family, Polycipiviridae, to classify this clade of arthropod-infecting polycistronic picorna-like viruses. We further propose that this family be divided into three genera: Chipolycivirus (2 species), Hupolycivirus (2 species), and Sopolycivirus (11 species), with members of the latter infecting ants in at least three different subfamilies.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Response to Reviewer

Click here to access/download Response to Reviewer ResponseToReviews.pdf Manuscript Including References and Figures/Table legends Click here to download Manuscript Including References (Word document) MS_Polycipiviridae_IO.docx

1 Polycipiviridae: a proposed new family of polycistronic 2 picorna-like RNA viruses

3 4 Ingrida Olendraite1,a, Nina I. Lukhovitskaya1,a, Sanford D. Porter2, Steven M. Valles2, 5 Andrew E. Firth1,b 6 7 1 Division of Virology, Department of Pathology, University of Cambridge, Cambridge 8 CB2 1QP, U.K. 9 10 2 Center for Medical, Agricultural and Veterinary Entomology, USDA-ARS, 1600 SW 11 23rd Drive, Gainesville, FL 32608, U.S.A. 12 13 a These authors contributed equally to this work 14 b Correspondence – [email protected] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

1

33 Abstract

34 Solenopsis invicta virus 2 is a single-stranded positive-sense picorna-like RNA virus 35 with an unusual genome structure. The monopartite genome of approximately 11 kb 36 contains four open reading frames in its 5′ one third, three of which encode proteins with 37 homology to picornavirus-like jelly-roll fold capsid proteins. These are followed by an 38 intergenic region, and then a single long open reading frame that covers the 3′ two 39 thirds of the genome. The polypeptide translation of this 3′ open reading frame contains 40 motifs characteristic of picornavirus-like helicase, protease and RNA-dependent RNA 41 polymerase domains. Inspection of public transcriptome shotgun assembly sequences 42 revealed five related apparently nearly complete virus genomes isolated from ant 43 species and one from a dipteran insect. By high-throughput sequencing and in silico 44 assembly of RNA isolated from Solenopsis invicta and four other ant species, followed 45 by targeted Sanger sequencing, we obtained nearly complete genomes for four further 46 viruses in the group. Four further sequences were obtained from a recent large-scale 47 invertebrate virus study. The 15 sequences are highly divergent (pairwise amino acid 48 identities as low as 17% in the non-structural polyprotein), but possess the same overall 49 polycistronic genome structure distinct from all other characterized picorna-like viruses. 50 Consequently we propose the formation of a new virus family, Polycipiviridae, to classify 51 this clade of arthropod-infecting polycistronic picorna-like viruses. We further propose 52 that this family be divided into three genera: Chipolycivirus (2 species), Hupolycivirus (2 53 species), and Sopolycivirus (11 species), with members of the latter infecting ants in at 54 least three different subfamilies. 55 56 57 58 59 60

2

61 Introduction

62 The order currently comprises the families , , 63 , Picornaviridae and and the unassigned genera 64 and . Members of the order are characterized by (i) a 65 positive-sense RNA genome, usually with a 5′ covalently linked VPg ("virus protein, 66 genome linked") and a 3′ poly(A) tail, (ii) a polyprotein gene expression strategy with 67 cleavage mainly mediated by viral protease(s), (iii) a structural protein module 68 containing three jelly-roll capsid protein domains which form small non-enveloped 69 icosahedral virions with pseudo T = 3 symmetry, and (iv) a non-structural protein 70 module containing a superfamily III helicase (or NTPase), a 3C-like chymotrypsin-like 71 protease, and a superfamily I RNA-dependent RNA polymerase (RdRp) encoded 72 sequentially in that order [1]. In contrast, families and (for 73 example), although termed picorna-like (and informally grouped into a "picorna-like 74 superfamily") are excluded from the Picornavirales order for various reasons; for 75 example caliciviruses encode a single jelly-roll capsid protein and virions have true T = 76 3 symmetry, while have an unrelated capsid protein, filamentous virions, 77 and encode a superfamily II instead of superfamily III helicase. 78 79 Solenopsis invicta virus 2 (SINV-2) is a positive-sense single-stranded RNA virus that 80 infects the red imported fire ant, Solenopsis invicta Buren, an invasive species in 81 southern U.S.A. [2]. Replicating virus was detected in larval and adult stages of S. 82 invicta [3]. While field infection rates are rather low for S. invicta [4, 5], there are distinct 83 fitness costs for founding queens infected with the virus [6]. Colonies established from 84 SINV-2-infected queens produced less brood and had longer claustral periods. SINV-2 85 infection was also associated with significant up-regulation of global gene expression in 86 S. invicta queens, including immune response genes [6]. The monopartite genome was 87 originally found to contain three open reading frames (ORFs) in its 5′ one third and a 88 single long ORF in its 3′ two thirds, though we show here that there are actually four 89 consecutive 5′ ORFs, one having been overlooked previously as a result of a 90 sequencing error. The polypeptide encoded by the 3′ ORF contains motifs characteristic

3

91 of a superfamily III helicase, 3C-like chymotrypsin-related protease, and a superfamily I 92 RdRp [2]. A phylogenetic analysis of RdRp sequences placed SINV-2 within the 93 "picorna-like superfamily" but outside of established virus families [7]. Isometric particles 94 with a diameter of ~33 nm were found only in ants testing positive for SINV-2 by RT- 95 PCR [2]. 96 97 By searching the NCBI transcriptome shotgun assembly (TSA) database, we identified 98 six related sequences of which five were derived from ant samples. To further explore 99 the diversity and prevalence of this group of viruses, we performed high-throughput 100 sequencing of RNA derived from Solenopsis invicta and four U.K. ant species. This led 101 to the identification of four additional viruses. During the course of this work, 1445 new 102 viruses were identified via high-throughput sequencing [8], of which four exhibit similar 103 genome organizations to SINV-2. We performed a phylogenetic and comparative 104 genomic analysis of these SINV-2-like virus sequences. All have a characteristic 105 genome organization with four consecutive 5′ proximal ORFs and one long 3′ ORF. 106 ORF1, 3 and 4 are predicted to encode jelly-roll fold capsid proteins while ORF2 107 encodes a protein of unknown function. Ant-infecting members of the group apparently 108 have an additional 5′ ORF (ORF2b) overlapping the 5′ end of ORF2 and encoding a 109 small protein containing a predicted transmembrane domain. The 3′ ORF5 encodes 110 helicase, protease and RdRp domains, is presumed to encode a VPg and potentially 111 another protein between the helicase and protease, and likely encodes one or more 112 additional proteins upstream of the helicase. The unusual genome organization and 113 phylogenetic distinctness of this group of viruses suggests they should be classified into 114 a new virus family for which we propose the name Polycipiviridae (polycistronic 115 picorna-like viruses). The characteristic complement of picornavirus-like protein 116 domains suggests that the Polycipiviridae should be included within the Picornavirales 117 order. 118

4

119 Results

120 Identification of SINV-2-like viruses

121 The SINV-2 genome was first sequenced by Valles et al. [2]. Suspicious of a large 122 apparently non-coding gap following ORF1, we resequenced the SINV-2 genome. This 123 resulted in correction of a UGA stop codon to a UGG tryptophan codon, after which it 124 became apparent that the region following ORF1 is actually occupied by an open 125 reading frame. The reannotated SINV-2 ORFs are shown in Fig. 1. The new SINV-2 126 sequence has been deposited in GenBank under accession MF041813.1, and has 143 127 nucleotide differences relative to the previous SINV-2 sequence EF428566.1. 128 129 Further high-throughput sequencing of Solenopsis invicta RNA samples revealed the 130 presence of a new SINV-2-like virus, which we named Solenopsis invicta virus 4 (SINV- 131 4). A complete genome was obtained by Sanger sequencing and has been deposited in 132 GenBank with accession number MF041808.1. SINV-4 has the same genome 133 organization (Fig. 2), and 31% amino acid identity to SINV-2 in the ORF5 polypeptide. 134 135 Next we queried the NCBI Transcriptome Shotgun Assembly (TSA) database using 136 tblastn with the SINV-2 ORF5 polypeptide as query and "eukaryotes" as the target 137 taxonomic group. Amongst the top matching "hits", long sequences (>10 kb) were 138 identified, and their longest ORF extracted, translated and queried against NCBI 139 RefSeq virus genome sequences. Sequences whose reciprocal best match was SINV-2 140 were retained, while other sequences (typically with top matches to iflavirus or 141 dicistrovirus sequences) were discarded. This resulted in five long sequences with 142 reciprocal best matches to SINV-2. An additional long sequence was obtained by 143 merging the partly overlapping fragments LH935078.1 and LH935077.1. Five of the long 144 sequences derive from ant RNA-Seq libraries (order Hymenoptera, family Formicidae; 145 LA858223.1 and LA866448.1 from Monomorium pharaonis, LI526777.1 from Lasius 146 neglectus, LI719284.1 from Linepithema humile, and the combined LH935078.1 + 147 LH935077.1 from Formica exsecta). The sixth sequence, KA182589.1, derives from a 148 Chironomus riparius RNA-Seq library (order Diptera; family Chironomidea). Sequence

5

149 attributes are recorded in Table 1 and genome organizations are depicted in Fig. 2. A 150 number of shorter partial genome sequences were also identified, but are not discussed 151 further here. 152 153 Given that SINV-2-like viruses appeared to be most prevalent in ants, we obtained 154 samples from four different ant species (Lasius flavus, L. niger, L. neglectus and 155 Myrmica scabrinodis) collected in Cambridge, U.K., and performed two rounds of high- 156 throughput RNA sequencing. Contigs were assembled using Trinity [9, 10] and Velvet 157 [11], and SINV-2-like sequences identified with blastx. 158 159 In the first round, we sequenced L. niger and L. flavus RNA samples using both small- 160 RNA sequencing to enrich for 21–22 nt virus-derived RNAs that are expected to be 161 produced as a result of the insect RNA interference (RNAi) anti-viral defense pathway, 162 and standard RNA-Seq. In this case, good contig assemblies were not obtained for the 163 small-RNA samples and no further small-RNA sequencing was performed. For the long 164 RNA-Seq of L. niger, but not L. flavus, we identified one SINV-2-like sequence. In the 165 second round, we performed long RNA-Seq for RNA obtained from M. scabrinodis, L. 166 neglectus and a second L. flavus sample, and identified SINV-2-like sequences for M. 167 scabrinodis and L. neglectus. Morphological ant species identifications were confirmed 168 by blastn of cytochrome C oxidase subunit I (COI) sequences obtained from the 169 transcriptome assemblies against the NCBI non-redundant (nr) nucleotide database. 170 171 The three SINV-2-like sequences were verified by Sanger sequencing and extended 172 where possible by 5′ and 3′ RACE (see below). Sequence attributes are recorded in 173 Table 1 and genome organizations are depicted in Fig. 2. Hereafter we refer to these 174 viruses as Lasius niger virus 1 (LniV-1), Lasius neglectus virus 1 (LneV-1) and Myrmica 175 scabrinodis virus 1 (MsaV-1), with GenBank accession numbers MF041812.1, 176 MF041809.1 and MF041810.1, respectively. 177 178 Using Bowtie 2 [12], we mapped RNA-Seq reads (>30 nt) back to the assembled virus 179 genomes to assess coverage and variation. LniV-1, LneV-1 and MsaV-1 had mean

6

180 coverage values of 6.4, 457 and 1348, respectively. For the two viruses with high 181 coverage (Fig. 3), we identified single nucleotide polymorphisms (SNPs) present in 182 >10% of reads. In MsaV-1, we found seven SNPs in coding regions, all non- 183 synonymous and with frequencies just over 10%. In LneV-1, we found 87 SNPs of 184 which 62 were synonymous – 5/22 in ORF1, 3/3 in ORF2, 7/7 in ORF3, 8/8 in ORF4 185 and 39/47 in ORF5. 186 187 During the course of the above work, 1445 new RNA viruses were identified via high- 188 throughput sequencing of RNA samples obtained from a variety of invertebrates [8].The 189 authors identified the most similar previously published virus sequence for each new 190 virus, four of which were found to have highest similarity to SINV-2. These four 191 sequences derive not from ants but from a mixed sample of spiders (KX883688.1), a 192 mixed insect sample including members of the orders Odonata and Diptera 193 (KX883940.1), the crayfish Procambarus clarkia (KX884540.1), and a mixed insect 194 sample including members of the orders Lepidoptera, Diptera, Neuroptera and 195 Coleoptera (KX883910.1) with respectively 19%, 21%, 21% and 30% amino acid 196 sequence identity to SINV-2 in the ORF5 polypeptide. The sequences from spiders, 197 Odonata/Diptera and crayfish have low abundance (<0.1% of sample non-rRNA) 198 leading the authors to suggest the possibility that they may derive from parasites or gut 199 contents (which in these cases might include insects) instead of the target organisms. 200 Sequence attributes are recorded in Table 1 and genome organizations are depicted in 201 Fig. 2. 202 203 In summary, we identified a total of 15 SINV-2-like sequences (Table 1), comprising a 204 correction of the original SINV-2 sequence, SINV-4, six sequences from the NCBI TSA 205 database, three sequences from U.K. ant species, and four additional recently 206 published sequences. Fourteen of the sequences appear to represent complete or 207 nearly complete virus genomes. 208

7

209 Genome organization of SINV-2-like viruses

210 Each full-length sequence had five main ORFs (Fig. 2). The stop and start codons of 211 consecutive ORFs in the 5′ region have closely spaced (often overlapping) stop and 212 start codons (Table 1). We also identified a shorter sixth ORF (termed ORF2b) in some 213 members of the group (Fig. 2). ORF2b overlaps the 5′ end of ORF2 in the +1 frame 214 relative to ORF2. The Formica exsecta TSA contains an additional short ORF (ORF3a) 215 between ORFs 2 and 3. Hubei picorna-like virus 81 (accession number KX884540; [8]) 216 appears to be incomplete, lacking ORFs 1 and 2 and most of 3. Sequences with ORF2b 217 cluster phylogenetically (Fig. 4; see below) and all are ant-associated, except for 218 Shuangao insect virus 8 which was derived from an "insect mix" that was not reported 219 to contain members of the family Formicidae [8]. However, insects within this sample 220 may include predators of ants; thus the sequence could potentially have originated from 221 an ant (or other) host. 222 223 Most of the 15 sequences had some amount of the 5′ and 3′ untranslated regions 224 (UTRs) present, though, given the variability in UTR lengths, we assume that several 225 are incomplete. The 5′ UTR sequences were frequently >200 nt and ranged up to 366 nt 226 in length. Eight sequences extend to the poly(A) tail and may therefore be assumed to 227 be 3′-complete. For these sequences, the 3′ UTR lengths ranged from 385 to 479 nt, 228 excluding the poly(A) sequence. The length of the intergenic region between ORFs 4 229 and 5 ranged from 336 to 768 nt. To assess 5′-completeness we compared 5′ 230 nucleotide sequences between different species (Supplementary Fig. 1). A number of 231 polycipivirus sequences exhibit a predicted stable simple stem-loop structure close to 232 the 5′ end. In SINV-4, SINV-2 and the L. neglectus TSA, this is preceded by an AU-rich 233 tract with 5′-terminal UUU. This 5′ end similarity between these highly divergent 234 sequences suggests that this represents the true 5′ end of the genome. Relative to 235 these, the MsaV-1 and LneV-1 sequences appear to be missing approximately 5 and 23 236 nt respectively from their 5′ ends. For the LniV-1 sequence we were unable to obtain 237 complete 5′ or 3′ UTR sequences. 238

8

239 For each ORF, functions were predicted with HHpred [13] using the Pfam [14] and PDB 240 [15] databases. For most sequences, ORFs 1, 3 and 4 were predicted to encode 241 picornavirus-like capsid proteins (Fig. 2). For the most divergent sequences – Hubei 242 picorna-like virus 81 and 82 and the C. riparius TSA – ORF1 was predicted to encode a 243 picornavirus-like capsid protein, but for ORF4 this was only predicted for Hubei picorna- 244 like virus 81, and for ORF3 for none. However, for ORFs 3 and 4, HHpred [16, 17] found 245 significant sequence alignments (E-values < 0.0001; alignment lengths ranging from 246 112 to 245 amino acids) between an alignment of the 11 sequences from the SINV- 247 4/SINV-2 clade and a query alignment of the C. riparius TSA and Hubei picorna-like 248 virus 82, or the single Hubei picorna-like virus 81 sequence, suggesting that these 249 ORFs encode homologous proteins across all 14 sequences. For all sequences, ORF5 250 was predicted to encode helicase (Hel), protease (Pro) and RNA-dependent RNA 251 polymerase (RdRp) domains. ORFs 2 and 2b had no HHpred matches; however a 252 transmembrane domain was predicted using TMHMM [18] in the middle of all ORF2b 253 amino acid sequences. HHpred indicated homology between ORF2 of the SINV- 254 4/SINV-2 clade and ORF2 of Hubei picorna-like virus 81 (E-value < 10−8; alignment 255 length 126 amino acids), but homology to ORF2 of the C. riparius TSA/Hubei picorna- 256 like virus 82 clade was less certain (E-value = 0.029; alignment length 165 amino 257 acids). The Formica exsecta TSA ORF3a polypeptide also had no significant HHpred 258 matches. 259 260 Following Koonin and Dolja [19], we identified characteristic Hel, Pro and RdRp motifs 261 in the ORF5 polypeptide sequences. All sequences contained the three superfamily III 262 helicase motifs (Supplementary Fig. 2). The protease domain is less well conserved 263 across picorna-like viruses with only three very short characteristic motifs corresponding 264 to the catalytic triad – H, D and C (within a quite conserved GxCG motif) [19]. Positive- 265 sense RNA viruses, and in particular picorna-like viruses, usually have chymotrypsin- 266 like cysteine proteases. In the SINV-2-like sequences, however, the protease has a 267 serine (S) at the corresponding active site, GxSG (Supplementary Fig. 3). We were able 268 to find all eight conserved RdRp motifs (Supplementary Fig. 4), with the motifs most 269 closely matching superfamily I RdRps (a group which also includes the RdRps of

9

270 , potyviruses, sobemoviruses and nidoviruses; [19]). In two sequences 271 (the C. riparius TSA and Hubei picorna-like virus 82), the usually very well conserved 272 GDD in motif VI (also often called motif C) was replaced with ADD. 273 274 Phylogeny of SINV-2-like viruses

275 Based on the genome structure and the identified protein domains, SINV-2-like 276 sequences may be classified within the Picornavirales order. To test for monophyly, we 277 obtained amino acid sequences for the conserved "core" region of the RdRp from 278 viruses in the Picornavirales order from the supplementary material of ref. [7], appended 279 the equivalent region from the 15 SINV-2-like sequences with RdRp coverage, 280 realigned the sequences with MUSCLE [20] and generated a Bayesian Markov chain 281 Monte Carlo based phylogenetic tree using MrBayes [21]. The SINV-2-like sequences 282 form a monophyletic (albeit highly divergent) group (Fig. 5). 283 284 285 286 287 288 289 290 291 292 293 294 295 296

297 Discussion

298 We have identified and sequenced four new viruses from ant species. Together with 299 SINV-2, six sequences recovered from TSA databases, and four recent additions to

10

300 GenBank, these 15 sequences form a distinct group of arthropod-infecting viruses with 301 a characteristic genome organization. Members have a polyadenylated positive-sense 302 RNA genome that encodes three related picornavirus-like jelly-roll capsid domains, and 303 a non-structural polyprotein containing superfamily III helicase, 3C-like chymotrypsin- 304 related protease, and superfamily I RdRp domains, characteristic of members of the 305 order Picornavirales. Thus we propose that they should be classified into a new family, 306 Polycipiviridae (polycistronic picorna-like viruses), within the order Picornavirales. The 307 virion morphology (small icosahedral particles) observed for SINV-2 [2] is also 308 consistent with this placement. Like the dicistronic dicistroviruses (family Dicistroviridae) 309 and the bipartite cheraviruses, sadwaviruses and (family Secoviridae), the 310 polycipivirus coding sequences are split into separate non-structural and structural 311 protein modules. In contrast to these other groups, the polycipivirus structural protein 312 module is further split into separate 5′ ORFs rather than depending on polyprotein 313 expression of multiple jelly-roll domains from a single ORF. 314 315 Although polycipiviruses form a distinct clade, there is considerable diversity among the 316 15 available sequences (some of ORF5 pairwise amino acid identities as low as 17%), 317 indicating that the family should be split into a number of genera. We propose the 318 following groupings of the currently available sequences (see Fig. 4): Sopolycivirus 319 (from Solenopsis invicta polycipivirus) to include SINV-2, SINV-4 and nine related 320 sequences all of which contain ORF2b; Hupolycivirus (from Hubei picorna-like virus 81 321 polycipivirus) to include the two Hubei picorna-like virus 81 sequences; and 322 Chipolycivirus (from Chironomus riparius polycipivirus) to include the C. riparius TSA 323 and Hubei picorna-like virus 82, both of which contain the GDD to ADD substitution in 324 RdRp motif VI. 325 326 327 The polycipivirus 3C-like protease is unusual in that it contains a serine residue at its 328 active site (serine protease), whereas the majority of Picornavirales 3C-like proteases 329 have a cysteine residue at this location (cysteine protease). The only other currently 330 known exceptions are the sole member of the Picornavirales family Marnaviridae, and

11

331 some members of genus in the family Secoviridae [1, 22]. The "picorna-like" 332 , sobemoviruses and also have chymotrypsin-like serine 333 proteases, and cellular homologues have serine at the active site [23, 24]. 334 335 The most conserved protein of positive-sense RNA viruses is the RdRp. In 336 polycipiviruses, we were able to identify all eight signature motifs of the superfamily I 337 RdRp. Unusually, however, in two of the sequences the highly conserved GDD of motif 338 VI was replaced with ADD. Motif VI is responsible for magnesium ion coordination. The 339 first aspartate (D) is mainly responsible for the coordination and cannot be substituted. 340 There is however potential for flexibility in the third position and even more flexibility in 341 the first position (reviewed in [25]). Indeed, the glycine (G) has been experimentally 342 substituted with six different amino acids and with some substitutions the polymerase 343 remains active. Substitution with alanine (ADD) in tobacco vein mottling virus (family 344 Potyviridae), polio virus (family Picornaviridae) or hepatitis C virus (family ) 345 results in an in vitro RdRp activity 5 to 12% of wildtype activity [26–28], though when the 346 same mutation was introduced in encephalomyocarditis virus (family Picornaviridae) the 347 RdRp was inactive in vitro [29]. Positive-sense and double-stranded RNA viruses have 348 a strong preference for GDD, though nidoviruses (order ) and some, but not 349 all, hypoviruses (family Hypoviridae) have SDD at this site. On the other hand, non- 350 segmented negative-sense RNA viruses generally have GDN, whereas segmented 351 negative-sense RNA viruses have SDD, and the reverse transcriptases of 352 have MDD [30–33]. Unusually, members of the positive-sense RNA virus family 353 and the double-stranded RNA virus family have a 354 permuted polymerase domain with motif VI occurring between motifs III and IV; the 355 GDD sequence is still present in permutotetraviruses but is substituted with ADN in 356 birnaviruses and mutation of ADN to GDD results in an almost complete loss of activity 357 [34, 35]. In summary, therefore, even though the ADD in motif VI of two polycipivirus 358 sequences is a deviation from the expected GDD, it is still a plausible variation. The fact 359 that it was observed in two independent sequences (which also cluster phylogenetically 360 and are relatively distant from the other polycipivirus sequences; Fig. 4), indicates that 361 ADD is a real variant and not a result of sequencing error.

12

362 363 Where investigated, Picornavirales species have been found to harbour a VPg protein 364 covalently linked to the 5′ end of their genome that primes RNA synthesis during 365 genome replication. The Picornavirales VPg is typically <5 kDa and normally encoded 366 between Hel and Pro [1]. Due to their small size and lack of structural domains, 367 divergent VPg proteins are not easily recognizable by sequence homology, and we 368 were not able to definitively identify a VPg domain in polycipivirus sequences. 369 Nonetheless there is a large region of unassigned function between Hel and Pro that we 370 presume to encode a VPg, and likely (due to the size of the region) also another protein 371 of unknown function. A further one or two additional proteins are likely encoded 372 upstream of Hel in the ORF5 polyprotein. Due to the high divergence between 373 polycipiviruses and picorna-like viruses whose cleavage sites have been characterized, 374 we were unable to definitively predict the polyprotein cleavage sites. 375 376 As in other Picornavirales species, gene expression in polycipiviruses likely depends on 377 internal ribosome entry site (IRES)-mediated initiation. Consistent with this, the lengthy 378 5′ UTRs typically contained a number of AUG codons that would be expected to inhibit 379 5′ end-dependent scanning. Consequently, we suppose the 5′ UTR to contain an IRES 380 to direct ribosome initiation at the ORF1 start codon. Similarly we suppose the long 381 intergenic region between ORFs 4 and 5 to contain a second IRES to direct ribosome 382 initiation at the ORF5 start codon. A similar situation occurs in dicistroviruses (family 383 Dicistroviridae) where translation of structural and non-structural polyproteins are 384 directed by separate IRESes (though in that case the non-structural polyprotein ORF is 385 5′ proximal). Although we attempted to define potential IRES RNA structure in silico by 386 means of RNA-folding algorithms and inter-species comparisons, the results to date 387 have been inconclusive. The translation mechanism of the additional 5′ ORFs (ORFs 2 388 to 4 and, where present, 2b) remains uncertain; however close spacing of the stop and 389 start codons of consecutive ORFs suggests a ribosome reinitiation mechanism. 390 391 Additional work will be required to confirm the composition and structure of virus 392 particles, presence and sequence of a 5′-linked VPg protein, non-structural polyprotein

13

393 cleavage sites and products, and gene expression mechanisms of this unusual family of 394 arthropod-infecting picorna-like viruses. 395 396 Ten of the 11 members of the proposed Polycipiviridae genus Sopolycivirus appear to 397 infect ant species, while the 11th member (Shuangao insect virus 8) is apparently an 398 insect virus whose host has yet to be identified (Fig. 4). These viruses have been 399 isolated from ants across several continents and in four out of five ant species targeted 400 in this study. Moreover, they have been identified in three different ant subfamilies, and 401 three individual ant species (Solenopsis invicta, Monomorium pharaonis and Lasius 402 neglectus) were found to play host to more than one divergent viruses in the group, 403 suggesting a long evolutionary history between Sopolycivirus species and ants. 404 Although we cannot rule out other hosts, and our own sampling has been biased 405 towards discovering new ant-infecting members, it seems possible that genus 406 Sopolycivirus may be an ant-specific clade. 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423

14

424 425 426 427 428 429

430 Materials and methods

431 SINV-4 identification and sequencing

432 Adult worker Solenopsis invicta ants were collected from 46 nests in the eastern portion 433 of the state of Formosa, Argentina, and returned to quarantine in Gainesville, Florida, 434 United States. Total RNA was extracted from 10–15 live worker ants from each colony 435 using Trizol (Invitrogen) and the PureLink purification kit (Ambion). Total RNA (10 μg 436 per group) was submitted to GE Healthcare (Los Angeles, CA) for mRNA purification, 437 library preparation, and Illumina RNA sequencing (MiSeq). Sequences were aligned to 438 the S. invicta genome (GenBank accession AEAQ01000001.1) and non-matching 439 sequences were compared to the UniProt annotated Swiss-Prot protein sequence 440 database (download date 14 Nov 2014). The unmatched sequences were assembled 441 (Vector NTI, Invitrogen) and a unique sequence with significant similarity to SINV-2 was 442 identified by blastx analysis. RT-PCR with gene-specific oligonucleotide primers 443 revealed that this sequence was present in some colonies of S. invicta in the U.S.A. 444 Total RNA from U.S.A. S. invicta colonies containing the sequence was extracted and 445 used as template for cDNA synthesis, subsequent PCR, and RACE (5′ and 3′) to obtain 446 the entire genome sequence of this new virus, SINV-4, by Sanger sequencing. For 5′ 447 RACE, the 5′ RACE System for Rapid Amplification of cDNA Ends, Version 2.0 448 (Invitrogen, Carlsbad, CA) was used. cDNA was first synthesized with oligonucleotide 449 p1476 (5′-TGGAATTCCAGAATTTTCTAAGGTTCCCATATTAGT) followed by PCR with 450 gene specific primer p1474 (5′-TGAATTCCAGGTAACGCTTGAACCATTGGT) and the 451 Abridged Anchor Primer (Invitrogen, Carlsbad, CA). For 3′ RACE, the GeneRacer Kit 452 (Invitrogen, Carlsbad, CA) was used. cDNA was synthesized with the GeneRacer Oligo

15

453 dT primer. PCR was subsequently completed with the GeneRacer 3′ primer and gene 454 specific primer, p1536 (5′-ATGGCTGTTGCTGACATGTTATGCATTATGTT). 455 456 LniV-1, LneV-1 and MsaV-1 identification and sequencing

457 Adult L. niger, L. neglectus, M. scabrinodis and L. flavus ants were collected from 458 individual nests in Cambridge, U.K.; all ants were workers except the sequencing round 459 1 L. flavus sample for which queens were used. Total RNA was extracted from 10–20 460 adult workers or 5 queens using Trizol (Invitrogen) following the manufacturer’s 461 instructions. RNA was treated with DNase (Promega, RQ1 RNase-free DNase). For 462 standard RNA-Seq libraries, ribosomal RNA was depleted using the Ribo-Zero kit 463 (Illumina), and remaining RNA subjected to alkaline hydrolysis, followed by acrilamide 464 gel purification of RNA bands within the size range 35–50 nt for L. niger and 70–85 for 465 the other four species. For small-RNA sequencing, RNA bands within the size range 466 18–30 nt were acrylamide gel purified from total RNA after DNase treatment. All library 467 amplicons were constructed using a small RNA cloning strategy [36, 37] and sequenced 468 (single-end, 75 nt) using the NextSeq500 platform (Illumina) at the DNA Sequencing 469 Facility (Department of Biochemistry, University of Cambridge). High-throughput 470 sequencing data have been deposited in ArrayExpress 471 (http://www.ebi.ac.uk/arrayexpress) under the accession number E-MTAB-5781. RNA- 472 Seq reads were trimmed using the FASTX-Toolkit and assembled using the Trinity 473 (v2.3.2) and Velvet (1.2.10) de novo assemblers [10, 11]. Using blastx, assembled 474 contigs were compared to a database of polypeptide sequences derived from SINV-2, 475 SINV-4 and the SINV-2-like sequences identified in the NCBI TSA database. Contigs 476 which mapped to SINV-2-like proteins with e-value <10−6 and length >300 nt of coding 477 sequence were retained, manually joined where possible, and used to design primers 478 for Sanger sequencing. 479 480 Gaps were filled and the entire genomes sequenced by Sanger sequencing and 481 terminal sequences obtained by 5′ and 3′ RACE, as follows. Two μg of total ant RNA 482 was treated with Proteinase K then recovered by acid phenol/chloroform extraction and 483 used for 5′- and 3′-RACE using the SMARTer RACE 5′/3′ kit (Clontech) according to the

16

484 manufacturer's instructions. Gene specific primers for 5′ and 3′ RACE were located 485 within 1000 nt of the expected 5′ and 3′ ends of the virus genome. Resulting clones (12– 486 19 for each of MsaV-1 5′ and 3′, LneV-1 5′ and 3′ and LniV-1 5′, and 4 for LniV-1 3′) in 487 the pRACE vector were sequenced with M13 universal primers. The rest of the viral 488 genomes were PCR amplified as eight partially overlapping DNA fragments using 489 specific primers and the same cDNA templates as the ones generated for the 5′/3′ 490 RACE PCRs. These fragments were cloned into the pJET1.2 vector (Thermo Scientific) 491 and sequenced with pJET1.2 specific primers. 492 493 Computational analysis

494 Sequences were processed using EMBOSS [38] and analyzed using blast [39], HHpred 495 [13] (using the PDB and Pfam databases) and Phyre 2 [40] (analysis performed 496 between November 2016 and May 2017). Comparison of amino acid 497 sequences/alignments between different polycipivirus clades was performed using 498 HHpred in the align two sequences/alignments mode during 23–28 June 2017. Amino 499 acid sequences were aligned using MUSCLE v3.8.31 [20] and phylogenetic trees were 500 estimated using the Bayesian Markov chain Monte Carlo method implemented in 501 MrBayes v3.2.6 [21] sampling across the default set of fixed amino acid rate matrices, 502 with 10 million generations, discarding the first 25% as burn-in. Trees were visualized 503 using FigTree v1.4.3. Pairwise amino acid identities were calculated based on pairwise 504 ORF5 MUSCLE alignments. To calculate coverage and polymorphism frequencies, 505 Bowtie 2 (v2.3.0, 12), using default parameters, was used to map raw NextSeq 506 sequencing reads back to the LniV-1, LneV-1 and MsaV-1 genomes. 507 508 Accession numbers, vouchers, and new family submission

509 We have submitted a proposal to the International Committee on Taxonomy of Viruses 510 (ICTV) to name a new family Polycipiviridae containing genera Sopolycivirus, 511 Hupolycivirus and Chipolycivirus, in the order Picornavirales. Sequences for SINV-4, 512 SINV-2, LneV-1, MsaV-1 and LniV-1 have been submitted to NCBI with accession 513 numbers MF041808, MF041813, MF041809, MF041810 and MF041812. Voucher

17

514 specimens of the S. invicta ant species from which SINV-4 was originally sequenced 515 are retained at USDA-ARS, Gainesville, Florida, U.S.A. 516

517 Funding information

518 This work was supported by a Wellcome Trust grant [106207] and a European 519 Research Council (ERC) European Union's Horizon 2020 research and innovation 520 programme grant [646891] to A.E.F.

521

522 Acknowledgements

523 L.A. Calcaterra and the Fundación para el Estudio de Especies Invasivas (FuEDEI) in 524 Hurlingham, Argentina are thanked for their long-term and ongoing assistance with 525 collecting and studying fire ants in Argentina. We thank Phillip Buckham-Bonnett for 526 valuable assistance with ant localization, collection and identification in the U.K., and 527 John P. Carr for many helpful discussions and strategic assistance, and the Cambridge 528 University Botanic Garden for access for sample collection. 529

530 Conflicts of interest and Ethical statement

531 The authors declare that there is no conflict of interest. No specific permits were 532 required to collect field specimens because they did not occur in locations protected in 533 any way. The field collections did not involve endangered or protected species. 534 535 536 537 538 539 540 541 542

18

543 References

544 [1] Le Gall O, Christian P, Fauquet CM, King AMQ, Knowles NJ, et al. Picornavirales, a 545 proposed order of positive-sense single-stranded RNA viruses with a pseudo-T = 3 virion 546 architecture. Arch Virol 2008; 153: 715–727.

547 [2] Valles SM, Strong CA, Hashimoto Y. A new positive-strand RNA virus with unique 548 genome characteristics from the red imported fire ant, Solenopsis invicta. Virology 2007; 549 365: 457–463.

550 [3] Hashimoto Y, Valles SM. Infection characteristics of Solenopsis invicta virus 2 in the red 551 imported fire ant, Solenopsis invicta. J Invertebr Pathol 2008; 99: 136–140.

552 [4] Valles SM, Varone L, Ramírez L, Briano J. Multiplex detection of Solenopsis invicta 553 viruses -1, -2, and -3. J Virol Methods 2009; 162: 276–279.

554 [5] Valles SM, Oi DH, Porter SD. Seasonal variation and the co-occurrence of four 555 pathogens and a group of parasites among monogyne and polygyne fire ant colonies. 556 Biol Control 2010; 54: 342–348.

557 [6] Manfredini F, Shoemaker D, Grozinger CM. Dynamic changes in host-virus interactions 558 associated with colony founding and social environment in fire ant queens (Solenopsis 559 invicta). Ecol Evol 2016; 6: 233–244.

560 [7] Koonin EV, Wolf YI, Nagasaki K, Dolja VV. The Big Bang of picorna-like virus evolution 561 antedates the radiation of eukaryotic supergroups. Nat Rev Microbiol 2008; 6: 925–939.

562 [8] Shi M, Lin X-D, Tian J-H, Chen L-J, Chen X, et al. Redefining the invertebrate RNA 563 virosphere. Nature 2016; 540: 1–12.

564 [9] Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. Full-length 565 transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 566 2011; 29: 644–652.

567 [10] Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, et al. De novo 568 transcript sequence reconstruction from RNA-seq using the Trinity platform for reference 569 generation and analysis. Nat Protoc 2013; 8: 1494–1512.

570 [11] Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de

19

571 Bruijn graphs. Genome Res 2008; 18: 821–829.

572 [12] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 573 2012; 9: 357–359.

574 [13] Söding J, Biegert A, Lupas AN. The HHpred interactive server for protein homology 575 detection and structure prediction. Nucleic Acids Res 2005; 33:W244-248.

576 [14] Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, et al. The Pfam protein families 577 database: Towards a more sustainable future. Nucleic Acids Res 2016; 44: D279–D285.

578 [15] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. The protein data bank. 579 Nucleic Acids Res 2000; 28: 235–242.

580 [16] Alva V, Nam S-Z, Söding J, Lupas AN. The MPI bioinformatics Toolkit as an integrative 581 platform for advanced protein sequence and structure analysis. Nucleic Acids Res 2016; 582 44: W410–W415.

583 [17] Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005; 584 21: 951–960.

585 [18] Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane 586 protein topology with a hidden Markov model: application to complete genomes. J Mol 587 Biol 2001; 305: 567–580.

588 [19] Koonin EV, Dolja VV. Evolution and taxonomy of positive-strand RNA viruses: 589 implications of comparative analysis of amino acid sequences. Crit Rev Biochem Mol Biol 590 1993; 28: 375–430.

591 [20] Edgar RC. MUSCLE: Multiple sequence alignment with high accuracy and high 592 throughput. Nucleic Acids Res 2004; 32: 1792–1797.

593 [21] Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, et al. MrBayes 3.2: 594 efficient Bayesian phylogenetic inference and model choice across a large model space. 595 Syst Biol 2012; 61: 539–542.

596 [22] Isogai M, Tatuto N, Ujiie C, Watanabe M, Yoshikawa N. Identification and 597 characterization of blueberry latent spherical virus, a new member of subgroup C in the 598 genus Nepovirus. Arch Virol 2012; 157: 297–303.

20

599 [23] Gorbalenya AE, Koonin EV., Blinov VM, Donchenko AP. Sobemovirus genome 600 appears to encode a serine protease related to cysteine proteases of picornaviruses. 601 FEBS Lett 1988; 236: 287–290.

602 [24] Gorbalenya AE, Donchenko AP, Blinov VM, Koonin EV. Cysteine proteases of 603 positive strand RNA viruses and chymotrypsin-like serine proteases. FEBS Lett 1989; 604 243: 103–114.

605 [25] O’Reilly EK, Kao CC. Analysis of RNA-dependent RNA polymerase structure and 606 function as guided by known polymerase structures and computer predictions of 607 secondary structure. Virology 1998; 303: 287–303.

608 [26] Jablonski SA, Luo M, Morrow CD. Enzymatic activity of poliovirus RNA polymerase 609 mutants with single amino acid changes in the conserved YGDD amino acid motif. J Virol 610 1991; 65: 4565–4572.

611 [27] Hong Y, Hunt AG. RNA polymerase activity catalyzed by a -encoded RNA- 612 dependent RNA polymerase. Virology 1996; 226: 146–151.

613 [28] Herian U, Lohmann V, Ko F, Bartenschlager R. Biochemical properties of Hepatitis C 614 virus NS5B RNA-dependent RNA polymerase and identification of amino acid sequence 615 motifs essential for enzymatic activity. J Virol 1997; 71: 8416–8428.

616 [29] Sankar S, Porter AG. Point mutations which drastically affect the polymerization activity 617 of encephalomyocarditis virus RNA-dependent RNA polymerase correspond to the active 618 site of Escherichia coli DNA polymerase I. J Biol Chem 1992; 267: 10168–10176.

619 [30] Biswas SK, Nayak DP. Mutational analysis of the conserved motifs of influenza A virus 620 polymerase basic protein 1. J Virol 1994; 68: 1819–1826.

621 [31] Sánchez AB, Torre JC De, Sa AB. Genetic and biochemical evidence for an oligomeric 622 structure of the functional L polymerase of the prototypic lymphocytic 623 choriomeningitis virus. J Virol 2005; 79: 7262–7268.

624 [32] Zamoto-Niikura A, Terasaki K, Ikegami T, Peters CJ, Makino S. Rift valley fever virus 625 L protein forms a biologically active oligomer. J Virol 2009; 83: 12779–12789.

626 [33] Te Velthuis AJW. Common and unique features of viral RNA-dependent polymerases. 627 Cell Mol Life Sci 2014; 71: 4403–4420.

21

628 [34] Letzel T, Mundt E, Gorbalenya AE. Evidence for functional significance of the permuted 629 C motif in Co2+-stimulated RNA-dependent RNA polymerase of infectious bursal disease 630 virus. J Gen Virol 2007; 88: 2824–2833.

631 [35] Gorbalenya AE, Pringle FM, Zeddam JL, Luke BT, Cameron CE, et al. The palm 632 subdomain-based active site is internally permuted in viral RNA-dependent RNA 633 polymerases of an ancient lineage. J Mol Biol 2002; 324: 47–62.

634 [36] Cook S, Chung BYW, Bass D, Moureau G, Tang S, et al. Novel virus discovery and 635 genome reconstruction from field RNA samples reveals highly divergent viruses in 636 dipteran hosts. PLoS One 2013; 8: 1–22.

637 [37] Guo H, Ingolia NT, Weissman JS, Bartel DP. Mammalian microRNAs predominantly 638 act to decrease target mRNA levels. Nature 2010; 466: 835–840.

639 [38] Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open 640 Software Suite. Trends Genet 2000; 16: 276–277.

641 [39] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search 642 tool. J Mol Biol 1990; 215: 403–410.

643 [40] Kelley LA, Sternberg MJE. Protein structure prediction on the Web: a case study using 644 the Phyre server. Nat Protoc 2009; 4: 363–371.

645

646

22

647 Figures legends

648 Figure 1. Genome map of SINV-2. The genome is represented by a black line. ORFs 649 are indicated in pale blue with vertical offsets indicating reading frame (−1, 0, +1) 650 relative to ORF5. IGR denotes the intergenic region.

651

652 Figure 2. Genome maps of proposed polycipiviruses. ORFs are represented by 653 coloured rectangles with vertical offsets indicating reading frames relative to the Hel- 654 Pro-RdRp ORF. Green rectangles indicate ORFs encoding jelly-roll capsid protein 655 domains as predicted by HHpred; hatched green lines indicate uncertain capsid-domain 656 identifications. Pink rectangles indicate a conserved overlapping ORF encoding a 657 transmembrane helix as predicted by TMHMM. A dark green rectangle in the F. exsecta 658 TSA map indicates an additional ORF, 3a. Replication protein domains, identified by 659 their conserved characteristic motifs, are indicated as Hel (helicase), Pro (protease) and 660 RdRp (RNA-dependent RNA polymerase). TSA indicates sequences obtained from the 661 NCBI Transcriptome Shotgun Assembly database. Two horizontal lines (on right side) 662 separate the three proposed genera.

663

664 Figure 3. RNA sequencing coverage of the MsaV-1 and LneV-1 genomes. The total 665 coverage at each nucleotide position is indicated in red (positive sense) or blue 666 (negative sense; note the different axis scale). Black lines show the mean coverage in a 667 1000-nt sliding window.

668

669 Figure 4. Phylogenetic tree of the proposed family Polycipiviridae. ORF5 amino 670 acid sequences were aligned with MUSCLE, and a Bayesian Markov chain Monte Carlo 671 based phylogenetic tree was produced with MrBayes. All posterior probabilities were 672 equal to 1.00. The tree was mid-point rooted and visualized with FigTree. TSA indicates 673 sequences obtained from the NCBI Transcriptome Shotgun Assembly database. 674 Proposed Polycipiviridae genera – Chipolycivirus, Hupolycivirus and Sopolycivirus – are 675 delineated with boxes. For sopolyciviruses, the host ant subfamilies are indicated at 676 right.

677

678

23

679 Figure 5. Phylogenetic tree for polycipiviruses and representative members of the 680 order Picornavirales. Core RdRp amino acid sequences from representative 681 Picornavirales viruses were obtained from Koonin et al. (2008) and combined with the 682 equivalent regions from the 15 sequences in Table 1 with RdRp coverage. Sequences 683 were aligned with MUSCLE, and a Bayesian Markov chain Monte Carlo based 684 phylogenetic tree was produced. Posterior probabilities are indicated for family root 685 nodes, and elsewhere if p < 1.00.

686 Abbreviations: ALSV – apple latent spherical virus; BBWV1 – broad bean wilt virus 1; 687 CPMV – cowpea mosaic virus; CRLV – cherry rasp leaf virus; CrPV – cricket paralysis 688 virus; DCV – Drosophila C virus; DWV – deformed wing virus; EMCV – 689 encephalomyocarditis virus; FMDV – foot-and-mouth disease virus; GFLV – grapevine 690 fanleaf virus; HaRNAV – Heterosigma akashiwo RNA virus; HAV – hepatitis A virus; 691 HplV-81 – Hubei picorna-like virus 81; HplV-82 – Hubei picorna-like virus 82; HRV1A – 692 human 1A; IFV – infectious flacherie virus; LneV-1 – Lasius neglectus virus 1; 693 LniV-1 – Lasius niger virus 1; MsaV-1 – Myrmica scabrinodis virus 1; NIMV – navel 694 orange infectious mottling virus; PnPV – Perina nuda picorna-like virus; PV – polio virus; 695 PYFV – parsnip yellow fleck virus; RTSV – rice tungro spherical virus; SDV – satsuma 696 dwarf virus; ShiV-8 – Shuangao insect virus 8; SINV-2 – Solenopsis invicta virus 2; 697 SINV-4 – Solenopsis invicta virus 4; TRSV – tobacco ringspot virus; TrV – Triatoma 698 virus; TSV – Taura syndrome virus.

699

700 Table legend

701 Table 1. Viruses in the proposed Polycipiviridae family.

702

24

Table Click here to download Table Table_1.docx

ORF coordinates f Accession Sequence 5’ UTR 3’ UTR Sequence name Host species or source number lengthd, nt lengthe ORF1 ORF2b ORF2 ORF3 ORF4 IGR ORF5 lengthe Start End Start End Start End Start End Start End length Start End

EF428566.1 SINV-2 Solenopsis invicta 11303h 301 302 1078 1079 1309 1075 1878 1871 2647 2644 3792 662 4455 10916 387 MF041813.1

M. pharaonis TSA LA858223.1 rev Monomorium pharaonis 11259h 255 256 984 985 1221 981 1763 1753 2547 2544 3692 658 4351 10812 447 LniV-1 MF041812.1 Lasius niger 11092 134 135 899 900 1157 896 1705 1695 2507 2507 3970 687 4658 - -

L. neglectus TSA LI526777.1 Lasius neglectus 11854h 366 367 1131 1132 1392 1128 1940 1930 2742 2742 4181 657 4839 11375 479

L. humile TSA LI719284.1 Linepithema humile 11245h 258 259 1011 1015 1305 1011 1817 1804 2664 2661 3872 385 4258 10860 385 M. pharaonis TSA LA866448.1 Monomorium pharaonis 12160h 230 231 1064 1069 1434 1065 1898 1891 2697 2702 4018 759 4778 11719 441 b Shuangao insect virus 8 KX883910.1 Insects mix (1) 12155 224 225 1046 1050 1415 1046 1882 1875 2693 2690 4033 768 4802 11755 400 SINV-4 MF041808.1 Solenopsis invicta 12077h 198 199 1065 1069 1434 1065 1898 1891 2697 2697 4034 734 4769 11662 415

MsaV-1 MF041810.1 Myrmica scabrinodis 11872h 228 229 1062 1111 1482 1059 1973 1970 2773 2776 4134 438 4573 11430 442 LneV-1 MF041809.1 Lasius neglectus 11821h 207 208 1053 1099 1422 1050 1946 1943 2782 2772 4259 336 4596 11402 419 LH935078.1 rev F. exsecta TSA Formica exsecta 11899 206 207 1046 1080 1388 1043 1876g 2079 2879 2882 4300 386 4687 11535 364 LH935077.1

Hubei picorna-like virus 81 KX884540.1 Procambarus clarkia 8116 ------84 81 1019 357 1377 8006 110

Hubei picorna-like virus 81 KX883940.1 Insects mix (2)c 10315 189 190 867 - - 857 1540 1537 2157 2154 3098 353 3452 10078 237 Ch. riparius TSA KA182589.1 Chironomus riparius 11462 365 366 1109 - - 1109 1627 1630 2418 2421 3413 384 3798 11144 318 a Hubei picorna-like virus 82 KX883688.1 Spiders mix 11081 279 280 1023 - - 1023 1751 1748 2722 2719 3723 442 4166 10939 142

a Spiders mix (Neoscona nautica, Parasteatoda tepidariorum, Plexippus setipes, Pirata sp., Araneae sp.). b Insects mix (1) (Abraxas tenuisuffusa, Hermetia illucens, Chrysopidae sp., Coleoptera sp., Psychoda alternata, Diptera sp., Stratiomyidae sp.).

c Insects mix (2) (Paramercion melanotum, Paracercion calamorum, Ceriagrion auranticum, Brachydiplax chalybea, Orthetrum albistylum, Pseudothemis zonata, Chironomus sp.). d Excluding polyA (if present). e UTRs are likely to be incomplete in some species. f ORF coordinates (including stop codon).

g The F. exsecta TSA sequence has an additional ORF (ORF3a; 1873 - 2082) between ORF2 and ORF3. h Sequence extends to 3’ polyA.

Figure 1 Click here to download Figure ORF2 ORF4 ORF2b Figure_1.pdf

5'− ORF5 −3'

ORF1 ORF3 IGR SolenopsisFigure 2 invicta virus 2 (EF428566) SolenopsisClick invicta here virusto download 4 (MF041808) Figure Figure_2.pdf Hel Pro RdRp Hel Pro RdRp

Solenopsis invicta virus 2 (MF041813) Myrmica scabrinodis virus 1 (MF041810) Hel Pro RdRp Hel Pro RdRp

Monomorium pharaonis TSA (LA858223 rev) Lasius neglectus virus 1 (MF041809) Hel Pro RdRp Hel Pro RdRp

Lasius niger virus 1 (MF041812) Formica exsecta TSA (LH935078 rev + LH935077) Hel Pro RdRp Hel Pro RdRp

Lasius neglectus TSA (LI526777) Hubei picorna−like virus 81 (KX884540) Hel Pro RdRp Hel Pro RdRp

Linepithema humile TSA (LI719284) Hubei picorna−like virus 81 (KX883940) Hel Pro RdRp Hel Pro RdRp

Monomorium pharaonis TSA (LA866448) Chironomus riparius TSA (KA182589) Hel Pro RdRp Hel Pro RdRp

Shuangao insect virus 8 (KX883910) Hubei picorna−like virus 82 (KX883688) Hel Pro RdRp Hel Pro RdRp

ORF2 ORF4 ORF2b Jelly−roll protein Transmembrane protein 5'− ORF5 −3' Capsid hit (not statistically significant) ORF1 ORF3 IGR Figure 3 Click here to download Figure Figure_3.pdf Figure 4 Click here to download Figure Figure_4.pdf

MF041808 Solenopsis invicta virus 4 Myrmicinae

KX883910 Shuangao insect virus 8

LA866448 Monomorium pharaonis TSA Myrmicinae

LH935078+LH935077 Formica exsecta TSA Formicinae MF041809 Lasius neglectus virus 1 Sopolycivirus MF041810 Myrmica scabrinodis virus 1 Myrmicinae

LI719284 Linepithema humile TSA Dolichoderinae

LI526777 Lasius neglectus TSA Formicinae MF041812 Lasius niger virus 1

LA858223 Monomorium pharaonis TSA Myrmicinae MF041813 Solenopsis invicta virus 2

Hupolycivirus KX883940 Hubei picorna-like virus 81

KX884540 Hubei picorna-like virus 81

KX883688 Hubei picorna-like virus 82 Chipolycivirus KA182589 Chironomus riparius TSA

0.2 Figure 5 Click here to download Figure Figure_5.pdf Polycipiviridae 0.2

Sopolycivirus

LA866448.1 TSA KX883910.1 ShiV-8

Hupolycivirus MF041808.1 SINV-4

KX884540.1 HplV-81

MF041810.1 MsaV-1

KX883940.1 HplV-81 MF041809.1 LneV-1

LH935077.1+LH935078.1 TSA LH935077.1+LH935078.1 LI526777.1 TSA MF041812.1 LniV-1 LI719284.1 TSA LA858223.1 TSA MF041813.1 SINV-2 Chipolycivirus KX883688.1 HplV-82 Iflaviridae KA182589.1 TSA 0.73

1.00 AB000906.1 IFV Picornaviridae AF323747.1 PnPV 0.69 1.00 0.63 AJ489744.2 DWV M81861.1 EMCV 0.65 0.72 0.65

AY593850.1 FMDV 1.00 1.00 0.98 0.56 M16248.1 HRV1A 0.98 AB009958.2 SDV AB022887.1 NIMV AJ430385.1 PV M20273.1 HAV AJ621357.1 CRLV AB030940.1 ALSV

M95497.1 RTSV X00206.1 CPMV D14066.1 PYFV D14066.1 D00915.1 GFLV AB084450.1 BBWV1 Marnaviridae U50869.1 TRSV AY337486.1 HaRNAVAF277675.1 TSV

AF014388.1 DCV AF218039.1 CrPV Dicistroviridae

AF178440.1 TrV Secoviridae Supplementary Material Files Click here to download Supplementary Material Files SupplementaryFigures_S1-S4.pdf

(a) (((((((((( )))))))))) MF041810 MsaV-1 AAUAAUUAUAUGGGGGCUACCCCCAUAUAUUUGAGUUCAUGCUACGGUGUGUUCCAAAC... MF041809 LneV-1 ------ACCCCCAUAUUUUUGAGUUCAUGUUACGAUGUGUUCCAAAU... ********** ************ **** *********** (b) ((((((((( ))))))))) MF041808 SINV-4 UUUAAAAAUAACCAUGUAGGGCCU-ACCCUACAUGUGCGUACAAAAGGGACGUAAUAUU... KX883910 ShiV-8 ------ACGUAGGGCCUACCCCUAUGUGUUGUUAU------UAUUAUC... LA866448 M.pha TSA ------GUAGGGCCUUACCCUGCACGUUGUUGU------UAUUAUU... ********* **** ** * ** *** (c) (((((((( )))))))) KX883910 ShiV-8 ACGUAGGGCCU-ACCCCUAUGUGUUGUUAUUAUUAUCUACGGAUUUUAAUACUUUA-UU... LA866448 M.pha TSA --GUAGGGCCUUA-CCCUGCACGUUGUUGUUAUUAUUUACGAAUUUUAACACUUUAAGU... ********* * **** ****** ******* **** ******* ****** * (d) ((((((( ))))))) EF428566 SINV-2 UUUUAAAUAGAUCGCAUAUACAGACUCCUGUAUAGGGAGGGCAUCAUUGGUCGGCAUGG...

(((((( (((( )))))))))) LI526777 L.neg TSA UUUAUUUAACUUAAUAAUAGUUUCUGCUUGAUUGGCAAUCAGCAGAUUUAACAUCACUA...

(((((((((( )))))))))) KA182589 C.rip TSA AAUAAAAGAACGCGUCCUCUUUCACGCGCGUUCUUUGAAUAGAAAUGGUAAUAUAGUAG...

Figure S1. Analysis of 5′ ends of polycipivirus sequences. (a) Alignment of 5′ ends of MF041810 (MsaV-1) and MF041809 (LneV-1). (b) Alignment of 5′ ends of MF041808 (SINV-4), KX883910 (Shuangao insect virus 8) and LA866448 (Monomorium pharaonis TSA). (c) Alignment of 5′ ends of KX883910 (Shuangao insect virus 8) and LA866448 (Monomorium pharaonis TSA). (d) Other polycipivirus sequences with possibly near-complete 5′ UTRs. We hypothesis that MF041808, EF428566 and LI526777 are within a few nucleotides of being 5′-complete, while MF041810, MF041809, KX883910 and LA866448 are missing just 5–30 nt of 5′ sequence. Predicted 5′-proximal stem-loops are highlighted in yellow with potential base-pairings indicated with parentheses. Substitutions from G:C or A:U, to G:U pairings are indicated in cyan. 5′-terminal A/U tracts are indicated in red.

A B C Hubei-82 MHIQFYSEPGAGKTDLS 47 YYGQ-KYMIIDE 32 DKGMLFNLNMVISNTNNPF Ch. riparius TSA FHVQFTSGPGYGKTDLS 45 YFAQ-NFGYLDE 32 DKGRMFQIRVLVSNTNNPY Hubei-81C FHVQLAGAPGIGKSSVL 31 FSSDVEKMIYDD 30 EKGTPFRVRVVGSATNVPY Hubei-81I FHVQLAGAPGIGKSSVL 31 FSSDVEKMIYDD 30 EKGTPFRVRVVGSATNVPY SINV-4 FHIQFVGDAGVGKSKLT 32 YQQQ-KFMIIDD 30 EKGIQLNSDILVSSTNVPY Shuangao-8 FHVQFVGKAGVGKSTMT 32 YQQQ-KFMIIDD 30 EKGIQLQSDILISSTNTPY Mo. pharaonis TSA1 FHIQFVGAAGVGKSKLT 32 YQQQ-KFMIIDD 30 EKGIQLNSDILISSTNTPY Fo. exsecta TSA FHVQFVGEPGIGKSTLT 32 YSGD-KVMIIDD 30 DKGQTLDSDILISSTNVPY LneV-1 FHVQFVGEPGIGKSTIT 32 YNGQ-TFMIIDD 30 DKGVHLDSDILISSTNIPY MsaV-1 FHVQFVGEPGIGKSTIT 32 YNGQ-TYMIIDD 30 DKGVHLESDILISSTNMPY Li. humile TSA FHVQLCGAAGIGKSTLT 32 YAGQ-KIMIVDD 30 DKGCQLSSEIMISSTNTPY La. neglectus TSA FHIQFVGRPGIGKSTIT 32 YAGQ-RIMVVDD 30 DKGVQLTSEVLLSSTNTAY LniV-1 FHVQFVGRPGIGKSTIT 32 YAGQ-KIMVVDD 30 DKGVQLTSEVLLSSTNTAY Mo. pharaonis TSA2 FHIQLVGKPGIGKSTLT 32 YAGQ-KIMIADD 30 DKGVQLTSEIFLSSTNTAY SINV-2 FHVQLVGRPGIGKSTLI 32 YAGQ-RIMIADD 30 DKGVQLTSEVFLSTTNTAY :*:*: . .* **: : : : *: :** : :. * ** .:

Figure S2. Conserved helicase motifs in polycipivirus sequences. Polycipivirus helicases have all three conserved sequence motifs of picorna-like superfamily III helicases (as per Koonin and Dolja, 1993). Numbers indicate the lengths of intervening amino acids sequences not shown in the figure. Abbreviations: Hubei-82 – Hubei picorna-like virus 82; Hubei-81C/81I – Hubei picorna- like virus 81 from crustacea (C) or insects (I). Shuangao-8 – Shuangao picorna-like virus 8. Mo. pharaonis TSAs sequences are denoted as TSA1 for LA866448 and TSA2 for LA858223. 1 2 3 Hubei-81C QLFVLNQHSLR 36 PGKDLVFINCRDM 72 GGVPGGTSGSPCIGVGGTQG-RDLMGI-- Hubei-81I QLFVLNQHSLR 36 PGKDLVFINCRDM 72 GGVPGGTSGSPCIGVGGTQG-RDLMGI-- SINV-4 QFLLVNRHMVE 33 PRSDAAIIFCRRF 69 GQTITGKSGAMLIVPNKASGHRNILGIQA Shuangao-8 QFLILNRHMVE 32 PKSDVAIVFCRQL 69 GETVCGKSGSMLIVPNKASGNKNILGIQA Mo. pharaonis TSA1 QFLILNRHMVE 32 PKSDVAIVFCRQL 69 GETVCGKSGSMLVVPNKISGHKNILGIQA Fo. exsecta TSA QFLIINAHTAD 32 PGNDLAIIFSRHL 68 GTTVAGKSGSMLVSPNKKPGHRNIVGIQA LneV-1 QFLIINAHTAD 32 PNNDLAIIFSRHL 68 GATIAGKSGSMLMIPSRKPGHRSIVGIQA MsaV-1 QFLIINSHTAD 32 PGNDLAIIFSRHL 68 GATVAGKSGSMLMIPSRKPGHRSIVGIQA Li. humile TSA QYLYVNKHVFL 33 DAGDLACIYSKSI 70 GTTVVGKSGSPVVYTRQISGSFAILGIQA La. neglectus TSA QYIILNKHVLK 33 PTGDLAIIYCRDL 67 GHTLGGRSGSTVLT--SLQSKPRIIGIQA LniV-1 QYIILNKHVLK 33 PNGDLAIIYCRDL 67 GHTIGGRSGSTVLT--AIQAKPRIIGIQA Mo. pharaonis TSA2 QYIILNKHVTK 30 PDGDLAIIWSRYV 67 GNTVLGNSGSTVVV--HHGSLTSILGIQA SINV-2 QFVLCNSHIFD 32 KDRDLAIIFSRLF 67 GSTVSGRSGSPVIA--QVNGLARLIGIQS Hubei-82 VFMVPN-HFLK 37 RGKDAALVALPKI 81 GDVRFGDSGSLVVHTNTKMQSHFLVGHMI Ch. riparius TSA IFMSCA-HTFA 36 DNKDIVLLYIEGF 66 VEVKPGDSGSLVMHDNPKIQNK-FIGL-I . * * . : . . * **: : ::*

Figure S3. Conserved protease motifs in polycipivirus sequences. Polycipivirus proteases contain sequence motifs of picorna-like proteases (as per Koonin and Dolja, 1993) including the catalytic triad of H, D and C/S. Unusually, polycipivirus sequences have S at the latter position whereas most members of the order Picornavirales have C. Virus name abbreviations are explained in the Figure S2 caption.

I II III IV Hubei-81C FHAKDELR 9 KTRAIVCVNMAYNMCLRKYFGPFFSVMHK 8 MVGINPESITSHIM 11 ALDLSKFDSHVT Hubei-81I FHAKDELR 9 KTRAIVCVNMAYNMCLRKYFGPFFSVMHK 8 MVGINPESITSHIM 11 ALDLSKFDSHVT SINV-4 DFPKDELR 16 KTRSVTCMSLDIILAWRRVTCDLFASLHR 8 APGMNPEGPDWGRL 11 DFDVSNWDGHMT Shuangao-8 DFPKDELR 16 KTRSVTCMSLDIILAWRRVTCDLIASLHR 8 APGMNPEGPDWGRL 11 DFDVSNWDGHMT Mo. pharaonis TSA1 DFPKDELR 16 KTRSVTCMSLDVILSWRRVTNDLFASLHR 8 APGMNPEGPDWGRL 11 DFDVSNWDGHMP Fo. exsecta TSA DFPKDELR 16 KTRSVTCMNMEFIFAWRRVTLDLFASLHR 8 GPGINPEGPDWTRL 11 DFDVSNWDGHMP LneV-1 DFPKDELR 16 KTRSVTCMNMEFIFSWRRVTLDLFASLHR 8 GPGINPEGPDWTRL 11 DFDVSNWDGHMP MsaV-1 DFPKDELR 16 KTRSVTCMNMEFIFAWRRVTLDLFASLHR 8 GPGINPEGPDWTRL 11 DFDVSNWDGHMP Li. humile TSA DFPKDELR 16 KTRSVTCMNMMYVLLWRELTLDMWAAFHR 8 CPGINPEGPEWSSA 11 DFDVSNWDGYFP La. neglectus TSA DFPKDELR 14 KTRSVTCMNVFYILAWRQLTLDFWASMHR 8 CPGINPEGPDWNNL 11 DFDVSNWDGFLR LniV-1 DFPKDELR 14 KTRSVTCMNVFYILAWRQLTLDFWASMHR 8 CPGINPEGPDWNNL 11 DFDVSNWDGFLR Mo. pharaonis TSA2 DFPKDELR 14 KTRSVTCMNVYYILAWRRYTLKFWSAMHR 8 CPGINPEGPEWNNL 11 DFDVSNWDGFLF SINV-2 DFPKDELR 14 RTRSVTCMNVYYILAWRRYTMRFWSAMHR 8 GPGINPEGPEWSAL 11 DFDVSNWDGFLF Hubei-82 EFIKKELV 8 KTRTVGTGNIIHQIVYNKCNKALQLLVKN 10 AMGLDLE-IHGNQI 11 DFDVKAWEASIN Ch. riparius TSA EFTKKELV 8 KTRTVATGNMIHQIVYNKLFKHLYILFKN 10 ALGVDPV-RHWDQI 11 DFDVKAWEEKVN . *.** .**:: .: : . : .:. *:: :*:. :: .

V VI VII VIII Hubei-81C 46 GMKSGAAIVAEINTLVHWIVFLAMYY 21 TLFLYGDDMICSI 45 EYQTFLKR 15 IDKSVINDL Hubei-81I 46 GMKSGAAIVAEINTLVHWIVFLAMYY 21 TLFLYGDDMICSI 45 DYQTFLKR 15 IDKTVINDL SINV-4 50 GLISGFPGTAETNTLAHILLFYYFYL 21 HAIFYGDDVQASI 41 DSQ-FLKS 13 LDINVVYDM Shuangao-8 50 GLISGFPGTAEVNTLAHLILFYYYYL 21 HSIFYGDDVQASI 41 DSQ-FLKS 13 LDISVVYDM Mo. pharaonis TSA1 50 GLISGFPGTAETNTLAHLILFYYYYL 21 HSIFYGDDVQASI 41 ESQ-FLKS 13 LDISVIYDM Fo. exsecta TSA 50 GLVSGFPGTAEINTLVHLILMYYFYL 21 SPIFYGDDVLISI 41 DCQ-FLKS 13 MDVSVCYDL LneV-1 50 GLISGFPGTAEVNTLVHLILMYYFYL 21 SPVFYGDDVIMSV 41 DCQ-FLKS 13 MDISVVYDL MsaV-1 50 GLISGFPGTAEVNTLVHLILMYYFYL 21 SPMFYGDDVIISI 41 DCQ-FLKS 13 MDISVVYDL Li. humile TSA 50 GMISGFPGTAELNSMAHLILIIYIYL 19 RCLIYGDDIILSF 41 QCQ-FLKS 13 MELSVAYDL La. neglectus TSA 48 GLVSGFPGTAEINTLSHWLLILYIYL 18 SVAIYGDDIILTF 41 KCT-FLKS 13 MDEEVLNNL LniV-1 48 GLVSGFPGTAEINTLSHWLLILYIYL 18 SVAIYGDDIILTF 41 KCT-FLKS 13 MDEEVLNNL Mo. pharaonis TSA2 50 GLVSGFPGTAEVNTLVHWLLYICIYL 18 SVAIYGDDIIITF 41 DCK-FLKS 13 MDMDVAHDL SINV-2 50 GIISGFPGTAEVNTLAHILLIVCIYL 18 SAILYGDDILLTI 41 QCQ-FLKS 13 LDLEVAYDL Hubei-82 56 GLLSGHPGTFIENTDVHIMLVYLIAR 21 KVVVAADDIVMAI 39 EIQ-FLKH 13 PLTSIIYQL Ch. riparius TSA 56 GLLSGHPGTLMENSEIHTMIMYLICF 20 RFILAADDIVIAI 39 EIQ-FLKQ 12 PNWDIIYQL *: ** . . *: * :: . .**: :. . *** : ::

Figure S4. Conserved RdRp motifs in polycipivirus sequences. Polycipivirus RdRps have all eight picorna-like superfamily I RdRp sequence motifs (as per Koonin and Dolja, 1993). Unusually, two sequences have a G to A substitution at the normally highly conserved GDD sequence in motif VI. Virus name abbreviations are explained in the Figure S2 caption. Manuscript with Track Changes

Click here to access/download Additional Material for Reviewer MS_Polycipiviridae_IO-TrackChanges.pdf GenBank submitted sequences

Click here to access/download Additional Material for Reviewer Submitted_GeneBank_sequences.docx HHalign inputs and results

Click here to access/download Additional Material for Reviewer HHpred_Inputs_and_Results.pdf