bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Revisiting Phylogeny Using Whole

2 Genomic Data and a Hypothesis for their Geographic

3 Origin and Diversification

4

5 Anamarija Butković a, Rubén González a, Santiago F. Elenaa,b,*

6

7 aInstituto de Biología Integrativa de Sistemas (I2SysBio), Consejo Superior de

8 Investigaciones Científicas-Universitat de València, 46182 Paterna, Spain

9 bThe Santa Fe Institute, Santa Fe, NM 87501, USA

10

11 *Address correspondence to Santiago F. Elena, [email protected].

12

13

14

15 Running title: Orthotospovirus origin and phylogenomics

16

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

17 ABSTRACT

18 The family Tospoviridae, a member of the order, is constituted of tri-

19 segmented negative-sense single-stranded RNA that infect plants and are also able

20 of replicating in their insect vectors in a persistent manner. The family is composed of a

21 single genus, the Orthotospovirus, whose type species is spotted wilt

22 (TSWV). Previous studies assessing the phylogenetic relationships within this genus

23 were based upon partial genomic sequences, thus resulting in unresolved clades and a

24 poor assessment of the roles of recombination and genome shuffling during mixed

25 infections. Complete genomic data for most Orthotospovirus species are now available

26 at NCBI genome database. In this study we have used 62 complete genomes from 20

27 species. Our study confirms the existence of four phylogroups (A to D), grouped in two

28 major clades (A-B and C-D), within the genus. We have estimated the split between the

29 two major clades ~3,100 years ago shortly followed by the split between the A and B

30 phylogroups ~2,860 years ago. The split between the C and D phylogroups happened

31 more recently, ~1,465 years ago. Segment reassortment has been shown to be important

32 in the generation of novel viruses. Likewise, within-segment recombination events have

33 been involved in the origin of new viral species. Finally, phylogeographic analyses of

34 representative viruses suggests the Australasian ecozone as the possible origin of the

35 genus, followed by complex patterns of migration, with rapid global spread and numerous

36 reintroduction events.

37

38 IMPORTANCE

39 Members of the Orthotospovirus genus infect a large number of plant families, including

40 food crops and ornamentals, resulting in multimillionaire economical losses. Despite this

41 importance, phylogenetic relationships within the genus were established years ago based

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

42 in partial genomic sequences. A peculiarity of orthotospoviruses is their tri-segmented

43 negative sense genomes, which makes segment reassortment and within-segment

44 recombination, two forms of viral sex, potential evolutionary forces. Using full genomes

45 from all described orthotospovirus species, we revisited their phylogeny and confirmed

46 the existence of four major phylogroups with uneven geographic distribution. We have

47 also shown a pervasive role of sex in the origin of new viral species. Finally, using

48 Bayesian phylogeographic methods, we assessed the possible geographic origin and

49 historical dispersal of representative viruses from the different phylogroups.

50

51 KEYWORDS

52 Bunyavirales, phylogenomics, phylogeography, recombination, segmented viruses,

53 segments reassortment, selection, Tospoviridae, virus taxonomy

54

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

55 The order Bunyavirales is full of emerging pathogens that cause severe diseases in

56 humans (e.g., the Crimean-Congo and hantavirus hemorrhagic fevers, the La Crosse

57 encephalitis or the Rift Valley fever), farm animals (e.g., Nairobi sheep disease, ovine

58 and bovine Schmallemberg diseases, or Akabame ruminants’ disease) and agronomically

59 important crops (e.g., tomato spotted wilt, melon yellow spot or bud necrosis

60 diseases). Bunyavirales are all arboviruses, and are transmitted by hemato- or

61 phytophagous arthropods. After the last International Committee on Taxonomy of

62 Viruses (ICTV) reclassification (Adams et al., 2017), two families of plant-only infecting

63 bunyavirus have been designated: the Fimoviridae and the Tospoviridae. Fimoviridae

64 contain a single genus, the , and nine recognized species. Likewise, all

65 Tospoviridae species were classified within the Orthotospovirus genus.

66 Orthotospoviruses have spread around the world causing important loses in vegetable,

67 legume and ornamental crops (Gilbertson et al., 2015). Along with crops, they infect a

68 large variety of weed species; usually each virus species having a wide range of plant

69 hosts. As an example, only for Tomato spotted wilt virus (TSWV), the type member of

70 the genus, more than 900 species belonging to over 90 monocotyledonous and

71 dicotyledonous plants have been described as susceptible hosts (Pappu et al., 2009).

72 The genome structure of orthotospoviruses consists of three negative-sense single-

73 stranded RNAs (Baltimore’s class V; Fig. 1) that differ in size: segments L (large; ~8.9

74 kb), M (medium; ~4.8 kb) and S (small; ~2.9 kb). Only orthotospoviruses and

75 (a plant-infecting genus of the Phenuiviridae family within the

76 Bunyavirales) use the ambisense strategy to produce their mRNAs (Hull, 2014). Segment

77 S encodes for the nucleoprotein (N) in negative-sense and for the protein NSS in positive-

78 sense. Protein NSS is involved in the suppression of plant RNA silencing (Hedil and

79 Kormelink, 2016). The segment M encodes for a precursor polyprotein of the

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

80 glycoproteins GN and GC in negative-sense. In positive-sense it encodes for the protein

81 NSM, which is responsible for the modification of plasmodesmata that enable the

82 infectious nucleocapsid units to move throughout, allowing intercellular movement

83 (Storms et al., 1995). Singh and Savithri (2015) showed that NSM also remodels the

84 endoplasmic reticulum networks to form vesicles to facilitate viral replication and

85 movement. The segment L encodes for the RNA-dependent RNA polymerase (RdRp) in

86 negative-sense.

87 Orthotospovirus particles are spherical with a diameter of 80 - 110 nm. The particles

88 are made of a lipid envelope that is spiked by two glycoproteins, GN and GC (Fig. 1).

89 These glycoproteins form heterodimers that compose oligomeric structures, the exact

90 conformation they make on the outside of the envelope has not been determined

91 experimentally yet. The cytoplasmic tails of the glycoproteins interact with the

92 nucleoproteins that are inside the envelope (Ribeiro et al., 2009). The nucleoproteins

93 encapsidate the three viral RNA segments forming the ribonucleoprotein complexes.

94 These complexes are associated with the RdRp. The number of ribonucleoprotein

95 complexes packed into each virion remains unknown, but for being viable, virions must

96 at least include one copy of the three genomic segments.

97 Orthotospovirus particles are transmitted between susceptible plants through thrip

98 vectors (Rotenberg et al., 2015). Up to fifteen species of common have been

99 reported as facultative vectors (Rotenberg et al., 2015). All these vectors belong to the

100 Frankliniella and Thrips genera (Terebrantia suborder, Thysanoptera order, Thripinae

101 subfamily, family). The transmission of orthotospoviruses by thrips has

102 several unusual characteristics. Only the first and second larval instances can acquire the

103 virus and this competence decreases with age. Although thrips can retain lifelong

104 infectivity, their ability to transmit the virus can be erratic (Hull, 2009). The ability to

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

105 retain the infectivity is due to the mode of transmission which is propagative, meaning

106 that the virus is able to replicate in its vector (Rotenberg et al., 2015; Fletcher et al., 2016).

107 As the virus replicates in the thrips, with some of the thrips (e.g., Frankliniella

108 occidentalis) being extreme polyphagous, there are chances that mixed infections of

109 orthotospoviruses occur both within the insect and/or the plant. That would lead to

110 recombination, segment reassortment, and component capture events. The global spread

111 of thrips and orthotospoviruses could lead to global disease outbreaks (Gilbertson et al.,

112 2015). Recent studies have shown that plant hosts infected with orthotospoviruses can

113 affect the feeding behavior and even the survival of the vector (Sundaraj et al., 2014).

114 Specifically, Shalileh et al. (2016) showed how TSWV manipulated F. occidentallis via

115 the host plant nutrients to enhance virus transmission and spread.

116 In this work, we have reevaluated the phylogenetic status within the Orthotospovirus

117 genus based on available full genomes. Furthermore, to identify events of heterologous

118 genome reassortment during mixed infections as well within-segment recombination

119 events, phylogenies were done for each one of the three segments and for each one of the

120 five proteins and incongruences among trees were evaluated. All the phylogenies were

121 done using Bayesian methods. Along with the phylogenies there were other evolutionary

122 characteristics of the viruses that were analyzed: selection pressures, recombination

123 events and phylogeographic distribution and dynamics. Based on these results, we have

124 drawn a plausible scenario for the geographic origin and ulterior spread and

125 diversification of Orthotospovirus.

126

127 RESULTS AND DISCUSSION

128 Pervasive evidences of recombination and segment reassortment analyses. Prior

129 to any other study, we evaluated the role of recombination during the phylogenetic

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

130 diversification of the orthotospoviruses (File S1). The PHI test for the protein alignment

131 found evidences of recombination between the GNGC (P = 0.0053) and NSM (P = 0.0017)

132 genes in segment M; consistently giving also significant values for segment M as a whole

133 (P < 0.0001). One possible mechanism could be the physiological role of NSM,

134 responsible for viral movement and regulation of apoptosis. In different plant hosts there

135 are different proteins involved in binding to NSM (Knipe and Howley, 2013). Differences

136 in NSM structure could have a major impact in cell-to-cell transportation. It has also been

137 suggested that the NSM protein is an evolutionary adaptation of an insect-only infecting

138 bunyavirus to a plant host, where it determines symptomatology and host range

139 (Tsompana et al., 2004). Therefore, when new viruses spill into new hosts, often in mixed

140 infections with other well-adapted viruses, within segments recombination or segments

141 reassortment events may give rise to new viruses with enhanced movement in the novel

142 host.

143 Furthermore, GNGC encodes for the precursor of the two glycoproteins GN and GC,

144 which is consistent with the GARD algorithm detecting a recombination breakpoint

145 between them. This suggest independent evolutionary histories for the two glycoproteins.

146 It is known that GN is responsible for attachment to epithelial cells in the midgut of thrips

147 as a mode of more effective transmission (Knipe and Howley, 2013). Recombination

148 between these two proteins can readily generate new viruses that would explore their

149 potential of transmission by novel thrips species. The exact breakpoint between GNGC

150 was estimated using GARD to be between amino acid positions 768 and 769. This

151 partition will be later used in all the BEAST analyses described in the following sections.

152 Further supporting the recombinogenic nature of orthotospoviruses, the pairwise

153 comparison of the MCC trees obtained for the three segments (Fig. S1) using tanglegram

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

154 in Dendroscope was incongruent between all segments. Mainly, the incongruence in tree

155 topologies was entirely driven by TSWV.

156 In conclusion, recombination within or reassortment of M segment during mixed

157 infections may create new viruses with improved movement in more hosts or

158 transmission ability using new vectors.

159 Full genome phylogenetic reconstructions confirm the existence of four

160 phylogroups. Maximum clade credibility (MCC) Bayesian trees were generated from

161 62 full genomic sequences belonging to 20 different orthotospovirus species (Fig. 2). The

162 trees were built with the heterogeneous model configuration described in Material and

163 Methods section that introduces a partition between segments and, in the case of segment

164 M, it accounts for the existence of a recombination breaking point within the glycoprotein

165 gene. The MCC obtained from the full-genome shows a significant clustering into four

166 different phylogroups A, B, C, and D. Pappu et al. (2009) defined phylogroups in terms

167 of geographical origin of the viruses. However, our full-genome analysis of a larger and

168 updated dataset does not support the notion of independent geographic origins for each

169 phylogroup. Instead, our phylogroups contain viruses from diverse origins. For instance,

170 Pappu et al. (2009) so-called American phylogroup is part of what we have defined here

171 as phylogroup A that contains a mixture of viruses isolated from Asia, Europe, Australia,

172 and America. Likewise, Pappu et al. (2009) Eurasian phylogroup, contains viruses from

173 what we have defined here as phylogroups C and D. Phylogroup D includes an isolate of

174 chlorosis virus (CaCV) from Australia. We only incorporated a full-genome

175 sequence from Africa since other sequences from this region failed to meet the criteria

176 mentioned in the Materials and Methods section. It is worth mentioning that most of

177 orthotospoviruses have a broad worldwide distribution, though many genomes have not

178 been fully sequenced and assembled after identification, thus limiting the number of full-

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

179 genomes available for our study. We expect that future accumulation of additional full-

180 genome sequences for some species would modify the number of clades in the phylogeny

181 shown in Fig. 2. In fact, it has been suggested using partial sequences that Groundnut

182 chlorotic fan-spot virus (GCFSV) and Groundnut yellow spot virus (GYSV) would likely

183 establish a different clade (Oliver and Whitfield, 2016).

184 In general, most nodes in the MCC tree had a high statistical support (0.95 < BPP £

185 1). However, a few nodes had lower support. The node with the lowest support (BPP =

186 0.52) corresponded to the divergence of TSWV strains 22 and 23; while other nodes

187 within the TSWV clade also had BPP < 0.70 support. Two other clades had intermediate

188 support values (BPP » 0.70): the basal node in which phylogroups A and B split apart,

189 and the node leading to the speciation events between Groundnut ringspot virus (GRSV),

190 Tomato chlorotic spot (TCSV), TSWV, and Zucchini lethal chlorosis virus (ZLCV).

191 Rates of molecular evolution are highly heterogeneous among genes and

192 lineages. Estimated rates of molecular evolution ranged between 10-4 - 2´10-3

193 substitutions per site and per year for segment L, between 6´10-4 - 10-3 for segment M

194 and between 9´10-4 - 1.7´10-3 for segment S (color legends in Fig. S2).

195 Heterogeneity in rates of molecular evolution among viruses (and strains of the same

196 virus) have been observed for the different genes (Fig. 3). For the RdRp gene, the slowest

197 estimated rates of evolution corresponded to all the nodes, while the fastest corresponded

198 to branches separating the phylogroups A-B and C-D. It showed the slowest rate of

199 evolution amongst the five genes.

200 For gene NSM the fastest rates of molecular evolution corresponded to the basal

201 branches leading to phylogroups A-B and C-D, with branches leading to phylogroups A

202 and B. By contrast, the slowest evolutionary rates for NSM were estimated for the basal

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

203 branch leading to the phylogroup C, and to terminal branches leading to CaCV and

204 Mulberry vein banding virus (MVBaV).

205 Regarding the glycoprotein genes, for gene GN, we observed fast rates of molecular

206 evolution in the basal branches leading to clades A-B, C-D and phylogroups A, B, C and

207 D. The fastest evolutionary rates were for ZLCV, Melon severe mosaic virus (MSMV)

208 and Melon yellow spot virus (MYSV). The slowest evolving branches were for TSWV

209 and MVBaV. For gene GC, the fastest rate of molecular evolution corresponds the branch

210 leading to the phylogroups C-D. By contrast, the slowest rates of molecular evolution

211 correspond to the basal branch leading to the yet to be named tospovirus isolated from

212 kiwi plants (hereafter referred as Tospokiwi) and TSWV.

213 Overall, the NSS gene showed slow rates of evolution in most basal branches, except

214 those leading to the phylogroup A and B and also phylogroups C-D. MYSV showed

215 faster evolutionary rate compared to the other viruses.

216 High rates of molecular evolution for the N gene were the basal branches leading to

217 phylogroups A and D. The slowest rates observed for the N gene were in the majority of

218 basal branches except the tip branches leading to Bean necrotic mosaic virus (BeNMV)

219 and necrotic spot virus (INSV).

220 These results suggest a pattern of variable rates of evolution along the diversification

221 of orthotospoviruses. Overall, accelerated rates of evolution have been observed in all

222 genes in the origin of phylogroups A-B and C-D, in some internal branches within

223 phylogroups leading to speciation events and in some terminal branches giving rise to

224 novel strains of a particular virus. Even for fast evolving lineages, differences among

225 genes exist. For example, MYSV showed very fast rates of molecular evolution in the

226 GN and NSS genes but a rather slow one in the other genes.

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

227 Estimation of divergence times. As it can be shown in Fig. 2 for the full genome,

228 the time of the diversification between clades A-B and C-D was estimated to be 3099

229 years ago (ya) (95% HPD confidence interval: 5192 - 1758 ya) (BPP = 1). Soon after,

230 ~2860 ya (95% HPD confidence interval: 4772 - 1605 ya), the A-B clade diverged into

231 phylogroups A and B (BPP = 1). The C-D clade diverged into phylogroups C and D

232 (BPP = 1) around 1465 ya (95% HPD: 2421 - 854 ya).

233 In addition, MCC trees were also generated for the individual segments (Fig. S2) and

234 individual genes (Fig. 3), with estimates of divergence times generated from these partial

235 trees being consistent with those obtained for the full genomes. For example, in the case

236 of the three segments (Fig. S2), the estimated divergence of phylogroups A-B and C-D

237 from each other was in the range of 2786 - 875 ya, whereas the divergence between clades

238 A-B and C-D took place between 2621 - 616 ya. Genes N and NSM have more recent

239 divergence times, estimated about 121 (95% HPD: 548 - 36 ya) and 141 (95% HPD: 366

240 - 45 ya) between phylogroups A-B and C-D, respectively. Divergence times between the

241 four mayor phylogroups estimated for gene NSS is 538 ya (95% HPD: 2519 - 81 ya).

242 However, estimates obtained using the G gene place the divergence between A-B and C-

243 D clades much earlier on time, 2911 ya (95% HPD: 7010 - 1122 ya).

244 A phylogeographic analysis of the origin and dispersal of Orthotospovirus. For

245 these analyses, instead of using countries, or even continents, as geographic units, we

246 opted to use more biologically relevant units: the eight biogeographic realms, or

247 ecozones, into which World Wide Fund for Nature (WWF) divides the world (Schulz,

248 2005). These ecozones consider the geological barriers that kept plants and animals

249 separated from other areas, having a shared evolutionary history. Seventy-three

250 sequences from a reduced dataset of all N proteins available have been used for these

251 analyses. The results are shown in Fig. 4. Australasia (Australia, New Guinea and

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

252 neighboring islands) has been identified as the most likely origin of orthotospoviruses

253 (RSPP = 0.4961), with the second highest supported origin being the Western Palearctic

254 (Western Europe and North Africa) (RSPP = 0.3929). However, this conclusion has to

255 be taken with some caution, since Pappu et al. (2009) and Oliver and Whitfield (2016)

256 had suggested that TSWV isolates were actually imported into Australia from Europe ca.

257 1915. The second highest support was for Western Palearctic. Eurasia, as part of the

258 Palearctic ecozone, contains the largest number of different Orthotospovirus species,

259 giving more plausibility to the hypothesis of the Palearctic ecozone as the possible origin

260 of the genus.

261 From Australasia, viruses were introduced into the Western Palearctic and from there

262 into Indomalaya (South Asian Subcontinent and Southeast Asia). Afterwards these three

263 ecozones have been stablished as dispersion centers to colonize the rest of the ecozones;

264 Neotropic (South America and the Caribbean), Nearctic (North America), Eastern

265 Palearctic (Eurasia) and Afrotopic (Sub Saharan Africa) in that order, with numerous

266 reintroductions. It is interesting that the sequences that were introduced numerous times

267 in Eastern Palearctic and Afrotropic became later isolated. The BSSVS analysis shows

268 the strongest support for the dispersal route from the Indomalaya ecozone into the Eastern

269 Palearctic one (BF = 236.2).

270 We also tested the null hypothesis of a random distribution of geographic origins on

271 the MCC tree generated with the full genome (Fig. S3) using Treebreaker (Ansari and

272 Didelot, 2016). The original definition of phylogroups was based on the geographic

273 origin. Therefore, the alternative hypothesis would be that geographic origins must be

274 differentially distributed among the four phylogroups. The branches on the MCC tree

275 that successfully reject the hypothesis of random distribution are for phylogroups B, C,

276 D, and for viruses TSWV and INSV. Phylogroup B has sequences from America,

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

277 phylogroup C has sequences from Asia and Europe while the phylogroup D has sequences

278 mostly from Asia. TSWV is found all around the globe but the sequences we used in this

279 study predominantly came from Asia. In the case of INSV we have sequences from

280 Europe. We can see that all the sequences coming from Asia and Europe show a

281 nonrandom distribution of geographic origin in the MCC tree. Only two viruses coming

282 from America, BeNMV and Soybean vein necrosis virus (SVNV) show significant

283 support for a geographic correlation. Interestingly, this association must be entirely due

284 to the limited geographic distribution of the vector species (Rotenberg et al., 2015).

285 BeNMV and SVNV are transmitted by the aphid Neohydatothrips variabilis which is

286 restricted to Central and North America. INSV came from Europe and is transmitted by

287 Frankinella spp. which is found all around the world. Phylogroup C shows strong

288 correlation for viruses Polygonum ringspot virus (PolRSV) and Hippeastrum chlorotic

289 ringspot virus (HCRV) which come from the Paleartic ecozone and are transmitted by

290 Thryps spp. and Dictyothrips spp., mostly distributed across Eurasia. Phylogroup D,

291 mostly contains viruses isolated from Asia and is transmitted by Thrips spp. which is

292 found everywhere.

293 TSWV geographic origin and pandemic dispersal. TSWV is the most represented

294 virus in our dataset and hence provides the richest information to perform

295 phylogeographic analyses for an individual virus. As in the previous section, we have

296 used the WWF ecozones. We have performed two independent phylogeographic analysis

297 of the N protein sequences from TSWV, one based in the entire 229 isolates dataset and

298 a second one based on a reduced dataset of 87 isolates that minimizes overrepresentation

299 of sequences from certain countries (see Materials and Methods). For both datasets, the

300 most likely origin of TSWV was within the Australasia ecozone (RSPP = 0.9426 for the

301 reduced dataset and RSPP = 0.9016 for the full dataset). However, this conclusion has to

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

302 be taken with some caution for the reasons discussed in the previous section. In any case,

303 the two data sets differ in the spread of isolates out of Australasia. The results for the

304 reduced dataset are shown in Fig. 5a. In the case of the reduced dataset TSWV isolates

305 emigrated from Australasia first into the Neotropic and then into the Nearctic and the

306 Western and Eastern Palearctic ecozones as recently as 1994 - 1999. Even more recently,

307 around 2007, a second introduction from the Nearctic into the Eastern Palearctic took

308 place. In 2015 several possible parallel introductions into the Afrotropic ecozones from

309 the Neotropic and Easten Palearctic ones are inferred from the dataset, though the

310 strongest supported route was from the Palearctic (BF = 215.4). The sequences from the

311 Western Palearctic and Afrotropic after the introductions became isolated.

312 When looking into the phylogeographic diffusion patterns inferred from the full

313 dataset (Fig. 5b), Australasia still appears as the origin of TSWV pandemic as previously

314 mentioned. From there, the virus spread out into the Nearctic, Western Palearctic,

315 Neotropic, and Eastern Palearctic ecozones within less than ten years (between 1988 and

316 1996). From the Eastern Palearctic, Western Palearctic and Neotropic, isolates were

317 introduced back into the Australasia, Indomalaya (South Asian Subcontinent and

318 Southeast Asia) and Afrotropic ecozones in early 2000s (Fig. 5b). Subsequent

319 reintroductions into Afrotropic were originated from the Eastern Palearctic and Neotropic

320 ecozones (Fig. 5b). In the last decade, isolates from the Afrotropic have diffused into the

321 Neotropic and Indomalaya ecozones (Fig. 5b).

322 In both sequence datasets the Nearctic, Western Palearctic and Eastern Palearctic

323 ecozones show the largest number of viral lineages. The strongest supported routes in

324 this case were between the Neotropic and Australasia and the Eastern Palearctic and

325 Australasia ecozones (in both cases, BF = 226.4). Sequences introduced into the

326 Indomalaya ecozone became isolated after the introduction.

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

327 Variations of population sizes over time. Population sizes over time plots are a

328 useful tool to illustrate overall patterns of genetic diversification throughout time

329 (Rambaut et al., 2018), i.e. the number of lineages observed on a tree with respect to time.

330 If diversification happens in a constant steady-state manner through time, a straight line

331 is expected when plotting the number of lineages in a logarithmic scale as a function of

332 time. In one hand, if the observed line lays above the null hypothesis straight line, the

333 diversification rate increases with time. In the other hand, if the observed line goes below

334 the straight line, it can be concluded that diversification rate is decreasing with time. In

335 our case, lineages will be considered as different virus species.

336 In the Skyride plot from the TSWV analysis of the reduced (87 sequences) dataset

337 (Fig. 6a) we can see a rise from 1995 until 2002, where we see a slight decline from 2003

338 until 2005. From 2005 until 2012.5 we have a slight rise after which we have a steady

339 population size until 2018 (Fig. 6a). While the Skyride plot of the full TSWV sequence

340 set (229 sequences) shows a similar pattern with the exception of a decline in population

341 size between the years 2005 – 2012.5 and a decline after that until 2018 (Fig. 6b). For

342 both datasets, the estimated median effective population size was 600 individuals.

343 Positive selection analysis. Selection analysis have been performed with the MEME

344 and aBSREL algorithms. Codons under positive selection have been found in all the

345 protein coding sequences except N (File S2): position 26 (P = 0.05) of NSM, positions

346 468 (P = 0.01), 476 (P = 0.05), 479 (P = 0.05), and 490 (P = 0.01) of NSS, positions 4 (P

347 < 0.01) and 813 (P < 0.01) of GNGC, and positions 1066 (P = 0.03) and 1728 (P = 0.04)

348 of L (File S2 and Fig. S4) were all under significant positive selection. The algorithm

349 aBSREL indicated nodes under positive selection for all genes except N (File S2). These

350 findings point to the fact that beneficial mutations arose in the four proteins and drifted

351 to fixation in the viral population, resulting in adaptation to new hosts and/or new vectors.

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

352 This change in the nucleic acid content and the worldwide spread of the thrips vectors,

353 especially F. occidentalis (Knipe and Howley, 2013), could lead to the spread of some

354 lineages that had beneficial mutations for adaptation to new hosts-vectors-environment.

355 This adaptive evolution could explain the fast and worldwide spread of orthotospoviruses

356 during the 1960s and 1970s by the vector F. occidentalis (Knipe and Howley, 2013).

357 Reflecting adaptation of the orthotospoviruses and ulterior diversification into new plant

358 hosts used by this very polyphagous vector.

359 The N protein does not show evidences of positive selection; instead, it shows many

360 instances of significant negative selection (File S2). This could be due to very stringent

361 structural constraints given its multifunctional nature. N is involved in transcription and

362 translation of viral RNA. It has been shown that a single point mutant of Bunyamwera

363 virus (BUNV) was either defective for transcription but not translation and vice versa

364 (Eifan and Elliot, 2009; Walter et al., 2011). This observation suggests that N may have

365 two functional domains; one involved in transcription and other in translation, each

366 interacting with different cellular cofactors. Since the small N protein forms the

367 ribonucleocapsid complex with viral RNAs and tightly interacts with the RdRp to initiate

368 RNA synthesis, protecting the nascent RNAs from degradation (Knipe and Howley,

369 2013), even a small structural change might disrupt its function, jeopardizing completion

370 of the viral replication cycle. But considering essential roles of other proteins that show

371 signatures of positive selection this situation in N could also be due to the lack of

372 statistical power of the methods used.

373

374 CONCLUSIONS

375 The taxonomic status of the plant-infecting bunyaviruses was recently revisited by

376 ICTV. They changed the status of tospoviruses from the genus level to the family level,

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

377 Tospoviridae. The new family only contains, so far, a genus, the Orthotospovirus.

378 However, Oliver and Whitfield (2016) have suggested that the family could be classified

379 into five distinct phylogenetic clades whose type members would be GYSV, IYSV,

380 SVNV, TSWV, and WSMoV. This classification is fully consistent with our finding of

381 four phylogroups; TSWV clade would be the representative species of phylogroup A,

382 SVNV of phylogroup B, IYSV would represent phylogroup C, and WSMoV phylogroup

383 D. No full genome sequences for the two viruses from the GYSV clade were available

384 at NCBI, and hence are missing in our analyses. Additional full genome data for the two

385 GYSV clade members would be necessary to confirm if they form an additional E

386 phylogroup as well as to explore their phylogeographic history in relation with the other

387 phylogroups.

388 Based in partial genome sequences and a smaller number of species, Pappu et al.

389 (2009) classified orthotospoviruses according to their geographic origin in two

390 geogroups: Asian and American. Our analyses suggest a more complex picture and that

391 classification of orthotospovirus species cannot be done based on geographic criteria.

392 Comparing this geographic classification with our genome-based phylogroups, we

393 observe that phylogroups A and B would be mainly of Asian origin while phylogroups C

394 and D would be mainly of American origin, though the geographic distinction is diffuse.

395 Genome reassortment, recombination and selection all played a role in the origin and

396 diversification of orthotospoviruses.

397 Looking at the results of the phylogeographic distribution of TSWV nucleocapsid

398 sequences and 73 orthotospovirus nucleocapsid sequences we suggest that the origin of

399 orthotospoviruses is the Australasia ecozone, afterwards radiating into other ecozones

400 around the globe. The geographic distribution of vectors correlates well with the

401 geographic distribution of different Orthotospovirus species. Phylogroups B, C, D and

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

402 TSWV and INSV are transmitted by aphids that are specific for the Neartic/Neotropic

403 and Paleartic regions and are the predominant vectors for each group.

404 This work is the first phylogenetic analysis done on the family Tospoviridae that used

405 well annotated full genome sequences. In line with the results from other studies, we

406 propose a hypothesis that the genus Orthotospovirus could reasonably be split into at least

407 four new genera.

408

409 MATERIAL AND METHODS

410 Viral species, sequence extraction and database curation. All available full

411 Orthotospovirus genomes were obtained from the NCBI Genome Database

412 (www.ncbi.nlm.nih.gov/genome; Brister et al., 2015) in May 10th, 2018. The genome

413 collection was further expanded for all orthotospoviruses that were defined as a species

414 with the same criterion used to search the NCBI Genome Database: the complete

415 genomes had to come from the same research paper and had to be sequenced from the

416 same extraction sample. Our working dataset is thus composed of the amino acid

417 sequences derived from 62 full viral genomes (the entire proteome) with well-defined

418 segments and genes belonging to 20 different Orthotospovirus species (File S1). All

419 nucleotide sequences were extracted by their accession number with the help of E-utilities

420 from Entrez Direct. It was confirmed by further inspection with Geneious Prime version

421 2019.0.3 (www.geneious.com) that each of the 62 genomes corresponded, indeed, to a

422 different Orthotospovirus.

423 To examine the phylogeographic origins of TSWV and all orthotospoviruses we

424 collected 229 sequences for TSWV (including partial sequences) and 73 for all

425 orthotospoviruses after filtering for duplicates. In the case of all orthotospovirus N

426 proteins we did not use the expanded set because it did not converge which could be

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

427 caused by problems with the priors or a low temporal signal (R2 = 0.13). We also created

428 a subset of TSWV sequences where we tried to minimize the bias of overrepresentation

429 from certain countries and from specific dates, we also removed partial sequences. The

430 subset was guided by CD-HIT (Fu et al., 2012) where we clustered the sequences coming

431 from the same ecozone with sequence identity cut-off of 95%, taking into consideration

432 only the ecozones that had the largest number of sequences. After removing the bias and

433 partial sequences in TSWV set, 87 sequences were retained.

434 Alignment and assembly of segments and full genomes. All alignments used in

435 this study were generated using MAFFT version 7 (Kazutaka and Standley, 2013) and

436 visually inspected. We generated the following alignments. Firstly, the protein

437 sequences of the five viral genes (GNGC, N, NSM, NSS, and RdRp; Fig. 1) were aligned

438 individually. Secondly, to build the alignment of segments, we respected the order of

439 genes in NCBI records and combined NSM and GNGC genes into segment M and NSS and

440 N genes into segment S (Fig. 1). Thirdly, we generated the concatenated full viral

441 genomes by combining the three segments in the arbitrary L-M-S order. Segment and

442 full genome joining were performed with FaBox alignment joiner available online

443 (Villesen, 2007). The resulting globally aligned proteome was 5,401 amino acids long,

444 had 966 identical sites and a pairwise identity of 62.3%.

445 Tests of phylogenetic congruence and detection of segment reassortment and

446 recombination events. To seek for heterologous segregation of segments in the origin

447 of novel orthotospoviruses, segment trees were compared to one another (M-L, M-S and

448 S-L) using the tanglegram function in Dendroscope version 3 (Huson and Scornavacca,

449 2012). Recombination analysis within genes and segments was performed using the PHI

450 test implemented in SplitsTree4 (Huson and Bryant, 2006) for recombination on the

451 protein and codon sequences encoded by each gene and concatenated segment. The

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

452 genetic algorithm for recombination detection GARD (Kosakovsky Pond et al., 2006), as

453 implemented in the www.datamonkey.org server (Weaver et al., 2018) was also used to

454 detect recombination breakpoints in the codon aligned nucleotide sequences of the five

455 Orthotospovirus genes.

456 Selection of the best models of amino acid substitutions. ProtTest 3 (Darriba et

457 al., 2011) was used to assign the best fitting models of protein evolution for the

458 individually aligned protein sequences based upon the Bayesian Information Criterion

459 (BIC) minimal score. For the aligned TSWV N sequences used for the phylogeographic

460 analyses, we used the second best-scoring model, FLU+G, since the best-scoring one,

461 HIVb (Korber, 2000), was not implemented in BEAST version 1.10.4 (Suchard et al.,

462 2018).

463 Phylogenetic and phylogeographic trees reconstruction. The maximum

464 likelihood trees of the protein alignments for genes, segments and full genome sequences

465 were generated with IQ-Tree (Nguyen et al., 2013). In the maximum likelihood tree

466 reconstruction, we used the model assigned with ProtTest 3 and 1000 bootstrap replicates.

467 These trees were later used as input for the program TempEst (Rambaut et al., 2016)

468 where we examined the temporal signal and clock-likeness of our data before proceeding

469 to the Bayesian analysis. All the trees for genes, segments, full genome and the

470 phylogeographic analysis had a positive correlation coefficient and significant temporal

471 signal. Values for R2 ranged between 0.031 - 0.36 which corresponded again to full

2 472 genome and gene NSS. The R values for the two datasets of TSWV and 73

473 orthotospovirus N genes were in the range 0.094 - 0.22. BEAST version 1.10.4 (Suchard

474 et al., 2018) on CIPRES Science Gateway (www.phylo.org; Miller et al., 2010) and its

475 associated program BEAUti were used for Bayesian analyses. All the genes, segments

476 and full genome sequences were analyzed under the best-fitting substitution model with

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

477 site heterogeneity modeled with a Gamma distribution (with four categories) and a

478 fraction of invariable sites, if needed. We used the uncorrelated relaxed clock where each

479 branch in the tree has its own evolutionary rate. Also, we used time-aware GMRF

480 Skyride as a tree prior because we did not have any data on the expansion or population

481 history and it was highly unlikely that constant population size was maintained, chain

9 482 length of 10 , and dates of isolation assigned to each sequence. In the case of the GNGC

483 gene, we introduced a partition in the alignment, the first half containing the N-terminal

484 768 amino acids and the second consisting of the C-terminal 821 amino acids. This was

485 done because we found a breakpoint at this position using the program GARD. Each

486 partition had a heterogeneous amino acid substitution model assigned by ProtTest 3.

487 Segments M, S and full genome sequences were also partitioned following the end and

488 the beginning of each respective gene inside the segment. So, segment M was split into

489 three fragments corresponding to genes NSM, GN and GC, segment S was split into genes

490 NSS and N, while the full genomes were split into genes N, NSS, NSM, GN, GC, and RdRp.

491 The respective models selected by ProtTest 3 for each gene were assigned to each

492 partition since the substitution models were unlinked. We used the same parameters as

493 stated above but with unlinked clock models for each partition to account for different

494 rates among branches and with linked trees to obtain a single tree estimated for all the

495 data.

496 In the phylogeographic analysis of TSWV and 73 orthotospoviruses we focused only

497 on the N gene sequences because they greatly outnumbered the rest of the gene sequences

498 available in NCBI. Larger number of sequences gave us higher resolution and more

499 information from different biogeographic realms or ecozones. Assigning ecozones

500 (Scuhultz, 2005) instead of countries to sequences proved to be a better choice since some

501 countries were highly under represented. Furthermore, countries define political entities

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

502 while ecozones (Fig. 4 and Fig. 5) delimit areas were plants and animals where relatively

503 isolated for long periods of time and therefore share a common evolutionary history. We

504 modeled the sequences with phylogeographic diffusion in discrete space.

505 All the outputs were analyzed by Tracer 1.7.1 (Rambaut et al., 2018) where

506 convergence and adequate effective sample size (ESS) were estimated. To generate the

507 maximum clade credibility (MCC) tree and remove the 10% burn in we used

508 TreeAnnotator 1.10.4. Bayesian posterior probabilities (BPP) were used to assess the

509 significance of each node in the MCC tree. FigTree 1.4.3

510 (tree.bio.ed.ac.uk/software/figtree) was used to visualize all trees and make them

511 publication ready. SpreaD3 was used to prepare the spatiotemporal history and visualize

512 it in a map (Bielejec et al., 2016).

513 Assessing the distribution of geographic origins along the MCC phylogenetic

514 tree. Phylogeny-trait association was performed with a software package TreeBreaker

515 (Ansari and Didelot, 2016). The traits assigned were the countries of origin from File S1

516 for complete genome MCC tree. The analysis was performed on a set of 90,001,000 trees

517 generated with BEAST after removing the 10% burn in.

518 dN/dS-based selection analyses. For the analyses of selective constraints operating

519 on the different genes, nucleotide sequences were aligned with MUSCLE (Edgar, 2004)

520 as implemented in MEGA7 (Kumar et al., 2016) using the codon alignment option that

521 preserves open reading frames. Two different methods were used to analyze each codon

522 alignment and inferring differences in selective pressures. The first one was the mixed

523 effects model of evolution (MEME) which uses mixed-effects maximum likelihood

524 approach to test for individual sites that have been subject to episodic positive or

525 diversifying selection (Murrell et al., 2012). The second one was the adaptive branch-

526 site random effects likelihood (aBSREL) model that uses the branch-site models which

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

527 test for positive selection on a selected proportion of branches (Smith et al., 2015). In

528 both cases the P = 0.05 threshold was set for each analysis. These selection analyses were

529 conducted in the www.datamonkey.org server (Weaver et al., 2018).

530

531 SUPPLEMENTARY MATERIAL

532 Supplementary File S1 (Excel). Orthotospovirus genomic sequence and phylogeographic

533 data.

534 Supplementary File S2 (Excel). Results of per-site and branch-site selection analyses.

535 Supplementary Fig. S1 (PDF). Dendrograms of the genomic segments generated with

536 Dendroscope.

537 Supplementary Fig. S2 (PDF). MCC trees obtained for each genomic segment.

538 Supplementary Fig. S3 (PDF). Results from the analyses performed with TreeBreaker.

539 Supplementary Fig. S4 (PDF). Graphical representation of per-site selection analyses.

540 On the abscissa is the nucleotide site and on ordinates the is the likelihood ratio test (LRT)

541 statistic for episodic diversification. Significant sites are those with an LRT value ³ 4.

542

543 AUTHOR CONTRIBUTIONS

544 AB and RG collected the data, performed all the analyses and drafted a manuscript.

545 SFE conceived the study, supervised its execution and wrote the manuscript.

546

547 ACKNOWLEDGEMENTS

548 This work was supported by grant Spain Agencia Estatal de Investigación – FEDER

549 grant BFU2015-65037-P to SFE. AB and RG were supported by Generalitat Valenciana

550 grant GRISOLIAP/2018/005 and by Spain Agencia Estatal de Investigación – FEDER

551 contract BES-2016-077078, respectively.

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

552

553 REFERENCES

554 Adams, M. J., Lefkowitz, E. J., King, A. M. Q., Harrach, B. Harrison, R. L., Knowles, N.

555 J., Kropinski, A. M., Krupovic, M., Kuhn, J. H., Mushegian, A. R., Nibert, M.,

556 Sabanadzovic, S., Sanfaçon, H., Siddell, S. G., Simmonds, P., Varsani, A., Zerbini,

557 F. M., Gorbalenya, A. E., Davison, A. J., 2017. Changes to taxonomy and the

558 International Code of and Nomenclature ratified by the

559 International Committee on Taxonomy of Viruses (2017). Arch. Virol. 162, 2505-

560 2538.

561 Ansari, A.M., Didelot, X., 2016. Bayesian inference of the evolution of a phenotype

562 distribution on a phylogenetic tree. Genetics 204, 89-98.

563 Bielejec, F., Baele, G., Vrancken, B., Suchard, M. A., Rambaut, A., Lemey, P., 2016.

564 SpreaD3: interactive visualisation of spatiotemporal history and trait evolutionary

565 processes. Mol. Biol. Evol. 33, 2167-2169.

566 Brister, J.R., Ako-Adjei, D., Bao, Y., Blinkova, O., 2015. NCBI viral genomes resource.

567 Nucl. Acids Res. 43, D571-D577.

568 Darriba, D., Taboada, G.L., Doallo, R., Posada, D., 2011. ProtTest 3: fast selection of

569 best-fit models of protein evolution. Bioinformatics 27, 1164-1165.

570 Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy and high

571 throughput. Nucleic Acids Res. 32, 1792-97.

572 Eifan, S., Elliot R.M., 2009. Mutational analysis of the Bunyamwera orthobunyavirus

573 nucleocapsid protein gene. J. Virol. 83, 11307-11317.

574 Fletcher, S. J., Shrestha, A., Peters, J. R., Carroll, B. J., Srinivasan, R., Pappu, H. R.,

575 Mitter, N., 2016. The Tomato spotted wilt virus genome is processed differentially in

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

576 its plant host Arachis hypogaea and its thrips vector Frankliniella fusca. Front. Plant

577 Sci. 7, 1349.

578 Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W., 2012. CD-HIT: accelerated for clustering the

579 next-generation sequencing data. Bioinformatics 28, 3150-3152.

580 Gilbertson, R.L., Batuman, O., Webster, C.G., Adkins, S., 2015. Role of the insect

581 supervectors Bemisia tabaci and Frankliniella occidentalis in the emergence and

582 global spread of plant viruses. Annu. Rev. Virol. 2, 67-93.

583 Hedil, M., Kormelink, R., 2016. Viral RNA silencing suppression: the enigma of

584 Bunyavirus NSS proteins. Viruses 8, 208.

585 Hull, R., 2009. Comparative Plant Virology, second edition, Elsevier Academic Press,

586 San Diego CA, USA.

587 Huson, D.H., Bryant, D., 2006. Application of phylogenetic networks in evolutionary

588 studies. Mol. Biol. Evol. 23, 254-267.

589 Huson, D.H., Scornavacca, C., 2012. Dendroscope 3: An interactive tool for rooted

590 phylogenetic trees and networks. Syst. Biol. 61, 1061-1067.

591 Kazutaka, K., Standley, D.M., 2013. MAFFT multiple sequence alignment software

592 version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772-780.

593 Knipe, D.M., Howley, P.M., 2013. Fields Virology, sixth edition. Wolters

594 Kluwer/Lippincott Williams & Wilkins Health, Philadelphia PA, USA

595 Korber, B., 2000. HIV signature and sequence variation analysis, in: Rodrigo, G.A.,

596 Learn, G.H. (Eds.), Computational Analysis of HIV Molecular Sequences. Kluwer

597 Academic Publishers, Dordrecht, Netherlands, pp 55-72.

598 Kosakovsky Pond, S.L., Posada, D., Gravenor, M.B., Woelk, C.H., Frost, S.D., 2006

599 Automated phylogenetic detection of recombination using a genetic algorithm. Mol.

600 Biol. Evol. 23, 1891-1901.

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

601 Kumar S., Stecher G., Tamura K., 2016. MEGA7: molecular evolutionary genetics

602 analysis version 7.0 for bigger datasets. Mol. Biol. Evol. 33, 1870-1874.

603 Miller, M.A., Pfeiffer, W., Schwartz, T., 2010. Creating the CIPRES science gateway for

604 inference of large phylogenetic trees. Proceedings of the Gateway Computing

605 Environments Workshop (GCE), IEEE, New Orleans LA, USA, pp. 1-8.

606 Murrell, B., Wertheim, J.O., Moola, S., Weighill, T., Scheffler, K., Kosakovsky Pond,

607 S.L., 2012. Detecting individual sites subject to episodic diversifying selection. PLoS

608 Genet. 8, e1002764.

609 Nguyen, L.-T., Schmidt, H.A., von Haeseler, A., Minh, B.Q., 2015. IQ-TREE: A fast and

610 effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol.

611 Biol. Evol. 32, 268-274.

612 Oliver, J. E., Whitfield, A. E., 2016. The genus Tospovirus: emerging bunyaviruses that

613 threaten food security. Annu. Rev. Virol. 3, 101-124.

614 Pappu, H.R., Jones, R.A., Jain, R.K., 2009. Global status of tospovirus epidemics in

615 diverse cropping systems: successes achieved and challenges ahead. Virus Res. 141,

616 219-236.

617 Rambaut, A., Lam, T.T., Max Carvalho, L., Pybus, O.G., 2016. Exploring the temporal

618 structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus

619 Evol 2, vew007.

620 Rambaut, A., Drummond, A.J., Xie, D., Baele, G., Suchard, M.A., 2018. Posterior

621 summarisation in Bayesian phylogenetics using Tracer 1.7. Syst. Biol. 67, 901-904.

622 Ribeiro, D., Borst, J. W., Goldbach, R., Kormelink, R., 2009. Tomato spotted wilt virus

623 nucleocapsid protein interacts with both viral glycoproteins GN and GC in planta.

624 Virology 383, 121-130.

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

625 Rotenberg, D., Jacobson, A.L., Schneweis, D.J., Whitfield, A.E., 2015. Thrips

626 transmission of tospoviruses. Curr. Opin. Virol. 15, 80-89.

627 Scuhultz, J., 2005. The Ecozones of the World, 2nd ed. Springer, New York, NY.

628 Shalileh, S., Ogada, P.A., Moualeu, D.P., Poehling, H.M., 2016. Manipulation of

629 Frankliniella occidentalis (Thysanoptera: Thripidae) by Tomato spotted wilt virus

630 (tospovirus) via the host plant nutrients to enhance its transmission and spread.

631 Environ. Entomol. 45, 1235-1242.

632 Singh, P., Savithri, H., 2015. GBNV encoded movement protein (NSM) remodels ER

633 network via C-terminal coiled coil domain. Virology 482, 133-146.

634 Smith, M.D., Wertheim, J.O., Weaver, S., Murrell, B., Scheffler, K., Kosakovsky Pond,

635 S.L., 2015. Less is more: an adaptive branch-site random effects model for efficient

636 detection of episodic diversifying selection. Mol. Biol. Evol. 32, 1342-1353.

637 Storms, M.M., Kormelink, R., Peters, D., Van Lent, J.W., Goldbach, R.W., 1995. The

638 nonstructural NSM protein of Tomato spotted wilt virus induces tubular structures in

639 plant and insect cells. Virology 214, 485-493.

640 Suchard, M.A., Lemey, P., Baele, G., Ayres, D.L., Drummond, A.J., Rambaut, A. 2018.

641 Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus

642 Evol. 4, vey016.

643 Sundaraj, S., Srinivasan, R., Culbreath, A.K., Riley, D.G., Pappu, H.R., 2014. Host plant

644 resistance against Tomato spotted wilt virus in peanut (Arachis hypogaea) and its

645 impact on susceptibility to the virus, virus population genetics, and vector feeding

646 behavior and survival. Phytopathology 104, 202-210.

647 Tsompana, M., Abad, J., Purugganan, M., Moyer, J.W., 2004. The molecular population

648 genetics of the Tomato spotted wilt virus (TSWV) genome: Evolutionary genetics of

649 TSWV. Mol. Ecol. 14, 53-66.

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

650 Villesen, P., 2007. FaBox: an online toolbox for FASTA sequences. Mol. Ecol. Notes 7,

651 965-968.

652 Walter, C.T., Costa Bento, D.F., Guerrero Alonso, A., Barr, J.N., 2011. Amino acid

653 changes within the Bunyamwera virus nucleocapsid protein differentially affect the

654 mRNA transcription and RNA replication activities of assembled ribonucleoprotein

655 templates. J. Gen. Virol. 92, 82-84.

656 Weaver, S., Shank, S.D., Spielman, S.J., Li, M., Muse, S.V., Kosakovsky Pond, S.L.,

657 2018. Datamonkey 2.0: a modern web application for characterizing selective and

658 other evolutionary processes. Mol. Biol. Evol. 35, 773-777.

659

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 1. Schematic representation of the Orthotospovirus tri-segmented ambisense

genome. The main function, and length of each protein is indicated. The translation

sense of each ORF is indicated by arrows.

L RdRp ∼8.9 kb ∼3050 aas Replication

GC GN Glyco p ro te in s NS M GN/GC Lipid M N ∼4.8 kb ∼337 aas ∼1245 aas e n ve lo p e Cell to cell Glycoproteins FRAGMENT(-)RNA S movement RdRp

FRAGMENT M L FRAGMENT S NSS N ∼494 aas ∼300 aas ∼2.9 kb Silencing Nucleoprotein suppressor

660

29 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 2. MCC tree obtained from the concatenated full genome. Phylogroups A, B, C,

and D are indicated with different color boxes. Branch labels indicate BPPs.

MSMV CSNV TSWV27 1 0.8 TSWV21 0.97TSWV24 1 TSWV25 0.53TSWV23 1 1 TSWV22 1 TSWV8 TSWV17 TSWV20 1 TSWV11 1 1 TSWV30 0.69 0.97 TSWV4 0.681 TSWV5 0.73 TSWV3 1 TSWV2 TSWV10 0.98 1 TSWV29 1 TSWV6 0.83 0.6 TSWV7 A 1 1 TSWV18 TSWV19 1 TSWV16 TSWV13 1 1 1 TSWV28 1 1 TSWV26

1 TSWV15 1 TSWV12 1 0.74 TSWV9 1 TSWV31 TSWV1 0.73 TSWV14 1 TCSV GRSV 1 ZLCV1 ZLCV2 1 INSV1 INSV2 1 BeNMV SVNV B 1 HCRV1 1 HCRV2 1 PolRSV1 C 1 PolRSV3 PolRSV2 1 1 MYSV1 1 MYSV3 MYSV2 1 PCSV MVBaV CaCV1 1 1 1 CaCV3 1 0.96 CaCV4 1 1 CaCV2 D CaCV5 1 1 WBNV WSMoV 1 CCSV1 1 CCSV2 1 TZSV Tospokiwi

-3500 -3000 -2500 -2000 -1500 -1000 -500 0

661

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 3. Six MCC trees of the 20 Orthotospovirus species used in this study. The trees

were obtained from each individual gene protein sequence. Phylogroups A, B, C, and

D are indicated with different color boxes. Numbers above branches represent BPPs.

Viral species with more than one isolate has been collapsed (triangles). Branches of

different color indicate differences in rates of molecular evolution according to the left

scale. The scale below the trees represents years.

RdRp NSM

Substitutionrate_medianrate 1 Substitutionrate_medianrate MSMV 0.0024 INSV 0.0204 1 0.9999 ZLCV MSMV 0.9858 0.9999 1 0.9988 1 0.9721 TSWV ZLCV 1 1 0.8873 CSNV 1 TSWV A 0.9643 A 0.5542 1 0.6391 1 TCSV 1 TCSV GRSV GRSV 1 CSNV INSV 1 BeNMV 1 BeNMV SVNV B SVNV B 1 1 1 HCRV 1 HCRV 1 0.9921 PolRSV C PolRSV C 0.9547 1 0.8111 MYSV PCSV 1 PCSV 0.8373 MYSV 0.0001 0.9977 0.9505 0.0005 1 TZSV 1 0.9902 TZSV 1 CCSV 0.6601 1 1 1 CCSV Tospokiwi D Tospokiwi D 0.9968 1 MVBaV 0.9995 WSMoV WBNV 1 1 WBNV 0.643 WSMoV 1 0.9868 1 CaCV CaCV MVBaV

-900 -800 -700 -600 -500 -400 -300 -200 -100 0 -150 -125 -100 -75 -50 -25 0

GN GC SubstitutionHA1.rate_medianrate MSMV SubstitutionHA2.rate_medianrate MSMV 0.0011 0.8328 1 TCSV 0.0006 0.8328 1 TCSV 1 GRSV 1 GRSV 1 1 1 0.9973 TSWV A 1 0.9973 TSWV A 1 CSNV 1 CSNV 1 1 0.5192 ZLCV 0.5192 ZLCV 1 1 INSV INSV 1 BeNMV 1 BeNMV SVNV B SVNV B 1 1 1 HCRV 1 HCRV 1 1 PolRSV C PolRSV C 1 1 1 1 MYSV MYSV 0.9975 PCSV 0.9975 PCSV 0.0003 0.0002 1 Tospokiwi 1 Tospokiwi 0.9022 0.9022 1 CCSV2 1 CCSV2 0.8015 TZSV 0.8015 TZSV 1 CCSV1 D 1 CCSV1 D WBNV WBNV 1 1 1 1 0.9974 0.9974 1 CaCV 1 CaCV WSMoV WSMoV MVBaV MVBaV

-3000 -2500 -2000 -1500 -1000 -500 0 -3000 -2500 -2000 -1500 -1000 -500 0

NSS N Substitutionrate_medianrate 1 Substitutionrate_medianrate 0.9519 CSNV INSV 1 0.0077 1 0.0318 ZLCV ZLCV 0.9999 0.9988 0.9991 1 TCSV CSNV 1 1 0.9938 GRSV 1 TCSV A 1 A 0.3284 0.9466 TSWV 0.6599 GRSV 1 1 0.8233 0.4059 INSV TSWV MSMV MSMV 0.3247 0.9998 BeNMV 1 BeNMV SVNV B SVNV B 1 1 0.9999 HCRV 1 HCRV 0.929 1 PolRSV C PolRSV C PCSV 0.8626 0.8305 PCSV 1 0.9038 0.9877 TZSV 0.0003 MYSV 0.0007 0.9994 1 1 0.9536 CCSV 1 CCSV 0.9856 Tospokiwi 0.6262 TZSV MVBaV 0.471 Tospokiwi 0.9114 0.9999 WBNV D MVBaV D 0.8631 0.9898 1 0.4144 WSMoV CaCV 1 1 CaCV 1 WBNV 1 MYSV WSMoV

-600 -500 -400 -300 -200 -100 0 -125 -100 -75 -50 -25 0

Figure 2. Phylogenetic MMC trees for each one of Tospoviruses genes. Branches are colored 662 depending on the substitution rate of the gene and labeled with the posterior probability. Tospoviruses species are colored depending on the Tospovirus group they belong. Axis represent years from 2017.

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 4. Phylogeographic representation of the migration pattern of the 73

orthotospovirus species based on the analysis of the N gene. The colored regions

represent different ecozones: Afrotopic in dark grey, Australasia in orange,

Indomalayan in blue, Nearctic in green, Neotropical in purple, Eastern Palearctic in

red, and Western Paleartic in yellow, respectively. Arrows show the direction of the

movement, with the black continuous lines representing movements in the current

period of time and blue dashed one’s past movements.

663

664

32 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 5. Phylogeographic representation of the migration pattern inferred for TSWV

based in the N gene. (A) Reduced dataset consistent in 87 sequences. (B) Large dataset

composed with 229 sequences. Colors represent different ecozones while lines

represent movements, as described in the legend of Fig. 3.

665

33 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

FIG 6. Population sizes over time plots generated for TSWV. (A) Reduced dataset

consistent in 87 sequences. (B) Large dataset composed with 229 sequences. The

ordinate axis represents the number of lineages (in logarithmic scale), while the

abscissa axis represents time from the origin. Light blue lines represent the 95% HPD

confidence interval. Tree model root height 95% HPD for the full sequence set: 26 -

32 and for the reduced set: 26 - 39.

34 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.031120; this version posted April 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

666

35