bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Title 2 Evolution of the PRD1-adenovirus lineage: a viral tree of life incongruent with the cellular 3 universal tree of life 4 5 Authors 6 Anthony C. Woo1,2*, Morgan Gaia1,2, Julien Guglielmini3, Violette Da Cunha1,2 and Patrick 7 Forterre1,2* 8 9 Affiliations 1 10 Unité de Biologie Moléculaire du Gène chez les Extrêmophiles (BMGE), Department of 11 Microbiology, Institut Pasteur, 25-28 Rue du Docteur Roux, 75015 Paris, France. 2 12 Department of Microbiology, CEA, CNRS, Université Paris-Sud, Université Paris-Saclay, 13 Institute for Integrative Biology of the Cell (I2BC), Bâtiment 21, Avenue de la Terrasse, 14 91190 Gif-sur-Yvette cedex, France. 3 15 HUB Bioinformatique et Biostatistique, C3BI, USR 3756 IP CNRS, Institut Pasteur, 25-28 16 Rue du Docteur Roux, 75015 Paris, France 17 *corresponding authors: 18 Anthony Woo: [email protected] 19 Patrick Forterre: [email protected] 20 21 Abstract 22 Double-stranded DNA of the PRD1-adenovirus lineage are characterized by 23 homologous major capsid proteins containing one or two β-barrel domains known as the jelly 24 roll folds. Most of them also share homologous packaging ATPases of the FtsK/HerA 25 superfamily P-loop ATPases. Remarkably, members of this lineage infect hosts from the three 26 domains of life, suggesting that viruses from this lineage could be very ancient and share a 27 common ancestor. Here we analyzed the evolutionary history of these cosmopolitan viruses 28 by inferring phylogenies based on single or concatenated genes. These viruses can be divided 29 into two supergroups infecting either eukaryotes or prokaryotes. The latter can be further 30 divided into two groups of bacterioviruses and one group of archaeoviruses. This viral tree is 31 thus incongruent with the cellular tree of life in which Archaea are closer to Eukarya and 32 more divergent from . We discuss various evolutionary scenarios that could explain 33 this paradox. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

34 35 Introduction 36 Studying origin and evolution is a challenging exercise, especially when 37 addressing early co-evolution with cellular domains. While cellular domains (Archaea, 38 Bacteria, and Eukarya) have been established based on ribosome sequences and more recently 39 single-copy genes, viral lineages (or supergroups) have been suggested based on the 40 conservation of their major capsid proteins (MCPs) structure (1–4). It has been proposed that 41 MCP can be considered as the “viral self” or the “viral hallmark protein” (1, 5, 6). Viruses can 42 be broadly divided into two types of lineages: domain-specific (identified in hosts from a 43 single domain) and cosmopolitan lineages (identified in hosts from more than one domain) (7, 44 8). To date, only viruses from the HK97 and the PRD1-adenovirus lineages, both 45 corresponding mostly to double-stranded (ds) DNA viruses, infect host from the three 46 domains of life (1, 3, 9, 10). The HK97 lineage mostly consists of archaeal and bacterial 47 viruses, whereas the viruses of the PRD1-adenovirus lineage are well represented in the 48 virosphere associated with all three domains (archaeoviruses, bacterioviruses, and 49 eukaryoviruses) and this lineage thus becomes an ideal subject to study the evolution of this 50 viral lineage in the context of the universal tree of life. 51 The PRD1-adenovirus lineage was initially defined by the conservation of a double- 52 jelly roll (DJR) fold in the MCP (11). This lineage is now characterized by either a single 53 MCP with the DJR fold (1, 3, 12) or two MCPs, each with a single-jelly roll fold (SJR) (13– 54 16). In both cases, the β-strands of their folds are oriented vertically relative to the capsid 55 surface. These viruses will be called hereinafter DJR and SJR DNA viruses, respectively. The 56 first attempt to determine the evolutionary relationships between DJR viruses was 57 consequently based on structural alignments and confirmed that all viruses with DJR fold 58 MCPs are indeed evolutionarily related (11). 59 The PRD1-adenovirus lineage encompasses different families as listed in Table 1. 60 Among eukaryoviruses, all members of the PRD1-adenovirus lineage are DJR DNA viruses. 61 Beside , this lineage includes all members of the Nucleo-Cytoplasmic Large 62 DNA Viruses (NCLDVs) assemblage (members of the families Mimiviridae, 63 Phycodnaviridae, Marseilleviridae, Iridoviridae, Ascoviridae, , Asfarviridae, and 64 related viruses), the Lavidaviridae (virophages related to viruses from the family of 65 Mimiviridae) and the Polintoviruses (12, 17). In prokaryotes, most members of the PRD1- 66 adenovirus lineage are also DJR DNA viruses, except Sphaerolipoviridae, which consists of 67 SJR DNA viruses grouping two subfamilies of archaeal viruses and one subfamily of bacterial bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

68 viruses (18, 19). The best-known archaeal and bacterial DJR viruses of the PRD1-adenovirus 69 lineage are small viruses, such as Turriviridae, Tectiviridae, and Corticoviridae, exemplified 70 by the Sulfolobus Turreted Icosahedral Virus (STIV), the bacteriophage PRD1, and the 71 Pseudoalteromonas virus PM2 infecting Sulfolobus, Escherichia coli, and Pseudoalteromonas, 72 respectively (12, 20). Most of these prokaryotic viruses are also known to integrate into hosts 73 (20, 21) or exist as free plasmids corresponding to defective viruses (22). In terms of 74 representation, the prokaryotic portion of the lineage was considered scarce compared to the 75 eukaryotic counterpart, but until recently Koonin and co-workers highlighted the unexpected 76 diversity of these viruses in archaeal and bacterial genomes and metagenomes (20). The 77 authors identified five groups of prokaryotic dsDNA viruses related to the PRD1-adenovirus 78 lineage based on detecting sequence similarities between their MCP: Odin (named after an 79 integrated element present in the metagenome-assembled of an Odinarchaeon), PM2 80 (with some members, the Autolykoviruses, abundant in marine microbial metagenomes(21)), 81 STIV (all the viruses infecting Archaea belong to this group, with the exception of the one 82 presumably infecting the Odinarchaeon), Bam35/Toil/Cellulomonas, and PRD1. An 83 additional group, FLiP (Flavobacterium-infecting, lipid-containing phage), composed of 84 rolling-circle ssDNA viruses with a homologous, yet very divergent MCP, has also been 85 identified (23). 86 In a phylogenetic tree based on the common core of the MCP published in 2013, the 87 viruses of the DJR PRD1-adenovirus were divided into three groups: one corresponding to 88 Adenoviruses, another grouping NCLDVs (represented by two members) with Lavidaviridae, 89 and the third one including viruses infecting Archaea and Bacteria (11). Furthermore, the 90 evolutionary relationships among the PRD1-adenovirus lineage were recently investigated 91 using pairwise sequence similarities between DJR viruses (20, 24). These analyses 92 demonstrated that there are evolutionary links between archaeoviruses and bacterioviruses as 93 well as eukaryoviruses and bacterioviruses. With the advent of metagenomics approaches, 94 many new viruses related to the PRD1-adenovirus lineage have been identified recently, 95 providing important insights into the evolution of viruses from this lineage. This strengthens 96 the call for a more thorough study to examine the evolutionary relationships among all MCPs 97 from the DJR viruses. 98 In addition to the MCP, most members of the PRD1-adenovirus lineage, with a few 99 exceptions, share a packaging ATPase (pATPase) of the FtsK/HerA superfamily P-loop 100 ATPases (Table 1) (3, 20). The MCP and pATPase were also part of the core genes defined in 101 our recent study on the evolution of the NCLDV assemblage (25), and concatenating the two bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

102 proteins proved useful in rooting the phylogenetic tree of NCLDVs with the Polintoviruses as 103 an outgroup, suggesting that such an approach could be extended to the whole PRD1- 104 adenovirus lineage. 105 Here we show that the concatenated MCP and pATPases genes of the viruses of the 106 PRD1-adenovirus lineage produces a robust viral phylogenetic tree characterized by a major 107 division between eukaryotic and prokaryotic viruses, in contradiction with the topology of the 108 universal cellular tree, which is characterized by the grouping of Archaea with eukaryotes. 109 Accordingly, the viral phylogenetic tree is not in congruence with the cellular phylogenetic 110 trees inferred using universal markers shared between the three cellular domains. We explore 111 various scenarios that could explain this discrepancy between the viral tree and the cellular 112 universal tree of life. 113 114 Results 115 The MCP and the pATPase genes were used as the hallmark proteins to identify viruses 116 from the PRD1-adenovirus lineage 117 We retrieved MCP and pATPase sequences using PSI-BLAST searches against the 118 GenBank non-redundant protein sequence database (nr), including sequences recovered from 119 proviruses (Materials and Methods). To validate the selected MCP and pATPase sequences in 120 our dataset, we generated protein models for all selected sequences and compared these 121 predicted structures to the PDB database. The DJR MCPs associated to groups and families 122 previously described indeed matched their corresponding structures in the public databases, 123 except for the putative MCPs identified from the Odin group, confirming the structural 124 conservation of the motif in the MCP of most of the viruses from the PRD1-adenovirus 125 lineage (fig. S1 and data file S1). The results also showed that MCPs from the SJR groups 126 were very divergent from those of the DJR groups. The MCP from Adenoviridae was unique 127 in exhibiting several additional structural elements (fig. S1). Our analysis also showed that all 128 pATPases share similar predicted structures including this time those from the SJR groups. 129 The only exception was the pATPase of the Adenoviridae, which had again, like in the MCP 130 analysis, the most distant structure compared to the other groups of DJR viruses (fig. S2 and 131 data file S1). 132 A recent study has used the MCP genes of the PRD1-adenovirus lineage to create 133 amino acid sequence similarity networks to analyze evolutionary relationships between 134 viruses from this lineage (20). We thus performed a similar analysis based on the pATPases 135 (fig. S3). Almost all viruses of the PRD1-adenovirus lineage were clustered together when bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

136 using the pATPases for the analysis, except for the Adenoviridae. This corroborates previous 137 observations based on amino-acid signatures and secondary structure prediction suggesting 138 that pATPases of the Adenoviridae were not specifically related to other pATPases of viruses 139 of the PRD1-adenovirus lineage but were derived from ATPases of the ABC superfamily (26), 140 indicating an exchange of pATPases during the evolution of this viral family. Surprisingly, 141 we found that the pATPases of the SJR DNA viruses did not form a distinct cluster, as 142 expected for an outgroup, but strongly clustered together with the pATPases of DJR 143 archaeoviruses and bacterioviruses. It was possible to distinguish two large assemblages in 144 the pATPase network, one corresponding to a core NCLDVs loosely associated to the 145 Asfarviridae on one side and the Poxviridae and the Polintoviruses on the other, and a second 146 one corresponding to archaeoviruses and bacterioviruses (including SJR viruses), loosely 147 associated to the Lavidaviridae. 148 149 Phylogenetic positions of the PRD1-adenovirus lineage viruses inferred from the single 150 protein genes 151 The result of the sequence similarity network suggested that the pATPase gene could 152 be a good marker for delineating the phylogeny of the viruses from the PRD1-adenovirus 153 lineage, in addition to the MCP that was used in recent studies (11, 20). Sequences of the 154 MCPs (excluding and both SJR viruses and the Adenoviridae) and pATPases were aligned 155 and trimmed (Materials and Methods). Phylogenetic analyses were performed separately on 156 the MCP and the pATPase using the resulting datasets within the maximum likelihood (ML) 157 framework. DJR eukaryoviruses were clearly separated from archaeoviruses and 158 bacterioviruses in both trees, in agreement with previous observations made using pairwise 159 sequence similarities or comparative structural analyses (11, 24). In particular, unlike the 160 result obtained with the sequence similarity network, members of the Lavidaviridae were 161 grouped with other eukaryoviruses in the pATPase tree. We thus decided to root the trees 162 between “prokaryotic” and “eukaryotic” DJR viruses. As a result, DJR eukaryoviruses and 163 viruses infecting Archaea or Bacteria formed two monophyletic clades. Moreover, we 164 recovered the monophyly of most previously defined assemblages at various taxonomic ranks 165 (groups, families or superfamilies). The few exceptions were the STIV, the Phycodnaviridae 166 and the Iridoviridae that were paraphyletic in both the MCP and pATPase trees, and the PM2 167 group which was paraphyletic in the pATPase tree (fig. S4 and S5). Mollivirus, which is 168 related to pandoraviruses, branched within the Phycodnaviridae and this corroborates 169 previous observations (25, 27). bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

170 In the MCP tree, the NCLDV formed a monophyletic sister group to the Lavidaviridae, 171 whereas in the pATPase tree, the Lavidaviridae branched at the base of the eukaryotic clade 172 and the NCLDV were not monophyletic as the Poxviridae were sister group to the 173 Polintoviruses. On the prokaryotic side, viruses of the STIV group were the sister group to 174 viruses of the PM2 group (except for the two Sulfolobus STIV), whereas the five other groups 175 of DJR bacterioviruses (the Bam35, the PRD1, the Toil, the Cellulomonas and the FLiP) 176 emerged at the base of this part of the tree (fig. S4). 177 In the pATPase tree, the Bam35, the PRD1, the Toil and the Cellulomonas groups 178 formed a monophyletic cluster that was sister group to all STIV (except two members of the 179 STIV group) whereas the PM2 group and the two STIV members branched at the base of the 180 tree, within SJR viruses of the Sphaerolipoviridae family (fig. S5). Remarkably, the SJR 181 pATPases did not form a monophyletic group well separated from DJR viruses, as expected if 182 DJR viruses originated from SJR viruses (28), but grouped with DJR viruses infecting 183 prokaryotes in agreement with the network analysis (fig. S3 and S5). To better address the 184 evolutionary relationship between the pATPases of SJR and DJR viruses, we performed 185 another phylogenetic analysis using only sequences of bacterial and archaeal viruses (fig. S6). 186 Interestingly, the pATPases of archaeal SJR DNA viruses branched within the clade of 187 archaeal DJR DNA viruses, whereas the pATPases of bacterial SJR viruses formed a separate 188 clade. It has been previously suggested that DJR viruses originated from the simpler SJR type 189 by ancestral gene duplication of one of the two SJR fold MCPs that possibly occurred before 190 the emergence of the Last Universal Common Ancestor (LUCA) (29, 30). If one assumes that 191 SJR viruses, defined by their very divergent MCPs, predated DJR viruses as previously 192 proposed (28), this suggests that they replaced later on their pATPases with those of DJR 193 viruses infecting the same hosts. Alternatively, SJR viruses might have emerged twice 194 independently from DJR viruses and the MCP genes of SJR viruses might have diverged 195 rapidly following their structural rearrangements. 196 To better compare the MCP and pATPase trees, we removed the sequences of the SJR 197 viruses whose MCP cannot be included in the MCP tree from the pATPase dataset and we 198 removed viruses of the FLiP group that have no detectable pATPases from the MCP dataset 199 (Fig. 1 and fig. S7 and S8). The removal of SJR did not change the topology of the pATPase 200 tree, except for some modifications in the relative branching of DJR viruses from the PM2 201 and the STIV groups. Surprisingly, the removal of the FLiP viruses from the MCP tree also 202 introduced some modifications in the relationships between the different groups of viruses 203 infecting eukaryotes, especially the Poxviridae and the Asfarviridae. The Poxviridae and the bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

204 Asfarviridae are however known to be unstable taxa, possibly due to their frequent long 205 branches. Nevertheless, most of the clades were still monophyletic and the bipartition 206 corresponding to eukaryoviruses and prokaryoviruses was still present. 207 208 Concatenation of the MCP and the pATPase genes highlights an unexpected 209 evolutionary scenario 210 The small differences observed between the MCP and the pATPase trees, the 211 uncertainties in the topologies due to the removal of some sequences, and the low support 212 values of most branches in these single trees were anticipated considering the divergence of 213 the analyzed proteins. However, these trees also exhibited some clear-cut congruence, such as 214 the eukaryotic/prokaryotic split, the monophyly of the NCLDVs other than the Poxviridae, 215 the monophyly of several previously defined groups of viruses infecting Archaea and Bacteria, 216 the grouping of the STIV and the PM2 on one side and the grouping of the PRD1, the Bam35, 217 the PRD1, the Toil, and the Cellulomonas on the other (Fig.1 and fig. S7 and S8). This 218 suggested that we could concatenate the MCP and the pATPase sequences to obtain a more 219 reliable phylogeny. We adopted the new version of transfer bootstrap in which the presence of 220 inferred branches in replications is measured using a gradual ‘transfer’ distance and the 221 resulting supports do not induce falsely supported branches (31). In the concatenated ML tree, 222 we recovered the division between DJR viruses infecting eukaryotes and those infecting 223 prokaryotes, as well as the monophyly of all previously defined groups of DJR viruses, except 224 for the STIV group. Interestingly, the STIV group was now divided into two subgroups 225 corresponding to DJR viruses infecting either Archaea or Bacteria and most of the branch 226 supports were above 0.9 (in the range of 0 to 1), indicating that the tree is very robust (Fig. 2). 227 Among the eukaryoviruses of the tree, the Polintoviruses were basal to all eukaryotic 228 groups whereas the NCLDV assemblage was sister group to the Lavidaviridae. The 229 Poxviridae were the first NCLDVs branching family, followed by the Asfarviridae. Notably, 230 we recovered the two major groups of NCLDVs that we previously identified based on 8 231 marker core genes (25), the MAPI (Marseilleviridae, Ascoviridae, Pitho-like viruses, and 232 Iridoviridae) and the PAM (Phycodnaviridae, Asfarviridae and the recently proposed order 233 “Megavirales”), except that the Asfarviridae were sister group to these two superclades 234 instead to be part of the PAM and the Phycodnaviridae were paraphyletic. Remarkably, we 235 also previously obtained Asfarviridae as sister group to other NCLDVs in a tree based on the 236 two large RNA polymerase subunits, using cellular sequences as an outgroup (25). We 237 previously noticed that the Poxviridae had exceptionally long branches and variable positions bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

238 in single-gene trees of NCLDV proteins. In particular, they tended to attract the long branches 239 of the Asfarviridae in our previous analyses. We thus removed both the Poxviridae and the 240 Asfarviridae from the dataset and we obtained a tree in which NCLDVs are no more sister 241 group to the Lavidaviridae (fig. S9), but the Polintoviruses, as previously proposed based on 242 phenotypic analyses (32). 243 Noticeably, two strongly supported clades of viruses infecting bacteria and one clade 244 of viruses infecting Archaea were recovered on the prokaryotic side of the tree (Fig. 2). The 245 bacterial members of the STIV formed a strongly supported monophyletic clade with bacterial 246 viruses/proviruses of the PM2 group that branched at the base of this clade and will be 247 hereinafter called the DJR bacterial cluster I. The second bacterial clade with the 248 Cellulomonas, the Bam, the Toil and the PRD1 (hereinafter called the DJR bacterial cluster 249 II) was similar to a clade first observed with the pATPase tree. The third clade included all 250 archaeal members of the STIV group (hereinafter called the DJR archaeal cluster). The DJR 251 archaeal cluster became sister group to the DJR bacterial cluster I when both the Poxviridae 252 and the Asfarviridae were removed from the analysis, with members the DJR bacterial cluster 253 II becoming paraphyletic at the base of the “prokaryotic clade” (fig. S9). 254 The monophyly of DJR archaeoviruses, previously observed with the complete MCP 255 tree was only weakly supported in the concatenated tree with or without the Poxviridae and 256 the Asfarviridae. To improve the resolution, we removed eukaryoviruses from the 257 concatenated tree. We again obtained the monophyly of the DJR bacterial clusters I, II, and 258 the archaeal cluster if we rooted the tree between Archaea and Bacteria (Fig. 3). On this 259 occasion, we obtained good support for the monophyly of Archaea. Interestingly, viruses 260 infecting Euryarchaea and those infecting Crenarchaea or Thaumarchaea were clearly 261 separated in the tree, suggesting that these viruses might have been present at the time of the 262 last common ancestor of Archaea and later on co-evolved with cellular hosts from this domain. 263 264 Discussion 265 The PRD1-adenovirus lineage is one of the two major viral lineages with members 266 infecting hosts from all three domains of life. It has long been thought that inferring 267 phylogeny of this lineage was out of reach because of the absence of significant sequence 268 similarities between their MCPs (11). This could be because the first viruses identified in this 269 lineage, PRD1 and Adenoviruses, are the most divergent viruses of this lineage. However, 270 many additional large groups of DNA viruses from the three domains of life were later on 271 progressively incorporated in this lineage, revealing previously undetectable connection at the bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

272 primary sequence level between their MCPs and pATPases. Here, we show that most 273 members of this lineage can be placed in a robust “universal tree” based on the concatenation 274 of these two proteins. A similar strategy has been recently adopted to produce a global 275 evolutionary history of RNA viruses based on the phylogeny of their RNA-dependent RNA 276 polymerases (33). Our results are supported by the fact that we recovered most groups 277 previously defined on different criteria, with the only exception of the STIV group (20). In 278 particular, the internal phylogeny of the NCLDVs is comparable to the one that we previously 279 obtained with eight-core genes (25). 280 The tree presented here will certainly change in the future with the discovery of new 281 viral groups belonging to this lineage and the discovery of new members of the existing 282 groups. Nevertheless, our findings suggest that the phylogenetic tree of the concatenated 283 MCP and pATPase can be used as a backbone to review current hypotheses about the 284 evolution of this lineage and propose new ones. 285 Interestingly, our in-depth phylogenetic analyses reveal a great divide between DJR 286 viruses infecting prokaryotes (Archaea and Bacteria) and those infecting eukaryotes. This 287 produces a “viral tree of life” strikingly different from the cellular tree based on universal 288 proteins in which Archaea and eukaryotes are closely related (34–37). Interestingly, the 289 topology of this viral tree of life corroborates phenotypic observations showing that the 290 mobilomes of Archaea and Bacteria are both very similar and strikingly different from the 291 eukaryotic mobilome (38). For instance, Caudovirales (head and tailed “phages”), conjugative 292 plasmids, casposons, or several families of IS elements are only present in Archaea and 293 Bacteria (39–44), whereas several families of RNA viruses, retroviruses, and certain families 294 of IS elements are specific to eukaryotes (38, 45). Similarly, defense systems are also 295 distributed according to the “eukaryote/prokaryote” divide, with the CRISPR-Cas system or 296 the toxin-antitoxin systems only present in Archaea and Bacteria whereas the RNAi system 297 only exists in eukaryotes (46, 47). 298 It has been suggested that all eukaryoviruses, including those from the PRD1- 299 adenovirus lineage, originated from a melting pot of bacteriophages (bacterioviruses) 300 infecting the bacterium at the origin of mitochondria (48). Regarding the origin of DJR 301 viruses, it was suggested that Polintoviruses evolved from a Tectiviridae and the 302 Polintoviruses subsequently became the ancestor of Lavidaviridae and NCLDVs (32). 303 However, in that scenario, DJR eukaryoviruses, especially Polintoviruses, should have been 304 more closely related to Tectiviridae than to other DJR bacterioviruses in the MCP/pATPase 305 DJR tree, which is not the case. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

306 Phylogenetic analyses of cellular and NCLDV RNA polymerases revealed that 307 NCLDVs had already diverged before the Last Eukaryotic Common Ancestor (LECA) (25). 308 Our present study indicates that the divergence between NCLDVs, Polintoviruses and 309 Lavidaviridae took place even earlier. This renders unlikely that the transformation of an 310 ancestral prokaryotic-like DJR virus into these three groups of eukaryoviruses had enough 311 time to occur during the evolutionary period between the first eukaryote with a proto- 312 mitochondrion and LECA. If they all originated from the mitochondrion, it would leave a 313 rather short lapse of time before LECA for so many various families to emerge. Notably, 314 among these three groups of eukaryoviruses, one group encompasses diverse large and giant 315 viruses infecting a range of host from unicellular algae to human beings. A more plausible 316 related hypothesis could be that DJR eukaryoviruses evolved from DJR viruses that infected 317 prokaryotes that entered into symbiosis with proto-eukaryotes much before the final 318 endosymbiosis that led to mitochondria, leaving more time for the evolution and 319 diversification of DJR eukaryoviruses. 320 If DJR eukaryoviruses originated indeed from bacterioviruses or archaeoviruses 321 (hereinafter called the “prokaryotic” ancestor hypothesis), the DJR universal tree could be 322 rooted with the DJR archaeal cluster (fig. S10), the DJR bacterial clusters (fig. S11 and S12) 323 or between the DJR archaeal cluster and the DJR bacterial cluster I (Fig. 4A and fig. S13). In 324 this scenario, according to our MCP/pATPase tree, DJR eukaryoviruses could have originated 325 from a third unknown group of DJR viruses, sister group to DJR bacterial cluster I. In any 326 case, it remains unclear in the “prokaryotic” ancestor hypothesis how a virus infecting a 327 prokaryotic symbiont could have acquired the capacity to infect the proto-eukaryotic host. 328 The presence of DJR viruses in the three domains has been often used as an argument 329 suggesting that these viruses were already present at the time of the Last Universal Common 330 Ancestor (LUCA) (1)(49)(50). In the “prokaryotic” ancestor hypothesis, the presence of DJR 331 viruses in LUCA favors rooting the whole DJR tree between archaeoviruses and 332 bacterioviruses, as in fig. S13. However, the branch between archaeoviruses and 333 bacterioviruses was very short in fig. S13, whereas the branch between archaeal and bacterial 334 proteins is always very long in the trees of universal proteins that were present in LUCA 335 (34)(51). In the “prokaryotic” ancestor hypothesis, it thus seems more likely that DJR viruses 336 originated after LUCA, either in proto-archaea or proto-bacteria and were later on transferred 337 from one domain to the other. 338 Alternatively, the presence of DJR viruses in LUCA would explain the clear-cut 339 separation of prokaryotic and eukaryotic DJR viruses in the whole DJR tree. However, bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

340 considering that viruses generally co-evolve with their hosts, it remains obscure why DJR 341 archaeoviruses are much more similar to DJR bacterioviruses than to DJR eukaryoviruses, 342 since Archaea and Eukarya are closely related in the universal tree of life. A plausible 343 explanation for this paradox could be that eukaryotic-like DJR viruses were first lost in the 344 stem lineage of Archaea and reintroduced later on in this domain from Bacteria (or proto- 345 bacteria) before the diversification of Archaea (Fig. 4B). The present absence of eukaryotic- 346 like DJR viruses in Archaea could be also due to a sampling bias, some of these archaeal 347 viruses being still hidden in some unexplored places around the globe. 348 The presence of DJR viruses in the three cellular domains thus raises challenging 349 questions about their origin and evolution. Here, we have shown that phylogenies based on 350 the concatenation of their MCP and pATPase not only can help to formulate more precisely 351 previous evolutionary hypotheses but also to propose new ones, such as the monophyly of 352 DJR archaeoviruses and to define new assemblages, such as DJR clusters I and II in Bacteria. 353 Future identification and isolation of new viral families in the PRD1-adenovirus lineage will 354 be essential to provide insights into various scenarios for the history of this lineage. 355 356 Materials and Methods 357 Selection of MCP/pATPase sequences 358 Representative MCP/pATPase sequences from different groups of the DJR viruses were used 359 as queries for PSI-BLAST (52) searches against the GenBank non-redundant protein 360 sequence database (nr). The query sequences are listed below: Group Name MCP pATPase Lavidaviridae Sputnik virophage YP_002122381 YP_002122364 Adenoviridae Frog adenovirus 1 NP_062443 NP_062434 STIV Sulfolobus turreted icosahedral virus 1 YP_025022 YP_025021 Bam35 Bacillus phage Bam35c NP_943764 NP_943760 PRD1 Enterobacteria phage PRD1 NP_040692 NP_040689 Toil Rhodococcus phage Toil ARK07697 ARK07695 PM2 Pseudoalteromonas phage PM2 NP_049903 NP_049900 FLiP Flavobacterium phage FLiP ASQ41214 - 361 362 The MCP/pATPase sequences of the NCLDVs were retrieved from a previous study that we 363 conducted (25). The Polintoviruses sequences were gathered from the Repbase collection (53) 364 (http://www.girinst.org/Repbase_Update.html): Polinton-2_NV, Polinton-1_DY, Polinton- bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

365 1_TC, Polinton-1_SP, Polinton-2_SP, Polinton-2_DR, Polinton-1_DR. Finally, the sequences 366 from the SJR group were recovered based on previously identified sequences (19). 367 368 Putative MCP/pATPase sequences were aligned with the query sequences for the examination 369 of the conserved structural elements using MAFFT (54). Prediction of the secondary structure 370 was performed using Phyre2 (55) and the predicted protein structures were visualized using 371 UCSF Chimera (56). The sequences used in this study are shown in the supplementary file. 372 After removing sequences with no significant matches or low confidence levels, we obtained 373 two different datasets of 145 and 128 sequences for the MCP and pATPase respectively. 374 375 Network analysis 376 After performing the structural protein prediction analysis, all-against-all blastp analyses were 377 performed on the refined MCP and pATPase datasets. The all-against-all integrases BlastP 378 results were grouped using the SiLiX (for SIngle LInkage Clustering of Sequences) package 379 v1.2.8 (http://lbbe.univ-lyon1.fr/ SiLiX)(57). This approach for the clustering of homologous 380 sequences is based on single transitive links with alignment coverage constraints. Several 381 different criteria can be used separately or in combination to infer homology (percentage of 382 identity, alignment score or E-value, alignment coverage). The MCP and pATPase sequences 383 were clustered independently by similarity using SiLiX with the expect threshold of 0.001 as 384 previously used for MCP analysis (20). The clustering results were analyzed and visualized 385 using the igraph package of the R programming language (https://igraph.org/). 386 387 Phylogenetic analysis 388 The alignments of the MCP sequences were performed using MAFFT v7.392 with the E-INS- 389 i algorithm (54), which can align sequences with several conserved motifs embedded in long 390 unalignable regions, whereas pATPase sequences were aligned using MAFFT with the L- 391 INS-i algorithm (54), which can align a set of sequences containing sequences flanking 392 around one alignable domain. Positions containing more than 30% of gaps were trimmed 393 using goalign v0.2.8 (https://github.com/evolbioinfo/goalign). 394 395 Maximum likelihood phylogenies 396 Single protein and concatenated protein phylogenies were conducted within the Maximum 397 Likelihood (ML) framework using IQ-TREE v1.6.3 (58). We first performed a model test 398 with the Bayesian Information Criterion (BIC) by including protein mixture models (59). For bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

399 mixture model analyses, we used the PMSF models (60). The support values were either 400 transfer bootstrap expectation (TBE) support (31) computed from 1000 bootstrap replicates in 401 the case of nonparametric bootstrap using gotree v0.3.0 402 (https://github.com/evolbioinfo/gotree), or from 1,000 replicates for SH-like approximation 403 likelihood ratio test (aLRT) (61) and ultrafast bootstrap approximation (UFBoot) (62). 404 405 Visualization 406 The phylogenetic trees were visualized with FigTree v1.4.3 407 (http://tree.bio.ed.ac.uk/software/figtree/) and iTOL (63). 408 409 Supplementary Materials 410 Fig. S1. Major capsid proteins of the PRD1-adenovirus lineage. 411 Fig. S2. Packaging ATPases of the PRD1-adenovirus lineage. 412 Fig. S3. Sequence similarity networks of the PRD1-adenovirus packaging ATPase proteins. 413 Fig. S4. Maximum likelihood (ML) phylogenetic tree of the major capsid protein gene of the 414 viruses from the PRD1-adenovirus lineage. 415 Fig. S5. Maximum likelihood (ML) phylogenetic tree of the packaging ATPase gene of the 416 viruses from the PRD1-adenovirus lineage. 417 Fig. S6. Maximum likelihood (ML) phylogenetic tree of the packaging ATPase gene of the 418 archaeoviruses and bacterioviruses from the PRD1-adenovirus lineage. 419 Fig. S7. Maximum likelihood (ML) phylogenetic tree of the packaging ATPase gene of the 420 double jelly-roll viruses from the PRD1-adenovirus lineage, excluding single jelly-roll viruses. 421 Fig. S8. Maximum likelihood (ML) phylogenetic tree of the major capsid protein gene of the 422 double jelly-roll viruses from the PRD1-adenovirus lineage, excluding the FLiP members. 423 Fig. S9. Maximum likelihood (ML) phylogenetic tree of the concatenated major capsid 424 protein and packaging ATPase genes, excluding the Poxviridae and the Asfarviridae. 425 Fig. S10. Maximum likelihood phylogenetic tree of the concatenated major capsid protein and 426 packaging ATPase genes, rooting with the DJR archaeal cluster. 427 Fig. S11. Maximum likelihood phylogenetic tree of the concatenated major capsid protein and 428 packaging ATPase genes, rooting with the DJR bacterial cluster II. 429 Fig. S12. Maximum likelihood phylogenetic tree of the concatenated major capsid protein and 430 packaging ATPase genes, rooting with the DJR cluster I. 431 Fig. S13. Maximum likelihood phylogenetic tree of the concatenated major capsid protein and 432 packaging ATPase genes, rooting between the DJR archaeal cluster and the DJR bacterial 433 cluster II. 434 Data file S1. Results of the predicted structures of the major capsid protein and the packaging 435 ATPase genes. 436 437 438 439 References 440 1. D. H. Bamford, Do viruses form lineages across different domains of life? Res. 441 Microbiol. (2003), , doi:10.1016/S0923-2508(03)00065-2. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

442 2. D. H. Bamford, J. M. Grimes, D. I. Stuart, What does structure tell us about virus 443 evolution? Curr. Opin. Struct. Biol. (2005), , doi:10.1016/j.sbi.2005.10.012. 444 3. N. G. A. Abrescia, D. H. Bamford, J. M. Grimes, D. I. Stuart, Structure Unifies the 445 Viral Universe. Annu. Rev. Biochem. 81, 795–822 (2012). 446 4. P. Forterre, M. Krupovic, D. Prangishvili, Cellular domains and viral lineages. Trends 447 Microbiol. (2014), , doi:10.1016/j.tim.2014.07.004. 448 5. M. Krupovic, D. H. Bamford, Order to the Viral Universe. J. Virol. 84, 12476–12479 449 (2010). 450 6. P. Forterre, The virocell concept and environmental microbiology. ISME J. (2013), , 451 doi:10.1038/ismej.2012.110. 452 7. J. Iranzo, M. Krupovic, E. V. Koonin, The Double-Stranded DNA Virosphere as a 453 Modular Hierarchical Network of Gene Sharing. MBio. 7, 1–21 (2016). 454 8. D. Prangishvili, D. H. Bamford, P. Forterre, J. Iranzo, E. V. Koonin, M. Krupovic, The 455 enigmatic archaeal virosphere. Nat. Rev. Microbiol. 15, 724–739 (2017). 456 9. R. W. Hendrix, Evolution: The long evolutionary reach of viruses. Curr. Biol. (1999), , 457 doi:10.1016/S0960-9822(00)80103-7. 458 10. M. L. Baker, W. Jiang, F. J. Rixon, W. Chiu, Common Ancestry of Herpesviruses and 459 Tailed DNA Bacteriophages. J. Virol. 79, 14967–14970 (2005). 460 11. J. Ravantti, D. Bamford, D. I. Stuart, Automatic comparison and classification of 461 protein structures. J. Struct. Biol. 183, 47–56 (2013). 462 12. C. San Martín, M. J. van Raaij, The so far farthest reaches of the double jelly roll 463 capsid protein fold. Virol. J. (2018), doi:10.1186/s12985-018-1097-1. 464 13. H. T. Jaalinoja, E. Roine, P. Laurinmaki, H. M. Kivela, D. H. Bamford, S. J. Butcher, 465 Structure and host-cell interaction of SH1, a membrane-containing, halophilic 466 euryarchaeal virus. Proc. Natl. Acad. Sci. (2008), doi:10.1073/pnas.0801758105. 467 14. S. T. Jaatinen, L. J. Happonen, P. Laurinmäki, S. J. Butcher, D. H. Bamford, 468 Biochemical and structural characterisation of membrane-containing icosahedral 469 dsDNA bacteriophages infecting thermophilic Thermus thermophilus. Virology (2008), 470 doi:10.1016/j.virol.2008.06.023. 471 15. S. T. Jaakkola, R. K. Penttinen, S. T. Vilen, M. Jalasvuori, G. Ronnholm, J. K. H. 472 Bamford, D. H. Bamford, H. M. Oksanen, Closely Related Archaeal Haloarcula 473 hispanica Icosahedral Viruses HHIV-2 and SH1 Have Nonhomologous Genes 474 Encoding Host Recognition Functions. J. Virol. (2012), doi:10.1128/JVI.06666-11. 475 16. Z. Zhang, Y. Liu, S. Wang, D. Yang, Y. Cheng, J. Hu, J. Chen, Y. Mei, P. Shen, D. H. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

476 Bamford, X. Chen, Temperate membrane-containing halophilic archaeal virus SNJ1 477 has a circular dsDNA genome identical to that of plasmid pHH205. Virology. 434, 478 233–241 (2012). 479 17. M. Krupovic, J. H. Kuhn, M. G. Fischer, A classification system for virophages and 480 satellite viruses. Arch. Virol. 161, 233–247 (2016). 481 18. D. Gil-Carton, S. T. Jaakkola, D. Charro, B. Peralta, D. Castaño-Díez, H. M. Oksanen, 482 D. H. Bamford, N. G. A. Abrescia, Insight into the assembly of viruses with vertical 483 single β-barrel major capsid proteins. Structure. 23, 1866–1877 (2015). 484 19. T. A. Demina, M. K. Pietilä, J. Svirskaitė, J. J. Ravantti, N. S. Atanasova, D. H. 485 Bamford, H. M. Oksanen, HCIV-1 and other tailless icosahedral internal membrane- 486 containing viruses of the family Sphaerolipoviridae. Viruses. 9 (2017), 487 doi:10.3390/v9020032. 488 20. N. Yutin, D. Bäckström, T. J. G. Ettema, M. Krupovic, E. V. Koonin, Vast diversity of 489 prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered 490 by genomic and metagenomic sequence analysis. Virol. J. 15, 1–17 (2018). 491 21. K. M. Kauffman, F. A. Hussain, J. Yang, P. Arevalo, J. M. Brown, W. K. Chang, D. 492 Vaninsberghe, J. Elsherbini, R. S. Sharma, M. B. Cutler, L. Kelly, M. F. Polz, A major 493 lineage of non-tailed dsDNA viruses as unrecognized killers of marine bacteria. Nature 494 (2018), doi:10.1038/nature25474. 495 22. M. Gaudin, M. Krupovic, E. Marguet, E. Gauliard, V. Cvirkaite-Krupovic, E. Le Cam, 496 J. Oberto, P. Forterre, Extracellular membrane vesicles harbouring viral genomes. 497 Environ. Microbiol. 16, 1167–1175 (2014). 498 23. E. Laanto, S. Mäntynen, L. De Colibus, J. Marjakangas, A. Gillum, D. I. Stuart, J. J. 499 Ravantti, J. T. Huiskonen, L.-R. Sundberg, Virus found in a boreal lake links ssDNA 500 and dsDNA viruses. Proc. Natl. Acad. Sci. 114, 8378–8383 (2017). 501 24. R. Sinclair, J. Ravantti, D. H. Bamford, Nucleic and Amino Acid Sequences 502 Classification. J. Virol. 91, 1–13 (2017). 503 25. J. Guglielmini, A. Woo, M. Krupovic, P. Forterre, M. Gaia, Diversification of giant 504 and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes. bioRxiv 505 (2018), doi:10.1101/455816. 506 26. A. Burroughs, L. Iyer, L. Aravind, Comparative genomics and evolutionary trajectories 507 of viral ATP dependent DNA-packaging systems. Genome Dyn. 3, 48–65 (2007). 508 27. N. Yutin, E. V. Koonin, Pandoraviruses are highly derived phycodnaviruses. Biol. 509 Direct (2013), doi:10.1186/1745-6150-8-25. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

510 28. E. V. Koonin, M. Krupovic, Polintons, virophages and transpovirons: a tangled web 511 linking viruses, transposons and immunity. Curr. Opin. Virol. (2017), , 512 doi:10.1016/j.coviro.2017.06.008. 513 29. M. Krupovič, D. H. Bamford, Archaeal proviruses TKV4 and MVV extend the PRD1- 514 adenovirus lineage to the phylum Euryarchaeota. Virology. 375, 292–300 (2008). 515 30. N. G. A. Abrescia, J. M. Grimes, H. M. Kivelä, R. Assenberg, G. C. Sutton, S. J. 516 Butcher, J. K. H. Bamford, D. H. Bamford, D. I. Stuart, Insights into Virus Evolution 517 and Membrane Biogenesis from the Structure of the Marine Lipid-Containing 518 Bacteriophage PM2. Mol. Cell. 31, 749–761 (2008). 519 31. F. Lemoine, J. B. Domelevo Entfellner, E. Wilkinson, D. Correia, M. Dávila Felipe, T. 520 De Oliveira, O. Gascuel, Renewing Felsenstein’s phylogenetic bootstrap in the era of 521 big data. Nature (2018), doi:10.1038/s41586-018-0043-0. 522 32. M. Krupovic, E. V. Koonin, Polintons: A hotbed of eukaryotic virus, transposon and 523 plasmid evolution. Nat. Rev. Microbiol. 13, 105–115 (2015). 524 33. Y. I. Wolf, D. Kazlauskas, J. Iranzo, A. Lucía-Sanz, J. H. Kuhn, M. Krupovic, V. V. 525 Dolja, E. V. Koonin, Origins and Evolution of the Global RNA Virome. MBio (2018), 526 doi:10.1128/mbio.02329-18. 527 34. V. Da Cunha, M. Gaia, D. Gadelle, A. Nasir, P. Forterre, Lokiarchaea are close 528 relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes. 529 PLoS Genet. 13 (2017), doi:10.1371/journal.pgen.1006810. 530 35. V. Da Cunha, M. Gaia, A. Nasir, P. Forterre, Asgard archaea do not close the debate 531 about the universal tree of life topology. PLOS Genet. 14, e1007215 (2018). 532 36. A. Spang, J. H. Saw, S. L. Jørgensen, K. Zaremba-Niedzwiedzka, J. Martijn, A. E. 533 Lind, R. Van Eijk, C. Schleper, L. Guy, T. J. G. Ettema, Complex archaea that bridge 534 the gap between prokaryotes and eukaryotes. Nature (2015), doi:10.1038/nature14447. 535 37. A. Spang, L. Eme, J. H. Saw, E. F. Caceres, K. Zaremba-Niedzwiedzka, J. Lombard, L. 536 Guy, T. J. G. Ettema, Asgard archaea are the closest prokaryotic relatives of eukaryotes. 537 PLOS Genet. (2018), doi:10.1371/journal.pgen.1007080. 538 38. P. Forterre, The Common Ancestor of Archaea and Eukarya Was Not an Archaeon. 539 Archaea. 2013, 1–18 (2013). 540 39. M. Krupovic, D. Prangishvili, R. W. Hendrix, D. H. Bamford, Genomics of Bacterial 541 and Archaeal Viruses: Dynamics within the Prokaryotic Virosphere. Microbiol. Mol. 542 Biol. Rev. 75, 610–635 (2011). 543 40. M. Krupovič, P. Forterre, D. H. Bamford, Comparative Analysis of the Mosaic bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

544 Genomes of Tailed Archaeal Viruses and Proviruses Suggests Common Themes for 545 Virion Architecture and Assembly with Tailed Viruses of Bacteria. J. Mol. Biol. (2010), 546 doi:10.1016/j.jmb.2010.01.037. 547 41. M. K. Pietila, N. S. Atanasova, V. Manole, L. Liljeroos, S. J. Butcher, H. M. Oksanen, 548 D. H. Bamford, Virion Architecture Unifies Globally Distributed Pleolipoviruses 549 Infecting Halophilic Archaea. J. Virol. (2012), doi:10.1128/JVI.06915-11. 550 42. J. Filee, P. Siguier, M. Chandler, Insertion Sequence Diversity in Archaea. Microbiol. 551 Mol. Biol. Rev. 71, 121–157 (2007). 552 43. N. Soler, M. Gaudin, E. Marguet, P. Forterre, Plasmids, viruses and virus-like 553 membrane vesicles from Thermococcales. Biochem. Soc. Trans. (2011), 554 doi:10.1042/bst0390036. 555 44. M. Krupovic, M. Gonnet, W. Ben Hania, P. Forterre, G. Erauso, Insights into 556 Dynamics of Mobile Genetic Elements in Hyperthermophilic Environments from Five 557 New Thermococcus Plasmids. PLoS One (2013), doi:10.1371/journal.pone.0049044. 558 45. P. Forterre, A new fusion hypothesis for the origin of Eukarya: Better than previous 559 ones, but probably also wrong. Res. Microbiol. (2011), 560 doi:10.1016/j.resmic.2010.10.005. 561 46. R. Sorek, C. M. Lawrence, B. Wiedenheft, CRISPR-Mediated Adaptive Immune 562 Systems in Bacteria and Archaea. Annu. Rev. Biochem. (2013), doi:10.1146/annurev- 563 biochem-072911-172315. 564 47. K. S. Makarova, Y. I. Wolf, E. V. Koonin, Comparative genomics of defense systems 565 in archaea and bacteria. Nucleic Acids Res. (2013), doi:10.1093/nar/gkt157. 566 48. E. V. Koonin, T. G. Senkevich, V. V. Dolja, The ancient virus world and evolution of 567 cells. Biol. Direct (2006), doi:10.1186/1745-6150-1-29. 568 49. D. Prangishvili, P. Forterre, R. A. Garrett, Viruses of the Archaea: A unifying view. 569 Nat. Rev. Microbiol. (2006), , doi:10.1038/nrmicro1527. 570 50. P. Forterre, The origin of viruses and their possible roles in major evolutionary 571 transitions. Virus Res. (2006), , doi:10.1016/j.virusres.2006.01.010. 572 51. R. Catchpole, P. Forterre, Positively twisted: The complex evolutionary history of 573 Reverse Gyrase suggests a non-hyperthermophilic Last Universal Common Ancestor. 574 bioRxiv (2019), doi:10.1101/524215. 575 52. S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D. J. 576 Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein database 577 search programs. Nucleic Acids Res. (1997), , doi:10.1093/nar/25.17.3389. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

578 53. J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, J. Walichiewicz, 579 Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 580 (2005), doi:10.1159/000084979. 581 54. K. Katoh, D. M. Standley, MAFFT multiple sequence alignment software version 7: 582 improvements in performance and usability. Mol. Biol. Evol. (2013), 583 doi:10.1093/molbev/mst010. 584 55. L. A. Kelley, S. Mezulis, C. M. Yates, M. N. Wass, M. J. E. Sternberg, The Phyre2 585 web portal for protein modeling, prediction and analysis. Nat. Protoc. (2015), 586 doi:10.1038/nprot.2015.053. 587 56. E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C. 588 Meng, T. E. Ferrin, UCSF Chimera - A visualization system for exploratory research 589 and analysis. J. Comput. Chem. (2004), doi:10.1002/jcc.20084. 590 57. V. Miele, S. Penel, L. Duret, Ultra-fast sequence clustering from similarity networks 591 with SiLiX. BMC Bioinformatics. 12 (2011), doi:10.1186/1471-2105-12-116. 592 58. L. T. Nguyen, H. A. Schmidt, A. Von Haeseler, B. Q. Minh, IQ-TREE: A fast and 593 effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. 594 Biol. Evol. (2015), doi:10.1093/molbev/msu300. 595 59. S. Kalyaanamoorthy, B. Q. Minh, T. K. F. Wong, A. Von Haeseler, L. S. Jermiin, 596 ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods 597 (2017), doi:10.1038/nmeth.4285. 598 60. H. C. Wang, B. Q. Minh, E. Susko, A. J. Roger, Modeling Site Heterogeneity with 599 Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic 600 Estimation. Syst. Biol. (2018), doi:10.1093/sysbio/syx068. 601 61. S. Guindon, J. F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, O. Gascuel, New 602 algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the 603 performance of PhyML 3.0. Syst. Biol. (2010), doi:10.1093/sysbio/syq010. 604 62. B. Q. Minh, M. A. T. Nguyen, A. Von Haeseler, Ultrafast approximation for 605 phylogenetic bootstrap. Mol. Biol. Evol. (2013), doi:10.1093/molbev/mst024. 606 63. I. Letunic, P. Bork, Interactive Tree Of Life (iTOL): An online tool for phylogenetic 607 tree display and annotation. Bioinformatics (2007), doi:10.1093/bioinformatics/btl529. 608 609 Acknowledgments 610 General: This work used the computational and storage services (TARS cluster) provided by 611 the IT department at Institut Pasteur, Paris. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

612 Funding: This work was supported by the European Research Council (ERC) grant from the 613 European Union’s Seventh Framework Program (FP/2007-2013)/Project EVOMOBIL-ERC 614 Grant Agreement no. 340440. 615 Author contributions: A.C.W. and P.F. designed the study. A.C.W., J.G., and V.D.C. 616 performed bioinformatics experiments. A.C.W., J.G., M.G., V.D.C., and P.F. analyzed and 617 interpreted the results. A.C.W., M.G., V.D.C. and P.F. wrote the manuscript. 618 Competing interests: The authors declare that they have no competing interests. 619 Data and materials availability: All data needed to evaluate the conclusions in the paper are 620 present in the paper and/or the Supplementary Materials. Additional data related to this paper 621 may be requested from the authors. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figures and Tables

MCP pATPase Domain NCLDV DJR FtsK/HerA family Eukarya Polintonviruses DJR FtsK/HerA family Eukarya Lavidaviridae DJR FtsK/HerA family Eukarya Adenoviridae DJR ABC family Eukarya STIV DJR FtsK/HerA family Bacteria/Archaea Bam35 DJR FtsK/HerA family Bacteria PRD1 DJR FtsK/HerA family Bacteria Toil DJR FtsK/HerA family Bacteria Cellulomonas DJR FtsK/HerA family Bacteria PM2 DJR FtsK/HerA family Bacteria FLiP DJR - Bacteria Sphaerolipoviridae SJR FtsK/HerA family Bacteria/Archaea

Table 1. The PRD1-adenovirus lineage. This table summarizes the shared genes between different families of this lineage. Abbreviations: DJR, double jelly-roll; SJR: single jelly-roll; NCLDV, Nucleo-Cytoplasmic Large DNA Viruses; STIV, Sulfolobus turreted icosahedral virus; FLiP, Flavobacterium-infecting bacteriophage; MCP, major capsid protein; pATPase, packaging ATPase.

bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

AB Tree scale: 0.1 Tree scale: 0.1

STIV 100 84.2 100 PM2 78 92.4 98 STIV 97.1 STIV 99 92 99 91 97.1 PM2 86 PM2 / STIV 94.5 79.3 100 98.1 Toil

100 100 Cellulomonas 90 69.7 Toi l 100 81 99 PRD1 94 100 85.9 99.9 PRD1 100 90 Bam35 89.8 99.9 100 100 Cellulomonas 98 100 88.9 100 Polintoviruses 100 96.3 Bam35 99 Lavidaviridae 100 91 96.4 99.2 Lavidaviridae 94.5 100 Asfarviridae 99 Poxviridae 99.7 93 99 93 88 92 76.1 90 72 100 82.4 Phycodnaviridae / Mimiviridae 100 Polintoviruses 99 Asfarviridae 93 100 Poxviridae 99.9 98 100 Marseilleviridae 73.5 81.1 99.4 71.1 100 Ascoviridae 99.6 100 96 doviridae / Ascoviridae 100 98.3 85.1 95.6 100 89 99.2 Iridoviridae / Marseilleviridae 87.7 92 Phycodnaviridae / Mimiviridae 96.5 Fig. 1. Maximum likelihood (ML) single-protein trees of the two hallmark proteins of the viruses from the PRD1-adenovirus lineage. ML phylogenetic trees of (A) the major capsid protein (MCP) and (B) packaging ATPase (pATPase). The root of the ML phylogenetic tree was between the prokaryotic and eukaryotic members. The scale-bar indicates the average number of substitutions per site. Values on top and below branches represent support calculated by ultrafast bootstrap approximation (UFBoot; 1,000 replicates) and SH-like approximate likelihood ratio test (aLRT; 1,000 replicates), respectively. Only values superior to 70 are shown. The best-fit model for the MCP tree was LG + F + R4, which was chosen according to Bayesian Information Criterion (BIC) and the alignment has 103 sequences with 237 positions. The best-fit model for the pATPase tree was LG + R6, which was chosen according to Bayesian Information Criterion (BIC) and he alignment has 103 sequences with 171 positions. More detailed versions of the trees are shown in fig. S7 and S8. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Phycisphaerae bacterium SM23_30 0.99 Nitrospirae bacterium GWB2_47_37 Tree scale: 1 0.99 Magnetospirillum marisnigri 1.00 Magnetospirillum magneticum I

0.91 Magnetospirillum sp. ME-1 r 0.73 Pseudoalteromonas virus PM2 e

0.77 Burkholderia ubonensis t Kiloniella spongiae s

0.97 1.00 u Thioclava dalianensis l

Concatenated 0.88 Autolykiviridae_1.249.A c

Autolykiviridae_1.008.O 1.00 0.69 Autolykiviridae_1.020.O DJR bacterial DJR 1.00 Autolykiviridae_1.044.O Polintoviruses 0.91 Autolykiviridae_1.080.O Rhodococcus phage Toil 0.96 Streptomyces noursei ATCC 11455 0.99 0.77 Gluconobacter phage GC1 Poxviridae 1.00 Enterobacteria phage PR4 1.00 Enterobacteria phage PRD1 0.90 Enterobacteria phage PR3

0.90 I

Sulfobacillus thermosulfidooxidans I

1.00 Sulfobacillus acidophilus TPY r

STIV Brevibacillus sp. CF1 12 e 0.89 t Bacillus flexus s 0.92 Bacillus pumilus 1.00 u 0.99 Bacillus licheniformis l c PM2 0.99 Bacillus phage Bam35c 0.77 1.00 Bacillus phage GIL16c 0.81 Bacillus sp. AFS075034 DJR bacterial 1.00 Bacillus phage Wip1 Cellulomonas 0.77 Bacillus virus AP50 Sulfolobus turreted icosahedral virus 1 1.00 Sulfolobus turreted icosahedral virus 2 0.54 Candidatus Nitrososphaera evergladensis SR1 0.66 Candidatus Nitrososphaera gargensis Ga9.2 Bam35 0.89 Pyrobaculum oguniense TE7 Archaeal virus DEEP AND17J

0.99 1.00 Methanococcus voltae A3 r

Archaeoglobus veneficus e t 0.95 Toil 1.00 Archaeal virus DEEP JUN17B s 1.00

Archaeal virus DEEP JUN17A u l 1.00 Methanolacinia petrolearia c 0.99 Methanofollis ethanolicus PRD1 0.91 Thermococcus barophilus

Thermococcus kodakarensis archaeal DJR 1.00 0.89 Palaeococcus ferrophilus 1.00 Thermococcus guaymasensis 0.97 Thermococcus nautili Phycodnaviridae 1.00 pTN3 Polinton_2_NV

1.00 Polinton_1_DY Polinton_1_TC 0.95 Mimiviridae Polinton_2_DR 0.94 1.00 Polinton_1_DR 0.86 Polinton_1_SP 1.00 Polinton_2_SP Mimiviridae-related Cafeteriavirus-dependent mavirus 1.00 Yellowstone Lake virophage 7 1.00 Organic Lake virophage 0.95 Yellowstone Lake virophage 6 Mollivirus 0.91 Qinghai Lake virophage 0.75 Dishui lake virophage 1 0.69 Yellowstone Lake virophage 5 0.79 Sputnik virophage 1.00 Zamilon virus Iridoviridae II Melanoplus sanguinipes entomopoxvirus 0.92 1.00 Anomala cuprea entomopoxvirus 0.82 Choristoneura biennis entomopoxvirus 0.91 1.00 Amsacta moorei entomopoxvirus Iridoviridae I virus

1.00 Bovine papular stomatitis virus s

0.73 Variola virus e

0.51 Myxoma virus s

Kaumoebavirus u

Ascoviridae r 0.88 1.00 African swine fever virus i 0.64 Faustovirus v Trichoplusia ni ascovirus 2c o y

1.00 DJR Marseilleviridae Spodoptera frugiperda ascovirus 1a r

Invertebrate iridescent virus 6 a 1.00 0.97 Invertebrate iridovirus 25 k 0.97 Lausannevirus u 0.97 1.00 Marseillevirus marseillevirus e Asfarviridae 0.83 Melbournevirus 0.86 Frog virus 3 0.99 Red seabream iridovirus 0.88 0.99 Lymphocystis disease virus 1 Lavidaviridae Organic Lake phycodnavirus 1 1.00 Phaeocystis globosa virus 0.91 Hokovirus HKV1 0.91 Klosneuvirus KNV1 0.99 Acanthamoeba polyphaga moumouvirus 1.00 Acanthamoeba polyphaga mimivirus 0.84 1.00 Acanthamoeba castellanii mamavirus Emiliania huxleyi virus 202 0.76 Mollivirus sibericum 0.76 Heterosigma akashiwo virus 01 Acanthocystis turfacea Chlorella virus Can0610SP 0.85 1.00 Paramecium bursaria Chlorella virus NE-JV -1 0.95 Yellowstone lake phycodnavirus 2 1.00 Yellowstone lake phycodnavirus 3 1.00 Ostreococcus tauri virus OtV5 0.79 Micromonas pusilla virus 12T 0.53 Bathycoccus sp. RCC1 105 virus BpV2

Fig. 2. Maximum likelihood (ML) phylogenetic tree of the concatenated MCP and pATPase genes. The ML phylogenetic tree was rooted between the prokaryotic and eukaryotic members. The scale-bar indicates the average number of substitutions per site. Values on the branches represent transfer bootstrap expectation (TBE) support calculated by nonparametric bootstrap (1,000 replicates). The best-fit model was LG + F + R6, which was chosen according to Bayesian Information Criterion (BIC). The alignment has 107 sequences with 408 columns. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Tree scale: 1 Rhodococcus phage Toil 1.00 Streptomyces noursei ATCC 11455 0.75 Gluconobacter phage GC1 Concatenated 1.00 Enterobacteria phage PR4 1.00 Enterobacteria phage PRD1 STIV 0.44 0.91 Enterobacteria phage PR3 Sulfobacillus thermosulfidooxidans PM2 1.00 Sulfobacillus acidophilus TPY Brevibacillus sp. CF112 0.95 Cellulomonas Bacillus flexus 1.00 0.89 Bacillus pumilus Bam35 1.00 Bacillus licheniformis 1.00 Bacillus phage Bam35c Toil 1.00 Bacillus phage GIL16c 0.77

Bacillus sp. AFS075034 DJR bacterial cluster II PRD1 1.00 Bacillus phage Wip1 0.64 Bacillus virus AP50 Phycisphaerae bacterium SM23_30 0.59 Nitrospirae bacterium GWB2_47_37 1.00 Magnetospirillum marisnigri 1.00 Magnetospirillum magneticum 0.98 Magnetospirillum sp. ME-1 0.83 Pseudoalteromonas virus PM2 0.82 Burkholderia ubonensis Kiloniella spongiae 0.99 1.00 Thioclava dalianensis 0.99 Autolykiviridae_1.249.A Autolykiviridae_1.008.O 1.00 0.96 Autolykiviridae_1.020.O 1.00 Autolykiviridae_1.044.O 0.85 Autolykiviridae_1.080.O DJR bacterial cluster I cluster bacterial DJR Sulfolobus turreted icosahedral virus 1 1.00 Sulfolobus turreted icosahedral virus 2 Crenarchaea Candidatus Nitrososphaera evergladensis SR1 0.97 Thaumarchaea 0.67 Candidatus Nitrososphaera gargensis Ga9.2 Pyrobaculum oguniense TE7 0.97 Crenarchaea Archaeal virus DEEP AND17J 1.00 0.97 Methanococcus voltae A3 Archaeoglobus veneficus 0.93 1.00 Archaeal virus DEEP JUN17B 1.00 Archaeal virus DEEP JUN17A 0.95 Methanolacinia petrolearia Euryarchaea 0.99 Methanofollis ethanolicus 0.95 Thermococcus barophilus Thermococcus kodakarensis 1.00 0.88 Palaeococcus ferrophilus DJR ar cluster haeal 1.00 Thermococcus guaymasensis 0.96 Thermococcus nautili 1.00 pTN3 Fig. 3. Maximum likelihood (ML) phylogenetic tree of the concatenated MCP and pATPase genes of bacterioviruses and archaeoviruses. The ML phylogenetic tree was rooted with the DJR bacterial cluster II. The scale-bar indicates the average number of substitutions per site. Values on the branches represent transfer bootstrap expectation (TBE) support calculated by nonparametric bootstrap (1,000 replicates). The best-fit model was LG + R4, which was chosen according to Bayesian Information Criterion (BIC). The alignment has 49 sequences with 468 columns. bioRxiv preprint doi: https://doi.org/10.1101/741942; this version posted August 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

24

Fig. 4. Hypothetical scenarios for the evolution of viruses of the PRD1-adenovirus lineage. (A) “Prokaryotic” ancestor hypothesis. Depending on the rooting, different possible topologies of the double jelly-roll viruses of this lineage are depicted (see fig. S10, S11, S12 and S13). In any case, double jelly-roll (DJR) eukaryoviruses originated from viruses, which were closely related to DJR bacterial cluster. (B) The enigma of the PRD1-adenovirus lineage. The presence of viruses from the PRD1-adenovirus lineage across the three domains of life suggests that these viruses could be very ancient. In this scenario, modern DJR archaeoviruses are more closely related to DJR bacterioviruses. Ancient viruses which could infect archaea were lost after the divergence of Archaea and Eukarya. After that, modern DJR archaeoviruses were being transferred from DJR bacterioviruses before the diversification of Archaea. Another possibility is some of the viruses which are closely related to DJR eukaryoviruses have yet to be found.