bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Charting the diversity of

2 Uncultured of Archaea

3 and Bacteria

4 Coutinho, FH1*; Rodríguez-Valera F1

5 Author affiliations 6 1 – Evolutionary Genomics Group, Departamento de Produccíon Vegetal y Microbiología, 7 Universidad Miguel Hernández, Campus San Juan, San Juan, Alicante 03550, Spain 8 * Address correspondence to: [email protected]

9 Abstract 10 Viruses of Archaea and Bacteria are among the most abundant and diverse 11 biological entities on Earth. Unraveling their biodiversity has been challenging due to 12 methodological limitations. Recent advances in culture-independent techniques, such as 13 metagenomics, shed light on viral dark matter, revealing thousands of new viral genomes 14 at an unprecedented scale. However, these novel genomes have not been properly 15 classified and the evolutionary associations between them were not resolved. Here, we 16 performed phylogenomic analysis of nearly 200,000 viral genomic sequences to establish 17 GL-UVAB: Genomic Lineages of Uncultured Viruses of Archaea and Bacteria. GL-UVAB 18 yielded a 44-fold increase in the amount of classified genomes. The pan-genome content 19 of the identified lineages revealed their infection strategies, potential to modulate host 20 physiology and mechanisms to escape resistance systems. Furthermore, using GL-UVAB 21 for annotating metagenomes from multiple ecosystems revealed elusive habitat 22 distribution patterns of viral communities. These findings expand the understanding of the 23 diversity, evolution and ecology of viruses of prokaryotes.

24 Introduction

1 1 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

25 Grasping the biodiversity of viruses of Bacteria and Archaea has been a major 26 challenge within the field of virology. Limitations for viral cultivation and purification 27 associated with the absence of universal marker genes have been major drawbacks in the 28 effort to chart and classify the biodiversity of these viruses[1,2]. The taxonomic 29 classification system established for viruses of Bacteria and Archaea was originally based 30 on morphological traits, but genetic studies demonstrated that the major taxa established 31 through this approach are not monophyletic[3–5]. Thus, viral classification and taxonomy 32 has come to rely heavily on comparative genomics. This shift has led the International 33 Committee for the Taxonomy of Viruses (ICTV) to call for a scalable genome-based 34 classification systems that can also be applied to uncultured viruses for which no 35 phenotypic data is available[6]. 36 Phylogenomic trees and genomic similarity networks incorporate full genomic data 37 for comparison and clustering of viral genomes. Both phylogenomic and network based 38 approaches have showed promising results for reconstructing phylogenies and classifying 39 and identifying novel viral taxa[1,5,7]. These approaches circumvent the biases and 40 limitations associated with morphological data or the use of phylogenetic markers, and are 41 easily scalable to thousands of genomes[5,8]. Network methods rely on the identification 42 of orthologous groups shared among genomes, which can be problematic for viruses due 43 to the ratio in which their genes evolve. Additionally, the evolutionary associations among 44 genome clusters identified by network approaches are not explicitly resolved by these 45 methods[5,9]. Meanwhile, phylogenomic approaches provide trees in which the 46 associations among genomes are easily interpretable under an evolutionary perspective. 47 For these reasons, phylogenomic methods have been the standard approach for 48 reconstructing phylogenies of prokaryotic viruses[1,4,7,8,10–14]. Previous studies have 49 leveraged this method to investigate the genetic diversity of cultured viruses, but 50 overlooked the uncultured diversity. 51 Thousands of novel viral genomes were recently discovered through culture- 52 independent approaches, such as shotgun metagenomics, fosmid libraries, single- 53 sequencing and prophage mining[4,11,15–18]. These new datasets unraveled an 54 extensive biodiversity that had been overlooked by culture-based approaches. These 55 genomes have the potential to fill many of the gaps in our understanding of the diversity of 56 viruses of prokaryotes. Yet, achieving this goal requires that these genomes are properly 57 organized in a robust evolutionary framework. Here, we applied a phylogenomic approach

2 2 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

58 to chart the diversity of uncultured viruses of Bacteria and Archaea aiming to gain insights 59 on their genetic diversity, evolution and ecology.

60 Results 61 Phylogenomic reconstruction 62 A database of genomic sequences of prokaryotic viruses was compiled with isolated 63 dsDNA viruses from RefSeq and uncultured viruses that were discovered across multiple 64 ecosystems using approaches that bypassed culturing. This database amounted to 65 195,698 viral genomic sequences along with associated information of computational host 66 predictions and ecosystem source (Table S1). Following pre-filtering and redundancy 67 removal steps, a subset of 16,484 genomic sequences was selected for phylogenomic 68 reconstruction (Methods and Figure S1). An all-versus-all comparison of the protein 69 sequences encoded in this dataset was performed and used to calculate Dice distances 70 between genomes. Essentially, the Dice distances between a pair of genomes decreases 71 the more proteins that are shared between them and the higher their degree of identity. 72 Finally, the obtained matrix of Dice distances was used to construct a phylogenomic tree 73 through neighbor-joining (Figure 1). In addition, a benchmarking dataset containing a 74 subset of 2,069 dsDNA prokaryotic viral genomes from NCBI RefSeq was analyzed in 75 parallel for comparison and validation of results. The robustness of the tree topology was 76 evaluated through a sub-sampling approach. Nodes displayed high confidence values 77 (average 50% recovery), and 90% of all nodes were recovered at least once among the 78 re-sampled trees. These figures were obtained when reducing the data used to calculate 79 distances to approximately 64% of the amount used to establish the original tree, 80 demonstrating that tree topology is robust even in the presence of incomplete or 81 fragmented genomes, which might be the case for some of the uncultured viral genomes 82 used.

83 Clustering uncultured prokaryotic viruses into closely related lineages 84 A three-step approach was applied to categorize diversity into hierarchical levels of 85 increasing genomic relatedness: Level-1 (average Dice distances between genomes equal 86 or below 0.925, and number of representatives equal or above 50), Level-2 (average Dice 87 distances between genomes equal or below 0.825, and number of representatives equal 88 or above 10) and Level-3 (average Dice distances between genomes equal or below 0.5, 89 and number of representatives equal or above 3). These cutoffs were selected to establish

3 3 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

90 lineages at a degree of genomic similarity within the range observed for families, sub- 91 families and genera established by the ICTV (Table S4). At the first level, 10,357 genomic 92 sequences were assigned to 72 lineages (Figure 1). Of these, 68 of the Level-1 lineages 93 have at least one complete viral genome assigned to them. In addition, 50 of these Level-1 94 lineages had at least one reference viral genome assigned to them. At the second level, 95 10,356 sequences were assigned to 314 lineages, while at the third level, 6,971 96 sequences were assigned to 942 lineages. This three-level classification system was used 97 to establish the GL-UVAB. Apart from 6 cases, all of the Level-1 lineages are composed of 98 genomes assigned to a single taxonomic family as defined by the ICTV. Out of the 942 99 Level-3 lineages, 104 of them included genomes classified at the level of genus by the 100 ICTV. Of these, 74 lineages are consistent regarding the genus, meaning that all classified 101 genomes are assigned to the same genera. Meanwhile, only 6 genera were split among 102 more than one Level-3 lineage. Thus, the identified lineages of uncultured viruses are in 103 agreement with the ICTV established taxonomy. 104 Genome sequences that were not included in the phylogenomic reconstruction 105 were assigned to the lineage of their closest relatives as determined by the average amino 106 acid identity and percentage of shared genes. Following this lineage expansion step, a 107 total of 100,907 sequences were classified to at least one level (Table S1), which 108 represents a 44-fold improvement in the proportion of classified sequences compared to 109 the amount of RefSeq prokaryotic viral genomes classified by the NCBI taxonomy 110 database. The dataset that contributed most (~60%) of the classified sequences was from 111 a cross-ecosystem global analysis of metagenomes[16], followed by metagenomes from 112 the Tara and Malaspina expeditions [17] (~15%), global marine viromes[15] (~10%) and 113 prophages identified in bacterial genomes[18] (~10%) (Figure S3).

114 Targeted hosts and ecosystem sources of GL-UVAB lineages 115 GL-UVAB lineages differed regarding host prevalence (Figure 2A). Out of the 72 116 Level-1 lineages, 34 are predicted to infect a single host phylum, most often 117 Proteobacteria, Firmicutes or Actinobacteria, while 35 lineages are predicted to infect two 118 or more phyla. Level-3 lineages display the highest levels of host consistency. Among 119 Level-3 lineages with at least one annotated host, 99% of them are predicted to infect a 120 single phylum and 76% are predicted to infect a single genus. Lineages also differed 121 regarding the ecosystem sources from where their members were obtained (Figure 2B). 122 Nearly all lineages contained members obtained from multiple ecosystems but aquatic and

4 4 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

123 human-associated samples were consistently the main sources of genomes, highlighting 124 diversity within these habitats and the potential for discoveries in other under-explored 125 ecosystems. Trends of host and ecosystem prevalence observed for the expanded 126 lineages (Figure S4) were consistent with those obtained from the analysis of the original 127 dataset, corroborating the validity of these patterns. Applying the same lineage 128 identification criteria to the benchmarking dataset of RefSeq viral genomes identified 12 129 Level-1 lineages, 49 Level-2 lineages and 162 Level-3 lineages (Figure S2). Among these 130 Level-3 lineages, 145 (91%) are composed of genomes that infect within the same host 131 order. This observation further corroborates the validity of the associations between GL- 132 UVAB lineages and targeted hosts.

133 GL-UVAB lineages differ in habitat distribution and pan-genome content 134 The observed differences in host preference and ecosystem source among 135 lineages led us to investigate the applicability of GL-UVAB as a reference database for 136 deriving abundance profiles from metagenomes. We analyzed the abundances of 72 GL- 137 UVAB Level-1 lineages across metagenomes from marine, freshwater, soil and human gut 138 samples (Figure 3). Lineages 28, 21, and 36 were the most abundant in marine samples, 139 in agreement with the high prevalence of Cyanobacteria and Proteobacteria as hosts of 140 these lineages (Figure 2A). Meanwhile, the lineages 64 (which mostly infects 141 Proteobacteria of classes Gamma and Delta), 23 (Proteobacteria and Bacteroidetes), and 142 21 where the dominant groups in freshwater habitats. In temperate soil samples, the most 143 abundant lineages were 14 (Actinobacteria), 38 (Beta- and Gamma- Proteobacteria) and 144 62 (Gammaproteobacteria). Finally, human gut samples were dominated by lineages 64, 145 30 (Firmicutes and Bacteroidetes) and 10 (Firmicutes and Actinobacteria). 146 We inspected the pan-genome of the identified lineages by clustering their protein 147 encoding genes into orthologous groups (OGs). A total of 79,281 OGs containing at least 148 three proteins were identified. These OGs displayed a sparse distribution, i.e., were only 149 detected in a small fraction of genomes within lineages (Table S3). The most conserved 150 OGs encoded functions associated with nucleic acid metabolism and viral particle 151 assembly. Few OGs encoded putative auxiliary metabolic genes (AMGs), and those where 152 never shared by all the members of a lineage. A total of 1,755 promiscuous OGs, present 153 in the pan-genome of three or more Level-1 lineages were identified.

154 Discussion

5 5 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

155 Only a small fraction of prokaryotic viruses can be cultivated through currently 156 available laboratory techniques. This limitation has left many gaps in our understanding of 157 their biodiversity. The results presented here help to bridge these gaps by leveraging on a 158 large dataset of viral genomic sequences obtained without cultivation from multiple 159 ecosystems. Categorizing this new diversity into a robust evolutionary framework brings us 160 closer to properly grasping the extension of viral biodiversity. 161 The average dice distance cutoffs used for defining lineages were chosen to 162 classify as many genomes as possible while maintaining cohesiveness within lineages 163 regarding similarity between genomes, targeted hosts and taxonomic classification as 164 defined by the ICTV. These goals were achieved, as the GL-UVAB lineages are formed by 165 groups of closely related genomes (Table S4), which was reflected in their targeted hosts 166 (Figure 2A), pan-genome content (Table S2) and taxonomic classification (Table S5). 167 Using different cutoffs for lineage identification would have resulted in the distinct lineages 168 and consequently different patterns of host and source prevalence, pan-genome content, 169 and habitat distribution. Nevertheless, GL-UVAB was conceived to be an evolving system. 170 We encourage researchers to adapt the GL-UVAB approach to suit the needs of the 171 specific questions under investigation. For example, performing species-level clustering 172 genomes would require average Dice distance cutoffs even lower than those used to 173 delineate Level-3 lineages. 174 The re-sampling approach showed that the tree topology was robust even in the 175 presence of fragmented genomes. Still, incomplete genomes are more likely to be 176 misclassified by our pipeline. Yet, in the absence of complete genomes, these fragments 177 are the best available option for obtaining at least a provisional classification of uncultured 178 viruses. As the quality and amount of uncultured viral genomes discovered increases, new 179 data can be used to update and improve the GL-UVAB. 180 The consistency of the targeted hosts among lineages identified with our 181 phylogenomic approach suggests that the assignment to GL-UVAB lineages provides a 182 rough estimate of the hosts of uncultured viruses. This is of fundamental importance, 183 considering the growing diversity of viral genomes discovered from metagenomic datasets 184 for which no host information is initially available[19,20]. Host prevalence analysis 185 indicated that approximately half of the Level-1 lineages are capable of infecting more than 186 a single host phylum (Figure 2A). The ability to interact with the molecular machinery of 187 the host is a major driver of the evolution of prokaryotic viruses. Thus, closely related 188 genomes (that belong to the same lineages) likely have undergone similar evolutionary

6 6 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

189 pressures that ensure host infectivity, leading to the observed pattern of higher host 190 consistency among the lowest level of hierarchical classification, i.e., Level-3 lineages. 191 Meanwhile, the ability of some lineages to infect across multiple host phyla is likely an 192 indication of the high level of genomic plasticity of viruses that allows them to evolve to 193 infect new organisms that are not closely related to their original hosts. 194 Metagenomic studies of viral ecology strive to elucidate the habitat distribution 195 patterns of taxa across ecosystems. The abundance patterns observed for the GL-UVAB 196 lineages (Figure 3) are a reflection of their distinctive trends of host prevalence (Figure 197 2A). As expected, the GL-UVAB lineages that dominated at each ecosystem often targeted 198 taxa that are the most abundant at these habitats[21,22], e.g., lineages that target 199 Proteobacteria and Cyanobacteria at aquatic samples and lineages that target 200 Bacteroidetes and Firmicutes in the human gut. Although this observation might seem 201 obvious, it does not emerge when using cultured viral genomes for the taxonomic 202 annotation of metagenomes. Instead, the same taxa are often observed with similar 203 abundance patterns regardless of the ecosystem sampled. This occurs because 204 established taxa have no discernible host or ecosystem preferences and because much of 205 viral diversity is not encompassed by viral taxonomy[14,23,24]. Thus, the cohesiveness of 206 GL-UVAB lineages regarding phylogeny, host preference and ecology allows for 207 meaningful habitat-taxa associations to be observed. 208 A detailed investigation of the pan-genome content of the Level-1 lineage 21, 209 revealed some of the strategies applied by these viruses during infection. This lineage was 210 among the dominant group in both freshwater and marine samples and infects 211 Cyanobacteria and Proteobacteria. The pan-genome of lineage 21 includes OGs encoding 212 high-light inducible proteins, photosystem II D1 proteins and a transaldolase. These 213 proteins are involved in photosynthesis and carbon fixation pathways[25]. Therefore, the 214 success of this group across aquatic ecosystems might be linked to their capacity to use 215 such proteins as AMGs to modulate the metabolism of their Cyanobacterial hosts during 216 infection, redirecting it to the synthesis of building blocks to be used for the assembly of 217 novel viral particles[25]. 218 The promiscuous distribution observed for multiple OGs could be the result of the 219 positive selection of these genes following events of horizontal gene transfer. Indeed, 220 promiscuous OGs often encoded proteins that might confer advantages during infection. 221 Five of them encode thymidylate synthase, a protein involved in nucleotide synthesis. 222 Meanwhile, three promiscuous OGs encode the PhoH protein, which mediates

7 7 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

223 phosphorus acquisition in nutrient deprived conditions. These findings suggest a selective 224 pressure favoring the acquisition of genes that allow viruses to modulate host metabolism 225 towards the production of nucleic acids to be used for the synthesis of progeny DNA[25]. 226 Nevertheless, the sparse distribution of these and other auxiliary metabolic genes across 227 lineages suggest that such genes have been recently acquired by the members of each 228 lineage through horizontal gene transfer (HGT), rather than being present in the common 229 ancestor of the group. Multiple methylases were identified among promiscuous OGs. 230 Viruses use these proteins to protect their DNA from host restriction modification 231 systems[26]. Prokaryotes can acquire restriction modification systems through HGT[27], 232 and our data suggest that viruses also benefit from HGT by acquiring novel methylases 233 that allow them to escape these systems. Finally, lysins (e.g., peptidases and amidases) 234 were a common function among promiscuous OGs. This finding is surprising because 235 lysins are believed to be fine-tuned for the specific structure of host cell wall[28,29]. 236 Acquisition of novel lysins might help viruses to expand their host spectra or as a 237 mechanism to ensure infectivity following the emergence of resistance mutations that lead 238 to alterations in the structure of the host cell wall. 239 In conclusion, by analyzing thousands of uncultured viral genomes we were able to 240 categorize the diversity of these biological entities. This was achieved by identifying 241 lineages of uncultured viruses through a robust and scalable phylogenomic approach. 242 Analyzing host and source prevalence, pan-genome content and abundance in 243 metagenomes painted a more accurate picture of viral biodiversity across ecosystems. As 244 these uncultured viruses are isolated and their morphology and host spectra are 245 elucidated, they will be properly integrated into the ICTV classification system. Here we 246 provided an initial framework for their taxonomic classification as well as insights regarding 247 their ecology and genetic diversity. Future studies will continue to shed light on the viral 248 dark matter across our planet’s many ecosystems. Our work provides the initial steps for a 249 genome-based classification of these yet undiscovered evolutionary lineages, providing a 250 solid framework to investigate the biology of prokaryotic viruses.

251 Methods 252 Viral genome database 253 The NCBI RefSeq dataset was used as a starting set of reference viral genome 254 sequences. Host information for these sequences was retrieved from GenBank files, and 255 their taxonomic classification was obtained both from the NCBI Taxonomy database and

8 8 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

256 from the ICTV [30]. Additionally, genomic sequences were compiled from studies that used 257 high throughput approaches to obtain viral genomes through culture-independent analysis, 258 along with their associated host and ecosystem source information whenever available. 259 These genomes were obtained from: environmental metagenomes and 260 metaviromes[3,12,15–17,31,32], fosmid libraries of Mediterranean viruses[4,11], single 261 virus genomes[33] and prophages integrated into prokaryotic genomes[18]. This dataset 262 (henceforth referred to as Vir_DB_Nuc) contained a total of 195,698 viral genome 263 sequences (Table S1 and Supplementary file 1). Protein encoding genes (PEGs) were 264 predicted from Vir_DB_Nuc using the metagenomic mode of Prodigal[34], which identified 265 4,332,223 protein sequences (henceforth referred to as Vir_DB_Prot, Supplementary file 266 2). The Vir_DB_Prot dataset was queried against the NCBI-nr protein database using 267 Diamond[35] for taxonomic and functional annotation.

268 Sequence pre-filtering 269 Identifying viral sequences within metagenomic and metaviromic datasets can be 270 problematic. Because each study used different strategies to achieve that goal we pre- 271 filtered genome sequences from Vir_DB_Nuc to ensure that only bona fide prokaryotic 272 viral genomes were included in downstream analyses. First, the Vir_DB_Prot dataset was 273 queried against the prokaryotic virus orthologous groups (pVOGs)[36] protein database 274 using diamond[35] (more sensitive mode, BLOSUM45 matrix, identity ≥ 30%, bitscore ≥ 275 50, alignment length ≥ 30 amino acids and e-value ≤ 0.01). For each sequence, the 276 percentage of proteins mapped to the pVOGs database and the added viral quotient 277 (AVQ) were calculated (as the sum of the individual viral quotient of the best hit of each 278 protein). Hits to pVOGs that presented homology with proteins from Eukaryotic viruses 279 were not considered. Sequences with 20% or more of the proteins mapped to the pVOGs 280 database and with an AVQ equal to or greater than 5 were classified as bona fide 281 prokaryotic viral genomes. This initial round of recruitment yielded 26,610 genomic 282 sequences (Vir_DB_Nuc_R1). Next, proteins from the Vir_DB_Nuc_R1 dataset were used 283 as bait for a second recruitment round. The remaining protein sequences (which were not 284 recruited in the first round) were queried against Vir_DB_Nuc_R1 through diamond as 285 described above. Genomic sequences from which at least 20% of the derived proteins 286 mapped to a single genome from Vir_DB_Nuc_R1, yielding a minimum of three hits, were 287 recruited to Vir_DB_Nuc_R2 (78,295 genomic sequences). Finally, a last step of manual 288 curation was performed, which recruited mostly long contigs with high AVQ that did not

9 9 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

289 match the percentage criteria of the automatic recruiting steps due to their high number of 290 encoded proteins. This step recruited a total of 6,421 genomic sequences 291 (Vir_DB_Nuc_R3). 292 We benchmarked the accuracy of the automatic recruiting steps with two datasets. 293 First, a subset of Vir_DB_Nuc comprised of all the reference viral genomes was run 294 through the recruitment pipeline using the same criteria described above. None of the 295 7,036 eukaryotic viruses were recruited by the pipeline (i.e., 100% precision) and 2,136 296 out of 2,297 prokaryotic viruses were correctly recruited (i.e., 92.99% recall). We also 297 benchmarked the filtering pipeline with a dataset of 897 Gbp of genome sequence data 298 derived from the NCBI RefSeq prokaryote genomes spanning 880 genera from 35 phyla. 299 Sequences were split into fragments of 5, 10, 15, 20, 25, 50 and 100 Kbp to mimic 300 metagenomic scaffolds. Using the filtering criteria described above and the subsequent 301 length filtering for sequences longer than 30 Kbp would recruit only 109 sequences 302 (0.36%), all of which displayed homology to the prophage sequences described by Roux 303 et al. [18].

304 Phylogenomic reconstruction 305 Phylogenomic reconstruction was performed using a subset of genomes from 306 Vir_DB_Nuc that included all dsDNA RefSeq viral genomes annotated as complete for 307 which the host Domain was either Bacteria or Archaea and the uncultured bona fide 308 prokaryotic viruses from Vir_DB_Nuc_R3 with a length equal or greater than 30 Kbp or 309 annotated as complete and with a length equal or greater than 10 Kbp. These criteria were 310 established to ensure any incomplete genomes fragments encompassed a significant 311 fraction of the complete genome (considering the 47 Kbp median size of complete 312 genomes from RefSeq viruses of Bacteria and Archaea). Genome sequences were 313 clustered with CD-HIT[37] using a cut-off of 95% nucleotide identity and minimum 50% 314 coverage of the shorter sequence to remove redundant sequences. The non-redundant 315 dataset contained 16,484 viral genomic sequences that were used for phylogenomic 316 reconstruction (Vir_DB_Phy). Distances between genomes were calculated based on a 317 modified version of the Dice method[4]. First, an all-versus-all comparison of the PEGs 318 derived from the Vir_DB_Phy dataset was performed through diamond[35] (more sensitive 319 mode, identity ≥ 30%, bitscore ≥ 30, alignment length ≥ 30 amino acids and e-value ≤

320 0.01). Next, distances between genome sequences were calculated as follows: D AB = 1 - 321 (2*(AB) / (AA+BB)), where AB is the bitscore sum of all the valid hits of genome A against

10 10 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

322 genome B, while AA and BB are the bitscore sum of all the valid hits of genome A against 323 itself and of all the valid hits of genome B against itself, respectively. The more 324 homologous proteins are shared between genomes A and B, and the higher the degree of

325 identity between these homologous proteins, the closer to zero the value of DAB will be. 326 Non homologous proteins should produce no hits when comparing genomes A against B, 327 but will hit with themselves when comparing A against A and B against B. Therefore, when

328 estimating DAB, non homologous proteins are penalized, increasing the value of DAB. The 329 obtained Dice distances matrix was used as input to build phylogenetic tree through 330 Neighbor-Joining algorithm[38] implemented in the Phangorn package of R. The obtained 331 tree was midpoint rooted.

332 Tree topology validation by re-sampling 333 A re-sampling approach was applied to test the consistency of the tree topology. 334 First, 20% of the proteins encoded in the genomes used to build the tree were randomly 335 selected. Then, distances between genomes were re-calculated after excluding any hits 336 from the all-versus-all search in which either the query or subject sequences were selected 337 for exclusion, which removes approximately 36% of all of the original hits. Finally, the 338 obtained distance matrix was used to construct a new tree. This process was repeated 339 over one hundred iterations. Next, we measured the frequency in which the nodes from 340 the original tree were present in the re-sampled trees. 341 342 Lineage identification 343 Lineage identification was performed by parsing the phylogenetic tree to identify 344 monophyletic clades that matched the established criteria for maximum average Dice 345 distances between genomes, and for a minimum number of representatives. Lineages 346 were identified in three steps, aimed at capturing diversity into levels of increasing 347 genomic relatedness: Level-1 (average Dice distances between genomes equal or below 348 0.925, and number of representatives equal or above 50), Level-2 (average Dice distances 349 between genomes equal or below 0.825, and number of representatives equal or above 350 10) and Level-3 (average Dice distances between genomes equal or below 0.5, and 351 number of representatives equal or above 3). These cutoffs were selected based on the 352 degree of genomic similarity observed for taxa of prokaryotic viruses described by the 353 ICTV and NCBI Taxonomy databases (Table S4), with the goal of classifying as many 354 sequences as possible while maximizing consistency within clades regarding pan-genome

11 11 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

355 content and agreement with the currently accepted viral taxonomy. To trace the pan- 356 genomes of the identified lineages the proteins derived from 16,484 genomes used for 357 phylogenomic reconstruction were clustered into orthologous groups using the orthoMCL 358 algorithm[39] implemented in the Get_Homologues pipeline[40]. The MCL inflation factor 359 was set to 1 and all other parameters were set to default. 360 361 Lineage expansion by closest relative identification 362 Sequences that did not pass the initial length and redundancy filters to be included 363 in the phylogenomic tree were assigned to the lineages of their closest relatives. Closest 364 relatives were defined as the genome with highest percentage of matched protein 365 encoding genes (PEGs) as detected by Diamond searches. Potential ties were resolved by 366 choosing the closest relative with the highest average amino acid identity (AAI) value. A 367 minimum AAI of 50% and the percentage of matched PEGs of 50% was required for 368 closest relative assignments.

369 Lineage benchmarking and validation with RefSeq prokaryotic viruses 370 We sought to validate our strategy for lineage identification. To that end a dataset of 371 2,069 genome sequences of dsDNA viruses of Archaea and Bacteria from the NCBI 372 RefSeq database was used for benchmarking. The steps for distance calculation, tree 373 construction and lineage identification were performed exactly as described for the full 374 dataset. Next, we calculated statistics of distances between taxa established by ICTV and 375 NCBI (Table S4).

376 Lineage abundance in metaviromes and metagenomes 377 The abundances of Vir_DB_Nuc sequences were estimated in viral metagenomes 378 (viromes) from the following ecosystems: marine epipelagic samples[41], healthy human 379 gut[42], freshwater lakes[43], and because no large scale viromes of mesophilic soils were 380 available we used cellular metagenomes from this ecosystem[44,45]. Sequencing reads 381 from these metagenomes and metaviromes were retrieved from the European Nucleotide 382 Archive or NCBI Short Read Archive. Subsets of twenty million R1 reads from each 383 sample were mapped to Vir_DB_Nuc using Bowtie2[46] using the sensitive-local alignment 384 mode. Lineage abundances across samples were calculated by summing the relative 385 abundances of individual genomes according to their assigned lineages.

12 12 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

386 Acknowledgements: 387 FHC and FRV were supported by the grants “VIREVO” CGL2016-76273-P 388 [AEI/FEDER, EU], (co-funded with FEDER funds); Acciones de dinamización “REDES DE 389 EXCELENCIA” CONSOLIDER- CGL2015-71523-REDC from the Spanish Ministério de 390 Economía, Industria y Competitividad and PROMETEO II/2014/012 “AQUAMET” from 391 Generalitat Valenciana. FHC was supported by APOSTD/2018/186 Post-doctoral 392 fellowship from Generalitat Valenciana.

393 Author contributions: 394 FHC conceived, designed and performed experiments and analyzed the data. FHC and 395 FRV wrote the manuscript.

396 Figure legends: 397 Figure 1. Phylogenomic reconstruction of 16,484 viral genomic sequences reveals major 398 lineages of uncultured prokaryotic viruses. The tree was built through Neighbor-Joining 399 based on Dice distances calculated between viral genomic sequences from both NCBI 400 RefSeq and those reconstructed from metagenomes, fosmid libraries, single virus 401 genomes and prophages integrated into prokaryote genomes. Tree was midpoint rooted. 402 Branch lengths were omitted to better display tree topology. Each of the 72 Level-1 GL- 403 UVAB lineages are highlighted by black colored branches and with their defining nodes in 404 indicated by blue dots. Numeric identifiers for the lineages are displayed in the innermost 405 ring within gray strips. The outermost ring depicts the ICTV family level classification 406 assignments of RefSeq viral genomes that were included in the tree. Color codes for ICTV 407 families are as follows: (Green), (Red), (Yellow), 408 (Blue), (orange), Sphaerolipoviridae (Cyan), Rudiviridae 409 (Purple) and Tectiviridae (Pink). Fore reference, representatives of isolated viral genomes 410 are shown with their ICTV genus level classification shown in parentheses.

411 Figure 2. Prevalence of targeted host and ecosystem sources among Level-1 GL-UVAB 412 lineages assigned through phylogenomic reconstruction. A) Frequency of infected host 413 phyla across each of the 72 identified lineages. B) Frequency of ecosystem sources from 414 which viral sequences were obtained across each of the 72 identified lineages. For clarity, 415 only hosts and ecosystems with prevalence equal or above 1% are shown. Numbers in 416 parentheses indicate the total number of genomes assigned to each lineage.

13 13 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

417 Figure 3. Abundance patterns of GL-UVAB Level-1 lineages across habitats. Bar plots 418 display the average and standard errors of the relative abundances of GL-UVAB Level-1 419 lineages across metagenomes and metaviromes from marine, freshwater, human gut and 420 soil ecosystems.

421 Supplementary Figure legends: 422 Figure S1. Flowchart summarizing the methodology used to establish GL-UVAB. The 423 initial dataset of genomic sequences consisted of the NCBI RefSeq and viral genomic 424 sequences obtained through culturing independent approaches adding up to 195,698 425 genomic sequences from which 4,332,223 protein encoding genes were identified. After 426 the initial filtering, 16,484 sequences were selected for phylogenomic reconstruction. Dice 427 distances were calculated between this set and the resulting distance matrix was used for 428 phylogenomic reconstruction through Neighbour-Joining. The obtained tree was used to 429 identify lineages at three levels, based on average Dice distances within nodes: Level-1 430 (average Dice distances between genomes equal or below 0.925, and number of 431 representatives equal or above 50), Level-2 (average Dice distances between genomes 432 equal or below 0.825, and number of representatives equal or above 10) and Level-3 433 (average Dice distances between genomes equal or below 0.5, and number of 434 representatives equal or above 3). Lineage abundances were estimated in metagenomic 435 datasets by read mapping. Lineage pan-genomes were determined by identifying clusters 436 of orthologous genes. Finally, sequences that were not included in the original tree were 437 assigned to the lineages by closest relative identification (CRI). Closest relatives were 438 determined based on percentage of matched genes and average amino acid identity and 439 setting a minimum value of 50% for both these variables.

440 Figure S2. Phylogenomic reconstruction of 16,484 viral genomic sequences. The tree was 441 built through Neighbor-Joining based on Dice distances calculated between viral genomic 442 sequences from both NCBI RefSeq and those reconstructed from metagenomes, fosmid 443 libraries and prophages integrated into prokaryote genomes. The tree was midpoint 444 rooted. Nodes were collapsed if all the leaves belonged to the same Level-1 lineage to 445 better display higher-order associations between lineages.

14 14 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

446 Figure S3. Phylogenomic reconstruction of 2,069 genomes of dsDNA viruses of Archaea 447 and Bacteria from RefSeq. The tree was built through Neighbor-Joining based on Dice 448 distances calculated between complete prokaryotic viral genomes from NCBI RefSeq. The 449 tree was midpoint rooted. Branches are colored according to their family level taxonomic 450 classification, the inner ring displays classification of genomes into subfamilies and the 451 outer ring displays classifications into lineages identified for the benchmarking dataset.

452 Figure S4. Dataset sources of GL-UVAB genomes. Bar plots depict the abundance of 453 sequences from each original publication that made up the full dataset (Total), those that 454 were recruited during the pre-filtering step (Recruited) and those that were classified into 455 GL-UVAB lineages (Classified).

456 Figure S5. Prevalence of targeted host and ecosystem sources among Level-1 GL-UVAB 457 lineages assigned through phylogenomic reconstruction and lineage expansion by closest 458 relative identification. A) Frequency of infected host phyla across each of the 72 identified 459 lineages. B) Frequency of ecosystem sources from which viral sequences were obtained 460 across each of the 72 identified lineages. For clarity, only hosts and ecosystems with 461 prevalence within a lineage equal or above 1% are shown. Numbers in parentheses 462 indicate the total number of genomes assigned to each lineage.

463 Supplementary files: 464 File S1. Multifasta file containing the 195,698 viral genome sequences from Vir_DB_Nuc 465 analyzed in this study.

466 File S2. Multifasta file containing the 4,332,223 protein encoding gene sequences 467 predicted from Vir_DB_Nuc.

468 File S3. Newick format phylogenomic tree used to define GL-UVAB lineages. The tree was 469 constructed with the Neighbor-Joining algorithm based on Dice distances calculated 470 between 16,484 genomic sequences.

471 Table S1. Table containing detailed information of all the genomic sequences from 472 Vir_DB_Nuc analyzed in this study, including sequence identifier, NCBI access number,

15 15 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

473 original dataset, sequence length, number of identified PEGs, taxonomic classification 474 (NCBI and ICTV), ecosystem source and host taxonomic classification and affiliation to the 475 identified lineages at three hierarchical levels.

476 Table S2. Table describing the prevalence of OGs across GL-UVAB Level-1 lineages with 477 associated taxonomic and functional annotation. Only OGs detected in at least 3 members 478 of a lineage are shown.

479 Table S3. Statistics of the Dice distances observed between genomes of the taxa 480 established by NCBI/ICTV and the lineages established by GL-UVAB.

481 Table S4. Prevalence of targeted hosts, ecosystem and dataset sources, taxonomic 482 classification and completeness of genomes among the three levels of GL-UVAB lineages.

16 16 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

483 References

484 1. Rohwer F, Edwards R. The Phage Proteomic Tree : a Genome-Based Taxonomy for 485 Phage. J Bacteriol. 2002;184: 4529–4535. doi:10.1128/JB.184.16.4529 486 2. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, Lewis JA, et al. 487 Origins of highly mosaic mycobacteriophage genomes. Cell. 2003;113: 171–182. 488 doi:10.1016/S0092-8674(03)00233-2 489 3. Nishimura Y, Watai H, Honda T, Mihara T, Omae K, Roux S, et al. Environmental 490 Viral Genomes Shed New Light on Virus-Host Interactions in the Ocean. mSphere. 491 2017;2: 2:e00359-16. Available: https://doi.org/10.1128/mSphere.00359-16 492 4. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. Expanding the marine 493 virosphere using metagenomics. PLoS Genet. 2013;9: e1003987. 494 doi:10.1371/journal.pgen.1003987 495 5. Bolduc B, Jang H Bin, Doulcier G, You Z-Q, Roux S, Sullivan MB. vConTACT: an 496 iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. 497 PeerJ. 2017;5: e3243. doi:10.7717/peerj.3243 498 6. Simmonds P, Adams MJ, Benkő M, Breitbart M, Brister JR, Carstens EB, et al. 499 Consensus statement: Virus taxonomy in the age of metagenomics. Nat Rev 500 Microbiol. 2017; doi:10.1038/nrmicro.2016.177 501 7. Nishimura Y, Yoshida T, Kuronishi M, Uehara H, Ogata H, Goto S. ViPTree: the viral 502 proteomic tree server. Bioinformatics. 2017;33: 1–2. 503 doi:10.1093/bioinformatics/btx157 504 8. Meier-Kolthoff JP, Göker M. VICTOR: genome-based phylogeny and classification of 505 prokaryotic viruses. Bioinformatics. 2017;33: 3396–3404. 506 doi:10.1093/bioinformatics/btx440 507 9. Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. Reticulate representation of 508 evolutionary and functional relationships between phage genomes. Mol Biol Evol. 509 2008;25: 762–777. doi:10.1093/molbev/msn023 510 10. Coutinho FH, Silveira CB, Gregoracci GB, Thompson CC, Edwards RA, Brussaard 511 CPD, et al. Marine viruses discovered via metagenomics shed light on viral 512 strategies throughout the oceans. Nat Commun. 2017;8. doi:10.1038/ncomms15955 513 11. Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. Genomes of 514 Abundant and Widespread Viruses from the Deep Ocean. MBio. 2016;7: e00805-16. 515 doi:10.1128/mBio.00805-16 516 12. López-Pérez M, Haro-Moreno JM, Gonzalez-Serrano R, Parras-Moltó M, Rodriguez- 517 Valera F. Genome diversity of marine phages recovered from Mediterranean

17 17 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

518 metagenomes: Size matters. PLoS Genet. Public Library of Science; 2017;13: 519 e1007018. 520 13. Adriaenssens EM, Edwards R, Nash JHE, Mahadevan P, Seto D, Ackermann HW, et 521 al. Integration of genomic and proteomic analyses in the classification of the 522 Siphoviridae family. Virology. Elsevier; 2015;477: 144–154. 523 doi:10.1016/j.virol.2014.10.016 524 14. Bellas CM, Anesio AM, Barker G. Analysis of virus genomes from glacial 525 environments reveals novel virus groups with unusual host interactions. Front 526 Microbiol. 2015;6: 1–14. doi:10.3389/fmicb.2015.00656 527 15. Coutinho FH, Silveira CB, Gregoracci GB, Edwards RA, Brussaard CPD, Dutilh BE, 528 et al. Marine viruses discovered through metagenomics shed light on viral strategies 529 throughout the oceans. Nat Commun. Nature Publishing Group; 2017;8: 1–12. 530 doi:10.1038/ncomms15955 531 16. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, 532 Mikhailova N, et al. Uncovering Earth’s virome. Nature. Nature Publishing Group; 533 2016;536: 425–430. doi:10.1038/nature19094 534 17. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics 535 and biogeochemical impacts of uncultivated globally abundant ocean viruses. 536 Nature. Nature Publishing Group; 2016;537: 589–693. doi:10.1101/053090 537 18. Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus-host 538 interactions resolved from publicly available microbial genomes. Elife. eLife 539 Sciences Publications Limited; 2015;4: e08490. doi:10.7554/eLife.08490 540 19. Aylward FO, Boeuf D, Mende DR, Wood-Charlson EM, Vislova A, Eppley JM, et al. 541 Diel cycling and long-term persistence of viruses in the ocean’s euphotic zone. Proc 542 Natl Acad Sci. 2017; 201714821. doi:10.1073/pnas.1714821114 543 20. Luo E, Aylward FO, Mende DR, Delong EF. Bacteriophage Distributions and 544 Temporal Variability in the Ocean’s Interior. MBio. 2017;8: 1–13. 545 21. Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. 546 Structure and function of the global ocean microbiome. Science (80- ). 2015;348: 1– 547 10. doi:10.1126/science.1261359 548 22. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al. 549 Enterotypes of the human gut microbiome. Nature. 2011;473: 174–180. doi:Doi 550 10.1038/Nature09944 551 23. Aguirre de Cárcer D, López-Bueno A, Pearce DA, Alcamí A. Biodiversity and 552 distribution of polar freshwater DNA viruses. Sci Adv. 2015;1: e1400127. 553 doi:10.1126/sciadv.1400127

18 18 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

554 24. Zablocki O, van Zyl L, Adriaenssens EM, Rubagotti E, Tuffin M, Cary SC raig, et al. 555 High-level diversity of tailed phages, eukaryote-associated viruses, and - 556 like elements in the metaviromes of antarctic soils. Appl Environ Microbiol. 2014;80: 557 6888–6897. doi:10.1128/AEM.01525-14 558 25. Thompson LR, Zeng Q, Kelly L, Huang KH, Singer AU, Stubbe J, et al. Phage 559 auxiliary metabolic genes and the redirection of cyanobacterial host carbon 560 metabolism. Proc Natl Acad Sci. 2011;108: E757–E764. 561 doi:10.1073/pnas.1102164108 562 26. Samson JE, Magadán AH, Sabri M, Moineau S. Revenge of the phages: defeating 563 bacterial defences. Nat Rev Microbiol. Nature Publishing Group; 2013;11: 675–687. 564 doi:10.1038/nrmicro3096 565 27. Oliveira PH, Touchon M, Rocha EPC. The interplay of restriction-modification 566 systems with mobile genetic elements and their prokaryotic hosts. Nucleic Acids 567 Res. 2014;42: 10618–10631. doi:10.1093/nar/gku734 568 28. Oliveira H, Melo LDR, Santos SB, Nobrega FL, Ferreira EC, Cerca N, et al. 569 Molecular Aspects and Comparative Genomics of Bacteriophage Endolysins. J Virol. 570 2013;87: 4558–4570. doi:10.1128/JVI.03277-12 571 29. Fernández-Ruiz I, Coutinho FH, Rodriguez-Valera F. Thousands of Novel Endolysins 572 Discovered in Uncultured Phage Genomes. Front Microbiol. 2018;9: 1033. 573 doi:10.3389/fmicb.2018.01033 574 30. Krupovic M, Dutilh BE, Adriaenssens EM, Wittmann J, Vogensen FK, Sullivan MB, et 575 al. Taxonomy of prokaryotic viruses: update from the ICTV bacterial and archaeal 576 viruses subcommittee. Arch Virol. Springer Vienna; 2016;161: 1095–1099. 577 doi:10.1007/s00705-015-2728-0 578 31. Philosof A, Yutin N, Sharon I, Koonin E V, Shmona K. Novel Abundant Oceanic 579 Viruses of Uncultured Marine Group II Euryarchaeota Identified by Genome-Centric 580 Metagenomics. 2017; 581 32. Vik DR, Roux S, Brum JR, Bolduc B, Joanne B, Padilla CC, et al. Putative archaeal 582 viruses from the mesopelagic ocean. PeerJ. 2017; 1–26. doi:10.7717/peerj.3428 583 33. Martinez-Hernandez F, Fornas O, Lluesma Gomez M, Bolduc B, de la Cruz Peña 584 MJ, Martínez JM, et al. Single-virus genomics reveals hidden cosmopolitan and 585 abundant viruses. Nat Commun. 2017;8: 15892. doi:10.1038/ncomms15892 586 34. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: 587 prokaryotic gene recognition and translation initiation site identification. BMC 588 Bioinformatics. 2010;11: 119. doi:10.1186/1471-2105-11-119 589 35. Buchfink B, Xie C, Huson DH. Fast and Sensitive Protein Alignment using 590 DIAMOND. Nat Methods. 2015;12: 59–60. doi:10.1038/nmeth.3176

19 19 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

591 36. Grazziotin AL, Koonin E V., Kristensen DM. Prokaryotic Virus Orthologous Groups 592 (pVOGs): A resource for comparative genomics and protein family annotation. 593 Nucleic Acids Res. 2017;45: D491–D498. doi:10.1093/nar/gkw975 594 37. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: Accelerated for clustering the next- 595 generation sequencing data. Bioinformatics. 2012;28: 3150–3152. 596 doi:10.1093/bioinformatics/bts565 597 38. Saitou N NM. The neighbor-joining method: a new method for reconstructing 598 phylogenetic trees. Mol Biol Evol. 1987;4: 406–425. 599 39. Li L, Stoeckert Jr. CJ, Roos DS. OrthoMCL: identification of ortholog groups for 600 eukaryotic genomes. Genome Res. 2003;13: 2178–2189. doi:10.1101/gr.1224503 601 40. Contreras-Moreira B, Vinuesa P. GET_HOMOLOGUES, a versatile software 602 package for scalable and robust microbial pangenome analysis. Appl Environ 603 Microbiol. 2013;79: 7696–7701. doi:10.1128/AEM.02411-13 604 41. Brum JR, Ignacio-Espinoza JC, Roux S, Doulcier G, Acinas SG, Alberti A, et al. 605 Patterns and ecological drivers of ocean viral communities. Science. 2015;348: 606 1261498. doi:10.1126/science.1261498 607 42. Minot S, Bryson A. Rapid evolution of the human gut virome. Proc …. 2013;110: 608 12450–12455. 609 doi:10.1073/pnas.1300833110/-/DCSupplemental.www.pnas.org/cgi/doi/10.1073/pna 610 s.1300833110 611 43. Roux S, Chan LK, Egan R, Malmstrom RR, McMahon KD, Sullivan MB. 612 Ecogenomics of and their giant virus hosts assessed through time series 613 metagenomics. Nat Commun. Springer US; 2017;8. doi:10.1038/s41467-017-01086- 614 2 615 44. Rascovan N, Carbonetto B, Revale S, Reinert MD, Alvarez R, Godeas AM, et al. The 616 PAMPA datasets: a metagenomic survey of microbial communities in Argentinean 617 pampean soils. Microbiome. 2013;1: 21. doi:10.1186/2049-2618-1-21 618 45. Luo C, Rodriguez-R LM, Johnston ER, Wu L, Cheng L, Xue K, et al. Soil microbial 619 community responses to a decade of warming as revealed by comparative 620 metagenomics. Appl Environ Microbiol. 2014;80: 1777–1786. 621 doi:10.1128/AEM.03712-13 622 46. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat 623 Methods. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights 624 Reserved.; 2012;9: 357–9. doi:10.1038/nmeth.1923

20 20 Sulfolobus islandicus filamentous virus (Betalipothrixvirus)

Sulfolobus islandicus rod-shaped virus 2 (Rudivirus) Sulfolobus spindle-shaped virus 5 (Alphafusellovirus)

Halorubrum virus HRPV2 (Alphapleolipovirus)

Lactococcus virus bIL170 (Sk1virus)

Shigella virus PSf2 (T1virus)

Burkholderia virus phiE202 (P2virus)

Gordonia virus Hotorobo (Woesvirus)

Mycobacterium virus SWU1 (L5virus)

Mycobacterium virus Donovan (Fishburnevirus)

Gordonia virus OneUp (Smoothievirus)

Propionibacterium virus PAD20 (Pa6virus) Streptomyces virus Sujidade (Likavirus) Escherichia virus JenP2 (Nonagvirus) Mycobacterium virus Oline (Pg1virus)

11

12 Salmonella virus PRD1 () Caulobacter virus Swift (Phicbkvirus) Escherichia virus SSL2009aVibrio (Hk578virus) virus VC8 (Vp5virus) 13 10 14

15 9 8 16

7 6

Escherichia virus PA2 (Tl2011virus) 5 17 Salmonella virus 118970sal2 (T5virus) 4 18 Bacillus virus Andromeda (Andromedavirus) Campylobacter virus CP220 (Cp220virus) Escherichia virus Lambda (Lambdavirus) 19 3 20 2 Salmonella virus Maynard (Vi1virus) 1 Escherichia virus Ime09 (T4virus)

21

0 22 bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 23

24

25

71 Bacillus virus Staley (Slashvirus)

Salmonella virus Epsilon15 (Epsilon15virus) 26 Mycobacterium virus Bxz1 (Bxz1virus) 70

27

69 28 68

29 67

30 Listeria virus P100 (P100virus) Escherichia virus Septima11 (Phieco32virus) Bacillus virus Bp8pC (Agatevirus) Burkholderia virus Bcep22 (Bcep22virus) 66 31

65 32

64 33 63 34

Pseudomonas virus DL60 (Pbunavirus) Burkholderia virus BcepNY3 (Bcep78virus) 62 35 Mycobacterium virus Konstantine (Barnyardvirus) 61 60 59

36

58 Paenibacillus virus Sitara (Sitaravirus) 37

Vibrio virus VHML (Vhmlvirus) Klebsiella virus K11 (Kp32virus) Pseudomonas virus gh1 (T7virus) 38

39 57 40 Burkholderia virus phi1026b (E125virus) 41 56 Escherichia virus JES2013 (V5virus) 42 Pseudomonas virus PAKP1 (Pakpunavirus) Escherichia virus IME11 (G7cvirus) Lactobacillus virus Ld3 (C5virus)

43

55 44 54 45 46 47

48

49 53 Enterococcus virus FL1 (Phifelvirus) 52

51

50

Listeria virus LP37 (P70virus) Bacillus virus B103 (Phi29virus) Cellulophaga virus Cba181 (Cba181virus) Bacillus virus Pascal (Pagevirus) bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

100

75

Actinobacteria

Bacteroidetes

Crenarchaeota

Cyanobacteria 50 Euryarchaeota

Frequency Firmicutes

Fusobacteria

Proteobacteria

Verrucomicrobia

25

0

Lineage_1Lineage_2 (191)Lineage_3 (419) (868) Lineage_6Lineage_7 (329)Lineage_8 (101) (100) Lineage_0 (3575) Lineage_4Lineage_5 (7345) (2550) Lineage_9 (1306) Lineage_13Lineage_14 (237)Lineage_15 (229)Lineage_16 (412) (345)Lineage_18 (591)Lineage_20 (678) Lineage_23 (884)Lineage_26Lineage_27 (143) (399)Lineage_29Lineage_30 (737) (910) Lineage_37Lineage_38 (110)Lineage_39 (559)Lineage_40 (146)Lineage_41 (428) (551)Lineage_43Lineage_44 (358)Lineage_45 (894)Lineage_46 (183)Lineage_47 (123)Lineage_48 (406)Lineage_49 (709) (459)Lineage_52Lineage_53 (255) (653)Lineage_56Lineage_57 (398) (303)Lineage_59Lineage_60 (777)Lineage_61 (482) (231)Lineage_63Lineage_64 (129)Lineage_65 (361)Lineage_66 (239)Lineage_67 (290)Lineage_68 (655)Lineage_69 (797) (833) Lineage_10Lineage_12 (2185) (2408) Lineage_17 (2770)Lineage_19 (1197)Lineage_21Lineage_22 (4917) (1647)Lineage_24 (2869) Lineage_28 (2105) Lineage_31Lineage_32 (1308)Lineage_33 (1659)Lineage_34 (1383)Lineage_35 (1275)Lineage_36 (1666) (2752) Lineage_42 (1085) Lineage_51 (1565) Lineage_55 (4484) Lineage_58 (1141) Lineage_62 (1666) Lineage_70Lineage_71 (2336) (1157) Lineage bioRxiv preprint doi: https://doi.org/10.1101/480491; this version posted November 29, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

100

75

Air

Aquatic

Bioremediation

Biotransformation

Human

Invertebrates 50 Lab enrichment Frequency Mammals

Plants

Solid waste

Terrestrial

Wastewater

25

0

Lineage_1Lineage_2 (191)Lineage_3 (419) (868) Lineage_6Lineage_7 (329)Lineage_8 (101) (100) Lineage_0 (3575) Lineage_4Lineage_5 (7345) (2550) Lineage_9 (1306)Lineage_11 (412)Lineage_13Lineage_14 (237)Lineage_15 (229)Lineage_16 (412) (345)Lineage_18 (591)Lineage_20 (678) Lineage_23 (884)Lineage_25Lineage_26 (775)Lineage_27 (143) (399)Lineage_29Lineage_30 (737) (910) Lineage_37Lineage_38 (110)Lineage_39 (559)Lineage_40 (146)Lineage_41 (428) (551)Lineage_43Lineage_44 (358)Lineage_45 (894)Lineage_46 (183)Lineage_47 (123)Lineage_48 (406)Lineage_49 (709) (459) Lineage_52Lineage_53 (255)Lineage_54 (653) (726)Lineage_56Lineage_57 (398) (303)Lineage_59Lineage_60 (777)Lineage_61 (482) (231)Lineage_63Lineage_64 (129)Lineage_65 (361)Lineage_66 (239)Lineage_67 (290)Lineage_68 (655)Lineage_69 (797) (833) Lineage_10 (2185)Lineage_12 (2408) Lineage_17 (2770)Lineage_19 (1197)Lineage_21Lineage_22 (4917) (1647)Lineage_24 (2869) Lineage_28 (2105) Lineage_31Lineage_32 (1308)Lineage_33 (1659)Lineage_34 (1383)Lineage_35 (1275)Lineage_36 (1666) (2752) Lineage_42 (1085) Lineage_50Lineage_51 (2094) (1565) Lineage_55 (4484) Lineage_58 (1141) Lineage_62 (1666) Lineage_70Lineage_71 (2336) (1157) Lineage

Freshwater Gut Marine Soil

Lineage_71

Lineage_70

Lineage_69

Lineage_68

Lineage_67

Lineage_66

Lineage_65

Lineage_64

Lineage_63

Lineage_62

Lineage_61

Lineage_60

Lineage_59

Lineage_58

Lineage_57

Lineage_56

Lineage_55

Lineage_54

Lineage_53

Lineage_52

Lineage_51

Lineage_50

Lineage_49

Lineage_48

Lineage_47

Lineage_46

Lineage_45

Lineage_44

Lineage_43

Lineage_42

Lineage_41

Lineage_40

Lineage_39

Lineage_38

Lineage_37

Lineage_36

Lineage_35

Lineage_34 Lineage_ID

Lineage_33

Lineage_32

Lineage_31

Lineage_30

Lineage_29 Lineage_28

. Lineage_27

The copyright holder for this preprint (which

Lineage_26

Lineage_25

Lineage_24

Lineage_23

Lineage_22

Lineage_21

Lineage_20

Lineage_19

Lineage_18

CC-BY-NC-ND 4.0 International license Lineage_17

this version posted November 29, 2018. Lineage_16

;

Lineage_15

Lineage_14 Lineage_13

available under a Lineage_12

Lineage_11

Lineage_10

Lineage_9

Lineage_8

https://doi.org/10.1101/480491 Lineage_7

doi: Lineage_6

Lineage_5

Lineage_4

Lineage_3

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made who has granted bioRxiv a license to display the preprint in perpetuity. It is made was not certified by peer review) is the author/funder, Lineage_2

bioRxiv preprint

Lineage_1

Lineage_0

0 5

0 0

0 3

6 9

10 10

20 10

20 30

40 12 Abundance