<<

1 Viral dark matter and virus–host interactions resolved from 2 publicly available microbial 3 1† 2,3 4 1‡ 4 S Roux , SJ Hallam , T Woyke , MB Sullivan 5 6 Affiliations 7 1 and Evolutionary Biology, University of Arizona, USA 8 2 Department of and Immunology, University of British Columbia, Canada 9 3 Graduate Program in Bioinformatics, University of British Columbia, Canada 10 4 DOE Joint Institute, Walnut Creek, California, USA 11 † Current affiliation: Department of Microbiology, The Ohio State University, USA 12 ‡ Current affiliation: Department of Microbiology, The Ohio State University, USA; Department of 13 Civil, Environmental, and Geodetic Engineering, The Ohio State University, USA 14 15 Corresponding author 16 Correspondence to : MB Sullivan ([email protected] ). 17 18 Manuscript category: 19 Tools and resources 20 21 Major subject areas: 22 Ecology / Genomics & evolutionary biology 23 24 Impact statement: 25 From public microbial genomes, VirSorter revealed 12,498 viral genome sequences that expand the 26 map of the global virosphere and whose analyses improve understanding of viral taxonomy, evolution 27 and virus-host interactions. 28 29 Competing interest statement: 30 The authors declare no competing interest. 31 Abstract 32 33 The ecological importance of viruses is now widely recognized, yet our limited knowledge of viral 34 sequence space and virus–host interactions precludes accurate prediction of their roles and impacts. 35 Here we mined publicly available bacterial and archaeal genomic datasets to identify 12,498 36 high-confidence viral genomes linked to their microbial hosts. These data augment public datasets 10- 37 fold, provide first viral sequences for 13 new bacterial phyla including ecologically abundant phyla, 38 and help taxonomically identify 7—38% of ‘unknown’ sequence space in viromes. Genome- and 39 network-based classification was largely consistent with accepted viral taxonomy and suggested that (i) 40 264 new viral genera were identified (doubling known genera) and (ii) cross-taxon genomic 41 recombination is limited. Further analyses provided empirical data on extrachromosomal prophages 42 and co-infection prevalences, as well as evaluation of in silico virus–host linkage predictions. Together 43 these findings illustrate the value of mining viral signal from microbial genomes. 44 Introduction 45 Over the past two decades, our collective understanding of microbial diversity has been 46 profoundly expanded by cultivation-independent molecular methods (Pace, 1997; Whitman, Coleman 47 & Wiebe, 1998; Rappé & Giovannoni, 2003; DeLong, 2009; Hanson et al., 2012). It is now widely 48 recognized that interconnected microbial communities drive matter and energy transformations in 49 natural and engineered ecosystems (Falkowski, Fenchel & Delong, 2008), while also contributing to 50 health and disease states in multicellular hosts (Clemente et al., 2012). Concomitant with this changing 51 worldview is a growing awareness that viruses modulate microbial interaction networks and long-term 52 evolution with resulting feedbacks on ecosystem functions and services (Suttle, 2007; Rodriguez- 53 Valera et al., 2009; Forterre & Prangishvili, 2013; Hurwitz, Hallam & Sullivan, 2013; Brum et al., 54 2014; Brum & Sullivan, 2015). 55 However, our understanding of viral diversity and virus–host interactions remains a major 56 bottleneck in the development of predictive ecosystem models and unifying eco-evolutionary theories. 57 This is because the lack of a universal marker gene for viruses hinders environmental survey 58 capabilities, while the number of isolate viral genomes in databases remains limited: for comparison, 59 more than 25,000 bacterial and archaeal host genomes are available in NCBI RefSeq (January 2015), 60 whereas only 1,531 of their viruses were entirely sequenced and most (86%) of these derive from only 61 3 of 61 known host phyla (Roux et al., 2015). Thus, although advances in high-throughput sequencing 62 expand the bounds of viral sequence space, these datasets are dominated by uncharacterized sequences 63 (usually 60–95%), termed “viral dark matter” (Youle, Haynes & Rohwer, 2012; Reyes et al., 2012; 64 Mizuno et al., 2013; Brum & Sullivan, 2015). In the absence of closely related isolates, viral genes and 65 genomes remain unlinked to hosts, which greatly limits ecological and evolutionary inferences. 66 Alternatively, viral sequence space can be explored in a known host context by revealing 67 putative viral sequences hidden in microbial genomes. Such signal was first analyzed through 68 annotation of prophages—viral genomes integrated in microbial genomes. Numerous tools exist to 69 automatically detect prophages (Fouts, 2006; Lima-Mendez et al., 2008a; Zhou et al., 2011; Akhter, 70 Aziz & Edwards, 2012), so prophage diversity and abundance are relatively well-studied (Casjens, 71 2003; Canchaya, Fournous & Brüssow, 2004). Early estimations, when only a few hundred bacterial 72 genomes were available, suggested that prophages are common (62% of bacterial genomes tested 73 contained at least one), existing as intact and functional forms or in varying degrees of decay (Casjens, 74 2003). Given that tens of thousands more microbial genomes are now publicly available, it is expected 75 that many new prophages and other viral sequences remain to be discovered. 76 Further, other viral signals might be prevalent in modern microbial genomic datasets. First, certain 77 types of prophage do not integrate into the host genome. These “extrachromosomal prophages” (also 78 termed “plasmid prophage”) exist outside the microbial chromosome until induced to undergo lytic 79 replication. These have been known to occur for decades (e.g., coliphage P1, Sternberg & Austin, 80 1981), though their abundance in nature is unknown. Second, some phages can enter a “chronic” cycle, 81 in which they replicate in the cell outside of the host chromosome, and produce virions that are 82 extruded without killing their host (Abedon, 2009; Rakonjac et al., 2011). Third, a phage “carrier state” 83 has been observed, in which a lytic phage is maintained and multiplied within a cultivated host 84 population without measurable effect on cell growth (Bastías et al., 2010). This phenomenon is thought 85 to arise due to the presence of both resistant and sensitive cells that frequently transition between these 86 two states. Sometimes also termed “partial resistance”, such states that enable the coexistence of phage 87 and host in culture have now been observed in different systems (Vibrio, Escherichia coli, Salmonella, 88 Flavobacterium), and are linked to slight decreases in growth rate or cell concentration but no host cell 89 clearing as would be observed for “typical” lytic viruses (i.e., plaque formation), thus could go 90 unnoticed in a microbial cell culture (Fischer et al., 2004; Carey-Smith et al., 2006; Middelboe et al., 91 2009). All three of these lesser studied types of infection would result in the assembly of viral 92 sequences outside of the main host chromosome in a microbial genome sequencing project, and could 93 be a new type of viral signal in modern microbial genomic datasets due to deep sequencing and public 94 release of draft (i.e., not completely assembled) genomic sequences. 95 Finally, single amplified genome (SAG) datasets, sourced from anonymously sorted, amplified, 96 and sequenced cells, are especially valuable for accessing the vast majority of environmental microbes 97 that remain uncultivated in the lab (Rinke et al., 2013; Kashtan et al., 2014). Single-cell amplified 98 genomes can reveal viral sequences directly linked to uncultivated hosts (Yoon et al., 2011; Roux et al., 99 2014; Labonté et al., 2015). When combined with metagenomic sequences, these data provide 100 information on population dynamics, lineage-specific viral-induced mortality rates, relative ratios of 101 prophages and current lytic infections, as well as putative links between viral infection and host 102 metabolic state (Roux et al., 2014; Labonté et al., 2015). Thus as microbial genomic datasets evolve 103 from complete genomes to fragmented draft and single-cell genomes, new windows into viral diversity 104 and virus–host interactions are opened. 105 Here we applied a recently developed and automated virus discovery pipeline, VirSorter (Roux 106 et al., 2015), to mine the viral signal from 14,977 publicly available bacterial and archaeal genomic 107 datasets. This identified 12,498 high-confidence viral sequences with known hosts, ~10-fold more than 108 in the RefSeqVirus database, that we then used to expand our understanding of viral diversity and 109 virus–host interactions. 110 111 Results & Discussion 112 113 New viruses detected in public microbial genomic datasets with VirSorter 114 VirSorter is designed to predict bacterial and archaeal virus sequences in isolate or single-cell 115 draft genomes, as well as complete genomes (Roux et al., 2015). Briefly, VirSorter identifies viral 116 sequences through (i) statistical enrichment in viral gene content, using a reference database composed 117 of viral genomes of archaeal and bacterial viruses from RefSeq (hereafter named RefSeqABVir for 118 “RefSeq and Viruses”) and assembled from viral metagenomes (database “Viromes” 119 in VirSorter), or (ii) a combination of viral “hallmark” gene(s) that code for virion-related functions 120 such as major capsid proteins or terminases (Koonin, Senkevich & Dolja, 2006; Roux et al., 2014), and 121 at least one viral-like genomic feature: statistical depletion in genes with a hit in the PFAM database, 122 statistical enrichment in uncharacterized genes, short genes, or strand bias (i.e. consecutive genes 123 which tend to be coded on the same strand). 124 Applied to 14,977 publicly available microbial genomes (Figure 1–source data 1), VirSorter 125 identified 12,498 high-confidence viral sequences representing either long genome fragments (>10kb 126 when linear) or complete genomes (contigs detected as circular). These viral sequences were found in 127 5,492 of the microbial genomes (~30%). Simply scanning the identified viruses for novel hosts 128 extended the host range of common viral families to now include several recently described phyla like 129 Caldiserica (formerly known as candidate phylum OP5), Marinimicrobia (SAR406 also known as 130 Marine Group A), or Omnitrophica (OP3), in addition to other understudied groups such as 131 Poribacteria, Nitrospinae, Cloacimonetes (WWE1), and Chloroflexi-type SAR202 (Figure 1, Figure 1– 132 source data 2, Figure 1–source data 3). Uncovering the first viruses infecting these major microbial 133 groups is critical given that many candidate phyla are abundant in understudied ecosystems and play 134 substantial roles in coupled biogeochemical cycling (Wrighton et al., 2012; Wright, Konwar & Hallam, 135 2012; Kamke, Sczyrba & Ivanova, 2013; Castelle et al., 2013; Allers et al., 2013b; Rinke et al., 2013; 136 Emerson et al., 2015). 137 BLAST-based family-level affiliations suggested that 90.45% of these 12,498 sequences 138 correspond to Caudovirales, 6.82% to ssDNA viruses (predominantly Inoviridae family), and 2.73% 139 could not be confidently assigned (Figure 1–figure supplement 1). Among the unassigned group, 7 140 sequences lacked any hit to a viral reference genome. These 7 short (4.1kb) near-identical circular 141 contigs from Bacteroides draft genomes were detected as viral based on sequence similarity with 142 human gut viromes, but contained two genes associated with plasmid replication (Figure 1–figure 143 supplement 2A). This could suggest a plasmid origin, but the high and even coverage of these genomes 144 across several CsCl-purified viromes from different studies(Kim et al., 2011; Minot et al., 2012) 145 suggests that they are derived from encapsidated particles typical of viruses (Figure 1–figure 146 supplement 2B). If confirmed, these sequences would represent the first complete genomes for an 147 entirely new viral order. 148 149 264 new putative viral genera identified through genome-based network clustering 150 To better determine relationships between viral genomes and host range, we next built a 151 network based on shared gene content to quantify genetic relatedness between the 12,498 sequences 152 identified with VirSorter and the 1,240 taxonomically-curated genomes available in RefSeqABVir 153 (Figure 2–figure supplement 1 and see Methods). Despite the absence of a universal marker gene, a 154 long history of organizing viral sequence space through genome-to-genome comparison exists using 155 either gene content (Rohwer & Edwards, 2002; Lima-Mendez et al., 2008b) or nucleotide composition 156 (Sims et al., 2009; Labonté & Suttle, 2013). We used MCL (Markov Cluster Algorithm) based on the 157 number of shared genes between sequence pairs as it had been previously shown to accurately 158 recapitulate taxonomic relationships in the Caudovirales, which dominated our dataset (Enright, Van 159 Dongen & Ouzounis, 2002; Lima-Mendez et al., 2008b). 160 Most (99.3% of 12,498) sequences affiliated to one of 614 virus clusters (VCs), of which 535 161 contained at least one complete genome or large genomic fragment (>30kb), and approximately half 162 (271 of 535 VCs) included RefSeqABVir sequences (Figure 2A, Figure 2–source data 1). Those VCs 163 with RefSeqABVir sequences provided the opportunity to evaluate whether a VC corresponded to any 164 particular taxonomic level of ICTV classification. Of 43 RefSeq-curated viral genera, 27 have all their 165 sequences in the same VC, 12 were spread across two VCs, and 4 were spread across >2 VCs—these 166 latter genera included the Spouna-like viruses (3 VCs), N4-like viruses (4 VCs), Lambda-like viruses 167 (9 VCs), and Inoviruses (11 VCs). Consistent with previous applications of this method, VCs identified 168 in this analysis were thus approximately equivalent to a RefSeq-curated viral genus (Lima-Mendez et 169 al., 2008b). 170 Given this level of taxonomic resolution, and ignoring the 79 VCs that lacked large (>30kb) 171 genome sequences, we identified a total of 264 new candidate viral genera (i.e., 264 VCs with no 172 sequences from RefSeqABVir, Figure 2–source data 1). These 264 candidate genera were derived from 173 both understudied and well-studied hosts (e.g. Gammaproteobacteria and Bacilli, Figure 2B) and 174 included 5 of the 30 highest-membership VCs (Figure 2–source data 1), which confirms that our 175 knowledge of viral diversity is limited even in well-studied hosts and with prevalent viruses. 176 177 VirSorter curated dataset includes extrachromosomal genomes and improves virome affiliation 178 Of the 12,498 sequences, 5,232 were prophages (i.e. a viral genome integrated into a microbial 179 contig) and 1,756 were either complete (circularized) or large (>30kb) genome fragments assembled 180 outside of the host chromosome (i.e. no microbial gene was detected on the contig, Figure 3A, Figure 181 1–source data 1). 182 To estimate how often a large (>30kb) genome fragment could be an integrated prophage and 183 not capture the microbial gene content, we simulated the process for 22 different prophage-containing 184 bacterial genomes “sequenced” (in silico) at coverage of 5, 25, 50, 75 and 100x (see Material & 185 Methods). These analyses suggested that only 2.3% of large (>30kb) prophage-originating contigs 186 lacked any identifiable microbial genes. Thus, these latter 1,756 sequences must largely be 187 extrachromosomal sequences and so represent a unique data source for quantifying the prevalence of 188 under-studied viral infection modes including chronic infections, lytic viruses or extrachromosomal 189 prophages. 190 Although we identified no clear sequence-based marker for the first two infection types, we 191 could conservatively estimate the fraction of extrachromosomal prophages by identifying plasmid 192 partition genes (ParA and ParB, Davis, Martin & Austin, 1992; Saint Girons et al., 2000; see Figure 3– 193 figure supplement 1 for an example of a putative extrachromosomal prophage displaying ParA-ParB 194 genes). These genes were significantly more abundant in the 1,756 circular and large genome 195 fragments than in the rest of the dataset (13% vs 1%, respectively; poisson test p-value < 10-05). Thus, 196 at least 13% of these sequences appear to be bona fide extrachromosomal prophages, whereas the 197 others might be lytic viruses in “carrier” states, chronic infections, or extrachromosomal prophages 198 without detectable ParA/ParB genes. 199 Beyond this glimpse into under-studied viral infection modes, these new references genomes 200 are likely to help improve taxonomic affiliation for the ‘viral dark matter’ in viromes. To quantify this, 201 we added these sequences to the RefSeqABVir database and assigned taxonomy to predicted genes in 202 three large-scale virome datasets available. We found that the VirSorter curated dataset improved 203 affiliation by 32 and 40%, respectively, in the marine POV (Hurwitz & Sullivan, 2013) and TOV 204 (Brum et al., 2015) datasets, and more than doubled the number of affiliated genes in human gut 205 viromes (Minot et al., 2012, Figure 3B). This particularly strong improvement in the human gut virome 206 affiliation is presumably due to enterobacteria being abundant among current publicly available 207 microbial genomes. 208 Finally, both the detection of non-integrated viral genomes and the improved virome affiliation 209 suggest that the VirSorter curated dataset includes not only integrated prophage data, but also viruses 210 actively infecting these microbes (i.e. not integrated in the host chromosome and producing virions) 211 with under-studied infection modes. 212 213 Long-term evolutionary patterns of bacterial and archaeal virus genomes 214 Examination of the VCs network beyond classification revealed additional higher order 215 patterns. First, bacterial and archaeal viruses clustered separately in >99% of VCs; the exception 216 (VC_89) included a single and unique (Garrett et al., 2010) archaeal virus (Hyperthermophilic Archaeal 217 Virus 2, NC_014321) that clustered with 21 bacterial viruses, presumably due to poor archaeal virus 218 representation. Second, >95% of these VCs contained exclusively one nucleic acid type (e.g., DNA or 219 RNA, and dsDNA or ssDNA, Figure 2–figure supplement 1), although RNA viral representation is low 220 because only RefSeq-curated families Cystoviridae and Leviviridae were available (no RNA viruses 221 were detected with VirSorter, presumably because DNA-based datasets were analyzed). The 15 VCs 222 including both ssDNA and dsDNA viral genomes are either associated with archaeal viruses for which 223 composite ssDNA/dsDNA genomes were already described (2 VCs; Sencilo et al., 2012), or more 224 surprisingly with ssDNA Inoviridae, which clustered with Caudovirales in 13 VCs (Figure 2–source 225 data 1). For 9 of these 13 Inoviridae–Caudovirales VCs, some of the sequences were wrongly affiliated 226 due to genes shared by both viral families such as integrases, exonucleases, and replication-associated 227 proteins. Two other VCs corresponded to prophage sequences that include genes similar to Inoviridae 228 and Caudovirales and could actually be two different viruses integrated at the same genome location. 229 However, the 2 remaining mixed VCs (VC_128 and VC_215) include sequences displaying a mix of 230 Caudovirales and Inoviridae genes (VC_215 sequences also included Corticoviridae genes). We posit 231 that these might represent new composite genomes beyond the ones already described for archaea 232 viruses (Sencilo et al., 2012) and the recently discovered RNA–DNA chimeric viruses (Diemer & 233 Stedman, 2012; Roux et al., 2013; Krupovic et al., 2015). 234 We next evaluated the scale and range of viral co-infection, a phenomenon critical to viral 235 genome evolution and thought to blur this vertical gene inheritance signal used to classify genomes into 236 VCs. Indeed, the fact that super-infection of prophage-containing bacteria would provide genomic 237 proximity for gene acquisition via illegitimate recombination was posited more than a decade ago 238 (Mosig, 1998). However, viral co-infection rates remains unconstrained with the only data for natural 239 systems derived from a single large-scale single-cell genomic dataset where ~35% of infected cells 240 contained multiple viruses (Roux et al., 2014). Here, in the 5,492 microbial genomes with detectable 241 viral signal, nearly half (2,445) contained more than one detectable virus (Figure 4). Most (~82%) of 242 these co-infections involved multiple Caudovirales, as previously observed (Casjens, 2003), and likely 243 provides mechanism for viral gene exchange and may be more common in some phages displaying 244 rampant mosaicism (e.g., the Siphoviridae, Hendrix et al., 1999) than others. The second most 245 commonly observed co-infections (9%) occurred between ssDNA Inoviridae and dsDNA Caudovirales 246 (Figure 4). These genomes represented the mixed VCs from the network analyses and putative new 247 composite genomes described above. Mechanistically, Inoviridae might be more prone to such 248 co-infection due to their long infection cycle whereby they extrude their filamentous virions without 249 killing their host (Rakonjac et al., 2011), with a dsDNA replication stage (Salim et al., 2008) that could 250 increase genomic exchanges with co-infecting dsDNA viruses. 251 Together, these findings suggest genome-based network analyses could be used to identify novel 252 viruses, as well as to infer host domain (archaeal or bacterial, >99% accuracy) and nucleic acid type 253 (ssDNA or dsDNA, >95% accuracy). Evolutionarily, we posit that while co-infection by multiple 254 viruses appears common, the consistency of so many VCs with ICTV taxonomy suggests that most 255 phage genomes harbor a largely vertically inherited core gene set as detected for marine T4-like 256 populations (Ignacio-Espinoza & Sullivan, 2012; Marston et al., 2012; Deng et al., 2014) rather than 257 the rampant mosaicism paradigm largely derived from Siphoviridae genomes (Hendrix et al., 1999). 258 While data remain limited to a subset of the known microbial phyla, it might be that viral infection 259 modes influence the tempo of their genome evolution. Specifically, we posit that horizontal gene 260 transfer is more prevalent in phages that occupy host cells longer due to lysogenic or chronic infection 261 stages and/or infect densely-packed hosts (e.g., or clumped life stages) as these parameters 262 would increase the probability of co-infection. Perhaps then, at least for more highly lytic viral groups, 263 genome-based clustering approaches can now be leveraged for their taxonomic predictive value as 264 suggested over a decade ago (Rohwer & Edwards, 2002). 265 266 Global virus–host network is confirmed as modular 267 Beyond charting diversity and taxonomic affiliation of viral sequence space, the VirSorter 268 dataset provided a unique opportunity to explore virus–host interactions. Beyond the above-noted 269 expansion of viruses to novel hosts, we next examined these patterns on a global scale by constructing 270 a virus–host interaction network based on database-available taxa. When considering viral diversity at 271 the genus level, the network displays a modular topology (Figure 5 and Figure 5–figure supplement 1). 272 Such modularity in virus–host interaction networks suggests that hosts are specifically associated with 273 particular viruses (Weitz et al., 2012), probably reflecting long-term coevolution between microbial 274 hosts and their viruses. Such modular structure was expected, but not observed in previous virus–host 275 interaction network studies, likely due to the short phylogenetic distances between hosts evaluated in 276 available datasets (Flores et al., 2011). The modular network presented here derives from a dataset 277 spanning 18 phyla across bacterial and archaeal domains. These results confirmed the prediction that 278 “at macroevolutionary scales, host–phage interaction matrices should be typified by a modular 279 structure” (Flores et al., 2011), as also had been observed across 215 phage types against 286 host types 280 of unknown diversity (Flores, Valverde & Weitz, 2013). 281 282 283 Virus–host adaptation signals detected at the genome composition and codon usage level 284 Finally, given the number of virus-host linkages revealed by VirSorter, we evaluated the 285 adaptation of viral genome composition within the host milieu—an idea practiced in the literature with 286 limited genomic information (Pride et al., 2006; Carbone, 2008; Cardinale & Duffy, 2011). To this end, 287 we computed the distance between viral and microbial genomes in terms of mono-, di-, tri-, tetra- 288 nucleotide frequency and codon usage, and compared the distances between the virus and its host 289 versus non-hosts in the dataset. Every metric tested displayed a smaller distance between viruses and 290 their hosts than with non-host genomes, with tetranucleotide frequency (TNF) maximizing the host to 291 non-host distances (Figure 6). 292 Among dsDNA viruses, host correlated genome composition patterns were robust across 293 integrated prophages and extrachromosomal genomes (i.e., viral sequences assembled outside of the 294 main host chromosome, Figure 6A, Figure 6–figure supplement 1). Our expectations were that 295 prophages would be largely optimized towards the genome of their host, but that genome composition 296 of the extrachromosomal category would be less correlated. Particularly, as cyanophage host range 297 breadth scales with the number of tRNA genes encoded by the virus (Enav, Béjà & Mandel-Gutfreund, 298 2012), we expected that genome composition of viral genomes with many tRNA genes might have poor 299 correlation to that of their host genomes, assuming that the viral-encoded tRNA genes could 300 compensate for codon mismatches across hosts. However, these latter expectations were not met as 301 viral and host genome composition correlations were strong regardless of the number of viral-encoded 302 tRNA genes (Figure 6–figure supplement 1), which suggests that host-optimized viral genome 303 composition may be beneficial even when the virus encodes its own tRNA genes. 304 Among ssDNA viruses, nucleotide composition of viral genomes was also correlated to host 305 genomes, but less so than for dsDNA viruses (Figure 6A). This contrasts with a previous analysis of 306 500 phage genomes that did not detect any difference between dsDNA and ssDNA genomes 307 adaptations to their host's genome (Cardinale & Duffy, 2011). One ssDNA viral group, the 308 Microviridae, had a reduced signal for genome composition metrics except for codon usage where its 309 signal was comparable to that of the dsDNA viruses (Figure 6A). Although this could indicate a bias 310 linked to the small genome size of these viruses (around 5kb), dsDNA viruses’ genomes subsampled to 311 similar sizes displayed a minimal signal loss (Figure 6–figure supplement 2), which suggests other 312 mechanisms may be driving this lower genome composition adaptation in Microviridae. Another 313 ssDNA group, the Inoviridae had reduced genome composition and codon usage adaptation signals. 314 Again, because Inoviridae release virions without killing their hosts, it is possible that the virus is 315 exposed to host resources over a much longer time interval, lowering the selection pressure toward 316 transcription and translation speed and efficiency, which is the main mechanism thought to drive 317 genome composition and codon usage adaptation of viral genomes (Cardinale & Duffy, 2011). 318 Pragmatically, to assess whether this signal could be used to predict the host of a new virus, we 319 calculated the distance based on TNF vectors between each VirSorter curated dataset sequence and the 320 14,977 microbial genomes. The taxonomy of the microbial genome with the lowest distance to the viral 321 sequence (i.e. the predicted host) was then compared to the taxonomy of the actual host (i.e. the 322 genomic dataset in which the viral sequence was identified). When the host database included all host 323 genomes, this host prediction was 99% accurate at both the family and genus level for virus–host TNF 324 distances lower than 4.10-04, 88% / 51% (family / genus level) for TNF distances ranging between 4.10- 325 04 and 1.10-03, and 70% / 37% for distances greater than 10-03 (Table 1). When genomes from the actual 326 host species are excluded, the accuracy of host prediction drops slightly (95%, 83% / 30%, 67% / 30% 327 for the same distance ranges), and even more when all genomes from the host genus are excluded (70% 328 and 37% at the family level, no correct genus could be predicted in that case, and only one distance 329 lower than 4.10-04 was observed, Table 1). Hence, tetranucleotide frequency comparison provides a 330 promising in silico approach to link new viral genomes to hosts at different levels of accuracy within 331 the taxonomic hierarchy when the suitable host reference genome is available. 332 333 Dataset availability 334 As evidenced by the improvement in virome taxonomic affiliation (Figure 3B), VirSorter 335 curated dataset should represent a useful reference dataset for future virome studies. This dataset also 336 likely harbor novel biology beyond the global patterns of viral diversity and virus-host interactions 337 presented in this manuscript, to be revealed through analyses targeted toward specific viral or host 338 subgroups. To facilitate these follow-up studies, VirSorter curated dataset is made available through 339 two complementary websites: MetaVir and iVirus. MetaVir (project “VirSorter”, dataset “VirSorter 340 curated dataset”) provides an automatic annotation of each sequence, with multiple visualization tools 341 to explore and compare genome maps, as well as multiple ways of searching the data (by host, by 342 phage affiliation, by taxonomic or functional affiliation of predicted genes, etc) and extract a specific 343 subset of interest (these tools are under the tab “Contig maps”). Nucleotide sequences from the 344 VirSorter curated dataset are also hosted at iVirus, alongside the viral clusters annotation and network 345 (as cytoscape-ready text files), the virus–host matrix, and the complete list of viral sequence predictions 346 in the 14,977 archaeal and bacterial genomic datasets including the category 3 predictions that are not 347 in VirSorter curated dataset 348 (http://mirrors.iplantcollaborative.org/browse/iplant/home/shared/ivirus/VirSorter_curated_dataset). 349 Finally, a summary of the sequences and clusters is provided as Figure 1–source data 1 and Figure 2– 350 source data 1, and a Data Dryad package including all annotated genbank files from the VirSorter 351 curated dataset is available (http://dx.doi.org/10.5061/dryad.b8226). 352 353 Conclusion 354 While recent advances in high-throughput sequencing and viral metagenomics continue to 355 expand the bounds of viral sequence space (e.g. Reyes et al., 2012; Mizuno et al., 2013; Brum & 356 Sullivan, 2015), such viruses are typically unlinked to cognate hosts, severely limiting ecological and 357 evolutionary inferences. Concurrently, emerging methods provide new virus–host linkage capabilities, 358 but do not scale well with increasing dataset size and complexity (e.g. Andersson & Banfield, 2008; 359 Tadmor et al., 2011; Allers et al., 2013a; Deng et al., 2014). Here, the mining of publicly available 360 microbial genomic data proved to be a useful complement to these approaches as it enables the 361 exploration of host-linked viral diversity. The resulting viral sequences hidden in microbial genomes 362 represent a powerful dataset, increasing the number of known, host-linked viruses by an order of 363 magnitude, with analyses of these data elucidating viral dark matter in ocean and human gut viromes, 364 as well as augmenting our understanding of viral taxonomy, viral genome evolution, and virus–host 365 interactions on multiple fronts. While this current VirSorter dataset remains limited by the cultivation 366 bias inherent in the publicly-available complete and draft microbial genomes, such bias will 367 progressively be eliminated as SAGs are used to better map microbial dark matter (e.g., Rinke et al., 368 2013). Such a drastically improved map of the virosphere, together with advances in experimental 369 approaches and theory (Brum & Sullivan, 2015), will help reveal the eco-evolutionary forces shaping 370 virus–host interactions across diverse ecosystems and eventually shift our inference capability from 371 observation to prediction. 372 373 374 Material & Methods 375 376 Application of VirSorter to public bacterial and archaeal genomes 377 A total of 14,977 bacterial and archaeal genomes (complete and draft) included in RefSeq and 378 WGS databases (Pruitt et al., 2009) were downloaded from the NCBI ftp website in March 2014 379 (RefSeq release 64). The 264 new candidate phyla (“Microbial Dark Matter”) SAGs’ (Rinke et al., 380 2013) raw reads were downloaded from the JGI portal page and assembled with SPAdes Genome 381 Assembler (Bankevich et al., 2012) (default parameters). Finally, 127 SUP05 SAGs that we previously 382 analyzed manually were added to the cellular genome pool (Roux et al., 2014). This dataset included 383 4,240 complete genomes and 10,547 draft genomes (as there is no clear annotation of a genome as 384 'draft' or 'complete' at the NCBI, we identified as “draft” genomes all genome projects including more 385 than 5 different sequences, to avoid considering genomes split into different chromosomes or including 386 one or several plasmids as “draft”). 387 Genomes were processed with VirSorter (Roux et al., 2015) separately for each class (except for 388 , SUP05 SAGs, and the Microbial Dark Matter dataset that were all processed together), 389 first using the RefSeqABVir database, and then using the Viromes database, yielding 89,301 total 390 predicted viral sequences. Among these, 938 correspond to Enterobacteria phage PhiX174, which is 391 used for quality control during Illumina sequencing, and were thus discarded. 392 393 Selection of a relevant subset of viral sequences: the VirSorter dataset 394 We focused on a subset of the putative viral sequences extracted from RefSeq, WGS and the 395 Microbial Dark Matter and SUP05 SAGs (89,301 sequences), and targeted the active prophages and 396 lytic virus signatures. To this end, we discarded all predictions lacking a viral hallmark gene or a viral 397 gene enrichment (i.e., category 3 predictions, Roux et al., 2015), and all prophage detections displaying 398 viral gene enrichment only and lacking viral hallmark genes, as these are likely defective prophages for 399 which boundaries are difficult to predict in silico and that often include bacterial genes. We next 400 removed all linear sequences shorter than 10kb except for sequences detected with the 401 non-Caudovirales score where a threshold of 5kb was used, as these viruses can frequently have 402 genomes smaller than 10kb. We also discarded all circular contigs (which should represent complete 403 genomes) smaller than 3kb as these are likely short repeat regions (the smallest known genome for a 404 bacteria or archaea virus is ~5kb). The resulting 13,391 sequences were then manually curated to 405 remove false positives. These false positives corresponded to defective prophages (wherein most are 406 expected to be smaller than 10kb), plasmid-like sequences, GTA gene clusters, and low complexity 407 regions. In addition, this manual curation step allowed us to adjust the boundaries of some prophage 408 predictions and/or modify the prophage vs complete viral contig automatic prediction. Consequently, 409 892 sequences were discarded (false-positive rate of 6.7%), leaving 12,498 curated sequences. 410 Among these, 7,266 sequences were entirely viral (thus potentially represent lytic, chronic, or 411 extrachromosomal lysogenic infections assembled in draft genomes), and 5,232 were prophages 412 (viral-like regions detected within a cellular genome fragment). Among the sequences detected as 413 entirely viral, 6 were tagged in the NCBI database as bacteriophages, and 108 as plasmids. Viruses and 414 plasmids can be difficult to distinguish, as gene exchange is known to occur between these two types of 415 mobile genetic elements (Leplae, Lima-Mendez & Toussaint, 2010). Here, 84 out of these 108 416 “plasmid” sequences displayed conclusive evidence of a viral origin as they contained viral hallmark 417 genes (coding for terminase large subunits or major capsid proteins), and originated from draft 418 unpublished genomes, hence likely to have been named “plasmid” because they formed 419 extrachromosomal circular assembly (see for example sequence gi:383080718 available at RefSeq). 420 The 24 others were more ambiguous (highlighted in orange in Figure 1–source data 1) since the 421 automatic annotation from NCBI did not display any viral-like gene, yet these sequences all displayed 422 statistical viral-like gene enrichment, and as such were maintained in the VirSorter curated dataset. 423 Finally, one additional ambiguous sequence, considered as entirely viral by VirSorter, was 424 detected in the Caldiserica SAG (Caldiserica_bacterium_sp_JGI_0000059-M03_ID_3757). Even 425 though this sequence looks indeed like a complete Inoviridae genome, it displayed a high level of 426 similarity (99% identity) to the complete genome of Delftia acidovorans SPH-1 (gi:160361034, from 427 coordinates 2300885 to 2307389). Such high similarity with another virus is suspicious, as well as the 428 fact that the matching genome is Delftia, a bacterium known to contaminate some MDA reagents. This 429 sequence was maintained in the VirSorter dataset as there is no definite proof of the contamination, but 430 the existence of a Caldiserica-infecting Inoviridae should be considered as uncertain until further 431 evidence is available (and is displayed as such in Figure 1–source data 2). 432 433 Protein and virus clustering of the VirSorter curated dataset 434 The pool of 450,047 proteins predicted from the 12,498 viral sequences was clustered with all 435 proteins from RefSeq and the viral metagenomes (i.e., sequences from the Viromes database) with 436 MCL based on reciprocal best BLAST hit (threshold of 50 on score and 0.001 on e-value, Enright, Van 437 Dongen & Ouzounis, 2002). Most of these sequences (423,618) could be included in 22,460 protein 438 clusters (PCs). About a third (7,742) of these PCs also contained sequences from the RefSeqABVir 439 database, and the remainder formed new protein clusters. 440 This protein clustering was then used to cluster genomes as in Lima-Mendez et al. (Enright, Van 441 Dongen & Ouzounis, 2002; Lima-Mendez et al., 2008b). Briefly, the number of shared protein clusters 442 between each pair of sequences (either RefSeq or VirSorter) is computed and a significance value is 443 deduced by comparing it to an expected number of shared protein clusters (modeled with a 444 hypergeometric formula taking into account the number of genes of both sequences). 445 We used ICCC (Intracluster clustering coefficient, which estimates cluster homogeneity by 446 measuring around each node how many of its neighbors are part of the same cluster) to determine the 447 best inflation value (from 1.5 to 5 by 0.25 increments) and significance threshold (i.e., which minimum 448 significance was required to draw an edge between two sequences, from 1 to 50). As expected, the 449 number of virus clusters formed increased with inflation and with significance. ICCC was clearly 450 higher with the lowest threshold in significance (sig >=1), regardless of the inflation value used. For 451 the lowest significance threshold, ICCC increased with inflation, usually with a first small peak around 452 2 and plateau around 4. These different values of inflation did not have a major impact on the clustering 453 though, as 95 to 99% of pairs of sequences were clustered similarly using inflations values of 2.75, 3, 454 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, or 5. We eventually settled for the combination yielding the highest 455 ICCC: a significance threshold of 1 and inflation of 4 (Figure 2–figure supplement 2). 456 457 Taxonomic and functional affiliation of sequences and virus clusters 458 Taxonomic affiliation of sequences was based on hits to the RefSeqABVir database. Each 459 profile in the database was first affiliated based on the origin of its members, with a 75% majority rule: 460 at each taxonomic level, a profile is affiliated to a taxon if more than 75% of the profile sequences are 461 affiliated to this taxon. Then, for each of the 12,498 viral sequences identified by VirSorter, a set of 462 relevant hits was selected: (i) first the profile with the best hit across all genes along the sequence, and 463 (ii) the best hit from other genes with a score close to this “absolute” best hit in the sequence (>75% of 464 the score of the first best hit). The sequence was then affiliated to the Lowest Common Ancestor (LCA) 465 of this set of relevant hits. Hence, a predicted protein will only be affiliated if pointing toward 466 sequences or profiles typical of a viral group, and a sequence detected by VirSorter will only be 467 affiliated if its best hits are consistent. Functional affiliation for each protein cluster was based on the 468 comparison of its members (predicted proteins) with PFAM (v. 27, threshold of 50 on score). Virus 469 Clusters were affiliated based on its members affiliations if >75% were consistent. 470 For the detection of new genera in the Viral Clusters (VCs), we chose to ignoring the 79 VCs 471 that lacked large (>30kb) genome sequences. This 30kb threshold is conservative as it avoids 472 considering short genome fragments as new genera, but would also overlook small non-circular viral 473 genomes (such as some Tectiviridae). However, because the latter comprise a minority (~0.1% of 474 12,498 sequences) of the VirSorter dataset (Figure 2), we chose to retain the larger, more conservative 475 threshold. 476 The 7 short circular sequences from Bacteroidia only detected with the Viromes database (gi 477 319430465, 298484481, 329959038, 423221334, 423242675, 423298785, 345651594) were targeted 478 for further examination. Hits to PFAM domains could be found on two proteins: a relaxase (PF03432.9, 479 score ~170), and one replication initiator protein (PF01051.16, score ~80). Genome organization was 480 compared with Easyfig (Sullivan, Petty & Beatson, 2011) after aligning all genomes to the same 481 starting point (one base before the start of the Rep-domain protein). Recruitment plots of virome 482 contigs (extracted from Kim et al., 2011; Minot et al., 2012) were generated with ggplot2 and based on 483 blastn comparison. 484 485 Host range and co-infection 486 The virus–host network was assessed considering only VCs (Viral Clusters) with more than 10 487 sequences. Hosts were grouped at the class level. The modularity Q value of the virus–host matrix was 488 computed with the lp-BRIM module in R software (https://github.com/tpoisot/lp-brim). The virus–host 489 matrix had a modularity of 0.45. The same index computed from 99 randomly permuted matrices 490 ranged from 0.02 to 0.17, with an average of 0.07. 491 Co-infection was defined as the detection of several distinct viruses in one genome project (one 492 complete genome or one SAG). However, different viral contigs in a single draft genome could also 493 originate from a single viral genome mis-assembled in several different contigs. This will be especially 494 true for Caudovirales that are the most detected viruses as well as the ones with the largest genomes. 495 To limit the over-estimation of co-infection due to mis-assembled Caudovirales genomes, co-infection 496 was only considered in the cases where multiple copies of the large subunit of the terminase were 497 detected, because this gene is present in single copy in Caudovirales genomes, and usually detected 498 even in new viruses (due to a high level of sequence conservation). 499 500 Evaluation of virus–host genome adaptation 501 Relative frequencies of k-mers (mono-, di-, tri- and tetra-nucleotide) were computed with 502 Jellyfish (Marçais & Kingsford, 2011) for every VirSorter sequence and every bacterial and archaeal 503 genome initially mined. Mean absolute error (i.e. average of absolute differences) between k-mer 504 frequency vectors were then computed with an in-house perl script for each pair of VirSorter sequence 505 and cellular genome, and used as a distance metric between viruses and putative hosts. For each 506 VirSorter sequence, a set of distances that included its host (i.e., the genome with which the sequence 507 was initially associated) alongside 10 randomly selected sequences from the same genus, the same 508 family, and a different order than the host were factored into the distance distribution (Figure 6–figure 509 supplement 1). 510 Codon usage adaptation was evaluated with cusp and cai from the European Molecular Biology 511 Open Software Suite (EMBOSS, Rice, Longden & Bleasby, 2000). First, codon usage bias of each 512 bacterial and archaeal genome was computed. Then, the codon usage adaptation index (cai) was 513 calculated for each gene between VirSorter sequences and cellular genomes. The global distribution 514 displays the average (across genes) adaptation index for each VirSorter sequence and (as for the k-mer 515 distances) a subset of cellular genomes including its host and 10 randomly selected sequences from 516 respectively the same genus, the same family, and a different order than the actual host. Function- 517 specific codon usage bias was based on the gene-by-gene adaptation between each VirSorter sequence 518 and its host. 519 For each category studied, the distance between distribution of distances to host genome (in red 520 on Figure 5–figure supplement 1) and distribution of distances to non-host genomes (in purple on 521 Figure 5–figure supplement 1) was evaluated with a Kolmogorov-Smirnov (K-S) statistic. The codon 522 usage adaptation indexes for the different functional categories were compared to the “other functions” 523 values with a Wilcoxon signed-rank test to detect categories with statistically different averages. Both 524 statistics were computed with R software. 525 To evaluate the effect of small genome size on distance between k-mer frequencies, a sub- 526 sample of 1,000 Caudovirales was randomly taken at different sizes (from 2,000 to 20,000bp) and the 527 same procedure as for complete sequences was used to determine the distance between host and non- 528 host distributions of k-mer distances. Even though the signal was slightly less strong for shorter 529 fragments, this simulation indicates that genome size is not the only factor that could explain such low 530 viral–host genome adaptation for ssDNA viruses. 531 The prediction of the host taxonomy for each viral sequence was based on the microbial 532 genome with the lowest tetramer frequency distance to the viral sequence. A prediction was considered 533 as “correct” when this closest microbial genome taxonomy was the same as the original genome in 534 which the viral sequence was detected. This prediction was computed using (i) all microbial genomes 535 (ii) only genomes from a different species than the actual host (i.e. the genome in which the viral 536 sequence was originally detected), and (iii) only genomes from a different genus than the actual host. 537 538 Estimation of virome affiliation improvement and prophage assembly efficiency 539 Protein sequences predicted from the Pacific Ocean Viromes (POV, Hurwitz & Sullivan, 2013), 540 Tara Oceans Viromes (TOV, Brum et al., 2015) and human gut viromes (Minot et al., 2012) datasets 541 were compared to RefSeqABVir (Jan. 2014) using BLAST (blastp, threshold of 50 on bit score and 542 0.001 on e-value). Those proteins that did not affiliate at >50 bit score and <0.001 e-value thresholds 543 were considered ‘unclassified’ and then used as queries in a secondary BLAST (blastp with the same 544 thresholds) against the predicted proteins from the VirSorter curated dataset. Any unclassified proteins 545 matching the VirSorter dataset better were considered newly affiliated. 546 To evaluate the efficiency of prophage assembly, we simulated genome sequencing from 23 547 bacterial genomes with identified prophages (NC_000907 NC_000913 NC_000964 NC_002570 548 NC_002655 NC_002662 NC_002695 NC_002935 NC_003030 NC_003212 NC_003295 NC_003366 549 NC_003997 NC_004070 NC_004307 NC_004310 NC_004431 NC_004557 NC_004567 NC_004668 550 NC_004722 NC_005085 NC_005362). NeSSM (Jia et al., 2013) was used to simulated HiSeq Illumina 551 reads (100bp paired-ends) with a coverage of the prophage region varying between 5x, 25x, 50x, 75x 552 and 100x. Reads were then assembled with Idba_ud , and viral contigs were predicted with VirSorter 553 (Roux et al., 2015). On the 481 contigs larger than 30kb detected as viral by VirSorter, 11 were 554 considered as “entirely viral” even though these originated from integrated prophages, resulting in a 555 “false-positive” ratio of integrated prophages wrongly considered as extrachromosomal viral genomes 556 of 2.3% for contigs of 30 kb and more. As could be expected, this same “false-positive” ratio was 557 higher for smaller contigs (12.06% for contigs <20kb, and 22.81% for contigs <10kb), so that we 558 considered the origin of these small contigs as “undetermined”, since they may come from integrated 559 prophages or extrachromosomal genomes. 560 All scripts used in this study are available on the TMPL wiki as a zip package: 561 http://tmpl.arizona.edu/dokuwiki/doku.php?id=bioinformatics:scripts:vsb and on github: 562 http://github.com/simroux/virsorter-curated-dataset-scripts-package. 563 References 564 Abedon ST. 2009. Phage evolution and ecology. Advances in applied microbiology 67: 1–45. 565 Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for finding prophages in bacterial 566 genomes that combines similarity- and composition-based strategies. Nucleic acids research 40:1– 567 13. 568 Allers E, Moraru C, Duhaime MB, Beneze E, Solonenko N, Canosa JB, Amann R, Sullivan MB. 569 2013a. Single-cell and population level viral infection dynamics revealed by phageFISH, a method 570 to visualize intracellular and free viruses. Environmental Microbiology 15:2306–18. 571 Allers E, Wright JJ, Konwar KM, Howes CG, Beneze E, Hallam SJ, Sullivan MB. 2013b. Diversity and 572 population structure of Marine Group A bacteria in the Northeast subarctic Pacific Ocean. The 573 ISME journal 7:256–68. 574 Andersson AF, Banfield JF. 2008. Virus population dynamics and acquired virus resistance in natural 575 microbial communities. Science 320:1047–50. 576 Bankevich A, Nurk S, Antipov D, Gurevich A a, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, 577 Pham S, Prjibelski AD, Pyshkin A V, Sirotkin A V, Vyahhi N, Tesler G, Alekseyev M a, Pevzner P 578 a. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. 579 Journal of computational biology : a journal of computational molecular cell biology 19:455–77. 580 Bastías R, Higuera G, Sierralta W, Espejo RT. 2010. A new group of cosmopolitan bacteriophages 581 induce a carrier state in the pandemic strain of Vibrio parahaemolyticus. Environmental 582 microbiology 12:990–1000. 583 Brum J, Ignacio-Espinoza J, Roux S, Doulcier G, Acinas SG, Alberti A, Chaffron S, Cruaud C, de 584 Vargas C, Gasol JM, Gorsky G, Gregory AC, Ogata H, Pesant S, Poulos BT, Schwenck SM, 585 Speich S, Dimier C, Kandels-Lewis S, Picheral M, Searson S, Tara Oceans Coordinators, Bork P, 586 Bowler C, Sunagawa S, Wincker P, Karsenti E, Sullivan MB. 2015. Patterns and ecological drivers 587 of ocean viral communities. Science 348:1261498–1–10. 588 Brum JR, Jeffrey Morris J, Décima M, Stukel MR. 2014. Mortality in the oceans : Causes and 589 consequences. In: Association for the Sciences of Limnology and Oceanography ed. Eco-DAS IX. 590 16–48. 591 Brum JR, Sullivan MB. 2015. Rising to the challenge: accelerated pace of discovery transforms marine 592 virology. Nature reviews. Microbiology 13:147–59. 593 Canchaya C, Fournous G, Brüssow H. 2004. The impact of prophages on bacterial chromosomes. 594 Molecular Microbiology 53:9–18. 595 Carbone A. 2008. Codon bias is a major factor explaining phage evolution in translationally biased 596 hosts. Journal of molecular evolution 66:210–23. 597 Cardinale DJ, Duffy S. 2011. Single-stranded genomic architecture constrains optimal codon usage. 598 Bacteriophage 1:4:219–24. 599 Carey-Smith G V., Billington C, Cornelius AJ, Hudson JA, Heinemann J a. 2006. Isolation and 600 characterization of bacteriophages infecting Salmonella spp. FEMS Microbiology Letters 601 258:182–186. 602 Casjens S. 2003. Prophages and bacterial genomics: what have we learned so far? Molecular 603 Microbiology 49:277–300. 604 Castelle CJ, Hug LA, Wrighton KC, Thomas BC, Williams KH, Wu D, Tringe SG, Singer SW, Eisen 605 JA, Banfield J F. 2013. Extraordinary phylogenetic diversity and metabolic versatility in aquifer 606 sediment. Nature communications 4:2120. 607 Clemente JC, Ursell LK, Parfrey LW, Knight R. 2012. The Impact of the Gut on Human 608 Health: An Integrative View. Cell 148:1258–1270. 609 Davis MA, Martin KA, Austin SJ. 1992. Biochemical activities of the ParA partition protein of the P1 610 plasmid. Molecular Microbiology 6:1141–1147. 611 DeLong EF. 2009. The microbial ocean from genomes to biomes. Nature 459:200–6. 612 Deng L, Ignacio-Espinoza JC, Gregory A, Poulos BT, Weitz JS, Hugenholtz P, Sullivan MB. 2014. 613 Viral tagging reveals discrete populations in Synechococcus viral genome sequence space. Nature 614 513:242–45. 615 Diemer GS, Stedman KM. 2012. A novel virus genome discovered in an extreme environment suggests 616 recombination between unrelated groups of RNA and DNA viruses. Biology direct 7:13. 617 Emerson JB, Thomas BC, Alvarez W, Banfield JF. 2015. Metagenomic analysis of a high CO2 618 subsurface microbial community populated by chemolithoautotrophs and bacteria and archaea 619 from candidate phyla. Environmental microbiology. 620 Enav H, Béjà O, Mandel-Gutfreund Y. 2012. Cyanophage tRNAs may have a role in cross-infectivity 621 of oceanic Prochlorococcus and Synechococcus hosts. The ISME journal 6:619–28. 622 Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for large-scale detection of 623 protein families. Nucleic acids research 30:1575–84. 624 Falkowski PG, Fenchel T, Delong EF. 2008. The Microbial Engines That Drive Earth’s Biogeochemical 625 Cycles. Science 320:1034–9. 626 Fischer CR, Yoichi M, Unno H, Tanji Y. 2004. The coexistence of Escherichia coli serotype O157:H7 627 and its specific bacteriophage in continuous culture. FEMS Microbiology Letters 241:171–177. 628 Flores CO, Meyer JR, Valverde S, Farr L, Weitz JS. 2011. Statistical structure of host–phage 629 interactions. Proceedings of the National Academy of Sciences of the United States of America 630 108:E288–E297. 631 Flores CO, Valverde S, Weitz JS. 2013. Multi-scale structure and geographic drivers of cross-infection 632 within marine bacteria and phages. The ISME journal 7:520–32. 633 Forterre P, Prangishvili D. 2013. The major role of viruses in cellular evolution: facts and hypotheses. 634 Current opinion in virology 3:558–65. 635 Fouts DE. 2006. Phage_Finder: automated identification and classification of prophage regions in 636 complete bacterial genome sequences. Nucleic acids research 34:5839–51. 637 Garrett RA, Prangishvili D, Shah SA, Reuter M, Stetter KO, Peng X. 2010. Metagenomic analyses of 638 novel viruses and plasmids from a cultured environmental sample of hyperthermophilic 639 neutrophiles. Environmental microbiology 12:2918–30. 640 Saint Girons I, Bourhy P, Ottone C, Picardeau M, Yelton D, Hendrix RW, Glaser P, Charon N. 2000. 641 The LE1 Bacteriophage Replicates as a Plasmid within Leptospira biflexa : Construction of an L . 642 biflexa-Escherichia coli Shuttle Vector. Journal of Bacteriology 182:5700–5705. 643 Hanson CA, Fuhrman JA, Horner-Devine MC, Martiny JBH. 2012. Beyond biogeographic patterns: 644 processes shaping the microbial landscape. Nature reviews. Microbiology 10:497–506. 645 Hendrix RW, Smith MC, Burns RN, Ford ME, Hatfull GF. 1999. Evolutionary relationships among 646 diverse bacteriophages and prophages: all the world’s a phage. Proceedings of the National 647 Academy of Sciences of the United States of America 96:2192–2197. 648 Hurwitz BL, Hallam SJ, Sullivan MB. 2013. Metabolic reprogramming by viruses in the sunlit and 649 dark ocean. Genome biology 14:R123. 650 Hurwitz BL, Sullivan MB. 2013. The Pacific Ocean Virome (POV): a marine viral metagenomic 651 dataset and associated protein clusters for quantitative viral ecology. PLoS One 8:e57355. 652 Ignacio-Espinoza JC, Sullivan MB. 2012. Phylogenomics of T4 cyanophages: lateral gene transfer in 653 the “core” and origins of host genes. Environmental microbiology 14:2113–26. 654 Jia B, Xuan L, Cai K, Hu Z, Ma L, Wei C. 2013. NeSSM: a Next-generation Sequencing Simulator for 655 Metagenomics. PloS one 8:e75448. DOI: 10.1371/journal.pone.0075448. 656 Kamke J, Sczyrba A, Ivanova N. 2013. Single-cell genomics reveals complex carbohydrate degradation 657 patterns in poribacterial symbionts of marine sponges. The ISME journal 7:2287–300. 658 Kashtan N, Roggensack SE, Rodrigue S, Thompson JW, Biller SJ, Coe A, Ding H, Marttinen P, 659 Malmstrom RR, Stocker R, Follows MJ, Stepanauskas R, Chisholm SW, Biller J. 2014. Single-cell 660 genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science (New 661 York, N.Y.) 344:416–20. 662 Kim M-S, Park E-J, Roh SW, Bae J-W. 2011. Diversity and abundance of single-stranded DNA viruses 663 in human feces. Applied and environmental microbiology 77:8062–70. 664 Koonin E V, Senkevich TG, Dolja V V. 2006. The ancient Virus World and evolution of cells. Biology 665 direct 1:29. 666 Krupovic M, Zhi N, Li J, Hu G, Koonin E V, Wong S, Shevchenko S, Zhao K, Young NS. 2015. 667 Multiple layers of chimerism in a single-stranded DNA virus discovered by deep sequencing. 668 Genome biology and evolution. 669 Labonté JM, Swan BK, Poulos BT, Luo H, Koren S, Hallam SJ, Sullivan MB, Woyke T, Wommack 670 EK, Stepanauskas R. 2015. Single cell genomics-based analysis of virus-host interactions in 671 marine surface bacterioplankton. The ISME journal. 672 Labonté JM, Suttle CA. 2013. Previously unknown and highly divergent ssDNA viruses populate the 673 oceans. The ISME journal 7:2169–77. 674 Lawrence JG, Hatfull GF, Roger W, Hendrix RW. 2002. Imbroglios of Viral Taxonomy : Genetic 675 Exchange and Failings of Phenetic Approaches. Journal of Bacteriology 184:4891–4905. 676 Leplae R, Lima-Mendez G, Toussaint A. 2010. ACLAME: a CLAssification of Mobile genetic 677 Elements, update 2010. Nucleic acids research 38:D57–61. 678 Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. 2008a. Prophinder: a computational tool for 679 prophage prediction in prokaryotic genomes. Bioinformatics 24:863–5. 680 Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. 2008b. Reticulate representation of evolutionary 681 and functional relationships between phage genomes. Molecular biology and evolution 25:762–77. 682 Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences 683 of k-mers. Bioinformatics (Oxford, England) 27:764–70. 684 Marston MF, Pierciey FJ, Shepard A, Gearin G, Qi J, Yandava C, Schuster SC, Henn MR, Martiny JBH. 685 2012. Rapid diversification of coevolving marine Synechococcus and a virus. Proceedings of the 686 National Academy of Sciences of the United States of America 109:4544–9. 687 Middelboe M, Holmfeldt K, Riemann L, Nybroe O, Haaber J. 2009. Bacteriophages drive strain 688 diversification in a marine Flavobacterium : implications for phage resistance and physiological 689 properties. Environmental Microbiology 11:1971–1982. 690 Minot S, Grunberg S, Wu GD, Lewis JD, Bushman FD. 2012. Hypervariable loci in the human gut 691 virome. Proceedings of the National Academy of Sciences of the United States of America 692 109:3962–6. 693 Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. 2013. Expanding the marine virosphere using 694 metagenomics. PLoS genetics 9:e1003987. 695 Mosig G. 1998. Recombination And Recombination-Dependent Dna Replication In Bacteriophage T4. 696 Annual Review of Genetics 32:379–413. 697 Pace NR. 1997. A Molecular View of Microbial Diversity and the Biosphere. Science 276:734–40. 698 Peng Y, Leung HCM, Yiu SM, Chin FYL. 2012. IDBA-UD: a de novo assembler for single-cell and 699 metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428. 700 Pride DT, Wassenaar TM, Ghose C, Blaser MJ. 2006. Evidence of host-virus co-evolution in 701 tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC genomics 7:1–13. 702 Pruitt KD, Tatusova T, Klimke W, Maglott DR. 2009. NCBI Reference Sequences : current status, 703 policy and new initiatives. Nucleic acids research 37:32–36. 704 Rakonjac J, Bennett NJ, Spagnuolo J, Gagic D, Russel M. 2011. Filamentous bacteriophage: biology, 705 phage display and nanotechnology applications. Current issues in molecular biology 13:51–76. 706 Rappé MS, Giovannoni SJ. 2003. The uncultured microbial majority. Annual review of microbiology 707 57:369–94. 708 Reyes A, Semenkovich NP, Whiteson K, Rohwer F, Gordon JI. 2012. Going viral: next-generation 709 sequencing applied to phage populations in the human gut. Nature reviews. Microbiology 10:607– 710 17. 711 Rice P, Longden I, Bleasby a. 2000. EMBOSS: the European Molecular Biology Open Software Suite. 712 Trends in genetics 16:276–7. 713 Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, Darling A, Malfatti S, Swan 714 BK, Gies E a, Dodsworth J a, Hedlund BP, Tsiamis G, Sievert SM, Liu W-T, Eisen J a, Hallam SJ, 715 Kyrpides NC, Stepanauskas R, Rubin EM, Hugenholtz P, Woyke T. 2013. Insights into the 716 phylogeny and coding potential of microbial dark matter. Nature 499:431–7. 717 Rodriguez-Valera F, Martin-Cuadrado A-B, Rodriguez-Brito B, Pasić L, Thingstad TF, Rohwer F, Mira 718 A. 2009. Explaining microbial population genomics through phage predation. Nature Reviews 719 Microbiology 7:828–36. 720 Rohwer F, Edwards R. 2002. The Phage Proteomic Tree : a Genome-Based Taxonomy for Phage. 721 184:4529–4535. 722 Roux S, Enault F, Bronner G, Vaulot D, Forterre P, Krupovic M. 2013. Chimeric viruses blur the 723 borders between the major groups of eukaryotic single-stranded DNA viruses. Nature 724 Communications 4:1–10. 725 Roux S, Enault F, Hurwitz BL, Sullivan MB. 2015. VirSorter: mining viral signal from microbial 726 genomic data. PeerJ 3:e985. DOI: 10.7717/peerj.985. 727 728 Roux S, Hawley AK, Torres Beltran M, Scofield M, Schwientek P, Stepanauskas R, Woyke T, Hallam 729 SJ, Sullivan MB. 2014. Ecology and evolution of viruses infecting uncultivated SUP05 bacteria as 730 revealed by single-cell- and meta- genomics. eLife 3:1–20. 731 Salim O, Skilton RJ, Lambden PR, Fane B a, Clarke IN. 2008. Behind the chlamydial cloak: the 732 replication cycle of chlamydiaphage Chp2, revealed. Virology 377:440–5. 733 Sencilo A, Paulin L, Kellner S, Helm M, Roine E. 2012. Related haloarchaeal pleomorphic viruses 734 contain different genome types. Nucleic acids research 40:5523–34. 735 Sims GE, Jun S-R, Wu G a, Kim S-H. 2009. Alignment-free genome comparison with feature 736 frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of 737 Sciences of the United States of America 106:2677–82. 738 Sternberg N, Austin S. 1981. The Maintenance of the P1 Plasmid Prophage. Plasmid 5:20–31. 739 Sullivan MJ, Petty NK, Beatson SA. 2011. Easyfig: a genome comparison visualizer. Bioinformatics 740 27:1009–10. 741 Suttle CA. 2007. --major players in the global ecosystem. Nature Reviews Microbiology 742 5:801–812. 743 Tadmor AD, Ottesen EA, Leadbetter JR, Phillips R. 2011. Probing Individual Environmental Bacteria 744 for Viruses by Using Microfluidic Digital PCR. Science 333:58–62. 745 Weitz JS, Poisot T, Meyer JR, Flores CO, Valverde S, Sullivan MB, Hochberg ME. 2012. Phage- 746 bacteria infection networks. Trends in microbiology 21:82–91. 747 Whitman WB, Coleman DC, Wiebe WJ. 1998. Prokaryotes : The unseen majority. Proceedings of the 748 National Academy of Sciences of the United States of America 95:6578–83. 749 Wright JJ, Konwar KM, Hallam SJ. 2012. of expanding oxygen minimum zones. 750 Nature reviews. Microbiology 10:381–94. 751 Wrighton K, Thomas B, Sharon I, Miller CS, Castelle CJ, VerBerkmoes NC, Wilkins MJ, Hettich RL, 752 Lipton MS, Williams KH, Long PE, Banfield JF. 2012. Fermentation, hydrogen, and sulfur 753 metabolism in multiple uncultivated bacterial phyla. Science 337:1661–1666. 754 Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki ME, Wilson WH, Yang EC, Duffy S, 755 Bhattacharya D. 2011. Single-cell genomics reveals organismal interactions in uncultivated marine 756 protists. Science 332:714–7. 757 Youle M, Haynes M, Rohwer F. 2012. Scratching the Surface of Biology’s Dark Matter. In: Witzany G 758 ed. Viruses: Essential Agents of Life. Dordrecht: Springer Netherlands, 61–80. 759 Zhou Y, Liang Y, Lynch KH, Dennis JJ, Wishart DS. 2011. PHAST: a fast phage search tool. Nucleic 760 acids research 39:W347–52. 761 762 Acknowledgements 763 We thank Natalie Solonenko and Sheri Floge and TMPL members for comments on the manuscript. 764 This work was performed under the auspices of the Gordon and Betty Moore Foundation (#3790) 765 through grants awarded to M.B.S. and and the Natural Sciences and Engineering Research Council 766 (NSERC) of Canada, Canada Foundation for Innovation (CFI), the Canadian Institute for Advanced 767 Research (CIFAR) and the Tula Foundation funded Centre for Microbial Diversity and Evolution, G. 768 Unger Vetlesen and Ambrose Monell Foundation through grants awarded to S.J.H. The work conducted 769 by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is 770 supported under Contract No. DE-AC02-05CH11231. 771 772 Contributions 773 All authors took part in the initial setup of the study and in the preparation of the manuscript. SR and 774 MBS analyzed the data. 775 776 The authors declare no competing interest. All the data and scripts described in this manuscript are 777 publicly available. 778 Figures titles and legends: 779 Figure 1: Distribution of viral sequences from the VirSorter curated dataset across the bacterial 780 and archaeal phylogeny. For each bacteria or archaea phylum (or phylum-level group), corresponding 781 viruses in RefSeq (gray) and VirSorter curated dataset (red) are indicated with circles proportional to 782 the number of sequences available. Groups for which no viruses were available in RefSeq are 783 highlighted in black. 784 785 Figure 1–figure supplement 1: Viral diversity in the VirSorter dataset. The best BLAST hits of 786 predicted proteins along each sequence (i.e., within 75% of the best BLAST hit for this sequence) were 787 used in a Lowest Common Ancestor affiliation (here displayed at the family level). “Unclassified 788 Caudovirales” gathers viruses only affiliated to the Caudovirales level without confident affiliation to 789 the Myo-, Sipho- or Podoviridae. The number and percentage of sequences affiliated is indicated next 790 to each family. 791 792 Figure 1–figure supplement 2 : Genome map comparison (A) and recruitment plot (B) of 793 Bacteroidia virus sequences from a putative new order. Replication–associated, Relaxase and 794 hypothetical proteins are depicted in blue, orange, and gray respectively. The recruitment plot includes 795 two viromes from human feces samples from two different studies (Human gut assembly, Minot et al., 796 2012, and Human feces, Kim et al., 2011). Identity percentage is based on a blastn between virome 797 contigs and the reference genome. 798 799 Figure 2: Degree of novelty of viruses detected in VirSorter curated dataset. A. Viral clusters 800 (VCs) are considered as putative new genera when including at least one sequence larger than 30kb, 801 circular, or known to be a complete genome (from RefSeq). These putative genera were considered as 802 “new” when the VC did not include any RefSeq sequence, and “known” otherwise. B. The proportion 803 of new VCs (containing no RefSeqABVir), VCs with only one RefSeqABVir sequence, and VCs with 804 more than one RefSeqABVir sequence is displayed for host classes associated with more than 10 virl 805 sequences. Only “putative genera” VCs were considered (i.e. clusters containing a RefSeqABVir 806 genome, a circular sequence, or a sequence with more than 30 predicted genes). 807 808 Figure 2–figure supplement 1: Structure of viral sequence space sampled in VirSorter dataset. 809 Network of virus clusters (VCs) based on gene content comparison between viral genome sequences 810 from RefSeqABVir and VirSorter dataset. VCs including only VirSorter sequences, are highlighted 811 with a black outline. The size of nodes is proportional to the number of sequences in the cluster and the 812 color of the node corresponds to the BLAST-based affiliation (at the family level) of its members when 813 consistent (i.e. agreement between >75% of the cluster members, otherwise clusters are indicated as 814 “unaffiliated”). 815 816 Figure 2–figure supplement 2: Benchmarks used to determine the best value for inflation and 817 significance thresholds for virus clustering. For each pair of values (inflation and significance 818 threshold), the genome network was computed and its overall shape evaluated with ICCC (intra-cluster 819 clustering coefficient). The chosen values are highlighted in green in the table, and with a star on the 820 associated plot. 821 822 Figure 3: Extrachromosomal prophages in VirSorter curated dataset and improvement in virome 823 affiliation. A. The distribution of VirSorter curated dataset as “integrated” (i.e. prophages integrated in 824 the host chromosome), “extrachromosomal” (i.e. >30kb or circular sequences with no microbial genes) 825 or “undetermined” (<30kb linear with no microbial genes) is indicated for each host class with at least 826 five VirSorter curated dataset sequences. The number of sequences associated with each host class in 827 indicated above the histogram. B. Improvement in the proportion of affiliated genes from viromes with 828 VirSorter dataset. Predicted genes from the Pacific Ocean Viromes (Hurwitz & Sullivan, 2013), Tara 829 Ocean Viromes (Brum, Ignacio-Espinoza, Roux et al., 2015) and Human Gut Viromes (Minot et al., 830 2012) were compared to RefSeqVirus (May 2015) and the VirSorter dataset (BLASTp, threshold of 50 831 on bit score and 0.001 on e-value). Predicted proteins affiliated to VirSorter (in blue) did not display 832 any significant similarity to a RefSeq sequence. 833 834 Figure 3–figure supplement 1: Contig map of a putative new extrachromosomal prophage. Contig 835 Spirochaetia_gi_359585655 represent a complete genome (the contig was detected as circular) from a 836 new genus (affiliated to a VC with no RefSeqABVir sequence). Functional affiliation of predicted 837 genes is indicated on the map, with notably two genes (ParA/ParB) indicative of extrachromosomal 838 prophages, as well as two genes (in orange) affiliated to the ACR_tran efflux pump family, of which 839 some members are involved in antiobiotic resistance phenotypes. This contig belongs to the virus 840 cluster VC_61, composed of 35 new putative extrachromosomal prophages from different Spirochetes 841 genomes. 842 843 Figure 4: Scale and range of co–infection. A. Number of different viral sequences detected by host 844 genome. Numbers are based on the set of microbial genomes with at least one viral sequence detected 845 (5,492 genomes). B. Affiliation of viruses involved in multiple infections of the same host. Affiliations 846 are deduced from best BLAST hits alongside the viral sequences, as in Figure 1. Co–infections 847 involving dsDNA and ssDNA viruses are highlighted in bold. 848 849 Figure 5: Virus–host network between virus clusters and host classes (matrix visualization). A cell 850 in the matrix is colored when at least one virus from a virus cluster (VC, rows) was retrieved in a 851 genome from a host class (columns). This virus-host network is detected as significantly modular by lp- 852 Brim (Modularity Q = 0.45; the same index computed from 99 randomly permuted matrices ranged 853 from 0.02 to 0.17, with an average of 0.08). The different modules are highlighted in color, with inter- 854 module links in gray. Virus clusters are identified by their number and their family-level affiliation 855 (based on BLAST-based affiliation of the cluster members) is indicated next to each cluster when 856 available (virus clusters with inconsistent members affiliation are considered as “unclassified”, 857 affiliations are spread along the x-axis for spacing purpose). Host phylum and class are indicated for 858 each host column, with domains indicated above the corresponding hosts. 859 860 Figure 5–figure supplement 1: Virus–host network between virus clusters and host classes 861 (network visualization). An edge is displayed between a virus cluster (VC) and a host class when at 862 least one virus from this cluster was retrieved in a genome from the host class. This network is detected 863 as significantly modular by lp-Brim (Modularity Q = 0.45; the same index computed from 99 randomly 864 permuted matrices ranged from 0.02 to 0.17, with an average of 0.08). The different modules are 865 highlighted in color, with inter-module links in gray. Virus clusters are identified by their number and 866 their family-level affiliation (based on BLAST-based affiliation of the cluster members) is indicated 867 below each cluster when available (virus clusters with inconsistent members affiliation are considered 868 as “unclassified”). Host phylum and class are indicated for each host node, with phyla (when multiple 869 class from the same phylum are included in the network) and domains indicated above the 870 corresponding host nodes. 871 872 Figure 6: Adaptation of viral genome composition and codon usage to the host genome. K-S 873 distances between distributions of virus–host distances and virus–non-host distances for each metrics 874 (in color) and different subsets of the viral sequences (all sequences, by type, and by taxonomy). Only 875 families with more than 5 genomes are displayed (although it should be noted that the VirSorter dataset 876 includes only 6 Microviridae sequences). The number of sequences in each category is indicated in 877 brackets. Distributions used to compute distances are displayed in Figure 6–figure supplement 1. 878 879 Figure 6–figure supplement 1: (A) K-S distances between distributions of virus–host distances 880 and virus–non-host distances for each metrics (in color) and different subsets of the viral 881 sequences (based on the number of tRNA genes detected). The number of sequences in each 882 category is indicated below the number of tRNA. (B) Distribution of k-mer distances between viral 883 and cellular genomes and codon usage adaptation index for host, host genus, host family, and 884 non-host (different order) genomes. For each viral genome, the distance to the host is displayed, as 885 well as 10 randomly taken distances to genomes from each category and different subsets of the viral 886 sequences (by taxonomy on the left column, and by number of tRNA genes on the rigth column). 887 888 Figure 6–figure supplement 2: Distance between k-mer frequency vectors of virus genome 889 subsamples and host genomes for Caudovirales. Viral genomes (1,000) were randomly sub-sampled 890 at different sizes (from 2,000 to 20,000bp). Only Caudovirales genomes were selected for this 891 subsample analysis. For each size of k-mer, the result of a linear regression of distance between host or 892 non-host and viral subsample size is indicated. The same distances for the Microviridae and Inoviridae 893 (taken from Figure 6A) are indicated for comparison, and associated with the size of the reference 894 genome of each group (Enterobacteria phage phiX174 and Enterobacteria phage M13). For clarity's 895 sake, the almost-identical values for 2-mer, 3-mer and 4-mer for Microviridae are slightly horizontally 896 shifted. 897 All reference sequences Predicted Host order Host family Host genus Correct Ratio Correct Ratio Correct Ratio d < 4x10-04 98 97 98.98% 97 98.98% 97 98.98% 4x10-04 ≤ d < 1x10-03 10,173 9,361 92.02% 8,971 88.18% 5261 51.72% 1x10-03 ≤ d 2508 1872 74.64% 1757 70.06% 917 36.56% Host species excluded Predicted Host order Host family Host genus Correct Ratio Correct Ratio Correct Ratio d < 4x10-04 21 20 95.24% 20 95.24% 20 95.24% 4x10-04 ≤ d < 1x10-03 10,003 9,067 90.64% 8,372 83.69% 2992 29.91% 1x10-03 ≤ d 2755 1981 71.91% 1840 66.79% 818 29.69%

Host genus excluded Predicted Host order Host family Host genus Correct Ratio Correct Ratio Correct Ratio d < 4x10-04 1 0 0.00% 0 0.00% 0 0.00% 4x10-04 ≤ d < 1x10-03 9,085 7,303 80.39% 6,181 68.04% 0 0.00% 1x10-03 ≤ d 3693 1768 47.87% 1388 37.58% 0 0.00% 898 899 Table 1: Accuracy of host prediction based on distance (d) between tetranucleotide frequencies of 900 viral and microbial genomes. For each viral genome, the order, family and genus of its host were 901 predicted from the taxonomy of the closest microbial genome (based on the euclidean distance between 902 tetranucleotide frequency vectors) and compared to the order, family and genus of the actual host (i.e. 903 the taxonomy of the genome with which the virus was identified). These predictions were computed 904 with (i) all microbial genomes, (ii) excluding specifically all genomes from the host species, and (iii) 905 excluding all genomes from the host genus. Cases with over 75% of prediction accuracy are 906 highlighted in green. 907 908 Figure 1–source data 1: List of datasets mined for viral signal. Bacterial and archaeal genomes 909 searched with VirSorter for viral sequences originated from NCBI Refseq and WGS, as well as the 910 Microbial Dark Matter dataset (MDM, Rinke et al., 2013) and the SUP05 SAGs dataset (Roux et al., 911 2014). 912 913 Figure 1–source data 2: New virus–host associations detected in VirSorter sequences. The star (*) 914 marks the questionable detection of an Inoviridae genome in a Caldiserica SAG, which could originate 915 from another bacterium contaminating MDA reagents (see Methods). 916 917 Figure 1–source data 3: Summary table of VirSorter dataset sequences. All sequences currently 918 identified as plasmids on NCBI and which did not display any viral gene in the automatic annotation 919 from NCBI are gathered at the bottom of the table and highlighted in orange. “Detection tag” column 920 indicates how the sequence was detected as viral by VirSorter: “hallmark” for the presence of viral 921 hallmark gene(s), “refseq” for an enrichment in bacterial and archalea virus genes, “noncaudo” for an 922 enrichment in non-Caudovirales genes, and “vdb” for an enrichment in virome-like genes. 923 924 Figure 2–source data 1: Summary table of virus clusters (VCs). Cluster affiliation is based on the 925 combination of BLAST-based taxonomic affiliation of its members. For VCs with more than 10 926 proteins, those composed only of VirSorter sequences are highlighted in green and those with only one 927 sequence from RefSeqABVir are marked in blue. Cases where sequences affiliated to both ssDNA and 928 dsDNA viruses are clustered together are highlighted in red. “Detection tags” lists the different 929 detection tags for the cluster members, with “NCBI_RefSeq” for complete genomes from the RefSeq 930 database. These NCBI RefSeq sequences are counted as “complete” in the “type of sequences” colum.