Viral Dark Matter and Virus–Host Interactions Resolved from 1 Publicly
Total Page:16
File Type:pdf, Size:1020Kb
1 Viral dark matter and virus–host interactions resolved from 2 publicly available microbial genomes 3 1† 2,3 4 1‡ 4 S Roux , SJ Hallam , T Woyke , MB Sullivan 5 6 Affiliations 7 1 Ecology and Evolutionary Biology, University of Arizona, USA 8 2 Department of Microbiology and Immunology, University of British Columbia, Canada 9 3 Graduate Program in Bioinformatics, University of British Columbia, Canada 10 4 DOE Joint Genome Institute, Walnut Creek, California, USA 11 † Current affiliation: Department of Microbiology, The Ohio State University, USA 12 ‡ Current affiliation: Department of Microbiology, The Ohio State University, USA; Department of 13 Civil, Environmental, and Geodetic Engineering, The Ohio State University, USA 14 15 Corresponding author 16 Correspondence to : MB Sullivan ([email protected] ). 17 18 Manuscript category: 19 Tools and resources 20 21 Major subject areas: 22 Ecology / Genomics & evolutionary biology 23 24 Impact statement: 25 From public microbial genomes, VirSorter revealed 12,498 viral genome sequences that expand the 26 map of the global virosphere and whose analyses improve understanding of viral taxonomy, evolution 27 and virus-host interactions. 28 29 Competing interest statement: 30 The authors declare no competing interest. 31 Abstract 32 33 The ecological importance of viruses is now widely recognized, yet our limited knowledge of viral 34 sequence space and virus–host interactions precludes accurate prediction of their roles and impacts. 35 Here we mined publicly available bacterial and archaeal genomic datasets to identify 12,498 36 high-confidence viral genomes linked to their microbial hosts. These data augment public datasets 10- 37 fold, provide first viral sequences for 13 new bacterial phyla including ecologically abundant phyla, 38 and help taxonomically identify 7—38% of ‘unknown’ sequence space in viromes. Genome- and 39 network-based classification was largely consistent with accepted viral taxonomy and suggested that (i) 40 264 new viral genera were identified (doubling known genera) and (ii) cross-taxon genomic 41 recombination is limited. Further analyses provided empirical data on extrachromosomal prophages 42 and co-infection prevalences, as well as evaluation of in silico virus–host linkage predictions. Together 43 these findings illustrate the value of mining viral signal from microbial genomes. 44 Introduction 45 Over the past two decades, our collective understanding of microbial diversity has been 46 profoundly expanded by cultivation-independent molecular methods (Pace, 1997; Whitman, Coleman 47 & Wiebe, 1998; Rappé & Giovannoni, 2003; DeLong, 2009; Hanson et al., 2012). It is now widely 48 recognized that interconnected microbial communities drive matter and energy transformations in 49 natural and engineered ecosystems (Falkowski, Fenchel & Delong, 2008), while also contributing to 50 health and disease states in multicellular hosts (Clemente et al., 2012). Concomitant with this changing 51 worldview is a growing awareness that viruses modulate microbial interaction networks and long-term 52 evolution with resulting feedbacks on ecosystem functions and services (Suttle, 2007; Rodriguez- 53 Valera et al., 2009; Forterre & Prangishvili, 2013; Hurwitz, Hallam & Sullivan, 2013; Brum et al., 54 2014; Brum & Sullivan, 2015). 55 However, our understanding of viral diversity and virus–host interactions remains a major 56 bottleneck in the development of predictive ecosystem models and unifying eco-evolutionary theories. 57 This is because the lack of a universal marker gene for viruses hinders environmental survey 58 capabilities, while the number of isolate viral genomes in databases remains limited: for comparison, 59 more than 25,000 bacterial and archaeal host genomes are available in NCBI RefSeq (January 2015), 60 whereas only 1,531 of their viruses were entirely sequenced and most (86%) of these derive from only 61 3 of 61 known host phyla (Roux et al., 2015). Thus, although advances in high-throughput sequencing 62 expand the bounds of viral sequence space, these datasets are dominated by uncharacterized sequences 63 (usually 60–95%), termed “viral dark matter” (Youle, Haynes & Rohwer, 2012; Reyes et al., 2012; 64 Mizuno et al., 2013; Brum & Sullivan, 2015). In the absence of closely related isolates, viral genes and 65 genomes remain unlinked to hosts, which greatly limits ecological and evolutionary inferences. 66 Alternatively, viral sequence space can be explored in a known host context by revealing 67 putative viral sequences hidden in microbial genomes. Such signal was first analyzed through 68 annotation of prophages—viral genomes integrated in microbial genomes. Numerous tools exist to 69 automatically detect prophages (Fouts, 2006; Lima-Mendez et al., 2008a; Zhou et al., 2011; Akhter, 70 Aziz & Edwards, 2012), so prophage diversity and abundance are relatively well-studied (Casjens, 71 2003; Canchaya, Fournous & Brüssow, 2004). Early estimations, when only a few hundred bacterial 72 genomes were available, suggested that prophages are common (62% of bacterial genomes tested 73 contained at least one), existing as intact and functional forms or in varying degrees of decay (Casjens, 74 2003). Given that tens of thousands more microbial genomes are now publicly available, it is expected 75 that many new prophages and other viral sequences remain to be discovered. 76 Further, other viral signals might be prevalent in modern microbial genomic datasets. First, certain 77 types of prophage do not integrate into the host genome. These “extrachromosomal prophages” (also 78 termed “plasmid prophage”) exist outside the microbial chromosome until induced to undergo lytic 79 replication. These have been known to occur for decades (e.g., coliphage P1, Sternberg & Austin, 80 1981), though their abundance in nature is unknown. Second, some phages can enter a “chronic” cycle, 81 in which they replicate in the cell outside of the host chromosome, and produce virions that are 82 extruded without killing their host (Abedon, 2009; Rakonjac et al., 2011). Third, a phage “carrier state” 83 has been observed, in which a lytic phage is maintained and multiplied within a cultivated host 84 population without measurable effect on cell growth (Bastías et al., 2010). This phenomenon is thought 85 to arise due to the presence of both resistant and sensitive cells that frequently transition between these 86 two states. Sometimes also termed “partial resistance”, such states that enable the coexistence of phage 87 and host in culture have now been observed in different systems (Vibrio, Escherichia coli, Salmonella, 88 Flavobacterium), and are linked to slight decreases in growth rate or cell concentration but no host cell 89 clearing as would be observed for “typical” lytic viruses (i.e., plaque formation), thus could go 90 unnoticed in a microbial cell culture (Fischer et al., 2004; Carey-Smith et al., 2006; Middelboe et al., 91 2009). All three of these lesser studied types of infection would result in the assembly of viral 92 sequences outside of the main host chromosome in a microbial genome sequencing project, and could 93 be a new type of viral signal in modern microbial genomic datasets due to deep sequencing and public 94 release of draft (i.e., not completely assembled) genomic sequences. 95 Finally, single amplified genome (SAG) datasets, sourced from anonymously sorted, amplified, 96 and sequenced cells, are especially valuable for accessing the vast majority of environmental microbes 97 that remain uncultivated in the lab (Rinke et al., 2013; Kashtan et al., 2014). Single-cell amplified 98 genomes can reveal viral sequences directly linked to uncultivated hosts (Yoon et al., 2011; Roux et al., 99 2014; Labonté et al., 2015). When combined with metagenomic sequences, these data provide 100 information on population dynamics, lineage-specific viral-induced mortality rates, relative ratios of 101 prophages and current lytic infections, as well as putative links between viral infection and host 102 metabolic state (Roux et al., 2014; Labonté et al., 2015). Thus as microbial genomic datasets evolve 103 from complete genomes to fragmented draft and single-cell genomes, new windows into viral diversity 104 and virus–host interactions are opened. 105 Here we applied a recently developed and automated virus discovery pipeline, VirSorter (Roux 106 et al., 2015), to mine the viral signal from 14,977 publicly available bacterial and archaeal genomic 107 datasets. This identified 12,498 high-confidence viral sequences with known hosts, ~10-fold more than 108 in the RefSeqVirus database, that we then used to expand our understanding of viral diversity and 109 virus–host interactions. 110 111 Results & Discussion 112 113 New viruses detected in public microbial genomic datasets with VirSorter 114 VirSorter is designed to predict bacterial and archaeal virus sequences in isolate or single-cell 115 draft genomes, as well as complete genomes (Roux et al., 2015). Briefly, VirSorter identifies viral 116 sequences through (i) statistical enrichment in viral gene content, using a reference database composed 117 of viral genomes of archaeal and bacterial viruses from RefSeq (hereafter named RefSeqABVir for 118 “RefSeq Archaea and Bacteria Viruses”) and assembled from viral metagenomes (database “Viromes” 119 in VirSorter), or (ii) a combination of viral “hallmark” gene(s) that code for virion-related functions 120 such as major capsid proteins or terminases (Koonin, Senkevich & Dolja, 2006;