<<

Provirophages in the Bigelowiella bear testimony to past encounters with giant

Guillaume Blanca,1,2, Lucie Gallot-Lavalléea, and Florian Maumusb,1,2 aLaboratoire Information Génomique et Structurale, UMR7256 (Institut de Microbiologie de la Méditerranée FR3479) CNRS, Aix-Marseille Université, 13288 Marseille cedex 9, France; and bINRA, UR1164 Unité de Recherche Génomique-Info, Institut National de la Recherche Agronomique de Versailles-Grignon, 78026 Versailles, France

Edited by Peter Palese, Icahn School of Medicine at Mount Sinai, New York, NY, and approved July 24, 2015 (received for review April 1, 2015) are recently discovered double-stranded DNA satel- cysteine protease (PRO), and zinc-ribbon domain (ZnR) as well as lites that prey on giant viruses (nucleocytoplasmic large DNA viruses; major and minor (MCPs and mCPs, respectively) NCLDVs), which are themselves parasites of unicellular . (12). In addition, encoding two different families of integrases This coupled can result in the indirect control of eukaryotic have been identified in several virophages: A putative rve integrase mortality by virophages. However, the details of such tripartite was found in and ALM (8, 10), whereas Sputnik encodes a ∼ relationships remain largely unexplored. We have discovered 300 putative tyrosine integrase (1). Among genes, only PRO, predicted genes of putative virophage origin in the nuclear genome ATPase, MCP, and mCP support the monophyly of virophages, of the unicellular alga . Physical clustering of these whereas the remaining complement shows complex phyloge- genes indicates that virophage are integrated into the B. natans genome. Virophage inserts show high levels of similarity nies suggestive of gene replacement (12). and synteny between each other, indicating that they are closely Remarkably, phylogenetic analysis of the Mavirus rve integrase related. Virophage genes are transcribed not only in the sequenced indicated that it is mostly related to homologs from eukaryotic B. natans strain but also in other Bigelowiella isolates, suggesting mobile elements of the Maverick/ (MP) family (8). The that transcriptionally active virophage inserts are widespread in are widely distributed in diverse and and Bigelowiella populations. Evidence that B. natans is also a host to were initially classified as transposable elements (TEs) (13, 14). NCLDV members is provided by the identification of NCLDV inserts However, convincing arguments support the hypothesis that in its genome. These putative largeDNAvirusesmaybeinfectedby polintons encode capsid proteins and might be bona fide viruses (15). B. natans virophages. We also identify four repeated elements shar- Because Mavirus was reported to display further synapomorphy with ing structural and genetic similarities with —aclassof a putative MP from the slime mold Polysphondylium pallidum, it was mobile elements first discovered in giant viruses—that were probably hypothesized that MPs may have originated from ancient Mavirus independently inserted in the B. natans genome. We argue that relatives that would have acquired the capability of intragenomic endogenized provirophages may be beneficial to both the virophage transposition (8). However, this hypothesis was recently challenged by and B. natans by (i) increasing the chances for the virophage to coin- Yutin et al. (12). A critical prerequisite for such an evolutionary fect the host cell with an NCLDV prey and (ii) defending the host cell scenario is the integration of virophageDNAinthegenomeofa against fatal NCLDV infections. eukaryotic host that would permit vertical transmission and adapta- virophage | nucleocytoplasmic large DNA virus | microbial community | tion to an intracellular parasitic lifestyle. However, although Sputnik endogenous virus | Maverick/polinton 2 was shown to integrate into the genome of its host (4), evidence of virophage insertions in eukaryotic genomes is lacking. putnik was first described in 2008 as a new class of small ico- Ssahedral viruses with an ∼20-kb circular double-stranded DNA Significance genome (1). Sputnik is a virus, because its replication de- pends upon proteins produced by the nucleocytoplasmic large DNA Virophages are viruses that hijack the replication machinery of virus [NCLDV; also or proposed order Megavirales (2)] giant viruses for their own replication. Virophages negatively polyphaga Mimivirus (APMV; ) and rep- impact giant virus replication and improve the survival chances licates in APMV viral factories. Sputnik was shown to inhibit rep- of eukaryotic cells infected by giant viruses. In this study, we lication of its and thus acted as a parasite of that virus. identified segments of the Bigelowiella natans genome that In analogy to the term it was called a virophage, but originate from virophages and giant viruses, revealing genomic this designation has been challenged (3). Three additional viroph- footprints of battles between these viral entities that occurred ages infecting members of the Mimiviridae, e.g., Sputnik 2, Rio in this unicellular alga. Interestingly, genes of virophage origin are transcribed, suggesting that they are functional. We hy- Negro, and Zamilon, were subsequently reported (4–6). Viroph- pothesize that virophage integration may be beneficial to both ages that prey on giant viruses that infect heterotrophic nano- the virophage and B. natans by increasing the chances for the and have also been discovered, including virophage to coinfect the cell with a giant virus prey and by Organic Lake virophage 1 [OLV1 (7)], Mavirus (8), and a viro- defending the host cell against fatal giant virus infections. phage of the Phaeocystis globosa virus (PgVV) (9), yet the classi- fication of the latter as a virophage sensu stricto is uncertain. In Author contributions: G.B. and F.M. designed research; G.B., L.G.-L., and F.M. performed addition, complete or near-complete virophage genomes have been research; G.B., L.G.-L., and F.M. analyzed data; and G.B. and F.M. wrote the paper. assembled from environmental DNA: Yellowstone Lake viroph- The authors declare no conflict of interest. ages 1–7(YSLV1–7) and Ace Lake Mavirus (ALM) (10, 11). This article is a PNAS Direct Submission. ∼ – Overall, virophage genomes have similar sizes ( 18 28 kb) and See Commentary on page 11750. + ∼ – low G C content ( 27 39%) and are related to Sputnik by genetic 1G.B. and F.M. contributed equally to this work. – and structural homologies (12). Among the 20 34 -coding 2To whom correspondence may be addressed. Email: [email protected] or sequences predicted in virophage genomes, the putative core gene [email protected]. set comprises six genes encoding the FtsK-HerA family DNA- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. packaging ATPase (ATPase), primase-superfamily 3 (S3) helicase, 1073/pnas.1506469112/-/DCSupplemental.

E5318–E5326 | PNAS | Published online August 24, 2015 www.pnas.org/cgi/doi/10.1073/pnas.1506469112 The of host cells by NCLDVs and virophages has been PNAS PLUS shown to limit the production of NCLDV particles, accompanied by greater survival of the eukaryotic host (1, 5, 6, 8, 16). At the community level, these parasitic relationships result in complex interplays between virophages, NCLDVs, and eukaryotic hosts. As a result, virophages indirectly positively regulate the pop- ulation size of eukaryotic hosts. At the global scale, these in- teractions may have significant impacts on biogeochemical cycles. For instance, in the marine environment, the co-occurrence of giant viruses and virophages in the context of algal blooms may influence the overall carbon flux as proposed for Antarctic lakes (7). Nev- SEE COMMENTARY ertheless, such tripartite community networks remain poorly explored except on theoretical grounds (17, 18). The ecological prominence and diversity of virophages are largely unknown and wait for the isolation and sequencing of new speci- mens. Recently, we demonstrated the value of searching integrated viral DNA in the genomes of potential eukaryotic hosts to identify new members of the NCLDVs (19). Here we analyzed the nuclear Fig. 1. Similarity plot of B. natans proteins against virophage/NCLDV and genome assemblies of 1,153 fully sequenced eukaryotes and report cellular best hits. Circles represent relative BLASTP scores of B. natans proteins the identification of integrated virophage elements in the genome aligned against their best cellular hits in the NR database (y axis) and their best viral hits among NCLDVs or virophages. When no cellular hit was recorded of the Bigelowiella natans (supergroup whereas a viral hit was obtained, the cellular score was set to zero. BLAST scores ). This discovery led to the prediction that this alga is also were normalized by dividing them by the score of the alignment of the query the host of viruses that are members of the NCLDVs. In support of sequence against itself. Circles are colored according to the origin of the best this prediction, we also identified inserts of likely NCLDV origin. overall scoring hit (yellow, cellular ; blue, virophages; green, NCLDVs). We investigated the transcriptional activity of the B. natans genome using RNA-sequencing (seq) data and show that virophage-like genes are actively transcribed in different B. natans strains whereas A total of 38 virophage-like elements (VLEs) ranging from

NCLDV-likegenestendtobesilent—albeit with notable excep- 100 base pairs (bp) to 33.3 kb were detected by nucleotide tions. Finally, we identified repeated genetic elements that have alignments in the B. natans genome assembly. Many VLEs cor- structural and genetic similarities to transpovirons, a distinct class of respond to truncated copies of larger elements, suggesting associated with giant viruses that were first recurrent insertions followed by degradation. However, the discovered in members of the Mimiviridae (4). We discuss the bi- number of VLEs may be misestimated because some of them lie ological relevance of integrated, actively transcribed virophages and at the end of contigs, suggesting that VLEs are difficult to as- propose a model for the mode of virophage–NCLDV coinfection. semble. None of the VLEs were located on the Altogether, our results contribute to the understanding of the , mitochondrial genome, or genome. genetic interactions occurring within microbial communities The cumulated size of VLE sequences reaches 327 kb. The between eukaryotes, virophages, and NCLDVs. VLEs were highly similar between each other—that is, nucleotide identities averaged 91.3%—indicating that they belong to the Results same family of closely related elements. However, sequence con- Integrated Virophage-Like Elements in the Algal Genome. A previous servation was occasionally interrupted by unique sequences con- comparative analysis of fully sequenced virophage genomes revealed taining one or more genes as shown in Fig. S1, revealing insertion six core proteins or protein domains that are universally conserved, or deletion events that occurred subsequent to their divergence. including S3 helicase, zinc-ribbon domain, major capsid protein, Some VLEs may be unable to produce viable virophages (i.e., minor capsid protein, DNA-packaging ATPase, and a cysteine pro- unable to complete a full replication cycle) because important tease (12). The four latter proteins were shown to produce consistent genes may have been lost following integration. Alternatively, the monophyletic clades that contrasted virophages from polintons, a difference in gene content between VLEs may reflect the genetic class of repeated elements related to virophages. We therefore used diversity of virophages before their integration. Six of the identified these proteins as markers of DNA inserts of putative virophage or- VLEs contained terminal inverted repeats (TIRs) of 2.0–2.6 kb at igin in eukaryotic genome sequences. In practice, we searched 1,153 their two extremities. TIRs were also described at the extremities predicted proteomes of protists, fungi, and basal metazoans for of the PgVV virophage-like element [associated with the virus homologs of the four virophage markers using BLASTP. The PgV-16T infecting P. globosa (9)] and polintons, and are common proteome of B. natans was the only one to exhibit homologs for among poxviruses, , and asfarviruses (7). In B. natans, each of the four virophage core protein families. No homolog for each TIR contained at least two putative ORFs. Other VLEs any of the virophage markers was identified in the other pro- contained a single TIR copy at one of their extremities, most teomes using predefined family-specific score thresholds. probably because these elements were truncated. To better delineate the subset of B. natans proteins that have a As shown in Fig. 2A, the G+C content of VLEs (36.4% on potential virophage origin, we used a score plot approach. BLAST average) was markedly lower than the background G+C content scores obtained between B. natans predicted proteins and their best of the host genome (44.9% on average). Such a difference in virophage matches were plotted against the respective BLAST scores G+C content suggests that the VLEs have been acquired hor- obtained between the B. natans predicted proteins and their best izontally in the relatively recent past. cellular matches (Fig. 1). Blue dots below the diagonal identify B. natans proteins that have higher similarity to a virophage pro- Virophage-Like Element Genes. Overall, 298 ORFs (>90 codons) tein than to a homolog in a cellular . Overall, 103 B. natans were predicted out of the 38 VLEs and organized into 54 gene proteins had a match within a virophage proteome, of which 64 had families (Dataset S1). The largest element of 33.3 kb was identi- a higher score with virophages than with cellular organisms. Further- fied on scaffold 2 (positions 1,655,224–1,688,550; Fig. 2B)and more, examination of the physical location of the virophage-like contained 27 predicted ORFs representing 25 distinct gene fam- protein genes revealed that they tend to cluster in specific loci in the ilies listed in Table 1. Functional annotation could be predicted genome assembly, revealing large regions of possible virophage origin. for only 14 of the pan-VLE gene families. Furthermore, 39 gene

Blanc et al. PNAS | Published online August 24, 2015 | E5319 tegration of the original virophage-like DNA into the algal ge- nome. The VLEs also contain five gene families homologous to sequences that have a more patchy distribution among virophages, including genes for lipase, ribonucleotide reductase small subunit, B family DNA polymerase, and two ORFs with unknown function. Each TIR region encompasses two ORFs, resulting in two pairs of duplicated genes in complete VLEs. The outward du- plicated ORFs within TIRs lack functional annotation. On the other hand, the inward duplicated ORFs encode Kelch-repeat domains most similar to eukaryotic and poxvirus homologs (Fig. S2A) that may be involved in protein–protein interaction. In- terestingly, these ORFs terminate outside the TIR regions, so that the N termini differ between the two protein copies. One of the N termini (Bn2_31) encodes a domain with no match in sequence databases, whereas the N-terminal region of the other protein (Bn2_2) encodes a protein kinase (PK) domain that is indirectly related to those encoded in various giant viruses (Fig. S2B). Sequenced virophages have been shown to contain genes that are closely related to homologs from their respective giant virus hosts (1, 7, 8, 12). These apparent host-derived genes en- code different repetitive proteins (distinct forms of collagen-like repeats in Sputnik and OLV, and FNIP repeats in Mavirus) that could be implicated in the interaction of the virophages with their giant virus hosts (12). The VLEs encode two gene families that are most similar to giant virus homologs (i.e., Bn2_18 and Bn2_28), but not one encodes repetitive proteins. However, the Kelch domain proteins (i.e., Bn2_2 and Bn2_31), ankyrin do- main protein (Bn161_7), and adhesion-like protein (Bn2_5) are proteins with repetitive domains likely implicated in the in- teraction of the virophages with their giant virus hosts; however, they have no significant match in sequenced giant viruses.

Origin of Virophage-Like Elements. We performed a phylogenetic analysis of the virophage-like proteins that had sufficient evolu- tionary conservation with homologs in sequence databases. For each protein family, the B. natans paralogs aggregated in a single clade containing short branches, further confirming the very close Fig. 2. Schematic representation of the largest virophage-like element of relationships between the different VLEs. The packaging ATPase B. natans.(A)G+C content curve in the region of the virophage-like element and protease were previously shown to support the monophyly on scaffold 2. (B) Gene map of the linear virophage-like element. Genes of virophages (12). The corresponding B. natans proteins cluster displayed on the outer and inner rims of the outermost track are coded on within the virophage clade (Fig. 3 A and B). In both these trees, the the direct (Watson) and reverse (Crick) strand, respectively. Purple and green B. natans proteins group with Sputnik, OLV, and YSLVs, whereas genes correspond to core virophage genes and genes with annotated Mavirus and ALM cluster in a sister clade. The of functions, respectively. The middle track indicates the number of RNA-seq the major capsid proteins is compatible with the scenario of a closer reads mapped to the regions (logarithmic scale). The inner track represents evolutionary proximity of the B. natans VLEs to OLV, YSLVs, and + the G C skew, with positive values in red and negative values in blue. The Sputnik than to the ALM-Mavirus clade (Fig. 3C). The same genome coordinates (in kb) are indicated in the innermost track of the map. conclusion can be drawn from the phylogenetic reconstruction of the virophage lipase (Fig. S2C). Moreover, the fast-evolving minor families did not have a significant match in sequence databases capsid protein was conserved between Sputnik, OLV, Zamilon, (BLASTP E value < 0.01). Such a small fraction of predicted YSLVs, and the B. natans elements, whereas the Mavirus and ALM function is a trademark of virophages that generally contain a homologs were comparatively too diverged to allow accurate phy- majority of orphan genes (1, 7, 8). Also, the functions encoded by logenetic reconstruction, further pointing to a more distant re- D the B. natans VLEs are reminiscent of bona fide virophages. lationship with the latter (Fig. 3 ). Altogether, these features suggest that the B. natans virophage elements share a common Homologs could be identified for each of the six core genes that ancestor with Sputnik-OLV-YSLVs. However, the exact timing of are ubiquitously conserved in complete virophage genomes (12). speciation events within this subtree could not be established, Importantly, these include the major and minor capsid protein owing to incongruences between the phylogenetic reconstructions. genes that are among the viral hallmark proteins distinguishing The phylogenetic trees for the other B. natans virophage-like viruses from other types of mobile elements. In addition, the proteins disclosed more complex evolutionary scenarios. It was VLEs encode two key proteins involved in virion maturation, suggested that gene replacements and horizontal gene transfers namely the protease and the packaging ATPase. The core viro- might explain the entangled phylogenetic relationships between phage zinc-ribbon domain is fused with a GIY-YIG nuclease virophages, large DNA viruses, and certain classes of mobile el- domain like in Mavirus and OLV; noticeably, the corresponding ements related to virophages such as polintons (12). An elegant protein (Bn119_7) is encoded by only one of the VLEs on scaffold example is the ORFs encoding a superfamily 3 helicase domain 119, suggesting that the other VLEs have lost the gene at some that exists in all fully sequenced virophages (core gene) but fails to point in their evolution. Furthermore, the VLEs were found to support monophyly of the latter, revealing multiple origins of the encode an rve family integrase closely related to the one encoded virophage genes (Fig. S2D). Nevertheless, the corresponding in Mavirus and ALM. This protein might have catalyzed the in- B. natans proteins cluster with the YSLV3 homolog, suggesting that

E5320 | www.pnas.org/cgi/doi/10.1073/pnas.1506469112 Blanc et al. Table 1. Virophage-like genes PNAS PLUS Putative Percentile † ORF no. function RPKM expression,* % Best hit

Scaffold 2 largest elements Bn2_1 Orphan protein 0.2 3.9 Bn2_2 Kelch and kinase 16.7 74.5 Dendroctonus ponderosae domain protein 478256302 (2e-32) Bn2_3 Orphan protein 6.3 55.1 Bn2_4 Unknown virophage protein 1.8 27.5 Zamilon 563399747 (2e-04)

Bn2_5 Adhesin-like protein 1.0 19.0 Escherichia coli 693111543 (9e-13) SEE COMMENTARY Bn2_7 Orphan protein 0.9 16.7 Bn2_8 Orphan protein 0.0 0.0 Bn2_9 DNA polymerase B 0.9 17.2 Mavirus 326439151 (2e-09) Bn2_10 Orphan protein 0.5 10.4 Bn2_11 S3 helicase 0.8 15.1 YSL5 701905635 (2e-28) Bn2_13 Orphan protein 604.1 99.2 Bn2_14 Orphan protein 125.7 95.6 Bn2_15 Orphan protein 194.4 97.2 Bn2_17 Orphan protein 21.8 78.7 Bn2_18 DnaJ domain protein 24.4 80.4 Ostreococcus lucimarinus virus 313843979 (2e-39) Bn2_19 DNA-packaging ATPase 14.2 71.8 OLV 322510450 (7e-25) Bn2_20 Orphan protein 9.4 63.8 Bn2_21 Cysteine protease 1.5 24.5 OLV 322510453 (5e-08) Bn2_22 Lipase 5.1 50.5 Mavirus 326439161 (6e-04) Bn2_23 Orphan protein 4.2 46.1

Bn2_25 Minor capsid protein 4.2 45.8 OLV 322510454 (2e-11) MICROBIOLOGY Bn2_26 Major capsid protein 3.9 44.0 OLV 322510455 (3e-16) Bn2_27 Unknown protein 1.4 24.0 Guillardia theta 551643434 (4e-16) Bn2_28 Ribonucleotide reductase 1.6 25.9 virus small subunit 310831442 (3e-80) Bn2_29 rve integrase 2.6 35.3 Dictyostelium fasciculatum 470248944 (3e-39) Bn2_31 Kelch domain protein 33.4 84.5 Strongylocentrotus purpuratus 390342441 (6e-35) Bn2_33 Orphan protein 0.2 5.2 Other remarkable ORFs found in smaller elements Bn119_7 ZnR and GIY-YIG domains 3.6 42 Phytophthora sojae 695382398 (3e-05) Bn161_7 Ankyrin domain protein 93.4 94 Amoebophilus asiaticus 501449850 (3e-34) Bn92_2 Unknown phage protein 0.0 1 Synechococcus phage 472343273 (1e-05) Bn92_3 Ankyrin domain protein 0.1 4 Pseudogymnoascus pannorum 682412062 (3e-08) Bn187_9 Unknown phage protein 20.4 78 Vibrio phage 510792797 (5e-06) Bn84_33 Unknown virophage protein 3.0 39 YSLV5 701905611 (2e-07)

*Percentile rank calculated over all B. natans genes. † Species name and GenBank identifier (BLAST E value). they originate from a common virophage ancestor. B. natans genes sources (, eukaryotes, or dsDNA viruses). Thus, the VLE encoding rve family integrase (Fig. S2E), an unknown protein genes reveal a mosaic origin that is typical of bona fide virophages family represented by Bn2_27 (Fig. S2F), DNA polymerase (Fig. (1, 7, 8). Altogether, the structure, gene content, and phylogenetic S2G), and GIY-YIG nuclease domains (Fig. S2H) exhibit pref- affinities of VLEs provide substantial evidence that they represent erential phylogenetic affinities, albeit with moderate bootstrap remains of integrated virophage genomes. support with eukaryotic homologs encoded by polinton-related elements. These mixed clades are nested within larger clades con- VLE Genes Are Transcribed. To investigate the transcriptional ac- taining virophages and environmental sequences, suggesting that tivity of the B. natans virophage-like elements, we analyzed a gene acquisitions or replacements most likely occurred in the previously published RNA-seq dataset generated from cultivated polinton-like elements. In contrast, the phylogenetic trees of the B. natans cells (20). A total of 45.3 million Illumina paired-end unknown protein family represented by Bn2_18 (Fig. S2I), the ri- reads were aligned onto the B. natans genome assembly (Dataset bonucleotide reductase small subunit (Fig. S2J), and the Kelch S2), of which 116,671 mapped within one of the virophage-like protein family (i.e., PK and Kelch domains have distinct origins; Fig. regions (Dataset S1). Two hundred seventy-eight out of the 302 S2 A and B) support scenarios of gene acquisition from different predicted virophage-like ORFs had at least one read mapped to

Blanc et al. PNAS | Published online August 24, 2015 | E5321 Fig. 3. Maximum-likelihood phylogenetic trees of conserved virophage proteins. (A) Packaging ATPase. (B) Maturation protease. (C)Majorcapsid protein. (D) Minor capsid protein. Branches with bootstrap support (expected-likelihood weights) less than 50% were collapsed. Sequences marked with an asterisk were taken from Repbase (38). For other sequences, the species name abbreviation and GenBank accession number are indicated; env, marine metagenome. Species: Amsmo, Amsacta moorei entomopoxvirus “L”; Caebr, Caenorhabditis brenneri; Caere, Caenorhabditis remanei; Dicfa, Dictyostelium fasciculatum; Giala, Giardia lamblia; Melsa, Melanoplus sanguinipes entomopoxvirus; Salsp, Salpingoeca rosetta; Tetvi, Tetraselmis viridis virus; uncvi, un- cultured virus. Taxa: Ea, Amoebozoa; El, Opisthokonta; u2, . Dark red, virophages; blue, (predicted) polintons and related elements; light red, NCLDV; gray, unassigned environmental sequences. The numbers of validated amino acid positions in cleaned alignments are 210 (A), 166 (B), 548 (C), and 408 (D). it. The levels of between B. natans genes were that the residual transcriptional activity reflects the fortuitous compared by the mean of the RPKM metric (reads per kilobase recognition of regulatory signals of the virophage genes by the per million mapped reads). Ninety-three virophage-like ORFs cellular host transcription complex. According to this scenario, had RPKM values >5, which ranks them in the top half of the the expression of the virophage genes has no expected biological most-transcribed genes in B. natans, including 10 genes that effect (neutral) and would disappear as the virophage sequences figure in the top 10% (i.e., RPKM >50). Interestingly, these decay by accumulating random mutations. Alternatively, the highly transcribed virophage-like ORFs encode one major and observed transcriptional activity might reflect an adaptive strat- one minor capsid protein, one DNA-packaging ATPase, one rve egy that benefits the cellular host population, the virophage, or integrase, two Kelch domain proteins, and three families of or- both. Experimental evidence indicates that virophages have a phan proteins. Other virophage core genes are generally tran- positive effect on the host-cell population. Mavirus interferes scribed at low to moderate levels (i.e., the majority of them have with Cafeteria roenbergensis virus propagation and increases the RPKM values ranking between the 20th and 36th percentiles), survival of the host-cell population (8). Sputnik causes a 70% yet 6 of the 10 ATPase gene copies have substantial transcription decrease in infective particles and a threefold de- levels, as indicated by RPKM values ranking between the 50th crease in cell lysis (1); it also delays or abolishes repli- and 90th percentiles. Interestingly, we identified transcript se- cation of (22). A model of population dynamics quences closely related to the VLEs in assembled RNA-seq suggests that the presence of virophages reduces overall mor- datasets generated from various B. natans isolates that were se- tality of the host algal cell after a bloom (7). Hence, eukaryotes quenced as part of the Marine Microbial Transcriptome that are susceptible to infection by giant viruses will gain a Sequencing Project (MMETSP) (21) (Dataset S3). This suggests selective advantage if they can stably associate themselves with that the same virophage elements are present and transcribed in virophages (8). An analogous hypothesis was briefly exposed by other B. natans strains, possibly because they were inherited verti- Katzourakis and Aswad (23) to explain the possible emergence cally from a common ancestor. of Maverick/polinton elements from hypothetical integrated viro- This observation questions the biological significance of the phage genomes in eukaryotes. Under this scenario, the hijacked expression of virophage-like genes. One possibility is that the virophage genes evolve under negative selection in the new eukaryotic integrations of virophage genomes were accidental events, and genome environment.

E5322 | www.pnas.org/cgi/doi/10.1073/pnas.1506469112 Blanc et al. Putative NCLDV Insertions. All characterized virophages so far PNAS PLUS have been shown to infect members of the Mimiviridae (NCLDVs). Thus, although no large DNA virus of B. natans has been iden- tified so far, the discovery of virophage sequences in B. natans suggests that this alga is the prey of giant viruses that are themselves infected by virophages. DNA fragments that are relics of integrated NCLDV genomes have been discovered in various eukaryotic genomes (24–26), revealing footprints of past interaction between viruses and hosts (19). We therefore searched the B. natans genome for sequences related to NCLDVs using

the BLAST score plot approach. As shown in Fig. 1, a total of 36 SEE COMMENTARY B. natans predicted proteins have better alignment scores with homologs in NCLDVs than in cellular organisms (Dataset S4). The closer relationship of these proteins with giant virus homo- logs was confirmed by phylogenetic reconstruction (for examples, see Fig. S3). This phylogenetic affinity may reflect horizontal Fig. 4. Scenarios of infection of B. natans cells by NCLDVs and virophages. gene transfer between viruses and eukaryotic hosts, the polarity Scenario (S)1 starts with B. natans cells devoid of VLE insertions. An NCLDV particle (yellow hexagon) or its DNA enters the cell and establishes viral of which (virus to host or host to virus) cannot always be de- factories (VFs) that produce new NCLDV particles. The infection causes cell termined with certainty. However, some proteins are homolo- lysis and death accompanied by release of NCLDV particles. Scenario 2 begins gous to typical NCLDV core genes, including major capsid with B. natans cells carrying VLEs in the form of functional provirophages protein and packaging ATPase (distantly related to the viro- integrated in the nuclear genome (N). VLEs may have been produced by phage ones), which have most likely been acquired by the independent entry of a virophage followed by active DNA integration in B. natans host (Fig. S3 A and B). Interestingly, some NCLDV- the host nuclear genome (delayed-entry mode). Upon NCLDV infection, like genes are physically clustered in discrete regions of the expressed VLE proteins may inhibit virus penetration or trigger reactivation B. natans genome assembly (Fig. S4). For example, we identified and excision of the provirophage, which in turn inhibits NCLDV replication a 165-kb region of scaffold 10 (position ∼340–505 kb) that is and takes advantage of the viral factories for its own replication. As a result, a limited number of NCLDV particles are created compared with S1, leading transcriptionally silent according to our analysis of the RNA-seq to increased rates of cell survival. Potentially, new virophages and a limited dataset (Dataset S4). Such a large untranscribed region is unique MICROBIOLOGY number of NCLDV particles are released in the environment through exo- in the B. natans genome. This region contains 83 predicted genes, of cytosis or another unknown mechanism that does not kill the host cell. which 9 have obvious NCLDV affinities, including genes for DNA- packaging ATPase, exonuclease, major capsid protein, and tran- scription factor. Of the remaining genes, only 7 have best matches Another possible explanation, which is not mutually exclusive in eukaryotes (albeit with low similarity) and 4 have best matches with the former, relates to the adaptive strategy of virophages to with homologs in bacteria and phages. The large majority of increase the frequency of coinfection with a host virus. Different predicted genes (63/83 = 76%) are orphans, a characteristic modes of virophage entry into the cell have been evaluated on shared with giant virus genomes. Thus, it is likely that this large theoretical grounds (18). However, these scenarios exclusively DNA stretch is the remnant of an integrated NCLDV genome consider the simultaneous entry of the virophage and the host similar to those previously observed in various eukaryotic genomes virus, either independently or in a paired mode [i.e., the viro- (19, 24–27). Inserts of likely NCLDV origin identified in the moss phage adheres to the virus before infection; see Taylor et al. genome were also reported as transcriptionally silent (19). These (18)]. Indirect evidence supports the paired-entry mode for hypothesized viral inserts probably behave like neutrally evolving Sputnik (1, 22), whereas an independent-entry mode has been nonfunctional DNA. The G+C content of the NCLDV-like re- observed for Mavirus (8). The transcribed B. natans virophage gion (45.5%) is similar to the background G+C content of the elements bring out a third hypothetical mode that we have B. natans genome (44.8%), which prevents precisely identifying dubbed the delayed-entry mode (Fig. 4). According to this sce- the insert boundaries. nario, the virophage is expected to enter the cell first by an un- Outside of the large NCLDV-like insert, some B. natans genes known mechanism; its DNA reaches the cell nucleus, integrates with viral phylogenetic affinity were found to be transcribed into the cellular genome, and remains latent until superinfection (Dataset S4). Six genes have transcription levels among the top by a virus reactivates and rescues the virophage, allowing its 50% of B. natans genes. These genes may have been inherited replication in the virus factory. This hypothetical scenario gets from the B. natans ancestor and captured by large DNA viruses from eukaryotic hosts that are closely related to B. natans. This support from the observation that the B. natans virophage ele- scenario can explain the preferential affinity between B. natans ments encode integrases, which suggests that genome integration proteins and NCLDV homologs, whereas the corresponding is an active process rather than an incidental event. To go further genes actually have a eukaryotic origin. Alternatively, the cor- into the delayed-entry scenario, it is possible that the transcrip- responding genes may have been recruited in the metabolism of tion of virophage genes leads to the production of sentinel a Bigelowiella ancestor subsequent to their acquisition from viral proteins that are able to detect infection of the host cell by an- donors, perhaps now fulfilling new cellular functions. For ex- other virus and transduce a signal triggering virophage reac- ample, B. natans has one copy for each of the four regular B tivation. There is an obvious advantage of the delayed-entry family eukaryotic DNA polymerase catalytic subunits alpha, mode over the independent-entry mode when the simultaneous delta, epsilon, and zeta. However, the alga possesses two extra independent entry of both the virus and virophage is a rare event delta DNA polymerases that group in phylogenetic positions due to, for example, high dilution of the virus particles in the en- compatible with distinct viral origins (Fig. S3C). The two viral- vironment. Furthermore, the integrated virophage can be passed on like DNA polymerase catalytic subunits are transcribed (i.e., to the next generations of host cells, contributing to its spread and 43th and 53th percentiles) at higher levels than those of the four multiplication. In the form of an integrated provirophage, the regular eukaryotic isoforms (25th–40th percentiles, respectively). virophage can potentially wait for virus superinfection during long Furthermore, they have a large number of introns (i.e., 19 and 30 periods of time, whose length depends on the rate at which random introns, respectively), whereas viral genes are generally devoid of mutations inactivate the endogenous virophage element. introns or, in rare cases, only contain a small number of them.

Blanc et al. PNAS | Published online August 24, 2015 | E5323 The two extra DNA polymerases have no identifiable orthologs Conclusion in other sequenced eukaryotes, except in the transcriptomes of The first discovered virophages have the form of small icosahe- other Bigelowiella isolates (Dataset S4). We can therefore not dral virion particles and have been shown to infect giant viruses. exclude a scenario in which the corresponding genes have been However, integrated virophages have been found more recently captured from viral donors and were progressively “eukaryo- in a Mimivirus genome (4). Here we show for the first time, to genized” through the accumulation of introns. our knowledge, that virophage genomes can also integrate in a eukaryotic genome. This finding led us to predict that B. natans Putative Insertions. Four of the six NCLDV-like might be the prey for NCLDVs. Additional integrated DNA DNA-packaging ATPase genes are carried by four closely re- fragments that most probably originate from NCLDV genomes – lated repeated genetic elements (18.8 21.8 kb in length) flanked provide data showing that B. natans or its recent ancestor had – by TIRs 300 900 bp in length (Fig. 5). These repeated elements physical contacts with NCLDV members. Furthermore, we also contain between 17 and 22 ORFs (Dataset S5), but some of the identified repeated genetic elements that resemble transpovirons original genes seem to have accumulated internal stop codons associated with Mimivirus. Thus, the B. natans genome appears and frameshifts, resulting in truncated . Very few to have recorded genetic footprints of molecular “battles” be- proteins encoded by the repeated elements exhibit detectable tween virophages, transpovirons, and giant viruses. B. natans similarity in public databases. They include a homolog of the belongs to the , a group of unicellular ma- Aureococcus anophagefferens virus protein AaV202, a homolog rine algae that acquired a by secondary endosymbiosis of the P. globosa virus virophage protein PgvV_00016, a homolog involving engulfment of a green alga by a eukaryotic heterotroph of the Mavirus virophage protein MV06, and a homolog of a host (28). They are typically mixotrophic, ingesting bacteria and hypothetical protein of the Batrachochytrium dendrobatidis. smaller protists as well as conducting . B. natans is Interestingly, two additional proteins are homologous to core the only species of the supergroup Rhizaria for which a complete transpoviron proteins, including the C-terminal superfamily I heli- genome sequence is available (29). During endosymbiosis, hun- case domain protein and C2H2 Zn-finger protein. Transpovirons dreds of genes of green origin have been transferred toward the form a distinct class of mobile genetic elements (6.5–7.5 kb) as- host genome in a process called endosymbiotic gene transfer sociated with Mimiviridae (4). The genes for helicase and C2H2 (29). The acquisition of DNA from giant viruses and trans- protein are adjacent in the B. natans repeated elements and povirons as well as from virophages through horizontal gene Mimivirus-associated transpovirons. Furthermore, Mimivirus-asso- transfer represents an additional component in the melting pot ciated transpovirons are flanked by TIRs (∼530 bp). Thus, the of genes composing the B. natans nuclear genome. B. natans repeated elements and transpovirons share structural One of the most intriguing findings of our analysis is that in- and genetic similarities, suggesting that the B. natans elements tegrated virophage genes are highly transcribed, suggesting that might belong to the transpoviron family. However, the putative they are biologically functional. We speculated on three poten- integrated B. natans transpovirons are substantially bigger in size tial adaptive scenarios to explain this observation. Certainly the relative to their Mimivirus counterparts, possibly due to the in- most interesting of them is the possibility that both the cellular corporation of foreign genes of diverse origins, including NCLDVs host and virophage take advantage of the integration strategy, by and virophages. Some of the integrated transpoviron genes show providing the cell with a defense mechanism against giant viruses evidence of transcription, including six genes that have transcrip- and providing the virophage with a mechanism to increase the tion levels in the B. natans top 50%. In addition, homologous rate of coinfection with a viral prey (i.e., delayed-entry mode S2, transcripts were identified in the transcriptome data of the other schematized in Fig. 4). From the perspective of the cellular host, Bigelowiella isolates (Dataset S6). it is tempting to speculate that integrated virophages can act as

Fig. 5. Organization of putative inserted transpoviron genomes. The position and strand orientation of each ORF (>90 codons) are indicated by an arrow. Homologous ORFs across transpoviron-like elements are indicated with the same color. ORFs with no homologs in the other transpovirons are shown in black. TIRs are represented by yellow semiarrows. Shaded areas between elements indicate regions of high nucleotide similarity.

E5324 | www.pnas.org/cgi/doi/10.1073/pnas.1506469112 Blanc et al. molecular weapons against viral pathogens, conferring a sort of is less conserved among virophages. To identify additional candidate viral- PNAS PLUS immunity transmissible from generation to generation. Sputnik like protein genes in B. natans, we performed a BLAST-score plot analysis and Zamilon strains have been shown to have a broad host previously described in Maumus et al. (19). Briefly, the full complement of spectrum among the Mimiviridae (5, 16), and Sputnik even af- B. natans predicted proteins together with intergenic ORFs was used to probe the National Center for Biotechnology Information database using fects the replication of Marseillevirus, which is distantly related BLASTP (E value < 1E-5). For each protein query, the alignment scores with to the Mimiviridae (22). By extrapolation, we can further hypothe- the best cellular hit and the best virophage or NCLDV hit were recorded. size that the putative defense function triggered by integrated When no cellular hit was recorded whereas a viral hit was obtained, the virophages can be efficient over a diversity of viral pathogens. cellular score was set to zero. BLAST scores were then normalized by the It is currently difficult to estimate the prevalence and bi- score of the alignment of the query sequence against itself (i.e., self-score), ological significance of virophage insertion in eukaryotes. At first resulting in relative scores expressed in percent of self-score. Nonviral hit glance, the fact that only B. natans showed virophage genes out scores are plotted against viral hit scores in Fig. 1. SEE COMMENTARY of 1,153 screened eukaryotic genomes might suggest that geno- mic integration of virophages is highly unusual. However, there is Identification and Delineation of Individual VLEs. The physical location of the a historical bias among sequenced eukaryotes, which include a virophage-like protein genes identified by BLASTP revealed that they tend to majority of model organisms, crop , fungi, animals, and cluster in specific loci in the genome assembly, unveiling large regions of possible virophage origin. Six of these regions were bordered by long pathogens (30), all of which are apparently not infected by giant inverted repeats on each side, which coincide with sharp changes in GC viruses and, hence, virophages. As a result, the organisms tested content (e.g., Fig. 2). We made the assumption that the long inverted re- here included relatively few potential hosts of giant viruses and peats mark the beginning and the end of VLEs. We extracted the nucleotide virophages [i.e., known hosts are and microalgae (1, 5, 8); sequences of these putative complete VLEs from the genome assembly and Dataset S6]. This bias could explain the apparent low prevalence used BLASTN to align the VLEs back to the genome assembly to identify of virophage insertions detected in our study. In contrast, giant additional truncated VLEs. Adjacent BLASTN matches that had an E value viruses and virophages are readily identified in environmental <1E-25 and a minimal length of 100 pb were assembled to identify a total of metagenomes (7, 10, 11, 31), suggesting that these viral entities 38 VLEs up to 33.3 kb in length. Every candidate VLE was checked and are common in natural ecosystems. Thus, the question of the validated manually. prevalence and biological significance of virophage insertions can only be reasonably addressed when more genomes of puta- Phylogenetic Analysis. Construction of adequate homologous protein sets for phylogenetic analysis was performed using the BLAST-EXPLORER website tive hosts are sequenced. (32). Homologous proteins were aligned using MUSCLE (33), and amino acid positions in multiple alignments containing >30% gaps were removed. We MICROBIOLOGY Methods used this criterion for alignment cleaning to keep coherence with the pio- Sequence Analysis. Annotations and sequences of 1,153 eukaryotic genome neering study of Yutin et al. (12), which produced a comprehensive phylo- assemblies representing various protists, fungi, and basal metazoans were genetic study of virophage genes. Maximum-likelihood (ML) phylogenetic downloaded from GenBank. Dataset S6 lists the investigated species together reconstruction was performed using the PhyML program (34). Before phy- with their GenBank accession numbers, including members of the Alveolate logenetic reconstruction, the best-fitting substitution model for each se- (103), Amoebozoa (35), Apusozoa (1), Chlorophyta (12), Choanoflagellida (2), quence dataset was determined using the ProtTest program (35). Sequences, Cryptophyta (1), Euglenozoa (49), Fonticula (1), Fornicata (6), Fungi (858), alignments, ProtTest outputs, and phylogenetic trees are available in Haptophyta (1), Heterolobosea (2), Parabassalia (1), Metazoa (8), Opisthokonta Dataset S7. (2), Rhizaria (3), Rhodophyta (4), Stramenopiles (63), and Streptophyta (2). We also downloaded the genome annotations of nine sequenced virophages, in- RNA-Seq Analysis. To analyze the transcriptional activity of the B. natans ge- cluding Mavirus (GenBank accession no. GCF_000890715), Sputnik (GenBank nome, we downloaded an RNA-seq dataset (ID MMETSP0045) from the accession no. EU606015), Zamilon (GenBank accession no. NC_022990), ALM CAMERA database (36). This dataset contained 61.6 million Illumina paired (GenBank accession no. KC556923), OLV (GenBank accession no. HQ704801), reads generated from polyadenylated RNA extracted from B. natans YSLV1 (GenBank accession no. KC556924), YSLV2 (GenBank accession no. CCMP2755 cells grown in f/2-Si media for a month under a 12-h:12-h light:dark KC556925), YSLV3 (GenBank accession no. KC556926), and YSLV4 (GenBank cycle at room temperature (20). Reads were aligned onto the reference genome accession no. KC556922). sequence using Bowtie 2 (37) with default parameters. Due to the high For each virophage, the major capsid protein, minor capsid protein, DNA- nucleotide similarity between VLEs, 63% of the reads that mapped onto a VLE packaging ATPase, and cysteine protease were aligned against the predicted also had valid alignments in at least another VLE. For these cases the origin of eukaryotic proteomes using BLASTP. Experience showed that computational the read is ambiguous, and we only picked one of the alignments at random for annotation of eukaryotic genomes can be inefficient in predicting genes of viral read-count purposes. In addition, we downloaded assembled RNA-seq datasets origin, because they have become pseudogenes, often resulting in truncation or (contigs) of other Bigelowiella isolates publicly available in the CAMERA data- in-frame stop codons and/or because they can have very distinct GC content base: MMETSP1052 (B. natans CCMP623), MMETSP1054 (B. natans CCMP1259), relative to the host genome and no introns. Therefore, we also aligned the MMETSP1055 (B. natans CCMP1258.1), MMETSP1358 (B. natans CCMP1242), virophage markers against the translated products of ORFs (>90 codons) lying and MMETSP1359 (Bigelowiella longifila CCMP242). These datasets were between predicted genes. Based on prior analysis of the distribution of BLASTP generated as part of the Marine Microbial Eukaryote Transcriptome Se- scores against the nonredundant (NR) database, we applied family-specific quencing Project (21). Homologs of virophage-like encoded proteins were score thresholds to avoid false detection of remote homologs that are of searched in the transcribed sequences using TBLASTN. cellular origin or nonhomologs with similar low-complexity sequences. The score threshold was set as the minimal BLASTP score between any two viro- phage proteins in the family. Scores obtained between any virophage proteins ACKNOWLEDGMENTS. We thank Yongjie Wang for providing the YSLV and ALM sequences ahead of publication. We thank Jean Michel Claverie and and cellular homologs were always lower. These score thresholds were 44.7 for Deborah Byrne for critical reading of the manuscript. The Information proteases, 46.6 for ATPases, 41.2 for mCPs, and 45.1 for MCPs. Génomique et Structurale laboratory is partially supported by the CNRS and The genome assembly of B. natans CCMP2755 exhibited homologs for Aix-Marseille University. We acknowledge the use of the Paca-Bioinfo plat- each of the four marker proteins with BLASTP E value >1E-5, except pro- form, supported by Infrastructures en Biologie Santé et Agronomie and teases, for which we used an E-value threshold of 1E-3, because this protein France-Génomique (ANR-10-INBS-0009).

1. La Scola B, et al. (2008) The virophage as a unique parasite of the giant mimivirus. 5. Gaia M, et al. (2014) Zamilon, a novel virophage with Mimiviridae host specificity. Nature 455(7209):100–104. PLoS One 9(4):e94923. 2. Colson P, et al. (2013) “Megavirales”, a proposed new order for eukaryotic nucleo- 6. Campos RK, et al. (2014) Samba virus: A novel mimivirus from a giant rain forest, the cytoplasmic large DNA viruses. Arch Virol 158(12):2517–2521. Brazilian Amazon. Virol J 11:95. 3. Krupovic M, Cvirkaite-Krupovic V (2011) Virophages or satellite viruses? Nat Rev 7. Yau S, et al. (2011) Virophage control of antarctic algal host–virus dynamics. Proc Natl Microbiol 9(11):762–763. Acad Sci USA 108(15):6163–6168. 4. Desnues C, et al. (2012) Provirophages and transpovirons as the diverse of 8. Fischer MG, Suttle CA (2011) A virophage at the origin of large DNA transposons. giant viruses. Proc Natl Acad Sci USA 109(44):18078–18083. Science 332(6026):231–234.

Blanc et al. PNAS | Published online August 24, 2015 | E5325 9. Santini S, et al. (2013) Genome of Phaeocystis globosa virus PgV-16T highlights the 24. Wang L, et al. (2014) Endogenous viral elements in algal genomes. Acta Oceanol Sin common ancestry of the largest known DNA viruses infecting eukaryotes. Proc Natl 33(2):102–107. Acad Sci USA 110(26):10800–10805. 25. Filée J (2014) Multiple occurrences of giant virus core genes acquired by eukaryotic 10. Zhou J, et al. (2013) Diversity of virophages in metagenomic data sets. J Virol 87(8): genomes: The visible part of the iceberg? 466-467:53–59. 4225–4236. 26. Sharma V, Colson P, Giorgi R, Pontarotti P, Raoult D (2014) DNA-dependent RNA 11. Zhou J, et al. (2015) Three novel virophage genomes discovered from Yellowstone polymerase detects hidden giant viruses in published databanks. Genome Biol Evol – Lake metagenomes. J Virol 89(2):1278–1285. 6(7):1603 1610. 12. Yutin N, Raoult D, Koonin EV (2013) Virophages, polintons, and transpovirons: A 27. Delaroque N, Boland W (2008) The genome of the brown alga Ectocarpus siliculosus contains complex evolutionary network of diverse selfish genetic elements with different re- a series of viral DNA pieces, suggesting an ancient association with large dsDNA viruses. BMC production strategies. Virol J 10:158. Evol Biol 8:110. 13. Kapitonov VV, Jurka J (2006) Self-synthesizing DNA transposons in eukaryotes. Proc 28. Keeling PJ (2013) The number, speed, and impact of plastid endosymbioses in eu- karyotic evolution. Annu Rev Biol 64:583–607. Natl Acad Sci USA 103(12):4540–4545. 29. Curtis BA, et al. (2012) Algal genomes reveal evolutionary mosaicism and the fate of 14. Pritham EJ (2006) Genes in conflict: The biology of selfish genetic elements. Am J Hum . Nature 492(7427):59–65. Biol 18(5):727–728. 30. del Campo J, et al. (2014) The others: Our biased perspective of eukaryotic genomes. 15. Krupovic M, Bamford DH, Koonin EV (2014) Conservation of major and minor jelly-roll Trends Ecol Evol 29(5):252–259. capsid proteins in Polinton (Maverick) transposons suggests that they are bona fide 31. Yutin N, Kapitonov VV, Koonin EV (2015) A new family of hybrid virophages from an viruses. Biol Direct 9:6. gut metagenome. Biol Direct 10:19. 16. Gaia M, et al. (2013) Broad spectrum of mimiviridae virophage allows its isolation 32. Dereeper A, Audic S, Claverie J-M, Blanc G (2010) BLAST-EXPLORER helps you building using a mimivirus reporter. PLoS One 8(4):e61912. datasets for phylogenetic analysis. BMC Evol Biol 10:8. 17. Wodarz D (2013) Evolutionary dynamics of giant viruses and their virophages. Ecol 33. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high – Evol 3(7):2103 2115. throughput. Nucleic Acids Res 32(5):1792–1797. 18. Taylor BP, Cortez MH, Weitz JS (2014) The virus of my virus is my friend: Ecological ef- 34. Guindon S, et al. (2010) New algorithms and methods to estimate maximum-like- – fects of virophage with alternative modes of coinfection. J Theor Biol 354:124 136. lihood phylogenies: Assessing the performance of PhyML 3.0. Syst Biol 59(3): 19. Maumus F, Epert A, Nogué F, Blanc G (2014) Plant genomes enclose footprints of past 307–321. infections by giant virus relatives. Nat Commun 5:4268. 35. Abascal F, Zardoya R, Posada D (2005) ProtTest: Selection of best-fit models of protein 20. Tanifuji G, Onodera NT, Moore CE, Archibald JM (2014) Reduced nuclear genomes evolution. Bioinformatics 21(9):2104–2105. maintain high gene transcription levels. Mol Biol Evol 31(3):625–635. 36. Sun S, et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology 21. Keeling PJ, et al. (2014) The Marine Microbial Eukaryote Transcriptome Sequencing Research and Analysis: The CAMERA resource. Nucleic Acids Res 39(Database issue): Project (MMETSP): Illuminating the functional diversity of eukaryotic in the D546–D551. oceans through transcriptome sequencing. PLoS Biol 12(6):e1001889. 37. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat 22. Desnues C, Raoult D (2010) Inside the lifestyle of the virophage. Intervirology 53(5):293–303. Methods 9(4):357–359. 23. Katzourakis A, Aswad A (2014) The origins of giant viruses, virophages and their 38. Jurka J, et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. relatives in host genomes. BMC Biol 12:51. Cytogenet Genome Res 110(1–4):462–467.

E5326 | www.pnas.org/cgi/doi/10.1073/pnas.1506469112 Blanc et al. Supporting Information

Blanc et al. 10.1073/pnas.1506469112

Fig. S1. Schematic representation of the identified virophage-like elements. Homologous ORFs (>90 codons) are colored with the same color across virophage-like elements. TIRs are represented by yellow semiarrows. Shaded areas between virophage-like elements indicate regions of high nucleotide similarity.

Blanc et al. www.pnas.org/cgi/content/short/1506469112 1of8 A B 99 Clostridium beijerinckii 501005220 100 100 Clostridium beijerinckii 652492949 B. natans VLEs Fictibacillus gelatini 651944200 63 92 Clostridiaceae bacterium 493352253 Naegleria gruberi 290989204 Branchiostoma floridae {260789139} 100 Acytostelium subglobosum 735859484 98 Dendroctonus ponderosae {546679471} 88 Polysphondylium pallidum 281200654 Ichthyophthirius multifiliis 471226536 100 Saccoglossus kowalevskii {291229008} 68 94 Phaethon lepturus 723122014 Lottia gigantea {556114334} B. natans VLEs Selaginella moellendorffii 302768935 86 Alligator mississippiensis {564239221} Blastocystis hominis 300122281 68 91 100 97 Trichoplax adhaerens 196002141 Bos taurus {528980267} Dendroctonus ponderosae 478253206 Tetraodon nigroviridis {47228796} Bactrocera cucurbitae 751440633 98 96 Crassostrea gigas 405962937 82 Drosophila melanogaster {24664829} Metaseiulus occidentalis 391340940 Helobdella robusta 675867128 90 Apis dorsata {572266843} 97 Dendroctonus ponderosae 546678751 Bombyx mori {512918466} 91 94 Phytophthora infestans 301112661 100Schistosoma mansoni {256088818} Micromonas sp. RCC299 255087548 Ostreococcus tauri 308811606 98 100 Clonorchis sinensis {358334255} 98 Ostreococcus lucimarinus 145357121 Capsaspora owczarzaki 470305808 Nematostella vectensis {156401308} 99 77 Albugo candida 635372021 98 Oikopleura dioica {313221502} Solanum lycopersicum 723748554 56 99 Malus domestica 657992758 100 Caenorhabditis briggsae {268566447} 94 Eucalyptus grandis 629077201 Ricinus communis 255545998 81 Caenorhabditis brenneri {341875095} 94 Theobroma cacao 590566659 99 Lottia gigantea {556109477} 87 Populus euphratica 743832337 84 99 75 Cicer arietinum 502122400 Emiliania huxleyi 76 99 79 92 Emiliania huxleyi {551605291} 66 63 100 dulcis 526120081 Chordopoxviruses Acanthamoeba castellanii 470395270 82 100 Pandoravirus salinus 531037326 Pseudomonas phage {334738072} 86 Pandoravirus dulcis 526119559 sibericum 585299415 Phages 92 Megaviridae 96 Saprolegnia diclina VS20 669150804 75 Acanthamoeba castellanii 470395270 99 Acanthamoeba castellanii 470391792 Polysphondylium pallidum 281202136 0.5 0.5 KELCH domain PK domain

Fig. S2. (Continued)

Blanc et al. www.pnas.org/cgi/content/short/1506469112 2of8 88 ALM ORF02 D El Monbr167518862 97 Bacteria/Phages f3 Steph213163899 env139600514 99 OLV OLV25 322510469 env142468211 86 97 YSLV4 ORF11 env136191113 env136809399 100 YSLV1 ORF04 95 77 94 environmental sequences env138744489 91 E8 Physo348669711 ze Marse284504468 86 100 ze Marse284504072 Ea Dicfa328871395 C 95 O. sativa 115472475 Ea Polpa281203223 P. patens 168012813 93 Ea Dicfa328871349 93 Ea Dicfa328871692 P. syringae 670185885 100 100 q3 Ectsi13242580 56 Mavirus 326439161 81 q1 Parbu157952927 ALM 99 q4 Micpu356980095 poxviruses YLV3 65 env139724783 OLV env144003915 81 99 Mavirus MV01 326439149 B. natans VLEs 79 99 54 env136817672 YSLV3 ORF11 B. natans VLEs 96 83 89 environmental sequences 0.5 99 100 env134315999 Sputnik V13 195982544 Lipase 60 env142268062 86 61 NCLDV (Irido-, mimi-, Marseilleviruses)

86 Bacteria/Phages 99 Helicase 0.5

E 79 Polintons F 87 chloroviruses 93 97 Mavirus 326439150 env137500756 ALM ORF04 env138995804 env141758703 env138709430 env142157681 98 100 B. natans VLEs 93 OLV OLV2 322510448 70 Ea Polpa281203224 YSLV4 ORF26 64 env142221327 89 Ea Dicfa328871350 Tlr 6Fp 18481473 98 ZMCOPIA I 1p* env139490358 OPIE2 I 1p* 75 YSLV2 ORF20 94 58 env135711014 99 El Acypi328705992 env139600241 58 Ginger2-1 MG 1p* env140123504 env136846524 El Trica270016734 91 100 77 TCN1-I 2p* 78 B. natans VLEs 100 64 Guith428162484 84 76 Endogenous 87 91 Guith428173929 Guith428169541 64 Gypsy transposons 97 92 73 99 Guith428167288 Gypsy transposons env144051830 92 Gypsy transposons YSLV1 ORF28 93 env136814635 Bacterial transposons env135293100 98 Integrase 0.2 0.5 Bn2_27 (FAM07)

Fig. S2. (Continued)

Blanc et al. www.pnas.org/cgi/content/short/1506469112 3of8 H 50 Environmental sequences G Environmental sequences 70 95 env136861278 100 Mavirus MV06 326439154 Phages 66 YSLV1 ORF09 95 env134338850 Eukaryotes OLV OLV24 322510468 79 73 93 Physo348667318 96 Physo348690372 93 PRD1 POLB* 77 Physo348680858 84 Bam35c POLB* 94 Physo348669710 84 Physo348666798 Polintons group 1 93 72 Phyin301100174 100 73 Physo348675267 Polintons group 2 B. natans VLEs 75 98 89 Environmental sequences 100 Adenoviruses env138181926 env136411477 100 El Klula1352308 98 84 87 PgV 357289605 El Debha74626904 env141311084 53 69 Environmental sequences 100 El Klula118830 OLV OLV1 322510447 El Lackl74627258 env136634862 100 68 PgV 357289896 62 86 93 Capow320170864 Environmental sequences Ea Dicfa328871352 PBCV-NY2A 157953154 73 Bf Bacth228904744 Ea Polpa281203225 Bf Tursa293375211 94 74 94 Mavirus 326439151 Bacteria/Phages 79 Bi Grama374310784 ALM ORF05 98 l2 Invir15078732 100 98 78 l2 Invir15078859 B. natans VLEs 67 Bd Deigo380003012 100 54 f1 Aciph326536376 73 Bb Psyto91214514 82 81 mitochondrial intron-encoded endonuclease 86 Bf Bacam380498739 0.5 80 Bacteria/phages 67 DNA polymerase 0.2 GIY-YIG nuclease domains

I Ostreococcus tauri virus 1 260665996 88 Ostreococcus lucimarinus virus 313843979 Bacteria 86 Candidatus Pelagibacter ubique 550987693 74 Sediminibacterium sp. 652372636 53 92 B. natans VLEs Tunisvirus fontaine2 557952265 100 Cannes 8 virus 539398770 Bacteria/Phages 93 Bacillus azotoformans 656683855 Prochlorococcus marinus 691696832 75 Paramecium bursaria Chlorella virus NE-JV-1 448932756 87 Acanthamoeba polyphaga moumouvirus 441432320 74 chiliensis 363540020 75 Samba virus 311977810 94 Megavirus chiliensis 363540485 88 89 Moumouvirus goulette 451927483 Prochlorococcus sp. MIT 0604 689267675 Prochlorococcus sp. 647692895

0.2 Bn2_18 (FAM19)

Fig. S2. (Continued)

Blanc et al. www.pnas.org/cgi/content/short/1506469112 4of8 91 J Green plants Plasmodium falciparum 1710399 51 Chordopoxviruses Canarypox virus{40233999} 98 Mus musculus 132626 90 81 Homo sapiens 400979 86 Homo sapiens 74727333 93 Danio rerio 3914657 Spisula solidissima 266918 82 96 Choristoneura biennis entomopoxvirus{502523849} 72 97 Spodoptera littoralis nucleopolyhedrovirus {449139107} Lymantria dispar nucleopolyhedrovirus {566568064} 94 Drosophila melanogaster 12644287 Infectious spleen and kidney necrosis virus{19773634} 83 Shrimp virus{17016586} 92 88 91 Saccharomyces cerevisiae 1172946 99 86 Schizosaccharomyces pombe 9297110 Neurospora crassa 30316285 Encephalitozoon cuniculi 51701822 87 Helicoverpa zea nudivirus 2{365199608} Caenorhabditis elegans 1172941 100 94 86 B. natans VLEs Cafeteria roenbergensis virus BV-PW1{309386626} 63 97 Moumouvirus goulette{451927338} 96 Acanthamoeba polyphaga moumouvirus{440383525} 74 Megavirus chiliensis{350611804} 100 Acanthamoeba castellanii mamavirus{351737466} 74 Baculoviruses Dictyostelium100 discoideum 41019487 98 Trypanosoma brucei brucei 3914640 8690 Leishmania amazonensis 3914644 100 Emiliania huxleyi virus 156{558481589} Emiliania huxleyi virus 145{558481019} Pandoravirus salinus{516306307} 100 Pandoravirus dulcis{516302977} 87 phycoDNAviruses 99 Organic Lake phycodnavirus 2{322510876} 83 Organic Lake phycodnavirus 1{322510501} 93 Phaeocystis globosa virus 12T{357289644} YLV4 98 Feldmannia species virus{197130510} Feldmannia irregularis virus a{38683666} 100 Ectocarpus siliculosus virus 1{13177402} Lausannevirus{326784994} 100 Tunisvirus fontaine2{566082495} Pithovirus sibericum{575826773}

0.5

ribonucleotide reductase small subunit

Fig. S2. Phylogenetic reconstructions of virophage-like proteins. (A) ML tree of KELCH domains; (B) ML tree of protein kinase (PK) domains contained in KELCH proteins; (C) ML tree of lipases; (D) ML tree of S3 helicase domains; (E) ML tree of rve integrases; (F) ML tree of Bn2_27 homologs; (G) ML tree of DNA polymerase type B family; (H) ML tree of GIY-YIG nuclease domains; (I) ML tree of Bn2_18 homologs; and (J) ML tree of ribonucleotide reductase small subunits. Virophage, NCLDV, prokaryotic, and eukaryotic sequence names are shown in brown, red, mauve, and blue, respectively. Sequences marked with an asterisk were taken from Repbase (38). For other sequences, species name abbreviation and GenBank accession number are indicated. For clarity, subtrees containing sequences originating from the same taxonomic clade were condensed (colored triangles). Statistical supports for branches in percent (approximate likelihood- ratio test) are shown beside nodes (only branch supports >50%). Branches with statistical support <50% were collapsed. The scale bars represent the number of substitutions per amino acid site. env, marine metagenome. Species: Aciph, Acinetobacter phage; Acypi, Acyrthosiphon pisum; Bacam, Bacillus amyloli- quefaciens; Bacth, Bacillus thuringiensis; Capow, Capsaspora owczarzaki; Debha, Debaryomyces hansenii; Deigo, Deinococcus gobiensis; Dicfa, Dictyostelium fasciculatum; Ectsi, Ectocarpus siliculosus virus 1; Grama, Granulicella mallensis; Guith, Guillardia theta; Klula, Kluyveromyces lactis; Lackl, Lachancea kluyveri; Marse, Marseillevirus; Micpu, Micromonas pusilla virus PL1; Monbr, Monosiga brevicollis MX1; Parbu, Paramecium bursaria Chlorella virus NY2A; Physo, Phytophthora sojae; Popla, Polysphondylium pallidum PN500; Psyto, Psychroflexus torquis; Steph, Stenotrophomonas phage S1; Tlr, Tetrahymena thermophile; Trica, Tribolium castaneum; Tursa, Turicibacter sanguinis. The numbers of validated amino acid positions in cleaned alignments are 340 (A), 237 (B), 233 (C), 333 (D), 304 (E), 93 (F), 472 (G), 92 (H), 158 (I), and 994 (J).

Blanc et al. www.pnas.org/cgi/content/short/1506469112 5of8 98 B. natans aug1.13 g5836 A B 100 B. natans scaffold 6 71 B. natans scaffold 87 Transpoviron-like 100 B. natans aug1.26 g9478 Marseillevirus - Irido/ B. natans scaffold 24 100 87 B. natans scaffold 10 NCLDV large insert Chloroviruses 1 78 100 env136248075 Prasinoviruses 1 Emiliania huxleyi virus PS401 {357290917} 100 96 Emiliania huxleyi virus 86 {73852542} 98 Chloroviruses 2 env137911513 80 99 env136365604 Chloroviruses 3/4/5 84 96 Pandoravirus salinus {531036974} Phaeocystis globosa virus {508181956} 100 Pandoravirus salinus {531037319} 100 Acanthamoeba polyphaga mimivirus {311977826} 87 Pandoravirus dulcis {526119111} 50 100 Ectocarpus siliculosus virus {13242498} Megavirus chiliensis {363540775} Feldmannia irregularis virus {38683659} 92 Cafeteria roenbergensis virus {310831165} 99 Feldmannia species virus {197322436} 97 Acanthamoeba polyphaga mimivirus {311977828} 100 Paramecium bursaria Chlorella virus 1 {340025835} 93 Paramecium bursaria Chlorella virus FR483 {155370687} 91 Megavirus chiliensis {363540810} 95 64 Acanthocystis turfacea Chlorella virus 1 {155371384} Cafeteria roenbergensis virus {310831332} Heterosigma akashiwo virus 01 {16151622} Hydra magnipapillata {449662874} – virus insert env136809839 84 59 92 Bathycoccus sp. RCC1105 virus {313768078} Acanthamoeba polyphaga mimivirus {311977809} 96 env136826666 99 Megavirus chiliensis {363540260} 87 Micromonas pusilla virus {472342694} 100 Prasinoviruses 2 97 Ostreococcus lucimarinus virus {472341287} 100 87 81 Micromonas sp. RCC1109 virus {313768295} 99 98 Prasinoviruses 3 Marseillevirus {284504185} Micromonas pusilla virus 12T {472342612} 99 100 Melbournevirus {701447270} 99 Asco/ Prasinoviruses 4 79 72 Cafeteria roenbergensis virus {310831388} Cafeteria roenbergensis virus BV-PW1 {310831328} 100 79 98 Acanthamoeba polyphaga mimivirus {311977822} Prasinoviruses 5 100 Acanthamoeba polyphaga moumouvirus {441432310} 99 Prasinoviruses 6 96 Megavirus chiliensis {363539780} 99 uncultured marine euryarchaeote {663506235} 89 B. natans aug1.10 g4662 86 99 98 Organic Lake phycodnavirus 1 {322510542} Micromonas sp. RCC1109 virus {313768426} Organic Lake phycodnavirus 2 {322510960} 100 Ostreococcus tauri virus 1 {260666080} 55 Phaeocystis globosa virus {508181864} 89 90 98 Phaeocystis globosa virus {508181865} Emiliania huxleyi {146393741} env144049216 85 Ectocarpus siliculosus virus {13242587} 76 Aureococcus anophagefferens virus {672551239} 100 Feldmannia species virus {197322408} 99 env136652568

0.5 0.5

72 C 99 98 Archaea Sulfolobus tokodaii str. 7{15921715} Archaea 100 75 Eukaryotic Alpha 100 Eukaryotic Zeta 83 100 90 Synechococcus phage {418487545} 100 Heterocapsa circularisquama virus {262318556} African swine fever virus {162849472} B. natans aug1.35 g11322 100 Acanthamoeba polyphaga mimivirus {311977705} 100 Megavirus chiliensis {363540521} 50 Cafeteria roenbergensis virus {310831487} 100 Phaeocystis globosa virus {357289784} Eukaryotic Delta 63 Pandoravirus dulcis {516302744} 86 77 B. natans aug1.43 g12888 100 Phy PbCV1 97 Phy PbCVF 96 Phy AtCV 96 Bathycoccus sp. virus {313768183} 100 Ostreococcus lucimarinus virus {313844188} Feldmannia species virus {197322442} 96 Pithovirus sibericum YP_009000951 94 100 Physcomitrella patens virus inserts 100 Lausannevirus {327409888} 99 92 Marseillevirus 52 Irido/Ascoviridae Emiliania huxleyi virus 86 {73852500} 98 Emiliania huxleyi virus PS401 {357290818}

0.5

Fig. S3. Phylogenetic reconstructions of NCLDV-like proteins. (A) ML tree of capsid proteins; (B) ML tree of DNA-packaging ATPases; and (C) ML tree of DNA polymerases (DNAPs). Same legend as Fig. S2. The numbers of validated amino acid positions in cleaned alignments are 370 (A), 229 (B), and 755 (C).

Blanc et al. www.pnas.org/cgi/content/short/1506469112 6of8 Fig. S4. B. natans scaffolds containing virophage- or NCLDV-like inserts. The blue graphs represent the number of transcriptome reads mapped onto contigs. GC content variation in 5-kb sliding windows along the contig sequence is depicted by a yellow-to-red color scale, with yellow indicating low GC content (minimum 30%) and red indicating high GC content (maximum 60%). The distance ruler is expressed in kilobases. Cyan rectangles represent virophage inserts; green circles represent NCLDV-like genes.

Blanc et al. www.pnas.org/cgi/content/short/1506469112 7of8 Other Supporting Information Files

Dataset S1 (XLSX) Dataset S2 (XLSX) Dataset S3 (XLSX) Dataset S4 (XLSX) Dataset S5 (XLSX) Dataset S6 (XLSX) Dataset S7 (GZ)

Blanc et al. www.pnas.org/cgi/content/short/1506469112 8of8