<<

of Phaeocystis globosa PgV-16T highlights the common ancestry of the largest known DNA infecting

Sebastien Santinia, Sandra Jeudya, Julia Bartolia, Olivier Poirota, Magali Lescota, Chantal Abergela, Valérie Barbeb, K. Eric Wommackc, Anna A. M. Noordeloosd, Corina P. D. Brussaardd,e,1, and Jean-Michel Claveriea,f,1

aStructural and Genomic Information Laboratory, Unité Mixte de Recherche 7256, Centre National de la Recherche Scientifique, Aix-Marseille Université, 13288 Marseille Cedex 9, France; bCommissariat à l’Energie Atomique–Institut de Génomique, 91057 Evry Cedex, France; cDepartment of and Soil Sciences, University of Delaware, Newark, DE 19711; dDepartment of Biological Oceanography, Royal Institute for Sea Research, NL-1790 AB Den Burg (Texel), The Netherlands; eAquatic Microbiology, Institute for Biodiversity and Ecosystem Dynamics, , Amsterdam, The Netherlands; and fService de Santé Publique et d’Information Médicale, Hôpital de la Timone, Assistance Publique–Hôpitaux de Marseille, FR-13385 Marseille, France

Edited by James L. Van Etten, University of Nebraska, Lincoln, NE, and approved May 1, 2013 (received for review February 22, 2013) Large dsDNA viruses are involved in the population control of many viruses: 730 kb and 1.28 Mb for CroV and Megavirus chilensis, globally distributed species of eukaryotic phytoplankton and have respectively. Other studies, targeting virus-specific [e.g., a prominent role in bloom termination. The genus Phaeocystis (Hap- DNA B (8) or capsid (9)] have suggested tophyta, Prymnesiophyceae) includes several high-biomass-forming a close phylogenetic relationship between Mimivirus and several phytoplankton species, such as Phaeocystis globosa, the blooms of giant dsDNA viruses infecting various unicellular such as which occur mostly in the coastal zone of the North Atlantic and the Pyramimonas orientalis (Chlorophyta, Prasinophyceae), Phaeocys- North Sea. Here, we report the 459,984-bp-long genome sequence of tis pouchetii (Haptophyta, Prymnesiophyceae), and Chrysochromulina P. globosa virus strain PgV-16T, encoding 434 proteins and eight ericina (Haptophyta, Prymnesiophyceae). tRNAs and, thus, the largest fully sequenced genome to date among Phaeocystis, the host for Phaeocystis globosa virus (PgV)-16T,

viruses infecting algae. Surprisingly, PgV-16T exhibits no phyloge- has a complex lifecycle that includes a unicellular, motile stage and MICROBIOLOGY netic affinity with other viruses infecting microalgae (e.g., phycod- gelatinous stage that can produce multicellular colonies of several naviruses), including those infecting Emiliania huxleyi,another centimeters in diameter. P. globosa regularly dominates the phy- ubiquitous bloom-forming haptophyte. Rather, PgV-16T belongs to toplankton community in the coastal zone of the North Atlantic an emerging clade (the Megaviridae) clustering the viruses endowed and the North Sea, where its blooms may result in “stinking water” with the largest known , including Megavirus, Mimivirus and the production of foam of mucilaginous material that deposits (both infecting acanthamoeba), and a virus infecting the marine on beaches. The decline of natural blooms was shown to be ac- fl fi micro agellate grazer Cafeteria roenbergensis. Seventy- ve percent companied by a considerable increase of viruses infecting P. glo- – of the best matches of PgV-16T predicted proteins correspond to bosa (10), and a mesocosm study showed that the abundance of two viruses [Organic Lake phycodnavirus (OLPV)1 and OLPV2] from P. globosa populations can indeed be controlled by viral lysis (11). a hypersaline lake in Antarctica (Organic Lake), the hosts of which The PgV-16T strain reported here belongs to the group I of clonal are unknown. As for OLPVs and other Megaviridae, the PgV-16T virus isolates (together with PgV-12T and PgV-14T) obtained from sequence data revealed the presence of a -like genome. Dutch coastal waters (southern North Sea) and partially charac- However, no virophage particle was detected in infected P. globosa terized previously (12). It is a lytic virus that only infects Phaeocystis cultures. The presence of many genes found only in Megaviridae in globosa with a latent period of 8–12 h and a burst size around 300– its genome and the presence of an associated virophage strongly 400 virions (12). Its icosahedral particle is about 150 nm in di- suggest that PgV-16T shares a common ancestry with the largest ameter, and its genome size was estimated at 470 kb (12). Group I known dsDNA viruses, the host range of which already encompasses viruses are the largest known viruses infecting P. globosa, but many the earliest diverging branches of domain Eukarya. others (classified in group II, such as PgV-03T, PgV-04T, PgV-10T) have been isolated that exhibit smaller virions (about 100 nm in | core | | mobile element | diameter) and genome sizes (about 177 ± 3 kb), as well as a dif- ferent host range (10, 12). Their genomes have not yet been sequenced. he discovery of the giant Acanthamoeba polyphaga Mimivirus In this study, we used in excess of 20,390,000 Illumina 101-bp Tin freshwater amoeba and the deciphering of its 1.2-Mb ge- reads (paired) for the de novo assembly the PgV-16T genome at nome sequence initiated a new independent phylogenetic lineage very high sequence coverage (average: 4,477 ± 1,624). Un- among eukaryotic dsDNA viruses (1, 2), as well as a new chapter in expectedly, the corresponding predicted proteome was found to be, . Soon after, it was recognized (3) that the closest homol- by far, most similar to the one of two partially sequenced viruses ogous sequences to many Mimivirus proteins occurred within metagenomic sequence data from the Global Ocean Survey ex- pedition (4). However, the identity of these—presumably abun- Author contributions: K.E.W., C.P.D.B., and J.-M.C. designed research; S.S., S.J., J.B., V.B., dant (5)—Mimivirus relatives and the hosts they infect remained A.A.M.N., and C.P.D.B. performed research; S.S., S.J., O.P., M.L., C.A., and J.-M.C. analyzed elusive. As of today, the only Mimivirus relatives of marine origin data; and S.S., C.P.D.B., and J.-M.C. wrote the paper. fl for which a complete genome sequence is available are a virus The authors declare no con ict of interest. infecting Cafeteria roenbergensis (CroV) (6), a major micro- This article is a PNAS Direct Submission. flagellate grazer, and Megavirus chilensis, isolated from near-shore Data deposition: The sequences reported in this paper have been deposited in the Gen- Bank database (accession nos. KC662249–KC662250). sediments off the coast of central Chile (7). This latter virus is 1To whom correspondence may be addressed. E-mail: [email protected] or Jean- propagated on various species of Acanthamoeba in the laboratory, [email protected]. but its environmental host remains unknown. Remarkably, these This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. viruses exhibit the two largest known genomes among marine 1073/pnas.1303251110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1303251110 PNAS Early Edition | 1of6 Downloaded by guest on September 29, 2021 [Organic Lake phycodnavirus (OLPV)1 and OLPV2] infecting an A unknown host from a hypersaline lake in Antarctica (Organic Lake) 25 2 (13). In addition, a subset of data (6,310,000 reads) enabled the as- sembly of a virophage-like genome, reinforcing the similarity with 34 OLPVs, which also appear to have associated . The PgV-16T genome, furthermore, exhibits the duplication of two types of No Match 188 viral core genes, the packaging ATPase and the second-largest RNA polymerase subunit, genes that have not been reported to occur at more than a single copy in any dsDNA virus. Virus 179 Results and Discussion General Genome Features. The genome of PgV-16T was fully as- B sembled as a linear DNA sequence of 459,984 bp, making it the EhV 4 2 Other 9 third-largest described genome among marine viruses behind 4 OLV CroV and Megavirus. Its overall nucleotide composition (A+T PpV content: 68%) is less A+T-biased than that of CroV and Mega- Megavirus 6 + ± virus (A T: 77% and 75%, respectively), although it reaches 87 6 8% A+T between genes. Using a conservative annotation protocol CroV OLPV 138 (SI Materials and Methods), we identified 434 putative - OsV 10 coding sequences with an average length of 320 aa (ranging from 45 to 3,739 aa) and eight tRNAs (three tRNALeu, two tRNAAsn, one tRNAArg, one tRNAGln, one tRNAIle). With an average inter- genic distance of 104 nt (summing to 42,717 noncoding base pairs) the PgV-16T genome reaches a coding density of 90.7%. Un- expectedly, the partially assembled genome sequences of two other Fig. 1. Global database similarity of the PgV-16T predicted proteome. (A) PgV strains deposited directly in the GenBank database (PgV 12-T, Close to 50% of PgV predicted proteins do not exhibit a significant match < −5 accession no. HQ634147.1; PgV 14-T, accession no. HQ634144.1) (E 10 ) in the NCBI NR protein database, as is typical for viruses belonging to a newly explored lineage with no other sequenced representatives (2, 6). were found to be nearly identical to PgV-16T in terms of gene (B) Surprisingly, for 77% of the PgV predicted proteins exhibiting a viral protein content and percentage of identity (98.1% of identity; 449 and as their best match, this homolog is found in the partially sequenced OLPV1 444 genes). We searched the final genome sequence assembly for or -2. This finding indicates strong global phylogenetic affinities between the presence of large (direct or inverted) repeated regions of >5 PgV (infecting a haptophyte) and the OLPVs infecting an unknown host. kb. No evidence of block duplication involving a string of several genes was found (Fig. S1), indicating that the large genome size of PgV is not the result of segmental duplications. A finer look at virophage (denoted PgVV) associated with the PgV-16T particles these data also showed that unchecked short-repeat expansions do of Phaeocystis globosa. However, no small viral particle (typically not contribute to the large size of the PgV genome. As with pre- less than 75 nm in diameter) characteristic of previously described viously characterized dsDNA viruses infecting unicellular eukar- virophages were observed in infected P. globosa cells. Upon as- yotes, most of the genes within the PgV-16T genome were sembly, the PgVV genome sequence exhibited a much higher vertically inherited from their common ancestor, and recent hor- coverage (e.g., density of reads mapped to the sequence) than the izontal gene transfers do not account for a significant fraction of PgV-16T genome (33,151 ± 14,153 vs. 4,477 ± 1,624), despite the gene content. Finally, unlike the Poxviruses, Asfarviruses, or a similar G+C content (36% vs. 32%). This suggests that the PgVV Chlorovirus, there is no sequence similarity (direct or inverted) genome was in a higher copy number (by a factor of ∼7.5) than the between the two extremities (telomeres) of the PgV genome. PgV-16T virus in the initial DNA sample. The central PgVV ge- − Applying a BlastP E-value conservative cutoff of 10 5 (and dis- nome sequence is flanked by two 1-kb-long, nearly identical telo- regarding the matches with the nearly identical partial PgV-12T and meric sequences complementary to each other (Fig. 2). Such PgV-14T genome sequences), 245 coding DNA sequences (CDSs) a feature, not previously described for virophages, is common (56.5%) exhibited a similarity to protein sequences in the National among poxviruses, chloroviruses, and asfarviruses and was also Center for Biotechnology Information (NCBI) nonredundant (NR) encountered in the N15 (20). In all these cases, databank (14). Among those, 178 (73%) had their closest homolog this genome feature was associated with covalently closed hairpin in a large dsDNA virus, 35 in bacteria, 25 in eukaryotes, and 2 in ends (telomeres). The PgVV genome was predicted to encode 16 archaebacteria. Unexpectedly, among the viral best matches, 138 proteins (numbered pgvv01 to pgvv16), all of them transcribed from (77.5%) were found in OLPV1 and/or OLPV2 (the DNA poly- the same DNA strand (Fig. 2), another unique characteristic. merase sequences of PgV-16T and the OLPVs share about 48% However, as for the previously described virophages, most of these identical residues) and the only four described genes of P. pouchetii protein sequences (10/16; 62.5%) had no significant matches or virus (sharing 85% identical residues in average) (Fig. 1). Only four were remotely similar to hypothetical proteins of unknown function PgV-predicted proteins (PgV140, PgV143, PgV411, PgV423) were (2/16; 12.5%). Three of them, including an endonuclease (pgvv02) found to be most similar to their homolog in Emiliania huxleyi virus and the predicted DNAprimase/polymerase (pgvv04) have a ho- (EhV)86 (15) (or related strains), the only other previously se- molog in the virophage infecting CroV (Mavirus) and the one as- quenced giant viruses known to infect a haptophyte. sociated with OLPVs (OLV1) (Fig. 2). Because no PgVV genes appeared to encode a even remotely similar to a capsid Identification of a Virophage-Like Sequence. Virophages are a class protein, we suspected that the PgVV genome could either be car- of virus that have now been found in association with giant ried as a linear plasmid within the PgV-16T particle or could in- viruses such Mimivirus/Megavirus (16, 17), as well as with CroV tegrate within the PgV genome, as described previously for the (18) and OLPV (13). Virophages presumably use the characteristic Sputnik virophage (17). We explored this eventuality by generating replicative cytoplasmic virion factories of these viruses for their own amapofthe“broken” paired-end reads perfectly matching on the replication (19). In addition to the single-scaffold PgV-16T ge- PgVV genome on one end and on the PgV-16T genome on the nome sequence, the assembly process generated a 19,527-bp-long other end (Fig. S2). A total of 796 pair-end reads were found to sequence scaffold, indicating the presence of an unsuspected meet these criteria, suggesting a multiplicity of integration sites.

2of6 | www.pnas.org/cgi/doi/10.1073/pnas.1303251110 Santini et al. Downloaded by guest on September 29, 2021 Organic Lake virophage Spiroplasma these genes uniquely conserved among these giant viruses are also DUF2828-domain hypotheƟcal protein PhaeocysƟs globosa virus “ SPM_00495 hypotheƟcal protein predicted to encode proteins of unknown function ( hypothetical Organic Lake virophage (PgV41, Pgv42) proteins”: PgV50, -58, -134, -183, -259, -260, -298) or were similar DNA /polymerase to performing seemingly insignificant roles such as a phospholipase (PgV117) or a methyltransferase (PgV234). Most curious among these well-conserved homologs was asparagine synthetase (>38% identical residues), which occurred in PgV-16T, 1 Megavirus, and Mimivirus isolates, in OLPV1 and OLPV2, and 19,527 Micromonas and Ostreococcus viruses. A phylogenetic tree built Mavirus virophage Chlamydomonas reinhardƟi from these viral sequences was amazingly congruent with the global predicted protein predicted GIY-YIG similarity of these viruses and clearly suggests that all known viral endonuclease (MV06) asparagine synthetase genes derived from a common ancestor that Fig. 2. Map of the PgV-16T–associated virophage-like genome. Predicted diverged from the cellular version quite a long time ago and were ORFs are indicated by boxes colored according to their annotation status: since vertically inherited within two distinct viral lineages (Fig. S3). with a predicted function (green), with a significant (E < 10–5) match in the The rationale behind the conserved occurrence of the asparagine fl NR database (yellow), and no match (white). The anking arrows represent synthetase gene in the genome of this subset of dsDNA viruses, or two 1-kb-long, nearly identical sequences complementary to each other. the role that asparagine might specifically play in their lifecycle, is unknown. A possible integration “hot spot” defined by 28 read pairs was found Of the 434 putative CDS mapped in the PgV genome, only 150 between positions 47,500 and 50,000 in the PgV genome sequence (34.6%) could be assigned a putative function based on homology, (ORF no. 42), and positions 3200 and 4200 in the PgVV genome a percentage typical of large eukaryotic viruses. This small fraction fi (ORF nos. 3 and 4). The presence of PgVV DNA packed in the of known genes was nevertheless suf cient to draw a number PgV-16T virions was directly validated by PCR on DNA extracted of conclusions. from particles produced from an independent culture and purified For instance, the PgV genome encodes a high number of RNA by centrifugation on a cesium chloride gradient (SI Materials and polymerase II subunit homologs (RPB1, RPB2, RPB5, RPB6, Methods). Finally, pgvv14 is similar to a protein found in several RPB7, RPB9, and RPB10), a complete set of initi- copies in the PgV genome, supporting the possible role of viroph- ation (PgV113), elongation (PgV146, PgV158), and termination MICROBIOLOGY ages as gene shuttles between large DNA viruses. factors (PgV321), together with a polyA polymerase (PgV 73), a TATA-box-like–binding protein (PgV48), and a mRNA-capping Gene Content. Using a BlastP with a conservative E-value threshold (PgV235). This impressive complement of transcription- − of 10 5, we found 48 genes shared by Megavirus, Mimivirus, CroV, related genes suggests that PgV most likely replicates in its host’s and PgV, the four largest fully sequenced, viral genomes known cytoplasm, even though its infectious cycle has not yet been studied (Fig. 3 A and B). Among those, only 18 correspond to genes de- at the cellular level. fined previously as “core genes,” that is, genes found in a majority In addition to the previously cited DNA manipulating enzymes of large dsDNA viruses and presumed to encode functions es- common to the four largest known viral genomes, PgV possesses sential for the replication of these viruses (Table S1) (2, 21). This DNA-repair enzymes encountered in a more sporadic manner in finding suggests that these four viruses share these 48 genes as a variety of other viruses: an ERCC4-type nuclease (PgV318, with a result of a common ancestry, rather than functional requirements homologs in OLPV and Asfarviruses), an apurinic-apyrimidinic imposed on all large DNA viruses, in particular, those replicating in the cytoplasm of their host . Similar three-way comparisons with the Emiliania huxleyi virus (EhV) (having the fifth largest viral genome at 407 kb) (15) resulted in a very different picture, with Mimi: 979 genes CroV: 544 genes PgV: 434 genes CroV: 544 genes only 14 genes in common, 11 of which are bona fide core genes (Fig. 3 C and D) encoding the fundamental components of the DNA 378 11 410 338 28 389 replication and transcription machineries and a major capsid pro- 104 48 tein. Thus, even though EhV and PgV are two large DNA viruses 486 19 20 79 infecting two plankton species belonging to the haptophytes, they do not show extensive evidence of common ancestry and clearly 511 973 belong to different virus families. The 48 genes (Table S1)sharedbetweenPgVandthethree ABMega: 1120 genes Mega: 1120 genes other largest sequenced viral genomes [but also found in the draft genome sequences of other Megavirus relatives (17) and, for most PgV: 434 genes CroV: 544 genes Mega: 1120 genes CroV: 544 genes of them, within the partial OLPV1 and OLPV2 sequences] are 107 devoted to essential functions such as deoxynucleotide synthesis 354 58 469 994 413 (ribonucleotide reductase, deoxycytidine deaminase, thymidine 14 14 5 10 , thymidylate synthase, Nudix-like ); DNA repli- 8 4 cation [primase, ribonuclease H1, DNA polymerase B, PCNA clamp, three clamp loaders (replication factor C), NAD-dependent 446 443 DNA , topoisomerase IA and II]; DNA repair (Lambda- EhV86: 472 genes type exonuclease, MutS7, Holliday junction resolvase, XRN C EhV86: 472 genes D 5′–3′ exonuclease); DNA transcription (Ribonuclease III, VV A18-like and VV D11-like helicases, TATA-box–binding protein, Fig. 3. Fraction of protein-coding genes shared by the largest known viral polyadenylate polymerase, transcription factors TFIIB-like, VLTF3- genomes at increasing evolutionary distance. (A and B) Venn diagrams in- dicating the global proximity in gene content of the four large genomes of like, and TFIIS-like); protein synthesis and processing (mRNA- Mimivirus, Megavirus, CroV, and PgV. (C and D) Venn diagrams indicating capping enzyme, initiation factor eIF4E-like, DNAJ, that EhV and PgV exhibit a markedly different gene content (they only share thiol , cysteine protease); and virion structure “core” genes conserved in most large eukaryotic dsDNA virus, irrespective of (capsid protein and DNA packaging ATPase). However, seven of family; Table S1), even though they both infect a haptophyte.

Santini et al. PNAS Early Edition | 3of6 Downloaded by guest on September 29, 2021 exoDNase (PgV94, with a homolog in CroV), and a mismatch duplication. Molecular phylogenetic approaches were of little help repair enzyme (PgV283) of a type (MutS8) uniquely found in in deciphering these mechanisms, given the small number of OLPV1 and -2, Phaeocystis pouchetii virus, and Pyramimonas ori- available viral homologs for these genes. Therefore, we used the entalis virus (22). The PgV genome is also rich in putative ribonu- following simple heuristics to determine which mechanisms were cleases such as a bacterial ribonuclease HII-like (PgV121, with most likely responsible for the duplications: when the gene copies homologs in CroV and EhV), a RNase T-like (PgV123, with were found to be most similar to each other (relative to a whole homologs found in many viruses and phage but very divergent in NR database search), they were classified as “recent duplication”; Megavirus and Mimivirus), and a PIN-domain containing RNase H if they were separately found to be most similar to their homologs (PgV303, with homologs in CroV and phages). in OLPV (currently the closest known relative of PgV-16T), the The PgV genome also contains a large number of putative duplication was classified as “vertically inherited”; all other cases endonucleases of the HNH-type (PgV60, -72, -243, -438) and GIY- were presumed to indicate an acquisition by horizontal gene YIG type (PgV69, -216, -359, -429), but none of them was found transfer. Table 1 summarizes the results of this analysis, pointing associated to an intein, or a type I intron (although such introns out eight cases of probable duplication since the divergence of the cannot be reliably identified when they occur within a protein OLPV and PgV lineages, including that of a that might without a significant homolog in the databases). A single intein was itself be involved in the duplication process. This analysis led us to identified within the coding region of the VV A18-like helicase unravel two unprecedented cases of “core gene” duplication within (PgV190). The closest relative of the predicted mature PgV190 giant viruses. The first one involves the A32-like packaging ATPase gene product was found in OLPV1, but this homolog exhibits no (NCVOG0249) that is uniquely found in two tandem copies in trace of intein. The intein was likely inserted into the PgV190- PgV-16T (PgV64, PgV65), most likely the result of a recent event. coding sequence after PgV diverged from the OLPV lineage. Two The second case involves the DNA-directed RNA polymerase predicted with a strong similarity (77% identical res- second largest subunit (PgV208, PgV228) (NCVOG0271). In this idues) were also identified (PgV137, -217), the sole viral homologs case, the duplication was most likely inherited from the PgV- of which could be found in OLPV1 (also in two copies). The OLPV common ancestor, because two copies of this RNA poly- presence of several DNA methylases and type II restriction endo- merase subunit are also present in OLPV. Phylogenetic analyses nucleases (with their best viral homologs in CroV or Ostreococcus are consistent with these interpretations (Fig. S4 A and B).The viruses) suggests that the PgV genome contains modified nucleo- packaging ATPase and the RNA polymerase are both “viral hall- tides (such as 5mC or 6mA), as already well documented for Par- mark genes” encoding key functions that are shared by a diverse amecium bursaria Chlorella virus 1 (23). subset of viruses (24) and are usually found in a single copy per Surprisingly, given the high A+T content of their genomes, viral genome. However, nothing in these sequences suggests that both Megavirus and Mimivirus lack deoxyuridine 5′-triphosphate one of the two copies is a defective, nonfunctioning gene. nucleotidohydrolase (dUTPase), an enzyme crucial to preventing the misincorporation of deoxyuridine instead of thymidine. This Discovery of a Putative Mobile Element: PgV-MIGE. The above sys- enzyme is present in most large eukaryotic DNA viruses (in- tematic search for genes individually duplicated in the PgV-16T cluding CroV and EhV), as well as in phages. Thus, the most genome led us to identify 12 copies of a PgV-specific protein (e.g., parsimonious scenario is that the dUTPase gene, present in the without significant similarity in the NR database) for which no common ancestor of all these viruses, was uniquely lost in the function could be predicted. These repeated ORFans are evenly branch leading to Megavirus and Mimivirus. The presence of dispersed throughout the genome, as indicated by their gene a dUTPase homolog in PgV (PgV389, most similar to the one of numbers (9, 29, 45, 78, 139, 220, 273, 297, 322, 333, 370, 433) and OLPV1) adds support to this hypothesis. transcribed from both strands (six from each). These ORFans averaged 320 aa in length (range 270 to 355 aa), with an average Large dsDNA Virus Core Gene Duplication. After the first round of pairwise similarity of about 58% (range, 42–65%). The N-terminal gene annotation, a number of redundancies were found within the halves of the predicted proteins exhibited lower conservation than subset of 150 PgV-16T genes with putative functional assignments. the C-terminal halves. An exception was PgV333, the most di- Multiple copies of genes encoding a similar function could arise in vergent (and shortest) paralog from which the last 50 residues were three ways: (i) vertical ancestral inheritance; (ii)acquisitionof missing. In a multiple alignment of the 11 full-length homologs, 95 additional copies by horizontal transfer; and (iii)insitugene residues are strictly conserved among the 170 C-terminal positions

Table 1. Duplications events in the PgV-16T genome Gene pair ID aa, % Predicted function Origin of the duplication, comment

137, 217 77 Transposase Recent duplication 60, 72 76 Homing endonuclease, HNH-type Recent duplication 64–65 72 Packaging ATPase (VV32-like) Recent tandem duplication, usually a single copy core gene 96–97 57 Phage tail collar domain family protein Recent tandem duplication 142,149 41 Deoxynucleoside kinase (DNK) Recent duplication, usually a single copy gene 208,228 35 DNA-dependent RNA Pol second largest Vertically inherited, usually a single copy core gene subunit (Rpb2) (also duplicated in OLPV) 7–15 (16) 35 (14) Methyltransferase fkbm family 7–15: recent duplication; 16: independent HGT 424–425 29 2OG-Fe(II) oxygenase family protein Recent tandem duplication 34, 427 23 DNA-J domain containing protein One copy common to all Megaviridae, one (old) HGT 301,414 17 2-Polyprenylphenol 6-hydroxylase HGT, ubiquinone synthesis 118,313 18 Disulfide- HGT, protein folding 240,411 17 NUDIX-like hydrolase HGT, nucleotide metabolism

Multiple occurrence of genes encoding protein with identical predicted functions are interpreted as horizontal gene transfer (HGT) or tandem duplications based on their percentage of identical residues (ID aa%). A third copy of the fkbm methyltransferase (pgvv16), is only 14% identical (in parenthesis) to the two other copies.

4of6 | www.pnas.org/cgi/doi/10.1073/pnas.1303251110 Santini et al. Downloaded by guest on September 29, 2021 (56%), suggestive of a constrained functional signature. However, duplication. Given their lack of resemblance to any previously de- submitting this multiple alignment to the sensitive 3D structure scribed mobile genetic element, we named these PgV-specific homology search program Fugue (25) failed to return any signifi- repeats: “PgV major interspersed genomic element” (PgV-MIGE). cant match or functional clue. To further analyze the primary structure of the DNA segments encompassing and flanking these Estimating the Actual Genome Sizes of OLPV1 and OLPV2, the Closest multiple ORF copies, we focused on the eight most similar ones Relatives to PgV. The PgV-16T genome can be used to estimate the (average pairwise identity of 63%). A local alignment of their level of completion of the OLPV1 and OLPV2 partial genomes nucleotide sequences identified a highly conserved DNA segment assembled from metagenomic data. Fig. S6 indicates that a total of of about 300 nt, beginning immediately upstream of the ORF N 145 predicted proteins are shared between OLPV1 or OLPV2 and terminus (Fig. S5). The nucleotide conservation of this noncoding PgV. Given that OLPV1 and OLPV2 are much closer to each other segment contrasts with the flanking N-terminal part of the ORFan (their shared predicted proteins are about 80% identical) than they coding region that is not well-conserved (Fig. S5). A huge diversity are from PgV (OLPV-PgV orthologous protein share 50% iden- of mobile elements has been described in the three domains of tical residues in average), it is reasonable to assume that most genes (26–28), and the list is still growing (29). However, the PgV-16T shared between PgV and OLPV1 or OLPV2 should also be present repeats do not resemble any of them. For instance, examining the in both OLPVs. Following this hypothesis, we can now perform sequences and putative secondary structures flanking the 5′ and 3′ a simple statistical computation: out of the 145 “conserved genes,” boundaries of these repeated segments failed to reveal any direct or only 121 are retrieved in the available OLPV2 draft genome. This inverted repeats, or putative target sequences shared by the various might be taken to indicate that this genome is only 121/145 (83.4%) repeated elements (Fig. S5). The conserved predicted protein as- complete. Reciprocally, only 125 of these “conserved” genes occur sociated with these repeats is also puzzling. To our knowledge, the in the draft OLPV1 genome, which, in turn, suggests that it is 125/ coding regions associated with large (even 145 (86%) complete. Applying these proportions, we can now es- those using a rolling-circle replication process) always encode pro- timate the total number of genes for these two viruses as 401/0.86 = teins bearing a recognizable similarity with DNA-processing enzymes 466 genes for OLPV1, and 326/0.834 = 390 genes for OLPV2. (e.g., homing endonucleases, transposases, , reverse tran- According to this extrapolation, OLPV1, OLPV2, and PgV-16T scriptase, etc.). This is not the case here. The presence of a highly would possess numbers of genes falling within the variation range conserved noncoding region within the repeat is also a remarkable exhibited by close members from the same virus family, a result feature (Fig. S6). We can only hypothesize that these two conserved consistent with the fact that PgV proteins have OLPV proteins as

elements are involved in the molecular mechanism leading to their their closest homologs. MICROBIOLOGY

Phyco.OLPV-2 Phyco.CeV

Phyco.OLPV-1 Phyco.HaV-1 Phyco.PpV Phyco.EhV-86 Phyco.OtV-1 Phyco.OsV-5 Phyco.PgV Phyco.OtV-2 Phyco.OlV-1

Phyco.PoV 100 Phyco.MpV-1 100 88 74 Phyco.BpV-1 Unc.CroV 100 100 Phyco.PBCV-AR158 100 Unc.Terra-2 Phyco.PBCV-NY2A 100 100 100 Phyco.PBCV-1 Mimivirus 100 100 100 Phyco.PBCV-FR483 Unc.Moumouvirus 50 81 100 Phyco.ATCV-1 100 100 Megavirus 93 Phyco.FsV

85 100 Phyco.EsV-1 Pox.Fowlpox 100 100 56 Unc.Lausannevirus 100 Pox.Tanapox 56 93 100 Unc.MarseilleVirus 100 100 Irido.WIV Pox.Myxoma 66 Pox.Orf 100 76 Irido.IIV-3 96 100 82 Irido.LDV-China

100 100 Irido.ISKNV

Aviadeno.FaV-E 100 Herpes.Human-7 Herpes.Human-6

Herpes.Rodent-Peru

Mastadeno.HaV-54Atadeno.DaV

Deltabaculo.CnNPV Asfar.ASFV

Betabaculo.SlGV Phyco.HcDNAV Gammabaculo.NaNPV

Alphabaculo.McNPV

Alphabaculo.AcMNPV

Fig. 4. Phylogenetic position of PgV-16T among dsDNA viruses. Neighbor-joining (midpoint-rooted) tree computed from a multiple alignment of 49 viral DNA polymerase B sequences (405 ungapped sites) using the default options of the MAFFT server (30) and drawn using Mega (31). The tree was collapsed for bootstrap values of <50. The tree clearly suggests that the viruses with the largest known genome form a well-supported cluster, irrespective of their hosts: phagotrophic protozoans (in red) or photosynthetic unicellular algae (in green). Viruses presently classified as “phycodnaviruses” (e.g., “algal viruses”) (green and blue) do not form a monophyletic lineages.

Santini et al. PNAS Early Edition | 5of6 Downloaded by guest on September 29, 2021 Conclusion tein, a mRNA-capping enzyme, DNA topoisomerases of the two The genetic analysis of the complete genome of P. globosa virus types (I and II), a poly-A polymerase, and the mysteriously PgV-16T revealed a number of specific features only found to be conserved Asparagine synthetase (Fig. S3). Thus, PgV-16T belongs associated with the group of viruses exhibiting the largest and most to the recently proposed Megaviridae family (7, 32), within which complex viral genomes known to date, in coherence with the mo- DNA viruses exhibiting the largest known genomes appear to clus- lecular phylogeny of its DNA polymerase (Fig. 4). Although group ter, although they infect hosts distributed among a wide taxonomic I P. globosa-infecting viruses cluster with the Megaviridae (re- range of unicellular eukaryotes, from unikonts to bikonts (33). grouping the viruses with the largest known genomes), group II PgVs belong to a very different clade yet apart from other algae- Materials and Methods infecting viruses (Fig. S7). Most unexpected was the discovery of Phaeocystis Globosa Culture and Virus Purification. Exponential axenic cultures a 19,527-bp virophage genome-like DNA molecule that appears to of Phaeocystis globosa Pg-G (Groningen University Culture Collection, The − − be part of the DNA content of the PgV-16T virion. PgVV exists Netherlands), cultured according to ref. 34 with 40 μmol quanta·m 2·s 1 either as a linear plasmid-like molecule or as a randomly integrated illumination, were infected with axenic PgV-16T at a virus to algal host ratio occurring at multiple sites within the PgV-16T genome, as of 10 allowing a one-step virus growth cycle (Fig. S8). Upon full lysis of the fi fl recently described for Sputnik 2, a virophage associated with the host, viruses were concentrated by ultra ltration on Viva ow 200 (molecular mass cutoff, 30 kDa; Vivascience), after which the viral concentrate was Lentille virus (17). The PgV-associated virophage ge- fi × nome shares only three homologous genes (including a primase, clari ed of cell debris by low speed centrifugation at 7,500 g for 30 min at 4 °C. The supernatant was decanted, and viral particles were concentrated by pgvv04) with the two previously described OLPV and Mavirus ultracentrifugation at 141,000 × g for 2 h at 8 °C using a TFF55.38 rotor. Viral virophages. Because none of the other PgVV predicted gene pellets were resuspended in TE buffer (pH 8.0) and stored frozen until products resembled a capsid protein, we hypothesize that PgVV further use. cannot exist as a free virus particle, a situation akin to another mobile element associated with the various Mimiviridae mobile Genome Sequencing and Assembly. Around 1 μg of phenol-chloroform–extrac- genetic elements, the transpovirons (17). Given its predicted linear ted PgV-16T virus DNA was used to construct a paired-end library, with ∼280-bp genome, its lack of capsid-like protein, and its minimal resemblance insert size, as described by the manufacturer’s (Illumina) protocol. Around 60 with Sputnik, Mavirus, and OLV1, PgVV might represent the first million paired-end reads were produced on an Illumina HiSeq2000 corresponding example of a virophage/transpoviron intermediate mobile element. to ∼6 Gb of raw nucleotide sequence. SI Materials and Methods contains further In addition to its association with a virophage-like sequence, the details and the bioinformatic procedures of genome assembly and annotation. PgV-16T genome (as well as draft genomes of PgV 12-T and PgV 14-T strains) shares many of the hallmark features of the largest ACKNOWLEDGMENTS. We thank Harry Witte for practical support. We are glad to acknowledge the use of the electron microscopy platform of the known viral genomes such as a MutS7-type mismatch DNA-repair Mediterranean Institute of Microbiology. This work was partially supported enzyme, a large complement of DNA-directed RNA polymerase by Centre National de la Recherche Scientifique and a grant from the French subunit and transcription factors, a TATA-box-like–binding pro- National Genomic Center–Genoscope.

1. La Scola B, et al. (2003) A giant virus in amoebae. Science 299(5615):2033. 18. Fischer MG, Suttle CA (2011) A virophage at the origin of large DNA transposons. 2. Raoult D, et al. (2004) The 1.2-megabase genome sequence of Mimivirus. Science 306 Science 332(6026):231–234. (5700):1344–1350. 19. Claverie JM, Abergel C (2009) Mimivirus and its virophage. Annu Rev Genet 43:49–66. 3. Ghedin E, Claverie JM (2005) Mimivirus relatives in the Sargasso sea. Virol J 2:62. 20. Rybchin VN, Svarchevsky AN (1999) The plasmid prophage N15: A linear DNA with 4. Yooseph S, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Expanding covalently closed ends. Mol Microbiol 33(5):895–903. the universe of protein families. PLoS Biol 5(3):e16. 21. Iyer LM, Aravind L, Koonin EV (2001) Common origin of four diverse families of large 5. Monier A, Claverie JM, Ogata H (2008) Taxonomic distribution of large DNA viruses in eukaryotic DNA viruses. J Virol 75(23):11720–11734. the sea. Genome Biol 9(7):R106. 22. Ogata H, et al. (2011) Two new subfamilies of DNA mismatch repair proteins (MutS) 6. Fischer MG, Allen MJ, Wilson WH, Suttle CA (2010) Giant virus with a remarkable specifically abundant in the marine environment. ISME J 5(7):1143–1151. complement of genes infects marine zooplankton. Proc Natl Acad Sci USA 107(45): 23. Van Etten JL, Meints RH (1999) Giant viruses infecting algae. Annu Rev Microbiol 53: – 19508 19513. 447–494. 7. Arslan D, Legendre M, Seltzer V, Abergel C, Claverie JM (2011) Distant Mimivirus 24. Koonin EV, Yutin N (2010) Origin and evolution of eukaryotic large nucleo- relative with a larger genome highlights the fundamental features of Megaviridae. cytoplasmic DNA viruses. Intervirology 53(5):284–292. – Proc Natl Acad Sci USA 108(42):17486 17491. 25. Shi J, Blundell TL, Mizuguchi K (2001) FUGUE: Sequence-structure homology recog- 8. Monier A, et al. (2008) Marine mimivirus relatives are probably large algal viruses. nition using environment-specific substitution tables and structure-dependent gap Virol J 5:12. penalties. J Mol Biol 310(1):243–257. 9. Larsen JB, Larsen A, Bratbak G, Sandaa RA (2008) Phylogenetic analysis of members of 26. Kapitonov VV, Jurka J (2008) A universal classification of eukaryotic transposable the Phycodnaviridae virus family, using amplified fragments of the major capsid elements implemented in Repbase. Nat Rev Genet 9(5):411–412, author reply 414. protein gene. Appl Environ Microbiol 74(10):3048–3057. 27. Siguier P, Varani A, Perochon J, Chandler M (2012) Exploring bacterial insertion se- 10. Brussaard CPD, Short SM, Frederickson CM, Suttle CA (2004) Isolation and phyloge- quences with ISfinder: Objectives, uses, and future developments. Methods Mol Biol netic analysis of novel viruses infecting the phytoplankton Phaeocystis globosa 859:91–103. (Prymnesiophyceae). Appl Environ Microbiol 70(6):3700–3705. 28. Filée J, Siguier P, Chandler M (2007) Insertion sequence diversity in archaea. Microbiol 11. Brussaard CPD, Kuipers B, Veldhuis MJW (2005) A mesocosm study of Phaeocystis – globosa population dynamics: I. Regulatory role of viruses in bloom control. Harmful Mol Biol Rev 71(1):121 157. Algae 4(5):859–874. 29. Bao W, Jurka MG, Kapitonov VV, Jurka J (2009) New superfamilies of eukaryotic DNA – 12. Baudoux AC, Brussaard CPD (2005) Characterization of different viruses infecting the transposons and their internal divisions. Mol Biol Evol 26(5):983 993. marine harmful species Phaeocystis globosa. Virology 341(1):80–90. 30. Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence align- – 13. Yau S, et al. (2011) Virophage control of antarctic algal host-virus dynamics. Proc Natl ment program. Brief Bioinform 9(4):286 298. Acad Sci USA 108(15):6163–6168. 31. Tamura K, et al. (2011) MEGA5: Molecular evolutionary genetics analysis using 14. Sayers EW, et al. (2012) Database resources of the National Center for Biotechnology maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Information. Nucleic Acids Res 40(Database issue):D13–D25. Biol Evol 28(10):2731–2739. 15. Wilson WH, et al. (2005) Complete genome sequence and lytic phase transcription 32. Claverie JM, Abergel C (2013) Open questions about giant viruses. Adv Virus Res 85: profile of a Coccolithovirus. Science 309(5737):1090–1092. 25–56. 16. La Scola B, et al. (2008) The virophage as a unique parasite of the giant mimivirus. 33. Cavalier-Smith T (2002) The phagotrophic origin of eukaryotes and phylogenetic Nature 455(7209):100–104. classification of Protozoa. Int J Syst Evol Microbiol 52(Pt 2):297–354. 17. Desnues C, et al. (2012) Provirophages and transpovirons as the diverse of 34. Baudoux AC, Brussaard CPD (2008) Influence of irradiance on virus-algal host inter- giant viruses. Proc Natl Acad Sci USA 109(44):18078–18083. actions. J Phycol 44(4):902–908.

6of6 | www.pnas.org/cgi/doi/10.1073/pnas.1303251110 Santini et al. Downloaded by guest on September 29, 2021