<<

Exploring nucleo-cytoplasmic large DNA viruses in Tara microbial metagenomes Pascal Hingamp, Nigel Grimsley, Silvia G Acinas, Camille Clerissi, Lucie Subirana, Julie Poulain, Isabel Ferrera, Hugo Sarmento, Emilie Villar, Gipsi Lima-Mendez, et al.

To cite this version:

Pascal Hingamp, Nigel Grimsley, Silvia G Acinas, Camille Clerissi, Lucie Subirana, et al.. Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes. ISME Journal, Nature Publishing Group, 2013, 7 (9), pp.1678-1695. ￿10.1038/ismej.2013.59￿. ￿hal-01258223￿

HAL Id: hal-01258223 https://hal.archives-ouvertes.fr/hal-01258223 Submitted on 19 Jan 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License The ISME Journal (2013) 7, 1678–1695 & 2013 International Society for Microbial All rights reserved 1751-7362/13 OPEN www.nature.com/ismej ORIGINAL ARTICLE Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes

Pascal Hingamp1,11, Nigel Grimsley2, Silvia G Acinas3, Camille Clerissi2, Lucie Subirana2, Julie Poulain4, Isabel Ferrera3,12, Hugo Sarmento3, Emilie Villar1, Gipsi Lima-Mendez5,6, Karoline Faust5,6, Shinichi Sunagawa7, Jean-Michel Claverie1, Herve´ Moreau2, Yves Desdevises2, Peer Bork7, Jeroen Raes5,6, Colomban de Vargas8, Eric Karsenti7, Stefanie Kandels-Lewis7, Olivier Jaillon4, Fabrice Not8, Ste´phane Pesant9, Patrick Wincker4 and Hiroyuki Ogata1,10,11 1CNRS, Aix-Marseille Universite´, Laboratoire Information Ge´nomique et Structurale (UMR 7256), Mediterranean Institute of Microbiology (FR 3479), Marseille, France; 2CNRS and Universite´ Pierre et Marie (Paris 06), UMR 7232, Observatoire Oce´anologique, Banyuls-sur-Mer, France; 3Department of Marine and Oceanography, Institute of Marine Science (ICM), CSIC, Passeig Marı´tim de la Barceloneta, Barcelona, Spain; 4CEA, Institut de Ge´nomique, Genoscope, Evry, France; 5Department of Structural Biology, VIB, Brussel, Belgium; 6Department of Applied Biological Sciences (DBIT), Vrije Universiteit Brussel, Brussels, Belgium; 7European Molecular Biology Laboratory, Heidelberg, Germany; 8CNRS, Universite´ Pierre et Marie Curie (Paris 06), UMR 7144, Station Biologique de Roscoff, Roscoff, France; 9MARUM—Center for Marine Environmental Sciences, Universita¨t Bremen, Bremen, Germany and 10Education Academy of Computational Life Sciences, Tokyo Institute of Technology, Tokyo, Japan

Nucleo-cytoplasmic large DNA viruses (NCLDVs) constitute a group of eukaryotic viruses that can have crucial ecological roles in the sea by accelerating the turnover of their unicellular hosts or by causing diseases in . To better characterize the diversity, abundance and biogeography of marine NCLDVs, we analyzed 17 metagenomes derived from microbial samples (0.2–1.6 lm size range) collected during the Tara Oceans Expedition. The sample set includes ecosystems under-represented in previous studies, such as the Arabian Sea oxygen minimum zone (OMZ) and Indian lagoons. By combining computationally derived relative abundance and direct prokaryote cell counts, the abundance of NCLDVs was found to be in the order of 104–105 genomes ml 1 for the samples from the photic zone and 102–103 genomes ml 1 for the OMZ. The Megaviridae and dominated the NCLDV populations in the metagenomes, although most of the reads classified in these families showed large divergence from known viral genomes. Our taxon co-occurrence analysis revealed a potential association between viruses of the Megaviridae family and related to . In support of this predicted association, we identified six cases of lateral gene transfer between Megaviridae and oomycetes. Our results suggest that marine NCLDVs probably outnumber eukaryotic organisms in the photic layer (per given water mass) and that metagenomic sequence analyses promise to shed new light on the biodiversity of marine viruses and their interactions with potential hosts. The ISME Journal (2013) 7, 1678–1695; doi:10.1038/ismej.2013.59; published online 11 April 2013 Subject Category: Microbial population and community ecology Keywords: eukaryotic viruses; marine NCLDVs; taxon co-occurrence; oomycetes

Introduction bacterial hosts (Suttle, 2007). However, little is known about the diversity, abundance and biogeo- Viruses are thought to be extremely abundant in the graphy of marine viruses infecting other cellular sea. Indeed, phages alone outnumber all other life organisms, in particular eukaryotes. Although less forms in seawater, reflecting the abundance of their numerous than bacteria, eukaryotes often represent the bulk of biomass and mediate important Correspondence: H Ogata, Education Academy of Computational biogeochemical and food web processes (Falkowski Life Sciences, Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro-ku, Tokyo 152 8552, Japan. et al., 2004, Massana, 2011). E-mail: [email protected], [email protected] Nucleo-cytoplasmic large DNA viruses (NCLDVs; 11These authors contributed equally to this work. Iyer et al., 2006, Yutin and Koonin, 2012) constitute an 12 Current address: Departament de Gene`tica i de Microbiologia, apparently monophyletic group of eukaryotic viruses Universitat Auto`noma de Barcelona, Bellaterra ES-08193, Spain. Received 6 November 2012; revised 28 February 2013; accepted 6 with a large double-stranded DNA (dsDNA) genome March 2013; published online 11 April 2013 ranging from 100 kb up to 1.26 Mb. Their hosts show a NCLDVs in Tara Oceans metagenomes P Hingamp et al 1679 remarkably wide taxonomic spectrum from micro- The abundance of viruses (Ostreococcus tauri virus scopic unicellular eukaryotes to larger animals, (OtVs)) infecting the smallest free-living green alga including humans. Certain NCLDVs are known to O. tauri could vary from undetectable levels to over have important roles in marine ecosystems. For 104 viruses ml 1 depending on the season and the instance, virus (HaV) affects distance from the shore (Bellec et al., 2010). The the population dynamics of their unicellular algal abundance of EhVs could reach over 107 viruses ml 1 host, which forms seasonal harmful blooms in coastal in rapidly expanding host populations in mesocosm areas (Tomaru et al., 2004). Another well-known virus experiments simulating host blooms (Schroeder ( viruses (EhV)) controls the popula- et al., 2003, Pagarete et al., 2011). A typical observa- tion of the ubiquitous E. huxleyi,which tion in these studies was an episodic sudden increase can form vast oceanic blooms at temperate latitudes (4 several orders of magnitude) in virus concentra- and exerts complex influence on the carbon cycle tion. These studies focused on specific viral / (Pagarete et al., 2011). Other NCLDVs cause diseases strains and depended on the availability of host in fishes and can lead to economic damages in cultures for lysis evaluation or on relatively simple aquaculture industries (Kurita and Nakajima, 2012). community compositions amenable to FC analysis. NCLDVs include viruses with very large virion Currently, no direct method is available to assess the particles, which do not pass through 0.2-mm filters abundance of diverse NCLDVs in a complex micro- typically used in viral metagenomics to separate free bial assemblage dominated by an overwhelming viruses from other organisms (Van Etten, 2011). The amount of bacterial cells and phages. prototype of such large viruses, also referred to as To better understand the diversity and geographi- giruses (Claverie et al., 2006), is the -infecting cal distribution of marine NCLDVs, we analyzed a Acanthamoeba polyphaga Mimivirus with a 0.75-mm subset of metagenomic sequence data (0.2–1.6 mm virion particle and 1.18-Mb genome (Raoult et al., size fraction) generated by Tara Oceans, an interna- 2004). Since the discovery of the giant Mimivirus from tional multidisciplinary scientific program aiming fresh water samples, NCLDVs have become a subject to characterize ocean plankton diversity, the role of of broader interest. This has led to several conceptual these drifting in marine ecosystems breakthroughs in our understanding of the origin of and their response to environmental changes viruses and their links to the evolution of cellular (Karsenti et al., 2011). Samples were collected during organisms (Claverie, 2006; Forterre, 2006; Raoult and the first year of the expedition from the Strait of Forterre, 2008; Forterre, 2010; Legendre et al., 2012). Gibraltar, through the Mediterranean and Red Sea, ThesequencingoftheMimivirusgenomeprompted down to the middle of the Indian Ocean (Table 1). the discovery of many close homologs in environ- Some marine regions under-represented in previous mental sequence data (Lopez-Bueno et al., 2009; metagenomic studies are included in this sample set, Cantalupo et al., 2011). Most notably, Mimivirus gene such as those from the Arabian Sea oxygen minimum homologs were detected in the Global Ocean Sam- zone (OMZ) and Indian Ocean lagoons. Most prokar- pling (GOS) marine metagenomes (Ghedin and yotic cells and many large virus particles are Claverie, 2005; Monier et al., 2008a; Williamson expected to be captured within the 0.2–1.6 mmsize et al., 2008), suggesting Mimivirus relatives exist fraction used in the present metagenome study. Here in the sea. Soon afterwards, two giant viruses related we show that putative NCLDV sequences differ to Mimivirus were isolated from marine environ- substantially from known reference genomes, sug- ments. These are roenbergensis virus gesting a high diversity of giant marine viruses. The (CroV; 750 kb) infecting a major marine microflagellate concentration of NCLDV genomes in the samples was grazer (Fischer et al., 2010) and Megavirus chilensis estimated by factoring the metagenome data set with (1.26 Mb) infecting Acanthamoeba (Arslan et al., prokaryotic abundance determined by FC and micro- 2011). About 70 NCLDV genomes have been scopy on samples collected concurrently on Tara. sequenced so far, of which about 15 represent marine Finally, we tested the capacity of the taxon co- viruses (Pruitt et al., 2012). Thanks to this recent occurrence patterns (Chaffron et al., 2010, Steele accumulation of sequence data and analyses, the et al., 2011) present in our data set to provide hints visible portion of the NCLDV is fast about potential natural hosts for marine NCLDVs. expanding, and NCLDV abundance in the sea is increasingly being recognized. However, our knowl- edge of their biology is still limited, leaving such Materials and methods fundamental ecological parameters as their abundance and host taxonomic range to be determined. Sampling and DNA extraction Previous studies examined the abundance of At the end of March 2012, a 2.5-year circum-global specific species/groups of NCLDVs in marine envir- expedition was completed onboard Tara, an arctic onments using either laboratory culture of viral exploration schooner modified for global marine hosts or flow cytometry (FC). The concentration of research with innovative systems for multiscale HaVs infecting the H. akashiwo could sampling of planktonic communities. During the reach 104 viruses ml 1 in natural sea water during expedition, planktonic organisms ranging in size from the period of host blooms (Tomaru et al., 2004). viruses to fish larvae together with physico-chemical

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1680 Table 1 General description of the samples analyzed in this study

Name Station Region Marine Depth Locationa T Salinity Chl a Date and Sample identifiers number system type (m) (1C) (psu) (mg Chl time (UTC)a am 3)

3_S 3 Atlantic Ocean Open ocean SRF 36143.520’N NA NA NA 2009/09/13 10:40 TARA-Y200000001 (A6.1) 10128.250’W 4_S 4 Atlantic Ocean Open ocean SRF 36133.200’N NA NA NA 2009/09/15 10:15 TARA-Y200000002 (A11) 6134.010’W 6_S 6 Mediterranean Sea Enclosed sea SRF 36131.239’N 17.0 37.35 3.121 2009/09/21 14:49 TARA-Y200000003 (A32) 410.443’W 7_S 7 Mediterranean Sea Enclosed sea SRF 3712.321’N 23.8 37.48 0.075 2009/09/23 17:05 TARA-A200000113 1156.99’W 7_D 7 Mediterranean Sea Enclosed sea DCM 3712.321’N 17.8 37.09 0.296 2009/09/23 17:05 TARA-A200000159 (42 m) 1156.99’W 23_S 23 Mediterranean Sea Enclosed sea SRF 42110.462’N 17.1 38.22 0.036 2009/11/18 12:44 TARA-E500000066 17143.163’E 23_D 23 Mediterranean Sea Enclosed sea DCM 42110.462’N 16.0 38.30 0.119 2009/11/18 12:44 TARA-E500000081 (56 m) 17143.163’E 30_S 30 Mediterranean Sea Enclosed sea SRF 33155.077’N 20.4 39.42 0.025 2009/12/14 12:44 TARA-A100001568 32153.622’E 31_S 31 Red Sea Enclosed sea SRF 2718.100’N 25.0 39.91 0.005 2010/01/09 10:03 TARA-A100001568 34148.400’E 36_S 36 Arabian Sea Semi-enclosed sea SRF 20149.053’N 26.0 36.53 0.047 2010/03/12 10:36 TARA-Y100000022 63130.727’E 38_S 38 Arabian Sea Semi-enclosed sea SRF 1912.318’N 26.3 36.62 0.052 2010/03/15 03:45 TARA-Y100000288 64129.620’E 38_Z 38 Arabian Sea Semi-enclosed sea OMZ 1912.103’N 14.7 36.00 0.002 2010/03/16 06:14 TARA-Y100000294 (350 m) 64133.825’E 39_S 39 Arabian Sea Semi-enclosed sea SRF 18134.213’N 27.4 36.29 0.026 2010/03/18 09:56 TARA-Y100000029 66129.167’E 39_Z 39 Arabian Sea Semi-enclosed sea OMZ 18144.043’N 15.6 35.91 0.003 2010/03/20 08:17 TARA-Y100000031 (270 m) 66123.375’E 43_S 43 Indian Ocean Lagoon SRF 4139.582’N 30.0 34.49 0.075 2010/04/05 08:50 TARA-Y100000074 73129.128’E 46_S 46 Indian Ocean Lagoon SRF 0139.748’S 30.1 35.11 0.050 2010/04/15 02:40 TARA-Y100000100 7319.664’E 49_S 49 Indian Ocean Open ocean SRF 16148.497’S 28.3 34.49 0.024 2010/04/23 10:29 TARA-Y100000120 59130.257’E

Abbreviations: DCM, deep maximum; NA, not applicable; OMZ, oxyzen minimum zone; SRF, surface; UTC, Coordinated Universal Time. aLocations, date and time correspond to events for the collection of contextual physicochemical data. Events for water sampling could slightly differ from these values.

contextual data were collected from several depths at (hexadecyltrimethylammonium bromide) protocol 153 stations across the world oceans. Plankton were (Winnepenninckx et al., 1993): (i) the filters were collected from up to three depths: near the surface incubated at 60 1C for 1 h in a CTAB buffer (2% CTAB; (SRF; B5 m), at the depth of maximum chlorophyll 100 mM TrisHCl (pH ¼ 8); 20 mM EDTA; 1.4 M NaCl; a fluorescence (deep chlorophyll maximum, DCM; 0.2% b-mercaptoethanol; 0.1 mg ml 1 proteinase K; 20–200 m) and in the mesopelagic layer (MESO; 10 mM DTT (dithiothreitol), (ii) DNA was purified 200–1000 m) to capture deep oceanographic features, using an equal volume of chloroform/isoamylalcohol such as OMZs. As much as possible where sampling (24:1) and a 1-h-long RNase digestion step, and (iii) was shallower than 80 m, SRF and DCM samples DNA was precipitated with a 2/3 volume of isopro- were collected using a large peristaltic pump (A40, panol and washed with 1 ml of a EtOH/NH4Ac TECH-POMPES, Sens, France), whereas samples from solution (76% and 10 mM, respectively). Finally, the deeper DCM and MESO were collected using 12-l extracted DNA samples were dissolved in 100 mlof Niskin bottles mounted on a rosette equipped with laboratory grade water and stored at 20 1C until physico-chemical sensors. For samples analyzed in sequencing. On average, an approximate yield of this study, 100 liters of seawater from each depth were 1 mg ml 1 was obtained for each sample. first passed through 200- and 20-mm mesh filters to remove larger plankton, then gently passed in series through 1.6- and 0.22-mm filters (142 mm, GF/A Metagenomic sequence data glass microfiber pre-filter, Whatman, Maidstone, UK; All sequencing libraries were created using the and 142 mm, 0.22 mm Express PLUS Membrane, Roche-454 Rapid Library kit (Roche Applied Millipore, Billerica, MA, USA, respectively) using Science, Meylan, France). The input for nebuliza- a peristaltic pump (Masterflex, EW-77410-10, tion used 500 ng of extracted DNA. Each library Cole-Parmer International, Vernon Hills, IL, USA). was indexed to avoid cross-contamination and The filters were kept for 1 month at 20 1C on board sequenced on one-eighth to one-half of a GS-FLX Taraandthenat 80 1C in the laboratory until DNA Titanium plate (Meylan, France). Quality checking extraction. DNA was extracted using a modified CTAB of the reads was performed using the 454 standard

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1681 tools. 454-based pyrosequencing is known to gen- concentration), filtered onto a 0.2-mm polycarbonate erate artificial duplicates (Briggs et al., 2007). filter and kept frozen until processing. For the Therefore, for each set of reads generated from the enumeration of total prokaryotes, cells were stained same sample by the same 454 run, we identified and with DAPI and between 500 and 1000 DAPI-positive removed artificial duplicates using the 454 Repli- cells were counted manually in a minimum of 10 cate Filter software (Gomez-Alvarez et al., 2009) by microscope fields using an Olympus BX51TF epi- applying the following criteria: X5 identical starting fluorescence microscope (Olympus, Tokyo, Japan). nucleotides and X97% overall nucleotide sequence identity. This resulted in an overall reduction of the number of reads by 16%, ranging from 3% to 47% Enumeration of prokaryotes by FC depending on the sample. Metagenomic sequence For FC counts, three aliquots of 1 ml of seawater (pre- data generated from Tara Oceans are referred to as filtered through 200-mm mesh) were collected from Tara Oceans Project (TOP) metagenomes. The each depth. Samples were fixed immediately using sequence data analyzed in this study is based on a cold 25% glutaraldehyde (final concentration subset of TOP metagenomes (Table 2), which is 0.125%), left in the dark for 10 min at room referred to as TOP pyrosequences or, in the present temperature, subsequently flash-frozen and kept in study, simply as TOP data. The sequence data are liquid nitrogen on board, and then stored at 80 1C accessible from the Sequence Read Archive of the in the laboratory. Two sub-samples were taken for European Nucleotide Archive through the accession separate counts of heterotrophic prokaryotes and number ERA155562 and ERA155563. Additional phototrophic . For heterotrophic pro- sequence and annotation data are accessible from karyote determination, 400 ml of sample was added to http://www.igs.cnrs-mrs.fr/TaraOceans. a diluted SYTO-13 (Molecular Probes Inc., Eugene, The GOS metagenomic sequence reads (Rusch OR, USA) stock (10:1) at 2.5 mmol l 1 final concen- et al., 2007) were downloaded from CAMERA (Sun tration, left for about 10 min in the dark to complete et al., 2011). We used only the sequence data the staining and run in the flow cytometer. We used a recovered from the samples corresponding to the FacsCalibur (Becton and Dickinson, Franklin Lakes, size fraction between 0.1 and 0.8 mm (that is, 40 NJ, USA) flow cytometer equipped with a 15-mW samples corresponding to GS001 to GS051). Protein- Argon-ion laser (488 nm emission). At least 30 000 coding regions in the metagenomic sequences (TOP events were acquired for each subsample (usually and GOS) were identified using the FragGeneScan 90 000 events). Fluorescent beads (1 mm, Fluoresbrite software (Rho et al., 2010). carboxylate microspheres, Polysciences Inc., War- rington, PA, USA) were added at a known density as internal standards. The bead standard concentration Enumeration of prokaryotes by 4,6-diamidino-2- was determined by epifluorescence microscopy. phenylindole (DAPI) Heterotrophic prokaryotes were detected by their In all, 10 ml of seawater for SRF and DCM and signature in a plot of side scatter vs FL1 (green 90 ml for OMZ (pre-filtered through 20-mm mesh) fluorescence). In a red (FL3) –green (FL1) fluores- were fixed in paraformaldehyde (1.5% final cence plot, beads fall in one line, heterotrophic prokaryotes in another and noise in a third (respec- Table 2 Quality-controlled Tara Oceans pyrosequence data tively, with more FL3 than FL1). Picocyanobacteria fall in between noise and heterotrophic prokaryote. Sample Total size Number G þ C Average Number Average This method is based on del Giorgio et al. (1996) as name (bp) of reads (%) size (bp) of pre- ORF dicted size (aa) discussed in Gasol and del Giorgio (2000). For ORFs phototrophic picoplankton, we used the same pro- cedure as for heterotrophic prokaryote but without 3_S 21 533 646 63 994 37 336 65 656 99 addition of SYTO-13. Small eukaryotic were 4_S 52 953 075 140 754 38 376 149 018 108 identified in plots of side scatter vs FL3, and FL2 vs 6_S 36 129 806 95 255 48 379 98 996 111 FL3 (Olson et al., 1993), and excluded in the 7_S 98 750 180 332 049 38 297 335 408 90 7_D 279 389 388 1 117 888 37 250 1 013 853 81 enumeration of phototrophic prokaryotes. Data ana- 23_S 67 695 268 196 190 39 345 201 447 101 lysis was performed with the Paint-A-Gate software 23_D 83 539 478 239 447 38 349 246 948 102 (Becton and Dickinson). The abundance of prokar- 30_S 89 180 466 256 028 37 348 268 616 101 yotic cells was based on the enumerations of 31_S 245 463 121 614 743 39 399 660 949 114 heterotrophic and phototrophic prokaryotes. 36_S 245 945 064 737 506 39 333 757 448 100 38_S 214 253 370 601 110 39 356 631 351 103 38_Z 223 188 575 638 843 45 349 659 041 104 39_S 233 273 851 590 664 43 395 629 501 114 NCLDV classification 39_Z 249 558 778 679 589 46 367 708 056 108 Throughout this study, we used the NCLDV nomen- 43_S 167 515 516 529 506 37 316 545 641 93 46_S 251 310 870 648 425 41 388 689 641 112 clature derived from the common ancestor hypo- 49_S 222 417 021 680 573 43 327 696 974 98 thesis (Iyer et al., 2006) based on seven distantly related viral families: Megaviridae, Phycodnaviri- Abbreviation: ORF, open reading frame. dae, Marseilleviridae, Iridoviridae, Ascoviridae,

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1682 Asfarviridae and Poxviridae. Among theses, Mega- used to map the sequences in the reference trees iridae is a recently proposed family (Arslan et al., using the Bayesian option. This Pplacer approach was 2011), which includes Mimivirus, Mamavirus, used also for the phylogenetic analysis of the reads Megavirus, CroV and other marine viruses such as assigned to the Megaviridae and oomycetes taxo- Pyramimonas orientalis virus, Phaeocystis pouchetii nomic nodes. For the visualization of phylogenetic virus (PpV), Chrysochromulina ericina virus (CeV) trees, we used Archaeopteryx (Han and Zmasek, as well as Organic Lake Viruses (OLPV1, OLPV2) 2009), FigTree (http://tree.bio.ed.ac.uk/software/fig- (Ogata et al., 2011; Yau et al., 2011). Although the tree/) and MEGA version 5.1 (Tamura et al., 2011). order Megavirales was recently proposed to refer to the taxonomic classification of NCLDVs (Colson et al., 2012), we simply refer here to these viruses 2bLCA taxonomic annotation collectively as NCLDVs. Each 454 read 4100 bp in length was assigned a taxonomic classification using a dual BLAST (Altschul et al., 1997; Monier et al., 2008b) based Marker genes last common ancestor (2bLCA) approach somewhat Sixteen NCLDV marker genes were selected from the similar to the method applied by MEGAN (Huson 1445 clusters of NCLDV orthologs, represented in et al., 2007) but using an adaptive E-value threshold the NCVOG database (Yutin et al., 2009). These specific for each protein. For each 454 read, the best marker genes were selected based on their conserva- local alignment (high-scoring segment pair (HSP)) tion in nearly all known NCLDV genomes (four with known proteins was obtained by a first BLAST markers) or in a majority of viruses from the two (B1; BLASTx) against the UniProt database release major marine NCLDV families (Megaviridae and April 2011 (UniProt Consortium, 2012). Reads Phycodnaviridae; 12 markers), as well as on the without any HSPs at an E-value 10 5 were classi- observation that these genes typically occur only once p fied as ‘no hits’. For each read with at least one in their genomes if present (Supplementary Table S1). significant HSP, the subsequence of the UniProt For cellular organisms, we used 35 conserved genes subject fragment aligned in the best scoring B1 HSP normally encoded as a single copy in all the cellular wasusedasasecondBLAST(B2;BLASTp)query organisms (Raes et al., 2007). Profile-hidden Markov against the same UniProt database. All the B2 models (Eddy, 2008) derived from the sequence database hits with an E-value B1 HSP were recorded alignments of these marker genes were used to p and defined to constitute a set of close homologs for identify their homologs (E-value 10 3)inthetrans- p the read (denoted as set H). The taxonomic classifica- lated amino-acid sequence sets derived from metage- tions (Benson et al.,2012)ofthesetHwerethen nomic data. After identification of the marker gene reduced to their LCA, which was finally assigned to homologs, taxonomic assignment was performed the read as its taxonomic annotation. Reads were using the dual BLAST based last common ancestor annotated as ‘ambiguous’ if the set H contained (2bLCA) method described below in order to separate representatives from several domains of life. This these sequences in distinct NCLDV, Bacteria, Archaea 2bLCA protocol was applied to the metagenomic and bins. For each marker gene, we then reads as well as to the metagenomic marker gene obtained marker gene density in the metagenomes homologs (predicted protein sequences). For the latter (number of hits per Mbp). A normalization process for case, we used BLASTp for B1 (instead of BLASTx) the marker gene size was introduced by dividing the against a customized reference database (that is, a computed marker gene density by the length of the subset of UniProt) with enriched taxonomic annota- reference multiple sequence alignment of the profile- tions for NCLDVs. The use of two protein reference hidden Markov model. databases in this study merely reflects the period when the computation was performed. Phylogenetic mapping Phylogenetic mapping (Monier et al., 2008a) is a method to place and classify a new sequence (usually Read abundance per taxon a short environmental sequence) within a reference For each set of taxa at a given depth (here fifth level tree using a precompiled multiple sequence align- from the root) in the National Center for Biotechno- ment. In this study, we compiled a reference sequence logy Information (NCBI) taxonomic tree of life, set composed of 187 type B DNA polymerase (PolB) we estimated the relative read abundance of homologs and a reference sequence set composed of plankton representatives for each taxon in each 154 MutS homologs from diverse cellular organisms Tara Oceans sample (providing a samples taxa and viruses (Supplementary Figures S1 and S2). matrix). The relative read abundance of a specific Multiple sequence alignments and phylogenetic trees taxon for a specific sample was calculated as the were constructed using T-Coffee (Notredame et al., number of 454 metagenomic reads with a taxonomic 2000) and RAXML (Rokas, 2011). HMMALIGN was annotation at or below the taxon level divided by the used to align metagenomic sequences on the refer- total number of 454 reads in the sample. The ence alignments and Pplacer (Matsen et al., 2010) was resulting matrix composed of 712 taxa (rows) across

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1683 17 samples (columns) is provided (Supplementary Megaviridae proteome database contained all Files S1 and S2). 6678 publically available peptides for M. chilensis (1120 peptides), Megavirus courdo7 (1139 peptides), Co-occurrence analysis Acanthamoeba castellanii mamavirus (997 The 712 taxa 17 samples matrix from above was peptides), A. polyphaga mimivirus (972 peptides), first filtered to exclude taxa with o5 total reads, A. polyphaga mimivirus isolate M4 (756 peptides), reducing the matrix to 609 taxa. To normalize the Moumouvirus Monve (1150 peptides) and CroV BV- read counts with respect to varying sequencing PW1 (544 peptides). Because complete depth across samples, the number of reads in each proteomes were poorly represented in the Uni- cell of the matrix was divided by the total number of Ref100 database release December 2010 (Suzek reads for the corresponding column. In order to et al., 2007) which we intended to use for HGT detect putative taxon co-occurrences across the 17 detection, we enriched UniRef100 with oomycete samples, rank-based Spearman correlation coeffi- proteomes from the following publically available cients (r) were first computed between taxon pairs oomycete genome and transcriptome projects using the R ‘stats’ package ‘cor’ function (R (Supplementary Table S2): euteiches Development Core Team, 2011). Significance of each ESTs (161 384 open reading frames (ORFs)) (Gaulin r was tested by computing a two-sided P-value et al., 2008), arabidopsidis (asymptotic t approximation) using the R ‘stats’ (14 937 ORFs) (Baxter et al., 2010), ultimum package ‘cor.test’ function and controlled for multi- (14 224 peptides) (Levesque et al., 2010), as well as ple tests using false discovery rate (q-value) com- Hyaloperonospora parasitica (6452 peptides), Phy- puted by the tail area-based method of the R ‘fdrtool’ tophthora infestans (14 580 peptides), package (Strimmer, 2008). Taxon associations with ramorum (10 892 peptides), |r|40.7 and qo0.05 were reported with this first (13 995 peptides) and parasitica approach. Taxon co-occurrences/co-exclusions were (17 437 peptides) available from the Broad Institute also independently assessed by the method of Harvard and MIT ‘Saprolegnia and Phytophthora described by Faust et al. (2012). In this second and Sequencing Project’. Where peptides were not made more stringent approach, the two samples from available, nucleotide sequences were translated OMZ were excluded to reduce the detection of into ORFs 450 amino acids. To these 265 433 biome-specific patterns in species distributions. In non-redundant oomycete peptides, we added a addition, we excluded parent–child taxonomic none-oomycete proteome from Tha- relationships (for example, an association between lassiosira pseudonana (11 532 peptides), absent ‘Viruses’ and ‘Phycodnaviridae’) in this second from UniRef100 but publically available at the analysis. Briefly, taxon associations were measured NCBI. The 386 000 additional stramenopile peptides with Spearman’s correlation (denoted as r’) and were clustered (90% identity, 265 433 peptides) Kullback–Leibler distance on the input matrix. The before concatenation with UniRef100 to form the 1000 top- and 1000 bottom- ranking edges for each ‘UniRef100 þ ’ database. method were further evaluated according to Faust Potential HGTs between Megaviridae and cellular et al. (2012), which mitigates biases introduced by proteins were first approximated by reciprocal best data normalization. This method builds a null BLAST hits computed by a method similar to distribution of scores for each edge by permuting the one described by Ogata et al. (2006). Briefly, the corresponding taxon rows while keeping the rest the best cellular homolog in the UniRef100 þ of the matrix unchanged and then restores the stramenopiles database was first identified for each 5 compositional bias by renormalizing the matrix. We Megaviridae peptide (BLASTp, E-valuep10 ). ran 1000 rounds of permutation-renormalization for If this best cellular homolog obtained a best hit each edge and 1000 bootstraps of the matrix columns against a Megaviridae peptide in a second BLASTp to calculate the confidence intervals around the edge search against the UniRef100 þ stramenopiles þ score. The P-value for each measure was obtained Megaviridae database (excluding hits in the same from the Z-scores of the permuted null and bootstrap cellular taxonomic group at the first three NCBI confidence interval; they were combined (denoted as classification levels), they were considered a poten- P’-values) using a method conceived for non-inde- tial Megaviridae-cell HGT candidate. pendent tests (Brown, 1975) and corrected for multi- The six Megaviridae-oomycete HGT candidates ple testing using false discovery rate q-values revealed by reciprocal BLAST were then subjected (denoted as q’-values) according to Benjamini and to phylogenetic analysis. Homologs for the six Hochberg (1995). Taxon associations with q’o0.05 Megaviridae peptides were collected by keeping were reported with this second approach. representative sequences among all detected taxo- nomic groups using BLAST-EXPLORER (Dereeper et al., 2010). Alignments were built using MUSCLE Horizontal gene transfer (HGT) analysis (Edgar, 2004) and GBLOCKS (Talavera and To identify potential HGTs between Megaviridae Castresana, 2007) except for the following two cases. and oomycetes, comprehensive proteome databases For the putative fucosyltransferase AEJ34901, we for each taxon were assembled as follows. The used MAFFT/l-INS-i method (Katoh et al., 2005).

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1684 For the putative RNA methylase gi|311977703, we used CLUSTALW (Chenna et al., 2003) followed NCLDV by manual curation of the alignment. For these two cases, all alignment positions with 445% gaps were Bacteria removed before phylogenetic analysis. Phylogenetic trees were inferred using PhyML (Guindon Archaea and Gascuel, 2003) implemented in Phylogeny.fr (Dereeper et al., 2008) with 100 bootstrap replicates. Eucarya The generated trees were mid-point rooted. 10-2.5 10-2 10-1.5 10-1 10-0.5 10-0 100.5 Results Marker gene density (number of hits/Mbp) Figure 1 Metagenome-based relative abundance of NCLDV and General features of the metagenomes cellular genomes in the TOP data set. Seventeen TOP metagen- Samples in this study were collected as part of the omes (0.2–1.6 mm size fraction) were pooled and analyzed as a Tara Oceans expedition between 13 September 2009 single data set to generate this plot. Each dot in the plot represents the density of one of the marker genes used in this study and 23 April 2010. The 17 microbial samples (16 markers for NCLDVs and 35 markers for cellular genomes). analyzed are from the 13 sampling sites and The estimated abundance of NCLDVs genomes is slightly lower correspond to the size fraction between 0.2 and than that of Archaea genomes and amounts to approximately 3% 1.6 mm (Table 1). These samples were selected to of bacterial genomes. represent a broad range of biomes. Direct sequencing of extracted DNA by the GS-FLX Titanium 454

pyrosequencing technology yielded 2.8 billion bp 6% (8 million reads; Table 2), which correspond to 440% of the size of sequence data in total base pairs 4% produced by the previous GOS survey (Rusch et al., 2% 2007). Average G þ C % varied from 37% to 48% 0% across samples, and 8 358 544 ORFs (102 aa in average) were identified. These constitute the TOP data set analyzed in this study. 1E+7

1E+6 Abundance of NCLDVs ) -1 We used 16 NCLDV marker genes and 35 cellular 1E+5 marker genes to assess the abundance of genomes represented in the metagenomic data. These markers are usually encoded as single copy genes in their 1E+4 genomes, therefore their abundance in metagenomes Abundance (ml ␮ reflects the number of (haploid) genomes in the 1E+3 Prokaryote (FC, <200 m) NCLDV (estimated from Prok-FC) sequenced samples. The median density (hits per Prokaryote (Microscopy, 0.2-20 ␮m) Mbp) of the NCLDV marker genes in our whole NCLDV (estimated from Prok-Microscopy) metagenomic data set was found to be 0.019 1E+2 (Figure 1), which is lower than the marker gene 03 density for Archaea (0.028) and corresponds to 3% Sample of the density for Bacteria (0.64). The median Figure 2 NCLDV genome abundance in the TOP data set. density of the marker genes for eukaryotes was (a) Proportion of the average marker gene density for NCLDVs about half that of NCLDVs (0.008). The same method relative to that of prokaryotes (Bacteria and Archaea) for each of the 17 TOP metagenomes. (b) Experimentally measured prokar- applied to the GOS marine metagenomic data, yotic cell densities (gray circles; 16 samples by microscopy and 13 recovered from microbial samples (0.1–0.8 mm size samples by FC) were used to estimate the absolute abundances of fraction) collected along a transect from the North NCLDV genomes (black squares) by rescaling the metagenome- Atlantic to the Eastern Tropical Pacific, revealed that based relative abundances. ‘S’, ‘D’ and ‘Z’ in the sample names the marker gene density of NCLDVs (0.05) was as indicate the depths from which the samples were collected: ‘S’ for surface, ‘D’ for deep chlorophyll max and ‘Z’ for oxygen minimum high as 10% of Bacteria (0.47) (Supplementary zone. Figure S3). This ratio is higher than that for TOP samples likely reflecting the exclusion of large bacterial cells and the inclusion of small NCLDVs and microscopy on water samples collected onboard in the GOS 0.1–0.8 mm size fraction. Tara concomitantly with the metagenome samples, The computed abundance of NCLDV genomes to re-scale the relative NCLDV genome abundance relative to prokaryotic genomes varied from 0.2% to into absolute concentrations. FC analysis performed 5.6% across the 17 Tara samples (Figure 2a). We on 16 water samples (o200 mm size fraction) used prokaryotic cell abundances measured by FC showed that prokaryotic cell density varied from

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1685 5 6 1 2.5 10 to 3.5 10 cells ml (Figure 2b). Direct 6% cell count by microscopic analysis for 13 samples 6% (0.2–20 mm size fraction) provided comparable mea- sures varying from 4.0 105 to 2.2 106 cells ml 1. Prasinoviruses (588) Other Phycodnaviridae (99) 45% We observed no during our sampling, Megaviridae (467) and these measures fall within typical ranges of 36% Iridoviridae (75) prokaryotic cell density in the oceans (Suttle, 2005). NCLDV, unclassified (80) We used GF/A pre-filters (glass microfiber, 1.6 mm nominal pore size) to collect samples for the present 7% metagenomic sequencing as previous works indicate that the vast majority of prokaryotic cells (90–94%) 100% pass through GF/A filters (Lambert et al., 1993; 90% 80% Massana et al., 1998). By assuming that 90% of 70% prokaryotic cells observed by FC (o200 mm) or 60% microscopy (0.2–20 mm) could pass through the 50% 1.6-mm GF/A pre-filters, the absolute abundance of 40% NCLDV genomes ml 1 of sea water in the 0.2–1.6 mm 30%

Relative abundance 20% size fraction was estimated (Figure 2b). The 10% NCLDV genome abundance was found to vary from 0% 4 103 to 1.7 105 ml 1 with an average of 4.5 104 1 genomes ml for samples from photic zones (SRF Sample and DCM). Samples from OMZ showed reduced NCLDV abundances (7.7 102–2.3 103 NCLDV Figure 3 Metagenome-based relative abundance of NCLDV 1 families. (a) Representation of different viral groups in the whole genomes ml ). TOP metagenomic data set as measured by the NCLDV marker The detection of homologous sequences by a gene density. The number of marker reads taxonomically assigned marker gene depends on numerous factors such as to each viral group is shown in parentheses in the legend. its level of conservation and gene length, as well (b) Representation of different viral groups in the 17 TOP metagenomic samples. ‘S’, ‘D’ and ‘Z’ in the sample names as the taxonomic composition of the metagenomes indicate the depths from which the samples were collected: ‘S’ for being analyzed. We presumed that the use of multi- surface, ‘D’ for deep chlorophyll max and ‘Z’ for oxygen minimum ple genes with largely different enzymatic functions zone. In both (a) and (b), three reads and one read assigned to would increase the overall accuracy of our procedure. Asfarviridae and Poxviridae, respectively, were omitted for To estimate the effect of possible artifacts, we presentation purpose. repeated the above calculations after adding marker gene size normalization. This reduced the abundance estimates of NCLDV genomes by 38% compared family patterns was observed across depths (SRF, with calculations without gene size normalization DCM, OMZ for stations 7, 23, 38, 39). (Supplementary Figure S4). An independent classification using PolB phylo- genetic mapping analysis showed a globally similar taxonomic distribution of reads across different NCLDV lineages (Figure 4). Thanks to the recent Megaviridae and prasinoviruses are the most abundant expansion of available reference genomic sequences group of NCLDVs for Phycodnaviridae and Megaviridae families, In total, we identified 1309 NCLDV marker gene prasinoviruses can now clearly be recognized as homologs in the TOP metagenomes. Our BLAST- the most abundant group of marine phycodna- based taxonomic annotation (see Materials and viruses. Within the Megaviridae branches, the two methods) revealed two dominant NCLDV families largest amoeba-infecting viruses (Mimivirus and (Figure 3). Over half (52%) of them were attributable Megavirus) are rather under-represented (3.5% of to the Phycodnaviridae family, while 36% were Megaviridae), while most reads were assigned to most closely related to the Megaviridae family. other Megaviridae branches, leading to viruses These two families together represented nearly characterized by reduced genomes (from B300 to 90% of the detected NCLDV marker gene sequences. 730 kb). The hosts of the latter viruses are distrib- This result confirmed a previous observation on the uted widely in the classification of eukaryotes: relative abundance of these two families among C. roenbergensis (stramenopiles; ), NCLDVs in a survey of the GOS data set (Monier P. orientalis (; Chlorophyta; Prasino- et al., 2008a). At the same sampling locations phyceae), P. pouchetii (Haptophyceae; Phaeocys- (stations 7 and 23), prasinoviruses (infecting green tales) and Haptolina ericina (formerly C. ericina; algae of the Mamiellophyceae ) were found to Haptophyceae; Prymnesiales). Interestingly, many be relatively more abundant in DCM than in SRF metagenomic reads were assigned to relatively deep samples (2.4–8.3-folds in absolute abundance), con- branches. For example, 17 PolB-like reads were sistent with the photosynthetic activity of their assigned to the branch leading to the clade contain- hosts. No other notable difference in the virus ing three prasinoviruses (OsV5, MpV1, BpV1), and

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1686

1

1 Iridoviridae (6) 1 1 1 1

1 Ascoviridae (0)

4 Marseilleviruses (4) 3 6 22 7 39 2 18 9 14 Megaviridae (170) 26 18 6

2

4 6 2 17 Phycodnaviridae (79) 12 26

3 7

2 Poxviruses (3)

1

1 Asfarviridae/HcDNAV (2)

Figure 4 Phylogenetic positions of metagenomic reads closely related to NCLDV DNA polymerase sequences. An HMM search with a PolB profile detected 2028 PolB-like peptide sequences in the TOP metagenomes. Each of these peptides was placed within a large reference phylogenetic tree containing diverse viral and cellular homologs (Supplementary Figure S1) with the use of Pplacer. Of these peptides, 264 were mapped on the branches leading to NCLDV sequences and are shown in this figure. The numbers of mapped metagenomic reads are shown on the branches and are reflected by branch widths. This result is consistent with the preponderance of the Phycodnaviridae and Megaviridae families seen in our BLAST-based marker gene analysis. Only the NCLDV part of the reference tree is shown.

39 PolB-like reads were assigned to the basal branch in the TOP samples are highly diverse and only leading to four marine viruses (PpV, CeV, OLPV1 distantly related to known viruses, thus potentially and OLPV2). To illustrate metagenome sequence corresponding to viruses infecting different marine divergence with known viral sequences, we arbi- eukaryotes. trary classified the metagenomic NCLDV marker sequences as ‘known’ if they showed X80% amino- acid sequence identity to their closest homolog in Correlated abundance of MutS protein subfamilies with the databases and otherwise as ‘novel’ (or ‘unseen’). Megaviridae abundance A vast majority (73–99%) of the sequences turned Two recently identified subfamilies of DNA mis- out to be ‘novel’ when they were searched against match repair protein MutS are specific to a set of the UniProt sequence database (Figure 5). Similarly, viruses with large genomes (Ogata et al., 2011). The searches against the GOS sequence database MutS7 and/or MutS8 subfamilies are encoded in all revealed that large proportions (36–76%) of the the known members of the Megaviridae family and TOP marker gene homologs were ‘unseen’ in this in HcDNAV (356 kb); the latter virus infects the previous large-scale marine microbial survey. bloom-forming Heterocapsa circular- A fragment recruitment plot for the OLPV1 PolB isquama and appears to be related to the Asfarvir- protein sequence applied to PolB-like metagenomic idae family (Ogata et al., 2009). It has been suggested reads that best matched OLPVs (OLPV1 or OLPV2) that these hallmark genes of giant viruses are further showed a high level of richness among these required to maintain the integrity of viral genomes sequences (even within a single sample) and their with large sizes (mostly 4500 kb; Ogata et al., 2011). large divergence from the reference OLPV1 sequence These MutS genes are not included in our NCLDV (Supplementary Figure S5). Overall, these results marker gene set. Prompted by the observed high suggest that the majority of the NCLDVs represented abundance of sequences of possible Megaviridae

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1687 Known Novel using the whole set of the TOP metagenomic reads. 0% 20% 40% 60% 80% 100% We used two approaches for the detection of taxon associations: the first based on Spearman’s correla- Prasinoviruses tion across all samples (3696 associations, qo0.05), and the second combining Spearman’s correlation Other Phycodnaviridae with a Kullback–Leibler measure of dissimilarity on a reduced data set excluding two outlier OMZ Megaviridae samples (108 associations, q’o0.05). This resulted in the identification of a total of 3703 potential taxon Iridoviridae association pairs, of which 101 were supported by both methods (Supplementary Table S3). The dis- NCLDVs, unclassified crepancy between the two lists was due to the higher intrinsic stringency of the second method, as well as Seen Unseen to the specific photic-OMZ contrasts, which were 0% 20% 40% 60% 80% 100% only taken into account by the first method. Some of the inferred taxon associations simply reflected Prasinoviruses uncertainty in the taxonomic assignments, such as the associations between ‘Archaea; environmental Other Phycodnaviridae samples’ and ‘Archaea; Euryarchaeota; Marine Group II; environmental samples;’ (q ¼ 1.38 10 8, Megaviridae q’E0) or between environmental viruses and myo- viruses (q ¼ 3.8 10 5, q’ ¼ 9.4 10 3). These could Iridoviridae be explained by the taxonomic assignments of similar organisms into related but distinct taxo- NCLDVs, unclassified nomic nodes in the NCBI database. However, our analysis also revealed known biolo- Figure 5 Classification of NCLDV marker genes in the TOP data gical associations of lineages. For instance, a corre- based on the level of sequence similarity to database sequences. 3 7 Metagenomic reads showing X80% amino-acid sequence identity lated occurrence (q ¼ 1.33 10 , q’ ¼ 8.42 10 ) to database sequences were classified as ‘known (or seen)’, was detected between two distinct Bacteroidetes otherwise as ‘novel (or unseen)’. (a) BLAST result against UniProt. lineages (that is, Sphingobacteria and Cytophagia), (b) BLAST result against the GOS data. The large proportions of which are known to co-exist in seawater likely being ‘novel (and unseen)’ genes suggest current environmental surveys are far from reaching saturation and that diverse yet unknown attached to cells (Gomez-Pereira NCLDVs exist in the sea. et al., 2012). We also observed known virus–host pairs, such as a T4-like phage/ associa- tion (q ¼ 9.7 10 3) and an association between origin in the TOP data set, we screened our data for unclassified phycodnaviruses (mostly prasino- MutS7 and MutS8 homologs. In total, we identified 78 viruses) and a group of environmental prasinophytes reads similar to MutS (68 and 10 reads for MutS7 and (q ¼ 0.014). An example of co-excluding taxa was a MutS8, respectively) in 13 samples (Supplementary relationship between Prochlorococcus, existing in Figure S6a). If these MutS genes originate from the euphotic zone, and sulfur-oxidizing symbionts, a putative Megaviridae viruses detected by our marker lineage of g-Proteobacteria known to have an gene method, we expect to see a correlation in their important role in sulfur-oxidizing microbial commu- abundance across samples. We tested this hypothesis nities in deeper aphotic OMZs (q ¼ 0.011; Canfield and found a statistically significant correlation et al., 2010; Stewart et al., 2012). The latter case between the relative abundance of the Mut7/8 homo- appeared to simply reflect their non-overlapping logs and the Megaviridae marker gene density waters of residence. These known association exam- (R ¼ 0.725, P ¼ 9.90 10 4; Supplementary Figure ples served as controls, suggesting that the inferred S6b). A similar level of correlation was also found network might be mined usefully for putative novel in the GOS data set (R ¼ 0.647; P ¼ 6.55 10 6; associations (or segregations) of plankton organisms. Supplementary Figure S6c). This result suggests that Examples of positive and negative correlations the TOP reads assigned to the Megaviridae family between virus and cellular organism abundances are probably originate from viruses with a large genome as listed in Table 3. We have no simple explanation for found in known viruses of this family. some of the taxon pairs, such as the virus–cell mutual exclusions as well as the association of eukaryotic viruses with some bacteria (although the Oomycetes or their stramenopile relatives co-occur latter could be due to bacterial genes acquired by with marine Megaviridae HGT in a viral genome). However, the association To test whether the present data set might serve between the taxonomic node for ‘Megaviridae’ to identify potential hosts of marine NCLDVs, (NCBI taxonomy: Viruses; dsDNA viruses, no we assessed association of taxon occurrences RNA stage; Mimiviridae.) and the node for ‘oomy- (‘co-occurences’ and ‘co-exclusions’) across samples cetes’ (NCBI taxonomy: Eukaryota; stramenopiles;

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1688 Table 3 Examples of positive and negative viral-cell associations

Taxon 1 Taxon 2 r q r’ q’

Co-occurrence Viruses; dsDNA viruses, no RNA stage; Eukaryota; stramenopiles; Oomycetes 0.949 2.22E-05 0.939 1.7E-02 Mimiviridae Viruses; dsDNA viruses, no RNA stage; Bacteria; Tenericutes; Mollicutes; 0.883 1.44E-03 — — Iridoviridae; Lymphocystivirus; unclassified Mycoplasmataceae Lymphocystivirus Viruses; unclassified phages; environmental Bacteria; Cyanobacteria; environmental 0.864 2.92E-03 — — samples samples Viruses; dsDNA viruses, no RNA stage; Eukaryota; Alveolata; ; 0.861 3.26E-03 — — Caudovirales; Siphoviridae Aconoidasida; Piroplasmida Viruses; dsDNA viruses, no RNA stage; Bacteria; Proteobacteria; Gammaproteo- 0.853 4.20E-03 — — Herpesvirales; Herpesviridae; bacteria; Thiotrichales; Thiotrichaceae Gammaherpesvirinae Viruses; dsDNA viruses, no RNA stage; Bacteria; Proteobacteria; Gammaproteo- 0.838 6.30E-03 — — Phycodnaviridae bacteria; Alteromonadales; Alteromona- dales genera Viruses; dsRNA viruses; Reoviridae; Eukaryota; Metazoa; Chordata; Craniata 0.834 6.98E-03 — — Sedoreovirinae; Mimoreovirus Viruses; dsDNA viruses, no RNA stage; Bacteria; Chloroflexi; Thermomicrobiales; 0.830 7.61E-03 — — Herpesvirales; Herpesviridae; Thermomicrobiaceae; Thermomicrobium Gammaherpesvirinae Viruses; dsDNA viruses, no RNA stage; Bacteria; Proteobacteria; Magnetococcus 0.825 8.53E-03 — — Herpesvirales; Herpesviridae; Gammaherpesvirinae Viruses; dsDNA viruses, no RNA stage; Eukaryota; Viridiplantae; Chlorophyta; 0.821 9.36E-03 — — Phycodnaviridae; unclassified ; Mamiellales Phycodnaviridae Viruses; dsDNA viruses, no RNA stage; Bacteria; Acidobacteria; Solibacteres; 0.820 9.51E-03 — — Herpesvirales; Herpesviridae; Solibacterales; Solibacteraceae Gammaherpesvirinae Viruses; dsDNA viruses, no RNA stage; Bacteria; Proteobacteria; Deltaproteobac- 0.820 9.51E-03 — — Herpesvirales; Herpesviridae; teria; Desulfobacterales; Gammaherpesvirinae Desulfobacteraceae Viruses; dsDNA viruses, no RNA stage; Bacteria; Cyanobacteria; environmental 0.819 9.71E-03 — — Caudovirales; Myoviridae; T4-like viruses samples Viruses; dsDNA viruses, no RNA stage; Bacteria; Cyanobacteria; environmental 0.817 1.02E-02 — — Caudovirales; Podoviridae; samples Autographivirinae Viruses; dsDNA viruses, no RNA stage Eukaryota; Alveolata; Ciliophora; 0.803 1.36E-02 — — Intramacronucleata; Spirotrichea Viruses; dsDNA viruses, no RNA stage; Bacteria; Firmicutes; Clostridia; 0.802 1.38E-02 — — Caudovirales; Podoviridae; N4-like viruses Clostridiales; Peptococcaceae Viruses; dsDNA viruses, no RNA stage; Eukaryota; Alveolata; Apicomplexa; 0.802 1.39E-02 — — Caudovirales Aconoidasida; Piroplasmida Viruses; dsDNA viruses, no RNA stage; Bacteria; Proteobacteria; Alphaproteo- 0.801 1.39E-02 — — Viruses; dsDNA viruses, no RNA stage; bacteria; Rickettsiales; SAR11 cluster unclassified dsDNA viruses Viruses; dsDNA viruses, no RNA stage; Eukaryota; stramenopiles; Actinophryi- 0.801 1.39E-02 — — Phycodnaviridae; Phaeovirus dae; Actinophrys Viruses; dsDNA viruses, no RNA stage; Eukaryota; Viridiplantae; Chlorophyta; 0.800 1.42E-02 — — Phycodnaviridae; unclassified Prasinophyceae; environmental samples Phycodnaviridae Mutual exclusion Viruses; dsDNA viruses, no RNA stage; Eukaryota; Euglenozoa; Kinetoplastida; 0.742 3.32E-02 —0.804 1.72E-02 Caudovirales; Myoviridae; phiKZ-like Trypanosomatidae; Leishmania viruses Viruses; dsDNA viruses, no RNA stage; Bacteria; candidate division OP8; 0.751 2.95E-02 0.695 3.83E-02 Iridoviridae; Ranavirus environmental samples Viruses; dsDNA viruses, no RNA stage; Eukaryota; Rhodophyta; Bangiophyceae; —— 0.659 2.95E-02 Caudovirales; Myoviridae; phiKZ-like Cyanidiales; Cyanidiaceae viruses Viruses; dsDNA viruses, no RNA stage; Bacteria; Spirochaetes; Spirochaetales; —— 0.715 3.95E-02 Caudovirales; Myoviridae; phiKZ-like Spirochaetaceae viruses

Abbreviation: dsDNA, double-stranded DNA. Statistical significance of taxon associations was assessed by two methods. r (Spearman’s correlation coefficient) and q (false discovery rate) were calculated by the first method and r’ (Spearman’s correlation coefficient) and q’ (false discovery rate) were calculated by a more stringent second method. See Materials and methods for details.

oomycetes.) attracted our attention, as this does not statistically significant by both of the two methods correspond to a known virus–host relationship. The we used (r ¼ 0.95, q ¼ 2.2 10 5, r’ ¼ 0.94, association of these two taxonomic nodes, the q’ ¼ 0.018; Figure 6), albeit based on a modest highest we observed between virus and cells, was number of reads assigned to each of these taxonomic

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1689 0.76 oomycetes node (homologous to 12 different pro- teins; Supplementary Table S4). A much larger number of reads were, in fact, assigned to lower 0.20 ≥ taxonomic levels, such as 721 reads assigned to the Pairs with q-value 0.05 Mimivirus node (that is, ‘Viruses; dsDNA Pairs with q-value<0.05 viruses, no RNA stage; Mimiviridae; Mimivirus’). Pairs involving viruses (q<0.05) The fact that the majority of the 35 Megaviridae 0.10 reads corresponded to D5 family primases may be explained by their large gene sizes and usually high sequence conservation (for example, 2880 nt for the Mimivirus L207/L206), a similar observation having been made in a previous marine metagenomic study 0.05 (Monier et al., 2008b). Consistent with the relatively

Density high ranks of their taxonomic assignments, the reads for the Megaviridae and oomycetes nodes were found to show large divergence from reference protein sequences. The average BLASTx sequence 0.15 identity for the 35 reads against their closest Megaviridae protein sequences was 50% (ranging from 28% to 88%), and the average sequence identity for the 19 reads assigned to ‘oomycetes’ 0.00 was 58% (30–90%) against their closest known oomycete protein sequences. Their G þ C composi- tions were significantly different with each other p-value (35% for Megaviridae and 48% for oomycete reads, in average; t-test, P ¼ 8.5 10 4) and comparable with those of their respective reference genomes. We performed phylogenetic analyses of the 19 reads assigned to the oomycete taxonomic node in an attempt to obtain better taxonomic resolution. Despite their short sizes (B100 aa) and large evolutionary distances from database homologs, many of these reads appeared related to strameno- piles (12 out of 19 cases), including six cases showing distant yet specific relationships to known oomycete sequences (Supplementary Figures S7-1––S7-12). For the remaining seven reads, their phylogenetic posi-

Megaviridae (reads/Mbp) tions were rather poorly resolved and showed no coherent relationship to specific taxonomic groups (Supplementary Figures S7-13––S7-19). A similar analysis of the 31 reads (D5 family proteins) assigned to the Megaviridae node confirmed in most cases Oomycetes (reads/Mbp) their initial taxonomic annotation (Supplementary Figure S8), with some of them assigned close to the Figure 6 Taxon associations inferred from co-occurrence analy- sis. (a) Distribution of P-values for Spearman’s correlation root of the viral family. These reads are not closely coefficients for taxon associations observed in the TOP meta- related to the sequences from CroV (Megaviridae) genomic data. Colored (red and green) areas of the histogram and phaeoviruses (Phycodnaviridae), the only known represent taxon pairs showing statistically significant correla- NCLDVs parasitizing marine stramenopiles. Phylo- tions. The position of the P-value for the hypothetical positive association between the ‘Megaviridae’ and ‘oomycetes’ taxonomic genetic analysis was not performed for the four groups is indicated by a red triangle. (b) Correlated occurrence of Megaviridae reads similar to collagen-like proteins 454 reads taxonomically assigned to the ‘Megaviridae’ and the due to insufficient quality of sequence alignments. ‘oomycetes’ groups by the BLAST-based 2bLCA method. Each dot If this Megaviridae–stramenopile sympatry corresponds to one of the 17 TOP samples analyzed. Axes revealed by metagenomics reflected an intimate represent the density of these reads (number of reads per Mbp) for each of the ‘Megaviridae’ and the ‘oomycetes’ groups. biological interaction (for example, virus–host), we reasoned that an increased rate of genetic exchange might be observable between these organisms. nodes. Thirty-five reads were assigned to the Detection of HGTs between extant genomes of these Megaviridae node (31 reads similar to D5 family- organisms would thus provide strong independent predicted DNA helicase/primase sequences support for the predicted co-occurrence. We there- (De Silva et al., 2007); 4 reads similar to collagen- fore undertook a systematic screening of all publicly like proteins), while 19 reads were assigned to the available Megaviridae and cellular sequences for

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1690 96 Trichophyton tonsurans {; GI 326470413} 94 Trichophyton equinum {Ascomycota; GI 326478597} 100 Trichophyton rubrum {Ascomycota; GI 327306345} Trichophyton verrucosum {Ascomycota; GI 302653648} 85 84 Arthroderma benhamiae {Ascomycota; GI 302498435} 100 Arthroderma otae {Ascomycota; GI 296804730} 96 Arthroderma gypseum {Ascomycota; GI 315047476} Aspergillus flavus {Ascomycota; GI 238507261} 61 Chaetomium thermophilum {Ascomycota; GI 340915059}

99 Thielavia terrestris {Ascomycota; GI 367039659} 87 Chaetomium globosum {Ascomycota; GI 116203039} Fungi 94 Podospora anserina {Ascomycota; GI 171682812} 100 Glomerella graminicola {Ascomycota; GI 310795940} Nectria haematococca {Ascomycota; GI 302890433} 82 100 79 Trichoderma atroviride {Ascomycota; GI 358395341} 100 Trichoderma virens {Ascomycota; GI 58389308} 83 Aspergillus nidulans {Ascomycota; GI 67904666} Arthrobotrys oligospora {Ascomycota; GI 345564484} Piriformospora indica {; GI 353243842} Piriformospora indica {Basidiomycota; GI 353243842}

93 Saprolegnia parasitica {SPRG 19367} 100 Saprolegnia parasitica {SPRG 03105} Oomycetes Saprolegnia parasitica {PRG 03092} 100 Megavirus {GI 363539803; ORF mg1057}

100 Mimivirus {GI 311978223; ORF R811} Megaviridae Moumouvirus {GI 371945464; ORF R1082}

0.2 sustitutions/site Figure 7 Evidence of horizontal gene transfer between viruses and eukaryotes related to oomycetes. The displayed maximum likelihood tree was generated based on sequences of the Mimivirus hypothetical vWFA -containing protein (gi: 311978223) and its homologs using PhyML. The numbers on the branches indicate bootstrap percentages after 100 bootstrap sampling. The tree was mid-point rooted for visualization purpose. The grouping of the Megaviridae and oomycete sequences suggests a gene exchange between the lineage leading to Megaviridae and the lineage leading to oomycetes. Phylogenetic trees for the remaining five putative cases of horizontal gene transfers between these lineages are provided in the Supplementary Figure S9.

hints of potential HGTs. A first reciprocal BLAST best aquatic environments using electron microscopy hit search identified 31 candidate HGTs between (Bergh et al., 1989). Proctor and Fuhrman (1990) Megaviridae and cellular organisms (Supplementary then discovered that viruses were quantitatively Table S5). Surprisingly, the most frequent cellular important components of marine food webs through partner happened to be from the oomycete lineage the observation of numerous bacteria visibly (six genes). Phylogenetic tree inference provided infected by viruses. Ever since these pioneering further evidence that the six genes were likely bona works, a large body of research continuously fide HGTs (Figure 7 and Supplementary Figure S9). revealed the fascinating ecological and evolutionary These are a hypothetical protein with a von Will- functions of viruses, including NCLDVs in marine ebrand factor type A domain and an in-between ring environments (Wilson et al., 2005; Sullivan et al., fingers domain, a putative fatty acid hydroxylase, a 2006; Frada et al., 2008; Nagasaki, 2008; Moreau hypothetical protein of unknown function, a putative et al., 2010; Danovaro et al., 2011; Breitbart, 2012). phosphatidylinositol kinase, a putative fucosyltrans- The abundance of NCLDV genomes was found ferase and a putative RNA methylase (S-adenosyl-L- to be in the range from 4 103 to 1.7 105 geno- methionine-dependent methyltransferase). For four mes ml 1 for the TOP photic layer samples. Our of these six cases, the monophyletic grouping of the indirect metagenomic estimate of virus abundance Megaviridae and oomycete sequences was supported is likely to be affected in two opposite ways: by a very high bootstrap value (497%). overestimation, for instance, due to actively repli- cating viral genomic DNA in infected small eukar- yotic cells, and underestimation due to smaller or Discussion larger virion particles not being captured by our size fractionation or reduced efficiency of DNA extrac- In the late 1970s, Torrella and Morita (1979) tion for encapsidated genomes. In fact, a substantial revealed unexpected high viral concentrations in proportion of prasinovirus OtV particles (B120 nm

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1691 in diameter) cannot be retained on the 0.2-mm fraction of the unicellular organisms in a population membrane (Grimsley and Clerissi, data not shown). may be infected by viruses. The estimated relative Furthermore, underestimation was likely to be genome abundance of NCLDVs (3% and 10% of compounded by the fact that most NCLDV-infected bacteria in the TOP and GOS data sets, respectively) cells are 41.6 mm and thus were excluded from our suggests that such single virus genomics approaches size fraction. Filtration efficiency is another pitfall will be helpful in analyzing uncultivated marine of quantitative estimates. Size of retained microbes NCLDVs from size-fractioned natural water samples. may vary during pre- and retention filtration The predicted abundance of NCLDV genomes was (progressively excluding smaller infected cells and found to vary from 104 to 105 genomes ml 1 for most retaining smaller NCLDVs than the filter’s nominal of the TOP euphotic samples. Interestingly, the pore sizes), though we rarely encountered filter suggested variation in the abundance of NCLDVs clogging for the samples analyzed in this study. (at a high taxonomic level) across sampling sites Regarding our experimental measurements, we used makes a very sharp contrast with the known and well-established methods for prokaryotic cell counts more remarkable fluctuations (spanning more than (FC and epifluorescence microscopy), which distin- several orders of magnitudes) in the abundance of guish cells from many viruses, including marine specific viral species/strains measured in time series NCLDVs (Jacquet et al., 2002). Yet, we cannot monitoring (Tomaru et al., 2004). Moreover, our exclude the possibility of the existence of cell-sized phylogenetic (Figure 4) and fragment recruitment (and -shaped) marine viruses that could not be analyses (Supplementary Figure S5) indicated that discriminated from cells by these methods. Our numerous distinct genotypes exist (for the Megavir- metagenomic based ratio of NCLDVs to prokaryotes idae family and the prasinovirus clade) in the (o5%) then suggests that the resulting prokaryote analyzed samples (even within a single sample). It overestimation (due to contaminated large viruses) has been recently suggested (Rodriguez-Brito et al., could be 5% at most. Therefore, our estimate should 2010) that dominant phage and bacterial taxa in be considered a first approximation for genome microbial communities persist over time in stable abundance of core gene containing NCLDVs in the ecosystems but their populations fluctuate at the analyzed size fraction. An early metagenomic survey genotype/strain levels in a manner predictable by showed that only 0.02% of the total predicted the ‘killing-the-winner’ hypothesis (Winter et al., proteins from the GOS metagenomes corresponded 2010). Multiple and perpetual prey–predator inter- to Mimivirus homologs (Williamson et al., 2008). actions and functional redundancy across species/ Such a small proportion cannot be directly com- genotypes may lead to the apparent stability they pared with the higher genome abundance estimate observed in the community composition at high we obtained in this study (that is, 10% of bacterial taxonomic levels. A similar mechanism might be genomes in the GOS data), as gene abundance acting on marine NCLDV-host communities. The estimates are heavily dependent on genome diver- relatively stable NCLDV sequence abundance across sity and the availability of reference genomes. We geographically distant locations may be caused by consider that our marker gene-based approach is compensating local community changes at low rather suitable to quantify the abundance of NCLDV taxonomic levels, in which diverse NCLDV strains genomes, given the limited number of sequenced are involved in the control of specific eukaryotic NCLDV genomes and the large genomic diversity host populations. observed even within a single family of NCLDVs. Isolation of new viruses requires host cultures. The abundance of eukaryotic organisms (mainly Among known hosts of NCLDVs, of the unicellular) in marine microbial assemblages is Acanthamoeba genus have been the most efficient typically three orders of magnitude lower than that laboratory hosts to isolate new NCLDVs from aquatic of prokaryotes (Suttle, 2007; Massana, 2011). In the samples (Arslan et al., 2011, Boyer et al., 2009, La euphotic zone of the Sargasso Sea, phototrophic/ Scola et al., 2010, Thomas et al., 2011). Taxon heterotrophic nanoplankton (2–20 mm) and photo- association analysis on the TOP data set hinted at an trophic/heterotrophic microplankton (20–200 mm) unexpected sympatric association between Mega- were found to amount to only 0.3% of bacterial viridae and stramenopiles possibly distantly related abundance (Caron et al., 1995). Therefore, the to oomycetes. The two sets of reads involved in this predicted NCLDV genome abundance by the present correlation showed a clear difference in their G þ C study suggests that NCLDVs equal or even out- compositions. This rather suggests two distinct number eukaryotic organisms in the photic layer of source organisms for these reads. Yet, an alternative the sea. In other words, our suggested NCLDV/ scenario is that they originated from a single eukaryote ratio is not unlike the ratio of phage/ organism (a virus very recently acquiring cellular bacteria in seawater (Suttle, 2007). Whole-genome genes or a cellular organism with recently integrated amplification and sequencing of single microbial viral genomes). In this case, the taxonomic associa- cells/viruses is becoming a powerful tool in reveal- tion would not correspond to a direct observation of ing genomic contents of environmental uncultivated the co-occurring organisms but would be a by- microorganisms (Allen et al., 2011; Yoon et al., product of very recent genetic exchanges between 2011). These studies reveal that a substantial Megaviridae and oomycete relatives. However, there

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1692 is no known example of a lysogenic virus of the shed new light on the biodiversity of marine viruses Megaviridae family and recent research shows little and their interactions with potential hosts. Larger evidence for recent HGTs between marine NCLDVs sets of environmental sequence data from diverse and eukaryotes (Monier et al., 2007; Derelle et al., locations and different size fractions, such as those 2008; Moreira and Brochier-Armanet, 2008; Filee from remaining Tara Oceans samples, will be useful and Chandler, 2010). not only to test our ‘Megaviridae–stramenopile’ Oomycetes are filamentous eukaryotic microor- hypothesis but also to provide a larger picture of ganisms resembling fungi in many aspects of their NCLDV–eukaryote interactions. biology, but they form a totally distinct phylogenetic group within the stramenopile () super- group (Richards et al., 2011). Some of them are devastating crop , such as Phytophthora Acknowledgements infestans causing late blight of (Haas et al., We thank the coordinators and members of the Tara Oceans 2009), but others include pathogens of fishes and consortium (http://www.embl.de/tara-oceans/start/) for algae, such as the water Saprolegnia parasitica organizing sampling and data analysis. We thank the causing diseases in fishes (Kale and Tyler, 2011) and commitment of the following people and sponsors who Eurychasma dicksonii infecting marine made this singular expedition possible: CNRS, EMBL, (Grenville-Briggs et al., 2011). To our knowledge, Genoscope/CEA, VIB, Stazione Zoologica Anton Dohrn, there is no report of a giant virus infecting oomy- UNIMIB, ANR (projects POSEIDON/ANR-09-BLAN-0348, cetes. However, other stramenopile lineages include BIOMARKS/ANR-08-BDVA-003, PROMETHEUS/ANR-09- GENM-031, and TARA-GIRUS/ANR-09-PCS-GENM-218), C. roenbergensis (stramenopiles; Bicosoecida; Cafe- EU FP7 (MicroB3/No.287589), FWO, BIO5, Biosphere 2, teriaceae; Cafeteria) and brown algae (stramenopiles; agne`s b., the Veolia Environment Foundation, Region Phaeophyceae; ), which are hosts of Bretagne, World Courier, Illumina, Cap L’Orient, the EDF known NCLDVs (CroV and phaeoviruses). Yet, our Foundation EDF Diversiterre, FRB, the Prince Albert II de sequence analysis of the predicted Megaviridae Monaco Foundation, Etienne Bourgois, the Tara schooner reads indicated that they are not closely related to and its captain and crew. CC benefited from a doctoral the sequences from these viruses. The possible fellowship from the AXA Research Fund. Tara Oceans promiscuity of these two marine dwellers was would not exist without the continuous support of the further supported by the identification of several participating 23 institutes (see http://oceans.taraexpedi- putative HGTs between Megaviridae and oomycete tions.org). This article is contribution number 0003 of the Tara Oceans Expedition 2009–2012. genomes. Incidentally, some of the analyzed trees exhibited oomycete homologs near the Phycodnavir- idae clade (Supplementary Figure S8) and several fungal homologs adjacent to the Megaviridae/oomy- References cete clade (Figure 7 and Supplementary Figure S9-1). Multiple gene transfers have been described from Allen LZ, Ishoey T, Novotny MA, McLean JS, Lasken RS, fungi to oomycetes, and the suggestion was made Williamson SJ. (2011). Single virus genomics: a new that they contributed to the evolution of the tool for virus discovery. PLoS One 6: e17722. pathogenicity of oomycetes (Richards et al., 2011). Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, We found in the literature an intriguing coin- Miller W et al. (1997). Gapped BLAST and PSI-BLAST: cidence in the biogeography of Megaviridae and a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. oomycetes. Megaviridae was identified as a domi- Arslan D, Legendre M, Seltzer V, Abergel C, Claverie JM. nant family of NCLDVs in a sample from a mangrove (2011). Distant Mimivirus relative with a larger forest (Monier et al., 2008a), while 20 years earlier genome highlights the fundamental features of Mega- marine oomycetes (for example, Phytophthora viridae. Proc Natl Acad Sci USA 108: 17486–17491. vesicula) were described as the major decomposers Baxter L, Tripathy S, Ishaque N, Boot N, Cabral A, Kemen of mangrove leaves (Newell et al., 1987). Taken E et al. (2010). Signatures of adaptation to obligate together, these observations lead us to hypothesize biotrophy in the Hyaloperonospora arabidopsidis that there is a yet unrecognized close interaction genome. Science 330: 1549–1551. between Megaviridae and stramenopiles (distantly Bellec L, Grimsley N, Derelle E, Moreau H, Desdevises Y. related to oomycetes), either as a direct virus/host (2010). Abundance, spatial distribution and genetic couple (Monier et al., 2009) or through co-infection diversity of Ostreococcus tauri viruses in two different of a common third partner (Ogata et al., 2006; Boyer environments. Environ Microbiol Rep 2: 313–321. Benjamini Y, Hochberg Y. (1995). Controlling the false et al., 2009). Limitations in the available genome discovery rate: a practical and powerful approach to data for marine stramenopiles and the scope of the multiple testing. J R Stat Soc Ser B 57: 289–300. present TOP data set, which targeted the girus/ Benson DA, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell prokaryote size fraction, make it difficult to obtain J, Sayers EW. (2012). GenBank. Nucleic Acids Res 40: finer taxonomic resolutions for the potential eukar- D48–D53. yotic counterpart. Bergh O, Borsheim KY, Bratbak G, Heldal M. (1989). High The present work provides a proof of principle abundance of viruses found in aquatic environments. that metagenomic sequence analyses promise to Nature 340: 467–468.

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1693 Boyer M, Yutin N, Pagnier I, Barrassi L, Fournous G, Eddy SR. (2008). A probabilistic model of local sequence Espinosa L et al. (2009). Giant Marseillevirus high- alignment that simplifies statistical significance esti- lights the role of amoebae as a melting pot in mation. PLoS Comput Biol 4: e1000069. emergence of chimeric microorganisms. Proc Natl Edgar RC. (2004). MUSCLE: a multiple sequence align- Acad Sci USA 106: 21848–21853. ment method with reduced time and space complex- Breitbart M. (2012). Marine viruses: truth or dare. Ann Rev ity. BMC Bioinformatics 5: 113. Mar Sci 4: 425–448. Falkowski PG, Katz ME, Knoll AH, Quigg A, Raven JA, Briggs AW, Stenzel U, Johnson PL, Green RE, Kelso J, Schofield O et al. (2004). The evolution of modern Prufer K et al. (2007). Patterns of damage in genomic eukaryotic phytoplankton. Science 305: 354–360. DNA sequences from a Neandertal. Proc Natl Acad Sci Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, USA 104: 14616–14621. Raes J et al. (2012). Microbial co-occurrence relation- Brown MB. (1975). 400: a method for combining non- ships in the human microbiome. PLoS Comput Biol 8: independent, one-sided tests of significance. Bio- e1002606. metrics 31: 987–992. Filee J, Chandler M. (2010). Gene exchange and the origin Canfield DE, Stewart FJ, Thamdrup B, De Brabandere L, of giant viruses. Intervirology 53: 354–361. Dalsgaard T, Delong EF et al. (2010). A cryptic sulfur Fischer MG, Allen MJ, Wilson WH, Suttle CA. (2010). cycle in oxygen-minimum-zone waters off the Chilean Giant virus with a remarkable complement of genes coast. Science 330: 1375–1378. infects marine . Proc Natl Acad Sci USA Cantalupo PG, Calgua B, Zhao G, Hundesa A, Wier AD, 107: 19508–19513. Katz JP et al. (2011). Raw sewage harbors diverse viral Forterre P. (2006). Three RNA cells for ribosomal lineages populations. MBio 2: pii e00180–11. and three DNA viruses to replicate their genomes: a Caron DA, Dam HG, Kremer P, Lessard EJ, Madin LP, hypothesis for the origin of cellular domain. Proc Natl Malone TC et al. (1995). The contribution of micro- Acad Sci USA 103: 3669–3674. organisms to particulate carbon and nitrogen in Forterre P. (2010). Giant viruses: conflicts in revisiting the surface waters of the Sargasso Sea near Bermuda. virus concept. Intervirology 53: 362–378. Deep-Sea Res I 42: 943–972. Frada M, Probert I, Allen MJ, Wilson WH, de Vargas C. Chaffron S, Rehrauer H, Pernthaler J, von Mering C. (2008). The ‘Cheshire Cat’ escape strategy of the (2010). A global network of coexisting microbes from Emiliania huxleyi in response to viral environmental and whole-genome sequence data. infection. Proc Natl Acad Sci USA 105: 15944–15949. Genome Res 20: 947–959. Gasol JM, del Giorgio PA. (2000). Using flow cytometry for Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, counting natural planktonic bacteria and understand- Higgins DG et al. (2003). Multiple sequence alignment ing the structure of planktonic bacterial communities. with the Clustal series of programs. Nucleic Acids Res Scientia Marina 64: 197–224. 31: 3497–3500. Gaulin E, Madoui MA, Bottin A, Jacquet C, Mathe C, Claverie JM. (2006). Viruses take center stage in cellular Couloux A et al. (2008). Transcriptome of Aphano- evolution. Genome Biol 7: 110. myces euteiches: new oomycete putative pathogeni- Claverie JM, Ogata H, Audic S, Abergel C, Suhre K, city factors and metabolic pathways. PLoS One 3: Fournier PE. (2006). Mimivirus and the emerging e1723. concept of ‘‘giant’’ virus. Virus Res 117: 133–144. Ghedin E, Claverie JM. (2005). Mimivirus relatives in the Colson P, de Lamballerie X, Fournous G, Raoult D. (2012). Sargasso sea. Virol J 2: 62. Reclassification of giant viruses composing a fourth Gomez-Alvarez V, Teal TK, Schmidt TM. (2009). Systema- domain of life in the new order megavirales. Inter- tic artifacts in metagenomes from complex microbial virology 55: 321–332. communities. ISME J 3: 1314–1317. Danovaro R, Corinaldesi C, Dell’anno A, Fuhrman JA, Gomez-Pereira PR, Schuler M, Fuchs BM, Bennke C, Middelburg JJ, Noble RT et al. (2011). Marine viruses Teeling H, Waldmann J et al. (2012). Genomic content and global climate change. FEMS Microbiol Rev 35: of uncultured Bacteroidetes from contrasting oceanic 993–1034. provinces in the North Atlantic Ocean. Environ De Silva FS, Lewis W, Berglund P, Koonin EV, Moss B. Microbiol 14: 52–66. (2007). Poxvirus DNA primase. Proc Natl Acad Sci Grenville-Briggs L, Gachon CM, Strittmatter M, Sterck L, USA 104: 18724–18729. Kupper FC, van West P. (2011). A molecular insight del Giorgio PA, Bird DF, Prairie YT, Planas D. (1996). Flow into algal-oomycete warfare: cDNA analysis of Ecto- cytometric determination of bacterial abundance in carpus siliculosus infected with the basal oomycete lake plankton with the green nucleic acid stain Eurychasma dicksonii. PLoS One 6: e24500. SYTO 13. Limmnol Oceanorgr 41: 783–789. Guindon S, Gascuel O. (2003). A simple, fast, and accurate Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, algorithm to estimate large phylogenies by maximum Chevenet F et al. (2008). Phylogeny.fr: robust phylo- likelihood. Syst Biol 52: 696–704. genetic analysis for the non-specialist. Nucleic Acids Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, Res 36: W465–W469. Cano LM et al. (2009). Genome sequence and analysis Dereeper A, Audic S, Claverie JM, Blanc G. (2010). of the Irish potato famine Phytophthora BLAST-EXPLORER helps you building datasets for infestans. Nature 461: 393–398. phylogenetic analysis. BMC Evol Biol 10:8. Han MV, Zmasek CM. (2009). phyloXML: XML for Derelle E, Ferraz C, Escande ML, Eychenie S, Cooke R, evolutionary biology and comparative genomics. Piganeau G et al. (2008). Life-cycle and genome of BMC Bioinformatics 10: 356. OtV5, a large DNA virus of the pelagic marine Huson DH, Auch AF, Qi J, Schuster SC. (2007). MEGAN unicellular green alga Ostreococcus tauri. PLoS One analysis of metagenomic data. Genome Res 17: 3: e2250. 377–386.

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1694 Iyer LM, Balaji S, Koonin EV, Aravind L. (2006). Evolu- Moreira D, Brochier-Armanet C. (2008). Giant viruses, tionary genomics of nucleo-cytoplasmic large DNA giant chimeras: the multiple evolutionary histories of viruses. Virus Res 117: 156–184. Mimivirus genes. BMC Evol Biol 8: 12. Jacquet S, Heldal M, Iglesias-Rodriguez D, Larsen A, Nagasaki K. (2008). , , and their Wilson W, Bratbak G. (2002). Flow cytometric analysis viruses. J. Microbiol 46: 235–243. of an Emiliana huxleyi bloom terminated by viral Newell SY, Miller JD, Fell JW. (1987). Rapid and pervasive infection. Aquat Microb Ecol 27: 111–124. occupation of fallen mangrove leaves by a marine Kale SD, Tyler BM. (2011). Entry of oomycete and fungal zoosporic . Appl Environ Microbiol 53: effectors into and host cells. Cell 2464–2469. Microbiol 13: 1839–1848. Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: a Karsenti E, Acinas SG, Bork P, Bowler C, De Vargas C, novel method for fast and accurate multiple sequence Raes J et al. (2011). A holistic approach to marine eco- alignment. J Mol Biol 302: 205–217. systems biology. PLoS Biol 9: e1001177. Ogata H, La Scola B, Audic S, Renesto P, Blanc G, Robert C Katoh K, Kuma K, Toh H, Miyata T. (2005). MAFFT et al. (2006). Genome sequence of Rickettsia bellii version 5: improvement in accuracy of multiple illuminates the role of amoebae in gene exchanges sequence alignment. Nucleic Acids Res 33: 511–518. between intracellular pathogens. PLoS Genet 2: e76. Kurita J, Nakajima K. (2012). Megalocytiviruses. Viruses 4: Ogata H, Toyoda K, Tomaru Y, Nakayama N, Shirai Y, 521–538. Claverie JM et al. (2009). Remarkable sequence La Scola B, Campocasso A, N’Dong R, Fournous G, similarity between the dinoflagellate-infecting marine Barrassi L, Flaudrops C et al. (2010). Tentative girus and the terrestrial pathogen African swine fever characterization of new environmental giant viruses virus. J Virol 6: 178. by MALDI-TOF mass spectrometry. Intervirology 53: Ogata H, Ray J, Toyoda K, Sandaa RA, Nagasaki K, 344–353. Bratbak G et al. (2011). Two new subfamilies of Lambert DL, Taylor PN, Goulder R. (1993). Between-site DNA mismatch repair proteins (MutS) specifically comparison of freshwater by DNA abundant in the marine environment. ISME J 5: hybridization. Microb Ecol 26: 189–200. 1143–1151. Legendre M, Arslan D, Abergel C, Claverie JM. (2012). Olson RJ, Zettler ER, du Rand MD. (1993). Phytoplankton Genomics of Megavirus and the elusive fourth domain analysis using flow cytometry. In: Kemp PF, Sherr BF, of life. Commun Integr Biol 5: 102–106. Sherr EB, Cole JJ (eds) Handbook of Methods in Levesque CA, Brouwer H, Cano L, Hamilton JP, Holt C, Aquatic Microbial Ecology. Lewis Publishers: Boca Huitema E et al. (2010). Genome sequence of the Raton, FL, USA, pp 175–196. necrotrophic plant pathogen reveals Pagarete A, Le Corguille G, Tiwari B, Ogata H, de Vargas C, original pathogenicity mechanisms and effector reper- Wilson WH et al. (2011). Unveiling the transcriptional toire. Genome Biol 11: R73. features associated with coccolithovirus infection of Lopez-Bueno A, Tamames J, Velazquez D, Moya A, natural Emiliania huxleyi blooms. FEMS Microbiol Quesada A, Alcami A. (2009). High diversity of the Ecol 78: 555–564. viral community from an Antarctic lake. Science 326: Proctor LM, Fuhrman JA. (1990). Viral mortality of marine 858–861. bacteria and cyanobacteria. Nature 343: 60–62. Massana R, Taylor LT, Murray AE, Wu KY, Jeffrey WH, Pruitt KD, Tatusova T, Brown GR, Maglott DR. (2012). DeLong EF. (1998). Vertical distribution and temporal NCBI Reference Sequences (RefSeq): current status, variation of marine planktonic archaea in the Gerlache new features and genome annotation policy. Nucleic Strait, Antarctica, during early spring. Anglais 43: Acids Res 40: D130–D135. 607–617. R Development Core Team (2011). R: A Language and Massana R. (2011). Eukaryotic picoplankton in surface Environment for Statistical Computing. The R Foun- oceans. Annu Rev Microbiol 65: 91–110. dation for Statistical Computing: Vienna, Austria. Matsen FA, Kodner RB, Armbrust EV. (2010). Pplacer: Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P. linear time maximum-likelihood and Bayesian phylo- (2007). Prediction of effective genome size in meta- genetic placement of sequences onto a fixed reference genomic samples. Genome Biol 8: R10. tree. BMC Bioinformatics 11: 538. Raoult D, Audic S, Robert C, Abergel C, Renesto P, Ogata H Monier A, Claverie JM, Ogata H. (2007). Horizontal gene et al. (2004). The 1.2-megabase genome sequence of transfer and nucleotide compositional anomaly in Mimivirus. Science 306: 1344–1350. large DNA viruses. BMC Genomics 8: 456. Raoult D, Forterre P. (2008). Redefining viruses: lessons Monier A, Claverie JM, Ogata H. (2008a). Taxonomic from Mimivirus. Nat Rev Microbiol 6: 315–319. distribution of large DNA viruses in the sea. Genome Rho M, Tang H, Ye Y. (2010). FragGeneScan: predicting Biol 9: R106. genes in short and error-prone reads. Nucleic Acids Monier A, Larsen JB, Sandaa RA, Bratbak G, Claverie JM, Res 38: e191. Ogata H. (2008b). Marine mimivirus relatives are Richards TA, Soanes DM, Jones MD, Vasieva O, Leonard G, probably large algal viruses. J Virol 5: 12. Paszkiewicz K et al. (2011). Horizontal gene transfer Monier A, Pagarete A, de Vargas C, Allen MJ, Read B, facilitated the evolution of plant parasitic mechanisms Claverie JM et al. (2009). Horizontal gene transfer of an in the oomycetes. Proc Natl Acad Sci USA 108: entire metabolic pathway between a eukaryotic alga 15258–15263. and its DNA virus. Genome Res 19: 1441–1449. Rodriguez-Brito B, Li L, Wegley L, Furlan M, Angly F, Moreau H, Piganeau G, Desdevises Y, Cooke R, Derelle E, Breitbart M et al. (2010). Viral and microbial commu- Grimsley N. (2010). Marine prasinovirus genomes nity dynamics in four aquatic environments. ISME J 4: show low evolutionary divergence and acquisition of 739–751. protein metabolism genes by horizontal gene transfer. Rokas A. (2011). Phylogenetic analysis of protein J Virol 84: 12555–12563. sequence data using the Randomized Axelerated

The ISME Journal NCLDVs in Tara Oceans metagenomes P Hingamp et al 1695 Maximum Likelihood (RAXML) Program. Curr Protoc on a Heterosigma akashiwo (Raphidophyceae)bloomin Mol Biol Chapter 19: Unit19 11. Hiroshima Bay, Japan. Aquat Microb Ecol 34: 227–238. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Torrella F, Morita RY. (1979). Evidence by electron Williamson S, Yooseph S et al. (2007). The Sorcerer micrographs for a high incidence of bacteriophage II Global Ocean Sampling expedition: northwest particles in the waters of Yaquina Bay, oregon: Atlantic through eastern tropical Pacific. PLoS Biol ecological and taxonomical implications. Appl 5: e77. Environ Microbiol 37: 774–778. Schroeder DC, Oke J, Hall M, Malin G, Wilson WH. (2003). UniProt Consortium (2012). Reorganizing the protein Virus succession observed during an Emiliania hux- space at the Universal Protein Resource (UniProt). leyi bloom. Appl Environ Microbiol 69: 2484–2490. Nucleic Acids Res 40: D71–D75. Steele JA, Countway PD, Xia L, Vigil PD, Beman JM, Van Etten JL. (2011). Another really, really big virus. Kim DY et al. (2011). Marine bacterial, archaeal and Viruses 3: 32–46. protistan association networks reveal ecological Williamson SJ, Rusch DB, Yooseph S, Halpern AL, linkages. ISME J 5: 1414–1425. Heidelberg KB, Glass JI et al. (2008). The Sorcerer II Stewart FJ, Ulloa O, DeLong EF. (2012). Microbial Global Ocean Sampling Expedition: metagenomic metatranscriptomics in a permanent marine oxygen characterization of viruses within aquatic microbial minimum zone. Environ Microbiol 14: 23–40. samples. PLoS One 3: e1456. Strimmer K. (2008). fdrtool: a versatile R package for Wilson WH, Schroeder DC, Allen MJ, Holden MT, Parkhill estimating local and tail area-based false discovery J, Barrell BG et al. (2005). Complete genome sequence rates. Bioinformatics 24: 1461–1462. and lytic phase transcription profile of a Coccolitho- Sullivan MB, Lindell D, Lee JA, Thompson LR, Bielawski JP, virus. Science 309: 1090–1092. Chisholm SW. (2006). Prevalence and evolution of core Winnepenninckx B, Backeljau T, De Wachter R. (1993). photosystem II genes in marine cyanobacterial viruses Extraction of high molecular weight DNA from and their hosts. PLoS Biol 4: e234. molluscs. Trends Genet 9: 407. Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S et al. Winter C, Bouvier T, Weinbauer MG, Thingstad TF. (2010). (2011). Community cyberinfrastructure for Advanced Trade-offs between competition and defense specia- Microbial Ecology Research and Analysis: the CAM- lists among unicellular planktonic organisms: the ERA resource. Nucleic Acids Res 39: D546–D551. ‘killing the winner’ hypothesis revisited. Microbiol Suttle CA. (2005). Viruses in the sea. Nature 437: 356–361. Mol Biol Rev 74: 42–57. Suttle CA. (2007). Marine viruses—major players in the Yau S, Lauro FM, DeMaere MZ, Brown MV, Thomas T, global ecosystem. Nat Rev Microbiol 5: 801–812. Raftery MJ et al. (2011). Virophage control of antarctic Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. algal host-virus dynamics. Proc Natl Acad Sci USA (2007). UniRef: comprehensive and non-redundant 108: 6163–6168. UniProt reference clusters. Bioinformatics 23: Yoon HS, Price DC, Stepanauskas R, Rajah VD, Sieracki ME, 1282–1288. Wilson WH et al. (2011). Single-cell genomics reveals Talavera G, Castresana J. (2007). Improvement of phylo- organismal interactions in uncultivated marine . genies after removing divergent and ambiguously Science 332: 714–717. aligned blocks from protein sequence alignments. Syst Yutin N, Wolf YI, Raoult D, Koonin EV. (2009). Eukaryotic Biol 56: 564–577. large nucleo-cytoplasmic DNA viruses: clusters of Tamura K, Peterson D, Peterson N, Stecher G, Nei M, orthologous genes and reconstruction of viral genome Kumar S. (2011). MEGA5: Molecular Evolutionary evolution. Virol J 6: 223. Genetics Analysis using maximum likelihood, evolu- Yutin N, Koonin EV. (2012). Hidden evolutionary com- tionary distance, and maximum parsimony methods. plexity of nucleo-cytoplasmic large DNA viruses of Mol Biol Evol 28: 2731–2739. eukaryotes. Virol J 9: 161. Thomas V, Bertelli C, Collyn F, Casson N, Telenti A, Goesmann A et al. (2011). Lausannevirus, a giant amoebal virus encoding histone doublets. Environ This work is licensed under a Creative Microbiol 13: 1454–1466. Commons Attribution 3.0 Unported Tomaru Y, Tarutani K, Yamaguchi M, Nagasaki K. (2004). License. To view a copy of this license, visit http:// Quantitative and qualitative impacts of viral infection creativecommons.org/licenses/by/3.0/

Supplementary Information accompanies this paper on The ISME Journal website (http://www.nature.com/ismej)

The ISME Journal