VIRAL DIVERSITY AND DYNAMICS IN THE

OPEN OCEAN

By

Elaine Luo

H. B. Sc., University of Toronto, 2014

A dissertation submitted in partial fulfillment of requirements for the degree of

DOCTOR OF PHILOSOPHY

in

MARINE BIOLOGY

at the

UNIVERSITY OF HAWAI‘I AT MĀNOA

May 2020

Dissertation committee:

Edward DeLong, chairperson

Kyle Edwards

Grieg Steward

Edward Ruby

David Karl

Keywords: marine , open ocean, bacterioplankton, bacteriophages, virioplankton, sediment trap, marine snow, , long-read sequencing

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

ii DEDICATION

To my mom, born to teenage parents in rural China during the Cultural Revolution,

Who watched her younger sister die from malnutrition in her mother’s arms, Whose sunspots reflect her childhood, reading the fields and the chickens, carrying firewood to keep her three brothers warm, Who refused to apologize for birthing a daughter, when her husband and his family wanted a son, Who found creative ways to get me vaccinated while corruption lined pockets; was a life less than, in a country of a billion?

Whose burn scars lay tribute to the price she paid for marriage,

Who, speaking no English, determined to move to Canada, Who, with no education, learned the TOEFL and the GED, Who, as a single mom, worked full-time and put food on the table, Who, always a penny-pincher, spent $150 to frame my undergrad diploma,

Whose eyes sparkle when she mentions what I am doing now, reminding me of how lucky I am to be here, inspiring me to seize the days that she never had.

iii ACKNOWLEDGEMENTS

First and foremost, I thank Edward DeLong for being an exemplary advisor.

Throughout my PhD, his curiosity, enthusiasm, and positive attitude created a healthy environment for personal growth where I feel encouraged to try, and fail too. Witnessing Ed’s fairness, ethics, diplomacy, lab management, and open communication style taught me skills that I will continue to value beyond my

PhD. I am deeply appreciative of the opportunity to learn and work in the

DeLong lab.

I am grateful to committee members Kyle Edwards, Grieg Steward, Edward

Ruby, and David Karl for their continuous contribution and support throughout my PhD. This dissertation was enriched by their inquisitive questions and various expertise: Kyle on ecological and statistical models, Grieg on viral literature and tangential flow filtration, Ned on hypothesis generation and experimental design, and Dave on biogeochemistry and background research on

Station ALOHA. I value the opportunity to interact with such an inspirational group of scientists.

Thanks to all current and former DeLong lab members for creating a collaborative and exciting work environment during my time here: Frank

Aylward, Dominique Boeuf, Jessica Bryant, Andrew Burger, Bethanie Edwards,

John Eppley, Carla Gimpel, Andy Leu, Fuyan Li, Daniel Mende, Dan Olson,

Kirsten Poff, Anna Romano, Paul Den Uyl, and Alice Vislova. In particular, I am deeply grateful to Anna for being an amazing mentor in the lab and beyond. I

iv will always remember her inspiring and meticulous work ethics, and her encouragement of my personal growth exceeded any expectations of a colleague.

I am grateful to Frank for being such a stellar postdoctoral mentor during our work together on Ch. II. His curiosity and openness for discussion fostered an exciting and enriching work environment. I am grateful to Jessica Bryant and

Oscar Sosa for their mentorship, and I thoroughly enjoyed our interactions both in and outside the office.

Special thanks to the HOT program and SCOPE-OPS team for oceanographic expeditions and sampling. Tara Clemente, Susan Curless, and Dan Sadler were exemplary Chief Scientists who insured smooth operations on the oceanographic cruises I participated in. Thanks to Ryan Tabata, Tim Burrell, and Eric

Shimabukuro for collecting samples that were vital to Ch. III and showing me the ropes on cruise operations. Many thanks to Tara Clemente, Blake Watkins, Eric

Grabowski for sediment trap deployment, collection, and biogeochemical analyses for samples used in Ch. IV.

I would like to thank collaborators John Beaulaurier at Oxford Nanopore

Technologies, as well as Ashley Coenen, Daniel Muratore, and Steven Beckett in the Joshua Weitz lab, for exciting discussions in the field of studying marine viruses. I appreciate the opportunity to interact with and learn from their diverse expertise.

v I would like to thank Chris Schvarcz for mentorship on epifluorescence microscopy and tangential flow filtration. His enthusiasm and willingness to offer help is deeply appreciated.

Thanks to Tina Carvalho at the UH Biological Electron Microscope Facility for guidance on transmission and scanning electron micrography. The micrographs that she helped produce look amazing.

Many thanks to the C-MORE o’hana for creating such an awesome work environment in the Hale and camaraderie during oceanographic cruises.

Special thanks to my undergraduate advisors James Thomson and Lauren

Mullineaux for taking me under their wings. I would not have entered a research trajectory without their dedication to providing excellent undergraduate mentorship. I am deeply grateful for their commitment to offering funded undergraduate opportunities, which makes it possible for students from diverse backgrounds to pursue research.

Funding for this dissertation was generously provided by the Simons

Foundation (SCOPE #329108 to Edward DeLong), Gordon and Betty Moore

Foundation (GBMF 3777 to Edward DeLong), and the Natural Sciences and

Engineering Research Council of Canada (PGSD3-487490-2016 to me).

vi ABSTRACT

In the open oceans that cover roughly 40% of our planet, viruses influence ecosystem dynamics, microbial diversity, and biogeochemical cycling. By infecting and killing cellular hosts, viruses transform organic matter from living cells into dissolved and particulate pools that fuel organic matter recycling and export. Despite their importance, viruses in the environment remain unexplored relative to other microbes, particularly in the under-sampled open oceans that might contain large reservoirs of novel viral diversity. For example, some fundamental questions remain open in this field: how many different viral populations coexist in the open ocean, what novel genes do they encode, and how do viral diversity and virus-host interactions vary from the surface to the deep ocean, or between free-living and particle-attached habitats? To address these questions, I explored the diversity and dynamics of viruses in the open ocean, from planktonic assemblages in the upper ocean to sinking particles in the deep sea. In this body of work, I used metagenomic approaches to study viruses sampled from Station ALOHA located in the North Pacific Subtropical Gyre.

From roughly 7 TB of sequencing data, I recovered over 17,000 viral population genomes, at least 9,000 of which were novel with respect to what has been studied before. I explored how viral diversity, viral reproductive strategies, and virus-host interactions varied across vertical gradients along the open ocean water column to better understand viral effects on ecosystem dynamics and biogeochemical cycling. The culmination of these projects reveals the diversity and dynamics of viruses that represent some of the most abundant yet understudied life forms in the ocean.

vii TABLE OF CONTENTS

Acknowledgements……………………………………………………………………iv

Abstract…………………………………………………………………………………vii

List of abbreviations and symbols…………………………………………………. xv

Chapter 1. Background and rationale………………………………………………...1

Overview………………………………………………………………………… 1

The open oceans………………………………………………………………… 1

Station ALOHA…………………………………………………………………. 3

Importance of microbes in marine ecoystems……………………………….. 5

Importance of viruses in marine ecosystems………………………………… 7

History of marine virus research…………………………………………….. 10

Dissertation overview………………………………………………………… 13

Figures………………………………………………………………………….. 19

Figure 1.1. Station ALOHA…………………………………………... 19 Figure 1.2. Epifluorescence micrograph of SYBR gold-stained

seawater samples from Station ALOHA……………….. 20 Figure 1.3. Transmission electron micrographs of putative bacteriophages at Station ALOHA……………………... 21 Figure 1.4. Some viral reproductive strategies……………………... 22 Figure 1.5. Viral reproductive strategies alternative to lysis and lysogeny…………………………………………………… 23

References……………………………………………………………………….24

Chapter 2. Bacteriophage distributions and temporal variability in the ocean’s interior…………………………………………………………………………………..32

Abstract………………………………………………………………………….32

viii Importance……………………………………………………………………... 33

Introduction……………………………………………………………………. 34

Results and discussion………………………………………………………... 37

Novel ALOHA viral genomes, genome fragments, and AMGs….. 37 Phage genotype distributions in the Station ALOHA time series depth profile…………………………………………………………… 39 Genomic trends in lytic vs. lysogenic viral life-history…………….41 Ecology of surface and mesopelagic phage………………………… 44

Conclusions……………………………………………………………………..47

Materials and methods………………………………………………………...47

Study site and sample collection……………………………………...47

Genome-centric approach Assembling ALOHA viral contigs…………………………... 48 Annotating ALOHA viral contigs…………………………… 49

Depth and temporal distributions…………………………… 49 Mapping to reference genomes……………………………….50 Detection of novel phage genes and AMGs…………………50

Gene-centric approach Phage and cell-associated gene catalogue assemblages……51 Viral life history………………………………………………...52

Funding………………………………………………………………………… 53

Acknowledgements…………………………………………………………… 54

Figures………………………………………………………………………….. 55

Figure 2.1. Depth profiles of known and novel phage ALOHA viral contigs……………………………………………………... 55 Figure 2.2. Coverage profiles of 129 ALOHA viral contigs through depth and time…………………………………………….56 Figure 2.3. Cluster analysis of phage-specific versus cellular genes………………………………………………………..57 Figure 2.4. Temporal variability of assembled viral contigs increased with depth……………………………………...58 Figure 2.5. Depth profile of prophage marker proteins…………… 59

Supplementary figures………………………………………………………... 60

Figure S2.1. Dataset includes 83 samples from 7 depths and 12 time points……………………………………………………….60

ix Figure S2.2. Bioinformatic workflow………………………………... 61 Figure S2.3. Genome maps of abundant phages from found at different depths……………………………………………62 Figure S2.4. Depth profile of phage marker proteins……………… 68 Figure S2.5. Bubble plot of proportion of ALOHA gene catalogue genes hitting to four groups……………………………...69

Supplementary tables…………………………………………………………. 70

Table S2.1. Assembly statistics for the initial individual assemblies…………………………………………………. 69 Table S2.2. List of 10 out of 129 ALOHA viral contigs with hits to

known phage……………………………………………… 72

Table S2.3. List of 37 novel genes in marine phage………………… 72 Table S2.4. List of 4 persistent ALOHA viral contigs……………… 73

References……………………………………………………………………….74

Chapter 3. Double-stranded DNA virioplankton dynamics and reproductive strategies in the oligotrophic open ocean water column…………………………80

Abstract………………………………………………………………………….80

Introduction……………………………………………………………………. 81

Methods…………………………………………………………………………83

Sample collection, extraction, and sequencing……………………... 83 Viral-specific reassembly……………………………………………... 84

ALOHA 2.0 virus database curation………………………………… 85 Genomic completion…………………………………………………...86

Viral and prokaryotic contribution to total DNA………………….. 86 Spatiotemporal distribution and abundance……………………….. 86 Characterizing cellular assemblages………………………………… 87 Viral population identification………………………………………..87

Archaeal virus identification…………………………………………. 88 Eukaroyotic virus identification……………………………………... 89 Crocosphaera putative phage identification…………………………..90 Viral and prokaryotic diversity……………………………………….90 VC ratio………………………………………………………………….90

Results and discussion………………………………………………………... 91

Viral DNA contribution to total DNA………………………………. 92 Viral diversity hotspot at the base of the euphotic zone…………...93 Depth-dependent patterns in temperate phage integration and induction……………………………………………………………….. 93

x VC ratio to confirm genomic temperate phage identification……..95 Environmental distribution of viral AMGs………………………….96 Spatiotemporal distributions: temporal persistence and depth

distributions……………………………………………………………. 98 Depth distributions of archaeal viruses and their hosts……………99 Putative Crocosphaera phages………………………………………….99

Conclusion……………………………………………………………………. 101

Acknowledgements………………………………………………………….. 102

Figures………………………………………………………………………… 103

Figure 3.1. Depth profiles of time-averaged viral and prokaryotic contributions to total sequenced DNA in a. virus- enriched and b. cell-enriched size fractions…………...103 Figure 3.2. α-diversity depth profiles of viral and prokaryotic

assemblages……………………………………………… 104 Figure 3.3. Spatiotemporal distributions of annotated virus populations present in the virus-enriched size fraction…………………………………………………… 105 Figure 3.4. Depth profiles of relative abundances of putative temperate phages……………………………………………………. 106 Figure 3.5. Depth profiles of temporal variabilities of VC ratios of 1352 inferred temperate phages and other viral populations……………………………………………….106 Figure 3.6. Distribution of viruses containing specific auxiliary metabolic genes in the water column…………………. 107

Supplementary figures……………………………………………………….108

Figure S3.1. Bioinformatic workflow………………………………. 108

Figure S3.2. Hydrography depth profile time-series of samples... 109 Figure S3.3. Cut-off selection for identifying putative archaeal viruses……………………………………………………. 110 Figure S3.4. Examples of viral genomes in the ALOHA 2.0 database………………………………………………….. 111 Figure S3.5. Genomic structure of a putative Crocosphaera phage or phage parasite…………………………………………… 112 Figure S3.6. Size fraction distribution and genome sizes of putative temperate phage and other viruses…………………… 113 Figure S3.7. Depth profiles of time-averaged proportion of putative temperate SAR11 phages………………………………..114 Figure S3.8. Virus-host relative abundances for a. cyanophage and b. thaumarchaeal virus…………………………………. 114 Figure S3.9. Spatiotemporal distributions of all virus populations present in the virus-enriched size fraction…………… 115

xi Figure S3.10. Temporal variability in viral relative abundances and correlation with environmental variables……………..116

Supplementary tables………………………………………………………... 117

Table S3.1. Sample information and associated metadata……….. 117 Table S3.2. Sequencing, initial assemblies, VIRSorter contigs, and read statistics for all samples…………………………... 117 Table S3.3. Information for 17 369 >10kbp ALOHA 2.0 virus populations……………………………………………….117 Table S3.4. Relative abundances of 17 369 ALOHA 2.0 virus populations in the virus-enriched fraction…………… 117 Table S3.5. Relative abundances of 17 369 ALOHA 2.0 virus populations in the cell-enriched fraction……………... 117 Table S3.6. Relative abundances of 1543 circular ALOHA 2.0 virus populations in the virus-enriched fraction…………… 118 Table S3.7. Relative abundances of 1543 circular ALOHA 2.0 virus populations in the cell-enriched fraction……………... 118 Table S3.8. Relative abundances of 2568 mOTUS the cell-enriched fractions…………………………………………………...118 Table S3.9. Taxonomic assignments of ALOHA 2.0 virus proteins…………………………………………………... 118 Table S3.10. List of novel viral PFAM domains……………………118

References……………………………………………………………………...119

Chapter 4. Diversity and origins of viruses on sinking particles in the deep ocean…………………………………………………………………………………... 127

Abstract………………………………………………………………………...127

Introduction…………………………………………………………………... 128

Methods……………………………………………………………………….. 131

Sample collection, extraction, and sequencing……………………. 131

Virus-specific assembly……………………………………………… 131 Deep Trap Virus (DTV) database curation…………………………132

Genomic completion…………………………………………………. 133 Identifying virus populations………………………………………. 133 Recovering cellular metagenome-assembled genomes (MAGs)... 135 Temporal abundance and persistence……………………………... 135 Vertical transport and depth of origin……………………………... 136

Viral correlation with particulate carbon export flux…………….. 136

Results and discussion………………………………………………………. 137

xii Linking novel viruses to hosts using metagenome-assembled genomes (MAGs)…………………………………………………….. 138 Viral infection patterns in bathypelagic …………………………………………………………………140 Temperate phages on sinking particles……………………………. 142 Interannual similarities of viruses and prokaryotes on sinking particles……………………………………………………………….. 143 Evidence of vertical transport of viruses…………………………...144 Viral correlation with particulate carbon export flux……………..146

Conclusions……………………………………………………………………148

Acknowledgements………………………………………………………….. 149

Figures………………………………………………………………………… 151

Figure 4.1. Relative abundances through time of 68 annotated deep trap viruses identified by alignments to putative prophages and CRISPRs in MAGs……………………. 151 Figure 4.2. Examples of different virus-host abundance patterns observed in deep trap viruses…………………………..152 Figure 4.3. Coverage abundance profiles of 857 DTV populations

through time……………………………………………... 153 Figure 4.4. Abundances through time of 21 deep trap viruses with depth of origins in the upper 500m…………………… 154 Figure 4.5. Viral presence in deep trap samples grouped by

presumptive depth of origin…………………………… 155 Figure 4.6. Normalized abundances of annotated viruses that significantly correlate with carbon flux………………..156

Supplementary figures………………………………………………………. 157

Figure S4.1. Bioinformatic workflow………………………………. 157 Figure S4.2. Sample metadata plotted through the 3-year sampling period…………………………………………………….. 158 Figure S4.3. Histograms of the number of samples present for each virus and MAG population……………………………..159 Figure S4.4. Spatiotemporal coverage abundance profiles of 21 viruses captured in sinking particles that we postulate originated from the upper 500 m……………………… 160

Supplementary tables………………………………………………………... 161

Table S4.1. Sequencing, assembly, and viral contig identification …………………………………………………………….. 161 Table S4.2. Information on 857 deep trap viruses………………… 163

Table S4.3. Alignments between deep trap viruses and MAGs…. 164 Table S4.4. Novel PFAM protein domains recovered from the deep trap viral database………………………………………. 166

xiii Table S4.5. Relative abundances of 857 deep trap viruses………..168 Table S4.6. Relative abundances of 129 cellular MAGs…………...168 Table S4.7. Alignments between 21 deep trap viruses and the ALOHA 2.0 viral database……………………………... 169

References……………………………………………………………………...170

Chapter 5: Summary, future directions, and questions moving forward……. 177

Summary……………………………………………………………………… 177

Future directions……………………………………………………………... 184

Questions moving forward…………………………………………………..189

Figures………………………………………………………………………… 195

Figure 5.1. Long-read sequencing overcomes limitations in short- read sequencing…………………………………………. 195 Figure 5.2. Genome length distributions of raw sequenced long reads……………………………………………………… 195 Figure 5.3. Viral genome recovery using long-read and short-read

approaches……………………………………………….. 196 Figure 5.4. Complete viral genomes recovered using long-read sequencing……………………………………………….. 196 Figure 5.5. End repeat distributions along the genome………….. 197 Figure 5.6. Gene content differences between viruses with different replication strategies……………………………………. 197 Figure 5.7. Life cycle of known viral parasites, phage-induced chromosomal island (PICIs)…………………………….198 Figure 5.8. Genome figures of putative viral parasites phage- induced chromosomal island (PICIs)…………………. 199

References……………………………………………………………………...200

xiv LIST OF ABBREVIATIONS AND SYMBOLS

NPSG: North Pacific Subtropical Gyre

HOT: Hawaii Ocean Time-series

(Station) ALOHA: A Long-term Oligotrophic Habitat Assessment

NPP: net primary production

SEP: summer export pulse

ANI: average nucleotide identity

AAI: amino acid identity

xv CHAPTER 1. BACKGROUND AND RATIONALE

Overview

Viruses are currently considered the most abundant DNA-containing life forms in the oceans, and influence marine ecosystems in a variety of ways. This introduction provides background information on the open-ocean habitat and the viruses that live there. The first sections provide context to this dissertation by describing the open-ocean environment, its significance to global biogeochemical cycles, and our sampling site Station ALOHA. The middle sections highlight the importance of microbes to marine ecosystems, and in particular, viruses within these microbial communities. The latter sections offer a brief history of marine virus research and what we know so far about viral diversity in the oceans to provide rationale for this work. This introduction ends with a chapter overview to link this dissertation with the current state of the field.

The open oceans

The open oceans cover roughly 40% of our planet, represent one of the largest biomes on Earth (1). These environments harbor microbial communities that

1 influence processes, such as primary production and export, that are critical to the habitability of our planet. This biome includes five major circulatory features that cover most of the ocean: the North/South Pacific Subtropical Gyres, the

North/South Atlantic Subtropical Gyres, and the Indian Ocean Gyre. In these oligotrophic gyres, the sunlit layer is depleted in inorganic nutrients such as nitrogen and phosphorus that are vital to primary productivity (2). These low- nutrient environments were previously thought to be homogeneous oceanic deserts devoid of (1,3), in part due to sparse observations resulting from their distance from human development. With increasing sampling from oceanographic expeditions, microscopy, cultivation-based analyses, and cultivation-independent DNA sequencing, we now know that the open oceans contain a diverse suite of microbial communities that fuel productivity, food webs, and biogeochemical cycling across the globe. Although biomass and productivity per volume of seawater is lower in the open oceans than in nutrient- rich coastal waters, the open oceans’ sheer size makes them formidable drivers of our planet’s biogeochemical cycles through productivity and export. Net primary productivity (NPP) in the open oceans is estimated at 42 gigatonnes of carbon per year (4), accounting for 65% of oceanic NPP and 34% of global NPP

(2). In comparison with highly productive terrestrial systems, the NPP of the

Pacific Ocean is estimated to equal that of all tropical rainforests (2). Productivity in the open oceans drive the biological carbon pump, in which photosynthesis fixes inorganic carbon into biomass that is then exported and sequestered in the deep sea (5). Oceanic production and export sequesters 2 – 6 gigatonnes of carbon each year (6), playing a critical role in carbon cycles that influence the climate and habitability of our planet.

2 Station ALOHA

One of the most well-characterized locations in the open ocean is the area surrounding Station ALOHA (22°45’ N, 158° W) in the North Pacific Subtropical

Gyre (NPSG) (Fig. 1.1). At 2x107 km2, the NPSG is considered the world’s largest contiguous biome, covering most of the Pacific Ocean north of the equator (7).

Despite its size, this environment is under-sampled compared to more accessible coastal areas (1). To establish regular sampling in open-ocean environments, two major NSF-funded initiatives were established in 1988: the Bermuda Atlantic

Time-series Study (BATS) in the North Atlantic Subtropical Gyre, and Hawaii

Ocean Time-series (HOT) in the NPSG (1,8). The study site of the HOT program was established at Station ALOHA, located roughly 100 km north of Oahu,

Hawaii. To date, the 30-plus years of roughly monthly HOT cruises provide a wealth of physical, chemical, and biological data characterizing this environment. These data provide vital information for generating a contextual framework for this dissertation on studying viruses in the North Pacific

Subtropical Gyre.

The HOT program enabled repeated measurements along the water column that captured physiochemical gradients characterizing the NPSG habitat. The NPSG is persistently stratified, displaying a relatively shallow mixed-layer depth of 20

– 120 m throughout the year (1). As a result of a gradient from high-light, low- nutrient surface waters to low-light, nutrient-rich mesopelagic waters, the deep chlorophyll maximum (DCM) is a feature generally observed around 100 – 125 m throughout the year (1). This feature represents a depth where primary

3 producers become light-limited, and thus adapt to low-light conditions by increasing their chlorophyll content (9). Primary productivity decreases rapidly below the DCM, allowing for a build-up of inorganic nutrients and a relatively deep and permanent nutricline around 200 m (1). These depth-specific features are temporally stable, as seasonal changes in the upper water column are relatively weak (10). These features will be relevant to Chapters II and III, in which we explore virioplankton (free viral particles) and cell-associated viruses in the upper water column.

The HOT program has also enabled regular sampling of productivity and export.

We now know that estimates of primary production are much higher than that previously thought (1,3). Complementing productivity measurements, 150 m sediment traps collecting sinking particles revealed that roughly 5% of the primary productivity is exported from the euphotic zone (11). Starting in 1992, annual deployments of deep-moored sediment traps at 4000 m enabled direct measurement of particulate carbon, nitrogen, phosphorus, and silica exported and sequestered into bathypelagic depths (1). These 4000 m sediment traps revealed peaks in particle export in the late summer in some years, termed the summer export pulse (SEP) (12). These sediment trap samples will be relevant to

Chapter IV, in which we explore viruses associated with sinking particles in the deep ocean.

4 Importance of microbes in marine ecosystems

Microscopic life forms – bacteria, archaea, single-celled eukaryotes, and viruses – are critical to biogeochemical cycles that sustain life on Earth (13). Microbes are the most abundant, diverse, and ancient life forms on the planet. Prokaryotes – comprised of the domains Bacteria and Archaea – perform a myriad of metabolic functions, the diversity of which dwarfs those of all plants and animals (13). The ability to use a variety of substrates for energy and carbon needs enables prokaryotes to occupy otherwise uninhabitable niches. As a result of their metabolic diversity, prokaryotes contribute to the habitability of life on Earth.

Cyanobacteria are thought to be responsible for oxygenic photosynthesis that allowed the evolution of organisms that use oxygen to respire organic carbon

(14). Nitrogen fixation by prokaryotic diazotrophs transforms an otherwise biologically inaccessible form of nitrogen (N2) into inorganic nitrogen (NH4) that can be assimilated into biomass (15). Chemolithoautotrophic prokaryotes that can use deep-sea crustal fluids as an energy source transform barren ocean floors into diverse communities (16). These examples represent just the tip of the iceberg of how microbes sustain life on our planet. In short, we depend on microbes, and our planet would look very different today without them.

Prokaryotes are particularly important to the open oceans, where they are central to productivity, export, and transformation of organic compounds. Metabolic diversity allows prokaryotes to live off of a variety of electron donors and acceptors (13), and carbon sources such as dissolved organic carbon (17), that

5 might otherwise be inaccessible to eukaryotic organisms. Oligotrophic waters tend to be dominated by small, planktonic cells with a high surface-area-to- volume ratio, a trait that is particularly advantageous in the uptake of rate- limiting nutrients (reviewed in (18,19)). While prokaryotic cells can range from

0.3 – 80 µm (20), many are <1 µm in the open oceans, particularly in the nutrient- limited sunlit ocean. At Station ALOHA, the 0.6 – 0.8 µm cyanobacterium

Prochlorococcus accounts for more than half of surface prokaryotic assemblages

(21), and 61 – 78 % of primary production in the picoplankton size fraction, which accounts for the majority (68 – 83% of total primary production (22).

SAR11, ~0.4 µm heterotrophic bacterioplankton important for dissolved organic matter remineralization in oligotrophic environments (reviewed in (20)), accounts for >10% of the prokaryotic community throughout the water column

(21). Major shifts in prokaryotic assemblages are observed at the light-limiting depths around the DCM. Below 100 – 125 m, a decline in nutrient demand relative to supply allows a build-up of inorganic nutrients in the aphotic ocean.

Here, we observe increasing abundances of ~1 µm Thaumarchaeota, an ammonium-oxidizer critical to nitrogen cycling in the oceans (21). Compared to the photic ocean region, the aphotic ocean region appears to be enriched in prokaryotes with larger genome sizes (23), a feature typically associated with larger cells that are living on particles or nutrient-replete environments (23–25). I will explore how these depth-related shifts in prokaryotic assemblages relate to viral populations along the water column in Chapters II and III.

6 Prokaryotic assemblages are also important to export processes in the oceans, serving both as a major source and a sink of particles. Planktonic cells can contribute to particle size and number through aggregation into sinking marine snow; particle-attached cells can contribute to particles through colonization, growth, and biofilm formation, as well as to particle degradation through the fragmentation (26) and respiration of particulate organic matter (reviewed in

(27)). Although pulses of carbon export observed at Station ALOHA are thought to be induced by larger eukaryotic phytoplankton associated with nitrogen- fixing cyanobacteria (12), prokaryotes account for the majority (84%) of ribosomal RNA, a proxy for living cell abundance, found on sinking particles

(28). These prokaryotic assemblages tend to differ from their planktonic counterparts and are enriched in large, “copiotrophic” bacteria that decompose sinking organic matter (28–30). Respiration of organic matter also decreases the particulate organic to inorganic carbon ratio, creating a ballasting effect that helps particles sink faster towards sequestration in the deep ocean (31). We will explore how these prokaryotic assemblages on sinking particles relate to the nature of particle-attached viruses in Chapter IV.

Importance of viruses in marine ecosystems

Viruses are observed to be the most abundant members of all microbes on the planet. Viruses are, on average, an order of magnitude more abundant than other organisms in the oceans (Fig. 1.2, (32)). Bacteriophages (phages) that infect and

7 kill prokaryotic hosts are particularly abundant in this environment (Fig. 1.3).

Viruses infect many key microbial groups, including primary producers such as

Prochlorococcus (33) and abundant heterotrophic bacteria such as Pelagibacter

(SAR11 clade), Puniceispirillum (SAR116), Roseobacter, and Alteromonas (34–37). At

Station ALOHA, viruses are typically found on the orders of 107mL-1 at the surface, while cells are found on the orders of 105 to 106mL-1 (38). Despite their small sizes, averaging roughly 50 nm (39), viral biomass in the oceans is second only to that of prokaryotes (40).

The abundance of viruses in the ocean gives them great potential to influence the marine ecosystem. Roughly 5 – 30% of cells are infected with viruses at any given time. Some of these viruses manipulate their hosts using auxiliary metabolic genes that alter cellular carbon, sulfur, and nitrogen metabolism during the viral replication cycle (41–43). Every day, viruses lyse roughly 4 – 50% of cellular production, recycling up to 25% of photosynthetically fixed carbon biomass into dissolved and particulate organic carbon pools (44–46). Initial estimates of viral decay rates range from 4.1 – 11.1% per hour (47), and viral inactivation could be caused by UV, temperature, particle adsorption, and ingestion by grazers (48,49).

As a result, virioplankton assemblages are highly dynamic, with turnover times estimated at 0.09 – 3.5 days (46,50,51). Viral lysis transforms an estimated 150 gigatonnes of carbon annually, equivalent to roughly 20 times that from our use of fossil fuel, and 25 times higher than our ocean’s biological carbon pump

(40,52). Viral lysis shunts organic matter from living cells, which could flow into higher trophic levels, into dissolved and particulate organic matter pools

(reviewed in (51)). This “viral shunt” is thought to provide substrates for

8 respiration and supports heterotrophic bacterial metabolism in the oligotrophic oceans (45,53,54).

Another conceptual model of how viruses influence marine ecosystems is through the “viral shuttle” (reviewed in (51)). Through cell lysis, viruses can release sticky organic material from host cells, which is postulated to promote aggregation and sinking. Although this mechanism has been observed during viral lysis of some cultured eukaryotic phytoplankton (55–59), whether it occurs in more complex natural environments, particularly one dominated by prokaryotes such as Station ALOHA, remains an open question. This “viral shuttle” concept will be relevant to studying viruses on sinking particles in

Chapter IV.

Much of the viral effects on marine ecosystems described above relate to a lytic reproductive strategy, in which a virus kills the cell by repackaging cellular material into the next generation of viral particles (Fig. 1.4). However, viruses are also known to reproduce through lysogeny (Fig. 1.4). Lysogeny occurs when a phage integrates into the genome of its host as a prophage and reproduces in tandem with host-genome replication (60). Lysogeny appears to be common in the oceans, since roughly half of marine bacterial genomes contain prophages

(61). By integrating into host genomes, prophages can serve as vectors for horizontal gene transfer and confer novel functions to their cellular host. A conspicuous example is a phage encoding the cholera toxin that, when integrated as a prophage, converts the normally harmless Vibrio cholera into one of humanity’s greatest blights (62). Prophages also have the flexibility to excise

9 from host genomes and initiate lytic cycles through a process called induction

(60). Phages that have this ability to integrate and excise are termed “temperate”

(60). Multiple reproductive strategies allow temperate viruses to persist as a prophage in unfavorable conditions and initiate lytic cycles when conditions improve, through a process called induction (Fig. 1.4, reviewed in (42)). In addition to the lysis and lysogeny spectrum, viruses can reproduce through alternative mechanisms (Fig. 1.5, reviewed in (63)). Some viruses, after inserting their genomes into hosts, enter neither lysis nor lysogeny, but instead remain as episomes in the cell in a suspended state called pseudolysogeny (reviewed in

(64)). Other viruses, such as filamentous phages, can cause chronic infections that continuously produce virion progenies, but do not kill host cells (reviewed in

(65)). At the population-level, some temperate phages can continuously produce low levels of virion progeny without decimating host populations, attributed to abundant prophage populations with asynchronous, slow, and continual induction (66). Since various viral reproductive strategies affect cellular hosts in different ways, investigating how these strategies might be structured by environmental variables will further our understanding of viral effects on marine ecosystems. We will explore variability in viral reproductive strategies in

Chapters II and III.

History of marine virus research

10 Marine virology is relatively new field. The study of marine viruses has been historically hindered by their small size (~50 nm, Fig. 1.2) and by difficulties in cultivating both the viruses and their hosts in the laboratory. Although the idea of viruses first emerged in 1892, when <0.1 µm filtrate of tobacco infected with the tobacco mosaic disease was found to be infectious, reports of viruses in seawater emerged in 1923, as an explanation for bacterial mortality observed in coastal seawater (67). Their presence in marine systems remained enigmatic until the proposal of the microbial loop concept in 1974, which gave viruses contextual background by describing the role of microbes in organic matter recycling in the ocean (17). Since 1979, transmission electron micrographs revealed that viral particles were particularly abundant in aquatic environments (68–70) and actively infect abundant cyanobacteria and heterotrophic bacteria (71). Shortly after, phage from marine bacteria were first cultured in the laboratory in 1955

(72). Viral contribution to cellular mortality and biogeochemical cycling was explored using cultured isolates, mesocosms, and model-based estimations in the

1990’s (45,53,73–75).

Although measuring viral contribution to mortality and nutrient cycling appeared to be feasible at this point, measuring viral diversity was challenging.

Viruses lack universally conserved genes, such as that encoding the 16S ribosomal RNA, that revolutionized the study of prokaryotic diversity (76).

Additionally, the study of viral isolates in the laboratory was limited to those infecting small subset of cultivable hosts (reviewed in (77)). To circumvent these challenges, researchers explored viral diversity in the field using pulsed field gel

11 electrophoresis that measured genome size distributions as a proxy for viral diversity (78,79).

Around the turn of the millennium, developments in DNA sequencing revolutionized the way viral diversity was studied in the oceans. The first complete genomes of marine phage isolates were sequenced around 1999 (34,80).

Moving beyond this small subset of cultivable viral isolates, rapid developments in lowering sequencing cost and increasing throughput allowed the simultaneous recovery of many genomes from diverse microbial communities

(81). Environmental shotgun sequencing, in which environmental DNA is collected in situ and sequenced, gave birth to the field of marine viral metagenomics in the early-to-mid 2000’s. The direct sequencing of marine viruses revealed genomic fragments of hundreds of genotypes (“species”) in 200

L of surface seawater off the coast of California (82). Roughly ~65% of these sequences were distinct compared to previously sequenced organisms. Similarly, environmental shotgun sequencing of marine sediments recovered viral metagenomes containing ~70% novel sequences, and led to the inference that a single sample could contain thousands of viral species (83). Hundreds of viral samples collected from four ocean regions led to the recovery of metagenomes with ~90% novel sequences, and suggested that the global marine virome could contain hundreds of thousands of species (84). Complementing the studies of free-living viruses described above, metagenomic surveys of cellular assemblages revealed the abundance of cyanophages that account for up to 10% of all sequences sampled from near-surface ocean (85). Additional marine viral metagenomic studies during the past decade continue to uncover novel genetic

12 diversity across global ocean basins (86–93). These studies indicated that marine viruses represent a highly diverse, abundant, and undersampled component of marine ecosystems.

Despite the rapid expanding application of of marine viral metagenomics, viral diversity in the NPSG remained largely unexplored. We did not know how many viral populations (“species”) inhabit Station ALOHA, nor their genetic novelty with respect to previously sequenced viral metagenomes from other environments. Although both DNA and RNA viruses are found in the oceans

(94), this prokaryotic-dominated environment contains an abundance of double- stranded DNA phages prime for metagenomic sampling (Fig. 1.3). Previous research at Station ALOHA indicated that viral DNA accounts for over half of the DNA recovered from the <0.22 µm size fraction characteristic of phage-sized particles (95). Exploring this large reservoir of virus genetic diversity and environmental variability forms the foundation of my dissertation work.

Dissertation overview

Rationale. Given the dynamic nature of microbial communities, sporadic snapshots of microbial assemblages might be inadequate for studying viral diversity and dynamics in the oceans (42,96). At the start of my PhD, the most comprehensive viral metagenomic surveys sampled multiple locations across the

13 oceans and provided valuable insight into the geographic variability of viral diversity (90,97). However, temporal and vertical variabilities, while important, remained underexplored aspects of metagenomic surveys investigating viruses in the oceans. Examining temporal variability could reveal seasonal and/or interannual patterns in viral diversity and virus-host interactions; examining vertical variability could identify environmental drivers of viral diversity, such as planktonic or particle-attached habitats, light, and variability in host abundance and production that can influence viral life-history strategies

(reviewed in (42)). The existing infrastructure provided by monthly HOT cruises facilitates coupled sampling across both depth and time at Station ALOHA.

Objectives. In this dissertation, I explored the diversity of abundant viruses at

Station ALOHA from planktonic assemblages in the upper water column and from particle-associated assemblages on deep-sea sinking particles. I created viral databases of genomes or genomic fragments representing populations of some of the most abundant viruses in the NPSG. Using these viral databases, I aimed to:

1. Explore how viral diversity and reproductive strategies vary throughout

the water column to test the validity of theoretical models that have been

proposed for marine viral reproduction, such as the “piggypack-the-

winner” hypothesis, which predicts increased lysogeny with increasing

host densities (98), or “kill-the-winner hypothesis”, consistent with

dynamics that predicts increased lysis with increasing host densities (99);

14 2. Identify whether viruses are generally host- and habitat- specific, or if

they can occupy multiple niches along the water column;

3. Characterize temporal variabilities and persistence of viral assemblages at

monthly and interannual timescales to explore whether marine viruses

observed in situ behave in a way that is consistent with proposed

theoretical ecological models, such as the kill-the-winner hypothesis that

predicts shuffling of abundant viral and host populations (99); and

4. Identify whether viruses contribute to particle export through vertical

transport on sinking particles

To address these objectives, I analyzed metagenomic samples captured from

HOT cruises and deep-moored sediment traps to study both planktonic and particle-associated viruses in different habitats along the water column at Station

ALOHA. In Chapters II and III, I explored objectives 1 – 3 by studying the spatiotemporal dynamics of planktonic viruses along the upper water column (5

– 1000 m) spanning a total of three years. In Chapter IV, I explored objectives 3 –

4 by studying the spatiotemporal dynamics of sinking particle-associated viruses captured at 4000 m over the course of three years. Below is an overview of these individual chapters.

Chapter II. This chapter represents the first metagenomic survey focused on bacterioplankton-associated viruses at Station ALOHA. To investigate temporal and spatial variability in planktonic assemblages, filtered seawater archived samples were collected from depth profiles spanning 25 – 1000 m in tandem with monthly HOT cruises from 2010 – 11. From these samples, my colleagues

15 generated 83 metagenomes and explored prokaryotic assemblages across the water column (23). Complementing these efforts, I explored cell-associated viruses and generated the ALOHA viral database containing 129 genomes or genomic fragments of cell-associated viruses at Station ALOHA (100). Using these viral population genomes, I explored how viral assemblages co-varied with cellular assemblages, and how viral diversity, reproductive strategy, and temporal patterns varied with depth. The novel viral diversity captured and described in this chapter complement the current body of work on prokaryotic diversity in the NPSG (21,23,85).

Chapter III. This chapter builds on Chapter II by exploring both virioplankton and bacterioplankton-associated viruses at Station ALOHA. In November 2014, a

DeLong laboratory Simons Foundation project, in collaboration with the HOT program, began monthly collection of both virioplankton and bacterioplankton samples from 5 – 4000 m for metagenomic analysis. This unprecedented sampling represented a unique opportunity to examine the spatiotemporal dynamics of viruses at Station ALOHA. Targeted sampling of virioplankton, in tandem with bacterioplankton, captured novel planktonic virion diversity and provided further insight into viral reproductive strategies. We chose to extract and sequence 374 metagenomes targeting the upper 500 m from 2014 – 16. From these samples, I recovered over 16 thousand viral populations (ALOHA 2.0 viral database) that catalogued the diversity of both free-living and cell-associated viruses in this environment (101). Using this ALOHA 2.0 viral database, I studied how virus and host assemblages, temporal variability, and viral reproductive and metabolic strategies varied through physiochemical gradients in the upper

16 ocean. The novel viral diversity captured and described in this chapter builds on my preliminary work in Chapter II.

Chapter IV. This chapter represents one of the first metagenomic surveys of sinking particle-associated viruses at Station ALOHA. Since 1992, a deep-moored sediment trap was deployed annually at 4000 m to collect sinking particles at

Station ALOHA for biogeochemical analysis of export flux (12). To study microbial assemblages on sinking particles, sediment trap samples from 2014 –

16 were sequenced to generate 63 metagenomes spanning three years. From these samples, I curated a database of 857 deep trap viral populations and linked them to their cellular hosts recovered from the same samples. I explored how virus-host dynamics, viral diversity, and reproductive strategies varied on sinking particles collected across the three years, revealing differences from those observed in planktonic viruses. I also looked for evidence of vertical transport, using the ALOHA 2.0 viral dataset in Chapter III, to identify viruses that were exported from the upper ocean. The novel particle-attached viral diversity captured and described in this chapter complements my work on planktonic viruses in Chapters II and III.

Chapter V. This concluding chapter includes a synthesis of my dissertation work, overview of background work that forms the foundations of the future directions I plan to pursue in my postdoctoral research, and a discussion on some key questions in this field moving forward. Metagenomic surveys to date are largely dependent on short-read sequencing and assembly methods to recover genomes. Short-read assemblies make it challenging to recover complete

17 genomes from complex microbial assemblages, often resulting in genomic fragments that provide incomplete information about the organism. In the second section of this chapter, I describe a novel long-read sequencing approach from a collaborative project that allowed us to recover thousands of complete viral genomes from a single ocean sample (102). As evidenced by the sequencing revolution at the turn of this millennium, technological advances have preceded some major discoveries in this field, and this novel long-read sequencing technology could further unveil an ocean of unexplored microbial diversity.

18 Figures

Figure 1.1. Station ALOHA is located in the North Pacific Subtropical Gyre that covers most of the Pacific Ocean north of the equator.

19

Figure 1.2. Epifluorescence micrograph of SYBR gold-stained seawater samples from Station ALOHA. The SYBR gold-DNA stain causes viral-like particles (smaller dots in orange circle) and cell-like particles (larger dot in blue circle) to fluoresce, showing their approximate size and abundances.

20 a b

c d b

Figure 1.3. Transmission electron micrographs of putative bacteriophages at Station ALOHA. Diverse morphotypes were observed including those resembling members of the families (a) Siphoviridae, (b) Myoviridae, (c) Podoviridae, and (d) Inoviridae. The black bar length indicates 10 nm.

21 lysis

induction

lysogeny prophage

Figure 1.4. Some viral reproductive strategies: Phages that enter lytic cycle generate viral progeny and kill their hosts. Phages that lysogenize integrate their genome into the host genome as a prophage, leading to coexistence, during which time the phage replicates as an integral part of the host genome. Prophages can excise from the genome and enter the lytic cycle through a process called induction.

22 pseudolysogeny chronic infection filamentous phage population level

Figure 1.5. Viral reproductive strategies alternative to lysis and lysogeny: pseudolysogeny, chronic infection by filamentous phages, and chronic infection at the population level. Pseudolysogeny occurs when a phage neither integrates into the host genome as a prophage nor initiates a lytic cycle. Chronic infection by filamentous phages can result in continuous production of virion progenies without lysing host cells. Chronic infection by temperate phages that display asynchronous induction at the population level can result in continuous production of virion progenies without lysing entire host populations.

23 References

1. Karl DM, Lukas R. The Hawaii Ocean Time-series (HOT) program: Background, rationale and field implementation. Deep Res Part II. 1996;43(2–3):129–56.

2. Woodward FI. Global primary production. Curr Biol. 2007;17(8):269–73.

3. Berger WH. Global maps of ocean productivity. In: Productivity of the ocean: present and past. John Wiley and Sons. 1989. p. 429–55.

4. Martin JH, Knauer GA, Karl DM, Broenkow WW. VERTEX: carbon cycling in the northeast Pacific. Deep Res. 1987;34(2):267–85.

5. Volk T, Hoffert MI. Ocean carbon pumps. Geophys Monogr Ser. 1985;32:99–110.

6. Siegenthaler U, Sarmiento JL. Atmospheric carbon dioxide and the ocean. Nature. 1993;365(6442):119–25.

7. Sverdrup HU, Johnson MW, Fleming RH. The Oceans. Prentice-Hall Inc. 1946. 1087 p.

8. Michaels AF, Knap AH. Overview of the U.S. JGOFS Bermuda Atlantic Time-series study and the hydrostation S program. Deep Res Part II. 1996;43(2–3):157–98.

9. Anderson GC. Subsurface chlorophyll maximum in the Northeast Pacific ocean. Limnol Oceanogr. 1969;386–91.

10. Bingham FM, Lukas R. Seasonal cycles of temperature, salinity and dissolved oxygen observed in the Hawaii Ocean Time-series. Deep Res Part II. 1996;43(2–3):199–213.

11. Karl DM, Church MJ. Microbial and the Hawaii Ocean Time-series programme. Nat Rev Microbiol. 2014;12:1–15.

12. Karl DM, Church MJ, Dore JE, Letelier RM, Mahaffey C. Predictable and efficient carbon sequestration in the North Pacific Ocean supported by symbiotic nitrogen fixation. Proc Natl Acad Sci USA. 2012;109(6):1842–9.

13. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive earth’s biogeochemical cycles. Science. 2008;320(5879):1034–9.

14. Soo RM, Hemp J, Parks DH, Fischer WW, Hugenholtz P. On the origin of oxygenic photosynthesis and Cyanobacteria. Science. 2017;355:1436–40.

15. Zehr JP, Kudela RM. Nitrogen cycle of the open ocean: from genes to ecosystems. Ann Rev Mar Sci. 2011;3:197–225.

24 16. Prieur D, Voytek M, Jeanthon C, Reysenbach AL. Deep-sea thermophilic prokaroytes. In: Thermophiles Biodiversity, Ecology, and Evolution. Springer. 2001. 226 p.

17. Pomeroy LR. The Ocean’s Food Web, A Changing Paradigm. Bioscience. 1974;25(9):499–504.

18. Raven JA. The twelfth Tansley Lecture. Small is beautiful: The picophytoplankton. Funct Ecol. 1998;12(4):503–13.

19. Cotner JB, Biddanda BA. Small players, large role: Microbial influence on biogeochemical processes in pelagic aquatic ecosystems. Ecosystems. 2002;5(2):105–21.

20. Giovannoni SJ. SAR11 bacteria: the most abundant plankton in the oceans. Ann Rev Mar Sci. 2017;9(1):231–55.

21. Mende DR, Boeuf D, DeLong EF. Persistent core populations shape the microbiome throughout the water column in the North Pacific Subtropical Gyre. Front Microbiol. 2019;10(October):1–12.

22. Rii YM, Karl DM, Church MJ. Temporal and vertical variability in picophytoplankton primary productivity in the North Pacific Subtropical Gyre. Mar Ecol Prog Ser. 2016;562:1–18.

23. Mende DR, Bryant JA, Aylward FO, Eppley JM, Nielsen T, Karl DM, et al. Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat Microbiol. 2017;2(10):1367–73.

24. Allen LZ, Allen EE, Badger JH, McCrow JP, Paulsen IT, Elbourne LD, et al. Influence of nutrients and currents on the genomic composition of microbes across an upwelling mosaic. ISME J. 2012;6(7):1403–14.

25. Swan BK, Tupper B, Sczyrba A, Lauro FM, Martinez-Garcia M, González JM, et al. Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean. Proc Natl Acad Sci USA. 2013;110(28):11463–8.

26. Briggs NT, Dall’Olmo G, Claustre H. Major role of particle fragmentation in regulating biological sequestration of CO2 by the oceans. Science. 2020;367(6479):791–3.

27. Turner JT. Zooplankton fecal pellets, marine snow, phytodetritus and the ocean’s biological pump. Prog Oceanogr. 2015;130:205–48.

28. Boeuf D, Edwards BR, Eppley JM, Hu SK, Poff KE, Romano AE, et al. Biological composition and microbial dynamics of sinking particulate organic matter at abyssal depths in the oligotrophic open ocean. Proc Natl Acad Sci USA. 2019;116(24):11824–32.

25 29. Fontanez KM, Eppley JM, Samo TJ, Karl DM, DeLong EF. Microbial community structure and function on sinking particles in the North Pacific Subtropical Gyre. Front Microbiol. 2015;6:469.

30. Pelve EA, Fontanez KM, DeLong EF. Bacterial succession on sinking particles in the ocean’s interior. Front Microbiol. 2017;8:2669.

31. Berelson WM. Particle settling rates increase with depth in the ocean. Deep Res Part II. 2002;49:237–51.

32. Wigington CH, Sonderegger DL, Brussaard CPD, Buchan A, Finke JF, Fuhrman JA, et al. Re-examining the relationship between virus and microbial cell abundances in the global oceans. Nat Microbiol. 2016;1:15024.

33. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, Chisholm SW. Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc Natl Acad Sci USA. 2004;101(30):11013–8.

34. Rohwer F, Segall A, Steward G, Seguritan V, Breitbart M, Wolven F, et al. The complete genomic sequence of the marine phage Roseophage SIO1 shares homology with nonmarine phages. Limnol Oceanogr. 2000;45(2):408–18.

35. Garcia-heredia I, Rodriguez-Valera F, Martin-Cuadrado A. Novel group of podovirus infecting the marine bacterium Alteromonas macleodii. Bacteriophage. 2013;3(2):e24766.

36. Zhao Y, Temperton B, Thrash JC, Schwalbach MS, Vergin KL, Landry ZC, et al. Abundant SAR11 viruses in the ocean. Nature. 2013;494:357–60.

37. Kang I, Oh H-M, Kang D, Cho J-C. Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans. Proc Natl Acad Sci USA. 2013;110(30):12343–8.

38. Brum JR. Concentration, production and turnover of viruses and dissolved DNA pools at Stn ALOHA, North Pacific Subtropical Gyre. Aquat Microb Ecol. 2005;41:103–13.

39. Brum JR, Schenck RO, Sullivan MB. Global morphological analysis of marine viruses shows minimal regional variation and dominance of non- tailed viruses. ISME J. 2013;7(9):1738–51.

40. Suttle CA. Viruses in the sea. Nature. 2005;437(7057):356–61.

41. Thompson LR, Zeng Q, Kelly L, Huang KH, Singer AU, Stubbe J, et al. Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc Natl Acad Sci USA. 2011;108(39):E757–64.

42. Breitbart M. Marine viruses: truth or dare. Ann Rev Mar Sci. 2012;4(1):425–

26 48.

43. Hurwitz BL, U’Ren JM. Viral metabolic reprogramming in marine ecosystems. Curr Opin Microbiol. 2016;31:161–8.

44. Fuhrman JA. Marine viruses and their biogeochemical and ecological effects. Nature. 1999;399(6736):541–8.

45. Wilhelm SW, Suttle CA. Viruses and nutrient cycles in the sea. Bioscience. 1999;49(10):781–8.

46. Wommack KE, Colwell RR. Virioplankton: viruses in aquatic ecosystems. Microbiol Mol Biol Rev. 2000;64(1):69–114.

47. Noble RT, Fuhrman JA. Virus decay and its causes in coastal waters. Appl Environ Microbiol. 1997;63(1):77–83.

48. Suttle CA, Chen F. Mechanisms and rates of decay of marine viruses in seawater. Appl Environ Microbiol. 1992;58(11):3721–9.

49. Bongiorni L, Magagnini M, Armeni M, Noble R, Danovaro R. Viral production, decay rates, and life strategies along a trophic gradient in the North Adriatic Sea. Appl Environ Microbiol. 2005;71(11):6644–50.

50. Jacquet S, Miki T, Noble R, Peduzzi P, Wilhelm S. Viruses in aquatic ecosystems: Important advancements of the last 20 years and prospects for the future in the field of microbial oceanography and limnology. Adv Oceanogr Limnol. 2010;1(1):97–141.

51. Weinbauer MG. Ecology of prokaryotic viruses. FEMS Microbiol Rev. 2004;28(2):127–81.

52. Lara E, Boras JA, Gomes A, Borrull E, Teira E, Pernice MC, et al. Unveiling the role and life strategies of viruses from the surface to the dark ocean. Sci Adv. 2017;3(9):e1602565.

53. Gobler CJ, Hutchins DA, Fisher NS, Cosper EM, Sañudo-Wilhelmy SA. Release and bioavailability of C, N, P, Se, and Fe following viral lysis of a marine chrysophyte. Limnol Oceanogr. 1997;42(7):1492–504.

54. Middelboe M, Jørgensen NOG, Kroer N. Effects of viruses on nutrient turnover and growth efficiency of noninfected marine bacterioplankton. Appl Environ Microbiol. 1996;62(6):1991–7.

55. Proctor LM, Fuhrman JA. Roles of viral infection in organic particle flux. Mar Ecol Prog Ser. 1991;69:133–42.

56. Peduzzi P, Weinbauer MG. Effect of concentrating the virus‐rich 2‐2nm size fraction of seawater on the formation of algal flocs (marine snow). Limnol Oceanogr. 1993;38(7):1562–5.

27 57. Shibata A, Kogure K, Koike I, Ohwada K. Formation of submicron colloidal particles from marine bacteria by viral infection. Mar Ecol Prog Ser. 1997;155:303–7.

58. Yamada Y, Tomaru Y, Fukuda H, Nagata T. Aggregate formation during the viral lysis of a marine diatom. Front Mar Sci. 2018;5:1–7.

59. Lawrence JE, Suttle CA. Effect of viral infection on sinking rates of Heterosigma akashiwo and its implications for bloom termination. Aquat Microb Ecol. 2004;37(1):1–7.

60. Echols H. Developmental pathways for the temperate phage: Lysis vs lysogeny. Annu Rev Genet. 1972;6(1):157–90.

61. Paul JH. Prophages in marine bacteria: dangerous molecular time bombs or the key to survival in the seas? ISME J. 2008;2(6):579–89.

62. Waldor MK, Mekalanos JJ. Lysogenic conversion by a filamentous phage encoding cholera toxin. Science. 1996;272(5270):1910–4.

63. Hobbs Z, Abedon ST. Diversity of phage infection types and associated terminology: the problem with “Lytic or lysogenic.” FEMS Microbiol Lett. 2016;363(7):1–8.

64. Cenens W, Makumi A, Mebrhatu MT, Lavigne R, Aertsen A. Phage–host interactions during pseudolysogeny. Bacteriophage. 2013;3(1):e25029.

65. Denhardt DT, Model P. The single-stranded DNA phages. Crit Rev Microbiol. 1975;4(2):161–223.

66. Morris RM, Cain KR, Hvorecny KL, Kollman JM. Lysogenic host – virus interactions in SAR11 marine bacteria. Nat Microbiol. 2020;(10.1038/s41564-020-0725–x).

67. Zobell CE. Marine microbiology; a monograph on hydrobacteriology. Chronica Botanica Company. 1946. 240 p.

68. Torrella F, Morita RY. Evidence by electron micrographs for a high incidence of bacteriophage particles in the waters of Yaquina Bay, Oregon: ecological and taxonomical implications. Appl Environ Microbiol. 1979;37(4):774–8.

69. Bergh Ø, Børsheim KY, Bratbak G, Heldal M. High abundance of viruses found in aquatic environments. Nature. 1989;340(6233):467–8.

70. Frank H, Moebus K. An electron microscopic study of bacteriophages from marine waters. Helgoländer Meeresunters. 1987;41(4):385–414.

71. Proctor LM, Fuhrman JA. Viral mortality of marine bacteria and cyanobacteria. Nature. 1990;343(6253):60–2.

28 72. Spencer R. A marine bacteriophage. Nature. 1955;175(4459):690–1.

73. Bratbak G, Egge JK, Heldal M. Viral mortality of the marine alga Emiliania huxleyi (Haptophyceae) and termination of algal blooms. Mar Ecol Prog Ser. 1993;93(1–2):39–48.

74. Fuhrman JA, Noble RT. Viruses and protists cause similar bacterial mortality in coastal seawater. Limnol Oceanogr. 1995;40(7):1236–42.

75. Suttle CA. The significance of viruses to mortality in aquatic microbial communities. Microb Ecol. 1994;28(2):237–43.

76. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR. Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci USA. 1985;82(20):6955–9.

77. Staley JT, Konopka A. Measurement of in situ activities of nonphotosynthetic microorganisims in aquatic and terrestrial habitats. Annu Rev Microbiol. 1985;39:321–46.

78. Wommack KE, Ravel J, Hill RT, Chun J, Colwell RR. Population dynamics of Chesapeake Bay virioplankton: total-community analysis by pulsed- field gel electrophoresis. Appl Environ Microbiol. 1999;65(1):231–40.

79. Steward GF, Montiel JL, Azam F. Genome size distributions indicate variability and similarities among marine viral assemblages from diverse environments. Limnol Oceanogr. 2000;45(8):1697–706.

80. Männistö RH, Kivelä HM, Paulin L, Bamford DH, Bamford JKH. The complete genome sequence of PM2, the first lipid-containing bacterial virus to be isolated. Virology. 1999;262(2):355–63.

81. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428(6978):37–43.

82. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, et al. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci USA. 2002;99(22):14250–5.

83. Breitbart M, Felts B, Kelley S, Mahaffy JM, Nulton J, Salamon P, et al. Diversity and population structure of a near-shore marine-sediment viral community. Proc R Soc B Biol Sci. 2004;271(1539):565–74.

84. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, et al. The marine viromes of four oceanic regions. PLoS Biol. 2006;4(11):2121–31.

85. Delong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N, et al. Community genomics among stratified microbial assemblages in the ocean’s interior. Science. 2006;311:496–503.

29 86. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. Expanding the marine virosphere using metagenomics. PLoS Genet. 2013;9(12):e1003987.

87. Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. Genomes of abundant and widespread viruses from the deep ocean. MBio. 2016;7(4):e00805-16.

88. Coutinho FH, Silveira CB, Gregoracci GB, Thompson CC, Edwards RA, Brussaard CPD, et al. Marine viruses discovered via metagenomics shed light on viral strategies throughout the oceans. Nat Commun. 2017;8:15955.

89. López-Pérez M, Haro-Moreno JM, Gonzalez-Serrano R, Parras-Moltó M, Rodriguez-Valera F. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters. PLoS Genet. 2017;13(9):1–23.

90. Brum JR, Ignacio-espinoza JC, Roux S, Doulcier G, Acinas SG, Alberti A, et al. Patterns and ecological drivers of ocean viral communities. Science. 2015;348(6237):1261498.

91. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and biogeochemical impacts of uncultivated globally abundant ocean viruses. Nature. 2016;537:689–93.

92. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536(7617):425–30.

93. Gregory AC, Zayed AA, Sunagawa S, Wincker P, Sullivan MB, Ferland J, et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell. 2019;177:1–15.

94. Steward GF, Culley AI, Mueller J a, Wood-Charlson EM, Belcaid M, Poisson G. Are we missing half of the viruses in the ocean? ISME J. 2013;7(3):672–9.

95. Brum JR, Steward GF, Karl DM. A novel method for the measurement of dissolved deoxyribonucleic acid in seawater. Limnol Oceanogr Methods. 2004;2:248–55.

96. Hewson I, Winget DM, Williamson KE, Fuhrman JA, Wommack KE. Viral and bacterial assemblage covariance in oligotrophic waters of the West Florida Shelf (Gulf of Mexico). J Mar Biol Assoc UK. 2006;86(03):591.

97. Hurwitz BL, Sullivan MB. The Pacific Ocean Virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS One. 2013;8(2):e57355.

98. Knowles B, Silveira CB, Bailey BA, Barott K, Cantu VA, Cobián-Güemes AG, et al. Lytic to temperate switching of viral communities. Nature. 2016;531(7595):533–7.

30 99. Thingstad TF, Lignell R. Theoretical models for the control of bacterial growth rate, abundance, diversity and carbon demand. Aquat Microb Ecol. 1997;13:19–27.

100. Luo E, Aylward FO, Mende DR, Delong EF. Bacteriophage distributions and temporal variability in the ocean’s interior. MBio. 2017;8(6):e01903-17.

101. Luo E, Eppley JM, Romano AE, Mende DR, DeLong EF. Double-stranded DNA virioplankton dynamics and reproductive strategies in the oligotrophic open ocean water column. ISME J. 2020;14:1304–1315.

102. Beaulaurier J, Luo E, Eppley JM, Uyl P Den, Dai X, Burger A, et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 2020;30(3):437–46.

31 CHAPTER 2. BACTERIOPHAGE DISTRIBUTIONS AND TEMPORAL VARIABILITY IN THE

OCEAN’S INTERIOR

Published 2017 in mBio 8(6): e01903-17. DOI: 10.1128/mBio.01903-17

Authors: Elaine Luo, Frank Aylward, Daniel Mende, Edward DeLong

Abstract

Bacteriophages represent the most abundant DNA-containing entities in the oligotrophic ocean, yet how specific phage populations vary over time and space is not well understood. Here, we conducted a metagenomic time-series survey of double-stranded DNA phages throughout the water column in the North

Pacific Subtropical Gyre, encompassing 1.5 years from 25 – 1000 m. Viral gene sequences were identified in assembled metagenomic samples, yielding 172,385 viral gene families. Viral marker gene distributions indicated that lysogeny was more prevalent at mesopelagic depths than in surface waters, consistent with results from prior prophage induction studies using mitomycin C. A total of 129

ALOHA viral genomes and genome fragments from 20–108kbp were recovered representing some of the most abundant phages in the water column. Phage genotypes displayed discrete structures. Most phage persisted throughout the time-series, and displayed a strong depth structure that mirrored the stratified depth distributions of co-occurring bacterial taxa in the water column.

Mesopelagic phages were distinct from surface water phage with respect to their

32 diversity, gene content, putative life histories and temporal persistence, reflecting depth-dependent differences in host genomic architectures and phage reproductive strategies. The spatiotemporal distributions of the most abundant open ocean bacteriophage we report here provide new insight into viral temporal persistence, life-history, and virus-host-environment interactions throughout the open ocean water column.

Importance

The North Pacific Subtropical Gyre represents one of the largest biomes on the planet, where microbial communities are central mediators of ecosystem dynamics and global biogeochemical cycles. Critical members of these communities are the viruses of marine bacteria that can alter microbial metabolism and significantly influence their survival and productivity. To better understand these viral assemblages, we conducted genomic analyses of planktonic viruses in a over seasonal cycle to ocean depths of 1000 m. We identified 172,385 different viral gene families and 129 unique virus genotypes in this open ocean setting. The spatiotemporal distributions of the most abundant open ocean viruses we report here provide new insights into viral temporal variability, life-history, and virus-host-environment interactions throughout the water column.

33 Introduction

Viruses are abundant biotic entities that play critical roles in aquatic environments. Some of the most common among these viruses in the open ocean are dsDNA bacteriophages (phages) that infect many abundant and biogeochemically important groups of bacterioplankton, such as Prochlorococcus,

Synechococcus, and numerous heterotrophic bacterial species in common genera such as Roseobacter, Alteromonas, Pelagibacter, and Puniceispirillum (Wilson et al,

1993; Rohwer et al, 2000; Lindell et al, 2005; Zhao et al, 2013; Kang et al, 2013;

Garcia-Heredia et al, 2013). Phages have been shown to kill hosts at rates up to

20-40% of the total population per day potentially strongly impacting bacterioplankton populations (Weinbauer, 2004; Weinbauer & Rassoulzadegan,

2004). In addition, carbon flux though phage biomass is estimated to be as high as 10% of carbon fixation in the ocean, playing a substantial role in the global carbon cycle (Puxty et al, 2016). Furthermore, phages can influence ocean biochemistry through microbial cell lysis a leading to production of DOM, and via auxiliary metabolic genes (AMGs) that alter host cellular carbon, sulfur, and nitrogen metabolism of their hosts during the phage replication cycle (Thompson et al, 2011; Breitbart, 2012; Hurwitz, 2016). Advancing fundamental knowledge of marine phages is therefore an important step towards developing a deeper understanding of marine ecosystem dynamics.

Phages have critical roles in microbial ecology and biogeochemistry of the global ocean due to their tremendous abundance and diversity. While the genotypic diversity of marine phages has historically been difficult to ascertain, recent

34 studies have provided new insights into viral genomic diversity in the oceans.

Developments in high-throughput DNA sequencing have enabled the exploration viral diversity in the environment at unprecedented scales (Mizuno et al; 2013; Brum et al, 2015; Roux et al, 2016; Mizuno et al, 2016). These frequent reports of large reservoirs of viral genetic diversity highlights the importance of further work using reference-independent metagenomic techniques for in situ characterization of marine phages.

The majority of published marine viral metagenomic surveys to date have focused on cataloguing the genomic diversity and geographic variability in surface water samples (Mizuno et al; 2013; Brum et al, 2015; Roux et al, 2016). The vertical and temporal distributions of environmental phage assemblages have received relatively less attention. To our knowledge, only two metagenomic studies have reported on phages recovered from deep-sea planktonic samples

The Pacific Ocean Virome dataset included 12 samples from the deep Pacific, revealing that aphotic zone viromes contained a unique set of AMG’s that distinguish them from photic zone viromes (Hurwitz et al, 2015), while the structure of 99 genomic fragments of bathypelagic phages has also been reported from the Mediterranean Sea (Mizuno et al, 2016). These studies suggested that deep-ocean phages are largely novel and distinct from previously characterized surface phages, highlighting the need to explore the vast diversity of uncharacterized phages below the surface ocean. With respect to temporal variability, previous studies have focused on daily, weekly or annual scales in surface water (Waterbury & Valois, 1993; Pagarete et al, 2013; Chow et al, 2014;

Goldsmith et al, 2015; Brum et al, 2015; Aylward et al, submitted for publication),

35 but metagenomic studies undertaking depth profiles coupled with temporal variability are needed to provide context to the spatiotemporal variations observed in in marine phage dynamics.

Coupled studies of viral dynamics in both space and time at well-defined sample sites have potential to provide the environmental context for interpreting broader patterns and consequences of viral diversity. Here, we present a metagenomic depth profile time-series of phages captured in cellular bacterioplankton fractions from depths of 25 – 1000 m over 1.5 years. We used two approaches to explore how phages vary through depth and time in the

North Pacific Subtropical Gyre (NPSG), an oligotrophic habitat that is representative of the largest biome in the world (Karl & Church, 2014). Using a genome-centric approach, we describe genomic fragments of abundant phage populations at Station ALOHA. Through implementation of a multi-step re- assembly workflow, we reconstructed viral population genomes to describe the how the diversity, distribution, and genetic repertoire of phages vary through depth and time. For the second approach, we used a gene-centric methodology similar to that previously reported for prokaryotic assemblages at Station

ALOHA. (Mende et al, 2017). For this approach we leveraged a non-redundant gene catalogue constructed from Station ALOHA to analyze the vertical distribution of phage genes and examine how the diversity and functional repertoire of phages varies across depth profiles. We also used marker genes to explore how viral life-history strategies shift through the ocean’s water column.

Our analyses characterize viral gene distributions, genotypes, and temporal

36 dynamics across a range of depths and provide important insight into the genomic diversity and dynamics of viral assemblages in the open ocean.

Results and discussion

In this study, we characterized viral genotypic diversity at Station ALOHA from

25 – 1000 m over 1.5 years using metagenomic data, employing both genome and gene-centric approaches (Fig. S2.2). The genome-centric approach generated 129 viral contigs between 20-108kpb in length, representing genomes or large genomic fragments of abundant phages at Station ALOHA. These were used to characterize the distributions of dominant viral populations and provide genomic insights into the ecology of specific phage groupings. In addition, the gene-centric approach of all Station ALOHA contigs assembled (Mende et al,

2017) captured a wide range of viral diversity from 177,713 non-redundant viral genes, which facilitated a broader quantitative analyses of phage gene distributions in space and time, providing insight into the relationships between viral life history, environmental, and host variability.

Novel ALOHA viral genomes, genome fragments, and AMGs. Our conservative viral genome assembly strategy yielded many novel viral populations different from previously characterized viruses (Fig. 2.1). 79 out of

129 ALOHA viral contigs shared relatively no sequence homology to any known phages or previously sequenced viral metagenomes, Of the 50 Station ALOHA

37 viral contigs having some database homologues, 10 were related to known phage genomes in RefSeq75 (all cyanophages, Table S2.2), while 40 shared homology to phages in one of three previously sequenced viral metagenomes. These data suggested that at least some of the most abundant phage groups we found were widespread across ocean basins.

Mesopelagic phages appeared to be under-sampled in current databases and were distinct from surface water phages. None of the ALOHA viral contigs from

770 m or deeper were similar to any previously reported phages in existing databases. Gene mapping revealed that some ALOHA viral contigs shared conserved genomic structure with known phages in reference databases, despite low amino acid similarity (Fig. S2.3a-e), supporting the existence of evolutionary constraints on gene order even among distantly related phage.

Among viral contigs of all lengths, we found 37 genes identified from 625 contigs co-located with phage structural genes that have functions not previously characterized in marine phages (Table S2.3). These included 19 putative AMG’s that provide insight into how phages can manipulate host metabolism. Some of these genes have putative functions in antibiotic synthesis (carbamoyltransferase

C-terminus, myo-inositol-1-phosphate synthase), antimicrobial resistance

(dolichyl-phosphate-mannose-protein mannosyltransferase), antitoxin synthesis

(antitoxin of toxin-antitoxin system), antigen synthesis (P83/100), transporters

(sodium bile acid symporter), and superinfection immunity. These new phage- associated genes further expand our current knowledge of gene content in naturally occurring phages. Moreover, certain AMG’s were only found in

38 ALOHA viral contigs dominant at specific depths, such as myo-inositol-1- phosphate synthase (25 & 75 m), dolichyl-phosphate-mannose-protein mannosyltransferase (200 m), sodium bile acid symporter (200 m), P83/100 antigen proteins (200 m), and superinfection immunity protein (sporadic, 770 &

1000 m). The functional roles of these depth-specific AMG’s provide new insight into phage-host interactions in the open ocean water column.

Phage genotype distributions in the Station ALOHA time series depth profile.

Most ALOHA viral contigs reached peak abundances at a single depth and could be broadly categorized into one of five groups based on abundance profiles: a surface group dominating 25-75m, a DCM group at 125m, a 200 m group, a deeper mesopelagic group from 500-1000 m, and a sporadic group of more temporally variable phages (Fig. 2.2). No evidence for eurybathic phage was evident in our data. Instead, the vertical distribution of phage contigs appeared to reflect the depth-stratified distributions of their potential bacterial and archaeal hosts (Mende et al, 2017). Overall, these results suggest that many dominant phage groups at Station ALOHA may have narrow host ranges.

This depth structure in ALOHA viral contig distributions was also reflected in our gene-centric approach using the ALOHA gene catalogue (Mende et al, 2017).

Comparison of sample clustering patterns based on phage- versus bacterioplankton-specific genes indicated that viral gene distributions mirrored community-wide trends of their potential bacterioplankton hosts. Phage-specific versus cellular (non-phage) genes were highly similar to the genome-centric approach above in depth clustering patterns (Fig. 2.3). Phage and cell-associated gene distribution dendrograms were conserved across these depth clusters over

39 time. This suggests an overall pattern of viral assemblage gene distributions reflected that of their cellular host community.

The pronounced spatial differences along the depth gradient were accompanied by depth-stratified differences in temporal variability of ALOHA viral contigs, with persistent phages dominating surface waters and more episodic phages dominating mesopelagic depths. Most phages displayed no clear trends of seasonality or shuffling of dominant phage groups, with the exception of a small group of phages in the sporadic group (Fig. 2.2). In the persistent groups

(surface, DCM, 200m, mesopelagic), phage populations displayed remarkable temporal stability throughout the 1.5 year sampling period, similar to that previously reported for surface water viruses over shorter daily time scales

(Aylward et al, in press). Additionally, four phages in our persistent group, captured from 2010-2011, displayed homology to phages found in a 2015 diel

15m phage study in the NPSG (Aylward et al, in press), demonstrating that these populations were consistently present over multiple years (Table S2.4). In contrast to the patterns observed in surface waters, phages in the mesopelagic ocean exhibited more sporadic occurrence characteristic of boom-and-bust dynamics. Several ALOHA viral contigs were highly abundant in only a few time points but virtually absent in at other times, as shown by the relative abundances of phages at 25 and 1000 m (Fig. 2.4a). To confirm that the greater temporal variability at 1000 m is not attributed to lower number of viral contigs detected, mean-normalized variances show that mesopelagic phages were indeed more temporally variable than surface water phages (Fig. 2.4b).

40 Genomic trends in lytic vs. lysogenic viral life-history. The phage depth distributions were also reflected in encoded viral life history traits. Using prophage marker genes as indicators, we found that the genomic potential for lysogeny increased below the DCM, which hovers around 90-130m throughout the year (Karl & Church, 2014). All three prophage markers, integrase, CI repressor, and excisionase, showed congruence in proportional increases with depth below 125m (Fig. 2.5a-c). Collectively, the data suggested that the proportion of phages that are capable of lysogeny was ~5 times higher in the mesopelagic than in the surface ocean. Furthermore, the average number of prophage markers per cellular genome increased below 125m, suggesting that the average number of integrated prophages per cell also increased with depth

(Fig. S2.4a-c). The copy number of integrases appears to be suspiciously high with respect to other prophage and phage markers, suggesting that there may be contamination from mobile genomic elements, or that these gene families are more highly conserved and therefore more easily detected using homology- based methods. Despite the possibility of false positives in using integrase as a prophage marker, the overall increased abundance of other prophage markers provides evidence for increased prevalence of lysogeny in the mesopelagic ocean. Lastly, the average copy number of phage markers per cellular genome decreased from 1-5 at 125 m to 0.45, 0.31, 0.52, and 0.04 at 1000 m respectively for

DNA polymerase, terminase, capsid, and tail (Fig. S2.4d-g). The strikingly similar decreases across all four phage markers suggest that the prevalence of actively replicating lytic phages within cells decreased greatly at mesopelagic depths. The potential inclusion of free phages adhered to particles in our samples, in addition to assumed intracellular phage, is unlikely to majorly impact these results due to

41 pre-filtration of larger >1.6µm particles before bacterioplankton collection. Taken together, these marker gene data suggest a shift from the euphotic to the mesopelagic zone consisting of increased lysogeny and overall decreased viral particle abundance per host.

Viral life-history strategies are important to consider for both phage and host ecology. For the host, a lytic cycle results in cell death and release of cellular material into the environment, while a lysogenic cycle does not immediately kill the host but results in a cost of carrying and reproducing foreign genetic material. For the phage, a lytic cycle means rapid increase in short-term fitness when a host is productive enough to support phage production, while lysogeny may represent an opportunity cost in reproduction but increased chances of survival. Characterizing which environments favor a certain viral life-history strategy is important to our consideration of viral-host interactions and its resulting biogeochemical effects along the ocean’s interior.

Different theories exist to explain the ecological factors that may influence the proportion of lytic versus lysogenic phages (Wigington et al, 2016; Knowles et al,

2016). For example, the “piggy back the winner” hypothesis (Knowles et al,

2016), based on cell-particle abundance patterns, predicts that lysogeny should be more prevalent at higher host cell densities. Our data do not directly support these results, since we found evidence for more lysogenic phage in deep waters characterized by lower cell densities (data available at

42 http://hahana.soest.hawaii.edu/hot/hot-dogs/), consistent with earlier results of Weinbauer et al. (2003).

Lysogeny might also be advantageous in the deep ocean due to low host productivity that may constrain lytic phage replication (Middleboe, 2000;

Moebus, 1996). The results reported here are consistent with a prior prophage induction experiment that showed greater potential for lysogeny in deep versus surface waters in the Mediterranean and Baltic Seas (Weinbauer et al, 2003).

Other field observations have found increased lysogeny in low-productivity environments: in oligotrophic rather than coastal oceans (Jiang & Paul, 1997;

Weinbauer & Suttle, 1999) and in winter rather than spring (Williamson et al,

2002; Brum et al, 2016). Although this trend is not consistent across all studies

(eg. Laybourn-Parry et al, 2007; Payet & Suttle, 2013), productivity appears to be a major correlate of phage-host interactions in some environments. At Station

ALOHA, productivity declines sharply past 125m near the DCM (Karl & Church,

2014), coinciding with the sharp transition from lytic to lysogenic phage-host interaction in our marker gene profiles. Our study provides a genomic perspective and validation for induction experiments, as well as a molecular explanation for a higher proportion of lysogeny and decreased viral particles per host in the deep ocean. Our results also suggest that phage-induced mortality may be higher in more productive surface waters, shifting to more temperate phage-host interactions in the mesopelagic open ocean.

43 Ecology of surface and mesopelagic phage. Overall, the most abundant surface phage populations were well-represented in the 1.6-0.2µm size fraction throughout our sampling period (Fig 2.2; Fig. 2.4). This temporal persistence is intriguing, since dominant phages are expected to follow boom-and-bust dynamics in a negative density-dependent manner, according to some ecological models such as Kill-the-Winner or fluctuating selection dynamics model

(reviewed in Arvani et al, 2012). On the other hand, phage resistance mechanisms allow coastal marine Synechococcus populations to co-exist with abundant cyanophages over seasonal cycles (Waterbury & Valois, 1993), and persistent viral types have been observed in other viral time-series analyses

(Marston & Sallee, 2003; Chow et al, 2013; Pagaret et al, 2013; Goldsmith et al,

2015). Moreover, the stability of phage in surface waters may reflect the overall stability of prokaryotic assemblages in the NSPG (Bryant et al, 2015). A combination of these factors might contribute to the temporal stability of many different host-phage pairs, particularly those persistent in ocean surface waters.

The most temporally variable phage groups in our analyses occurred in the mesopelagic, where specific phages dominate at certain time points only to disappear the next month (Fig. 2.2, 2.4). This sporadic nature corresponds to high temporal variability in known phage and bacterial genes observed in the mesopelagic ocean, particularly at 1000 m (Fig. S2.5a,b). Compared to the surface ocean, where sunlight drives consistent high productivity, productivity in deep- ocean is limited by sporadic rain of organic material exported from the surface

(Karl & Church, 2014). Temporally variable resources might select for particle-

44 attached bacteria (DeLong et al, 2006) that grow quickly when resources become available. These bursts of growth can lead to temporally sporadic phage distributions from prophage propagation with fast-growing host, prophage induction, or particularly successful lytic infections in terminating host blooms.

In this ecological landscape, we found one novel gene of interest that is specific to a mesopelagic sporadic phage: a superinfection immunity gene encoding for a membrane-attached protein that confers host immunity to other phages (Lu &

Henning, 1989) and has not been previously observed in marine phage. The associated contig (Fig. S2.3e) in our sampling period followed the distribution of

Vibrio (Fig. S2.5b), known to be particle-attached bacteria found in the deep ocean at Station ALOHA (Fontanez et al, 2015). This putative lytic Vibriophage is remarkably abundant in an environment where lysogeny is favored. By conferring host immunity to other competitor phages, this gene could contribute to the success of this putative lytic phage and subsequent boom and bust dynamics of the phage and particle-attached host in the mesopelagic ocean. It is worth noting here that such genomic characterization phage populations in situ may be prone to generating false positives in functional capacity (Enault et al,

2017). Experimental verification of the function of encoded phage genes will more reliably elucidate the genomic capacity of previously uncharacterized mesopelagic phages and the resulting phage-host interaction.

45 Conclusions

In summary, our time series study in the NPSG leveraged both gene- and genome-centric approaches to provide insight into phage diversity, structure, and function across depth and time. We found that mesopelagic phages were distinct from surface phages, and were largely novel and underrepresented in current viral reference and metagenomic databases. With respect to depth variability, discrete phage populations displayed strong depth structure, similar to that of putative bacterial hosts. There were virtually no eurybathyal phages, suggesting that most dominant phage groups were adapted to relatively narrow host ranges. We also found unique AMG’s in mesopelagic phage suggestive of depth-specific adaptations to a more variable landscape of phage-host interactions in the aphotic zone. With respect to temporal variability, the most abundant phage groups were remarkably persistent, displaying little to no seasonality nor observable shuffling of dominant groups. With respect to coupled variability through space and time, mesopelagic phages were more sporadic distribution. Considering viral life-history, we found five times more genes associated with lysogeny going down depth profiles, suggesting a sharp increase in lysogeny at and below the DCM. Our observations, in addition to other recent studies (Mizuno et al, 2013; Brum et al, 2015; Hurwitz et al, 2015;

Roux et al, 2016; Mizuno et al, 2016) expand the realm of characterized the surface to the mesopelagic ocean.

46 Materials and Methods

Study site and sample collection. Bacterioplankton samples in the 0.2-1.6 um size fraction were collected from 7 depths (25, 75, 125, 200, 500, 770, and 1000 m) on 12 occasions in 2010-2011 at Station ALOHA(22°45’ N, 158°W) in the North

Pacific Subtropical Gyre (NPSG). Detailed sample collection has been previously described (Mende et al, 2017). As the study site of the Hawaii Ocean Time-series

(HOT) program, Station ALOHA is one of the most sampled open-ocean systems in the world, with well-characterized biogeochemical gradients that provide context to our work (Karl & Church, 2014; examples in Fig. S2.1).

Station ALOHA metagenomic assembly and gene catalogue. The methods for

DNA extraction, library construction and sequencing with the Illumina NextSeq and MiSeq platforms of these bacterioplankton samples have been previously described in detail (Mende et al, 2017). . Briefly, metagenomic reads from each sample were assembled using MIRA v4.9.5_2, providing the basis for our genome-centric analyses. 40 million genes were predicted from all assemblies using Prodigal v2.6.0 (Hyatt et al, 2010) and clustered using CD-HIT v2.6 (Li et al, 2001) to generate 8,966,703 non-redundant gene clusters. These non- redundant genes are hereafter referred to as the ALOHA gene catalogue, providing the basis for our gene-centric analyses. The relative abundance of each non-redundant gene in each sample was calculated by mapping reads using the

BWA-MEM algorithm v0.7.15 (Li, 2013) with default parameters, and then dividing the resulting coverage by the total coverage of all genes (Mende et al,

2017). Sequence data are available at NCBI under Bioproject no. PRJNA352737, and at https://imicrobe.us/project/view/263.

47

Genome-centric approach: Assembling ALOHA viral contigs. Station ALOHA metagenomic assemblies were used to identify viral contigs (Fig. S2.2) using

VIRSorter v1.0.3 with a 3kbp cut-off for improved recall (Roux et al, 2015).

Sequence reads that were used to assemble these contigs were pooled across all samples and reassembled using default parameters in metaSPAdes (Bankevitch et al, 2012). The 104,732 contigs were validated through a second VIRSorter screen, in which 917 viral contigs 1.7-108kbp in length from all VIRSorter categories were retained. Most cellular contamination were removed at this stage, as the estimated cellular genomic completeness was only 1.2%, with only

35 out of 16,296 genes mapping to single copy prokaryotic marker genes using the standard Anvi’o v.2.1.1 workflow (Rinke et al, 2013; Campbell et al, 2013;

Eren et al, 2015). Of these 35 hits, 30 were restricted to recA and DNA helicase, which could be recombination and replication homologous phage genes. One contig found to contain a ribosomal bacterial marker gene was removed from our analyses.

Given that 20kbp represent the lower end of genome size in DNA phage from marine systems (Steward et al, 2000), we retained 142 contigs ≥20kbp in size to focus on near-complete genomes or large genomic fragments. As a final quality- control step, we removed contigs that did not contain any genes with distant homology to known phage structural proteins (PFAM bit score >10 to terminase, portal, capsid, tail, base plate, spike, neck, head genes). This step was conservative and eliminated 13 putative phage genomes from our analyses, 5 of which displayed significant homology to previously sequenced phages

48 (identification described below). The resulting contigs represent a high- confidence subset of total phage diversity. We checked again for cellular contamination using cell genomic completion, which is now reduced to 0.6%.

Only 12 out of a total of 5 877 putative phage genes mapped to bacterial or archaeal marker genes. 11 of these hits are restricted to recA and DNA helicase, while none were ribosomal cell marker genes. Overall, these quality-control steps successfully removed any detectable cellular genome contaminants, generating

129 ALOHA high-confidence viral contigs ranging between 20-108 kbp in length.

Assembly and reassembly statistics are shown in Table S2.1.

Genome-centric approach: Annotating ALOHA viral contigs. Contigs were annotated with a combined viral database from predicted proteins in assembled sequences from phages in RefSeq v75 (O’Leary et al, 2016) and four other viral metagenomes: uvMED (Mizuno et al, 2013), uvDeep (Mizuno et al, 2016), GOV

(Roux et al, 2016), and EV (Paez-Espino et al, 2016). Genes were predicted using

Prodigal v2.6.3 (Hyatt et al, 2010). Contigs were identified using LAST (Kiełbasa et al, 2011) with cutoffs of >50% of genes hit to a reference with >60% average amino acid identity. To calculate the proportion of genomes assigned to each category in a given sample, contigs were normalized by their relative abundance of nucleotides mapped.

Genome-centric approach: Depth and temporal distributions of ALOHA viral contigs. For reference-independent visualization of spatiotemporal distributions, we used BWA-MEM v0.7.15 (Li, 2013) to map all reads from each sample to

ALOHA viral contigs and generated coverage profiles using Anvi’o v.2.1.1 (Eren

49 et al, 2015). To prevent the possibility of conserved phage genes inflating coverage, only the second and third coverage value quartiles across each contig were used to generate mean coverage profiles. ALOHA viral contigs are binned by differential coverage manually into five distribution groups. To examine temporal variability within each depth bin, we used proportion of nucleotides mapped as a metric for relative abundance and plotted the 13 most abundant phages through time within each depth bin. To examine whether temporal variability in phage assemblages change with depth, we calculated the mean- normalized variance of the relative abundance each ALOHA viral contig within all depth bins.

Genome-centric approach: Mapping to reference genomes. To examine the genomic structure of our ALOHA viral contigs, we generated genome maps of some of the most abundant or complete ALOHA viral contigs in each of five depth distribution groups. Phage genomes in the RefSeq v75 database that were most similar to each ALOHA viral contig were included to visualize structural similarities using the GenomeDiagram module v0.2 on Python (Pritchard et al,

2006).

Genome-centric approach: Detection of novel phage genes and AMGs. To detect AMGs, we predicted and annotated 16 286 genes from 917 reassembled viral contigs (all lengths) and compared these genes to existing databases of marine phage genes including AMGs (Hurwitz & Sullivan, 2013; Roux et al,

2016). Genes were considered novel AMGs if they are both annotated with a

PFAM bit score >30 to a novel function not in these databases, and co-located on

50 the subset of 625 contigs with viral structural genes (PFAM >10 bit score to terminase, portal, capsid, tail, base plate, spike, neck, head).

Gene-centric approach: Phage and cell-associated gene catalogue assemblages.

We clustered samples using the ALOHA gene catalogue to examine how closely patterns of spatiotemporal diversity of phage assemblages mirrored those of broader bacterioplankton communities. We separated genes into two groups of interest: a phage group of 177,713 viral genes, and a cell-associated group encompassing all remaining genes in the ALOHA gene catalogue (8,788,990 in total). Viral genes were identified using the combined viral database described above with LAST at >90% amino acid identity. 5,328 photosystem genes (PFAM bit score >30) were removed from the phage group on the grounds that phage copies of this AMG (reviewed in Hurwitz and U’Ren, 2016) could not be accurately distinguished from bacterial-encoded copies (Lindell et al, 2004). This approach identified 172 385 phage genes used in subsequent analyses.

Independently for phage and cell-associated groups, we clustered samples using gene coverages to generate Bray-Curtis distance matrices and subsequently average linkage hierarchical agglomerative clustering on the R vegan package

(Oksanen, 2016). The resulting dendrograms were visualized using the R dendextend package (Galili et al, 2015).

To visualize the depth structure and distribution of known phages and bacteria, we identified 180,055 genes from ALOHA gene catalogue with >60% amino acid identity to known phages and 3,295,154 bacterial genes with >60% amino acid identity to known bacteria in RefSeq v75. Using these gene coverages, we

51 calculated relative abundances of three groups of well-characterized phages and their associated bacterial hosts for each sample.

Gene-centric approach: Viral life history. To examine whether the prevalence of lysogeny changes with depth, we identified prophage markers in the ALOHA gene catalogue and used their relative abundance as a proxy for the potential incidence of lysogeny. We also analyzed other sets of well-known phage structural proteins for normalization. We curated sets of annotated prophage and phage markers of interest from NCBI and identified 89 prophage integrases, 116 prophage CI repressors, 357 prophage excisionases, 1780 phage DNA polymerases, 139 phage terminases, 971 phage capsids, and 104 phage tail fiber proteins. Curated proteins were selected from either full phage genomes or uncultured marine phages from previously sequenced viral metagenomes. We used these protein sets to generate hidden markov models for each marker using

MUSCLE v3.8.31 alignment and HMMER v3.1 (Edgar, 2004; Eddy, 2011). We then used these models to identify marker proteins in the translated ALOHA gene catalogue with domain bit score of >50 across the whole sequence. The summed coverages of prophage markers (integrases, CI repressors, and excisionases) were normalized to the summed coverage of capsids as a proxy for the proportion of phage that had potential reproduce through lysogeny based on the presence of prophage marker genes. We further normalized this proportion by dividing by the surface ocean mean value to identify relative changes in lysogeny with depth (data shown in Suppl. File 1).

52 To examine depth profiles of the number and nature of active infections, we calculated the average cell genomic copy number of phage markers (DNA polymerases, terminases, capsids, tail fibers) and prophage (integrases, CI repressors, excisionases). Given that prophages in cellular genomes are predominantly non-replicative, the copy number of prophage markers was used to estimate the number of prophages per genome. The copy number of phage markers was used to estimate the number of total phages captured inside a bacterium. To generate copy number per cellular genome, coverages of prophage and phage markers were normalized to the average coverage of 10 universal single-copy bacterial marker genes, called mOTU profiling (Sunagawa et al, 2013;

Mende et al, 2017).

Funding

This research was supported by grants from the Simons Foundation to E.F.D.

(SCOPE 329108), the Gordon and Betty Moore Foundation to E.F.D. (GBMF

3777). Partial support for D.M. was provided by the European Molecular

Biology Organization (ALTF 721-2015), and partial support for E.L. by the

Natural Sciences and Engineering Research Council of Canada (PGSD3-487490-

2016). This work is a contribution of the Simons Collaboration on Ocean

Processes and Ecology and the Center for Microbial Oceanography: Research and

Education. The funders had no role in study design, sample collection, data analyses, or the decision to submit the work for publication.

53 Acknowledgements

We thank the Captain and crew of the R/V Kilo Moana and the Hawaii Ocean

Time-series marine operations team for sample collection and oceanographic data acquisition. We thank Tsultrim Palden and Anna Romano for DNA library preparation and sequencing. We thank John Eppley and Torben Nielson for generating initial assemblies and assistance with the ALOHA gene catalogue. We thank Elisha Wood-Charlson, Jessica Bryant, Daniel Olson, and Murat Eren for expert advice in data analyses.

54 Figures

a. all draft genomes

25

75

125 known phage (10)

200 other virome (40)

500 novel phage (79) 770

1000

0.0 0.2 0.4 0.6 0.8 1.0 proportion of hits b. known phage

25

75 s) r 125 Prochlorococcus phage (6)

200 Synechococcus phage (1)

500 Cyanophage (3)

depth (mete 770

1000

0.0 0.2 0.4 0.6 0.8 1.0 proportion of hits to known phage c. other viromes

25

75

125 uvMED (9)

200 GOV (11)

500 EV (20) 770

1000

0.0 0.2 0.4 0.6 0.8 1.0 proportion of hits to other virome

Figure 2.1. Depth profiles of known and novel phage ALOHA viral contigs assigned using four reference databases: known phage in RefSeq 75 protein database, and three previously sequenced viral metagenomes. Phage are identified with a cutoff of >50% genes with >60% average amino acid identity to a reference database genome. a) proportion of contigs with hits to known phage in RefSeq75, other virome, and novel phage. B) subset of hits to RefSeq75. c) subset of hits to other viromes: uvMED from the Mediterranean (Mizuno et al, 2013), Global Ocean Viromes from TARA Oceans (Roux et al, 2016), and Earth Viromes from human, terrestrial, and marine environments (Paez-Espino et al, 2016). Number of contigs in each category is shown in the legend. Proportion is normalized by total nucleotides mapped to each contig. Error bars show standard error, which are summed amongst groups in stacked bars.

55

Figure 2.2. Coverage profiles of 129 ALOHA viral contigs through depth and time. Each node on the top dendrogram and its associated column represents one contig, while each row represents one of 83 total samples. The height of the black column within a cell represents the mean coverage (second and third quartile) of a contig in a given sample. The full bar height of each sample layer is log-scaled to the maximum coverage in a given sample. Sample layers are ordered by depth from 25 to 1000 meters, shown on the right. Within each depth bin, samples are ordered by time from August 2010 to December 2011, shown on the left. Contigs are clustered by differential coverage, with manual binning into five groups on the bottom. Across the top, homology to known phage in RefSeq75 protein database and other viral metagenomes assigned based on >50% genes hitting at >60% average amino acid identity.

56 phage cell

depth (m) 25 75 125 200 500 770 1000

0.8 0.4 0.0 0.0 0.4 0.8 Bray-Curtis distance

Figure 2.3. Cluster analysis of phage-specific versus cellular genes identified in the Station ALOHA time-series non-redundant gene catalogue. Each edge on the trees represents one sample, and corresponding samples are shown with connecting lines between the two dendrograms.

57 a. contig size 55 prophage 56 * 66 * relative abundance 53 30 0.1 23 22 0.2 37 54 0.5

25 meters 32 34 47 28 b. 45 25 41

34 s) r 75 30 41 * 125 44 * 25 * 200 25 28 500 24 770

1000 meters 55 21 depth (mete 1000 40 2011 2012 0.0 0.2 0.4 0.6 year temporal variability

Figure 2.4. Temporal variability of assembled viral contigs increased with depth. a) Relative abundance of the most abundant ALOHA viral contigs at 25 and 1000 meters. Each row represents one of 13 most abundant ALOHA viral contig for each depth bin, with size indicated in kilo base pairs (kbp). Asterisks represent contigs containing one or more prophage markers (PFAM >30 bit score to integrase, CI repressor, Cro, or excisionase). Relative abundance is scaled by area. b) Mean-normalized variance of relative abundance through time calculated for each contig.

58 a. integrase b. CI repressor c. excisionase 25 l 25 l 25 l s) 75 75 75 r l l l 125 l 125 l 125 l 200 l 200 l 200 l 500 l 500 l 500 l 770 770 770 depth (mete l l l 1000 l 1000 l 1000 l 0 2 4 6 8 0 5 10 15 0.0 2.5 5.0 7.5 10.0 fold increase fold increase fold increase

Figure 2.5. Depth profile of prophage marker proteins identified in the Station ALOHA time-series non-redundant gene catalogue (domain bit score >50) using hidden markov models generated with manually curated sets of viral marker proteins from NCBI. Each circle represents a sample mean and each vertical bar represents a depth mean. Depth profile of prophage markers a) integrase, b) CI repressor, and c) excisionase. The fold increase of prophage markers with respect to surface mean is calculated using coverage of marker proteins normalized to coverage of capsid proteins.

59 Supplementary Figures

a. potential temperature (°C) b. nitrate (μmol/kg) c. fluorescence (μg/L) 5 10 15 20 25 10 20 30 40 0.0 0.2 0.4 0.6 0.8

0 0 0

s) 200 200 200 r

400 400 400

600 600 600 depth (mete 800 800 800

1000 1000 1000 2011 2012 2011 2012 2011 2012 year

Figure S2.1. Dataset includes 83 samples from 7 depths and 12 time points from August 2010 to December 2011. This sampling captures in-situ assemblages along sharp biogeochemical gradients such as a) temperature, b) nutrients, and c) chlorophyll fluorescence showing the deep chlorophyll maximum at 90-130m.

60 8.9 million phage vs. 6.3 billion depth all contigs predicted gene cellular reads MIRA Prodigal clusters LAST genes structure CDhit taxonomic annotation taxonomic extract viral LAST profile gene-centric sequences VIRSorter >3kb functional annotation prophage vs. phage HMMer markers

pooled functional 26.1 917 16286 reassembly annotation novel phage million viral predicted genes viral reads SPAdes contigs Prodigal genes PFAM VIRSorter

mapping 20kb filter 129 ALOHA coverage depth genome-centric phage structural genes viral contigs BWA profiles Anvi’o structure

genomes of novel temporal phage variability

Figure S2.2. Bioinformatic workflow from sequencing to gene- and -centric approaches. Analyses presented in main and supplementary figures are highlighted with black backgrounds. The original Station ALOHA catalogue used to form the basis for this work was reported in Mende et al, 2017.

61 100% structuralstructural terminaseterminase a. surface amino acid identity regulatoryregulatory lysogenylysogeny replicationreplication otherother 0 novelnovel AMGAMG unknownunknown

0 56774

draft genome 2

40000 220000

Prochlorococcus phage P-SSM2

0 Myo-inositol-1-phosphate synthase 66475

draft genome 14

25000 170000

Synechococcus phage ACG-2014f

0 56191

draft genome 17

0 47536

Cyanophage NATL2A-133

0 54988

draft genome 19

40000 171797

Synechococcus phage S-IOM18

0 52752

draft genome 20

50000 147284

Pelagibacter phage HTVC008M

62 100% structuralstructural terminaseterminase b. DCM amino acid identity regulatoryregulatory lysogenylysogeny replicationreplication otherother 0 novelnovel AAMGMG unknownunknown

0 Sm-like domain 62975

draft genome 1

0 192497

Cyanophage P-RSM6

0Carbamoyltransferase C-terminus 70386

draft genome 140

0 73503

draft genome 13

0 188632

Cyanophage P-TIM40

0 48810

draft genome 22

0 182180

Prochlorococcus phage P-SSM7

0 Sm-like domain 28100

draft genome 75

45000 85000

Pelagibacter phage HTVC008M

63 100% structuralstructural terminaseterminase c. 200m amino acid identity regulatoryregulatory lysogenylysogeny replicationreplication otherother 0 novelnovel AMGAMG unknownunknown

0 Carbamoyltransferase C-terminus 108784

draft genome 7

0 182180

Prochlorococcus phage P-SSM7

0 66035

draft genome 15

25000 198013

Synechococcus phage S-CAM1

0 58845

draft genome 15

0 77837

Vibrio phage SHOU24

0 Carbamoyltransferase C-terminus 55142

draft genome 18

0 100000

Synechococcus phage ACG-2014f

0 Dolichyl-phosphate-mannose-protein mannosyltransferase 48052

draft genome 124

0 37995

Paenibacillus phage Fern

64 100% structuralstructural terminaseterminase d. deep amino acid identity regulatoryregulatory lysogenylysogeny replicationreplication otherother 0 novelnovel AMGAMG unknownunknown

0 44533

draft genome 29

40000 147284

Pelagibacter phage HTVC008M

0 41226

draft genome 35

0 188632

Cyanophage P-TIM40

0 35382

draft genome 44

0 43198

Vibrio phage VHML

0 31090

draft genome 63

15000 110000

Synechococcus phage S-CAM1

0 21066

draft genome 135

65 100% structuralstructural terminaseterminase e. sporadic amino acid identity regulatoryregulatory lysogenylysogeny replicationreplication otherother 0 novelnovel AMGAMG unknownunknown

0 53754

draft genome 8

0 64113

Pseudomonas phage KPP25

0 81568

draft genome 12

75000 147284

Pelagibacter phage HTVC008

0 meiotically up-regulated gene 113 43574

draft genome 30 HNH endonuclease

0 58652

Clavibacter phage CMP1

0 29946

draft genome 68

0 30651

Pseudoalteromonas phage H105

0 24508

draft genome 90

10000 76718

Vibrio phage VBP32

66 Figure S2.3. Genome maps of abundant phages from found at different depths in the water column at Station ALOHA: a) surface, b) DCM, c) 200m, d) mesopelagic, and e) sporadic group shown in Fig. 2. ALOHA viral contigs are displayed on top, while reference genomes of the most closely related phage (most common hit at any amino acid identity) are displayed below. Arrows represents predicted genes, which are color-coded by function. Novel AMG’s are annotated in red text. Blue shading between ALOHA viral contig and reference genome genes represent animo acid similarities of LAST hits. The start and end of genomes are displayed in number of base pairs. Some longer reference genomes have been truncated for clarity.

67 a. integrase b. CI repressor c. excisionase 25 l 25 l 25 l 75 l 75 l 75 l prophage marker 125 125 125 l l l phage marker 200 l 200 l 200 l 500 l 500 l 500 l 770 770 770

s) l l l r 1000 l 1000 l 1000 l 2.5 5.0 7.5 0e+00 5e−05 1e−04 0.25 0.50 0.75

d. DNA polymerase e. terminase f. capsid g. tail depth (mete 25 l 25 l 25 l 25 l 75 l 75 l 75 l 75 l 125 l 125 l 125 l 125 l 200 l 200 l 200 l 200 l 500 l 500 l 500 l 500 l 770 l 770 l 770 l 770 l 1000 l 1000 l 1000 l 1000 l 0 5 0 5 0 5 10 0 1 2 3 copies per cell genome

Figure S2.4. Depth profile of phage marker proteins identified in the Station ALOHA non-redundant gene catalogue (domain bit score >50) using hidden markov models generated with manually curated set of proteins from NCBI. Each circle represents a sample mean and each vertical bar represents a depth mean. Depth profile of prophage markers are shown in closed circles: a) integrase, b) CI repressor, and c) excisionase. Depth profile of phage markers are shown in open circles: d) DNA polymerase e) terminase f) capsid g) tail fiber. The copy number of marker genes per cell genome equivalent is calculated using marker gene coverage normalized to average coverage of 10 single-copy bacterial marker genes.

68 a. phage

25 proportion of genes 0.01 45 0.02 s)

r 75

125 0.05

200

500 hits depth (mete all Cyanophage 770 Pelagibacter phage Pseudomonas phage 1000 Vibrio phage

2011 2012 year b. bacteria

25 proportion of genes

0.1 45 0.2 s) 75 r

125 0.5

200

500 hits depth (mete Cyanobacteria 770 Pelagibacter Pseudomonas 1000 Vibrio

2011 2012 year

Figure S2.5. Bubble plot of proportion of ALOHA gene catalogue genes hitting to four groups of a) known phage and b) associated bacteria. Hits assigned based on protein-protein LAST against RefSeq75 database at >60% amino acid identity. Proportion of genes is normalized by gene coverage and is scaled by area.

69 Supplementary Tables

Table S2.1. Assembly statistics for the initial individual assemblies for each sample, number of contigs >3kb in length used for VIRSorter runs, and number of VIRSorter identified contigs in all categories. Reads from these VIRSorter contigs were then pooled to generate the viral reassembly, which passed through additional quality-control steps to generate final ALOHA viral contigs.

initial individual assemblies pooled viral reassembly cruise seq. depth all reads all >3kb VIRSorter VIRSorter all VIRSorter >20kb ALOHA run contigs contigs contigs reads contigs contigs filter viral contigs HOT224 1 25 68795546 780250 8676 998 542961 104732 917 142 129 HOT224 1 45 38277419 436530 8648 330 141926 HOT224 1 75 62434378 675786 7264 943 484334 HOT224 1 125 33557304 327780 8778 404 246891 HOT224 1 200 27557654 164297 1222 19 21808 HOT224 1 500 39143776 329910 7135 67 41532 HOT224 1 770 39044958 387925 10488 45 17836 HOT224 1 1000 50259388 496232 13460 135 115413 HOT225 1 25 35576272 434504 9755 513 182372 HOT225 1 45 58874846 720091 7768 333 114849 HOT225 1 75 50099672 579167 9528 1076 657787 HOT225 1 125 37740015 269950 2871 160 84479 HOT225 1 200 32974908 262614 2533 46 37615 HOT225 1 500 61257440 592008 15445 236 167380 HOT225 1 770 37462326 379836 10670 53 22013 HOT225 1 1000 37018306 348456 8147 58 25284 HOT226 1 25 38545192 246479 3661 138 68042 HOT226 1 45 38860404 418488 5588 265 135878 HOT226 1 75 37924780 242771 7469 192 69443 HOT226 1 200 29131958 151367 452 9 2514 HOT226 1 500 27763940 198519 1447 6 2712 HOT226 1 770 23025498 179291 1523 2 859 HOT226 1 1000 34099068 285546 2835 3 963 HOT227 1 25 31015782 160109 8453 67 27017 HOT227 1 45 54588576 333911 2452 113 27689 HOT227 1 75 20863908 104941 4773 41 14749 HOT227 1 125 58673416 563141 12251 566 561542 HOT227 1 770 29684950 248958 1637 9 3600 HOT229 1 25 15866300 151335 14446 1247 148135 HOT229 1 125 68900144 536390 4754 181 144104 HOT229 1 200 219892566 2598365 62364 534 1205969 HOT229 1 500 94020722 525644 8406 62 85366 HOT229 1 1000 81638699 743487 10987 33 275403 HOT229 2 25 46264924 448514 6346 445 280803 HOT229 2 500 94395654 907741 24310 304 162775 HOT229 2 770 110472810 1149978 34053 173 67922 HOT229 3 500 47246262 656552 85902 1535 174339 HOT231 1 25 2758328 21169 4442 31 1556 HOT231 1 75 65988470 617360 4499 443 218226 HOT231 1 125 1954188 5041 54 2 111 HOT231 1 200 53963094 420162 4373 123 145983 HOT231 1 500 63035510 490556 7473 249 216338 HOT231 1 770 62847914 545445 10316 54 20754 HOT231 1 1000 75332383 660830 9055 125 63468 HOT231 2 25 64708052 651261 5701 325 102905 HOT231 2 125 62858612 496745 6291 930 634978 HOT232 1 25 101135393 1045538 8273 443 308793 HOT232 1 75 66624258 677497 8651 792 286337 HOT232 1 125 79244750 741083 9391 306 375192 HOT232 1 200 179804296 2044277 40804 1075 1010794 HOT232 1 500 78403226 380410 3664 42 36786

70 HOT232 1 770 95739381 937124 27556 361 112580 HOT232 1 1000 71620066 645586 8183 39 47363 HOT232 2 500 91876072 843945 18793 175 137943 HOT232 3 500 43586734 581707 79231 857 123812 HOT233 1 25 1932956 16739 2971 23 1031 HOT233 1 75 60599405 488917 6189 616 116916 HOT233 1 1000 75101586 725780 14948 114 52105 HOT233 1c 125 54516104 505293 9647 0 0 HOT233 1c 200 53184084 410387 4045 236 229104 HOT233 1c 770 60375417 539000 12133 87 34297 HOT233 2 25 68079640 766887 10882 488 216266 HOT233 2c 500 62438198 575439 17005 72 47527 HOT234 1 75 60360120 631574 11038 1065 1515852 HOT234 1 200 71289593 532716 7729 578 867680 HOT234 1 500 66557784 570637 12360 148 95943 HOT234 1 770 59799400 497390 8776 187 81637 HOT234 1 1000 42757246 385592 3853 19 15480 HOT234 2 25 5813396 63467 13061 400 46490 HOT234 2 75 6142242 68922 12892 732 147041 HOT234 2 125 68391236 459730 5263 849 2578816 HOT234 2 200 7099726 51322 2508 79 31798 HOT234 3 25 69587030 732288 13601 973 872133 HOT236 1 25 54113650 477890 6200 483 103286 HOT236 1 75 112435135 1320009 8069 105 173282 HOT236 1 125 85336087 738209 5257 168 115681 HOT236 1 200 117076925 1025243 17511 327 285229 HOT236 1 500 154947358 751276 8176 208 785016 HOT236 1 1000 64887254 472905 6114 132 601922 HOT236 2 500 140048230 1171819 25399 1173 1567817 HOT236 2 770 128566144 1236465 32220 458 373421 HOT237 1 25 7543668 61651 8704 85 3896 HOT237 1 75 6314039 67564 11563 129 7214 HOT237 1 200 78611890 664892 5875 129 111294 HOT237 1 500 43359446 614893 97266 452 54946 HOT237 1 770 78187908 805648 22673 62 26423 HOT237 1 1000 129459766 1122956 25169 89 44163 HOT237 2 75 74007793 783100 9990 284 129515 HOT237 2 500 43186099 610238 95821 464 56750 HOT237 2 770 15275186 216918 26698 57 4197 HOT237 3 25 81138274 877747 11541 160 62565 HOT237 3 125 63309805 441548 5095 121 78069 HOT237 3 500 53091295 447697 8176 10 3692 HOT238 1 25 58927359 489829 3209 394 257492 HOT238 1 75 49096904 450292 6696 540 307569 HOT238 1c 125 96914858 815738 16870 1565 2188893 HOT238 1c 200 65735346 601297 7582 264 268898 HOT238 1 500 109166692 552739 5746 111 277537 HOT238 1 1000 102565462 362416 4177 22 35941 HOT238 2 500 101038902 935636 19761 689 473122 HOT238 2 770 126791415 1169739 27823 1672 516841 total 6309588541 57151033 1391529 34732 26073010

71 Table S2.2. List of 10 out of 129 ALOHA viral contigs with hits to known phage in RefSeq75 database with more than half of its genes hitting an average amino acid identity of >60%.

contig RefSeq75 reference genome proportion gene hits AAI % AVC002 Prochlorococcus_phage_P-SSM2 0.519480519 71.77225 AVC004 Prochlorococcus_phage_MED4-213 0.947368421 79.60166667 AVC031 Synechococcus_phage_ACG-2014j 0.526315789 61.353 AVC038 Cyanophage_P-TIM40 0.666666667 62.78642857 AVC045 Prochlorococcus_phage_P-SSM7 0.518518519 67.07642857 AVC047 Prochlorococcus_phage_P-SSP7 0.638297872 69.564 AVC059 Prochlorococcus_phage_P-SSP7 0.538461538 73.25047619 AVC086 Cyanophage_P-TIM40 0.885714286 81.10741935 AVC119 Prochlorococcus_phage_P-GSP1 0.619047619 69.28615385 AVC140 Cyanophage_P-RSM6 0.949494949 88.15542553

Table S2.3. List of 37 novel genes in marine phage identified from 625 curated contigs co-located with viral structural genes (PFAM bit score >10) that has not been reported in other previously sequenced viromes (Brum et al ,2015; Roux et al, 2016). 19 putative novel auxiliary metabolic genes are noted with an asterisk (*).

PFAM annotation PF00147* Fibrinogen beta and gamma chains, C-terminal globular domain PF00782* Dual specificity phosphatase, catalytic domain PF01206* Sulfurtransferase TusA PF01658* Myo-inositol-1-phosphate synthase PF01758* Sodium Bile acid symporter family PF01812* 5-formyltetrahydrofolate cyclo-ligase family PF01918 Alba PF02597* ThiS family PF03602* Conserved hypothetical protein 95 PF04892* VanZ like family PF05262* Borrelia P83/100 protein PF06855* YozE SAM-like fold PF07902 gp58-like protein PF08279 HTH domain PF09382 RQC domain PF12385* Papain-like cysteine protease AvrRpt2 PF13022 Helix-turn-helix of insertion element transposase PF13231* Dolichyl-phosphate-mannose-protein mannosyltransferase PF13455 Meiotically up-regulated gene 113 PF14279 HNH endonuclease PF14326 Domain of unknown function (DUF4384) PF14373* Superinfection immunity protein PF15943* Putative antitoxin of bacterial toxin-antitoxin system, YdaS/YdaT PF16075 Domain of unknown function (DUF4815) PF16190* Ubiquitin-activating enzyme E1 FCCH domain PF16243 Sm_like domain PF16363* GDP-mannose 4,6 dehydratase PF16510 Phage P22-like portal protein PF16724 T4-like virus Myoviridae tail sheath stabiliser PF16778 Phage tail assembly chaperone protein PF16786 Recombination enhancement, RecA-dependent nuclease PF16790 Bacteriophage clamp loader A subunit PF16805 Phage late-transcription coactivator PF16861* Carbamoyltransferase C-terminus PF16868* NMT1-like family PF16945 Putative lactococcus lactis phage r1t holin PF17212 Tail tubular protein

72 Table S2.4. List of 4 persistent ALOHA viral contigs with hits to surface phages in a 2015 dataset near Station ALOHA (Aylward et al. submitted) with more than half of its genes hitting an average amino acid identity of >60%.

contig reference proportion gene hits AAI % AVC003 VS12|size37975 0.739130434783 68.9752941176 AVC011 VS13|size35890 0.674418604651 75.8634482759 AVC017 VS13|size35890 0.608695652174 66.0454761905 AVC063 VS7|size43665 0.888888888889 62.0079166667

73 References

Avrani S, Schwartz DA, Lindell D. 2012. Virus-host swinging party in the oceans: Incorporating biological complexity into paradigms of antagonistic coexistence. Mob Genet Elements 2(2):88-95.

Brum JR, Hurwitz BL, Schofield O, Ducklow HW, Sullivan MB. 2015. Seasonal time bombs: dominant temperate viruses affect Southern Ocean microbial dynamics. ISME J 10(2):1-13.

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sitorkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol19(5):455-477.

Breitbart M. 2012. Marine viruses: truth or dare. Ann Rev Mar Sci 4(1): 425-448.

Brum JR, Schenck RO, Sullivan MB. 2013. Global morphological analysis of marine viruses shows minimal regional variation and dominance of non- tailed viruses. ISME J 7(9):1738-1751.

Brum JR, Sullivan MB. 2015. Rising to the challenge: accelerated pace of discovery transforms marine virology. Nature Rev Microbiol 13(3):147-159.

Bryant JA, Aylward FO, Eppley JM, Karl DM, Church MJ, DeLong EF. 2016. Wind and sunlight shape microbial diversity in surface waters of the North Pacific Subtropical Gyre. ISME J 10(6):1308-1322.

Campbell JH, O’Donoghue P, Campbell AG, Schwientek P, Sczyrba A, Woyke T, Söllb D, Podar M. 2013. UGA is an additional glycine codon in uncultured SR1 bacteria from the human microbiota. Proc Natl Acad Sci USA 110(14):5540-5545.

Chevreux B. 2004. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 14:1147-1159.

Chow CET, Kim DY, Sachdeva R, Caron DA, Fuhrman JA. 2014. Top-down controls on bacterial community structure: microbial network analysis of bacteria, T4-like viruses and protists. ISME J 8(4), 816-829.

Delong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N, Martinez A, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM. 2006. Community genomics among stratified microbial assemblages in the ocean’s interior. Science 311:496-503.

Eddy SR. 2011. Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195.

74 Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792-1797.

Enault F, Briet A, Bouteille L, Roux S, Sullivan MB. 2017. Phages rarely encode antibiotic resistance genes: a cautionary tale for virome analyses ISME J 11(1):237-247.

Eren MA, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO. 2015. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319.

Galili T. 2015. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. 31(22):3718-3720.

Garcia-Heredia I, Rodriguez-Valera F. 2013. Novel group of podovirus infecting the marine bacterium Alteromonas macleodii. Bacteriophage 3(2):e24766.

Garza DR, Suttle CA. 1998. The effect of cyanophages on the mortality of Synechococcus spp. and selection for UV resistant viral communities. Microb Ecol 36: 281-292.

Goldsmith DB, Parsons RJ, Beyene D, Salamon P, Breitbart M. 2015. Deep Sequencing of the viral phoH gene reveals temporal variation, depth- specific composition, and persistent dominance of the same viral phoH genes in the Sargasso Sea. PeerJ 3:e997.

Hartman PS, Eisenstark A. 1982. Alteration of bacterio- phage attachment capacity by near-UV irradiation. J Virol 43:529-532.

Hurwitz BL, Sullivan MB. 2013. The Pacific Ocean Virome (POV): A marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS ONE 8(2):e57355.

Hurwitz BL, U’Ren JM 2016. Viral metabolic reprogramming in marine ecosystems. Curr Opin Microbiol 31:161-168.

Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11(1):119.

Jover LF, Effler TC, Buchan A, Wilhelm SW, Weitz JS. 2014. The elemental composition of virus particles: implications for marine biogeochemical cycles. Nature Rev Microbiol 12(7):519–528.

Kang I, Oh HM, Kang D, Cho, JC. 2013. Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans. Proc Natl Acad Sci USA 110(30):12343–12348.

75 Knowles B, Silveira CB, Bailey BA, Barott K, Cantu VA, Cobián-Güemes AG, et al. 2016. Lytic to temperate switching of viral communities. Nature 531(7595):533-537.

Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res 21(3):487-493.

Marston MF, Sallee JL. 2003. Genetic diversity and temporal variation in the cyanophage community infecting marine Synechococcus species in Rhode Island’s coastal waters. Appl Environ Microbiol 69(8):4639-4647.

Laybourn-Parry J, Marshall WA, Madan NJ. 2007. Viral dynamics and patterns of lysogeny in saline Antarctic lakes. Polar Biol 30:351-358.

Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. e-pub ahead of print 26 May 2013, arXiv:1303.3997v2 [q- bio.GN].

Li W, Jaroszewski L, Godzik A. 2001. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics 17:282-283.

Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, Chisholm SW. 2004. Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc Natl Acad Sci USA 101(30):11013-11018.

Lindell D, Jaffe JD, Johnson ZI, Church GM, Chisholm SW. 2005. Photosynthesis genes in marine viruses yield proteins during host infection. Nature 438(7064):86-89.

Lu MJ, Henning U. 1989. The immunity (imm) gene of Escherichia coli bacteriophage T4. J Virol 63(8):3472-3478.

Mende DR, Bryant JA, Aylward FO, Eppley JM, Nielsen TN, DeLong EF. 14 August 2017. Environmental drivers of a genomic transition zone in the ocean’s interior. Nat Microbiol doi:10.1038/s41564-017-0008-3.

Middelboe M. 2000. Bacterial growth rate and marine virus-host dynamics. Microb Ecol 40:114-124.

Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. 2013. Expanding the marine virosphere using metagenomics. PLoS Genetics, 9(12): e1003987.

Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. 2016. Genomes of abundant and widespread viruses from the deep ocean. mBio, 7(4):e00805-16.

Moebus K. 1996. Marine bacteriophage reproduction under nutrient-limited growth of host bacteria: Investigations with six phage-host systems. Mar Ecol Prog Ser 144:1–12.

76

Mojica KDA, Brussaard CPD. 2014. Factors affecting virus dynamics and microbial host-virus interactions in marine environments. FEMS Microbiol Ecol 89(3):495–515.

O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 4;44(D1):D733-45.

Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O'Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H. 2016. vegan: Community Ecology Package. R package version 2.4-0. http://CRAN.R-project.org/package=vegan.

Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, Rubin E, Ivanova NN, Kyrpides NC. 2016. Uncovering Earth’s virome. Nature, 536(7617):425-430.

Pagarete A, Johannessen T, Fuhrman JA, Thingstad TF, Sandaa RA. 2013. Strong seasonality and interannual recurrence in marine myovirus. Appl Environ Microbiol 79(20):6253-6259.

Payet JP, Suttle CA. 2013. To kill or not to kill: the balance between lytic and lysogenic viral infection is driven by trophic status. Limnol Oceanogr 58: 465-474.

Pritchard L, White JA, Birch PRJ, Toth IK. 2006. GenomeDiagram: a Python Package for the visualization of large-scale genomic data. Bioinformatics 22(5):616-617.

Puxty RJ, Millard AD, Evans DJ, Scanlan DJ. 2016. Viruses Inhibit CO2 Fixation in the Most Abundant Phototrophs on Earth. Curr Biol 26(12):1585-1589.

Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, et al. 2013. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499(7459):431-437.

Rohwer F, Seguritan V. 2000. The complete genomic sequence of the marine phage Roseophage SIO1 shares homology with nonmarine phages. Limnol Oceanogr 45(2):408-418.

Roux S, Enault F, Hurwitz BL, Sullivan MB. 2015. VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985.

Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. 2016. Ecogenomics and biogeochemical impacts of uncultivated globally abundant ocean viruses. Nature 537:689-693.

77 Steward GF, Montiel JL, Azam F. 2000. Genome size distributions indicate variability and similarities among marine viral assemblages from diverse environments. Limnol Oceanogr 45(8):1697-1706.

Suttle CA, Chan AM. 1994. Dynamics and distribution of cyanophages and their effect on marine Synechococcus spp. Appl Environ Microbiol 60:3167-3174.

Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger S, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Doré J, Ehrlich SD, Stamatakis A, Bork P. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods 10(12):1196-1199.

Thompson LR, Zeng Q, Kelly L, Huang KH, Singer AU, Stubbe J, Chisholm SW. 2011. Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc Natl Acad Sci USA 108(39):E757-E764.

Waterbury JB, Valois FW. 1993. Resistance to co-occurring phages enables marine Synechococcus to co-exist with cyanophages abundant in seawater. Appl Environ Microbiol 59:3393-3399.

Weinbauer MG, Suttle CA. 1999. Lysogeny and prophage induction in coastal and offshore bacterial communities. Aquat Microb Ecol 18:217-225.

Weinbauer MG, Brettar I, Hofle MG. 2003. Lysogeny and virus-induced mortality of bacterioplankton in surface, deep, and anoxic marine waters. Limnol Oceanogr 48:1457-1465.

Weinbauer MG. 2004. Ecology of prokaryotic viruses. FEMS Microbiol Rev 28:127-181.

Weinbauer MG, Rassoulzadegan F. 2004. Are viruses driving microbial diversification and diversity? Environ Microbiol 6:1-11.

Wigington CH, Sonderegger DL, Brussaard CPD, Buchan A, Finke JF, Fuhrman J, Lennon JT, Middelboe M, Suttle CA, Stock C, Wilson WH, Wommack KE, Wilhelm SW, Weitz JS. 2016. Re-examining the relationship between virus and microbial cell abundances in the global oceans. Nature Microbiol 1:1-21.

Wikner J, Vallino JJ, Steward GF, Smith DC, Azam F. 1993. Nucleic-acids from the host bacterium as a major source of nucleotides for 3 marine bacteriophages. FEMS Microbiol Ecol 12:237-248.

Williamson SJ, Houchin LA, McDaniel L, Paul JH. 2002. Seasonal variation in lysogeny as depicted by prophage induction in Tampa Bay, Florida. Appl Environ Microbiol 68:4307-4314.

78 Wilson WH, Joint IR, Carr NG, Mann NH. 1993. Isolation and molecular characterization of five marine Cyanophages propagated on Synechococcus sp. Strain WH7803. Appl Environ Microbiol 59(11):3736-3743.

Zhao Y, Temperton B, Thrash JC, Schwalbach MS, Vergin KL, Landry ZC, Ellisman M, Deerinck T, Sullivan MB, Giovannoni SJ. 2013. Abundant SAR11 viruses in the ocean. Nature, 494(7437):357-360.

79 CHAPTER 3. DOUBLE-STRANDED DNA VIRIOPLANKTON DYNAMICS AND

REPRODUCTIVE STRATEGIES IN THE OPEN OCEAN WATER COLUMN

Published 2020 in ISME J 14:1304-1315. DOI: 10.1038/s41396-020-0604-8

Authors: Elaine Luo, John Eppley, Anna Romano, Daniel Mende, Edward DeLong

Abstract

Microbial communities are critical to ecosystem dynamics and biogeochemical cycling in the open oceans. Viruses are essential elements of these communities, influencing the productivity, diversity, and evolution of cellular hosts. To further explore the natural history and ecology of open-ocean viruses, we surveyed the spatiotemporal dynamics of double-stranded DNA (dsDNA) viruses in both virioplankton and bacterioplankton size fractions in the North Pacific Subtropical

Gyre, one of the largest biomes on the planet. Assembly and clustering of viral genomes revealed a peak in virioplankton diversity at the base of the euphotic zone, where virus populations and host species richness both reached their maxima. Simultaneous characterization of both extracellular and intracellular viruses suggested depth-specific reproductive strategies. In particular, analyses indicated elevated lytic interactions in the mixed layer, more temporally variable temperate phage interactions at the base of the euphotic zone, and increased lysogeny in the mesopelagic ocean. Furthermore, the depth variability of auxiliary metabolic genes suggested habitat specific strategies for virus influence

80 of light-energy, nitrogen, and phosphorus acquisition during host infection. Most virus populations were temporally persistent over several years in this environment at the 95% nucleic acid identity level. In total, our analyses revealed variable distributional patterns and diverse reproductive and metabolic strategies of virus populations in the open-ocean water column.

Introduction

Viruses represent dynamic reservoirs of unexplored genetic diversity. On average an order of magnitude more abundant than cellular organisms (1), viruses occur at ~107/mL in the surface layer of open oceans covering ~40% of

Earth (2). In this environment, dsDNA bacteriophages infect key microbial groups, including oxygenic photoautotrophs such as Prochlorococcus and

Synechococcus (eg. 3,4) and common bacterial heterotrophs such as Pelagibacter

(SAR11), Puniceispirillum (SAR116), Roseobacter, and Alteromonas (eg. 5–8). Viruses can lyse their hosts at estimated rates of 20-40% per day (9,10), potentially contributing as much as 145 gigatonnes to annual global carbon flux (11,12).

Viruses also influence the diversity and biogeochemistry of marine ecosystems by carrying auxiliary metabolic genes (AMGs) that manipulate host metabolism during infection (reviewed in 13).

Recent developments in high-throughput DNA sequencing have allowed for exploration of viral diversity at unprecedented scales. The continuing description

81 of viral genetic diversity highlights the importance of further cultivation- independent in situ characterization of environmental viral populations.

Metagenomic virus surveys have focused on characterizing geographic variability across surface oceans (14–17), temporal variability at the surface ocean

(18–22), and vertical variability across depth profiles (23–25). Coupled studies of viral and host dynamics in both space and time at well-defined sample sites (26–

28) further have the potential to provide additional perspective on the dynamics, patterns and consequences of viral diversity.

To further explore virus diversity, environmental distributions and dynamics in the open ocean, we characterized virus genomes from seawater in virus-enriched

(0.02-0.2µm) and cell-enriched (>0.2µm) size fractions over time and depth in the

North Pacific Subtropical Gyre (NPSG). The approach facilitated the exploration of both extracellular virioplankton particles as well as cell-associated phages, to better characterize reproductive and metabolic strategies of viruses in the open ocean. We use the term “reproductive strategies” here refer to life history differences between strictly lytic viruses with a singular strategy of host lysis, versus temperate phages that in addition to the lytic cycle, have the potential to either integrate into their host’s genome or reside as an extrachromosomal element. In this study, we analyzed samples collected over 1.5 years over 12 depths (5-500m) to characterize dsDNA viruses in both virioplankton and bacterioplankton size fractions. This virus genome dataset (referred to here as the

ALOHA 2.0 viral database) provides new perspectives on the diversity, reproductive strategies, gene content, and ecology of viruses in the open ocean.

82 Methods

A schematic overview of our workflow is presented in Fig. S3.1.

Sample collection, extraction, and sequencing. Station ALOHA (22°45’ N, 158°

W), a relatively seasonally stable environment located in the NPSG, is a well- characterized sampling site of the Hawaii Ocean Time-series (HOT) program.

The ALOHA 2.0 dataset contains 374 metagenomic samples collected at 12 depths (5, 24, 45, 75, 100, 125, 150, 175, 200, 225, 250, and 500 meters) at approximately monthly intervals of 16 time points spanning 1.5 years from 2014-

2016 (Fig. S3.2). These collections correspond to HOT cruise numbers 267-283, for which physiochemical data are available in Table S3.1 in the publication and on

Hawaii Ocean Time-Series HOT-DOGS application

(http://hahana.soest.hawaii.edu/hot/hot-dogs).

All samples were collected using the following procedure. 2-4L of seawater (2L from 5-175m, 4L from 200-500m) were collected using CTD-attached Niskin bottles and filtered, using peristaltic pumps at a flow rate of about 6L/h, onto a

0.2µm 25mm Supor filter (VWR 28147-956) housed in polypropylene filter holder

(Cole-Palmer EW-06623-32). These >0.2µm cell-enriched samples were removed from filter holder and stored in 300µL of RNALater (Ambion AM7021, Waltham

MA) at -80°C. 1-2L (1L from 5-175m, 2L from 200-500m) of the <0.2µm filtrate were collected and filtered, using peristaltic pumps at a flow rate of about 1L/h, onto a 0.02µm Whatman Anotop filter (VWR 28138-017, Radnor PA). These

83 corresponding 0.2-0.02 µm virus-enriched samples were stored sealed in filter housing at -80°C. DNA extraction, sequencing, and read quality-control are described in the Supplementary Methods.

Quality-controlled reads, on average 9-10 million per sample (Table S3.1 in the publication), were assembled within each sample using option “- k 21,33,55,77,99,127” on metaSPAdes v3.10.1 (29), generating a total of 83 million contigs amongst 416 metagenomes (Table S3.1 in the publication). All sequences were used for viral reassembly and database curation. For downstream analyses, smaller metagenomes from samples with duplicate sequencing runs and samples with <1 million reads were removed, resulting in a total of 374 metagenomes.

Sequences were submitted to NCBI SRA under project number PRJNA352737 and assemblies can be found under BioSample SAMN12604809.

Viral-specific reassembly. All >3kb contigs were filtered using VIRSorter v1.03

(30) using the virome database, under regular mode for cell-enriched samples and decontamination mode for virus-enriched samples. Contigs from all identified viral categories were retained (Table S3.2 in the publication). BWA-

MEM v0.7.15 (Li 2013) and msamtools (31) was used to identify 809 million reads mapping to these putative viral contigs at >95% average nucleotide identity

(ANI) across >45bp (Table S3.2 in the publication). Viral reads were reassembled using metaSPAdes v3.11.1, which was chosen due to improved genome recovery and low rate of generating false apparent circularity (32). Reads from contigs with >10 coverage, a threshold representing 99% genome recovery (32), were

84 pooled across each depth for 12 reassemblies. Reads from contigs with <10 coverage were pooled across all samples into a low-coverage reassembly to improve genome recovery.

ALOHA 2.0 virus database curation. Viral contigs across the 13 reassemblies, along with viral contigs from two previous smaller datasets near Station ALOHA

(22, 27), were clustered with cd-hit-est v4.6 (33) at >95% ANI to form 1.5 million non-redundant viral populations. Populations were filtered through VIRSorter and 262 197 putative viruses from all categories were retained. Proteins were predicted using Prodigal v2.6.3 (34) and functionally annotated using HMMer v.3.2 (35) against the PFAM-A v30 database (36). Populations containing one or more known viral marker proteins were retained (bit score >30 to capsid, head, neck, tail, spike, portal, terminase, clamp loader, T4 proteins, T7 proteins, Mu proteins, excisionase, phage integrase, repressor protein CI, or Cro), resulting in

56 559 high-confidence virus populations ranging from 0.5-366kbp in length. To focus on full genomes or large genomic fragments, we retained only >10kbp contigs, resulting in 17 369 populations that form the ALOHA 2.0 virus database.

Functional annotations were inspected to ensure that no ribosomal proteins are present, with the exception of S21 found also in a cultivated Pelagibacter phage,

S33 found enriched in aquatic viruses, and L7/12 found in assembled viral contigs (37). One population containing L11 was retained due to the protein’s proximity and interaction with L7/12 (38).

The high proportion of novel viral diversity in our samples precludes using reference genomes for the detection of chimeras, which are expected to occur at a

85 frequency of ~0.5% using metaSPAdes (32). As a result, we inspected clusters for chimeric signature through self-alignment using LAST v756 (39) to identify stretches of repeats at >95%ANI across >10 kbp. 15 populations displayed this signature and were noted as chimeras (Table S3.3 in the publication).

Genomic completion. We used terminal repeats (apparent circularity) to identify a non-redundant set of complete genomes (40), using a combination of four different methods: i. 439 were identified using Virsorter ii. 411 from check_circularity.pl (41) iii. 790 from LAST to identify overlaps at 95% ANI across 100 bp-10 kbp within 200 bp of both ends; and iv. 1131 from NUCmer v3.1

(42) to identify direct terminal repeats 200-2000bp in length within 200 bp of both ends (43). 1543 complete genomes were identified pooled amongst these four methods (Table S3.3 in the publication). LAST was used to assess redundant circularly permuted contigs at 95% ANI across 150 bp, yielding 961 non- redundant complete genomes (16 787 total, Fig. S3.1).

Viral and prokaryotic contribution to total DNA. Viral contribution was calculated for each sample as the proportion of reads mapping to the ALOHA 2.0 viral database (>95% across >45 bp). Prokaryotic contribution was calculated for each sample as the proportion of reads classified as bacterial or archaeal with

Kaiju v1.6.2 (44).

Spatiotemporal distribution and abundance. Reads from each sample were mapped using BWA-MEM to virus populations and filtered using msamtools at

>95% across >45 bp. Anvi’o v3 (45) was used to calculate coverage profiles for

86 every sample, using interquartile range (IQR) coverage, which diminishes the effect of conserved or hypervariable regions in respectively over- and under- estimating coverage. For analyses including all virus populations, relative nucleotides mapped (nucleotides mapped to population divided by nucleotides mapped across all populations) was used to calculate relative abundances

(Tables S3.4, S3.5 in the publication). For analyses including only complete viral genomes, relative coverage (genome coverage divided by total coverage summed across all complete genomes) was used to approximate relative abundances

(Tables S3.6, S3.7 in the publication).

To examine temporal persistence, reads from a previous 2010-2011 dataset from

Station ALOHA (46) were mapped to the 2014-6 ALOHA 2.0 virus database using BWA-MEM and filtered using msamtools at >95% across >45 bp.

Populations with non-zero IQR coverage were considered present in the 2010-

2011 dataset (Table S3.3 in the publication).

Characterizing cellular assemblages. COG0012, a universal single-copy marker protein, was used to generate 2568 mOTUs representing cells at the near-species level, at higher resolution than with rRNA-based OTUs (46–48). One Crocosphaera mOTU not previously included was curated, and relative abundances were calculated using relative coverages of reads mapping to these 2569 mOTUs

(Table S3.8 in the publication).

Viral population identification. Taxonomy was assigned using protein LAST alignments to known phages in RefSeq84 (49; Table S3.9 in the publication), as

87 well as five viral metagenomic databases available as of 2018: uvMED (50), uvDEEP (24), GOV (15), EV (16), and MED2017 (25). To avoid inflating the number of novel populations, we performed broad taxonomic assignments at

>60% average amino acid identity (AAI) across >50% of proteins to any reference genome or contig, with the best hit assigned based on the highest AAI (Table S3.3 in the publication). A broader cut-off of >50% of proteins at any AAI was used to identify phages infecting heterotrophic bacterioplankton in RefSeq84, due to lower sequence representation than picocyanophages in these databases (Table

S3.3 in the publication).

Proteins were annotated using HMMsearch against PFAM (bit score >30).

Proteins with functional domains not found in previously reported datasets

(15,27) were considered novel viral genes (Table S3.10 in the publication).

Putative temperate phages were identified using i. functional annotations to identify phages with the genomic potential for lysogeny and ii. VIRSorter to identify integrated prophages. i. Functional annotations identified 922 populations with temperate phage markers (>30 bit score to integrase, excisionase, Cro, or CI repressor). ii. VIRSorter identified prophages from original assemblies, of which 413 temperate phage populations shared significant homology (a minimum of 150bp at 95% ANI); VIRSorter identified 73 final viral populations as prophages.

Archaeal virus identification. Putative archaeal viruses were identified in unannotated populations using archaeal protein markers and sequence similarity

88 to Archaea or archaeal viruses in RefSeq84 using archaeal markers (PFAM bit score >30). Populations carrying protein markers included one with AmoC (51),

16 with archaeal holiday junction resolvase, and 37 with MCM DNA helicase of archaeal origin (Table S3.9 in the publication), yielding a total of 53 archaeal virus marker populations. 632 populations were identified having ≥1 proteins with top hits (protein-protein LAST) to archaea or archaeal viruses in the

RefSeq84 database (49). We refined this set with the expectation that archaeal viruses should display high ratio of top protein hits to archaea and archaeal viruses divided by top hits to Bacteria (AB ratio) and large proportion of proteins without RefSeq84 hits, given the scarcity of open ocean archaeal viruses in current databases. The 53 archaeal virus marker populations all displayed >0.06

AB ratio and >0.8 proportion of proteins without RefSeq84 hits (Fig. S3.3b). We used modified, conservative cut-offs (>0.5 AB ratio, >0.8 proportion of proteins without RefSeq84 hits) to retain 161 putative archaeal viruses (Fig. S3.3c).

Phylum-level classifications were assigned based on number of protein top hits to Thaumarchaeota and Euryarchaeota. Populations with ties in number of top hits to these phyla were considered unclassified archaeal viruses.

Eukaryotic virus identification. Putative eukaryotic viruses were identified using the nucleocytoplasmic large DNA virus capsid marker (PFAM bit score

>30) and ≥2 proteins with top hits to eukaryotic viruses in RefSeq84 (Table S3.9).

All taxonomic identifications were cross-referenced to confirm that a population is assigned only to one group.

89 Crocosphaera putative phage identification. Putative Crocosphaera phages were identified through a combination of methods. First, alignments to RefSeq84 revealed one putative Crocosphaera phage (Table S3.3 in the publication) with 11 of 15 proteins hitting to Crocrosphaera, and one hit to another Cyanobacteria

(Table S3.9 in the publication). For independent confirmation, this virus displayed abundance profiles expected from Crocosphaera (ie. summer bloom in the upper ocean). 170 other potential Crocosphaera phages that displayed similar spatiotemporal distributions (Fig. S3.9 in the publication) were also included as candidates for independent confirmation. Reads from samples capturing a

Crocosphaera bloom (and presumably its phages) near Station ALOHA around the same time (52) were mapped (>95% across >45bp) to these 171 populations to confirm presence. 115 candidates that recruited reads at >0 IQR coverage were retained. We then retained only candidates with higher number of top hits to non- Prochlorococcus/Synechococcus cyanobacterial proteins than top hits to

Prochlorococcus or Synechococcus proteins, for a final set of 1 “putative” and 6

“potential” Crocosphaera phages (Table S3.3 in the publication).

Viral and prokaryotic diversity. To examine within-sample α-diversity, we calculated the Shannon diversity, evenness, and richness for each sample using the vegan package in R (53). Respective assemblages were assessed using relative coverages of 16 787 viral populations and 2568 cellular mOTUs.

VC ratio. We define the Virus:Cell ratio (VC ratio) here as the log-ratio of a population’s relative abundance (nucleotides mapped) in the virus-enriched

90 versus cell-enriched size fractions. To calculate VC ratios, samtools bedcov (54) was used to calculate nucleotides mapping to each population from all samples, normalized to the average library size of 1.3 million nucleotides per sample.

Populations with zero nucleotides mapping in any sample were adjusted to one nucleotide for log-ratio calculation. Temporal variability of VC ratios was calculated for each population using the mean-normalized variance of its VC ratios within each depth.

Results and discussion

A total of 374 metagenomes were generated from seawater sampled at Station

ALOHA across 16 time points at 12 depths (5-500m). These samples were used to prepare DNA from cell-enriched >0.2µm size fractions (encompassing cellular

DNA, giant viruses, active infections, prophages, and absorbed phages), and virus-enriched 0.2-0.02µm size fractions (capturing ultra-small cells, free virus particles, and ultra-small detritus). After metagenome assembly, curation, and classification (Fig. S3.1), the 4.2 TB of sequence data yielded 16 787 virus populations, with 8079 novel populations not represented in publicly available virus sequences. A total of 961 assembled populations were identified as complete via evidence of their terminal repeats (55,56). 1352 populations were identified as putative temperate phages, due to the presence of flanking cellular sequences or diagnostic temperate phage marker genes. Among novel viruses, 29

91 putative temperate Pelagibacter and Puniceispirillum phages were identified, along with 12 complete archaeal virus genomes, 25 genomic fragments of eukaryotic viruses, and a putative Crocosphaera phage-like element (Fig. S3.4, S3.5). The curated viral populations contained 236 novel viral proteins (Table S3.10 in the publication) not previously reported from marine environments (15,27). These proteins included auxiliary metabolic genes in nitrogen (nitronate monooxygenase) and phosphorus cycling (transporter), phage holin, a toxin- antitoxin system (Fig. S3.4), and CRISPR-associated proteins.

Viral DNA contribution to total DNA. On average across all depths, the

ALOHA 2.0 viral dataset recruited 42% of reads from the virus-enriched fraction and 8% of reads from the cell-enriched fraction (Fig. 3.1). Assuming that viruses account for 56% of the total DNA in the <0.22µm fraction (57), our dataset recovered on average 75% of the total viral DNA from this habitat. In the virus- enriched fraction, viral contribution total DNA peaked near the base of the euphotic zone, 150-250 m. This increase in relative viral DNA contribution may reflect increased cellular turnover and viral production, increased temperate phage induction (described below), or decreased decay rates due to lower ambient UV light fluxes and temperature at these depths (58).

In the cell-enriched fraction, relative viral contribution to total DNA peaked near the deep chlorophyll maximum (DCM) around 100 m, extending previous observations of subsurface cyanophage maxima in cell-enriched fractions at

Station ALOHA (59). Since other sources of DNA contribution are mostly cellular in origin, the increase in ratio of viral-to-bacterial DNA likely indicates a

92 combination of increased active infections and prophages in these cell populations.

Viral diversity hotspot at the base of the euphotic zone. Within-sample α- diversity and richness revealed hotspots for virioplankton diversity, and confirmed it for prokaryotic diversity (60), at the base of the euphotic zone between 150-250 m (Fig. 3.2). This peak in diversity could reflect both habitat variability and transitions in microbial metabolic diversity. For example, in this environment, photoautotrophic cyanobacteria dominate in the photic zone, while chemolithotrophic ammonium-oxidizing Thaumarchaeota and (presumptive) heterotrophic Euryarchaeota are both more abundant in deep waters. However, both cyanobacterial and archaeal groups do co-exist at transitional depths in and just below the deep chlorophyll maximum (60). Similarly, while cyanophages were found predominantly in shallow waters and archaeal viruses mostly below the photic zone, viruses from cyanobacteria and archaea were found at transitional depths between 150-250 m (annotations shown on Fig. 3.3, top bar).

Previously uncharacterized viruses also increased in abundance below the upper surface waters, and represented >50% of viral assemblages below 125 m (Fig.

3.1). Elevated richness in oligotrophic picoplankton communities just below the photic zone was evident not only in bacteria and archaea, but also their viruses.

Depth-dependent patterns in temperate phage integration and induction. The community-wide abundances of 1352 putative temperate phages displayed depth-specific patterns (Fig. 3.4). In cell-enriched samples, we postulate that

93 putative temperate phages represented integrated prophages, which increased at and below the DCM, consistent with previous reports of apparent increased lysogeny in deeper waters (27,61,62). Depth-dependent changes in viral reproductive strategies were also evident within groups at the genus level, particularly in SAR11 phages with broad depth ranges (Fig. S3.7). Our results suggest that community-wide changes in reproductive strategies are driven not only by specific host groups, but also partly by environmental variability along the water column. Given that cellular host abundance generally decreases with depth (Fig. S3.2c), our results do not appear to support the piggyback-the-winner

(63) hypothesis in this oligotrophic pelagic habitat. It seems probable that viral- host dynamics may have different trajectories in different environmental, biological and ecological contexts, and may not be driven simply by bulk numerical host-cell and virus-particle ratios alone. In oligotrophic pelagic environments, high prokaryote cell densities in surface waters correspond to smaller average genome sizes (46,64). Within and below the DCM, low cell densities correspond to larger average genome sizes (46). Host cell genome sizes

(and therefore genomic “real estate” available for prophage, genomic islands and other mobile elements), rather than cell densities, might drive viral reproductive strategies (64,65).

In virus-enriched samples, we postulate that temperate phage signal represents temperate virus particles, which peaked at 150-250 m (Fig. 3.4). This peak may reflect increased temperate phage productivity relative to deeper waters associated with more slowly growing hosts (66). Temperate phage production at these intermediate depths appeared to be episodic, as indicated by temporal

94 variability in populations’ extracellular to intracellular abundance (VC) ratios

(Fig. 3.5). Low VC variability reflects phages with constant or no viral particle production. High VC variability indicates phages with more episodic virus particle production. The VC variability of temperate phages peaked at 150-250 m

(Fig. 3.5), consistent with episodic induction and production. Episodic temperate phage induction may be driven by temporal resource variability (67–69), or increased cellular stress due to lower light for photoautotrophs at these depths

(Fig. S3.2f).

Given the low abundance of temperate phages in the upper 75 meters (Fig. 3.4), we infer that lytic phage-host interactions were prevalent in surface waters, potentially reflecting consistently high productivity and host abundance that can favor lytic strategies (66,70–72). Temperate phage induction and production appeared to peak at the base of the euphotic zone (Figs. 3.4, 3.5), potentially reflecting increased environmental variability at these depths that might favor a flexible reproductive strategy (69,73). Evidence for host-integrated prophages increased in mesopelagic waters (Fig. 3.4), possibly reflecting decreases in host productivity and abundance (72,73), or an increase in genome size that might better accommodate prophages (46,65). Using size fractionation to separate intra- and extra-cellular viruses, our results revealed depth-specific viral reproductive strategies from the surface to mesopelagic ocean.

VC ratio to confirm genomic temperate phage identification. The VC ratio of each population represents its extracellular to intracellular abundance ratio. On average, temperate phages, or phages persisting intracellularly for long periods

95 before initiating a lytic cycle, are expected to have lower VC ratios relative to strictly lytic phages. Indeed, temperate phages on average displayed lower VC ratios than that of other coexisting populations (Welsh’s t-test, p=0.02). This trend provided further support for our temperate phage identifications using genomic markers. Nevertheless, intracellular temperate phage counts may reflect a variety of life history states. Intracellular temperate phage counts could represent prophage DNA still integrated in the host genome, replicating temperate phage that have have excised from the host genome, or temperate phage that entered directly into the lytic cycle immediately post-infection (in eclipse phase). Presuming that the temperate phage signal in the virus-enriched fraction represents packaged phage particles, ~98% of putative temperate phages appeared to have the ability to produce viral particles in situ (Fig. S3.6b, Table

S3.3 in the publication). Hence, the lower VC ratios found in the temperate phages does appear to reflect their different lifestyles and reproductive strategies, relative to non-temperate phage.

Environmental distribution of viral AMGs. Viral copies of AMGs potentially encode for rate-limiting steps in host metabolism that are essential for viral propagation (reviewed in 11). We show here that the abundance of three key

AMGs in energy and nutrient acquisition vary with depth, potentially indicating cellular host variability or energy or nutrient limitation at specific depths.

Virus-encoded copies of photosystem reaction center genes were first observed in a Synechococcus phage genome, were thought to prevent photoinhibition by supplementing declining host photosynthesis during infection (74), and were

96 found to be co-expressed with high-light gene in cyanophages infecting high- light Prochlorococcus hosts (4,75). Photosystem genes were subsequently observed in many cyanophage genomes (76), yet their environmental distribution in the water column remains relatively unexplored. Here, we observed a three-fold increase in abundance of phage carrying photosystem genes from the surface waters to low-light environments around the DCM (Fig. 3.6a). Considering that cyanophages and cyanobacteria were most abundant in the surface ocean (Figs.

3.6a, S3.8a), a higher proportion of cyanophages carried photosystem genes at the DCM relative to surface waters. Our observations suggest that virus-encoded photosystem genes may be more advantageous in light-limited conditions, potentially to prevent light-energy limitation in hosts during infection.

Alternatively, longer latent periods, consistent with our observed increase in viral DNA in the cell-enriched fraction at the DCM (Fig. 3.1a), might also favor phage-encoded photosystem genes at this depth (76).

Viral-encoded ammonium transporter genes, which shared highest similarity to bacterial homologues (Table S3.9 in the publication), increased at and below the

DCM in tandem with Thaumarchaeota abundance (Fig. 3.6b). Ammonium- oxidizing Thaumarchaeota in the ocean have demonstrated high affinity for ammonia (77) and therefore may compete with bacteria in ammonia-limited open-ocean environments (78). As a result, bacteriophage-encoded copies of ammonium transporter genes might assist in ammonium acquisition during infection, particularly at depths with higher abundances of Thaumarchaeota competitors.

97 PhoH genes are used during phosphorus starvation (79), upregulated during cyanophage infection (80), present in diverse groups of marine viruses (5,7,81–

83), and can be used as a marker gene for assessing viral assemblages (26,84).

Viruses are richer in phosphorus relative to cells (85), potentially driving an enrichment of this auxiliary metabolic gene in marine viruses (84) that inhabit relatively nutrient-limited open oceans. At Station ALOHA, up to 10% of viruses encoded copies of phoH genes, peaking at the DCM in tandem with a maximum in the total dissolved nitrogen to phosphorus ratio (Fig. 3.6c). Our observations suggest that the phoH gene might assist in phosphorus acquisition during viral infection in relatively phosphorus-limited environments.

Spatiotemporal distributions: temporal persistence and depth distributions.

Many virioplankton populations displayed temporal persistence through the 1.5- year time series (Figs. 3.3, S3.9), consistent with recent studies at Station ALOHA

(22,27). However, some virioplankton populations occurred only at specific times of year (Fig. S3.10), potentially reflecting the seasonal variability of physiochemical variables and their cellular hosts. Of the 9579 viral populations that recruited reads from the ALOHA 2.0 cell-enriched samples from 2014-2016,

6959 (76%) also recruited reads from cell-enriched samples from 2010-2011 (Table

S3.3 in the publication). This strong temporal persistence indicates that viral genome evolution may be under stabilizing selection, constrained to very specific loci, or occur at finer resolutions than 95% ANI used here for read-mapping and defining populations. Our results show that many indigenous phage populations persist over multiannual cycles, indicative of continuous infection, replication, and lytic cycles that maintain this persistence.

98

Viral assemblages displayed stratified depth structure, with shifting depth ranges along the transition zone between surface and mesopelagic waters (Fig.

3.3, S9). This depth structure likely reflects biogeochemical gradients (86) and depth partitioning of bacterial host communities (46). Consistent with a previous study at Station ALOHA (27), we found no evidence of eurybathic viruses in either the cellular or virioplankton enriched size fractions.

Depth distributions of archaeal viruses and their hosts. Several different archaeal lineages, including Euryarchaeota and Thaumarchaeota (formerly

Group I marine Crenarchaeota (87)) are commonly found along the water column at Station ALOHA (59,88,89). Euryarchaeota apparently function as organic matter degrading heterotrophs, while Thaumarchaeota appear to function as chemolithoautotropic ammonium oxidizers (78,90). We identified 161 putative archaeal virus populations, 12 of which represented complete genomes.

We classified 67 of these populations as euryarchaeal and 60 as thaumarchaeal, with the latter encompassing crenarchaeal populations (Table S3.3 in the publication). In both size fractions, total archaeal virus abundance peaked at and below the DCM (Fig. S3.8b), with putative thaumarchaeal viruses driving most of this change. Relative abundances of archaeal viruses were consistent, though somewhat lower in magnitude, with that of archaea (Fig. S3.8b), suggesting that our methods of archaeal virus identification were conservative but accurate.

Putative Crocosphaera phages. The unicellular cyanobacterium Crocosphaera can fix both nitrogen and carbon, and plays an important role in fueling primary

99 production in nutrient-poor open ocean environments (91). Although it is the third-most abundant cyanobacterium at Station ALOHA, after Prochlorococcus and Synechococcus (52), no Crocosphaera phage sequences have yet been identified in current databases. We identified the genome of a putative Crocosphaera phage or phage parasite based on multiple lines of evidence: sequence similarity to

Crocosphaera in 11 of 15 proteins, similar spatiotemporal abundance profiles to that expected from Crocosphaera (summer in the upper ocean) during the 1.5 year time-series, and presence during a confirmed Crocosphaera “bloom” near Station

ALOHA around the same time (52). This genome contains a higher GC content

(41.7%) than other co-occurring viruses (37.3%), likely reflecting the higher GC content of Crocosphaera at ~37.4% (92), compared to abundant surface

Prochlorococcus at ~32% (46). Accounting for up to 2% of viral DNA in the virus- enriched size fraction, this genome likely represents a phage, plasmid, or phage- inducible chromosomal island (PICI) packaged into phage particles. PICIs are phage parasitic DNA elements that can hijack another infecting phage’s packaging machinery and mobilize in infectious phage-like particles (93,94). The genome (11 kbp) appears complete and encodes a phage integrase (Fig. S3.6), characteristic of both temperate phages and PICIs. In addition to this genome, we identified 6 other potential Crocosphaera phages (Table S3.3 in the publication) based on similar Station ALOHA abundance profiles to Crocosphaera, their presence during a Crocosphaera “bloom” in one set of samples, and their dissimilarity from Prochlorococcus or Synechococcus gene homologues.

100 Conclusion

Viruses impact the ecology and biogeochemistry of microbial communities across the oceans. Our study recovered a large fraction of dsDNA viruses found at Station ALOHA, revealing a hotspot for microbial diversity at the base of the euphotic zone, where the majority of viruses were distinct from those previously reported. The concurrent characterization of both intracellular and extracellular viruses provided independent support of temperate phage identification using marker genes, and revealed community-wide shifts in viral reproductive strategies. In this open-ocean environment, lytic interactions dominated the upper photic layer above the DCM, and temperate phages were most abundant in the mesopelagic ocean. Temporal variability in reproductive strategies appeared most prevalent in transitional depths at the base of the euphotic zone, marked by a peak in prophage integration and induction. Environmental distributions of viral auxiliary metabolic genes also displayed depth-specific patterns in energy and nutrient acquisition. For example, photosystem genes were most abundant at light-limited depths around the DCM; bacteriophage- encoded ammonium transporter genes increased along with potential ammonia- oxidizing thaumarchaeal potential; phosphorus starvation genes increased in tandem with N:P ratio. Although most viruses were temporally persistent over several years, some displayed temporal variability that appeared to reflect the seasonal distributions of specific hosts. These temporal patterns, in conjunction with other lines of evidence, led to the identification of putative Crocosphaera phages or phage parasites that have not been previously identified. Taken together, these new data and analyses provide new insight on the spatiotemporal

101 patterns of planktonic viral diversity, reproductive strategies and metabolic repertoires in the open ocean. Furthermore, simultaneous ennumeration of extracellular and intracellular viruses and their hosts sets the stage for delineating more specific host-virus interactions.

Acknowledgements

We thank the captain and crew of R/V Kilo Moana, R/V KOK, Hawaii Ocean

Time-series, and SCOPE-ops team for cruise organization, sample collection, and oceanographic data acquisition. We thank Jackie Mueller and Grieg Stewart for their previous methods development using Anotop filters, and Paul Den Uyl for contributions to library preparation and sequencing. We thank Wei Qin, John

Beaulaurier, Dominique Boeuf, Fuyan Li, Matthew Sullivan, and Murat Eren for discussions on data analysis. This project is funded by Simons Foundation

(#329108) and the Gordon and Betty Moore Foundation (GBMF 3777) to EFD.

Partial support for EL was provided by the Natural Sciences and Engineering

Research Council of Canada (PGSD3-487490-2016). This work is a contribution of the Simons Collaboration on Ocean Processes and Ecology and the Center for

Microbial Oceanography: Research and Education.

102 Figures

a. virus-enriched virus 5 known phages (464) 25 45 uvMED (691) 75 uvDEEP (138) 100 125 Med2017 (107) 150 175 GOV (3355) 200 EV (2592) 225 250 novel viruses (8079) 500 0.0 0.1 0.2 0.3 0.4 0.5 0.6 cell b. cell-enriched prokaryote 5 25 45 depth (meters) 75 100 125 150 175 200 225 250 500 0.0 0.1 0.2 0.3 0.4 0.5 0.6

proportion of total sequenced DNA

Figure 3.1. Depth profiles of time-averaged viral and prokaryotic contributions to total sequenced DNA in a. virus-enriched and b. cell-enriched size fractions. Relative abundances of viral populations are colored by average amino acid identity (AAI) (>60% AAI across >50% genes) to known phages in RefSeq, and five other viral metagenomic datasets: uvMED (50), uvDEEP (24), Med2017 (25), GOV (15), and EV (16). Legend shows number of populations identified in each category.

103 300 400 500 5 viral

s) 25

r prokaryotic a b c 45 75 100 125 150 175 200 225 depth (mete 250 500 5 6 7 8 2000 4000 6000 0.75 0.80 0.85 0.90 Shannon's diversity richness evenness

Figure 3.2. α-diversity depth profiles of viral and prokaryotic assemblages: a. Shannon diversity, b. richness (number of populations), and c. evenness (Shannon diversity divided by log richness). Solid and open circles represent viral and prokaryotic assemblages, respectively, averaged through time (mean +/- SE).

104 cyanophage SAR11 phage SAR116 phage archaeal virus eukaryotic virus depth 0 400 800 0 0.1 0.2 taxonomy (m) 0 15 30 5 2014-11 time 25 2016-04 45 coverage 75 max 0 100 125 150 175 200 225 250 500

Figure 3.3. Spatiotemporal distributions of annotated virus populations present in the virus-enriched size fraction. Each node on the top dendrogram and its associated column represents the coverage profile of one virus population, colored by its corresponding taxonomic annotation. Rows represent individual samples that are horizontally ordered by depth and time. The height of the black bar in every sample shows mean interquartile range (IQR) coverage of every population, normalized to the maximum IQR coverage in that sample. Time-averaged depth profiles (mean +/-SE) of environmental variables of photosynthetically active radiation (PAR), fluorometric chlorophyll a, and inorganic nitrogen are shown in colored triangles on the right panel (data retrieved from Hawaii Ocean Time-Series HOT-DOGS application).

105 5 virus-enriched 25 s)

r 45 cell-enriched 75 100 125 150 175 200 225

depth (mete 250 500 0.1 0.2 0.3 0.4 0.5 proportion temperate

Figure 3.4. Depth profiles of relative abundances of putative temperate phages. Circles represent time-averaged depth profiles (mean +/- SE) of the proportion of all viruses that were identified as temperate in the virus-enriched size fraction (closed circles) and cell-enriched size fraction (open circles).

5 temperate 25 phages 45 other 75 viruses 100 125 150 175 200

depth (meters) 225 250 500 0 2 4 6 8 VC ratio temporal variability (10 )

Figure 3.5. Depth profiles of temporal variabilities of VC ratios (mean +/-SE) of 1352 inferred temperate phages (blue) and other viral populations (orange). Higher VC ratio temporal variability indicates episodic production of free viral particles, while lower variability indicates consistent production of free viral particles. Temporal variability of VC ratios is calculated for each population by pooling its VC ratios within each depth to determine the mean-normalized variance.

106 viruses with viruses with viruses with + photosystem NH4 transporter phoH 0.00 0.025 0.05 0.075 0 0.0025 0.005 0.0075 0.025 0.050 0.075 0.100 5 25 a b c 45 75 100 125 150 175

depth (meters) 200 225 250 500 0.0 0.2 0.4 0 0.1 0.2 0.3 15 16 17 18 19 20 0 400 800 0 10 20 30 5 10 15 20 25 30 0.0 0.1 0.2 1 2 cyanophage abundance Thaumarchaeota total dissolved N:P Cyanobacteria abundance abundance 2

Figure 3.6. Distribution of viruses containing specific auxiliary metabolic genes (AMGs) in the water column. Solid circles show depth profiles, averaged through time (mean +/- SE), of the abundance-normalized proportion of virus populations carrying auxiliary metabolic genes respectively for carbon, nitrogen, and phosphorus metabolism: a. photosystem, b. ammonium transporter, and c. phoH. Time-averaged depth profiles of the following are included for environmental context: a. relative abundances of cyanophages in the virus- enriched fraction (small open circles) and of cyanobacteria in the cell-enriched fraction (large open circles), photosynthetically active radiation (PAR) (yellow triangles), and fluorometric chlorophyll a (green triangles); b. Thaumarchaeal relative abundance in the cell-enriched fraction (open circles) and inorganic nitrogen concentrations (purple triangles). c. total dissolved nitrogen (blue triangles), phosphorus (red triangles), and their ratio (open circles). Environmental metadata were retrieved from Hawaii Ocean Time-Series HOT- DOGs application. Grey box highlights depths around the deep chlorophyll maximum (DCM).

107 Supplementary Figures

Figure S3.1. Bioinformatic workflow. Data products included in this publication are highlighted in black.

108

Figure S3.2. Hydrography depth profile time-series of samples collected during the Hawaii Ocean Time-Series and used to prepare the ALOHA2.0 virus database. Points represent time and depth of 186 samples taken for the 0.2- 0.02µm virus-enriched size fraction. Lines in panels a and f respectively represent the mixed layer depth (surface density offset of 0.125) and euphotic zone depth (1% of surface PAR). Data obtained via the Hawaii Ocean Time-Series HOT- DOGS application, University of Hawai'i at Mānoa, National Science Foundation Award #1756517 (http://hahana.soest.hawaii.edu/hot/hot-dogs).

109 a. RefSeq hits to archaea/archaeal virus (632) 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4

b. archaeal protein marker (53) 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 c. RefSeq hits to archaea/archaeal virus, proportion of proteins without RefSeq hits refined >0.5, >0.8 (147) 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4

ratio of top protein hits to archaea/archaeal virus:Bacteria

Figure S3.3. Cut-off selection for identifying putative archaeal viruses. The proportion of proteins without RefSeq84 hits (y-axis) and the ratio of top hits to Archaea/archaeal virus vs. Bacteria (AB ratio, x-axis) was calculated for a. all possible archaeal viruses with one or proteins with top hit to to Archaea or archaeal virus on RefSeq, and b. high-confidence archaeal viruses containing archaeal protein markers from PFAM (bit score >30). Based on the high- confidence archaeal viruses in b, we used cutoffs of >0.8 proportion of unknown proteins and >0.5 AB ratio to refine populations in a., retaining 147 putative archaeal viruses shown in c. Combining b. and c. resulted in a final set of 161 putative archaeal viruses.

110

Figure S3.4. Examples of viral genomes in the ALOHA 2.0 database. Top captions represent contig number, putative hosts based on the highest number of protein hits to reference genome, genome completion, and size. The outer ring represents ALOHA 2.0 viruses, and the inner ring represents reference genomes with 10 or more protein hits. Blue shading between the genomes represent amino acid identity from LAST alignments. Genes of interest are color-coded by functional groups (PFAM bit score >30).

111 11140. putative Crocosphaera phage, complete genome 11kbp

RefSeq84

(AAI, bit score)

(74%, 140) (74%,

(80%, 129) (80%,

(91%, 169) (91%,

(82%, 207) (82%,

(96%, 625) (96%,

(74%, 86) (74%, 158) (78%, (42%, 644) (42%,

(87%, 411) (87%, (57%, 624) (57%, (88%, 396) (88%,

(91%, 240) (91%,

Crocosphaera

Crocosphaera

Crocosphaera

Crocosphaera

Crocosphaera

Crocosphaera Crocosphaera Crocosphaera

Crocosphaera Crocosphaera Crocosphaera Crocosphaera

(16)

(16) (15) (16) PFAM (58)

(bit score)

P10 proterin P10

D5 N-terminal D5

Helix-turn-helix

CHY zinc finger zinc CHY Phage integrase Phage

eggNOG Nucleopolyhedrovirus (21)

(bit score) (40)

Virus database Virus Virus database Virus Phage integrase Phage

Primase C terminal 2 terminal C Primase

Figure S3.5. Genomic structure of a putative Crocosphaera phage or phage parasite. Top caption represent contig number, putative host, genome completion, and size. Genes of interest are color-coded by functional groups (PFAM bit score >30) described in Fig. S5. Taxonomic and functional annotations are included for RefSeq84, PFAM, and EggNOG databases.

112

Figure S3.6. Size fraction distribution and genome sizes of putative temperate phage and other viruses. a. Median VC ratio across all samples, relative abundance profiles in the b. virus-enriched and c. cell-enriched size fractions, and d. genome sizes of 1543 complete circular viral populations. Inferred temperate phages are shown in blue, while all other populations are shown in orange.

113 SAR11 phage 5 25 virus-enriched

s) cell-enriched

r 45 75 100 125 150 175 200 225

depth (mete 250 500 0.3 0.4 0.5 0.6 0.7 proportion temperate

Figure S3.7. Depth profiles of time-averaged proportion (mean +/-SE) of putative temperate SAR11 phages normalized to all SAR11 phages. Open circles represent phages captured in the cell-enriched fraction, and closed circles represent phages captured in the virus-enriched fraction.

a. cyanophage b. thaumarchaeal virus 0.0 0.05 0.1 0.15 0.00 0.005 0.01 5 virus V 25 virus C

s) 45

r host 75 100 125 150 175 200 depth (mete 225 250 500 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 Cyanobacteria Thaumarchaeota

Figure S3.8. Virus-host relative abundances for a. cyanophage and b. thaumarchaeal virus. Time-averaged depth profiles (mean +/-SE) show viruses in the virus-enriched size fraction (small closed circles), viruses in the cell- enriched size fraction (large closed circles), and hosts in the cell-enriched size fraction (open circles).

114

Figure S3.9. Spatiotemporal distributions of all virus populations present in the virus-enriched size fraction. Each node on the top dendrogram and its associated column represents the coverage profile of one virus population. The green bar near the dendrogram represents 171 populations displaying summer blooms in the upper ocean, representing possible Crocosphaera phages for downstream identification. Rows represent individual samples that are horizontally ordered by depth and time. The height of the black bar in every sample shows mean coverage (calculated using only the second and third quartile) of every population, normalized to the maximum coverage in that sample.

115 relative abundance Spearman’s rho 0.0000 0.0005 0.0010 0.0015 0 1 s)

r cyanophage 5 * ** * * SAR11 phage ** ** *** cyanophage 125 * *** * ** SAR11 phage * ***

depth (mete archaeal virus 250 * ** SAR11 phage * ** nit

2015 2016 chl phos temp hbact year cyano

Figure S3.10. Temporal variability in relative abundances of select annotated virus populations at 5m, 125m, and 250m, and their correlation with environmental variables: potential temperature (temp), fluorometric chlorophyll a (chl), Prochlorococcus+Synechecoccus abundance (cyano), heterotrophic bacterial abundance (hbact), nitrate+nitrite (nit), and phosphate (phos). Stars represent significant Spearman’s rho correlations (P<0.05). White shading indicates missing data.

116 Supplementary tables

Tables available at https://www.nature.com/articles/s41396-020-0604-8#Sec26

Table S3.1. Sample information and associated metadata. Sample naming conventions are as follows: HSD[size fraction]-[HOT cruise number]-[depth in meters]-[cast number]-[sequencing date in yymmdd]. Data obtained via the Hawaii Ocean Time-Series HOT-DOGS application, University of Hawai'i at Māno, National Science Foundation Award #1756517 (http://hahana.soest.hawaii.edu/hot/hot-dogs).

Table S3.2. Sequencing, initial assemblies, VIRSorter contigs, and read statistics for all samples. Columns in order represents sample name, number of quality-controlled reads in each sample, number of contigs assembled by metaSPAdes, number of contigs >3kb, number of >3kb contigs that was identified as viral (all categories) by VIRSorter, and number of viral reads mapping to VIRSorter-identified contigs. Two smaller columns on the right represent the number of viral reads summed across depth. Sampling points with multiple sequencing runs were included at this initial stage prior to viral-specific reassembly. In post-reassembly analyses, only the largest sequencing run for any sampling point was used. Sample naming conventions are described in Table S3.1.

Table S3.3. Information for 17 369 >10kbp ALOHA 2.0 virus populations: name, GC content, length, circularity, name of circular genome representative (if redundant), chimeric signature, temperate phage identification, homology (>60% AAI across >50% genes) to RefSeq84 viruses or viral metagenomic datasets (names consistent with Fig. 1), modified homology (any AAI across >50% genes) to RefSeq84 viruses, archaeal virus identification, eukaryotic virus identification, Crocosphaera phage identification, presence in virus-enriched fraction, presence in cell-enriched fraction, and presence in 2010-1 dataset.

Table S3.4. Relative abundances of 17 369 ALOHA 2.0 virus populations in the virus-enriched fraction, approximated by nucleotides mapping to population normalized to nucleotides mapping to all populations. Sample naming conventions are described in Table S3.1.

Table S5. Relative abundances of 17 369 ALOHA 2.0 virus populations in the cell-enriched fraction, approximated by nucleotides mapping to population

117 normalized to nucleotides mapping to all populations. Sample naming conventions are described in Table S1.

Table S3.6. Relative abundances of 1543 circular ALOHA 2.0 virus populations in the virus-enriched fraction, approximated by the population’s coverage normalized to summed coverage across all populations. Sample naming conventions are described in Table S3.1.

Table S3.7. Relative abundances of 1543 circular ALOHA 2.0 virus populations in the cell-enriched fraction, approximated by the population’s coverage normalized to summed coverage across all populations. Sample naming conventions are described in Table S3.1.

Table S3.8. Relative abundances of 2568 mOTUS (cellular populations) in the cell-enriched fractions, approximated by the population’s coverage normalized to summed coverage across all populations. Sample naming conventions are described in Table S3.1, with the exception of omitted sequencing date.

Table S3.9. Taxonomic assignments (top hits) of ALOHA 2.0 virus proteins aligned using LAST to the RefSeq84 protein database.

Table S3.10. List of novel viral PFAM domains (bit score >30) from ALOHA 2.0 populations that are distinct from two reported lists in previous metagenomic datasets (13,25).

118 References

1. Wigington CH, Sonderegger DL, Brussaard CPD, Buchan A, Finke JF, Fuhrman J, et al. Re-examining the relationship between virus and microbial cell abundances in the global oceans. Nat Microbiol. 2016;1:15024.

2. Karl DM, Church MJ. Microbial oceanography and the Hawaii Ocean Time-series programme. Nat Rev Microbiol. 2014;12:1–15.

3. Wilson WH, Joint IR, Carr NG, Mann NH. Isolation and molecular characterization of five marine cyanophages propagated on Synechococcus sp. strain WH7803. Appl Environ Microbiol. 1993;59(11):3736–43.

4. Lindell D, Sullivan MB, Johnson ZI, Tolonen AC, Rohwer F, Chisholm SW. Transfer of photosynthesis genes to and from Prochlorococcus viruses. Proc Natl Acad Sci U S A. 2004;101(30):11013–8.

5. Rohwer F, Segall A, Steward G, Seguritan V, Breitbart M. The complete genomic sequence of the marine phage Roseophage. Limnol Ocean. 2000;45(2):408–18.

6. Garcia-heredia I, Rodriguez-Valera F, Martin-Cuadrado A. Novel group of podovirus infecting the marine bacterium Alteromonas macleodii. Bacteriophage. 2013;3(2):e24766.

7. Zhao Y, Temperton B, Thrash JC, Schwalbach MS, Vergin KL, Landry ZC, et al. Abundant SAR11 viruses in the ocean. Nature. 2013 Feb 21;494:357– 60.

8. Kang I, Oh H-M, Kang D, Cho J-C. Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans. Proc Natl Acad Sci U S A. 2013;110(30):12343–8.

9. Weinbauer MG. Ecology of prokaryotic viruses. FEMS Microbiol Rev. 2004;28(2):127–81.

10. Weinbauer MG, Rassoulzadegan F. Are viruses driving microbial diversification and diversity? Environ Microbiol. 2004;6(1):1–11.

11. Suttle CA. Viruses in the sea. Nature. 2005;437(7057):356–61.

12. Puxty RJ, Millard AD, Evans DJ, Scanlan DJ, Puxty RJ, Millard AD, et al. Viruses inhibit CO2 fixation in the most abundant phototrophs on Earth. Curr Biol. 2016;26:1–5.

13. Hurwitz BL, U’Ren JM. Viral metabolic reprogramming in marine ecosystems. Curr Opin Microbiol. 2016;31:161–8.

14. Brum JR, Ignacio-espinoza JC, Roux S, Doulcier G, Acinas SG, Alberti A, et

119 al. Patterns and ecological drivers of ocean viral communities. Science. 2015;348(6237):1261498.

15. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and biogeochemical impacts of uncultivated globally abundant ocean viruses. Nature. 2016;537:689–93.

16. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536(7617):425–30.

17. Gregory AC, Zayed AA, Sunagawa S, Wincker P, Sullivan MB, Ferland J, et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell. 2019;177:1–15.

18. Chow CT, Fuhrman JA. Seasonality and monthly dynamics of marine myovirus communities. Environ Microbiol. 2012;4(8):2171–83.

19. Needham DM, Chow C-ET, Cram JA, Sachdeva R, Parada A, Fuhrman JA. Short-term observations of marine bacterial and viral communities: patterns, connections and resilience. ISME J. 2013;7:1274–85.

20. Chow CET, Winget DM, White RA, Hallam SJ, Suttle CA. Combining genomic sequencing methods to explore viral diversity and reveal potential virus-host interactions. Front Microbiol. 2015;6:1–15.

21. Needham DM, Sachdeva R, Fuhrman JA. Ecological dynamics and co- occurrence among marine phytoplankton, bacteria and myoviruses shows microdiversity matters. ISME J. 2017;11:1614–29.

22. Aylward FO, Boeuf D, Mende DR, Wood-Charlson EM, Vislova A, Eppley JM, et al. Diel cycling and long-term persistence of viruses in the ocean’s euphotic zone. Proc Natl Acad Sci U S A. 2017;201714821.

23. Hurwitz BL, Brum JR, Sullivan MB. Depth-stratified functional and taxonomic niche specialization in the ‘core’ and ‘flexible’ Pacific Ocean Virome. ISME J. 2015;9:472–84.

24. Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. Genomes of abundant and widespread viruses from the deep ocean. MBio. 2016;7(4):e00805-16.

25. López-Pérez M, Haro-Moreno JM, Gonzalez-Serrano R, Parras-Moltó M, Rodriguez-Valera F. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters. PLoS Genet. 2017;13(9):1–23.

26. Goldsmith DB, Parsons RJ, Beyene D, Salamon P, Breitbart M. Deep sequencing of the viral phoH gene reveals temporal variation, depth- specific composition, and persistent dominance of the same viral phoH genes in the Sargasso Sea. PeerJ. 2015;3:e997.

120 27. Luo E, Aylward FO, Mende DR, Delong EF. Bacteriophage Distributions and Temporal Variability in the Ocean’s Interior. MBio. 2017;8(6):e01903- 17.

28. Sieradzki ET, Ignacio-Espinoza JC, Needham DM, Fichot EB, Fuhrman JA. Dynamic marine viral infections and major contribution to photosynthetic processes shown by spatiotemporal picoplankton metatranscriptomes. Nat Commun. 2019;10(1169).

29. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J Comput Biol. 2012;19(5):455–77.

30. Roux S, Enault F, Hurwitz BL, Sullivan MB. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3:e985.

31. Arumugam M, Harrington ED, Raes J, Foerstner KU, Arumugam M, Bork P. SmashCommunity: a metagenomic annotation and analysis tool. Bioinformatics. 2010;26(23):2977–8.

32. Roux S, Emerson JB, Eloe-Fadrosh EA, Sullivan MB. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817.

33. Li W, Godzik A. Cd-hit : a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.

34. Hyatt D, Chen G, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal : prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11(119).

35. Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10).

36. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36(Database issue):281– 8.

37. Mizuno CM, Guyomar C, Roux S, Lavigne R, Rodriguez-Valera F, Sullivan M, et al. Numerous cultivated and uncultivated viruses encode ribosomal proteins. Nat Commun. 2019;10:752.

38. Gudkov AT. The L7 / L12 ribosomal domain of the ribosome : structural and functional studies. FEBS Lett. 1997;407(3):253–6.

39. Kielbasa SM, Wan R, Sato K, Kiebasa SM, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21:487–93.

40. Nishimura Y, Watai H, Honda T, Mihara T, Omae K, Roux S, et al. Environmental viral genomes shed new light on virus-host interactions in

121 the ocean. mSphere. 2017;2(2):e00359-16.

41. Imai T. sprai = single pass read accuracy improver [Internet]. 2013. Available from: http://zombie.cb.k.u-tokyo.ac.jp/sprai/

42. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12.

43. Beaulaurier J, Luo E, Eppley J, Uyl P Den, Dai X, Turner DJ, et al. Assembly-free single-molecule nanopore sequencing recovers complete virus genomes from natural microbial communities. bioRxiv. 2019 Apr 26;619684.

44. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:1–9.

45. Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ. 2015;3:e1319.

46. Mende DR, Bryant JA, Aylward FO, Eppley JM, Nielsen T, Karl DM, et al. Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat Microbiol. 2017;2(10):1367–73.

47. Mende DR, Sunagawa S, Zeller G, Bork P. Accurate and universal delineation of prokaryotic species. Nat Methods. 2013;10(9):881–4.

48. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger S a, Kultima JR, et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013;10(12):1196–9.

49. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, Mcveigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(Database issue):733–45.

50. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. Expanding the Marine Virosphere Using Metagenomics. PLoS Genet. 2013;9(12).

51. Ahlgren NA, Fuchsman CA, Rocap G, Fuhrman JA. Discovery of several novel, widespread, and ecologically distinct marine Thaumarchaeota viruses that encode amoC nitrification genes. ISME J. 2018;13:618–31.

52. Wilson ST, Aylward FO, Ribalet F, Barone B, Casey JR, Connell PE, et al. Coordinated regulation of growth, activity and transcription in natural populations of the unicellular nitrogen-fixing cyanobacterium Crocosphaera. Nat Microbiol. 2017;2:1–27.

53. Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, et

122 al. vegan: Community ecology package [Internet]. 2019. Available from: https://cran.r-project.org/package=vegan

54. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment / Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.

55. Casjens SR, Gilcrease EB. Determining DNA Packaging Strategy by Analysis of the Termini of the Chromosomes in Tailed-Bacteriophage Virions. Methods Mol Biol. 2009;502:91–111.

56. Grose JH, Casjens SR. Understanding the enormous diversity of bacteriophages: The tailed phages that infect the bacterial family Enterobacteriaceae. Virology. 2014;468:421–43.

57. Brum JR, Steward GF, Karl DM. A novel method for the measurement of dissolved deoxyribonucleic acid in seawater. Limnol Oceanogr Methods. 2004;2:248–55.

58. Suttle CA, Chen F. Mechanisms and rates of decay of marine viruses in seawater. Appl Environ Microbiol. 1992;58(11):3721–9.

59. Delong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N, et al. Community genomics among stratified microbial assemblages in the ocean’s interior. 2006;311:496–503.

60. Mende DR, Boeuf D, DeLong EF. Persistent core populations shape the microbiome throughout the water column in the North Pacific Subtropical Gyre. Front Microbiol. 2019;10(October):1–12.

61. Weinbauer MG, Brettar I, Höfle MG. Lysogeny and virus-induced mortality of bacterioplankton in surface, deep, and anoxic marine waters. Limnol Oceanogr. 2003;48(4):1457–65.

62. Williamson SJ, Cary SC, Williamson KE, Helton RR, Bench SR, Winget D, et al. Lysogenic virus-host interactions predominate at deep-sea diffuse- flow hydrothermal vents. ISME J. 2008;2(11):1112–21.

63. Knowles B, Silveira CB, Bailey BA, Barott K, Cantu VA, Cobián-Güemes AG, et al. Lytic to temperate switching of viral communities. Nature. 2016;531(7595):533–7.

64. Swan BK, Tupper B, Sczyrba A, Lauro FM, Martinez-Garcia M, González JM, et al. Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean. Proc Natl Acad Sci U S A. 2013;110(28):11463–8.

65. Casjens S. Prophages and bacterial genomics: What have we learned so far? Mol Microbiol. 2003;49(2):277–300.

123 66. Middelboe M. Bacterial growth rate and marine virus–host dynamics. Microb Ecol. 2000;40:114–24.

67. McDaniel LD, Houchin LA, Williamson SJ, Paul JH. Lysogeny in marine Synechococcus. Nature. 2002;415(6871):496.

68. McDaniel L, Paul JH. Effect of nutrient addition and environmental factors on prophage induction in natural populations of marine synechococcus species. Appl Environ Microbiol. 2005;71(2):842–50.

69. Brum JR, Hurwitz BL, Schofield O, Ducklow HW, Sullivan MB. Seasonal time bombs: dominant temperate viruses affect Southern Ocean microbial dynamics. ISME J. 2015;10(2):1–13.

70. Stewart FM, Levin BR. The population biology of bacterial viruses: why be temperate. Theor Popul Biol. 1984;26(1):93–117.

71. Moebus K. Marine bacteriophage reproduction under nutrient-limited growth of host bacteria. I. Investigations with six phage-host systems. Mar Ecol Prog Ser. 1987;144:1–12.

72. Thingstad TF. Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems. Limnol Oceanogr. 2000;45(6):1320–8.

73. Williamson SJ, Houchin LA, Mcdaniel L, Paul JH. Seasonal variation in lysogeny as depicted by prophage induction in Tampa Bay, Florida. Appl Environ Microbiol. 2002;68(9):4307–14.

74. Mann NH, Cook A, Bailey S, Clokie M, Amanullah A, Azam N, et al. Bacterial photosynthesis genes in a virus. 2003;424:741–2.

75. Lindell D, Jaffe JD, Johnson ZI, Church GM, Chisholm SW. Photosynthesis genes in marine viruses yield proteins during host infection. Nature. 2005;438(7064):86–9.

76. Sullivan MB, Lindell D, Lee JA, Thompson LR, Bielawski JP, Chisholm SW. Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol. 2006;4(8):1344–57.

77. Santoro AE, Casciotti KL, Francis CA. Activity, abundance and diversity of nitrifying archaea and bacteria in the central California Current. Environ Microbiol. 2010;12(7):1989–2006.

78. Martens-Habbena W, Berube PM, Urakawa H, de la Torre JR, Stahl DA, Torre J, et al. Ammonia oxidation kinetics determine niche separation of nitrifying Archaea and Bacteria. Nature. 2009;461(7266):976–9.

79. Kim S, Makino K, Amemura M, Shinagawa H. Molecular analysis of the phoH gene, belonging to the phosphate regulon in Escherichia coli. J

124 Bacteriol. 1993;175(5):1316–24.

80. Lindell D, Jaffe JD, Coleman ML, Futschik ME, Axmann IM, Rector T, et al. Genome-wide expression dynamics of a marine virus and host reveal features of co-evolution. Nature. 2007;449(7158):83–6.

81. Weigele PR, Pope WH, Pedulla ML, Houtz JM, Smith AL, Conway JF, et al. Genomic and structural analysis of Syn9, a cyanophage infecting marine Prochlorococcus and Synechococcus. Environ Microbiol. 2007;9(7):1675–95.

82. Weynberg KD, Allen MJ, Ashelford K, Scanlan DJ, Wilson WH. From small hosts come big viruses: the complete genome of a second Ostreococcus tauri virus, OtV-1. Environ Microbiol. 2009;11(11):2821–39.

83. Moreau H, Piganeau G, Desdevises Y, Cooke R, Derelle E, Grimsley N. Marine Prasinovirus genomes show low evolutionary divergence and acquisition of protein metabolism genes by horizontal gene transfer. J Virol. 2010;84(24):12555–63.

84. Goldsmith DB, Crosti G, Dwivedi B, Mcdaniel LD, Varsani A, Suttle CA, et al. Development of phoH as a novel signature gene for assessing marine phage diversity. Appl Environ Microbiol. 2011;77(21):7730–9.

85. Jover LF, Effler TC, Buchan A, Wilhelm SW, Weitz JS. The elemental composition of virus particles: implications for marine biogeochemical cycles. Nat Rev Microbiol. 2014;12(7):519–28.

86. Karl DM, Lukas R. The Hawaii Ocean Time-series (HOT) program: Background, rationale and field implementation. Deep Sea Res Part II Top Stud Oceanogr. 1996;43(2):129–56.

87. DeLong EF. Archaea in coastal marine environments. Proc Natl Acad Sci U S A. 1992;89(12):5685–9.

88. Karner MB, Delong EF, Karl DM. Archaeal dominance in the mesopelagic zone of the Pacific Ocean. Nature. 2001;409:507–10.

89. Church MJ, Wai B, Karl DM, DeLong EF. Abundances of crenarchaeal amoA genes and transcripts in the Pacific Ocean. Environ Microbiol. 2010;12(3):679–88.

90. Iverson V, Morris RM, Frazar CD, Berthiaume CT, Morales RL, Armbrust EV, et al. Untangling genomes from metagenomes: Revealing an uncultured class of marine Euryarchaeota. Science. 2012;335(6068):587–90.

91. Mohr W, Intermaggio MP, LaRoche J. Diel rhythm of nitrogen and carbon metabolism in the unicellular, diazotrophic cyanobacterium Crocosphaera watsonii WH8501. Environ Microbiol. 2010;12(2):412–21.

92. Bench SR, Ilikchyan IN, James Tripp H, Zehr JP. Two strains of

125 crocosphaera watsonii with highly conserved genomes are distinguished by strain-specific features. Front Microbiol. 2011;2:1–13.

93. Fillol-Salom A, Martínez-Rubio R, Abdulrahman RF, Chen J, Davies R, Penadés JR. Phage-inducible chromosomal islands are ubiquitous within the bacterial universe. ISME J. 2018;12(9):2114–28.

94. Penadés JR, Christie GE. The Phage-Inducible Chromosomal Islands: A Family of Highly Evolved Molecular Parasites. Annu Rev Virol. 2015;2(1):181–201.

126 CHAPTER 4. DIVERSITY AND ORIGINS OF VIRUSES ON SINKING PARTICLES IN THE

DEEP OCEAN

Abstract

Sinking particles and particle-associated microbial assemblages influence global biogeochemistry by serving as a conduit for particulate matter export from the surface to the deep oceans. Despite ongoing studies on the diversity and activity of particle-associated microbial assemblages, viral diversity within these assemblages has been largely unexplored. In part due to sparse viral representation in current reference databases, assigning taxonomy to novel environmental viruses remains a challenge. In this study, virus genomes associated with sinking particles collected over three years in deep-moored sediment traps at 4000 m in the North Pacific Subtropical Gyre were analyzed.

Sequencing, assembly, and virus-specific reassembly recovered 857 >10kbp viral population genomes or genomic fragments. I linked 68 viral populations to their cellular hosts via alignments to prophages or CRISPRs in cellular metagenome- assembled genomes. Among these, we identified novel viruses infecting putative deep-sea bacteria such as Colwellia, Moritella, and Shewanella. Relative abundances for some virus-host pairs were nearly identical through time, suggesting that the viruses exist as prophages. In contrast, other virus-host pairs displayed more variable abundance ratios, which might reflect dynamic virus-

127 host interactions. Some viruses displayed inter-annual variability unlike that of host populations, potentially a reflection of negative frequency-dependent selection mediated by host-acquired resistance. Alternatively, this result might reflect instead population heterogeneity and variable pangenomes of closely related host strains that were captured at different times in the sediment traps.

We found evidence for the vertical transport of particle-associated viruses from the upper water column to the deep sea, including cyanophages infecting primary producers from the photic zone. The abundance of viruses from the upper water column positively correlated with carbon export flux. Other viruses of unknown origins were also enriched in samples having high carbon flux, presumably reflecting viral co-transport in sinking surface water populations.

Taken together, these observations shed new light on the diversity and dynamics of viruses found on deep-sea sinking particles in the open oceans.

Introduction

Microbial processes are fundamental to productivity and export in the global oceans. Microbes fuel the biological carbon pump through primary production that fixes inorganic carbon into biomass, some of which sinks into the deep sea in the form of sinking aggregates composed of both organic and inorganic particulate matter (1,2). These sinking particles and their associated microbial communities sequester roughly 4 gigatonnes of our planet’s atmospheric carbon annually (3) and play a critical role in the global carbon cycle. Sinking particles

128 are sometimes “hotspots” of microbial activity, harboring diverse assemblages that play active roles in the remineralization and transport of organic matter in the oceans (4–10). These microbial communities connect the surface and deep oceans (11) and fuel biogeochemical cycling through selective remineralization of labile organic carbon (12).

Previous studies of microbial communities associated with sinking particles have focused mostly on bacteria, archaea, and eukaryotes. Particle-attached microbial communities are often dominated by Bacteria and Archaea that are sometimes distinct, or display variable activity, compared with their planktonic counterparts (13–18). Compared to planktonic samples, sinking particles are enriched in larger “copiotrophic” bacteria typically associated with gut microbiomes, such as Bacteroidetes, δ-, ɛ-, and γ-proteobacterial groups (7–10,13–

18). While these prokaryotic communities are currently being studied, viral diversity on sinking particles has not yet been investigated. Viruses are likely present in sinking particles, since viral-like particles have been observed using microscopy (19,20). Exploring this reservoir of genetic diversity could reveal novel viral populations and provide insight into processes that contribute to carbon export in the oceans. For example, identifying the origin of particle- attached viruses could provide insight into the vertical transport of viruses in the ocean. Furthermore, whether viral lysis promotes or even decreases particle export remains an open question. Identifying viruses that correlate with export flux will highlight viral groups that might be worth further investigation.

Determining which viruses are found on sinking particles, their origins and host

129 associations, and whether and how they influence particle export processes, will further our understanding of microbial diversity and biogeochemical processes in our oceans.

This study investigates the diversity and dynamics of viruses on sinking particles in the North Pacific Subtropical Gyre (NPSG), an environment characteristic of the oligotrophic open oceans that cover roughly 40% of our planet (21). To study viruses associated with sinking particles, three years of sediment trap samples collected at 4000 m at Station ALOHA were analyzed. At Station ALOHA, deep- moored sediment traps have been deployed since 1992 (22), providing rich time- series data on particulate export flux. Metagenomic data from sediment trap samples collected between 2014-2016 (9,23) were used to assemble 857 deep-trap viruses (referred to here as DTV) that represent some of the most abundant particle-associated viruses reaching 4000 m the NPSG. To overcome challenges in identifying novel viruses without sequence representation in current databases, viral populations were linked to their cellular hosts using metagenome- assembled genomes (MAGs) previously determined from the same samples

(23,24). Viral diversity and viral-host dynamics were explored across 3 years, providing evidence of vertical transport of viruses, and identifying viral populations that correlated with patterns of carbon export flux. The DTV data and analyses reported here provide new perspectives on the diversity and dynamics of viruses on sinking particles in the open oceans.

130 Methods

A schematic overview of our workflow is presented in Fig. S4.1.

Sample collection, extraction, and sequencing. Station ALOHA (22°45’ N, 158°

W) is a seasonally stable environment located in the NPSG and is a well- characterized sampling site of the Hawaii Ocean Time-series program (25).

Metagenomes previously generated from sinking particles collected in a deep- moored sediment trap at 4000 m (9,22,23) were used here to assemble DTVs. This metagenomic dataset contained 63 samples spanning 3 years from 2014-2016

(9,23). Particulate carbon export flux data was generated by Karl and co-workers for the same samples (23). Sediment trap set-up, deployment, recovery, and sample processing for measuring for particulate carbon flux were previously described (22). For reference and consistent with previous studies, samples that displayed >=150% of the 28-year mean carbon flux were considered summer export pulse samples (22,23). Methods for extraction, sequencing, read quality- control (QC), and assembly of sediment trap metagenomes have been previously described (9,23). Read sequence data are available on NCBI SRA under

Bioproject PRJNA482655.

Virus-specific reassembly. Viral contigs and associated reads were identified using four methods. 1. >3kb contigs were filtered using VIRSorter v1.03 (26) using the virome database, and 11,610 viral contigs from all categories were retained (Table S4.1). 2. >1kb contigs were filtered using VIBRANT (27), and

15,356 viral contigs were retained (Table S4.1). 3. 43,663 contigs from 121

131 metagenome-assembled genomes (MAGs) from the same samples (23,24) were filtered through VIRSorter and VIBRANT, and 1123 viral contigs identified from either program were retained. 4. 1470 Eukaryotic viruses were collected from

NCBI and de-replicated using cd-hit-est at >=95% average nucleotide identity

(ANI) to form a eukaryotic virus database. BWA-MEM v0.7.15 (Li 2013) and msamtools (28) was used to identify reads mapping putative viral contigs at

>=95% identity across >=45 bp or to the NCBI eukaryotic viruses at >=70% identity across >=45 bp. 24 million total viral reads were reassembled using metaSPAdes v3.13.1 (29), which was chosen due to improved genome recovery and low rate of generating false apparent circularity (30).

Deep Trap Virus (DTV) database curation. Viral contigs were identified and retained if they met one or more of the following three criteria. 1. The contig was classified as viral by both VIRSorter and VIBRANT (767 contigs). 2. The contig was classified as viral by either VIRSorter or VIBRANT, and contained a phage marker protein (bit score >30 to capsid, head, neck, tail, spike, portal, terminase, clamp loader, T4 proteins, T7 proteins, Mu proteins, excisionase, phage integrase, repressor protein CI, or Cro). Proteins were predicted using Prodigal v2.6.3 (31) and functionally annotated using HMMer v.3.2 (32) against the PFAM-A v30 database (33). 537 contigs were identified as viral by VIRSorter and contained one or more phage marker proteins. 1396 contigs were identified as viral by

VIBRANT and contained one or more phage marker proteins. 3. If the contig contained one or more eukaryotic virus marker protein (>=30 bit score to

NCLDV capsids, envelope, and Poxvirus proteins) with PFAM annotation (674 contigs).

132 If a contig contained prophages identified from VIRSorter and VIBRANT that differed in length, the shorter prophage sequence was retained prior to de- replication. All contigs were clustered with cd-hit-est v4.6 (34) at >=95% ANI resulting in 2359 non-redundant viral populations. To focus on full genomes or large genomic fragments, only >=10 kbp contigs were retained for the final 857 populations that form the DTV database. Functional annotations were inspected to ensure that no ribosomal proteins were present, with the exception of S21, which was previously found in a cultivated Pelagibacter phage (35). The high proportion of novel viral diversity in our samples precludes using reference genomes for the detection of chimeras, which are expected to occur at a frequency of ~0.5% using metaSPAdes (30). Sequences were inspected for chimeras through self-alignment using LAST v1021 (36) to identify repeats at

>=95% ANI across >=5 kbp. 3 populations displayed this signature and were noted as chimeras (Table S4.2).

Genomic completion. 109 complete, non-redundant virus genomes (Table S4.2) were identified by looking for terminal repeats indicating apparent circularity

(37): i. 83 were identified using Virsorter; ii. 4 from check_circularity.pl (38); 32 using NUCmer v3.1 (39) to find direct terminal repeats 20-5000bp in length within 200 bp of both ends (40).

Identifying virus populations. Taxonomy was assigned to 171 DTVs using these three methods, ordered by priority. 1. 57 viral populations were identified through alignments to putative prophages in MAGs with known taxonomic annotations (Table S4.3). 105 viral populations aligned to MAG scaffolds at

133 >=95% identity across >=50% of the viral contig. To confirm the taxonomic assignments of these MAG scaffolds, proteins were aligned using LAST against the Genome Taxonomy DataBase (GTDB) release 04-RS89 (41). The MAG scaffold’s taxonomy was considered accurate if it contained the highest number of proteins hitting to the same GTDB genus or family. By retaining only hits to

MAG scaffolds with confirmed annotations, 57 virus-host links were retained

(Table S4.2). 2. 14 viruses were linked to cellular hosts using CRISPR spacers identified from MAGs with known taxonomic annotations. CRISPR spacers were identified using CRASS (42) from reads mapping MAGs at >=97% identity across

>=75% of the read length, and viral populations were linked to that MAG if they aligned to an entire spacer at 100% ANI (Table S4.2). 3. 37 viruses were identified based on protein alignments to GTDB. Predicted virus proteins were aligned using LAST and taxonomy was broadly assigned if viruses contained >=50% proteins with hits at >=60% amino acid identity (AAI) to a single genus-level group (Table S4.2).

To identify the number of novel viral populations recovered from our samples, deep trap viral proteins were aligned using LAST to known marine viruses in

RefSeq96 (Table S4.2, (43)), as well as marine viral metagenomic databases available as of early 2020: uvMED (44), uvDEEP (45), GOV (46), EV (47),

MED2017 (48), ALOHA 2.0 (49), Nishimura 2017 (37), Coutinho 2017 (50), and

GOV2.0 (51). For a conservative estimate on the number of novel populations, populations were considered novel if they did not meet broad taxonomic assignments at >=60% AAI across >=50% of proteins to any reference genome or contig (Table S4.2). Annotated viral proteins were considered novel if they

134 contained PFAM domains not found in previously reported datasets (46,49,52)

(Table S4.4).

184 putative temperate phages were identified pooled across four methods. i.

Functional annotations identified 94 populations with temperate phage markers

(>30 bit score to integrase, excisionase, Cro, or CI repressor). ii. 46 populations shared significant homology (>= 95% identity across at least of half the viral contig) with prophages identified by VIRSorter and VIBRANT from original assemblies. iii. VIRSorter and VIBRANT respectively identified 49 and 110 final viral populations as prophages. iv. 48 viruses linked to MAGs were identified as prophages because they shared >=95% ANI to a contiguous MAG chromosomal region and the aligning MAG scaffold was >=10 kb longer than the virus, consistent with flanking cellular regions indicating integrated prophages.

Recovering cellular metagenome-assembled genomes (MAGs) used in viral host identification. The MAGs used here were previously assembled, quality controlled and analyzed as previously described (23,24). Assembled MAGs are available under the NCBI accession numbers SAMN14675689-SAMN14675809.

Temporal abundance and persistence. Reads from each sample were mapped using BWA-MEM and filtered using msamtools at >95% identity across >45 bp for viruses, and >97% identity across >45 bp for MAGs. Anvi’o v3 (53) was used to calculate coverage profiles for every sample, using interquartile range (IQR) coverage, which diminishes the effect of conserved or hypervariable regions in respectively over- and under-estimating coverage. IQR coverage normalized to

135 the smallest library size (normalized coverage) was used to approximate relative abundances (Tables S4.5, S4.6). A population is considered present in a sample if it displayed a nonzero IQR coverage (Table S4.2).

Vertical transport and depth of origin. To look for evidence of vertical transport, contigs from the samples collected from 5-500 m in the same environment from

2014-6, the ALOHA 2.0 virus database (49), were aligned using LAST to deep- trap viruses at >95% identity across >50% of the virus (Table S4.7). To determine viral depth of origin, ALOHA 2.0 reads were mapped using BWA-MEM and filtered using msamtools at >95% identity across >45 bp. Populations were assigned based on the highest normalized IQR coverage to the surface (5 – 75 m),

DCM (100 – 125 m), transitional (150 – 250 m) or mesopelagic (500 m) depth of origin (Table S4.2). Populations with zero IQR coverage with reads recruited from this 5 – 500 m dataset were inferred to be bathypelagic (Table S4.2).

Viral correlation with particulate carbon export flux. The WGCNA package in

R (54) was used to identify 194 viral populations belonging to three modules that displayed significant Pearson’s correlation (p<0.05, (55)) with log-transformed particulate carbon flux (Table S4.2). Viruses were grouped into modules using the dynamic hybrid method with tree cut height at 0.988.

136 Results and Discussion

The deep trap virus (DTV) database reported here represents 857 viruses from sinking particles collected at 4000 m at Station ALOHA, an oligotrophic habitat in the NPSG. Metagenomic data previously generated from a total of 63 sediment trap samples between 2014-2016 (9,23) were used for these analyses. Sediment trap metagenome assembly, classification, virus-specific reassembly and curation of these previously reported metagenomic data recovered 857 virus populations forming the DTV database (Fig. S4.1). Additionally, previously reported particulate carbon export flux data (22,23) was utilized for carbon flux comparisons across sample time points. As previously described, these samples spanned the summer export pulse events in 2015 and 2016 (Fig. S4.2, (23)).

Viruses from this environment appeared to be under-sampled and represent a reservoir of novel genetic diversity (Table S4.2). A total of 735 (86%) of DTVs were novel with respect to previously sequenced viruses (37,43–51). Of the remaining 122 that were similar to reference databases at >=60% AAI across

>=50% proteins, 23 DTVs were similar to the ALOHA 2.0 viral database collected from the upper 500 m in the same environment (49). A total of 112 DTVs were similar to marine viruses collected from other environments, suggesting widespread distribution of a few conserved viral groups throughout the global ocean basins. Only one virus was similar to the taxonomically annotated

RefSeq96 database (Vibrio), further highlighting the degree of novel genetic diversity observed in DTVs. The DTVs contained 115 protein functional domains that were not found in previously reported virus datasets (46,49,52), including

137 predicted proteins that might be involved in cellulose, chitin, pectate, or trehalose metabolism (Table S4.4). This functional diversity in auxiliary metabolic genes potentially reflects the diversity of cellular hosts on sinking particles, and corresponding diverse viral strategies in supplementing host carbon and energy metabolism during infection.

Linking novel viruses to hosts using metagenome-assembled genomes

(MAGs). DTV sequence novelty precluded using standard approaches, such as the presence of homologous sequences in reference databases, for identifying viral taxonomy. To address this challenge, we leveraged cellular MAGs generated from the same samples to link viruses with their hosts. MAG bins can contain virus genomes that are integrated into their host genomes as prophages.

MAGs can also contain CRISPRs that include short spacer sequences identical to a part of viral genomes from previous infections (reviewed in (56)). By aligning viral sequences to putative prophages or CRISPR spacers in MAGs that were reconstructed from the same samples, 68 viruses could be linked to their hosts.

These linkages revealed broad taxonomic diversity in viruses associated with sinking particles, with representation from Bacteroidetes, α-, δ-, ɛ-, and γ- proteobacterial groups (Fig. 4.1). In general, viruses infecting hosts in the

Alteromonadales group (γ-proteobacteria) appear to dominate sinking particles, frequently accounting for over half of all annotated viruses. Of the

Alteromonadales phages, phages infecting close relatives of deep-sea bacteria

(57), such as Shewanella, Colwellia, and Moritella, were particularly abundant, mirroring the known prevalence of these bacteria on sinking particles in the

138 deep-sea. In particular, one Moritella phage accounted for over half of all annotated viruses observed in late 2016. Other abundant viruses include two phages of Arcobacteraceae (ɛ-proteobacteria) that accounted for over half of all annotated viruses observed in early 2014. These results are consistent with previous observations of abundant heterotrophs found on sinking particles in this environment (7,9) and in other environments following cyanobacterial blooms (58). Annotated viruses account for only 68 (8%) of all 857 DTVs, so the taxonomic groups shown here likely represent a small subset of total viral diversity on sinking particles.

Of 105 virus-linked MAG scaffolds, 57 links were retained after independent confirmation, with alignments to reference databases, that the virus annotations were consistent with MAG-assigned taxonomy. Inconsistencies between virus versus MAG taxonomic annotations may be due to several potential, and non- mutually exclusive, reasons including: (i) mis-annotation of the virus due to sparse representation in existing databases; (ii) mis-binned viral contigs in the

MAG bins; and (iii) the existence of previously unrecognized broad host range viruses. Overall, linking viruses to MAGs appears to be a promising approach for identifying novel viruses. Alignments to reference databases identified 37

DTVs at a low confidence level (>=60% AAI across >=50% proteins), while alignments to MAGs identified 68 DTVs at high confidence levels (>=95% identity across >=50% of the viruses, or 100% identity across 100% of CRISPR spacers).

139 Viral infection patterns in bathypelagic bacteria. Virus-host links revealed different abundance patterns between the virus and host populations. Some viral populations displayed temporal abundance profiles that were nearly identical to that of their hosts, such as the Shewanella phages shown in Fig. 4.2a (left). Such similarity suggests that these viruses had infected and integrated into the genome of nearly all of its host population, and remained stable as a prophage in populations captured in sinking particles over three years. Similar read coverages in both viral and cellular regions in the aligning Shewanella scaffold

(Fig. 4.2a, right) further reflect roughly equal predicted abundance between the integrated prophage and the rest of the host genome. This viral population could represent either a stable prophage that were rarely induced to replicate independently from their host genomes, or an inactive prophage no longer capable of induction.

In contrast, some viral populations displayed temporally variable abundance profiles relative to that of their hosts. For example, a Moritella phage population was nearly absent in 2014 and 2015 compared to its host, but mirrored host abundance in late 2016 (Fig 4.2b, left). Its absence in 2014 is evident by sparse to no coverage in the aligning viral region in the Moritella scaffold, a large discrepancy compared to the ~20x coverage shown in non-viral scaffold regions

(Fig 4.2b, right). One interpretation is that the Moritella host populations observed were immune or had not been exposed to this virus in 2014 and 2015.

In 2016, the host population might have been exposed or lost immunity, allowing successful viral integration into the genomes of nearly all of this host population.

Another interpretation is that from year to year, highly related, but distinct

140 populations of Moritella were sampled in the sediment traps. In this case, the

2016 Moritella population we observed may represent a closely related but different subpopulation than those observed in 2014 and 2016.

Conversely, one Arcobacter phage that was abundant in sinking particles in 2014 was undetectable in 2015 and 2016 (Fig 4.2c, left). Its absence in 2015 is evident in the undetectable coverage in aligning viral regions in the MAG scaffold, a large discrepancy compared to the ~50x coverage observed in non-viral MAG scaffold regions (Fig 4.2c, right). Interestingly, other shorter genomic islands on this MAG scaffold seem to also have disappeared in 2015. Upon closer inspection, one of these islands contains an integrase, while two of these islands contain transposases, which are common features of island regions that can be diagnostic of different populations and strains. These results likely reflect the heterogeneity of host populations sampled and their spatiotemporal patchiness.

It is important to recognize that these trap samples do not follow specific virus and host populations over space and time. Instead, the samples represent time points capturing sinking particles, potentially from heterogeneous sources, over the preceding 10 – 14 day period. Thus, variability between virus and host abundances does not necessarily reflect viral integration or excision within the same host population. Considering that particles in open-ocean environments sink at variable speeds (59), and that they can be carried by horizontal advection up to tens of kilometers per day (60), sediment trap samples likely reflect microbial communities from a diverse range of temporal and spatial scales, and of heterogeneous sources. Despite the time, depth, space and source-integrated

141 nature of our dataset, we still observed strong differential patterns in viral and host abundances. Such strong patterns reflect a high level of viral and host microdiversity and dynamic viral-host interactions in the open oceans, even within highly genetically similar populations.

We found no evidence of prophage induction and replication, since no MAG- linked virus population was at any time point much more abundant than their host. Prophage induction and replication might not have been detected due to the heterogeneous sampling of sinking particles and spatiotemporally asynchronous infections amongst individual cells in particle-attached populations. Sinking particles are composed of heterogeneous organic and inorganic materials, including marine snow and fecal pellets (reviewed in (61)).

Spatially disparate microhabitats might result in asynchronous prophage induction that is difficult to identify from the bulk nucleic-acid signatures sampled in sediment traps. The time-integrated nature of our samples (10 – 14 days) and variable particle sinking rates (59) could also blur signatures of temporally asynchronous induction. Alternatively, if a portion of a prophage population is induced and replicating, differences in copy number could break assemblies between an integrated prophage and host genome. Therefore, identifying virus-host links using alignments to putative prophages in MAGs might be biased towards dormant temperate phages found at similar abundances to that of their hosts.

Temperate phages on sinking particles. The proportion of temperate phages represented 4 – 45% of the total viral assemblages (Fig. 4.3, right panel), with an

142 average of 19%. This proportion is considerably lower than the average of 48% reported for planktonic samples at 500 m (49). The presence of relatively low proportion of temperate phages on sinking particles is perplexing from a genomic perspective. Considering that particle-attached bacteria tend to be copiotrophic, in contrast with free-living oligotrophic bacterioplankton, particle- attached hosts are predicted to have greater genomic “real estate” for integrative elements such as prophages (62,63). However, these results are consistent from a host-density perspective, assuming that lysogeny decreases with higher host densities (64), a condition that is characteristic of particle-attached bacterial communities. Several alternative reasons might explain this lower than expected proportion of temperate phage observed on sinking particles. The high proportion of novel viral diversity on these samples could hinder identification of temperate phage protein markers, leading to false negatives. Particle-attached viruses could be enriched in adsorbed free-viral particles, which contain a lower proportion of temperate phages relative to cell-associated viruses (49).

Alternatively, lytic viruses could be enriched on particles due to higher host growth rates relative to those in planktonic habitats (65–67). Some of these deep trap virus hosts grow relatively quickly, at rates similar to primary producers in the surface ocean (68,69), where a high proportion of lytic viruses was observed

(49). Although we cannot resolve amongst these possibilities using metagenomic data alone, the mechanisms that govern viral reproductive strategies on sinking particles are worth further investigation.

Interannual similarities of viruses and prokaryotes on sinking particles. Viral assemblages on sinking particles appeared to be highly taxonomically variable

143 through time (Fig. 4.1). The majority of viral populations were not sampled consistently throughout the 3-year period (Fig. 4.3, coverage profiles). Despite near-identical read-mapping and abundance-calculation methods, viral populations on average appeared to be more ephemeral than MAG populations

(Fig. S4.3). At a presence cutoff of >0 IQR coverage (i.e. >=1 coverage across

>=25% of the population’s sequences), most MAGs were present in >60 samples, while most viruses were present in <20 samples. This discrepancy diminished with an increasing presence cutoff. Taken together, almost all cellular MAGs were present in every sample, though many were rare. In contrast, most viral populations were absent in all but a few samples. This discrepancy is intriguing considering that DTV populations were dereplicated at broader ANI (95%) than that of MAGs (97%). All else being equal, a broader population cutoff should have increased read recruitment and persistence across samples. Despite this expectation, viral populations on average displayed lower interannual persistence than cellular populations. The lower virus recovery might reflect: (i) the population genetic heterogeneity of the hosts; (ii) the physiological heterogeneity of even genetically identical host cells, due to the heterogeneity of sinking particles and the sporadic nature of sampling; (iii) frequent presence of degraded host cells that cannot support viruses; and/or (iv) assuming these samples captured active and contiguous viral and host populations over time, negative frequency-dependent selection mediated by cell-acquired defenses against dominant viral populations (70).

Evidence of vertical transport of viruses. Although viruses were previously observed on sinking particles through electron microscopy (19), the origin of

144 particle-associated viruses remains unclear. As evidence for vertical transport of particle-attached viruses, alignments between DTVs and viral contigs captured from samples from the upper 500 m in same environment during late 2014 to early 2016 (49) revealed 21 viruses that were present in the upper ocean during the same period (Table S4.7). Previous observations at Station ALOHA suggested that viruses are generally depth-specific, with little evidence of eurybathic viruses found throughout the water column (49,52). As a result, it is unlikely that these 21 viruses present in the upper 500 m also inhabit bathypelagic depths at

4000 m. Instead, these viruses likely originated from the upper 500 m and were transported to the deep ocean via sinking particles. Indeed, most of these 21 viruses displayed depth-specific abundance profiles in the upper water column, supporting their origins in the upper ocean (Fig. S4.4). The summed normalized abundances of these viruses (Fig. 4.4) also significantly correlated with carbon flux (Pearson’s correlation p= 0.01), further supporting their origins in the upper ocean. Additionally, all three of these annotated viruses infected bacteria expected to be present in the upper water column, such as cyanobacteria, and

Caulobacterales (Figs. 4.4, S4.4), whose presumed surface-adapted lifestyles and can be associated with phytoplankton blooms (58,71). These findings are consistent with previous reports of associations between cyanophages and particle export (72).

Of the 21 viruses that we postulate originated from the upper ocean, 6, 3, 7, and 5 were most abundant and assigned respectively to surface (5 – 75 m), DCM (deep chlorophyll maximum, 100 – 125 m), transitional (150 – 250 m) or mesopelagic

(500 m) depths of origin (Fig. 4.4, Table S4.2). Viruses that originated from 150 –

145 500 m were more frequently found in exported particles relative to those from above 150 m (Fig. 4.5). Taken together, viruses throughout the upper 500 m were observed on sinking particles exported to the deep ocean at 4000 m, and viruses that originate below the DCM were associated most frequently with particle export. Although the bulk of the SEP was attributed to large primary producers

(22,23) presumably originating from the near-surface ocean, these patterns revealed a frequent but low background signal of viruses originating below the mixed layer depth in sediment trap samples. Our observations both revealed where viral export occurs along the water column, and provided evidence indicating viruses were directly transported to the deep sea on sinking particles.

Future investigation of the mechanism(s) underlying, and rate of viral contribution to particle export, perhaps coupled with in situ incubations to target specific viral groups, will help constrain viral effects on global biogeochemical cycles.

Viral correlation with particulate carbon export flux. 194 DTVs belonged to one of three WGCNA clusters that significantly correlated with particulate carbon export flux (Pearson p<0.05, Fig. 4.3, highlighted groups). Twelve viral populations were annotated using alignment to either MAGs or the GTDB protein database, revealing abundant representation from γ-proteobacterial groups (Fig. 4.6). In particular, one Shewanella phage was an order of magnitude more abundant than other viruses at the start of the 2015 summer export pulse.

Other viral groups that correlated with carbon flux include phages infecting

Caenarcaniphilales, Oligoflexales, Flavobacteriales, Psychrobiaceae,

146 Pseudoalteromonas, and Vibrio. These results are consistent with some cellular hosts, such as Oligoflexales, Flavobacteriales, and Alteromonadales, previously observed to display positive correlations with both carbon flux and the summer export pulse (23). These results are also consistent with a 2015 report of the presence of Pseudoalteromonas and Vibrio species correlating with modeled carbon flux (72).

Viral correlation with carbon flux relates to two proposed conceptual models that invoke conflicting effects of viruses as respectively decreasing or increasing carbon export to the deep sea: the “viral shunt” and the “viral shuttle” (reviewed in (73)). In the viral shunt model, viruses enhance the microbial loop by transforming living cells into dissolved and particulate organic matter that increase the availability of substrates for heterotrophic respiration in the upper ocean (74–76). Conversely, in the viral shuttle model, viruses enhance export from the surface to the deep ocean by lysing cells, releasing sticky material, and promoting aggregation, leading to larger particles for more efficient export (73).

It is likely that a combination of both models applies to how viruses influence the marine environment, possibly depending on host metabolism, size, habitat, and viral reproductive strategies. Our finding of particle-associated viral populations that correlated with particulate carbon flux hints at the possibility of the viral shuttle hypothesis. However, additional evidence is needed to directly support or refute either hypotheses, with a particular focus on linking the mechanism of this correlation to viral lysis of hosts in the upper ocean.

Constructing a mechanistic and quantitative framework that explains how viruses influence marine ecosystems might require both laboratory and field

147 studies. Research on viral influence on aggregation and sinking has focused on cultured eukaryotic hosts (19,20,77–79), which provided strong evidence that specific viruses can caused host death, aggregation, and sinking. Two other studies have examined correlations between virus occurrence and indirect measures of carbon export in the field (72,80). These studies captured geographically diverse viruses that correlated with a proxy for export calculated using observed particle size distributions. Our findings complement this current body of work by using direct field measurements of particulate carbon flux to identify viruses whose presence correlates with sinking particles in the oceans.

While our results show that viruses are indeed transported on sinking particles, they should not be interpreted as evidence that viruses are mechanistically involved in promoting or inhibiting carbon export. Identifying the mechanism(s) by which viruses might contribute to export, for example actively through cell lysis or passively through particle adsorption, remains an open area for future investigation. Such mechanisms might vary greatly, depending on the oceanographic region and predominant primary producers involved.

Conclusions

Viral diversity has largely been overlooked in studies of microbial assemblages associated with sinking particles. In this study, we assembled a database of 857 deep-trap viruses collected from sediment traps at 4000 m in an open ocean environment characteristic of the largest biome on Earth. To overcome challenges

148 in deciding on taxonomic assignments for to novel viruses that lack relatives in reference databases, we used cellular MAGs assembled from the same samples to link 68 of these viral populations to their hosts. Using these linkages, we identified novel viral populations infecting deep-sea bacteria, and examine virus- host dynamics across a three-year period. We found that while some viruses exhibited near-identical abundances to those of their hosts, others displayed discrepancies in abundance profiles that likely reflect dynamic prophage-host interactions. On average, viral populations appeared to be less persistent than cellular MAG populations, possibly reflecting negative frequency-dependent selection resulting from host-acquired resistance. Some deep trap virus populations appeared to originate from the surface oceans, providing evidence for vertical transport of viruses on sinking particles. The abundances of some viral populations displayed a positive correlation with particulate carbon flux, suggesting taxonomic groups that might influence export processes. These results and analyses reveal viral diversity, virus-host dynamics, and viral export on deep-sea sinking particles in the open ocean.

Acknowledgements

I thank the captain and crew of R/V Kilo Moana, R/V KOK, the Hawaii Ocean

Time-series program, and the SCOPE-ops team for cruise organization, sample collection, and oceanographic data acquisition. I thank David Karl for sediment

149 trap design and set-up, Tara Clemente and Blake Watkins for sediment trap deployment and recovery, Eric Grabowski for generating particulate export flux data, Anna Romano and Paul Den Uyl for library preparation and sequencing,

Kirsten Poff and Andy Leu for sharing their unpublished and preprint data,

Andy Leu for generating cellular MAGs, virus-CRISPR spacer alignments, and general discussions, and John Eppley for bioinformatics advice. This project is funded by Simons Foundation (#329108) and the Gordon and Betty Moore

Foundation (GBMF 3777) to EFD. Partial support for EL was provided by the

Natural Sciences and Engineering Research Council of Canada (PGSD3-487490-

2016). This work is a contribution of the Simons Collaboration on Ocean

Processes and Ecology and the Center for Microbial Oceanography: Research and

Education.

150 Figures

1.00 predicted hosts other (8) Bacteroidetes Flavobacteriales (6) 0.75 α-proteobacteria Rhizobiales (4) Rhodobacterales (6) 0.50 δ-proteobacteria Myxococcota (4) ɛ-proteobacteria Arcobacter (2) 0.25 γ-proteobacteria Oceanospirillales (5)

annotated viruses Pseudomonadales (5) Vibrio (2) 0.00 Alteromonadales Shewanellaceae (12) 2014 2015 2016 Colwellia (5) Moritella (1) other (8) time

Figure 4.1. Relative abundances through time of 68 annotated deep-trap viruses identified by alignments to putative prophages and CRISPRs in MAGs. Relative abundances are calculated using IQR coverage normalized to a total of 1. Viral host annotations are grouped by color, with the number of viral populations in each group marked in parentheses. Bottom grey bars indicate summer export pulse samples based on particulate carbon export flux.

151

Figure 4.2. Examples of different virus-host abundance patterns observed in deep-trap viruses (DTVs): a. Phage-host abundance co-variance and b. c. Phage- host decoupled abundance variability. The left panels display abundances of viruses (closed circles) and hosts (open circles) through time. Normalized coverage is calculated using IQR coverage for each viral contig and MAG scaffold, weighted by scaffold length for MAGs, and normalized to the smallest sequenced sample. The right panels display viruses and their position in MAG scaffolds, with viral structural proteins colored in orange, integrases in yellow, and protein alignments in blue. Below, coverage along that MAG scaffold is shown for two samples corresponding to time-points labeled on the left panel (blue and orange text). Black rectangles highlight viral regions in MAG scaffolds that display coverage discrepancies relative to non-viral regions.

152

Figure 4.3. Coverage abundance profiles of 857 DTV populations through time. Each node on the top dendrogram and its associated column represents the coverages of one viral population. The top row indicates groups that appear to be variable or persistent across 63 samples. Each row represents an individual sample, ordered by time. The height of black bars represents log IQR coverage for each population, normalized to the maximum in that sample. The right panels display sample biogeochemical data: particulate carbon export flux (µmol/m2/day), summer export pulse samples, proportion of total reads mapping to DTVs, abundance-normalized proportion of DTVs that are temperate, Shannon’s diversity, richness, and evenness. Three groups of viruses positively correlated with carbon export flux are highlighted in color on the top dendrogram, on the background, and on bottom bars. Asterisks indicate variables that significantly correlate with carbon flux: proportion of temperate phages, and the bottom panel of three WGCNA clusters.

153 surface (5-75m) 100 S DTV 79 E DTV 367 DTV 392 75 P DTV 541 DTV 575 50 DTV 592 DCM (100-125m) 25 Cyanophage DTV 293 Cyanophage DTV 465 Caulobacterales phage 0 DTV 750 10.0 transition (150-250m) DTV 59 7.5 DTV 226 DTV 450 ed coverage abundance DTV 562 z 5.0 DTV 614 DTV 636 DTV 772 2.5 mesopelagic (500m)

normali DTV 393 0.0 DTV 526 DTV 601 2014 2015 2016 DTV 631 year DTV 819

Figure 4.4. Abundances through time of 21 deep-trap viruses (DTVs) with depth of origins in the upper 500m: surface (5-75m), deep chlorophyll maximum (100-125m), transition (150-250m), and mesopelagic (500m). The bottom panel is a subset of the top panel, with abundant viruses removed. Abundances are approximated using normalized coverage, calculated using IQR coverage normalized to the smallest sequenced sample. Grey shading indicates summer export pulse samples based on particulate carbon export flux.

154 surface 5 - 75 m DCM 100 - 125 m transition 150 - 250 m mesopelagic 500 m depth of origin bathypelagic

10 20 30 40 presence (number of samples)

Figure 4.5. Viral presence in deep trap samples grouped by presumptive depth of origin: surface (5-75m), DCM (deep chlorophyll maximum, 100-125m), transition (150-250m), mesopelagic (500m), and bathypelagic (0 IQR coverage from upper 500m reads). A population is considered present in a sample if it has >0 IQR coverage (63 samples total).

155 300 S Caenarcaniphilales phage DTV 671* E Oligoflexales phage DTV 684* P 200 Bacteroidetes Flavobacteriales phage DTV 697 100 Flavobacteriales phage DTV 700 α-proteobacteria 0 Erythrobacter phage DTV 14

ed coverage γ-proteobacteria z 6 Shewanella phage DTV 442* Psychrobiaceae phage DTV 26* 4 Alteromonadaceae phage DTV 222* Pseudoalteromonas phage DTV 122 Pseudoalteromonas phage DTV 416* normali 2 Vibrio phage DTV 281* 0 Vibrio phage DTV 694* 2014 2015 2016 year

Figure 4.6. Normalized abundances of annotated viruses that significantly correlated with carbon flux. Of the 194 viral populations belonging to WGCNA clusters that significantly correlated in abundances with particulate carbon export flux, 12 were annotated using alignment to MAGs (high-confidence, denoted by asterisks) or the GTDB protein database. The bottom panel is a subset of the top panel, with the abundant Shewanella phage DTV 442 removed. Abundances are approximated using normalized coverage, calculated using IQR coverage normalized to the smallest sequenced sample. Grey shading indicates summer export pulse samples based on particulate carbon export flux.

156 Supplementary figures

2.4 billion 83 assembly 24 million reassembly QC’ed million 11,610 viral contigs 455,440 viral contigs VIRSorter reads reads contigs SPAdes 8.5 million reads at >=95% 63 samples 63 individual assemblies 15,356 viral contigs phage PFAM markers OR 1 pooled clustering 95% ANI VIBRANT euk virus PFAM markers OR assembly cd-hit-est 12.7 million reads at >=95% VIRSorter + VIBRANT

43,663 1123 viral contigs contigs VIRSorter OR MAGs VIBRANT 6.1 million reads at >=95% 218 novel 2359 high-confidence 1470 NCBI euk viral genes viral populations viruses

9.6 million reads at >=70% >10kb

max 143kb prophages 857 deep trap viruses median 24kb read-mapping temporal correlated PFAM markers (94) abundance with C flux (184) (735 novel) BWA profiles WGCNA (194) VIRSorter (49) VIBRANT (110) VIRSorter “circular” OR contig check_circularity.pl OR virus-host alignments upper 500m MAG CRISPRs (14) DTR linkage (83) LAST viruses (21) 109 complete MAG alignments (57) max 143kb genomes homology to GTDB (37) median 41kb homology to RefSeq96 (1)

Figure S4.1. Bioinformatic workflow from metagenomic reads to DTV database and analyses.

157 500 y) a 2 300 (µmol/m /d particulate carbon flux 100 70 y) a 60

50 S 40 E (mmol/m2/d 30

primary productivity P 2014 2015 2016

year

Figure S4.2. Sample metadata plotted through the 3-year sampling period: particulate carbon export flux and primary productivity measured from 12-hour light incubations. Solid line indicates the 30-year mean, while the dashed line indicates 150% of the 30-year mean. Grey shading indicates carbon-based summer export pulse samples. Particulate carbon export data is courtesy of Eric Grabowski and David Karl. Productivity data was retrieved from http://hahana.soest.hawaii.edu/hot/hot-dogs/

158 viruses present (>0 IQR cov) viruses present (>0.1 IQR cov) viruses present (>0.25 IQR cov) viruses present (>1 IQR cov) 250 600 300 150 400 200 150 100 Frequency Frequency Frequency Frequency 200 100 50 50 0 0 0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

number of samples number of samples number of samples number of samples

MAGs present (>0 IQR cov) MAGs present (>0.1 IQR cov) MAGs present (>0.25 IQR cov) MAGs present (>1 IQR cov) 35 80 25 80 20 60 25 60 15 40 15 40 10 Frequency Frequency Frequency Frequency 20 5 20 5 0 0 0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

number of samples number of samples number of samples number of samples

Figure S4.3. Histograms of the samples present for each virus and MAG population. A viral or MAG population is considered present in a sample if it has an IQR coverage above that cutoff.

159

Figure S4.4. Spatiotemporal coverage abundance profiles of 21 viruses captured in sinking particles that we postulate originated from the upper 500 m. Each node on the top dendrogram and its associated column represents the coverages of one viral population. Each row represents an individual sample from the ALOHA2.0 dataset collected from the upper 500m at Station ALOHA (49). All samples are by depth, and within depth, ordered by time. The height of black bars represents log IQR coverage for each population, normalized to the maximum in that sample. The bottom blue bar highlights viruses that correlate with particulate carbon export flux (group color is consistent with Fig. 5).

160 Supplementary tables

Table S4.1. Sequencing, assembly, and viral contig identification for 63 metagenomic samples (23): sample name, number of quality-controlled reads, number of contigs, number of contigs >1kb for input into viral identification programs, number of viral contigs identified by VIRSorter, number of viral contigs identified by both VIRSorter and VIBRANT, number of viral contigs identified by VIBRANT, number and proportion of contigs identified by VIRSorter but not by VIBRANT, number and proportion of contigs identified by VIBRANT but not by VIRSorter. sample number number_ number_contig VIRSort VIRSorter_VIBRA VIBRA VIRSorte VIRSorter_u VIBRAN VIBRANT_un _reads contigs s_>=1kbp er NT_shared NT r_uniq niq_prop T_uniq iq_prop ALOHA-2014-XVII- 1397216 474341 13805 79 29 63 50 0.632911392 34 0.53968254 1-01-DNA 6 ALOHA-2014-XVII- 1530708 552244 12898 56 24 69 32 0.571428571 45 0.652173913 1-02-DNA 4 ALOHA-2014-XVII- 1566767 488925 16344 131 76 162 55 0.419847328 86 0.530864198 1-03-DNA 9 ALOHA-2014- 1552433 499984 24748 0 0 466 0 466 1 XVII_1-04_DNA 4 ALOHA-2014-XVII- 1056662 236928 11139 21 8 33 13 0.619047619 25 0.757575758 1-05-DNA 9 ALOHA-2014-XVII- 1930691 785399 37391 135 64 148 71 0.525925926 84 0.567567568 1-06-DNA 6 ALOHA-2014-XVII- 1527643 438959 17493 128 55 157 73 0.5703125 102 0.649681529 1-07-DNA 8 ALOHA-2014-XVII- 1966043 739925 21539 169 91 185 78 0.461538462 94 0.508108108 1-08-DNA 7 ALOHA-2014-XVII- 1696215 711902 31402 74 40 98 34 0.459459459 58 0.591836735 1-09-DNA 7 ALOHA-2014-XVII- 1700721 540053 17897 102 45 123 57 0.558823529 78 0.634146341 1-10-DNA 2 ALOHA-2014-XVII- 1233934 369814 14536 125 70 128 55 0.44 58 0.453125 1-11-DNA 0 ALOHA-2014-XVII- 1502479 435100 19623 166 105 246 61 0.36746988 141 0.573170732 1-12-DNA 6 ALOHA-2014-XVII- 1831271 718833 35593 230 137 317 93 0.404347826 180 0.567823344 1-13-DNA 2 ALOHA-2014-XVII- 2128480 768786 28713 153 84 246 69 0.450980392 162 0.658536585 1-14-DNA 9 ALOHA-2014- 1853923 632331 28653 0 0 409 0 409 1 XVII_1-15_DNA 9 ALOHA-2014- 8272891 328628 12372 0 0 132 0 132 1 XVII_1-16_DNA ALOHA-2014-XVII- 1891112 649198 26470 203 112 279 91 0.448275862 167 0.598566308 1-17-DNA 7 ALOHA-2014- 1672316 492460 17815 0 0 370 0 370 1 XVII_1-18_DNA 7 ALOHA-2014-XVII- 2491313 682448 24212 116 62 173 54 0.465517241 111 0.641618497 1-19-DNA 9 ALOHA-2014-XVII- 2176137 613304 22204 94 56 136 38 0.404255319 80 0.588235294 1-20-DNA 8 ALOHA-2014- 3387189 1260226 87893 0 0 865 0 865 1 XVII_1-21_DNA 4 ALOHA-2015-XVIII- 2066172 980943 46061 330 176 354 154 0.466666667 178 0.502824859 1-01-DNA 1 ALOHA-2015-XVIII- 2073210 892310 44554 259 133 294 126 0.486486486 161 0.547619048 1-02-DNA 7 ALOHA-2015-XVIII- 1956251 727872 20734 130 60 118 70 0.538461538 58 0.491525424 1-03-DNA 4 ALOHA-2015-XVIII- 2063490 939846 36810 286 131 269 155 0.541958042 138 0.513011152 1-04-DNA 0 ALOHA-2015-XVIII- 1884170 760201 29588 223 99 190 124 0.556053812 91 0.478947368 1-05-DNA 1 ALOHA-2015-XVIII- 2097220 924719 44952 312 132 268 180 0.576923077 136 0.507462687 1-06-DNA 7 ALOHA-2015-XVIII- 2024055 815179 46609 266 116 260 150 0.563909774 144 0.553846154 1-07-DNA 2 ALOHA-2015-XVIII- 2081868 1069041 54191 390 155 282 235 0.602564103 127 0.45035461 1-08-DNA 5

161 ALOHA-2015-XVIII- 1920899 920878 50439 334 125 252 209 0.625748503 127 0.503968254 1-09-DNA 7 ALOHA-2015-XVIII- 1865972 1092331 53015 199 89 203 110 0.552763819 114 0.561576355 1-10-DNA 2 ALOHA-2015-XVIII- 1881316 453943 20945 18 7 30 11 0.611111111 23 0.766666667 1-11-DNA 6 ALOHA-2015-XVIII- 1821724 710538 23777 147 67 144 80 0.544217687 77 0.534722222 1-12-DNA 8 ALOHA-2015-XVIII- 2333567 804413 65296 313 180 403 133 0.424920128 223 0.553349876 1-13-DNA 8 ALOHA-2015-XVIII- 2312478 838434 67006 342 197 439 145 0.423976608 242 0.551252847 1-14-DNA 0 ALOHA-2015-XVIII- 2047307 884009 59699 418 201 408 217 0.519138756 207 0.507352941 1-15-DNA 4 ALOHA-2015-XVIII- 1465673 481627 38143 222 120 284 102 0.459459459 164 0.577464789 1-16-DNA 6 ALOHA-2015-XVIII- 2069347 929995 68036 348 190 406 158 0.454022989 216 0.532019704 1-17-DNA 3 ALOHA-2015-XVIII- 2201173 908740 39867 208 94 229 114 0.548076923 135 0.589519651 1-18-DNA 0 ALOHA-2015-XVIII- 2340518 1071628 64083 355 195 496 160 0.450704225 301 0.606854839 1-19-DNA 6 ALOHA-2015-XVIII- 1940591 592178 38464 0 0 111 0 111 1 1-20-DNA 2 ALOHA-2015-XVIII- 2165746 738021 36034 0 0 175 0 175 1 1-21-DNA 6 ALOHA-2016-XIX- 2638712 1133111 36448 65 27 80 38 0.584615385 53 0.6625 1-01-DNA 4 ALOHA-2016-XIX- 1848616 744163 18241 129 46 67 83 0.643410853 21 0.313432836 1-02-DNA 1 ALOHA-2016-XIX- 2317359 607419 27394 194 87 186 107 0.551546392 99 0.532258065 1-03-DNA 6 ALOHA-2016-XIX- 2111995 591547 56554 108 55 267 53 0.490740741 212 0.794007491 1-04-DNA 0 ALOHA-2016-XIX- 2375198 887808 38436 306 155 348 151 0.493464052 193 0.554597701 1-05-DNA 8 ALOHA-2016-XIX- 2462434 965241 31634 210 90 228 120 0.571428571 138 0.605263158 1-06-DNA 3 ALOHA-2016-XIX- 2197282 823242 24882 101 38 107 63 0.623762376 69 0.644859813 1-07-DNA 3 ALOHA-2016-XIX- 2588019 1034354 37341 188 98 237 90 0.478723404 139 0.58649789 1-08-DNA 4 ALOHA-2016-XIX- 1758900 528988 42741 157 101 244 56 0.356687898 143 0.586065574 1-09-DNA 8 ALOHA-2016-XIX- 2845887 1547398 29511 100 33 105 67 0.67 72 0.685714286 1-10-DNA 1 ALOHA-2016-XIX- 3040300 1020956 38827 168 60 178 108 0.642857143 118 0.662921348 1-11-DNA 2 ALOHA-2016-XIX- 2079405 866134 27822 96 36 114 60 0.625 78 0.684210526 1-12-DNA 7 ALOHA-2016-XIX- 5579117 159822 6457 39 10 24 29 0.743589744 14 0.583333333 1-13-DNA ALOHA-2016-XIX- 2082566 686198 48302 121 41 140 80 0.661157025 99 0.707142857 1-14-DNA 5 ALOHA-2016-XIX- 2026612 683014 18812 125 54 172 71 0.568 118 0.686046512 1-15-DNA 9 ALOHA-2016-XIX- 1885271 668567 12361 73 26 71 47 0.643835616 45 0.633802817 1-16-DNA 9 ALOHA-2016-XIX- 7607416 222042 7005 36 11 34 25 0.694444444 23 0.676470588 1-17-DNA ALOHA-2016-XIX- 2254059 901752 20517 121 43 139 78 0.644628099 96 0.690647482 1-18-DNA 2 ALOHA-2016-XIX- 9217257 337245 34630 18 5 37 13 0.722222222 32 0.864864865 1-19-DNA ALOHA-2016-XIX- 2096427 1011743 34878 10 1 41 9 0.9 40 0.975609756 1-20-DNA 1 ALOHA-2016-XIX- 1934195 712272 23324 93 37 131 56 0.602150538 94 0.717557252 1-21-DNA 3 pooled_assembly_megahit 107204 61374 2070 1144 1942 926 0.447342995 798 0.410916581 total 1223169 46196854 2148534 11610 5723 15312 5887 9589 335

162 Table S4.2. Information on 857 deep-trap viruses: name, length, GC content, genomic completion, presence of chimeras, temperate, 34 “confident” host as predicted by alignments to MAG scaffolds >=10 kb longer than the virus, 23 “putative” hosts as predicted by alignments to MAG scaffolds <10 kb longer than the virus, 14 host as predicted by alignments to MAG CRISPR spacers, novelty with respect to previously sequenced viral metagenomes, group that correlate with particulate carbon flux (Fig. 3), depth of origin through alignments to the ALOHA 2.0 viral database (49), and number of samples present out of 63 total samples.

(omitted due to size and available upon request)

163 Table S4.3. Alignments between deep-trap viruses and MAGs. Columns are in BlastTab format and represent: deep trap virus, deep trap virus length, MAG scaffold, MAG scaffold length, percent nucleic acid identity, length of alignment, mismatches, gap opens, start of alignment on deep trap virus, end of alignment on deep trap virus, start of alignment on MAG scaffold, end of alignment on MAG scaffold, e-value, and bit score.

DTV length MAG length identit alignmen misma gap_o DTV_s DTV_e MAG_ MAG_ evalue bit_sc y t_length tches pens tart nd start end ore c_00000 15164 DT-Colwellia- 20542 100 15164 0 0 1 15164 13923 15439 2.40E+ 51027 0000012 2_scaffold_6 4 4 7 04 c_00000 15006 DT-Arcobacteraceae- 11235 100 15006 0 0 15006 1 30391 45396 2.37E+ 66958 0000081 2_scaffold_13 4 04 c_00000 10961 DT- 37810 99.98 10781 2 0 181 10961 26650 37430 1.70E+ 380 0000106 Alteromonadaceae- 04 1_scaffold_60 c_00000 12101 DT-Shewanella- 60683 97.97 12111 229 6 12101 1 73 12176 1.83E+ 48507 0000124 1_scaffold_29 04 c_00000 13269 DT-Halomonas- 43550 100 13269 0 0 1 13269 177 13445 2.10E+ 30105 0000135 1_scaffold_28 04 c_00000 35781 DT-Halomonas- 85692 100 35781 0 0 35781 1 57376 60954 5.66E+ 24737 0000140 2_scaffold_2 0 7 7 04 3 c_00000 12438 DT-Flavobacteriaceae- 13590 100 12438 0 0 1 12438 30005 42442 1.97E+ 93467 0000155 1_scaffold_1 9 04 c_00000 24384 DT- 38630 99.99 24384 2 0 1 24384 196 24579 3.86E+ 14051 0000160 Rhodobacteraceae- 04 4_scaffold_90 c_00000 11988 DT-Nitrincolaceae- 66171 100 8895 0 0 11988 3094 57277 66171 1.41E+ 0 0000162 3_scaffold_32 04 c_00000 30280 DT-Nitrincolaceae- 21176 99.96 30280 11 0 30280 1 12311 15339 4.78E+ 58369 0000165 1_scaffold_4 1 3 2 04 c_00000 10768 DT- 43550 100 10768 0 0 10768 1 27318 38085 1.70E+ 5465 0000184 Rhodobacteraceae- 04 4_scaffold_153 c_00000 17030 DT-Epibacterium_A- 34855 99.92 13130 5 1 1 13125 21726 34855 2.07E+ 0 0000195 1_scaffold_188 04 c_00000 37776 DT-Halomonas- 74106 100 37776 0 0 37776 1 36295 74070 5.97E+ 36 0000197 1_scaffold_17 04 c_00000 12319 DT-Nitrincolaceae- 21176 99.85 12319 19 0 1 12319 15333 16565 1.94E+ 46105 0000199 1_scaffold_4 1 8 6 04 c_00000 19514 DT-Winogradskyella- 48250 99.99 19514 2 0 19514 1 492 20005 3.09E+ 46249 0000210 2_scaffold_1 2 04 7 c_00000 11214 DT-Moritella- 65254 99.99 9624 1 0 11214 1591 55631 65254 1.52E+ 0 0000232 2_scaffold_37 04 c_00000 10345 DT-Shewanella- 60683 99.47 10345 55 0 1 10345 14687 25031 1.62E+ 35652 0000250 1_scaffold_29 04 c_00000 10117 DT-Nitrincolaceae- 49507 99.65 10117 35 0 10117 1 33554 43670 1.59E+ 5837 0000254 2_scaffold_31 04 c_00000 40491 DT-Idiomarina- 57028 99.99 40490 4 0 40491 2 96602 13709 6.40E+ 43319 0000270 1_scaffold_2 6 1 04 5 c_00000 31717 DT-Vibrio- 98328 100 31717 0 0 1 31717 58984 90700 5.02E+ 7628 0000281 1_scaffold_27 04 c_00000 26274 DT-Nitrincolaceae- 66171 99.92 26288 8 1 26274 1 31042 57329 4.15E+ 8842 0000527 3_scaffold_32 04 c_00000 20385 DT-Epibacterium_A- 38355 99.83 19577 34 0 19577 1 1 19577 3.08E+ 18778 0000573 1_scaffold_187 04 c_00000 10091 DT-Idiomarina- 16184 99.98 10039 2 0 10091 53 15180 16184 1.59E+ 0 0000653 2_scaffold_6 5 7 5 04 c_00000 33949 DT-Halomonas- 36804 100 33949 0 0 1 33949 16036 19431 5.37E+ 17373 0000661 2_scaffold_6 6 4 2 04 4 c_00000 23616 DT-Idiomarina- 48678 100 23348 0 0 1 23348 25323 48670 3.69E+ 8 0000662 2_scaffold_18 04 c_00000 27924 DT-Halomonas- 49389 100 27924 0 0 1 27924 65 27988 4.42E+ 21401 0000663 1_scaffold_26 04 c_00000 42978 DT-Rhizobiales- 95405 99.98 42978 9 0 42978 1 32406 36704 6.79E+ 58700 0000669 2_scaffold_3 4 9 6 04 8 c_00000 43367 DT- 35847 100 43367 0 0 43367 1 31303 74669 6.86E+ 28380 0000671 Caenarcaniphilales- 7 04 8 1_scaffold_4 c_00000 46849 DT-Rhizobiales- 20745 99.97 37149 13 0 46849 9701 82997 12014 5.87E+ 87307 0000674 2_scaffold_8 2 5 04 c_00000 43482 DT- 15078 100 43482 0 0 1 43482 67783 11126 6.88E+ 39519 0000677 Methylophagaceae- 3 4 04 3_scaffold_3 c_00000 56721 DT- 15078 100 37976 0 0 1 37976 11280 15078 6.00E+ 0 0000678 Methylophagaceae- 3 8 3 04 3_scaffold_3

164 c_00000 30618 DT-Alteromonas- 92976 100 30618 1 0 1 30618 9934 40551 4.84E+ 52425 0000685 1_scaffold_63 04 c_00000 28934 DT-Arcobacteraceae- 32203 100 28934 0 0 1 28934 39104 68037 4.58E+ 25399 0000690 2_scaffold_2 6 04 9 c_00000 13385 DT-Vibrio- 26113 100 13385 0 0 1 13385 7357 20741 2.12E+ 5372 0000694 1_scaffold_49 04 c_00000 36557 DT-Psychrobiaceae- 36602 100 36557 0 0 1 36557 23 36579 5.78E+ 23 0000026 2_scaffold_36 04 c_00000 15119 DT-Shewanella- 24806 99.94 8064 5 0 8064 1 6552 14615 1.27E+ 10191 0000054 1_scaffold_63 04 c_00000 14024 DT-Colwellia-8_s107 14948 99.91 10444 9 0 3581 14024 4337 14780 1.65E+ 168 0000063 04 c_00000 15734 DT-Margulisbacteria- 15075 99.99 15075 1 0 325 15399 1 15075 2.38E+ 0 0000075 1_scaffold_198 04 c_00000 22038 DT-Rhizobiales- 26352 99.68 22038 71 0 1 22038 24 22061 3.46E+ 4291 0000086 1_scaffold_1 04 c_00000 11424 DT-Cetobacterium- 10806 100 10787 0 0 10787 1 20 10806 1.71E+ 0 0000095 1_s69 04 c_00000 12394 DT-Rhizobiales- 11497 99.96 11474 5 0 11474 1 1 11474 1.81E+ 23 0000099 1_scaffold_39 04 c_00000 18337 DT-Algicola- 17965 99.99 17965 2 0 18203 239 1 17965 2.84E+ 0 0000101 1_scaffold_75 04 c_00000 34941 DT-Colwellia- 34612 99.78 23560 45 4 23717 163 1 23557 3.70E+ 11055 0000107 4_scaffold_30 04 c_00000 13701 DT-Flavobacteriales- 14348 99.99 13701 1 0 13701 1 86 13786 2.17E+ 562 0000127 1_scaffold_75 04 c_00000 28905 DT-Cetobacterium- 31153 99.98 28905 6 0 1 28905 99 29003 4.57E+ 2150 0000136 1_s6 04 c_00000 13788 DT- 15072 100 13788 0 0 1 13788 944 14731 2.18E+ 341 0000205 Rhodobacteraceae- 04 4_scaffold_152 c_00000 10454 DT-Flavobacteriaceae- 10035 99.97 9941 3 0 9987 47 1 9941 1.57E+ 94 0000214 4_scaffold_84 04 c_00000 23993 DT- 26663 99.41 23996 138 1 23993 1 1416 25411 3.75E+ 1252 0000222 Alteromonadaceae- 04 2_scaffold_30 c_00000 11703 DT-Colwellia- 11708 99.98 11237 2 0 1 11237 472 11708 1.78E+ 0 0000227 2_scaffold_54 04 c_00000 20111 DT- 19799 99.95 19721 2 1 1 19721 87 19799 3.11E+ 0 0000229 Rhodobacteraceae- 04 4_scaffold_129 c_00000 14262 DT-Rickettsiaceae- 14308 100 14262 0 0 14262 1 23 14284 2.26E+ 24 0000230 1_scaffold_39 04 c_00000 38520 DT-Colwellia-7_s16 38573 100 21152 0 0 1 21152 17422 38573 3.34E+ 0 0000301 04 c_00000 40581 DT- 40733 100 40581 0 0 1 40581 71 40651 6.42E+ 82 0000416 Pseudoalteromonas- 04 1_scaffold_35 c_00000 29014 DT-Psychroserpens- 31227 100 29014 1 0 29014 1 1 29014 4.59E+ 2213 0000502 1_scaffold_230 04 c_00000 66141 DT-Oligoflexales- 52994 100 40922 0 0 40922 1 1 40922 6.47E+ 12072 0000684 1_scaffold_4 04 c_00000 24425 DT-Psychrobium- 15124 99.97 15124 4 0 15478 355 1 15124 2.39E+ 0 0000757 1_scaffold_77 04 c_00000 11668 DT-Psychroserpens- 15026 99.97 9582 3 0 11668 2087 5445 15026 1.51E+ 0 0000824 1_scaffold_23 04

165 Table S4.4. Novel PFAM protein domains (bit score >30) recovered from the deep trap viral database that were not found in previously reported datasets (46,49,52).

PF00126 Bacterial regulatory helix-turn-helix protein lysR family PF00150 Cellulase (glycosyl hydrolase family 5) PF00459 Inositol monophosphatase family PF00503 G-protein alpha subunit PF00533 BRCA1 C Terminus (BRCT) domain PF00544 Pectate lyase PF00653 Inhibitor of Apoptosis domain PF00773 RNB domain PF01030 Receptor L domain PF01364 Peptidase family C25 PF01400 Astacin (Peptidase family M12A) PF01595 Domain of unknown function DUF21 PF01607 Chitin binding Peritrophin-A domain PF01725 Ham1 family PF01797 Transposase IS200 like PF01972 Serine dehydrogenase proteinase PF02099 Josephin PF02272 DHHA1 domain PF02442 Lipid membrane protein of large eukaryotic DNA viruses PF02581 Thiamine monophosphate synthase/TENI PF02834 LigT like Phosphoesterase PF02877 Poly(ADP-ribose) polymerase PF03098 Animal haem peroxidase PF03142 Chitin synthase PF03699 Uncharacterised protein family (UPF0182) PF04218 CENP-B N-terminal DNA-binding domain PF04245 37-kD nucleoid-associated bacterial protein PF04367 Protein of unknown function (DUF502) PF04583 Baculoviridae p74 conserved region PF04606 Ogr/Delta-like zinc finger PF04631 Baculovirus hypothetical protein PF04664 Opioid growth factor receptor (OGFr) conserved region PF04798 Baculovirus 19 kDa protein conserved region PF04947 Poxvirus Late Transcription Factor VLTF3 like PF05006 Protein of unknown function (DUF666) PF05593 RHS Repeat PF05635 23S rRNA-intervening sequence protein PF05685 Putative restriction endonuclease PF05816 Toxic anion resistance protein (TelA) PF06101 Plant protein of unknown function (DUF946) PF06120 Tail length tape measure protein PF06283 Trehalose utilisation PF06322 Phage NinH protein PF06564 Cellulose biosynthesis protein BcsQ PF06725 3D domain PF07130 YebG protein PF07273 Protein of unknown function (DUF1439) PF07308 Protein of unknown function (DUF1456) PF07509 Protein of unknown function (DUF1523) PF07534 TLD PF07589 PEP-CTERM motif PF07638 ECF sigma factor PF08614 Autophagy protein 16 (ATG16) PF08798 CRISPR associated protein PF09124 T4 recombination endonuclease VII PF09299 Mu transposase PF09393 Phage tail tube protein PF09558 Protein of unknown function (DUF2375) PF09641 Protein of unknown function (DUF2026) PF10000 ACT domain

166 PF10048 Predicted integral membrane protein (DUF2282) PF10162 G8 domain PF10269 Transmembrane Fragile-X-F protein PF10688 Bacterial inner membrane protein PF10789 Phage RNA polymerase binding RpbA PF10800 Protein of unknown function (DUF2528) PF10886 Protein of unknown function (DUF2685) PF11008 Protein of unknown function (DUF2846) PF11039 Protein of unknown function (DUF2824) PF11140 Protein of unknown function (DUF2913) PF11559 Afadin- and alpha -actinin-Binding PF11637 ATP-dependant DNA helicase UvsW PF11646 Protein of unknown function DUF3258 PF12514 Protein of unknown function (DUF3718) PF12686 Protein of unknown function (DUF3800) PF12699 phiKZ-like phage internal head proteins PF12952 Domain of unknown function (DUF3841) PF13031 Protein of unknown function (DUF3892) PF13181 Tetratricopeptide repeat PF13271 Domain of unknown function (DUF4062) PF13290 Chitobiase/beta-hexosaminidase C-terminal domain PF13404 AsnC-type helix-turn-helix domain PF13424 Tetratricopeptide repeat PF13431 Tetratricopeptide repeat PF13542 Helix-turn-helix domain of transposase family ISL3 PF13737 Transposase DDE domain PF13749 Putative ATP-dependent DNA helicase recG C-terminal PF13781 DoxX-like family PF13795 HupE / UreJ protein PF13856 ATP-binding sugar transporter from pro-phage PF13935 Ead/Ea22-like protein PF13937 Domain of unknown function (DUF4212) PF13992 YecR-like lipoprotein PF14076 Domain of unknown function (DUF4258) PF14081 Domain of unknown function (DUF4262) PF14213 Domain of unknown function (DUF4325) PF14350 Beta protein PF14470 Bacterial PH domain PF14568 SMI1-KNR4 cell-wall PF15891 Nucleoside 2-deoxyribosyltransferase like PF15902 Sortilin neurotensin receptor 3 PF15919 HicB_like antitoxin of bacterial toxin-antitoxin system PF16083 LydA holin phage PF16084 LydA-holin antagonist PF16184 Cadherin-like PF16452 Bacteriophage CI repressor C-terminal domain PF16461 Lambda phage tail tube protein TTP PF16462 Phage tail assembly chaperone protein TAC PF16463 Phage tail tube protein family PF16684 Telomere resolvase PF16697 Inner membrane component of T3SS cytoplasmic domain PF16872 Putative phage abortive infection protein PF16928 DNA/protein translocase of phage P22 injectosome PF16931 Putative phage holin PF17037 Cellulose biosynthesis protein BcsO

167 Table S4.5. Relative abundances of 857 deep-trap viruses approximated by IQR coverage normalized to the smallest library size (normalized coverage).

(omitted due to size and available upon request)

Table S4.6. Relative abundances of 129 cellular MAGs approximated by IQR coverage weighted by scaffold length for each MAG and normalized to the smallest library size (normalized coverage).

(omitted due to size and available upon request)

168 Table S4.7. Alignments between 21 deep-trap viruses and the ALOHA 2.0 viral database of viruses recovered from upper 500m samples in the same environment (49). Columns are in BlastTab format and represent: deep trap virus, deep trap virus length, ALOHA2.0 virus, ALOHA2.0 virus length, taxonomy based on >=60% AAI across >=50% of proteins, percent nucleic acid identity, length of alignment, mismatches, gap opens, start of alignment on deep trap virus, end of alignment on deep trap virus, start of alignment on ALOHA2.0 virus, end of alignment on ALOHA2.0 virus, e-value, and bit score.

DTV len aloha len taxonomy ide align mis gap DT DT aloha_ aloha_ eva bit_ gth _2.0_ gth ntit ment mat _op V_s V_e 2.0_vir 2.0_vir lue sco virus y _leng che ens tart nd us_star us_en re th s t d c_000 681 c_000 689 unknown 99. 6816 66 1 1 681 54 68209 0 1.0 00000 67 00000 64 89 7 67 8E+ 0367 7519 05 c_000 490 c_000 459 unknown 99. 4521 50 1 124 464 715 45931 0 7.1 00000 14 00005 31 89 7 3 58 3E+ 0393 2847 04 c_000 492 c_000 368 unknown 99. 3683 10 0 775 445 1 36836 0 5.8 00000 43 00004 36 97 6 7 92 2E+ 0392 7334 04 c_000 354 c_000 357 unknown 99. 3497 16 0 349 1 1 34974 0 5.5 00000 22 00004 97 95 4 74 3E+ 0450 7348 04 c_000 335 c_000 165 Prochlorococcus phage P-SSM2 99. 3357 36 0 335 1 3141 36711 0 5.3 00000 71 00000 374 89 1 71 0E+ 0465 7487 04 c_000 274 c_000 639 d__Bacteria;p__Proteobacteria;c__Alphapr 99. 2744 26 0 274 1 7716 35157 0 4.3 00000 42 00004 65 oteobacteria;o__Caulobacterales;f__Marica 91 2 42 3E+ 0750 5251 ulaceae;g__Hyphobacterium19 04 c_000 262 c_000 437 unknown 99. 2627 40 2 262 1 12140 38413 0 4.1 00000 75 00003 80 84 6 75 4E+ 0526 0215 04 c_000 246 c_000 303 unknown 99. 2464 2 0 1 246 5716 30355 0 3.9 00000 95 00001 55 99 0 40 0E+ 0541 8743 04 c_000 201 c_000 375 unknown 99. 2014 13 0 1 201 11310 31451 0 3.1 00000 42 00005 33 94 2 42 8E+ 0772 2722 04 c_000 201 c_000 340 unknown 99. 2011 11 0 201 1 397 20508 0 3.1 00000 12 00003 68 95 2 12 8E+ 0575 0323 04 c_000 162 c_000 471 unknown 99. 1621 46 0 2 162 905 17121 0 2.5 00000 18 00001 13 72 7 18 5E+ 0605 8854 04 c_000 180 c_000 226 unknown 99. 1617 55 0 48 162 6452 22624 0 2.5 00000 79 00001 24 66 3 20 4E+ 0592 3652 04 c_000 227 c_000 608 unknown 99. 1587 4 0 158 1 38387 54262 0 2.5 00000 59 00004 84 97 6 76 1E+ 0562 7174 04 c_000 126 c_000 608 unknown 99. 1265 3 0 1 126 18862 31513 0 2.0 00000 52 00004 84 98 2 52 0E+ 0059 7174 04 c_000 123 c_000 608 unknown 99. 1238 10 0 1 123 5815 18198 0 1.9 00000 84 00004 84 92 4 84 6E+ 0636 7174 04 c_000 121 c_000 383 unknown 99. 1213 40 0 121 1 21158 33290 0 1.9 00000 33 00001 44 67 3 33 1E+ 0819 8922 04 c_000 130 c_000 116 unknown 99. 1143 68 0 1 114 192 11623 0 1.7 00000 51 00003 23 41 2 32 9E+ 0631 6801 04 c_000 106 c_000 246 unknown 99. 1065 30 0 1 106 76 10734 0 1.6 00000 59 00000 79 72 9 59 8E+ 0226 7927 04 c_000 152 c_000 104 unknown 99. 1044 26 0 143 392 1 10449 0 1.6 00000 55 00000 49 75 9 74 6 4E+ 0614 0432 04 c_000 100 c_000 927 Prochlorococcus phage P-SSM2 99. 1005 38 0 100 1 34172 44229 0 1.5 00000 58 00005 87 62 8 58 8E+ 0293 6554 04 c_000 106 c_000 956 unknown 98. 6115 113 0 106 456 3451 9565 0 9.3 00000 75 00005 5 15 75 1 1E+ 0079 3969 03

169 References

1. McCave IN. Vertical flux of particles in the ocean. Deep Res. 1975;22(7):491–502.

2. Ducklow HW, Steinberg DK, Buesseler KO. Upper ocean carbon export and the biological pump. Oceanography. 2001;14(4):50–8.

3. Siegenthaler U, Sarmiento JL. Atmospheric carbon dioxide and the ocean. Nature. 1993;365(6442):119–25.

4. Turley CM, Mackie PJ. Biogeochemical significance of attached and free- living bacteria and the flux of particles in the NE Atlantic Ocean. Mar Ecol Prog Ser. 1994;115(1–2):191–204.

5. Turley CM, Stutt ED. Depth-related cell-specific bacterial leucine incorporation rates on particles and its biogeochemical significance in the Northwest Mediterranean. Limnol Oceanogr. 2000;45(2):419–25.

6. Aristegui J, Gasol JM, Duarte CM, Herndl GJ. Microbial oceanography of the dark ocean’s pelagic realm. Limnol Oceanogr. 2009;54(5):1501–29.

7. Fontanez KM, Eppley JM, Samo TJ, Karl DM, DeLong EF. Microbial community structure and function on sinking particles in the North Pacific Subtropical Gyre. Front Microbiol. 2015;6:469.

8. Pelve EA, Fontanez KM, DeLong EF. Bacterial succession on sinking particles in the ocean’s interior. Front Microbiol. 2017;8:2669.

9. Boeuf D, Edwards BR, Eppley JM, Hu SK, Poff KE, Romano AE, et al. Biological composition and microbial dynamics of sinking particulate organic matter at abyssal depths in the oligotrophic open ocean. Proc Natl Acad Sci USA. 2019;116(24):11824–32.

10. Preston CM, Durkin CA, Yamahara KM. DNA metabarcoding reveals organisms contributing to particulate matter flux to abyssal depths in the North East Pacific ocean. Deep Res Part II. 2020;173:104708.

11. Mestre M, Ruiz-González C, Logares R, Duarte CM, Gasol JM, Sala MM. Sinking particles promote vertical connectivity in the ocean microbiome. Proc Natl Acad Sci USA. 2018;115(29):E6799–807.

12. Jiao N, Herndl GJ, Hansell DA, Benner R, Kattner G, Wilhelm SW, et al. Microbial production of recalcitrant dissolved organic matter: Long-term carbon storage in the global ocean. Nat Rev Microbiol. 2010;8(8):593–9.

13. DeLong EF, Franks DG, Alldredge AL. Phylogenetic diversity of aggregate‐attached vs. free‐living marine bacterial assemblages. Limnol Oceanogr. 1993;38(5):924–34.

170 14. Rieck A, Herlemann DPR, Jürgens K, Grossart HP. Particle-associated differ from free-living bacteria in surface waters of the Baltic Sea. Front Microbiol. 2015;6:1297.

15. Crespo BG, Pommier T, Fernández-Gómez B, Pedrós-Alió C. Taxonomic composition of the particle-attached and free-living bacterial assemblages in the Northwest Mediterranean Sea analyzed by pyrosequencing of the 16S rRNA. Microbiologyopen. 2013;2(4):541–52.

16. Eloe EA, Shulse CN, Fadrosh DW, Williamson SJ, Allen EE, Bartlett DH. Compositional differences in particle-associated and free-living microbial assemblages from an extreme deep-ocean environment. Environ Microbiol Rep. 2011;3(4):449–58.

17. Ghiglione JF, Mevel G, Pujo-Pay M, Mousseau L, Lebaron P, Goutx M. Diel and seasonal variations in abundance, activity, and community structure of particle-attached and free-living bacteria in NW Mediterranean Sea. Microb Ecol. 2007;54(2):217–31.

18. López-Pérez M, Kimes NE, Haro-Moreno JM, Rodriguez-Valera F. Not all particles are equal: The selective enrichment of particle-associated bacteria from the Mediterranean Sea. Front Microbiol. 2016;7:996.

19. Proctor LM, Fuhrman JA. Roles of viral infection in organic particle flux. Mar Ecol Prog Ser. 1991;69:133–42.

20. Peduzzi P, Weinbauer MG. Effect of concentrating the virus‐rich 2‐2nm size fraction of seawater on the formation of algal flocs (marine snow). Limnol Oceanogr. 1993;38(7):1562–5.

21. Karl DM, Church MJ. Microbial oceanography and the Hawaii Ocean Time-series programme. Nat Rev Microbiol. 2014;12:1–15.

22. Karl DM, Church MJ, Dore JE, Letelier RM, Mahaffey C. Predictable and efficient carbon sequestration in the North Pacific Ocean supported by symbiotic nitrogen fixation. Proc Natl Acad Sci USA. 2012;109(6):1842–9.

23. Poff K, Leu AO, Eppley JM, Karl DM, DeLong EF. Microbial dynamics of the open ocean summer export pulse. In review.

24. Leu AO, Eppley JM, DeLong EF. Comparative genomics of particle- attached versus free-living bacteria. In preparation.

25. Karl DM, Lukas R. The Hawaii Ocean Time-series (HOT) program: Background, rationale and field implementation. Deep Res Part II. 1996;43(2–3):129–56.

26. Roux S, Enault F, Hurwitz BL, Sullivan MB. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3:e985.

171 27. Kieft K, Zhou Z, Anantharaman K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome. 2020;8(1):90.

28. Arumugam M, Harrington ED, Raes J, Foerstner KU, Arumugam M, Bork P. SmashCommunity: a metagenomic annotation and analysis tool. Bioinformatics. 2010;26(23):2977–8.

29. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.

30. Roux S, Emerson JB, Eloe-Fadrosh EA, Sullivan MB. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817.

31. Hyatt D, Chen G, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal : prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11(119).

32. Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195.

33. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36(Database issue):281– 8.

34. Li W, Godzik A. Cd-hit : a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.

35. Mizuno CM, Guyomar C, Roux S, Lavigne R, Rodriguez-Valera F, Sullivan M, et al. Numerous cultivated and uncultivated viruses encode ribosomal proteins. Nat Commun. 2019;10:752.

36. Kielbasa SM, Wan R, Sato K, Kiebasa SM, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21:487–93.

37. Nishimura Y, Watai H, Honda T, Mihara T, Omae K, Roux S, et al. Environmental viral genomes shed new light on virus-host interactions in the ocean. mSphere. 2017;2(2):e00359-16.

38. Imai T. sprai = single pass read accuracy improver [Internet]. 2013. Available from: http://zombie.cb.k.u-tokyo.ac.jp/sprai/

39. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12.

40. Beaulaurier J, Luo E, Eppley JM, Uyl P Den, Dai X, Burger A, et al. Assembly-free single-molecule sequencing recovers complete virus

172 genomes from natural microbial communities. Genome Res. 2020;30(3):437–46.

41. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36(10):996.

42. Skennerton CT, Imelfort M, Tyson GW. Crass: Identification and reconstruction of CRISPR from unassembled metagenomic data. Nucleic Acids Res. 2013;41(10):e105.

43. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, Mcveigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(Database issue):733–45.

44. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. Expanding the marine virosphere using metagenomics. PLoS Genet. 2013;9(12):e1003987.

45. Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. Genomes of abundant and widespread viruses from the deep ocean. MBio. 2016;7(4):e00805-16.

46. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and biogeochemical impacts of uncultivated globally abundant ocean viruses. Nature. 2016;537:689–93.

47. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536(7617):425–30.

48. López-Pérez M, Haro-Moreno JM, Gonzalez-Serrano R, Parras-Moltó M, Rodriguez-Valera F. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters. PLoS Genet. 2017;13(9):1–23.

49. Luo E, Eppley JM, Romano AE, Mende DR, DeLong EF. Double-stranded DNA virioplankton dynamics and reproductive strategies in the oligotrophic open ocean water column. ISME J. 2020;14:1304–1315.

50. Coutinho FH, Silveira CB, Gregoracci GB, Thompson CC, Edwards RA, Brussaard CPD, et al. Marine viruses discovered via metagenomics shed light on viral strategies throughout the oceans. Nat Commun. 2017;8:15955.

51. Gregory AC, Zayed AA, Sunagawa S, Wincker P, Sullivan MB, Ferland J, et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell. 2019;177:1–15.

52. Luo E, Aylward FO, Mende DR, Delong EF. Bacteriophage distributions and temporal variability in the ocean’s interior. MBio. 2017;8(6):e01903-17.

173 53. Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ. 2015;3:e1319.

54. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9(559).

55. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2019. Available from: https://www.r- project.org/

56. Sorek R, Kunin V, Hugenholtz P. CRISPR - A widespread system that provides acquired resistance against phages in bacteria and archaea. Nat Rev Microbiol. 2008;6(3):181–6.

57. Lauro FM, Chastain RA, Blankenship LE, Yayanos AA, Bartlett DH. The unique 16S rRNA genes of piezophiles reflect both phylogeny and adaptation. Appl Environ Microbiol. 2007;73(3):838–45.

58. Berg KA, Lyra C, Sivonen K, Paulin L, Suomalainen S, Tuomi P, et al. High diversity of cultivable heterotrophic bacteria in association with cyanobacterial water blooms. ISME J. 2009;3:314–25.

59. McDonnell AMP, Boyd PW, Buesseler KO. Effects of sinking velocities and microbial respiration rates on the attenuation of particulate carbon fluxes through the mesopelagic zone. Global Biogeochem Cycles. 2015;29:175–93.

60. Qiu B, Koh DA, Lumpkin C, Flament P. Existence and formation mechanism of the North Hawaiian Ridge Current. J Phys Oceanogr. 1997;27:431–44.

61. Turner JT. Zooplankton fecal pellets, marine snow, phytodetritus and the ocean’s biological pump. Prog Oceanogr. 2015;130:205–48.

62. Mende DR, Bryant JA, Aylward FO, Eppley JM, Nielsen T, Karl DM, et al. SUPP: Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat Microbiol. 2017;2(10):1367–73.

63. Lauro FM, McDougald D, Thomas T, Williams TJ, Egan S, Rice S, et al. The genomic basis of trophic strategy in marine bacteria. Proc Natl Acad Sci USA. 2009;106(37):15527–33.

64. Weitz JS, Li G, Gulbudak H, Cortez MH, Whitaker RJ. Viral invasion fitness across a continuum from lysis to latency. Virus Evol. 2019;5(1):1–9.

65. Riemann L, Grossart HP. Elevated lytic phage production as a consequence of particle colonization by a marine Flavobacterium (Cellulophaga sp.). Microb Ecol. 2008;56(3):505–12.

66. Moebus K. Marine bacteriophage reproduction under nutrient-limited

174 growth of host bacteria. I. Investigations with six phage-host systems. Mar Ecol Prog Ser. 1987;144:1–12.

67. Middelboe M. Bacterial growth rate and marine virus–host dynamics. Microb Ecol. 2000;40:114–24.

68. Yayanos AA. Evolutional and ecological implications of the properties of deep-sea barophilic bacteria. Proc Natl Acad Sci USA. 1986;83(24):9542–6.

69. Vaulot D, Marie D, Olson RJ, Chisholm SW. Growth of Prochlorococcus, a Photosynthetic Prokaryote, in the Equatorial Pacific Ocean. Science. 1995;268(9):1480–2.

70. Avrani S, Schwartz D a., Lindell D. Virus-host swinging party in the oceans: Incorporating biological complexity into paradigms of antagonistic coexistence. Mob Genet Elements. 2012;2(2):88–95.

71. Jin L, Lee HG, Kim HS, Ahn CY, Oh HM. Caulobacter daechungensis sp. nov., a stalked bacterium isolated from a eutrophic reservoir. Int J Syst Evol Microbiol. 2013;63:2559–64.

72. Guidi L, Chaffron S, Bittner L, Eveillard D, Larhlimi A, Roux S, et al. Plankton networks driving carbon export in the oligotrophic ocean. Nature. 2015;532(7600):465–70.

73. Weinbauer MG. Ecology of prokaryotic viruses. FEMS Microbiol Rev. 2004;28(2):127–81.

74. Wilhelm SW, Suttle CA. Viruses and nutrient cycles in the sea. Bioscience. 1999;49(10):781–8.

75. Gobler CJ, Hutchins DA, Fisher NS, Cosper EM, Sañudo-Wilhelmy SA. Release and bioavailability of C, N, P, Se, and Fe following viral lysis of a marine chrysophyte. Limnol Oceanogr. 1997;42(7):1492–504.

76. Middelboe M, Jørgensen NOG, Kroer N. Effects of viruses on nutrient turnover and growth efficiency of noninfected marine bacterioplankton. Appl Environ Microbiol. 1996;62(6):1991–7.

77. Shibata A, Kogure K, Koike I, Ohwada K. Formation of submicron colloidal particles from marine bacteria by viral infection. Mar Ecol Prog Ser. 1997;155:303–7.

78. Yamada Y, Tomaru Y, Fukuda H, Nagata T. Aggregate formation during the viral lysis of a marine diatom. Front Mar Sci. 2018;5:1–7.

79. Lawrence JE, Suttle CA. Effect of viral infection on sinking rates of Heterosigma akashiwo and its implications for bloom termination. Aquat Microb Ecol. 2004;37(1):1–7.

175 80. Sianturi ET. Viruses of the eukaryotic plankton are predicted to increase carbon export efficiency in the global sunlit ocean. bioRxiv. 2019;(doi.org/10.1101/710228).

176 CHAPTER 5. SUMMARY, FUTURE DIRECTIONS, AND QUESTIONS MOVING FORWARD

Summary

Viruses influence ecosystem dynamics, microbial diversity, and biogeochemical cycling across the global oceans. Despite their importance, viruses represent an undersampled component of marine ecosystems, particularly in the open oceans, which cover roughly 40% of our planet. At our open-ocean study site in the

North Pacific Subtropical Gyre, Station ALOHA, viruses reach abundances of

107/mL in surface waters (1) and their diversity has remained largely unexplored. Efforts to sample this diversity constitute the three data chapters described in this dissertation. In this body of work, I analyzed metagenomic samples collected from both planktonic cells and virioplankton from the upper ocean, as well as sinking particles from the deep ocean. From these metagenomes, I recovered viral population genomes and explored how viral diversity, reproductive strategies, and temporal dynamics varied across the open ocean’s water column.

Chapter II represents the first metagenomic study targeting bacterioplankton- associated viral assemblages at Station ALOHA (2). Sporadic snapshots were considered inadequate for describing microbial assemblages in dynamic environments (3,4), so repeated depth-profile sampling was performed on the monthly HOT cruises to explore the spatiotemporal variability of microbial

177 assemblages at Station ALOHA. Bacterioplankton samples were sequenced from

7 depths, from 25 – 1000 m, and spanning 1.5 years, during 2010 – 11. From these

83 metagenomes, I created a bioinformatics workflow to assemble viral population genomes, and recovered 129 genomes of abundant cell-associated viruses in this environment. I explored how viral diversity, reproductive strategies, and temporal variability changed, both with their host (5) and with collection depth. In total, 61% of viral populations were novel (not homologous at >=60% amino acid identity across >=50% of proteins) with respect to previous studies (6–9), with particularly high novelty in the aphotic ocean, which appeared under-sampled with respect to surface waters. Both virus and host assemblages were highly structured by depth. No evidence of eurybathic viruses was observed throughout 25 – 1000 m, suggesting that viruses appear to be specific to hosts that are adapted to depth-specific niches throughout the water column. I identified temperate phages using genomic markers for lysogeny, such as integrase and excisionase used to respectively integrate into and excise from host genomes. The proportion of assembled temperate phages relative to total assembled viruses increased with depth, consistent with previous hypotheses suggesting that lysogeny may be favorable in habitats with low host abundance or productivity (10–13). Interestingly, most viral populations were persistent through the 1.5-year time-series, a finding that is inconsistent with the idea that viral populations rapidly diversity in a co-evolutionary arms race with their hosts. Viral persistence suggests that stabilizing variables might be at play in this environment, such as possible limitations in evolutionary trajectories in a nutrient-poor environment. The temporal variability of viral assemblages as a

178 whole increased with depth, potentially reflecting ecological differences between surface and mesopelagic assemblages.

Chapter III builds on Chapter II through additional sampling of virioplankton

(operational definition 0.02 – 0.2 µm) in tandem with viruses associated with planktonic cells (operational definition >0.2 µm) (14). Targeted sampling of the virioplankton size fraction enabled sequencing and assembly of packaged double-stranded DNA virion at Station ALOHA. Samples from 5 – 500 m, over a period spanning 1.5 years during 2014 – 16, were sequenced by short-read sequencing technologies (Illumina), resulting in 374 metagenomes with half each targeting the virus-enriched and cell-enriched size fractions. Analyzing metagenomes from both size fractions, I generated the ALOHA 2.0 viral database of 16,787 populations. Similar to observations reported in Chapter II, the majority of viral populations (52%) were novel with respect to previously studied viruses

(6–9,15), particularly in the mesopelagic and bathypelagic samples. Previous work by collaborators in the DeLong Lab had shown that prokaryotic diversity peaked at and just below the DCM (5). My work showed that the prokaryotic diversity maximum at the base of the euphotic zone coincides with a “hotspot” of viral population diversity, possibly reflecting metabolically diverse host assemblages existing there that can utilize both light and chemical substrates for energy. Distributions of viral auxiliary metabolic genes revealed depth-specific adaptations to light-energy, nitrogen, and phosphorus metabolism. Consistent with observations described in Chapter II, most viral populations appear to be specific to a narrow depth range along the water column. Not only did many

179 viral populations persist throughout this 1.5-year time-series from 2014 – 16, the majority were also found within the 2010 – 11 dataset used in Ch. II.

Coupled sampling of both size fractions enabled the calculation of viral population abundance ratios between virioplankton and cellular size fractions

(VC ratio). The VC abundance ratio provided independent confirmation of genomic temperate phage identification through marker genes. On average, viruses carrying temperate phage markers displayed on average lower VC ratios than other populations, reflecting a lower extracellular to intracellular abundance ratio that is consistent with temperate reproductive strategies. The VC ratio also revealed that most temperate phages found in this environment were active and capable of producing free virions, and that temperate phage production was highest at the base of the euphotic zone (150 – 250 m). Temperate phages were rare in surface waters and increased in abundance with depth, particularly as putative prophages in the cell-enriched fraction. Taken together, these analyses revealed that lytic virus-host interaction dominated the surface waters, lysogenic virus-host interactions increased with depth, and flexible reproductive strategies peaked within the transitional depths at the base of the euphotic zone.

In Chapter IV, I explored viral diversity on sinking particles in the deep ocean.

Previously, metagenomic samples obtained from a three-year time-series collected from a sediment trap moored at 4000 m at Station ALOHA was used to examine cellular organisms and microbial dynamics associated with sinking particles (16,17). From the 63 metagenomes obtained in this study, I generated a deep trap virus database of 857 viral populations found on sinking particles in

180 the deep ocean. These sinking-particle associated viruses appear to be even more distinct than planktonic viruses from the upper ocean in Chapters II and III. In total, 86% of viral populations were novel with respect to previously studied viruses (6–9,15). This large degree of novelty complicates taxonomic identification based on similarity to known viral sequences in reference databases. To overcome this challenge and associate specific viral populations with their probable hosts, I used metagenome-assembled genomes generated by collaborator Andy Leu in the DeLong Lab (17,18) to look for viruses that could be linked to prophage sequences associated with host genomes. This method yielded 68 virus-host linkages at a high confidence level relative to using reference databases to identify viruses and their putative hosts. These virus-host links revealed intriguing patterns in abundance profiles between some temperate phages and their hosts. Some temperate phages displayed near-identical abundance profiles with their hosts through the three-year sampling period, consistent with a prophage that has integrated into nearly all of its host genomes.

Other temperate phages displayed decoupled abundance profiles compared with their hosts, indicating that some prophages became absent or reappeared during our three-year deep-moored sediment-trap sampling period. Taken together, these patterns reveal the dynamic microdiversity of viral and host populations observed on deep-sea sinking particles.

Studying viruses on sinking particles is also relevant to the “viral shuttle” hypothesis, in which viral lysis promotes export through particle formation and aggregation. Laboratory studies in favor of this hypothesis found that viral lysis enhances aggregation and sinking in large cultured eukaryotic hosts, but we do

181 not know if this process is relevant to prokaryote-dominated open ocean habitats. This hypothesis is difficult to test in situ, because other diverse processes may also lead to the aggregation and sinking of particulate material.

Furthermore, it is difficult to link lysed and aggregated cells in surface waters to particle-associated viruses, since some viruses might be found on particles simply as adsorbed “bycatch”, and not due to their enhancement of aggregation and sinking. Interpretations are further complicated by variable particle sinking rates (19), as well as horizontal advection that can carry particles at rates up to several kilometers per day (20). Considering these challenges, I used this dataset to ask the more preliminary question of whether there is any evidence of vertical transport of viruses. If not, then it is unlikely that viruses enhance particle aggregation and export in this environment. To look for evidence of vertical transport, I identified deep-trap viral populations that were also found in planktonic assemblages from the upper water column (Chapter III). Since we observed that viral populations were depth-specific, deep trap virus populations that were present in the upper water column were likely transported on sinking particles from the surface to the deep ocean. I found that 21 deep trap virus populations were nearly identical to ALOHA 2.0 viruses sampled from the upper

500 m, indicating that they likely originated there. Independent confirmation using taxonomic assignments and correlation with carbon export flux provided further lines of evidence that these sediment trap viruses originated near surface waters. The three identifiable viruses were cyanophages or phages infecting bacterial taxa that have been associated with surface attached lifestyles and sometimes, cyanobacterial blooms. These analyses represent a step towards examining the viral shuttle hypothesis in natural assemblages by identifying

182 viral groups that contribute to vertical export flux. Whether this contribution is active through lysis and aggregation, or passive through adsorption onto particles, remains an open question.

Studying viruses on sinking particles revealed differences between particle- associated and planktonic viral assemblages described in Chapters II and III.

Most notably, particle-associated viruses displayed higher temporal variability than their planktonic counterparts. Most particle-associated viruses found at

4000 m were present only in a few samples over the three-year sampling period, whereas planktonic viruses from the upper ocean generally persisted through similar timescales. In addition, particle-associated viruses appeared depleted in temperate phages relative to planktonic cell-associated viruses observed in the mesopelagic ocean, likely reflecting sampling efficiencies, or potentially real ecological differences, between free-living and particle-attached habitats. Overall, these studies of viruses on sinking particles complemented the planktonic focus of previous chapters by providing contrasting perspectives on particle-attached microbial assemblages.

Taken together, these studies revealed how viral diversity, reproductive strategies, and temporal dynamics varied across the open ocean water column, and between planktonic and particle-attached habitats. Planktonic viruses in the upper ocean generally were specific to a narrow depth range and persisted at interannual timescales. Viral assemblages displayed depth-specific metabolic and reproductive strategies, reflecting environmental structure on viral diversity. In situ characterization of both sinking particle-associated and planktonic viruses

183 revealed differences in viral temporal persistence and reproductive strategies, highlighting ecological differences between these habitats. Studying viruses from both the upper and deep oceans provided preliminary evidence of vertical transport of viruses on sinking particles in this environment, raising further questions concerning the mechanism of how viruses contribute to export. The culmination of these projects revealed the diversity and dynamics of some of the most abundant life forms in an environment representing the largest biome on

Earth. The viral population genomes curated by these studies, the ALOHA 1.0,

2.0, and deep trap virus databases, serve as reference databases for further research on environmental viruses (e. g., (21,22)). Future research directions to further characterize viral diversity in the open ocean are discussed below.

Future directions

This section describes the background research for a future direction that I plan to pursue in my postdoctoral studies: developing novel long-read sequencing approaches to explore viral diversity. As discussed in Chapter I, major discoveries in the field of viral metagenomics have relied on developments in novel sequencing technologies. One recent development is known as long-read sequencing, to contrast from short-read sequencing that generates reads on the order of a few hundred bases. Currently, the vast majority of metagenomic studies use short DNA sequence reads, which necessitates additional assembly steps to form longer pieces of contiguous genomic information, since viral

184 genomes are typically on the order of tens of thousand base pairs. Short-read assembly is computationally intensive, making it difficult to recover genomes from complex assemblages (23); it is prone to generating assembly errors, such as chimeras caused by erroneous linkage of sequences that do not belong together

(23,24); it is also prone to missing complex genomic regions such as hypervariable regions or repeats, which results in smaller genomic fragments instead of complete genomes (25). Recent advances in long-read sequencing technology can overcome some of these challenges. Long-read sequencing has the potential to capture an entire viral genome in a single read, bypass the assembly step, and span complex or repetitive regions to accurately preserve information across genomes (Fig. 5.1).

Current long-read sequencing technology, however, has its specific challenges.

Special consideration is required in DNA extraction to avoid fragmentation. The current long-read sequencing landscape is dominated by Oxford Nanopore

Technologies (ONT) and Pacific Biosciences (PacBio). Both sequencers require large (microgram) amounts of high molecular weight DNA input. This requirement makes it difficult to sequence rare or small microbes such as viruses without amplifying their DNA, which can introduce biases that complicate quantitative analysis of metagenomes (26). ONT sequencing is cost-effective relative to PacBio, which could enable deeper sequencing to capture rarer genotypes, but it also yields higher sequencing error rates, particularly compared with short-read technologies (27). ONT sequencing determines DNA bases by measuring changes in electrical conductivity as each base passes through a nanoporous protein on a membrane (28). This technology is sensitive to DNA

185 base-pair modifications that change its shape, with recent applications producing error rates of 2 – 6% (29). Since viral “species” are typically determined at >=95% nucleic acid identity (30), such high error rates is problematic for studying viral assemblages in the environment.

Considering the challenges, novel approaches are needed to fully leverage long- read technology. To our knowledge, long-read sequencing has only been applied in one other virus study in marine systems. However, this study fragmented and amplified viral DNA, and used an assembly step to piece together smaller DNA fragments to reconstruct viral genomes (21). While longer reads helped recover many genomic sequences, this approach did not directly address the above biases in amplification and assembly.

To further explore the potential of long-read sequencing, the DeLong laboratory collaborated with the ONT long-read sequencing platform to pilot an amplification-free and assembly-free approach, to recover complete viral genomes from natural marine populations (31). During my PhD, I contributed to this collaborative project with bioinformatic analyses of the assembly-free viral genomes recovered from this novel approach.

This collaboration showed that a single sample sequenced using long reads can recover thousands of complete viral genomes, which was a significant improvement over short-read assemblies that tended to generate genomic fragments (Fig. 5.2, (31)). Raw long reads displayed peaks in length distribution that ranged from ~35 kbp to ~63 kbp, depending on the sample depth (Fig. 5.3,

186 (31)). These results are generally consistent with size ranges of native Pacific

Ocean virus populations as determined by pulsed-field gel electrophoresis (32).

In comparison, when samples were subjected to short-read assembly (Chapter

III), 4.2 TB of sequencing data from 374 metagenomes yielded 16,787 viral population genomes, of which 961 were identified as complete. In contrast, a single 12 GB metagenome sequenced using long reads yielded 1229 complete viral genomes (31). Since a single long read can span an entire viral genome, long read sequencing appears to be effective and reliable in recovering complete polished viral genomes from environmental samples (examples shown in Fig.

5.4, (31)).

These complete viral genomes revealed repetitive genomic regions that would have been otherwise missed using short-read assemblies, thereby highlighting two interesting aspects of marine viruses. First, the ability to sequence end repeats of individual viral genomes enabled the identification of the genomone replication strategies of viral populations. Viruses replicate by generating connected copies of their DNA, then packaging individual genomes by cleaving at non-specific or specific sites (33). Since many viruses are terminally redundant with end repeats, we mapped the position of these repeats to identify viral packaging strategies. Non-specific cleavage results in circularly permuted genomes within a viral population, with end-repeats that span the genome randomly (Fig. 5.5, left; adapted from (31)). In contrast, a specific cleavage mechanism results in identical genomes within a viral population, with end- repeats located in the same location (Fig. 5.5, right; adapted from (31)). To explore possible biological differences associated with packaging strategies, I

187 compared differences in gene content between populations with circularly permuted genomes or with specific end-repeats. Viral populations with circularly permuted genomes were enriched in portal proteins (Fig. 5.6) involved in sensing pressure (34), which is potentially important during genome packaging with nonspecific cleavage. On the other hand, viral populations with specific end-repeats were enriched in genes involved with sequence diversity and lysogeny (Fig. 5.6), which confer a greater flexibility respectively in host range and reproductive strategies. Taken together, these results raise the possibility that viral genome packaging strategies could influence viral diversity and evolution.

The second interesting aspect, shown by long reads, was that some viral particles contained parasitic DNA (31). The ability to sequence repetitive regions revealed a group of viral parasites at Station ALOHA called phage-induced chromosomal islands (PICIs, (31,35)). PICIs represents a unique class of mobile genetic elements that replicate by hijacking the infection process of a “helper virus”, replacing the helper’s DNA with its own (Fig. 5.7, (36)). Some polished long read genomes contained connected repeating copies of DNA, whose gene content and genome length (5 – 13 kbp) were consistent with known PICIs (Fig. 5.8, (31,35)).

The total lengths of these repeats (33 – 66 kbp) were consistent with viral genome sizes, likely due to their packaging by the headful mechanism in which DNA is translocated into limited-capacity viral capsids until filled (31,37). These repeating PICI sequences, revealed by long reads, provided the first evidence that these viral parasites are actually packaged as concatamers in viral particles in the oceans (31).

188 As shown in this previous work, long-read sequencing can be a useful tool in studying novel biology in the oceans. Developing and utilizing long-read sequencing approaches to further characterize viral diversity in the ocean will form the foundation of my postdoctoral work. The initial explorative project described above broadly sequenced the 30 kilodalton to 0.1 µm size fraction, which included viral particles, ultra-small cells, free DNA, and cellular vesicles

(31). Building on this work, targeted sequencing of viral particles will enable recovery of more diverse viral genomes, particularly rare viral populations or viral parasites that would be missed by a broader sequencing of the entire size fraction. These rare viruses include populations with circularly permuted genomes, which represented only 6% of viral populations from the preliminary study. Targeted sequencing of viral particles will also confirm that recovered genomes such as PICI-like elements are inside viral particles, and enable estimating the proportion of viral particles that contain parasitic genomes.

Building on our preliminary study, I am excited to study what other aspects of unexplored biology these long-read sequences might bring to light.

Questions moving forward

The identification of thousands of novel genomes from viral populations throughout this thesis hints at an important question for future research in viral metagenomics. Considering the dearth of reference sequences to provide

189 taxonomic and functional context to sequenced viral populations, how can we fully utilize our sequencing potential to explore viral diversity? This challenge could be potentially addressed by developing new ways to link viruses to their hosts, either in silico through bioinformatics, or in the laboratory through culturing. Continuous developments in bioinformatics methods could include using either nucleotide frequency (e. g., (38)) or cellular metagenome-assembled genomes (Chapter IV) to link viruses to their hosts. Improving culturing techniques to address the “great plate count anomaly” that has plagued microbiology for a century (reviewed in (39)) will help expand the range of hosts

(and their specific viruses) that can be studied in the lab, and thereby provide context for virus-host links. Accurately cataloguing the taxonomy and function underlying novel viral diversity remains an ongoing challenge in the field of viral metagenomics.

The observation of depth-specific viral reproductive strategies (Chapters II and

III) raises another open question in microbial oceanography: how can we improve our understanding the environmental variables that shape viral reproduction, to better model virus-host interactions in the ocean? This question might be first addressed with virus-host laboratory co-cultures to confirm that viral reproductive strategies, inferred from in situ genome-based observations, translate into true biological differences in viral productivity and the fate of the host. Building on these data, laboratory experiments might help constrain variables that promote population-level shifts to non-lytic strategies such as lysogeny, prophage induction, pseudolysogeny, or chronic infections. These variables could include both (i) abiotic factors such as light, temperature,

190 nutrients (11), particle-attached or free-living habitats, and oligotrophic or eutrophic environments, and (ii) biotic factors such as virus and host density, host productivity (13), metabolism, diversity, and genome size (40). Coupled with a greater confidence in genome-based observations, and experimental evidence of variables that structure reproductive strategies, further metagenomic exploration of viruses from diverse environments will help build a more comprehensive model of virus-host interactions in the ocean.

The evidence of vertical transport of viruses observed in Chapter IV raises the possibility that the viral shuttle could take place in natural environments, and highlights important questions for this field: e. g., do viruses in this environment actively promote particle export through cell lysis? Alternatively, are they simply present due to co-transport with their dead and sinking hosts, and/or passively adsorbed onto sinking particles? Evidence for the former possibility would transform the way we think about the effects of viruses on biogeochemical cycling in the oceans, particularly for bacteriophages that have not been thought as major components in the viral-shuttle hypothesis. The next steps towards testing this hypothesis could include a combination of metagenomic and laboratory approaches to target viruses infecting primary producers important to export. For example, some of these key players could include viruses infecting cyanobacterial diazotrophs like Crocosphaera, which are linked to increased carbon flux (17), as well as cyanophages previously observed to be correlated with export (41), and observed to be exported from the upper ocean on sinking particles (Chapter IV). Incubation experiments with these groups of interest could more directly test the viral shuttle hypothesis by investigating whether

191 lysis enhances host cell aggregation and sinking. Studying environmentally relevant viruses in laboratory studies, and linking them to in situ metagenomic observations, might help clarify the role of viruses in export processes in the open ocean.

An important aspect of how viruses influence microbial diversity is their potential to serve as horizontal gene transfer (HGT) agents. Viruses replicating with non-specific cleavage might be more prone to erroneously packaging host sequences, and have been observed to serve as effective HGT agents through generalized transduction (42). In our preliminary long-read study, we observed two groups of viruses with disparate genome-replication mechanisms, specific

(94% of populations) vs. non-specific cleavage (6% of populations). These observations raise the questions of (i) what hosts or environments might select for non-specific viral packaging mechanism, and (ii) how might we better understand the relationships between viral biology and HGT?

In addition to HGT through transduction, viruses can also confer novel or enhanced functions through the transfer and incorporation of viral auxiliary metabolic genes (AMGs). We saw in Chapter III that three key AMGs involved in energy and nutrient acquisition displayed depth-specific patterns that corresponded to physiochemical gradients along the water column. This observation raises the possibility that virus-encoded AMGs could confer adaptive benefits to the infected host. Furthermore, the presence of the viral superinfection immunity protein observed in Chapter II suggested that viruses could confer to its host an immunity against other viruses. However, finding the

192 presence of a gene does not provide evidence of its functionality, and relying on metagenomic data alone is likely to over-estimate the functional diversity of sequenced viral populations (43). To avoid this concern, efforts should be made to validate in situ metagenomic observations with laboratory assays as a means to confirm the functionality of virus-mediated genes. Such coupled metagenomic and laboratory studies could elucidate how viruses might confer functional diversity to their hosts, and potentially transform the way we think about virus- host interactions in natural environments.

The discovery of viral parasites, such as PICIs at Station ALOHA, further complicates the biological classification of organisms in the ocean. These enigmatic mobile elements raise some open questions in the field of marine metagenomics: what environmental variables might structure the abundance of parasites found in viral particles? How might these parasites affect viral productivity and virus-host interactions? How might we constrain the proportion of cellular or viral productivity that ends up being shunted by these parasites? What other types of parasitic mobile elements exist in the ocean?

Addressing these questions could help untangle the complex web of interactions amongst different groups of genetic elements in marine ecosystems. The ease with which novel sequences with unique reproductive strategies are continuously discovered hints at the diversity that remains to be explored in the ocean.

Ending on a broader note, how can we connect our explorative metagenomic work in the natural sciences to humanitarian applications? Curiosity-driven basic

193 research might not immediately reveal its applicability, but it underlies revolutionary developments in science, technology, and medicine that continuously transform the way we live, as well as our imaginations for the human potential. Thinking about how broader implications of our scientific pursuits might better serve humanity remains an ongoing challenge and goal for researchers.

194 Figures

Figure 5.1. Long-read sequencing overcomes limitations in short-read sequencing. Short reads require assembly to piece together genomes, which is prone to missing hyper-variable or repetitive genomic regions. In contrast, long- read sequencing can capture entire viral genomes in a single read, preserving information across multiple genomic regions.

20000 25 m

10000

0 30000

count 250 m 20000 10000 0 30 40 50 60 70

length (kbp)

Figure 5.2. Genome length distributions of raw sequenced long reads shown in kilobase pairs (kbp) for two sequencing runs from samples collected at 25 m and 250 m from Station ALOHA. 250 m data from Beaulaurier et al. (31).

195

long−read polished genomes short−read assemblies 0.7 0.8 0.9 viral genome recovery

Figure 5.3. Viral genome recovery using long-read and short-read approaches. The proportion of genome recovery is calculated using reciprocal alignments between polished long-read genomes and short-read assemblies. Data from Beaulaurier et al. (31).

Figure 5.4. Complete viral genomes recovered using long-read sequencing, with an example each from 25 m, 117 m, and 250 m at Station ALOHA. Predicted proteins are shown along the genome and color-coded by function as annotated using the PFAM-A v30 database (44). Blue shading indicates amino acid identity to the closest relative in RefSeq v92 (45), if any, shown below the ALOHA viral genomes. Data from Beaulaurier et al. (31).

196

Figure 5.5. End repeat distributions along the genome differ between viral populations with circularly permuted genomes (left) and with specific ends (right). The percentage of populations displaying circularly permuted genomes or genomes with specific ends are displayed in parentheses. Red lines show alignment positions of end repeats. Figure adapted from Beaulaurier et al. (31).

Figure 5.6. Gene content differences between viruses with different replication strategies: circularly permuted or specific ends. Gene functions were identified using the PFAM-A v30 database (44), at bit score >=30. Gene copies per genome were approximated using the average coverage along the gene divided by the average coverage along the entire genome. Phage genomes with specific ends versus circularly permuted were identified in Beaulaurier et al. (31).

197 PICI DNA helper virus in viral capsid 40 kbp genome

PICI 8 kbp genome

PICI life cycle

hijacks viral infection 5x PICI genome 40 kbp capacity

connected copies of packaging replication PICI DNA

Figure 5.7. Life cycle of known viral parasites, phage-induced chromosomal island (PICIs). PICIs are mobile genetic islands residing in host cells that hijack viral infections. When a helper virus infects the host, PICIs become activated when a helper virus infects the host, replicate their DNA, and take advantage of viral machinery to replace viral DNA with their own. After Penades and Christie (35).

198

Figure 5.8. Genome figures of putative viral parasites phage-induced chromosomal island (PICIs) recovered from Station ALOHA using long reads. Bold labels indicate the depth where the genome was recovered and genome length in kilobase pairs (kbp). Predicted proteins are shown along the genome and color-coded by function as annotated using the PFAM-A v30 database (44). Individual repeat segments are highlighted using the vertical dashed lines and grey background shading along the genome. Taxonomic (% amino acid identity) and functional annotations (left: PFAM-A v30 (44); right: EGGNOG v4.5 (46)) are shown for each predicted protein. Figure adapted from Beaulaurier et al. (31).

199 References

1. Brum JR. Concentration, production and turnover of viruses and dissolved DNA pools at Stn ALOHA, North Pacific Subtropical Gyre. Aquat Microb Ecol. 2005;41:103–13.

2. Luo E, Aylward FO, Mende DR, Delong EF. Bacteriophage distributions and temporal variability in the ocean’s interior. MBio. 2017;8(6):e01903-17.

3. Breitbart M. Marine viruses: truth or dare. Ann Rev Mar Sci. 2012;4(1):425– 48.

4. Hewson I, Winget DM, Williamson KE, Fuhrman JA, Wommack KE. Viral and bacterial assemblage covariance in oligotrophic waters of the West Florida Shelf (Gulf of Mexico). J Mar Biol Assoc UK. 2006;86(03):591.

5. Mende DR, Bryant JA, Aylward FO, Eppley JM, Nielsen T, Karl DM, et al. Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat Microbiol. 2017;2(10):1367–73.

6. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R. Expanding the marine virosphere using metagenomics. PLoS Genet. 2013;9(12):e1003987.

7. Mizuno CM, Ghai R, Saghaï A, López-García P, Rodriguez-Valera F. Genomes of abundant and widespread viruses from the deep ocean. MBio. 2016;7(4):e00805-16.

8. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and biogeochemical impacts of uncultivated globally abundant ocean viruses. Nature. 2016;537:689–93.

9. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536(7617):425–30.

10. Stewart FM, Levin BR. The population biology of bacterial viruses: why be temperate. Theor Popul Biol. 1984;26(1):93–117.

11. Moebus K. Marine bacteriophage reproduction under nutrient-limited growth of host bacteria. I. Investigations with six phage-host systems. Mar Ecol Prog Ser. 1987;144:1–12.

12. Thingstad TF. Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems. Limnol Oceanogr. 2000;45(6):1320–8.

13. Middelboe M. Bacterial growth rate and marine virus–host dynamics. Microb Ecol. 2000;40:114–24.

14. Luo E, Eppley JM, Romano AE, Mende DR, DeLong EF. Double-stranded

200 DNA virioplankton dynamics and reproductive strategies in the oligotrophic open ocean water column. ISME J. 2020;14:1304–1315.

15. López-Pérez M, Haro-Moreno JM, Gonzalez-Serrano R, Parras-Moltó M, Rodriguez-Valera F. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters. PLoS Genet. 2017;13(9):1–23.

16. Boeuf D, Edwards BR, Eppley JM, Hu SK, Poff KE, Romano AE, et al. Biological composition and microbial dynamics of sinking particulate organic matter at abyssal depths in the oligotrophic open ocean. Proc Natl Acad Sci USA. 2019;116(24):11824–32.

17. Poff K, Leu AO, Eppley JM, Karl DM, DeLong EF. Microbial dynamics of the open ocean summer export pulse. In review.

18. Leu AO, Eppley JM, DeLong EF. Comparative genomics of particle- attached versus free-living bacteria. In preparation.

19. McDonnell AMP, Buesseler KO. Variability in the average sinking velocity of marine particles. Limnol Oceanogr. 2010;55(5):2085–96.

20. Qiu B, Koh DA, Lumpkin C, Flament P. Existence and formation mechanism of the North Hawaiian Ridge Current. J Phys Oceanogr. 1997;27:431–44.

21. Warwick-Dugdale J, Solonenko N, Moore K, Chittick L, Gregory AC, Allen MJ, et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ. 2019;7:e6800.

22. Kumar J, Sharma N, Kaushal G, Samurailatpam S, Sahoo D, Rai AK, et al. Metagenomic insights into the taxonomic and functional features of kinema, a traditional fermented soybean product of Sikkim Himalaya. Front Microbiol. 2019;10(August):1–17.

23. Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 2012;6(4):898–901.

24. Roux S, Emerson JB, Eloe-Fadrosh EA, Sullivan MB. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817.

25. Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet. 2008;24(3):142–9.

26. Yilmaz S, Allgaier M, Hugenholtz P. Multiple displacement amplification compromises quantitative analysis of metagenomes. Nat Methods. 2010;7(12):943–4.

201 27. Glenn TC. Field guide to next-generation DNA sequencers. Mol Ecol Resour. 2011;11(5):759–69.

28. Stoddart D, Heron AJ, Mikhailova E, Maglia G, Bayley H. Single- nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore. Proc Natl Acad Sci USA. 2009;106(19):7702–7.

29. Tyler AD, Mataseje L, Urfano CJ, Schmidt L, Antonation KS, Mulvey MR, et al. Evaluation of Oxford Nanopore’s MinION sequencing device for microbial whole genome sequencing applications. Sci Rep. 2018;8(1):1–12.

30. Roux S, Adriaenssens EM, Dutilh BE, Koonin E V., Kropinski AM, Krupovic M, et al. Minimum information about an uncultivated virus genome (MIUVIG). Nat Biotechnol. 2019;37(1):29–37.

31. Beaulaurier J, Luo E, Eppley JM, Uyl P Den, Dai X, Burger A, et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 2020;30(3):437–46.

32. Steward GF, Montiel JL, Azam F. Genome size distributions indicate variability and similarities among marine viral assemblages from diverse environments. Limnol Oceanogr. 2000;45(8):1697–706.

33. Fujisawa H, Morita M. Phage DNA packaging. Genes to Cells. 2003;2(9):537–45.

34. Prevelige PE, Cortines JR. Phage assembly and the special role of the portal protein. Curr Opin Virol. 2018;31:66–73.

35. Penadés JR, Christie GE. The phage-inducible chromosomal islands: a family of highly evolved molecular parasites. Annu Rev Virol. 2015;2(1):181–201.

36. Fillol-Salom A, Martínez-Rubio R, Abdulrahman RF, Chen J, Davies R, Penadés JR. Phage-inducible chromosomal islands are ubiquitous within the bacterial universe. ISME J. 2018;12(9):2114–28.

37. Black L. DNA packaging in dsDNA bacteriophages. Annu Rev Microbiol. 1989;43(1):267–92.

38. Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically- derived viral sequences. 2017;45(1):39–53.

39. Staley JT, Konopka A. Measurement of in situ activities of nonphotosynthetic microorganisims in aquatic and terrestrial habitats. Annu Rev Microbiol. 1985;39:321–46.

40. Casjens S. Prophages and bacterial genomics: What have we learned so far?

202 Mol Microbiol. 2003;49(2):277–300.

41. Guidi L, Chaffron S, Bittner L, Eveillard D, Larhlimi A, Roux S, et al. Plankton networks driving carbon export in the oligotrophic ocean. Nature. 2015;532(7600):465–70.

42. Casjens SR, Gilcrease EB. Determining DNA packaging strategy by analysis of the termini of the chromosomes in tailed-bacteriophage virions. Methods Mol Biol. 2009;502:91–111.

43. Enault F, Briet A, Bouteille L, Roux S, Sullivan MB. Phages rarely encode antibiotic resistance genes : a cautionary tale for virome analyses. ISME J. 2017;11:237–47.

44. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36(Database issue):281– 8.

45. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, Mcveigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(Database issue):733–45.

46. Huerta-Cepas J, Szklarczyk D, Forslund K, Cook H, Heller D, Walter MC, et al. EGGNOG 4.5: A hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2016;44(D1):D286–93.

203