UC San Diego UC San Diego Electronic Theses and Dissertations

Title EXPLORING THE GLOBAL VIROME AND DECIPHERING THE ROLE OF PHAGES IN CYSTIC FIBROSIS

Permalink https://escholarship.org/uc/item/1103w639

Author Cobian Guemes, Ana Georgina

Publication Date 2019

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA SAN DIEGO SAN DIEGO STATE UNIVERSITY

EXPLORING THE GLOBAL VIROME AND DECIPHERING THE ROLE OF PHAGES IN CYSTIC FIBROSIS A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Biology

by

Ana Georgina Cobián Güemes

Committee in charge:

University of California San Diego

Professor Douglass Conrad Professor Justin Meyer Professor Joseph Pogliano

San Diego State University

Professor Forest Rohwer, Chair Professor Robert Edwards

2019

The Dissertation of Ana Georgina Cobián Güemes is approved, and it is acceptable in quality and form for publication on microfilm and electronically:

______

______

______

______

______Chair

University of California San Diego San Diego State University 2019

iii Dedication

To Adrian, Pilar and Jorge for always supporting me through this unique journey.

iv Epigraph

“No temas a la perfección, nunca la alcanzarás” Salvador Dali

v Table of Contents

Signature page ...... iii Dedication ...... iv Epigraph ...... v Table of Contents ...... vi List of Figures ...... xi List of Tables ...... xiii List of Equations ...... xiv Acknowledgments ...... xv Vita ...... xix Abstract of the Dissertation ...... iv Chapter 1 : Viruses as Winners in the Game of Life ...... 1 Abstract ...... 1 How many viruses? ...... 1 A Global Census ...... 3 Phages Matter ...... 8 How many different viruses? ...... 11 Signature Genes ...... 12 Phage ...... 15 Virome Meta-analysis ...... 16 How many different phages on Earth? ...... 22 Which are the most abundant phage genes? ...... 23 The future for FRAP ...... 25 Summary Points ...... 26 Future Issues ...... 26 Definition of terms ...... 28 Acknowledgements ...... 29 References ...... 30 Appendix for Chapter 1 ...... 39 Supplemental methods ...... 39 Chapter 2: Fragment Recruitment Assembly Purification (FRAP): Put bioinformatics back into the hands of biologists...... 45 Abstract ...... 45

vi Introduction ...... 46 Current bioinformatics ...... 46 Why use FRAP? ...... 47 How to use FRAP? ...... 48 Challenges in the development of FRAP ...... 49 Results and Discussion ...... 50 Basic FRAP ...... 51 Blind FRAP ...... 60 The future of FRAP ...... 65 Biological insights from FRAP ...... 65 Methods ...... 65 File types ...... 65 FRAP implementations ...... 66 FRAP-tools ...... 69 Reference databases ...... 69 Denovo assemblies ...... 69 Acknowledgments ...... 69 References ...... 70 Appendix for Chapter 2 ...... 73 Supplemental tables ...... 73 Chapter 3 : Cystic Fibrosis Rapid Response: Translating Multi-omics Data into Clinically Relevant Information ...... 76 Abstract ...... 76 Introduction ...... 76 Results ...... 79 Patient CF01 fatal exacerbation expedited monitoring: metatranscriptomes and metabolomes ...... 79 Bacterial small molecule profiles before and during fatal exacerbation...... 85 Active members of the microbial community during a stable period and the fatal exacerbation...... 86 Microbial community dynamics during a non-fatal exacerbation...... 88 Discussion ...... 89 CF01 fatal exacerbation mechanism ...... 89 CFRR for polymicrobial infections management, the importance of historical samples and a fast sample to results strategy...... 92

vii Considerations about implementing the Cystic Fibrosis Rapid Response...... 95 Materials and methods ...... 96 Clinical data ...... 96 Metagenome and Metatranscriptome shotgun sequencing ...... 97 Sequencing data processing ...... 97 Samples comparison ...... 99 Metabolomics ...... 100 Metabolomics data processing ...... 100 Data availability ...... 102 Acknowledgments ...... 102 References ...... 103 Appendix for Chapter 3 ...... 111 Supplemental figures ...... 111 Supplemental tables ...... 118 Chapter 4 : Mobile genetic elements in Cystic Fibrosis exacerbations ...... 123 Abstract ...... 123 Introduction ...... 124 Results ...... 126 Clinical characterization of Cystic Fibrosis pulmonary exacerbations...... 126 Cystic Fibrosis exacerbations display higher microbial loads and phage production. ... 128 Microbial community composition of pulmonary exacerbations...... 129 Temporal dynamics of microbial community composition ...... 131 Genomic insertions and deletions in CF pulmonary exacerbations...... 131 Phage activity in pulmonary exacerbations ...... 135 Discussion ...... 137 CF exacerbations and bacteria genomic insertions ...... 137 Achromobacter in CF exacerbations ...... 138 Stenotrophomonas in CF exacerbations ...... 138 Unique mobile elements in CF exacerbations ...... 139 models of acute CF exacerbations...... 139 Materials and methods ...... 140 Clinical data...... 140 Samples collection and pre-processing...... 140

viii Viral and microbial enumeration...... 140 Total DNA metagenomes...... 141 Total RNA metatranscriptomes...... 141 Virome ...... 142 Clinical isolates genome sequencing...... 143 Metagenomes, metatranscriptomes and viromes data analysis...... 144 Genomic insertions and deletions identification...... 144 Insertions and deletions annotations...... 145 Technical considerations for CF sputum metagenomics...... 145 Acknowledgments ...... 146 References ...... 147 Appendix for Chapter 4 ...... 153 Supplemental figures ...... 153 Chapter 5 : Compounding Achromophages for therapeutic applications ...... 169 Abstract ...... 169 Introduction ...... 170 Results ...... 171 Cystic fibrosis and Achromobacter ...... 171 Achromobacter clinical isolates ...... 172 Achromophage isolation and characterization ...... 173 Achromophage comparative genomics ...... 177 Achromophage genome annotation ...... 181 Toxins annotations in Achromophages ...... 181 Achromophage lifestyle determination ...... 182 Prophage CF418-P1 induction and characterization ...... 182 Host range determination for Achromophages ...... 184 Achromobacter phages lysates preparation for therapeutic applications ...... 187 Discussion ...... 189 Achromophages hunting ...... 189 Achromophages genomics ...... 189 Toxins characterization in phage genomes...... 189 Cryptic prophages induction ...... 190 Beyond phage hunting ...... 191

ix Materials and methods ...... 191 Cystic fibrosis metagenomes ...... 191 Achromobacter strains ...... 192 Phage hunting ...... 192 Phages Transmission Electron Microscopy (TEM) ...... 194 Phages genome size determination by Pulse Field Gel Electrophoresis (PFGE) ...... 194 Phages host range determination ...... 195 Phages DNA isolation for sequencing ...... 196 Phages Illumina sequencing ...... 196 Phages Nanopore sequencing ...... 197 Phages genome assembly ...... 197 Phages genome annotation ...... 197 Phages comparative genomics ...... 198 High titer phage production and endotoxin removal and quantification ...... 198 Acknowledgments ...... 199 References ...... 199 Appendix for Chapter 5 ...... 206 Supplemental figures ...... 206 Supplemental tables ...... 212 Chapter 6 : Synthesis ...... 217 Phages abundance, diversity and lifestyle...... 217 Are phages hard to find? ...... 219 The personalized of Cystic Fibrosis exacerbations ...... 219 Model for Cystic Fibrosis acute exacerbations ...... 220 The Chaotic Neutral Phage ...... 221 References ...... 223

x List of Figures

Figure 1.1 Virus-to-microbe ratios for major biomes...... 6 Figure 1.2 Bioinformatics pipeline for the FRAP ...... 18 Figure 1.3 Viral rank abundance curves...... 21 Figure 2.1 Fragment Recruitment Assembly Purification general case...... 51 Figure 2.2 FRAP basic concept and normalization...... 52 Figure 2.3 Heatmap of the 100 most abundant bacteria in the aquarium metagenomes...... 56 Figure 2.4 Fragment recruitment plot of Phaeobacter gallaciensis ...... 57 Figure 2.5 Contamination detection in CF samples ...... 59 Figure 2.6 Genome contamination detection ...... 60 Figure 2.7 Generalized FRAP pipeline to map metagenomes to viral assembled contigs...... 62 Figure 2.8 FRAP metagenomes to viromes in coral-algae interactions ...... 63 Figure 2.9 Relation between HISAT2 score and identity for reads of length 100nt...... 68 Figure 3.1 Clinical data for the last 24 months of patient CF01’s life ...... 80 Figure 3.2 The most abundant bacterial genera of fatal exacerbation sample D-8 ...... 82 Figure 3.3 Shiga toxin and its human receptor globotriaosylceramide (Gb3) ...... 83 Figure 3.4 Actively transcribing members ...... 87 Figure 3.5 Proposed model of lung dynamics resulting in patient CF01’s death ...... 90 Figure 3.6 Cystic fibrosis rapid response ...... 94 Figure 4.1 Virus to Microbe Ratio (VMR) in CF sputum samples ...... 128 Figure 4.2 Bacterial abundance in CF exacerbations metagenomes ...... 130 Figure 4.3 Insertions in CF exacerbation dominant genomes...... 134 Figure 4.4 Stenotrophomonas insertions are phages ...... 135 Figure 4.5 Stenotrophomonas phage SHP2 and zonula occludens toxin ...... 137 Figure 5.1 Achromobacter phages on The Phage Proteomic Tree...... 178 Figure 5.2 Achromobacter phages clade phiAxp1...... 179 Figure 5.3 Achromobacter phages clade JWX, genomes comparisons...... 180 Figure 5.4 Achromobacter prophage induced when infected with other lytic phages...... 183 Figure 5.5 Achromobacter phages host range test...... 186 Figure 5.6 Achromobacter phage nyaak high titer lysate production ...... 188 Figure 5.1 Virus to microbe ratio, microbial and viral abundance in Earth biomes ...... 218

xi Figure 5.2 Model for Cystic Fibrosis acute exacerbations...... 221

xii List of Tables

Table 1.1 Estimated number of virus-like particles on Earth...... 4 Table 1.2 Virome metrics ...... 19 Table 2.1 Average genome length for FRAP normalization...... 53 Table 2.2 Average locus length for FRAP normalization...... 54 Table 2.3 Average gene length for FRAP normalization...... 54 Table 2.4 Exact hit, using smalt at 100% identity reads to contigs ...... 64 Table 2.5 Exact hit, using smalt at 100% identity reads to reads ...... 64 Table 4.1 Cystic Fibrosis patients clinical data during exacerbations ...... 127 Table 4.2 Deletions in dominant genomes...... 132 Table 4.3 Insertions in dominant genomes...... 133 Table 5.1 Achromobacter lytic phages in the literature (n=24) ...... 173 Table 5.2 Achromobacter isolated in this study...... 176

xiii List of Equations

Equation 1.1 Fractional abundance of viral genotypes ...... 42 Equation 1.2 Fractional abundance of contigs ...... 42 Equation 1.3 Estimated a and b for rank abundance curves ...... 43 Equation 1.4 Number of viral genotypes prediction ...... 44

xiv Acknowledgments

Thanks to my mentor Dr. Forest Rohwer for giving me the opportunity to explore the amazing world of phages. Thanks for teaching me how to ask the really interesting questions and work really hard to answer them. Thanks for pushing me to the limits of knowledge.

Thanks to Dr. Rob Edwards who encouraged me to come to San Diego and has always being very generous with this advice, computers and time. Thanks to Dr. Doug Conrad for caring about the microbes and sharing his clinical knowledge and resources to get us one step closer to apply genomics in the clinic. Thanks to Dr. Justin Meyer for teaching a bioinformatician to do phage plaque essays and guiding me through the phage mutations space. Thanks to Dr. Joe Pogliano and Dr. Elio Schaechter for the microbiology class and providing sharp and original questions for my research.

Dr. Anca Segall is an inspiration and I thank her for guiding me through my phage experiments and allowing me to work in her lab. (I hope one day I can identify an integrase by eye, like she does). Dr Linda Wegley-Kelley, I couldn’t have published this dissertation witouth your guidance and support, thank you.

The Rohwer lab Cystic Fibrosis team was an inspiration and I thank Dr. Yan Wei Lim,

Mike Furlan, Dr. Rob Quinn and Dr. Katrine Whiteson for welcoming me and showing me very usefull methods and tools. Thanks to Dr. Cynthia Silveira for bringing her viral world knowledge to the CF rapid response team and making the “not so rapid response” weekends enjoyable. Thanks to Gina Spidel and Eugenia Kronen for their hard work in getting all the

xv reagents and permits that we need to do science; and to Patti Swinford and Medora Bratlien for helping with the PhD related matters.

Thanks to my Rohwer lab comrades that made this five years more fun: Yan Wei Lim,

Lauren Paul, Savannah Sanchez, Emma George, Adam Barno, Kevin Green, Sean Benler, Ty

Roach, Mark Little, Marisa Rojas, Brandon Reyes, Ines Glatier D’Auriac, Lance Boling, Jose

Reina, Esther Rubio, Aaron Hartman, Helena Villela, Doug Naliboff, Brandie White, Jason

Baer and Zach Quilan. Thanks to the wise advice from Dr. Linda Wegley-Kelley and Dr. Juris

Grasis that helped me navigate grad school.

Thanks to the Edwards lab for bioinformatics help and friendship: Kate McNair,

Daniel Cuevas and Adrian Cantu. Thanks to the awesome Heather Maughman and Merry

Youle for helping me improve my writing skills. Thanks to Dr. Conrad’s team at the UCSD

CF clinic for making it possible to have samples to work with: Jenna Mielke, Nelly and

Rohaum. Thanks to the phage wisperers Shr-Hau Hung and Greg Peters for making any phage related protocol work. Thanks to the phage hunters team: Tram Le, Jessica Octavio, Lorena

Dominguez, Helena Villela, Lili Han.

Thanks to Dr. Liz Dinsdale for her advice, and support in getting our sequencing runs to work. Thanks to the everyone in the “Phage Journal club”, the “Bioinformatics Breakfast” and the “Biomath Group” for their delightfull scientific conversations.

Adrian Cantu, thanks for all your programming support, science discussions, love, care and patience. Thanks for coming with me to this journey and making my life amazing.

xvi Mom and dad, thanks for giving me all the tools to achieve this goal. Thanks for you love and support. Thanks to all my family for showing their support for my PhD. Thanks for calling, visiting and making sure everything was OK. A special thanks to my grandpa for encouraging me to go abroad for the PhD and for asking the question of “Who counted them?”, in this case I did counted the phages. Thanks to our friends in San Diego, specially to

Peter and Sally Salamon.

Thanks to the funding agencies that supported me during my PhD: CONACyT,

UCMEXUS and the Spruance Foundation.

Chapter 1, in full, is published in Annual Review of Virology. Ana Georgina Cobián

Güemes, Merry Youle, Vito Adrian Cantú, Ben Felts, James Nulton and Forest Rohwer;

2016. The dissertation author was the primiary investigator and author of this paper.

Chapter 2, in part, is in preparation for submission. Ana Georgina Cobián Güemes,

Vito Adrian Cantú, Jose Carlos Reina Cabello and Forest Rohwer. The dissertation author was the primary investigator and author of this paper.

Chapter 3, in full, is published in mBIO. Ana Georgina Cobián Güemes, Yan Wei

Lim, Robert A. Quinn, Douglas J. Conrad, Sean Benler, Heather Maughan, Rob Edwards,

Thomas Brerrin, Vito Adrian Cantú, Daniel Cuevas, Rohaum Hamidi, Pieter Dorrestein and

Forest Rohwer; 2019. The dissertation author was the primiary investigator and author of this paper.

Chapter 4 is in preparation for submission. Ana Georgina Cobián Güemes, Cynthia B.

Silveira, Mark Little, Vito Adrian Cantú, Sean Benler, Jose Carlos Reina Cabello, Rob

xvii Edwards, Douglas Conrad and Forest Rohwer; 2019. The dissertation author was the primary investigator and author of this paper.

Chapter 5 is in preparation for submission. Ana Georgina Cobián Güemes, Tram Le,

Maria Isabel Rojas, Lorena Dominguez, Jessica Claire Octavio, Lance Boling, Helena Villela,

Shr-Hau Hung, Lili Han, Vito Adrian Cantú, Rob Edwards, Anca Segal, Douglas Conrad and

Forest Rohwer; 2019. The dissertation author was the primary investigator and author of this paper.

xviii VITA

2019 Doctor of Phylosophy in Biology, University of California San Diego and San

Diego State University

2013 – 2014 Research Assistant in Bioinformatics, San Diego State University, CA, USA

2012 – 2013 Bioinformatics analyst, Research Center for Food and Development (CIAD),

Sonora, Mexico

2012 Master of Sciences in Biochemistry, National Autonomous University of

Mexico

2010 Bachelor of Science in Genomics, National Autonomous University of Mexico

PUBLICATIONS

Cobián Güemes, A. G., Vito Adrian Cantú, Jose Carlos Reina Cabello and Forest Rohwer “Fragment Recreuitment Assembly Purification” (in preparation) Cobián Güemes, A. G., Forest Rohwer, Rob Cole, Sandra Morales, Saima Aslam and Susan M. Lehman “Population dynamics of a phage therapy product during treatment of a Pseudomonas aeruginosa lung infection” (in preparation) Cobián Güemes, A. G., Tram Le, Maria Isabel Rojas, Lorena Dominguez, Jessica Claire Octavio, Lance Boling, Helena Villela, Shr-Hau Hung, Lili Han, Vito Adrian Cantú, Rob Edwards, Anca Segal, Douglas Conrad and Forest Rohwer “Compounding Achromophages for therapeutic applications.” (in preparation) Cobián Güemes, A. G., Cynthia B. Silveira, Jose Carlos Reina Cabello, Mark Little, Adrian Cantú, Sean Benler, Rob Edwards, Douglas Conrad and Forest Rohwer. “Mobile genetic elements in Cystic Fibrosis” (in preparation) Cobián Güemes, A. G., Yan Wei Lim, Robert A. Quinn, Douglas J. Conrad, Sean Benler, Heather Maughan, Rob Edwards, Thomas Brettin, Vito Adrian Cantú, Daniel Cuevas, Rohaum Hamidi, Pieter Dorrestein, and Forest Rohwer. “Cystic Fibrosis Rapid Response: Translating Multi-omics Data into Clinically Relevant Information.” MBio 10, no. 2 (2019).

xix Cobián Güemes, A. G., Merry Youle, Vito Adrian Cantú, Ben Felts, James Nulton, and Forest Rohwer. "Viruses as Winners in the Game of Life." Annual Review of Virology 3, no. 1 (2016). Benler, Sean, Cobián Güemes, A. G., Katelyn Mcnair, Shr-Hau Hung, Kyle Levi, Rob Edwards, and Forest Rohwer. "A Diversity-generating Retroelement Encoded by a Globally Ubiquitous Bacteroides Phage." Microbiome 6, no. 1 (2018). Galtier D'Auriac, Ines, Robert A. Quinn, Heather Maughan, Louis-Felix Nothias, Mark Little, Clifford A. Kapono, Cobián Güemes, A. G., Brandon T. Reyes, Kevin Green, Steven D. Quistad, Matthieu Leray, Jennifer E. Smith, Pieter C. Dorrestein, Forest Rohwer, Dimitri D. Deheyn, and Aaron C. Hartmann. "Before Platelets: The Production of Platelet- activating Factor during Growth and Stress in a Basal Marine Organism." Proceedings of the Royal Society B: Biological Sciences 285, no. 1884 (2018). Knowles, Ben, Barbara Bailey, Lance Boling, Mya Breitbart, Cobián Güemes, A. G., Javier Del Campo, Rob Edwards, Ben Felts, Juris Grasis, Andreas F. Haas, Parag Katira, Linda Wegley Kelly, Antoni Luque, Jim Nulton, Lauren Paul, Gregory Peters, Nate Robinett, Stuart Sandin, Anca Segall, Cynthia Silveira, Merry Youle, and Forest Rohwer. "Variability and Host Density Independence in Inductions-based Estimates of Environmental Lysogeny." Nature Microbiology 2, no. 7 (2017). Knowles, B., C. B. Silveira, B. A. Bailey, K. Barott, V. A. Cantu, Cobián Güemes, A. G., F. H. Coutinho, et al. “Lytic to Temperate Switching of Viral Communities.” Nature 531, no. 7595 (2016): 466–70. Gomez-Jimenez, S., L. Noriega-Orozco, R. R. Sotelo-Mundo, V. A. Cantu, Cobián Güemes, A. G., R. G. Cota-Verdugo, L. A. Gamez-Alejo, L. Del Pozo-Yauner, E. Guevara- Hernandez, K. D. Garcia-Orozco, A. A. Lopez-Zavala, and A. Ochoa-Leyva. "High- Quality Draft Genomes of Two Vibrio Parahaemolyticus Strains Aid in Understanding Acute Hepatopancreatic Necrosis Disease of Cultured Shrimps in Mexico." Genome Announcements 2, no. 4 (2014). Escalera-Zamudio, Marina, Martha I. Nelson, Cobián Güemes, A. G., Irma López-Martínez, Natividad Cruz-Ortiz, Miguel Iguala-Vidales, Elvia Rodríguez García, et al. “Molecular Epidemiology of Influenza A/H3N2 Viruses Circulating in Mexico from 2003 to 2012.” PLoS ONE 9, no. 7 (2014). Vazquez-Perez, Joel, Pavel Isa, Darwyn Kobasa, Christopher E Ormsby, Jose E Ramírez- Gonzalez, Damaris P Romero-Rodríguez, Charlene Ranadheera, Cobián Güemes, A. G.et al. “A (H1N1) Pdm09 HA D222 Variants Associated with Severity and Mortality in Patients during a Second Wave in Mexico.” Virology Journal 10, no. January (2013): 41.

xx Escalera-Zamudio, Marina, Cobián Güemes, A. G., María de los Dolores Soto-del Río, Pavel Isa, Iván Sánchez-Betancourt, Aurora Parissi-Crivelli, María Teresa Martínez-Cázares, et al. “Characterization of Influenza A Virus in Mexican Swine That Is Related to the A/H1N1/2009 Pandemic Clade.” Virology 433, no. 1 (2012): 176–82. Landa-Cardeña, Adriana, Jaime Morales-Romero, Rebeca García-Roman, Cobián Güemes, A. G., Ernesto Méndez, Cristina Ortiz-Leon, Felipe Pitalúa-Cortés, Silvia Ivonne Mora, and Hilda Montero. “Clinical Characteristics and Genetic Variability of Human Rhinovirus in Mexico.” Viruses 4, no. 2 (2012): 200–210. Arias, Carlos F., Marina Escalera-Zamudio, María De Los Dolores Soto-Del Río, Cobián Güemes, A. G., Pavel Isa, and Susana López. "Molecular Anatomy of 2009 Influenza Virus A (H1N1)." Archives of Medical Research 40, no. 8 (2009): 643-54.

xxi ABSTRACT OF THE DISSERTATION

Exploring the Global Virome and deciphering the role of phages in Cystic Fibrosis

by

Ana Georgina Cobián Güemes

Doctor of Philosophy in Biology

University of California San Diego, 2019

San Diego State University, 2019

Professor Forest Rohwer, Chair

Viruses are the most abundant and diverse life form on Earth. In Chapter 1 of this dissertation, a global census of the number of viral particles showed that 6.03 x 1031 viral particles are distributed across almost every ecosystem. The global census results showed that most viruses are in soils and sediments, two unexplored biomes for viral diversity. Accurate and fast bioinformatics methods to explore viral and bacterial composition in metagenomes are presented in Chapter 2 as “Fragment Recruitment Assembly Purification.”

iv Phages are viruses that infect bacteria, their role in polymicrobial infections such as

Cystic Fibrosis is explored. Cystic Fibrosis is a genetic disease in which the lung phenotype promotes mucus accumulation and microbial colonization. A multi-omics strategy to explore changes in the lung microbial community during acute pulmonary exacerbations is presented in Chapter 3 as “Cystic Fibrosis Rapid Response: Translating Multi-omics data into Clinically

Relevant Information.” In Chapter 4, eight acute exacerbations were studied, and a trend of loss of diversity and viral lytic lifestyle was observed. Two Cystic Fibrosis patients studied in

Chapter 4 were colonized by antibiotic resistant bacteria from the genera Achromobacter.

Both patients suffered fatal exacerbations. This motivated the isolation and characterization of lytic phages to be used as antimicrobials against bacterial infections.

v Chapter 1 : Viruses as Winners in the Game of Life

Abstract

Viruses are the most abundant and the most diverse life form. Their global abundance was previously estimated as 1031, but their abundance remains uncertain and their global distribution vague. In this meta-analysis we estimated that there are 4.80 × 1031 phages on

Earth. Further, 97% of them are in soil and sediments—two underinvestigated biomes that combined account for only ~2.5% of publicly available viral metagenomes. The majority of the most abundant phage sequences from all biomes are novel. Our analysis drawing on all publicly available viral metagenomes predicted a mere 257, 679 phage genotypes on Earth— an unrealistically low number— which attests to the current paucity of metagenomic data.

Further advances in viral ecology and diversity call for a shift of attention to previously ignored major biomes and careful application of verified methods for viral metagenomic analysis.

J.B.S. Haldane observed, “God has an inordinate fondness for stars and beetles.”

To which we would add, “…and viruses.”

How many viruses?

Until the 1970s, viruses were of import only insofar as they were found to cause disease in us, our domesticates, and other eukaryotes of economic value. Bacteriophages — viruses that infect bacteria— were proving themselves to be eminently useful as model systems for molecular biology research, but were still thought to be of little significance to the functioning of the biosphere. There simply could not be enough environmental phages to

1 matter, given the then measurable number of potential prokaryotic hosts counted by culturing techniques. Since typically only ~ 1% of the Bacteria were culturable by methods at that time, bacterial populations were routinely underestimated by two orders of magnitude: the “great plate count anomaly” (Staley and Konopka, 1985) . In the late 1970s, the reported environmental bacterial populations jumped 100-fold or more with the development of improved direct counting methods employing epifluorescence microscopy (Hobbie, Daley, and Jasper 1977). In 1979 Torrella and Morita (Torrella and Morita 1979) concentrated particles larger than 0.2 μm from Oregon bay water by filtration and made direct counts of virus-like particles (VLPs) by transmission electron microscopy. Their counts, only about 104 mL–1 or one VLP per microbial cell, overlooked VLPs <0.2 μm, the majority of phages. This omission was corrected in 1989 when Bergh and associates centrifuged water samples directly onto grids for viewing by transmission electron microscopy (Bergh et al. 1989). They reported direct virion counts up to 1.5 × 107 mL–1, an order of magnitude greater than their hosts.

Moreover, based on several reasonable assumptions, they predicted that in marine environments, as much as one-third of the bacterial population suffered phage attack daily— hardly insignificant.

This was only the beginning as abundant microbes, and even more numerous phages, were then found in many environments, including inhospitable locations such as glacial ice

(Anesio et al. 2007), bubbling acidic hot springs (Bolduc et al. 2015), deep-sea vents

(Ortmann and Suttle 2005), and deep sediments (Engelhardt et al. 2014). Based on subsequent methodological developments, viruses are now recognized for what they are: the winners in

2 the game of life. They are the most abundant and the most diverse life forms, and are of great import for ecology and evolution.

A Global Census

How many viruses are there on Earth, and where are they located? One would expect to find the most viruses in the habitats with the most host organisms. The most abundant cellular organisms are the prokaryotes with an estimated 4.15 × 1030 cells, the majority of which are found in soil, subseafloor sediments, and marine waters (Table 1.1) (Whitman,

Coleman, and Wiebe 1998; Williamson, Radosevich, and Wommack 2005; Kallmeyer et al.

2012; DeLong 2003). The number of microbial eukaryotes, including algae, is comparatively small, in the range of only 103 to 104 mL–1 in seawater and other planktonic environments

(Rocke et al. 2015).

Taking a global phage census poses methodological challenges. Current methods for direct counting of VLPs commonly combine nucleic acid stains with EFM or flow cytometry.

These methods were developed for aquatic environments, and their application to soil and sediment is problematic. Additional steps are required to detach prokaryotes and VLPs from particles and surfaces, and successful methods are specific to particular situations. Both may be obscured by opaque particles, and background fluorescence can interfere. The marine environment remains the most extensively sampled and best studied biome. Other intensively studied locations, such as the human gut and freshwater, make a relatively small contribution to the global total. Soils and the extensive subsurface sediments warrant more attention as their contribution is expected to be major.

3 Table 1.1 Estimated number of virus-like particles on Earth. Virus-to-microbe ratios (Supplemental Table 2 in Cobián et al., 2016) were extracted from 53 previous studies ratios (Supplemental Table 1 in Cobián et al., 2016) and used to calculate virus-like particles in each biome. Numbers of prokaryotic cells in marine, freshwater, other aquatic, sediment, and soil biomes are from Whitman at al., 1998. Number of cells per human is from Sender et al. 2016; human population is from United Nations, 2016. Median virus-to-microbe ratio for human biome is from Kim et al., 2011.

Number of Median virus- Virus-like Percentage of Biome prokaryotic to-microbe particles total virus-like cells ratio per biome particles Marine 1.01 x 1029 12.76 1.29 x 1030 2.6828 Freshwater 1.26 x 1026 14 1.76 x 1027 0.0037 Other aquatic 2.44 x 1027 30 7.32 x 1028 0.1524 Sediment 3.80 x 1030 11 4.18 x 1031 87.0131 Soil 2.50 x 1029 19.5 4.88 x 1030 10.1481 Human-associated 2.80 x 1023 0.1 2.80 x 1022 0.0000 Other host-associated Unknown 25 Unknown Unknown

Total 4.15 x 1030 12 4.80 x 1031

Marine sediment contains up to 105 times more organic matter than the water column above, supporting bacterial densities about 103 times higher (Kallmeyer et al. 2012). Summed globally, the total number of prokaryotes in subseafloor sediment (2.9 × 1029) is roughly equal to the estimates of Whitman et al. (Whitman, Coleman, and Wiebe 1998) for the total number of prokaryotes in seawater (1.2 × 1029) and in soil (2.6 × 1029). Cell densities alone would predict good hunting for the phage, as the probability of non-specific phage-host contact would be ~103 times greater on average than in the water column above. Indeed, VLP counts of 109 ml–1 in sediment have been reported, three orders of magnitude greater than that

4 observed in the overlying water column at that depth (Danovaro and Serresi 2000). Similarly, the large numbers of prokaryotes in soil, the rhizosphere, and the rhizosheath, and their associated phages, remain largely terra incognita despite their accessibility and their importance to agriculture.

VLPs outnumber their hosts by approximately a factor of ten-to-one in various oceanic locations (Wommack and Colwell 2000), and similar ratios have been observed in some other biomes (Figure 1.1). It is not understood what drives this consistency, nor the observed variations. On this basis, earlier extrapolations from the prokaryote abundances in the major biomes yielded an estimated 1.2 × 1030 in open ocean, 2.6 × 1030 in soil, 3.5 × 1031 oceanic subsurface, and 0.25–2.5 × 1031 in terrestrial subsurfaces, consistent with the rule-of-thumb estimate of 1031 viruses on Earth (Mokili, Rohwer, and Dutilh 2012). Here we revisited this question by combining the prokaryote populations drawn primarily from the classic 1998 paper by Whitman et al. (Whitman, Coleman, and Wiebe 1998) with the median measured

VMRs from 53 previously-reported studies (Supplemental Table 1 in Cobian et al. 2016). In addition to the major biomes, we included some specialized environments of particular interest, such as niches associated with humans and other hosts. From a global perspective, the majority of prokaryotes inhabit soil, seawater, and the sediments. Thus, we expected that these biomes would also account for the majority of viruses.

5 Figure 1.1 Virus-to-microbe ratios for major biomes. Box plots of biome virus-to-microbe ratios drawn from 53 previous studies (see Supplemental Methods). The means are indicated with gray diamonds. The medians are indicated with red lines.

This approach yielded 1.29 × 1030 VLPs in marine waters, 4.18 × 1031 in sediments, and 4.88 × 1030 in soil (Table 1.1). Adding in other lesser contributors brought the global total to 4.80 × 1031 VLPs (details in Supplemental Table 2 in Cobian et al. 2016). This confirms and slightly augments the oft-quoted number of 1031. Given that the vast majority of cellular entities are prokaryotes, this number represents the global phage population. Significantly, sediments and soil combined accounted for 97% of the global total. In sum, phages are the most abundant life forms, exceeding the runner-up, the prokaryotes, by more than an order of magnitude. Populations of this magnitude dictate that phage evolution is driven by selection, with the contribution of drift being insignificant (Kimura 1962).

Caveats: To the best of our knowledge, there are only two published reports of VMRs for human-associated communities. One reported VMRs of 38.6 and 7.9 for gum-associated mucus and the adjacent milieu, respectively (Barr et al. 2013), while the other reported a mean VMR of 0.129 and a median VMR of 0.1 in fecal matter (M.-S. Kim et al. 2011). Since

6 the vast majority of human-associated microbes and VLPs are in the distal gut (Sender, Fuchs, and Milo 2016; Haynes and Rohwer 2011), the fecal VMR was used to calculate the total

VLPs for the human-associated biome. Admittedly, that fecal VMR, being two orders of magnitude lower than the VMRs typical of most other biomes, was suspect. However, combining it with the recently revised estimate of 3.9 × 1013 human-associated prokaryotes

(Sender, Fuchs, and Milo 2016) yields 3.9 × 1012 human-associated VLPs, in good agreement with the previous estimate of 3.0 × 1012 (Haynes and Rohwer 2011). Unraveling the phage- host dynamics in the human-associated microbiome will require further investigation.

Several factors may result in underestimations of phage abundance. 1) Equating the number of VLPs with the number of phages overlooks those residing as prophages in bacterial genomes—not an insignificant number (see below). 2) Some stains used to enumerate microbial cells and VLPs by EFM, such as SYBR Green, are relatively insensitive to single- stranded DNA and RNA, thus overlook many small viruses. A more representative assay can be made using SYBR Gold (Tuma et al. 1999). 3) VLP counts for soil and sediment are biased by reduced extraction efficiencies, while visualization and identification of VLPs can be compromised by particulate matter in samples of fecal matter, soil, sediment, etc. 4) Some

VLP isolation methods for EFM include filtration through 0.2 μm filters which miss larger virions and also virions stuck to particulate matter. 5) The 0.02 μm grids used for TEM viewing miss the smallest virions.

Conversely, are these VLPs really viruses? In some environments, some VLPs may be gene transfer agents (GTAs), i.e., packaged cellular genes masquerading as tailed phage particles (Lang, Zhaxybayeva, and Beatty 2012). Although GTAs are produced by

7 roseobacters, a group that may account for more than 25% of the prokaryotes in some marine environments (Lang, Zhaxybayeva, and Beatty 2012), it remains unknown whether GTAs make a significant contribution to marine VLP counts. Also unknown is what fraction of the

VLPs in various environments are infectious, and what percentage were defective upon release or subsequently inactivated by UV irradiation, physical damage, or enzymatic attack.

Phages Matter

How much do 1031 viruses matter? One measure of this is to consider the matter they contain. The mass of a typical phage virion containing 50 kbp DNA packaged inside an icosahedral capsid is calculated to be 0.0823 fg; of this, 0.054 fg is DNA and 0.0283 fg is the protein capsid (Antoni Luque, personal communication). For comparison, each

Prochlorococcus, a particularly small autotrophic marine bacterium, has a mass of 300 fg.

Based on this virion mass, the 4.80 × 1031 VLPs on Earth have a total mass of 3.95 × 1015 g or

3.95 Pg. The stoichiometry of carbon, nitrogen, and phosphorus (C/N/P) in this typical virion is 20/6/1 (Jover et al. 2014), which partitions that mass as approximately 0.06 fg C, 0.02 fg N, and 0.0075 fg P. Calculation based on these data suggests that the Earth's VLPs represent roughly 2.9 Pg C, i.e., two orders of magnitude less than the 350-500 Pg total carbon previously estimated for Earth's prokaryotes (Whitman, Coleman, and Wiebe 1998).

Compared to microbial cells, virions are enriched in both nitrogen and phosphorus, with estimated global totals of 0.96 Pg N, and 0.36 Pg P. As a result of this enrichment, >5% of the total marine DOP and DON pools is estimated to reside in virions in some locations (Jover et al. 2014).

8 It is now widely recognized that microbes are of tremendous importance for global biogeochemical processes, and consequently, that which controls the microbes — the phages — runs the world. Prokaryotes are subject to predation by both heterotrophic protists and phages. Grazing by protist predators is size-selective, whereas lysis by phages is strain- specific. Moreover, in the oceans grazing moves the organic carbon and nutrients to higher trophic levels, whereas lysis routes these components instead through the viral shunt as dissolved organic matter (DOM). This DOM feeds the heterotrophic microbial community, thereby increasing net primary productivity and slowing the movement of carbon to the deep ocean (Fuhrman 1999; Weinbauer 2004; Wommack and Colwell 2000; Weitz et al. 2015;

Proctor and Fuhrman 1990; C A Suttle 2005). Because of the stoichiometric mismatch between virions and their host cells, after lysis, a disproportionate amount of the P is found in the progeny virions, leaving the cellular debris that feeds the heterotrophs depleted in P (Jover et al. 2014). An estimated 1028 marine bacteria are lysed daily including ~30% of the cyanobacteria and ~60% of the heterotrophic bacteria (Curtis A. Suttle 2007; Proctor and

Fuhrman 1990). This selective, strain-specific lysis profoundly impacts prokaryote community diversity increasing both richness and evenness (Wommack and Colwell 2000;

Fuhrman 1999; T.F. Thingstad 2000; T.F. Thingstad et al. 2015; Sandaa et al. 2009).

Phages also add a horizontal dimension to microbial evolution by mediating the transfer of genes between cells (John H Paul 1999). They nab useful metabolic genes from their hosts, maintain them, evolve them further to suit their own needs, and then sometimes return the new version to the microbial gene pool (Frank et al. 2013; Lindell et al. 2004;

Sullivan et al. 2006; D.B. Goldsmith et al. 2011). On an ecosystem level, the phage

9 community encodes environment-specific repertoires of microbial metabolic genes (Dinsdale et al. 2008).

Through lysogeny, phage genes are a continual presence in microbial communities.

Identification of prophages in microbial genomes is challenging, and numerous bioinformatic methods have been developed (Bose and Barber 2006; Zhou et al. 2011; McNair, Bailey, and

Edwards 2012; Akhter, Aziz, and Edwards 2012; Fouts 2006) Estimates of the percentage of prokaryotes in various biomes that carry one or more prophages have ranged between 0% and

100% (J.H. Paul 2008). Approximately 82% of the sequenced prokaryotic genomes available as of 2015 are predicted to contain at least one prophage (Katelyn McNair, personal communication). Resident prophages often account for the differences between strains within a bacterial species (Canchaya et al. 2003). Prophage-encoded exotoxin genes cause many notable human diseases, including cholera, diphtheria, and enterohaemorrhagic diarrhea

(Casas et al. 2006) as well as diseases that plague our agriculture and aquaculture.

We do not yet understand the factors influencing the prevalence of lysogeny in any biome. Siphoviruses have long been associated with the temperate lifestyle, although lysogeny is not confined to that family. This group was observed to dominate the community in the Southern Ocean, marine sediment, desert, hypersaline ponds, and human fecal samples, accounting for 44% of the total in sediment (M. Breitbart et al. 2004; M Breitbart et al. 2002;

M. Breitbart et al. 2003; Brum et al. 2015; Adriaenssens et al. 2015; Roux et al. 2016).

However, whether a temperate phage will follow the lytic or lysogenic pathway is decided at the start of each infection in response to host and environmental factors. The common interpretation based primarily on studies of coliphage λ posits that host abundance, as sensed

10 by a low multiplicity of infection (MOI), favors lytic replication, while high MOI favors lysogeny (Herskowitz and Hagen 1980). A recent analysis of phage communities on coral reefs suggests more complex dynamics (Knowles et al. 2016).

How many different viruses?

Ever since the discovery of phages, it has been evident that there are different ones capable of killing different bacteria. However, one hundred years later we are still do not know the extent of this diversity, its biogeography, or its dynamic role in ecosystem function.

While the diversity of all cellular life has been probed using universal genes such as the small subunit ribosomal RNA gene, the polyphyletic viruses have no gene in common, thus precluding a comprehensive PCR-based survey of viral diversity (F Rohwer and Edwards

2002; Dwivedi et al. 2012). Therefore, other methods had to be devised.

Two approaches based on data available in 2003 both yielded an estimated 100 million phage 'species' (F. Rohwer 2003). One conservatively assumed that 10 phage 'species' infect each of the estimated 10 million microbial species — thus 100 million different phages. This alone does not measure genetic diversity since two 'different' phages might differ in only one or two key proteins that determine host range. Similarly, the swapping of structural gene modules can yield a new 'species' that differs in virion morphology without any increase in the global genetic diversity, while two phages indistinguishable by morphology and host range can differ significantly in other genome modules. The second approach compared all the sequenced phage ORFs in GenBank at that time using BLAST and clustered the ORFs using a E-value of 10–4. From this data the non-parametric estimator Chao1 (Chao 1984) predicted that 2 × 109 phage ORFs remained to be discovered. Assuming 50 ORFs per phage

11 genome with 50% of those being novel, calculations based on the Chao1 value predicted 100 million different phages.

A decade later, armed with more metagenomic data and new analysis tools, Sullivan and colleagues presented an analysis based on protein clustering that reduced the estimated total number of phage ORFs from 2 × 109 to only 3.9 × 106 (Ignacio-Espinoza, Solonenko, and Sullivan 2013), a demotion of almost three orders of magnitude. The debate continues.

Signature Genes

Although there is no universal phage gene, diversity within phage groups can be assessed using signature genes shared by all group members (Adriaenssens and Cowan 2014).

For instance, the capsid portal gene (g20) is conserved among many of the large cyanomyophages that inhabit marine and freshwater environments. Several studies using g20 reported rich community diversity, typically 100 or more OTUs and including clades for which there are no cultured isolates (Zhong et al. 2002; Jameson et al. 2011). Attempts to correlate variations in community composition with depth, host abundance, season, and geographic distance yielded inconsistent results. Some OTUs demonstrated consistent seasonal variation while others persisted in moderate abundance from year to year (Chow and

Fuhrman 2012). Sampling across a north-south Atlantic Ocean transect found similar cyanomyophages to be widely distributed with no apparent geographical segregation

(Jameson et al. 2011), while others were present in both marine and freshwater environments

(Dorigo, Jacquet, and Humbert 2004). However, some diversity within this group eluded these surveys; of 39 cyanophage isolates from the Gulf of Mexico, only 63% carried detectable g20 sequences (McDaniel, DelaRosa, and Paul 2006).

12 The entire T4 superfamily (Myoviridae) was similarly surveyed using their major capsid protein (MCP, gp23). In aquatic environments, these are primarily T4-like cyanomyophages. The results here echoed the same trends: diversity (more than 100 OTUs) exceeding that represented in cultured isolates (Comeau and Krisch 2008), seasonal succession patterns, and long-term persistence of some OTUs (Chow and Fuhrman 2012;

Pagarete et al. 2013). In some cases persistent OTUs were also the most abundant (>4% relative abundance), thus contradicting predictions of both the Bank Model of viral community structure and the classic Kill-the-Winner dynamic (T. Thingstad and Lignell

1997; Pagarete et al. 2013; M. Breitbart and Rohwer 2005).

These analyses based on g20 and gp23 are limited to the myophages (predominantly cyanomyophages) and miss the podophages and the abundant siphophages. A more inclusive assessment of cyanophage diversity, including myophages and podophages, used psbA, the gene that encodes the D1 protein of oxygenic photosystem II (Chenard and Suttle 2008).

Phage-encoded sequences clustered by both phage family and by host, separate from the host psbA genes, and included clusters with no cultured isolates. Multiple clusters coexisted in some locales while some clusters extended over vast geographic distances. Moreover, one cyanopodovirus subcluster was found to be globally distributed based on its DNA polymerase gene (pol) (Huang et al. 2010).

Other signature genes can provide a more complete picture of marine phage diversity.

For example, the phosphate-starvation gene phoH is present in nearly 40% of all marine phages compared to 4% of non-marine phages, reflecting the scarcity of phosphate in the marine environment (D.B. Goldsmith et al. 2011). It is not restricted to a single viral family

13 and is found in phages that infect heterotrophs as well as autotrophs, even in viruses of photosynthetic green algae. A phoH-based marine survey found, yet again, that most of the environmental diversity was not represented in cultured isolates, that phoH homologs are widely distributed with most clusters represented in multiple oceanic regions, and that marine phage community composition varies with depth and geographical location. A subsequent survey of the phoH genes in Sargasso Sea phage communities at depths of 0 to 1,000 m over a two-year period identified 3,619 OTUs (97% identical) and provided new insights into community dynamics (Dawn B Goldsmith et al. 2015). Approximately 96% of those OTUs were rare, each accounting for <0.01% of the total sequences, while more than 50% of the sequences were from five abundant OTUs. The presence of a few abundant OTUs (1-4 in any particular sample) and many rare ones is consistent with the Bank Model of phage community structure. However, whereas that model predicts the cycling of phages between the two groups over time, here the rare OTUs remained rare, and the most abundant OTUs persisted through seasons and years.

To date, signature genes have been developed for only some viral families. They are strikingly lacking for the Siphoviridae, the family that includes many temperate phages and that dominates both metagenomic datasets and cultured isolates. In even the best cases, signature genes fail to capture the full richness present in natural communities. However, they have provided insights into the global distribution of specific phage genes. The DNA polymerase conserved in the T7-like podophages and restricted to that group was found in multiple biomes (M. Breitbart, Miyake, and Rohwer 2004). Moreover, identical or nearly- identical 533 bp segments were recovered from different biomes, indicating that phages, or at

14 least phage genes, have moved between biomes in recent evolutionary time. Similarly, >98% identical sequences from algal virus DNA polymerase genes were found from the northern

Pacific Ocean to Antarctica (Short and Suttle 2005). These observations could represent either the movement of phages or of individual phage genes. That phages from freshwater, sediment, and soil are able to infect marine prokaryotic communities suggests that phages can move successfully between biomes (Sano et al. 2004). These, and similar observations, suggest that the global viral diversity may be less than previously estimated based on the diversity in individual biomes.

Phage Metagenomics

Expansion of the field of view from signature genes to assessment of phage community diversity was made possible by the development of . Earlier sequencing methods that required cloning of phage DNA had often encountered issues. Many phages carry genes that are lethal to the cloning host cells, or their DNA contains modified bases that block cloning. An alternative method, linker-amplified shotgun sequencing, was first used to assess near-shore marine communities (M Breitbart et al. 2002). Sample preparation included passage through a 0.16 μm tangential flow filter, purification of VLPs by

CsCl gradient centrifugation, and subsequent DNA extraction. This procedure did not recover large viruses (e.g., algal Phycodnaviruses) or RNA phages. This initial marine survey found that more than 65% of the phage sequences were novel. Four years later, viromes prepared using next generation sequencing from four oceanic regions contained >90% unknowns

(Angly et al. 2006). Even 15 or more years later, despite the increased number of sequenced viral genomes in the public databases, 60-99% of the sequences in viromes from diverse

15 biomes are still unknowns (Mokili, Rohwer, and Dutilh 2012; Brum and Sullivan 2015;

Adriaenssens et al. 2015; Watkins et al. 2015; Roux et al. 2016). More than 99% of viral genetic diversity remains to be explored (Mokili, Rohwer, and Dutilh 2012).

Even though most sequences recovered are unknowns, bioinformatics methods can provide insights into community structure. Metagenomic reads are assembled into contigs in silico. The more diverse the community, the lower the probability of sequencing two overlapping fragments from the same genome, thus shorter contigs and more unassembled singleton reads. Plots of contig spectra (number of contigs versus contig length) were best represented by power-law based mathematical models and provided estimates of both the richness and evenness of the sampled community (M Breitbart et al. 2002). Application of this approach to near-shore marine communities estimated 374 – 7,114 genotypes present. Of these, the most abundant one represented only 2 – 3% of the total community, while only three contributed more than 1% of the reads. Subsequent metagenomic surveys of diverse environments reported genotypes numbering in the hundreds to tens of thousands [reviewed in (Youle, Haynes, and Rohwer 2012); data in (M. Breitbart et al. 2003; Angly et al. 2006;

Tseng et al. 2013; Bolduc et al. 2015; Youle, Haynes, and Rohwer 2012)].

Virome Meta-analysis

We have developed a new bioinformatics method, FRAP (Fragment Recruitment,

Assembly, Purification) (Figure 1.2, Supplemental Methods) and used it to assess global phage diversity by analyzing 1,623 publicly available viromes (Supplemental Table 3 in

Cobian et al. 2016). To create a reference library of the observed phage genotypes on Earth, all reads from each of the viromes were assembled separately using SPAdes with the K value

16 adjusted to maximize the incorporation of base pairs into contigs (Bankevich et al. 2012).

Comparative evaluations of assembler performance on metagenomic datasets had ranked

SPAdes among the best based on contig accuracy (García-López, Vázquez-Castellanos, and

Moya 2015). We assumed that each ≥ 1kbp contig represented a partial phage genotype that was relatively abundant in that biome. Rarely more than one non-overlapping contig might have been recovered from the same genotype. All of these ≥1 kbp contigs were added to the reference library. In addition, the 2,669 sequenced phage genomes (December 2015) and 67 archaeal virus genomes (February 2016) in the NCBI Viral Genomes database were added to the library, along with an additional 123 genomes from the Broad Institute

Marine Phage Sequencing Project. Subsequent dereplication by CD-HIT (Li and Godzik

2006) at 98% identity yielded 2,267,978 phage contigs plus genomes.

Given this reference library, the fragment recruitment step could then retrieve the matching reads from any virome. In principle this FRAP method could be used to retrieve and thus purify sequences that are only minor components of any dataset. For our analysis all reads from all viromes were mapped to the reference library at 90%. 95%, and 99% identity, and the normalized number of hits to each contig was tallied for each biome (Table 1.2,

Supplemental Methods). This yielded the fractional abundance in each biome of every contig that is also present in the reference library (Figure 1.3 b–g). Parallel mapping was performed on a pooled global virome prepared by proportionate subsampling of the individual biome files (Figure 1.3 a). Of these pooled reads, 87% were from sediment and 10% were from soil.

The ten most abundant viral contigs in each case occupied the same rank for all three mapping identities, evidencing the robustness of the method.

17 Figure 1.2 Bioinformatics pipeline for the FRAP (fragment recruitment, assembly, purification) method (see Supplemental Methods).

18 Table 1.2 Virome metrics. 1For the Tara Oceans Virome data set, only 1% of the reads were used. Including all would increase the number of reads from the marine biome to 1,688,798,702.

Percentage Percentage of reads mapped Number of reads to reference library Number of Biome of assembled reads viromes into ≥1 kbp 90% 95% 99% contigs identity identity identity Marine 192 56,676,5171 11.83 38.1 29.4 20.4 Freshwater 48 11,519,523 19.51 15.7 11.4 8.1 Other aquatic 19 3,342,537 14.04 40.2 35.8 27.3 Sediments 21 15,729,082 10.00 54.7 51.7 44.2 Soil 9 2,459,152 15.54 41.9 36.6 29.0 Human- associated 1158 481,172,486 25.16 30.3 27.2 12.4 Other host- 3.81 34.6 31.1 25.3 associated 167 34,600,192 Other 7 396,889 28.22 44.3 36.8 17.4 Total 1621 605,896,378 16.01 31.8 28.1 14.7

Mapping at 99% identity provided the most conservative estimate of the number of identified viral contigs and was used to calculate the number of viral genotypes potentially encoded for each biome (Figure 1.3). Here we summed the lengths of all observed phage contigs and divided by the assumed average phage genome size of 50 kbp. The most viral genotypes were observed in the marine and human-associated biomes, the least in soil. This reflects the size of the datasets: 192, 1158, and 7 viromes for marine, human-associated, and soil, respectively. This meta-analysis demonstrated that even when using all the information currently available for DNA-containing phages, we detect only 1,160 viral genotypes in the

19 pooled global virome, likely due to the low number of viromes from soil and sediment biomes.

The rank abundance plots for each biome are best described by a power law model

(Supplemental Table 5 in Cobian et al. 2016). Using this model, we calculated the predicted richness of each biome (Supplemental Methods). This approach predicted only 24,795 viral genotypes for the marine biome, which is not realistic (Supplemental Table 5 in Cobian et al.

2016). We conclude that we have insufficient information about the shape of the curve to calculate the actual viral richness of each biome. Further work is needed before we even know how much we still do not know, i.e., how far we are from understanding the viral dark matter of the biosphere.

20 Figure 1.3 Viral rank abundance curves for (a) marine, (b) sediment, (c) freshwater, (d ) soil, (e) other aquatic, and ( f ) human-associated biomes. Reads in these six major biomes were mapped to the genotypes present in the reference library at 90%, 95%, and 99% identity. In each case, the relative abundance of all recovered viral genotypes is shown as well as the observed coding capacity expressed as the potential number of different 50-kbp phage genomes and the predicted viral genotypes using both the curve fit and tail fit methods.

21

How many different phages on Earth?

Providing a direct answer to this question remains challenging, in part because we do not know whether phages are provincial or cosmopolitan. If a biome is sampled in two geographic locations, A and B, and the number of phage genotypes present in each is estimated, is the richness of the phage community in that biome equal to A + B, or is it significantly less? Likewise, if the number of phage genotypes in each biome is known, are we justified in summing them to calculate the global virome? Other studies have addressed this by assessing the fraction of genotypes shared between biomes or geographical regions. A three-pronged analysis of viromes from four oceanic regions (the Pacific off the coast of

British Columbia, Arctic Ocean, Gulf of Mexico, and Sargasso Sea) found that a large fraction of the phage community is cosmopolitan, that is they are found in two, three, or even four of the surveyed regions (Angly et al. 2006). Within this global distribution, the phage communities showed regionalization in that community members, including the cosmopolitan and the most abundant, shifted in relative abundance from region to region. Assembly of reads from each region separately yielded a total of ~150,000 genotypes, whereas co- assembly reduced the total to only 57,600 different genotypes. Similarly, a recent study of hypersaline ponds on three continents found that community composition varied with the level of salinity but that communities in the same salinity are genetically connected across the globe (Roux et al. 2016). Conversely, phage communities present in three soil biomes shared essentially no genotypes (Fierer et al. 2007).

The cosmopolitan range of individual phage genes or gene modules is another factor that complicates determination of the ecological or geographical range of phage genotypes.

22 Studies described earlier that found nearly identical sequences of signature genes to be globally distributed indicated a global phage gene pool, at least within the marine biome.

However, even within that biome, this does not distinguish between the global travels of phage genes by horizontal gene transfer and the presence of the same phage genotypes in geographically remote locations.

Here we used the number of phage encoding capacity observed in our meta-analysis to estimate the number of possible phage genotypes on Earth (Supplemental Methods). In brief, we used the sum of lengths of all the phage contigs as a proxy for all phage DNA currently available in the databases, and then divided by the assumed 50 kbp average phage genome length (Supplemental Table 5 in Cobian et al. 2016). On this basis we estimate that the observed phage DNA is sufficient to encode 257, 698 different phages.

This estimate is undoubtedly low. The sequenced samples so far represent only a minute fraction of the total phage-encoded information present in every biome. For example, including all 2.16 × 109 reads from the massive Tara Oceans Viromes dataset (average read length of ~101 bp) would provide 2 × 108 kbp of marine phage genomic sequence. To put this in perspective, 1.29 × 1030 marine VLPs with an average genome of 50 kbp would contain 6.5

× 1031 kbp of DNA.

Which are the most abundant phage genes?

To explore this, we annotated the 10 most abundant contigs in each biome and in the pooled global virome (Supplemental Table 4 in Cobian et al. 2016). The majority of them have no significant similarity to sequences in the NCBI database. For the soil biome, none of the 10 have similarities to known sequences. A podovirus polymerase is the most abundant

23 contig in the marine biome followed by several uncultured virus clones and unknown sequences. For the freshwater biome, nonstructural viral genes were detected among the most abundant ones, as well as others with no similarities to known sequences. Similarly, for the global virome the most abundant contigs have no significant hits in the databases, followed by contigs annotated as circovirus, a circular virus, and a cryptic MLU1 plasmid from

Micrococcus.

Caveats: Our reference database was clustered at 98% identity. The 2 most abundant viral contigs in the marine biome were podovirus polymerase genes that share >98% identity among themselves. Methodological improvements allowing clustering at 100% identity sequences would eliminate this issue.

Further, all of our results are unavoidably skewed due to the biased distribution of the current viromes. Of the 1,623 viromes, 71% were from human-associated communities, 10% from communities associated with other animals or plants, and 11% from marine environments (Table 1.2). Thus 92% of the viromes explored 3% of the VLPs on Earth, while only 1.8% investigated the two biomes with 97% of the VLPs — soil and sediment (Table

1.1). Some biomes remain virtually unsampled. Even in the marine biome, only a very limited geographic territory and range of environmental parameters have been sampled. Despite the surging interest in the human microbiome, published counts for human-associated prokaryotic cells and VLPs are strikingly lacking.

The publicly available viromes are further compromised by poor methods. Of those viromes, 132 (7.5%) were omitted from our library because they were mislabeled microbial metagenomes or showed obvious contamination with human or microbial sequences. Some

24 contained abundant φX174 sequences due to either the amplification bias of multiple displacement amplification (K.-H. Kim and Bae 2011; Duhaime and Sullivan 2012) or, when sequencing on the Illumina platform, the failure to remove the PhiX quality control sequences prior to submission to the public databases. In the sediment biome, the highly abundant fragments from Staphylococcus aureus plasmids could be from phages or GTAs, or from bacterial contaminants of viromes due to inadequate viral purification during sample preparation. Many sources of error can be avoided and more quantitative data obtained by carefully adhering to current best practices (Duhaime and Sullivan 2012; Duhaime et al. 2012;

K.-H. Kim and Bae 2011). Still needed is the comparable development of methods for RNA viruses and ssDNA phages, as well as VLP purification procedures that do not exclude the largest viruses.

The future for FRAP

Given a high-quality reference library, FRAP can be used to get rid of the crap, i.e., to fish out matching sequences from a metagenomic dataset, even when they comprise only a small percentage of the reads. This can potentially eliminate some sample purification steps or enable you to selectively retrieve different components from a mixed sample. However,

FRAP's utility depends on the completeness and quality of the reference library. Library development, in turn, calls for clean sample preparation methods, sequencing of more bp, and longer read lengths (such as the 10 to 15 bp lengths now possible with PacBio SMRT℗ sequencing).

25 Recent advances in phage ecology have heightened our awareness that we live in a phage world. Much of that world remains a terra incognita. For the careful researcher equipped with today's technology, the opportunities for discovery are vast.

Summary Points

1. Phages are the winners: the most numerous and genetically diverse life forms on

Earth. The estimated 4.80 × 1031 VLPs on Earth comprise at least 257,698 different phage genotypes.

2. We have barely begun to explore phage diversity. Metagenomic studies have focused on a few biomes and have ignored soil and sediment—the two that combined contain

97% of the global phage population. Sampling has been sparse at best, while numerous environments remain terra incognita.

3. Global phage diversity far exceeds that represented by cultured isolates.

4. One cannot fully understand the ecology or evolution of any ecosystem without including the phages.

5. Our current knowledge of the fractional abundances of phages on each biome is limited and we need better strategies to describe the population structure of the phages on each biome.

Future Issues

1. Metagenomic sampling has been narrowly focused on selected regions of the marine environment and on host-associated communities, primarily human-associated. Most

26 of the globe and most biomes await exploration. Initial surveys of the human-associated phage community uncovered anomalous dynamics that await resolution.

2. Viruses that have chromosomes of RNA or single-stranded DNA are significant components of some communities, but have been generally ignored. Correcting this requires new inclusive methods for both their direct VLP counts and their metagenomics.

3. Some phages and some phage genes are cosmopolitan, while others appear to be geographically or ecologically restricted. The question remains: do all phages share the same global gene pool, with varying levels of access?

4. Genomic analysis of both phages and their hosts indicates that lysogeny is commonplace. Awaiting further exploration are the prevalence of temperate phages in various environments, the lysis-lysogeny decision, and the impact of lysogeny on the ecology and evolution of both phages and their hosts.

5. Phages are the greatest reservoir of unexplored genetic diversity on Earth. Current estimates of the number of different proteins encoded by phages vary widely. After more than a decade of phage metagenomics, the majority of phage sequences remain novel. Sequencing technology has advanced rapidly, and now enhanced bioinformatics methods are essential to analyze the mushrooming metagenomic data.

6. Assignment of function to phage-encoded proteins based on sequence homology is limited by the rapid rate of phage evolution. Other approaches are called for in order to translate environmental metagenomic data into the metabolic potential of phage communities.

27

7. The species concept is not directly applicable to viruses. An alternative, generally- accepted metric is needed to facilitate discussion of viral diversity, ecology, and evolution.

8. In the current era of the microbiome, researchers are actively investigating the roles of microbes in processes including human health, ecosystem functioning, and global biogeochemical cycles. Now, a century after the discovery of phage, exploration of the role of phage in these and other activities is overdue.

Definition of terms

Virome: a viral metagenome

OTU: operational taxonomic unit

Phage: a virus that infects a prokaryote (Bacteria and )

Microbe: a prokaryote, i.e., an archaeon or a bacterium

PFU: plaque-forming unit

Virion: the intercellular transport form of a virus, typically comprising the chromosome(s) enclosed within a protein capsid

VLP: virus-like particle

VMR: the virus-to-microbe ratio

EFM: epifluorescence microscopy fg: 10–15 g

Pg: 1015 g

Prophage: a phage chromosome that resides within a host cell (lysogen) without immediately engaging in lytic replication

28 Cyanophage: a phage that infects cyanobacteria

Myophage: a member of the family Myoviridae

Podophage: a member of the family Podoviridae

Siphophage: a member of the family Siphoviridae

DOP: dissolved organic phosphorus

DON: dissolved organic nitrogen

Acknowledgements

Chapter 1, in full, is published in Annual Review of Virology. Ana Georgina Cobián

Güemes, Merry Youle, Vito Adrian Cantú, Ben Felts, James Nulton and Forest Rohwer;

2016. The dissertation author was the primiary investigator and author of this paper.

All calculations were made on Rob Edwards’s lab cluster at San Diego State

University, which is supported by National Science Foundation grant DBI-0850356 for computational resources. We are thankful to the members of the Biomath group at San Diego

State University for critical discussion of this work; to Cynthia Silveira for early access to marine VMRs; to Rizki Wulandari, Jan Janouˇskovec, Nate Robinett, and Ben Knowles for suggestions and comments; and to Evelien Adriaenssens for data sharing. This work was partially supported by the National Council on Science and Technology (CONACyT),

Mexico.

29 References

Adriaenssens, Evelien M, and Don A Cowan. 2014. "Using signature genes as tools to assess environmental viral ecology and diversity." Appl. Environ. Microbiol. 80 (15): 4470-4480.

Adriaenssens, Evelien M, Lonnie Van Zyl, Pieter De Maayer, Enrico Rubagotti, Ed Rybicki, Marla Tuffin, and Don A Cowan. 2015. "Metagenomic analysis of the viral community in Namib Desert hypoliths." Environ. Microbiol. 17 (2): 480-495.

Akhter, Sajia, Ramy K Aziz, and Robert A Edwards. 2012. "PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-and composition- based strategies." Nucleic Acids Res. 40 (16): e126-e126.

Anesio, Alexandre M, Birgit Mindl, Johanna Laybourn‐Parry, Andrew J Hodson, and Birgit Sattler. 2007. "Viral dynamics in cryoconite holes on a high Arctic glacier (Svalbard)." Journal of Geophysical Research: Biogeosciences 112 (G4).

Angly, F.E., B. Felts, M. Breitbart, P. Salamon, R.A. Edwards, C. Carlson, A.M. Chan, M. Haynes, S. Kelley, and H. Liu. 2006. "The marine viromes of four oceanic regions." PLoS Biol. 4 (11): e368.

Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, and Andrey D Prjibelski. 2012. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." J. Comput. Biol. 19 (5): 455-477.

Barr, Jeremy J, Rita Auro, Mike Furlan, Katrine L Whiteson, Marcella L Erb, Joe Pogliano, Aleksandr Stotland, Roland Wolkowicz, Andrew S Cutting, Kelly S Doran, P. Salamon, M. Youle, and F Rohwer. 2013. "Bacteriophage adhering to mucus provide a non-host-derived immunity." Proc. Natl. Acad. Sci. USA 110 (26): 10771-10776.

Bergh, Ø., K Y Børsheim, G Bratbak, and M Heldal. 1989. "High abundance of viruses found in aquatic environments." Nature 340: 467-468.

Bolduc, Benjamin, Jennifer F Wirth, Aurélien Mazurie, and Mark J Young. 2015. "Viral assemblage composition in Yellowstone acidic hot springs assessed by network analysis." ISME J. 9: 2162-2177.

Bolduc, Benjamin, Ken Youens-Clark, Simon Roux, Bonnie L Hurwitz, and Matthew B Sullivan. "iVirus." http://ivirus.us/.

Bose, Michael, and Robert D Barber. 2006. "Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences." In Silico Biol. 6 (3): 223-227.

30 Breitbart, M, P Salamon, B Andresen, J M Mahaffy, A M Segall, D Mead, F Azam, and F Rohwer. 2002. "Genomic analysis of uncultured marine viral communities." Proc. Natl. Acad. Sci. USA 99 (22): 14250-14255.

Breitbart, M., B. Felts, S. Kelley, J.M. Mahaffy, J. Nulton, P. Salamon, and F. Rohwer. 2004. "Diversity and population structure of a near–shore marine–sediment viral community." Proc. R. Soc. Lond. B Biol. Sci. 271 (1539): 565.

Breitbart, M., I. Hewson, B. Felts, J.M. Mahaffy, J. Nulton, P. Salamon, and F. Rohwer. 2003. "Metagenomic analyses of an uncultured viral community from human feces." J. Bacteriol. 185 (20): 6220.

Breitbart, M., J.H. Miyake, and F. Rohwer. 2004. "Global distribution of nearly identical phage encoded DNA sequences." FEMS Microbiol. Lett. 236 (2): 249-256.

Breitbart, M., and F. Rohwer. 2005. "Here a virus, there a virus, everywhere the same virus?" Trends Microbiol. 13 (6): 278-284.

Broad Institute. "Marine Phage Sequencing Project." www.broadinstitute.org/annotation/viral/Phage.

Brum, Jennifer R, Bonnie L Hurwitz, Oscar Schofield, Hugh W Ducklow, and Matthew B Sullivan. 2015. "Seasonal time bombs: dominant temperate viruses affect Southern Ocean microbial dynamics." ISME J.

Brum, Jennifer R, and Matthew B Sullivan. 2015. "Rising to the challenge: accelerated pace of discovery transforms marine virology." Nat. Rev. Microbiol. 13 (3): 147- 159.

Canchaya, C., G. Fournous, S. Chibani-Chennoufi, M.L. Dillmann, and H. Brüssow. 2003. "Phage as agents of lateral gene transfer." Curr. Opin. Microbiol. 6 (4): 417-424.

Casas, V., J. Miyake, H. Balsley, J. Roark, S. Telles, S. Leeds, I. Zurita, M. Breitbart, D. Bartlett, and F. Azam. 2006. "Widespread occurrence of phage-encoded exotoxin genes in terrestrial and aquatic environments in Southern California." FEMS microbiology letters 261 (1): 141-149.

Chao, A. 1984. "Non-parametric estimation of the number of classes in a population." Scan. Stat. Theory Appl. 11: 265-270.

Chenard, C, and CA Suttle. 2008. "Phylogenetic diversity of sequences of cyanophage photosynthetic gene psbA in marine and freshwaters." Appl. Environ. Microbiol. 74 (17): 5317-5324.

Chow, Cheryl‐Emiliane T, and Jed A Fuhrman. 2012. "Seasonality and monthly dynamics of marine myovirus communities." Environ. Microbiol. 14 (8): 2171-2183.

31 Comeau, André M, and Henry M Krisch. 2008. "The capsid of the T4 phage superfamily: the evolution, diversity, and structure of some of the most prevalent proteins in the biosphere." Mol. Biol. Evol. 25 (7): 1321-1332.

Danovaro, R., and M. Serresi. 2000. "Viral density and virus-to-bacterium ratio in deep-sea sediments of the Eastern Mediterranean." Appl. Environ. Microbiol. 66 (5): 1857.

DeLong, Edward F. 2003. "Oceans of archaea." ASM News 69 (10): 503-503.

Dinsdale, E.A., R.A. Edwards, D. Hall, F. Angly, M. Breitbart, J.M. Brulc, M. Furlan, C. Desnues, M. Haynes, and L. Li. 2008. "Functional metagenomic profiling of nine biomes." Nature 452 (7187): 629-632.

Dorigo, U., S. Jacquet, and J.F. Humbert. 2004. "Cyanophage diversity, inferred from g20 gene analyses, in the largest natural lake in France, Lake Bourget." Appl. Environ. Microbiol. 70 (2): 1017.

Duhaime, Melissa B, Li Deng, Bonnie T Poulos, and Matthew B Sullivan. 2012. "Towards quantitative metagenomics of wild viruses and other ultra‐low concentration DNA samples: a rigorous assessment and optimization of the linker amplification method." Environ. Microbiol. 14 (9): 2526-2537.

Duhaime, Melissa B, and Matthew B Sullivan. 2012. "Ocean viruses: rigorously evaluating the metagenomic sample-to-sequence pipeline." Virology 434 (2): 181-186.

Dwivedi, Bhakti, Robert Schmieder, Dawn B Goldsmith, Robert A Edwards, and Mya Breitbart. 2012. "PhiSiGns: an online tool to identify signature genes in phages and design PCR primers for examining phage diversity." BMC Bioinformatics 13 (1): 37.

ENA. "ENA Sequence Search." https://www.ebi.ac.uk/metagenomics/.

Engelhardt, Tim, Jens Kallmeyer, Heribert Cypionka, and Bert Engelen. 2014. "High virus-to-cell ratios indicate ongoing production of viruses in deep subsurface sediments." ISME J. 8 (7): 1503-1509.

Fierer, N., M. Breitbart, J. Nulton, P. Salamon, C. Lozupone, R. Jones, M. Robeson, R.A. Edwards, B. Felts, and S. Rayhawk. 2007. "Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil." Appl. Environ. Microbiol. 73 (21): 7059-7066.

Fouts, Derrick E. 2006. "Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences." Nucleic Acids Res. 34 (20): 5839-5851.

32 Frank, Jeremy A, Don Lorimer, Merry Youle, Pam Witte, Tim Craig, Jan Abendroth, Forest Rohwer, Robert A Edwards, Anca M Segall, and Alex B Burgin. 2013. "Structure and function of a cyanophage-encoded peptide deformylase." ISME J. 7 (6): 1150-1160.

Fu, Limin, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. "CD-HIT: accelerated for clustering the next-generation sequencing data." Bioinformatics 28 (23): 3150-3152.

Fuhrman, Jed A. 1999. " and their biogeochemical and ecological effects." Nature 399 (6736): 541-548.

García-López, Rodrigo, Jorge Francisco Vázquez-Castellanos, and Andrés Moya. 2015. "Fragmentation and coverage variation in viral metagenome assemblies, and their effect in diversity calculations." Front. Bioeng. Biotechnol. 3.

Goldsmith, D.B., G. Crosti, B. Dwivedi, L.D. McDaniel, A. Varsani, C.A. Suttle, M.G. Weinbauer, R.A. Sandaa, and M. Breitbart. 2011. "Development of phoH as a Novel Signature Gene for Assessing Marine Phage Diversity." Appl. Environ. Microbiol. 77 (21): 7730-7739.

Goldsmith, Dawn B, Rachel J Parsons, Damitu Beyene, Peter Salamon, and Mya Breitbart. 2015. "Deep sequencing of the viral phoH gene reveals temporal variation, depth- specific composition, and persistent dominance of the same viral phoH genes in the Sargasso Sea." PeerJ. 3: e997.

Haynes, Matthew, and Forest Rohwer. 2011. "The human virome." In Metagenomics of the Human Body, 63-77. Springer.

Herskowitz, Ira, and David Hagen. 1980. "The lysis-lysogeny decision of phage lambda: explicit programming and responsiveness." Ann. Rev. Genet. 14 (1): 399-445.

Hobbie, J El, R Jasper Daley, and STTI977 Jasper. 1977. "Use of nuclepore filters for counting bacteria by fluorescence microscopy." Appl. Environ. Microbiol. 33 (5): 1225-1228.

Huang, Sijun, Steven W Wilhelm, Nianzhi Jiao, and Feng Chen. 2010. "Ubiquitous cyanobacterial podoviruses in the global oceans unveiled through viral DNA polymerase gene sequences." ISME J. 4 (10): 1243-1251.

Ignacio-Espinoza, J Cesar , Sergei A Solonenko, and Matthew B Sullivan. 2013. "The global virome: Not as big as we thought?" Curr. Opin. Virol.: 566-571.

Jameson, Eleanor, Nicholas H Mann, Ian Joint, Christine Sambles, and Martin Mühling. 2011. "The diversity of cyanomyovirus populations along a North–South Atlantic Ocean transect." ISME J. 5 (11): 1713-1721.

33 Jover, Luis F, T Chad Effler, Alison Buchan, Steven W Wilhelm, and Joshua S Weitz. 2014. "The elemental composition of virus particles: implications for marine biogeochemical cycles." Nature Reviews Microbiology 12 (7): 519-528.

Kallmeyer, Jens, Robert Pockalny, Rishi Ram Adhikari, David C Smith, and Steven D’Hondt. 2012. "Global distribution of microbial abundance and biomass in subseafloor sediment." Proc. Natl. Acad. Sci. USA 109 (40): 16213-16216.

Kim, Kyoung-Ho, and Jin-Woo Bae. 2011. "Amplification methods bias metagenomic libraries of uncultured single-stranded and double-stranded DNA viruses." Appl. Environ. Microbiol. 77 (21): 7663-7668.

Kim, Min-Soo, Eun-Jin Park, Seong Woon Roh, and Jin-Woo Bae. 2011. "Diversity and abundance of single-stranded DNA viruses in human feces." Appl. Environ. Microbiol. 77 (22): 8062-8070.

Kimura, Motoo. 1962. "On the probability of fixation of mutant genes in a population." Genetics 47 (6): 713.

Knowles, B, CB Silveira, BA Bailey, Barott. K, A Cantu, AG Cobián-Güemes, FH Coutinho, E Dinsdale, B Felts, KA Furby, EE George, KT Green, GB Gregoracci, AF Haas, JM Haggerty, ER Hester, NG Hisakawa, LW Kelly, YW Lim, M Little, A Luque, T McDole-Somera, K McNair, LS de Oliveira, SD Quistad, NL Robinett, E Sala, P Salamon, SE Sanchez, S Sandin, GGZ Silva, J Smith, C Sullivan, C Thompson, MJA Vermeij, M Youle, C Young, B Zgliczynski, R Brainard, RA Edwards, J Nulton, F Thompson, and F Rohwer. 2016. "Lytic to temperate switching of viral communities." Nature: in press.

Lang, Andrew S, Olga Zhaxybayeva, and J Thomas Beatty. 2012. "Gene transfer agents: phage-like elements of genetic exchange." Nat. Rev. Microbiol. 10 (7): 472-482.

Li, Weizhong, and Adam Godzik. 2006. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences." Bioinformatics 22 (13): 1658-1659.

Lindell, D., M.B. Sullivan, Z.I. Johnson, A.C. Tolonen, F. Rohwer, and S.W. Chisholm. 2004. "Transfer of photosynthesis genes to and from Prochlorococcus viruses." Proc. Natl. Acad. Sci. USA 101 (30): 11013.

McDaniel, Lauren D, Michéle DelaRosa, and John H Paul. 2006. "Temperate and lytic cyanophages from the Gulf of Mexico." J. Mar. Biol. Assoc. U.K. 86 (03): 517- 527.

McNair, K., B.A. Bailey, and R.A. Edwards. 2012. "PHACTS, a computational approach to classifying the lifestyle of phages." Bioinformatics 28 (5): 614-618.

34 Meyer, Folker, Daniel Paarmann, Mark D'Souza, Robert Olson, Elizabeth M Glass, Michael Kubal, Tobias Paczian, A Rodriguez, Rick Stevens, and Andreas Wilke. 2008. "The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes." BMC Bioinformatics 9 (1): 386.

Mokili, J.L., F. Rohwer, and B.E. Dutilh. 2012. "Metagenomics and future perspectives in virus discovery." Curr. Opin. Virol.

Ortmann, Alice C, and Curtis A Suttle. 2005. "High abundances of viruses in a deep-sea hydrothermal vent system indicates viral mediated microbial mortality." Deep Sea Research Part I: Oceanographic Research Papers 52 (8): 1515-1527.

Pagarete, A, C-ET Chow, T Johannessen, JA Fuhrman, TF Thingstad, and RA Sandaa. 2013. "Strong seasonality and interannual recurrence in marine myovirus communities." Appl. Environ. Microbiol. 79 (20): 6253-6259.

Paul, J.H. 2008. "Prophages in marine bacteria: dangerous molecular time bombs or the key to survival in the seas?" ISME J. 2 (6): 579-589.

Paul, John H. 1999. "Microbial gene transfer: an ecological perspective." J. Mol. Microbiol. Biotechnol. 1 (1): 45-50.

Proctor, L.M., and J.A. Fuhrman. 1990. "Viral mortality of marine bacteria and cyanobacteria." Nature 343: 60-62.

R Core Team. 2013. "R: A language and environment for statistical computing." http://www.R-project.org/.

RefSeq. "NCBI Viral Genomes." http://www.ncbi.nlm.nih.gov/genome/viruses/.

Rocke, Emma, Maria G Pachiadaki, Alec Cobban, Elizabeth B Kujawinski, and Virginia P Edgcomb. 2015. "Protist community grazing on prokaryotic prey in deep ocean water masses." PLoS One 10 (4): e012450.

Rohatgi, Ankit. "WebPlotDigitizer v. 3.9." http://arohatgi.info/WebPlotDigitizer.

Rohwer, F, and R Edwards. 2002. "The phage proteomic tree: a genome-based taxonomy for phage." J. Bacteriol. 184 (16): 4529-4535.

Rohwer, F. 2003. "Global phage diversity." Cell 113 (2): 141-141.

Roux, Simon, Francois Enault, Viviane Ravet, Jonathan Colombet, Yvan Bettarel, Jean‐ Christophe Auguet, Thierry Bouvier, Soizick Lucas‐Staat, Agnès Vellet, and David Prangishvili. 2016. "Analysis of metagenomic data reveals common features of halophilic viral communities across continents." Environ. Microbiol.

35 Roux, Simon, Michaël Faubladier, Antoine Mahul, Nils Paulhe, Aurélien Bernard, Didier Debroas, and François Enault. 2011. "Metavir: a web server dedicated to virome analysis." Bioinformatics 27 (21): 3074-3075.

Sandaa, Ruth‐Anne, Laura Gómez‐Consarnau, Jarone Pinhassi, Lasse Riemann, Andrea Malits, Markus G Weinbauer, Josep M Gasol, and T Frede Thingstad. 2009. "Viral control of bacterial biodiversity–evidence from a nutrient‐enriched marine mesocosm experiment." Environ. Microbiol. 11 (10): 2585-2597.

Sano, E., S. Carlson, L. Wegley, and F. Rohwer. 2004. "Movement of viruses between biomes." Appl. Environ. Microbiol. 70 (10): 5842.

Schmieder, Robert, and Robert Edwards. 2011. "Quality control and preprocessing of metagenomic datasets." Bioinformatics 27 (6): 863-864.

Sender, Ron, Shai Fuchs, and Ron Milo. 2016. "Revised estimates for the number of human and bacteria cells in the body." bioRxiv: 036103. https://doi.org/http://dx.doi.org/10.1101/036103.

Short, C.M., and C.A. Suttle. 2005. "Nearly identical bacteriophage structural gene sequences are widely distributed in both marine and freshwater environments." Appl. Environ. Microbiol. 71 (1): 480.

SILVA. "SSU Ref NR." http://www.arb-silva.de/projects/ssu-ref-nr/.

SMALT. "Sanger Institute." http://www.sanger.ac.uk/science/tools/smalt-0.

SRA. "NCBI Sequence Read Archive." http://www.ncbi.nlm.nih.gov/sra.

Staley, James T, and Allan Konopka. 1985. "Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats." Ann. Rev. Microbiol. 39 (1): 321-346.

Steward, G.F., J.L. Montiel, and F. Azam. 2000. "Genome size distributions indicate variability and similarities among marine viral assemblages from diverse environments." Limnol. Oceanogr. 45: 1697-1706.

Sullivan, M.B., D. Lindell, J.A. Lee, L.R. Thompson, J.P. Bielawski, and S.W. Chisholm. 2006. "Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts." PLoS Biol. 4 (8): e234.

Suttle, C A. 2005. "Viruses in the sea." Nature 437: 356-361.

Suttle, Curtis A.. 2007. "Marine viruses — Major players in the global ecosystem." Nat. Rev. Microbiol. 5: 801-812.

36 Thingstad, T Frede. 2000. "Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems." Limnol. Oceanogr. 45 (6): 1320-1328.

Thingstad, T Frede, Bernadette Pree, Jarl Giske, and Selina Våge. 2015. "What difference does it make if viruses are strain-, rather than species-specific?" Front. Microbiol. 6.

Thingstad, TF, and R Lignell. 1997. "Theoretical models for the control of bacterial growth rate, abundance, diversity and carbon demand." Aquat. Microb. Ecol. 13 (1): 19-27.

Torrella, Francisco, and Richard Y Morita. 1979. "Evidence by electron micrographs for a high incidence of bacteriophage particles in the waters of Yaquina Bay, oregon: ecological and taxonomical implications." Appl. Environ. Microbiol. 37 (4): 774- 778.

Tseng, Ching-Hung, Pei-Wen Chiang, Fuh-Kwo Shiah, Yi-Lung Chen, Jia-Rong Liou, Ting-Chang Hsu, Suhinthan Maheswararajah, Isaam Saeed, Saman Halgamuge, and Sen-Lin Tang. 2013. "Microbial and viral metagenomes of a subtropical freshwater reservoir subject to climatic disturbances." ISME J. 7 (12): 2374-2386.

Tuma, Rabiya S, Matthew P Beaudet, Xiaokui Jin, Laurie J Jones, Ching-Ying Cheung, Stephen Yue, and Victoria L Singer. 1999. "Characterization of SYBR Gold nucleic acid gel stain: a dye optimized for use with 300-nm ultraviolet transilluminators." Anal. Biochem. 268 (2): 278-288.

UNIVEC. "NCBI Univec Database." http://www.ncbi.nlm.nih.gov/tools/vecscreen//univec/.

Watkins, Siobhan C, Neil Kuehnle, C Anthony Ruggeri, Kema Malki, Katherine Bruder, Jinan Elayyan, Kristina Damisch, Naushin Vahora, Paul O'Malley, and Brieanne Ruggles-Sage. 2015. "Assessment of a metaviromic dataset generated from nearshore Lake Michigan." Mar. Freshw. Res.

Weinbauer, M.G. 2004. "Ecology of prokaryotic viruses." FEMS Microbiol. Rev. 28 (2): 127-181.

Weitz, Joshua S, Charles A Stock, Steven W Wilhelm, Lydia Bourouiba, Maureen L Coleman, Alison Buchan, Michael J Follows, Jed A Fuhrman, Luis F Jover, and Jay T Lennon. 2015. "A multitrophic model to quantify the effects of marine viruses on microbial food webs and ecosystem processes." ISME J.

Whitman, William B., David C. Coleman, and William J. Wiebe. 1998. "Prokaryotes: The unseen majority." Proc. Natl. Acad. Sci. USA 95: 6578-6583.

37 Wilke, Andreas, Jared Bischof, Travis Harrison, Tom Brettin, Mark D'Souza, Wolfgang Gerlach, Hunter Matthews, Tobias Paczian, Jared Wilkening, and Elizabeth M Glass. 2015. "A RESTful API for accessing microbial community data for MG- RAST." PLoS Comput Biol 11 (1): e1004008.

Williamson, K.E., M. Radosevich, and K.E. Wommack. 2005. "Abundance and diversity of viruses in six Delaware soils." Appl. Environ. Microbiol. 71 (6): 3119.

Wommack, K.E., and R.R. Colwell. 2000. "Virioplankton: viruses in aquatic ecosystems." Microbiol. Mol. Biol. Rev. 64 (1): 69-114.

Youle, Merry, Matthew Haynes, and Forest Rohwer. 2012. "Scratching the surface of biology’s dark matter." In Viruses: Essential agents of life, 61-81. Springer.

Zhong, Y., F. Chen, S.W. Wilhelm, L. Poorvin, and R.E. Hodson. 2002. "Phylogenetic diversity of marine cyanophage isolates and natural virus communities as revealed by sequences of viral capsid assembly protein gene g20." Appl. Environ. Microbiol. 68 (4): 1576.

Zhou, You, Yongjie Liang, Karlene H Lynch, Jonathan J Dennis, and David S Wishart. 2011. "PHAST: a fast phage search tool." Nucleic Acids Research.

38 Appendix for Chapter 1

Supplemental methods

Global VLP calculation

A total of 6,154 VMR measurements were collected from 53 research papers

(Supplemental Table 2 in Cobian et al. 2016). When specific VMR values were not provided in the manuscript, the VMRs were extracted from the published plots using Web Plot

Digitizer (Rohatgi). Each VMR value was assigned to one of 7 major biomes (marine, freshwater, other aquatic, sediments, soil, human-associated, and other host-associated) and both the mean and median VMRs were calculated for each biome. Boxplots for each biome

VMR were generated using R (R Core Team 2013). Sources for the number of prokaryotic cells are noted on Figure 1. For each biome, the number of VLPs was estimated by multiplying the number of prokaryotic cells by the calculated median VMR. The global VMR was calculated by summing the VLPs for all 7 biomes and then dividing by the sum of the number of prokaryotic cells.

Virome collection

Virome FASTA sequences were obtained from MGRAST (Meyer et al. 2008),

MetaVir (Roux et al. 2011), iVIRUS (Bolduc et al.), SRA (SRA), and ENA (ENA). Since there is overlap between these databases, virome duplicates were removed by manual curation. Only viromes accompanied by a peer-reviewed paper were included. For the large

Tara Ocean Viromes dataset, one sequencing run per site was selected, of which a 1% subsample without replacement was used for analysis. FASTA sequences from MGRAST were downloaded after quality filtering using MGRAST API (Wilke et al. 2015); those from

39 SRA and ENA were quality filtered with PRINSEQ (Schmieder and Edwards 2011) (99% quality); raw FASTA files were used from MetaVir and iVIRUS. Six viromes (metavir_2726, metavir_2727, MGRAST: 4519681, 4519682, 4519683, 4519684) had been spiked with PhiX control DNA; those introduced sequences were eliminated from the FASTA files using in- house scripts (SMALT mapping at 95% identity to the φX174 genome; Genbank accession

J02482.1). When viromes had been sequenced as paired-end, only one read was used for subsequent analysis. Of the 1,622 viromes collected, 1,615 were assigned to one of the 7 major biomes (marine, freshwater, other aquatic, sediments, soil, human-associated, and other host-associated). The remaining 7 viromes were classified as 'other' (fermented food and air viromes). The available metadata and major biome classification for each virome are provided in Supplemental Table 3.

Reference library creation

Each virome was assembled de novo using SPAdes (Bankevich et al. 2012) and all contigs ≥1000 nt from all viromes were merged into a single file for further analysis. A prokaryotic virus reference database was created from 2,699 bacteriophage genomes and 67 archaeal virus genomes from the NCBI RefSeq database (January 2016) (RefSeq), plus an additional 123 bacteriophage genomes from the Broad Institute Marine Phage Sequencing

Project (Broad Institute). The virome contig file and the prokaryotic virus reference database were merged into a single file and clustered using CD-HIT-EST (Fu et al. 2012) at 98% identity. BLASTn (e-value cutoff of 0.001) was used to identify cloning vectors (UNIVEC) and rRNA (SILVA) sequences (using SSU Ref NR and LSURef_123_10_07_15). Removal of these sequences left a reference library containing 2,258,219 phage (sensu lato) contigs.

40 Size of the global virome

All assembled contigs <1000 nt were merged into a single file and clustered at 98% identity using CD-HIT-EST to generate the small_contigs_98 database. The lengths of every sequence in this database, and in the reference library were obtained and summed to yield the size of the global virome in nucleotides (1.29 × 1010). This total was divided by 50 kbp, the assumed average phage genome size, to yield the number of viral genotypes on Earth.

Virome mapping to the reference library

All viromes were mapped individually to the reference library using SMALT

(SMALT) in 3 separate runs using 99%, 95%, and 90% alignment identity cutoff values. The mapping at 99% alignment identity was used when calculating both observed and predicted viral genotypes.

In addition, a separate global analysis was performed by first creating a pool of all virome FASTA sequences for each of the 5 biomes with the most VLPs. A pooled global virome (1,000,000 reads) was then generated by subsampling with replacement from each biome pool, with the percentage contributed by each biome being the percentage of global

VLPs present in that biome (i.e., 26,857 marine reads, 37 freshwater, 1,525 other aquatic,

870,833 sediments, and 101,667 soil). This pooled global virome was mapped to the reference library as above.

Fractional abundances of viral genotypes

41 Mapping as described above yielded the number of hits to each genotype in the reference library for each virome. The fractional abundance (f) of each viral genotype in each virome was then calculated as:

Equation 1.1 Fractional abundance of viral genotypes

푟(푖) 퐿(푚푒푎푛) 푓(푖) ≅ ∗ 푇(푗) 퐿(푖) where

f (i): the fractional abundance of contig i in this virome

r (i): the number of reads that map to contig i

T (j): the total number of reads in virome j

L (mean): the mean genome length (bp), assumed to be 50,000 bp (Steward, Montiel, and

Azam 2000)

L (i): the length (bp) of contig i

Then the fractional abundance of contig i in a biome was calculated as:

Equation 1.2 Fractional abundance of contigs

푓(푖) 푓(푖 ) ≅ 푏푖표푚푒 푁 where

f (ibiome): the fractional abundance of contig i in the biome

N: number of viromes in biome

42 Parallel calculations were made to calculate the fractional abundance of contig i in the global virome sample. Rank abundance plots for each biome and the global virome sample were generated using R.

Annotation

The 10 viral genotypes with the greatest fractional abundance in each biome and in the pooled global virome were annotated through online NCBI BLASTn against the nr/nt database.

Observed viral genotypes

For each biome and the pooled global virome, the sum of the lengths of every contig with a fractional abundance at 99% identity > 0 was divided by 50,000 (the assumed average phage genome length) to estimate the number of observed viral genotypes.

Predicted viral genotypes

Curve fitting was performed for the 99% identity rank abundance plots for each biome using the Matlab Curve Fitting ToolboxTM R2015b (Mathworks, Inc.). Several curve fit models were evaluated when applied to the first 100, 1,000, and 10,000 fractional abundance ranks for each sample (Supplemental Table 5 in Cobian et al. 2016). The power law model gave the best fit and was therefore used to calculate the predicted number of viral genotypes for each biome. Estimated a and b for each curve were used to calculate the number of VLPs on each rank of the curve as follow:

Equation 1.3 Estimated a and b for rank abundance curves

푏 푉푖 = 푎푖 × 푉푡표푡푎푙

43 where 푉푖 is the number of VLPs on rank i, and 푉푡표푡푎푙 is the total number of VLPs in the biome (from Table 1.1). The values for 푉푖 were summed until the following condition was met:

Equation 1.4 Number of viral genotypes prediction

푥 푉 = ∑ 푎푖푏 × 푉 . 푡표푡푎푙 푖=1 푡표푡푎푙

At that point, the sum was considered to be the predicted number of viral genotypes. For the soil virome, the curve was asymptotic, and x was not calculated.

44 Chapter 2: Fragment Recruitment Assembly Purification (FRAP): Put bioinformatics

back into the hands of biologists.

Abstract

A strict or exact match strategy for the analysis of metagenomes is presented as

Fragment Recruitment Assembly Purification. This strategy reduces the problem of metagenomic assignments to text matching using strict or exact hits from a dataset to a database. Such databases are constructed by the user and tailored to adrress specific biological questions. With the increasing number of available genomes, and the high number of reads generated by current sequencing technologies, unreliable assignments can be eliminated from the analysis. Reliable assignments can be identified by using only strict and exact matches, such strategy limits the occurrence of false positives. Here the implementation of Fragment

Recruitment Assembly Purification is presented as well as a set of auxiliary tools that allow the user to quickly build heatmaps, coverage plots, and fragment recruitment plots; and then use these data to make biological inferences. The performance of the presented algorithm was tested in aquatic, and host associated metagenomes. Fragment Recruitment Assembly

Purification was used for the exploration of The Global Virome (Chapter 1) and in the analysis of Cystic Fibrosis metagenomes and viromes (Chapter 4).

45 Introduction

Current bioinformatics

Current bioinformatic methods have become a convoluted land for biologists. In metagenomic studies, once next generation sequencing (NGS) data (Singer et al. 2016) is generated, there are a myriad of options (Fonseca et al. 2012; Naccache et al. 2014;

Huttenhower 2019; Huson et al. 2016) and opinions (Greninger 2018) about the best way to analyze (Sczyrba et al. 2017; 2019, n.d.; McIntyre et al. 2017) such data.

Bioinformatic metodologies to compare a dataset (i.e. a metagenome) to a database

(i.e. viral reference genomes) rely heavily on statistics in calculating the probability that a

DNA fragment originated from a given organism (reference genome). Variability in the results of metagenomic analyses using different methods and pipelines is embedded in several steps such as: 1) the scoring systems used for each application (i.e. BLAST family algorithms developed by Altschul et al. 1990), 2) the construction of k-mer profiles with different databases, k-mer sizes and inference methods (i.e. KAIJU(Menzel, Ng, and Krogh 2016) and

FOCUS2 (Silva, Dutilh, and Edwards 2016)), and 3) the use of different sets of marker genes for taxonomical inferences (i. e. MetaPhlan2, MetaVir (Roux et al. 2011)). Such variability in the search methods can result in distinct biological interpretations when comparing the analysis of the same dataset through different software or platforms.

To overcome these issues, Fragment Recruitment Assembly Purification (FRAP) was developped, as a simple and reproducible method to compare a dataset to a database through strict or exact hits. FRAP assign strict (96% identity over 100% of the read) or exact hits

46 (100% identity over 100% of the read) in the dataset to a custom database specific to the biological problem at hand.

Why use FRAP?

FRAP aims to be a simple approach to bioinformatics with a fast learning curve. Exact hits are reliable and reproducible, which is defined as 100% of the read matching to the database with 100% identity. As the probability of having a specific 100 nucleotides sequence by chance is 1/4100, it is hard to argue that an exact hit is not a true positive. The exact hits approach is possible because the amount of sequences in the databases and in the datasets is large enough so that non perfect matches can be ignored or used for separate analysis.

Databases and datasets size will continue to increase(Stephens et al. 2015), so much so that eventually we will have the Earth’s metagenome. As the databases become bigger, the more

FRAP will become relevant.

The FRAP approach has an adaptable design in which datasets and databases are any collection of sequences in fasta format. This makes the strategy directed towards answering biological questions and researchers can easily construct databases relevant to their questions.

FRAP reduces metagenomic assignments to text strings comparisons which have been implemented in several operating systems (such as Unix) as low level functions. This allows

FRAP to be a “never break” pipeline which should be stable across platforms and versions.

String matching algorithms have at worst lineal time complexity(Pandiselvam, Marimuthu, and Lawrance 2014). Regardless of the method used to obtain exact hits, FRAP results should always be the same.

47 How to use FRAP?

Databases and datasets are nucleotide fasta files. Datasets are usually short DNA fragments (reads) from metagenomes, metatrascriptomes or viromes. Databases are usually a set of genomes of interest such as bacteria representative genomes, archaea reference sequences, and viral reference sequences; or a set of genes of interest such as bacteria metabolic genes (i.e. the SEED (Overbeek et al. 2014) database), virulence factors (Sayers et al. 2019; Chen et al. 2016), or antibiotic resistance genes (Jia et al. 2017). These are popular examples, but a dataset and a database can be any collection of sequences in a fasta file format.

Several exact hit implementations were explored in this study, such as the use of mapping algorithms with adjusted parameters to obtain exact hits such as smalt (“SMALT

Manual” 2010) and HISAT (Kim, Langmead, and Salzberg 2015); the implementation of a search via a hash table, called hash exact hit (heh), and the use of the Unix command grep.

FRAP has two main applications: basic FRAP and blind FRAP. Basic FRAP compares a set of reads to a set of known reference genomes or genes. Examples include comparing aquarium metagenomes to bacteria and archaea representative genomes and comparing lung metagenomes to virulence factors genes and antibiotic resistance genes. In basic FRAP the reported fractional abundances are interpreted as taxonomical or functional assignments. Basic FRAP can also be used to filter out or retrieve reads present in a database of interest, for example to filter out the human genome reads present in a lung metagenome and keep the remaining reads for further analysis, or to retrieve the viral reads in a lung metagenome and assemble all of them together to discover viral like contigs.

48 Furthermore, basic FRAP aids in the detection of contamination in samples or in databases. For example, a marine bacteria was detected in a lung metagenome. Upon closer inspection, all the detected reads map to a single short region of the marine bacteria genome, in this case the sample was contaminated with a marine bacteria PCR product. Contamination in the databases can be detected, for example a human biopsy sample had exact hits to a bacteria and all hits were in a very short region, further exploration led to the identification of a human genome region in the assembly of the bacteria in the reference database.

Blind FRAP compares a dataset to itself via contigs or reads. In blind FRAP the database has no identified function or taxa and the databases are usually contigs assembled from the dataset or contigs assembled from another dataset. One example of blind FRAP is presented in Chapter 1, where contigs assembled from viromes were used as a database and reads from the same viromes were used as datasets. The resulting fractional abundances for each contig represent the abundance of viral-like elements that have no functional or taxonomical annotation yet. A second example of blind FRAP is obtaining viromes and metagenomes from the same sample and using the contigs from a virome as database and the reads from metagenomes as datasets, in this case the fractional abundances of virome derived contigs is interpreted as the presence of viral-like elements in metagenomes.

Challenges in the development of FRAP

FRAP needs further development to become portable and fast. ideally FRAP should be easy to use, not requiring the use of the terminal and where dataset and database files can be dragged and dropped into the analysis window.

49 Computational challenges in the development of FRAP are: 1) The development of efficient search algorithms (So 2017) for text search which is currently limited by database size, and 2) Determine the amount of computational power needed for large scale analysis, for example, how much computing power do we need to use the Earth’s metagenome?

Biological challenges in the development of FRAP rely on the compilation of reliable databases that are vetted for reproducible results.

Results and Discussion

FRAP was used in several projects. FRAP generalized pipeline (Figure 2.1) inputs are a single fasta file containing a database and one or multiple fasta files containing the reads in a dataset or datasets. The implemented methods to obtain exact hits are smalt at 100% identity, HISAT at 100% identity, hash exact hit, and grep. FRAP outputs are a tab delimited file containing the fractional abundances of each element in the database in each dataset, a tab delimited file containing the number of hits to each element in the database in each dataset and fragment recruitment plots for the 10 database elements with the highest fractional abundances. In addition, frap-tools can be used to obtain a fasta file containing the reads that have hits to the database (si_hits.pl) or a fasta file containing the reads that have no hits to the database (no_hits.pl). FRAP can be accessed through the GitHub repository https://github.com/yinacobian/frap

50 Figure 2.1 Fragment Recruitment Assembly Purification general case.

Basic FRAP

The simplest implementation of FRAP is the “just map FRAP” approach (Figure 2.2-

A). FRAP normalization to obtain fractional abundances from exact hits (Figure 2.2-B) is based on the principle that datasets have different sizes and the length of each element in the database is different. The fractional abundance of each element in the database (f(i)) is calculated by dividing the number of hits to an element in the database (r(i)) by the total number of sequences in the dataset (T(j)), next this term is multiplied by the mean length of

51 the elements in the dataset (L(mean)) divided by the length of the database element (L(i)).

Fractional abundances are then scaled by a factor of 1,000,000.

Figure 2.2 FRAP basic concept and normalization. A) FRAP implementation as “Just map FRAP” which is implemented in the script jmf.pl B) FRAP normalization to fractional abundances

L(mean) values represent the average genome length. When the database is composed of complete genomes, the L(mean) value can be pre-determined and used for the calculations.

Microbial average genome length was calculated from publicly available complete genomes included in RefSeq-release80 (Table 2.1). Based on this set of genomes, virus average genome length is 29,936 bp, bacteria average genome length is 4,028,000 bp and archaea average genome length is 2,730,000 bp. For fungi and protozoa, the average genome length

52 was calculated using locus length, which can be the complete genome, or a chromosome

(Table 2.2). Subsequent calculations are needed to get accurate average genome lengths for this organisms. L(mean) values represent the average gene length when the database is composed of genes. The average gene length for virus, bacteria, archaea, fungi and protozoa was calculated (Table 2.3). The average gene length for virus is 741 bp, for bacteria 899 bp, for archaea 855 bp, for fungi 1,660 bp and for protozoa 1,798 bp.

Table 2.1 Average genome length for FRAP normalization. RefSeq-release80 published on January 9, 2017. Genomes/ASSEMBLY_REPORTS/ published on March 6, 2017. Virus: obtained from all viral genomes available at refseq. Bacteria: obtained from assembly reports, the average genome length per taxid group was used as input, 67,706 genomes are distributed in 964 taxid groups. Archaea: obtained from assembly reports, the average genome length per taxid group was used as input, 192 genomes are distributed in 11 taxid groups. Information not available for fungi and protozoa, everything is together in an eukaryotes genome size file.

GENOME LENGTH

N minimum 1st quartile median mean 3rd quartile maximum

Virus 8,321 200 2,737 7,233 29,936 38,124 2,473,870 Bacteria 67,706 162,600 2,425,000 3,912,000 4,028,000 5,193,000 11,930,000 Archaea 192 174,700 2,150,000 2,486,000 2,730,000 3,046,000 4,568,000

53 Table 2.2 Average locus length for FRAP normalization. Locus length for L-mean in FRAP normalization. RefSeq-release80 published on January 9, 2017. A LOCUS is a refseq entry, in the case of viruses one locus is one genome, for the rest of the groups a locus can be either a compelte genome, a plasmid, a contig or any refseq entry.

LOCUS LENGTH 1st 3rd N minimum quartile median mean quartile maximum Virus 8,321 200 2,737 7,233 29,936 38,124 2,473,870 Bacteria 8,067,030 16 886 4,656 36,891 25,351 14,782,125 Archaea 20,952 4 2,851 11,354 30,810 32,179 4,064,496 Fungi 74,670 86 664 2,226 89,939 14,047 11,880,248 Protozoa 247,163 22 922 1,158 14,559 2,514 13,391,543

Table 2.3 Average gene length for FRAP normalization. 1min represent the start site of regulatory regions in the database, therefore it does not represent an accurate minimum gene length.

GENE LENGTH 1st 3rd N min1 quartile median mean quartile maximum Virus 308,153 1 260 443 747 860 262,387 Bacteria 291,916,012 1 437 773 899 1,179 110,417 Archaea 1,638,741 1 419 731 855 1,124 32,447 Fungi 2,195,385 3 857 1,394 1,660 2,100 186,159 Protozoa 976,901 3 649 1,221 1,798 2,168 131,440

54 FRAP-tools are a set of auxiliary scripts to visualize and generate fasta files from jmf.pl outputs. Heatmaps are generated from fractional abundances (Figure 2.3). Fragment recruitment plots (Figure 2.4) are generated for the 10 elements in the database with the highest fractional abundances. In the case of complete genomes as a database, the confidence of the identification of a bacteria genome in the dataset is assessed by the distribution of fragments across most of the genome.

A basic FRAP approach was used in a set of metagenomes from an aquarium water column microbial community establishment process, in which 47 samples were obtained over a 6 months period. The databases used were bacteria representative genomes (Figure 2.3-A) and archaea representative genomes (Figure 2.3-B). The dominant bacteria during the microbial community establishment was Phaeobacter galleciensis, whose hits are distributed across the whole genome (Figure 2.4) which provides evidence of the presence of the organism in the system. Results from this study will be further discussed and published by

Calhoun S. et. al.

55 Figure 2.3 Heatmap of the 100 most abundant bacteria in the aquarium metagenomes. FRAP at 96% identity. A) Heatmap of the most abundant bacteria found in the aquarium metagenomes. B) Heatmap of the most abundant archaea in the aquarium metagenomes. Propionibacterium acnes was manually removed from the analysis since it is not a marine bacteria and is a possible contamination.

56 Figure 2.4 Fragment recruitment plot of Phaeobacter gallaciensis, the most abundant bacteria in the aquarium metagenomes.

A basic FRAP approach was used in the analysis of CF associated microbial metagenomes obtained from sputum samples of acute exacerbations. These results are discussed in detail in Chapter 4 of this dissertation. Human DNA in the sputum metagenomes was removed using FRAP and FRAP-tools, in which the remaining sequences were compared to a set of bacteria reference genomes. In a sample from patient CF418, the microbial community was dominated by Achromobacter xylosoxidans, which is the closest reference genome and evidence for its presence in the clinical sample is supported by hits across the genome. In Chapter 4 of this dissertation a detailed analysis of the genome deletions supports the hypothesis that the genome deletions are phage regions which may be excising from the genome. In clinical metagenomic studies is essential to use exact hits with curated reference databases so clinicians receive complete information that they can trust for trearment consideration.

57 Visual inspection of fragment recruitment plots allows the identification of contaminants in the dataset or in the databases. In a sputum metagenome from patient CF01, sample contamination was detected when for three bacteria genomes hits in a 1,500 nt region were identified (Figure 2.5). Such bacteria were Delftia acidovorans, Bordetella petrii, and

Marinomonas aquamarina and the identified regions are the result of amplicons amplification of such regions. Contamination in an assembled reference bacteria genome present in the bacteria representative genomes database was identified when a metatranscriptome from a human lung biopsy had hits to a small region of Microbacterium barkeri (Figure 2.6), in this case the region in the bacteria reference genome had a human origin, which implies it was contaminated during the sequencing and assembly process. It is important to identify and flag this reference genomes to avoid further misassignments.

The reproducibility of FRAP was assessed using replicates of subsamples with replacement from a human plasma metagenome spiked with 104 copies of human immunodeficiency virus (Naccache et al. 2014). Five subsamples with replacement were compared to viral refseq (Supplemental Table 4.1), overall the fractional abundances of viruses in each replicates have low variability. The mean standard deviation from all datasets was 0.0109, which means the replicates show reproducibility.

58

Figure 2.5 Contamination detection in CF samples. The most abundant bacteria found in CF01 were Delftia acidovorans, Bordetella petrii and Marinomonas aquimarina. In these 3 bacteria, all the hits mapped in one isolated position, supporting the idea of a sample contamination.

59 Figure 2.6 Genome contamination detection. Recruitment plot of the most abundant bacteria (Microbacterium barkeri) in a cancer biopsy sample. All the 2,295,340 reads map in the same position.

Blind FRAP

The second family of FRAP uses is the blind FRAP strategy. Its basic principle is the use of a database that does not have taxonomical or functional assignments to obtain fractional abundances for each contig in the datasets. The database are contigs assembled either from the dataset or an extended dataset. Functional assignments are obtained for the contigs with highest fractional abundances though distant evolutionary relationships assignments such as tBLASTx.

A blind FRAP strategy was used in the global virome calculations presented in

Chapter 1 of this dissertation. In that study a global virome database was constructed using available public viromes from several biomes. Datasets were viromes in each biome, the result was the fractional abundances of contigs per biome. The contigs with highest fractional abundance on each biome were annotated using tBLASTx, those contigs represent new diversity which has not yet been described.

60 A second blind FRAP strategy is proposed for the study of pairs of metagenomes and viromes from the same sample (Figure 2.7). The rationale behind this strategy is to estimate the fractional abundances of viral-derived elements in metagenomes, which otherwise would be missed since they are unrepresented viruses that are not yet in reference databases.

Viromes to metagenomes FRAP was used in a test dataset of coral and algae interactions

(Figure 2.8). The samples are punches from a coral-algae interaction transect in which samples A and B are coral punches, sample C is the interaction zone between coral and algae and samples D and E are algae tissue. Metagenomes were obtained from all samples and viromes were obtained from samples A, B and C. Hits to viral like elements derived from coral punches were identified in all the coral derived samples and in the interaction zone, in algae samples, two contigs had hits to the coral-derived viral elements. This data will be explored in further detail and published by Little M. et. al.

A blind FRAP strategy in which reads and contigs originated from the same sample are compared to each other is proposed. In this case fractional abundance of contigs can be obtained from the samples themselves. This strategy was applied to two cheese rinds viromes.

Viromes from Winnimere cheese and Bailey cheese rinds were obtained and contigs were assembled and used as a database, the reads were used as dataset. The percentage of exact hit reads to the contigs database was 77.5 % for Winnimere and 63.8% for Bailey (Table 2.4).

From these estimates we can calculate false negatives as the number of fragments that don’t recruit to the assembly, which is 23% for Winnimere and 36% for Bailey. An assessment of how accurate the exact hit search algorithm is to compare reads to reads. In the case of cheese

61 rind viromes, the search strategies are good, with 99.8% reads recruited back to reads for

Winnimere and 99.98% for Bailey (Table 2.5).

Figure 2.7 Generalized FRAP pipeline to map metagenomes to viral assembled contigs.

62 Figure 2.8 FRAP metagenomes to viromes in coral-algae interactions. Samples 2A and 2B are coral punches, sample 2C is a punch of the coral-algae interaction zone, samples 2D and 2E are algae punches. Viromes were obtained for samples 2A, 2B and 2C. Metagenomes were obtained for all samples. Contigs fractional abundances were visualized in Anvi’o (Eren et al. 2015).

63 Table 2.4 Exact hit, using smalt at 100% identity. Dataset: polished reads, q-phred>30 (99.9% base call accuracy), no low complexity (Shannon entropy 0.5) single end. Database: all assembled contigs (spades --only-assembler) from single end.

Exact hit reads to contigs Winnimere Bailey cheese cheese virome virome number of reads 3,256,476 1,714,998 number of nucleotides 497,990,024 262,232,327 number of contigs 30,484 23,315 number of nucleotides in contigs 26,541,438 13,239,529 number of reads that map to contigs 2,523,962 1,095,563 number of contigs with no reads mapped back 533 1,253 percentage of reads that map back to contigs 77.51 63.88 percentage of contigs with no reads 1.75 5.37

Table 2.5 Exact hit, using smalt at 100% identity. Dataset: polished reads, q-phred>30 (99.9% base call accuracy), no low complexity (Shannon entropy 0.5) single end. Database: polished reads, q-phred>30 (99.9% base call accuracy), no low complexity (Shannon entropy 0.5) single end.

Exact hit reads to reads Winnimere Bailey cheese cheese virome virome # reads 3,256,476 1,714,998 # reads that map to reads 3,249,940 1,714,714 % reads map back to reads 99.80 99.98

64 The future of FRAP

A third family of FRAP uses is proposed, in which two or more sample groups are compared using blind FRAP, then the fractional abundances are used as input for machine learning algorithms such as random forests and a group of contigs that better differentiate among samples is identified. This strategy needs further exploration.

FRAP aims to enable easy, robust and meaningful bioinformatics to accelerate biological discoveries. To get to this point, FRAP needs to be adapted as an easy to use platform and optimization of exact hits search are needed to be able to use large databases and datasets and obtain results in a short time.

Biological insights from FRAP

Inferences about biological mechanisms are enabled through FRAP analysis, ecological and evolutional processes can be elucidated using FRAP. For example, the co- occurrence of changes across contigs shows an ecological relation among such contigs.

Contigs that increase by the same amount are part of an ecological unit that co-varies. Also, the co-occurrence of changes at specific sites shows an evolutionary relation between such sites, viral quasispecies(Lauring and Andino 2010) can be identified in this way.

Methods

File types

The input for FRAP are nucleotide fasta files. Datasets are usually fasta files containing multiple reads (DNA fragments between 35 and 1000 nucleotides originated from an NGS instrument), several datasets can be used at the same time as long as each sample is in

65 a single fasta file and have a unique identifier. Databases are usually fasta files containing several reference genomes or genes. Databases or Datasets can also be contigs.

FRAP outputs are tab delimited files, each column is a dataset and each row is an element of the database. Both files are provided as outputs in the basic version of FRAP, just map frap (jmp.pl). Such files are the hits file which contains the number of exact hits to each element of the database (hits.tab) and the normalized file (normalized.tab) which contains the

FRAP values to each element of the database.

Two graphical outputs are generated using frap-tools. The first one is a generic heatmap of the FRAP values. The second are fragment recruitment plots in which the x axis is the element in the database and the y axis is the identity of each hit.

A fasta output with the reads that have exact hits to the database can be generated using the script si_hit.pl, a fasta file containing the reads that do not have exact hits to the database can be generated using the script no_hit.pl.

FRAP implementations

FRAP implementations using existing mapper algorithms are FRAP-smalt and FRAP-

HISAT, both can be used through the perl script jmf.pl. The arguments of jmf.pl are the complete path to the database fasta file, the path to the folder containing the datasets, the path to the results folder that would be created, the mapper to use, and the average L(mean) to use.

FRAP-smalt is a FRAP implementation that uses the mapper smalt (“SMALT

Manual” 2010) to find exact hits between dataset and database. To use smalt, the database is indexed using a k-mer of 10 (-k 10) with a step size of 5 (-s 5), next the dataset is mapped to

66 the database using an identity filter of 100% (y=1), the selected output format is samsoft (-o samsoft). From the samsoft file, the exact hits are counted, and tab delimited files generated.

FRAP-HISAT is a FRAP implementation that uses the mapper HISAT (Kim,

Langmead, and Salzberg 2015) to find exact hits between dataset and database. HISAT uses scores to evaluate if a read map to the database. To obtain a score equivalent to 100% identity over 100% of the read, the scores were back calculated to identity the corresponding identity to score (Figure 2,9). In a read of length 100, a score of 0 represents 75% identity, a score of

120 represents 90% identity, a score of 160 represents 95% identity, a score of 200 represents

100% identity. A score of 200 is used in FRAP.

FRAP was implemented in Perl using a hash strategy, this implementation is called hash exact hit (heh). FRAP-heh is available in the repository https://github.com/yinacobian/frap-heh, to construct the index a k-mer size is selected, viral refseq index was contructed using 50, 100 and 200 k-mers, index construction is made using the program heh-db2.pl. The database in then indexed in a hash that is used to compare the dataset to, which is implemented in the program heh-compa.pl. Further development is needed to make FRAP-heh efficient.

FRAP was implemented using the command grep. FRAP-grep is available in the repository https://github.com/yinacobian/FRAP-grep, the database is converted to a single line, then each read is compared to the database. This approach in portable since grep is a broadly used command available in many operating systems. In an Ubuntu operating system with 156 Gb of RAM, each read took 0.5 seconds to be searched against viral refseq (7,194

67 genomes and 218Mb), which is very slow. FRAP-grep needs further development to be more efficient.

Figure 2.9 Relation between HISAT2 score and identity for reads of length 100nt. HISAT2 (Kim, Langmead, and Salzberg 2015) scores are calculated as following: score = ((number of mismatches * (match weight=2)) + (length – number of mismatches))*(mismatch weight=-6). In a read of length 100, a score of 0 represents 75% identity, a score of 120 represents 90% identity, a score of 160 represents 95% identity, a score of 200 represents 100% identity.

68 FRAP-tools

Graphic outputs from FRAP are obtained using frap-tools, which uses R and python.

The script Plot_Heatmap.R creates a heatmap from FRAP values, this is a generic heatmap that can be further customized by the user. The script fragplot2.py creates individual fragment recruitment plots. The program si_hit.pl is used to generate a fasta file containing the reads that map the dataset, the program no_hit.pl is used to generate a fasta file containing the reads that do not map to the database.

Reference databases

Three main databases were used in the presented FRAP examples: bacteria representative genomes, archaea refseq, and viral refseq. Bacteria representative genomes contains 5,460 complete genomes which include at least one genome from each branch of the bacteria phylogenetic tree as calculated by NCBI(O’Leary et al. 2016). Archaea refseq contains 192 complete genomes. Viral refseq contains 8,321 complete genomes. All databases are part of RefSeq-release80 and were accessed on 01/09/2017.

Denovo assemblies

Denovo assemblies were performed using SPAdes (Bankevich et al. 2012) with the parameter –only-assembler. For some of the test cases, only contigs >900 nucleotides were used as database.

Acknowledgments

We are grateful to Sandi Calhoun and Mark Little for data sharing during FRAP tests and implementation. We are grateful to Dr. Rachel Dutton for providing cheese samples for

69 viromes generation and for viromes sequencing. Thanks to the bioinformatics breakfast group and the biomath group for fruitful discussion during the development of this project.

References

2019, CAMI. n.d. “CAMI II Challenge.” https://data.cami-challenge.org/cami2.

Altschul, S F, W Gish, W Miller, E W Myers, and D J Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.

Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, et al. 2012. “SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19 (5): 455–77. https://doi.org/10.1089/cmb.2012.0021.

Chen, Lihong, Dandan Zheng, Bo Liu, Jian Yang, and Qi Jin. 2016. “VFDB 2016: Hierarchical and Refined Dataset for Big Data Analysis - 10 Years On.” Nucleic Acids Research 44 (D1): D694–97. https://doi.org/10.1093/nar/gkv1239.

Eren, A. Murat, Özcan C. Esen, Christopher Quince, Joseph H. Vineis, Hilary G. Morrison, Mitchell L. Sogin, and Tom O. Delmont. 2015. “Anvi’o: An Advanced Analysis and Visualization Platform for ‘omics Data.” PeerJ 3: e1319. https://doi.org/10.7717/peerj.1319.

Fonseca, Nuno A., Johan Rung, Alvis Brazma, and John C. Marioni. 2012. “Tools for Mapping High-Throughput Sequencing Data.” Bioinformatics 28 (24): 3169–77. https://doi.org/10.1093/bioinformatics/bts605.

Greninger, Alexander L. 2018. “The Challenge of Diagnostic Metagenomics.” Expert Review of Molecular Diagnostics 18 (7): 605–15. https://doi.org/10.1080/14737159.2018.1487292.

Huson, Daniel H., Sina Beier, Isabell Flade, Anna G??rska, Mohamed El-Hadidi, Suparna Mitra, Hans Joachim Ruscheweyh, and Rewati Tappu. 2016. “MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data.” PLoS Computational Biology 12 (6): 1–12. https://doi.org/10.1371/journal.pcbi.1004957.

Huttenhower, Curtis. 2019. “HUMANN2 Pipeline.” 2019. http://huttenhower.sph.harvard.edu/humann.

70 Jia, Baofeng, Amogelang R. Raphenya, Brian Alcock, Nicholas Waglechner, Peiyao Guo, Kara K. Tsang, Briony A. Lago, et al. 2017. “CARD 2017: Expansion and Model- Centric Curation of the Comprehensive Antibiotic Resistance Database.” Nucleic Acids Research 45 (D1): D566–73. https://doi.org/10.1093/nar/gkw1004.

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT : A Fast Spliced Aligner with Low Memory Requirements” 12 (4). https://doi.org/10.1038/nmeth.3317.

Lauring, Adam S., and Raul Andino. 2010. “Quasispecies Theory and the Behavior of RNA Viruses.” PLoS Pathogens 6 (7): 1–8. https://doi.org/10.1371/journal.ppat.1001005.

McIntyre, Alexa B. R., Rachid Ounit, Ebrahim Afshinnekoo, Robert J. Prill, Elizabeth Hénaff, Noah Alexander, Samuel S. Minot, et al. 2017. “Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers.” Genome Biology 18 (1): 182. https://doi.org/10.1186/s13059-017-1299-7.

Menzel, Peter, Kim Lee Ng, and Anders Krogh. 2016. “Fast and Sensitive Taxonomic Classification for Metagenomics with Kaiju.” Nature Communications 7: 1–9. https://doi.org/10.1038/ncomms11257.

Naccache, Samia N., Scot Federman, Narayanan Veeraraghavan, Matei Zaharia, Deanna Lee, Erik Samayoa, Jerome Bouquet, et al. 2014. “A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from next- Generation Sequencing of Clinical Samples.” Genome Research 24 (7): 1180–92. https://doi.org/10.1101/gr.171934.113.

O’Leary, Nuala A., Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, et al. 2016. “Reference Sequence (RefSeq) Database at NCBI: Current Status, Taxonomic Expansion, and Functional Annotation.” Nucleic Acids Research 44 (D1): D733–45. https://doi.org/10.1093/nar/gkv1189.

Overbeek, Ross, Robert Olson, Gordon D. Pusch, Gary J. Olsen, James J. Davis, Terry Disz, Robert A. Edwards, et al. 2014. “The SEED and the Rapid Annotation of Microbial Genomes Using Subsystems Technology (RAST).” Nucleic Acids Research 42 (D1): 206–14. https://doi.org/10.1093/nar/gkt1226.

Pandiselvam, P, T Marimuthu, and R Lawrance. 2014. “A Comparative Study on String Matching Algorithms of Biological Sequences.” International Conference on Intelligent Computing, no. January 2014: 1–5. https://arxiv.org/pdf/1401.7416.pdf%0Ahttp://arxiv.org/ftp/arxiv/papers/1401/1401 .7416.pdf.

Roux, Simon, Michaël Faubladier, Antoine Mahul, Nils Paulhe, Aurélien Bernard, Didier Debroas, and François Enault. 2011. “Metavir: A Web Server Dedicated to Virome

71 Analysis.” Bioinformatics 27 (21): 3074–75. https://doi.org/10.1093/bioinformatics/btr519.

Sayers, Samantha, Li Li, Edison Ong, Shunzhou Deng, Guanghua Fu, Yu Lin, Brian Yang, et al. 2019. “Victors: A Web-Based Knowledge Base of Virulence Factors in Human and Animal Pathogens.” Nucleic Acids Research 47 (D1): D693–700. https://doi.org/10.1093/nar/gky999.

Sczyrba, Alexander, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Dröge, Ivan Gregor, et al. 2017. “Critical Assessment of Metagenome Interpretation - A Benchmark of Metagenomics Software.” Nature Methods 14 (11): 1063–71. https://doi.org/10.1038/nmeth.4458.

Silva, Genivaldo, Bas Dutilh, and Robert Edwards. 2016. “FOCUS2: Agile and Sensitive Classification of Metagenomics Data Using a Reduced Database.” BioRxiv, 046425. https://doi.org/10.1101/046425.

Singer, Esther, Bill Andreopoulos, Robert M. Bowers, Janey Lee, Shweta Deshpande, Jennifer Chiniquy, Doina Ciobanu, et al. 2016. “Next Generation Sequencing Data of a Defined Microbial Mock Community.” Scientific Data 3: 160081. https://doi.org/10.1038/sdata.2016.81.

“SMALT Manual.” 2010. October, no. 1: 1–7.

So, Martin. 2017. “Sequence Analysis Edlib : A C / C 11 Library for Fast , Exact Sequence Alignment Using Edit Distance” 33 (January): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

Stephens, Zachary D., Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. 2015. “Big Data: Astronomical or Genomical?” PLoS Biology 13 (7): 1– 11. https://doi.org/10.1371/journal.pbio.1002195.

72 Appendix for Chapter 2 Supplemental tables

Supplemental Table 2.1. FRAP vs viral refseq of plasma metagenome spiked with HIV. Five subsample replicates. spikeHIV-mg-SRR1106548-2.

id name R1 R2 R3 R4 R5 Mean SD gi|9628705|ref|NC_001710.1| GB virus C/Hepatitis G virus, 899.54 899.54 899.54 899.54 899.54 899.54 0.0000 complete genome gi|9629357|ref|NC_001802.1| Human immunodeficiency 324.90 324.90 324.90 324.90 324.90 324.90 0.0000 virus 1, complete genome gi|56718463|ref|NC_003287.2| Enterobacteria phage M13, 17.59 17.70 17.59 17.33 17.73 17.59 0.1420 complete genome gi|295413923|ref|NC_014075.1| Torque teno virus 12, complete 13.68 13.68 13.68 13.68 13.68 13.68 0.0000 genome gi|9627425|ref|NC_001604.1| Enterobacteria phage T7, 7.96 7.96 7.96 7.96 7.96 7.96 0.0000 complete genome gi|339832375|ref|NC_015783.1| Torque teno virus, 6.36 6.36 6.36 6.36 6.36 6.36 0.0000 complete genome gi|29502191|ref|NC_002076.2| Torque teno virus 1, complete 5.72 5.72 5.72 5.72 5.72 5.72 0.0000 genome gi|730977588|ref|NC_025824.1| Enterobacteria phage fd strain 5.13 5.01 5.13 5.39 4.99 5.13 0.1420 478, complete genome gi|295413965|ref|NC_014082.1| Torque teno mini virus 7, complete 4.29 4.29 4.39 4.34 4.34 4.33 0.0382 genome gi|9634957|ref|NC_002195.1| Torque teno mini virus 9, complete 3.93 3.93 3.83 3.88 3.88 3.89 0.0387 genome gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, 3.13 3.09 3.13 3.13 3.13 3.12 0.0166 complete genome gi|295413834|ref|NC_014073.1| Torque teno virus 28, complete 1.83 1.83 1.83 1.83 1.83 1.83 0.0000 genome gi|295413918|ref|NC_014074.1| Torque teno virus 27, complete 1.74 1.74 1.74 1.78 1.74 1.75 0.0162 genome gi|295413958|ref|NC_014081.1| Torque teno virus 3, complete 1.77 1.77 1.77 1.77 1.77 1.77 0.0000 genome gi|295413928|ref|NC_014076.1| Torque teno virus 10, complete 1.56 1.56 1.56 1.56 1.56 1.56 0.0000 genome gi|295441877|ref|NC_014089.1| Torque teno mini virus 5, complete 1.56 1.56 1.56 1.56 1.56 1.56 0.0000 genome gi|418487627|ref|NC_019445.1| Escherichia phage TL-2011b, 1.27 1.27 1.27 1.27 1.27 1.27 0.0000 complete genome gi|295441896|ref|NC_014094.1| Torque teno virus 6, complete 0.85 0.85 0.85 0.85 0.85 0.85 0.0000 genome

73 Supplemental Table 2.1. FRAP vs viral refseq of plasma metagenome spiked with HIV. Five subsample replicates. spikeHIV-mg-SRR1106548-2. (Continued)

id name R1 R2 R3 R4 R5 Mean SD gi|727071839|ref|NC_025726.1| Torque teno mini virus ALA22, 0.72 0.72 0.72 0.72 0.72 0.72 0.0000 complete genome gi|295441905|ref|NC_014096.1| Torque teno virus 15, complete 0.48 0.48 0.48 0.48 0.48 0.48 0.0000 genome gi|209427726|ref|NC_011356.1| Enterobacteria phage YYZ-2008, 0.68 0.68 0.68 0.68 0.68 0.68 0.0000 complete prophage genome gi|46401626|ref|NC_005856.1| Enterobacteria phage P1, 0.59 0.63 0.64 0.62 0.60 0.62 0.0179 complete genome gi|557307526|ref|NC_022749.1| Shigella phage SfIV, complete 0.70 0.66 0.66 0.68 0.71 0.68 0.0197 genome gi|744692686|ref|NC_026013.1| Microviridae IME- 16, complete 0.71 0.71 0.71 0.71 0.71 0.71 0.0000 sequence gi|971482474|ref|NC_028748.1| Bacillus phage BMBtpLA, 0.46 0.42 0.40 0.42 0.44 0.43 0.0206 complete genome gi|374531645|ref|NC_016765.1| Pseudomonas phage 0.28 0.28 0.28 0.28 0.28 0.28 0.0025 vB_PaeS_PMG1, complete genome gi|124300942|ref|NC_008376.2| Geobacillus phage GBSV1, complete 0.26 0.26 0.26 0.26 0.26 0.26 0.0000 genome gi|221328618|ref|NC_011976.1| Salmonella phage epsilon34, 0.32 0.32 0.32 0.32 0.32 0.32 0.0000 complete genome gi|849250250|ref|NC_027339.1| Enterobacteria phage SfI, 0.18 0.16 0.15 0.16 0.15 0.16 0.0109 complete genome gi|82700933|ref|NC_007623.1| Pseudomonas phage EL, 0.16 0.16 0.16 0.16 0.16 0.16 0.0000 complete genome gi|295441884|ref|NC_014091.1| Torque teno virus 16, complete 0.16 0.16 0.16 0.16 0.16 0.16 0.0000 genome gi|209447126|ref|NC_011357.1| Stx2-converting phage 1717, 0.15 0.15 0.15 0.15 0.15 0.15 0.0000 complete prophage genome gi|937456792|ref|NC_027991.1| Staphylococcus phage SA1, 0.16 0.16 0.16 0.16 0.16 0.16 0.0000 complete genome gi|939482395|ref|NC_002484.2| Bacteriophage D3, 0.13 0.13 0.14 0.14 0.14 0.14 0.0024 complete genome gi|428782011|ref|NC_019711.1| Enterobacteria phage HK629, 0.14 0.13 0.15 0.13 0.15 0.14 0.0096 complete genome gi|428782787|ref|NC_019723.1| Enterobacteria phage HK630, 0.12 0.19 0.16 0.17 0.16 0.16 0.0220 complete genome gi|29134936|ref|NC_004629.1| Pseudomonas phage phiKZ, 0.11 0.11 0.11 0.11 0.11 0.11 0.0000 complete genome

74 Supplemental Table 2.1. FRAP vs viral refseq of plasma metagenome spiked with HIV. Five subsample replicates. spikeHIV-mg-SRR1106548-2. (Continued)

id name R1 R2 R3 R4 R5 Mean SD gi|428783345|ref|NC_019769.1| Enterobacteria phage HK542, 0.04 0.03 0.03 0.03 0.04 0.04 0.0045 complete genome gi|428783215|ref|NC_019767.1| Enterobacteria phage HK544, 0.04 0.06 0.03 0.02 0.01 0.03 0.0169 complete genome gi|428782316|ref|NC_019716.1| Enterobacteria phage mEp460, 0.07 0.07 0.07 0.07 0.07 0.07 0.0000 complete genome gi|966201269|ref|NC_028694.1| Propionibacterium phage PA1-14, 0.03 0.04 0.05 0.05 0.07 0.05 0.0117 complete genome gi|408905847|ref|NC_018852.1| Propionibacterium phage P100D, 0.06 0.05 0.06 0.06 0.04 0.05 0.0088 complete genome gi|543171632|ref|NC_022342.1| Propionibacterium phage 0.05 0.05 0.05 0.05 0.05 0.05 0.0021 PHL111M01, complete genome gi|330858351|ref|NC_015453.1| Propionibacterium phage PAS50 0.02 0.05 0.04 0.05 0.07 0.04 0.0185 endogenous virus, complete genome gi|849254459|ref|NC_027373.1| Propionibacterium phage PHL030N00, 0.06 0.04 0.06 0.04 0.02 0.04 0.0150 complete genome gi|408905837|ref|NC_018842.1| Propionibacterium phage P1.1, 0.08 0.06 0.04 0.05 0.05 0.06 0.0128 complete genome gi|682123269|ref|NC_024787.1| Listeria phage LMTA-148, 0.02 0.02 0.02 0.02 0.02 0.02 0.0000 complete genome gi|238801880|ref|NC_012753.1| Streptococcus phage 5093, 0.02 0.02 0.02 0.02 0.02 0.02 0.0000 complete genome gi|89152530|ref|NC_007805.1| Pseudomonas phage F10, 0.03 0.03 0.03 0.03 0.03 0.03 0.0000 complete genome gi|431809676|ref|NC_019914.1| Staphylococcus phage StB27, 0.00 0.00 0.00 0.00 0.00 0.00 0.0000 complete genome gi|849251120|ref|NC_027346.1| Propionibacterium phage 0.00 0.00 0.01 0.01 0.01 0.00 0.0038 PHL171M01, complete genome gi|543171262|ref|NC_022334.1| Propionibacterium phage PHL112N00, 0.00 0.01 0.00 0.00 0.01 0.00 0.0025 complete genome gi|906475910|ref|NC_027624.1| Propionibacterium phage SKKY, 0.01 0.00 0.01 0.00 0.00 0.00 0.0025 complete genome gi|422933554|ref|NC_019491.1| Cyprinid herpesvirus 1 strain 0.00 0.00 0.00 0.00 0.00 0.00 0.0002 NG-J1, complete genome gi|363539767|ref|NC_016072.1| Megavirus chiliensis, complete 0.00 0.00 0.00 0.00 0.00 0.00 0.0000 genome Mean of 0.0109 all SD

75 Chapter 3 : Cystic Fibrosis Rapid Response: Translating Multi-omics Data into

Clinically Relevant Information

Abstract

Pulmonary exacerbations are the leading cause of death in cystic fibrosis (CF). To track microbial dynamics during acute exacerbations, a CF Rapid Response (CFRR) strategy was developed. The CFRR relies on viromics, metagenomics, metatranscriptomics and metabolomics data to rapidly monitor active members of the viral and microbial community during acute CF exacerbations. To highlight CFRR, a case study of a CF patient is presented, in which an abrupt decline in lung function characterized a fatal exacerbation. The microbial community in the patient’s lungs was closely monitored through the multi-omics strategy, which led to the identification of pathogenic shigatoxigenic Escherichia coli (STEC) expressing shiga toxin. This case study illustrates the potential for CFRR to deconstruct complicated disease dynamics and provide clinicians with alternative treatments to improve the outcomes of pulmonary exacerbations and expand the lifespans of individuals with CF.

Introduction

Cystic fibrosis (CF) is a recessive genetic disease in which defects or deficits in the cystic fibrosis transmembrane conductance regulator (CFTR) protein result in disease phenotypes of the pancreas, sweat glands and reproductive, respiratory and digestive systems

(Knowles and Drumm, 2012). In the lungs of individuals with CF, mucociliary clearance is impaired, which promotes chronic polymicrobial infections (Laguna et al., 2016). Antibiotic treatments and proper disease management have extended the average lifespan of CF patients;

76 nevertheless, these polymicrobial lung infections are still the primary cause of morbidity and mortality (Alexander et al., 2016). Common bacteria that colonize CF lungs over the long- term include Pseudomonas aeruginosa, Staphylococcus aureus, Haemophilus influenzae,

Burkholderia cepacia complex, Rothia mucilaginosa and Streptococcus spp. (Surette 2014;

LiPuma 2010; Lim et al. 2013; Whiteson et al. 2014), but every CF individual presents a unique microbial community that changes over time (Lim et al., 2014a; Whelan et al., 2017;

Zhao et al., 2012). This highlights the need to characterize the microbial communities in each

CF individual.

Microbial community dynamics in CF lungs follow the Climax Attack Model (CAM)

(Conrad et al., 2013; Quinn et al., 2014), in which a climax community is acclimated to the host and dominates during stable periods, and a transient attack community is associated with exacerbations. Attack communities are virulent and colonize the CF lungs either from an external source or are already present in the CF lungs and become active during exacerbations. In the CAM, attack communities lead to Cystic Fibrosis Pulmonary

Exacerbations (CFPE), declines in lung function, and eventually death. Preventing CFPE relies on quickly identifying attack viral and microbial communities and the genes they carry and express, such as those encoding specific toxins (Gallant et al., 2015), to efficiently tailor antimicrobial therapies.

Herein we propose Cystic Fibrosis Rapid Response (CFRR), a strategy for determining microbial dynamics during CFPE. This strategy is a personalized multi-omics approach that uses viromes (Willner et al., 2012), metagenomes, metatranscriptomes (Lim et al., 2013b), and metabolomes (Quinn et al., 2016a; Whiteson et al., 2014) from longitudinal

77 samples to monitor the whole microbial community, particularly its active members and their metabolic products. Using CFRR to obtain personalized taxonomic and functional profiles of the lung microbial communities would provide clinicians with comprehensive information about each patient’s viral and microbial ecosystem. This information allows clinicians to generate testable hypotheses, test those hypotheses using standard clinical tests, and propose specific clinical interventions (e.g., precisely targeted antibiotic therapy) to improve CFPE outcomes.

The ability to generate multi-omic datasets and analyze large amounts of data in a clinically relevant time frame (i.e., ≤ 48 hours) makes the CFRR approach applicable in CF clinical practice, especially in clinics closely related to research institutions. It requires access to a sequencing instrument, mass spectrometer, computational resources and specialized personnel in each one of these areas. In an optimal situation, the time between sample collection and data interpretation is 30 hours for metabolomes (Quinn et al. 2016), 38 hours for metagenomes and metatranscriptomes, and 48 hours for viromes. These times are expected to shorten as technologies improve. The rapid decrease in sequencing costs

(National Human Genome Research Institute, 2018) and incorporation of sequencing cores within hospitals (Deurenberg et al., 2017) will increase CF patients accessibility to the CFRR in the foreseeable future.

A case study is presented to demonstrate the potential of the CFRR strategy. A 37- year-old male CF patient (CF01) was monitored over a two-year period with metagenomes, metatranscriptomes and metabolomes. Integrating the information from these sources led to

78 the identification of an attack community in which a strain of E. coli that likely produced shiga-toxin was detected during a fatal exacerbation.

Results

Patient CF01 fatal exacerbation expedited monitoring: metatranscriptomes and metabolomes

An overall decline in lung function was observed in patient CF01 during his last year of life and four CFPEs were reported. In the last month of life 10% of predicted FEV1 was lost (Figure 3.1-A). During the last exacerbation, the patient was hospitalized at the intensive care unit (ICU) for seven days and then died. The fatal exacerbation was characterized by severe lung tissue damage (Figure 3.1- D, Supplemental Table 3.1A), an increase in white blood cell counts (Figure 3.1-B, Supplemental Table 3.1B) and a general decline in health.

During the fatal exacerbation, clinical microbiology lab cultures from sputum samples tested positive for P. aeruginosa, Stenotrophomonas maltophilia, Aspergillus terreus and yeast

(Figure 3.1-C, Supplemental Table 3.1C). Treatment alternated between the antibiotics aztreonam, azithromycin, in addition to a sulfonamide and a quinolone; at the ICU, colistin and meropenem were administered (Supplemental Table 3.1D) but no improvement was observed.

The CFRR strategy was launched to rapidly identify the cause of the CFPE. Sputum samples were collected 7 and 8 days before death (samples D-7 and D-8). In samples D-7 and

D-8 active members of the microbial community were determined using metatranscriptomics.

In sample D-8 small molecule profiles (using metabolomics) were characterized and a total

DNA metagenome was sequenced.

79 Figure 3.1 Clinical data for the last 24 months of patient CF01’s life. (A) Percentage of predicted FEV1 of patient CF01 over the last 24 months of life. Solid dots are FEV1 measurements. The line is included to highlight lung function dynamics and does not represent measurements. Seven exacerbation periods were reported and are shown in gray. (B) White blood cell (WBC) counts for the last month of life. (C) Clinical microbiology positive cultures from patient CF01’s sputum samples over the last 24 months of life. Dots represent days where cultures were positive for each microbe tested in the clinical microbiology panel. Omics sampling points for metagenomes, metatranscriptomes, and metabolomes are indicated by dots in the Omic panel. Performed X rays and WBC measurement days are indicated with dots in the clinical care panel. (D) Patient CF01 chest X rays in a frontal view with quantitative disease severity evaluation using Brasfield scores. D- 193, mild exacerbation; D-8, acute exacerbation, the time point where CFRR data were obtained; D-1, 1 day before death. A lower Brasfield score represents a higher disease severity. The Brasfield score scale is from 25 to 0, where 25 is lower disease severity and 0 is higher disease severity. Parameters used for Brasfield scores calculations are air trapping, linear markings, nodular cystic lesion, large lesions, and general severity, and individual scores are shown in Supplemental Table 3.1A in the supplemental material.

80 Metatranscriptomics data from sample D-8 showed that the most abundant microbial ribosomal RNAs (rRNA) belonged to the genera Bacillus (29.9%), Escherichia-Shigella

(23.9%), Streptococcus (11.6%), Salmonella (6.9%) and Lactococcus (4.4%) among other genera (23.3%) (Supplemental Figure 3.1-A). The microbial messenger RNA (mRNA) composition was dominated by the genera Pseudomonas (97.1%), followed by

Stenotrophomonas (1.9%) and Escherichia (0.07%) (Supplemental Figure 3.1-B). At species level resolution, the most abundant bacterial genomes (based on total RNA, Figure 3.2) were:

Bacillus sp., E. coli (STEC), Salmonella enterica serovar Infantis, P. aeruginosa and S. maltophilia. Enterobacteria phage SP6, Pseudomonas phages and Stenotrophomonas phage

S1 were also detected. Two members of the phylum Ascomycota were identified: Candida albicans and Aspergillus fumigatus. Metagenomics data of sample D-8 detected Pseudomonas

(98.5%) as the dominant bacterial genus (Supplemental Figure 3.5-A).

81

Figure 3.2 The most abundant bacterial genera of fatal exacerbation sample D-8. The relative abundances of each genus as determined by rRNA, mRNA, and total RNA are shown. A reference genome from each genus was selected based on the number of reads recruited in the rRNA (Escherichia, Bacillus, and Salmonella) or mRNA (Pseudomonas and Stenotrophomonas) category. Fragment recruitment was visualized using Anvi’o, showing a logarithmic scale for mRNA and rRNA from 1 to 1,000. Anvi’o plots show reads mapped along the genome coordinates. Nonribosomal microbial reads were recruited against each reference genome using SMALT with an identity cutoff of 80% and are shown in brown along the external ring. rRNA reads were classified into each genus by BLASTn, were recruited against the corresponding reference genome using SMALT with an identity cutoff of 60% and are shown in gray along the internal ring.

The presence of Escherichia-Shigella in the lungs of a CF patient is unusual and thus a detailed analysis was performed to further resolve the taxonomy at strain level. Strain level analysis identified that E. coli present in CF01’s lungs was most closely related to the genome of E. coli (STEC) B2F1. This strain typically carries the shiga toxin 1 and shiga toxin 2 genes, both of which were identified in the metatranscriptomes (Figure 3.3-B, C). Furthermore, the shiga toxin receptor, globotriaosylceramide (Gb3), was detected in the metabolome from sample D-8 (Figure 3.3-A). This suggests that shiga toxin, and its Gb3 target, were being produced in the lungs of CF01. Gb3 is produced in human cells by Gb3 synthase, which adds a sugar to a lactosylceramide molecule. Ceramide is produced by sphingomyelinase (SMase)

82 in the host cell or by the action of bacterial encoded SMase (Figure 3.5-B). The gene that encodes a P. aeruginosa secreted SMase, the hemolytic phospholipase C (PlcH) (Vasil et al.,

2009), was detected in the sample D-8 metatranscriptome (Supplemental Figure 3.2-B).

Figure 3.3 Shiga toxin and its human receptor globotriaosylceramide (Gb3). (A) The masses of globotriaosylceramide and its precursors lactosylceramide and sphingomyelin from exacerbation sample D-8 and 14 historical nonexacerbation samples were determined by parent mass searching and validated by MS/MS matching. The fatal exacerbation sample is shown in gray. (B) STEC BRF1 was used as a reference genome for fragment recruitment to the Shiga-like toxin 2 subunit A protein sequence. The amino acid sequence position is shown on the x axis, and percent identity is shown on the y axis. The nucleotide sequences from patient CF01 metatranscriptome exacerbation sample D-8 were mapped to proteins using BLASTx with an E value cutoff of 0.001 and filtered by an identity of ⬎60%. (C) Metatranscriptome recruitment as explained above for panel B, except that in this case, reads were recruited to the Shiga-like toxin 2 subunit B amino acid sequence.

In a longitudinal metabolomics dataset, Gb3 was highly abundant (p < 0.0001) in sample D-8 but was in low abundance in the prior samples (Figure 3.3-A). The Gb3 precursor lactosylceramide (18:1/16:0) (Sandvig et al., 2012) and its ceramide donor sphingomyelin

(18:1/16:0) (Obrig et al., 2003) were abundant in all samples throughout the longitudinal dataset (Figure 3.3-A, Supplemental Table 3.2A). These data demonstrate that Gb3 precursors

83 were present for at least a year before the fatal exacerbation, but Gb3 was produced in significantly high quantities eight days before death (sample D-8).

Gb3 levels positively correlate with shiga toxin levels (Boyd et al., 1993) although the mechanism behind this positive correlation is not clear. Gb3 is the only known functional receptor for shiga toxins (Aigal et al., 2015) and shiga toxins induce reorganization of lipids in the epithelial cell’s membrane. Shiga toxin B can bind up to 15 Gb3 molecules (Ling et al.,

1998) and this binding result in the aggregation of Gb3 in lipid rafts. The aggregation of Gb3 in lipid rafts promotes a negative membrane curvature and internalization of shiga toxin (Betz et al., 2011). Spatial distribution of Gb3 in the cell membrane has a regulatory role in its presentation (Lingwood et al., 2010), thus higher recruitment of Gb3 in lipid rafts may induce the production of more Gb3.

Antibiotic resistance genes were detected in the metatranscriptomes of D-8 and D-7 samples. Transcripts encoding all the protein components were identified for two RND-type multidrug exporters, MexGH1-OpmD (Aendekerk et al., 2002) and MexA-MexB-OprM (Li et al., 1995) previously described in Pseudomonas, as well as the tetracycline efflux pump tet(C) previously described in Achromobacter. Transcripts encoding several beta-lactamases were identified, such as TEM-116, PDC-3, OXA-50 and BEL-3 (Jia et al., 2017), which are typically found in Pseudomonas, and CTX-M-21 (Saladin et al., 2002), which is usually found in Enterobacteriaceae. Transcripts encoding for enzymes that are involved in resistance to macrolides, aminoglycosides, lincosamide, diaminopyrimidine and glycopeptide antibiotics were detected; these enzymes were previously described in Pseudomonas, Achromobacter,

84 Escherichia, Streptomyces, Paenibacillus, Clostridium and Morganella (Supplemental Table

3A).

A partial P. aeruginosa genome sequence was recovered by assembling reads from the fatal exacerbation metatranscriptomes (D-8 and D-7) into contigs and then mapping those contigs to the P. aeruginosa PAO1 reference genome (Supplemental Figure 3.2-A). In the resulting P. aeruginosa CF01 contigs, 38 genes related to resistance to antibiotics and toxic compounds were identified (Supplemental Table 3.3B). Two prophages were also identified in the assembled P. aeruginosa CF01 contigs (D-8 and D-7); one was complete and the second one was a partial prophage (Supplemental Figure 3.2-C and 3.2-D).

Bacterial small molecule profiles before and during fatal exacerbation.

Longitudinal metabolomic data from CF01 historical samples and fatal exacerbation sample D-8 were compared to metabolic profiles from six pathogenic bacterial isolates previously detected in CF sputum (P. aeruginosa VVP172, Enterococcus sp. VVP100, E. coli

VVP427, Streptococcus sp. VVP047, Stenotrophomonas sp. VVP327 and S. aureus

VVP270). The goal was to identify metabolites produced by pathogenic bacteria and track how changes in their abundances might have preceded the fatal exacerbation. Metabolites from these pathogens were consistently detected throughout the longitudinal samples. In sample D-8 there was an increase (p < 0.001) in the number of metabolites that matched P. aeruginosa VVP172, E. coli VVP427, Streptococcus sp. VVP047 and S. aureus VVP270

(Supplemental Figure 3.3, Supplemental Table 3.2B).

85 Active members of the microbial community during a stable period and the fatal exacerbation.

Analysis of metatranscriptomes from a stable period 10 and 9 months before the fatal exacerbation event (samples D-303 and D-279) identified several differences between this stable period and the fatal exacerbation. First, Firmicutes was the most active phylum during the stable period whereas Proteobacteria was the most active during exacerbation (Figure 3.4-

A). Second, samples from the stable period showed an active microbial community that was more even and diverse than the community in exacerbation samples (Figure 3.4-D). Third, transcripts from Pseudomonas were detected at very low levels in stable samples (average relative abundance 3%), but at high levels (average relative abundance 37%) in exacerbation samples (Supplemental Figure 3.2-A). Fourth, the percentages of unclassified sequences were higher in the stable samples D-303 and D-279 (40.9% and 39.0%) than in the exacerbation samples D-8 and D-7 (27.6% and 17.0%). Fifth, a higher fractional abundance of bacteriophages was detected in the fatal exacerbation samples relative to the stable ones.

Enterobacteria phage SP6, several Pseudomonas phages (Figure 3.4-B) and sarcoma viruses

(Figure 3.4-C) were the dominant viruses in samples D-8 and D-7.

86 Figure 3.4 Actively transcribing members of the viral and bacterial communities in sputum samples of patient CF01. Metatranscriptomes from two exacerbations and two stable samples were obtained. (A) Bacterial taxonomical assignments were made using KAIJU at the genus level and are color-coded by phylum. (B) Fractional abundances of bacteriophages based on viral RefSeq mapping and FRAP normalization. (C) Fractional abundances of eukaryotic viruses based on viral RefSeq mapping and FRAP normalization. (D) Bacterial rank abundance plot generated using relative abundances at the genus level. Evenness was

87 calculated as H/In(S), where H is the Shannon diversity index and S is the total number of species.

Microbial community dynamics during a non-fatal exacerbation.

Two years before the fatal exacerbation, CF01’s lung function declined faster than in previous years (Supplemental Figure 3.4-A). The rate of lung function change in the last two years of life was -9.75 FEV1%/year (Supplemental Figure 3.4-C). The overall rate of lung function change during CF01 last 14 years of life was -1.39 FEV1%/year. During a two-year period of 4 and 3 years before death, the rate of lung function change was 1.30 FEV1%/year

(Supplemental Figure 3.4-B).

During the two-year period leading up to the fatal exacerbation, seven exacerbation events were reported, and sputum samples were periodically screened for fungi and bacteria at the clinical microbiology lab (Supplemental Table 3.1-C). P. aeruginosa was detected in all samples. Six months before the fatal exacerbation S. maltophilia was detected, and during the last two months of life, Enterobacter cloacae was detected. A. terreus was detected in two samples in the last six months of life. Yeast was detected in all screened samples, except for the final exacerbation samples. Based on this information, several antibiotics were prescribed to manage the exacerbations (Supplemental Table 3.1-D); these included monobactams, macrolides, quinolones, beta-lactams, sulfonamides and a cationic polypeptide.

Two years before CF01’s death, metagenomics was used to monitor the microbial composition of the respiratory tract during an exacerbation event, the subsequent antibiotic treatment (samples D-724 to D-718), and a stable period that followed (samples D-409 and D-

286) (Supplemental Figure 3.5-A). The bacterial genera that best differentiated between samples collected during antibiotic treatment (D-722 to D-718) and no antibiotic treatment

88 (D-724 and D-723) were Rothia, Campylobacter, Veilonella and Prevotella (Supplemental

Figure 3.6). The antibiotics prescribed during this exacerbation were a fluoroquinolone

(ciprofloxacin), and a tetracycline (doxycycline). Clinical microbiology lab tests performed on D-719 were positive for P. aeruginosa, Pseudomonas fluorescens, A. fumigatus and yeast

(Supplemental Table 3.1C). Exacerbation and stable samples had Streptococcus phages,

Staphylococcus phages and Pseudomonas phages, whereas only exacerbation samples had a shigatoxin-converting phage (Supplemental Figure 3.5-B), and stable samples had higher abundances of Herpesviruses (Supplemental Figure 3.5-D).

Discussion

CF01 fatal exacerbation mechanism

The unusually fast decline of patient CF01 led to the implementation of the CFRR.

During CF01 fatal exacerbation, E. coli mRNA, rRNA and metabolites were detected, which demonstrated not only the presence but also the activity of shigatoxigenic E. coli. The identification of a shigatoxinogenic E. coli is supported by ribosomal RNA (36,590 unique ribosomal RNA sequences in metatranscriptome D-8), messenger RNA (1,412 E. coli mRNA reads in metatranscriptome D-8, and 11 partial mRNA reads at 60% identity to STEC BRF1), and metabolites (10 matched metabolome spectra to E. coli in D-8). The presence of STEC in the lungs of a CF patient was alarming as this strain causes severe damage to the lung epithelium (Bergan et al., 2012; Uchida et al., 1999). Moreover, interactions between shiga toxin and the host epithelium were inferred from metabolomes. The molecule

89 globotriaosylceramide (Gb3), the receptor for shiga toxin, showed an increase of three orders of magnitude during the fatal exacerbation (sample D-8), compared to previous samples.

Figure 3.5 Proposed model of lung dynamics resulting in patient CF01’s death. (A) A nonfatal exacerbation (days ⫺724 to ⫺718) was followed by a recovery of lung function, and attack and climax communities were diverse. (B) The fatal exacerbation was triggered by colonization by STEC, which is supported by the presence of its rRNA in the metatranscriptomes. This bacterium encodes Shiga toxin, which was likely taken into host cells by the human receptor globotriaosylceramide. (C) Later during the fatal exacerbation, Shiga toxin was internalized and then induced apoptosis, necrosis, and inflammation. P. aeruginosa was reestablished and came to dominate the community, as suggested by its abundant mRNA.

Altogether, these multi-omics data support the following model of microbial dynamics that caused patient CF01’s death. At the beginning of the fatal exacerbation, STEC produced shiga toxin that remained inside the bacterial cells. Later in the exacerbation, STEC’s cell membranes were disrupted and the shiga toxin was released (Figure 3.5-B). This release may have been triggered by the action of the cationic polypeptide colistin (Gupta et al., 2009).

Next, the toxin was taken up by lung epithelial cells through the host membrane receptor globotriaosylceramide (Gallegos et al., 2012) (Figure 3.5-C). Inside the lung epithelial cells

(Uchida et al., 1999), shiga toxin inhibited host translation by blocking the ribosomes, thereby

90 inducing cell death, necrosis and an acute inflammatory response (Melton-Celsa, 2014; Obrig,

2010; Uchida et al., 1999). The immune response and lung tissue damage was evident in the chest X-rays and the increase in white blood cells (Figure 3.1-D, D-8 and D-1).

During the fatal exacerbation, STEC led the attack community that ultimately destabilized the climax community, a phenomenon previously reported in CF exacerbations

(Conrad et al., 2013); this resulted in a decline of evenness (diversity index that quantifies how equal the community is (Pielou, 1979)) and diversity (the number of different species in a community (Shannon and Weaver, 1949)), a switch from a community dominated by

Firmicutes to one dominated by Proteobacteria, and transcription of Enterobacteria and

Pseudomonas bacteriophages and sarcoma viruses. This event was followed by a

Pseudomonas and Stenotrophomonas bloom, characterized by active transcription, as both rRNA and mRNA were detected, as well as an increase in their metabolites. Pseudomonas was the most active member of the microbial community with an mRNA abundance of 97%, followed by 1.92% of Stenotrophomonas mRNA. Bacillus was either lysed or dormant as only rRNA was detected. A feature that may have contributed to the success of Pseudomonas was its resistance to multiple antibiotics, as detected by the transcription of over 38 antibiotic resistance genes. This scenario is congruent with the one described by the clinical laboratory, as positive cultures for Pseudomonas and Stenotrophomonas were reported during the fatal exacerbation.

Additional dynamics such as bacteriophage induction may have happened during the fatal exacerbation, as active transcription was detected from Enterobacteria phage SP6 and

91

Pseudomonas bacteriophages. Bacteriophage induction is known to play a role in the control of bacterial populations in CF lungs (James et al., 2015).

CFRR for polymicrobial infections management, the importance of historical samples and a fast sample to results strategy.

The CFRR emerged from the need to investigate the cause of acute exacerbations. The power of the CFRR is shown in the information obtained for CF01 case study. The CFRR is ideal for medical centers closely associated with research facilities where the equipment is available. However, as technologies improve and become more accessible, CFRR could be implemented within the clinic.

A key component of the CFRR strategy is the comparison between acute exacerbation and stable periods. Because CF microbial communities are heterogeneous, a baseline needs to be determined for each patient. Longitudinal samples are essential to identify the changes in the microbial community and metabolites during acute exacerbations.

In the presented CF01 case study, historical samples were essential to differentiate the attack community that led to a fatal exacerbation from the attack community associated with a non-fatal exacerbation. The increase in Gb3 abundance during CF01 fatal exacerbation

(Figure 3.3) was detected when comparing its abundances in historical samples. In the case of metabolites, a baseline is necessary for each CF patient because for many compounds the basal levels are not known. Accumulation of ceramides and sphingomyelins is observed in CF lungs (Seitz et al., 2015). In particular, sphingomyelins, ceramides and lactosylceramide are significantly higher in CF lungs compared to non-CF ones (Quinn et al., 2015).

92 A challenging component of the CFRR is the collection and storage of historical samples. Sputum samples intended for viromes, metagenomes and metabolomes (Wandro et al., 2017) analysis are stable if stored at -20 ºC or -80 ºC. Metatranscriptomes are prone to

RNA degradation and sputum collection intended for this purpose requires RNA stabilization prior to -20 ºC or -80 ºC storage. Given these considerations, each patient can be provided with a non-thaw cycle -20 ºC freezer where individual raw sputum samples can be stored for viromes, metagenomes and metabolomes (Supplemental Figure 3.7-A). Sputum samples for metatranscriptomes can be collected during the patient’s visit to the CF clinic, where immediately after collection the RNA integrity is preserved by adding TRizol or RNAlater.

RNA should then be extracted as soon as possible. A proposed sampling scheme, in which higher resolution of samples is desired close to an acute exacerbation and fewer samples far away from the exacerbation event is proposed (Supplemental Figure 3.7).

Historical samples collected by the patient at home or during routine visits to the clinic are a valuable resource in the event of an acute exacerbation. In these cases, historical samples would be processed along with those from acute exacerbations in the CFRR pipeline (Figure

3.6) and valuable information would be obtained in less than 48 hours. This information is then analyzed by a multidisciplinary scientific team along with the clinician to 1) validate the multi-omics findings with approved clinical tests and 2) identify appropriate therapeutic options.

The information presented by the CFRR to the clinician is more detailed than that provided by classical clinical microbiology. A clear understanding of how this information is obtained and the exploratory nature of the findings needs to be considered when interpreting

93 the results. Discussion among clinicians and experts on the benefits and limitations of each

‘omics approach is essential to identify the elements causing CF acute exacerbations and then select the course of action to prevent a fatality. The final treatment decision is always in the hands of the clinician, who evaluates the different lines of evidence for each finding and considers the cost to benefit ratio of possible therapeutic interventions. The application of the

CFRR in a clinical context gives CF patients the opportunity for a better outcome based on an informed treatment decision. Another consideration when implementing the CFRR in the clinic is the availability of financial resources to perform the multi-omics strategy in exacerbation and historical samples.

Figure 3.6 Cystic fibrosis rapid response. Our proposed multi-omics strategy is to analyze sputum samples from cystic fibrosis patients, in which metabolomes, metagenomes, metatranscriptomes, and viromes are obtained from a single sputum sample. Estimated times and equipment for each omics step are included, as are recommended transport conditions. Recommended transport condition temperatures can be achieved by using ice, dry ice, or liquid nitrogen.

94 Considerations about implementing the Cystic Fibrosis Rapid Response.

This was a retrospective study in which the patient’s treatment was not modified based on the presented meta-omics results. The course of action of the CFRR strategy is to provide information to clinicians so that they can evaluate and confirm the findings before proceeding with pertinent treatment modifications.

In the case of CF01’s fatal exacerbation, the information obtained from the CFRR strategy could have informed the course of action in the treatment with the following modifications: 1) use of different antibiotics, since colistin mechanism of action results in liberation of the bacteria cell contents such as shiga toxin, and 2) administration of neutralizing antibodies against shiga toxin. Colistin is a cationic polypeptide that disrupts the cell membrane of gram-negative bacteria through a detergent-like mechanism and it is often used in the treatment of multidrug-resistant exacerbation in patients with CF (Wishart et al.,

2018).

In the presented case study only metatranscriptomes, metabolomes and metagenomes were used to elucidate the cause of a fatal exacerbation. In future CFRR case studies the use of viromes could be incorporated. The combination of metagenomes and viromes allows the identification of viral induction events, for example of prophages carrying toxins.

Shigatoxigenic phages are capable of lysogenic conversion (Krüger and Lucchesi, 2015;

Moons et al., 2013), and in the case of the CF01 fatal exacerbation, an early detection of shigatoxin in the viromes of historical samples could have provided valuable information about the coding potential of the viral community.

95 Time is crucial during the management of CF exacerbations. The estimated execution time of the CFRR in an ideal situation with specialized staff working 24/7 is 48 hours. Each step has room for improvement that would shorten the execution times. For example, real time direct sequencing, such as Oxford Nanopore, can eventually be used for the CFRR metagenomes, metatranscriptomes and viromes. These technologies provide genomic information as it is being sequenced (Greninger et al., 2015; Schmidt et al., 2017), which will be ideal for CFRR once sample preparation and data analysis are optimized for human DNA removal (Gu et al., 2016), and once high amounts of sputum starting material (400 ng of DNA needed for a Nanopore run) are no longer necessary for DNA for sequencing.

Combining data from multiple ‘omics sources enabled the identification of shigatoxigenic E. coli as the likely cause of patient CF01’s fatal exacerbation. Although these

‘omics data were not used to alter clinical treatment of CF01, future applications of CFRR are expected to provide information that is essential for improving therapy, e.g., antibiotic resistance predictions and gene expression in major attack community pathogens. Although each individual’s CF community is unique, these methods will allow for the observation of overarching trends within and between patients, for example a loss in diversity in acute exacerbations.

Materials and methods

Clinical data

Sample collection procedures and access to clinical data were approved by the

Institutional Review Board of University of California San Diego (HRPP 081510) and San

Diego State University (IRB#1711018R). Clinical microbiology, hematology, and X-rays

96 were taken during the normal care of the patient at UCSD medical center. Spirometry tests were used to calculate the percentage of predicted FEV1 as previously described (Hankinson et al., 1999). Clinical status (exacerbation or stable) was determined by the clinician. Lung function dynamics were modeled using splines and linear model fitting as previously described (Conrad et al., 2017).

Metagenome and Metatranscriptome shotgun sequencing

Sputum samples were collected by expectoration in a sterile cup and processed for metagenomes or metatranscriptomes as previously described (Lim et al., 2014b). Metagenome libraries were constructed using Nextera DNA library preparation kit. Metatranscriptome libraries were constructed using TruSeq RNA library preparation kit. All libraries were sequenced on the Illumina GAIIx platform. Metatranscriptomes D-7 and D-8 were prepared using a modified procedure to obtain rRNA and mRNA in a single sequencing step, where half of the sample was depleted of rRNA using Ribo-Zero Gold kit (Lim et al., 2013b) while total RNA was extracted from the other half. Both fractions were pooled in a proportion of

4:1, and then a single Illumina library was constructed and sequenced.

Sequencing data processing

Quality filtering and dereplication was done using PRINSEQ (Schmieder and

Edwards, 2011) (-min_qual_mean 20 -derep 1245 -lc_method entropy -lc_threshold 50 - ns_max_p 1 -out_bad null). Cloning vector sequences were removed using SMALT (-y 0.8 - x) with 80% identity against the UniVec database (National Center for Biotecnology

Information, 2016a), possible sources of cloning vector sequences are reagents used in the library preparation (National Center for Biotecnology Information, 2016b; Woyke et al.,

97 2011). Human genome sequences were removed using BLASTn (E value of 0.1) against the human reference genome GRCh38. The resulting FASTA files are available on NCBI SRA with the following accession numbers: (SAMN10605049 to SAMN10605062, n=12).

Metagenome and metatranscriptome datasets presented in this study are summarized in

Supplemental Table 3.1E. Microbial taxonomy assignments at the genus level were made from BLASTn against NT (E value of 0.001, the hit with lowest E value out of 10 hits was kept) for metagenomes and KAIJU (Menzel et al., 2016) for metatranscriptomes. Viral assignments were made by mapping reads against the viral reference genomes database

(NCBI RefSeq, release 87) using SMALT (2010) with 80% identity. Fractional abundances were calculated using FRAP as previously described (Cobián Güemes et al., 2016) and expressed per million reads. After quality filtering and removing reads that mapped to the human genome, metatranscriptome D-8 reads were compared to the SILVA SSU database using BLASTn with an E-value cutoff of 0.001, and taxonomy was assigned at the genus level using the best hit from 10000 subsample replicates. Non-ribosomal reads were compared to the NCBI nucleotide database (NT) using BLASTn with an E-value cutoff of 0.001. The best hit was selected and used to assign bacterial taxonomy at the genus level. Species level assignments were determined by the genome that recruited the most reads for eac