GENOMIC AND METABOLIC GENE CHARACTERIZATION OF BACTERIAL COMMUNITIES FROM THE NEUSE RIVER ESTUARINE SYTEM USING LONG READ METAGENOMICS

Laura Elizabeth Fisch

A thesis submitted to the faculty at the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Master of Science in the Department of Marine Sciences.

Chapel Hill 2019

Approved By:

Scott Gifford

Alecia Septer

Adrian Marchetti

Chris Osborn

© 2019 Laura Elizabeth Fisch ALL RIGHTS RESERVED

ii ABSTRACT

Laura Elizabeth Fisch: GENOMIC AND METABOLIC GENE CHARACTERIZATION OF BACTERIAL COMMUNITIES FROM THE NEUSE RIVER ESTUARINE SYTEM USING LONG READ METAGENOMICS (Under the direction of Scott Gifford)

The productivity of estuaries is linked to the microbes in the estuary and their significant role in carbon cycling. The greatest challenge in studying these microbial communities is finding a method to capture their complexity. Metagenomics is the recovery and sequencing of DNA from an environmental sample and can be applied to studying the potential of a community to drive carbon cycling by identifying the carbon metabolism genes encoded in the community’s

DNA. Long read sequencing technology produces DNA sequence fragments (reads) of ~ 1,000 to 50,000 base pairs which are long enough to contain multiple, complete functional genes. This study aims to understand the taxonomic identity and carbon metabolic pathways of estuarine microbial communities by applying long read sequencing technology from Oxford Nanopore

Technologies to water samples collected from the Neuse River Estuary on March 8th and October

4th of 2018.

iii ACKNOWLEDGMENTS

Thank you to Scott Gifford for advising me through this graduate degree. Under your guidance, I feel as though I have grown immensely as a scientist, presenter and writer. I am very grateful for the many hours you have put into my education. I want to especially thank you for the time spent training me to write scientifically through creating this thesis, which has greatly improved the quality of my writing. I also want to thank you for imparting your impressive presentation skills to me, and for developing my ability to see the big picture in all of the work that we do.

I also want to thank my committee members, Alecia Septer, Adrian Marchetti, and Chris

Osborn. Each of you have, through your own expertise, contributed to this project. I found committee meetings with you all to be quite enjoyable as we always had interesting conversations about microbiology, organic carbon, and metagenomics. Thank you all for your contributions to this project with your ideas and my education. I want to mention Chris especially as some of the papers you had me read for my comprehensive exams and the Daniel

Thornton paper made their way into this thesis because of the questions you asked me to answer.

Through my classes in the Marine Science program, I have been well educated on marine microbial ecology so that I could contribute to this project.

I want to thank National Science Foundation for funding my graduate education through the Graduate Research Fellowship Program. By funding my graduate education NSF made it possible to devote recourses to a sequencing project, which is something I’d always wanted to work on.

iv A huge thanks to Hans Paerl’s lab for allowing us to come on the ModMon and Pamlico sound cruises for sample collection. In particular, thanks to Jeremy Brady and Betsy Abare for taking us out to the estuary and for helping us out with our sampling.

Another huge thanks to Acacia Zhao for helping me with many of the bioinformatics tools from installation to usage to understanding these tools. I really appreciate your help as it made this project possible for a biologist without a computer science background.

Lastly, I’d like to thank Lauren Speare for all the great conversations about research and science. I found your tips and trick for developing topic sentences to be very helpful. Thanks a bunch for sharing your writing experience and knowledge with me.

v TABLE OF CONTENTS

LIST OF TABLES ...... viii

LIST OF FIGURES ...... ix

LIST OF ABREVIATIONS ...... xi

INTRODUCTION ...... 1

METHODS ...... 5

Sample Collection ...... 5

DNA extraction ...... 5

DNA sequencing with Oxford Nanopore Technologies ...... 6

Sequence processing and Quality control ...... 6

Read annotation ...... 7

RESULTS AND DISCUSSION ...... 9

Internal Standards ...... 9

Sequencing Statistics ...... 9

Sequence Quality ...... 11

Community Composition ...... 12

Functional Composition ...... 20

Glucose metabolism: glycolysis and the pentose phosphate pathway ...... 22

Carbohydrate metabolism: isomerase enzymes ...... 23

Aromatic metabolism: the homogentisate aromatic degradation pathway ...... 26

Aromatic metabolism: the beta-ketoadipate pathway ...... 29

vi CONCLUSIONS...... 31

APPENDIX: CARBON METABOLIC ANNOTATIONS ...... 33

WORKS CITED ...... 41

vii LIST OF TABLES

Table 1: Sequencing statistics ...... 11

Table 2: The Top 10 most abundant phyla and the relative abundances of the total number of reads assigned to them out of the total number of reads analyzed in the sample ...... 16

Table 3: The Top 10 most abundant phyla and the relative abundances of the total number of reads assigned to them out of the total number of reads analyzed in the sample ...... 16

Table 4: Top 20 most abundant families and the number of reads assigned to them ...... 17

Table 5. Broad COG categories and the number of reads with a COG assigned to them. Numbers represent the number of reads with a COG assignment that falls into each category ...... 21

Table 6. COG Metabolism categories and the number of gene annotations in each category. Numbers represent the number of reads with a COG assignment that falls into each category ...... 22

Table 1A: COG annotations of glycolysis enzymes ...... 33

viii LIST OF FIGURES

Figure 1: 2018 sampling sites ...... 4

Figure 2: Bioinformatics work flow ...... 8

Figure 3: Example figure of how MEGAN6’s LCA algorithm assigns reads to a taxonomic identification from the list of best hits returned by DIAMOND blastx ...... 12

Figure 4: Tree depicting the evolutionary relationships of the different taxonomic classifications of NR180 reads within the ...... 18

Figure 5: Tree depicting the evolutionary relationships of the different taxonomic classifications of PS9 reads within the proteobacteria ...... 19

Figure 6: 19 reads from NR180 that include the functional annotation uronic isomerase. and the surrounding COG annotations on the read ...... 25

Figure 7: 19 reads from NR180 that include the functional annotation homogentisate 1,2 dioxygenase and the surrounding COG annotations on the read ...... 28

Figure 8: Metabolic map of the homogentisate pathway and the first read from the NR180 sample found with the marker gene annotation of homogentisate 1,2-dioxygenase...... 29

Figure 1A: Reads from PS9 that include the functional annotation uronic isomerase. and the surrounding COG annotations on the read ...... 34

Figure 2A: Reads from NR180 that include the functional annotation xylose isomerase. and the surrounding COG annotations on the read ...... 35

Figure 3A: Reads from PS9 that include the functional annotation xylose isomerase. and the surrounding COG annotations on the reads ...... 36

Figure 4A: Reads from NR180 that include the functional annotation galactose 1 phosphate uridylyltransferase. and the surrounding COG annotations on the read ...... 37

Figure 5A: Reads from PS9 that include the functional annotation galactose 1 phosphate uridylyltransferase. and the surrounding COG annotations on the read ...... 38

Figure 6A: Reads from PS9 that include the functional annotation homogentisate 1,2-dioxygenase and the surrounding COG

ix annotations on the read ...... 39

Figure 7A: Reads from PS9 that include the functional annotation protocatechuate 3,4-dioxygenase and the surrounding COG annotations on the read ...... 40

Figure 8A: Reads from PS9 that include the functional annotation protocatechuate 3,4-dioxygenase and the surrounding COG annotations on the read ...... 40

x LIST OF ABBREVIATIONS

NRE Neuse River Estuary

COG Cluster of Orthologous Genes

xi INTRODUCTION

The health and productivity of estuaries is directly linked to the microbes that have a significant role in estuarine carbon cycling. In aquatic ecosystems, heterotrophic are the biogeochemical machines that process and recycle organic carbon to build biomass and drive energy flow through the ecosystem (Moran et al., 2016). The Neuse River Estuary (NRE) is located on the coast of North Carolina where fresh water meets oceanic water coming in from the

Pamlico Sound. Influxes of riverine water bring in natural and anthropogenic derived inorganic and organic nutrients into the estuary (Paerl et la., 2009). It is believed that the large influxes of organic matter results in a system where heterotrophic bacteria are not limited by organic carbon

(Peierls and Paerl, 2010). A four-year study in the NRE and the Pamlico Sound found that bacteria production was highest when water temperatures were highest (Peierls and Paerl, 2010).

However, the bacterial community composition and functional roles in carbon cycling are not as extensively studied in estuaries. Studying metabolic pathways within the bacterial community is essential to understanding their influence on estuarine carbon cycling. These pathways determine the biochemical processes by which bacterial communities transform carbon and can provide insight into the sources of organic substrates sustaining the heterotrophic community.

Delving into bacterial community function in an estuarine ecosystem is inherently challenging due to both the taxonomic complexity of microbial communities found in these systems as well as the diversity of metabolic pathways they contain. Furthermore, the organic

1 matter that supports heterotrophic bacteria contains diverse carbon molecules that originate from different sources including: surrounding terrestrial land plants, soils, anthropogenic influences such as fertilizers and or farm waste, and algal exudates (Bauer and Bianchi, 2012). A microbial community capable of processing many different carbon molecules is often complex with many different and carbon processing pathways (Moran et al., 2016; Barberán et al., 2012). A lot of information must be obtained to study these processes. Methods that provide a holistic description of the many different parts of a complex system are necessary for studying microbial communities such as those found in estuaries.

Metagenomics (the recovery and sequencing of DNA from an environmental sample) can reveal a microbial community’s taxonomic identity and diversity; however, it is more challenging to ascertain the details of community functional potential. The popular sequencing platform, Illumina, produces highly accurate DNA sequences (reads). The shortcoming of this technology is the reads are a maximum of 250 base pairs long. Assembly is required to try to put the small reads together into larger contigs (continuous sequences) that span entire genes, but most metagenomic studies are either unable to assemble many reads into contigs or choose not to in order to avoid chimeric sequences. Furthermore, the short reads cannot resolve genomic structures like repetitive regions (Wick et al., 2017). The result of an assembly using Illumina reads are many discrete contigs (Wick et al., 2017). Illumina’s short reads limit the extent to which the reads can be assembled into genes and genomes from complex microbial communities, such as those found in estuaries.

To overcome the challenge of sequencing metagenomic samples of complex communities, long read technology can be applied to study the details of microbial carbon metabolic pathways. Long read sequencing technology produces long (1,000 to 50,000 base

2 pairs) DNA sequence fragments that contain multiple complete functional genes. Oxford

Nanopore Technologies is one such long read technology and works by directly sensing DNA nucleotides by measuring the current alteration of the DNA fragment nucleotides as they pass through a pore (Range et al., 2018). This method does not have a limit on the size of the DNA fragments that can pass through the pore. The trade off with nanopore sequencing is it produces reads of lower accuracy than Illumina (80 to 90% accuracy compared to 99% with Illumina).

However, the value of using long read sequencing technology, is that the spatial relationships of distal sequences can be determined (Range et al., 2018).

This study aims to understand the taxonomic identity and carbon metabolic pathways of estuarine microbial communities by applying Nanopore long read sequencing technology to water samples collected from the Neuse River Estuary on March 8th and October 4th of 2018 (Fig

1). Specifically, we asked

1) Which bacterial taxa are found in the Neuse River Estuary?

2) What functional potential exists within the bacterial community’s metagenome?

3) How does long read sequencing technology improve our understanding of the

metabolic pathways that can alter carbon in an ecosystem?

The resulting data set is the first metagenomic analysis of bacterial communities in the

NRE. It provides insight into the bacterial taxa and functional mechanisms driving carbon metabolism in this biogeochemically important estuarine system

3 October8th March4th

Figure 1. 2018 sampling sites. The orange circles mark where the two samples were collected. Sample NR180 was collected on March 4th 2018 and sample PS9 was collected on October 8th 2018. Image provided by the Paerl lab at the Institute of Marine Sciences

4 METHODS

Sample collection

Sample NR180 was collected from ModMon (Paerl et al., 2011, Paerl et al., 2007) station

180 on March 8th 2018. Sample PS9 sample was collected on October 4th 2018 from station 9 in the Pamlico Sound (Paerl et al., 2007). Surface seawater (~1m depth) was pumped through silicone tubing (Masterflex) through a 3 µm 142 mm polycarbonate filter and then a 0.2 µm polyethersulfone 142 mm filter. 3L of water was filtered for NR180 and 2.5 L for PS9. The 0.2

µm filter was folded into a Whirl Pack and flash frozen in liquid nitrogen in the field and then stored at -80 °C in the laboratory.

DNA extraction

The frozen filter was fragmented with a rubber mallet while in the Whirl Pack, and the pieces were then added to a 50 mL bead bashing tube from the DNeasy PowerMax Soil DNA extraction kit from Qiagen. Internal standards were added directly to the bead tubes before continuing with the extraction, and included the genomes, Thermus thermophilus HB27,

Deinococcus radiodurans ATCC13939, and Blautia producta ATCC2734, (each added from separate stocks to the extraction tube individually). For the NR180 sample, the internal standards were added in the following amounts: B. producta: 3.83 ng, D. radioduran: 1.62 ng, and T. thermophilus: 2.98 ng. For the PS9 sample, internal standards were added in the following amounts: B. producta: 15.2 ng, D. radioduran: 7.4 ng, and T. thermophilus: 15.9 ng. The bead bashing step was performed for 5 minutes. For processing the PS9 filter, prior to the bead bashing step, the tube with the filter and the internal standards was placed in a 55 °C water bath

5 for 5 minutes to improve cell lysis and DNA recovery. The remainder of the extraction followed the PowerMax Soil kit protocol. A genomic DNA clean and concentrator kit from Zymo

Research was used to concentrate the large elution volume to 60 µL for NR180 and 100 µL for

PS9 in the kit’s elution buffer. NR180’s extracted DNA concentration was assessed with a

Quant-iT PicoGreen dsDNA Assay Kit from Invitrogen and was found to be 21.6 ng /µL. PS9’s extracted DNA was assessed on a spectrophotometer and found to be at a concentration of 120 ng/µL with an A260/280 of 1.86.

DNA sequencing with Oxford Nanopore Technologies

Nanopore sequence libraries for each sample were prepared with the 1D Genomic DNA by ligation kit (SQK-LSK108) (Oxford Nanopore, London, UK). All steps for end prep, adaptor ligation, Ampure XP bead binding, and sequencing were followed as indicated in the protocol.

The library preps started out with 1 µg of DNA from NR180 and 2 µg of DNA from PS9. After losses from library prep, 0.47 µg from NR180 and 1.2 µg of library prepped sample was loaded into a MinION 107 flow cell. The 48-hour sequencing workflow was run using MinKNOW software.

Sequence processing and Quality control

Basecalling to convert the raw fast5 files from the MinKNOW software to fastq files was performed with Albacore v2.2.7 (Oxford Nanopore Technologies). Default parameters were used except for quality filtering, which was turned off (–disable filtering). The sequences were checked for quality and read length distribution with NanoPlot (De Coster et al., 2018).

Porechop was used to remove sequencing adaptors (rrwick, github). The first and last 50 base pairs of each read were trimmed off with NanoFilt (--headcrop 50 --tailcrop 50) (De Coster et al.,

2018). Reads with average quality scores <8 and lengths <100 bp were removed with NanoFilt (-

6 l 100 -q 8) (De Coster et al.,2 018). The filtered sequences were then checked again for quality and read length distribution with NanoPlot (Fig 2).

Read annotation

Reads were aligned to the NCBI RefSeq Protein database v84 using DIAMOND v0.9.21.122 (Buchfink et al., 2015) using the following parameters to account for long read lengths: -F 15 (enables frame shifts for lenience with long error prone reads), --range-culling

--top 10, -f 100 (produces a DAA formatted file). The Diamond DAA files were then loaded into

MEGAN6 which uses a naive lowest common ancestor algorithm (LCA) to assign each read to a single taxonomic identify out of the best hits for each read (Huson, 2018) (Fig 2). The percent to cover parameter was set to 70, indicating that a read must be covered by 70% of its taxa read assignment (Fig 3). MEGAN6 was used to view taxonomic diversity and functional profiles within each sample. The following two files for converting accession numbers from NCBI

RefSeq to taxonomic names and functional annotations were downloaded from the MEGAN downloads page. The November 2018 version of the prot_acc2tax.abin file for taxonomic characterization and the October 2016 acc2eggnog.abin file was used for functional groupings of annotations into Clusters of Orthologous Genes (COGs). Reads given functional annotations were extracted as GFF files for further analysis of gene organization on the long reads.

7 NanoPlotQC NanoPlotQC Rawreadprocessing NR180.fastq NR180_adaptor_trimmed.fastq NR180.fast5 1.Albacore 2.Porechop PS9.fast5 PS9.fastq PS9_adaptor_trimmed.fastq

NanoPlotQC QualityFiltering NR180_adaptor_trimmed.fastq NR180_final.fastq 3.NanoFilt PS9_adaptor_trimmed.fastq PS9_final.fastq Firstandlast50bp ofeachread Readsbelow100bp Readswithaverageqscore8orless Annotation NR180_final.fastq 4.DIAMOND NR180.daa 5.MEGAN6 MEGANIZEDDAAfiles PS9_final.fastq Blastx alignment PS9.daa NaiveLowest againstNCBIRefSeq Common Ancestor Algorithm Figure 2. Bioinformatics work flow. 1.The fast5 files output by the MinKNOW software from the sequencing run were converted into fastq files with Albacore. 2. Adaptors were trimmed off of all of the reads in the fastq file with Porechop. NanoPlot was used to perform quality control checks at each step of the read processing and quality filtering. 3. NanoFilt was used to trim off the first and last 50 base pairs off of each read. 4. The reads were annotated with a DIAMOND blastx alignment against the NCBI-Refseq database with parameters specific for long reads generated by nanopore sequencing whith DAA files as the output. 5. When importing the files to MEGAN, there is an option to “Meganize” the file. Meganizing the DAA files is a fast and efficient way to convert the DAA files into visual form that can be explored with the MEGAN6 GUI.

8 RESULTS AND DISCUSSION

Internal standards

Internal standards have been used to estimate absolute abundances of taxa genomes in a metagenomic sample (Satinsky et al., 2013). Internal standards were added to both samples’ metagenomes with the expected outcome of recovering multiple genomes from each bacterium added. After quality filtering the number of reads from all three bacteria, if present at all, were below 20, which is too low to make use of them as estimators of absolute abundances. Therefore, they were not used for this study’s analysis.

In order to work with internal standards for future metagenomes changes must be made to. Eutrophic environments like the NRE are productive regions and for this study, the DNA yields were high, the proportion of internal standard to environmental DNA could be increased.

Sequencing depth could also be altered, however for this experiment the flow cell was run to its maximum sequencing capacity. Thus, depth could not have been increased with just one flow cell.

Sequencing statistics

The sequencing run for NR180 returned ~ 700,000 reads (Table 1). After quality filtering for reads of Q score 8 or more, 39.7% of those sequences remained for downstream analysis

(table 1). The sequencing run for PS9 returned ~ 200,000 reads with 61.5% remaining after quality filtering for downstream analysis (table 1). The total base pairs sequenced (for each sample, accounting for read length and number of reads), equated to ~700 to 800 Mbp (table 1).

9 When assessing a metagenomic community, coverage is helpful for understanding how much of the whole community is represented by the sequenced metagenome (Rodriguez et al.,

2014). Coverage is an estimation of the fraction of the actual environmental metagenome that was sequenced. Unfortunately, there are no quantitative computational methods to assess coverage of a metagenomic data set from nanopore sequencing. Nonpareil curves are a computational method to estimate coverage of a metagenomic data set based on read redundancy. It is a valuable tool for measuring metagenome coverage as it is independent of assembly, reference databases, or abundance distribution models (Rodriguez et al., 2014).

However, this program requires data with Q scores greater than what is generally produced by nanopore sequencing. In order to address coverage, the sequencing effort of 0.7 Gbp can be compared to other aquatic communities that have had an estimated sequencing coverage vs. sequencing effort performed on their metagenomes. Nonpareil curves determined that 1 Gbp of sequencing effort covered half of a sample from the Baltic sea’s diversity and that 0.5 Gbp of sequencing effort covered half of a sample from Lake Lanier (Rodriguez et al, 2014). If it is assumed that this estuarine sample is of similar complexity to other aquatic samples, then it is unlikely that the 0.8 Gbp of sequencing effort in this study captures the entire metagenome.

Compared to the sequencing efforts of Rodriguez et al., our study could have a roughly estimated coverage of 20% to 70%.

10 Table 1. Sequencing statistics Sample NR180 PS9 # of raw reads 678,655 211,010 # of reads after quality filtering 269,884 129,837 # of bases (Mbp) 732 811 Average read q score 9.0 9.4 Maximum q score 15.5 (139 bp long) 13.8 (199 bp long)

Average read length (bp) 3,005 5,637 Max read length (bp) 42,191 (Q 8.6) 59,953 (Q 9.3)

# of reads annotated by Diamond Blastx 157,205 91,387

Sequence quality

The quality of sequence data generated by nanopore sequencing is important to consider for data analysis and interpretation. An average Q score of 9 for NR180 sample translates to

12.6% of all bases on a single read are incorrectly called and an average Q score of 9.4 for PS9 translates to 11.4% incorrectly called bases (table 1). When interpreting downstream annotations of these reads, it is therefore important to consider 10%-15% of the bases are incorrect. The alignment of these reads to a reference database for identifying taxa and functional genes is reliant on computational programs that take into account the higher error rate (Arumugam et al.,

2019). These sequencing errors include erroneous insertions and deletions (indels). Nanopore indels are accounted for using frame-shift aware alignment techniques like those provided in the long-read setting of DIAMOND (Arumugam et al., 2019). Using these settings has been shown to improve the accuracy of metagenome annotations and the long-read DIAMOND settings account for the fact that multiple genes may be encoded on a single read (Arumugam et al.,

2019).

The output of a long-read diamond blast is that on a single long read, one section can be aligned to several different taxa (Fig 3). Algorithms account for this possibility, assigning a single taxonomic identification to each individual long read and also determining functional

11 annotations for each gene within a long read (Huson et al., 2018). Some reads are not able to be resolved to the species level and thus will be assigned to higher taxonomic levels (Huson et al.,

2018).

LongRead A 80% 100% P B 60% C 60% D 40% 100% R E 20%

F 20% 20% Q G 20% H 20% Family Genus Species Sectionsofreference alignmentstothe longread

Figure 3. Example figure of how MEGAN6’s LCA algorithm assigns reads to a taxonomic identification from the list of best hits returned by DIAMOND blastx. 100% of the alignments that span the long read (in blue) belong to Family R and Genus P. 20% of the read is covered with alignments that fit within Genus Q. Within Genus P and Q, sequences from different species within the RefSeq database have aligned to the read. The read is assigned to Species A because that is the alignment that covers more than the required cut off of 70%. Other species alignments cover 60% or less; therefore, the read is not identified as belonging to any of the other species that aligned to the read. Because species A is the alignment that meets the threshold cut off, this read is identified as belonging to species A. Species A is considered the long read’s “read assignment.” Figure adapted from Huson 2018.

Community composition

In both samples, 99% of reads were annotated as Bacteria, 1% as Archaea, and none as viruses. Proteobacteria was the most abundant phylum in both samples with just over half of all reads assigned to this phylum (table 2). The next most abundant phylum was Bacteriodetes, which comprised 10% of all reads for both samples and Acintobacteria, which averaged 10% for

12 both samples (table 2). Within Proteobacteria, Alphaproteobacteria and were the most abundant classes making up on average 67% and 20% of the Proteobacteria, respectively (table 3).

The two samples, NR180 and PS9, had similar community composition and relative abundances of individual taxa. Both samples contain the same Proteobacterial orders (Fig 4 and

5). There are similar relative abundances of the most abundant families between the two samples

(table 4). The third and fourth most abundant families, Flavobacteriaceae and Rhodobacteraceae, have relative abundances of ~ 3% (table 4). The most prevalent family found in both samples was the Pelagibacteraceae, an Alphaproteobacteria family, although it was more abundant in the

NR180 sample (table 4).

A few notable differences were also found between the two samples when comparing relative abundances of read assignments. There are about twice as many reads assigned to betaproteobacteria in PS9 than in NR180. At the family level, there are 10% more

Pelagibacteraceae reads in the NR180 sample than in the PS9 sample. PS9 has twice as many reads assigned to the marine Cyanobacteria family, Synechococcaceae, compared to NR180

(table 4). Sample PS9 includes Cyanobacteria from the family Leptolyngbyaceae in the top 20 most abundant families (table 4). This family was also found in NR180 but at a much lower relative abundance, having only 32 reads.

The physiological conditions on October 8th may explain some of the differences between the two samples. PS9 was collected recently after hurricane Florence, which caused heavy rains on the coast and inland of North Carolina. Increased nutrient flux into the estuary are associated with rain events (Paerl et al., 2010). This could have possibly elevated estuary nutrient levels for

PS9, which could support more phytoplankton growth and explain the observed increase in

13 Cyanobacteria in PS9. An increase of Betaproteobacteria and decrease in Pelagibacteraceae shown in PS9, could be due to the influx of freshwater associated with the large amount of rain after hurricane Florence. Freshwater ecosystems have been shown to have higher abundances of

Betaproteobacteria and lower abundances of Pelagibacter compared to marine ecosystems

(Fortunato and Crump, 2015). Therefore, we hypothesize the observed relative abundances of these taxa in PS9 are likely a result of the recent freshwater input.

The community composition and abundant taxonomic groups in sample PS9 and NR180 has similarities to what has been observed in other aquatic metagenomes. Like these two samples, the Tara Oceans project reported that the most abundant taxa in coastal and open ocean surface waters are Alphaproteobacteria, followed next by Gammaproteobacteria (Sunagawa et al., 2015). Bacteriodetes and Actinobacteria were also found in surface waters from the Tara oceans project (Sunagawa et al., 2015). 6 out of the 8 families reported from a coastal metagenome study near the NRE were also observed in these two samples. (Ward et al., 2017).

Freshwater samples also have the same taxa in the metagenome as this estuarine sample, although they appear in different relative abundances (Fortunato and Crump, 2015). Taken together, these observations suggest these NRE samples have similar bacterial community taxa to many different aquatic environments at the broader taxonomic levels of order or family, although the relative abundances of individual taxa are different between different ecosystems.

The cyanobacteria families in the estuarine metagenomes show one difference from coastal and open ocean communities. The cyanobacteria Prochlorococcus is widely distributed and abundant in coastal and open oceans (Flombaum et al., 2013). Worldwide, it is more abundant than Synechcococcus (Flombaum et al., 2013). However, NR180 and

PS9 had no Prochlorococcus reads while Synechococcaceae was one of the most

14 abundant families in both samples (table 4). Synechococcaceae was also the only reported cyanobacterial family in the coastal study near the NRE (Ward et al., 2017).

There are also noticeable differences between the Neuse and Pamlico Sound metagenomes compared to other aquatic communities, especially when comparing the relative abundances of taxa. First, the archaea have a presence in the metagenomes of coastal ocean and open ocean communities while they are not an abundant taxon in the two metagenomes from this study (Fortunato and Crump, 2015; Sunagawa et al., 2015). Archaea also do not have a presence in the estuarine and freshwater sample from the Columbia River Estuarine and coastal aquatic metagenomes (Fortunato and Crump, 2015). The relative abundances of Betaproteobacteria and

Actinobacteria of this study are comparable to Columbia River Estuary (Fortunato and Crump,

2015). Betaproteobacteria and Actinobacteria are more abundant in freshwater than in the coastal ocean; but, the estuarine relative abundances of these groups are in between the reported values for the fresh water and marine (roughly 10% for both) (Fortunato and Crump, 2015). Future work is needed to address the underlying factors driving these observations. Are estuarine microbial communities unique to their ecosystems, or are they a mix of fresh water and marine communities? Or do estuarine environments select the dominant taxa based on their physiological conditions? Different types of carbon compounds could contribute to environmental selection of the microbial taxa that have the metabolic ability to degrade the available carbon compounds. Future experiments sampling from the riverine beginning of the

NRE, the intersection of the NRE with the Pamlico Sound, and the coastal ocean waters that influence the NRE to compare microbial community compositional and carbon processing pathways could help address the above inquiries.

15

Table 2. The Top 10 most abundant phyla and the relative abundances of the total number of reads assigned to them out of the total number of reads analyzed in the sample NR180 PS9 Phylum % of total Proteobacteria 56.8 53.0 Actinobacteria 9.6 9.7 8.3 12.1 Verrucomicrobi a 2.3 2.3 Cyanobacteria 2.0 5.2 Firmicutes 0.6 1.4 Planctomycetes 0.4 2.5 Balneolaeota 0.3 0.3 Chloroflexi 0.1 0.4 Euryarchaeota 0.1 0.2 other 0.4 1.8 unassigned 14.2 11.2

Table 3. The Top most abundant proteobacteria classes and the relative abundances of the total number of reads assigned to them out of the total number of reads analyzed in the sample

NR180 PS9 Proteobacteria classes % of total Alphaproteobacteria 73.4 61.2 Gammaproteobacteria 19.0 20.0 Betaproteobacteria 6.7 14.8 Deltaproteobacteria 0.52 3.1 Epsilonproteobacteria 0.20 0.2 Oligoflexia 0.1 0.5

Acidithiobacillia 0.01 0.04

Zetaproteobacteria 0.008 0.04

16 Table 4. Top 20 most abundant families and the number of reads assigned to them

NR180 PS9 # # % of % of families assigned families assigned total total reads reads total # of reads 157,250 100 total # of reads 81,847 100 1 Pelagibacteraceae 43301 27.5 Pelagibacteraceae 14336 17.5 2 Microbacteriaceae 7737 4.3 Synechococcaceae 2934 3.6 3 Rhodobacteraceae 4707 3.0 Flavobacteriaceae 2870 3.5 4 Flavobacteriaceae 4273 2.7 Rhodobacteraceae 2507 3.1 5 Synechococcaceae 2426 1.5 Ilumatobacteraceae 2397 3.0 6 Rhodospirillaceae 1791 1.1 Planctomycetaceae 1400 1.7 7 Halieaceae 1303 0.8 Halieaceae 1050 1.3 8 Puniceicoccaceae 1245 0.8 Rhodospirillaceae 678 0.8 9 Ilumatobacteraceae 710 0.5 Comamonadaceae 643 0.8 10 Porticoccaceae 654 0.4 Crocinitomicaceae 613 0.8 11 Mycobacteriaceae 567 0.4 Microbacteriaceae 437 0.5 12 Francisellaceae 530 0.3 Haliscomenobacteraceae 326 0.4 13 Planctomycetaceae 522 0.3 Burkholderiaceae 325 0.4 14 Balneolaceae 411 0.3 Leptolyngbyaceae 306 0.4 15 Pseudomonadaceae 393 0.3 Verrucomicrobiaceae 274 0.3 16 Hyphomonadaceae 366 0.2 Verrucomicrobia subDiv 3 266 0.3 17 Streptomycetaceae 362 0.2 Pseudomonadaceae 248 0.3 18 Haliscomenobacteraceae 359 0.2 Opitutaceae 248 0.3 19 Comamonadaceae 302 0.2 Balneolaceae 229 0.3 20 Opitutaceae 302 0.2 Sphingomonadaceae 221 0.3

17 Proteobacteria

Figure 4. Tree depicting the evolutionary relationships of the different taxonomic classifications of NR180 reads within the proteobacteria. Circles scale with the number of reads assigned to each family. Fig generated in MEGAN6

18 Alphaproteobacteria

Betaproteobacteria

Delta/epsilon subdivisions Proteobacteria

Gammaproteobacteria

Zetaproteobacteria

Figure 5. Tree depicting the evolutionary relationships of the different taxonomic classifications of PS9 reads within the proteobacteria. Circles scale with the number of reads assigned to each family. Fig generated in MEGAN6

19 Functional Composition

MEGAN6 coupled with the EggNOG/COG database groups the data as Clusters of

Orthologous Genes (COGs) or functionally related genes in order to explore the functional potential of the two metagenomes. MEGAN6 uses bitscores comparisons and percent coverage from the DIAMOND alignment to assign each annotated gene a function (Huson et al., 2018).

Within a metagenome only a certain amount of the total metagenome will receive a COG annotation. Marine metagenomic studies have calculated the percentages to genes receiving

COG annotations out of all of the predicted genes in the metagenome (Sunagawa et al., 2015;

Quince et al., 2017). About 40% of all genes from the metagenomes received COG annotations

(Sunagawa et al., 2015; Quince et al., 2017). This value gives an indication as to roughly how much information the EggNOG/COG database can provide for an aquatic metagenome.

At the broad level of cellular function and metabolism, the similarities between each sample indicates that basic metabolic and cellular machinery functions are present in all bacteria

(table 5 and table 6). The majority of COG assignments were to the metabolism category which is reasonable as bacteria spend a lot of their energy on transporting and metabolizing molecules from the environment (table 6). The similarities between the two samples’ broad biological functions suggests that the metagenomic data set provides an accurate snapshot of the community functional roles.

20 Table 5. Broad COG categories and the number of reads with a COG assigned to them. Numbers represent the number of reads with a COG assignment that falls into each category. If two of the same COG assignment appear on one read, it is counted twice. NR180 PS9 # of reads in % of reads # of reads in % of reads COG category category assigned category assigned Total reads assigned 149974 100.0 47,500 100.0 information Storage and Processing 35402 23.6 10,203 21.5 Cellular Processes and Signaling 30362 20.2 9796 20.6 Metabolism 84210 56.1 27,501 57.9 Not assigned 69327 35,595 no hits 18642 9,178

21 Table 6. COG Metabolism categories and the number of gene annotations in each category. Numbers represent the number of reads with a COG assignment that falls into each category. If two of the same COG assignment appear on one read, it is counted twice. NR180 PS9 # of reads % of reads # of reads in % of reads COG category in category assigned category assigned total assigned 84210 100.0 27501 100.0 [C] Energy production and conversion 17250 20.5 5940 21.6 [E] Amino acid transport and metabolism 23470 27.9 7296 26.5 [F] Nucleotide transport and metabolism 8111 9.6 2788 10.1 [G] Carbohydrate transport and metabolism 8599 10.2 2789 10.1 [H] Coenzyme transport and metabolism 8114 9.6 1972 7.2 [I] Lipid transport and metabolism 7416 8.8 2656 9.7 [P] Inorganic ion transport and metabolism 7573 9.0 2950 10.7 [Q] Secondary metabolites biosynthesis, transport and catabolism 3677 4.4 1110 4.0

Glucose metabolism: glycolysis and the pentose phosphate pathway

We examined the reads for enzymes involved in glucose metabolism in order to assess the legitimacy of the metagenome annotations given that glucose is an important carbon and energy source for cells. Additionally, glucose metabolism pathways are highly prevalent in bacterial genomes and well annotated (Jurtshuk, 1996). Glycolysis enzymes are considered a part of the core metabolism, or the metabolic capabilities present in all living cells, For these reasons, complete or nearly complete pathways are an expected finding for these metagenome samples.

Annotations for all 10 enzymes in glycolysis and all 7 enzymes in the pentose phosphate pathway were found within the read annotations for both samples (Table 1A and 2A). Finding

22 evidence for two pathways that are commonly used by bacteria for glucose metabolism is reassuring that the way the data processed through the chosen bioinformatics pipeline reasonably reflects the bacterial community’s functions.

Carbohydrate metabolism: isomerase enzymes

Phytoplankton have been shown to exude simple sugars molecules considered to be the most bioavailable sources of energy into the environment (Thornton, 2014). Because this is a potential labile source of dissolved organic carbon to the ecosystem, we looked for evidence of carbohydrate metabolism in the metagenomes. Enzymes that belong to the isomerase family convert one isomer to another, rearranging molecules, in a way that facilitates carbohydrate metabolism (Cuesta et al., 2014). The product and substrate have the same molecular formula, but the elements are arranged in different spaces. Isomerase enzymes have a known function of being involved in carbohydrate degradation and some of the enzymes involved glucose metabolism are isomerases (Cuesta et al., 2014). Not all isomerases are a apart of the core metabolism (i.e. glycolysis). In particular, key isomerases involved in the degradation of carbohydrates enzymes were examined.

Uronic isomerase, xylose isomerase, and galactose 1 phosphate uridylyltransferase

(which converts galactose into glucose) were all found in the metagenomes. Both samples had several read assignments of Flavobacteria for all three marker genes (Fig 6 and Figs 1A-5A).

Flavobacteria are known to associate with phytoplankton blooms (Buchan et al., 2014) and phytoplankton have been shown to exude xylose, uronic acid, and galactose (Thornton, 2014).

Finding the read annotations for predicted carbohydrate metabolism within Flavobacteria

23 suggests that this group has the potential to participate in the recycling of phytoplankton derived dissolved organic carbon.

Reads assigned to the same taxa show similar annotation patterns across the long reads and include other enzymes involved in carbohydrate processing. An example of this is with two different reads containing a predicted uronic isomerases that were assigned as Flavobacteria bacterium MS024-2A (Fig 6). Both reads are of similar lengths and both have the same annotation pattern for the genes surrounding uronic isomerase. A mannoate dehydratase enzyme

(COG1312) annotation is followed by uronic isomerase (COG1904), then a tripartite transporter

(COG3090), and a trap dicarboxylate transporter (COG1593) on the opposite strand (Fig 6).

These transporters are known to participate in the uptake of organic acids and molecules. These transporters could potentially transport uronic acid, the substrate for uronic isomerase, as it has a carboxylate group. Moreover, three reads with the same read assignments, Coraliomargarita akajimensis, found in the NR180 metagenome have similar annotation patterns (Fig 6). A malate

L-lactate dehydrogenase (COG2055) is found in the neighborhood of uronic isomerase on all three reads assigned to C. akajimensis (Fig 6). For reads that span the same section of the same species genomes, we would expect to see similar annotation patterns on those reads. This holds true for reads with uronic isomerase that have been assigned to the same species (Fig 6).

24 0 4000 8000 12000 16000 20000 24000 0 4000 8000 12000 16000 20000 24000

1 2 Flavobacteriaceae Coraliomargaritaakajimensis 3 4 FlavobacteriabacteriumMS024-2A Bacteroidetes 5 6 Coraliomargaritaakajimensis Gammaproteobacteria 7 8 Opitutaceae Opitutae 9 10 Verrucomicrobia Bacteria

1111 12 Coraliomargaritaakajimensis Bacteria 1313 14 Bacteria Francisella 1515 16 FlavobacteriabacteriumMS024-2A Cellvibrionales

1717 18 Alphaproteobacteria

COG1904_Uronic_isomerase; COG0111_Dehydrogenase; COG2055_malate_L-lactate_dehydrogenase; COG0209_Provides_the_precursors_necessary_for_DNA_synthesis._Catalyzes_the_bios...; COG1312_Catalyzes_the_dehydration_of_D-mannonate_(By_similarity); COG1638_Trap_dicarboxylate_transporter,_dctp_subunit; COG3090_Tripartite_ATP-independent_periplasmic_transporter_dctq_component; COG1593_trap_dicarboxylate_transporter_dctm_subunit; COG2133_Dehydrogenase; COG3264_mechanosensitive_ion_channel; COG1189_hemolysin_a; COG0033_phosphoglucomutase; COG3808_pump_that_utilizes_the_energy_of_pyrophosphate_hydrolysis_as_the_drivin...; COG3119_Sulfatase; COG0001_Glutamate-1-semialdehyde_aminotransferase; COG0564_pseudouridine_synthase_activity; COG0574_phosphotransferase_activity,_paired_acceptors; COG0848_Biopolymer_transport_protein_exbD_tolR; COG0811_MotA_TolQ_exbB_proton_channel; COG4401_chorismate_mutase; COG0761_Converts_1-hydroxy-2-methyl-2-(E)-butenyl_4-diphosphate_into_isopenteny...; COG1047_peptidylprolyl_cistrans_isomerase; COG2885_Ompa_motb_domain_protein; COG1393_arsenate_reductase_(glutaredoxin)_activity; COG1951_fumarate; COG0246_Mannitol_dehydrogenase; ENOG410YMJM_Outer_membrane_fimbrial; COG1804_l-carnitine_dehydratase_bile_acid-inducible_protein_F;

Figure 6. 19 reads from NR180 that include the functional annotation uronic isomerase. and the surrounding COG annotations on the read. Rectangles represent genes. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

25

Aromatic metabolism: the homogentisate aromatic degradation pathway

The homogentisate pathway is a route for the metabolism of the aromatic amino acids phenylalanine and tyrosine for the purpose of energy production (Arias-Barrau et al., 2004). The homogentisate pathway has been shown to metabolize additional aromatic compounds in

Pseudomonas putida (Arias-Barrau et al., 2004). This pathway has not been shown to be involved in the synthesis of amino acids or any other products and it appears to be a strictly catabolic pathway. Aromatic metabolism is not a part of the core metabolism, thus aromatic metabolism can be used for investigating the environment. Since, Roseobacters, a group of abundant marine and estuarine bacteria typically found with genes for the homogentisate degradation pathway in their genomes, we looked for annotations of this pathway in our sample (Newton et al., 2010).

19 reads from NR180 and 15 from FN4 contained annotations for homogentisate 1,2- dioxygenase activity (COG3508). Homogentisate 1,2-dioxygenase is a ring cleaving enzyme that opens up the aromatic ring in the compound homogentisate (Arias-Barrau et al., 2004). Some of the longer reads include two or more COG annotations involved in the homogentisate pathway based off of pathway information from the KEGG database. A 14,827 bp read from NR180 has three enzymes in the homogentisate pathway annotated one after the other (Fig 6 and 7). Many of the reads from both samples have 4-hydroxyphenylpyruvate dioxygenase and homogentisate

1,2-dioxygenase next to each other, even reads assigned to different taxonomic groups (Fig 6).

The reads with homogentisate 1,2 dioxygenase from NR180 had 13 of 19 read assignments belonging to Bacteriodetes, the phylum which is comprised of mostly Flavobacteria (Fig 6). PS9 had 6 of 14 read assignments belonging to Bacteriodetes or Flavobacteria (Fig 6A). According to the pathway in the KEGG database, tyrosine is first metabolized to 4-hydroxyphenylpyruvate

26 (Fig 7). Tyrosine-like signals has shown up in fluorescent analysis of CDOM in oceanic waters

(Yamashita et al., 2003). The observed pathways have the potential to provide a key to the way member of Bacteriodetes process aromatic amino acids found in the environment like tyrosine.

When we examined the gene annotations on individual long reads, we noticed that there are often substantial gaps between the gene annotations (Fig 6-8). The process of mapping the

RefSeq data to the EggNOG database requires that a COG or an EggNOG exists to represent a

RefSeq annotation. Thus, not all RefSeq alignments from the DIMAOND blastx may be present in the visualization of the long reads. We looked at the original DIAMOND blastx output for the first read in Fig 7. There were not RefSeq annotations between the DNA polymerase and the next three annotations, which has room for three more genes (Fig 7 and 8). Other explanations such as sequencing errors in sections of reads large enough to prevent database alignments when running the DIAMOND blastx or limited information in the RefSeq/COGEggNOG database could account for this observation.

27 0 2000 6000 10000 14000 18000 22000 0 2000 6000 10000 14000 18000 22000

1 2 Bacteroidetes Bacteria

3 4 Rhodobacteraceae bacteriumHIMB11Alphaproteobacteria 5 6 Flavobacteriaceae Flavobacteriales

7 8 Rhodobacteraceae Proteobacteria

9 10 Flavobacteriaceae Flavobacteriales

11 12 Flavobacteriaceae Flavobacterium

1313 14 Flavobacteriaceae Alphaproteobacteria

15 16 Bacteroidetes Flavobacteriales

17 18 Bacteroidetes Flavobacteriaceae 1919 20 Bacteroidetes

COG0749_DNA_polymerase; COG0179_Fumarylacetoacetate_hydrolase; COG3185_4-Hydroxyphenylpyruvate_dioxygenase; COG3508_homogentisate_1,2-dioxygenase_activity; COG0323_This_protein_is_involved_in_the_repair_of_mismatches_in_DNA._It_is_requ...; COG0590_deaminase; COG0449_Catalyzes_the_first_step_in_hexosamine_metabolism,_converting_fructose-...; COG4231_indolepyruvate_ferredoxin_oxidoreductase; COG0554_Key_enzyme_in_the_regulation_of_glycerol_uptake_and_metabolism_(By_simi...; ENOG4110K8B_Adenylate_guanylate_Cyclase; COG0189_Responsible_for_the_addition_of_glutamate_residues_to_the_C-terminus_of...; COG0162_Catalyzes_the_attachment_of_tyrosine_to_tRNA(Tyr)_in_a_two-step_reactio...; COG0451_Nad-dependent_epimerase_dehydratase; COG3483_Catalyzes_the_oxidative_cleavage_of_the_L-tryptophan_(L-_Trp)_pyrrole_r...; COG4547_Cobalt_chelatase,_pCobT_subunit; ENOG410XNMH_Histidine_kinase; COG0173_aspartyl-trna_synthetase; ENOG4111G1Y_This_protein_specifically_catalyzes_the_removal_of_signal_peptides_...; COG0322_The_UvrABC_repair_system_catalyzes_the_recognition_and_processing_of_DN...; ENOG410XQJQ_tonB-dependent_Receptor;

Figure 7. 19 reads from NR180 that include the functional annotation homogentisate 1,2 dioxygenase and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, homogentisate 1,2 dioxygenase is colored black. Read assignments are below each read.

28 0 2000 6000 10000 14000 18000

4-hydroxyphenylpyruvate 1 1 4-hydroxyphenylpyruvate dioxygenase COG3185

Homogentisate homogentisate 1,2- 2 dioxygenase COG3508

maleylacetoacetate isomerase 4-Maleyl-acetoacetate 3 -enzymenotfoundinreadannotations

4-fumarylacetoacetate 4 fumarylacetoacetate hydrolase COG0179

acetoacetate fumarate

TCA

Figure 8. Metabolic map of the homogentisate pathway and the first read from the NR180 sample found with the marker gene annotation of homogentisate 1,2-dioxygenase. Colors of the read assignments coordinate to the enzyme colors in the pathway. The map was put together with information on KEGG.

Aromatic metabolism: the beta-ketoadipate pathway

The beta-ketoadipate pathway is involved in the degradation of aromatic compounds

derived from vascular plants (Gulvik and Buchan, 2013). Several monomers of lignin are

metabolized via this pathway (Gulvik and Buchan, 2013). Peripheral pathways will transform

lignin-derived aromatic compounds and additional plant derived aromatics into protocatechuate

where the enzyme protocatechuate 3,4-dioxygenase then cleaves the ring in protocatechuate

(Hardwood and Parales, 1996). Because the protocatechuate pathway is known to be involved in

lignin degradation it is a good marker for a community’s potential for processing of plant derived

29 aromatics in the environment. Additionally, protocatechuate 3,4-dioxygenase has been shown to actively metabolize several aromatic compounds in Roseobacters (Gulvik and Buchan, 2013).

For these reasons, the two metagenomes were explored for the presence of this pathway.

3 reads from NR180 and 4 reads from PS9 have the COG annotation protocatechuate 3,4- dioxygenase (COG3485). All of the reads have read assignments within the Alphaproteobacteria except for one from NR180, which has the read assignment Verrucomicrobiales (Fig 7A and

8A). Other known enzymes from the beta-ketoadipate pathway were not found in the metagenome. It is possible that this pathway has not been studied extensively enough to provide enough database information for this metagenomic study.

30 CONCLUSION

The application of long read technology to studying marine bacterial metagenomes is beneficial when trying understand details about community function inaccessible with short read technology. Short read data sets can identify taxa and functional potential as we have in this study using long reads. The advantage with working with long reads is observing spatial relationships of distal annotations on the same read. By searching for functional genes of interest, like homogentisate 1,2-dioxygenase, annotations for additional genes involved in the homogentisate pathway were found on reads with this marker gene. The discovery of multiple functional genes provides more confidence in the annotations as evidence for an organism to carry out a certain function. As in the case with the marker gene 1,2-dioxygenase, evidence was discovered for the potential capability of members of the Flavobacteria to metabolize the aromatic compound, homogentisate.

Another benefit with the long reads is making use of the spatial arrangements of annotations to produce a more specific predicted functional role for a less specific annotation such as a dehydrogenase or oxidoreductases. In the EggNOG/COG database, there are many of such annotations that can identify the protein family of a gene, but cannot provide a more specific functional role. If an annotation is defined as an oxidoreductase, and it is found next to a marker gene for a specific pathway, then it is possible that is an oxidoreductase involved in that pathway. For example, there is an annotated oxidoreductase next to the marker gene protocatechuate 3,4 dioxygenase on all three reads from PS9 with this marker gene (Fig 7A).

31 Since oxidoreductases are known to be involved in aromatic degradation, it is possible that this particular oxidoreductase is part of the beta-ketoadipate pathway for aromatic degradation.

Long read technology has provided a new way to view bacterial genomes; however, in order to get a better picture of each bacterial genome within the metagenome, assembling genomes from the metagenome is the next step. Unfortunately, error rates at 85% to 95% for nanopore sequencing means this long-read data is unsuitable for genome assembly (Wick et al.,

2017). If the same metagenome is sequenced on Illumina and Oxford Nanopore, then a hybrid assembly can be performed to combine the benefits of long read technology with highly accurate short reads. This method has been shown to immensely improve the genome assembly of single isolates in culture by reducing the final number of contigs and increasing the accuracy of the genome compared to genomes assembled with Nanopore only or Illumina only data (Wick et al.,

2017). One substantial benefit to assembly would be resolving some of the read assignments placed within the kingdom of bacteria by MEGAN6’s Lowest Common Ancestor algorithm.

Some of these reads that we identified as having a marker genes of interest did not have a taxonomic read assignment. With genome assembly, these reads could be placed within a larger bacterial genome and thus make it possible to identify the bacteria with that functional potential.

A hybrid assembly performed on future metagenomes has potential to yield a greater understanding of the carbon cycling capabilities encoded in the metagenome.

32 APPENDIX: CARBON METABOLIC ANNOTATIONS

Table 1A) COG annotations of glycolysis enzymes

NR180 PS9 NR180 PS9 # of % of assigned Enzyme name COG # COGs COGs Glucokinase COG0837 7 1 0.008 0.004 Phosphoglucose Isomerase COG0166 139 51 0.2 0.2 Phosphofructokinase COG0205 76 53 0.09 0.2 Aldolase COG0191 76 46 0.09 0.2 Triose phosphate isomerase COG0149 70 20 0.08 0.07 glyceraldehyde- 3phosphate dehydrogenase COG0057 273 90 0.3 0.3 phosphoglycerate kinase COG0126 173 35 0.2 0.2 phosphoglycerate mutase COG0406 16 5 0.02 0.02 enolase COG0148 235 77 0.3 0.3 pyruvate kinase COG0469 53 34 0.06 0.1 Total metabolism COGs 84210 27,501 100 100

33 Table 2A) COG annotations of pentose phosphate enzymes

NR180 PS9 NR180 PS9 # of % of assigned Enzyme name COG# COGs COGs glucose-6-phosphate dehydrogenase COG0364 45 54 0.05 0.2 6-phosphogluconolactonase COG2706 1 3 0.001 0.01 6-phosphogluconate dehydroegnase COG1023 12 7 0.01 0.03 ribulose-5-phosphate isomerase COG0362 16 13 0.02 0.05 ribulose-5-phosphate 3- epimerase COG0036 202 36 0.2 0.1 transketolase COG0021 234 138 0.3 0.5 transaldolase COG0176 170 25 0.2 0.09 Total metabolism COGs 84210 27,501 100 100

0 4000 8000 12000 16000 20000 24000 0 4000 8000 12000 16000 20000 24000

1 2 BacteriaProteobacteria

3 4

Cellvibrionales Saccharophagus degradans

COG1904_Uronic_isomerase; COG1875_Phoh_family; ENOG4111FSN_Transcriptional_regulator; COG0412_dienelactone_hydrolase; COG1173_Binding-protein-dependent_transport_systems_inner_membrane_component;

Figure 1A) Reads from PS9 that include the functional annotation uronic isomerase. and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

34 0 2000 6000 10000 14000 18000 22000 0 2000 6000 10000 14000 18000 22000 1 2 Opitutaceae 3 4 Coraliomargarita akajimensis Actinobacteria 5 6 Flavobacteriia Actinobacteria 7 8 Flavobacteriia PVCgroup 9 10 Coraliomargarita akajimensis Coraliomargarita akajimensis 11 12 Flavobacteriaceae Bacteria 13 14 Pseudonocardiaceae FlavobacteriabacteriumMS024-2A 15 16 Coraliomargarita akajimensis Coraliomargarita akajimensis 17 18 FlavobacteriabacteriumMS024-2A FlavobacteriabacteriumMS024-2A 19 20 Coraliomargarita akajimensis Planctomycetales 21 22 Coraliomargarita akajimensis Flavobacteriia 23 24 Phaeodactylibacter xiamenensis Bacteroidetes 25 26 Alphaproteobacteria Flavobacteriaceae

COG2115_Xylose_isomerase; COG0337_3-dehydroquinate_synthase; COG1070_Carbohydrate_kinase; COG0706_Required_for_the_insertion_and_or_proper_folding_and_or_complex_formati...; COG1459_Type_ii_secretion_system; COG1304_Catalyzes_the_1,3-allylic_rearrangement_of_the_homoallylic_substrate_is...; ENOG410XT88_Xylose_isomerase_domain_protein_TIM_barrel; COG0745_regulatoR; COG0451_Nad-dependent_epimerase_dehydratase; COG2086_Electron_transfer_flavoprotein; COG2025_Electron_transfer_flavoprotein; COG1629_receptor; ENOG410XP70_tonB-dependent_Receptor; COG0544_Involved_in_protein_export._Acts_as_a_chaperone_by_maintaining_the_newl...; COG0740_Cleaves_peptides_in_various_proteins_in_a_process_that_requires_ATP_hyd...; COG0708_Exodeoxyribonuclease_III; COG0611_Catalyzes_the_ATP-dependent_phosphorylation_of_thiamine-_monophosphate_...; ENOG4111JPI_general_secretion_pathway_protein_G; COG1788_Key_enzyme_for_ketone_body_catabolism._Transfers_the_CoA_moiety_from_su...;

Figure 2A) Reads from NR180 that include the functional annotation xylose isomerase. and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

35

0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000

1 2 Bacteria Rhodopirellula

3 4 Phycisphaera mikurensis Coraliomargarita akajimensis

5 6 Flavobacteriia Bacteria

7 8 PVCgroup Pedosphaera parvula

9 10 Pedosphaera parvula Bacteria

11 12 Bacteria Alphaproteobacteria

13 14 Pedosphaera parvula Planctomycetaceae

15 16 Bacteria Rhodopirellula

COG2115_Xylose_isomerase; COG2931_Hemolysin-type_calcium-binding; COG0215_Cysteinyl-tRNA_synthetase; COG1070_Carbohydrate_kinase; ENOG410XT88_Xylose_isomerase_domain_protein_TIM_barrel; COG1883_decarboxylase_beta_subunit; COG0413_Catalyzes_the_reversible_reaction_in_which_hydroxymethyl_group_from_5,1...; COG0342_Part_of_the_Sec_protein_translocase_complex._Interacts_with_the_SecYEG_...; COG0861_membrane_protein_terC;

Figure 3A) Reads from PS9 that include the functional annotation xylose isomerase. and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

36 0 2000 4000 0 2000 4000

1 2 Francisella Pelagibacteraceae

3 4 Flavobacteriaceae Francisella

5 6 Francisella Flavobacteriia

7 8 Bacteria Pelagibacteraceae

9 10 Bacteroidetes

COG1085_galactose-1-phosphate_uridylyltransferase; COG0760_peptidyl-prolyl_cis-trans_isomerase; COG0513_purine_NTP-dependent_helicase_activity; COG1486_glycoside_hydrolase_family_4; COG0395_Binding-protein-dependent_transport_systems_inner_membrane_component; COG0153_Catalyzes_the_transfer_of_the_gamma-phosphate_of_ATP_to_D-galactose_to_...; COG1331_mannose-6-phosphate_isomerase_activity; COG0057_glyceraldehyde3phosphate_dehydrogenase; COG0399_DegT_DnrJ_EryC1_StrS_aminotransferase; COG1209_Catalyzes_the_formation_of_dTDP-glucose,_from_dTTP_and_glucose_1-phosph...; COG0451_Nad-dependent_epimerase_dehydratase;

Figure 4A) Reads from NR180 that include the functional annotation galactose 1 phosphate uridylyltransferase. and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

37 0 4000 0 4000

1 2 Bacteria Bacteria

3 4 Magnetofaba australis Gammaproteobacteria

5 6 FlavobacteriabacteriumMS024-2A Ramlibacter tataouinensis

COG1085_galactose-1-phosphate_uridylyltransferase; COG0153_Catalyzes_the_transfer_of_the_gamma-phosphate_of_ATP_to_D-galactose_to_...; COG1475_parb-like_partition_protein;

Figure 5A) Reads from PS9 that include the functional annotation galactose 1 phosphate uridylyltransferase. and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

38 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000

1 2 Bradyrhizobium Bacteroidetes

3 4 BacteriaFlavobacteriaceae

5 6 Proteobacteria Flavobacteriaceae

7 8 Thiothrix caldifontis Bacteroidetes 9 10 Rhodobacteraceae Flavobacteriales 11 12 Alphaproteobacteria CandidatusKoribacter versatilis 13 14 Owenweeksia hongkongensis Proteobacteria 15 16 Rhodobacteraceae

COG3508_homogentisate_1,2-dioxygenase_activity; COG0060_amino_acids_such_as_valine,_to_avoid_such_errors_it_has_two_additional_... COG0322_The_UvrABC_repair_system_catalyzes_the_recognition_and_processing_of_D COG3185_4-Hydroxyphenylpyruvate_dioxygenase; COG3483_Catalyzes_the_oxidative_cleavage_of_the_L-tryptophan_(L-_Trp)_pyrrole_r...; COG0166_glucose-6-phosphate_isomerase_activity; COG0611_Catalyzes_the_ATP-dependent_phosphorylation_of_thiamine-_monophosphate COG2204_two_component,_sigma54_specific,_transcriptional_regulator,_Fis_family; COG3643_Glutamate_formiminotransferase; COG0745_regulatoR; COG0193_The_natural_substrate_for_this_enzyme_may_be_peptidyl-_tRNAs_which_drop COG0142_synthase; COG0560_phosphoserine_phosphatase_activity; COG0389_Poorly_processive_error-prone_DNA_polymerase_involved_in_untargeted_mu COG2998_(ABC)_transporter; COG1351_Catalyzes_the_formation_of_dTMP_and_tetrahydrofolate_from_dUMP_and_me COG0590_deaminase; COG1530_ribonuclease; COG0743_Catalyzes_the_NADP-dependent_rearrangement_and_reduction_of_1-deoxy-D COG0496_Nucleotidase_that_shows_phosphatase_activity_on_nucleoside_5'-monophosp COG0210_helicase; COG1024_Enoyl-CoA_hydratase; COG0620_Catalyzes_the_transfer_of_a_methyl_group_from_5-_methyltetrahydrofolate...; COG2818_DNA-3-methyladenine_glycosylase_activity; COG0028_acetolactate_synthase;

Figure 6A) Reads from PS9 that include the functional annotation homogentisate 1,2- dioxygenase and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

39 0 2000 4000 6000 0 2000 4000 6000

1 2 AlphaproteobacteriaAlphaproteobacteria 3 4 Verrucomicrobiales

COG0634_hypoxanthine_phosphoribosyltransferase; COG0654_oxidoreductase_activity,_acting_on_paired_donors,_with_incorporation_or...; COG3485_protocatechuate_3,4-dioxygenase; COG0626_cystathionine; COG0180_tryptophanyltRNA_synthetase;

Figure 7A) Reads from PS9 that include the functional annotation protocatechuate 3,4- dioxygenase and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

1 2 Alphaproteobacteria Planktotalea frisia 3 4

Alphaproteobacteria Tistlia consotensis 5 COG3485_protocatechuate_3,4-dioxygenase; COG0477_major_facilitator_Superfamily; ENOG410ZVFC_D-amino_acid_dehydrogenase,_small_subunit;

Figure 8A) Reads from PS9 that include the functional annotation protocatechuate 3,4- dioxygenase and the surrounding COG annotations on the read. Genes annotated on the plus strand are on top of the line. Genes annotated on minus strand are below the line. The marker gene of interest, uronic isomerase is colored black. Read assignments are below each read.

40 WORKS CITED

Arumugam, K., Bagci, C., Bessarab, I., Beier, S., Buchfink, B., Gorska, A., ... & Williams, R. B. (2019). Annotated bacterial chromosomes from frame-shift-corrected long read metagenomic data. BioRxiv, 511683.

Arias-Barrau, E., Olivera, E. R., Luengo, J. M., Fernández, C., Galán, B., García, J. L., ... & Minambres, B. (2004). The homogentisate pathway: a central catabolic pathway involved in the degradation of L-phenylalanine, L-tyrosine, and 3-hydroxyphenylacetate in Pseudomonas putida. Journal of bacteriology, 186(15), 5062-5077

Barberán, A., FERNÁNDEZ‐GUERRA, A. N. T. O. N. I., Bohannan, B. J., & Casamayor, E. O. (2012). Exploration of community traits as ecological markers in microbial metagenomes. Molecular ecology, 21(8), 1909-1917.

Bauer JE and Bianchi TS (2011) Dissolved Organic Carbon Cycling and Transformation. In: Wolanski E and McLusky DS (eds.) Treatise on Estuarine and Coastal Science, Vol 5, pp. 7–67. Waltham: Academic Press.

Buchan, A., LeCleir, G. R., Gulvik, C. A., & González, J. M. (2014). Master recyclers: features and functions of bacteria associated with phytoplankton blooms. Nature Reviews Microbiology, 12(10), 686.

Buchfink B, Xie C, Huson DH, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176

Cuesta, S. M., Furnham, N., Rahman, S. A., Sillitoe, I., & Thornton, J. M. (2014). The evolution of enzyme function in the isomerases. Current opinion in structural biology, 26, 121-130.

De Coster, W., D’Hert, S., Schultz, D. T., Cruts, M., & Van Broeckhoven, C. (2018). NanoPack: visualizing and processing long-read sequencing data. Bioinformatics, 34(15), 2666-2669.

Flombaum, P., Gallegos, J. L., Gordillo, R. A., Rincón, J., Zabala, L. L., Jiao, N., ... & Vera, C. S. (2013). Present and future global distributions of the marine Cyanobacteria Prochlorococcus and Synechococcus. Proceedings of the National Academy of Sciences, 110(24), 9824-9829.

Fortunato, C. S., & Crump, B. C. (2015). Microbial gene abundance and expression patterns across a river to ocean salinity gradient. PLoS One, 10(11), e0140578.

Gulvik, C. A., & Buchan, A. (2013). Simultaneous catabolism of plant-derived aromatic compounds results in enhanced growth for members of the Roseobacter lineage. Appl. Environ. Microbiol., 79(12), 3716-3723.

Harwood, C. S., & Parales, R. E. (1996). The β-ketoadipate pathway and the biology of self- identity. Annual Reviews in Microbiology, 50(1), 553-590.

41 Huson, D. H., Albrecht, B., Bağcı, C., Bessarab, I., Gorska, A., Jolic, D., & Williams, R. B. (2018). MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biology direct, 13(1), 6.

Huson D. (2018). Computational Microbiome Analysis [Power Point Slides]. Retrieved from http://ab.inf.uni-tuebingen.de/data/software/megan6/download/MeganTutorialApril2018.pdf

Jurtshuk Jr, P. (1996). Chapter 4: Bacterial metabolism. Medical microbiology (4th ed.). Galveston: University of Texas Medical Branch at Galveston. Available from: http://www. ncbi. nlm. nih. gov/books/NBK7919.

Moran, M. A., Kujawinski, E. B., Stubbins, A., Fatland, R., Aluwihare, L. I., Buchan, A., ... & Howe, B. (2016). Deciphering ocean carbon in a changing world. Proceedings of the National Academy of Sciences, 113(12), 3143-3151.

Newton, R. J., Griffin, L. E., Bowles, K. M., Meile, C., Gifford, S., Givens, C. E., ... & Rinta- Kanto, J. M. (2010). Genome characteristics of a generalist marine bacterial lineage. The ISME journal, 4(6), 784.

Paerl, H. W., Rossignol, K. L., Hall, S. N., Peierls, B. L., & Wetz, M. S. (2010). Phytoplankton community indicators of short-and long-term ecological change in the anthropogenically and climatically impacted Neuse River Estuary, North Carolina, USA. Estuaries and Coasts, 33(2), 485-497.

Paerl, H. W., Valdes-Weaver, L. M., Joyner, A. R., & Winkelmann, V. (2007). Phytoplankton indicators of ecological change in the eutrophying Pamlico Sound system, North Carolina. Ecological Applications, 17(sp5), S88-S101.

Peierls, B. L., & Paerl, H. W. (2010). Temperature, organic matter, and the control of bacterioplankton in the Neuse River and Pamlico Sound estuarine system. Aquatic Microbial Ecology, 60(2), 139-149.

Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J., & Segata, N. (2017). Shotgun metagenomics, from sampling to analysis. Nature biotechnology, 35(9), 833.

Rang, F. J., Kloosterman, W. P., & de Ridder, J. (2018). From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome biology, 19(1), 90.

Rodriguez-r, L. M., & Konstantinidis, K. T. (2014). Estimating coverage in metagenomic data sets and why it matters. The ISME journal, 8(11), 2349.

Rrwick. https://github.com/rrwick/Porechop

42 Satinsky, B. M., Gifford, S. M., Crump, B. C., & Moran, M. A. (2013). Use of internal standards for quantitative metatranscriptome and metagenome analysis. Methods in Enzymology, 531, 237– 250. https://doi.org/10.1016/B978-0-12-407863-5.00012-5

Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., ... & Cornejo- Castillo, F. M. (2015). Structure and function of the global ocean microbiome. Science, 348(6237), 1261359.

Thornton, D. C. (2014). Dissolved organic matter (DOM) release by phytoplankton in the contemporary and future ocean. European Journal of Phycology, 49(1), 20-46.

Ward, C. S., Yung, C. M., Davis, K. M., Blinebry, S. K., Williams, T. C., Johnson, Z. I., & Hunt, D. E. (2017). Annual community patterns are driven by seasonal switching between closely related marine bacteria. The ISME journal, 11(6), 1412.

Wick, R. R., Judd, L. M., Gorrie, C. L., & Holt, K. E. (2017). Completing bacterial genome assemblies with multiplex MinION sequencing. Microbial genomics, 3(10).

Yamashita, Y., & Tanoue, E. (2003). Chemical characterization of protein-like fluorophores in DOM in relation to aromatic amino acids. Marine Chemistry, 82(3-4), 255-271.

43