Metagenomics of the Galapagos Marine

By Liang Zhao

Senior Honors Thesis Department of Biology University of North Carolina at Chapel Hill

04/19/2019

Dr. Scott Gifford, Thesis Advisor

Dr. Mara Evans, Reader

Dr. Todd Vision, Reader

1 Abstract

The Galapagos Islands are located in the eastern Pacific Ocean where several currents interact to create diverse marine environments. Heterotrophic control how much carbon is available to organisms at higher trophic levels; however, the compositions of the Galapagos microbiome and how differences in microbial community structure are driven by the diverse environmental conditions found among the islands are currently unknown. Environmental samples were collected from 18 stations across the islands and DNA was extracted and sequenced. The sequencing data was analyzed to estimate the relative and absolute cell abundance in each sample. Candidatus Pelagibacter and Synechococcus were the most abundant taxa across the majority of the stations. Distinct community compositions were found at stations located at the western side of the islands in 2016 when the La Nina strengthened the upwelling caused by the Equatorial Undercurrent in this region, suggesting that certain bacteria were selected for by the nutrient-rich environment. The findings provide insight into the microbial community compositions of the Galapagos marine environment and offer potential microbial markers for characterizing the physical environment.

Introduction

The Galapagos Islands are located in the eastern Pacific Ocean, sitting directly on the equator. Several ocean currents interact around the Galapagos Islands. The Equatorial

Undercurrent (the Cromwell Current) flows eastward along the equator beneath the South

Equatorial Current, and runs directly into the Galapagos, bringing cold, nutrient-rich deep water to the surface on the western side of the islands (Kessler 2006) (Fig. 1A). The South Equatorial

Current, joined by the Peru Current, brings cold upwelling sea water to the Galapagos Islands,

2 making the climate cooler and milder. These interacting currents create diverse marine environments among the islands, which put different selection pressure on the microbial communities.

Microbes are an essential part of marine ecosystems, yet how the types of microbes and their functional capabilities encoded in their genomes vary across different habitats remains largely unknown (Stocker 2012; Moran 2015). Primary producers, in the form of microbial phytoplankton, fix carbon through , forming the foundation of the food web

(Buchan et al. 2014; Moran 2015). Phytoplankton abundance and activities are limited by the amount of inorganic nutrients in the water; therefore, phytoplankton abundance often increases if there are more nutrients available in the ecosystem. Some of the carbon fixed by primary producers is transferred directly to higher trophic levels, while a large portion is released into the water as dissolved organic carbon (DOC). Heterotrophic bacteria take up DOC and can either make it available to organisms at higher trophic levels or alternatively make it unavailable through continuous recycling in the microbial loop and remineralization to CO2 (Buchan et al.

2014; Moran 2015).The diversity of bacterial and their activities can thus have a substantial impact on ecosystem processes. Therefore, identifying the types of microbes around the Galapagos and how their distribution relates to environmental factors such as ocean currents, upwelling, and nutrient conditions enables better understanding of the factors shaping the rich ecosystems.

Metagenomics is the sequencing and analysis of the collective community genome of from an environmental sample (DeLong et al. 2006). It enables capturing an unbiased representation of the taxonomic composition of a given microbial community and what potential functions they may be able to perform without need for direct cell cultivation and its

3 accompanying biases. In this project, ca. 700 million metagenomic reads were analyzed to answer the following questions: 1) How does microbial community composition vary across the

Galapagos Archipelago, and 2) Is microbial community composition variation driven by differences in environmental conditions, particularly the physical gradients set up by the diverse currents entering the archipelago?

Methods

Environmental sampling. 48 cell samples were collected at stations across the Galapagos archipelago, Ecuador in 2015 and 2016 (Fig. 1B; Appendix Table 1). Bacterial cells were collected using 3 µm and 0.22 µm filters. The filters were frozen and the volume of water passed through the filtering setup was recorded. Along with the cell samples, a range of environmental data was collected at each station, including Particulate Organic Carbon (POC), chlorophyll a

(chl a), phosphate, nitrate, silicate, and cell abundance from flow cytometry (Marchetti Lab,

UNC Chapel Hill), as well as Dissolved Organic Carbon (DOC) (Gifford Lab, UNC; Meideros

Lab, UGA).

DNA extraction and library preparation. DNA was extracted from the frozen 0.22 µm filters

(representing the free-living bacterioplankton). Prior to starting the extraction, three internal standard genomes were added to each sample to make the metagenomes quantitative. For the

2015 samples, 4.0 ng of Thermus thermophilus, 4.7 ng of Blautia producta, and 2.7 ng of

Deinococcus radiodurans genomic DNA was added. For the 2016 DNA samples, 4.4 ng of

Thermus thermophilus, 4.6 ng of Blautia producta, and 3.5 ng of Deinococcus radiodurans genomic DNA was added. DNA was then extracted using a Qiagen DNeasy Blood and Tissue

Kit (Qiagen, Hilden, Germany) following the manufacturer’s procedure. Metagenomic DNA

4 libraries were prepared using the KAPA HyperPlus Library Preparation Kit Protocol (Kapa

Biosystems Scientific, Massachusetts, United States). DNA was enzymatically fragmented to achieve an average fragment size of 395 bp. Barcodes were ligated to the sequences. After quantification using Quant-iT PicoGreen dsDNA kit (Molecular Probes Inc., Organ, United

States), DNA concentrations were normalized and pooled into two sets (one for 2015 and the other for 2016) and sent for sequencing.

Sequencing, quality control and paired-end assembly. Sequencing data were generated using the Illumina HiSeq platform (San Diego, California, United States) at UNC’s High-Throughput

Sequencing Facility (Chapel Hill, North Carolina). 359 million paired-end reads (2×150 bp) were generated from the pooled 2015 samples, and 364 million reads from the 2016 samples. Read quality, trimming, and joining were performed using the Galaxy bioinformatics platform (Afgan et al. 2016). FastQC was used to perform quality control checks on the sequencing data.

Trimmomatic was used to trim off any ends of the read that had low quality (sliding window size

10, average quality threshold 20) and to discard any reads with length < 50 bp. Pear was used to assemble paired-end reads (minimum overlap size 10). The assembled paired-end reads, paired- end reads that could not be assembled (because of lack of overlapping region), and unpaired reads (Trimmomatic removed one from the pair) were then concatenated into one file.

Estimate cell/gene abundance from internal standards. Recovery of internal standards in the sequence libraries reflect sequencing coverage and was used to estimate absolute cell and gene abundances of taxa for each sample. These estimates were derived from three equations as described by Stainsky et al. (2013).

5 In Eq. 1, Sr is the number of molecules of the internal standard genome recovered from the sequencing effort. SS is the number of protein encoding internal standard reads in the sequence library. Sp is the number of protein encoding genes in the internal standard reference genome.

The quality controlled reads were first annotated via a homology search against the three internal standard genomes via BLASTn (e-value <0.001, %ID >90). Duplicate hits and hits with %ID < 95, an alignment length < 50% of the read length, or a bit score < 50 were removed using a custom Python script (filter_blastn_internal_stds_result_v4.py). The internal standard reads were then annotated via a homology search against the internal standard protein database using BLASTx (e-value <0.001). Duplicate hits, hits with bit scores < 40 or %ID < 95, were removed (blastx_IS_filter.py), and the number of protein encoding internal standard reads in the sequence library (Ss) was acquired by counting the remaining hits. Sp was acquired by counting protein encoding genes in the internal standard reference genome. The three internal standard reference genomes are available on NCBI (IMG Genome IDs of T. thermophilus, D. radiodurans, B. producta are 637000322, 2556921628, 2515154176 respectively).

In Eq. 2, Pg is the total number of protein encoding genes in the sample. Ps is the number of protein encoding sequences in the sequence library. Sa is the number of molecules of internal standard genome added to the sample. The number of reads in the sequence library is then converted to number of genes in the sample by multiplying Sa/Sr. Sa was calculated by dividing the weight of the added genome by the weight of a single internal standard genome. Ps was determined by identifying non-internal standard protein encoding reads through annotation.

Gene Annotation. After removal of internal standard reads, all remaining reads were annotated using a homology search against NCBI’s Reference Sequence database (RefSeq, version 84)

6 containing bacterial and archaeal protein sequences via the Diamond search algorithm (blastx z-- salltitles --max-target-seqs 1 --block-size 70 --index-chunks 1 --threads 35) (Buchfink et al.

2015). Duplicate hits and any hits with bit score < 50 were removed

(diamond_result_bitscore50_filter_linux.py). To identify reads of viral origin, a homology search against NCBI’s RefSeq viral database (v85) was conducted using Diamond (same parameters).

The same Python script was used to remove duplicate hits and hits with bit scores lower than 50.

Bacterial annotated reads were replaced by a viral annotation if the viral hit had a higher bit score. The number of protein encoding sequences in the sequence library (Ps) was acquired by counting bacterial and archaeal hits.

In Eq. 3, Ga is the number of molecules of any particular gene category in the sample. Gs is the number of sequences of any particular gene category in the sequence library. One can be

-1 converted into another using the conversion factor Sa/Sr. Cells L was calculated by dividing the number of protein encoding genes in the sample (Pg) by 2000, the approximate average number of genes in bacterial cells (Koonin and Wolf 2008), and by the volume of filtered seawater.

When considering a particular gene category, the number of molecules of that particular gene category in the sample (Ga) was divided by the volume of filtered seawater to give gene abundance (genes L-1).

Estimate cell abundance from single copy gene (recA). Genome and cell abundances were estimated based on the number of recA annotated reads in a sample. Recombinase A (recA) is a single copy gene, and therefore one read is representative of a single genome. A bacterial and archaeal RecA protein database was constructed by identifying proteins in the NCBI’s Reference

Sequence protein database (v84) annotated with the key words "recombinase RecA", "protein

RecA", "recombinase A", or "RecA protein". The metagenomic reads were then compared to the

7 RecA database using Diamond (same parameters as before). Duplicate hits and any hits with bit score < 50 were removed (diamond_result_bitscore50_filter_linux.py). The number of recA reads in the sequence library (Gs) was acquired by counting reads that have been annotated as recA in both search against RecA database and search against RefSeq database. The abundance of recA genes in the sample (Ga) was calculated using the conversion factors Sa/Sr in Eq. 3. The average genome size in each sample was estimated by dividing the number of total annotated reads by the number of recA reads.

Taxonomic annotations. The community composition of the samples was determined by matching the protein accession number of the annotated read with taxon ID using the NCBI database (downloaded on Oct. 25th, 2017). These taxonomic annotations were then compiled into a summary table by counting the number of reads for each taxon in each sequence library (all_lineage.py, get_ranked3.py). Entries with the same genus were collapsed. The read

-1 counts were then converted into cells L using the conversion factors Sa/Sr, filtered volume and

2000. The table was used to make the community compositions heatmap across samples. Figure

2 and 3 were made using R package ggplot2 (Wickham, 2009).

Results

Metagenomic sequencing and cell abundance estimation. Sequencing of the 2015 samples yielded 358,981,271 paired-end reads produced across 23 station samples. After trimming low quality reads, assembling overlapping paired-end reads, and removing the internal standard reads, 356,761,607 (90.8%) reads remained. A homology search against NCBI’s Refseq database resulted in 193,818,125 (54.3%) reads with significant homology to bacterial, archaeal, or viral genes. For the 2016 samples, 364,092,724 paired-end reads were produced from 24 samples.

8 After trimming, paired-end assembly, and internal standard removal, 88.9% of reads remained, and 175,506,420 (48.3%) of these reads were homologous with bacterial or archaeal genes.

Estimation of cell abundance using internal standard recovery showed that on average 2.6×1012 protein encoding genes L-1 were found in each sample in 2015 (Appendix Table 1). Assuming an average open ocean bacterial genome contains 2000 protein encoding genes (Koonin and Wolf

2008), this results in 1.3×109 cells L-1. 142,350 recA genes/reads in total were found and an average of 1.9×109 cells L-1 was estimated from the number of recA genes. For the 2016 samples, an average of 2.4×109 cells L-1 was estimated assuming 2000 genes genome-1 and average of

3.7×109 cells L-1 was estimated from the 135,688 recA genes found. The average estimated genome size was 1300 (± 96 S.D) genes genome-1 for the 2015 samples and 1297 (± 162 S.D) genes genome-1 for the 2016 samples.

Environmental conditions and flow cytometry data. The levels of inorganic nutrients were largely synchronized (Fig. 2). High nitrate, phosphate and silicates levels were observed in 2016, especially at stations 3 and 24. High levels of chlorophyll a in the small size fractions (< 5 µm) were found at stations 4, 14, 16 and 18 in both years. The flow cytometry data showed that the average abundance of Prochlorococcus was 6.4×107 cells L-1 in 2015 and 1.1×107 cells L-1 in

2016. High abundances of Prochlorococcus were found at stations 9 through 26 in 2015 and at stations 14 through 26 in 2016 (Fig. 2). The average abundance of Synechococcus was 8.7×107 cells L-1 in 2015 and 5.6×107 cells L-1 in 2016. High abundance of Synechococcus was found at stations 4 and 14 in 2015, and at station 14 in 2016. The average dissolved organic carbon

(DOC) in 2015 was 77 μM in 2015 and 66 μM in 2016. DOC was found below 70 μM at stations

1 through 12 in 2016. At stations 3 and 4, DOC dropped below 50 μM.

9 General trends in cellular abundance. The average cellular abundance estimated from the metagenomes of bacterial taxa in 2015 and 2016 was 1.27 ×109 cells L-1 and 2.36 ×109 cells L-1.

Members of the SAR11 clade (Candidatus Pelagibacter) and Synechococcus were the top two taxa across the majority of stations, averaging 4.0×108 and 2.1×108 cells L-1 respectively. The majority of the remaining taxa had an average ranging from 104 to 106 cells L-1. The least abundance taxa found in the sequence library were estimated to have 61.0 cells L-1 in a single sample. The distribution indicated that the community was composed of a handful of highly abundant taxa and a long tail of rare taxa.

Taxa specific trends. In 2015, increased abundances of Leeuwenhoekiella were found at station

9 and station 12 through 26. Stations 2, 3, 5, 6, 7, 14 and 16 had increased abundances of

Formosa, Flavobacterium and gamma proteobacterium HIMB55. Samples GN69, GN81, GN83 had higher abundances across taxa (Fig. 3).

In 2016, stations 3, 4, 5 and 7 had low abundances of Synechococcus, Prochlorococcus and alpha proteobacterium HIMB59 (Fig. 2). Stations 5 and 7 had low abundances of Candidatus

Pelagibacter. Increased abundances of Rhodobacterales bacterium HTCC2255, Polaribacter, gamma proteobacterium HTCC2207, and bacterium MOLA455 were found at stations 3, 4, 5, and 7. Stations 5 and 7 also had increased abundances of

Rhodobacteraceae members HIMB11 and SB2, as well as Tenacibaculum. Stations 3 and 4 had increased abundances of Candidatus Thioglobus and Candidatus Nitrosopelagicus. Increased abundances of Leeuwenhoekiella were found at station 9 and station 12 through 26.

10 Discussion

Strong upwelling leads to distinct community profile. Primary producers contribute to the dissolved organic carbon (DOC) pool in the ocean and they are highly active in the surface water where they have access to sunlight and can carry out photosynthesis. As a result, fewer primary producers are found in deeper waters where there is limited access to sunlight, resulting in lower

DOC concentration in deep water compared to surface water (Moran 2015). High primary productivity in the surface water quickly uses up the nutrients while the downward transport of organic matter and microbial remineralization increase the nutrient concentration in deeper waters (Moore et al. 2013). Therefore, nutrient levels increase with depth while DOC level decreases.

Stations 2, 3, 4, 5 and 7 were located at the upwelling region and they had lower DOC and higher nutrients in 2016 than in 2015, indicating the surface waters in 2016 resembled deep waters. This could be explained by the fact that 2015 was an El Nino year and 2016 was a La

Nina year. In an El Nino event, the slope of the thermocline decreases and the ocean currents flowing westward in the Pacific weaken (Philander 1985). As a result, the Equatorial

Undercurrent and upwelling at the Galapagos also weaken. In a La Nina event, in contrast, the thermocline becomes steeper and the upwelling at the Galapagos becomes stronger, causing the surface water to acquire deep water characteristics.

Our measured DOC concentrations and microbial community composition suggest that the upwelling in 2016 was so strong that the upwelled water did not have enough time to acquire surface water characteristics. In 2015, the upwelling was weakened and thus the DOC and nutrient levels were relatively consistent across the stations. All 2016 stations seemed to have some deep water characteristics. Among these stations, stations 3 and 4 had the lowest DOC

11 concentrations and highest abundances of members of Candidatus Nitrosopelagicus which is an archaeal genus. Archaea abundances are enriched below the photic zone (Karner et al. 2001).

Stations 3, 4, 5 and 7 had increased abundances of Rhodobacterales bacterium HTCC2255, gamma proteobacterium HTCC2207 and Polaribacter which were also found abundant at

Monterey Bay, California (Ottesen et al. 2011) where strong upwelling happens seasonally

(Timothy Pennington and Chavez 2000). In addition to Polaribacter, other groups of

Flavobacteriia, including Formosa and Tenacibaculum were found to have increased abundance at these stations (Fig. 3). Many flavobacteria species have been associated with algae bloom

(Teeling et al. 2012), which provides a similar nutrient-rich environment for the bacteria as the one found at upwelling site. Environmental data and community compositions support that 2016 had very strong upwelling and the microbial community profile changed as a consequence.

Cyanobacterial trends. Station 14 is located at Darwin Bay, which is a collapsed volcanic crater filled with sea water. Little is known about the physical and chemical oceanography of this site but it is possible that terrestrial input enhances the nutrient level and results in high primary productivity. The increase in chlorophyll a was mainly contributed by small size fractions (Fig.

2), indicating that bacterial primary producers were abundant and active at the crater. Based on the flow cytometry data (Fig. 2), Synechococcus was the major contributor to the increase in chlorophyll a and thus primary productivity at the crater.

Prochlorococcus is known to be highly abundant in oligotrophic oceans (Johnson et al.

2006). Our data show increased abundances of Prochlorococcus at the eastern stations, indicating that the eastern side of the islands resembles an oligotrophic open ocean environment.

In conclusion, the Galapagos microbiome consists of several abundant taxa (e.g.

Pelagibacteraceae, Synechococcaceae, and Rhodobacteraceae) and a diverse

12 group of less abundant taxa. The overall taxonomic compositions resemble many other surface ocean microbiome (Sunagawa et al. 2015). Due to the effects of La Nina, the communities at the upwelling sites showed distinct compositions and bacteria that are adapted to eutrophic environments dominated. In contrast to that of the nutrient-rich western side, more oligotrophic bacteria were found at the eastern side of the islands. The physical gradients caused by the currents lead to variations in microbial communities.

Figure 1. A) Ocean currents intersecting with the Galapagos (green box) and B) sample station map for the 2015 and 2016 research expedition.

13

Figure 2. Environmental measurements and flow cytometry data in 2015 and 2016. DOC, POC, nitrate, phosphate and silicate concentrations were measured in µM. Chlorophyll a (small size fractions: < 5 µm; large size fractions: > 5 µm) were measured in mg m-3. Prochlorococcus and Synechococcus cell counts were measured in 107 cells L-1. Missing data points represent missing measurements.

14

Figure 3. Community composition heatmap. This heatmap shows 34 of the most abundant genera. The size of the circle represents the proportion of cell abundance across stations. The color of the circle indicates the log transformed absolute abundance (cells L-1). Samples GN01 through GN111 were from 2015 and samples GN201 through GN295 were from 2016. Sample GN09 was excluded from this figure because its volume of filtered seawater was uncertain. Taxa were collapsed by genus and taxa that did not have a genus were marked with "(no genus)".

15 References

Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Čech M, Chilton J, Clements D, Coraor N, Eberhard C, et al. 2016. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44(W1):W3–W10. doi:10.1093/nar/gkw343.

Buchan A, LeCleir GR, Gulvik CA, González JM. 2014. Master recyclers: features and functions of bacteria associated with phytoplankton blooms. Nat Rev Microbiol. 12:686.

Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 12(1):59–60. doi:10.1038/nmeth.3176.

DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U, Martinez A, Sullivan MB, Edwards R, Brito BR, et al. 2006. Community Genomics among Stratified Microbial Assemblages in the Ocean’s Interior. Science. 311(5760):496–503.

Johnson ZI, Zinser ER, Coe A, McNulty NP, E. Malcolm S. Woodward, Chisholm SW. 2006. Niche Partitioning among Prochlorococcus Ecotypes along Ocean-Scale Environmental Gradients. Science. 311(5768):1737–1740.

Karner MG, DeLong EF, Karl DM. 2001. Archaeal dominance in the mesopelagic zone of the Pacific Ocean. Nature. 409(6819):507–10. doi:10.1038/35054051.

Kessler WS. 2006. The circulation of the eastern tropical Pacific: A review. Rev East Trop Pac Oceanogr. 69(2):181–217. doi:10.1016/j.pocean.2006.03.009.

Koonin EV, Wolf YI. 2008. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 36(21):6688–6719. doi:10.1093/nar/gkn668.

Moore CM, Mills MM, Arrigo KR, Berman-Frank I, Bopp L, Boyd PW, Galbraith ED, Geider RJ, Guieu C, Jaccard SL, et al. 2013. Processes and patterns of oceanic nutrient limitation. Nat Geosci. 6:701.

Moran MA. 2015. The global ocean microbiome. Science. 350(6266). doi:10.1126/science.aac8455.

Ottesen EA, Marin R Iii, Preston CM, Young CR, Ryan JP, Scholin CA, Delong EF. 2011. Metatranscriptomic analysis of autonomously collected and preserved marine bacterioplankton. ISME J. 5(12):1881–95. doi:10.1038/ismej.2011.70.

Philander SGH. 1985. El Niño and La Niña. J Atmospheric Sci. 42(23):2652–2662. doi:10.1175/1520-0469(1985)042<2652:ENALN>2.0.CO;2.

Satinsky BM, Gifford SM, Crump BC, Moran MA. 2013. Chapter Twelve - Use of Internal Standards for Quantitative Metatranscriptome and Metagenome Analysis. In: DeLong EF, editor. Methods in Enzymology. Vol. 531. Academic Press. p. 237–250.

16 Stocker R. 2012. Marine Microbes See a Sea of Gradients. Science. 338(6107):628. doi:10.1126/science.1208929.

Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, Djahanschiri B, Zeller G, Mende DR, Alberti A, et al. 2015. Structure and function of the global ocean microbiome. Science. 348(6237):1261359. doi:10.1126/science.1261359.

Teeling H, Fuchs BM, Becher D, Klockow C, Gardebrecht A, Bennke CM, Kassabgy M, Huang S, Mann AJ, Waldmann J, et al. 2012. Substrate-Controlled Succession of Marine Bacterioplankton Populations Induced by a Phytoplankton Bloom. Science. 336(6081):608. doi:10.1126/science.1218344.

Timothy Pennington J, Chavez FP. 2000. Seasonal fluctuations of temperature, salinity, nitrate, chlorophyll and at station H3/M1 over 1989–1996 in Monterey Bay, California. Deep Sea Res Part II Top Stud Oceanogr. 47(5):947–973. doi:10.1016/S0967- 0645(99)00132-0.

Wickham H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York

Acknowledgements Drs. Adrian Marchetti (UNC Chapel Hill), Harvey Siem (UNC Chapel Hill) and Patricia Medieros (UGA) collected temperature, salinity, nutrient concentration, chlorophyll, cyanobacteria flow cytometry and DOC data and shared them with us. Brooke Stemple and Beryl DeLong (UNC Chapel Hill) extracted DNA and prepared libraries for sequencing. We thank Dr. Gary Bishop for advice on improving the scripts for reference binning and members of the Gifford and Septer labs for valuable input on the project. Dr. Scott Gifford, as my adviser, was helpful at all stages of this project and offered insightful biological interpretations of the data and encouraging feedback. This research was supported by UNC’s Center for Galapagos Studies and the joint UNC-Universidad San Francisco de Quito Galapagos Science Center.

17 Appendix Table 1. Sequencing and analysis statistics of samples. "total Reads" describes the number of reads after trimming low quality reads, assembling overlapping paired-end reads, and removing the internal standard reads. "B" stands for Blautia producta, "D" stands for Deinococcus 5 radiodurans, and "T" stands for Thermus thermophilus. For 2015 samples, Sa-Blautia is 7.5×10 , 5 6 Sa-Deinococcus is 8.0×10 , and Sa-Thermus is 2.1×10 (unit: genome). For 2016 samples, Sa- 5 6 6 Blautia is 7.4×10 , Sa-Deinococcus is 1.1×10 , and Sa-Thermus is 2.3×10 (unit: genome).

Sample GN59 had missing filtered volume and the volume shown was estimated.

kB

1165

1230

1260

1232

1232

1233

1289

1288

1268

1216

1143

1113

1205

1303

1279

1219

1705

1756

1571

1359

1334

1356

1221

1156

1267

1296

1281

1249

1287

1336

1266

1496

1253

1057

1172

1305

1300

1245

1266

1334

1324

1382

1424

1277

1526

1268

1291

ratio)

mean

genome

size (recA

7.46

8.72

7.90

7.38

9.45

30.87

28.10

26.12

34.91

20.69

21.11

23.60

34.51

40.67

92.88

37.71

30.37

33.53

37.11

37.83

20.51

22.29

12.24

27.96

37.24

36.92

19.46

46.28

82.98

17.02

16.90

17.52

15.05

18.72

29.69

29.63

12.68

16.46

11.59

10.93

11.82

19.54

13.33

15.33

25.80

81.58

14.53

cell/L

(recA)

genome

equivalents

recA

6594

4787

4658

5198

4994

5611

7344

5574

5585

6100

6253

5473

6217

5214

4987

6935

4531

5478

6440

4924

5429

6158

6011

5193

8005

6688

6640

5937

7677

6712

8663

4792

5614

7483

6674

6510

6612

5657

7187

7313

4320

6719

7048

6960

4645

5645

6332

count

count

4.88

9.76

8.01

8.79

6.85

5.73

7.18

7.42

8.94

5.50

4.71

6.15

18.32

17.59

16.61

21.73

12.93

13.21

15.39

22.44

26.12

57.20

21.82

17.09

20.43

24.54

24.52

12.66

19.13

10.82

22.05

25.44

24.74

13.24

28.55

48.43

10.92

10.90

11.03

12.58

18.93

22.43

12.49

10.23

18.58

53.05

11.18

cell/L

2000)

genome

equivalents

(genome size

3.66

3.52

3.32

4.35

2.59

2.64

3.08

4.49

5.22

4.36

3.42

4.09

4.91

4.90

2.53

3.83

2.16

4.41

5.09

4.95

2.65

5.71

9.69

2.18

0.98

2.18

2.21

1.95

2.52

3.79

4.49

1.60

1.76

1.37

1.15

1.44

1.48

2.50

1.79

2.05

1.10

3.72

2.24

0.94

1.23

11.44

10.61

genes

protein

encoding

10^12 L^-1 10^12

7.70

7.04

6.98

7.61

5.17

5.55

6.77

8.98

9.16

7.18

8.58

5.06

6.69

3.25

9.26

8.90

5.30

5.68

2.63

5.89

5.52

4.49

6.29

9.84

9.87

3.37

4.22

3.43

3.09

3.95

2.60

4.99

4.47

4.09

2.75

9.66

5.03

2.59

2.70

10.97

19.45

10.80

10.30

10.88

11.99

17.44

29.18

x 10^12

Pg-Mean

7.93

7.31

7.36

7.59

5.15

6.05

7.45

9.47

7.70

9.66

5.71

6.81

3.48

9.74

9.97

5.44

5.84

2.56

5.97

5.71

3.97

6.44

2.95

4.05

2.63

3.30

3.27

2.42

4.88

3.10

3.26

2.06

4.80

2.14

2.27

Pg-T

12.50

21.45

10.33

12.05

11.23

10.30

12.41

18.57

10.98

10.24

10.87

35.25

x 10^12

8.65

7.80

7.64

8.56

5.65

5.62

7.81

9.38

7.91

9.17

5.37

6.61

3.74

9.71

6.13

5.75

2.40

5.77

5.17

4.09

6.35

3.58

4.50

3.95

3.07

4.59

3.49

4.47

4.91

3.83

2.53

9.36

4.71

2.34

2.47

Pg-D

11.54

20.72

11.03

12.93

11.91

10.04

11.41

13.65

17.67

10.52

11.34

30.18

x 10^12

6.50

6.00

5.93

6.67

4.72

4.98

5.05

8.08

8.87

6.14

5.92

6.91

7.41

7.75

4.11

6.66

2.52

8.00

7.03

4.31

9.91

5.44

2.95

5.92

5.68

5.42

6.09

8.03

8.02

3.57

4.10

3.70

2.91

3.97

1.88

5.63

5.40

5.18

3.66

8.76

5.59

3.30

3.37

Pg-B

16.18

10.94

16.08

22.10

x 10^12

Sr-T

2.20

1.83

1.81

1.92

2.71

2.60

2.88

1.72

1.29

0.79

1.57

1.80

1.76

1.28

1.29

3.36

2.58

6.28

2.36

1.52

1.60

3.48

1.34

0.73

3.60

7.01

2.95

2.69

5.16

2.88

2.07

1.45

4.93

4.04

6.14

5.33

5.43

6.02

3.86

6.51

3.63

9.32

1.91

0.52

3.06

6.94

7.45

x 10^12

0.93

0.79

0.80

0.78

1.14

1.29

1.27

0.80

0.64

0.37

0.68

0.80

0.85

0.55

0.56

1.65

1.22

2.69

1.05

0.72

0.66

1.42

0.56

0.36

1.40

2.87

1.17

1.14

1.92

1.12

0.83

0.50

1.56

1.40

1.58

2.20

1.49

1.60

1.62

1.58

1.19

2.92

0.85

0.23

1.20

2.44

2.63

Sr-D

x 10^12

Sr-B

0.88

0.73

0.73

0.71

0.97

1.03

1.39

0.66

0.59

0.34

0.86

0.76

0.80

0.68

0.61

1.52

0.86

2.83

0.94

0.71

0.49

1.44

0.55

0.28

1.41

2.22

1.08

0.98

1.37

1.11

1.03

0.67

1.49

1.45

1.59

2.20

1.63

2.83

1.22

1.36

0.83

1.91

0.86

0.30

0.96

1.63

1.83

x 10^12

7.68

5.89

5.87

6.40

6.15

6.92

9.47

7.18

7.08

7.41

7.15

6.09

7.49

6.79

6.38

8.46

7.73

9.62

6.69

7.24

8.35

7.34

6.00

8.67

8.50

7.41

9.88

8.97

7.17

7.03

7.91

7.82

8.50

8.59

7.04

9.10

9.75

5.72

9.28

8.89

7.09

7.16

8.17

10.12

10.14

10.97

10.04

x 10^6

proteins

bacterial

annotated

1.69

1.29

1.31

1.55

1.46

1.55

1.93

1.62

1.47

1.61

1.55

1.40

1.51

1.36

1.26

1.72

1.51

1.70

1.60

1.31

1.42

1.58

1.55

1.35

1.95

1.79

1.71

1.38

1.76

1.53

1.92

1.29

1.45

1.44

1.50

1.47

1.59

1.23

1.63

1.65

1.11

1.74

1.75

1.74

1.15

1.36

1.60

total

Reads

x 10^7

end

1.70

1.30

1.32

1.55

1.47

1.55

1.93

1.62

1.48

1.61

1.55

1.40

1.52

1.36

1.27

1.72

1.52

1.70

1.60

1.32

1.43

1.59

1.55

1.35

1.96

1.80

1.72

1.38

1.76

1.54

1.93

1.30

1.46

1.45

1.51

1.47

1.60

1.23

1.64

1.66

1.12

1.74

1.76

1.75

1.16

1.37

1.61

reads

x 10^7

paired-

L

2.1

2.0

2.1

1.8

2.0

2.1

2.2

2.0

2.1

1.7

2.1

2.1

2.1

2.2

2.1

2.0

1.8

1.5

2.1

1.8

2.2

2.0

2.1

1.8

2.6

2.7

2.7

2.5

2.3

2.5

2.6

2.2

2.1

2.4

2.5

2.7

2.8

1.8

2.0

2.5

2.0

2.5

2.6

2.8

2.3

2.8

2.2

filtered

volume

9

7

7

5

4

4

3

1

1

9

7

6

5

4

3

2

2

2

1

1

#

26

26

24

24

22

20

16

16

18

18

11

11

12

14

14

26

24

22

20

18

16

16

14

12

11

11

10

station

9

7

3

1

99

93

85

83

81

69

61

59

57

53

45

39

35

27

19

15

11

295

293

281

279

275

269

265

263

257

255

253

251

245

241

239

233

227

225

223

217

215

211

205

201

111

105

GN#

sample

2016 2015

18