Rachel L. Harris 1

Insights into the phylogeny and coding potential of : a replication of phylogenetic anchoring methods described by Rinke et al., 2013 Rachel Harris, David Zhao, Melany Ruiz Urigen, & Chuhan Zong QCB 455 – MOL 455 – COS 551, Fall 2014 | Instructor: Dr. Anastasia Baryshnikova

ABSTRACT In their 2013 study, Rinke et al. challenge the efficacy of a gold standard reference genome database in accurately anchoring metagenomic reads. By appending 201 uncultivated archaeal and bacterial genomes representing largely uncharted taxa (so-called “microbial dark matter”) to this gold standard, Rinke et al. significantly improve phylogenetic anchoring of 475 metagenomes. In this study, we apply Rinke’s methods to ten of their reported top-recruiting metagenomes. We not only replicate, but also improve upon their results, concluding that microbial dark matter genomes are key players in improving phylogenetic anchoring performance of reference genomes databases.

INTRODUCTION A clear cultivation bias exists in microbial . As of 2010, half of all sequenced prokaryotic phyla lacked a cultivable representative, and 88% of all these known phyla were phylogenetically anchored as either belonging to Proteobacteria, Firmicutes,

Actinobacteria, or Bacteriodetes1. Rinke et al. attempt to abolish these biases by testing whether the genomes of microbial dark matter (uncultivated representing poorly sequenced branches on the tree of life), when appended to an NCBI BLASTx reference database, improved phylogenetic anchoring for queried metagenomic reads2. In this study we aimed to confirm

Rinke et al.’s methodology by means of alternative tools, including BLASTn, R environment software, and the Galaxy computational program. Results between the two studies were in agreement and often improved upon in our own analyses.

MATERIALS AND METHODS Ten metagenomes were selected for analysis from publicly available databases according to their relative anchoring performance (Rinke et al., Figure 4) and their representation of nine diverse habitats: Sakinaw Lake (SAK), TA Mother Reactor (TAM), GBS 85C sediment (GBS),

Saanich Inlet pooled fosmids (SAA), GOS Mangrove on Isabella Island (MAN), Yellowstone

Bison Hot Spring (BIS), Line P J08P26-500 (LNP), TA reactor (BIO), Peru Margin Rachel L. Harris 2

(PER), and Marine Sediments sample SCG71 (MAR). All 201 SAG assemblies were accessed from the Microbial Dark Matter project website (http://genome.jgi.doe.gov/MDM).

A random subset of 10,000 reads was extracted from each metagenome and subjected to two runs of NCBI’s BLAST (BLASTn). The first run BLASTed each of the metagenomes against NCBI’s non-redundant nucleotide (nt) database, whereas the second run

BLASTed each of the metagenomes against a modified database comprised of the nt and Rinke et al.’s 201 SAG assemblies. Resulting BLASTn hits against the nt and nt+SAGs databases will hereafter be referenced as BLAST Hits 1 (BH1) and BLAST Hits 2 (BH2), respectively. Target labels of these hits were identified by either NCBI’s GI sequence identification markers (GI IDs) or one of Rinke et al.’s SAG IDs. A third BLASTn run was subsequently performed on all SAG assemblies in order to exchange any SAG target labels identified in BH2 with their respective GI

IDs. Queries originally assigned to SAG targets that were found to have no corresponding GI ID were considered false positives and these entries were removed from BH2 analysis. Duplicate queries with the same target label were also removed from both BH1 and BH2 to ensure the validity of community composition and subsequent statistical analysis.

Whereas Rinke et al. determined phylogenies of BLAST hits with the aid of MEGAN4 software3, we obtained taxonomic summaries by submitting GI ID targets from both BLAST hit databases of each metagenome to Princeton University’s Galaxy Project4 server

(https://galaxy.princeton.edu). All statistical analyses were performed on BLAST hits at the phylum level in the R software environment (http://r-project.org) to determine whether

BLASTing against nt+SAGs represented significant improvements in read anchoring, phylogenetic binning, and percent identity distribution relative to BLASTing against the nt alone.

Rachel L. Harris 3

RESULTS Anchoring of Metagenomic Reads Rinke et al. report >2% BH2 read anchoring at the phylum level for all ten metagenomes analyzed in this study. We not only confirm these findings in our own analysis, but also improve upon them, recovering greater read anchoring for six out of ten metagenomes (Fig. 1). Only BIO and PER metagenomes yield <2% read anchoring, achieving 1.49% and 1.57%, respectively.

Whereas the Rinke study reports SAK as demonstrating the greatest recovery of read hits

(19.56%), our results depict a three-fold improvement in anchoring for our highest recruiting metagenome, BIS (60.22%). A paired Student’s t-test of all BLAST hits from our 10 analyzed metagenomes reveals that significantly (P=0.00023) more reads were assigned at the phylum level for BH2 relative to BH1. This result is in concordance with that of Rinke et al.

(P=0.00024), who performed the same analysis for their top 19 recruiting metagenomes (Fig.

1a).

Phylogenetic Binning Figure 4 in Rinke et al. depicts the 23 most anchored phyla following classification of

BH2 hits by MEGAN4. By contrast, phylogenetic binning conducted via Galaxy in this analysis only reveals an overlap of eight phyla as top recruiters in surveyed metagenomes. However, all eight overlapping phyla between the two studies – Acetothermia, Caldiserica, Cloacimonetes,

Marinimicrobia, Sunergistetes, Euryarchaeota, Nanoarchaeota, and Thaumarchaetoa – show improved phylogenetic anchoring across several metagenomes from BH1 to BH2 where no such improvement was noted at all by Rinke et al. (Fig. 1a). Furthermore, our results also indicate at least four additional phyla not mentioned in the parent study that demonstrate an average of ≥1% improvement in binning across all metagenomes – Aquificae, Firmicutes, Ignavibacteriae, and

Proteobacteria (Fig. 1b). Rachel L. Harris 4 a b

Fig. 1 | Phylogenetic anchoring. 1a. Modified Figure 4 in original Rinke et al. publication, depicting 19 top-recruiting metagenomes characterized by >2% phylum-level read anchoring. Highlighted metagenomes represent metagenomes analyzed in this study. Top-recruiting phyla in Rinke et al.’s study are listed at the top, with phyla denoted by representing overlapping top recruiters in our own analysis. Black rectangles ( ) represent additional phylum-level recruits elucidated in our analysis that were not discovered by Rinke et al. 1b. Duplication of the Rinke heat map described in 1a. portraying all phyla demonstrating improved read anchoring from BH1 to BH2. Grey cells label phyla showing 0% anchoring improvement.

Beyond the analysis conducted by Rinke et al., we tested each metagenome individually

for significant phylum-level differences in community composition. A paired Student’s t-test

revealed a significant difference in classification at the phylum level for LNP (P=0.02627), SAK

(P=0.03152), and TAM (P=0.01966) metagenomes (Fig. 2).

Percent Identity Distribution In addition, we also determined whether BLASTing metagenomes against nt+SAGs

improved phylogenetic anchoring at a confidence interval of 97% query-target shared nucleotide

identity relative to BLASTing against the nt alone. This was performed by removing all BLAST

hits that shared the same query and target label across BH1 and BH2 databases for each

metagenome, leaving behind only novel classifications for consideration.

Rachel L. Harris 5

We were able to successfully elucidate, via

unpaired, one-sided t-tests, significant

improvements in phylum-level clustering for

-06 GBS (P=0.03569), LNP (P=5.1E ), SAK

-16 -16 (P=2.2E ), and TAM (P=2.2E ) metagenomes

(Fig. 3a). Despite only four out of the ten

analyzed metagenomes showing significant

improvement in anchoring with ≥97% identity,

all novel BH2 classifications in this study Fig. 2 | Community Composition at the Phylum Level. Distribution of phylum-level classifications for BH1 hits against the nt database (left panel) and BH2 hits against the ntSAGs database (right panel). BH2 hits demonstrated significantly improved percent characterized by statistically significant changes in community composition are denoted by . P = 0.02627, 0.03152, and 0.01966 for -16 LNP, SAK, and TAM, respectively. identities (Welch two-sample t-test, P=2.2E )

relative to queries that maintained the same target label in both BH1 and BH2 databases (Fig.

3b).

a b

Fig. 3 | True hit (CI≥97%) trends relative to BLAST type. 3a. Per metagenome BH1:BH2 hit proportions with percent identities ≥97%. Significant improvements in classification -06 above this threshold for BH2 data are denoted by (GBS, P=0.03569; LNP, P=5.1E ; -16 -16 SAK, P=2.2E ; and TAM, P=2.2E ). 3b. Distribution of percent identities for all BLAST hits across all analyzed metagenomes. A significant increase in number of hits with ≥97% -16 query-target identity is generally associated with BH2 hits (P=2.2E ). This increase is clearly enhanced when only novel hits are considered. DISCUSSION With the exception of a few inconsistencies, the results of this analysis illustrate robust

agreement with those of Rinke et al. In several instances we were not only able to duplicate their Rachel L. Harris 6 results, but also improve upon them. We attest these improvements to the exponential growth of

NCBI’s non-redundant reference databases. As of January 2015, more than 11,100 reference genomes are publicly available in NCBI’s databases (http://ncbi.nlm.nih.gov); this is more than twice the number of reference genomes that were available at the time of the parent study’s publication in early 2013 and more than ten times the number available at the start of the study in mid 20105. Advancements in high-throughput sequencing technologies have enabled swift and reliable taxonomic identifications from uncultured microbial samples. As per sample costs have dropped, the number of published metagenomes has risen, drastically improved our knowledge of microbial diversity6. This improvement is particularly relevant in our own data pertaining to read anchoring (Fig. 1).

It is possible that some discrepancies between our own data and those published by Rinke et al. may be attributed to differences in choice of processing tools. For example, where Rinke et al. used NCBI’s BLASTx algorithm to BLAST metagenomic reads against the non-redundant database (nr), we utilized BLASTn against the nt database. Both methods are valid in elucidating taxonomic information from raw reads; we elected to employ BLASTn over

BLASTx due to its faster processing time (BLASTx translates nucleotide queries as they are submitted for analysis, whereas BLASTn directly runs a search of nucleotide strings against the nt) and more concise output (BLASTx outputs protein-specific GI IDs, which was useful for

Rinke et al. in another part of their study that was irrelevant to this particular investigation).

Nevertheless, we affirm that major statistical differences between our results and Rinke et al.’s are most likely the result of the tremendous growth of NCBI reference databases. For instance, discrepancies between the two studies’ top recruiting phyla can be attributed to the expansion of the number of unique representative genomes per phyla since Rinke et al.’s original analysis. Our Rachel L. Harris 7

results reflect this improvement, and are supported by significantly increased read anchoring

(Fig. 1) and improved binning of true hits (Fig. 3).

Notwithstanding major improvements in reference genome databases, this study’s

replication of Rinke et al.’s methods continues to support the notion that appending MDM single

cell genomes to these databases still results in significantly improved phylogenetic anchoring for

submitted queries. As such, we acknowledge single cell genomics as a viable next step in

elucidating rare taxa in microbial communities, as they are statistically proven to be key players

in correctly inferring community composition.

WORKS CITED 1. Hugenholtz, P. & Kyrpides, N.C. A changing of the guard. Environ. Microbiol. 11, 551-553 (2009). 2. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013). 3. Huson, D. H., Mitra, S., Ruscheweyh, H.-J., Weber, N. & Schuster, S. C. Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21, 1552–1560 (2011). 4. Blankenberg, D. et al. Galaxy: A web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology (2010). doi:10.1002/0471142727.mb1910s89 5. Lagesen, K., Ussery, D. W. & Wassenaar, T. M. Genome update: the 1000th genome--a cautionary tale. 156, 603–608 (2010). 6. Ni, J., Yan, Q. & Yu, Y. How much metagenomic sequencing is enough to achieve a given goal? Sci. Rep. 3, 1968 (2013).

SUPPLEMENTARY MATERIAL

Table S1 | Unique phylum-level assignments in BH2. Seven of ten analyzed metagenomes exhibit novel phyla hits when raw reads are BLASTed against nt+SAGs database. Phyla distinguished by * represent overlapping top-recruiters in the analysis by Rinke et al. Metagenome BH2-unique Phyla

BIO N/A

BIS Caldiserica*,Dictyoglomi,Elusimicrobia,Tenericutes

GBS Gemmatimonadetes,Synergistetes*

LNP Cloacimonetes*,Phaeophyceae,Xanthophyceae

MAN Cloacimonetes*

MAR N/A

PER N/A

SAA Gemmatimonadetes

SAK Acetothermia*,Elusimicrobia,Fusobacteria,Synergistetes*

TAM Aquificae,Chlamydiae,Deferribacteres,Dictyoglomi,Gemmatimonadetes,Nitrospirae, Synergistetes*