Rachel L. Harris 1 Insights Into the Phylogeny and Coding Potential Of
Total Page:16
File Type:pdf, Size:1020Kb
Rachel L. Harris 1 Insights into the phylogeny and coding potential of microbial dark matter: a replication of phylogenetic anchoring methods described by Rinke et al., 2013 Rachel Harris, David Zhao, Melany Ruiz Urigen, & Chuhan Zong QCB 455 – MOL 455 – COS 551, Fall 2014 | Instructor: Dr. Anastasia Baryshnikova ABSTRACT In their 2013 study, Rinke et al. challenge the efficacy of a gold standard reference genome database in accurately anchoring metagenomic reads. By appending 201 uncultivated archaeal and bacterial genomes representing largely uncharted taxa (so-called “microbial dark matter”) to this gold standard, Rinke et al. significantly improve phylogenetic anchoring of 475 metagenomes. In this study, we apply Rinke’s methods to ten of their reported top-recruiting metagenomes. We not only replicate, but also improve upon their results, concluding that microbial dark matter genomes are key players in improving phylogenetic anchoring performance of reference genomes databases. INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all sequenced prokaryotic phyla lacked a cultivable representative, and 88% of all these known phyla were phylogenetically anchored as either belonging to Proteobacteria, Firmicutes, Actinobacteria, or Bacteriodetes1. Rinke et al. attempt to abolish these biases by testing whether the genomes of microbial dark matter (uncultivated prokaryotes representing poorly sequenced branches on the tree of life), when appended to an NCBI BLASTx reference database, improved phylogenetic anchoring for queried metagenomic reads2. In this study we aimed to confirm Rinke et al.’s methodology by means of alternative tools, including BLASTn, R environment software, and the Galaxy computational program. Results between the two studies were in agreement and often improved upon in our own analyses. MATERIALS AND METHODS Ten metagenomes were selected for analysis from publicly available databases according to their relative anchoring performance (Rinke et al., Figure 4) and their representation of nine diverse habitats: Sakinaw Lake (SAK), TA Mother Reactor (TAM), GBS 85C sediment (GBS), Saanich Inlet pooled fosmids (SAA), GOS Mangrove on Isabella Island (MAN), Yellowstone Bison Hot Spring (BIS), Line P J08P26-500 (LNP), TA reactor biofilm (BIO), Peru Margin Rachel L. Harris 2 (PER), and Marine Sediments sample SCG71 (MAR). All 201 SAG assemblies were accessed from the Microbial Dark Matter project website (http://genome.jgi.doe.gov/MDM). A random subset of 10,000 reads was extracted from each metagenome and subjected to two runs of NCBI’s Nucleotide BLAST (BLASTn). The first run BLASTed each of the metagenomes against NCBI’s non-redundant nucleotide (nt) database, whereas the second run BLASTed each of the metagenomes against a modified database comprised of the nt and Rinke et al.’s 201 SAG assemblies. Resulting BLASTn hits against the nt and nt+SAGs databases will hereafter be referenced as BLAST Hits 1 (BH1) and BLAST Hits 2 (BH2), respectively. Target labels of these hits were identified by either NCBI’s GI sequence identification markers (GI IDs) or one of Rinke et al.’s SAG IDs. A third BLASTn run was subsequently performed on all SAG assemblies in order to exchange any SAG target labels identified in BH2 with their respective GI IDs. Queries originally assigned to SAG targets that were found to have no corresponding GI ID were considered false positives and these entries were removed from BH2 analysis. Duplicate queries with the same target label were also removed from both BH1 and BH2 to ensure the validity of community composition and subsequent statistical analysis. Whereas Rinke et al. determined phylogenies of BLAST hits with the aid of MEGAN4 software3, we obtained taxonomic summaries by submitting GI ID targets from both BLAST hit databases of each metagenome to Princeton University’s Galaxy Project4 server (https://galaxy.princeton.edu). All statistical analyses were performed on BLAST hits at the phylum level in the R software environment (http://r-project.org) to determine whether BLASTing against nt+SAGs represented significant improvements in read anchoring, phylogenetic binning, and percent identity distribution relative to BLASTing against the nt alone. Rachel L. Harris 3 RESULTS Anchoring of Metagenomic Reads Rinke et al. report >2% BH2 read anchoring at the phylum level for all ten metagenomes analyzed in this study. We not only confirm these findings in our own analysis, but also improve upon them, recovering greater read anchoring for six out of ten metagenomes (Fig. 1). Only BIO and PER metagenomes yield <2% read anchoring, achieving 1.49% and 1.57%, respectively. Whereas the Rinke study reports SAK as demonstrating the greatest recovery of read hits (19.56%), our results depict a three-fold improvement in anchoring for our highest recruiting metagenome, BIS (60.22%). A paired Student’s t-test of all BLAST hits from our 10 analyzed metagenomes reveals that significantly (P=0.00023) more reads were assigned at the phylum level for BH2 relative to BH1. This result is in concordance with that of Rinke et al. (P=0.00024), who performed the same analysis for their top 19 recruiting metagenomes (Fig. 1a). Phylogenetic Binning Figure 4 in Rinke et al. depicts the 23 most anchored phyla following classification of BH2 hits by MEGAN4. By contrast, phylogenetic binning conducted via Galaxy in this analysis only reveals an overlap of eight phyla as top recruiters in surveyed metagenomes. However, all eight overlapping phyla between the two studies – Acetothermia, Caldiserica, Cloacimonetes, Marinimicrobia, Sunergistetes, Euryarchaeota, Nanoarchaeota, and Thaumarchaetoa – show improved phylogenetic anchoring across several metagenomes from BH1 to BH2 where no such improvement was noted at all by Rinke et al. (Fig. 1a). Furthermore, our results also indicate at least four additional phyla not mentioned in the parent study that demonstrate an average of ≥1% improvement in binning across all metagenomes – Aquificae, Firmicutes, Ignavibacteriae, and Proteobacteria (Fig. 1b). Rachel L. Harris 4 a b Fig. 1 | Phylogenetic anchoring. 1a. Modified Figure 4 in original Rinke et al. publication, depicting 19 top-recruiting metagenomes characterized by >2% phylum-level read anchoring. Highlighted metagenomes represent metagenomes analyzed in this study. Top-recruiting phyla in Rinke et al.’s study are listed at the top, with phyla denoted by representing overlapping top recruiters in our own analysis. Black rectangles ( ) represent additional phylum-level recruits elucidated in our analysis that were not discovered by Rinke et al. 1b. Duplication of the Rinke heat map described in 1a. portraying all phyla demonstrating improved read anchoring from BH1 to BH2. Grey cells label phyla showing 0% anchoring improvement. Beyond the analysis conducted by Rinke et al., we tested each metagenome individually for significant phylum-level differences in community composition. A paired Student’s t-test revealed a significant difference in classification at the phylum level for LNP (P=0.02627), SAK (P=0.03152), and TAM (P=0.01966) metagenomes (Fig. 2). Percent Identity Distribution In addition, we also determined whether BLASTing metagenomes against nt+SAGs improved phylogenetic anchoring at a confidence interval of 97% query-target shared nucleotide identity relative to BLASTing against the nt alone. This was performed by removing all BLAST hits that shared the same query and target label across BH1 and BH2 databases for each metagenome, leaving behind only novel classifications for consideration. Rachel L. Harris 5 We were able to successfully elucidate, via unpaired, one-sided t-tests, significant improvements in phylum-level clustering for -06 GBS (P=0.03569), LNP (P=5.1E ), SAK -16 -16 (P=2.2E ), and TAM (P=2.2E ) metagenomes (Fig. 3a). Despite only four out of the ten analyzed metagenomes showing significant improvement in anchoring with ≥97% identity, all novel BH2 classifications in this study Fig. 2 | Community Composition at the Phylum Level. Distribution of phylum-level classifications for BH1 hits against the nt database (left panel) and BH2 hits against the ntSAGs database (right panel). BH2 hits demonstrated significantly improved percent characterized by statistically significant changes in community composition are denoted by . P = 0.02627, 0.03152, and 0.01966 for -16 LNP, SAK, and TAM, respectively. identities (Welch two-sample t-test, P=2.2E ) relative to queries that maintained the same target label in both BH1 and BH2 databases (Fig. 3b). a b Fig. 3 | True hit (CI≥97%) trends relative to BLAST type. 3a. Per metagenome BH1:BH2 hit proportions with percent identities ≥97%. Significant improvements in classification -06 above this threshold for BH2 data are denoted by (GBS, P=0.03569; LNP, P=5.1E ; -16 -16 SAK, P=2.2E ; and TAM, P=2.2E ). 3b. Distribution of percent identities for all BLAST hits across all analyzed metagenomes. A significant increase in number of hits with ≥97% -16 query-target identity is generally associated with BH2 hits (P=2.2E ). This increase is clearly enhanced when only novel hits are considered. DISCUSSION With the exception of a few inconsistencies, the results of this analysis illustrate robust agreement with those of Rinke et al. In several instances we were not only able to duplicate their Rachel L. Harris 6 results, but also improve upon them. We attest these improvements to the exponential growth of NCBI’s non-redundant reference databases. As of January 2015, more than 11,100 reference genomes are publicly available in NCBI’s databases (http://ncbi.nlm.nih.gov); this is more than twice the number of reference genomes that were available at the time of the parent study’s publication in early 2013 and more than ten times the number available at the start of the study in mid 20105. Advancements in high-throughput sequencing technologies have enabled swift and reliable taxonomic identifications from uncultured microbial samples. As per sample costs have dropped, the number of published metagenomes has risen, drastically improved our knowledge of microbial diversity6.