Correcting for 16S Rrna Gene Copy Numbers in Microbiome Surveys Remains an Unsolved Problem Stilianos Louca1,2* , Michael Doebeli1,2,3 and Laura Wegener Parfrey1,2,4

Louca et al. Microbiome (2018) 6:41 https://doi.org/10.1186/s40168-018-0420-9 SHORTREPORT Open Access Correcting for 16S rRNA gene copy numbers in microbiome surveys remains an unsolved problem Stilianos Louca1,2* , Michael Doebeli1,2,3 and Laura Wegener Parfrey1,2,4 Abstract The 16S ribosomal RNA gene is the most widely used marker gene in microbial ecology. Counts of 16S sequence variants, often in PCR amplicons, are used to estimate proportions of bacterial and archaeal taxa in microbial communities. Because different organisms contain different 16S gene copy numbers (GCNs), sequence variant counts are biased towards clades with greater GCNs. Several tools have recently been developed for predicting GCNs using phylogenetic methods and based on sequenced genomes, in order to correct for these biases. However, the accuracy of those predictions has not been independently assessed. Here, we systematically evaluate the predictability of 16S GCNs across bacterial and archaeal clades, based on ∼ 6,800 public sequenced genomes and using several phylogenetic methods. Further, we assess the accuracy of GCNs predicted by three recently published tools (PICRUSt, CopyRighter, and PAPRICA) over a wide range of taxa and for 635 microbial communities from varied environments. We find that regardless of the phylogenetic method tested, 16S GCNs could only be accurately predicted for a limited fraction of taxa, namely taxa with closely to moderately related representatives (15% divergence in the 16S rRNA gene). Consistent with this observation, we find that all considered tools exhibit low predictive accuracy when evaluated against completely sequenced genomes, in some cases explaining less than 10% of the variance. Substantial disagreement was also observed between tools R2 < 0.5 for the majority of tested microbial communities. The nearest sequenced taxon index (NSTI) of microbial communities, i.e., the average distance to a sequenced genome, was a strong predictor for the agreement between GCN prediction tools on non-animal-associated samples, but only a moderate predictor for animal-associated samples. We recommend against correcting for 16S GCNs in microbiome surveys by default, unless OTUs are sufficiently closely related to sequenced genomes or unless a need for true OTU proportions warrants the additional noise introduced, so that community profiles remain interpretable and comparable between studies. Keywords: 16S rRNA, Gene copy number, Microbiome surveys, Phylogenetic reconstruction Introduction rRNA sequence variants to reference databases like SILVA Amplicon sequencing of the 16S ribosomal RNA (rRNA) [4] and estimate the proportions of taxa based on rel- gene is widely used for estimating the composition of ative read counts. Many bacteria and archaea, however, bacterial and archaeal communities. Global microbial have more than one copy of the 16S gene, which leads to diversity initiatives, including the Human Microbiome biased cell count estimates when the latter are estimated Project [1], the Earth Microbiome Project [2], and the Tara solely based on 16S rRNA read counts [5]. This has led to Oceans global ocean survey [3], use the 16S rRNA gene efforts to predict the distribution of 16S gene copy num- to determine which microbes are present by matching 16S bers (GCNs) across clades based on available sequenced genomes, in order to then correct 16S rRNA read counts *Correspondence: [email protected] to account for variable 16S GCNs in cells [5–8]. These cor- 1Biodiversity Research Centre, University of British Columbia, Vancouver, Canada rections can substantially affect community profiles and 2Department of Zoology, University of British Columbia, Vancouver, Canada diversitypatterns,sincesomecladeshaveover10copiesof Full list of author information is available at the end of the article the 16S rRNA gene [5, 7]. It is thus important to carefully © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Louca et al. Microbiome (2018) 6:41 Page 2 of 12 evaluate the accuracy [9] of predicted 16S GCNs across NSTD. To further evaluate these tools under more real- the wide range of microbial clades encountered in micro- istic scenarios, we also compare all tools to each other biome surveys. Inaccurate prediction of 16S GCNs can for OTUs in 635 prokaryotic communities sampled from introduce substantial noise to community profiles, which diverse natural environments, including the ocean, lakes, can be worse than the original GCN-related biases, par- hot springs, soil, bioreactors, and animal guts. We find ticularly when prediction methods differ between studies. that 16S GCNs are moderately phylogenetically conserved An accurate prediction of 16S GCNs relies heavily and that prediction of 16S GCNs for the large number on the assumption that 16S GCNs are sufficiently phy- of clades without sequenced genomes from close relatives logenetically conserved. That is, 16S GCNs must be will generally be inaccurate. This conclusion is verified by autocorrelated among related taxa at least across phylo- our finding of low predictive accuracies by CopyRighter, genetic distances typically covered by available sequenced PICRUSt, and PAPRICA, both for the sequenced genomes genomes [10]. Kembel et al. [5] found that 16S GCN as well as when compared to each other on microbiomes. exhibits a strong phylogenetic signal, as measured by Using phylogenetically predicted 16S GCNs to correct 16S Blomberg’s K statistic [11], and concluded that 16S GCNs read counts in microbiome surveys, as previously sug- may be predictable based on phylogenetic placement with gested [5–7], worked well only for a small number of respect to genomes with known 16S GCN. A similar con- microbiomes. clusion was reached independently by Angly et al. [7], based on a strong phylogenetic signal as measured by Results and discussion Pagel’s λ [12]. However, neither Blomberg’s K nor Pagel’s How predictable are 16S GCNs from phylogeny? λ make any statement about time scales (nor phylogenetic We found that the autocorrelation function of 16S scales) over which traits vary. While 16S GCN variation GCNs, that is the correlation between the GCNs of two is relatively rare within species, variation increases with randomly picked OTUs at a certain phylogenetic dis- taxonomic distance [13] and this may lead to inaccurate tance, decays moderately with increasing phylogenetic predictions for the many clades which are distant from distance (Fig. 1a), dropping below 0.5 at a phylogenetic sequenced genomes. To date, no independent evaluation distance of ∼ 15% and to zero at a phylogenetic dis- of existing 16S GCN prediction tools has been published. tance of ∼ 30% (nucleotide substitutions per site in the To resolve these uncertainties, we assessed the phy- 16S gene). Hence, predictions of 16S GCNs are expected logenetic autocorrelation of 16S GCNs across bacteria to be inaccurate for clades with an NSTD greater than and archaea (prokaryota) in a phylogenetic tree com- about 15% and close to random for clades with an NSTD prising ∼ 570,000 OTUs (99% similarity in 16S rRNA), greater than about 30%. To explicitly test this conclu- based on ∼ 6800 quality-checked complete sequenced sion, we predicted 16S GCNs for randomly chosen tips of genomes. The tree was constructed from sequences in our SILVA-derived tree and compared these predictions SILVA and partly constrained using SILVA’s taxonomic to the GCNs known from complete sequenced genomes, annotations. We predicted 16S GCNs using several com- where possible. We considered the following common mon phylogenetic reconstruction methods and exam- ancestral state reconstruction algorithms for predicting ined the accuracy achieved by each method for OTUs GCNs: Sankoff’s maximum-parsimony with various tran- in the SILVA-derived tree. We assessed the predictive sition costs [14], maximum-likelihood of Mk models with accuracy as a function of an OTU’s nearest-sequenced- rerooting (equal rates model), weighted-squared-change taxon-distance (NSTD), that is, the minimum phyloge- parsimony [15], phylogenetic independent contrasts (PIC) netic distance (mean nucleotide substitutions per site) of [16], and subtree averaging (arithmetic average of GCNs the OTU to the nearest sequenced genome. We note that across descending tips). CopyRighter and PICRUSt use the average NSTD for a particular microbial community, PIC, while PAPRICA uses subtree averaging. We mea- weighted by OTU frequencies, is known as its nearest sured the accuracy of each method using the cross- 2 2 sequenced taxon index (NSTI; [6]). Further, we system- validated coefficient of determination Rcv [17]. The Rcv atically assessed the predictive accuracy of three recent corresponds to the fraction of variance explained by a tools for correcting 16S GCNs in microbiome surveys, reconstruction algorithm, when tested against a sepa- PICRUSt [6], CopyRighter [7], and PAPRICA [8], which rate set of randomly chosen sequenced genomes (“test together have been

Correcting for 16S Rrna Gene Copy Numbers in Microbiome Surveys Remains an Unsolved Problem Stilianos Louca1,2* , Michael Doebeli1,2,3 and Laura Wegener Parfrey1,2,4

Core Functional Traits of Bacterial Communities in the Upper Mississippi River Show Limited Variation in Response to Land Cover

Inference-Based Accuracy of Metagenome Prediction Tools Varies Across Sample Types and Functional Categories Shan Sun1*, Roshonda B

Hidden State Prediction: a Modification of Classic Ancestral State Reconstruction Algorithms Helps Unravel Complex Symbioses

Maawf: an Integrated and Visual Tool for Microbiome Data Analyses

Cowpi: a Rumen Microbiome Focussed Version of the Picrust Functional Inference Software

Rhizosphere Bacterial Community Structure and Predicted Functional

An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies

Which Is More Important for Classifying Microbial Communities: Who's There Or What They Can Do?

Download Preprint

Picrust2: an Improved and Extensible Approach for Metagenome Inference

Fall 2020 Ann-Catrin Uhlemann, MD, Ph.D. Columbia University

Metagenomics II: What Are They Doing?