Confident Phylogenetic Identification of Uncultured Prokaryotes Through
Total Page:16
File Type:pdf, Size:1020Kb
Environmental Microbiology (2019) 21(7), 2485–2498 doi:10.1111/1462-2920.14636 Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S-ITS-23S rRNA operon Joran Martijn ,1 Anders E. Lind,1 Max E. Schön,1 method to those who wish to cost-effectively and confi- Ian Spiertz,1 Lina Juzokaite,1 Ignas Bunikis,2 dently estimate the phylogenetic diversity of prokary- Olga V. Pettersson2 and Thijs J. G. Ettema 1,3* otes in environmental samples at high throughput. 1Department of Cell and Molecular Biology, Science for Life Laboratory, Uppsala University, SE-75123, Uppsala, Sweden. Introduction 2Science for Life Laboratory, Uppsala University, The 16S rRNA gene has been used for decades to phylo- SE-75185, Uppsala, Sweden. genetically classify bacteria and archaea (Woese and Fox, 3 Laboratory of Microbiology, Department of 1977). The gene excels in this respect because of its uni- Agrotechnology and Food Sciences, Wageningen versal occurrence, resistance to horizontal gene transfer University, Stippeneng 4, 6708WE, Wageningen, and high degree of conservation (Woese, 1987; Green The Netherlands. and Noller, 1997). Highly conserved regions are inter- spersed with highly variable regions, allowing for phyloge- Summary netic classification at species and higher taxonomic levels. In addition, the gene has proven to be an excellent target Amplicon sequencing of the 16S rRNA gene is the pre- for studies aiming to quantify the taxonomic composition dominant method to quantify microbial compositions of microbial communities via high-throughput PCR and to discover novel lineages. However, traditional amplicon sequencing (Doolittle, 1999). Primers are usually short amplicons often do not contain enough informa- designed such that they anneal to stretches of conserved tion to confidently resolve their phylogeny. Here we sites that flank a variable region, in effect of capturing the present a cost-effective protocol that amplifies a large informative variable region of a large fraction of the micro- part of the rRNA operon and sequences the amplicons bial community. 16S rRNA gene amplicon surveys are with PacBio technology. We tested our method on a now a standard method in microbial ecology and have led mock community and developed a read-curation pipe- to important insights into the taxonomic makeup of many line that reduces the overall read error rate to 0.18%. different environments. Examples include oceanic waters Applying our method on four environmental samples, (Sogin et al., 2006), deep sea sediments (Jorgensen we captured near full-length rRNA operon amplicons et al., 2012), hot springs (Hou et al., 2013) and the human from a large diversity of prokaryotes. The method oper- gut (Turnbaugh et al., 2007). ated at moderately high-throughput (22286–37,850 raw Despite its many advantages, the 16S rRNA gene is lim- ccs reads) and generated a large amount of putative novel ited in its number of phylogenetically informative sites. 16S archaeal 23S rRNA gene sequences compared to the rRNA gene-based phylogenetic analyses are therefore archaeal SILVA database. These long amplicons allowed sensitive to stochastic error and exhibit limited resolution for higher resolution during taxonomic classification by (Brown et al., 2001; Delsuc et al., 2005). Studies aiming to means of long (1000 bp) 16S rRNA gene fragments and resolve deeper evolutionary relationships between taxa for substantially more confident phylogenies by means of often favour large data sets of conserved protein coding combined near full-length 16S and 23S rRNA gene genes over the 16S rRNA gene to overcome such error sequences, compared to shorter traditional amplicons and increase resolution (Brown et al., 2001; Wolf et al., (250 bp of the 16S rRNA gene). We recommend our 2001; Brochier et al., 2002; Matte-Tailliez et al., 2002). Another frequently used method is to concatenate the 16S rRNA gene with the larger 23S rRNA gene (Williams et al., Received 30 August, 2018; revised 16 April, 2019; accepted 18 April, 2019. *For correspondence. Tel. (+31)6 38164355. E-mail 2012; Ferla et al., 2013; Zaremba-Niedzwiedzka et al., [email protected] 2017). However, both methods typically require that © 2019 The Authors. Environmental Microbiology published by Society for Applied Microbiology and John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes. 2486 J. Martijn et al. genome sequences are available for the taxa in question. A ccs reads BC While it is now possible to obtain genomic data via meta- 2.45% 22286 genomic binning (Albertsen et al., 2013; Alneberg et al., ·read quality > .99 sub 2014), relatively expensive deep sequencing and computa- ·no local low ins quality stretches tionally demanding metagenome assembly is required. In ·not siamaera del addition, draft genomes acquired via metagenomic binning ·not chimera et al., approaches often lack rRNA genes (Hugenholtz ·encodes 16S & 2016; Nelson and Mobberley, 2017), which obstructs 23S rRNA genes linking the obtained metagenomic bins to environmental lin- eages observed in 16S rRNA amplicon surveys. One pos- mean %error quality trimmed sible solution is to obtain 16S and 23S rRNA gene ccs reads .64% sequences simultaneously via PCR amplicon sequencing. number of reads (mock) 4916 This approach exploits that both rRNA genes are often ·precluster ·align .18% neighbouring (67% in known bacterial genomes, 74% in ·consensus 110 known archaeal genomes—Supporting Information Table S1). However, sequences generated by currently ccs ccs consensus available high-throughput sequencing methods are too ccs reads qtrim ccs qtrim ccs short to capture such long amplicons. consensus consensus ‘ ’ With the introduction of Pacific Bioscience’ssingle- Fig. 1. A. Read curation pipeline. ccs = circular consensus sequence, ‘qtrim’ = quality trimmed. For a more detailed overview of the pipeline molecule real-time (SMRT) sequencing technology, see Supporting Information Fig. S1. sequencing long amplicons at moderately high through- B. Observed mean error rates. ‘del’ = deletions, ‘ins’ =insertions, ‘ ’ put became realistic. Its relatively high sequencing error sub = substitutions. C. Number of (remaining) ccs reads of the mock community per stage rates (15%) are now substantially reduced via circular of the read curation pipeline. consensus sequencing (ccs). A number of pioneering studies have already used the technology to success- available that encode at least one 16S-ITS-23S cluster. A fully obtain near full-length 16S rRNA gene sequences single SMRT cell RSII run generated 22,286 ccs reads. with low error rates (Schloss et al., 2016; Singer et al., We observed a mean error rate of 2.45%. The large 2016; Wagner et al., 2016). Here we go one step fur- majority of errors were deletions (77.7%), followed by ther and obtain near full-length 16S and 23S rRNA insertions (15.8%) and substitutions (6.5%) (Fig. 1). Of genes by sequencing a large part of the rRNA operon. these, 868 ccs reads (3.9%) had no errors. We develop a read curation pipeline that deals with We sought to reduce the error rate by removing reads PacBio-specific issues and evaluate error rates with a that are highly erroneous because of several reasons. phylogenetically diverse mock community. We apply First, as was observed by (Schloss et al., 2016), error our method to four diverse environmental samples and rates were strongly correlated with the ‘read quality’ values compare our method with classic partial 16S rRNA that are calculated by the SMRT analysis software amplicon sequencing with respect to taxonomic classifi- (Supporting Information Fig. S2). We removed 11,664 cation and resolving deeper phylogenetic relationships reads (52.3%) with an associated read quality value of of novel taxa. lower than 0.99. Second, we removed 106 high quality reads (1.0%) containing local stretches (≥ 30 bp) of con- Results and discussion secutive low-quality base-calls (Phred ≤18) with an enriched number of errors (Supporting Information Fig. S3). Reducing the mean error rates Third, we observed 1201 high quality ccs reads (11.3%) Here we present a method for generating and sequencing that have been previously referred to as ‘siamaeras’ amplicons of approximately 4000 bp containing near full- (Hackl et al., 2014). The first half of these reads consists length 16S and 23S rRNA genes from environmental taxa. of the expected amplicon, while the second half consists Because the method uses PacBio sequencing technology, of the reverse complement of the first half, but missing a which exhibits higher error rates compared to Illumina, we primer at the breakpoint (Supporting Information Fig. S4). developed a read curation pipeline (Fig. 1 and Supporting Siamaeras most likely stem from damaged amplicons that Information Fig. S1) that attempts to reduce the mean have a long overhang. The overhang forms a hairpin, error rate. To evaluate the error rate, we applied the which anneals to the complementary strand. As a result, method to a synthetic ‘mock community’ composed of the the SMRTbell adapter is blocked from ligating there during genomic DNA of 38 phylogenetically diverse archaeal and the library preparation, and the read processing software bacterial lineages for which complete genomes are will interpret the concatenation of both strands as a single