An Improved Greengenes Taxonomy with Explicit Ranks for Ecological and Evolutionary Analyses of Bacteria and Archaea
Total Page:16
File Type:pdf, Size:1020Kb
The ISME Journal (2012) 6, 610–618 & 2012 International Society for Microbial Ecology All rights reserved 1751-7362/12 www.nature.com/ismej ORIGINAL ARTICLE An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea Daniel McDonald1,MorganNPrice2, Julia Goodrich1,8, Eric P Nawrocki3, Todd Z DeSantis5,8, Alexander Probst4,9,GaryLAndersen4,RobKnight1,6 and Philip Hugenholtz7 1Department of Chemistry & Biochemistry and Biofrontiers Institute, University of Colorado, Boulder, CO, USA; 2Lawrence Berkeley National Laboratory, Physical Biosciences Division, Berkeley, CA, USA; 3Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA; 4Lawrence Berkeley National Laboratory, Center for Environmental Biotechnology, Berkeley, CA, USA; 5Department of Bioinformatics, Second Genome Inc., San Bruno, CA, USA; 6Howard Hughes Medical Institute, Boulder, CO, USA and 7Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences and Institute for Molecular Bioscience, St Lucia, Queensland, Australia Reference phylogenies are crucial for providing a taxonomic framework for interpretation of marker gene and metagenomic surveys, which continue to reveal novel species at a remarkable rate. Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference. We developed a ‘taxonomy to tree’ approach for transferring group names from an existing taxonomy to a tree topology, and used it to apply the Greengenes, National Center for Biotechnology Information (NCBI) and cyanoDB (Cyanobacteria only) taxonomies to a de novo tree comprising 408 315 sequences. We also incorporated explicit rank information provided by the NCBI taxonomy to group names (by prefixing rank designations) for better user orientation and classification consistency. The resulting merged taxonomy improved the classifica- tion of 75% of the sequences by one or more ranks relative to the original NCBI taxonomy with the most pronounced improvements occurring in under-classified environmental sequences. We also assessed candidate phyla (divisions) currently defined by NCBI and present recommendations for consolidation of 34 redundantly named groups. All intermediate results from the pipeline, which includes tree inference, jackknifing and transfer of a donor taxonomy to a recipient tree (tax2tree) are available for download. The improved Greengenes taxonomy should provide important infrastructure for a wide range of megasequencing projects studying ecosystems on scales ranging from our own bodies (the Human Microbiome Project) to the entire planet (the Earth Microbiome Project). The implementation of the software can be obtained from http://sourceforge.net/projects/tax2tree/. The ISME Journal (2012) 6, 610–618; doi:10.1038/ismej.2011.139; published online 1 December 2011 Subject Category: evolutionary genetics Keywords: evolution; phylogenetics; taxonomy Introduction rRNA gene (16S) is the most comprehensive and widely used in microbiology today (Pruesse et al., A robust universal reference taxonomy is a necessary 2007; Peplies et al., 2008), but has yet to reach its aid to interpretation of high-throughput sequence full potential because numerous microbes belong data from microbial communities (Tringe and to taxa that have not yet been characterized and Hugenholtz, 2008). Taxonomy based on the 16S because numerous sequences that could be reliably classified remain unannotated. For example, two Correspondence: P Hugenholtz, Australian Centre for Eco- thirds of 16S sequences in GenBank are only genomics, School of Chemistry and Molecular Biosciences and classified to domain (kingdom), that is, Archaea or Institute for Molecular Bioscience, University of Queensland, Bacteria, by the National Center for Biotechnology Molecular Biosciences Building 76, St Lucia, Queensland, 4072, Australia. Information (NCBI) taxonomy: this taxonomy is E-mail: [email protected] likely the most widely consulted 16S-based taxo- 8Current address: Department of Biology, Cornell University, nomy, despite its disclaimer that it is not an Ithaca, NY, USA. authoritative source, in part because classifications 9Current address: Institute for Microbiology and Archaea Centre, University of Regensburg, Regensburg, Germany. are maintained up to date through user submissions. Received 10 May 2011; revised 23 August 2011; accepted 25 Most of the un(der)classified sequences are from August 2011; published online 1 December 2011 culture-independent environmental surveys; these An improved Greengenes taxonomy D McDonald et al 611 sequences can swamp BLAST searches, leaving sequences that had o1% non-ACGT characters. users baffled about the phylogenetic affiliation of The sequences were checked for chimeras using their submitted sequences. This shortcoming has UCHIME (http://www.drive5.com/uchime/) and been addressed by several dedicated 16S databases, ChimeraSlayer (Haas et al., 2011). We only removed including the Ribosomal Database Project (Cole sequences from named isolates if they were classi- et al., 2009), Greengenes (DeSantis et al., 2006), fied as chimeric by both tools; we removed other SILVA (Pruesse et al., 2007) and EzTaxon (Chun sequences if they were classified as chimeric by et al., 2007), that classify a higher proportion of either tool or if they were unique to one study, environmental sequences. However, improvements meaning that no similar sequence (within 3% in a are still needed because many sequences remain preliminary tree) was reported by another study. unclassified and numerous classification conflicts Quality filtered 16S sequences were aligned based exist between the different 16S databases (DeSantis on both primary sequence and secondary structure et al., 2006). Moreover, the emergence of large-scale to archaeal and bacterial covariance models (ssu- next-generation sequencing projects such as the align-0.1) using Infernal (Nawrocki et al., 2009) with Human Microbiome Project (Turnbaugh et al., the sub option to avoid alignment errors near the 2007; Peterson et al., 2009) and TerraGenome (Vogel ends. The models were built from structure-anno- et al., 2009), and the availability of affordable tated training alignments derived from the Com- sequencing to a wide range of users who have parative RNA Website (Cannone et al., 2002) as traditionally lacked access, mean that the need to described in detail previously (Nawrocki et al., integrate new sequences into a consistent universal 2009). The resulting alignments were adjusted to taxonomic framework has never been greater. fit the fixed 7682 character Greengenes alignment The Greengenes taxonomy is currently based on a through identification of corresponding positions de novo phylogenetic tree of 408 135 quality-filtered between the model training alignments and the sequences calculated using FastTree (Price et al., Greengenes alignment. Hypervariable regions were 2010). De novo tree construction is among the most filtered using a modified version of the Lane mask objective means for inferring sequence relationships, (Lane, 1991). A tree of the remaining 408 135 filtered but requires either generation of new taxonomic sequences, (tree_16S_all_gg_2011_1) was built using classifications or transfer of existing taxonomic FastTree v2.1.1, a fast and accurate approximately classifications between iterations of trees as the maximum-likelihood method using the CAT approx- 16S database expands. Previously, we developed a imation and branch lengths were rescaled using a tool that automatically assigns names to mono- gamma model (Price et al., 2010). Statistical support phyletic groups in large phylogenetic trees (Dalevi for taxon groupings in this tree was conservatively et al., 2007), which is useful for naming novel approximated using taxon jackknifing, in which a (unclassified) clusters of environmental sequences. fraction (0.1%) of the sequences (rather than align- Here we describe a method to transfer group names ment positions) is excluded at random and the tree from any existing taxonomy to any tree topology that reconstructed. We use these support values to help has overlapping terminal node (tip) names. We used guide selection of monophyletic interior nodes for this ‘taxonomy to tree’ approach to annotate the group naming during manual curation. 408 135 sequence tree with the NCBI taxonomy as For evaluation of NCBI-defined candidate phyla, downloaded in June 2010 (Sayers et al.,2011),supple- we added 765 mostly partial length sequences, that mented with the Greengenes taxonomy from the failed the Greengenes filtering procedure but were previous iteration (Dalevi et al., 2007) and CyanoDB required for the evaluation, to the alignment using (http://www.cyanodb.cz). Explicit rank information, PyNAST (Caporaso et al., 2010; based on the 29 prefixed to group names, was incorporated into the November, 2010 Greengenes OTU templates) and Greengenes taxonomy to help users orient themselves generated a second FastTree (tree_16S_candiv_gg_ and to improve the consistency of the classification. 2011_1) using the parameters described above. We assessed the consistency of the resulting classi- fication with the NCBI taxonomy including currently defined candidate phyla (divisions), and present Transferring taxonomies to trees (tax2tree) recommendations for consolidation of 34 redun- Having constructed de novo trees, the next key step dantly named groups and exclusion of one on the was to link