Taxonomic Annotation Errors Incorrectly Assign the Family
Total Page:16
File Type:pdf, Size:1020Kb
A peer-reviewed version of this preprint was published in PeerJ on 10 July 2018. View the peer-reviewed version (peerj.com/articles/5248), which is the preferred citable publication unless you specifically need to cite this preprint. Lydon KA, Lipp EK. 2018. Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: implications for microbial community assessments. PeerJ 6:e5248 https://doi.org/10.7717/peerj.5248 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: Implications for microbial community assessments Keri A Lydon Corresp., 1 , Erin K Lipp Corresp. 1 1 Environmental Health Science, University of Georgia, Athens, GA, United States Corresponding Authors: Keri A Lydon, Erin K Lipp Email address: [email protected], [email protected] Next-generation sequencing has provided powerful tools to conduct microbial ecology studies. Analysis of community composition relies on annotated databases of curated sequences to provide taxonomic assignments; however, these databases occasionally have errors with implications for downstream analyses. Systemic taxonomic errors were discovered in Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales. These orders have family level annotations that were erroneous at least one taxonomic level, e.g., 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common; we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations, with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific versions of Greengenes can lead to incorrect conclusions, especially in marine systems where these taxa can be common. PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26824v1 | CC BY 4.0 Open Access | rec: 3 Apr 2018, publ: 3 Apr 2018 1 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order 2 Vibrionales in Greengenes: Implications for microbial community assessments 3 4 5 Keri Ann Lydon1* and Erin K. Lipp1* 6 7 1 University of Georgia, Department of Environmental Health Science, Athens, GA 8 9 10 *Co-Corresponding authors: 150 East Green St., Athens, GA, 30602; [email protected]; 11 [email protected]; (706) 583-8138; @SciKeri 12 13 14 Running title: Greengenes errors in Vibrionales 1 15 Abstract 16 Next-generation sequencing has provided powerful tools to conduct microbial ecology studies. 17 Analysis of community composition relies on annotated databases of curated sequences to 18 provide taxonomic assignments; however, these databases occasionally have errors with 19 implications for downstream analyses. Systemic taxonomic errors were discovered in 20 Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales. 21 These orders have family level annotations that were erroneous at least one taxonomic level, e.g., 22 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in 23 Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the 24 Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common; 25 we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations, 26 with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific 27 versions of Greengenes can lead to incorrect conclusions, especially in marine systems where 28 these taxa can be common. 29 30 31 32 33 34 35 36 37 2 38 Introduction 39 Analysis of 16S rRNA gene sequences has dramatically changed the way microbiologists 40 understand the ecology of whole bacterial communities in an ecosystem. We are now able to 41 sequence millions of reads of this gene in mixed samples to understand changes and dynamics in 42 microbial composition. One of the main ways for analyzing these data is to compare the 43 sequence reads against a ribosomal sequence database with known taxonomic identities. 44 Frequently used databases include Greengenes (DeSantis et al., 2006, McDonald et al., 2012; 45 http://greengenes.secondgenome.com), SILVA (Pruesse et al., 2007), and the ribosomal database 46 project (RDP) (Cole et al., 2014). Greengenes is highly cited and was first introduced in 2006 for 47 assigning taxonomies to Archaea and Bacteria. One of the reasons for its popularity is its ease of 48 use given that it has been assimilated into 16S rRNA gene sequence analysis pipelines such as 49 QIIME (Caporaso et al., 2010). Although Greengenes is one of the smallest databases, it has 50 been suggested as the preferred database for classification of taxonomy because of its capacity to 51 assign taxonomy to great depth (e.g., species level identification) (Werner et al., 2012). 52 In a recent analysis of 16S rRNA gene sequences using QIIME version 1.9.1 (Caporaso 53 et al., 2010) in our laboratory, we observed that the Greengenes taxonomy for the order 54 Vibrionales appeared to have errors when using version 13_8. Further investigation of 55 Greengenes versions revealed that these errors first appeared in version 13_5, but were not 56 present in earlier versions (Accessed 26 March 2018; ftp://ftp.microbio.me/greengenes_release/). 57 Less than a week after we first identified these misclassifications (21 March 2018), Edgar (2018) 58 published a non-peer reviewed pre-print (25 March 2018) that found the annotation error rate in 59 Greengenes version 13_5 to be approximately 15%, including mismatches for the families 60 Vibrionaceae and Pseudoalteromonadaceae between SILVA version 128 and Greengenes 13_5 3 61 (Edgar et al., 2018). Although the analyses reported by Edgar (2018) suggest broader issues with 62 taxonomic mismatches, we specifically assessed the extent of the misclassification associated 63 with order Vibrionales and family Pseudoalteromonadacae, given the importance of these taxa in 64 many marine systems. 65 66 Methods 67 Taxonomic Comparison of Pseudoalteromonadaceae Across Curated Databases. We 68 searched for the bacterial family Pseudoalteromonadaceae within the representative sequences 69 and taxonomy for Greengenes version 13_8 provided within QIIME version 1.9.1 and found that 70 100% of Pseudoalteromonadaceae sequences (n=164) were misclassified within the Vibrionales 71 order when they properly belong to the Alteromonadales order (Data File 1). This was also 72 observed in an earlier version of Greengenes (13_5). Subsequently, we checked taxonomies 73 against NCBI (16 rRNA database), RDP (Type Strains), and SILVA (ref NR) using BLAST 74 (Morgulis et al., 2008), SeqMatch (Cole et al., 2014), and SINA (Pruesse et al., 2012), 75 respectively, for all Pseudoalteromonadaceae sequences from the Greengenes representative 76 sequences in version 13_8 to determine if there were additional mismatches at other levels of 77 taxonomy (Summarized in Table 1; Full details in Data File 1). 78 79 Literature Search for Taxonomic Errors. We attempted to examine how widespread such a 80 taxonomic annotation error might have been propagated in the microbial ecology literature. We 81 conducted a systematic literature search within Google Scholar using the terms "Greengenes + 82 Vibrionales + Pseudoalteromonadaceae" or "Greengenes + Vibrionales" for the years 2013 – 83 2018 (Greengenes 13_5 and 13_8 were released in 2013) and including only published peer- 4 84 reviewed papers accessible in English and which used a next-generation amplicon sequencing 85 approach (i.e., excluding analyses with PhyloChip, DGGE, or clone libraries). 86 87 Results 88 In total, SILVA and NCBI identified 45 sequences, and RDP identified 46, that were 89 incorrectly assigned to the Pesudoalteromonadaceae family in Greengenes but should have been 90 classified in the Vibrionaceae family (in other words, these were Vibrio spp. misclassified as 91 Pseudoalteromonadaceae in Greengenes). However, because Greengenes assigned all members 92 of the Pseudoalteromonadaceae family within the Vibrionales order, these sequences were only 93 incorrect at the family level (Data File 1). These sequences mismatched by family represented 23 94 Vibrio spp. and included common species such as V. alginolyticus, V. cholerae, and V. harveyi 95 (Data File 1). This means that while these sequences did belong in the order Vibrionales, if the 96 family level was chosen for downstream analysis, they would have been analyzed incorrectly as 97 Pseudoalteromonadaceae. In addition, 115, 116, and 118 Pseudoalteromonadaceae sequences 98 were confirmed by SILVA, RDP, and NCBI, respectively, as correctly identified by family, and 99 sometimes genus (Pseudoalteromonas), but were incorrectly placed within Vibrionales order 100 when they in fact belonged to Alteromonadales. There were several sequences for which we 101 were unable to confirm taxonomic identity using NCBI, SILVA, or RDP (Table 1). 102 We found 86 articles that met our search criteria in journals such as PLoS One, Frontiers 103 in Microbiology, ISME J, among others that used either 13_5 or 13_8 versions of Greengenes to 104 assign their taxonomy (Table 2). Of these articles, 20 reported results clearly stating the incorrect 105 taxonomic assignments or listing the mismatched taxonomy in a table (in text or in supplemental