A peer-reviewed version of this preprint was published in PeerJ on 10 July 2018.
View the peer-reviewed version (peerj.com/articles/5248), which is the preferred citable publication unless you specifically need to cite this preprint.
Lydon KA, Lipp EK. 2018. Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: implications for microbial community assessments. PeerJ 6:e5248 https://doi.org/10.7717/peerj.5248 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: Implications for microbial community assessments
Keri A Lydon Corresp., 1 , Erin K Lipp Corresp. 1
1 Environmental Health Science, University of Georgia, Athens, GA, United States
Corresponding Authors: Keri A Lydon, Erin K Lipp Email address: [email protected], [email protected]
Next-generation sequencing has provided powerful tools to conduct microbial ecology studies. Analysis of community composition relies on annotated databases of curated sequences to provide taxonomic assignments; however, these databases occasionally have errors with implications for downstream analyses. Systemic taxonomic errors were discovered in Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales. These orders have family level annotations that were erroneous at least one taxonomic level, e.g., 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common; we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations, with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific versions of Greengenes can lead to incorrect conclusions, especially in marine systems where these taxa can be common.
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26824v1 | CC BY 4.0 Open Access | rec: 3 Apr 2018, publ: 3 Apr 2018 1 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order
2 Vibrionales in Greengenes: Implications for microbial community assessments
3
4
5 Keri Ann Lydon1* and Erin K. Lipp1*
6
7 1 University of Georgia, Department of Environmental Health Science, Athens, GA
8
9
10 *Co-Corresponding authors: 150 East Green St., Athens, GA, 30602; [email protected];
11 [email protected]; (706) 583-8138; @SciKeri
12
13
14 Running title: Greengenes errors in Vibrionales
1 15 Abstract
16 Next-generation sequencing has provided powerful tools to conduct microbial ecology studies.
17 Analysis of community composition relies on annotated databases of curated sequences to
18 provide taxonomic assignments; however, these databases occasionally have errors with
19 implications for downstream analyses. Systemic taxonomic errors were discovered in
20 Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales.
21 These orders have family level annotations that were erroneous at least one taxonomic level, e.g.,
22 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in
23 Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the
24 Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common;
25 we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations,
26 with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific
27 versions of Greengenes can lead to incorrect conclusions, especially in marine systems where
28 these taxa can be common.
29
30
31
32
33
34
35
36
37
2 38 Introduction
39 Analysis of 16S rRNA gene sequences has dramatically changed the way microbiologists
40 understand the ecology of whole bacterial communities in an ecosystem. We are now able to
41 sequence millions of reads of this gene in mixed samples to understand changes and dynamics in
42 microbial composition. One of the main ways for analyzing these data is to compare the
43 sequence reads against a ribosomal sequence database with known taxonomic identities.
44 Frequently used databases include Greengenes (DeSantis et al., 2006, McDonald et al., 2012;
45 http://greengenes.secondgenome.com), SILVA (Pruesse et al., 2007), and the ribosomal database
46 project (RDP) (Cole et al., 2014). Greengenes is highly cited and was first introduced in 2006 for
47 assigning taxonomies to Archaea and Bacteria. One of the reasons for its popularity is its ease of
48 use given that it has been assimilated into 16S rRNA gene sequence analysis pipelines such as
49 QIIME (Caporaso et al., 2010). Although Greengenes is one of the smallest databases, it has
50 been suggested as the preferred database for classification of taxonomy because of its capacity to
51 assign taxonomy to great depth (e.g., species level identification) (Werner et al., 2012).
52 In a recent analysis of 16S rRNA gene sequences using QIIME version 1.9.1 (Caporaso
53 et al., 2010) in our laboratory, we observed that the Greengenes taxonomy for the order
54 Vibrionales appeared to have errors when using version 13_8. Further investigation of
55 Greengenes versions revealed that these errors first appeared in version 13_5, but were not
56 present in earlier versions (Accessed 26 March 2018; ftp://ftp.microbio.me/greengenes_release/).
57 Less than a week after we first identified these misclassifications (21 March 2018), Edgar (2018)
58 published a non-peer reviewed pre-print (25 March 2018) that found the annotation error rate in
59 Greengenes version 13_5 to be approximately 15%, including mismatches for the families
60 Vibrionaceae and Pseudoalteromonadaceae between SILVA version 128 and Greengenes 13_5
3 61 (Edgar et al., 2018). Although the analyses reported by Edgar (2018) suggest broader issues with
62 taxonomic mismatches, we specifically assessed the extent of the misclassification associated
63 with order Vibrionales and family Pseudoalteromonadacae, given the importance of these taxa in
64 many marine systems.
65
66 Methods
67 Taxonomic Comparison of Pseudoalteromonadaceae Across Curated Databases. We
68 searched for the bacterial family Pseudoalteromonadaceae within the representative sequences
69 and taxonomy for Greengenes version 13_8 provided within QIIME version 1.9.1 and found that
70 100% of Pseudoalteromonadaceae sequences (n=164) were misclassified within the Vibrionales
71 order when they properly belong to the Alteromonadales order (Data File 1). This was also
72 observed in an earlier version of Greengenes (13_5). Subsequently, we checked taxonomies
73 against NCBI (16 rRNA database), RDP (Type Strains), and SILVA (ref NR) using BLAST
74 (Morgulis et al., 2008), SeqMatch (Cole et al., 2014), and SINA (Pruesse et al., 2012),
75 respectively, for all Pseudoalteromonadaceae sequences from the Greengenes representative
76 sequences in version 13_8 to determine if there were additional mismatches at other levels of
77 taxonomy (Summarized in Table 1; Full details in Data File 1).
78
79 Literature Search for Taxonomic Errors. We attempted to examine how widespread such a
80 taxonomic annotation error might have been propagated in the microbial ecology literature. We
81 conducted a systematic literature search within Google Scholar using the terms "Greengenes +
82 Vibrionales + Pseudoalteromonadaceae" or "Greengenes + Vibrionales" for the years 2013 –
83 2018 (Greengenes 13_5 and 13_8 were released in 2013) and including only published peer-
4 84 reviewed papers accessible in English and which used a next-generation amplicon sequencing
85 approach (i.e., excluding analyses with PhyloChip, DGGE, or clone libraries).
86
87 Results
88 In total, SILVA and NCBI identified 45 sequences, and RDP identified 46, that were
89 incorrectly assigned to the Pesudoalteromonadaceae family in Greengenes but should have been
90 classified in the Vibrionaceae family (in other words, these were Vibrio spp. misclassified as
91 Pseudoalteromonadaceae in Greengenes). However, because Greengenes assigned all members
92 of the Pseudoalteromonadaceae family within the Vibrionales order, these sequences were only
93 incorrect at the family level (Data File 1). These sequences mismatched by family represented 23
94 Vibrio spp. and included common species such as V. alginolyticus, V. cholerae, and V. harveyi
95 (Data File 1). This means that while these sequences did belong in the order Vibrionales, if the
96 family level was chosen for downstream analysis, they would have been analyzed incorrectly as
97 Pseudoalteromonadaceae. In addition, 115, 116, and 118 Pseudoalteromonadaceae sequences
98 were confirmed by SILVA, RDP, and NCBI, respectively, as correctly identified by family, and
99 sometimes genus (Pseudoalteromonas), but were incorrectly placed within Vibrionales order
100 when they in fact belonged to Alteromonadales. There were several sequences for which we
101 were unable to confirm taxonomic identity using NCBI, SILVA, or RDP (Table 1).
102 We found 86 articles that met our search criteria in journals such as PLoS One, Frontiers
103 in Microbiology, ISME J, among others that used either 13_5 or 13_8 versions of Greengenes to
104 assign their taxonomy (Table 2). Of these articles, 20 reported results clearly stating the incorrect
105 taxonomic assignments or listing the mismatched taxonomy in a table (in text or in supplemental
106 material). An additional 41 papers were presumed to have misclassification errors based on the
5 107 database version reported. Only 17 of the evaluated articles reported either the correct taxonomic
108 assignment directly or were presumed to have the correct assignment based on the use of earlier
109 Greengenes versions, which did not have the mismatch error. The remaining 8 papers did not
110 report sufficient detail on the database or bioinformatics pipeline used to determine if the
111 mismatch error was likely. In all, 61 of the 86 papers (>71%) since 2013 likely reported relative
112 abundance levels of Vibrionales or Alteromonadales that were incorrect due to improper
113 taxonomic assignments.
114
115 Discussion and Conclusions
116 Vibrionales and Alteromonadales are important members in a number of systems. For
117 example, the genus Vibrio in Vibrionales are opportunistic, ubiquitous, and conditionally rare
118 taxa that are studied for their contributions to nutrient cycling in the environment and
119 pathogenicity in both vertebrate and invertebrates. We hypothesize that ambiguous sequences
120 might contribute to taxonomic annotation errors in Greengenes, which allows for polyphyletic
121 groupings (McDonald et al., 2012). Results presented here demonstrate the extent of taxonomic
122 annotation errors present in the Greengenes database versions 13_8 and 13_5 for the order
123 Vibrionales. Results with erroneous attribution can alter our understanding or interpretation of
124 the ecology of those microbial assemblages. For example, given that 27% of sequences were
125 incorrectly attributed to Pseudoalteromonadaceae instead of Vibrionaceae, we could be making
126 inferences on disproportioned relative abundances. The same can be said vice versa for the
127 sequences that are truly Pseudoalteromonadaceae and could be inflating the total of Vibrionales
128 artificially because of these assignment errors.
6 129 We acknowledge that potential taxonomic errors are problematic for other databases
130 (Edgar et al., 2018) and encourage further support for curation of taxonomic databases that are
131 foundational to microbial ecology research using next-generation sequencing approaches. This is
132 an opportunity for field-wide collaboration to ensure that databases reviewed and curated.
133
134 Acknowledgments
135 We thank M. Ghazaleh Bucher, J. Westrich, and E. Ottesen for comments and advice on
136 this manuscript.
137
138 References
139 1. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. 2006.
140 Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with
141 ARB. Appl. Environ. Microbiol 72: 5069-5072.
142 2. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. 2012. An
143 improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses
144 of bacteria and archaea. ISME J 6: 610-618.
145 3. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO. 2007. SILVA:
146 A comprehensive online resource for quality checked and aligned ribosomal RNA sequence
147 data compatible with ARB. Nucleic Acids Res 35: 7188–7196.
148 4. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al., 2014. Ribosomal Database
149 Project: data and tools for high throughput rRNA analysis Nucl Acids Res 42(Database
150 issue): D633-D642.
7 151 5. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. 2010.
152 QIIME allows analysis of high-throughput community sequencing data. Nature Methods
153 7(5): 335-336.
154 6. Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Capoaso JG, et al. 2012.
155 Impact of training sets on classification of high-throughput bacterial 16S rRNA gene surveys.
156 ISME J 6: 94-103.
157 7. Edgar RC. Taxonomy annotation errors in 16S rRNA and fungal ITS sequence databases.
158 bioRxiv Accessed 27 March 2018; doi: 10.1101/288654
159 8. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäffer AA. 2008.
160 Database indexing for production MegaBLAST searches. Bioinformatics 15:1757-1764.
161 9. Pruesse E, Peplies J, and Glöckner FO. 2012. SINA: accurate high-throughput multiple
162 sequence alignment of ribosomal RNA genes. Bioinformatics 28:1823-1829.
163
164
165
166
167
8 Table 1(on next page)
Summary of taxonomic assignments made using RDP, SILVA, and NCBI for representative sequences in Greengenes assigned to the family Pseudoalteromonadaceae (n=164).
Search was conducted by pulling the representative sequences from the Greengenes database and then assigning taxonomy with RDP (SeqMatch), SILVA refNR (SINA), and NCBI (BLAST) searches. Full results in Data File 1. 1
Assigned Taxonomy of Database
Pseudoalteromonadaceae sequences GG RDP (n) SILVA NCBI (n)
in Greengenes (n=164) (n) (n)
Vibrionales (order); 164 0 0 0
Pseudoalteromonadaceae (family) a
Vibrionales (order); 0 46 45 45
Vibrionaceae (family) b
Alteromonadales (order); 0 116 115 118
Pseudoalteromonadaceae (family) b
2 a Incorrect assignment of family Pseudoalteromonadaceae to order Vibrionales
3 b Correct assignment of family Vibrionaceae to order Vibrionales and family
4 Pseudoalteromonadaceae to order Alteromonadaceae
5
6
7
8 Table 2(on next page)
Extent of incorrect taxonomic annotation of Pseudoalteromonadaceae in Vibrionales as published in peer reviewed literature.
Search was conducted between 23 and 28 March 2018 using Google Scholar. Search results included in the analysis were peer-reviewed papers published between 2013 and 2018 that were accessible in English and used a 16S rRNA gene next-generation sequencing approach. 1
Search Term Queries Greengenes+ Vibrionales+ Greengenes+ Pseudoalteromonadaceae Vibrionales a Total All papers fitting search criteria 22 64 86 Papers confirmed using Greengenes versions 13_5 or 13_8 with known taxonomy errors 14 47 61 Papers explicitly stating taxonomic mismatch (in text or supplemental material) b 10 10 20 Papers stating correct assignment or using earlier version of Greengenes 7 10 17 Papers using Greengenes but with no information on database version. 1 7 8 2 a Counts do not include papers appearing in the “Greengenes + Vibrionales + 3 Pseudoalteromonadaceae” search 4 b These papers are also included in the tally for papers using Greengenes 13_5 or 13_8
5