A peer-reviewed version of this preprint was published in PeerJ on 10 July 2018.

View the peer-reviewed version (peerj.com/articles/5248), which is the preferred citable publication unless you specifically need to cite this preprint.

Lydon KA, Lipp EK. 2018. Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: implications for microbial community assessments. PeerJ 6:e5248 https://doi.org/10.7717/peerj.5248 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order Vibrionales in Greengenes: Implications for microbial community assessments

Keri A Lydon Corresp., 1 , Erin K Lipp Corresp. 1

1 Environmental Health Science, University of Georgia, Athens, GA, United States

Corresponding Authors: Keri A Lydon, Erin K Lipp Email address: [email protected], [email protected]

Next-generation sequencing has provided powerful tools to conduct microbial ecology studies. Analysis of community composition relies on annotated databases of curated sequences to provide taxonomic assignments; however, these databases occasionally have errors with implications for downstream analyses. Systemic taxonomic errors were discovered in Greengenes database (v13_5 and 13_8) related to orders Vibrionales and . These orders have family level annotations that were erroneous at least one taxonomic level, e.g., 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common; we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations, with 20 explicitly stating the incorrect . Erroneous assignments using these specific versions of Greengenes can lead to incorrect conclusions, especially in marine systems where these taxa can be common.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26824v1 | CC BY 4.0 Open Access | rec: 3 Apr 2018, publ: 3 Apr 2018 1 Taxonomic annotation errors incorrectly assign the family Pseudoalteromonadaceae to the order

2 Vibrionales in Greengenes: Implications for microbial community assessments

3

4

5 Keri Ann Lydon1* and Erin K. Lipp1*

6

7 1 University of Georgia, Department of Environmental Health Science, Athens, GA

8

9

10 *Co-Corresponding authors: 150 East Green St., Athens, GA, 30602; [email protected];

11 [email protected]; (706) 583-8138; @SciKeri

12

13

14 Running title: Greengenes errors in Vibrionales

1 15 Abstract

16 Next-generation sequencing has provided powerful tools to conduct microbial ecology studies.

17 Analysis of community composition relies on annotated databases of curated sequences to

18 provide taxonomic assignments; however, these databases occasionally have errors with

19 implications for downstream analyses. Systemic taxonomic errors were discovered in

20 Greengenes database (v13_5 and 13_8) related to orders Vibrionales and Alteromonadales.

21 These orders have family level annotations that were erroneous at least one taxonomic level, e.g.,

22 100% of sequences assigned to the Pseudoalteromonadaceae family were placed improperly in

23 Vibrionales (rather than Alteromonadales) and >20% of these sequences were assigned to the

24 Pseudoalteromonadaceae family (rather than to Vibrionaceae). Use of this database is common;

25 we identified 67 peer-reviewed papers since 2013 that likely included erroneous annotations,

26 with 20 explicitly stating the incorrect taxonomy. Erroneous assignments using these specific

27 versions of Greengenes can lead to incorrect conclusions, especially in marine systems where

28 these taxa can be common.

29

30

31

32

33

34

35

36

37

2 38 Introduction

39 Analysis of 16S rRNA gene sequences has dramatically changed the way microbiologists

40 understand the ecology of whole bacterial communities in an ecosystem. We are now able to

41 sequence millions of reads of this gene in mixed samples to understand changes and dynamics in

42 microbial composition. One of the main ways for analyzing these data is to compare the

43 sequence reads against a ribosomal sequence database with known taxonomic identities.

44 Frequently used databases include Greengenes (DeSantis et al., 2006, McDonald et al., 2012;

45 http://greengenes.secondgenome.com), SILVA (Pruesse et al., 2007), and the ribosomal database

46 project (RDP) (Cole et al., 2014). Greengenes is highly cited and was first introduced in 2006 for

47 assigning taxonomies to Archaea and . One of the reasons for its popularity is its ease of

48 use given that it has been assimilated into 16S rRNA gene sequence analysis pipelines such as

49 QIIME (Caporaso et al., 2010). Although Greengenes is one of the smallest databases, it has

50 been suggested as the preferred database for classification of taxonomy because of its capacity to

51 assign taxonomy to great depth (e.g., species level identification) (Werner et al., 2012).

52 In a recent analysis of 16S rRNA gene sequences using QIIME version 1.9.1 (Caporaso

53 et al., 2010) in our laboratory, we observed that the Greengenes taxonomy for the order

54 Vibrionales appeared to have errors when using version 13_8. Further investigation of

55 Greengenes versions revealed that these errors first appeared in version 13_5, but were not

56 present in earlier versions (Accessed 26 March 2018; ftp://ftp.microbio.me/greengenes_release/).

57 Less than a week after we first identified these misclassifications (21 March 2018), Edgar (2018)

58 published a non-peer reviewed pre-print (25 March 2018) that found the annotation error rate in

59 Greengenes version 13_5 to be approximately 15%, including mismatches for the families

60 Vibrionaceae and Pseudoalteromonadaceae between SILVA version 128 and Greengenes 13_5

3 61 (Edgar et al., 2018). Although the analyses reported by Edgar (2018) suggest broader issues with

62 taxonomic mismatches, we specifically assessed the extent of the misclassification associated

63 with order Vibrionales and family Pseudoalteromonadacae, given the importance of these taxa in

64 many marine systems.

65

66 Methods

67 Taxonomic Comparison of Pseudoalteromonadaceae Across Curated Databases. We

68 searched for the bacterial family Pseudoalteromonadaceae within the representative sequences

69 and taxonomy for Greengenes version 13_8 provided within QIIME version 1.9.1 and found that

70 100% of Pseudoalteromonadaceae sequences (n=164) were misclassified within the Vibrionales

71 order when they properly belong to the Alteromonadales order (Data File 1). This was also

72 observed in an earlier version of Greengenes (13_5). Subsequently, we checked taxonomies

73 against NCBI (16 rRNA database), RDP (Type Strains), and SILVA (ref NR) using BLAST

74 (Morgulis et al., 2008), SeqMatch (Cole et al., 2014), and SINA (Pruesse et al., 2012),

75 respectively, for all Pseudoalteromonadaceae sequences from the Greengenes representative

76 sequences in version 13_8 to determine if there were additional mismatches at other levels of

77 taxonomy (Summarized in Table 1; Full details in Data File 1).

78

79 Literature Search for Taxonomic Errors. We attempted to examine how widespread such a

80 taxonomic annotation error might have been propagated in the microbial ecology literature. We

81 conducted a systematic literature search within Google Scholar using the terms "Greengenes +

82 Vibrionales + Pseudoalteromonadaceae" or "Greengenes + Vibrionales" for the years 2013 –

83 2018 (Greengenes 13_5 and 13_8 were released in 2013) and including only published peer-

4 84 reviewed papers accessible in English and which used a next-generation amplicon sequencing

85 approach (i.e., excluding analyses with PhyloChip, DGGE, or clone libraries).

86

87 Results

88 In total, SILVA and NCBI identified 45 sequences, and RDP identified 46, that were

89 incorrectly assigned to the Pesudoalteromonadaceae family in Greengenes but should have been

90 classified in the Vibrionaceae family (in other words, these were Vibrio spp. misclassified as

91 Pseudoalteromonadaceae in Greengenes). However, because Greengenes assigned all members

92 of the Pseudoalteromonadaceae family within the Vibrionales order, these sequences were only

93 incorrect at the family level (Data File 1). These sequences mismatched by family represented 23

94 Vibrio spp. and included common species such as V. alginolyticus, V. cholerae, and V. harveyi

95 (Data File 1). This means that while these sequences did belong in the order Vibrionales, if the

96 family level was chosen for downstream analysis, they would have been analyzed incorrectly as

97 Pseudoalteromonadaceae. In addition, 115, 116, and 118 Pseudoalteromonadaceae sequences

98 were confirmed by SILVA, RDP, and NCBI, respectively, as correctly identified by family, and

99 sometimes genus (), but were incorrectly placed within Vibrionales order

100 when they in fact belonged to Alteromonadales. There were several sequences for which we

101 were unable to confirm taxonomic identity using NCBI, SILVA, or RDP (Table 1).

102 We found 86 articles that met our search criteria in journals such as PLoS One, Frontiers

103 in Microbiology, ISME J, among others that used either 13_5 or 13_8 versions of Greengenes to

104 assign their taxonomy (Table 2). Of these articles, 20 reported results clearly stating the incorrect

105 taxonomic assignments or listing the mismatched taxonomy in a table (in text or in supplemental

106 material). An additional 41 papers were presumed to have misclassification errors based on the

5 107 database version reported. Only 17 of the evaluated articles reported either the correct taxonomic

108 assignment directly or were presumed to have the correct assignment based on the use of earlier

109 Greengenes versions, which did not have the mismatch error. The remaining 8 papers did not

110 report sufficient detail on the database or bioinformatics pipeline used to determine if the

111 mismatch error was likely. In all, 61 of the 86 papers (>71%) since 2013 likely reported relative

112 abundance levels of Vibrionales or Alteromonadales that were incorrect due to improper

113 taxonomic assignments.

114

115 Discussion and Conclusions

116 Vibrionales and Alteromonadales are important members in a number of systems. For

117 example, the genus Vibrio in Vibrionales are opportunistic, ubiquitous, and conditionally rare

118 taxa that are studied for their contributions to nutrient cycling in the environment and

119 pathogenicity in both vertebrate and invertebrates. We hypothesize that ambiguous sequences

120 might contribute to taxonomic annotation errors in Greengenes, which allows for polyphyletic

121 groupings (McDonald et al., 2012). Results presented here demonstrate the extent of taxonomic

122 annotation errors present in the Greengenes database versions 13_8 and 13_5 for the order

123 Vibrionales. Results with erroneous attribution can alter our understanding or interpretation of

124 the ecology of those microbial assemblages. For example, given that 27% of sequences were

125 incorrectly attributed to Pseudoalteromonadaceae instead of Vibrionaceae, we could be making

126 inferences on disproportioned relative abundances. The same can be said vice versa for the

127 sequences that are truly Pseudoalteromonadaceae and could be inflating the total of Vibrionales

128 artificially because of these assignment errors.

6 129 We acknowledge that potential taxonomic errors are problematic for other databases

130 (Edgar et al., 2018) and encourage further support for curation of taxonomic databases that are

131 foundational to microbial ecology research using next-generation sequencing approaches. This is

132 an opportunity for field-wide collaboration to ensure that databases reviewed and curated.

133

134 Acknowledgments

135 We thank M. Ghazaleh Bucher, J. Westrich, and E. Ottesen for comments and advice on

136 this manuscript.

137

138 References

139 1. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. 2006.

140 Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with

141 ARB. Appl. Environ. Microbiol 72: 5069-5072.

142 2. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. 2012. An

143 improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses

144 of bacteria and archaea. ISME J 6: 610-618.

145 3. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO. 2007. SILVA:

146 A comprehensive online resource for quality checked and aligned ribosomal RNA sequence

147 data compatible with ARB. Nucleic Acids Res 35: 7188–7196.

148 4. Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al., 2014. Ribosomal Database

149 Project: data and tools for high throughput rRNA analysis Nucl Acids Res 42(Database

150 issue): D633-D642.

7 151 5. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. 2010.

152 QIIME allows analysis of high-throughput community sequencing data. Nature Methods

153 7(5): 335-336.

154 6. Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Capoaso JG, et al. 2012.

155 Impact of training sets on classification of high-throughput bacterial 16S rRNA gene surveys.

156 ISME J 6: 94-103.

157 7. Edgar RC. Taxonomy annotation errors in 16S rRNA and fungal ITS sequence databases.

158 bioRxiv Accessed 27 March 2018; doi: 10.1101/288654

159 8. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schäffer AA. 2008.

160 Database indexing for production MegaBLAST searches. Bioinformatics 15:1757-1764.

161 9. Pruesse E, Peplies J, and Glöckner FO. 2012. SINA: accurate high-throughput multiple

162 sequence alignment of ribosomal RNA genes. Bioinformatics 28:1823-1829.

163

164

165

166

167

8 Table 1(on next page)

Summary of taxonomic assignments made using RDP, SILVA, and NCBI for representative sequences in Greengenes assigned to the family Pseudoalteromonadaceae (n=164).

Search was conducted by pulling the representative sequences from the Greengenes database and then assigning taxonomy with RDP (SeqMatch), SILVA refNR (SINA), and NCBI (BLAST) searches. Full results in Data File 1. 1

Assigned Taxonomy of Database

Pseudoalteromonadaceae sequences GG RDP (n) SILVA NCBI (n)

in Greengenes (n=164) (n) (n)

Vibrionales (order); 164 0 0 0

Pseudoalteromonadaceae (family) a

Vibrionales (order); 0 46 45 45

Vibrionaceae (family) b

Alteromonadales (order); 0 116 115 118

Pseudoalteromonadaceae (family) b

2 a Incorrect assignment of family Pseudoalteromonadaceae to order Vibrionales

3 b Correct assignment of family Vibrionaceae to order Vibrionales and family

4 Pseudoalteromonadaceae to order Alteromonadaceae

5

6

7

8 Table 2(on next page)

Extent of incorrect taxonomic annotation of Pseudoalteromonadaceae in Vibrionales as published in peer reviewed literature.

Search was conducted between 23 and 28 March 2018 using Google Scholar. Search results included in the analysis were peer-reviewed papers published between 2013 and 2018 that were accessible in English and used a 16S rRNA gene next-generation sequencing approach. 1

Search Term Queries Greengenes+ Vibrionales+ Greengenes+ Pseudoalteromonadaceae Vibrionales a Total All papers fitting search criteria 22 64 86 Papers confirmed using Greengenes versions 13_5 or 13_8 with known taxonomy errors 14 47 61 Papers explicitly stating taxonomic mismatch (in text or supplemental material) b 10 10 20 Papers stating correct assignment or using earlier version of Greengenes 7 10 17 Papers using Greengenes but with no information on database version. 1 7 8 2 a Counts do not include papers appearing in the “Greengenes + Vibrionales + 3 Pseudoalteromonadaceae” search 4 b These papers are also included in the tally for papers using Greengenes 13_5 or 13_8

5