Newton and Roeselers BMC Microbiology 2012, 12:221 http://www.biomedcentral.com/1471-2180/12/221

RESEARCH ARTICLE Open Access The effect of training set on the classification of gut microbiota using the Naïve Bayesian Classifier Irene LG Newton1* and Guus Roeselers2

Abstract Background: Microbial ecologists now routinely utilize next-generation sequencing methods to assess microbial diversity in the environment. One tool heavily utilized by many groups is the Naïve Bayesian Classifier developed by the Ribosomal Database Project (RDP-NBC). However, the consistency and confidence of classifications provided by the RDP-NBC is dependent on the training set utilized. Results: We explored the stability of classification of honey bee gut microbiota sequences by the RDP-NBC utilizing three publically available ribosomal RNA sequence databases as training sets: ARB-SILVA, Greengenes and RDP. We found that the inclusion of previously published, high-quality, full-length sequences from 16S rRNA clone libraries improved the precision in classification of novel bee-associated sequences. Specifically, by including bee-specific 16S rRNA gene sequences a larger fraction of sequences were classified at a higher confidence by the RDP-NBC (based on bootstrap scores). Conclusions: Results from the analysis of these bee-associated sequences have ramifications for other environments represented by few sequences in the public databases or few bacterial isolates. We conclude that for the exploration of relatively novel habitats, the inclusion of high-quality, full-length 16S rRNA gene sequences allows for a more confident taxonomic classification. Keywords: Honey bee, Gut, Microbiota, Naïve Bayesian classifier, Pyrosequencing,

Background large number of groups [1-6] utilize the Ribosomal Data- Microbial ecology studies routinely utilize 454 pyrose- base Project’s Naïve Bayesian Classifier (RDP-NBC) [7] quencing of ribosomal RNA gene amplicons in order to for the classification of rRNA sequences into the new determine composition and functionality of environmen- higher-order taxonomy, such as that proposed in Ber- tal communities [1-6]. Where it was once costly to gen- gey's Taxonomic Outline of the Prokaryotes [8]. Bayesian erate libraries of a few hundred 16S rRNA gene classifiers assign the most likely class to a given example sequences, so called next-generation sequencing meth- described by its feature vector based on applying Bayes' ods now allow researchers to deeply probe a microbial theorem. Developing such classifiers can be greatly sim- community at relatively little cost per sequence. Taxo- plified by assuming that features are independent given nomic classification is a key part of these studies as it class (naïve independence assumptions). Because inde- allows researchers to correlate relative abundance of par- pendent variables are assumed, only the variances of the ticular sequences with taxonomic groupings. These variables for each class need to be determined and not kinds of informative data can also allow for hypothesis the entire covariance matrix. Despite this unrealistic as- generation concerning the community function in the sumption, the resulting classifier is remarkably success- context of a given biological or ecological question. A ful in practice, often competing with much more sophisticated techniques [9,10]. The practical advantages of the RDP-NBC are that classification are straightfor- * Correspondence: [email protected] 1Department of Biology, 1001 E 3rd Street, Bloomington, IN 47405, USA ward (putting sequences in a predetermined taxonomic Full list of author information is available at the end of the article context), computationally efficient (building a statistical

© 2012 Newton and Roeselers; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Newton and Roeselers BMC Microbiology 2012, 12:221 Page 2 of 9 http://www.biomedcentral.com/1471-2180/12/221

model based on k-mers in the training set), can analyze arb) and visually inspected using ARB [24]. We refer to thousands of sequences, and does not require full-length this custom seed alignment as the arb-silva SSU + honey 16S sequences (making it an appropriate tool for next bee alignment (ASHB). To generate a phylogeny we generation sequencing based studies). The RDP-NBC re- used the ASHB as input to RAxML (GTR + γ with lies on an accurate training set – on reference sequences 1,000 bootstrap replicates) using a maximum likeli- used to train the model and a taxonomic designation file hood framework (Stamatakis 2006). This phylogeny was to generate the classification results. Recently the effects used to inform the taxonomic designations (see below). of training sets on RDP-NBC performance were investi- In addition, we used the RAxML evolutionary place- gated [11]; the size and taxonomic breadth of the train- ment algorithm to identify the placement of short ing set had a significant impact on classification, such reads within this framework (raxmlHPC-SSE3 –fv–m that improvements in the confidence of classification of GTRGAMMA –n Placement). Alignment (ASHB) and previously “unclassified” sequences were made with a phylogeny are available in TreeBase at http://purl.org/ larger, more diverse training set [11]. phylo/treebase/phylows/study/TB2:S13210?x-access- For environments that lack cultured isolates or are code=52f01c46c780bc323ba5d1d50ea58fd6&format=html). relatively underexplored, researchers are often unable to find an appropriate training set to reveal the taxonomic identity of the extracted sequences [11-13]. However, if Generating a taxonomy file for bee-associated sequences previous clone libraries have generated full length, high- Fine scale taxonomic placement (below phylum level) quality 16S rRNA gene sequences, then these sequences for relatively novel bacterial groups is difficult to accom- can be utilized in a training set and taxonomy frame- plish and subject to some debate [13]. In order to taxo- work, potentially increasing the precision of the classifi- nomically classify these sequences we utilized the cation provided by the RDP-NBC. Our primary goal in phylogenetic framework generated above (Figure 1) and this study was to test the effect of training set on the RDP- also queried the RDP (using the RDP-seqmatch tool) for NBC-based classification of Apis mellifera (European nearest cultured representatives. We used cultured iso- honey bee) gut derived 16S rRNA gene sequences. In- lates (identified by the RDP-seqmatch tool) to root our sect guts are relatively underexplored and host novel phylogeny, generated by the 276 honey bee representa- bacterial groups for which there do not exist close, cul- tive sequences. Based on percent identity to the cultured tured relatives, making taxonomic assignments for 16S representative, each sequence in the honey bee dataset sequences and metatranscriptomic data difficult [14-16]. was assigned to either the class or the genus level. If the We also sought to improve the classification of sequences cultured representative was >95% identical to the bee from the honey bee gut by the RDP-NBC through the derived sequence, we placed the novel bee sequence in creation of training sets that include full-length the genus of the cultured representative. If, however, a sequences identified as core honey bee microbiota as cultured isolate was not found with identity above 95% part of a phylogenetic framework first put forward by for the bee sequence, but they grouped in a clade con- Cox-Foster et al., 2006 and extended by Martinson taining a cultured representative, the bee sequences were et al., 2010 [17,18]. Below we compare the precision placed in the same class and we noted incertae sedis in and reproducibility of classification of the honey bee gut the taxonomy file. In addition to this de novo generation microbiota using six different training sets: RDP, Green- of taxonomic information for the bee associated genes, arb-silva, and custom, honey bee specific data- sequences, if phylogenetic information (as established bases generated from each. by Cox-Foster et al., 2006) was available for any of the Genbank submissions, that information was also Methods included in the taxonomy. For example, names of bee Generating a bee-specific seed alignment specific groups such as “alpha-2.1” and “beta” (recur- Sequences that corresponded to accession numbers pub- ring in many bee studies) often appear in the full gen- lished in analyses of bee-associated microbiota and that bank accession for these sequences. Occasionally the were near full length (at least 1250 bp) were used to gen- Genbank records list an organism’s full taxonomic erate the seed alignment for our subsequent analyses (A designation without considering its placement in the total of 5,713 sequences were downloaded and 5,158 phylogenetic framework previously identified for passed the length threshold) [18-22]. These sequences honey bee guts. For example, Lactobacillus apis has a were clustered at 99% identity, reducing the dataset to 276 Genbank taxonomy that does not consider it part of representatives. This set of sequences is referred to as the the firm-4 group. In our taxonomy, we did not re- honey bee database (HBDB) throughout and were aligned move the genus and name but instead con- using the SINA aligner (v 1.2.9, [23]) to the arb-silva SSU sider this sequence to be part of the firm-4 clade at database (SSURef_108_SILVA_NR_99_11_10_11_opt_v2. the family level. Newton and Roeselers BMC Microbiology 2012, 12:221 Page 3 of 9 http://www.biomedcentral.com/1471-2180/12/221

HM111703 HM110347 HM110307 HM110404 HM113086 HM108525 HM108617 HM108524 HM110318 HM108469 HM108649 HM108619 HM108539 HM108641 HM108593 HM108520 HM108637 HM108620 HM108521 HM108591 HM108650 HM108489 HM108599 HM108556 HM108582 HM108595 HM108584 HM108592 HM108515 HM108905 HM108529 Burkholderia cepacia complex HM111767 HM108646 HM108647 HM108487 HM108547 HM109369 HM108614 HM108598 HM110359 HM108570 HM108612 94 HM109161 HM108557 HM110399 HM110501 HM108513 HM108609 HM108594 HM108551 HM108559 Burkholderiaceae HM108512 HM108565 HM108541 HM108569 HM110352 99 HM111713 CP000459 Burkholderia cepacia HM108651 HM110322 HM110259 HM108572 HM108603 HM108576 HM108523 HM108658 HM108636 Burkholderiales HM109784 99 HM108471 HM108482 92 HM110488 EF554880 Ralstonia Ralstonia HM111069 HM111652 AB362826 97 HM111704 HM111212 HM110417 95 HM110520 95 HM111030 AB076856 Diaphorobacter nitroreducens HM109070 HM108494 93 HM108477 HM108509 90 HM108454 HM108479 Commomonadaceae HM108455 HM108481 β HM108451 -proteo HM111381 94 AY426974 Neisseria canis HM108674 HM215021 95 HM108516 HM215011 HM108492 HM113221 Beta HM111908 HM112100 HM108719 HM108718 HM108716 HM113217 HM108449 91 HM113304 HM112050 HM112108 Neisseriales HM113136 HM113280 86 HM215026 HM215024 HM215032 HM215034 HM215031 HM215029 HM108542 HM108540 HM108578 Gamma-1 86 HM108522 HM112068 HM113164 HM112029 HM111910 HM113171 HM113301 HM113131 HM113113 HM113325 HM215025 HM113269 AB480768 Enterobacteriaceae bacterium Acj201 HM112168 95 HM109483 HM113288 95 HM113123 HM108971 HM110617 Enterobacteriales HM110311 HM110623 γ-proteo HM112315 HM112402 HM112256 HM112340 96 CP001120 Salmonella enterica subsp. enterica HM108396 Pseudomonadales AB681773 Erwinia HM109714 HM110291 HM110317 HM110127 91 JF494820 Pantoea HM113016 HM109389 HM109805 HM109789 HM110641 HQ284953 Acinetobacter sp. 18N3 Acinetobacter HM109573 Acetobacteriaceae NR024819 Saccharibacter florica strain S-877 HM109437 HM109591 HM112426 80 Alpha 2.2 92 HM112382 HM112282 EU409601 Commensalibacter intestini strain A911 Alpha 2.1 HM112118 HM109329 HM108486 HM108459 96 HM108484 HM109557 HM109937 HM109678 96 HM109141 HM108849 HM111817 95 HM111706 HM111812 HM111792 96 98 HM109860 HM111020 99 HM108436 HM108371 α HM108501 -proteo 96 HM108452 Alpha 1 Rhizobiales HM108461 HM110470 CP001562 Bartonella 68 HM111597 HM111424 HM111434 HM111526 HM111400 HM112916 Wolbachia HM113074 HM112870 AB491204 Wolbachia Rickettsiales HM109883 HM215040 HM109889 HM215041 HM215046 HM215043 HM113333 HM113154 HM113308 HM113341 99 92 HM113104 HM113265 HM113135 96 HM113281 HM113345 92 HM112113 HM113153 84 HM113286 AY667701 Lactobacillus apis strain 1F1 94 HM113185 HM112070 HM113175 HM111947 Firmicutes 98 HM112858 HM113232 Firm-5 HM111949 HM113160 HM113102 HM113114 HM113334 96 97 HM215048 HM113169 HM111918 HM113295 HM113255 HM113298 97 97 HM113328 HM113316 HM113222 96 Lactobacillales HM113139 HM112046 HM113103 Firm-4 99 HM112004 HM113305 HM111914 EF187245 Lactobacillus sp bin4 HM112042 95 HM112615 HM112771 HM112623 HM112522 98 HM112788 HM112599 HM112474 HM112730 94 HM112780 HM112762 HM112629 HM112552 HM112612 HM110569 HM112983 HM109382 78 HM109380 97 HM111168 HM109469 HM109806 95 HM109713 HM111788 Actinobacteria HM109505 98 HM109753 HM110564 97 HM215049 AB437358 Actinomycetales HM112025 HM113220 Bifidobacterium HM113111 Bifidobacteriales HM113290

0.2 Figure 1 Phylogenetic relationships for the bacterial species included in the honey bee specific database (with bootstrap support indicated above branches if > 75%). Class level designations are highlighted in red while lower rank taxonomic designations are indicated using arrows on nodes. Specific clades identified previously in honey bees are colored in blue while novel clades identified here, including cultured isolates and well-described genera (such as Wolbachia), are colored in yellow. Newton and Roeselers BMC Microbiology 2012, 12:221 Page 4 of 9 http://www.biomedcentral.com/1471-2180/12/221

Processing of pyrosequencing amplicons from SILVA training sets differ in size, in diversity of honey bee guts sequences, and partly in taxonomic framework. The lar- Raw .sff files corresponding to 16S rRNA gene ampli- gest of these datasets, the Greengenes reference, is by far cons from the honey bee gut were downloaded from the the most diverse, comprised of 84,414 sequences includ- DDBJ (DRA000526). The sequences were the result of ing multiple representatives from each taxonomic class. an amplification of the V1/V2 hypervariable 16S regions With regards to taxonomic framework, the RDP relies with primers 27 F and 338IIR [25]. All data extraction, on Bergey's Taxonomic Outline of the Prokaryotes (2nd pre-processing, analysis of operational taxonomic units ed., release 5.0, Springer-Verlag, New York, NY, 2004) as (OTUs), and classifications were performed using mod- its reference. In contrast, the Greengenes taxonomy ules implemented in the Mothur software platform [26] assigns reference sequences to individual classifications as in [25] except where noted below. Information about using phylogenies based on a subset of sequences but which colony each sequence came from was retained also includes NCBI’s explicit rank information [27]. Fi- throughout sequence processing so we could make stat- nally, SILVA, like the RDP, uses Bergey’s Manual of Sys- istical inferences based on the ecological framework tematic Bacteriology (volumes 1 through 4), Bergey's tested previously [25]. Unique sequences were aligned Taxonomic Outlines (volume 5), and the List of Prokary- using the “align.seqs” command and the Mothur- otic names with Standing in Nomenclature [28]. In all compatible Bacterial SILVA SEED database modified to three taxonomic references, six taxonomic ranks are include the ASHB. Out of 70,939 sequences, a total of predominantly utilized for classification: Domain, 4,480 unique, high-quality sequences were retrieved Phylum, Class, Order, Family and Genus (although the from honey bee guts using this pipeline. Operational training set taxonomies differ with regards to the preva- taxonomic units (OTUs) were generated using a 97% lence of suborders, subclasses, etc.). We chose to utilize sequence-identity threshold, as in [25]. the SILVA taxonomic nomenclature for the HBDB with- out observable conflicts across all three training sets for Taxonomic classification and generation of a these specific bacterial groups (Figure 2B). custom database Training set had a significant impact on both the pres- To create custom training datasets for Mothur, one ence and also the predicted abundance of particular requires a reference sequence database and the corre- taxonomic groups within honey bee guts (Figure 2A). sponding taxonomy file for those sequences. We down- Across all training sets, a total of 10 bacterial classes loaded three pre-existing, Mothur-compatible training were predicted to be represented in the bee gut includ- sets: 1) the RDP 16S rRNA reference v7 (9,662 sequences), ing 27 distinct orders, although certain orders were 2) the Greengenes reference (84,414 sequences), and 3) prevalent only in results from specific datasets, notably the SILVA bacterial reference (14,956 sequences) each Acidobacteriales and Pasteurellales (found predomin- available on the Mothur WIKI page (http://www.mothur. antly in the Greengenes taxonomic classification) and org/wiki/Main_Page). The datasets are each comprised of Bacillales and Aeromonadales (found predominantly in both an unaligned sequence file and a taxonomy file. We the SILVA results). When comparing classification modified each of these to include the honey bee database results at the order level, 3,145/4,480 (70%) of the (HBDB) to create RDP + bees, GG + bees and SILVA + sequences were classified differently by all three training bees. Using each of these six alternative datasets, we clas- sets, suggesting a severe inability of the RDP-NBC to sified the honey bee gut microbiota sequences using the place the novel sequences within known cultured iso- RDP-II Naive Bayesian Classifier [7] and a 60% confidence lates and databases. The incongruence between the clas- threshold. In addition, we also tested the ability of the sifications provided by each training set was magnified HBDB alone to confidently classify these short reads. as the taxonomic scale progressed from phylum to genus Blastn searches were performed using the blast + package (Table 1). A systematic analysis of congruence between (version 2.2.26) using default parameters. all three training sets for each unique sequence classi- fied revealed that only 595 (~13%) of the sequences concurred in their complete taxonomic classification, Results and discussion down to genus, regardless of training set (Table 1). At The effect of pre-existing training sets on the the genus level, between the three training sets, RDP classification of honey bee gut sequences and SILVA were the most similar in their classification, In order to explore how three heavily utilized pre- agreeing 1017/4480 times. The results provided by the existing training sets perform on honey bee gut micro- GG based classification were different from those pro- biota, we systematically tested the RDP-NBC in the vided by either the SILVA or the RDP datasets, dis- classification of a 16S rRNA gene pyrosequencing data- agreeing ~99% of the time with regards to genus set from the honey bee gut. The RDP, Greengenes, and (Figure 2A). Newton and Roeselers BMC Microbiology 2012, 12:221 Page 5 of 9 http://www.biomedcentral.com/1471-2180/12/221

AB GG RDP SILVA GG+bees RDP+bees SILVA+bees Lactobacillaceae 96.6 92.0 85.5 Enterobacteriaceae 97.3 98.3 96.6 Pasteurellaceae 61.5 firm-5 96.9 96.8 96.8 Psychromonadaceae 61.7 beta 98.7 98.7 98.6 Enterobacteriaceae 99.6 99.7 99.6 gamma-1 99.0 88.4 99.1 Neisseriaceae 99.4 88.8 81.7 Bifidobacteriaceae 99.0 97.3 97.7 Bifidobacteriaceae 99.9 99.9 99.9 firm-4 99.3 99.3 99.3 97.5 96.2 96.0 alpha-2.1 98.8 99.5 98.6 Bartonellaceae 77.7 Lactobacilliaceae 68.7 73.2 72.6 Moraxellaceae 94.4 89.8 80.9 alpha-1 99.6 99.5 99.4 Enterococcaceae 99.4 100 unclassified Aeromonadaceae 99.6 alpha-2.2 81.4 80.2 79.1 Succinivibrionaceae 66.1 Enterococcaceae Leuconostocaceae 100 Moraxellaceae Sphingomonadaceae Lachnospiraceae Xanthomonadaceae Leuconostocaceae Lachnospiraceae Sphingomonadaceae Methylophilaceae Xanthomonadaceae Acidobacteriaceae Acidobacteriaceae Actinomycetaceae Actinomycetaceae Clostridiaceae Aeromonadaceae Corynebacteriaceae Clostridiales Kineosporiaceae Corynebacteriaceae Oxalobacteraceae Kineosporiaceae Prevotellaceae Oxalobacteraceae Cyclobacteriaceae Prevotellaceae Flexibacteraceae Acetobacteraceae Micromonosporaceae Cyclobacteriaceae Porphyromonadaceae Flexibacteraceae Orbus 98.8 Micromonosporaceae Aerococcaceae 65.0 Neisseriaceae Carnobacteriaceae Porphyromonadaceae Brucellaceae Cytophagaceae Bacillaceae Incertae_Sedis_XI Rhizobiaceae Nakamurellaceae Geobacteraceae Flavobacteriaceae Flavobacteriaceae Nocardioidaceae Thermoactinomycetaceae Orbus 65.6 Cytophagaceae Hyphomonadaceae Nocardioidaceae C Alteromonadaceae Enterobacteriaceae 85.4 Aurantimonadaceae firm-5 93.3 Bacillales_incertae_sedis beta 98.7 Colwelliaceae Bifidobacteriaceae 98.7 Desulfobacteraceae gamma-1 78.8 Gracilibacteraceae firm-4 99.7 Hahellaceae alpha-2.1 98.6 Incertae_Sedis_XI Lactobacillaceae 83.2 Listeriaceae alpha-1 99.9 Nakamurellaceae alpha-2.2 78.8 Planococcaceae Moraxellaceae Rhodobiaceae Actinomycetales_incertae_sedis Unclassified Oxalobacteraceae Staphylococcaceae Unclassified (650)

0 to 10 10 to 50 50 to 100 100 to 200 200 to 400 400 to 1000 1000 to 2000 Figure 2 The effect of training set on the classification of sequences from the honey bee gut visualized by a heat map. Unique sequences (4,480) were classified using the NBC trained on either RDP, GG, or SILVA (A), three custom databases including near full length honey bee-associated sequences RDP + bees, GG + bees, SILVA + bees (B), or the near full length honey bee-associated sequences alone (C). Family-level taxonomic designations are shown and where taxonomic classifications occur across all three datasets, these are highlighted in bold lettering. Where a classification is unique to one training set, this is highlighted in red font. The average bootstrap score resulting from the classification is provided for each taxonomic assignment.

Resultant classification differences could be the prod- sequences could suggest that representation of related uct of either 1) differences in the taxonomic framework taxonomic groups within the training set is particularly provided to the RDP-NBC for each sequence or 2) dif- low. To explore the source of classification differences, ferences in the availability of sequences within different we investigated the pool of sequences for which training lineages in the training sets used on the RDP-NBC prior sets altered the classification. In total, 1,335 sequences to classification. Systematic phylogeny-dependent in- were unstable in their classification across all three train- stability with regards to classification of particular ing sets at the order level (Table 1), meaning that they Newton and Roeselers BMC Microbiology 2012, 12:221 Page 6 of 9 http://www.biomedcentral.com/1471-2180/12/221

Table 1 The taxonomic classification for 16S rRNA gene training set. They did not find close homologs in the sequences improves with the addition of custom SILVA training set either, the closest top scoring hits being databases 86% identical (on average). In contrast, in the GG training Taxonomic Congruent Incongruent Congruent set, top hits that were 98.6% identical were found and Level Classifications across all three Classifications these sequences were classified as γ-, with- (No. sequences) training sets with HBDB out further taxonomic depth. This result suggests that Kingdom 4,480 0 4,480 training set breadth is playing a role in the incongruity Phylum 4,465 0 4,478 observed here. In support of this hypothesis, a large num- Class 4,453 4 4,479 ber of short reads were unclassifiable using each training Order 2,579 1,335 4,669 set (1,167 unclassified by SILVA, 1,468 by GG, 2,818 by Family 1,870 2,784 4,216 RDP) and the RDP training set resulted in the least Genus 595 2,552 –* confident classification out of all three with a majority (62%) of the sequences unclassifiable at the 60% threshold. *HBDB sequences were not taxonomically assigned to genus so this level of taxonomic classification was excluded. Bootstrap scores resulting from RDP-NBC classifications The number of 16S rRNA gene sequences from honey bee guts with identical can be an indicator of sequence novelty [29]; sequences or completely divergent classifications across three widely used training sets (RDP, Greengenes, SILVA) is shown. As the taxonomic levels become more with low scores at particular taxonomic levels may repre- fine, there is an increase in the discordance/errors in taxonomic placement sent new groups with regards to the training set utilized. across all three datasets. The addition of honey bee specific sequences greatly The average bootstrap scores for each classification at the improves the congruence across all datasets (last column). family level for each of the three training sets was calcu- were classified as different orders in each of the lated (Figure 2A). Certain sequences were classified with three published training sets (RDP, GG, and SILVA). relatively low average bootstrap values, suggesting that These discrepancies were found to correspond to these sequences do not have close representatives in the classifications in three major classes: the α-proteobac- training sets. For example, a low average bootstrap score teria, γ-proteobacteria and bacilli. Sequences classified as was observed for the classification of sequences as Succi- Bartonellaceae by the Greengenes taxonomy were either nivibrionaceae by SILVA or as Aerococcaceae by the RDP. classified as Brucellaceae (RDP), Rhizobiaceae (RDP), Aurantimonadaceae (SILVA), Hyphomonadaceae (SILVA) or Rhodobiacea (SILVA). Within the γ-proteobacteria, The use of custom sequences improves the stability of those sequences classified as Orbus by the RDP training set classification of honey bee gut pyrosequences, regardless were identified as Pasteurellaceae (GG), Enterobacteriaceae of training set (GG), Psychromonadaceae (GG), Aeromonadaceae (GG In order to improve the classification of honey bee gut and SILVA), Succinivibrionaceae (GG and SILVA), Alter derived 16S rRNA gene sequences, a custom database was omonadaceae (SILVA), or Colwelliaceae (SILVA). The used to classify unique sequences. The taxonomic classifi- number of incongruent classifications for sequences cations in this custom database were generated either by identified as Lactobacillaecae by Greengenes were even close identity (95%) to a cultured isolate or by the inclu- more astonishing as they were classified as different sion of cultured isolates in the phylogeny. This phylogeny phylabyuseoftheRDPorSILVAtrainingsets;these mirrors those published by others for these bee-associated sequences were classified as Aerococcaceae (RDP), sequences [18,19,30]; honey bee-specific clades were Carnobacteriaceae (RDP), Orbus (RDP), Succ inivibrio recovered with bootstrap support >90% (Figure 1). The naceae (RDP), Bacillaceae (RDP or SILVA), Leucono addition of honey bee specific sequences to each training stocaceae (SILVA), Listeriacae (SILVA), Ther moactino set not only altered spurious taxonomic assignments for mycetaceae (SILVA), Enterococcaceae (SILVA), Gracili certain classes (notably the δ-proteobacteria are not bacteraceae (SILVA), Planococcaceae (SILVA), Desulfo present in results from these datasets, Figure 2B) but also bacteraceae (SILVA). significantly improved the congruence between classifica- Training set composition could be affecting the classi- tions provided for each training set (nearly 100% of se- fication results by the RDP-NBC presented above. We quence classification assignments concurred at the family explored this possibility by analyzing one particularly level, Figure 2B). In total, 8 different bacterial orders striking incongruity between training sets: the classifica- including Caulobacterales, Rhizobiales, Methylophilales, tion of sequences as Orbus. Only the RDP training set Neisseriales, Desulfobacterales, Desulfuromonadales, resulted in the classification of honey bee microbiota Bacillales, and Pasteurellales were reclassified into the short reads as Orbus and these sequences were used as six bee-specific families (Figure 2A,B). Importantly, the queries in a blast search against all three training sets large number of unclassifiable short reads observed pre- (RDP, SILVA, and GG). On average, these Orbus-classified viously was reduced to <100 sequences when the HBDB sequences were 93% identical to top hits in the RDP was included in the training set (Figure 2B) and the Newton and Roeselers BMC Microbiology 2012, 12:221 Page 7 of 9 http://www.biomedcentral.com/1471-2180/12/221

average bootstrap scores for these classifications were Table 2 Bacterial isolates with genus and species generally above 90% (Figure 2B). When we classify these designations that clade within the bee-specific groups short reads using the HBDB alone (that is, without the Bee-specific group Strain taxonomic designation inclusion of existing training sets), we see a similar Alpha-2.2 Saccharibacter florica strain S-877 – result the majority of the sequences are classified at a Alpha-2.1 Commensalibacter intestini strain A911 60% bootstrap threshold (Figure 2C). However, without Alpha-1 Bartonella grahamii as4aup the additional breadth provided by the GG, SILVA, or RDP training sets, nearly 15% of the short reads (650 Firm-5 Lactobacillus apis strain 1 F1 out of a total of 4,480) are unclassifiable and average These isolates, and their existing taxonomic information, may inform research into the function of the honey bee gut microbiota. bootstrap scores drop in value, suggesting that the di- versity within the bee gut has not been exhaustively were looking at the family level, including information characterized by previous 16S rRNA clone library based about whether or not the sequence had been found in studies. In contrast to the classifications provided by honey bees. As is obvious in Figure 1, certain bee- the published training sets alone (where only 62% of the associated clades include strains identified to the genus classifications agreed at the family level across all three and species level (Table 2). Because these strains are bac- training sets), the inclusion of the bee specific terial isolates that can be studied with regards to their sequences dramatically increased the congruence (94% metabolic capabilities (in some cases, their genome of the sequences agreed at the family level, Table 1). For particular taxonomic orders with high representation Table 3 Diversity of species and unique sequences found (>100 unique sequences) in the honey bee gut, there are within honey bee microbiota particularly few incongruences at the Family level Family Num. unique sequences OTUs (97% ID) (Figure 2B). Only the RDP + bees training set identifies Enterobacteriaceae 1621 175 sequences as Orbus classified as either Gamma-1 or gamma-1 436 48 Enterobacteriales by the GG + bees or SILVA + bees training sets. It is possible that this error is due to the beta 532 35 fact that the RDP training set was the smallest included Bifidobacteriaceae 363 32 in this comparative analysis; size and diversity of the firm-5 929 32 training set affects the resulting assignments [11]. We firm-4 253 21 utilized an evolutionary placement algorithm imple- alpha-2.1 90 15 mented in RAxML to identify the phylogenetic position alpha-1 65 13 of short reads classified as Orbus by the RDP + bees training set. Indeed, these Orbus-like sequences clade Lactobacilliaceae 86 12 within the gamma-1 group (Additional file 1). The Flavobacteriaceae 2 2 spurious placement of these short reads within Orbus Leuconostocaceae 2 2 by RDP was therefore primarily due to the fact that Moraxellaceae 6 2 Orbus is the closest sequence to gamma-1 found within Sphingomonadaceae 2 2 the RDP training set. Xanthomonadaceae 2 2 Biological significance Actinomycetaceae 1 1 In the end, the goal of the classifications provided by the Aeromonadaceae 1 1 RDP-NBC for next generation sequencing datasets is to alpha-2.2 10 1 provide a sense of community structure that may be Clostridiaceae 2 1 relevant to function in the environment. There were few Corynebacteriaceae 1 1 incongruities between the HBDB-based taxonomies and Cytophagaceae 1 1 those in the existing training sets, primarily because existing training sets did not include sequences identical Enterococcaceae 9 1 to these bee-specific groups. Across all three training Incertae_Sedis_XI 1 1 sets, only 14 sequences were found to be identical to Kineosporiaceae 1 1 those in the HBDB. The Greengenes training set, for ex- Nakamurellaceae 1 1 ample, included the majority of these identical sequences Oxalobacteraceae 1 1 (12/14) and many closely related sequences (>95% iden- Prevotellaceae 1 1 tical across the full length) Additional file 2). However, For each family found with honey bee guts (based on SILVA + bees rarely did our taxonomic designation differ from that in classification) the number of unique sequences and the number of 97% the original training set largely due to the fact that we identical operational taxonomic units (OTUs) is shown. Newton and Roeselers BMC Microbiology 2012, 12:221 Page 8 of 9 http://www.biomedcentral.com/1471-2180/12/221

sequences have been completed, see ncbi accession representation in training sets markedly affects RDP- #CP001562), we can begin to determine whether or not NBC performance [11,29]. there are functional differences relevant in the classification of an organism as either “alpha-2.1” (Commensalibacter Additional files intestini)or“alpha-2.2” (Saccharibacter florica). For ex- ample, the pathogen Bartonella henselae sequence Additional file 1: Table S1. Total number of operational taxonomic CP00156 (B. henselae) clades with the alpha-1 sequences units (97% ID) in either genetically uniform or genetically diverse colonies (Figure 1), a group that often is found in honey bee col- and classified as one of the honey bee specific taxonomic groups. Additional file 2: Table S2. Top scoring blastn hits between onies although the fitness effects on the host are unclear. full-length, bee specific sequences and the Greengenes training set. Additionally, the relevance of the taxonomic designation Additional file 3: Figure S1. Phylogenetic placement of representative below the family level for these bee-specific groups remains short read classified as Orbus by the RDP + bees training set. to be determined. Abbreviations Fine scale diversity within the honey bee gut RDP-NBC: The Ribosomal Database Project’s Naïve Bayesian Classifier; Using the RDP-NBC and the HBDB custom training HBDB: The Honey Bee Database; ASHB: Arb-silva small subunit honey bee alignment. sets, a large number of diverse sequences within the honey bee gut were classified in each of the honey bee Competing interests specific families (Table 3). Although our classification The authors declare that they have no competing interest. schema does not designate different genera within bee- specific bacterial families, the schema can be used to ex- Authors’ contributions ILGN conceived of the study, implemented the bioinformatics, analyzed plore the relevance of fine-scale diversity (at the OTU resultant data, and drafted the manuscript. GR provided bioinformatics tools, level) within the honey bee gut (as in [25]). The fine- participated in the analysis of the data, and helped to draft the manuscript. scale diversity identified previously as present in genetic- All authors read and approved the final manuscript. ally diverse colonies was found to exist within Acknowledgements honey bee-specific bacterial families (Additional file 3), This work was funded by startup funds provided by Indiana University to suggesting that host genetic diversity may play a role in ILGN. The manuscript benefited from the critiques of four anonymous shaping the diversity and composition of associated reviewers, to which we are thankful. microflora in colonies. Author details 1Department of Biology, 1001 E 3rd Street, Bloomington, IN 47405, USA. Conclusions 2Microbiology & Systems Biology group, TNO, Utrechtseweg, Zeist, The Netherlands. Insect-associated microbiota can be difficult to classify using existing databases [15]; The lack of cultured iso- Received: 14 May 2012 Accepted: 23 September 2012 lates or characterized species from insect environments Published: 26 September 2012 and also the enormous diversity of hosts for the micro- References bial communities is problematic. For example, when pre- 1. Andersson AF, Lindberg M, Jakobsson H, Backhed F, Nyren P, Engstrand L: defined, publically available datasets are used to train the Comparative Analysis of Human Gut Microbiota by Barcoded RDP-NBC and classify sequences from the honey bee Pyrosequencing. PLoS One 2008, 3(7):e2836. doi:10.1371/journal.pone.0002836. 2. Bates ST, Berg-Lyons D, Caporaso JG, Walters WA, Knight R, Fierer N: gut, an environment for which there are no cultured Examining the global distribution of dominant archaeal populations in representatives, taxonomic classifications are unstable soil. ISME J 2011, 5(5):908–917. and inconsistent (Figure 2A). In contrast, the HBDB cus- 3. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at tom training sets effectively and confidently classify the a depth of millions of sequences per sample. P Natl Acad Sci USA 2011, in the honey bee gut. Results from our classifi- 108:4516–4522. cation are consistent with previous studies of the honey 4. Dethlefsen L, Huse S, Sogin ML, Relman DA: The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA bee gut using 16S rRNA clone libraries [17,18], suggest- Sequencing. PLoS Biol 2008, 6(11):2383–2400. ing that the inclusion of environment-specific, high- 5. Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: quality, full-length sequences in the training set can Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing. PLoS Genet 2008, 4(11):e1000255. dramatically affect the classification results produced by doi:10.1371/journal.pgen.1000255. the RDP-NBC. In addition, the larger, more diverse 6. Koenig JE, Spor A, Scalfone N, Fricker AD, Stombaugh J, Knight R, Angenent training sets (SILVA + bees and GG + bees), provided LT, Ley RE: Succession of microbial consortia in the developing infant gut microbiome. P Natl Acad Sci USA 2011, 108:4578–4585. more stable and precise classifications, echoing results 7. Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapid of previous studies and suggesting that breadth and assignment of rRNA sequences into the new bacterial taxonomy. depth in the RDP-NBC training set is crucial for more Appl Environ Microbiol 2007, 73(16):5261–5267. 8. Garrity GM, Bell JA, Lilburn TG: Taxonomic Outline of the Prokaryotes Bergey's confident taxonomic classifications [11]. This result Manual of Systematic Bacteriology, Second Edition, release 5.0. New York, NY: echoes those of other groups who have found that Springer-Verlag; 2004. Newton and Roeselers BMC Microbiology 2012, 12:221 Page 9 of 9 http://www.biomedcentral.com/1471-2180/12/221

9. Domingos P, Pazzani M: On the optimality of the simple Bayesian 29. Lan Y, Wang Q, Cole JR, Rosen GL: Using the RDP classifier to predict classifier under zero–one loss. Mach Learn 1997, 29(2–3):103–130. taxonomic novelty and reduce the search space for finding novel 10. Friedman N, Geiger D, Goldszmidt M: Bayesian network classifiers. Mach organisms. PLoS One 2012, 7(3):e32491. Learn 1997, 29(2–3):131–163. 30. Moran NA, Hansen AK, Powell E, Sabree ZL: Distinctive gut microbiota of 11. Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Caporaso JG, honey bees assessed using deep sampling from individual worker bees. Angenent LT, Knight R, Ley RE: Impact of training sets on classification of PLoS One 2012, 7(4):e36393. high-throughput bacterial 16 s rRNA gene surveys. ISME J 2012, 6(1):94–103. doi:10.1186/1471-2180-12-221 12. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Cite this article as: Newton and Roeselers: The effect of training set on Edwards RA, Stoye J: Phylogenetic classification of short environmental the classification of honey bee gut microbiota using the Naïve Bayesian DNA fragments. Nucleic Acids Res 2008, 36(7):2230–2239. Classifier. BMC Microbiology 2012 12:221. 13. Wu M, Eisen JA: A simple, fast, and accurate method of phylogenomic inference. Genome Biol 2008, 9(10):R151. doi:10.1186/gb-2008-9-10-r151. 14. Warnecke F, Luginbuhl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, Cayouette M, McHardy AC, Djordjevic G, Aboushadi N, et al: Metagenomic and functional analysis of hindgut microbiota of a wood- feeding higher termite. Nature 2007, 450(7169):560–U517. 15. Soergel D, Dey N, Knight R, Brenner S: Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. ISME J 2012, doi:10.1038/ismej.2011.208. 16. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 2006, 72(7):5069–5072. 17. Cox-Foster DL, Conlan S, Holmes EC, Palacios G, Evans JD, Moran NA, Quan PL, Briese T, Hornig M, Geiser DM, et al: A metagenomic survey of microbes in honey bee colony collapse disorder. Science 2007, 318(5848):283–287. 18. Martinson VG, Danforth BN, Minckley RL, Rueppell O, Tingek S, Moran NA: A simple and distinctive microbiota associated with honey bees and bumble bees. Mol Ecol 2011, 20(3):619–628. 19. Jeyaprakash A, Hoy MA, Allsopp MH: Bacterial diversity in worker adults of Apis mellifera capensis and Apis mellifera scutellata (Insecta: Hymenoptera) assessed using 16S rRNA sequences. J Invertebr Pathol 2003, 84(2):96–103. 20. Koch H, Schmid-Hempel P: Socially transmitted gut microbiota protect bumble bees against an intestinal parasite. P Natl Acad Sci USA 2011, 108(48):19288–19292. 21. Olofsson TC, Vasquez A: Detection and identification of a novel lactic acid bacterial flora within the honey stomach of the honeybee Apis mellifera. Curr Microbiol 2008, 57(4):356–363. 22. Vasquez A, Forsgren E, Fries I, Paxton RJ, Flaberg E, Szekely L, Olofsson TC: Symbionts as Major Modulators of Insect Health: Lactic Acid Bacteria and Honeybees. PLoS One 2012, 7(3):e33188. doi:10.1371/journal. pone.0033188. 23. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig WG, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 2007, 35(21):7188–7196. 24. Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, et al: ARB: a software environment for sequence data. Nucleic Acids Res 2004, 32(4):1363–1371. 25. Mattila HR, Rios D, W-S VE, Roeselers G, Newton ILG: Characterization of the active microbiotas associated with honey bees reveals healthier and broader communities when colonies are genetically diverse. PLoS ONE 2012, 7:e32962. 26. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, et al: Introducing mothur: Submit your next manuscript to BioMed Central open-source, platform-independent, community-supported software for and take full advantage of: describing and comparing microbial communities. Appl Environ Microbiol 2009, 75(23):7537–7541. • Convenient online submission 27. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P: An improved Greengenes • Thorough peer review taxonomy with explicit ranks for ecological and evolutionary analyses of • No space constraints or color figure charges bacteria and archaea. ISME J 2012, 6(3):610–618. 28. Euzeby JP: List of bacterial names with standing in nomenclature: • Immediate publication on acceptance A folder available on the Internet. Int J Syst Bacteriol 1997, • Inclusion in PubMed, CAS, Scopus and Google Scholar – 47(2):590 592. • Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit