APPLIED AND ENVIRONMENTAL MICROBIOLOGY, June 2010, p. 3886–3897 Vol. 76, No. 12 0099-2240/10/$12.00 doi:10.1128/AEM.02953-09 Copyright © 2010, American Society for Microbiology. All Rights Reserved.

Diversity of 16S rRNA Genes within Individual Prokaryotic Genomesᰔ† Anna Y. Pei,1‡ William E. Oberdorf,2‡ Carlos W. Nossa,2 Ankush Agarwal,2 Pooja Chokshi,5 Erika A. Gerz,2 Zhida Jin,2 Peng Lee,3 Liying Yang,3 Michael Poles,2 Stuart M. Brown,4 Steven Sotero,4 Todd DeSantis,7 Eoin Brodie,7 Karen Nelson,8 and Zhiheng Pei2,3,6* New York University College of Arts and Science, New York, New York 100121; Departments of Medicine2 and Pathology3 and Center for Health Informatics and Bioinformatics,4 New York University School of Medicine, New York, New York 10016; Tufts University College of Arts and Sciences, Medford, Massachusetts 021555; Department of Veterans Affairs New York Harbor Healthcare System, New York, New York 100106; Ecology Department, Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 947207; and J. Craig Venter Institute, Rockville, Maryland 208508

Received 7 December 2009/Accepted 19 April 2010

Analysis of intragenomic variation of 16S rRNA genes is a unique approach to examining the concept of ribosomal constraints on rRNA genes; the degree of variation is an important parameter to consider for estimation of the diversity of a complex microbiome in the recently initiated Human Microbiome Project (http://nihroadmap.nih.gov/hmp). The current GenBank database has a collection of 883 prokaryotic genomes representing 568 unique species, of which 425 species contained 2 to 15 copies of 16S rRNA genes per genome Sequence diversity among the 16S rRNA genes in a genome was found in 235 species (from 0.06% .(0.81 ؎ 2.22) to 20.38%; 0.55% ؎ 1.46%). Compared with the 16S rRNA-based threshold for operational definition of species (1 to 1.3% diversity), the diversity was borderline (between 1% and 1.3%) in 10 species and >1.3% in 14 species. The diversified 16S rRNA genes in Haloarcula marismortui (diversity, 5.63%) and tengcon- gensis (6.70%) were highly conserved at the 2° structure level, while the diversified gene in B. afzelii (20.38%) appears to be a pseudogene. The diversified genes in the remaining 21 species were also conserved, except for a truncated 16S rRNA gene in “Candidatus Protochlamydia amoebophila.” Thus, this survey of intragenomic diversity of 16S rRNA genes provides strong evidence supporting the theory of ribosomal constraint. Taxo- nomic classification using the 16S rRNA-based operational threshold could misclassify a number of species into more than one species, leading to an overestimation of the diversity of a complex microbiome. This phenomenon is especially seen in 7 bacterial species associated with the human microbiome or diseases.

rRNA genes are widely used for estimation of evolutionary volve short domains without affecting the entire sequence of history and taxonomic assignment of individual organisms (14, each gene (8). 26, 50–52). The choice of rRNA genes as optimal tools for such However, significant differences between copies of rRNA purposes is based on both observations and assumptions of genes in single organisms, albeit few, have been discovered in ribosomal conservation (13, 50). rRNA genes are essential all three domains of life and in all three classes of rRNA genes. components of the ribosome, which consists of Ͼ50 proteins The amphibian Xenopus laevis and the loach Misgurnus fossilis and three classes of RNA molecules; precise spatial relation- have two types of 5S rRNA genes that are specific to either ships may be essential for assembly of functional ribosomes, somatic or oocyte ribosomes (30, 48). The parasite Plasmo- constraining rRNA genes from drastic change (9, 13). In bac- dium berghei contains two types of 18S rRNA genes that differ teria, the three rRNA genes are organized into a gene cluster at 3.5% of the nucleotide positions and are life cycle stage which is expressed as single operon, which may be present in specific (17). The metazoan Dugesia mediterranea possesses multiple copies in the genome. In organisms with multiple two types of 18S rRNA genes with 8% dissimilarity (6). The rRNA gene operons, the gene sequences tend to evolve in archaeon Haloarcula marismortui contains two distinct types of concert. It is generally believed that copies of rRNA genes 16S rRNA genes that differ by 5% (32, 33). In the domain within an organism are subject to a homogenization process , the actinomycete Thermobispora bispora contains two through homologous recombination, also known as gene con- types of 16S rRNA genes that differ by 6.4% (47). Copies of version (18), a form of concerted evolution that maintains their the 16S rRNA genes and 23S rRNA genes of the actinomycete fit within the ribosome. The homogenization process may in- Thermospora chromogena differ by approximately 6 and 10%, respectively (54). Paralogous copies of rRNA genes with dif- ferent sequences may have functionally distinct roles. * Corresponding author. Mailing address: Department of Medicine, New York University School of Medicine, New York, NY 10016. Divergent evolution between rRNA genes in the same ge- Phone: (212) 951-5492. Fax: (212) 263-4108. E-mail: zhiheng.pei@med nome may corrupt the record of evolutionary history and ob- .nyu.edu. scure the true identity of an organism. Substantial variation, if ‡ A. Y. Pei and W. E. Oberdorf contributed equally to this work. it occurs, may lead to the artificial classification of an organism † Supplemental material for this article may be found at http://aem .asm.org/. into more than one species. For a cultivable organism, this ᰔ Published ahead of print on 23 April 2010. problem can be resolved by cloning rRNA genes from a pure

3886 VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3887 culture of the organism to identify the degree of variation. used for reference. To compare two related 2° structures, a mismatch was defined However, most environmental surveys and the recently initi- as conserved if located in a loop or located in a stem, but causing GC:GU ated Human Microbiome Project (HMP) (http://nihroadmap conversions or covariation resulting in no change in base pairing. In contrast, a nonconserved mismatch that alters base pairing and structure was classified as a .nih.gov/hmp/) (34) use cultivation-independent techniques to stem-loop transition. When base pairing and/or structure was significantly altered examine microbiomes that contain mixed species. In the case and did not allow for more detailed comparison, localized segments including 50 of the HMP, it is hoped that this approach may identify some bp upstream and 50 bp downstream of the area of diversity were analyzed at idiopathic diseases that are caused by alterations in the micro- various energy levels. The two most similar structures at the lowest energy levels were compared. If no identical structures were found, these areas were classified biome in humans. In this type of study, it may be impossible to into one of two groups. In areas of regional complex rearrangement (37), the trace all rRNA genes observed back to their original host. For bases pair differently and result in a slight change to the sizes and relative example, in the phylum TM7, multiple 16S rRNA gene se- location of 2° structures in the immediate area, but the structures look similar quences have been reported (21), but it is not known whether topographically. In regional alteration of 2° structures, highly concentrated areas they belong to multiple species or to the same bacterium with of substitutions result in topographically different structures. Both of these could not be further broken down into covariation, GU ϭ GC, indels, or in-loop a high degree of intragenomic variation among rRNA gene changes, and were considered on the whole to be localized, nonconservative paralogs. Due to the limited number of microorganisms for changes. For 16S rRNA genes that displayed high levels of regional diversity, the which nucleotide sequences are available for all copies of the regions in question were also folded using the KnetFold program (5). This rRNA genes, intragenomic variation among 16S rRNA genes, folding method creates secondary structures based on multiple sequences. The output from KnetFold was entered into jViz.RNA 2.0 in order to visualize the and the likelihood of pyrosequencing errors (25, 40), the po- secondary structure (49). jViz.RNA 2.0 allows for the creation of complex 2° tential to overestimate the diversity of a microbiome exists. structures that may contain pseudoknots. The multiple-sequence folding was Coenye et al. analyzed 55 bacterial genomes and found the verified using another program named Murlet (23). intragenomic heterogeneity between multiple 16S rRNA genes For T. tengcongensis and Borrelia afzelii, mismatches also were classified by the in these genomes was below the common threshold (1 to 1.3%) position-specific relative variability rate, calculated from the consensus 16S rRNA gene model based on an alignment of 3,407 bacterial 16S rRNA genes for distinguishing species (44) and was unlikely to have a pro- (53). Positions were classified as variable or nonvariable according to the sub- found effect on the classification of taxa (10). The analysis of 76 stitution rate relative to the average substitution rate of all sites (53). The relative whole genomes by Acinas et al. revealed the extreme diversity substitution rate for a variable position, v Ͼ 1, indicates a substitution rate higher (11.6%) of 16S rRNA genes in Thermoanaerobacter tengcon- than that averaged for all sites in the rRNA gene analyzed, while a conserved position had a relative substitution rate of v Ͻ 1; uncommon sites are positions gensis (2). These early analyses of intragenomic variation of occupied in Ͻ25% of organisms due to insertions. The expected variability for 16S rRNA genes were limited to a small number of available certain classes of positions was calculated from the consensus models. Differ- whole genomes. With the increasing number of whole micro- ences between expected and observed variability were analyzed by chi-square bial genomes available from the National Center for Biotech- analysis, considered significant at P Ͻ 0.05. nology Information (NCBI), the extent of diversity among the paralogous 16S rRNA genes within single organisms can now RESULTS be more thoroughly assessed. In the present study, we (i) addressed the theory of 16S rRNA conservation by systematic 16S rRNA gene data set. In total, 883 complete prokaryotic evaluation of intragenomic diversity of 16S rRNA sequences in genomes were available for analysis, 61 from Archaea and 822 completely sequenced prokaryotic genomes to assess its effect from Bacteria. The 883 genomes represented 25 phyla or 568 on the accuracy of 16S rRNA-based molecular and unique species (see Table S1 in the supplemental material). (ii) examined whether previously observed ribosomal con- Proteobacteria was the most abundant phylum (264 species), straints on conservation of 2° structures are uniformly appli- followed by (106 species), Actinobacteria (47 spe- cable at the intragenomic level. cies), Euryarchaeota (34 species), Crenarchaeota (18 species), Bacteriodetes (13 species), Cyanobacteria (13 species), and Spi- rochaetes (13 species). The remaining 17 phyla were repre- MATERIALS AND METHODS sented by only 60 species. Genome sequences from 143 species Annotation of rRNA genes. Gene sequences were obtained from the Complete contained only a single 16S rRNA gene. The remaining 425 Microbial Genomes database at the NCBI website (http://www.ncbi.nlm.nih.gov unique (19 archaea, 406 bacteria) species containing multiple /genomes/MICROBES/Complete.html). For some species, more than one ge- nome was available. To avoid overrepresentation of any species, only the most 16S rRNA genes belonged to 18 different phyla (Table 1). completely annotated genome was selected to represent a species. Once again, the most abundant was Proteobacteria (210 spe- Analysis of intragenomic diversity in 16S rRNA genes. Genomes that con- cies), followed by Firmicutes (98 species), Actinobacteria (36 tained only a single 16S rRNA gene were not further analyzed. Copies of 16S species), Euryarchaeota (19 species), Cyanobacteria (11 spe- rRNA genes from each remaining genome were aligned with Sequencher (Gene cies), and Bacteriodetes (10 species). The remaining 12 phyla Codes Corporation, Ann Arbor, MI). To calculate diversity, the number of revealed mismatches and insertions was divided by the total number of positions, were represented by only 41 species. including gaps in the alignment. If gaps were determined to be caused by Diversity of 16S rRNA genes. The 425 unique species con- intervening sequences (IVS) (inserts of Ͼ10 bp), they were recorded and re- tained 2 to 15 copies of 16S rRNA genes per genome (2.22 Ϯ moved, and sequences were realigned and reanalyzed. To determine the effect of 0.81). Intragenomic diversity of 16S rRNA genes ranged be- masking hypervariable positions, the two 16S rRNA genes with the largest Ϯ difference in a genome were aligned using NAST (12) in the Greengenes 7,682- tween 0 and 20.38% (0.30% 1.12%). Sequence diversity character format. The aligned sequences were compared to the Lane mask (27) among the 16S rRNA genes in a genome was found in 235 to mask away all but 1,287 conserved columns (lanes) of aligned characters. After species, ranging from 0.06% to 20.38% (0.55% Ϯ 1.46%). common gaps had been removed, diversity between the two sequences was Using the 1%-to-1.3% threshold for operational identification calculated. of species based on 16S rRNA genes (45), 24 species were Comparison of 2° structures. Secondary structures were analyzed based on minimizing free energy, using RNAstructure (31) and visualized by Rnaviz (11), found to have intragenomic diversity equal to or higher than with experimentally defined 16S rRNA or the consensus 16S rRNA models (53) the operational threshold (Table 2), of which the diversity was 3888 PEI ET AL. APPL.ENVIRON.MICROBIOL.

TABLE 1. Distribution of unique prokaryotic species with more than one 16S rRNA gene by phylum

No. of No. of 16S rRNA genes per genome % diversity Phylum species Minimum Avg Maximum ␴ Minimum Avg Maximum ␴ Proteobacteria 210 2 4.47 15 2.54 0.00 0.20 2.09 0.34 Firmicutes 98 2 5.97 15 3.13 0.00 0.46 6.70 1.00 Actinobacteria 36 2 3.64 6 1.40 0.00 0.10 1.30 0.23 Euryarchaeota 19 2 2.74 4 0.73 0.00 0.38 5.63 1.28 Cyanobacteria 11 2 2.55 4 0.93 0.00 0.06 0.27 0.09 Bacteroidetes 10 2 4.90 7 1.79 0.00 0.34 1.30 0.47 Chlorobi 6 2 2.17 3 0.41 0.00 0.04 0.14 0.06 Chloroflexi 6 2 3.00 5 1.10 0.00 0.57 1.08 0.46 Spirochaetes 6 2 2.00 2 0.00 0.00 3.41a 20.38 8.31 Aquificae 5 2 2.20 3 0.45 0.00 0.04 0.20 0.09 Deinococcus-Thermus 4 2 2.75 3 0.50 0.00 0.30 1.06 0.51 Thermotogae 4 2 2.75 4 0.96 0.00 0.14 0.50 0.24 Chlamydiae 3 2 2.33 3 0.58 0.00 0.45 1.34 0.77 Dictyoglomi 2 2 2.00 2 0.00 0.00 0.03 0.06 0.04 Tenericutes 2 2 2.00 2 0.00 0.00 0.04 0.07 0.05 Acidobacteria 1 2 2.00 2 0.00 0.00 0.00 0.00 0.00 Fusobacteria 1 5 5.00 5 0.00 0.14 0.14 0.14 0.00 Verrucomicrobia 1 3 3.00 3 0.00 0.00 0.00 0.00 0.00

Summary 425 2 4.45 15 2.66 0.00 0.30 20.38 1.12

a The high average diversity is due to the 20.38% diversity in Borrelia afzelii. borderline (between 1% and 1.3%) in 10 species and Ն1.3% in relia afzelii (20.38%) (15, 19, 20, 22, 39, 46, 55). The sequence 14 species. In particular, this phenomenon was seen in 7 bac- diversity observed in these species appears to follow five pat- terial species associated with the human microbiome or dis- terns: intervening sequences, regional diversity, random diver- eases (Table 2), including Escherichia coli (1.10%), Bacillus sity, pseudogene, and gene truncation/partial rRNA operon. subtilis (1.16%), Pseudomonas stutzeri (1.23%), Bacteroides the- Species with IVS. IVS in 16S rRNA genes are associated taiotaomicron (1.30%), Bifidobacterium adolescentis (1.30%), with significant intragenomic diversity among 16S rRNA genes “Candidatus Protochlamydia amoebophila” (1.34%), and Bor- in four species. The diversity seems related to regions near

TABLE 2. Species with significant diversity among paralogous 16S rRNA genes

No. of % diversity Species % diversity Patterns copies (masked) Deinococcus geothermalis 3 1.06 Random 1.09 Chloroflexus sp. strain Y-400-fl 3 1.08 Regional 0.31 Shewanella sp. strain ANA-3 9 1.09 Partial operon, regional 0.16 Escherichia colia 7 1.10 Partial operon, regional 0.31 Carboxydothermus hydrogenoformans 4 1.14 Regional 1.24 Bacillus clausii 7 1.15 Partial operon, regional 0.08 Bacillus subtilisa 10 1.16 Random 0.39 Desulfotalea psychrophila 7 1.16 Regional 0.00 Geobacillus thermodenitrificans 10 1.22 Regional 1.09 Pseudomonas stutzeria 4 1.23 Regional 0.47 Bacteroides thetaiotaomicrona 5 1.30 Regional 0.66 Bifidobacterium adolescentisa 5 1.30 Partial operon, regional 0.39 “Candidatus Protochlamydia amoebophila”a 3 1.34 Gene truncation, regional 1.53 Caldicellulosiruptor saccharolyticus 3 1.35 Regional 0.31 Shewanella baltica 10 1.36 Partial operon, regional 0.08 Shewanella woodyi 10 1.56 Regional 0.08 Heliobacterium modesticaldum 8 1.58 Regional 0.23 Syntrophomonas wolfei 3 1.67 Partial operon, IVS 0.16 Photobacterium profundum 14 1.76 IVS 0.08 Desulfitobacterium hafniense 5 1.80 IVS 0.08 Clostridium cellulolyticum 8 2.07 Regional 0.32 Haloarcula marismortui 2 5.63 Regional 4.86 Thermoanaerobacter tengcongensis 4 6.70 IVS 5.01 Borrelia afzeliia 2 20.38 Pseudogene NAb

Avg 6.08 2.55

a Species found in humans. b NA, not alignable. VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3889

FIG. 1. Secondary structures of 16S rRNA genes from Thermoanaerobacter tengcongensis. (a) The distribution of substitutions is shown on the secondary structure predicted for rrnB16S, based on free energy minimization (31) with the consensus 16S rRNA gene model as reference (53). Positions that differ between rrnB16S and rrnC16S are shown in colored letters. Conservative changes located in loops or compensatory changes due to covariation in stems are shown in blue; changes that result in alteration of secondary structures are shown in red. Insertions/deletions are shown in brown. Substitutions are coded as follows: K ϭ GorU,Mϭ AorC,Rϭ AorG,Sϭ CorG,Wϭ AorU,andYϭ C or U. (b) The distribution of substitutions also is shown on the variability map based on the 2° structure models for Thermus thermophilus (53). Mismatched positions between rrnB16S and rrnC16S are highlighted in colors according to the position-specific relative variability rate, calculated from the consensus rrn16S model based on an alignment of 3,407 bacterial 16S rRNA genes (53). A position with a relative substitution rate of v Ͼ 1 (red) implies that it has a substitution rate higher than the average substitution rate of all its sites in the rRNA gene analyzed, while v Ͻ 1 (blue) indicates that the rate is lower than the average rate. Uncommon sites are positions that are occupied in Ͻ25% of organisms because of insertions, which are shown by black dots. The expected variability was calculated from the consensus models (c).

IVS. T. tengcongensis contains four 16S rRNA genes that can 16S rRNA genes that varied in length (1,634 nt, 1,696 nt, and be classified into two major 16S rRNA gene alleles. rrnA16S, 1,646 nt, respectively). These three genes each contain a dif- rrnB16S, and rrnD16S were nearly identical. Their diversity in ferent IVS. rrnA has a 66-nt IVS starting at the 104th nt that relation to rrnC16S was 11.5% (186 of 1,620 positions), due to shares no homology with the IVS found in rrnB and rrnC. The 3 inserts (24 nucleotides [nt], 28 nt, and 31 nt) and 103 mis- IVS in rrnB and rrnC are 130 nt and 79 nt, respectively, and matched positions (Fig. 1a). Considering only the 103 mis- each begins at the 103rd nt compared to Escherichia coli. After matched positions, the difference between the 2 rRNA alleles the IVS are removed, the greatest divergence is found between was 6.70%. Homologs of the 28- and 31-nt inserts also were rrnA and rrnC, with 26 mismatches resulting in 1.67% diversity. present in rrn16S of Thermoanaerobacter kivui (96.6%) and The majority of the intragenomic difference is located in the Thermoanaerobacter siderophilus (89.8%), respectively, sug- 33-nt region directly adjacent to the 5Ј end of the IVS. There gesting that such insertions may be widespread in Thermo- are 22 mismatches in this area resulting in regional alteration anaerobacter (4) (Fig. 2). Syntrophomonas wolfei contains three and a divergence of 66.67%. Desulfitobacterium hafniense con-

FIG. 2. Homologs of the 28- and 31-bp inserts in T. tengcongensis rrnC 16S rRNA genes also are present in 16S rRNA genes of T. kivui and T. siderophilus. The 24- and 28-bp inserts were separated by a 6-nucleotide conserved sequence shown in red. 3890 PEI ET AL. APPL.ENVIRON.MICROBIOL. tains five 16S rRNA genes. 16S rrnA, 16S rrnB, and 16S rrnC region between V6 and V7, as well as portion of V7) has 10 are 1,554 nt long and only differed by 0.51%, but 16S rrnD substitutions for a difference of 16.67%. (1,661 nt) and 16S rrnE (1,671 nt) each contain an IVS of 104 Species with nonfunctional rRNA gene. Borrelia afzelii con- nt and 117 nt, respectively, inserted after the 91st nt. While the tains two 16S rRNA genes found on two separate chromo- IVS are in nearly identical areas, they differ significantly somes that differ by 20.38%. The lengths of B. afzelii rrnA and (58.87%). Excluding the IVS, the maximal difference is 1.8% rrnB are 1,509 nt and 1,537 nt, respectively, when compared to between 16S rrnB and 16S rrnF. The majority of the mis- E. coli as a template. While rrnB is closely related to 16S rRNA matches occurred within the 33-nt segment upstream of the genes in other species of Borrelia, rrnA is significantly different. IVS. This area of variability might have been caused by the Alignment of the two genes results in a 1,570-nt overlap, of insertion of IVS. The source for these IVS is unknown, but which 320 bp are mismatched (225 substitutions and 95 indels). they appear to be rRNA in nature because they exhibit a highly These mismatches do not follow any type of aforementioned complex 2° structure like rRNA. They were also found in two patterns and are distributed throughout the alignment evenly. closely related species (16S rrnE IVS in Desulfitobacterium This level of random diversity may have occurred due to the frappieri and 16S rrnD IVS in D. frappieri and Desulfitobacte- escape from constraint in a nonfunctional 16S rRNA pseudo- rium dehalogenans). Photobacterium profundum has 14 copies gene. There has not been any experiment designed to examine of 16S rRNA genes. There are 3 IVS of 14 nt (IVS-A), 14 nt the expression of rrnA of B. afzelii. It does not appear to be (IVS-B), and 16 nt (IVS-C), respectively. Three genes (rrnC, horizontally transferred into B afzelii from other species, as it rrnJ, and rrnK) contain no insert. Seven genes (rrnA, rrnF, rrnL, is closer to rrnB of B. afzelii than to a species in any other rrnM, rrnN, rrnO, and rrnP) contain only IVS-A. Three genes genus. (rrnB, rrnD, and rrnG) contain IVS-A and IVS-C. There is only Species with truncation of 16S rRNA genes or partial rRNA one gene (rrnB) that contains all three IVS. Excluding the IVS, operons. Truncation of 16S rRNA genes or partial rRNA oper- the maximum difference was 2.09% between 16S rrnC and 16S ons are a form of structural diversity among 16S rRNA genes rrnE. in a genome. Truncated 16S rRNA genes were only found in Species with random or unusual regional diversity. No ap- two species. “Candidatus Protochlamydia amoebophila” con- parent clustering of substitutions in 16S rRNA genes was ob- tains three 16S rRNA genes with 16S rrnA (1,481 nt) and 16S served in Deinococcus geothermalis (1.06%) and Bacillus sub- rrnC (1,481 nt) being identical, but the rrnB operon contains a tilis (1.16%). truncated 23S gene as well as a truncated 16S gene, although Unusual regional diversity is associated with high overall the 5S gene is intact. The 16S gene was missing its 3Ј end and diversity among paralogous 16S rRNA genes in four species. was fused with the 23S gene on its 5Ј end. The truncated 16S Shewanella woodyi has 10 16S rRNA genes. The greatest di- rrnB was 898 nt long and had 12 mismatches (8 substitutions vergence of 1.56% occurs between 16S rrnC and 16S rrnD. 16S and 4 indels) resulting in a diversity of 1.34%. When the rrnC has two regions in which the majority of the variability missing 3Ј portion was searched for in the genome, only two lies. The first spans 41 nt between positions 995 and 1035 in genes were found belonging to rrnA and rrnC, indicating that variable region 6 (V6) (7) and shows 13 substitutions, which 16S rrnB truly is truncated. Bacillus amyloliquefaciens FZB42 result in a 31.7% regional divergence. The second region is has 10 16S rRNA genes, all of which are nearly identical only 11 bp long between positions 1128 and 1138 (V7), with six (Ͻ0.5% diversity). 16S rrnF is only 599 nt in length, corre- substitutions for a regional divergence of 54.5%. Heliobacte- sponding to the 3Ј portion of the 16S rRNA gene. Again a rium modesticaldum has 10 16S rRNA genes, nine of which are search of the whole genome of B. amyloliquefaciens for the 5Ј 1,510 nt and one of which is 1,516 nt in length. Maximal portion of the 16S rRNA gene found only nine copies that are diversity is 1.58%, between 16S rrnF and rrnC. This diversity is associated with the nine complete 16S genes. in part due to a 20-nt region found in V2 (170th to 189th nt) The absence of a whole 16S rRNA gene in a rRNA operon, that exhibits 13 mismatches for a 65% regional divergence. In as evidenced by the presence of 23S or 5S rRNA genes but addition, 16S rrnF has a 6-nt insert between nt 1490 and 1495 absence of 16S rRNA gene, was observed in rRNA operons in (V9). Clostridium cellulolyticum has eight 16S rRNA genes. 95 species (see Table S2 in the supplemental material). The The maximal divergence is 2.07%, between 16S rrnE and rrnA. 23S or 5S rRNA genes in the partial rRNA operon appear This high divergence is found mainly in a 123-nt segment functional because none of the genes exhibit excessive random spanning V1 and V2 (nt 82 to 204). Within this segment in 16S mutations characteristic of a pseudogene. Interestingly, in- rrnE, there are 22 substitutions and 3 indels of 1 base each, tragenomic diversity among 16S rRNA genes in 6 of the 95 compared with 16S rrnA. The segment ends with a 5-nt dele- species was borderline or slightly above the 1 to 1.3% threshold tion, and the rest of the 3Ј strand is rather homologous to the for separation of species (Table 2). These species include other 7 copies. The variable area shows a divergence of 22.76% Shewanella sp. strain ANA-3 (1.09%), Escherichia coli (1.10%), compared with the same area in rrnA (28 mismatches among Bacillus clausii (1.15%), Bifidobacterium adolescentis (1.30%), 123 nt). The genome of H. marismortui has three 16S rRNA Shewanella baltica (1.36%), and Syntrophomonas wolfei genes on two chromosomes. Although rrnA and rrnC are vir- (1.67%). As described before, the high diversity in S. wolfei was tually identical, rrnB differs substantially from rrnA (5.63%), also associated with IVS in 16S rRNA genes. due to 83 substitutions that are mainly concentrated in three Conservation of secondary structures. In 18 of the 24 spe- regions. Region 1 (positions 59 to 184; V1 to V2) has 11 cies with significant intragenomic diversity (Table 2), the ma- substitutions for a regional difference of 8.7%. Region 2 (po- jority of substitutions in each species were conserved at the 2° sitions 524 to 840; V4 to V5) has 52 substitutions for a differ- structure level (Table 3), and the diversity at the secondary ence of 16.4%. Region 3 (positions 1054 to 1133, involving the structure level is Ͻ1%. For example, at the 2° structure level, VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3891

TABLE 3. Effect of diversity on the conservation of 2° structure

Diversity No. of conservative changes No. of nonconservative changes

Regional No. of Regional Organism G:U ϭ In Stem-loop alteration 1° 2°substitutions Covariation Indel Indel complex G ⅐ C loop transition of 2° rearrangement structure Deinococcus geothermalis 1.06 0.20 16 0 6 7 0 3 0 0 0 Chloroflexus sp. strain Y-400-fl 1.08 0.41 16 2 1 6 1 6 0 0 0 Shewanella sp. strain ANA-3 1.09 0.13 17 6 6 3 0 2 0 0 0 Escherichia colia 1.10 0.32 17 2 3 6 1 4 1 0 0 Carboxydothermus hydrogenoformans 1.14 0.06 18 2 3 11 1 1 0 0 0 Bacillus clausii 1.15 0.45 18 6 2 3 0 5 2 0 0 Bacillus subtilisa 1.16 0.58 18 0 3 6 0 4 5 0 0 Desulfotalea psychrophila 1.16 0.52 18 6 0 1 3 8 0 0 0 Geobacillus thermodenitrificans 1.22 0.32 19 0 1 8 5 5 0 0 0 Pseudomonas stutzeria 1.23 0.06 19 8 5 5 0 1 0 0 0 Bacteroides thetaiotaomicrona 1.30 0.07 18 4 1 12 0 0 1 0 0 Bifidobacterium adolescentisa 1.30 0.26 20 8 3 1 4 3 1 0 0 “Candidatus Protochlamydia 1.34 1.34 12 0 0 0 0 0 0 0 12 amoebophila”a Caldicellulosiruptor saccharolyticus 1.35 0.00 21 8 5 7 1 0 0 0 0 Shewanella baltica 1.36 0.19 21 10 4 4 0 3 0 0 0 Shewanella woodyi 1.56 0.98 24 0 4 2 3 2 0 13 0 Heliobacterium modesticaldum 1.58 0.60 24 12 2 1 2 7 0 0 0 Syntrophomonas wolfei 1.66 1.46 26 0 1 2 0 1 0 0 22 Desulfitobacterium hafniense 1.80 1.16 28 6 3 1 0 1 0 0 17 Photobacterium profundum 2.09 0.98 32 8 3 6 0 3 0 12 0 Clostridium cellulolyticum 2.13 1.76 35 4 1 1 0 2 0 27 0 Haloarcula marismortui 5.63 0.82 83 51 9 11 0 12 0 0 0 Thermoanaerobacter tengcongensis 6.70 0.52 103 34 10 57 4 8 0 0 0 Borrelia afzelii PKo 20.38 6.88 320 16 82 74 40 53 55 0 0

a Species found in humans.

95 (92.2%) of the 103 mismatches between rrnB16S and species are associated with exceptional regional diversity and 2 rrnC16S in T. tengcongensis are conserved (Fig. 1a), composed with nonfunctional 16S rRNA genes. of 48 mismatches in loops, 36 covariations, 4 insertions, and 7 In two species, thermodynamic folding showed that substi- GU:GC conversions. Thus, only 8 (0.5%) (3 substitutions and tutions caused complex regional rearrangement of base pair- 5 insertions) of the 1,527 total positions are predicted to alter ing, but the resultant 2° structures still topographically retained 2° structure, causing a stem-loop transition. Superimposition of the main features of the original ones (Fig. 3 and Table 3), the mismatches onto the variability map of the Thermus ther- similar to the complex rearrangement described in 23S rRNA mophilus 16S rRNA gene (31) (Fig. 1b) revealed that 74 genes (37). In S. woodyi, 9 of the 24 mismatches between (71.8%) of the 103 mismatch positions occurred at highly vari- rrnC16S and rrnD16S were conserved, while the other 15 mis- able positions, significantly higher than expected by chance matches (0.98% of the 1,537 total positions) are predicted to (32.0%; P Ͻ 0.0001) (Fig. 1c). The 29 mismatches at conserved alter the 2o structure. The alteration is due to 2 substitutions positions account for 1.9% of the 1,527 positions in rrn16S. The causing stem-loop transitions and 13 substitutions resulting in highly conserved 2° structure indicates constraints from ribo- regional complex rearrangement (Fig. 3a and b). Likewise, in somal proteins, indicating that rrnC16S is functional. In the 2° C. cellulolyticum, 28 of the 35 mismatches between rrnA16S structure of H. marismortui, 71 (85.5%) of the 83 mismatches and rrnE16S (1.70% of the 1,645 total positions) are predicted were conserved. Only 12 (0.82% of the 1,472 total positions) to alter the 2° structure, of which 28 are involved in regional are predicted to cause an alteration, in the form of stem-loop complex rearrangement (Fig. 3c and d). Besides these two transitions. Similarly, for H. modesticaldum, 17 (70.83%) of the species, P. profundum also exhibits features of complex re- 24 mismatches are conserved, while only 7 mismatches (0.46% gional rearrangement, although the majority of substitutions of the 1,516 total positions) are predicted to alter the 2° struc- (17/32) in this species are conserved (Table 3). The remaining ture, causing stem-loop transitions. In the remaining 16 spe- 15 (0.98% of 1,533 total positions) are associated with alter- cies, there are total of 498 substitutions between the 16 pairs of ation of the 2° structures, of which 12 are concentrated in an most diversified 16S rRNA genes. The large majority (411; area of regional complex rearrangement (Fig. 3e and f). In the 82.5%) of the substitutions do not cause alteration of the 2° other two species, substitutions directly caused regional alter- structure. ation in the 2° structures without complex rearrangement (Ta- In the remaining 6 of the 24 species, either the majority of ble 3). The altered regions are located directly upstream of the substitutions are not conserved at the 2° structure level or IVS insertion sites. In D. hafniense, there were 28 total substi- the diversity at the 2° structure level remains Ͼ1%, of which 4 tutions, of which 18 (1.16% of the 1,555 total positions) caused 3892 PEI ET AL. APPL.ENVIRON.MICROBIOL.

FIG. 3. Conservation of 2° structure by complex rearrangement of base pairing and substitutions in 16S rRNA genes of S. woodyi (a and aЈ and b and bЈ), C. cellulolyticum (c and cЈ and d and dЈ), and P. profundum (e and eЈ and f and fЈ). Each molecule was folded using a program based on thermodynamics (a, b, c, d, e, and f), as well as a program based on multiple sequence alignment (aЈ,bЈ,cЈ,dЈ,eЈ, and fЈ), as described in Materials and Methods. For thermodynamic folding, the regions 50 bp upstream of the first mutation and 50 bp downstream of the last mutation were used to create the structures for each rRNA molecule, while only the area of interest is shown. Nucleotides related to substitutions are VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3893 an alteration. One caused stem-loop transition, while 17 were tochlamydia amoebophila” (1.53%), Carboxydothermus hydrog- found in an area of regional alteration directly upstream of the enoformans (1.24%), Deinococcus geothermalis (1.09%), and IVS insertion site. In S. wolfei, there were 26 total mismatches Geobacillus thermodenitrificans (1.09%). For B. afzelii, the two between rrnA16S and rrnC16S: 22 resulting in regional alter- 16S rRNA genes are too diversified to be aligned using avail- ation of the 2° structure immediately upstream of the IVS and able algorithms (the original alignment of full sequences of 1 resulting in a stem-loop transition (1.46% of the 1,570 total these two genes was done manually). Such a high level of positions). For these 16S rRNA genes that displayed high diversity after masking will be troublesome not only for thresh- levels of regional diversity, the regions in question were also old-based taxonomic assignment using full-length sequences, folded using the multiple-sequence-alignment approach (5). but also for phylogenetic inference using masked sequences As shown in Fig. 3 (aЈ to fЈ), a consensus 2° structure for the because 16S rRNA genes from within the same genome are pair of diversified regions can be constructed if noncanonical not monophyletic in these species. base pairing is allowed (35, 36), which provides another pos- sibility for conservation of the 2° structure among highly diver- DISCUSSION sified regions. The truncated 16S rRNA gene (16S rrnB)in“Candidatus We analyzed 16S rRNA genes from genomes representing Protochlamydia amoebophila” is probably a nonfunctional 425 prokaryotic species to examine the evidence supporting the gene. It is only 898 nt long and contains mismatches all asso- theory of ribosomal constraint of rRNA structures at the in- ciated with alteration of the 2° structure. tragenomic level. Our findings support the hypothesis that Borrelia afzelii is unique among the genomes deciphered to individual rRNA genes within a genome are conserved due to date, in that the divergence remains extremely high (6.68%) such structural constraints. Prokaryotes comply with the con- even after examination of the 2° structure of the genes. The straint at both the primary and secondary structure levels. In 320 substitutions between rrnA (1,509 nt) and rrnB (1,537 nt) organisms containing multiple rRNA genes, the homogeneity resulted in 108 positions (6.88%) of 2° structure alteration (55 of primary sequences is believed to be maintained through indels and 53 stem-loop transitions) over a 1,570-nt alignment gene conversion by homologous recombination (18), as a form (Fig. 4a). Superimposition of the mismatches to the variability of concerted evolution (1). Although there are no criteria to map of E. coli 16S rRNA gene (31) (Fig. 4b) revealed that only define the ribosomal constraint, observations suggest that 1 to 73 (22.8%) of the 320 substitutions occurred at highly variable 1.3% diversity between 16S rRNA genes is tolerated by strains positions, significantly less than expected by chance (29.9%; from the same species (44). Our data indicate that homologous P Ͻ 0.0058) (Fig. 4c). Similarly, the substitution rate at con- recombination is highly effective in maintenance of 16S rRNA served positions (207/320; 64.7%) was also less than expected homogeneity, as the diversity is confined within 1% in all but by chance (68.3%) but not statistically different (P Ͻ 0.1649) only 24 of the 425 species. The exceptionally high diversity in (Fig. 4c). Conversely, a large number of the substitutions (40; the 24 species does not violate the ribosomal constraints, as 12.5%) were located at sites not commonly seen the majority they can be explained at the 2° structure level, the ultimate line of 16S rRNA genes (Fig. 4c). of ribosomal constraint (56). For example, in T. tengcongensis, Effect of masking hypervariable positions. Normally, for the 2° structure is well conserved despite 6.70% diversity in the phylogenetic analysis, only those positions that can be well primary structure (Table 3). Similarly, in H. marismortui,71of aligned between most sequences are included in the analysis. the 83 (85.5%) substitutions in the primary structure are con- Hypervariable regions are usually “masked out.” To assess the served at the 2° structure level. The type of diversity in B. afzelii effect of masking hypervariable regions on intragenomic vari- represents a novel example to support the theory of ribosomal ation of 16S rRNA genes, we aligned 16S rRNA genes from constraint. The two 16S rRNA genes in this species differ by the 24 highly diversified species (Table 2) and masked the 20.38% in the 1° structure. Unlike the pattern seen in T. teng- aligned sequences with Lane mask (12, 27). Intragenomic dif- congensis and H. marismortui, substitutions in B. afzelii oc- ferences were recalculated on the masked sequences, and the curred in a nonselective fashion among both the conserved and results are shown in Table 2. The effect of masking hypervari- variable positions, resulting in drastic alteration of the 2° struc- able positions was remarkable for the 14 of the 24 species with ture (6.88%) and a large number of indels (n ϭ 95). This level intragenomic diversity of 16S rRNA genes greater than 1%. of random diversity in B. afzelii is not violating the concept of The masking reduced the diversity from between 1.06% and ribosomal constraint; rather, it may have occurred due to the 2.07% to Ͻ0.66% (Table 2). This level of diversity will not have escape from constraint in a nonfunctional 16S rRNA pseudo- a significant impact on phylogenetic inference. However, the gene. Similarly, the truncated 16S rRNA gene (16S rrnB) with variation after masking remained high for H. marismortui high diversity at the 2° structural level in “Candidatus Pro- (4.86%) and T. tengcongensis (5.01%), “Candidatus Pro- tochlamydia amoebophila” is probably a nonfunctional gene.

highlighted in red and indels in green. For folding using the multiple-sequence-alignment approach, nucleotides making up noncanonical base pairs are highlighted in brown (35, 36). Segments of rrn16S of S. woodyi correspond to positions 965 through 1065 of rrnC16S and rrnD16S. The segments of rrnC16S (a and aЈ) and rrnD16S (b and bЈ) differ by 13 positions, all substitutions. Segments of rrn16S of C. cellulolyticum correspond to positions 31 through 256 of rrnA16S and rrnE16S. The segments of rrnA16S (c and cЈ) and rrnE16S (d and dЈ) differ by 28 positions, including 10 indels and 18 substitutions. Segments of rrn16S of P. profundum correspond to positions 129 through 254 of rrnC16S and rrnE16S. The segments of rrnC16S (e and eЈ) and rrnE16S (f and fЈ) differ by 12 positions, including 1 indel and 11 substitutions. 3894 PEI ET AL. APPL.ENVIRON.MICROBIOL.

FIG. 4. Secondary structures of 16S rRNA genes from Borrelia afzelii. (a) The distribution of substitutions is shown on the secondary structure predicted for rrnB16S, based on free energy minimization (31) with the consensus 16S rRNA gene model as the reference (53). Positions that differ between rrnB16S and rrnA16S are shown in colored letters. Conservative changes located in loops or compensatory changes due to covariation in stems are shown in blue; changes that result in alteration of secondary structures are shown in red. Insertions/deletions are shown in brown. Substitutions are coded as follows: K ϭ GorU,Mϭ AorC,Rϭ AorG,Sϭ CorG,Wϭ AorU,andYϭ C or U. (b) The distribution of substitutions is also shown on the variability map based on the 2° structure models for Escherichia coli (53). Substitutions between rrnB16S and rrnA16S are highlighted in colors according to the position-specific relative variability rate calculated from the consensus rrn16S model based on an alignment of 3,407 bacterial 16S rRNA genes (53). A position with a relative substitution rate of v Ͼ 1.18 (red) implies that it has a substitution rate higher than the average substitution rate of all of its sites in the rRNA gene analyzed, while v Ͻ 1.18 (blue) indicates that the rate is lower than the average. Uncommon sites are positions that are occupied in Ͻ25% of organisms because of insertions, which are shown by black dots. The expected variability was calculated from the consensus models (c).

For the four species with exceptionally high regional diver- absence of the 16S rRNA gene in an rRNA operon is associ- sity (Table 3), the classical thermodynamic folding failed to ated with an increase in intragenomic diversity in 6 species show a significant effect of ribosomal constraint at the 2° (Table 2). structural level, but consensus 2° structure can be obtained Intervening sequences (IVS) have been observed in rRNA for the pair of diversified regions if noncanonical base pairs genes previously (4, 37). IVS in 16S rRNA genes were found in are allowed (35, 36), providing another possibility for con- 4 species in this study. IVS appear to correlate with high level servation of 2° structure among highly diversified regions. of intragenomic variation among 16S rRNA genes. In one Thus, this largest survey of intragenomic diversity indicates species, T. tengcongensis, the diversity (6.70%) in the primary that all prokaryotic 16S rRNA genes conform to the theory structure far exceeds 1 to 1.3% (the species boundary), while in of ribosomal constraints. the other three species, P. profundum, D. hafniense, and S. Gene truncation is another form of intragenomic variation wolfei, IVS are associated with above average diversity, at of 16S rRNA genes. It ranges from complete absence of the 2.09%, 1.8%, and 1.67%, respectively. The variations appear to 16S rRNA gene in an rRNA operon to partial truncation of the be due to sequence diversity at the region 5Ј upstream from the 16S rRNA gene. A complete absence of a 16S rRNA gene is a IVS, likely associated with the insertion event of the IVS. common phenomenon involving rRNA operons in 95 of the Compared with IVS found in 23S rRNA genes (37), those in 568 species (see Table S2 in the supplemental material). The 16S rRNA genes are smaller and do not contain open reading VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3895 frames. The small IVS found in this study often can be folded skew the community structure, if these species are prevalent in to an rRNA-like secondary structure. the microbiome. Besides confirming the previous observation It is well known that there is wide variation of copy numbers of the presence of highly diversified 16S rRNA in single pro- of 16S rRNA gene among various species (28, 42). Currently, karyotic genomes (2), our study provides an estimate of the it is common practice to describe the composition of a micro- extent of this phenomenon. It involves both Bacteria and Ar- bial community using 16S gene composition rather than cell chaea, ranging from extremophiles to human pathogens. The composition. It would be desirable to convert 16S gene com- overall probability to encounter a species with significant in- position to cell composition, but for a large number of organ- tragenomic diversity among its 16S rRNA genes is about 4.2% isms in a complex microbiome, this conversion is not possible (24 in 569 unique species or 24 in 425 unique species with because of the lack of knowledge about the copy numbers of multiple 16S rRNA genes). This phenomenon will have an 16S rRNA gene in their genomes. Here is a hypothetical ex- impact on analysis of the human microbiome because 7 species ample to illustrate the difference between the 16S gene com- found in the human microbiome or diseases exhibit in- position and cell composition, in which a microbial community tragenomic 16S rRNA diversity greater than the threshold for contains 100 bacterial cells: 90 cells from Borrelia turicatae and defining species. 10 cells from Brevibacillus brevis. Because there is one 16S The definition for a prokaryotic species is polyphasic, in that rRNA gene per cell for B. turicatae and 15 16S rRNA genes per it requires a distinct set of biological characteristics and cor- cell for Brevibacillus brevis, this community contains 240 16S responding DNA reassociation values of Ͼ70%. However, rRNA genes, 90 from B. turicatae and 150 from Brevibacillus there is not a simple universal definition. 16S rRNA genes have brevis. Consequently, this community is dominated by cells been used as a surrogate maker for operationally defining from B. turicatae (90/100) and by 16S rRNA genes from Bre- species. Initially, a Ͼ3% difference between 16S rRNA genes vibacillus brevis (150/240). Thus, 16S gene composition is an from two organisms was required to claim the two organisms acceptable way to describe a microbial community with the belong to different species (43). Later, the threshold was low- understanding of the difference between the 16S gene compo- ered to 1 to 1.3% (44). This operational definition is helpful in sition and cell composition. taxonomic classification using 16S rRNA genes, especially for In 3 of the 24 diversified species, H. marismortui, T. tengcon- studies of complex microbiomes using cultivation-independent gensis, and B. afzelii, the intragenomic diversity between 16S techniques in which biological characteristics and DNA reas- rRNA genes correlated with differences in GC content. H. sociation values cannot be determined for individual bacterial marismortui is a halophilic red archaeon found in the Dead cells/species. Nevertheless, it is critical to understand the lim- Sea, a high-saline environment (16). T. tengcongensis, isolated itations of the 16S rRNA-based operational definition for spe- from a hot spring in Tengchong, China, grows optimally at cies. The main limitation is that 16S rRNA genes evolve at 80°C (4). It is known that these organisms thrive in such envi- different rates, but the operational species threshold (1 to ronments due to adaptations in protein structure, metabolic 1.3%) is relatively rigid. As a consequence, closely related strategies, and physiologic responses. Data from this work sup- species that evolve slowly will be grouped as a single species by port the recent observation by Lopez-Lopez et al. that the the operational definition, such as Streptococcus pseudopneu- spectrum of adaptation may extend to diversification in rRNA genes (29). The diversified copies of 16S rRNA genes have a moniae and Streptococcus pneumoniae (3), which differ by only higher GC content than their regular counterpart (58.4% ver- 5 bp between their 16S rRNA genes (a 0.03% difference). sus 56.7% for H. marismortui and 60.3% versus 59.2% for T. Another limitation is that 16S does not represent the entire tengcongensis). In H. marismortui, the rRNA operon containing genomic content that determines the biological characteristics the high-GC 16S rRNA gene appears to have a different pro- for a species. Significant differences in genome composition moter and is preferentially expressed at higher temperature may be present in bacterial species that are completely iden- (29). Thus, diversification in rRNA genes could be selected to tical or that differ only slightly in 16S rRNA genes. For exam- meet survival needs under different environmental conditions. ple, isolates of Vibrio splendidus exhibit up to 25% genotypic B. afzelii normally resides in the digestive tract of ticks or fleas difference (45), and strains of E. coli may differ up to 40% in and is transmitted to humans via the insect’s saliva following a the number of genes in their genomes (24, 38). The three bite (19). It is one of the four major Borrelia species causing members in the Bacillus cereus group (B. cereus, Bacillus an- Lyme disease. Of the two 16S rRNA genes in B. afzelii, the thracis, and Bacillus thuringiensis) can be classified as a single pseudogene has a much lower GC content (38.1%) than the species by their nearly identical 16S rRNA genes, but they functional copy (46.5%). It appears the random mutations in differ greatly by the number and type of genes they harbor due the pseudogene have been bringing its GC content toward the to the presence of large plasmids (41). In both E. coli and the baseline for the whole genome (28%). B. cereus group, these differences confer various biological 16S rRNA has been a widely used marker for taxonomic capabilities and pathogenicities. Intragenomic variation of 16S classification of prokaryotic organisms. By the operational def- rRNA genes is another limit that can be encountered when inition, two 16S rRNA genes differing by 1 to 1.3% or more classifying species that harbor 16S rRNA genes with diversity represent two different species (44). Using such thresholds greater than the threshold set by the operational definition would artificially classify the 24 species with intragenomic di- (Table 2), which will lead to overestimation of species diversity versity of 16S rRNA genes of Ͼ1% into more than 24 species. in a microbiome. Thus, it can be expected that community In a population-based microbiome survey, the presence of such structures determined using the 16S rRNA-based operational intragenomic sequence divergence would result in overestima- species definition approximate but do not necessarily reflect tion of the species diversity of the microbiome and possibly the true community structures. 3896 PEI ET AL. APPL.ENVIRON.MICROBIOL.

ACKNOWLEDGMENTS 22. Johnson, T. J., and L. K. Nolan. 2009. Pathogenomics of the virulence plasmids of Escherichia coli. Microbiol. Mol. Biol. Rev. 73:750–774. This study was supported in part by grant UH2CA140233 from the 23. Kiryu, H., Y. Tabei, T. Kin, and K. Asai. 2007. Murlet: a practical multiple Human Microbiome Project of the NIH Roadmap Initiative and Na- alignment tool for structural RNA sequences. Bioinformatics 23:1588–1598. tional Cancer Institute and by grant R01AI063477 from the National 24. Kudva, I. T., P. S. Evans, N. T. Perna, T. J. Barrett, F. M. Ausubel, F. R. Institute of Allergy and Infectious Diseases. Part of this work was Blattner, and S. B. Calderwood. 2002. Strains of Escherichia coli O157:H7 carried out under the auspices of the University of California, at differ primarily by insertions or deletions, not single-nucleotide polymor- Lawrence Berkeley National Laboratory under contract no. DE-AC02- phisms. J. Bacteriol. 184:1873–1879. 05CH11231. 25. Kunin, V., A. Engelbrektson, H. Ochman, and P. Hugenholtz. 2010. Wrinkles in the rare biosphere: pyrosequencing errors lead to artificial inflation of We thank Meisheng Zhou and David Rosmarin for assistance in diversity estimates. Environ. Microbiol. 12:118–123. verification of sequence diversity and Lauren Valentor for editing the 26. Ku¨ntzel, H., M. Heidrich, and B. Piechulla. 1981. Phylogenetic tree derived manuscript. from bacterial, cytosol and organelle 5S rRNA sequences. Nucleic Acids Res. 9:1451–1461. REFERENCES 27. Lane, D. J. 1991. 16S/23S rRNA sequencing, p. 115–174. In E. Stackebrandt and M. Goodfellow (ed.), Nucleic acid techniques in bacterial systematics. 1. Abdulkarim, F., and D. Hughes. 1996. Homologous recombination between John Wiley & Sons, Chichester, United Kingdom. the tuf genes of Salmonella typhimurium. J. Mol. Biol. 260:506–522. 28. Lee, Z. M. P., C. Bussema III, and T. M. Schmidt. 2009. rrnDB: documenting 2. Acinas, S. G., L. A. Marcelino, V. Klepac-Ceraj, and M. F. Polz. 2004. the number of rRNA and tRNA genes in bacteria and archaea. Nucleic Divergence and redundancy of 16S rRNA sequences in genomes with mul- Acids Res. 37:D489–D493. tiple rrn operons. J. Bacteriol. 186:2629–2635. 29. Lo´pez-Lo´pez, A., S. Benlloch, M. Bonfa´, F. Rodríguez-Valera, and A. Mira. 3. Arbique, J. C., C. Poyart, P. Trieu-Cuot, G. Quesne, M. da Glo´ria, S. 2007. Intragenomic 16S rDNA divergence in Haloarcula marismortui is an Carvalho, A. G. Steigerwalt, R. E. Morey, D. Jackson, R. J. Davidson, and adaptation to different temperatures. J. Mol. Evol. 65:687–696. R. R. Facklam. 2004. Accuracy of phenotypic and genotypic testing for 30. Mashkova, T. D., T. I. Serenkova, A. M. Mazo, T. A. Avdonina, M. Y. identification of Streptococcus pneumoniae and a description of Streptococcus Timofeyeva, and L. L. Kisselev. 1981. The primary structure of oocyte and pseudopneumoniae sp. nov. J. Clin. Microbiol. 42:4686–4696. somatic 5S rRNAs from the loach Misgurnus fossilis. Nucleic Acids Res. 4. Bao, Q., Y. Tian, W. Li, Z. Xu, Z. Xuan, S. Hu, W. Dong, J. Yang, Y. Chen, 9:2141–2151. Y. Xue, Y. Xu, X. Lai, L. Huang, X. Dong, Y. Ma, L. Ling, H. Tan, R. Chen, 31. Mathews, D. H., M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker, and J. Wang, J. Yu, and H. Yang. 2002. A complete sequence of the T. tengcon- D. H. Turner. 2004. Incorporating chemical modification constraints into a 12: gensis genome. Genome Res. 689–700. dynamic programming algorithm for prediction of RNA secondary structure. 5. Bindewald, E., and B. A. Shapiro. 2006. RNA secondary structure prediction Proc. Natl. Acad. Sci. U. S. A. 101:7287–7292. from sequence alignments using a network of k-nearest neighbor classifiers. 32. Mevarech, M., S. Hiesch-Twizer, S. Goldman, E. Yakobson, H. Eisenberg, RNA 12:342–352. and P. P. Dennis. 1989. Isolation and characterization of the rRNA gene 6. Carranza, S., G. Giribet, C. Ribera, J. Baguna, and M. Riutort. 1996. Evi- clusters of Halobacterium marismortui. J. Bacteriol. 171:3479–3485. dence that two types of 18S rDNA coexist in the genome of Dugesia (Schmid- 33. Mylvaganam, S., and P. P. Dennis. 1992. Sequence heterogeneity between tea) mediterranea (Platyhelminthes, Turbellaria, Tricladida). Mol. Biol. Evol. the two genes encoding 16S rRNA from the halophilic archaebacterium 13:824–832. Haloarcula marismortui. Genetics 130:399–410. 7. Chakravorty, S., D. Helb, M. Burday, N. Connell, and D. Alland. 2007. A 34. NIH HMP Working Group, J. Peterson, S. Garges, M. Giovanni, P. detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of McInnes, L. Wang, J. A. Schloss, V. Bonazzi, J. E. McEwen, K. A. Wetter- pathogenic bacteria. J. Microbiol. Methods 69:330–339. strand, C. Deal, C. C. Baker, V. Di Francesco, T. K. Howcroft, R. W. Karp, 8. Cilia, V., B. Lafay, and R. Christen. 1996. Sequence heterogeneities among R. D. Lunsford, C. R. Wellington, T. Belachew, M. Wright, C. Giblin, H. 16S ribosomal RNA sequences, and their effect on phylogenetic analyses at David, M. Mills, R. Salomon, C. Mullins, B. Akolkar, L. Begg, C. Davis, L. the species level. Mol. Biol. Evol. 13:451–461. Grandison, M. Humble, J. Khalsa, A. R. Little, H. Peavy, C. Pontzer, M. 9. Clayton, R. A., G. Sutton, P. S. Hinkle, Jr., C. Bult, and C. Fields. 1995. Portnoy, M. H. Sayre, P. Starke-Reed, S. Zakhari, J. Read, B. Watson, and Intraspecific variation in small subunit rRNA sequences in GenBank: why M. Guyer. 2009. The NIH Human Microbiome Project. Genome Res. 19: single sequences may not adequately represent prokaryotic taxa. Int. J. Syst. 2317–2323. Bacteriol. 45:595–599. Ninio, J. 10. Coenye, T., and P. Vandamme. 2003. Intragenomic heterogeneity between 35. 1979. Prediction of pairing schemes in RNA molecules—loop 61: multiple 16S ribosomal RNA operons in sequenced bacterial genomes. contributions and energy of wobble and non-wobble pairs. Biochimie FEMS Microbiol. Lett. 228:45–49. 1133–1150. 11. De Rijk, P., and R. De Wachter. 1997. RnaViz, a program for the visualiza- 36. Olson, W. K., M. Esguerra, Y. Xin, and X. J. Lu. 2009. New information tion of RNA secondary structure. Nucleic Acids Res. 25:4679–4684. content in RNA base pairing deduced from quantitative analysis of high- 12. DeSantis, T. Z., Jr., P. Hugenholtz, K. Keller, E. L. Brodie, N. Larsen, Y. M. resolution structures. Methods 47:177–186. Piceno, R. Phan, and G. L. Andersen. 2006. NAST: a multiple sequence 37. Pei, A., C. W. Nossa, P. Chokshi, M. J. Blaser, L. Yang, D. M. Rosmarin, and alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Z. Pei. 2009. Diversity of 23S rRNA genes within individual prokaryotic Res. 34:W394–W399. genomes. PLoS One 4:e5437. 13. Doolittle, W. F. 1999. Phylogenetic classification and the universal tree. 38. Perna, N. T., G. Plunkett III, V. Burland, B. Mau, J. D. Glasner, D. J. Rose, Science 284:2124–2129. G. F. Mayhew, P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. 14. Eigen, M., B. Lindemann, R. Winkler-Oswatitsch, and C. H. Clarke. 1985. Hackett, S. Klink, A. Boutin, Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, Pattern analysis of 5S rRNA. Proc. Natl. Acad. Sci. U. S. A. 82:2437–2441. A. Lim, E. T. Dimalanta, K. D. Potamousis, J. Apodaca, T. S. Ananthara- 15. Fritsche, T. R., R. K. Gautom, S. Seyedirashti, D. L. Bergeron, and T. D. man, J. Lin, G. Yen, D. C. Schwartz, R. A. Welch, and F. R. Blattner. 2001. Lindquist. 1993. Occurrence of bacterial endosymbionts in Acanthamoeba Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature spp. isolated from corneal and environmental specimens and contact lenses. 409:529–533. J. Clin. Microbiol. 31:1122–1126. 39. Poirel, L., J. M. Rodríguez-Martínez, N. Al Naiemi, Y. J. Debets-Ossenkopp, 16. Goo, Y. A., J. Roach, G. Glusman, N. S. Baliga, K. Deutsch, M. Pan, S. and P. Nordmann. 22 March 2010, posting date. Characterization of DIM-1, ␤ Kennedy, S. DasSarma, W. V. Ng, and L. Hood. 2004. Low-pass sequencing an integron-encoded metallo- -lactamase from a Pseudomonas stutzeri clin- for microbial comparative genomics. BMC Genomics 5:3. ical isolate in The Netherlands. Antimicrob. Agents Chemother. [Epub 17. Gunderson, J. H., M. L. Sogin, G. Wollett, M. de la Cruz Hollingdale, V. F. ahead of print.] de la Cruz, A. P. Waters, and T. F. McCutchan. 1987. Structurally distinct, 40. Quince, C., A. Lanze´n, T. P. Curtis, R. J. Davenport, N. Hall, I. M. Head, stage-specific ribosomes occur in Plasmodium. Science 238:933–937. L. F. Read, and W. T. Sloan. 2009. Accurate determination of microbial 18. Hashimoto, J. G., B. S. Stevenson, and T. M. Schmidt. 2003. Rates and diversity from 454 pyrosequencing data. Nat. Methods 6:639–641. consequences of recombination between rRNA operons. J. Bacteriol. 185: 41. Rasko, D. A., M. R. Altherr, C. S. Han, and J. Ravel. 2005. Genomics of the 966–972. Bacillus cereus group of organisms. FEMS Microbiol. Rev. 29:303–329. 19. Hengge, U. R., A. Tannapfel, S. K. Tyring, R. Erbel, G. Arendt, and T. 42. Rastogi, R., M. Wu, I. Dasgupta, and G. E. Fox. 2009. Visualization of Ruzicka. 2003. Lyme borreliosis. Lancet Infect. Dis. 3:489–500. ribosomal RNA operon copy number distribution. BMC Microbiol. 9:208. 20. Hong, H. A., R. Khaneja, N. M. Tam, A. Cazzato, S. Tan, M. Urdaci, A. 43. Stackebrandt, E., and B. M. Goebel. 1994. Taxonomic note: a place for Brisson, A. Gasbarrini, I. Barnes, and S. M. Cutting. 2009. Bacillus subtilis DNA-DNA reassociation and 16S rRNA sequence analysis in the present isolated from the human gastrointestinal tract. Res. Microbiol. 160:134–143. species definition in bacteriology. Int. J. Syst. Bacteriol. 44:846–849. 21. Hugenholtz, P., G. W. Tyson, R. I. Webb, A. M. Wagner, and L. L. Blackall. 44. Stackebrandt, E., and J. Ebers. 2006. Taxonomic parameters revisited: tar- 2001. Investigation of candidate division TM7, a recently recognized major nished gold standards. Microbiol. Today 2006:153–155. lineage of the domain Bacteria with no known pure-culture representatives. 45. Thompson, J. R., S. Pacocha, C. Pharino, V. Klepac-Ceraj, D. E. Hunt, J. Appl. Environ. Microbiol. 67:411–419. Benoit, R. Sarma-Rupavtarm, D. L. Distel, and M. F. Polz. 2005. Genotypic VOL. 76, 2010 DIVERSITY OF 16S rRNA GENES 3897

diversity within a natural coastal bacterioplankton population. Science 307: 52. Woese, C. R., O. Kandler, and M. L. Wheelis. 1990. Towards a natural system 1311–1313. of organisms: proposal for the domains Archaea, Bacteria and Eucarya. 46. Turroni, F., J. R. Marchesi, E. Foroni, M. Gueimonde, F. Shanahan, A. Proc. Natl. Acad. Sci. U. S. A. 87:4576–4579. Margolles, D. van Sinderen, and M. Ventura. 2009. Microbiomic analysis of 53. Wuyts, J., Y. Van de Peer, and R. De Wachter. 2001. Distribution of substi- the bifidobacterial population in the human distal gut. ISME J. 3:745–751. tution rates and location of insertion sites in the tertiary structure of ribo- 47. Wang, Y., Z. Zhang, and N. Ramanan. 1997. The actinomycete Thermo- somal RNA. Nucleic Acids Res. 29:5017–5028. bispora bispora contains two distinct types of transcriptionally active 16S 54. Yap, W. H., Z. Zhang, and Y. Wang. 1999. Distinct types of rRNA operons rRNA genes. J. Bacteriol. 179:3270–3276. exist in the genome of the actinomycete Thermomonospora chromogena and 48. Wegnez, M. R., R. Monier, and H. Denis. 1972. Sequence heterogeneity of 5S evidence for horizontal transfer of an entire rRNA operon. J. Bacteriol. rRNA in Xenopus laevis. FEBS Lett. 25:13–20. 181:5201–5209. 49. Wiese, K. C., E. Glen, and A. Vasudevan. 2005. JViz.Rna—a Java tool for 55. Zocco, M. A., M. E. Ainora, G. Gasbarrini, and A. Gasbarrini. 2007. Bacte- RNA secondary structure visualization. IEEE Trans. Nanobioscience 4:212– roides thetaiotaomicron in the gut: molecular aspects of their interaction. Dig. 218. Liver Dis. 39:707–712. 50. Woese, C. R. 1987. Bacterial evolution. Microb. Rev. 51:221–271. 56. Zwieb, C., C. Glotz, and R. Brimacombe. 1981. Secondary structure com- 51. Woese, C. R. 1998. The universal ancestor. Proc. Natl. Acad. Sci. U. S. A. parisons between small subunit ribosomal RNA molecules from six different 95:6854–6859. species. Nucleic Acids Res. 9:3621–3640. Supplementary Table 1. 568 Prokaryotic species analyzed in this study

Bacteria : Acidobacteria (n=3) Number of Diversity Species Accession number 16S rRNA genes % Acidobacteria bacterium NC_008009.1 1 0 Acidobacterium capsulatum NC_012483.1 1 0 Solibacter usitatus NC_008536.1 2 0 Average 1.33 0.00

Bacteria : Actinobacteria (n=47) Number of Species Accession number Diversity % 16S rRNA genes Acidothermus cellulolyticus NC_008578.1 1 0 Arthrobacter aurescens NC_0608711.1 6 0 Arthrobacter chlorophenolicus NC_011886.1 5 0 Arthrobacter sp. FB24 N5C_008541.1 5 0 Beutenbergia cavernae NC_012669 2 0 Bifidobacterium adolescentis NC_008618.1 5 1.3 Bifidobacterium animalis subsp. lactis NC_011835.1 2 0.19 Bifidobacterium longum NC_010816.1 4 0.07 Clavibacter michiganensis subsp. michiganensis NC_009480.1 2 0 Corynebacterium aurimucosum NC_012590.0 4 0 Corynebacterium diphtheriae NC_002935.2 5 0.07 Corynebacterium efficiens NC_004369.1 5 0.43 Corynebacterium glutamicum NC_006958.1 6 0.26 Corynebacterium jeikeium NC_007164.1 3 0 Corynebacterium urealyticum NC_010545.1 3 0.13 Frankia alni NC_008278.1 2 0 Frankia sp. CcI3 NC_007777.1 2 0 Kineococcus radiotolerans NC_009664.1 4 0 Kocuria rhizophila NC_010617.1 3 0 Leifsonia xyli subsp. xyli NC_006087.1 1 0 Mycobacterium abscessus NC_010397.1 1 0 Mycobacterium avium NC_008595.1 1 0 Mycobacterium bovis NC_002945.3 1 0 Mycobacterium gilvum NC_009338.1 2 0 Mycobacterium leprae NC_011896.1 1 0 Mycobacterium marinum NC_010612.1 1 0 Mycobacterium smegmatis NC_008596.1 2 0 Mycobacterium sp. JLS NC_009077.1 2 0.13 Mycobacterium tuberculosis NC_002755.2 1 0 Mycobacterium ulcerans NC_008611.1 1 0 Mycobacterium vanbaalenii NC_008726.1 2 0.13 Nocardia farcinica NC_006361.1 3 0.13 Nocardioides sp. JS614 NC_008699.1 2 0 Propionibacterium acnes NC_006085.1 3 0.07 Renibacterium salmoninarum NC_010168.1 2 0 Rhodococcus erythropolis NC_012490.1 5 0 Rhodococcus jostii NC_008268.1 4 0 Rhodococcus opacus NC_012522.1 4 0.07 Rubrobacter xylanophilus NC_008148.1 1 0 Saccharopolyspora erythraea NC_009142.1 4 0.13 Salinispora arenicola NC_009953.1 3 0 Salinispora tropica NC_009380.1 3 0 Streptomyces avermitilis NC_003155.4 6 0 Streptomyces coelicolor NC_003888.3 6 0.2 Streptomyces griseus subsp. griseus NC_010572.1 6 0 Thermobifida fusca NC_007333.1 4 0.13 Tropheryma whipplei NC_004572.3 1 0 Average 3.00 0.07

Bacteria : Aquificae (n=5) Number of Species Accession number Diversity % 16S rRNA genes Aquifex aeolicus NC_000918.1 2 0 Hydrogenobaculum sp. Y04AAS1 NC_011126.1 2 0 Persephonella marina NC_012440.1 2 0 Sulfurihydrogenibium azorense NC_012438.1 2 0.2 Sulfurihydrogenibium sp. YO3AOP1 NC_010730.1 3 0 Average 2.20 0.04

Bacteria : Bacteroidetes (n=13) Number of Species Accession number Diversity % 16S rRNA genes Bacteroides fragilis NC_003228.3 6 0 Bacteroides thetaiotaomicron NC_004663.1 5 1.3 Bacteroides vulgatus NC_009614.1 7 0.72 Candidatus Amoebophilus asiaticus NC_010830.1 1 0 Candidatus Azobacteroides pseudotrichonymphae NC_011565.1 2 0 Candidatus Sulcia muelleri NC_010118.1 1 0 Cytophaga hutchinsonii NC_008255.1 3 0 Flavobacterium johnsoniae NC_009441.1 6 0.26 Flavobacterium psychrophilum NC_009613.1 6 0.13 Gramella forsetii NC_008571.1 3 0.07 Parabacteroides distasonis NC_009615.1 7 0.94 Porphyromonas gingivalis NC_010729.1 4 0 Salinibacter ruber NC_007677.1 1 0 Average 4.00 0.26

Bacteria : Chlamydiae (n=7) Number of Species Accession number Diversity % 16S rRNA genes Candidatus Protochlamydia amoebophila NC_005861.1 3 1.34 Chlamydia muridarum Nigg NC_002620.2 2 0 Chlamydia trachomatis NC_010287.1 2 0 Chlamydophila abortus NC_004552.2 1 0 Chlamydophila caviae NC_003361.3 1 0 Chlamydophila felis NC_007899.1 1 0 Chlamydophila pneumoniae NC_002179.2 1 0 Average 1.57 0.19

Bacteria : Chlorobi (n=9) Number of Species Accession number Diversity % 16S rRNA genes Chlorobaculum parvum NC_011027.1 2 0 Chlorobium chlorochromatii NC_007514.1 1 0 Chlorobium limicola NC_010803.1 2 0 Chlorobium phaeobacteroides NC_010831.1 2 0 Chlorobium tepidum NC_002932.3 2 0 Chloroherpeton thalassium NC_011026.1 1 0 Pelodictyon luteolum NC_007512.1 2 0.14 Pelodictyon phaeoclathratiforme NC_011060.1 3 0.07 Prosthecochloris aestuarii NC_011059.1 1 0 Average 1.78 0.02

Bacteria : Chloroflexi (n=9) Number of Species Accession number Diversity % 16S rRNA genes Chloroflexus aggregans NC_011831.1 3 0.75 Chloroflexus aurantiacus NC_010175.1 3 1.08 Chloroflexus sp. Y-400-fl NC_012032.1 3 1.08 Dehalococcoides ethenogenes NC_002936.3 1 0 Dehalococcoides sp. BAV1 NC_009455.1 1 0 Herpetosiphon aurantiacus NC_009972.1 5 0.14 Roseiflexus castenholzii NC_009767.1 2 0.14 Roseiflexus sp. RS-1 NC_009523.1 2 0.2 Thermomicrobium roseum NC_011959.1 1 0 Average 2.33 0.38

Archaea : Crenarchaeota (n=18) Number of Species Accession number Diversity % 16S rRNA genes Aeropyrum pernix NC_000854.2 1 0 Caldivirga maquilingensis NC_009954.1 1 0 Desulfurococcus kamchatkensis NC_011766.1 1 0 Hyperthermus butylicus NC_008818.1 1 0 Ignicoccus hospitalis NC_009776.1 1 0 Metallosphaera sedula NC_009440.1 1 0 Nitrosopumilus maritimus NC_010085.1 1 0 Pyrobaculum aerophilum NC_003364.1 1 0 Pyrobaculum arsenaticum NC_009376.1 1 0 Pyrobaculum calidifontis NC_009073.1 1 0 Pyrobaculum islandicum NC_008701.1 1 0 Staphylothermus marinus NC_009033.1 1 0 Sulfolobus acidocaldarius NC_007181.1 1 0 Sulfolobus islandicus NC_012589.0 1 0 Sulfolobus solfataricus NC_002754.1 1 0 Sulfolobus tokodaii NC_003106.2 1 0 Thermofilum pendens NC_008698.1 1 0 Thermoproteus neutrophilus NC_010525.1 1 0 Average 1.00 0.00

Bacteria : Cyanobacteria (n=13) Number of Species Accession number Diversity % 16S rRNA genes Acaryochloris marina NC_009925.1 2 0 Anabaena variabilis NC_007413.1 4 0 Cyanothece sp. ATCC 51142 NC_010546.1 2 0.07 Gloeobacter violaceus NC_005125.1 1 0 Microcystis aeruginosa NC_010296.1 2 0.27 Nostoc punctiforme NC_010628.1 4 0.13 Nostoc sp. PCC 7120 NC_003272.1 4 0.07 Prochlorococcus marinus NC_008816.1 2 0 Synechococcus elongatus NC_006576.1 2 0 Synechococcus sp. CC9311 NC_008319.1 2 0.14 Synechocystis sp. PCC 6803 NC_000911.1 2 0 Thermosynechococcus elongatus NC_004113.1 1 0 Trichodesmium erythraeum NC_008312.1 2 0 Average 2.31 0.05

Bacteria : Deinococcus-Thermus (n=4) Number of Species Accession number Diversity % 16S rRNA genes Deinococcus deserti NC_012526.1 3 0 Deinococcus geothermalis NC_008025.1 3 1.06 Deinococcus radiodurans NC_001263.1 3 0.13 Thermus thermophilus NC_005835.1 2 0 Average 2.75 0.30

Bacteria : Dictyoglomi (n=2) Number of Species Accession number Diversity % 16S rRNA genes Dictyoglomus thermophilum NC_011297.1 2 0.06 Dictyoglomus turgidum NC_011661.1 2 0 Average 2.00 0.03

Bacteria : Elusimicrobia (n=1) Number of Species Accession number Diversity % 16S rRNA genes Elusimicrobium minutum NC_010644.1 1 0 Average 1.00 0.00

Archaea : Euryarchaeota (n=34) Number of Species Accession number Diversity % 16S rRNA genes Archaeoglobus fulgidus NC_000917.1 1 0 Candidatus Methanoregula boonei NC_009712.1 1 0 Candidatus Methanosphaerula palustris NC_011832.1 3 0 Haloarcula marismortui NC_006396.1 3 5.63 Halobacterium salinarum NC_010364.1 1 0 Halobacterium sp. NRC-1 NC_002607.1 1 0 Haloquadratum walsbyi NC_008212.1 2 0.07 Halorubrum lacusprofundi NC_012029.1 2 0.07 Methanobrevibacter smithii NC_009515.1 2 0.07 Methanocaldococcus jannaschii NC_000909.1 2 0.41 Methanococcoides burtonii NC_007955.1 3 0 Methanococcus aeolicus NC_009635.1 2 0.07 Methanococcus maripaludis NC_009135.1 3 0 Methanococcus vannielii NC_009634.1 4 0 Methanocorpusculum labreanum NC_008942.1 3 0 Methanoculleus marisnigri NC_009051.1 1 0 Methanopyrus kandleri NC_003551.1 1 0 Methanosaeta thermophila NC_008553.1 2 0 Methanosarcina acetivorans NC_003552.1 3 0.07 Methanosarcina barkeri NC_007355.1 3 0.27 Methanosarcina mazei NC_003901.1 3 0 Methanosphaera stadtmanae NC_007681.1 4 0 Methanospirillum hungatei NC_007796.1 4 0.14 Methanothermobacter thermautotrophicus NC_000916.1 2 0.13 Natronomonas pharaonis NC_007426.1 1 0 Picrophilus torridus NC_005877.1 1 0 Pyrococcus abyssi NC_000868.1 1 0 Pyrococcus furiosus NC_003413.1 1 0 Pyrococcus horikoshii NC_000961.1 1 0 Thermococcus kodakarensis NC_006624.1 1 0 Thermococcus onnurineus NC_011529.1 1 0 Thermoplasma acidophilum NC_002578.1 1 0 Thermoplasma volcanium NC_002689.2 1 0 Methanogenic archaeon NC_009464.1 3 0.2 Average 2.00 0.21

Bacteria : Firmicutes (n=106) Number of Species Accession number Diversity % 16S rRNA genes Acholeplasma laidlawii NC_010163.1 2 0 Alkaliphilus metalliredigens NC_009633.1 10 0 Alkaliphilus oremlandii NC_009922.1 8 0.07 Anaerocellum thermophilum NC_012034.1 3 0.26 Anoxybacillus flavithermus NC_011567.1 8 0.66 Aster yellows witches'-broom phytoplasma NC_007716.1 2 0.13 Bacillus amyloliquefaciens NC_009725.1 10 0.5 Bacillus anthracis NC_012659 11 0.66 Bacillus cereus NC_004722.1 13 0.19 Bacillus clausii NC_006582.1 7 1.15 Bacillus halodurans NC_002570.2 8 0.39 Bacillus licheniformis NC_006322.1 7 0.64 Bacillus pumilus NC_009848.1 7 0.26 Bacillus subtilis subsp. subtilis NC_000964.2 10 1.16 Bacillus thuringiensis serovar konkukian NC_005957.1 14 0.4 Bacillus weihenstephanensis NC_010184.1 14 0.32 Brevibacillus brevis NC_012491.1 15 0.6 Caldicellulosiruptor saccharolyticus NC_009437.1 3 1.35 Candidatus Desulforudis audaxviator NC_010424.1 2 0 Candidatus Phytoplasma australiense NC_010544.1 2 0.2 Carboxydothermus hydrogenoformans NC_007503.1 4 1.14 Clostridium acetobutylicum NC_003030.1 11 0.79 Clostridium beijerinckii NC_009617.1 14 0.6 Clostridium botulinum A NC_009697.1 8 0.53 Clostridium cellulolyticum NC_011898.1 8 2.07 Clostridium difficile NC_009089.1 11 0.2 Clostridium kluyveri NC_009706.1 7 0.27 Clostridium novyi NC_008593.1 10 0.13 Clostridium perfringens NC_003366.1 10 0.66 Clostridium phytofermentans NC_010001.1 8 0.79 Clostridium tetani NC_004557.1 6 0.33 Clostridium thermocellum NC_009012.1 4 0.2 Coprothermobacter proteolyticus NC_011295.1 2 0.53 Desulfitobacterium hafniense NC_011830.1 5 1.8 Desulfotomaculum reducens NC_009253.1 8 0.37 Enterococcus faecalis NC_004668.1 4 0.07 Exiguobacterium sibiricum NC_010556.1 9 0.26 Exiguobacterium sp. AT1b NC_012673 9 0.52 Finegoldia magna NC_010376.1 4 0.33 Geobacillus kaustophilus NC_006510.1 9 0.77 Geobacillus thermodenitrificans NC_009328.1 10 1.22 Halothermothrix orenii NC_011899.1 4 0.06 Heliobacterium modesticaldum NC_010337.2 10 1.58 Lactobacillus acidophilus NC_006814.3 4 0.84 Lactobacillus brevis NC_008497.1 5 0.32 Lactobacillus casei NC_008526.1 5 0.13 Lactobacillus delbrueckii subsp. bulgaricus NC_008054.1 9 0.51 Lactobacillus fermentum NC_010610.1 5 0.32 Lactobacillus gasseri NC_008530.1 6 0 Lactobacillus helveticus NC_010080.1 4 0 Lactobacillus johnsonii NC_005362.1 6 0 Lactobacillus plantarum NC_004567.1 5 0.13 Lactobacillus reuteri NC_009513.1 6 0.76 Lactobacillus sakei subsp. sakei NC_007576.1 7 0.38 Lactobacillus salivarius NC_007929.1 7 0.38 Lactococcus lactis subsp. cremoris NC_009004.1 6 0 Leuconostoc citreum NC_010471.1 4 0 Leuconostoc mesenteroides subsp. mesenteroides NC_008531.1 4 0 Listeria innocua NC_003212.1 6 0.26 Listeria monocytogenes NC_012488.1 6 0.27 Listeria welshimeri NC_008555.1 6 0.06 Lysinibacillus sphaericus NC_010382.1 10 0.71 Macrococcus caseolyticus NC_011999.1 4 0 Mesoplasma florum NC_006055.1 2 0 Moorella thermoacetica NC_007644.1 1 0 Mycoplasma agalactiae NC_009497.1 2 0.46 Mycoplasma arthritidis NC_011025.1 1 0 Mycoplasma capricolum subsp. capricolum NC_007633.1 2 0.2 Mycoplasma gallisepticum NC_004829.1 2 0.2 Mycoplasma genitalium NC_000908.2 1 0 Mycoplasma hyopneumoniae NC_006360.1 1 0 Mycoplasma mobile NC_006908.1 1 0 Mycoplasma mycoides subsp. mycoides NC_005364.2 2 0.65 Mycoplasma penetrans NC_004432.1 1 0 Mycoplasma pneumoniae NC_000912.1 1 0 Mycoplasma pulmonis NC_002771.1 1 0 Mycoplasma synoviae NC_007294.1 2 0.13 Natranaerobius thermophilus NC_010718.1 4 0.14 Oceanobacillus iheyensis NC_004193.1 7 0.96 Oenococcus oeni NC_008528.1 2 0 Onion yellows phytoplasma NC_005303.1 2 0.19 Pediococcus pentosaceus NC_008525.1 5 0.13 Pelotomaculum thermopropionicum NC_009454.1 2 0.13 Staphylococcus aureus NC_007622.1 5 0.19 Staphylococcus aureus subsp. aureus NC_002951.2 6 0.13 Staphylococcus carnosus subsp. carnosus NC_012121.1 5 0.39 Staphylococcus epidermidis NC_004461.1 5 0.71 Staphylococcus haemolyticus NC_007168.1 5 0.32 Staphylococcus saprophyticus subsp. saprophyticus NC_007350.1 6 0.19 Streptococcus agalactiae NC_004116.1 7 0 Streptococcus equi subsp. equi NC_012471.1 60 Streptococcus gordonii NC_009785.1 4 0.26 Streptococcus mutans NC_004350.1 5 0.19 Streptococcus pneumoniae NC_012468.1 4 0 Streptococcus pyogenes NC_004606.1 5 0 Streptococcus sanguinis NC_009009.1 4 0.71 Streptococcus suis NC_009442.1 4 0 Streptococcus thermophilus NC_006449.1 6 0.13 Streptococcus uberis NC_012004.1 5 0 Symbiobacterium thermophilum NC_006177.1 6 0.25 Syntrophomonas wolfei subsp. wolfei NC_008346.1 3 1.67 Thermoanaerobacter pseudethanolicus NC_010321.1 4 0.39 Thermoanaerobacter sp. X514 NC_010320.1 4 0.07 Thermoanaerobacter tengcongensis NC_003869.1 4 6.7 Ureaplasma parvum NC_010503.1 2 0.07 Ureaplasma parvum NC_002162.1 2 0.07 Average 5.61 0.43

Bacteria : Fusobacteria (n=1) Number of Diversity Species Accession number 16S rRNA genes % Fusobacterium nucleatum subsp. nucleatum NC_003454.1 5 0.14 Average 5.00 0.14

Bacteria : Gemmatimonadetes (n=1) Number of Species Accession number Diversity % 16S rRNA genes Gemmatimonas aurantiaca NC_012489.1 1 0 Average 1.00 0.00

Archaea : Korarchaeota (n=1) Number of Species Accession number Diversity % 16S rRNA genes Candidatus Korarchaeum cryptofilum NC_010482.1 1 0 Average 1.00 0.00

Bacteria : Planctomycetes (n=1) Number of Species Accession number Diversity % 16S rRNA genes Rhodopirellula baltica NC_005027.1 1 0 Average 1.00 0.00

Bacteria : Proteobacteria (n=264) Number of Species Accession number Diversity % 16S rRNA genes Acidiphilium cryptum NC_009484.1 2 0 Acidithiobacillus ferrooxidans NC_011761.1 2 0 Acidovorax citrulli NC_008752.1 3 0 Acidovorax sp. JS42 NC_008782.1 3 0 Acinetobacter baumannii NC_011586.1 6 0 Acinetobacter sp. ADP1 NC_005966.1 7 0 Actinobacillus pleuropneumoniae NC_009053.1 3 0.26 Actinobacillus succinogenes NC_009655.1 6 0.07 Aeromonas hydrophila subsp. hydrophila NC_008570.1 10 0.13 Aeromonas salmonicida subsp. salmonicida NC_009348.1 9 0.32 Agrobacterium radiobacter NC_011985.1 3 0 Agrobacterium tumefaciens NC_003062.2 4 0 Agrobacterium vitis NC_011989.1 2 0 Alcanivorax borkumensis NC_008260.1 3 0 Aliivibrio salmonicida NC_011313.1 1 0 Alkalilimnicola ehrlichei NC_008340.1 2 0.07 Alteromonas macleodii NC_011138.1 5 0.96 Anaeromyxobacter dehalogenans NC_011891.1 2 0.06 Anaeromyxobacter sp. Fw109-5 NC_009675.1 2 0 Anaplasma marginale NC_012026.1 1 0 Anaplasma phagocytophilum NC_007797.1 1 0 Arcobacter butzleri NC_009850.1 5 0 Aromatoleum aromaticum NC_006513.1 4 0 Azoarcus sp. BH72 NC_008702.1 4 0 Azorhizobium caulinodans NC_009937.1 3 0 Azotobacter vinelandii NC_012560.1 2 0 Bartonella bacilliformis NC_008783.1 2 0 Bartonella henselae NC_005956.1 2 0 Bartonella quintana NC_005955.1 2 0 Bartonella tribocorum NC_010161.1 2 0 Baumannia cicadellinicola NC_007984.1 2 0 Bdellovibrio bacteriovorus NC_005363.1 2 0 Beijerinckia indica subsp. indica NC_010581.1 3 0.07 Bordetella avium NC_010645.1 3 0 Bordetella bronchiseptica NC_002927.3 3 0 Bordetella parapertussis NC_002928.3 3 0 Bordetella pertussis NC_002929.2 3 0 Bordetella petrii NC_010170.1 3 0 Bradyrhizobium japonicum NC_004463.1 1 0 Brucella abortus NC_006932.1 3 0 Brucella canis NC_010103.1 2 0 Brucella melitensis NC_003317.1 3 0 Brucella ovis NC_009504.1 1 0 Brucella suis NC_004310.3 3 0 Buchnera aphidicola NC_011833.1 1 0 Burkholderia ambifaria NC_008390.1 4 0.39 Burkholderia cenocepacia NC_008060.1 4 0.33 Burkholderia mallei NC_006349.2 4 0.07 Burkholderia multivorans NC_010801.1 1 0 Burkholderia phymatum NC_010622.1 4 0.13 Burkholderia phytofirmans NC_010681.1 3 0.07 Burkholderia pseudomallei NC_009076.1 4 0.07 Burkholderia sp. 383 NC_007509.1 6 0.41 Burkholderia thailandensis NC_007651.1 4 0.07 Burkholderia vietnamiensis NC_009254.1 1 0 Burkholderia xenovorans NC_007952.1 6 0 Campylobacter concisus NC_009802.1 3 0 Campylobacter curvus NC_009715.1 6 0 Campylobacter fetus subsp. fetus NC_008599.1 3 0 Campylobacter hominis NC_009714.1 3 0 Campylobacter jejuni NC_003912.7 3 0 Campylobacter lari NC_012039.1 3 0 Candidatus Blochmannia floridanus NC_005061.1 1 0 Candidatus Blochmannia pennsylvanicus NC_007292.1 1 0 Candidatus Carsonella ruddii NC_008512.1 1 0 Candidatus Hamiltonella defensa NC_012751.1 3 0.19 Candidatus Pelagibacter ubique NC_007205.1 1 0 Candidatus Ruthia magnifica NC_008610.1 1 0 Candidatus Vesicomyosocius okutanii NC_009465.1 1 0 Caulobacter crescentus NC_002696.2 2 0 Caulobacter sp. K31 NC_010338.1 2 0 Cellvibrio japonicus NC_010995.1 3 0 Chromobacterium violaceum NC_005085.1 8 0 Chromohalobacter salexigens NC_007963.1 5 0 Citrobacter koseri NC_009792.1 7 0.9 Colwellia psychrerythraea NC_003910.7 9 0.26 Coxiella burnetii NC_011527.1 1 0 Cronobacter sakazakii NC_009778.1 7 0.58 Cupriavidus taiwanensis NC_010528.1 3 0.33 Dechloromonas aromatica NC_007298.1 4 0 Delftia acidovorans NC_010002.1 5 0 Desulfatibacillum alkenivorans NC_011768.1 2 0 Desulfobacterium autotrophicum NC_012108.1 6 0.97 Desulfococcus oleovorans NC_009943.1 1 0 Desulfotalea psychrophila NC_006138.1 7 1.16 Desulfovibrio desulfuricans subsp. desulfuricans NC_011883.1 3 0 Desulfovibrio vulgaris NC_008751.1 5 0.27 Diaphorobacter sp. TPSY NC_011992.1 3 0 Dichelobacter nodosus NC_009446.1 3 0 Dinoroseobacter shibae NC_009952.1 2 0 Ehrlichia canis NC_007354.1 1 0 Ehrlichia chaffeensis NC_007799.1 1 0 Ehrlichia ruminantium NC_006831.1 1 0 Enterobacter sp. 638 NC_009436.1 7 0.45 Erwinia tasmaniensis NC_010694.1 7 0.06 Erythrobacter litoralis NC_007722.1 1 0 Escherichia coli NC_008253.1 7 1.1 Escherichia fergusonii NC_011740.1 7 0.39 Francisella novicida NC_008601.1 3 0 Francisella philomiragia subsp. philomiragia NC_010336.1 3 0 Francisella tularensis subsp. holarctica NC_007880.1 3 0.33 Geobacter bemidjiensis NC_011146.1 4 0.78 Geobacter lovleyi NC_010814.1 2 0 Geobacter metallireducens NC_007517.1 2 0 Geobacter sp. FRC-32 NC_011979.1 2 0 Geobacter sulfurreducens NC_002939.4 2 0 Geobacter uraniireducens NC_009483.1 2 0 Gluconacetobacter diazotrophicus NC_011365.1 4 0 Gluconobacter oxydans NC_006677.1 4 0 Granulibacter bethesdensis NC_008343.1 3 0 Haemophilus ducreyi NC_002940.2 6 0 Haemophilus influenzae NC_007146.2 6 0 Haemophilus parasuis NC_011852.1 6 0.29 Haemophilus somnus NC_008309.1 5 0.13 Hahella chejuensis NC_007645.1 5 0 Halorhodospira halophila NC_008789.1 2 0.13 Helicobacter acinonychis NC_008229.1 2 0 Helicobacter hepaticus NC_004917.1 1 0 Helicobacter pylori NC_000915.1 2 0.8 Herminiimonas arsenicoxydans NC_009138.1 2 0 Hyphomonas neptunium NC_008358.1 1 0 Idiomarina loihiensis NC_006512.1 4 0.26 Jannaschia sp. CCS1 NC_007802.1 1 0 Janthinobacterium sp. Marseille NC_009659.1 2 0 Klebsiella pneumoniae NC_011283.1 8 0.78 Laribacter hongkongensis NC_012559.1 7 0 Lawsonia intracellularis NC_008011.1 2 0 Legionella pneumophila NC_009494.1 3 0.13 Leptothrix cholodnii NC_010524.1 2 0 Magnetococcus sp. MC-1 NC_008576.1 3 0.07 Magnetospirillum magneticum NC_007626.1 2 0 Mannheimia succiniciproducens NC_006300.1 6 0.19 Maricaulis maris NC_008347.1 2 0 Marinobacter aquaeolei NC_008740.1 3 0.26 Marinomonas sp. MWYL1 NC_009654.1 8 0.13 Mesorhizobium loti NC_002678.2 2 0 Mesorhizobium sp. BNC1 NC_008254.1 2 0 Methylibium petroleiphilum NC_008825.1 1 0 Methylobacillus flagellatus NC_007947.1 2 0 Methylobacterium chloromethanicum NC_011757.1 5 0 Methylobacterium extorquens NC_010172.1 5 0 Methylobacterium nodulans NC_011894.1 7 0 Methylobacterium populi NC_010725.1 5 0.14 Methylobacterium radiotolerans NC_010505.1 4 0.14 Methylobacterium sp. 4-46 NC_010511.1 6 0.94 Methylocella silvestris NC_011666.1 2 0 Methylococcus capsulatus NC_002977.6 2 0 Myxococcus xanthus NC_008095.1 4 0 Nautilia profundicola NC_012115.1 4 0 Neisseria gonorrhoeae NC_002946.2 4 0 Neisseria meningitidis NC_010120.1 4 0 Neorickettsia sennetsu NC_007798.1 1 0 Nitratiruptor sp. SB155-2 NC_009662.1 3 0.06 Nitrobacter hamburgensis NC_007964.1 1 0 Nitrobacter winogradskyi NC_007406.1 1 0 Nitrosococcus oceani NC_007484.1 2 0 Nitrosomonas europaea NC_004757.1 1 0 Nitrosomonas eutropha NC_008344.1 1 0 Nitrosospira multiformis NC_007614.1 1 0 Novosphingobium aromaticivorans NC_007794.1 3 0 Ochrobactrum anthropi NC_009667.1 2 0 Oligotropha carboxidovorans NC_011386.1 1 0 Orientia tsutsugamushi NC_009488.1 1 0 Paracoccus denitrificans NC_008686.1 2 0 Parvibaculum lavamentivorans NC_009719.1 1 0 Pasteurella multocida subsp. multocida NC_002663.1 6 0.61 Pectobacterium atrosepticum NC_004547.2 7 0.19 Pelobacter carbinolicus NC_007498.2 2 0.27 Pelobacter propionicus NC_008609.1 4 0 Phenylobacterium zucineum NC_011144.1 1 0 Photobacterium profundum NC_006370.1 15 2.09 Photorhabdus luminescens subsp. laumondii NC_005126.1 7 0.78 Polaromonas naphthalenivorans NC_008781.1 2 0 Polaromonas sp. JS666 NC_007948.1 1 0 Polynucleobacter necessarius subsp. asymbioticus NC_009379.1 1 0 Polynucleobacter necessarius subsp. necessarius STIR1 NC_010531.1 1 0 Proteus mirabilis NC_010554.1 7 0.19 Pseudoalteromonas atlantica NC_008228.1 5 0.13 Pseudoalteromonas haloplanktis NC_007482.1 9 0.65 Pseudomonas aeruginosa NC_011770.1 4 0.07 Pseudomonas entomophila NC_008027.1 7 0.07 Pseudomonas fluorescens NC_007492.1 6 0.13 Pseudomonas mendocina NC_009439.1 4 0.13 Pseudomonas putida NC_009512.1 6 0.13 Pseudomonas stutzeri NC_009434.1 4 1.23 Pseudomonas syringae NC_005773.3 5 0 Psychrobacter arcticus NC_007204.1 4 0 Psychrobacter cryohalolentis NC_007969.1 4 0 Psychrobacter sp. PRwf-1 NC_009524.1 5 0.65 Psychromonas ingrahamii NC_008709.1 10 0.39 Ralstonia eutropha NC_007347.1 6 0.52 Ralstonia metallidurans NC_007973.1 4 0 Ralstonia pickettii NC_010682.1 2 0 Ralstonia solanacearum NC_003295.1 3 0 Rhizobium etli NC_007761.1 3 0 Rhizobium leguminosarum NC_011369.1 3 0 Rhizobium sp. NGR234 NC_012587.0 3 0.56 Rhodobacter sphaeroides NC_007494.1 2 0.2 Rhodoferax ferrireducens NC_007908.1 2 0 Rhodopseudomonas palustris NC_008435.1 2 0 Rhodospirillum centenum NC_011420.1 4 0 Rhodospirillum rubrum NC_007643.1 4 0 Rickettsia akari NC_009881.1 1 0 Rickettsia bellii NC_009883.1 1 0 Rickettsia canadensis NC_009879.1 1 0 Rickettsia conorii NC_003103.1 1 0 Rickettsia felis NC_007109.1 1 0 Rickettsia massiliae NC_009900.1 1 0 Rickettsia prowazekii NC_000963.1 1 0 Rickettsia rickettsii NC_010263.1 1 0 Rickettsia typhi NC_006142.1 1 0 Roseobacter denitrificans NC_008209.1 1 0 Ruegeria pomeroyi NC_003911.11 3 0 Ruegeria sp. TM1040 NC_008044.1 1 0 Saccharophagus degradans NC_007912.1 2 0 Salmonella enterica subsp. arizonae serovar NC_003197.1 7 0.45 Serratia proteamaculans NC_009832.1 7 0.59 Shewanella amazonensis NC_008700.1 8 0.39 Shewanella baltica NC_009052.1 10 1.36 Shewanella denitrificans NC_007954.1 8 0.78 Shewanella frigidimarina NC_008345.1 9 0.07 Shewanella halifaxensis NC_010334.1 10 0.26 Shewanella loihica NC_009092.1 8 0.96 Shewanella oneidensis NC_004347.1 9 0.26 Shewanella pealeana NC_009901.1 11 0.13 Shewanella piezotolerans NC_011566.1 8 0.33 Shewanella putrefaciens NC_009438.1 8 0.9 Shewanella sediminis NC_009831.1 12 0.98 Shewanella sp. ANA-3 NC_008577.1 9 1.09 Shewanella woodyi NC_010506.1 10 1.56 Shigella boydii NC_010658.1 7 0.26 Shigella dysenteriae NC_007606.1 7 0 Shigella flexneri NC_004741.1 7 0.58 Shigella sonnei NC_007384.1 7 0.91 Sinorhizobium medicae NC_009636.1 3 0 Sinorhizobium meliloti NC_003047.1 3 0 Sodalis glossinidius NC_007712.1 7 0.13 Sorangium cellulosum NC_010162.1 4 0.17 Sphingomonas wittichii NC_009511.1 2 0 Sphingopyxis alaskensis NC_008048.1 1 0 Stenotrophomonas maltophilia NC_011071.1 4 0.06 Sulfurimonas denitrificans NC_007575.1 4 0 Sulfurovum sp. NBC37-1 NC_009663.1 3 0 Syntrophobacter fumaroxidans NC_008554.1 2 0.13 Syntrophus aciditrophicus NC_007759.1 1 0 Thioalkalivibrio sp. HL-EbGR7 NC_011901.1 1 0 Thiobacillus denitrificans NC_007404.1 2 0 Thiomicrospira crunogena NC_007520.2 3 0 Vibrio cholerae NC_012582.1 8 0.78 Vibrio fischeri NC_006840.2 12 0.91 Vibrio harveyi NC_009783.1 10 0.45 Vibrio parahaemolyticus NC_004603.1 11 0.65 Vibrio vulnificus NC_005139.1 8 0.39 Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis NC_004344.2 2 0.13 Wolbachia endosymbiont of Drosophila melanogaster NC_002978.6 1 0 Wolbachia endosymbiont NC_006833.1 1 0 Wolbachia sp. wRi NC_012416.1 1 0 Xanthobacter autotrophicus NC_009720.1 2 0 Xanthomonas campestris NC_010688.1 2 0.06 Xanthomonas oryzae NC_007705.1 2 0.06 Xylella fastidiosa NC_010577.1 2 0.07 Yersinia pestis NC_008149.1 7 0.13 Yersinia pseudotuberculosis NC_010634.1 7 0.07 Zymomonas mobilis subsp. mobilis NC_006526.1 3 0.2 Average 3.75 0.16

Archaea : Nanoarchaeota (n=1) Number of Species Accession number Diversity % 16S rRNA genes Nanoarchaeum equitans NC_005213.1 1 0 Average 1.00 0.00

Bacteria : Nitrospirae (n=1) Number of Species Accession number Diversity % 16S rRNA genes Thermodesulfovibrio yellowstonii NC_011296.1 1 0 Average 1.00 0.00

Bacteria : Spirochaetes (n=13) Number of Species Accession number Diversity % 16S rRNA genes Borrelia afzelii NC_008277.1 2 20.38 Borrelia burgdorferi NC_001318.1 1 0 Borrelia duttonii NC_011229.1 1 0 Borrelia garinii NC_006156.1 1 0 Borrelia hermsii NC_010673.1 1 0 Borrelia recurrentis NC_011244.1 1 0 Borrelia turicatae NC_008710.1 1 0 Brachyspira hyodysenteriae NC_012225.1 1 0 Leptospira biflexa NC_010842.1 2 0 Leptospira borgpetersenii NC_008510.1 2 0 Leptospira interrogans NC_005823.1 2 0.07 Treponema denticola NC_002967.9 2 0 Treponema pallidum subsp. pallidum NC_010741.1 2 0.07 Average 1.46 1.58

Bacteria : Thermotogae (n=9) Number of Species Accession number Diversity % 16S rRNA genes Fervidobacterium nodosum NC_009718.1 2 0 Petrotoga mobilis NC_010003.1 2 0.07 Thermosipho africanus NC_011653.1 3 0.5 Thermosipho melanesiensis NC_009616.1 4 0 Thermotoga lettingae NC_009828.1 1 0 Thermotoga maritima NC_000853.1 1 0 Thermotoga neapolitana NC_011978.1 1 0 Thermotoga petrophila NC_009486.1 1 0 Thermotoga sp. RQ2 NC_010483.1 1 0 Average 1.78 0.06

Bacteria : Tenericutes (n=2) Number of Species Accession number Diversity % 16S rRNA genes Candidatus Phytoplasma mali NC_011047.1 2 0 Ureaplasma urealyticum NC_011374.1 2 0.07 Average 2.00 0.04

Bacteria : Verrucomicrobia (n=3) Number of Species Accession number Diversity % 16S rRNA genes Akkermansia muciniphila NC_010655.1 3 0 Methylacidiphilum infernorum NC_010794.1 1 0 Opitutus terrae NC_010571.1 1 0 Average 1.67 0

Supplementary Table 2. 95 prokaryotic species with partial rRNA operons missing 16S rRNA genes Number of genes 16S Organism 5S 16S 23S Diversity Archaea: Crenarchaeota (n=1) Ignicoccus hospitalis 21 1 NA* Average 2.00 1.00 1.00

Archaea: Euryarchaeota (n=11) Pyrococcus abyssi 21 1 NA Pyrococcus furiosus 21 1 NA Pyrococcus horikoshii 21 1 NA Thermococcus kodakarensis 21 1 NA Methanobrevibacter smithii 3 2 2 0.07 Methanococcus maripaludis 4 3 3 0.00 Methanococcus vannielii 5 4 4 0.00 Methanococcus aeolicus 4 2 2 0.07 Methanospirillum hungatei 6 4 4 0.14 Candidatus Methanosphaerula palustris 6 3 3 0.00 Methanococcoides burtonii 6 3 3 0.00 Average 3.82 2.27 2.27 0.04

Bacteria: Actinobacteria (n=4) Corynebacterium glutamicum 6 6 12 0.26 Bifidobacterium animalis subsp. lactis 3 2 2 0.19 Bifidobacterium adolescentis 6 5 5 1.30 Beutenbergia cavernae 4 2 2 0.00 Average 4.75 3.75 5.25 0.44

Bacteria: Bacteroidetes/Chlorobi (n=1) Bacteroides fragilis 7 6 6 0.00 Average 7.00 6.00 6.00 0.00

Bacteria: Firmicutes (n=19) Thermoanaerobacter pseudethanolicus 4 4 8 0.39 Thermoanaerobacter sp. X514 4 4 8 0.07 Mycoplasma pulmonis 21 1 NA Mycoplasma synoviae 3 2 2 0.13 Macrococcus caseolyticus 5 4 4 0.00 Lactobacillus plantarum 6 5 5 0.13 Staphylococcus aureus 6 5 5 0.19 Staphylococcus epidermidis 6 5 5 0.71 Staphylococcus haemolyticus 6 5 5 0.32 Lactococcus lactis subsp. cremoris 7 6 6 0.00 Staphylococcus aureus subsp. aureus 7 6 6 0.13 Staphylococcus saprophyticus subsp. saprophyticus 7 6 6 0.19 Bacillus clausii 8 7 7 1.15 Oceanobacillus iheyensis 8 7 7 0.96 Bacillus halodurans 9 8 8 0.39 Lysinibacillus sphaericus 11 10 10 0.71 Alkaliphilus oremlandii 10 8 8 0.07 Desulfotomaculum reducens 10 8 8 0.37 Syntrophomonas wolfei subsp. wolfei 11 3 3 1.67 6.84 5.47 5.89 0.40

Bacteria: Planctomycetes (n=1) Rhodopirellula baltica 2 1 2 0.00 Average 2.00 1.00 2.00 0.00

Bacteria: Proteobacteria (n=56) Mesorhizobium sp. BNC1 1 2 3 0.00 Brucella canis 2 2 4 0.00 Brucella melitensis 3 3 6 0.00 Diaphorobacter sp. TPSY 6 3 3 0.00 Desulfobacterium autotrophicum 7 6 6 0.97 Geobacter sp. FRC-32 4 2 2 0.00 Desulfovibrio desulfuricans subsp. desulfuricans 6 3 3 0.00 Alkalilimnicola ehrlichei 1 2 3 0.07 Francisella novicida 4 3 3 0.00 Francisella tularensis subsp. holarctica 4 3 3 0.33 Stenotrophomonas maltophilia 5 4 4 0.06 Alteromonas macleodii 6 5 5 0.96 Haemophilus ducreyi 7 6 6 0.00 Mannheimia succiniciproducens 7 6 6 0.19 Pseudomonas fluorescens 7 6 6 0.13 Pseudomonas putida 7 6 6 0.13 Citrobacter koseri 8 7 7 0.90 Cronobacter sakazakii 8 7 7 0.58 Enterobacter sp. 638 8 7 7 0.45 Erwinia tasmaniensis 8 7 7 0.06 Escherichia coli 8 7 7 1.10 Escherichia fergusonii 8 7 7 0.39 Pectobacterium atrosepticum 8 7 7 0.19 Photorhabdus luminescens subsp. laumondii 8 7 7 0.78 Proteus mirabilis 8 7 7 0.19 Pseudomonas entomophila 8 7 7 0.07 Serratia proteamaculans 8 7 7 0.59 Shigella boydii 8 7 7 0.26 Shigella dysenteriae 8 7 7 0.00 Shigella flexneri 8 7 7 0.58 Shigella sonnei 8 7 7 0.91 Sodalis glossinidius 8 7 7 0.13 Yersinia pestis 8 7 7 0.13 Klebsiella pneumoniae 9 8 8 0.78 Marinomonas sp. MWYL1 9 8 8 0.13 Shewanella amazonensis 9 8 8 0.39 Shewanella denitrificans 9 8 8 0.78 Shewanella piezotolerans 9 8 8 0.33 Shewanella putrefaciens 9 8 8 0.90 Vibrio cholerae 9 8 8 0.78 Vibrio vulnificus 9 8 8 0.39 Aeromonas salmonicida subsp. salmonicida 10 9 9 0.32 Colwellia psychrerythraea 10 9 9 0.26 Pseudoalteromonas haloplanktis 10 9 9 0.65 Shewanella frigidimarina 10 9 9 0.07 Shewanella sp. ANA-3 10 9 9 1.09 Shewanella baltica 11 10 10 1.36 Shewanella pealeana 12 11 11 0.13 Vibrio parahaemolyticus 12 11 11 0.65 Shewanella sediminis 13 12 12 0.98 Vibrio fischeri 13 12 12 0.91 Psychromonas ingrahamii 11 10 10 0.39 Shewanella loihica 9 8 9 0.96 Haemophilus somnus 6 5 10 0.13 Haemophilus parasuis 8 6 6 0.29 Actinobacillus pleuropneumoniae 7 3 6 0.26 Average 7.80 6.75 7.04 0.41

Bacteria: Spirochaetes (n=2) Borrelia burgdorferi 21 2 NA Borrelia garinii 21 2 NA Average 2.00 1.00 2.00 *NA: not applicable because there is only one 16S rRNA gene in the genome.