<<

A null model for microbial diversification PNAS PLUS

Timothy J. Strauba,1 and Olga Zhaxybayevaa,b,2

aDepartment of Biological Sciences, Dartmouth College, Hanover, NH 03755; and bDepartment of Computer Science, Dartmouth College, Hanover, NH 03755

Edited by Eugene V. Koonin, National Institutes of Health, Bethesda, MD, and approved May 10, 2017 (received for review December 12, 2016) Whether ( and ) are naturally orga- and selective sweeps, nascent ecologically differentiated groups nized into phenotypically and genetically cohesive units compara- are proposed to undergo up to five stages that eventually result ble to animal or plant species remains contested, frustrating in their speciation (11). Whatever the genetic and ecological attempts to estimate how many such units there might be, or to mechanisms, clustering patterns are suggested to be an “accu- identify the ecological roles they play. Analyses of gene sequences rate” (13), “quick,” and “easy” (14) way to delineate bacterial in various closely related prokaryotic groups reveal that sequence species and are widely interpreted as indicative of the operation diversity is typically organized into distinct clusters, and processes of evolutionary forces (e.g., refs. 12, 15, and 16). such as periodic selection and extensive recombination are under- However, a simple turnover of microbial populations due to stood to be drivers of cluster formation (“speciation”). However, their random diversification and extinction accompanied by ac- observed patterns are rarely compared with those obtainable with cumulation of neutral changes in genomic DNA sequence will simple null models of diversification under stochastic lineage birth also inevitably produce clusters in the genealogies of the enco- and death and random genetic drift. Via a combination of simula- ded genes (17). Intuitively, such clustering becomes apparent tions and analyses of core and phylogenetic marker genes, we show when one examines patterns on any tree-like diagram resulting Escherichia Neisseria that patterns of diversity for the genera , , and from the processes of birth and death, but it can also be shown Borrelia are generally indistinguishable from patterns arising un- mathematically (e.g., ref. 18). That purely stochastic processes der a null model. We suggest that caution should thus be taken in produce clusters raises the possibility that microdiverse clusters interpreting observed clustering as a result of selective evolution- observed in gene trees might not actually reflect the operation of ary forces. Unknown forces do, however, appear to play a role in genetic or ecological forces of speciation. EVOLUTION , and some individual genes in all groups fail to In this study, we developed a statistical framework that sim- conform to the null model. Taken together, we recommend the ulates microbial diversity under a null model of birth–death and presented birth−death model as a null hypothesis in prokaryotic compares the created patterns to those observed in data col- speciation studies. It is only when the real data are statistically lected from actual bacterial populations, such as 16S rRNA different from the expectations under the null model that some genes, housekeeping genes frequently used in multilocus se- speciation process should be invoked. quence analyses (MLSA), and protein-coding gene families. We applied this framework to analyses of hundreds of genomes prokaryotic diversity | bacterial species | typing | genetic drift | neutral evolution within four bacterial groups: Escherichia spp., Borrelia spp., Neisseria spp., and Helicobacter pylori. The first three of these icrobiologists have long debated whether a coherent spe- Significance Mcies concept might apply to Bacteria (and the other pro- karyotic , Archaea). Without such a concept, ad hoc conventions, for instance grouping taxa into “operational taxo- When evolutionary histories of closely related microorganisms nomic units” (OTUs) delimited by regions of 16S ribosomal are reconstructed, the lineages often cluster into visibly rec- RNA (rRNA) genes showing at least 97% sequence identity, ognizable groups. However, we do not know if these clusters have been used for identification and quantification of pro- represent fundamental units of bacterial diversity, such as karyotic diversity (1). Sometimes, however, 16S rRNA or other “species,” nor do we know the nature of evolutionary and marker genes in environmental samples exhibit “microdiversity” ecological forces that are responsible for cluster formation. at this or more stringent levels, with an overabundance of closely Addressing these questions is crucial, both for describing bio- related sequences that could be interpreted as resulting from diversity and for rapid and unambiguous identification of speciation-like ecological and genetic processes (2–4). For in- microorganisms, including . Multiple competing stance, in an early pivotal study of marine vibrios, Acinas et al. (5) scenarios of ecological diversification have been previously observed “a large predominance of closely related taxa in this proposed. Here we show that simple cell death and division community” and concluded that “such microdiverse clusters arise over time could also explain the observed clustering. We argue by selective sweeps and persist because competitive mechanisms that testing for the signatures of such “neutral” patterns are too weak to purge diversity from within them.” should be considered a null hypothesis in any microbial Two models have been most widely invoked to explain the classification analysis. evolutionary forces behind the formation of such clusters. Under the ecotype (periodic selection) model alluded to above, the Author contributions: T.J.S. and O.Z. designed research; T.J.S. performed research; T.J.S. and O.Z. analyzed data; and T.J.S. and O.Z. wrote the paper. progeny of the most-fit genotype in a clonally evolving bacterial population takes over the population, resulting in a selective The authors declare no conflict of interest. sweep that purges diversity at all loci (6). In this case, micro- This article is a PNAS Direct Submission. diversity would be the consequence of neutral mutations occur- Data deposition: Gene families are deposited to FigShare, available at https://dx.doi.org/ 10.6084/m9.figshare.4731946. Note that this is a derived dataset. The actual sequence ring between sweeps (6, 7). Under the competing recombination data were taken from PATRIC database records, descriptions of which are supplied in model, the frequent exchange of material between genes within Dataset S1. ecologically differentiated bacterial populations effects gene se- 1Present address: Genomic Center for Infectious Diseases, Broad Institute, Cambridge, MA quence similarity clustering (8, 9). Perhaps a combination of 02142. these forces is in place, or different diversification modes act on 2To whom correspondence should be addressed. Email: [email protected]. different bacterial groups, even if they coexist in the same en- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. vironment (10–12). Under the selective forces of recombination 1073/pnas.1619993114/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1619993114 PNAS Early Edition | 1of10 Downloaded by guest on September 27, 2021 analyzed groups have been routinely subdivided into clusters Selection of Bacterial Groups and Compilation of Gene Families. To designated as species and recognized as such in Bergey’s Manual, test the null hypothesis, we examined four bacterial groups the authoritative classification compendium (15). However, for (Escherichia spp., Neisseria spp., Borrelia spp., and Helicobacter these three groups, the majority of ubiquitous gene families pylori) that had at least 100 genomes in public access (to ensure formed clusters indistinguishable from those simulated under the statistical power) and represented a range of lifestyles (to capture null model in 1 to 5% DNA sequence divergence interval, and potentially different selective pressures that could have affected therefore no speciation processes need be invoked to explain the their diversification) (Table 1 and Dataset S1). Escherichia is a observed diversity. γ-proteobacterial genus comprised of six described species and a few yet unclassified strains (20–24). The analyzed 620 Escherichia Results and Discussion genomes span representatives from environmental isolates to A Null Model of Microbial Diversification. As a null model, we commensals and disease-causing strains of , propose a diversification scenario in which a bacterial group does Escherichia albertii, Escherichia fergusonii, Escherichia hermanii, not experience evolutionary and ecological forces that either and the unclassified Escherichia spp. Members of β-proteobacte- purge or promote diversity. Under the proposed model, the ge- rial genus Neisseria colonize the mucosal and dental surfaces of nealogy of diversifying bacterial populations (hereafter referred many animals, and two of its 21 currently named species (25), to as lineages) is influenced only by a random process of ex- and , cause disease in tinction (death) or divergence into two new lineages (birth). The humans (26). Neisseria cluster into “species groups” that do not net birth–death rate is constant, and recombination is absent. match perfectly to named species (27), and the relationships be- Additionally, the DNA sequences of the genomes of these tween the species groups are unclear (2, 27, 28), perhaps as a lineages are assumed to accumulate substitutions at a constant result of Neisseria’s natural propensity for DNA uptake and re- rate. In population genetics terminology, the genes within the combination (29). The analyzed 232 Neisseria genomes represent genomes of the simulated lineages experience only genetic drift. 15 named species, although the majority are from N. meningitidis To simulate evolution of such gene families, we implemented a and N. gonorrhoeae. Named species of the spirochete Borrelia (30, two-step procedure (see Methods for detailed description). First, 31) form two distinct groups associated with Lyme disease and the evolutionary history (genealogy) of a gene family was simu- relapsing fever, respectively (32), and, within “Lyme disease lated under the stochastic lineage birth–death model until all group,” named species typically form distinct clusters (33). The “extant” lineages could be traced back to one common ancestor. 107 Borrelia genomes in our dataset are dominated by the Lyme This resulted in a relating the extant lineages disease-causing B. burgdorferi and B. garnii but overall repre- that contain the gene. Second, the DNA sequences of the gene in sent 16 named species and include a few relapsing fever isolates. these extant lineages were established by evolving the DNA se- Finally, human-associated e-proteobacterium Helicobacter py- quence of the common ancestor along the obtained genealogy lori (34, 35) is known for its unexpectedly high genomic diversity under one of the simplest nucleotide substitution models [the (36) and extensive recombination (37, 38), likely due to its nat- F81 substitution model (19)]. The F81 model does not distin- ural competence. Thus, in contrast to the other three bacterial guish synonymous and nonsynonymous substitutions, which, groups, we limited our analyses to 233 genomes within H. pylori given the likely strong constraints on the amino acid sequences of and did not expand the analyses to the rest of the genus the protein-coding gene families to maintain protein function, Helicobacter. may appear overly simplistic. However, at the examined level To assess genetic diversity patterns within each group, we se- of divergence, the majority of the selected gene families are lected markers frequently used for this purpose: the 16S rRNA expected to be strongly conserved, and therefore a substantial gene; several genes known as MLSA loci (39, 40); and a subset of fraction of their nucleotide substitutions are projected to be protein-coding genes hereafter referred to, for brevity, as “gene synonymous. For gene families that do deviate from this expec- families.” For the last, many comparative bacterial genomic tation, we expect the null model to be rejected (discussed in studies aiming at characterization of diversity and structure of a Framework for Comparison of Divergence Patterns in Real and microbial community use single-copy universally or nearly uni- Simulated Gene Families). Because, in the real bacterial groups, versally present genes, so-called “core” genes (41). The re- even conserved gene families exhibit a range of divergence (from laxation of universality is needed to account for the substantial hyperconserved to faster evolving), each simulated phylogenetic gene content variation even among closely related bacteria (42, tree was scaled to a randomly selected value drawn from the 43), as well as for the variable quality and completeness of the distribution of the divergences observed for the real gene families genomes. We focused our analyses on a similar, “relaxed core of the analyzed bacterial group. Similarly, the composition and single copy” subset of protein-coding gene families within each length of the DNA sequence of the common ancestor in each gene of the four groups (Table 1; see Methods for details). The re- family simulation was informed by the respective parameters ob- laxation of the requirement of a gene to be present in all served in the real data (see Methods for details). To simulate members of the group by allowing it to be absent in up to 10% of evolution of multiple gene families, the procedure was repeated genomes ensured that a sufficiently large number of gene fami- 1,000 times. lies were available for analyses. Additionally, keeping only

Table 1. Summary of the gene families and their diversification patterns in the four analyzed bacterial groups No. of gene families

= † + † − † Bacterial group No. of genomes Total (%*) D (% ) D (% ) D (% )

Escherichia 620 2,837 (54) 2,534 (89.3) 12 (0.4) 291 (10.3) Borrelia 107 304 (22) 303 (99.7) 0 (0) 1 (0.3) H. pylori 233 1,138 (68) 13 (1.1) 1,124 (98.8) 1 (0.1) Neisseria 232 1,244 (47) 835 (67.1) 14 (1.1) 395 (31.8) Neisseria excluding N. gonorrhoeae 214 1,259 (46) 1,041 (82.7) 27 (2.1) 191 (15.2)

*Average percent of the gene families present in each genome of the group. †Percent of the total number of gene families.

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114 Straub and Zhaxybayeva Downloaded by guest on September 27, 2021 “single-copy” genes eliminated gene families expanded via gene drawn therefrom, we have limited our analyses to evaluation PNAS PLUS duplication and/or and therefore likely of the divergence patterns that correspond to 0.01 to 0.05 nu- to have their evolutionary histories deviate from the assumptions cleotide substitutions per gene family. The range was chosen of the null birth−death model. because the clusters of prokaryotic sequences thought to mark “species” exhibit, on average, less than 5% and 3% of nucleo- Framework for Comparison of Divergence Patterns in Real and tide substitutions per genome (45) and 16S rRNA gene (46), Simulated Gene Families. Divergence patterns presented as phy- respectively. logenetic trees are difficult to compare, especially in large-scale To assess whether the differences in CD curve shapes between analyses. Instead, for each gene family, we calculated the number real and simulated data are significant, we developed a statistical of clusters at a specific level of divergence using furthest- test reminiscent of the widely used Kolmogorov−Smirnov (KS) neighbor hierarchical clustering (44) and summarized the data test (47). The deviations in shapes of two CD curves are captured as a cluster−divergence (CD) curve (Fig. 1) (inspired by ref. 5). in a distance metric Dmax, which measures how maximally far If a special process, like recombination or periodic selec- apart the CD curves are in a specified interval of divergence (see tion, influences diversification of real gene families, the shapes Methods for details). When Dmax between a real gene family and of their CD curves are expected to differ from those of the the median of the simulated gene families is compared with the D simulated gene families. If no difference is detected, then it is distribution of max among simulated gene families, the former distance metric can be classified as either falling within 95% of inappropriate to conclude that such processes delimit the ob- the values for the simulated gene families (i.e., indistinguishable served discrete groups. Because the focus of the study is on ex- from the diversification patterns under the null model; referred = amination of bacterial microdiversity and biological inferences as category D ) or significantly differing from the null model by + − being in the upper (D ) or lower (D ) 2.5 percentile. The comparisons of the KS to our test show that ours is less con- servative (Fig. S1), and thus is less likely to generate false matches to the null model.

Diversification Patterns Observed in Gene Families of Bacterial Groups Are Often Indistinguishable from Those Generated Under the Null Model. Similarity of 16S rRNA gene sequences is one EVOLUTION of the widely used approaches implemented to circumscribe OTUs, which are often used in lieu of species to describe mi- crobial diversity within a group of microorganisms or in a mi- crobial community (16). However, we found that, for all four analyzed groups, the clustering observed in the real 16S rRNA genes is indistinguishable from the simulations under the null model (Fig. 2A and Fig. S2). Ubiquitous housekeeping genes are also commonly used to identify clustering on species or “clonal complex” levels (e.g., for pathogen typing) via MLSA (48), and

100 most of these loci are among the identified gene families. The majority of the MLSA loci-corresponding gene families of = Escherichia, Borrelia, and Neisseria were classified as D (Fig. 2B, Fig. S3, and Table S1), and therefore their diversification pat- terns are also indistinguishable from those simulated under the null model.

10 20 50 Similar to the clustering patterns of their 16S rRNA and MLSA gene families, 89.3% of Escherichia and 99.7% of Borrelia 5 = relaxed core protein-coding gene families are classified as D Number of Clusters (Table 1, Fig. S4, and Fig. 3). This result is surprising, given that our gene family identification likely disproportionally selected very conserved gene families expected to be under purifying 12 selection. Taken together, our observations imply that, for 0.4 0.3 0.2 0.1 0.0 Escherichia and Borrelia, the evolutionary histories of the ma- Divergence jority of gene families, including 16S rRNA and the MLSA loci, are statistically indistinguishable from the genes simulated to Fig. 1. An illustration of how diversification patterns of a gene family are evolve under genetic drift. If recombination or periodic selection represented in a CD plot. For each pair of taxa in a gene family, the nucle- impacts these groups (49), they are not sufficiently strong or otide sequences are converted to pairwise distances. The distances are then clustered at a specific level of divergence using the furthest-neighbor al- constrained by phylogeny to produce sequence clusters that are gorithm (44). The number of clusters (y axis) is plotted against divergence different from the simple model of random turnover of lineages (x axis; measured in number of substitutions per site). If all taxa in the gene over time. family evolve at the same rate, there is a one-to-one correspondence be- However, the results were somewhat different for H. pylori and tween the points on the CD curve and the nodes of the genealogy of the Neisseria. In stark contrast to Escherichia and Borrelia, 98.8% of = examined gene family (shown above the plot), as exemplified by several H. pylori gene families, including all MLSA loci, were not D dashed lines. In this specific case, the CD curve is identical to a Lineage- (Fig. 4, Fig. S3, Table 1 and Table S1). In addition to such a = Through-Time (LTT) plot (106), a widely used approach (51, 107) that dramatic number of H. pylori non-D gene families, all but one + transforms the patterns of divergence on phylogenetic trees and assesses D D H. pylori them quantitatively via the γ statistic (50, 108). However, although the equal were (Fig. 4 ). Therefore, it is likely that the lineage rate of substitutions is assumed in our simulations under the null model, it is experiences evolutionary forces that violate at least some of the often violated in the real bacterial gene families. Therefore, the CD plots of null model assumptions of constant diversity, constant rate of real data are not LTT plots, and we have developed a statistical test distinct lineage birth/death, and absence of recombination. Such patterns from the γ statistic (see Methods for details). of diversification could result from longer terminal branches

Straub and Zhaxybayeva PNAS Early Edition | 3of10 Downloaded by guest on September 27, 2021 AB 16S rRNA simulations Gene family simulations 16S rRNA MLSA loci # of Clusters # of Clusters 2 5 10 20 50 200 500 1 2 5 10 20 50 200 500 1

0.06 0.05 0.04 0.03 0.02 0.01 0.00 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Divergence Divergence

Fig. 2. The divergence patterns of (A)theEscherichia 16S rRNA gene and (B) most of MLSA loci are indistinguishable from those simulated under the null model of diversification. (A) CD curves of 100 simulations are shown in black, and real data are shown in purple. Based on our statistical test, the 16S rRNA gene CD curve is classified as D= in the 0.01 to 0.05 divergence interval. (B) CD curves of 1,000 gene family simulations are shown in black, and real data are = − = shown in purple (D ) and blue (D ). Based on our statistical test, all but one of the MLSA loci CD curves are classified as D in the 0.01 to 0.05 divergence interval (Table S1). The unit of divergence on the x axis of A and B is number of substitutions per site.

than expected under the pure birth−death process. Longer ter- than Escherichia (55) and Borrelia (32), suggesting that, even in minal branches can be, for example, a result of the variable net presence of substantial amounts of recombination, the null rate of birth and death, or due to niche expansion (50) or eco- model cannot be easily rejected. logical diversification (51). Consistent with the latter possibility, the phylogenetic trees of H. pylori strains and human mitochon- Effects of Sampling on the Inferred Divergence Patterns. Incomplete drial DNA from various geographic regions are congruent (35). and biased sampling (for example, of only pathogenic or drug- High levels of recombination are also shown to result in phylo- resistant strains within a species) can influence clustering in- genetic trees that have longer terminal branches and shorter in- dependently of any speciation-like processes (22, 56) and, as a ternal branches (9, 52). Given that H. pylori is one of the most result, can affect measurements of clade diversifications (57). mutagenic and recombinogenic bacterium known (53), with the Even though hundreds to thousands of gene families were ana- recombination rate being at least an order of magnitude higher lyzed for hundreds of genomes within each group, within a gene than the substitution rate (54), the observed long terminal branches family, some homologs have identical nucleotide sequences and, may, in addition, reflect the combination of elevated substitution as a result, may contain fewer than 100 nonidentical (“unique”) and recombination rates. nucleotide sequences per gene family. Therefore, our limited Members of Neisseria genus are also known to be highly re- datasets likely miss some of the microdiversity that exists in real, combinogenic, albeit to a lesser degree than H. pylori (55). Al- fully sampled bacterial populations. If undersampling influences though the majority of Neisseria gene families were classified as the apparent diversification patterns, the calculations of the = D (Fig. S4E and Table 1), the fraction (67.1%) was much values of Dmax for a given null distribution could be affected, smaller than for Borrelia and Escherichia. Also, unlike in H. pylori which, subsequently, will result in false assignment of gene = − + − = comparisons, most of the non-D gene families were D (Fig. families into D , D , and D categories. However, when simu- S4F and Table 1). These observations suggest that processes lated gene families were randomly subsampled at 25%, 50%, and shaping diversification of Neisseria are different from those of 75% levels, the patterns of diversification of the subsampled H. pylori, Escherichia, and Borrelia. Although all known Neisseria datasets were not significantly different from each other and strains have a natural propensity for DNA uptake, there are from the 100% sampled set (One-way ANOVA; F3,396 = 0.115; barriers to successful between-strain gene exchange (29), which P = 0.951; Fig. S5). Thus, the observed diversification patterns is probably why there are recognizable (albeit fuzzy) clusters on and the summary statistic Dmax appear to be robust to incom- phylogenetic networks of marker genes (25). In particular, a plete random sampling of a microbial community, consistent genetically uniform group of N. gonorrhoeae forms a clearly with findings of ref. 50. isolated cluster (25), which may be explained by N. gonorrhoeae’s However, isolates with available genomes may not represent restriction to urogenital tract, an ecological niche distinct from random sampling of both cultured strains and overall microbial the oral and nasal cavities colonized by most known Neisseria community diversity. These nonrandom sampling biases are spp. (25). Suspecting that such distinct patterns of N. gonor- difficult to pinpoint and simulate. To investigate the impact of rhoeae diversification and “geographic” isolation may have im- such potential biases empirically, we compared diversification pacted shapes of CD curves of its gene families, we reanalyzed patterns observed in the detected gene families to those in the the Neisseria group without N. gonorrhoeae genomes. In this re- much more extensive collection of the MLSA database records duced dataset, 82.7% of Neisseria gene families became statisti- (refs. 39, and 40 and Table S1). Although these records are likely cally indistinguishable from the simulated gene families (Table 1 to have skewed sampling due to their bias for pathogenic strains, and Fig. S4H), making the overall results comparable to those of when corrected for the difference in the absolute number of Escherichia and Borrelia. However, Neisseria are more recombinogenic nonidentical sequences between the two datasets, the overall

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114 Straub and Zhaxybayeva Downloaded by guest on September 27, 2021 PNAS PLUS ABSimulations Simulations D= # of Clusters # of Clusters 125 200 5102050 12 5102050200 12 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.60.50.40.30.20.10.0 Divergence Divergence CD Simulations Simulations D- D+ FliC (D+) # of Clusters # of Clusters EVOLUTION 2 5 10 20 50 200 1 12 5102050200 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.60.50.40.30.20.10.0 Divergence Divergence

Fig. 3. The divergence patterns of most Escherichia gene families are indistinguishable from those simulated under the null model of diversification. (A)CD curves of 50 gene families randomly sampled from the 1,000 gene families simulated under the null model. The simulations were tailored to this dataset as = described in Methods.(B−D) CD curves of 2,837 Escherichia gene families divided into three categories: statistically indistinguishable from simulations [(B) D , n = 2,534] or significantly different [(C) D−, n = 291; (D) D+, n = 12]. Highlighted in red is the most divergent gene family, fliC, the H antigen in E. coli. The curves in B and C were randomly subsampled to 50 for representation purposes.

− shapes of CD curves of the MLSA loci were qualitatively com- genes (58). Additionally, the only D gene families in Borrelia and parable to those of the real gene families (Fig. S3 D–G). Future H. pylori (acpP and rps19, respectively; Fig. S4C and Fig. 4C)ex- analyses of genomic sets of a substantially larger size will allow hibit strong purifying selection in 95% and 98% of their codons − testing of the null hypothesis at precisions higher than used in (Table 2). (Among Neisseria’s D gene families, no significant this study. depletion or enrichment of functional categories was detected after correction for multiple testing, possibly reflecting the lack of Dmax as a Metric for Identifying Gene Families Under Selection and power to detect statistical significance from the smaller dataset.) + Those Suitable as Phylogenetic Markers of Microdiversity. Are there Among the identified D gene families, fliC of Escherichia = any common properties of non-D gene families that may hint at exhibits a notably elevated rate of nonsynonymous substitutions, the evolutionary and ecological forces that shape their evolu- with 18% and 43% of the gene’s sites inferred to be under = − + tionary history? Compared with D gene families, D and D positive and neutral selection, respectively (Table 2). The fliC gene families had significantly smaller and larger tree heights gene encodes flagellin, the flagellar filament protein. FliC is (i.e., the sum of all branch lengths of a phylogenetic tree), re- known to be a highly polymorphic gene, the evolutionary history spectively (Fig. S6A). This relation was not due to elevated of which is shaped by both fixation of mutations and horizontal substitution rates in just a few taxa within a gene family: In many gene transfer, not only in Escherichia/Salmonella (59) but also in + = cases, the overall substitution rates for gene families in D , D , other bacterial pathogens (60, 61). Given that fliC protein serves − and D categories differed by an order of magnitude, as reflected as an antigen (known as an H antigen of Gram-negative bacteria) in the significantly higher and lower number of nonidentical and an important factor in animal innate immune response + − sequences in D and D gene families, respectively (Fig. S6B), (62), generation of fliC allelic diversity via mutation and re- and as exemplified by trees of selected gene families (Fig. S7). combination is one of the strategies for antigenic variation nec- + Taken together, these observations suggest that, in general, D essary for the arms race against host immune responses (59, 63). − + gene families represent faster-evolving genes, whereas D gene Similarly, among the numerous H. pylori’s D gene families, families are very conserved. mod-1 and omp3 exhibit notably elevated substitution rates − Consistent with this hypothesis, 291 D gene families of (Table 2). In many pathogens, mod genes are hypothesized to Escherichia are significantly enriched in genes encoding translation function as regulators that control gene expression of surface machinery and significantly depleted for several categories of antigens (64, 65). Interestingly, H. pylori exhibits a very high level metabolic genes (Fig. S8), a functional bias expected for conserved of variability in its methyltransferase-encoding genes even in

Straub and Zhaxybayeva PNAS Early Edition | 5of10 Downloaded by guest on September 27, 2021 ABSimulations Simulations D= # of Clusters # of Clusters 125 200 5102050 2 1 12 5102050200 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Divergence Divergence CD Simulations Simulations Ribosomal protein S19 (D-) D+ mod−1 (D+) omp3 (D+) # of Clusters # of Clusters 2 5 10 20 50 200 1 12 5102050200 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Divergence Divergence

Fig. 4. The divergence patterns of most H. pylori gene families are distinct from those simulated under the null model of diversification. (A)CDplotofthe 50 gene families randomly sampled from the 1,000 gene families simulated under the null model. The simulations were tailored to this dataset as described in = Methods.(B−D) CD plots of 1,138 H. pylori gene families divided into three categories: statistically indistinguishable from simulations [(B) D , n = 13] or − + − significantly different [(C) D , n = 1; (D) D , n = 1,124]. The only D gene family (rps19) encodes ribosomal protein S19. Highlighted in red and green are two + of the most divergent D gene families: mod-1, encoding a type III restriction modification system methyltransferase, and omp3, encoding an outer mem- + brane protein. For representation purposes, the remaining D gene families shown in D are a subset of 50 randomly sampled families.

isolates from a restricted geographic region (66), highlighting the (Table 2). The pilS gene is part of the two-component signal importance of diversifying selection on these loci in this patho- transduction system PilR/PilS that regulates transcription of the gen. H. pylori also carries an unusually large and diverse family of pilin gene in Pseudomonas (69), a component of type IV pili present outer membrane proteins (including omp3), and this gene family in Neisseria (70). Although it is not clear why pilS in Neisseria has the diversity is hypothesized to be adaptive (67) and to influence the observed elevated substitution rates, notably, in N. gonorrhoeae,itno pathogenic success of this bacterium (68). longer controls piliation but is still expressed, suggesting the re- + Not all D gene families exhibit signatures of positive selec- cruitment for some other function (71). Such neofunctionalization, if tion, as exemplified by pilS and mutL gene families in Neisseria occurring in Neisseria, may underlie observed relaxation of selection

Table 2. Inference of selection in certain gene families with high Dmax Proportion of sites in three categories of the M2a model (estimated dN/dS)

Bacterial Tree height, No. of unique group Gene family (its functional annotation) Category substitutions/site (total) sequences Purifying Neutral Positive

Escherichia fliC (flaggelin) D+ 40.39 138 (610) 0.40 (0.04) 0.42 (1) 0.18 (1.94) − Borrelia acpP (acyl carrier protein) D 8.9 28 (95) 0.95 (0.02) 0.05 (1) 0 (-) − H. pylori rpS19 (ribosomal protein S19) D 0.73 70 (232) 0.99 (0.04) 0 (1) 0.01 (4.89) H. pylori mod-1 (type III restriction-modification D+ 29.56 185 (233) 0.53 (0.08) 0.43 (1) 0.04 (3.71) system methyltransferase) + H. pylori omp3 (outer membrane protein) D 17.66 188 (241) 0.61 (0.11) 0.28 (1) 0.11 (3.16) + Neisseria mutL (MMR) D 11.89 93 (229) 0.86 (0.03) 0.13 0.01 (2.74) Neisseria pilS (transcription regulator) D+ 27.53 88 (233) 0.91 (0.11) 0.09 (1) 0 (-)

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114 Straub and Zhaxybayeva Downloaded by guest on September 27, 2021 on the pilS nucleotide sequence. MutL, a gene encoding a compo- based similarity relationships among the genes were used to determine the PNAS PLUS nent of the mismatch repair (MMR) system, is known to be fre- gene families using the Markov Cluster Algorithm v14-137 (86, 87) with an quently horizontally transferred (72, 73) and shows evidence for inflation parameter of 1.2. extensive intragenic recombination (72). In agreement with our The acquired gene families were consequently filtered to identify relaxed analyses (Table 2), most of the nucleotide variation observed in mutL core gene families by requiring the genes in a gene family to be present in most (but not all) genomes of the analyzed group. Because the four analyzed homologs does not cause a change in amino acid (72), and horizontal bacterial groups have different numbers of genomes (Table 1), this re- acquisition of the mutL gene is hypothesized to be a mechanism quirement translated into a constraint for the gene family homologs to be for functional restoration of the MMR system in hyperrecom- present in 90% (Borrelia), 95% (Neisseria and H. pylori), and 99% (Escher- binogenic mutants (72). ichia) of genomes, with less stringency imposed on bacterial groups with a Combined with the observation that most of the highly smaller number of genomes. The relaxed core gene families were further + recombinogenic H. pylori gene families are in the D category, filtered for single-copy relaxed core gene families, defined as those with less + D gene families may be enriched in genes that have undergone than an average of 1.1 gene copies per genome across all genomes in a strong diversifying selection of the corresponding amino acid group. Finally, to avoid inclusion of homologs that only shared a protein domain, sequences that were 2 standard deviations outside of the mean sequence and/or experienced extensive horizontal gene transfer. + amino sequence length of the gene family were removed. For brevity, we Regardless of the underlying evolutionary forces, D gene fam- refer to the resulting single-copy relaxed core gene families simply as gene ilies promise to be useful phylogenetic markers for delineating families throughout the manuscript. All further analyses were performed on microdiversity within a group of closely related bacteria. Some of the nucleotide sequences of the gene families, which are available via Fig- + the identified D gene families have already been either shown Share (https://dx.doi.org/10.6084/m9.figshare.4731946). to be efficacious markers for typing of pathogenic isolates [e.g., porB (74) and lip (75) for Neisseria, and the above-discussed fliC Modeling Genealogies of Gene Families That Evolve Under the Null Model. To (76, 77) for E. coli] or identified to have a potential to be such simulate evolutionary history of a gene family that is not affected by eco- markers [e.g., E. coli housekeeping genes ugd and galF that often logical and evolutionary forces (our null model), we implemented a simple birth−death process known as the Hey/Moran model (88, 89). Under this flank frequently horizontally exchanged genes involved in the O model, at each generation, a constant-size population of n gene family- antigen biosynthesis (78, 79)]. Therefore, the presented frame- containing lineages experiences only an extinction of one randomly cho- work could help identify loci promising for describing bacterial sen lineage (death) and a duplication of a different, also randomly chosen, genotypic variation that is driven by ecological forces. lineage (birth). The birth−death process is repeated until all n lineages in the population, when tracked back in time, coalesce to a single common an- EVOLUTION Concluding Remarks: Implications for the Search for the Biological cestor (18). The resulting genealogical relationships among n extant lineages Process Underlying Bacterial Speciation. Our analysis shows that can be summarized in a phylogenetic tree. To do so, we first define a dis- many of the diversification patterns observed in 16S rRNA genes, tance between two extant lineages as a value proportional to the number of MLSA markers, and core genes families are indistinguishable generations to their last common ancestor, then calculate pairwise distances − for all extant lineages, and, finally, transform the resulting distance matrix from a random process of lineage birth death and, therefore, into a phylogenetic tree using the neighbor-joining method (90), as imple- could be a result of neutral lineage turnover. Hence, for these mented in Biopython v1.64 (91). The Python source code of the birth−death markers, the observed genotypic clustering alone cannot be used model is available at https://github.com/ecg-lab/microdiversity/. as evidence for existence of selection-driven ecological processes The simulations were independently repeated 1,000 times to obtain behind its formation, as is often done (6, 14, 42, 43, 80), even if the evolutionary histories of 1,000 gene families, and the procedure was re- patterns are compatible with such models. Taken together with peated separately for each bacterial group. Given that the real gene families the recent finding that patterns of diversification arising from varied in size, the size of each simulated gene family (i.e., the number of distinct processes of periodic selection and recombination also lineages n) was drawn from the distribution of the observed real gene family mimic each other in appearance (10), our analysis highlights yet sizes in the respective bacterial group. another challenge to identifying a single biological process that Simulation of Nucleotide Sequences of Neutrally Evolving Gene Families. defines bacterial species (81). Moreover, distinct evolutionary Among the gene families of real bacterial groups, we expect to observe a patterns of divergence in most gene families of the H. pylori group gamut of gene lengths, variable nucleotide composition, and level of se- and many gene families of Neisseria group hint that clustering quence conservation (which can be measured by tree height) that ranges patterns observed in bacterial groups are not formed by one uni- from “ultraconserved” to “quickly evolving” gene families. This variation versal process, and the answer to the question “How do bacteria was captured as distributions of the parameters observed within each ana- diverge?” is likely to be group-specific and not generalizable, as lyzed bacterial group. The distributions were used to mimic variation among has been hypothesized (49). We recommend neutral birth−death the simulated gene families. Specifically, the genealogical history of each models such as that used here as null hypotheses in future pro- simulated gene family was scaled with a tree height randomly drawn from the observed range of tree heights. The length of the gene in a simulated karyotic speciation studies. gene family was drawn from the distribution of observed gene lengths. Lastly, the nucleotide sequence of the last common ancestor (i.e., the se- Methods quence at the root node of the simulated gene family’s genealogy) was Sources for the Bacterial Datasets. Nucleotide and amino acid sequences of generated randomly, but with nucleotide composition drawn from the dis- the protein-coding genes from the genomes of Escherichia spp., Neisseria tribution of observed nucleotide frequencies in real gene families. Using the spp., Borrelia spp., and Helicobacter pylori (Dataset S1) were obtained from ancestral sequence and the scaled simulated genealogical history as an input the Pathosystems Resource Integration Center (PATRIC) database (82) be- tree, nucleotide sequences of a simulated gene family were evolved in Seq- tween June 2014 and April 2015. Nucleotide sequences of 16S rRNA and Gen v1.3.3 (92) under the F81 substitution model (19). This model takes into MLSA genes were extracted from the same PATRIC records. Additional nu- account empirical nucleotide frequencies but, otherwise, does not make any cleotide sequences from MLSA loci for E. coli (40), B. burgdorferi (33), additional assumptions about how nucleotide changes occur over time, and Neisseria (39), and H. pylori (39) (Table S1) were obtained in April 2015 from therefore represents neutral accumulation of the substitutions over time. No the MLST databases hosted at https://pubmlst.org/ and mlst.warwick.ac.uk/ insertions or deletions were simulated. mlst/dbs/Ecoli. Adjustments for the Analyses of 16S rRNA Gene. Due to presence of multiple, Identification of Relaxed Single-Copy Core Gene Families. Amino acid se- often nonidentical copies of 16S rRNA genes (93, 94) in many of the genomes quences of protein-coding genes in the Escherichia spp., Neisseria spp., in the four analyzed groups, and much higher nucleotide conservation of Borrelia spp., and Helicobacter pylori groups (analyzed separately) were 16S rRNA gene, a slightly different methodology was used for both the as- used as an input in all-versus-all sequence similarity searches, which were sembly of 16S rRNA gene family from the analyzed genomes and the sim- conducted using Afree v2.0beta (83) with a word size of five and a minimum ulation of 16S rRNA-like sequences. Specifically, 16S rRNA-like gene genealogies Sørensen-Dice (SD) similarity index (84, 85) of 10. The obtained SD index- were simulated 100 times, using the number of unique rRNA sequences in the

Straub and Zhaxybayeva PNAS Early Edition | 7of10 Downloaded by guest on September 27, 2021 real bacterial group as a value of n. For the corresponding nucleotide sequence MAD(N(x)) > 1], D(x) was normalized by MAD(N(x)) to avoid an inflation of simulations, these genealogies were scaled with respect to the maximum di- D(x). When simulations were less variable [i.e., MAD(N(x)) ≤ 1], the nor- vergence observed for the real 16S rRNA gene family, and the length of the malization was not used.

ancestral nucleotide sequence was set to 1,500 bp. The nucleotide sequences Each real gene family was then associated with a single value Dmax were evolved in Seq-Gen v1.3.3 (92) under the F81 substitution model (19). defined as Because presence of multiple nonidentical copies of a 16S rRNA gene per genome may affect the inferred clustering patterns, we also constructed an Dmax = max jDðxÞj , x additional 16S rRNA gene family that included only one, randomly selected, full-length, 16S rRNA gene copy per genome, and we investigated the dif- where x varies across a specified divergence range (in our case, between ferences in observed clustering in the full and subsampled 16S rRNA gene 0.01 and 0.05, at 0.01 intervals; the clusters at divergence of 0.00 were ex- families (Fig. S2A). As expected, such data subsampling underestimated the cluded to reduce the impact of biased sampling). Using the same procedure, amount of observed microdiversity, but the overall shapes of the two CD each simulated gene family was compared with all simulated gene families

curves remained similar (Fig. S2A). Given that our simulations assume one to obtain the set of Dmax values of the null distribution. gene copy per lineage, we used subsampled 16S rRNA gene family in the Based on the Dmax metric, real gene families were classified into three comparisons of real and simulated data. categories, D+, D−, and D=, which correspond to sets of gene families with

Dmax values greater than the top 2.5%, less than the bottom 2.5%, or within Alignments of Gene Families. Nucleotide sequences of each real gene family, the middle 95%, respectively, of the Dmax values of the null distribution. as well as of the MLSA loci, were aligned using the MUSCLE v3.8.31 program Patterns of diversification of a given gene family were designated as sig- (95) with a maximum number of iterations set to 16. Sequences with fewer nificantly different from the null distribution in a given divergence range if + − than 50 nucleotides aligned to all other sequences were removed. Align- the family was classified into the D or D category; this corresponds to a

ments of 16S rRNA genes were calculated using the SINA program (96), P value < 0.05, because the expectation of seeing the specific Dmax value due which incorporates structural information of the rRNA. Alignment was un- to chance is less than 5%. necessary for simulated gene families, as simulations did not incorporate insertions or deletions of nucleotides. Assessment of Incomplete Sampling on CD Curve Shape. To investigate the effects of random subsampling from a larger population, four different pop- Quantification of Divergence Patterns. For each gene family, pairwise nucle- ulation sizes—100, 133, 200, and 400 lineages—were simulated 100 times otide distances were calculated from the gene family alignment using the each. From each population size, 100 random genes were subsampled Mothur v1.32.1 program (44). These distances were not corrected for mul- without replacement, which corresponds to 100% (100 out of 100), 75% tiple substitutions, because we assumed that genomes within analyzed (100 out of 133), 50% (100 out of 200), and 25% (100 out of 400) sampling

groups have recently diverged and therefore are not expected to accumu- rates. Distance Dmax was calculated for each gene family in both the full late many multiple substitutions, and, additionally, we analyzed only dis- simulation and each subsample. The distributions of Dmax were compared by tances between 0.01 and 0.05 (i.e., divergence of 1 to 5%). The calculated one-way ANOVA (98). distances were used to group taxa into clusters at various distance cutoffs using furthest-neighbor algorithm with a precision parameter of 1,000, as Assignment and Analysis of Gene Families’ Functional Annotations. A randomly implemented in Mothur v1.32.1 (44). The resulting patterns of divergence selected amino acid sequence from each gene family was used as a query in a − were captured in a CD curve, in which the number of clusters in a gene protein BLAST (99) search (database size 107, e value < 10 5, low-complexity family (y axis) is a function of genetic divergence (x axis) (Fig. 1). The shape filter on) against the Clusters of Orthologous Groups (COG) database of the CD curve is expected to reflect the evolutionary processes (e.g., se- (2014 update; accessed at ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data) lective sweeps, recombination) that influence the history of the gene family. (100). Each gene family was assigned a functional annotation of the top- scoring BLASTP match. If a gene family had no homologs in the COG data- Assignment of Statistical Significance to Differences in Divergence Patterns. To base, it was excluded from further analysis. Underrepresentation and over- ’ determine if the shape of a given gene family s CD curve is significantly representation of specific functional categories among the gene families + − different from that of the simulated neutrally evolving gene families (the classified as either D or D was evaluated using Fisher’s exact test (101). null distribution of CD curves), we developed a statistical method inspired by an application of the KS test (47) to the comparison of evolutionary histories Inference of Selection Regimes in Gene Families with the Elevated Dmax. From (97). Although our test resembles the KS test, we introduced improvements each of the four analyzed bacterial groups, gene families with the largest that allowed us to assess differences in specific parts of two curves, to take + − values of Dmax in [0.01; 0.05] divergence interval in both D and D cate- into account how variable the null distribution is, and to discriminate be- gories were examined for evidence of selection (Table 2). For these gene tween two classes of significantly different CD curves (that we designate as + − families, the codon nucleotide alignments were created using translatorX D and D ; see below). v1.1 (102) from the amino acid sequence alignment generated in MUSCLE Specifically, in our method, the distance D(x) between a CD curve Y of a v3.8.31 (95). Maximum likelihood phylogenetic trees were reconstructed in real gene family and a distribution of CD curves of the simulated gene the RAxML program v7.3.6 (103) under the GTR+Γ substitution model families N = {N ,N , ..., N } (the null distribution) at a particular divergence 1 2 1000 [general time-reversible (GTR) substitution model with among-site rate x is defined as variation accomodated using Γ distribution] and with 100 rapid boot- straps. Using these trees and corresponding codon alignments as an input, DðxÞ = YðxÞ − medianðNðxÞÞ,if MADðNðxÞÞ ≤ 1; the fraction of sites under selection, as well as estimates of dN and dS, ð Þ − ðNð ÞÞ ð Þ = Y x median x ðNð ÞÞ > ; were inferred under the M2a model (104), as implemented in the PAML D x ðNð ÞÞ ,ifMAD x 1 MAD x program v4.8 (105). where Y(x) is the number of clusters in the given gene family at divergence x, median(N(x)) is the median number of clusters of the null distribution N at ACKNOWLEDGMENTS. We thank Dr. W. Ford Doolittle for numerous stimu- lating discussions of the project and for comments on the manuscript. We also divergence x, and MAD (N(x)) = median (jN (x) - median (N(x))j) is the median i thank Drs. Deborah Hogan, Phillip Honenberger, Mark McPeek, and Camilla absolute deviation of the null distribution at divergence x. As a robust Nesbø for critical reading of the manuscript drafts, and two anonymous re- measure of variability among a set of measurements, MAD is used here to viewers for the constructive and insightful feedback. This work was supported take into account the spread of the null distribution N at a specific di- by the Simons Foundation Investigator in Mathematical Modeling of Living vergence x. When the simulations produced a variable distribution [i.e., Systems award (to O.Z.) and Dartmouth start-up funds (to O.Z.).

1. Locey KJ, Lennon JT (2016) Scaling laws predict global microbial diversity. Proc Natl 5. Acinas SG, et al. (2004) Fine-scale phylogenetic architecture of a complex bacterial Acad Sci USA 113:5970–5975. community. Nature 430:551–554. 2. Hanage WP, Fraser C, Spratt BG (2006) Sequences, sequence clusters and bacterial 6. Cohan FM, Perry EB (2007) A systematics for discovering the fundamental units of species. Philos Trans R Soc Lond B Biol Sci 361:1917–1927. bacterial diversity. Curr Biol 17:R373–R386. 3. Hunt DE, et al. (2008) Resource partitioning and sympatric differentiation among 7. Cohan FM (2016) Bacterial speciation: Genetic sweeps in bacterial species. Curr Biol closely related bacterioplankton. Science 320:1081–1085. 26:R112–R115. 4. Caro-Quintero A, Konstantinidis KT (2012) Bacterial species may exist, metagenomics 8. Hanage WP, Spratt BG, Turner KM, Fraser C (2006) Modelling bacterial speciation. reveal. Environ Microbiol 14:347–355. Philos Trans R Soc Lond B Biol Sci 361:2039–2044.

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114 Straub and Zhaxybayeva Downloaded by guest on September 27, 2021 9. Feil EJ (2004) Small change: Keeping pace with microevolution. Nat Rev Microbiol 2: 46. Stackebrandt E, Ebers J (2006) Taxonomic parameters revisited: Tarnished gold PNAS PLUS 483–495. standards. Microbiol Today 33:152–155. 10. Shapiro BJ, et al. (2012) Population genomics of early events in the ecological dif- 47. Massey FJ (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc ferentiation of bacteria. Science 336:48–51. 46:68–78. 11. Shapiro BJ, Polz MF (2014) Ordering microbial diversity into ecologically and ge- 48. Maiden MCJ, et al. (2013) MLST revisited: The gene-by-gene approach to bacterial netically cohesive units. Trends Microbiol 22:235–247. genomics. Nat Rev Microbiol 11:728–736. 12. Bendall ML, et al. (2016) Genome-wide selective sweeps and gene-specific sweeps in 49. Achtman M, Wagner M (2008) Microbial diversity and the genetic nature of mi- natural bacterial populations. ISME J 10:1589–1601. crobial species. Nat Rev Microbiol 6:431–440. 13. Mende DR, Sunagawa S, Zeller G, Bork P (2013) Accurate and universal delineation 50. Morlon H, Potts MD, Plotkin JB (2010) Inferring the dynamics of diversification: A of prokaryotic species. Nat Methods 10:881–884. coalescent approach. PLoS Biol 8:e1000493. 14. Thompson CC, et al. (2015) Microbial in the post-genomic era: Rebuilding 51. McPeek MA (2008) The ecological dynamics of clade diversification and community from scratch? Arch Microbiol 197:359–370. assembly. Am Nat 172:E270–E284. 15. Whitman W, ed (2015) Bergey’s Manual of Systematics of Archaea and Bacteria 52. Ferretti L, Disanto F, Wiehe T (2013) The effect of single recombination events on (Bergey’s Manual Trust, Athens, GA). coalescent tree height and shape. PLoS One 8:e60123. 16. Huse SM, et al. (2014) VAMPS: A website for visualization and analysis of microbial 53. Suerbaum S, Josenhans C (2007) Helicobacter pylori evolution and phenotypic di- population structures. BMC Bioinformatics 15:41. versification in a changing host. Nat Rev Microbiol 5:441–452. 17. Wakeley J (2009) Coalescent Theory: An Introduction (Roberts and Company, 54. Kennemann L, et al. (2011) Helicobacter pylori genome evolution during human Greenwood Village, CO). infection. Proc Natl Acad Sci USA 108:5033–5038. 18. Kingman JFC (1982) The coalescent. Stochastic Process Appl 13:235–248. 55. Vos M, Didelot X (2009) A comparison of rates in bac- 19. Felsenstein J (1981) Evolutionary trees from DNA sequences: A maximum likelihood teria and archaea. ISME J 3:199–208. approach. J Mol Evol 17:368–376. 56. Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the 20. Welch RA (2006) The genus Escherichia. : Gamma Subclass, The Pro- rare biosphere through improved OTU clustering. Environ Microbiol 12:1889–1898. karyotes, eds Dworkin M, Falkow S, Rosenberg E, Schleifer K-H, Stackebrandt E 57. Revell LJ, Harmon LJ, Glor RE (2005) Underparameterized model of sequence evo- (Springer, New York), Vol 6, pp 60−71. lution leads to bias in the estimation of diversification rates from molecular phy- 21. Luo C, et al. (2011) Genome sequencing of environmental Escherichia coli expands logenies. Syst Biol 54:973–983. understanding of the ecology and speciation of the model bacterial species. Proc 58. Jordan IK, Rogozin IB, Wolf YI, Koonin EV (2002) Essential genes are more evolu- Natl Acad Sci USA 108:7200–7205. tionarily conserved than are nonessential genes in bacteria. Genome Res 12:962–968. 22. Walk ST (2015) The “cryptic” Escherichia. EcoSal Plus 6:10.1128/ecosalplus.ESP- 59. Wang L, Rothemund D, Curd H, Reeves PR (2003) Species-wide variation in the 0002-2015. Escherichia coli flagellin (H-antigen) gene. J Bacteriol 185:2936–2943. 23. Octavia S, Lan R (2014) The family . The Prokaryotes: Gammap- 60. Wicker E, et al. (2012) Contrasting recombination patterns and demographic histo- roteobacteria, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F ries of the plant pathogen Ralstonia solanacearum inferred from MLSA. ISME J 6: (Springer, Berlin), pp 225−286. 961–974. 24. Liu S, et al. (2015) Escherichia marmotae sp. nov., isolated from faeces of Marmota 61. Monteil CL, et al. (2013) Nonagricultural reservoirs contribute to emergence and

himalayana. Int J Syst Evol Microbiol 65:2130–2134. evolution of Pseudomonas syringae crop pathogens. New Phytol 199:800–811. EVOLUTION 25. Bennett JS, Bratcher HB, Brehony C, Harrison OB, Maiden MCJ (2014) The genus 62. Smith KD, et al. (2003) Toll-like receptor 5 recognizes a conserved site on flagellin Neisseria. The Prokaryotes: Alphaproteobacteria and , eds required for protofilament formation and bacterial motility. Nat Immunol 4: Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 1247–1253. 881−900. 63. Vinatzer BA, Monteil CL, Clarke CR (2014) Harnessing population genomics to un- 26. Maiden MCJ, Harrison OB (2016) Population and functional genomics of Neisseria derstand how bacterial pathogens emerge, adapt to crop hosts, and disseminate. revealed with gene-by-gene approaches. J Clin Microbiol 54:1949–1955. Annu Rev Phytopathol 52:19–43. 27. Bennett JS, et al. (2012) A genomic approach to : An examination 64. Rao DN, Dryden DTF, Bheemanaik S (2014) Type III restriction-modification enzymes: and proposed reclassification of species within the genus Neisseria. Microbiology A historical perspective. Nucleic Acids Res 42:45–55. 158:1570–1580. 65. de Vries N, et al. (2002) Transcriptional phase variation of a type III restriction- 28. Bratcher HB, Bennett JS, Maiden MCJ (2012) Evolutionary and genomic insights into modification system in Helicobacter pylori. J Bacteriol 184:6615–6623. meningococcal biology. Future Microbiol 7:873–885. 66. Kojima KK, et al. (2016) Population evolution of Helicobacter pylori through di- 29. Rotman E, Seifert HS (2014) The genetics of Neisseria species. Annu Rev Genet 48: versification in DNA methylation and interstrain sequence homogenization. Mol Biol 405–431. Evol 33:2848–2859. 30. Caimano M (2006) The genus Borrelia. Proteobacteria: Delta, Epsilon subclass, The 67. Tomb J-F, et al. (1997) The complete genome sequence of the gastric pathogen Prokaryotes, eds Dworkin M, Falkow S, Rosenberg E, Schleifer K-H, Stackebrandt E Helicobacter pylori. Nature 388:539–547. (Springer, New York), Vol 7, pp 235−293. 68. Oleastro M, Ménard A (2013) The role of Helicobacter pylori outer membrane pro- 31. Karami A, Sarshar M, Ranjbar R, Zanjani RS (2014) The Spirochaetaceae. The teins in adherence and pathogenesis. Biology (Basel) 2:1110–1134. Prokaryotes: Other Major Lineages of Bacteria and The Archaea, eds Rosenberg E, 69. Hobbs M, Collie ESR, Free PD, Livingston SP, Mattick JS (1993) PilS and PilR, a two- DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, Berlin), pp 915−929. component transcriptional regulatory system controlling expression of type 4 fim- 32. Wang G, Schwartz I (2015) Borrelia. Bergey’s Manual of Systematics of Archaea and briae in . Mol Microbiol 7:669–682. Bacteria (John Wiley, New York). 70. Eriksson J, et al. (2015) Characterization of motility and piliation in pathogenic 33. Margos G, et al. (2009) A new Borrelia species defined by multilocus sequence Neisseria. BMC Microbiol 15:92. analysis of housekeeping genes. Appl Environ Microbiol 75:5410–5416. 71. Carrick CS, Fyfe JAM, Davies JK (2000) The genome of Neisseria gonorrhoeae retains 34. Mitchell HM, Rocha GA, Kaakoush NO, O’Rourke JL, Queiroz DMM (2014) The family the remnants of a two-component regulatory system that once controlled piliation. Helicobacteraceae. The Prokaryotes: Deltaproteobacteria and Epsilonproteobac- FEMS Microbiol Lett 186:197–201. teria, eds Rosenberg E, DeLong EF, Lory S, Stackebrandt E, Thompson F (Springer, 72. Denamur E, et al. (2000) Evolutionary implications of the frequent horizontal Berlin), pp 337−392. transfer of mismatch repair genes. Cell 103:711–721. 35. Moodley Y, et al. (2012) Age of the association between Helicobacter pylori and 73. Lin Z, Nei M, Ma H (2007) The origins and early evolution of DNA mismatch repair man. PLoS Pathog 8:e1002693. genes—Multiple horizontal gene transfers and co-evolution. Nucleic Acids Res 35: 36. Alm RA, et al. (1999) Genomic-sequence comparison of two unrelated isolates of the 7591–7603. human gastric pathogen Helicobacter pylori. Nature 397:176–180. 74. Heymans R, Golparian D, Bruisten SM, Schouls LM, Unemo M (2012) Evaluation of 37. Morelli G, et al. (2010) Microevolution of Helicobacter pylori during prolonged in- Neisseria gonorrhoeae multiple-locus variable-number tandem-repeat analysis, fection of single hosts and within families. PLoS Genet 6:e1001036. N. gonorrhoeae Multiantigen sequence typing, and full-length porB gene sequence 38. Dorer MS, Sessler TH, Salama NR (2011) Recombination and DNA repair in Heli- analysis for molecular epidemiological typing. J Clin Microbiol 50:180–183. cobacter pylori. Annu Rev Microbiol 65:329–348. 75. Trees DL, Schultz AJ, Knapp JS (2000) Use of the neisserial lipoprotein (Lip) for 39. Jolley KA, Maiden MC (2010) BIGSdb: Scalable analysis of bacterial genome variation subtyping Neisseria gonorrhoeae. J Clin Microbiol 38:2914–2916. at the population level. BMC Bioinformatics 11:595. 76. Beutin L, Delannoy S, Fach P (2015) Genetic diversity of the fliC genes encoding the 40. Wirth T, et al. (2006) Sex and virulence in Escherichia coli: An evolutionary per- flagellar antigen H19 of Escherichia coli and application to the specific identification spective. Mol Microbiol 60:1136–1151. of enterohemorrhagic E. coli O121:H19. Appl Environ Microbiol 81:4224–4230. 41. Medini D, et al. (2008) Microbiology in the post-genomic era. Nat Rev Microbiol 6: 77. Beutin L, Strauch E (2007) Identification of sequence diversity in the Escherichia coli 419–430. fliC genes encoding flagellar types H8 and H40 and its use in typing of Shiga toxin- 42. Cordero OX, Polz MF (2014) Explaining microbial genomic diversity in light of evo- producing E. coli O8, O22, O111, O174, and O179 strains. J Clin Microbiol 45:333–339. lutionary ecology. Nat Rev Microbiol 12:263–273. 78. Iguchi A, et al. (2015) A complete view of the genetic diversity of the Escherichia coli 43. Kashtan N, et al. (2014) Single-cell genomics reveals hundreds of coexisting sub- O-antigen biosynthesis gene cluster. DNA Res 22:101–107. populations in wild Prochlorococcus. Science 344:416–420. 79. Cheng K, et al. (2016) Phenotypic H-antigen typing by mass spectrometry combined 44. Schloss PD, et al. (2009) Introducing mothur: Open-source, platform-independent, with genetic typing of H antigens, O antigens, and toxins by whole-genome se- community-supported software for describing and comparing microbial communi- quencing enhances identification of Escherichia coli isolates. J Clin Microbiol 54: ties. Appl Environ Microbiol 75:7537–7541. 2162–2168. 45. Goris J, et al. (2007) DNA-DNA hybridization values and their relationship to whole- 80. Barraclough TG, Balbi KJ, Ellis RJ (2012) Evolving concepts of bacterial species. Evol genome sequence similarities. Int J Syst Evol Microbiol 57:81–91. Biol 39:148–157.

Straub and Zhaxybayeva PNAS Early Edition | 9of10 Downloaded by guest on September 27, 2021 81. Doolittle WF (2012) Population genomics: How bacterial species form and why they 96. Pruesse E, Peplies J, Glöckner FO (2012) SINA: Accurate high-throughput multiple don’t exist. Curr Biol 22:R451–R453. sequence alignment of ribosomal RNA genes. Bioinformatics 28:1823–1829. 82. Wattam AR, et al. (2014) PATRIC, the bacterial bioinformatics database and analysis 97. Wollenberg K, Arnold J, Avise JC (1996) Recognizing the forest for the trees: Testing – resource. Nucleic Acids Res 42:D581 D591. temporal patterns of cladogenesis using a null model of stochastic diversification. 83. Mahmood K, Webb GI, Song J, Whisstock JC, Konagurthu AS (2012) Efficient large- Mol Biol Evol 13:833–849. scale protein sequence comparison and gene matching to identify orthologs and co- 98. Bewick V, Cheek L, Ball J (2004) Statistics review 9: One-way analysis of variance. Crit orthologs. Nucleic Acids Res 40:e44. Care 8:130–136. 84. Dice LR (1945) Measures of the amount of ecologic association between species. 99. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein Ecology 26:297–302. database search programs. Nucleic Acids Res 25:3389–3402. 85. Sørensen TJ (1948) A method of establishing groups of equal amplitude in plant 100. Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34. coverage and improved protein family annotation in the COG database. Nucleic 86. Van Dongen S (2000) Graph clustering by flow simulation. Ph.D dissertation (Univ Acids Res 43:D261–D269. 2 Utrecht, Utrecht, The Netherlands). 101. Fisher RA (1922) On the interpretation of χ from contingency tables, and the cal- 87. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale culation of P. JR Stat Soc 85:87–94. detection of protein families. Nucleic Acids Res 30:1575–1584. 102. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: Multiple alignment of nucle- 88. Hey J (1992) Using phylogenetic trees to study speciation and extinction. Evolution otide sequences guided by amino acid translations. Nucleic Acids Res 38:W7-13. 46:627–640. 103. Stamatakis A (2014) RAxML version 8: A tool for phylogenetic analysis and post- 89. Moran PAP (1958) A general theory of the distribution of gene frequencies. I. analysis of large phylogenies. Bioinformatics 30:1312–1313. Overlapping generations. Proc R Soc Lond B Biol Sci 149:102–112. 104. Yang Z, Wong WSW, Nielsen R (2005) Bayes empirical bayes inference of amino acid 90. Saitou N, Nei M (1987) The neighbor-joining method: A new method for re- sites under positive selection. Mol Biol Evol 22:1107–1118. – constructing phylogenetic trees. Mol Biol Evol 4:406 425. 105. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol 91. Cock PJA, et al. (2009) Biopython: Freely available Python tools for computational 24:1586–1591. molecular biology and bioinformatics. Bioinformatics 25:1422–1423. 106. Harvey PH, May RM, Nee S (1994) Phylogenies without fossils. Evolution 48:523–529. 92. Rambaut A, Grassly NC (1997) Seq-Gen: An application for the Monte Carlo simu- 107. Martin AP, Costello EK, Meyer AF, Nemergut DR, Schmidt SK (2004) The rate and lation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13: pattern of cladogenesis in microbes. Evolution 58:946–955. 235–238. 108. Pybus OG, Harvey PH (2000) Testing macro-evolutionary models using incomplete 93. Acinas SG, Marcelino LA, Klepac-Ceraj V, Polz MF (2004) Divergence and redundancy – of 16S rRNA sequences in genomes with multiple rrn operons. J Bacteriol 186: molecular phylogenies. Proc Biol Sci 267:2267 2272. 2629–2635. 109. Byers DM, Gong H (2007) Acyl carrier protein: Structure-function relationships in a – 94. Pei AY, et al. (2010) Diversity of 16S rRNA genes within individual prokaryotic ge- conserved multifunctional protein family. Biochem Cell Biol 85:649 662. nomes. Appl Environ Microbiol 76:3886–3897. 110. Barnwal RP, Van Voorhis WC, Varani G (2011) NMR structure of an acyl-carrier protein 95. Edgar RC (2004) MUSCLE: Multiple sequence alignment with high accuracy and high from Borrelia burgdorferi. Acta Crystallogr Sect F Struct Biol Cryst Commun 67: throughput. Nucleic Acids Res 32:1792–1797. 1137–1140.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1619993114 Straub and Zhaxybayeva Downloaded by guest on September 27, 2021