Phylogenetic incongruence arising from fragmented speciation in enteric

Adam C. Retchless and Jeffrey G. Lawrence1

Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260

Edited* by W. Ford Doolittle, Dalhousie University, Halifax, NS, Canada, and approved May 17, 2010 (received for review February 2, 2010) Evolutionary relationships among species are often assumed to be DNA. Similarly, antirecombination driven by sequence divergence fundamentally unambiguous, where genes within a genome are and the mismatch repair system causes hybrid sterility in eukar- thought to evolve in concert and phylogenetic incongruence yotes such as Saccharomyces (5), whereas in prokaryotes it simply between individual orthologs is attributed to idiosyncrasies in their prevents the integration of the particular sequence that has di- evolution. We have identified substantial incongruence between verged from the recipient (6). the phylogenies of orthologous genes in Escherichia, Salmonella, As a result, barriers to recombination in bacteria could be lim- and Citrobacter,orE. coli, E. fergusonii, and E. albertii. The source of ited to specific portions of a genome. Consider a bacterial pop- incongruence was inferred to be recombination, because individual ulation freely recombining at all loci. Subpopulations can develop fl genes support con icting topology more robustly than expected through genetic isolation of only a few loci, driven by ecology or from stochastic sequence homoplasies. Clustering of phylogeneti- sequence divergence (7); such subpopulations—recombining at cally informative sites on the genome indicated that the regions many loci but genetically isolated at other loci—could be numer- of recombination extended over several kilobases. Analysis of phy- ous within a larger population. Consistent with this fragmented logenetically distant taxa resulted in consensus among individual speciation model (8), several multilocus sequence analysis gene phylogenies, suggesting that recombination is not ongoing; (MLSA) studies identified closely related populations that appear instead, conflicting relationships among genes in descendent taxa to be recombining at some loci but remain genetically isolated reflect recombination among their ancestors. Incongruence could at others (9, 10). In addition, a comparison of Escherichia and

have resulted from random assortment of ancestral polymorphisms EVOLUTION if species were instantly created from the division of a recombining Salmonella genomes revealed extensive variation in the level of population. However, the estimated branch lengths in alternative sequence divergence (after correcting for position and codon se- phylogenies would require ancestral populations with far more di- lection) across regions of the chromosome, suggesting that many versity than is found in extant populations. Rather, these and pre- regions experienced homogenizing recombination as much as 70 vious data collectively suggest that genome-wide recombination million years after other regions had ceased recombination (11). rates decreased gradually, with variation in rate among loci, lead- Critically, excessively diverged regions were clustered around the ing to pluralistic relationships among their descendent taxa. loci where gene gains or losses distinguish Escherichia from Sal- monella; such adaptive changes in gene inventory could have recombination | species | Tree of Life | population structure | incomplete contributed to ecological differentiation within the recombining lineage sorting ancestral population and been the focus of selection against re- combination. This suggests that interference with recombination t first glance, prokaryotes appear to have simple, well-ordered emerged at different loci at different times, driven by adaptation. Arelationships resulting from asexual reproduction and di- Consistent with this view, population genetic simulations exam- vergence by mutation. However, homologous recombination be- ining the recombination-suppressing role of sequence divergence tween closely related strains can lead to complex, nonclonal do not result in population splitting if given recombination relationships (1). Recombination has implications that are so pro- parameters similar to those observed in E. coli (12, 13), but pop- found that its potential within populations is often taken to be the ulation splitting is observed when using the more restrictive definitive feature of species. Mayr’s biological species concept parameters inferred from Salmonella (14). (BSC) frames species in the context of reproductive barriers If recombination barriers are imparted gradually as pop- whereby only conspecific individuals exchange genes; individuals ulations split, they may not be complete before each descendent that fail to recombine represent different species (2). Despite its population splits again. The stepwise acquisition of genetic iso- formulation for sexual eukaryotes, Dykhuizen and Green (1) pro- lation at different locations around the chromosome would lead posed that the BSC could apply to bacteria; operationally, phy- to differing phylogenies of orthologous genes, resulting in the logenies of orthologous genes would be identical for strains of lack of clear organismal relationships. Alternatively, recombi- different species but demonstrably different for strains within spe- nation could cease for all loci instantly when each population cies due to recombination. Several studies have applied these cri- splits, as suggested for Yersinia pestis (15). In this instant speci- teria to multilocus phylogenetic analysis of prokaryotes, supporting ation model, all recombination events occur before the acqui- the notion that there are distinct groups of organisms experiencing sition of genome-wide genetic isolation. Any phylogenetic recombination within each group but not between them (3). incongruence would result from the partitioning of ancestral Despite these results, the BSC may not be generally applicable variation among descendent lineages, which would confound our to prokaryotes, even for taxa that undergo high rates of re- combination. Whereas eukaryotic recombination occurs genome-

wide during meiosis, recombination in prokaryotes involves only Author contributions: A.C.R. and J.G.L. designed research; A.C.R. performed research; A.C.R. a small fragment of DNA being introduced into a cell via trans- and J.G.L. contributed new reagents/analytic tools; A.C.R. and J.G.L. analyzed data; and formation, phage-mediated transduction, or plasmid conjugation. A.C.R. and J.G.L. wrote the paper. In both prokaryotes and eukaryotes, recombinants may be The authors declare no conflict of interest. counter-selected when genomic incompatibilities reduce hybrid *This Direct Submission article had a prearranged editor. fitness. In eukaryotes, this inhibits gene exchange genome-wide 1To whom correspondence should be addressed. E-mail: [email protected]. (4), whereas in prokaryotes, recombination interference affects This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. only that small portion of the genome that carries the incompatible 1073/pnas.1001291107/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1001291107 PNAS Early Edition | 1of6 Downloaded by guest on September 28, 2021 ability to discern otherwise robust organismal relationships. Here, accepted) or C. youngae (30% accepted). A similar pattern was we test these models directly. observed for the C. youngae/Salmonella clade (Fig. 2A). In contrast to the divergence of Escherichia, Salmonella, and Results Citrobacter, which have been separated for tens of millions of Phylogenetic Discordance in the Enterobacteriaceae Does Not Reflect years, the three species of Escherichia are likely in the final Ongoing Recombination. To identify taxa with potentially con- throes of genetic isolation. MLSA results (18) are consistent with founded relationships, we looked within the well-characterized very low levels of recombination between otherwise distinct species-rich clade of enteric bacteria. To establish a reference groups of Escherichia with divergence comparable to E. coli and phylogeny, we aligned a core genome containing 1,174 ortholo- E. fergusonii. The vast majority of genes in our analysis also gous open reading frames (ORFs) in each of 17 genomes (Table supports the monophyly of E. coli K12 with E. coli UTI89 (Fig. S1), with <15% of aligned sites having gaps in any genome. A 2C) as well as the monophyly of the Escherichia relative to other Neighbor-Net (16) analysis of the concatenated codon alignment genera; however, the relationships between the three Escherichia of these genes shows conflicting phylogenetic signals among these species remain unclear. The E. coli/E. albertii clade was sup- genomes (Fig. 1). Regions with conflicting signal may reflect the ported by 53% of genes and the alternative E. coli/E. fergusonii incongruent histories among genes due to recombination (17). clade by 44% of genes. Thus, these taxa represent the genesis of Examining each gene independently by maximum likelihood the phylogenetic ambiguity that plagues the relationships of (ML), there is near-universal support for the separation of Escherichia, Salmonella, and Citrobacter. Erwinia, Dickeya, Pectobacterium, , and Yersinia from These gene-based quartets provided no evidence for recent re- Cronobacter and the other genomes (>99% of those alignments combination between species of different genera, indicating that with a single topology in the 90% confidence limit). These taxa are (for the genomes analyzed here) any substantial gene flow was used as outgroup genomes in subsequent tests. limited to the time periods before the genera diversified into extant Using this outgroup, we examined reference pairs of taxa in a species. Yet with both radiations, the remaining phylogenetic in- quartet analysis to test the robustness of their relationship with congruence may be interpreted several ways. The conflicting signal respect to an additional taxon (Fig. 2). For high-confidence phy- may simply represent noise, especially when inferring the rela- logenies, the additional taxon should be either a clear outgroup tionships between the more distantly related Escherichia, Salmo- (supporting the reference pair as sister taxa) or a clear ingroup nella, and Citrobacter. Alternatively, conflicting phylogenetic signal (rejecting the pair as sister taxa). As expected from the Neighbor- may reflect maintenance of ancestral polymorphism, whereby Net results, virtually all genes supported the Escherichia/Salmo- incomplete lineage sorting leads to ambiguous phylogenetic rela- nella pair when it was evaluated with either Klebsiella or Crono- tionships even when genetic isolation occurs instantly for all genes bacter, and virtually no gene supported this pair when evaluated (19). Lastly, the fragmented speciation model predicts conflicting with E. albertii, E. fergusonii (Fig. 2C), or S. enterica arizonae phylogenetic signal due to stepwise acquisition of barriers to re- (Fig. 2B). However, two regions of the Neighbor-Net phylogeny combination. show substantial conflicting signal (Fig. 1, regions A and B). The patterns of divergence between Escherichia and Salmonella sup- Robust Alternative Relationships Among Bacterial Genera. One port a fragmented speciation model (11) whereby chromosomal expects the phylogenetic signal to be weakest for very short regions became genetically isolated during an extended time frame branches, so one may posit that the inference of conflicting to- (shaded area in Fig. 1). This range includes the nodes representing pologies described above simply reflects the lack of support for the the divergence of the Citrobacter lineages, suggesting that the re- “true” organismal phylogeny. To test this hypothesis, we deter- lationship between these three genera will be ambiguous (Fig. 1 mined whether the support for alternative topologies is stronger Inset). This theme was reinforced by individual gene quartet than expected. Likelihood analyses were performed on alignments analyses, where relationships between Escherichia and Salmonella of sequences from 14 genomes representing the maximal available were ambiguous—with the pair being neither widely accepted nor diversity among Escherichia, Salmonella, Citrobacter, and Klebsi- widely rejected—when evaluated with Citrobacter koseri (14% ella, while maintaining the monophyly of each group (6, 4, 2, and

Fig. 1. Phylogenetic network of enteric bacteria. The Neighbor-Net (16) dendrogram was calculated by SplitsTree. The shaded region indicates the range for placement of the node separating Escherichia coli and Salmonella enterica according to relative divergence analysis (11). The Inset focuses on the divergence of Escherichia, Salmonella, and Citrobacter (arrow A).

2of6 | www.pnas.org/cgi/doi/10.1073/pnas.1001291107 Retchless and Lawrence Downloaded by guest on September 28, 2021 Fig. 2. Percentage of genes supporting clades. Each panel shows how often a reference genome was found to be the sister taxon to a second genome (defining each trendline; see legend) in a four-taxa ML phylogeny. Each phylogeny included the reference pair, a constant outgroup, and each of a series of test genomes. Results are plotted according to the distance from the reference genome to the node leading to the test genome on a neighbor-joining tree

based on estimates of amino acid substitution counts (Ka). Gene counts were limited to those ORF alignments that generated substantial likelihood support for a single topology (90% confidence interval, SH test). Quartet composition is listed to the right of each chart; x represents the test genome, which is identified on the distance axis. Cko, Citrobacter koseri; Csp, C. sp. 30_2; Cyo, C. youngae; Csa, Cronobacter sakazakii; Eal, E. albertii; Eco, E. coli MG1655; Efe, E. fergusonii; Esp, Enterobacter sp. 683; Kpn, K. pneumoniae;Saz,S. enterica arizonae; Sen, S. enterica LT2; UTI, E. coli UTI89.

2 genomes, respectively; Table S1). The use of multiple taxa for using threshold gene P values as the expected probabilities). Unlike each clade increased the signal-to-noise ratio. Although 2,028 what would be expected for an unambiguous organismal phylog- potentially orthologous ORFs were present in each of the 14 eny, large fractions of alignments reject the Escherichia/Citrobacter genomes, we removed 705 unreliably aligned ORFs (>5% of their clade, the Escherichia/Salmonella clade, and, for high-confidence multiple sequence alignment contained gaps), 14 potentially alignments (P < 0.25), the Citrobacter/Salmonella clade. paralogous ORFs for which syntenic neighboring ORFs could The proportions of alignments supporting each topology were

not be reliably identified, and 165 ORFs for which the monophyly robust to subsampling guided by statistical confidence and to EVOLUTION of each of the four genera was not confidently supported by a variety of other subsampling techniques that would purge dif- Bayesian analysis. For the 1,144 remaining ORFs, the three pos- ferent varieties of stochastic and systematic errors. Support for sible relationships of the four genera were evaluated by codon- a single topology did not arise when we removed potentially based maximum likelihood, holding the relationships within each mismatched orthologs, alignments with gaps or few informative genus fixed as defined by Bayesian analysis (Dataset S1). sites, or any alignment generating inconsistent phylogenetic re- A substantial number of alignments supported each of the three sults using a codon-position model, other outgroups, or fewer topologies (Fig. 3 A and B) and several lines of evidence ruled out taxa (Table S2). stochastic and systematic errors as the basis for these incongruent Robust Alternative Relationships Among the Escherichia. The same results. Among those alignments generating strong bootstrap set of 1,144 alignments was tested according to a codon-position support for a topology (Fig. 3A), where bootstrap support thresh- maximum-likelihood model applied to each of the three topologies olds provide conservative estimates of accuracy (20, 21), no to- involving E. albertii, E. coli,andE. fergusonii, with an outgroup pology had the level of support expected if it were the true topology comprising two genomes each of Salmonella and Klebsiella.As for all genes (dashed line). Furthermore, an excess of alignments fi with the original tests (Fig. 2C), roughly equal portions of align- rejected each topology with high con dence, indicating strong ments supported the clustering of E. coli with either E. albertii or phylogenetic information at the gene level despite the widespread < E. fergusonii, whereas E. albertii and E. fergusonii rarely clustered incongruence between genes (Fig. 3B; P 0.01 for all categories together (Fig. 3 C and D). None of the topologies had support < where individual genes are rejected at P = 0.25; binomial test from a sufficient number of genes to justify the hypothesis that it is the single true topology and the others are artifactual (Fig. 3C), whereas each topology was rejected more often than expected by chance (Fig. 3D). Support for a single topology did not arise when we removed potentially mismatched orthologs or alignments with gaps, few informative sites, or that did not produce identical results when Klebsiella, Citrobacter,orSalmonella were used as single outgroups (Table S3). Interestingly, alignments supporting the E. albertii/E. fergusonii clade are rare (Fig. 3 C and D), illustrating the complexity of the isolation process. E. fergusonii and E. albertii may have arisen from small populations that rarely encountered each other, but continued to recombine with E. coli. Alternatively, ecological differentiation may be greater between E. fergusonii and E. albertii than between either of them and E. coli, suppressing re- combination between them more. The former explanation is Fig. 3. Phylogenetic discordance at all confidence levels. Measures of con- consistent with elevated substitution rates in the lineages leading fi dence on ML tests for individual genes do not match expectations for to E. fergusonii and E. albertii relative to E. coli observed pre- a genome-wide topology, either for the relationship among Citrobacter, viously (18), but is not exclusive of the latter explanation. Also of Escherichia, and Salmonella (A and B)orE. albertii, E. coli, and E. fergusonii (C and D). Bootstrap support is expected to correspond to accuracy (A and C), note, whereas the E. albertii/E. coli clade is favored by those and SH test P values provide expected frequencies of topology rejection (B genes with strongest bootstrap support, a greater number of and D). Support for (A and C) or rejection of (B and D) alternative clades is genes support the E. fergusonii/E. coli clade (Fig. 3 C and D). indicated by trendlines. Within-species quartets are shown in Fig. S1. This could reflect the idiosyncratic nature of the fragmented

Retchless and Lawrence PNAS Early Edition | 3of6 Downloaded by guest on September 28, 2021 Table 1. Occurrence of sites supporting the same topology the relationship between Escherichia, Salmonella,andCitrobacter, Topology Runs of sites or between the three species of Escherichia. If recombination be- tween nascent species were responsible for the phylogenetic in- 1 2 3 Exp SD Obs Z congruence, then the phylogenetically informative sites should be clustered in their respective genomes according to the topology ESC Nuc 19,539 19,886 18,829 38,827 114 35,684 27.6 that they support. To evaluate clustering, we concatenated single- ESC Pro 2,382 2,879 2,442 5,117 41 4,600 12.5 gene, high-quality alignments of genes with reliable identification ESC Syn 12,229 11,666 11,621 23,672 89 22,844 9.3 of neighboring ORFs. Supporting sites in the alignment were de- EEE Nuc 9,687 19,298 18,950 30,718 102 24,327 62.6 fined as those for which one topology is more parsimonious than EEE Pro 471 3,074 1,361 2,558 29 1,660 31.0 the other two. Parsimony criteria were applied to nucleotide, EEE Syn 6,950 11,009 13,658 20,357 83 17,159 38.7 amino acid, and synonymous codon alignments, and two analyses Topologies: EEE1, Eal,Efe; EEE2, Eal,Eco; EEE3, Eco,Efe; ESC1, Cit,Esc; ESC2, were performed to measure the clustering of sites supporting each Cit,Sal; ESC3, Esc,Sal; Nuc, nucleotide; Pro, protein; Syn, synonymous codons. topology within the 1,309 ORFs. Runs test of randomness was performed on the sequence of sites supporting A runs test for randomness in the order of supporting sites in- each topology. We observed (Obs) fewer runs than expected (Exp) given the dicated highly significant clustering of supporting sites by topology − number of sites supporting each topology, indicating that sites supporting (P <<10 10; Table 1). That is, for both the Escherichia/Salmonella/ a given topology are clustered. Z-scores [(Obs − Exp)/SD], evaluated with −10 Citrobacter and E. albertii/E. coli/E. fergusonii clades, sites sup- a one-tailed test, reject the null hypothesis of random ordering at P <<10 fl for all tests, including both quartets analyzed and all three character types porting each of the con icting topologies were clustered in these evaluated. Additional tests are shown in Table S4. genomes. To test whether sites supporting each topology were clustered, we repeated the runs test by omitting each topology in − turn; significant clustering was still observed (Table S4; P < 10 5). speciation process, suggesting either that some loci in the E. To investigate the scale of clustering as a function of distance be- fergusonii genome have not recombined with E. albertii/E. coli for tween supporting sites, pairs of sites were binned according to a long time, or that E. albertii/E. coli continued to recombine at distance, calculating the frequency that both sites within the binned these loci until relatively recently. distance range supported the same topology (Fig. 4). For all As a final test of the robustness of the incongruence in both analyses, there was a clear enrichment of sites supporting the same quartet analyses, we attempted to identify hidden likelihood topology over the frequency expected from genome averages. This support (22) for a congruent topology by concatenating those is most noticeable for the analysis of Escherichia species, where alignments supporting each of the three topologies tested, then phylogenetic signal is expected to be the strongest; clustering is repeating the maximum-likelihood analysis. In each case, we apparent both within and between genes (Fig. 4 A–C). Clustering is found unambiguous support for the topology that the genes had detectable in the Escherichia/Citrobacter/Salmonella analysis (Fig. individually supported (no alternate topology within a 99% con- 4D), although less apparent due to the accumulation of noise. fidence interval). To guard against the analysis being dominated by a few genes with the strongest support for the given topology, Phylogenetic Incongruence Does Not Reflect Incomplete Lineage we repeated the analysis using only those genes that had 50–60% Sorting. The clustering of informative sites is evidence that re- bootstrap support for the given topology. Again, we recovered combination produced different phylogenies for different regions unambiguous support for each topology except the E. albertii/ of these genomes, thus eliminating the possibility of recovering an E. fergusonii clade, which produced 64% bootstrap support for unambiguous organismal phylogeny from the sequence data. Yet itself, but could not reject the E. fergusonii/E. coli clade. this evidence is not sufficient to determine whether recombination occurred at some loci subsequent to genetic isolation at other Clustering of Phylogenetic Signal in the Chromosome. The above loci (fragmented speciation), or an ancestral population split into results suggest that no single phylogeny is appropriate to describe two descendent populations, one of which split again before its

Fig. 4. Chromosomal clustering of parsimony informative sites supporting each topology. Each site supporting a distinct topology was compared against each other site and the observation binned according to the distance between sites. Trendlines report proportions of observations where a site supporting a given topology was paired with a site supporting the same topology. Expected values are derived from the genome-wide proportion of sites supporting that topology. Data are plotted at the midpoint of the bin range. Within-species quartets are shown in Fig. S2.

4of6 | www.pnas.org/cgi/doi/10.1073/pnas.1001291107 Retchless and Lawrence Downloaded by guest on September 28, 2021 ancestral polymorphisms had been resolved (incomplete lineage ancestral diversity following instant acquisition of genetic iso- sorting). By the latter model, one topology would reflect the his- lation. Above, we provided evidence rejecting these alternatives; tory of population splitting (the species topology), and the other therefore, another model, such as the stepwise acquisition of ge- two would represent the diversity of the original population (19). netic isolation (fragmented speciation), must be invoked to explain To evaluate this “instant-speciation” model empirically, we com- the data. In accordance with this model, previous data for E. coli pared the observed diversity of the extant populations to the and S. enterica show that genetic isolation occurred at different inferred diversity of the ancestral population that could have times for different genes, driven by adaptive change (11). The generated the incongruent phylogenies (Fig. S3). Putative ances- fragmented speciation model suggests that organismal phylogenies tral diversity was measured as the length of the innermost branch cannot be deduced from gene phylogenies because genes have connecting all four genera on a maximum-likelihood tree (internal different evolutionary histories. Given the vast diversity of pro- branch). Extant diversity (terminal branch) was measured as half karyotes (23), groups of ambiguously related taxa produced by of the branch length separating genes from two E. coli strains. To rapid evolutionary radiations may be common. maximize our estimate of extant diversity, we began with a collec- The number and density of such problematic relationships can tion of four maximally divergent E. coli genomes (13), selecting only increase as more microbial diversity is characterized. In the two (UMN026/UTI89) where the terminal branch length exceeds most extreme interpretation, this would invalidate the Tree of Life the internal branch length most often. As reported above, wide- hypothesis, which is founded on the idea that extant taxa have spread support for the monophyly of E. coli indicates that unambiguous relationships (24). Phylogenetic trees of organisms E. albertii and E. fergusonii do not recombine freely with E. coli; serve as frameworks for interpreting evolutionary change; char- therefore they were not included in measures of extant diversity. acteristics of ancestral taxa are inferred by coalescence and serve If incomplete lineage sorting were to produce the observed as platforms for interpreting changes in descendent taxa. Yet our phylogenetic ambiguity between Escherichia, Salmonella, and data suggest that such ancestral taxa may not have existed, and Citrobacter, then the terminal and internal branch lengths on inferences that require them, such as any utilization of parsimony, maximum-likelihood trees with nonspecies topologies would be would fail. For example, if one accepts an organismal phylogeny comparable because both would represent within-species di- that places Escherichia as an outgroup to the Citrobacter/Salmo- versity. Only trees with the “true” species topology could have nella clade, then any feature in common between E. coli and Sal- accumulated extra divergence along the internal branch. How- monella would be interpreted as a parallel gain or a loss from ever, we observed that the terminal branch was generally shorter Citrobacter (Fig. S4). Alternatively, this feature may have been EVOLUTION than the internal branch. This was true for each topology, in shared by the two taxa throughout the fragmented speciation 93.0% (346/372), 94.8% (292/308), and 95.7% (443/463) of the process, because a distinct taxon ancestral to the Citrobacter/Sal- trees that clustered Escherichia with Salmonella, Escherichia with monella clade need not exist (17). Given these complications, the Citrobacter,orCitrobacter with Salmonella, respectively (Fig. S3A). Tree of Life, in demanding a strict bifurcating relationship among When using S. enterica enterica strains, extant diversity was smaller descendent taxa, cannot form the basis for rigorous examination of than the ancestral variation in at least 98% of trees for each bacterial diversity. comparison (S. enterica arizonae was an outgroup in >97.5% of alignments where no alternate topology is within the 90% con- Questioning Bacterial Species Concepts. The patterns of incongruence fidence limit). that we identified suggest that populations of potentially recom- To determine how often the internal branch would be longer binogenic bacteria are neither freely recombining nor genetically than the terminal branch if genetic isolation were imposed si- isolated at all loci as required by the BSC. The implication is that multaneously for all genes, we simulated the evolution of these any apparently freely recombining population actually comprises taxa according to the best ML tree. This guide tree was modified so many partially genetically isolated subpopulations. As a result, that the internal branch length was equal to the average terminal Mayrian species boundaries cannot be defined rigorously by gene branch length between strains UMN026 and UTI89 (Fig. S3B). flow because extant species will include numerous ecological We repeated the quartet analysis using 100 simulations of the set protospecies that are in partial genetic isolation, leading to am- of 1,144 genes (Fig. S3C). The internal branch on an ML tree of biguity in the relationships among derived taxa as shown above. the simulated data was longer than the average terminal branch for Moreover, ecotypes are not a good basis for species identification, 60 ± 1% of the genes (maximum value 63.9%). Comparable values as closely related ecotypes can still experience substantial gene were found when the genes supporting each topology were ana- flow, causing the evolution of one ecotype to influence the tra- lyzed separately (Table S5; maximum value 66.8% for the 308 jectory of another as in Neisseria (9) or Campylobacter (25). Given genes supporting the Escherichia/Citrobacter clade). Therefore, the complexity of bacterial gene exchange, we are unlikely to our data indicate that measured ancestral diversity for the identify any rules for identifying the threshold beyond which two Escherichia/Salmonella/Citrobacter split far exceeds diversity found populations are destined to follow separate paths. Historical evi- in extant species of E. coli and S. enterica. Because the data suggest dence for recombination does not necessitate ongoing potential that all three topologies represent the true species topology, we for recombination. reject incomplete lineage sorting as the mechanism leading to Thus, species concepts may not apply to bacteria (26), even if phylogenetic incongruence. Similar results were found using the phenotypically distinct groups of related bacteria are readily internal branch of the E. albertii/E. coli/E. fergusonii split when the identifiable. Such concepts connect patterns of phenotypic di- genes supporting each of the two dominant topologies were ana- versity in groups of organisms to the evolutionary forces acting lyzed (Table S5); the rare gene alignments that supported an upon those organisms’ constituent genes. Forces that lead to co- E. albertii/E. fergusonii clade produced branch lengths within the hesion within sexual eukaryotic populations act upon all genes in expected distribution, which is consistent with the lack of pro- concert; as a result, the history of such organisms is reflected in the longed recombination between genes in these lineages, as was collective history of their genes. Their sexual systems simplify suggested above. species conceptualization by producing mating barriers that affect entire genomes at once (4). In contrast, evolutionary forces do Discussion not act on all bacterial genes in unison; recombination may be Questioning the Tree of Life. Significant phylogenetic incongruence successful at some loci, but be counter-selected at others. The was observed between bacterial taxa (Fig. 1). This incongruence evolutionary independence of bacterial genes afforded by posi- could have reflected noise, recent recombination between other- tion-specific gene exchange generates incongruence among gene wise genetically isolated populations, or the random assortment of trees. Therefore, a species concept attributing the unambiguous

Retchless and Lawrence PNAS Early Edition | 5of6 Downloaded by guest on September 28, 2021 species delineation to the action of a particular evolutionary pro- quence alignments (MSA) were produced with ClustalW and back-translated cess may be unattainable in bacteria. to codon alignments. At least five syntenic genes must have been identified One response would be for to embrace the pluralistic to establish orthology. nature of bacterial taxa, placing strains into more than one species (27), or abandoning species names altogether for a less hierarchical Quartet Analyses. For each analysis, we evaluated the relationships among four groups of genomes for each ortholog. Alignments for each gene were approach (28). Yet one could argue that bacterial species names fi subject to maximum-likelihood analysis for each of the three possible to- carry the greatest practical impact, placing organisms into de ned pologies, whereas the relationships within each group were specified if the groups that are used for agriculture, biotechnology, epidemiology, group comprised more than two genomes. The root in Fig. 2 was specified public health, disease diagnosis, and bioterrorism. Indeed, the according to the dominant topology in the Neighbor-Net tree (Fig. 1), and public policy impact of such ambiguity and fluidity in the charac- the relationships among Salmonella and Escherichia strains in Fig. 3 A and B terization of bacteria may preclude the widespread adoption of were specified using MrBayes (29). such a classification system. Barring this approach, then, what is left is the necessary use of practical definitions in the absence of Maximum Likelihood on ORFs. The topologies were evaluated by the PAML a feasible species concept. Such definitions would encompass col- package (30), using each MSA in turn, generating resampling estimated log- lections of bacteria that are phenotypically similar by criteria that likelihood (RELL) bootstrap support and Shimodaira–Hasegawa (SH) test fi P values (31). Our simulations (below) support bootstrap thresholds as are subjectively important to the classi ers, leading to both nar- Ω rowly defined (e.g., anthracis) and broadly defined (e.g., a conservative estimate of accuracy. Codon-based ML used a single pa- rameter and the Miyata geometric amino acid substitution probabilities. E. coli) groups. The ease by which many medically relevant taxa fi Codon-position ML constructed an HKY85 nucleotide-substitution model for can be classi ed suggests that it can be an effective approach. Al- each codon position. This model is computationally efficient relative to the though this lacks the elegance and satisfaction of groupings driven codon model, and has been shown to have similar accuracy for both closely by biological processes, the absence of strong theoretical un- and distantly related sequences (32). derpinning to their delineation does not detract from their utility. Simulation of Instant Speciation. Simulated sequences were generated by the Materials and Methods Evolver program in PAML. Test trees were generated by a codon-position Genomes. The sequences of Citrobacter sp. 30_2, C. koseri, C. youngae, nucleotide model. Using actual sequences, this produced results comparable Cronobacter sakazakii, Dickeya zeae, Klebsiella pneumoniae strains 342 and to the full codon model, but was much less computationally intensive. All MGH 78578, Enterobacter sp. 638, Erwinia tasmaniensis, Escherichia coli parameters were based upon the actual MSA being tested. The input tree was MG1655, UMN026, UTI89, and IAI39, E. fergusonii, E. albertii, Pectobacte- identical to the ML tree generated from the actual MSA, except that the rium wasabiae, Salmonella enterica enterica LT2, CT18, and CVM19633, innermost branch was set to be the same length as other branches in the tree S. enterica arizonae, Serratia proteamaculans, and Yersinia enterocolitica that were being compared with the innermost branch. Sequence length was were downloaded from the National Center for Biotechnology Information. set to be identical to the number of sites aligned across all sequences of the Accession numbers appear in Table S1. MSA. Codon proportions were identical to the frequencies observed across all sequences in the MSA, and the κ and Ω parameters were set according to the Ortholog Identification. Annotated ORFs were translated and used as BLASTP parameters estimated by the YN00 program, with pairwise values averaged queries to search databases composed of ORFs from each of the other together by first averaging all pairs of genomes between any two groups genomes (e < 1) followed by semiglobal alignment. Sets of putative ortho- within the quartets, and then averaging the six pairwise values for the four logs were assembled from those ORFs where each was a reciprocal best groups within the quartet analysis. match with the others. Analyses of 17 enteric genomes (Figs. 1 and 2) used alignments with >65% similarity; 14 genome analyses (Figs. 3 and 4; Table 1) ACKNOWLEDGMENTS. This work was supported by National Institutes of used alignments with >70% sequence similarity (Dataset S1). Multiple se- Health Grant GM078092.

1. Dykhuizen DE, Green L (1991) Recombination in Escherichia coli and the definition of 18. Walk ST, et al. (2009) Cryptic lineages of the genus Escherichia. Appl Environ Microbiol biological species. J Bacteriol 173:7257–7268. 75:6534–6544. 2. Mayr E (1942) Systematics and the Origin of Species (Columbia Univ Press, New York). 19. Pamilo P, Nei M (1988) Relationships between gene trees and species trees. Mol Biol 3. Wertz JE, Goldstone C, Gordon DM, Riley MA (2003) A molecular phylogeny of enteric Evol 5:568–583. – bacteria and implications for a bacterial species concept. J Evol Biol 16:1236 1248. 20. Taylor DJ, Piel WH (2004) An assessment of accuracy, error, and conflict with support 4. Rieseberg LH, Wood TE, Baack EJ (2006) The nature of plant species. Nature 440: values from genome-scale phylogenetic data. Mol Biol Evol 21:1534–1537. 524–527. 21. Hillis DM, Bull JJ (1993) An empirical test of bootstrapping as a method for assessing 5. Greig D (2009) Reproductive isolation in Saccharomyces. Heredity 102:39–44. confidence in phylogenetic analysis. Syst Biol 42:182–192. 6. Shen P, Huang HV (1986) Homologous recombination in Escherichia coli: Dependence 22. Gatesy J, Baker RH (2005) Hidden likelihood support in genomic data: Can forty-five on substrate length and homology. Genetics 112:441–457. – 7. Ray JL, et al. (2009) Sexual isolation in Acinetobacter baylyi is locus-specific and varies wrongs make a right? Syst Biol 54:483 492. 23. Dykhuizen DE (1998) Santa Rosalia revisited: Why are there so many species of 10,000-fold over the genome. Genetics 182:1165–1181. – 8. Lawrence JG (2002) Gene transfer in bacteria: Speciation without species? Theor bacteria? Antonie Van Leeuwenhoek 73:25 33. Popul Biol 61:449–460. 24. Doolittle WF, Bapteste E (2007) Pattern pluralism and the Tree of Life hypothesis. Proc 9. Spratt BG, Bowler LD, Zhang QY, Zhou J, Smith JM (1992) Role of interspecies transfer Natl Acad Sci USA 104:2043–2049. of chromosomal genes in the evolution of penicillin resistance in pathogenic and 25. Sheppard SK, McCarthy ND, Falush D, Maiden MC (2008) Convergence of Campylobacter commensal Neisseria species. J Mol Evol 34:115–125. species: Implications for bacterial evolution. Science 320:237–239. 10. Hanage WP, Fraser C, Spratt BG (2005) Fuzzy species among recombinogenic bacteria. 26. Doolittle WF, Zhaxybayeva O (2009) On the origin of prokaryotic species. Genome Res BMC Biol 3:6. 19:744–756. 11. Retchless AC, Lawrence JG (2007) Temporal fragmentation of speciation in bacteria. 27. Bapteste E, Boucher Y (2009) Epistemological impacts of horizontal gene transfer on – Science 317:1093 1096. classification in microbiology. Methods Mol Biol 532:55–72. 12. Fraser C, Hanage WP, Spratt BG (2007) Recombination and the nature of bacterial 28. Lawrence JG, Hatfull GF, Hendrix RW (2002) Imbroglios of viral taxonomy: Genetic speciation. Science 315:476–480. exchange and failings of phenetic approaches. J Bacteriol 184:4891–4905. 13. Touchon M, et al. (2009) Organised genome dynamics in the Escherichia coli species 29. Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under results in highly diverse adaptive paths. PLoS Genet 5:e1000344. mixed models. Bioinformatics 19:1572–1574. 14. Falush D, et al. (2006) Mismatch induced speciation in Salmonella: Model and data. 30. Yang Z (2007) PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol Philos Trans R Soc Lond B Biol Sci 361:2045–2053. – 15. Dykhuizen DE (2000) Yersinia pestis: An instant species? Trends Microbiol 8:296–298. 24:1586 1591. 31. Shimodaira H, Hasegawa M (1999) Multiple comparisons of log-likelihoods with 16. Bryant D, Moulton V (2004) Neighbor-Net: An agglomerative method for the con- – struction of phylogenetic networks. Mol Biol Evol 21:255–265. applications to phylogenetic inference. Mol Biol Evol 16:1114 1116. 17. Lawrence JG, Retchless AC (2010) The myth of bacterial species and speciation. Biol 32. Ren F, Tanaka H, Yang Z (2005) An empirical examination of the utility of codon- Philos, in press. substitution models in phylogeny reconstruction. Syst Biol 54:808–818.

6of6 | www.pnas.org/cgi/doi/10.1073/pnas.1001291107 Retchless and Lawrence Downloaded by guest on September 28, 2021