Each of 3,323 metabolic innovations in the of E. coli arose through the horizontal transfer of a single DNA segment

Tin Yau Panga,b and Martin J. Lerchera,b,1

aInstitute for Computer Science, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany; and bDepartment of Biology, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany

Edited by W. Ford Doolittle, Dalhousie University, Halifax, Nova Scotia, Canada, and approved November 15, 2018 (received for review October 31, 2017) Even closely related often show an astounding to efficiently metabolize nutrient sources is an essential determi- diversity in their ability to grow in different nutritional environ- nant of bacterial fitness (12), and flux balance analysis (FBA) has ments. It has been hypothesized that complex metabolic adapta- been established as a robust and reliable modeling framework for tions—those requiring the independent acquisition of multiple the prediction of this ability (13, 14). new genes—can evolve via selectively neutral intermediates. A computational analysis of approximate metabolic models However, it is unclear whether this neutral exploration of pheno- generated automatically from genome sequences suggested that type space occurs in nature, or what fraction of metabolic adap- within-species phenotypic divergence is almost instantaneous, tations is indeed complex. Here, we reconstruct metabolic models whereas divergence between genera is gradual or “clock-like” for the ancestors of a phylogeny of 53 Escherichia coli strains, (12). Accordingly, the genetic distance calculated from multi- linking genotypes to phenotypes on a genome-wide, macroevolu- tionary scale. Based on the ancestral and extant metabolic models, locus sequence typing data is a weak indicator of how similar two we identify 3,323 phenotypic innovations in the history of the E. E. coli strains are in terms of the carbon sources they can me- coli that arose through changes in accessory genome con- tabolize (15). How can within-species divergence be so much

tent. Of these innovations, 1,998 allow growth in previously in- faster than between-species divergence? The answer likely lies in EVOLUTION accessible environments, while 1,325 increase biomass yield. frequent recombination and the HGT events it facilitates be- Strikingly, every observed innovation arose through the horizon- tween bacterial strains belonging to the same species (16): A tal acquisition of a single DNA segment less than 30 kb long. Al- small set of new genes acquired through HGT can potentially though we found no evidence for the contribution of selectively lead to drastic phenotypic changes (4, 12). neutral processes, 10.6% of metabolic innovations were facilitated Horizontally transferred genes that do not provide fitness by horizontal gene transfers on earlier phylogenetic branches, benefits are likely to be lost quickly, not least because of a mu- consistent with a stepwise to successive environments. tational bias toward deletions in bacterial genomes (17). This Ninety-eight percent of metabolic phenotypes accessible to the logic suggests that successful HGTs—that is, those events that combined E. coli pangenome can be bestowed on any individual left their traces in extant genomes—were individually adaptive. strain by transferring a single DNA segment from one of the ex- A requirement for individually adaptive DNA acquisitions would tant strains. These results demonstrate an amazing ability of the E. coli lineage to adapt to novel environments through single hori- impose a strong barrier on the emergence of complex pheno- zontal gene transfers (followed by regulatory ), an types that require multiple gene acquisitions, because the size of ability likely mirrored in other of generalist . horizontally transferred DNA segments is limited by the mech- anisms of cellular DNA uptake (18, 19). For example, DNA horizontal gene transfer | lateral gene transfer | Escherichia coli | transfers by phages (), a major mechanism of HGT metabolic adaptation | flux balance analysis Significance n many ways, homologous recombination between the strains Iof a prokaryotic species is analogous to meiotic recombination Bacteria often evolve by copying genes from other strains, a in : It contributes to the efficient purging of delete- process termed horizontal gene transfer. As a consequence, rious (1, 2) and brings together beneficial mutations different strains of the bacterial species Escherichia coli differ that arose in different genetic backgrounds (i.e., it counters substantially in the sets of genes they possess. Here, we use clonal interference) (3). Similar to recombination in eukaryotes, the inferred gene sets of all recent ancestors of 53 E. coli strains prokaryotic recombination may sometimes break up beneficial to reconstruct the ancestors’ abilities to grow in different nu- combinations of epistatically interacting sequences (1). Crucially, tritional environments. This allows us to infer over 3,000 meta- prokaryotic recombination of genomic regions that are only bolic innovations in E. coli’s evolutionary history. All innovations partially homologous facilitates horizontal gene transfer (HGT) arose through the copying (transfer) of only one small piece of between strains, a phenomenon contributing to prokaryotic ad- DNA from another strain, demonstrating an amazing capacity of aptation (4). The role of recombination in the evolution of E. coli to quickly adapt to new environments. Escherichia coli and its relationship to HGT has been studied extensively over the past 70 y (4–11). Author contributions: T.Y.P. and M.J.L. designed research; T.Y.P. performed research; With the advent of high-throughput DNA sequencing, com- T.Y.P. analyzed data; and T.Y.P. and M.J.L. wrote the paper. parative genomics all but replaced the comparison of phenotypes The authors declare no conflict of interest. as the basis for understanding evolution and . This article is a PNAS Direct Submission. However, it is the phenotype that natural selection acts upon; to Published under the PNAS license. fully appreciate the patterns and driving forces of adaptation, we 1To whom correspondence should be addressed. Email: [email protected]. need to link genotypes to phenotypes on both the genomic scale This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. and an evolutionary timescale. Bacterial is arguably 1073/pnas.1718997115/-/DCSupplemental. the most promising model system for such an endeavor. The ability Published online December 18, 2018.

www.pnas.org/cgi/doi/10.1073/pnas.1718997115 PNAS | January 2, 2019 | vol. 116 | no. 1 | 187–192 Downloaded by guest on October 3, 2021 in E. coli (20), are limited by the carrying capacity of the phage size distribution of domesticated prophages in E. coli genomes capsid (18, 21). (21). Although a comparative analysis of E. coli genomes identi- Did ancient strains of the E. coli lineage find a way to cir- fied “hot” genomic regions with elevated rates of homologous cumvent the barrier to complex adaptations imposed by the size recombination that exceeded 100 kb in size, these appear to result limit on HGTs? It has been proposed that complex metabolic from the superposition of multiple smaller HGT events (28). adaptations may evolve via a neutral exploration of phenotype space (22, 23), hypothesizing that “many additions of individual Reconstruction of Ancestral E. coli Metabolic Systems. Here, to track reactions to a metabolic network will not change a metabolic phenotypic innovations in the evolutionary history of the E. coli phenotype until a second added reaction connects the first re- clade, we first reconstructed the metabolic networks of the 52 action to an already existing metabolic pathway” (23). However, ancestral strains based on a consensus annotation of the extant no empirical data from bacterial metabolism supports this sce- metabolic networks published by Monk et al. (25) and the gene nario (24); bacterial genomes appear compact and almost devoid presence and absence data inferred by Pang and Lercher (18) of nonfunctional DNA sequences (17). (see Fig. 1 for an overview and SI Appendix, Detailed Materials An alternative explanation for the emergence of complex ad- and Methods for details). We then performed FBA on the an- aptations was put forward by Szappanos et al. (24), who sug- cestral and extant networks, testing their ability to grow in gested that metabolic complexity may arise through successive 200,000 randomly generated nutritional environments as well as noncomplex adaptations to changing environments. However, in 2,418 environments used in previous simulations of E. coli K- the relative roles in bacterial evolution of simple adaptations 12 metabolic adaptation (24) (Dataset S2). Thirty extant and 46 (proceeding through individually adaptive DNA acquisitions) vs. ancestral networks were each able to grow in more than 20% of complex adaptations are currently unknown. What proportion of the environments (SI Appendix, Fig. S3 and Dataset S1). Due to metabolic innovations in a given bacterial clade was complex auxotrophies or gaps in essential pathways, the remaining models (i.e., required multiple independent HGT events)? Did such produced biomass in a small minority of the tested environments multiple DNA acquisitions occur in quick succession, or were (≤0.5%) and were excluded from further analyses. they spread over long evolutionary time spans, suggesting a string As expected, strains that diverged very recently (amino acid of successive adaptations to stepping-stone environments (24)? sequence divergence <0.01%) tend to grow in the same envi- And, more fundamentally, How adaptable are generalist bacteria ronments. Beyond those nearly identical strains, however, we such as E. coli—that is, how many independent DNA acquisi- find that phenotypic similarity is practically independent of the tions are typically required to allow a strain to grow in an envi- amino acid sequence divergence of two strains (Spearman’s ρ = ronment where it was initially unviable? 0.0052, P = 0.88; SI Appendix, Fig. S4A), consistent with earlier observations of strong within-species diversification (12, 15). Results Thus, in contrast to longer evolutionary timescales (12), phe- The E. coli Dataset. We examine these questions by linking geno- notypic divergence within species does not show clock-like evo- mic and phenotypic evolution in the E. coli clade. Our analyses lution. Given that phenotypic divergence has been proposed to focus exclusively on changes in the accessory genomes that arose be driven by HGT (4), this independence is consistent with evi- via HGT and gene losses; see the Discussion section for a review dence for high rates of within-species recombination relative to of the contribution of other genomic changes. Our dataset con- point mutations (9). sists of 53 E. coli and Shigella strains (SI Appendix, Fig. S1 and Dataset S1), encompassing commensal as well as intestinal and Most Potential Phenotypic Innovations Require only a Single DNA extraintestinal pathogenic strains (25). These strains had been Transfer. We first wanted to study how easily possible pheno- chosen to form a representative sample of the E. coli species (25) typic innovations could, in principle, arise via HGT within the E. and cover the major recognized clades of E. coli (A, B1, B2, D, coli pangenome. To this end, we developed a model of functional and E; see SI Appendix, Fig. S2). Because the Shigella strains are HGT based on the extant metabolic networks. We merged the nested within the E. coli phylogeny (18, 26), we subsume them hereafter under the term E. coli. Genome-scale metabolic models have previously been reconstructed for these extant strains based on gene annotations, manually curated genome-scale metabolic models of the E. coli K-12 strain and several close relatives, and information obtained from public databases (25). In a previous publication (18), we identified orthologous groups of genes between the 53 strains and 17 outgroup genomes based on amino acid sequence similarities. We then reconstructed a well-supported maximum-likelihood phylogeny of vertical de- scent based on the concatenated alignments of 1,334 universal one-to-one orthologs (SI Appendix,Fig.S1). Based on the pres- ence and absence patterns of gene family members across the Fig. 1. Overview of the methodology and main result. We started from the phylogeny, we inferred the gene content of the 52 ancestral ge- phylogeny of the E. coli genomes studied in refs. 18 and 25. (Left) Based on nomes, using the maximum-likelihood algorithm implemented in the genomes of the ancestral strains (18), we reconstructed their metabolic the web server GLOOME (27). A gene acquisition via HGT was networks ①. For each strain and its immediate ancestor, we performed FBA inferred if a gene was present in the derived, but not the ancestral, to estimate viability and biomass yield across a wide range of nutritional node of a phylogenetic branch (18). environments ②. If the derived strain could grow in a given environment but In the same publication, we found 1,790 gene pairs whose the ancestor was unviable or produced biomass with much lower yield, we members were repeatedly coacquired via HGT on the same inferred a phenotypic innovation (Right) For each such innovation, the branches of the phylogeny. The genomic distance between such newly acquired genes responsible for the innovation (red bars) lie within <30 kb of each other on the genome of at least one descendant of the in- cotransferred genes rarely exceeds 30 kb in length, indicating novating strain. We therefore conclude that these genes were cogained that individual genomic acquisitions by the E. coli strains in the < through the horizontal transfer of a single DNA segment (indicated by the present study are restricted to DNA segments of 30 kb. This single gene-donating phage). In contrast, we found no case in which mul- 30-kb size limit is highly consistent with the size distribution of tiple independent HGT events (e.g., multiple phages) contributed to the clusters of cofunctioning E. coli genes (18) and agrees with the same innovation.

188 | www.pnas.org/cgi/doi/10.1073/pnas.1718997115 Pang and Lercher Downloaded by guest on October 3, 2021 − metabolic networks of all E. coli strains into a supermodel. We the two recipients (Spearman’s ρ = −0.20, P < 10 8) but remains performed FBA on this supermodel to identify all metabolic above 90% regardless of sequence divergence (SI Appendix, Fig. phenotypes accessible to the E. coli pangenome. The pangenome- S4C). The potential utility of HGT, however, does not depend scale supermodel was able to produce biomass in many environ- on sequence divergence: The probability that the transfer of the ments inaccessible to any of the extant strains. By comparing the genes found on a random 30-kb segment of the donor genome viability and biomass yield of each extant genome with the super- leads to any phenotypic innovation remains at around 24% per model, we identified potential phenotype changes that could be donor–recipient pair (considering only segments that contain at bestowed on a given strain through the addition of reactions from least one metabolic gene not present in the recipient; Spear- the pangenome. We distinguished “new phenotypes”, defined as man’s ρ = −0.0064, P = 0.85; SI Appendix, Fig. S4D). the ability of a given strain to produce biomass in an environment where it was previously unable to grow, and “yield-improved Phenotypic Innovations in the History of E. coli Each Required only a phenotypes”, defined as at least a doubling in biomass yield. Single DNA Transfer. How does this picture of possible phenotypic DNA segments of 30 kb represent an upper size limit for being innovations compare with the phenotypic changes actually ob- successfully acquired via HGT by E. coli strains (18). We thus served throughout E. coli’s evolutionary history? From FBA assumed that a set of metabolic genes could be transferred in a simulations on the ancestral and the derived node of each phy- single HGT event from one of the extant genomes (the donor) if logenetic branch, we identified 3,323 phenotypic innovations that the genes reside within a segment of <30 kb in that genome. For arose on individual branches of the E. coli phylogeny (Fig. 1). Of each potential new or yield-improved phenotype, we then iden- these phenotypic innovations, 1,998 are new phenotypes, whereas tified the minimal number of such 30-kb segments from other 1,325 are yield improvements. For each phenotypic innovation on extant genomes that would have to be jointly transferred to be- a given phylogenetic branch, we identified the genes that con- stow this phenotype on the recipient. We found that only 2.4% of tributed to the innovation by performing parsimonious FBA (29). potential new phenotypes and 7.4% of potential yield improve- As the considered phenotype arose somewhere along the branch, ments require the acquisition of multiple DNA segments (i.e., one or more of the contributing genes must have been acquired they should be classified as complex adaptations) (Fig. 2). there. We inferred that these genes were cotransferred on a single DNA segment if they are found within 30 kb of each other on the Cross-Strain Phenotype Transfer Depends only Weakly on Sequence genome of at least one of the extant descendants of the branch (SI Divergence. If a strain can grow in a certain environment, how Appendix, Detailed Materials and Methods). EVOLUTION likely is it that it can confer this ability to another strain through Strikingly, all metabolic innovations that we observed for the the transfer of a single DNA segment? To test this, we examined E. coli lineages were achieved through the horizontal acquisition all environments in which one extant strain (the donor) can grow of a single DNA segment. Whereas 71% of these segments fa- while another (the recipient) cannot. For most donor–recipient cilitated the innovation through a single gene, <3% contained pairs, all previously lacking phenotypes can be bestowed on the more than five relevant genes. Thus, the E. coli clade seems to be recipient through the transfer of genes found on a single 30-kb completely devoid of complex adaptations, at least on the DNA segment of the donor (i.e., through one HGT event); on timescale of individual phylogenetic branches. Our simulations average, 98.8% of growth abilities of one strain can be trans- identified at least one new phenotype for 55.8% and one yield ferred to another strain in this way (SI Appendix, Fig. S4B). Al- improvement for 58.9% of the observed DNA segment acquisi- though there is a weak negative correlation between phenotype tions that involve metabolic genes of known molecular function. transferability and amino acid divergence between strains Thus, we find potential adaptations behind a majority of HGT − (Spearman’s ρ = −0.24, P < 10 12), the vast majority of pheno- events amenable to analysis via FBA, indicating that the simu- types can be transferred with a single DNA segment between lated set of hypothetical environments overlaps substantially with even the most-divergent E. coli strains. the environmental landscape encountered by E. coli strains The probability that a DNA segment that confers a given throughout their evolution. However, we identified many more phenotype to one strain confers the same phenotype to another new phenotypes and yield improvements per HGT than expected had randomly chosen DNA segments been transferred (SI Ap- strain decreases slightly with the sequence divergence between − pendix, Fig. S5; binomial tests P < 10 15), indicating that the phenotypic innovations inferred via FBA are indeed correlated with meaningful evolutionary events. 100,000 new phenotypes While complex innovations drawn from the E. coli pangenome yield improvements are expected to be rare (Fig. 2), they are still possible, either as alternatives to single-segment adaptations or in the rare cases in 10,000 which multiple DNA segments are indeed necessary to evolve a phenotypic innovation. That we did not find a complex HGT scenario behind any of the 3,323 phenotypic innovations we 1,000 observed indicates that metabolic evolution is unlikely to be a − predominantly neutral process in E. coli [P < 10 15 for each type of innovation; one-sided binomial tests comparing the number of 100 frequency observed complex innovations (0) with the number expected if complex and noncomplex innovations were equally likely to be observed (2.4% of 1,998 new phenotypes and 7.4% of 1,325 10 yield-improved phenotypes, respectively)]. We thus conclude that only DNA segments that are individually adaptive are likely to spread through E. coli populations and that complex E. coli 1 phenotypes rarely or never evolve through the neutral explora- 12345tion of phenotype space. DNA segments required for phenotype innovations

Fig. 2. A small minority of potential phenotypic innovations accessible Complex Innovations Through Stepwise Niche Expansion. It has been within the E. coli pangenome require the acquisition of several distinct DNA suggested that metabolic evolution may often proceed through segments via HGT (i.e., these innovations are complex). adaptations to intermediate environments that act as evolutionary

Pang and Lercher PNAS | January 2, 2019 | vol. 116 | no. 1 | 189 Downloaded by guest on October 3, 2021 stepping stones, providing reactions that can later be exapted for however, that complex metabolic innovations in E. coli are ex- additional adaptations (24). Over the timescale of individual ceedingly rare: Our best estimate indicates that fewer than 1 in branches of the E. coli phylogeny, no such successive adaptations 3,323 (0.03%) metabolic innovations in E. coli were complex. are observable. What about larger evolutionary timescales? For Of the 202,418 tested environments, 200,000 represent a each phenotypic innovation observed on a branch of the E. coli largely unbiased sample of minimal nutritional environments, phylogeny, we identified the number of DNA segments contrib- each constructed by choosing a random source each of carbon, uting to this phenotype that were acquired on earlier branches. nitrogen, phosphorus, and sulfur. Thus, our set of environments We found that 10.6% of new phenotypes and 19.0% of yield is obviously biased against more nutrient-rich environments. improvements indeed relied on combining multiple DNA segment However, the main type of evolutionary innovation examined in transfers since the last common ancestor of the E. coli strains our manuscript is the emergence of new phenotypes, defined as studied here, using gene acquisitions on earlier branches as evo- the ability to grow in a previously inaccessible environment. If an lutionary stepping stones (Fig. 3). Examples for such complex E. coli strain can produce biomass on a subset of the nutrients innovations are shown in SI Appendix, Figs. S6–S8). Note that for available in a given nutrient-rich environment, it can also grow in every single one of these apparently complex innovations, alter- the full environment. Thus, acquiring the ability to grow in a natively, the same phenotype could have been bestowed on the nutrient-rich environment can never be more complex (in the ancestor of the successive gene acquisitions in a noncomplex sense of requiring more DNA acquisitions) than adaptation to a fashion: in each case, at least one of the extant genomes contains a minimal subset of the corresponding nutrients. If anything, the genomic segment of <30 kb that contains all of the genes required restriction to minimal environments biases our analysis toward for the later innovation. complex adaptations and is, in that sense, conservative. We emphasize that we base our analysis on the known or Discussion predicted molecular functions of individual genes; if a gene Based on combining comparative genomics and genome-scale family is known to have multiple molecular functions, we also metabolic modeling, we conclude that every single phenotypic consider those. While our analysis is—for obvious reasons—biased innovation identified by our methodology was achieved through against phenotypic innovations that rely on currently unknown the acquisition of a single short segment of DNA via HGT. Such molecular gene functions, this should not create a bias toward a far-reaching statement warrants some discussion. noncomplex interactions. Even if the majority of horizontally ac- Our analysis of phenotypic evolution in E. coli focuses on quired genes with unknown function were metabolic, there is no observed phenotypic innovations rather than on observed HGT reason to think that they differ systematically in some fundamental events. While we additionally found that the majority of HGT way from genes of known molecular functions. events contribute to at least one metabolic phenotypic innova- Gene functions are frequently discovered based on the effects tion, we emphasize that we cannot draw any conclusions on the of single-gene knockouts, suggesting a possible bias of known adaptiveness of individual HGTs. functions toward noncomplex phenotypes. However, the functions We analyzed only 53 extant genomes, a tiny subset of the total of metabolic genes are, by definition, “noncomplex”: Complex genomic diversity in the E. coli clade. While our strains represent phenotypes arise not from individual enzymes or transporters, but the major E. coli clades (SI Appendix, Fig. S2), the analysis of from their interactions. Importantly, the gene-protein-reaction additional strains would undoubtedly have unearthed additional (GPR) associations in E. coli metabolism are well known also phenotypic innovations. We cannot know what fraction of these for protein complexes, as evidenced by the accurate prediction of unseen phenotypic innovations were complex (i.e., required the gene knockout effects for genes that require protein complex acquisition of multiple DNA segments). We can conclude, formation for their function (30, 31). Accordingly, the experi- mental bias toward the study of individual genes appears uncritical in the context of our study. Importantly, we make no assumptions about the processes or new phenotypes physiological functions in which genes or sets of genes are in- 1,000 yield improvements volved. FBA is agnostic of a gene’s or an operon’s physiological role and simply tests all possible ways in which a molecular function can benefit the production of biomass, given the or- ganism’s overall network of metabolic reactions. Thus, when we test for biomass production in our simulations, the physiological 100 function of genes and operons emerges from the molecular in- teraction of their gene products with other metabolic proteins encoded anywhere in the genome.

frequency FBA identifies a set of reactions that, if activated together, would provide maximal growth. Thus, when identifying a phe- 10 notypic innovation, FBA implicitly assumes that regulation is optimally adapted to the environment considered. When a pheno- typic innovation occurs—that is, when a given strain first acquires the enzymes and/or transporters required for growth in a new en- vironment—protein expression is unlikely to be initially optimal; on 1 1234the contrary, genomic DNA acquisitions may disrupt existing reg- HGTs on distinct branches contributing to phenotype innovations ulatory programs. If the strain finds itself in an environment where the innovation is of , the FBA-predicted pheno- Fig. 3. In the E. coli lineage, 10.6% of new phenotypes and 19.0% of yield type will be realized only after a period of regulatory adapta- improvements evolved through two to four successive horizontal DNA ac- tion. Laboratory evolution experiments have demonstrated that quisitions on distinct branches of the phylogeny. Note that every single one adaptive regulatory changes arise rapidly in evolving E. coli pop- of these apparently complex phenotypic innovations could instead have – been bestowed on the immediate ancestor of the successive DNA acquisi- ulations through point mutations or copy number changes (32 35). tions through a single <30-kb DNA segment presently found in one of the Thus, additional gene acquisitions are typically not required for other extant E. coli strains. regulatory adaptation.

190 | www.pnas.org/cgi/doi/10.1073/pnas.1718997115 Pang and Lercher Downloaded by guest on October 3, 2021 We cannot test whether an appropriate protein expression evolution does not appear necessary, nor is it observable across pattern evolved in reality, whether the innovation was adaptive, E. coli strains. or, indeed, whether the strain even experienced any environment E. coli’s adaptive efficiency is largely a consequence of its well- in which the phenotype was of relevance. We find that every filled metabolic “toolbox” (39, 40). Based on the mathematical and potentially adaptive phenotypic innovation arose through the computational analysis of abstract metabolic network topologies acquisition of a single stretch of DNA of <30 kb in length. While (40), the toolbox model posits that the number of phenotypes only an unknown subset of these phenotypic innovations were supported by a metabolic network, F, scales approximately qua- truly adaptive, we can still conclude that all adaptive phenotypic dratically with the number of enzymes, N, across different bacte- innovations, as far as they are discernible from the genomic data rial species: F ≈ 1600 × N1.9 (see figure 2 in ref. 40). This means and current metabolic modeling technology, arose through indi- that on average, each new phenotype requires the acquisition of vidual—and thus individually adaptive—HGT events. 1600=ð1.9× N0.9Þ new enzymes (40). In their toolbox analogy, In our analysis, we focused exclusively on HGT as the source Maslov and colleagues (39, 40) liken enzymes to tools: The of phenotypic innovations. What about the role of homologous ’ sequence changes induced by genomic mutations or homologous larger one s toolbox, the more tools one can repurpose for a new recombination (5–8, 10, 11, 36)? Such sequence changes are task and the fewer additional tools one needs to acquire. For the likely to affect metabolic phenotypes, especially through changes relatively large E. coli metabolic network [1,336 enzymes for the in gene regulation (discussed above), in enzyme kinetics, or in MG1655 strain (41)], the toolbox model predicts 1.3 enzymes for substrate specificity. While changes in enzyme kinetics may af- each new phenotype. For comparison, the endosymbiont Buchnera fect growth rates in a given environment, they do not influence a aphidicola, a species with a much smaller metabolic network [288 strain’s ability to grow or its maximal biomass yield and are thus enzymes for the APS strain (41)] is predicted to need, on average, irrelevant to the phenotypic innovations analyzed here. Our analysis 5.2 enzymes. The already high metabolic versatility of E. coli strains assumes that all members of an orthologous enzyme family in the E. means that a large selection of enzymes can be repurposed coli strains are considered to catalyzethesamereaction(s)ofthe (exapted) for new phenotypes and that only a few additional genes same substrates (i.e., have identical biochemical functions). This need to be acquired via HGT. It appears likely that the efficiency assumption is justified because homologous enzymes with different of E. coli metabolic evolution through individual DNA transfers biochemical functions generally show much higher sequence di- within a versatile pangenome is at least in part responsible for the vergence than is observed between the orthologous E. coli sequences frequent emergence of new pathogenic strains in this clade. analyzed here (37). Thus, it appears likely that the majority of EVOLUTION metabolic phenotypic innovations observable within the E. coli clade Materials and Methods indeed arose via HGT, even if their full fitness benefits required A detailed account of the materials and methods used can be found in SI subsequent regulatory adaptations through genomic mutations. Appendix, Detailed Materials and Methods; for a graphical summary, see It would of course be desirable to test the predicted phenotypic Fig. 1. Briefly, we obtained a reliable phylogeny as well as reconstructions of innovations experimentally, adding the DNA segment inferred the gene content of ancestral genomes and of HGT events from Pang and to be responsible for the innovation to an ancestral genome and Lercher (18). Based on genomic neighborhoods of the extant strains and the observing the growth of the ancestral and engineered strains in previous observation that horizontally cotransferred sets of genes in E. coli the corresponding environment (after allowing enough time for genomes almost always lie within 30 kb of each other (18), we identified sets the strains to adapt their gene regulation). Although such exper- of genes likely to be cotransferred in a single HGT event. imental gene additions are infeasible with the scale of our analysis, We assembled a set of universal GPR associations across the metabolic networks of the extant E. coli strains reconstructed by Monk et al. (25). We the removal (knockout) of genes provides very similar information reconstructed the metabolic networks of the ancestral strains by applying the on model reliability. FBA predictions for metabolic gene essenti- universal GPR associations to the inferred genomes. An efficient imple- ality in E. coli have been shown to be extremely accurate, with mentation (42) of FBA (13, 14) was used to calculate maximal biomass pro- correct predictions for between 91% and 95% of genes (31), justi- duction rates of the ancestral and extant strains across 202,418 nutritional fying our reliance on this methodology. environments. We defined phenotypic distances as 1 − J,whereJ is the Jaccard In sum, our results provide a comprehensive picture of met- index of the subsets of environments in which each of the two compared abolic evolution across the E. coli species. Because two strains strains can grow. We defined phenotypic innovations as cases in which a strain belonging to the E. coli species can easily recombine (36, 38), can produce biomass in an environment where its immediate ancestor could sequence divergence does not impose a barrier to phenotype not (new phenotypes) or where its biomass yield is at least double that of its transfer within the E. coli pangenome. Quite the opposite: even immediate ancestor (yield-improved phenotypes). In each case, the innovation for the most diverged E. coli genomes in our dataset, there is still occurred through the acquisition of one or more genes via HGT on the phy- a 99% chance that a given phenotype of a donor strain can be logenetic branch immediately preceding the innovation; if these genes lie transferred through the genes contained on a single DNA seg- within 30 kb of each other on the genome of one of the innovator’sde- ment in the donor’s genome of <30 kb. Thus, if one E. coli strain scendants, we concluded that they were coacquired in a single HGT event. is already adapted to a given environment, another strain “ ” ACKNOWLEDGMENTS. We thank Esther Sundermann for preparing Fig. 1 stranded in this environment can almost always acquire the and Balázs Papp, Csaba Pál, and Bill Martin for helpful discussions. This work necessary metabolic reactions from its relative in a single HGT was supported by DFG Grants CRC 680 (to M.J.L.) and CRC 1310 (to T.Y.P. event. Contrary to earlier suggestions (22, 23), neutral metabolic and M.J.L.) and by Volkswagen Foundation Grant 93 043 (to M.J.L.).

1. Felsenstein J (1974) The evolutionary advantage of recombination. Genetics 78:737–756. 8. Kowalczykowski SC, Dixon DA, Eggleston AK, Lauder SD, Rehrauer WM (1994) Bio- 2. Moran NA (1996) Accelerated evolution and Muller’s rachet in endosymbiotic bac- chemistry of homologous recombination in Escherichia coli. Microbiol Rev 58: teria. Proc Natl Acad Sci USA 93:2873–2878. 401–465. 3. Cooper TF (2007) Recombination speeds adaptation by reducing competition be- 9. Spratt BG, Hanage WP, Feil EJ (2001) The relative contributions of recombination and tween beneficial mutations in populations of Escherichia coli. PLoS Biol 5:e225. point to the diversification of bacterial clones. Curr Opin Microbiol 4:602–606. 4. Pál C, Papp B, Lercher MJ (2005) Adaptive evolution of bacterial metabolic networks 10. Wirth T, et al. (2006) Sex and virulence in Escherichia coli: An evolutionary perspec- by horizontal gene transfer. Nat Genet 37:1372–1375. tive. Mol Microbiol 60:1136–1151. 5. Tatum EL, Lederberg J (1947) Gene recombination in the bacterium Escherichia coli. 11. Tenaillon O, Skurnik D, Picard B, Denamur E (2010) The of J Bacteriol 53:673–684. commensal Escherichia coli. Nat Rev Microbiol 8:207–217. 6. Dykhuizen DE, Green L (1991) Recombination in Escherichia coli and the definition of 12. Plata G, Henry CS, Vitkup D (2015) Long-term phenotypic evolution of bacteria. biological species. J Bacteriol 173:7257–7268. Nature 517:369–372. 7. Guttman DS, Dykhuizen DE (1994) Clonal divergence in Escherichia coli as a result of 13. Watson MR (1984) Metabolic maps for the Apple-II. Biochem Soc Trans 12: recombination, not mutation. Science 266:1380–1383. 1093–1094.

Pang and Lercher PNAS | January 2, 2019 | vol. 116 | no. 1 | 191 Downloaded by guest on October 3, 2021 14. Orth JD, Thiele I, Palsson BO (2010) What is flux balance analysis? Nat Biotechnol 28: 28. Yahara K, et al. (2016) The landscape of realized homologous recombination in 245–248. . Mol Biol Evol 33:456–471. 15. Sabarly V, et al. (2011) The decoupling between genetic structure and metabolic 29. Holzhütter HG (2004) The principle of flux minimization and its application to esti- phenotypes in Escherichia coli leads to continuous phenotypic diversity. J Evol Biol 24: mate stationary fluxes in metabolic networks. Eur J Biochem 271:2905–2922. 1559–1571. 30. Orth JD, et al. (2011) A comprehensive genome-scale reconstruction of Escherichia coli 16. Soucy SM, Huang J, Gogarten JP (2015) Horizontal gene transfer: Building the web of metabolism–2011. Mol Syst Biol 7:535. life. Nat Rev Genet 16:472–482. 31. Hartleb D, Jarre F, Lercher MJ (2016) Improved metabolic models for E. coli and 17. Mira A, Ochman H, Moran NA (2001) Deletional bias and the evolution of bacterial mycoplasma genitalium from GlobalFit, an algorithm that simultaneously matches genomes. Trends Genet 17:589–596. growth and non-growth data sets. PLoS Comput Biol 12:e1005036. 18. Pang TY, Lercher MJ (2017) Supra-operonic clusters of functionally related genes 32. Fong SS, Palsson BO (2004) Metabolic gene-deletion strains of Escherichia coli evolve – (SOCs) are a source of horizontal gene co-transfers. Sci Rep 7:40294. to computationally predicted growth phenotypes. Nat Genet 36:1056 1058. 19. Junier I, Rivoire O (2016) Conserved units of co-expression in bacterial genomes: An 33. Fong SS, Joyce AR, Palsson BO (2005) Parallel adaptive evolution cultures of Escher- ichia coli lead to convergent growth phenotypes with different gene expression evolutionary insight into transcriptional regulation. PLoS One 11:e0155740. states. Genome Res 15:1365–1372. 20. Golomidova A, Kulikov E, Isaeva A, Manykin A, Letarov A (2007) The diversity of 34. Blount ZD, Barrick JE, Davidson CJ, Lenski RE (2012) Genomic analysis of a key in- coliphages and coliforms in horse feces reveals a complex pattern of ecological in- novation in an experimental Escherichia coli population. Nature 489:513–518. teractions. Appl Environ Microbiol 73:5975–5981. 35. Tenaillon O, et al. (2016) Tempo and mode of genome evolution in a 50,000-gener- 21. Bobay LM, Touchon M, Rocha EPC (2014) Pervasive domestication of defective pro- ation experiment. Nature 536:165–170. phages by bacteria. Proc Natl Acad Sci USA 111:12127–12132. 36. Dixit PD, Pang TY, Studier FW, Maslov S (2015) Recombinant transfer in the basic 22. Wagner A (2008) Neutralism and selectionism: A network-based reconciliation. Nat genome of Escherichia coli. Proc Natl Acad Sci USA 112:9070–9075. Rev Genet 9:965–974. 37. Whisstock JC, Lesk AM (2003) Prediction of protein function from protein sequence 23. Wagner A (2011) The Origins of Evolutionary Innovations: A Theory of Transformative and structure. Q Rev Biophys 36:307–340. Change in Living Systems (Oxford Univ Press, Oxford). 38. Dixit PD, Pang TY, Maslov S (2017) Recombination-driven genome evolution and 24. Szappanos B, et al. (2016) Adaptive evolution of complex innovations through step- stability of bacterial species. Genetics 207:281–295. wise metabolic niche expansion. Nat Commun 7:11607. 39. Maslov S, Krishna S, Pang TY, Sneppen K (2009) Toolbox model of evolution of pro- 25. Monk JM, et al. (2013) Genome-scale metabolic reconstructions of multiple Escher- karyotic metabolic networks and their regulation. Proc Natl Acad Sci USA 106: ichia coli strains highlight strain-specific adaptations to nutritional environments. 9743–9748. Proc Natl Acad Sci USA 110:20338–20343. 40. Pang TY, Maslov S (2011) A toolbox model of evolution of metabolic pathways on 26. Pupo GM, Lan R, Reeves PR (2000) Multiple independent origins of Shigella clones of networks of arbitrary topology. PLoS Comput Biol 7:e1001137. Escherichia coli and of many of their characteristics. Proc Natl 41. Caspi R, et al. (2016) The MetaCyc database of metabolic pathways and enzymes and Acad Sci USA 97:10567–10572. the BioCyc collection of pathway/genome databases. Nucleic Acids Res 44:D471–D480. 27. Cohen O, Ashkenazy H, Belinky F, Huchon D, Pupko T (2010) GLOOME: Gain loss 42. Gelius-Dietrich G, Desouki AA, Fritzemeier CJ, Lercher MJ (2013) Sybil–Efficient mapping engine. Bioinformatics 26:2914–2915. constraint-based modelling in R. BMC Syst Biol 7:125.

192 | www.pnas.org/cgi/doi/10.1073/pnas.1718997115 Pang and Lercher Downloaded by guest on October 3, 2021