Downloaded by guest on October 2, 2021 irba growth microbial growth genomic these and potentials. regimes that selective functional found distinct selec- with We the associated occupies. are on classes organism based an copiotrophy regime and tive organ- oligotrophy a slow-growing evolutionary of observed suggest to definitions extremely us and led for ultimately classes observations predictions These growth isms. growth two in into bias organ- group that toward found naturally biased we Finally, strongly isms growth. are rapid of collections capable culture organisms organisms uncultivated how well and illustrate cultivated as to of com- alone rates We growth metagenomes. sequences the from pared rRNA estimates 16S community weighted from as estimates exten- diversity; allow prokaryotic of sions range the survey across to potential growth amplified single-cell and devel- metagenome- genomes, genomes, predicted 200,000 assembled We and over from estimator status. rates rate growth growth maximal cultivation maximal improved of an oped regardless microbial rate of potential, distribution growth the maximal survey growth to of solution practical estimators a Genomic provide micro- rates. of diversity growth global the bial assess them, to for ability experi- our conditions Growth limiting culture severely culture of appropriate days. physiology design the laboratory to multiple of microbes understanding most using to sufficient lack measured minutes we Yet, ments. typically of are matter rates lifestyle a times microbial doubling from with of magnitude, 2020) parameter ranging of 10, orders basic August several review over a for varies (received that is 2021 rate 16, February growth approved and Maximal CT, Haven, New University, Yale Turner, E. Paul by Edited a Weissman L. Jake codon patterns via usage cells single and metagenomes, from cultures, rates growth microbial maximal Estimating PNAS over- capture to appear (9), statistics usage codon -wide microbial habitats. of different snapshot across unbiased growth and comprehensive a build possi- be to may ble it rates sequences, derived growth fast-growing environmentally maximal toward from estimating directly biased ecosys- By be community. specific to the tend of a members gut) even human at here, the show targeted (e.g., we tem efforts as Moreover, ecosys- culturing waters. many sea comprehensive for deep manner as such high-throughput be laborious tems a is can in cultivation species impractical (14), maximal some phylogeny and microbial their for of on media based diversity growth predicted true Although mak- the rates. 13), assess growth (12, to unknown difficult are it ing organisms the prokaryotic for gut conditions of human culture adequate, majority the even or like optimal, Yet, habitats 11). (9, nutrient-rich slow-growing to harboring habitats, relative systems across organisms marine detected rapidly oligotrophic be to many can ability with (7– differences genomes their lifestyle their Broad of maximal replicate 10). function their and components a in cellular vary as synthesize will rates subsurface species in growth deep and competition, potential for conditions of years nutrient organ- absence many optimal as the marine under high Even oligotrophic as (4–6). even for microbes and days 3) (2, several isms to (1) organisms T eateto ilgclSine–aieadEvrnetlBooy nvriyo otenClfri,LsAgls A90089 CA Angeles, Los California, Southern of University Biology, Environmental and Sciences–Marine Biological of Department ecno oe aia rwhrtspeitdusing predicted rates growth maximal hope, of beacon A ie agn rmudr1 i o laboratory-reared for min 10 doubling with under widely, from vary ranging of times rates growth he 01Vl 1 o 2e2016810118 12 No. 118 Vol. 2021 | oligotrophy a,1 hnwiHou Shengwei , | copiotrophy | oo sg bias usage codon a n e .Fuhrman A. Jed and , a doi:10.1073/pnas.2016810118/-/DCSupplemental at online information supporting contains article This 1 ulse ac 5 2021. 15, March Published BY-NC-ND) (CC 4.0 NoDerivatives License distributed under is article access open This Submission.y Direct PNAS a is article This interest.y performed competing no J.L.W. declare authors data; research; The paper.y analyzed the S.H. wrote designed J.A.F. and and J.L.W. J.A.F. S.H., tools; J.L.W., and reagents/analytic and new contributed S.H., J.L.W. J.L.W., research; contributions: Author and environmentally (26–30)], we [MAGs genomes, method, genomes metagenome-assembled our reference derived (23–25), Using genomes representative prokaryotic (gRodon). 200,000 including over these package in of rates R growth implementation assay an an community in provide bulk methods neglected we previously to Together, but applied important correction. an when species metagenomes, method on from perfor- data based doing the predictive correction In to our a 22). improve abundances provide substantially we (20, to Additionally, usage able mance. codon are we of so, dimensions additional ing (9). data. data genomic genomic maximal from predict partial rates to growth bias with this leverages used even software be growthpred predictions can Their and accurate rates growth make maximal to high of is predictor genes expressed best highly genes the other in and (CUB) proteins bias ribosomal replication, RNA usage for of codon coding several high origin ribosomal etc.), the number, among (e.g., to copy proximity tRNA that growth and number showed of copy [rRNA] indicators alternative (9) genomic Rocha of (16– pools possible and usage (tRNA) Vieira-Silva RNA in biased transfer 21). vary cellular a Highly to (15). may acid. optimized demonstrate amino codons, communities genes given a genes natural degenerate, for codons expressed of is alternative of rates code usage their genetic growth the the Because in trends all owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To etn aua iiino irba rwhsrtge into strategies growth microbial classes. of two sug- division rates, natural growth a maximal we of gesting Finally, distribution bimodal organisms. a fast-growing noticed are toward collections culture biased current strongly how reveal in data uncultivated rates yet These distributed growth as species. many the including rates organisms, predicted 200,000 we over growth of data, experiments. genomic potential Using growth are tradition- nature? that laboratory have how those using rates However, measured be growth been to microbial ally tend because toward which fast, biased grow organisms, necessarily is culturable growth often easily about are slowly. knowledge soil very grow Our only and can that seawater microorganisms by like dominated environments growth rapid many have microbes rates, that perception wide the Despite Significance eetn h oko iiaSlaadRca()b assess- by (9) Rocha and Vieira-Silva of work the extend We https://doi.org/10.1073/pnas.2016810118 . y raieCmosAttribution-NonCommercial- Commons Creative . y https://www.pnas.org/lookup/suppl/ | f10 of 1

MICROBIOLOGY single-cell amplified genomes [SAGs (31, 32)], in order to sur- more difficult to accurately estimate pair bias due to the large vey the natural diversity of prokaryotic growth rates. We provide number of possible codon pairs, we do so on a genome-wide this comprehensive set of over 200,000 predictions as a com- scale, calculating pair bias over all genes rather than just for piled database of estimated growth rates (estimated growth rates highly expressed genes (our R package includes a “partial” mode from gRodon online [EGGO]). This database reveals a strong for when this is not possible due to partial genomic information). bias in existing reference genomic databases toward fast-growing Consider that if there are 64 codons, the number of possible organisms. Finally, we observe a bias in growth predictions for ordered pairs is 4,096, and accordingly, far more data will be slow-growing organisms, ultimately leading us to suggest an evo- needed to reliably estimate the frequencies of all of these pairs lutionary definition of oligotrophy based on the selective regime than the original set of codons. an organism occupies. We fit our model using all available completely assembled genomes in RefSeq (1,415) for the set of 214 species with doc- Results and Discussion umented maximal growth rates compiled by Vieira-Silva and Predicting Maximal Growth Rates. Rocha (9). All three of these measures were significantly associ- More than one aspect of codon usage is associated with growth. ated with growth rate in a multiple regression (CUB, P = 2.2 × We measured three features of codon usage: 1) the CUB of a 10−37; consistency, P = 8.1 × 10−15; codon pair bias, P = 5.3 × user-defined set of highly expressed genes relative to an expecta- 10−6; linear regression). Furthermore, comparing nested mod- tion calculated from the genome-wide codon usage pattern (20), els, incorporating first CUB, then consistency, and finally, codon 2) the CUB of highly expressed genes relative to an expecta- pair bias, we found that each nested model fit the data signif- tion calculated from the codon usage pattern of highly expressed icantly better than the last (addition of consistency, P = 4.2 × genes only, and 3) the genome-wide codon pair bias (22). Details 10−11; addition of codon pair bias, P = 4.0 × 10−6; likelihood of these calculations are in Materials and Methods. In practice, ratio test). we take the set of highly expressed genes to be those coding gRodon accurately predicts maximal growth rates. The gRodon for ribosomal proteins because these genes are expected to be model fit the available maximal growth rate data well (adjusted highly expressed in most organisms (9). The first measure cap- R2 = 0.63) (Fig. 1A). Our model demonstrated a significantly bet- tures CUB in the classical sense, and the measure independent ter fit to growth data than a linear model fit on the output of of length and composition (MILC) metric we use (20) controls growthpred (ANOVA, P = 1.1 × 10−8)(SI Appendix, Fig. S1). for overall genome composition as well as gene length. The sec- Notably, gRodon provided a better fit to the data than growthpred ond measure captures the “consistency” of bias across highly at both high and low growth rates (SI Appendix, Fig. S2). expressed genes, with the intuition that if highly expressed genes We considered the possibility of overfitting our model to the are optimized to cellular tRNA pools, then they will share a com- data, which would inhibit our ability to apply our predictor to mon bias (low values indicate high consistency). This quantity new datasets. Overfitting is a particularly relevant concern when can be thought of as the “distance” between highly expressed dealing with species data since models may end up being fit to genes in codon usage space, where we expect these genes to be underlying phylogenetic structure rather than real associations close together when they are highly optimized for growth. The between variables. In addition to traditional cross-validation (SI third measure, codon pair bias, captures associations between Appendix, Fig. S1A), we implemented a blocked cross-validation neighboring codons, which have been suggested to impact trans- approach, which effectively controls for phylogenetic structure lation (22, 33, 34). Specifically, it has been shown that altering when estimating model error (35). Under this framework, we the frequency of different codon pairs (but not the overall codon take each phylum in our dataset as a fold to hold out for inde- or usage) can lead to lower rates of translation, and pendent error estimation rather than holding out random subsets this strategy has been used to produce attenuated polioviruses of our data as in traditional cross-validation. We found that [potentially to engineer novel vaccines (22)]. Because it is much even when predicting growth rates for each phylum in this way

AB

Fig. 1. Predictions from gRodon accurately reflect prokaryotic growth rates, with the caveat that (A) gRodon underestimates doubling times when growth is very slow due to (B) a floor on CUB reached in slow-growth regimes. Vertical dashed red lines at 5 h indicate where the CUB vs. doubling-time relationship appears to flatten. The black dashed line in A is the x = y reference line.

2 of 10 | PNAS Weissman et al. https://doi.org/10.1073/pnas.2016810118 Estimating maximal microbial growth rates from cultures, metagenomes, and single cells via codon usage patterns Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 siaigmxmlmcoilgot ae rmclue,mtgnms n igeclsvacodon via cells single and metagenomes, cultures, patterns from usage rates growth microbial maximal Estimating al. et Weissman with domi- consistent will is drift expectation This enough, process. low evolutionary low, is the quite coefficient nate likely selective the is after proteins and ribosomal to of selection microbes, slow-growing optimize 1A very of (Fig. populations time In underes- S1A). doubling to actual tends still the it timate growthpred growers. outperforms slow gRodon while of problem growth The maximal average Appendix, (SI community-wide metagenomes from rates bulk accurate of more EGGO for allows prediction that the metagenomes for in correction include dance entry and metagenomes. each dataset from Prediction for (9) models Rocha database. both and trained from Vieira-Silva model predictions the the alterna- alongside this on package of include gRodon only rates We the in ). S7 growth model Fig. two maximal tive Appendix, the the (SI despite species original on ), S8 several the and disagreeing S7 to datasets Figs. results training Appendix , similar (SI very model genomes). yields gRodon 8,287 with model species retrained (464 database The (42) al. in rates et genomes Madin growth assembled the the completely on with gRodon microbes with retrained associated We rates. including growth (42), on data data phenotypic attempted microbial has available work all recent compile describing but to rates, database growth cultured single microbial known for no all is estimates there doubling-time Unfortunately, recorded sub- microbes. a available only of includes from but set dataset application times hand-curated this doubling carefully for specifically a minimal compiled was (9) of Rocha and set Vieira-Silva original The predictions. growth (15). predict microcosms previous can marine with low-nutrient approaches in consistent CUB-based rates GC is that or This sequence showing S6). coding work Fig. percent Appendix , either (SI by residu- content model affected have our not rate, to were growth als appear with does of association sequence targets nonlinear some coding other our correct these percent sure by should While be affected selection. usage to not wanted was codon we performance of model’s composition, measures nucleotide content genome our GC for genomic While low 41). for cod- (40, selection just (high-percentage experi- alongside and streamlining microbes sequence) genome S4) ing many for Fig. environments, selection Appendix , look- ence especially marine (SI Finally, both S5). oligotrophic genome Fig. recombination, a in Appendix, (SI of in proteins between ribosomal signal genes CUB the all a in at without differences ing apparent or genome no with the significant found genes along but We weak content 39). to (GC) (38, lead guanine–cytosine can in and patterns selection of increases efficiency locally the Recombination estimates. rate growth (e.g., genomic that pop- sizes caution For population (SI we weak. symbionts), effective residuals intracellular rather atypical model is extremely our effect with as the ulations well although as S3), Fig. (37)], Appendix, expected population be given might a [as effec- that on found the acts We but selection (36). here, efficiently how exploit (N determine we size for signal population Selection the the forces. tive drives are evolutionary growth statistics interacting rapid several codon of Observed result performance. model affect an given when even disadvantage. growthpred that unfair outperform meaning to phylum), able test was the its (including gRodon on dataset based entire the were to Importantly, predictions fit S1 B). growthpred’s Fig. comparison, Appendix, this (SI for phyla for of majority predictions large growthpred’s excluding the outperformed but we phyla phylum), other test all to the fit model our from (extrapolating eas sesdteipc fortann e ngRodon’s on set training our of impact the assessed also We could that variables confounding of number a examined We N e e screae ihmxmlgot rate growth maximal with correlated is n h aeo eobnto will recombination of rate the and ) eipeetdaseisabun- species a implemented We o eyln obigtimes, doubling long very For etS1 Text N and e slkl oconfound to likely is n is S9–S12 ). Figs. and Fig. Appendix, SI al .Smayo GOdatabase Source EGGO of Summary 1. Table 3 (Fig. marine both 50). in (12, seen difficult clearly more be B culturing be can since general pattern environment in This that will across rates species average growth slow-growing maximal true higher the have than average an from on microbes will cultured that environment uncultured is expectation of basic rates important A growth organisms. provide of and distribution the (26,490) about database information overall our biases of culture portion insights. rRNA strong 16S ecological reveal their and genomes of derived basis Environmentally the Appendix, on (SI alone taxa sequence microbial to growth of predictions propagation the including applications, potential 2A). many Fig. in plot time to doubling small minimal account- (too mean h cluster, a 99 with slow-growing of observations and of also 0.4% to small for We corresponding very ing 2A). third roughly (Fig. a respectively, divide obtained h, copiotroph/ 7.9 proposed doubling and mean our with 2.7 microbes, of of clusters times large two in obtained slow- we and together fast- 2 separated group (Fig. divide phyla 5-h broadly growing the and to rate, (Fig. growth tended of terms classifying phyla for Additionally, doubling- above 5-h 2A). proposed the we to cutoff corresponding peaks time bimodal, between roughly split was the RefSeq publicly with across The rates 217,074 24)]. growth (23, of [184,907 from distribution assemblies genome cor- rates RefSeq majority the to these, growth responded Of SAGs. predicted and MAGs, genomes, of available (44–49) 1) Database. EGGO The (dis- observed is CUB) Copiotrophy). that (e.g., for so in optimization organism enough growth cussed an low of is as growth signal oligotroph maximal no an rapid sug- for threshold of selection this definition fact, which In natural default. a pragmatic a gests information, as additional well confounders serves without most h and to 5 S3–S6), robust Figs. largely Appendix, population be (SI but to size, vary), appears population etc. predictor local rates, our recombination as regimes, (e.g., selective vary rate unre- structure, will reasons 1B; growth for CUB Fig. populations to of and in lated degree species flattens greater across the is CUB degree Obviously time some which doubling to S13). at a Fig. threshold if While Appendix, us (the tell SI scenario? long reliably h a can 5 it such h, than in 100 or time done 10 doubling a of be between differentiate can accurately cannot What gRodon coefficients (43). selective when small drift are to limited due is fluctuations optimization of stochastic Importantly, evolutionary predictors by genomic where times. many rate for doubling growth problem high a maximal be very likely will at floor floor this 1B a Fig. reaches in teins seen pattern the OGtois(2 A ,1 aiesurface Marine 809 2,266 7,214 7,287 — MAG MAG (45) MAG MarRef SAG (44) al. et Delmont (27) al. et 184,907 Tully (32) GORG-tropics (26) al. et Isolate Parks (23) Assemblies RefSeq oe ta.(8 slt ,5 ua gut Human gut Human microbiome Human 3,459 4,483 4,431 Isolate MAG MAG (49) al. et Zou (48) al. et Poyet (47) al. et Nayfach (46) al. et Pasolli and ent htti ag aaaeo rdce rwhrtshas rates growth predicted of database large this that note We OGtois lblOenRfrneGnmsTropics. Genomes Reference Ocean Global GORG-tropics, IAppendix, (SI -associated and ) S18 Fig. Appendix, SI rpsdEouinr entoso lgtoh and Oligotrophy of Definitions Evolutionary Proposed ecntutdadtbs EG)(Table (EGGO) database a constructed We B slt 725 Isolate slt ,9 ua gut Human 1,493 Isolate yeN.o eoe Environment genomes of No. Type and AsadSG aeu sizable a up make SAGs and MAGs .UigaGusa itr model, mixture Gaussian a Using C). hr U fterbsmlpro- ribosomal the of CUB where etS2 Text https://doi.org/10.1073/pnas.2016810118 n is S15–S17). Figs. and PNAS Marine Marine Marine — | A f10 of 3 and

MICROBIOLOGY A

BC

Fig. 2. Prokaryotes with sequenced genomes span a broad range of predicted growth rates. (A) Predicted growth rates for assemblies in NCBI’s RefSeq database. Growth rates were averaged over genera to produce this distribution since the sampling of taxa in RefSeq is highly uneven (SI Appendix, Fig. S14 has full distribution) (a small number of genera had inferred doubling times over 100 h, 6 of 2,984). Clusters correspond to the components of a Gaussian mixture model, with area under each curve scaled to the relative likelihood of an observation being drawn from that cluster. (B and C) Growth rate distributions for individual (B) fast- and (C) slow-growing phyla (only showing phyla with ≥ 30 genera represented in RefSeq). Vertical dashed red lines in A–C are at 5 h for reference.

Fig. S19) environments, with isolate collections showing much especially when studying subsets of microbes for which additional shorter predicted doubling times than MAGs and SAGs from metadata are available. For example, the very largest cells in the same environments. Even in sets of isolates meant to capture marine samples seem to also be those with the highest maximal the complete taxonomic diversity in an environment (48, 49), we growth rates (Fisher’s exact test, P = 2.2 × 10−15)(SI Appendix, see that they fail to capture the most slowly growing members Fig. S21). This is consistent with the “nutrient growth law” coined of the community (SI Appendix, Fig. S19). Illustrating this gap by Schaechter et al. (51), which describes a simple exponen- is important, as it shows how existing culture collections are not tial relationship between bacterial cell volumes and their growth only incomplete but also biased. These patterns are most appar- rates. In contrast, a similar analysis of growth rate vs. cell size ent when looking within an environment and largely disappear using data from a trait database and cultured isolates (42) was when comparing against MAGs from diverse environments (SI unable to find any association between cell size and growth rate Appendix, Fig. S20) (26). (SI Appendix, Fig. S22). Because maximal growth rate is a basic For marine environments in particular, where oligotrophic parameter of microbial lifestyle (10), gRodon and EGGO allow organisms are prevalent, the existing set of fully sequenced iso- us to build better large-scale comparative studies linking specific lates [MarRef (45)] does a poor job of representing the natural traits and habitats to particular microbial life histories. distribution of growth rates among taxa (Fig. 3 A and B). We found that these biases are not attributable to simple taxonomic Proposed Evolutionary Definitions of Oligotrophy and Copiotrophy. biases in the genomic database. Using phylogenetic logistic Codon usage may be optimized to promote either translational regression, we found that whether or not we classify an organ- accuracy, efficiency, or more likely, some combination of the ism as a copiotroph (d < 5) (Proposed Evolutionary Definitions of two (21, 52–57). In particular, the strong relationship between Oligotrophy and Copiotrophy) has a positive impact on whether maximal growth rate and CUB is thought to be a product of opti- that strain is represented by a fully sequenced isolate (Fig. 3C) mization for translational efficiency (9, 52, 56, 58). Our results (β = 1.0, P = 7.3 × 10−5). This relationship is robust to removal indicate a clear divide among prokaryotes, where an organism of individual species and entire phyla from the dataset (Fig. 3 D either does or does not experience selection on codon usage to and E), as well as to sample size (Fig. 3F) and phylogenetic optimize translational efficiency (Fig. 1B). We use this division uncertainty (SI Appendix, Fig. S18). Thus, slow-growing marine as the basis for an evolutionary definition of oligotrophy and organisms are less likely to be represented among completely copiotrophy—defining an oligotroph as an organism for which sequenced genomes than fast-growing organisms, regardless of selection for rapid maximal growth is weak enough so that trans- phylogenetic group. We found the same general pattern in the lation efficiency is not optimized by selection on codon usage human gut, although our analysis had less power due to the (and a copiotroph as an organism that does experience optimiza- small number of slow-growing organisms in the gut (β = 1.8, tion of translation efficiency). This grouping was supported by P = 0.012) (SI Appendix, Fig. S19). clustering growth predictions from all organisms in RefSeq (Fig. The strong bias shown in Fig. 3 A and B and SI Appendix, 2), where we saw two groups naturally emerge with the boundary Fig. S19 A and B serves to illustrate how important methods for between them at a doubling time of approximately 5 h, consistent genome-based phenotype prediction are for understanding nat- with the CUB optimization cutoff in Fig. 1B. ural microbial systems. With this in mind, we note that there are To illustrate our point that oligotrophy and copiotrophy cor- many potential use cases for gRodon and the EGGO database, respond to two distinct selective regimes, we devised a test

4 of 10 | PNAS Weissman et al. https://doi.org/10.1073/pnas.2016810118 Estimating maximal microbial growth rates from cultures, metagenomes, and single cells via codon usage patterns Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 nhgl xrse ee eaiet h eto h genome the of rest the to relative genes expressed sites highly synonymous at in selection purifying stronger for evidence had siaigmxmlmcoilgot ae rmclue,mtgnms n igeclsvacodon via cells single and metagenomes, cultures, patterns from usage rates growth microbial maximal Estimating al. et Weissman in difference clear a Fig. was Appendix , S23 (SI oligotrophs there and copiotrophs cutoff, between selection doubling-time calculate 5-h to our organisms pairwise between 60,000 nearly comparisons made we Genomic (60)], Tight database (ATGC) [Alignable Clusters large organisms a related copi- Using closely We among oligotrophs. of genome. lower database with significantly the compared as be of organisms will rest otrophic in statistic the sites this to synonymous that relative predict at genes selection expressed of across highly degree substitution statistic, the genome-wide synonymous describes a of which define we rate Thus, genome. overall the also the we to efficiency, (dS relative substitution translational synonymous of high rate on the maintain selection expect purifying to strong usage a expect codon has we where 59 the genes, ref. of (dS expressed e.g., discussion substitution [although, developed synonymous evolution substitution more neutral of synonymous for rate and expectation the nonsynonymous uses of the and across to rates similar genes is pares on expressed test selection highly Our purifying species. of in strength polymorphisms the synonymous compare to selection for species genus-level individual analysis, of for removal used the dataset (D) (full to MarRef robust (F in and is represented relationship), copiotrophy are positive and genus still doubling that status but mean of isolate weaker the research representatives between (E whether distinct any regression analysis, summarizes by whether logistic the heatmap generated and Phylogenetic from corresponding MAGs genus the depths). only). that and multiple visualization for genera, at for h than per summary 5 rather sampled than surface tip ocean less one to the includes was predicted at shown time times (only tree doubling sampled The few (C phylogeny. were very distributions. of SAGs with rate capture independent MAGs, how growth to than maximal to fail consistent rate part and surprisingly growth in SAGs showed overall and due groups lower database) likely a [GORG-tropics] h, Tropics showed 5 Genomes SAGs under Reference Additionally, community. be Ocean the Global of the fraction (from slow-growing MAGs the than average on times doubling predicted 3. Fig. (P ) rdce aia rwhrtsi aieevrnet.Osreta (A that Observe environments. marine in rates growth maximal Predicted < 2 × 10 −16 h eoa fetr ldsfo h nlss(h eoa fPoebcei,tems bnatpyu ntedtst ed oa to leads dataset, the in phylum abundant most the Proteobacteria, of removal (the analysis the from clades entire of removal the ) Mann–Whitney , CD AB dN dN U h eoa flrefatoso h aa(pt 50%). to (up data the of fractions large of removal the ) /dS /dS et,weecopiotrophs where test), ttsi,wihcom- which statistic, ttsi] o highly For statistic]. HE dS ob reduced be to ) HE /dS sanull a as ) dS HE Using . /dS , ul eune sltsfo aRfaemr ieyt ecporps( copiotrophs be to likely more are MarRef from isolates sequenced Fully ) piiaino ihyepesdgns(dS genes expressed highly of optimization oe eiul bv) sn rxe for our proxies that Using above). residuals model (N (s size coefficient selective the by both (dS inrt r rbbymsiga vnsrne ifrnein difference stronger even an masking muta- probably the in are differences rate anything, if tion Thus, shown (61–66). been with locally has level increases it expression rate as mutation true, that likely organisms diverse olig- is across than opposite the copiotrophs fact, in are, In expressed genes otrophs). highly selec- expressed more of even highly absence general, encoding the in genes expect in that even might (assuming here we shown tion rate those level, differ- mutation to expression if results by with and similar confounded genome, correlated the be negatively across could is rate test mutation in our ences caveat, important an htamcoecnocp.I ouaingntctrs we that such terms, which enough under low genetic is regime efficiency population selective translational a In on in selection existing occupy. as can oligotrophs regime define selective microbe specific a a as that terms, evolutionary in oligotrophy here. see we what N 1 e motnl,orcasfiainrdfie oitoh and copiotrophy redefines classification our Importantly, h onaiso lgtoh,i u iw r hsdefined thus are view, our in oligotrophy, of boundaries The . HE dS and F E HE /dS e eoe rmflysqecdioae Mre)hv shorter have (MarRef) isolates sequenced fully from genomes B) faseis(silsrtdb h fet of effects the by illustrated (as species a of ) /dS dS 0 = HE ttsi ewe oitoh n lgtoh than oligotrophs and copiotrophs between statistic ,btoiorpssoe iteeiec for evidence little showed oligotrophs but .74), /dS ttsi a eaieycreae with correlated negatively was statistic https://doi.org/10.1073/pnas.2016810118 n h fetv population effective the and ) s HE and /dS N PNAS e 0 = efound we , N e . .As 98). | nour on d f10 of 5 < s sN  5), e

MICROBIOLOGY as expected, particularly in copiotrophs (SI Appendix, Fig. S23). at high nutrient concentrations must grow quickly), yet theory More broadly, it appears that at typical Ne values for microbes predicts that, given a rate vs. yield trade-off in adenosine triphos- [∼ 108 (37)] (SI Appendix, Fig. S3), codon optimization levels phate (ATP) production, high-resource environments should off for maximal doubling times greater than 5 h (Fig. 1B and favor more energetically wasteful but faster growth due to SI Appendix, Fig. S13). Even for marinus, which increased competition (69). Thus, we generally expect oppor- may have very large effective population sizes [> 1013 (67) over a tunistic and rapid but wasteful ATP production in nutrient- well-mixed marine region, although some estimates of Prochloro- replete environments vs. slow but relatively energy-efficient ATP 8 coccus Ne are much lower at ∼ 10 (37)], growth rates were production in nutrient-limited environments. Therefore, a natu- severely underestimated, although still above our 5-h threshold ral expectation is that slow- and fast-growing organisms should (predicted doubling time of 6.2 h vs. an actual doubling time of have distinct resource acquisition strategies aligned to these 17 h for strain CCMP1375). This would seem to indicate that for general niche types (7). slow-growing organisms like Prochlorococcus, there is essentially In fact, we found that organisms belonging to our two nat- no selective advantage to optimizing translational efficiency via urally defined growth rate clusters (Fig. 2), which we refer codon usage (s ≈ 0). to here as copiotrophs and oligotrophs, had distinct genomic Functional differences between copiotrophs and oligotrophs. content reflecting two alternative microbial lifestyles. Gene We recognize that our evolutionary definition of copiotro- families involved in transcription and carbohydrate transport phy/oligotrophy complicates an already murky set of definitions. and metabolism were strongly overrepresented on copiotroph The terms “oligotroph” and “copiotroph,” as used in the lit- genomes relative to oligotrophs, corresponding to an overall erature, typically conflate two features of microbial lifestyle: strategy of rapid acquisition of nutrients and protein produc- resource use and growth rate (41). Giovannoni et al. (41) dif- tion (Fig. 4A). Gene families involved in energy production and ferentiate the classical definition of oligotrophs and copiotrophs conversion and replication, recombination, and repair were over- as organisms that grow at low and high nutrient concentrations, represented on oligotroph genomes, corresponding to an overall respectively, from the more general ecological classes of r and K strategy of energy production and cell maintenance (Fig. 4A). strategists, which are specialized for either rapid, opportunistic In all, 13 major classes of genes [as defined by the Clusters growth or slow and steady growth, respectively (68). Giovannoni of Orthologous Genes (COG) database (70)] were significantly et al. (41) emphasize that these nutrient and growth rate defini- differentially enriched on copiotroph or oligotroph genomes. tions need not overlap (not all organisms specialized for growth Moreover, many individual gene families were far more

AB

C

Fig. 4. Copiotroph and oligotroph genomes are enriched for different functions. (A) The difference between the average proportion of genes in copiotroph genomes (ECopiotrophs) and the average proportion of genes in oligotroph genomes (EOligotrophs) assigned to various classes of genes [COG classifications from eggnogmapper (107)]. Positive numbers indicate a functional class is enriched as a percentage of total genes in copiotrophs relative to oligotrophs, and negative values are the opposite. Only significantly differentially enriched classes are shown (with red bars emphasizing classes with larger differences). (B) Volcano plot showing differential prevalence across copiotroph (PCopiotrophs) and oligotroph (POligotrophs) genomes of specific gene families belonging to the most differentially enriched gene classes. (C) Table of differentially prevalent gene families from the most commonly differentially prevalent classes (SI Appendix, Fig. S24 has a full table).

6 of 10 | PNAS Weissman et al. https://doi.org/10.1073/pnas.2016810118 Estimating maximal microbial growth rates from cultures, metagenomes, and single cells via codon usage patterns Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 esnbygo o fpeitn bevdinstantaneous observed predicting of job do good estimators CUB-based reasonably that using a shown work recent habi- has that communities suboptimal encouraging natural is to it dispersal Nevertheless, etc.). quality, tats, rate habitat maximal physiological why fluctuating its reasons at of (e.g., reproduce number not any may are organism not There an time. is con- sampling organism bottom-up the and an at top-down of trols of influence rate cryptic growth the given maximal clear the and rate growth limitations, here. and done robustness have their we assess to as important is it that processes and be demographic and/or may evolutionary signals other clear genome-wide by has confounded that genomes emphasize from we traits although of utility, rate inference Growth where cultivable. example their easily one extrapolate more is to are able that be microbes may from and traits genomes methods). their extinction have to dilution we consuming) by Yet, (e.g., time laboratory very the necessarily in grow (and to difficult ones most the missing be over- precisely to microbes are to found collections slow-growing unlikely culture the are derived as classically We from easily, 3). biases ref. have these (e.g., efforts come gaps) of culturing significant diversity of in notable the set filled capture (although the fully lifestyles how not rates show prokaryotic does we well- growth isolates tools, cultured a maximal these existing compiling of Using prokaryotes. and form across predicting the database for comprehensive in and (EGGO) resource (gRodon) package community R documented a produced We resource and Conclusions growth their just than sys- strategies. more oligotrophs acquisition in and copiotrophs differ specific, that tematically specific. copiotroph oligotroph suggests are this are defense Altogether, involved antiviral of antimicrobials genes forms many many against whereas that appears defending it in Thus, test). 15%, exact vs. Fisher’s (48 whereas were (76)], genes copiotroph-specific systems few CRISPR-Cas restric- genes systems, [e.g., defense modification oligotroph-specific antiviral tion many DNA-degrading in Similarly, involved were test]. 0%, vs. exact 24 Fisher’s (72–75); helix- domains and (HTH) domains, (PIN) turn-helix pili N-terminal domains, protein motility (HEPN) eukary- twitching nucleotide-binding higher prokaryotes [containing and specific no otes copiotroph whereas were defense, antiviral genes DNA- in such were involved hand, likely other proteins the binding on pumps). genes, efflux (e.g., oligotroph-specific proteins Many transport were genes these of Many siaigmxmlmcoilgot ae rmclue,mtgnms n igeclsvacodon via cells single and metagenomes, cultures, patterns from usage rates growth microbial maximal Estimating al. et Weissman 0%, vs. (55 olig- were specific genes and otroph such antibiotics) no whereas bacteriocins, two hydroperoxides), (e.g., (e.g., oxidants these provide antimicrobials in to to appeared of found specific resistance majority copiotroph genes were The A that defense differences. differ genes stark 4). defense of not revealed (Fig. types organisms did oligotrophs of the defense groups and at to copiotrophs look related between percent- closer genome total deal the the great in as a genes even among S24), of prevalent Fig. age more were Appendix , (70)] (SI V oligotrophs group [COG defense lular S25). Fig. Appendix, hydrolases (SI (glycoside lyases) those carbohydrates polysaccharide specifically copiotrophs and down families, that breaking gene found in these and involved for (71) enriched genomes greatly of enzymes were classes carbohydrate-active oligotrophs. two for among our searched prevalent produc- in specifically energy more also in were involved We At conversion families copiotrophs. and gene amino among many tion and time, prevalent carbohydrates same more the of frequently metabolism were and acids transport the in 4 (Fig. C genomes oligotroph or copiotroph among prevalent ti motn orcgieta h eainhpo h nsitu in the of relationship the that recognize to important is It ayidvda eefmle lsie sfntoigi cel- in functioning as classified families gene individual Many and .Ntby eefmle involved families gene Notably, S24). Fig. Appendix, SI P < 3.7 × 10 −15 ihrseattest). exact Fisher’s , P P < < 2.6 2.4 × × B 10 10 and −4 −4 , , ecuinuesaantdaigsrn ocuin hni sapidto [optimal applied prediction). for is used it not model, when is this conclusions temperature fit default, the strong (by to drawing in extremophiles used against extremophiles microbes users few caution these the the we in given using temperature although growth fit model, optimal growth- for option final to accounts that temperature Similar package (9). gRodon a final patterns organisms include usage these methods we codon as prediction pred, their (31) two in dataset the differ the excluded sure we systematically from we fitting, make psychrophiles model although to initial and For as database, thermophiles datasets. so identical provided on feature compared a this were using use not proteins also proteins did default ribosomal can ribosomal growthpred NCBI’s for Importantly, the gRodon. by search and were growthpred generated these both across to annotations and statistics passed pipeline, the usage annotation from codon annotations prokaryotic directly protein three Ribosomal taken our species. that of were we to species, each corresponding each genomes of all For mean (23–25)]. the Biotechnology [1,415 available for calculated database all Center RefSeq downloaded National (NCBI) we the Information (9)], from assemblies [214 genome dataset complete Rocha and Vieira-Silva and nal ggplot2 Fitting. packages Model R using made 82). were (81, at ggpubr figures downloaded All be can ecoevo/gRodon. vignette, a and doc- umentation including available package, are gRodon database, The EGGO growth full https://github.com/jlw-ecoevo/eggo. predicted the at as and well datasets as genomic analysis, various and for rates figures generate to used scripts All Methods and organisms. Materials of groups oligotrophs discrete be practice to in appear his- copiotrophs strategies, life and of acquisition continuum resource a to and describe 2), correspond fact tory in not and could in and need 1 while classes copiotrophy distinct Thus, (Figs. and spectra. optimization these oligotrophy of growth principle ends of evolution- opposite distinct to terms two corresponding in into fall regimes to ary appear continuous along microbes fall but rate spectra, growth and Environmen- concentration 4). resource (Fig. tal maintenance cell and specializing production in energy oligotrophs in specializing and breakdown copiotrophs distinct and with two acquisition nutrient with microbes, align of also classes that functional classes growth defined tionarily community, to system. a copiotrophs a of of in proportion oligotrophs rate relative the growth approximating instantaneous by likely the tend capture rate growth also maximal to of estimators CUB-based that data suggest the our experiments, enrichment with nutrient together most against to trough taken the benchmarking Thus, of reported to (15). exception copiotrophs the been peak abundant with highly have plankton, as marine growth even for poorly estimating (15), work of systems methods marine (77–80) in rates growth aemdlfitn rcdr.W oktemnmlrcre doubling recorded “condensed minimal the in the species took each We from procedure. time and fitting model bias same pair (excluding “par- “metagenome” gRodon’s and for bias) modes. models consistency) fit pair we (excluding Similarly, tial” predictors. as measures usage rmNB ne iPoetPJB21.Rwsqecn aafrthe for data sequencing Raw obtained PRJEB22811. were experiments BioProject (85) al. under et Okie NCBI the of from end the at taken samples Data. Metagenomic expectation an using variance) and criterion the mean information (for algorithm. chooses maximization Bayesian optimum Mclust this the (84). finds on and settings based (BIC) default Gaussians with of mixture package optimal mclust the in tion with psy- dealing or when thermophiles preferable set either training extremophiles. were this species making these 8,287 perhaps to chrophiles, of matched species species 130 464 that Notably, with set with genomes. training our associated yielded genomes This RefSeq. assembled from completely possible, all where obtained and (https://doi.org/10.6084/m9.figshare.c.4843290.v1) file ete talna oe oBxCxtasomdduln times doubling Box–Cox-transformed to model linear a fit then We evolu- propose to us led usage codon of analysis our Finally, o tigo h ai ta.(2 riigst eue the used we set, training (42) al. et Madin the on fitting For h asinmxuemdli i.2wsfi sn h cut)func- Mclust() the using fit was 2 Fig. in model mixture Gaussian The λ hsnuigteMS akg 8) sn u he codon three our using (83)] package MASS the using chosen o ahseiswt rwhrt itdi h origi- the in listed rate growth a with species each For h a eunigdt o h eaeoi water metagenomic the for data sequencing raw The https://doi.org/10.1073/pnas.2016810118 traits CIcv supplementary NCBI.csv” https://github.com/jlw- PNAS | f10 of 7

MICROBIOLOGY time series samples taken by Coello-Camba et al. (86) were obtained from median as with CUB) since we are interested in the distance of all ribosomal NCBI under BioProject PRJNA395437. Adapters and low-quality reads were proteins from the expected codon usage patterns. trimmed using fastp v0.21.0 (87) with default parameters, and only reads For codon pair bias, we implemented the calculation by Coleman et al. longer than 30 base pairs (bp) were kept for further analysis. Okie et al. (85) (22) that controls for background amino acid and codon usage when esti- samples were assembled individually using metaSPAdes v3.10.1 (88). Coello- mating the over-/underrepresentation of codon pairs (figure S1 in ref. 22 Camba et al. (86) samples were assembled individually using megahit v1.2.9 has a relevant equation). (89) with default parameters. We called and annotated open reading frames

(ORFs) from assemblies using prokka (90) (with options “–metagenome Population Parameters. We obtained estimates of Ne from ref. 37, which –compliant –fast”). Reads were mapped to ORFs using bwa 0.7.12 (91), are based on dN/dS ratios (the intuition being that selection acts more and the number of reads aligned to each ORF was counted using bamcov efficiently in large populations). Gene-specific recombination rates were v0.1.1 (available at https://github.com/fbreitwieser/bamcov). We ran gRodon obtained by applying the PhiPack (94) program for detecting recombination in weighted and unweighted metagenome modes on each sample, with to the ATGC database of closely related genome clusters (60), as described weights corresponding to mean coverage depth (corrected for gene length). in Weissman et al. (39). In weighted metagenome mode, the median CUB of the highly expressed genes is taken as a weighed median (weightedMedian in matrixStats R pack- Classification and Phylogeny. For marine and gut organisms, we classified all age), with weights corresponding to mean depth of coverage for that gene. genomes, MAGs, and SAGs using GTDB-Tk [v1.3.0 with database release 95 One sample from Coello-Camba et al. (86) had a very atypical estimated (95, 96)]. For the analyses in Fig. 3C and SI Appendix, Fig. S19C, we used average minimal doubling time over twice as long as any other estimated the tree output by GTDB-Tk [built automatically using FastTree (97)]. To doubling time from this dataset (MG078 at 3.1 h, as compared with the assess sensitivity to phylogeny, we built a maximum likelihood tree with second longest doubling time in MG002 at 1.4 h) and strongly disagreeing 10 bootstrap replicates from the GTDB-Tk alignment using RAxML [v8.2.11, with a replicate sample from the same experiment and time point (MG073 with -k -f a -m PROTGAMMAGTR options (98)]. We performed phyloge- at 0.35 h). Upon closer inspection, this sample had far fewer bases than netic logistic regression using the R package phylolm (99). We then assessed the rest (133 mega-bases vs. > 1 giga-bases), and only a little over 400 sensitivity to individual species, entire phyla, sample size, and phylogenic genes were detected in the assembly, far too few for accurate assessment of uncertainty using the R package sensiPhy (100). Trees and aligned heatmaps community-wide growth rate, leading us to omit this sample from further were visualized using R package ggtree (101, 102). analyses. Extrapolating between Closely Related Taxa. For all genomes used to build EGGO Datasets. We downloaded all prokaryotic assemblies from RefSeq (23, EGGO, we extracted all annotated 16S rRNA genes; then, we aligned these 24), as well as several collections of isolate genomes (45, 48, 49), MAGs (27, sequences and removed poorly aligned columns using ssu-align and ssu- 46, 47), and SAGs (32). Where possible, we used per-existing gene annota- mask [default settings (103)]. We then filtered sequences for which less tions provided by NCBI. For the Pasolli et al. (46) and Nayfach et al. (47) than 80% of positions were accounted for (i.e., were gaps). We ran fast- MAGs, gene predictions were not available, and we used prokka to predict tree on the resulting alignment [with -fastest, -nt, and -gtr options (97)] ORFs and annotate ribosomal proteins (90). Note that for both of these MAG to obtain a phylogeny with 192,195 tips representing 60,421 organisms. datasets, we used a subset of all MAGs designated as being representatives For phylogenetic prediction of maximal growth rate, we then omitted any of species clusters by the authors. We then ran gRodon on each genome, tips with EGGO entries where d > 100 h (13 tips) to minimize the influence using partial mode for MAGs and SAGs (which vary in their completeness). of outliers. Finally, we filtered results from genomes with few ribosomal proteins. Sim- To predict growth rate, we first randomly sampled one tip per organism ilar to Vieira-Silva and Rocha (9), we found that growth rates were biased in our tree (to avoid predicting an organisms growth rate from itself). We when <10 highly expressed genes were used for prediction (SI Appendix, then iteratively found the five closest tips to each tip in the tree and took Fig. S26), and we used this cutoff for our MAGs and SAGs. For our isolate the weighted geometric mean of the growth rates associated with these genomes, this generally was not an issue, with over 99% of genomes in tips. This gave us our predicted maximal growth rate on the basis of 16S RefSeq having between 50 and 70 annotated ribosomal proteins. We fil- rRNA in SI Appendix, Fig. S15A. Weights were calculated as inverse patristic tered all genomes outside this range to remove a very small set of obvious distance, with a small constant added for when organisms had identical 16S problem cases (e.g., one Bacillus genome that had over 1,000 annotated sequences (e.g., multiple genomes in EGGO for the same species): ribosomal proteins). The numbers in Table 1 correspond to postfiltering genome counts. 1 w = . [2] distance + 10−8 Measuring Bias. We use the MILC measure of CUB (20) implemented in the coRdon R package (92). This bias measure behaves slightly better than the For SI Appendix, Fig. S17, the predicted rate was simply taken as the rate Effective Number of Codons Prime (ENC’) measure used by Vieira-Silva and associated with the closest tip on the tree. We identified the closest tips Rocha (9) and automatically accounts for the CUB of genomic background in using the castor R package (104). its calculation (93) [by taking the genome-wide distribution of codons as its To produce SI Appendix, Fig. S15B, we sampled 10,000 tips from expected distribution (20, 92)]. As recommended in the coRdon documenta- our tree and calculated all pairwise distances between tips using the tion, genes with fewer than 80 codons were omitted from our calculations. cophenetic.phylo() function in the ape R package (105). Importantly, we calculate the MILC statistic on a per-gene basis rather than concatenating all of our genes. The contribution (Ma) of each amino acid (a) Signals of Selection. We obtained all ATGC clusters of genomes and their to the MILC statistic for a gene is calculated as core gene alignments (60). We calculated dN/dS and dS values for each aligned gene in each cluster of organisms using PAML [v4.9j (106)]. We X Oc restricted our final analysis to pairwise comparisons where 0.01 < dS < 1 to Ma = Oc log , [1] E c∈C c ensure that organisms had sufficient substitutions to analyze and that the probability of multiple substitutions at the same site was low (i.e., no satu-

where C is the set of codons coding for a, Oc is the observed count of codon ration). To calculate dSHE/dS, we divided the mean dS of all ribosomal genes c, and Ec is the expected count of codon c [the original paper has the full cal- by the mean core genome-wide dS for each pairwise comparison. Previous culation of the MILC statistic (20)]. Typically, Ec for a given gene is estimated work has noted that dS/dN can be used as a proxy measure for Ne since using the genome-wide frequency of that codon c. This is what we mean the two values should generally be correlated (37). Additionally, we assume when we say that for our CUB measurement, the bias of highly expressed that selection for translation optimization should be proportional to growth 1 genes is calculated “relative to the genomic background.” For calculating rate, such that s ∼ d , in SI Appendix, Fig. S23 C and D. the average CUB, we used the median value in order to reduce the influence of any outliers (i.e., misbehaving ribosomal proteins). Functional Analysis. For each genus in our RefSeq dataset, we randomly sam- For our consistency calculation, MILC was also used but was calculated pled a single genome for functional annotation. We assigned a label of using the highly expressed proteins as the expected background (using the “copiotroph” or “oligotroph” on the basis of whether each genome was “subset” option in coRdon). In other words, we estimated the expected predicted to have a minimal doubling time of greater or less than 5 h. We count of a codon, Ec, using the frequency of that codon in highly expressed ran eggnogmapper [diamond setting, v2.0.0 (107, 108)] on each genome to genes only, rather than the genome-wide frequency. For the consistency get COG and archaeal COG (arCOG) annotations and a broad annotation of metric, we took the mean value across ribosomal proteins (rather than the functional class for each ORF (70, 109). We used hmmer v3.3.1 to search the

8 of 10 | PNAS Weissman et al. https://doi.org/10.1073/pnas.2016810118 Estimating maximal microbial growth rates from cultures, metagenomes, and single cells via codon usage patterns Downloaded by guest on October 2, 2021 Downloaded by guest on October 2, 2021 2 .G Pachiadaki G. M. 32. 1 .Choi J. 31. Øvre L. Jonassen, I. Xue, Y. 30. 9 .Almeida A. 29. 7 .J ul,E .Gaa,J .Hiebr,Tercntuto f261draft 2,631 of reconstruction The Heidelberg, F. Stewart J. D. Graham, R. 28. D. E. Tully, J. B. 27. microbial Benchmarking Fuhrman, A. J. Ignacio-Espinoza, C. J. Hou, S. Long, M. A. 15. Oberhardt A. M. 14. Hug A. L. 13. 6 .H Parks H. D. 26. Haft H. D. 25. usage codon and abundance tRNA of expressivity. Co-variation Kurland, gene G. with C. Nilsson, Correlation L. bacteria: Dong, H. in usage 18. Codon Gautier, C. Gouy, M. 17. of abundance the between Correlation Ikemura, T. 16. Rapp S. M. 12. 3 .Ttsv,S if,B eoo,K ’el,I oso,Rfe irba genomes microbial Refseq Tatusova Tolstoy, T. I. 24. O’Neill, K. Fedorov, B. Ciufo, S. Tatusova, T. 23. along usage codon and nucleotide in Gradients Berg, G. O. Hooper, D. S. 19. to number copy operon rRNA Exploiting Sunagawa S. Schmidt, M. 11. T. Stoddard, F. S. Roller, R. B. 10. 2 .R Coleman R. J. 22. Frumkin I. 21. Vlahovi K. Supek, F. 20. siaigmxmlmcoilgot ae rmclue,mtgnms n igeclsvacodon via cells single and metagenomes, cultures, patterns from usage rates growth microbial maximal Estimating al. et Weissman rssac, tasot”“xot”“fu, bt-atms, “chloram- “beta-lactamase,” words “efflux,” key “export,” the “transport,” using annotations “resistance,” family differen- gene among significantly antimicro- resistance were for bial/oxidant searched that we copiotrophs/oligotrophs), classification among COG prevalent tially “V” the under broadly Benjamini–Hochberg a with tests (α exact correction and Fisher’s copiotrophs performed across we families certain gene oligotrophs, (α individual of of correction prevalence enrichment Benjamini–Hochberg in differences a the with Mann– in performed tests we differences Whitney oligotrophs, and assess copiotrophs database across To classes Annotation functional 110)]. Enzyme (71, Carbohydrate-Active (dbCAN2) Automated the with 10 [e-value genome cutoff each against database (CAZy) Enzyme Carbohydrate-Active .S iiaSla .P oh,Tessei mrn fgot n t ssi ecological in uses its and growth of imprint reflects systemic number The Rocha, copy P. operon E. rRNA Vieira-Silva, S. Schmidt, M. T. 9. Dunbar, M. J. Klappenbach, A. J. 8. Lauro M. F. 7. Trembath-Reichert E. 6. biosphere. deep the of Starnawski extent P. and 5. Nature D’Hondt, S. Colwell, S. F. 4. Rapp cultures. S. bacterial M. of growth 3. The Monod, J. 2. Eagon, G. R. 1. o h nlsso ees ee cnieiggn aiisclassified families gene (considering genes defense of analysis the For igecl genomics. single-cell (2017). 829–834 11, (2019). permafrost. Svalbard from sequences genome (2019). 499–504 568, eunigo h o rumen. cow the of sequencing pairings. organism–media new eaeoeasmldgnmsfo h lbloceans. global the from genomes metagenome-assembled life. of tree the expands substantially metagenomes. from predictions rate growth Microbiol. uli cd Res. Acids Nucleic (2016). 6614–6624 44, in Res. Acids Nucleic the synony- for a optimal (1981). for 389–409 is proposal 151, A that genes: choice protein its codon in mous codons respective the of occurrence the (2015). 1261359 348, aaae e ersnainadantto strategy. annotation and (2014). D553–D559 representation New database: Science coli strategies. reproductive bacterial investigate genomics. (meta) bacteria. of strategies ecological U.S.A. Sci. Acad. beds. shale and coal subseafloor U.S.A. 2-km-deep in life microbial sediment. Geochem. clade. marine sar11 minutes. 10 than less rnlto efficiency. translation expressivity. gene microbial of prediction in shrci coli Escherichia genes. 90–91 (2017). E9206–E9215 114, taeist mrv eeec aaae o olmicrobiomes. soil for databases reference improve to Strategies al., et −5 7418 (2008). 1784–1787 320, e iwo h reo life. of tree the of view new A al., et rc al cd c.U.S.A. Sci. Acad. Natl. Proc. esq nudt npoaytcgnm noainadcuration. and annotation genome prokaryotic on update An Refseq: al., et ,S .Cno,K .Vri,S .Goann,Cliaino h ubiquitous the of Cultivation Giovannoni, J. S. Vergin, L. K. Connon, A. S. e, Aydtbs 9hde akvmdl HM)packaged (HMMs) models Markov hidden V9 database CAZy ; 4–7 (2013). 547–574 75, 6–9 (2003). 369–394 57, ´ uli cd Res. Acids Nucleic h eoi ai ftohcsrtg nmrn bacteria. marine in strategy trophic of basis genomic The al., et oo sg fhgl xrse ee fet proteome-wide affects genes expressed highly of usage Codon al., et ,S .Goann,Teuclue irba majority. microbial uncultured The Giovannoni, J. S. e, ´ CIpoaytcgnm noainpipeline. annotation genome prokaryotic NCBI al., et e eoi lern ftehmngtmicrobiota. gut human the of blueprint genomic new A al., et = aiebceimwt eeaintm of time generation a with bacterium marine a natriegens, Pseudomonas eoeyo ery800mtgnm-sebe genomes metagenome-assembled 8,000 nearly of Recovery al., et tutr n ucino h lbloenmicrobiome. ocean global the of function and Structure al., et iu teuto ygnm-cl hne ncdnpi bias. pair codon in changes genome-scale by attenuation al., et irba omnt sebyadeouini subseafloor in evolution and assembly community Microbial al., et 0.05). tal. et 52–53 (2009). 15527–15533 106, tal. et hrigtecmlxt ftemrn irboethrough microbiome marine the of complexity the Charting al., et LSGenet. PLoS 81D6 (2018). D851–D860 46, (1982). 7055–7074 10, tdfeetgot rates. growth different at e,Cmaio fcdnuaemaue n hi applicability their and measures usage codon of Comparison cek, ˇ Cell sebyo 1 irba eoe rmmetagenomic from genomes microbial 913 of Assembly , .Bacteriol. J. ansigtelnsaeo irba utr ei opredict to media culture microbial of landscape the Harnessing , rc al cd c.U.S.A. Sci. Acad. Natl. Proc. ehlcmon s n lwgot characterize growth slow and use Methyl-compound al., et 6313 (2019). 1623–1635 179, s .Ts,Bceiladacaa metagenome-assembled archaeal and Tas¸, Bacterial N. as, ˚ 5732 (2000). 3517–3523 28, 1088(2010). e1000808 6, a.Commun. Nat. a.Commun. Nat. pl nio.Microbiol. Environ. Appl. 3–3 (1962). 736–737 83, a.Microbiol. Nat. 9024 (2017). 2940–2945 114, Nature nu e.Microbiol. Rev. Annu. .coli E. M Bioinf. BMC .Ml Biol. Mol. J. a.Microbiol. Nat. SEJ. ISME 3–3 (2002). 630–633 418, 7 (2018). 870 9, 43(2015). 8493 6, irbo.Rs Ann. Res. Microbiol. a.Microbiol. Nat. 44–44 (2018). E4940–E4949 115, shrci coli Escherichia rnltoa system. translational 8–9 (2021). 183–195 15, 5314 (2017). 1533–1542 2, 4–6 (1996). 649–663 260, 8 (2005). 182 6, 3813 (2000). 1328–1333 66, c.Data Sci. 66 (2016). 16160 1, uli cd Res. Acids Nucleic 64 (2016). 16048 1, rc al cd Sci. Acad. Natl. Proc. = 7–9 (1949). 371–394 3, rnfrRA and RNAs transfer 5.T assess To 0.05). uli cd Res. Acids Nucleic e00516–e00519 8, 723(2018). 170203 5, e.Mineral. Rev. .Ml Biol. Mol. J. nu Rev. Annu. Escherichia rc Natl. Proc. Science SEJ. ISME Nature 42, 8 .M oa,H cmn mato eobnto ntebs opsto of composition base the on recombination of Impact Ochman, H. pan-genome Bobay, and size M. population L. effective 38. driving Factors Ochman, H. Bobay, M. L. 37. 3 .Pr,W in .Zag eoi vdnefreeae uainrtsi highly in rates mutation elevated for evidence Genomic Zhang, J. Qian, W. bacteria. Park, in C. biases mutational 63. Whole-genome Andersson, I. D. Lind, 5 the A. P. at mutations 62. strand-specific induces Transcription Arndt, F. P. Polak, P. 61. updated An ATGC-COGS: and database ATGC Koonin, V. E. Wolf, I. Y. Kristensen, M. strength D. the in Variation 60. Sockett, Hahn, E. W. R. M. Peden, F. 59. J. Grocock, J. R. subject Bailes, is E. genes Sharp, expressed M. P. lowly of bias. 58. usage codon codon The of Hershberg, evolution R. the with Katz, influence S. Yannai, associate that A. Forces codons Zeng, 57. K. optimal Emery, Translationally R. L. Wilke, Sharp, M. P. O. C. 56. Weems, M. Zhou, bias. T. codon 55. on Selection Petrov, A. D. Hershberg, R. 54. in usage codon Synonymous Akashi, H. 53. usage. codon synonymous of theory selection-mutation-drift The temper- Bulmer, M. and medium 52. on Dependency Kjeldgaard, ‘unculturable’ O. N. of Maaløe, culture O. for Schaechter, Strategies M. Wade, G. 51. W. Palmer, M. R. Vartoukian, R. S. 50. Zou Y. 49. from insights New Kyrpides, C. N. Poyet Pollard, M. S. 48. K. Seshadri, R. Shi, J. Z. Nayfach, S. 47. Pasolli E. 46. Klemetsen T. 45. Delmont O. T. 44. of repair the to content GC Swan high K. Linking B. Johnson, 40. L. P. Fagan, F. W. Weissman, L. J. 39. pair Kimura, codon M. Crow, shape F. J. help 36. properties tRNA Roberts Stansfield, R. I. D. Aucott, 35. S. L. in pairs Buchan, codon R. of J. utilization Nonrandom 34. Hatfield, W. G. Gutman, A. G. 33. 3 .Lnh vlto ftemtto rate. mutation the of Evolution Lynch, M. 43. for theory streamlining Madin of S. J. Implications Temperton, 42. B. Thrash, C. J. Giovannoni, J. S. 41. 493(oJAF)adU S iiino ca cecs(C)Gat1737409 Grant (OCE) Sciences Ocean Grant J.A.F.). of Division (to (CBIOMES) NSF Compu- US Ecosystems and J.A.F.) on Marine (to 549943 of Collaboration Modeling Foundation also Biogeochemical We Simons tational 653212. from Award Foundation support Simons acknowledge from ecology microbial marine ACKNOWLEDGMENTS. Appendix Availability. we “nuclease,” Finally, Data words “HTH.” key and the using “PIN,” function “methyl.” “HEPN,” and defense “CRISPR,” words antiviral key for searched the DNA-binding/degrading using for searched domains We “hydroperoxide.” and phenicol,” atraadarchaea. and bacteria bacteria. in evolution 1970). Press, irrhcl rpyoeei structure. phylogenetic or hierarchical, frames. reading open in preferences coli xrse genes. expressed U.S.A. Sci. Acad. and genes. genomes human prokaryotic of studies annotation. macro-evolutionary family protein micro-and for resource bacteria. among bias usage codon selected of selection. natural to Sci. Biol. Trans. Phil. proteins. in sites sensitive structurally (2008). accuracy. translational and (1991). 897–907 129, of growth balanced during composition chemical and typhimurium. size cell of ature bacteria. analyses. microbiome functional research. (2019). microbiome mechanistic enables data multiomics microbiome. gut human global (2019). the of genomes uncultivated lifestyle. and geography, age, (2019). spanning 649–662 176, metagenomes from genomes 150,000 metagenomics. marine for specific databases metagenomes. ocean surface in (2018). abundant are teria ocean. surface the in bacteria tonic genomes. prokaryotic in breaks strand double 7 (2020). 170 7, ecology. microbial (2013). . rc al cd c.U.S.A. Sci. Acad. Natl. Proc. ,2 eeec eoe rmcliae ua u atraenable bacteria gut human cultivated from genomes reference 1,520 al., et . ESMcoil Lett. Microbiol. FEMS xesv nxlrdhmnmcoim iest eeldb over by revealed diversity microbiome human unexplored Extensive al., et irr fhmngtbceilioae ardwt longitudinal with paired isolates bacterial gut human of library A al., et rvln eoesralnn n aiuia iegneo plank- of divergence latitudinal and streamlining genome Prevalent al., et ytei fbceiladacaa hntpctatdata. trait phenotypic archaeal and bacterial of synthesis A al., et oeua ouainGenetics Population Molecular Microbiology irgnfiigppltoso lntmctsadproteobac- and planctomycetes of populations Nitrogen-fixing al., et eoeRes. Genome h a aaae:Dvlpetadipeetto of implementation and Development databases: Mar The al., et rs-aiainsrtge o aawt eprl spatial, temporal, with data for strategies Cross-validation al., et 77–78 (2008). 17878–17883 105, MORep. EMBO l td aaaeicue nteatceand/or article the in included are data study All SEJ. ISME 2311 (2010). 1203–1212 365, eoeBo.Evol. Biol. Genome M vl Biol. Evol. BMC o.Bo.Evol. Biol. Mol. ...wsspotdb ototrlflosi in fellowship postdoctoral a by supported was J.L.W. nItouto oPplto eeisTheory Genetics Population to Introduction An 5316 (2014). 1553–1565 8, uli cd Res. Acids Nucleic Genetics 9–0 (1958). 592–606 19, 2612 (2008). 1216–1223 18, 1312 (2012). 1123–1129 13, – (2010). 1–7 309, 6930 (1989). 3699–3703 86, a.Biotechnol. Nat. 2–3 (1994). 927–935 136, uli cd Res. Acids Nucleic rc al cd c.U.S.A. Sci. Acad. Natl. Proc. 5 (2018). 153 18, o.Bo.Evol. Biol. Mol. (2017). 2627–2636 34, Ecography 2714 (2018). 1237–1246 10, rspiamelanogaster Drosophila https://doi.org/10.1073/pnas.2016810118 rnsGenet. Trends SnurAscae,NwYr,N,2019). NY, York, New Associates, (Sinauer uli cd Res. Acids Nucleic 20D1 (2017). D210–D218 45, uli cd Res. Acids Nucleic LSGenet. PLoS 7–8 (2019). 179–185 37, 1–2 (2017). 913–929 40, 5118 (2009). 1571–1580 26, nu e.Genet. Rev. Annu. 0512 (2006). 1015–1027 34, a.Microbiol. Nat. 4–5 (2010). 345–352 26, 1043(2019). e1008493 15, a.Med. Nat. 62D9 (2018). D692–D699 46, 1115 (2005). 1141–1153 33, Nature aua selection Natural : PNAS 11463–11468 110, 1442–1452 25, 505–510 568, 287–299 42, 804–813 3, Escherichia Salmonella (Blackburn rc Natl. Proc. | c.Data. Sci. Genetics 0 f10 of 9 n of end Cell SI

MICROBIOLOGY 64. X. Chen, J. Zhang, No gene-specific optimization of mutation rate in Escherichia coli. 87. S. Chen, Y. Zhou, Y. Chen, J. Gu, fastp: An ultra-fast all-in-one fastq preprocessor. Mol. Biol. Evol. 30, 1559–1562 (2013). Bioinformatics 34, i884–i890 (2018). 65. W. Wei et al., Smal: A resource of spontaneous mutation accumulation lines. Mol. 88. S. Nurk, D. Meleshko, A. Korobeynikov, P. A. Pevzner, Metaspades: A new versatile Biol. Evol. 31, 1302–1308 (2014). metagenomic assembler. Genome Res. 27, 824–834 (2017). 66. X. Chen, J. Zhang, Yeast mutation accumulation experiment supports elevated 89. D. Li, C. M. Liu, R. Luo, K. Sadakane, T. W. Lam, Megahit: An ultra-fast single-node mutation rates at highly transcribed sites. Proc. Natl. Acad. Sci. U.S.A. 111, E4062 solution for large and complex metagenomics assembly via succinct de Bruijn graph. (2014). Bioinformatics 31, 1674–1676 (2015). 67. N. Kashtan et al., Single-cell genomics reveals hundreds of coexisting subpopulations 90. T. Seemann, Prokka: Rapid prokaryotic genome annotation. Bioinformatics 30, 2068– in wild Prochlorococcus. Science 344, 416–420 (2014). 2069 (2014). 68. E. R. Pianka, On r-and k-selection. Am. Nat. 104, 592–597 (1970). 91. H. Li, Aligning sequence reads, clone sequences and assembly contigs with BWA- 69. T. Pfeiffer, S. Schuster, S. Bonhoeffer, Cooperation and competition in the evolution MEM. arXiv [Preprint] (2013) https://arxiv.org/abs/1303.3997 (Accessed 3 February of ATP-producing pathways. Science 292, 504–507 (2001). 2021). 70. M. Y. Galperin, D. M. Kristensen, K. S. Makarova, Y. I. Wolf, E. V. Koonin, Microbial 92. A. Elek, M. Kuzman, K. Vlahovicek,ˇ Cordon: Codon usage analysis and prediction of genome analysis: The COG approach. Briefings Bioinf. 20, 1063–1070 (2019). gene expressivity. Bioconductor 3, 8 (2019). 71. V. Lombard, H. G. Ramulu, E. Drula, P. M. Coutinho, B. Henrissat, The carbohydrate- 93. J. A. Novembre, Accounting for background nucleotide composition when measuring active enzymes database (CAZY) in 2013. Nucleic Acids Res. 42, D490–D495 codon usage bias. Mol. Biol. Evol. 19, 1390–1394 (2002). (2014). 94. T. C. Bruen, H. Philippe, D. Bryant, A simple and robust statistical test for detecting 72. D. Matelska, K. Steczkiewicz, K. Ginalski, Comprehensive classification of the pin the presence of recombination. Genetics 172, 2665–2681 (2006). domain-like superfamily. Nucleic Acids Res. 45, 6995–7020 (2017). 95. D. H. Parks et al., A complete domain-to-species taxonomy for bacteria and archaea. 73. K. S. Makarova, V. Anantharaman, N. V. Grishin, E. V. Koonin, L. Aravind, Carf and wyl Nat. Biotechnol. 38, 1079–1086 (2020). domains: Ligand-binding regulators of prokaryotic defense systems. Front. Genet. 5, 96. P. A. Chaumeil, A. J. Mussig, P. Hugenholtz, D. H. Parks, GTDB-Tk: A toolkit to classify 102 (2014). genomes with the genome taxonomy database. Bioinformatics 36, 1925–1927 (2019). 74. K. S. Makarova et al., Evolutionary and functional classification of the carf domain 97. M. N. Price, P. S. Dehal, A. P. Arkin, Fasttree 2—approximately maximum-likelihood superfamily, key sensors in prokaryotic antivirus defense. Nucleic Acids Res. 48, 8828– trees for large alignments. PloS One 5, e9490 (2010). 8847 (2020). 98. A. Stamatakis, Raxml version 8: A tool for phylogenetic analysis and post-analysis of 75. V. Anantharaman, K. S. Makarova, A. M. Burroughs, E. V. Koonin, L. Aravind, Com- large phylogenies. Bioinformatics 30, 1312–1313 (2014). prehensive analysis of the hepn superfamily: Identification of novel roles in intra- 99. L. S. T. Ho, C. Ane, A linear-time algorithm for Gaussian and non-Gaussian trait genomic conflicts, defense, pathogenesis and RNA processing. Biol. Direct 8, 1–28 evolution models. Syst. Biol. 63, 397–408 (2014). (2013). 100. G. B. Paterno, C. Penone, G. D. A. Werner, sensiphy: An r- package for sensitiv- 76. A. Bernheim, R. Sorek, The pan-immune system of bacteria: Antiviral defence as a ity analysis in phylogenetic comparative methods. Method Ecol. Evol. 9, 1461–1467 community resource. Nat. Rev. Microbiol. 18, 113–119 (2020). (2018). 77. T. Korem et al., Growth dynamics of gut microbiota in health and disease inferred 101. G. Yu, D. K. Smith, H. Zhu, Y. Guan, T. T. Y. Lam, ggtree: An r package for visualization from single metagenomic samples. Science 349, 1101–1106 (2015). and annotation of phylogenetic trees with their covariates and other associated data. 78. C. T. Brown, M. R. Olm, B. C. Thomas, J. F. Banfield, Measurement of bacterial Method Ecol. Evol. 8, 28–36 (2017). replication rates in microbial communities. Nat. Biotechnol. 34, 1256 (2016). 102. G. Yu, Using ggtree to visualize data on tree-like structures. Curr. Protocol. Bioinform. 79. Y. Gao, H. Li, Quantifying and comparing bacterial growth dynamics in multiple 69, e96 (2020). metagenomic samples. Nat. Methods 15, 1041–1044 (2018). 103. E. P. Nawrocki, D. L. Kolbe, S. R. Eddy, Infernal 1.0: Inference of RNA alignments. 80. A. Emiola, J. Oh, High throughput in situ metagenomic measurement of bacterial Bioinformatics 25, 1335–1337 (2009). replication at ultra-low sequencing coverage. Nat. Commun. 9, 4956 (2018). 104. S. Louca, M. Doebeli, Efficient comparative phylogenetics on large trees. Bioinformat- 81. H. Wickham, Ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, New York, ics 34, 1053–1055 (2018). NY, 2016). 105. E. Paradis, K. Schliep, Ape 5.0: An environment for modern phylogenetics and 82. A. Kassambara, ggpubr: ‘ggplot2’ Based Publication Ready Plots (R package Version evolutionary analyses in R. Bioinformatics 35, 526–528 (2019). 0.4.0, 2020). https://rpkgs.datanovia.com/ggpubr/. Accessed 2 March 2021. 106. Z. Yang et al., Paml: A program package for phylogenetic analysis by maximum 83. W. N. Venables, B. D. Ripley, Modern Applied Statistics with S (Springer, New York, likelihood. Comput. Appl. Biosci. 13, 555–556 (1997). NY, ed. 4, 2002). 107. J. Huerta-Cepas et al., Fast genome-wide functional annotation through orthology 84. L. Scrucca, M. Fop, T. B. Murphy, A. E. Raftery, Mclust 5: Clustering, classifica- assignment by eggnog-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017). tion and density estimation using Gaussian finite mixture models. RJ 8, 289–317 108. J. Huerta-Cepas et al., Eggnog 5.0: A hierarchical, functionally and phylogenetically (2016). annotated orthology resource based on 5090 organisms and 2502 . Nucleic 85. J. G. Okie et al., Genomic adaptations in information processing underpin trophic Acids Res. 47, D309–D314 (2019). strategy in a whole-ecosystem nutrient enrichment experiment. Elife 9, e49816 109. K. S. Makarova, Y. I. Wolf, E. V. Koonin, Archaeal clusters of orthologous genes (2020). (arcogs): An update and application for analysis of shared features between 86. A. Coello-Camba et al., Picocyanobacteria community and cyanophage thermococcales, methanococcales, and methanobacteriales. Life 5, 818–840 (2015). responses to nutrient enrichment in a mesocosms experiment in oligotrophic waters. 110. H. Zhang et al., dbcan2: A meta server for automated carbohydrate-active enzyme Front. Microbiol. 11, 1153 (2020). annotation. Nucleic Acids Res. 46, W95–W101 (2018).

10 of 10 | PNAS Weissman et al. https://doi.org/10.1073/pnas.2016810118 Estimating maximal microbial growth rates from cultures, metagenomes, and single cells via codon usage patterns Downloaded by guest on October 2, 2021