<<

Inferring the shape of global

Jakub Otwinowskia,1, David M. McCandlishb, and Joshua B. Plotkina

aBiology Department, University of Pennsylvania, Philadelphia, PA 19104; and bSimons Center for Quantitative , Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724

Edited by Richard E. Lenski, Michigan State University, East Lansing, MI, and approved June 20, 2018 (received for review March 7, 2018)

Genotype– relationships are notoriously complicated. ear relationship between the underlying trait and the measured Idiosyncratic between specific combinations of muta- phenotype, or fitness. tions occur and are difficult to predict. Yet it is increasingly clear A classic argument for this form of epistasis is the physiological that many interactions can be understood in terms of global theory of (12), which held that dominance arises from epistasis. That is, may act additively on some under- the inherently nonlinear relationship between activity or lying, unobserved trait, and this trait is then transformed via dosage and measured phenotype [e.g., saturation phenomena in a nonlinear function to the observed phenotype as a result of metabolic networks (13)]. Another classical example arose in the subsequent biophysical and cellular processes. Here we infer the attempt to reconcile apparent contradictions between patterns of shape of such global epistasis in three , based on pub- genetic and the tolerable degree of , lished high-throughput mutagenesis data. To do so, we develop by positing models of nonlinear selection—for instance, trunca- a maximum-likelihood inference procedure using a flexible family tion selection—operating on an underlying additive trait such as of monotonic nonlinear functions spanned by an I-spline basis. the number of deleterious mutations or heterozygous (14– Our analysis uncovers dramatic nonlinearities in all three pro- 18). Following these classic examples, research into the shape teins; in some proteins a model with global epistasis accounts of nonlinear selection on an observed quantitative trait using for virtually all of the measured variation, whereas in others we either quadratic (19) or more general (20) forms became a major find substantial local epistasis as well. This method allows us to interest of evolutionary quantitative (21). Contempo- test hypotheses about the form of global epistasis and to distin- rary studies on the fitness effects of mutations often incorporate guish variance components attributable to global epistasis, local epistasis based on thermodynamic arguments, where the prob- epistasis, and measurement error. ability of binding (22) or folding (23–25) is expressed as a nonlinear function of a binding or | | | deep mutational scanning fitness landscape –phenotype map folding energy that is additive across sites. Today, the idea that protein | epistatic interactions are the result of a nonlinear mapping from an underlying additive trait has been called “unidimensional” he mapping from genetic sequence to biological phenotype in (26, 27), “nonspecific” (28), or “global” (29) epistasis, and sev- Tlarge part determines the course of evolution. Nonadditive eral heuristic techniques have been proposed to infer epistasis of interactions between sites, called epistasis, can either acceler- this form (30–33). ate or severely constrain the pace of . Due to the Here we present a maximum-likelihood framework for infer- high dimensionality of sequence space, studies of evolution on ring models of global epistasis, and we apply this framework epistatic genotype–phenotype maps, or fitness landscapes, have to several large genotype–phenotype maps of proteins derived historically been limited to mathematical models, such as NK landscapes (1, 2), or computational models of simple biophysical , such as RNA folding (3, 4). Significance Recently, high-throughput sequencing has made it possible to measure genotype–phenotype maps for proteins (5) and across How does an ’s genetic sequence govern its mea- (6). While many thousands of genotype–phenotype surable characteristics? New technologies provide libraries of pairs can now be assayed in a single experiment, this is still randomized sequences to study this relationship in unprece- only a minuscule fraction of all possible sequences, and so dented detail for proteins and other molecules. Deriving the sampled typically take the form of a scattered insight from these data is difficult, though, because the space cloud centered on the . Statistical models are therefore of possible sequences is enormous, so even the largest exper- required to derive biological insight from these sparsely sam- iments sample a tiny minority of sequences. Moreover, the pled genotype–phenotype maps. One approach has been to fit effects of mutations may combine in unexpected ways. We models that include terms for each pairwise between present a statistical framework to analyze such mutagenesis genetic sites (7, 8). Such models can sometimes predict protein data. The key assumption is that mutations contribute in a contacts or mutational effects from protein sequence alignments simple way to some unobserved trait, which is related to the (9), but they do a poor job at capturing a complete picture of the observed trait by a nonlinear mapping. Analyzing three pro- genotype–phenotype map—so that their predictions far from the teins, we show that this model is easily interpretable and yet observed data are often wildly inaccurate (10, 11). fits the data remarkably well. Models of a genotype–phenotype map that contain terms for every possible pairwise interaction seem reasonable if we believe Author contributions: J.O., D.M.M., and J.B.P. designed research; J.O. performed research; that epistasis arises primarily through the idiosyncratic effects and J.O., D.M.M., and J.B.P. wrote the paper. of particular pairs of —for instance, pairs of residues The authors declare no conflict of interest. that contact each other in a folded protein. While there is abun- This article is a PNAS Direct Submission. dant biochemical evidence for epistasis of this type, geneticists Published under the PNAS license. have long suspected that a substantial fraction of observed epis- Data deposition: The code used for inference and analysis, as well as input–output data, tasis is due not to specific pairwise interactions, but rather to is available on GitHub at https://github.com/jotwin/GlobalEpistasis.jl. inherent nonlinearities in molecular phenotypes, cellular fitness, 1 To whom correspondence should be addressed. Email: [email protected] organismal physiology, and reproductive . Apparent pair- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. wise interactions may be caused by mutations that contribute 1073/pnas.1804015115/-/DCSupplemental. additively to some underlying trait, combined with a nonlin- Published online July 23, 2018.

E7550–E7558 | PNAS | vol. 115 | no. 32 www.pnas.org/cgi/doi/10.1073/pnas.1804015115 Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 phenotype olnaiysget lblculn ewe l ie that sites model formalized all be semiparametric can a between idea as This coupling phenotype. observed global the determines as a nonlinearity, suggests smooth nonlinearity a show in nonepistatic often shown best-fit deviations their devia- Such and for the models. datasets examining procedure by empirical inference motivated between and epistasis, tions of model form a 8). different (7, develop a well we as interactions contrast, higher-order By possibly and each sites between interactions of for pair terms explicit including by is model measurement of magnitudes known (34). the errors cannot given that for, variance is accounted unexplained there be of but portion maps, explain significant genotype–phenotype a often in usually models variance the Nonepistatic of regression. much linear effects by additive mated the phenotypes and sequences term constant the by modeled is with (i.e., acid substitution amino no and position each for effects effects additive The data. term noise the where sequence a starting Given model. The nonepistatic structure. acids a amino is underlying of analysis of our of form point some assuming data from by maps genotype–phenotype infer can models Statistical Epistasis Global of Model additional A of handful a more only far using capture can variation parameters. it phenotypic but much the models, shares additive of of to approach simplicity due Our the epistasis confidence sites. of additional their between of coeffi- interactions with extent additive specific together the the the estimate trait about and infer unobserved intervals, hypotheses the relationship, test for nonlinear the cients epistasis, find this we global of trait, form of additive non- model monotonic, unobserved a some best-fit is of phenotype function observed linear the the Under experiments. that (DMS) assumption scanning mutational deep from tiosie al. et Otwinowski available, are sequence each esti- for independent error measurement When the model. of our mates in nonlinearity global the by effects additive simultaneously. standard the arbi- to data, an inference sufficient β the fitting With reduce likelihood. avoid to maximum and to model model (35), To complex phe- trarily I-splines trait. measured additive functions, the unobserved monotonic affect single function mutations a the that via only is under- notype assumption refer model The We nonlinear. this is data. lying model measured full the the though epista- even to coefficients global trait the of additive to shape the and the relating represents sequence, sis that genetic function the nonlinear on depends where i ,α n biu a oicroaeeitssit hsadditive this into epistasis incorporate to way obvious One eas siaeteaon feitssta sntcaptured not is that epistasis of amount the estimate also We n h hp fgoa epistasis global of shape the and φ sa nerd u nbevd diietatthat trait additive unobserved, but inferred, an is i.S1A Fig. Appendix, SI y h oeittcmdlis model nonepistatic the , g a (φ) i o sites for ecos eil aaercfml of family parametric flexible a choose we β a y i ε i ,α φ y = ersnsdvain ewe oe and model between deviations represents = = = nti oe steadtv effects, additive the as model this in β a i 0 i β g β WT 1 = + + (φ) i 0 ,α + X ,s httewl-yephenotype wild-type the that so ), L to i β a evsaie samti of matrix a as visualized be can X 0 L i o rti .Ti observed This G. protein for β ihmn ar fmeasured of pairs many With . L ε i β ,a n nascae measured associated an and i i ,a + i , ε, g g (φ) (φ) β i β a einferred be can ,a sa arbitrary an is i i ,α 0 = a eesti- be can fteeis there if [1] [2] ilyrdcsteetn fumdldeitss(σ cross-validated epistasis 10-fold unmodeled of from extent reduced the reduces tially of Methods model ( and our model rials test, nonepistatic GB1 sub- purely likelihood-ratio for double a 0.0003, that all outperforms the find epistasis of We measured global Olson 95% (37). and particular, (509,693) and single– stitutions In (1,045) all for for (37). substitutions sites (IgGFC) al. using fragment amino-acid 55 immunoglobulin et 1), an targeted a to Olson (Fig. binding (37) (GB1), by stability al. G study and et protein a folding from of protein data domain of IgG-binding system global characterized model the we framework, in our of epistasis application first a As GB1 Protein in underlying Epistasis Global the the on assess mutations to of and trait. impact additive coef- epistasis the the and global describing epistasis global of ficients both of but shape estimates features, our the in certain uncertainty in for support others, statistical not is bootstrap complex- there parametric whether different via ask test of likelihood-ratio models in a compare (detailed of means to by us ity allows so Doing error measurement phenotype of Methods). and estimate (Materials independent the be is to likelihood Gaussian HOC per-sequence the estimate we component, particular, uncorrelated epistasis In fully (36). a geno- model to each distribution—akin house-of-cards for Gaussian draw, a random from independent type, an contribution as phenotypic epistasis the this model of in we because of framework epistasis, remaining statistical forms (HOC) mea- our the the house-of-cards other in of as variation to of phenotype much due component sured last how is this to infer phenotype refer We can measured epistasis. we the case, in the variation often is as I htecuezr,oto oa f105mttos(SI are mutations they but 1,045 what model, to of additive purely similar total the are under a effects inferred additive be of would These out 95% S1 D). with zero, Fig. mutations Appendix, exclude deleterious 774 that and CIs beneficial 234 find affinities visible). we be binding and to observed small (Materials too the 0.13 1A, of Fig. width in range gray of full (light CI the extremely 95% over are average gener- epistasis Methods) More global an of assay. with shape their the precise, in of score estimates binding our established the ally, that on 37 bound ref. lower by overall a experiments the control with with around consistent compared (from is observed small scores very binding is of relative range of slope units This of in binding. change mutation, expected log per an 0.0034–0.0046) with (CI small, 0.0037 extremely only is slope this of mate (i.e., (P side derivative left-hand second positive for bootstraps a −1. of 95% by in additive positive deleteri- indicated decreasing returns; increasing For with as possible. saturation values, are observe trait affinity data we binding observed mutations, in of ous increases (P range further the values that outside trait suggesting high S1), remains Table at slope the positive but epistasis, strictly type diminishing-returns wild of the pattern for around a bootstraps region of 95% the underlying has in in (negative function the derivative nonlinear second and of inferred negative diminishing our value a both particular, the In includes trait. on it additive depending that see returns, we increasing 1A), (Fig. model model. nonepistatic the for eifrtemdldsrbdaoeb aiu likelihood. maximum by above described model the infer We o h fet nteuosre diietat(i.1D), (Fig. trait additive unobserved the on effects the For our by inferred epistasis global of shape the at first Looking .W n infiatspotfranneosoeo the on slope nonzero a for support significant find We 04). .Ti prahalw sto us allows approach This Methods). and Materials σ HOC < o ttsia ehdlg) n tsubstan- it and methodology), statistical for 0.003; 0 = σ HOC 2 r .56 2 oprdwith compared 93.5%, = ,btoresti- our but S1), Table Appendix, SI ystigtettlvrac four of variance total the setting by , o h oeittcmdl iha with model) nonepistatic the for see S1; Table Appendix, SI PNAS −0.97 | o.115 vol. σ < φ < < HOC 2 0.003; + o+)adit and +2) to −5 ,indicating 1.72), | σ o 32 no. m 2 IAppendix, SI −4.27 HOC where , r 2 86.2% = 0 = | < φ < Mate- E7551 P . σ 35, < m 2

EVOLUTION A 3 count B 10000 0.3 0.2 0 7500 density 0.1 5000 0.0 −9 −6 −3 0 3 −3 2500 relative log binding C 0.4

relative log binding relative −6 0.3 0.2

density 0.1 −9 0.0 −20 −10 0 −20 −10 0 φ additive trait additive trait D K additive R H effect E D N 0 Q T S C G A −5 V L I M P −10 Y F W 5 10152025303540455055

Fig. 1. (A) Global epistasis as a nonlinear function of an intermediate additive trait (Eq. 1) for binding of protein GB1 to an immunoglobulin fragment (IgGFC) (37). Solid black line indicates the nonlinear function g(φ) (95% confidence interval in light gray, too small to be visible). Based on the inferred magnitude of nonglobal epistasis, we estimate that the true phenotypes for 95% of genotypes lie between the two dashed lines (2σHOC = 0.7). Colors are a histogram of observed binding. (B and C) The distribution of binding values is bimodally distributed (measured dashed line, inferred solid line) (B), while the distribution of inferred additive phenotypes is unimodal (C). (D) The impact of each possible substitution on the underlying additive trait. Open squares are the wild-type amino acid. Additive effects and the underlying trait are scaled so that the wild type has trait value 0 and the average mutation changes the trait by distance 1 (Materials and Methods).

more widely distributed for deleterious mutations (SI Appendix, interactions between specific combinations of sites. Unlike global Fig. S1B). Interestingly, the distribution of inferred additive epistasis, which depends in a nonlinear way on a single additive traits is unimodal whereas the observed distribution of bind- trait, or HOC epistasis, which consists of independent Gaussian ing affinities is bimodal, with one peak around the wild-type effects associated with each genotype, complex epistasis can be affinity and another at low binding (Fig. 1 B and C). This attributed to nonlinear interactions between specific sites on the apparent contradiction is resolved by the form of global epis- measured phenotype. tasis, which maps all sequences with low values of the additive To investigate complex epistasis we compared the predictions trait to similar values for the observed phenotype, creating a of our model, fitted to data from Olson et al. (37), to an indepen- second mode. dent dataset from a follow-up experiment by ref. 38. Wu et al. Taken together, our model for the underlying additive trait (38) measured the binding of GB1 variants mutated to all possi- and for the nonlinear function φ provides a simple explanation ble combinations of amino acids at four sites specifically chosen for the observed binding data. The model accounts for essentially because they were suspected to exhibit idiosyncratic epistatic all of the variation in binding measurements beyond what would interactions. be expected under measurement noise alone. In particular, the On the whole, our results confirm that the pattern of global p root-mean-square error in our predictions is P(y − yˆ)2/N = epistasis inferred using shallow mutagenesis across the whole 0.59 whereas even under a perfect model we would expect a domain persists at a deeper level of mutagenesis for these four pP 2 sites. In particular, the predictions of the global epistasis model, root-mean-square error of σm /N = 0.52 due to the Pois- son variation in the counts used to calculate the binding scores. fitted to Olsen et al. (37) data, are close to the mean binding The extent of these errors is very small compared with the overall curve measured by Wu et al. (38), across the entire range of range of phenotypes observed, and it constitutes a sequence– the inferred additive trait (SI Appendix, Fig. S2A). However, the function relationship that is dominated by global epistasis with width of the deviations from our binding predictions is much little room for epistasis of other kinds. This can be under- broader than the variation predicted by the error model fitted stood visually by examining Fig. 1A, where we predict that the to the Olsen et al. (37) experiment (Fig. 1A). Some of this dif- true phenotypes for 95% of genotypes fall between the two ference can be attributed to the much larger measurement noise pP 2 dashed lines. in the ref. 38 dataset [ σm /N = 0.52 (Olsen et al.) (37) vs. pP 2 σm /N = 0.77 (Wu et al.) (38)] due to the lower read depth Complex Epistasis in GB1 in Wu et al. (38). However, the deviations from the model are While our model captures the form of global epistasis and quan- also skewed—so that around 13% of variants measured by Wu tifies the extent of remaining HOC epistasis, deviations from our et al. (38) have significantly higher binding than predicted, even model can provide evidence for one remaining type of epistasis, after the difference in measurement noise is taken into account which we call complex epistasis. This form of epistasis arises from (SI Appendix, Fig. S2B). We interpret this skew as evidence of

E7552 | www.pnas.org/cgi/doi/10.1073/pnas.1804015115 Otwinowski et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 i.2. Fig. sequences of positive fluorescence define relative we log where have 0.9787, to 0.9966, rate rate positive negative (true true outliers on few nonfluorescing remarkably from mutations with fluorescing proteins, delineates 3.7 that and threshold sharp sequence per of mutations fluorescence 11 includ- (31). the (avGFP), average to protein measured fluorescent up that green ing a global study of of DMS variants model 51,715 our a applied to we epistasis broadly genotype– and more characterize single maps To space only sequence. phenotype sequence protein of contains wild-type cloud (37) the limited al. near a covers et which Olsen mutants, of double dataset GB1 The GFP in Epistasis Global a for interactions complex the phenotype. idiosyncratic, measured identifying have in that for instrumental sites Accounting be specific epistasis. therefore can global epistasis by pairwise global caused this indicating primarily analysis, small, is are our epistasis model in the yet, from and deviations data the binding epista- from pairwise of directly magnitude calculated similar sis contrast, a By exhibits (37). 41Q/54P GB1 pair 41L/54G in the known changes is mutations structural it with indeed of associated and be analysis, pair to our in the deviation largest example, the shows For global model. the from substi- epistasis deviation systematic of of (SI indicate pair S2C) deviations Fig. mean-square distribution each Appendix, large with for mutations the of predictions Pairs marginalize model tutions. our we from epistasis deviations complex exhibit mutagenized (38). sites al. et specific Wu the by among epistasis complex positive ito 175poenvrat,wt . uain naeae O epistasis HOC average. on mutations 3.7 magnitude with has the variants, test, protein of 51,715 slope (likelihood-ratio of the sist positive and fluo- is high low is is mapping fluorescence there nonlinear threshold threshold the the above Below and trait. rescence additive the in threshold sharp tiosie al. et Otwinowski dis- value by trait trait has the in type shown changes are S4. wild effects mutation Fig. Additive The magnitude the Methods). lines. and absolute that (Materials dashed 1 mean so tance the the be the between scaled to that and lie small is suggests 0 will (too trait analysis genotypes gray additive Our of light underlying 95% CI. with for bootstrap line values 95% solid true the black the indicating by visible) indicated is g(φ)

log relative fluorescence a infers epistasis global of model our data, GFP the to Applied GB1 in mutations and sites of combinations which infer To −2 −1 0 3 2 1 0 −10 −20 −30 h hp fgoa psai naDSsuyo F 3)sosa shows (31) GFP of study DMS a in epistasis global of shape The σ count HOC = 100 200 7,ad1-odcross-validated 10-fold and 0.073, additive trait .5o rae;Fg 2). Fig. greater; or −1.25 φ P < r 2 0) aacon- Data 0.003). = 0.931 IAppendix, SI ± 0.003. hw hticuiggoa psai mrvsthe improves epistasis global (r including substantially that fit shows genetic S3A) different Fig. a the in below effect effect substantial no a background. In have have 0.019–0.021). to may CI observed threshold 95% mutations mutation, words, per increasing other fluorescence is log function 0.020 epistasis global test, the (likelihood-ratio while threshold 0.00022–0.00041); CI the 95% above background mutation, per at 0.00032 fluorescence (slope is flat log almost fluorescence is relation the trait–fluorescence the where levels, threshold, the Below uhmdl aebooia es ncssweebt the both where cases in applicable, biological universally make not While models function. and such nonlinear a genotype via the type that latent the trait assuming to underlying additively by contribute each this condition environmental do We prob- genotype-by-environment analyze this interactions. to address model our we extending by Here lem mutation conditions. given it experimental any particular, the of In condition, effects across the interpretability. single summarize of to a difficult problem provide becomes in a should introduces measurements and provide also protein than principle it a results in of robust function can more the conditions— conditions into different insight multiple . more several in of concentrations testing in several at While variants activity phenotypes antibiotic of as measure such library often a studies across scanning mutational Deep of in Interactions quantification Genotype-by-Environment better with together uncertainty. a sequence– relationship the with of approach orig- picture architecture function understand our their to but network easier trait), in simpler, latent a neural 31 provides the a represent ref. to results on by bottleneck the sharp used (based with approach analysis consistent network inal observed neural essentially the the explain are to of results epistasis of These forms data. other little with invoke map, plays to genotype–phenotype epistasis GFP need global the that in role find alone; dominant we a error GB1, like measurement Thus, to Methods). due and 0.16 our is of mea- in predictions error square under our error in expected the error be and (mean-square would 2, alone than Fig. noise P more surement in fall little lines will is genotypes dashed predictions of two 95% the that More between mutations. suggests of model effects our the specifically, in additive a fluo- essentially threshold Below is of this above rescence trait. and terms additive zero, is in underlying fluorescence threshold, understood an this on adequately imposed be threshold can unobserved, sharp GFP the for on tionship mutations infer- individual to of respect trait. with effects additive underpowered form the was the it ring infer the epistasis, to although power global that enough of sufficient suggest had they experiment and importance GFP the uncertainty, show results quantifying magni- These the of mutation. times and 0.91 average to an total), equal (141 of was 1,810 tude CIs zero of these only of excluding out width that coefficients CIs average the negative find have 1,131 we coefficients and particular, additive positive In the uncer- GB1. of greater for much 62% observed show for than S3 C) CIs Fig. tainty 95% , Appendix bootstrapped (SI the coefficients that find (e.g., we trait g GFP underlying for the S1D), on Fig. for mutations both of CIs small effects very found accounting we after GB1 smaller much (σ variance epistasis is global fluorescent for epistasis inferred HOC the to and attributable model), nonepistatic the (φ) IAppendix , (SI models nonepistatic and epistatic the Comparing vrl,orrslsidct httesqec–ucinrela- sequence–function the that indicate results our Overall, (y − r eysal(en01) u htteifre additive inferred the that but 0.18), (mean small very are y ˆ ) 2 /N 0 = hc hndtrie h bevdpheno- observed the determines then which φ, hra ewudepc root-mean- a expect would we whereas .18, CV 2 HOC 0 = P < .935 0 = 0.003; .073 oprdwith compared PNAS vs. slope S1; Table Appendix, SI σ HOC g | (φ) o.115 vol. 0 = β n o h additive the for and -Lactamase .Weesfor Whereas .31). r CV | 2 o 32 no. IAppendix, SI 0 = Materials .688 | E7553 for

EVOLUTION genotype and the environment can act on some latent underly- The global-epistasis model provides a score for each muta- ing trait. For instance, it is reasonable to assume that the action tion’s additive effect on the underlying trait. We again normalize of an antibiotic depends on its “local” concentration in the cellu- so that the wild type has a score equal to zero and the mean abso- lar or extracellular environment, which can be modulated either lute effect of a mutation on the additive trait equals 1. We infer through changes to the activity of an that degrades the that the single mutation effects (SI Appendix, Fig. S5) are largely antibiotic or by changes to the overall concentration of antibiotic deleterious: 3,295 out of 3,312 single mutations have CI below in the medium. zero, and none has CI above zero (SI Appendix, Fig. S6B), con- Here we apply this approach to a study of β-lactamase, an sistent with the observation that no displays a growth rate enzyme that degrades β-lactam , which has been a consistently higher than wild type. model system for DMS (30, 39, 40) and protein evolution (41– The inferred activity scores are mapped to growth rates via 44). Stiffler et al. (45) measured the effects of 4,997 single a nonlinear function that shifts, depending on the antibiotic mutants of TEM-1 β-lactamase in five different concentra- concentration. This function shows first increasing and then tions of the antibiotic ampicillin. They quantified β-lactamase decreasing returns (for instance, in the highest antibiotic concen- activity by measuring the growth rate of each mutant in each tration, the second derivative has positive 95% CI for −0.61 < condition. Our model takes these five growth rate measure- φ < −0.38 and negative CI for −0.33 < φ < −0.09, where φ is ments and synthesizes them into a single score capturing the the genetic effect on the underlying trait), and it ultimately activity of that particular genotype, together with a nonlin- saturates with zero slope at high concentrations (P = 0.39; SI ear function capturing the mapping from the latent trait (e.g., Appendix, Table S1). These results indicate a threshold-like phe- effective local concentration of antibiotic) to the observed nomenon where, for any given genotype, increasing the antibi- growth rate. otic concentration up to a certain point produces no growth The results of fitting this global-epistatic model are shown defect, whereas further increases in antibiotic produce rapidly in Fig. 3, where the x axis displays the activity of each mutant declining growth rates. Interestingly, while the slope of the non- (i.e., the impact of that mutant on the underlying trait), and the linear function becomes shallower for low-activity genotypes environmental conditions determine a series of curves, one for (slope 0.090 fitness change per mutation on left-hand side, CI each condition. Due to our assumption that the environment and 0.073–0.095), it remains significantly greater than 0 (P < .0003; genotype interact additively to determine the underlying trait, SI Appendix, Table S1). This result may have clinical rele- these curves are simply translations of one another, and each vance: Although growth rates experience a sudden decline with curve produces a prediction of the growth rate of each mutant increasing antibiotic concentration, increasing the concentration for one environment. Overall, we find the model produces a very beyond the point of first inhibition is still capable of produc- good fit to the data (cross-validated r 2 = 0.92), far better than ing a further decrease in bacterial growth rate. Finally we tested a purely additive model (P < 0.0003; SI Appendix, Table S1). the hypothesis that the environmental coefficients were a linear The performance is also comparable to that of the biophysically function of the log antibiotic concentration vs. a more general based mechanistic model originally fitted by ref. 45, which uses model that fitted an arbitrary coefficient for each condition. approximately 5,000 more parameters. We found strong evidence against the linear model (P < 0.003; SI Appendix, Table S1), consistent with the visual pattern of increasingly severe antibiotic effects at higher concentrations (Fig. 3). The simplicity of our model together with its graphical repre- concentration, sentation provides insights into the design and behavior of this 0 trial experiment. For instance, in Fig. 3 we have shown the two bio- 10, 1 logical replicates in each condition in slightly different colors. The plot shows that the replicates are not completely overlap- 39, 1 ping, which indicates biological variation between the replicates, −1 and indeed we can reject our basic model in favor of a model 39, 2 with a different environmental coefficient for each antibiotic 156, 1 concentration in each replicate (P < 0.003; SI Appendix, Table S1). Another fine-scale pattern can be observed in the CIs 156, 2 −2 for the additive effects (SI Appendix, Fig. S6B), which widen 625, 1 for genotypes that have WT-like growth rates at one antibiotic concentration and then very low growth rates in the next con- log relative growth rate growth log relative 625, 2 centration. This suggests a finer grid of antibiotic concentrations −3 2500, 1 would be optimal for future experiments. 2500, 2 Additive Traits and Biophysical Quantities Since we infer an underlying additive trait when fitting a global −4 epistasis model, it is natural to ask whether the underlying trait −2.5 −2.0 −1.5 −1.0 −0.5 0.0 corresponds to some known biophysical quantity. While this genetic effect (protein activity) question will surely depend upon the details of the assay used in a given DMS experiment—e.g., whether the assay is mea- Fig. 3. Gene-by-environment interactions in a DMS study of β-lactamase suring binding of a protein to a substrate, enzymatic activity, (45). Data consist of 4,997 single mutants, measuring growth rate under dif- etc.—thermodynamic stability for a protein’s native conforma- ferent concentrations of antibiotic (ampicillin) and two replicates (45). Log tion is almost always an important phenotype, and there is a relative growth rate is plotted against the inferred protein activity for each consistent thread in the literature suggesting that the phenotypic substitution; the mapping between protein activity for log relative growth rate for each antibiotic concentration is given by the colored curves and effects of most mutations are mediated via their effects on ther- each such curve is surrounded by a gray region giving its 95% CI. Growth modynamic stability and the nonlinear relationship between free rate measurements were made using two biological replicates, shown as energy of folding and probability of being folded (23–25, 46–48). two slightly different hues for each antibiotic concentration. Cross-validated Moreover, low-throughput measurements of mutational effects r2 = 0.923 ± 0.002. on thermodynamic stability (∆∆G) for mutations relative to

E7554 | www.pnas.org/cgi/doi/10.1073/pnas.1804015115 Otwinowski et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 h nerdadtv fet r eaieycreae (ρ correlated negatively are effects additive inferred the effects S6A; additive Fig. inferred Appendix, our (SI with correlated significantly not are for Similarly independent with found bio- correlation energies infer, strong folding recent correlate a and A we binding not separates pro- stability. that effects do fold model small physical the stability, of this measurements binding that in independent with surprising fold with confounded not the are of is which binding stability it the the and Therefore, of as IgGFC, tein. stability to well 52). binding the as (37, for affect interface G stability mutations protein protein many assayed on presumably (37) effects al. et their are Olson binding sequence with the wild-type correlated the on around not effects mutants mutational single that of (SI phenotype observation phenotype (37) al.’s and additive et underlying effects S1C; not the stability Fig. do on measured we Appendix, effects the G, inferred protein between the our For correlation of mutations. a estimates these find published of previously effects and stability effects these additive of our effects stability the with correlated mutations. model same our ther- highly under of be coefficients most effects additive should inferred energetic that the their hypothesis stability, via mal the phenotype under over affect Thus, conserved mutations relatively (51). and time 50) (49, evolutionary additive typically are WT diietatadacresoigtesaeo h nonlin- the of shape al. the et Otwinowski showing curve of effects a pair underlying the and an a showing on trait by coefficients substitution additive visually of single–amino-acid map summarized possible heat variation. each be a phenotypic can graphics: most of models for account resulting can The epistasis global on genetic model of based simple-to-understand complex a for sites, is specific evidence between strategy abundant interactions despite Our that, gambit demands. the competing trade-offs make these must models between however, practice, under- In the biology. into of lying make insight model mechanistic behavior, give ideal and comprehensible predictions, An accurate exhibit genotypes. would possible relationship of dimen- this space large the and size of enormous sionality the to due challenge substantial for framework statistical uncertainty. principled quantifying and a hypotheses easy-to-understand testing provide provides an can also Our providing epistasis relationship. approach sequence–function global while underlying the of fit of summary form good this remarkably that a shown proteins, the well-studied have to three mutation we Analyzing possible trait. each additive of increasing underlying contributions monotonically rela- the nonlinear and the of of tionship form class the infer flexible non- simultaneously we a this functions, Modeling by phenotype. mapping unobserved, measured interactions linear an the between genetic and mapping trait that nonlinear additive mutagenesis global is a high-throughput assumption from arise key from Our interactions infer- experiments. for genetic framework maximum-likelihood ring a proposed have We trait. Discussion underlying the is hypothesis folding held of widely mutations energy folding the free of for that of effects support equivocal stability energy best inferred estimated at free provides our the between and the effects relationship of additive inconsistent function the the deter- 46), nonlinear with mostly (23–25, consistent a are is by effects model phenotypic our mined most of measured for fit that effects good hypothesis stabil- the fitness folding while on on the Thus, effects mutations ity. if their single through predicted mediated for were be effects mutations would energetic of as 31 stability, ref. by dictions P 2 = ots hshptei,w acltdtecreainbetween correlation the calculated we hypothesis, this test To nesadn eoyepeoyerltosispoie a provides relationships genotype–phenotype Understanding × 10 −16 β ; lcaae needn esrmnsof measurements independent -lactamase, i.S3 Fig. Appendix, SI P 0 = P 0 = . ,arsl ossetwt Olson with consistent result a 14), .FrGP nteohrhand, other the on GFP, For .08). ihcmuainlpre- computational with B) ∆∆G esrmns(53). measurements = −0.62, ∆∆G 2–5 64) hnormdlsol twl n h theory the and basis well mechanistic a fit provide should would epistasis model stability-mediated our trans- of folding of nonlinear then probability a 46–48), the by (23–25, and converted free energy then folding the between on are formation in effects that additive epistasis folding essentially most of to energy If due biophysics. indeed and is in proteins biology interpretation molecular mechanistic of of the unambiguous features terms whether an major unclear provides the is capture model it to relationship, sufficient genotype–phenotype sometimes the is it and con- extreme highly to provides extrapolating the it when outside so phenotypes. behavior additive and conservative is phenotypes trolled, model of Fur- epistasis range backgrounds. poten- observed global genetic or our across saturation thermore, consistently for framework, operate responsible that epistasis assumption tiation factors plausible global biologically physiological the the on the can based under is Extrapolation it 11). understood because (10, predic- easily models out-of-sample pairwise be mutagenesis incorrect by the exhibited qualitatively in tions sampled the those unlike from assay, far genotypes for tions indi- data. even of mutagenesis suggesting possible be limited effects (31), will from epistasis GFP additive global of of the measurements reanalysis robust for our that in CIs e.g., the mutations, epistasis global than vidual of tighter form the much for CIs are inferred the space surpris- that Perhaps genotypic find uncertainty. we of quantifying ingly, on dimensionality premium a large puts The also GB1). of are for majority there actions (e.g., the models parameters pairwise capture fewer in times and than hundred several fit addi- with variation better of can observed we dramatically handful function, a a nonlinear just the produce control with which model parameters, parameters. additive tional thousand an several augmenting have By typically will completely model the a even additive of space, Because sequence vir- protein variation. of for experimental dimensionality account observed high the that of predictions all tually accurate provide can epistasis global of model our directly in off coefficients read additive epistasis. be of map can heat prob- genotype the optimal (NP)-complete from the time interaction whereas polynomial pairwise (58), For nondeterministic lem a challenging. a in is extremely maximum fit global be the can finding landscape coefficients instance, between the fitted of interactions the features pairwise qualitative from possible even assessing all where with by sites, inferred model typically a landscapes rugged fitting complex, highly functional contrast the tractability highly with analytical among and use simplicity This “func- acid (55). amino sequences the site-specific predictions as the detailed of more as literature level as such well the given as 57)], a in (56, least information” known frac- tional readily at quantity the achieve can [a that as one functionality sequences such model random of statistics epistasis (55). tion descriptive global Hippel important von a approximate and given Berg particular, by In information- models introduced now-classical level, the treatment technical using analyzed theoretic more be a can form At details this (29). the of effect on sequence is than the genetic rather mutation the value of trait of particular magnitude current In the any the (54). on of only and epistasis depends effect sign backgrounds, the no across of shows constant sign resulting it the the and trait, peaked, particular, underlying single additive is the model with increasing of assume monotonically we values is Because increasing relationship understand. nonlinear to the easy that is space observed sequence the over and trait additive the phenotype. between relationship ear hl h lbleitssmdli edl comprehensible, readily is model epistasis global the While predic- reasonable makes also model epistasis global The global of model simple a sometimes that show results Our model the of behavior the display, to easy being Besides PNAS | 00piws inter- pairwise ∼500,000 o.115 vol. | o 32 no. | E7555

EVOLUTION for the observed additive trait. On the other hand, the observa- genotype–phenotype relationships in high-throughput - tion that our model fits well is not alone sufficient to infer the esis data. While the form of epistasis is very simple and easy to existence of a mechanistically meaningful underlying additive understand, our model nonetheless captures the major features trait. For example, if the phenotype is a saturating function of genetic and genotype-by-environment interaction for GB1, of several underlying traits (e.g., refs. 37, 52, and 59), the GFP, and TEM-1 β-lactamase. We suggest that it provides a additive trait inferred under our model may correspond to an natural first model to apply to any high-throughput mutagenesis amalgamation of multiple mechanistic traits. dataset. The interpretation of the global epistasis model as the sim- plest possible epistatic extension of an additive model, together Materials and Methods with its capacity to capture a large proportion of the observed Preprocessing of Genotype–Phenotype Data. As input our procedure requires epistasis, suggests that it should be used as a standard first anal- pairs of genotype–phenotype measurements and optional estimates of ysis of mutagenesis assay data, instead of an additive fit. More measurement error and environmental condition. expressive models, capable of capturing complex epistasis, can For the GB1 datasets (37, 38), we defined a functional score and variance from the sequence counts before and after selection based on a Poisson then be used to describe the idiosyncratic interactions between approximation the small set of sites whose interactions deviate substantially WT c1 c from the large-scale patterns identified by the global-epistasis fit. y = log − log 1 c cWT Moreover, differences in experimental protocol can be viewed 0 0 as changes in the nonlinear mapping from the additive trait to 1 1 1 1 σ2 = + + + , the observed phenotype, and so we expect the inferred additive m WT WT c0 c1 c c coefficients to be more consistent and reproducible across lab- 0 1

oratories and experiments than the raw measurements. Indeed where c0 and c1 are the counts before and after selection, respectively, and 1 for TEM-1, we find that our inferred coefficients are more highly both include a pseudocount of 2 (66). The score is defined relative to the WT WT correlated with previous measurements of TEM-1 activity (39) WT sequence, using counts c0 and c1 . than the measured growth rates in any single antibiotic concen- For the GFP dataset (31), we used the fluorescence measurement (minus tration (r 2 = 0.96 for coefficients, 0.06, 0.74, 0.90, 0.91, 0.79 in the WT fluorescence) and variance as provided. For the β-lactamase dataset (45), trials 1 and 2 were combined, with given individual conditions). functional measurements relative to WT (and no estimates of measurement A subtle, but important, technical note concerns our assump- error). The WT sequences were not given, and therefore we added them to tion that the nonlinear global epistasis function g(φ) is a mono- our data with value equal to zero for each trial and antibiotic concentration. tonic function spanned by an I-spline basis. Some restriction on Mutations were excluded that showed small effects relative to the distribu- the form of g(φ) is essential because a model allowing g(φ) tion of neutral variants (absolute difference from WT less than 0.25 under all could exactly match the measured phenotypes and would there- concentrations of antibiotic) since the effect of mutations that show no fit- fore provide no useful simplification. In our case, we restrict the ness defect across all conditions is not well defined under our best-fit model function g(φ) to be monotonic for comprehensibility—even a (below). quadratic mapping can produce extremely complex patterns of epistasis (60). Although other families of monotonically increas- Inference of Nonepistatic Model. The additive trait φ(a), defined in Eq. 2, is the prediction of a nonepistatic model given a sequence a and parameters ing functions are certainly possible (ref. 61; see also ref 62, which β and β . In practice, sequences are defined as a set of zeros and ones 0 i,ai used very similar methods to show there is no global nonlinearity for each site, representing each possible amino acid substitution from the in antibody-binding energies), we have found that a four-element WT (dummy coding), and coefficients are defined consistently such that φ I-spline basis is sufficient for the datasets explored here and adds can be calculated by matrix multiplication. We assume a Gaussian likelihood y ∼ N (φ(a), σ2 + σ2 ) for each observation (representing binding, flu- only a handful of additional parameters relative to an additive a ma HOC model. orescence, etc.) with mean equal to the additive trait and variance equal In contrast, recent heuristic approaches to model global epis- to the given measurement error for each sequence plus the HOC epista- + σ2 tasis essentially suffer from imposing too strong or too weak sis. We find the 19L 1 coefficients for the trait and HOC which maximize the total log-likelihood P log N (φ(a), σ2 + σ2 ), where the sum is over constraints on the nonlinear function g(φ). For instance, refs. 31 a ma HOC and 32 use a neural network approach that lacks easy compre- all observed sequences. We use a gradient-based algorithm, L-BFGS (NLopt library), with parameters initialized from the coefficients of a nonepistatic hensibility, whereas ref. 33 assumes that the nonlinear function model fitted to the data using ordinary least-squares regression. is a power transformation, which is comprehensible but too con- In the GFP (31) dataset, estimated errors are based on the fluorescence strained to capture physiologically realistic patterns of saturation variance of sequences with the same barcode. Many of the sequences typical of metabolic and developmental systems (13). A subtler appeared only once, with no associated variance estimates. In this case, we distinction between our method and the method proposed by extended our likelihood such that if there was measurement error associ- Sailer and Harms (33) is that their fitting procedure assumes ated with a sequence, we set the total variance in the likelihood to be σ2 + σ2 . This single new parameter σ2 is estimated simultaneously that the form of the nonlinearity can be accurately inferred from HOC m0 m0 the observed residuals when fitting a nonepistatic model. While with the other parameters. It assumes the same amount of measurement approximately true for the datasets analyzed by Sailer and Harms noise for each sequence with only one barcode, which is a reasonable approximation for the GFP data, but may not be reasonable for other (33), this is not the case for the more complex nonlinearities ana- experiments. lyzed here because the additive coefficients inferred based on our In the β-lactamase dataset, no measurement noise estimates were used, nonlinear mapping sometimes differ substantially from the addi- and the log-likelihood reduces to a mean-square error. tive coefficients inferred under a standard nonepistatic fit (e.g., SI Appendix, Fig. S1B). Inference of Global Epistasis. Our model of global epistasis is an extension The intuition that complicated behavior in a high-dimensional of the nonepistatic model, replacing φ with g(φ) in the likelihood. The non- space can often be summarized by learning a nonlinear function linearity g is implemented by a flexible family of third-order monotonic of a latent additive trait has a long history in biology—as detailed I-splines (35). I-splines are a basis set of piecewise polynomials, defined by a in the Introduction. And techniques for inferring such relation- set of control points, or knots, that are linearly combined to form a mono- tonic nonlinear function. We found that four evenly spaced knots were ships have been rediscovered multiple times across a breadth sufficient for each analysis and had sufficient flexibility without overfitting, of scientific disciplines [e.g., single-index models in economet- resulting in five basis functions. The nonlinear function is a linear combi- rics (63), projection pursuit regression in statistics (64), and P nation of these basis functions plus a constant, g(φ) = cα + m αmIm(φ), linear–nonlinear models in neuroscience (65)]. Here we have with nonnegative αm coefficients. I-splines are defined only within the shown that these ideas can be productively applied to modeling knot boundaries, so we extend g to linearly extrapolate beyond the knot

E7556 | www.pnas.org/cgi/doi/10.1073/pnas.1804015115 Otwinowski et al. Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 7 iuaM rwJ 17)Efc foealpeoyi eeto ngntccag at change genetic on selection phenotypic overall of Effect (1978) JF Crow M, Kimura 17. 6 ika D(97 eeoi samjrcueo eeoyoiyi nature. in heterozygosity of cause major a as (1967) that RD fitness. affecting polymorphisms Milkman factors balanced distributed 16. Continuously of (1967) number JL King The (1967) 15. WF Bodmer dominance. TE, of Reed basis JA, molecular The Sved (1981) JA 14. Burns H, Kacser at 13. models dominance. of statistical theories are evolutionary and good Physiological (1934) How S (2016) Wright S 12. Bonhoeffer GE, Leventhal produces L, regression Plessis by du landscapes fitness 11. Inferring (2014) JB Plotkin J, Otwinowski 10. tiosie al. et Otwinowski β n eoohrie o otnosvariable continuous a For otherwise. zero and predictions: identical give can for epistasis because, global occurs the g(φ) This in transformation. change affine responding an transformation affine to invertible up any only defined the as is coefficients spline fit. epistasis these global used the in We parameters fixed. initial coefficients, parameters spline the other evenly chose holding four we chose and we one, I-spline and α the zero of between parameters knots initial spaced the For fit. epistasis 0 that such hspolm egnrt h riigadtsigsbesbsdo the on based several under subsets of observed way testing is that no and mutation have each training we For avoid the conditions. To since environmental generate observed. set, been we test never problem, the have this to that assign- mutations mutation for avoid given predictions a to making on modification data some all requires ing model the of cross-validation and position. finish, knot to allowed is optimization the change the not then instances, does these coefficient condi- In the environmental in prediction. all change the for a curve as the unidentifiable, of become portion tions flat coefficients the associated in is the that speaking, genotype Strictly position. knot est and environment) reference the to (relative effect, a environmental is the plus phenotype trait measured intermediate g(φ the an where of model function nonlinear a Effects. with effects Environmental environmental and Genetic and 1. to 0 Separating equal to coefficients additive phenotype the WT of the value absolute by sets average interpretation that the biological makes transformation natural postpro- affine a then the have We to choosing maximization. parameters positions inferred the knot the throughout the cess fixed choosing these by keeping transformation affine and the fix we therefore, was model nonepistatic produced a in parameters First, tested. initial resulting we of fitted, datasets choice all in following solutions the good fit, very poor a in as resulting likelihood, maximum by simultaneously with above parameters described variance the and estimate We boundaries. a state, tal .Lv M adn ,FynW 21)PtsHmloinmdl fpoenco- protein of models Hamiltonian Potts (2017) WF Flynn A, Haldane fitness RM, the and Levy mapping phenotype 9. to Genotype (2013) and I protease Nemenman HIV-1 in J, effects Otwinowski mutational of 8. analysis systems A and (2011) dynamics al. et evolutionary T, of Hinkley investigations 7. Genomic (2015) MM Desai science. protein ER, of Jerison style new A 6. scanning: mutational Deep (2014) S Fields RNA. DM, with Fowler role ‘evo-devo’ Modelling The 5. (2002) ruggedness: W within Fontana Smoothness 4. (1996) W Fontana PF, Stadler MA, Huynen 3. rugged on walks (1993) adaptive SA of theory Kauffman general a 2. Towards (1987) S Levin S, Kauffman 1. lcaaedt a lp qa ozero, to equal slope had data -lactamase m sntdi h antx,orbs-tnnierfunction nonlinear best-fit our text, main the in noted As n uteyi h neec rcdr sta h aetadtv trait additive latent the that is procedure inference the in subtlety One hntedt oss fsnl uat nmlil niomns the environments, multiple in mutants single of consist data the When optima, local in stuck get to optimization the for possible is it While niiulloci. individual 55:493–495. a emitie nantrlpopulation. natural a in maintained be can 53. landscapes. fitness complex approximating epistasis. of estimates biased fitness. evolutionary and landscapes, 43:55–62. energy free variation, . lac coli E. the of landscape transcriptase. reverse experiments. evolution microbial in epistasis Methods Nat adaptation. in neutrality of Evolution landscapes. htpouealna function linear a produce that , + = φ g e 0 .Fractgrclevrnetlefc where effect environmental categorical a For ). ) where (Aφ)), φ β i e ,α ≤ Ofr nvPes e York). New Press, Univ (Oxford = ho Biol Theor J φ htaeafce r erae ob qa oterightmost the to equal be to decreased are affected are that P 11:801–807. ≤ rcNt cdSiUSA Sci Acad Natl Proc γ ,adteewr sda nta aaeesfrteglobal the for parameters initial as used were these and 1, ζ β γ i ,α δ a Genet Nat replacing g(φ) h rgn fOdr efOgnzto n eeto in Selection and Organization Self Order: of Origins The γ g 128:11–45. ,z and 0 where , is rcNt cdSiUSA Sci Acad Natl Proc c rcNt cdSiUSA Sci Acad Natl Proc α σ g five , HOC 2 43:487–489. opsdwith composed ζ h diieefcswr hnrescaled then were effects additive The . γ LSOne PLoS ymaso eaaeoptimization separate a of means by g, 75:6168–6171. steefc fbigi environment in being of effect the is φ α A . m fteudryn hntp,acor- a phenotype, underlying the of o ilEvol Biol Mol Genetics ofcet,teadtv effects, additive the coefficients, urOi ee Dev Genet Opin Curr ∂φ ∂ 8:e61570. g δ φ = Bioessays γ 93:397–401. A e 111:E2301–E2309. ,z 55:469–481. esprt eei and genetic separate We ,t h ih ftehigh- the of right the to 0, −1 = = ζ 33:2454–2468. uigteinference, the During . when 1 z . 24:1164–1177. Genetics z urOi tutBiol Struct Opin Curr steenvironmen- the is Genetics 35:33–39. γ seulto equal is 97:639–666. mNat Am β i 55:483–492. ,α g o the for o any for Genetics 68:24– γ z 2 ouav ,e l 21)Eprmna sa fafins adcp na on landscape genotype- nonlinear fitness in a epistasis high-order of Detecting assay (2017) MJ Experimental Harms ZR, (2017) Sailer al. 33. protein. et fluorescent green V, the of Pokusaeva landscape fitness 32. Local (2016) beta-lactamase al. the et makes of KS, landscape Sarkisyan epistasis mutational the 31. Global Capturing (2013) (2014) al. et MM H, Jacquier Desai 30. ER, Jerison DP, fitness Rice rugged evolution. S, of protein in Kryazhimskiy Epistasis features (2016) JW 29. Topological Thornton TN, Starr (2015) 28. FA Kondrashov DA, disadvan- Kondrashov the and 27. for epistasis Multidimensional accounts (2001) model AS folding Kondrashov FA, protein Kondrashov biophysical sequence A 26. in (2011) meanderings EI Missense Shakhnovich CS, (2005) Wylie DL 25. Hartl DM, neutrality. Weinreich protein MA, of DePristo natural prediction 24. Thermodynamic in (2005) al. selection et JD, phenotypic Bloom of 23. strength The L S, Willmann J, (2001) Berg trait. al. 22. quantitative a et on JG, selection natural Kingsolver of 21. form characters. the correlated Estimating on (1988) selection D of Schluter measurement 20. deleterious The (1983) slightly SJ very Arnold by R, Lande the 19. of Contamination (1995) AS Kondrashov 18. h ai n uiePcadFudto n h SAm eerhOffice Research Army US from the support and acknowledges Foundation sup- J.B.P. Packard (W911NF-12-R-0012-04). was Lucile and and J.O. T32AI055400, David comments. Grant the helpful NIH for by referees ported anonymous the and preprint ACKNOWLEDGMENTS. conservative. somewhat is center, the test in the probability that excess indicating has distribution resulting resulting the The however, bootstraps). case; 100 (with models and P data bootstrapped the on a late in test above. described distri- difference log-likelihood model. bootstrapped with had bution, a (null) model to simpler comparing complex the by more calculated by the determined of parameters optimization initial likelihood (hypoth- The complex more model. the and esis) model (null) was simpler log-likelihood the in between difference computed the First test. likelihood-ratio bootstrapped Comparison. Model models, maximum-likelihood and bootstrapped and the transforma- in affine traits scale, an intermediate the trait to where bootstrap regression, the up least-squares into only values estimates trait the defined original Because bootstrap are variable. each trait random for tion, normal underlying standard the a for of instance an The as is predictions. generated were maximum-likelihood data the bootstrapped on based bootstrap parametric Intervals. Confidence Bootstrapped as used are observations particular remaining the a and testing. to set and assigned test training. the training observations as for all used used cross-validation, are subsets categories integer of nine the fold the define each (i.e., integers For conditions These and 3). replicates Fig. of in number between the integers and of permutation one random a assign we conditions/replicates, aho w ieryetaoae ein fo h ntbudre othe on to based distribution boundaries knot the the (from for regions values maximum/minimum extrapolated spaced linearly linearly two 101 of and one) each and zero model (between bootstrapped region linear each of on value based each for distribution a make we hudb nfr nteideal the in uniform be should S7 ) Fig. Appendix, (SI distribution -value evldtdti rcdr ybosrpigthe bootstrapping by procedure this validated We hntp maps. phenotype bioRxiv:222778. scale. macroevolutionary Nature TEM-1. stochasticity. sequence-level despite 1522. predictable adaptation space. sequence in landscapes . of tage viruses. in effects fitness mutational most evolution. protein of view biophysical A space: USA Sci Acad sites. populations. Evolution Evolution over? times 100 died not we have Why mutations: m P b hti,frec otta ntets,w calcu- we test, the in bootstrap each for is, That S1. Table Appendix, SI M vlBiol Evol BMC au ymaso eod(rrcrie ieiodrtots based test likelihood-ratio recursive) (or second a of means by value and 533:397–401. USA Sci Acad Natl Proc 42:849–861. 37:1210–1226. y c rcNt cdSiUSA Sci Acad Natl Proc b 0 mNat Am 102:606–611. eeae yaprmti otta rmtenl oe as model null the from bootstrap parametric a by generated r h nerdprmtr.FrcluaigteC of CI the calculating For parameters. inferred the are ecos 0 ierysae ausfor values spaced linearly 101 choose We χ. Genetics ecmae etdmxmmlklho oesb a by models maximum-likelihood nested compared We 4:42. si 20)Aatv vlto ftasrpinfco binding factor transcription of evolution Adaptive (2004) M assig ¨ .FrteC fmaximum-likelihood of CI the For φ). 157:245–261. β i etakJseBomfrcmet ntebioRxiv the on comments for Bloom Jesse thank We ,αb efidprmtr htlnal rnfr the transform linearly that parameters find we b, 205:1079–1088. /m rnsGenet Trends 110:13067–13072. b . 98:12089–12092. eetmtdcndneitrasb a by intervals confidence estimated We φ b rcNt cdSiUSA Sci Acad Natl Proc PNAS y and 0 31:24–33. = a e Genet Rev Nat g(φ) φ | ML ho Biol Theor J o.115 vol. + r etr eciigthe describing vectors are q σ rti Sci Protein P φ HOC 2 b 6:678–687. ausfrtefirst the for values 175:583–594. | = 108:9916–9921. + Science β o 32 no. m i P σ ,α b χ g m 2 φ 25:1204–1218. auswere values b η emk a make we ntenon- the in ML (m where , 344:1519– b + rcNatl Proc | χ c E7557 + b g ML via c b η ) ,

EVOLUTION 34. Szendro IG, Schenk MF, Franke J, Krug J, de Visser JAGM (2013) Quantitative analyses 51. Risso VA, et al. (2014) Mutational studies on resurrected ancestral proteins reveal of empirical fitness landscapes. J Stat Mech Theor Exp 2013:P01005. conservation of site-specific amino acid preferences throughout evolutionary history. 35. Ramsay JO (1988) Monotone regression splines in action. Stat Sci 3:425–441. Mol Biol Evol 32:440–455. 36. Kingman JF (1978) A simple model for the balance between selection and mutation. 52. Wu NC, Olson CA, Sun R (2016) High-throughput identification of protein mutant J Appl Probab 15:1–12. stability computed from a double mutant fitness landscape. Protein Sci 25: 37. Olson CA, Wu NC, Sun R (2014) A comprehensive biophysical description of pairwise 530–539. epistasis throughout an entire protein domain. Curr Biol 24:2643–2651. 53. Otwinowski J (2018) Biophysical inference of epistasis and the effects of mutations 38. Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R (2016) Adaptation in protein fitness on protein stability and function. arXiv:1802.08744. landscapes is facilitated by indirect paths. eLife 5:e16965. 54. Weinreich DM, Chao L (2005) Rapid evolutionary escape by large populations from 39. Firnberg E, Labonte JW, Gray JJ, Ostermeier M (2014) A comprehensive, high- local fitness peaks is likely in nature. Evolution 59:1175–1182. resolution map of a gene’s fitness landscape. Mol Biol Evol 31:1581–1592. 55. Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins: 40. Klesmith JR, Bacik JP, Wrenbeck EE, Michalczyk R, Whitehead TA (2017) Trade-offs Statistical-mechanical theory and application to operators and promoters. J Mol Biol between enzyme fitness and solubility illuminated by deep mutational scanning. Proc 193:723–743. Natl Acad Sci USA 114:2265–2270. 56. Carothers JM, Oestreich SC, Davis JH, Szostak JW (2004) Informational complexity and 41. Weinreich DM, Delaney NF, Depristo MA, Hartl DL (2006) Darwinian evolution can functional activity of RNA structures. J Am Chem Soc 126:5130–5137. follow only very few mutational paths to fitter proteins. Science 312:111–114. 57. Hazen RM, Griffin PL, Carothers JM, Szostak JW (2007) Functional information and 42. Novais A, et al. (2010) Evolutionary trajectories of beta-lactamase CTX-m-1 cluster the of biocomplexity. Proc Natl Acad Sci USA 104:8574–8581. : Predicting antibiotic resistance. PLoS Pathog 6:e1000735. 58. Barahona F (1982) On the computational complexity of Ising spin glass models. J Phys 43. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M (2016) Coevolutionary land- A Math Gen 15:3241–3253. scape inference and the context-dependence of mutations in beta-lactamase TEM-1. 59. Manhart M, Morozov AV (2015) and binding can emerge as Mol Biol Evol 33:268–280. evolutionary spandrels through structural coupling. Proc Natl Acad Sci USA 44. Bloom JD (2014) An experimentally informed evolutionary model improves phyloge- 112:1797–1802. netic fit to divergent lactamase homologs. Mol Biol Evol 31:1–17. 60. Hwang S, Park SC, Krug J (2017) Genotypic complexity of Fisher’s geometric model. 45. Stiffler M, Hekstra D, Ranganathan R (2015) as a function of purifying Genetics 206:1049–1079. selection in TEM-1-lactamase. Cell 160:882–892. 61. Ramsay JO (1998) Estimating smooth monotone functions. J R Stat Soc Ser B Stat 46. Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS (2006) –epistasis Methodol 60:365–375. link shapes the fitness landscape of a randomly drifting protein. Nature 444:929–932. 62. Adams RM, Kinney JB, Walczak AM, Mora T (2017) Physical epistatic landscape of 47. Gong LI, Suchard MA, Bloom JD (2013) Stability-mediated epistasis constrains the antibody binding affinity. arXiv:1712.04000 [q-bio]. evolution of an influenza protein. eLife 2:e00631. 63. Li Q, Racine JS (2007) Nonparametric Econometrics: Theory and Practice (Princeton 48. Dasmeh P, Serohijos AW, Kepp KP, Shakhnovich EI (2014) The influence of selection Univ Press, Princeton). for protein stability on dN/dS estimations. Genome Biol Evol 6:2956–2967. 64. Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76: 49. Wells JA (1990) Additivity of mutational effects in proteins. 29: 817–823. 8509–8517. 65. Atencio CA, Sharpee TO, Schreiner CE (2008) Cooperative nonlinearities in auditory 50. Sandberg WS, Terwilliger TC (1993) Engineering multiple properties of a protein by cortical neurons. Neuron 58:956–966. combinatorial mutagenesis. Proc Natl Acad Sci USA 90:8367–8371. 66. Plackett RL (1981) The Analysis of Categorical Data (MacMillan, New York), 2nd Ed.

E7558 | www.pnas.org/cgi/doi/10.1073/pnas.1804015115 Otwinowski et al. Downloaded by guest on September 27, 2021