S41467-019-12130-8.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE https://doi.org/10.1038/s41467-019-12130-8 OPEN Learning the pattern of epistasis linking genotype and phenotype in a protein Frank J. Poelwijk 1, Michael Socolich2 & Rama Ranganathan2 Understanding the pattern of epistasis—the non-independence of mutations—is critical for relating genotype and phenotype. However, the combinatorial complexity of potential epi- static interactions has severely limited the analysis of this problem. Using new mutational 13 1234567890():,; approaches, we report a comprehensive experimental study of all 2 mutants that link two phenotypically distinct variants of the Entacmaea quadricolor fluorescent protein—an oppor- tunity to examine epistasis up to the 13th order. The data show the existence of many high- order epistatic interactions between mutations, but also reveal extraordinary sparsity, enabling novel experimental and computational strategies for learning the relevant epistasis. We demonstrate that such information, in turn, can be used to accurately predict phenotypes in practical situations where the number of measurements is limited. Finally, we show how the observed epistasis shapes the solution space of single-mutation trajectories between the parental fluorescent proteins, informative about the protein’s evolutionary potential. This work provides conceptual and experimental strategies to profoundly characterize epistasis in a protein, relevant to both natural and laboratory evolution. 1 cBio Center, Department of Data Sciences, Dana-Farber Cancer Institute, 360 Longwood Avenue, Boston, MA 02215, USA. 2 Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, The Pritzker School for Molecular Engineering, University of Chicago, 929 East 57th Street, Chicago, IL 60637, USA. Correspondence and requests for materials should be addressed to F.J.P. (email: [email protected]) or to R.R. (email: [email protected]) NATURE COMMUNICATIONS | (2019) 10:4213 | https://doi.org/10.1038/s41467-019-12130-8 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-12130-8 he central properties of proteins—folding, biochemical involved in the actual color switch, indicating a strong coopera- Tfunction, and evolvability—arise from a global pattern of tivity between the fluorophore and its immediate environment. cooperative energetic interactions between amino acid residues. When introducing amino acid substitutions in a protein, Results cooperativity manifests itself as context-dependence of the effects A complete combinatorial mapping of phenotypes. We devel- of those mutations, or epistasis1. Knowledge of the extent and oped an efficient iterative gene synthesis approach to simulta- distribution of epistasis in a protein is essential for understanding neously construct and barcode the full library of 8192 fluorescent its evolution. For example, when a certain functional improve- protein (FP) variants that represents the parental genotypes ment requires a combination of mutations that are individually mKate219 and mTagBFP220 and all possible intermediates unfavorable, no single-mutation trajectory exists that increases (Fig. 1b, Supplementary Fig. 1, and “Methods”). This strategy fitness at each step, and evolution towards the new functionality – makes it possible to readout the identity of every combination of will be hampered2 4. Being able to uncover epistasis is relevant mutations simply by high-throughput DNA sequencing of the for the reconstruction of phylogenetic trees5 and for estimating barcode region—a method that should be generally valuable for the evolutionary potential of antibiotic resistance genes6,7 and studies of high-order epistasis. We expressed the library in viruses8, but also for protein engineering efforts that make use of Escherichia coli, carried out two-color fluorescence activated cell directed evolution: information on epistatic architectures should sorting (FACS) to select variants with brightness above a prove useful in the selection of evolvable templates9,10,in threshold, deep sequenced the input and selected libraries, and 11 focusing mutations to highly-epistatic regions of a protein ,orin computed the frequency fa of every FP allele a in the sorted identifying cooperative units for DNA shuffling experiments12,13. population relative to the input population—an enrichment score However, the challenge of mapping epistasis is extraordinarily (Fig. 1b). The color channels are normalized by the measured complex. Epistasis can occur due to nonlinearities in any process spectral properties of the parental red and blue proteins and are linking genotype and fitness and can occur at the pairwise level combined to produce a single quantitative phenotype–y—that (two-way) or extend to a series of higher-order terms (three-way, reports fluorescence brightness (Fig. 1b, c, “Methods”, and Sup- four-way, etc.) that describes the full extent of possible interac- – plementary Fig. 2). This measure integrates various underlying tions14 17. As a consequence, the number of potential epistatic biophysical properties—extinction coefficient, quantum yield, interactions grows exponentially with the number of positions in protein expression—and is therefore a rich phenotype for char- a protein, a combinatorial problem that becomes rapidly inac- acterizing epistatic effects of mutations, even though other cessible to any scale of experimentation. Indeed, the theoretical properties such as maturation half-time and some detailed complexity of this problem is such that it is not feasible or spectral properties are not explicitly considered here. rational to propose an exhaustive mapping of epistasis for any Global non-linearities are expected due to the experimental protein. process and were minimized in a mechanistically unbiased How, then, can we practically characterize the architecture of manner using the procedure of linear-nonlinear optimization21–23, epistatic interactions between amino acids? We reasoned that a permitting the assessment of relevant epistatic interactions strategy may emerge from a focused experimental case study in between mutations (see “Methods” and Supplementary Fig. 6 which we make all possible combinations of mutations within a for robustness of the main conclusions to this process). The limited set of positions within a protein—a dataset from which we bottom line is a quantitative assignment of phenotypes for all can directly determine the extent of epistasis and explore possible 8192 variants in a form in which independence of mutational simplifying methods. As a model system, we chose the Entacmaea effects for reasons other than the internal cooperativities of amino quadricolor fluorescent protein eqFP61118, a protein in which acids and the single threshold selection for brightness corre- spectral properties and brightness represent easily measured, sponds to additivity in y (see Supplementary Fig. 13 for additional quantitative phenotypes with a broad dynamic range. Recently, discussion on detection limits). The fact that the parental two variants of eqFP611 have been reported, one bright deep-red genotypes are brightly fluorescent but many of the intermediates λ = λ = 19 (mKate2, ex 590 nm, em 635 nm ) and one bright blue are not (Fig. 1c) is a first indication that we can expect substantial λ = λ = 20 (mTagBFP2, ex 405 nm, em 460 nm ), that are separated epistasis between mutations linking the two. by thirteen mutations (Fig. 1a); we will refer to these as the “ ” parental genotypes. From phenotypes to epistasis. From the full dataset of pheno- By developing new technologies for high-throughput combi- types, we computed the complete hierarchy—1-way, 2-way, 3-way, natorial mutagenesis and quantitative phenotyping, we explored 4-way, etc.—of epistatic interactions between the thirteen mutated 13 = the full space of 2 8192 variants comprising the parental positions. Mathematically, epistasis is a transform (Ω)inwhich genotypes and all possible intermediates between them. These phenotypes (y) of individual variants are represented as context- data reveal a broad range of high-order epistatic interactions dependent effects of the underlying mutations (ω,Fig.2a)14,15: between mutations, suggesting great complexity in the relation- ship between genotype and phenotype. However, we find that ω ¼ Ωy ð1Þ epistasis is also highly sparse compared to theoretical limits, a property that opens up the use of powerful computational tech- For N positions with a single substitution at each position, y is a niques for uncovering the epistatic architecture with practical vector of 2N phenotypic measurements in binary order (here, 213) experimental or sequence-based approaches. If sparsity is general, and ω is a vector of 2N corresponding epistatic interactions. A first- ω this observation suggests an approach for accurate phenotypic order epistatic term ( 1) is the phenotypic effect of a single ω predictions of unobserved mutants based on phenotypic mea- mutation, a second order epistatic term ( 2) is the degree to which surements of a limited set of genotypes, which we illustrate using a single mutation effect is different in the background of second ω our model system. Finally, we obtain a view of the model pro- mutation, and a third-order epistasis ( 3) is the degree to which tein’s evolutionary potential: we find that the observed high-order the second order epistasis is different in the background of a third epistasis greatly limits the number of viable single-mutation tra- mutation. Higher-order