Amino Acid Coevolution Induces an Evolutionary Stokes Shift

Amino acid coevolution induces an evolutionary PNAS PLUS Stokes shift David D. Pollocka, Grant Thiltgenb, and Richard A. Goldsteinb,1 aDepartment of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045; and bDivision of Mathematical Biology, National Institute for Medical Research, London NW7 1AA, United Kingdom Edited by Andrew J. Roger, Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Canada, and accepted bythe Editorial Board April 3, 2012 (received for review December 7, 2011) The process of amino acid replacement in proteins is context- mine parameters and the overwhelming computational demands dependent, with substitution rates influenced by local structure, of more realistic but complex models. Recent advances in high- functional role, and amino acids at other locations. Predicting how throughput sequencing, high-performance computing, and phy- these differences affect replacement processes is difficult. To make logenetic model building have improved the situation, but it is still such inference easier, it is often assumed that the acceptabilities necessary to make simplifications and assumptions. The question of different amino acids at a position are constant. However, is which simplifications are most useful for addressing a particular evolutionary interactions among residue positions will tend to problem or issue? The model including the simplifications should invalidate this assumption. Here, we use simulations of purple acid be correct enough to decipher key features of how protein struc- phosphatase evolution to show that amino acid propensities at a ture and function interact with protein evolution, allowing us position undergo predictable change after an amino acid replace- to interpret the terms of the model in terms of the basic biology ment at that position. After a replacement, the new amino acid and biochemistry. To answer this question, we need to consider and similar amino acids tend to become gradually more acceptable the mechanistic theory underlying different evolutionary models. over time at that position. In other words, proteins tend to The most direct approach to formulating a mechanistic model equilibrate to the presence of an amino acid at a position through of protein evolution is to make the replacement rates dependent replacements at other positions. Such a shift is reminiscent of the on the resulting change in various protein properties, which is spectroscopy effect known as the Stokes shift, where molecules calculated as a function of the entire protein sequence. This ap- EVOLUTION receiving a quantum of energy and moving to a higher electronic proach has been the rationale for thermodynamic energy-based state will adjust to the new state and emit a smaller quantum of models, which allow for direct calculation of contextual inter- energy whenever they shift back down to the original ground actions. Although this approach might seem a priori preferable state. Predictions of changes in stability in real proteins show that because of its directness, it has dual disadvantages: it is slow, mutation reversals become less favorable over time, and thus, and it does not seem to explain the data as well as site-specific broadly support our results. The observation of an evolutionary empirical models (12, 13). Thermodynamic models are compro- Stokes shift has profound implications for the study of protein mised by the unlikely assumption that the fitness of a protein is evolution and the modeling of evolutionary processes. a simple function of its thermodynamic stability and the assumptions necessary to make the thermodynamic calculations com- major focus of modern evolutionary studies is to understand putationally feasible. For example, the energy potentials used how structural and functional contexts determine the pat- are almost certainly incorrect. We still do not have adequate A fl terns of evolutionary change at different positions in a biological ways to include the effect of side chain and backbone exibility macromolecule. Such an understanding is important to phylo- in these calculations. Thermodynamic properties generally involve genetics, partially because the position-specific processes of evo- small differences between large numbers, meaning that these larger lution are known to determine our ability to reconstruct deep quantities must be computed to excruciating accuracy. Small er- rors in energy potentials can accumulate across the many atomic nodes in the tree of life (1) but also because features of the evo- interactions in a protein, compounding error in even the best en- lutionary process such as convergence can deterministically mis- ergy models. Most of the time in evolutionary studies, however, lead phylogenetic reconstruction (2). Understanding patterns of many variants need to be evaluated, and therefore, simple (and evolution and how they respond to details of structure and func- even more error-prone) pairwise contact potentials are used for tion can also potentially help us to better decode the evolutionary computational reasons. Furthermore, to correctly calculate the free record, allowing us to distinguish between structural and func- energy of the native fold, it is necessary to consider the energy of all tional constraints and identify the signatures of positive selection. thermodynamically relevant alternative folds. Because it is cur- fi This nding can lead to improved understanding of a biomole- rently impossible to include the vast multitude of such folds or even ’ cule s structure, dynamics, thermodynamics, functionality, and know what the relevant folds are, decoy datasets are generally used, physiological context. Key questions concern how evolutionary consisting of, for example, a subset of the known folds in protein processes vary among sites and over time and particularly, how the databases. These decoy datasets are inadequate representations of evolution at different locations influences each other. For example, coevolution between different locations in a protein can slow the amino acid replacement process, allowing phylogenetic inference at deep nodes that would otherwise have been swamped Author contributions: D.D.P. and R.A.G. designed research; G.T. and R.A.G. performed by recurrent neutral changes. Furthermore, coevolution tends research; D.D.P. and R.A.G. analyzed data; and D.D.P. and R.A.G. wrote the paper. to depend on proximity in the 3D structure of proteins, leading to The authors declare no conflict of interest. the hope that, if properly understood, it could improve our ability – This article is a PNAS Direct Submission. A.J.R. is a guest editor invited by the Editorial to predict important features of protein structure (3 11). Board. To understand patterns of molecular evolution, it is necessary Freely available online through the PNAS open access option. to model the process of evolution. Models of protein evolution 1 To whom correspondence should be addressed. E-mail: [email protected]. must, out of necessity, make assumptions or simplifications con- uk. cerning the underlying replacement process. In the past, most This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. model assumptions have been driven by the lack of data to deter- 1073/pnas.1120084109/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1120084109 PNAS Early Edition | 1of8 Downloaded by guest on September 24, 2021 the energetically relevant competing folds, leading to large amounts amino acids at other locations. A fundamental difference be- of error in energy analyses. tween expectations under what we will call a coevolutionary fit- Because of these limitations, phylogenetic analyses have been ness model and the site-specific constant fitness (independent dominated by phenomenological models. Such models attempt evolution) model described above can be seen by considering to capture the results of evolutionary change rather than model forward and reverse substitutions. In the constant model, if a the selective constraints acting on the evolutionary process per forward substitution (for example, from amino acid ak to al)is se. Historically, empirical models of amino acid evolution were advantageous, then the reverse substitution (from al to ak) will obtained from observed differences among sequences (14–16). always be disadvantageous or deleterious and will be unlikely to Phenomenological substitution models were initially constructed occur. In other words, the change in fitness of the reverse sub- based on the assumption that evolution at all locations was stitution, Δωðal ⇒ akÞ, is the same magnitude and the opposite identical and independent or that different locations might direction as the change in fitness of the forward substitution of evolve at different rates, with the rate acting as a simple scaling Δωðal ⇒ akÞ¼ −Δωðak ⇒ alÞ. In contrast, the coevolutionary factor for the rate matrix. More elaborate models have been fitness model would allow for the possibility that Δωðak ⇒ alÞ and constructed that allow substitution rates to depend on local Δωðal ⇒ akÞ might change because of substitutions at other sites. structure (17–19), whereas other models have allowed sub- Despite their limitations and the lack of a plausible underlying stitution rates to vary between branches and locations in ways theoretical rationale, the overall comparative success of the that are determined by the sequence data (20–22). In these phenomenological models suggests

Amino Acid Coevolution Induces an Evolutionary Stokes Shift

Lecture 5: Sequence Alignment – Global Alignment

Novel Bioinformatics Applications for Protein Allergology

Information-Theoretic Bounds of Evolutionary Processes Modeled As a Protein Communication System

Testing the Independence Hypothesis of Accepted Mutations for Pairs Of

A Thesis Entitled Homology-Based Structural Prediction of the Binding

Bioinformatics Scoring Matrices

Oxidising Bacteria (SAOB)

Lecture 10: Local Alignment and Substitution Matrices 10.1

2-PAM Matrices

Pairwise Alignment

PHAT: a Transmembrane-Specific Substitution Matrix

Substitution Matrices E S V U