From Residue Coevolution to Protein Conformational Ensembles and Functional Dynamics
Total Page:16
File Type:pdf, Size:1020Kb
From residue coevolution to protein conformational ensembles and functional dynamics Ludovico Suttoa,1, Simone Marsilib,1, Alfonso Valenciab,2, and Francesco Luigi Gervasioa,c,2 aInstitute of Structural and Molecular Biology, University College London, London WC1H 0AJ, United Kingdom; bStructural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain; and cDepartment of Chemistry, University College London, London WC1H 0AJ, United Kingdom Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved September 23, 2015 (received for review May 2, 2015) The analysis of evolutionary amino acid correlations has recently joint distribution (the partition function, in the language of attracted a surge of renewed interest, also due to their successful statistical mechanics) is impracticable. Previous works in the use in de novo protein native structure prediction. However, many literature focused on efficiency, circumventing this problem by aspects of protein function, such as substrate binding and product adopting different, approximated solutions (12, 17–19, 22–26), release in enzymatic activity, can be fully understood only in terms generically based on tractable approximations of the likelihood of an equilibrium ensemble of alternative structures, rather than a function. However, given the success and the number of potential single static structure. In this paper we combine coevolutionary data applications of coevolutionary analysis, the study of reference and and molecular dynamics simulations to study protein conformational more quantitative approaches is necessary. In this regard, the heterogeneity. To that end, we adapt the Boltzmann-learning algo- Monte Carlo Markov chain (MCMC)-based, maximum-likelihood rithm to the analysis of homologous protein sequences and develop approach, albeit computationally demanding, is in principle exact a coarse-grained protein model specifically tailored to convert the given a sufficient sampling at each minimization step. In this work resulting contact predictions to a protein structural ensemble. By we adopt the Boltzmann learning algorithm (11, 27), whose accu- means of exhaustive sampling simulations, we analyze the set of racy in inferring the parameters of the pairwise model, at variance conformations that are consistent with the observed residue corre- with all of the previous approaches in the literature, is not biased a lations for a set of representative protein domains, showing that priori by the choice of a particular approximation scheme. (i) the most representative structure is consistent with the experi- ii The second step is a direct problem: after translating the prob- mental fold and ( ) the various regions of the sequence display abilistic model for sequences into an energy potential for protein different stability, related to multiple biologically relevant conforma- structures, we can explore the resulting energy landscape using tions and to the cooperativity of the coevolving pairs. Moreover, we molecular dynamics (MD). After extensive sampling, we can show that the proposed protocol is able to reproduce the essential characterize the folding reaction and find the best candidate for the features of a protein folding mechanism as well as to account for regions involved in conformational transitions through the correct native fold as well as metastable intermediates and conformers that sampling of the involved conformers. may have a functional role. Moreover, we can spot flexible regions, directly connecting coevolution to function and dynamics. With coevolution | network inference | coarse-grained | protein folding | this goal in mind, we introduce a coarse-grained model particu- protein dynamics larly apt to translate predictions of contacts to a structural en- semble. Thanks to the great reduction in the number of degrees of freedom, coarse-grained models have been widely used to study airs of positions along a protein sequence can show strong Pcorrelations arising both from functional and structural constraints (1–9). Earliest approaches for detecting interdependent Significance residues and predicting 3D contacts in proteins (1–4, 8) analyzed alignments containing from tens to a few hundreds sequences. Evolutionary-related protein sequences have been selected to Given the small size of available sequences datasets, these works preserve a common function and fold. Residues in contact in this relied on an independent pair approximation: a “coevolutionary conserved structure are coupled by evolution and show correlated coupling” between two residues was estimated independently for mutational patterns. The exponential growth of sequenced ge- each pair, ignoring the rest of the network of residues. The number nomes makes it possible to detect these coevolutionary coupled of known protein sequences, however, has grown dramatically in pairs and to infer three-dimensional folds from predicted contacts. the past few years (10). Such a large increase in the size of datasets But how far can we push the prediction of native folds? Can we has allowed to fit—either explicitly (11, 12) or implicitly (13, 14)— predict the conformational heterogeneity of a protein directly from sequences? We address these questions developing an pairwise models for protein sequences that take into account the accurate contact prediction algorithm and a protein coarse- whole network of correlated residues simultaneously, and are able grained model, and exploring conformational landscapes to disentangle correlated positions from “interacting” positions by congruent with coevolution. We find that both structural and identifying the parameters of the model with the coupling con- dynamical properties can be already recovered using evolu- stants in an Ising-like Hamiltonian (15, 16). Despite their sim- tionary information only. plicity, these models have had remarkable success in the design of synthetic sequences preserving natural function (13, 14) and in the Author contributions: L.S., S.M., A.V., and F.L.G. designed research; L.S. and S.M. performed BIOPHYSICS AND prediction of interacting pairs of residues from the knowledge of research; L.S. and S.M. analyzed data; and L.S., S.M., A.V., and F.L.G. wrote the paper. COMPUTATIONAL BIOLOGY their sequence alone (12, 17–21). The authors declare no conflict of interest. In this paper, we tackle the problem of sampling an ensemble This article is a PNAS Direct Submission. of structures compatible with the observed coevolution between Freely available online through the PNAS open access option. protein residues. We will follow a two-step procedure. The first 1L.S. and S.M. contributed equally to this work. step corresponds to an inverse problem: from a set of homolo- 2 To whom correspondence may be addressed. Email: [email protected] or f.l.gervasio@ gous sequences to the parameters of a model. Inverse problems ucl.ac.uk. are notoriously computationally hard. For large sets of variables, This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. an exact evaluation of the normalizing constant of the variables’ 1073/pnas.1508584112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1508584112 PNAS | November 3, 2015 | vol. 112 | no. 44 | 13567–13572 Downloaded by guest on September 30, 2021 many aspects of proteins (28–34). Due to their simplicity, Cα AB models in particular have already been used to predict protein folds from coevolutionary data (35–37). Here, in the same spirit as the model presented in ref. 35, where coevolutionary information is used with a Cα coarse-grained protein model, we present a higher- resolution coarse-grained model that combines the pairwise pre- dictions with an adapted all-atom force field for the heavy backbone atoms, similar to the approach used in ref. 38. The predicted con- tacts are introduced as favorable interactions between Cβ atoms of a coarse-grained side-chain, whereas the protein backbone is modeled C D with all of the heavy atoms to capture the secondary structure conformation with high resolution. Indeed, we show through ex- tensive molecular dynamics simulations on a set of 18 proteins that the final accuracy of structure prediction, measured as root-mean- square deviation (RMSD) from the native experimental structure, is determined solely by the accuracy of contact predictions. However, besides recovering a protein native fold, the main advantage of the proposed approach is its ability to capture the conformational heterogeneity and the thermodynamical features Fig. 1. (A and B) Energy surfaces obtained projecting sequences from the of the folding reaction as implied by coevolutionary information ADH_zinc_N domain family (A) and sequences sampled from the model only. In contrast to more expensive approaches like all-atom MD (B) over the first two principal components of the MSA (7), and taking the or more refined coarse-grained potentials (39), we can afford an negative logarithm of the resulting distribution. High-probability regions in extensive equilibrium sampling of the conformational space. We sequence space are in dark blue. The cluster structure of the alignment is illustrate this point by applying our approach to analyze two clearly reproduced by the simulated trajectories. (C and D) Mean precision of energy landscapes related to the folding of the Ras protein and the top-ranked predictions for different values of the scaled rank (rank/total number of contacts), and the mean true positive rate for different