Origins of Coevolution Between Residues Distant in Protein 3D Structures

Origins of coevolution between residues distant in protein 3D structures Ivan Anishchenkoa,b, Sergey Ovchinnikova,b,c, Hetunandan Kamisettyd, and David Bakera,b,e,1 aDepartment of Biochemistry, University of Washington, Seattle, WA 98105; bInstitute for Protein Design, University of Washington, Seattle, WA 98105; cMolecular and Cellular Biology Program, University of Washington, Seattle, WA 98195; dFacebook Inc., Seattle, WA 98109; and eHoward Hughes Medical Institute, University of Washington, Seattle, WA 98105 Edited by Barry Honig, Howard Hughes Medical Institute, Columbia University, New York, NY, and approved July 6, 2017 (received for review February 15, 2017) Residue pairs that directly coevolve in protein families are generally conformational changes (16, 17), and modeling homo-oligomeric close in protein 3D structures. Here we study the exceptions to this complexes (18, 19). general trend—directly coevolving residue pairs that are distant in However, not all directly coupled residues are close in the 3D protein structures—to determine the origins of evolutionary pres- structure. It has been proposed that coevolution between distant sure on spatially distant residues and to understand the sources of sites can arise from residue–residue interactions through allosteric error in contact-based structure prediction. Over a set of 4,000 pro- interaction networks (20), negative design (21), codon effects (22), tein families, we find that 25% of directly coevolving residue pairs and spurious phylogenetic correlations (23, 24). Although previous are separated by more than 5 Å in protein structures and 3% by work has examined individual cases in which pairs of directly more than 15 Å. The majority (91%) of directly coevolving residue coevolving residues are distant in the 3D structure, we are not aware – pairs in the 5 15 Å range are found to be in contact in at least one of any comprehensive analysis over the protein structure database — homologous structure these exceptions arise from structural vari- of the contributions to such potential contact mispredictions. ation in the family in the region containing the residues. Thirty-five Coevolution-based structure prediction methods will become percent of the exceptions greater than 15 Å are at homo-oligomeric even more powerful as more genomes are sequenced, and it is interfaces, 19% arise from family structural variation, and 27% are important to understand possible sources of error as well as to in repeat proteins likely reflecting alignment errors. Of the remain- gain general understanding of the sources of direct evolutionary BIOPHYSICS AND ing long-range exceptions (<1% of the total number of coupled coupling between residue pairs. In this paper, we systematically COMPUTATIONAL BIOLOGY pairs), many can be attributed to close interactions in an oligomeric study the extent and source of direct coupling between residues state. Overall, the results suggest that directly coevolving residue distant in protein 3D structures in the Protein Data Bank (PDB). pairs not in repeat proteins are spatially proximal in at least one biologically relevant protein conformation within the family; we We find that almost all such coupling can be traced to direct find little evidence for direct coupling between residues at spatially physical interactions in at least one physiologically relevant con- separated allosteric and functional sites or for increased direct formation in the protein family. coupling between residue pairs on putative allosteric pathways Results and Discussion connecting them. Distances Between Coevolving Residue Pairs. To investigate the ori- protein coevolution | structural variation | homo-oligomeric contacts gins of direct evolutionary coupling of residue pairs in protein 3D structures, we focus on the 0.5 × (protein length) most strongly coupled pairs identified by GREMLIN separated by more than six atural proteins tread a delicate balance between maintain- residues in the primary sequence. As found previously, the fraction ing structural stability and carrying out their biological N of these pairs that are close in the 3D structure (shortest distance function. The amino acid sequence covariation in evolutionarily between heavy atoms less than 5 Å) increases with increasing related proteins arises from protein structural, functional, and number and diversity of the sequences in the family (Fig. 1A;we stability constraints, and methods have been developed for pre- use Meff defined in Methods to summarize both effects). As the dicting residue–residue contacts in protein 3D structures from multiple sequence alignment (MSA) data (1–5). These methods disentangle direct couplings between residue pairs from indirect Significance couplings that arise from chaining effects: If A is correlated with B, and B with C, then correlations between A and C will be Coevolution-derived contact predictions are enabling accurate observed even if A and C do not directly interact with one an- protein structure modeling. However, coevolving residues are other. The direct couplings are isolated by expressing the energy not always in contact, and this is a potential source of error in of a sequence as the sum of one and two body interactions be- such modeling efforts. To investigate the sources of such errors tween residues and then finding values of the one and two body and, more generally, the origins of coevolution in protein interaction parameters that best account for the observed se- structures, we provide a global overview of the contributions to the “exceptions” to the general rule that coevolving residues quence data; residue pairs with significant two body interaction are close in protein three-dimensional structures. parameters are considered to be directly coupled. In large pro- — tein families many pairs of residues both close and distant in Author contributions: S.O., H.K., and D.B. designed research; I.A. and S.O. performed space—are found to covary because of chaining effects; directly research; I.A., S.O., and D.B. analyzed data; and I.A. and D.B. wrote the paper. coupled residues, however, are usually close in the 3D structure. The authors declare no conflict of interest. There has been considerable recent success in using direct cou- This article is a PNAS Direct Submission. – plings to infer residue residue contacts for protein structure Freely available online through the PNAS open access option. – – – prediction (6 9), protein protein complex prediction (10 12), Data deposition: MSAs and GREMLIN contact maps are available at gremlin.bakerlab.org/ identification of specific interaction partners among two sets of exceptions. paralogous proteins (13), resolution of ambiguities in protein 1To whom correspondence should be addressed. Email: [email protected]. NMR spectra (14), identification of structurally ordered re- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. gions within intrinsically disordered proteins (15), modeling of 1073/pnas.1702664114/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1702664114 PNAS Early Edition | 1of6 Downloaded by guest on September 28, 2021 ABC Fig. 1. The frequency with which coevolving directly coupled residues are in contact depends on structure quality and MSA size. (A) The contact frequency of directly coupled residue pairs depends on number of sequences in family. Fraction of top 0.5 × (protein length) coevolving directly coupled residue–residue pairs identified by GREMLIN that make contacts in small (protein length L ≤ 150; blue box-and-whiskers), medium (150 < L ≤ 400; red), and large (L > 400; 3 green) protein 3D structures. The contact prediction regime analyzed in this paper (with Meff > 10 ) is highlighted in gray. Two residues in the protein 3D structure were considered to be in contact if any pair of heavy atoms are within 5 Å distance. (B) The contact frequency of directly coupled residue pairs increases with increasing structure accuracy. The correlation between GREMLIN prediction accuracy and X-ray crystallographic resolution is shown in scatter (Lower) and box-and-whiskers (Upper) plots (boxes and whiskers comprise 25%, 75% and 2.5%, 97.5% percentiles, respectively; the median is shown by a solid horizontal line). (C) Comparison of GREMLIN contact prediction accuracy in X-ray and solution NMR structures for 222 proteins with structures de- termined using both methods; contact prediction accuracy is consistently higher for the X-ray structures. The outliers marked on B and C by PDB codes are all repeat proteins. number of sequences reaches 103, the contact prediction accuracy hence, GREMLIN incorrectly predicts contacts between seg- saturates and remains almost constant regardless of the size of the ments (Fig. S2). Seventeen percent of the strongly coevolving protein (with smaller numbers of sequences, prediction accuracy is residue pairs identified by GREMLIN with dmin > 15 Å are in smaller for larger proteins) (Fig. 1A). Hence, in the following anal- repeat proteins; the average contact prediction accuracy for the ysis, we focus on a subset of 3,883 proteins with at least 103 99 repeat proteins in the 3883 proteins is 0.62 compared with effective sequences in the MSAs. 0.75 overall. Coevolution-based contact prediction for repeat This paper is focused on residue pairs that are strongly cou- proteins clearly should be carried out using techniques that ex- pled but are not close in the 3D structure—we refer to such pairs plicitly address the coupling arising from the internal

Origins of Coevolution Between Residues Distant in Protein 3D Structures

Uniprot at EMBL-EBI's Role in CTTV

Webnetcoffee

The Biogrid Interaction Database

The Uniprot Knowledgebase BLAST

Functionfinderspresnotes.Pdf

Protein Function Prediction in Uniprot with the Comparison of Structural Domain Arrangements

Uniprot.Ws: R Interface to Uniprot Web Services

Three-Dimensional Structures of Carbohydrates and Where to Find Them

The Role of Uniprot's Protein Sequence Databases in Biomedical Research

Biogrid Australia (Formerly Biogrid)

Development and Analysis of Web Resources Using Glycoinformatics

Introduction to Bioinformatics