<<

Patterns of coevolving amino acids unveil structural and dynamical domains

Daniele Granataa,1,2, Luca Ponzonib,1,2, Cristian Michelettib,2, and Vincenzo Carnevalea,2

aInstitute for Computational Molecular Science, College of Science and Technology, Temple University, Philadelphia, PA 19122; and bMolecular and Statistical Biophysics, Scuola Internazionale Superiore di Studi Avanzati (SISSA), 34136 Trieste, Italy

Edited by Richard W. Aldrich, The University of Texas at Austin, Austin, TX, and approved November 7, 2017 (received for review July 6, 2017) Patterns of interacting amino acids are so preserved within pro- those sites that host coevolving , and these, in turn, are tein families that the sole analysis of evolutionary comutations an indicator of spatial proximity (37–46). The natural question can identify pairs of contacting residues. It is also known that evo- posed by these parallel advances in the sequence–structure and lution conserves functional dynamics, i.e., the concerted motion structure–function relationships is whether or not it is at all feasi- or displacement of large regions or domains. Is it, there- ble to establish a more direct connection between them (29, 47). fore, possible to use a pure sequence-based analysis to identify In particular, one may ask if, going beyond the simple impact of these dynamical domains? To address this question, we intro- function on sequence conservation, covariation between pairs of duce here a general coevolutionary coupling analysis strategy amino acids can be directly related to functional properties with- and apply it to a curated sequence database of hundreds of pro- out relying on the prior knowledge of the structure. A positive tein families. For most families, the sequence-based method par- answer to this question would have important practical implica- titions amino acids into a few clusters. When viewed in the con- tions. For instance, it could be used in contexts where structural text of the native structure, these clusters have the signature information is covered more sparsely than at the sequence level. characteristics of viable protein domains: They are spatially sep- It could also clarify how local mutations in a given protein fam- arated but individually compact. They have a direct functional ily are related to conserved global functional features, which is bearing too, as shown for various reference cases. We conclude beyond the reach of structure-based approaches. that even large-scale structural and functionally related properties To our knowledge, this overarching question has been pre- can be recovered from inference methods applied to evolutionary- viously addressed only in specific, though important, contexts related sequences. The method introduced here is available as (29, 48–58), and hence a more comprehensive, general approach a software package and web server (spectrus.sissa.it/spectrus- would be particularly valuable. Here, motivated by these studies evo webserver). and especially by the protein sector analysis of ref. 49, we carry out a systematic characterization of the sequence→function rela- coevolution | protein domains | spectral clustering | structural dynamics | tionship without harnessing the wealth of dynamical properties allosteric networks encoded in protein structures. We shall, in fact, only rely on sequence-based coevolutionary data and use it to infer dynam- ical/functional domains whose organization has been conserved powerful paradigm in molecular biology is the flow of infor- by . We term such fundamental units “evolutionary Amation from the chemical composition of to their domains” (EDs). biological function, which is typically viewed as a chain of impli- The presentation of the strategy is articulated in the following cations: The protein sequence encodes for the structure, which, way. First, we introduce and apply the method to a prototypical in turn, assists function (1, 2). Much attention has been—and still is—paid to the two key steps in this logical ladder, namely the sequence–structure and structure–function relationships. These, Significance however, have been mostly considered separately, and addressed with distinct conceptual frameworks and tools. Patterns of pairwise correlations in sequence alignments can For instance, a nowadays well-established mediator of struc- be used to reconstruct the network of residue-residue con- ture and function for globular proteins is their internal dynam- tacts and thus the three-dimensional structure of proteins. ics. Single-molecule experiments, in fact, have provided vivid and Less explored, and yet extremely intriguing, is the functional quantitative descriptions of the dynamical basis of protein func- relevance of such coevolving networks: Do they encode for tion (3–7). Computational and theoretical studies, from atom- the collective motions occurring in proteins at thermal equi- istic molecular dynamics (MD) simulations (8–10) to coarse- librium? Here, by combining coevolutionary coupling analysis grained elastic networks (11–13), have also provided a detailed with a state-of-the-art dimensionality reduction approach, we understanding of the strong ties between proteins’ structural show that the network of pairwise evolutionary couplings can architecture and internal dynamics. In particular, the secondary be analyzed to reveal communities of amino acids, which we and higher-order structural organization of several proteins and term “evolutionary domains,” that are in striking agreement are well suited to sustain collective conformational with the quasi-rigid protein domains obtained from elastic changes needed for function (3, 4, 14). These large-scale changes network models and molecular dynamics simulations. can therefore be efficiently excited by thermal fluctuations or triggered by the binding of ligands and effectors (15–31). Author contributions: D.G., L.P., C.M., and V.C. designed research, performed research, Efforts to clarify the sequence–structure relationship have also contributed new reagents/analytic tools, analyzed data, and wrote the paper. followed different routes: from the development of increasingly The authors declare no conflict of interest. accurate force fields to be used in unbiased folding simulations This article is a PNAS Direct Submission. (8, 9) to higher-level approaches where structural features are Published under the PNAS license. inferred from the sole physicochemical or statistical profiling of 1D.G. and L.P. contributed equally to this work. the primary sequence (32–36). Recent methodological break- 2To whom correspondence may be addressed. Email: [email protected], throughs for the latter contexts involve the application of statis- [email protected], [email protected], or [email protected]. tical inference techniques to the analysis of multiple sequence This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. alignments (MSAs). Correlated substitutions can help identify 1073/pnas.1712021114/-/DCSupplemental.

E10612–E10621 | PNAS | Published online November 28, 2017 www.pnas.org/cgi/doi/10.1073/pnas.1712021114 Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 h nu fteE erhsrtg o ie protein given a for strategy search ED the of input Overview. the 1A, Methodological Fig. EDs: for Searching method the of applications. summary its brief sec- discussing this a before make provide to we and self-cointained, in completeness, tion presented For is Methods. method and search EDs. ED Materials as the them of to presentation refer detailed such shall A we of acids, 42, amino organization of (39, extended clusters methods coevolving expectedly and analysis this elegantly be coupling Considering can direct 60). which by at one, probed pairwise operating the effectively constraints reflect than puta- functional expectedly scale and larger would have a structural they coevolv- of that because such action of residues important Revealing the pairs is manner. of comutating groups concerted groups a of ing in analysis identify evolved the to tively on acids based amino is strategy Our Discussion and Results func- comparative spa- for diverse allowing analysis. domains, tional a the highlights of organization partitioning tial sequence-based show we our diverged architecture, structural functionally that overall have super- same that the channels retaining members, ion while such EDs, the For the of (59). of representatives family interpretation various functional consider direct we more a for Finally, ihteSETU esre 6) h oa aiao h ult cr udstecoc ftemi eune n yaisbsdsbiiin,shown simulation subdivisions, dynamics-based MD and al. an sequence- et of main Granata sequence. analysis the its of the on choice from and the structure obtained guides protein score DDs the quality quasi-rigid, on the both into of representations, subdivisions maxima color-coded local the two The in with (61). compared webserver SPECTRUS is the partitioning with evolutionary The . adenylate 1. Fig. as (49), analysis in sector coupling detail protein the coevolutionary with in to the agreement and discussed good of MSA in inference the are the in and respect in contained with used sequences of methods strong number a the fluctua- also to structural dynam- demonstrate from with They established consistent tions. domains well quasi-rigid and or space results ical, comutations in sequence The compact sole 42. the are ref. from analysis, of inferred EDs, MSAs the 800 that about show pre- of the to set method . annotated the the apply viously we for survey, units database-wide a functional for and Next, structural the of known consistency with the EDs show and kinase, adenylate namely system, AB plcto to Application (B) MSA. protein a from (EDs) residues coevolving of groups the identifying for performed steps the of illustration Schematic (A) eut n Discussion and Results ssmaie in summarized As and IAppendix. SI ssi h lDAapoc ecie nrf 0 u similar gplmDCA but Methods as and such 60, (Materials approaches, ref. (42) other plmDCA20 in by and anal- obtained described coupling be approach the can for plmDCA for results encoding choice the MSA of is pairs method relative ysis two Our the any family. within protein between positions its couplings acid statistical amino the of of matrix the is h e uv nteuprlf graph left upper the The in curve score, methods. red quality The the partitioning 1B. shows Fig. domain in given for are results benchmark standard a Kinase. to coli it apply Adenylate first we decomposition, ED Case: sequence-based Test A in domains quasi-rigid complexes. protein dynamical, or determine proteins subdivision) to units (61) rigid (spectral-based algorithm that and SPECTRUS to low similar the is for by the strategy respectively, used clustering of The scales, clusters. description small of numbers and consistent high large a both num- thus the at indi- providing of protein score function domains, quality a a of as of decomposition ber profile protein the best analyzing the by cating set is but priori a (Materials matrix similarity Methods final the and regularize the sparsifi- to particular that used the note strategy by we cation affected but significantly simplicity, not its are for outcomes chosen been has maxi- be egy will to (as found network coupling in been the discussed have of properties which clustering acid, the mize amino top each the of retaining couplings by a subdi- graph into domain regularized bors is robust matrix a similarity ensure the pro- To set vision, clustering groups. optimal an a connected returns densely of which of result (62), whole the clustering the spectral then of the is cedure, subdivision EDs The multiple acids. in proxim- amino evolutionary sequence of (or score pairs similarity between a ity) assign to used is pling h ubro vltoaybsdsbiiin sntspecified not is subdivisions evolutionary-based of number The h vltoayrltdesecddb h ttsia cou- statistical the by encoded relatedness evolutionary The dnlt iae[rti aaBn PB D4AKE], ID (PDB) Bank Data [ adenylate .The Survey). Dataset-Wide and IAppendix). SI PNAS Q | o udvsoso h nyeit an into enzyme the of subdivisions for , ulse nieNvme 8 2017 28, November online Published oilsrt n aiaethe validate and illustrate To k 7 = k naetnihosstrat- neighbors -nearest togs evolutionary strongest and k naetneigh- -nearest IAppendix). SI Escherichia | E10613

EVOLUTION BIOPHYSICS AND PNAS PLUS COMPUTATIONAL BIOLOGY ABare exclusively based on the sequences defining each MSA, and to compare them to the DDs subdivisions. Evolutionary couplings: Clustering propensity and community structure. As a preliminary step toward identifying the EDs, we first investigated if the input networks of statistical couplings Jij , obtained from coevolutionary analysis, exhibit an intrinsic propensity to be densely organized and, thus, to be clustered. As detailed in Materials and Methods, such propensity is con- veniently captured by ∆C = C − Crand, that is, the difference of the clustering coefficients of the k-nearest neighbors graph, C , and of a randomized, reshuffled version, Crand (63), measur- ing the probability that two neighbors of a vertex are also con- nected between themselves. As shown in Fig. 2A, this quantity Fig. 2. (A) Histograms of the maximum adjusted clustering coefficient ∆C also proves useful in choosing the optimal k, since the differ- for plmDCA method, obtained by progressively excluding from the dataset ent graphs show usually a maximum for the clustering coefficient the MSAs containing a low number of sequences. (B) Scatter plots of ∆C eff ∆C at k = 7, especially for MSA containing a large number of as a function of the corresponding MSA size (Nseq, effective number of sequences with sequence identity less than 90%). sequences (see also SI Appendix, Fig. S3 for the other inference methods). Importantly, the MSA size (calculated as effective eff number of sequences, Nseq , i.e., the number of sequences in the increasing number of EDs. The quality score reflects how sharply set whose mutual identity is smaller than 90%) crucially affects defined, according to the clustering metrics, is the returned opti- the clustering propensity of the similarity graph, as clarified by mal subdivision compared with random partitions. The highest the strong correlation between these two quantities shown in Fig. scores for the sequence-based partitioning are found for Q = 3, 2B (and SI Appendix, Fig. S4). Thus, when a large dataset is avail- 6, and 9 EDs. The structural and sequence-wise representations able and the reconstruction of the network of couplings is most of the partitions into Q = 3 and 6 domains are given in Fig. 1B. reliable, the latter shows a high tendency to cluster and an unam- Note that EDs, which can span several intercalated stretches of biguous number of “relevant” neighbors (k = 7), which is indica- the primary sequence, are nonetheless structurally compact. This tive of an inherent collective organization of the coevolution is a noteworthy and intriguing result, since the evolutionary sub- patterns. Strikingly, this number coincides with the average struc- divisions are exclusively sequence-based, with no input about the tural neighbors surrounding each residue in protein structures actual structure of the protein. (6.75 ± 0.04, calculated on the PDB structures of this dataset As a matter of fact, the returned subdivisions are viable from using a Cβ–Cβ distance threshold of 8.5 A˚ as in ref. 42). We the structural and functional point of view. This emerges from also note that, for k = 7, the percentage of true contacts (includ- their comparison with quasi-rigid, dynamical domains (DDs). ing along the sequence) is systematically larger than 50%, espe- These were identified with the SPECTRUS web server (61) cially for larger MSA size (SI Appendix, Fig. S5 and Materials and using as input the structural fluctuations observed in extensive Methods). MD simulations of . As shown in Fig. 1B, the Compactness of evolutionary domains. We used the networks of Q = 3 and Q = 6 evolutionary subdivisions are well consistent, evolutionary couplings, derived from each of the 813 MSAs, as both structurally and sequence-wise, with the high-scoring quasi- input for the clustering algorithm. For an initial unsupervised rigid partitionings into similar numbers of domains. In particular, overview of the EDs organization, we identified the subdivisions for both cases, the Q = 3 subdivision corresponds to the well- from Q = 2 to Q = 10 domains for each . Next, we known partitioning into three main functional domains, namely studied whether the sequence-based subdivisions corresponded the ATP-, the AMP-binding site, and the core, shown to spatially compact domains once mapped on the available PDB respectively in red, gray, and blue. In addition, even the finest structures of the MSAs’ representatives. The results are given in partitionings (Q = 9 and Q = 10; see SI Appendix, Fig. S1) pro- Fig. 3. vide consistent decompositions in the two cases and highlight Fig. 3A presents the probability distribution of the compact- structural elements that are arguably crucial for the protein func- ness parameter, Ω, which measures the fraction of amino acids tional dynamics. that are no further than 10 A˚ from most residues in their same The result is noteworthy because, although sequences code domain (Materials and Methods). For clarity, the results are pre- for both structural and functional properties, it would have been sented as aggregated over the considered values of Q; more difficult to anticipate that the latter could be obtained directly detailed, nonaggregated representations, including those for the from the primary sequence without additionally using a 3D con- other inference methods, are given in SI Appendix, Figs. S7–S9. formation. In addition, although DCA is a very powerful means The Ω distribution of genuine ED partitions in Fig. 3A is strongly of extracting reliable indications of proteins’ folds, we are not skewed toward the Ω = 1 limit. In fact, the median value is 0.98, aware of documented instances where DCA-derived structural indicating that, over all considered MSAs and partitioning levels information was used to infer functional movements. These con- Q, very few amino acids are isolated, or at a distance larger than siderations reinforce the significance of showing that functional ˚ and structural domains can be directly and confidently extracted 10 A from the other members of their domains. By contrast, the from the coupling analysis of MSAs (see SI Appendix, Fig. S2). compactness Ω computed for random partitioning of the same entries, and into the same range of Q, follows a very different Dataset-Wide Survey. For a systematic characterization of the distribution that is so shifted toward lower Ω values (the mean is EDs, we then extended the analysis to a dataset of 813 MSAs about 0.57) that it has negligible overlap with the ED one. The compiled by Feinauer et al. (42). This was chosen for two main scatter plot in Fig. 3B additionally reveals a strong correlation reasons. First, it gives a comprehensive coverage of several pro- between the number of sequences in the MSAs and the observed tein families, with various MSA sizes (from 16 to 65,000 entries) compactness of the inferred EDs, similarly to what is observed and protein lengths (from 30 to 500 amino acids). Second, a PDB for the clustering coefficient. In fact, one notes that values in entry is available for a representative protein of each family/ the left tail of the Ω distribution are typically found for MSAs MSA. This is a key element in this study because it allows us featuring the smallest numbers of entries, 300 sequences or less. to assess the spatial compactness of ED decompositions, which We interpret this result as an indirect indication that, when less

E10614 | www.pnas.org/cgi/doi/10.1073/pnas.1712021114 Granata et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 rnt tal. et Granata 3E Fig. in recognition structures acid Cys2His2 two nucleic The a in (64). regulation (TFIIIA), specifically involved and protein MSAs, IIIA finger numerous factor zinc most (C2H2) transcription the the of encoding one to corresponds random of case the from dissimilar partitions. too not notice- is a value compactness presents the of entry each This of fragmentation only). in able MSA sequences repre- numerous (nine least 35, the dataset protein has the which viral 3L28, Ebola entry PDB the by is sented instance first The tribution. about 3 is Fig. in pactness examples two residue, other terminal The a and space. of in compact exception visibly sole are the with that, subdivisions cor- (Q subdivision partitioning compactness, the optimal and the 1NE2, to entry responds PDB representative the is pnst nMAwt ag olo eune 1,8)adan and (14,080) sequences of pool compactness large average a with examples MSA an notable to 3 few sponds Fig. a in show subdivisions ED we of structures, protein selected of rele- its (see demonstrating method decomposition, the ED in vance the of score 3 quality Fig. in in analyses decompositions the optimal compactness repeating of when values drawn higher be even Analogous with case. still random but can the conclusions, with compactness compared their high significantly although be sub- compact, ED less the are consequently and divisions reconstructed reliably less is work than 55mer. rRNA 5S a with complex compactness in and structural form EDs apo the the of plot Scatter (B) (C cyan). (in sequences into protein subdivisions same the of ings 3. Fig. DE A –E h eodisac sams neetn ule,bcueit because outlier, interesting most a is instance second The oilsrt h ocpsdsusdaoewti h context the within above discussed concepts the illustrate To tutrlrpeettoso he oal xmlso Ddcmoiin,mre ybu qae in squares blue by marked decompositions, ED of examples notable three of representations Structural ) ycnrs,prant rtiswoeaeaeE com- ED average whose proteins to pertain contrast, by E, 0 eune r sdt ne h opig,tenet- the couplings, the infer to used are sequences ∼300 itiuino h vrg tutrlcompactness structural average the of Distribution (A) 0 = Ω Q = 0. 2, 62 . . . sraiyprevdb npcigthe inspecting by perceived readily is .99, hΩi ± 1 oan,v.terltv S ie h ahdln ersnsteaeaecmatesfrtesto admpartitionings. random of set the for compactness average the represents line dashed The size. MSA relative the vs. domains, ,10 Q i.S10). Fig. Appendix, SI .. ntelwsd ftedis- the of side low the on i.e., 0.02, Q 0 = C opt and h tutr hw nfigure in shown structure The .96 oan,pce codn othe to according picked domains, Q h nr nFg 3C Fig. in entry The E. 6 = oan,ad ned its indeed, and, domains, 7 = B .Ishg ereof degree high Its ). A and ersn the represent hΩ B i Q o the for can Ω, corre- vrteMAdtst(nrd,cmae ihteoecmue o admpartition- random for computed one the with compared red), (in dataset MSA the over D asi h oocnet hssget ht vni challeng- functional large-scale in meaningful relationships. domains, extract even repeated still of can that, analysis presence ED the suggests the reflects This DCA functional where context. coherent cases holo ing more the in recapitulated in be ways nature can fragmented outlier seemingly this the of Therefore, acid. nucleic groove the con- the helix”) of with remain- “recognition contacts as sequence-specific the to forming (iii) (referred residues helix tains finally, the and, of helices); part blue three consis- ing residues, all white in facing locks present the and (note tently sustains helix domain the white crucial onto the hairpin ions (ii) the zinc (66); the fold the coordinate yel- stabilize that in to form respectively) apo cyan, the in and (highlighted low helix the the in domain on histidines cysteines two red two and by the formed (i) site holo binding Specifically, the organization the outlines meaningful. spatial to a functionally acquire is superposed motifs. domains that three is the form, all subdivision bound) across same sin- (RNA repeated a the consistently of when is partitioning Indeed, the finger coher- particular, zinc in In gle arranged patterns. are rather in structural but residues ent scattered the not 3D, are Fig. domain in of each instance However, previous structure domains. the fragmented apo from spatially the differently yields it to finger, superposed zinc is the subdivision tripartite ing of consists score, quality est functional and structural domains, genuine the to of can due origin couplings. signals correlations common DCA as the cases, well such to as in due (67), and correlations al. 2J7J), reflect As et ID domains. Espada par- C2H2 (PDB by is nine (65) discussed TFIIIA contains 2HGH). state it ID because free (PDB noteworthy (66) the ticularly 55mer in rRNA 5S both to 6, bound to 4 fingers h pia attoigo FIA .. h n ihtehigh- the with one the i.e., TFIIIA, of partitioning optimal The Ω optdfrec igeMAadaeae vrthe over averaged and MSA single each for computed , PNAS | (E B. ulse nieNvme 8 2017 28, November online Published Q iw ftetasrpinfco IA in IIIA, factor transcription the of Views ) 3 = C D.We h correspond- the When EDs. β -hairpin | E10615

EVOLUTION BIOPHYSICS AND PNAS PLUS COMPUTATIONAL BIOLOGY Comparison with dynamical domains. Motivated by these obser- comparing domains at the optimal ED number Qopt, as deter- vations, we undertook a systematic comparison of the EDs and mined by the individual quality scores. However, such compari- the quasi-rigid (or dynamical) domains (DDs) for each of the 813 son is more delicate, because the respective Qopts for ED and DD MSAs. The DDs were obtained from the SPECTRUS decom- decompositions do not generally coincide, making it advisable to position tool (61), based on an elastic network model (ENM) consider the more stable average hAMIiQ . For more details, see analysis (68, 69) of the PDB structures of the MSA’s reference discussion in SI Appendix, Fig. S11. entries, as detailed in Materials and Methods. The structure- and The good overlap between EDs and DDs at all levels of sub- dynamics-based character of the DD analysis is an apt comple- division suggests that our clustering approach captures all of ment of the sequence-based one of EDs. This duality makes the relevant topological features from the network of statisti- the comparison particularly interesting and relevant for fram- cal couplings. It thus constitutes a powerful tool for inferring ing the sequence→structure→function relationship. The overlap meaningful structural and functional relationships, as discussed of the two types of domain subdivisions was measured in terms in Case Study: Comparative Analysis Across the 6TM Family of Ion of the adjusted mutual information (AMI), which allows for a Channels. straightforward assessment of the statistical significance of the subdivisions overlap, as described in SI Appendix, Supplementary Methods. Case Study: Comparative Analysis Across the 6TM Family of Ion Channels. To further assess the capability of ED decompositions To better illustrate the correspondence of the EDs and DDs to outline important functional properties of a protein family, we and to give an immediate meaning to the AMI value, we discuss conclude by applying the ED analysis in a comparative scenario here two examples. Fig. 4A shows the results for SbmC protein to a specific class of ion channels, the six-transmembrane-helices N eff = 3,707 Q = 4 (PDB ID 1JYH, seq ) subdivided into domains. (6TM) superfamily, for which the sequence–function relation- This level of subdivision was considered because it provides the ship has been actively investigated in a number of seminal stud- best quality score for dynamical domains. The consistency of the ies (70). This superfamily is characterized by a strictly conserved ED and DD subdivisions is very clearly conveyed by the struc- tetrameric architecture. The latter is shown in Fig. 5A where dif- tural and sequence-wise representations, which overlap almost ferent colors are used to highlight the main functional domains, perfectly. This consistency extends to both coarse and finer sub- including the four-helix bundle voltage sensor domain (VSD) divisions, as highlighted by the AMI profile, which is particu- and the pore of the ion conduction pathway, which involves two >0.8 Q = 2 Q = 4 larly high ( ) for and , and remains larger than transmembrane helices and the linking reentrant pore, contain- 0.5 in all other cases as well. Likewise, for the example in Fig. ing the selectivity filter. This single structural template inherited 4B [ATP-binding cassette (ABC) transporter, PDB ID 2ONK, eff from an ancestor gene has enabled, through differentiation, an Nseq = 17, 503], a consistent overlap between EDs and DDs is explosion of functional variability. Channels in the 6TM class, observed at various levels of subdivision. In particular, we note in fact, are involved, for instance, in reporting noxious environ- that even the lowest AMI value of 0.5, attained for Q = 4, still mental conditions, in shaping the neuronal action potential, and corresponds to a clear and satisfactory consistency of the two in syncing the beating of the heart (59). Since all these channels types of subdivisions. share the same architecture, different decompositions in EDs To extend considerations to the entire dataset, we computed in different phylogenetic groups likely reflect distinct functional for each MSA the average and the largest AMI between EDs rather than structural aspects (51, 52). and DDs, for Q in the range [2, 10]. The results are presented For definiteness, we focus on three different 6TM fami- as a function of the number of MSA sequences in the scatter lies: the voltage-gated potassium-selective channel [Kv, PDB ID plots of Fig. 4C. Interestingly, we observe again a strong depen- 2R9R (71)], the bacterial voltage-gated sodium-selective channel eff dence on Nseq : For MSAs with 500 sequences or more, the aver- [BacNav, PDB ID 4EKW (72)], and transient receptor poten- age values for AMImax and hAMIiQ are 0.62 and 0.47, compared tial [TRP, PDB ID 3J5P (73)] channels. We analyzed the MSAs with the corresponding values of 0.49 and 0.35, respectively, for the three families based on a pool of ∼ 800 sequences, each eff eff when Nseq < 500. Clearly, when Nseq tends to 0, the AMI van- with 200 positions (74) from which we omitted the highly gapped ishes, again consistent with a random partitioning of a sequence. regions of the alignments (typically occurring in loops between Slightly higher values of AMI are observed, on average, when the six transmembrane helices). Although the 6TM dataset that

ABC

Fig. 4. (A and B) ED and DD decompositions of an SbmC protein (PDB ID 1JYH:A) and an ABC transporter permease protein (PDB ID 2ONK:C). (C) Scatter plots of the maximum (Upper) and average (Lower) AMI, over the domain number Q, between the ED and DD decompositions, as a function of the effective MSA size.

E10616 | www.pnas.org/cgi/doi/10.1073/pnas.1712021114 Granata et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 rnt tal. et Granata aha i elw,aohri soitdt h aigregion gating the ion to conductive associated highly is and another narrow yellow), the (in by lining formed pathway residues region, the selectivity of the all to corresponds largely domain rnmte otepr oanfrgtn.Tedvso into division The turn, gating. in for is, movement domain this pore Q membrane; the movement the to trans- the across sense transmitted the helix determine that contains this and 5B) latter of Fig. variations the in potential that spheres pore membrane recall (yellow the We residues between (75–77). positive coupling S4 mechanical and sub- strong region sequence-based the (Q the with primary agrees however, the intrigu- view, of an of division From is point apart. into elements This functional these subdivision VSD. a kept classic have the would aforementioned domains of the pore structural the because rest VSD with result the associated the ing are than of residues rather facing helix domain its fourth and the S4) (called since informative, unexpectedly (see domains S13 of numbers increasing for family decided maximum we the analysis, MSA. to each robust corresponding a graph the ensure decompose To to S12). Fig. Appendix, SI iei lal iie ycmaio ihtemc etrpop- low better pretty much a the showing with previously, its discussed comparison moment, cases by the ulated limited at clearly available comprehensive is most size the is used we domains. monomeric six and (C four spheres. yellow into as divisions shown are sensing voltage for subunit, responsible blue the see For ( cyan. view). a subdivision, in representing lateral monomeric highlighted color and is each VSD (top with the subunit channels, monomeric 6TM of single assembly tetrameric cal 5. Fig. D B A nFg 5 Fig. In 4 = i.S13 Fig. Appendix, SI o ult cr) h udvso for subdivision The score). quality for ik pfrhrfntoa etrs One features. functional further up picks 5C, Fig. in EDs, ceai ersnaino biologi- of representation Schematic (A) channels. Kv for EDs B and epeetvrossbiiin o h Kv the for subdivisions various present we E, Q o h ult cr.Pstvl hre residues charged Positively score. quality the for = ,soni h otx ftefl tetramer; full the of context the in shown 2, 2 = ersnaino h otsignificant most the of Representation B) D smaigu.I at it fact, In meaningful. is EDs ) C E Q Fig. Appendix, SI 2 = –E salready is ie sub- Finer ) ∆ ∆C C (see for ul rma S,adtpclyol 0 frsde with residues of sector, a 20% determines only eigenvector one typically the on and component uses largest MSA, it matrix the an fact, covariance from conservation-weighted In nonex- a built residues. a of protein returns differ- eigenvectors the analysis top methodological of sectors coverage several protein of haustive to instance, of nature complementary For cooperative because mostly ences. is the techniques, CoeViz analysis into these and ED var- insight (49) elegant coevolution. from analysis provides most residue sectors tackled which and protein been best-known (87), the has the are is that Among approaches mutations (48) perspectives. correlated issue ious of long-standing patterns a from Analysis. residues Sectors coevolving Protein with Comparison criteria. structural static with made dis- subdvisions are from character, functional tinct specific their to owing that, domains repre- which these of (82–86), activators the pocket for channels. site vannilloid binding intracellular the exactly main corresponds the of sents S4 location of part the external the to the by and determined cavity helices yellow pore the two Remarkably, in suggested 82. as and domain, 73 rep- gating refs. latter upper The second domain: external a yellow The effectively extended VSD. resents the the to of belongs rest S4 the of of part part all upper S4 residues the internal of with the part grouped with upper sectioned, The longitudinally red). instead, are (in is, residues domain gating C-terminal the it the with characterizes Only consistent associated instead, channels: is that, ion it role voltage-gated since dynamical in the respect, 82), of this lack (73, in The the compelling regions with decomposition. is ED gating S4 the of by different division captured well two indeed, have possess are, channels which pH, to these temperature, particular, shown as In such been (78–81). stimuli, binding cation of ligands nonselective variety a and a is by it gated Specifically, with ones. channel characteristics other distinct the fam- has to channel Eukaryota, respect this in into only Indeed, grouped identified different. ily, is TRP totally for VSD instead, EDs of is, of organization channels rest The the residues. selec- external the and and sustaining internal yellow), faces, (in lower and domain they pore upper tivity reentrant the together, the into and, Similarly, split pore, red). is helix (in the domain” of “gating helix part the S4 form lower the the fact, along In with way. families segregates superimposable two almost these reflects an (another shaped in sodium) BacNav evolution, constraints for functional and the selective Kv how family, between voltage-gated comparison subdivisions, in detailed tetrameric The TRP further are and S13. and 6 BacNav, Fig. Fig. in Kv, given are of which comparisons the from func- for evolutionary-related genuinely amino reasons. are tional that external region surmise by this accordingly in modulated We acids and be binding. indeed ligand voltage-sensing can like the stimuli, which sig- of the domains, loops for pore the instrumental the is between appears region it propagation this tetramer, nal channel that the speculate of to context natural the in viewed 5 When Fig. in (highlighted channel blue the the in of with portion EDs, extracellular different the to volt- of assigned the exception mostly and are pore regions the features. subdivision, sensor this age protein In system. various the of the elements (Q about provide subdivisions descrip- example can Finer multilevel and Kv EDs the filter This that regarding domain. selectivity tion instructive pore gating particularly the the the also differ- of sustaining contacting is a faces one one that two lower upper the notable the the for is internal with found It the is helix, VSD. assignment respectively, the domain comprise, of ent two residues external other and the and (red), h niainfo h T aiyi htEscnsnl out single can EDs that is family 6TM the from indication The emerge EDs of role functional the regarding elements Further D PNAS and 6 = | ,wihsilbigsbtentetwo. the between bridges still which E), ulse nieNvme 8 2017 28, November online Published otyrtr h ai structural basic the return mostly ) dniyn rusof groups Identifying IAppendix, SI | E10617

EVOLUTION BIOPHYSICS AND PNAS PLUS COMPUTATIONAL BIOLOGY Fig. 6. Comparative analysis of the EDs for Kv, BacNav, and TRP channels, corresponding to subdivision Q = 4 (see quality scores in SI Appendix, Fig. S13). While Kv and BacNav show similar organization, coherent with their analogous functional requirements, TRP is characterized by a different domain pattern, consistent with its ligand-gated properties and loss of voltage-gated ones, specific to the other two channels.

i.e., a group of residues evolving concertedly. By construc- sectors to EDs or DDs: Protein sector analysis on these large tion, the method prioritizes the most conserved residues (88). datasets (more than 14,000 effective sequences) returns groups Importantly, this nonexhaustive assignment is nonexclusive too, of residues distant in both primary and tertiary structure (see meaning that one residue can be part of distinct sectors. By SI Appendix, Figs. S15 and S16). The fact that the differences contrast, the ED decomposition uses the entire DCA-based between EDs and protein sectors are more pronounced for large similarity to ensure a residue assignment that is both exhaus- datasets suggests that, when presented with highly heterogeneous tive and exclusive. The latter feature, in particular, is instru- sequence sets, these two algorithms highlight different aspects of mental to the specific goal pursued here of comparing EDs residue–residue correlations. For instance, protein sectors anal- with DDs. ysis has been shown to effectively identify the groups of amino DCA and statistical coupling analysis share nevertheless im- acids that experience the largest variations on passing from one portant conceptual similarities (89, 90), and, therefore, similar- phylogenetic group to another (91). On the other hand, DCA is ities between EDs and sectors can be expected in specific con- seemingly less sensitive to the phylogenetic structure of the MSA texts. We therefore compared the two types of subdivisions for analyzed (42). For this reason, we believe that the interpreta- several case studies. We first considered the two datasets of ref. tion of EDs in terms of structural domains and DDs ought to 49, which consist of the PDZ domain and the S1A serine be applicable in more general contexts, and particularly to large families. The former dataset has 240 sequences and features one datasets. sector. The quality score profile of the ED analysis in Fig. 7A has an overall decreasing trend, which is typical of datasets of this Conclusions size, indicating meaningful division for Q = 2, 3. The first subdi- Patterns of correlated mutations in an MSA can be used to reveal vision features a domain that totally includes the aforementioned a set of pairwise statistical interactions that are often informa- sector (red spheres). In the finer ED subdivisions, the protein tive about the possible spatial proximity between the residues sector is resolved into smaller and spatially coherent EDs (red involved. Strikingly, we showed that this network of couplings has and gray domains in the sequence diagram), allowing a further comparison with DDs for Q = 3: the highlighted residues (and corresponding EDs) overlap with two distinct dynamical parti- tions of the protein. The second dataset, with a larger number A of sequences (1,388), yields three sectors. The EDs quality score profile in Fig. 7B indicates that significant subdivisions are found for Q = 2, 3, 8 domains. Two of the three sectors (red and orange in the diagram) have a good correspondence with the EDs. They are compact and both contained in the red domain for Q = 2, 3, and then perfectly separated for Q = 8. The other sector (in gray in Fig. 7B) instead comprises scattered residues. This is consis- tent with previous studies that showed that this sector is more related to thermal stability than structural properties (49). Inter- estingly, when S1A sectors and EDs differ from DDs (again for B Q = 3), they are still consistent with each other. In fact, one sees in Fig. 7 that the red ED includes the orange sector but both groups differ from the blue DD. Overall, the comparative anal- ysis of these two families, whose MSAs contain homogeneous sets of sequences, shows that EDs and sectors have significant similarities. Remarkable differences, however, are observed in case of larger and more heterogeneous sets of sequences. In SI Appendix, Figs. S14–S16, we illustrate three examples discussed previ- ously, namely SbmC gyrase inhibitory protein, adenylate kinase, and ABC transporter, whose MSAs have been built by includ- ing the largest number of sequences (42). While, for SbmC eff Fig. 7. Comparison of ED decomposition and protein sector analysis (49) (Nseq = 3,714), some similarity is still noticeable between two for (A) the PDZ domain (PDB ID 1BE9) and (B) the S1A serine family sectors (cyan and orange in SI Appendix, Fig. S14) and the sub- (PDB ID 3TGI), also with the corresponding division in DDs. The sectors are divisions in two DDs and EDs, for the other datasets (adeny- shown as spheres in the 3D representations, and EDs and DDs are shown as late kinase and the ABC transporter), it is not possible to relate different colors also in the sequence diagram.

E10618 | www.pnas.org/cgi/doi/10.1073/pnas.1712021114 Granata et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 rnt tal. quasi- et Granata clus- into a structures for protein data partitions input optimally as that fluctuations procedure distance tering interresidue uses method DDs. and EDs methods, DCA three reconstruc- using phylogenetic obtained same curated in were the on couplings described were relative based channels the 52, of ion and All the three 74 tions. with the refs. obtained for in was MSAs derived kinase while ones adenylate 42, for ref. MSA analysis of The coupling approach direct 93). sole (39, the dynam- MSAs the to from and the with inferred structural of (16 EDs associated the the assess structures sequences of to characteristics PDB posteriori involved ical a The used of 494). were to dataset sequences number query the (30 the in positions MSAs and both The 65,535) using queries. for input scheme, heterogeneous as detection entries are PDB a target (92), of HHblits sequences with obtained were ter Dataset. server web and package software a as at available is here introduced method Appendix SI Methods and Materials ion of case challenging the for here shown channels. as distinctive subgroup, features each functionally to the the with highlighting separately, of studied goal are particular, ultimate group, subfami- entire In homologous the than which available. rather in lies, are studies cases comparative sequences those enables of this beyond thousands methods which inference for cou- these scope the interresidue of facto single de applicability widens the of aspect are this Importantly, with than 42). robust (41, size plings more sample of For are the topology to couplings, the sequences. respect therefore coevolutionary of and of hundreds and EDs, network that robust few the believe observe a we we reason, for MSA, this even the results in sequences consistent effec- the of on number crucially tive depends couplings, statistical of with inference chimeras protein of design properties. novo archi- biological de novel same enable the sharing might proteins tecture pro- across existing engineer EDs structural to Transferring approach any teins: this interesting using more for of Even perspective step detail. the of is initial level atomistic crucial an with a modeling represent sequence- can the domains than sparser are or struc- where available data. contexts not based fea- in are ED valuable These data particularly of subdivi- tural resolutions.” be viability to two “spatial the ought the and indicating tures different small thus fact, at both domains, of decompositions for decompositions of matter results number domain consistent a large provide to less As approaches because route a ones. sion give noisier, noteworthy structural to hence is than expected and This be diverse direct, could a families. across approaches protein consistent sequence-based very of be structure-based to repertoire and found sequence- were recently The subdivisions a (61). by identified EDs approach domains comparing pos- introduced quasi-rigid subdivisions, the dynamical, these the explored of with We meaning groups. into biological struc- segregate compact sible to protein and tendency the localized innate of an spatially show context couplings the spectral these in the end, ture, namely, analyzed this framework, When To clustering clustering. residues. efficient mea- an between a used as proximity we couplings evolutionary statistical of the that interpreting sure communities, by EDs, these term characterized we We in fashion. evolve to of concerted appear a residues pairs of groups involving entire residues, mutations sequence. contacting the compensatory of beyond rest more the are Therefore, with that than residues themselves of among communities connected with structure peculiar a spectrus.sissa.it/spectrus-evo ial,dsietefc htteDAaayi,ue o the for used analysis, DCA the that fact the despite Finally, in organization hierarchical the detecting cases, these In eue h aae frf 2cnitn f83MA.Telat- The MSAs. 813 of consisting 42 ref. of dataset the used We IAppendix SI otissplmnayrsls ehd,adfiue.The figures. and methods, results, supplementary contains D eeietfidwt h PCRSagrtm(1.The (61). algorithm SPECTRUS the with identified were DDs . webserver . eiuswt h eks opig ntentok see omitting network; upon the variability in assignment couplings S23 weakest domain the the with tested residues also we case, ter (see (iv matrix and similarity DCA-based the to applied assign- k-medoids the see (i of instances ment; to independent over respect variability their with is, (that robust- procedure (see the assignment dataset evaluated ED MSA We the residues. with of consecutive associated ness to network corresponding the of entries connectivity the the ensure To Network. adopted bor we used choice, simplest be the the not As could EDs. symmetric DDs of the for context used sequence-based concerns step the criterion in change in (contact) matrix second structural con- similarity The the the the because couplings. increase of strong to sparsification squared and the then weak and between values additively trast negative first were remove couplings to the shifted Specifically, couplings. pairwise DCA the ranking) the (top First, significant ferences. top-most the out single to it subdivision(s). used and score, value ity each for times (iii 1,000 repeated was of procedure k-medoids stochastic epnigt ar faioaista eei otc ntereference the in contact in were that (C signifi- acids cor- structures most amino entries PDB only of the retained pairs only accordingly to we retaining 61, responding of by ref. number Following matrix couplings. increasing similarity algorithm, cant k-medoids the an a sparsifying with into performed after customarily acids is amino step partitioning subdivide tral to used domains, is 94) acids, 62, amino of pairs matrix, similarity considered iary 68. ref. other of the ENM the for using while, computed were trajectories, they MD instances, were kinase atomistic adenylate from for obtained fluctuations distance Interresidue domains. rigid hce hte rntte a ecnetdb aho h network as measured the then on is path domain the a of by disconnectedness connected of be degree The can links. they of not or whether checked 10 more the or two where of are consisted that EDs individual subdomains the whether establishing by teriori Compactness. Structural averaging by n the obtained the in is between node ficient links a of of coef- ber clustering node neighbors local The of two neighbors. themselves ficient are that residue) a probability (i.e., network the measures ficient k .Teueo uotmlgah nwy ed aial otesame set the always to we basically sion, leads (see anyway, size graph, MSA suboptimal (see a the subdivisions of to spe- use the respect The to with S3). respect and Fig. with robust used is method location DCA peak cific The analysis. ED the for used k largest DCA. the computed ing the we from entries, derived MSA 813 network for the the the of profile of each to for propensity quantity Specifically, clustering this the use of we Here dence (96). node per neighbors k each with associated weights the account (63). into link takes which coefficient, ing Network. coefficient Neighbor clustering k-Nearest the the of Optimization EDs improve could residues such omitting resolution. for protocol the on optimization ¯ naetnihoigrsde o aiu ausof values various for residues neighboring -nearest /(N i ssoni i.2 h itiuini iil ekdat peaked visibly is distribution The 2. Fig. in shown is > o ahpriininto partition each for ) omnyue emo eeec sarno rp,frwhich for graph, random a is reference of term used commonly A D eeietfidwt xcl h aemto,btwt w dif- two with but method, same the exactly with identified were EDs nbif h Dpriinn novstefloigses (i steps: following the involves partitioning DD the brief, In n ertie nyteoewt h etkmdissoe Finally, score. k-medoids best the with one the only retained we and Q, .W hncniee all considered then We A. k esrieta,i uuedvlpet ftecretmto,an method, current the of developments future in that, surmise We . ˚ k S .I hsppr eue eeaie,wihe ento fcluster- of definition weighted generalized, a used we paper, this In 1. − ete dnie h au of value the identified then We {3,5,7,10,15,20,25,30,40,55}. ∈ togs opig r et see kept; are couplings strongest arx eust o ennflsbiiin,w lasretained always we subdivisions, meaningful for requisite a matrix, h C ehdue (see used method DCA the ) ) where 1), rm2t 0 o pia oandsrmnto,ti spec- this discrimination, domain optimal For 10. to 2 from Q, ,(iii S19), Fig. Appendix, SI n q oe rsde)wr ikdol ftheir if only linked were (residues) nodes ∆C k naetniho rtro,wee o ahrsde only residue, each for where, criterion, neighbor -nearest i .Acrigy in Accordingly, S29). Fig. Appendix, SI α N sdfie as defined is ,( S18), and S17 Figs. Appendix, SI o ahety h itiuino uhotmlvle of values optimal such of distribution The entry. each for k S stenme fndsand nodes of number the is itne esta 10 than less distances ij PNAS = > S i nre ftesmlrt arxwr eie from derived were matrix similarity the of entries ij ,mkn u ehdparameter-free. method our making 7, h pta opcns fEswsasse pos- a assessed was EDs of compactness spatial The and sdrvdfo h itneflcutoso all of fluctuations distance the from derived is , 10 n Q C | pr.W apdec ED each mapped We apart. A i ˚ j n (ii ; ulse nieNvme 8 2017 28, November online Published egbr of neighbors 9) r“lqees”o h ewr of network the of “cliqueness,” or (95), = q (n C et pcrlcutrn cee(61, scheme clustering spectral a next, ) i 2, q C = i − . . . .I h lat- the In S24–S29). Figs. Appendix, SI 2 vralndsi h ewr with network the in nodes all over h pcfi pricto strategy sparsification specific the ) piiaino h -ers Neigh- k-Nearest the of Optimization )dsic arnso h oe and nodes the of pairings distinct 1) t i 1 oan,w optdaqual- a computed we domains, ,10 /[ n ) o outes(2 4,the 94), (62, robustness For A). i ˚ (n i h lblcutrn coef- clustering global The . i − is S20–S23), Figs. Appendix, SI k ¯ where 1)], steaeaenme of number average the is hc a necessary was which ii, oset To ii k h prt”o EDs of “purity” the ) h lseigcoef- clustering The . k = eut n Discus- and Results h ieo the of size the ) Fig. Appendix, SI C ,wihw then we which 7, ∆C α k t eprofiled we , eewithin were s q i IAppendix, SI stenum- the is = oagraph a to nauxil- An ) | C k k C depen- − E10619 rand yield- C rand =

EVOLUTION BIOPHYSICS AND PNAS PLUS COMPUTATIONAL BIOLOGY dq = bq/(nq −1), where bq is the number of residue pairs without a connect- where N is the total number of nodes/residues. For a physical interpretation ing path. The overall compactness of a subdivision into Q domains is then of Ω, see SI Appendix, Fig. S6. defined as Q ACKNOWLEDGMENTS. This work was partially supported by National Insti- Q 1 X Ω = 1 − dq, [1] tutes of Health Grants R01GM093290, S10OD020095, and P01GM055876 (to N q=1 V.C.) and National Science Foundation Grant ACI-1614804 (to V.C.).

1. Alberts B, et al. (2002) Molecular Biology of the Cell (Garland Sci, New York). 34. Sali A, Potterton L, Yuan F, van Vlijmen H, Karplus M (1995) Evaluation of comparative 2. Petsko G, Ringe D (2004) and Function (New Sci Press, London). protein modeling by MODELLER. Proteins Struct Funct Genet 23:318–326. 3. Eisenmesser EZ, Bosco DA, Akke M, Kern D (2002) Enzyme dynamics during catalysis. 35. Rohl CA, Strauss CE, Misura KM, Baker D (2004) Protein Structure Prediction Using Science 295:1520–1523. Rosetta in Methods in Enzymology (Elsevier, San Diego), pp 66–93. 4. Eisenmesser EZ, et al. (2005) Intrinsic dynamics of an enzyme underlies catalysis. 36. Roy A, Kucukural A, Zhang Y (2010) I-TASSER: A unified platform for automated pro- Nature 438:117–121. tein structure and function prediction. Nat Protoc 5:725–738. 5. Min W, Xie XS, Bagchi B (2008) Two-dimensional reaction free energy surfaces of 37. Lockless SW (1999) Evolutionarily conserved pathways of energetic connectivity in catalytic reaction: Effects of protein conformational dynamics on . J protein families. Science 286:295–299. Phys Chem B 112:454–466. 38. Dutheil J, Pupko T, Jean-Marie A, Galtier N (2005) A model-based approach for detect- 6. Min W, Xie XS, Bagchi B (2009) Role of conformational dynamics in kinetics of an ing coevolving positions in a molecule. Mol Biol Evol 22:1919–1928. enzymatic cycle in a nonequilibrium steady state. J Chem Phys 131:08B606. 39. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T (2008) Identification of direct 7. Nevin Gerek Z, Kumar S, Banu Ozkan S (2013) Structural dynamics flexibility informs residue contacts in protein–protein interaction by message passing. Proc Natl Acad function and evolution at a scale. Evol Appl 6:423–433. Sci USA 106:67–72. 8. Shaw DE, et al. (2008) Anton, a special-purpose machine for molecular dynamics sim- 40. Morcos F, et al. (2011) Direct-coupling analysis of residue coevolution captures native ulation. Commun ACM 51:91–97. contacts across many protein families. Proc Natl Acad Sci USA 108:E1293–E1301. 9. Shaw DE, et al. (2010) Atomic-level characterization of the structural dynamics of 41. Kamisetty H, Ovchinnikov S, Baker D (2013) Assessing the utility of coevolution-based proteins. Science 330:341–346. residue–residue contact predictions in a sequence- and structure-rich era. Proc Natl 10. Granata D, Camilloni C, Vendruscolo M, Laio A (2013) Characterization of the free- Acad Sci USA 110:15674–15679. energy landscapes of proteins by NMR-guided metadynamics. Proc Natl Acad Sci USA 42. Feinauer C, Skwark MJ, Pagnani A, Aurell E (2014) Improving contact prediction along 110:6817–6822. three dimensions. PLoS Comput Biol 10:e1003847. 11. Zheng W, Brooks BR, Thirumalai D (2007) Allosteric transitions in the chaperonin 43. Skwark M, Raimondi D, Michel M, Elofsson A (2014) Improved contact predictions GroEL are captured by a dominant normal mode that is most robust to sequence using the recognition of protein like contact patterns. PLoS Comput Biol 10:e1003889. variations. Biophys J 93:2289–2299. 44. Hayat S, Sander C, Marks D, Elofsson A (2015) All-atom 3D structure prediction 12. Zen A, Micheletti C, Keskin O, Nussinov R (2010) Comparing interfacial dynamics in of transmembrane β-barrel proteins from sequences. Proc Natl Acad Sci USA 112: protein-protein complexes: An elastic network approach. BMC Struct Biol 10:26. 5413–5418. 13. Ramanathan A, Agarwal PK (2011) Evolutionarily conserved linkage between enzyme 45. Dutheil J, Pupko T, Jean-Marie A, Galtier N (2005) A model-based approach for detect- fold, flexibility, and catalysis. PLoS Biol 9:e1001193. ing coevolving positions in a molecule. Mol Biol Evol 22:1919–1928. 14. Micheletti C, Lattanzi G, Maritan A (2002) Elastic properties of proteins: Insight on 46. Fodor AA, Aldrich RW (2004) Influence of conservation on calculations of amino the folding process and evolutionary selection of native structures. J Mol Biol 321: acid covariance in multiple sequence alignments. Proteins Struct Funct 909–921. 56:211–221. 15. Agarwal PK, Billeter SR, Rajagopalan PR, Benkovic SJ, Hammes-Schiffer S (2002) Net- 47. Liberles DA, et al. (2012) The interface of protein structure, protein biophysics, and work of coupled promoting motions in enzyme catalysis. Proc Natl Acad Sci USA . Protein Sci 21:769–785. 99:2794–2799. 48. Suel¨ G, Lockless S, Wall M, Ranganathan R (2003) Evolutionarily conserved networks 16. Hammes-Schiffer S, Benkovic SJ (2006) Relating protein motion to catalysis. Annu Rev of residues mediate allosteric communication in proteins. Nat Struct Biol 10:59–69. Biochem 75:519–541. 49. Halabi N, Rivoire O, Leibler S, Ranganathan R (2009) Protein sectors: Evolutionary 17. Carnevale V, Raugei S, Micheletti C, Carloni P (2006) Convergent dynamics in the pro- units of three-dimensional structure. Cell 138:774–786. tease enzymatic superfamily. J Am Chem Soc 128:9766–9772. 50. Dwyer RS, Ricci DP, Colwell LJ, Silhavy TJ, Wingreen NS (2013) Predicting functionally 18. Chennubhotla C, Bahar I (2007) Signal propagation in proteins and relation to equi- informative mutations in Escherichia coli BamA using evolutionary covariance analy- librium fluctuations. PLoS Comput Biol 3:e172. sis. Genetics 195:443–455. 19. Carnevale V, Pontiggia F, Micheletti C (2007) Structural and dynamical alignment of 51. Palovcak E, Delemotte L, Klein ML, Carnevale V (2014) Evolutionary imprint of activa- enzymes with partial structural similarity. J Phys Condens Matter 19:285206. tion: The design principles of VSDs. J Gen Physiol 143:145–156. 20. Zen A, Carnevale V, Lesk AM, Micheletti C (2008) Correspondences between low- 52. Palovcak E, Delemotte L, Klein ML, Carnevale V (2015) Comparative energy modes in enzymes: Dynamics-based alignment of enzymatic functional fami- suggests a conserved gating mechanism for TRP channels. J Gen Physiol 146:37–50. lies. Protein Sci 17:918–929. 53. Sutto L, Marsili S, Valencia A, Gervasio FL (2015) From residue coevolution to pro- 21. del Sol A, Tsai CJ, Ma B, Nussinov R (2009) The origin of allosteric functional modula- tein conformational ensembles and functional dynamics. Proc Natl Acad Sci USA tion: Multiple pre-existing pathways. Structure 17:1042–1050. 112:13567–13572. 22. Jackson CJ, et al. (2009) Conformational sampling, catalysis, and evolution of the 54. Woods KN, Pfeffer J (2015) Using THz spectroscopy evolutionary network analysis bacterial phosphotriesterase. Proc Natl Acad Sci USA 106:21631–21636. methods, and MD simulation to map the evolution of allosteric communication path- 23. Teilum K, Olsen JG, Kragelund BB (2009) Functional aspects of protein flexibility. Cell ways in c-type . Mol Biol Evol 33:40–61. Mol Sci 66:2231–2247. 55. Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M (2015) Coevolutionary land- 24. Morra G, Verkhivker G, Colombo G (2009) Modeling signal propagation mechanisms scape inference and the context-dependence of mutations in beta-lactamase TEM-1. and ligand-based conformational dynamics of the hsp90 molecular chaperone full- Mol Biol Evol 33:268–280. length dimer. PLoS Comput Biol 5:e1000323. 56. Haldane A, Flynn WF, He P, Vijayan R, Levy RM (2016) Structural propensities of kinase 25. Provasi D, Artacho MC, Negri A, Mobarec JC, Filizola M (2011) Ligand-induced modu- family proteins from a Potts model of residue co-variation. Protein Sci 25:1378–1384. lation of the free-energy landscape of -coupled receptors explored by adap- 57. Sutto L, Marsili S, Valencia A, Gervasio F (2015) From residue coevolution to pro- tive biasing techniques. PLoS Comput Biol 7:e1002193. tein conformational ensembles and functional dynamics. Proc Natl Acad Sci USA 26. Bhabha G, et al. (2011) A dynamic knockout reveals that conformational fluctuations 112:13567–13572. influence the chemical step of enzyme catalysis. Science 332:234–238. 58. Poon A, et al. (2010) Phylogenetic analysis of population-based and deep sequencing 27. Bavro VN, et al. (2012) Structure of a kirbac potassium channel with an open bundle data to identify coevolving sites in the nef gene of HIV-1. Mol Biol Evol 27:819–832. crossing indicates a mechanism of channel gating. Nat Struct Mol Biol 19:158–163. 59. Yu FH, Catterall WA (2004) The VGL-Chanome: A specialized for 28. Glembo TJ, Farrell DW, Gerek ZN, Thorpe M, Ozkan SB (2012) Collective dynam- electrical signaling and ionic homeostasis. Sci Signaling 2004:re15. ics differentiates functional divergence in protein evolution. PLoS Comput Biol 8: 60. Ekeberg M, Lovkvist¨ C, Lan Y, Weigt M, Aurell E (2013) Improved contact pre- e1002428. diction in proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E 87: 29. Liu Y, Bahar I (2012) Sequence evolution correlates with structural dynamics. Mol Biol 012707. Evol 29:2253–2263. 61. Ponzoni L, Polles G, Carnevale V, Micheletti C (2015) SPECTRUS: A dimensionality 30. Lai J, Jin J, Kubelka J, Liberles DA (2012) A phylogenetic analysis of normal modes reduction approach for identifying dynamical domains in protein complexes from evolution in enzymes and its relationship to enzyme function. J Mol Biol 422: limited structural datasets. Structure 23:1516–1525. 442–459. 62. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416. 31. Micheletti C (2013) Comparing proteins by their internal dynamics: Exploring 63. Saramaki¨ J, Kivela¨ M, Onnela JP, Kaski K, Kertesz´ J (2007) Generalizations of the clus- structure–function relationships beyond static structural alignments. Phys Life Rev tering coefficient to weighted complex networks. Phys Rev E 75:027105. 10:1–26. 64. Hanas J, Hazuda D, Bogenhagen D, Wu F, Wu C (1983) Xenopus transcription factor 32. Shakhnovich EI, Gutin AM (1993) Engineering of stable and fast-folding sequences of A requires zinc for binding to the 5 S RNA gene. J Biol Chem 258:14120–14125. model proteins. Proc Natl Acad Sci USA 90:7195–7199. 65. Lu D, Klug A (2007) Invariance of the zinc finger module: A comparison of the free 33. Dokholyan NV, Shakhnovich EI (2001) Understanding hierarchical protein evolution structure with those in nucleic-acid complexes. Proteins: Structure, Function, and from first principles. J Mol Biol 312:289–307. Bioinformatics 67:508–512.

E10620 | www.pnas.org/cgi/doi/10.1073/pnas.1712021114 Granata et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 3 ioM a ,Jlu ,CegY(03 tutr fteTP1incanldeter- channel ion TRPV1 the of Structure (2013) Y Cheng D, Julius E, Cao M, voltage- voltage- a Liao of a structure 73. crystal of The (2011) structure WA Catterall Atomic N, Zheng (2007) T, Scheuer R J, Payandeh MacKinnon 72. EB, Campbell X, Tao SB, Long 71. (2001) al. models et network B, elastic Hille protein of 70. Evaluation (2013) K Hinsen N, Reuter E, Fuglebakk 69. protein of description efficient and Accurate (2004) A Maritan coevolution- P, Capturing Carloni zinc C, (2015) Micheletti DU by Ferreiro RNA 68. AM, 5S Walczak of T, Mora recognition RG, key” Parra and R, “lock Espada and fit 67. Induced (2006) al. et BM, Lee 66. 4 aioaM rnt ,CreaeV(06 otg-ae oimcanl:Evolu- channels: sodium Voltage-gated (2016) V Carnevale D, Granata M, Kasimova 74. 1 anvl ,Rhc 21)TP1 agtfrrtoa rgdesign. drug rational for target A TRPV1: (2016) T Rohacs V, Carnevale 81. channels. TRP to introduction An (2006) DE Clapham M, Delling IS, Ramsey 79. of basis Structural Kv1.2: of sensor Voltage (2005) R Mackinnon E, Campbell S, Long 77. gate activation and sensors voltage between Coupling (2002) Y Ramu AM, Klem Z, Lu 75. 0 egQ(2014) Q Feng 80. channels. TRP with Sensing (2005) B Nilius G, Owsianik K, Talavera T, Voets 78. M A, Broomand 76. rnt tal. et Granata ie yeeto cryo-microscopy. electron by mined channel. sodium gated environment. membrane-like lipid a in channel K+ dependent 507. Vol motions. collective of analysis an on based models. Bioinformatics Gaussian Funct and Struct dynamics molecular Comparing dynamics: vibrational proteins. inrepeat signals ary IIIA. factor transcription of fingers inr itr n itntv eunefeatures. sequence distinctive and history tionary cals Membranes Physiol Biol coupling. electromechanical channel. K a in sensor voltage channels. K+ voltage-gated in 9:E52. 1:85–92. 68:619–647. Esve,SnDeo,p 19–50. pp Diego), San (Elsevier, eprtr esn yTemlTPCanl nCretTpc in Topics Current in Channels TRP Thermal by Sensing Temperature annikk ¨ o hneso xial Membranes Excitable of Channels Ion ,LrsnH,EidrF(03 oeua oeeto the of movement Molecular (2003) F Elinder HP, Larsson R, o ¨ Nature 55:635–645. Science M Bioinformatics BMC e Physiol Gen J e Physiol Gen J 475:353–358. o Biol Mol J Nature 309:903–908. hmTerComput Theor Chem J 504:107–112. 120:663–676. 122:741–748. 357:275–291. urTpMembr Top Curr 16:207. Snur udrad MA), Sunderland, (Sinauer, Nature 9:5618–5628. 78:261–286. 450:376–382. Pharmaceuti- a Chem Nat nuRev Annu Proteins 6 emnM(04 admgah smdl fnetworks. of models networks. as graphs ‘small-world’ Random of (2004) M dynamics Newman Collective 96. Protein (1998) algorithm. in S an Strogatz Mutations and Analysis D, Correlated clustering: Watts spectral (1998) On 95. (2002) G Y Weiss Stormo MI, Jordan L, AY, Ng Liu 94. B, Giraud A, Lapedes S 93. A, Hauser A, Biegert M, molecular Remmert in allostery 92. mediating sector interdomain An (2010) al. et RG, coupling Smock direct to 91. component principal From (2013) M sequences. Weigt R, biological Monasson in S, Cocco coevolution of 90. Elements analysis (2013) coupling O Statistical sectors: Rivoire Protein 89. (2015) S pro- Leibler of LJ, analysis coevolution Colwell T, for Tes¸ileanu tool web-based 88. mecha- A reveal Coeviz: nanodiscs (2016) in A structures Porollo TRPV1 FN, Baker (2016) Y 87. Cheng D, Julius from E, Insights Cao Y, ligands: Gao by 86. activation TRPV1 Understanding (2015) al. et K, Elokely 85. Darr conformations activation 84. and distinct binding capsaicin in underlying mechanism structures Structural TRPV1 (2015) al. (2013) et F, Yang D Julius 83. Y, Cheng M, Liao E, Cao 82. fteTP1inchannel. ion TRPV1 the of mechanisms. activation reveal Internet 393:440–442. Syst Process Inf Neural NM). Adv Fe, Santa Inst, Fe (Santa Effects Structural and alignment. Phylogenetic sequences: HMM-HMM by searching sequence 173–175. protein ative chaperones. structure for needed are modes Low-eigenvalue proteins: prediction. in coevolution of analysis 110:178102. conservation. versus residues. tein action. lipid and ligand of nisms resiniferatoxin. and capsaicin of E137–E145. modes binding the maceutics ,Dmn 21)Bnigo asii oteTP1inchannel. ion TRPV1 the to capsaicin of Binding (2015) C Domene L, e ´ WlyBakel eln,p 35–68. pp Berlin), (Wiley-Blackwell, 12:4454–4465. LSCmu Biol Comput PLoS o ytBiol Syst Mol M Bioinformatics BMC LSCmu Biol Comput PLoS PNAS 6:414. a hmBiol Chem Nat 2:849–856. | 9:e1003176. Nature ulse nieNvme 8 2017 28, November online Published Nature 17:119. 504:113–118. 11:e1004091. dn 21)Hbis ihnn-atiter- Lightning-fast HHblits: (2011) J oding ¨ 534:347–351. 11:518–524. rcNt cdSiUSA Sci Acad Natl Proc rmteGnm othe to Genome the From a Methods Nat hsRvLett Rev Phys | o Phar- Mol E10621 Nature 113: 9:

EVOLUTION BIOPHYSICS AND PNAS PLUS COMPUTATIONAL BIOLOGY