The structure of dynamic space

S. Rackovskya,b,1 and Harold A. Scheragaa,1

aDepartment of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, NY 14853; and bDepartment of Biochemistry and , University of Rochester School of Medicine and Dentistry, Rochester, NY 14642

Contributed by Harold A. Scheraga, June 26, 2020 (sent for review May 5, 2020; reviewed by Robert L. Jernigan and Jeffrey Skolnick) We use a bioinformatic description of dynamic properties, 2) also encodes information about the influence of protein based on residue-specific average B factors, to construct a dynamics- fold and other residue-external factors on dynamics; based, large-scale description of a space of protein sequences. We 3) Using Fourier techniques, the global, whole-sequence dy- examine the relationship between that space and an independently namic properties of sequences can be represented; constructed, structure-based space comprising the same sequences. It 4) A substantial fraction of the information encoded in the is demonstrated that structure and dynamics are only moderately global representation of dynamic properties originates from correlated. It is further shown that helical fall into two classes the part of which does not arise from single-amino acid with very different structure–dynamics relationships. We suggest that physical properties, and is therefore not accessible from any dynamics in the two helical classes are dominated by distinctly differ- representation based on static amino acid properties; ent modes––pseudo–one-dimensional, localized helical modes in one 5) Groups of proteins which fold to different architectures differ case, and pseudo–three-dimensional (3D) global modes in the other. from one another in their behavior in a detectable and sta- Sheet/barrel and mixed-α/β proteins exhibit more conventional tistically significant manner, when represented by global structure–dynamics relationships. It is found that the strongest corre- dynamic parameters. lation between structure and dynamic properties arises when the latter are represented by the sequence average of the dynamic index, The availability of a well-characterized bioinformatic quantity which corresponds physically to the overall mobility of the protein. derived from the dynamic properties of amino acids makes None of these results are accessible to bioinformatic methods hitherto possible study of the dynamic properties of proteins on a large available. scale, rather than anecdotally. We wish to understand the rela- tionship between the space of protein structures and a parallel, B factor | protein dynamics | structure–dynamics relationships | distinct space determined by the dynamic properties of the same Fourier transform proteins. We demonstrate the following results: 1) The relationship between the two spaces is characterized, in rotein structure and evolution have been intensively studied for part, by an anomalous dependence of dynamic distance on many years, and vast bodies of sequence and structure data have P structure difference. been accumulated and analyzed using tools of bioinformatics. This 2) This anomaly arises from unexpected behavior of all-helical approach is usually referred to as “knowledge-based,” to distinguish proteins, which exhibit two distinct types of behavior in it from an equally impressive body of computational studies based dynamic space. on simulation, using physically motivated empirical energy func- 3) We suggest that these behaviors correspond to physically differ- tions, of actual physical processes. One central area of protein sci- ent dynamic regimes within the universe of all-helical proteins. ence, however, has thus far resisted knowledge-based study. Protein 4) Structure–dynamics correlations in proteins are encoded in dynamic characteristics have only been available computationally the overall mobility of the structure, rather than in more lo- from two frameworks. These are molecular-dynamic simulations calized descriptions of chain dynamic properties. and elastic network models, processor-intensive approaches which limit studies to single proteins, or to comparisons of small groups of molecules of interest. This situation arises from the fact that no Significance informatic parameter has been available which adequately repre- sents the dynamics of individual amino acids. Protein dynamic properties have been computationally accessible In recent work (1), we have developed a measure of the dynamic only by means of molecular-dynamic simulations, or network properties of amino acids in protein sequences which is suitable for models, of specific molecules. We have developed a bioinformatic bioinformatic use. This property is the residue-specific average approach to dynamics, which makes it possible to delineate the value of the B factor (2), determined from a large database of dynamic characteristics of large numbers of sequences simulta- protein structures. We denote the average B factor for amino acid neously. In this work we report an analysis of the large-scale Xas. The quantity plays the same general role with dynamic structure of protein space. It is demonstrated that pro- respect to dynamics that a hydrophobicity index plays with respect teins of different structural classes have different dynamic be- to solvent exposure. It is not the case that every hydrophilic amino haviors, and that all-helical proteins occur with two distinct types acid is in actual contact with solvent, nor does every amino acid with of dynamic behavior. One subset of helical proteins is character- a high value of exhibit high mobility. Rather, is a ized by localized, helix-based dynamics, while the complementary measure of the tendency of the amino acid X to be in motion. The subset exhibits dynamics of a more three-dimensional nature. This information carried by becomes important in the context information has not been available through the application of of a complete sequence, as is also true of hydrophobicity indices. more traditional methods. It was shown (1) that the values of differ between amino Author contributions: S.R. and H.A.S. designed research, performed research, analyzed acids in a statistically significant manner. Using statistical, signal data, and wrote the paper. processing, and information theoretic methods, we demonstrated Reviewers: R.L.J., Iowa State University; and J.S., Georgia Institute of Technology. several properties of : The authors declare no competing interest. 1) Values of are partly, but not exclusively, determined by Published under the PNAS license. the values of the intrinsic physical properties of the amino 1To whom correspondence may be addressed. Email: [email protected] or has5@ acids, as represented by an complete and orthogonal set of cornell.edu. property factors (3, 4); First published August 5, 2020.

19938–19942 | PNAS | August 18, 2020 | vol. 117 | no. 33 www.pnas.org/cgi/doi/10.1073/pnas.2008873117 Downloaded by guest on September 29, 2021 We next to the construction of the distance function in D. The proteins in our database are labeled by values of the four indices C,A,T, and H which together classify entries in the CATH database (8). We focus first on the identifier C, which specifies structural class. It will be remembered that C = 1 denotes helical architecture, C = 2 sheet/barrel architecture, and C = 3 mixed-α/β architecture. The sequences of the 5,719 proteins in our database are written in terms of , giving a numerical string for each sequence (which we denote as the dynamic sequence), and we ask in what way sequences belonging to the three C classes differ from one another. We answer this question by Fourier analyzing the dynamic sequences, and carrying out an ANOVA analysis of the distributions of the resulting Fourier coefficients. (Details of the procedure are given in Methods.) We require a high degree of statistical significance (9), and find that there are 11 values of the wave number k at which the distributions of sine or cosine Fourier coefficients of sequences belonging to the three classes differ from one another with P < 0.0001. We measure distance in D using a weighted Euclidean distance function based on these 11 Fourier Fig. 1. Side-by-side boxplot of the average (), maximum [MAX(- coefficients, as shown in Eq. 4 below. The weighting allows us to RALL)], and minimum [MIN(RALL)] of the correlation between structure and measure independently the contribution of each of the significant dynamic distances, over all possible choices of the weighting set {w } and all i wave numbers to any structure–dynamics correlation we observe. proteins in the dataset. The values of these quantities for the choice {wi} = (1,0,....,0) (see text) are indicated by arrows. Given these two functions, the correlation between distances in the two spaces can be determined for any protein in the da- tabase, and for any set of values of the dynamic weighting fac- S D Results and Discussion tors. We denote this correlation coefficient by R( m, m;{wi}), where distances are measured from protein m, and {w |i = In the present work, we examine a basic, but hitherto inaccessible i

1,2...11} is the set of weighting functions used in the dynamic BIOPHYSICS AND COMPUTATIONAL BIOLOGY question about proteins––whether structure and dynamics are distance function. We have carried out this calculation for every related in a simple way. We proceed as follows. protein in the database, and for all 2,047 possible binary values 1) We construct a dynamics-based distance function between of the 11 weights. The results are summarized in Fig. 1, in which proteins, based on the global properties of , and apply we show a side-by-side boxplot of the average value, maximum, it to a large protein database. This generates a protein space and minimum of R(Sm,Dm;{wi}). It will be seen that one specific determined by sequence dynamic properties. choice of {wi} gives an exceptionally large average and range for 2) We construct a structure-based distance function between pro- R. This weighting, {wi} = (1,0,0...,0), corresponds to a distance teins which allows rapid, optimization-free comparison of mo- measured solely by the values of the cos(k = 0) Fourier coeffi- lecular architectures. We demonstrate that, despite the fact cient. The cos(k = 0) Fourier coefficient is the average value that this function is based on low-resolution information, it of , measured over the sequence. This coefficient, it should accurately describes the organization of structure space. This be noted, contains no information about the actual linear ar- distance function generates a second, independent space de- rangement of residues along the sequence. In physical terms, it termined by the structures in the same large protein database. measures the average tendency of all of the residues in the se- 3) We analyze the relationship between these two spaces, and ask quence to be in motion. We shall refer to this coefficient as the what information this relationship carries about protein physics. mobility of the sequence. It should be noted that there is reason to question whether structure and dynamics are of necessity related. Even in cases where structural homology and functional similarity are both obtained, dynamic differences have been demonstrated (5). In order to understand the relationship between the spaces associated with our database––the structure space, which we denote by S, and the dynamic space, which we denote by D—we consider the correlation between corresponding pairwise dis- tances in the two spaces. Naively, one expects to find that dis- tances in the two spaces should be positively correlated, because, as structures become less similar, the dynamic characteristics of the molecules diverge in some reasonable sense. We shall ex- amine whether this is, in fact, the case. We begin by considering the distance function in S. We use a low-resolution representation of protein structures which we have shown (6, 7) to give a structure space essentially equivalent to that obtained by high-resolution methods, while making pos- sible extremely rapid comparisons of structure. This leads to a Euclidean sequence distance function in a three-dimensional space, shown in Eq. 3 in Methods. We demonstrate there that the organization of the space generated by this metric corre- sponds precisely to what would be expected based on physical Fig. 2. A histogram of the values of RALL, the correlation between struc- intuition. ture and dynamic distances, for all 5,719 proteins in the data set.

Rackovsky and Scheraga PNAS | August 18, 2020 | vol. 117 | no. 33 | 19939 Downloaded by guest on September 29, 2021 Table 1. Comparative structural properties of database and ask whether the number of representatives in the R < 0 region is anomalous region significantly larger than would be expected on a random basis. C A Nneg Nall fNeg fAll Z P* The answer to this question is summarized in Table 1. The only structural classes which occur in the anomalous region with greater 1 10 332 899 0.3701 0.1572 15.2357385 ∼0 than random probability (corresponding to Z > 0)arethosewith 1 20 207 415 0.2308 0.0726 15.0944806 ∼0 C = 1, and all C = 1 classes are represented in this region. 1 25 20 58 0.0223 0.0101 3.13571099 <0.0017 The same C = 1 classes are also represented in the normal 1 40 2 2 0.0022 0.0003 2.12962067 <0.04 region. We ask what the dynamic difference is between C = 1 1 50 19 24 0.0212 0.0042 5.88583513 <0.00001 proteins which occur in the anomalous region of dynamic space 2 10 17 110 0.0190 0.0192 −0.0572426 and those in the normal region. Table 2 shows average values of 2 20 4 28 0.0045 0.0049 −0.1752547 the mobility distributions in the two regions. Comparison of the 2 30 30 152 0.0334 0.0266 1.16903786 two averages gives |Z| = 33.5, corresponding to distributions of − 2 40 45 311 0.0502 0.0544 0.5199053 R(Sm,Dm;{wi}) which differ with p ∼ 0. The two distributions 2 60 54 683 0.0602 0.1194 −5.2417886 <<0.0001 differ with very high significance, despite the structural similar- 2 70 2 44 0.0022 0.0077 −1.8310616 ities between members of each CA class. 2 80 1 36 0.0011 0.0063 −1.9341994 Helical proteins in the anomalous region are seen to have 2 160 1 34 0.0011 0.0059 −1.8541518 lower mobility than those in the normal region, and the sensi- 2 170 1 25 0.0011 0.0044 −1.4493877 tivity of those mobilities to structural change is greatly reduced. 3 10 30 269 0.0334 0.0470 −1.8219082 The insensitivity in this region to overall structural differences 3 20 6 234 0.0067 0.0409 −5.0973756 <<0.0001 suggests that mobility is significantly influenced by structural 3 30 56 766 0.0624 0.1339 −6.0365754 <<0.0001 subfeatures which are common to all of the proteins in question, 3 40 51 1,136 0.0569 0.1986 −10.289245 <<0.0001 irrespective of changes in overall architecture. We speculate that 3 50 4 41 0.0045 0.0072 −0.9180483 the dynamics of molecules in the R < 0 regime is dominated by 3 80 1 20 0.0011 0.0035 −1.1793186 modes which are characteristic of helical segments, and are 3 90 14 282 0.0156 0.0493 −4.5394363 <<0.0001 quasi–one-dimensional in nature, rather than those which de- C and A are CATH labels (see text); Nneg and Nall are the number of pend in a 2- or 3D way on the manner in which helical segments occurrences of each CA group in the anomalous region and full database, are assembled. respectively; fNeg and fAll are the corresponding composition fractions; Z is Further inspection of Table 1 indicates that, whereas sheet/ the Z score for the comparison of fractional occurrences; p is the associated barrel proteins are generally distributed randomly between the probability of randomly observing the difference in fractions. two regions of dynamic space, mixed-α/β proteins show a strong *Values of P are given only for cases where the occurrence is significantly tendency to avoid the anomalous region. A majority of C = 3 different from random. groups (comprising 88% of all α/β proteins in the dataset) are found to prefer the normal region, with very high significance. The peak in the histogram of Fig. 2 at high positive values of The results embodied in Fig. 1 are of interest for several R(S ,D ;{w }) contains only proteins with C = 2 and C = 3. reasons. m m i In Table 3 we show the average values of the mobility for the 1) The values of R(Sm,Dm;{wi}) are not large, even for the ex- three structural classes. Differences between the averages are ceptional case {wi} = (1,0....,0). (The actual distribution of R significant with P <<0.0001. The highest average mobility is for the optimal weighting set is shown in Fig. 2) This confirms exhibited by helical proteins, and the lowest by sheet/barrel the anecdotal observation that molecular architecture does structures. However, the average value for helical proteins is a not strongly dictate dynamic behavior. composite. As is clear from Table 2, the subset of helical proteins 2) Both positive and negative values of R are found, in contrast with R > 0 exhibit higher average mobility, and the subset with to intuitive expectation. R < 0 lower average mobility, than any other class of structures. 3) The strongest correlation between dynamic and structural From a dynamic perspective, there are two separate classes of distances occurs when global dynamics are described by a helical proteins. function which depends only on the average tendency of res- The fact that the greatest correlation with structure is given by idues in each sequence to be mobile. a dynamic description which depends only on the cos(k = 0) k = R > R < Fourier coefficient is suggestive. The 0 coefficient arises We shall refer to the 0 and 0 subsets of the database from an equal weighting of at all sequence positions and is as normal and anomalous, respectively. The normal regime “ ” a global measure of dynamic characteristics. Dynamic distance corresponds physically to expected behavior, in which dy- functions which include contributions from coefficients with higher namics diverge as structures do. The complementary, anomalous values of k, and are therefore more localized in nature, underweight regime corresponds physically to an extended region of dynamic some regions of the molecule, and apparently obscure information similarity in structure space. It is instructive to examine the re- which connects dynamics with overall structure. lationship between these two types of behavior and molecular Several questions are raised by our results. One would like to architecture. We find that the distribution of C values in the know what structural features differentiate helical proteins with anomalous region is very different from that in the sample as a R < 0 from those with R > 0? What detailed molecular motions whole. Fully 65% of proteins with R < 0 are α-helical. All-β (17%) and α/β (18%) structures are found with much lower frequency in this region. In the database as a whole the corre- R ≤ sponding percentages are 24, 26, and 50%. Table 2. Dynamic comparison between helical proteins with R > A statistical analysis of the observed distribution sheds further 0 and 0 light on this result. We ask to what extent the observed distri- N σ bution of structure classes in the anomalous region differs from R ≤ 0 580 27.04 0.24 that in the entire database. For this purpose we subdivide the R > structural classes, by labeling each protein with values of both C, 0 818 28.15 0.26 the CATH parameter which indicates structural class, and A, the N is the number of proteins in each region. is the average value parameter which indicates architecture. For each CA group, we of the k = 0 Fourier coefficient in the region, and σ is the associated SD.

19940 | www.pnas.org/cgi/doi/10.1073/pnas.2008873117 Rackovsky and Scheraga Downloaded by guest on September 29, 2021 Table 3. Average mobility for different structural classes − Æ æ = ck ck N Zk σ( ) , [1] C ck

1 27.97 where the bracket indicates an average of the Fourier coefficient ck over all σ 2 27.79 permutations of the wild-type sequence of the protein in question, and is 3 27.84 the associated SD. We have shown (14) that these statistical quantities can be calculated analytically. As was noted previously (13), the effect of this normalization is to remove any dependence on sequence composition alone, so that the Z function explicitly encodes information about the specific lin- ear arrangement of amino acids in the wild-type sequence. Sequence com- lead to the observed behaviors? Is there a difference in the position information is encoded in the k = 0 cosine Fourier coefficient, character of intramolecular interactions between the two re- whose value is independent of the linear arrangement of amino acids. gimes? Another intriguing question is whether one observes a similar dichotomy of behavior in groups of proteins which are Structure–Distance Function. We have shown in previous work (8, 9) that the known to be structurally very similar? What behavior will be structure of a protein can be represented numerically by a low-resolution observed if the analysis is extended to topological (CAT) sub- representation which is four-dimensional. Each protein can be written as a point in this space with coordinates (E ,E ,E ,A ), where the coordinates are groups of the C = 1 proteins? We are addressing these questions L 0 R R the fractional occurrence of 4-Cα fragments in one of three extended in ongoing work. structure types or in an α-helical conformation, respectively. Because the = Methods four fractions are normalized (EL+E0+ER+AR 1), there are only three in- dependent variables, and a principal component analysis shows the structure Average B Factors. In this work, we use values of the residue-specific average space S to be 3D under the following transformation: B factor which have been adjusted from our previous work (1), to ac- count for the removal from our database of some defective sequences. The new values of are correlated with the previous values with R = 0.98, and all conclusions of our previous work remain. The adjusted values are shown in Table 4.

Database. The dataset we used for these studies is a subset of our standard dataset

(8, 10). The basic data set is drawn from the CathDomainSeqs.S60.ATOM.v.3.2.0 BIOPHYSICS AND COMPUTATIONAL BIOLOGY dataset (ref. 11; www.cathdb.info), and the sequences therein exhibit no more than 60% pairwise sequence identity. The subset utilized in this work was selected to contain only structures for which reliable B factors are given. It contains 5,719 structures.

Fourier Analysis. The details of the Fourier approach have been extensively described in previous work (10, 12–17) The methods which were developed to study the 10 static property factors in that work carry over unchanged to the analysis of the dynamic (B-factor) sequence. As in previous work, the sequences are described in terms of the Z functions.

Table 4. Values of N σ

ALA 70,827 26.27 18.58 ASP 49,858 29.89 20.36 CYS 12,640 26.27 18.60 GLU 57,792 31.92 21.21 PHE 34,385 25.31 17.48 GLY 65,405 27.45 19.18 HIS 19,436 26.98 18.99 ILE 48,947 25.86 17.92 LYS 51,027 31.30 20.83 LEU 77,160 26.88 18.44 MET 18,205 27.65 19.42 ASN 37,037 28.52 20.14 PRO 39,407 28.77 19.58 GLN 31,524 29.54 20.65 ARG 42,687 29.47 20.16 SER 51,065 28.32 19.64 THR 48,400 27.06 19.32 VAL 62,648 25.67 18.03 TRP 11,966 23.98 16.41 TYR 30,558 25.07 17.41 Fig. 3. The organization of the structure space S, arising from the N is the number of occurrences of each amino acid, is the value of the structure–distance function in Eq. 3. Helical proteins are shown in red, sheet/ residue-specific average B factor, and σ is the associated SD. barrel proteins in blue, and mixed-α/β proteins in black.

Rackovsky and Scheraga PNAS | August 18, 2020 | vol. 117 | no. 33 | 19941 Downloaded by guest on September 29, 2021 Table 5. Values of k included in dynamic distance function ANOVA. In the present paper, we are interested in determining at which values of the wave-number k distributions of Fourier coefficients of the dynamic k sin cos sequences differ with statistical significance between the three structural 0xclasses. We therefore carry out an ANOVA comparison of these distributions, 2X in the same manner as our previous work (17). Only structure groups with at 3X least 20 representatives were used in the analysis, in order to guarantee sta- 4X tistical reliability. We require that differences between distributions be sig- 5X nificant with P < 0.001. It is found that differences at this level of significance 15 x exist at 11 wave numbers, which are shown in Table 5. These 11 wave numbers 18 X are used to construct the dynamic distance function discussed next. 46 x 54 x Dynamic Distance Function. The distance Δ between two proteins P and Q in 55 x the dynamic space D is written as a weighted Euclidean function: 60 x 1=2 11 2 Δ()P,Q,{}wi = []∑ wi ()Zi ()P − Zi ()Q , [4] i=1

Ξ =−(0.515)E − (0.518)E − (0.34)E + (0.592)A 1 L 0 R R where the Z are defined in Eq. 1,thek values indexed by i are shown in Ξ2 =−(0.099)EL − (0.358)E0 + (0.92)ER + (0.128)AR . [2] Table 5, and the weighting factors w take the values 0 or 1. Ξ3 =−(0.780)EL + (0.6)E0 + (0.158)ER − (0.065)AR i

Each protein is now represented in structure space by a three-vector Ξ = (Ξ , 1 Structure–Dynamic Correlation. We calculated the correlation coefficient Ξ , Ξ ). The three principal components account for 100% of the variance of 2 3 R(S ,D ;{w }) between Δ(P,Q,{w }) and Ω(P,Q) (Eqs. 3 and 4) for all proteins m the four coordinates. m m i i | = = ... A meaningful representation of structure space should exhibit a physically in the data set, and all possible weighting sets {wi wi 0,1; i 1, ,11}- a | = ∀ sensible separation between proteins belonging to the three structural total of 2047 sets (excluding the trivial set {wi wi 0 i}). The results of that classes. In Fig. 3 this is shown to be the case. We find that the two extremes calculation are shown in Figs. 1 and 2. It should be noted that, given the size

of S are occupied by helical (C = 1) and sheet/barrel (C = 2) structures, and of our dataset, values of |R(Sm,Dm;{wi})|≥0.03 are statistically significant, for α β = the region intermediate between those extremes contains mixed- / (C 3) any {wi}. structures, which are indeed structurally intermediate between the two other classes. The organization of structure space is precisely that which would be expected on the basis of physical intuition. Data Availability. All study data are included in the article. With this result established, we use a straightforward Euclidean structure distance function. The structural distance Ω between proteins P and Q is given by ACKNOWLEDGMENTS. We thank Dr. Gia Maisuradze and Dr. Khatuna Kachlishvili for enlightening and helpful discussions. We thank the referees 1=2 3 2 for helpful comments, which have contributed to this work. This work was Ω(P,Q) = [∑ (Ξi(P) − Ξi (Q)) ] . [3] i=1 supported by NIH/NIGMS Grant GM14312.

1. H. A. Scheraga, S. Rackovsky, Sequence-specific dynamic information in proteins. 9. V. E. Johnson, Revised standards for statistical evidence. Proc. Natl. Acad. Sci. U.S.A. Proteins 87, 799–804 (2019). 110, 19313–19317 (2013). 2. Z. Sun, Q. Liu, G. Qu, Y. Feng, M. T. Reetz, Utility of B-factors in protein science: In- 10. H. A. Scheraga, S. Rackovsky, Global informatics and physical property selection in terpreting rigidity, flexibility, and internal motion and engineering thermostability. protein sequences. Proc. Natl. Acad. Sci. U.S.A. 113, 1808–1810 (2016). Chem. Rev. – 119, 1626 1665 (2019). 11. C. A. Orengo et al., CATH–A hierarchic classification of structures. 3. A. Kidera, Y. Konishi, M. Oka, T. Ooi, H. A. Scheraga, Statistical analysis of the physical Structure 5, 1093–1108 (1997). J. Protein Chem. – properties of the 20 naturally occurring amino acids. 4,23 55 (1985). 12. S. Rackovsky, “Hidden” sequence periodicities and protein architecture. Proc. Natl. 4. A. Kidera, Y. Konishi, T. Ooi, H. A. Scheraga, Relation between sequence similarity and Acad. Sci. U.S.A. 95, 8580–8584 (1998). structural similarity in proteins: Role of important properties of amino acids. J. Protein 13. S. Rackovsky, Characterization of architecture signals in proteins. J. Phys. Chem. B 110, Chem. 4, 265–297 (1985). 18771–18778 (2006). 5. Y. He et al., Sequence-, structure-, and dynamics-based comparisons of structurally 14. S. Rackovsky, Sequence physical properties encode the global organization of protein homologous CheY-like proteins. Proc. Natl. Acad. Sci. U.S.A. 114, 1578–1583 (2017). Proc. Natl. Acad. Sci. U.S.A. – 6. H. A. Scheraga, S. Rackovsky, Homolog detection using global sequence properties structure space. 106, 14345 14348 (2009). Proc. suggests an alternate view of structural encoding in protein sequences. Proc. Natl. 15. S. Rackovsky, Global characteristics of protein sequences and their implications. Natl. Acad. Sci. U.S.A. – Acad. Sci. U.S.A. 111, 5225–5229 (2014). 107, 8623 8626 (2010). Phys. Rev. Lett. 7. S. Rackovsky, Quantitative organization of the known protein x-ray structures. I. 16. S. Rackovsky, Spectral analysis of a protein conformational switch. Methods and short-length-scale results. Proteins 7, 378–402 (1990). 106, 248101 (2011). 8. N. L. Dawson et al., CATH: An expanded resource to predict protein function through 17. S. Rackovsky, Sequence determinants of protein architecture. Proteins 81, structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017). 1681–1685 (2013).

19942 | www.pnas.org/cgi/doi/10.1073/pnas.2008873117 Rackovsky and Scheraga Downloaded by guest on September 29, 2021