Sequence physical properties encode the global organization of space

S. Rackovsky1

Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine of NYU, One Gustave L. Levy Place, New York, NY 10029

Edited by Harold A. Scheraga, Cornell University, Ithaca, NY, and approved July 7, 2009 (received for review April 3, 2009) It is demonstrated that, properly represented, the amino acid available sets of physical properties of the 20 amino acids. They composition of protein sequences contains the information neces- demonstrated that all of these data can be represented by a set sary to delineate the global properties of protein structure space. of 10 property factors, which together carry 86% of the A numerical representation of amino acid sequence in terms of a variance of the entire property database. Therefore, to a very set of property factors is used, and the values of those property good approximation, an amino acid X can be represented factors are averaged over individual sequences and then over sets numerically as a 10-vector, of sequences belonging to structurally defined groups. These ϭ ͑ X X X sequence sets then can be viewed as points in a 10-dimensional X f1 ,f2 ,...... ,f10) [1] space, and the organization of that space, determined only by sequence properties, is similar at both local and global scales to It follows that an N-residue sequence can be written as a set of that of the space of protein structures determined previously. 10 numerical strings of length N, each of which describes the variation of one of the property factors along the length of the proteomics ͉ sequence analysis ͉ sequence–structure relationship protein. The property factors are linearly independent by con- struction, and therefore the 10 strings together give a complete, valuating the degree of structural homology between protein uncorrelated description of the physical properties of the se- sequences is a significant outstanding problem in biomedical quence. The definitions of the property factors are given in E supporting information (SI) Table S1. research. That this problem remains open is apparent from the BIOPHYSICS AND Applying this representation requires a database of protein persistence of interest in the ‘‘remote homolog’’ problem—the COMPUTATIONAL BIOLOGY observation that in any reasonably large group of sequences that sequences. We have constructed a very large set of sequences fold to a specified, common architecture, there will be pairs of taken from the CATH database (19, 20). The organization of sequences that are not related by any currently known criterion. CATH is ideal for this investigation, because domains are Accurate methods for structural homology detection depend organized in a hierarchical fashion based (in order of increasing on an understanding of the sequence code underlying fold detail) on class, architecture, topology, and homology. In the selection. There are intriguing hints that this code may be less present work, we wanted to use a comprehensive sequence/ complex than once thought. That the average amino acid com- structure database that reflects the composition of the entire positions of proteins can give reasonably accurate classifications Protein Data Bank, rather than relying on selection criteria. A of structural class (1–3) and fold family (4–12) is well known. But primary consideration in avoiding biased results is eliminating a classification scheme gives no information about quantitative sequences with a high degree of similarity from the database. We relationships between the classes under consideration, because therefore began with a subset of the entire CATH database, it reflects only local details of the underlying space of protein CathDomainSeqs.S35.ATOM.v3.1.020, which was selected by structures. Quantitating those relationships requires a sequence- the CATH curators to be representative of the entire database based metric function capable of objectively measuring the while containing no pairs of sequences with sequence identity distance between 2 arbitrarily selected classes. exceeding 35%. This value is generally considered to mark the We have delineated the organization of structure space in lower limit of sequence relatedness, and thus our working previous work (13–15). Those results were obtained using only database is composed entirely of sequence pairs that are in the structural data, with no reference to sequence information. The ‘‘twilight zone.’’ It contains no pairs that can be considered picture that emerged, subseqently verified by Yee and Dill (16), homologs in the traditional sense. is that of a structure gradient in which all-helical structures are We further adjusted the database by removing all sequences concentrated at one extreme of the space and all-sheet/barrel with missing residues and all sequences with fewer than 60 amino structures are concentrated at the other extreme, with mixed acids. We were left with a data set of 7,056 sequences known to alpha/beta structures in the intervening region. be complete and unrelated by any standard criterion. The highest In the present work we demonstrate that when protein se- level of the CATH hierarchy consists of 4 classes. C ϭ 1 contains quences are represented appropriately, the average amino acid all-helical structures, C ϭ 2 contains sheet/barrel structures, and properties of those sequences encode a similar picture of the C ϭ 3 contains mixed alpha/beta structures. The very small class global organization of protein sequence space. We demonstrate C ϭ 4 (73 sequences) contains proteins whose only common the existence of a metric function, based entirely on sequence feature is a lack of regular structure, and is not considered in this properties, that reproduces the known characteristics of struc- work. Our final database contained 1,538 sequences with C ϭ 1, ture space. 1,690 sequences with C ϭ 2, and 3,755 sequences with C ϭ 3. Sequence Model Rigorous determination of the characteristics of protein se- Author contribution: S.R. designed research, performed research, analyzed data, and quences requires that they be analyzed numerically, using a wrote the paper. representation that is both complete and nonredundant. Rep- The author declares no conflict of interest. resentations that rely on arbitrarily chosen sets of physical This article is a PNAS Direct Submission. properties of the amino acids generally are both incomplete 1E-mail: [email protected]. and correlated. This problem was addressed by Kidera and This article contains supporting information online at www.pnas.org/cgi/content/full/ coworkers (17, 18), who performed a factor analysis on all 0903433106/DCSupplemental.

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0903433106 PNAS Early Edition ͉ 1of4 Downloaded by guest on September 30, 2021 Table 1. Eigenvalues of the principal eigenvectors 0 Fourier transforms of the 10 numerical strings that together Eigenvector Eigenvalue Variance proportion represent the sequence (21, 22). This observation provides a direction for further generalization of the results, through e1 4.078 40.8 inclusion of higher Fourier components in the analysis of e2 1.426 14.3 sequence space. e3 1.211 12.1 The 10-vectors QCAT for the 59 CAT classes can be thought of e4 0.949 9.5 as the position vectors of these classes in 10-space. To under- e5 0.791 7.9 stand the relationships between classes established by the Eu- e6 0.541 5.4 clidean metric inherent in the APF representation, we need to e7 0.446 4.5 visualize the distribution of the corresponding points. We there- e8 0.272 2.7 fore performed a principal components analysis (PCA) (23) of e9 0.198 2.0 the 10-vectors. e10 0.088 0.9 Results The PCA results are summarized in Table 1. The first 3 Sequences ranged in length from 60 to 1146 aa, and the total eigenvectors carry a total of 67.2% of the variance of the entire number of residues in the database was 1,114,667. data set; therefore, a low-dimensional representation of The sets of sequences in which we are interested here are those the structure space is both feasible and meaningful. Each characterized by common values of the 3 identifiers C, A, and T. of the principal components includes contributions from all These are sequences known to fold to similar architectures but 10 property factors. The principal components are listed in for which no specification of is given. We Table 2. restrict our attention to those CAT classes in the database that A projection of the APF space onto the first 3 eigenvectors have at least 20 members. There are 59 such classes, constituting of the PCA is shown in Fig. 1. It can be seen that the 6% of the 980 CAT classes in the database. These contain a total distribution of CAT groups, identified by structural class (i.e., of 4,319 sequences—60% of the sequences in the database. The by the value of the CATH classifier C), is isomorphic to that groups included in the present study are shown in Table S2.It obtained from purely structural considerations, in that the should be noted that this database is significantly larger than the all-helical and all-sheet/barrel groups occupy opposite ex- databases used in earlier work (13, 16). tremes of the space, separated by alpha/beta structures. To For every sequence S in the database, we can define the make this observation quantitative, hyperplanes separating the sequence-averaged value of the mth property factor, regions corresponding to the 3 C classes were determined, using a minimum squared error (MSE) algorithm (23). The N 1 S ability of these hyperplanes (which are defined in Table S3)to ͗ (m)͘ ϭ ͸ (m) f S fn [2] separate the classes is summarized in Table 3. It can be seen NS nϭ1 that the separation between classes, although not perfect, is very clean. The relatively few misclassfied groups arise from an where N is the number of residues in the sequence. We can S inability of fairly simplistic, unoptimized hyperplane classifiers further average these quantities over the set of N sequences that Q to completely separate the points in the 3 regions, and the belong to some predefined set {Q}, misclassifications are entirely consistent with the large-scale 1 structure of the space. P values were calculated for the ͗ (m)͘ ϭ ͸ ͗ (m)͘ f {Q} f T [3] observed distributions of groups with all 3 C values, and all NQ Tʚ{Q} satisfy P Ͻ .0001. Optimization of the hyperplanes, or use of a more flexible separation function, may produce a perfect The NQ sequences in {Q} are then represented by the 10-vector classification of the 59 CAT groups. of averaged property factors, A related question of potential interest is the predictive power of this approach. As a preliminary test, a test data set (Table S4) ϭ ͗ (1)͘ ͗ (2)͘ ͗ (m)͘ Q ( f {Q}, f {Q}.... f {Q}) [4] was constructed comprising 60 CAT groups from the original database that have between 10 and 19 members. By construction, We refer to this as the averaged property factor (APF) the sequences in this data set have Ͼ35% pairwise sequence representation of the sequence class Q. It should be noted that identity with each other and with the sequences in the 59-group the sequence-averaged property factors in eq. (2) are the k ϭ development set. This data set is expected to be a challenging test

Table 2. The principal components ͗f1͗͘f2͗͘f3͗͘f4͗͘f5͗͘f6͗͘f7͗͘f8͗͘f9͗͘f10͘ C

U1 2.228 Ϫ2.388 2.584 Ϫ2.286 3.945 3.387 Ϫ4.775 3.279 Ϫ4.297 Ϫ1.229 1.426 U2 0.73 2.351 0.902 8.174 2.676 9.352 3.010 Ϫ2.851 Ϫ2.563 Ϫ17.233 3.218 U3 Ϫ2.391 Ϫ5.221 Ϫ3.168 3.897 Ϫ12.529 Ϫ5.277 Ϫ0.341 7.730 Ϫ7.368 Ϫ8.474 Ϫ5.181 U4 Ϫ1.203 Ϫ2.079 9.038 Ϫ0.529 Ϫ4.457 Ϫ9.396 Ϫ11.939 Ϫ10.586 3.692 Ϫ10.838 Ϫ3.366 U5 1.662 Ϫ2.123 Ϫ9.205 Ϫ10.71 Ϫ8.622 8.833 Ϫ3.014 Ϫ3.695 8.812 Ϫ14.255 3.23 U6 0.74 1.276 9.144 1.925 Ϫ25.742 15.313 6.211 Ϫ4.466 Ϫ5.774 13.654 2.01 U7 Ϫ1.126 Ϫ7.337 2.402 Ϫ8.596 9.415 Ϫ4.77 27.633 Ϫ10.987 Ϫ9.914 Ϫ6.868 Ϫ1.137 U8 Ϫ0.207 0.912 15.02 Ϫ4.479 Ϫ2.24 Ϫ1.763 17.849 22.849 23.702 Ϫ16.13 Ϫ0.544 U9 Ϫ11.163 Ϫ16.34 1.492 9.107 13.122 25.787 Ϫ5.281 Ϫ2.872 16.958 16.329 4.358 U10 23.277 Ϫ17.543 Ϫ4.64 26.19 Ϫ12.113 Ϫ14.568 14.858 Ϫ10.044 24.143 18.283 Ϫ9.952

ϭ ¥10 ͗ n͘ ͗ n͘ The general form of the ith principal component is Ui nϭ1 am f where ain is the (i,n)th element of the table and f is the average of the nth property factor.

2of4 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0903433106 Rackovsky Downloaded by guest on September 30, 2021 BIOPHYSICS AND COMPUTATIONAL BIOLOGY

Fig. 1. Projection of the points representing the 59 CAT groups onto the first 3 principal components (see text). Groups for which C ϭ 1 are coded green, those for which C ϭ 2 are coded red, and those for which C ϭ 3 are coded yellow.

of any classification procedure for two reasons: (i) The small size databases on which those studies are based differ widely in both of the CAT groups makes the averages over sequence prop- size and difficulty. erties in eq. (3) less reliable, and (ii) the disjunction between A recent study comparing results on class prediction using CAT classes in the 2 data sets guarantees that the groups in the 16 different methods found accuracies between 77% and test database differ significantly from those in the develop- 99.5% (24). The data sets used in that study were those of Chou ment set. (25) and of Zhou (26), containing 204–498 sequences. The Application of the MSE hyperplanes that classify the devel- present work is based on a much larger data set, and, because opment set to the classification of the CAT groups in the test set it is directed toward delineating the global structure of se- gives an overall accuracy of 67.8% (Table 4). A more sophisti- quence space, classifies CAT groups rather than individual cated classification was then carried out using a support vector sequences. More importantly, the sequence descriptors are not machine (SVM), with an RBF (Gaussian) kernel, giving an optimized for correspondence with a preexisting classification. overall accuracy of 81.7%. It is of interest to compare this result Nevertheless, the SVM results are consonant with previous to other recent results on sequence classification. This compar- results. ison is complicated by 2 factors: (i) A wide spectrum of methods Further confirmation of the power of the APF representation was used, some of which combine multiple classes of preopti- comes from a complete-linkage clustering of the 59 CAT groups. mized descriptive properties whose statistical independence has This gives a set of superclusters, the members of which are CAT not been investigated, and (ii) the training and testing sequence groups, each containing at least 20 sequences. Cluster compo- sitions at the 7-supercluster level are given in Table 5. Almost

Table 3. Classification of the 59 CAT groups Table 4. Classification of the 60 CAT groups in the test set CAT groups CAT groups Number of correctly Number of Proteins correctly Number of correctly Number of Proteins correctly CAT groups classified proteins classified CAT groups classified proteins classified

C ϭ 1 16 14 (88%) 762 703 (92%) C ϭ 1 14 9 (64%) 195 129 (66%) C ϭ 2 14 13 (93%) 1,220 1,182 (97%) C ϭ 2 11 6 (55%) 145 82 (56%) C ϭ 3 29 26 (90%) 2,337 2163 (93%) C ϭ 3 35 25 (71%) 481 352 (73%)

Rackovsky PNAS Early Edition ͉ 3of4 Downloaded by guest on September 30, 2021 Table 5. Complete-linkage cluster composition Discussion Cluster number Fraction (C ϭ 1) Fraction (C ϭ 2) Fraction (C ϭ 3) It should be emphasized that our results were obtained without using structural information, and that the chemical data used did 1 0.0 0.9 0.1 not include the actual sequences of amino acids along the chain-only 2 1.0 0.0 0.0 sequence- and group-averaged values of the amino acid property 3 0.5 0.0 0.5 factors. Clearly, when amino acid physical properties are appropri- 4 0.83 0.08 0.08 ately represented, their averages encode not only membership in 5 0.0 0.67 0.33 fold families, but also the global organization of protein structure 6 0.07 0.0 0.93 space. 7 0.09 0.0 0.91 We have demonstrated an unexpectedly simple connection be- tween chemical constitution and structure in proteins. We have also shown that the principal components can be used as a metric every cluster is dominated by one value of C, indicating that the function to quantitate the differences between groups. Further APF parameters encode information capable of distinguishing explorations of the implications of this metric are underway. the structure classes. At the same time, the clusters straddle the borders between classes in a manner consistent with the large- ACKNOWLEDGMENTS. I thank Professor Igor Kuznetsov for very helpful scale structure of sequence space, as revealed by the PCA and discussions. This work was supported by the National Library of Medicine of shown in Fig. 1. the National Institutes of Health (Grant LM06789).

1. Nakashima H, Nishikawa K, Ooi T (1986) The folding type of a protein is relevant to the 14. Rackovsky S (1990) Quantitative classification of the known protein x-ray structures. amino acid composition. J Biochem 99:153–162. Polymer Preprints 31:205. 2. Wang Z-X, Yuan Z (2000) How good is prediction of protein structural class by the 15. Rackovsky S (2006). Classification of protein sequences and structures. The Proteomics component-coupled method? Proteins Struct Funct Genet 38:165–175. Protocols Handbook, ed Walker JM (Humana Press, Totowa, NJ), pp 861–874. 3. Du Q-S, Jiang Z-Q, He W-Z, Li D-P, Chou K-C (2006). Amino acid principal component 16. Yee DP, Dill KA (1993) Families and the structural relatedness among globular proteins. analysis (AAPCA) and its applications in protein structural class prediction. J Biomol Protein Sci 2:884–899. Struct Dyn 23:635–640. 17. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA (1985) Statistical analysis of the physical 4. van Heel M (1991) A new family of powerful multivariate statistical sequence analysis properties of the 20 naturally occurring amino acids. J Protein Chem 4:23–55. techniques. J Mol Biol 220:877–997. 18. Kidera A, Konishi Y, Ooi T, Scheraga HA (1985) Relation between sequence similarity 5. Reczko M, Bohr H (1994) The DEF database of sequence-based protein fold class and structural similarity in proteins: Role of important properties of amino acids. J predictions. Nucleic Acids Res 22:3616–3619. Protein Chem 4:265–297. 6. Dubchak I, Muchnik I, Holbrook SR, Kim S-H (1995). Prediction of class 19. Orengo CA, et al. (1997) CATH: A hierarchic classification of structures. using global description of amino acid sequence. Proc Natl Acad Sci USA 92:8700– Structure 5:1093–1108. 8704. 20. Available at http://cathwww.biochem.ucl.ac.uk/latest/index.html. Accessed May 9, 7. Hobohm W, Sander C (1995) A sequence property approach to searching protein 2007. databases. J Mol Biol 255:390–399. 21. Rackovsky S (1998) ‘‘Hidden’’ sequence periodicities and protein architecture. Proc Natl 8. Ding CHQ, Dubchak I (2001) Multi-class protein fold recognition using support vector Acad Sci USA 95:8580–8584. machines and neural networks. Bioinformatics 17:349–358. 22. Rackovsky S (2006) Characterization of architecture signals in proteins. J Phys Chem B 9. Edler L, Grassmann J, Suhai S (2001) Role and results of statistical methods in protein 110:18771–18778. fold class prediction. Math Comput Model 33:1401–1417. 23. Duda RO, Hart PE, Stork DG (2001). Pattern Classification (Wiley-Interscience, New 10. Shen H-B, Chou K-C (2006). Ensemble classifier for protein fold pattern recognition. York), 2nd Ed. Bioinformatics 22:1717–1722. 24. Li Z-C, Zhou X-B, Lin Y-R, Zou X-Y (2008). Prediction of protein structure class by 11. Ofran Y, Margalit H (2006) Proteins of the same fold and unrelated sequences have coupling improved genetic algorithm and support vector machine. Amino Acids similar amino acid composition. Proteins Struct Funct Bioinform 64:275–279. 35:581–590. 12. Taguchi Y-H, Gromiha MM (2007) Application of amino acid occurrence for discrimi- 25. Chou KC (1999) A key driving force in determination of protein structure classes. nating different types of globular proteins. BMC Bioinform 8:404. Biochem Biophys Res Commun 264:216–224. 13. Rackovsky S (1990) Quantitative organization of the known protein x-ray structures, I: 26. Zhou GP (1998) An intriguing controversy over protein structural class prediction. J Methods and short length-scale results. Proteins Struct Funct Genet 7:378–402. Protein Chem 17:729–738.

4of4 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0903433106 Rackovsky Downloaded by guest on September 30, 2021