Sequence Physical Properties Encode the Global Organization of Protein Structure Space
Total Page:16
File Type:pdf, Size:1020Kb
Sequence physical properties encode the global organization of protein structure space S. Rackovsky1 Department of Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine of NYU, One Gustave L. Levy Place, New York, NY 10029 Edited by Harold A. Scheraga, Cornell University, Ithaca, NY, and approved July 7, 2009 (received for review April 3, 2009) It is demonstrated that, properly represented, the amino acid available sets of physical properties of the 20 amino acids. They composition of protein sequences contains the information neces- demonstrated that all of these data can be represented by a set sary to delineate the global properties of protein structure space. of 10 property factors, which together carry 86% of the A numerical representation of amino acid sequence in terms of a variance of the entire property database. Therefore, to a very set of property factors is used, and the values of those property good approximation, an amino acid X can be represented factors are averaged over individual sequences and then over sets numerically as a 10-vector, of sequences belonging to structurally defined groups. These ϭ ͑ X X X sequence sets then can be viewed as points in a 10-dimensional X f1 ,f2 ,......,f10) [1] space, and the organization of that space, determined only by sequence properties, is similar at both local and global scales to It follows that an N-residue sequence can be written as a set of that of the space of protein structures determined previously. 10 numerical strings of length N, each of which describes the variation of one of the property factors along the length of the proteomics ͉ sequence analysis ͉ sequence–structure relationship protein. The property factors are linearly independent by con- struction, and therefore the 10 strings together give a complete, valuating the degree of structural homology between protein uncorrelated description of the physical properties of the se- sequences is a significant outstanding problem in biomedical quence. The definitions of the property factors are given in E supporting information (SI) Table S1. research. That this problem remains open is apparent from the BIOPHYSICS AND Applying this representation requires a database of protein persistence of interest in the ‘‘remote homolog’’ problem—the COMPUTATIONAL BIOLOGY observation that in any reasonably large group of sequences that sequences. We have constructed a very large set of sequences fold to a specified, common architecture, there will be pairs of taken from the CATH database (19, 20). The organization of sequences that are not related by any currently known criterion. CATH is ideal for this investigation, because domains are Accurate methods for structural homology detection depend organized in a hierarchical fashion based (in order of increasing on an understanding of the sequence code underlying fold detail) on class, architecture, topology, and homology. In the selection. There are intriguing hints that this code may be less present work, we wanted to use a comprehensive sequence/ complex than once thought. That the average amino acid com- structure database that reflects the composition of the entire positions of proteins can give reasonably accurate classifications Protein Data Bank, rather than relying on selection criteria. A of structural class (1–3) and fold family (4–12) is well known. But primary consideration in avoiding biased results is eliminating a classification scheme gives no information about quantitative sequences with a high degree of similarity from the database. We relationships between the classes under consideration, because therefore began with a subset of the entire CATH database, it reflects only local details of the underlying space of protein CathDomainSeqs.S35.ATOM.v3.1.020, which was selected by structures. Quantitating those relationships requires a sequence- the CATH curators to be representative of the entire database based metric function capable of objectively measuring the while containing no pairs of sequences with sequence identity distance between 2 arbitrarily selected classes. exceeding 35%. This value is generally considered to mark the We have delineated the organization of structure space in lower limit of sequence relatedness, and thus our working previous work (13–15). Those results were obtained using only database is composed entirely of sequence pairs that are in the structural data, with no reference to sequence information. The ‘‘twilight zone.’’ It contains no pairs that can be considered picture that emerged, subseqently verified by Yee and Dill (16), homologs in the traditional sense. is that of a structure gradient in which all-helical structures are We further adjusted the database by removing all sequences concentrated at one extreme of the space and all-sheet/barrel with missing residues and all sequences with fewer than 60 amino structures are concentrated at the other extreme, with mixed acids. We were left with a data set of 7,056 sequences known to alpha/beta structures in the intervening region. be complete and unrelated by any standard criterion. The highest In the present work we demonstrate that when protein se- level of the CATH hierarchy consists of 4 classes. C ϭ 1 contains quences are represented appropriately, the average amino acid all-helical structures, C ϭ 2 contains sheet/barrel structures, and properties of those sequences encode a similar picture of the C ϭ 3 contains mixed alpha/beta structures. The very small class global organization of protein sequence space. We demonstrate C ϭ 4 (73 sequences) contains proteins whose only common the existence of a metric function, based entirely on sequence feature is a lack of regular structure, and is not considered in this properties, that reproduces the known characteristics of struc- work. Our final database contained 1,538 sequences with C ϭ 1, ture space. 1,690 sequences with C ϭ 2, and 3,755 sequences with C ϭ 3. Sequence Model Rigorous determination of the characteristics of protein se- Author contribution: S.R. designed research, performed research, analyzed data, and quences requires that they be analyzed numerically, using a wrote the paper. representation that is both complete and nonredundant. Rep- The author declares no conflict of interest. resentations that rely on arbitrarily chosen sets of physical This article is a PNAS Direct Submission. properties of the amino acids generally are both incomplete 1E-mail: [email protected]. and correlated. This problem was addressed by Kidera and This article contains supporting information online at www.pnas.org/cgi/content/full/ coworkers (17, 18), who performed a factor analysis on all 0903433106/DCSupplemental. www.pnas.org͞cgi͞doi͞10.1073͞pnas.0903433106 PNAS Early Edition ͉ 1of4 Downloaded by guest on September 30, 2021 Table 1. Eigenvalues of the principal eigenvectors 0 Fourier transforms of the 10 numerical strings that together Eigenvector Eigenvalue Variance proportion represent the sequence (21, 22). This observation provides a direction for further generalization of the results, through e1 4.078 40.8 inclusion of higher Fourier components in the analysis of e2 1.426 14.3 sequence space. e3 1.211 12.1 The 10-vectors QCAT for the 59 CAT classes can be thought of e4 0.949 9.5 as the position vectors of these classes in 10-space. To under- e5 0.791 7.9 stand the relationships between classes established by the Eu- e6 0.541 5.4 clidean metric inherent in the APF representation, we need to e7 0.446 4.5 visualize the distribution of the corresponding points. We there- e8 0.272 2.7 fore performed a principal components analysis (PCA) (23) of e9 0.198 2.0 the 10-vectors. e10 0.088 0.9 Results The PCA results are summarized in Table 1. The first 3 Sequences ranged in length from 60 to 1146 aa, and the total eigenvectors carry a total of 67.2% of the variance of the entire number of residues in the database was 1,114,667. data set; therefore, a low-dimensional representation of The sets of sequences in which we are interested here are those the structure space is both feasible and meaningful. Each characterized by common values of the 3 identifiers C, A, and T. of the principal components includes contributions from all These are sequences known to fold to similar architectures but 10 property factors. The principal components are listed in for which no specification of sequence homology is given. We Table 2. restrict our attention to those CAT classes in the database that A projection of the APF space onto the first 3 eigenvectors have at least 20 members. There are 59 such classes, constituting of the PCA is shown in Fig. 1. It can be seen that the 6% of the 980 CAT classes in the database. These contain a total distribution of CAT groups, identified by structural class (i.e., of 4,319 sequences—60% of the sequences in the database. The by the value of the CATH classifier C), is isomorphic to that groups included in the present study are shown in Table S2.It obtained from purely structural considerations, in that the should be noted that this database is significantly larger than the all-helical and all-sheet/barrel groups occupy opposite ex- databases used in earlier work (13, 16). tremes of the space, separated by alpha/beta structures. To For every sequence S in the database, we can define the make this observation quantitative, hyperplanes separating the sequence-averaged value of the mth property factor, regions corresponding to the 3 C classes were determined, using a minimum squared error (MSE) algorithm (23). The N 1 S ability of these hyperplanes (which are defined in Table S3)to ͗ (m)͘ ϭ (m) f S fn [2] separate the classes is summarized in Table 3. It can be seen NS nϭ1 that the separation between classes, although not perfect, is very clean.