An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids Through PCA

Interdisciplinary Bio Central Open Access IBC 2014;6:3, 1-10 • DOI: 10.4051/ibc.2014.6.4.0003 FULL REPORT An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids through PCA Youngki You1, Inhwan Jang1, Kyungro Lee2, Heonjoo Kim1, Kwanhee Lee1,* 1School of Life Science, Handong Global University, Pohang, Korea 2Department of Biotechnology Yonsei, University, Seoul, Korea SYNOPSIS Subject areas; Bioinformatics/Computational biology/Molecular modeling Amino acid substitution matrices are essential tools for protein sequence analysis, ho- Author contribution; The research is mainly mology sequence search in protein databases and multiple sequence alignment. The done by Y. Y., who did most of the work. H. Kim PAM matrix was the first widely used amino acid substitution matrix. The BLOSUM se- was in charge of statistical work. ries then succeeded the PAM matrix. Most substitution matrixes were developed by us- *Correspondence and requests for materials ing the statistical frequency of substitution between each amino acid at blocks repre- should be addressed to K.L. (khlee@handong. senting groups of protein families or related proteins. However, substitution of amino edu). acids is based on the similarity of physiochemical properties of each amino acid. In this Editor; Keun Woo Lee, Gyeongsang National study, a new approach was used to obtain major physiochemical properties in multiple University, Korea sequence alignment. Frequency of amino acid substitution in multiple sequence align- Received February 25, 2014; ment database and selected attributes of amino acids in physiochemical properties da- August 29, 2014; Accepted tabase were merged. This merged data showed the major physiochemical properties Published November 05, 2014 through principle components analysis. Using factor analysis, these four principle com- Citation; You, Y., et al. An Approach for a ponents were interpreted as flexibility of electronic movement, polarity, negative Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids charge and structural flexibility. Applying these four components, BAPS was construct- through PCA. IBC 2014, 6:3, 1-10. ed and validated for accuracy. When comparing receiver operated characteristic (ROC50) doi: 10.4051/ibc. 2014.6.4.0003 values, BAPS scored slightly lower than BLOSUM and PAM. However, when evaluating Funding; The research was funded by Handong for accuracy by comparing results from multiple sequence alignment with the structural Global University. alignment results of two test data sets with known three-dimensional structure in the Competing interest; All authors declare no homologous structure alignment database, the result of the test for BAPS was compar- financial or personal conflict that could atively equivalent or better than results for prior matrices including PAM, Gonnet, inappropriately bias their experiments or Identity and Genetic code matrix. writing. © You, Y. et al. This is an Open Access article distrib- uted under the terms of the Creative Commons At- tribution Non-Commercial License (http://creative- commons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and Key Words: BAPS; factor analysis; principle component analysis; scoring matrix; se- reproduction in any medium, provided the original work is properly cited. quence alignment www.ibc7.org 1 IBC 2014;6:3 • DOI: 10.4051/ibc.2014.6.4.0003 Interdisciplinary Bio Central You Y, et al. INTRODUCTION base in comparison to data sets from the sequence database continued to be a reoccurring problem. New methods using Protein sequence alignment is a method used to find homology substitution matrices were studied14-18. among different proteins and is consequently a powerful tool This study fused the substitution frequency between amino for predicting protein structure and function. Homology was acids from PROSITE and the data of physiochemical properties based on specific groups according to specific physiochemical of amino acids. The fused resulting data reflects key amino acid properties of amino acids. Substitution matrices, which de- physicochemical properties to align protein sequences. The scribe the relationship between each amino acid quantitatively, number of dimensions decreased through principle compo- were studied to improve accuracy1-6. nent analysis. And the resulting principle components were ap- PAM and BLOSUM matrices are commonly used alignment plied to make a substitution matrix. The new substitution ma- programs. PAM matrix is a mutation probability matrix based trix, Block based Amino acid Physiochemical properties Substi- on evolutionary distance1,7. BLOSUM series was constructed tution matrix (BAPS), shows the utility of this study. Several using the substitution frequency between amino acids from major potential physiochemical properties are selected to make PROSITE, which contains information concerning conserved the matrix. BAPS was then compared with seven other matrices 8 regions between related proteins in a protein family . PAM se- by receiver operated characteristic (ROC50) values and structur- ries, BLOSUM series and most substitution matrices use only al alignment tests. statistical information from the sequence alignment database. However, this statistical information is limited to sequence RESULTS AND DISCUSSION alignment data and can be further improved using structural information. When comparing sequences, a matrix based on Factor analysis with principle component analysis structural information is essential for accurate and efficient The transform value of the 20 amino acid physiochemical prop- predictions concerning protein structure and function9,10. erties (Table 1) for the 387,240 columns was determined by fac- Structure based matrices were developed to satisfy this re- tor analysis. The transform value signifies the influence of the quirement11,12. However, these matrices were constrained by 20 physiochemical properties on each column. Using principle structural data set size. The structure block matrix was con- component analysis (PCA), four out of 20 resulting factors were structed in an attempt to overcome this problem13. Although, selected according to eigenvalue and scree plot in Figure 1. Fac- this matrix considered even more properties involved in pro- tors preceding drastic slope changes on the scree plot and with tein conformation, the lack of data sets in the structure data- eigenvalues greater than 1 were chosen. As shown in Table 2, factor loading scores for the four factors were then rotated by Table 1. Selected attribute of amino acid to the structure and function varimax. The four factors consist of a set of unique properties based on high absolute loading scores. Accession Data description number 9 BEGF750101 Conformational parameter of inner helix (Beghin-Dirkx, 1975) BEGF750102 Conformational parameter of beta-structure (Beghin-Dirkx, 1975) 8 BEGF750103 Conformational parameter of beta-turn (Beghin-Dirkx, 1975) BHAR880101 Average flexibility indices (Bhaskaran-Ponnuswamy, 1988) 7 CHAM810101 Steric parameter (Charton, 1981) CHAM820101 Polarizability parameter (Charton-Charton, 1982) 6 CHAM830107 A parameter of charge transfer capability (Charton-Charton, 1983) CHAM830108 A parameter of charge transfer donor capability 5 (Charton-Charton, 1983) CHOC760102 Residue accessible surface area in folded protein (Chothia, 1976) 4 DAWD720101 Size (Dawson, 1972) Eigenvalues DAYM780201 Relative mutability (Dayhoff et al., 1978b) 3 EISD860101 Solvation free energy (Eisenberg-McLachlan, 1986) FASG760101 Molecular weight (Fasman, 1976) 2 FASG760103 Optical rotation (Fasman, 1976) FAUJ880108 Localized electrical effect (Fauchere et al., 1988) 1 FAUJ880111 Positive charge (Fauchere et al., 1988) FAUJ880112 Negative charge (Fauchere et al., 1988) HUTJ700101 Heat capacity (Hutchens, 1970) 0 5 10 15 20 HUTJ700102 Absolute entropy (Hutchens, 1970) PRAM900101 Hydrophobicity (Prabhakaran, 1990) Figure 1. Scree plot of principle component analysis. www.ibc7.org 2 IBC 2014;6:3 • DOI: 10.4051/ibc.2014.6.4.0003 Interdisciplinary Bio Central You Y, et al. Factor I has high factor loading scores in size, absolute entropy, bility are attributes with high factor loading scores for Factor III. polarizability parameter, and molecular weight. High scores in These properties are strengthened in the abundance of elec- size and molecular weight indicate association between Factor I trons. Although not to the same degree, Factor III also shares and volume. Factor I can also be expected to relate to electron attributes with Factor II. motion within a molecule due to high factor load scores in abso- Factor IV reflects structural flexibility. Optical rotation, con- lute entropy and polarizability parameter. Accordingly, Factor I formational parameter of inner helix, conformational parame- represents flexibility of electron movement in relation to volume. ter of beta-turn and steric parameter are characteristics directly Factor II signifies polarity. Factor II had high absolute scores related to secondary and tertiary structure. Scores for optical in residue accessible surface area in folded protein, solvation rotation and conformational parameter of inner helix are high, free energy, conformational parameter of beta-structure, aver- and values for conformational parameter of beta-turn (inverse- age flexibility indices, positive charge, and negative charge. ly proportion to factor IV) are low, when a structure is flexible.

An Approach for a Substitution Matrix Based on Protein Blocks and Physicochemical Properties of Amino Acids Through PCA

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

Bioinformatics 1: Lecture 3

BASS: Approximate Search on Large String Databases

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh

Parameter Advising for Multiple Sequence Alignment

Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices

Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices

Assume an F84 Substitution Model with Nucleotide Frequ