Biological Meaning of Nonlinear Mapping of Substitution Matrices in Metric Spaces

Biological Meaning of Nonlinear Mapping of Substitution Matrices in Metric Spaces Juan Mendez and Javier Lorenzo Institute of Intelligent Systems(SIANI) Univ. Las Palmas de Gran Canaria 35017 Las Palmas, Canary Islands, Spain Email: [email protected] Abstract—Based on the Information Theory, a distance func- substitution matrices collected from sets of related sequences tion between amino acids is induced to model the affinity are statistical summaries of the evolution process[8]; they show relationship between them. Tables with mapping coordinates have that similar amino acids tend to replace each other more often been obtained by using nonlinear multidimensional scaling for different dimensions. These mapping coordinates are meaningless than dissimilar ones in accordance with the rather neutral virtual data, but a high relationship with physical and chemical action of most mutations, as asserted by the neutral theory of properties is found. The main conclusion is that the number evolution[1]. Yang [9] determines an amino acid substitution of effective characteristics involved in substitution matrices is model from a codon substitution model at DNA level by using low. The hypothesis that hydrophobicity and secondary struc- a Markov model. His conclusion is that a relationship exits ture propensities are very important characteristics involved in substitution matrices is reinforced by the analysis of the results. between the acceptance rates of probabilities –from which the substitution matrices are computed – and the chemical Index Terms—Substitution Matrices; Molecular Evolution; properties, but this relationship may not be simple. Principal Component Analysis; Multidimensional Scaling; Infor- The conservational or homeostatic nature of physical-che- mation Theory mical characteristics in mutational process has also been considered [10] by focussing on the amino acid isoelectric I. INTRODUCTION point, which combines both electric and spatial information. The main assumption of the neutral theory of evolution[1] Those authors study the effect of co-ordinated substitution is that mutations at the molecular level are mainly neutral or (pairs of related amino acid substitutions) and the correlation weakly disadvantageous. Moreover, it asserts that substitution with protein changes in physical-chemical properties. The rates in low biologically relevant sites in proteins are greater study concludes that co-ordinated substitutions play a com- than the active rates. The role of physical-chemical properties pensatory role. The relationship between mutation and place in evolutionary models has been emphasized by considering of amino acids in the protein was analyzed by Gromiha and that local mutations tend to conserve important properties of Oobatake[11]. They studied the correlation between stability the protein and that the knowledge implicitly contained in sub- changes caused in buried amino acids and 48 physicochemical stitution matrices is related to the properties of protein domains and conformal properties. rather the amino acids themselves[2]. This approach has been The study of the relationship between physical-chemical extended for the detection of protein domains in the space of parameters in structure-dependent matrices[12] finds a high property sequence instead of the primary sequence[3][4][5]. correlation between the matrices and some parameters, par- Most models of substitution matrices are computed as log- ticularly for hydrophobicity and amino acid volume. Both of odds ratio of sequence probabilities. This is related to the them seem to be important factors for the protein folding and concept of relative entropy or mutual information between in its functionality. Volume is important due to the physical two sequences. Intuitively, sequences with high homological constraints in the torsion angles of the peptide chain, while relationship have high levels of common information coming hydrophobicity is important in the interactions of residues from their ancestor, while distantly related sequences retain with the aqueous environment. Koshi and Goldstein[12] use low levels of common information. Formally, the relationship the correlation between the matrix ¢Qab = jQ(a) ¡ Q(b)j between substitution matrices and information theory has been related to a characteristic Q of amino acids, such as the presented by Altschul[6] and is based on the statistical proper- contained in the AAindex database[13], and the functional ties associated with general score matrices[7]. This theoretical ©ab(M) where M is the substitution matrix. The matrix approach provides a sound framework for substitution matrices ¢Qab is a distance in the characteristic domain, but they that is used to propose an amino acid distance based on the use the functional ©ab(M) = ln(MabMba). Since ¢Qaa = evolutionary data. ¢Qbb = 0 and that in the general case, ©aa 6= ©bb, some Amino acids can be organized into physical-chemical perturbations could be introduced in the correlation values. A groups with similar relationship between their members. The more homogeneous correlation must be between ¢Q and some functional verifying ©aa(M) = 0 which is more similar to a of this distance is very simple and heuristically inferible, it distance. In this paper is suggested that some type of amino can be obtained from the sound theoretical framework of acid distance is a more correct measure to be correlated with information theory. ¢Q, but it is proposed that this distance be obtained from Protein sequences can be formally represented as random substitution matrices due to their biological richness and the distributions of symbols contained in the set A of amino acid. soundness of their theoretical background. Let qab be the probability distribution on the A£A alphabet of A set of orthogonal linear amino acid characteristics for the alignment of pairs of homologous sequences in a defined clustering purposes has been obtained[14] from the eigenvec- biological framework. Let pa be the probability distribution on tors of the PCA selection procedure from a wide characteristic the projection of the pairwise distribution: A £ A ! A. This set. They are significant among the characteristic set itself, projection destroys the information about the probabilities of but there are doubts about whether they are significant in the pair (a; b), so that the probability qab cannot be recovered relationship with the substitution matrices. Also, an amino acid from pa. The choice of the set of homologous sequences used index was proposed to maximize its relationship to structural to compute the distribution qab defines the biological context properties form a Machine Learning approach [15] for the computation of alignments. Different biological envi- Venkatarajan and Brown [16] have represented the animo ronments are possible, as defined in the different BLOSUM or acids in a high dimensional space of 237 properties, and PAM matrices. The product value papb 6= qab is the probability by using Principal Component Analysis(PCA) have used of a pair taken from two independent random sequences, while multidimensional scaling to generate linear mapping. Their qab is the probability among related sequences, thus it contains conclusion is that five-dimensional property space can be the information about these relations. The mutual information, constructed such that the amino acids are in a similar spatial I(A; B), between two sequences A and B is defined as [19], distribution to that in the original high-dimensional property [20]: space, since the distances computed for pairs of amino acids I(A; B) = H(A) + H(B) ¡ H(A; B) (1) in the five-dimensional property space are highly correlated with corresponding scores from similarity matrices where H(A; B) is the entropy of A £ A. Both, H(A) and The converse approach to map the characteristic based H(B) are the entropy of random sequences with the same multidimensional representation of amino acids is to map the probability distribution, consequently they become identical evolutionary distance of amino acid. There are previous works to the entropy of A. The mutual information is computed as: X in mapping substitutions matrices in virtual spaces from the qab I(A; B) = qab log (2) papb Pattern Matching perspective in information retrieval[17]. An ab amino acid distance can be obtained from the Information This expression also can be interpreted as the relative Theory by mining the knowledge about the dimensionality of entropy – or Kullback-Leibler distance– between distributions substitution matrices. The proposed distance is a evolutionary q of related pairs and p p of independent pairs. A goal of distance between amino acids; it is related to a biological ab a b alignment statistics is to define useful tables that capture the environment, as general/local as the substitution matrix itself. biological significance of a set of related sequences. A substi- The mapping of this distance in a multidimensional space tution matrix s(a; b) is introduced as the log-odds between the provides knowledge about the intrinsic dimensionality of this relationship probability q and the independent probability information. However, the distance used by some autors[18] ab p p , so that the mutual information becomes the expected is heuristic. The main goal of this paper is to propose an well a b P value of this matrix: I(A; B) = E[s(a; b)] = q s(a; b). found distance based on the Information Theory and also an ab ab The substitution matrix is the additional information needed attempt to identify the meaning of some coordinates provided to relate both probabilities. Thus, it can be interpreted as the by the nonlinear mapping. information lost in the projection A £ A ! A, such as[7]: The remainder of this paper is organized in sections cov- ¸s(a;b) ering the methods, results and discussion. In the methods qab = papbe (3) section, the concept of distance entropy is used to generate an amino acid distance based on the Information Theory. where ¸ is introduced according to the base of the logarithm. The results and discussion sections discuss about the meaning Similarly to the the mutual information between sequence of some mapping coordinates.

Biological Meaning of Nonlinear Mapping of Substitution Matrices in Metric Spaces

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

Assume an F84 Substitution Model with Nucleotide Frequ

Amino Acid Substitution Matrices from an Information Theoretic Perspective

Comparison Theorems for Splittings of M-Matrices in (Block) Hessenberg

Pathological Rate Matrices: from Primates to Pathogens

3D Deep Convolutional Neural Networks for Amino Acid Environment Similarity Analysis Wen Torng1 and Russ B

Multiplicatively Closed Markov Models Must Form Lie Algebras

Metzlerian and Generalized Metzlerian Matrices: Some Properties and Economic Applications