Probabilistic Evolutionary Model for Substitution Matrices of PAM and BLOSUM Families

DIMACS Technical Report 2008-16 December 2008

Probabilistic evolutionary model for substitution matrices of PAM and BLOSUM families

Valentina V. Sulimova Vadim V. Mottl Tula State University Computing Center of the Russian Academy Lenin Ave. 92, 300600 Tula, Russia of Sciences [email protected] Vavilov St. 40, 119333 Moscow, Russia [email protected]

Casimir A. Kulikowski Ilya B. Muchnik Deptartment of Computer Science DIMACS Rutgers University Rutgers University New Brunswick, New Jersey 08903 New Brunswick, New Jersey 08903 [email protected] [email protected]

DIMACS is a collaborative project of Rutgers University, Princeton University, AT&T Labs – Research, Bell Labs, NEC Laboratories America and Telcordia Technologies, as well as affiliate members Avaya Labs, HP Labs, IBM Research, Microsoft Research, Stevens Institute of Technology, Georgia Institute of Technology and Rensselaer Polytechnic Institute. DIMACS was founded as an NSF Science and Technology Center.

Abstract

Background: Almost all problems of protein analysis must inevitably be based on comparing the types of amino acids from which protein sequences are composed. Similarities between amino acids are most commonly based on two methods derived from very different approaches: the evolutionary based substitution matrixes of the PAM (Point Accepted Mutation) family, derived from phylogenetic trees, and the BLOSUM substitution matrixes which are statistically inferred from multiple alignments of groups of proteins which, according to their authors, S. and J. Henikoff, are essentially different from the PAM family of matrices. Results: In this paper we prove that the statistical approach for computing substitution matrixes of the BLOSUM family can be explained in terms of the PAM evolutionary model. This means that both of these approaches are actually based on similar types of evolutionary models, and the main difference between them lies in the different initial data for estimating their unknown model parameters. We also show that all PAM substitution matrices can be represented as kernel functions in their mathematical structure, and lose their positive semi-definiteness only because of choice of final representation. Conclusions: The fact that the PAM and BLOSUM substitution matrices are originally positive semidefinite, allows them to be easily used for constructing kernels over a set of proteins, so, without loss of biological meaning, these similarity measures can be applied without correction. Fur- thermore, any new substitution matrix will automatically be a kernel if, first, it is estimated by either the Dayhoff or Henikoff techniques and, second, the final representation proposed in the present research is adopted.

Contents

Abstract...... 2 1 Introduction...... 4 2 Dayhoff’s PAM model and its evolutionary assumptions...... 5 3 Similarity measures between amino acids on the basis of Dayhoff’s model and its interpretation...... 6 4 Dayhoff’s approach to estimating unknown parameters of the PAM model ...... 7 5 Family of BLOSUM substitution matrices and statistical basis of the Henikoff Derivation .. 8 6 Dayhoff’s model of evolution for BLOSUM ...... 9 7 Mathematical properties of substitution matrices of the PAM and BLOSUM families: kernels in the set of amino acids on the basis of Dayhoff’s model of evolution...... 14 8 Results and discussions...... 16 9 Conclusions...... 16 References ...... 17

3 1 Introduction One of the fundamental issues in bioinformatics is the problem of choosing a quantitative measure of similarity or dissimilarity between amino-acid sequences, which in turn has to be based on measuring the similarities between the amino acids that compose them. Amino acid similarity is traditionally described by a 20x20 square matrix of pairwise similarity values denoted as a substitution matrix. For protein analysis these substitution matrices should be consistent with the best knowledge available about protein amino acid sequence evolution. The first most commonly adopted similarity measure involves the family of Point Accepted Mu- tation (PAM) substitution matrices, introduced by Margaret Dayhoff [1], based on a probabilistic model of evolution. Parameters of this model are estimated from empirical data which generate phylogeny trees for families of closely related proteins. The second, more popular family of substitution matrices, whose advantages have been con- firmed in a number of practical cases, was introduced by Steven and Jorjia Henikoff and called BLOSUM (BLOcks SUbstitution Matrices) [2]. According to the authors these matrices differ sig- nificantly from matrices of the PAM family in that they are inferred from local multiple alignments within protein families which produce blocks of conservative fragments of amino acid sequences. BLOSUM directly calculates frequencies of appearance of different amino acids at the same positions in an extracted block of similar fragments of sequences, requiring no knowledge of phylogeny but only the results of the alignment. In this paper we show that the family of BLOSUM substitution matrices can be explained in terms of Dayhoff’s evolutionary model as was done for PAM. From this analysis one can see that the Henikoffs’ statistical approach, applied to their particular BLOCKs data gives exactly the same result which one would have got based on Dayhoff’s model using an appropriate hierarchical structure knowledge. Henikoff’s work, together with our explanation of how the same result can be derived through application of Dayhoff’s technique, gives a strong hint that it is possible and useful to calculate adaptive substitution matrixes which specify the substitution scale for a particular group of proteins. For instance, the COG-server [3] provides us with a very important classification if one is particularly interested in a specific group of clusters such that each is found in the same group of organisms. These COG-clusters taken together can be considered as a separate group characterized by a particular substitution matrix. The specificity arises from the particular set of organisms arising from having already decided on a group of organisms of interest. It is then very likely that reconstructing COG-clusters from this group (and only within this group) will allow us to build a more detailed clustering. Further, if one is interested in building a classification rule to distinguish pairs of protein functional families, ignoring all other protein families, so one can determine a specific substitution matrix for the two families. Determining such specific substitution matrices has been discussed over the last few years [4, 5] yet all have been so far restricted by heuristic- statistical assumptions. In showing that BLOSUM matrices can be derived in a way that satisfies the rigorous Dayhoff PAM model of evolution, this research can now be recast on a solid theoretical base. In parallel with discussion about the power of Dayhoff’s evolutionary model which is a fundamental base for constructing substitution matrices, we investigate some their algebraic properties. Specifically, we investigate the conditions of their derivation which guarantee that these matrices should be positive semi definite. This aspect we have analyzed in greatest detail, encouraged by the opportunity of building an automatic classifier for protein families by Support Vector Machine (SVM) methods [6, 7, 8]. These methods allow one to construct classifiers at any level of complex- ity if one is able to build a corresponding similarity function for pairs of objects which satisfies the

4 property that it is an inner product function in some linear vector space. Such functions are traditionally called kernel functions. From our perspective, it means that a similarity score for pairs of proteins has to satisfy a particular formal property. In the last seven years many bioinformatics re- searchers have devoted much time and effort to building corresponding kernels [9, 10, 11, 12, 13, 14, 15, 16]. These procedures on the whole are based on similarity functions for amino acids which serve as the principal foundation for constructing such a function for amino acid sequences with the same property of being positive semi-definite. Yet, substitution matrices of the PAM and BLOSUM families in their traditional representation have negative eigenvalues leading numerous publications to attempt correcting traditional substitution matrices, but this has resulted either in the loss of their biological meaning [9, 10] or in revised matrices whose positive semi-definiteness was not guaranteed [11]. The present paper gives a very clear picture of the conditions under which a similarity function between sets of amino acids satisfies the positive semi-definite property. It proves satisfying that those conditions which determine the positive semi-definite property for a matrix of substitutions is completely characterized by a few natural simple properties of Dayhoff’s probabilistic evolutionary model. 2 Dayhoff’s PAM model and its evolutionary assumptions This section describes the basic probabilistic Dayhoff evolutionary model in an alternative, non- standard form which is most convenient for further analysis and comparison with the BLOSUM approach. To measure the similarity or dissimilarity between amino-acid sequences requires that one adopt a representation for measuring the similarity of their constituent amino acids. Let A be the finite set of 20 natural amino acids (AAs) such that A =α{ (1),..., α (20)} . A measure of similarity between two amino acids αij,α∈A is usually understood to be their predisposition to having mutually exchanged by mutation from one protein sequence to another over evolutionary time and the simplest and most commonly applied probabilistic model of mutation is Dayhoff’s PAM. The key hypothesis underlying this model is that evolutionary changes in the amino-acid sequence of a protein are the result of random independent mutations at separate points of the AA chain, and that the mutations observed today, were those “accepted” as result of subsequent natural selection processes. The PAM model represents these predispositions of AAs towards mutual mutative transformations as a square matrix of conditional probabilities Ψ =ψ(ij ,ij , = 1,..., n ) , ψij=ψ( α j | α i ), αij,α∈A, n=20 , (1) which are interpreted as the probabilities that at the next step of evolution amino acid i will trans- form into amino acid j . So, ∑ ψα(|)1ji α = for each αi ∈ A . α∈j A

The main mathematical notion used in Dayhoff’s model is that of a Markov chain hs , s =1,2,3,... of evolution of an amino acid, which completely defined by the matrix of probabilities from (1), and applied independently at every separate point of an AA chain. It is further assumed, that this Markov chain represents an ergodic random process, which is characterized by a final probability distribution ξ (α j ) : ∑ ξ()(αψααijij | ) =ξα ( ). α∈i A

5 This is the formalization of an assumption that evolutionary process started very long ago, with random processes of mutations setting in early, and as result do not depend on unknown (and empirically unknowable) initial probabilities of states. The second fundamental supposition about PAM model is the assumption that its Markov chain is reversible, i.e. the condition: ξα()(ijijij ψα | α ) =ξα ( )(| ψα α ). holds true. This implies that it is impossible to determine in the process of observation which of two amino acids is an ancestor and which is a descendant. While involving simplification from a biological perspective, PAM models have proven remarkably good at predicting observed evolutionary transformations in sequences of rapidly reproducing organisms and enable mathematical tractability of derivations of probabilities of transformations from one AA chain to another. 3 Similarity measures between amino acids on the basis of Dayhoff’s model and its interpretation This section gives a more mathematically rigorous introduction of the similarity functions for pairs of amino acids based on Dayhoff’s evolutionary model and their possible probabilistic inter- pretations. Let A , as before, be the finite set of amino acids A =α{ (1),..., α (20)} , which are states of ergodic and reversible Markov chain with a conditional probability density ψij=ψ ( α j | α i ) and final probability distribution ξα (j ) , α∈j A . Let us define also a two-step random transformation of amino acids α→α→αik j:

ij i j ψαα=[2](| )∑ ψαϑψϑα (|)(|), ϑ∈A which defines a new Markov chain, thinned out in comparison with initial one, defined by the original transformation ψ(|)ααj i , which in contrast can be called a one-step transformation j i ψαα[1] (|). Theorem 1. The Markov random process produced by the two-step random transformation re- mains ergodic and reversible with the same final distribution ξ()α . Proof.

ij First we show that the Markov process, defined by the two-step transformation ψ[2] is ergodic with the same final distribution ξα(). n ij il l j As it can be represented as ψ[2]= ∑ψψ [1] [1] , l =1 nnnnnn iij i illj⎛⎞ iillj llj j ∑∑∑∑∑∑ξψ=ξψψ=[2] [1] [1]⎜⎟ ξψψ=ξψ=ξ [1] [1] [1] . Thus, ergodicity is iillil======111111⎝⎠ proved. The proof of the reversibility of the two-step transformation follows from the reversibility of the original one-step transformation ξψiijj =ξ ψ ji:

6 nn n n i ij i il lj i il lj l li lj l lj li ξψ=ξψψ=[2] ∑∑() ξψψ= ∑ () ξψψ= ∑ ( ξψψ ) ll==11 l = 1 l = 1 nn jjllij jllijji =ξψψ=ξψψ=ξψ∑∑()[2] . ll==11 This completes the proof.

It is natural to evaluate the similarity of two amino acids by computing the probability that they resulted from two independent branches of evolution followed by descendents of one and the same unknown amino acid n ij kikjk K[2](,αα ) =∑ ξαψ ( ) [1] ( α | α ) ψ [1] ( α | α ). (2) k=1 However it is more convenient to represent this similarity measure in an alternative way, based on the property of reversibility of the Markov chain: n K (,ααij ) = ξαψ ( k ) ( α ik | α ) ψ ( α j | α k ) = [2]∑k=1 [1] [1] ξα()iki ψ (|) α α [1] (3) n ξξ()αψααψαα=αψααikijkiji ( | ) ( | ) () ( | ), ∑k=1 [1] [1] [2] j i ψαα[2](|)

ij Such a representation allows us to interpret K[2](,α α ) differently. It can be considered as the probability that two randomly chosen immediately adjacent positions in this Markov chain with j i ij transition probabilities ψαα[2](|) are occupied by two preset states (,α α ). Besides, in a number of cases it is very convenient to use the similarity measure (3), normalized by the final probabilities of the amino acids being compared: K (,αijααψααψαα )ξ () i ( ji | ) ( ji | ) K (,ααij ) = [2]== [1] [1] . (4) [2] ξξ(ααji )() ξξ ( αα ji )() ξ ( α j )

4 Dayhoff’s approach to estimating unknown parameters of the PAM model This section presents Dayhoff’s method for reconstructing the similarity functions from empirical data (substitution matrices PAM). For inferring an estimate Ψˆ =ψ(,,1,...,20)ˆ ij ij = of the unknown matrix of random point mutations Ψ =ψ(ij ,ij , = 1,...,20) M. Dayhoff studied 34 protein “superfamilies” of closely related proteins (more than 85% identical to each another), grouped into 71 phylogenetic trees. From these data 1572 accepted point mutations were observed empirically, and a symmetric ma- trixC== (Cijij , , 1,...,20) of frequencies with which amino acid αi is replaced by amino acid α j was derived, together with a vector m== (mii , 1,...,20) of so-called relative mutabilities of amino acids α=i ,1,...,20i – the probabilities that each amino acid will change in a given small evolutionary in- terval.

7 An estimate of a vector of occurrence frequencies of amino acids ξˆˆ=ξ(()i ,i = 1,...,20) was also computed by Dayhoff from the accepted point mutation data. The matrix Ψˆ =ψ(,,1,...,20)ˆ ij ij = can be calculated as λmCj ij ψ=ˆ ij for ij≠ and ψˆ jjj=−λ1 m for ij= , ∑C ij i where λ is a normalization parameter, which is equal for each column of the matrix Ψˆ and defined in such a way that 20 ξ−ψ=ˆ ()iii(1ˆ ) 0.01 , (5) ∑i=1 i.e. that, in average, only one amino acid among 100 randomly chosen amino acids would change. The value (1−ψˆ ii ) is proportional to the mutability of the respective amino acid. The conditional mutation probabilities Ψˆ satisfying (5) are said to define what is denoted as an evolutionary distance of 1 PAM. The most widely used evolutionary distance is that of 250 PAM ˆˆ250 ˆ associated with the 250th degree of the 1 PAM matrix: Ψ = ΨΨ ××... . 250 And, if a 1 PAM matrix is multiplied by itself an infinite number of times all columns of the resulting matrix will be equal to one another and to the vector of occurrence frequencies of the amino acids ξˆˆ=ξ(()i ,i = 1,...,20) , which in terms of Dayhoff’s model of evolution is the vector of final probabilities. The commonly adopted Dayhoff substitution matrices are computed from Ψˆ m and ξˆ as given by the rule: ψˆ ij d ij= 10 log π ij , π=ij []m (6) []mm 10[] []m ξˆ j and then rounded to the nearest integer value. It should be noticed, that the value under logarithm has the same structure as the similarity measure (4). 5 Family of BLOSUM substitution matrices and statistical basis of the Henikoff Derivation This section describes the statistical method for determining, from empirical data, the BLO- SUM (BLOcks SUbstitution Matrices) proposed by Steven and Jorja Henikoffs in 1992 [2] and widely used for protein sequence alignments. In contrast to the Dayhoff matrices, the BLOSUM ones are inferred not from phylogenetic trees but from local multiple alignments containing much more diverse protein families (starting from a level of 45% sequence similarity). Such multiple alignments produce blocks of evolutionarily conservative amino acid sequence fragments without gaps. Each element of a BLOSUM matrix is computed as the log-odds ratio between the observed probabilities of occurrence of amino acid pairs among the blocks and those expected by chance. Let us to consider all blocks from the Henikoff BLOCKS data base as one consolidated block consisting of Q columns, involving of Nk amino acids, kQ= 1,..., .

8 For computing the square matrix of observed probabilities p== (pijij , , 1,...,20) of occurrence of amino acid pairs in accordance with the Henikoff technique, one first counts pair frequencies for each non-ordered pair of amino acids αi and α j , for each column k of the consolidated block: ii ij ⎧NNkk(1)/2,,− i= j M k = ⎨ ij (7) ⎩NNkk,, i≠ j

i j i j where Nk и Nk are numbers of positions in the column k occupied by amino acids α or α respectively. For all columns, the pair frequencies for each non-ordered pair (αij ,α ) is Q ij ij M = ∑ M k (8) k =1 and the total number of all ordered pairs in all columns is QQ

MM==∑∑kkk NN(1)/2 −. (9) kk==11 Probabilities pij are counted as M ij pij = . (10) M The next step is to compute the expected probability of occurrence of amino acid αi in the (,i j ) pair: 1 qpiii=+∑ p ij. (11) 2 ij≠

The resulting BLOSUM matrix of amino acid similarities b== (bijij , , 1,...,20) is defined as ii ij ii ⎛⎞p ij ⎛⎞p b =2log2 ⎜⎟ii for ij= and b =2log2 ⎜⎟ij for ij≠ , (12) ⎝⎠qq ⎝⎠2qq and rounded for nearest integer value. There is a series of BLOSUM substitution matrices which differ from each other by the level of identity required of the proteins in the multiply aligned families. So, the matrices BLOSUM 45, BLOSUM 50, BLOSUM 62 and BLOSUM 80 are computed from protein families with levels of similarity of 45%, 50%, 62% and 80%, respectively. 6 Dayhoff’s model of evolution for BLOSUM The above derivation of BLOSUM substitution matrices comes from applying statistical methods to conserved amino acid blocks, and appears completely different in origin from the PAM substitution matrices. In this section we will show that BLOSUM expresses the same probabilistic model of amino acid mutations as PAM, which is based on the notion of an ergodic and reversible Markov chain defined by the matrix Ψ =ψ (ij ,ij , = 1,...,20) of conditional probabilities of transformations (1).

Let each column position in the consolidated block {α=k ,1,...,kQ} be associated with the hypothesis that all amino acids in it are produced by the same unknown amino acid ϑ (one-step ancestor) as result of independent random transformations with unknown conditional probabilities

9 ψα(|) ϑ resulting in an ergodic and reversible Markov chain with final distribution ξα(). And further we can assume that an ancestor for each column k of the block is chosen independently in accordance with this distribution ξα(). The notion of an evolutionary step, which is so important in the PAM framework does not apply here, but let us assume, for definiteness sake, that these conditional probabilities correspond to one step: ψα(|) ϑ =ψ[1] (|) α ϑ. ij Let also Ψ[2]=ψ( [2] ,ij , = 1,..., n ) be a matrix of two-step conditional probabilities, such as n ij j i il l j ψ=ψαα=[2] [2](|)∑ ψψ, l =1 which in accordance with theorem 1 defines an ergodic reversible Markov process with final distribution ξα(). Then, in accordance with (3), the observed probability of occurrence of amino acid pair ( αi , α j ) can be expressed as ij i j i ij j ji pp=αα=ξψ=ξψ(, ) [2] [2] . (13)

Theorem 2. Statistic (11) proposed by the Henikoffs is an unbiased estimate of ξi . Proof. First, it should be noticed, that the statistic (11) can be represented in more convenient form: Q 20 Q ij1 i j MNNii 20 ii ∑∑∑∑NNkk(1)− kk iii11 ijM ij≠=≠=2 k=1 1 j1, jik 1 qp=+∑ p = +⋅ =QQ +⋅ = 22j=1, MM11 2 ji≠ NN(1)−− NN (1) 22∑∑kk kk kk==11 Q 20 QQ Q 20 QQ NNii(1)− NNij N i(1) N i−+ NN ij NNii(1)− N ∑ kk ∑∑kk ∑ k k∑∑ kk ∑∑kk kNi k=1 +=jjik=≠=1, 1 k = 1 kji==≠11, j ===kk==11. QQ QQQN ∑∑NNkk(1)(1)−− NN kk ∑∑∑NNkk(1)−− NN kk (1) N k kk==11 kkk===111 So, qNNii= . (14)

Let’s show, that qNNii= is a maximum likelihood estimate of the final probability ξi . i In accordance with the proposed model, the chance variable Nk is the number of occurrence of i the event, which has probability ξ , in Nk independent tests. This chance variable is distributed according to a Binomial distribution: NNii NN− i ηξ=ξ−ξ(|)NCiikk ()(1) i ik k. kk Nk

i Random variables (Nkk ,= 1,..., Q ) are independent in accordance with the accepted model: QQ NNii NN− i η=(Nkiiiiii , 1,..., Q | ξ=ηξ=ξ−ξ ) ( N | ) Ckk ( ) (1 ) k k. (15) kkk∏∏Nk kk==11

10 i For observed values (Nkk ,= 1,..., Q ) the distribution (15) is the likelihood function relative to an unknown value of the probability ξi . The maximum likelihood estimate of this probability is defined by the expression: ˆ iii ξ=arg max log η (Nkk , = 1,..., Q | ξ ) = ξi Q Q (16) ⎡⎤i Nk ii ki i arg max⎢⎥ logCNN +ξ+−−ξ()kk log ( NN )log(1 ) . i ∏ k ∑ ξ ⎣⎦k =1 k =1 The maximum of this likelihood function is: dNNNNNNQQ⎛⎞⎛iki−−ξ−ξ−(1 iiiki ) ( ) ⎞ logη=ξ=− (Nkii , 1,..., Q | ) kk = k k = iiiiikk ∑∑⎜⎟⎜ ⎟ dξξ−ξξ−ξkk==11⎝⎠1(1)⎝⎠ 1 ⎡⎤⎛⎞QQQ ⎛⎞⎛⎞ NNNii(1−ξ ) − ξ i + ii ξ = ii⎢⎥⎜⎟∑∑∑kkk ⎜⎟⎜⎟ ξ−ξ(1 ) ⎣⎦⎝⎠kkk===111 ⎝⎠⎝⎠

1 ⎡⎤QQ⎛⎞⎛⎞⎛⎞QQ NNiii−ξ−ξ+ξNNiii= ii⎢⎥∑∑kk⎜⎟⎜⎟⎜⎟∑∑kk ξ−ξ(1 ) ⎣⎦⎢⎥kk==11⎝⎠⎝⎠⎝⎠kk==11 1 ⎡⎤QQ⎛⎞ NNii−ξ=0, ii⎢⎥∑∑kk⎜⎟ ξ−ξ(1 ) ⎣⎦kk==11⎝⎠ i.e.

QQ ii⎛⎞ ∑∑NNkk− ⎜⎟ξ=0 . kk==11⎝⎠

So, the maximum likelihood estimate for ξi is: Q N i i ˆ i ∑ k =1 k N ξ=Q = . (17) N N ∑ k =1 k Comparing to (14), we see that the estimate (17) is identical to the statistic (11), proposed by the Henikoffs. Furthermore, this estimate is unbiased: i Q ⎛⎞N 11Q EENEN=ξ=ξii|(|) ii. ⎜⎟ ()∑∑k =1 kk ⎝⎠NN Nk =1

ii Here EN(k |ξ ) is the average of the distribution of the chance variable under a Bernoulli distri- ii i bution, i.e. E (NNkk |ξ= ) ξ. So, Q ⎛⎞i N ˆ iiiN ∑ k =1 k EE()ξ=⎜⎟ =Q ξ=ξ ⎝⎠N N ∑ k =1 k

So, the Henikoffs’ statistic (11) is an unbiased maximum likelihood estimate of ξi : qNNii=ξˆ = i . The proof is complete.

11 Theorem 3. Statistic (10) M ij M for i= j and M ij 2M for i≠ j is an unbiased estimate of ppij=αα(, i j ): ⎧M ij Mi,,= j pˆ ij , E(|,)ppˆ ijξΨ = ij , (18) = ⎨ ij ⎩M 2,Mi≠ j Proof. Let us consider the two cases ij= and ij≠ separately. Let ij= .

Q 1 ii Q ii NN(1)− ii ∑ k =1 kk NN(1)− M 2 ∑ k =1 kk ==Q . M 1 Q NN(1)− NN(1)− ∑ k =1 kk 2 ∑ k =1 kk

ii The value of the random product NNkk (− 1) in the k -th column of the consolidated block is i associated with n hypothesis about the random choice of amino acid aAk ∈=α={ , i 1,..., n} in accordance with the probability distribution ξ =ξ (i ,in = 1,..., ) , therefore n ii l ii l l EN()(kk(1)|,) N−=ξ−=α=ξΨ∑ EN kk (1)|, N a k ψ ) l =1 n (19) li⎡⎤2 lli ll ∑ξ=α−=α⎣⎦EN()()|kk a ,ψψ ENa ( kk | ,). l =1

i li The chance variable Nk is distributed under the Bernoulli Law with the probability ψ of oc- i currence of the event under consideration (amino acid α in each of Nk independent tests). The mean of the Bernoulli distribution is defined as the probability of occurrence of an event in li a separate test (in this case ψ ) multiplied by the number of tests (in this case Nk ), i.e. illli EN(|kk a= α=ψ ,)ψ N k . (20) The deviation of this distribution is the same ⎡⎤i i ll2 ll lili EN⎣⎦()kkk−=α=α=ψ−ψ ENa(| ,)|ψψ a k , N k (1). (21) The average of any chance variable squired is the sum of its deviation and its squared average, i.e. illlililililili22 EN()()|kk a= α=ψ−ψ+ψ=ψ−ψ+ψ= ,ψ N k (1)( N k ) N k{ 1 N k} li li li li li (22) NNkkψ+{}1( −ψ= 1) NNN kkk ( −ψψ+ψ 1) . So, n ii llili ii E ()NNkk(1)|,−=−ξψψ=−ξΨ NN kk (1)∑ NN kk (1) p. l =1 Therefore, Q NN(1)− ˆ ii∑ k =1 kk ii ii E(|,)pppξΨ ==Q . NN(1)− ∑ k =1 kk

12 Let us consider now the case i ≠ j .

Here there are n hypotheses about the random choice of an ancestor amino acid aAk ∈ for gen- erating all amino acids of the k -th column of the consolidated block, yielding the equality n ij l ij l l ENN(|,)(|kkξΨ=ξ∑ ENN kk a k =α ,) ψ . l =1

i i The chance variable Nk can take Nk + 1 values Nsk = , s = 0,1,2,..., Nk with probabilities

i PN()k = s:

Nk ij l l i j l i l ENN(|kk a k=α ,)()(|,ψψ =∑ PN k = ssEN k a k =α N k = s ,). s=0

j Here the conditional distribution of the chance variable Nk is Bernoulli distributed with a prob- j ability of occurrence of the event (amino acid α ) in each of (Nsk − ) independent tests, which is equal to ψ−ψlj(1 li ) . The average of Bernoulli distribution is: ψ lj E(|Najlil=α , N = s ,)ψ = ( N − s ), kk k1−ψli k consequently, ψ lj Nk ENN(|ij a=α l ,)ψψ l = PN ( i = s | a =α l ,)() l sN − s = kk kli ∑ k k k 1−ψ s=0 NN ψ lj ⎧⎫kk NPNsa(|illill==α− ,)(|ψψ s PNsa ==α ,). s2 li ⎨⎬kkk∑∑ kk 1−ψ ⎩⎭ss==00

i li Here the first sum in braces is the average of the chance variable Nk and is equal to ψ Nk . The i second sum is the average of squared variable Nk : lj ij l lψ li i2 l l ENN(|kk a k=α ,)ψψ ={} N k ()()|, ψ N k − E() N k a k =α , 1−ψli i.e. ψ lj ENN(|i j a=α l ,)ψ l ={} () N22 ψ li − N ψ li (1)() −ψ li − N ψ li = kk k1−ψli k k k ψψψlj li lj NNψli{} −(1 −ψ li ) − N ψ li = NN{} (1 −ψ li ) − (1 −ψ li ) = 11−ψlikk k −ψ li kk ψψli l j NN(1)(1)−−ψ=li NN (1). −ψψ li l j 1−ψli kk kk

kk The full distribution over the set of values of the random variable NNij is the linear combina- tion nn ij l ij l l llilj ENN()kk=ξ∑∑ ENN ( kk | a k =α= ,)(1)ψ N k N k − ξψψ. ll==11 It follows that

13 QQ ij kk k k ij 11⎛⎞M ENN()ij N (1) N − n Ep()ˆ = E =⋅∑ kk==11 =∑ ξψψ= ⎜⎟ Q kk ∑l =1 llilj 22⎝⎠M 1 Q kk NN(1)− NN(1)− ∑ k =1 2 ∑ k =1 Q NNkk(1)− ∑ k =1 ij ij Q pp= . NNkk(1)− ∑ k =1 So, for any i and j E(|,)ppˆ ijξΨ = ij . The proof is complete.

According to theorems 2 and 3, BLOSUM represents the same probabilistic model of amino acid mutations as PAM, based on the notion of an ergodic and reversible Markov chain so that each element of the BLOSUM matrix (12) can be expressed in corresponding terms as: ⎛⎞pij ⎛⎞pîi ⎛⎞pij ⎛⎞pîj bii ==2log 2log ⎜⎟ for i = j , and bij ==2log 2log ⎜⎟ for i ≠ j , 22⎜⎟ij ⎜⎟ˆîi 22⎜⎟ij ⎜⎟ˆîj ⎝⎠qq ⎝⎠ξξ ⎝⎠2qq ⎝⎠ξξ or following (13): ⎛⎞pij ⎛⎞ξψiij ⎛⎞ ψ ij bij ===2log 2log[2] 2log [2] . (23) 222⎜⎟ij ⎜⎟ij ⎜⎟ j ⎝⎠2qq ⎝⎠ξ ξξ ⎝⎠

7 Mathematical properties of substitution matrices of the PAM and BLOSUM families: kernels in the set of amino acids on the basis of Dayhoff’s model of evolution This section is completely focused on the analysis of the semi-positive definite properties for substitution matrices and their relation with the original Dayhoff model of point evolution processes. In a number of problems of protein analysis it is useful that the similarity measure over the set of amino acids should possess the properties of an inner product. Functions of this kind are called kernel functions. For the finite set of amino acids, a kernel function is any real-valued two-argument function forming a positive semi-definite matrix. Each kernel function, defined over the set of amino acids embeds the set of amino acids in some hypothetical linear space and lets us to consider each amino acid as a point in it. However, it should be noted that the substitution matrices of the PAM and BLOSUM families in their traditional log-odds representation (6) and (23) have negative eigenvalues, i.e. they are not valid kernels. In this section we prove an interesting fact: Dayhoff’s evolutionary model based on the notion of an ergodic and reversible Markov chain, is sufficient for constructing a valid kernel over the set of amino acids. From this it follows that the PAM and BLOSUM families are kernel functions by mathematical structure, but have lost their natural positive definiteness simply because of an unsuitable choice of final representation. It should be noted that the similarity measure (2) can be easily represented as an inner product nn Kxx(,ααij ) = ξαψ (kikk ) ( α | α ) ξαψ ( ) ( α jk | α ) = = xxT , [2]∑∑( [1])( [1] ) ik jk ij kk==11

14 where x =ξαψαα(kik ) ( | ),knR = 1,..., ∈ n can be considered as a feature vector of the i -th i ( [1] ) amino acid. So, the similarity measure (2) and its representation through two-step random transformation are kernels. It is easy to show that the normalized similarity measure (4) is also a kernel.\ as follows: K (,αijαψαα ) ( ji | ) Theorem 4. The two-argument function K (,ααij ) = [2]= [2] is a kernel. [2] ξξ()()ααji ξ () α j Proof. K (,ααij ) 1 n K (,αα=ij )[2] == ξαψααψ (kikjk ) ( | ) ( α | α ) [2] ji ji∑ [1] [1] ξξ()()αα ξξ ()() ααk=1

nn⎛⎞⎛⎞ξα()kk ξα () ⎜⎟⎜⎟ψαα(|ik ) ψαα ( jk | ) = xx = xxT ∑∑⎜⎟⎜⎟ξξ()ααij[1] ( ) [1] ik jk ij kk==11⎝⎠⎝⎠ ij So, the function K[2](,αα ) is a kernel. The proof is complete.

It should be noted that the expression within the logarithm term in the definition of the BLO- ij SUM substitution matrix (23) is absolutely equal to the normalized similarity measure K[2](,α α ) and so, in accordance with theorem 4 is a kernel. As to the PAM substitution matrix, the expression within the logarithm in its definition (6) has ij the same structure as K[2](,αα ). ψ ij ψ ij It is evident, that if m=2 these functions π=ij []m and K (,ααij ) = [2] are equal and, so []m ξ j [2] ξ j ij π[2] is a kernel. ij The question arises: will the functions π[]m be kernels for any m ? j i The answer is that they will, if there exists a random transformation ψ[/2]m (|)αα such that ij i j ψαα=[]mmm(| )∑ ψ [/2] (|) αϑψϑα [/2] (|). ϑ∈A

j i This transformation ψ[/2]m (|)αα also defines an ergodic and reversible Markov random proc- j i ess. If the transformation ψ[/2]m (|)αα exists, we say that an ergodic and reversible Markov ran- j i dom process ψ[]m (|)αα is divisible. It should be noted that divisibility does not automatically follow from the fact that a Markov ij random process is ergodic and reversible. It is evident that π[]m is a kernel, at least for any even number m . As to other, non-even values of m , the divisibility of the respective Markov random j i j i processes ψ[]m (|)αα can be checked experimentally. Particularly, ψ[1] (|)αα, which partici- pates in forming PAM1 substitution matrix, is divisible and so, PAM1, or rather the expression un- ij der the logarithm π[1] is also a kernel.

15 8 Results and discussions One of the main results of this paper consists in proving an interesting and useful fact: that the statistical approach for computing substitution matrices of the BLOSUM family, introduced by the Henikoffs, can be explained in terms of the PAM evolutionary model, proposed by M. Dayhoff. So, both of these commonly adopted ways of comparing amino acid sequences are based on the same model of evolution and the main difference between them lies only in the different initial data used for estimating their unknown parameters. Another interesting result from proving the above is that the model of PAM evolution with its main assumption of an ergodic and reversible Markov chain of point mutations of amino acids, is sufficient for constructing kernel functions over the set of amino acids. Moreover, we have shown that the natural similarity measure of amino acids, based on Dayhoff’s model and commonly adopted in bioinformatics, which is the probability that two amino acids being compared might have resulted from two independent random transformations from one and the same unknown source amino acid, possesses all the properties of a kernel function. So, the substitution matrices of the PAM and BLOSUM families naturally are kernels and they have lost this property only as the result of a specific and unsuitable choice of final representation. The review of the traditional representation of substitution matrixes raises the possibility of applying more general methods for constructing kernels over the set of proteins, which is often needed to obtain a positive semi-definite substitution matrix, while at the same time using a source similarity matrix justified from the point of view of evolution. From a mathematical perspective there are several interesting questions which remain to be an- swered, but which should be solvable in the near future. One set of questions is related to the problem of how to approximate an original substitution matrix by one in simple integer form (as is done now) so that the new representation will conserve the positive semi-definite property. Other open questions arise from the above analysis which are likely to be harder to answer, but are important in practice. The best example is how to construct a pair-wise alignment for a couple of protein sequences based on the positive semi-definite substitution matrices whose score function also satisfies the same positive semi-definite property. There exist a few publications [10,11,14,15,16] which have described such alignment procedures, but these procedures don’t have a rigorous evolutionary foundation. 9 Conclusions One of the foundations for solving many problems in bioinformatics, such as protein homology detection, prediction of protein-protein interactions, prediction of biological functions, secondary and 3D structure of proteins etc., involves using a similarity measure over the set of amino acids. The commonly used families of substitution matrices PAM and BLOSUM define similarity measures which are adequate from the biological point of view but have been here shown to fail to satisfy kernel properties, which has obscured the fact that they are so closely related, and only differ in the choice of statistical parameter choices. In numerous publications, attempts at correcting traditional substitution matrices have resulted either in the loss of their biological meaning or in revised matrices whose positive semi-definiteness was not guaranteed. In this paper we prove that all PAM substitution matrices are kernel functions by their mathematical structure and lose their positive definiteness only because of an unsuitable final representation. Moreover, we have proved that BLOSUM substitution matrices can also be justified in terms of the PAM evolutionary model. As result, the BLOSUM kernel family only needs a modified final representation that does not destroy the original biological rationale. The results obtained are especially significant in view of the growing popularity of constructing data-specific amino acid substitution matrices. Any new substitution matrix will automatically be a

16 kernel if, first, it is estimated by Dayhoff’s or the Henikoffs’ techniques both based on the PAM evolutionary model, and, secondly, if the final representation guaranteeing positive definiteness proposed in the present research is applied. References

1 Dayhoff M.O., Schwartz R.M., Orcutt B.C. A model for evolutionary change in proteins. Atlas for Protein Sequence and Structure, 1978, Vol. 5, 345-352. 2 Henikoff S., Henikoff J. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 1992, 10915-10919. 3 Tatusov R.L., Galperin M.Y., Natale D.A. and Koonin E.V. The COG database: a tool for ge- nome-scale analysis of protein functions and evolution. Nucleic Acids Research, 2000, Vol. 28, No. 1 33-36. 4 Xin Liu1, Wei-Mou Zheng. Amino acid substitution matrices for protein conformation identifica- tion. arXiv:q-bio.BM/0406032, v. 1, 15 June 2004. 5 Ng, P.C., Henikoff, J.G., Henikoff, S. PHAT: a transmembrane-specific substitution matrix. Bio- informatics, 2000, 16, 760–766. 6 Vapnik V. Statistical Learning Theory. New York: John-Wiley&Sons, Inc.,1998, 732 p. 7 Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 1998, 2:121–167. 8 M.A. Aizerman, E.M. Braverman, L.I. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 1964, Vol. 25, pp. 821-837. 9 Vanschoenwinkel B., Manderic B. Substitution matrix based kernel functions for protein secondary structure prediction. Proceedings of the 2004 International Conference on Machine Learn- ing and Applications. December16-18, 2004, pp. 388 – 396. 10 Wu F., Oslon B., Dobbs D., Honavar V. Comparing kernels for predicting protein binding sites from amino acid sequence. Newral Netwarks, 2006, IJCNN’OC, pp. 1612-1616. 11 Vert JP, Saigo H, Akutsu T. Local alignment kernels for biological sequences. In B. Schölkopf, K. Tsuda, and J. P. Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004. 12 Mottl V.V., Seredin O.S., Sulimova V.V. Mathematically correct methods of similarity measure- ment on sets of signals and symbol sequences of different length. Pattern Recognition and Image Analysis, Vol.15, No. 1, 2005, pp. 87-89. 13 Caragea C., Sinapov J., Silvescu A., Dobbs D., Honavar V. Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC Bioinformatics, November 2007, 8:438 doi:10.1186/1471-2105-8-438. 14 Vert J.-P., Qiu J., Noble W.S. A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics, December 2007, 8 (Suppl 10): S8 doi:10.1186/1471-2105-8-S10-S8. 15 Vincent M., Passerini A., Labbe M., Frasconi P. A simplified approach to disulfide connectivity prediction from protein sequences. BMC Bioinformatics, January 2008, 9:20 doi:10.1186/1471- 2105-9-20. 16 Kernel methods in computational biology. Edited by B. Scholkoph, K.Tsuda and J.-p.Vert. A Bradford book, The MIT press, Cambridge, Massachusetts, London, England, 2004, 397p.