Evolutionary Models for Insertions and Deletions in a Probabilistic Modeling Framework Elena Rivas Washington University School of Medicine in St

Washington University School of Medicine Digital Commons@Becker

Open Access Publications

2005 Evolutionary models for insertions and deletions in a probabilistic modeling framework Elena Rivas Washington University School of Medicine in St. Louis

Follow this and additional works at: https://digitalcommons.wustl.edu/open_access_pubs Part of the Medicine and Health Sciences Commons

Recommended Citation Rivas, Elena, ,"Evolutionary models for insertions and deletions in a probabilistic modeling framework." BMC Bioinformatics.,. 63. (2005). https://digitalcommons.wustl.edu/open_access_pubs/147

This Open Access Publication is brought to you for free and open access by Digital Commons@Becker. It has been accepted for inclusion in Open Access Publications by an authorized administrator of Digital Commons@Becker. For more information, please contact [email protected]. BMC Bioinformatics BioMed Central

Research article Open Access Evolutionary models for insertions and deletions in a probabilistic modeling framework Elena Rivas*

Address: Department of Genetics, Washington University School of Medicine, 4444 Forest Park Blvd., Saint Louis, Missouri 63108 USA Email: Elena Rivas* - [email protected] * Corresponding author

Published: 21 March 2005 Received: 16 December 2004 Accepted: 21 March 2005 BMC Bioinformatics 2005, 6:63 doi:10.1186/1471-2105-6-63 This article is available from: http://www.biomedcentral.com/1471-2105/6/63 © 2005 Rivas; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time. Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events. Results: I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively. Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence. I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA). Conclusion: These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling.

Background are another class of probabilistic models used for struc- Probabilistic models are widely used for sequence analysis tural RNAs for problems such as RNA homology searches [1]. Hidden Markov models (HMMs) are a very large class [9-13], RNA structure prediction [14,15], and RNA gene- of probabilistic models used for many problems in bio- finding [16]. logical sequence analysis such as sequence homology searches [2-4], sequence alignment [5], or protein gene- Sequence similarity methods based on HMMs or SCFGs finding [6-8]. Stochastic context-free grammars (SCFGs) can take the form of profile or pair models and are very

Page 1 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

important for comparative genomics. These probabilistic very important goal in order to make those probabilistic methods for sequence comparison assume a certain models more realistic. degree of sequence divergence. For instance, in profile models (either profile HMMs [2-4] or profile SCFGs I encountered this problem in working on QRNA, a com- [12,13]) a sequence is compared to a consensus model. putational program to identify noncoding RNA genes de Profile models must allow for the occurrence of insertions novo. QRNA uses probabilistic comparative methods to and deletions with respect to the consensus, and they do analyze the pattern of mutation present in a pairwise so by using state transition probabilities that assign some alignment in order to decide whether the compared position-dependent penalties for modifying the consen- nucleic acid sequences are more likely to be protein-cod- sus with insertions or deletions. Similarly, in pair proba- ing, structural RNA encoding, or neither. Originally bilistic models [8,16] two related sequences are compared QRNA was parameterized at a fixed divergence time. (aligned and/or scored). Pairwise alignments need to Motivated by the goal of making QRNA a time dependent allow for substitution, insertion and deletion events parametric family of models, I investigated the possibility between the two related sequences. Substitutions are of evolving the transition and emission probabilities asso- taken care of by residue emission probabilities, while ciated with a given probabilistic model. Since I already insertion and deletion events are generally taken care of had the model parameterized for a given time, I aimed to by state transition probabilities as in the case of profile use that model as a generating point of the whole time- HMMs. parameterized family of models.

In the BLAST programs [17], the score of a pairwise align- Because QRNA includes both linear and affine gap mod- ment is determined using substitution matrices which els in different places, in this paper I propose algorithms measure the degree of similarity between two aligned res- to describe the evolution of indels as a (N + 1)-th charac- idues. Similarly, in pair probabilistic models, residue ter in a substitution matrix, and algorithms to describe the emission probabilities are based on substitution matrices. evolution of the transition probabilities associated with a The evolution of substitution matrices has been studied at probabilistic model. large for many different kinds of processes: nucleotides, amino acids, codons, or RNA basepairs [18-23]. The evo- The purpose of this paper is to describe the general theo- lution of emission probabilities using substitution matri- retical framework behind these methods. A detailed ces is easily integrated into probabilistic models both for description of the particular implementation of these HMMs [24-29] and for SCFGs [14]. algorithms in QRNA and a discussion of the results obtained with "evolutionary QRNA" (eQRNA) will In probabilistic models, insertion and deletion events appear in a complementary publication. (indels) are sometimes described by treating indels as an additional residue (gap characters) in a substitution Results matrix. More often they are described using additional Evolutionary models for emission probabilities hidden states, where transition probabilities into those The evolution of emission probabilities without gaps states represent the cost of gap initiation and transitions In order to introduce notation, I will start with a brief within those states represent the cost of gap extension. If review of the current methods for calculating joint proba- the cost of gap initiation and gap extension are identical, bilities conditional on time, P(i, j|t), where i, j are two res- it is referred to as a linear gap cost model. Hidden states idues (for instance, nucleotides, amino acids, RNA allow arbitrary costs for gap initiation and gap extension, basepairs, or codons). P(i, j|t) gives us the probability that which is traditionally referred to as an affine gap cost residues i and j are observed at a homologous site in a model. Treating gaps as an extra character in a substitution pairwise alignment after a divergence time t. Pairwise matrix is equivalent to assuming a linear gap cost model. sequence comparison methods score aligned residue pairs The parameters that modulate those processes should be with these joint probabilities either explicitly or implicitly allowed to change as the divergence time for the [17]. In explicit generative pair probabilistic models, like sequences being compared is varied. It has been difficult the pair-HMMs and pair-SCFG in QRNA, the P(i, j|t) terms to combine probabilistic models such as profile and pair are referred to as pair emission probabilities. HMMs or SCFGs with evolutionary models for insertion and deletions [30-33]. Methods to evolve transition prob- The evolution of joint probabilities is usually obtained by abilities are not as well developed as those describing sub- modeling the corresponding conditional probabilities stitution matrices, but significant effort is currently aimed P(j|i, t)as a substitution process in which residue i has at this problem [34-41]. Models incorporating the evolu- been substituted by residue j over time t. Probabilistic tion of insertions and deletions in the context of probabi- models for nucleotide substitutions [18,19,42-44] assume listic models such as profile HMMs or pair models are a

Page 2 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

that nucleotide substitution follows a model of evolution example, a similar assumption was taken to construct the that depends on an instantaneous rate matrix, BLOSUM matrices [51], which were obtained as joint probabilities at discrete point estimates from clusters of tR Qt = e , (1) aligned sequences. where t is the divergence time, R is the instantaneous rate In this third approach the parameters at the generating matrix, and Qt is the substitution matrix of conditional time t* will be used to construct a rate matrix for the proc- ≡ probabilities, that is Qt(ij) P(j|i, t). This is a reasonable ess. This approach is motivated by the kind of situation in model used, for instance, to describe nucleotide substitu- which we find ourselves with probabilistic methods based tions in the Jukes-Cantor [42] or Kimura [43] models, or on homology such as QRNA: a model has been trained in the more general REV model [44]; this is also the evolu- one kind of data, and the resulting probabilities represent tionary model used for amino acid substitutions some effective but fixed divergence time, and we wish to [18,19,21,45,46], codon to codon substitutions [20,47], extend that model to a time-dependent parameterization. and RNA basepair to basepair substitutions [14,22,23]. For residue substitution processes, the rate matrix R and Throughout this paper, I will use the words "divergence Q*, defined as the substitution matrix at the generating time", "divergence", or "time" equivalently to describe the time t [Q ≡ Q ], convey exactly the same information. * * t* amount of dissimilarity between biological sequences More explicitly, assuming the evolutionary model given in measured as the number of mutations and gaps intro- equation (1) we can calculate the rate matrix of the proc- duced in the alignment of the sequences. I will never refer ess as a function of Q* as to "time" as representing an actual number of years of divergence, since this number cannot be determined =<<∞1 intrinsically from sequence data. R log(Qt** ),03 . ( ) t*

Thus, given a rate matrix R, Qt (and therefore the desired Kishino et al [52] introduced the idea of calculating the joint emission probabilities) can be inferred for any rate matrix starting from a given substitution matrix using desired time using the Taylor expansion for the matrix equation (3) and an eigenvalue decomposition of Q*. It is exponential, worth noting that the matrix equation for the rate R can be expressed as a Taylor expansion of the form n=∞ n = ()tR Qt ∑ ()2 n=∞ n+1 1 ()−1 n n=0 n! R = ∑ ()QI− ()4 tn* 23 * n=1 =+ +t +t +   ItR RR RRR ... =−−−+−112 1 3 − 23!! ()()()QI** QI QI *) ... , t*  2 3  This Taylor series converges in all cases. which allows for a direct numerical calculation of the rate matrix. The convergence of this series requires only that λ There are several ways in which the rate matrix R can be for every (real or complex) eigenvalue of matrix Q*, then determined. One approach is to use analytically inferred |λ - 1| < 1. In addition, for any valid substitution matrix rate matrices that depend on a small number of external the eigenvalues have to be real and |λ| ≤ 1 (see Appendix parameters [42-44,48]. For instance, the HKY model for A). Under these two conditions, the above Taylor series nucleotide substitutions [48] depends on six parameters: converges so long as the eigenvalues of Q* are positive. the four stationary nucleotide frequencies, a rate of transi- Therefore the three properties required of Q* in order to be tions, and a rate of transversions, which have to be pro- able to obtain a rate matrix using the Taylor expansion in vided externally. Another type of approach uses equation (4) are that its eigenvalues are all smaller than maximum likelihood methods [21,49,50] in order to esti- one (but one that is strictly one), real, and positive. Com- mate a rate matrix numerically from a training set of plex or negative eigenvalues would correspond to oscilla- sequence alignments. tory behaviors, which do not seem to reflect the biology. All the substitution processes I have tested so far for nucle- A third approach arises naturally in cases where suitable otides, amino acids, and RNA basepairs correspond to real joint probabilities have already been estimated for a pair and positive eigenvalues for which the above method is model, and we wish to make that model conditional on applicable. evolutionary divergence time. This approach starts from the assumption that our point estimate represents It is relevant to compare instantaneous rate matrix sequences at a particular arbitrary divergence time t*. For approaches to the approach used in the PAM amino-acid

Page 3 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

  ACDEF GH I KLMNP Q RS T VWY  − .029 .019 .056 .022 .106 .011 .032 .049 .088 .020 .005 .045 .037 .055 .238 .089 .138 .003 .016     .142 − .021 .001 .024 .020 .008 .056 .017 .089 .028 .013 .016 .012 .018 .077 .058 .060 .008 .019   .026 .006 − .233 .014 .052 .015 .024 .034 .006 .005 .100 .038 .048 .013 .102 .045 .014 .002 .007     .061 .000 .189 − .011 .023 .030 .009 .136 .018 .006 .049 .035 .181 .061 .096 .040 .040 .003 .013   .045 .010 .022 .021 − .027 .019 .111 .017 .150 .051 .013 .008 .006 .017 .049 .033 .039 .027 .209     .113 .004 .042 .023 .014 − .010 .005 .024 .011 .007 .061 .022 .017 .025 .102 .020 .020 .007 .009     .037 .006 .039 .093 .031 .032 − .018 .039 .022 .014 .114 .025 .063 .084 .060 .023 .014 .005 .125   .044 .015 .024 .012 .074 .006 .007 − .002 .465 .054 .016 .011 − .002 .017 .050 .032 .692 .002 .023     .059 .004 .030 .150 .010 .027 .014 .002 − .049 .025 .059 .040 .100 .222 .113 .041 .041 .003 .015  R =  .072 .015 .004 .014 .060 .008 .005 .282 .033 − .138 .012 .013 .021 .031 .030 .053 .113 .006 .025     .069 .019 .012 .020 .084 .022 .014 .135 .070 .566 − .026 .033 .127 .070 .101 .050 .239 .015 .026   .008 .004 .115 .069 .010 .088 .053 .018 .076 .022 .012 − .018 .055 .075 .191 .082 .012 .001 .014     .065 .005 .041 .046 .005 .031 .011 .012 .048 .024 .014 .017 − .034 .022 .057 .047 .037 .002 .009     .068 .005 .065 .307 .006 .030 .035 −.002 .153 .048 .070 .066 .043 − .153 .124 .046 .040 .008 .037   .085 .005 .015 .086 .013 .036 .038 .020 .282 .058 .032 .075 .023 .127 − .050 .059 .003 .004 .017     .233 .015 .074 .087 .024 .094 .018 .036 .092 .036 .030 .121 .039 .066 .032 − .164 .005 .003 .014   .113 .015 .043 .046 .021 .024 .009 .030 .043 .081 .019 .067 .041 .032 .049 .212 − .126 .007 .014     .161 .014 .012 .043 .022 .022 .005 .595 .040 .161 .083 .009 .029 .025 .002 .006 .116 − .003 .031  .020 .009 .008 .018 .077 .037 .008 .009 .017 .043 .026 .004 .008 .024 .016 .018 .033 .014 − .094 .037 .009 .012 .029 .241 .019 .088 .040 .030 .071 .018 .021 .015 .047 .026 .034 .027 .063 .037 −

RateFigure matrix 1 generated from BLOSUM62 Rate matrix generated from BLOSUM62. Rate matrix obtained from the amino-acid substitution matrix BLOSUM62, rescaled to have an average number of one substitution per amino acid. Notice in bold the two off diagonal negative entries.

substitution matrices [53]. The PAM matrices were not other methods have been proposed to calculate a rate generated by calculating a rate matrix, but by estimating matrix from the Dayhoff data [55]. These methods still from a collection of highly similar sequences the substitu- ≈ PAM PAM assume that R (Q* - I) which corresponds to taking tion matrix for the time of one substitution per site Q* ≡ only the first term in the Taylor series for the logarithm in Qt = 0.01, and then calculating Qt at any other (integer t) time by multiplication. This is a discrete approximation equation (4). This assumption is good only for very that converges to the same answer given by the rate closely related sequences. Using the Taylor series allows method for very small time units. PAM matrices have been one to estimate, using the same input data and avoiding criticized for not being able to capture the substitutions the calculation of eigenvalues, the rate matrix to any that are observed for more dissimilar sequences. BLOSUM desired level of precision, independent of the degree of matrices empirically outperform PAM in sequence similarity in the training set. homology searches, presumably because sequences at larger divergence times were used to calculate the BLO- Notice that the rate matrix obtained using BLOSUM62 SUM matrices. However, the BLOSUM method is not a (Figure 1) has two off-diagonal negative entries (and if we time dependent continuous model but a very coarse- use more divergent BLOSUM matrices we have more neg- grained discretization. There are ways of combining the ative off-diagonals). Off-diagonal entries of the rate δ best of both approaches (more divergent sequence for matrix have to be positive so that I + tR can be interpreted δ training and a continuous-time model) to generate rate as a substitution matrix for very small times t. This prob- matrices, for instance by using the resolvent method [54], lem is not unique to sequence data. The construction of or using maximum likelihood methods as in the WAG rate matrices for a Markov process from empirical data matrices [21]. However, it is also possible to take a dis- using a generating time is also used in mathematical mod- crete BLOSUM matrix, for instance BLOSUM62, and con- eling of financial processes such as credit risk modeling vert it to an underlying rate matrix. The BLOSUM62- [56,57]. In the world of mathematical finances the prob- generated rate matrix obtained using equation (4) is lem is referred to as the regularization problem. I will use shown in Figure 1. one of following regularization algorithms presented in [57]. The QOG algorithm (quasi-optimization of the gen- A rate matrix can also be derived from the PAM data erator) regularizes the rate matrix. The QOM algorithm (quasi-optimization of the root matrix) leaves the rate PAM Q* by various methods. One exact method is to do an matrix unchanged and regularizes the conditional matrix eigenvalue decomposition as presented in [52]. Recently, at a given time if any negative probability appears. Using

Page 4 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

  ACDEF GH I KLMNP QRS T VWY  − .029 .019 .056 .022 .106 .011 .032 .049 .088 .020 .005 .045 .037 .055 .238 .089 .138 .003 .016     .142 − .021 .001 .024 .020 .008 .056 .017 .089 .028 .013 .016 .012 .018 .077 .058 .060 .008 .019   .026 .006 − .233 .014 .052 .015 .024 .034 .006 .005 .100 .038 .048 .013 .102 .045 .014 .002 .007     .061 .000 .189 − .011 .023 .030 .009 .136 .018 .006 .049 .035 .181 .061 .096 .040 .040 .003 .013   .045 .010 .022 .021 − .027 .019 .111 .017 .150 .051 .013 .008 .006 .017 .049 .033 .039 .027 .209     .113 .004 .042 .023 .014 − .010 .005 .024 .011 .007 .061 .022 .017 .025 .102 .020 .020 .007 .009     .037 .006 .039 .093 .031 .032 − .018 .039 .022 .014 .114 .025 .063 .084 .060 .023 .014 .005 .125   .044 .015 .024 .011 .074 .006 .007 − .002 .465 .054 .015 .011 .000 .017 .049 .032 .692 .002 .023     .059 .004 .030 .150 .010 .027 .014 .002 − .049 .025 .059 .040 .100 .222 .113 .041 .041 .003 .015  R =  .072 .015 .004 .014 .060 .008 .005 .282 .033 − .138 .012 .013 .021 .031 .030 .053 .113 .006 .025     .069 .019 .012 .020 .084 .022 .014 .135 .070 .566 − .026 .033 .127 .070 .101 .050 .239 .015 .026   .008 .004 .115 .069 .010 .088 .053 .018 .076 .022 .012 − .018 .055 .075 .191 .082 .012 .001 .014     .065 .005 .041 .046 .005 .031 .011 .012 .048 .024 .014 .017 − .034 .022 .057 .047 .037 .002 .009     .067 .004 .065 .307 .006 .030 .035 .000 .153 .048 .070 .066 .043 − .153 .124 .046 .040 .007 .037   .085 .005 .015 .086 .013 .036 .038 .020 .282 .058 .032 .075 .023 .127 − .050 .059 .003 .004 .017     .233 .015 .074 .087 .024 .094 .018 .036 .092 .036 .030 .121 .039 .066 .032 − .164 .005 .003 .014   .113 .015 .043 .046 .021 .024 .009 .030 .043 .081 .019 .067 .041 .032 .049 .212 − .126 .007 .014     .161 .014 .012 .043 .022 .022 .005 .595 .040 .161 .083 .009 .029 .025 .002 .006 .116 − .003 .031  .020 .009 .008 .018 .077 .037 .008 .009 .017 .043 .026 .004 .008 .024 .016 .018 .033 .014 − .094 .037 .009 .012 .029 .241 .019 .088 .040 .030 .071 .018 .021 .015 .047 .026 .034 .027 .063 .037 −

RegularizedFigure 2 rate matrix generated from BLOSUM62 Regularized rate matrix generated from BLOSUM62 Regularized rate matrix generated from BLOSUM62 after the QOG algorithm has been applied. The matrix has been rescaled to have an average number of one substitution per amino acid. In this simple case in which there was at most one negative off-diagonal entry per row, the regularization process requires the negative off-diagonal value to be set to zero (represented in bold in this Figure), and to shift the rest of the elements in that row by the corresponding amount so that the sum of all elements is zero. Rows without any off-diagonal negative values remain unchanged from the values obtained in Figure 1.

the QOG algorithm we obtain a regularized version of the 3. Obtain the permutation wPwp = () , such that rate matrix using BLOSUM62, which is given in Figure 2. wwp ≤ p . Regularization algorithms i i+1 Here I reproduce the QOG and QOM regularization algorithms. The proofs for these algorithms can be found in =+p nk−−1 p −−+p 4. Construct Cwk ∑ = w− () nkw1 + , for [57]. The QOG algorithm regularizes each row of a rate 1 i 0 ni k 1 matrix independently. Given a row in a rate matrix R, k = 2, ..., n - 1. ≤ ≡ 5. Calculate kmin = mink = 2, ..., n - 1 {k such that Ck 0}. r = (r1,...,rn) (R(i, 1) ..., R(i, n)), (5) the QOG algorithm solves the problem of finding the vec- 6. Construct the vector tor at the minimal Euclidean distance from r such that the 02if ≤≤ik sum of all its elements is zero, and all elements but one  min are positive. rˆ =  1 n i wpp− {}ww+ ∑ p otherwise  i −+1 jk=+1 j  nkmin 1 min The steps of the QOG algorithm are: 7. The regularized row is given by r ← P-1 (rˆ ). Finally

1. Permute the row vector so that r1 = R(i, i). reverse the permutation of step (1).

λ 2. Construct the vector w, such that wi = ri - , where The QOM algorithm regularizes each row of a conditional matrix independently. Given a row in a conditional 1 n λ = ∑ r . matrix Q n i=1 i t ≡ r = (r1,..., rn) (Qt(i, 1),..., Qt(i, n)), (6)

Page 5 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

the QOM algorithm solves the problem of finding the vec- amino acid substitutions. These 4 × 4 probabilities can be tor at the minimal Euclidean distance from r such that the viewed as a particular example of the REV model [44]. sum of all its elements is one, and all elements are Note that the sum of all elements of P* adds up to one, positive. and the matrix is symmetric.

The steps of the QOM algorithm are: The marginal probabilities defined as pi = ∑j P(i, j|t*) can be calculated from the joint probabilities to be, λ 1. Construct the vector w, such that wi = ri - , where p = (p , p , p , p ) = (0.2836, 0.2311, 0.2531, 0.2322). (8) 1 n a c g t λ =−()∑ r 1 . n i=1 i Similarly, the conditional probabilities P(j|i, t*) can be ← calculated from the previous joint and marginal probabil- 2. If all wi are non negative, r w is the new regularized ities using the relationship P(i, j|t*) = P(j|i, t*) pi. Using the row. ≡ matrix representation Q*(ij) P(j|i, t*) we have,

3. Otherwise, obtain the permutation wPwp = () , such  0.... 4401 0 1834 0 2154 0 1611     0.... 2250 0 3778 0 1939 0 2034  p ≥ p Q = .(99) that wwi i+1 . *  0. 2414 0....1770 0 4295 0 1521     0.... 1968 0 2024 0 1658 0 4350  k 4. Construct Cwkw=−∑ p p , for k = 1,..., n. k i=1 i k Notice how the sum of the elements in each row adds up to one. Notice also how Q* is quite different from the ≤ identity matrix, which means that we have started with a 5. Calculate kmax = maxk = 1,..., n {k such that Ck 1}. quite divergent generating time. 6. Construct the vector If we assume a homogeneous Markov substitution process, we can interpret the conditional probabilities Q as  1 k * wp +−[]11∑ max wikp if ≤≤ the matrix of substitution probabilities at the generating = i j=1 j max rˆi  kmax time. Thus, we can characterize the underlying evolution-  0 otherwise ary process by its instantaneous rate of evolution, which can be calculated from Q* using equation (4). The resulting rate matrix R (up to an arbitrary scaling factor t ) is 7. The regularized row is given by r ← P-1 ().rˆ * given by, A 4 × 4 example starting from joint probabilities at a given  −1.. 0965 0 3690 0 . 4575 0 . 2701  generating time   1  0.... 4528− 1 2750 0 3720 0 4503  As an review of these techniques, I will use a set of 4 × 4 R = .()10 t  05.1126 0... 3396− 1 0876 0 2353  single-nucleotide joint probabilities P(i, j|t*) for i, j = {a, *    −  c, g, t} at a particular generating time t* to construct the  0... 3298 0 4481 0 2565 1 . 0345  corresponding rate matrix. This rate matrix has all the good properties: (i)"Normali- zation": the sum of the elements of each row is zero. (ii) In this example, the joint probabilities at the generating "Reversibility": p R = p R . The process is reversible by time using the matrix notation P (ij) ≡ P(i, j|t ) are given i ij j ji * * construction because we started with symmetric joint by, probabilities. (iii) "Saturation". The rate matrix converges at time infinity to the given marginal probabilities in  ACGT   equation (8). We can test saturation by using equation (2)  0.... 1248 0 0520 0 0611 0 0457  and calculating the substitution matrix for a very large P =  0.... 0520 0 0873 0 0448 0 0470 .()7 time. For instance, for t = 10t* we have *    006. 611 0... 0448 0 1087 0 0385     0.... 2836 0 2311 0 2531 0 2322   0.... 0457 0 0470 0 0385 0 1010     0.... 2836 0 2311 0 2531 0 2322  Q = .()11 10. 0t*   These 4 × 4 pair-nucleotide probabilities are taken from 0.22836 0... 2311 0 2531 0 2322   the program QRNA. They were calculated according to  0.... 2836 0 2311 0 2531 0 2322  [16] by marginalizing codon-codon joint probabilities which were constructed from the BLOSUM62 matrix of

Page 6 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Saturation (or stationarity) of a Markov process is a neces- TKF model however is hard to implement in combination sary consequence of (i) normalization and (ii) reversibil- with a probabilistic model such as an HMM, although an ity. Appendix A shows a derivation of the previous active area of research exists in that direction [36,39,40]. statement which was useful for me (and hopefully for A more direct attack to the problem of introducing phyl- some readers) when studying the behavior at t = ∞ of ogeny into existing probabilistic models originated with more complicated evolutionary models. Therefore, start- the concept of tree HMMs [34,35]. The tree HMM method ing from joint probabilities as in this example, we can models the evolution of the parsing of different sequences always interpret the marginal probabilities as the station- through an HMM. This approach is more related with the ary probabilities of the evolutionary process. evolution of transition probabilities, and I will discuss it later on in this paper. In summary, starting with a single set of joint probabilities at one particular generating divergence time t*, we calcu- Here I am going to describe a method for the evolution of late the joint probabilities at any other arbitrary time, indels under the assumption that they behave like an assuming an exponential model of evolution. To that additional residue added to a N × N residue substitution effect, given the particular set of joint probabilities (7) we matrix. This is a simplification of the problem because it have calculated the corresponding rate matrix (10) by Tay- forces indels to have linear penalties (that is, the cost of lor expansion. Thus we can estimate the substitution opening an indel in an alignment or the cost of extending matrix/conditional probabilities at any other arbitrary it with one more indel character is the same) and to time, simply using equation (2), and reconstruct the joint behave independently of each other (that is, successive probabilities at any other arbitrary time. For instance, for indel characters in one sequence will be treated as inde- t = 0.3t* we obtain, pendent events, rather than as a single indel of n residues long). Despite its apparent simplicity, this approach poses  0.... 2089 0 0249 0 0303 0 0195  interesting problems in parameterizing evolution.    0.... 0249 0 1615 0 0208 0 0239  P = .()12 03. t*  00. 3303 0... 0208 0 1864 0 0156  Let us review some of the implications of insertion and   deletion processes. The treatment that pair models give to  0.... 0195 0 0239 0 0156 0 1731  pairwise alignments can be interpreted (if we assume This method allows us to evolve pair emission probabili- reversibility, as is the case here) with all generality as if ties corresponding to different processes (in addition to one of the sequences is the ancestor of the other one. For the 4 × 4 nucleotide emissions) for instance 20 × 20 any two aligned residues we assume that they can be amino acid-to-amino acid joint emission probabilities, 64 related by a substitution process. For a residue aligned to × 64 codon-to-codon joint emission probabilities, or 16 × a gap we assume that either a residue in the ancestor was 16 RNA basepair-to-basepair joint emission probabilities. deleted in the descendent sequence, or that a residue not Thus, this method is useful to be applied in combination present in the ancestor appeared in the descendent with pair HMMs or pair SCFGs already parameterized at sequence. one fixed divergence time to make their emission probabilities a time-dependent family. An stochastic insertion–deletion process also involves insertions followed by subsequent deletions. These events The evolution of emission probabilities with indels treated as an extra leave no trace in pairwise alignments because alignments character usually do not retain gaps aligned to gaps. However, when Substitution processes (even if describing multi-nucle- we are treating indels as an extra character, we have to otide events such as codon evolution or RNA basepair account for such events. evolution) are not enough to describe the full evolutionary relationship between two biological sequences. We If we were given ideal alignments with all their gap-to-gap also need to consider indels, for which we need to intro- aligned columns we could estimate from data the (N + 1) duce more complicated models of evolution than the one × (N + 1) extended joint probabilities at a generating time, described so far. ε P* . Because that is not the case, we need to make some Indels have traditionally been a problem for phylogenetic ε ∆ ≤ inference about P* . Let us represent with , such that 0 methods. Programs to construct phylogenetic trees from ∆ ≤ 1/2, the expected frequency of observed gaps with * data such as PHYLIP [58], PAUP [59], and other phylog- respect to the total number of residues in pairwise align- eny packages [60-64] treat gaps as missing data. The theo- ments at a particular time t . The parameter ∆, can be esti- retical description of the evolution of gaps in a * mated from data, or it could be estimated according to the probabilistic fashion reached a landmark with the TKF model [30] as Thorne/Kishino/Felsenstein (TKF) model [30,31]. The

Page 7 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Suppose we start with a N × N set of joint probabilities P* λµ− λ − ()t* at a generating time t*, where p stands for the marginal ∆= 1 e λµ− ,()13 probabilities and Q* represents the set of conditional 2 µλ− e()t* probabilities associated with P*. if we knew the values for the rate of insertions λ and the µ λ µ rate of deletions , such that 0 < < . 1. Extend the joint probabilities at the generating time t* ε Let us represent with ∆' the expected frequency of missing to a (N + 1) × (N + 1) matrix of joint probabilities P* of gap-to-gap aligned columns in a pairwise alignment at a the form, ∆ particular time t*. One can estimate ' as the expected length of insertions that were later deleted without leaving  ∆  ε = Ω Pp*()|ÓÕ Ó  any trace in current sequences. The probability of a stretch P* ,()16  p ∆∆| ′  of l gap-to-gap characters is given by the geometric distri-  Õ  bution density ρ(l) = (1 - ∆2) ∆2l. Therefore ∆' is given by, where ∆ is a parameter which represents the expected frequency of gaps with respect to the total number of resi- ∞ ∆2 dues in an pairwise alignment at t*, and which satisfies the ∆′ ==∑llρ() . (14 ) ≤ ∆ ≤ ∆ 2 condition 0 1/2. The parameter ' is given in terms = 1− ∆ l 1 ∆2 Using these two parameters and the joint probabilities in of ∆ as ∆′ = , and the normalization constant is 1− ∆2 the absence of gaps at the generating time P ()ˆˆÓÕ we can * Ω ∆ ∆ ˆ construct the set of (N + 1) × (N + 1) extended joint prob- given by = 1/(1 + 2 + '). The indices with hats (Ó ) abilities at t as stand for the N residues, and exclude the gap character, * which I represent with the symbol -.   Pp()|ÓÕ ∆ The (N + 1) × (N + 1) extended conditional probabilities Ω * Ó ,()15  ∆∆′  ε  pÕ |  at the generating time Q* are given by, where we have assumed independence for the joint prob- Ω ability of a residue and a gap. The normalization factor  ∆  = 1/(1 + 2∆ + ∆') represents the fact that the observed ∆ is   + ∆ different from the value we would have obtained had we  1  + ∆ known the complete alignment.  Q*()/(ÓÕ 1 )  ε =  ∆  Q*  .()17 Another implication of insertion and deletions appears in  1 + ∆  the behavior of the marginal probabilities of single resi-  ∆ ∆′   p  dues and indels. At t = 0 when sequences have not yet  Õ 1 + ∆′ ∆∆+ ′  diverged, the marginal probability of finding a gap in an alignment should be zero. In the limit t = ∞, the pairwise 2. Construct the (N + 1) × (N + 1) extended rate matrix Rε alignment of two finite-length sequences is going to be as dominated by gap-to-gap alignments, which implies that as the divergence time increases the marginal probability n=∞ n+1 εε11− ()−1 − ε of a residue becomes negligible, while the marginal prob- R ==log(QQ1 ) ∑ (),()QQ1 − In 18 t 0 * tn0 * ability of a gap becomes one in the limit t = ∞. Our evolu- * * n=1 tionary model has to be able to accommodate such where saturation frequencies.  ∆  A step-by-step description of the algorithm for the evolution of gaps   + ∆ as an extra residue  1   Q ()/(ÓÕ 1 + ∆ )  I will start by describing the steps to implement the −1 ε = * QQ0 *  ∆ .(199) method before explaining how to derive those steps. This   method can be applied starting from two different situa-  1 + ∆    tions: starting from a N × N set of joint probabilities at a  001...  generating time that need to be extended to allow indel characters and evolved with time; or starting from a given ε tR N × N rate matrix that needs to be extended to allow indel 3. Calculate the exponential of the rate matrix e using characters. the Taylor expansion,

Page 8 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

=∞ ε εεε ε n ()tR n PijpiQij()= () (). (25 ) etR = ∑ .()20 ttt n=0 n! Λ The expression for t in equation (24) guarantees reversi- 4. Construct the extended matrix of conditional probabil- ε ε bility, that is, that the extended Pt constructed according ities at arbitrary time Qt as to the above expression are symmetric.

ε ε QQe= tR ,()21 For the other starting situation, in which we have a N × N t 0 rate marix R, the procedure to generate a (N + 1) × (N + 1) where the matrix Q0 is given by quasi-stationary reversible evolutionary model is the following:  0    1. Construct the (N + 1) × (N + 1) extended rate matrix Rε δ()ÒÔ =   Q0  ,()22 as  0   −   pqqÔ()1 00  β    − ε  R()ÓÕβδ () ÓÕ  where pˆÓ are the original marginal probabilities of P*, and R = ,()26  β    ∆∆′ − 2   the probability 0 0. the extreme case in which the N + 1 gap residue does not evolve. The instantaneous rate is given by,

ε  β  5. Construct the extended marginal probabilities p as   t  R()ÒÔ− βδ () ÒÔ  QRε = .()27 0  β      pi()1 −=Λ if Ò ,  −−βββ()...()()111qp −− qp − q  piε ()=  Ò t ()23 01 0N 0 t Λ =−  t if i , Thus β is the instantaneous rate of deletion of a character, where the probability of a gap at time t is given by, β while - (1 - q0)pˆÓ is the rate of insertion of character ˆÓ . (More complicated models in which the rate of deletion is ε − ∑ pQˆÓ t (ˆÓ ) different for different characters are also possible.) Notice Λ = ˆÓ .()24 t εε ˆ−+ − −− that q0 = 1 corresponds to the case in which the rate of ∑ˆ pQˆÓ tt(Ó )()1 Q Ó insertions is zero. I call this process "quasi-stationary" because the back- ε ε tR ε ground frequencies pt (ˆÓ ) at any finite time are always 2. Find e analytically, if an analytic expression for R is proportional to the original N-dimensional background ε given by solving the differential equation d ()etR / dt = Rε frequencies pˆÓ . This result is a concequence of the fact that ε etR , or numerically, proceeding as in step (3) of the pre- the first N elements of the last row of Q0 are proportional vious procedure. to the N stationary frequencies pˆÓ . On the other hand, while remaining "quasi-stationary" the background fre- 3. Proceed as in steps (4)-(6) of the previous procedure. quencies evolve from (pˆÓ , 0) at time zero towards "all Λ A 5 × 5 example starting from joint probabilities at a given gaps" at time infinity, i.e. limt → ∞ t = 1. This behaviour at time infinity is the consequence of the particular value of generating time q selected in the previous step. We start with the generating joint probabilities P* in the 4 0 × 4 example in equation (7), which we want to extend to a 5 × 5 matrix by adding a gap character. For this example, 6. Finally, construct the evolved (N + 1) × (N + 1) joint I have selected the arbitrary value for the gap parameter ∆ ε probabilities at arbitrary time Pt as = 0.18.

Page 9 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

The 4 × 4 joint probabilities in equation (7) augmented to matrix; instead, the instantaneous rate of evolution is ε ∆ given by, a 5 × 5 matrix P* using the gap parameter = 0.18 (which ∆ ∆2 ∆2 implies that ' = /(1 - ) = 0.0335) is given by, ε dQt = ε QR0 .()33  ACGT −  dt   t=0  0..... 0896 0 0373 0 0438 0 0328 0 0366    ε 0... 0373 0 0626 0 0321 0...0337 0 0299 P =  .()28 In this example, the instantaneous rate of evolution takes *  0..... 0438 0 0321 0 0780 0 0276 0 0327    the form, 0... 0328 0 0337 0 02776 0.. 0725 0 0300    0..... 0366 0 0299 0 0327 0 0300 0 0240   −1.. 2625 0 3692 0 . 4578 0 .. 2700 0 1655     0....5 4531− 1 4415 0 3721 0 4508 0. 1655  ε ε QR =  0.. 5130 0 3398− 1 ... 2535 0 2352 0 1655 .()34 ε 0   The conditional probabilities Q* (ij) = P (j|i, t*) are given  0... 3298 0 4486 0 2564 − 1.. 2003 0 1655    by,  −−−−0..... 0467 0 0381 0 0417 0 0382 0 1647  One should not be concerned either by having some neg-  0..... 3729 0 1554 0 1826 0 1366 0 1525    ative off diagonal components. For small times δt, the  0.... 1907 0 3201 0 1643 0 17240 . 1525  ε conditional matrix is given by, Q =  0..... 2046 0 1500 0 3640 0 1289 0 1525 .(229) *    0....6 1668 0 1715 0 1405 0 3686 0. 1525    εδε ε 0..... 2391 0 1949 0 2134 0 1958 0 1568 =≈+tR δ   QQeQδ t 000 tQR().()35 Therefore, in order to have a proper matrix of conditional The extended marginal probabilities at the particular time probabilities for sufficiently small δt, it is necessary to sat- instance t are given by, * isfy the following condition for each pair of indices i, j,

ε ε p = (pa, pc, pg, pt, p-) = (0.2402, 0.1957, 0.2143, 0.1966, if Q0(ij) = 0 then (Q0R )(ij) > 0. (36) t* 0.1532), (30) In this case, the off-diagonal components of the last row of Q are non-zero, which allows us to have negative off- which are quasi-stationary with respect to the 4 × 4 sta- 0 diagonal elements for that row in the instantaneous rate tionary probabilities p = (0.2836, 0.2311, 0.2532, 0.2322) matrix Q Rε. we started with in equation (8). 0 With the 5 × 5 rate matrix in hand, we can apply steps (3) The matrix of conditional probabilities at time zero using and (4) to obtain the conditional probabilities at any arbi- expression (22) is given by, ε trary time Qt . For instance for t = 0.3t* we obtain the fol-  10000   lowing evolved conditional probabilities:  01000   00100   Q =  .()31 0..... 7011 0 0834 0 1017 0 0654 0 0484 0  00010    0.... 1023 0 6651 0 0856 0 00985 0. 0484    ε   Q =  0..... 1139 0 0781 0 7007 0 0588 0 0484 .()37   03. t*    0..... 2822 0 2299 0 2518 0 2310 0 0051   0...1 0799 0 0981 0 0641 0.. 7095 0 0484     0..... 2685 0 2188 0 2396 0 2198 0 0533  The rate matrix for this example, calculated using the Tay- lor expansion described in equation (18) takes the value, The quasi-stationary marginal probabilities are constructed using the result Λ = 0.0487, and the 4 × 4 sta-  −1..... 2625 0 3692 0 4578 0 2700 0 1655  03. t*    0....8 4531− 1 4415 0 3721 0 4508 0. 1655  tionary probabilities p = (0.2836, 0.2311, 0.2532, ε R =  0.. 5130 0 3398− 1 ... 2535 0 2352 0 1655 .()32 0.2322), following step (5) of the algorithm as,    0... 3298 0 4486 0 2564 −1.. 2003 0 1655    ε  00000 p = (.0 2698 ,. 0 2199 ,. 0 2408 ,. 0 2209 ,. 0 0487 ). ( 38 ) 03. t*

One should not be concerned to see a whole row of zeros Finally, using equation (25), for t = 0.3t*, we obtain the for this rate matrix. For this generalized model the instan- following evolved joint probabilities taneous rate of evolution is not directly given by the rate

Page 10 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

and the solutions are,  0..... 1891 0 0225 0 0274 0 0176 0 0131     0.... 0225 0 1462 0 0188 0 00217 0. 0106  1 3 −4αt ε =+ P =  0..... 0274 0 0188 0 1687 0 0142 0 0117 .()39 ret ,()45 03. t*   4 4  0...2 0176 0 0217 0 0142 0.. 1567 0 0107     0..... 0131 0 0106 0 0117 0 0107 0 0026  1 1 − α se=− 4 t ,()46 t 4 4 Notice that this matrix is symmetric, which is the result of having imposed reversibility for any arbitrary divergence By taking the limit t = ∞ in the previous two equations, time. one can see that the saturation frequencies of the Jukes- Cantor model are pi = 0.25 for i = a, c, g, t. We can also see by calculating the conditional probabilities at large divergence times how these probabilities The 5 × 5 extended Jukes-Cantor rate matrix Rε is con- evolve towards their saturation values given by (0, 0, 0, 0, structed by adding a rate of mutation to a gap represented 1). For instance, for t = 30t* we have, by the quantity β ≥ 0 which in principle we will assume is different from the rate of substitutions α,  0..... 0020 0 0016 0 0018 0 0016 0 9930     0.... 0020 0 0016 0 0018 0 00016 0. 9930   −+()3αβαααβ ε   Q =  0..... 0020 0 0016 0 0018 0 0016 0 9930 .()40 ααβααβ−+ 30t*    ()3  ε  0...0 0020 0 0016 0 0018 0.. 0016 0 9930  R =  αααβαβ−+()3 .()47      0..... 0019 0 0016 0 0017 0 0016 0 9933   ααααββ−+()3     00000 An example starting from a rate matrix: The Jukes-Cantor model extended to gaps We also introduce the matrix at time zero Q0 which As an example of a situation in which we start with a rate ≥ depends on the probability parameter 1 q0 > 0, matrix, let us consider the generalization of the Jukes-Can- tor model [42] to a 5 × 5 evolutionary model with a gap  10000   character. The original Jukes-Cantor model assumes that  01000 α all nucleotides mutate at the same rate > 0 which is rep- Q =  00100,()48 0   resented by the rate matrix  00010  −−−−   ()/()/()/()/14141414qqqqq00000  −3αααα    αααα−3  where the particular case q0 = 1 is only allowed if simulta- R = .()41 neously β = 0, and corresponds to a trivial extension of the  αα−3 αα   original Jukes-Cantor model in which the gap character  ααα−3 α does not evolve.

tR In this simple case the conditional matrix Qt = e can be ε found analytically by solving the matrix differential equa- The conditional matrix at arbitrary time is given by Qt = ε = tR tion QRQtt . Because of the symmetries of the problem Q0 e . The symmetries of the problem in this case allow we can write ε us to parameterize Qt as  rsss  tttt  rsssγ  srss  tttt t =  tttt γ Qt  ,()42  srsstttt t ssrstttt ε     Q = ssrsγ ,()49 sssr t  tttt t  tttt γ  sssrtttt t  ξξξξσ with the condition rt + 3st = 1. We then obtain the follow-  tttt t ing differential equations γ ξ σ with the conditions rt + 3st + t = 1 and 4 t + t = 1. rrs =−33αα + ,() 43 ttt ε ≡ tR Introducing the matrix Mt e , we can parameterize =−αα + ssrttt,()44

Page 11 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

probabilities are given by (0, 0, 0, 0, 1). At any other finite γ time, the background frequencies of the model are quasi-  rssstttt t   stationary with respect to the background frequencies of srssγ  tttt t the original Jukes-Cantor model, and are given by M =  ssrsγ ,()50 t  tttt t γ  sssrtttt t =−1 Λ   pitt() (161 ), ( )  00001 4 Λ which implies that pt (-) = t. (62)

ξγ=−1 − T = T tt()(),11q0 () 51 Imposing the reversibility condition pQt tt p in partic- 4 Λ γ Λ σ Λ ular we obtain (1 - t) t + t t = t which implies, σ γ t = (1 - q0) t + q0. (52) − −βt Λ = 1 e t .()63 The differential equation to calculate Mt takes the form − −βt 1 qe0 ε M = R Mt, which translates into the differential t Therefore q0 controls how fast the background frequencies equations, approach the saturation probabilities (0, 0, 0, 0, 1) Λ β through the factor t. For a given , the larger q0, the faster =−αβ + + α Λ Λ rrsttt()33 , () 53 t approaches one. (Note that t always approaches one as t goes to infinity.) =−αβ + + α ssrttt() , ()54 At first glance, it looks like q0 could take any value includ- γβγ=− ing one in the solution for the extended Jukes-Cantor tt().155 () model. q0 = 1 would result in fixed background frequen- Which are satisfied by cies of the form (0, 0, 0, 0, 1), which is an undesirable result, and the value q0 = 1 would have to be excluded when β > 0. In fact, the limit to the ungapped Jukes-Can- =+1 −−+βαβtt3 ()4 β ret e ,()56 tor model has to be taken by setting = 0 first, and then 4 4 Λ q0 = 1. In that way, t = 0 for all times, which is the correct result for the original Jukes-Cantor model. =−1 −−+βαβtt1 ()4 set e ,()57 4 4 Derivation of the algorithm for a (N + 1) × (N + 1) quasi-stationary and reversible evolutionary process γ -βt t = 1 - e . (58) Unlike the ungapped N × N case in which the marginal probabilities are time independent, in the presence of And in addition we have gaps the marginal probabilities have to evolve with time. In fact, as I discussed earlier, the marginal probability of a 1 −β ξ =−(),159qe t () ε t 4 0 gap pt (-) has to evolve from zero at time zero to one at time infinity. As a result of that observation, probabilistic σ -βt ≠ t = 1 - (1 - q0) e . (60) evolutionary models with Q0 I are necessary in the presence of gaps in order to maintain reversibility. The reason β In the limit case = 0, the solutions for rt and st reduce to for this requirement is the following: for an evolutionary those of the original Jukes-Cantor model with the trivial model of the form etR, reversibility implies that there is additions of σ = 1, ξ = 0 and γ = 0, after setting q = 1. t t t 0 some p* such that p*Q* = p* [see Appendix B, equation (202)]; it follows then that p R = 0, and therefore p etR = p The extended Jukes-Cantor model depends on three * * * for arbitrary time t. Thus, under a reversible model of the parameters: the rate of nucleotide substitution α > 0, the tR β ≥ ≥ form Qt = e , marginal probabilities do not evolve with rate of nucleotide deletion 0, and the parameter 1 q0 time. On the other hand, if Q ≠ I then the condition p Q > 0. What is the meaning of q0 ? q0 controls the saturation 0 * * frequencies (i.e. the background frequencies at time infin- = p* does not imply p*R = 0 for the rate matrix R, and there- ity), as well as the background frequencies at any other fore it does not impose p* as the marginal probabilities for β ∞ finite time. For > 0 and 1 >q0 > 0, taking the limit t = arbitrary t. (See appendix B for more details on this point.) in equations (56)-(60), one can see that the saturation

Page 12 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Therefore, to model the evolution of gaps we need to gen- The generalized conditional matrix in (64) also saturates eralize the evolutionary model to have the following form at very large times, and the saturation probabilities (i.e. the marginal probabilities at infinity) are given by those of ε ε ε tR ε tR Qt = Q0 e . (64) the rate matrix, that is limt→∞ Qt = limt→∞ e (see Cor- ollary A.1). Because of the relationship in equation (66) The matrix Q0 can be parameterized in following form, −1 ε between the rate matrix and the matrix QQ0 * , the satu- ε  0  ration probabilities p∞ are given by the condition (see    δ()ÓÕ  Appendix B), Q = ,()65 0  0    εεεε−1 = −1  −  p∞()()()()()().() iQQ0 ** ijp∞ jQQ0 ji 71  pqqÕ()1 00 ξ This matrix depends on one additional parameter q . The 0 GF ED ∝ particular dependency Q0(-, ˆÓ ) pˆÓ , is necessary to obtain quasi-stationary reversibility of the marginal ?? probablilities. ? ?? ?? X s9?F 2 εε≡ ss ?? 2 The rate matrix is now a function of Q0 and QQ* = , ss O 2 tt* s 2 ss γ 2 and takes the form ss γ 2 κ ss 2 ss 22τ ss 2 ss δ 1−−τ−γ 2 εε= 1 −1 ss 22 R log(QQ0 * ), (66 ) ss 2 t ss 1−2δ−τ Ö 2 * ss GFED@ABCB / XY /GFED@ABCE −1 ε JJ 1−2κ−ξ τ E where the matrix QQ0 * has the form, JJ JJ 22X1 JJ 211 JJ 221  ∆  JJ 21 1−−τ−γ JJ δ 211   JJ 2 1 τ 1 + ∆ κ JJ 2 1   JJ 221  Q ()/(ÓÕ 1 + ∆ )  JJ 211 −1 ε = * JJ 221 ?? QQ0 *  ∆ , (()67 JJ ?   J% ?? ?? Y  1 + ∆  ?? O  −  ?  pj()1 ηη where

AFigure pair-HMM 3 model  1  ∆ 1 ∆ η =−1  + .()68 A pair-HMM model. Description of a pair-HMM model. + ∆ ∆+∆′  qq001 The three states: Emit-a-pair (XY), Emit-X (X), and Emit-Y Notice that Q may be inverted as long as 0

Page 13 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Then using equation (71) we can see that the saturation vation of rate matrices from sequence data [23,67], or esti- probabilities maintain the quasi-stationary property that mating multiple nucleotide changes [68]. In comparison, was imposed at the time instance t*, and are given by, the ideas to describe the evolution of transition parameters in probabilistic models are much less standardized [34-40]. ε pp∞∞∞=−[(172ΛΛ ),], ( ) i The goal of this section is to describe the evolution of tran- where sition probabilities. For instance, in the pair-HMM of Fig- ure 3 the transition probabilities from the "XY" state to the ∆ "X" or "Y" states describe the introduction of gaps in one Λ∞ = .()73 ∆∆++()()11 −η of the two aligned sequences, using an affine penalty. These transitions should be zero when the sequences have As I discussed before, it is reasonable to impose that at not yet diverged (time zero), but they should be maximal Λ η infinity all we find is gaps, i.e ∞ = 1.0, which implies = at infinite divergence. In between these two extremes, it is 1 and desirable to model the transition probabilities changing with divergence time. These methods are termed "evolu- ∆∆′ − 2 tionary" because the transition probabilities will be q = .()74 0 ∆∆′ + parameterized with time, using functions that are general- izations of the Markov process that probabilistic evolu- Notice that because the relationship given in equation tionary models assume for substitutions. Unlike the TKF ∆ ∆ (14) between ' and , then 0

Page 14 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

not fix them all completely. When the more restrictive An example: evolution of the transition probabilities of a pair-HMM conditions are used, both algorithms give the same "XY" state results. These two algorithms are applicable to most pair Consider the transition probability vector associated to and profile probabilistic models, be they HMMs or the "XY" state of the pair-HMM given in Figure 3, SCFGs, generalized or not. I present an example of the evolution of a vector probability vector for a pair HMM, →  T XY XY  and an example of the evolution of a matrix of transition  t  → substitutions for a profile HMM.  T XY X  XY =  t  qt → ,()83  T XY Y  Evolution of a vector of transition probabilities  t   XY→ E  A step-by-step description of the algorithm  Tt  Let us start by providing the recipe to apply the algorithm: which describe the four possible transitions from a corre- 1. Given a transition probability vector lated emission of two nucleotides to another correlated XY→ XY emission in both sequences (Tt ); to a gap in  T1   t  n XY→ X XY→ Y = i =∀ sequence Y (Tt ); to a gap in sequence X (Tt ); qt   such that∑Ttt 177 , , ( )   →  n  i=1 or to end the alignment (T XY E ).  Tt  t 2. Assume its set of values is known at the three particular Below are some arbitrary values for the transition vector at ∞ ∞ time instances of t = 0, t = t*, and t = , named q0, q*, and divergence times: t = 0, t = t*, and t = associated with q∞. Assume each component i in these probability vectors state "XY", qXY, satisfies one of the following three conditions,  101.(−τ )  0741.(−τ)   0001.(−τ )       −τ −τ −τ q (i) q (i) >q∞(i) or q (i) = q (i) = XY 001.( ) XY  0131.( ) XY  0501.( )  0 * 0 * 0 * qq= , = ,q∞ = .()84 0  001.(−τ ) *  0131.(−τ )  0501.(−τ ) q∞(i), for all i. (78)        τ   τ   τ  3. If the three input vectors satisfy the condition, The transition TXY→E = τ is related to the expected length of the alignments generated using the model. We typically qi()− q∞ () i want to keep that transition invariant through time, and * =−r,()79 − correlated with the alignment length L: τ = 1/L. (This pair qi0() q∞ () i HMM produces sequences with a geometric length distri- where r > 0 is a real number independent of i, then calcu- bution of mean 1/τ.) The other three transitions change late qt at an arbitrary time t (0

n    qi()− q∞ () i Similarly to this "XY" state case, all the other transition wqiqitt=−∑[ ( ) ( )]exp / log  *  .()82 t 0 ∞ * − probabilities that appear in the pair model of Figure 3 =   qi0() q∞ () i  i 1 could be continuously parameterized with the divergence time of the alignment being scored. This algorithm can be

Page 15 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

applied to any full set of transition probabilities emerging because it corresponds to the case described in equation from a particular state in a given probabilistic model that (80). must evolve with time. Derivation of the algorithm to evolve a vector of transition Connection with a tree-HMM 2×2 match-transition matrix probabilities In the original representation of a tree-HMM [35] the idea To describe the evolution of transition probabilities, the of a match-transition matrix is introduced. If one parse simple exponential models used for substitution matrices through the HMM generated a Match to Match (MM) are not sufficient. I propose to adopt a generalization of transition, while another parse through the model gener- the exponential model of the form, ated a Match-to-Delete (MD) transition, one can consider the substitution of MM by MD similarly to a substitution qqreITTTtR=+(), − ()92 of residues by the conditional probability P(MD|MM, t). t 0 This leads to the concept of a 2 × 2 match-transition where I is the n × n identity matrix, r is a vector still to be matrix given by, identify, and a R is a n × n rate matrix.

MM MD This model simply adds to the exponential term a time independent vector a, MM  PMMMMt(|,)PMDMMt(|,),(86))   MD  PMMMDt(|,)PMDMDt(|,) TTTtR=+ qaret .()93 ≥ which in [35] is parameterized with two real numbers r Because qt=0 = q0, then it is necessary that a = q0 - r, thus giv- 0, and 0 ≤ a ≤ 1 as ing the expression in equation (92). Note that this is the most general solution of a differential equation of the MM MD ∝ form qt (qt - a). Until now it was always assumed that  −rt −rt  MM aae+−()111 ()()−−ae (887) the constant term was zero, that is r = q0. The freedom  .  − −  added by including a term constant in time is that, while MD  aae− rt 1−+aaert  before the behavior at t = ∞ was solely controlled by etR, Tree-HMMs model the evolution of paths though the now the additional term also contributes to that limit. HMM. In contrast, the method proposed here models the evolution of the transition probabilities of the model An immediate consequence of this generalization is that themselves. However, one can see that the match-transi- the rate matrix is not now sufficient to determine the tion matrix is closely related in form to the model we have whole evolutionary process. In addition to the rate matrix, proposed here. Introduce the probability vectors, the probability vector must also be specified at time zero (no divergence) and at time infinity (all mutations have  PMMMMt(|,)  PMMMDt(|,) saturated) such that, qMM =  , qMD =  .()88 t  PMDMMt(|,) t  PMDMDt(|,) T =+TT tR − qqr∞ o  lim eI . (94 ) For t = 0 and t = ∞ they have the following values,  t→∞ 

      The exponential of the rate matrix R has the general form, MM1 MD0 MM MD a qqqq=  ,,=   ∞∞==  .()89 00 0   11  − a     k1  It is easy to see that the match-transition matrix given in tR    −1  etU=−exp  U  equation (87) can be rewritten as,       kn   − MM=+ MM tR MM − MM  kt1  qqeqqt ∞∞(),()0 90 e   − = U  U 1,()95 MD=+ MD tR MD − MD  −  qqeqqt ∞∞(),()0 91  e ktn 

 −r 0  n for a diagonal rate matrix R =   . Such diagonal for some real eigenvalues {}kii=1 . If conditions are  0 −r  ∀ restricted to the case in which ki > 0, i, the immediate rate matrix does not require additional normalization consequence of working with positive eigenvalues is that,

Page 16 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

q0(i) q* (i) >q∞ (i). (104) limetR = 096 . ( ) t→∞ Even though this model was derived under the conditions of equation (104), it also extends to the degenerate case There is then a simple relationship between the vector r where for some i we have and the values of the probabilities at time zero and saturation, q0(i) = q*(i) = q∞(i), (105) q∞ = q0 - r. (97) since this simply corresponds to these parameters under- going no evolution at all. Therefore, we can write with all generality Therefore if the input column vectors satisfy one of the TTTtR=+ − three previous conditions for each one of their elements, qqreIt 0 [] the parametric expression is =+−TTtR qqqe∞∞()0 ()98 TT TtR − qqqqe=+−∞∞()  kt1  t 0 e   −      qq*()11∞ ()  − exptt /* log( ) TT 1   q (1))()− q∞ 1   =+−qqqU∞∞() U . 0 ()106 0 =+−TT .   qqq∞∞()0   −    ktn   qn()− q∞ () n   exptt / log( * )   e   * −    qn0() q∞ () n 

However, for the given information (q0, q*, q∞), the time- A normalization condition has not yet been imposed. T parameterized vector qt in (98) is still underdetermined. Using the unity vector u = (1,..., 1), normalization requires that In order to reduce the amount of freedom, I assume that tR e is diagonal (i.e. U = I). Diagonal rate matrices have quT =∀1107 t.() been used in other contexts of generalized evolution such t ∝ tR as the tree-HMM model [34,35]. Then we have, For an evolutionary model of the form qt e normalization requires that etRu = u for arbitrary times. I refer to this −  kt1  property as the strong normalization condition. The nor-  e  TT=+− T malization of a generalized evolutionary model of the qqt ∞∞() qq0   . ()99 TT TtR  −  form qq=+−∞∞() qqe requires the weaker condi-  e ktn  t 0 T tR tion (q0 - q∞) e u = 0. This property is always true for a rate At this time the known probabilities at the generating matrix that satisfies the strong normalization condition. I time t*, q*, have not yet been used. These are, refer to this property as the weak normalization condition. − − TT=+ kt1 * ktn * qq* ∞ ( re (1100 ) ,..., rne ( ) ). ( ) In order to obtain the strong normalization condition Thus we obtain automatically it is necessary to have a rate matrix of the λ λ -1 form R = U diag (0, - 2,..., - n) U (see Appendix A, equa- − qi()− q∞ () i qi()− qi () tion (193)). Such a type of rate matrix is not appropriate e kti * = * =+1* 0 ,( 1011) − − to describe the evolution of a probability vector, since qi0() q∞∞ () i qi0() q () i such rate matrix cannot be uniquely inferred from the which can be solved for ki, three input probability vectors q0, q*, q∞. For that reason, I have explored the use of rate matrices of the diagonal 1 qi()− q∞ () i form R = diag (-k ,..., -k ), which can be inferred from the k =− log * .()102 1 n i − three input probability vectors q , q , q∞ using expression t* qi0() q∞ () i 0 * (102). Such diagonal rate matrices, however, in general do − not satisfy the strong normalization condition, thus the The condition k > 0 translates into 0

Page 17 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

≡−TtR I→I wqqeut ()0 ∞ T n  −  =− qi*() q∞ () i ∑[qi0 ( ) q∞ ( i )]exp tt /* log( ). (108 )  qi()− q∞()i  i=1 0 ? If the input vectors satisfy the condition for all i, ? ?? ? ? qi()− q∞ () i ? * =−r,()109 ? I − D? 7 qi0() q∞ () i Ù ? 7 ÙÙ 77 for some real number r > 0, then the rate matrix has the Ù 7 M→I Ù 7 I→M particular form R = diag(-r,..., -r), the function wt = 0 for T Ù 77T arbitrary times, and the weak normalization condition is ÙÙ 7 satisfied automatically. Ù 7 ÙÙ 77 The previous condition is in general too restrictive. If the Ù previous condition is not satisfied, by construction wt is T M→M zero at t = 0, t = t and t = t∞, but in general w ≠ 0 for n > / * t M M 2. Normalization is then achieved (107) by modifying our definition of q to @ ? t @ ~ @@ ~~ q @ ~ q ← t @ ~ D→M t + @ ~ T 1 wt @~ ~~@@ The final expression is ~ @ ~~ @@T M→D  −  qq*()11∞ () ~ @  exp[tt /* log( ))]  qq()11− ∞ () ~ @ T T  0  q∞ ()qq− ∞ qT = + 0  .(1110) ~ @ t 11+ w + w   ~ t t  qn()− q∞ () n  exp[tt / log( * )]  * −   qn0() q∞ () n  ONMLHIJK /ONMLHIJK D D Evolution of a matrix of transition probabilities T D→D In some cases, several states of a model correspond to a particular evolutionary event, and it seems natural to expect that their transitions would evolve under the con- PartFigure of a4 profile HMM model trol of the same rate matrix. For instance, in a profile Part of a profile HMM model. For a profile HMM we HMM (Figure 4) I will consider the joint evolution of the depict the transition probabilities associated with the states transitions of three states associated with a given consen- of a given consensus position in the profile: Match (M), Insert sus position: Match (M), Insert (I), and Delete (D). (I), and Delete (D). The three states corresponding to the next position in the profile are referred to with primes. The Match state has three transitions (into I, M', and D'), while For a collection of m states S = {S1,..., Sm} that transition the Insert and Delete states have two transitions each (into I into a collection of n states E = {E1,..., En}, consider the set of all transition probabilities emerging from the m origi- and M', and into D' and M' respectively). nating states S and ending in the n states E,

SE→ , Timini i for ==1,..., , and 1 ,..., . ( 111 ) The set of S and E states do not have to be mutually exclu- design. For instance, for the states associated with a con- sive, and some E states can also be part of the S set. The set sensus position in a profile HMM the set of originating of E states also has to be complete, in the sense that states is S = {M, D, I}, and the set of ending states is E = {M', D', I}, where the prime index indicates the next posi- n tion in the profile. The condition m ≤ n can be imposed SE→ i i = ∑TS1..()for all i 112 with all generality. i=1 On the other hand, not all E states need to be reached by a given Si state; some transitions may be forbidden by

Page 18 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

We want to describe the evolution with time of the transi- fashion that change is absorbed by the transition proba- tion probabilities. For that purpose, I will use the m × n bilities into any other E state. matrix of transitions Qt such that, A step-by-step description of the algorithm SE→ The recipe to implement the algorithm is as follows: ≡ i i QiiTt ()t . (113 ) tR ≤ Note that an evolutionary model of the form Qt = Q0e , 1. Assume we know the m × n (m n) matrices of transi- like that used for evolving gaps as extra characters, is not tion probabilities at time zero Q0 and at time infinity Q∞, tR sufficient. A model of the form Qt = Q0e in the limit of such that the rank of Q0 - Q∞ is m. infinite divergence would necessarily result in transitions that for a given end state are all identical and independent 2. If an analytic n × n rate matrix R is given, one can find of the previous state. That is clearly too restrictive for most the analytic expression for etR by solving the differential models, for instance in a profile HMM, in which some of equation d(etR)/dt = RetR, and jump to step (6). For a the transitions are not evolved and are set to zero. numerical solution jump to step (5).

In order to allow for more general saturation properties of 3. If the information given is the set of transition proba- the transition probabilities, I propose the following bilities at a generating time t*, calculate the rate matrix R model for the evolution of the matrix of transition as, probabilities, l=∞ − l+1 =+−=11()1 − l tR R log[IOQQnn× (* 0 )] ∑ [(OQ* Q0 )]. (119) Qt = Q0 + K (e - I), (114) t* tl* l=1 where R is the n × n rate matrix, and the m × n matrix K is The n × m matrix O is obtained by solving the set of linear still to be determined. This extension (as in the vector equations model proposed before) corresponds to adding a constant T term A = Q0 - K, and it is is the more general solution of a -(Q∞ - Q0) O + umv = Im × m, (120) ∝ differential equation of the form Qt (Qt - A). where um is the m dimensional unity vector [i.e.

tR tR T = We will see that K(e - I) = (Q0 - Q∞)(e - I), thus um (11 ,..., ) ], and v is a m dimensional vector uniquely determined by the set of m independent linear equations, tR Qt = Q0 + (Q0 - Q∞) (e - In × n), (115) T v (Q∞ - Q0) = 0, (121) tR = Q∞ + (Q0 - Q∞) e . (116) T v um = 1. (122) As in the previous case, a freedom provided by the additional constant-in-time term is that while the saturation The solution of equation (120) is not unique. In fact, tR behavior of Q0e is controlled by the saturation probabil- equation (120) determines the matrix O up to a n dimen- ities of etR, the model given by equation (116) is inde- sional probability vector ψ that satisfies the conditions ψT pendent of those saturation probabilities so that the O = 0. This probability vector corresponds to the satura- probabilities at infinity can be set arbitrarily. That is, tion probabilities of the matrix etR. While the rate matrix R assuming that ψ is the n dimensional vector of saturation and the matrix etR depend on the choice of the saturation probabilities of etR, probabilities ψ, the asymptotic behaviour of the matrix of transition probabilities is independent of ψ, as was shown tR ==ψψT T in equation (118). limQe00 Qun un , (117 ) t→∞ 4. Impose the condition, tR T lim[QQQeQQQuQ∞∞∞∞∞+− ( ) ] =+− ( )ψ = . (118 ) →∞ 00n t T v (Q* - Q0) = 0. (123) Notice that while Qt, Q0, Q∞ and K are m × n matrices operating in the S × E space, the matrices R and etR are square This condition [necessary so that ψT R = 0] imposes con- n × n matrices operating in the E × E space. In fact, etR straints between the set of probabilities at time zero, at determines the change in time that a transition probabil- time t*, and at time infinity. ity into one of the E states experiences and in which

Page 19 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

5. Calculate the exponential of the rate matrix etR using the → times when T MI> 0 , the average length of a insertion corresponding Taylor expansion. δ t would be very close to one. 6. Finally, calculate the set of evolved transition probabilities as, At time infinity, one can parameterize the transition probabilities as tR Qt = Q0 - (Q∞ - Q0)(e - In × n) (124) MDI′′ or M1−+()mmmm  DI DI Q∞ = D  10− dd. ()128 Q = Q + (Q - Q ) etR. (125) DD t ∞ 0 ∞  −  I  10iiII An example: Evolution of the transitions of a profile HMM given a rate matrix where mD and mD represent the probabilities of Match to To illustrate this method, consider the case of a profile Delete and Match to Insert at infinity, and dD and iI are the HMM (Figure 4). There are three states associated with a Delete to Delete and Insert to Insert probabilities at time ≤ ≤ given consensus position in the profile: the Match state infinity, (0 mD, mI, dD, iI 1). (M), the Insert state (I) and the Delete state (D). These three states transition into states M' (the Match state at the Let us assume that the rate matrix is given by next position in the profile), D' (the delete state at the next ′′ position in the profile), and I (the insert state between the MDI two matched positions), therefore in this example m = 3 M′ −2αα α and n = 3.   R = D′  ααα−2 . ()129  αα− α Consider the following the transition matrix I  2  α MDI′′ for some parameter > 0. This rate matrix assumes that the rate of change in the occurrence of state M' is similar  MM→ ′ MD→ ′ Mt→  M TTTt t t  to that of state D' and that of state I, and that this change → ′ → ′ reverts equally into the other two states. More realistic sit- Q = D  TTDM DD 0 .126() t  t t  uations can be achieved using rate matrices depending on I  IM→ ′ II→   TTt 0 t  more parameters.

This transition matrix, like a nucleotide substitution The instantaneous rate of transition change is given by, matrix, adds up to one by rows. We assume as in HMMER ′′ [4] that there are no transitions between the Insert and the MDI M −+333ααα()mm m m  DI D I Delete states, but the model could work under more gen- −− =−−αα − ()QQR∞ 0 D  33()()dqDD dq DD 0. ()130 eral conditions.  − ααα−− I  3 ()iqII03 () iq II

The matrix at time zero is given by This matrix gives the instantaneous change that a transition probability experiences under this model and MDI′′ describes how that change is transferred to the other allowed transition probabilities. M 100   =− tR QtDDD 10qq . ()127 The matrix e can be obtained analytically by solving the  −  differential equation d(etR)/dt = RetR. This is a 3-dimen- I  10qqII sional Jukes-Cantor model which has as its solution, The parameter 1/(1 - q ) is the average length of an insert I ′′ in between two matched positions at very short times (if MDI there were no deletions). The parameter 1/(1 - qD) is the M′ rss  ttt average length of a deletion at very short times (if there tR = ′ e D  srsttt.131() were no insertions, and all position in the profile had the   I  ssr same parameters at time zero). For instance, one could set ttt q very close to zero, which implies that, for very small I with

Page 20 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

idues in between the two matches a and b; the model 1 2 − α assigns to such sequence a probability given by, re=+ 3 t ,()132 t 3 3 = MI→→−IIn1 IIM→ ′ Paiibt(11 ...nMMIInt | ) p ( ap )′ ( bpi ) ( )... piT ( ) ( Tt ) Tt ,()143 =−1 1 −3αt set .()133 where pI(ii) represent the emission probabilities associ- 3 3 ated to the Insert state. Putting all together, we obtain the following evolved transition probabilities for a profile HMM under a Jukes-Can- We can interpret that in time t the path through the model tor like assumption for the rate matrix: that generated ab has evolved into the path through the model that generated ai1...inb with probability given by MDI′′ −+ M 13()mmsDIt 3 ms Dt 3 ms It  MI→→−→IIn1 IM′   = =+−tR = −− −−+− Paiibabt(11 ...nIInt | , ) pi ( )... piT ( ) ( Tt ) Tt . (1444) QQt ∞∞() QQe0 D  ()(13qdDDqsDt)() q D30 d D qs Dt .() 134  −− − + −  I  ()()13qI i I qs It 0 q I 3 () i I qs It To get a better intuition of what this means, take as an Substituting the values for st given in equation (133), we have the following evolved transition probabilities for a example the case in which the time interval t is very small. profile HMM, Then the probability that a path between two matches in the HMM inserts n residues in time t ≈ 0 is MM→ ′ =− + −−3αt Tmmet 1()(),()ID 1 135 − Paiibabt( ... | , )≈− pi ( )... pi ( )31α mqn 1 ( qt ) . ( 145 ) → ′ − α in I1 In II I TmeMD=−(),()13 t 136 t D α This probability is proportional to 3 mI the rate of substi- MI→−=−3αt tuting a Match-to-Match transition for a Match-to-Insert Tmet I(),()1 137 n−1 transition, and to qI (1 - qI), which is the geometric fac- DM→ ′ =− − − −−3αt tor associated to an insert of length n at time zero. Tqdqet ()()(),()1DDD 1 138 An example: Evolution of the transitions of a profile HMM given the DD→ ′ =+ − −−3αt Tqdqet DDD()(),1 () 139 transitions at a generating time In this case, we maintain the same values for the transition → ′ − α probabilities at time zero Q and at time infinity Q , but TqiqeIM=−()()(),1 − − 1 −3 t () 140 0 ∞ t III the rate matrix will be obtained from a generating time for which we know the transition probabilities. II→−=+− −3αt Tqiqet III()(),()1 141 The set of linear equations in step (3) of this algorithm The evolution of different paths through the HMM that determine the vector vT = (v , v , v ) are In a tree-HMM one assumes that the different paths 1 2 3 through the model are the objects that are subject to evo- m v + (d - q )v = 0, (146) lution [34]. Here we have directly modeled the evolution D 1 D D 2 of the transition probabilities of the HMM. We can get an m v + (i - q )v = 0, (147) intuition for the meaning of these evolved transition I 1 I I 3 probabilities by estimating how these evolved transition v + v + v = 1. (148) probabilities induce the evolution of different paths 1 2 3 through the model. A process that is similarly to that The solution of these linear equations is modeled by a one-branch tree-HMM. v = (d - q )(i - q )/d, (149) Suppose that at time zero, we emitted residue a from state 1 D D I I M, and residue b from state M'. The model assigns to such v = -m (i - q )/d, (150) sequence a probability given by, 2 D I I

v3 = -mI (dD - qD)/d, (151) ==MM→ ′ = Pabt(|0 ) pMM ()() ap′ bT0 pMM ()(), ap′ b () 142 ≡ where d (dD - qD)(iI - qI) - mD(iI - qI) - mI(dD - qD). where pM(a) and pM'(b) represent the emission probabilities associated to the M and M' states respectively. Now Parameterize the matrix O in the form suppose that at time t there has been an insertion of n res-

Page 21 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

 MMM  070... 020 010  123   = = O  DDD123,()152 Q∗  060.. 040 0 .()162      III123  0700030.. with each row adding to zero. The set of linear equations Selecting the particular values qD = qI = 0.1 and dD = iI = 0.6, in step (3) that determine the matrix O are using the constraints of equations (161) implies that mD = 0.33 and mI = 0.25. Using these values and the arbitrary ψ (dD - qD)(M1 - D1) + v1 = 0, (iI - qI)(M1 - I1) + v1 = 0, (153) values for the saturation probabilities = (1/3, 1/3, 1/3), we obtain the following O matrix: (dD - qD)(M2 - D2) + v2 = 1, (iI - qI)(M2 - I2) + v2 = 0, (154)  8... 0000−− 4 6667 3 3333  (d - q )(M - D ) + v = 0, (i - q )(M - I ) + v = 1. (155)   D D 3 3 3 I I 3 3 3 O =  −4.. 0000 1 3333 2 . 6667 .()163    −4.. 0000 3 3333 0 .6 6667  Solving Mi and Ii in terms of Di we have, The rate matrix R constructed using equation (119) is   −−11 +−− +mI  D12()iqII D (iqm II I ) D 3 given by  d d d  = O  DDD123..()156   1 11  D +−−[(dq ) ( i−−+−−−−−−−qD)] (iqmm ) D (dqmm )   1 d DD IIIIIDDDDI23d d   −0.. 4757 0 3054 0 . 1703    R =  0... 4406− 0 6109 0 1703 ,()164 The matrix O is therefore determined up to the unitary   ψT ψ  0.. 0351 0 3054− 0 .6 3406  vector (D1,D2, D3). The saturation probabilities = ( M', ψ ψ ψT ψT D', I')( R = 0) are defined by the equations O = 0, and an instantaneous rate matrix -(Q∞ - Q0)R is given by, which imply  −0... 4323 0 3046 0 1277    − ()dq− −−() iq −− =− = ψψiqII− DD II ()QQR∞ 0  0.. 4569 0 4569 0 .()165 D1 M, I ,()157   d d  −0.. 2554 0 0 2554 

−− −− − The evolved set transition probabilities at time t = 0.3 is iqmII I iqmmII I D D =−ψψ′ − ,()158 2 M d I d given by,

−− −   mI dqmmDD D I 0... 8845 0 0800 0 0355 D =−ψψ′ + .()159   3 M I = d d Q03.  0.. 7800 0 2200 0 .()166    0.. 8290 0 0 1710  Substituting vector D with vector ψ we finally obtain the following expression for the matrix O in terms of the sat- Using the "vector" method, in which transition probabil- uration probabilities ψ: ity vectors evolve independently with the same set of parameters, we would have obtained the identical result,

 −−−−ψψ′′′′()(iq d q ) ψψψψ () iq −+− m m ( d −+ q ) ψmm−ψ   DI I ID D DI I ID DI ID D DDI ID  = 1 ψψ+−−− ψ −+−−+ ψψ ψ ψψψψ−− − + O  ()()()()()MIIIIDD′′iq d q MIIII iqm ID m ID()()dqm D D M′ I m I.(1660) d  ψψ+−−− ψ ψ −−−ψψ + − ψψ + −−+ ψ   ()()()(MDDDDIIDI′′dq ′ iq ′ iqqmII)( MDD′′ ) m ( MDDDD ′′ )( d q m ) DI ′ m  0... 8846 0 0800 0 0354    The condition in step (4) of the algorithm translates in = Q03.  0.. 7798 0 2202 0 .()167 this case into the following relationship of parameters,    0.. 8290 0 0 1710  → ′ → ′ The normalization function w given in equation (82) is ()dqt−=MD mt (DD − q ),()161 t DD**D D different from zero only for dimensions larger than two. −=−MI→→II ()iqtII** mtI ( qI ). The second and third row effectively have dimension two This is an additional set of constraints that the "vector" (since one of the elements is always zero), and do not algorithm does not impose. require normalization. For the first row the normalization function takes the value w0.3 = 0.0020. To test the algorithm I have made up a toy HMM consen- The vector method allows us to use more unrestricted sets sus state, which at the generating time t* is given by the matrix of transitions, of parameters than the matrix method since the conditions in equation (78) are independent for each row. In principle, however, the conditions in equation (161)

Page 22 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

seem to allow behaviors that the vector model does not by Q*, the calculation of the rate matrix R involves the fol- MD→ ′ ≡ MD→ ′ lowing steps: allow such as t* >mD t∞ as long as, simultane- DD→ ′ ≡ DD→ ′ (a) The m × n matrix K is by construction invertible ously, t* >dD t∞ . In practice when I have tested ≠ that kind of situation, the rate matrices obtained are because we have imposed ki 0, for all rows i. always not real, and therefore they lack any biological interpretation. A little aside with respect to matrix inversions is in order here. The (unique) inverse of a matrix is defined only for Derivation of the algorithm to evolve a matrix of transition square matrices. One can introduce a inverse-like matrix probabilities for a non-square matrix; these are called pseudoinverses We start with a model of the general form [69]. The pseudoinverse of a non-square matrix is not unique and many pseudoinverses can be defined; one of tR Qt = Q0 + K (e - In×n), (168) the best known is the Moore-Penrose matrix inverse [70]. We will see how despite the fact that the pseudoinverse of where Q is the known m × n matrix of probabilities at 0 K is not unique, we can still define Qt uniquely. time zero, and the m × n matrix K must still be determined. Therefore solving for R in equation (114) at the particular time t* we have Assume that we know the transition probabilities at time infinity, which we represent by the m × n matrix Q∞. Then, 1 − R =+−log IK1 () Q Q , (172 ) because of the asymptotic behavior of the exponential t  * 0  tR tR ψT * family e , limt→∞e = un for some n dimensional ψT ψ ψ -1 saturation probabilities, where = ( i,..., n), and the n where K is the n × m pseudoinverse of K defined by the conditions KK-1 = I and K-1K = I . T m×m n×n dimensional unity vector un = (1, ..., 1), we have

(b) Because the final result for Qt in equation (116) does ψT Q∞ = Q0 + K (un - In×n). (169) not depend on the values ki we can set them with all gen- This equation implies that ρ ≠ erality to the form ki = 0. Therefore we have

ψT ρ ψT K = -(Q∞ - Q0)+ k , (170) K = -(Q∞ - Q0) + um . (173)

-1 ρ -1 Because K Kun = un and Kun = um, then we need that K um where k is a m dimensional vector that represents the ρ-1 = un. Therefore we propose that the n × m pseudoin- -1 sums by rows of K, i.e. ∑jK(i, j) = ki which we impose to verse matrix K has the following form, be different from zero. − 1 KOu1 =+ ν T ,()174 Because for the exponential family etR we have the reversi- ρ n bility condition ψTetR = ψT for arbitrary time, introducing the expression for K in equation (170) in the equation where the n × m matrix O, and the m dimensional vector (114) we have the general result, v satisfy the conditions,

tR O u = 0, (175) Qt = Q0 - (Q∞ - Q0)(e - In×n). (171) m

T This result proves point (6) of the previous algorithm v um = 1. (176) description. -1 (c) In order to satisfy K K = In×n we need to have,

Therefore if given Q0, Q∞ and a n × n rate matrix R, which T satisfy the reversibility conditions ψTR = 0, we can calcu- v (Q∞ - Q0) = 0, (177) late the evolved transition probabilities using the equa- ψT tion (171). -O(Q∞ - Q0) + un = In×n. (178)

In the case in which the information given is the set of transition probabilities at a generating time t*, designated

Page 23 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Equation (177) is a set of homogeneous linear equations Reversibility and multiplicativity that together with the normalization conditions in equa- For a given probabilistic model, imposing reversibility has tion (176) uniquely determine the vector v. different implications for its emission and transition probabilities. In pair models, we assume that the emission -1 On the other hand, in order to satisfy KK = Im×m, the fol- probabilities are reversible by imposing P(at, bt+t') = lowing must apply: P(at+t',bt), which corresponds to using symmetric joint probabilities represented by the shorthand notation P(a, ψTO = 0, (179) b|t'). If the emissions do not involve gaps, the marginal probabilities do not evolve, and the evolved joint proba- T -(Q∞ - Q0)O + umv = Im×m. (180) bilities are obtained from the evolved conditionals and the saturation probabilities. In the presence of gaps, I have Equation (180) is a set of linear equations which deter- described how to construct the evolved conditionals and mines O aside from a dependence on an arbitrary proba- the corresponding evolved marginals in a way that main- bility vector. In particular we can find the expression of tains reversibility for any arbitrary time, so that we can matrix O in terms of the vector ψ as we did in equation construct evolved symmetric joint probabilities. (160). For transition probabilities the situation is different. Once the matrix O has been obtained using equation Mathematically, a matrix of transition probabilities is like (180) as a function of the vector ψ, one can verify that the a substitution matrix (i.e. conditional probabilities) but set of equations describe by (178) is automatically there is not the equivalent of "joint" probabilities for tran- satisfied for any vector ψ as long as it satisfies the condi- sitions. To maintain reversibility for the transitions of a tion ψTO = 0. This is the result presented in step (4) of the probabilistic model, one has to build reversibility in the algorithm. design of the model. In particular, one needs to be sure that the transition probabilities that involve gaps lack any (d) Because ψ corresponds to the saturation probabilities directionality. For instance, in the pair-HMM of Figure 3 of etR, then it is necessary that ψTR = 0. This condition is →→ we need to impose that TTXY X = XY Y for arbitrary satisfied if, t t times. That is achieved by making sure that the input tran- 1 sition probabilities at time t*, zero and infinity do lack ψ T []Ou+−ν T () QQ= 0181 .() ρ n * 0 directionality.

Therefore it implies that, Another property of probabilistic models of evolution for residue substitutions is multiplicativity. Multiplicativity is T an immediate property for evolutionary models of the v (Q* - Q0) = 0, (182) form etR. For residue-substitution evolutionary processes, which is the condition imposed in step (4) of the algo- multiplicativity implies that the transition from one given rithm. Under those conditions, it results for the rate event (say residue a) to another event (say residue b) in a matrix R, finite time, if it goes through any intermediate state, has to be of the form of any other possible substitution. In mathematical terms, =+−1   R log IOQQnn× ()* 0  . (183 ) t* Pbat(),,,.()+ t′ = ∑ PcatPbct()()′ 184 Notice that the parameter ρ ≠ 0, which is necessary to be c able to invert the matrix K to calculate the rate R, does not appear anywhere in the final result, either in the evolved However, when allowing gaps, any intermediate evolutionary step can go through processes of deletions or transitions Qt or in the value of R. This results from the fact that in either equation the only relevant component is the insertions in addition to substitutions; therefore multipli- projection of K (or K-1) into (etR - I). The same projection cativity as described in the previous equation does not is what makes the vector ψ that appears in the pseudoin- hold anymore. There is a natural explanation of why "sub- -1 tR ψT stitutions-only multiplicativity" is modified when consid- verse K irrelevant. Even though limt→∞e = u , it is also tR ering insertion and deletion events. Consider the true that limt→∞ (Q∞ - Q0)e = 0, so that the dependence on ψ disappears from the final expression of Q∞. evolution of gaps as single characters, which was introduced previously in this paper. The substitution matrix ε with gaps Qt satisfies the relationship

Page 24 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

finite time could have appeared from a very large number εεε−1 of intermediate situations in which stretches of other QQQQ′ = ′.()185 tt+ t 0 t nucleotides could have been added or removed. The sim- Analyzing this matrix equation by components and using ple one-to-one correspondence that models of substitu- the expression for Q0 given in equation (22), the substitutions maintain through evolution does not exist in the tion of residue a into residue b in finite time t + t' has the presence of insertion and deletion events. This does not following terms: mean that evolutionary models with gaps are inconsist- ent, however some traditional properties of phylogenetic Pbat(),,,+ t′ = ∑ PcatPbct()()′ trees of single residue evolution such as the pulley princi- c ple [71] cannot be applied under the transition probabil- 1 ity evolution models. +−PatPbt(),,()− ′ ()186 q0   Conclusion +−1 − ′ 1  PatpPbct(),,.∑ c () Motivated by the goal of making QRNA (a comparative  q0  c probabilistic method for RNA genefinding) an evolution- The first term corresponds to pure substitution events of ary model, I have introduced several probabilistic meth- ′ ods to describe the evolution of insertion and deletion →tt→ the form ac b , and it is identical to equation events. The methods introduced here have a larger scope (184). The second term modulated by the coefficient 1/q0 than this program alone, and they can be applied to other (introduced in equation (65), which is part of the non pair probabilistic models and to profile HMMs and SCFGs trivial matrix Q0) represents the event in which as well. ′ ab→tt−→− . The third term (preceded by coeffi- I described an algorithm which addresses the evolution of cient (1 - 1/q0)) represents the event in which gaps as an extra residue in a (N + 1) × (N + 1) substitution tt′ matrix. This method can be applied to the joint emission ac→− →− b. Note that this model would align at probabilities of pair models. This method allows us to time t + t' residues which could have been derived by a gap maintain a stationary N-dimensional background distri- intermediate. This is usually discouraged by evolutionary bution, while the actual (N + 1)-dimensional background models that describe the evolutionary history of inser- frequencies evolve towards all gaps at time infinity. I call tions and deletions, in which such event would be repre- this process quasi-stationary. As an example, I showed an a − analytic solution for the Jukes-Cantor model extended to sented as . For the model at hand, the fact that a gap − b gaps. can revert into a residue is a consequence of treating gaps as an additional residue in a substitution matrix. I also presented two methods for the evolution of transition probabilities in a profile or pair HMM or SCFG, that For the particular case of the generalized Jukes-Cantor are applicable to any probabilistic model that uses transi- model introduced before, it turns out that the two extra tions between states to model insertions and deletions. In terms in equation (186) are independent of the particular the first algorithm, the transition probabilities associated substitutions and cancel, such that with one state in the model are evolved as a vector independently of the transition probabilities associated to any   − γ other state in the model. I also presented a second algo- 1 γξγ+−1 1 t′ = tt′ 1  t 0.() 187 rithm in which the transition probabilities associated with qq00  4 a given set of states co-evolve under the control of a single Therefore the generalized Jukes-Cantor model preserves rate matrix. I presented an example of the application of multiplicativity. This results from the extreme simplicity these methods to a pair-HMM and to a profile HMM. of the model and is not true for more complicated models. For instance, for the rate matrix created from a partic- I have applied these methods to the program QRNA, which was the motivation for the development of the ular Q* in the other example presented in this paper (which is a particular case of the REV model [44]), the two algorithms in the first place. QRNA contains three proba- extra terms in equation (186) are different for the differ- bilistic models (the oth, cod, and rna models) that ana- ent nucleotide substitutions, and do not cancel out. lyze the pattern of mutation of a given pairwise alignment to decide which of the three models best classifies the A more complicated situation appears for probabilistic alignment. These models are a combination of general- models that introduce gaps in an affine manner. A given ized pair-HMMs and a pair-SCFG. Originally, this pro- residue-to-residue substitution process that occurred in a gram assumed a fixed divergence time, and all the

Page 25 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

emission probabilities of the different models were tied to ary methods into multiple sequence probabilistic models those of BLOSUM62. That produced a QRNA parameter- that explicitly describe the phylogenetic relationship ized for highly diverse sequences, which in turn produced between the sequences. a large number of false positives for highly similar sequences. In the new program eQRNA, all emission and Availability transition probabilities are a continuous time-dependent The different models presented in this paper have been family able to match any possible degree of sequence implemented in several small ANSI C programs. These are divergence. not fully developed software applications, but demonstra- tions (for those who want to avoid the mathematical The three models of QRNA (the OTH, COD, and RNA descriptions) of how the different algorithms work. The models) need to be at approximately the same evolution- programs are freely available at http://selab.wustl.edu/ ary distance, so that when a pairwise alignment is ana- publications/Rivas05/evolve.tar.gz. lyzed, the differences in scores of the models result from observing a different pattern of mutations (coding, RNA, Methods or none in particular) rather than because one model Appendix A. Conditions for the saturation of a generalized favors more closely related sequences than the other. This substitution matrix model synchronization requires a number of QRNA-spe- In this appendix I provide the conditions for saturation of tR cific design elements which are tangential to the imple- a generalized evolutionary model of the form Qt = Q0e . mentation of the evolutionary models for indels and Saturation can be described as transition probabilities presented in this paper. For rea- sons of clarity, I leave for another paper a detailed descrip-  1 n  qq∞∞ 1     tion of the particular implementation designs that went = = 1 nT= limQt    …()q∞ , ,qquq∞∞,()188 into eQRNA in order to make it fully evolutionary. In a t→∞      1 n  1 nutshell, the transition probabilities of the OTH and COD  qq∞∞  models are evolved according to the algorithm to evolve for the unitary vector u, and a set of saturation frequencies vectors of transition probabilities, while the emission at time infinity denoted by q , such that quT = 1 . probabilities of those two models were evolved using the ∞ ∞ original QRNA parameters as the generating time of the tR respective rate matrix. In the RNA model, for the context- Here I show that saturation of Qt = Q0e is a necessary free grammar component of the model, the transitions are condition of two properties of the matrix Q = {Q(ij)}, fixed, and the evolution of gaps is accommodated by normalization and positivity. I also show that the satura- tR treating gaps as extra characters according to the method tion probabilities of Qt are the same as those of e . presented here for that purpose. The HMM component of tR the RNA model is parameterized with time similarly to the Proposition A.1. Consider first the simplest case Qt = e . OTH and COD models. Preliminary results show an Normalization, i.e. ∑jQ(ij) = 1, together with positivity, ∀ important improvement compared with the previous i.e. Q(ij) > 0 i, j, imply that a substitution matrix of the tR fixed-time implementation. The application of these evo- form Qt = e saturates to a set of probabilities at time lutionary methods for other probabilistic models for infinity. sequence comparison beyond eQRNA should be tractable. Proof. Normalization of the rate matrix, ∑j Q(ij) = 1 implies that So far the methods presented here have been introduced only in profile and pair models. They could also be 1 1     applied to probabilistic models where, instead of aligning Qu= Q  =  .()189 two contemporary sequences, one aligns a sequence to an     1 1 ancestor. The only difference with respect to an evolutionary pair model is that, in this case, the emission probabil- λ ities will be the substitution (conditional) matrices That is, = 1 is an eigenvalue of Q. It also has implications themselves instead of joint conditional-on-time probabil- for the norm of Q, defined as the largest row sum of abso- ities. One important limitation of the methods presented lute values here is that, in general, they lack the property of multipli- n cativity. In consequence, in order to extend the methods ≡= presented here to more than two sequences related by a QQijmax∑ ( )1190 . ( ) i = phylogenetic tree, one would have to work with rooted j 1 trees. A future challenge is to incorporate these evolution-

Page 26 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

Therefore, because of the spectral theorem [72], the spectral radius σ(Q), defined as the largest absolute value of 1   any eigenvalue of Q, is bounded by, Ψ = U 0  .()197   σ(Q) ≤ ||Q|| = 1. (191) 1

On the other hand there is an eigenvalue λ = 1 therefore Substituting in equation (195) we finally obtain,

σ(Q) = 1. (192) 1   = ΨT −1 λ limQUt   0 . (198 ) In consequence, Q has one eigenvalue, = 1, and all other t→∞   eigenvalues are smaller than one. 1

Therefore because the substitution matrix is of the form Qt This is the saturation condition (188) for some saturation = etR, it implies that the instantaneous rate matrix R has probabilities defined by q = ΨT U-1. one null eigenvalue, and all the other are negative. If we ∞ 0 assume that the null eigenvalue is not degenerate and that the negative eigenvalues are real, we can write with all Corollary A.1. For a generalized evolutionary model of tR generality, the form Qt = Q0e , Qt also saturates at infinity, and the tR saturation probabilities of Qt are given by those of e , that is,  0    −λ  2  −1 ==tR T RU= U,()193 limQeuqt lim∞ . (199 )   t→∞t →∞    −λ   n  Proof. Note that by construction Q0 has to have the same λ normalization and positivity conditions as Q . It can be for some matrix U, and such that i > 0 for i = 2, ..., n. t − shown that under those conditions, Q 1 Q = etR also has Therefore Q = etR can be cast into the form, 0 t t to add up to one, summing by rows, and all its elements have to be positive. Therefore, using the result of Proposi-   1 tion A.1,  − λ   e t 2  = −1 QUt  U .()194 −1 T limQQ= uq∞ . (200 )   →∞ 0 t  − λ  t  e t n  In the limit, Therefore

==TT  1  limQQuquqt (0 )∞∞ , (201 )   t→∞ 0 =   −−1 = ΨΨT 1 limQUt ()1 , 0 ,… , 0 U U00 U , ( 195 ) t→∞     which proves saturation for an evolutionary probabilistic  0  tR process of the form Qt = Q0 e . ΨT for 0 = (1, 0, ..., 0). Appendix B. Implications of reversibility on a generalized evolutionary process On the other hand using equation (194) we obtain In this appendix I discuss the implications that reversibility imposes on a generalized evolutionary model. I show tR  1   1   1  that for an evolutionary model of the form Qt = e , the  − λ      t 2 marginal probabilities with respect to which Q is reversi-  e   0   0  t QUΨ = U  = U ,,∀t ()196 t 0     ble have to be stationary, and therefore coincide with the        − λ      saturation probabilities. I also show that for an evolution-  e t n   0   0  tR ary model of the general form Qt = Q0e , the marginal Ψ which implies that U 0 is the eigenvector of Q corre- probabilities with respect to which Qt is reversible can sponding to the eigenvalue λ = 1. According to (189) that change with time. In this way we decouple the "reversibil- is, ity" frequencies from the saturation frequencies. I also

Page 27 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

demonstrate how to calculate the saturation probabilities, Lemma B.3. If Q* is a conditional matrix that satisfies the given Q0 and Q* at one particular time t*. This system sets reversibility condition (202) and R = log Q* then the sat- the ground for the quasi-stationary model of evolution uration probabilities of R are given by the p* vector in with gaps as an extra indel. (202), that is,

Lemma B.1. Consider a given matrix of conditional prob- limeuptR= T . (210 ) ∀ * abilities Q*, [∑j Q*(ij) = 1 i] which is reversible with t→∞ respect to a set of marginal probabilities p*, T = Proof. Taking from Lemma B.2., we have pR* 0 ; there- p*(i)Q*(ij) = p*(j)Q*(ji). (202) Tn= ≥ fore pR* 0 for n 0, and because of the relationship Then one can see that reversibility implies =∞ n ()tR n TT= tR =+ pQ** p *.()203 eI∑ ,()211 n=1 n! Proof. Summing one of the indices in the reversibility con- it results that ditions and taking into account the normalization condition for the Q matrix results in, * TtR= T pe** p.()212 == ∑∑piQijpj**() ( ) * () Qjipj * ( ) * (), (204 ) for arbitrary t. Therefore, it also holds in the limit of very ii large time that which in vectorial notation takes the form pepT limtR= T . (213 ) TT= **→∞ pQ** p *.()203 t

tR = T Lemma B.2. If R = log Q* then the reversibility condition Additionally, Appendix A shows that limt→∞ euqsat . (202) for Q* implies that Combining those two equations together we have

=∧=T TT===tR T T T piRijpjRji**()() ()() pR *0205 . ( ) pp**lim e puqq *sat sat . (214 ) t→∞ Proof. If R = log Q* then because of the Taylor series we have This proves that the saturation probabilities are p*.

=∞ n+1 Proposition B.1. For a reversible evolutionary model of 1 n ()− = − n tR R ∑ ()QI* .()206 the form Qt = e , it results that the associated marginal tn* n=1 probabilities with respect to which the parametric family Qt is reversible have to be stationary (i.e. time Because of the reversibility condition for Q* (202) it is independent). also true that

Proof. From the parametric family Qt select one particular n n p*(i) (Q* - I) (ij) = p*(j)(Q* - I) (ji), (207) = instance t*, and consider QQ* tt= ∗ . Suppose that the for n ≥ 1. Therefore it follows that marginal probabilities at this time are given by p*, that is: TT= pQ** p *. Because of the relationship R = log Q*, it fol- p*(i) R(ij) = p*(j)R(ji). (208) lows from Lemma B.3 that the whole parametric family etR In addition we can also see by inspecting equation (206) has p* as the corresponding marginal probabilities, there- that the normalization condition for Q* translates into ∑j fore the marginal probabilities do not evolve with time R(ij) = 0 ∀i, which implies that (stationary).

Proposition B.2. For a reversible generalized evolutionary ∑ piRij() ( )=∀00209 jor pRT = . ( ) ** model of the form Q = Q etR, the associated marginal i t 0 probabilities with respect of which Qt is reversible can be evolved with time.

Page 28 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

proof. In order to prove that this is the case, we just need to Rawlings C, Clark D, Altman R, Hunter L, Lengauer T, Wodak S. Menlo Park, CA: AAAI Press; 1995:114-120. find an example in which that statement is true. Consider 6. Burge CB, Karlin S: Finding the Genes in Genomic DNA. COSB again one particular instance QQ= = with its corre- 1998, 8:346-354. * tt∗ 7. Cawley SL, Pachter L: HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 2003:II36-II41. ii36– sponding marginal probabilities p*. Because the model is ii41 reversible for arbitrary divergence times, in particular 8. Meyer IM, Durbin R: Gene structure conservation aids similar- there should be some p0 probabilities such that ity based gene prediction. Nucl Acids Res 2004, 32:776-783. 9. Sakakibara Y, Brown M, Hughey R, Mian IS, Sjolander K, Underwood pQTT= p . For this generalized model, the rate matrix is RC, Haussler D: Stochastic Context-Free Grammars for tRNA 00 0 Modeling. Nucl Acids Res 1994, 22:5112-5120. = −1 10. Eddy SR, Durbin R: RNA Sequence Analysis Using Covariance given by RQQlog(0 * ) . Therefore it follow by Lemma Models. Nucl Acids Res 1994, 22:2079-2088. B.3 that the saturation probabilities of R are given by the 11. Lowe TM, Eddy SE: tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. condition Nucl Acids Res 1997, 25:955-964. 12. Eddy SR: A Memory-Efficient Dynamic Programming Algo- rithm for Optimal Alignment of a Sequence to an RNA Sec- −1 = −1 p∞()()()()()(). iQQ0 ** ijp∞ jQQ0 ji ()215 ondary Structure. BMC Bioinformatics 2002, 3:18. 13. Klein RJ, Eddy SR: RSEARCH: Finding homologs of single struc- tured RNA sequences. BMC Bioinformatics 2003, 4:44. Therefore the saturation probabilities p∞ are different from ≠ 14. Knudsen B, Hein J: RNA secondary structure prediction using p* as long as p0 p*. stochastic context-free grammars and evolutionary history. Bioinformatics 1999, 15:446-454. Therefore, we have constructed a parametric family, Q = 15. Dowell RD, Eddy SR: Evaluation of Several Lightweight Sto- t chastic Context-Free Grammars for RNA Secondary Struc- tR Q0e , in which the marginal probabilities for reversibility ture Prediction. BMC Bioinformatics 2004, 5:71. 16. Rivas E, Eddy SR: Noncoding RNA gene detection using com- are p0 at time zero, p* at t*, and p∞ at time infinity, with p0 ≠ ≠ parative sequence analysis. BMC Bioinformatics 2001, 2:8. p* p∞. Therefore if there is reversibility at arbitrary time, 17. Altschul S, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local the marginals have to be time dependent, Alignment Search Tool. Jour Mol Biol 1990, 215:403-410. 18. Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol 1994, 39:105-111. pt (i)Qt(ij) = pt(j)Qt (ji). (216) 19. Goldman N, Thorne JL, Jones DT: Using evolutionary trees in protein secondary structure prediction and other compara- In particular in the Section "The evolution of emission tive sequence analyses. J Mol Biol 1996, 263:196-208. 20. Muse SV: Estimating synonymous and nonsynonymous substi- probabilities with indels treated as an extra character" we tution rates. Mol Biol Evol 1996, 13:105-114. have constructed a system in which the time-dependent 21. Whelan S, Goldman N: general empirical model of protein evolution derived from multiple protein families using a maxi- reversibility condition (216) is satisfied by marginal prob- mum likelihood approach. Mol Biol Evol 2001, 18:691-699. abilities that are quasi-stationary with respect to some (n 22. Smith AD, Lui TW, Tillier ER: Empirical models for substitution - 1) p probabilities, in ribosomal RNA. Mol Biol Evol 2004, 21:419-421. 0 23. Knudsen B, Andersen ES, Damgaard C, Kjems J, Gorodkin J: Evolu- tionary rate variation and RNA secondary structure Λ pt(i) = p0(i) (1 - t), for i = 1,...n - 1 (217) prediction. Comput Biol Chem 2004, 28:219-226. 24. Yang Z: A space-time process model for the evolution of DNA sequences. Genetics 1995, 139:993-1005. Λ pt(n) = t. (218) 25. Felsenstein J, Churchill GA: Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol 1996, 13:93-104. Acknowledgements 26. Gribskov M, Veretnik S: Identification of sequence pattern with Thanks to Sean Eddy for numerous discussions. Thanks to Matt Visser for profile analysis. Methods Enzymol 1996, 266:198-212. insights into matrix logarithms. This work was supported by the NIH 27. Coin L, Durbin R: Improved techniques for the identification of National Human Genome Research Institute. I would like to acknowledge pseudogenes. Bioinformatics 2004:I94-I100. 28. McAuliffe JD, Pachter L, Jordan MI: Multiple-sequence functional the Benasque Center for Science in which part of this work took shape at annotation and the generalized hidden Markov phylogeny. an ESF and NIH funded workshop on computational RNA biology in the Bioinformatics 2004, 20:1850-1860. summer of 2003. 29. Siepel A, Haussler D: Combining phylogenetic and hidden Markov models in biosequence analysis. J Comput Biol 2004, 11:413-428. References 30. Thorne JL, Kishino H, Felsenstein J: An evolutionary model for 1. Durbin R, Eddy SR, Krogh A, Mitchison GJ: Biological Sequence Analysis: maximum likelihood alignment of DNA sequences. J Mol Evol Probabilistic Models of Proteins and Nucleic Acids Cambridge UK: Cam- 1991, 33:114-124. bridge University Press; 1998. 31. Thorne JL, Kishino H, Felsenstein J: Inching toward reality: an 2. Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden improved likelihood model of sequence evolution. J Mol Evol Markov models in computational biology: Applications to 1992, 34:3-16. protein modeling. J Mol Biol 1994, 235:1501-1531. 32. Metzler D: Statistical alignment based on fragment insertion 3. Karplus K, Barrett C, Hughey R: Hidden Markov models for and deletion models. Bioinformatics 2003, 19:490-499. detecting remote protein homologies. Bioinformatics 1998, 33. Miklos I, Lunter GA, Holmes I: "Long Indel" model for evolution- 14:846-856. ary sequence alignment. Mol Biol Evol 2004, 21:529-540. 4. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 34. Mitchison GJ, Durbin RM: Tree-based maximal likelihood sub- 14:755-763. stitutions matrices and hidden Markov models. J Mol Evol 1995, 5. Eddy SR: Multiple Alignment Using Hidden Markov Models. In 41:1139-11351. Proc Third Int Conf Intelligent Systems for Molecular Biology Edited by:

Page 29 of 30 (page number not for citation purposes) BMC Bioinformatics 2005, 6:63 http://www.biomedcentral.com/1471-2105/6/63

35. Mitchison GJ: probabilistic treatment of phylogeny and 65. Siepel A, Haussler D: Phylogenetic estimation of context- sequence alignment. J Mol Evol 1999, 49:11-22. dependent substitution rates by maximum likelihood. Mol Biol 36. Holmes I, Bruno W: Evolutionary HMMs: a bayesian approach Evol 2004, 21:468-488. to multiple alignment. Bioinformatics 2001, 17:803-820. 66. Lunter G, Hein J: A nucleotide substitution model with near- 37. Qian B, Goldstein RA: Detecting distant homologs using phylo- est-neighbour interactions. Bioinformatics 2004:I216-I223. genetic tree-based HMMs. Proteins 2003, 52:446-453. 67. Goldman N, Whelan S: A novel use of equilibrium frequencies 38. Holmes I: Using guide trees to construct multiple-sequence in models of sequence evolution. Mol Biol Evol 2002, evolutionary HMMs. Bioinformatics 2003, Suppl 1:147-157. 19:1821-1831. 39. Knudsen B, Miyamoto MM: Sequence alignments and pair hid- 68. Whelan S, Goldman N: Estimating the frequency of events that den Markov models using evolutionary history. J Mol Biol 2003, cause multiple-nucleotide changes. Genetics 2004, 333:453-460. 167:2027-2043. 40. Pedersen JS, Hein J: Gene finding with a hidden Markov model 69. Campbell SL, Meyer CDJ: Generalized Inverses of Linear Transformations of genome structure and evolution. Bioinformatics 2003, New York: Dover; 1991. 19:219-227. 70. Jodár L, Law AG, Rezazadeh A, Watson JH, Wu G: Computations 41. Holmes I: A probabilistic model for the evolution of RNA for the Moore-Penrose and Other Generalized Inverses. Con- structure. BMC Bioinformatics 2004, 5:166. gress Numer 1991, 80:57-64. 42. Jukes TH, Cantor C: Evolution of protein molecules. In Mamm 71. Felsenstein J: Evolutionary Trees from DNA Sequences: A Prot Met Academic Press; 1965:21-132. Maximum Likelihood Approach. J Mol Evol 1981, 17:368-376. 43. Kimura M: A simple method for estimating evolutionary rates 72. Bronson R: Matrix operations New York: McGraw-Hill; 1973. of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 1980, 16:111-120. 44. Tavaré S: Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 1986, 17:57-86. 45. Yang Z, Nielsen R, Hasegawa M: Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 1998, 15:1600-1611. 46. Kosiol C, Goldman N, Buttimore NH: new criterion and method for amino acid classification. J Theor Biol 2004, 228:97-106. 47. Yang Z, Nielsen R, Goldman N, Pedersen A: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 2000, 155:431-449. 48. Hasegawa M, Kishino H, Yano T: Dating of the human-ape split- ting by a molecular clock of mitochondrial DNA. J Mol Evol 1985, 21:160-174. 49. Holmes I, Rubin GM: An expectation maximization algorithm for training hidden substition models. J Mol Biol 2002, 317:757-768. 50. Müller T, Spang R, Vingron M: Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood methods. Mol Biol Evol 2002, 19:8-13. 51. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 1992, 89:10915-10919. 52. Kishino H, Miyata T, Hasegawa M: Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J Mol Evol 1990, 31:151-160. 53. Dayhoff M, Schwartz R, Orcutt B: model of evolutionary change in protein. Atlas Prot Seq Struct 1978, 5:345-352. 54. Müller T, Vingron M: Modeling amino acid replacement. J Comp Biol 2000, 7:761-776. 55. Kosiol C, Goldman N: Different Versions of the Dayhoff Rate Matrix. Mol Biol Evol 2004, 22:193-199. 56. Israel RB, Rosenthal JS, Wei JZ: Finding generators for Markov chains via empirical transition matrices, with applications to credit rating. Mathematical Finance 2001, 11:245-265. 57. Kreinin A, Sidelnikova M: Regularization algorithms for transition matrices. Algo Res Quartely 2001, 4:23-40. 58. Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle 2004. Publish with BioMed Central and every 59. Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, scientist can read your work free of charge Massachusetts 2003. "BioMed Central will be the most significant development for 60. Adachi J, Hasegawa M: MOLPHY programs for molecular phyl- disseminating the results of biomedical research in our lifetime." ogenetics version 2.3. Institute of Statistical Mathematics, Tokyo 1995. Sir Paul Nurse, Cancer Research UK 61. Yang Z: PAML: a program package for phylogenetic analysis Your research papers will be: by maximum likelihood. Comput Appl Biosci 1997, 13:555-556. 62. Liò P, Goldman N, Thorne JL, Jones3 DT: PASSML: combining available free of charge to the entire biomedical community evolutionary inference and protein secondary structure peer reviewed and published immediately upon acceptance prediction. Bioinformatics 1998, 14:726-733. 63. Ronquist F, Huelsenbeck JP: MRBAYES: Bayesian inference of cited in PubMed and archived on PubMed Central phylogenetic trees. Bioinformatics 2001, 17:754-755. yours — you keep the copyright 64. Cai W, Pei J, Grishin NV: Reconstruction of ancestral protein sequences and its applications. BMC Evol Biol 2004, 4:33. Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 30 of 30 (page number not for citation purposes)