ISSN: 2378-3648

Xia. J Genet Genome Res 2017, 4:031 DOI: 10.23937/2378-3648/1410031 Volume 4 | Issue 1 Journal of Open Access and Genome Research

REVIEW ARTICLE Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning Xuhua Xia1,2* 1Department of Biology, University of Ottawa, Canada 2Ottawa Institute of Systems Biology, Ottawa, Canada *Corresponding author: Xuhua Xia, Department of Biology, Ottawa Institute of Systems Biology, University of Ottawa, 30 Marie Curie, P.O. Box 450, Station A, Ottawa, Ontario, K1N 6N5, Canada, Tel: (613)-562-5800, Ext: 6886, Fax: (613)- 562-5486, E-mail: [email protected] , amino acid and codon sequences. All substi- Abstract tution models used in molecular phylogenetics are Mar- kov chain models characterized by 1) Either a transition Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transi- probability matrix (P) with discrete time or a rate matrix tion probabilities and evolutionary distances from substi- (Q) in continuous time where P can be derived from Q, tution rate matrices. The derivation is essential in molec- and 2) Equilibrium frequencies. The general form of an ular phylogenetics and phylogenomics, and represents the instantaneous rate matrix for nucleotide sequences is, only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., in the order of A, G, C, and T: BLAST and FASTA). Three different approaches are fre- A− abc quently used for deriving transition probabilities and evolu-  Gg− d e tionary distances: 1) The probability reasoning, 2) Solving Q =  partial differential equations, and 3) Matrix exponential and Ch i− f (1) logarithm. The first approach demands the least amount of  mathematical skills but offers the best way for conceptual Tjk l − understanding, and can often generate nice mathematical Transition probability matrix, often referred to as the P expressions of transition probabilities and evolutionary dis- tances. This review represents the most systematic and matrix, specifies the probability of a nucleotide or amino comprehensive numerical illustration of the first approach. acid changing into another one after time t. It is needed to Keywords calculate likelihood and to derive evolutionary distances, and consequently is needed phylogenetics based on the , Substitution rate, Transition probability, Evolutionary distance maximum likelihood and distance-based methods as well as Bayesian inference. Whether a substitution model can Introduction be implemented for phylogenetic analysis essentially de- pends on whether the model’s transition probabilities can Substitutions occur over time and can overwrite each be calculated. other at the same nucleotide or amino acid site. When we compare two homologous nucleotide sequences and There are three ways to obtain transition probabilities find differences in N sites, the actual number of substi- from the Q matrix [1,2] :1) By probability reasoning, 2) By tutions (designated by M) could be much greater than solving differential equations involving rates, and 3) By tak- N because multiple substitutions could have happened ing the matrix exponential of the rate matrix. The last two at the same site, overwriting each other. Substitution require some mathematical background in calculus and models are used to infer the observable M from the ob- linear algebra. The first, in contrast, demands little math- served substitutions from sequence comparisons. ematical skill except for careful book-keeping and solving simultaneous equations. This approach is particularly rel- Many substitution models have been proposed for evant to biological students not only for gaining a concep-

Citation: Xia X (2017) Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning. J Genet Genome Res 3:031. doi.org/10.23937/2378-3648/1410031 Received: April 03, 2017: Accepted: July 24, 2017: Published: July 27, 2017 Copyright: © 2017 Xia X. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Xia. J Genet Genome Res 2017, 4:031 • Page 1 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648 tual understanding of the substitution models, but also There are some quick ways to check the derived to deriving nice mathematical expressions for transition transition probabilities. First, we note that when t ap- probabilities and evolutionary distances. New researchers proaches infinity, then all entries in matrix P approaches often ask why we can derive evolutionary distances be- ¼ if α > 0. This is what we have expected. Second, when tween two aligned sequences for the TN93 model [3] but t = 0, then all diagonal elements in matrix P are 1 and cannot for the simpler HKY85 model [4] which is a special all off-diagonal elements are zero. This is again what we case of the TN93 model, yet another model, F84 (used in expected. Third, if α is zero, then no change is possible, PHYLIP since 1984), which is also a special case of the TN93 and we again expect all diagonal elements in matrix P model, can have its evolutionary distance readily derived. to be 1 and all off-diagonal elements to be zero, which One can easily obtain answers to such questions by taking is also true. the first approach. However, the first approach is not of From p in Eq. (2), the expected proportion of sites general purpose and cannot handle very complicated sub- ij that are different between two aligned homologous se- stitution models. In contrast, the last one can be used with quences (p ) is 3*p (t), i.e., any substitution models specified by a rate matrix from diff ij which the matrix exponential can be obtained. In short, all 33−4αt p = 3( pt ) = − e (4) these approaches need to be learned by anyone wishing to diff ij 44 become a molecular phylogeneticist, but this paper will fo- Note that pdiff approaches ¾ when t is infinitely large, cus only on the probability reasoning approach illustrated which means that multiple substitutions can no longer with JC69 [5], K80 [6], F84 (the model used in PHYLIP since be corrected. Eq. (4) offers another way of deriving p (t) 1984), HKY85 [4], and TN93 [3] models. ii in Eq. (3), i.e., pii(t) is simply 1-pdiff. Probability Reasoning to Obtain Transition Prob- Eq. (4) allows us to derive the JC69 distance (DJC69) abilities and Evolutionary Distances because a distance is defined as μt where μ is the substi- Felsenstein [1] presented nice examples of prob- tution rate which is equal to 3α in the JC69 model. This ability reasoning to derive transition probabilities and is the same as the distance that you have driven is the product of the speed (rate) and time. Given that D = evolutionary distances from rate matrices. This section JC69 3αt, we can derive D (Figure 1g) by substituting αt = presents the approach in a more systematic and acces- JC69 D /3 into Eq. (4), i.e., sible way. JC69

JC69 model 3 4 pdiff DJC 69 =−− ln 1 (5) 43 Consider nucleotide A in the JC69 model (Figure 1a). 

Imagine that the nucleotide has a rate α of changing into Where pdiff (the expected number of sites that are dif- any of the four , i.e., including changing to it- ferent between the two homologous sequences) can be self (Figure 1b). This is effectively the same specification approximated by the observed proportion of sites (pdiff. as the JC69 model. After time t, the expected number obs) differing between the two aligned sequences. Note of substitutions is 4αt and the probability of no substi- that pdiff.obs may differ from pdiff even when the under- tution is p (x = 0, α, t) = -4αte according to the Poisson lying substitution model indeed follows JC69 because distribution, and the probability of having at least one of 1) Stochastic factors due to limited aligned length of change is then p (x ≥ 1, α, t) = 1-e-4αt (Figure 1c). Because the two sequences, and 2) Distortion caused by subop- nucleotide A can change into any one of the four nu- timal sequence alignment. Thus, although pdiff in Eq. (4) cleotides (including nucleotide A itself), each nucleotide cannot be greater than 0.75, pdiff.obs could, even when gets 1/4 of p (x ≥ 1, α, t). We therefore have in Figure 1. sequences evolve strictly according to the JC69 model. D is not defined when p ≥ 0.75 as there is no loga- px(≥ 1,α , t ) 1 1 JC69 diff pt( ) = = − e−4αt (2) rithm for 0 or negative values. ij 4 44 We can optionally show that DJC69 in Eq. (5) is a max- The transition probability pii(t) is the summation of two probabilities: the probability of no change (which is e‑4αt) imum likelihood distance. For two aligned sequences and the probability of changing to itself which is the same of length N, designate the number of sites that differ as specified in Eq. (2), as shown in Figure 1e, i.e., between the two sequences as ND and the number of sites identical between the two sites as (N-ND). Now the −−44ααttpx(≥ 1,α , t ) 1 3 ptii ( ) =+=+ e e (3) likelihood function is: 4 44 N 1 ()NN− DDN L =  ppii (1− ii ) The transition probability matrix for the JC69 model 4 has only two distinct elements. The four diagonal ele- 1 lnL = Nln +− ( NND ) ln( p ii ) + N D ln(1 − pii ) ments are the same as specified in Eq. (3) and all the 4 off-diagonal elements are the same as specified in Eq. 1 13−−4DDJC 69 /3  334JC 69 /3  =N ln +− ( NNDD ) ln  + e  + Nln  − e  (2). Each row in P adds up to 1 as a nucleotide can either 4 44  44  stay the same or change into some other nucleotides. (6)

Xia. J Genet Genome Res 2017, 4:031 • Page 2 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

Figure 1: Derivation of transition probabilities and the evolutionary distance (D) based on the JC69 model. The d value in the diagonal of the rate matrix (a) is constrained by the row sum equal to 0, i.e., d = -3α. P(j|i,t) means the probability of changing

from the original nucleotide i to nucleotide j after time t, and is synonymous to pij(t) or simply pij in this paper.

one would have expected. We illustrate the application S1: AAG CCT CGG GGC CCT TAT TTT TTG of Eqs. (5) and (8) by using the aligned sequences in Fig-

|| | ||| ||| | ||| ||| || ure 2 where N = 24, ND = 6, and pdiff = 6/24 = 0.25. So DJC69 S2: AAT CTC CGG GGC CTC TAT TTT TTT = 0.3041, and var (DJC69) = 0.0176. Figure 2: Two homologous sequences for illustrating com- putation of pairwise evolutionary distances. The equilibrium frequencies of the π vector can be

derived by set t = ∞ in Eqs. (2) and (3) which leads to pii = p = ¼. This implies that equilibrium frequencies of the Where the constant term N*ln(1/4) can be dropped ij four nucleotides will be equal for the JC69 model. This in maximizing lnL to obtain the distance estimate, but is not surprising because the frequencies did not even needs to be kept when performing likelihood ratio test appear in the rate matrix (Figure 1a). for comparing different substitution models (e.g., JC69 against TN93). K80 model

We take the derivative of lnL with respect to DJC69, set The K80 model has a transition substitution rate α the derivative to 0 and solve for DJC69. The resulting DJC69 and a rate β (Figure 3a). We will focus on is exactly the same as that in Eq. (5). I used D instead of nucleotide A and conceptualize the model with two

DJC69 in the equations below: events (Figure 3b), in contrast to only one event in the

JC69 model. The first event (e1) occurs when nucleotide dLln ()N− N e−−4DD /3 Ne4 /3 A changes into any of the four nucleotides (including to =−+= DD 0 13−− 33 itself). In other words, the original A is replace by a nu- dD +−ee4DD /3 4 /3 44 44 (7) cleotide randomly drawn from a nucleotide pool with equal nucleotide frequencies. This event occurs with a 3334NN− D 4 pdiff D =− ln  =−− ln 1 rate β. The second event (e2) occurs when nucleotide 43N 4 3 A changes either to G or to itself, i.e., the original A is replace by a nucleotide randomly drawn from a The variance of DJC69 (designated as VJC69) is obtained as the negative reciprocal of the second derivative of pool with equal number of A and G. This e2 occurs with a lnL: rate γ. Thus the transition rate α equals β+γ according to 1 pp(1− ) this conceptualization. Note that, whenever e1 happens, =−=diff diff VJC 69 22 (8) the original nucleotide is replace by any one of the four dLln 4 p diff nucleotides with equal probability, no matter how many 2 L1− dDJC 69 3 e2 events has occur before or after the occurrence of

e1. It might help to think of a long sequence with L sites Note that VJC69 decreases with sequence length L as

Xia. J Genet Genome Res 2017, 4:031 • Page 3 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

Figure 3: Derivation of transition probabilities and the evolutionary distance (D) based on the K80 model. The rate matrix (a) has the diagonal elements (d) constrained by the row sum equal to 0, i.e., d = -α - 2β. P and Q are the observed proportion of transitional and transversional changes between two aligned homologous sequences. Equating them to their respective expected values, E(P) and E(Q), leads to the solution of αt and βt shown, and the evolutionary distance D. P(j|i,t) means the probability of changing from

the original nucleotide i to nucleotide j after time t, and is synonymous to pij(t) or simply pij in this paper. being A at time 0. If these L sites each have experienced These probabilities are also shown in Figure 3c. The at least one e2 event, then these sites will either be A reason for the condition that “e1 has not occurred” is or G with equal probability (i.e., 0.5), and we expect to because e1 event can erase e2 event (as we have dis- have L/2 sites being A and the other L/2 sites being G. In cussed before). contrast, if each of these L sites has experienced at least Now the probability of the starting nucleotide A one e event, then the site will be replaced by either 1 changing to G during time t, designated as p(G|A,t), is A, C, G, or T with equal probability, and we expect to the summation of two probabilities. The first is 1/2 of observe A, C, G and T in L/4 sites each. Any e events 2 the probability of p(e > 0, e = 0,t) in Eq. (11) because occurring before or after the e event do not change this 2 1 1 the other 1/2 is for A to itself. The second is 1/4 of p(e expectation. This means that e erases e , but not vice 1 1 2 > 0,t) in Eq. (10) because A→A, A→G, A→C and A→T versa. The probability that an e event happened is in- 2 each get ¼, so only 1/4 of p(e > 0,t) is for A→G. The formative only when no e event has happened. 1 1 summation of these two probabilities (Figure 3d) is After time t, the expected number of substitutions is p(G|A,t). This probability is equal to p(A|G,t), p(C|T,t), 2(α+β)t, i.e., the nucleotide A has two ways of change and p(T|C,t) in the K80 model. In other words, the sum- with a rate of α (to A and to G) and another two ways of mation of these two probabilities is the probability of a change with a rate β (to C and to T), so the probability of transition (Ps) during time t. Thus, pe(>= 0, e 0, t ) pe ( > 0, t ) no change, according to Poisson distribution, is P = 21+ 1 s 24 −+2(αβ )t pe( = 0, e = 0,) t = e (9) ee−4βttt−− −+ 2( αβ ) 1 e −4β 12 = + (12) 24 Note that α is conceptualized as (β+γ) in Figure 3, so 1 ee−4βtt −+ 2( αβ ) =+− = PGAtPAGtPTCtPCTt (|,) = (|,) = (|,) = (|,) e-2(α+β)t in Eq. (9) is equivalent to e-(4β+2γ)t. The probability 44 2 that at least one e1 event has occurred is Similarly, the probability of the starting A changing −4βt to C (or to T) is 1/4 of p(e > 0,t) in Eq. (10) because 1/4 pe(>=− 0, t ) 1 e (10) 1 1 is for A→A, ¼ is for A→G and 1/4 is for A→T, so only 1/4

Thus, the probability that at least one e2 has occurred is for A→C (Figure 3d). This probability is the probability but e1 has not occurred is simply for a transversional change during time t, > = =−= =−> −4βt pe(21 0, e 0,) t 1 pe (1 0, e 2 0,) t pe (1 0, t ) pe(> 0, t ) 1− e 1 (13) −+2(αβ )tt−4β (11) Pv = = =− 1ee −− (1 ) 44 =ee−4βtt − −+ 2( αβ ) As a quick check of the derived transition probabil-

Xia. J Genet Genome Res 2017, 4:031 • Page 4 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

ities, we note that sP and Pv are zero when t = 0 (or when tionary distances based on a substitution model). Meth- = 0 and β = 0). This also implies that all diagonal ele- ods for handling such situations are discussed later in ments in the transition probability matrix are equal to 1, the section on the GTR model. and is what we have expected. When t = ∞, with α > 0 We may optionally show D in Eq. (17) to be a max- and β > 0, all entries in matrix P approaches ¼ (the equi- K80 imum likelihood estimator of the distance based on the librium frequency of the K80 model). This is also what K80 model, just like the K distance in Eq. (5). To see we expected. JC69 this, it is better to re-parameterize the K80 model by

For two aligned homologous sequences, Ps can be replacing αt and βt by DK80 and κ using the following re- approximated by the proportion of sites differing by a lationship: transition (P), and 2P by the portion of sites differing by D = αβ tt + 2 v K 80 (18) a transversion (Q, Figure 3e). Note that the expected Q κ = αβtt / is equal to 2Pv because there are two ways of having a transversional change. Therefore, Solving these two equations gives us

−4βtt −+ 2( αβ ) D κ 1 ee αt = K 80 P =+− (14) κ + 44 2 2 (19) D 11−−ee−−44ββtt β = K 80 = = = (15) t QP 2v 2 κ + 2 42 Substituting αt and βt into Eqs. (14) and (15) so that We can now first solve for βt from Eq. (15), and then P and Q will be functions of D and κ, and the likelihood substitute the solution for βt into Eq. (14) to solve for K80 function for deriving D and κ is αt. This leads to K80 N 1 NN NN−− N ln(1−− 2PQ ) ln(1 − 2 Q ) L =  PQs v (1−− PQ ) sv αt = −+ 4 (20) 24 (16) 1 ln(1− 2Q ) lnLN = ln + NPNQNNNs ln + v ln + ( −sv − ) ln(1 −− PQ ) βt = − 4 4 Recall that evolutionary distance is defined as µt, Where the constant term N*ln(1/4) can be dropped where µ is the substitution rate which is equal to (α+2β) in maximizing lnL to obtain the distance estimate, but in the K80 model. Thus, the evolutionary distance based need to be kept when performing likelihood ratio test on the K80 model (DK80) is (α+2β)t, which comes to for comparing different substitution models (e.g., K80 ln(1-2PQ - ) ln(1-2 Q ) against TN93). DK 80 =+=− αβ tt 2 − (17) 24 Taking partial derivatives with respect to DK80 and κ, setting them to zero and solving the simultaneous equa- Where P and Q can be approximated by the observed tions, we have proportion of sites differing by a transition or a transver- ln(1-2PQ - ) ln(1-2 Q ) sion from two aligned sequences, designated as P and D = − obs obs − obs (21) obs K 80 24 Qobs. Similar to what I have mentioned with reference to D , P and Q may differ from P and Q even if the K80 2 ln(1−− 2PQ ) JC69 obs obs κ = obs obs −1 (22) model is followed during the sequence evolution. This is ln(1− 2Q ) because 1) Limited aligned length of the two sequences obs may result in stochastic variation in P and Q , and obs obs Where P = N/N, and Q = N/N. Using the two 2) The two observed proportions may be distorted by obs s obs v aligned sequences in Figure 2, we have N = 24 and Pobs = alignment errors (i.e., misidentification of site homolo- 4/24 and Q = 2/24. These lead to D = 0.3151, and κ = gy). For example, two homologous sequences that have obs K80 4.9126. It may be relevant to add that, while DJC69 and DK80 diverged for an infinite length of time according to the are maximum likelihood estimates, distance formulae for K80 model should have expected P and Q equal to 0.25 F84 and TN93 models, obtained in the same way by equat- and 0.5, respectively. However, we may actually have ing the observed substitutions to expected substitutions, P > 0.25 or Q > 0.5, which would render D inap- obs obs K80 are generally not maximum likelihood estimates. This will plicable. On the other hand, after sequence alignment become clear when we deal with these models. and of (because evolutionary distances are typically calculated without using sites with indels), We have previously derived the variance of DJC69 as the negative reciprocal of the second derivative of lnL Pobs and Qobs may well be much smaller than the expect- ed 0.25 and 0.5 leading to severer underestimation of with respect to DJC69. This can be used only when the log-likelihood function is used to estimate a single par- the true distance. It is also possible to have Pobs and Qobs values that, when used to replace P and Q in Eq. (16), ameter. When there are multiple parameters (e.g., DK80 result in negative αt or βt values that make no biological and κ), we cannot use the same approach unless the parameters are not correlated. There are two common- sense. The same applies to DJC69 (in fact to any evolu-

Xia. J Genet Genome Res 2017, 4:031 • Page 5 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

ly used methods for deriving variances of parameters. Where πA, πG, πC and πT are equilibrium frequencies,

The first is the delta method [7-9], and the second uses πR and πY are frequencies of and , the Fisher information matrix to obtain the variances and the diagonal elements are constrained by each row and covariance matrix for the parameters. The delta summing up to 0. The parameter γ in Eq. (26) is some- method, which often yields nice and clean mathematical times replaced by κβ, but it is easier to understand the expressions for the variance, is illustrated in the Appen- F84 model by using QF84 specified in Eq. (26). dix. The method using the Fisher information matrix is We may view the F84 model as featuring two events shown below. (e1 and e2). Suppose we start with a nucleotide A. Event

To estimate variance involving multiple parameters e1 occurs with rate β. When it occurs, the original A will such as DK80 and κ, we first take the second order partial be replaced by a nucleotide drawn randomly from a nu- derivatives of lnL with respective to DK80 and κ, substi- cleotide pool in which the nucleotide frequencies are tuting the estimatedK80 D and κ in Eqs. (21) and (22) into the same as the equilibrium frequencies. This means the second-order partial derivatives, arranging them that the original A has a rate of βπA, βπG, βπC and βπT to into what is called a Fisher information matrix (MFI) change to A, G, C and T, respectively, when 1e occurs. below, and compute the matrix inverse of MFI (designat- This is different from the K80 model where, 1 whene ‑1 ed by MFI ): occurs, the original A has a rate of 0.25 to change to any of the four nucleotides. Event e has a rate of γ to ∂∂22lnLL ln 2 −− occur, and will result in the original A being replaced ∂κκ2 ∂∂D K 80 (23) by a purine drawn randomly from a purine pool with M FI = ∂∂22lnLL ln A and G frequencies specified as π / and π / . Thus, −− A R G R 2 the original A has a rate of γπ /π and γπ /π to change ∂∂DDKK80 κ ∂80 A R G R to A and G when e2 occurs. This again differs from K80 ‑1 The diagonal elements of MFI are the variances for where the original A has a rate of 0.5 of changing to A or κ and D , and the off-diagonal elements of M ‑1 are K80 FI Gwhen e2 occurs. These events are illustrated in Figure covariances. The mathematical expression for the vari- 4a, where we use x to represent β+γ/πR. Note that it is ance of κ is tedious, but that for the variance of D is K80 not a good idea to use α to represent β+γ/πR for two simpler: reasons. First, if we had started with a nucleotide C or T 22 2 aPcQaPcQ+−+() instead of A, then we would have β+γ/πY instead of β+γ/ VD(K 80 ) = , where N πR which would force us to use α1 and α2 to distinguish 11ab+ (24) between the two. A casual reader will then be misled to ab = , = ,c = 1-2PQ - 1-2 Q 2 think that F84 has three rate parameters (i.e., β, α1 and

α2) without knowing that α1 and α2 are used as different With the aligned sequences in Figure 2, we have N = functions of the same rate parameter γ. Second, I have 24 and empirical P = 4/24 and Q = 2/24. These lead to reserved α to represent β+γ in Figure 4b which simpli- D = 0.3151, and κ = 4.9126. The M and M ‑1 are K80 FI FI fies the derivation of transition probabilities illustrated in Figure 4. 0.047435 -0.286795 M = FI  Note that whenever event e happens, the original -0.286795 49.641105 1 A is replace by A, C, G and T with probabilities π π 21.84451668 0.126204037 (25) A, G, −1 π and π , no matter how many e events has occur be- M FI = C T 2 0.126204037 0.020873724 fore or after the occurrence of 1e . This is similar to the Where the two parameters are in the order of κ scenario involving the K80 model, except that the K80 and DK80, i.e., the variance is 21.8445 for κ and 0.0209 model assumes equal nucleotide frequencies. It might for DK80. The off-diagonal elements are covariances be- help to think of a long sequence with L sites being A at tween the two parameters. time 0. If these L sites each have experienced at least one e event, then these sites will either be A or G with F84 and HKY85 model 2 probabilities Aπ and πG, respectively, and we expect to

The F84 and HKY85 model accommodate not only have πAL sites being A and πGL sites being G. In contrast, the differential substitution rates between transitions if each of these L sites has experienced at least one e1 and , but also different equilibrium nu- event, then the site will be replaced by A, C, G, or T with cleotide frequencies, in contrast to JC69 and K80 which probabilities A,π πG, πC and πT, and we expect to observe assume equal equilibrium nucleotide frequencies. The A, C, G and T in πAL, πGL, πCL, and πTL sites, respectively. same probabilistic reasoning used before can be applied Any number of e2 events occurring before or after the to derive transition probabilities for the HKY80 model. e1 event does not change this expectation. This means that e erases e , but not vice versa. The occurrence of The rate matrix for the F84 model is 1 2 an e event is informative only when no e event has A −+βπG γπ GR π βπC βπT 2 1 βπ+− γπ π βπ βπ G A AR C T happened. QF 84 = (26) C βπ βπ −+βπ γπ π A G T TY T βπ A βπG βπC+− γπ CY π

Xia. J Genet Genome Res 2017, 4:031 • Page 6 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

Figure 4: Derivation of transition probabilities based on the F84 model. πA, πG,πC, πT and πR, πY are equilibrium frequencies

of A, C, G, T, purine (A + G) and (C + T). Event e1 occurs at a rate of β and leads to the original A being replaced

by any of the four nucleotides according to their equilibrium frequencies, and event e2 occurs at a rate of γ and results in the

original A being replaced by either A or G according to their frequencies in the purine pool, i.e., πA/πR, and πG/πR. I used x as a

shorthand for β+γ/πR. The rate γ does not appear in the final transition probabilities because it has been absorbed into α which equals β+γ shown in (b). P(j|i,t) means the probability of changing from the original nucleotide i to nucleotide j after time t, and

is synonymous to pij(t) or simply pij in this paper.

After time t, the total flow of the original A to the + γ) and β in Eq. (31). The transition probability from the four nucleotides (including itself, Figure 4a and Figure original A to C (a transversion, Figure 4e) is simply 4b) is -bt pAC = pp C pe (1 >= 0, t ) C (1 - e ) (32) +++=+= + +=+= pAxx p G bpbp C T p Rx pb Y pbgp R( /) RY pb bg a (27) For other transversions, e.g., pAT, one just need to re-

So the probability that no substitution has happened place πC by πT. The complete transition probability ma- during time t (Figure 4c), according to Poisson distribu- trix for the F84 model is −−ββtt πA++ ππ AYxx12 π G π G +− ππ GY xx12 π G πC(11 − e) πT( −e) tion, is A  −−ββtt G πA+− ππ AYxx12 π A π G ++ ππ GY xx12 π A πC(11 − e) πT( −e) -at P =  == F 84 C  −−ββtt  (33) pe( , e 0,) t e (28) πA(11−e) πG( −e) πC ++ ππ CRxx3 π G 4 π T +− ππ TR xx34 π T 12   T  π −−−ββttπ −π +− ππ π π ++ ππ π  A(11e) G( e) C CRxx34 C T TR xx34 C The rate of A changing to A, G, C, and T through e1 is Where βπA + βπG + βπC + βπT = β, so the probability that at least one e has occurred during time t is eeee----babatttt 1 xxxx ==== , , , (34) -bt 1234pppp pe(1 > 0, t ) =- 1 e (29) RRY Y As a quick check of the transition probabilities, we The probability that e2 has happened but e1 has not is then first note that when t = 0 (or when α = 0 and β = 0), then the diagonal elements are 1 and all off-diagonal ele- > = =- = - > =--batt - (30) pe(2 0, e 1 0, t ) 1 pe (12 , e 0,) t pe (1 0, t ) e e ments are 0, which is what we expected. Second, when t = ∞ with α > 0 and β > 0, then the transition probabil- The reason for the condition that “e1 has not oc- ities will approach the equilibrium frequencies, which is curred” is because e1 event can erase e2 event. With these, it is easy to derive transition probability from A also what we expected. to G (Figure 4d) as the summation of 1) A fraction ofG π To obtain the distance for the F84 model (DF84), re- of p(e1 > 0,t), which is the probability of e1 event that call that a distance is defined as µt where µ is the aver- results in the original A being replaced by A, C, G, and T age substitution rate, i.e., substitution rates in Eq. (26) with probabilities A,π πG, πC and πT, and 2) A fraction of weighted by the equilibrium frequencies: πG/πR of p(e2 > 0,e1 = 0,t), which is the probability that e2 DF84 = 2ppAG ( b tt ++ g / pRTC )2 pp ( btt ++ g / pYYR )2 ppb t (35) events not erased by e1. That is, >=p pp--batt p pe(21 0, e 0, t ) G GYe G e Now we need to obtain βt and γt in order to calcu- pG( | At ,) = pe(1 > 0,) tppGG + =+ - (31) pR ppRR late DF84. We can obtain αt and βt, and then obtain γt =

From now on, p(j|i,t) will be written simply as pij, so αt - βt, remembering that α = β + γ (Figure 4b and Eq. p(G|A,t) is pAG. With the same reasoning, we can derive (27)]. The method we will use is the same as that for transition probabilities for other A↔G and C↔T sub- the K80 model, i.e., we obtain the expected transitions stitutions. Note that the two rate parameters in the F84 and transversions, designated E(S) and E(V), respective- model (β and γ) have been re-parameterized into α (= β ly, from transition probabilities and equate them to the

Xia. J Genet Genome Res 2017, 4:031 • Page 7 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

observed S and V to solve for αt and βt. With the prop- librium frequencies. Event e2 occurs with a rate γ and erty of time reversibility (e.g., A•π pAG = πG•pGA), we have results in the original A being replaced by either A or G ES( ) =+ 2pp p 2 p with the probabilities equal to their respective equilib- A AG C CT (36) rium frequencies. Fictionalized in this way, the expected EV( ) = 2ppppA p AT +++ 2 A p AC 2 G p GC 2 G p GT number of substitutions after time t is β(πA+πG+πC+πT) + Equating E(S) and E(V) to the observed S and V, and γ(πA+πG) = β+γR. According to the Poisson distribution, solving these two equations with the two unknowns (αt the probability that no substitution has happened dur- and βt), we have ing time t is æö22 ÷ ç -2(ppppAGRY+ pppp CT RY) ÷ at = lnç ÷ (37) −+()βγR ç 22 2 2 2 2 ÷ pe( , e = 0,) t = 1- e (42) èøçSVppRY-22 pppp AGRY - pppp CT RY ++( ppp AGY ppp CT R) ÷ 12 The probability that at least one e occurred after æö 1 ç V ÷ bt =- lnç 1 - ÷ (38) time t is ç 2pp÷ èøRY −βt pe(1 >=− 0, t ) 1 e (43) Substitute βt and γt (= αt - βt) into Eq. (35) and, af- ter some algebraic manipulation, we have a more useful The probability that e2 has occurred but e1 has not is

−βtR −+() βγ form of DF84: pe(2> 0, e 1 = 0,) t =− 1 pe (12 , e =−> 0,) t pe (1 0, t ) =− e e (44) 2  x The transition probability p(G|A,t), abbreviated as Dx =  −+ (ππ ππ) ππ ln( ) + πππ ln 2 F84 ππ AG CT RY 1 CT R RY x3 (39) pAG, is −β −+β πγ π ππt π ()R t   G GYee G x2 22 pAG = ππ G pe (1 >+ 0, t ) pe (21 > 0, e = 0,) t =+ G − (45) +−πππ ln  ππ ln ( x ) π ππ AGY x RY 1 R RR 3  In the same way, we can derive other transition Where probabilities which are shown below: V −−ββtt ππAA++xx12 π G ππ GG +− xx 12 π G π C(11 − e) πT( − e) x1 =- 1; A  2ppRY −−ββtt (40) G ππAA+−xx12 π A ππ GG ++ xx 12 π A π C(11 − e) πT( − e) (46) P =  xV2 =+ (pppAGY ppp CT R)(2 pp RY - ) HKY C −−ββtt πA(11−e) πG( − e) ππCC ++ xx34 π T ππ TT +− xx 34 π T  22 2 2 2 2 T xS3 =- ppRY + 22 pppp AGRY + pppp CTY R - ppp AGYVV - ppp CT R π −−−ββttπ − ππ +− π ππ ++ π A(11e) G( e) CC xx34 C TT xx 34 C

To illustrate the calculation of F84D , we may use the Where two aligned sequences in Figure 2 which gives us π = −−ββ−+βπγ −+βπγ A ππtt()RYtt() YReeee (47) 6/48, π = 12/48, π = 10/48, π = 20/48, S = 4/24, V = xx12 = ; = ; xx 34 = ; = C G T ππππ 2/24, αt = 0.5778363341, βt = 0.2076393648, γt = αt - βt RRY Y = 0.3701969693, and D = 0.3198867427. The variance As a quick check of the transition probabilities, we F84 first note that when t = 0 (or when α = 0 and β = 0), of the DF84 can be obtained by either the delta method or the method using Fisher information matrix. then the diagonal elements are 1 and all off-diagonal elements are 0, which is what we expected. Second, A substitution model similar to the F84 model is the when t approaches infinity with β > 0 and γ > 0, then the HKY85 model, with its rate matrix specified as: transition probabilities will approach the equilibrium −+β γ π βπ βπ A ( ) GC T frequencies, which is also what we expected.  β+− γ π βπ βπ G ( ) A CT (41) QHKY 85 = We cannot derive the distance for the HKY85 mod- C βπ βπ −+(β γ) π AG T el by following the same approach as that for the F84 T βπ βπ( β+− γ) π AG C model. Hasegawa, et al. [4] has tried this approach but Where (β+γ) is often written as α and the diagonal were not successful because there is no explicit solution elements are constrained by each row summing up to for βt and γt. However, if we treat the A↔G transition 0. The HKY85 model and the F84 model differ only in and C↔T transition separate, then we can solve for βt the specification of rates involving transitions. qAG and and γt [10]. In other words, we obtain one set of βt and qCT are πG(β+γ) and πT(β+γ) in the HKY85 model speci- γt from observed A↔G transitions and transversions, fied in Eq. (41), in contrast to Gπ (β+γ/πR) and πT(β+γ/πY), and another set of βt and γt from observed C↔T transi- respectively, in the F84 model specified in Eq. (26). By tions and transversions. βt in the two sets are the same comparing these rates, it becomes obvious that the F84 as that in Eq. (38), but γt is different between the two model would be equivalent to the HKY85 model if πR = sets of estimates. We can then take a weighted aver-

πY. age of γt. Admittedly, this does sound mathematically clumsy and explains why HKY85, while commonly used We can obtain the transition probabilities for the in phylogenetic analysis involving a likelihood - frame HKY85 model in the same way as that for the F84 model. work or Bayesian inference, is almost never used in dis- In short, we again start with a nucleotide A and envision tance-based phylogenetics. two events e1 and e2. Event e1 occurs with rate β, and re- sults in the original A replaced by any of the four nucleo- Here is the somewhat circuitous protocol to get βt tides with probabilities equal to their respective equi- and γt from HKY85. The expected numbers of A↔G

Xia. J Genet Genome Res 2017, 4:031 • Page 8 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

y = 0.9991x - 0.0023 although DHKY85 is consistently but slightly smaller than R² = 0.9996 DF84. 0.29 The HKY85 model itself may not carry much biologic- al significance given the existence of the F84 model. 0.24

5 However, the twists involved in computing the evolu- 8

K Y tionary distance, i.e., the separate estimation ofγ H A↔G D 0.19 and γC↔T, lead very naturally to a very useful TN93 mod- el that we will cover next. 0.14 TN93 model

0.09 We have come far, so far that we need hardly any ex- 0.09 0.14 0.19 0.24 0.29 tra effort to derive transition probabilities for the TN93 DF84 model. There are two equivalent specifications of the Figure 5: Evolutionary distances from the HKY85 and F84 rate matrix for the TN93 model. The first is models are nearly identical.

A −+βπG γ RG π π R βπC βπT βπ+− γ π π βπ βπ G A RA R C T (53) QTN 93 = and C↔T transitions, designated SR and SY, respective- C βπ βπ −+βπ γ π π A G T YT Y ly, and transversions are T βπ A βπG βπC+− γ YC π π Y Where the diagonal elements are constrained by ES( ) = 2p p R A AG each row summing up to 0. The second specification ES(Y ) = 2pC p CT (48) simply replaces (β + γR/πR) by α1 and (β + γY/πY) by α2. We

EV( ) = 2ppppA p AT +++ 2 A p AC 2 G p GC 2 G p GT see that TN93 is reduced to F84 if γR = γY, and to HKY85

if γR / πR = γY / πY. Setting E(SR) and E(V) to their the observed SR and V, and solve for βt and γt, we have The similarity between TN93 and F84 allows us to re-use Figure 4 for deriving transition probabilities for æöV bt =- lnç 1 - ÷ TN93. We only need to add a subscript R to γ and α in ç ÷ èø2ppRY (49) Figure 4 so that we have γR and αR as rates for purine, 1 æöpp(2 pp -V ) keeping everything else the same, and we instantly ob- g t = lnç AG RY ÷ R ç 2 ÷ tain the transition probabilities for transitional substitu- pRèø2 pppp AGRY--SV RY pp R ppp AGY tions between purines and for transversional substitu- Where βt is the same as that in Eq. (38), and γ t in Eq. R tions as shown inFigure 4. To get transition probabilities (49) is γt estimated from observed S and V. R between pyrimidines, we can just replace the original Now we obtain another set of solutions for βt and γt nucleotide A inFigure 4 by nucleotide C or T and rename by setting E(SY) and E(V) to their observed SY and V, and γ and α in Figure 4 to γY and αY. Note that our αR = β+γR, solve for βt and γt, we have the same βt but a different and αY = β+γY. γt: The transition probability matrix for the TN93 model 1 æöpp(2 pp -V ) ç CT RY ÷ π++ ππxx π π +− ππ xx π π 11 − e−−ββttπ −e g t = lnç ÷ (50) is A AY12 G G GY12 G C( ) T( ) Y ç 2 ÷ A  ç -- ÷ −−ββtt pYèø2 pppp CTRYSV YRY pp ppp CTR G πA+− ππ AYxx12 π A π G ++ ππ GY xx12 π A πC(11 − e) πT( −e) P =   TN 93 C  −−ββtt  (54) πA(11−e) πG( −e) πC ++ ππ CRxx3 π G 4 π T +− ππ TR xx34 π T   T A weighted average of γt could be  π −−−ββttπ −π +− ππ π π ++ ππ π  A(11e) G( e) C CRxx34 C T TR xx34 C

gt =+ pgRR tt pg YY (51) Where x1 and x3 are the same as those in Eq. (34), but

The distance for the HKY model x2 has α replaced by αR and x4 has α replaced by αY, i.e., ---- eebbttaaRYtt ee xx == , , xx == , DHKY 85 = m t = 2( ppbgAG tt ++ )2 ppbgTC ( tt ++ )2 ppbYR t (52) 12 34 (55) ppppRRY Y To compute D using the two aligned sequences HY85 To obtain the distance for the TN93 model (D ), re- in Figure 2, we have π = 6/48, π = 12/48, π = 10/48, π TN93 A C G T call that a distance is defined as µt where µ is the aver- = 20/48, S = 4/24, S = 0, V = 2/24, βt = 0.2076393648, Y R age substitution rate, i.e., substitution rates in Eq. (53) γ t = -0.2223239164, γ t = 1.047432870, weighted γt = R Y weighted by the equilibrium frequencies, so: 0.624180608, DHKY85 = 0.308904. I intentionally choose (56) the aligned sequences in Figure 2 with SR = 0 just to see DTN 93 = 2ppAG ( b tt ++ gRRTC / p )2 pp ( b tt ++ gYYYR / p )2 ppb t if D would behave strangely. It did not. For compari- HKY85 Now we need to obtain α t, α t, and βt. The meth- son, the same two sequences yield D = 0.319887. R Y F84 od we will use is the same as that for the K80 and F84 models, i.e., we obtain the expected numbers of A↔G In general, DHKY85 is slightly smaller than DF84. I used the eight vertebrate COI sequences in the FASTA file transitions, C↔T transitions, and transversions, desig-

VertCOI.fas that comes with DAMBE [11] to compute nated E(SR), E(SY) and E(V), respectively, from transition probabilities, and equate them to the observed S , S both DHKY85 and DF84 (Figure 5). The difference is minor, R Y

Xia. J Genet Genome Res 2017, 4:031 • Page 9 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648

and V to solve for αRt, αYt, and βt: ferential equations and the other involving matrix expo- nential and logarithms, are often used in practical com- ES(R ) == 2pA p AG S R putation with the GTR model for nucleotide sequences ES( ) == 2p p S (57) Y C CT Y and amino acid-based substitution models. They will be EV( ) = 2pppp p +++ 2 p 2 p 2 p = V A AT A AC G GC G GT numerically illustrated elsewhere.

The resulting αRt, αYt, and βt are Acknowledgements æö 2ppppAGRY ÷ at = lnç ÷ (58) This study is funded by the Discovery Grant from R ç 2 ÷ èø2ppppAGRY-- pp RYSV R ppp AGY Natural Science and Engineering Research Council of Canada (RGPIN/261252-2013). I thank C. Vlasschaert æö2pppp ç CT RY ÷ (59) and S. Aris-Brosou for feedback. atY = lnç 2 ÷ èøç2pppp-- ppSV ppp ÷ CTRY YRY CTR References æöV ÷ bt =- lnç 1 - ÷ (60) 1. Felsenstein J (2004) Inferring phylogenies. Sinauer, Sun- ç ÷ derland, Massachusetts, 664. èø2ppRY If one wishes to express D in S , S and V, then one 2. Yang Z (2006) Computational . Oxford TN93 R Y University Press, Oxford, 357. may just substitute γRt, γYt, and βt into Eq. (56), which yields: 3. Tamura K, Nei M (1993) Estimation of the number of nucleo- tide substitutions in the control region of mitochondrial DNA in 2pp p ln(xx )++ ln( ) 2pp p ln( xx ) ln( ) AG[ Y 12] CTR[ 13] humans and chimpanzees. Mol Biol Evol 10: 512-526. DxTN 93 = +-2ppRY 1 (61) ppRY 4. Hasegawa M, Kishino H, Yano T (1985) Dating of the hu- Where man-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22: 160-174. V x1 =- 1; 5. Jukes TH, Cantor CR (1969) Evolution of protein mole- 2ppRY cules. In: Munro HN, Mammalian protein metabolism. Ac- 2pppp ademic Press, New York, 21-132. x = AGRY ; (62) 2 2 6. Kimura M (1980) A simple method for estimating evolution- 2ppppAGRY--SV RRY pp ppp AGY ary rates of base substitutions through comparative studies 2ppppCT RY of nucleotide sequences. J Mol Evol 16: 111-120. x3 = 2 . 2ppppCTRY--SV YYR pp ppp CTR 7. Kimura M, Ohta T (1972) On the stochastic model for es- To illustrate the application of D with the two aligned timation of mutational distance between homologous pro- TN93 teins. J Mol Evol 2: 87-90. sequences in Figure 2, we have πA = 6/48, πC = 12/48, πG = 10/48, π = 20/48, S = 4/24, S = 0, V = 2/24, α t = 0.13353, 8. Waddell PJ, Steel MA (1997) General time-reversible dis- T Y R R tances with unequal rates across sites: mixing gamma and αYt = 0.90593, βt = 0.20764, γRt = αRt – βt = -0.07411, γYt inverse Gaussian distributions with invariant sites. Mol Phy- = αRt – βt = 0.69829, DTN93 = 0.35299. The variance of the logenet Evol 8: 398-414. DTN93 can be obtained by either the delta method or the 9. Xia X (2007) Bioinformatics and the cell: Modern computa- method using Fisher information matrix. Note that SR = 0 tional approaches in genomics, proteomics and transcrip- tomics. Springer US, New York, 349. means no information for estimating αRt properly. I should mention that all distance formulations in this 10. Rzhetsky A, Nei M (1995) Tests of applicability of several substitution models for DNA sequence data. Mol Biol Evol paper are known as Independently Estimated (IE) distan- 12: 131-151. ces because they use information from only two aligned sequences and are independent of other pairs of sequen- 11. Xia X (2013) DAMBE5: A comprehensive software package for data analysis in and evolution. Mol ces. Practical molecular phylogenetic analysis typically Biol Evol 30: 1720-1728. would use Simultaneously Estimated (SE) distances [12,13] 12. Tamura K, Nei M, Kumar S (2004) Prospects for inferring which use information from all pairs of sequences. SE dis- very large phylogenies by using the neighbor-joining meth- tances are implemented in MEGA [14] and DAMBE [11,15]. od. Proc Natl Acad Sci U S A 101: 11030-11035. The PhyPA [16] function in DAMBE, which performs phylo- 13. Xia X (2009) Information-theoretic indices and an approximate genetic reconstruction base on pairwise alignment when significance test for testing the molecular clock hypothesis reliable multiple sequence alignment is difficult to obtain with genetic distances. Mol Phylogenet Evol 52: 665-676. for highly diverged sequences, uses SE distances only. 14. Kumar S, Stecher G, Tamura K (2016) MEGA7: Molecu- lar Evolutionary Genetics Analysis Version 7.0 for Bigger In short, the approach of deriving transition proba- Datasets. Mol Biol Evol 33: 1870-1874. bilities by probability reasoning can go a long way if one 15. Xia X (2017) DAMBE6: New tools for microbial genomics, can do good bookkeeping. In particularly, the probabili- phylogenetics and molecular evolution. J Hered 108: 431-437. ty reasoning approach is very useful for conceptual un- 16. Xia X (2016) PhyPA: Phylogenetic method with pairwise derstanding. However, the approach becomes increas- sequence alignment outperforms likelihood methods in ing difficult with more complicated substitution models. phylogenetics involving highly diverged sequences. Mol Two alternative approaches, one involving solving dif- Phylogenet Evol 102: 331-343.

Xia. J Genet Genome Res 2017, 4:031 • Page 10 of 10 •