Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning
Total Page:16
File Type:pdf, Size:1020Kb
ISSN: 2378-3648 Xia. J Genet Genome Res 2017, 4:031 DOI: 10.23937/2378-3648/1410031 Volume 4 | Issue 1 Journal of Open Access Genetics and Genome Research REVIEW ARTICLE Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning Xuhua Xia1,2* 1Department of Biology, University of Ottawa, Canada 2Ottawa Institute of Systems Biology, Ottawa, Canada *Corresponding author: Xuhua Xia, Department of Biology, Ottawa Institute of Systems Biology, University of Ottawa, 30 Marie Curie, P.O. Box 450, Station A, Ottawa, Ontario, K1N 6N5, Canada, Tel: (613)-562-5800, Ext: 6886, Fax: (613)- 562-5486, E-mail: [email protected] nucleotide, amino acid and codon sequences. All substi- Abstract tution models used in molecular phylogenetics are Mar- kov chain models characterized by 1) Either a transition Substitution rate matrices are used to correct multiple hits at the same sites, which requires the derivation of transi- probability matrix (P) with discrete time or a rate matrix tion probabilities and evolutionary distances from substi- (Q) in continuous time where P can be derived from Q, tution rate matrices. The derivation is essential in molec- and 2) Equilibrium frequencies. The general form of an ular phylogenetics and phylogenomics, and represents the instantaneous rate matrix for nucleotide sequences is, only statistically sound way for developing scoring matrices used in sequence alignment and local string matching (e.g., in the order of A, G, C, and T: BLAST and FASTA). Three different approaches are fre- A− abc quently used for deriving transition probabilities and evolu- Gg− d e tionary distances: 1) The probability reasoning, 2) Solving Q = partial differential equations, and 3) Matrix exponential and Ch i− f (1) logarithm. The first approach demands the least amount of mathematical skills but offers the best way for conceptual Tjk l − understanding, and can often generate nice mathematical Transition probability matrix, often referred to as the P expressions of transition probabilities and evolutionary dis- tances. This review represents the most systematic and matrix, specifies the probability of a nucleotide or amino comprehensive numerical illustration of the first approach. acid changing into another one after time t. It is needed to Keywords calculate likelihood and to derive evolutionary distances, and consequently is needed phylogenetics based on the Substitution model, Substitution rate, Transition probability, Evolutionary distance maximum likelihood and distance-based methods as well as Bayesian inference. Whether a substitution model can Introduction be implemented for phylogenetic analysis essentially de- pends on whether the model’s transition probabilities can Substitutions occur over time and can overwrite each be calculated. other at the same nucleotide or amino acid site. When we compare two homologous nucleotide sequences and There are three ways to obtain transition probabilities find differences in N sites, the actual number of substi- from the Q matrix [1,2] :1) By probability reasoning, 2) By tutions (designated by M) could be much greater than solving differential equations involving rates, and 3) By tak- N because multiple substitutions could have happened ing the matrix exponential of the rate matrix. The last two at the same site, overwriting each other. Substitution require some mathematical background in calculus and models are used to infer the observable M from the ob- linear algebra. The first, in contrast, demands little math- served substitutions from sequence comparisons. ematical skill except for careful book-keeping and solving simultaneous equations. This approach is particularly rel- Many substitution models have been proposed for evant to biological students not only for gaining a concep- Citation: Xia X (2017) Deriving Transition Probabilities and Evolutionary Distances from Substitution Rate Matrix by Probability Reasoning. J Genet Genome Res 3:031. doi.org/10.23937/2378-3648/1410031 Received: April 03, 2017: Accepted: July 24, 2017: Published: July 27, 2017 Copyright: © 2017 Xia X. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Xia. J Genet Genome Res 2017, 4:031 • Page 1 of 10 • DOI: 10.23937/2378-3648/1410031 ISSN: 2378-3648 tual understanding of the substitution models, but also There are some quick ways to check the derived to deriving nice mathematical expressions for transition transition probabilities. First, we note that when t ap- probabilities and evolutionary distances. New researchers proaches infinity, then all entries in matrix P approaches often ask why we can derive evolutionary distances be- ¼ if α > 0. This is what we have expected. Second, when tween two aligned sequences for the TN93 model [3] but t = 0, then all diagonal elements in matrix P are 1 and cannot for the simpler HKY85 model [4] which is a special all off-diagonal elements are zero. This is again what we case of the TN93 model, yet another model, F84 (used in expected. Third, if α is zero, then no change is possible, PHYLIP since 1984), which is also a special case of the TN93 and we again expect all diagonal elements in matrix P model, can have its evolutionary distance readily derived. to be 1 and all off-diagonal elements to be zero, which One can easily obtain answers to such questions by taking is also true. the first approach. However, the first approach is not of From p in Eq. (2), the expected proportion of sites general purpose and cannot handle very complicated sub- ij that are different between two aligned homologous se- stitution models. In contrast, the last one can be used with quences (p ) is 3*p (t), i.e., any substitution models specified by a rate matrix from diff ij which the matrix exponential can be obtained. In short, all 33−4αt p = 3( pt ) = − e (4) these approaches need to be learned by anyone wishing to diff ij 44 become a molecular phylogeneticist, but this paper will fo- Note that pdiff approaches ¾ when t is infinitely large, cus only on the probability reasoning approach illustrated which means that multiple substitutions can no longer with JC69 [5], K80 [6], F84 (the model used in PHYLIP since be corrected. Eq. (4) offers another way of deriving p (t) 1984), HKY85 [4], and TN93 [3] models. ii in Eq. (3), i.e., pii(t) is simply 1-pdiff. Probability Reasoning to Obtain Transition Prob- Eq. (4) allows us to derive the JC69 distance (DJC69) abilities and Evolutionary Distances because a distance is defined as μt where μ is the substi- Felsenstein [1] presented nice examples of prob- tution rate which is equal to 3α in the JC69 model. This ability reasoning to derive transition probabilities and is the same as the distance that you have driven is the product of the speed (rate) and time. Given that D = evolutionary distances from rate matrices. This section JC69 3αt, we can derive D (Figure 1g) by substituting αt = presents the approach in a more systematic and acces- JC69 D /3 into Eq. (4), i.e., sible way. JC69 JC69 model 3 4 pdiff DJC 69 =−− ln 1 (5) 43 Consider nucleotide A in the JC69 model (Figure 1a). Imagine that the nucleotide has a rate α of changing into Where pdiff (the expected number of sites that are dif- any of the four nucleotides, i.e., including changing to it- ferent between the two homologous sequences) can be self (Figure 1b). This is effectively the same specification approximated by the observed proportion of sites (pdiff. as the JC69 model. After time t, the expected number obs) differing between the two aligned sequences. Note of substitutions is 4αt and the probability of no substi- that pdiff.obs may differ from pdiff even when the under- tution is p (x = 0, α, t) = e-4αt according to the Poisson lying substitution model indeed follows JC69 because distribution, and the probability of having at least one of 1) Stochastic factors due to limited aligned length of change is then p (x ≥ 1, α, t) = 1-e-4αt (Figure 1c). Because the two sequences, and 2) Distortion caused by subop- nucleotide A can change into any one of the four nu- timal sequence alignment. Thus, although pdiff in Eq. (4) cleotides (including nucleotide A itself), each nucleotide cannot be greater than 0.75, pdiff.obs could, even when gets 1/4 of p (x ≥ 1, α, t). We therefore have in Figure 1. sequences evolve strictly according to the JC69 model. D is not defined when p ≥ 0.75 as there is no loga- px(≥ 1,α , t ) 1 1 JC69 diff pt( ) = = − e−4αt (2) rithm for 0 or negative values. ij 4 44 We can optionally show that JC69D in Eq. (5) is a max- The transition probability pii(t) is the summation of two probabilities: the probability of no change (which is e-4αt) imum likelihood distance. For two aligned sequences and the probability of changing to itself which is the same of length N, designate the number of sites that differ as specified in Eq. (2), as shown in Figure 1e, i.e., between the two sequences as ND and the number of sites identical between the two sites as (N-ND). Now the −−44ααttpx(≥ 1,α , t ) 1 3 ptii ( ) =+=+ e e (3) likelihood function is: 4 44 N 1 ()NN− DDN L = ppii (1− ii ) The transition probability matrix for the JC69 model 4 has only two distinct elements. The four diagonal ele- 1 lnL = Nln +− ( NND ) ln( p ii ) + N D ln(1 − pii ) ments are the same as specified in Eq. (3) and all the 4 off-diagonal elements are the same as specified in Eq.