A simple, general result for the variance of substitution number in molecular evolution. Bahram Houchmandzadeh, Marcel Vallade
To cite this version:
Bahram Houchmandzadeh, Marcel Vallade. A simple, general result for the variance of substitution number in molecular evolution. . Molecular Biology and Evolution, Oxford University Press (OUP), 2016, 33 (7), pp.1858. hal-01275060
HAL Id: hal-01275060 https://hal.archives-ouvertes.fr/hal-01275060 Submitted on 16 Feb 2016
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A simple, general result for the variance of substitution number in molecular evolution.
Bahram Houchmandzadeh and Marcel Vallade CNRS, LIPHY, F-38000 Grenoble, France Univ. Grenoble Alpes, LIPHY, F-38000 Grenoble, France
The number of substitutions (of nucleotides, amino acids, ...) that take place during the evolution of a sequence is a stochastic variable of fundamental importance in the field of molecular evolution. Although the mean number of substitutions during molecular evolution of a sequence can be es- timated for a given substitution model, no simple solution exists for the variance of this random variable. We show in this article that the computation of the variance is as simple as that of the mean number of substitutions for both short and long times. Apart from its fundamental importance, this result can be used to investigate the dispersion index R, i.e. the ratio of the variance to the mean substitution number, which is of prime importance in the neutral theory of molecular evolution. By investigating large classes of substitution models, we demonstrate that although R ≥ 1, to obtain R significantly larger than unity necessitates in general additional hypotheses on the structure of the substitution model.
I. INTRODUCTION. at a “constant rate” during evolution, a hypothesis that plays an important role in the foundation of the “molec- Evolution at the molecular level is the process by which ular clock”[4, 5]. The original neutral theory postulated random mutations change the content of some sites of a that the substitution process is Poissonian, i.e. assuming given sequence (of nucleotides, amino acids, ...) during R =1. Since the earliest work on the index of dispersion, time. The number of substitutions n that occur during it became evident however that R is usually much larger this time is of prime importance in the field of molecular than unity (see [6] for a review of data). Many alter- evolution and its characterization is the first step in deci- natives have been suggested to reconcile the “overdisper- phering the history of evolution and its many branching. sion” observation with the neutral theory ([6]). Among The main observable in molecular evolution, on compar- these various models, a promising alternative, that of ing two sequences, is pˆ, the fraction of sites at which the fluctuating neutral space, was suggested by Takahata [7] two sequences are different. In order to estimate the sta- which has been extensively studied in various frameworks tistical moments of n, the usual approach is to postulate ([8–12]). a substitution model Q through which pˆ can be related The fluctuating neutral space model states that the i to the statistical moments of n. The simplest and most substitution rate mj from state i to state j is a func- widely used models assume that Q is site independent, tion of both i and j. States i and j can be nucleotides although this constraint can be relaxed[1, 2]. or amino acids, in which case we recover the usual sub- Once a substitution model Q has been specified, it is stitution models of molecular evolution discussed above. straightforward to deduce the mean number of substitu- The states can also be nodes of a neutral graph used tions n and the process is detailed in many textbooks. to study global protein evolution ([13–15]). For neutral However,h i the mean is only the first step in the charac- networks used in the study of protein evolution, Bloom, terization of a random variable and by itself is a rather Raval and Wilke [11] devised an elegant procedure to es- poor indicator. The next step in the investigation of a timate the substitution rates. We will show in this article that in general R 1 and the equality is reached only random variable is to obtain its variance V . Surprisingly, ≥ no simple expression for V can be found in the literature for the most trivial cases. However, producing large R for arbitrary substitution model Q. The first purpose of requires additional hypotheses on the structure of substi- this article is to overcome this shortcoming. We show tution rates. that computing V is as simple as computing n , both In summary, the problem we investigate in this article for short and long times. h i is to find a simple and general solution for the variance We then apply this fundamental result to the investi- and dispersion index of any substitution matrix of dimen- gation of the dispersion index R, the ratio of the vari- sion K. A substitution matrix Q collects the transition rates mi (i = j); its diagonal elements qi = mi are set ance to the mean number of substitutions. The neutral j 6 i − theory of molecular evolution introduced by Kimura[3] such that its columns sum to zero (see below for nota- supposes that the majority of mutations are neutral (i.e. tions) and designate the rate of leaving state i. Because have no effect on the phenotypic fitness) and therefore of this condition, Q is singular. substitutions in protein or DNA sequences accumulate Zheng [8] was the first to use Markov chains to investi- 2 gate the variance of substitution number as a solution of a set of differential equations. His investigation was further developed by Bloom, Raval and Wilke [11] who gave the general solution in terms of the spectral decomposition of the substitution matrix; this solution was extended by Raval [12] for a specific class of matrices used for random walk on neutral graphs. Minin and Suchard [16] used the same spectral method to derive an analytical form for the generating function of a binary process. The first step to characterize the substitution num- ber, which as is well known, is to find the equilibrium probabilities πi of being in a state i, which is obtained i by solving the linear system i qj πi = 0 with the addi- tional condition of i πi =1. Once πi are obtained, the mean substitution number asP a function of time is simply n =mt ¯ where m¯ =P miπ is the weighted average of Figure 1. The random variables X can switch between K i i states; the counter N of the number of transitions is incre- hthei “leaving” rates. P mented at each transition of the variable X. The figure above We show here that finding the variance necessitates a shows one realization of these random variables as a function similar computation. Denoting the weighted deviation of of time. the diagonal elements of Q from the mean hi = (¯m i − m )πi, we have to find the solution of the linear system i II. MARKOV CHAIN MODEL OF DISPERSION i qj ri = hj with the additional condition ri =0. For INDEX. long times, the dispersion index is then simply P P A. Background and definitions. 2 K R =1+ mir (1) m¯ i The problem we investigate in this article is mainly i=1 X that of counting transitions of a random variable (figure 1). Consider a random variable X that can occupy K For short times, i.e. when the mean number of substi- i distinct states and let mj (i = j, 1 i, j K ) be the tutions is small, the result is even simpler : transition rate from state i to6 state≤j. The≤ probability density pi(t) of being in state i at time t is governed by v R =1+ m n (2) the Master equation m¯ 2 h i dp i = mi p + mjp dt − j i i j where j j X X i j = m pi + m pj K − i j v = (¯m mi)2π X m − i i=1 where mi = mi is the “leaving” rate from state i. We X j j can collect the p into a (column) vector p = (p1, ...p )T P i | i K in other words, vm is the variance of the diagonal ele- and write the above equations in matrix notation ments of the substitution matrix, weighted by the equi- d librium probabilities. p = ( D + M) p = Q p (3) dt | i − | i | i This article is organized as follow. In the next sec- i i tion, we use a Markov chain approach to derive relations where D is the diagonal matrix of m and M = (mj ) (1,2) and show its validity by comparing it to results ob- collects the detailed transition rates from state i to state tained by direct numerical simulations. The simplicity of j (i = j) and has zero on its diagonal. In our notations, these results then allows us to study the dispersion in- the upper6 (lower) index designates the column (row) of dex for specific models of nucleotide substitutions widely a matrix. The matrix Q = D + M is called the substi- used in the literature (section III) and for general models tution matrix and its columns− sum to zero. (section IV). We investigate in particular the conditions Before proceeding, we explain the notations used in necessary to produce large R. The last section is devoted this article. As the matrix Q is not in general symmetric, to a general discussion of these results and to conclusions. a clear distinction must be made between right (column) Technical details, such as the proof of R 1 are given in and left (row) vectors. The Dirac notations are standard the appendices. ≥ and useful for handling this distinction : a column vector 3
T 1 K (x1, ...xK ) is denoted x while a row vector (y , ..., y ) By the same token, the second moment | i i is denoted y and y x = y xi is their scalar product. h | h | i i 2 2 n In some of literature (see [2]), the substitution matrix n (t) = n pi (t) is the transpose of the matrixP used here and the master i,n X equation is then written as d p /dt = p Q and therefore 2 h | h | can be written in terms of partial second moments ni = its rows sum to zero. 2 n By construction, the matrix Q is singular and has one n n pi (t) as zero eigenvalue while all others are negative. There- P n2(t) = 1 n2(t) fore, as time flows, p(t) π where π = (π1, ...π ) | | i → | i | i K is the equilibrium occupation probability and the zero- 2 2 2 T where n (t) = (n1, ..., nK ) . It is straightforward to eigenvector of the substitution matrix. show (see appendix A), for the initial condition pn(0) = 2 | i π , that n(t) and n (t) obey a linear differential equa- Q π =0 | i | i | i tion 1 π =1 h | i d n = Q n + D π (5) where 1 = (1, ...1), and the second condition expresses dt | i | i | i h | that the sum of the probabilities must be 1. Note that d 2 2 n = Q n +2M n + D π (6) by definition, 1 Q =0, and thus 1 is a zero left eigen- dt | i | i h | h | vector of the substitution matrix. Let us define the left vector 1 m = (m , ..., mK) B. Problem formulation. h | which collects the leaving rates. By definition, 1 M = h | To count the number of substitutions (figure 1), we 1 D = m . Multiplying eqs (5,6) by the left vector 1 , h | h | h | consider the probability densities pn(t) of being in state and noting that 1 Q = 0 , we get a simple relation for i h | h | i after n substitutions at time t. These probabilities are the moments : governed by the master equation d n = m π (7) n dpi i n j n−1 dt h i h | i = m pi + mi pj n> 0 d dt − n2 =2 m n + m π (8) j dt h | i h | i 0 X dpi i 0 = m pi We observe that the mean number of substitutions in- dt − volves only a trivial integration. Defining the weighted n We can combine the above equations by setting pi (t)=0 average of the leaving rates as n n n T if n< 0. Collecting the elements of (p1 ,p2 , ..., pK ) into n m¯ = m π = miπ the vector p , the above equation can then be written h | i i | i i as X d the mean number of substitution is simply pn = D pn + M pn−1 (4) dt | i − | i n(t) =mt ¯ (9) h i The quantities of interest for the computation of the dispersion index are the mean and the variance of the To compute the second moment of the substitution number on the other hand, we must solve for n us- number of substitutions. The mean number of substitu- | i tions at time t is ing equation (5) and then perform one integration. The next subsection is devoted to the efficient solution of this n(t) = npn(t) procedure. h i i i,n X Let us define C. Solution of the equation for the moments .
n ni(t)= npi (t) One standard way of solving equation (5) would be n X to express the matrix Q in its eigenbasis; equation(5) is then diagonalized and can be formally solved. This is and collect the partial means ni into the vector n(t) = T | i the method used by Bloom, Raval and Wilke [11] and (n1, ..., nK ) . The mean is then defined simply as further refined by Raval[12] for a specific class of sub- i n(t) = n (t)= 1 n(t) stitution matrices where mj = 0 or 1. The first problem h i i h | i i with this approach is that there is no guarantee that Q X 4 is diagonalizable. Even if Q can be diagonalized, this is Expressing now the equation (5) for the evolution of not the most efficient procedure to find V , as it necessi- first moments in the new basis, we find that tates the computation of all eigenvalues and left and right d eigenvectors of Q and then the cumbersome summation n =m ¯ (11) dt h i of their binomial products. d The procedure we follow involves some straightfor- n˜ = Q˜ n˜ + n α˜ + µ˜ (12) dt | i | i h i | i | i ward, albeit cumbersome linear algebraic operations, but T 2 K the end result is quite simple. We note that the matrix where n˜ = (n2, ..., nK) , µ˜ = (m π2, ..., m πK ) and | i | i Q is singular and has exactly one zero eigenvalue, asso- α˜ is given in relation 10. Equation (11) is the same as | i ciated with the left 1 and right π eigenvectors. The equation (7) and implies that n =mt ¯ . As Q˜ is non- h i method we use is toh isolate| the zero| i eigenvalue by mak- singular (and negative definite), equations (11,12) can ing a round-trip to a new basis. Thus, if we can find a now readily be solved. Noting that Q π =0 implies that | i new basis in which the substitution matrix Q′ = X−1QX α˜ + Q˜ π˜ =0, the differential equation (12) integrates | i | i takes a lower block triangular form ˜ n˜ = I eQt Q˜ −1 h˜ + n π˜ (13) | i − h i | i E 0 0 0 ··· where h˜ =m ¯ π˜ µ˜ . ′ | i − | i Q = (10) To compute E the second moment (equation 8) and the α˜ Q˜ variance, we need must integrate the above expression one more time. We finally obtain 2 we will have achieved our goal of isolating the zero eigen- Var(n)= n2 n (14) value. The non singular matrix Q˜ is of rank K 1 and − h i 1 Q˜ 1 has the non-zero and negative eigenvalues of Q.− As 1 = n +2 m˜ It + Q˜ − (I e t) Q˜ − h˜ h | h i | − | is the known left eigenvalue of Q, we can split the vector D E space into = u | 1 u =0 and the space padded by The second term in the r.h.s. of the above equation is π . It is thenB straightforward{| i h | i } to find the above transfer the excess variance δV with respect to a Poisson process. |matricesi X and X−1 for such a transformation: 1. Long time behavior. 1 1 1 1 11 1 1 − − ··· − ··· 01 0 0 01 0 0 ˜ ··· 1 ··· As all eigenvalues of Q are negative, for large times X = 00 1 0 ; X− = 00 1 0 ˜ ··· ··· exp(Qt) 0 and the leading term of the excess variance . .. . . .. . → . . . . . . is therefore 0 0 1 0 0 1 ··· ··· δV =2 m˜ r˜ t (15) h | i Under such a transformation, a right vector x = where r˜ is the solution of the linear equation T | i (x1, x2, ..., xK ) transforms into | i Q˜ r˜ = h˜ (16) x | i i i i xi E x2 Returning to the original basis, relation (15) becomes ′ −1 x = X x = P. = P | i | i . x˜ | i δV =2 m r t (17) h | i xK 1 2 where m = (m ,m , ...mK ) is the left vector of the h | where the K 1 dimensional vector x˜ = (x2, ..., x )T . leaving rates and r is the solution of the linear equation − | i K | i In general, we will designate by a ˜ all vectors that be- Q r = h (18) long the K 1 dimensional space in which the linear | i | i − B 1 r =0 (19) application Q˜ operates. h | i 1 2 K T A left vector y = (y ,y , ..., y ) transforms into h = (h1, ..., h ) is the vector of weighted deviation h | K from| i m¯ of the leaving rates mi: y′ = y X = (y1,y2 y1, ..., yK y1) = (y1 y˜ ) h | h | − − h | i hi =(¯m m )πi 2 − where the K 1 dimensional left vector y˜ = (y y1, ..., yK y1−). h | − Finally, for large times, the dispersion index is −˜i i 1 ˜ Finally, Qj = Qj Qj where the elements of Q have m r − R =1+2h | i (20) been indexed from 2 to K. m¯ 5
Note that the above summation is over i =2 K . How- ever, by definition, ···
K (¯m mi)π =0 − i i=1 X and hence the sum in relation (22) can be rearranged as
K δV = t2 (¯m mi)2π (23) − i i=1 X
The sum,which we will denote by vm, represents the vari- ance of the diagonal elements of Q, weighted by the equi- Figure 2. Comparison between the theoretical result (20) and librium probabilities. It is more meaningful to express 5 numerical simulations. 3 × 10 4 × 4 random (uniform(0, 1) the variance in terms of the mean substitution number. ) matrices were generated. For each matrix, a Gillespie al- Using relation (9), we therefore have gorithm was used to generate 106 random paths as a func- tion of time (tfinal = 1000), from which the dispersion in- v 2 δV = m n (24) dex was computed. In the above figure, each dot corre- m¯ 2 h i sponds to one random matrix. The mean relative error −3 (Rtheor − Rsim)/Rtheor is 1.3 × 10 . The dispersion index for short times is therefore
vm R =1+ 2 n which is the relation (1) given in the introduction. m¯ h i Figure 2 shows the agreement between the above the- The dispersion index for all times can also be com- oretical results and stochastic numerical simulations. puted, and an example is given in appendix C. In the Two important consequences should be noted. First, next section, we investigate some applications of the re- it is not difficult to show that R 1, which is called lation (20). the overdispersion of the molecular≥ clock. A demonstra- tion of this theorem for symmetric substitution matri- ces whose elements are 0 or 1 (adjacency matrices) was III. APPLICATION TO SPECIFIC given by Raval [12]. We give the general demonstration NUCLEOTIDES SUBSTITUTION MODELS. for general time reversible (GTR) substitution matrices in appendix B . Nucleotide substitution models are widely used in The second consequence of relation (20) is that if all di- molecular evolution[1, 2] for example to deduce distances agonal elements of the substitution matrix are equal (i.e. between sequences. Some of these models have few pa- mi = mj i, j ), then the dispersion index is exactly 1 rameters or have particular symmetries. For these mod- and we recover∀ the property of a normal Poisson process, els, it is worthwhile to express relation (20) for large times regardless of the fine structure of Q and the equilibrium into an even more explicit form and compute the disper- probabilities πi. This is a sufficient condition. We show sion number as an explicit function of the parameters. that the necessary condition for R =1 is h = 0 , which, We provide below such a computation for some of the | i | i except for the trivial case where some πi = 0, again im- most commonly used models. plies the equality of diagonal elements of Q (see B). For the K80 model proposed by Kimura[17] , all diago- nal elements of the substitution matrix are equal; hence, relation (20) implies that R =1. 2. Short time behavior.
A. T92 model. For short times, i.e. when the mean number of sub- stitution is small, we can expand δV given by expression Tamura [18] introduced a two parameter model (T92) (14) to the second order in time: extending the K80 model to take into account biases in G+C contents. Solving relation (20) explicitly for this δV = t2 m˜ h˜ + O(t3) (21) − | model, we find for the dispersion index K D E 2 2 = t2 (mi m1)(m ¯ mi)π + O(t3) (22) 2k θ(1 θ)(2θ 1) − − − i R =1+ − − (25) i=2 k +1 1+2kθ(1 θ) X − 6
Here k = α/β, where α and β are the two parameters of the original T92 model. A similar expression was found 1.0 by Zheng[8]. For a given k, the maximum value of R is 2 0.8 √2+ k √2 R∗ =1+ − k +1 0.6 )
And it is straightforward to show that in this case R ( P R [1, 2] 0.4 ∈ although even reaching a maximum value for R =1.5 will 0.2 necessitate strong asymmetries in the substitution rates (such as k = 18.8 and θ =0.063). 0.0 1.0 1.5 2.0 2.5 3.0 R B. TN93 model.
Tamura and Nei[19] proposed a generalization of the Figure 3. Cumulative histogram of the dispersion index R of the TN93 model and its specific cases. The three dimen- F81[20] and HKY85 [21] models which allows for biases in 1 2 3 4 sional space |πi = (π ,π ,π ,π ), i πi = 1 is scanned by the equilibrium probabilities, different rates of transition steps of dπ = 0.025 (≈ 11200 points). For each value of vs transversion and for different rates of transitions. The |πi, the dispersion index of the correspondingP substitution corresponding substitution matrix is matrix QTN93 is computed from relation (21). Black cir- cles : the F81 model (k1 = k2 = 1); red diamonds and k1π1 π1 π1 green left triangles correspond the HKY85 with respectively ∗ k1 = k2 = 0.1 and 10; blue squares, cyan right triangles k1π2 π2 π2 QTN93 = µ ∗ and magenta up triangles correspond to TN93 model with π3 π3 k2π3 ∗ {k1,k2} = {0.1, 1}, {1, 10}, {0.1, 10} respectively. Permuta- π4 π4 k2π4 tions of {k1,k2} lead to the same results and are not dis- ∗ played. For each substitution matrix, it has been checked a specific case of this model where k1 = k2 corresponds that solution (26) and the general solution (20) are identical. to the HKY85 model, while k1 = k2 = 1 corresponds to that of F81 (also called “equal input” ). Solving equation (20) leads to
2 2 R =1+ C (mi mj ) (26) m¯ ij − The lower bound is reached for π = (1/4)(1, 1, 1, 1)T , i For the specific case k1 = k2 = 1 (equal input or F81 model), expression (26) takes a particularly simple form RTN =0.04+0.24k2 +6/(6 + k2)+ O(ǫ) 2 R =1+ π π (π π ) / π π (27) i j i − j i j i