A Simple, General Result for the Variance of Substitution Number in Molecular Evolution

Home , Substitution matrix

A simple, general result for the variance of substitution number in molecular evolution. Bahram Houchmandzadeh, Marcel Vallade

To cite this version:

Bahram Houchmandzadeh, Marcel Vallade. A simple, general result for the variance of substitution number in molecular evolution. . Molecular Biology and Evolution, Oxford University Press (OUP), 2016, 33 (7), pp.1858. hal-01275060

HAL Id: hal-01275060 https://hal.archives-ouvertes.fr/hal-01275060 Submitted on 16 Feb 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A simple, general result for the variance of substitution number in molecular evolution.

Bahram Houchmandzadeh and Marcel Vallade CNRS, LIPHY, F-38000 Grenoble, France Univ. Grenoble Alpes, LIPHY, F-38000 Grenoble, France

The number of substitutions (of nucleotides, amino acids, ...) that take place during the evolution of a sequence is a stochastic variable of fundamental importance in the ﬁeld of molecular evolution. Although the mean number of substitutions during molecular evolution of a sequence can be es- timated for a given substitution model, no simple solution exists for the variance of this random variable. We show in this article that the computation of the variance is as simple as that of the mean number of substitutions for both short and long times. Apart from its fundamental importance, this result can be used to investigate the dispersion index R, i.e. the ratio of the variance to the mean substitution number, which is of prime importance in the neutral theory of molecular evolution. By investigating large classes of substitution models, we demonstrate that although R ≥ 1, to obtain R signiﬁcantly larger than unity necessitates in general additional hypotheses on the structure of the substitution model.

I. INTRODUCTION. at a “constant rate” during evolution, a hypothesis that plays an important role in the foundation of the “molec- Evolution at the molecular level is the process by which ular clock”[4, 5]. The original neutral theory postulated random mutations change the content of some sites of a that the substitution process is Poissonian, i.e. assuming given sequence (of nucleotides, amino acids, ...) during R =1. Since the earliest work on the index of dispersion, time. The number of substitutions n that occur during it became evident however that R is usually much larger this time is of prime importance in the field of molecular than unity (see [6] for a review of data). Many alter- evolution and its characterization is the first step in deci- natives have been suggested to reconcile the “overdisper- phering the history of evolution and its many branching. sion” observation with the neutral theory ([6]). Among The main observable in molecular evolution, on compar- these various models, a promising alternative, that of ing two sequences, is pˆ, the fraction of sites at which the fluctuating neutral space, was suggested by Takahata [7] two sequences are different. In order to estimate the sta- which has been extensively studied in various frameworks tistical moments of n, the usual approach is to postulate ([8–12]). a substitution model Q through which pˆ can be related The fluctuating neutral space model states that the i to the statistical moments of n. The simplest and most substitution rate mj from state i to state j is a func- widely used models assume that Q is site independent, tion of both i and j. States i and j can be nucleotides although this constraint can be relaxed[1, 2]. or amino acids, in which case we recover the usual sub- Once a substitution model Q has been specified, it is stitution models of molecular evolution discussed above. straightforward to deduce the mean number of substitu- The states can also be nodes of a neutral graph used tions n and the process is detailed in many textbooks. to study global protein evolution ([13–15]). For neutral However,h i the mean is only the first step in the charac- networks used in the study of protein evolution, Bloom, terization of a random variable and by itself is a rather Raval and Wilke [11] devised an elegant procedure to es- poor indicator. The next step in the investigation of a timate the substitution rates. We will show in this article that in general R 1 and the equality is reached only random variable is to obtain its variance V . Surprisingly, ≥ no simple expression for V can be found in the literature for the most trivial cases. However, producing large R for arbitrary substitution model Q. The first purpose of requires additional hypotheses on the structure of substi- this article is to overcome this shortcoming. We show tution rates. that computing V is as simple as computing n , both In summary, the problem we investigate in this article for short and long times. h i is to find a simple and general solution for the variance We then apply this fundamental result to the investi- and dispersion index of any substitution matrix of dimen- gation of the dispersion index R, the ratio of the vari- sion K. A substitution matrix Q collects the transition rates mi (i = j); its diagonal elements qi = mi are set ance to the mean number of substitutions. The neutral j 6 i − theory of molecular evolution introduced by Kimura[3] such that its columns sum to zero (see below for nota- supposes that the majority of mutations are neutral (i.e. tions) and designate the rate of leaving state i. Because have no effect on the phenotypic fitness) and therefore of this condition, Q is singular. substitutions in protein or DNA sequences accumulate Zheng [8] was the first to use Markov chains to investi- 2 gate the variance of substitution number as a solution of a set of differential equations. His investigation was further developed by Bloom, Raval and Wilke [11] who gave the general solution in terms of the spectral decomposition of the substitution matrix; this solution was extended by Raval [12] for a specific class of matrices used for random walk on neutral graphs. Minin and Suchard [16] used the same spectral method to derive an analytical form for the generating function of a binary process. The first step to characterize the substitution number, which as is well known, is to find the equilibrium probabilities πi of being in a state i, which is obtained i by solving the linear system i qj πi = 0 with the additional condition of i πi =1. Once πi are obtained, the mean substitution number asP a function of time is simply n =mt ¯ where m¯ =P miπ is the weighted average of Figure 1. The random variables X can switch between K i i states; the counter N of the number of transitions is incre- hthei “leaving” rates. P mented at each transition of the variable X. The figure above We show here that finding the variance necessitates a shows one realization of these random variables as a function similar computation. Denoting the weighted deviation of of time. the diagonal elements of Q from the mean hi = (¯m i − m )πi, we have to find the solution of the linear system i II. MARKOV CHAIN MODEL OF DISPERSION i qj ri = hj with the additional condition ri =0. For INDEX. long times, the dispersion index is then simply P P A. Background and definitions. 2 K R =1+ mir (1) m¯ i The problem we investigate in this article is mainly i=1 X that of counting transitions of a random variable (figure 1). Consider a random variable X that can occupy K For short times, i.e. when the mean number of substi- i distinct states and let mj (i = j, 1 i, j K ) be the tutions is small, the result is even simpler : transition rate from state i to6 state≤j. The≤ probability density pi(t) of being in state i at time t is governed by v R =1+ m n (2) the Master equation m¯ 2 h i dp i = mi p + mjp dt − j i i j where j j X X i j = m pi + m pj K − i j v = (¯m mi)2π X m − i i=1 where mi = mi is the “leaving” rate from state i. We X j j can collect the p into a (column) vector p = (p1, ...p )T P i | i K in other words, vm is the variance of the diagonal ele- and write the above equations in matrix notation ments of the substitution matrix, weighted by the equi- d librium probabilities. p = ( D + M) p = Q p (3) dt | i − | i | i This article is organized as follow. In the next sec- i i tion, we use a Markov chain approach to derive relations where D is the diagonal matrix of m and M = (mj ) (1,2) and show its validity by comparing it to results ob- collects the detailed transition rates from state i to state tained by direct numerical simulations. The simplicity of j (i = j) and has zero on its diagonal. In our notations, these results then allows us to study the dispersion in- the upper6 (lower) index designates the column (row) of dex for specific models of nucleotide substitutions widely a matrix. The matrix Q = D + M is called the substi- used in the literature (section III) and for general models tution matrix and its columns− sum to zero. (section IV). We investigate in particular the conditions Before proceeding, we explain the notations used in necessary to produce large R. The last section is devoted this article. As the matrix Q is not in general symmetric, to a general discussion of these results and to conclusions. a clear distinction must be made between right (column) Technical details, such as the proof of R 1 are given in and left (row) vectors. The Dirac notations are standard the appendices. ≥ and useful for handling this distinction : a column vector 3

T 1 K (x1, ...xK ) is denoted x while a row vector (y , ..., y ) By the same token, the second moment | i i is denoted y and y x = y xi is their scalar product. h | h | i i 2 2 n In some of literature (see [2]), the substitution matrix n (t) = n pi (t) is the transpose of the matrixP used here and the master i,n X equation is then written as d p /dt = p Q and therefore 2 h | h | can be written in terms of partial second moments ni = its rows sum to zero. 2 n By construction, the matrix Q is singular and has one n n pi (t) as zero eigenvalue while all others are negative. There- P n2(t) = 1 n2(t) fore, as time flows, p(t) π where π = (π1, ...π ) | | i → | i | i K is the equilibrium occupation probability and the zero- 2 2 2 T where n (t) = (n1, ..., nK ) . It is straightforward to eigenvector of the substitution matrix. show (see appendix A), for the initial condition pn(0) = 2 | i π , that n(t) and n (t) obey a linear differential equa- Q π =0 | i | i | i tion 1 π =1 h | i d n = Q n + D π (5) where 1 = (1, ...1), and the second condition expresses dt | i | i | i h | that the sum of the probabilities must be 1. Note that d 2 2 n = Q n +2M n + D π (6) by definition, 1 Q =0, and thus 1 is a zero left eigen- dt | i | i h | h | vector of the substitution matrix. Let us define the left vector 1 m = (m , ..., mK) B. Problem formulation. h | which collects the leaving rates. By definition, 1 M = h | To count the number of substitutions (figure 1), we 1 D = m . Multiplying eqs (5,6) by the left vector 1 , h | h | h | consider the probability densities pn(t) of being in state and noting that 1 Q = 0 , we get a simple relation for i h | h | i after n substitutions at time t. These probabilities are the moments : governed by the master equation d n = m π (7) n dpi i n j n−1 dt h i h | i = m pi + mi pj n> 0 d dt − n2 =2 m n + m π (8) j dt h | i h | i 0 X dpi i 0 = m pi We observe that the mean number of substitutions in- dt − volves only a trivial integration. Defining the weighted n We can combine the above equations by setting pi (t)=0 average of the leaving rates as n n n T if n< 0. Collecting the elements of (p1 ,p2 , ..., pK ) into n m¯ = m π = miπ the vector p , the above equation can then be written h | i i | i i as X d the mean number of substitution is simply pn = D pn + M pn−1 (4) dt | i − | i n(t) =mt ¯ (9) h i The quantities of interest for the computation of the dispersion index are the mean and the variance of the To compute the second moment of the substitution number on the other hand, we must solve for n us- number of substitutions. The mean number of substitu- | i tions at time t is ing equation (5) and then perform one integration. The next subsection is devoted to the efficient solution of this n(t) = npn(t) procedure. h i i i,n X Let us define C. Solution of the equation for the moments .

n ni(t)= npi (t) One standard way of solving equation (5) would be n X to express the matrix Q in its eigenbasis; equation(5) is then diagonalized and can be formally solved. This is and collect the partial means ni into the vector n(t) = T | i the method used by Bloom, Raval and Wilke [11] and (n1, ..., nK ) . The mean is then defined simply as further refined by Raval[12] for a specific class of sub- i n(t) = n (t)= 1 n(t) stitution matrices where mj = 0 or 1. The first problem h i i h | i i with this approach is that there is no guarantee that Q X 4 is diagonalizable. Even if Q can be diagonalized, this is Expressing now the equation (5) for the evolution of not the most efficient procedure to find V , as it necessi- first moments in the new basis, we find that tates the computation of all eigenvalues and left and right d eigenvectors of Q and then the cumbersome summation n =m ¯ (11) dt h i of their binomial products. d The procedure we follow involves some straightfor- n˜ = Q˜ n˜ + n α˜ + µ˜ (12) dt | i | i h i | i | i ward, albeit cumbersome linear algebraic operations, but T 2 K the end result is quite simple. We note that the matrix where n˜ = (n2, ..., nK) , µ˜ = (m π2, ..., m πK ) and | i | i Q is singular and has exactly one zero eigenvalue, asso- α˜ is given in relation 10. Equation (11) is the same as | i ciated with the left 1 and right π eigenvectors. The equation (7) and implies that n =mt ¯ . As Q˜ is non- h i method we use is toh isolate| the zero| i eigenvalue by mak- singular (and negative definite), equations (11,12) can ing a round-trip to a new basis. Thus, if we can find a now readily be solved. Noting that Q π =0 implies that | i new basis in which the substitution matrix Q′ = X−1QX α˜ + Q˜ π˜ =0, the differential equation (12) integrates | i | i takes a lower block triangular form ˜ n˜ = I eQt Q˜ −1 h˜ + n π˜ (13) | i − h i | i E 0 0 0 ··· where h˜ =m ¯ π˜ µ˜ . ′ | i − | i Q =   (10) To compute E the second moment (equation 8) and the α˜ Q˜ variance, we need must integrate the above expression     one more time. We finally obtain   2 we will have achieved our goal of isolating the zero eigen- Var(n)= n2 n (14) value. The non singular matrix Q˜ is of rank K 1 and − h i 1 Q˜ 1 has the non-zero and negative eigenvalues of Q.− As 1 = n +2 m˜ It + Q˜ − (I e t) Q˜ − h˜ h | h i | − | is the known left eigenvalue of Q, we can split the vector D E space into = u | 1 u =0 and the space padded by The second term in the r.h.s. of the above equation is π . It is thenB straightforward{| i h | i } to find the above transfer the excess variance δV with respect to a Poisson process. |matricesi X and X−1 for such a transformation: 1. Long time behavior. 1 1 1 1 11 1 1 − − ··· − ···  01 0 0   01 0 0  ˜ ··· 1 ··· As all eigenvalues of Q are negative, for large times X = 00 1 0 ; X− = 00 1 0 ˜  ···   ···  exp(Qt) 0 and the leading term of the excess variance  . .. .   . .. .  →  . . .   . . .  is therefore      0 0 1   0 0 1   ···   ···  δV =2 m˜ r˜ t (15)     h | i Under such a transformation, a right vector x = where r˜ is the solution of the linear equation T | i (x1, x2, ..., xK ) transforms into | i Q˜ r˜ = h˜ (16) x | i i i i xi E x2 Returning to the original basis, relation (15) becomes ′ −1   x = X x = P. =  P  | i | i . x˜    | i  δV =2 m r t (17)     h | i  xK    1 2     where m = (m ,m , ...mK ) is the left vector of the   h | where the K 1 dimensional vector x˜ = (x2, ..., x )T . leaving rates and r is the solution of the linear equation − | i K | i In general, we will designate by a ˜ all vectors that be- Q r = h (18) long the K 1 dimensional space in which the linear | i | i − B 1 r =0 (19) application Q˜ operates. h | i 1 2 K T A left vector y = (y ,y , ..., y ) transforms into h = (h1, ..., h ) is the vector of weighted deviation h | K from| i m¯ of the leaving rates mi: y′ = y X = (y1,y2 y1, ..., yK y1) = (y1 y˜ ) h | h | − − h | i hi =(¯m m )πi 2 − where the K 1 dimensional left vector y˜ = (y y1, ..., yK y1−). h | − Finally, for large times, the dispersion index is −˜i i 1 ˜ Finally, Qj = Qj Qj where the elements of Q have m r − R =1+2h | i (20) been indexed from 2 to K. m¯ 5

Note that the above summation is over i =2 K . How- ever, by deﬁnition, ···

K (¯m mi)π =0 − i i=1 X and hence the sum in relation (22) can be rearranged as

K δV = t2 (¯m mi)2π (23) − i i=1 X

The sum,which we will denote by vm, represents the variance of the diagonal elements of Q, weighted by the equi- Figure 2. Comparison between the theoretical result (20) and librium probabilities. It is more meaningful to express 5 numerical simulations. 3 × 10 4 × 4 random (uniform(0, 1) the variance in terms of the mean substitution number. ) matrices were generated. For each matrix, a Gillespie al- Using relation (9), we therefore have gorithm was used to generate 106 random paths as a function of time (tfinal = 1000), from which the dispersion in- v 2 δV = m n (24) dex was computed. In the above ﬁgure, each dot corre- m¯ 2 h i sponds to one random matrix. The mean relative error −3 (Rtheor − Rsim)/Rtheor is 1.3 × 10 . The dispersion index for short times is therefore

vm R =1+ 2 n which is the relation (1) given in the introduction. m¯ h i Figure 2 shows the agreement between the above the- The dispersion index for all times can also be com- oretical results and stochastic numerical simulations. puted, and an example is given in appendix C. In the Two important consequences should be noted. First, next section, we investigate some applications of the re- it is not difficult to show that R 1, which is called lation (20). the overdispersion of the molecular≥ clock. A demonstration of this theorem for symmetric substitution matrices whose elements are 0 or 1 (adjacency matrices) was III. APPLICATION TO SPECIFIC given by Raval [12]. We give the general demonstration NUCLEOTIDES SUBSTITUTION MODELS. for general time reversible (GTR) substitution matrices in appendix B . Nucleotide substitution models are widely used in The second consequence of relation (20) is that if all di- molecular evolution[1, 2] for example to deduce distances agonal elements of the substitution matrix are equal (i.e. between sequences. Some of these models have few pa- mi = mj i, j ), then the dispersion index is exactly 1 rameters or have particular symmetries. For these mod- and we recover∀ the property of a normal Poisson process, els, it is worthwhile to express relation (20) for large times regardless of the fine structure of Q and the equilibrium into an even more explicit form and compute the disper- probabilities πi. This is a sufficient condition. We show sion number as an explicit function of the parameters. that the necessary condition for R =1 is h = 0 , which, We provide below such a computation for some of the | i | i except for the trivial case where some πi = 0, again im- most commonly used models. plies the equality of diagonal elements of Q (see B). For the K80 model proposed by Kimura[17] , all diagonal elements of the substitution matrix are equal; hence, relation (20) implies that R =1. 2. Short time behavior.

A. T92 model. For short times, i.e. when the mean number of substitution is small, we can expand δV given by expression Tamura [18] introduced a two parameter model (T92) (14) to the second order in time: extending the K80 model to take into account biases in G+C contents. Solving relation (20) explicitly for this δV = t2 m˜ h˜ + O(t3) (21) − | model, we ﬁnd for the dispersion index K D E 2 2 = t2 (mi m1)(m ¯ mi)π + O(t3) (22) 2k θ(1 θ)(2θ 1) − − − i R =1+ − − (25) i=2 k +1 1+2kθ(1 θ) X − 6

Here k = α/β, where α and β are the two parameters of the original T92 model. A similar expression was found 1.0 by Zheng[8]. For a given k, the maximum value of R is 2 0.8 √2+ k √2 R∗ =1+ − k +1 0.6 )

And it is straightforward to show that in this case R ( P R [1, 2] 0.4 ∈ although even reaching a maximum value for R =1.5 will 0.2 necessitate strong asymmetries in the substitution rates (such as k = 18.8 and θ =0.063). 0.0 1.0 1.5 2.0 2.5 3.0 R B. TN93 model.

Tamura and Nei[19] proposed a generalization of the Figure 3. Cumulative histogram of the dispersion index R of the TN93 model and its specific cases. The three dimen- F81[20] and HKY85 [21] models which allows for biases in 1 2 3 4 sional space |πi = (π ,π ,π ,π ), i πi = 1 is scanned by the equilibrium probabilities, different rates of transition steps of dπ = 0.025 (≈ 11200 points). For each value of vs transversion and for different rates of transitions. The |πi, the dispersion index of the correspondingP substitution corresponding substitution matrix is matrix QTN93 is computed from relation (21). Black cir- cles : the F81 model (k1 = k2 = 1); red diamonds and k1π1 π1 π1 green left triangles correspond the HKY85 with respectively ∗ k1 = k2 = 0.1 and 10; blue squares, cyan right triangles  k1π2 π2 π2  QTN93 = µ ∗ and magenta up triangles correspond to TN93 model with π3 π3 k2π3  ∗  {k1,k2} = {0.1, 1}, {1, 10}, {0.1, 10} respectively. Permuta-  π4 π4 k2π4  tions of {k1,k2} lead to the same results and are not dis-  ∗    played. For each substitution matrix, it has been checked a specific case of this model where k1 = k2 corresponds that solution (26) and the general solution (20) are identical. to the HKY85 model, while k1 = k2 = 1 corresponds to that of F81 (also called “equal input” ). Solving equation (20) leads to

2 2 R =1+ C (mi mj ) (26) m¯ ij − The lower bound is reached for π = (1/4)(1, 1, 1, 1)T , i

For the speciﬁc case k1 = k2 = 1 (equal input or F81 model), expression (26) takes a particularly simple form RTN =0.04+0.24k2 +6/(6 + k2)+ O(ǫ)

2 R =1+ π π (π π ) / π π (27)  i j i − j   i j  i 0. P The results are displayed in ﬁgure 3 ; to obtain large val- i πi =1. As every term of the ﬁrst sum in relation (27) P is smaller than the corresponding term in the second sum: ues for R such as R> 1.5 necessitates high asymmetries P in the transition rates and/or strong biases in equilibrium R [1, 2] probabilities of states . F81 ∈ 7

IV. STATISTICAL INVESTIGATION OF THE 1.0 DISPERSION INDEX AND THE INFLUENCE OF 0.9 R SPARSENESS. 0.8 0.7 G S The relation (20) can be solved explicitly for general ) 0.6 R 0.5 (a) substitution matrices. However, a general substitution (

P 0.4 matrix of dimension 4 has 11 free parameters (substitu- 0.3 tion matrices are defined up to a scaling parameter); ex- 0.2 plicit solution of (20) as a function of substitution matrix 0.1 parameters is rather cumbersome and does not provide 0.0 1 1.5 2 3 4 5 6 7 8 9 10 15 insightful information. 0.8 An exchangeable (time reversible, GTR) substitu- i 0.7 tion matrix has the additional constraint[22] mj πi = j 0.6 mi πj . Considering only exchangeable matrices reduces the number of free parameters to 9, but the parameter 0.5 (b) space is still too large to be explored systematically. e R We can however sample the parameter space by gener- 0.4 ating a statistically significant ensemble of substitution 0.3 S G matrices Q and get an estimate of the probability dis- 0.2 tribution of the dispersion index R. The simplicity of 7 0.1 relation (20) allows us to generate 10 random matrices 1 1.5 2 3 4 5 6 7 8 9 10 15 for each class (see section VI) and compute their associ- R ated R in a few minutes with a usual normal computer : depending on the dimension of Q (from 4 to 20) this Figure 4. Statistical study of 4×4 substitution matrices for (i) computation takes between 2 and 10 minutes. GTR matrices (“G”,black curves) ; (ii) arbitrary matrices (“R”, Figure 4.a shows the cumulative probability P (R) for red curves) and (iii) sparse GTR matrices (“S”, blue curves). 7 both arbitrary (R) and GTR (G) matrices, computed In each case, 10 matrices are generated and for each matrix, 7 from 10 matrices in each case. We observe that arbi- its dispersion index R and its eccentricity e = min(πi) are trary matrices produce statistically low dispersal indices computed, where πi is the equilibrium probability of state i. In each case, the data are sorted by R value to compute : P (R > 1.5) = 0.08 and P (R > 3) = 2.7 10−3. × the cumulative histogram (a). figure (b) shows the relation The GTR matrices have statistically higher dispersion between e and R. To make visible the statistical relation indices: P (R > 1.5) = 0.495 and P (R > 3) = 0.048. between e and R, a moving average of size 1000 data points Still, values larger than R = 5, as has been reported in is applied to the 107sorted (R,e) data in each data set and 4 the literature[6], have a very low probability (1.4 10− the result (Rm,em) is displayed in the lower plot (b). for random matrices and 7 10−3 for GTR matrices).× × We observed in the preceding section that for each class of matrices, high values of R are generally associated with fully connected graphs, i.e. substitution processes where large biases in the equilibrium probabilities, i.e. a given the random variable X can jump from any state i to state would have a very low equilibrium probability in any other state j. For a 4 states random variable, each order to allow for large R. We can investigate how this node of the connectivity graph is of degree 3 (dG = 3). observation holds for general and GTR matrices. For This statement may however be too restrictive. Consider Q for example a 4 4 nucleotide substitution matrix for each K K matrix that is generated we quantify its × relative× eccentricity by synonymous substitutions. Depending on the identity of the codon to which it belongs, a nucleotide can only mu-

e = K mini(πi) tate to a subset of other nucleotides. For example, for the third codon of Tyrosine, only T C transitions are ↔ The relation between R and e is statistical: matrices with allowed, while for the third codon of Alanine, all substi- dispersion index in [R, R + dR] will have a range of e and tutions are synonymous. For a given protein sequence, we display the average e for each small interval (Figure the mean nucleotide synonymous substitution graph is 4.b). We observe again that high values of the disper- therefore of degree smaller than 3. In general, the degree sion index in each class of matrices requires high bias in of each state (node) i is given by the number (minus one) equilibrium probabilities of states. of non-zero elements of the i th column in the associated − Another effect that can increase the dispersion index substitution matrix. of a matrix is its sparseness. This effect was investigated We can investigate the effect of sparseness of substitu- by Raval [12] for a random walk on neutral networks, tion matrices on the dispersion index with the formalism which we generalize here. Until now, we have examined developed above. Figure 4 shows the probability distri- 8

1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 ) ) 0.6 0.6 R R 0.5 F Sp4 Sp2 (a) 0.5 F NS ( ( P P 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 1 2 4 6 8 10 20 50 100 1 2 4 6 8 0.6 R

0.5 Figure 6. Cumulative probability of the dispersion index for 0.4 Fully connected (black, F) and non-synonymous (blue, NS) amino acid 20 × 20 substitution matrix. For the fully con- 0.3 (b) nected matrix, any amino acid can be replaced by any other. e For the NS matrices, only amino acids one nucleotide muta- 7 0.2 tion apart can replace each other. For each case, 10 20 × 20

Sp2 GTR matrices are generated and their dispersion index R 0.1 F are computed. For the NS matrices, transitions are weighted Sp4 by the number of nucleotide substitutions that can lead from 0.0 1 2 4 6 8 10 20 50 100 one amino acid to another: there are for example 6 nucleotide R mutations that transform a Phenylalanine into a Leucine, but only one mutation that transforms a Lysine into a Isoleucine.

Figure 5. Effect of sparseness of 16 × 16 GTR substitution 7 matrices, for three different connectivities. For each class, 10 one nucleotide mutation apart can replace each other random matrices are generated and their statistical properties (non-synonymous substitution, NS). The average degree displayed . Black lines (marked F) : fully connected graphs of the graph in the latter case is d¯G =7.5. As before, we dG = 15 ; blue lines (marked Sp4) dG = 4 ; red lines (marked 7 Sp4) dG = 2. (a) The cumulative probability P (R) for each generate 10 random matrices in each class and compute class. (b) the mean eccentricity as a function of dispersion their statistical properties. We observe again (Figure 6) index for each class. that the distribution of R is shifted to the right for the NS graphs, where the median is RNS = 2.48, compared to RF =1.66 for fully connected graphs. bution of R for GTR matrices with dG =2. As it can be For specific amino acid substitution matrices used in observed, the dispersion index distribution for GTR ma- the literature such as WAG[23], LG[24] and IDR[25], the trices is shifted to higher values and P (R> 5) increases index of dispersion is 1.253, 1.196 and 1.242 respectively. six fold from 0.007 (for dG = 3) to 0.044 (for dG =2). The effect of sparseness can be investigated better by considering higher dimensional substitution matrices. V. DISCUSSION AND CONCLUSION. Consider a random variable X that can take 16 different values. Figure 5 shows the effect of sparseness of Q on The substitution process (of nucleotides, amino acids, the distribution of the dispersion index. We have con- ...) and therefore the number of substitutions n that sidered GTR matrices in three cases : (i) fully connected take place during time t are stochastic. One of the most transition graphs (dG = 15) ; (ii) regular graphs of degree fundamental task in molecular evolutionary investigation 4 where two different states i, j =1...16 are connected if is to characterize the random variable n from molecular their binary representations are one mutation apart ; and 7 data. (iii) regular graphs of degree 2. For each class, 10 ma- A given model for the substitution process in the form trices are generated. As can be observed, the sparseness of a substitution matrix Q enables us to estimate the shifts the dispersion index distribution to the right : the mean number of substitution n that occur during a median in the three cases is respectively 1.65, 2.08 and time t. The mean depends onlyh oni the diagonal elements 6.15. of Q and the equilibrium probabilities of states πi: A more insightful model would be 20 20 GTR ma- × i trices for amino acid substitutions. We compare the case n(t) = tQiπi = t tr(QΠ) (29) of fully connected graphs (F) where any amino acid can h i − − replace any other one to the case where only amino acids where tr() designates the trace operator. 9

In molecular evolution, the main observable is the be readily adapted to such an extension and the variance probability pd(t) that two diﬀerent sequences are diﬀerent of the substitution number can be computed for variable at a given site. Denoting U(t) = exp(tQ), and assuming substitution matrices. that both sequences are at equilibrium [2],

p (t)=1 U iπ =1 tr(UΠ) (30) VI. METHODS. d − i i − One can estimate p (t) from the fraction of observed dif- d Numerical simulation of stochastic equations use the ferences between two sequences pˆ. By eliminating time Gillespie algorithm [26] and are written in C++ lan- in relations (29,30), it is then possible to relate the esti- guage. To compute the dispersion index of a given ma- mators dˆ (of n ) and pˆ 6 h i trix Q, we generate 10 random path over a time period dˆ= f(ˆp) (31) of 1000. To compare the analytical solutions given in this article (figure 2) to stochastic simulations (figure 2), 5 For sequences of length L, pˆ is given by a binomial distri- we generated 3.6 10 random matrices and numerically × bution B(L,p) and the variance of the distance estimator computed their dispersion index by the above method. dˆ can be deduced from relation (31). This quantity how- This numerical simulation took approximately 10 days ever is very different from the intrinsic variance of the on a 60 core cluster. substitution number. All linear algebra numerical computations and all data The mean of substitution number, its estimator dˆ and processing were performed with the high-level Julia lan- the variance of the estimator are only the first step in guage [27]. Computing the analytical dispersion index for the above 3.6 105 random matrices took about 5 characterizing a random variable. The next crucial step is × to evaluate the variance V of this number. What we have seconds on a normal desktop computer, using only one achieved in this article is to find a simple expression for V . core. In particular, we have shown that for both short and long To generate random GTR matrices, we use the fac- −1 time scales, the variance V can be easily deduced from Q. torization Q = SΠ (see appendix B) which allows for For long times, the procedure is similar to deriving the independent generation of the K(K 1)/2 elements of the symmetric matrix S and K 1 elements− of Π. For equilibrium probabilities πi from Q, i.e. we only need − Q arbitrary matrices, we draw the K(K 1) elements of to solve a linear equation associated with (relation − 17). For short times, only the diagonal elements of Q are the matrix. All random generator used in this work are required to compute V (relation 23). uniform(0,1). A long standing debate in the neutral theory of evolu- Acknowledgments. We are grateful to Erik Geissler, tion concerns the value of dispersion index R = V/ n . Olivier Rivoire and Ivan Junier for fruitful discussions On the one hand, the exact solution of this paper is usedh i and critical reading of the manuscript. to demonstrate that in general, any substitution process given by a matrix Q is overdispersed, i.e. R 1, and Appendix A: Mean and variance equation. the equality can be observed only for trivial models≥ where all diagonal elements of Q are equal. On the other hand, comprehensive investigation of various substitution mod- Obtaining equations of the moments such as (7,8) from els (section III,IV) shows that models that produce R the Master equation is a standard procedure of stochastic much larger than 2 generally require strong biases in processes[28, 29]. We give here the outline of the deriva- the equilibrium probabilities≈ of states. One possibility tion. to produce a higher dispersion index is sparse matrices, Consider the master equation (4) where the ensemble of possible transitions has been re- d n n n−1 duced. p = D p + M p (A1) dt | i − | i The substitution models we have considered here can n be applied to sequences where the Q matrix is the same which is a system of K equations for the pi (t), written in for all sites. For nucleotide sequences, this can describe vectorial form. Multiplying each row by n and summing the evolution of non-coding sequences or synonymous over all n leads, in vectorial form, to substitutions (by taking into account the sparseness of d 1 the matrix). On the other hand, for amino acid substitu- npn = D npn + M npn− dt | i − | i tion, it is well known that some sites are nearly constant n n n X X X or evolve at a slower pace than other sites. A natural ex- n The term n np was defined as the vector of the par- tension of the present work would be to take into account tial means n .| Fori the second term, we have substitution matrices that vary among sites, drawing for P| i example their scaling factor from a Gamma distribution npn−1 = (n + 1)pn = n + p (A2) | i | i | i [2]. The formalism we have developed in this article can n n X X

n where p = n p is the probability density of the The vector r is the solution of the linear equation random| variablei X| beingi in state i, whose dynamics is | i P Q r = h (B4) given by relation (3). For the initial condition p(0) = | i | i π , we have at all times | i 1 r =0 (B5) | i h | i i p(t) = π where hi = (¯m m )πi. The general solution of the | i | i undetermined equation− (B4) is therefore so the moment equation is 1 r = C π + ΠS˜− h | i | i | i d n = ( D + M) n + M π dt | i − | i | i where the constant C is determined from the condition (B5). On the other hand which is relation (5). Note that by definition, M π = D π . | i m = m m¯ 1 +m ¯ 1 h | h |− 1h | h | The| i equation for the second moment (8) is obtained = h Π− +m ¯ 1 − h | h | by the same procedure where each row of equation (A1) is multiplied by n2. Higher moments and the probability And thus generating function equations can be obtained by similar m r = h Π−1 r +m ¯ 1 r computations. h | i − | | h | i = h S˜−1 h C h 1 − | | − h | i D −1 E Appendix B: Proof of over dispersion for GTR = h S˜ h (B6) − | | substitution matrices. D E where we have used the fact that 1 r = 1 h = 0. As ˜−1 h | i h | i As we have seen in relation (20), the dispersion index S is negative definite , for long times is m r 0 h | i≥ m r R =1+2h | i Moreover, the equality is reached only when h = 0 , m¯ i.e. for π =0, only when all diagonal elements| ofi Q |arei i 6 We must demonstrate that m r 0 to prove that R equal. To see this, we can expand relation (B6) 1. We give here the proofh for| i≥ GTR matrices. These≥ K matrices can be factorized into 1 2 m r = λ− v h h | i − i h i| i 1 i=2 Q = S.Π− (B1) X The only way to obtain m r = 0 is to have vi h = 0 where Π = diag(π1, ..., π ) and S is a symmetric ma- h | i h | i k for i = 2..K. As on the other hand, v1 h = 1 h = 0 trix of positive non-diagonal elements whose columns we must have h = 0 . h | i h | i (and rows) sum to zero. Note that in the literature[2], | i | i a slightly different factorization is used in the form of Q = Π.F, where F is a symmetric matrix (we stress Appendix C: Dispersion index for all times. again that in our notation, the substitution matrix is the transpose of that used in most of the literature). The In subsection II C we gave the long (eq. 20) and short advantage of the factorization (B1) is that except for one (eq.24) time solution of the variance. For all the specific zero eigenvalue, all other eigenvalues of S are negative. models used in the literature (section III), the variance S can be therefore be written as at all times can also be determined explicitly through K relation (14) S = λi vi vi (B2) | i h | −1 Q˜ t −1 i=2 δV =2 m˜ It + Q˜ (I e ) Q˜ h˜ (C1) X | − | D EE where vi and vi are the right and left orthonormal The procedure requires the computation of exp(Qt) and | i Sh | eigenvectors of associated with the eigenvalue λi. The is analogous to the determination of n from sequence S pseudo-inverse of is defined as dissimilarities[2, 8]. h i K As an example, consider the equal input model (F81) 1 1 S˜− = λ− v v (B3) which we studied in subsection III B. For this model, the i | ii h i| i=2 reduced matrix is simply X and it is strictly negative definite . Q˜ = µI3 − 11 where I3 is the 3 3 identity matrix and therefore [13] M. A. Huynen, P. F. Stadler, and W. Fontana. Smooth- × exp(Q˜ t) = exp( µt)I3. Relation (C1) then becomes ness within ruggedness: the role of neutrality in adapta- − tion. Proceedings of the National Academy of Sciences, 2 93(1):397–401, jan 1996. δV = ( 1+ µt + e−µt) m˜ h˜ −µ2 − | [14] E. Bornberg-Bauer and H. S. Chan. Modeling evolution- D E ary landscapes: Mutational stability, topology, and su- We have previously shown (eq. 23) that generally perfunnels in sequence space. Proceedings of the National Academy of Sciences, 96(19):10689–10694, sep 1999. K [15] E. van Nimwegen, J. P. Crutchfield, and M. Huynen. ˜ i 2 m˜ h = (¯m m ) πi = vm Proceedings − | − Neutral evolution of mutational robustness. i=1 of the National Academy of Sciences D E X , 96(17):9716–9720, aug 1999. and for the F81 model, [16] Vladimir N Minin and Marc A Suchard. Counting la- 2 beled transitions in continuous-time Markov models of K K evolution. Journal of mathematical biology, 56(3):391– −2 3 2 µ vm = π π 412, mar 2008. i − i i=1 i=1 ! [17] Motoo Kimura. A simple method for estimating evo- X X lutionary rates of base substitutions through compara- However, the time can be expressed as a function of mean tive studies of nucleotide sequences. Journal of Molecular the substitution number. Finally, for the F80 model, and Evolution, 16(2):111–120, jun 1980. setting µ = 1 without loss of generality, the dispersion [18] K Tamura. Estimation of the number of nucleotide sub- index for all times is stitutions when there are strong transition-transversion and G+C-content biases. Molecular biology and evolu- n ¯ v tion, 9(4):678–687, jul 1992. R( n )=1+2 h i + e−hni/m 1 m h i m¯ − n [19] K Tamura and M Nei. Estimation of the number of nu- h i cleotide substitutions in the control region of mitochon- drial DNA in humans and chimpanzees. Mol. Biol. Evol., 10(3):512–526, may 1993. [20] Joseph Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of [1] Dan Graur and Wen-Hsiung Li. Fundamentals of Molec- Molecular Evolution, 17(6):368–376, nov 1981. ular Evolution. Sinauer Associates Inc.,U.S, 1999. [21] M Hasegawa, H Kishino, and T Yano. Dating of the [2] Ziheng Yang. Computational molecular evolution. Ox- human-ape splitting by a molecular clock of mitochon- ford series in ecology and evolution, 2006. drial DNA. Journal of molecular evolution, 22(2):160– [3] Motoo Kimura. The Neutral Theory of Molecular Evolu- 174, 1985. tion. Cambridge University Press, 1984. [22] Simon Tavaré. Some Probabilistic and Statistical Prob- [4] Lindell Bromham and David Penny. The modern molec- lems in the Analysis of DNA Sequences. Lectures in ular clock. Nature reviews. Genetics, 4(3):216–24, mar Mathematics in the Life Sciences, 17:57, 1986. 2003. [23] S. Whelan and N. Goldman. A General Empirical Model [5] Simon Y W Ho and Sebastián Duchêne. Molecular- of Protein Evolution Derived from Multiple Protein Fam- clock methods for estimating evolutionary rates and ilies Using a Maximum-Likelihood Approach. Molecular timescales. Molecular ecology, 23(24):5947–65, dec 2014. Biology and Evolution, 18(5):691–699, may 2001. [6] D J Cutler. The index of dispersion of molecular evo- [24] Si Quang Le and Olivier Gascuel. An improved general lution: slow fluctuations. Theoretical population biology, amino acid replacement matrix. Molecular biology and 57(2):177–86, mar 2000. evolution, 25(7):1307–20, jul 2008. [7] Naoyuki Takahata. Statistical models of the overdis- [25] Adam M Szalkowski and Maria Anisimova. Markov mod- persed molecular clock. Theoretical Population Biology, els of amino acid substitution to study proteins with in- 39(3):329–344, jun 1991. trinsically disordered regions. PloS one, 6(5):e20488, jan [8] Qi Zheng. On the dispersion index of a Markovian molec- 2011. ular clock. Mathematical Biosciences, 172(2):115–128, [26] Daniel T Gillespie. Exact stochastic simulation of cou- aug 2001. pled chemical reactions. The Journal of Physical Chem- [9] Ugo Bastolla, Markus Porto, H Eduardo Roman, and istry, 81(25):2340–2361, 1977. Michele Vendruscolo. Lack of self-averaging in neu- [27] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Vi- tral evolution of proteins. Physical review letters, ral B. Shah. Julia: A Fresh Approach to Numerical Com- 89(20):208101, nov 2002. puting. page 37, nov 2014. [10] Claus O Wilke. Molecular clock in neutral protein evo- [28] C Gardiner. Handbook of Stochastic Methods: for lution. BMC genetics, 5(1):25, aug 2004. Physics, Chemistry and the Natural Sciences. Springer, [11] Jesse D Bloom, Alpan Raval, and Claus O Wilke. Ther- 2004. modynamics of neutral protein evolution. Genetics, [29] Bahram Houchmandzadeh. Theory of neutral clus- 175(1):255–66, jan 2007. tering for growing populations. Physical Review E, [12] Alpan Raval. Molecular clock on a neutral network. Phys- 80(5):051920, nov 2009. ical review letters, 99(13):138104, sep 2007.