Genetics 540 Models of DNA Evolution, Part 1 Joe Felsenstein

Genetics 540 Models of DNA evolution, part 1 Joe Felsenstein Department of Genetics University of Washington joe@genetics Change of DNA sequence of a species ... is actually mutation, then substitution, in a population T TTTT C C A u/3 G u/3 u/3 u/3 u/3 C T u/3 The Jukes-Cantor model (1969) the simplest symmetrical model of DNA evolution Transition probabilities under the Jukes-Cantor model • All sites change independently • All sites have the same stochastic process working at them • Make up a fictional kind of event, such that when it happens the site changes to one of the 4 bases chosen at random (equiprobably) 4 • Assertion: Having these events occur at rate 3u is the same as having the Jukes-Cantor model events occur at rate u • The probability of none of these fictional events happens in 4 time t is exp(−3ut) • No matter how many of these fictional events occur, provided it is not zero, the chance of ending up at a particular base is 1 4. • Putting all this together, the probability of changing to C, given the site is currently at A, in time t is 4 1 ¡ Prob (CjA; t) = 1 − e−3ut 4 while 4 4 1 ¡ Prob (AjA; t) = e−3t + 1 − e−3ut 4 or 4 1 ¡ Prob (AjA; t) = 1 + 3e−3ut 4 so that the total probability of change is 4 3 ¡ Prob (changejt) = 1 − e−3ut 4 1 0.75 Differences 0.49 per site 0 0 0.7945 Branch length The fraction of sites differing after branches of different length, under the Jukes-Cantor model a A G b b b b C T a Kimura's (1980) K2P model of DNA change, which allows for different rates of transitions and transversions, A similar derivation can be done for the K2P model. It involves making two kinds of fictional events: • I. At rate α, if the site has a purine (A or G), choose one of the two purines at random and change to it. If the site has a pyrimidine (C or T), choose one of the pyrimidines at random and change to it. • II. At rate β, choose one of the 4 bases at random and change to it. By proper choice of α and β one can achieve the overall rate of change and Ts=Tn ratio R you want. For rate of change 1, the transition probabilities (warning: terminological tangle). 1 1 1 R+2 Prob (transitionjt) = 4 − 2 exp − R+1 t ¡ 1 2 ¡ +4 exp −R+1t 1 1 2 ¡ Prob (transversionjt) = 2 − 2 exp −R+1t : (1) (the transversion probability is the sum of the probabilities of both kinds of transversions). 0.60 Total differences 0.50 Transitions 0.40 Differences 0.30 0.20 Transversions 0.10 R = 10 0.00 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) Fractions of transitions and transversions expected in different amounts of branch length under the K2P model, for Ts=Tn = 10 0.70 Total differences 0.60 0.50 Transversions 0.40 Differences 0.30 Transitions 0.20 0.10 R = 2 0.00 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) Fractions of transitions and transversions expected in different amounts of branch length under the K2P model, for Ts=Tn = 2 Other commonly used models include: Two models that specify the equilibrium base frequencies (you provide the frequencies πA; πC; πG; πT and they are set up to have an equilibrium which achieves them), and also let you control the transition/transversion ratio: The Hasegawa-Kishino-Yano (1985) model: to : A G C T from : A − απG + βπG απC απT G απA + βπA − απC απT C απA απG − απT + βπT T απA απG απC + βπC − My F84 model: to : A G C T from : πG A − απG + β π απC απT π R G απ + β A − απ απ A πR C T βπT C απA απG − απT + π π Y T απ απ απ + β C − A G C πY where πR = πA + πG and πY = πC + πT (The equilibrium frequencies of purines and pyrimidines) Both of these models have formulas for the transition probabilities, and both are subcases of a slightly more general class of models, the Tamura-Nei model (1993). There is also the General Time-Reversible model (GTR), which maintains \detailed balance" so that the probability of starting at (say) A and ending at (say) T in evolution is the same as the probability of starting at T and ending at A: to : A G C T from : A − απG βπC γπT G απA − δπC πT C βπA δπG − υπT T γπA πG υπC − And there is of course the general 12-parameter model which has arbitrary rates for each of the 12 possible changes (from each of the 4 nucleotides to each of the 3 others). Neither of these has formulas for the transition probabilities, but those can be done numerically. There are many other models, but these are the most widely- used ones. Here is a general scheme of which models are subcases of which other ones: General 12−parameter model (12) General time−reversible model (9) Tamura−Nei (6) HKY (5) F84 (5) Kimura K2P (2) Jukes−Cantor (1) α = 0.25 cv = 2 α = 11.1111 α = 1 cv = 0.3 frequency cv = 1 0 0.5 1 1.5 2 rate Gamma distributions with mean 1 and different coefficients of variation (standard deviation / mean). α = 1=CV 2 is the \shape parameter" of the Gamma distribution References: Barry, D., and J. A. Hartigan. 1987. Statistical analysis of hominoid molecular evolution. Statistical Science 2: 191-210. Felsenstein, J. and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution Molecular Biology and Evolution 13: 93-104. Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22: 160-174. Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Molecular Biology and Evolution 7: 82-102. Jukes, T. H. and C. Cantor. 1969. Evolution of protein molecules. pp. 21-132 in Mammalian Protein Metabolism, ed. M. N. Munro. Academic Press, New York. Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120. Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20: 86-93. Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11: 605- 612. Olsen, G. J. 1987. Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symposia on Quantitative Biology 52: 825-837. Tamura, K. and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10: 512-526. Waddell, P. J. and M. A. Steel. 1997. General time-reversible distances with unequal rates across sites: mixing Γ and inverse Gaussian distributions with invariant sites. Molecular Phylogenies and Evolution 8: 398-414. Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequenc es with variable rates over sites: approximate methods. Journal of Molecular Evolution 39: 306-314. Yang, Z. 1995. A space-time process model for the evolution of DNA sequen ces. Genetics 139: 993-1005..

Genetics 540 Models of DNA Evolution, Part 1 Joe Felsenstein

Phylogenetic Analysis of Anostracans (Branchiopoda: Anostraca) Inferred from Nuclear 18S Ribosomal DNA (18S Rdna) Sequences

Investgating Determinants of Phylogeneic Accuracy

Eidesstattliche Erklärung

Genetics 540 Winter, 2001 Models of DNA Evolution – Part 2 Joe Felsenstein Department of Genetics University of Washington

Convergent Evolution of Behavior in an Adaptive Radiation of Hawaiian Web-Building Spiders

Phylogenetics and Bioinformatics for Evolution [30Pt] Maximum Likelihood

PHYLOGENOMIC CONFLICT in HYLARANA 1 Exons, Introns, And

Hierarchical Phylogeny Construction

Molecular Phylogeny and Historical Biogeography of the Land Snail Genus Solatopupa (Pulmonata) in the Peri-Tyrrhenian Area

Deep Diversification and Long-Term Persistence in the South American ‘Dry Diagonal’: Integrating Continent-Wide Phylogeography and Distribution Modeling of Geckos

Mutation Patterns of Mitochondrial H- and L-Strand DNA in Closely Related Cyprinid Fishes

Fast Distance-Based Phylogenetic Placement