Divergence Time Slides
Total Page:16
File Type:pdf, Size:1020Kb
Divergence time estimation Rachel Warnock 001011 Computational Evolution http://www.bsse.ethz.ch/cevo Testing hypotheses in evolutionary biology & macroevolution t speciation or extinction times r rates of morphological or molecular evolution λ rate of speciation μ rate of extinction Macroevolutionary parameters of interest are phylogenetic parameters r t λ μ Macroevolutionary parameters of interest are phylogenetic parameters r t λ μ Macroevolutionary parameters of interest are phylogenetic parameters r t λ μ ψ ρ fossil sampling rate extant species sampling A straightforward model of evolution and sampling λ — speciation rate μ — extinction rate A straightforward model of evolution and sampling ●● λ — speciation rate ● ● ●● μ — extinction rate ● ● ● ● ψ — fossil sampling rate ● ● ●●●● ● ● ● ● ● A straightforward model of evolution and sampling ●● λ — speciation rate ● ● ●● μ — extinction rate ● ● ● ● ψ — fossil sampling rate ● ● ● ρ — extant species sampling ●●●● ● ● ● ●● ● ● ● ● ● The fossilized birth-death process ●● λ — speciation rate ● ● ●● μ — extinction rate ● ● ● ● ψ — fossil sampling rate ● ● ● ρ — extant species sampling ●●●● ● ● ● ●● ● ● ● ● ● Stadler, 2010 A straightforward model of evolution and sampling ●● r λ — speciation rate ● ● r ●● μ — extinction rate ● ● ● ● ψ — fossil sampling rate ● ● r ● ρ — extant species sampling ●●●● r ● ● ● ●● ● ● ● ● ● tsampling textinction tspeciation A straightforward model of evolution and sampling ●● λ — speciation rate ● ● ●● μ — extinction rate ● ● ● ● ψ — fossil sampling rate ● ● ● ρ — extant species sampling ●●●● ● ● ● ●● ● ● ● ● ● The molecular clock hypothesis Zuckerkandl, Pauling. 1962, 1965 Building the tree of life ● ATGCATGC ● TTGCCTGC ● ● TTGCATCG ● ATGCATCG ● ATGCATGG ● TTGCCTGG ● ● TAGCGTGC ● TAGCGAGC branch lengths = rate x time ν = rt Dating the tree of life ● ATGCATGC ● TTGCCTGC ● ● TTGCATCG ● ATGCATCG ● ATGCATGG ● TTGCCTGG ● ● TAGCGTGC ● TAGCGAGC branch lengths = time Calibrating the molecular clock ● ATGCATGC ● TTGCCTGC ● ● TTGCATCG ● ATGCATCG ● ATGCATGG ● TTGCCTGG ● ● TAGCGTGC ● TAGCGAGC 6.5 branch lengths = time Calibrating the molecular clock ● ATGCATGC ● TTGCCTGC ● 20 ● TTGCATCG 70 ● ATGCATCG ● ATGCATGG ● TTGCCTGG 55 ● ● TAGCGTGC ● TAGCGAGC 6.5 branch lengths = time Bayesian divergence time estimation P( data | model ) P( model ) P( model | data ) = P( data ) Slides adapted from Louis du Plessis Taming the BEAST Workshop 2016 Bayesian divergence time estimation likelihood priors P( data | model ) P( model ) P( model | data ) = P( data ) posterior marginal likelihood of the data Slides adapted from Louis du Plessis Taming the BEAST Workshop 2016 Bayesian divergence time estimation ACAC... λ μ TCAC... ACAG... ψ ρ tree calibration substitution clock alignment tree model priors model model P( data | model ) P( model ) P( model | data ) = P( data ) Slides adapted from Louis du Plessis Taming the BEAST Workshop 2016 Bayesian divergence time estimation ACAC... λ μ TCAC... ACAG... ψ ρ tree calibration substitution clock alignment tree model priors model model ACAC... λ μ λ μ TCAC... P( A CAG... | ψ ρ ) P( ψ ρ ) λ μ ACAC... TCAC... ψ ρ P( | A C AG... ) = ACAC... TCAC... P( A C AG... ) Slides adapted from Louis du Plessis Taming the BEAST Workshop 2016 Bayesian divergence time estimation ACAC... λ μ TCAC... ACAG... ψ ρ tree calibration substitution clock alignment tree model priors model model ACAC... λ μ λ μ TCAC... P( A CAG... | ψ ρ ) P( ψ ρ ) λ μ ACAC... TCAC... ψ ρ P( | A C AG... ) = ACAC... TCAC... P( A C AG... ) λ μ Coming up… ψ ρ Slides adapted from Louis du Plessis Taming the BEAST Workshop 2016 Bayesian divergence time estimation ACAC... λ μ •TCSequenceAC... data do not contain information about absolute time ACAG... ψ ρ tree calibration substitution clock alignment tree model priors model model ACAC... λ μ λ μ TCAC... P( A CAG... | ψ ρ ) P( ψ ρ ) λ μ ACAC... TCAC... ψ ρ P( | A C AG... ) = ACAC... TCAC... P( A C AG... ) λ μ Coming up… ψ ρ Bayesian divergence time estimation ACAC... λ μ •TCSequenceAC... data do not contain information about absolute time ACAG... ψ ρ •This has several importanttree consequences:calibration substitution clock alignment tree model priors model model •We need strong prior information about the divergence times or the substitution ArateCAC... λ μ λ μ TCAC... P( A CAG... | ψ ρ ) P( ψ ρ ) λ μ ACAC... TCAC... ψ ρ P( • Model selection | A C AG... cannot) = be used to select among possible ACAC... TCAC... calibration strategies P( A C AG... ) •If there is uncertainty in the calibrations, even an infinite amount of sequence data won’t completely eliminate uncertainty in the posteriors λ μ Coming up… ψ ρ REVIEWS Neutral theory illustrates the Bayesian clock dating of equation (2) in a the Bayesian estimates converge to the true parameter Also termed the neutral two-species case. values as more and more data are analysed. However, mutation-random drift theory; Direct calculation of the proportionality constant convergence on truth does not occur in divergence time claims that evolution at the z in equation (2) is not feasible. In practice, a simula- estimation. The use of priors to resolve times and rates molecular level is mainly Markov Chain Monte Carlo random fixation of mutations tion algorithm known as the has two consequences. First, as more loci or increasingly that have little fitness effect. algorithm (MCMC algorithm) is used to generate a longer sequences are included in the analysis but the sample from the posterior distribution. The MCMC calibration information does not change, the posterior Neutral mutations algorithm is computationally expensive, and a typi- time estimates do not converge to point values and will Mutations that do not affect cal MCMC clock-dating analysis may take from a few instead involve uncertainties31,54,62. Second, the priors on the fitness (survival or reproduction) of the individual. minutes to several months for large genome-scale data times and on rates have an important impact on the pos- sets. Methods that approximate the likelihood can terior time estimates even if a huge amount of sequence Advantageous mutations substantially speed up the analysis29,57,58. For technical data is used62,63. Errors in the time prior and in the rate Mutations that improve the reviews on Bayesian and MCMC molecular clock dating prior can lead to very precise but grossly inaccurate time fitness of the carrier and are REFS 59,60 62,64 favoured by natural selection. see . estimates . Great care must always be taken in the con- Nearly a dozen computer software packages cur- struction of fossil calibrations and in the specification Deleterious mutations rently exist for Bayesian dating analysis (TABLE 1), all of of priors on times and on rates in a dating analysis65,66. Mutations that reduce the which incorporate models of rate variation among lin- As the amount of sequence data approximates fitness of the carrier and are eages (the episodic or relaxed clock models envisioned genome scale, the molecular distances or branch removed from the population 61 by negative selection. by Gillespie) . All of these programs can also analyse lengths on the phylogeny are essentially determined multiple gene loci and accommodate multiple fossil without any uncertainty, as are the relative ages of the Substitution calibrations in one analysis. nodes. However, the absolute ages and absolute rates Mutations that spread into the cannot be known without additional information (in population and become fixed, Limits of Bayesian divergence time estimation driven either by chance or by the form of priors). The joint posterior of times and natural selection. Estimating species divergence times on the basis of rates is thus one-dimensional. This reasoning has been uncertain calibrations is challenging. The main diffi- used to determine the limiting posterior distribution Relaxed clock models culty is that molecular sequence data provide informa- when the amount of sequence data (that is, the number Models of evolutionary rate tion about molecular distances (the product of times of loci or the length of the sequences) increases without drift over time or across 31,54 lineages developed to relax the and rates) but not about times and rates separately. In bound . An infinite-sites plot can be used to deter- molecular clock hypothesis. other words, the time and rate parameters are unidenti- mine whether the amount of sequence data is satur- fiable. Thus, in Bayesian clock dating, the sequence ated or whether including more sequence data is likely Rate and timedistances areare resolved only into absolute semi-identifiable times and rates to improve the time estimates (FIG. 2). The theory has through the use of priors. In a conventional Bayesian been extended to the analysis of large but finite data estimation problem, the prior becomes unimportant and sets to partition the uncertainties in the posterior time a Prior f(t,r) b Likelihood L(D|t,r) c Posterior f(t,r|D) 6 5 4 s/s/My) 3 (× 10 r 2 Rate, 1 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Time, t (Ma) Time, t (Ma) Time, t (Ma) | Figure 1 Bayesian molecular clock dating. We estimate the posterior likelihood (part b) is calculated under the Jukes–CantorNature model Reviews. The | posteriorGenetics distribution of divergence time (t) and rate (r) in a two-species case to surface (part c) is the result of multiplying the prior and the likelihood. The illustrate Bayesian molecular clock dating. The data are an alignment of the data are informative about the molecular distance, d = tr, but not about t and 12S RNA gene sequences from humans and orang-utans, with 90 r separately. The posterior is thus very sensitive to the prior. The blue line differences at 948 nucleotides sites. The joint prior (part a) is composed of indicates the maximum likelihood estimate of t and r, and the molecular two gamma densities (reflecting our prior information on the molecular rate distance d, with ˆtˆr = ˆd. When the number of sites is infinite, the likelihood and on the geological divergence time of human–orang-utan), and the collapses onto the blue line, and the posterior becomes one-dimensional62.