<<

The Holy Grail

Molecular Clocks evidence is sparse and imprecise Rose Hoberman (or nonexistent)

Predict divergence by comparing molecular data

Rate Constancy? •Given 110 MYA – a – branch lengths (rt) – a estimate for one (or more) node C D R M H

• Can we date other nodes in the tree?

• Yes... if the rate of molecular change is constant across all branches Page & Holmes p240

Protein Variability Evidence for Rate Constancy in structures & functions differ – Proportion of neutral sites differ • Rate constancy does not hold across different protein types Large carniverous marsupial • However... – Each protein does appear to have a characteristic rate of

Page and Holmes p229

1 The Outline Molecular Clock • Methods for estimating time under a molecular Hypothesis clock – Estimating genetic distance • Amount of genetic difference between – Determining and using calibration points sequences is a function of time since – Sources of error separation. • Rate heterogeneity – reasons for variation • Rate of molecular change is constant – how its taken into account when estimating times (enough) to predict times of divergence • Reliability of time estimates • Estimating gene duplication times

Measuring Evolutionary time with a Estimating Genetic Differences molecular clock 1. Estimate genetic distance If all nt equally likely, observed difference d = number replacements would plateau at 0.75 2. Use paleontological data to determine date of common ancestor Simply counting T = time since divergence differences underestimates 3. Estimate calibration rate (number of genetic distances changes expected per unit time) r = d / 2T Fails to count for multiple hits 4. Calculate time of divergence for novel sequences (Page & Holmes p148) T_ij = d_ij / 2r

Estimating Genetic Distance with a Distances from Gamma-Distributed Rates • accounts for relative frequency of different • rate variation among sites types of substitutions – “fast/variable” sites •3rd codon positions • allows variation in substitution rates • codons on surface of globular protein between sites – “slow/invariant” sites • Trytophan (1 codon) structurally required • given learned parameter values •1st or 2nd codon position when di-sulfide bond needed – frequencies • alpha parameter of describes degree of variation of rates across – transition/transversion bias positions – alpha parameter of gamma distribution • modeling rate variation changes branch length/ • can infer branch length from differences sequence differences curve

2 Gamma Corrected Distances The ‘Sloppy’ Clock

• high rate sites • ‘Ticks’ are stochastic, not deterministic saturate quickly – happen randomly according to a • sequence difference rises much more Poisson distribution. slowly as the • Many divergence times can result in the low-rate sites same number of mutations gradually accumulate differences • Actually over-dispersed Poisson – Correlations due to structural constraints • Felsenstein Inferring Phylogenies p219

Poisson Variance Need for Calibrations (Assuming A Pefect Molecular Clock) • Changes = rate*time If every MY • Can explain any observed branch length • Poisson variance – Fast rate, short time – 95% lineages 15 MYA – Slow rate, long time old have 8-22 • Suppose 16 changes along a branch substitutions – Could be 2 * 8 or 8 * 2 – 8 substitutions also – No way to distinguish could be 5 MYA – If told time = 8, then rate = 2 • Assume rate=2 along all branches – Can infer all times Molecular p532

Estimating Calibration Rate Calibration Complexities

• Calculate separate rate for each data set • Cannot date perfectly (species/genes) using known date of • Fossils usually not direct ancestors divergence (from fossil, ) – branched off tree before (after?) splitting • One calibration point event. – Rate = d/2T • Impossible to pinpoint the age of last • More than one calibration point common ancestor of a group of living – use regression species – use generative model that constrains time estimates (more later)

3 Molecular Dating Linear Regression Sources of Error • Fix intercept at (0,0) • Fit line between • Both X and Y values only estimates divergence estimates and – substitution model could be incorrect calibration times – tree could be incorrect – errors in orthology assignment • Calculate regression and – Poisson variance is large prediction confidence limits • Pairwise divergences correlated (Systematics p534?) – inflates correlation between divergence & time • Sometimes calibrations correlated – if using derived calibration points Molecular Systematics p536 • Error in inferring slope • Confidence interval for predictions much larger than confidence interval for slope

Rate Heterogeneity Rate Heterogeneity among Lineages • Rate of can differ between – nucleotide positions – genes Cause Reason – genomic regions Repair e.g. RNA viruses have – genomes (nuclear vs organelle), species equipment error-prone polymerases – species –over time Metabolic rate More free radicals • If not considered, introduces bias into Generation time Copies DNA more frequently time estimates Population size Effects mutation fixation rate

Local Clocks? Rate Changes within a

• Closely related species often share similar properties, likely to have similar rates Cause Reason • For example Population size more likely to fix changes neutral alleles in small – murid on average 2-6 times faster population than apes and (Graur & Li p150) Strength of selection 1. new role/environment – mouse and rat rates are nearly equal (Graur & changes over time Li p146) 2. gene duplication 3. change in another gene

4 Working Around Rate Search for Genes with Heterogeneity Uniform Rate across Taxa 1. Identify lineages that deviate and remove them Many ‘clock’ tests: 2. Quantify degree of rate variation to put – Relative rates tests limits on possible divergence dates • compares rates of sister nodes using an outgroup – requires several calibration dates, not always – Tajima test • Number of sites in which character shared by outgroup and available only one of two ingroups should be equal for both ingroups – gives very conservative estimates of – Branch length test molecular dates • deviation of distance from root to leaf compared to average distance 3. Explicity model rate variation – Likelihood ratio test • identifies deviance from clock but not the deviant sequences

Likelihood Ratio Test Relative Rates Tests

• estimate a phylogeny under molecular • Tests whether distance between two taxa and an outgroup are equal (or average rate of two vs an clock and without it outgroup) – e.g. root-to-tip distances must be equal – need to compute expected variance • difference in likelihood ~ 2*Chi^2 with n-2 – many triples to consider, and not independent • Lacks power, esp degrees of freedom – short sequences – asymptotically – low rates of change – when models are nested • Given length and number of variable sites in typical sequences used for dating, (Bronham et al 2000) says: – when nested parameters aren’t set to – unlikely to detect moderate variation between lineages (1.5-4x) boundary – likely to result in substantial error in date estimates

R Modeling Rate Variation N Relaxing the Molecular Clock Relaxing the Molecular Clock D E F • Likelihood analysis M – Assign each branch a rate parameter • Learn rates and times, not just • explosion of parameters, not realistic branch lengths – User can partition branches based on domain knowledge A B C – Rates of partitions are independent – Assume root-to-tip times equal – Allow different rates on different branches • Nonparametric methods – smooth rates along tree – Rates of descendants correlate with that of common acnestor • Bayesian approach – stochastic model of evolutionary change • Restricts choice of rates, but still too much – prior distribution of rates – Bayes theorem flexibility to choose rates well –MCMC

5 Bayesian Approaches Parsimonious Approaches Learn rates, times, and substitution parameters simultaneously • Sanderson 1997, 2002 – infer branch lengths via parsimony Devise model of relationship between rates – fit divergence times to minimize difference – Thorne/Kishino et al between rates in successive branches • Assigns new rates to descendant lineages from a – (unique solution?) lognormal distribution with mean equal to • Cutler 2000 ancestral rate and variance increasing with branch length – infer branch lengths via parsimony – Huelsenbeck et al – rates drawn from a • Poisson process generates random rate changes (negative rates set to zero) along tree • new rate is current rate * gamma-distributed random variable

Comparison of Likelihood & Bayesan Approaches for Estimating Divergence Sources of Error/Variance Times (Yang & Yoder 2003) • Lack of rate constancy (due to lineage, • Analyzed two mitochondrial genes population size or selection effects) – each codon position treated separately • Wrong assumptions in evolutionary model – tested different model assumptions • Errors in orthology assignment – used – 7 calibration points • Incorrect tree • Neither model reliable when • Stochastic variability – using only one codon position • Imprecision of calibration points – using a single model for all positions • Results similar for both methods • Imprecision of regression – using the most complex model • sloppiness in analysis – use separate parameters for each codon position (could use – self-fulfilling prophecies codon model?)

Reading the entrails of chickens Multiple Gene Loci (Graur and Martin 2004)

• single calibration point • “Trying to estimate time of divergence • error bars removed from calibration points from one protein is like trying to estimate • standard error bars instead of 95% confidence the average height of humans by intervals measuring one human” • secondary/tertiary calibration points treated as --Molecular Systematics p539 reliable and precise – based on incorrect initial estimates – variance increases with distance from Use multiple genes! original estimate • few used (and multiple calibration points)

6 McLysaght, Hokamp, Wolfe 2002 Even so... Dating Human Gene Duplications Be Very Wary Of Molecular Times • [758] Trees generated (ML method using PAM matrix) • [602] Alpha parameter for gamma distribution learned – (Gu and Zhang 1997) faster than ML, more accurate than • Point estimates are absurd parsimony – Thrown out if variance > mean. Why would this happen? • Sample errors often based – “May be problematic to apply this model for gene family evolution because of the possible functional divergence among only on the difference between paralogous genes” estimates in the same study • [481] NJ trees built from Gamma-corrected distances – Family kept only if worm/fly group together • Even estimates with confidence • [191] Two-cluster test of rate constancy (Takezaki et al intervals unlikely to really capture 1995) all sources of variance

Blanc, Hokamp, Wolfe Ks > 10 unreliable ? Dating Arabadopsis Duplications • Create nucleotide alignments • Yang (abstract) calculates effect of • Estimate “Level of” Synonymous evolutionary rate on accuracy of substitutions (Yang’s ML method) phylogenic reconstruction – per site? per synonymous site? • Anisimova calculates accuracy and power • Ks values > 10 ignored (Yang; Anisimova) of LRT in detecting adaptive molecular • Why used different method than for evolution human? • Neither seems to give any cutoff regarding • How reliable is ranking of Ks values? How dS > 10. much variance expected?

Future Improvements General References

• Calculate accurate confidence Reviews/Critiques 1. Bronham and Penny. The modern molecular clock, intervals taking into account Nature review in ?, 2003. multiple sources of variance 2. Graur and Martin. Reading the entrails of chickens...the • Novel models that account for variation in illusion of precision. Trends in Genetics, 2004. rates between taxa Textbooks: • Build explicit models that predict rates 1. Molecular Systematics. 2nd edition. Edited by Hillis, Moritz, and Mable. based on an understanding of the 2. Inferring Phylogenies. Felsenstein. underlying processes that generate 3. Molecular Evolution, a phylogenetic approach. Page differences in substitutions rates and Holmes.

7 Rate Heterogeneity References Dating Duplications References

Dealing with Rate Heterogeneity Dating duplications: 1. Yang and Yoder. Comparison of likelihood and bayesian methods for • McLysaght, Hokamp, and Wolfe. Extensive genomic duplication estimating divergence times... Syst. Biol, 2003. during early evolution. Nature Genetics?, 2002. 2. Kishino, Thorne, and Bruno. Performance of a divergence time • Blanc, Hokamp, and Wolfe. Recent polyploidy superimposed on estimation method under a probabilistic model of rate evolution. Mol. older large-scale duplications in the Arabidopsis genome. Genome Biol. Evol, 2001. Research, 2003. 3. Huelsenbeck, Larget, and Swofford. A compound poisson process for relaxing the molecular clock. Genetics, 2000. Reference used for dating duplications in above papers • Gu and Zhang. A simple method for estimating the parameter of Testing for Rate heterogeneity substitution rate variation among sites. Mol. Biol. Evol., 1997. 1. Takezaki, Rzhetsky and Nei. Phylogenetic test of the molecular clock • Yang Z. On the best evolutionary rate for phylogenetic analysis. and linearized trees. Mol. Bio. Evol., 1995. Syst. Biol, 1998. 2. Bronham, Penny, Rambaut, and Hendy. The power of relative rates • Anisimova, Bielawski, Yang. Accuracy and power of the likelihood test depends on the data. J Mol Evol, 2000. ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol., 2001.

Synonymous vs Nonsynonymous Relative vs Absolute Rates Distance • M. Systematics p540 • Syn sites are sites where a nt change – “Differences in rates of divergence among does not cause an AA change lineages detract only from methods of – only ~25% of sites, so become saturated analysis that require clocklike behavior of more quickly molecules, and alternative methods of • Between proteins analysis exist for all applications of molecular – more variation in non-synonymous rates systematics except for the absolute estimation • Within same protein of time.” – more variation in synonymous rates • t1 = 2 * t2 still requires clocklike behavior? • Which are used? What is effect?

Two-cluster Test Neutral Hypothesis Takezaki, Rzhetsky and Nei (1995?) • estimate tree • Most mutations have no influence on of • for each nonroot interior node: the organism – calculate average “rate” for both descendant clades – Advantageous mutations rare – test equality of rates (using variance & covariance of – Deleterious mutations rapidly removed branch lengths) [doesn’t appear to correct for multiple – Greatest proportion of mutations have no effect on testing] protein function • move up from leaves, eliminating a cluster if not • Rate of change is thus affected only by mutation equal rate, and so should be relatively constant within • finally, linear tree created a species – reestimate branch lengths under clock constraint – Variation in rate among genes b/c differences in selective constraints

8 in Nuclear Genes of Perfect Molecular Clock Mammals (Yang & Nielsen 1997) • Change linear function time (substitutions dS (P) dS (R) dN (P) dN(R) ~ Poisson) Acid phosphotase 0.354 0.680 0.028 0.049 • Rates constant (positions/lineages) Myelin Proteolipid 0.033 0.117 0.009 0.000 • Tree perfect Interleukin 6 0.100 0.566 0.191 0.373 • Molecular distance estimated perfectly IGF binding 1 0.307 0.667 0.109 0.084 • Calibration dates without error Thrombomodulin 0.414 1.337 0.092 0.108 • Regression (time vs substitutions) without error Average 0.190 0.525 0.039 0.066

Bayesian parametric estimation Yang, effect of evol. rate abstract • Density function for x, given the training data set • Yang calculates effect of evolutionary rate on X ()nN= {xx 1 ,..., } accuracy of phylogenic reconstruction p(|xxXpXd()nn )= ∫ (,|θ () )θ – simulation study – branch length = “expected total number nt • From the definition of conditional probability densities substitutions per site” (not synonymous?) pXpXpX(,xxθθθ |()nnn )= ( | , () ) ( | () ). – estimates proportion of correctly recovered branch (n) partitions • The first factor is independent of X since it just our assumed form pX (|, xx θ () n ) ⇒ p (|) θ for – “optimum levels of sequence divergence were even parameterized density. higher than previously suggested for saturation of • Therefore substitutions, indicating that the problem of saturation p(|xxXppXd()nn )= (|)(|θ θθ () ) may have been exaggerated” ∫

Bayesian parametric estimation The Holy Grail • Instead of choosing a specific value θ , the Bayesian approach performs a weighted Fossil evidence is average over all values of θ . sparse and imprecise ()n If the weighting factor pX (| θ ) , which is a (or nonexistent) posterior of θ peaks very sharply about some value θ $ we obtain pX (| xx () n ) ≈ p (|) θ $ . Thus the optimal estimator is the most likely value of θ given the data and the prior of θ . Predict divergence times by comparing molecular data

9