Molecular Clocksclocks
Total Page:16
File Type:pdf, Size:1020Kb
MolecularMolecular ClocksClocks Rose Hoberman The Holy Grail Fossil evidence is sparse and imprecise (or nonexistent) Predict divergence times by comparing molecular data •Given 110 MYA – a phylogenetic tree – branch lengths (rt) – a time estimate for one (or more) node C D R M H • Can we date other nodes in the tree? • Yes... if the rate of molecular change is constant across all branches Rate Constancy? Page & Holmes p240 Protein Variability • Protein structures & functions differ – Proportion of neutral sites differ • Rate constancy does not hold across different protein types • However... – Each protein does appear to have a characteristic rate of evolution Evidence for Rate Constancy in Hemoglobin Large carniverous marsupial Page and Holmes p229 The Molecular Clock Hypothesis • Amount of genetic difference between sequences is a function of time since separation. • Rate of molecular change is constant (enough) to predict times of divergence Outline • Methods for estimating time under a molecular clock – Estimating genetic distance – Determining and using calibration points – Sources of error • Rate heterogeneity – reasons for variation – how its taken into account when estimating times • Reliability of time estimates • Estimating gene duplication times Measuring Evolutionary time with a molecular clock 1. Estimate genetic distance d = number amino acid replacements 2. Use paleontological data to determine date of common ancestor T = time since divergence 3. Estimate calibration rate (number of genetic changes expected per unit time) r = d / 2T 4. Calculate time of divergence for novel sequences T_ij = d_ij / 2r Estimating Genetic Differences If all nt equally likely, observed difference would plateau at 0.75 Simply counting differences underestimates distances Fails to count for multiple hits (Page & Holmes p148) Estimating Genetic Distance with a Substitution Model • accounts for relative frequency of different types of substitutions • allows variation in substitution rates between sites • given learned parameter values – nucleotide frequencies – transition/transversion bias – alpha parameter of gamma distribution • can infer branch length from differences Distances from Gamma-Distributed Rates • rate variation among sites – “fast/variable” sites •3rd codon positions • codons on surface of globular protein – “slow/invariant” sites • Trytophan (1 codon) structurally required •1st or 2nd codon position when di-sulfide bond needed • alpha parameter of gamma distribution describes degree of variation of rates across positions • modeling rate variation changes branch length/ sequence differences curve Gamma Corrected Distances • high rate sites saturate quickly • sequence difference rises much more slowly as the low-rate sites gradually accumulate differences • Felsenstein Inferring Phylogenies p219 The ‘Sloppy’ Clock • ‘Ticks’ are stochastic, not deterministic – Mutations happen randomly according to a Poisson distribution. • Many divergence times can result in the same number of mutations • Actually over-dispersed Poisson – Correlations due to structural constraints Poisson Variance (Assuming A Pefect Molecular Clock) If mutation every MY • Poisson variance – 95% lineages 15 MYA old have 8-22 substitutions – 8 substitutions also could be 5 MYA Molecular Systematics p532 Need for Calibrations • Changes = rate*time • Can explain any observed branch length – Fast rate, short time – Slow rate, long time • Suppose 16 changes along a branch – Could be 2 * 8 or 8 * 2 – No way to distinguish – If told time = 8, then rate = 2 • Assume rate=2 along all branches – Can infer all times Estimating Calibration Rate • Calculate separate rate for each data set (species/genes) using known date of divergence (from fossil, biogeography) • One calibration point – Rate = d/2T • More than one calibration point – use regression – use generative model that constrains time estimates (more later) Calibration Complexities • Cannot date fossils perfectly • Fossils usually not direct ancestors – branched off tree before (after?) splitting event. • Impossible to pinpoint the age of last common ancestor of a group of living species Linear Regression • Fix intercept at (0,0) • Fit line between divergence estimates and calibration times • Calculate regression and prediction confidence limits Molecular Systematics p536 Molecular Dating Sources of Error • Both X and Y values only estimates – substitution model could be incorrect – tree could be incorrect – errors in orthology assignment – Poisson variance is large • Pairwise divergences correlated (Systematics p534?) – inflates correlation between divergence & time • Sometimes calibrations correlated – if using derived calibration points • Error in inferring slope • Confidence interval for predictions much larger than confidence interval for slope Rate Heterogeneity • Rate of molecular evolution can differ between – nucleotide positions – genes – genomic regions – genomes (nuclear vs organelle), species –species –over time • If not considered, introduces bias into time estimates Rate Heterogeneity among Lineages Cause Reason Repair e.g. RNA viruses have equipment error-prone polymerases Metabolic rate More free radicals Generation time Copies DNA more frequently Population size Effects mutation fixation rate Local Clocks? • Closely related species often share similar properties, likely to have similar rates • For example – murid rodents on average 2-6 times faster than apes and humans (Graur & Li p150) – mouse and rat rates are nearly equal (Graur & Li p146) Rate Changes within a Lineage Cause Reason Population size Genetic drift more likely to fix changes neutral alleles in small population Strength of selection 1. new role/environment changes over time 2. gene duplication 3. change in another gene Working Around Rate Heterogeneity 1. Identify lineages that deviate and remove them 2. Quantify degree of rate variation to put limits on possible divergence dates – requires several calibration dates, not always available – gives very conservative estimates of molecular dates 3. Explicity model rate variation Search for Genes with Uniform Rate across Taxa Many ‘clock’ tests: – Relative rates tests • compares rates of sister nodes using an outgroup – Tajima test • Number of sites in which character shared by outgroup and only one of two ingroups should be equal for both ingroups – Branch length test • deviation of distance from root to leaf compared to average distance – Likelihood ratio test • identifies deviance from clock but not the deviant sequences Likelihood Ratio Test • estimate a phylogeny under molecular clock and without it – e.g. root-to-tip distances must be equal • difference in likelihood ~ 2*Chi^2 with n-2 degrees of freedom – asymptotically – when models are nested – when nested parameters aren’t set to boundary Relative Rates Tests • Tests whether distance between two taxa and an outgroup are equal (or average rate of two clades vs an outgroup) – need to compute expected variance – many triples to consider, and not independent • Lacks power, esp – short sequences – low rates of change • Given length and number of variable sites in typical sequences used for dating, (Bronham et al 2000) says: – unlikely to detect moderate variation between lineages (1.5-4x) – likely to result in substantial error in date estimates R Modeling Rate Variation N Relaxing the Molecular Clock D E F • Learn rates and times, not just M branch lengths A B C – Assume root-to-tip times equal – Allow different rates on different branches – Rates of descendants correlate with that of common acnestor • Restricts choice of rates, but still too much flexibility to choose rates well Relaxing the Molecular Clock • Likelihood analysis – Assign each branch a rate parameter • explosion of parameters, not realistic – User can partition branches based on domain knowledge – Rates of partitions are independent • Nonparametric methods – smooth rates along tree • Bayesian approach – stochastic model of evolutionary change – prior distribution of rates – Bayes theorem –MCMC Parsimonious Approaches • Sanderson 1997, 2002 – infer branch lengths via parsimony – fit divergence times to minimize difference between rates in successive branches – (unique solution?) • Cutler 2000 – infer branch lengths via parsimony – rates drawn from a normal distribution (negative rates set to zero) Bayesian Approaches Learn rates, times, and substitution parameters simultaneously Devise model of relationship between rates – Thorne/Kishino et al • Assigns new rates to descendant lineages from a lognormal distribution with mean equal to ancestral rate and variance increasing with branch length – Huelsenbeck et al • Poisson process generates random rate changes along tree • new rate is current rate * gamma-distributed random variable Comparison of Likelihood & Bayesan Approaches for Estimating Divergence Times (Yang & Yoder 2003) • Analyzed two mitochondrial genes – each codon position treated separately – tested different model assumptions – used – 7 calibration points • Neither model reliable when – using only one codon position – using a single model for all positions • Results similar for both methods – using the most complex model – use separate parameters for each codon position (could use codon model?) Sources of Error/Variance • Lack of rate constancy (due to lineage, population size or selection effects) • Wrong assumptions in evolutionary model • Errors in orthology assignment • Incorrect tree • Stochastic variability • Imprecision of calibration