Introduction to Coalescent Theory Overview
Total Page:16
File Type:pdf, Size:1020Kb
Overview • Intro to Coalescent Theory (Today) Introduction to Coalescent Theory • Genomic Imprinting, Mathematical Modeling, and Notions of Optimality in Jon Wilkins Evolution (Wednesday) Santa Fe Institute [email protected] • Statistical Inference in Complex Beijing CSSS 2007 Systems Ingredients of Natural Selection Population Genetics • Heritable variation • How is variation generated and maintained in a population? • Differential reproductive success • What can patterns of genetic diversity tell us • Causal connection between the two about the history of a population? – Demography (migration, reproduction, etc.) – Molecular events (mutation, recombination, etc.) – Natural selection (directional, purifying, etc.) 1 Why diversity? Neutral Theory • Muller - mutation drives deviations from the • Not just selective neutrality optimal phenotype • Dobzhansky - heterogeneous environments / • Constant population size frequency dependent effects • Well mixed population (panmictic) • Lewontin-Hubby experiments (mid 1960s) – Too much variation for either explanation • This is always true • Kimura - neutral theory Sampling with Replacement The Coalescent • Some alleles pass ACTT • Homologous genes on no copies to the share a common next generation, ancestor while some pass on T • DNA sequence more than one G diversity is shaped by t s a genealogical history P • All that we care about are the • Genealogies are ancestors of C G shaped by chance, sequences present demography, selection in our dataset ACGT ACGT ACTT ACTT AGTT Present 2 Balls in Boxes The Shapes of Genealogies • The coalescent • Time to the MRCA of a models genealogies pair of sequences is Probability = 1/2N * (1-1/2N)3 exponentially t s backwards in time a distributed with mean p E[T2] = 2N e h 2 t time of 2N generations Probability = 1/2N * (1-1/2N) n i s • Time to the next n • Follow ancestral o i t coalescent event for a a r Probability = 1/2N * (1-1/2N) lineages back until e n sample of n sequences e the most recent E[T3] = 2N/3 G is exponential with Probability = 1/2N common ancestor " n% E[T ] = 2N/6 mean 2N/ $ ' generations (MRCA) is reached 4 # 2& E[T ] = 2N/10 Present 5 ! Genealogies are highly variable The problem • The variance on • Want to infer the underlying processes that the length of each have shaped genetic diversity, but portion of the genealogy is large, on the order of N2 • The inherent stochasticity means that any • Variation in given genealogy is consistent with a wide topology as well range of demographic processes • Mutations are random on top of the genealogy • How do we estimate parameters, and how do we know how good our estimates are? 3 Estimating N The Structured Coalescent • Expected pairwise distance (π) – 2N times 2µ (= θ) • Expected number of polymorphisms (S) n#1 1 Location Location " $ • With geographically structured populations, all i 1 i = topologies are not equally likely ! The Structured Coalescent The Island Model N N N N N N Low Migration High Migration • Each migrant is equally likely to come from any deme • The relationship between genealogy and • Population structure, but no geography geography can be used to make inferences " # " 1 F = 0 = ST " 1+ 4Nm ! 4 A finite, linear habitat The Solution % % * * * 2 2 2 2 " f (x )cos(" x) #2$ " (t# 1 ) " j f j (x0 )sin(" j x) #2$ "* (t# 1 ) U (x,t) = i i 0 i e m i 2 + e m j 2 x &" + sin(" )cos(" ) &" * # sin(" * )cos(" * ) MRCA i=1 i i i j=1 j j j % % * * * 2 2 1 " f (y )sin(" y) 2 * 2 1 " i f i (y0)cos(" i y) #2$ m "i (t# 2 ) j j 0 j #2$ m " j (t# 2 ) Uy (y,t) = & e +& * * * e i=1 " i + sin(" i )cos(" i ) j=1 " j # sin(" j )cos(" j ) % 4N 2 1 ( 1)n (2 )2n#1 (n#1)! $ m " i + ' & # $ m" i (2n#1)! 1 1 2 2 2 2 cot(" ) = n=1 f (x ) = cos(" x)(e#(x#x0 ) / 4$ m + e#(2#x#x0 ) / 4$ m )dx i % n i 0 2 ) i 2 2 n 4' $ m #1 past 1+ &(#$ m " i ) (2m n=1 m=1 % 4N 2 * 1 ( 1)n (2 * )2n#1 (n#1)! $ m " j + ' & # $ m" j (2n#1)! 1 1 2 2 2 2 #tan(" * ) = n=1 f *(x ) = sin(" * x)(e#(x#x0 ) / 4$ m + e#(2#x#x0 ) / 4$ m )dx j % n j 0 2 ) j 2 2 * n 4' $ m #1 1+ &(#$ m " j ) (2m n=1 m=1 ! Not trivial to extend to > 1 dimension present Not trivial to extend to > 2 sequences Realistic Geography Coalescent Simulations • In most systems of interest, analytic solutions are too cumbersome • The coalescent provides an efficient framework in which to do simulations • Must understand how to relate the forward-time system to a corresponding backward-time process 5 Coalescent Simulations Ancestral Selection Graph • Dual Process • In the case of selective neutrality, the genealogical and mutational processes are separable – we can produce the genealogy, and then simply place • Lineages coalesce mutations on the tree afterwards " k% with probability $ ' /2N # 2& • But, if there is selection, then the reproductive success of an individual depends on its type • A lineage splits with ! probability sk/2 Extracting the genealogy Take-home messages • The coalescent provides a convenient approach to modeling evolutionary processes – Well suited to dealing with data • Analytic results are accessible only for very simple models • Move down the graph, allowing mutations to – In other cases, it produces efficient simulations occur • Choose the incoming branch with higher • Leaves the question of how to make inferences fitness – Come back on Friday 6.