Overview

• Intro to Coalescent Theory (Today) Introduction to Coalescent Theory • Genomic Imprinting, Mathematical Modeling, and Notions of Optimality in Jon Wilkins (Wednesday) Santa Fe Institute [email protected] • Statistical Inference in Complex Beijing CSSS 2007 Systems

Ingredients of

• Heritable variation • How is variation generated and maintained in a population? • Differential reproductive success • What can patterns of genetic diversity tell us • Causal connection between the two about the history of a population? – Demography (migration, reproduction, etc.) – Molecular events (, recombination, etc.) – Natural selection (directional, purifying, etc.)

1 Why diversity? Neutral Theory

• Muller - mutation drives deviations from the • Not just selective neutrality optimal phenotype • Dobzhansky - heterogeneous environments / • Constant population size frequency dependent effects • Well mixed population (panmictic) • Lewontin-Hubby experiments (mid 1960s) – Too much variation for either explanation • This is always true • Kimura - neutral theory

Sampling with Replacement The Coalescent

• Some pass ACTT • Homologous genes on no copies to the share a common next generation, ancestor while some pass on T • DNA sequence more than one G diversity is shaped by t s

a genealogical history

P • All that we care about are the • Genealogies are ancestors of C G shaped by chance, sequences present demography, selection

in our dataset ACGT ACGT ACTT ACTT AGTT

Present

2 Balls in Boxes The Shapes of Genealogies

• The coalescent • Time to the MRCA of a models genealogies pair of sequences is Probability = 1/2N * (1-1/2N)3 exponentially t

s backwards in time a distributed with mean p E[T2] = 2N e h 2 t time of 2N generations

Probability = 1/2N * (1-1/2N) n i

s • Time to the next

n • Follow ancestral o i t coalescent event for a a

r Probability = 1/2N * (1-1/2N) lineages back until e n sample of n sequences e the most recent E[T3] = 2N/3 G is exponential with Probability = 1/2N common ancestor " n% E[T ] = 2N/6 mean 2N/ $ ' generations (MRCA) is reached 4 # 2& E[T ] = 2N/10 Present 5 !

Genealogies are highly variable The problem

• The on • Want to infer the underlying processes that the length of each have shaped genetic diversity, but portion of the genealogy is large, on the order of N2 • The inherent stochasticity means that any • Variation in given genealogy is consistent with a wide topology as well range of demographic processes • are random on top of the genealogy • How do we estimate parameters, and how do we know how good our estimates are?

3 Estimating N The Structured Coalescent

• Expected pairwise distance (π) – 2N times 2µ (= θ)

• Expected number of polymorphisms (S) n#1 1 Location Location " $ • With geographically structured populations, all i 1 i = topologies are not equally likely

!

The Structured Coalescent The Island Model

N N

N N

N N Low Migration High Migration • Each migrant is equally likely to come from any deme • The relationship between genealogy and • Population structure, but no geography geography can be used to make inferences " # " 1 F = 0 = ST " 1+ 4Nm

! 4 A finite, linear habitat The Solution

% % * * * 2 2 2 2 " f (x )cos(" x) #2$ " (t# 1 ) " j f j (x0 )sin(" j x) #2$ "* (t# 1 ) U (x,t) = i i 0 i e m i 2 + e m j 2 x &" + sin(" )cos(" ) &" * # sin(" * )cos(" * ) MRCA i=1 i i i j=1 j j j % % * * * 2 2 1 " f (y )sin(" y) 2 * 2 1 " i f i (y0)cos(" i y) #2$ m "i (t# 2 ) j j 0 j #2$ m " j (t# 2 ) Uy (y,t) = & e +& * * * e i=1 " i + sin(" i )cos(" i ) j=1 " j # sin(" j )cos(" j ) % 4N 2 1 ( 1)n (2 )2n#1 (n#1)! $ m " i + ' & # $ m" i (2n#1)! 1 1 2 2 2 2 cot(" ) = n=1 f (x ) = cos(" x)(e#(x#x0 ) / 4$ m + e#(2#x#x0 ) / 4$ m )dx i % n i 0 2 ) i 2 2 n 4' $ m #1 past 1+ &(#$ m " i ) (2m n=1 m=1 % 4N 2 * 1 ( 1)n (2 * )2n#1 (n#1)! $ m " j + ' & # $ m" j (2n#1)! 1 1 2 2 2 2 #tan(" * ) = n=1 f *(x ) = sin(" * x)(e#(x#x0 ) / 4$ m + e#(2#x#x0 ) / 4$ m )dx j % n j 0 2 ) j 2 2 * n 4' $ m #1 1+ &(#$ m " j ) (2m n=1 m=1

! Not trivial to extend to > 1 dimension present Not trivial to extend to > 2 sequences

Realistic Geography Coalescent Simulations

• In most systems of interest, analytic solutions are too cumbersome

• The coalescent provides an efficient framework in which to do simulations

• Must understand how to relate the forward-time system to a corresponding backward-time process

5 Coalescent Simulations Ancestral Selection Graph

• Dual Process • In the case of selective neutrality, the genealogical and mutational processes are separable – we can produce the genealogy, and then simply place • Lineages coalesce mutations on the tree afterwards " k% with probability $ ' /2N # 2& • But, if there is selection, then the reproductive success of an individual depends on its type • A lineage splits with ! probability sk/2

Extracting the genealogy Take-home messages

• The coalescent provides a convenient approach to modeling evolutionary processes – Well suited to dealing with data

• Analytic results are accessible only for very simple models • Move down the graph, allowing mutations to – In other cases, it produces efficient simulations occur • Choose the incoming branch with higher • Leaves the question of how to make inferences – Come back on Friday

6