The Coalescent Process
Total Page:16
File Type:pdf, Size:1020Kb
The coalescent process Introduction ‡ Random drift can be seen in several ways Forwards in time: variation in allele frequency Backwards in time: a process of inbreeding//coalescence Allele frequencies Random variation in reproduction causes random fluctuations in allele frequency: pq var p = ÅÅÅÅÅÅÅÅÅÅÅ 2 Ne After many generations, the distribution can be approximated by a diffusion. With random driftH L and mutation (PØQ at rate m, QØP at rate n) the equilibrium distribution is: prob p ~ p4 Ne n-1 q4 Ne m-1 -5 -5 The left-hand plot shows the distribution of p for Ne = 2, 500, n = 2.5 µ 10 , m = 5 µ 10 ; the right-hand plot is for N =20,000 e H L 7 2 6 5 1.5 4 1 3 2 0.5 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 The diffusion approximation can also include other forces, such as selection and migration. For example, the equilibrium distribution under mutation, random drift, and selection is: ê2 N prob p ~ p4 Ne n-1 q4 Ne m-1 W e êêê2 Ne 2 2 2 2 With heterozygote advantage (fitnesses 1-s;1:1-s), W = 1 - s p + q ~Exp -2 Ne s p + q H L -5 -5 With Ne = 2, 500, n = 2.5 µ 10 , m = 5 µ 10 , and s=0.0001, 0.001, 0.004 (left to right): H L @ H LD 2 Coalescent process.nb 5 0.8 1 4 0.8 0.6 3 0.6 2 0.4 0.4 1 0.2 0.2 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 ¤ The key parameters are Ne m, Ne n, Ne s, which give the strength of drift relative to mutation and selection. ü Further reading: Kimura, The neutral theory of molecular evolution, Chap.3 Identity by descent ‡ Definition Wright (1921, 1922), Haldane & Moshinsky (1939), Cotterman (1940) and Malécot (1948) developed the idea of identity by descent. Two genes are identical by descent if they descend from the same gene in some ancestral population. ü Note: - Identity by descent is distinct from identity in state - i.b.d. is defined relative to some ancestral reference population. - Identity measures can extend to many genes; usually, however, we just deal with identity between pairs of genes. This is related to variance of allele frequency, correlation between genes, and homozygosity - Relationships among many genes are better thought of in terms of coalescence of lineages in a genealogy. ‡ The probability of identity by descent is easily calculated for pedigrees e.g. brother-sister mating Coalescent process.nb 3 Genes are NOT ibd Probability of identity in this case by descent is 1/4 In general, the probability that two distinct genes in a diploid individual are i.b.d. is 1 n-1 f = ÅÅÅÅ 1 + fA , where the sum is over all loops in the pedigree, n is the number of individuals in loops 2 the loop, and fA the identity between genes in the common ancestor. Note ‚that the HrandomL H elementL here is in segregation, not reproduction 4 Coalescent process.nb ‡ The increase in i.b.d. with random mating ü Wright-Fisher model Suppose that there are 2 Nt individuals in a haploid population. In the next generation, there are 2 Nt+1 , drawn randomly from all 2 Nt possible parents. On this scheme, individuals produce a number of offspring which is close to a Poisson distribution. The Wright-Fisher model also applies to a random-mating diploid population, provided that individuals are as likely to mate with themselves as with anyone else. Then, the probability that two genes are i.b.d. from the previous generation is 1 2 Nt : 1 1 ft+1 = ÅÅÅÅÅÅÅÅÅÅÅ + 1 - ÅÅÅÅÅÅÅÅÅÅÅ ft f0 = 0 2 Nt 2 Nt ê 1 t-1 1 ht+1 ª 1 - ft = 1 - ÅÅÅÅÅÅÅÅÅÅÅ ht hence ht = 1 - ÅÅÅÅÅÅÅÅÅÅÅ J 2 NtN i=0 2 Ni With constant population size, ht declines by (1-1/2N) per generation - approximately, as ~exp(-t/2N). The typical timescale for inbreedingJ andN random drift is 2N generations.‰ J N With fluctuating sizes, h declines (approximately) as exp - t-1 ÅÅÅÅ1ÅÅÅÅ = exp -t 2 N where N is t i=0 2 Ni H H the harmonic mean population size. H H⁄ LL H ê L Coalescence The ancestry of a sample of neutral genes has a simple statistical distribution: the chance that any two lineages coalesce is ÅÅÅÅÅÅÅÅ1 ÅÅÅ per generation 2 Ne - - - - More precisely: - suppose that each gene leaves v descendants var n - As Nض, the probability that any pair of lineages coalesce, per generation, tends to ÅÅÅÅÅÅÅÅ2 NÅÅÅÅÅÅ i.e. Ne = N var n H L The coalescent process refers to this limit - equivalent to the diffusionê H L approximation An influential idea: - DNA sequences are best described by their genealogy - a variety of mutation models can be superimposed - tracing back samples of alleles - speeds up simulations - gives statistical tests on sampled data Coalescent process.nb 5 An influential idea: - DNA sequences are best described by their genealogy - a variety of mutation models can be superimposed - tracing back samples of alleles - speeds up simulations - gives statistical tests on sampled data ü References Hudson, R. (1990). Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, 1-44. Hudson, R. (1993). The how and why of generating gene genealogies. In Mechanisms of molecular evolution, ed. Takahata N & Clark AG, pp 23-36. Donnelly, P. and S. Tavaré. (1995). Coalescents and genealogical structure under neutrality. Ann. Rev. Genet. 29, 401-421. Rosenberg, N. A., and M. Nordborg, 2002 Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics 3: 380-390. ‡ Properties of the coalescent process 1 2 Ne The time during which there are k lineages is exponentially distributed with expectation ÅÅlÅÅ = ÅÅÅÅÅÅÅÅk k-ÅÅÅÅÅÅÅÅ1 ÅÅ2ÅÅ : k k - 1 H Lê P tk = Exp -l tk l „tk where l = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 4 Ne ü The genealogy is dominated by the deepest split. H L H L @ D The expected depth of the tree is: 2 2 1 1 2 N ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ … ÅÅÅÅ + ÅÅÅÅ + 1 = e k k - 1 k - 1 k - 2 6 3 2 2 2 2 2 2 2 2 2 N ÅÅÅÅÅÅÅÅÅÅÅÅ - ÅÅÅÅ + ÅÅÅÅÅÅÅÅÅÅÅÅ - ÅÅÅÅÅÅÅÅÅÅÅÅ + … ÅÅÅÅ - ÅÅÅÅ + ÅÅÅÅ - ÅÅÅÅ = e k - 1 k k - 2 k - 1 2 3 1 2 J 2 N H 2 N L 1 -H ÅÅÅÅ +L1H ~4 NL for large k e k e JJ N J N J N J NN Thus, the tree collapses to 2 lineages in ~ 2 Ne generations; these take another 2 Ne generations to coalesce Hence, pairwise measures areJJ uninformatN ivNe ü The expected length of the genealogy is ~ 4 Ne Log 1.78 k The expected length of the tree is: @ D 2 2 4 3 2 N k ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ + k - 1 ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ … ÅÅÅÅ + ÅÅÅÅ + 2 e k k - 1 k - 1 k - 2 6 3 2 2 2 2 2 = 2 N ÅÅÅÅÅÅÅÅÅÅÅÅ + ÅÅÅÅÅÅÅÅÅÅÅÅ + … ÅÅÅÅ + ÅÅÅÅ + ÅÅÅÅ e k - 1 k - 2 3 2 1 J H L N k-1 H L H L H L 1 = 4 N ÅÅÅÅ e j Jj=1 N ~4 Ne Log 1.78 k for large k ‚ The distribution of length is highly variable: @ D The dots show the quantiles at 0.001, 0.01, 0.1, 0.9, 0.99, 0.999. 6 Coalescent process.nb L 20 10 5 2 1 n 5 10 20 50 Figure 1 ‡ Fluctuating population size Changes in Ne cause changes in timescale The standard coalescent Expanding populations Ø "star phylogeny" exponential growth: popl'n was 10% of the current size at TMRCA Coalescent process.nb 7 Population bottlenecks Ø burst of coalescence a bottleneck equivalent to 2 Ne 'ordinary' generations of drift ü Changing timescales The "scaled time" is a measure of the total amount of genetic drift that has occurred: t „t T = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ 0 2 N t For a constant population size, T = t 2 N . If the population is growing at a rate l, and the present size is -lt N0 , then N = N0 ‰ , and so: ‡ H L t ‰lt 1 T = ÅÅÅÅÅÅÅÅÅÅÅ „t = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ê H‰ltL - 1 0 2 N0 2 N0 l The parameter l is a measure of the amount of population growth over the current timescale set by population size, ‡2 N0 . Here is the transformationH for l = 1.5 L 8 Coalescent process.nb scaled time 12 10 8 6 4 2 actual time 0.5 1 1.5 2 ‡ Branching processes The coalescent process only applies to samples from a large population If all genes are observed, we have a branching process e.g. discrete time: # of offspring i follows a Poisson distribution with E i = l @ D P 1 l 1 2 3 4 More generally, for l~1, P ~ 2 l - 1 var i H L ê H L Coalescent process.nb 9 1920 18 16 15 17 14 1310 12 7 119 8 6 4 1 5 2 3 t d i s o q f p c k h m j a e n r l b g coalescent 17 1920 18 16 13 15 14 12 11 9 8 10 7 5 6 4 3 2 1 sample from o b s d g i t j k r a m p n e q c f h l a branching process l = 1.1 Mutation ‡ Infinite alleles Assuming that every mutation generates a new allele, the probability of identity in allelic state 2 t ("homozygosity") is F = t ft 1 - m , where ft is the distribution of coalescence times.