<<

An introduction to

Nicolas Lartillot

May 26, 2014

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 1 / 39 De Novo in Animals

Table 3. Coding sequence and divergence patterns in five non-model animals.

species #contigs #SNPs pS (%) pN (%) pN/pS dN/dS aa0.2 aEWK vA

turtle 1 041 2 532 0.43 0.05 0.12 0.17 0.01 0.43 0.92 0.17 60.03 60.007 60.02 60.03 60.18 60.15 hare 524 2 054 0.38 0.05 0.12 0.15 20.11 0.30 ,0 ,0 60.04 60.008 60.02 60.03 60.22 60.23 ciona 2 004 11 727 1.58 0.15 0.10 0.10 20.28 0.10 0.34 0.04 60.06 60.011 60.01 60.01 60.10 60.11 termite 4 761 5 478 0.12 0.02 0.18 0.26 0.08 0.28 0.74 0.20 60.01 60.002 60.02 60.02 60.10 60.11 oyster 994 3 015 0.59 0.09 0.15 0.21 0.13 0.22 0.79 0.21 60.05 60.011 60.02 60.02 60.12 60.13

doi:10.1371/journal.pgen.1003457.t003

Explaining genetic variation

Figure 4. Published estimates of genome-wide pS, pN and pN/pS in animals. a. pN as function of pS; b. pN/pS as function of pS; Blue: vertebrates; Red: invertebrates;Gayral Full circles: et species al, analysed2013, in PLoS this study, Genetics designated by 4:e1003457 their upper-case initial (H: hare; Tu: turtle; O: oyster; Te: termite; C: ciona); Dashed blue circles: non-primate mammals (from left to right: mouse, tupaia, rabbit). Estimates were taken from Bustamante et al. 2005 (human), Hvilsom et al 2012 (chimpanzee), Carneiro et al 2012 (rabbit), Perry et al 2012 (other mammals), Begun et al 2007 (D. simulans) and Tsagkogeorga et al 2012 (C. intestinalis B = right-most circle). doi:10.1371/journal.pgen.1003457.g004

Fraction ofPLOS heterozygous Genetics | www.plosgenetics.org sites innuclear 9 genomes April 2013 | Volume 9 | Issue 4 | e1003457 humans: 0.1% (1 every 1000 positions) drosophila: 3.3% (1 every 30 nucleotide positions) what determines levels of within-species genetic variation?

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 2 / 39 Viral phylogenies

Rates of HIV and AIDS Progression REVIEWS

As a result, strains with advantageous could, by chance, find themselves in individuals with low rates of partner exchange and so will not be transmitted far in the population. Of more debate is whether a bottleneck has a selective component, so that strains that are better adapted to new hosts (such as R5 strains) competitively establish themselves in primary infection60,or whether it is entirely neutral61 and thereby only magnifies the effects of . Finally, some advantageous mutations, such as those conferring CTL escape, might not appear until relatively late in infection62.If these late-escape mutants do not arise until after most individuals have transmitted the virus, will be less effective at the pop- ulation level. As a consequence, HIV strains might not readily adapt to the HLA distributions of their local populations63,because some CTL-escape mutants have little opportunity for further transmission. The data presented to support the adaptation of HIV to HLA at the population level only considered within-host evolution, albeit in a large number of Figure 4 | Contrasting patterns of intra- and inter-host evolution of HIV. The tree was patients, and did not measure the effect of transmission. constructed using the NEIGHBOUR-JOINING METHOD on envelope gene-sequence data that was takenwithin from nine HIV-infected and patients48 (a total between of 1,195 sequences, 822 base hosts pairs in length), with Indeed, the fact that repeated individual adaptation was Figure 1. Internal, Backbone, and External Branches in a Within-Host HIV Genealogy, and Mean Nucleotide Substitution Rates for These Branchesthose in viruses sampled from each patient depicted by a different colour. In each case, intra-host observed in these patients indicates that the HIV popu- Nine Longitudinally Sampled HIV-1 Patients HIV evolution is characterized by continual immune-driven selection, such that there is a lation as a whole was not adapted to the host HLA dis- To avoid the influence of deleterious mutations segregating to external branches in intrapatient HIV genealogies, we estimate mean substitutionsuccessive rates selective replacement of strains through time, with relatively little genetic diversity at tribution. Moreover, although certain CTL-escape for the set of internal and backbonewithin branches. Thesehost branch sets are depicted in color in the maximum a posteriori tree for ‘‘patient 1’’ obtained byRambaut et al, 2004, Nature Rev Genet 5:52 Bayesian relaxed-clock inference [26] (backbone, red; internal, blue; external, black). The backbone represents the central trunk of trees shaped byany rapid time point. By contrast, there is little evidence for positive selection at the population level mutants can be transmitted through the population64,it lineage turnover and can be defined phylogenetically (see Methods). Note that the set of internal branches also includes the backbone branches.(bold lines connecting patients), so that multiple lineages are able to coexist at any time point. Samples for each time point are indicated by the dotted line. Mean nucleotide substitution rates and their standard deviations on internal, external, and is possible that CTL-escape mutations that are passed to A major BOTTLENECK is also likely to occur when the virus is transmitted to new hosts. backbone branches are shown for all longitudinally sampled HIV-1 patients. The consistently higher substitution rate on external branches might be indicativeLemey of higher et al, 2007load on these PLoS branches. Comput Biol 3:e29 individuals with the ‘wrong’ HLA background will doi:10.1371/journal.pcbi.0030029.g001 sometimes be deleterious and removed by purifying to progression time, and three continuous parameters that be attributed to differences in time length of sampling. In selection. In summary, inter-host HIV evolution is not relate to disease progression: progression time, the rate of general, backbone dS rates before progression time seem to (neutral) spatial and temporal diffusion of the virus, merely intra-host evolution played out over a longer CD4 T cell count change over time, and the rate of log viral show little temporal fluctuation in the trees since dS þ 2 with viral lineages co-existing for extended time peri- timescale, and the evolutionary process that occurs whatload (log VL) change tree over timeshape (Figure 3). The log of can the divergence say accumulated about in a linear fashion, infection with R values dynamics 2 ods. Indeed, there is little evidence that differ- within hosts will not select for viruses with enhanced backbone rate of synonymous divergence shows a strong close to 0.96, except for patient 9 (R 0.83). We also ¼ negative correlation with both progression time (Pearson investigated the heterogeneity of synonymous and non- ences determine subtype structure and distribution. For transmissibility. correlation coefficient r 0.79, p 0.011) and the change in synonymous substitution rates within the env C2V3 gene, example, experimental studies have revealed that sub- ¼À ¼ CD4 T cell count (r 0.72, p 0.028), and a moderate because strong site-to-site variation in synonymous rates has whatþ global¼À ¼ tree can tell us about the global pandemicstype C viruses consistently have lower in vitro fitness Recombination and HIV diversity. Genetic recombina- positive correlation with the change in log VL (r 0.65, p the potential to bias dS estimates [30]. Although this analysis ¼ ¼ 0.059). No significant correlations were observed for non- revealed very strong site-to-site variation in dS rates, the than those assigned to subtype B (REF.56).Although cau- tion is an integral part of the HIV lifecycle, occurring synonymous divergence rates (r 0.32, p 0.40 for inferred rate distributions were nearly identical among all tion should be shown when extrapolating from the lab- when reverse transcriptase switches between alternative ¼À ¼ progression time; r 0.54, p 0.135 for CD4 T cell count patients (Table S3). Finally, recombination rate estimates ¼À ¼ þ oratory to nature, this indicates that the high prevalence genomic templates during replication. As already men- change; r 0.14, p 0.58 for log VL change). Similar results (Table S2) did not provide any evidence that recombination ¼À ¼ of subtype C in sub-Saharan Africa is the result of its tioned, the recombination rate of HIV is one of the were obtained when divergence rates were estimated from could be the cause of the differences between dS estimates. Nicolas Lartillotinternal branches (CNRS (Figure S2). - InUniv. contrastLyon to backbone 1) and The variability in dS rates couldCoalescent reflect differences in either chance entry into populationsMay with26, high rates 2014 of partner 3highest / 39 of all organisms, with an estimated three recom- internal rates, no significant correlations were observed for viral mutation rate or viral generation time, but only the exchange. However, it is unclear whether the success of bination events occurring per genome per replication both dS and dN rates on external branches. Similar results latter provides a likely explanation for the correlation HIV-1 group M, relative to groups N and O, is the result cycle65, thereby exceeding the mutation rate per replica- were also obtained when datasets were restricted to samples between dS rates and disease progression. However,NEIGHBOUR-JOINING viral METHOD up to about 70 months after seroconversion (Figure S3), generation time is also expected to affect dN rates to some of some intrinsic property of the virus that enhances tion. The discovery that most infected cells harbour two An algorithm that uses genetic 66 indicating that the differences in dS rate estimates could not extent. While we did not observe a significant correlation,distances this to construct a transmissibility, or because the founding virus from or more different proviruses , and the evidence for dual 67,68 phylogeny by the sequential group M was fortunate enough to find itself in popula- infection ,set the stage for recombination to have a PLoS | www.ploscompbiol.org0284 February 2007 | Volume 3 | Issue 2addition | e29 of taxa. tions in which the epidemiological conditions were ideal central role in generating HIV diversity. Indeed, recom- for transmission. bination has now been detected at all phylogenetic lev- BOTTLENECK Why is natural selection a less potent force among els: among primate lentiviruses7,8, among HIV-1 A severe reduction in population 69 70 71 size that causes the loss of hosts than within them? The first factor is the bottleneck groups , among subtypes and within subtypes . genetic variation. The role of that accompanies inter-host transmission, which greatly Prevalent inter-subtype recombinants are denoted ‘cir- random drift is increased, reduces genetic diversity. Evidence for a strong bottle- culating recombinant forms’ (CRFs). There are 15 cur- whereas the power of selection is neck at transmission is the homogeneity of the virus rently recognized CRFs that show a broad range of reduced, by bottlenecks. during primary infection57–58, although this could complexity and are widely distributed. In some geo- 59 HLA HAPLOTYPE depend on the mode of transmission .The second graphical regions, CRFs account for at least 25% of all The particular pattern of important factor concerns the behavioural aspects of HIV infections72.Probably because it is more difficult to at the human major HIV transmission. HIV is predominantly a sexually detect, the role of intra-subtype recombination has tra- histocompatibility complex transmitted disease, and so the extensive variation in ditionally been downplayed. However, recent popula- (MHC) loci, which defines which antigens are recognized by rates of partner exchange will, in combination with the tion-genetic studies indicate that recombination is also a T cells. transmission bottleneck, generate strong genetic drift. pervasive force within subtypes71,73.

NATURE REVIEWS | GENETICS VOLUME 5 | JANUARY 2004 | 57 Inferring population history from haplotype data

Hein, Shierup and Wiuf, 2005

a set of n haplotypes randomly sampled from a population sequences of length L, known mutation rate µ what can we say about population size (N) and structure? demographic history? selection?

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 4 / 39 Approach define a model of demography and reproduction (Wright-Fisher) define a model of DNA sequence mutations explain genetic variation based on this two-level model.

Applications estimating parameters (population size, mutation rate) testing hypotheses (e.g. deviation from neutrality) building blocks for more sophisticated models

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 5 / 39

The Wright-Fisher model

e m i T

from Felsenstein Assumptions panmictic population constant population size neutrality

The Wright-Fisher model

e m i T

from Felsenstein

each offspring ’chooses’ parent uniformly at random in previous generation

Genealogy of a sample

e m i T

from Felsenstein n individuals taken at random in the present (here n = 3) age of their ancestor? typical shape of the genealogy? Coalescence of n = 2 genes

www.coalescent.dk

prob. of coalescence in previous generation 1/(2N) average coalescence time for 2 individuals: T = 2N. average total length separating the 2 sequences: 2T = 4N. Genetic diversity and coalescence time

T

0 AACAGT ATCACG

time since last common ancestor: T generations sequences of length L, known mutation rate µ mean fraction of sites differing between 2 individuals: π = 2µT .

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 9 / 39 Coalescence of n = 2 genes

www.coalescent.dk

with mutation mutations at rate µ per base pair per generation average diversity: π = 2T µ = 2.2N.µ = 4Nµ = θ. θ: scaled mutation rate (N and µ are confounded) yields an estimate of N if µ is known and π is observed Tajima’s estimator

n = 4 observed DNA sequences 1 A C C A C A A G 2 A C C A G T A G 3 A C T G C A T G 4 A C T G G T A C

πij : fraction of polymorphic sites between haplotypes i and j πˆ: average over all possible pairs of haplotypes:

2 X πˆ = π n(n − 1) ij i

yields a simple estimator of θ = 4Nµ

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 11 / 39 Effective population size of humans Human-chimp divergence SND (single nucleotide differences): ' 2% divergence time: ' 6Ma. thus, mutation rate: ' 3.10−8

Human polymorphism heterozygosity: π = 0.001 (1 every 1000 bp) SNP (single nucleotide polymorphisms): 1 every 100 to 300 bp

π = 4Nµ N = π/4/µ ' 10 000

effective population size < census population size Effective population size Genetic aspects autosomal: 2N mitochondrial, Y : N X chromosome: 3/2 N

Demographic aspects demographic fluctuations modulate coalescence rate coalescence events more likely at population bottlenecks reproductive (species with male dominance have low N) population structure (e.g. a parasite has N of its host)

Linkage and selection selection at linked loci reduce N at neutral loci purifying selection: positive selection: selective sweeps Nucleotide diversity across life forms

Lynch and Conery, Science 2003; 302:1401 Effective population sizes across life forms

Mutation rates (per generation) human (nuclear genome) ' 10−8 fly, nematode (nuclear genome) ' 10−9 unicellular eukaryotes and prokaryotes ' 10−10

Effective population sizes human, large vertebrates: 104 small vertebrates: 105 invertebrates, terrestrial plants: 106 unicellular eukaryotes: 107 prokaryotes: > 108

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 15 / 39 R EPORTS (11). Because degenerative mutations greatly karyotes is less than two. Only two spliceo- The rather abrupt increase in the average outnumber beneficial mutations, the proba- somal introns have been found in the kineto- intron number per gene with increasing ge- bility of preservation by rare neofunctional- plastid Trypanosoma (17), and only a single nome size is accompanied by a more contin- izing mutations is diminished in small one has been found in the diplomonad Giar- uous increase in the average intron length populations. In contrast, preservation by sub- dia (18). Understanding this uneven phyloge- (Fig. 3), which has been observed previously functionalization occurs when both members netic distribution of introns is a major chal- in more phylogenetically restricted surveys of a pair are partially degraded by mutations lenge for evolutionary genomics. (22, 23). The inverse scaling of the average

to the extent that their joint expression is Although natural selection may eventually intron length with Neu [slope of the logarith- necessary to fulfill the essential functions of exploit introns for adaptive purposes (16), newly mic regression (ϮSEM) on Neu ϭ –0.67 Ϯ the ancestral locus (12, 13). The probability established introns are expected to impose a 0.22] is consistent with the hypothesis that of subfunctionalization approaches zero in selective disadvantage (s)ontheirhostgenesby population-size reduction diminishes the ef- large populations because the long time to increasing the mutation rate to defective alleles ficiency of selection against mildly deleteri- fixation magnifies the chances that secondary (19). Theory suggests that there is a threshold ous insertions into introns. Within genomes,

mutations will completely incapacitate one value of Nes Ϸ 1.0, below which newly arisen the average intron size increases in regions of copy before joint preservation is complete introns can freely drift to fixation and above low recombination (24, 25), which may also and because of the weak mutational disad- which intron colonization and maintenance are be a consequence of localized reductions in vantage of harboring two coding regions exceedingly improbable. Qualitatively consistent effective population size resulting from selec- (14). The longer retention time of duplicate with this hypothesis is a threshold genome size tive sweeps and/or background selection (19, genes in small populations is inconsistent of ϳ10 Mb, below which introns are very rare 25). An alternative hypothesis that intron size with the predictions for the neofunctionaliza- and above which they approach an asymptote of acts as a recombination modifier to reduce tion model and opposite to the expected pat- about seven per gene (Fig. 3). By transforming selective interference among linked sites (24) Populationtern if degenerative size mutations and only lead evolutionary to scales from Fig. 1B, we found that thegenomics maximum is not easily reconciled with the reduction of complete nonfunctionalization of duplicate value of Neu that is permissive to intron prolif- intron size and number in compact genomes. genes (15), but it is entirely compatible with eration is ϳ0.015. How does this compare with Mobile genetic elements are self-contained

expectations under the subfunctionalization the theoretical expectation of Nes Ϸ 1.0? genomic units capable of proliferating within Effective sizemodel. Thus, and although the selection evolution of mul- The minimum selective disadvantage of an their host genomes (26, 27). Hundreds of fam-

ticellularity undoubtedly posed some new se- that contains a new intron is about equal to ilies of these elements exist within eukaryotes, on August 30, 2007 lective challenges that were met through the excess-mutation/ rate to defective alleles and almost all of them fit into three major randomneofunctionalization, drift proportional much of the increase in caused to by 1 alterationsN at sites involved in splic- functional categories: DNA-based (cut-and- gene number in multicellular species may not ing. The number of base positions (in the intron paste) transposons and the long-terminal repeat have been driven by adaptive processes, but and surrounding exons) with nucleotide identi- (LTR) and non-LTR classes of RNA-dependent selectionrather as efficient a passive responseonly to a genetic if sties that>> are essential1/ forN proper splicing is un- (copy-and-paste) retrotransposons. The vast environment (reduced population size) more likely to be less than 10 and is plausibly as high majority of mobile elements are indiscriminate conducive to duplicate-gene preservation as 30 (19). Thus, the net selective disadvantage with respect to insertion sites, and as a conse- by subfunctionalization. of an intron-containing allele is at least 10 to 30 quence, their activities often have deleterious

Spliceosomal introns are noncoding times as large as u,notincludinginsertionand effects on the host genome. A broad range of www.sciencemag.org Evolutionarystretches of genomics RNA that are excised from the deletion mutations, which minimally occur at selection coefficients must be associated with transcripts of their host protein-coding genes. ϳ10 to 60% of the rate of substitutions per base insertions in coding regions, regulatory regions, small TheN mechanisms: random by which introns drift originate dominates(20, 21). Because they molecular can alter the spatial con- evolutionand intergenic spacers, and in because humans mutations remain a mystery, but their broad phyloge- figuration of key splice-site signatures, the num- with negative fitness consequences ϾϾ1/(2Ne) netic distribution implies that they and the ber of insertion and deletion events affecting are efficiently eliminated by selection, the frac- spliceosome that processes them were present proper splicing must exceed that for substitu- tion of mobile-element insertions capable of manyin features the stem eukaryote (16 selected). The average num- intions. fly/yeast Thus, the observed threshold/E.coli value of notdrifting selected to fixation must decline in withhumans increasing

ber of introns per gene in most multicellular Neu Ϸ 0.015 for intron proliferation is reason- Ne.Becausemobileelementsgraduallyacquire Downloaded from species is between four and seven, whereas ably compatible with the theoretical Nes Ϸ 1.0 inactivating mutations, the long-term survival genomethe average structure number for most unicellular influenced eu- threshold. by population geneticsof an element family requires parameters the average au-

Table 1. Average rates of origin (B) and loss (d) of Fig. 3. The relationship between duplicate genes (ϮSEM). The former is defined as average intron size (solid circles) the probability of a gene duplicating over the time in base pairs (bp) and intron span required for a silent-site divergence of 1%. number (open circles) and ge- The latter is the exponential rate of loss, such that nome size. The regression for in- D ϭ 1–e –(0.01d),orϳ0.01d for small d,isthe tron size is highly significant, probability of loss by the time silent sites have with an intercept of 1.41 Ϯ 0.36, diverged by 1%, where e is the base of the natural 2 a slope of 0.51 Ϯ 0.10, and r ϭ logarithm (1). The analyses are based on gene 0.641, df ϭ 16 (1). families containing five or fewer members, and species-specific estimates can be found in the supporting online material (1).

Species Bd

Unicellular 0.00405 Ϯ 0.00130 43.26 Ϯ 10.15 eukaryotes Metazoan 0.00373 Ϯ 0.00073 17.80 Ϯ 2.52 species Prokaryotes 0.00238 Ϯ 0.00038 –

Lynch andwww.sciencemag.org Conery, Science SCIENCE VOL 3022003; 21 NOVEMBER 302:1401 2003 1403 Distribution of age of ancestor

www.coalescent.dk

prob. of coalescence in previous generation 1/(2N) prob. of coalescence in 2 generations (1 − 1/(2N))(1/(2N)) prob. of coalescence in t generations (1 − 1/(2N))t−1(1/(2N)) t has a

Histogram of x10000 0.8 0.6 0.4 Density 0.2 0.0

0 2 4 6 8 10

Exponential(1) u = t/2N, p(u) = e−u www.coalescent.dk age of ancestor of 2 individuals has geometric distribution for n << N, approx. an exponential distribution mean of t2 is 2N, (std dev of t2 is 2N) rescaling: u2 = t2/(2N) has mean 1 and stdev 1 Generalization for n > 2

t2 rate of coalescence

r2 = 1/2N ... j  1 j(j − 1) r = = tn−1 j 2 2N 4N tn 0 1 2 3 4 5

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 19 / 39 Generalization for n > 2 mean coalescence times

t2 ' 2N

t 4N 2 t ' , j = 2..n j j(j − 1)

... in rescaled time units tn−1

tn u2 ' 1 0 1 2 3 4 5 2 u ' , j = 2..n j j(j − 1)

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 19 / 39 Generalization for n > 2

distribution of coal. times

t  4N  2 t ∼ Exp mean = j j(j − 1)

... in rescaled time units tn−1   tn 2 u ∼ Exp mean = 0 j j(j − 1) 1 2 3 4 5

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 19 / 39 Drawing from the coalescent

u2

...

e m i T un−1 un 0 1 2 3 4 5

Forward versus backward simulation forward: Wright Fisher simulation + backtracking of ancestors backward: Kingman’s coalescent: drawing exponential times equivalence for large N but Kingman’s approach more efficient . . . and more insightful Drawing from the coalescent

large variability of deep branches high uncertainty on estimates based on one locus suggests approaches averaging over several independent loci What is coalescent theory useful for

Theory obtaining insights about patterns in sequence variation deriving theoretical expectations (e.g. age of sample’s last common ancestor)

Simulations null distribution for hypothesis testing detecting departures from neutrality (selection)

Parameter estimation estimating θ = 4Nµ based on observed polymorphism estimating demographic scenarios

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 22 / 39 Mean age of most recent common ancestor (MRCA)

u2

... Tn = un + un−1 + ... + u2 E[Tn] = 2(1 − 1/n) un−1 un 0 1 2 3 4 5

expected MRCA age reaches a limit (4N generations) for large n intra-specific variation gives access to relatively shallow past in contrast to interspecific divergence (human chimp: 6 Myrs) Age of most recent common ancestor

mitochondrial: 200 000 years (Soares et al, 2009, Am J Human Genet 84:740) Y chromosome: 55 000 years (Thomson et al, 2000, PNAS, 97:7360) nuclear genome: variation along genome (because of recombination)

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 24 / 39 REVIEWS

Box 2 | The coalescent Here we introduce the most popular model: the coalescent. We begin by introducing the simplest form, in which there is no recombination, and then discuss the version that applies in a more realistic setting. Coalescent without recombination Panels a–c illustrate the intuition that underlies the coalescent using a population of DNA fragments that are evolving according to a Wright–Fisher model — that is, in the absence of recombination, in a population of constant size. Panel a shows a schematic of an evolving population. In this simplified representation of evolution, each row corresponds to a single generation, and each blue circle denotes a fragment in that generation. Generations are replaced in their entirety by their offspring, with arrows running from the parental fragment to the offspring fragment. The present day is represented by the bottom row, with each higher row representing one generation further back into the past. Panel b indicates the ancestry of a sample from the present day. In this example, six fragments, indicated in red, are sampled from the current generation. The ancestry of this sample is then traced back in time (that is, up the page), and is indicated in red. Panel c highlights one of the key features of the coalescent: all information outside the ancestry of the sample of interest can be ignored. The coalescent provides a mathematical description of the ancestry of the sample. As we move back in time, the number of lines of ancestry decreases until, ultimately, a single line remains. The most recent fragment from which the entire sample is descended is known as the ‘most recent common ancestor’ (MRCA), whereas the time at which the MRCA appears is known as the ‘time to the most recent common ancestor’ (TMRCA). Coalescent with recombination The coalescent with recombination is illustrated in panel d. In such settings, lines bifurcate, as well as coalesce (join), as we move back in time. Here we show the genealogy for three copies of a fragment. By tracing the lineages back in time, we observe the following events: in event 1 the green lineage undergoes recombination and splits into two lineages, which are then traced separately; in event 2 one of the resulting green lineages coalesces with the red lineage, creating a segment that is partially ancestral to both green and red, and partially ancestral to red only; in event 3 the blue lineage coalesces with the lineage created by event 2, creating a segment that is partially ancestral to blue and red, and partially ancestral to all three colours; in event 4 the other part of the green lineage coalesces with the lineage created by event 3, creating a segment that is ancestral to all three colours in its entirety. As the inset shows, the recombination event induces different genealogical trees on either side of the break. Coalescent methods have been reviewed extensively20–22, and there are now book-length treatments97,98 to which the reader is referredGenealogies for further details. and recombination ad 4 Generations Induced trees DNA fragment

Time

3 Break point Present day b Coalescence 2

Time

Present day 1 Recombination Sample c

MRCA

TMRCA

Panel d is modified with permission from REF. 89  (2002) Elsevier. Marjoram and Tavaré, 2006, Nat Rev Genet, 7:759

762 | OCTOBER 2006 | VOLUME 7 www.nature.com/reviews/genetics ©!!""6!Nature Publishing Group! REVIEWS

mutation and recombination. We now need a popula- The basic idea underlying the coalescent is that, in tion-genetics model that incorporates these principles the absence of selection, sampled lineages can be viewed and that allows us to construct and analyse random as randomly ‘picking’ their parents, as we go back in genealogies. The coalescent has become the standard time (FIG. 4).Whenever two lineages pick the same par- model for this purpose. This choice is not arbitrary, as ent, their lineages coalesce. Eventually, all lineages coa- the coalescent is a natural extension of classical popu- lesce into a single lineage, the MRCA of the sample. The lation-genetics theory and models7.It was discovered rate at which lineages coalesce depends on how many independently by several authors in the early lineages are picking their parents (the more lineages, the 1980s12–15,although the definitive treatment is due to faster the rate) and on the size of the population (the GenealogiesKingman and12,16. recombinationmore parents to choose from, the slower the rate).

Aa Ba

3.5

a 3

2.5

Time to MRCA to Time 2 bcd

1.5

e

0 2 4 6 8 10 134 5 26 Position in kilobases

b c

36124 5 3 4 5 1 2 6

d

e

3 4 5 1 2 6 3 4 5 1 2 6

Figure 3 | A simulated sample of six haplotypes using the standard coalescent with recombination. A | In the top graph, the red line shows how the time to the most recent common ancestor (MRCA) (in units of coalescence time — 1 unit corresponds Rosenbergto 2N generations, if N is and the size of Nordborg,the population) varies along the 2002, chromosome as Nat a result of Revrecombination. Genet, The parameters 3:380 were chosen to represent ~10 kb of human DNA. The crosses along this line mark positions at which recombination took place in the history of the sample. Note that only a fraction of the recombination events resulted in a change of the time to the MRCA. B | A selection of gene trees (a–e) that correspond to specific positions along the chromosome (a–e) is shown. Trees for closely linked regions tend to be very similar (for example, c and d), if not identical. Numbers 1–6 represent individual haplotypes.

382 | MAY 2002 | VOLUME 3 www.nature.com/reviews/genetics © 2002 Nature Publishing Group Total length of the genealogy

n X Ln = j uj j=2 n X 2 u E[Ln] = j 2 j(j − 1) j=2 n X 1 ... = 2 (j − 1) j=2 un−1 un 0 for large n 1 2 3 4 5 E[Ln] ∼ 2 ln n

(slow increase) Estimating θ = 4Nµ: Watterson’s estimator

u 2 n X Ln = j uj ... j=2 n X 1 u E[L ] = 2 n−1 n (j − 1) un j=2 0 1 2 3 4 5

Sn: number of sites segregating in the sample

low mutation rate: Sn = total # mutations along genealogy

E[Sn] = 2Nµ E[Ln] = θ E[Ln]/2 2S θˆ = n E[Ln] Estimating θ = 4Nµ: Tajima versus Watterson

Tajima’s estimator of scaled mutation rate mean fraction of polymorphic sites between pairs of sequences

2 X πˆ = πij t2 n(n − 1) i

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 29 / 39 Variance of the two estimators

Felsenstein 1992

Tajima’s estimator is not consistent Watterson’s estimator consistent but not optimal maximum likelihood (see later) optimal and more general

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 30 / 39 1/14/2010

!"#$%&"'(!)*+,(-'"."/*$!0 +-11$)'!-)! 7%(&$)'4'"-)!)$'8-%9!:%(&$4*; .(2")"!,34,,5,$!4)+$&'%4,6

CHIMPANZÉ TCGCTCTCGTGCCCAGGCTCACCCACAAGTGGTT HUMAIN1 TTGCTCTCATGCCCAGGCTCACCCACAAGTGGTT HUMAIN2 .T...... A...... G...... HUMAIN3 .T...G..A...... HUMAIN4 .T...... A...... A...... G...... H1 C G A H2 C G G !"#$%&'#() ! H3 G G A H4 C A G

14-1-2010 Cours BCM6215 65 14-1-2010 Cours BCM6215 66

Demography and population structure <=>?@A B-%1$&!.$!C()(4,-C"$&

genetree

14-1-2010 Cours BCM6215 67 14-1-2010 Cours BCM6215 68 changes in population size induce changes in rate of coalescence at time t, rate of coalescence of j lineages is j(j − 1)/4N(t) increasing population: comparatively higher rates in distant past decreasing population: comparatively higher rates near present

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 31 / 39

17 1/14/2010

!"#$%&"'(!)*+,(-'"."/*$!0 +-11$)'!-)! 7%(&$)'4'"-)!)$'8-%9!:%(&$4*; .(2")"!,34,,5,$!4)+$&'%4,6

CHIMPANZÉ TCGCTCTCGTGCCCAGGCTCACCCACAAGTGGTT HUMAIN1 TTGCTCTCATGCCCAGGCTCACCCACAAGTGGTT HUMAIN2 .T...... A...... G...... HUMAIN3 .T...G..A...... HUMAIN4 .T...... A...... A...... G...... H1 C G A H2 C G G !"#$%&'#() ! H3 G G A H4 C A G

14-1-2010 Cours BCM6215 65 14-1-2010 Cours BCM6215 66

Demography and population structure <=>?@A B-%1$&!.$!C()(4,-C"$&

genetree

14-1-2010 Cours BCM6215 67 14-1-2010 Cours BCM6215 68 Tajima’s πˆ and Watterson θˆ respond differently to changes in N increasing population size: d =π ˆ − θˆ < 0 decreasing population size: d =π ˆ − θˆ > 0 Tajima’s D = d/Vˆ (d)

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 32 / 39

17 Empirical findings negative D for non-Africans using mitochondrial genomes however, not observed in nuclear genomes suggests complex demography (or other confounding effects)

Limitations significant deviation: departure from any assumption could be demography or selection or admixture selection (D < 0: directional selection, D > 0 balancing selection)

Hypothesis testing using Tajima’s D Principle estimate πˆ and θˆ, compute D simulate genealogies and distribute mutations over it with rate θˆ on each replicate, estimate πˆ and θˆ, compute D: null distribution

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 33 / 39 Limitations significant deviation: departure from any assumption could be demography or selection or admixture selection (D < 0: directional selection, D > 0 balancing selection)

Hypothesis testing using Tajima’s D Principle estimate πˆ and θˆ, compute D simulate genealogies and distribute mutations over it with rate θˆ on each replicate, estimate πˆ and θˆ, compute D: null distribution

Empirical findings negative D for non-Africans using mitochondrial genomes however, not observed in nuclear genomes suggests complex demography (or other confounding effects)

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 33 / 39 Hypothesis testing using Tajima’s D Principle estimate πˆ and θˆ, compute D simulate genealogies and distribute mutations over it with rate θˆ on each replicate, estimate πˆ and θˆ, compute D: null distribution

Empirical findings negative D for non-Africans using mitochondrial genomes however, not observed in nuclear genomes suggests complex demography (or other confounding effects)

Limitations significant deviation: departure from any assumption could be demography or selection or admixture selection (D < 0: directional selection, D > 0 balancing selection) Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 33 / 39 ANRV260-GE39-10 ARI 10 October 2005 20:29

in explaining the pattern of variability within findings is that balancing selection may oc- and between species (39, 59). cur for many reasons other than overdomi- Much of the theoretical literature in pop- nance, (e.g., fluctuating environmental con- Selective sweep: the process by which ulation genetics over the past 50 years has ditions) and could therefore, potentially, be a new advantageous focused on developing and analyzing models quite common (38, 39). However, the efficacy mutation eliminates that generalize the previously mentioned ba- of selection will tend to be reduced when mul- or reduces variation sic di-allelic models to models where more tiple selected alleles are segregating simulta- in linked neutral sites than two alleles may be segregating, where neously in the genome. The mutations will as it increases in frequency in the multiple mutations may arise and interact— tend to interfere with each other and reduce population possibly in the presence of recombination, the local effective population size (8, 29, 40, where the environment may be changing 57). Many population geneticists used to be- through time, and where random genetic drift lieve that the number of selective deaths re- may be acting in populations subject to vari- quired to maintain large amounts of selection ous demographic forces (25, 39). From theory would have to be so large that selection would alone we have gained many valuable insights, probably play a very small role in shaping ge- including the fact that the efficacy of selection netic variation (43, 60, 61). These types of ar- depends not only on the selection coefficient, guments, known as genetic load arguments, but primarily on the product of the selection were instrumental in the development of the coefficient and the effective population size. neutral theory. However, the amount of selec- An increased effect of selection may be due to tion that a genome can permit depends on the either an increased population size or a larger way mutations interact in their effect on or- Tajima’s D and selectionselection coefficient. Among other important ganismal fitness and on several other critical model assumptions (25, 62, 71, 107). Popula- tion genetic theory does not exclude the possi- bility that selection is very pervasive and can- not alone determine the relative importance and modality of selection in the absence of data from real living organisms (25, 39). Much excitement currently exists in the population genetics communities over the fact that many predictions generated from the theory may now be tested in the context of

by Universite de Montreal on 05/26/12. For personal use only. the large genomic data sets. In particular, we should be able to detect the molecular signatures of new, strongly selected advanta- Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org geous mutations that have recently become fixed (reached a frequency of one in the pop- ulation). As these mutations increase in fre- quency, they tend to reduce variation in the neighboring region where neutral variants are Figure 1 segregating (13, 51, 52, 68). This process, by TheNielsen, effect of a selective 2005, sweep Annon genetic Rev variation. Genet The figure 39:197 is based on which a selected mutation reduces variabil- averaging over 100 simulations of a strong selective sweep. It illustrates ity in linked sites as it goes to fixation, is how the number of variable sites (variability) is reduced, LD is increased, known as a selective sweep (Figure 1). The and the frequency spectrum, as measured by Tajima’s D, is skewed, in the hope is that by analysis of large compara- directional selectionregion around thelike selective population sweep. All statistics are increase calculated in a sliding (at selected locus) window along the sequence right after the advantageous allele has reached tive genomic data sets and large SNP data frequency 1 in the population. All statistics are also scaled so that the sets we will be able to determine how and locally in genome, looks under neutrality like equals demographic one. expansionwhere both positive and negative selection

recombination progressively200 Nielsen dissipates linkage with nearby loci

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 34 / 39 Tajima’s D and selection: illustration Korneliussen et al. BMC Bioinformatics 2013, 14:289 Page 11 of 14 http://www.biomedcentral.com/1471-2105/14/289

recent selective sweep at the lactase locus Korneliussen et al, 2013, BMC Bioinformatics 14:289

Figure 8 Applications to real data. Genomic scans using a sliding window approach of 10 kb. The UCSC Tajima track was downloaded from the UCSC genome browser, and was shifted relatively to LCT gene on the hg19 human assembly. The vertical red line indicates window centers where the EB method (100 kb window) has a Tajima’s D below−1.8. Figure a) is using our EB method with varying window sizes. Figure b) is our NicolasEB method Lartillot together (CNRS with the - Univ. genotype Lyon calling 1) methods. We triedCoalescent with varying p-value cutoffs for the genotype callingMay methods, 26, 2014 and are using 35 /a 39 window size of 100 kb.

500 kb) all using a fixed step size of 10 kb. We also com- statistics (Figure 1). The level of bias varies between the pared the results for the naïve genotype calling methods different scenarios, not only for different depths and using 2 different SNP inclusion cutoff criteria’s (10-6, 10-3). error rates, but it also depends on whether or not the Notice that the overall estimate of Tajima’s D is very posi- data set is neutral or affected by selection (Figure 2). tive for the SNP data, most likely due to ascertainment The results from this study suggest/confirms it is not biases [11]. Also notice that the lowest observed value of unproblematic to perform neutrality tests on genotypes Tajima’s D is the LCT region for the EB approach while called from low or medium coverage NGS data. there are multiple regions with low Tajima’s D values for In contrast, both the ML and the EB approach give the GC approaches and the SNP chip data. fairly accurate estimates for all the examined measures. When applying a neutral genome-wide prior for our Discussion analysis, we observed only small deviations from the true We have developed two methods to perform the neutral- values of Tajima’s D even for very low depth data. When ity tests on NGS data that take the uncertainty of the ge- applying the EB approach we did observe a small bias in notypes into account. Both methods perform better than the regions under selection when the prior was esti- using called genotypes and in most instances they are mated from regions without selection (see Figure 7). approximately unbiased. The full ML method is compu- However, the bias is always much smaller than the bias tationally slow when applied in sliding windows at a of the GC approach. Even though the EB approach can genome-wide scale, which is why we also present a fast give small biases it can still have an advantage over the empirical Bayes method with a prior that is estimated full ML approach. When estimating the neutrality test from the data itself, for example the entire genome, or a statistics for small windows, we often obtained a few ex- reasonable subset of the genome. This makes the treme outliers with positive values of Tajima’s D for the method computationally feasible for full genomic data of ML approach (see Figure 6). Since the EB method uses any magnitude and any windows size. the entire 1 Mb region as prior we do not see a similar There is not a single obvious way to identify SNP sites problem with extreme positive outliers and the variance and call genotypes from NGS data. Here we have tried of the estimates is smaller overall. If the goal is to iden- different approaches with different cutoffs. Regardless of tify regions under selection the smaller variance of the method and chosen cutoff they all show a large bias in EB approach will give fewer regions with extreme values. some or all simulated scenarios. This is evident from This is because regions with little data will increase the the raw theta estimates (Table 1), and the actual test variance for the full ML approach while they will give Extensions to Kingman’s coalescent

with demographic variation (time-dependent N(t)) with population structure (demes with migration between demes) with recombination (ancestral recombination graphs) Hudson 1983, Theor Popul Biol 23:183. important tool for estimating recombination rates along genomes with selection (ancestral selection graphs) Krone and Neuhauser, 1997, Theor Popul Biol 51:210.

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 36 / 39 The ancestral recombination graph Deconstructing the ARG

• The combined history of recombination and coalescence is described Ancestral recombinationby the ancestral recombi graph:nation graph 2 loci

Coalescence Mutation Coalescence Coalescence Mutation Coalescence Recombination

Event

from Awadalla (McVean, Awadalla and Fearnhead, Genetics, 160:1231)

scaled recombination rate ρ = 4Nr coalescence at rate j(j − 1)/2 recombination at rate jρ/2

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 37 / 39

Learning about recombination The signal of recombination?

• Just like there is a true genealogy underlying a sample of sequences without recombination, there is a true ARG underlying samples of sequences with recombination

• As before, we can consider nonparametric and parametric ways of learning about recombination

• There are several useful nonparametric ways of learning about recombination which we will consider first – These really only apply to species, such as humans, where we can be fairly surely that most SNPs are the result of a single ancestral mutation event Ancestral chromosome recombines

Recurrent mutation Recombination

4 Ancestral recombination graph: continuous segment of loci142 5 : The coalescent with recombination

Coal. Recomb. Anc. mat.

1 ! 1 + 2 ! "

3 3 ! 1 ! + " 2 2

6 2 ! ! + "

10 5 ! ! 2

6 2 ! !

3 3 ! ! 2

1 ! !

Figure 5.12 Black lines represent the sample sequences or ancestral sequence material to these. Dotted lines represent non-ancestralHein, Shierup material. and Wiuf, Light 2005 grey lines indicate that a MRCA has been found. The non-ancestral piece formed after the first coalescence event between two non-consecutive pieces of ancestral material is trapped material. Also shown is the rate scaledof coalescence recombination and recombination, rate and (for the amount whole of material segment) spanned byρ ancestral= 4Nr material. λ is the length of the black bar in the sequence with dashed ends. coalescence at rate j(j − 1)/2 either a coalescent event (with rate k(k 1)/2 1) or a recombination − = recombinationevent (with rate atk rateρ/2 jρ/ρ)2 could occur. In this example the first event = is a recombination event. After the event there are three sequences with ancestral material to the two sampled sequences. The next two events are also recombination events. In one of the two events a sequence is created with no material ancestral to the sample. The rate of a coalescence is now 10, while the rate of recombination is 2.5ρ. The fourth event is the first coalescent event that also traps a piece of non- ancestral material between two pieces of ancestral material. As long as the flanking regions are linked, their genealogical histories are identical, so if one segment coalesces into another sequence, so does the other. After three more coalescence events all the ancestral material from the two sampled sequences have found common ancestry, in fact have found a GMRCA. There are two MRCAs: one is also the GMRCA which is the MRCA of the middle island of ancestral material, the other is the MRCA of the two flanking islands of ancestral material. This MRCA is created at the second coalescent event. When two pieces of ancestral material are bridged together they share fate as long as they are not cut by recombination again. The material between the two pieces is called trapped material.

5.5.1.2 Discrete versus continuous sequences Real sequences have a discrete number of base pairs rather than an infinite number of sites. The infinite sites model described in the previous section can be converted to a discrete model by dividing the continuous interval

JOTU: “chap05” — 2004/10/28 — 13:24 — page 142 — #16 Summary and conclusions

Summary rate of coalescence of j lineages is j(j − 1)/4N depth of genealogy reflects population size shape of genealogy reflects demographic history Kingman’s coalescent: simple and powerful model for understanding population genetics estimating parameters testing models

From there coalescent at the core of models for statistical inference represents the natural law for integrating over genealogies

Nicolas Lartillot (CNRS - Univ. Lyon 1) Coalescent May 26, 2014 39 / 39