Computational Methods in Population Genetics

Computational Methods in Population Genetics Dirk Metzler February 1, 2017 Contents 1 Introduction 2 1.1 Examples . .2 1.2 Some recommended books on population genetics . .3 1.3 Wright Fisher model and Kingman's Coalescent . .3 1.4 Estimators for θ and Tajima's π ................................6 1.5 Outline of methods . .8 1.5.1 ML with Importance Sampling . .8 1.5.2 MCMC for frequentists and Bayesians . .8 1.5.3 Approximate Bayesian Computation (ABC) . .9 2 Importance sampling for genealogies 9 2.1 Griffiths und Tavaré . 12 3 Lamarc (and Migrate) 13 4 IM, IMa, IMa2 19 5 Approximate Bayesian Computation (ABC) 22 5.1 ABC with local regression correction . 23 5.2 MCMC without likelihoods . 25 5.3 Sequential / Adaptive ABC . 26 5.4 Optimizing sets of summary statistics with PLS . 27 6 Jaatha 30 6.1 Wild Tomatoes and Jaatha 1.0 . 30 6.2 Jaatha 2.0 . 38 6.3 Application to genome-wide data . 41 6.4 Statistics in jaatha beyond jsfs . 42 6.5 Conclusions . 42 7 The program STRUCTURE 42 7.1 no admixture, no sampling locations . 43 7.2 with admixture . 45 7.3 taking sampling locations into account . 47 7.4 Faster alternatives to STRUCTURE for large datasets . 49 7.4.1 ADMIXTURE . 49 7.4.2 fastSTRUCTURE . 49 7.4.3 snmf() in the Bioconductor R package LEA . 50 8 Simulating and detecting selection 52 8.1 Ancestral Selection Graphs . 53 8.2 Simulating selective sweeps and other kinds of strong selection . 54 1 9 Stats for Selection 56 9.1 Selective Sweeps . 56 9.1.1 Detecting classical selective sweeps . 56 9.1.2 Stats for detecting incomplete sweeps . 58 9.1.3 Soft Sweeps . 60 9.1.4 Signals of sweeps and demography . 61 9.2 Balancing selection . 63 9.3 Does the gene list make sense? . 64 10 Phasing and PAC 65 10.1 Classical methods for phasing . 65 10.1.1 Excoffier and Slatkin's EM algorithm . 65 10.1.2 Excursus: EM algorithm . 66 10.1.3 Basic algorithms in PHASE . 68 10.2 Li&Stephens' PAC approach . 71 10.2.1 Excursus: Stephens and Donnelly's Importance Sampling . 71 10.2.2 Estimating LD and recombination hotspots . 75 10.2.3 PAC in PHASE . 79 10.2.4 Population splitting and recombination . 80 10.2.5 Diversifying selection and recombination . 82 10.3 Phasing large genomic datasets . 83 10.3.1 fastPHASE . 83 10.3.2 Phasing with Beagle software package . 85 10.3.3 IMPUTE version 2 . 86 10.3.4 MaCH . 86 10.3.5 polyHAP . 86 1 Introduction 1.1 Examples Complex Demography ? substructure population growth recent speciation introgression? recombination within loci can we still detect selection? 2 1.2 Some recommended books on population genetics References [1] R. Nielsen, M. Slatkin (2013) An Introduction to Population Genetics: Theory and Applications. Sinauer [2] J.H. Gillespie (2004) Population Genetics { A Concise Guide. 2nd Ed., John Hopkins University Press [3] J. Felsenstein (2016+) Theoretical Evolutionary Genetics. http://evolution.genetics.washington.edu/pgbook/pgbook.html [4] J. Wakeley (2008) Coalescent Theory: An Introduction. Roberts & Company Publishers [5] J. Hein, M.H. Schierup, C. Wiuf (2005) Gene Genealogies, Variation, and Evolution: A Primer in Coalescent Theory. Oxford University Press [6] J. Felsenstein (2002) Inferring Phylogenies, Chapters 26, 27 and 28. Sinauer Associates 1.3 Wright Fisher model and Kingman's Coalescent Basic assumptions of the Wright Fisher model • non-overlapping generations • constant population size • panmictic • neutral (i.e. no selection) • no recombination • N diploid individuals population of 2N haploid alleles (in case of autosomal DNA) Wright Fisher model Each allele chooses an ancestor in the generation before. Generation 0 − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 Samples are assumed to be taken purely randomly from the population. 3 Generation 0 ABCD − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 This induces a specific random distribution for the genealogies of the sampled alleles. Generation 0 ABCD − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 Haploid population of size Ne Average time until two ancestral lineages coalesce: Ne generations. Scale time: (1 time unit) = (Ne generations) ) pairwise coalescence rate = 1 µ := mutation rate per generation θ := 2Ne · µ is the expected number of mutations between 2 random individuals Let Ne −! 1 The Kingman Coalescent 4 Zeit in Zeit in 12 3 ........ k−1 k Generationen N Generatioen 2N/(k(k−1)) 2/(k(k−1)) = 2/(7*6) = 0,0476 2N/(6*5) 2/(6*5) = 0,667 2N/(5*4) 2/(4*5) = 0,1 2N/(4*3) 2/(4*3) = 0,167 2N/(3*2) 2/(3*2) = 0,333 E(total length) k−1 X = 2 · 1=i 2N/(2*1) 2/(2*1) = 1 i=1 typical coalescent trees for n = 8: 6 6 6 7 2 1 2 3 5 4 3 8 3 3 7 6 8 8 5 4 1 5 4 1 7 2 1 5 4 7 8 2 7 6 2 8 6 1 1 4 3 2 3 3 5 8 6 2 4 3 4 7 8 7 8 1 2 5 5 6 1 4 7 5 4 5 7 7 3 4 3 3 8 8 5 1 5 2 6 8 6 3 4 5 2 1 8 2 7 6 1 6 1 7 2 4 simulated coalescent tree with n = 500: 5 1.4 Estimators for θ and Tajima's π Two estimators of θ θπ (\Tajima's π") Average number of pairwise differences. number of mutations θW (\Watterson's θ") = Pk−1 i=1 1=i Both are unbiased estimators of θ, i.e. EθW = Eθπ = θ. Example: Ward et al. (1991) sampled 360 bp sequences from mtDNA control region of n = 63 Nuu Chah Nulth and observed 26 mutations. 26 θ = = 5:5123 W P63 i=1 1=i This corresponds to 0.0153 Mutations per base and per 2 · Ne generations.Assuming a mutation rate −6 µb ≈ 6:6 · 10 per generation per site this leads to an effective population size of θW =360 Nce = ≈ 1150 females 2 · µb How precise is this estimation? θ Pn 1=i2 var(θ ) = + θ2 · i=1 W Pn 1=i Pn 2 i=1 ( i=1 1=i) Theorem 1 Any unbiased estimator of θ has variance at least θ : Pn−1 1 k=1 k+θ (Here, we assume that the estimation is based on a single locus without recombination). 6 For the Nuu Chah Nulth data we get: θW = 5:5123 σθW = 3:42 Confidence range? (2σ-rule would leed to negative values...) Conclusion: Ne could perhaps also be 200 or 3000 females. How can we improve this estimate? Sample more individuals? How many individuals n would we 100/θ need to get σθW = 0:1 · θ? From the formula for varθW follows that we need n ≈ 2 · e . For θ = 5, this is n ≈ 109. For θ = 1, this is n ≈ 1043. number of water molecules on earth≈ 1047 number of seconds since big bang≈ 4:3 · 1017 Solution: sample many loci! References [Fel06] J. Felsenstein (2006) Accuracy of Coalescent Likelihood Estimates: Do We Need More Sites, More Sequences, Or More Loci?Mol. Biol. Evol., 23.3: 691{700. How to sample if • one read is 600 bp long • costs for developing a new locus is 40$ • costs for collecting a sample is 10 or 0.10$ • costs for a single read is 6$ • you can spend 1000$ • true θ is 1.8 (per locus) Optimal sampling scheme: n = 7 or n = 8 , respectively, individuals and 11 loci. With this sampling scheme we get: σθW ≈ 0:2 · θ and σθπ ≈ 0:22 · θ (all this is based on infinte-sites assumptions) Tajima's D θ π > θ W : θ π < θ W : θπ −θW D := σ bθπ −θW substructure? population growth? selection? 7 1.5 Outline of methods 1.5.1 ML with Importance Sampling The Likelihood = ( i)i vector of model parameters D sequence data Z LD( ) = Pr (D) = Pr (D j G) · P (dG): all Genealogies G Importance Sampling Draw G1;:::;Gk (approx.) i.i.d. with density Q and approximate Z k 1 X Pr (D j Gi) · P (Gi) Pr (D j G) P (dG) ≈ : k Q(G ) i=1 i efficient for with Pr (D j Gi) · P (Gi) ≈ Q(Gi) Methods differ in their choice of Q. Griffiths & Tavaré(1994) Q: Generate G backwards in time, greedy proportional to coalescence and mutation probabilities. Choose between all allowed events. Good for infinite sites models, inefficient if back-mutations are allowed. 1.5.2 MCMC for frequentists and Bayesians Felsenstein, Kuhner, Yamato, Beerli,. For some initial 0, sample Genealogies G approx. i.i.d. according to Pr 0 (G j D) by Metropolis- Hastings MCMC. Coalescent is a natural prior for G! Two flavours: for frequentists: use G1;:::;Gk for Importance Sampling Optimize approx. Likelihood ! 1 Iterate with 0 replaced by 1 for Baysians: Then sample conditioned on Genealogies and iterate to do Gibbs-sampling from Pr( ; G j D). Problems of full-data methods • usual runtime for one dataset: several weeks or months • complex software, development takes years • most programs not flexible, hard to write extensions 8 1.5.3 Approximate Bayesian Computation (ABC) Pritchard et al. (1999) Approximate Bayesian Computation 1. Select summary statistics S = (Si)i and compute their values s = (si)i for given data set 2. Choose tolerance δ 3. repeat until k accepted 0: • Simulate 0 from prior distribution of • Simulate genealogy G according to Pr 0 (G).

Computational Methods in Population Genetics

Population Size and the Rate of Evolution

Joint Effects of Genetic Hitchhiking and Background Selection on Neutral Variation

Detecting the Form of Selection from DNA Sequence Data

1 1 Title: Background Selection and the Statistics of Population Differentiation: 2 Consequences for Detecting Local

Arxiv:1302.1148V1 [Q-Bio.PE] 5 Feb 2013 I

Genomic Insights Into Positive Selection

Background Selection Does Not Mimic the Patterns of Genetic Diversity Produced by Selective Sweeps

On the Importance of Skewed Offspring Distributions and Background Selection in Virus Population Genetics

Estimating the Genome-Wide Contribution of Selection to Temporal Allele Frequency Change

Revisiting Current Methods of Population Genetic Selection Inference

An Introduction to Coalescent Theory

Population Genetics of Rare Variants and Complex Diseases Authors