Computational Methods in Population Genetics
Total Page:16
File Type:pdf, Size:1020Kb
Computational Methods in Population Genetics Dirk Metzler February 1, 2017 Contents 1 Introduction 2 1.1 Examples . .2 1.2 Some recommended books on population genetics . .3 1.3 Wright Fisher model and Kingman's Coalescent . .3 1.4 Estimators for θ and Tajima's π ................................6 1.5 Outline of methods . .8 1.5.1 ML with Importance Sampling . .8 1.5.2 MCMC for frequentists and Bayesians . .8 1.5.3 Approximate Bayesian Computation (ABC) . .9 2 Importance sampling for genealogies 9 2.1 Griffiths und Tavar´e . 12 3 Lamarc (and Migrate) 13 4 IM, IMa, IMa2 19 5 Approximate Bayesian Computation (ABC) 22 5.1 ABC with local regression correction . 23 5.2 MCMC without likelihoods . 25 5.3 Sequential / Adaptive ABC . 26 5.4 Optimizing sets of summary statistics with PLS . 27 6 Jaatha 30 6.1 Wild Tomatoes and Jaatha 1.0 . 30 6.2 Jaatha 2.0 . 38 6.3 Application to genome-wide data . 41 6.4 Statistics in jaatha beyond jsfs . 42 6.5 Conclusions . 42 7 The program STRUCTURE 42 7.1 no admixture, no sampling locations . 43 7.2 with admixture . 45 7.3 taking sampling locations into account . 47 7.4 Faster alternatives to STRUCTURE for large datasets . 49 7.4.1 ADMIXTURE . 49 7.4.2 fastSTRUCTURE . 49 7.4.3 snmf() in the Bioconductor R package LEA . 50 8 Simulating and detecting selection 52 8.1 Ancestral Selection Graphs . 53 8.2 Simulating selective sweeps and other kinds of strong selection . 54 1 9 Stats for Selection 56 9.1 Selective Sweeps . 56 9.1.1 Detecting classical selective sweeps . 56 9.1.2 Stats for detecting incomplete sweeps . 58 9.1.3 Soft Sweeps . 60 9.1.4 Signals of sweeps and demography . 61 9.2 Balancing selection . 63 9.3 Does the gene list make sense? . 64 10 Phasing and PAC 65 10.1 Classical methods for phasing . 65 10.1.1 Excoffier and Slatkin's EM algorithm . 65 10.1.2 Excursus: EM algorithm . 66 10.1.3 Basic algorithms in PHASE . 68 10.2 Li&Stephens' PAC approach . 71 10.2.1 Excursus: Stephens and Donnelly's Importance Sampling . 71 10.2.2 Estimating LD and recombination hotspots . 75 10.2.3 PAC in PHASE . 79 10.2.4 Population splitting and recombination . 80 10.2.5 Diversifying selection and recombination . 82 10.3 Phasing large genomic datasets . 83 10.3.1 fastPHASE . 83 10.3.2 Phasing with Beagle software package . 85 10.3.3 IMPUTE version 2 . 86 10.3.4 MaCH . 86 10.3.5 polyHAP . 86 1 Introduction 1.1 Examples Complex Demography ? substructure population growth recent speciation introgression? recombination within loci can we still detect selection? 2 1.2 Some recommended books on population genetics References [1] R. Nielsen, M. Slatkin (2013) An Introduction to Population Genetics: Theory and Applications. Sinauer [2] J.H. Gillespie (2004) Population Genetics { A Concise Guide. 2nd Ed., John Hopkins University Press [3] J. Felsenstein (2016+) Theoretical Evolutionary Genetics. http://evolution.genetics.washington.edu/pgbook/pgbook.html [4] J. Wakeley (2008) Coalescent Theory: An Introduction. Roberts & Company Publishers [5] J. Hein, M.H. Schierup, C. Wiuf (2005) Gene Genealogies, Variation, and Evolution: A Primer in Coalescent Theory. Oxford University Press [6] J. Felsenstein (2002) Inferring Phylogenies, Chapters 26, 27 and 28. Sinauer Associates 1.3 Wright Fisher model and Kingman's Coalescent Basic assumptions of the Wright Fisher model • non-overlapping generations • constant population size • panmictic • neutral (i.e. no selection) • no recombination • N diploid individuals population of 2N haploid alleles (in case of autosomal DNA) Wright Fisher model Each allele chooses an ancestor in the generation before. Generation 0 − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 Samples are assumed to be taken purely randomly from the population. 3 Generation 0 ABCD − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 This induces a specific random distribution for the genealogies of the sampled alleles. Generation 0 ABCD − 1 − 2 − 3 − 4 −− 5 − 6 − 7 − 8 − 9 Haploid population of size Ne Average time until two ancestral lineages coalesce: Ne generations. Scale time: (1 time unit) = (Ne generations) ) pairwise coalescence rate = 1 µ := mutation rate per generation θ := 2Ne · µ is the expected number of mutations between 2 random individuals Let Ne −! 1 The Kingman Coalescent 4 Zeit in Zeit in 12 3 ........ k−1 k Generationen N Generatioen 2N/(k(k−1)) 2/(k(k−1)) = 2/(7*6) = 0,0476 2N/(6*5) 2/(6*5) = 0,667 2N/(5*4) 2/(4*5) = 0,1 2N/(4*3) 2/(4*3) = 0,167 2N/(3*2) 2/(3*2) = 0,333 E(total length) k−1 X = 2 · 1=i 2N/(2*1) 2/(2*1) = 1 i=1 typical coalescent trees for n = 8: 6 6 6 7 2 1 2 3 5 4 3 8 3 3 7 6 8 8 5 4 1 5 4 1 7 2 1 5 4 7 8 2 7 6 2 8 6 1 1 4 3 2 3 3 5 8 6 2 4 3 4 7 8 7 8 1 2 5 5 6 1 4 7 5 4 5 7 7 3 4 3 3 8 8 5 1 5 2 6 8 6 3 4 5 2 1 8 2 7 6 1 6 1 7 2 4 simulated coalescent tree with n = 500: 5 1.4 Estimators for θ and Tajima's π Two estimators of θ θπ (\Tajima's π") Average number of pairwise differences. number of mutations θW (\Watterson's θ") = Pk−1 i=1 1=i Both are unbiased estimators of θ, i.e. EθW = Eθπ = θ. Example: Ward et al. (1991) sampled 360 bp sequences from mtDNA control region of n = 63 Nuu Chah Nulth and observed 26 mutations. 26 θ = = 5:5123 W P63 i=1 1=i This corresponds to 0.0153 Mutations per base and per 2 · Ne generations.Assuming a mutation rate −6 µb ≈ 6:6 · 10 per generation per site this leads to an effective population size of θW =360 Nce = ≈ 1150 females 2 · µb How precise is this estimation? θ Pn 1=i2 var(θ ) = + θ2 · i=1 W Pn 1=i Pn 2 i=1 ( i=1 1=i) Theorem 1 Any unbiased estimator of θ has variance at least θ : Pn−1 1 k=1 k+θ (Here, we assume that the estimation is based on a single locus without recombination). 6 For the Nuu Chah Nulth data we get: θW = 5:5123 σθW = 3:42 Confidence range? (2σ-rule would leed to negative values...) Conclusion: Ne could perhaps also be 200 or 3000 females. How can we improve this estimate? Sample more individuals? How many individuals n would we 100/θ need to get σθW = 0:1 · θ? From the formula for varθW follows that we need n ≈ 2 · e . For θ = 5, this is n ≈ 109. For θ = 1, this is n ≈ 1043. number of water molecules on earth≈ 1047 number of seconds since big bang≈ 4:3 · 1017 Solution: sample many loci! References [Fel06] J. Felsenstein (2006) Accuracy of Coalescent Likelihood Estimates: Do We Need More Sites, More Sequences, Or More Loci?Mol. Biol. Evol., 23.3: 691{700. How to sample if • one read is 600 bp long • costs for developing a new locus is 40$ • costs for collecting a sample is 10 or 0.10$ • costs for a single read is 6$ • you can spend 1000$ • true θ is 1.8 (per locus) Optimal sampling scheme: n = 7 or n = 8 , respectively, individuals and 11 loci. With this sampling scheme we get: σθW ≈ 0:2 · θ and σθπ ≈ 0:22 · θ (all this is based on infinte-sites assumptions) Tajima's D θ π > θ W : θ π < θ W : θπ −θW D := σ bθπ −θW substructure? population growth? selection? 7 1.5 Outline of methods 1.5.1 ML with Importance Sampling The Likelihood = ( i)i vector of model parameters D sequence data Z LD( ) = Pr (D) = Pr (D j G) · P (dG): all Genealogies G Importance Sampling Draw G1;:::;Gk (approx.) i.i.d. with density Q and approximate Z k 1 X Pr (D j Gi) · P (Gi) Pr (D j G) P (dG) ≈ : k Q(G ) i=1 i efficient for with Pr (D j Gi) · P (Gi) ≈ Q(Gi) Methods differ in their choice of Q. Griffiths & Tavar´e(1994) Q: Generate G backwards in time, greedy proportional to coalescence and mutation probabilities. Choose between all allowed events. Good for infinite sites models, inefficient if back-mutations are allowed. 1.5.2 MCMC for frequentists and Bayesians Felsenstein, Kuhner, Yamato, Beerli,. For some initial 0, sample Genealogies G approx. i.i.d. according to Pr 0 (G j D) by Metropolis- Hastings MCMC. Coalescent is a natural prior for G! Two flavours: for frequentists: use G1;:::;Gk for Importance Sampling Optimize approx. Likelihood ! 1 Iterate with 0 replaced by 1 for Baysians: Then sample conditioned on Genealogies and iterate to do Gibbs-sampling from Pr( ; G j D). Problems of full-data methods • usual runtime for one dataset: several weeks or months • complex software, development takes years • most programs not flexible, hard to write extensions 8 1.5.3 Approximate Bayesian Computation (ABC) Pritchard et al. (1999) Approximate Bayesian Computation 1. Select summary statistics S = (Si)i and compute their values s = (si)i for given data set 2. Choose tolerance δ 3. repeat until k accepted 0: • Simulate 0 from prior distribution of • Simulate genealogy G according to Pr 0 (G).