Ghost’ Populations
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/733600; this version posted August 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 The Effects of Gene Flow from Unsampled `Ghost' Populations 2 on the Estimation of Evolutionary History under the Isolation 3 with Migration Model ∗1 y1 4 Melissa Lynch and Arun Sethuraman 1 5 Department of Biological Sciences, California State University San Marcos 6 August 12, 2019 7 Abstract 8 Unsampled or extinct `ghost' populations leave signatures on the genomes of individuals from extant, 9 sampled populations, especially if they have exchanged genes with them over evolutionary time. This gene 10 flow from `ghost' populations can introduce biases when estimating evolutionary history from genomic 11 data, often leading to data misinterpretation and ambiguous results. To assess the extent of this bias, we 12 perform an extensive simulation study under varying degrees of gene flow to and from `ghost' populations 13 under the Isolation with Migration (IM) model. Estimates of popular summary statistics like Watterson's 14 θ, π, and FST , and evolutionary demographic history (estimated as effective population sizes, divergence 15 times, and migration rates) using the IMa2p software clearly indicate that we a) under-estimate divergence 16 times between sampled populations, (b) over-estimate effective population sizes of sampled populations, 17 and (c) under-estimate migration rates between sampled populations, with increased gene flow from the 18 unsampled `ghost' population. Similarly, summary statistics like FST and π are also affected depending 19 on the amount of gene flow from the unsampled `ghost'. ∗[email protected] [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/733600; this version posted August 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 20 Introduction 21 Studies that apply population genetics methods to infer evolutionary history, including those in the fields 22 of conservation, agriculture, evolution, and anthropology often begin with a sampling strategy. As a 23 rule of thumb, we would expect that the more extensive the sampling of individuals across their geo- 24 graphical range (or according to the biological question at hand), the better the resolution of population 25 genetic analyses. However, in any such study, genomic data collected harbors signatures of evolutionary 26 processes of drift, selection, and gene flow involving both sampled and unsampled `ghost' populations 27 (Beerli, 2004). Such `ghost' populations are ubiquitous, and have been increasingly discovered across 28 numerous species complexes. For example, several studies of the evolutionary history of African Hunter- 29 Gatherers (Lachance et al., 2012; Hey et al., 2018; Durvasula & Sankararaman, 2019) identify significant 30 gene flow from an unsampled archaic population (diverged from modern humans around the same time 31 as Neanderthals) into multiple modern Hunter-Gatherer lineages. Similar studies in black rats (Koneˇcn`y 32 et al., 2013), Brachypodium sylvaticum bunchgrass (Rosenthal, Ramakrishnan, & Cruzan, 2008), bono- 33 bos (Kuhlwilm, Han, Sousa, Excoffier, & Marques-Bonet, 2019) and modern humans (summarized in 34 (Nielsen et al., 2017)) also describe the occurrence of unsampled `ghost' populations in their respective 35 systems. The degree of such unsampled genomic variation in sampled genomic data is a result of either 36 (a) incomplete population sampling on the part of the researcher, or (b) population extinction or decline, 37 and (c) the amount of gene flow from such unsampled, or extinct populations over evolutionary time. 38 Importantly, not accounting for such unsampled `ghost' population gene flow into extant, sampled popu- 39 lations can lead to erroneous estimates while using summary statistics, or phylogenetic/mutation model- 40 based methods, and thus, erroneous conclusions about the evolutionary history of the species. Interest- 41 ingly, the studies that account for unsampled `ghosts' are still part of a minority in the landscape of 42 population genomics research and publications. 43 For example, under an evolutionary scenario where two populations of a species don't directly exchange 44 genes between each other, but exchange genes bidirectionally with an unsampled `ghost', one would 45 expect that measures of population differentiation (FST ) between sampled extant populations would be 46 very low, compared to what we would expect under a model where the gene flow from the `ghost' is 47 unaccounted for. Similarly, other summary statistics, such as the allele frequency distribution (Tajima's 48 D), heterozygosity, nucleotide diversity (π), effective population size (Ne), and estimates of time to most 49 recent common ancestor (TMRCA) can all be biased depending on the degree of gene flow from unsampled 50 `ghost' populations. 51 Model-based methods to estimate population structure and gene flow, including Bayesian or maximum 2 bioRxiv preprint doi: https://doi.org/10.1101/733600; this version posted August 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 52 likelihood approaches, as implemented in software such as structure (Pritchard, Stephens, & Donnelly, 53 2000), ADMIXTURE (Alexander, Novembre, & Lange, 2009), MIGRATE-n (Beerli & Felsenstein, 2001), and 54 IMa2p (Sethuraman & Hey, 2016) are also not infallible to the effects of `ghost' gene flow, in that 55 identical estimates of phylogenetic and mutational history may be achieved through a number of unique 56 demographic histories (Lawson, Van Dorp, & Falush, 2018). For example, Lawson et al., 2018 describe 57 three scenarios of evolutionary history - one of recent admixture from four divergent populations, one 58 of gene flow from a `ghost' population, and one of recent bottlenecks in sampled, extant populations, 59 which all estimate the same population structure and admixture proportions while using the programs 60 structure or ADMIXTURE. Correspondingly, not accounting for `ghost' gene flow has been known to bias 61 estimates of migration rates (Hey et al., 2018), divergence times (Lachance et al., 2012) while utilizing 62 other model-based estimators of evolutionary history. 63 Biases in estimates of effective population sizes and migration rates in the presence of the unsampled 64 `ghost' populations was previously investigated by Beerli (2004) and Slatkin (2004). Using an `n'-island 65 model (Slarkin, 1985), where each of `n' populations of constant size can exchange genes at constant rates 66 with the `n-1' other populations, Beerli estimated the effect of the magnitude and direction of migration 67 in the presence of a `ghost' population. He simulated three populations of identical effective population 68 size and per generation mutation rates under this `n' island model, such that each set of populations either 69 had high or low magnitude of unidirectional or bidirectional migration with the `ghost' population (Figure 70 1). Beerli also estimated the effect of the number of `ghost' populations by sampling two populations 71 out of a larger, varied set of unsampled populations. Each of these datasets was then analyzed using the 72 MIGRATE-n software. Beerli's study identified several effects of the `ghost' population gene flow, including 73 (1) a higher migration rate, both unidirectional and bidirectional to and from the `ghost' population, 74 led to an overestimation of migration rates between the sampled populations, (2) increasing the number 75 of `ghost' populations increased the bias in estimates of population sizes, but had little effect on the 76 migration rate estimates, and (3) increasing the number of sampled loci did not improve or affect the 77 estimation of migration rates. 78 Here we extend Beerli's study to a more complex model of evolutionary history, popularly termed the 79 Isolation with Migration (IM) model (Hey & Nielsen, 2004, 2007; Nielsen & Wakeley, 2001). This class 80 of models is widely used as a model of divergence with gene flow, where two sampled populations or 81 subpopulations have diverged from the ancestor, and maintain gene flow via exchange of migrants since 82 divergence. Numerous tools have been developed to estimate evolutionary history under the IM model, 83 including the IM suite (IMa2p, and IMa3 (Sethuraman & Hey, 2016; Hey et al., 2018), MIST (Chung & Hey, 84 2017), and dadi (Gutenkunst, Hernandez, Williamson, & Bustamante, 2009). The IM suite of tools are 3 bioRxiv preprint doi: https://doi.org/10.1101/733600; this version posted August 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 85 genealogy samplers that use a Metropolis Coupled Markov Chain Monte Carlo (MCMCMC) to explore the 86 parameter space of evolutionary demographic parameters, propose updates to parameters and