| INVESTIGATION Inferring the Probability of the Derived vs. the Ancestral Allelic State at a Polymorphic Site Peter D. Keightley1 and Benjamin C. Jackson Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, EH9 3FL, United Kingdom ABSTRACT It is known that the allele ancestral to the variation at a polymorphic site cannot be assigned with certainty, and that the most frequently used method to assign the ancestral state—maximum parsimony—is prone to misinference. Estimates of counts of sites that have a certain number of copies of the derived allele in a sample (the unfolded site frequency spectrum, uSFS) made by parsimony are therefore also biased. We previously developed a maximum likelihood method to estimate the uSFS for a focal species using information from two outgroups while assuming simple models of nucleotide substitution. Here, we extend this approach to allow multiple outgroups (implemented for three outgroups), potentially any phylogenetic tree topology, and more complex models of nucleotide substitution. We find, however, that two outgroups and the Kimura two-parameter model are adequate for uSFS inference in most cases. We show that using parsimony to infer the ancestral state at a specific site seriously breaks down in two situations. The first is where the outgroups provide no information about the ancestral state of variation in the focal species. In this case, nucleotide variation will be underestimated if such sites are excluded. The second is where the minor allele in the focal species agrees with the allelic state of the outgroups. In this situation, parsimony tends to overestimate the probability of the major allele being derived, because it fails to account for the fact that sites with a high frequency of the derived allele tend to be rare. We present a method that corrects this deficiency and is capable of providing nearly unbiased estimates of ancestral state probabilities on a site-by-site basis and the uSFS. KEYWORDS nucleotide polymorphism; ancestral allele; derived allele; unfolded site frequency spectrum; parsimony; misinference ANY population genetic and quantitative genetic anal- derived alleles across sites, the unfolded site frequency spec- Mysis methods require the assignment of ancestral vs. trum (uSFS) (e.g., Boyko et al. 2008; Schneider et al. 2011; derived states at polymorphic nucleotide sites. For example, Tataru et al. 2017). Fay and Wu (2000) and Zeng et al. (2006) proposed statis- The minor allele at a site or counts of numbers of minor tics, H and E, that compare the numbers of high, intermedi- alleles at a group of sites (the folded site frequency spectrum) ate, and low frequency derived variants, which can then be can be observed directly fromsequence polymorphism data.In used to distinguish between different modes of natural selec- contrast, the derived vs. the ancestral allele at a site cannot be tion and demographic change. A number of methods have known with certainty, because at least one outgroup is re- also been developed to infer selection and demographic quired for inference, and there is the possibility of more than change based on the complete distribution of counts of one mutation separating the focal species from the outgroup. This also implies that the uSFS cannot be known precisely. Copyright © 2018 Keightley and Jackson For the purpose of ancestral state inference, rule-based max- doi: https://doi.org/10.1534/genetics.118.301120 Manuscript received March 23, 2018; accepted for publication May 14, 2018; imum parsimony is the most frequently applied method in published Early Online May 16, 2018. molecular evolutionary genetics (e.g., Voight et al. 2006; Available freely online through the author-supported open access option. Dreszer et al. 2007; Keinan et al. 2007; Sabeti et al. 2007; This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which 1000 Genomes Project Consortium 2010, 2015; Lohse and permits unrestricted use, distribution, and reproduction in any medium, provided the Barton 2011; Langley et al. 2012; Schmidt et al. 2017). It has original work is properly cited. Supplemental material available at Figshare: https://doi.org/10.25386/genetics. been recognized, however, that parsimony potentially pro- 6275915. duces misleading results (Felsenstein 1981; Collins et al. 1Corresponding author: Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Charlotte Auerbach Rd., Edinburgh EH9 3FL, United 1994; Eyre-Walker 1998). Of particular relevance here is that Kingdom. E-mail: [email protected] sites that have a low frequency of the derived allele are Genetics, Vol. 209, 897–906 July 2018 897 usually more common than sites that have a high frequency infer ancestral state probabilities for each polymorphic site of the derived allele. This implies that misinference tends in the data. We evaluate the performance of the new ap- to upwardly bias counts of high frequency derived alleles proach by simulations, apply it to data from the Drosophila (Baudry and Depaulis 2003; Hernandez et al. 2007). Population Genomics Project (DPGP) as a test case, and rein- There is a related problem concerning the assignment of fer the ancestral state probabilities for a population of the ancestral states, which does not seem to have been addressed. 1000 Genomes Project in humans, which were previously If ancestral states are assigned on site-by-site basis, potentially inferred by a parsimony-related approach. useful information is ignored. For example, consider the case of a single outgroup species that is uninformative about the ancestral allele of the variation in a focal species at a site. It is Materials and Methods more likely, however, that the ancestral allele at the site is the Following Keightley et al. (2016), uSFS inference is carried low frequency allele, if sites with a high frequency of the out in two-steps. Evolutionary rate parameters are estimated derived allele are uncommon in the data set as a whole (as is from all sites in the data (including polymorphic and mono- usually the case). morphic sites) in step 1. In step 2 the uSFS is computed, Matsumoto et al. (2015) pointed out that ancestral states conditional on the evolutionary rate parameter estimates. are not observable, that a single best ancestral reconstruction Information from steps 1 and 2 is then combined in a third is not advisable, and that assuming one can bias molecular step to infer the ancestral state probability for each polymor- evolutionary inference. This was developed by Jackson et al. phic site. (2017), who assigned the ancestral state probability at a site fi as the inferred probability of the node for the common an- Representation of the data and some de nitions cestor of the focal species and the closest outgroup, obtained Suppose we have sampled m orthologous gene copies at a set using PAML (Yang 2007), while ignoring polymorphism of sites from a population of a focal species. The uSFS we data. However, this does not optimally weight information require to estimate therefore contains m 2 1 elements, ex- coming from the focal site itself and from the data as a whole. cluding the elements where the ancestral or derived allele is Inference of ancestral states on a site-by-site basis has been fixed. We assume that we have randomly sampled a single problematic, but there has been progress in inferring the uSFS. gene copy at each site in one or more outgroup species. We Hernandez et al. (2007) developed a context-dependent sub- assume that the tree topology relating the species is known stitution model using a single outgroup to infer the ancestral and does not vary among sites (Figure 1). In the analysis we state at a polymorphic site in a focal species, and then imple- assume that the nucleotide variation within the focal species fi mented a step to correct for ancestral misidenti cation. Their coalesces within the branch labeled b1. The consequences of simulations suggested, however, that the approach only par- polymorphism in the outgroup species and violation of the tially corrects for ancestral misidentification, depending on assumptions of an invariant tree topology and coalescence the divergence between the focal species and the outgroup. within branch b1 are investigated in simulations. The ob- Schneider et al. (2011) developed a probabilistic method served nucleotide configuration for a site is the count of each to infer the uSFS on a site-by-site basis, but did not use infor- of the four nucleotides in the focal species (labeled X, Y for a mation from the frequencies of polymorphisms across all sites, biallelic site), along with the state for each outgroup (A, C, G, so results from this method are biased. Keightley et al. (2016) or T). Let the number of outgroups = n (in Figure 1, n = 3), developed a maximum likelihood (ML) method that addresses and denote the outgroups o1, o2...on. Assuming an unrooted the deficiency in Schneider et al. (2011) and simulations sug- tree (as in Figure 1), the number of branches in the tree is gested that it is capable of correctly inferring the uSFS. It uses a therefore b =2n 2 1. two-stage process in which the evolutionary rates are estimated Models of nucleotide substitution by ML and then, assuming the rates, estimates the uSFS ele- ments by ML, while correctly weighting information from in- The JC model, K2 model, and a model allowing six symmet- formative and uninformative sites. However, the method is rical rates (R6; Figure 2) are considered. All substitution limited to two outgroups, assumes simple substitution models models require the estimation of evolutionary rates (i.e., [for one outgroup, the Kimura two-parameter (K2) model; for mean number of nucleotide changes per site) for each – two outgroups, the Jukes Cantor (JC) model], and is not read- branch, K1...Kb.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-