A Maximum Likelihood Approach
Total Page:16
File Type:pdf, Size:1020Kb
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 15452–15457, December 1998 Evolution Using rare mutations to estimate population divergence times: A maximum likelihood approach GIORGIO BERTORELLE*† AND BRUCE RANNALA‡ *Department of Integrative Biology, University of California, Berkeley, CA 94720-3140; and ‡Department of Ecology and Evolution, State University of New York, Stony Brook, NY 11794-5245 Communicated by Henry C. Harpending, University of Utah, Salt Lake City, UT, October 23, 1998 (received for review July 27, 1998) ABSTRACT In this paper we propose a method to estimate size N. This estimator is based on a model of genetic drift without by maximum likelihood the divergence time between two popu- mutation, and it is therefore most suitable for populations that lations, specifically designed for the analysis of nonrecurrent have recently separated. However, when the differences between rare mutations. Given the rapidly growing amount of data, rare alleles are taken into account by using equivalents of Fst that allow f disease mutations affecting humans seem the most suitable for mutation [such as st (7) or Rst (8)], equivalent estimators candidates for this method. The estimator RD, and its conditional become feasible for older population subdivision events (8). version RDc, were derived, assuming that the population dynam- More recently, Nielsen et al. (9) have proposed a maximum ics of rare alleles can be described by using a birth–death process likelihood estimator based on the coalescent process and assum- approximation and that each mutation arose before the split of ing no mutation. When applied to simulated data from stable a common ancestral population into the two diverging popula- populations, this estimator appears to have less bias and lower tions. The RD estimator seems more suitable for large sample variance than an Fst-based estimator (9). sizes and few alleles, whose age can be approximated, whereas In this paper we present another likelihood estimator of the the RDc estimator appears preferable when this is not the case. time of divergence of two populations, specifically designed for When applied to three cystic fibrosis mutations, the estimator RD the analysis of rare alleles that have arisen by nonrecurrent could not exclude a very recent time of divergence among three mutation. Rare alleles are becoming an important source of Mediterranean populations. On the other hand, the divergence information on human populations as more disease mutations are time between these populations and the Danish population was mapped, more effort is focused on the study of population estimated to be, on the average, 4,500 or 15,000 years, assuming frequencies of disease mutations, and large-scale programs of or not a selective advantage for cystic fibrosis carriers, respec- genetic screening are becoming a realistic possibility. The method tively. Confidence intervals are large, however, and can probably we present here is best suited for analyzing data on rare disease be reduced only by analyzing more alleles or loci. mutations, since it assumes that the number of copies of each mutant (at the same or different loci) can be modeled by using a The amount of genetic divergence between two isolated popu- stochastic birth–death process with sampling (10, 11). This as- lations tends to accumulate with time, following their subdivision sumption, which is satisfied if mutants are rare (12), allows many from a common ancestral population. New alleles are indepen- demographic factors, including selection and population growth, dently generated in each descendent population by mutation, and to be introduced into the model in a relatively simple way. This the frequencies of the pre-subdivision alleles tend to diverge due approach also greatly simplifies the analysis of multiallelic and to the random sampling of genes in each generation (genetic multilocus data. drift). These two processes therefore leave a signature, increas- The Model ingly evident with time, on the genetic composition of the subdivided populations. We consider a simple model of an ancestral population (labeled Several methods have been proposed to estimate the time, in population 0) that separates into two descendent populations the past, when two or more populations arose from a single (labeled populations 1 and 2) at a time T generations in the past. ancestral population (i.e., the divergence time). Takahata and Nei The descendent populations experience no immigration. A rare (1) suggested that the net number of nucleotide substitutions d allele is assumed to have arisen by nonrecurrent mutation at time accumulated between two populations (called also dA in ref. 2) be T 1 t in the past. When the population split occurs, any copy of used to estimate the time of their divergence. If the two popu- the allele in the ancestral population has a probability s and (1 2 lations have the same constant size, d is expected to increase s) of joining one, or the other, descendent population. linearly with the product mT (where m is the mutation rate and T Within each population, we assume that the demography of the is the divergence time). The same linear increase with mT is mutant lineages can be described by using a birth–death process predicted for the genetic distance (dm)2 (3), computed as the approximation to the coalescent model, valid when the allele has square of the difference between the average allele size observed remained rare (12). Population sizes are allowed to change over at microsatellite markers in the diverging populations. In both time according to an exponential process of growth (or decline) j j j cases, therefore, an estimation of T can be simply obtained if the with rates 0 (ancestral population), 1, and 2 (descendent mutation rate is known. populations). Another commonly used approach (see, for example, ref. 4) is The Likelihood to first estimate Wright’s Fst (5) from allele frequencies and then use this estimate to predict the divergence time. The relationship 1 5 2 2T/2N A mutant allele arises at time T t in the past. The probability Fst 1 e (5, 6) can be used to estimate the divergence time distribution of the total number of alleles descended from the T, given that the populations have the same constant and known mutant, k, that existed immediately prior to the population subdivision event T generations ago is The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked ‘‘advertisement’’ in Abbreviations: pdf, probability distribution function; pgf, probability accordance with 18 U.S.C. §1734 solely to indicate this fact. generating function; CF, cystic fibrosis. © 1998 by The National Academy of Sciences 0027-8424y98y9515452-6$2.00y0 †To whom reprint requests should be addressed. e-mail: giorgio@ PNAS is available online at www.pnas.org. mws4.biol.berkeley.edu. 15452 Downloaded by guest on September 26, 2021 Evolution: Bertorelle and Rannala Proc. Natl. Acad. Sci. USA 95 (1998) 15453 ~ ! 5 ~ 2 n !n k22 $ Pr k 1 t t k 2, [1] and the conditional pdf is then n where (see ref. 12) t is defined as j 1 j ~ u . ! 5 S 1 2D~ n !j1~~ 2 !n !j2 2j Pr j1, j2 j1 0, j2 0 s t 1 s t j 0t j1 2 0e n 5 2 2j [2] t 1 2 ~ 2 j ! 0t . 1 1 2 0 e ~1 1 n ~1 2 s!!~sn 2 1!~1 2 n ! 3 t t t n2~ 2 ! ~n 2 ! . [9] Let j1 be the number of copies of the mutant that enter descen- t 1 s s t 2 dent population 1 at the population subdivision event and let j2 be the number that enter descendent population 2. The joint Let l be the number of copies of the mutant found in population probability distribution function (pdf) of j and j , conditional on i 1 2 i immediately after the divergence event at time T that leave one k 5 j 1 j ,is 1 2 or more descendents in a present-day sample from population i, k ~ u ! 5 S D j1~ 2 !j2 where i is either 1 or 2. The pdf of li is a binomial of the form Pr j1, j2 k s 1 s . [3] j1 The joint probability generating function (pgf) of j and j , j 2 1 2 ~ u ! 5 S iD li ~ 2 !ji li # # Pr li ji QTi 1 QTi 0 li ji, [10] conditional on k, is then li k k f ~ ! 5 O S D j1~ 2 !j2 j1 j2 u j1, j2 k z1, z2 s 1 s z1 z2 where QTi is the probability that an allele in population i leaves 5 j1 j1 0 one or more descendents at present and is given by (see ref. 12) 5 ~ 2 1 !k z2 sz2 sz1 . [4] j 2fi i The unconditional joint pgf of j1 and j2 is 5 2j [11] QTi 2 ~ 2 j ! iT , ` fi fi 2 i e f ~ ! 5 O ~ 2 1 !k~ 2 n ! k22 j , j z1, z2 z2 sz2 sz1 1 t vt 1 2 5 k 2 where the sampling fraction fi is the probability that a chromo- some in the present-day descendents of population i is sampled. ~ 2 1 !2~ 2 n ! z2 sz2 sz1 1 t 5 The pgf of li, conditional on ji,is 2 n 1 n 2 n . [5] 1 tz2 s tz2 s tz1 ji j 2 Using standard techniques (see ref. 13), we obtain the probability f ~ ! 5 O S iD li ~ 2 !ji li li l uj zi QTi 1 QTi zi i i 5 li of j1 and j2 by evaluating the j1th and j2th-order partial derivatives li 0 with respect to z1 and z2 as follows 5 ~ 2 1 !ji 1 QTi QTizi .