A model for detecting positive selection in protein-coding DNA sequences

John P. Huelsenbeck*†, Sonia Jain‡, Simon W. D. Frost§, and Sergei L. Kosakovsky Pond§

*Section of Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116; ‡Division of Biostatistics and Bioinformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0717; and §Antiviral Research Center, University of California at San Diego, 150 West Washington Street, La Jolla, CA 92103

Edited by Joseph Felsenstein, University of Washington, Seattle, WA, and approved March 1, 2006 (received for review September 21, 2005) Most methods for detecting Darwinian natural selection at the natural selection that is present at only a few positions in the molecular level rely on estimating the rates or numbers of non- alignment (21). In cases where such methods have been successful, synonymous and synonymous changes in an alignment of protein- many sites are typically under strong positive selection (e.g., MHC). coding DNA sequences. In some of these methods, the nonsyn- More recently, Nielsen and Yang (2) developed a method that onymous rate of substitution is allowed to vary across the allows the rate of nonsynonymous substitution to vary across the sequence, permitting the identification of single amino acid posi- sequence. The method of Nielsen and Yang has proven useful for tions that are under positive natural selection. However, it is detecting positive natural selection in sequences where only a few unclear which should be used to describe sites are under directional selection. how the nonsynonymous rate of substitution varies across the The Nielsen and Yang (2) approach assumes that the nonsyn- sequence. One widely used solution is to model variation in the onymous͞synonymous rate ratio (dN͞dS ratio) at a site is a random nonsynonymous rate across the sequence as a mixture of several variable drawn from some probability distribution. The goal is to discrete or continuous probability distributions. Unfortunately, estimate parameters of the model from alignments of protein- there is little population genetics theory to inform us of the coding DNA and, by using these parameter estimates, identify appropriate probability distribution for among-site variation in the amino acids under positive selection. Their model is quite compli- nonsynonymous rate of substitution. Here, we describe an ap- cated and contains many parameters to estimate. Some of the proach to modeling variation in the nonsynonymous rate of parameters account for the fact that the sequences are related to substitution by using a Dirichlet process . The one another through some unknown phylogenetic tree. Other Dirichlet process allows there to be a countably infinite number of parameters account for biases in the substitution process, such as an nonsynonymous rate classes and is very flexible in accommodating increased rate of mutation for transitions. The remaining param- different potential distributions for the nonsynonymous rate of eters, however, describe how dN͞dS varies across the sequence. substitution. We implemented the model in a fully Bayesian ap- Nielsen and Yang showed not only that it is practical to efficiently proach, with all parameters of the model considered as random estimate these parameters from alignments of protein-coding DNA variables. sequences but also that one can use an empirical Bayesian approach to detect specific amino acid residues that are under the influence atural selection leaves a detectable signature in comparisons of of positive natural selection. Nprotein-coding DNA sequences; a bias in the ratio of the rates How should variation in the rate of nonsynonymous substitution of nonsynonymous and synonymous substitutions is unambiguous across a sequence be modeled? Population genetics theory is largely evidence for natural selection. Purifying selection causes the rate of silent on this issue. For the most part, we lack information on the nonsynonymous substitution to be smaller than the rate of synon- distribution of selection coefficients for new mutations and the ymous substitution. In fact, the predominant pattern found in demography of the populations under consideration, making it analysis of alignments of protein-coding DNA is that nonsynony- difficult to predict a distribution for the rate of nonsynonymous mous substitutions have a lower rate than synonymous substitu- substitution (22, 23). The original Nielsen and Yang (2) approach tions. This finding is consistent with natural selection acting to assumed that a site could be in one of three categories, each of eliminate deleterious mutations that change the protein to a less which differed in its dN͞dS rate ratio. With probability p1,asiteis functional form. However, natural selection can also act to increase in category 1, which has dN͞dS ϭ 0; with probability p2, the site is the probability that a nonsynonymous mutation is fixed in the in category 2, which has dN͞dS ϭ 1; and with probability p3, the site population. Positive, or direction, selection causes the relative rates is in a category that has dN͞dS Ͼ 1(p1 ϩ p2 ϩ p3 ϭ 1). Nielsen and of nonsynonymous to synonymous substitutions to be Ͼ1. Examples Yang (2) also considered a model that allowed dN͞dS to vary of positive selection have been found in many genes but perhaps continuously on the interval (0, 1); a gamma distribution, truncated most famously in human major histocompatibility complex (MHC) on the interval (0. 1) was used to model amino acid sites that are (1), HIV-1 envelope (env) gene (2), sperm lysins (3), and primate acting neutrally or under the influence of purifying natural selec- stomach lysozymes (4). tion. Under this model, a site has dN͞dS drawn from a truncated Although seemingly simple, detecting positive natural selection gamma distribution with probability p1 or has dN͞dS Ͼ 1with from alignments of protein-coding DNA is recognized as a formi- probability p2. Later, Yang et al. (12) systematically explored more dable statistical problem. Many methods have been proposed to ways to model variation in dN͞dS across a sequence. They consid- detect the footprint of natural selection, all of which are based on ered a total of 13 models. The ‘‘M10’’ model from Yang et al. (12), measuring the relative rates or numbers of nonsynonymous and for example, assumes that, with probability p1, the dN͞dS rate ratio synonymous substitutions (2, 5–20). Most of the methods assume a is drawn from a on the interval (0, 1) and, with constant rate of nonsynonymous and synonymous substitutions across the sequence. These methods are ill-suited for detecting positive natural selection when only a few of the amino acid Conflict of interest statement: No conflicts declared. positions in a gene are under the influence of positive natural This paper was submitted directly (Track II) to the PNAS office. selection, with the others under purifying selection. Applying a Abbreviation: MCMC, Markov chain Monte Carlo. EVOLUTION method that assumes a constant rate of nonsynonymous change †To whom correspondence should be addressed. E-mail: [email protected]. across the sequence potentially masks the signature of positive © 2006 by The National Academy of Sciences of the USA

www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508279103 PNAS ͉ April 18, 2006 ͉ vol. 103 ͉ no. 16 ͉ 6263–6268 Downloaded by guest on September 28, 2021 probability p2,thedN͞dS rate ratio is drawn from an offset gamma Table 2. Prior and posterior probabilities for the number of distribution. Even though many of the models considered in Nielsen selection classes, k, for HIV-1 pol ͞ and Yang (2) and in Yang et al. (12) describe dN dS as varying E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 continuously, in practice, these continuous distributions are dis- E(k͉X) ϭ 3.22 E(k͉X) ϭ 5.99 E(k͉X) ϭ 9.58 cretized to allow the likelihood to be calculated (the likelihood is ͉ ͉ ͉ averaged over the different discrete categories). At least as cur- k f(k) f(k X) f(k) f(k X) f(k) f(k X) rently implemented, then, all of these models describing how the 1 0.9285 0.0000 0.0157 0.0000 0.0001 0.0000 rate of nonsynonymous substitution varies across the sequence are 2 0.0690 0.0006 0.0686 0.0000 0.0006 0.0000 discrete. 3 0.0025 0.7837 0.1459 0.0313 0.0033 0.0006 4 0.0001 0.2096 0.2014 0.1518 0.0114 0.0055 We take a different approach to modeling variation in dN͞dS 5 0.0000 0.0062 0.2036 0.2316 0.0285 0.0252 across sites, allowing sites to be in one of a number of classes, with ͞ 6 0.0000 0.0001 0.1611 0.2467 0.0557 0.0607 each class having a different dN dS ratio. The prior probability 7 0.0000 0.0000 0.1041 0.1688 0.0891 0.1018 distribution for the number of classes and the dN͞dS for each class 8 0.0000 0.0000 0.0566 0.0979 0.1199 0.1404 is described by a Dirichlet process prior. The Dirichlet process prior 9 0.0000 0.0000 0.0265 0.0467 0.1388 0.1615 provides a flexible way to model situations in which the data 10 0.0000 0.0000 0.0108 0.0161 0.1405 0.1599 Ͼ elements are drawn from a mixture of simpler parametric distri- 10 0.0000 0.0000 0.0057 0.0093 0.3797 0.3349 butions. For typical mixture models, the number of mixture com- The prior and posterior probabilities for the number of selection classes are ponents is assumed to be known, and determining the correct f(k) and f(k͉X), respectively. number of components apriorifor a particular model is difficult. For the Dirichlet process prior, however, the number of mixture components is countably infinite, obviating the need to determine Results and Discussion the correct number of mixture components. Here we apply the The Dirichlet process prior has two components: One is a param- ␣ Dirichlet process to the problem of detecting positive selection in eter, usually called , that influences the probability that data elements find themselves in the same cluster. In other words, the protein-coding DNA sequences. Instead of assuming that the dN͞dS rate ratio for a site is drawn from a particular parametric distribu- parameter controls the ‘‘clumpiness’’ of the process. The other tion, as is the case for Nielsen and Yang (2) and Yang et al. (12), component of a Dirichlet process prior is a probability distribution we assume that a site is assigned to a category with a particular that describes the probability of the parameter assigned to each d ͞d value. The number of selection categories and the d ͞d cluster. (The Dirichlet process prior is described in more detail in N S N S Materials and Methods.) We examined the robustness of the infer- value for each category are both considered random variables in our ␣ model. The Dirichlet process has been used in one other case for ences of positive selection to three different choices for , choosing ␣ such that the prior mean of the number of components (k) was an evolutionary problem: Lartillot and Philippe (24) used the E(k) Ϸ 1, E(k) ϭ 5, and E(k) ϭ 10; we did this to keep the number Dirichlet process to model variation in the substitution process of selection classes to a manageable number because the likelihood across alignments of amino acid sequences. Similarly, Pond and calculations become too computationally cumbersome when k is Frost (25) described a flexible discretization scheme that is able to large (e.g., k Ͼ 25). An alternative that we did not explore is to place fit a wide range of rate distributions. However, this scheme still a prior probability distribution on ␣. Often in problems using a requires that the maximum number of rate classes be specified a Dirichlet process prior, a gamma hyperprior is placed on ␣. priori. Tables 1–6 compare the posterior and prior probability distri- We estimate the parameters of the model in a Bayesian frame- butions for the number of mixture components (selection catego- work. Calculating the joint posterior probability distribution of the ries) for each of the six alignments. For all of the alignments we parameters involves summation over all possible phylogenetic trees examined, there is very little posterior probability for k ϭ 1 and k ϭ relating the sequences and, for each tree, integration over all 2, even when there may be a substantial amount of prior probability possible combinations of parameter values. We use Markov chain for those cases. The data are very difficult to explain with only a few Monte Carlo (MCMC) to approximate the posterior probability selection classes. distribution of the parameters and apply the method to six align- We also examined the marginal posterior probability of ␻ ϭ ments of protein-coding DNA sequences (12, 17, 26, 27). dN͞dS for each site. Fig. 1 shows the probability distribution for four

Table 1. Prior and posterior probabilities for the number of Table 3. Prior and posterior probabilities for the number of selection classes, k, for HIV-1 vif selection classes, k, for HIV-1 env E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 E(k͉X) ϭ 3.04 E(K͉X) ϭ 6.33 E(k͉X) ϭ 10.21 E(k͉X) ϭ 2.16 E(k͉X) ϭ 6.06 E(k͉X) ϭ 10.61

k f(k) f(k͉X) f(k) f(k͉X) f(k) f(k͉X) k f(k) f(k͉X) f(k) f(k͉X) f(k) f(k͉X)

1 0.9434 0.0000 0.0139 0.0000 0.0000 0.0000 1 0.9505 0.0000 0.0124 0.0000 0.0000 0.0000 2 0.0550 0.0000 0.0648 0.0000 0.0004 0.0000 2 0.0483 0.8532 0.0613 0.0039 0.0003 0.0000 3 0.0015 0.9634 0.1439 0.0334 0.0026 0.0002 3 0.0012 0.1350 0.1415 0.0430 0.0020 0.0001 4 0.0000 0.0363 0.2040 0.1096 0.0096 0.0027 4 0.0000 0.0117 0.2057 0.1328 0.0082 0.0014 5 0.0000 0.0002 0.2086 0.1929 0.0256 0.0134 5 0.0000 0.0001 0.2129 0.2147 0.0232 0.0085 6 0.0000 0.0000 0.1647 0.2292 0.0528 0.0371 6 0.0000 0.0000 0.1682 0.2344 0.0502 0.0265 7 0.0000 0.0000 0.1050 0.1967 0.0879 0.0751 7 0.0000 0.0000 0.1060 0.1820 0.0867 0.0600 8 0.0000 0.0000 0.0556 0.1288 0.1216 0.1225 8 0.0000 0.0000 0.0550 0.1069 0.1233 0.1024 9 0.0000 0.0000 0.0251 0.0667 0.1432 0.1561 9 0.0000 0.0000 0.0240 0.0510 0.1476 0.1399 10 0.0000 0.0000 0.0098 0.0286 0.1460 0.1630 10 0.0000 0.0000 0.0090 0.0221 0.1513 0.1635 Ͼ10 0.0000 0.0000 0.0047 0.0143 0.3828 0.4085 Ͼ10 0.0000 0.0000 0.0039 0.0091 0.3844 0.4688

The prior and posterior probabilities for the number of selection classes are The prior and posterior probabilities for the number of selection classes are f(k) and f(k͉X), respectively. f(k) and f(k͉X), respectively.

6264 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508279103 Huelsenbeck et al. Downloaded by guest on September 28, 2021 Table 4. Prior and posterior probabilities for the number of Table 6. Prior and posterior probabilities for the number of selection classes, k, for human influenza virus hemagglutinin selection classes, k, for vertebrate ␤-globin E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 E(k͉X) ϭ 2.24 E(k͉X) ϭ 5.50 E(k͉X) ϭ 9.46 E(k͉X) ϭ 3.03 E(k͉X) ϭ 5.65 E(k͉X) ϭ 8.95

k f(k) f(k͉X) f(k) f(k͉X) f(k) f(k͉X) k f(k) f(k͉X) f(k) f(k͉X) f(k) f(k͉X)

1 0.9383 0.0000 0.0149 0.0000 0.0000 0.0000 1 0.9462 0.0000 0.0132 0.0000 0.0000 0.0000 2 0.0598 0.7744 0.0673 0.0048 0.0005 0.0000 2 0.0525 0.0000 0.0630 0.0000 0.0004 0.0000 3 0.0018 0.2146 0.1460 0.0775 0.0029 0.0020 3 0.0014 0.9716 0.1422 0.0524 0.0024 0.0008 4 0.0000 0.0107 0.2037 0.1963 0.0103 0.0062 4 0.0000 0.0280 0.2039 0.1811 0.0092 0.0094 5 0.0000 0.0003 0.2064 0.2670 0.0267 0.0244 5 0.0000 0.0004 0.2100 0.2687 0.0250 0.0323 6 0.0000 0.0000 0.1625 0.2124 0.0540 0.0580 6 0.0000 0.0000 0.1664 0.2343 0.0524 0.0816 7 0.0000 0.0000 0.1037 0.1355 0.0883 0.1156 7 0.0000 0.0000 0.1060 0.1474 0.0881 0.1395 8 0.0000 0.0000 0.0553 0.0620 0.1208 0.1564 8 0.0000 0.0000 0.0560 0.0732 0.1227 0.1767 9 0.0000 0.0000 0.0252 0.0301 0.1412 0.1720 9 0.0000 0.0000 0.0250 0.0307 0.1449 0.1797 10 0.0000 0.0000 0.0100 0.0104 0.1436 0.1571 10 0.0000 0.0000 0.0097 0.0086 0.1477 0.1493 Ͼ10 0.0000 0.0000 0.0050 0.0038 0.3820 0.2962 Ͼ10 0.0000 0.0000 0.0047 0.0035 0.3815 0.2276

The prior and posterior probabilities for the number of selection classes are The prior and posterior probabilities for the number of selection classes are f(k) and f(k͉X), respectively. f(k) and f(k͉X), respectively.

arbitrarily chosen sites from the HIV-1 env alignment. Three things general, the Dirichlet process prior picks out the sites that were are apparent: First, and as expected, different sites for the same consistently found to be under positive selection by Yang et al. (12) data set have different probability distributions for ␻. Second, the but assigns lower probabilities to these same sites of being under distribution for ␻ varies depending on the value of the parameter positive selection. For example, the Yang et al. (12) analysis of the used in the analysis. The distributions were similar when the prior HIV-1 env alignment consistently found sites 28, 66, and 87 to be expectation of the number of selection classes was 5 or 10 but were under positive selection (with probabilities of 0.99 or greater), and different when the prior expectation was 1. In short, when the sites 26 and 51 to be under positive selection with probability 0.99 model attempts to explain the data with too few selection catego- ries, the marginal posterior probability distribution for ␻ atasite becomes a compromise between the necessities for that site and others that are grouped into the same selection category. Third, and perhaps more importantly, the marginal posterior probability dis- tribution of ␻ for a site can be used to directly calculate the probability that the site experiences positive selection. The proba- bility that ␻ Ͼ 1 is the probability that the site is under positive selection. This quantity can be directly calculated from the output of the MCMC procedure as the fraction of the time the site had ␻ Ͼ 1. Fig. 2 shows the probability that each site is under positive selection for all of the analyses of this paper. There is strong concordance between the sites that we find to be under positive selection with the sites that Yang et al. (12) found to be under positive selection. Yang et al. (12) found that analysis of different probability distributions describing how ␻ varies across a sequence would typically result in a set of sites that always were found to be under positive selection and another set of sites that were inferred to be under positive selection for some models but not others. In

Table 5. Prior and posterior probabilities for the number of selection classes, k, for Japanese encephalitis virus env E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 E(k͉X) ϭ 2.01 E(k͉X) ϭ 3.58 E(k͉X) ϭ 5.44

k f(k) f(k͉X) f(k) f(k͉X) f(k) f(k͉X)

1 0.9344 0.0000 0.0149 0.0000 0.0001 0.0000 2 0.0635 0.9849 0.0669 0.1832 0.0006 0.0223 3 0.0021 0.0146 0.1445 0.3497 0.0032 0.1089 4 0.0000 0.0005 0.2017 0.2627 0.0110 0.1863 5 0.0000 0.0000 0.2052 0.1365 0.0281 0.2201 6 0.0000 0.0000 0.1627 0.0500 0.0556 0.2066 7 0.0000 0.0000 0.1049 0.0139 0.0897 0.1348 8 0.0000 0.0000 0.0568 0.0032 0.1213 0.0721 9 0.0000 0.0000 0.0263 0.0008 0.1406 0.0316 10 0.0000 0.0000 0.0106 0.0000 0.1421 0.0117 Ͼ10 0.0000 0.0000 0.0054 0.0000 0.3774 0.0057 ␻ ͞ Fig. 1. The posterior probability density distribution of , the dN dS rate EVOLUTION The prior and posterior probabilities for the number of selection classes are ratio, for four sites of the HIV-1 env alignment. The lightly shaded portion of f(k) and f(k͉X), respectively. each distribution is the part that has ␻ Ͼ 1.

Huelsenbeck et al. PNAS ͉ April 18, 2006 ͉ vol. 103 ͉ no. 16 ͉ 6265 Downloaded by guest on September 28, 2021 Fig. 2. The posterior probabilities of sites being under positive selection for each of the analyses of the six alignments of this study. The graphs are groupedby alignment, with each group consisting of three graphs. The top graph of each group has E(k) Ϸ 1, the middle graph has E(k) ϭ 5, and the bottom graph has E(k) ϭ 10.

under the M12 model. (The M12 model is a mixture of two normal description of how dN͞dS varies across a sequence and by account- distributions with a discrete category with ␻ ϭ 0.) Our analysis of ing for uncertainty in parameters of the model when making the HIV-1 env alignment finds sites 26, 28, 51, 66, 83, and 87 to be inferences of positive selection. under positive selection, all having a probability of Ͼ0.95 and having ␻ Ͼ 1. Our analysis does not condition on the maximum Materials and Methods likelihood values of the parameters (the tree, branch lengths, and Data. We assume an alignment of protein-coding DNA sequences substitution model parameters) as is the case of the Nielsen and is available. The alignment is contained in the matrix X ϭ {xij}, Yang (2) approach. It is likely that the accommodation of uncer- tainty in the model parameters causes the probabilities of sites being in particular categories to be dampened relative to approaches that do not account for parameter uncertainty. Table 7. Sites potentially under positive selection Table 7 lists sites that had a high probability of being under Sites with probability Ͼ0.95 of positive selection for all six genes. For the most part, the same sites Data E(k) being under positive selection are found to be under positive selection regardless of the value of Vertebrate ␤-globin 1 – the concentration parameter used in the analysis. For example, sites 5– 26, 28, 51, 66, 83, and 87 of the HIV-1 env alignment were inferred 10 – to be under positive selection regardless of the value of ␣ assumed Japanese encephalitis 1– in the analysis. Sites 24, 68, 69, and 76 had a probability Ͼ0.95 of virus env being under positive selection when E(k) Ϸ 1 but not when E(k) ϭ 5– 5orE(k) ϭ 10. However, the probability of those sites being under 10 – positive selection was just below the 0.95 threshold. (Sites 24, 68, 69, Human influenza virus 1– and 76 had probabilities ranging between 0.88 and 0.93 of having hemagglutinin ␻ Ͼ 1 when the expected number of selection categories was set to 5 226, 135 10 226, 135 5or10.) HIV-1 env 1 28, 66, 26, 87, 51, 83, 76, 69, 68, 24 Methods for detecting the presence of positive natural selection 5 28, 66, 26, 87, 83, 51 in protein-coding DNA have become an important tool in studies 10 28, 66, 26, 87, 83, 51 of molecular evolution. The recent advances that allow the non- HIV-1 pol 1 67, 347, 478, 779, 568, 761 synonymous͞synonymous rate ratio to vary across the sequence 5 67, 347, 779, 478, 3, 568 have opened up the possibility of detecting specific amino acid 10 67, 347, 779, 478, 3, 568 residues that are functionally important, displaying an elevated HIV-1 vif 1 33, 167, 33, 127, 39, 109, 122, 47, 92, 37 d ͞d rate ratio. The method we describe here represents an 5 33, 167, 127, 31, 37, 109, 39, 122, 92, 47, 63 N S 10 33, 127, 167, 31, 37, 109, 122, 39, 92, 47 important extension of existing methods by allowing a more flexible

6266 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508279103 Huelsenbeck et al. Downloaded by guest on September 28, 2021 Table 8. The prior probability distributions assigned to parameters of the phylogenetic model Parameter Symbol Prior

Tree topoology ␶ All trees have equal prior probability Branch lengths ␯ Branch lengths are independent exponential (10) random variables Codon frequencies ␲ Flat Transition͞transversion ratio ␬ Ratio of two identically distributed exponential random variables

dN͞dS rate ratio ␻ Ratio of two identically distributed exponential random variables Category information z, k Dirichlet process Dirichlet process parameter ␣ Fixed such that E(k) is small [E(k) Ϸ 1, E(k) ϭ 5, E(k) ϭ 10]

where i ϭ 1,2,...,s and j ϭ 1,2,...,c; s is the number of sequences, k ϭ 3, and the first two sites are in the same category. The allocation and c is the number of codons in each sequence. The information vector describes a partition of the sites into different classes, each at the jth site is contained in the vector xj ϭ (x1j,x2j,...,xsj)Ј. class with its own dN͞dS value. The number of ways to partition a set of c sites into k classes is described by the Stirling numbers of the Phylogenetic Model. The s species are related to one another second kind (30): through an unknown phylogeny, represented as a bifurcating tree. ␶ ␶ ␶ kϪ1 Possible phylogenies are denoted 1, 2,..., B(s), where B(s)isthe 1 k number of possible phylogenetic trees for s species. We assume a S͑c, k͒ ϭ ͸ ͑ Ϫ 1͒iͩ ͪ͑k Ϫ i͒c. [2] k! i time-reversible model of nucleotide substitution (see below), which iϭ0 means that the phylogenetic trees can be arbitrarily rooted without changing the likelihood. Therefore, we assume unrooted phyloge- The sites for the example alignment of c ϭ 4 sites can be assigned netic trees throughout this study. The number of unrooted phylo- to k ϭ 3 categories in a total of S(4, 3) ϭ 6 ways: z ϭ (1,1,2,3); genetic trees for s species is B(s) ϭ (2s Ϫ 5)!͞[2sϪ3(s Ϫ 3)!] (28). z ϭ (1, 2, 1, 3); z ϭ (1, 2, 3, 1); z ϭ (1, 2, 2, 3); z ϭ (1,2,3,2);and Each unrooted tree has 2s Ϫ 3 branches. Each branch on the tree z ϭ (1, 2, 3, 3). We label possible partitions of the sites into classes has a length, ␯. The set of branch lengths for the jth tree is denoted by using the restricted growth function notation of Stanton and ␯j ϭ (␯1, ␯2,...,␯2sϪ3). White (31), where elements are sequentially numbered with the We assume that substitutions occur on the phylogenetic tree constraint that the index numbers for two sites are the same if they according to a continuous-time Markov process. The instantaneous are found in the same category. The number of sites in the ith rate of change from codon i to codon j is contained in a matrix, Q, category is denoted ␩i. For the example given above, ␩1 ϭ 2, ␩2 ϭ the entries of which are given by 1, and ␩3 ϭ 1. ϭ Q {qij} Likelihood. Assuming that substitutions are independent across ␻␬␲ sites, the likelihood for an alignment can be calculated as the j : nonsynonymous transition ␻␲ product of the site likelihoods: j : nonsynonymous transversion ␬␲ ϭ j : synonymous transition c Ά ␲ : synonymous transversion ␶ ␯ ␬ ␲ ␻ ␻ ϭ ͹ ͉␶ ␯ ␬ ␲ ␻ j L( , , , , k, 1,..., k, z) Pr(xi , , , , zi). [3] 0:i and j differ at more than one position, iϭ1

[1] We used the Felsenstein (29) pruning algorithm to calculate where ␻ is the nonsynonymous͞synonymous rate ratio, ␬ is the likelihoods for specific combinations of parameter values. transition͞transversion rate ratio, and ␲j is the stationary fre- quency of codon j (2, 14, 15). The rate matrix, Q,is61ϫ 61 in Statistical Model. We estimate parameters in a Bayesian framework. dimension; although there are 64 possible codon triplets, the All parameters of the model are considered random values with three termination codons of the universal code are excluded their own prior and posterior probability distributions. The joint from the state space, leaving the 61 sense codons as the states posterior probability of all of the parameters of the model is of the substitution process. The diagonal elements of the rate f͑␶, ␯, ␬, ␲, k, ␻ ,...,␻ , z͉X͒ matrix Q are specified such that each row sums to 0. We rescale 1 k the matrix such that the average rate of synonymous substitution ͑␶ ␯ ␬ ␲ ␻ ␻ ͒ ͑␶ ␯ ␬ ␲ ␻ ␻ ͒ L , , , , k, 1,..., k, z f , , , , k, 1,..., k, z is 1; this means that the branch lengths are in terms of expected ϭ . f͑X͒ number of synonymous substitutions per codon site. This scaling is different from the usual for phylogenies, where the rate matrix [4] is rescaled such that the branch lengths are in terms of expected number of substitutions per site. Finally, note that the substitu- The posterior probability [f( ⅐ ͉X)] is equal to the product of the ⅐ ͉ ⅐ tion process is time-reversible because ␲iqij ϭ ␲jqji for all i  j. likelihood [L( ) or, equivalently, f(X )] and the prior probability Practically speaking, this means that we can use unrooted distribution [f(⅐)] of the parameters divided by the marginal likeli- phylogenies instead of rooted trees when calculating the likeli- hood [f(X)]. hood on the tree (29). The transition probabilities of the process Bayesian analysis requires that a prior probability distribution be are the probability of finding the process in codon j conditioned specified for the parameters. Here we assume that the parameters on the process starting in codon i and run over a branch of length take the prior probability distributions shown in Table 8. ␯. The transition probabilities can be calculated as P(␯) ϭ eQ␯. The model, as described, is fairly typical of codon models Each of the c sites can be in one of k selection categories, where implemented to date (32). However, the Dirichlet process prior is each category differs in its dN͞dS value. The information on the new to the problem of detecting positive natural selection and category membership is contained in an association vector, z. For requires more explanation. The Dirichlet process prior is widely EVOLUTION example, for the alignment of four sites given above, one possible used in clustering problems as a nonparametric alternative, where assignment of sites to selection categories is z ϭ (1, 1, 2, 3). Here, the data elements are drawn from a mixture distribution with a

Huelsenbeck et al. PNAS ͉ April 18, 2006 ͉ vol. 103 ͉ no. 16 ͉ 6267 Downloaded by guest on September 28, 2021 countably infinite number of components (33, 34). [Ewens and parameter to change, proposes a new value for the parameter, Tavare´(35) describe the relationship between the Dirichlet process and then accepts or rejects the proposed state as the next state of prior and the Ewen’s sampling formula from population genetics.] the chain by using the Metropolis–Hastings formula. Most of the The Dirichlet process prior has two parameters: a concentration proposal mechanisms have been described elsewhere. We use the parameter, which, loosely speaking, determines the degree to which LOCAL mechanism of Larget and Simon (39) to simultaneously data elements are grouped together, and a base measure, G0(⅐), change the tree and branch lengths (␶, ␯). We update the codon which describes the probability distribution of the parameter value frequencies (␲) by using a Dirichlet proposal mechanism and associated with each cluster. The model can be described as follows: change the transition͞transversion rate ratio (␬) and the category First, the number of classes, k, and an allocation variable, z, are dN͞dS values (␻) by using a rate multiplier [i.e., we multiply the old randomly drawn from the probability distribution parameter value by e␭(uϪ1/2) to obtain a new value for the parameter, where u is a uniform(0, 1) random number and ␭ is a tuning k ͟ ϭ ͑␩ Ϫ 1͒! ͑ ͉␣ ͒ ϭ ␣k i 1 i parameter]. Finally, we update the allocation vector by using a f z, k , c ͟c ͑␣ ϩ Ϫ ͒. [5] iϭ1 i 1 Gibbs sampling algorithm for nonconjugate models proposed by Neal (40). We implement the method of Neal (40) as follows: First, Once the sites have been partitioned into k categories, a ␻ ϭ dN͞dS we randomly choose a site, i, and remove it from the set of k value is assigned to each category by drawing randomly from selection classes currently in computer memory. If the site is in a ␻ ϭ ␻ ϭ ␩ the following probability distribution k times, G0( i) f( i) selection class by itself (i.e., zi for the class was 1), then the entire ͞ ϩ ␻ 2 ␩ [1 (1 i) ] which is the probability density of the ratio of two class is deleted, and k is decreased by 1. Otherwise, zi for the identically distributed exponential random variables. Here, ␻i is the selection class that the site is associated with is decreased by 1. We dN͞dS rate ratio for the ith selection category. The probability of then construct five new (temporary auxiliary) classes by drawing the having k categories is obtained by summing over all possible dN͞dS value for each from the prior probability distribution. These partitions for k categories, and is auxiliary classes represent components that have no other sites associated with them. Gibbs sampling is performed to update site a ␣k ͑ ͉␣ ͒ ϭ c k i but with respect to the distribution that includes the temporary f k , c ͟c ͑␣ ϩ Ϫ ͒, [6] iϭ1 i 1 auxiliary parameters. The site is assigned to the jth previously existing class with probability Z␩Ϫif(xi͉␻j) and to the jth of five where nak is the absolute values of the Stirling numbers of the first auxiliary classes with probability Z(␣͞5)f(xi͉␻j), where Z is the kind. The expected number of categories is appropriate normalizing constant. After this update, all dN͞dS values not associated with a selection class are discarded. c c All analyses were run for a total of 2 million update cycles. ͑ ͉␣ ͒ ϭ ͸ ͑ ϭ ͉␣ ͒ Ϸ ␣ ͩ ϩ ͪ E k , c if k i , c ln 1 ␣ . [7] Chains were thinned, with samples taken every 100 cycles. iϭ1 Samples taken during the first 100,000 cycles were discarded as the burn-in phase. The MCMC procedure was repeated, result- Finally, the probability of finding two sites grouped together into ing in two sets of samples for each gene. Convergence was ϭ ͉␣ ϭ ͞ ϩ ␣ the same cluster is f(zi zj , c) 1 (1 )]. assessed by using the program TRACER (http:͞͞evolve.zoo.ox. ac.uk͞software.html?idϭtracer). MCMC. We approximated posterior probabilities by using MCMC (36, 37). The general idea is to construct a Markov chain that has Data Analysis. We analyzed six alignments of protein-coding DNA as its state space the parameters of the statistical model and a sequences. These alignments included: (i) ␤-globin gene sequences stationary distribution that is the posterior probability of the from s ϭ 17 vertebrates (12); (ii) s ϭ 23 env gene sequences from parameters. Samples drawn from the Markov chain when at Japanese encephalitis virus (12, 26); (iii) s ϭ 28 sequences of the stationarity are valid, albeit dependent, draws from the posterior HA1 domain of the hemagglutinin gene of human influenza virus probability distribution of the parameters (38). The fraction of the A (17); (iv) s ϭ 13 sequences of the HIV-1 env gene V3 region (27); time the Markov chain visits any particular combination of param- (v) HIV-1 pol gene sequences from s ϭ 23 isolates (12); and (vi) s ϭ eter values is an approximation of the joint posterior probability of 29 HIV-1 vif gene sequences (12). those parameters. We wrote a computer program that implements the MCMC J.P.H. was supported by National Institutes of Health Grant GM-069801 method for the phylogenetic model described above. For each and National Science Foundation Grants DEB-0445453 and CACT- iteration of the Markov chain, the program randomly picks a TOL-22035A.

1. Hughes, A. & Nei, M. (1988) Nature 335, 167–170. 22. Sawyer, S. & Hartl, D. (1992) Genetics 132, 1161–1176. 2. Nielsen, R. & Yang, Z. (1998) Genetics 148, 929–936. 23. Nielsen, R. & Yang, Z. (2003) Mol. Biol. Evol. 20, 1231–1239. 3. Lee, Y. H., Ota, T. & Vaquier, V. (1995) Mol. Biol. Evol. 12, 231–238. 24. Lartillot, N. & Philippe, H. (2004) Mol. Biol. Evol. 21, 1095–1109. 4. Messier, W. & Stewart, C. B. (1997) Nature 385, 151–154. 25. Pond, S. L. K. & Frost, S. D. W. (2005) Mol. Biol. Evol. 22, 223–234. 5. Miyata, T. & Yasunaga, T. (1980) J. Mol. Evol. 16, 23–36. 26. Zanotto, P. M., Kallas, E. G., Souza, R. F. & Holmes, E. C. (1996) Genetics 153, 1077–1089. 6. Nei, M. & Gojobori, T. (1986) Mol. Biol. Evol. 3, 418–426. 27. Leitner, T., Kumar, S. & Albert, J. (1997) J. Virol. 71, 4761–4770. 7. Li, W. H., Wu, M. & Luo, C. L. (1985) Mol. Biol. Evol. 2, 150–174. 28. Schro¨der, E. (1870) Z. Math. Phys. 15, 361–376. 8. Li, W. H. (1993) J. Mol. Evol. 36, 96–99. 29. Felsenstein, J. (1981) J. Mol. Evol. 17, 368–376. 9. Pamilo, P. & Bianchi, N. O. (1993) Mol. Biol. Evol. 10, 271–281. 30. Abramowitz, M. & Stegun, I. A. (1972) Handbook of Mathematical Functions with Formulas, 10. Comeron, J. M. (1995) J. Mol. Evol. 41, 1152–1159. Graphs, and Mathematical Tables (Dover, New York). 11. Ina, Y. (1995) J. Mol. Evol. 40, 190–226. 31. Stanton, D. & White, D. (1986) Constructive Combinatorics (Springer, New York). 12. Yang, Z., Nielsen, R., Goldman, N. & Pedersen, A. (2000) Mol. Biol. Evol. 17, 32–43. 32. Huelsenbeck, J. P. & Dyer, K. A. (2004) J. Mol. Evol. 58, 661–672. 13. Yang, Z. & Nielsen, R. (2000) Genetics 155, 431–449. 33. Ferguson, T. S. (1973) Ann. Stat. 1, 209–230. 14. Goldman, N. & Yang, Z. (1994) Mol. Biol. Evol. 11, 725–736. 34. Antoniak, C. E. (1974) Ann. Stat. 2, 1152–1174. 15. Muse, S. V. & Gaut, B. S. (1994) Mol. Biol. Evol. 11, 715–724. 35. Ewens, W. J. & Tavare´, S. (1998) in Encyclopedia of Statistical Science, eds. Kotz, S., Read, 16. Yang, Z. (1998) Mol. Biol. Evol. 15, 568–573. C. B. & Banks, D. L. (Wiley, New York), pp. 230–234. 17. Fitch, W. M., Bush, R. M., Bender, C. A. & Cox, N. J. (1997) Proc. Natl. Acad. Sci. USA 36. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953) 94, 7712–7718. J. Chem. Phys. 21, 1087–1092. 18. Suzuki, Y. & Gojobori, T. (1999) Mol. Biol. Evol. 16, 1315–1328. 37. Hastings, W. K. (1970) Biometrika 57, 97–109. 19. Pond, S. K. & Muse, S. V. (2005) Mol. Biol. Evol. 22, 2375–2385. 38. Tierney, L. (1996) in Markov Chain Monte Carlo in Practice, eds. Gilks, W. R., Richardson, 20. Pond, S. L. K. & Frost, S. D. W. (2005) Mol. Biol. Evol. 22, 1208–1222. S. & Spiegelhalter, D. J. (Chapman & Hall, New York), pp. 59–74. 21. Crandall, K. A., Kelsey, C. R., Imamichi, H., Lane, H. C. & Salzman, N. P. (1999) Mol. Biol. 39. Larget, B. & Simon, D. L. (1999) Mol. Biol. Evol. 16, 750–759. Evol. 16, 372–382. 40. Neal, R. M. (2000) J. Comput. Graph. Stat. 9, 249–265.

6268 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508279103 Huelsenbeck et al. Downloaded by guest on September 28, 2021