A Dirichlet Process Model for Detecting Positive Selection in Protein-Coding DNA Sequences
Total Page:16
File Type:pdf, Size:1020Kb
A Dirichlet process model for detecting positive selection in protein-coding DNA sequences John P. Huelsenbeck*†, Sonia Jain‡, Simon W. D. Frost§, and Sergei L. Kosakovsky Pond§ *Section of Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 92093-0116; ‡Division of Biostatistics and Bioinformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 92093-0717; and §Antiviral Research Center, University of California at San Diego, 150 West Washington Street, La Jolla, CA 92103 Edited by Joseph Felsenstein, University of Washington, Seattle, WA, and approved March 1, 2006 (received for review September 21, 2005) Most methods for detecting Darwinian natural selection at the natural selection that is present at only a few positions in the molecular level rely on estimating the rates or numbers of non- alignment (21). In cases where such methods have been successful, synonymous and synonymous changes in an alignment of protein- many sites are typically under strong positive selection (e.g., MHC). coding DNA sequences. In some of these methods, the nonsyn- More recently, Nielsen and Yang (2) developed a method that onymous rate of substitution is allowed to vary across the allows the rate of nonsynonymous substitution to vary across the sequence, permitting the identification of single amino acid posi- sequence. The method of Nielsen and Yang has proven useful for tions that are under positive natural selection. However, it is detecting positive natural selection in sequences where only a few unclear which probability distribution should be used to describe sites are under directional selection. how the nonsynonymous rate of substitution varies across the The Nielsen and Yang (2) approach assumes that the nonsyn- sequence. One widely used solution is to model variation in the onymous͞synonymous rate ratio (dN͞dS ratio) at a site is a random nonsynonymous rate across the sequence as a mixture of several variable drawn from some probability distribution. The goal is to discrete or continuous probability distributions. Unfortunately, estimate parameters of the model from alignments of protein- there is little population genetics theory to inform us of the coding DNA and, by using these parameter estimates, identify appropriate probability distribution for among-site variation in the amino acids under positive selection. Their model is quite compli- nonsynonymous rate of substitution. Here, we describe an ap- cated and contains many parameters to estimate. Some of the proach to modeling variation in the nonsynonymous rate of parameters account for the fact that the sequences are related to substitution by using a Dirichlet process mixture model. The one another through some unknown phylogenetic tree. Other Dirichlet process allows there to be a countably infinite number of parameters account for biases in the substitution process, such as an nonsynonymous rate classes and is very flexible in accommodating increased rate of mutation for transitions. The remaining param- different potential distributions for the nonsynonymous rate of eters, however, describe how dN͞dS varies across the sequence. substitution. We implemented the model in a fully Bayesian ap- Nielsen and Yang showed not only that it is practical to efficiently proach, with all parameters of the model considered as random estimate these parameters from alignments of protein-coding DNA variables. sequences but also that one can use an empirical Bayesian approach to detect specific amino acid residues that are under the influence atural selection leaves a detectable signature in comparisons of of positive natural selection. Nprotein-coding DNA sequences; a bias in the ratio of the rates How should variation in the rate of nonsynonymous substitution of nonsynonymous and synonymous substitutions is unambiguous across a sequence be modeled? Population genetics theory is largely evidence for natural selection. Purifying selection causes the rate of silent on this issue. For the most part, we lack information on the nonsynonymous substitution to be smaller than the rate of synon- distribution of selection coefficients for new mutations and the ymous substitution. In fact, the predominant pattern found in demography of the populations under consideration, making it analysis of alignments of protein-coding DNA is that nonsynony- difficult to predict a distribution for the rate of nonsynonymous mous substitutions have a lower rate than synonymous substitu- substitution (22, 23). The original Nielsen and Yang (2) approach tions. This finding is consistent with natural selection acting to assumed that a site could be in one of three categories, each of eliminate deleterious mutations that change the protein to a less which differed in its dN͞dS rate ratio. With probability p1,asiteis functional form. However, natural selection can also act to increase in category 1, which has dN͞dS ϭ 0; with probability p2, the site is the probability that a nonsynonymous mutation is fixed in the in category 2, which has dN͞dS ϭ 1; and with probability p3, the site population. Positive, or direction, selection causes the relative rates is in a category that has dN͞dS Ͼ 1(p1 ϩ p2 ϩ p3 ϭ 1). Nielsen and of nonsynonymous to synonymous substitutions to be Ͼ1. Examples Yang (2) also considered a model that allowed dN͞dS to vary of positive selection have been found in many genes but perhaps continuously on the interval (0, 1); a gamma distribution, truncated most famously in human major histocompatibility complex (MHC) on the interval (0. 1) was used to model amino acid sites that are (1), HIV-1 envelope (env) gene (2), sperm lysins (3), and primate acting neutrally or under the influence of purifying natural selec- stomach lysozymes (4). tion. Under this model, a site has dN͞dS drawn from a truncated Although seemingly simple, detecting positive natural selection gamma distribution with probability p1 or has dN͞dS Ͼ 1with from alignments of protein-coding DNA is recognized as a formi- probability p2. Later, Yang et al. (12) systematically explored more dable statistical problem. Many methods have been proposed to ways to model variation in dN͞dS across a sequence. They consid- detect the footprint of natural selection, all of which are based on ered a total of 13 models. The ‘‘M10’’ model from Yang et al. (12), measuring the relative rates or numbers of nonsynonymous and for example, assumes that, with probability p1, the dN͞dS rate ratio synonymous substitutions (2, 5–20). Most of the methods assume a is drawn from a beta distribution on the interval (0, 1) and, with constant rate of nonsynonymous and synonymous substitutions across the sequence. These methods are ill-suited for detecting positive natural selection when only a few of the amino acid Conflict of interest statement: No conflicts declared. positions in a gene are under the influence of positive natural This paper was submitted directly (Track II) to the PNAS office. selection, with the others under purifying selection. Applying a Abbreviation: MCMC, Markov chain Monte Carlo. EVOLUTION method that assumes a constant rate of nonsynonymous change †To whom correspondence should be addressed. E-mail: [email protected]. across the sequence potentially masks the signature of positive © 2006 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0508279103 PNAS ͉ April 18, 2006 ͉ vol. 103 ͉ no. 16 ͉ 6263–6268 Downloaded by guest on September 28, 2021 probability p2,thedN͞dS rate ratio is drawn from an offset gamma Table 2. Prior and posterior probabilities for the number of distribution. Even though many of the models considered in Nielsen selection classes, k, for HIV-1 pol ͞ and Yang (2) and in Yang et al. (12) describe dN dS as varying E(k) Ϸ 1 E(k) ϭ 5.00 E(k) ϭ 10.00 continuously, in practice, these continuous distributions are dis- E(k͉X) ϭ 3.22 E(k͉X) ϭ 5.99 E(k͉X) ϭ 9.58 cretized to allow the likelihood to be calculated (the likelihood is ͉ ͉ ͉ averaged over the different discrete categories). At least as cur- k f(k) f(k X) f(k) f(k X) f(k) f(k X) rently implemented, then, all of these models describing how the 1 0.9285 0.0000 0.0157 0.0000 0.0001 0.0000 rate of nonsynonymous substitution varies across the sequence are 2 0.0690 0.0006 0.0686 0.0000 0.0006 0.0000 discrete. 3 0.0025 0.7837 0.1459 0.0313 0.0033 0.0006 4 0.0001 0.2096 0.2014 0.1518 0.0114 0.0055 We take a different approach to modeling variation in dN͞dS 5 0.0000 0.0062 0.2036 0.2316 0.0285 0.0252 across sites, allowing sites to be in one of a number of classes, with ͞ 6 0.0000 0.0001 0.1611 0.2467 0.0557 0.0607 each class having a different dN dS ratio. The prior probability 7 0.0000 0.0000 0.1041 0.1688 0.0891 0.1018 distribution for the number of classes and the dN͞dS for each class 8 0.0000 0.0000 0.0566 0.0979 0.1199 0.1404 is described by a Dirichlet process prior. The Dirichlet process prior 9 0.0000 0.0000 0.0265 0.0467 0.1388 0.1615 provides a flexible way to model situations in which the data 10 0.0000 0.0000 0.0108 0.0161 0.1405 0.1599 Ͼ elements are drawn from a mixture of simpler parametric distri- 10 0.0000 0.0000 0.0057 0.0093 0.3797 0.3349 butions. For typical mixture models, the number of mixture com- The prior and posterior probabilities for the number of selection classes are ponents is assumed to be known, and determining the correct f(k) and f(k͉X), respectively.