Nonlinear dynamics of nonsynonymous (dN) and synonymous (dS) substitution rates affects inference of selection Jochen B W Wolf, Axel Künstner, Ki Woong Nam, Mattias Jakobsson, Hans Ellegren
To cite this version:
Jochen B W Wolf, Axel Künstner, Ki Woong Nam, Mattias Jakobsson, Hans Ellegren. Nonlinear dy- namics of nonsynonymous (dN) and synonymous (dS) substitution rates affects inference of selection. Genome Biology and Evolution, Society for Molecular Biology and Evolution, 2009, 1, pp.308-319. 10.1093/gbe/evp030. hal-01413570
HAL Id: hal-01413570 https://hal.archives-ouvertes.fr/hal-01413570 Submitted on 9 Dec 2016
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. RESEARCH ARTICLE
Nonlinear Dynamics of Nonsynonymous (dN) and Synonymous (dS) Substitution Rates Affects Inference of Selection
Jochen B. W. Wolf, Axel Ku¨nstner, Kiwoong Nam, Mattias Jakobsson, and Hans Ellegren Department of Evolutionary Biology, Uppsala University, Uppsala, Sweden
Selection modulates gene sequence evolution in different ways by constraining potential changes of amino acid sequences (purifying selection) or by favoring new and adaptive genetic variants (positive selection). The number of nonsynonymous differences in a pair of protein-coding sequences can be used to quantify the mode and strength of selection. To control for regional variation in substitution rates, the proportionate number of nonsynonymous differences (dN) is divided by the proportionate number of synonymous differences (dS). The resulting ratio (dN/dS) is a widely used indicator for functional divergence to identify particular genes that underwent positive selection. With the ever-growing amount of genome data, summary statistics like mean dN/dS allow gathering information on the mode of evolution for entire species. Both applications hinge on the assumption that dS and mean dS (;branch length) are neutral and
adequately control for variation in substitution rates across genes and across organisms, respectively. We here explore the Downloaded from validity of this assumption using empirical data based on whole-genome protein sequence alignments between human and 15 other vertebrate species and several simulation approaches. We find that dN/dS does not appropriately reflect the action of selection as it is strongly influenced by its denominator (dS). Particularly for closely related taxa, such as human and chimpanzee, dN/dS can be misleading and is not an unadulterated indicator of selection. Instead, we suggest that inconsistencies in the behavior of dN/dS are to be expected and highlight the idea that this behavior may be inherent to
taking the ratio of two randomly distributed variables that are nonlinearly correlated. New null hypotheses will be needed http://gbe.oxfordjournals.org/ to adequately handle these nonlinear dynamics.
Introduction from gene families to chromosomes to entire species. This can address questions about the relative importance of neg- The extent to which selection affects genes and ative and positive selection and about the influence of genomes is a key question in genetics and molecular evo- parameters such as life-history traits or effective population lution. Selection may modulate gene sequence evolution in sizes that covary with patterns of molecular evolution different ways, for example, by constraining potential (Wright and Andolfatto 2008; Ellegren 2009). changes of amino acid sequences (purifying or negative se- at Montpellier SupAgro on September 12, 2016 Despite the extensive use of d /d , there are substan- lection) or by favoring new and adaptive genetic variants N S tial uncertainties associated with its basic properties. Esti- (positive selection). To quantify selection in the simplest mates of mean d /d in sets of human–chimpanzee case, the number of nonsynonymous differences in a pair N S orthologous genes for instance have varied from 0.64 of protein-coding sequences can be estimated. However, (Eyre-Walker and Keightley 1999) and 0.34 (Fay et al. substitution rates vary across the genome and between spe- 2001) to about 0.20–0.25 (CSAC 2005; Arbiza et al. cies that makes direct comparisons solely based on nonsy- 2006; Bakewell et al. 2007; RMGSC 2007). Moreover, nonymous substitutions difficult. To control for variation in based on alignments of sequences from several mammalian the underlying mutation rate, a standard way is to take the genomes, mean d /d has recently been found to vary ratio of the number of nonsynonymous differences per total N S among different branches of the mammalian tree (Kosiol number of possible nonsynonymous changes (d ) to the N et al. 2008). Although some of the variation may be attrib- number of synonymous differences per total number of syn- uted to technical problems like sequence quality and align- onymous changes (d ). This ratio (d /d ) is then used as S N S ment inaccuracies (Schneider et al. 2009), the interpretation a measure of ‘‘functional divergence’’ that accounts for and validity of d /d as a tool for locating genes affected by the underlying local or regional variation in the substitution N S selection have also been questioned on theoretical grounds. rate for which d is taken as a proxy. S Recent studies convincingly suggest that d /d shows time The application of d /d has a strong tradition in evo- N S N S dependency (Rocha et al. 2006), that within-population var- lutionary research, notably for the identification of genes iation can cause a nonmonotonic relationship of the selec- with a history of positive selection (e.g., Nielsen 2005). With tion strength and d /d (Kryazhimskiy and Plotkin 2008), the recent advances in sequencing technology, we are now at N S and that gene conversion may potentially mimic the effects the wake of an era that will allow comparative genomic anal- of selection in the genome (Berglund et al. 2009). There is ysis across large evolutionary timescales where summary further a growing literature on the effects of negative selec- statistics like mean d /d potentially make it possible to N S tion on d that can erroneously mimic signatures of positive gather information on the mode of evolution for any entity S selection (Chamary et al. 2006). A detailed understanding of the factors influencing dN/dS is of crucial importance as it Key words: positive selection, negative selection, protein evolution, strongly bears on our ability to make inferences about the selection models, dN/dS ratio, neutral theory, adaptive evolution, melanocortin-1-receptor. role of selection in evolution. In this study, we focus on the idea that d /d will be an E-mail: [email protected]. N S adequate estimator of functional divergence only if local Genome. Biol. Evol. 1:308–319. doi:10.1093/gbe/evp030 variation in substitution rates equally affects both synony- Advance Access publication August 13, 2009 mous and nonsynonymous sites. Hence, it is of crucial
Ó The Author(s) 2009. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nonlinear Dynamics of Nonsynonymous (dN) and Synonymous (dS) Substitution Rates 309 importance to understand how dN scales with dS. We use and Yang 1994) and several counting methods (Nei and simulations in combination with gene sequences available Gojobori 1986; Li 1993; Yang and Nielsen 2000) imple- from the genomes of a wide range of vertebrate species to mented in the CODEML program of the PAML package investigate the relationship between dN and dS and how this Version 4.1 (Yang 2007). ML analysis was performed with relationship affects their ratio (dN/dS and mean dN/dS). runmode-2. We used a method that takes nucleotide fre- quencies at each codon position into account and thereby controls for an artificial signature of x that may be due Materials and Methods to differences in the effective number of codons (Albu Terminology et al. 2008). Coding sequence alignments where dN, dS, or x exceeded 5 were excluded from all downstream anal- Throughout the manuscript, we adhere to the follow- yses (excluding all values .3 qualitatively yields the same ing terminology: the ratio of dN and dS for a single gene is results). We report the results from the ML method. Note denoted x, the arithmetic mean of x across genes is denoted that the maximum estimator is asymptotically unbiased. x , the ratio of the sum of dN (across genes), and the sum of The distributional properties of dN/dS we expand on below dS (across genes) is denoted by w. are thus unlikely to be produced by an estimation bias but We expand on this in little more detail below as it will most likely be inherent in the parameter as such. This is recurrently emerges as an issue. Mean dN/dS can be com- partially supported by the fact that counting methods Downloaded from puted in two ways. One can either calculate x for each gene yielded similar results. and take the average across all genes or calculate the sum of In a first step, estimates were only based on pairwise dN and the sum of dS across all genes and take the ratio of alignments between human and all other species (fig. 1A these two sums. Although the two approaches look similar and B) instead of branch-specific estimates based on mul- at a first glance, they are not equal. With a few exceptions, tiple alignments (fig. 1C). This allows evaluating the effect http://gbe.oxfordjournals.org/ the expectation of a ratio of two random variables is gen- of different gene sets across evolutionary distance and erally not equal to the ratio of the expectation of the two avoids potential bias from ancestral reconstruction. The random variables (Hejmans 1999). We can denote drawback of this approach is that the same starting point X d (human) is repeatedly used what essentially results in pseu- x 5 ½ N;i =n; ð1Þ d doreplication and may lead to properties specific to the pri- i2C S;i mate lineage being overrated in the result. Explicit and comparative contrasts cannot be used to control for it be- P cause evolutionary distance (branch length) is the parameter at Montpellier SupAgro on September 12, 2016 dN;i of interest here. We therefore replicated the analyses with w 5 iP2C ; ð2Þ mouse as a starting point (supplementary fig. S1, Supple- dS;i mentary Material online). i2C To further ensure that a single influential branch in the where the set C contains all genes with dS . 0, n is the num- primate lineage does not introduce a systematic bias in the ber of genes in C, and the summation is over the genes in the repeated pairwise comparisons with human, we also con- set C (note that we could not include dS50 when computing structed multiple alignments for 4,181 genes common to w, but we use the same set C for both calculations to be able all 11 species from human until opossum (11-way core to compare the values directly). To assess the level of differ- set, see above). As for pairwise alignments, multiple align- ence between x and w, we performed simulations under ments were generated using MAFFT Version 6.606b (Katoh a simple sequence evolution model (see Results on simula- and Toh 2008) and back translated to DNA sequences for tions). A third option would be to concatenate all coding se- subsequent analysis. A total of 3,866 alignments could be quences in a genome and estimate mean dN/dS directly. used for subsequent analyses. dN, dS,andx were estimated Although we expect this to be very similar to w, in-depth for each gene using the ML method from Yang (2007) im- analysis of the relative performance of these measures plemented in CODEML (model 5 1; user tree specified ac- may be warranted in future studies. cording to Miller et al. [2007]). A threshold of ,5ondN, dS, and x reduced the final data set to 826 estimates. Data Extraction and Parameter Estimates Pairwise and Multiple Comparisons with Human and Pairwise Comparisons between Zebra Finch and Several Other Vertebrate Species Chicken
Full-coding sequences for human and 15 additional Consideration of dN, dS, and x involving several spe- species (see table 1) were downloaded from the BioMart cies can be influenced by differences in Ne or lineage- database (ENSEMBL 50), and information about pairwise specific substitution rates. To exclude the effects of Ne 1:1 orthologues was extracted (http://www.biomart.org). or substitution rate priori, we constructed pairwise align- Pairwise alignments with human were generated for all spe- ments between chicken and zebra finch orthologues. We cies on protein sequences using MAFFT Version 6.606b made use of the fact that in birds, there is a large variation (Katoh and Toh 2008) and back translated to DNA sequen- in substitution rates across chromosomes and investigated ces for subsequent analysis. Alignments are available upon the relationship of mean dN, dS, and w across chromosomes. request. Estimates for dN, dS, and x were computed for each We downloaded the zebra finch protein sequences (ZE- gene using a maximum likelihood (ML) method (Goldman BRA_FINCH_1, 2009; ENSEMBL 53) from the BioMart 310 Wolf et al. homepage (http://www.biomart.org) and the chicken or the smoother span in the lowess algorithm. Bivariate his- protein sequences from the inparanoid database that yielded tograms for the heatmaps in figure 3 and supplementary a total of 17,148 sequences from zebra finch and 16,715 figures S2–S5 (Supplementary Material online) were sequences from chicken, respectively. 1:1 orthology for generated by an in-house script making use of the ‘‘fields these two proteomes was established by reciprocal blasting package.’’ using inparanoid 3.0 (O’Brien et al. 2005). The program An ML approach implemented in the ‘‘MASS pack- identified 11,413 groups of orthologs, of which 11,309 age’’ was used to fit the best univariate density function groups could be shown to have 1:1 orthologue relation- from a range of distributions (gamma, Gaussian, uniform, ships. From this set of genes, we constructed codon-based and Poisson) to empirical dN and dS distributions. The alignments using MUSCLE (Edgar 2004) followed by the gamma distribution was found to give the best fit (supple- calculation of dN and dS using the CODEML program of the mentary fig. S6, Supplementary Material online). PAML 4.1 package (see above). dN, dS,orx . 3 were dis- carded for subsequent analyses that reduced the data to a re- maining 11,107 pairwise dN and dS values. A Model for Pairs of Genes with Synonymous and Nonsynonymous Sites
Pairwise and Multiple Alignments of Passerine MC1R This section contains a summary of the model used to Downloaded from Sequences simulate data from a simple population divergence model.
We also assessed the relationship between dN, dS, and A more detailed description can be found in the Supplemen- x on a single-gene basis. We chose a gene (MC1R)with tary Material online. a prominent role in evolutionary research. Full passerine Let us consider a particular gene for which ortholo- gous genes exist in a pair of species and that these two spe- MC1R sequences were obtained from the National Center http://gbe.oxfordjournals.org/ for Biotechnology Information database (for GenBank cies diverged TD units of time ago (time is measured in units accession numbers, see fig. 2). Codon-based pairwise align- of N generations and N denotes the population size). For ments were constructed with the chicken MC1R sequence this particular gene, the total substitution rate for synony- (GenBank accession number: AB201628) and each of the mous sites is denoted rS/2, and the total substitution rate for passerine sequences. dN and dS were estimated from each nonsynonymous sites is denoted rN/2. We can view these alignments using CODEML program. dN, dS,ordN/dS . 3 two sets of sites as evolving independent of each other. We were not discarded to present the relationship across the full will let the sites evolve under rates that are similar to em- range of observed dS values. Qualitatively, the results do pirically observed rates (a lower rate for the nonsynony- not change if discarded. In a second step, multiple align- mous sites compared with synonymous sites—a at Montpellier SupAgro on September 12, 2016 ments between all 22 passerine sequences were obtained difference likely to be caused by purifying selection acting by MUSCLE (codon based). From this alignment, on nonsynonymous sites). an ML phylogenetic tree was constructed using PhyML Let’s assume that we have sampled one lineage from (Guindon and Gascuel 2003). dN and dS were estimated each species and that substitutions are added to a lineage with CODEML, applying the free-ratio model to calculate proportional to the length of the branch. In other words, the estimates from individual branches. the number of substitutions M of a branch of length t is Pois- son distributed with parameter r/2t, M ; Po(r/2t). The time till coalescence for two lineages (after they have entered the Statistical Analyses ancestral population) is denoted T2. This waiting time is ex- Statistical analyses were performed in R 2.8.0 (R De- ponentially distributed T2 ; Exp(1), with parameter 1. The velopment Core Team 2006). Model selection based on total coalescence time for the two lineages is TD þ T2 5 T. Akaike’s information criterion (AIC), Bayesian informa- Assuming no recombination within a gene, all sites in a par- tion criterion (BIC), and backward selection was used to ticular gene (both synonymous and nonsynonymous) find the best description of the relationship between w evolve according to the same genealogy, that is, all sites (or x ) with evolutionary distance and the relationship of within a gene have the exact same coalescent times. We show in the Supplementary Material online that allowing single gene x with dS. A log-log fit described the relation- ship better than a linear fit (cf. table 2) and is reported T2 to vary has negligible impact on the variables that we throughout the results. are interested in here; therefore, we assume that all genes Splines, or piecewise polynomials, were used to fit have the same divergence time T. smoothing curves through the scatterplot data of all genes in pairwise comparison (fig. 3; supplementary figs. S2– Results and Discussion S5, Supplementary Material online). We used B-splines as implemented in the ‘‘splines package.’’ To decide on We produced pairwise coding sequence alignments the number of knots for the final graphical representation between the complete set of human protein-coding sequen- of the splines, we used the BIC, which penalizes the number ces and the orthologous sequences of 15 species, chosen of parameters more strongly than AIC. As splines can be un- such that they cover a large part of the vertebrate evolution- duly influenced by values at the extreme of the ranges, we ary history. The number of genes obtained with a stringent also fitted local regression algorithms (lowess in the ‘‘base 1:1 orthologue relationship ranges between 17,226 for hu- package’’). The shape of the curves was very robust to man–chimpanzee and 936 for human–zebra fish (table 1). A changes in the number of knots in the regression splines total of 105 orthologous genes appear in all 15 pairwise Nonlinear Dynamics of Nonsynonymous (dN) and Synonymous (dS) Substitution Rates 311
r comparisons, representing a common core set of genes S d shared between all the studied vertebrate species. For each ; 0.178 0.071 0.008 0.054 0.073 0.036 0.048 0.107 0.004 0.104 gene, we estimated the number of nonsynonymous changes x per nonsynonymous site (d ), the number of synonymous
Spearman’s N changes per synonymous site (dS), and their ratio x 5 r d /d . N N S d As there is theoretical motivation that the degree of ;
x divergence between two lineages can affect mean dN/dS
Spearman’s (Rocha et al. 2006; Kryazhimskiy and Plotkin 2008), we
r initially chose to use external estimates of divergence S
d (branch length estimates from Miller et al. [2007]) to ex-
; plore its relationship with mean dN/dS. However, it turns N
d out that this measure can basically be equated with mean
Spearman’s 2 dS as estimated from our data (R 5 0.96, P , 0.001; dS 5 0.03 þ 1.46 Miller branch length). Therefore, w dS serves as a good proxy for branch length and all conclu- Downloaded from sions based on branch length will be true for dS as well. This N
d is of importance as we later propose that it is really dS that Mean influences dN/dS and not the concept of divergence per se. S d 0.327 0.059 0.181 0.358 0.881 Mean http://gbe.oxfordjournals.org/ Mean dN/dS Depends on Branch Length and the Set of a Genes Used
Mean dN/dS, measured by the unbiased estimator w, after Miller et al. (2007)
Branch Length strongly decreases with branch length (fig. 1A; log-log re- 2 gression: P , 0.001, Radj50:89). For example, w is 0.31 for human–chimpanzee, 0.14 for human–mouse, and 0.07 for human–zebra fish comparisons. An intuitive expla- Length Branch in PAML nation for this relationship is that the set of orthologues of increasingly distant species comparisons contain an in- at Montpellier SupAgro on September 12, 2016 creasing fraction of conserved genes that are involved in 1(%)
. basic biological processes and molecular functions shared x Genes among many vertebrate species. Low x values in distant
Probability of comparisons could thus be seen to represent genes evolving with under strong purifying selection. This effect of the selected 1