<<

Copyright  2004 by the Genetics Society of America DOI: 10.1534/genetics.103.023226

Estimating the Frequency of Events That Cause Multiple- Changes

Simon Whelan*,†,1 and Nick Goldman* *EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, United Kingdom and †Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, United Kingdom Manuscript received October 15, 2003 Accepted for publication April 21, 2004

ABSTRACT Existing mathematical models of DNA sequence assume that all substitutions derive from point . There is, however, increasing evidence that larger-scale events, involving two or more consecutive sites, may also be important. We describe a model, denoted SDT, that allows for single-nucleotide, doublet, and triplet mutations. Applied to -coding DNA, the SDT model allows doublet and triplet mutations to overlap codon boundaries but still permits data to be analyzed using the simplifying assumption of indepen- dence of sites. We have implemented the SDT model for maximum-likelihood phylogenetic inference and have applied it to an alignment of mammalian globin sequences and to 258 other protein-coding sequence alignments from the Pandit database. We find the SDT model’s inclusion of doublet and triplet mutations to be overwhelmingly successful in giving statistically significant improvements in fit of model to data, indicating that larger-scale events do occur. Distributions of inferred parameter values over all alignments analyzed suggest that these events are far more prevalent than previously thought. Detailed consideration of our results and the absence of any known mechanism causing three adjacent to be substituted simultaneously, however, leads us to suggest that the actual evolutionary events occurring may include still-larger-scale events, such as , inversion, or recombination, or a series of rapid compensatory changes.

HE growing popularity of the maximum-likelihood through the addition of a limited number of biologically T(ML) approach to phylogenetic inference has led interpretable parameters. ML also provides the opportu- to increased interest in attempting to improve the mod- nity to compare models in a robust statistical framework eling of the evolutionary process at the molecular level. (Edwards 1972; Yang et al. 1994). These techniques This is due to the expectation that more realistic models have been used to good effect in choosing models ap- will both increase our understanding of molecular evolu- propriate for phylogenetic inference (e.g., Posada and tion and lead to more accurate estimates of phylogenetic Crandall 1998; Whelan and Goldman 2001) and the trees. For example, the models introduced by Kimura detection of positive selection in phylogenies (e.g., Yang (1980) and Yang (1994a) have helped improve phyloge- et al. 2000). netic tree estimation (see, e.g., Yang et al. 1994) by the Current models applied at the nucleotide level de- inclusion of a parameter describing a bias toward transi- scribe the evolutionary process as point events occurring tion mutation and a parameter describing heterogeneity independently at individual nucleotide sites, i.e., a series of evolutionary rate along a biological sequence, respec- of events each causing a single nucleotide to change and tively. Models of codon evolution have provided insights unaffected by surrounding sequence. Existing models into how protein-coding sequences evolve by allowing describing the evolution of codons enforce the same the identification of the overall type of selection oc- assumption by allowing only a single nucleotide in a curring on specific branches of a tree (Yang 1998) and codon to change in any given event (Goldman and the specific sites where selection has been acting in a Yang 1994; Muse and Gaut 1994; Nielsen and Yang protein throughout its evolutionary history (Nielsen 1998; Yang 1998; Yang et al. 2000; Yang and Nielsen and Yang 1998). Recently, Yang and Nielsen (2002) 2002). This implies that amino acid changes requiring have extended this approach by introducing a two or more nucleotide changes at the codon level may of these models that allows selection to vary both among occur only via a series of intermediate steps. Modeling sites and across the tree, demonstrating that an improved evolution in this manner is biologically appealing be- understanding of can be achieved cause our knowledge of the molecular mechanisms caus- by carefully increasing the complexity of a model ing DNA sequence to change suggests the majority of mutations that occur in the germline are due to errors made at individual sites during the replication of DNA. 1Corresponding author: EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, Additionally, the neutral theory suggests that these mu- United Kingdom. E-mail: [email protected] tations will become fixed independently of one another

Genetics 167: 2027–2043 ( August 2004) 2028 S. Whelan and N. Goldman in a population at a constant rate, thus appearing over tions on the SDT model, are used to investigate the evolutionary time as independent point events (Kimura relative importance of events involving single-nucleo- 1984). tide, doublet, and triplet changes. The models, applica- Biological mechanisms of larger-scale events that ef- ble to protein-coding DNA sequences, take account of fect contiguous nucleotides have been known for some changes spanning two codons yet still permit data analy- time. If the frequency of such events is relatively high, sis to proceed using the simplifying assumption of inde- then failure to consider them in an evolutionary model pendent codons. We apply these models to an aligned may affect estimates of topologies and set of mammalian globin sequences and a substantial other factors important in the evolutionary process, for subset of a large database of nonredundant biological example, selection or rate heterogeneity. The presence sequence alignments to estimate the relative frequency of large-scale events may also have an impact on current of point, doublet, and triplet events. Likelihood-ratio methodologies used for searching in data- tests show that allowing for doublet and triplet events bases and sequence alignment. However, such events substantially improves the fit of the model to the data are thought to be rare and this has led to the widely in nearly all cases. We find that the estimated frequency made assumption that sites evolve independently. This of doublet and triplet events prior to and after selection is also computationally appealing, having the added is surprisingly high, and that estimates of other evolu- benefit of simplifying the mathematics of evolutionary tionary parameters in the model are affected by includ- modeling. Relaxing the independent sites assumption ing such events in the model. Alignment errors and vary- has proven to be difficult and there are no widely used ing degrees of selection may explain some of this solutions to date, although some progress has been estimated abundance of doublet and triplet events, but made. For example, Pedersen and Jensen (2001) used the levels of these events are so high that alignment a sophisticated approach describing the accelerated rate and selection artifacts are unlikely to be the sole source. of C → T mutation in the context of a CpG dinucleo- We propose that biologically recognized large-scale tide. Unfortunately this approach was limited to pairwise events affecting long stretches of DNA occurring during sequence comparison and incurs a heavy computational the evolutionary history of the sequences (e.g., gene burden. conversion, inversion, and recombination) or compen- Several sources of information have suggested that satory change are a likely source for these events. The the assumption that all mutations occur as independent model we introduce, while not completely describing point events at individual nucleotide sites should be such events, may help reduce the bias introduced when questioned. Our best estimates of protein evolution have estimating phylogenies and evolutionary model parame- nonzero instantaneous rates of change between amino ters for data where large-scale events have occurred. acids whose codons differ by more that one nucleotide (for example, see Dayhoff et al. 1972, 1978; Whelan MATERIALS AND METHODS and Goldman 2001), thus suggesting single events may change multiple nucleotides on a regular basis. Further- In the past, models of DNA sequence evolution have more, Yang et al. (1998, Table 1) compared a model assumed that only point mutations occur. To attempt where only amino acids that differ by a single nucleotide to make a codon model that more accurately describes were allowed to interchange with a model that allowed the inferred process of coding sequence evolution, we all possible amino acid interchanges and found that the have developed a model in which evolution occurs ac- latter, more general, model yielded a significantly better cording to three concurrent mutational processes, mod- description of the evolutionary process. Most recently, erated by a common selection process. there has also been some debate over the relative impor- DNA mutation processes: The first mutation process tance of doublet events (an observed change in two is of point (singlet) mutations, as is usual in codon adjacent nucleotide sites due to a single event). Averof models; the second is of doublet mutations, in which et al. (2000) provided evidence that doublet mutation two consecutive nucleotides are simultaneously changed; of the rate of in a globin and the third is of triplet mutations that change three %2ف occurs at pseudogene, an amount high enough to cause concern consecutive nucleotides. We now describe these muta- for those using evolutionary models. However, Smith tional processes and the method by which they are et al. (2003) have performed similar analyses on geno- scaled and eventually combined within the SDT model mic and SNP data and their results suggested that the of evolution. when regional rate On the basis of previous studies comparing models of ,%0.3ف rate may be far lower, at variation is taken into account; other authors have also single-nucleotide substitutions, and because of its suc- contested the frequency of doublet mutations (e.g., cess as the basis of modeling mutation in previous codon Silva and Kondrashov 2002). models, we have chosen to model single-mutation events In this study we develop a new model, the singlet- using the Hasegawa-Kishino-Yano model (Hasegawa et doublet-triplet (SDT) model, which incorporates events al. 1985), assumed to apply independently and identi- that change one, two, or three adjacent nucleotides. A cally at each sequence site. This leads to the use of two series of submodels, created by placing various restric- parameters (totaling 4 d.f.), the / Modeling Multiple-Nucleotide Changes 2029 rate bias parameter ␬ and the vector-valued parameter sequence. This is justified partly by the fact that protein ␲ ϭ ␲ ␲ ␲ ␲ ( A, C, G, T) describing the base frequencies sequences do not evolve in isolation, but as part of expected at a site at equilibrium of this evolutionary larger genomic regions that will have similar stationary process. The underlying single-nucleotide mutation pro- distributions of nucleotide content to the gene. This cess now has the following instantaneous rate of substi- ignores the effect of variation in selection at the edges tution of nucleotide i with j: of genes and introns; such effects are assumed to be small and their inclusion would greatly increase the ␲ j if i and j differ by a transversion complexity of the model. ϭ ␬␲ si*,j  j if i and j differ by a transition Standardizing the relative rates of point, doublet, and 0 otherwise. (1) triplet events: To parameterize and estimate the relative frequencies of point, doublet, and triplet events each Note that the instantaneous rate of substitution of i process needs to be scaled to have comparable units. with i is defined to be zero. We scale each process so that time is in units of the Doublet and triplet mutations are defined as events expected number of mutational events (point, doublet, that cause two and three consecutive nucleotides to or triplet) per codon. To achieve this, we need to know change, respectively. This means, for example, that di- the equilibrium distribution of codons in the SDT nucleotides affected by a doublet mutation can inter- model. This is denoted f for each codon m m m . change only with dinucleotides that differ in both their m1m2m3 1 2 3 In practice, we find that f ϭ␲ ␲ ␲ /(1 Ϫ␲ ) bases; from each of the 16 dinucleotides i i there are m1m2m3 m1 m2 m3 stop 1 2 for all sense codons, where for the universal code ␲ ϭ nine permitted doublet mutations to the dinucleotides stop ␲ ␲ ␲ ϩ␲␲ ␲ ϩ␲␲ ␲ , and that for the three stop j j with i ϶ j and i ϶ j . We have little prior knowledge T A A T A G T G A 1 2 1 1 2 2 codons is f ϭ f ϭ f ϭ 0. Note that ␲ would be about what form the processes of doublet or triplet TAA TAG TGA stop the expected frequency of stop codons if these were not mutation events might take and we expect them to be prevented from arising by the parts of the model describ- reasonably rare, so it seems best to use a very simple ing the effects of selection acting on sequence evolution. mathematical model to describe them. We therefore However, until these values for f are verified (see be- use the following underlying substitution models for m1m2m3 low) it is convenient to retain the f notation. doublets and triplets. The instantaneous rate of replace- m1m2m3 Recalling that our single-nucleotide mutation events ment of the doublet i i with j j is given by 1 2 1 2 occur independently at each nucleotide site, the mean ␲ ␲ if i ϶ j and i ϶ j rate of single-nucleotide mutation events per codon in a ϭ j1 j 2 1 1 2 2 d*i1i 2,j1j2 Ά 0 otherwise. (2) sequence at equilibrium with each codon i1i2i3 occurring with frequency fm m m is given by Similarly, the instantaneous rate of replacement of 1 2 3 ␳ ϭ ͚ ϩ ϩ ϩ ϩ ϩ the triplet i 1i 2i 3 with j1j2 j3 is given by s fm1m2m3(s*m1 , s*m2 , s*m3 , ), (4) codons ␲ ␲ ␲ if i ϶ j , i ϶ j , and i ϶ j m1m2m3 ϭ j1 j2 j3 1 1 2 2 3 3 t i*1i2i3,j1j 2j3 Ά 0 otherwise. (3) where the replacement of a subscript with a ϩ symbol in- dicates summation over all possible values of that subscript Note that we use “triplet” to mean three consecutive (so, for example, s* ϩ ϭ s* ϩ s* ϩ s* ϩ s* ). Then, nucleotides irrespective of codon position in coding m1 , m1 ,A m1 ,C m1 ,G m1 ,T to arrive at a model giving rise to one single-nucleotide DNA and “codon” in the usual sense to mean an in- mutation event per unit time, we use the following for frame triplet that encodes an amino acid. the instantaneous rate of substitution of nucleotide i It is important to note that each of the singlet, dou- with j at each site of a sequence: blet, and triplet mutation processes is assumed to apply ϵ ␳ to every singlet, doublet, and triplet site, respectively, si, j s*i, j / s . (5) of the sequences studied, which means that the doublet We use a similar approach to scale time in the doublet and triplet processes can span codons. For example, if and triplet processes to make the units the expected at a given time a sequence is TGCACC, then the singlet numbers of mutational events per codon (doublet or process applies equally to the six sites (i.e., to the nucleo- triplet changes). Again recalling that the mean rates tides T, G, C, A, C, and C), the doublet process applies must be calculated at the equilibrium state of the SDT to each of the five doublets (TG, GC, CA, AC, and CC), model that combines the single, doublet, and triplet and the triplet process applies to the four triplets (TGC, mutation processes, this is achieved by the following GCA, CAC, and ACC). Our implementation (below) rescaling, combining these models ignores any “edge effects” ϵ ␳ caused by finite sequence length (i.e., for a sequence di1i2,j1j2 d*i1i 2 , j1 j2 / d (6) of length N codons there are 3N singlets, 3N Ϫ 1 dou- blets, and 3N Ϫ 2 triplets) and is generated as though and it applied to an infinitely long sequence. We also ignore ϵ ␳ ti i i , j j j t*i i i , j j j / t , (7) the effect of intron and exon structure in a gene, where 1 2 3 1 2 3 1 2 3 1 2 3 particular codons may not be adjacent in the genomic where 2030 S. Whelan and N. Goldman

␳ ϭ ͚ ϩϩ ϩ ϩϩ ϩ ͚ ϩϩ d fm1m2m3 (d*m1m2 , d*m2m3 , fn1n2n3d*m3n1 , ) three mutational processes described above. Some codons codons events are straightforward to deal with: all singlet events, m1m2m3 n1n2n3 (8) doublet events affecting codon positions 1 and 2, or 2 and 3, and triplet events affecting codon positions 1, 2, and and 3 of one codon each generate one change in a sin- gle codon. The remaining events, namely doublet and ␳ ϭ ͚ ϩϩϩ ϩ ͚ ϩϩϩ ϩ ϩϩϩ t fm1m2m3(t*m1m2m3 , fn1n2n3 {t*m2m3n1 , t*m3n1n2 , }). codons codons triplet events that span a codon boundary, each gener- m1m2m3 n1n2n3 (9) ate two changes, one in each of two codons. Since we wish to describe the net effect on independent codons, Note the more complicated forms of Equations 8 and we cannot assume we know the precise nature of each 9 relative to Equation 4. This is due to having to consider of these two changes. In this case, we treat the first af- doublet and triplet events that can span two consecu- fected codon as known, so that the effect of the event tive codons, for example, the pair (m1m2m3, n1n2n3) aris- on that codon is fully specified. (Exactly the same model ing consecutively with frequency fm1m2m3 fn1n 2n3 . Our ap- would arise if we treated the second affected codon as proach for describing the frequency of two adjacent known.) The affected neighboring codon will have some codons simply as the product of their individual occur- nucleotides specified (since the mutation event is known) rence makes the assumption that codons appear inde- and is then assumed to be drawn randomly according pendently at neighboring sites (see below). to expectations for individual codons over evolutionary Finally we introduce parameters ␴, ␦, and ␶ that de- time, conditional on the presence of the known nucleo- scribe the relative frequencies of point, doublet, and tides. With our modeling assumptions of independent triplet mutation events, respectively. If we make the codon sites and codon model equilibrium, the frequen- ␴ϩ␦ϩ␶ϭ constraint that 1 then our SDT model, cies for each possible neighboring codon m1m 2m3 are described by the mutation process consisting of the precisely proportional to the fm1m 2m3. concurrent functioning of the singlet, doublet, and trip- For example, consider the doublet mutation event ␴ ␦ → let processes defined by the values si, j , di i ,j j , and AT GG falling on the third position of the codon ␶ 1 2 1 2 ti1i2i3 ,j1j2j3 , also has a mean rate of one mutation event CGA and the first position of the neighboring codon. per unit time and has these different types of event This generates two codon changes within our codon occurring with frequencies in the ratio ␴:␦:␶. model. In the first codon, we can identify the change Formulation of the SDT model: In existing codon CGA → CGG. However, we can identify only the second models, only singlet mutation events may occur. Each codon change as TWZ → GWZ, for unknown nucleo- event thus affects exactly one codon, and a process of tides W and Z. Our codon model is constructed so that evolution with independent codon sites is easily con- this doublet event generates all the changes TWZ → structed. The mutation processes described above, how- GWZ (i.e., all possibilities for W and Z), but with each ever, allow for the possibility of single events that affect occurring with relative frequency given by fTWZ . This two neighboring codons. This means that a perfect im- leads to actual frequencies fTWZ /fTϩϩ ; these sum to one, plementation of these processes into a codon model which is appropriate to ensure that this doublet event would require us to analyze codon positions while mak- has generated two codon changes when codons are ob- ing allowance for the actual neighboring codons at every served independently. → Ј Ј point in evolutionary time. While these are known in the More generally, the doublet change X3Y1 X3 Y1 fall- sequences studied (i.e., those at the tips of the phylogeny ing on the third codon position of the triplet X1X2X3 relating the sequences in an alignment), they are not and the first codon position of the neighboring triplet → Ј known at intermediate evolutionary times. It is no generates one codon change, X1X2X3 X1X2X3, and → Ј longer possible to analyze the sites of a sequence align- each of the changes Y1WZ Y1WZ, for all nucleotides ϩϩ ment as though they are independent realizations of W and Z, with frequency f Y1WZ/f Y1 . The exactly analo- some evolutionary process on a phylogeny. The state gous approach is used for triplet events; so, for example, space of the Markov process that should be used in the triplet change CGG → GAT falling on the third effect becomes the space of all possible sequences, and codon position of triplet TAC and the first and second for a codon model for a sequence of length N codons codon positions of the neighboring triplet causes one this space contains 61N possible sequences. This is clearly codon change TAC → TAG and causes each of the → too large for analysis by straightforward methods, al- changes GGW ATW with frequency fGGW/fGGϩ. though complex statistical techniques have been used Selection process: The final components of our to make some progress in such cases (Pedersen and model to be introduced are parameters to describe the Jensen 2001; Arndt et al. 2003; Robinson et al. 2003; effect of selection on the sequence changes described Siepel and Haussler 2004). by our mutation processes. We follow an approach simi- Instead, to maintain independence of sites, we choose lar to that first used by Goldman and Yang (1994) to consider only the average effect of neighboring co- and Muse and Gaut (1994) to introduce parameters dons when constructing the evolutionary model from the describing the relative probability of fixation in a popu- Modeling Multiple-Nucleotide Changes 2031

lation (species) of different types of change. Following present. Q can be broken into the contributions from those authors and the later work of Yang and co-workers the three mutation processes described above, denoted (e.g., Nielsen and Yang 1998; Yang et al. 2000; Yang Q(s), Q(d), and Q(t) for the singlet, doublet, and triplet and Nielsen 2002), we assign different probabilities of processes, respectively. The full methodology for calcu- acceptance of a mutation event depending on whether lating these individual rate matrices is provided in the there are any consequent changes to the encoded amino appendix, and they are combined to generate our SDT acid sequence. However, whereas those authors had to model according to consider only the possibility of single synonymous and ϭ (s) ϩ (d) ϩ (t) single nonsynonymous changes, we have to consider the Q Q Q Q . (11) possibility of individual mutation events that may give We are now able to verify that the SDT model has rise to zero, one, or two nonsynonymous changes. (In fact, the expected equilibrium distribution f ϭ (f ), as de- an exhaustive list of changes that can arise from our i1i2i3 fined in terms of the elements of the parameter ␲ and three types of mutation events is 0 nonsynonymous discussed above, by confirming that all elements of the changes (N) ϩ 1 synonymous change (S); 0N ϩ 2S; vector fQ ϭ 0. We also note that the SDT model is 1N ϩ 0S; 1N ϩ 1S; 2N ϩ 0S.) In a simple extension of time reversible (see, e.g., Lio` and Goldman 1998), as it earlier approaches, we use the following parameteri- satisfies f Q ϭ f Q for all codons i i i zation to describe the effect of selection on mutation i1i2i3 i1i2i3 , j1 j2 j3 j1 j2 j3 j1 j2 j3 ,i1i2i3 1 2 3 and j j j . events: 1 2 3 So far as we are aware, no other models of codon se- Pr (mutation events being accepted) quence evolution have been proposed that make allow- 0 if any stop codons arise ance for individual mutation events that affect more  than one nucleotide. Our model is perhaps closest to p if all resulting substitutions are synonymous ϭ  s the codon model denoted M0 by Yang et al. (2000) pIN if one nonsynonymous substitution results  and first described by Nielsen and Yang (1998) as a p2N if two nonsynonymous substitutions result. simplified version of the model of Goldman and Yang (10) (1994). Our new SDT model is almost identical to M0 when the restrictions ␦ϭ␶ϭ0 are made; i.e., only In previous models only p and p (and the probability S 1N singlet mutations can occur (␴ϭ1). In this case, the 0 for stop codons) have been considered and are ex- only difference is that M0 assumes that the rate of pressed as relative rates of 1 and ␻ϭp /p for synony- 1N S change to a codon j j j is proportional to the frequency mous and nonsynonymous substitutions, respectively. 1 2 3 with which that codon (j j j ) is observed, whereas the Similarly, we work in terms of the relative rates of 1, 1 2 3 ␻ ϭ ␻ ϭ SDT model, in common with the model of Muse and 1 p1N /pS and 2 p2N /pS . As has been widely dis- cussed, values of ␻Ͻ1 indicate an excess of synonymous Gaut (1994), assumes it is proportional to the fre- substitutions, suggesting negative selection and conser- quency of the altered nucleotide (j1, j2,orj3). We prefer vative evolution, and values of ␻Ͼ1 indicate an excess the latter assumption, since we know of no mechanism of nonsynonymous substitutions and suggest positive by which a mutational process might observe and be selection (e.g., Nielsen and Yang 1998; Yang and Bie- affected by the codon that a nucleotide lies in. We see lawski 2000). The same interpretations may be placed this as a different issue from the fact that the occurrence ␻ of mutations will be affected by neighboring bases; this on our selection parameters: 1 is directly analogous to ␻ ␻ is undoubtedly the case, but would not be expected to in standard models of codon evolution, and 2 de- scribes the average selective pressure on two amino acids be related in any way to a nucleotide’s position in the changing simultaneously. Inference under models that coding structure of protein-coding DNA. allow ␻ values to vary between sites in the alignment Likelihood implementation of the SDT model: We has proven useful in detecting more specifically the implemented the SDT model for maximum-likelihood effects of selection (for example, Yang and Bielawski analysis in purpose-written software using standard 2000; Yang et al. 2000; Swanson et al. 2003), but would methods for Markov process models of sequence evolu- be computationally difficult to include in our model. tion, described (for example) by Goldman and Yang The SDT model: Finally, the above ideas may be com- (1994), Swofford et al. (1996), and Lio` and Goldman bined into a Markov process model for codon evolution (1998). Phylogenetic trees were chosen as described that permits the modeling of mutation events that affect below, and maximum likelihood was used to estimate ␬ ␲ ␴ ␦ ␶ ␻ more than one nucleotide site and that still permits the branch lengths and the parameters , , , , , 1, and ␻ computationally convenient analysis of sequence align- 2. Gaps in the sequence alignment are considered as ments under the assumption that all codon sites evolved missing data during likelihood calculations (see, e.g., independently. Our codon model is embodied in a ma- Yang 1997). Maximum-likelihood estimates under the trix Q of instantaneous replacement rates, where the SDT model can occasionally prove difficult to find using

element Q i1i2i3 , j1 j2 j3 gives that rate of replacement of co- standard numerical optimization procedures. To avoid don i1i2i3 with codon j1j2j3 when codon i1i2i3 is initially issues regarding bad convergence and local optima all 2032 S. Whelan and N. Goldman analyses were run from multiple starting values and re- by Aϩ,ϩ; this will in general be different from the mean sults were accepted only when several starting points rate at equilibrium of Q, which indicates the rate of converged on the same maximum likelihood. Likeli- observable events. The relationship of these rates is ex- hood-ratio tests (LRTs; see Yang et al. 1994; Swofford plained further in the appendix. et al. 1996) were used for model comparisons, for exam- Sequence data and trees: For our initial study a set of six ple, to test whether the SDT model gives a significantly mammalian globin sequences from human (NM_000558), better fit to data than a version constrained to have no rhesus monkey ( J04495), cow (AJ242798), goat ( J00043), doublet and triplet mutation events (␴ϭ1, ␦ϭ␶ϭ Gravy’s zebra (U70191.1), and rabbit (M11113) were 0). Precise details of the LRTs used are given below. extracted from GenBank (accession numbers in paren- Additional factors of interest: A primary focus of this theses) and aligned using ClustalX (Thompson et al. article is to see whether or not a model that allows for 1997) to give an alignment of 143 codons after initiation mutation events affecting multiple nucleotides gives an and stop codons were removed. This data set is pre- improved description of the evolution of protein-coding sented as a typical example for which we can present sequences, as assessed by LRTs of competing models detailed results and is ideal for studying doublet and that do and do not make this allowance. Assuming, as triplet mutation because the sequences are closely re- is shown below, that the new SDT model is overwhelm- lated and require no insertions/deletions (indels) to ingly successful, it is then of interest to examine the produce the alignment; in our model described below, extent to which doublet and triplet mutation events ambiguity in the alignment, which is more likely to occur are inferred to occur. At the level of mutation, this is in more distantly related sequences, may potentially in- described in the SDT model by the parameters ␴, ␦, fluence estimates of the rate of doublet and triplet muta- and ␶. It is also interesting to consider the relative pro- tion. A phylogenetic tree for these sequences was esti- portions of different mutation events occurring that mated using the REV ϩ⌫model and using an exhaustive are accepted by selection and the resulting numbers of tree search procedure, as implemented in PAML (Yang different mutation events that actually contribute to 1994b, 1997) with the ⌫-distribution discretized into observed sequence evolution (i.e., after the effects of four categories (Yang 1994a). The resulting topology is selection are taken into account). These characteristics ((human, rhesus monkey), ((cow, goat), zebra), rabbit), may all be computed from the inferred parameter esti- which we believe agrees with the consensus opinion of mates in the SDT model. the relationship between these mammals. Similar tree

We denote by Mi,j the number of mutation events per estimation procedures were performed using various codon site per unit time affecting i nucleotides (i.e., other models of DNA evolution and no variation in the i ϭ 1, 2, and 3 indicating singlet, doublet, and triplet topology estimate was observed. Both the tree and se- events, respectively), the first of the affected nucleotides quence alignment are available from the authors on being at codon position j (j ϭ 1, 2, 3). For example, request.

M3,2 is the number of triplet events affecting positions In addition, a large number of coding sequence align- 2 and 3 of one codon and position 1 of the following ments were taken from the nucleotide section of the codon. The events described by Mi,j are the inferred Pandit 7.6 database (Whelan et al. 2003; available at mutations that have occurred prior to selection, and http://www.ebi.ac.uk/goldman-srv/pandit/), which is we denote by Ai,j the corresponding number of events derived from the Pfam-A seed database of protein do- inferred to have been accepted by selection. Formulae main alignments (Bateman et al. 2004). In Pandit, nu- for the computation of the Mi,j and Ai,j are given in the cleotide sequences for each alignment are extracted appendix. Note the contrasting interpretations of the from the EMBL database (Kulikova et al. 2004) and

Mi,j and Ai,j, which relate to actual mutational events in aligned using the Pfam-A seed alignment as a template. our model, and the events described directly by the The manual curation of these families, the quality and elements of the matrix Q, which are the observable length of the alignments, and the levels of sequence outcomes of those events. To use an earlier example, a divergence make them representative of the type of data triplet mutation CGG → GAT occurring at position 3 used to perform phylogenetic inference (Whelan et al. of a codon TAC and positions 1 and 2 of the following 2003). Codon models are computationally very slow, codon contributes to M3,3 and, according to its probabil- necessitating that only a subset of Pandit was analyzed; ity of being accepted by selection, to A3,3. However, the only families with four, five, or six sequences were used. observable outcomes are five codon changes: the singlet Families were constrained to have alignment lengths of change TAC → TAG and the doublet changes GGW → Ͼ100 codons and so that the sequences all use the ϭ ATW (with probability fGGW/fGGϩ) for each of W A, C, universal , to give a lower limit to the infor- G, T, thus contributing to QTAC,TAG and to QGGW,ATW for mation present from which to estimate parameters and each of W ϭ A, C, G, T. ensure that sequences evolve in a comparable manner, ϭ␴ Because of the way our model is constructed M1,ϩ , respectively. We also required that there must be no ϭ␦ ϭ␶ ϭ M2,ϩ , M3,ϩ , and Mϩ,ϩ 1. The total rate of ac- variation in tree estimate under several standard models cepted mutation events per site per unit time is given of DNA evolution for the families used. The effect of Modeling Multiple-Nucleotide Changes 2033

using an inaccurate tree topology has an unknown in- 1987; Andrews 2001) and obtaining data-specific distri- fluence on parameter estimation under our new model, butions using simulation approaches (e.g., Goldman so ideally the ML trees for each codon model and data 1993) is too computationally expensive for general use set combination should be used. Using families where in our model. For a typical set of parameter values we only a single evolutionary tree topology is estimated used a parametric bootstrap approach (Goldman 1993) under various DNA models should reduce any potential to estimate the distribution of 2⌬ for comparing models effects because tree topology has proven to be relatively a and d from 200 simulations. The results (not shown) ␹2 stable when adding complexity to a model. To further suggest that the 3-distribution is still conservative, with ␹2 ensure data quality we make the final constraint of en- the 95% points of the simulation and 3-distribution suring that branch length estimates under the null being 3.88 and 7.82, respectively. Examination of pa- model do not tend to extremely long values, which may rameter estimates from the simulation shows that both occur when there are so many synonymous changes ␦ and ␶ are distributed near zero, confirming that when that they have become saturated. When this occurs the no doublet or triplet events occur their parameter esti- number of inferred synonymous changes may be se- mates are reasonable. In the following discussion we verely underestimated and may potentially affect the assume the use of ␹2-distributions is appropriate for accuracy of other parameter estimates. These con- significance testing. straints should ensure that the data used were of high Statistical testing: The comparison of model variants quality and resulted in 258 families (out of a possible (a) and (d) represents a general test of the significance 4699) being used in subsequent analyses. These repre- of doublet and triplet events and shows that the descrip- sent a total of 75,177 aligned codon positions, consisting tion of the evolutionary process provided by the SDT of 1200 different protein-coding sequences containing model is significantly better than that provided by the in total 358,837 codons. simpler null model that allows no large-scale events (2⌬ϭ 68.26; P Ͻ 0.001). Assessing the relative importance of doublet and triplet events on the evolution of the globin RESULTS sequences is less clear because the presence of doublet Mammalian globin sequences: Four variants of the and triplet events can each be tested by two model SDT model are used to examine the evolution of the comparisons and the results are somewhat contradic- globin sequences: (a) ␴ϭ1 and ␦ϭ␶ϭ0, representing tory. Using a LRT to assess the importance of doublet a null model where no large scale events occur, (b) ␦ϭ events by comparing model variants (a) and (b), neither 1 Ϫ␴and ␶ϭ0, a model where only point and doublet of which consider triplet events, yields a highly signifi- events can occur, (c) ␶ϭ1 Ϫ␴and ␦ϭ0, a model cant result (2⌬ϭ56.54; P Ͻ 0.001). In a similar test where only point and triplet events can occur, and (d) performed in the presence of a parameter describing the full SDT model (any ␴, ␦, ␶Ն0 satisfying ␴ϩ␦ϩ triplet events, i.e., the comparison of (c) and (d), the ␶ϭ1). addition of doublet mutation does not significantly im- Relevant parameter estimates, the maximum likeli- prove the model (2⌬ϭ0.43; P ϭ 0.51). A similar obser- hood under the tree considered, and the relative num- vation is made for triplet events. When comparing bers of degrees of freedom of these variant models are model variants without a parameter describing doublet shown in Table 1. LRTs may be performed between the events a much higher increase in likelihood is achieved models when they are nested by comparing 2⌬, twice the [comparison of (a) and (c): 2⌬ϭ67.83; P Ͻ 0.001] improvement in likelihood observed when progressing than when comparing model variants that also contain from the simpler (null) model to the more complex a parameter describing doublet mutation [comparison ␹2 ⌬ϭ Ͻ (alternative) model, to a n-distribution, where the n of (b) and (d): 2 11.72; P 0.001], although the degrees of freedom are the number of parameters by addition of ␶ (triplet events) is highly significant in both which the null and alternate models differ. When the these comparisons. parameters ␦ and/or ␶ are added, the use of the ␹2- This behavior is probably due to the ability of the distribution is likely to be a conservative test because parameters describing doublet events to partially de- they can both rest on boundaries of the parameter space scribe triplet events, and vice versa. This is illustrated under null models (Self and Liang 1987). In addition, by examining the test for the significance of doublet ␻ 2 effectively does not exist under the null model (a), mutation given by comparing (c) and (d). The relatively and any value it takes will result in exactly the same small value of 2⌬ obtained by progressing to the SDT ␻ likelihood. In cases such as this 2 is said to be unidenti- model is accompanied by quite a large estimate of the fiable under the null model; regularity conditions are frequency of doublet mutation and a corresponding violated and the standard theory justifying a ␹2-distribu- reduction in the estimate of the frequency of triplet tion for significance testing may break down (for a mutation. It appears reasonable to assume that the statis- pathological example see Chen and Chen 2001). Ob- tical test comparing the null model (a) and the SDT taining appropriate asymptotic distributions for signifi- model (d) can ably distinguish between single-nucleo- cance testing in such cases is difficult (Davies 1977, tide events and larger-scale events. The parameter esti- 2034 S. Whelan and N. Goldman

TABLE 1 Parameter estimates from mammalian globin sequences

Value under model with b. ␴ variable, c. ␴ variable, ␦ϭ d. ␴, ␦ variable, Parameter a. ␴ϭ1, ␦ϭ␶ϭ0 ␦ϭ1 Ϫ␴, ␶ϭ00,␶ϭ1 Ϫ␴ ␶ϭ1 Ϫ␴Ϫ␦ d.f.a 0223 ␴ 1 (0.53/1)b 0.62 (0.43/0.78)b 0.70 (0.44/0.85)b 0.66 (0.44/0.83)b ␦ 0 (—/0)c 0.38 (0.20/0.22)c 0 (—/0)c 0.11 (0.22/0.07)c ␶ 0 (—/0)d 0 (—/0)d 0.30 (0.18/0.15)d 0.23 (0.16/0.10)d ␻ 1 0.36 0.20 0.22 0.23 ␻ 2 — 0.11 0.13 0.05 ␬ 1.39 1.74 1.84 1.82 ␲ A 0.18 0.18 0.18 0.18 ␲ C 0.36 0.36 0.36 0.36 ␲ G 0.28 0.29 0.29 0.29 ␲ T 0.18 0.17 0.17 0.17 Tree lengthe 2.39 3.08 2.76 2.85 ln L Ϫ1301.615 Ϫ1273.343 Ϫ1267.700 Ϫ1267.484 a Degrees of freedom by which model variant varies from model a. b ϭ First value in parentheses is the proportion of singlet events accepted by selection, i.e., A1,ϩ/M1,ϩ A1,ϩ/ ␴ ; second value is the proportion of all accepted events that are singlets, i.e., A1,ϩ/Aϩ,ϩ; see materials and methods section and appendix for details of computation. c ϭ First value in parentheses is the proportion of doublet events accepted by selection, i.e., A2,ϩ/M2,ϩ A2,ϩ/ ␦ ; second value is the proportion of all accepted events that are doublets, i.e., A2,ϩ/Aϩ,ϩ. d ϭ First value in parentheses is the proportion of triplet events accepted by selection, i.e., A3,ϩ/M3,ϩ A3,ϩ/ ␶ ; second value is the proportion of all accepted events that are triplets, i.e., A3,ϩ/Aϩ,ϩ. e Tree length is the sum of all branch lengths in a topology.

mates under the SDT model, as well as the LRT results, a codon where it is possible for synonymous mutations can then be used as an indicator of the relative impor- to occur, whereas triplet mutations always affect the tance of doublet and triplet mutations. second position of a codon and consequently always Parameter estimates: In the SDT model the estimates of result in a nonsynonymous change. the frequency of doublet and triplet mutation events It is interesting to assess the effect of the inclusion (prior to selection; Table 1, ␦ and ␶, respectively) and of doublet and triplet events in a model on the other

accepted events (after selection; A2,ϩ/Aϩ,ϩ and A3,ϩ/ parameter estimates our model contains (see Table 1).

Aϩ,ϩ, respectively) are both surprisingly high and may It is immediately noticeable that the inclusion of doublet indicate that both types of mutation played an important and triplet events in the model has a negligible effect role in the evolution of the mammalian globin genes. on the equilibrium distribution parameters contained The mutation event numbers indicate how frequently in ␲. This is reassuring, because the expected frequency doublet and triplet mutations are estimated to occur of the nucleotide bases in the observed sequences is and suggest that in the globin alignment approximately defined by ␲ in a similar manner for all of the models. one-third of all mutations affect two or three nucleotides The inclusion of doublet and triplet mutations in the simultaneously (␦ϩ␶ϭ0.34). The numbers regarding model appears to have a substantial effect on the re- events accepted by selection suggest that only a limited maining parameters. The transition/transversion ratio, proportion of these doublet and triplet events pass ␬, appears to increase under SDT and the selection ␻ ␬ through the filter of selection and impact the evolution- parameter, 1, decreases. The cause of the change in ␻ ary history of the . The relative fractions of the is unclear, while the decrease in 1 may be expected: if different types of mutation accepted by selection reflect doublet and triplet events have occurred during the the genetic code. More point mutations than either evolution of the sequence, under the null model multi- ␦ϭ ␶ϭ doublet (A2,ϩ/ 22%) or triplet (A3,ϩ/ 16%) muta- ple nonsynonymous point events are required to de- ␴ϭ ␻ tions are accepted by selection (A1,ϩ/ 44%) because scribe many of the doublet and triplet events and 1 some of the point mutations at the first or third base needs to be artificially high to describe them; but when in a codon are synonymous and occur at the same rate as the larger-scale events are better accounted for (by the ␻ neutral mutation. Fewer triplet mutations than doublet SDT model) the value of 1 is closer to its true level. mutations survive selection because there are more syn- For example, consider the triplet mutation that changes onymous doublet mutations than triplet mutations; dou- isoleucine (ATA) to alanine (GCC). Under the null blet mutation may occur at the first and third sites of model (a), at least two nonsynonymous events would Modeling Multiple-Nucleotide Changes 2035 ␻ be required to describe the change and 1 would be large the globin analysis are consistent with the estimates to make these events more likely. In the SDT model the made from the database. The variability observed in all change may be correctly described as a triplet event and of the distributions in Figure 1 should be viewed with ␻ 1 is unaffected. some caution because the standard errors of the param- ␻ The value of 2 is also of interest because no compara- eters that contribute to them are different in each family ble parameter has previously been estimated. In the due to differences in the length of alignments, tree mammalian globin alignment there appears to be a lengths, etc. preference for nonsynonymous changes to cause only General effects on other parameters’ estimates of in- a single-amino-acid residue in a protein to change rather cluding doublet and triplet events in an evolutionary ␻ Ͼ␻ than affecting adjacent pairs of residues, i.e., 1 2 model are illustrated in Figure 2, A and B. Figure 2A in Table 1, particularly for the full SDT model. The shows the effect of including doublet and triplet events final factor of interest is the effect the inclusion of dou- on estimates of ␬, which tends to be greater under the blet and triplet events has on tree length, which appears SDT model than under the null model. Figure 2B shows to slightly increase under the SDT model. The ratio of the distribution of selection parameters under the null ␻ branch lengths between the two models also differs (i.e., and SDT models. The estimates of 1 tend to be much differences are not a simple linear scaling of all branch lower under the SDT model than under the null model; lengths; results not shown), suggesting that the SDT model even cases where pervasive positive selection is implied ␻ Ͼ may affect tree estimation procedures. The cause for under the null model (cases where 1 1) revert to the increase in tree length is unclear but may be related much lower levels under the SDT model. The parameter ␬ ␻ to the increase in , with the SDT model inferring more 2 of the SDT model tends to take substantially higher ␻ synonymous transition mutations than the null model. values than 1 takes in the SDT model, but slightly lower ␻ Database analysis: Application of the SDT model to values than 1 takes in the null model, and there is a ␻ ␻ a large number of alignments provides a more general weak correlation between 2 and 1 of the null model: ␻ assessment of the relative importance of different types high estimates of 1 in the null model tend to be associ- ␻ of mutational event. The parameter estimates from our ated with high estimates of 2. This unexpected general database analysis confirm that the conclusions of the (database) finding differs from the globin result. We ␻ ␻ globin analysis apply across a broader spectrum of biol- expected values of 2 to be lower than 1 (SDT), re- ogy, indicating that doublet and triplet events occur on flecting a presumed greater physicochemical disruption a regular basis. caused by changing two amino acids relative to a single- ␻ Statistical testing: The comparison between the null amino-acid change. The high values of 2 may reflect model (a) and the SDT model (d) is used to investigate compensatory amino acid replacements (such context the doublet and triplet mutation for our database analy- effects are not considered by the SDT model) or the sis. The submodels (b) and (c) are not used because, model describing larger-scale events than doublets and as discussed above, the parameters ␦ and ␶ have a ten- triplets (see below). dency to partially describe each other, making the result from these models difficult to interpret. Statistical tests DISCUSSION show that progression from the null model to the SDT model is significant at the 95% level in 257 (99.6%) of For the surprisingly high estimates of doublet and the families examined and the mean (median) increase triplet mutation rates to be believable some support in likelihood was 117.79 (90.23), indicating that the SDT from the biological literature is required. Unlike dou- model provides a substantial improvement in describing blet events (see Discussion in Averof et al. 2000), there evolution and is likely to provide biologically interesting is no explicit mechanism known to us that causes exactly information. three adjacent nucleotides to change simultaneously. It Parameter estimates: The distribution across families of therefore seems reasonable to consider the possibility the frequency of doublet and triplet events prior to that events spanning larger stretches of DNA are the selection (␦ and ␶, respectively) is shown in Figure 1A, source of these high estimates. Figure 3 demonstrates with the proportion of doublet and triplet mutation how the products of gene conversion, small-scale recom- events taking mean (median) values of 0.22 (0.16) and bination, and inversion mutations, if plausible for the 0.28 (0.25), respectively. Figure 1B shows the distribu- data at hand, may appear as high estimates of doublet tion of the proportions of single, doublet, and triplet and triplet events in the SDT model. The apparent ability ␴ mutations that are accepted by selection (A1,ϩ/ , A2,ϩ/ of a parameter describing doublet events to partially de- ␦ ␶ , and A3,ϩ/ , respectively). Figure 1C indicates the pro- scribe triplet events, and vice versa, as noted in the analysis portions of all accepted events that derive from doublet of the globin sequences, supports this hypothesis.

and triplet mutations (A2,ϩ/Aϩ,ϩ and A3,ϩ/Aϩ,ϩ, respec- We find the potential of gene conversion to be the tively); the mean (median) values are 0.091 (0.041) and source of our high estimates of doublet and triplet events 0.097 (0.068), respectively. Note that the estimates of particularly intriguing because many recent studies have the proportion of doublet and triplet mutations from suggested that gene conversion occurs more often than 2036 S. Whelan and N. Goldman

Figure 1.—Distribution of param- eter estimates and statistics relating to singlet, doublet, and triplet muta- tions on a per family basis from a database of 258 aligned protein fami- lies. (A) The proportions of doublet (shaded bars) and triplet (open bars) mutations before selection (␦ and ␶, respectively). (B) The proportions of singlet (solid bars), doublet (shaded bars), and triplet (open bars) muta- tions accepted by selection. (C) The proportions of all events accepted by selection that derive from doublet (shaded bars) and triplet (open bars) mutations. Estimates under the SDT model from the globin sequence alignment are also indicated. Modeling Multiple-Nucleotide Changes 2037

Figure 2.—Effects of includ- ing parameters describing dou- blet and triplet events on other parameter estimates in the SDT model, on a per family ba- sis from a database of 258 aligned protein families. (A) The change in ␬ between the null and SDT models; note that 8 families with ␴Ͻ0.05 under SDT are not included because they do not contain enough in- formation to estimate ␬ accu- rately. (B) The distribution of ␻ 1 under the null (solid bars) and SDT (shaded bars) models ␻ and the distribution of 2 un- der the SDT model (open bars); note that two families with ␴Ͼ0.95 are not included because they do not contain enough information to esti- ␻ mate 2 accurately.

previously thought. Laboratory-based studies have found and May (2004); the shape of the distributions and the significant gene conversion on the human sex chromo- inherent difficulties in uncovering short gene conver- somes (Skaletsky et al. 2003) and in the malaria para- sion events suggest that relatively short segments of DNA site (Nielsen et al. 2003). Another study, examining may be altered by gene conversion (Semple and Wolfe human sperm, found high levels of gene conversion in 1999, Figure 6; Drouin 2002, Figure 2). The estimated meiotic crossover hot spots (Jeffreys and May 2004); frequency at which gene conversions occur in these between 80 and 94% of recombination events at these studies, however, is probably too low to describe the hot spots were observed to resolve as gene conversions high frequency of doublet and triplet events inferred and the mean length of the conversion tract was esti- by our model. This may be because only known, active mated to be in the range of 55–290 bp. Two computa- homologs of the gene families are considered in the tional studies have estimated the length and frequency analyses, whereas gene conversion events have also been of gene conversion events in gene families of Saccharo- documented where a section of a pseudogene sequence myces cerevisiae (Semple and Wolfe 1999) and Caenorhab- replaces part of a functional gene. For example, in the ditis elegans (Drouin 2002). The length and variance of human condition 21-hydroxylase deficiency, the most the observed gene conversions are quite large but common form of adrenal hyperplasia, gene conversion -of mutant alleles (Col %75ف broadly agree with the lengths proposed by Jeffreys is thought to account for 2038 S. Whelan and N. Goldman

Figure 3.—Possible mechanisms by which large-scale events may result in elevated estimates of doublet and triplet mutation. (A) Gene conver- sion. The left-hand side shows how three indepen- dent mutations may occur in a sequence; these may be correctly inferred as three point mutations (indicated by dashed boxes). The right-hand side demonstrates that a gene conversion event oc- curring as indicated makes these three changes appear simultaneously in the right-hand lineage, and one point and one doublet mutation will be inferred. Note that small-scale recombination will result in a similar inference except the exchange of information will be reciprocal. (B) Sequence inversion (illustrated with double-stranded DNA). The observed strand of DNA (in boldface type) appears to have evolved via a single triplet event as a consequence of this sequence inversion.

lier et al. 1993; Tusie-Luna and White 1995). Gene triggering events seems adequate: it captures the gen- conversion from pseudogenes is also thought to be im- eral behavior of sequence evolution, at least at the level portant in polycystic kidney disease (Watnik et al. 1998). observable with species-level data, and does so without This introduces the possibility that gene families convert having to model a complex pattern of heterogeneous not only within themselves but also with other areas of selective pressures that vary over sequence position and the genome with high sequence similarity. If this were over evolutionary time. We do, however, have to remem- occurring it may raise the frequency of gene conversion ber to be cautious with the interpretation of our inferred to levels that may substantially describe our high esti- levels of instantaneous doublet and triplet events. mates of doublet and triplet mutation. Another possibility is that high estimates of doublet An alternative explanation for the high estimates of and triplet events may be due to properties of the se- doublet and triplet events would be if compensatory muta- quence data used. Possible problems arising from errors tions were regularly occurring. For example, if a change in tree topology or the misalignment of homologous at the first position of a codon were accepted by selec- sequences may potentially influence the estimates of tion only if other changes rapidly occurred at the second the proportion of doublet and triplet events. However, and third position of the preceding codon, this would results obtained by examining subsets of the data be well described by triplet mutation in our model. A formed by removing the most divergent families, which phenomenon similar to this has been modeled in RNA can be expected to contain those families with the most evolution where matching bases in regions with stem ambiguous alignments, and by including only families structures are described as changing simultaneously with long alignments of five or more sequences do not (e.g., Savill et al. 2001). If compensatory events that substantially differ from the results presented above, are accepted by selection arise quickly relative to the strongly suggesting that the estimated high levels of frequency of occurrence of the mutations that “trigger” doublet and triplet events are not artifacts of the data them and of events that do not lead to any compensatory (results not shown). This is further supported by the changes, then a model such as ours that treats the com- detailed analysis of the globin family, where estimates pensating mutations as arising simultaneously with the typical of the results obtained from the database are Modeling Multiple-Nucleotide Changes 2039 obtained from an alignment containing only closely re- reflect reality more accurately than those of simpler lated sequences with no ambiguity or indels in the align- models. We hope in the future that approaches based ment. The possibility that errors in tree estimation may on the formulation of the SDT model, combined with be the cause of the high estimates of large-scale events recent advances in modeling context dependence (e.g., cannot be discounted, but many measures have been Pedersen and Jensen 2001; Arndt et al. 2003; Siepel and employed to ensure that tree topology estimates are as Haussler 2004), may be used to more completely de- good as possible. It is unlikely that the topology esti- scribe such larger-scale events in the evolutionary process. mates for all 258 families studied are wrong, which would indicate that the inference of frequent large-scale events is correct in at least some families. CONCLUSION The SDT model, in common with many other models In this study we introduce a new codon model of used in , makes many assumptions about evolution that can estimate the frequency of singlet, the evolutionary process and it is possible that violations doublet, and triplet mutation using an approach that of these assumptions are the source of the high estimates allows events to span multiple codon sites while main- of doublet and triplet mutation. The assumption of no taining the simplifying independent-sites assumption for variation in evolutionary factors (e.g., selection) along data analysis. The model is constructed in a manner that a protein may conceivably lead to overestimation of the separates the process of mutation on the DNA sequence rate of doublet and triplet mutation. If there is variation from the process of selection acting upon the protein. in selection in a relatively conserved protein then a site ␻ Ͼ Inclusion of doublet and triplet mutations gives a statisti- undergoing positive selection (with 1 1) would have cally significant improvement in the fit of the model to more accepted point mutations at the first and second data in the vast majority of 258 protein-coding sequence codon position than at other codons. Consequently, in alignments studied, indicating that these mutation the SDT model, which takes into account only the average events do occur and are an important feature of genome amount of selection occurring, this excess of changes may evolution. Parameter estimates from the model suggest potentially be misinterpreted as doublet and triplet events. that doublet and triplet mutations occur far more fre- Another plausible contributor to the high estimates of quently than previously thought and give some indica- doublet and triplet mutation would be a significant de- tion of the variations that may be observed over typical gree of rate variation in the mutational process acting protein-coding DNA sequences, which reflect variations on the DNA, which could be caused, for example, by the over genomic regions. presence of mutational hot spots. However, the levels of The exact cause of these high estimates, however, doublet and triplet mutation events estimated from the remains open to debate. It seems unlikely that misspeci- data are very high and occur over a wide variety of fication of the model caused, for example, by high levels biological sequences. Given the plausible biological ex- of variation in selection along proteins or poor data quality planation of such events, it seems unlikely that misspeci- can be the sole cause of the high estimates of doublet fications of the model, such as those discussed above, and triplet mutation. The lack of a biological mechanism are the sole cause of the high estimates of doublet and for triplet mutation, the more prevalent of the two types triplet mutation. of large-scale mutation inferred, leads us to conclude The misspecification argument can also be turned on its that these high estimates may more likely be reflecting head to support the SDT model: if large-scale mutational compensatory change and larger-scale mutation, proba- events are occurring during evolution then estimates from bly in the form of gene conversion, inversion mutation, studies that do not describe these events may be af- or small-scale recombination. If such events are regu- fected. For example, Anisimova et al. (2003) present a larly occurring during evolution, this may be a worrying simulation study that demonstrates that recombination departure from the independence-of-sites assumption (a larger-scale event) may result in overestimation of and affect parameter estimates under current models. selection parameters in standard codon models. If large- In these circumstances, the SDT model should provide scale events are as frequent as our study suggests it may parameter estimates that better reflect reality. go some way to explaining the extreme level of positive selection observed by some studies; the proportion of We thank Adam Siepel and David Haussler for letting us see a prepublication version of their work and two anonymous referees amino acid changes driven by positive selection in pri- for comments leading to improvements to this manuscript. S.W. is mates and Drosophila has been estimated as 35% (Fay supported by the Wellcome Trust. N.G. is supported by a Wellcome et al. 2001) and 45% (Smith and Eyre-Walker 2002), Trust Senior Fellowship in Basic Biomedical Science. respectively. The results of this study also show that other parameter estimates may be affected by not allowing for large-scale events. The SDT model, while not completely LITERATURE CITED describing events such as recombination and gene con- Andrews, D. W. K., 2001 Testing when a parameter is on the bound- version, does represent a step toward being able to do ary of the maintained hypothesis. Econometica 69: 683–734. so and consequently parameter estimates from it may Anisimova,M.,R.Nielsen and Z. Yang, 2003 Effect of recombina- 2040 S. Whelan and N. Goldman

tion on the accuracy of the likelihood method for detecting Savill, N. J., D. C. Hoyle and P. G. Higgs, 2001 RNA sequence positive selection at amino acid sites. Genetics 164: 1229–1236. evolution with secondary structure constraints: comparison of Arndt, P. F., C. B. Burge and T. Hwa, 2003 DNA sequence evolu- substitution rate models using maximum-likelihood methods. Ge- tion with neighbor-dependent mutation. J. Comp. Biol. 3–4: 313– netics 157: 399–411. 322. Self, S. G., and K.-L. Liang, 1987 Asymptotic properties of maxi- Averof,M.,A.Rokas,K.H.Wolfe and P. M. Sharp, 2000 Evidence mum likelihood estimators and likelihood ratio tests under non- for a high frequency of simultaneous doublet-nucleotide substitu- standard conditions. J. Am. Stat. Assoc. 82: 605–610. tions. Science 287: 1283–1286. Semple, C., and K. H. Wolfe, 1999 and gene Bateman, A., L. Coin,R.Durbin,R.D.Finn,V.Hollich et al., conversion in the Caenorhabditis elegans genome. J. Mol. Evol. 48: 2004 The Pfam protein families database. Nucleic Acids Res. 555–564. 32: D138–D141. Siepel, A., and D. Haussler, 2004 Phylogenetic estimation of con- Chen, H. F., and J. H. Chen, 2001 The likelihood ratio test for text-dependent substitution rates by maximum likelihood. Mol. homogeneity in finite mixture models. Can. J. Stat. 29: 201–215. Biol. Evol. 21: 468–488. Collier, S., M. Tassabehji,P.Sinnott and T. Strachan, 1993 A Silva, J. C., and A. S. Kondrashov, 2002 Patterns in spontaneous de novo pathological point mutation at the 21-hydroxylase locus: mutation revealed by human-baboon sequence comparison. implications for gene conversion in the human genome. Nat. Trends Genet. 18: 544–547. Genet. 4: 260–265. Skaletsky, H., T. Kuroda-Kawaguchi,P.J.Minx,H.S.Cordum, Davies, R. B., 1977 Hypothesis testing when a nuisance parameter L. Hillier et al., 2003 The male-specific region of the human is present only under the alternative. Biometrika 64: 247–254. Y is a mosaic of discrete sequence classes. Nature Davies, R. B., 1987 Hypothesis testing when a parameter is present 423: 825–837. only under the alternative. Biometrika 74: 33–43. Smith,N.G.C.,andA.Eyre-Walker, 2002 Adaptive protein evolu- Dayhoff, M. O., R. V. Eck and C. M. Park, 1972 A model of evolu- tion in Drosophila. Nature 415: 1022–1024. tionary change in proteins, pp. 89–99 in Atlas of Protein Sequence Smith,N.G.C.,M.T.Webster and H. Ellegren, 2003 A low rate and Structure, Vol. 5, edited by M. O. Dayhoff. National Biomedi- of simultaneous doublet-nucleotide mutations in primates. Mol. cal Research Foundation, Washington, DC. Biol. Evol. 20: 47–53. Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt, 1978 A model Swanson, W. J., R. Nielsen and Q. F. Yang, 2003 Pervasive adaptive of evolutionary change in proteins, pp. 345–352 in Atlas of Protein evolution in mammalian fertilization proteins. Mol. Biol. Evol. Sequence and Structure, Vol. 5, Suppl. 3, edited by M. O. Dayhoff. 20: 18–20. National Biomedical Research Foundation, Washington, DC. Swofford, D. L., G. J. Olsen,P.J.Waddell and D. M. Hillis, Drouin, G., 2002 Characterization of the gene conversions between 1996 Phylogenetic inference, pp. 407–514 in Molecular Systemat- the multigene family members of the yeast genome. J. Mol. Evol. ics, Ed. 2, edited by D. M. Hillis,C.Moritz and B. K. Mable. 55: 14–23. Sinauer Associates, Sunderland, MA. Edwards, A. W. F., 1972 Likelihood. Cambridge University Press, Thompson, J. D., T. J. Gibson,F.Plewniak,F.Jeanmougin and D. G. Cambridge, UK. Higgins, 1997 The CLUSTAL_X windows interface: flexible Fay,J.C.,G.J.Wyckoff and C.-I Wu, 2001 Positive and negative strategies for multiple sequence alignment aided by quality analy- selection on the human genome. Genetics 158: 1227–1234. sis tools. Nucleic Acids Res. 25: 4876–4882. Goldman, N., 1993 Statistical tests of model of DNA substitution. Tusie-Luna, M. T., and P. C. White, 1995 Gene conversions and J. Mol. Evol. 36: 182–198. unequal crossovers between CYP21 (steroid 21-hydroxylase gene) Goldman, N., and Z. Yang, 1994 A codon-based model of nucleotide and CYP21P involve different mechanisms. Proc. Natl. Acad. Sci. substitution for protein-coding DNA sequences. Mol. Biol. Evol. USA 92: 10796–10800. 11: 725–736. Watnik, T. J., M. A. Gandolph,H.Weber,H.P.H.Neumann and Hasegawa,M.,H.Kishino and T. Yano, 1985 Dating of the human- G. G. Germino, 1998 Gene conversion is a likely cause of muta- ape splitting by a of mitochondrial DNA. J. Mol. tion in PKD1. Hum. Mol. Genet. 7: 1239–1243. Evol 22: 160–174. Jeffreys, A. J., and C. A. May, 2004 Intense and highly localised Whelan, S., and N. Goldman, 2001 A general empirical model of gene conversion activity in human meiotic crossover hot spots. protein evolution derived from multiple protein families using Nat. Genet. 36: 151–156. a maximum likelihood approach. Mol. Biol. Evol. 18: 691–699. Kimura, M., 1980 A simple method for estimating evolutionary rates Whelan, S., P. I. W. de Bakker and N. Goldman, 2003 Pandit: of base substitution through comparative studies of nucleotide a database of protein and associated nucleotide domains with sequences. J. Mol. Evol. 6: 111–120. inferred trees. Bioinformatics 19: 1556–1563. Kimura, M., 1984 The Neutral Theory of Evolution. Cambridge Univer- Yang, Z., 1994a Maximum likelihood phylogenetic estimation from sity Press, Cambridge, UK. DNA sequences with variable rates over sites: approximate meth- Kulikova, T., P. Aldebert,N.Althorpe,W.Baker,K.Bates et al., ods. J. Mol. Evol. 39: 306–314. 2004 The EMBL nucleotide sequence database. Nucleic Acids Yang, Z., 1994b Estimating the pattern of nucleotide substitution. Res. 32: D27–D30. J. Mol. Evol. 39: 105–111. Lio`, P., and N. Goldman, 1998 Models of molecular evolution and Yang, Z., 1997 PAML: a program package for phylogenetic analysis phylogeny. Genome Res. 8: 1233–1244. by maximum likelihood. Comput. Appl. Biosci. 13: 555–556. Muse, S. V., and B. S. Gaut, 1994 A likelihood approach for compar- Yang, Z., 1998 Likelihood ratio tests for detecting positive selection ing synonymous and nonsynonymous nucleotide substitution and application to primate lysozyme evolution. Mol. Biol. Evol. rates, with applications to the chloroplast genome. Mol. Biol. 15: 568–573. Evol. 11: 715–724. Yang, Z., and J. P. Bielawski, 2000 Statistical methods for detecting Nielsen,K.M.,J.Kasper,M.Choi,T.Bedford,K.Kristiansen et molecular . Trends Ecol. Evol. 15: 496–503. al., 2003 Gene conversion as a source of nucleotide diversity in Yang, Z., and R. Nielsen, 2002 Codon-substitution models for de- Plasmodium falciparum. Mol. Biol. Evol. 20: 726–734. tecting molecular adaptation at individual sites along specific Nielsen, R., and Z. Yang, 1998 Likelihood models for detecting lineages. Mol. Biol. Evol. 19: 908–917. positively selected amino acid sites and applications to the HIV-1 Yang, Z., N. Goldman and A. Friday, 1994 Comparison of models envelope gene. Genetics 148: 929–936. for nucleotide substitution used in maximum likelihood phyloge- Pedersen, A.-M. K., and J. L. Jensen, 2001 A dependent-rates model netic estimation. Mol. Biol. Evol. 11: 316–324. and an MCMC-based methodology for the maximum-likelihood Yang, Z., R. Nielsen and M. Hasegawa, 1998 Models of amino acid analysis of sequences with overlapping reading frames. Mol. Biol. substitution and applications to mitochondrial protein evolution. Evol. 18: 763–776. Mol. Biol. Evol. 15: 1600–1611. Posada, D., and K. A. Crandall, 1998 MODELTEST: testing the Yang, Z., R. Nielsen,N.Goldman and A.-M. K. Pedersen, 2000 model of DNA substitution. Bioinformatics 14: 817–818. Codon-substitution models for heterogeneous selection pressure Robinson, D. M., D. T. Jones,H.Kishino,N.Goldman and J. L. at amino acid sites. Genetics 155: 431–449. Thorne, 2003 Protein evolution with dependence among co- dons due to tertiary structure. Mol. Biol. Evol. 20: 1692–1704. Communicating editor: Z. Yang Modeling Multiple-Nucleotide Changes 2041

APPENDIX Calculation of Q(s),Q(d),Q(t), and Q: Before calculating the constituent rate matrices of the SDT model it is convenient to define the following functions:

0ifj j j is a stop codon  1 2 3 ⌬(1) ϭ → (i1i2i3, j1 j2 j3 ) 1 if the change i1i2i3 j1 j2 j3 is synonymous ␻ →  1 if the change i1i2i3 j1 j2 j3 is nonsynonymous (A1)

 0 if either j1 j2 j3 or l1l2l3 are stop codons → → 1 if changes i1i2i3 j1 j2 j3 and k1 k2 k3 l1 l2 l3 are both synonymous ⌬(2) ϭ ␻ if changes i i i → j j j and k k k → l l l represent one synonymous (i1i2i3k1k2k3 , j1 j2 j3l1l2l3 ) 1 1 2 3 1 2 3 1 2 3 1 2 3  and one nonsynonymous change ␻ → →  2 if both i1 i2 i3 j1 j2 j3 and k1 k2 k3 l1 l2 l3 are nonsynonymous changes (A2) to represent the probabilities of mutation events being accepted as functions of the changes between codons i1i2i3 and j1j2j3, and codon pairs i1i2i3 k1k2k3 and j1j2 j3 l1l2l3, respectively. Of these rate matrices Q(s) is the simplest to describe, since all singlet events affect only one codon:

⌬(1)(i i i , j j j )␴s if i ϶ j , i ϭ j , i ϭ j  1 2 3 1 2 3 i1 ,j1 1 1 2 2 3 3 ⌬(1) ␴ ϭ ϶ ϭ  (i1i2i3, j1j2j3) si ,j if i1 j1, i2 j2, i3 j3 Q (s) ϭ 2 2 i1i2i3, j1 j2 j3 ⌬(1)(i i i , j j j )␴s if i ϭ j , i ϭ j , i ϶ j  1 2 3 1 2 3 i3, j3 1 1 2 2 3 3 0 otherwise. (A3) Q(d) has to allow for doublet mutation events that alter only one codon (those falling on codon positions 1 and 2 or 2 and 3) and that alter two codons (falling on codon positions 3 and 1). The first two of these are straightforward to accommodate, leading to the following:

⌬(1) ␦ ϶ ϶ ϭ  (i1i2i3, j1 j2 j3) di i , j j if i1 j1, i2 j2, i3 j3  1 2 1 2 (d) ϭ ⌬(1)(i i i , j j j )␦d if i ϭ j , i ϶ j , i ϶ j Q i1i2i3, j1 j2 j3  1 2 3 1 2 3 i2i3 , j2j3 1 1 2 2 3 3 0 otherwise. (A4)

(d) To complete Q , we have to consider all possible neighboring codons i1i2i3 k1k2k3 and all possible doublet mutation → ϶ ϶ events that can alter them at codon positions 3 and 1—these are the changes i3k1 j3l1 for all j3 i3 and l1 k1, giving rise to codons i1i2 j3 l1k2k3. Recalling our assumption that codon k1k2k3 follows a given instance of codon i1i2i3 (d) with probability fk1k2k3 and, similarly, i1i2i3 precedes k1k2k3 with probability fi1i2i3, Q may be completed using the fol- lowing iterative procedure, for all codons i1i2i3 {

for all (following) codons k1k2k3 { ϶ ϶ for all (replacement) doublets j3l1 with j3 i3 and l1 k1 { (d) ϩϭ ⌬(2) ␦ Q i1i2i3,i1i2j3 fk1k2k3 (i1i2i3 k1k2k3, i1i2j3 l1k2k3 ) di3k1, j3l1

(d) ϩϭ ⌬(2) ␦ Q k1k2k3 , l1k2k3 fi1i2i3 (i1i2i3 k1k2k3, i1i2j3 l1k2k3) di3k1, j3l1 }}}, (A5) where the ϩϭ symbol indicates that in each iteration of the loops, the indicated values are cumulatively added to (d) (d) the values of Q i1i2i3,i1i2j3 and Q k1k2k3, l1k2k3 already computed by Equation A4. Following the same logic, Q(t) can be produced by first calculating  ⌬(1)(i i i , j j j )␶t if i ϶ j , i ϶ j , i ϶ j (t) ϭ  1 2 3 1 2 3 i1i2i3, j1j2j3 1 1 2 2 3 3 Q i i i , j j j 1 2 3 1 2 3 0 otherwise (A6) and completed using the following iterative procedure: 2042 S. Whelan and N. Goldman for all codons i1i2i3 {

for all (following) codons k1k2k3 { ϶ ϶ ϶ for all (replacement) triplets j2 j3l1 with j2 i2, j3 i3, and l1 k1 (t) ϩϭ ⌬(2) ␶ Q i1i2i3, i1 j2 j3 fk1k2k3 (i1i2i3 k1k2k3, i1 j2 j3 l1k2k3) ti2i3k1, j2j3l1 (t) ϩϭ ⌬(2) ␶ Q k1k2k3,l1k2k3 fi1i2i3 (i1i2i3 k1k2k3, i1 j2 j3 l1k2k3) ti2i3k1, j2 j3l1 } ϶ ϶ ϶ for all (replacement) triplets j3l1l2 with j3 i3, l1 k1, and l2 k2 { (t) ϩϭ ⌬(2) ␶ Q i1i2i3,i1i2j3 fk1k2k3 (i1i2i3 k1k2k3, i1i2 j3 l1l2k3 ) ti3k1k2, j3l1l2 (t) ϩϭ ⌬(2) ␶ Q k1k2k3, l1l2k3 fi1i2i3 (i1i2i3 k1k2k3, i1i2j3 l1l2k3 ) ti3k1k2, j3l1l2 }}}. (A7)

The matrix Q actually used in our model is now simply calculated as

Q ϭ Q(s) ϩ Q(d) ϩ Q(t) (A8) and, as is usual for Markov process models, it is convenient to define the diagonal elements of Q according to ϭϪ͚ Qi1i2i3,i1i2i3 Qi1i2i3, j1 j2 j3 . (A9) codons j1 j2 j3

Calculation of the quantities Mi,j and Ai,j : Mi,j and Ai,j, as defined in the text, are computed using the following formulae. For singlet events,

ϭ ͚ ͚ ␴ ϭ␴͚ ϩ M1,x six,jx fi1i2i3 fi1i2i3 six, ϶ codons jx ix codons i1i2i3 i1i2i3 ϭ ͚ ͚ ⌬(1) Ј ␴ A1,x (i1i2i3,[i1i2i3 ]x ) six, jx fi1i2i3 (A10) ϶ codons jx ix i1i2i3 ϭ␴͚ ͚ ⌬(1) Ј fi1i2i3 (i1i2i3,[i1i2i3 ]x )six,jx (A11) ϶ codons jx ix i1i2i3 ϭ Ј for each x 1, 2, 3, where [i1i2i3]x indicates the codon arising when codon i1i2i3 has nucleotide ix at position x Ј Ј Ј changed to jx (so, for example, [i1i2i3]1 represents the codon j1i2i3 and [[i1i2i3]2]3 represents i1j2j3). For doublet events,

ϭ␦͚ ϭ ϩϩ M2,x fi1i2i3 dixixϩ1, (A12) codons i1i2i3

ϭ␦͚ ͚ ⌬(1) Ј Јϩ A2,x fi1i2i3 (i1i2i3,[[i1i2i3 ]x ]x 1)dixixϩ1,jx jxϩ1 (A13) ϶ codons jx ix i i i ϶ 1 2 3 jxϩ1 ixϩ1 for x ϭ 1, 2, and

ϭ␦͚ ͚ ϩϩ M2,3 fi1i2i3 fk1k2k3di3,k1, (A14) codons codons i1i2i3 k1k2k3 ϭ␦͚ ͚ ͚ ⌬(2) A2,3 fi1i2i3 fk1k2k3 (i1i2i3 k1k2k3, i1i2j3 l1k2k3)di3k1,j3l1. (A15) ϶ codons codons j3 i3 i i i k k k ϶ 1 2 3 1 2 3 l1 k1 For triplet events,

ϭ␶͚ ϩϩϩ M3,1 fi1i2i3ti1i2i3, (A16) codons i1i2i3

ϭ␶͚ ͚ ϩϩϩ M3,2 fi1i2i3 fi1i2i3ti1i2i3, (A17) codons codons i1i2i3 k1k2k3 Modeling Multiple-Nucleotide Changes 2043

ϭ␶͚ ͚ ϩϩϩ M3,3 fi1i2i3 fk1k2k3ti3k1k2, (A18) codons codons i1i2i3 k1k2k3 ϭ␶͚ ͚ ⌬(1) A3,1 fi1i2i3 (i1i2i3, j1j2j3)ti1i2i3,j1j2j3 (A19) ϶ codons j1 i1 i i i ϶ 1 2 3 j2 i2 ϶ j3 i3 ϭ␶͚ ͚ ͚ ⌬(2) A3,2 fi1i2i3 fk1k2k3 (i1i2i3 k1k2k3, i1j2j 3 l1k2k3)ti2i3k1,j2j3l1 (A20) ϶ codons codons j2 i2 i i i k k k ϶ 1 2 3 1 2 3 j3 i3 ϶ l1 k1 ϭ␶͚ ͚ ͚ ⌬(2) A3,3 fi1i2i3 fk1k2k3 (i1i2i3 k1k2k3, i1i2j 3 l1l2k3)ti3k1k2,j3l1l2. (A21) ϶ codons codons j2 i2 i i i k k k ϶ 1 2 3 1 2 3 j3 i3 ϶ l1 k1 ϭ␴ ϭ␦ ϭ␶ ϭ As mentioned in the text, M1,ϩ , M2,ϩ , M3,ϩ , and Mϩ,ϩ 1. Although all singlet, doublet, and triplet mutation events arise independently of codon position (as seems reasonable for a model of sequence evolution), they are not independent of the affected nucleotides in each singlet, doublet, or triplet. Different singlets, doublets, and triplets have different rates of evolution (as is evident in Equations 1, 2, and 3), and the same singlet, doublet, or triplet will occur with different frequencies at different codon positions because of the effects of the selection ␻ ␻ parameters 1 and 2 and because stop codons are not allowed to arise. Consequently, the following relationships do not hold exactly: ϭ ϭ ϭ␴ M1,1 M1,2 M1,3 /3 ϭ ϭ ϭ␦ M2,1 M2,2 M2,3 /3 ϭ ϭ ϭ␶ M3,1 M3,2 M3,3 /3. However, we note that for reasonable parameter values these relationships remain approximately true.

Since each “accepted mutation” A1,1, A1,2, A1,3, A2,1, A2,2, and A3,1 leads to one observed codon change, and each of

A2,3, A3,2, A3,3 leads to exactly two observed codon changes, we can also relate the mean rate of observable codon changes at equilibrium with the values Ai,j as follows: Ϫ ͚ ϭ ϩ ϩ ϩ ϩ ϩ ϩ ϩ ϩ fi1i2i3Q i1i2i3,i1i2i3 A1,1 A1,2 A1,3 A2,1 A2,2 A3,1 2(A2,3 A3,2 A3,3). (A22) codons i1i2i3