10

Selecting models of evolution

THEORY David Posada

10.1 Models of evolution and phylogeny reconstruction Phylogenetic reconstruction is a problem of statistical inference. Since statistical inferences cannot be drawn in the absence of probabilities, the use of a model of substitution or amino acid replacement – a model of evolution – becomes indispensable when using DNA or sequences to estimate phylo- genetic relationships among taxa. Models of evolution are sets of assumptions about the process of nucleotide or amino acid substitution (see Chapters 4 and 9). They describe the different probabilities of change from one nucleotide or amino acid to another along a , allowing us to choose among different phylogenetic hypotheses to explain the data at hand. Comprehensive reviews of models of evolution are offered elsewhere (Swofford et al., 1996;Lio` & Goldman, 1998). As discussed in the previous chapters, phylogenetic methods are based on a number of assumptions about the evolutionary process. Such assumptions can be implicit, like in parsimony methods (see Chapter 8), or explicit, like in distance or maximum likelihood methods (see Chapters 5 and 6, respectively). The advantage of making a model explicit is that the parameters of the model can be estimated. Distance methods can only estimate the number of substitutions per site. However, maximum likelihood methods can estimate all the relevant parameters of the model of evolution. Parameters estimated via maximum likelihood have desirable statistical properties: as sample sizes get large, they converge to the true value and

The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (eds.). Published by Cambridge University Press. C Cambridge University Press 2009. 345 346 David Posada

have the smallest possible variance among all estimates with the same expected value. It is well known that the use of one model of evolution or another may change the results of a phylogenetic analysis (Sullivan & Joyce, 2005). When the model assumed is wrong, branch lengths, / ratio, and divergence may be underestimated, while the strength of rate variation among sites may be overestimated. Simple models tend to suggest that a clade is significantly supported when it cannot be, and tests of evolutionary hypotheses (e.g. of the , see Chapter 11) can become conservative. In general, phylogenetic methods may be less accurate (recover an incorrect tree more often), or may be inconsistent (con- verge to an incorrect tree with increased amounts of data) when the assumed model of evolution is wrong. Cases where the use of wrong models increases phylogenetic performance are the exception, and they rather represent a bias towards the true tree due to violated assumptions. Indeed, models are not important just because of their consequences in the phylogenetic analysis, but because the characterization of the evolutionary process at the molecular level is itself relevant. Models of evolution make assumptions to make complex problems computa- tionally tractable. A model becomes a powerful tool when, despite its simplified assumptions, it can fit the data and make accurate predictions about the problem at hand. The performance of a method is maximized when its assumptions are satisfied, and some indication of the fit of the data to the phylogenetic model is necessary. If the model used may influence the results of the analysis, it becomes crucial to decide which is the most appropriate model to work with. Before proceeding further, a word of caution should be said when selecting best- fit models for heterogeneous data, for example, when joining different genes for the phylogenetic analysis, or coding and non-coding regions. Since different genomic regions are subjected to different selective pressures and evolutionary constraints, a single model of evolution may not fit well with all the data. Nowadays, some options exist for a combined analysis in which each data partition (e.g. different genes) has its own model (Nylander et al., 2004). In addition, mixture models consider the possibility of the model varying in different parts of the alignment (Pagel & Meade, 2004).

10.2 Model fit In general, models that are more complex will fit the data better than simpler ones just because they have more parameters. An aprioriattractive procedure to select a model of evolution would be the arbitrary use of the most complex, parameter-rich model available. However, when using complex models a large 347 Selecting models of evolution: theory

number of parameters need to be estimated, and this has several disadvantages. First, the analysis becomes computationally difficult, and requires a large amount of time. Second, as more parameters need to be estimated from the same amount of data, more error is included in each estimate. Ideally, it would be advisable to incorporate as much complexity as needed, i.e. to choose a model that is intricate enough to explain the data, but not that complicated that requires impractical long computations or large data sets to obtain accurate estimates. The best-fit model of evolution for a particular data set can be selected using sound statistical techniques. During the last few years different approaches have been proposed, to select the best-fit model of evolution within a collection of can- didate models, like hierarchical likelihood ratio tests (hLRTs), information criteria, Bayesian,orperformance-based approaches. In addition, although not considered here, the overall adequacy of a particular model can also be evaluated using different procedures (Goldman, 1993; Bollback, 2002). Regardless of the model selection strategy chosen, the fit of a model can be measured through the .Thelikelihood is proportional to the probability of the data (D)givenamodelofevolution(M), a vector of K model parameters (θ), a tree topology (τ), and a vector of branch lengths (ν):

L = P (D |M,θ,τ,ν) (10.1)

When the goal is to compute the likelihood of a model, the parameter values and the tree affect the calculations, but they are not really what we want to infer (they are nuisance parameters). A standard strategy to “remove” nuisance parameters is to utilize their maximum likelihood estimates (MLEs), which are the values that make the likelihood function as large as possible:

θ,ˆ τ,ˆ νˆ = max L(θ,τ,ν) (10.2) θ,τ,ν Note that, to facilitate the computation, we usually work with the maximized log likelihood:   = ln P (D M, θ,ˆ τ,ˆ νˆ ) (10.3)

Alternatively, in a Bayesian setting we can integrate the nuisance parameters out and obtain the marginal probability of the data given only the model (P(D|M), also called model likelihoods), typically using computationally intensive techniques like Markov chain Monte Carlo (MCMC). Integrating out the tree, branch lengths, and model parameters to obtain P(D|M) is represented by:  P (D |M ) = P (D |M ,θ,τ,ν)P (θ,τ,ν |M ) dθdτdν (10.4) 348 David Posada

10.3 Hierarchical likelihood ratio tests (hLRTs) A standard way of comparing the fit of two models is to contrast their log likelihoods using the likelihood ratio test (LRT) statistic:

LRT = 2(1 − 0) (10.5)

where 1 is the maximum log likelihood under the more parameter-rich, complex model (alternative hypothesis) and 0 is the maximum log likelihood under the less parameter-rich simple model (null hypothesis). The value of this statistic is alwaysequalto,orgreaterthan,zero,evenifthesimplemodelisclosesttothe true model, simply because the superfluous parameters in the complex model provide a better explanation of the stochastic variation in the data than the simpler model. When the models compared are nested (i.e. the null hypothesis is a special case of the alternative hypothesis) and the null hypothesis is correct, this statistic is asymptotically distributed as a χ 2 distribution with a number of degrees of freedom equal to the difference in number of free parameters between the two models. When the value of the LRT is significantly large, the conclusion is that the inclusion of additional parameters in the alternative model increases the likelihood of the data significantly, and consequently the use of the more complex model is favored. On the other hand, a small difference in the log likelihoods indicates that the alternative hypothesis does not explain the data significantly better than the null hypothesis. That two models are nested means that one model (null model or constrained model) is equivalent to a restriction of the possible values that one or more param- eters can take in the other model (alternative, unconstrained or full model). For example, the Jukes–Cantor model (JC) (1969) and the Felsenstein (F81) (1981) models are nested. This is because the JC model is a special case of the F81, where the base frequencies are set to be equal (0.25), while in the F81 model these fre- quencies can be different. When comparing two different nested models through an LRT, we are testing hypotheses about the data. The hypotheses tested are those rep- resented by the difference in the assumptions among the models compared. Several hypotheses can be tested hierarchically to select the best-fit model for the data set at hand among a set of possible models. Are the base frequencies equal? Is there a tran- sition/transversion bias? Are all transition rates equal? Are there invariable sites? Is there rate homogeneity among sites? And so on. For example, testing the equal base frequencies hypothesis can be done with a LRT comparing JC vs. F81, as these mod- els only differ in the fact that F81 allows for unequal base frequencies (alternative hypothesis), while JC assumes equal base frequencies (null hypothesis). Indeed, the same hypothesis could also have been evaluated by comparing JC + vs. F81 + ,orK80+ Ivs.HKY+ I, or SYM vs. GTR. An example of such a 349 Selecting models of evolution: theory

hierarchical LRT procedure (hLRT) for 24 models is shown in Fig. 10.1. The hLRTs can be easily accomplished by using the program ModelTest (Posada & Crandall, 1998) for a set of 56 candidate models (see the practice section in this chapter).

10.3.1 Potential problems with the hLRTs We should be aware that there are some potential problems derived from the use of pairwise LRTs for model selection (Posada & Buckley, 2004). The χ 2 distribution approximation for the LRT statistic may not be appropriate when the null model is equivalent to fixing some parameter at the boundary of its possible values (Whelan & Goldman, 1999). An example of this situation is the invariable sites test. In this case, the alternative hypothesis postulates that the proportion of invariable sites could range from 0 to 1. The null hypothesis (no invariable sites) is a special case of the alternative hypothesis, with the proportion of invariable sites fixed to 0, which is at the boundary of the range of the parameter in the alternative model. In χ 2 χ 2 χ 2 this case, the use of a mixed distribution (50% 0 and 50% 1 ) is appropriate. However, even after using the most appropriate χ 2 distribution, obtaining correct P-values for the LRT statistics can be difficult, because LRTs implicitly assume that at least one of the models compared is correct. Moreover, when the two competing hypotheses are not nested the χ 2 approximation may perform poorly when the data include very short sequences relative to the number of parameters to be estimated. In these cases, the null distribution of the LRT statistic can be approximated by Monte Carlo simulation. In addition, which model comparison is used to compare which hypothesis depends on the starting model of the hierarchy, and on the order in which different hypotheses are performed. For example, it could be possible to start with the simple JC or with the most-complex GTR + I + . In the same way, a test for equal base frequencies could be performed first followed by a test for rate heterogeneity among sites, or vice versa. Many hierarchies of LRTs are possible, and they can result in different models being selected (Posada & Crandall, 2001; Posada & Buckley, 2004), and in some cases they can even lead to the estimation of different trees (Pol, 2004).

10.4 Information criteria A different approach for model selection is the simultaneous comparison of all competing models. The idea again is to include as much complexity in the model as needed. To do that, the likelihood of each model is penalized by a function of the number of free parameters in the model (K); the more parameters, the bigger the penalty. The Akaike Information Criterion or AIC (Akaike, 1974)isan asymptotically unbiased estimator of the Kullback–Leibler information quantity Fig. 10.1 Hierarchical likelihood ratio tests for 24 models of nucleotide substitution (in ModelTest the number of models is 56; see Table 10.1). At each level the null hypothesis (model on top) is either accepted (A) or rejected (R). G: shape parameter of the gamma distribution; I: proportion of invariable sites. Here the model selected would be HKY+G as indicated by the pathway represented using bold arrows. 351 Selecting models of evolution: theory

(Kullback & Leibler, 1951), which measures the expected distance between the true model and the estimated model:

AIC =−2 + 2K (10.6)

We can think of the AIC as the amount of information lost when we use, say HKY85, to approximate the real process of molecular evolution. Hence, the model with the smallest AIC is preferred. If branch lengths are estimated de novo for everymodel,asisusuallythecase,K will include the number of branches (twice the number of taxa minus three). An advantage of the AIC is that it can be used to compare both nested and non-nested models. When sample size (n) is small compared with the number of parameters (n/K < 40) a corrected version of the AIC is recommended: 2K (K + 1) AIC = AIC + (10.7) c n − K − 1 Note that sample size is usually approximated by the total number of characters in the alignment, although what is the sample size of an alignment is still an

open question. The AIC and AICc calculations are implemented in the programs ModelTest and ProtTest (Abascal et al., 2005) for DNA and protein sequences, respectively.

10.5 Bayesian approaches Model selection can be implemented in a Bayesian setting using Bayes factors, posterior probabilities or the Bayesian Information Criterion.Bayesfactorsare similar to the LRTs in that they compare the evidence (here, the model likelihoods) for two competing models:

P (D |Mi ) Bij = (10.8) P (D|Mj )

In this case, evidence for Mi is considered very strong if Bij > 150, strong if 12 < Bij < 150, positive if 3 < Bij < 12, barely worth mentioning if 1 < Bij < 3, and negative (supports Mj)ifBij < 1. Bayes factors for models of molecular evolution can be calculated using reversible jump MCMC (Huelsenbeck et al., 2004)(seealso Chapter 7). In addition, when multiple models are considered, it is possible to choose the model with the highest posterior probability (Raftery, 1996). For R candidate models, the posterior probability of the ith model is: P (D |M )P (M ) P (M |D ) =  i i (10.9) i R | r =1 P (D Mr )P (Mr ) where P(M) are the model prior probabilities. 352 David Posada

Both Bayes factors and model posterior probabilities can be difficult to compute. The Bayesian Information Criterion (BIC) (Schwarz, 1978) provides an approxi- mate solution to the natural log of the Bayes factor:

BIC =−2 + K log n (10.10)

The smaller the BIC, the better the fit of the model to the data. Given equal priors for all competing models, choosing the model with the smallest BIC is equivalent to selecting the model with the maximum posterior probability. Because with standard alignments the natural log of n is usually >2, the BIC tends to choose simpler models than the AIC. Like the AIC, the BIC can be used to compare nested and non-nested models. The BIC calculation is implemented in the programs ModelTest and ProtTest.

10.6 Performance-based selection Arguing that there is no guarantee that the best-fit models will produce the best estimates of phylogeny, Minin et al.(2003) developed a novel approach that selects models on the basis of their phylogenetic performance, measured as the expected error on branch lengths estimates weighted by their BIC. Under this decision the- oretic framework (DT) the best model is the one with that minimizes the risk function:

R   −BIC /2   e i Ci ≈ Bˆ i − Bˆ j  (10.11) R −BICi /2 j=1 j=1 e where   2t−3  2 2 Bˆ i − Bˆ j = (Bˆil − Bˆ jl) (10.12) l=1 and where t is the number of taxa. Indeed, simulations suggested that models selected with this criterion result in slightly more accurate branch length estimates than those obtained under models selected by the hLRTs (Minin et al., 2003;Abdoet al., 2005).

10.7 Model selection uncertainty One big advantage of the AIC, Bayesian, and DT methods over the hLRTs is that they can rank models, allowing us to assess how confident we are in the model selected. Indeed, models could be ranked according to posterior probabilities, and credible intervals could easily be constructed by summing these probabilities. For 353 Selecting models of evolution: theory

other relative measures like the AIC or BIC, we could present their differences (). For example, for the ith model, the AIC (or BIC) difference is:

AICi = AICi − min AIC (10.13)

where min AIC is the smallest AIC value among all candidate models.

Very conveniently, we can use these differences to obtain the relative weight (wi) of each model: exp(−1/2 ) w =  i (10.14) i R − /  r =1 exp( 1 2 r ) Note that the weights for every model add to 1, so it is easy to establish a 95% confidence set of models for the best models by summing the weights from largest to smallest from largest to smallest until the sum is just 0.95 (or similar).

10.8 Model averaging Very interestingly, the model weights (or the posterior probabilities) allow us to obtain a model-averaged estimate (also called a multimodel estimate) of any parameter (Raftery, 1996; Wasserman, 2000; Burnham & Anderson, 2003; Hoeting et al., 1999;Madigan&Raftery,1994). For example, a model-averaged estimate of

the relative substitution rate between adenine and cytosine (ϕA–C) using the model weights (w) for R candidate models would be:  R w Iϕ (M )ϕ − ϕ = i=1 i A−C i A C i , ˆ¯ A−C (10.15) w + (ϕA−C ) where  w + ϕ = R w , ( A−C ) i Iϕ − (Mi ) (10.16) i=1 A C and

1ifϕA−C is in model Mi , Iϕ (M ) = (10.17) A−C i 0otherwise

Remarkably, it is possible to construct a model-averaged estimate of the phy- logeny itself. Also, note that some parameters do not have the same interpretation across different models. For example, the shape of the gamma distribution (α, commonly used to describe among-site rate variation) in a + model is not the same parameter as α in a + I + model. Furthermore, if we sum up the weights of the models that contain a given param- eter we will get an estimate of the relative importance of that parameter (ranging 354 David Posada

from 0 to 1). For example, the relative importance of the relative substitution

rate between adenine and cytosine is simply the w + (ϕA−C )coefficientabove. Because we usually do not explore all the possible combinations of parameters in the set of candidate models, the relative importance of some parameters can be correlated. PRACTICE David Posada

10.9 The model selection procedure

This section will demonstrate how to use Paup* (Swofford, 1998)(seeChapter 8)andModelTest (Posada & Crandall, 1998) for selecting models of nucleotide substitution, and PhyML (Guindon & Gascuel, 2003)andProtTest (Abascal et al., 2005) for selecting models of amino acid replacement, using hLRTs, AIC and BIC. Although Bayes factors and model posterior probabilities can be approx- imated using MrBayes (Ronquist & Huelsenbeck, 2003), and the DT strategy can be accomplished with Paup* and DT-Sel (Minin et al., 2003), these procedures are beyond the scope of this chapter (although see new versions of ModelTest). The data files required for this section are available at www.thephylogenetichandbook.org. By far the most complicated and time-consuming step for the hLRTs, AIC, and BIC strategies is the estimation of the likelihood scores for each model. For the hLRTs the hypotheses should be nested, and therefore the likelihood scores are estimated on the same fixed topology (the base tree). This tree is used just to estimate parameters and likelihood scores of the different models, and it is not our final estimate of the phylogenetic relationships among the taxa under investigation. It has been shown that, as far as this tree is a reasonable estimate of the phylogeny (i.e. a maximum parsimony or a neighbor-joining tree; never use a random tree!), the parameter estimates and the model selected should be appropriate (Posada & Crandall, 2001;Abdoet al., 2005). In the case of the AIC and BIC, the models compared do not have to be nested, and one could relax the fixed-topology restriction. That is, we could estimate the maximum likelihood tree for each model to obtain the likelihood scores. However, in principle the latter has little effect on the model selection procedure, but increases computation time a great deal (Abdo et al., 2005).

10.10 ModelTest † ModelTest is a simple program written in ANSI C that has been compiled for Macintosh, Windows, and Linux. The ModelTest package is available for free and can be downloaded from the software section http://darwin.uvigo.es. The input of ModelTest is a text file containing the log likelihood scores that correspond to each one of the 56 nucleotide substitution models specified in Table 10.1. Such input file can be generated by executing in Paup* a particular

† See also the new program ModelTest at the same website. 355 356 David Posada

Table 10.1 Set of candidate models evaluated in ModelTest. ef: equal base frequencies; uf: base frequencies. For each model we evaluate model, model+I, model+Gandmodel +I+G, with I = proportion of invariable sites (1 parameter) and G = shape parameter of the gamma distribution (1 parameter)

Free Base Substitution Model parameters frequencies codea Reference

JC 0 equal 000000 (Jukes & Cantor, 1969) F81 3 unequal 000000 (Felsenstein, 1981) K80 1 equal 010010 (Kimura, 1980) HKY 4 unequal 010010 (Hasegawa et al., 1985) TrNef 2 equal 010020 (Tamura & Nei, 1993) TrN 5 unequal 010020 (Tamura & Nei, 1993) K81 2 equal 012210 (Kimura, 1981) K81uf 5 unequal 012210 (Kimura, 1981) TIMef 3 equal 012230 (Posada, 2003) TIM 6 unequal 012230 (Posada, 2003) TVMef 4 equal 012314 (Posada, 2003) TVM 7 unequal 012314 (Posada, 2003) SYM 5 equal 012345 (Zharkikh, 1994) GTR 8 unequal 012345 (Rodr´ıguez et al., 1990) a Numbers refer to how many distinct parameters are used to describe the relative rates A⇔C, A⇔G, A⇔T, C⇔G, C⇔T, and G⇔T. For example, 010020 implies three different parameters: 0 ≈ A⇔C = A⇔T = C⇔G = G⇔T, 1 ≈ A⇔G, and 2 ≈ C⇔T.

block of commands available in the modelblock file included in the ModelTest package. The complete procedure requires three steps:

(i) Execute the alignment file (NEXUS format) in Paup* (see Chapter 8,Practice). (ii) Open the command file modelblock and execute it. Paup* estimates a neighbor- joining tree and the likelihood and parameter values for several models. This can take from several minutes to several hours depending on the number of taxa and the computer speed. Once finished, a file called model.scores will appear in the same directory as the modelblock file. It is often a good practice to rename this file as yourdata.scores. (iii) Run ModelTest using the file yourdata.scores as the input file. The easiest way is to use the ModelTest Web Server (Posada, 2006) at http://darwin.uvigo.es (Fig. 10.2). Alternatively, one can run ModelTest locally. The Mac version of the program has a command–line interface asking to select an input file and to choose a name for the output file. The PC version requires using the console (go to Start > Accessories > Console prompt) or an MS-DOS window. Here you would type: modeltest.exe < model.scores > outfile and press enter. The program will save the outfile with the results in the same directory. Note that model.scores has to be in the same folder 357 Selecting models of evolution: practice

Fig. 10.2 ModelTest Web Server at http://darwin.uvigo.es. Note that the specification to count branch lengths as parameters in the standalone ModelTest software requires the option “-t17”, where 17 is the number of taxa in this data set.

where modeltest.exe resides (such folder, called Modeltest,iscreateddur- ing the installation of the program).

The output of ModelTest consists of a description of the hLRTs and AIC or BIC strategies. The first section of the output corresponds to the hLRTs strategy. Here, the particular LRTs performed and their associated P-values are listed, and the model selected with the corresponding parameter estimates (actually calculated by Paup*) is described. In the second section of the output, the program will use the AIC or the BIC framework, depending on the option previously indicated by the user. Here, the model selected is described first. Then, three tables show the models ranked according to the AIC/BIC differences and AIC/BIC weights, the model-averaged parameter estimates and their relative parameter importance. In addition, ModelTest also provides a block of commands in NEXUS format, which can be executed in Paup* an order to facilitate the implementation of the model selected. It is often convenient to paste this block of commands after the data block in the NEXUS file. This is very useful if the user wants to use this model for further analyses by Paup*; for example, to perform a LRT of the molecular clock, or to estimate a tree using the best-fit model. 358 David Posada

Table 10.2 Set of candidate models evaluated in ProtTest.Foreachmodel (0 free parameters) we evaluate model, model+F, model+I, model+G, model+F+I, model+F+G, model+I+Gandmodel+F+I+G, with F = amino acid frequencies (19 parameters), I = proportion of invariable sites (1 parameter) and G = shape parameter of the gamma distribution (1 parameter)

Model Data type Reference

JTT nuclear (Jones et al., 1992) MtREV mitochondrial (Adachi & Hasegawa, 1996) MtMam mitochondria (Cao et al., 1998) MtArt arthropod mitochondria (Abascal et al., 2007) Dayhoff nuclear (Dayhoff et al., 1978) WAG nuclear (Whelan & Goldman, 2001) RtREV retroviral polymerase (Dimmic et al., 2002) CpREV chloroplastic (Adachi et al., 2000) Blosum62 nuclear (Henikoff & Henikoff, 1992) VT nuclear (Muller & Vingron, 2000)

10.11 ProtTest

ProtTest (Abascal et al., 2005) is a program for the selection of models of amino acid replacement. It is written in Java, so it can run in any operating system with a Java Runtime Environment. ProtTest performs similar calculations as ModelTest, except for the hLRTs, which are not implemented because most of the amino acid models considered are not nested. Likelihood calculations are implemented using PhyML (Guindon & Gascuel, 2003), and some functions make use of the Pal library (Drummond & Strimmer, 2001). ProtTest provides a user-friendly interface where the user selects the differ- ent options for the analysis. In addition, it can be run at the command–line or through a web server (http://darwin.uvigo.es). Given an input file with an amino acid alignment in PHYLIP or NEXUS format, the program automatically calculates a BIONJ tree (Gascuel, 1997). Alternatively, a user-defined tree can be specified. Next, likelihoods and parameter estimates can be calculated using PhyML for every modeluponthisfixedtopology,oraPhyML tree search can be implemented for every model. At the time of writing, ProtTest selects among 80 empirical mod- els of amino acid replacement (Table 10.2). These models are preferentially based on empirical amino acid replacement matrices for computational reasons. Such matrices have been estimated from large data sets consisting of diverse protein families. 359 Selecting models of evolution: practice

------* * * AKAIKE INFORMATION CRITERION (AIC) * * * ------

Model selected: GTR+I+G -lnL = 21148.9277 K=41 AIC = 42379.8555 Base frequencies: freqA = 0.3771 freqC = 0.2255 freqG = 0.1759 freqT = 0.2214 Substitution model: Rate matrix R(a) [A-C] = 3.4141 R(b) [A-G] = 5.0976 R(c) [A-T] = 3.5059 R(d) [C-G] = 0.4437 R(e) [C-T] = 14.9945 R(f) [G-T] = 1.0000 Among-site rate variation Proportion of invariable sites (I) = 0.1595 Variable sites (G) Gamma distribution shape parameter = 0.7279

Fig. 10.3 Partial output of ModelTest Web Server indicating the AIC model for the mtDNA data set. Note that the maximum likelihood estimates shown by ModelTest are actually obtained by Paup*.

10.12 Selecting the best-fit model in the example data sets

10.12.1 Vertebrate mtDNA The first data set is an alignment of mitochondrial sequences from several verte- brates including 17 taxa and 1998 . The model selected by the hLRTs in ModelTest 3.8 is TrN+I+G, while for the AIC (Fig. 10.3) and BIC (using the alignment length as sample size) was GTR+I+G. In both models base frequen- cies are unequal, there is a significant proportion of invariable sites (indicated by the +I), and there is rate heterogeneity among sites (indicated by +G). However, while the TrN model considers three types of substitutions (all transitions together and the two types of ), GTR considers that the six possible kinds of substitutions among the different bases occur at different rates (see Table 10.1). The hLRTs therefore selected a simpler model than the AIC or the BIC, which is often the case. In the output of ModelTest, we can see that the TrN model cannot be rejected when it is tested against the TIM model, and therefore the full GTR scheme is never reached. However, note that TrN+I+G(-lnL= 21262.9570) is 360 David Posada

Table 10.3 Parameter importances and model-averaged estimates for the vertebrate mtDNA data set

Parametera Importance Model-averaged estimates

fA 1.0000 0.3771 fC 1.0000 0.2255 fG 1.0000 0.1759 fT 1.0000 0.2214 TiTv 0.0000 – rAC 1.0000 3.4152 rAG 1.0000 5.0991 rAT 1.0000 3.5073 rCG 1.0000 0.4440 rCT 1.0000 15.0010 pinv(I) 0.0000 – alpha(G) 0.0022 0.4738 pinv(IG) 0.9978 0.1595 alpha(IG) 0.9978 0.7280

a (I): averaged using only +I models; (G): averaged using only +G models; (IG): averaged using only +I+Gmodels.

significantly worse than GTR+I+G(-lnL= 21148.9279) (LRT = 228.0582; d.f. = 3; P-value < 0.0001). In this case, the hLRTs seem to have been stuck in a local optimum (see Posada & Buckley, 2004). According to the AIC or BIC weights (0.99 or 0.96, respectively), there is little uncertainty about the GTR+I+G model being the best model for this data set, so it would be very reasonable to use just this GTR+I+G for the next analyses. The relative parameter importance values (Table 10.3) indicate that all the parameters in this model are equally relevant. Given the large weights of the best model, the model-averaged estimates (Table 10.3) are indeed very similar to those obtained under the GTR+I+Gmodel(Figure 10.3). Importantly, these parameter estimates characterize the molecular evolution of this gene. We can see that A is the most frequent nucleotide (0.3771), that the most common substitution is between C and T (15.0010), that around 16% of the sites have not changed (the proportion of invariable sites, p inv, is 0.1595), and that there is medium rate heterogeneity (the shape of the gamma distribution, α is 0.7280).

10.12.2 HIV-1 envelope gene The second data set is the env gene from HIV and SIV strains. The alignment includes 14 taxa and 2202 nucleotides. In this case, the three criteria, hLRTs, AIC, and BIC select the same model, TVM+I+G. The TVM (“transversional 361 Selecting models of evolution: practice

model”; Posada, 2003) scheme considers one type of transition and four types of transversions. For this particular data set the hLRTs do not seem to have stuck in local optima, as GTR+I+G(-lnL= 15780.9895) is not significantly better than TVM+I+G(-lnL= 15780.9971). There is some uncertainty on the model selected, as the AIC and BIC weights for TVM+I+G are 0.70 and 0.56, respectively. For the AIC, the second best model is GTR+I+G with a weight of 0.26. For the BIC, the second best model is TVM+G, with a weight of 0.41. The relative parameter importances indicate that the A⇔GandC⇔T rate parameters do not need to be considered independently (importance = 0.2705). The model averaged estimates indicate that A is the most frequent nucleotide (0.3537), transitions are the most common type of substitutions, around 17% of the sites have not changed (p inv = 0.1721, and there is medium rate heterogeneity (α = 0.9436).

10.12.3 G3PDH protein Finally, the third example data set is an amino acid alignment of the enzyme glycerol-3-phosphate dehydrogenase in bacteria, protozoa, and animals. It includes 12 taxa and 234 amino acid residues. The model selected by ProtTest 1.3 for this data set after 2–3 hours running in a laptop was WAG+I+G (Whelan & Goldman, 2001), regardless of the criterion used (AIC, AICc, or BIC). From the output, we can see that there is no uncertainty in the selection; the WAG+I+Gmodelhasa weight of almost 1 under any criteria (look for AICw, AICcw, or BICw). The WAG matrix was inferred from a large database of sequences comprising a broad range of nuclear protein families, and it is suited for distantly related amino acid sequences. Therefore, the model selected appears to be appropriate. Given its large weight, the relative importances of the different parameters are concentrated around +I+G. The model-averaged estimates indicate a very small proportion of invariables sites (p inv ∼ 0.08 for the different criteria) and moderate rate variation among residues (α ∼ 2).Notethatthismodeldoesnotinclude the +F(Caoet al., 1994), which suggests that it is not worth including in the model the amino acid frequencies estimated (observed) from this particular alignment. Remember that each replacement matrix has attached its own set of amino acid frequencies estimated from a large protein database. Conveniently, ProtTest has an option that allows the user to visualize at once the results for all criteria, including different ways of defining sample size for the AICc and the BIC (number of sites in the alignment, sum of site entropies, length × number of sequences scaled by the total entropy).