Does the Choice of Nucleotide Substitution Models Matter Topologically?

bioRxiv preprint doi: https://doi.org/10.1101/041566; this version posted February 26, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Hoff et al. RESEARCH Does the choice of nucleotide substitution models matter topologically? Michael Hoff1y, Stefan Orf1y, Benedikt Riehm1y, Diego Darriba2 and Alexandros Stamatakis1,2* fmichael.hoff, steffan.orf, [email protected] fdiego.darriba, [email protected] Abstract Background: In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites × #taxa) yields different models and, as a consequence, different tree topologies. Results: We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10%) for approximately 5% of the tree inferences conducted on the 39 empirical datasets used in our study. Conclusions: We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences. Keywords: phylogenetics; nucleotide substitution; model selection; information criterion; BIC; AIC Background the virtual root that is deployed to direct the Felsen- Statistical models of DNA evolution as used in Bayesian stein pruning algorithm [1]. inference (BI) and Maximum Likelihood (ML) meth- To ensure that, a nucleotide substitution matrix ods for phylogenetic reconstruction are typically re- quired to be time-reversible. That is, evolution (or is time-reversible, it must exhibit a certain sym- the Markov process modeling it) is assumed to occur metry. This symmetry requirement is depicted in in the same way if followed forward or backward in the following example for a set of substitution rates time. Time-reversibility has some intrinsic computa- frAC; rAG; rAT; rCG; rCT; rGTg and a set of stationary fre- tional advantages, mainly that, the likelihood of a tree quencies fπA; πC ; πG; πT g: one intends to score is the same for any placement of *Correspondence: [email protected] 1Karlsruhe Institute of Technology, Department of Informatics, Kaiserstraße 12, 76131, Karlsruhe, Germany ACGT 2The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for 0 1 A − rACπC rAGπG rATπT Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, CB rACπA − rCGπG rCTπT C Germany B Cµ Full list of author information is available at the end of the article G@ rAGπA rCGπC − rGTπT A yEqual contributor T rATπA rCTπC rGTπG − bioRxiv preprint doi: https://doi.org/10.1101/041566; this version posted February 26, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Hoff et al. Page 2 of 11 As implied by the above representation the rates rXY Here, we present and make available a fast implemen- in the upper and lower diagonal part of this matrix tation of a ML-based model testing tool for all possi- must be symmetrical. Therefore, there can be at most ble time-reversible DNA substitution models. It was 6 distinct, independent rates. Because rates in such developed by three students within the context of a a matrix are relative rates, one rate (typically rGT) is programming practical at the Master level in the com- set to 1:0 by default such that there are at most 5 puter science department of the Karlsruhe Institute of free parameters that can be estimated via ML opti- Technology. The educational aspects of this project are mization or sampled with MCMC methods for BI. A discussed in the on-line supplement. time-reversible DNA substitution matrix can have be- The effect of different selection criteria and substi- tween 1 up to 6 distinct rates. Note that, we further tution models on ML-based phylogenetic analyses has differentiate between substitution models with identi- already been discussed before. Darriba et al. [7] eval- cal base frequencies (i.e., π = ...π = 0:25) or dis- A t uated the accuracy of distinct model selection criteria tinct base frequencies. Distinct base frequencies can with respect to recovering the true generating model be either be obtained by using empirical base frequen- parameters (e.g., substitution rates, proportion of in- cies or via direct sampling (BI) or ML inference of the variant sites, or the α shape parameter of the Γ distri- base frequencies. Nonetheless, here we mainly focus on bution) under simulation, using 40; 000 synthetic align- models as defined by substitution rates and not base frequencies. If all 6 rates and all base frequencies are ments under a wide range of model parameters. identical we obtain the Jukes-Cantor (JC [2]) model Ripplinger and Sullivan [13] evaluated the influence and if all 6 rates are independent, we obtain the Gen- of model choice on empirical tree topologies as deter- eral Time-Reversible (GTR [3]) model. mined by using the hierarchical Likelihood Ratio Test Evidently, there is a large number of possible time- (hLRT [14]), the AICc and BIC criteria, as well as DT. reversible models situated between the JC and GTR In their tests, the selection criteria at hand, often re- model. We can essentially consider all possible com- turned different best-fit models. These different best- binations of dependencies between the 6 substitution fit models then also induced distinct ML trees. Rip- rates. In the literature, only a subset of these has plinger and Sullivan exclusively focused on selecting been explicitly described and named, for instance the named models in their study, that is, they did not se- HKY85 model [4]. We are aware of two papers and lect among all possible 203 substitution models. They respective implementations that consider all possible did, however, also investigate the impact of model se- time-reversible nucleotide models and intend to select lection on bootstrap support values. Note that, this the best-fitting one among these 203 substitution ma- was not feasible in our study because of the limited trices. amount of time and credits available to the students The first paper that addressed the problem used for completing the programming practical. Bayesian statistics and MCMC methods to sam- Nonetheless, our work fills the gap by the two afore- ple among all possible substitution matrices propor- mentioned papers [7, 13] since we study the impact tional to their posterior probability [5]. The method of selecting among all possible 203 substitution mod- for integrating over all substitution models is avail- els on ML-based tree inferences on empirical datasets able in MrBayes version 3.2 [6] via the command compared to tree inferences conducted under a default . lset nst=mixed GTR+Γ model. Another code that allows for testing all possible sub- Initially, we present two algorithms for generating stitution matrices in a ML framework based on the all possible substitution models. Thereafter, we dis- AIC (Akaike Information Criterion), AICc (corrected AIC) BIC (Bayesian Information Criterion), and DT cuss the sequential and parallel implementation using (Decision Theory) criteria is jModelTest 2 [7]. How- the PLL. Subsequently, we present a set of experiments ever, jModelTest relies on PHYML [8] to calculate on empirical datasets to answer the following question: likelihood scores for a given substitution model and Does model selection really matter with respect to its tree. PHYML is relatively slow compared to RaxML impact on the shape of the final tree topology? Posada (v. 8.1.7 [9]) and the Phylogenetic Likelihood Library and Buckley discussed the potential impact of the sam- (PLL [10]) that relies on the same phylogenetic like- ple size on AICc and BIC criteria [15]. We also assess lihood kernel implementation as RaxML. For instance, if different possible definitions of the sample size as optimizing the model parameters and branch lengths used for calculating information criteria influence the on a fixed tree with 354 ITS sequences [11] using model that will be selected. In addition, we conduct a GTR model with Γ-distributed rate heterogeneity a small simulation study to verify that our tool works among sites [12] and 4 discrete rate categories requires correctly. We conclude with speedup data for the par- 11 seconds with RaxML and 110 seconds with PHYML. allel implementation of our tool. bioRxiv preprint doi: https://doi.org/10.1101/041566; this version posted February 26, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Hoff et al. Page 3 of 11 Algorithms We further define the k−prefix of a k−model m, as The model test needs to calculate the ML score for the shortest prefix of m that contains all characters every possible time-reversible nucleotide substitution in Σk at least once.

Load more