Impact of the Partitioning Scheme on Divergence Times Inferred from Mammalian Genomic Data Sets
Total Page:16
File Type:pdf, Size:1020Kb
Evolutionary Bioinformatics OPEN ACCESS Full open access to this and thousands of other papers at ORIGINal ReseaRCH http://www.la-press.com. Impact of the Partitioning Scheme on Divergence Times Inferred from Mammalian Genomic Data Sets Carolina M. Voloch and Carlos G. Schrago Department of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil. Corresponding author email: [email protected] Abstract: Data partitioning has long been regarded as an important parameter for phylogenetic inference. The division of heterogeneous multigene data sets into partitions with similar substitution patterns is known to increase the performance of probabilistic phylogenetic methods. However, the effect of the partitioning scheme on divergence time estimates has generally been ignored. To investigate the impact of data partitioning on the estimation of divergence times, we have constructed two genomic data sets. The first one with 15 nuclear genes comprising 50,928 bp were selected from the OrthoMam database; the second set was composed of complete mitochondrial genomes. We studied two partitioning schemes: concatenated supermatrices and partitioned gene analysis. We have also measured the impact of taxonomic sampling on the estimates. After drawing divergence time inferences using the uncorrelated relaxed clock in BEAST, we have compared the age estimates between the partitioning schemes. Our results show that, in general, both schemes resulted in similar chronological estimates, however the concatenated data sets were more efficient than the partitioned ones in attaining suitable effective sample sizes. Keywords: relaxed molecular clock, data partitioning, timescale, molecular dating Evolutionary Bioinformatics 2012:8 207–218 doi: 10.4137/EBO.S9627 This article is available from http://www.la-press.com. © the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited. Evolutionary Bioinformatics 2012:8 207 Voloch and Schrago Introduction assembled (http://www.ensembl.org). In addition, Divergence time estimation has been revolutionized the rich fossil record of mammals provides several by the application of models that relax the assump- sources of calibration information that can be used in tion of the strict molecular clock by decomposing molecular dating analyses.18 branch lengths into absolute times and evolution- In this paper, we compared the impact on diver- ary rates.1,2 Such an approach has been successfully gence time estimation of concatenated and parti- implemented in a Bayesian framework in which tioned schemes using nuclear and mitochondrial data complex models of evolutionary rate evolution may sets of mammals with the relaxed molecular clock. be feasibly used because the marginal distributions Our analyses showed that, although both of the parti- of divergence times are obtained via a Markov chain tioning schemes yielded similar divergence time esti- Monte Carlo method.3 Although this approach is mates, the concatenated data were more efficient than widely used, several aspects of molecular dating by the partitioned scheme because the former allowed such relaxed clock methods require detailed scrutiny. effective sample sizes to increase more rapidly. This For example, it is still unclear which modeling strat- finding indicates that the concatenated data yielded egy for describing the change in rates along lineages parametric estimates with better mixing of the Markov is more appropriate for biological data.4,5 Another chain. The effectiveness of concatenated data is prob- unsolved issue is the possible impact of taxonomic ably due to the smaller number of model parameters, sampling on the estimates of evolutionary rates.6–8 which facilitates exploration of the parametric space In addition to the modeling of evolutionary rates by the MCMC sampler. and taxonomic sampling, the effect of the data- partitioning scheme on the estimation of divergence times has attracted relatively little attention. Curiously, Materials and Methods this issue has been addressed frequently over the past Sequences, alignments decade in the context of topological estimation only, and tree topologies mainly because of the increased availability of mul- We have constructed two phylogenomic data tigene data sets.9–12 Researchers may study multigene sets to investigate the impact of data partitioning data sets as a single concatenated supermatrix or may on mammalian divergence times. In each data set a predefined number of partitions. Evidently, the set, chronological inferences were obtained by estimation of the optimal number of partitions to be concatenating all of the genes in a single supermatrix used in phylogenetic analysis is a subject of theoreti- or by allowing the partitions to have independent cal interest.13,14 Although the effect of data partition- evolutionary parameters. To evaluate the behavior ing on the estimation of divergence time has rarely of the chronological estimates with increasing been investigated, the few existing studies show that taxonomic sampling, we have studied three species divergence times are influenced by the partitioning compositions with increasing numbers of terminals scheme used in multigene data sets.11 in each data set (Fig. 1). In this sense, evaluations of the effects of data par- The first data set was composed of 15 nuclear genes titioning on divergence time inference are needed. that were selected from the OrthoMam database.19 Ideally, such analyses must be conducted via simu- Genes were selected according to their relative rates lation or by the analysis of biological data in which (varying from 0.25 to 1.25), calculated following there exists considerable empirical evidence of the the approach of Criscuolo et al.20 The 15 genes were values of the parametric estimates. The advantage sampled in order to obtain an alignment with approxi- of the latter approach is that the complexity of the mately 50,000 nucleotides (Table 1), which is close to evolutionary process is captured. Mammalian times- the total sequence length used in recent mammalian cales have been intensively investigated,15–17 and the phylogenomic studies.21,22 Only genes with the full availability of molecular data for the lineage is unri- species sampling were selected. The final concate- valled among vertebrates because of the hundreds of nated supermatrix contained 50,928 nucleotides, and mitochondrial genomes that have been sequenced the corresponding taxonomic compositions included and the more than 30 nuclear genomes that are being 18, 26 and 34 species (Fig. 1). 208 Evolutionary Bioinformatics 2012:8 Impact of the partitioning scheme on divergence time estimation A Ornithorhynchus BCOrnithorhynchus Ornithorhynchus Macropus Macropus Macropus Monodelphis Monodelphis Choloepus Monodelphis Choloepus Dasypus Dasypus Echinops Dasypus Loxodonta Echinops Procavia Echinops Loxodonta Sorex Procavia Erinaceus Loxodonta Vicugna Sorex Sus Sus Vicugna Tursiops Sus Bos Bos Equus Bos Canis Equus Equus Felis Canis Oryctolagus Canis Ochotona Felis Spermophilus Felis Oryctolagus Cavia Ochotona Dipodomys Oryctolagus Mus Dipodomys Rattus Mus Mus Tupaia Rattus Rattus Micorcebus Tarsius Otolemur Callithrix Tarsius Callithrix Callithrix Macaca Macaca Macaca Gorilla Pongo Pan Gorilla Pan Pan Homo Homo Homo DEOrnithorhynchus Ornithorhynchus F Ornithorhynchus Macropus Macropus Macropus Monodelphis Monodelphis Choloepus Monodelphis Choloepus Dasypus Dasypus Echinops Dasypus Loxodonta Echinops Procavia Echinops Loxodonta Sorex Procavia Erinaceus Loxodonta Lama Sorex Sus Sus Lama Lagenorhychus Sus Ovis Ovis Bos Bos Ovis Equus Bos Canis Equus Equus Felis Oryctolagus Canis Canis Ochotona Felis Sciurus Felis Oryctolagus Cavia Ochotona Jaculus Oryctolagus Mus Jaculus Rattus Mus Mus Tupaia Rattus Lemur Rattus Eulemur Tarsius Tarsius Cebus Cebus Cebus Cercopithecus Cercopithecus Cercopithecus Gorilla Pongo Pan Gorilla Pan Pan Homo Homo Homo Figure 1. Phylogenies used in this study. Topologies (A–C) refer to the taxonomic compositions 1 (A), 2 (B) and 3 (C) of the nuclear data set. Topologies (D–F) refer to the taxonomic compositions 1 (C), 2 (D) and 3 (F) of the mitochondrial data set. Phylogenies (C and F) were inferred in PhyML. Note: Black circles indicate nodes in which aLRT value was lower than 0.9, otherwise aLRT statistics = 1. The second data set was composed of mitochondrial classes containing first, second and third codon posi- genomes of the same lineages present in the nuclear tions, as opposed to the nuclear. This is the partition- data set. When the genome of the same species was ing scheme commonly applied to mitochondrial data not available, we used a close sister taxa. For instance, and is intended to reduce evolutionary rate variability Callithrix was used in the nuclear data set and Cebus within the partitions studied.23 Accession numbers of was used in the mitochondrial data set. The taxonomic the mitochondrial genomes of the species shown in compositions contained 19, 27 and 35 species respec- Figure 1D–F are available in Table 2. tively (Fig. 1). We used all 13 mitochondrial coding The genes were aligned individually in PRANK genes, resulting in a supermatrix of 11,763 nucleotides. under default parametric settings.24 The phylogenetic Mitochondrial genes were partitioned into three inference was performed with supermatrices of the nuclear and mitochondrial data sets under the taxo- Table 1. Orthologous