Research Article: Maximum Likelihood Analyses of 3,490 rbcL Sequences: Scalability of Comprehensive Inference versus Group-Specific Taxon Sampling Alexandros Stamatakis1,*, Markus Göker2,3, Guido Grimm4 1 The Exelixis Lab, Dept. of Computer Science, Technische Universität München, Germany 2 Organismic Botany, Eberhard-Karls-University, Tübingen, Germany 3 German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany 4 Department of Palaeobotany, Swedish Museum of Natural History, Stockholm, Sweden * Corresponding author: Technische Universität München, Fakultät für Informatik, I 12, Boltzmannstr. 3, 85748 Garching b. München, Tel: +49 89 28919434, Fax: +49 89 28919414, Email:
[email protected] Keywords: Phylogenetic Inference, Maximum Likelihood, RAxML, large single–gene datasets, eudicots Running head: Scalability of Maximum Likelihood Analyses Abbreviations: GRTS: Group-based Randomized Taxon-Subsampling; TU: Taxonomic Unit. Abstract The constant accumulation of sequence data poses new computational and methodological challenges for phylogenetic inference, since multiple sequence alignments grow both in the horizontal (number of base pairs, phylogenomic alignments) as well as vertical (number of taxa) dimension. Put aside the ongoing controversial discussion about appropriate models, partitioning schemes, and assembly methods for phylogenomic alignments, coupled with the high computational cost to infer these, for many organismic groups, a sufficient number of taxa is often exclusively available from one or just a few genes (e.g., rbcL, matK, rDNA). In this paper we address scalability of Maximum-Likelihood-based phylogeny reconstruction with respect to the number of taxa by example of several large nested single-gene rbcL alignments comprising 400 up to 3,491 taxa. In order to thoroughly test the effect of taxon sampling, we deploy an appropriately adapted taxon jackknifing approach.