A Toolkit for Creating and Manipulating Supermatrices and Other Large 1

bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 SuperCRUNCH: A toolkit for creating and manipulating supermatrices and other large 2 phylogenetic datasets 3 4 Daniel M. Portik1*, John J. Wiens1 5 1. Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 6 7 *Correspondence: [email protected], [email protected] 8 9 Running Title: SuperCRUNCH for phylogenetic data 10 11 Keywords: bioinformatics, GenBank, genomics, multiple sequence alignment, orthology, 12 phylogenetics, phylogeography, supermatrix 13 14 1 bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 15 Abstract 16 1. Phylogenies with extensive taxon sampling have become indispensable for many types of 17 ecological and evolutionary studies. Many large-scale trees are based on a “supermatrix” 18 approach, which involves amalgamating thousands of published sequences for a group. 19 Constructing up-to-date supermatrices can be challenging, especially as new sequences may 20 become available within the group almost constantly. However, few tools exist for assembling 21 large-scale, high-quality supermatrices (and other large datasets) for phylogenetic analysis. 22 2. Here we present SuperCRUNCH, a Python toolkit for assembling large phylogenetic datasets 23 from GenBank/NCBI nucleotide data. SuperCRUNCH searches for specified sets of taxa and 24 loci to create species-level or population-level datasets. It offers many transparent options for 25 orthology detection, sequence selection, alignment, and file manipulation for generating large- 26 scale phylogenetic datasets. 27 3. We compared SuperCRUNCH to the most recent alternative approach for generating 28 supermatrices (PyPHLAWD) for two datasets. Given the same set of starting sequences, 29 SuperCRUNCH required more computational time but it retrieved more taxa and total sequences 30 and produced trees having greater congruence with previous studies. SuperCRUNCH can 31 assemble supermatrices for genomic datasets with thousands of loci, and can also generate 32 population-level datasets for phylogeographic analyses. We demonstrate clear advantages for 33 using data downloaded directly from GenBank/NCBI rather than using intermediate databases. 34 Furthermore, we show the effectiveness of initially identifying loci through label searching 35 followed by rigorous orthology detection, rather than relying on automated clustering of all 36 sequences. SuperCRUNCH is open-source, well-documented, and freely available at 2 bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 37 https://github.com/dportik/SuperCRUNCH, with several complete example analyses available at 38 https://osf.io/bpt94/. 39 4. SuperCRUNCH is a flexible method that can be used to assemble high quality phylogenetic 40 datasets for any taxonomic group and for any scale (kingdoms to individuals). It allows rapid 41 construction of supermatrices, greatly simplifying the process of updating large phylogenies with 42 new data. SuperCRUNCH streamlines the major tasks required to process sequence data, 43 including filtering, alignment, trimming, and formatting. 44 3 bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 45 1 | INTRODUCTION 46 Large-scale phylogenies, including hundreds or thousands of species, have become essential for 47 many studies in ecology and evolutionary biology. Many of these large-scale phylogenies are 48 based on the supermatrix approach (e.g., de Queiroz & Gatesy, 2007), which typically involves 49 amalgamating thousands of sequences from public databases (e.g., GenBank). Yet only a handful 50 of tools exist for automatically assembling these large phylogenetic datasets. These include 51 programs like PhyLoTA (Sanderson, Boss, Chen, Cranston, & Wehe 2008), PHLAWD (Smith, 52 Beaulieu, & Donoghue 2009), phyloGenerator (Pearse & Purvis, 2013), SUMAC (Freyman, 53 2015), SUPERSMART (Antonelli et al., 2017), PhylotaR (Bennett et al., 2018) and PyPHLAWD 54 (Smith & Walker, 2018). Each program has its own pros and cons for assembling molecular 55 datasets. However, many of them rely on particular GenBank database releases to retrieve 56 starting sequences (i.e., PhyLoTA, PyPHLAWD, SUMAC, SUPERSMART). GenBank releases 57 occur every two months, and unless continually updated in the workflow, all new sequence data 58 are automatically excluded from searches. Other workflows use the NCBI taxonomic database to 59 locate sequence data, which can conflict with clade-specific taxonomies and inadvertently 60 exclude relevant records. Many programs (e.g., PHLAWD, PyPHLAWD, PhyLoTA, PhylotaR, 61 SUPERSMART) also employ automated (blind) clustering of all sequences, which produces 62 orthologous sequence clusters that represent either individual or aggregate loci (such as longer 63 mtDNA sequences spanning multiple genes). Although blind clustering may be useful under 64 some circumstances, the ability to target specific loci instead may often be desirable. In addition, 65 the criteria for orthology detection, filtration, and sequence selection are not always clear in these 66 programs. Thus, producing high-quality phylogenetic datasets is presently challenging using 4 bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 67 many of the available methods. because of the rapid influx of new sequence data, development 68 of new loci, changing taxonomies, and lack of methodological transparency. 69 To address these challenges, we developed SuperCRUNCH, a semi-automated method 70 for extracting, filtering, and manipulating nucleotide data. SuperCRUNCH uses sequence data 71 downloaded directly from NCBI as a starting point, rather than retrieving sequences through an 72 intermediate step (i.e. via a GenBank database release and/or NCBI Taxonomy). Sequence sets 73 are created based on designated lists of taxa and loci, offering fine-control for targeted searches. 74 SuperCRUNCH also includes refined methods for orthology detection and sequence selection. 75 By offering the option to select representative sequences for taxa or retain all filtered sequences, 76 SuperCRUNCH can be used to generate interspecific supermatrix datasets (one sequence per 77 taxon per locus) or population-level datasets (multiple sequences per taxon per locus). It can also 78 be used to assemble phylogenomic datasets with thousands of loci. SuperCRUNCH is modular in 79 design, is intended to be transparent, objective and repeatable, and offers flexibility across all 80 major steps in constructing phylogenetic datasets. SuperCRUNCH is open-source and freely 81 available at https://github.com/dportik/SuperCRUNCH. 82 83 2 | WORKFLOW 84 SuperCRUNCH consists of a set of PYTHON (v2.7) modules that function as stand-alone 85 command-line scripts. These modules can be downloaded and executed independently without 86 the need to install SuperCRUNCH as a PYTHON package or library, making them easy to use and 87 edit. Nevertheless, there are eight dependencies that should be installed prior to use. These 88 include the BIOPYTHON package for PYTHON, and the following seven external dependencies: 89 NCBI-BLAST+ (for BLASTN and MAKEBLASTDB; Altschul, Gish, Miller, Myers, & Lipman 1990; 5 bioRxiv preprint doi: https://doi.org/10.1101/538728; this version posted February 2, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 90 Camacho et al., 2009), CD-HIT-EST (Li & Godzik, 2006), CLUSTAL-O (Sievers et al., 2011), MAFFT 91 (Katoh, Misawa, Kuma, & Miyata 2002; Katoh & Standley, 2013), MUSCLE (Edgar, 2004), 92 MACSE (Ranwez, Douzery, Cambon, Chantret, & Delsuc 2018), and TRIMAL (Capella-Gutiérrez, 93 Silla-Martínez, & Gabaldón 2009). Installation instructions for these programs, a comprehensive 94 user-guide, and detailed usage instructions for all modules can be found at 95 https://github.com/dportik/SuperCRUNCH. We provide several complete analysis examples on 96 the Open Science Framework SuperCRUNCH project page available at https://osf.io/bpt94. We 97 illustrate the

A Toolkit for Creating and Manipulating Supermatrices and Other Large 1

RESEARCH PUBLICATIONS by ZOO ATLANTA STAFF 1978–Present

Extreme Miniaturization of a New Amniote Vertebrate and Insights Into the Evolution of Genital Size in Chameleons

Herpetological Information Service No

Phylogenetic Relationships and Subgeneric Taxonomy of ToadHeaded Agamas Phrynocephalus (Reptilia, Squamata, Agamidae) As Determined by Mitochondrial DNA Sequencing E

Xenosaurus Tzacualtipantecus. the Zacualtipán Knob-Scaled Lizard Is Endemic to the Sierra Madre Oriental of Eastern Mexico

Gliding Dragons and Flying Squirrels: Diversifying Versus Stabilizing Selection on Morphology Following the Evolution of an Innovation

MADAGASCAR: the Wonders of the “8Th Continent” a Tropical Birding Custom Trip

Fossil Amphibians and Reptiles from the Neogene Locality of Maramena (Greece), the Most Diverse European Herpetofauna at the Miocene/Pliocene Transition Boundary

Mouse Exph5 Conditional Knockout Project (CRISPR/Cas9)

On the Andaman and Nicobar Islands, Bay of Bengal

Primate Specific Retrotransposons, Svas, in the Evolution of Networks That Alter Brain Function

Literature Cited in Lizards Natural History Database