An Integrated Pathway for Building Regional Phylogenies For
Total Page:16
File Type:pdf, Size:1020Kb
‘An integrated pathway for building regional phylogenies for ecological studies’: a user friendly tutorial describing the regPhylo functions and the general work-flow David Eme & Libby Liggins 2 July 2019 Contents Aims 3 Data accessibility to reproduce the example. 3 Preamble: installing the regPhylo R package and external software. 4 1) Prepare the species list 6 2) Taxonomic checks using NCBI taxonomic database 7 3) Extract the DNA sequences and associated metadata from different sources and assemble the data 9 A) Extraction from Genbank, via the NCBI platform . .9 B) Extraction from BOLD . .9 C) Add data from another source . 10 D) Assemble the data into a single table . 10 4) Improve the spatial metadata associated with the DNA sequences in three steps 11 A) Homogenise the geographic coordinates . 11 B) Retrieve geographic coordinates using GeOMe database . 11 C) Infer the geographic coordinates from the name of the sampling location . 11 5) Build the species-by-gene matrix and remove undesirable sequences 13 6) Selection of the gene regions of interest 14 A) Minimum number of gene regions maximizing the species coverage . 14 B) Degree of species overlap between gene regions . 15 C) Amount of missing data in the species-by-gene matrix for the selected gene regions . 16 D) Export all DNA sequences and metadata for the selected gene regions . 16 7) Export the best sequence per species and gene region, based on sequence length and/or geographic criteria 17 A) Selection based on sequence length and geographic criteria . 17 B) Selection based on sequence length only . 17 1 8) Multiple alignments 18 9) Trim poorly aligned positions and/or gappy positions 19 10) Concatenate the trimmed alignments into a single supermatrix 20 11) Select the best partitioning scheme and substitution model using PARTITIONFINDER2 21 12) Define soft topological constraints 22 13) Prepare the baseline .xml file for BEAST2 using BEAUTi 23 14) Include hard topological constraints in the baseline .xml file for BEAST2 23 A) Define hard topological constraints based on bootstrap support of a RAxML tree guided by the constrained tree . 23 B) Edit the baseline .xml file to include hard topological constraints in BEAST2 . 25 15) Date the phylogenetic tree in absolute time using the CLADEAGE approach in BEAST2 25 Advanced topic 1: detecting problematic sequences that should be removed from the pool of sequences 27 A) Build the species-by-gene matrix naively. 27 B) Select DNA sequences and metadata and export the selected gene regions in fasta format . 28 C) Detecting outlier sequences in the pool of selected gene regions . 29 1) First alignment of all sequences and detection of sequences that should be reverse complemented 29 2) Detection of potential outlier sequences in a gene region alignment . 30 Advanced topic 2: including species without DNA and resolving polytomies using BEAST2 32 A) Export a supermatrix including the taxa without DNA to prepare the baseline xml file in BEAUTi . 32 B) Edit the baseline xml file including hard topological constraints plus the new constraints for the taxa without DNA . 34 References 36 2 Aims This tutorial helps the user to build a Bayesian posterior distribution of time-calibrated molecular trees, for an example community of 30 species, using tools in the R environment. Users are guided through the steps of tree building based on a supermatrix approach using functions of the regPhylo R package (Figure 1). The output of this tutorial is an xml file ready to run in the Bayesian tree building software BEAST2 (Bouckaert et al. 2014). For more information about regPhylo and a case study based on a larger dataset, please refer to our original publication (“An integrated pathway for building regional phylogenies for ecological studies” Eme et al. 2019) and related case study tutorial (Appendix 2 in supporting information). Data accessibility to reproduce the example. The species list, all final and intermediate tables, alignments and files used in this tutorial can be downloaded from Dryad from a zip fill called “Tuto_regPhylo.zip”. Warnings: The tutorial assumes that all the files and folders from the “Tuto_regPhylo.zip” file are extracted to one folder called “Tuto_regPhylo” in your working directory. 3 Figure 1: General work-flow when using the regPhylo R package to construct a posterior distribution of time- calibrated multi-gene phylogenies. The names of the regPhylo functions are in red italicised font. Superscript numbers indicate the relevant step in this tutorial (“AdvTopic” refers to the advanced topic section), and the large numbers (bottom right) refer to the paragraph of the “Package description and methods” section of the paper where the method is described (Eme et al. 2019). Preamble: installing the regPhylo R package and external software. regPhylo requires that the following packages must be installed and loaded in the R environment: bold, seqinr, ape, geomedb, RJSONIO, stringr, fields, parallel, caper, phytools install.packages(c("bold", "seqinr", "ape", "RJSONIO", "stringr", "fields", "parallel", "caper", "phytools")) library(bold) library(seqinr) library(ape) library(RJSONIO) library(stringr) library(fields) library(parallel) library(caper) library(phytools) # The "geomedb" requires the latest version available on Github, to download it, # do the following: install.packages("devtools") library(devtools) install_github("biocodellc/fimsR-access") library(geomedb) Note about the accessibility of the regPhylo R package during the review process The regPhylo R package is available on GitHub at https://github.com/dvdeme/regPhylo To install the regPhylo R package from GitHub do the following: install.packages("devtools") library(devtools) # Install the package from GitHub install_github("dvdeme/regPhylo") Load the regPhylo package. library(regPhylo) To see the list and short description of all the functions and data available in Rdata format. help(package=regPhylo) 4 To access the full functionality of the regPhylo functions the following external software must be installed. • BLAST+ (required for Detect.Outlier.Seq): All information required to download and install this software are available at the following link https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE= BlastDocs&DOC_TYPE=Download. Specific installation instructions for the different operating systems (OS) can be find at https://www.ncbi.nlm.nih.gov/books/NBK279671/. • Gblocks (required for Filtering.align.Gblocks): Downloads for the different OS are available at http:// molevol.cmima.csic.es/castresana/Gblocks.html, and online documentation, including instructions for in- stalling the program for the different OS are available at http://molevol.cmima.csic.es/castresana/Gblocks/ Gblocks_documentation.html. • trimAl (required for Filtering.align.Trimal): Online documentation and the download page are available at http://trimal.cgenomics.org/downloads. regPhylo has been trialled with trimAl v1.2 (Official release). • PartitionFinder2 (required for PartiFinder2 ): Online documentation and the download page are avail- able at http://www.robertlanfear.com/partitionfinder/. Installation instructions are provided in the Par- titionFinder2 manual available at http://www.robertlanfear.com/partitionfinder/assets/Manual_v2.1.x.pdf . Download the source code from the Github page available at https://github.com/brettc/partitionfinder/ releases/tag/v2.1.1. Then copy and paste the archive into the desired folder, and extract (decompress) all the files. WARNINGS: PartitionFinder requires Python 2.7.x or higher (but not 3.x!) and dependencies including specific python libraries in order to run. All instructions are provided in the manual, but we provide a simple procedure to avoid many hurdles below: • Download python 2.7.15 from https://www.python.org/downloads/release/python-2715/, select the file ap- propriate for your OS. Install python 2.7.15 by double clicking on the installer and follow the instructions (for Windows users, during the installation “Customize Python 2.7.15” allows the option “Add python.exe to Path”, but be careful with this option if other version(s) of python is(are) already installed and set-up in the path). • Then, open the terminal (in Windows cmd.exe) and use the next set of commands to install numpy, pandas, tables, pyparsing, scipy and sklearn python’s libraries. python -m pip install numpy python -m pip install pandas python -m pip install tables python -m pip install pyparsing python -m pip install scipy python -m pip install sklearn • To check if Partitionfinder2 was successfully installed open the terminal, navigate to the folder storing the file PartitionFinder.py (you may need to change directories), and run the following command in order to see the help file. You should see several “help” options appear in your terminal. PartitionFinder.py --help • Mumsa (required for Mumsa.Comp, not available for Windows OS): Download from http://msa.cgb.ki.se/ cgi-bin/msa.cgi, paste the archive into the desired folder, extract the archive and then compile the program using the following commands: 5 # move into the newly decompressed "mumsa-1.0" folder. cd mumsa-1.0 # compile the program make • PASTA (required for First.Align.All and Multi.Align, not available for Windows OS): Download and in- structions for installation are provided on the Github page available at: https://github.com/smirarab/pasta • Muscle (required for Multi.Align): Download for the different OS are available at https://www.drive5.com/ muscle/downloads.htm, and instructions at https://www.drive5.com/muscle/manual/.