‘An integrated pathway for building regional phylogenies for ecological studies’: a user friendly tutorial describing the regPhylo functions and the general work-flow David Eme & Libby Liggins 2 July 2019

Contents

Aims 3

Data accessibility to reproduce the example. 3

Preamble: installing the regPhylo R package and external software. 4

1) Prepare the list 6

2) Taxonomic checks using NCBI taxonomic database 7

3) Extract the DNA sequences and associated metadata from different sources and assemble the data 9 A) Extraction from Genbank, via the NCBI platform ...... 9 B) Extraction from BOLD ...... 9 C) Add data from another source ...... 10 D) Assemble the data into a single table ...... 10

4) Improve the spatial metadata associated with the DNA sequences in three steps 11 A) Homogenise the geographic coordinates ...... 11 B) Retrieve geographic coordinates using GeOMe database ...... 11 C) Infer the geographic coordinates from the name of the sampling location ...... 11

5) Build the species-by-gene matrix and remove undesirable sequences 13

6) Selection of the gene regions of interest 14 A) Minimum number of gene regions maximizing the species coverage ...... 14 B) Degree of species overlap between gene regions ...... 15 C) Amount of missing data in the species-by-gene matrix for the selected gene regions ...... 16 D) Export all DNA sequences and metadata for the selected gene regions ...... 16

7) Export the best sequence per species and gene region, based on sequence length and/or geographic criteria 17 A) Selection based on sequence length and geographic criteria ...... 17 B) Selection based on sequence length only ...... 17

1 8) Multiple alignments 18

9) Trim poorly aligned positions and/or gappy positions 19

10) Concatenate the trimmed alignments into a single supermatrix 20

11) Select the best partitioning scheme and substitution model using PARTITIONFINDER2 21

12) Define soft topological constraints 22

13) Prepare the baseline .xml file for BEAST2 using BEAUTi 23

14) Include hard topological constraints in the baseline .xml file for BEAST2 23 A) Define hard topological constraints based on bootstrap support of a RAxML tree guided by the constrained tree ...... 23 B) Edit the baseline .xml file to include hard topological constraints in BEAST2 ...... 25

15) Date the phylogenetic tree in absolute time using the CLADEAGE approach in BEAST2 25

Advanced topic 1: detecting problematic sequences that should be removed from the pool of sequences 27 A) Build the species-by-gene matrix naively...... 27 B) Select DNA sequences and metadata and export the selected gene regions in fasta format ...... 28 C) Detecting outlier sequences in the pool of selected gene regions ...... 29 1) First alignment of all sequences and detection of sequences that should be reverse complemented 29 2) Detection of potential outlier sequences in a gene region alignment ...... 30

Advanced topic 2: including species without DNA and resolving polytomies using BEAST2 32 A) Export a supermatrix including the taxa without DNA to prepare the baseline xml file in BEAUTi . 32 B) Edit the baseline xml file including hard topological constraints plus the new constraints for the taxa without DNA ...... 34

References 36

2 Aims

This tutorial helps the user to build a Bayesian posterior distribution of time-calibrated molecular trees, for an example community of 30 species, using tools in the R environment. Users are guided through the steps of tree building based on a supermatrix approach using functions of the regPhylo R package (Figure 1). The output of this tutorial is an xml file ready to run in the Bayesian tree building software BEAST2 (Bouckaert et al. 2014). For more information about regPhylo and a case study based on a larger dataset, please refer to our original publication (“An integrated pathway for building regional phylogenies for ecological studies” Eme et al. 2019) and related case study tutorial (Appendix 2 in supporting information).

Data accessibility to reproduce the example.

The species list, all final and intermediate tables, alignments and files used in this tutorial can be downloaded from Dryad from a zip fill called “Tuto_regPhylo.zip”. Warnings: The tutorial assumes that all the files and folders from the “Tuto_regPhylo.zip” file are extracted to one folder called “Tuto_regPhylo” in your working directory.

3 Figure 1: General work-flow when using the regPhylo R package to construct a posterior distribution of time- calibrated multi-gene phylogenies. The names of the regPhylo functions are in red italicised font. Superscript numbers indicate the relevant step in this tutorial (“AdvTopic” refers to the advanced topic section), and the large numbers (bottom right) refer to the paragraph of the “Package description and methods” section of the paper where the method is described (Eme et al. 2019).

Preamble: installing the regPhylo R package and external software. regPhylo requires that the following packages must be installed and loaded in the R environment: bold, seqinr, ape, geomedb, RJSONIO, stringr, fields, parallel, caper, phytools install.packages(c("bold", "seqinr", "ape", "RJSONIO", "stringr", "fields", "parallel", "caper", "phytools")) library(bold) library(seqinr) library(ape) library(RJSONIO) library(stringr) library(fields) library(parallel) library(caper) library(phytools)

# The "geomedb" requires the latest version available on Github, to download it, # do the following: install.packages("devtools") library(devtools) install_github("biocodellc/fimsR-access") library(geomedb)

Note about the accessibility of the regPhylo R package during the review process The regPhylo R package is available on GitHub at https://github.com/dvdeme/regPhylo To install the regPhylo R package from GitHub do the following: install.packages("devtools") library(devtools)

# Install the package from GitHub install_github("dvdeme/regPhylo")

Load the regPhylo package. library(regPhylo)

To see the list and short description of all the functions and data available in Rdata format. help(package=regPhylo)

4 To access the full functionality of the regPhylo functions the following external software must be installed.

• BLAST+ (required for Detect.Outlier.Seq): All information required to download and install this software are available at the following link https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE= BlastDocs&DOC_TYPE=Download. Specific installation instructions for the different operating systems (OS) can be find at https://www.ncbi.nlm.nih.gov/books/NBK279671/. • Gblocks (required for Filtering.align.Gblocks): Downloads for the different OS are available at http:// molevol.cmima.csic.es/castresana/Gblocks.html, and online documentation, including instructions for in- stalling the program for the different OS are available at http://molevol.cmima.csic.es/castresana/Gblocks/ Gblocks_documentation.html. • trimAl (required for Filtering.align.Trimal): Online documentation and the download page are available at http://trimal.cgenomics.org/downloads. regPhylo has been trialled with trimAl v1.2 (Official release). • PartitionFinder2 (required for PartiFinder2 ): Online documentation and the download page are avail- able at http://www.robertlanfear.com/partitionfinder/. Installation instructions are provided in the Par- titionFinder2 manual available at http://www.robertlanfear.com/partitionfinder/assets/Manual_v2.1.x.pdf . Download the source code from the Github page available at https://github.com/brettc/partitionfinder/ releases/tag/v2.1.1. Then copy and paste the archive into the desired folder, and extract (decompress) all the files.

WARNINGS: PartitionFinder requires Python 2.7.x or higher (but not 3.x!) and dependencies including specific python libraries in order to run. All instructions are provided in the manual, but we provide a simple procedure to avoid many hurdles below:

• Download python 2.7.15 from https://www.python.org/downloads/release/python-2715/, select the file ap- propriate for your OS. Install python 2.7.15 by double clicking on the installer and follow the instructions (for Windows users, during the installation “Customize Python 2.7.15” allows the option “Add python.exe to Path”, but be careful with this option if other version(s) of python is(are) already installed and set-up in the path). • Then, open the terminal (in Windows cmd.exe) and use the next set of commands to install numpy, pandas, tables, pyparsing, scipy and sklearn python’s libraries. python -m pip install numpy python -m pip install pandas python -m pip install tables python -m pip install pyparsing python -m pip install scipy python -m pip install sklearn

• To check if Partitionfinder2 was successfully installed open the terminal, navigate to the folder storing the file PartitionFinder.py (you may need to change directories), and run the following command in order to see the help file. You should see several “help” options appear in your terminal.

PartitionFinder.py --help

• Mumsa (required for Mumsa.Comp, not available for Windows OS): Download from http://msa.cgb.ki.se/ cgi-bin/msa.cgi, paste the archive into the desired folder, extract the archive and then compile the program using the following commands:

5 # move into the newly decompressed "mumsa-1.0" folder. cd mumsa-1.0 # compile the program make

• PASTA (required for First.Align.All and Multi.Align, not available for Windows OS): Download and in- structions for installation are provided on the Github page available at: https://github.com/smirarab/pasta

• Muscle (required for Multi.Align): Download for the different OS are available at https://www.drive5.com/ muscle/downloads.htm, and instructions at https://www.drive5.com/muscle/manual/.

• Mafft (required for First.Align.All and Multi.Align): Download for the different OS and all documentation are available at https://mafft.cbrc.jp/alignment/software/ (for Windows OS we have only tested the “All- in-one package for Windows” version of Maftt).

• Prank (required for First.Align.All and Multi.Align): Download for the different OS and all documentation are available at http://wasabiapp.org/software/prank/, specific instructions for installation are available at http://wasabiapp.org/software/prank/prank_installation/.

• RAxML (not strictly required for any regPhylo functions, but it can be used to build a likelihood tree that helps to define hard topological constraints in Step 14): Installation instructions and a manual are available at https://cme.h-its.org/exelixis/web/software/raxml/index.html. See the GitHub page to download the appropriate OS version and software architecture at: https://github.com/stamatak/standard-RAxML. In this example we use the “raxmlHPC-PTHREADS-AVX” architecture.

On Linux PASTA, Muscle, Mafft, Prank, Gblocks, and trimAl must be in the PATH. Below we provide an example how this is done for BLAST+. Open the .bashrc file in a text editor (e.g. gedit), and add the following line of code at the bottom of the .bashrc file (can also be .bash_profile depending of the configuration) (adapting the path to the bin folder according to your personal configuration): if [ -d"$HOME/Programs/BLAST/ncbi-blast-2.6.0+/bin"]; then PATH="$HOME/Programs/BLAST/ncbi-blast-2.6.0+/bin:$PATH" fi

Alternatively, the progam can be added to the PATH from the terminal in .bashrc (or .bash_profile) using the following command (adapting the path to the bin folder according to your personal configuration): echo "export PATH=$PATH:$HOME/Programs/BLAST/ncbi-blast-2.6.0+/bin" >> ~/.bashrc

1) Prepare the species list

Here we used an hypothetical ray-finned fish community of 30 species to illustrate the utility of the regPhylo functions and work-flow. After extracting the archived file available on Dryad, the species list of interest can be loaded into the R environ- ment. The species list is in the first column of the file “SpeciesList_Classification.csv”. Only the first column is used here, the other columns contain the hierarchical levels of the Linnaean classification, and will be used later.

6 # Load the species list and classification into R. SpList.Classif = read.csv("Tuto_regPhylo/SpeciesList_Classification.csv", h=TRUE)

# Extract the species list only. Sp.List = SpList.Classif$SpeciesName # Replace the "_" with a space between the and species name. Sp.List = gsub("_","", Sp.List) Sp.List[1:10] # Display the first 10 species.

## [1] "Aplodactylus etheridgii" "Aplodactylus arctidens" ## [3] "Asterorhombus filifer" "Arnoglossus scapha" ## [5] "Bassanago bulbiceps" "Conger verreauxi" ## [7] "Epigonus robustus" "Epigonus denticulatus" ## [9] "Aldrovandia affinis" "Halosauropsis macrochir"

Once in the appropriate format we export the species list to be checked for the presence of potential synonymous species names recognised by NCBI using the NCBI taxonomic database. write.table(Sp.List, file="Sp.List_forNCBITaxo.txt", sep="\t", row.names=F, col.names=F, quote=F)

2) Taxonomic checks using NCBI taxonomic database

Load the file “Sp.List_forNCBITaxo.txt” into the NCBI web page (www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi), in order to check the taxonomic status of the species, extract the unique NCBI taxid and to check for species synonyms. Export the results by pressing the button “Save in file”. Move the downloaded “tax_report.txt” file into the working directory. Load the file “tax_report.txt” into the R environment. taxReport=read.delim("Tuto_regPhylo/tax_report.txt", sep="\t", h=T, stringsAsFactors=FALSE) ### accessed on 7th March 2019. # We remove the unnecessary columns 2, 4 and 6 containing "|" as separator. taxReport = taxReport[,-c(2,4,6)] head(taxReport, n=3)

## code name preferred.name taxid ## 1 1 Aplodactylus etheridgii 91854 ## 2 1 Aplodactylus arctidens 82892 ## 3 1 Asterorhombus filifer 1461739

The first column contains the code regarding the NCBI taxonomic status of the species names:

• 1 = the incoming name is the primary name in the NCBI taxon database, • 2 = the incoming name is a secondary name in the NCBI taxon database (it could be listed as a synonym, a misspelling, a common name, or several other nametypes), • 3 = the incoming name is not found in the NCBI taxon database, • “+” = the incoming name is duplicated in our database (used in combination with the other status codes).

7 The second column reports the incoming species names, the third column reports the preferred name in the NCBI taxon database, and the last column reports the unique NCBI species taxonomic identifier (i.e. taxid). We identify which species are not found in the NCBI taxon database (i.e. code =3) and therefore do not have a taxid: taxReport$name[which(taxReport$code==3)]

## [1] "Epigonus robustus"

Except “Epigonus robustus” all other species in the list are recognised by NCBI taxon database. We use the function Taxreport2Sp.List to prepare species lists formatted for searching NCBI (GenBank) and BOLD for sequences and metadata. For NCBI, the input species list is a two column table including all species with a NCBI taxid (the first column reports the taxid and the second reports the binomial species name). For BOLD, the input species list is also a two column table (the first column includes all identified synonyms privileged by NCBI taxonomic database, and the second column reports the original species name).

# Run the function with the path to the file "tax_report.txt" exported by the # NCBI taxonomic web facility as input. SpList.DF = Taxreport2Sp.List(input = "Tuto_regPhylo/tax_report.txt")

# Extraction of the species list for NCBI search (first object of the list) SpList.NCBI = SpList.DF$SpList.NCBI head(SpList.NCBI,n=3)

## taxid Sp.names ## 1 91854 Aplodactylus etheridgii ## 2 82892 Aplodactylus arctidens ## 3 1461739 Asterorhombus filifer dim(SpList.NCBI) # 29 species with a taxid.

## [1] 29 2

# Extraction of the species list for BOLD search (second object of the list) SpList.BOLD = SpList.DF$SpList.BOLD head(SpList.BOLD,n=3)

## SpName.Bold.search Sp.names ## [1,] "Aplodactylus etheridgii" "Aplodactylus etheridgii" ## [2,] "Aplodactylus arctidens" "Aplodactylus arctidens" ## [3,] "Asterorhombus filifer" "Asterorhombus filifer" dim(SpList.BOLD) # 32 species because two species had another preferred name in NCBI.

## [1] 32 2

8 3) Extract the DNA sequences and associated metadata from different sources and assemble the data

All extracted data will be stored in a folder called “Data_Extraction” dir.create("Data_Extraction")

A) Extraction from Genbank, via the NCBI platform

The SpList.NCBI table is the input file for the function GetSeqInfo_NCBI_taxid which extracts DNA sequences, accession numbers and related metadata from Genbank and the associated databases EMBL and DDBJ (via NCBI, https://www.ncbi.nlm.nih.gov/). For more detail about the function and details about the metadata retrieved, please refer to the help of the function (?GetSeqInfo_NCBI_taxid). Here we run the function requesting all the DNA sequences available (gene = “ALL”), and we export a txt table called “Seq.NCBI.txt” storing all the output information (DNA sequences and metadata). Warning: This function is very slow (the seqinr functions extract information through a distant server) and can take multiple hours and even days to run, according to the query (it took us 2h30min with this example). If you do not want to wait, use our provided output to continue exploring the regPhylo functions, see the Note2 below for the code to download the output.

Seq.NCBI.info = GetSeqInfo_NCBI_taxid(splist = SpList.NCBI, gene = "ALL", filename = "Data_Extraction/Seq.NCBI.txt", timeout = 15) # Time difference of 3.42 hours using a timeout = 20, # and Time difference of 2.57 hours for timeout = 15.

The output table must be loaded into the R environment.

Seq.NCBI.all = read.delim("Data_Extraction/Seq.NCBI.txt", sep ="\t",h= TRUE) dim(Seq.NCBI.all) # 542 DNA sequences.

Note1: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available. Note2: Our provided output can be downloaded using the following code:

Seq.NCBI.all = read.delim("Tuto_regPhylo/Data_Extraction/Seq.NCBI.txt", sep ="\t",h= TRUE)

B) Extraction from BOLD

The SpList.BOLD is the input file for the function GetSeq_BOLD which extracts DNA sequences and related metadata from the Barcode Of Life Database (BOLD, http://www.boldsystems.org/) and exports the output as a txt table called “Seq.BOLD.txt”.

Seq.BOLD.info = GetSeq_BOLD(splist = SpList.BOLD, filename = "Data_Extraction/Seq.BOLD.txt")

The output table must be loaded into the R environment.

9 Seq.BOLD=read.delim("Data_Extraction/Seq.BOLD.txt", sep="\t", h=T) dim(Seq.BOLD) # 291 sequences are retrieved.

Note: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available. This function runs quickly (23 seconds on our desktop) but if you do not want to wait, the output table can also be downloaded using the following code.

Seq.BOLD = read.delim("Tuto_regPhylo/Data_Extraction/Seq.BOLD.txt", sep="\t", h=T)

C) Add data from another source

We can also have the option to complement the data extracted from GenBank and BOLD, by DNA sequences and metadata coming from another source such as a personal repository.

# Here we load a table with the same structure as the "Seq.NCBI.txt" # including 11 sequences with the associated metadata coming from a # personal repository. Seq.PerRep=read.delim("Tuto_regPhylo/Data_Extraction/Seq.PerRep.txt", sep="\t", h=T) dim(Seq.PerRep) # 11 sequences.

## [1] 11 24

D) Assemble the data into a single table

The Congr.NCBI.BOLD.perReposit function homogeneises the output from GenBank, BOLD, and the personal repository. This function assembles a common data frame by removing duplicated sequences (based on acces- sion numbers) and selects the most relevant information (i.e. the longest sequence among sources, and comple- mentary metadata information across the different sources for the location, geographic coordinates, and col- lection date). For more information about the different fields of the output table see the help file of the ?Congr.NCBI.BOLD.perReposit function. The output is a single table called “AllSeqDF.txt”, and the option “perReposit” provides a name for the personal repository.

AllSeqDF = Congr.NCBI.BOLD.perReposit(input.NCBI = Seq.NCBI.all, input.BOLD=Seq.BOLD, output = "Data_Extraction/AllSeqDF.txt", input.perReposit = Seq.PerRep, perReposit = "PerRep") dim(AllSeqDF) # 729 sequences in total after removing the duplicates

## [1] 729 25

Number of sequences coming from the difference sources length(which(AllSeqDF$OriginDatabase == "NCBI")) # 427 sequences from NCBI length(which(AllSeqDF$OriginDatabase == "BOLD")) # 176 sequences from BOLD length(which(AllSeqDF$OriginDatabase == "NCBI-BOLD")) # 115 sequences duplicated in NCBI and BOLD length(which(AllSeqDF$OriginDatabase == "PerRep")) # 11 sequences from the personal repository

Note: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available.

10 4) Improve the spatial metadata associated with the DNA sequences in three steps

All the data associated to the improvement of the geolocation of the DNA sequences will be stored in the folder “Geolocation”. dir.create("Geolocation")

A) Homogenise the geographic coordinates

In NCBI and BOLD geographic coordinates may have a different format, the function GeoCoord.WGS84 stan- dardises the coordinates format (WGS84 decimal degree using +- North, and +- East). The function also splits the field “Lat_Lon” into two distinct fields for latitude (“Lat”) and Longitude (“Long”). We run the function using the AllSeqDF table exported by the Congr.NCBI.BOLD.perReposit function as input, and we ouptut a table called “AllSeqDF_Geo1.txt”.

AllSeqDF1 = GeoCoord.WGS84(input = AllSeqDF, output = "Geolocation/AllSeqDF_Geo1.txt") dim(AllSeqDF1)

## [1] 729 26 names(AllSeqDF1) # Confirm that Latitude (i.e. "Lat") and Longitude (i.e. "Long") are distinct.

B) Retrieve geographic coordinates using GeOMe database

We use the function Query.GeOMe.XY.R to retrieve geographic coordinate for NCBI sequences that do not have geographic coordinates in NCBI but may be present in the GeOMe database https://www.geome-db.org/ (Deck et al. 2017). Here, we run the function, considering the phylum “Chordata”, and using the output of the previous function GeoCoord.WGS84 as input (i.e. AllSeqDF1 ). (It can takes a couple of minutes).

AllSeqDF2 = Query.GeOMe.XY.R(input = AllSeqDF1 , Phylum = "Chordata", output ="Geolocation/AllSeqDF_Geo2.txt")

No metadata (geographic coordinates) were retrieved from GeOMe in this example (see our case study in XXX et al. XXX in Appendix 2 in supplementary information for an example where geographic coordinates were retrieved). length(grep("XY-GeOMe", AllSeqDF2$OriginDatabase))

## [1] 0

C) Infer the geographic coordinates from the name of the sampling location

To infer geographic coordinates (in decimal degrees) for DNA sequences that only have the place name of their sampling location (e.g. country, village, city. . . ) associated with them, we use the function GeoCodeName. This function uses the nominatim openstreetmap API (https://nominatim.openstreetmap.org) to retrieve geographic coordinates, and the output tables exported by GeoCoord.WGS84 or by Query.GeOMe.XY.R. The function also accepts, in option, a two column table “CorrTab” providing corrected location names (for more detail see ?GeoCo- deName). This function can take from few minutes to several hours depending the request (here, it took us less than 2 minutes).

11 # Run the function, without any correction for the place name. AllSeqDF3a = GeoCodeName(input = AllSeqDF2, output = "Geolocation/AllSeqDF_Geo3a.txt")

# To detect the place name of the location that couldn't be found because # the location name requires some corrections (e.g "New Zealand; 100m # off-shore Muriwai beach" might be problematic and can easily be corrected # to "New Zealand; Muriwai beach"), see the code below, and the first three examples. as.character(unique(AllSeqDF3a$Location[AllSeqDF3a$Geo_accuracy == "NoLocationFound"]))[1:3]

## [1] NA ## [2] "Australia:Tasmania,Maria Island CDS" ## [3] "Atlantic Ocean: near Spain, Galicia Bank"

# A two column table with the corrected place name can be loaded in R. LocNameCorrected = read.delim("Tuto_regPhylo/Geolocation/LocNameCorrected.csv", sep ="\t",h=T) head(LocNameCorrected,n=3)

## Location ## 1 Australia:Tasmania,Maria Island CDS ## 2 Atlantic Ocean: near Spain, Galicia Bank ## 3 USA: Gulf of Mexico ## Location_used ## 1 Australia: Tasmania, Maria Island ## 2 Zepa banco de Galicia ## 3 Gulf of Mexico

# Run the function with the place correction. AllSeqDF3 = GeoCodeName(input = AllSeqDF2, output = "Geolocation/AllSeqDF_Geo3.txt", CorrTab = LocNameCorrected)

Evaluate the percentage of DNA sequences with geographic coordinates.

(table(AllSeqDF3$Geo_accuracy)/length(AllSeqDF3$Geo_accuracy))*100

## ## From_DB Inferred ## 30.04115 28.53224

30% of the DNA sequences have precise geographic coordinates from the source databases, and a further 28.5% had with geographic coordinates inferred from associated place names. In total, 58.5% of the DNA sequences were able to be geoeferenced. Note: The number that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available. Now, it is time to understand a bit more about the presence or absence of the different gene regions for the different species, in order to select a set of gene regions maximizing the gene coverage and the species overlap.

12 5) Build the species-by-gene matrix and remove undesirable sequences

We export all the data from steps 5 and 6 into the folder called “CleanSeqPool”. dir.create("CleanSeqPool")

First, we use the function SpeciesGeneMat.Bl to build a species-by-gene matrix presenting the number of sequences available for each gene region and each species. This function cleans the gene names with a special focus on 36 gene regions recognized as phylogenetically informative including for fishes and removes microsatellites and unassigned DNA regions. The function then exports four tables including the species-by-gene matrix (see ?SpeciesGeneMat.Bl for additional details). We run the function including two lists of “blacklisted” DNA sequences (for BOLD and for GenBank) that we have considered to be problematic and that need to be removed from the pool of sequences.

# Prepare two lists with the accession number of the sequences that we would like # to remove from the pool of sequences. # For BOLD BOLD.SeqTrash = c("3191061", "3214910")

# For GenBank NCBI.SeqTrash = c("AB018233", "AF202547", "AF133061", "KP194660", "FJ896410", "EU366662")

See the “Advanced topic 1: detecting problematic sequences that should be removed from the pool of sequences” for an example for how functions in regPhylo can help to detect and remove sequences that can cause alignment problem (e.g. badly annotated sequences, poor quality sequences. . . ).

# Run the SpeciesGeneMat.Bl function to get a clean pool of sequences and # an appropriate species-by-gene matrix. Sp.DNAMat_cl=SpeciesGeneMat.Bl(input = AllSeqDF3, output = "CleanSeqPool/SpAll.DNA.Mat_cl_", NCBI.Trash = NCBI.SeqTrash, BOLD.Trash = BOLD.SeqTrash) names(Sp.DNAMat_cl) # Names of the different elements of the list.

## [1] "Species.Gene_matrix" "Summary_DNA" ## [3] "Summary_Species" "MissingSpecies_WithoutSequences" dim(Sp.DNAMat_cl$Species.Gene_matrix) # Dimension of the species-by-gene matrix.

## [1] 30 103

# Display the first 5 columns (including the species name and the gene regions with the # best species coverage). Sp.DNAMat_cl$Species.Gene_matrix[1:5,1 :5]

## Species_Name co1 16srrna rag1 12srrna ## Arnoglossus scapha Arnoglossus scapha 4 0 0 0 ## Bassanago bulbiceps Bassanago bulbiceps 20 0 0 0 ## Conger verreauxi Conger verreauxi 5 0 0 0 ## Epigonus robustus Epigonus robustus 2 0 0 0 ## Scorpaena cardinalis Scorpaena cardinalis 9 0 0 0

13 # Summary of the number of different gene regions ("NB_TypeDNA") and number of # sequences available for each species. head(Sp.DNAMat_cl$Summary_Species,n=3)

## Name_Species NB_TypeDNA NB_Seq ## 1 Ablennes hians 54 103 ## 2 Aldrovandia affinis 48 93 ## 21 Halosauropsis macrochir 48 99

# Summary of the number of different species for each gene region, and number of sequences. head(Sp.DNAMat_cl$Summary_DNA,n=5)

## Name_DNA NB_Species NB_Seq ## 13 co1 29 355 ## 2 16srrna 22 49 ## 68 rag1 18 25 ## 1 12srrna 16 41 ## 19 cytb 13 38

We can see that the co1, 16srrna (mitochondrial DNA) and rag1 (nuclear DNA) are the three regions with the highest species coverage. Note1: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available.

6) Selection of the gene regions of interest

A) Minimum number of gene regions maximizing the species coverage

We use the function SelGene.MaxSpCov to identify the set of gene regions that provide 100% species coverage. (See ?SelGene.MaxSpCov for additional details and possibilities)

SelGene.MaxSpCov(input = Sp.DNAMat_cl$Species.Gene_matrix)

14 Species coverage accumulation curve from a gene perspective 30.0 29.8 29.6 29.4

Number of species Species accumulation curve

29.2 Total number of species Minimum number of genes for 100% species coverage=2 29.0 0 20 40 60 80 100

Gene index

## $Minimum_Number_of_Gene_With_Full_Species_Coverage ## [1] 2 ## ## $List_Gene_Name_Full_Species_Coverage ## [1] "co1" "16srrna"

Only the co1, and the 16srrna are necessary to have all the species included in the tree.

B) Degree of species overlap between gene regions

Now, we use the function Matrix.Overlap to estimate the species overlap between each pair of selected gene regions.

# Based on the species-by-gene matrix, we select 2 mitochondrial # gene regions ("co1", "16srrna") and 3 nuclear gene regions # ("rag1", "myh6", "plagl2"). GeneSelection = c("co1", "16srrna", "rag1", "myh6", "plagl2")

Mat.overlap.GeneSelection = Matrix.Overlap(input = Sp.DNAMat_cl$Species.Gene_matrix, gene.Sel = GeneSelection) # The species overlap for each pairwise comparison of the selected gene regions is presented. Mat.overlap.GeneSelection$NumberOfSpecies

## co1 16srrna rag1 myh6 plagl2 ## co1 29 21 17 13 9 ## 16srrna 21 22 17 12 8 ## rag1 17 17 18 10 8 ## myh6 13 12 10 13 6 ## plagl2 9 8 8 6 9

15 # The average number of species that have a sequence and overlap among the selected gene regions. diag(Mat.overlap.GeneSelection$NumberOfSpecies) =NA # Remove the diagonal. mean(Mat.overlap.GeneSelection$NumberOfSpecies, na.rm = TRUE)

## [1] 12.1

Note1: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available.

C) Amount of missing data in the species-by-gene matrix for the selected gene regions

The function AmMissData extracts the percentage of missing data in the species-by-gene matrix according to a list of selected gene regions.

AmMissData(input = Sp.DNAMat_cl$Species.Gene_matrix, gene.list = GeneSelection)

## [1] 39.33333

39.33% of the species-by-gene matrix is missing data, in other words the matrix is 100-39.33 = 60.67% complete. Note: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available.

D) Export all DNA sequences and metadata for the selected gene regions

From the cleaned pool of sequences exported in step 5 by the SpeciesGeneMat.Bl function (with the extension "_CleanDataset.txt"), we select and export the sequences and associated information for the selected gene regions (i.e. co1, 16srrna, rag1, myh6, plagl2). This function extracts the gene region of interest from long annotated DNA sequences (i.e. > 5000 bp) such as full mitochondrial gene regions, and exports all associated metadata into a single output table.

# Load the pool of sequences exported by the SpeciesGeneMat.Bl function in the r # environment. PoolSeq = read.delim("CleanSeqPool/SpAll.DNA.Mat_cl__CleanDataset.txt", sep="\t", h=T)

# Extract all the sequences (including from long annotated DNA sequences) and # associated information. PoolAllSeq.5Genes = Select.DNA(input = PoolSeq, gene.list = GeneSelection, output = "CleanSeqPool/PoolAllSeq.5Genes", timeout = 15)

This function can take second to hours to run if there are many sequences to be extracted from the long annotated DNA sequences in GenBank (the seqinr functions extract information through a distant server). For our example, it should take less than two minutes, but for those who do not want to wait, the output of this function can be loaded in R using the following code:

PoolAllSeq.5Genes=read.delim("Tuto_regPhylo/CleanSeqPool/PoolAllSeq.5Genes.Select.DNA.txt", sep ="\t", h=T)

462 sequences in the pool of sequences for the 5 gene regions

16 dim(PoolAllSeq.5Genes)

## [1] 462 29

Note: The number of DNA sequences that you retrieve may differ from our example. This reflects that more sequences for the species of interest are now available.

7) Export the best sequence per species and gene region, based on sequence length and/or geographic criteria

All the data exported in step 7 will be stored in the folder “Alignments” dir.create("Alignments")

A) Selection based on sequence length and geographic criteria

We can select the best representative sequences for each species and gene region using the SelBestSeq function. In our example , we selected the sequence most geographically proximate to the centroid of New Zealand (i.e. Ref- Point=cbind(174.7976, -41.3355)), and used the median sequence length option. Only one sequence per species and gene region is selected (i.e. ‘MaxSeq = 1’).

# Create the directory to store the alignments using geographic criteria. dir.create("Alignments/OneBest.Geo")

# Run the function using sequence length and geographic proximity as the # criterion to select the sequence. BestSeq.Geo.export = SelBestSeq(input = PoolAllSeq.5Genes, output = "Alignments/OneBest.Geo/Best.Geo", RefPoint = cbind(174.7976, -41.3355), perReposit = "PerRep", Alignment =T, MaxSeq =1, gene.list = GeneSelection, SeqChoice = "Median") dim(BestSeq.Geo.export) # 91 sequences have been exported.

## [1] 91 33

Note: a warning message “closing unused connection . . . .” may appear after running this function, but this is normal and the user should not worry about it.

B) Selection based on sequence length only

The sequences are selected based on their median sequence length.

# Create the directory to store the alignments, when the geographic criteria is disabled. dir.create("Alignments/OneBest") # Run the function disabling the geographic criteria to select the sequence. BestSeq.export = SelBestSeq(input = PoolAllSeq.5Genes, output = "Alignments/OneBest/Best", perReposit = "PerRep", Alignment =T, MaxSeq =1, gene.list = GeneSelection, SeqChoice = "Median")

17 8) Multiple alignments

All the data created in steps 8, 9, 10 and 11 are stored in the folder called “Alignments” The function Multi.Align calls different programs (Muscle, Mafft, Prank and PASTA) to align the sequence file. This function also detects sequences that need to be reverse complemented. (For additional details see the help page "?Multi.Align). The ‘output’, ‘input’, ‘nthread’ and ‘methods’ objects need to be present in the R environment before running the function because the function runs in parallel using the parallel R package. Reminder: for Windows OS, PASTA is not available and the path to the software must be included in Mafft.path, Muscle.path, Prank.path.

# First we need to load the objects into the R environment. output="Alignments/OneBest.Geo/MultiAlign" input="Alignments/OneBest.Geo" nthread=5 methods = c("mafftfftns2", "mafftfftnsi", "muscle", "prank")

# Run the function, Multi.Align(input="Alignments/OneBest.Geo", output="Alignments/OneBest.Geo/MultiAlign", nthread=5, methods = c("mafftfftns2", "mafftfftnsi", "muscle", "prank"))

See below an example for Windows users to run the function.

# First we need to load the objects into the R environment, # including the path of the programs but remember to change # backslashes "\" in slashes "/" in the path name. output="Alignments/OneBest.Geo/MultiAlign" input="Alignments/OneBest.Geo" nthread=5 methods = c("mafftfftns2", "mafftfftnsi", "muscle", "prank") Mafft.path = "C:/Users/david/Program/mafft/mafft-7.409-win64-signed/mafft-win/mafft" Muscle.path = "C:/Users/david/Program/Muscle/muscle3.8.31_i86win32.exe" Prank.path = "C:/Users/david/Program/Prank/prank.windows.140603/prank/bin/prank.exe"

# Run the function, Multi.Align(input="Alignments/OneBest.Geo", output="Alignments/OneBest.Geo/MultiAlign", nthread=5, methods = c("mafftfftns2", "mafftfftnsi", "muscle", "prank"), Mafft.path = "C:/Users/david/Program/mafft/mafft-7.409-win64-signed/mafft-win/mafft", Muscle.path = "C:/Users/david/Program/Muscle/muscle3.8.31_i86win32.exe", Prank.path = "C:/Users/david/Program/Prank/prank.windows.140603/prank/bin/prank.exe")

An additional option, the function Mumsa.Comp compares multiple alignments using MUMSA software (Lass- mann & Sonnhammer, 2005) and reports the MOS (Multiple Overlap Score) and the AOS (Average Overlap Score) to help the user to assess dissimilarities among alignment and select which alignment to use. Remark: This func- tion is not available for Windows users in R but an online web server, http://msa.cgb.ki.se/cgi-bin/msa.cgi can be used. The MOS is provided for each alignment allowing us to check the support (more exactly the agreement or congruence) of those alignments. The alignment with the highest MOS is supposed to be the most consensual alignment (Lassmann & Sonnhammer, 2005). AOS also informs whether the sequences are too divergent to be aligned (an AOS score less than 0.5 indicates that the sequences are too divergent, potentially caused by saturation effects).

18 MumsaRes = Mumsa.Comp(input = "Alignments/OneBest.Geo/MultiAlign", output = "Alignments/OneBest.Geo/Mumsa_Res_OneBest.Geo", remove.empty.align = TRUE) MunsaRes # to display the table, with AOS and MOS score.

Here for simplicity we used all alignments exported by “mafftfftnsi” and we copy these files into a folder called “Alignments/OneBest.Geo/ToTrim” dir.create("Alignments/OneBest.Geo/ToTrim") list.ali = list.files("Alignments/OneBest.Geo/MultiAlign") list.mafft.ali = list.ali[grep("Mafftfftnsi", list.ali)] file.copy(paste("Alignments/OneBest.Geo/MultiAlign/", list.mafft.ali, sep =""), "Alignments/OneBest.Geo/ToTrim")

At this point, the alignments should be additionally inspected by eyes. We inspected the alignment using Seaview (Gouy et al. 2010) and adjusted the rag1, and plagl2 alignments manually.

9) Trim poorly aligned positions and/or gappy positions

Two of the most popular programs to remove poorly aligned positions can be used in regPhylo, trimAl (Capella- Gutierrez et al. 2009) and the Gblocks (Castresana 2000). The function Filtering.align.Trimal uses the software trimAl and runs the options, -gappyout and -automated1, simultaneously (see Capella-Gutierrez et al. 2009). (See ?Filtering.align.Trimal for additional details). Reminder: for Windows users the path of trimAl must be included in TrimAl.path. outtrimAl = Filtering.align.Trimal(input = "Alignments/OneBest.Geo/ToTrim", output = "Alignments/OneBest.Geo/Trimmed_trimAl") outtrimAl # to see the sequence length for the different alignments.

# Example for windows users. outtrimAl = Filtering.align.Trimal(input = "Alignments/OneBest.Geo/ToTrim", output = "Alignments/OneBest.Geo/Trimmed_trimAl", TrimAl.path = "C:/Users/david/Program/trimAl/trimal.v1.2rev59/trimAl/bin/trimal.exe")

The function Filtering.align.Gblocks uses the software GBLOCKS V 0.91b (Castresana, 2000). The function allows the user to specify (with “Type”) whether the sequences are proteins (“p”), DNA-non coding (“d”), or DNA coding (“c”) regions. The function offers the possibility to use the default parameter (more stringent selection) or the less stringent selection approach. We used the latter because the former tends to be too conservative.

# Four of the alignment are coding DNA and one is a non coding DNA (16srrna). list.ali = list.files("Alignments/OneBest.Geo/ToTrim") list.ali

## [1] "Mafftfftnsi_Best.Geo_16srrna.fas" "Mafftfftnsi_Best.Geo_co1.fas" ## [3] "Mafftfftnsi_Best.Geo_myh6.fas" "Mafftfftnsi_Best.Geo_plagl2.fas" ## [5] "Mafftfftnsi_Best.Geo_rag1.fas"

19 # We prepare a vector with the Type of DNA for each alignment. # The first alignment is the 16srrna, so it must # be coded "d", all the others must be coded "c" (all coding DNA). Type.ali = c("d", "c", "c", "c", "c")

# We loop the function over the different alignments. i=1 for(i in 1:length(list.ali)){ outGblocks = Filtering.align.Gblocks(input="Alignments/OneBest.Geo/ToTrim", target.file = list.ali[i], LessStringent="TRUE", Type=Type.ali[i], output ="Alignments/OneBest.Geo/Trimmed_Gblocks", remove.empty.align = TRUE) } outGblocks

## Align.Name Trim.Method Gene.Name ## 1 Gblocksls_Mafftfftnsi_Best.Geo_16srrna.fas Gblocksls 16srrna.fas ## 2 Gblocksls_Mafftfftnsi_Best.Geo_co1.fas Gblocksls co1.fas ## 3 Gblocksls_Mafftfftnsi_Best.Geo_myh6.fas Gblocksls myh6.fas ## 4 Gblocksls_Mafftfftnsi_Best.Geo_plagl2.fas Gblocksls plagl2.fas ## 5 Gblocksls_Mafftfftnsi_Best.Geo_rag1.fas Gblocksls rag1.fas ## Program SeqLength ## 1 Mafftfftnsi 534 ## 2 Mafftfftnsi 648 ## 3 Mafftfftnsi 708 ## 4 Mafftfftnsi 741 ## 5 Mafftfftnsi 1275

# Again for Windows users the Gblocks.path option has to be provided, see example below. i=1 for(i in 1:length(list.ali)){ outGblocks = Filtering.align.Gblocks(input="Alignments/OneBest.Geo/ToTrim", target.file = list.ali[i], LessStringent="TRUE", Type=Type.ali[i], output ="Alignments/OneBest.Geo/Trimmed_Gblocks", remove.empty.align = TRUE, Gblocks.path = "C:/Users/david/Program/Gblocks/Gblocks_Windows_0.91b/Gblocks_0.91b") } outGblocks

After trimming, we inspected the alignments again using Seaview. We decided to retain all the alignments exported by Gblocks because the alignment kept more sites for most of the coding DNA regions (co1, myh6 and plagl2), and successfully trimmed poorly aligned positions for the non coding 16srrna alignment.

10) Concatenate the trimmed alignments into a single supermatrix

We use the function Align.Concat to concatenate the trimmed alignment for the five gene regions into a single supermatrix in nexus and fasta formats, at the species level. This function also allows the inclusion of species with- out DNA sequences, if necessary (for instance, to then use BEAST2 to resolve polytomies, Kuhn et al. 2011).(For

20 additional details see ?Align.Concat, and Advanced topic 2: including species without DNA and resolving polytomies using BEAST2). We export the Gblocks alignments into a folder called “ForConcat”.

# Create the "ForConcat" folder. dir.create("Alignments/OneBest.Geo/ForConcat") list.ali = list.files("Alignments/OneBest.Geo/Trimmed_Gblocks") # Copy the files into the new folder. file.copy(paste("Alignments/OneBest.Geo/Trimmed_Gblocks/", list.ali, sep =""), "Alignments/OneBest.Geo/ForConcat")

## [1] TRUE TRUE TRUE TRUE TRUE

Run the function, the ouptut files will be stored in the “Alignments/OneBest15Ge/ForConcat” folder.

Align.Concat(input="Alignments/OneBest.Geo/ForConcat", Sp.List=NULL, outputConcat = "Alignments/OneBest.Geo/ForConcat/Concat")

## Name.PartitionFinder2 Common.Gene.Name ## [1,] "gene1" "16srrna" ## [2,] "gene2" "co1" ## [3,] "gene3" "myh6" ## [4,] "gene4" "plagl2" ## [5,] "gene5" "rag1"

The supermatrix is called “Concat.fas” (in fasta format) and “Concat.nex” (in nexus format) and the file “Parti- tions_Concat.txt” reports the different partitions (i.e. gene regions here) following the RAxML format. A table "convtab.txt’ providing the entry order of the different gene regions is also exported.

11) Select the best partitioning scheme and substitution model using PAR- TITIONFINDER2

The function PartiFinder2 is a wrapper running PARTITIONFINDER2 (Lanfear et al. 2017) for a supermatrix (or concatenated alignment) to estimate the best number of partitions and substitution models. This function uses the alignment “Concat.fas” as “input”, “Partitions_Concat.txt” in “Partition” (i.e. partition file), “codon” specifies which gene regions are coding. In our example, the only non-coding region, the 16srrna, appears in first position, so “codon” must include all other gene regions (i.e. codon = c(2:5)). (See also “convtab.txt” exported by Align.Concat). For additional information see ?PartiFinder2. Warning: the path to partitionfinder2 needs to be provided in the “Path.PartiF2”.

PartiFinder2(input = "Alignments/OneBest.Geo/ForConcat/Concat.fas", Partition = "Alignments/OneBest.Geo/ForConcat/Partitions_Concat.txt", codon = c(2:5), nexus.file = "Alignments/OneBest.Geo/ForConcat/Concat.nex", Path.PartiF2 = "/home/davidpc/Programs/PartitionFinder2/partitionfinder-2.1.1/PartitionFinder.py", branchlengths = "linked", models = "all", model_selection = "BIC", search = "greedy", Raxml = "TRUE", nthread =5)

This function can take several minutes to run (48 sec. using 5 threads, here). The file containing the best partition scheme is called “Partitions_Concat.txt_PF2_all.txt” and is located in the same folder as the “Concat.fas” alignment. All details and files exported by PartitionFinder2 can be found in a folder called “analysis” in the working directory, including the summary of the results (best number of partitions and best substitution models can be found in the “analysis/best_scheme.txt” file).

21 bestpartition = readLines("Alignments/OneBest.Geo/ForConcat/Partitions_Concat.txt_PF2_all.txt") bestpartition # To display the 8 partitions that have been delineated.

Warning: - 1) In some cases users might need to install the “png” and “reticulate” R packages, in order for python to run through R. This is likely the case, if the files “partition_finder.cfg” and “Concat.phy” have been successfully exported in your working directory and PartitionFinder2 is working on your machine, but the following error message is printed “Error in file(con,”r“) : cannot open the connection In addition: Warning message: In file(con,”r“) : cannot open file ‘analysis/best_scheme.txt’: No such file or directory”. - 2) Sometimes PartitionFinder2 crashes prior to finishing the run, usually due to threading problems. In order to re-run the function, remove the folder and the files created by the unsuccessful execution of the PartiFinder2 function (i.e. “partition_finder.cfg”, “Concat.phy”, and the folder called “analysis”), and we advise to decrease the number of threads.

12) Define soft topological constraints

All data created in step 12 will be exported into a folder called “Constraints_Calibs” dir.create("Constraints_Calibs")

The goal of this step is to designate soft topological constraints on the tree following established phylogenetic rela- tionships recovered from published phylogenies. The constraints are considered to be soft at this stage so some taxa remain unconstrained, then they are free to enter into any clade according to their molecular affinities (See RAxML manual p24 option -g; Manual available at: https://sco.h-its.org/exelixis/resource/download/NewManual.pdf). To implement this, we build a multifurcating tree in newick format that will be run in RAxML using:

• 1) the function ConstraintTaxo2newick, • 2) a two column table listing the name of the hierarchical level of the constraint (e.g. ‘Family’, ‘Order’) and the name of the clade that must be constrained (e.g. ‘Aplodactylidae’, ‘Aulopiformes’), For example, we load, a file listing 17 topological constraints based on previous phylogenetic work.

Constraint=read.delim("Tuto_regPhylo/Constraints_Calibs/TopoConstraints_Example.txt", sep="\t", h=T) head(Constraint,n=5)

## Level_Hierarchy Constraint_Name ## 1 Cohort Elopomorpha ## 2 Division Acanthopterygii ## 3 Family Serranidae ## 4 Order Anguilliformes ## 5 Order Aulopiformes

• 3) a taxonomic classification table for all the taxa present in the phylogenetic tree (here we use the SpList.Classif which contained both the species list and the taxonomic hierarchy) The SpList.Classif can be loaded using the following command:

SpList.Classif = read.csv("Tuto_regPhylo/SpeciesList_Classification.csv", h = TRUE)

22 We run the function ConstraintTaxo2newick to build a multifurcating phylogenetic tree ready to be used by RAxML as the constraint tree (option ‘-g’ in RAxML) to guide the reconstruction of the molecular phylogenetic tree.

BackBoneTree = ConstraintTaxo2newick(inputTaxo = SpList.Classif, inputConst = Constraint, outputNewick = "Constraints_Calibs/Backbone_17TopoConst") # Plot the backbone tree using ape R package plot(BackBoneTree$NewickConstraintTree, cex = 0.6)

Halosauropsis macrochir Aldrovandia affinis Conger verreauxi Bassanago bulbiceps Vinciguerria nimbaria Vinciguerria poweriae Omosudis lowii Stemonosudis macrura Scopelarchus analis Coelorinchus bollonsi Coelorinchus fasciatus Coryphaenoides subserrulatus Coryphaenoides murrayi Coryphaenoides striaturus Coryphaenoides armatus Ablennes hians Euleptorhamphus viridis Platybelone argalus Arnoglossus scapha Asterorhombus filifer Scorpaena onaria Pterois antennata Scorpaena cardinalis Epinephelus rivulatus Epinephelus octofasciatus Epinephelus daemelii Epigonus denticulatus Epigonus robustus Aplodactylus arctidens Aplodactylus etheridgii

Warnings: branch lengths of the multifurcating tree are misleading, only the topology matters.

13) Prepare the baseline .xml file for BEAST2 using BEAUTi

Before including further topological constraints and any calibration constraints, we prepare the baseline .xml file used by BEAST2 to perform the Bayesian tree construction. We use BEAUTi to build this file. First, we load the appropriate BEAST2 packages (CLADEAGE in our case) and the supermatrix with the desired number of partitions estimated by PartitionFinder2 and implemented in the nexus file exported by Partifinder2. The input alignment is called “Concat_PF2.nex” and can be found in the following directory “Alignments/OneBest.Geo/ForConcat”. Then, the parameters and their priors, related to the substitution models, clock models, tree model, and MCMC are implemented. Finally the file is saved in xml format. We do not use BEAUTi to implement the topological constraints and the CLADEAGE analysis to date the tree, those steps are automated using other regPhylo functions, see steps 14 and 15. A detailed description on how to use BEAUTi to prepare .xml files for BEAST2 is beyond the scope of this document, but tutorials and detailed descriptions can be found at: https://www.beast2.org/. A baseline .xml file for this example can be found here: “Tuto_regPhylo/Constraints_Calibs/BaseLine_8GTR.xml”

14) Include hard topological constraints in the baseline .xml file for BEAST2

A) Define hard topological constraints based on bootstrap support of a RAxML tree guided by the constrained tree

Here, RAxML (Stamatakis 2014) is called directly from R (see instruction in the preamble for the installation). All the input files for RAxML must be present in the R working directory.

23 We assumed a GTRCAT substitution model for each partition (8 GTRCAT models in total), and we used the rapid bootstrap search using the autoMRE option as bootstrap convergence criterion to automatically stop the bootstrap. The ‘-g’ option allows us to include a multifurcating tree as a constraint tree. See RAxML manual for further details about all the options available (Manual available at: https://sco.h-its.org/exelixis/resource/ download/NewManual.pdf). Here, exceptionally, we change our working directory while running RAxML to a subfolder “Trees/RAxML”", nested within the folder used previously as the working directory.

# Create a directory to store the RAxML tree. dir.create("Trees") | dir.create("Trees/RAxML") # Copy all the input files into the same directory # The supermatrix in fasta format file.copy("Alignments/OneBest.Geo/ForConcat/Concat.fas", "Trees/RAxML") # The best partition scheme defined by partitionFinder2. file.copy("Alignments/OneBest.Geo/ForConcat/Partitions_Concat.txt_PF2_all.txt", "Trees/RAxML") # The multifurcating guiding tree. file.copy("Constraints_Calibs/Backbone_17TopoConst.txt", "Trees/RAxML") # we change our working directory setwd("Trees/RAxML")

# Set-up the option for RAxML. nthread=5 input="Concat.fas" output="Tree_RAxML_autoMRE" ConstraintTree="Backbone_17TopoConst.txt" PartitionFile="Partitions_Concat.txt_PF2_all.txt"

# Prepare the command line that is going to be passed to the console. a=paste("raxmlHPC-PTHREADS-AVX -g ", ConstraintTree, " -f a -x 22345 -p 12345 -# autoMRE -q ", PartitionFile, " -m GTRCAT -T ", nthread, " -s ", input, " -n ", output, sep="")

# Pass the command line to the console to run RAxML. system(a) setwd("../..")

For Windows users if RAxML is not in the PATH, then the PATH needs to be specified, see an example of the code below.

# RAxML path raxml = "C:/Users/david/Program/RAxML/standard-RAxML-master/ standard-RAxML-master/WindowsExecutables_v8.2.10/raxmlHPC-PTHREADS-AVX.exe" a=paste(raxml, " -g ", ConstraintTree, " -f a -x 22345 -p 12345 -# autoMRE -q ", PartitionFile, " -m GTRCAT -T ", nthread, " -s ", input, " -n ", output, sep="") # Pass the command line to the console to run RAxML. system(a)

24 # We come back to the old working directory to continue the tutorial. setwd("../..")

The overall run time using 5 threads took 2 min 09 sec on our desktop.

B) Edit the baseline .xml file to include hard topological constraints in BEAST2

To edit the .xml file and include hard topological constraints, we also need a rooted RAxML tree with bootstrap support. The latter defines the clades (with 100% bootstrap support) that are going to be used to set-up hard monophyletic constraints in the .xml file for further analysis in BEAST2. Those clades include all previously constrained clades plus some clades strongly supported by molecular data in the maximum likelihood analytical framework. The constraints are considered hard because an unconstrained species is not free to enter into them. Great care should be taken when rooting the RAxML maximum likelihood tree. To ensure that all bootstrap values were still associated with the appropriate nodes after re-rooting, we opened the RAxML tree (including the bootstrap supports ‘RAxML_bipartitions.Tree_RAxML_autoMRE’) in Dendroscope 3.5.9 (available at: http: //dendroscope.org/), specifying that internal nodes should be interpreted as edge labels. The tree is re-rooted using the MRCA of the out-group (here we use the MRCA of ‘Halosauropsis_macrochir, ’Aldrovandia_affinis’, ‘Conger_verreauxi’ and ‘Bassanago_bulbiceps’). To do that, go to “Select”, “Select Root”, select the branch leading to the MRCA of the outgroup, and export (in “File”, “Export”) the re-rooted tree in nexus format (e.g. ‘RAxML_bipartitions.Tree_RAxML_autoMRE_ReRooted’). Then the rooted tree is loaded in the R environment. require(ape) TreeRooted=read.nexus("Trees/RAxML/RAxML_bipartitions.Tree_RAxML_autoMRE_ReRooted") # Check if the tree is rooted is.rooted(TreeRooted)

Specify the hard constraints according to the bootstrap supports of the RAxML tree, using the Multi- TopoConst.EditXML4BEAST2 function. This function allows the inclusion of taxa without DNA in the Bayesian tree construction. For additional details see Advanced topic 2: including species without DNA and resolving polytomies using BEAST2 and ?MultiTopoConst.EditXML4BEAST2.

MultiTopoConst.EditXML4BEAST2(inputtree=TreeRooted, input.xml="Tuto_regPhylo/Constraints_Calibs/BaseLine_8GTR.xml", output="Constraints_Calibs/BaseLine_8GTR_w.xml", bootstrapTH=100, xmltreename="ConcatF", Partitions="TRUE")

## [1] TRUE TRUE

The output of the MultiTopoConst.EditXML4BEAST2 function, called “Constraints_Calibs/BaseLine_8GTR_w.xml”, can be opened in BEAUTi. From here, users can decide to perform conventional tree dating with BEAST2. However, regPhylo also includes a function that helps to set-up a more advanced calibration approach in BEAST2 using the CLADEAGE approach (Matschiner et al. 2017) (see next section).

15) Date the phylogenetic tree in absolute time using the CLADEAGE ap- proach in BEAST2

The regPhylo function, CladeAgeCalib.xml, edits the BEAST2 xml file to set-up a CLADEAGE analysis which allows us to objectively determine the prior distribution of the calibration points used to calibrate the tree in an

25 absolute time frame (Matschiner et al. 2017). The CLADEAGE is based on the oldest fossil occurrence of the clades used for the calibration, the net diversification rate, diversification turnover, and the fossil sampling rates of the group under investigation. For the last three parameters, we use values provided by Matschiner et al. (2017). For a tutorial about CLADEAGE analysis see the original paper Matschiner et al. (2017) and the tutorial ‘A Rough guide to CladeAge’ available at: https://www.beast2.org/tutorials/).

CLADEAGE analysis can be set-up directly through the BEAUTi GUI, however when there are many clades including a large number of species, this is a daunting task. Here, we provide the CladeAgeCalib.xml function to automatically edit the output .xml file from the MultiTopoConst.EditXML4BEAST2 function, according to CLADEAGE requirements. The output .xml file is ready for analysis in BEAST2.

Calibration = read.delim("Tuto_regPhylo/Constraints_Calibs/Calibration_4_clades.txt", sep ="\t",h= TRUE)

We run the function, using the xml file including the topological constraints (i.e. “Constraints_Calibs/BaseLine_8GTR_w.xml”), and the classification table in the object “SpList.Classif”.

CladeAgeCalib.xml(xml.input="Constraints_Calibs/BaseLine_8GTR_w.xml", input.tree=TreeRooted, output="Constraints_Calibs/BaseLine_8GTR_END.xml", CalPointTable=Calibration, MinDivRate=0.041, MaxDivRate=0.081, MinTurnoverRate=0.0011, MaxTurnoverRate=0.37, MinSamplingRate=0.0066, MaxSamplingRate=0.01806, xmltreename="ConcatF", inputTaxono=SpList.Classif, Partitions="TRUE")

## [1] TRUE

The output file “Constraints_Calibs/BaseLine_8GTR_END.xml” is the final input .xml file ready to be run in BEAST2.

26 Advanced topic 1: detecting problematic sequences that should be removed from the pool of sequences

Here, we show how regPhylo functions can help to detect and remove problematic sequences from the pool of sequences. All the data generated in this section will be exported in the folder “AdvTop1_ProbSeq”. dir.create("AdvTop1_ProbSeq")

A) Build the species-by-gene matrix naively.

The first time we build the species-by-gene matrix we have no a priori idea which sequences might be problematic, so no sequences are blacklisted (the “NCBI.Trash” and “BOLD.Trash” options are not documented) when running the SpeciesGeneMat.Bl function. We use the AllSeqDF3 object as the input table (i.e. representing the output of the GeoCodeName function step 4).

Sp.DNAMat_naive = SpeciesGeneMat.Bl(input = AllSeqDF3, output = "AdvTop1_ProbSeq/SpAll.DNA.Mat_Naive_") # Names of the different elements of the list. names(Sp.DNAMat_naive)

## [1] "Species.Gene_matrix" "Summary_DNA" ## [3] "Summary_Species" "MissingSpecies_WithoutSequences"

# Dimensions of the species-by-gene matrix. dim(Sp.DNAMat_naive$Species.Gene_matrix)

## [1] 30 103

# Display the first 5 columns and last 4 rows (including the species name and # the gene regions with the best species coverage). Sp.DNAMat_naive$Species.Gene_matrix[27:30,1 :5]

## Species_Name co1 16srrna rag1 12srrna ## Omosudis lowii Omosudis lowii 29 2 1 6 ## Aldrovandia affinis Aldrovandia affinis 27 4 1 5 ## Halosauropsis macrochir Halosauropsis macrochir 40 5 2 6 ## Ablennes hians Ablennes hians 22 4 1 4

# Summary of the number of different gene regions ("NB_TypeDNA") and # number of sequences available for each species. head(Sp.DNAMat_naive$Summary_Species,n=3)

## Name_Species NB_TypeDNA NB_Seq ## 1 Ablennes hians 54 103 ## 2 Aldrovandia affinis 48 94 ## 21 Halosauropsis macrochir 48 101

27 # Summary of the number of different species for each gene region, # and number of sequences. head(Sp.DNAMat_naive$Summary_DNA,n=5)

## Name_DNA NB_Species NB_Seq ## 13 co1 29 361 ## 2 16srrna 22 49 ## 68 rag1 18 26 ## 1 12srrna 16 41 ## 19 cytb 13 38

We can see that the co1, 16srrna (mitochondrial DNA) and rag1 (nuclear DNA) are the three regions with the higher species coverage. We keep the same gene regions of interest.

GeneSelection = c("co1", "16srrna", "rag1", "myh6", "plagl2")

B) Select DNA sequences and metadata and export the selected gene regions in fasta format

From the pool of sequences exported by the function SpeciesGeneMat.Bl, we select and export the sequences and metadata for the selected gene regions (including extraction of sequences from long annotated DNA sequences > 5000 bp such as full mitochondrial gene regions).

# Load the sequences exported by the SpeciesGeneMat.Bl function # in the r environment. PoolSeq.naive=read.delim("Tuto_regPhylo/AdvTop1_ProbSeq/SpAll.DNA.Mat_Naive__CleanDataset.txt", sep="\t", h=T)

# Extract selected sequences (including from long annotated DNA sequences) and metadata. PoolAllSeq.5Genes.naive = Select.DNA(input = PoolSeq.naive, gene.list = GeneSelection, output = "AdvTop1_ProbSeq/PoolAllSeq.5Genes.naive") dim(PoolAllSeq.5Genes.naive) # The output table includes 470 sequences.

## [1] 470 29

This function can take seconds to hours to run if there are many sequences to be extracted from the long annotated DNA sequences in GenBank (the seqinr functions extract information through a distant server). For our example, it should take less than two minutes, but for those who do not want to wait, the output of this function can be loaded in R using the following code:

PoolAllSeq.5Genes.naive=read.delim("Tuto_regPhylo/AdvTop1_ProbSeq/PoolAllSeq.5Genes.naive.Select.DNA.txt", sep ="\t", h=T)

We use the function SelBestSeq to export all the sequences available per species for each selected gene region in fasta format in order to detect problematic sequences that need to be “blacklisted” and removed from the overall pool of sequences. The “input” is the output of the Select.DNA function, the “output” is a path to the folder storing the alignment (the folder must be created first), and the “SeqChoice” is not required because all the sequences per gene region

28 and species are exported. The option “perReposit” provides a name for the personal repository where sequences originated from (the same name as the one provided in Congr.NCBI.BOLD.perReposit[step 5]). “Alignment = T” specifies that the fasta alignment must be exported, and “MaxSeq” is “ALL” because all the sequences per species and gene region must be exported. For additional information about the function options see ?SelBestSeq.

# First we create a folder to store all the alignments dir.create("AdvTop1_ProbSeq/Alignments") | dir.create("AdvTop1_ProbSeq/Alignments/AllSeq")

# Run the function PoolAllSeq.5Genes.naive.export = SelBestSeq(input = PoolAllSeq.5Genes.naive, output = "AdvTop1_ProbSeq/Alignments/AllSeq/AllSeq", perReposit = "PerRep", Alignment =T, MaxSeq = "ALL", gene.list = GeneSelection, SeqChoice = "Median")

C) Detecting outlier sequences in the pool of selected gene regions

1) First alignment of all sequences and detection of sequences that should be reverse complemented

Here, we use the function First.Align.All in order to detect sequences that should be reverse complemented (i.e. enter in the wrong direction [from 3’ to 5’, instead of 5’ to 3’]) and to perform the first alignment of the sequences per gene region of interest. This alignment can be inspected by eye to identify problematic sequences that cause alignment problems. Enter into the R environment the names of the input, output, the number of threads and methods to use before running the First.Align.All function (this step is required because the function is perfomed in parallel). Note: we remind users that one of the alignment softwares, PASTA, is currently not available for Windows users (see ?First.Align.All for more information). input = "AdvTop1_ProbSeq/Alignments/AllSeq" output = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign" nthread=5 methods = c("mafftfftnsi")

Run the function.

First.Align.All(input = "AdvTop1_ProbSeq/Alignments/AllSeq", output = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign", nthread =5, methods = c("mafftfftnsi"))

For windows users the “Mafft.path”" must be provided as well (remember backslash must be converted into slash).

Mafft.path = "C:/Users/david/Program/mafft/mafft-7.409-win64-signed/mafft-win/mafft" input = "AdvTop1_ProbSeq/Alignments/AllSeq" output = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign" nthread=5 methods = c("mafftfftnsi")

First.Align.All(input = "AdvTop1_ProbSeq/Alignments/AllSeq", output = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign", nthread =5, methods = c("mafftfftnsi"), Mafft.path = "C:/Users/david/Program/mafft/mafft-7.409-win64-signed/mafft-win/mafft")

The list of sequences that have been reverse complemented is stored here “AdvTop1_ProbSeq/Alignments/AllSeq/ListSeq_RevCompl_FirstAlignAll.txt”.

29 2) Detection of potential outlier sequences in a gene region alignment

Here we use the CO1 alignment as an example to illustrate the Detect.Outlier.Seq function. This function helps to detect potential outlier sequences that should be removed from the pool of sequences before selecting the best sequence (i.e. mis-aligned sequences, which might be caused by different problems such as gene or species annotation problems, or the presence of paralogous sequences, for example). This function is only provided as a tool to HELP detect potential outlier sequences; great care must be taken to ensure that neither, real outliers are omited or that too many sequences are considered outliers. We recommend checking alignments by eye to ensure your confidence in the retained sequences. We run the function with the ‘Comb’ option, a distance threshold of 0.6, and disable the search for secondary outlier sequences (for more details about the function options see ?Detect.Outlier.Seq).

C01_MisAlign0.6_1= Detect.Outlier.Seq( inputal = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign/Mafftfftnsi_AllSeq_co1.fas", Strat.DistMat = "Comb", Dist.Th = 0.6, output = "AdvTop1_ProbSeq/Mafftfftnsi_AllSeq_co1_outlier_1.txt", Second.Outlier = "No")

The sequence “Coryphaenoides_armatus|4862685|AB018233|NA” is reported as a potential primary outlier.

C01_MisAlign0.6_1

## PrimaryOulierSeq Dist.Prop.Comb ## 1 Coryphaenoides_armatus|4862685|AB018233|NA 1

Running the function with the ‘Comb’ option, a distance threshold of 0.6 (i.e. 60%), and allowing a search for secondary outlier sequences using local BLAST database and a bitscore threshold of 0.8 (i.e. 80%).

C01_MisAlign0.6_2= Detect.Outlier.Seq( inputal = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign/Mafftfftnsi_AllSeq_co1.fas", Strat.DistMat = "Comb", Dist.Th = 0.6, output = "AdvTop1_ProbSeq/Mafftfftnsi_AllSeq_co1_outlier_2.txt", Second.Outlier = "Yes", Bitsc.Th = 0.8)

C01_MisAlign0.6_2 # no other secondary sequence has been detected

## Query_SeqName SeqLen_Query ## 1 Coryphaenoides_armatus|4862685|AB018233|NA 444 ## Hit_SeqName Hit_SeqLen evalue pident ## 1 Coryphaenoides_armatus|4862685|AB018233|NA 444 0 100 ## length mismatch gapopen qstart qend sstart send bitscore qcovs ## 1 444 0 0 1 444 1 444 821 100 ## BitScore.Prop ## 1 1

Decreasing the ‘Dist.Th = 0.4’, we detected 4 primary outlier sequences (see code below) that should be removed.

C01_MisAlign0.4_1= Detect.Outlier.Seq( inputal = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign/Mafftfftnsi_AllSeq_co1.fas", Strat.DistMat = "Comb", Dist.Th = 0.4, output = "AdvTop1_ProbSeq/Mafftfftnsi_AllSeq_co1_outlier_1a.txt", Second.Outlier = "No")

30 C01_MisAlign0.4_1

## PrimaryOulierSeq Dist.Prop.Comb ## 1 Aplodactylus_arctidens|NA|AF092140|NA 0.4305831 ## 2 Aplodactylus_arctidens|NA|AF202547|NA 0.4305764 ## 3 Aplodactylus_etheridgii|4876360|AF133061|NA 0.4184084 ## 4 Coryphaenoides_armatus|4862685|AB018233|NA 1.0000000

After inspecting the alignment by eyes in Seaview (Gouy et al. 2010), we considered two other sequences that must be removed because they were too divergent considering other conspecific sequences and BLAST results.

• “Aldrovandia_affinis|3191061|NA|NA”

• “Halosauropsis_macrochir|3214910|NA|NA”

Finally the 6 outlier sequences (and the empty gap) can be removed using the Rm.OutSeq.Gap function.

Mafftfftnsi_AllSeq_Clean_co1 = Rm.OutSeq.Gap(input = c(as.character(C01_MisAlign0.4_1[,1]), "Aldrovandia_affinis|3191061|NA|NA", "Halosauropsis_macrochir|3214910|NA|NA"), SeqInput ="AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign/Mafftfftnsi_AllSeq_co1.fas", AligoutputName = "AdvTop1_ProbSeq/Alignments/AllSeq/FirstAlign/Mafftfftnsi_AllSeq_Clean_co1")

# The output alignment is a "DNAbin" object. class(Mafftfftnsi_AllSeq_Clean_co1)

## [1] "DNAbin"

31 Advanced topic 2: including species without DNA and resolving polytomies using BEAST2 regPhylo functions allow you to include taxa without DNA sequences and to use BEAST2 as polytomy resolver based on taxonomic constraint (Kuhn et al. 2011). Here, we show how it can be done in 2 steps:

• 1) Export a supermatrix with the additional taxa without DNA to prepare the baseline xml file in BEAUti for BEAST2. • 2) Edit the baseline xml file to include hard topological constraints plus the new constraints for the taxa without DNA.

We completed our previous 30 species phylogeny including four taxa without DNA, one constraint at the family level, and the others three constraints at the genus level.

TaxaNoDNA = cbind(c("Ariosoma_sp1", "Epinephelus_sp1", "Coryphaenoides_sp1", "Coryphaenoides_sp2"), c("Family", "Genus", "Genus", "Genus"), c("", "Epinephelus", "Coryphaenoides", "Coryphaenoides")) colnames(TaxaNoDNA) = c("SpeciesName", "hier.level", "ConstraintName") TaxaNoDNA = as.data.frame(TaxaNoDNA) TaxaNoDNA

## SpeciesName hier.level ConstraintName ## 1 Ariosoma_sp1 Family Congridae ## 2 Epinephelus_sp1 Genus Epinephelus ## 3 Coryphaenoides_sp1 Genus Coryphaenoides ## 4 Coryphaenoides_sp2 Genus Coryphaenoides

A) Export a supermatrix including the taxa without DNA to prepare the baseline xml file in BEAUTi

All data created in this section will be stored in the directory called “AdvTop2_NoDNA”. dir.create("AdvTop2_NoDNA")

We use the function Align.Concat to build the new supermatrix including the 4 taxa without DNA. We create a new directory called “ForConcat_noDNA” to store the new supermatrix and alignments, and we copy all the trimmed alignments produced in step 9 into this new folder.

# Create the folders. dir.create("AdvTop2_NoDNA/ForConcat_noDNA") list.ali = list.files("Tuto_regPhylo/Alignments/OneBest.Geo/Trimmed_Gblocks")

# Copy the files into the new folder. file.copy(paste("Tuto_regPhylo/Alignments/OneBest.Geo/Trimmed_Gblocks/", list.ali, sep =""), "AdvTop2_NoDNA/ForConcat_noDNA")

## [1] TRUE TRUE TRUE TRUE TRUE

32 # Define the species list of the 4 new taxa without DNA. SpListNoDNA = as.character(TaxaNoDNA[,1])

# Run the function to build a supermatrix including the 4 species without DNA. Align.Concat(input = "AdvTop2_NoDNA/ForConcat_noDNA", Sp.List = SpListNoDNA, outputConcat = "AdvTop2_NoDNA/ForConcat_noDNA/Concat_NoDNA")

## Name.PartitionFinder2 Common.Gene.Name ## [1,] "gene1" "16srrna" ## [2,] "gene2" "co1" ## [3,] "gene3" "plagl2" ## [4,] "gene4" "rag1" ## [5,] "gene5" "myh6"

# The new super matrix can be imported into R with the following command: require(ape) Supermatrix_NoDNA = read.dna("AdvTop2_NoDNA/ForConcat_noDNA/Concat_NoDNA.fas", format = "fasta")

# Display the species names in the new supermatrix. labels(Supermatrix_NoDNA)

## [1] "Ablennes_hians" "Aldrovandia_affinis" ## [3] "Aplodactylus_arctidens" "Aplodactylus_etheridgii" ## [5] "Ariosoma_sp1" "Arnoglossus_scapha" ## [7] "Asterorhombus_filifer" "Bassanago_bulbiceps" ## [9] "Coelorinchus_bollonsi" "Coelorinchus_fasciatus" ## [11] "Conger_verreauxi" "Coryphaenoides_armatus" ## [13] "Coryphaenoides_murrayi" "Coryphaenoides_sp1" ## [15] "Coryphaenoides_sp2" "Coryphaenoides_striaturus" ## [17] "Coryphaenoides_subserrulatus" "Epigonus_denticulatus" ## [19] "Epigonus_robustus" "Epinephelus_daemelii" ## [21] "Epinephelus_octofasciatus" "Epinephelus_rivulatus" ## [23] "Epinephelus_sp1" "Euleptorhamphus_viridis" ## [25] "Halosauropsis_macrochir" "Omosudis_lowii" ## [27] "Platybelone_argalus" "Pterois_antennata" ## [29] "Scopelarchus_analis" "Scorpaena_cardinalis" ## [31] "Scorpaena_onaria" "Stemonosudis_macrura" ## [33] "Vinciguerria_nimbaria" "Vinciguerria_poweriae"

To include the best partitioning scheme in the new supermatrix based on PartitionFinder2, there are two possi- bilities:

• 1) Copy and paste the previous partitioning scheme defined during step 11 from the file “Align- ments/OneBest.Geo/ForConcat/Concat_PF2.nex” “manually”.

• 2) Re-run PartitonFinder2 using the PartiFinder2 function as below. Warning: the path to partition- finder2 needs to to be provided in the “Path.PartiF2”, and must be adapted to your configuration.

33 # Re-run partitionFinder2 to determine the best partitoning scheme. # First we change the working directory to avoid interaction with the previous # run of PartitionFinder2. setwd("AdvTop2_NoDNA")

# Run the function. PartiFinder2(input = "ForConcat_noDNA/Concat_NoDNA.fas", Partition = "ForConcat_noDNA/Partitions_Concat.txt", codon = c(2:5), nexus.file = "ForConcat_noDNA/Concat_NoDNA.nex", Path.PartiF2 = "/home/davidpc/Programs/PartitionFinder2/partitionfinder-2.1.1/PartitionFinder.py", branchlengths = "linked", models = "all", model_selection = "BIC", search = "greedy", Raxml = "TRUE", nthread =5)

# Re-set the normal working directory. setwd("..")

The nexus output of the PartiFinder2 function of the supermatrix “AdvTop2_NoDNA/ForConcat_noDNA/Concat_NoDNA_PF2.nex” can be loaded in BEAUTi to produce the baseline xml file as done in step 13. Here a baseline xml file including the supermatrix with the 4 taxa without DNA can be found in “Tuto_regPhylo/AdvTop2_NoDNA/Baseline_noDNA_8GTR.xml”

B) Edit the baseline xml file including hard topological constraints plus the new constraints for the taxa without DNA

We used the same soft topological constraints as in step 12 and the same 30 species phylogenetic tree produced in step 14.A based on those constraints (see the rooted tree below). require(ape) TreeRooted=read.nexus("Tuto_regPhylo/Trees/RAxML/RAxML_bipartitions.Tree_RAxML_autoMRE_ReRooted")

We use the function MultiTopoConst.EditXML4BEAST2 to include all the constraints. To do this, we use the same objects as in step 14.B “input.tree” with the bootstrap score (“TreeRooted”), the classification table (“SpList.Classif”), the baseline xml file (“AdvTop2_NoDNA/Baseline_noDNA_8GTR.xml”) but additionally provide a 3 column table defining the constraints for the 4 taxa without DNA. In this example, we use the object “TaxaNoDNA” defined above. A new taxonomic table will be exported, and a new tree will also be exported, both including the 4 new taxa.

# Get the classification table SpList.Classif = read.csv("Tuto_regPhylo/SpeciesList_Classification.csv", h = TRUE)

# Run the function to edit the baseline .xml file MultiTopoConst.EditXML4BEAST2(inputtree = TreeRooted, output = "AdvTop2_NoDNA/Baseline_noDNA_8GTR_w.xml", bootstrapTH = 100, xmltreename = "ConcatF", input.xml = "Tuto_regPhylo/AdvTop2_NoDNA/Baseline_noDNA_8GTR.xml", Partitions = "TRUE", TaxaNoDNA = TaxaNoDNA, TaxoTable = SpList.Classif, output.new.TaxoTable = "AdvTop2_NoDNA/New_ClassifDF.csv", output.new.tree = "AdvTop2_NoDNA/New_BackboneTree.txt")

## [1] TRUE TRUE

34 # Plot the newly constrained tree including the 4 taxa without DNA NewTree = read.tree("AdvTop2_NoDNA/New_BackboneTree.txt") plot(NewTree, cex = 0.6)

Halosauropsis macrochir Aldrovandia affinis Ariosoma sp1 Conger verreauxi Bassanago bulbiceps Euleptorhamphus viridis Platybelone argalus Ablennes hians Epigonus denticulatus Epigonus robustus Aplodactylus arctidens Aplodactylus etheridgii Scorpaena onaria Scorpaena cardinalis Pterois antennata Epinephelus rivulatus Epinephelus daemelii Epinephelus sp1 Epinephelus octofasciatus Arnoglossus scapha Asterorhombus filifer Coryphaenoides striaturus Coryphaenoides sp1 Coryphaenoides sp2 Coryphaenoides armatus Coryphaenoides murrayi Coryphaenoides subserrulatus Coelorinchus bollonsi Coelorinchus fasciatus Stemonosudis macrura Omosudis lowii Scopelarchus analis Vinciguerria nimbaria Vinciguerria poweriae

# The position of the 4 taxa without DNA can differ among runs, # because the grafting of the species involve a degree of randomness # within the defined taxonomic constraints.

# To see the new classification table including the 4 taxa without DNA. NewClassifDF = read.delim("AdvTop2_NoDNA/New_ClassifDF.csv", sep ="\t", header = TRUE) dim(NewClassifDF) # now there are 34 rows and still 23 columns.

## [1] 34 23

The xml file output “AdvTop2_NoDNA/Baseline_noDNA_8GTR_w.xml” can be openned in BEAUTi. From here, users may decide to perform conventional tree dating with BEAST2. However, regPhylo also includes a function that helps to set-up a more advanced calibration approach in BEAST2 using the CLADEAGE approach (Matschiner et al. 2017) (see step 15).

35 References

• Bouckaert R., Heled J., Kühnert D., Vaughan T., Wu C.H., Xie D., . . . Drummond A.J. (2014) BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Computational Biology, 10, 1–6. • Capella-Gutíerrez S., Martínez J.M.S., & Gabaldón T. (2009) TrimAl: a tool for automatic alignment trimming. Bioinformatics, 25, 1972–1973. • Castresana J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution, 17, 540–552. • Deck J., Gaither M.R., Ewing R., Bird C.E., Davies N., Meyer C., . . . Crandall E.D. (2017) The Genomic Observatories Metadatabase (GeOMe): A new repository for field and sampling event metadata associated with genetic samples. PLoS Biology, 15, 1–7. • Eme D., Anderson, M.J., Struthers C.D., Roberts C.D., & Liggins L. (2019) An integrated pathway for building regional phylogenies for ecoogical studies. Global Ecology and Biogeography, Accepted. • Gouy M., Guindon S., & Gascuel O. (2010) Seaview version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution, 27, 221–224. • Lassmann T. & Sonnhammer E.L.L. (2005) Automatic assessment of alignment quality. Nucleic Acids Research, 33, 7120–7128. • Kuhn T.S., Mooers A. & Thomas G.H. (2011) A simple polytomy resolver for dated phylogenies. Methods in Ecology and Evolution, 2, 427-436. • Lanfear R., Frandsen P.B., Wright A.M., Senfeld T., & Calcott B. (2017) PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses. Molecular Biology and Evolution, 34, 772–773. • Matschiner M., Musilová Z., Barth J.M.I., Starostová Z., Salzburger W., Steel M., & Bouckaert R. (2017) Bayesian phylogenetic estimation of clade ages supports trans-Atlantic dispersal of cichlid fishes. Systematic Biology, 66, 3–22. • Stamatakis A. (2014) RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylo- genies. Bioinformatics, 30, 1312–1313.

36