bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Multi-rate Poisson Tree Processes for single-locus species delimitation under Maximum Likelihood and Markov Chain Monte Carlo. P. Kapli 1∗, S. Lutteropp 1,2, J. Zhang 1, K. Kobert 1, P. Pavlidis 3, A. Stamatakis 1,2∗ and T. Flouri 1,2∗ 1The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-68159 Heidelberg, Germany, 2Department of Informatics, Institute of Theoretical Informatics, Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany, 3Institute of Molecular Biology and Biotechnology (IMBB), Foundation of Research and Technology, Nikolaou Plastira 100 GR-70013 Heraklion, Crete, Greece

ABSTRACT delimitation is a critical task in systematic studies with potential Motivation: In recent years, molecular species delimitation implications in all subfields of biology that involve evolutionary has become a routine approach for quantifying and classifying relationships. In line with the species concept of De Queiroz biodiversity. Barcoding methods are of particular importance in large- (2007) and the integrative approach (Dayrat, 2005), a scale surveys as they promote fast species discovery and biodiversity reliable identification of a new species involves data from multiple estimates. Among those, distance-based methods are the most sources (e.g., ecology, morphology, evolutionary history). Such common choice as they scale well with large datasets; however, an approach is necessary for meticulous comparative evolutionary they are sensitive to similarity threshold parameters and they ignore studies but is tremendously difficult to apply, if at all possible, evolutionary relationships. The recently introduced “Poisson Tree in large biodiversity studies (e.g., vast barcoding, environmental, Processes” (PTP) method is a phylogeny-aware approach that does microbial samples). In such studies, researchers use alternative units not rely on such thresholds. Yet, two weaknesses of PTP impact its of comparison that are easier to delimit [Molecular Operational accuracy and practicality when applied to large datasets; it does not Taxonomic Units (Blaxter et al., 2005), Recognizable Taxonomic account for divergent intraspecific variation and is slow for a large Units (Oliver and Beattie, 1993)] rather than going through the number of sequences. cumbersome task of delimiting and describing full species. The Results: We introduce the multi-rate PTP (mPTP), an improved correspondence of these units to real species remains ambiguous method that alleviates the theoretical and technical shortcomings of (Casiraghi et al., 2010), especially in the absence of additional PTP. It incorporates different levels of intraspecific genetic diversity information. However, they are practical for biodiversity and β- deriving from differences in either the evolutionary history or sampling diversity estimates (Valentini et al., 2009). of each species. Results on empirical data suggest that mPTP With the introduction of DNA-barcoding (Hebert et al., 2003) is superior to PTP and popular distance-based methods as it, and the advances in coalescent theory (Hickerson et al., 2010), consistently, yields more accurate delimitations with respect to the genetic data became the most popular data source for delimiting taxonomy (i.e., identifies more taxonomic species, infers species species. Several algorithms and implementations for molecular numbers closer to the taxonomy). Moreover, mPTP does not require species delimitation exist, most of which are inspired by the any similarity threshold as input. The novel dynamic programming phylogenetic species concept (Zhang et al., 2013; Fujisawa and algorithm attains a speedup of at least five orders of magnitude Barraclough, 2013; Yang and Rannala, 2014) and the DNA compared to PTP, allowing it to delimit species in large (meta-) barcoding concept (Puillandre et al., 2012; Hao et al., 2011; barcoding data. In addition, Markov Chain Monte Carlo sampling Edgar, 2010). These methods address different research questions provides a comprehensive evaluation of the inferred delimitation in (Casiraghi et al., 2010); thus, the user needs to assess several just a few seconds for millions of steps, independently of tree size. factors before choosing the most appropriate method for a particular Availability: mPTP is implemented in C and is available for download study. The “species-tree” approaches rely on multiple genetic loci at http://github.com/Pas-Kapli/mptp under the GNU Affero (Yang and Rannala, 2014; Jones et al., 2015) and account for 3 license. A web-service is available at http://mptp.h-its.org. potential species-tree/gene-tree incongruence (Maddison, 1997). Contact: [email protected], Alexandros.Stamatakis@h- Such methods are the most appropriate when the goal is to its.org, [email protected] identify and describe new species. However, most current implementations (Yang and Rannala, 2014; Jones et al., 2015) are computationally demanding and can only be applied to small datasets of closely related taxa, becoming impractical with a 1 INTRODUCTION growing number of species and/or loci [see Fujisawa et al. (2016) Species are fundamental units of life and form the most common for a recently introduced faster method]. Large-scale biodiversity basis of comparison in evolutionary studies. Therefore, species (meta-) barcoding studies comprise hundreds or even thousands of samples of high evolutionary divergence. The goal of such studies ∗to whom correspondence should be addressed often is to obtain β-diversity estimates (comparative studies of

1 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Kapli et al

different treatments, ecological factors, etc.) or a rough estimate of a broader range of empirical datasets. In addition, we develop of the biodiversity for a given sample. Hence, a researcher may and present a novel, dynamic-programming algorithm for the mPTP use distance-based methods (Puillandre et al., 2012; Hao et al., model. The implementation is several orders of magnitude faster 2011; Edgar, 2010) that scale well on large datasets with respect than the original PTP and yields more accurate delimitations almost to run times. However, these methods ignore the evolutionary instantly on large datasets comprising thousands of taxa. Note that, relationships of the involved taxa and rely on not necessarily the original PTP requires days of computation time on a desktop biologically meaningful ad hoc sequence similarity thresholds. to analyze such datasets. Finally, we provide an Markov chain Moreover, they are restricted to single-locus data, which reduces Monte Carlo (MCMC) delimitation sampling approach that allows species delimitation accuracy (Dupuis et al., 2012). for inferring delimitation support values. The General Mixed Yule Coalescent (GMYC; Pons et al., 2006; Fujisawa and Barraclough, 2013) and the recently introduced Poisson Tree Processes (PTP; Zhang et al., 2013) are two 2 METHODS similar models that bridge the gap between “species-tree” and In this section we introduce the basic notation that is used throughout distance-based methods for species delimitation. While GMYC the manuscript, and provide a detailed description of the mPTP algorithm and PTP are also restricted to single loci, they do take the including the MCMC sampling method. evolutionary relationships of the sequences into account. Since, A binary rooted tree T = (V,E) is a connected acyclic graph where V is at the same time, they are computationally inexpensive they can the set of nodes and E the set of branches (or edges), such that E ⊂ V × V . be deployed for analyzing large (meta-)barcoding samples. The Each inner node u has a degree (number of branches for which u is an end- GMYC method (Fujisawa and Barraclough, 2013) uses a speciation point) of 3 with the exception of the root node which has degree 2, while (Yule, 1925) and a neutral coalescent model (Hudson, 1990). It leaves (or tips) have degree 1. We use the notation (u, v) ∈ E to denote strives to maximize the likelihood score by separating/classifying an branch with end-points u, v ∈ V , and ` : V × V → R to denote the the branches of an ultrametric tree (in units of absolute or associated branch length. Finally, we use Tu to denote the subtree rooted at relative ages) into two processes; within and between species. In node u. contrast to GMYC, PTP models the branching processes based 2.1 Multi-rate PTP (mPTP) heuristic algorithm on the number of accumulated expected substitutions between Let T = (V,E) be a binary rooted phylogenetic tree with root node r. The subsequent speciation events. PTP tries to determine the transition optimization problem in the original PTP is to find a connected subgraph point from a between- to a within-species process by assuming G = (Vs,Es) of T , where Vs ⊆ V , Es ⊆ E, r ∈ Vs such that (a) that a two parameter model, with one parameter for speciation G is a binary tree, and (b) the likelihood of i.i.d branch lengths Es and and one parameter for the coalescent process best fits the data. Ec = E \ Es fitting two distinct exponential distributions is maximized. The underlying assumption is that each substitution has a small Formally, we are interested in maximizing the likelihood probability of generating a branching event. Within species,     ˆ ˆ branching events will be frequent whereas among species they will Y ˆ −λs`(x) Y ˆ −λc`(x) L =  λse  ×  λce  (1) be more sparse.The probability of observing n speciations for k x∈Es x∈Ec substitutions follows a Poisson process and therefore, the number −1 −1  1 P   1 P  of substitutions until the next speciation event can be modeled via where λˆs = `(x) and λˆc = `(x) |Es| x∈Es |Ec| x∈Ec an exponential distribution (Zhang et al., 2013). Given that PTP are maximum likelihood (ML) estimates for the rate parameters. directly uses substitutions, it does not require an ultrametric input In the multi-rate PTP (mPTP), we are interested in fitting the branch tree, the inference of which can be time consuming and error prone. lengths of each maximal subtree of T (each species delimited in T ) formed Thus, PTP often yields more accurate delimitations than GMYC by branches from Ec to a distinct exponential distribution. Let T1 = (Tang et al., 2014). (V1,E1),T2 = (V2,E2),...,Tk = (Vk,Ek) be the maximal subtrees of T formed exclusively by branches from E such that tk E = E and Here, we introduce a new algorithm and an improved model as c i=1 i c no pair i, j exists for which Vi ∩ Vj 6= ∅. The task in mPTP is to maximize well as implementation of PTP that alleviates previous shortcomings the likelihood of the method. The initial PTP assumes only one exponential k distribution for the speciation events and only one for the coalescent Y Y −λˆ `(x) Y −λˆs`(x) L = λˆie i × λˆse (2) events, across all species in the phylogeny. While the speciation rate i=1 x∈Ei x∈Es can be assumed to be constant among closely related species, the −1  1 P  intraspecific coalescent rate and consequently the genetic diversity where λˆi = `(x) for 1 ≤ i ≤ k. We refer to the k |Ei| x∈Ei may vary significantly even among sister species. This divergence in maximal subtree root nodes as coalescent roots. intraspecific variation can be attributed to factors, such as population Whether a polynomial-time algorithm exists to solve the two problems size and structure, population bottlenecks, selection, life cycle remains an open question. We assume that the problem is hard, and thus, and mating systems (see Bazin et al. (2006) for further details). we propose a greedy, dynamic-programming (DP) algorithm to solve the Additionally, sampling bias may also be responsible for observing problem. The algorithm visits all inner nodes of T in a bottom-up postorder different levels of intraspecific genetic diversity and it is already traversal, that is, a node is visited only after both of its child nodes have been visited. For each node u, the DP computes an array of |E | + 1 scores, known to decrease the accuracy of PTP (Zhang et al., 2013). To u where E is the set of branches of subtree T . Entry 0 ≤ i ≤ |E | assumes incorporate the potential divergence in intraspecific diversity, we u u u that Tu contains i between-species branches. For the case i = 0, Tu is propose the novel multi-rate Poisson Tree Processes (mPTP) model. part of a coalescent process, while for i = |Eu|, Tu is part of the single In contrast to PTP, it fits the branching events of each delimited between-species process. Each entry i > 0 contains the maximization of species to a distinct exponential distribution. Thereby it can better Eq. 2, by considering only (i) the branches of subtree Tu and (ii) the set accommodate the sampling- and population-specific characteristics S of branches of T that have at least one end-point on the path from the

2 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Multi-rate Poisson Tree Processes (mPTP)

r 0 1 2 · ·· |Er | 2.2 MCMC sampling We deploy an MCMC approach to obtain delimitation support values for each node. In each step, we propose a new delimitation by designating a set 0 1 2 · ·· |Eu| of coalescent roots and compute its score. We denote the current delimitation as θ and the proposed delimitation, after applying a move to θ, as θ0. We use u i the acceptance ratio L0 g(θ0 → θ) α = c × 0 Lc g(θ → θ ) to decide whether to keep the proposed delimitation θ0 or not. If α ≥ 1, 0 1 2 · ·· |Ev | 0 1 2 · ·· |Ew| we accept the delimitation, otherwise we accept it with probability α. The 0 0 j v w k term Lc/Lc is the ratio of the AICc-corrected likelihood values of the θ over θ, and g(x → y) is the transition probability from one delimitation, x, to any other delimitation y. Given θ, we propose θ0 by applying one of two possible moves with equal probability. The first move is to split a coalescent clade by placing a randomly selected coalescent root node u (and its two out-going branches) in the set of nodes (and branches) corresponding to the speciation process. The second type of move selects a node u whose two Fig. 1. Visual representation of the mPTP dynamic programming algorithm. child nodes are coalescent roots, joins their branch sets into one, and sets Each entry i at node u is computed from information stored at entries j and u as the new coalescent root. The support value for a particular node u is k of child nodes v and w, for all j and k such that i = j + k + 2. The the sum of Akaike weights (Akaike, 1978; Burnham and Anderson, 2002) of dashed branches denote the smallest set S of branches which, by definition, all MCMC samples for which u is part of the speciation process normalized must be part of the between-species process, irrespective of the resolution of by the sum of all Akaike weights. The method’s run-time is linear to the other subtrees outside T . u number of MCMC steps and independent of the tree size. For a sketch of the algorithm see Paragraph 3 of Supplement I. It is considered good practice that MCMC results are accompanied by a critical convergence assessment. Arguably, a good way of accomplishing root r to u. For this subset of branches, the between-species process consists this is to compare samples obtained from independent MCMC runs. of exactly |S| + i branches, and we only consider the coalescent processes We use the average standard deviation of delimitation support values inside Tu. We restrict ourselves to the set of branches S since it is infeasible (ASDDSV) for quantifying the similarity among such samples. Inspired by to test all possible groupings of branches outside of Tu. For any i > 0, it the average standard deviation of split frequencies (ASDSF) technique from holds that the edges from u leading to its child nodes v and w belong to phylogenetic inference (Ronquist et al., 2012), we calculate the ASDDSV the between-species process, and therefore, S is the smallest set of edges by averaging the standard deviation of per-node delimitation support values which, by definition, must be part of the between-species process. Figure 1 across multiple independent MCMC runs that explore the delimitation space. illustrates the algorithm for a particular triplet of nodes u, v, w, and marks Each such MCMC run starts from a randomly generated delimitation. the set S with dashed lines. Entry 0 represents the null-model for Tu, that Similarly to ASDSF, ASDDSV approaches zero as runs converge to the same is, all branches of Tu belong to the coalescent process [or, equivalently, distribution of delimitations. to the speciation process, as the two cases can not be distinguished by the ASDDSV is useful for monitoring that chains converge to the same (m)PTP model]. The score (maximization of Eq. 2) for entry i ≥ 1 is distribution, but does not imply that they have globally converged. It is also computed from the information stored in the array entries j and k of the desirable to compare the delimitation support values of an MCMC run with two child nodes v and w, by considering all combinations of j and k such the ML estimate. For this, we introduce the average support value (ASV), that i = j + k + 2 (plus two accounts for the two outgoing edges of u). that is,   Not all array entries are necessarily valid. For instance, entry i = 1 may be 1 X X  f(u) + 1 − f(u) , invalid given that node u has two out-going branches in the between-species |T | process. Similarly, it is possible that two invalid entries j and k exist for a u∈VS u∈ /VS given value of i. Each entry i at node u stores the computed score, the sum of where VS is the set of the speciation process nodes and f(u) the support i i between-species branch lengths σu within Tu, the product Lu of likelihoods value of a speciation node u. The closer ASV is to one, the better the support of coalescent processes inside Tu, and pointers for storing which entries j values agree with the ML delimitation. and k were chosen for calculating a specific entry i. The first term of Eq. 2 j k (coalescent process) is the product of Lv and Lw, while the second term j k (speciation process) is computed from the sums σv and σw and the sum of branch lengths from S. We store the score and child node entry indices for 3 EXPERIMENTAL SETUP the combination of j and k that maximize the score. Once the root node of T For the evaluation of mPTP we use empirical datasets that comprise a is processed, we have the likelihood values of exactly |E| + 1 delimitations, number of varying parameters (substitution rate, geographic range, genetic one of which corresponds to the null-model (i = 0). We select the entry diversity etc.), which are difficult to reproduce with simulations. We i that minimizes the Akaike Information Criterion score corrected for small retrieved all datasets from the Barcode of Life Database (BOLD, http:// sample size (AICc; McQuarrie and Tsai (1998)). This way we penalize over- www.boldsystems.org/), the largest barcode library for eukaryotes. splitting caused by the increasing number of parameters (λ1, λ2, . . . , λk). For each of the datasets we inferred the putative species with mPTP A lower AICc value corresponds to a model which better explains the data. and four other methods that model speciation on the basis of genetic Once the best AICc corrected entry i at the root has been determined, we distance (including the original PTP method), and assess their accuracy with perform a backtracking procedure using the stored child node pointers to respect to the current taxonomy (available in BOLD). Finally, we avoided retrieve the coalescent roots and hence obtain the species delimitation. The comparisons with time-based species delimitation methods (Fujisawa and average asymptotic run-time complexity of the method is O(n2) with a Barraclough, 2013; Yang and Rannala, 2014; Jones et al., 2015) which worst-case of O(n3), where n is the number of nodes in T (for the proofs are time consuming and heavily dependent on the factorization accuracy of see paragraph 1, Supplement I). branch lengths into time and evolutionary rate.

3 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Kapli et al

Table 1. Main characteristics of empirical test datasets. NSp: Number singletons also interfere with the efficiency of delimitation methods of species, NMSp: Number of Monophyletic species, NSeq: Number (Puillandre et al., 2012). of Sequences, AL: Alignment Length, APD: Average P-Distance. For each dataset, we inferred putative species using the two PTP models and we compared the results to three popular and well-established distance- based methods; ABGD (Puillandre et al., 2012), Uclust (Edgar, 2010) and Phylum NSp NMSp NSeq AL APD (%) Crop (Hao et al., 2011). The input format of the data and parameters differ among the five methods. The two PTP versions require a rooted Amynthas Annelida 43 28 265 1539 16.7 Phyllotreta Arthropoda 16 14 195 659 11.8 phylogenetic tree. Therefore, we used Mafft v7.123 (Katoh and Standley, Atheta Arthropoda 84 60 399 1076 14.2 2013) to align the sequences of each dataset, and subsequently RAxML Philodromus Arthropoda 15 11 493 713 10.2 under the default algorithm and the GTR+Γ model, to infer an ML tree. Digrammia Arthropoda 20 17 286 659 6.2 To root each tree, we chose outgroup taxa based on previously published Carabus Arthropoda 49 34 514 880 11.5 phylogenies. For the genera without such prior phylogenetic knowledge, we Bembidion Arthropoda 97 80 532 1495 12.9 selected a representative species of a genus belonging to the same family Calcinus Arthropoda 29 24 641 679 16.6 Xysticus Arthropoda 35 33 861 1238 8.9 or tribe given the taxonomy in BOLD (NCBI Accession Numbers of the Gammarus Arthropoda 53 27 755 1082 21.4 ingroup and outgroup sequences are provided in Supplement II). Moreover, Clubiona Arthropoda 35 32 1127 1256 8.4 we implemented a method that identifies and ignores branches that result Culicoides Arthropoda 100 76 1252 1485 17.5 from identical sequences, during the delimitation process (for more details Balanus Arthropoda 4 3 1775 1548 2.3 see paragraph 2 and Figures 1 and 2, Supplement I). Drosophila Arthropoda 148 109 2303 1589 11.2 For mPTP, we further assessed the confidence of the ML solution using Anopheles Arthropoda 121 68 2741 1544 10.5 2×107 Anolis Chordata 41 35 181 709 18.9 MCMC sampling. For each dataset, we executed ten MCMC runs of Coryphopterus Chordata 12 10 282 658 14.7 steps, each starting from an initial random delimitation. We estimated the Myotis Chordata 54 37 789 1542 11.7 convergence of the independent runs by calculating the ASDDSV. To obtain Cyanea Cnidaria 5 4 92 1002 12.7 an overall support for the ML estimate, we computed the mean ASV over all Ophiura Echinodermata 13 8 75 1605 16.8 ten independent runs. Holothuria Echinodermata 18 12 355 1553 13.6 For each of the remaining three methods (ABGD, Uclust and Crop), Mopalia 20 11 304 708 10.9 we optimized a set of performance-critical parameters and chose the Rhagada Mollusca 24 12 686 675 11.1 Echinococcus Platyhelminthes 7 5 316 1608 5.1 delimitation that recovered the highest number of taxonomic species. ABGD requires two parameters: (i) the prior limit to intraspecific diversity (P ) and (ii) a proxy for the minimum barcoding gap among the inter- and intraspecific genetic distances (X). We ran ABGD with the aligned sequences and 400 parameter combinations. For P we sampled 100 values 3.1 Empirical Datasets from the range h0.001, 0.1i and for X we used the four values 0.05, 0.1, We retrieved 24 empirical genus level datasets (Table 1) that cover six 0.15 and 0.2. The only critical parameter for Uclust is the fractional identity of the most species rich Phyla (Arthropoda, Annelida, Chordata, threshold (id) which defines the minimum similarity for the sequences of Echinodermata, Platyhelminthes, Cnidaria) in BOLD. This allows to each cluster. We performed 50 consecutive runs, increasing the id value by evaluate the efficiency of mPTP given a wide range of organisms with diverse 0.01 starting with 0.5. Finally, for Crop we tried four combinations of the l biological traits and evolutionary histories. The majority of datasets (14 out and u parameters that correspond to similarity thresholds of 1%, 2%, 3%, of 24 datasets) belong to the Arthropoda which is the most species-rich and 5% as suggested by the authors (http://code.google.com/p/ animal group, and the most common group PTP has been applied to so far crop-tingchenlab/). Another critical parameter for Crop is z, which (in 61 out of 110 citations). Within Arthropoda, we mainly focus on insects specifies the maximum number of sequences to consider after the so-called (8 out 14 datasets), which represents over 90% of all . The rest of the initial “split and merge”. We set z to 100 which is considered as a reasonable empirical datasets spans the remaining five phyla. value for full-length barcoding genes. All datasets comprise the 5’ end of the common animal barcoding gene “Cytochrome Oxidase subunit I” (COI-5P) (Hebert et al., 2003). The number 3.3 Comparison of delimitation methods of taxonomic species in the 24 datasets ranged from four (Balanus) to 148 The evaluation of the five methods is based on three measures. First, the (Drosophila), while the number of sequences from 75 (Ophiura) to 2741 percentage of recovered taxonomic species (RTS), that is, the percentage of (Anopheles). The number of sequences per species ranged between two delimited species that match the “true species”. We consider the taxonomic and the extreme case of 1763 for Balanus glandula, reflecting situations species retrieved from BOLD to be the “true species”, and we deem the where some species might be readily and others rarely available. Finally, performance of the algorithm better when the number of matches to the the geographical distribution of the datasets also varied from the global scale taxonomic species is higher. Since we do not have the expertise to evaluate (e.g., Anopheles, Drosophila) to the local scale (e.g., the Anolis samples the taxonomy of each dataset, we could not assess the accuracy of each originate from the islands and surrounding shores of the Caribbean sea). individual delimitation. However, by assuming that, the closer a delimitation is to the current taxonomy, the higher the probability to correspond to the real 3.2 Putative Species Delimitation species, we gain some insight as to the relative accuracy among the different The sequence files obtained from the BOLD database were preprocessed to methods. The second measure is the F-score, also known as F-measure or remove three factors that interfere with the assessment of the delimitation F1-score (Rijsbergen, 1979), that is, the harmonic mean of precision and methods. First, the efficiency of each method is measured by comparing recall measures. In species delimitation, precision denotes the fraction of the delimited to the taxonomic species (see paragraph 3.3). Therefore, it clustered sequences belonging to a single taxonomic species. The recall, was necessary to remove sequences of ambiguous taxonomic assignment. describes the fraction of sequences of a species that are clustered together. Second, we removed duplicate sequences, using the “-f c” option of RAxML The F-Score improves when decreasing (i) the number of species which are version 8.1.17 (Stamatakis, 2014), to avoid unnecessary computations. split into more than one groups and (ii) the number of taxonomic species Finally, we removed singleton species (i.e., species represented by a single lumped together into one group. The F-score ranges from 0 to 1, where 1 sequence) to eliminate the impact of random exact matches (of delimited indicates a perfect agreement among two delimitations. Finally, the third to taxonomic species) in methods that tend to oversplit. Additionally, measure is simply the number of delimited species.

4 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Multi-rate Poisson Tree Processes (mPTP)

Table 2. Percentage of RTS, F-scores and number of delimited species for the five delimitation methods (mPTP, PTP, Uclust, Crop and ABGD) for five of the empirical datasets.

mPTP PTP Usearch Crop ABGD

Genus RTS (%)

Amynthas 51 40 44 40 44 Anopheles 41 39 30 31 40 Atheta 64 60 57 55 60 Drosophila 47 51 41 34 53 Philodromus 60 27 47 53 47 Fig. 2. Average performance over all datasets of the five delimitation methods (mPTP, PTP, Uclust, Crop, and ABGD) for the A) number of species, B) F-scores and C) number of RTS F-score

Amynthas 0.784 0.638 0.649 0.674 0.673 The accuracy of either PTP model in recovering the true species depends Anopheles 0.787 0.704 0.730 0.681 0.730 on whether these taxa form monophyletic clades. Thus, for PTP and mPTP, Atheta 0.839 0.844 0.836 0.832 0.850 we also compare their performance by applying the above measures for Drosophila 0.728 0.559 0.747 0.729 0.765 monophyletic taxonomic species only. Finally, we also quantify how the Philodromus 0.882 0.717 0.812 0.852 0.828 percentage of monophyletic species correlates to the percentage of RTS for the five delimitation methods we test. Number of species

4 RESULTS Amynthas 64 104 95 91 90 We analyzed a total of 17,219 COI sequences. The alignment length Anopheles 126 218 193 154 118 ranged from 658 bp to 1620 bp, while the proportion of variable Atheta 85 100 95 95 91 nucleotides, measured by the P-Distance (MEGA v5.2; Tamura Drosophila 139 444 183 217 157 et al., 2011) averaged over the sequences of each dataset , varied Philodromus 21 38 26 20 24 from 2.3% (Balanus) to 21.4% (Gammarus). Table 1 presents the fraction of monophyletic taxonomic species in the phylogenies for each dataset. The fraction ranges from only 51% in Gammarus, the models (82% and 72% on average for mPTP and PTP, respectively), most variable of the datasets, to 94% (Xysticus). Out of the 400 indicating that polyphyly [we use the term polyphyly in referring combinations of parameters tested for ABGD, two recovered the to both paraphyly and polyphyly sensu Funk and Omland (2003)] highest number of species, with P = 0.010235 and X = 0.1 or 0.05. is a major contributing factor when taxonomic species are not For Uclust, the delimitation that maximized the number of RTS recovered (Figure 3 in Supplement I). In particular, the RTS was with the threshold value of id = 0.97. Finally, the best scoring percentage of either PTP model is highly correlated with the parameter combination for Crop (l = 1.0 and u = 1.5) corresponds percentage of monophyletic species in a dataset (Figure 3). The to the 3% similarity threshold. Pearson coefficient indicates that the correlation is stronger among Table 2, presents the efficiency of the five delimitation methods the species recovered with mPTP (rmP T P = 0.75, p-value = for five of the datasets using three measures: percentage of RTS, 2.642e-05) than with PTP (rPTP = 0.62, p-value = 0.001218). The F-score and number of species. A complete list of results for all correlation is also positive for the other three methods (rABGD = 24 datasets is given in paragraph 4.1 in Supplement I. Overall, 0.63 / p-value = 0.0009327, rUclust = 0.5 / p-value = 0.01294, rCrop mPTP and ABGD scored best in terms of RTS percentage (59% = 0.63 / p-value = 0.0009421) and comparable to PTP but smaller and 57% on average, respectively) and F-scores (0.828 and 0.819, than for mPTP. Similarly, the slope of the regression line was greater respectively) compared to the other three methods (PTP: RTS = for mPTP than for the other methods, indicating a steeper linear 53%, F-score = 0.776, Uclust: RTS = 53%, F-score = 0.79, Crop: relationship between the two variables (Figure 3). RTS = 52%, F-score = 0.791) (Figure 2). The striking difference Regarding the confidence of the mPTP delimitations, all between the five methods is in the number of delimited species. The independent MCMC runs appear to converge with an ASDDSV novel mPTP method delimited a total of 1190 species which is the below 0.01 for all datasets, except for Drosophila (one of the closest to the total number of taxonomic species (1041). In contrast, largest datasets), for which the ASDDSV was 0.048, but still in the PTP inferred 2048 species which is almost twice this number. The acceptable range for assuming convergence (Ronquist et al., 2012). other three methods yielded more conservative species numbers The ASV with respect to the ML delimitation was very high for compared to PTP (Uclust: 1671, Crop: 1663, ABGD: 1412) that all datasets, ranging from 72% (Myotis) to 99.8% (Phyllotreta), are, however, still notably higher than the mPTP estimates (Figure indicating that the data support well the ML solution (Table 2). 3 in Supplement I). The accumulated running time for all ten When considering only monophyletic species, as one might independent runs (executed sequentially on an Intel Core i7-4500U expect, the recovery percentage increases substantially for both PTP CPU @ 1.80GHz) was less than 50 seconds on average across all

5 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Kapli et al

be similarly challenging. This and previous studies (Monaghan et al., 2009; Esselstyn et al., 2012; Tang et al., 2014; Ratnasingham and Hebert, 2013) suggest that single-locus barcoding methods provide meaningful clusters, close to taxonomically acknowledged species. This makes them useful for approximate species estimation studies or when more thorough systematic research is practically impossible.

5.1 (m)PTP The molecular systematics of the example taxa, vary from well- studied [e.g., Anopheles (Harbach and Kitching, 2016), Drosophila (Bachli,¨ 2016)] to scarcely studied (e.g., Xysticus, Clubiona). They further differ in the number of species, number of sequences per species, geographic ranges, and nucleotide divergences. Despite these differences, mPTP outperformed PTP and yields substantially smaller putative species numbers as well as delimitations that are closer to the current taxonomy. The assumption of mPTP that per- species branch lengths can be fit to distinct exponential distributions increases flexibility and adjustability to more realistic datasets Fig. 3. For each method, we fit a regression line to the points of (e.g., including variable intraspecific diversity). The divergence correspondence of the percentage of monophyletic species (x-axis) to the of intraspecific diversity patterns may either be due to population percentage of RTS (y-axis). The Pearson coefficient (r) is given for each et al. correlation in the corresponding color. traits and processes (Bazin , 2006) or the uneven sampling of the species (e.g., a highly sampled species from a single location compared to a species represented by one sample from multiple locations). The latter is known to decrease PTP accuracy (Zhang datasets, which corresponds to ∼ 5 seconds per run. For a thorough et al., 2013). run-time comparison between PTP and mPTP (including the ML The accuracy of (m)PTP strongly correlates with the proportion method) please refer to paragraph 4.2 in Supplement I. of monophyletic species in the underlying phylogeny. Paraphyletic species will either be delimited into smaller groups or delimited with other, nested species. Therefore, the recovery rate of 5 DISCUSSION taxonomic species is significantly higher when only considering Molecular species delimitation has caused mixed reactions within the monophyletic species in each dataset. Among the 24 datasets, the scientific community, from those highly enthusiastic about the number of monophyletic species ranged from 51% to 94%. its potential to accelerate biodiversity cataloging (Blaxter, 2003; Thus, the lack of monophyly is the primary contributing factor Blaxter and Floyd, 2003) to those very critical about its role for not recovering taxonomic species with either PTP model. in shaping modern systematics (Bauer et al., 2010; Will et al., The average observed monophyly in our arthropod (76.4%) and 2005). The main argument between the two conflicting sides is the remaining invertebrate (64.7%) datasets corresponds well to whether molecular delimitation on its own is sufficient to justify available estimates for the same groups (73.5% and 61.4%, taxonomic rearrangements (Will et al., 2005). Integrative taxonomy respectively) based on a large number of empirical studies (Funk alleviates this conflict as it, by definition, requires multiple levels and Omland, 2003). The same study reports that the most common of evidence taking into account various biodiversity characteristics reasons for polyphyly are: inaccurate taxonomy, incomplete lineage of an organism to accept potential taxonomic changes. Within this sorting, and retrogressive hybridization (Funk and Omland, 2003; framework, and in line with the independently evolving species McKay and Zink, 2010). All three effects could apply to the concept (De Queiroz, 2007) and the phylogenetic species concept, selected datasets. Nevertheless, only the latter two are relevant to molecular species delimitation using DNA-barcoding serves as an the efficiency of the algorithm per se, while the accuracy of the excellent tool in modern taxonomy (Tautz et al., 2003; Vogler and taxonomy is only relevant when it is used as a reference measure. Monaghan, 2007). It can be easily applied to a large range of Polyphyletic species affect both PTP versions in the same way. organisms regardless of their life stage, gender or prior taxonomic The reason for the improved delimitation accuracy of monophyletic knowledge, by a broad range of researchers. In addition, barcoding species of mPTP over PTP ie because it can accommodate different genes are easily amplified from small tissue samples even from degrees of intraspecific genetic diversity within a phylogeny. poorly preserved historical samples (Austin and Melville, 2006). Furthermore, such an approach might represent the only feasible 5.1.1 MCMC Sampling The drawback of an ML approach is approach in biodiversity surveys, as they often comprise a large that it only provides a point estimate and no information on model number of species, many of which might be unknown or not easily uncertainty. The confidence about a given solution has a substantial accessible. Hence, collecting ecological or morphological traits is impact in drawing reasonable conclusions. Therefore, we provide often simply not feasible. For historical samples or samples from an MCMC method for assessing the plausibility of the ML solution. inaccessible areas (e.g., deep seas, deserts) barcoding methods are In phylogenies comprising hundreds of taxa, it is hard to obtain an equally important since the collection of life history traits might overall support for a particular delimitation hypothesis by visual

6 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Multi-rate Poisson Tree Processes (mPTP)

inspection of the tree. To alleviate this problem, at the end of also be applied to entire organelle phylogenies (e.g. mitochondria, an MCMC run we calculate the ASV for the ML solution. In Qin et al., 2015). In contrast to methods based on sequence our experiments, the ML delimitations were highly supported by similarity, mPTP does not require any similarity threshold or the ASVs for all datasets, pointing towards a unimodal likelihood other user-defined parameter as input. The limitations of mPTP surface for our model (Table 3 in Supplement I). Low ASVs are associated with processes that can not be detected neither in may be interpreted as low confidence for the given delimitation single-gene phylogenies (incomplete lineage sorting, hybridization) scheme, either because another (multi-modal likelihood surface) or nor in recent speciation events. The novel dynamic programming no delimitation scheme (flat likelihood surface) is well supported. delimitation algorithm reduces computation time to a minimum and This also indicates a poor fit of the data to the model. The execution allows for almost instantaneous species delimitation on phylogenies times of the MCMC sampling are almost negligible, regardless of with thousands of taxa. The MCMC sampling provides support tree size. Therefore, we can thoroughly sample the delimitation values for the delimited species based on millions (or even billions) space even for phylogenies comprising thousands of taxa. For the of MCMC generations in just a few seconds on a modern desktop. large phylogenies of our study (i.e., Drosophila, Anopheles), the The mPTP tool is available both, as a standalone package, and as a PTP implementation required over 30 hours for the ML optimization web service. alone, and would require days or even months for the estimation of support values. Instead, mPTP required less than a minute for both the ML optimization and the support value estimation. ACKNOWLEDGMENT The authors gratefully acknowledge the support of the Klaus Tschira 5.2 Distance-based Methods Foundation. PP was funded by FP7-PEOPLE-2013-IEF EVOGREN Distance-based methods are easy to apply to large datasets as (625057). they need minimal preprocessing effort and computational time. Their major weakness is that they require either a threshold value or a combination of parameters associated with the threshold REFERENCES value, the sampling effort, or the search strategy. Empirical data Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. Annals of the show that certain similarity cut-offs (2-3%) correspond well to Institute of Statistical Mathematics, 30(1), 9–14. the species boundaries of several taxonomic groups (Hebert et al., Austin, J. J. and Melville, J. (2006). Incorporating historical museum specimens into 2003; Smith et al., 2005; Hebert et al., 2004); however, these molecular systematic and conservation genetics research. Molecular Ecology Notes, 6(4), 1089–1092. values are often far from optimal (Lin et al., 2015). Selecting Bachli,¨ G. (1999-2016). The database on Taxonomy of Drosophilidae. http:// these parameters is not intuitive and they may only be evaluated a taxodros.unizh.ch/. posteriori based on the expectations of the researcher. Furthermore, Bauer, A. M., Parham, J. F., Brown, R. M., Stuart, B. L., Grismer, L., Papenfuss, the empirical knowledge for threshold settings is tightly associated T. J., Bohme,¨ W., Savage, J. M., Carranza, S., Grismer, J. L., Wagner, P., Schmitz, A., Ananjeva, N. B., and Inger, R. F. (2010). Availability of new Bayesian- with barcoding genes, and it may not be as useful for other delimited gecko names and the importance of character-based species descriptions. marker genes. Here, we optimized the parameters for three of Proceedings of the Royal Society of London B: Biological Sciences. the most popular distance methods (ABGD, Crop and Uclust) Bazin, E., Glemin,´ S., and Galtier, N. (2006). Population size does not influence based on the RTS percentage. Despite our substantial effort to mitochondrial genetic diversity in animals. Science, 312(5773), 570–572. use optimal parameters for the given data, our results show that, Blaxter, M. (2003). Molecular systematics: counting angels with DNA. Nature, 421(6919), 122–124. on average, mPTP performs better with respect to F-scores and Blaxter, M. and Floyd, R. (2003). Molecular taxonomics for biodiversity surveys: RTS percentage. At the same time, it also delimited notably already a reality. Trends in Ecology & Evolution, 18(6), 268–269. fewer species. ABGD accuracy was closest to mPTP, while PTP, Blaxter, M., Mann, J., Chapman, T., Thomas, F., Whitton, C., Floyd, R., and CROP, and Uclust performed notably worse. Besides accuracy, Abebe, E. (2005). Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462), the greatest advantage of mPTP is that it consistently yields 1935–1943. more accurate results without requiring the user to optimize any Burnham, K. and Anderson, D. (2002). Model Selection and MultiModel Inference: A parameters/thresholds. Finally, in contrast to (m)PTP, distance Practical Information-Theoretic Approach. New York: Springer. methods ignore evolutionary relationships. Hence, there is no direct Casiraghi, M., Labra, M., Ferri, E., Galimberti, A., and De Mattia, F. (2010). DNA relation between monophyletic species and delimitation accuracy barcoding: a six-question tour to improve users’ awareness about the method. Briefings in Bioinformatics, 11(4), 440–453. in similarity based tools. However, polyphyly often reflects recent Dayrat, B. (2005). Towards integrative taxonomy. Biological Journal of the Linnean speciation while reciprocal monophyly indicates that significant Society, 85(3), 407–415. time since speciation has passed. Consequently, the barcoding gap De Queiroz, K. (2007). Species concepts and species delimitation. Systematic biology, should be less pronounced in datasets of recently diverged species. 56(6), 879–886. Dupuis, J. R., Roe, A. D., and Sperling, F. A. (2012). Multi-locus species delimitation This justifies that the RTS fraction improves with the number of in closely related animals and fungi: one marker is not enough. Molecular ecology, monophyletic species. 21(18), 4422–4436. Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460–2461. Esselstyn, J. A., Evans, B. J., Sedlock, J. L., Anwarali Khan, F. A., and Heaney, 6 CONCLUSIONS L. R. (2012). Single-locus species delimitation: a test of the mixed Yule–coalescent We presented mPTP, a novel approach for single-locus delimitation model, with an empirical application to Philippine round-leaf bats. Proceedings of the Royal Society of London B: Biological Sciences that consistently provides faster and more accurate species estimates . Fujisawa, T. and Barraclough, T. G. (2013). Delimiting Species Using Single-Locus than PTP and other popular delimitation methods. As PTP, mPTP is Data and the Generalized Mixed Yule Coalescent Approach: A Revised Method and mainly designed for analyzing barcoding loci, but can potentially Evaluation on Simulated Data Sets. Systematic Biology, 62(5), 707–724.

7 bioRxiv preprint doi: https://doi.org/10.1101/063875; this version posted July 14, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Kapli et al

Fujisawa, T., Aswad, A., and Barraclough, T. G. (2016). A Rapid and Scalable Method 55(4), 595–609. for Multilocus Species Delimitation Using Bayesian Model Comparison and Rooted Puillandre, N., Lambert, A., Brouillet, S., and Achaz, G. (2012). ABGD, Automatic Triplets. Systematic Biology. Barcode Gap Discovery for primary species delimitation. Molecular ecology, 21(8), Funk, D. J. and Omland, K. E. (2003). Species-Level Paraphyly and Polyphyly: 1864–1877. Frequency, Causes, and Consequences, with Insights from Animal Mitochondrial Qin, J., Zhang, Y., Zhou, X., Kong, X., Wei, S., Ward, R. D., and Zhang, A.-b. (2015). DNA. Annual Review of Ecology, Evolution, and Systematics, 34, 397–423. Mitochondrial phylogenomics and genetic relationships of closely related pine Hao, X., Jiang, R., and Chen, T. (2011). Clustering 16S rRNA for OTU prediction: a moth (Lasiocampidae: Dendrolimus) species in China, using whole mitochondrial method of unsupervised Bayesian clustering. Bioinformatics, 27(5), 611–618. genomes. BMC genomics, 16(1), 1. Harbach, R. E. and Kitching, I. J. (2016). The phylogeny of Anophelinae revisited: Ratnasingham, S. and Hebert, P. D. (2013). A DNA-based registry for all animal inferences about the origin and classification of Anopheles (Diptera: Culicidae). species: The Barcode Index Number (BIN) System. PloS one, 8(7), e66213. Zoologica Scripta, 45(1), 34–47. Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, Newton, Hebert, P. D., Cywinska, A., Ball, S. L., et al. (2003). Biological identifications through MA, USA, 2nd edition. DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences, Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Hohna,¨ S., 270(1512), 313–321. Larget, B., Liu, L., Suchard, M. A., and Huelsenbeck, J. P. (2012). MrBayes 3.2: Hebert, P. D., Stoeckle, M. Y., Zemlak, T. S., and Francis, C. M. (2004). Identification efficient Bayesian phylogenetic inference and model choice across a large model of birds through DNA barcodes. PLoS biology, 2, 1657–1663. space. Systematic biology, 61(3), 539–542. Hickerson, M., Carstens, B., Cavender-Bares, J., Crandall, K., Graham, C., Johnson, Smith, M. A., Fisher, B. L., and Hebert, P. D. (2005). DNA barcoding for effective J., Rissler, L., Victoriano, P., and Yoder, A. (2010). Phylogeographys past, present, biodiversity assessment of a hyperdiverse arthropod group: the ants of Madagascar. and future: 10 years after. Molecular Phylogenetics and Evolution, 54(1), 291–301. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1462), Hudson, R. R. (1990). Gene genealogies and the coalescent process. Oxford surveys in 1825–1834. evolutionary biology, 7(1), 44. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post- Jones, G., Aydin, Z., and Oxelman, B. (2015). DISSECT: an assignment-free Bayesian analysis of large phylogenies. Bioinformatics, 30(9), 1312–1313. discovery method for species delimitation under the multispecies coalescent. Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., and Kumar, S. (2011). Bioinformatics, 31(7), 991–998. MEGA 5: Molecular Evolutionary Genetics Analysis. Center for Evolutionary Katoh, K. and Standley, D. M. (2013). MAFFT multiple sequence alignment software Functional GenomicsThe Biodesign Institute, pages 85287–5301. version 7: improvements in performance and usability. Molecular biology and Tang, C. Q., Humphreys, A. M., Fontaneto, D., and Barraclough, T. G. (2014). Effects evolution, 30(4), 772–780. of phylogenetic reconstruction method on the robustness of species delimitation Lin, X., Stur, E., and Ekrem, T. (2015). Exploring Genetic Divergence in a Species-Rich using single-locus data. Methods in Ecology and Evolution, 5(10), 1086–1094. Insect Genus Using 2790 DNA Barcodes. PLoS ONE, 10(9), 1–24. Tautz, D., Arctander, P., Minelli, A., Thomas, R. H., and Vogler, A. P. (2003). A plea Maddison, W. P. (1997). Gene trees in species trees. Systematic biology, 46(3), 523– for DNA taxonomy. Trends in Ecology & Evolution, 18(2), 70–74. 536. Valentini, A., Pompanon, F., and Taberlet, P. (2009). DNA barcoding for ecologists. McKay, B. D. and Zink, R. M. (2010). The causes of mitochondrial DNA gene tree Trends in Ecology & Evolution, 24(2), 110–117. paraphyly in birds. Molecular Phylogenetics and Evolution, 54(2), 647–650. Vogler, A. P. and Monaghan, M. T. (2007). Recent advances in DNA taxonomy. Journal McQuarrie, A. D. and Tsai, C.-L. (1998). Regression and time series model selection. of Zoological Systematics and Evolutionary Research, 45(1), 1–10. World Scientific. Will, K. W., Mishler, B. D., and Wheeler, Q. D. (2005). The perils of DNA barcoding Monaghan, M. T., Wild, R., Elliot, M., Fujisawa, T., Balke, M., Inward, D. J., Lees, and the need for integrative taxonomy. Systematic biology, 54(5), 844–851. D. C., Ranaivosolo, R., Eggleton, P., Barraclough, T. G., and Vogler, A. P. (2009). Yang, Z. and Rannala, B. (2014). Unguided Species Delimitation Using DNA Sequence Accelerated Species Inventory on Madagascar Using Coalescent-Based Models of Data from Multiple Loci. Molecular Biology and Evolution, 31(12), 3125–3135. Species Delineation. Systematic Biology, 58(3), 298–311. Yule, G. U. (1925). A Mathematical Theory of Evolution, Based on the Conclusions of Oliver, I. and Beattie, A. J. (1993). A possible method for the rapid assessment of Dr. J. C. Willis, F.R.S. Philosophical Transactions of the Royal Society of London. biodiversity. Conservation biology, 7(3), 562–568. Series B, Containing Papers of a Biological Character, 213, 21–87. Pons, J., Barraclough, T. G., Gomez-Zurita, J., Cardoso, A., Duran, D. P., Hazell, S., Zhang, J., Kapli, P., Pavlidis, P., and Stamatakis, A. (2013). A general species Kamoun, S., Sumlin, W. D., and Vogler, A. P. (2006). Sequence-based species delimitation method with applications to phylogenetic placements. Bioinformatics, delimitation for the DNA taxonomy of undescribed insects. Systematic biology, 29(22), 2869–2876.

8