<<

diversity

Article Phylogenetic Signal of Indels and the Neoavian Radiation

Peter Houde 1,*, Edward L. Braun 2,* , Nitish Narula 1 , Uriel Minjares 1 and Siavash Mirarab 3,*

1 Department of Biology, New Mexico State University, Las Cruces, NM 88003, USA 2 Department of Biology, University of Florida, Gainesville, FL 32607, USA 3 Electrical and Computer Engineering, University of California, San Diego, CA 92093, USA * Correspondence: [email protected] (P.H.); ebraun68@ufl.edu (E.L.B.); [email protected] (S.M.); Tel.: +1-575-646-6019 (P.H.); +1-352-846-1124 (E.L.B.); +1-858-822-6245 (S.M.)  Received: 3 June 2019; Accepted: 4 July 2019; Published: 6 July 2019 

Abstract: The early radiation of has been hypothesized to be an intractable “hard polytomy”. We explore the fundamental properties of insertion/deletion alleles (indels), an under-utilized form of genomic data with the potential to help solve this. We scored >5 million indels from >7000 pan-genomic intronic and ultraconserved element (UCE) loci in 48 representatives of all neoavian orders. We found that intronic and UCE indels exhibited less homoplasy than nucleotide (nt) data. Gene trees estimated using indel data were less resolved than those estimated using nt data. Nevertheless, Accurate Species TRee Algorithm (ASTRAL) species trees estimated using indels were generally similar to nt-based ASTRAL trees, albeit with lower support. However, the power of indel gene trees became clear when we combined them with nt gene trees, including a striking result for UCEs. The individual UCE indel and nt ASTRAL trees were incongruent with each other and with the intron ASTRAL trees; however, the combined indel+nt ASTRAL tree was much more congruent with the intronic trees. Finally, combining indel and nt data for both introns and UCEs provided sufficient power to reduce the scope of the polytomy that was previously proposed for several supraordinal lineages of Neoaves.

Keywords: insertion; deletion; indel; ASTRAL; multispecies coalescent; hemiplasy; phylogeny; polytomy; neoaves

1. Introduction There have been many attempts to resolve the relationships among the major lineages of using molecular data [1]. The earliest efforts used mitochondrial data (e.g., [2–4]), but attention rapidly shifted to the nuclear genome. Analyses of nuclear sequence data began by using individual genes [5–7] or small numbers of genes [8,9] with increasingly comprehensive higher-taxon sampling. A constant across all of these studies was the poor resolution of the relationships at the base of the most species-rich avian , Neoaves. The absence of resolution was so striking that Poe and Chubb [10] suggested that was evidence of a hard polytomy (simultaneous cladogenic events). Subsequent analyses of multiple genes [11–13] provided convincing support for monophyly of several supraordinal , falsifying the specific version of the hard polytomy hypothesis that was proposed by Poe and Chubb [10]. However, the multigene studies were still unable to provide strong support for all relationships among orders. In fact, subsequent phylogenomic studies using hundreds or even thousands of genes [14–16] only provided convincing evidence that a few additional supraordinal clades were monophyletic despite the use of very large data matrices, some comprising as much as 322 megabase pairs (Mbp) of sequence data, for analyses. The continued lack of resolution in the bird tree in the phylogenomic era led Suh [17] to advocate a return to the hard polytomy hypothesis involving the radiation of nine clades at the base of Neoaves (as in Figure1, excluding Phaethontimorphae).

Diversity 2019, 11, 108; doi:10.3390/d11070108 www.mdpi.com/journal/diversity Diversity 2019, 11, x FOR PEER REVIEW 2 of 24

Diversitypolytomy2019 hypothesis, 11, 108 involving the radiation of nine clades at the base of Neoaves (as in Figure2 of 231, excluding Phaethontimorphae).

FigureFigure 1.1. DefinitionsDefinitions of of clade clade names names used used in inthis this paper. paper. The Thebasal basal-most-most polytomy polytomy represents represents “the “themagnificent magnificent seven” seven” and andthree three “orphan “orphan orders” orders” of Red ofdy Reddy et al. et[18] al.. [ Other18]. Other polytomies polytomies reflect reflectdifferences differences in tree topologies in tree topologies recovered recovered using various using data various types data and typesanalyses and (reviewed analyses by (reviewed Braun et byal. Braun[1]). et al. [1]). The hard polytomy hypothesis makes a fundamental prediction that individual gene trees will The hard polytomy hypothesis makes a fundamental prediction that individual gene trees will show no more similarity to each other than expected for random trees generated by the coalescent show no more similarity to each other than expected for random trees generated by the coalescent process. Sayyari and Mirarab [19] proposed a statistical test of the polytomy hypothesis based on process. Sayyari and Mirarab [19] proposed a statistical test of the polytomy hypothesis based on quartet frequencies; when they applied this test to the bird tree, they were unable to reject the null quartet frequencies; when they applied this test to the bird tree, they were unable to reject the null hypothesis of a zero-length branch (i.e., a true polytomy) for several branches near the base of Neoaves hypothesis of a zero-length branch (i.e., a true polytomy) for several branches near the base of (Figure2). Suh [ 17] combined two lines of evidence when he argued for a hard polytomy at the Neoaves (Figure 2). Suh [17] combined two lines of evidence when he argued for a hard polytomy at base of Neoaves. The first line of evidence was the conflict among gene trees and between the gene the base of Neoaves. The first line of evidence was the conflict among gene trees and between the trees and estimates of the species tree, presumably due to incomplete lineage sorting (ILS). Gene tree gene trees and estimates of the species tree, presumably due to incomplete lineage sorting (ILS). incongruence is certainly ubiquitous for the base of Neoaves. Consider, for example, that none of the Gene tree incongruence is certainly ubiquitous for the base of Neoaves. Consider, for example, that 14,446 individual gene trees of Jarvis et al. [15] were identical to the estimated species tree. However, none of the 14,446 individual gene trees of Jarvis et al. [15] were identical to the estimated species the observation that there is extensive conflict among gene trees does not provide direct evidence tree. However, the observation that there is extensive conflict among gene trees does not provide of a hard polytomy because ILS is expected to be ubiquitous, even for fully resolved species trees. direct evidence of a hard polytomy because ILS is expected to be ubiquitous, even for fully resolved Gene tree–species tree conflicts due to ILS are an inevitable consequence of genetic drift in a set of species trees. Gene tree–species tree conflicts due to ILS are an inevitable consequence of genetic populations connected by a phylogenetic tree [20,21], although there will be more ILS for short internal drift in a set of populations connected by a phylogenetic tree [20,21], although there will be more ILS branches (like the branches at the base of Neoaves; [15,16]). The hard polytomy hypothesis is only for short internal branches (like the branches at the base of Neoaves; [15,16]). The hard polytomy corroborated when quartets within gene trees are considered and the percentage of gene trees that hypothesis is only corroborated when quartets within gene trees are considered and the percentage support each of the three mutually exclusive paired sister taxa in a quartet approaches one-third [19]. of gene trees that support each of the three mutually exclusive paired sister taxa in a quartet Of course, the hard polytomy test assumes the estimated gene trees are accurate; any topological errors in the gene trees could inflate the estimated amount of ILS [22].

Diversity 2019, 11, x FOR PEER REVIEW 3 of 24

approaches one-third [19]. Of course, the hard polytomy test assumes the estimated gene trees are Diversity 2019, 11, 108 3 of 23 accurate; any topological errors in the gene trees could inflate the estimated amount of ILS [22].

FigureFigure 2. 2. EvidenceEvidence for for extensive extensive gene gene tree–species tree–species tree tree conflict conflict at the at base the of base Neoaves. of Neoaves. (a) Total (a Evidence) Total NucleotideEvidence N Treeucleotide (TENT) Tree maximum (TENT) maximum likelihood likelihood (ML) tree (ML) of Jarvis tree etof al.Jarvis [15 ]et with al. [15] branch with lengths branch in mutationlengths inunits, mutation drawn units, using drawn Interactive using Interactive Tree Of Life Tree (iTOL) Of Life [ 23(iTOL]. Branches) [23]. Branches are colored are colored by Quartet by ScoreQuartet (QS) Score computed (QS) computed using ASTRAL using based ASTRAL on all based 14,446 on unbinned all 14,446 nt unbinned gene trees nt with gene branches trees with that have bootstrap support (BS) 5% contracted. Low QS (red) in deep divergences among Neoaves are branches that have bootstrap ≤support (BS) ≤5% contracted. Low QS (red) in deep divergences among suggestiveNeoaves are of suggestive high levels of ofhigh ILS. levels The of null ILS. hypothesis The null hypothesis of a polytomy of a polytomy could not could be rejectednot be rejected for four branchesfor four branches (dashedlines) (dashed at the lines)p < 0.01at the level p < ( p0.01-values level shown (p-values in green shown on nodesin green where on thenodes null where hypothesis the isnull rejected) hypothesis [19]; is (b rejected)) ASTRAL-III [19].; ( treeb) ASTRAL run on- allIII 14,446tree run unbinned on all 14,446 nt gene unbinned trees withnt gene branches trees with with BS 5% contracted, as reported by [24,25]. Branch lengths in coalescent units computed by ASTRAL branches≤ with BS ≤5% contracted, as reported by [24,25]. Branch lengths in coalescent units arecomputed written by under ASTRAL each branch.are writt Branchen under colors each are branch. identical Branch to (a). colors The nullare identical hypothesis to of(a). a The polytomy null couldhypothesis not be of rejected a polytomy for six could branches not be (dashed rejected lines) for six at thebranchesp < 0.01 (dashed level ( plines)-values at shownthe p < in0.01 green level on nodes(p-values where shown the nullin green hypothesis on nodes is where rejected). the null hypothesis is rejected).

TheThe secondsecond argument Suh [17] [17] used to to support support the the hard hard polytomy polytomy hypothesis hypothesis is is the the fact fact that that all all availableavailable large-scalelarge-scale estimates of of avian phylogeny phylogeny conflict conflict with with each each other. other. However However,, the the observed observed conflictsconflicts among among phylogenomic phylogenomic estimates estimates of the of avian the avian species species tree could tree also could reflect also errors reflect in phylogenetic errors in estimationphylogenetic [1]. estimation If some of [1] the. phylogenomicIf some of the treesphylogenomic are actually trees better are estimates actually better of the trueestimates avian of species the treetrue (oravian parts species of it), tree then (or the parts conflicts of it), then could the reflect conflicts errors could in the reflect poorer errors estimates in the poorer of the estimates tree. Lack of of resolutionthe tree. Lack along of short resolution branches along in gene short trees branches can also in result gene from trees a paucity can also of informativeresult from characters a paucity [ of26 ], withinformative relative abundancecharacters [26] varying, with among relative data abundance partitions varyingaccording among to their data inherent partitions mutational according rate and to selectivetheir inherent constraints. mutational Differences rate and in levelsselective of homoplasy constraints. and Differences/or systematic in levels bias associatedof homoplasy with and/or specific datasystematic partitions bias canassociated contribute with tospecific apparent data gene partitions tree conflict can contribute [1]. Incongruence to apparent betweengene tree relativelyconflict accurate[1]. Incongruence estimates of between phylogeny relatively and biased accurate estimates estimates of phylogeny of phylogeny do not provide and biase evidenced estimates to support of thephylogeny hard polytomy do not hypothesis. provide evidence Regardless to support of the details, the hard it is polytomy important hypothesis. to consider Regardless the possibility of the that somedetails, (or it evenis important all) of the to currentlyconsider the available possibility phylogenomic that some (or trees even for all) birds of simplythe currently represent available biased estimatesphylogenomic of avian trees phylogeny. for birds simply Obviously, represen the quartet-basedt biased estimates polytomy of avian test phylogeny. can have both Obviously, false positive the andquartet false-based negative polytomy errors iftest gene can trees have are both biased false in positive some systematic and false manner, negative for errors example, if gene if one trees of threeare possiblebiased in gene some tree systematic quartets is manner,overrepresented for example, due to if phenomenon one of three such possible as long-branch gene tree attraction quartets [ is27 ]. Ultimately,overrepresented whether due one to phenomenon embraces the such hard as polytomy long-branch hypothesis attraction or views[27]. Ultimately, the base of whether Neoaves one as a treeembraces that is the likely hard to bepolytomy resolved, hypothesis it is important or views to consider the base the of possibilityNeoaves as that a tree some that (or is even likely all) to of be the currentlyresolved, availableit is important phylogenomic to consider trees the forpossibility birds represent that some biased (or even estimates all) of of the avian currently phylogeny. available phylogenomicThere is certainly trees for evidence birds represent that large-scale biased estimates bird trees of basedavian onphylogeny. different data types yield distinct estimatesThere of is phylogenycertainly evidence [15,18]. that Both large Jarvis-scale et bird al. [trees15] and based Reddy on dif etferent al. [ 18data] argued types yield that distinct the trees thatestimates emerge of inphylogeny analyses [15,18] of non-coding. Both Jarvis data et mightal. [15] be and closer Reddy to theet al. true [18] tree. argued One that argument the trees for that this hypothesisemerge in analysesis the observation of non-coding that coding data might data exhibits be closer more to the variation true tree. in base One composition argument for [ 15 this,18 ]; thishypothesis means thatis the coding observation data are that likely coding to exhibit data exhibits stronger more violations variation of the in assumptionsbase composition of the [15,18] general; time reversible (GTR) model than non-coding data. However, that argument is indirect. Another way Diversity 2019, 11, 108 4 of 23 to corroborate the hypothesis that trees generated using non-coding data are closer to the true tree would be to conduct analyses of yet another data type. Insertions and deletions (indels) represent a distinct type of data that has been historically under-utilized in phylogenetics. This reflects, at least in part, the fact that their positions in alignments has been derided as ambiguous or subjective [28,29]; however, that argument is unconvincing because phylogenetic inference using nt data depends on the same alignment used to place indels. Ultimately, questions regarding the utility of indels in phylogenetics are empirical and must be addressed by comparisons to analyses that use nt data. Regardless of the details, the larger point is that phylogenetic analyses of indels might provide a different way of looking at existing data. Assuming that one accepts that indels are useful characters for phylogenetic estimation, then one is left with the question of the best model for their analysis. This leads to a more valid reason why indels have been historically under-utilized in phylogenetics: because, in sharp contrast to the situation for nt substitutions where a plethora of models that are informed by knowledge of substitution processes exist [30], the models that are currently available for indel analysis are few and relatively unsophisticated. Some of the more sophisticated indel models (e.g., the TKF91 and TKF92 [Thorne–Kishino–Felsenstein] models [31,32] and their extensions [33] as well as the “Wagner–Dollo duet” model [34]) are impractical for phylogenomic-scale data sets. This has meant that most analyses of indels have coded them as binary characters [35] that are then analyzed using either the maximum parsimony (MP) criterion or the maximum likelihood (ML) criterion using a simple model [35]. When analyses of indels are conducted, a variant of the two-state Markov model (i.e., Cavender–Farris–Neyman [CFN] model [36–38]) that assumes unequal state frequencies and includes among-characters rate heterogeneity is typically used (we used a variant of the CFN model with Γ-distributed rate variation that is called BINGAMMA in Randomized Axelerated Maximum Likelihood (RaxML) [39]). Use of this unsophisticated model could be a problem because indel data from biological sources almost certainly violate the CFN+Γ model. However, it seems reasonable to postulate that analyses based on nts and indels will deviate from the models used from their analysis in different ways. Thus, indel data could provide a different way of looking at the data that is likely to be valuable, despite the relative dearth of models for indel analyses. Despite the limited number of models available for analyses of indels, there are reasons to be hopeful. It has long been clear that most indel datasets exhibit less homoplasy than nt datasets [7,9,15,35,38]. Their low homoplasy is due, at least in part, to the fact that they have a very large character state space [40]. is expected to be homoplasy-free when the character state space is large enough [41]. MP should be a consistent estimator of phylogeny when characters reach the limiting condition of being homoplasy-free, and simple ML methods are likely to perform better when characters exhibit limited homoplasy. In addition to the large state space, the limited homoplasy of indels is also likely to reflect the mechanisms responsible for the origin of novel indel mutations and the fixation of those mutations in populations. Those mechanisms differ in fundamental ways from those responsible for nt substitutions. The mutational mechanisms responsible for the origin of new indels include phenomena ranging from replication slippage and unequal crossing over [42–45]. For the insertion of longer sequences, one finds other processes making substantial contributions, like the insertion of organellar DNA into the nuclear genome as pseudogenes (numts; [46–48]) as well as the insertion of transposable elements (TEs) [49–51]. Of course, one could postulate that some mutational mechanisms generate unexpectedly large numbers of identical indels or that some types of identical indels have a higher probability of fixation, but this is the equivalent of postulating that indels have a smaller state space than expected given naïve models. We can largely exclude the latter possibility because empirical observations indicate that, as an ensemble, indels exhibit less homoplasy than nt substitutions [7,9,15,35,38] and that long indels exhibit less homoplasy than short ones [9,15] (also see below in Results and Discussion, Section 3.1). The observation of very low homoplasy for long indels might seem to suggest that it is desirable to limit analyses to the largest indels, but there are too few of them to clearly resolve many of the short internodes at the base of Neoaves (Supplemental Materials S.1, Table S1, Figure S1). Fortunately, even the most length-inclusive indel data sets exhibit low homoplasy relative to nts, and indels comprise Diversity 2019, 11, 108 5 of 23 a very large data set that spans the entire genome (Tables S12 and S13 in Jarvis et al. [15]). Thus, it seems reasonable to investigate whether phylogenomic-scale analyses of indels can shed light on the phylogeny of birds. Since indel and nt data are likely to deviate from the models used from their analysis in different ways, it is reasonable to postulate that if they converge on a specific topology then that may be indicative of underlying phylogenetic signal. Herein, we analyze a genome-scale indel data matrix for birds, using the CFN+Γ model. We compare the behavior of indels and nts from a common set of genome-scale loci from the full diversity of extant birds at the ordinal level in the context of multispecies coalescent (MSC) analyses, with limited comparison to previously unpublished MP and ML trees (Supplemental Figures S2–S7). We extend prior studies of avian phylogeny that used indel data [38,52] by increasing the data matrix size by almost three orders of magnitude and by implementing methods that explicitly address conflict among gene trees due to the multispecies coalescent. Herein we address several major questions regarding the properties of indel data as they are applied in phylogenetic analyses of Neoaves. First, how similar are the estimates of gene trees generated by analyses of indels to those based on nt data? To answer this, one must ask how well resolved are indel trees? Are indel and nt gene trees generated from the same loci congruent with one another? Are there differences in the level of performance of indels among different data partitions, and if there are then do partitioned nt data exhibit similar differences? Can nt and indel data be combined in MSC analysis? If so, then do combined data perform better than either alone? Finally, do analyses of two very different sources of phylogenetic information—indels and nt sequence data—provide insights into the hard polytomy hypothesis? We expect the answers to the questions we posed to provide insight into the phylogenetic utility of indels for resolving the avian tree of life and, more generally, to reveal information about the potential for MSC analyses of indel data to be a tool for phylogenomic analyses.

2. Methods We studied the indels present in the 2516 intronic and 3769 ultraconserved element (UCE) loci from Jarvis et al. [53] datasets. Indels were scored by the principle of simple indel coding [54]. Indels for the MP analyses were scored using 2xread [55,56] and verified using GapCoder [57]. Indels for multispecies coalescent analyses were scored for each locus using 2matrix [56], an updated version of 2xread, in combination with a custom perl script (see Supplemental Material). MP bootstrap (BS) trees were generated using Tree Analysis Technology (TNT) [58] (Figures S2–S5). The retention index (RI; [59]) was calculated and MP trees produced using TNT [58] and PAUP*4.0b11 [60] on ML gene and species trees. Individual introns were concatenated by locus. ML estimates of gene trees were generated using the BINGAMMA model in RAxML version-8.2.11 [39] (Figures S6 and S7). Coalescent trees were generated using ASTRAL-III [24]. For many analyses, we collapsed poorly supported nodes, which we defined as those with BS support below a specific threshold, to yield polytomies. Sensitivity analyses were performed by varying the level of BS support of input gene trees for ASTRAL analysis from 0–33% contraction. Where we do not specify, our results are based on gene trees with branches of 5% or less BS contracted. The 5% threshold, while somewhat arbitrary, was chosen based on previous simulation and empirical results [24] that indicated only very low values of support should be contracted and that 5% works well for an avian nt dataset. Support values in the low range (e.g., <20%) produce robust results, as we will show. We used DiscoVista [61] to examine quartet frequencies, clade recovery, and Robinson–Foulds (RF) distances. The calculation of local posterior probabilities (localPP) in ASTRAL is described in Sayyari and Mirarab [22]; for ASTRAL analyses that used both nt+indel gene trees for the same locus, we instructed ASTRAL to count each pair of estimated gene trees as one tree (this was done to avoid double counting the same locus twice). Tests of the hard polytomy were performed in ASTRAL as described [19]. Our taxonomic nomenclature generally follows Ericson [62], Yuri et al. [38], Jarvis et al. [15], and Cracraft [63] where it does not conflict with the former two. Whenever possible (i.e., barring supraordinal clades), we use common names because we believe they are easier to read and remember. Diversity 2019, 11, 108 6 of 23

Data and code used in this paper can be found in Supplementary Materials.

3. Results and Discussion Binary coding of indels in the introns and non-coding UCE alignments from Jarvis et al. [53] resulted in an indel data matrix with more than five million characters (Table1). This was almost three orders of magnitude larger than the Yuri et al. [38] data matrix, which had 12,030 characters and was the largest previous study focused on avian indels. Almost 25% of the indels were parsimony informative (Table1). Although the number of indel characters is an order of magnitude lower than number of aligned nts, it still represents a very large phylogenetic data matrix. The number of indels per locus varied from a minimum of six to more than 10,000, with the intronic regions typically having more indels per locus than the UCEs (Table1).

Table 1. Indel and nt information content.

Ultraconserved Elements Introns All Data (UCEs) Number (#) of loci 2515 3679 6194 Total # of nts 19,258,212 9,229,346 28,487,558 Total # of indels 3,918,322 1,242,014 5,160,336 # Parsimony Informative nts 10,855,614 3,908,476 14,764,090 # Parsimony Informative indels 942,169 315,393 1,257,562 % Parsimony Informative nts 56.37 42.35 51.83 % Parsimony Informative indels 24.05 25.39 24.37 Indels per locus: Minimum 6 10 6 Lower quartile 431.5 266 284 Upper quartile 2065 406.5 704 Median 1014 333 389 Mean 1557.98 337.6 833.12

3.1. Indels Exhibit Less Homoplasy Than nt Substitutions Measured (apparent) homoplasy represents three distinct phenomena that contribute to character state distributions that conflict with an estimate of the species phylogeny. These phenomena are: (1) true homoplasy (convergence, reversal, and parallelism), (2) hemiplasy (characters that are perfectly consistent with their underlying gene tree but conflict with the species tree due to ILS [64]), and (3) analytical errors (e.g., alignment errors). Ensemble RI provides a metric of homoplasy relative to a specific estimate of the species tree, and it includes all three sources of apparent homoplasy (including error introduced by inaccurate estimation of the species tree). As expected, based on prior studies, all indel size classes have lower homoplasy (higher RI) than nt substitutions (nt RI is 0.31 whereas indel RI ranges from 0.49 to 0.86 depending on the indel size class; Table2). If we focus on the fit of individual indels to an estimate of the species tree, using the percentage of indels with perfect fit to the tree (RI = 1.0), we see a similar size dependence (Figure3). The number of indels falls o ff sharply as a function of indel size, probably reflecting the complex mixture of phenomena that lead to the accumulation of indels over evolutionary time. First, mutational origin of indels is heterogeneous, with very small indels reflecting phenomena like replication slippage followed by a shift to unequal crossing-over for somewhat larger indels and finally arriving at the largest size class, that is likely to include many TE insertions. Han et al. [65] examined a set of avian introns and found that 73% of insertions that were >40 bp in length were TE insertions. Second, the probability of indel fixation may differ for indels of different sizes. Regardless of the differences within indels, these data confirm previous work showing that indels are less homoplastic than nt substitutions [35,38] and they are likely to provide a useful source of phylogenetic information. Diversity 2019, 11, x FOR PEER REVIEW 7 of 24

Diversity 2019, 11, 108 7 of 23 homoplastic than nt substitutions [35,38] and they are likely to provide a useful source of phylogenetic information. Table 2. Intron indel number, information content, and parsimony consistency by size class. Table 2. Intron indel number, information content, and parsimony consistency by size class. All 1bp 2–9bp 10–49bp 50–99bp 100+bp All 1bp 2–9bp 10–49bp 50–99bp 100+bp # indels 3,918,322 1,023,680 1,924,526 828,741 65,651 75,724 # informative indels# indels 942,1693,918,322 239,295 1,023,680 510,0231,924,526 828,741 168,906 65,651 10,515 75,724 13,430 % informative# informative indels indels 24.05 942,169 23.38239,295 26.50510,023 168,906 20.38 10,515 16.02 13,430 17.74 Ensemble% RIinformative indels 1 indels0.52 24.05 0.5423.38 0.4926.50 20.38 0.56 16.02 0.73 17.74 0.86 EnsembleEnsemble RI nts 1 RI indels 0.311 0.52 0.54 0.49 0.56 0.73 0.86 1 Ensemble1 ensemble RI RI nts calculated as0.31 concatenated data on the Jarvis et al. ML TENT [15], not by locus. 1 ensemble RI calculated as concatenated data on the Jarvis et al. ML TENT [15], not by locus.

Figure 3. 3. NumbersNumbers of of indels indels and and estimates estimates of their of homoplasytheir homoplasy are related are torelated indel size. to indel Indel size. frequency Indel (blue;frequency log scale) (blue; and log the scale) percentage and the of percentage indels that areof indels parsimony that areconsistent parsimony (red) consistentas a function (red) of indel as a lengthfunction (log of indel scale). length We defined (log scale). indels We as defined parsimony indels consistent as parsimony if they consi havestent RI =if they1.0 on have a specific RI = 1.0 tree. on Wea specific used the tree. “Total We used Evidence the “Total Indel Evidence Tree” (TEIT; Indel Figure Tree S12” (TEIT; in Jarvis figure et S12 al. [ 15in]) Jarvis as that et estimateal. [15]) as of that the avianestimate species of the tree avian because species it reflects tree indel because data it (it reflects is based indel on concatenated data (it is based exon, intron, on concatenated and UCE indels). exon, Theintron, curve and for UCE the indels). percentage The ofcurve parsimony for the consistentpercentage indels of parsimony resembles consistent a step function. indels resembles This likely a reflectsstep function. the fact This that the likely accumulation reflects the of fact indels that over the evolutionary accumulation time of indels involves over multiple evolutionary mechanisms. time Theinvolves case multiple where 100% mechanisms. of the characters The case have where RI =100%1.0(i.e., of the the characters situation have for the RIlargest = 1.0 (i.e., indels the above)situation is consistentfor the largest with indels the absence above) of is true consistent homoplasy with as the well absence as complete of true lineage homoplasy sorting. as well as complete lineage sorting. Hemiplasy is a reflection of the fact that gene and species histories conflict. This leads to a situation where a character (e.g., an indel) may be perfectly consistent with the associated gene tree but it appears Hemiplasy is a reflection of the fact that gene and species histories conflict. This leads to a to be homoplastic because it is inconsistent with the species tree [64]. Thus, it should be possible situation where a character (e.g., an indel) may be perfectly consistent with the associated gene tree to minimize the contribution of hemiplasy to the observed RI by calculating the RI for individually but it appears to be homoplastic because it is inconsistent with the species tree [64]. Thus, it should optimized gene trees. There are two potential biases that might emerge when RI is estimated for be possible to minimize the contribution of hemiplasy to the observed RI by calculating the RI for individual gene trees. First, we have to use estimated gene trees that can have topological errors [66,67]. individually optimized gene trees. There are two potential biases that might emerge when RI is Second, contiguous genomic segments can be associated with multiple gene trees due to intralocus estimated for individual gene trees. First, we have to use estimated gene trees that can have recombination [67,68]. Thus, indels can exhibit indel hemiplasy even if one focuses on individual topological errors [66,67]. Second, contiguous genomic segments can be associated with multiple genes. More properly, one can see hemiplasy in any region that includes multiple c-genes, where gene trees due to intralocus recombination [67,68]. Thus, indels can exhibit indel hemiplasy even if c-genes are defined as DNA segments with a single evolutionary history [69,70]. The common usage one focuses on individual genes. More properly, one can see hemiplasy in any region that includes of the term “gene” may refer to a DNA segment that includes multiple c-genes (indeed, large gene multiple c-genes, where c-genes are defined as DNA segments with a single evolutionary history regions subject to recombination will almost certainly include multiple c-genes). Nevertheless, using [69,70]. The common usage of the term “gene” may refer to a DNA segment that includes multiple estimates of individual gene trees for the RI calculations likely minimize the impact of hemiplasy and c-genes (indeed, large gene regions subject to recombination will almost certainly include multiple therefore provide a more accurate measurement of true homoplasy. c-genes). Nevertheless, using estimates of individual gene trees for the RI calculations likely

Diversity 2019, 11, x FOR PEER REVIEW 8 of 24 Diversity 2019, 11, 108 8 of 23 minimize the impact of hemiplasy and therefore provide a more accurate measurement of true homoplasy. Intronic indels exhibit slightly more homoplasy than UCE indels, even when estimates of of individual gene gene trees trees were were used used to tocalculate calculate the the ensemble ensemble RI for RI each for each gene gene (ensemble (ensemble RI: intron RI: intron indel meanindel mean= 0.59,= median0.59, median = 0.57, =1st0.57, quartile 1st quartile= 0.51, 3rd= 0.51, quartile 3rd = quartile 0.66; UCE= 0.66; indel UCE mean indel = 0.65, mean median= 0.65, = 0.65,median 1st =quartile0.65, 1st = 0.60, quartile 3rd =quartile0.60, 3rd = 0.70) quartile (Figure= 0.70) 4). Indels (Figure exhibited4). Indels substantially exhibited substantially less homoplasy less inhomoplasy both data in types both data than types nt substitutions than nt substitutions (ensemble (ensemble RI: intron RI: intronnt mean nt mean = 0.32,= 0.32, median median = 0.32,= 0.32, 1st 1stquartile quartile = 0.30,= 0.30,3rd quartile 3rd quartile = 0.34;= UCE0.34; nt UCEmean nt = 0.34, mean median= 0.34, = median0.34, 1st =quartile0.34, 1st= 0.32, quartile 3rd quartile= 0.32, =3rd 0.36) quartile (Figure= 0.36) 4). The (Figure lesser4). variation The lesser of variation ensemble of RI ensemble in nt gene RI in trees nt gene may trees be may due beboth due to phylogeneticboth to phylogenetic signal and/or signal andcova/orriance covariance of physically of physically and/or and selectively/or selectively linked linked nt positions. nt positions. The differenceThe difference in intron in intron nt vs nt indel vs. indel gene gene tree tree ensemble ensemble RI RIis not is not due due to tothe the obvious obvious difference difference in in dataset dataset sizes; the relationship holds up when loci are selected to match indel indel and and nt nt dataset dataset si sizeszes ( (FigureFigure S S8).8).

Figure 4. Frequency distributiondistribution of of ensemble ensemble retention retention index index (RI) (RI calculated) calculated using using individual individual gene gene trees treesby intron by intron indel geneindel trees, gene UCEtrees, indel UCE gene indel trees, gene intron trees, nt intron gene trees,nt gene and trees, UCE and nt gene UCE trees. nt gene Ensemble trees. RI of individual loci should provide a more accurate measure of homoplasy than RI of species trees by Ensemble RI of individual loci should provide a more accurate measure of homoplasy than RI of minimizing the role of hemiplasy. Indel gene trees exhibit substantially higher RI than nt gene trees. species trees by minimizing the role of hemiplasy. Indel gene trees exhibit substantially higher RI 3.2. Howthan Wellnt gene Resolved trees. Are Indel Trees in Comparison to nt Trees?

3.2. HowResolution Well Resolved of both A intronre Indel and Trees UCE in Comparison nt gene trees to isnt superior Trees? to that of indel gene trees. Indel gene trees are less resolved as binary trees than nt gene trees for any cutoff of BS support (contraction), Resolution of both intron and UCE nt gene trees is superior to that of indel gene trees. Indel despite differences in resolution between intron and UCE partitions (Figure5a). With contraction of gene trees are less resolved as binary trees than nt gene trees for any cutoff of BS support only branches with 0% support, the median (across loci) number of binary nodes recovered normalized (contraction), despite differences in resolution between intron and UCE partitions (Figure 5a). With to the maximum possible is 100% for intron nt gene trees, somewhat less for intron indel and UCE nt contraction of only branches with 0% support, the median (across loci) number of binary nodes gene trees, and substantially lower for UCE indel gene trees. The median percentage of recovered recovered normalized to the maximum possible is 100% for intron nt gene trees, somewhat less for nodes drops rapidly and monotonically to below 50% of the maximum possible as nodes are contracted intron indel and UCE nt gene trees, and substantially lower for UCE indel gene trees. The median up to 33% based on BS support. UCE gene trees are less resolved than intron gene trees within each percentage of recovered nodes drops rapidly and monotonically to below 50% of the maximum data type. The relative differences between data types and partitions remain the same at all levels of possible as nodes are contracted up to 33% based on BS support. UCE gene trees are less resolved contraction. Similarly, mean BS support averaged over all branches of each gene tree is higher for nt than intron gene trees within each data type. The relative differences between data types and gene trees than indel gene trees, despite differences between intron and UCE partitions (Figure5b). partitions remain the same at all levels of contraction. Similarly, mean BS support averaged over all What does gene tree resolution imply about the utility of indels for phylogeny reconstruction? branches of each gene tree is higher for nt gene trees than indel gene trees, despite differences Overall, nt gene trees exhibit higher levels of BS support than indel gene trees. This suggests that, between intron and UCE partitions (Figure 5b). broadly, nts are more phylogenetically informative than indels for recovering gene trees. That nt data produce better resolved gene trees than indel data seems counter-intuitive since indels exhibit higher ensemble RI than nts. The superior resolution of nt trees may result from the fact that the nt data set has more than an order of magnitude more characters than the indel data set for the same loci. However, without knowing the true phylogeny, it is not certain that tree resolution is an accurate measure of unbiased phylogenetic informativeness. After all, violations of substitution modeling can result in

Diversity 2019, 11, 108 9 of 23 strong support for an incorrect tree (both for indels and substitutions) [1,27,71,72]. Insight on the relationship of resolution and phylogenetic informativeness may be gained by comparing congruence amongDiversity gene 2019, trees11, x FOR within PEERand REVIEW between partitions. 9 of 24

FigureFigure 5.5. Gene tree tree resolution resolution for for intron intron and and UCE UCE gene gene trees trees estimated estimated from from indel indel and andnt data. nt data. (a) (aBoxplots) Boxplots show show the the number number of of binary binary no nodes,des, normalized normalized by by the the maximum maximum possible possible number number of of binary binary nodesnodes (e.g., (e.g., 46 46 for for genegene treestrees withwith nono missingmissing data) for gene trees where nodes nodes with with BS BS up up to to a a certain certain thresholdthreshold (x-axis) (x-axis) areare contracted.contracted. With With no no contraction, contraction, this this value value is is1.0 1.0 and and with with increased increased filtering, filtering, it itquickly quickly drops drops;; (b ()b The) The density density of ofthe the gene gene tree tree BS BSaveraged averaged over over all branches all branches of each of each gene gene tree are tree areshown. shown. Overall, Overall, nt gene nt gene trees trees are better are better supported supported than thanindels indels and introns and introns are better are better supported supported than thanUCEs. UCEs.

3.3. CongruenceWhat does of gene Indel tree and resolution nt Gene Trees imply about the utility of indels for phylogeny reconstruction? Overall,Indel nt and gene nt treestrees forexhibit the same higher locus levels are of only BS slightlysupport more than similarindel gene to each trees. other This than suggests indel and that, nt treesbroadly, for two nts unrelated are more lociphylogenetically (Table3). However, informative when than we contract indels for gene recovering trees to only gene include trees. That highly nt supporteddata produce branches better with resolved BS 75%, gene pairs trees of than indel indel and data nt gene seems trees counter from di-intuitivefferent loci since have indels substantially exhibit ≥ morehigher conflict ensemble than RI the than indel nts. and The nt s geneuperior trees resolution of the same of nt locus trees (Table may 4result). from the fact that the nt data set has more than an order of magnitude more characters than the indel data set for the same loci. However, without knowingTable the 3. Robinson–Foulds true phylogeny, (RF) it is distances. not certain that tree resolution is an accurate measure of unbiased phylogenetic informativeness. After all, violations of substitution modelingQuartiles can result in indel–nt strong Same support Locus for an indel–indel incorrect Di treefferent (both Loci for indels nt–nt and Diff substitutions)erent Loci [1,27,71,72]1st quartile. Insight on the relationsh 0.658ip of resolution and 0.744 phylogenetic informativeness 0.659 may be gainedmedian by comparing congruence 0.721 among gene trees within 0.791 and between partitions. 0.698 3rd quartile 0.775 0.842 0.743 3.3. Congruence of Indel and nt Gene Trees Table 4. Number of incompatible branches between intron indel vs. nt gene trees. Indel and nt trees for the same locus are only slightly more similar to each other than indel and nt trees for two unrelated loci (Table 3). However,Same Locus when we contractDifferent gene Loci trees to only include highly supported branchesTree Resolution with BS ≥75%, pairsMean of (SD)indel and nt geneMean trees (SD) from different loci have substantially more conflict than the indel and ntn = gene2515 trees of the nsame= 6322710 locus (Table 4). Fully resolved 30.06 (3.88) 30.84 (2.90) <34% contractedTable 3. Robinson 3.530–Foulds (2.61) (RF) distances. 4.138 (2.96) <76% contracted 0.192 (0.50) 0.343 (0.72) Quartiles indel–nt Same Locus indel–indel Different Loci nt–nt Different Loci 1st quartile 0.658 0.744 0.659 medianMeasuring incongruence0.721 using the normalized RF distance0.791 of fully resolved gene trees,0.698 randomly sampled3rd quartile nt gene trees appear0.775 to be more congruent with0.842 one another than indel and0.743 nt gene trees of the same locus (Figure6, Table3). Indel and nt gene trees of the same locus are only slightly more congruentTable with 4. oneNumber another of incompatible than randomly branches sampled between indel intron gene indel trees vs nt (Figure gene trees.6). However, it is important to note that uncontracted gene trees include a great deal of unsupported error. Although RF Same Locus Different Loci scores suggestTree that Resolution nt gene trees of two random lociMean are (SD) more similar to each otherMean than (SD) indel and nt of n = 2515 n = 6322710 Fully resolved 30.06 (3.88) 30.84 (2.90) <34% contracted 3.530 (2.61) 4.138 (2.96) <76% contracted 0.192 (0.50) 0.343 (0.72)

Diversity 2019, 11, x FOR PEER REVIEW 10 of 24

Measuring incongruence using the normalized RF distance of fully resolved gene trees, randomly sampled nt gene trees appear to be more congruent with one another than indel and nt gene trees of the same locus (Figure 6, Table 3). Indel and nt gene trees of the same locus are only slightly more congruent with one another than randomly sampled indel gene trees (Figure 6). Diversity 2019, 11, 108 10 of 23 However, it is important to note that uncontracted gene trees include a great deal of unsupported error. Although RF scores suggest that nt gene trees of two random loci are more similar to each otherthe same than locus indel in and uncontracted nt of the gene same trees locus (Figure in 6 uncontracted), numbers of gene incompatible trees (Figure branches 6), between numbers indel of incompatibleand nt gene treesbranches of the between same vs. indel random and nt loci gene show trees that of this the issame not truevs random of highly loci supported show that branches this is not(Table true4 ).of highly supported branches (Table 4).

FigureFigure 6.6. RFRF distances distances for for intron intron gene gene trees. trees. Density Density of normalized of normalized RF distances RF distances between between indel indel and andnt genent gene trees trees of the of same the same locus locus(red) (red)are compared are compared to distances to distances between between randomly randomly sampled sampled indel gene indel treesgene and trees randomly and randomly sampled sampled nt gene nt gene trees. trees. Randomly Randomly sampled sampled nt gene nt gene trees trees are are slightly slightly more more congruentcongruent with with one one another another than than indel indel and and nt nt gene gene trees trees of of the the same same locus. locus. Indel Indel and and nt nt gene gene trees trees of of thethe same same locus locus are are substantially more more congruent congruent with with one one another another than than randomly randomly sampled sampled indel indel gene genetrees. trees. Similar Similar patterns patterns are observedare observed for UCEfor UCE gene gene trees trees (Figure (Figure S9). S9). Indel and nt gene trees of the same locus should be congruent with one another if both recover Indel and nt gene trees of the same locus should be congruent with one another if both recover gene phylogeny accurately, because any given gene has only one evolutionary history (this statement gene phylogeny accurately, because any given gene has only one evolutionary history (this assumes the absence of intralocus recombination, but it should hold approximately as long as intralocus statement assumes the absence of intralocus recombination, but it should hold approximately as recombination is limited). The fact that randomly sampled nt gene trees are more congruent with one long as intralocus recombination is limited). The fact that randomly sampled nt gene trees are more another than either indel and nt gene trees of the same locus or randomly sampled indel gene trees are congruent with one another than either indel and nt gene trees of the same locus or randomly to one another is compelling evidence that, locus by locus, nt gene trees (especially those based on sampled indel gene trees are to one another is compelling evidence that, locus by locus, nt gene trees introns) more accurately track gene phylogeny than indel gene trees, at least within the context of the (especially those based on introns) more accurately track gene phylogeny than indel gene trees, at substitution models we employed. least within the context of the substitution models we employed. Thus, we should anticipate that gene tree QS derived from nt data will be higher and likely more Thus, we should anticipate that gene tree QS derived from nt data will be higher and likely accurate than those estimated from indels. In fact, they are. Comparing different branches, nt QS show more accurate than those estimated from indels. In fact, they are. Comparing different branches, nt similar patterns to indel QS. However, nt QS for the recovered ASTRAL tree are almost universally QS show similar patterns to indel QS. However, nt QS for the recovered ASTRAL tree are almost higher than indel QS (Figure7). Thus, it follows that indel gene trees in our analysis artifactually universally higher than indel QS (Figure 7). Thus, it follows that indel gene trees in our analysis suggest higher levels of ILS than nt gene trees. That being said, both indel and nt trees recover artifactually suggest higher levels of ILS than nt gene trees. That being said, both indel and nt trees near-equal quartet frequencies for some deep branches of Neoaves, suggestive of near-zero-length recover near-equal quartet frequencies for some deep branches of Neoaves, suggestive of branches and abundant ILS. near-zero-length branches and abundant ILS. The ratio of the two incongruent indel bipartitions in unrooted quartets are expected to be equal The ratio of the two incongruent indel bipartitions in unrooted quartets are expected to be equal given the multispecies coalescent for neutral loci. It is notable that both indel and nt data recover given the multispecies coalescent for neutral loci. It is notable that both indel and nt data recover similarly unequal QS for some branches (e.g., sister branches labeled 8 and 19 in Figure7). Such similarly unequal QS for some branches (e.g., sister branches labeled 8 and 19 in Figure 7). Such asymmetries are suggestive of deviations from expectation for true gene trees given the multispecies asymmetries are suggestive of deviations from expectation for true gene trees given the multispecies coalescent. Those deviations could reflect processes such as hybridization or slight biases in gene tree estimation. See Figures S10–S13 for the quartet frequencies of all other data partitions.

Diversity 2019, 11, x FOR PEER REVIEW 11 of 24 coalescent. Those deviations could reflect processes such as hybridization or slight biases in gene Diversity 2019, 11, 108 11 of 23 tree estimation. See Figures S10–S13 for the quartet frequencies of all other data partitions.

FigureFigure 7. 7.Comparison Comparison of quartet of quartet frequencies frequencies for combined for combined intron + introUCEn+UCE nt+indel nt+indel gene trees. gene (a) ASTRALtrees. (a) treeASTRAL using tree all 12,388using all gene 12,388 trees gene from trees combined from combined intron+ UCEintron+UCE nt+indel nt+indel datasets datasets with branches with branches with upwith to 5% up BS to contracted. 5% BS contracted. Branch labels Branch are labels local posterior are local probabilities posterior probabilities (localPP) < 1.0.(localPP In computing) <1.0. In localPP,computing two localPP gene trees, two from gene the trees same from locus the are same counted locus as are one counted to avoid as double one to counting avoid double the same counting locus. Branchesthe same where locus. the Branches polytomy where could the not polytomy be rejected could using not the be polytomy rejected using test in the ASTRAL polytomy at the test 0.05 in significanceASTRAL at levelthe 0.05 are markedsignificance with level a #; ( bare) Right marked panel: with the a same #; (b tree) Right as abovepanel: with the collapsedsame tree branches,as above andwith remaining collapsed branchesbranches, labeled and remaining 1 to 65; Left branches panel: labeled quartet 1 frequencyto 65; Left ofpanel: the main quartet resolution frequency (red of bar) the andmain the resolution two alternative (red bar) resolutions and the two around alternative that branch resolutions (blue bars)around for that all labeledbranch internal(blue bars) branches. for all Labelslabeled on internal the x-axis branches. indicate Labels the quartet on the topology. x-axis indicate For example, the quartet for box topology. “37”, the second For example, set of bars for puts box “36”“37”, and the “59” second together, set of meaningbars puts that “36” the and position “59” oftogether, pelican meaning and that are swapped;the position the of third pelican bar putsand “58”loonand are “59”swapped; together, the meaningthird bar thatputs loon “58” and and pelican “59” together, are put together. meaning Quartet that loon frequencies and pelican computed are put fromtogether. combined Quartet nt frequencies gene trees (i.e., computed UCEs+ fromintrons) combined are shown nt gene in lighter trees ( tonesi.e., UCEs and+ quartetintrons) frequencies are shown basedin lighter on combined tones and indel quartet gene frequencies trees are shown based in on darker combined tones. indel The horizontal gene trees dashed are shown lines in show darker the 1 tones./3 threshold, The horizontal which corresponds dashed lines to show either the a true ⅓ threshold, hard polytomy which or corresponds random resolutions to either in a gene true trees.hard Nodespolytomy with or high random support resolutions (top) and low in gene support trees. (bottom) Nodes show with two high di ff supporterent scales. (top) Note and thatlow many support of 1 the(bottom) basal branchesshow two have different quartet scales. frequencies Note that that many are very of the close basal to /branches3. have quartet frequencies that are very close to ⅓.

Diversity 2019, 11, x FOR PEER REVIEW 12 of 24

3.4. Congruence among Indel and nt Trees for Introns and UCEs Overall, indel and nt ASTRAL trees of the same partition (i.e., introns only or UCEs only) are largely congruent, although important differences exist (Figure 8). Differences exist among “the

Diversityusual suspects”,2019, 11, 108 i.e., those associated with very short branches and QS approaching 33%. There12 are of 23 two reciprocally strongly supported conflicts (localPP ≥ 95%) among ASTRAL trees computed from gene trees with BS up to 5% contracted. The first is between intron indels and intron nt trees with 3.4.respect Congruence to the relationships among Indel and of pelican, nt Trees foregret, Introns and andibis. UCEs UCE indels recover the same relationship as intronOverall, indels indelwith andlesser nt support, ASTRAL with trees one of theexception. same partition Second, seriema (i.e., introns is placed only at or the UCEs base only) of the are with full support in the UCE nt tree, but in the indel tree its inclusion with largely congruent, although important differences exist (Figure8). Di fferences exist among “the usual +Accipitrimorphae causes Psittacopasseria to be rejected with high support. suspects”, i.e., those associated with very short branches and QS approaching 33%. There are two Overall, within-partition ASTRAL trees (i.e., intron nt and intron indel, and UCE nt and UCE reciprocally strongly supported conflicts (localPP 95%) among ASTRAL trees computed from gene indel) are more similar to one another than between≥ -partition ASTRAL trees (i.e., intron vs UCE), trees with BS up to 5% contracted. The first is between intron indels and intron nt trees with respect to but the UCE indel tree differs the most (Figures 8 and 9a). Intron indel and nt trees are topologically the relationships of pelican, egret, and ibis. UCE indels recover the same relationship as intron indels identical except for the sister relationships within the pelican+egret+ibis clade, and that the positions with lesser support, with one exception. Second, seriema is placed at the base of the Australaves with of both Hoatzin and differ by one node. Neither of the latter two is well full support in the UCE nt tree, but in the indel tree its inclusion with owl+Accipitrimorphae causes supported in either tree. Where differences in branch support exist, the nt tree tends to have higher Psittacopasseria to be rejected with high support. support.

FigureFigure 8. 8.ASTRAL ASTRAL trees trees using using intron intron ( a(a,b,b)) and and UCE UCE ( c(,cd,d)) loci loci with with gene gene trees trees computed computed using using nt nt (a (a,c),c) or indelsor indels (b,d ) (b,d) and andbranches branches with with up to up 5% to BS 5% contracted. BS contracted. Branch Bra labelsnch show labels localPP show localPP and those and without those awithout designation a designation have a localPP have a of localPP 1.0. Shades of 1.0. of Shades grade of correspond grade correspond to the support to the support value. value.

Overall, within-partition ASTRAL trees (i.e., intron nt and intron indel, and UCE nt and UCE indel) are more similar to one another than between-partition ASTRAL trees (i.e., intron vs. UCE), but the UCE indel tree differs the most (Figures8 and9a). Intron indel and nt trees are topologically identical except for the sister relationships within the pelican+egret+ibis clade, and that the positions of both Hoatzin and Caprimulgiformes differ by one node. Neither of the latter two is well supported in either tree. Where differences in branch support exist, the nt tree tends to have higher support. Diversity 2019, 11, 108 13 of 23 Diversity 2019, 11, x FOR PEER REVIEW 13 of 24

FigureFigure 9. 9.(a )(a For) For each each focal focal relationship relationship (i.e., (i.e., thosethose withwith somesome conflict),conflict), we we show show whether whether they they are are recoveredrecovered or rejected or rejected in various in various trees trees with with high high or low or support, low support, defined defined by a threshold by a threshold of 95% of support. 95% Threesupport. relevant Three trees relevant from the trees Jarvis from dataset the Jarvis and dataset two MP and indel two trees MP are indel also trees included. are also Note included. that support Note valuesthat support are localPP values for are ASTRAL localPP trees for ASTRAL and BS supporttrees and for BS concatenationsupport for concatenation trees, including trees, thoseincluding from Jarvisthose et from al. [15 Jarvis]. See et Figure al. [151]. for See clade Figure names. 1 for clade Figure names. created Figure using created DiscoVista. using DiscoVista ( b) The number. (b) The of diffnumbererent branches of different between branches all pairs between of ASTRAL all pairs speciesof ASTRAL trees. species trees.

Diversity 2019, 11, 108 14 of 23

Diversity 2019, 11, x FOR PEER REVIEW 14 of 24 UCE indel and nt ASTRAL trees are less congruent with one another than are the intron indel and nt ASTRALUCE trees. indel Theand positionsnt ASTRAL of trees several are less clades congruent all diff wither by one one another node than between are the UCE intron indel indel and nt trees,and but nt larger ASTRAL differences trees. The exist positions as well. of Mostseveral notably, clades all Phoenicopterimorphae differ by one node between is recovered UCE indel as and sister to nt trees, but larger differences exist as well. Most notably, Phoenicopterimorphae is recovered as all other Neoaves in the UCE nt tree, a position identical to the concatenated UCE tree presented by sister to all other Neoaves in the UCE nt tree, a position identical to the concatenated UCE tree Jarvis et al. [15]. In contrast, pigeon occupies the position sister to all other Neoaves in the UCE indel presented by Jarvis et al. [15]. In contrast, pigeon occupies the position sister to all other Neoaves in tree.the Overall, UCE indel the UCE tree. ntOverall, tree receives the UCE highernt tree receives support higher than thesupport UCE than indel the tree. UCE indel tree. IncongruenceIncongruence between between intron intron indel indel and and UCEUCE indel ASTRAL ASTRAL trees trees are are largely largely mirrored mirrored by intron by intron nt andnt UCEand UCE nt trees. nt trees. Notable Notable exceptions exceptions include include the basal basal position position of ofpigeon pigeon and and the thepan- pan-waterbirdswaterbirds cladeclade in the in UCEthe UCE indel indel tree tree already already noted. noted. InIn otherother words, words, there there are are no no differences differences between between indel indel and and nt ASTRALnt ASTRAL trees trees that that are are recovered recovered consistently consistently by both intron intron and and UCE UCE indel indel partit partitionsions except except the the sistersister relation relation of ibis of ibis and and pelican. pelican. UCEUCE indel indel trees trees exhibit exhibit relatively relatively higher higher within-within- and and among among-partition-partition RF RFdistances distances (Figure (Figure 9b). 9b). For all gene tree contraction levels, the QS was higher for nt than for indels (Figure 10). Similarly, For all gene tree contraction levels, the QS was higher for nt than for indels (Figure 10). Similarly, over almost all contraction levels, intron indels have higher QS than UCE indels (Figure 10). This over almost all contraction levels, intron indels have higher QS than UCE indels (Figure 10). This suggests that UCE gene trees may be less precisely resolved than intron gene trees when indels are suggestsused. that The UCE situation gene is trees more may complex be less for precisely nt data; resolved the QS was than higher intron for gene intron trees nt treeswhen at indels low are used.contraction The situation levels is, more whereas complex the QS for was nt data; higher the for QS UCE was nt higher trees for when intron the nt contraction trees at low level contraction was levels,higher whereas (Figure the 10). QS was higher for UCE nt trees when the contraction level was higher (Figure 10).

FigureFigure 10. QS 10. ofQS the of the ASTRAL ASTRAL tree tree computed computed fromfrom intron and and UCE, UCE, indel indel and and nt gene nt gene trees trees at levels at levels of of genegene tree contractiontree contraction from from 0–33%. 0–33%. UCE UCE indels indels yield yield uniformly uniformly lower lower QS QS than than intronintron indelsindels and nt data and removingdata and removing low support low support branches branches increases increases the QS the of QS the of ASTRALthe ASTRAL tree. tree.

In lightIn light of the of the higher higher level level of of discordance discordance forfor indel gene gene trees trees evident evident in the in theRF distanc RF distanceses and and QS, itQS, should it should not not be surprisingbe surprising that that nt nt ASTRAL ASTRAL treestrees are are better better supported. supported. On Onthe theother other hand, hand, it is it is perhapsperhaps surprising surprising that that indel indel ASTRAL ASTRAL trees trees soso closelyclosely resemble resemble nt nt ASTRAL ASTRAL trees. trees. The Thereason reason for for this congruence is likely to be the simple fact that signal can overcome noise when datasets are this congruence is likely to be the simple fact that signal can overcome noise when datasets are sufficiently large. sufficiently large. 3.5. Are nt Trees Improved When Combined with Indel Trees? 3.5. Are nt Trees Improved When Combined with Indel Trees? Neither nt nor indel data perfectly fit the models we used for their analyses. Moreover, the Neitherintrinsic nt stochasticity nor indel dataof molecular perfectly ev fitolution the models may meanwe used that for only their one analyses. form of Moreover,substitution the might intrinsic stochasticity of molecular evolution may mean that only one form of substitution might occur during a particular interval of time. This could lead to the cases where a particular branch in a gene tree might have some nt synapomorphies but no indel synapomorphies, or vice versa. Thus, we speculate that for Diversity 2019, 11, 108 15 of 23 any given locus, both nts and indels may contain phylogenetic signals that the other lacks. To test this, we examined combining nts and indels. Two ways to address the effects of combining nt+indel data are whether the addition of indel data either recover or decompose clades that are absent or present in nt-only trees, and whether they improve or reduce branch support. The UCE nt ASTRAL tree (Figure8c) di ffers from the intron nt tree (Figure8a) at 7 branches and the UCE-indel tree (Figure8d) di ffers from the intron-indel tree (Figure8b) at 9 branches. However, the UCE nt+indel tree (Figure 11b) is similar to the combined intron nt+indel tree (Figure 11a; including key clades discussed below), with only three branch differences. The reduction in discrepancy between UCEs and introns when indels are added shows that indel and nt data complement one another and combining both in phylogenetic analyses improves consistency across data partitions. However, it must be noted that the increased similarity of UCE and intron trees after the addition of indels is mostly due to changes in UCE trees.

Medium ground-finch Medium ground-finch Zebra finch Zebra finch a) Intron-combined American crow American crow Golden-collared manakin b) UCE-combined Golden-collared manakin Rifleman Rifleman Budgerigar Budgerigar Kea Kea Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema 0.97 Carmine bee-eater 0.79 Carmine bee-eater Downy woodpecker Downy woodpecker Rhinoceros hornbill Rhinoceros hornbill 0.96 Bar-tailed 0.69 Bar-tailed trogon -roller Cuckoo-roller Speckled Speckled mousebird White-tailed eagle White-tailed eagle Bald eagle Bald eagle vulture Turkey vulture Barn owl Barn owl Little egret Little egret Dalmatian pelican Dalmatian pelican Crested ibis 0.95 0.37 Crested ibis Great cormorant Great cormorant Emperor Emperor penguin Adelie penguin Adelie penguin 0.76 Northern fulmar 0.97 Northern fulmar Red-throated loon Red-throated loon 0.37 Sunbittern 0.63 Sunbittern White-tailed White-tailed tropicbird 0.87 0.65 0.55 Hoatzin 0.4 Killdeer 0.93 0.85 0.33 Grey-crowned 0.59 Hoatzin Killdeer Grey-crowned crane Annas Annas hummingbird Chimney swift Chimney swift Chuck-wills-widow Chuck-wills-widow 0.68 MacQueens 0.91 MacQueens bustard Common cuckoo Red-crested Red-crested turaco Common cuckoo Brown Brown mesite 0.65 Yellow-throated 0.85 Yellow-throated sandgrouse 0.99 Pigeon 0.63 Pigeon American American flamingo Great-crested Great-crested grebe Chicken Chicken Turkey Turkey Pekin Pekin duck Common ostrich Common ostrich White-throated White-throated tinamou

Medium ground-finch Medium ground-finch Zebra finch Zebra finch c) nt-combined American crow d) indel-combined American crow Golden-collared manakin Golden-collared manakin Rifleman Rifleman Budgerigar Budgerigar Kea Kea Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema Carmine bee-eater Carmine bee-eater Downy woodpecker Downy woodpecker Rhinoceros hornbill Rhinoceros hornbill Bar-tailed trogon 0.72 Bar-tailed trogon Cuckoo-roller Cuckoo-roller 0.78 Speckled mousebird Speckled mousebird White-tailed eagle White-tailed eagle Bald eagle Bald eagle Turkey vulture Turkey vulture Barn owl Barn owl Little egret Crested ibis 0.94 Dalmatian pelican Dalmatian pelican Crested ibis 0.52 Little egret Great cormorant Great cormorant Emperor penguin Emperor penguin Adelie penguin Adelie penguin Northern fulmar 0.68 Northern fulmar Red-throated loon Red-throated loon 0.47 0.52 Sunbittern Sunbittern 0.97 White-tailed tropicbird 0.58 0.71 White-tailed tropicbird Annas hummingbird Hoatzin 0.81 Chimney swift 0.53 0.76 Grey-crowned crane Chuck-wills-widow Killdeer 0.82 Grey-crowned crane American flamingo 0.94 Killdeer 0.47 Great-crested grebe Hoatzin Annas hummingbird 0.69 MacQueens bustard Chimney swift Red-crested turaco Chuck-wills-widow Common cuckoo 0.74 MacQueens bustard Brown mesite Red-crested turaco 0.93 Yellow-throated sandgrouse Common cuckoo Pigeon Brown mesite American flamingo 0.96 Yellow-throated sandgrouse Great-crested grebe Pigeon Chicken Chicken Turkey Turkey Pekin duck Pekin duck Common ostrich Common ostrich White-throated tinamou White-throated tinamou

Figure 11. ASTRAL trees using combined dataset gene trees with branches with up to 5% BS contracted. (a) Intron-combined = intron nt+indel gene trees; (b) UCE-combined = UCE nt+indel gene trees; (c) nt-combined = intron+UCE nt gene trees; and (d) indel-combined = intron+UCE indel gene trees. Branch labels are localPP <1.0. Note, in computing branch support of intron-combined and UCE-combined trees, every two gene trees of the same locus are counted as 1 to avoid double counting. Diversity 2019, 11, 108 16 of 23

The addition of indels did nothing to alter intron nt tree topology (compare Figures8a and 11a) except for the position of Hoatzin, which received low support (i.e., localPP < 95%) in both. The combined intron nt+indel tree (Figure 11a) improved branch support for Australaves+ (from 89% to 97%). However, it also reduced support for several branches, including (from 98% to 87%) and Aequornithia+Phaethontimorphae (from 98% to 76%). Adding indels did more to alter UCE nt tree topology than they did to intron nt topology (compare Figures8c and 11b). The UCE nt tree failed to recover reciprocal monophyly of and Passerea, a relationship recovered by Jarvis et al. [15], whereas the combined UCE nt+indel tree did with low support (Figure 11b). Instead, the nt tree supported +Otidimorphae (localPP = 91%). While neither recovered as monophyletic, the nt tree did so without mousebird (localPP = 96%), whereas the combined tree recovered owl+Accipitrimorphae as sister to all other . The addition of indels reduced support for Cursorimorphae+Hoatzin (from 98% to 85%). No other changes in topology and support resulting from the addition of indel data either increased or decreased localPP above or below 90%. While members of “the magnificent seven” [18] differed topologically without support in either tree, the combined UCE nt + indel tree was more congruent with the combined intron nt+indel tree as well as with the Exascale Maximum Likelihood (ExaML) TENT and Maximum Pseudo-likelihood Estimation of Species Trees* (MP-EST*) TENT trees of Jarvis et al. [15] with respect to the recovery of Columbea and Passerea. There are numerous topological differences between combined UCE+intron nt and UCE+intron indel trees (Figure 11c,d, respectively). Most notable among these are the repositioning of Phoenicopterimorphae from Columbea to a pan-waterbirds clade and movement of Hoatzin from sister to Cursorimorphae to the pan-waterbirds clade, with concomitant reductions to below 90% in localPP support for the indel tree. Alternate sisterships of egret+pelican and ibis+pelican are strongly-supported by both trees. The position of owl+Accipitrimorphae as sister to all other Telluraves in the indel tree is better supported than it is as sister to Coraciimorphae in the nt tree (100% vs. 78%). Caprimulgiformes and Otidimorphae also reversed positions with low support in both trees. Support for the same position of Phaethontimorphae was significantly lower in the indel tree. The addition of indels to the combined intron+UCE nt dataset affected branch support while the topology remained stable and changed in only 2 branches (compare Figures7a and 11c). Without altering topology, indels decreased localPP support for Passerea from 97% to 86%, and Hoatzin+Cursorimorphae from 94% to 67%. They moved owl+Accipitrimorphae from Afroaves to sister to all other Telluraves, improving support for its new position from 78% to 100%. Combining indel with nt data reduced support for Hoatzin+Cursorimorphae (from 94% to 67%), and Columbimorphae from 96% to 90%. No other differences in topology or support either increased or decreased localPP above or below 90% by adding indels to nt data.

3.6. Indel ASTRAL Trees Are Largely Congruent with Concatenation Trees We compared our combined intron+UCE indel ASTRAL tree (Figure 11d) with the published ExaML TENT and TEIT [15], as well as unpublished MP and ML intron and whole genome trees from the same study (Figure9a and Figures S2–S7). It is important to recognize that whereas all of our ASTRAL trees include the same intron and UCE datasets, some concatenation trees also include exon data that we did not consider. We excluded exons because they include relatively very few indels, both because they are short and under selective constraint (see many unresolved nodes in Figure S4). ASTRAL analysis requires well resolved gene trees. Exons also have been suggested to carry misleading phylogenetic signal [15,18] related to composition heterogeneity, selective constraint, or other factors. Thus, exons are expected to suffer from lack of or misleading signal. Our comparison to concatenation trees should be viewed as nothing more than a comparison to other trees, not a test of the properties of indels in ASTRAL analysis. We describe only our intron+UCE and indel+nt combined trees with concatenation trees in this section. Refer to Figure9a for comparisons with individual partitions and data types. Diversity 2019, 11, 108 17 of 23

Overall, there are few incompatibilities with our ASTRAL trees that include indels compared with either the ExaML TENT or TEIT that are strongly supported in both (Figure9a). Both the ExaML TENT and TEIT strongly rejected the Psittacopasserae+Coraciimorphae, whereas it is strongly supported by the combined intron+UCE indel (Figure 11d), combined intron nt+indel (Figure 11a), and combined intron+UCE nt+indel trees (Figure7). The TENT was in strong agreement with all combined analyses except the combined intron+UCE indel tree (Figure 11d), with which it was in strong disagreement on recovering the pelican+egret clade but rejecting the ibis+pelican clade. The TEIT was further in strong conflict with the combined intron+UCE indel tree (Figure 11d), which strongly supported Columbimorphae. Comparisons to MP-EST* TENT, which is the MSC-based method reported by Jarvis et al. [15], is also interesting. Despite being computed using different data types and different gene tree and species tree estimation methods (i.e., the MP-EST* TENT included exons and it used binned loci), the MP-EST* TENT and our combined intron+UCE nt+indel ASTRAL tree (Figure7) are virtually identical, differing in only two branches. They both recover Passerea and Columbea at the base of Neoaves with high support (as does the ExaML TENT) and both recover Phaethontimorphae as sister to Aequornithia (“pan-waterbirds”, also as does the ExaML TENT). More remarkably, the two deepest nodes within Passerea were recovered identically (i.e., Otidimorphae was sister to all other Passerea). Finally, both trees placed owl sister to and placed the owl+Accipitriformes clade sister to the other Telluraves. These similarities are noteworthy because they show agreement between two MSC-based trees constructed using very different estimation methods in contrast to the concatenation ExaML TENT tree, which does not recover any of these relationships (and differs from MP-EST* in five branches).

3.7. Can Indel Data Address the Neoaves Hard Polytomy Hypothesis? The hypothesis that molecular data might indicate that the base of Neoaves was a hard polytomy (a “phylogenetic bush”) makes the fundamental prediction that gene trees will show no more similarity to each other than expected for a random set of trees generated by the coalescent. Although this prediction is straightforward, it is actually quite challenging to test in empirical studies because one is limited to using estimated gene trees (rather than the true gene trees). Like many other studies (e.g., Jarvis et al. [15]), we also observed extensive conflict among gene trees (Figures6 and7). This is expected since gene tree conflict is the inevitable result of independent segregation of ancestral polymorphism as well as gene tree estimation error. In arguing for a hard polytomy, Suh [17] contended that conflicts among estimates of gene trees largely reflect genuine discordance among true gene trees by emphasizing that TE insertions also show extensive conflict [49–51,73]. Since TE insertions are thought to be homoplasy-free or virtually so [65,74], they should define a single gene tree bipartition accurately. Phylogenetic networks based on observed TE insertion data and those based on random binary data are indeed strikingly similar (compare Figures. 4D and 4F in Suh [17]). However, a more quantitative test of the hard polytomy hypothesis is clearly desirable because TEs are very few in number and unequally distributed among taxa (S.1, Table S1, Figure S1). What, if anything, do our analyses of indel data using a species tree framework tell us about the hard polytomy hypothesis? The number of branches for which a polytomy cannot be rejected is six (p < 0.05) or seven (p < 0.01) for the intron nt data (Figure S15), and the same for the combined intron indel+nt data (Figure S16; see Figure S17 for UCE indels). The number of branches for which a polytomy cannot be rejected is seven (p < 0.05) or ten (p < 0.01) for the UCE nt data (Figure S18), and seven (p < 0.05) or nine (p < 0.01) for the combined UCE indel+nt data (Figure S19; see Figure S20 for combined intron+UCE indels). The number of branches for which a polytomy cannot be rejected is five (p < 0.05) for the combined intron+UCE nt data (Figure S21), and four (p < 0.01) for the combined intron+UCE indel+nt data (Figure7a [the branches indicated with “#”], Figure S22). If this last finding is accurate, it reduces the scale of the hard polytomy from a radiation of nine lineages (or even ten lineages) to one with a radiation of six major lineages (Figure 12). Moreover, it shifts the radiation from Diversity 2019, 11, 108 18 of 23

Diversity 2019, 11, x FOR PEER REVIEW 18 of 24 the base of Neoaves to a position within Passerea (Figure7). Taken as a whole, these observations cannotobservations reject the cannot hard reject polytomy the hard hypothesis polytomy in hypothesis its entirety in but its theyentirety do but suggest they thatdo suggest the number that the of lineagesnumber involved of lineages in the involved simultaneous in the simultaneous radiation is likely radiation to be is smaller likely to than be the smaller nine than suggested the nine by Suhsuggested [17]. Similar by Suh results [17]. Similar obtained results previously obtained [19 previously] (Figure2) [19] are not(Figure directly 2) are comparable not directly to comparable this study becauseto this they study used because contracted they used gene contracted trees of a di genefferent trees level of of a BSdifferent cut-off as level input, of BS included cut-off exonas input data, thatincluded were not exon used data here, that and were included not used analyses here, basedand included on binned analyses loci. What based is importanton binned inloci. the What context is ofimportant the present in study the context is that theof the addition present of indelstudy data is that to nucleotidethe addition data of improvedindel data the to powernucleotide of the data test toimprove reject polytomies.d the power of the test to reject polytomies.

Figure 12. Collapsed polytomy trees estimated by the hard polytomy test of combined intron+UCE Figure 12. Collapsed polytomy trees estimated by the hard polytomy test of combined intron+UCE nt+indel implemented in ASTRAL [19] (right) compared to the polytomous tree hypothesized by Suh [17] nt+indel implemented in ASTRAL [19] (right) compared to the polytomous tree hypothesized by Suh (left). This study reduces the scale of the basal neoavian hard polytomy from a radiation of nine lineages [17] (left). This study reduces the scale of the basal neoavian hard polytomy from a radiation of nine (or even ten lineages) to a radiation of six major lineages, i.e., core landbirds (Telluraves), pan-waterbirds lineages (or even ten lineages) to a radiation of six major lineages, i.e., core landbirds (Telluraves), (Aequornithia+Phaethontimorphae), , , Hoatzin, and Caprimulgiformes pan-waterbirds (Aequornithia+Phaethontimorphae), Charadriiformes, Gruiformes, Hoatzin, and (including swift+hummingbird). Caprimulgiformes (including swift+hummingbird). One limitation of the Sayyari and Mirarab [19] polytomy test is that while it can reject the null One limitation of the Sayyari and Mirarab [19] polytomy test is that while it can reject the null hypothesis of a polytomy, it cannot corroborate the null hypothesis. The power to reject a polytomy test hypothesis of a polytomy, it cannot corroborate the null hypothesis. The power to reject a polytomy is lower if one considers shorter branches and, in the limit of a branch length tending to zero, the power test is lower if one considers shorter branches and, in the limit of a branch length tending to zero, the converges to zero for any fixed number of genes. Both Walsh et al. [75] and Braun and Kimball [26] power converges to zero for any fixed number of genes. Both Walsh et al. [75] and Braun and have argued that questions regarding polytomies should be examined in a power analysis framework. Kimball [26] have argued that questions regarding polytomies should be examined in a power In addition, one has to ask at what (true) branch length is a resolved branch no different than a hard analysis framework. In addition, one has to ask at what (true) branch length is a resolved branch no polytomy for practical purposes. Given an answer, estimated branch lengths (in unit of time) can different than a hard polytomy for practical purposes. Given an answer, estimated branch lengths in principle provide evidence for a hard polytomy when they fall below a certain level of practical (in unit of time) can in principle provide evidence for a hard polytomy when they fall below a significance. Consideration of both extrinsic factors (e.g., the timescale for climatic fluctuations) and certain level of practical significance. Consideration of both extrinsic factors (e.g., the timescale for intrinsicclimatic factors fluctuations) (e.g., eff andective intrins populationic factors size (e.g., and effective generation population length) that size a ffandect generation the rate ofcladogenesis length) that affect the rate of cladogenesis might permit one to define a reasonable minimum branch length (the “effect size”). However, ASTRAL’s test of the polytomy hypothesis operates on branch lengths in

Diversity 2019, 11, 108 19 of 23 might permit one to define a reasonable minimum branch length (the “effect size”). However, ASTRAL’s test of the polytomy hypothesis operates on branch lengths in coalescent units. It is hard to convert coalescent units to time since they are affected both by population size and by generation time. When one considers the fact that gene tree error will typically lead to the underestimation of coalescent branch lengths, the number of variables necessary to establish a biologically meaningful estimate of the effect size further grows. Future work should examine if indels can improve our estimates of branch length measured in coalescent units, hence providing more realistic tests of polytomy. Congruence among trees produced by independent data partitions provide further suggestive evidence that the base of Neoaves is not a hard polytomy. The combined UCE indel+nt tree is very similar to the combined intron indel+nt tree (RF distance = 3; see Figure9b). Both the combined intron indel+nt tree (Figure 11a) and the combined UCE indel+nt tree (Figure 11b) include the Columbea and Passerea clades as the basal divergence for Neoaves, as in Jarvis et al. [15] and Reddy et al. [18]. Support for both of these clades is limited in the combined UCE indel+nt tree, but the fact that it recovers them at all is significant because neither the UCE nt-only tree (Figure8c) nor the UCE indel-only tree (Figure8d) include those clades. Thus, combining indel +nt data increases congruence between the UCE data and the intron data, the latter of which by itself supports the deep divergence between Columbea and Passerea regardless of whether nts (Figure8a) or indels (Figure8b) are analyzed. This increased congruence when trees based on indels and nts are combined is not predicted by the hard polytomy hypothesis—it is predicted if these clades are present in the true species tree.

4. Conclusions and Summary We investigated the properties of indels in phylogenetic analysis, with an emphasis on the multispecies coalescent as implemented in ASTRAL III [24,25]. Indels are less homoplasious than nts as measured by RI and ensemble RI, but they produce less congruent gene trees as measured by RF distance and more equal quartet frequencies (i.e., greater quartet conflict). Indel and nt data lead to largely congruent species tree topologies, localPP branch support values, and quartet frequencies. Where differences exist in topology or support, neither data type is consistently superior, except with regard to quartet frequencies in which nt data show less gene tree discordance than indels. Importantly, combining nt+indel data leads to improved congruence between intron and UCE data partitions. UCE indel-only and nt-only trees failed to recover Columbea and Passerea as monophyletic clades at the base of Neoaves, but the combined UCE nt+indel tree recovered this relationship, in agreement with the Jarvis et al. ExaML and MP-EST* TENT trees [15]. Similarly, combining intron+UCE indels recovered owl+Accipitriformes as monophyletic, in agreement with the Jarvis et al. MP-EST* TENT tree [15].

5. Directions for Future Research This study suggests several directions for future work. (1) Foremost, more enlightened models of indel substitution are wanting. (2) Although short indels are on average clearly more homoplasious than long ones, it is also clear that many of even the shortest indels are virtually free of homoplasy. This provides motivation to develop methods for identifying and filtering indels to improve phylogenetic accuracy independent of guide trees based, for example, on flanking sequence complexity, presence of tandem repeats, or iterative alignment. (3) Mirarab et al. [76] improved input “gene tree” resolution by binning similar loci prior to MSC analysis. A similar approach might be used to further investigate the role of input indel data set size on species tree resolution. (4) We detected similar differences in the two lesser quartet frequencies using both indel and nt gene trees, suggesting that they may be real. Such violations of the coalescent might result from hybridization or introgression, and are worthy of investigation.

Supplementary Materials: The following are available online at http://www.mdpi.com/1424-2818/11/7/108/s1, Figure S1: Bootstrap Support for Maximum Likelihood Intron Indel trees using indels of different size classes, Figure S2: Bootstrap Maximum Parsimony tree of concatenated intron indels using TNT [4] with 1000 replications of tree bisection-reconnection branch-swapping and 10 trees saved per replication (equally most parsimonious trees Diversity 2019, 11, 108 20 of 23 were treated as strict consensus) and 10,000 bootstrap replications. Bootstrap scores are shown only for branches with <100% support, Figure S3: Bootstrap Maximum Parsimony tree of concatenated UCE indels, Figure S4: Bootstrap Maximum Parsimony tree of concatenated exon indels, Figure S5: Bootstrap Maximum Parsimony tree of concatenated intron+UCE+exon indels, Figure S6: Best Maximum Likelihood tree of concatenated intron indels using the BINGAMMA model in RAxML [1], from 10 randomized starting MP trees ( 10 option), Figure S7: Best Maximum Likelihood tree of whole genome indels using the BINGAMMA model in RAxML− [1], Figure S8: Intron indel and nt gene tree ensemble RI as a function of number of characters per locus, Figure S9: Gene tree Robinson-Foulds (RF) distances for intron and uce gene trees, Figure S10: Quartet Scores of intron indel ASTRAL tree. Right panel: Collapsed intron indel 5% ASTRAL tree, Figure S11: Quartet Scores of intron nt ASTRAL tree. Right panel: Collapsed intron nt 5% ASTRAL tree, Figure S12: Quartet Scores of UCE indel ASTRAL tree. Right panel: Collapsed UCE indel 5% ASTRAL tree, Figure S13: Quartet Scores of UCE nt ASTRAL tree. Right panel: Collapsed UCE nt 5% ASTRAL tree, Figure S14: Polytomy test of intron indel dataset [11] conducted using ASTRAL III [12], Figure S15: Polytomy test of intron nt dataset [11] conducted using ASTRAL III [12], Figure S16: Polytomy test of combined intron indel+nt dataset [11] conducted using ASTRAL III [12], Figure S17: Polytomy test of UCE indel dataset [11] conducted using ASTRAL III [12], Figure S18: Polytomy test of the UCE nt dataset [11] conducted using ASTRAL III [12], Figure S19: Polytomy test of the combined UCE indel+nt dataset [11] conducted using ASTRAL III [12], Figure S20: Polytomy test of the combined intron+UCE indel dataset [11] conducted using ASTRAL III [12], Figure S21: Polytomy test of the combined intron+UCE nt dataset [11] conducted using ASTRAL III [12], Figure S22: Polytomy test of the combined intron+UCE indel+nt dataset [11] conducted using ASTRAL III [12], Table S1: Distribution of indels per intron per size class. Data and code used in this paper can be found at http://doi.org/10.5281/zenodo.3237219. Author Contributions: Conceptualization, P.H., E.L.B. and S.M.; Methodology, P.H., E.L.B. and S.M.; Software, S.M.; Validation, P.H., E.L.B. and S.M.; Formal Analysis, P.H., E.L.B., S.M., N.N. and U.M.; Investigation, P.H., E.L.B. and S.M.; Resources, P.H., E.L.B. and S.M.; Data Curation, P.H., E.L.B. and S.M.; Writing—Original Draft Preparation, P.H., E.L.B. and S.M.; Writing—Review and Editing, P.H., E.L.B. and S.M.; Visualization, P.H., E.L.B. and S.M.; Supervision, P.H., E.L.B. and S.M.; Project Administration, P.H., E.L.B. and S.M.; Funding Acquisition, P.H., E.L.B. and S.M. Funding: S.M. was funded by the National Science Foundation (NSF) grant IIS-1565862. E.L.B. was funded by NSF grant DEB-1655683. P.H., N.N., and M.U. were funded by New Mexico State University. Acknowledgments: We thank Andre Aberer for performing concatenated indel RAxML analyses of Figures S6 and S7. Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Braun, E.L.; Cracraft, J.; Houde, P. Resolving the avian tree of life from top to bottom: The promise and potential boundaries of the phylogenomic era. In Avian Genomics in Ecology and Evolution—From the Lab into the Wild; Kraus, R.H.S., Ed.; Springer: Cham, Switzerland, 2019. 2. Mindell, D.P.; Sorenson, M.D.; Dimcheff, D.E.; Hasegawa, M.; Ast, J.C.; Yuri, T. Interordinal relationships of birds and other reptiles based on whole mitochondrial genomes. Syst. Biol. 1999, 48, 138–152. [CrossRef] 3. Braun, E.L.; Kimball, R.T. Examining basal avian divergences with mitochondrial sequences: Model complexity, taxon sampling, and sequence length. Syst. Biol. 2002, 51, 614–625. [CrossRef] 4. Gibb, G.C.; Kardailsky, O.; Kimball, R.T.; Braun, E.L.; Penny, D. Mitochondrial genomes and avian phylogeny: Complex characters and resolvability without explosive radiations. Mol. Biol. Evol. 2007, 24, 269–280. [CrossRef] 5. Groth, J.G.; Barrowclough, G.F. Basal divergences in birds and the phylogenetic utility of the nuclear RAG-1 gene. Mol. Phylogenet. Evol. 1999, 12, 115–123. [CrossRef] 6. Chubb, A.L. New nuclear evidence for the oldest divergence among neognath birds: The phylogenetic utility of ZENK (i). Mol. Phylogenet. Evol. 2004, 30, 140–151. [CrossRef] 7. Fain, M.G.; Houde, P. Parallel radiations in the primary clades of birds. Evolution 2004, 58, 2558–2573. [CrossRef] 8. Ericson, P.G.P.; Anderson, C.L.; Britton, T.; Elzanowski, A.; Johansson, U.S.; Källersjö, M.; Ohlson, J.I.; Parsons, T.J.; Zuccon, D.; Mayr, G. Diversification of Neoaves: Integration of molecular sequence data and fossils. Biol. Lett. 2006, 2, 543–547. [CrossRef] 9. Chojnowski, J.L.; Kimball, R.T.; Braun, E.L. Introns outperform exons in analyses of basal avian phylogeny using clathrin heavy chain genes. Gene 2008, 410, 89–96. [CrossRef] 10. Poe, S.; Chubb, A.L. Birds in a bush: Five genes indicate explosive evolution of avian orders. Evolution 2004, 58, 404–415. [CrossRef] Diversity 2019, 11, 108 21 of 23

11. Hackett, S.J.; Kimball, R.T.; Reddy, S.; Bowie, R.C.K.; Braun, E.L.; Braun, M.J.; Chojnowski, J.L.; Cox, W.A.; Han, K.-L.; Harshman, J.; et al. A phylogenomic study of birds reveals their evolutionary history. Science 2008, 320, 1763–1768. [CrossRef] 12. Wang, N.; Braun, E.L.; Kimball, R.T. Testing hypotheses about the sister group of the passeriformes using an independent 30-locus data set. Mol. Biol. Evol. 2012, 29, 737–750. [CrossRef][PubMed] 13. Kimball, R.T.; Wang, N.; Heimer-McGinn, V.; Ferguson, C.; Braun, E.L. Identifying localized biases in large datasets: A case study using the avian tree of life. Mol. Phylogenet. Evol. 2013, 69, 1021–1032. [CrossRef] [PubMed] 14. McCormack, J.E.; Harvey, M.G.; Faircloth, B.C.; Crawford, N.G.; Glenn, T.C.; Brumfield, R.T. A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing. PLoS ONE 2013, 8, e54848. [CrossRef][PubMed] 15. Jarvis, E.D.; Mirarab, S.; Aberer, A.J.; Li, B.; Houde, P.; Li, C.; Ho, S.Y.W.; Faircloth, B.C.; Nabholz, B.; Howard, J.T.; et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 2014, 346, 1320–1331. [CrossRef][PubMed] 16. Prum, R.O.; Berv, J.S.; Dornburg, A.; Field, D.J.; Townsend, J.P.; Lemmon, E.M.; Lemmon, A.R. A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature 2015, 526, 569–573. [CrossRef][PubMed] 17. Suh, A. The phylogenomic forest of bird trees contains a hard polytomy at the root of Neoaves. Zool. Scr. 2016, 45, 50–62. [CrossRef] 18. Reddy, S.; Kimball, R.T.; Pandey, A.; Hosner, P.A.; Braun, M.J.; Hackett, S.J.; Han, K.-L.; Harshman, J.; Huddleston, C.J.; Kingston, S.; et al. Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling. Syst. Biol. 2017, 66, 857–879. [CrossRef] 19. Sayyari, E.; Mirarab, S. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes 2018, 9, 132. [CrossRef] 20. Degnan, J.H.; Rosenberg, N.A. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. (Amst.) 2009, 24, 332–340. [CrossRef] 21. Edwards, S.V. Is a new and general theory of molecular systematics emerging? Evolution 2009, 63, 1–19. [CrossRef] 22. Sayyari, E.; Mirarab, S. Fast coalescent-based computation of local branch support from quartet frequencies. Mol. Biol. Evol. 2016, 33, 1654–1668. [CrossRef][PubMed] 23. Letunic, I.; Bork, P. Interactive Tree Of Life (iTOL) v4: Recent updates and new developments. Nucleic Acids Res. 2019.[CrossRef] [PubMed] 24. Zhang, C.; Sayyari, E.; Mirarab, S. ASTRAL-III: Increased scalability and impacts of contracting low support branches. In Comparative Genomics; Lecture Notes in Computer Science; Meidanis, J., Nakhleh, L., Eds.; Springer International Publishing: Cham, Switzerland, 2017; Volume 10562, pp. 53–75. ISBN 978-3-319-67978-5. 25. Rabiee, M.; Sayyari, E.; Mirarab, S. Multi-allele species reconstruction using ASTRAL. Mol. Phylogenet. Evol. 2019, 130, 286–296. [CrossRef][PubMed] 26. Braun, E.L.; Kimball, R.T. Polytomies, the power of phylogenetic inference, and the stochastic nature of molecular evolution: A comment on Walsh et al. (1999). Evolution 2001, 55, 1261–1263. [CrossRef][PubMed] 27. Felsenstein, J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol. 1978, 27, 401–410. [CrossRef] 28. Morgan-Richards, M.; Trewick, S.A.; Bartosch-Härlid, A.; Kardailsky, O.; Phillips, M.J.; McLenachan, P.A.; Penny, D. Bird evolution: Testing the Metaves clade with six new mitochondrial genomes. BMC Evol. Biol. 2008, 8, 20. [CrossRef][PubMed] 29. Saurabh, K.; Holland, B.R.; Gibb, G.C.; Penny, D. Gaps: An elusive source of phylogenetic information. Syst. Biol. 2012, 61, 1075–1082. [CrossRef] 30. Huelsenbeck, J.P.; Alfaro, M.E.; Suchard, M.A. Biologically inspired phylogenetic models strongly outperform the no common mechanism model. Syst. Biol. 2011, 60, 225–232. [CrossRef] 31. Thorne, J.L.; Kishino, H.; Felsenstein, J. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 1991, 33, 114–124. [CrossRef] 32. Thorne, J.L.; Kishino, H.; Felsenstein, J. Inching toward reality: An improved likelihood model of sequence evolution. J. Mol. Evol. 1992, 34, 3–16. [CrossRef] 33. Holmes, I.H. Solving the master equation for Indels. BMC Bioinform. 2017, 18, 255. [CrossRef][PubMed] Diversity 2019, 11, 108 22 of 23

34. Alekseyenko, A.V.; Lee, C.J.; Suchard, M.A. Wagner and Dollo: A stochastic duet by composing two parsimonious solos. Syst. Biol. 2008, 57, 772–784. [CrossRef][PubMed] 35. Simmons, M.P.; Ochoterena, H.; Carr, T.G. Incorporation, relative homoplasy, and effect of gap characters in sequence-based phylogenetic analyses. Syst. Biol. 2001, 50, 454–462. [CrossRef][PubMed] 36. Mossel, E. On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 2003, 10, 669–676. [CrossRef][PubMed] 37. Steel, M.; Penny, D. Two further links between MP and ML under the poisson model. Appl. Math. Lett. 2004, 17, 785–790. [CrossRef] 38. Yuri, T.; Kimball, R.T.; Harshman, J.; Bowie, R.C.K.; Braun, M.J.; Chojnowski, J.L.; Han, K.-L.; Hackett, S.J.; Huddleston, C.J.; Moore, W.S.; et al. Parsimony and model-based analyses of indels in avian nuclear genes reveal congruent and incongruent phylogenetic signals. Biology 2013, 2, 419–444. [CrossRef][PubMed] 39. Stamatakis, A. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 2006, 22, 2688–2690. [CrossRef] 40. Miklós, I.; Lunter, G.A.; Holmes, I. A “Long Indel” model for evolutionary sequence alignment. Mol. Biol. Evol. 2004, 21, 529–540. [CrossRef] 41. Steel, M.; Penny, D. Maximum parsimony and the phylogenetic information in multistate characters. In Parsimony, Phylogeny, and Genomics; Albert, V.A., Ed.; Oxford University Press: Oxford, UK, 2005; pp. 163–178. 42. Chuzhanova, N.A.; Anassis, E.J.; Ball, E.V.; Krawczak, M.; Cooper, D.N. Meta-analysis of indels causing human genetic disease: Mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutat. 2003, 21, 28–44. [CrossRef] 43. Garcia-Diaz, M.; Kunkel, T.A. Mechanism of a genetic glissando: Structural biology of indel mutations. Trends Biochem. Sci. 2006, 31, 206–214. [CrossRef] 44. Messer, P.W.; Arndt, P.F. The majority of recent short DNA insertions in the human genome are tandem duplications. Mol. Biol. Evol. 2007, 24, 1190–1197. [CrossRef][PubMed] 45. Volfovsky, N.; Oleksyk, T.K.; Cruz, K.C.; Truelove, A.L.; Stephens, R.M.; Smith, M.W. Genome and gene alterations by insertions and deletions in the evolution of human and chimpanzee chromosome 22. BMC Genom. 2009, 10, 51. [CrossRef][PubMed] 46. Lopez, J.V.; Yuhki, N.; Masuda, R.; Modi, W.; O’Brien, S.J. Numt, a recent transfer and tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat. J. Mol. Evol. 1994, 39, 174–190. [PubMed] 47. Qu, H.; Ma, F.; Li, Q. Comparative analysis of mitochondrial fragments transferred to the nucleus in vertebrate. J. Genet. Genom. 2008, 35, 485–490. [CrossRef] 48. Liang, B.; Wang, N.; Li, N.; Kimball, R.T.; Braun, E.L. Comparative genomics reveals a burst of homoplasy-free numt insertions. Mol. Biol. Evol. 2018, 35, 2060–2064. [CrossRef][PubMed] 49. O’Donnell, K.A.; Burns, K.H. Mobilizing diversity: Transposable element insertions in genetic variation and disease. Mob. DNA 2010, 1, 21. [CrossRef][PubMed] 50. Jurka, J.; Bao, W.; Kojima, K.K. Families of transposable elements, population structure and the origin of species. Biol. Direct 2011, 6, 44. [CrossRef] 51. Matzke, A.; Churakov, G.; Berkes, P.; Arms, E.M.; Kelsey, D.; Brosius, J.; Kriegs, J.O.; Schmitz, J. Retroposon insertion patterns of neoavian birds: Strong evidence for an extensive incomplete lineage sorting era. Mol. Biol. Evol. 2012, 29, 1497–1501. [CrossRef] 52. Pa´sko,Ł.; Ericson, P.G.P.; Elzanowski, A. Phylogenetic utility and evolution of indels: A study in neognathous birds. Mol. Phylogenet. Evol. 2011, 61, 760–771. [CrossRef] 53. Jarvis, E.D.; Mirarab, S.; Aberer, A.J.; Li, B.; Houde, P.; Li, C.; Ho, S.Y.W.; Faircloth, B.C.; Nabholz, B.; Howard, J.T.; et al. AvianPhylogenomics Consortium Phylogenomic analyses data of the avian phylogenomics project. Gigascience 2015, 4, 4. [CrossRef] 54. Simmons, M.P.; Ochoterena, H. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol. 2000, 49, 369–381. [CrossRef][PubMed] 55. Little, D.P. 2xread: A simple indel coding tool. Available online: https://www.nybg.org/files/scientists/2xread. html (accessed on 28 April 2019). 56. Salinas, N.R.; Little, D.P.2matrix: A utility for indel coding and phylogenetic matrix concatenation. Appl. Plant Sci. 2014, 2.[CrossRef][PubMed] Diversity 2019, 11, 108 23 of 23

57. Young, N.D.; Healy, J. GapCoder automates the use of indel characters in phylogenetic analysis. BMC Bioinform. 2003, 4, 6. [CrossRef] 58. Goloboff, P.A.; Farris, J.S.; Nixon, K.C. TNT, a free program for phylogenetic analysis. Cladistics 2008, 24, 774–786. [CrossRef] 59. Farris, J.S. The retention index and the rescaled consistency index. Cladistics 1989, 5, 417–419. [CrossRef] 60. Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods); Sinauer Associates: Sunderland, MA, USA, 2003. 61. Sayyari, E.; Whitfield, J.B.; Mirarab, S. DiscoVista: Interpretable visualizations of gene tree discordance. Mol. Phylogenet. Evol. 2018, 122, 110–115. [CrossRef] 62. Ericson, P.G.P. Evolution of terrestrial birds in three continents: Biogeography and parallel radiations. J. Biogeogr. 2012, 39, 813–824. [CrossRef] 63. Cracraft, J.C. Avian higher-level relationships and classification: Nonpasseriforms. In The Howard and Moore Complete Checklist of the Birds of the World, 4th ed.; Dickinson, E.C., Remsen, J.V., Jr., Eds.; Aves Press: Eastbourne, UK, 2013; Volume 1. 64. Avise, J.C.; Robinson, T.J. Hemiplasy: A new term in the lexicon of phylogenetics. Syst. Biol. 2008, 57, 503–507. [CrossRef] 65. Han, K.-L.; Braun, E.L.; Kimball, R.T.; Reddy, S.; Bowie, R.C.K.; Braun, M.J.; Chojnowski, J.L.; Hackett, S.J.; Harshman, J.; Huddleston, C.J.; et al. Are transposable element insertions homoplasy free?: An examination using the avian tree of life. Syst. Biol. 2011, 60, 375–386. [CrossRef] 66. Patel, S. Error in phylogenetic estimation for bushes in the tree of life. J. Phylogenet. Evol. Biol. 2013, 01. [CrossRef] 67. Springer, M.S.; Gatesy, J. The gene tree delusion. Mol. Phylogenet. Evol. 2016, 94, 1–33. [CrossRef][PubMed] 68. Hobolth, A.; Dutheil, J.Y.; Hawks, J.; Schierup, M.H.; Mailund, T. Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. Genome Res. 2011, 21, 349–356. [CrossRef][PubMed] 69. Doyle, J.J. The irrelevance of allele tree topologies for species delimitation, and a non-topological alternative. Syst. Bot. 1995, 20, 574. [CrossRef] 70. Doyle, J.J. Trees within trees: Genes and species, molecules and morphology. Syst. Biol. 1997, 46, 537–553. [CrossRef][PubMed] 71. Hendy, M.D.; Penny, D. A framework for the quantitative study of evolutionary trees. Syst. Zool. 1989, 38, 297. [CrossRef] 72. Kim, J. Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phylogenet. Evol. 2000, 17, 58–75. [CrossRef][PubMed] 73. Suh, A.; Smeds, L.; Ellegren, H. The dynamics of incomplete lineage sorting across the ancient adaptive radiation of neoavian birds. PLoS Biol. 2015, 13, e1002224. [CrossRef] 74. Doronina, L.; Reising, O.; Clawson, H.; Ray, D.A.; Schmitz, J. True homoplasy of retrotransposon insertions in primates. Syst. Biol. 2019, 68, 482–493. [CrossRef] 75. Walsh, H.E.; Kidd, M.G.; Moum, T.; Friesen, V.L. Polytomies and the power of phylogenetic inference. Evolution 1999, 53, 932–937. [CrossRef] 76. Mirarab, S.; Bayzid, M.S.; Boussau, B.; Warnow, T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 2014, 346, 1250463. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).