DiscoVista: interpretable visualizations of gene tree discordance ∗

Erfan Sayyari †1, James B. Whitfield ‡2, and Siavash Mirarab§1 1Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093. 2Department of Entomology, University of Illinois at Urbana-Champaign, 505 S. Goodwin Avenue, Urbana, IL 61801

February 1, 2018

Abstract Phylogenomics has ushered in an age of discordance. Analyses often reveal abundant discordances among phylogenies of different parts of genomes, as well as incongruences between species trees obtained using different methods or data partitions. Researchers are often left trying to make sense of such incongruences. Interpretive ways of measuring and visualizing discordance are needed, both among alternative species trees and gene trees, especially for specific focal branches of a tree. Here, we introduce DiscoVista, a publicly available tool that creates a suite of simple but interpretable visualizations. DiscoVista helps quantify the amount of discordance and some of its potential causes. Index terms— Software, Species tree estimation, Gene trees, Phylogenomics, Visualization, ILS

1 Introduction

The age of phylogenomics, once hoped to be the end of incongruence in phylogenetic analyses (Rokas et al., 2003), has turned out to be ripe with incongruence (Jeffroy et al., 2006) and methodological difficulties. The long- understood theoretical concerns about gene tree incongruence due to incomplete lineage sorting (ILS) (Maddison, 1997) have been implicated in many studies (e.g., Jarvis et al., 2014; Wickett et al., 2014; Suh et al., 2015). While methods that seek to address gene tree incongruence have been developed, no consensus has emerged as to the choice of the best methodology (see Edwards et al., 2016; Springer and Gatesy, 2016, for examples). Nevertheless, phylogenomics studies have to at least consider the possibility of gene tree incongruence and its impacts, a feat made difficult by the noisy estimates of gene trees (Springer and Gatesy, 2016; Patel et al., 2013; Mirarab et al., 2016; Sayyari and Mirarab, 2016). Moreover, in some cases, the incongruence itself may be of interest (Hahn and Nakhleh, 2016). Even ignoring gene tree discordance, choices of models to apply to the sequences, delineation of data partitions, alignment techniques, or simply software packages used to analyze the data have all proved consequential (e.g., Jeffroy et al., 2006; Jarvis et al., 2014; Wickett et al., 2014; Romiguier et al., 2013; Philippe et al., 2011; Zwickl et al., 2014). These difficulties have compelled some researchers to use several alternative models and methods and then test the sensitivity of results to such choices. Occasionally, analysts choose to also perturb the set of species included, and they often run analyses on different partitions of the data. The analyst hopes for congruence arXiv:1709.09305v2 [q-bio.PE] 30 Jan 2018 between various analyses that would indicate rubustness of the results to assumptions, but often observes differences. Ideally, the results of all analyses should be published, to convey the existence of incongruence in results to the reader. As long as incongruence remains an important force in phylogenetics, we need interpretable ways to measure and visualize the discordance between species tree estimates resulting from different analytical method and assumptions, and also between gene trees and a summary species tree. Sophisticated tools have been developed to visualize discordance. For example, DensiTree (Bouckaert, 2010) overlays trees on top of each other to create a phylogenetic cloud, and SplitsTree (Huson, 1998) conveys incongruence by producing a network while keeping some of the tree-like structure. These tools create creative and striking indicators of discordance. Yet it is often hard to interpret the meaning of such figures in measurable ways. We believe that in addition to these methods,

∗Accepted by Molecular Phylogenetics and Evolution, 2018 †[email protected] ‡jwhitfi[email protected] §[email protected]

1 Table 1: Description and examples for different DiscoVista analyses

Analysis Name Shows ... Examples compatibility of focal groups of species Species tree compatibility Figs. 1a, S3, and S4 in species tree taxon occupancy for individuals or focal Occupancy analysis Figs. 1bc and S5 groups of species in genes compatibility of focal groups of species Gene tree compatibility Figs. 2a, S6, and S7 in gene trees quartet frequencies around important Branch quartet frequencies Figs. 2b and S8 branches of the species tree GC content GC content of each codon position Figs. S9 phylogenomics will benefit from simple, interpretable, and easy-to-perform visualizations that help systematists to identify discordance and its potential causes. In this paper, we introduce DiscoVista, a tool that creates a series of simple yet powerful visualizations of discordance. DiscoVista is a command-line tool and relies on several other packages, including Dendropy (Suku- maran and Holder, 2010), ape (Paradis et al., 2004), newick utilities (Junier and Zdobnov, 2010), the ggplot package (Wickham, 2016), and ASTRAL (Sayyari and Mirarab, 2016). The code, a Docker (Boettiger and Carl, 2015) virtual image (for easy installation), and examples are available online at https://github.com/ esayyari/DiscoVista. DiscoVista can generate several visualizations (Table 1) that summarize gene tree discordance and discor- dance among species trees, show taxon occupancy, and show sequence statistics such as GC content. DiscoVista strives for interpretability. In many analyses, not all branches in a phylogenetic tree are equally important because questions of interest typically concern several hypotheses surrounding the relationships between focal groups. Visualizing discordance with respect to only these focal relationships simplifies interpretation. Assessing hypotheses concerning these larger subsets of the species helps in answering the downstream biological questions of interest. Thus, DiscoVista allows researchers to define focal groups of taxa and evaluate discordance relevant to those groups.

2 Results

We apply DiscoVista on three datasets to demonstrate its output visualizations. The exact commands for generating example figures are given in the supplementary materials.

Datasets The position of among deep branches of the Metazoan phylogeny has proved challenging to resolve, with two prevailing hypotheses. One hypothesis puts Xenacoelomorpha as sister to all other , while the other hypothesis puts them inside Deuterostomia, implying a dramatic loss of com- plexity. Intriguingly, these marine worms are bilaterally symmetrical but lack several other features compared to most other bilaterians. Two independent and simultaneous studies by Rouse et al. (2016) and Cannon et al. (2016) have focused on the position of Xenacoelomorpha in the tree of life. These two studies used different (but overlapping) set of species and each analysis used several reconstruction methods. The two papers come to the same conclusion, putting Xenacoelomorpha as the sister to all other Bilateria. The final results of Cannon et al. (2016) is based on 78 species and 212 orthologous genes with average per taxon occupancy of 80%. Their paper includes alternative analyses based on several subsets of taxa and reconstruction methods. The dataset by Rouse et al. (2016) includes 26 species and 1178 genes (including four species) with average gene occupancy of 70%. They also report alternative trees using concatenation and ASTRAL (run on 393 genes with 80% occupancy). Although these two datasets used different set of taxa, by focusing on focal splits, DiscoVista can generate visual comparisons of results across both datasets (Figs. 1 and 2). As a second example, we show DiscoVista results on a phylotranscriptomic dataset of 103 plants (Wickett et al., 2014) in the supplementary material (Figs. S3 – S9). This dataset comes with both DNA and AA sequences, allowing us to show additional figures that could not be built for the Xenacoelomorpha datasets.

Split definitions A central input to DiscoVista is a split definitions file where the user can combine taxa into groups of interest and give names to the groups (see supplementary material for details). Each split is a bipartition of the taxa into two groups and corresponds to an edge in an unrooted tree. The user can specify one side of a split (which would be a clade if the side that doesn’t include the root is given). With careful definition of splits, alternative hypotheses of interest could be specified. For our two empirical datasets, we

2 Table 2: An example split definition file for the two Xenoturbella datasets; several lines on top define base splits and are left blank here, but see the full files in Figures S1 and S2.

Cluster Name Definition ... ... ... ... Xenacoelomorpha ... Chordata ... Protostomia Spiralia+Ecdysozoa Deuterostomia Ambulacraria+Chordata Protostomia+Deuterostomia Bilateria Nephrozoa+Xenacoelomorpha Xenambulacraria(C1) Ambulacraria+Xenacoelomorpha C2 Chordata+Xenambulacraria(C1) Bilateria(C3) C2+Protostomia D2 Xenacoelomorpha+Deuterostomia Bilateria(D3) D2+Protostomia Xenambulacraria(E1) Xenoturbella+Ambulacraria E2 Chordata+Xenambulacraria(E1) E3 Protostomia+E2 are considering focal groups of species from original publications (Rouse et al., 2016; Cannon et al., 2016), as shown partially in Table 2 (see Figs. S1 and S2 for full definitions). i) Species tree compatibility This visualization shows whether focal splits are supported, weakly rejected, or strongly rejected by different analyses. The inputs are a set of species trees in newick format, a support threshold, and the splits definition file; the output is a heat map. For each focal split and for each species tree, if the split is compatible with the tree, the corresponding cell is in shades of blue/green, and the spectrum of blue- green color indicates the branch support (any measure of support, including the bootstrap support, Bayesian posterior probability, or localPP (Sayyari and Mirarab, 2016) values could be used.) A split is considered weakly rejected if it is incompatible with the fully resolved species tree, but is compatible with the tree when branches with support values below the input threshold (e.g., 75% BS) get contracted. Compatibility is a concept that we find useful for measuring discordance in an interpretable way. A split is compatible with an unrooted tree if it is compatible with all splits of that tree and can therefore be added to that tree; compatibility can be easily and efficiently checked (Warnow, 1994). Let two splits be A|B and C|D; then, the two splits are compatible if and only if one of the four pairwise intersections A ∩ C, A ∩ D, B ∩ C, or B ∩ D is empty. On the empirical datasets, the species tree compatibility figures (Fig. 1a) show several patterns. The two datasets produce largely congruent results, with only minor differences. In the Rouse et al. (2016) dataset, aggressive filtering of genes based on their occupancy negatively impacts concatenation trees in terms of BS support and the recovery of some clades (e.g., Chordata). Consistent with literature (Mirarab et al., 2016), ASTRAL run on maximum likelihood gene trees (bestML) seems more accurate than ASTRAL run on the bootstrap replicates (MLBS) gene trees (impacting the recovery of Chordata). ii) Occupancy analysis Missing data is common in phylogenomics, and common filtering practices can further increase the amount of missing data with potential impact on downstream analyses (e.g., Hosner et al., 2016; Mai and Mirarab, 2017). DiscoVista can visualize the taxon occupancy for individual species or for groups of taxa; taxon occupancy is defined as the fraction of gene alignments that have at least one of the species from a group. The inputs to this analysis are a set of sequence alignment files (FASTA format), and a species annotation file; the output is a line plot of the occupancy per species or per group. DiscoVista can also visualize taxon occupancy as a heatmap that shows the presence/absence of species in gene alignments, and for those present, uses color shades to indicate their non-gapped sequence length compared to the other sequences of the same gene. On the empirical Rouse et al. (2016) dataset, the occupancy plots immediately reveal how Xenoturbella bocki has low occupancy (Fig. 1b), appearing only in a handful of genes. However, the negative impacts of this low occupancy may be ameliorated by the high occupancy of the other two Xenacoelomorpha species, Xenoturbella profoundus and Hofstenia miamia. Patterns of occupancy do not vary widely on this dataset. On the Cannon dataset, in contrast, some genes have lower levels of occupancy than others (from left to right in Fig. 1c). Just like the Rouse dataset, several species have very low occupancy (e.g., Astrotomma agassizi, Cephalodiscus

3 a)

Weakly Strongly Rejected Rejected ASTRAL_212 (<90%) (>=90%) MrBayes_212 RAxML80_393 RAxML70_1178 ASTRAL_MLBS_393

Supported ASTRAL_bestML_393 RAxML_212_Acoelomorpha 0% 50% 100% b) Strongylocentrotus_purpuratus Ciona_intestinalis Anneissia_japonica Nematostella_vectensis Lottia_gigantea Branchiostoma_floridae Trichoplax_adhaerens Hofstenia_miamia Drosophila_melanogaster Saccoglossus_kowalevskii Amphimedon_queenslandica Phoronis_vancouverensis Xenoturbella_profundus Schmidtea_mediterranea Peripatopsis_capensis Homo_sapiens Mnemiopsis_leidyi Gallus_gallus Gnathostomula_paradoxa Macrodasys_sp Adineta_ricciae Priapulus_sp Loxosoma_pectinaricola Baseodiscus_unicolor Capitella_teleta Xenoturbella_bocki b0 b220 b372 b561 b809 b5759 b6117 b6524 b7002 b7271 b7647 b8003 b8324 b8655 b9067 b9424 b9730 b1080 b1336 b1730 b2040 b2290 b2442 b2690 b2977 b3219 b3662 b3991 b4285 b4530 b4838 b5241 b5536 b99378 b99852 b62695 b62962 b63258 b63561 b63828 b64237 b64501 b65032 b65386 b65744 b66389 b66583 b67057 b67351 b67552 b67774 b68143 b68431 b68811 b69102 b69535 b69733 b70191 b70620 b70875 b71066 b71232 b71386 b71574 b72168 b72358 b72711 b72881 b73171 b73676 b73979 b74199 b74510 b74807 b75100 b75330 b75755 b76065 b76189 b76505 b76799 b77213 b77621 b77901 b78454 b78753 b78882 b79295 b79733 b80612 b80986 b81255 b81692 b82001 b82366 b82730 b83176 b83432 b83590 b83890 b84137 b84350 b84704 b84949 b85111 b85295 b85516 b85964 b86354 b86649 b87156 b87431 b87742 b88018 b88271 b88557 b88963 b89684 b89929 b90102 b90345 b90620 b91031 b91702 b91898 b92177 b92395 b92761 b93102 b93264 b93586 b94107 b94391 b94677 b95063 b95321 b95537 b95867 b96066 b96332 b96587 b96938 b97157 b97592 b98074 b98320 b98537 b99045 b33866 b34070 b34446 b34750 b35064 b35534 b36005 b36546 b36847 b37051 b37457 b37796 b38092 b38366 b38564 b38744 b39084 b39415 b39591 b39897 b40082 b40319 b40520 b40762 b40920 b41242 b41608 b41959 b42249 b42631 b43150 b43389 b43688 b43837 b44150 b44418 b44603 b44869 b45072 b45254 b45793 b46156 b46396 b46796 b47089 b47295 b47530 b47783 b47925 b48153 b48389 b48990 b49175 b49568 b49902 b50325 b50589 b50970 b51345 b51728 b51878 b52210 b52521 b52710 b53016 b53401 b53784 b54018 b54421 b54795 b55034 b55266 b55515 b56048 b56598 b56720 b56863 b57192 b57493 b57916 b58177 b58512 b58774 b58954 b59523 b59895 b60302 b60703 b60951 b61233 b61632 b61776 b62144 b62435 b15903 b16221 b16763 b17045 b17351 b17675 b17967 b18363 b18724 b19093 b19461 b19721 b19945 b20335 b20604 b20930 b21094 b21288 b21530 b22061 b22374 b22787 b22956 b23342 b23784 b24100 b24500 b24769 b24955 b25165 b25610 b25933 b26256 b26584 b26934 b27365 b27748 b28013 b28214 b28493 b28789 b29363 b29860 b30101 b30322 b30545 b30965 b31220 b31495 b31637 b31876 b32193 b32453 b32738 b32960 b33161 b33421 b33689 b10137 b10533 b10690 b10935 b11307 b11543 b11793 b12169 b12627 b12806 b13025 b13278 b13729 b14028 b14281 b14563 b14802 b15236 b15422 b15577 c) b100132 b100429 b100647 b101295 b101623 b101898 b102190 b102460 b102731 b103127 b103419 b103820 b104134 b104520 b104807 b105177 b105509 b105780 b106193 b106406 b106602 b106912 b107157 b107422 b107737 b107899 b108076 b108452 b108805 b109062 b109693 b109932 b110106 b110415 b110736 b111060 b111329 b111543 b111865 b112068 b112386 b112568 b112945 b113229 b113855 b114068 b114486 b114846 b115579 b115763 b116097 b116263 b116577 b116890 b117230 b117447 b117693 b118157 b118537 b118924 b119269 b119528 b119859 b120180 b120520 b120732 b121016 b121283 b121650 b121935 b122235 b122602 b123007 b123294 Homo_sapiens Drosophila_melanogaster Novocrania_anomala Nemertoderma_westbladi Membranipora_membranacea Halicryptus_spinulosus Ciona_intestinalis Xenoturbella_bocki Sycon_ciliatum Leucosolenia_complicata Isodiametra_pulchra Convolutriloba_macropyga Branchiostoma_floridae Terebratalia_transversa Prostheceraeus_vittatus Lineus_longissimus Adineta_vaga Helobdella_robusta Capitella_teleta Meara_stichopi Lottia_gigantea Strongylocentrotus_purpuratus Oscarella_carmela Craspedacusta_sowerby Brachionus_calyciflorus Trichoplax_adhaerens Phoronis_psammophila Nematostella_vectensis Macrostomum_lignano Hofstenia_miamia Priapulus_caudatus Lepidodermella_squamata Amphimedon_queenslandica Daphnia_pulex Aphrocallistes_vastus Hemithiris_psittacea Saccoglossus_mereschkowskii Childia_submaculatum Agalma_elegans Schistosoma_mansoni Salpingoeca_rosetta Gallus_gallus Diopisthoporus_longitubus Monosiga_brevicollis Leptochiton_rugatus Ptychodera_bahamensis Crassostrea_gigas Strigamia_maritima Ixodes_scapularis Cliona_varians Stomolophus_meleagris Dumetocrinus_sp Mnemiopsis_leidyi Macrodasys_sp Botryllus_schlosseri Adineta_ricciae Leptosynapta_clarki Pleurobrachia_bachei Euplokamis_dunlapae Eunicella_cavolinii Acropora_digitifera Cephalothrix_hongkongiensis Ascoparia_sp Peripatopsis_capensis Taenia_pisiformes Pomatoceros_lamarckii Petromyzon_marinus Diopisthoporus_gymnopharyngeus Schmidtea_mediterranea Eumecynostomum_macrobursalium Labidiaster_annulatus Loxosoma_pectinaricola Barentsia_gracilis Megadasys_sp Schizocardium_braziliense Astrotomma_agassizi Cephalodiscus_gracilis Sterreria_sp b0408 b0631 b0716 b0131 b0314 b0383 b0476 b0632 b0665 b0929 b0994 b0135 b0863 b0121 b0873 b0258 b0374 b0415 b0199 b0394 b0677 b0910 b0105 b0785 b0856 b0904 b1004 b0074 b0098 b0126 b0272 b0289 b0372 b0643 b0696 b0802 b0834 b0925 b0926 b0951 b0983 b0054 b0157 b0175 b0190 b0205 b0257 b0362 b0387 b0974 b1024 b0041 b0067 b0086 b0311 b0327 b0380 b0644 b0647 b0691 b0786 b0958 b0049 b0106 b0113 b0179 b0346 b0391 b0455 b0623 b0629 b0651 b0657 b0678 b0730 b0801 b0819 b0975 b0065 b0180 b0283 b0487 b0506 b0548 b0670 b0681 b0803 b0911 b0942 b1027 b0050 b0159 b0189 b0291 b0528 b0564 b0744 b0950 b0966 b0115 b0120 b0140 b0185 b0303 b0328 b0407 b0470 b0735 b0850 b0161 b0233 b0350 b0475 b0538 b0541 b0614 b0648 b0663 b0961 b0013 b0045 b0062 b0154 b0168 b0254 b0320 b0443 b0511 b0529 b0572 b0641 b0892 b0992 b1013 b0081 b0151 b0212 b0261 b0298 b0315 b0316 b0423 b0433 b0471 b0478 b0582 b0613 b0731 b0808 b0822 b0947 b0017 b0097 b0139 b0237 b0290 b0292 b0305 b0343 b0619 b0624 b0703 b0709 b0724 b0729 b0913 b0923 b0087 b0268 b0279 b0345 b0601 b0626 b0689 b0732 b0854 b0890 b0920 b0941 b0112 b0253 b0501 b0552 b0579 b0679 b0881 b0934 b0940 b0056 b0381 b0463 b0610 b0686 b0806 b0810 b0901 b0066 b0144 b0642 b0774 b0955 b0645 b0761 b0162 b0309 b0646 b0654 b0978 b0070 b0630 b0843 b0483 GENE_ID

Figure 1: Species tree discordance and occupancy. a) DiscoVista Specie tree analysis. Rows correspond to focal splits, and columns correspond to alternative species trees reported in two pa- pers (Rouse et al., 2016; Cannon et al., 2016). The spectrum of blue-green indicates support values for splits compatible with a tree. Note that support values are of different types (Bayesian pos- terior, concatenation bootstrap support, and multi-locus bootstrap support) and thus may not be directly comparable. Weakly rejected splits correspond to splits that are not present in the tree, but are compatible if low support branches (below 90%) are contracted. In Rouse et al. (2016), RAxML70 1178, RAxML80 393, and RAxML90 are results of concatenation analyses on genes with average occupancy of 70%, 80%, and 90% respectively. For Cannon et al. (2016), RAxML 212, RAxML 212 Acoelomorpha, RAxML 336, and RAxML 881 are concatenation analyses on 212 or- thologus genes of 78 species, 212 orthologus genes after removing Acoelomorpha, 336 orthologus genes on 56 selected species, and 881 orthologus genes on 77 species, respectively. The ASTRAL 212 is the species tree estimated using ASTRAL and 212 orthologus genes of 78 species. b,c) The occupancy map of 393 (b) and 212 (c) gene alignments of Rouse et al. (2016) (b) and Cannon et al. (2016) (c) with average occupancy of 80% in both datasets. The spectrum of green color shows the gene length, and the white color indicates missing data. 4 gracilis, Sterreria sp.). However, Xenoturbella bocki, which is the only Xenacoelomorpha species in this dataset, has high occupancy and appears in most of the genes.

a)

b) 4 884 9 2,3 # 4,9 2,3 # 3,9 2,4 # 3,4 2,9 # 5,8 # 6,9 5,8 # 5,6 # 8,9 5,6 # 5,9 # 6,8 5,9 # 10,4 # 11,8 10,4 # 10,11 # 4,8 10,11 # 11,4 10,8 #

Figure 2: Gene tree discordance figures. a) Gene tree compatibility. This figure shows the portion of RAxML genes for which focal splits (x-axis) are highly (weakly) supported or rejected. Weakly rejected splits are those that are not in the tree but are compatible if low support branches (below 75%) are contracted. b) Frequency of three topologies around focal internal branches of ASTRAL species trees in both datasets. Main topology is shown in red, and the other two alternative topologies are shown in blue. The dotted line indicates the 1/3 threshold. The title of each subfigure indicates the label of the corresponding branch on the tree on the right (also generated by DiscoVista). Each internal branch has four neighboring branches which could be used to represent quartet topologies. On the x-axis the exact definition of each quartet topology is shown using the neighboring branch labels separated by “#”.

iii) Gene tree compatibility This visualization simply depicts the portion of gene trees supporting or rejecting each focal split. The inputs are one or several collections of gene trees, the split definition file, and a support threshold. For splits that are compatible with gene trees, branches with bootstrap support values above (below) a threshold are considered as highly (weakly) supported. Splits that are not compatible with the original tree but are compatible with the tree if branches with low support values get contracted are considered

5 as weakly rejected splits, and those that are not compatible even after contracting low support branches are considered strongly rejected. By default, gene trees that miss one side of the split completely are removed from the analysis. The figure can also be created such that genes that completely miss one side of the split (or completely miss a user-defined subset of a side of the split) are marked as “missing” (Fig. S6). On the empirical datasets, these figures reveal high levels of gene tree discordance (Fig. 2a). Although ASTRAL species trees on maximum likelihood gene trees are mostly congruent in both datasets and with concatenation results, gene trees show high amounts of discordance, and none of the major splits are recovered in most gene trees (Fig. 2a). However, for all of the focal splits, the majority of the gene trees are compatible with the species tree after contracting low support branches (below 75%). Interestingly, maximum likelihood gene trees in Rouse et al. (2016) show somewhat less discordance with major species tree splits compared to those in Cannon et al. (2016). iv) Branch quartet frequencies Every internal branch of a species tree divides the tree into four parts, and thus, at least one quartet tree (often many) can be mapped uniquely onto that branch. For a given branch, assuming (a) the branch is correct (b) the only source of discordance between gene trees and species tree is ILS, and (c) there is no gene tree estimation error, the multi-species coalescent (MSC) model has specific expectations about these quartets (Allman et al., 2011). The probability of a gene tree quartet tree matching the species 1 tree topology (p > 3 ) is higher than probability of matching the two alternatives, and the two alternatives 1−p 1 have equal probabilities (q = 2 < 3 ). Visualizing the empirical frequency of quartets around each branch can serve two purposes: it gives an interpretable measure of discordance specific to that branch (Sayyari and Mirarab, 2016), and one can check whether the assumptions of ILS are met. If the discordance is purely due to ILS, then one would expect the second and the third hypotheses to have similar frequencies (close to q). Lack of this pattern can have many causes, including hybridization (Sol´ıs-Lemus et al., 2016). Finally, note that if the length of a species tree branch in coalescent units (Allman et al., 2011) is d, then for quartets around it, 2 −d p = 1 − 3 e , and thus, the quartet frequencies also convey information about the branch length. In the limit, 1 1 for d = 0 (e.g., a polytomy) one would expect all three frequencies to be close to 3 and p = q = 3 can be tested as a null hypothesis (Sayyari and Mirarab, 2017). In this visualization, for every focal split, a bar graph shows the quartet frequencies around that branch. Moreover, a cartoon tree is generated where leaves are the large groups of taxa collapsed into single leaves and branches are labeled consistently with the bar graph. Inputs to this analysis are a set of gene trees, a species tree, an annotation file that maps each species to a named group and thus defines leaves of the cartoon tree and by extension, the focal splits. The portion of quartets around three focal branches of the ASTRAL species trees in our empirical stud- ies show why some branches have been controversial (Fig. 2b). For Nephrozoa (branch labeled as “8”) and Deuterostomia (labeled as “9”), the relative frequency of main topologies are extremely close to 1/3 in both studies. This high level of gene tree discordance around these two branches is likely caused by a combination of high true discordance and also gene tree estimation error; irrespective of which is more prevalent, the high discordance reveals a cause of difficulties faced in resolving these relationships. Interestingly, the portion of quartets for alternative topologies in Rouse et al. (2016) follow the expectations of the MSC theory quite well (i.e., second and third topologies have very close frequencies); the same cannot necessarily be said of the Cannon et al. (2016) gene trees. v) GC content Commonly used models of sequence evolution are stationary and assume that all species have identical base composition. This assumption is often violated in gene coding sequences, especially in the third codon position (Jeffroy et al., 2006). While non-stationary models (Boussau and Gouy, 2006) and tests of divergence from stationarity (Ababneh et al., 2006) exist, simply visualizing GC content of different codon position for different extant taxa can help in judging non-stationarity and in deciding whether a GTR analysis is appropriate for some or all of the data (Jarvis et al., 2014; Wickett et al., 2014). DiscoVista generates box plots (distributions) or dot plots (averages) of GC content of each codon position, as well as all three codon positions, for each species. The input to this analysis is a set of gene coding sequences in FASTA format. The two Xenacoelomorpha datasets did not release DNA sequences and therefore, we could not compute their GC content. Nevertheless, examples of the GC content figures could be seen for the plant dataset (Fig. S9). These figures immediately show that assumptions of equal base frequencies are violated. They also show that the variations in the GC content are mostly concentrated in the third codon position. The GC content levels for the second codon position, and to a lesser degree the first codon position, do not vary much across species. These results favor the removal of the third codon position when building gene trees or in concatenation analyses. To summarize, DiscoVista provides a useful tool for visualizing several patterns of discordance, missing data, and GC variations in phylogenomic datasets.

6 References

Ababneh, F., Jermiin, L. S., Ma, C., and Robinson, J. 2006. Matched-pairs tests of homogeneity with applica- tions to homologous nucleotide sequences. Bioinformatics, 22(10): 1225–1231. Allman, E. S., Degnan, J. H., and Rhodes, J. A. 2011. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of Mathematical Biology, 62(6): 833–862.

Boettiger, C. and Carl 2015. An introduction to Docker for reproducible research. ACM SIGOPS Operating Systems Review, 49(1): 71–79. Bouckaert, R. R. 2010. DensiTree: Making sense of sets of phylogenetic trees. Bioinformatics, 26(10): 1372– 1373.

Boussau, B. and Gouy, M. 2006. Efficient likelihood computations with nonreversible models of evolution. Systematic Biology, 55(5): 756–768. Cannon, J. T., Vellutini, B. C., Smith, J., Ronquist, F., Jondelius, U., and Hejnol, A. 2016. Xenacoelomorpha is the sister group to Nephrozoa. Nature, 530(7588): 89–93. Edwards, S. V., Xi, Z., Janke, A., Faircloth, B. C., McCormack, J. E., Glenn, T. C., Zhong, B., Wu, S., Lemmon, E. M., Lemmon, A. R., Leach´e,A. D., Liu, L., and Davis, C. C. 2016. Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. Molecular Phylogenetics and Evolution, 94: 447–462. Hahn, M. W. and Nakhleh, L. 2016. Irrational exuberance for resolved species trees. Evolution, 70(1): 7–17.

Hosner, P. A., Faircloth, B. C., Glenn, T. C., Braun, E. L., and Kimball, R. T. 2016. Avoiding Missing Data Biases in Phylogenomic Inference: An Empirical Study in the Landfowl (Aves: Galliformes). Molecular Biology and Evolution, 33(4): 1110–1125. Huson, D. H. 1998. Splitstree: analyzing and visualizing evolutionary data. Bioinformatics, 14(1): 68–73. Jarvis, E. D., Mirarab, S., Aberer, A. J., Li, B., Houde, P., Li, C., Ho, S. Y. W., Faircloth, B. C., Nabholz, B., Howard, J. T., Suh, A., Weber, C. C., da Fonseca, R. R., Li, J., Zhang, F., Li, H., Zhou, L., Narula, N., Liu, L., Ganapathy, G., Boussau, B., Bayzid, M. S., Zavidovych, V., Subramanian, S., Gabald´on,T., Capella-Guti´errez,S., Huerta-Cepas, J., Rekepalli, B., Munch, K., Schierup, M. H., Lindow, B., Warren, W. C., Ray, D., Green, R. E., Bruford, M. W., Zhan, X., Dixon, A., Li, S., Li, N., Huang, Y., Derryberry, E. P., Bertelsen, M. F., Sheldon, F. H., Brumfield, R. T., Mello, C. V., Lovell, P. V., Wirthlin, M., Schneider, M. P. C., Prosdocimi, F., Samaniego, J. A., Velazquez, A. M. V., Alfaro-N´u˜nez, A., Campos, P. F., Petersen, B., Sicheritz-Ponten, T., Pas, A., Bailey, T., Scofield, P., Bunce, M., Lambert, D. M., Zhou, Q., Perelman, P., Driskell, A. C., Shapiro, B., Xiong, Z., Zeng, Y., Liu, S., Li, Z., Liu, B., Wu, K., Xiao, J., Yinqi, X., Zheng, Q., Zhang, Y., Yang, H., Wang, J., Smeds, L., Rheindt, F. E., Braun, M. J., Fjelds˚a,J., Orlando, L., Barker, F. K., Jønsson, K. A., Johnson, W., Koepfli, K.-P., OBrien, S., Haussler, D., Ryder, O. A., Rahbek, C., Willerslev, E., Graves, G. R., Glenn, T. C., McCormack, J. E., Burt, D. W., Ellegren, H., Alstr¨om,P., Edwards, S. V., Stamatakis, A., Mindell, D. P., Cracraft, J., Braun, E. L., Warnow, T., Jun, W., Gilbert, M. T. P., and Zhang, G. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science, 346(6215): 1320–1331. Jeffroy, O., Brinkmann, H., Delsuc, F., and Philippe, H. 2006. Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4): 225–231.

Junier, T. and Zdobnov, E. M. 2010. The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics, 26(13): 1669–1670. Maddison, W. P. 1997. Gene Trees in Species Trees. Systematic Biology, 46(3): 523–536. Mai, U. and Mirarab, S. 2017. Treeshrink: Efficient detection of outlier tree leaves. In J. Meidanis and L. Nakhleh, editors, Comparative Genomics, pages 116–140, Cham. Springer International Publishing. Mirarab, S., Bayzid, M. S., and Warnow, T. 2016. Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. Systematic Biology, 65(3): 366–380. Paradis, E., Claude, J., and Strimmer, K. 2004. Ape: Analyses of phylogenetics and evolution in r language. Bioinformatics, 20(2): 289–290.

7 Patel, S., Kimball, R., and Braun, E. 2013. Error in phylogenetic estimation for bushes in the tree of life. Phylogenetics and Evolutionary Biology, 1(2): 1–10. Philippe, H., Brinkmann, H., Lavrov, D. V., Littlewood, D. T. J., Manuel, M., W¨orheide,G., and Baurain, D. 2011. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS biology, 9(3): e1000602. Rokas, A., Williams, B. L., King, N., and Carroll, S. B. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature, 425(6960): 798–804. Romiguier, J., Ranwez, V., Delsuc, F., Galtier, N., and Douzery, E. J. P. 2013. Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Molecular biology and evolution, 30(9): 2134–2144. Rouse, G. W., Wilson, N. G., Carvajal, J. I., and Vrijenhoek, R. C. 2016. New deep-sea species of Xenoturbella and the position of Xenacoelomorpha. Nature, 530(7588): 94–97. Sayyari, E. and Mirarab, S. 2016. Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies. Molecular Biology and Evolution, 33(7): 1654–1668. Sayyari, E. and Mirarab, S. 2017. Testing for polytomies in phylogenetic species trees using quartet frequencies. arXiv preprint 1708.08916 . Sol´ıs-Lemus, C., Yang, M., and An´e, C. 2016. Inconsistency of Species Tree Methods under Gene Flow. Systematic Biology, 65(5): 843–851.

Springer, M. S. and Gatesy, J. 2016. The gene tree delusion. Molecular Phylogenetics and Evolution, 94(Part A): 1–33. Suh, A., Smeds, L., and Ellegren, H. 2015. The dynamics of incomplete lineage sorting across the ancient adaptive radiation of neoavian birds. PLoS Biol, 13(8): e1002224.

Sukumaran, J. and Holder, M. T. 2010. DendroPy: a Python library for phylogenetic computing. Bioinformat- ics, 26(12): 1569–1571. Warnow, T. 1994. Tree compatibility and inferring evolutionary history. Journal of Algorithms, 16(3): 388 – 407.

Wickett, N. J., Mirarab, S., Nguyen, N., Warnow, T., Carpenter, E. J., Matasci, N., Ayyampalayam, S., Barker, M. S., Burleigh, J. G., Gitzendanner, M. A., Ruhfel, B. R., Wafula, E., Der, J. P., Graham, S. W., Mathews, S., Melkonian, M., Soltis, D. E., Soltis, P. S., Miles, N. W., Rothfels, C. J., Pokorny, L., Shaw, A. J., DeGironimo, L., Stevenson, D. W., Surek, B., Villarreal, J. C., Roure, B., Philippe, H., DePamphilis, C. W., Chen, T., Deyholos, M. K., Baucom, R. S., Kutchan, T. M., Augustin, M. M., Wang, J., Zhang, Y., Tian, Z., Yan, Z., Wu, X., Sun, X., Wong, G. K.-S., and Leebens-Mack, J. 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proceedings of the National Academy of Sciences, 111(45): 4859–4868. Wickham, H. 2016. ggplot2: elegant graphics for data analysis, volume 35 of Use R! Springer International Publishing, New York.

Zwickl, D. J., Stein, J. C., Wing, R. A., Ware, D., and Sanderson, M. J. 2014. Disentangling methodological and biological sources of gene tree discordance on Oryza (Poaceae) chromosome 3. Systematic Biology, 63(5): 645–659.

8 Supplementary Material for “DiscoVista: interpretable visualizations of gene tree discordance”

Erfan Sayyari1, James B. Whitfield2, Siavash Mirarab1*

1Department of Electrical and Computer Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093. 2Department of Entomology, 320 Morrill Hall, University of Illinois, 505 S. Goodwin Avenue, Urbana, IL 61801

1 This document provides supplementary tables and figures used in the main paper.

Supplementary Figures

Clade Name Clade Definition Section Letter Components Show Ambulacraria "Anneissia japonica+""Strongylocentrotus purpuratus""+""Saccoglossus kowalevskii""" None None 1 Ecdysozoa "Drosophila melanogaster+""Peripatopsis capensis""+""Priapulus sp""" None None 1 Xenoturbella "Xenoturbella bocki+""Xenoturbella profundus""" None None 0 Acoelomorpha "Hofstenia miamia""" None None 0 Spiralia "Capitella teleta+""Lottia gigantea""+""Baseodiscus unicolor+""Macrodasys sp""+ None None 1 ""Phoronis vancouverensis""+""Loxosoma pectinaricola""+""Adineta ricciae""+ ""Gnathostomula paradoxa""+""Schmidtea mediterranea""" Xenacoelomorpha "Hofstenia miamia+""Xenoturbella profundus""+""Xenoturbella bocki""" None None 1 Chordata "Branchiostoma floridae+""Ciona intestinalis""+""Gallus gallus""+""Homo sapiens""" None None 1 All "Gnathostomula paradoxa+""Schmidtea mediterranea""+""Priapulus sp""+ None None 0 ""Xenoturbella profundus""+""Branchiostoma floridae""+""Phoronis vancouverensis""+ ""Macrodasys sp""+""Gallus gallus""+""Strongylocentrotus purpuratus""+ ""Amphimedon queenslandica""+""Loxosoma pectinaricola""+""Capitella teleta""+ ""Homo sapiens""+""Adineta ricciae""+""Xenoturbella bocki""+ ""Peripatopsis capensis""+""Hofstenia miamia""+""Baseodiscus unicolor""+ ""Saccoglossus kowalevskii""+""Nematostella vectensis""+""Anneissia japonica""+ ""Mnemiopsis leidyi""+""Drosophila melanogaster""+"" adhaerens""+ ""Lottia gigantea""+""Ciona intestinalis""" Protostomia Spiralia+Ecdysozoa None Spiralia+Ecdysozoa 1 Deuterostomia Ambulacraria+Chordata None None 1 Nephrozoa Protostomia+Deuterostomia None Protostomia+Deuterostomia 1 Bilateria Nephrozoa+Xenacoelomorpha None Nephrozoa+Xenacoelomorpha 1 Xenambulacraria(C1) Ambulacraria+Xenacoelomorpha None None 1 C2 Chordata+Xenambulacraria(C1) None None 1 Bilateria(C3) C2+Protostomia None None 0 D2 Xenacoelomorpha+Deuterostomia None None 1 Bilateria(D3) D2+Protostomia None None 0 Xenambulacraria(E1) Xenoturbella+Ambulacraria None None 1 E2 Chordata+Xenambulacraria(E1) None None 1 E3 Protostomia+E2 None None 1 "Loxosoma pectinaricola""" None None 0 Outgroup All-Bilateria None None 0

Figure S1: Complete split definitions for the Xenoturbella dataset of Rouse et al. (2016).

2 Clade Name Clade Definition Section Letter Components Show Xenoturbella "Xenoturbella bocki""" None None 0 Acoelomorpha "Diopisthoporus gymnopharyngeus+""Diopisthoporus longitubus""+""Hofstenia miamia""+ None None 0 ""Isodiametra pulchra""+""Eumecynostomum macrobursalium""+""Convolutriloba macropyga""+ ""Childia submaculatum""" Ambulacraria "Dumetocrinus sp+""Labidiaster annulatus""+""Astrotomma agassizi""+""Leptosynapta clarki""+ None None 1 ""Strongylocentrotus purpuratus""+""Cephalodiscus gracilis""+""Saccoglossus mereschkowskii""+ ""Ptychodera bahamensis""+""Schizocardium braziliense""" Entoprocta "Loxosoma pectinaricola+""Barentsia gracilis""" None None 0 Ecdysozoa "Priapulus caudatus+""Halicryptus spinulosus""+""Peripatopsis capensis""+ ""Drosophila melanogaster""+""Daphnia pulex""+""Strigamia maritima""+""Ixodes scapularis""" None None 1 Chordata "Branchiostoma floridae+""Botryllus schlosseri""+""Ciona intestinalis""+""Homo sapiens""+ None None 1 ""Gallus gallus""+""Petromyzon marinus""" Spiralia "Crassostrea gigas+""Lineus longissimus""+""Cephalothrix hongkongiensis""+""Helobdella robusta""+ None None 1 ""Pomatoceros lamarckii""+""Capitella teleta""+""Adineta ricciae""+""Adineta vaga""+ ""Brachionus calyciflorus""+""Lepidodermella squamata""+""Macrostomum lignano""+ ""Prostheceraeus vittatus""+""Taenia pisiformes""+""Schistosoma mansoni""+ ""Schmidtea mediterranea""+""Megadasys sp""+""Macrodasys sp""+""Membranipora membranacea""+ ""Loxosoma pectinaricola""+""Barentsia gracilis""+""Phoronis psammophila""+""Terebratalia transversa""+ ""Hemithiris psittacea""+""Novocrania anomala""+""Leptochiton rugatus""+""Lottia gigantea""" Xenacoelomorpha "Xenoturbella bocki+""Diopisthoporus gymnopharyngeus""+""Diopisthoporus longitubus""+ None None 1 ""Hofstenia miamia""+""Isodiametra pulchra""+""Eumecynostomum macrobursalium""+ ""Convolutriloba macropyga""+""Childia submaculatum""+""Ascoparia sp""+""Meara stichopi""+ ""Sterreria sp""+""Nemertoderma westbladi""" Deuterostomia "Dumetocrinus sp+""Labidiaster annulatus""+""Astrotomma agassizi""+""Leptosynapta clarki""+ None None 1 ""Strongylocentrotus purpuratus""+""Cephalodiscus gracilis""+""Saccoglossus mereschkowskii""+ ""Ptychodera bahamensis""+""Schizocardium braziliense""+""Branchiostoma floridae""+ ""Botryllus schlosseri""+""Ciona intestinalis""+""Homo sapiens""+""Gallus gallus""+ ""Petromyzon marinus""" All "Cliona varians+""Salpingoeca rosetta""+""Xenoturbella bocki""+""Branchiostoma floridae""+ None None 0 ""Botryllus schlosseri""+""Macrodasys sp""+""Membranipora membranacea""+""Petromyzon marinus""+ ""Eumecynostomum macrobursalium""+""Pomatoceros lamarckii""+""Astrotomma agassizi""+ ""Labidiaster annulatus""+""Macrostomum lignano""+""Peripatopsis capensis""+""Hofstenia miamia""+ ""Cephalodiscus gracilis""+""Phoronis psammophila""+""Adineta vaga""+""Priapulus caudatus""+ ""Lottia gigantea""+""Euplokamis dunlapae""+""Schizocardium braziliense""+""Sycon ciliatum""+ ""Halicryptus spinulosus""+""Ptychodera bahamensis""+""Cephalothrix hongkongiensis""+ ""Leucosolenia complicata""+""Strongylocentrotus purpuratus""+""Diopisthoporus longitubus""+ ""Capitella teleta""+""Leptochiton rugatus""+""Novocrania anomala""+""Crassostrea gigas""+ ""Lepidodermella squamata""+""Helobdella robusta""+""Oscarella carmela""+""Ixodes scapularis""+ ""Mnemiopsis leidyi""+""Stomolophus meleagris""+""Ciona intestinalis""+""Nemertoderma westbladi""+ ""Schmidtea mediterranea""+""Ascoparia sp""+""Strigamia maritima""+""Brachionus calyciflorus""+ ""Convolutriloba macropyga""+""Monosiga brevicollis""+""Terebratalia transversa""+ ""Loxosoma pectinaricola""+""Meara stichopi""+""Lineus longissimus""+""Hemithiris psittacea""+ ""Childia submaculatum""+""Adineta ricciae""+""Daphnia pulex""+""Taenia pisiformes""+ ""Prostheceraeus vittatus""+""Megadasys sp""+""Nematostella vectensis""+""Schistosoma mansoni""+ ""Eunicella cavolinii""+""Drosophila melanogaster""+""Isodiametra pulchra""+""Leptosynapta clarki""+ ""Homo sapiens""+""Barentsia gracilis""+""Craspedacusta sowerby""+""Gallus gallus""+ ""Amphimedon queenslandica""+""Saccoglossus mereschkowskii""+""Trichoplax adhaerens""+ ""Diopisthoporus gymnopharyngeus""+""Dumetocrinus sp""+""Pleurobrachia bachei""+ ""Aphrocallistes vastus""+""Acropora digitifera""+""Agalma elegans""+""Sterreria sp""" Protostomia Spiralia+Ecdysozoa None Spiralia+Ecdysozoa 1 Nephrozoa Protostomia+Deuterostomia None Protostomia+Deuterostomia 1 Bilateria Nephrozoa+Xenacoelomorpha None Nephrozoa+Xenacoelomorpha 1 Xenambulacraria(C1) Ambulacraria+Xenacoelomorpha None None 1 C2 Chordata+Xenambulacraria(C1) None None 1 Bilateria(C3) C2+Protostomia None None 0 D2 Xenacoelomorpha+Deuterostomia None None 1 Bilateria(D3) D2+Protostomia None None 0 Xenambulacraria(E1) Xenoturbella+Ambulacraria None None 1 E2 Chordata+Xenambulacraria(E1) None None 1 E3 Protostomia+E2 None None 1

Figure S2: Complete split definitions for the Xenoturbella dataset of Cannon et al. (2016).

3 Figures S3 to S9 are all for the plant dataset of Wickett et al. (2014).

Angiosperm Angiosperms Eudicots/Magnoliids Eudicots/Magnoliids/Chloranthales Magnoliids/Chloranthales Magnoliids/Chloranthales/Monocots Monocots/Eudicots Monocots/Magnoliids ANA−grade angiosperms Amborella/Nuphar Angiosperms minus Amborella Angiosperms minus Amborella and Nuphar Bryophytes Bryophyta Bryophyta/Tracheophytes Bryophyta/Tracheophytes/Anthocerophyta Bryophyta/Marchantiophyta/Tracheophytes Bryophyta/Marchantiophyta Bryophyta/Marchantiophyta/Anthocerophyta Tracheophytes Anthocerophyta/Tracheophytes Gymnosperms Coniferales Coniferales/Gnetales Gnetales Gnetales/Pinaceae Sister to land plants Zygnematophyceae Coleochaetales Zygnematophyceae/LandPlants Chara/LandPlants Coleochaetales/LandPlants Coleochaetales/Chara/LandPlants Chara/Coleochaetales all astral astralall astral.untrimmed.all untrimmed.all.unpart. astral.trim50genes.all trim50genes.all.unpart. astral.untrimmed.no3rd untrimmed.no3rd.unpart. astral.trim50genes.no3rd trim50genes.no3rd.unpart. trim50genes50sites.all.part. astral.untrimmed.all.gamma 604genes.trimExt.all.unpart. astral.trim50genes33taxa.all trim50genesChara.all.unpart. trim50genes33taxa.all.unpart. trim50genes50sites.all.unpart. trim50genes50sites.no3rd.part. astral.untrimmed.no3rd.gamma 604genes.trimExt.no3rd.unpart. astral.trim50genes33taxa.no3rd trim50genesChara.no3rd.unpart. trim50genes33taxa.no3rd.unpart. trim50genes50sites.no3rd.unpart. astral.trim50genes25Xbranches.all trim50genes50sites.all.part.gamma 604genes.trimExt.all.unpart.gamma astral.trim50genes25Xbranches.no3rd trim50genes50sites.no3rd.part.gamma 604genes.trimExt.no3rd.unpart.gamma 604genes.trimExt.no3rd.phylo.CATGTR trim50genes50sites25Xbranches.all.unpart. trim50genes50sites25Xbranches.no3rd.unpart.

BOOT −100−50 0 50 100

Figure S3: DiscoVista Specie tree analysis on 1kp dataset: Rows correspond to major orders and clades, and columns correspond to the results of different methods reported in two different closely related datasets. The spectrum of blue-green indicates amount of MLBS values for monophyletic clades. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted.

4 Angiosperm Angiosperms Eudicots/Magnoliids Eudicots/Magnoliids/Chloranthales Magnoliids/Chloranthales Magnoliids/Chloranthales/Monocots Monocots/Eudicots Monocots/Magnoliids ANA−grade angiosperms Amborella/Nuphar Angiosperms minus Amborella Angiosperms minus Amborella and Nuphar Bryophytes Bryophyta Bryophyta/Tracheophytes Bryophyta/Tracheophytes/Anthocerophyta Bryophyta/Marchantiophyta/Tracheophytes Bryophyta/Marchantiophyta Bryophyta/Marchantiophyta/Anthocerophyta Tracheophytes Anthocerophyta/Tracheophytes Gymnosperms Coniferales Coniferales/Gnetales Gnetales Gnetales/Pinaceae Sister to land plants Zygnematophyceae Coleochaetales Zygnematophyceae/LandPlants Chara/LandPlants Coleochaetales/LandPlants Coleochaetales/Chara/LandPlants Chara/Coleochaetales all astral astralall astral.untrimmed.all untrimmed.all.unpart. astral.trim50genes.all trim50genes.all.unpart. astral.untrimmed.no3rd untrimmed.no3rd.unpart. astral.trim50genes.no3rd trim50genes.no3rd.unpart. trim50genes50sites.all.part. astral.untrimmed.all.gamma 604genes.trimExt.all.unpart. astral.trim50genes33taxa.all trim50genesChara.all.unpart. trim50genes33taxa.all.unpart. trim50genes50sites.all.unpart. trim50genes50sites.no3rd.part. astral.untrimmed.no3rd.gamma 604genes.trimExt.no3rd.unpart. astral.trim50genes33taxa.no3rd trim50genesChara.no3rd.unpart. trim50genes33taxa.no3rd.unpart. trim50genes50sites.no3rd.unpart. astral.trim50genes25Xbranches.all trim50genes50sites.all.part.gamma 604genes.trimExt.all.unpart.gamma astral.trim50genes25Xbranches.no3rd trim50genes50sites.no3rd.part.gamma 604genes.trimExt.no3rd.unpart.gamma 604genes.trimExt.no3rd.phylo.CATGTR trim50genes50sites25Xbranches.all.unpart. trim50genes50sites25Xbranches.no3rd.unpart.

Classification Strong Support Weak Support Compatible (Weak Rejection) Strong Rejection

Figure S4: DiscoVista Specie tree analysis on 1kp dataset: Rows correspond to major orders and clades, and columns correspond to the results of different methods reported in two different closely related datasets. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted.

5 a) b) 100% 100%

75% 75%

50% 50% Percent

Occupancy FNA2AA−filterlen33 FNA2AA−f25 25% FNA2AA−filterlen33 FNA2AA−f25 25% Riccia_sp Zea_mays Base Pinus_taeda Vitis_vinifera Uronema_sp Smilax_bona Oryza_sativa Roya_obtusa Spirogyra_sp Cedrus_libani Mougeotia_sp Ginkgo_biloba Cycas_rumphii Taxus_baccata Mosses Carica_papaya Chara_vulgaris Inula_helenium Saruma_henryi Ephedra_sinica Netrium_digitus Nuphar_advena Hedwigia_ciliata Eudicots Kochia_scoparia Psilotum_nudum Zamia_vazquezii Charales Sorghum_bicolor Larrea_tridentata Cycas_micholitzii Boehmeria_nivea Dioscorea_villosa Sarcandra_glabra Monocots Sabal_bermudana Bazzania_trilobata Bryum_argenteum Aquilegia_formosa Yucca_filamentosa Mesostigma_viride Persea_americana Liverworts Angiopteris_evecta Ipomoea_purpurea Sphagnum_lescurii Hornworts Spirotaenia_minuta Entransia_fimbriata Houttuynia_cordata Magnoliids Gnetum_montanum Alsophila_spinulosa Acorus_americanus Kadsura_heteroclita Huperzia_squarrosa Equisetum_diffusum Pteridium_aquilinum Populus_trichocarpa Prumnopitys_andina Lycophytes Entransia_fimbriataa Arabidopsis_thaliana Metzgeria_crassipilis Medicago_truncatula Coleochaete_scutata Hibiscus_cannabinus Welwitschia_mirabilis Leucodon_brachypus Catharanthus_roseus Allamanda_cathartica Thuidium_delicatulum Amborella_trichopoda Ceratodon_purpureus ANA_Grade Liriodendron_tulipifera Diospyros_malabarica Colchicum_autumnale Rosmarinus_officinalis Physcomitrella_patens Anomodon_attenuatus Juniperus_scopulorum Klebsormidium_subtile Podophyllum_peltatum Polytrichum_commune Sciadopitys_verticillata Pyramimonas_parkeae Cosmarium_ochthodes Tanacetum_parthenium Marchantia_emarginata Monilophytes Coleochaete_irregularis Penium_margaritaceum Marchantia_polymorpha Sphaerocarpos_texanus Eschscholzia_californica Nephroselmis_pyriformis Chloranthales Nothoceros_vincentianus Cylindrocystis_brebissonii Rosulabryum_cf_capillare Nothoceros_aenigmaticus Ophioglossum_petiolatum Cunninghamia_lanceolata Gymnosperms Brachypodium_distachyon Cylindrocystis_cushleckae Chlorokybus_atmophyticus Monomastix_opisthostigma Coleochaetales Rhynchostegium_serrulatum Dendrolycopodium_obscurum Mesotaenium_endlicherianum Rosulabryum_cf._capillare_sp Selaginella_moellendorffii_1kp Chaetosphaeridium_globosum Pseudolycopodiella_caroliniana Zygnematophyceae Selaginella_moellendorffii_genome

Figure S5: DiscoVista analyses on 1kp dataset a) occupancy analysis on the 1kp dataset over each individual species for two model conditions. FNA2AA-f25: amino acid sequences back translated to DNA, and sequences on long branches (25X median branch length) removed; FNA2AA-filterlen33: amino acids sequences back translated to DNA, and fragmentary sequences removed (66% gaps or more). b) occupancy analysis of major clades in the 1kp dataset and the same model conditions as (a).

6 FAA_filterlen33

800

600

400

200

0

FNA2AA_c1c2_filterlen33

800

600

400

200

Number of Gene Trees 0

FNA2AA_filterlen33

800

600

400

200

0 Gnetales Bryophyta Bryophytes Coniferales Angiosperm Angiosperms Tracheophytes Gymnosperms Coleochaetales Chara/LandPlants Amborella/Nuphar Monocots/Eudicots Gnetales/Pinaceae Zygnematophyceae Sister to land plants Eudicots/Magnoliids Coniferales/Gnetales Monocots/Magnoliids Chara/Coleochaetales Bryophyta/Tracheophytes grade angiospermsANA − grade Magnoliids/Chloranthales Bryophyta/Marchantiophyta Coleochaetales/LandPlants Angiosperms minus Amborella Anthocerophyta/Tracheophytes Zygnematophyceae/LandPlants Coleochaetales/Chara/LandPlants Eudicots/Magnoliids/Chloranthales Magnoliids/Chloranthales/Monocots Bryophyta/Tracheophytes/Anthocerophyta Bryophyta/Marchantiophyta/Tracheophytes Angiosperms minus Amborella and Nuphar Bryophyta/Marchantiophyta/Anthocerophyta

Strongly Supported Weakly Supported Missing Weakly Reject Strongly Reject

Figure S6: Gene tree analysis on 1kp dataset: The portion of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected for three model conditions of the 1kp dataset. FAA-filterlen33: gene trees on amino acids sequences, and fragmentary sequences removed (66% gaps or more) FNA2AA-f25: amino acid sequences back translated to DNA, and sequences on long branches (25X median branch length)removed; FNA2AA-filterlen33: amino acid sequences back translated to DNA, and fragmentary sequences removed (66% gaps or more). Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted

7 FAA_filterlen33 1.00

0.75

0.50

0.25

0.00

FNA2AA_c1c2_filterlen33 1.00

0.75

0.50

0.25

0.00

FNA2AA_filterlen33 Proportion of relevant gene trees Proportion of relevant 1.00

0.75

0.50

0.25

0.00 Gnetales Bryophyta Bryophytes Coniferales Angiosperm Angiosperms Tracheophytes Gymnosperms Coleochaetales Chara/LandPlants Amborella/Nuphar Monocots/Eudicots Gnetales/Pinaceae Zygnematophyceae Sister to land plants Eudicots/Magnoliids Coniferales/Gnetales Monocots/Magnoliids Chara/Coleochaetales ANA − grade angiosperms Bryophyta/Tracheophytes Magnoliids/Chloranthales Bryophyta/Marchantiophyta Coleochaetales/LandPlants Angiosperms minus Amborella Anthocerophyta/Tracheophytes Zygnematophyceae/LandPlants Coleochaetales/Chara/LandPlants Eudicots/Magnoliids/Chloranthales Magnoliids/Chloranthales/Monocots Bryophyta/Tracheophytes/Anthocerophyta Bryophyta/Marchantiophyta/Tracheophytes Angiosperms minus Amborella and Nuphar Bryophyta/Marchantiophyta/Anthocerophyta

Strongly Supported Weakly Supported Weakly Reject Strongly Reject

Figure S7: Gene tree analysis on 1kp dataset: The number of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected or are missing of three model conditions (same as S6) of the 1kp dataset. Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted.

8 4 5 t1 t2 t3 0.4 Coleochaetales Topology

0.3 7 Zygnematophyceae Charales 0.2 8 4 5 3

relative freq. relative 9 0.1

Landplants 0.0 2

Base 5,7 | 8,9 5,8 | 7,9 5,9 | 7,8 2,3 | 4,7 2,7 | 3,4 2,4 | 3,7

6 7 Hornworts t1 t2 t3 0.4 Topology 4 0.3 Mosses Vascular Plants 8 7 6 3 0.2

relative freq. relative 9 0.1 Liverworts 0.0 2

Base 2,3 | 4,7 2,4 | 3,7 2,7 | 3,4 4,6 | 8,9 4,8 | 6,9 4,9 | 6,8

4 7 8 Coniferales Minus Pinaceae Pinaceae

t1 t2 t3 10 11 Topology 0.4 Gnetales 8 9

0.2 7 relative freq. relative

4 2 0.0 5 Outgroup 3 Cycads

2,7 | 3,5 2,5 | 3,7 2,3 | 5,7 2,4 | 8,9 2,9 | 4,8 2,8 | 4,9 Ginkgo 10,11 | 7,9 10,7 | 11,9 10,9 | 11,7 3 5 6

0.6 t1 t2 t3 Topology 0.4

0.2 Magnoliids Chloranthales

1314 0.0 Eudicots 3 15 Monocots 5 12 12,6 | 15,3 12,3 | 15,6 12,15 | 3,6 11,7 | 12,5 11,5 | 12,7 11,12 | 5,7 13,14 | 15,5 13,15 | 14,5 13,5 | 14,15 7 8 6 relative freq. relative 0.6 11 7 Kadsura

0.4 8 10

2 4 Nuphar 0.2 Outgroup Amborella

0.0 10,7 | 2,4 10,4 | 2,7 10,2 | 4,7 10,8 | 11,6 10,11 | 6,8 10,6 | 11,8

Figure S8: Example DiscoVista visualizations of gene tree discordance on the 1kp dataset (Wickett et al., 2014). Branch quartet frequencies graphs. Bars show the frequencies of observing the three quartet topologies around focal branches of the ASTRAL species tree among the main trimmed gene trees of the 1kp dataset. Each internal branch has four neighboring branches, which can be arranged in three ways. The frequency of the species tree topology among gene trees is shown in red, and the other two alternative topologies are shown in blue. The dotted lines indicate the 1/3 threshold. The title of each box indicates the label of the corresponding branch on the associated cartoon tree (also generated by DiscoVista). On the x-axis the exact definition of each quartet topology is shown using the neighboring branch labels separated by “#”.

9 a) 0.8

0.7 ALL 0.6 0.5 0.4 0.3 0.8 0.6 C1 0.4 0.2 0.6 C2

GC content 0.4 0.2 1.00

0.75 C3 0.50 0.25 Zea_mays Pinus_taeda Vitis_vinifera Uronema_sp Oryza_sativa Roya_obtusa Spirogyra_sp Cedrus_libani Mougeotia_sp Ginkgo_biloba Cycas_rumphii Taxus_baccata Carica_papaya Chara_vulgaris Inula_helenium Saruma_henryi Ephedra_sinica Netrium_digitus Nuphar_advena Hedwigia_ciliata Kochia_scoparia Psilotum_nudum Zamia_vazquezii Sorghum_bicolor Larrea_tridentata Cycas_micholitzii Boehmeria_nivea Dioscorea_villosa Sarcandra_glabra Smilax_bona − nox Sabal_bermudana Bazzania_trilobata Bryum_argenteum Aquilegia_formosa Yucca_filamentosa Mesostigma_viride Persea_americana Angiopteris_evecta Ipomoea_purpurea Sphagnum_lescurii Spirotaenia_minuta Entransia_fimbriata Houttuynia_cordata Alsophila_spinulosa Gnetum_montanum Acorus_americanus Kadsura_heteroclita Huperzia_squarrosa Equisetum_diffusum Pteridium_aquilinum Populus_trichocarpa Prumnopitys_andina Arabidopsis_thaliana Metzgeria_crassipilis Medicago_truncatula Ricciocarpos_natans Coleochaete_scutata Hibiscus_cannabinus Welwitschia_mirabilis Leucodon_brachypus Catharanthus_roseus Allamanda_cathartica Thuidium_delicatulum Amborella_trichopoda Ceratodon_purpureus Liriodendron_tulipifera Diospyros_malabarica Colchicum_autumnale Rosmarinus_officinalis Physcomitrella_patens Anomodon_attenuatus Juniperus_scopulorum Klebsormidium_subtile Podophyllum_peltatum Polytrichum_commune Sciadopitys_verticillata Pyramimonas_parkeae Cosmarium_ochthodes Tanacetum_parthenium Marchantia_emarginata Coleochaete_irregularis Penium_margaritaceum Marchantia_polymorpha Sphaerocarpos_texanus Eschscholzia_californica Nephroselmis_pyriformis Nothoceros_vincentianus Cylindrocystis_brebissonii Rosulabryum_cf_capillare Nothoceros_aenigmaticus Ophioglossum_petiolatum Cunninghamia_lanceolata Brachypodium_distachyon Cylindrocystis_cushleckae Chlorokybus_atmophyticus Monomastix_opisthostigma Rhynchostegium_serrulatum Dendrolycopodium_obscurum Mesotaenium_endlicherianum Selaginella_moellendorffii_1kp Chaetosphaeridium_globosum Pseudolycopodiella_caroliniana b) Selaginella_moellendorffii_genome

0.9 ●

● ALL ● 0.8 ● ● C1 ● ● ● ● ● ● ● 0.7 C2 ● ● ● ● C3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● GC content ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● nox − Zea_mays Pinus_taeda Vitis_vinifera Uronema_sp Oryza_sativa Roya_obtusa Spirogyra_sp Cedrus_libani Mougeotia_sp Ginkgo_biloba Cycas_rumphii Taxus_baccata Carica_papaya Chara_vulgaris Inula_helenium Saruma_henryi Ephedra_sinica Netrium_digitus Nuphar_advena Hedwigia_ciliata Kochia_scoparia Psilotum_nudum Zamia_vazquezii Sorghum_bicolor Larrea_tridentata Cycas_micholitzii Boehmeria_nivea Dioscorea_villosa Sarcandra_glabra Smilax_bona Sabal_bermudana Bazzania_trilobata Bryum_argenteum Aquilegia_formosa Yucca_filamentosa Mesostigma_viride Persea_americana Angiopteris_evecta Ipomoea_purpurea Sphagnum_lescurii Spirotaenia_minuta Entransia_fimbriata Houttuynia_cordata Gnetum_montanum Alsophila_spinulosa Acorus_americanus Kadsura_heteroclita Huperzia_squarrosa Equisetum_diffusum Pteridium_aquilinum Populus_trichocarpa Prumnopitys_andina Arabidopsis_thaliana Metzgeria_crassipilis Medicago_truncatula Ricciocarpos_natans Coleochaete_scutata Hibiscus_cannabinus Welwitschia_mirabilis Leucodon_brachypus Catharanthus_roseus Allamanda_cathartica Thuidium_delicatulum Amborella_trichopoda Ceratodon_purpureus Liriodendron_tulipifera Diospyros_malabarica Colchicum_autumnale Rosmarinus_officinalis Physcomitrella_patens Anomodon_attenuatus Juniperus_scopulorum Klebsormidium_subtile Podophyllum_peltatum Polytrichum_commune Sciadopitys_verticillata Pyramimonas_parkeae Cosmarium_ochthodes Tanacetum_parthenium Marchantia_emarginata Coleochaete_irregularis Penium_margaritaceum Marchantia_polymorpha Sphaerocarpos_texanus Eschscholzia_californica Nephroselmis_pyriformis Nothoceros_vincentianus Cylindrocystis_brebissonii Rosulabryum_cf_capillare Cunninghamia_lanceolata Ophioglossum_petiolatum Nothoceros_aenigmaticus Brachypodium_distachyon Cylindrocystis_cushleckae Chlorokybus_atmophyticus Monomastix_opisthostigma Rhynchostegium_serrulatum Dendrolycopodium_obscurum Mesotaenium_endlicherianum Selaginella_moellendorffii_1kp Chaetosphaeridium_globosum Pseudolycopodiella_caroliniana Selaginella_moellendorffii_genome

Figure S9: DiscoVista analyses on 1kp dataset a,b) GC content analysis of the 1kp dataset using boxplots and dot plot respectively for first, second, third, as well as all three codon positions. In dot-plot each dot shows the average GC content ratio for each species in all (red), first (pink), second (light blue), and third (dark blue) codon positions.

10 Structure of parameter files splits definitions: The splits definition file has 7 columns separated with tabs. In the first column split names are listed, in the second column the species or other splits that define this split are listed separated with “+” or “-” signs. “+” signs are used to add splits and species, and “-” signs to subtract species or splits. In the third column (you might leave this column blank) a name can be defined that is used to specify the part of the tree that the splits belongs to, e.g. 1-Base, or Base. Forth column defines splits components where in the absence of any of them splits is considered as missing. The fifth column indicates whether the split should be shown in the species tree (gene trees) analysis. The sixth column is used to define components from the other side of the split, where in the absence of any of them the split is considered as missing. Note that if you leave this column blank, “All” minus this split will be considered as the other side component. The last column is used to add any comments to this file. Also note that, a split with the name “All” should be defined which consists of all the species names in your analysis.

Annotation file: In this file, there should be a row for each species. The first column of the annotation file specifies the name of species, and the second column defines the split that this species belongs to. Columns are separated by tab.

Installation, Commands, and Usage

In order to install DiscoVista from source please refer to https://github.com/esayyari/DiscoVista. Dis- coVista relies on different R, python , some C packages, as well as ASTRAL. In order to make the installation easier we also provide a docker image available at https://hub.docker.com/r/esayyari/discovista/. In order to use this image, after installing docker, it is sufficient to download it using the following command: docker pull esayyari/discovista Now, you would run different analyses using the following command: docker run -v :/data esayyari/discovista discoVista.py [OPTIONS]

Species discordance analysis: For this analysis you need the path to the species trees folder, the clade definitions, the output folder, and a threshold. For more information regarding the structure of the species tree folder please refer to https://github.com/esayyari/DiscoVista. Then you would use the following command using docker: docker run -v :/data esayyari/discovista discoVista.py -m 0 -p -t -c -o -y -w

Discordance analysis on gene trees: For this analysis you need the path to the gene trees folder, the clade definitions, the output folder, and a threshold. For more information regarding the structure of the gene trees folder please refer to https://github.com/esayyari/DiscoVista. Then you would use the following command with docker: docker run -v :/data esayyari/discovista discoVista.py -m 1 -p -t -c -o -w

GC content analysis: For this analysis you need the path to the coding alignments (in FASTA format) with the structure described in more details at https://github.com/esayyari/DiscoVista, and the output folder. You would use the following command in docker: docker run -v :/data esayyari/discovista discoVista.py -m 2 -p -o occupancy analysis: For this analysis you need the path to the alignments (in FASTA format) with the structure described in more details at https://github.com/esayyari/DiscoVista, and the output folder. Then you can use this command with the docker: docker run -v :/data esayyari/discovista discoVista.py -p $path -m 3 -a -o

11 Relative frequencey analysis: For this analysis you need the path to the gene trees and species tree folder (structure described on the Github), annotation file, outgroup clade name, output folder. Using docker you would do this analysis with the following command: docker run -v :/data esayyari/discovista discoVista.py -p $path -m 5 -a -o

If desired, the cartoon tree can be shown rooted, requiring the name of the outgroup as input (e.g. Base or Outgroup). docker run -v :/data esayyari/discovista discoVista.py -p $path -m 5 -a -o -g

12