<<

ASTRAL-III: increased scalability and impacts of contracting low support branches

Chao Zhang Maryam Rabiee Erfan Sayyari Siavash Mirarab

University of California, San Diego Gene tree discordance

The species tree

Gorilla Human Chimp Orangutan gene 1 gene1000

Gorilla Human Chimp Orang. Gorilla Chimp Human Orang. A gene tree

Causes of gene tree discordance include: • Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

2 Incomplete Lineage Sorting (ILS)

A random process related to the coalescence of alleles across various populations

Tracing alleles through generations

3 Incomplete Lineage Sorting (ILS)

A random process related to the coalescence of alleles across various populations

Tracing alleles through generations

3 Incomplete Lineage Sorting (ILS)

A random process related to the coalescence of alleles across various populations

Tracing alleles through generations

Multi-species coalescent (MSC) model captures ILS

3 Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

Chimp Gorilla θ1=70% θ2=15% θ3=15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp

Human Orang. Human Orang. Human Gorilla Human Orang.

4 Unrooted quartets under MSC model

For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)

Chimp Gorilla θ1=70% θ2=15% θ3=15% d=0.8 Chimp Gorilla Orang. Chimp Gorilla Chimp

Human Orang. Human Orang. Human Gorilla Human Orang.

The most frequent gene tree = The most likely species tree

4 More than 4 species

For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

Gorilla Human Chimp Orang. Rhesus

5 More than 4 species

For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

Gorilla Human Chimp Orang. Rhesus

n 1. Break gene trees into (4 ) quartets of species 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees

Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

5 More than 4 species

For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)

Gorilla Human Chimp Orang. Rhesus

n 1. Break gene treesAlternative into (4 ) quartets: of species weight all n quartet topologies 2. Find the dominant3(4 ) tree for all quartets of taxa 3. Combine quartetby their trees frequency and find the optimal tree Some tools (e.g.. BUCKy-p [Larget, et al., 2010])

5 Maximum Quartet Support Species Tree

• Optimization problem:

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet m trees induced by T Score(T )= Q(T ) Q(ti) | \ | a gene tree 1 X

6 Maximum Quartet Support Species Tree

• Optimization problem:

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet m trees induced by T Score(T )= Q(T ) Q(ti) | \ | a gene tree 1 X • Theorem: Statistically consistent under the multi- species coalescent model when solved exactly

6 Maximum Quartet Support Species Tree

• Optimization problem: NP-Hard [Lafond & Scornavaccaori, 2016]

Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees

the set of quartet m trees induced by T Score(T )= Q(T ) Q(ti) | \ | a gene tree 1 X • Theorem: Statistically consistent under the multi- species coalescent model when solved exactly

6 ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem using dynamic programming

7 ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]

• ASTRAL solves the problem using dynamic programming

• Introduced a constrained version of the problem

• Draws branches in the species tree from a given set X = {all bipartitions in all gene trees}

• Th constrained version is statistically consistent

2 1.73 • Running time: O(n k|X| )

7 ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk|X|1.73) instead of O(n 2k|X|1.73)

8 ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk|X|1.73) instead of O(n 2k|X|1.73)

2. Add extra bipartitions to the set X using heuristics

• Consensus + support + subsampling species

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

8 ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]

1. Faster calculation of the score function inside DP

• O(nk|X|1.73) instead of O(n 2k|X|1.73)

2. Add extra bipartitions to the set X using heuristics

• Consensus + support + subsampling species

• Using quartet-based distances to find likely branches

• Complete incomplete gene trees

3. Can handle input gene trees with polytomies in O(n 3k|X|1.73)

8 ASTRAL used by the biologists

• Plants: Wickett, et al., 2014, PNAS • : Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature ASTRAL • Xenoturbella, Rouse et al., 2016, Nature

• Flatworms: Laumer, et al., 2015, eLife

• Shrews: Giarla, et al., 2015, Syst. Bio.

• Frogs: Yuan et al., 2016, Syst. Bio.

• Tomatoes: Pease, et al., 2016, PLoS Bio. ASTRAL-II

• Angiosperms: Huang et al., 2016, MBE

• Worms: Andrade, et al., 2015, MBE Zhang et al. Page 14 of 34

Zhang et al. Page 14 of 34 (a) ASTRAL-III50 : improved200 running time 500 1000

−6 2 1600 (a) 200 Feature 1: 50 200 500 1000

−6 2−8 ASTRAL−II 2 1600Low gene tree error ASTRAL−III 200High gene tree error

ASTRAL-III can now −8 2 ASTRAL−II ASTRAL−III 2−10 analyze multifurcating gene trees with the 2−10 same running time as

Avg weight calculation time (seconds) weight Avg 2−12 binary gene trees: Avg weight calculation time (seconds) weight Avg −12 2 non 0 3 5 7 10 201.7333 50 75 non 0 3 5 7 10 20 non33 0 503 755 7 non10 20 033 503 755 non7 0 103 205 73310 5020 3375 50 non75 non0 0 3 3 55 77 10 1020 2033 503375 50non750 3 5 7 10 20 33 50 75 O(nk|X| ) contraction contraction more polytomies (b)

(b) 50 200 500 1000 10 100 50 200 500 1000 ● ● ● ●

100 ● ● ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Set X cluster size (K) Set X cluster size ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 0 ● ● ● non 0 3 5 7 10 20 33 50 75 non 0 3 ●5 7 10 20 33 50 75 non 0 3 5 7 10 20 33● 50● 75 non 0 3 5 7 10 20 33 50 75 ● ● ● ● ●

Set X cluster size (K) Set X cluster size ● ● ● ● contraction ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 6 Weight● calculation and X on S100. Average and standard error of (a) the time it ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● | | ● ● ● ● ● takes to score a single tripartition using Eq. 3 and (b) search space size X are shown for both 0 | | non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 ASTRAL-II10 20 33 50 and75 ASTRAL-IIInon 0 3 5 on7 10 the20 S10033 50 dataset.75 non Running0 3 5 time7 10 is20 in33 log50 scale.75 We vary numbers of gene treescontraction (boxes)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsfor with 400 and 800bp alignments.

Figure 6 Weight calculation and X on S100. Average and standard error of (a) the time it takes to score a single tripartition| using| Eq. 3 and (b) search space size X are shown for both ASTRAL-II and ASTRAL-III on the S1003.2.2 dataset. Running Running time for largetime polytomies is in log| scale.| We vary numbers of gene trees (boxes)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsforASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the with 400 and 800bp alignments. running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and ASTRAL-III can have a di↵erent set X, we show the running time per each weight calculation (i.e., Eq. 3). As we contract low support branches and hence increase the prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II, 3.2.2 Running time for large polytomieswhereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases. ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the 3.2.3 The search space running time when gene trees includeComparing polytomies the size of the (Fig. search 6a). space Since ( X ) ASTRAL-II between ATRAL-II and and ASTRAL- | | ASTRAL-III can have a di↵erentIII set showsX, that we as show intended, the the running search space time is perdecreased each in weight size for cases with no polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac- calculation (i.e., Eq. 3). As we contract low support branches and hence increase the tion, on average, X is always smaller for ASTRAL-III than ASTRAL-II. With few | | prevalence of polytomies, the weighterror-prone calculation gene trees time (50 gene quickly trees from grows 200bp for alignments), ASTRAL-II, the search space has whereas, in ASTRAL-III, the weightreduced calculation dramatically time but with remains many genes flat, or or high-quality even decreases. gene trees, the reduc- tions are minimal. Moreover, the search space for gene trees estimated from short alignments (e.g., 200bp) is several times larger than those based on longer align- 3.2.3 The search space ments (e.g., 1600bp) for both methods. Contracting low support branches initially Comparing the size of the searchincreases space the ( searchX ) space between because ATRAL-II ASTRAL-III adds and multiple ASTRAL- resolutions per poly- tomy to X. Further| | contraction results in reductions in X , presumably because | | III shows that as intended, the searchmany polytomies space existis decreased and they are in resolved size similarly for cases inside with ASTRAL-III. no polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac- tion, on average, X is always smaller for ASTRAL-III than ASTRAL-II. With few | | error-prone gene trees (50 gene trees from 200bp alignments), the search space has reduced dramatically but with many genes or high-quality gene trees, the reduc- tions are minimal. Moreover, the search space for gene trees estimated from short alignments (e.g., 200bp) is several times larger than those based on longer align- ments (e.g., 1600bp) for both methods. Contracting low support branches initially increases the search space because ASTRAL-III adds multiple resolutions per poly- tomy to X. Further contraction results in reductions in X , presumably because | | many polytomies exist and they are resolved similarly inside ASTRAL-III. Zhang et al. Page 13 of 34

Zhang et al. Page 30 of 34

ASTRAL-III: improved500 running time 1500

● 10 2 ● ● ●

● Feature 2: 500 1500 ● ● ● ● 1500 Can exploit similarities ● ● ● 5 ASTRAL2 2 ● between gene trees ● ASTRAL3 ●

● 1.73 1000 ● • ● O(D|X| ) where D is ● 2.09 ● 2.06 ● 2.27 ● 2.29 ● the sum of degrees of all ● Running time (minutes) ● 0 ● unique gene tree2 nodes 500 ● ● Running time (minutes) ASTRAL2 ●

● ● ● ● ● ASTRAL3 • Overlay all gene trees onto ● ● ● ● ● ● ● ● a single polytree; use a 0 ●● ● ● ●● ● ● 8 9 10 11 12 13 14 8 9 10 11 12 13 14 bottom-up traversal2 to 2 2 2 02 25000 2 100002 215000 2 0 2 5000 2 100002 2 15000 #Genes score a cluster #Genes Figure S2 Running time versus k. Average running time of ASTRAL-II versus ASTRAL-III on Figure 5 Running timethe versus avian dataset11k. Average with 500bp running or 1500bp times alignments (4 replicates) with varying are numbers shown of gens for ASTRAL-II (k), shown in 14 and ASTRAL-III on thenormal avian scale. dataset A LOESS with curve 500bp is fit to or the 1500bp data points. alignments ASTRAL-II with could varying not finish numbers on 2 genes of in the allotted 48 hour time. Averages are over 4 runs. gens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.

branches has minimal impact on the discordance (eight discordant branches with (a)

binned MP-EST instead of nine).50 However, contracting200 low500 support branches1000 with

2−6

3%–33% thresholds dramaticallyASTRAL−II reduces the discordance with the reference tree (2, ASTRAL−III

2, 4, 2, 3, and 3 discordant2−8 branches, respectively, for 3%, 5%, 7%, 10%, 20%, and 400 33%). Three thresholds (3%,800 5%, and 10%) produce an identical tree (Fig. 4d). The remaining di↵erences are2−10 among the branches that are deemed unresolved by Jarvis Avg weight calculation time (seconds) weight Avg et al. and change among2−12 the reference trees as well [5]. Contracting at 50% and 75%

non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 thresholds, however, increases discordance to five andcontraction six branches, respectively.

Thus, consistent with simulations, contracting very(b) low support branches seems to produce the best results, when50 judged by200 similarity with500 the reference1000 trees. To ● ● ● ●

● 60000 ● ● summarize, ASTRAL-III obtained on unbinned but collapsed gene trees● agreed with ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40000 ● ● ● all major relations in Jarvis et al., including the novel ● group, whereas● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Set X cluster size ● ● ● the unresolved tree missed important ● ● (Fig. 4). ● ● 20000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 3.2 RQ2: Running time improvements contraction We study the improvements in running time400 as800 various● ASTRAL−II parameters● ASTRAL−III change.

Figure S3 Weight calculation and X on S100. Average and standard error of (a) the time it takes to score a single tripartition using| | Eq. 3 and (b) search space size X for both ASTRAL-II 3.2.1 Varying the number(red) and ASTRAL-III of genes (blue) (k) on the S100 dataset. Running time is in log| scale| for varying We compare ASTRAL-IIInumbers of to gene ASTRAL-II trees (boxes)andsequencelength400and800( on the avian simulatedline types dataset,). changing the number of genes from 28 to 214 and forcing X to be the same for both versions to enable comparing impacts of improved weight calculation. We allow each replicate run to take up to two days. ASTRAL-II is not able to finish on the dataset with k =214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the running time over ASTRAL-II and the extent of the improvement depends on k (Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With 213 genes, the largest value where both versions could run, ASTRAL-III finishes on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover, fitting a line to the average running time in the log-log scale graph reveals that on this dataset, the running time of ASTRAL-III on average grows as O(k2.08), which is better than that of ASTRAL-II at O(k2.28), and both are better than the theoretical worst case, which is O(k2.726). ASTRAL-III: improved running time

Feature 3.

An A*-like algorithm is used to trim the dynamic programming

• Compute an upper-bound on the score of a cluster without “expanding” it

• If the upper bound is below the best existing score, don’t expand

12 New features (journal version)

4. Restrict the search space to guarantee |X|=O(nk)

• Overall running time is O(D(nk)1.73)= O((nk)2.73)

• Does not impact accuracy

• Further improves speed

13 New features (journal version)

4. Restrict the search space to guarantee |X|=O(nk)

• Overall running time is O(D(nk)1.73)= O((nk)2.73)

• Does not impact accuracy

• Further improves speed

5. The A*-like algorithm is further developed to compute . . even tighter upper bounds (complicated)

• Improves speed only empirically and slightly

13 Zhang et al. Page 34 of 34

Zhang et al. Page 13 of 34

Zhang et al. Zhang et al. Page 13 of 34 Page 13 of 34

Empirical running500 time 1500 ● 10 2 ● ● ● 500 1500500 1500

● ● ● ● ● 10 10 ● ● 2 ● 2 ● 1M ● ● ● ●

1000 ● ● 500k ● ● ● ● ● ● ● 5 ● ● ● 2 ● ● ● ● ● 200k ● ● 5 ● 5 2 ● 2 ● 100 ● ● ●

100k 1.09 ● ● 1.83 ● ● 2.09 ● 2.06 ●

Size of X Size 2.27 2.29 ● ● ● 2.09 ● 2.062.09 ● ● 2.06 50k ● ● ● 2.27Running time (minutes) ● 2.292.27 ● 2.29 ● 10 ● 0 ● ● ●

Running time (minutes) 2 Running time (minutes) Running time (minutes) ● ● ● ● 0 0 ● ● ASTRAL2 2 2 ● 20k ● ● ● ASTRAL2 ● ASTRAL2 ● ● ●● ● ● ASTRAL3 ● ● ● ● ● ● ●● ● ASTRAL3 ● ASTRAL3 10k ● ● ● 1

8 9 10 11 12 13 14 8 9 10 11 12 13 14 200 500 1000 2000 5000 10000 200 500 100028 200029 5000210 2 10000211 2 212 2 213 2 214 22882 2299 2 221100 2 211 2 212 2 213 2 214 282 29 2 210 2 211 2 212 213 214 #Species #Species #Genes #Genes #Genes

Figure S8 Empirical running time of ASTRAL-III with n.Averagerunningtimeisshownfor ASTRAL-III for datasetsFigure with varying 5 Runningn.Averagesareover20replicates.Onereplicateof2000 time versus k.FigureAverage 5 Running running times time versus(4 replicates)k. Average2 are shown running for times ASTRAL-II (4 replicates) are shown for ASTRAL-II species dataset could not finishFigure in 2Empirical days and is5 removed Running fromrunning the analysis. time Notetime versus that theseseemsk. Average close to running O((nk times) ) (4 replicates) are shown for ASTRAL-II datasets have factorsand other ASTRAL-III than n that change on as well the (e.g., avian the amount dataset ofand ILS, etc.). ASTRAL-III with Thus, 500bp these or on 1500bp the avian alignments dataset with with 500bp varying or numbers 1500bp alignments of with varying numbers of running times should be treatedand as ball-park ASTRAL-III estimates. Finally, we on note thethat on avian the 10,000 dataset with 500bp or 1500bp alignments with varying numbers of dataset, we have onlygens 2 replicates (k), and shown not 20. in log scale (seegens Fig. S2 (k), for shown normal in scale). log scale A (see line is Fig. fit S2 to the for normal data points scale). in A the line is fit to the data points in the log/loggens space ( andk), line shown slopes in are log shown.log/log scale ASTRAL-II space (see14 Fig. and line S2 did slopes for not normal finish are shown. on scale).214 ASTRAL-IIgenes A inline 48 is did hour. fit not to finish the dataon 214 pointsgenes in in 48 the hour. log/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.

branches has minimal impactbranches on the has discordance minimal impact (eight discordant on the discordance branches (eight with discordant branches with binnedbranches MP-EST instead has minimal ofbinned nine). impact However, MP-EST on contracting instead the discordance of nine). low support However, (eight branches contracting discordant with low branches support branches with with 3%–33%binned thresholds MP-EST dramatically instead3%–33% reduces of thresholds nine). the discordance However, dramatically with contracting reduces the reference the discordance low tree support (2, with branches the reference with tree (2, 2, 4, 2, 3, and 3 discordant2, branches, 4, 2, 3, and respectively, 3 discordant for 3%, branches, 5%, 7%, respectively, 10%, 20%, for and 3%, 5%, 7%, 10%, 20%, and 3%–33% thresholds dramatically reduces the discordance with the reference tree (2, 33%). Three thresholds (3%,33%). 5%, and Three 10%) thresholds produce (3%, an identical 5%, and tree 10%) (Fig. produce 4d). Thean identical tree (Fig. 4d). The remaining2, 4, di 2,↵erences 3, and are 3 among discordantremaining the branches di branches,↵erences that are arerespectively, among deemed the unresolved branches for 3%, that by 5%, Jarvis are 7%, deemed 10%, unresolved 20%, and by Jarvis et al. and33%). change Three among thresholds theet reference al. and (3%, change trees 5%, as among and well 10%) [5]. the Contracting reference produce trees at an 50% as identical well and [5]. 75% Contracting tree (Fig. 4d). at 50% The and 75% thresholds,remaining however, di↵ increaseserencesthresholds, discordance are among however, to the five increases branches and six discordance branches, that are respectively. deemedto five and unresolved six branches, by respectively. Jarvis Thus,et consistent al. and change with simulations, amongThus, the consistent contracting reference with very trees simulations, low as support well contracting [5]. branches Contracting very seems low at support 50% and branches 75% seems to producethresholds, the best however, results,to when produce increases judged the by discordance best similarity results, with when to five the judged reference and by six similarity trees. branches, To with respectively. the reference trees. To summarize, ASTRAL-III obtainedsummarize, on unbinned ASTRAL-III but collapsed obtained gene on unbinned trees agreed but collapsed with gene trees agreed with Thus, consistent with simulations, contracting very low support branches seems all major relations in Jarvisallet major al., including relations the in Jarvis novelet Columbea al., including group, the whereas novel Columbea group, whereas the unresolvedto produce tree the missed bestthe important results, unresolved clades when tree (Fig. judged missed 4). important by similarity clades with (Fig. the 4). reference trees. To summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with 3.2 RQ2:all Running major relations time improvements3.2 in RQ2: Jarvis Runninget al. time, including improvements the novel Columbea group, whereas We studythe the unresolved improvements treeWe in missed study running the important time improvements as various clades in parameters running (Fig. 4). time change. as various parameters change. 3.2.1 Varying the number of3.2.1 genes Varying (k) the number of genes (k) We compare3.2 RQ2: ASTRAL-III Running toWe time ASTRAL-II compare improvements ASTRAL-III on the avian to simulated ASTRAL-II dataset, on the changing avian simulated dataset, changing the numberWe study of genes the from improvements 28theto 2 number14 and forcing of in genes runningX fromto be 2 time8 theto 2 same14 asand various for forcing both parameters versionsX to be to the same change. for both versions to enable comparing impacts ofenable improved comparing weight impacts calculation. of improved We allow weight each calculation. replicate We allow each replicate run to3.2.1 take up Varying to two days. therun number ASTRAL-II to take of up genes is to not two ( ablek days.) to ASTRAL-II finish on the is dataset not able with to finish on the dataset with 14 14 k =2We, while compare ASTRAL-III ASTRAL-IIIk finishes=2 , while to on ASTRAL-II all ASTRAL-III conditions. on ASTRAL-III finishes the avian on all improves simulated conditions. the dataset,ASTRAL-III changing improves the running time over ASTRAL-IIrunning and time the extent over ASTRAL-II of the improvement and the extent depends of the on k improvement depends on k the number of genes from 28 to 214 and forcing X to be the same for both versions to (Fig. 5). With 1000 genes(Fig. or more, 5). With there 1000 is at genes least or a 2.1X more, improvement. there is at least With a 2.1X improvement. With 213 genes,enable the comparing largest value2 impacts13 wheregenes, both of the improved versions largest value could weight where run, calculation. ASTRAL-III both versions finishesWe could allow run, each ASTRAL-III replicate finishes on averagerun to 3.2 take times up faster toon than two average days.ASTRAL-II 3.2 ASTRAL-II times (234 faster versus than is not758 ASTRAL-II minutes). able to finish Moreover, (234 versus on the 758 dataset minutes). with Moreover, fittingk a=2 line14 to, the while average ASTRAL-IIIfitting running a line time finishes to in the the average on log-log all running conditions. scale graph time in reveals ASTRAL-III the log-log that scale improves graph reveals the that on thisrunning dataset, time the running overon ASTRAL-II time this dataset, of ASTRAL-III and the running the on extent average time of of grows ASTRAL-III the improvementas O(k2. on08), average depends grows as onOk(k2.08), 2.28 2.28 which(Fig. is better 5). than With that 1000 ofwhich ASTRAL-II genes is better or at more, thanO(k that there), of and ASTRAL-II is both at are least better at aO( 2.1Xk than), the improvement. and both are better With than the theoretical worst case, whichtheoretical is O(k2.726 worst). case, which is O(k2.726). 213 genes, the largest value where both versions could run, ASTRAL-III finishes on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover, fitting a line to the average running time in the log-log scale graph reveals that on this dataset, the running time of ASTRAL-III on average grows as O(k2.08), which is better than that of ASTRAL-II at O(k2.28), and both are better than the theoretical worst case, which is O(k2.726). Low support branches

• Does it help to contract genes branches with low support?

• Yes, but only for very low supports

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 15 Low support branches

• Does it help to contract genes branches with low support?

• Yes, but only for very low supports

• Mostly helps in the presence of low support gene trees

Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 15 Low support branches

• Does it help to contract genes branches with low support?

• Yes, but only for very low supports

• Mostly helps in the presence of low support gene trees

• More genes allows for

more filtering Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping 15 Zhang etAvian al. biological datasetPage 12 of 34

Zebra finch Zebra finch Medium ground-finch Medium ground-finch a) Binned MP-EST American crow b) ExaML concatenation American crow Golden-collared manakin Golden-collared manakin (coalescent-based Rifleman (published tree) Rifleman Budgerigar Budgerigar published tree) Kea Kea Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema 91 Carmine bee-eater Carmine bee-eater Downy woodpecker 72 Downy woodpecker Rhinoceros hornbill Rhinoceros hornbill Bar-tailed trogon Bar-tailed trogon -roller 84 Cuckoo-roller Speckled mousebird Speckled mousebird Bald eagle Barn owl White-tailed eagle Bald eagle Turkey vulture White-tailed eagle Barn owl 96 Turkey vulture 59 Little egret Little egret Dalmatian pelican Dalmatian pelican 58 Crested ibis Crested ibis Great cormorant Great cormorant Adelie penguin Adelie penguin Emperor penguin 91 Emperor penguin Northern fulmar 70 Northern fulmar Red-throated loon Red-throated loon 97 50 White-tailed tropicbird White-tailed tropicbird Sunbittern 99 88 Grey-crowned crane 96 Grey-crowned crane Killdeer 91 Killdeer Hoatzin Hoatzin Anna's hummingbird 55 Red-crested turaco Chimney swift MacQueen's bustard Chuck-will's-widow 91 Common cuckoo 80 Red-crested turaco Chimney swift MacQueen's bustard Anna's hummingbird Common cuckoo Chuck-will's-widow Brown mesite Yellow-throated Yellow-throated sandgrouse 97 Columbea Pigeon Pigeon American American flamingo Great-crested Great-crested grebe Turkey Turkey Chicken Chicken Pekin Pekin duck Common ostrich White-throated tinamou White-throated tinamou Common ostrich

Zebra finch Zebra finch c) Unbinned ASTRAL Medium ground-finch d) Unbinned ASTRAL Medium ground-finch American crow American crow - no contraction Golden-collared manakin -3%, 5%, 10% contraction Golden-collared manakin Rifleman Rifleman Budgerigar Budgerigar 14K gene trees with very lowKea gene tree resolution (25%Kea mean BS) Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema Carmine bee-eater Carmine bee-eater Downy woodpecker Downy woodpecker Rhinoceros hornbill Rhinoceros hornbill 97 Bar-tailed trogon Bar-tailed trogon Cuckoo-roller Cuckoo-roller Speckled mousebird Speckled mousebird Bald eagle Bald eagle 88 White-tailed eagle 91,90,70 White-tailed eagle Turkey vulture 16 Turkey vulture Barn owl Barn owl Anna's hummingbird Anna's hummingbird Chimney swift Chimney swift Chuck-will's-widow Chuck-will's-widow Little egret Little egret Dalmatian pelican 84,92,95 Dalmatian pelican Crested ibis Crested ibis Great cormorant Great cormorant 97 Emperor penguin Emperor penguin Adelie penguin Adelie penguin 60 Northern fulmar Northern fulmar 98 Red-throated loon 80,73,70 Red-throated loon Grey-crowned crane Sunbittern 90 Sunbittern White-tailed tropicbird White-tailed tropicbird 87,91,85 Grey-crowned crane 56 Great-crested grebe 52,59,74 Killdeer 88 American flamingo Hoatzin 92,93,85 98 Killdeer Red-crested turaco Hoatzin MacQueen's bustard Red-crested turaco Common cuckoo MacQueen's bustard Yellow-throated sandgrouse Yellow-throated sandgrouse Brown mesite 89 98,99,100 99 Brown mesite Pigeon Pigeon Great-crested grebe Common cuckoo American flamingo Turkey Turkey Chicken Chicken Pekin duck Pekin duck White-throated tinamou White-throated tinamou Common ostrich Common ostrich

Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5] using the coalescent-based binning (a)andconcatenation(b), and two new trees using ASTRAL-III with no contraction (c)andwithcontractionwith3%,5%,and10%thresholds(d). Support values (bootstrap for a,b and local posterior probability for c,d)shownforallbranches except those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively. Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33] that were largely congruent with the concatenation using ExaML [45] and di↵ered in only five branches with low support (Fig. 4ab); both trees were used as the reference [5]. Here, we test if simply contracting low support gene tree branches and using ASTRAL-III produces a tree that is congruent with these reference trees. Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con- traction, the results di↵er in nine and 11 branches, respectively, with respect to the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this tree contradicts some strong results from the avian analyses (e.g., not recovering the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours, but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50 hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0% Zhang et al. Page 12 of 34

Zebra finch Zebra finch Medium ground-finch Medium ground-finch a) Binned MP-EST American crow b) ExaML concatenation American crow Golden-collared manakin Golden-collared manakin (coalescent-based Rifleman (published tree) Rifleman Budgerigar Budgerigar published tree) Kea Kea Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema 91 Carmine bee-eater Carmine bee-eater Downy woodpecker 72 Downy woodpecker Rhinoceros hornbill Rhinoceros hornbill Bar-tailed trogon Bar-tailed trogon Cuckoo-roller 84 Cuckoo-roller Speckled mousebird Speckled mousebird Bald eagle Barn owl White-tailed eagle Bald eagle Turkey vulture White-tailed eagle Barn owl 96 Turkey vulture 59 Little egret Little egret Dalmatian pelican Dalmatian pelican 58 Crested ibis Crested ibis Great cormorant Great cormorant Adelie penguin Adelie penguin Emperor penguin 91 Emperor penguin Northern fulmar 70 Northern fulmar Red-throated loon Red-throated loon 97 50 White-tailed tropicbird White-tailed tropicbird Sunbittern Sunbittern 99 88 Grey-crowned crane 96 Grey-crowned crane Killdeer 91 Killdeer Hoatzin Hoatzin Anna's hummingbird 55 Red-crested turaco Chimney swift MacQueen's bustard Chuck-will's-widow 91 Common cuckoo 80 Red-crested turaco Chimney swift MacQueen's bustard Anna's hummingbird Zhang et al. Common cuckoo Page 12 of 34 Chuck-will's-widow Avian biological datasetBrown mesite Brown mesite Yellow-throated sandgrouse Yellow-throated sandgrouse 97 Columbea Pigeon Pigeon American flamingo American flamingo Great-crested grebe Great-crested grebe Turkey Turkey Chicken Chicken Pekin duck Pekin duck Common ostrich White-throated tinamou White-throated tinamou Common ostrich

Zebra finch ZebraZebra finch finch Zebra finch Medium ground-finch c) Unbinned ASTRAL MediumMedium ground-finch ground-finch d) Unbinned ASTRAL Medium ground-finch a) Binned MP-EST American crow b) ExaML concatenation AmericanAmerican crow crow American crow Golden-collared manakin - no contraction Golden-collaredGolden-collared manakin manakin -3%, 5%, 10% contraction Golden-collared manakin (coalescent-based Rifleman (published tree) RiflemanRifleman Rifleman Budgerigar BudgerigarBudgerigar Budgerigar published tree) Kea KeaKea Kea Peregrine falcon PeregrinePeregrine falcon falcon Peregrine falcon Red-legged seriema Red-leggedRed-legged seriema seriema Red-legged seriema 91 Carmine bee-eater CarmineCarmine bee-eater bee-eater Carmine bee-eater Downy woodpecker 72 DownyDowny woodpecker woodpecker Downy woodpecker Rhinoceros hornbill RhinocerosRhinoceros hornbill hornbill Rhinoceros hornbill Bar-tailed trogon 97 Bar-tailedBar-tailed trogon trogon Bar-tailed trogon Cuckoo-roller 84 Cuckoo-rollerCuckoo-roller Cuckoo-roller Speckled mousebird SpeckledSpeckled mousebird mousebird Speckled mousebird Bald eagle BaldBarn eagle owl Bald eagle White-tailed eagle 88 White-tailedBald eagle eagle 91,90,70 White-tailed eagle Turkey vulture TurkeyWhite-tailed vulture eagle Turkey vulture Barn owl 96 BarnTurkey owl vulture Barn owl 59 Little egret Anna'sLittle egret hummingbird Anna's hummingbird Dalmatian pelican ChimneyDalmatian swift pelican Chimney swift 58 Crested ibis Chuck-will's-widowCrested ibis Chuck-will's-widow Great cormorant LittleGreat egret cormorant Little egret Adelie penguin DalmatianAdelie penguin pelican 84,92,95 Dalmatian pelican Emperor penguin 91 CrestedEmperor ibis penguin Crested ibis Northern fulmar 70 GreatNorthern cormorant fulmar Great cormorant Red-throated loon 97 EmperorRed-throated penguin loon Emperor penguin 97 50 White-tailed tropicbird AdelieWhite-tailed penguin tropicbird Adelie penguin Sunbittern 60 NorthernSunbittern fulmar Northern fulmar 99 88 96 Grey-crowned crane 98 Red-throatedGrey-crowned loon crane 80,73,70 Red-throated loon Killdeer 91 Grey-crownedKilldeer crane Sunbittern Hoatzin 90 SunbitternHoatzin White-tailed tropicbird Anna's hummingbird 55 White-tailedRed-crested tropicbird turaco 87,91,85 Grey-crowned crane Chimney swift 56 Great-crestedMacQueen's bustard grebe 52,59,74 Killdeer Chuck-will's-widow 88 91 AmericanCommon flamingocuckoo Hoatzin 80 92,93,85 Red-crested turaco 98 KilldeerChimney swift Red-crested turaco MacQueen's bustard HoatzinAnna's hummingbird MacQueen's bustard Common cuckoo Red-crestedChuck-will's-widow turaco Common cuckoo Brown mesite MacQueen'sBrown mesite bustard Yellow-throated sandgrouse Yellow-throated sandgrouse Yellow-throatedYellow-throated sandgrouse sandgrouse Brown mesite 97 Columbea 89 98,99,100 Pigeon 99 BrownPigeon mesite Pigeon American flamingo PigeonAmerican flamingo Great-crested grebe Great-crested grebe CommonGreat-crested cuckoo grebe American flamingo Turkey TurkeyTurkey Turkey Chicken ChickenChicken Chicken Pekin duck PekinPekin duck duck Pekin duck Common ostrich White-throatedWhite-throated tinamou tinamou White-throated tinamou White-throated tinamou CommonCommon ostrich ostrich Common ostrich

Zebra finch Zebra finch c) Unbinned ASTRAL Medium ground-finch d) Unbinned ASTRAL Medium ground-finch American crow American crow - no contraction Golden-collared manakin Figure -3%, 5%, 4 10% Avian contraction dataset with 14,446Golden-collared genes. manakinShown are reference trees from the original paper [5] Rifleman Rifleman Budgerigar using the coalescent-based binningBudgerigar (a)andconcatenation(b), and two new trees using 14K gene trees with very lowKea gene tree resolution (25%Kea mean BS) Peregrine falcon Peregrine falcon Red-legged seriema ASTRAL-III with no contraction (c)andwithcontractionwith3%,5%,and10%thresholds(Red-legged seriema d). Carmine bee-eater Carmine bee-eater Downy woodpecker Support values (bootstrap for a,b andDowny localwoodpecker posterior probability for c,d)shownforallbranches Rhinoceros hornbill Rhinoceros hornbill 97 Bar-tailed trogon Bar-tailed trogon Cuckoo-roller except those with full support; in (d),Cuckoo-roller support is shown for 3%, 5%, and 10%, respectively. Speckled mousebird Speckled mousebird Bald eagle Branches conflicting with the referenceBald eagle coalescent-based tree are shown as dotted red lines. 88 White-tailed eagle 91,90,70 White-tailed eagle Turkey vulture 16 Turkey vulture Barn owl Barn owl Anna's hummingbird Anna's hummingbird Chimney swift Chimney swift Chuck-will's-widow Chuck-will's-widow Little egret Little egret Dalmatian pelican 84,92,95 Dalmatian pelican Crested ibis Crested ibis Great cormorant Great cormorant 97 Emperor penguin Emperor penguin Adelie penguin MP-EST run on binned gene treesAdelie penguin (i.e., binned MP-EST) produced results [5, 33] 60 Northern fulmar Northern fulmar 98 Red-throated loon 80,73,70 Red-throated loon Grey-crowned crane Sunbittern 90 Sunbittern that were largely congruent withWhite-tailed the tropicbird concatenation using ExaML [45] and di↵ered White-tailed tropicbird 87,91,85 Grey-crowned crane 56 Great-crested grebe 52,59,74 Killdeer 88 American flamingo Hoatzin 92,93,85 98 Killdeer in only five branches with lowRed-crested support turaco (Fig. 4ab); both trees were used as the Hoatzin MacQueen's bustard Red-crested turaco Common cuckoo MacQueen's bustard Yellow-throated sandgrouse Yellow-throated sandgrouse Brown mesite 89 reference [5].98,99,100 Here, we test if simply contracting low support gene tree branches 99 Brown mesite Pigeon Pigeon Great-crested grebe Common cuckoo American flamingo Turkey and using ASTRAL-III producesTurkey a tree that is congruent with these reference trees. Chicken Chicken Pekin duck Pekin duck White-throated tinamou White-throated tinamou Common ostrich Similar to MP-EST, when ATRAL-IIICommon ostrich is run on 14,446 gene trees with no con- traction, the results di↵er in nine and 11 branches, respectively, with respect to Figure 4 Avian dataset with 14,446 genes. Shownthe are reference reference binned trees from MP-EST the original and paper concatenation [5] trees (Fig. 4c). Moreover, this using the coalescent-based binning (a)andconcatenation(b), and two new trees using ASTRAL-III with no contraction (c)andwithcontractionwith3%,5%,and10%thresholds(tree contradicts some strong results fromd). the avian analyses (e.g., not recovering Support values (bootstrap for a,b and local posteriorthe Columbea probability for groupc,d)shownforallbranches [5]). ASTRAL-III with no contraction finishes in 32 hours, except those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively. Branches conflicting with the reference coalescent-basedbut with tree contraction, are shown as dotted depending red lines. on the threshold, it takes 3 to 84 hours (> 50 hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33] that were largely congruent with the concatenation using ExaML [45] and di↵ered in only five branches with low support (Fig. 4ab); both trees were used as the reference [5]. Here, we test if simply contracting low support gene tree branches and using ASTRAL-III produces a tree that is congruent with these reference trees. Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con- traction, the results di↵er in nine and 11 branches, respectively, with respect to the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this tree contradicts some strong results from the avian analyses (e.g., not recovering the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours, but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50 hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0% Zhang et al. Zhang et al. Page 12 of 34 Page 12 of 34

Zebra finch Zebra finch Zebra finch Medium ground-finch Medium ground-finch Medium ground-finch a) Binned MP-EST American crow b) ExaMLa) Binned concatenation MP-EST American crow b) ExaML concatenation American crow Golden-collared manakin Golden-collared manakin Golden-collared manakin (coalescent-based Rifleman (published(coalescent-based tree) Rifleman (published tree) Rifleman Budgerigar Budgerigar Budgerigar published tree) Kea published tree) Kea Kea Peregrine falcon Peregrine falcon Peregrine falcon Red-legged seriema Red-legged seriema Red-legged seriema 91 Carmine bee-eater 91 Carmine bee-eater Carmine bee-eater Downy woodpecker Downy woodpecker Downy woodpecker 72 Downy woodpecker 72 Rhinoceros hornbill Rhinoceros hornbill Rhinoceros hornbill Bar-tailed trogon Bar-tailed trogon Bar-tailed trogon Cuckoo-roller Cuckoo-roller Cuckoo-roller 84 Cuckoo-roller 84 Speckled mousebird Speckled mousebird Speckled mousebird Bald eagle BaldBarn eagleowl Barn owl White-tailed eagle White-tailedBald eagle eagle Bald eagle Turkey vulture TurkeyWhite-tailed vulture eagle White-tailed eagle Barn owl 96 BarnTurkey owl vulture 96 Turkey vulture 59 Little egret 59 Little egret Little egret Dalmatian pelican Dalmatian pelican Dalmatian pelican 58 Crested ibis 58 Crested ibis Crested ibis Great cormorant Great cormorant Great cormorant Adelie penguin Adelie penguin Adelie penguin 91 Emperor penguin 91 Emperor penguin Emperor penguin Northern fulmar 70 Northern fulmar 70 Northern fulmar Red-throated loon Red-throated loon Red-throated loon 97 Red-throated loon 97 50 White-tailed tropicbird 50 White-tailed tropicbird White-tailed tropicbird Sunbittern Sunbittern Sunbittern 99 88 99 88 96 Grey-crowned crane 96 Grey-crowned crane Grey-crowned crane Grey-crowned crane 91 Killdeer 91 Killdeer Killdeer Hoatzin Hoatzin Hoatzin Anna's hummingbird 55 Anna'sRed-crested hummingbird turaco 55 Red-crested turaco Chimney swift ChimneyMacQueen's swift bustard MacQueen's bustard Chuck-will's-widow 91 Chuck-will's-widowCommon cuckoo 91 Common cuckoo 80 Red-crested turaco 80 Red-crestedChimney swift turaco Chimney swift MacQueen's bustard MacQueen'sAnna's hummingbird bustard Anna's hummingbird Zhang et al. Common cuckoo CommonChuck-will's-widow cuckoo Page 12 of 34 Chuck-will's-widow Avian biologicalBrown mesite datasetBrown mesite Brown mesite Yellow-throated sandgrouse Yellow-throated sandgrouse Yellow-throated sandgrouse 97 Columbea Pigeon 97 Columbea Pigeon Pigeon American flamingo American flamingo American flamingo Great-crested grebe Great-crested grebe Great-crested grebe Turkey Turkey Turkey Chicken Chicken Chicken Pekin duck Pekin duck Pekin duck Common ostrich CommonWhite-throated ostrich tinamou White-throated tinamou White-throated tinamou White-throatedCommon ostrich tinamou Common ostrich

ZebraZebra finchfinch ZebraZebra finch finch Zebra finch c) Unbinned ASTRAL MediumMedium ground-finchground-finch d) Unbinnedc) Unbinned ASTRAL ASTRAL MediumMedium ground-finch ground-finch d) Unbinned ASTRAL Medium ground-finch a) Binned MP-EST AmericanAmerican crowcrow b) ExaML concatenation AmericanAmerican crow crow American crow - no contraction Golden-collaredGolden-collared manakinmanakin -3%, - no 5%, contraction 10% contraction Golden-collaredGolden-collared manakin manakin -3%, 5%, 10% contraction Golden-collared manakin (coalescent-based RiflemanRifleman (published tree) RiflemanRifleman Rifleman BudgerigarBudgerigar BudgerigarBudgerigar Budgerigar published tree) KeaKea KeaKea Kea PeregrinePeregrine falconfalcon PeregrinePeregrine falcon falcon Peregrine falcon Red-leggedRed-legged seriemaseriema Red-leggedRed-legged seriema seriema Red-legged seriema 91 CarmineCarmine bee-eaterbee-eater CarmineCarmine bee-eater bee-eater Carmine bee-eater DownyDowny woodpeckerwoodpecker 72 DownyDowny woodpecker woodpecker Downy woodpecker RhinocerosRhinoceros hornbillhornbill RhinocerosRhinoceros hornbill hornbill Rhinoceros hornbill 97 Bar-tailedBar-tailed trogontrogon 97 Bar-tailedBar-tailed trogon trogon Bar-tailed trogon Cuckoo-rollerCuckoo-roller 84 Cuckoo-rollerCuckoo-roller Cuckoo-roller SpeckledSpeckled mousebirdmousebird SpeckledSpeckled mousebird mousebird Speckled mousebird BaldBald eagleeagle BaldBarn eagle owl Bald eagle 88 White-tailedWhite-tailed eagleeagle 91,90,7088 White-tailedBald eagle eagle 91,90,70 White-tailed eagle TurkeyTurkey vulturevulture TurkeyWhite-tailed vulture eagle Turkey vulture BarnBarn owlowl 96 BarnTurkey owl vulture Barn owl 59 Anna'sLittle egret hummingbird Anna'sLittle egret hummingbird Anna's hummingbird ChimneyDalmatian swift pelican ChimneyDalmatian swift pelican Chimney swift 58 Chuck-will's-widowCrested ibis Chuck-will's-widowCrested ibis Chuck-will's-widow LittleGreat egret cormorant LittleGreat egret cormorant Little egret DalmatianAdelie penguin pelican 84,92,95 DalmatianAdelie penguin pelican 84,92,95 Dalmatian pelican CrestedEmperor ibis penguin 91 CrestedEmperor ibis penguin Crested ibis GreatNorthern cormorant fulmar 70 GreatNorthern cormorant fulmar Great cormorant 97 EmperorRed-throated penguin loon 97 EmperorRed-throated penguin loon Emperor penguin 97 50 AdelieWhite-tailed penguin tropicbird AdelieWhite-tailed penguin tropicbird Adelie penguin 60 NorthernSunbittern fulmar 60 NorthernSunbittern fulmar Northern fulmar 99 88 Red-throatedGrey-crowned loon crane 96 Red-throatedGrey-crowned loon crane Red-throated loon 98 80,73,70 98 91 Red-throated loon 80,73,70 Grey-crownedKilldeer crane Grey-crownedSunbitternKilldeer crane Sunbittern 90 SunbitternHoatzin 90 SunbitternWhite-tailedHoatzin tropicbird White-tailed tropicbird White-tailedAnna's hummingbird tropicbird 87,91,8555 White-tailedGrey-crownedRed-crested tropicbird turaco crane 87,91,85 Grey-crowned crane 52,59,74 56 Great-crestedChimney swift grebe 56 52,59,74 Great-crestedKilldeerMacQueen's bustard grebe Killdeer 88 AmericanChuck-will's-widow flamingo 88 91 AmericanHoatzinCommon flamingocuckoo Hoatzin 80 92,93,85 92,93,85 98 KilldeerRed-crested turaco 98 KilldeerRed-crestedChimney swift turaco Red-crested turaco HoatzinMacQueen's bustard HoatzinMacQueen'sAnna's hummingbird bustard MacQueen's bustard Red-crestedCommon cuckoo turaco Red-crestedCommonChuck-will's-widow cuckoo turaco Common cuckoo MacQueen'sBrown mesite bustard MacQueen'sYellow-throatedBrown mesite bustard sandgrouse Yellow-throated sandgrouse Yellow-throatedYellow-throated sandgrousesandgrouse Yellow-throatedBrownYellow-throated mesite sandgrouse sandgrouse Brown mesite 97 Columbea 89 98,99,100 89 98,99,100 99 BrownPigeon mesite 99 BrownPigeonPigeon mesite Pigeon PigeonAmerican flamingo PigeonGreat-crestedAmerican flamingo grebe Great-crested grebe CommonGreat-crested cuckoo grebe CommonAmericanGreat-crested cuckooflamingo grebe American flamingo TurkeyTurkey TurkeyTurkey Turkey ChickenChicken ChickenChicken Chicken PekinPekin duckduck PekinPekin duck duck Pekin duck White-throatedCommon ostrich tinamou White-throatedWhite-throated tinamou tinamou White-throated tinamou CommonWhite-throated ostrich tinamou CommonCommon ostrich ostrich Common ostrich

Zebra finch Zebra finch c) Unbinned ASTRAL Medium ground-finch d) Unbinned ASTRAL Medium ground-finch American crow American crow Figure - no 4contraction Avian dataset with 14,446Golden-collared genes. manakinShownFigure -3%, are 5%, reference 4 10% Avian contraction trees dataset from with the 14,446 originalGolden-collared genes. paper manakinShown [5] are reference trees from the original paper [5] Rifleman Rifleman using the coalescent-based binningBudgerigar (a)andconcatenation(using theb coalescent-based), and two new trees binning usingBudgerigar (a)andconcatenation(b), and two new trees using 14K gene trees with very lowKea gene tree resolution (25%Kea mean BS) Peregrine falcon Peregrine falcon ASTRAL-III with no contraction (c)andwithcontractionwith3%,5%,and10%thresholds(Red-legged seriema ASTRAL-III with no contraction (c)andwithcontractionwith3%,5%,and10%thresholds(Red-legged seriemad). d). Carmine bee-eater Carmine bee-eater Support values (bootstrap for a,b andDowny localwoodpecker posteriorSupport probability values for (bootstrapc,d)shownforallbranches for a,b andDowny localwoodpecker posterior probability for c,d)shownforallbranches Rhinoceros hornbill Rhinoceros hornbill 97 Bar-tailed trogon Bar-tailed trogon except those with full support; in (d),Cuckoo-roller support is shownexcept for those 3%, with 5%, full and support; 10%, respectively. in (d),Cuckoo-roller support is shown for 3%, 5%, and 10%, respectively. Speckled mousebird Speckled mousebird Branches conflicting with the referenceBald eagle coalescent-basedBranches tree conflicting are shown with as dotted the reference redBald lines. eagle coalescent-based tree are shown as dotted red lines. 88 White-tailed eagle 91,90,70 White-tailed eagle Turkey vulture 16 Turkey vulture Barn owl Barn owl Anna's hummingbird Anna's hummingbird Chimney swift Chimney swift Chuck-will's-widow Chuck-will's-widow Little egret Little egret Dalmatian pelican 84,92,95 Dalmatian pelican Crested ibis Crested ibis Great cormorant Great cormorant 97 Emperor penguin Emperor penguin MP-EST run on binned geneAdelie trees penguin (i.e.,MP-EST binned MP-EST) run on binned produced gene results treesAdelie penguin (i.e., [5, 33] binned MP-EST) produced results [5, 33] 60 Northern fulmar Northern fulmar 98 Red-throated loon 80,73,70 Red-throated loon Grey-crowned crane Sunbittern that were largely90 congruent withSunbittern the concatenationthat were largely using congruentExaML [45] withWhite-tailed and the ditropicbird↵ concatenationered using ExaML [45] and di↵ered White-tailed tropicbird 87,91,85 Grey-crowned crane 56 Great-crested grebe 52,59,74 Killdeer 88 American flamingo Hoatzin 92,93,85 in only98 five branches with lowKilldeer supportin (Fig. only 4ab); five branches both trees with were lowRed-crested used support turaco as the (Fig. 4ab); both trees were used as the Hoatzin MacQueen's bustard Red-crested turaco Common cuckoo MacQueen's bustard Yellow-throated sandgrouse reference [5]. Here, we test ifYellow-throated simply sandgrouse contracting low support gene treeBrown mesite branches 89 reference [5].98,99,100 Here, we test if simply contracting low support gene tree branches 99 Brown mesite Pigeon Pigeon Great-crested grebe Common cuckoo American flamingo and using ASTRAL-III producesTurkey a tree thatand is using congruent ASTRAL-III with these produces referenceTurkey a tree trees. that is congruent with these reference trees. Chicken Chicken Pekin duck Pekin duck White-throated tinamou White-throated tinamou Similar to MP-EST, when ATRAL-IIICommon ostrich isSimilar run on to 14,446 MP-EST, gene when trees ATRAL-III withCommon ostrich no con- is run on 14,446 gene trees with no con- traction, the results di↵er in nine and 11traction, branches, the respectively, results di↵er with in nine respect and 11to branches, respectively, with respect to theFigure reference 4 Avian binned dataset with MP-EST 14,446 and genes. concatenationShownthe are reference reference trees binned trees (Fig. from MP-EST 4c). the original Moreover, and paper concatenation this[5] trees (Fig. 4c). Moreover, this using the coalescent-based binning (a)andconcatenation(b), and two new trees using treeASTRAL-III contradicts with no some contraction strong (c results)andwithcontractionwith3%,5%,and10%thresholds( fromtree the contradicts avian analyses some (e.g., strong not results recovering fromd). the avian analyses (e.g., not recovering theSupport Columbea values (bootstrap group [5]). for a,b ASTRAL-IIIand local posteriorthe with Columbea probability no contraction for groupc,d)shownforallbranches finishes [5]). ASTRAL-III in 32 hours, with no contraction finishes in 32 hours, except those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively. butBranches with conflicting contraction, with dependingthe reference coalescent-based on thebut threshold, with tree contraction, are it shown takes as 3 dotted depending to 84 red hours lines. on (> the50 threshold, it takes 3 to 84 hours (> 50 hours for 0% – 20% thresholds and < 26hours hours for for 0% 33% – 20% – 75%). thresholds Contracting and < 0%26 hours for 33% – 75%). Contracting 0%

MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33] that were largely congruent with the concatenation using ExaML [45] and di↵ered in only five branches with low support (Fig. 4ab); both trees were used as the reference [5]. Here, we test if simply contracting low support gene tree branches and using ASTRAL-III produces a tree that is congruent with these reference trees. Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con- traction, the results di↵er in nine and 11 branches, respectively, with respect to the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this tree contradicts some strong results from the avian analyses (e.g., not recovering the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours, but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50 hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0% Moving forward …

• ASTRAL, like all other two-step approaches, is sensitive to errors in the input gene trees

• What else can be done to reduce impacts of gene tree error?

17 Moving forward …

• ASTRAL, like all other two-step approaches, is sensitive to errors in the input gene trees

• What else can be done to reduce impacts of gene tree error?

• ASTRAL scales to 10K species. We have 90K bacterial genomes.

• Can we scale further? … divide-and-conquer …

17 Tandy Warnow Chao Zhang Erfan Sayyari Maryam Rabiee Zhang et al. Page 14 of 34

ZhangZhangetet al. al. Zhang et al. PagePage 14 14 of of 34 34 Page 14 of 34 (a)

50 200 500 1000

−6 2 1600 (a)(a) (a) 200 5050 200200 50050500 10002001000 500 1000

−8 −6−6 ASTRAL−II −6 2 22 16001600 2 1600 ASTRAL200200−III 200

2−28−8 ASTRALASTRAL−II−II 2−8 ASTRAL−II ASTRALASTRAL−III−III ASTRAL−III 2−10

2−21−010 2−10

Avg weight calculation time (seconds) weight Avg 2−12 Avg weight calculation time (seconds) weight Avg Avg weight calculation time (seconds) weight Avg 2−21−212 calculation time (seconds) weight Avg 2−12

non 0 non3non 050 33 755 1077 1010202020333333 505050 757575nonnon non00 33 055 737 101052020 33733 505010757520nonnonnon33000 33503 55575777 101010non202020 33033 505037575 5nonnon700 3103 55520777 10331010 20202050333333 75505050 757575nonnon 00 33 5 57 107 201033 2050 7533 non50 0 753 5 7 10 20 33 50 75 contractioncontraction contraction contraction (b)(b) (b) 5050 200200 (b)50050500 10001000200 500 1000

100100 100 ● ● ● ● ● ● 50 200 500 ● ● 1000 ● 19 ● ● ●

● ● ● 100 ● ● ● ● 7575 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5050 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Set X cluster size (K) Set X cluster size

Set X cluster size (K) Set X cluster size ● Set X cluster size (K) Set X cluster size ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2525 ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● 0 0 ● ● ● ● ● ● ● ● ● ● nonnon00 33 55 77 10102020333350507575 nonnon00 33 55 77 10102020333350507575 nonnon00 33 555 777 101010202020333333505050757575 nonnonnon000 333 555 777 1010102020203333335050507575 non 0 3 5 7 10● 20● 33 50 75 non 0 3 5 7 10 20 33 50 75 ● contractioncontraction ● contraction ● ● ● ● ● ● ●

Set X cluster size (K) Set X cluster size ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● FigureFigure 6 6 Weight Weight calculation calculation and and XX onon S100. S100.FigureAverageAverage 6● Weight and and standard standard calculation error error and of of (a) (a)X the theon time time S100. it it Average and standard error of (a) the time it ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● || || ● ● ● | | ● ● ● ● ● ● ● ● takes●takes to to score score● ● a a● single single● ● tripartition tripartition● ● using using Eq. Eq. 3 3takes and and (b) (b)to score search search● a space single space size size tripartitionXX areare shownusing shown Eq. for for 3 both both and (b) search space size X are shown for both ● ● ● ● ● ● ● ● ● ● ● ● || || | | 0 ASTRAL-IIASTRAL-II and and ASTRAL-III ASTRAL-III on on the the S100 S100 dataset. dataset.ASTRAL-II Running Running and time time ASTRAL-III is is in in log log scale. scale. on the We We S100 vary vary dataset. numbers numbers Running time is in log scale. We vary numbers nonofof gene0 gene3 trees trees5 7 ( (boxes10boxes20)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsfor)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsfor33 50 75 non 0 3 5 7of10 gene20 33 trees50 (75boxesnon)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsfor0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 withwith 400 400 and and 800bp 800bp alignments. alignments. with 400 andcontraction 800bp alignments.

Figure 6 Weight calculation and X on S100. Average and standard error of (a) the time it takes3.2.23.2.2 to score Running Running a single time time tripartition for for large large polytomies polytomies| using3.2.2| Eq. Running 3 and (b) time search for large space polytomies size X are shown for both ASTRAL-IIASTRAL-IIIASTRAL-III and ASTRAL-III has has a a clear clear advantage advantage on the S100 comparedASTRAL-III compared dataset. to to ASTRAL-II ASTRAL-II has Running a clear time advantagewith with respect is respect in compared log to to| scale. the the| to We ASTRAL-II vary numbers with respect to the of generunningrunning trees time time (boxes when when)andsequencelength(200and1600).SeeFig.S3forsimilarpatternsfor gene gene trees trees include includerunning polytomies polytomies time (Fig. (Fig. when 6a). 6a). gene Since Since trees ASTRAL-II ASTRAL-II include polytomies and and (Fig. 6a). Since ASTRAL-II and with 400ASTRAL-IIIASTRAL-III and 800bp can can have alignments. have a a di di↵↵erenterent set setASTRAL-IIIXX,, we we show show the the can running running have a time di time↵erent per per each set eachX weight weight, we show the running time per each weight calculationcalculation (i.e., (i.e., Eq. Eq. 3). 3). As As we we contract contractcalculation low low support support (i.e., branches branches Eq. 3). and andAs we hence hence contract increase increase low the the support branches and hence increase the prevalenceprevalence of of polytomies, polytomies, the the weight weight calculation calculationprevalence time of time polytomies, quickly quickly grows grows the weightfor for ASTRAL-II, ASTRAL-II, calculation time quickly grows for ASTRAL-II, whereas,whereas, in in ASTRAL-III, ASTRAL-III, the the weight weight calculation calculationwhereas, in time time ASTRAL-III, remains remains flat, flat, the or or weight even even decreases. decreases. calculation time remains flat, or even decreases. 3.2.2 Running time for large polytomies ASTRAL-III3.2.33.2.3 The The search searchhas a space space clear advantage3.2.3 The compared search space to ASTRAL-II with respect to the ComparingComparing the the size size of of the the search search space spaceComparing ( (XX)) between between the size ATRAL-II ATRAL-II of the search and and spaceASTRAL- ASTRAL- ( X ) between ATRAL-II and ASTRAL- || || | | runningIIIIII shows time shows that that when as as intended, intended, gene trees the the search search includeIII space space shows is is polytomies that decreased decreased as intended, in in size size (Fig. for the for cases casessearch 6a). with with space Since no no is ASTRAL-IIdecreased in size for and cases with no ASTRAL-IIIpolytomypolytomy but but can can can have increase increase a in di in the↵ theerent presence presencepolytomy set of of polytomiesX polytomies but, we can show increase (Fig. (Fig. 6b). 6b). the in With With the running presence no no contrac- contrac- timeof polytomies per each (Fig. 6b). weight With no contrac- tion,tion, on on average, average, XX isis always always smaller smallertion, for for ASTRAL-III ASTRAL-III on average, X than thanis ASTRAL-II. always ASTRAL-II. smaller With With for ASTRAL-III few few than ASTRAL-II. With few || || | | calculationerror-proneerror-prone (i.e., gene gene Eq. trees trees 3). (50 (50 As gene gene we trees trees contracterror-prone from from 200bp 200bp low gene alignments), alignments), support trees (50 the the gene branches search search trees space space from and has has 200bp hence alignments), increase the searchthe space has prevalencereducedreduced of dramatically dramatically polytomies, but but with with the many many weightreduced genes genes calculation or or dramatically high-quality high-quality time but gene gene with trees, quicklytrees, many the the genes reduc- reduc- grows or high-quality for ASTRAL-II, gene trees, the reduc- whereas,tionstions in are are ASTRAL-III, minimal. minimal. Moreover, Moreover, the the the weight search searchtions space space are calculation minimal. for for gene gene trees Moreover, trees time estimated estimated remains the search from from short flat, short space or for even gene trees decreases. estimated from short alignmentsalignments (e.g., (e.g., 200bp) 200bp) is is several several times timesalignments larger larger than (e.g., than those those 200bp) based based is several on on longer longer times align- align- larger than those based on longer align- mentsments (e.g., (e.g., 1600bp) 1600bp) for for both both methods. methods.ments Contracting Contracting (e.g., 1600bp) low low support support for both branches branches methods. initially initially Contracting low support branches initially 3.2.3increases Theincreases search the the search search space space space because because ASTRAL-III ASTRAL-IIIincreases the adds adds search multiple multiple space resolutions resolutions because ASTRAL-III per per poly- poly- adds multiple resolutions per poly- tomytomy to to XX.. Further Further contraction contraction results resultstomy in in to reductions reductionsX. Further in in X contractionX,, presumably presumably results because because in reductions in X , presumably because Comparing the size of the search space ( X )|| between|| ATRAL-II and ASTRAL-| | manymany polytomies polytomies exist exist and and they they are are resolvedmany resolved polytomies similarly similarly| inside exist| inside and ASTRAL-III. ASTRAL-III. they are resolved similarly inside ASTRAL-III. III shows that as intended, the search space is decreased in size for cases with no polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac- tion, on average, X is always smaller for ASTRAL-III than ASTRAL-II. With few | | error-prone gene trees (50 gene trees from 200bp alignments), the search space has reduced dramatically but with many genes or high-quality gene trees, the reduc- tions are minimal. Moreover, the search space for gene trees estimated from short alignments (e.g., 200bp) is several times larger than those based on longer align- ments (e.g., 1600bp) for both methods. Contracting low support branches initially increases the search space because ASTRAL-III adds multiple resolutions per poly- tomy to X. Further contraction results in reductions in X , presumably because | | many polytomies exist and they are resolved similarly inside ASTRAL-III. Further improvements

• Use a polytree to overly all the input gene trees into one data structure

• Allows us to spend time for each unique node in gene trees once

• O(|D|(nm)1.73) unique nodes in gene trees = O(nm)

• 3X running time improvement

20 Dynamic programming

L

A L-A

21 Dynamic programming

L

A L-A

A’ A-A’ A’ A-A’ A’ A-A’

• Recursively break subsets of species into smaller subsets

21 Dynamic programming

L A’ A L-A u L-A

A-A’ A’ A-A’ A’ A-A’ A’ A-A’

• Recursively break subsets of • w ( u ) : Compare u against input species into smaller subsets gene trees and compute quartets from gene trees satisfied by u

21 Dynamic programming

L Exponential A’ A L-A u L-A

A-A’ A’ A-A’ A’ A-A’ A’ A-A’

• Recursively break subsets of • w ( u ) : Compare u against input species into smaller subsets gene trees and compute quartets from gene trees satisfied by u

21 Constrained version

L

A1 A2

A L-A A6 A4 A10 A5

A8 A9 A3 A7 A’ A-A’ A’ A-A’ A’ A-A’

22 Constrained version

L

A1 A2

A L-A A6 A4 A10 A5

A8 A9 A3 A7 A’ A-A’ A’ A-A’ A’ A-A’

• Restrict “branches” in the species tree to a given constraint set X.

22 Asymptotic running time?

• Simple discrete math question:

• X = a set of subsets of some set L. Y = { (a,b) ∈ X | a∩b=∅ , a∪b ∈ X }

• Clearly, |Y| < |X|2

• What’s the maximum |Y| with respect to |X|?

23 Asymptotic running time?

• Simple discrete math question:

• X = a set of subsets of some set L. Y = { (a,b) ∈ X | a∩b=∅ , a∪b ∈ X }

• Clearly, |Y| < |X|2

• What’s the maximum |Y| with respect to |X|?

• Turns out to be rather challenging

• Daniel Kane and Terence Tao proved: |Y|=O(|X|1.73)

23