Species tree inference and update on very large datasets using approximation, randomization, parallelization, and vectorization

Siavash Mirarab

Electrical and Computer Engineering University of California at San Diego

1 Phylogenetic reconstruction from data

Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

2 Phylogenetic reconstruction from data

CTGCACACCG CTGCACACCG

CTGCACACGG

Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

2 Phylogenetic reconstruction from data

CTGCACACCG CTGCACACCG

CTGCACACGG

Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

2 Phylogenetic reconstruction from data

CTGCACACCG CTGCACACCG

CTGCACACGG

Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG D 2 Phylogenetic reconstruction from data

CTGCACACCG CTGCACACCG

CTGCACACGG

Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG

Orangutan Chimpanzee Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG Gorilla Human D P (D T ) T | 2 Applications: HIV forensic

Texas case Washington case

Scaduto et al., PNAS, 2010 3 Applications: microbiome

https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html

4 Applications: microbiome

https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html

Morgan, Xochitl C., Nicola Segata, and Curtis Huttenhower. "Trends in genetics (2013)

4 Applications: food safety

Tracking the source of a listeriosis outbreak

Jackson, Brendan R., et al. Reviews of Infectious Diseases (2016) 5 Fig. 3. Molecular dating of the 2014 outbreak. (A) BEAST dating of the separation of the 2014 lineage from Middle African lineages (SL = Sierra Leone; GN = Guinea; DRC = Democratic Republic of Congo; tMRCA: Sep 2004, 95% HPD: Oct 2002 - May 2006). (B) BEAST dating of the tMRCA of the 2014 West African outbreak (tMRCA: Feb 23, 95% HPD: Jan 27 - Mar 14) and the tMRCA of the Sierra Leone lineages (tMRCA: Apr 23, 95% HPD: Apr 2 - May 13); probability distributions for both 2014 divergence events overlayed below. Posterior support for major nodes is shown. Applications: ebola

source: Gire et al., Science, 2014

6 Sequence data growth

data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics)

500

400

300

200 sequences (millions)

100

0

1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017

• Rapid growth in the available sequences

7 Sequence data growth

data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics)

500 Whole Genomes (sequences) ACTGCACACCG Individual Genes (sequences) 400

Chromosome

(hundredsGene 300 of letters)

200 sequences (millions)

100

0 Genome 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 (billions of letters)

• Rapid growth in the • Our focus has shifted available sequences to “whole genomes”

7 Billions of More genomic regions columns More sequences More

A million rows

8 Phylogenetic reconstruction : a computational nightmare

• Almost all problems are NP-Hard!

• The search space is huge.

• Focusing on topology alone, there are (2n-5)!! trees with n leaves

• We also care about branch lengths: ℝ2n-3

• We are interested in n in 101…106 range Tandy Warnow

9 Dealing with uncertainty

• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support

10 Dealing with uncertainty

• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support

• Genome evolution is complex. We need complex statistical models for accurate inference

10 Dealing with uncertainty

• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support

• Genome evolution is complex. We need complex statistical models for accurate inference

• We need extensive simulations to test methods

10 To scale to large datasets …

• Approximate and heuristic solutions

11 To scale to large datasets …

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

11 To scale to large datasets …

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

• Develop optimized code.

11 How about accuracy?

Increased data can make problems easier, but …

12 How about accuracy?

Increased data can make problems easier, but …

• Larger datasets often

• Are used to answer harder problems

• Allow the use of more complex models

• Riddled with erroneous data

12 How about accuracy?

Increased data can make problems easier, but …

• Larger datasets often

• Are used to answer harder problems

• Allow the use of more complex models

• Riddled with erroneous data

• Often, methods lose their accuracy on large datasets

12 Examples of improving scalability

13 To scale to large datasets

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

• Develop optimized code.

14 ASTRAL

• Optimization problem (NP-Hard):

Find the median tree among a set of input trees

Distance between two trees := the number of quartet trees they do not share

15 ASTRAL

• Optimization problem (NP-Hard):

Find the median tree among a set of input trees

Distance between two trees := the number of quartet trees they do not share

• ASTRAL: an exact solution using a dynamic programming algorithm

• Solves the problem exactly on a constrained search space

15 Choosing Constraints

• Restricted search space should

• Not compromise statistical consistency

• Be large enough that accuracy is not sacrificed

• Be small for scalability Speciestreeerror

Number of genes

16 ASTRAL evolution

• Search space should ideally grow polynomially with the dataset size

• Tandy talked about ASTRAL-I and ASTRAL-II, which do not such have guarantees

• ASTRAL-III [Zhang et al., 2018]

• Guarantees polynomial size search space

• Increases the speed of the dynamic programming

17 Changing the number of species (n)

● ● 1M

1000 500k 52 hours ● ● ● ● 200k 100 100k 1.09 1.83 Size of X Size

50k ● ● ● 10 ● Running time (minutes)

20k

● Search space size Search 10k ●

Running time (minutes) 1

200 500 1000 2000 5000 10000 200 500 1000 2000 5000 10000 #Species #Species Simulations: n = 200 to 10,000 ,k = 1000 gene trees

Empirical running time seems to increase close to O(n 1.8)

[Zhang et al., 2018] 18 Changing the number of genes (k)

18 2

215 217 17 hours

216 0.777 2.09 210 15 e of X 2

z si 14 2 25

Running time (seconds) 213 Search space size Search Running time (minutes) 28 29 210 211 212 213 214 28 29 210 211 212 213 214 #Genes #Genes

Simulations: n = 48 species, k = 256 to 16,384 gene trees

Empirical running time seems to increase close to O(k 2)

[Zhang et al., 2018] 19 To scale to large datasets

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

• Develop optimized code.

20 Profiling ASTRAL-III Running time (minutes) Running time (minutes)

Many species Many genes

21 Further scaling improvements

Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming.

For a set X of subsets of an alphabet, find:

{(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅}

• Using, Abelian group theory, we can do compute this in time quadratic in |X|.

• Has an astronomically low probability of failure

22 Further scaling improvements

Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming.

For a set X of subsets of an alphabet, find:

{(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅}

• Using, Abelian group theory, we can do compute this in time quadratic in |X|.

• Has an astronomically low probability of failure

• Implemented the kernel in C++ and added vectorization.

22 Improved scalability Running time (minutes) Running time (minutes)

Many species Many genes

23 Parallelize the main step of the dynamic programming (blue) Running time (minutes) Running time (minutes)

Many species Many genes

24 Scaling + GPU

John Yin

108 ● • Can analyze datasets with 10,000 species and 1000 ● ● 73 ● genes in less than a day ●

given 24 cores and a GPU ●

● 31 ● • ● No GPU https://github.com/ ● ● 1 GPU

Speedup versus ASTRAL − III Speedup versus ● smirarab/ASTRAL/tree/ 4 ● 1 2 4 6 8 12 16 20 24 MP-similarity CPU cores Many genes

25 Bacteroidia

Thermodesulfobacteria, Ca. Dadabacteria Ca. Dependentiae

Nitrospirae_1, , Ca. Schekmanbacteria, Ca. NC10

Flavobacteriia

Archaea

Ca. Calescamantes Asgard group

Ca. Aminicenantes, Ca. Fischerbacteria TACK group Cytophagia Crenarchaeota ChitinophagiaSphingobacteriia

Alphaproteobacteria Euryarchaeota

Saprospiria DPANN group Phycisphaerae, Planctomycetia_1, Planctomycetia_2

Acidobacteriia Ca. Omnitrophica, Ca. Firestonebacteria, Ca. Desantisbacteria SAR324 CPR Deferribacteres, Chrysiogenetes Ca. , Calditrichaeota Nitrospirae_2 Verrucomicrobiae_1, Opitutae, Lentisphaerae Microgenomates group Ca. Tectomicrobia

Gemmatimonadetes, Ca. Glassbacteria, Ca. Eisenbacteria ChlamydiaeFibrobacteria Parcubacteria group Ca. WOR-3, Ca. Edwardsbacteria Ca. Delongbacteria Balneolaeota Qiyun Zhu Rob Knight Ca. Latescibacteria, Ca. Zixibacteria Ignavibacteria Eubacteria Ca. Cloacimonetes Chlorobi Ca. Hyd24-12 Ca. Kryptonia G group a m m Acidithiobacillia a p Ca. Hydrogenedentes r o Ca. Coatesbacteria t e o Ca. Ca.Lindowbacteria b Spirochaetes a c PVC group Ca. Aerophobetes t e

r

i a FCB group

PVC groupFCB group substitutions per site: 0.1 Firmicutes_1 Eubacteria Archaea Ca. Woesearchaeot a, Ca. Diapherot rit es Ca. Micrarchaeota (DPANN group) Ca. Thorarchaeota, Ca. Lokiarchaeota, Ca. Heimdallarchaeota Terrabacteria group TACK group Thermococci (Asgard group) Uyen Mai CPR Ca. Verstraetearchaeota, Ca. Bathyarchaeota Firmicutes_3 Ca. Korarchaeota Thaumarchaeota Ca. YNPFFA

Parcubacteria group Microgenomates group Firmicutes_5 Thermoprotei Methanococci MethanobacteriaThermoplasmataMarine Group II Tenericutes_2 Ca. Wirthbacteria Archaeoglobi Tenericutes_1 MethanomicrobiaHalobacteria

Clostridia_5

Dictyoglomi Clostridia_1 Clostridia_2 Clostridia_3 Ca. WoykebacteriaCa. CPR3 Clostridia_4

Negativicutes Ca. Dojkabacteria Ca. Margulisbacteria, Ca. WOR-1 Ca. Daviesbacteria Ca. CPR2Ca. Abawacabacteria Ca. Curtissbacteria Ca. , Ca. Wallbacteria

Ca. Fraserbacteria, Ca. Acetothermia Ca. Levybacteria Tissierellia Ca. Beckwithbacteria,Ca. Ca.Gottesmanbacteria Chisholmbacteria Ca. WWE3

ThermotogaeCaldiserica , Ca. Fervidibacteria, Ca. KD3-62 Ca. Andersenbacteria Ca. Blackburnbacteria Ca. Peribacteria Ca. Ca. Doudnabacteria Ca. Roizmanbacteria Ca. AmesbacteriaCa. Shapirobacteria Ca. Cyanobacteria Ca. Kuenenbacteria,Ca. Kerfeldbacteria, Ca. Komeilibacteria,Ca. Jacksonbacteria Ca. Veblenbacteria

Ca. Moranbacteria Ca. Portnoybacteria Ca. Woesebacteria Ca. Riflebacteria Ca. Buchananbacteria, Ca. Falkowbacteria Ca. Pacebacteria Chloroflexia, Anaerolineae Ca. Ca. Collierbacteria

Dehalococcoidia Ca. Terrybacteria

Thermoleophilia Ca. Azambacteria

Ca. Spechtbacteria Ca. Ca.Uhrbacteria Magasanikbacteria

Ca. Sungbacteria Ca. Yanofskybacteria Ca. Ryanbacteria Ca. Staskawiczbacteria, Ca. Nealsonbacteria, Ca. Wildermuthbacteria

Acidimicrobiia Ca. Giovannonibacteria Ca. Tagabacteria

Ca. Staskawiczbacteria, Ca. Nealsonbacteria, Ca. Wildermuthbacteria, Ca. Brennerbacteria

Deinococcus-Thermus Ca. Vogelbacteria

Ca. Campbellbacteria

Actinobacteria Ca. Harrisonbacteria, Ca. Liptonbacteria Ca. Llyodbacteria, Ca. Yonathbacteria

Ca. Nomurabacteria

Ca. TaylorbacteriaCa. Zambryskibacteria Ca. Adlerbacteria • 10,000 microbial species and Ca. Kaiserbacteria 381 genes

• ASTRAL infers the tree in 24 hours (4 GPUs) To scale to large datasets

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

• Develop optimized code.

27 Statistical support Erfan Sayyari • Traditional approach: bootstrapping each gene, then bootstrapping species tree

• Slow: requires bootstrapping all genes (e.g., 100 times slower)

28 Statistical support Erfan Sayyari • Traditional approach: bootstrapping each gene, then bootstrapping species tree

• Slow: requires bootstrapping all genes (e.g., 100 times slower)

• Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015]

[Mirarab et al., Sys bio, 2014] 28 Local posterior probability

• Quartet frequencies follow m 1 = 80 m 2 = 63 m 3 = 57 a multinomial distribution Chimp Gorilla Orang. Chimp Gorilla Chimp

Human Orang. Human Gorilla Human Orang.

θ1 θ2 θ3

29 Local posterior probability

• Quartet frequencies follow m 1 = 80 m 2 = 63 m 3 = 57 a multinomial distribution Chimp Gorilla Orang. Chimp Gorilla Chimp

Human Orang. Human Gorilla Human Orang.

θ1 θ2 θ3

• P (gene tree seen m1/m times = species tree) = P(θ1 > 1/3)

• Solved analytically: “local posterior probability” (localPP)

29 Local posterior probability

• Quartet frequencies follow m 1 = 80 m 2 = 63 m 3 = 57 a multinomial distribution Chimp Gorilla Orang. Chimp Gorilla Chimp

Human Orang. Human Gorilla Human Orang.

θ1 θ2 θ3

• P (gene tree seen m1/m times = species tree) = P(θ1 > 1/3)

• Solved analytically: “local posterior probability” (localPP)

• n>4 leads to an exponentially growing number of cases

• Approximate by using averaged all quartet scores, which can be computed in time quadratic in n.

29 Quartet support vs. localPP!3 Background: How many targets are enough? The number of loci required for confidently resolving a phylogenetic relationship depends on its “hardness”. Discordant gene trees due to incomplete lineage sorting (ILS) can make branches of a species tree very difficult to resolve. ILS is a function of the branch length and population size. This means that shorter branches or larger population sizes increase ILS, and can make species tree reconstruction more difficult. The co-PI has recently developed a new approach for estimating the support of a branch in a coalescent-based framework based on the proportion of gene trees that support quartet topologies in gene trees[31]. This new approach is analogous to finding the probability that a three-sided die is loaded towards each facet by observing outcomes of many tosses. We can also ask the reverse question: if we have an quartet frequency (a proxy for discordance) estimate of the degree of bias of a die, how many tosses are needed to confidently find its loaded side. Similarly,Increased this numberFig. 3. Impactof genes of number of⇒ genes increased (colors) and support [31] amount of ILS (x-axis) on support (posterior approach sheds light on the relationship between the probability) for a branch (y-axis). Under the number of genes required to achieve a level ofIncreased support for adiscordance coalescent model, for four ⇒ species Reduced separated with support branch of a certain length in coalescent units (Fig. 3). While an unrooted branch of length d, the probability of a gene tree being identical to the species tree is the co-PI’s method of calculating support, which is 30 implemented in ASTRAL [32, 33], gives the basic mathematical 1-2/3e-d (we use this as a measure of ILS). We measure support using local pp [31] for various framework, more method development is needed (see below). numbers of gene trees (assuming no gene tree estimation error). More ILS (ie., low 1-2/3e-d) Research Approach requires more genes to get high support. Specimens Suitable specimens of more than 80 species of Sabellidae and >120 Terebelliformia species have been collected from localities around the world in the last 15 years by the PI. Further collecting will be undertaken for some key taxa for a few more transcriptomes and for targeted DNA capture. We also have agreements from other experts in Sabellidae and Terebelliformia to provide other needed ethanol-preserved specimens for targeted capture. We will obtain ~ 50% of the known sabellid diversity and ~25% of terebelliforms for targeted capture sequencing (~500 species total). Our plan is to sample all genera, particularly focusing on their type species. This should allow for the generation of robust phylogenies, from which we will revise the . Specimens will be vouchered at the SIO Benthic Invertebrate Collection and biodiversity information curated on the Encyclopedia of Life. Sequencing (transcriptomes and targeted capture of DNA) A targeted capture approach will be used with an appropriate number of loci (see below) to generate robust phylogenies of Sabellidae and Terebelliformia. Two different sets of targets will be designed from transcriptomes, across Sabellidae and Terebelliformia repectively. We have already generated new transcriptomes for nine sabellids, a serpulid and a fabriciid (bold terminals in Fig. 4), and nine Terebelliformia (bold terminals in Fig. 5) and combined this data with the few publicly available transcriptomes. Based on direct sequencing data already obtained for several genes for Sabellidae and Terebelliformia [Rouse in prep.; Stiller et al. in prep.], our sampling arguably spans the extant diversity of each clade. Several more transcriptomes will be needed to ensure the targeted capture methods will be robust. Methods for generating these transcriptomes will follow those previously used by us[6, 34]. How many targets are needed? A targeted capture pipeline typically starts from a large number of loci sequenced from a small number of species using genome-wide approaches (e.g., transcriptomics)[7, 8, 35]. This initial dataset is then used to select a smaller subset of loci for the targeted capture phase, which involves a larger set of species. To reduce the cost and effort, one would want to minimize the number of loci in the second phase, as long as sufficient loci are selected to confidently resolve relationships of interest. To date the number of loci has been determined in an ad hoc manner, either by the ultra conserved elements discovered[7, 8] or limitations of the targeted capture technology[13]. Initial analyses of our two datasets demonstrates how the number of Fast Coalescent-Based Support Computation . doi:10.1093/molbev/msw079 MBE Downloaded from http://mbe.oxfordjournals.org/

FIG. 4. Branch length accuracy on the A-200 dataset with Medium ILS. See supplementary figures S7 and S8, Supplementary Material online for low and high ILS. The estimated branch length is plotted against the true branch length in log scale (base 10). Blue line: a fitted generalized additive model with smoothing (Wood 2011).

Table 3. Branch Length Accuracy on the A-200 Dataset. 250 500 by guest on May 28, 2016 1.00 Dataset n Log Err RMSE 0.75 True gt Est. gt True gt Est. gt 250 250 500 500 1.00 1.00 Low ILS 1,000 0.10 0.42 5.57 6.75 0.50 Low ILS 200 0.16 0.44 6.22 6.99 0.75 Low ILS 50 0.25 0.48 6.840.75 7.29 0.25 Med ILS 1,000 0.03 0.20 0.22 0.86 250 500 0.500.00 Med ILS 200 0.07 0.22 0.440.50 0.91 1.00 Med ILS 50 0.13 0.26 0.74 1.05 1000 1500 0.25Recall 1.00 High ILS 1,000 0.06 0.15 0.030.25 0.13 localPP0.75 is more accurate than High ILS 200 0.11 0.18 0.07 0.15 0.75 High ILS 50 0.18 0.24 0.14 0.19 0.00 bootstrapping 0.00 0.50 0.50 NOTE.—Logarithmic error (Log Err) and RMSE are shown for true species trees 1000 1000250 1500 1500500 scored with true gene trees or estimated gene trees (Est. gt). Recall 1.00 Recall 1.00 0.250.25 MLBS Local PP 0.75 0.000.750.00 the ROC curves show (fig. 5), for the same number of 0.00 0.25 0.501000 0.75 1.000.00 0.25 0.501500 0.75 1.00 0.50 0.50 False Positive Rate false positives branches, local posterior probabilities result in Recall 1.00 better recall than MLBS. This pattern is more pronounced FIG. 5. ROC curve for the avian dataset based on MLBS and local100X PPPP 0.25 0.25 for shorter alignments, which have increased gene tree support0.75 values. Boxes show different numbersMLBS of sites Local per PPMLBS genefaster (con- Local PP error. For example, for the 250 bp model condition, if we trolling0.00 gene tree estimation error). choose a support threshold that results in 0.01 FPR, with0.00 local 0.50 posterior values, we still recover 84% of correct branches,0.00 0.25 0.500.00 0.75 0.25 1.001000 0.500.00 0.75 0.25 1.00 0.500.00 0.75 0.25 1.001500 0.50 0.75 1.00 0.25Recall 1.00 False Positive RateFalse Positive Rate whereas with MLBS, the same FPR results in retaining Branch Length MLBS Local PP 70% of correct branches. Thus, for a desired level of precision, Branch length accuracy on the avian dataset was a function of Avian0.75 simulated dataset (48 taxa, 1000 genes) better recall can be obtained using local posterior gene0.00 tree estimation error whether ASTRAL or MP-EST was 0.00[Sayyari 0.25 and 0.50 Mirarab, 0.75 MBE, 1.00 2016]0.00 0.25 0.50 0.75 1.00 probabilities. used0.50 (table 4). With true gene trees, branch length log error 31 False Positive Rate

0.25 9 MLBS Local PP 0.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 False Positive Rate To scale to large datasets

• Approximate and heuristic solutions

• Make the problem easier

• Divide-and-conquer

• Constrained search

• Develop optimized code.

32 Phylogenetic placement

ACCG CGAG CGGT GGCT TAGA GGAG GcTT • • • ACCT Escherichia Salmonella Yersinia coli enterica bongori pestis cholerae

Query sequences A backbone dataset

33 Phylogenetic placement

ACCG CGAG CGGT GGCT TAGA GGAG GcTT • • • ACCT Escherichia Salmonella Salmonella Yersinia Vibrio coli enterica bongori pestis cholerae

Query sequences A backbone dataset

Applications: • Place sequences of unknown origins on a reference tree of known sequences (a major goal in microbiome analyses) • Update an existing tree quickly without recomputing

33 Why placement?

• Placement is easier than de novo inference

• Placement is usually sufficient for downstream applications

• Sometimes, placement is more accurate.

• Placement is embarrassingly parallel

34 SEPP: placement for microbiome data

• Uses divide-and-conquer to align and place on very large backbone trees

• Useful for identifying microbiome data

35 SEPP: placement for microbiome data

• Uses divide-and-conquer to align and place on very large backbone trees

• Useful for identifying microbiome data

• Has better accuracy than de novo inference of the tree Inference error [Janssen, mSystems, 2018] • When inferring from fragmentary data

35 Placement algorithms: State of the art (ML) is memory-hungry

ML (used inside SEPP)

16.00

1 4.00

y (Gb) r 1.07 Minimum least square error 1.00 on sequence distances eak Memo P (our new method) 0.25

APPLES* PPLACER

5e+02 1e+03 5e+03 1e+04 5e+04 1e+05 2e+05 Reference tree size

36 Placement algorithms: State of the art (ML) is memory-hungry

ML (used inside SEPP)

16.00

1 4.00

y (Gb) r 1.07 Minimum least square error 1.00 on sequence distances eak Memo P (our new method) 0.25

APPLES* PPLACER

5e+02 1e+03 5e+03 1e+04 5e+04 1e+05 2e+05 Reference tree size

36 Distance-based method is also much faster than ML

4.00 APPLES* PPLACER

1.00

0.89

utes) 0.25 n 1.03

Time (mi

5e+02 1e+03 5e+03 1e+04 5e+04 1e+05 2e+05 Reference tree size

37 The distance-based method is equally or more accurate

500 1000 5000 10000 50000 100000 1.00

0.75

0.50

0.25 APPLES* PPLACER 0.00

Empirical Cumulative Distribution 0 5 10 15 20 0 5 10 15 0 10 20 30 0 5 10 0 5 10 15 0 3 6 9 Error (# edges) Delta Error (edges)

38 The distance-based method is equally or more accurate

500 1000 5000 10000 50000 100000 1.00

0.75

0.50

0.25 APPLES* PPLACER 0.00

Empirical Cumulative Distribution 0 5 10 15 20 0 5 10 15 0 10 20 30 0 5 10 0 5 10 15 0 3 6 9 Error (# edges) ErrorDelta (#Error edges) (edges)

38 How about placing on species trees?

• Once new data are added to gene trees, we want to update the species tree

39 How about placing on species trees?

• Once new data are added to gene trees, we want to update the species tree

Maryam Rabiee • INSTRAL: a (worst-case) quadratic-time ASTRAL-like method to update species trees

• Finds the best place almost always (>99.9%)

39 How about placing on species trees?

• Once new data are added to gene trees, we want to update the species tree

Maryam Rabiee • INSTRAL: a (worst-case) quadratic-time ASTRAL-like method to update species trees [Rabiee, biorxiv, 2018]

• Finds the best place almost always (>99.9%)

• Sub-quadratic running time in practice

39 Examples of concerns with accuracy

40 But how about accuracy?

Increased data can make problems easier, but …

• Larger datasets often

• Are used to answer harder problems

• Allow the use of more complex models

• Riddled with erroneous data

• Often, methods lose their accuracy on large datasets

41 Errors abound in phylogenomic datasets

42 Errors show as long branches

Gatesy et. al. (2014) Errors show as long branches

Gatesy et. al. (2014)

A Gene tree from Mammalian dataset Song et al, PNAS, 2012 Errors show as long branches

Gatesy et. al. (2014)

A Gene tree from Mammalian dataset Song et al, PNAS, 2012

An optimization problem

The k-shrink problem:

• Given: • a tree with n leaves and branch lengths • some 1 ≤ k ≤ n

• Find: • for every 1 ≤ i ≤ k: • the set of i leaves that should be removed to reduce the tree diameter maximally

45 An optimization problem

The k-shrink We problem:have a polynomial • Given: time solution using • a treedynamic with n leaves programming and branch lengths • some 1 ≤ k ≤ n • Find: • for every 1 ≤ i ≤ k: • the set of i leaves that should be removed Uyen Mai to reduce the tree diameter maximally

45 What to remove?

the diameter after i-1 removals Let ν i = —————————————— the diameter after i removals

5

4 i 3 ratio ν

2

● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.2 5 10 15 removal

46 What to remove?

● 5

4 i i 3 ratio ν ν

2

● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20 removal

47 Running Time

• k-shrink can be solved in O(k2h+n) where h = the tree height

• by default, we set k=O(n0.5)

• Fast: processes a tree of n=203,452 leaves with k=2255 in 28 mins

48 How much data?

• In biology, data is not free. Fundamental questions: • How much data is enough? • What’s the best strategy to obtain data?

49 How much data?

• In biology, data is not free. Fundamental questions: • How much data is enough? • What’s the best strategy to obtain data?

• Theoretical answers: sample complexity Shubhanshu Shekhar

• For ASTRAL, we have bounds on the number of genes needed [Shekhar et al, TCBB, 2017]

49 How much data?

• In biology, data is not free. Fundamental questions: • How much data is enough? • What’s the best strategy to obtain data?

• Theoretical answers: sample complexity Shubhanshu Shekhar

• For ASTRAL, we have bounds on the number of genes needed [Shekhar et al, TCBB, 2017] • Practical:

• For ASTRAL, more genes are better than more or individuals [Rabiee and Mirarab, MPE, 2018] • For phylogenetic placement, we showed a very low coverage of genome (too little for assembly) is more than enough [Sarmashghi, et. al., biorxiv, 2018] 49 Code Optimization

• All methods shown here are in python, Java, etc. with little or no optimization (perhaps except ASTRAL)

• What’s the best measure of performance?

• FLOPS? Code Optimization

• All methods shown here are in python, Java, etc. with little or no optimization (perhaps except ASTRAL)

• What’s the best measure of performance?

• FLOPS?

• POPS ~ PPPD? Code Optimization

• All methods shown here are in python, Java, etc. with little or no optimization (perhaps except ASTRAL)

• What’s the best measure of performance?

• FLOPS?

• POPS ~ PPPD?

• Programs Optimized Per Student ~ Papers Published Per Dissertation Code Optimization

• All methods shown here are in python, Java, etc. with little or no optimization (perhaps except ASTRAL)

• What’s the best measure of performance?

• FLOPS?

• POPS ~ PPPD?

• Programs Optimized Per Student ~ Papers Published Per Dissertation

• Ease of development matters in fast moving fields More generally …

• Analyzing large datasets requires method development:

• Scalability: hardware + software + better algorithms

• Attention to accuracy and how properties of big data (e.g. data noisiness) affect accuracy

51 Tandy Warnow

Chao Zhang Uyen Mai James B. Whitfield

S.M. Bayzid Théo Zimmermann

Maryam Rabiee John Yin

Erfan Sayyari Shubhanshu Shekhar 52