Species tree inference and update on very large datasets using approximation, randomization, parallelization, and vectorization
Siavash Mirarab
Electrical and Computer Engineering University of California at San Diego
1 Phylogenetic reconstruction from data
Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
2 Phylogenetic reconstruction from data
CTGCACACCG CTGCACACCG
CTGCACACGG
Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
2 Phylogenetic reconstruction from data
CTGCACACCG CTGCACACCG
CTGCACACGG
Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
2 Phylogenetic reconstruction from data
CTGCACACCG CTGCACACCG
CTGCACACGG
Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG D 2 Phylogenetic reconstruction from data
CTGCACACCG CTGCACACCG
CTGCACACGG
Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG
Orangutan Chimpanzee Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG Gorilla Human D P (D T ) T | 2 Applications: HIV forensic
Texas case Washington case
Scaduto et al., PNAS, 2010 3 Applications: microbiome
https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html
4 Applications: microbiome
https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html
Morgan, Xochitl C., Nicola Segata, and Curtis Huttenhower. "Trends in genetics (2013)
4 Applications: food safety
Tracking the source of a listeriosis outbreak
Jackson, Brendan R., et al. Reviews of Infectious Diseases (2016) 5 Fig. 3. Molecular dating of the 2014 outbreak. (A) BEAST dating of the separation of the 2014 lineage from Middle African lineages (SL = Sierra Leone; GN = Guinea; DRC = Democratic Republic of Congo; tMRCA: Sep 2004, 95% HPD: Oct 2002 - May 2006). (B) BEAST dating of the tMRCA of the 2014 West African outbreak (tMRCA: Feb 23, 95% HPD: Jan 27 - Mar 14) and the tMRCA of the Sierra Leone lineages (tMRCA: Apr 23, 95% HPD: Apr 2 - May 13); probability distributions for both 2014 divergence events overlayed below. Posterior support for major nodes is shown. Applications: ebola
source: Gire et al., Science, 2014
6 Sequence data growth
data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics)
500
400
300
200 sequences (millions)
100
0
1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017
• Rapid growth in the available sequences
7 Sequence data growth
data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics)
500 Whole Genomes (sequences) ACTGCACACCG Individual Genes (sequences) 400
Chromosome
(hundredsGene 300 of letters)
200 sequences (millions)
100
0 Genome 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 (billions of letters)
• Rapid growth in the • Our focus has shifted available sequences to “whole genomes”
7 Billions of More genomic regions columns More sequences More
A million rows
8 Phylogenetic reconstruction : a computational nightmare
• Almost all problems are NP-Hard!
• The search space is huge.
• Focusing on topology alone, there are (2n-5)!! trees with n leaves
• We also care about branch lengths: ℝ2n-3
• We are interested in n in 101…106 range Tandy Warnow
9 Dealing with uncertainty
• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support
10 Dealing with uncertainty
• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support
• Genome evolution is complex. We need complex statistical models for accurate inference
10 Dealing with uncertainty
• Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support
• Genome evolution is complex. We need complex statistical models for accurate inference
• We need extensive simulations to test methods
10 To scale to large datasets …
• Approximate and heuristic solutions
11 To scale to large datasets …
• Approximate and heuristic solutions
• Make the problem easier
• Divide-and-conquer
• Constrained search
11 To scale to large datasets …
• Approximate and heuristic solutions
• Make the problem easier
• Divide-and-conquer
• Constrained search
• Develop optimized code.
11 How about accuracy?
Increased data can make problems easier, but …
12 How about accuracy?
Increased data can make problems easier, but …
• Larger datasets often
• Are used to answer harder problems
• Allow the use of more complex models
• Riddled with erroneous data
12 How about accuracy?
Increased data can make problems easier, but …
• Larger datasets often
• Are used to answer harder problems
• Allow the use of more complex models
• Riddled with erroneous data
• Often, methods lose their accuracy on large datasets
12 Examples of improving scalability
13 To scale to large datasets
• Approximate and heuristic solutions
• Make the problem easier
• Divide-and-conquer
• Constrained search
• Develop optimized code.
14 ASTRAL
• Optimization problem (NP-Hard):
Find the median tree among a set of input trees
Distance between two trees := the number of quartet trees they do not share
15 ASTRAL
• Optimization problem (NP-Hard):
Find the median tree among a set of input trees
Distance between two trees := the number of quartet trees they do not share
• ASTRAL: an exact solution using a dynamic programming algorithm
• Solves the problem exactly on a constrained search space
15 Choosing Constraints
• Restricted search space should
• Not compromise statistical consistency
• Be large enough that accuracy is not sacrificed
• Be small for scalability Speciestreeerror
Number of genes
16 ASTRAL evolution
• Search space should ideally grow polynomially with the dataset size
• Tandy talked about ASTRAL-I and ASTRAL-II, which do not such have guarantees
• ASTRAL-III [Zhang et al., 2018]
• Guarantees polynomial size search space
• Increases the speed of the dynamic programming
17 Changing the number of species (n)
● ● 1M
1000 500k 52 hours ● ● ● ● 200k 100 100k 1.09 1.83 Size of X Size
50k ● ● ● 10 ● Running time (minutes)
20k
● Search space size Search 10k ●
Running time (minutes) 1
200 500 1000 2000 5000 10000 200 500 1000 2000 5000 10000 #Species #Species Simulations: n = 200 to 10,000 ,k = 1000 gene trees
Empirical running time seems to increase close to O(n 1.8)
[Zhang et al., 2018] 18 Changing the number of genes (k)
18 2
215 217 17 hours