Species Tree Inference and Update on Very Large Datasets Using Approximation, Randomization, Parallelization, and Vectorization
Total Page:16
File Type:pdf, Size:1020Kb
Species tree inference and update on very large datasets using approximation, randomization, parallelization, and vectorization Siavash Mirarab Electrical and Computer Engineering University of California at San Diego 1 Phylogenetic reconstruction from data Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG D 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG Orangutan Chimpanzee Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG Gorilla Human D P (D T ) T | 2 Applications: HIV forensic Texas case Washington case Scaduto et al., PNAS, 2010 3 Applications: microbiome https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html 4 Applications: microbiome https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html Morgan, Xochitl C., Nicola Segata, and Curtis Huttenhower. "Trends in genetics (2013) 4 Applications: food safety Tracking the source of a listeriosis outbreak Jackson, Brendan R., et al. Reviews of Infectious Diseases (2016) 5 Fig. 3. Molecular dating of the 2014 outbreak. (A) BEAST dating of the separation of the 2014 lineage from Middle African lineages (SL = Sierra Leone; GN = Guinea; DRC = Democratic Republic of Congo; tMRCA: Sep 2004, 95% HPD: Oct 2002 - May 2006). (B) BEAST dating of the tMRCA of the 2014 West African outbreak (tMRCA: Feb 23, 95% HPD: Jan 27 - Mar 14) and the tMRCA of the Sierra Leone lineages (tMRCA: Apr 23, 95% HPD: Apr 2 - May 13); probability distributions for both 2014 divergence events overlayed below. Posterior support for major nodes is shown. Applications: ebola source: Gire et al., Science, 2014 6 Sequence data growth data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics) 500 400 300 200 sequences (millions) 100 0 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 • Rapid growth in the available sequences 7 Sequence data growth data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics) 500 Whole Genomes (sequences) ACTGCACACCG Individual Genes (sequences) 400 Chromosome (hundredsGene 300 of letters) 200 sequences (millions) 100 0 Genome 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 (billions of letters) • Rapid growth in the • Our focus has shifted available sequences to “whole genomes” 7 Billions of More genomic regions columns More sequences More A million rows 8 Phylogenetic reconstruction : a computational nightmare • Almost all problems are NP-Hard! • The search space is huge. • Focusing on topology alone, there are (2n-5)!! trees with n leaves • We also care about branch lengths: ℝ2n-3 • We are interested in n in 101…106 range Tandy Warnow 9 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support 10 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support • Genome evolution is complex. We need complex statistical models for accurate inference 10 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support • Genome evolution is complex. We need complex statistical models for accurate inference • We need extensive simulations to test methods 10 To scale to large datasets … • Approximate and heuristic solutions 11 To scale to large datasets … • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search 11 To scale to large datasets … • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 11 How about accuracy? Increased data can make problems easier, but … 12 How about accuracy? Increased data can make problems easier, but … • Larger datasets often • Are used to answer harder problems • Allow the use of more complex models • Riddled with erroneous data 12 How about accuracy? Increased data can make problems easier, but … • Larger datasets often • Are used to answer harder problems • Allow the use of more complex models • Riddled with erroneous data • Often, methods lose their accuracy on large datasets 12 Examples of improving scalability 13 To scale to large datasets • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 14 ASTRAL • Optimization problem (NP-Hard): Find the median tree among a set of input trees Distance between two trees := the number of quartet trees they do not share 15 ASTRAL • Optimization problem (NP-Hard): Find the median tree among a set of input trees Distance between two trees := the number of quartet trees they do not share • ASTRAL: an exact solution using a dynamic programming algorithm • Solves the problem exactly on a constrained search space 15 Choosing Constraints • Restricted search space should • Not compromise statistical consistency • Be large enough that accuracy is not sacrificed • Be small for scalability Speciestreeerror Number of genes 16 ASTRAL evolution • Search space should ideally grow polynomially with the dataset size • Tandy talked about ASTRAL-I and ASTRAL-II, which do not such have guarantees • ASTRAL-III [Zhang et al., 2018] • Guarantees polynomial size search space • Increases the speed of the dynamic programming 17 Changing the number of species (n) ● ● 1M 1000 500k 52 hours ● ● ● ● 200k 100 100k 1.09 1.83 Size of X Size 50k ● ● ● 10 ● Running time (minutes) 20k ● Search space size Search 10k ● Running time (minutes) 1 200 500 1000 2000 5000 10000 200 500 1000 2000 5000 10000 #Species #Species Simulations: n = 200 to 10,000 ,k = 1000 gene trees Empirical running time seems to increase close to O(n 1.8) [Zhang et al., 2018] 18 Changing the number of genes (k) 18 2 215 217 17 hours 216 0.777 2.09 210 15 e of X 2 z si 14 2 25 Running time (seconds) 213 Search space size Search Running time (minutes) 28 29 210 211 212 213 214 28 29 210 211 212 213 214 #Genes #Genes Simulations: n = 48 species, k = 256 to 16,384 gene trees Empirical running time seems to increase close to O(k 2) [Zhang et al., 2018] 19 To scale to large datasets • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 20 Profiling ASTRAL-III Running time (minutes) Running time (minutes) Many species Many genes 21 Further scaling improvements Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming. For a set X of subsets of an alphabet, find: {(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅} • Using, Abelian group theory, we can do compute this in time quadratic in |X|. • Has an astronomically low probability of failure 22 Further scaling improvements Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming. For a set X of subsets of an alphabet, find: {(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅} • Using, Abelian group theory, we can do compute this in time quadratic in |X|. • Has an astronomically low probability of failure • Implemented the kernel in C++ and added vectorization. 22 Improved scalability Running time (minutes) Running time (minutes) Many species Many genes 23 Parallelize the main step of the dynamic programming (blue) Running time (minutes) Running time (minutes) Many species Many genes 24 Scaling + GPU John Yin 108 ● • Can analyze datasets with III − 10,000 species and 1000 ● ● 73 ● genes in less than a day ● given 24 cores and a GPU ● ● 31 ● • ● No GPU https://github.com/ ● ● 1 GPU Speedup versus ASTRAL Speedup versus ● smirarab/ASTRAL/tree/ 4 ● 1 2 4 6 8 12 16 20 24 MP-similarity CPU cores Many genes 25 Bacteroidia Thermodesulfobacteria, Ca. Dadabacteria Epsilonproteobacteria Ca. Dependentiae Nitrospirae_1, Nitrospinae, Ca. Schekmanbacteria, Ca. NC10 Flavobacteriia Archaea Ca. Calescamantes Asgard group Ca.Aminicenantes, Ca. Fischerbacteria TACK group Cytophagia Crenarchaeota ChitinophagiaSphingobacteriia Alphaproteobacteria Euryarchaeota Saprospiria DPANN group Phycisphaerae, Planctomycetia_1, Planctomycetia_2 Acidobacteriia Ca. Omnitrophica, Ca. Firestonebacteria, Ca. Desantisbacteria SAR324 CPR Deferribacteres, Chrysiogenetes Ca. Marinimicrobia, Calditrichaeota Nitrospirae_2 Verrucomicrobiae_1, Opitutae, Lentisphaerae Microgenomates group Ca. Tectomicrobia Gemmatimonadetes, Ca. Glassbacteria, Ca. Eisenbacteria Aquificae ChlamydiaeFibrobacteria Parcubacteria group Ca. WOR-3, Ca. Edwardsbacteria Ca. Delongbacteria Balneolaeota Qiyun Zhu Rob Knight Ca. Latescibacteria, Ca. Zixibacteria Ignavibacteria Eubacteria Ca. Cloacimonetes Chlorobi Betaproteobacteria Ca. Hyd24-12 Ca. Kryptonia G Terrabacteria group a m Actinobacteria m Spirochaetes Acidithiobacillia a Firmicutes p Ca. Hydrogenedentes r o Chloroflexi Ca. Coatesbacteria t e Cyanobacteria Elusimicrobia