Copyright by Siavash Mir Arabbaygi 2015 the Dissertation Committee for Siavash Mir Arabbaygi Certiﬁes That This Is the Approved Version of the Following Dissertation

Copyright by Siavash Mir arabbaygi 2015 The Dissertation Committee for Siavash Mir arabbaygi certifies that this is the approved version of the following dissertation: Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction Committee: Keshav Pingali, Supervisor Tandy Warnow, Co-Supervisor David Hillis Bonnie Berger Joydeep Ghosh Ray Mooney Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction by Siavash Mir arabbaygi, B.S.; M. APPL S. DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY THE UNIVERSITY OF TEXAS AT AUSTIN August 2015 Dedicated to my mother, and the memory of my father. Acknowledgments I wish to thank the countless people who helped me throughout my PhD. I enjoyed working with many collaborators, most of whom I never met, and not all of whom I can name here. I had the good fortune of working with my labmates, especially Nam Nguyen, Shamsuzzoha Md. Bayzid, and Théo Zimmermann, and our interactions have always been smooth and fruitful. I benefited tremendously from my involvement in both the avian phylogenomics and the thousand transcripts projects, both of which included many top researchers from across the world. I want to especially thank Bastien Boussau, Edward Braun, Tom Gilbert, Erich Jarvis, Jim Leebens-Mack, Gane Wong, Norman Wickett, and Guojie Zhang. I also thank Kevin Liu, Mark Holder, and Jeet Sukumaran for allowing me to use their code throughout my work. I would also like to thank all members of my committee. My understanding of the biology was tremendously improved by participation in discussions lead by Prof. David Hillis. I am also grateful to Prof. Keshav Pingali, who accepted me in his group after my original supervisor officialy left UT; Keshav was a role model, and involvement with his group has greatly expanded my depth. I should also thank my sources of funding (Howard Hughes Medical Insti- tute international student fellowship, Canadian NSERC Doctoral award, and NSF) and the generous computational support of Texas Advanced Computing Center (TACC) and the computing cluster (Condor) at the CS department. v I cannot start to thank enough my supervisor, Prof. Tandy Warnow, for all her guidance and her support. I enjoyed the chance to argue with her frequently on all matters related to our research, small or large, and even though I often found myself on the loosing side of these arguments, I always felt encouraged. Nothing like her openness and acceptance in rare occasions I won an argument could have better trained me for the skepticism and inquis- itiveness necessary for the scientific pursuit. I can only inspire to match her enthusiasm for research, which was a constant boost of energy to me throughout. Her endless support as a research advisor and a career mentor has opened to me many new doors, which I did not believe accessible when I started my studies. I owe much of my past and future achievements to her. For all the support I had at work, no progress was possible if it wasn't for the persistent emotional support my fiancee, now wife, showered me with. She raised my spirits with expressions of confidence in my abilities, warranted or not. Her confidence in me was a constant motivation to overcome obstacles. And that's to say nothing of all the day-to-day ways in which she has nudged me towards being a more organized and focused student. I would have never been in the office early in the morning every day of the week, if it wasn't for her. She was a major reason I started my PhD studies, and her encouragements are a major reason I feel confident continuing as a researchers. Finally, everything I have ever achieved has been possible only because of my mother, who thought me starting is easy, persevering hard, and letting go the hardest. Her example is always with me. vi Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction Publication No. Siavash Mir arabbaygi, Ph.D. The University of Texas at Austin, 2015 Supervisors: Keshav Pingali Tandy Warnow The amount of biological sequence data is increasing rapidly, a promis- ing development that would transform biology if we can develop methods that can analyze large-scale data efficiently and accurately. A fundamental ques- tion in evolutionary biology is building the tree of life: a reconstruction of relationships between organisms in evolutionary time. Reconstructing phylogenetic trees from molecular data is an optimization problem that involves many steps. In this dissertation, we argue that to answer long-standing phylogenetic questions with large-scale data, several challenges need to be addressed in various steps of the pipeline. One challenges is aligning large number of sequences so that evolutionarily related positions in all sequences are put in the same column. Constructing alignments is necessary for phylogenetic reconstruction, but also for many other types of evolutionary analyses. In response to this challenge, we introduce PASTA, a scalable and accurate algorithm that vii can align datasets with up to a million sequences. A second challenge is related to the interesting fact that various parts of the genome can have different evolutionary histories. Reconstructing a species tree from genome-scale data needs to account for these differences. A main approach for species tree reconstruction is to first reconstruct a set of \gene trees" from different parts of the genome, and to then summarize these gene trees into a single species tree. We argue that this approach can suffer from two challenges: reconstruction of individual gene trees from limited data can be plagued by estimation error, which translates to errors in the species tree, and also, methods that summarize gene trees are not scalable or accurate enough under some conditions. To address the first challenge, we introduce statistical binning, a method that re-estimates gene trees by grouping them into bins. We show that binning improves gene tree accuracy, and consequently the species tree accuracy. To address the second challenge, we introduce ASTRAL, a new summary method that can run on a thousand genes and a thousand species in a day and has outstanding accuracy. We show that the development of these methods has enabled biological analyses that were otherwise not possible. viii Table of Contents Acknowledgments v Abstract vii List of Tables xv List of Figures xvi Chapter 1. Introduction 1 Chapter 2. Background 11 2.1 Phylogeny: an evolutionary tree . 11 2.1.1 Properties of a phylogenetic tree . 12 2.1.2 Character evolution . 15 2.1.2.1 Substitutions . 16 2.1.2.2 Alignments . 19 2.2 Gene trees and the species tree . 21 2.2.1 Definitions and concepts . 21 2.2.2 Coalescence and Incomplete Lineage Sorting (ILS) . 23 2.2.2.1 Coalescence . 25 2.2.2.2 Multi-Species Coalescent (MSC) . 28 2.3 Phylogenetic reconstruction . 31 2.3.1 Multiple Sequence Alignment (MSA) . 34 2.3.2 Tree reconstruction . 37 2.3.2.1 Maximum parsimony . 38 2.3.2.2 Maximum likelihood . 38 2.3.2.3 Bayesian estimation . 39 2.3.2.4 Distance-based . 40 2.3.2.5 Branch support . 40 ix 2.3.3 Analyzing multiple genes . 41 2.3.3.1 Concatenation . 42 2.3.3.2 Summary methods . 44 2.3.3.3 Co-estimation of gene trees and species trees . 46 2.4 Method evaluation . 48 2.4.1 Comparing two phylogenies . 49 2.4.2 Comparing two alignments . 50 Chapter 3. PASTA: Practical Alignments using SATéand TrAn- sitivity 52 3.1 Motivation . 53 3.2 Background: SATé-II . 55 3.3 PASTA's algorithm . 56 3.3.1 Transitivity merge of two alignments . 58 3.3.2 Steps of a PASTA iteration . 60 3.3.2.1 Six PASTA Steps . 60 3.3.2.2 Computing the transitivity merge . 65 3.3.3 Running time . 69 3.4 Experimental setup . 73 3.4.1 Datasets . 73 3.4.1.1 Nucleotide . 74 3.4.1.2 Amino-acid . 76 3.4.2 Methods . 78 3.4.3 Criteria . 79 3.4.4 Computational platform . 79 3.5 Results . 80 3.5.1 Ability to complete analyses . 80 3.5.2 Results on nucleotide datasets . 81 3.5.3 Alignment accuracy on AA datasets . 87 3.5.4 Comparisons on larger datasets . 88 3.5.5 Running Time . 89 3.5.6 Impact of varying algorithmic parameters. 91 3.5.6.1 Starting tree . 91 x 3.5.6.2 Subset size . 92 3.5.6.3 Choice of tool for merging alignments . 93 3.5.7 Varying the bootstrap threshold for reference tree . 94 3.5.8 Comparisons to UPP . 95 3.6 Summary and discussion . 96 Chapter 4. Statistical Binning 98 4.1 Sensitivity of summary methods to gene tree error . 100 4.1.1 Gene tree error in simulations . 101 4.1.2 Genet tree estimation on the avian phylogenomics project 104 4.2 Statistical binning pipeline . 108 4.2.1 Details of statistical binning . 111 4.2.2 Theoretical properties of statistical binning . 117 4.3 Experimental setup . 127 4.3.1 Datasets . 127 4.3.1.1 Biological datasets . 127 4.3.1.2 Simulated datasets . 127 4.3.2 Methods . 137 4.3.3 Criteria . 140 4.3.4 Computational platform . 144 4.4 Simulation results . 144 4.4.1 Binning parameters . 144 4.4.1.1 Impact of support threshold . 145 4.4.1.2 Impact of weighting . 148 4.4.2 Avian simulations using unweighted binning .

Load more