Efficient Algorithms for Comparing, Storing, and Sharing
Total Page:16
File Type:pdf, Size:1020Kb
EFFICIENT ALGORITHMS FOR COMPARING, STORING, AND SHARING LARGE COLLECTIONS OF EVOLUTIONARY TREES A Dissertation by SUZANNE JUDE MATTHEWS Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY May 2012 Major Subject: Computer Science EFFICIENT ALGORITHMS FOR COMPARING, STORING, AND SHARING LARGE COLLECTIONS OF EVOLUTIONARY TREES A Dissertation by SUZANNE JUDE MATTHEWS Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Approved by: Chair of Committee, Tiffani L. Williams Committee Members, Nancy M. Amato Jennifer L. Welch James B. Woolley Head of Department, Hank W. Walker May 2012 Major Subject: Computer Science iii ABSTRACT Efficient Algorithms for Comparing, Storing, and Sharing Large Collections of Evolutionary Trees. (May 2012) Suzanne Jude Matthews, B.S.; M.S., Rensselaer Polytechnic Institute Chair of Advisory Committee: Dr. Tiffani L. Williams Evolutionary relationships between a group of organisms are commonly summarized in a phylogenetic (or evolutionary) tree. The goal of phylogenetic inference is to infer the best tree structure that represents the relationships between a group of organisms, given a set of observations (e.g. molecular sequences). However, popular heuristics for inferring phylogenies output tens to hundreds of thousands of equally weighted candidate trees. Biologists summarize these trees into a single structure called the consensus tree. The central assumption is that the information discarded has less value than the information retained. But, what if this assumption is not true? In this dissertation, we demonstrate the value of retaining and studying tree collections. We also conduct an extensive literature search that highlights the rapid growth of trees produced by phylogenetic analysis. Thus, high performance algo- rithms are needed to accommodate this increasing production of data. We created several efficient algorithms that allow biologists to easily compare, store and share tree collections over tens to hundreds of thousands of phylogenetic trees. Universal hashing is central to all these approaches, allowing us to quickly identify the shared evolutionary relationships contained in tree collections. Our algorithms MrsRF and Phlash are the fastest in the field for comparing large collections of trees. Our al- gorithm TreeZip is the most efficient way to store large tree collections. Lastly, we developed Noria, a novel version control system that allows biologists to seamlessly iv manage and share their phylogenetic analyses. Our work has far-reaching implications for both the biological and computer science communities. We tested our algorithms on four large biological datasets, each consisting of 20; 000 to 150; 000 trees over 150 to 525 taxa. Our experimental results on these datasets indicate the long-term applicability of our algorithms to modern phylogenetic analysis, and underscore their ability to help scientists easily exchange and analyze their large tree collections. In addition to contributing to the reproducibility of phylogenetic analysis, our work enables the creation of test beds for improving phylogenetic heuristics and applications. Lastly, our data structures and algorithms can be applied to managing other tree-like data (e.g XML). v ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Dr. Tiffani L. Williams, for her guidance and support over the last six years. I first met Dr. Williams through the CRA-W DMP program, when she first became my mentor. Back then, she saw a potential in me that I didn't even know that I had. She has inspired me to always question my limits and to constantly strive for excellence. Her mentoring, compassion, and sacrifice has molded me into the researcher I am today. Though my time at Texas A&M is ending, her lessons and aphorisms will accompany me through all of life's journeys. I would also like to thank my committee members, Dr. Nancy Amato, Dr. Jennifer Welch and Dr. James Woolley, for the countless conversations, and their support and faith in me. I have had so much fun interacting and learning from them over the years. Dr. Amato and Dr. Welch have been a constant source of inspiration for me, and I am truly grateful for all their advice and time. I credit Dr. Woolley for cementing my love and appreciation for Phylogeny, and I have several fond memories of the class I took with him at Texas A&M. Next, I would like to thank the National Science Foundation for funding my research for the first three years of my PhD. I am also indebted to the Texas A&M Dissertation Fellowship program for funding me during the last year of my PhD. I am also grateful for the travel fellowships I have received over the years, including those from the Grace Hopper Conference, the Richard Tapia Conference, the SuperCom- puting Broader Engagement program, International Symposium on Bioinformatics Research and Applications (ISBRA), the Applied Mathematics Program workshop on Reproducible Research (AMP), and the NSF-ADVANCE program. The opportu- vi nities I received to network and share my research with a broader audience are truly irreplaceable. My work would also not be possible without the many wonderful collaborators I've had: Dr. Seung-Jin Sul, Dr. Marc Smith, Ana Dal Molin, Grant Brammer and Ralph Crosby. I am thankful for the discussions and friendships that have resulted from our partnerships over the years. Their separate insights allowed me to think about my work in novel ways, and have only added to the strength of this dissertation. Dr. Seung-Jin Sul was also my graduate mentor while I was in the DMP program. I am especially grateful to him for the patience, guidance and kindness he showed me while I was a new PhD student. So many friends and student mentors have been a constant source of support for me over the years. I am especially grateful to Dr. Lydia Tapia and Charles Lively for their encouragement and support. They eased my transition to Texas, and were among my first friends here. I am also grateful to all the women I met and interacted with through the Aggie Women in Computer Science: you are a constant source of joy. AWICS kept me sane these last four years, and it was an honor to be able to serve both as an officer and as Graduate President. Lastly, I would like to thank my family for their love, prayers and support over these last many years. Even though they didn't always understand what I was doing, my parents were always thrilled with any accomplishments I had to share and were a source of comfort when things got tough. Most of all, I am thankful to my fianc´e Kevin. He has been a source of constant strength and unconditional love over the last five and half years. He stood by me as made the difficult decision to move to Texas, and helped me never regret my decision. His unwavering confidence in me and my abilities have kept me going even when I thought the challenges were too much to bear. To him, I lovingly dedicate this work. vii TABLE OF CONTENTS CHAPTER Page I INTRODUCTION :::::::::::::::::::::::::: 1 A. Research Objective . 2 B. Our Contributions . 3 C. Dissertation Organization . 7 II PHYLOGENETIC ANALYSIS ::::::::::::::::::: 8 A. Motivation . 8 B. A Brief Overview of Phylogenetic Analysis . 9 C. Establishing the Value of Phylogenetic Tree Collections . 14 D. The Current State of Phylogenetic Analysis . 18 1. Journals of Interest and their Population Sizes . 20 2. Sampling Details . 22 3. Analysis . 23 a. Trends in the number of taxa and trees pro- duced over time . 25 b. Trends in the type of analysis method used over time . 27 c. Trends in software used for phylogenetic analysis 28 E. Our Biological Datasets . 29 F. Summary . 30 III CURRENT METHODS AND RELATED WORK :::::::: 32 A. Comparing Phylogenetic Trees . 32 1. Bipartitions: Topological Units of a Tree . 32 2. Distance Methods . 34 3. Topological Matrices for Comparing Trees . 36 4. Limitations . 37 B. Storing Phylogenetic Trees . 39 1. The Newick Format . 39 2. The NEXUS Format . 40 3. TASPI . 41 4. Limitations . 41 C. Sharing Phylogenetic Trees . 42 viii CHAPTER Page 1. Electronic Transfer . 42 2. Tree Databases . 43 3. Limitations . 43 D. Summary . 44 IV UNIVERSAL HASHING OF PHYLOGENETIC TREES :::: 46 A. Universal Hashing of Phylogenetic Trees . 46 B. Using Hashing to Detect Unique Bipartitions . 47 C. Using Hashing to Detect Unique Trees . 50 D. Supporting Heterogeneous Trees . 52 E. Handling Hashing Collisions . 54 F. Summary . 55 V EFFICIENT ALGORITHMS FOR COMPARING LARGE GROUPS OF PHYLOGENETIC TREES ::::::::::::: 57 A. Motivation . 57 B. MrsRF: Using MapReduce for Comparing Trees with Robinson-Foulds Distance . 58 1. MapReduce . 58 a. Phoenix: a MapReduce library . 60 2. Algorithm Description . 61 3. MrsRF(p; q): Computing a p × q RF sub-matrix . 63 a. Phase 1 of the MrsRF(p; q) algorithm . 63 b. Phase 2 of the MrsRF(p; q) algorithm . 65 4. Algorithm Analysis . 67 5. Experimental Analysis . 68 a. Cluster configurations . 68 b. Establishing the fastest sequential algorithm . 68 c. Multi-core performance on freshwater trees . 70 d. Multi-core performance on angiosperms trees . 71 C. Phlash: Advanced Topological Comparison . 72 1. The ABCs of Topological Comparisons . 73 a. Computing similarity and distance between bit- strings . 73 b. Weighted similarity and distance . 75 2. Algorithm Description . 77 a. Step 1: Constructing the compact or shared table of interest . 77 ix CHAPTER Page b. Step 2: Computing the a, b, c, and d values between trees . 81 c. Step 3: Computing a t × t matrix of interest . 85 d. A note about comparing heterogeneous trees . 86 3. Algorithm Analysis . 86 4. Experimental Analysis . 87 a.