Bazinet Umd 0117E 16489.Pdf (9.061Mb)
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of dissertation: COMPUTATIONAL METHODS TO ADVANCE PHYLOGENOMIC WORKFLOWS Adam Bazinet, Doctor of Philosophy, 2015 Dissertation directed by: Professor Michael Cummings Center for Bioinformatics and Computational Biology Affiliate Professor, Department of Computer Science Phylogenomics refers to the use of genome-scale data in phylogenetic analysis. There are several methods for acquiring genome-scale, phylogenetically-useful data from an organism that avoid sequencing the entire genome, thus reducing cost and effort, and enabling one to sequence many more individuals. In this dissertation we focus on one method in particular | rna sequencing | and the concomitant use of assembled protein-coding transcripts in phylogeny reconstruction. Phylogenomic workflows involve tasks that are algorithmically and computationally demanding, in part due to the large amount of sequence data typically included in such analyses. This dissertation applies techniques from computer science to improve methodology and performance associated with phylogenomic workflow tasks such as sequence clas- sification, transcript assembly, orthology determination, and phylogenetic analysis. While the majority of the methods developed in this dissertation can be applied to the analysis of diverse organismal groups, we primarily focus on the analysis of transcriptome data from Lepidoptera (moths and butterflies), generated as part of a collaboration known as \Leptree". COMPUTATIONAL METHODS TO ADVANCE PHYLOGENOMIC WORKFLOWS by Adam Bazinet Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Michael Cummings, Chair/Advisor Professor Charles Mitter, Dean's Representative Professor Mihai Pop Professor H´ectorCorrada Bravo Professor Amitabh Varshney c Copyright by Adam Bazinet 2015 Preface This dissertation is based, in part, on the following publications, listed by chapter: Chapter 2 Adam L. Bazinet and Michael P. Cummings. A comparative evaluation of se- quence classification programs. BMC Bioinformatics, 13:92, 2012. Chapter 3 Adam L. Bazinet, Michael P. Cummings, and Antonis Rokas. Homologous gene consensus avoids orthology-paralogy misspecification in phylogenetic inference. Un- published. Chapter 4 Adam L. Bazinet, Derrick J. Zwickl, and Michael P. Cummings. A gateway for phylogenetic analysis powered by grid computing featuring GARLI 2.0. Systematic Biology, 63(5):812-818, 2014. Chapter 5 Adam L. Bazinet and Michael P. Cummings. Computing the tree of life: lever- aging the power of desktop and service grids. In Proceedings of the Fifth Workshop on Desktop Grids and Volunteer Computing Systems (PCGrid), 2011. ii Chapter 6 Adam L. Bazinet and Michael P. Cummings. Subdividing long-running, variable- length analyses into short, fixed-length BOINC workunits. Journal of Grid Com- puting. Submitted. Chapter 7 Adam L. Bazinet, Michael P. Cummings, Kim T. Mitter, and Charles W. Mit- ter. Can RNA-Seq resolve the rapid radiation of advanced moths and butter- flies (Hexapoda: Lepidoptera: Apoditrysia)? An exploratory study. PLoS ONE 8(12):e82615, 2013. Chapter 8 Adam L. Bazinet, Michael P. Cummings, Kim T. Mitter, and Charles W. Mit- ter. Can RNA-Seq resolve the rapid radiation of advanced moths and butterflies (Hexapoda: Lepidoptera: Apoditrysia)? A follow-up study. In preparation. iii Dedication I dedicate this dissertation to my wife Jennifer, and my two children Cassandra and Derek. A loving family makes every day worth living. iv Acknowledgements I am grateful to many, many people for making this dissertation work possible. First, I would like to give wholehearted thanks to my advisor, Dr. Michael Cummings. For well over a decade, I have been privileged to be a member of the Laboratory of Molecular Evolution, which Dr. Cummings leads and directs. My experience in his research group has been truly rewarding, and I am exceedingly grateful to be able to work with and learn from Dr. Cummings on a daily basis. He is a tireless, dedicated scientist and a devoted mentor. I would like to recognize Drs. Charles and Kim Mitter, who have been ex- tremely pleasant colleagues to work with on lepidopteran systematics over the past several years. Thanks also to the rest of my committee members for their support and encouragement. Thanks to the faculty members in computer science at the University of Mary- land | particularly those affiliated with the Center for Bioinformatics and Compu- tational Biology | for the many years of instruction that I have had the privilege to receive. To the many colleagues, coworkers, collaborators, system administrators, and support staff with whom I have had the pleasure of working these many years, thank you for your dedication and cooperation. Finally, I would like to thank my family for their love and unwavering support. The acknowledgements that follow are specific to various chapters. v Chapter 2 We acknowledge all of the tool developers for making their programs, data sets, and documentation available. In particular, we thank the following au- thors for their assistance acquiring data sets, running various programs, or making sense of results: Bo Liu (MetaPhyler), Manuel Stark (mltreemap), Kaustubh Patil (PhyloPythiaS), Arthur Brady and Steven Salzberg (phymmbl), and Ozkan Nal- bantoglu (raiphy). In addition, we thank the reviewers for their helpful comments and suggestions. Chapter 4 We thank Barry Dutton, Yevgeny Deviatov, Derrick Hinkle, and Kevin Young for their efforts developing various aspects of the grid system and the garli web service; Charles Mitter for developing the statistical determination of the num- ber of required garli search replicates; and Mike Landavere, Christopher Camacho, Ahmed El-Haggan, Taha Mohammed, Patrick Beach, Matthew Kweskin, Kevin Hildebrand, and Fritz McCall (and other umiacs staff members) for connecting and administering grid system resources. We also thank the associate editor and one anonymous reviewer for their helpful suggestions. Funding for this work was provided by National Science Foundation award number DBI-0755048. Chapter 5 U.S. National Science Foundation award DBI-0755048 provided funding for this project. Chapter 6 We thank Derrick Zwickl for adding necessary features to garli that enabled this work. Similarly, we thank David Anderson for adding new features to boinc that aided with testing. This work was funded by National Science Founda- vi tion award DBI-1356562. Chapter 7 We are greatly indebted to the following generous colleagues for pro- viding specimens used in this study: Andreas Zwick, Marcus J. Matthews, Akito Y. Kawahara, Robert F. Denno, Bernard Landry, Timothy Friedlander, Daniel H. Janzen, Andrew Mitchell, Joaquin Baixeras, Ron Robertson, Jadranka Rota, David Adamski, Jaga Giebultowicz, Chris Bergh, Neil R. Spencer, James K. Adams, and Luis Pea. Our work was greatly facilitated by the expert help of Suwei Zhao of the University of Maryland-Institute for Bioscience and Biotechnology Research Se- quencing Core. We thank the Editor as well as Niklas Wahlberg and two anonymous reviewers for helpful comments on the manuscript. This study was made possible by the foundation built by our esteemed colleague Jerome Regier, who has now transitioned to a new scholarly career. vii Table of Contents List of Tables xiii List of Figures xiv List of Abbreviations xvii 1 Introduction 1 1.1 Background on phylogenetics . .1 1.2 \Genes to trees": an overview of phylogenetic workflows . .2 1.2.1 A traditional phylogenetic workflow . .2 1.2.2 The Leptree collaboration . .2 1.2.3 Phylogenomics: whole-genome-inspired methodology . .3 1.3 Computational methods in phylogenomic workflows . .5 1.3.1 Transcriptome assembly . .7 1.3.2 Orthology determination . .7 1.3.3 Multiple sequence alignment . .8 1.3.4 Phylogenetic analysis . .9 1.4 Applying high-throughput computing to phylogenetic analysis . 10 1.4.1 The Lattice Project: a multi-model grid computing system . 11 1.5 Dissertation outline . 12 2 An evaluation of sequence classification programs 13 2.1 Background . 13 2.1.1 General approaches to sequence classification . 15 2.1.1.1 Additional considerations . 18 2.1.2 Program capability analysis . 19 2.1.3 Program performance evaluation . 19 2.2 Results . 23 2.2.1 Alignment . 23 2.2.1.1 FACS 269 bp high complexity 454 metagenomic data set............................ 25 2.2.1.2 MetaPhyler 300 bp simulated metagenomic data set 25 viii 2.2.1.3 CARMA 265 bp simulated 454 metagenomic data set 26 2.2.1.4 PhyloPythia 961 bp simMC data set . 28 2.2.1.5 Discussion . 28 2.2.2 Composition . 30 2.2.2.1 PhyloPythia 961 bp simMC data set . 31 2.2.2.2 PhymmBL 243 bp RefSeq data set . 31 2.2.2.3 RAIphy 238 bp RefSeq data set . 32 2.2.2.4 Discussion . 32 2.2.3 Phylogenetics . 33 2.2.3.1 PhyloPythia 961 bp simMC data set . 34 2.2.3.2 Discussion . 34 2.2.4 Comparison of all programs . 35 2.3 Conclusions . 35 2.4 Methods . 37 2.4.1 Program classification . 37 2.4.2 Tool usage and result processing . 38 2.5 Use of sequence classification programs in phylogenomics . 39 3 Use of consensus sequences as an alternative to orthology determination 45 3.1 Background on orthology . 45 3.2 Consensus sequences: an alternative to orthology determination . 49 3.3 Validation of the consensus method . 50 3.3.1 Yeast transcriptome analysis . 50 3.3.2 Vertebrate proteome analysis . 55 3.3.3 Plant proteome analysis . 57 3.4 Summary . 58 4 The GARLI web service 62 4.1 Introduction . 62 4.2 The Lattice Project compared to other scientific gateways . 65 4.3 GARLI web service: user interface and functionality . 68 4.3.1 Modes of use . 68 4.3.2 Create job page . 69 4.3.3 Job status page . 69 4.3.4 Job details page . 70 4.4 Partitioned analysis specification . 70 4.5 Post-processing routines . 71 4.5.1 Calculating the required number of GARLI search replicates .