The Pennsylvania State University the Graduate School
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School COMPUTATIONAL METHODS FOR COMPARATIVE GENOMICS OF NON- MODEL SPECIES A CASE STUDY IN THE PARASITIC PLANT FAMILY OROBANCHACEAE A Dissertation in Biology by Eric Kenneth Wafula Ó2019 Eric Kenneth Wafula Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2019 ii The dissertation of Eric Kenneth Wafula was reviewed and approved* by the following: Claude W. dePamphilis Professor of Biology Dissertation Advisor Naomi S. Altman Professor of Statistics Istvan Albert Research Professor of Bioinformatics Director of Online Graduate Certificate in Applied Bioinformatics James H. Marden Professor of Biology Associate Director Huck Institute of the Life Sciences Chair of Committee Stephen W. Schaeffer Professor of Biology Associate Department Head of Graduate Education *Signatures are on file in the Graduate School iii ABSTRACT The rapid development of sequencing technologies coupled with the continuous drop in the cost of sequencing has facilitated studies of genomes, transcriptomes, and metagenomes of a variety of organisms at unprecedented resolution. However, sequencing and accurately assembling the genomes of many non-model organisms, especially plants, remains cost-prohibitive because they are often large or complex, which pose challenges to current sequencing technologies and assembly algorithms. Therefore, many researchers are now relying on comparative genomic approaches that integrate data from genomes and transcriptomes to gain novel insights into evolutionary history, including the unique features of complex non-model organisms. The genomes of parasitic angiosperms are relatively understudied. Past genome-scale research has been focused primarily on understanding the mechanisms of plant parasitism as a means to control weedy species that parasitize crops. Research efforts to understand the evolutionary aspects of parasitic plants have been restricted to the plastome degradation associated with the reduction and loss of photosynthesis. In this dissertation, I present PlantTribes 2, a gene family analysis framework that utilizes objective classifications of complete protein sequences from genomes for comparative and evolutionary analyses of gene families and transcriptomes on a genome-scale. Utilizing PlantTribes 2, and the draft genome of Striga asiatica, including transcriptomes of sister lineages, I present evidence for an ancient polyploidy event shared by parasitic Orobanchaceae and closely related non-parasitic sister lineages. The observed gene family evolutionary dynamics in Striga reveal an association between whole genome duplication (WGD) and the evolutionary origins of parasitism in Orobanchaceae. Gene losses are overrepresented by iv older genes whose functions are complemented by the host, while gene gains often result from the WGD event specific to Orobanchaceae whose functions are associated with further adaptations to the parasitic lifestyle. The evolutionary transition from autotrophy to heterotrophy is associated with changes in gene functions common to non-parasitic plants. These findings will provide a focus for future studies into the mechanisms of plant parasitism and potential targets for parasite control. v TABLE OF CONTENTS LIST OF FIGURES ......................................................................................................... viii LIST OF TABLES .............................................................................................................. x ACKNOWLEDGEMENTS ............................................................................................... xi Chapter 1 Introduction and overview ................................................................................. 1 Introduction ..................................................................................................................... 1 Gene discovery ............................................................................................................... 2 Gene expression following character evolution .............................................................. 3 Phylogenetic analysis ...................................................................................................... 5 Gene and genome duplication ......................................................................................... 6 Parasitic plants ................................................................................................................ 8 Content of this dissertation ........................................................................................... 10 References ..................................................................................................................... 11 Chapter 2 PlantTribes 2: tools for comparative gene family analysis in plant genomics . 19 Abstract ......................................................................................................................... 19 Introduction ................................................................................................................... 20 Pipeline implementation ............................................................................................... 22 Gene family scaffolds ................................................................................................... 24 Analysis tools ................................................................................................................ 26 Assembly post-processing ........................................................................................ 26 Gene family classification ......................................................................................... 26 Gene family alignment estimation ............................................................................ 27 Gene family phylogenetic inference ......................................................................... 28 Estimation of genome duplications ........................................................................... 29 Pipeline test dataset ....................................................................................................... 30 Performance evaluation ................................................................................................ 32 vi Evaluation of gene family inference methods .......................................................... 32 Evaluation of sequence classifiers ............................................................................ 35 Evaluation of targeted gene family assembly ........................................................... 37 Conclusions ................................................................................................................... 39 Availability and requirements ....................................................................................... 40 References ..................................................................................................................... 41 Chapter 3 Ancient whole genome duplication events in the parasitic lineages of Orobanchaceae .................................................................................................................. 48 Abstract ......................................................................................................................... 48 Introduction ................................................................................................................... 48 Results and discussion .................................................................................................. 51 Gene annotation ........................................................................................................ 51 Genome cleaning .................................................................................................. 52 Annotation-specific repeat masking ..................................................................... 53 Gene prediction and quality assessment ............................................................... 54 Genome duplication history ...................................................................................... 57 Global gene family phylogenetic analysis ............................................................ 57 Duplicated gene divergence of phylogenetic syntelogs ........................................ 59 Genome structure analysis .................................................................................... 63 Conclusions ................................................................................................................... 73 Materials and methods .................................................................................................. 74 Genome assembly cleaning ....................................................................................... 74 Repeat library construction ....................................................................................... 75 RNA-Seq sequencing and transcriptome assembly .................................................. 75 Gene prediction and functional assignment .............................................................. 77 Gene family classification ......................................................................................... 79 Phylogenetic and gene collinearity analysis ............................................................. 80 Analysis of synonymous substitutions rates (Ks) ....................................................