Sequencing, Pipeline Development, and Select Comparative Analysis of 64

SEQUENCING, PIPELINE DEVELOPMENT, AND SELECT COMPARATIVE ANALYSIS OF 64 HIGH-QUALITY DRAFT GENOMES OF EXTREMOPHILIC BACTERIA ISOLATED FROM COMMUNITIES IN CARBOXYLATE PLATFORM FERMENTATIONS A Thesis by EMMA BRITAIN CARAWAY Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Chair of Committee, Heather H. Wilkinson Committee Members, Joshua Yuan Joseph Sorg Head of Department, Leland Pierson III August 2016 Major Subject: Plant Pathology Copyright 2016 Emma Britain Caraway ABSTRACT Microbial extremophiles have the potential for a wide variety of biotechnological and industrial applications and yet extremophiles are underrepresented in whole genome sequencing efforts to date. The generation of whole genome sequences allows for gene calling, function prediction, and creation of evolutionary models and adds to the richness of extant knowledge of the bacterial world. The sequencing of extremophiles is thus of high value. Previous efforts collected 501 soil samples from 77 thermal and saline sites across the United States and Puerto Rico and used these in an effort to optimize the microbial communities in a carboxylate biofuel platform. The 34 best performing inocula were used to isolate 1866 strains using a variety of media in a low-oxygen and high-temperature environment. A diverse subset of this isolate library was screened for traits of industrial relevance. In this project I created a model to choose a characteristic subset of these isolates while maintaining the phylogenetic, phenotypic, and geographic diversity of the isolate library. Using this subset I created a pipeline to sequence, assemble, annotate, and disseminate high quality draft-genomes of these microbes. In this work I created high-quality draft genome sequences of 64 isolates from 22 sites across the United States and Puerto Rico. I inferred phylogeny of a subset (N=48) of these isolates using multilocus sequence analysis of four housekeeping genes and discovered three potentially novel genera. Using the Joint Genome Institutes Integrated Microbial Genomes system I was able to annotate and make functional assertions about these isolates. These isolates display a diverse range of carbohydrate ii utilization that is directly related to their phylogeny, and many isolates show industrially relevant carbohydrate utilization pathways such as cellulose, arabinose, and xylose. Many of the isolates sequenced also show a pathway for degredation of furfural, an inhibitory compound that causes issues in second-generation biofuel platforms. The furfural degradation pathway is shown to be rare among extant sequenced prokaryotes. The Opu operon was found in many of these isolates, which when complete transports the compatible solute glycine betaine into the cell. This pathway has been implicated in osmoregulation, thermotolerance, and cold-show protection. Finally, four isolates were found to have a group II intron interrupting the housekeeping gene recA, which codes for a protein related to DNA repair. The insertion of a group II intron into a housekeeping gene is extremely rare and has potential implications for our existing knowledge about the role of group II introns. This work creates 64 high-quality draft genome sequences and annotations as well as select analyses, clearly demonstrating the potential of these resources for future applications. iii DEDICATION This work is dedicated to my family, who continue to be relentlessly optimistic about my future. It is also dedicated to two professors, Marilyn Turnbull of Wellesley College and Charles Kennerley of Texas A&M, who believed in me and showed me that science could be not just fascinating, but empowering. Most importantly, this is dedicated to my husband, Davis Caraway, who makes every day better than the last. iv ACKNOWLEDGEMENTS I would like to thank my committee chair, Dr. Heather Wilkinson, and my committee members Dr. Joseph Sorg and Dr Joshua Yuan as well as former committee member Dr. Daniel Ebbole for their support and guidance throughout this process. I would like to thank Elena Kolomiets for always being available to help and Cruz Torres for always knowing how to fix things. Thanks go to my colleagues, faculty, and staff in the Department of Plant Pathology and Microbiology for their support and camaraderie. Both the Texas AgriLife Research Bioenergy Program and the Texas A&M University Office of the Vice President for Research Energy Resources Program provided financial support for this project. Finally, I would like to specially thank Dr. Charles Kennerley. Without his guidance, long talks about science, and eternal patience I wouldn’t be where I am today. v TABLE OF CONTENTS Page ABSTRACT…………………………………………………………………………………………………...……………ii DEDICATION……………………………………………………………………………………….……………………iv ACKNOWLEDGEMENTS………………………………………………………………………………………..……v TABLE OF CONTENTS……………………………………………………………………………………….………vi LIST OF FIGURES……………………………………………………………………………………………………viii LIST OF TABLES…………………………………………………………………………………………….…………ix CHAPTER I INTRODUCTION……………………………………………………………………………...……1 CHAPTER II PIPELINE TO ANALYZE DRAFT GENOME SEQUENCES FOR EXTREMOPHILES FROM SUCCESSFUL CARBOXYLATE PLATFORM FERMENTATIONS. ………………………………………………………………………………9 II.1 Introduction………………………………………………………………………………..……………9 II.2 Methods…………………………………………………………………………………………….……13 II.3 Results……………………………………………………………………………………………………20 II.4 Discussion………………………………………………………………………………………………26 CHAPTER III MULTILOCUS SEQUENCE ANALYSIS OF A SUBGROUP OF HIGH- QUALITY DRAFT GENOMES FOR ISOLATES IN THE GENERA GEOBACILLUS, ANOXYBACILLUS, AND AERIBACILLUS………………………………………………………………………………..…30 III.1 Introduction…………………………………………………………………………………..………30 III.2 Methods……………………………………………………………………………………………...…36 III.3 Results………………………………………………………………………………………………..…40 III.4 Discussion…………………………………………………………………………………..…………53 CHAPTER IV SELECT COMPARATIVE ANALYSIS OF THE 64 HIGH-QUALITY DRAFT GENOME SEQUENCES OF EXTREMOPHILES………………………….…58 IV.1 Introduction……………………………………………………………………………………..……58 IV.2 Methods ………………………………………………………………………………………….……62 IV.3 Results……………………………………………………………………………………………..……67 IV.4 Discussion……………………………………………………………………………………..………76 CHAPTER V THESIS CONCLUSIONS…………………………………………………………….…………80 vi Page REFERENCES…………………………………………………………………………………………………..………83 APPENDIX A…………………………………………………………………………………………………….……103 APPENDIX B…………………………………………………………………………………………………….……105 APPENDIX C…………………………………………………………………………………………………….……107 vii LIST OF FIGURES Page Figure 1 Phylogenetic tree for the isolate library at start of this project, reproduced from[12]…………………………………………………..…………….…………6 Figure 2 Sites of origin for isolates in this study…………………….……………….…………21 Figure 3 Heat map showing A5-MiSeq vs SOAPdenovo2……………………………....……25 Figure 4 Maximum-likelihood estimated phylogenetic tree obtained from 55 partial sequences of the 16S rDNA gene…..……………………………...…………..43 Figure 5 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial gyrB sequences………………………………………….…………...…44 Figure 6 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial grolEL sequences………………………………………………………45 Figure 7 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial rpoD sequences………………………………………………...…..…..46 Figure 8 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial trmE sequences……………………………………………..…...……..47 Figure 9 Maximum-likelihood phylogenetic tree estimated by multilocus sequence analysis with concatenated gyrB, groEL, rpoD, and trmE partial sequences……………………………………………………………………………….50 Figure 10 Schematic of Group II intron interrupting recA and table showing actual sizes of recA fragments and group II intron………..………………………73 Figure 11 Alignment of the open reading frame for II introns in the RecA protein....................................................................................................................................74 Figure 12 Maximum parsimony tree of amino acid sequence ORF of group II intron..………..…………………………..…………………………..………………......…………75 viii LIST OF TABLES Page Table 1 Isolates and select metadata used in this study……………...………………….…15 Table 2 Phenotypic data available for a subset of the isolates in this study…...…..17 Table 3 Comparison of select QUAST outputs of SOAPdenovo2 and A5- MiSeq.......................................................................................................................................23 Table 4 Isolates used in multilocus sequence analysis (MLSA), adapted from Cope, 2013[9]………………………………………………………………………………….....38 Table 5 Reference strains used in multilocus sequence analysis……………………….42 Table 6 Properties of genetic loci used in multilocus sequence analysis….………...42 Table 7 P values from pairwise ILD test between 5 genetic loci…………….…….....…42 Table 8 Identifiers associated with the IMG carbohydrate utilization network for pathways expressed in the draft genomes……………..………….63 Table 9 Gene counts of IMG Pathways for carbohydrate utilization…………...……...64 Table 10 Draft genomes that contain enzymes associated with a furfural degradation pathway………………………...………………………………………........….69 Table 11 Draft genomes with osmoprotectant genes in Opu family………….………....72 Table 12 Reference sequences with group II introns……………………..………..……........73 ix CHAPTER I INTRODUCTION Depletion of non-renewable petroleum resources, variable costs associated with oil, and

Sequencing, Pipeline Development, and Select Comparative Analysis of 64

Archaea, Bacteria and Termite, Nitrogen Fixation and Sustainable Plants Production

Desulfuribacillus Alkaliarsenatis Gen. Nov. Sp. Nov., a Deep-Lineage

Antonie Van Leeuwenhoek Journal of Microbiology

Next Generation Microbiology for the Future

Genomic Analysis of Caldalkalibacillus Thermarum TA2

An Astrobiological Study of an Alkaline-Saline Hydrothermal Environment, Relevant to Understanding the Habitability of Mars

Melghirimyces Thermohalophilus Sp. Nov., a Thermoactinomycete Isolated from an Algerian Salt Lake

Distribution of Long Linear and Branched Polyamines in the Thermophiles Belonging to the Domain Bacteria

Thermolongibacillus Cihan Et Al

Mechanisms of Action of the Duodenal-Jejunal Bypass Sleeve

Plastid-Localized Amino Acid Biosynthetic Pathways of Plantae Are Predominantly Composed of Non-Cyanobacterial Enzymes

Chapter 1 Reviews Studies in Geochemistry and Microbiology of Studied Area