SEQUENCING, PIPELINE DEVELOPMENT, AND SELECT COMPARATIVE ANALYSIS OF 64

HIGH-QUALITY DRAFT GENOMES OF EXTREMOPHILIC ISOLATED FROM

COMMUNITIES IN CARBOXYLATE PLATFORM FERMENTATIONS

A Thesis

by

EMMA BRITAIN CARAWAY

Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Chair of Committee, Heather H. Wilkinson Committee Members, Joshua Yuan Joseph Sorg Head of Department, Leland Pierson III

August 2016

Major Subject: Plant Pathology

Copyright 2016 Emma Britain Caraway ABSTRACT

Microbial extremophiles have the potential for a wide variety of biotechnological and industrial applications and yet extremophiles are underrepresented in whole genome sequencing efforts to date. The generation of whole genome sequences allows for gene calling, function prediction, and creation of evolutionary models and adds to the richness of extant knowledge of the bacterial world. The sequencing of extremophiles is thus of high value. Previous efforts collected 501 soil samples from 77 thermal and saline sites across the United States and Puerto Rico and used these in an effort to optimize the microbial communities in a carboxylate biofuel platform. The 34 best performing inocula were used to isolate 1866 strains using a variety of media in a low-oxygen and high-temperature environment. A diverse subset of this isolate library was screened for traits of industrial relevance. In this project I created a model to choose a characteristic subset of these isolates while maintaining the phylogenetic, phenotypic, and geographic diversity of the isolate library. Using this subset I created a pipeline to sequence, assemble, annotate, and disseminate high quality draft-genomes of these microbes.

In this work I created high-quality draft genome sequences of 64 isolates from

22 sites across the United States and Puerto Rico. I inferred phylogeny of a subset

(N=48) of these isolates using multilocus sequence analysis of four housekeeping genes and discovered three potentially novel genera. Using the Joint Genome Institutes

Integrated Microbial Genomes system I was able to annotate and make functional assertions about these isolates. These isolates display a diverse range of carbohydrate

ii utilization that is directly related to their phylogeny, and many isolates show industrially relevant carbohydrate utilization pathways such as cellulose, arabinose, and xylose. Many of the isolates sequenced also show a pathway for degredation of furfural, an inhibitory compound that causes issues in second-generation biofuel platforms. The furfural degradation pathway is shown to be rare among extant sequenced prokaryotes. The Opu operon was found in many of these isolates, which when complete transports the compatible solute glycine betaine into the cell. This pathway has been implicated in osmoregulation, thermotolerance, and cold-show protection. Finally, four isolates were found to have a group II intron interrupting the housekeeping gene recA, which codes for a protein related to DNA repair. The insertion of a group II intron into a housekeeping gene is extremely rare and has potential implications for our existing knowledge about the role of group II introns. This work creates 64 high-quality draft genome sequences and annotations as well as select analyses, clearly demonstrating the potential of these resources for future applications.

iii DEDICATION

This work is dedicated to my family, who continue to be relentlessly optimistic about my future. It is also dedicated to two professors, Marilyn Turnbull of Wellesley

College and Charles Kennerley of Texas A&M, who believed in me and showed me that science could be not just fascinating, but empowering. Most importantly, this is dedicated to my husband, Davis Caraway, who makes every day better than the last.

iv ACKNOWLEDGEMENTS

I would like to thank my committee chair, Dr. Heather Wilkinson, and my committee members Dr. Joseph Sorg and Dr Joshua Yuan as well as former committee member Dr. Daniel Ebbole for their support and guidance throughout this process.

I would like to thank Elena Kolomiets for always being available to help and Cruz Torres for always knowing how to fix things. Thanks go to my colleagues, faculty, and staff in the Department of Plant Pathology and Microbiology for their support and camaraderie.

Both the Texas AgriLife Research Bioenergy Program and the Texas A&M University

Office of the Vice President for Research Energy Resources Program provided financial support for this project.

Finally, I would like to specially thank Dr. Charles Kennerley. Without his guidance, long talks about science, and eternal patience I wouldn’t be where I am today.

v TABLE OF CONTENTS Page

ABSTRACT…………………………………………………………………………………………………...……………ii

DEDICATION……………………………………………………………………………………….……………………iv

ACKNOWLEDGEMENTS………………………………………………………………………………………..……v

TABLE OF CONTENTS……………………………………………………………………………………….………vi

LIST OF FIGURES……………………………………………………………………………………………………viii

LIST OF TABLES…………………………………………………………………………………………….…………ix

CHAPTER I INTRODUCTION……………………………………………………………………………...……1

CHAPTER II PIPELINE TO ANALYZE DRAFT GENOME SEQUENCES FOR EXTREMOPHILES FROM SUCCESSFUL CARBOXYLATE PLATFORM FERMENTATIONS. ………………………………………………………………………………9

II.1 Introduction………………………………………………………………………………..……………9 II.2 Methods…………………………………………………………………………………………….……13 II.3 Results……………………………………………………………………………………………………20 II.4 Discussion………………………………………………………………………………………………26

CHAPTER III MULTILOCUS SEQUENCE ANALYSIS OF A SUBGROUP OF HIGH- QUALITY DRAFT GENOMES FOR ISOLATES IN THE GENERA GEOBACILLUS, ANOXYBACILLUS, AND AERIBACILLUS………………………………………………………………………………..…30

III.1 Introduction…………………………………………………………………………………..………30 III.2 Methods……………………………………………………………………………………………...…36 III.3 Results………………………………………………………………………………………………..…40 III.4 Discussion…………………………………………………………………………………..…………53

CHAPTER IV SELECT COMPARATIVE ANALYSIS OF THE 64 HIGH-QUALITY DRAFT GENOME SEQUENCES OF EXTREMOPHILES………………………….…58

IV.1 Introduction……………………………………………………………………………………..……58 IV.2 Methods ………………………………………………………………………………………….……62 IV.3 Results……………………………………………………………………………………………..……67 IV.4 Discussion……………………………………………………………………………………..………76

CHAPTER V THESIS CONCLUSIONS…………………………………………………………….…………80

vi Page

REFERENCES…………………………………………………………………………………………………..………83

APPENDIX A…………………………………………………………………………………………………….……103

APPENDIX B…………………………………………………………………………………………………….……105

APPENDIX C…………………………………………………………………………………………………….……107

vii LIST OF FIGURES

Page

Figure 1 Phylogenetic tree for the isolate library at start of this project, reproduced from[12]…………………………………………………..…………….…………6

Figure 2 Sites of origin for isolates in this study…………………….……………….…………21

Figure 3 Heat map showing A5-MiSeq vs SOAPdenovo2……………………………....……25

Figure 4 Maximum-likelihood estimated phylogenetic tree obtained from 55 partial sequences of the 16S rDNA gene…..……………………………...…………..43

Figure 5 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial gyrB sequences………………………………………….…………...…44

Figure 6 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial grolEL sequences………………………………………………………45

Figure 7 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial rpoD sequences………………………………………………...…..…..46

Figure 8 Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial trmE sequences……………………………………………..…...……..47

Figure 9 Maximum-likelihood phylogenetic tree estimated by multilocus sequence analysis with concatenated gyrB, groEL, rpoD, and trmE partial sequences……………………………………………………………………………….50

Figure 10 Schematic of Group II intron interrupting recA and table showing actual sizes of recA fragments and group II intron………..………………………73

Figure 11 Alignment of the open reading frame for II introns in the RecA protein...... 74

Figure 12 Maximum parsimony tree of amino acid sequence ORF of group II intron..………..…………………………..…………………………..………………...... …………75

viii LIST OF TABLES

Page

Table 1 Isolates and select metadata used in this study……………...………………….…15

Table 2 Phenotypic data available for a subset of the isolates in this study…...…..17

Table 3 Comparison of select QUAST outputs of SOAPdenovo2 and A5- MiSeq...... 23

Table 4 Isolates used in multilocus sequence analysis (MLSA), adapted from Cope, 2013[9]………………………………………………………………………………….....38

Table 5 Reference strains used in multilocus sequence analysis……………………….42

Table 6 Properties of genetic loci used in multilocus sequence analysis….………...42

Table 7 P values from pairwise ILD test between 5 genetic loci…………….…….....…42

Table 8 Identifiers associated with the IMG carbohydrate utilization network for pathways expressed in the draft genomes……………..………….63

Table 9 Gene counts of IMG Pathways for carbohydrate utilization…………...……...64

Table 10 Draft genomes that contain enzymes associated with a furfural degradation pathway………………………...………………………………………...... ….69

Table 11 Draft genomes with osmoprotectant genes in Opu family………….………....72

Table 12 Reference sequences with group II introns……………………..………..……...... 73

ix CHAPTER I

INTRODUCTION

Depletion of non-renewable petroleum resources, variable costs associated with oil, and concerns about global climate change have sparked interest in ways to create biofuels and other high-value renewable bioproducts and chemicals [1, 2].

Many of the chemical reactions that produce these kinds of products have optimal conditions (temperature, pH, etc) that are not suitable for most life [2].

Extremophiles are microbes that are adapted to thrive under harsh conditions.

Extremophiles represent a largely unexplored biological frontier with a wealth of potential for biotechnological and industrial applications. Biomolecules with specific activities and capacities, primarily enzymes, but also proteins, lipids, and other molecules are the most obvious putative products available via exploration of extremophiles. Perhaps the most famous example of an extremophile-derived biomolecule is Taq DNA polymerase, a component for the polymerase chain reaction

(PCR) amplification of DNA used in molecular biology labs across the world. Taq

DNA polymerase originates from the thermophilic bacterium, Thermus aquaticus

(3), isolated from the Lower Geyser Basin of Yellowstone National Park [3, 4].

Genomic explorations of extremophilic microorganisms can reveal the genetic mechanisms of adaptations to a variety of extreme environments and thus provide for new biotechnologies and bioproducts. Thermophiles have attracted a great deal of attention in industry both because of more rapid reaction times at higher temperatures and because the capacity to grow at the higher temperatures is rare,

1 reducing risk of contamination by mesophilic organisms [5]. Most thermophiles are polyextremophiles, meaning that they can live under a variety of extreme conditions

(such as temperature, pH, and salinity), which also grants them a much higher degree of industrial relevance [6]. Currently, thermophiles are used in bioremediation, biomining, and production of biofuels, thermozymes, and biosurfactants [6].

One area of particular interest for the use of extremophiles is in the biofuel industry. Biofuels are produced from biomass. First generation biofuels are made from food crops like corn or sugarcane while second generation biofuels are created using lignoscellulosic biomass (non-food crops, agricultural/forestry waste).

Commercially feasible biofuel production today relies on first generation methods that produce ethanol. This is limiting both because it utilizes food stocks and because ethanol can only be used in limited quantities in our existing fuel infrastructure [7]. Second generation biofuel platforms can produce products like n- butanol, 2-butanol, and other compounds, that have the potential to be compatible with existing infrastructure, out of waste products [7]. The problem is that second generation biofuel platforms have yet to become commercially feasible. The main stumbling block for these platforms is lack of robust and high yielding microbes. The optimal microbe for a second generation biofuel platform would be able to degrade of lignocellulose, ferment the wide variety of sugars that resulted from that degredation with quickly and with high yields, and be able to tolerate the extreme conditions of reactors such as temperatures, pH, and end products [7].

2 The MixAlco™ process is patented second-generation carboxylate biofuel platform developed at Texas A&M University [8]. It takes any biomass and converts it into mixed alcohol fuel via mixed-acid fermentation. Briefly, the MixAlco™ process works first by treating biomass with lime to increase digestibility. This biomass is mixed with a consortium of microbes for fermentation, in the presence of iodoform to inhibit methane production. Calcium carbonate is added as a buffer. The calcium carbonate reacts with the carboxylic acids produced in the fermentation to form carboxylic salts. Next, following well-established chemical processes the salts are dewatered, dried, and can then be thermally converted into ketones and then hydrogenated into alcohols.

Cope et al. 2014 collected 501 microbial communities in soils from 77 thermal and/or saline sites across the US and Puerto Rico [9]. The soil samples were used to inoculate 30-day batch fermentations of a cellulosic substrate at 55°C. The fermentation conditions were optimized to favor communities that could covert cellulosic biomass to carboxylate salts, while tolerating high product concentrations

[10]. That study demonstrates that microbial communities from various extreme environments exhibit a wide range of performances for these fermentations, and that the variation in performance could be explained largely by complex multivariate relationships between the soil characteristics associated with the inoculum, the fermentation conditions and the resulting product spectra (8). Thus, that work is a proof of concept wherein there is sufficient variation among microbial communities from extreme environments and fermentation conditions, some exhibiting close a four-fold improvement over standard fermentations with the

3 standard inoculum (9), to justify efforts to optimize the microbial inocula and fermentation conditions.

To provide for studies of simplified microbial communities and adaptations associated with individual isolates, a library of 1866 isolates was created from the

34 best performing fermentations identified by Cope et al. (2014). Specifically, colonies were isolated at 55°C (the temperature of the fermentation screens) on three different media, and in some cases cultivated under different oxygen conditions. The three media were: 1) Drake’s thermophilic acetogen medium

(DTAM), which selects for acetogens; 2) Cellulose Agar for Thermophiles (CAT), which selects for microbes that utilize cellulose; and 3) Modified Growth Medium

(MGM) which selects for halophiles [9, 11, 12]. The major rationale for three different media and variable oxygen conditions was to maximize diversity in the library, rather than to specifically to favor particular adaptations. In fact, phylogenetic analysis using partial 16S rDNA sequences revealed a highly diverse isolate library, spanning 46 operational taxonomic units (OTUs) (Fig 1).

Forty of the 46 OTUs comprising the library fell within the phylum .

This was not surprising since, metagenomic sequencing of these best performing fermentations revealed all were dominated by bacteria from the phylum Firmicutes

[9, 11]. The phylum Firmicutes has been rearranged extensively to reflect changes that genome sequence data (especially that of the 16S rDNA) has made to phylogenies [13]. Phylum Firmicutes contains five classes , Clostridia and

Erysipelotrichi, Negativicutes, Thermolithobacteria; the class Mollicute was removed due to sequence and phenotypic differences [14–17]. Bacilli in particular, is

4 interesting for industrial uses because of it’s high growth rates, secretion, and that many have been recognized as ‘safe’ by the FDA [18]. subtilis, a model organism, is the dominant enzyme-producing organism in industrial microbiology. Its genome sequence was published in 1997, just two years after the publication of the first full microbial genome, and was the first published Gram- positive bacterium [19]. A wide variety of species of Bacillus and other former

Bacillus genera(i.e Geobacillus, Brevibacillus, Virgibacillus, etc) are extremophiles, able to withstand wide ranges of heat, pH, and salt[13, 20–22].

Additionally, in an effort to explore phenotypic diversity in the library, Haynes

(2014) selected a subset of 207 (out of the 1,866) isolates based on phylogenetic and geographic diversity, as well as the media used for isolation [12]. Haynes subjected this subset to tests on temperature tolerance, extracellular cellulase production, Congo red delcolorization (a putative surrogate phenotype for lignin degradation), n-butanol tolerance and vanillin utilization.

5 100 OTU 22 Caldalkalibacillus uzonensis DQ221694 OTU 33 bacterium AB362276

100 Planifilum yunnanense DQ119659 OTU 39 Thermoactinomyces vulgaris AF138732 100 OTU 3 OTU 46 Anaerobacillus alkalilacustre DQ675454

100 OTU 36 OTU 41 Anoxybacillus flavithermus AY643748 100 OTU 27 Bacillus sp. AY583458 OTU 1 Aeribacillus pallidus GU936608 100 OTU 43 OTU 35

100 Geobacillus sp.FJ429998 OTU 40

100 OTU 8 Bacillus smithii EU628681 OTU 12

100 OTU 5

83 Bacillus sp. AB066342 OTU 32 99 Bacillus aeolius AB362281

95 Bacillus thermoamylovorans AY373320 OTU 4 99 OTU 21 OTU 42

85 Bacillus licheniformis AF372616 OTU 11 100 OTU 10 99 subsp. subtilis HQ143659

86 OTU 26 Bacillus sp. EU596425 OTU 23

85 OTU 16 Geobacillus thermodenitrificans AF114426 88 OTU 44 Geobacillus thermoleovorans Z26923

80 OTU 19 Geobacillus stearothermophilus EU652090 OTU 2 OTU 20

100 OTU 9 Geobacillus debilis FN428699 Geobacillus thermoglucosidasius AY608984 100 OTU 6

65 Ureibacillus terrenus AJ276403

100 OTU 28 Ureibacillus thermosphaericus AB101594 100 OTU 14 Exiguobacterium profundum AY818050 100 OTU 29 OTU 15 100 Brevibacillus thermoruber AB362290 OTU 24 100 OTU 30 99 Thermoanaerobacterium thermosaccharolyticum EU563362 OTU 38

100 Clostridiales bacterium AY466716 OTU 13 OTU 17

100 Actinomycetales bacterium AM749789 100 100 OTU 7 OTU 25 100 Clostridiaceae bacterium FJ481102

100 Clostridium cellulosi FJ465164 OTU 18

100 Escherichia coli FJ949576 OTU 31

100 Xanthomonas sp. DQ213024 100 OTU 37 OTU 45 89 100 Pseudomonas fluorescens AY092072 100 OTU 34 Methanocaldococcus_jannaschii L77117 0.05

Figure 1. Phylogenetic tree for the isolate library at start of this project, reproduced from [12]. Based on partial 16S rDNA all 1866 isolates fell within 46 operational taxonomic units. Isolates with a 97% sequence similarity were collapsed into the same OTU. Closest sequences within the Ribosomal Database Project (RDP) SeqMatch (accessed 11/11/11).Tree constructed using Jukes-Cantor corrected distance matrix modeling and weighted neighbor-joining with bootstrapping at 100 iterations. Colored squares indicate those branches resulting from a single media/oxygen incubation type: cellulose/ anaerobic chamber (grey), cellulose (blue), acetogen (red), halophile, other (purple). Reproduced from Haynes 2015 with permission[12].

6 Further, the authors subjected smaller subsets of isolates to screens for hydrolysate or bio-oil tolerance or utilization and polyhydroxyalkanoate (PHA) accumulation (a putative indicator of bio-plastic production potential)[12]. This earlier work not only identified several isolates that possessed many of these industrially relevant phenotypes, but also, it further defined diversity within the library, wherein isolates that were closely related in the partial 16S rDNA analysis exhibited distinctly different phenotypes.

In this study, I explored the genotypic potential of the isolate library by sequencing a diverse subset of the microbes isolated from the best carboxylate platform fermentations. Whole genome sequences allow for a great deal of functional prediction in silico as well as the creation of evolutionary models. Whole genome sequences, especially from less commonly sampled sites (like those from our studies), add to the richness of the genome collection as a whole and allow for comparative models of ecological diversity [23]. From a more practical standpoint, whole genome sequences allow for the prediction of high-value products like thermo-stable enzymes and dissecting and engineering metabolic networks for fermentation platforms [2, 5, 24, 25].

The Joint Genome Institute (JGI) Genomes Online Database (GOLD) contains genomes from all JGI studies and also those directly submitted by users[26]. A total of 26.7% of the sequencing projects in the GOLD database were initiated by the JGI

[27]. The Department of Energy Joint Genome Institute’s mission is to produce DNA sequencing, synthesis and analysis in areas related to bioenergy, the carbon cycle, and biogeochemistry [28]. The GOLD database is thus enriched in these areas more

7 than any other existing database. Extremophiles are still underrepresented in whole genome sequencing efforts. As of June 2015, there were 416 genomes classified as thermophilic or hyperthmophilic and 302 classified as halophilic in GOLD out of

44,925 publicly available isolates. In this study we sequenced, assembled, annotated, and disseminated the draft genomes of 64 isolates taken from a carboxylate biofuel platform (Chapter II). I performed a multilocus sequence analysis on a large subset of these isolates to elucidate their phylogenetic relationships (Chapter III). Finally, I conducted comparative genomic analysis on these 64 isolates for industrially relevant and/or rare traits (Chapter IV).

8 CHAPTER II

PIPELINE TO ANALYZE DRAFT GENOME SEQUENCES FOR EXTREMOPHILES FROM

SUCCESSFUL CARBOXYLATE PLATFORM FERMENTATIONS

II. 1 Introduction

As the costs of sequencing decrease, whole genome sequencing is no longer just the purview of big research consortiums but is also available to individual labs.

Whole genome sequences, even in draft form, offer potential for comparative genomics, phylogenetics, forward genetics, and data mining [29]. It is vitally important that raw sequences go through quality control during every step of the process, otherwise the utility of the data is greatly decreased [30]. In this chapter I created a heuristic model to choose the most interesting and diverse isolates from a library of 1,866 extremophilic soil microbes isolated from a carboxylate biofuel platform. I sequenced, assembled, annotated, and disseminated the data from this sequencing effort. This chapter describes the development of the pipeline I used to create high quality draft genomes for 64 isolates.

Sequencing

In this study, we used Illumina MiSeq for the generation of sequence data.

This system uses ‘tagmentation’ to combine the steps of fragmentation and tagging by using a transposase, which fragments and incorporates sequence tags in one step. Fragments are immobilized on a solid flow cell and clusters are generated via bridge amplification. This process relies on reversible terminator chemistry for sequence reads [31]. The MiSeq can be run in less than 24 hours and produces up to

9 1.5 Gb of paired end reads per run [32]. This creates a massive amount of data in an extremely short period of time.

Assembly

The first and arguably most important step to understanding any next- generation sequencing data is the process of assembly. The output of a sequencer is completely uninformative without the application of complex analysis algorithms to translate short sequence reads into usable assemblies. When working with novel organisms, such as those used in this study, the process of de novo assembly must be performed. De novo assembly means building an assembly without the aid of a reference genome as in the process of sequence alignment [33]. De novo assembly programs work by first taking short sequence reads and then joining them via overlapping areas into contiguous sequences (contigs) [30].These contigs are then assembled into scaffolds. With the very short reads produced by next-generation sequencing technology, the number of overlaps produced is immense. Assembly programs developed in the mid-1990’s handled this by using a de Bruijn graph data structure, originally developed for combinatorial mathematics [33]. Many of the assemblers in use today, including both of the assembly programs used in this study, use some form of de Bruijn graph. Pipelines in use today incorporate features beyond assembly such pre-processing (error correction of sequence reads, adaptor removal) and post-processing (scaffold breaking, quality assessment)[29].

Annotation

After assembly, some form of annotation is performed. Gene prediction, however is a computationally complex and fundamentally important problem [34].

10 Early annotation algorithms analyzed a single ORF at a time and each predicted gene was manually curated. With the speed of genomic sequencing today, this type of approach is no longer feasible [35]. Reliance on automated annotation sequence technology is becoming more and more common. Generally, all annotation pipelines for prokaryotes begin by identifying long ORFs. These are then compared to validated gene data from extant organisms. Even those predicted regions that do not have sequence-similarity could still be annotated as putative proteins based on ab intio signs such as promoter elements or Shine-Delgarno regions. Modern annotation pipelines aim not just to include coding sites but also ribosomal-binding sites, termination sites, transposons, RNAs, psuedogenes, and other functional units

[34, 35].

Dissemination of Data

High-throughput sequencing presents a new set of challenges for researchers in terms of data processing, storage, and dissemination. The speed at which data is being created is far outpacing our ability to analyze it. The dissemination of this kind of data will determine what, if any, scientific and social value it will have [36]. Most scientists recognize that free access to data is necessary for the advancement of research and most funding agencies and publications require sequence deposition into a public database [37]. This type biocuratorial work is often overlooked in grant writing and training and those who perform this work are not often credited.

This type of omission has caused the publication and dissemination of false or missing data and metadata[38]. The largest initiative towards the dissemination of genomic data is the International Nucleotide Sequence Database Collaboration

11 (INSDC)[39]. The INSDC was founded in 1987 as collaboration between the DNA databank of Japan (DDBJ), the European Molecular Biology Laboratory’s European

Bioinformatics Institute (EMBL-EBI) and the National Center for Biotechnology

Information (NCBI) [40]. It sets out to establish functional annotation conventions, create a unified accessioning system, link metadata to the short-reads archive (SRA), and ensure free and open access. In the United States, we use the NCBI’s GenBank database which provides comprehensive databases of nucleotides and associated metadata [41].

More recently, the Genomic Standards Consortium (GSC) formed in 2005 with the goal to standardize genomic descriptions and promote data exchange

N[42]. The group created Minimum Information about any (x) sequence (MIxS), which aims to capture core metadata that is often lacking in database submissions.

The INSDC databases now have a keyword to indicate if datasets comply with MIxS requirements [43]. Further, the GSC created a journal Standards in Genomics

Sciences to support easy and rapid publication of genomic data [43]. In a similar vein, the Genomes OnLine Database (GOLD) is a database that collects, curates, and disseminates metadata associated with sequencing projects [26]. This database is in full compliance with the GSC’s MIxS guidelines, which creates a direct pipeline from

GOLD submission to publication in Standards in Genomics Sciences. In this study we submitted all sequence data to both GenBank and GOLD. As the number of sequenced genomes continues to increase, the value of biocuration and standardized metadata will only continue to rise.

12 II. 2 Methods

Isolate Selection

I created a heuristic model to choose which of the more than ~1,800 isolates from the library would be used in this study. Selection was made based upon an original estimate of sequencing 96 isolates though this was pared down later due to growth, isolation, and extraction constraints. First, if only one unique isolate (based on extant 16S rDNA data) in an operational taxonomic unit (OTU) existed, that isolate was automatically selected for sequencing (Figure 1; Table 1) [9]. After development and implementation of various screens for industrially relevant traits

Haynes identified 12 isolates with interesting, often complex phenotypes, all of which were included for sequencing (Table 1-2) [12]. Sequenced isolates that have phenotypic data are in Table 2[12]. I then aimed to select isolates that were in the same OTU as the phenotypically interesting isolates while capturing different sites and media to increase the diversity of the group to be sequenced. Ten sequences were chosen this way as ‘foils’ to the Haynes phenotypically interesting strains. A selection of unique isolates from OTU 11 were sequenced because they were most closely related to Bacillus licheniformis, a species with various known industrially relevant phenotypes, including butanol tolerance, the focus of a collaboration with

Dr. Katy Kao, Chemical Engineering, Texas A&M University [44, 45]. The remaining isolates were sequenced based on the percent of total unique isolates divided by the number of unique isolates in a given OTU. This was only applied to the remaining 45

‘slots’ left in the original plan to sequence 96 isolates. Every OTU was given a minimum of one strain to be sequenced even if proper rounding would result in a

13 zero. When rounding led to more isolates to be sequenced than was feasible, the extras were removed from OTU1, the largest OTU, comprising 43% of the total unique isolates in the library. Whenever an OTU gained an isolate by this percentage method, the isolate that would maximize geographic and media diversity within the

OTU was selected.

Genomic DNA (gDNA) Extraction

All isolates were grown in 250 mL flasks in 50 mL of Lysogeny Broth (LB), at

55°C at 70 RPM in a rotary shaker for 24 hours. After 24 hours the optical density

(OD) was measured using a BioScreenC (LabSystems, Helsinki, Finland)[46, 47]. If the wide-band optical density of the culture was below 0.5 the culture was allowed to grow for an additional 24 hours. Cultures with >0.5 OD were prepared for extraction using a modified version of the phenol-chloroform extraction described in McBride, Bouillaut, and Sorg [48].

14 Table 1. Isolates and select metadata used in this study. Isolates, the original operational taxonomic unit (OTU), site of origin, soil characteristics (pH, Na+ mg kg-1, soil temperature) and rationale for sequencing. For heuristic used to create rationale for sequencing, see Methods section of Chapter II. Isolates from the same fermentation are identical for the first 3 letters of the label. Different samples from the same site are distinct numbers within the first 3 letters. All soil data from Cope et al., 2014. For complete metadata (soil chemistry, GPS coordinates, sampling dates, etc.) see the associated GOLD Project ID Online (Appendix A). Isolate OTU Site Name State Temp C pH Na+ Rationale for Sequencing

A07M350 4 La Sal Del Ray TX 34 7.5 40553 Phenotypically Interesting

A07M352 4 La Sal Del Ray TX 34 7.5 40553 Phenotypically Interesting

E07C003 1 Great Salt Plains OK 19 6.8 13136 Characteristic percent of total isolates

E08C011 25 Great Salt Plains OK 19 7.4 15534 Characteristic percent of total isolates

E08C017 16 Great Salt Plains OK 19 7.4 15534 Only Unique Isolate in OTU

E08C020 27 Great Salt Plains OK 19 7.4 15534 Only Unique Isolate in OTU

E08D002 1 Great Salt Plains OK 19 7.4 15534 Characteristic percent of total isolates

F02C013 19 Brazoria TX 21 6.1 7742 Characteristic percent of total isolates

F05M388 1 Brazoria TX 22 7.7 12697 Characteristic percent of total isolates

F09D005 1 Brazoria TX 22 6.9 9812 Phenotypically Interesting

F09M437 5 Brazoria TX 22 6.9 9812 Characteristic percent of total isolates Bitter Lake G08C001 6 NM 9 7.2 37623 Characteristic percent of total isolates (Roswell) Bitter Lake G08C006 44 NM 9 7.2 37623 Only Unique Isolate in OTU (Roswell) Bitter Lake G08C008 1 NM 9 7.2 37623 Characteristic percent of total isolates (Roswell) Bitter Lake G08C011 7 NM 9 7.2 37623 Only Unique Isolate in OTU (Roswell) Bitter Lake G08C017 8 NM 9 7.2 37623 Characteristic percent of total isolates (Roswell) Bitter Lake G09D026 1 NM 9 7.2 37623 Characteristic percent of total isolates (Roswell) Bitter Lake G13D008 1 NM 11 7.3 18741 Characteristic percent of total isolates (Roswell) Bitter Lake G13D016 39 NM 11 7.3 18741 Only Unique Isolate in OTU (Roswell) Bitter Lake G13D029 37 NM 11 7.3 18741 Only Unique Isolate in OTU (Roswell) Bitter Lake G13D038 1 NM 11 7.3 18741 Characteristic percent of total isolates (Roswell) Bitter Lake G13D043 1 NM 11 7.3 18741 Characteristic percent of total isolates (Roswell) Bitter Lake G19C023 43 NM 12 7.4 5946 Only Unique Isolate in OTU (Roswell) Lazy Lagoon Closely related to phenotypically G23C002 1 NM 14 7.4 10017 (Roswell) interesting (Foils) Lazy Lagoon G23C019 1 NM 14 7.4 10017 Characteristic percent of total isolates (Roswell) Lazy Lagoon G23D015 1 NM 14 7.4 10017 Characteristic percent of total isolates (Roswell) Lazy Lagoon G24C011 3 NM 15 7.2 12181 Only Unique Isolate in OTU (Roswell) H01C001 2 San Francisco Bay CA 11 7.4 8258 Characteristic percent of total isolates

H01C007 15 San Francisco Bay CA 11 7.4 8258 Characteristic percent of total isolates

H01D012 41 San Francisco Bay CA 11 7.4 8258 Only Unique Isolate in OTU

15 Table 1 Cont.

Isolate OTU Site Name State Temp C pH Na+ Rationale for Sequencing

H01M105 1 San Francisco Bay CA 11 7.4 8258 Characteristic percent of total isolates

H20C002 2 San Francisco Bay CA 10 7.1 18262 Phenotypically Interesting

H20C009 2 San Francisco Bay CA 10 7.1 18262 Phenotypically Interesting

H20D004 36 San Francisco Bay CA 10 7.1 18262 Characteristic percent of total isolates

J04M017 12 Big Bend TX 31 7.3 181 Characteristic percent of total isolates

J11M005 11 Big Bend TX 37 7.2 134 Putative Bacillus licheniformis

J11M011 11 Big Bend TX 37 7.2 134 Putative Bacillus licheniformis

J11M287 11 Big Bend TX 37 7.2 134 Putative Bacillus licheniformis

J18C011 9 Big Bend TX 30 7.4 221 Characteristic percent of total isolates Closely related to phenotypically J18C022 9 Big Bend TX 30 7.4 221 interesting (Foils) J18C025 13 Big Bend TX 30 7.4 221 Only Unique Isolate in OTU

J18D015 1 Big Bend TX 30 7.4 221 Characteristic percent of total isolates Closely related to phenotypically J19C022 9 Big Bend TX 31 7.3 182 interesting (Foils) J20M022 10 Big Bend TX 28 7.3 205 Only Unique Isolate in OTU Closely related to phenotypically J20M030 2 Big Bend TX 28 7.3 205 interesting (Foils) K49C015 1 Baker Hot Spring UT 47 N/D N/D Characteristic percent of total isolates

K49D024 17 Baker Hot Spring UT 47 N/D N/D Only Unique Isolate in OTU

K49M014 1 Baker Hot Spring UT 47 N/D N/D Characteristic percent of total isolates

K49M015 45 Baker Hot Spring UT 47 N/D N/D Characteristic percent of total isolates

M24C029 9 Cape Romain SC 22 3.6 134 Phenotypically Interesting

N09C011 2 Sapelo Island GA 20 6.88 2476 Characteristic percent of total isolates

N09C014 1 Sapelo Island GA 20 6.88 2476 Characteristic percent of total isolates Closely related to phenotypically P01M009 2 Puerto Rico PR N/D 7.1 86 interesting (Foils) R08M008 32 Sulfur Springs NM 66 1.74 92 Only Unique Isolate in OTU

S44C017 1 Yellowstone WY 60 6.6 324 Characteristic percent of total isolates

S44C019 20 Yellowstone WY 60 6.6 324 Only Unique Isolate in OTU

S44C021 2 Yellowstone WY 60 6.6 324 Characteristic percent of total isolates

S44D013 40 Yellowstone WY 60 6.6 324 Only Unique Isolate in OTU

S48D018 1 Yellowstone WY 64 2.3 58 Characteristic percent of total isolates

T02C003 12 Still Water NV 32 7.98 77679 Characteristic percent of total isolates

T02C029 4 Still Water NV 32 7.98 77679 Characteristic percent of total isolates

U22C014 12 Owens Lake CA 33 9.59 164045 Characteristic percent of total isolates

U22D017 25 Owens Lake CA 33 9.59 164045 Only Unique Isolate in OTU

U22M431 1 Owens Lake CA 33 9.59 164045 Characteristic percent of total isolates

16 Table 2. Phenotypic data available for a subset of the isolates in this study. A subset of the isolates (N=30) in this study were characterized previously for a variety of biofuel and bioproduct related phenotypes. The assays included: a) cellulase production, as indicated by clearing on carboxymethyl cellulose (CMC)agar plates; b) Congo red decolorization, a surrogate for lignin degradation; c) growth at 37°C, a characteristic one might expect to be rare among isolates originally cultivated in 55°C fermentations; 4) Accumulation of polyhydroxyalkanoates (PHA), based on a fluorescence screen during growth with waste glycerol as the carbon source; 5) Tolerance/utilization of vanillin, an inhibitor that accumulates during bioconversion of lignin; and 6) capacity to tolerate/utilize complex substrates that result from different processes to convert and concentrate lignocellulosic biomass, hydrolysates (enzymatic conversion) or bio-oil (product of pyrolysis). A 1 indicates the isolate was positive for the trait screened and 0 indicates the isolate was negative. Gray filling in the table indicates the isolate was not screened for that trait. Adapted from Haynes, 2014 with permission. CR 1:2 1:10 1:20 1.25% Bio- Isolate Cellulase Decolor 37C PHA Vanillin Hydrolysate Hydrolysate Hydrolysate 0.2% Bio-oil 0.5% Bio-oil 1.0% Bio-oil oil 1.5% Bio-oil 1.75% Bio-oil A07M350 0 0 1 1 0

A07M352 0 0 1 1 0 0 1 1 1 1 0 0 0 0 H20C009 1 1 1 1 1 0 0 1 1 1 0 0 0 0 F09M437 1 0 1 0 0

G08C001 1 0 1 0 0

H01C007 0 1 1 0 0 0 0 0

J11M005 1 0 1 0 0 0 0 0 1 1 1 0 0 0 J11M287 1 0 1 0 0 0 1 1 1 1 0 0 0 0 S44C017 0 0 1 0 0 1 1 0 0 0 0

F09D005 0 1 0

G08C006 1 1 0

G08C011 0 0 0

G08C017 0 0 0

G13D029 0 0 0

G23C002 0 0 0

H01C001 1 1 0

H01D012 0 0 0

H20D004 0 0 0

J04M017 0 0 0

J11M011 0 0 0

J18C022 1 0 0

J19C022 0 0 0

J20M030 1 1 0

K49M015 0 0 0

M24C029 1 1 0

R08M008 0 0 0

S44C019 0 0 0

S44C021 1 0 0

S44D013 0 0 0

U22C014 1 0 0

17 Briefly, 10 mL of culture was pelleted via centrifugation for 10 minutes at 4,000 x g at 4°C in a 50 mL tube. The supernatant was discarded and the resulting pellet was washed and vortexed thoroughly in 2 mL of TE buffer. This liquid was collected and transferred to a 2 mL tube and centrifuged again for 10 minutes at 4,000 x g at 4°C.

Lysis and extraction steps continued as in the original protocol with an additional chloroform step at the end of extraction. After the DNA was precipitated in 95% cold ethanol with 77 μM sodium acetate, centrifuged for 20 minutes at maximum speed and washed with 70% room temperature ethanol, the tubes were cycled for 20 minutes in a vacuum centrifuge to evaporate the ethanol. DNA pellets were resuspended in 300 µL of TE buffer and incubated at 65°C for at least 3 hours. The initial DNA concentration and quality check was performed on a NanoDrop

(NanoDrop Products, Thermo Fisher Scientific, Wilmington, DE). Because of slow growth, oxygen conditions, difficulty extracting large amounts of high-quality gDNA, and contaminated cultures, only 80 isolates were successfully extracted.

Sequencing

We provided the genomic DNA samples to the Texas A&M Genomics and

Bioinformatics Service [49], wherein the gDNA was cleaned using the Agencourt

AMPure system (Beckman Coulter, Inc., Indianapolis, IN) and assembled the paired- end library using the Illumina TruSeq DNA Kit (Illumina Inc., San Diego, CA)[50, 51].

Sequencing was performed using the Illumina MiSeq (Illumina Inc., San Diego,

CA)[31].

18 Assembly and Quality Check

The paired reads from the Illumina MiSeq were first assembled using A5-

MiSeq, which has built-in quality-checking algorithms[52, 53]. Several isolates would not assemble and thus were not considered in subsequent steps. Isolates assemblies with raw coverage of less than 20% were discarded. Isolates with more that 300 contigs were also discarded. Isolates with genomes greater than 5 Mbp were examined and all were found to have contaminating sequences and were thus discarded. Many of these issues were overlapping (i.e. an isolate had more than 300 contigs and low coverage). A total of 16 isolates had to be discarded this way. After these discards a total of 64 isolates yielded draft genome sequences sufficient for further study. These 64 isolates were run through an alternate assembly pipeline called SOAPdenovo2, this analysis was performed by Cheng Yanbing, Plant

Pathology and Microbiology, Texas A&M University. Of the 64 isolates put into

SOAPdenovo2, 61 of them assembled. The three isolates that were unable to assemble with SOAPdenovo2 were G08C008, J18D015, and T02C029. The 61 isolates that were assembled in both A5-MiSeq and SOAPdenovo2 were compared using the Quality Assessment Tool for Genome Assemblies (QUAST)[54]. QUAST calculates the largest contiguous sequence (contig) and N50 based on contigs of sizes ≥ 500 bp.

Annotation

The assembled data was uploaded into two separate annotation pipelines.

19 Both the RAST (Rapid Annotations Using Subsystems Technology) Server and the

Joint Genome Institute’s (JGI) Integrated Microbial Genome (IMG)/Expert Review

(ER) annotation pipeline were used under default parameters [55–57].

Dissemination

All relevant data and associated metadata were uploaded into the NCBI’s

GenBank and Joint Genome Institute’s Genomes OnLine Database (GOLD). These data will be released to the public upon publication. See Appendix A for relevant accessions.

II. 3 Results

Isolate Selection

The isolates used in this study, select associated metadata, and the rationale for which isolates to sequence are available in Table 1. Figure 2 is map of the latitude and longitude of soil sample in which each isolate originated. These soil samples are the inocula sources for the top 34 performing fermentations from the carboxylate biofuel platform [9]. Sample IDs associated with each isolate are the first three characters of an isolate ID (i.e. isolate A07M350 was isolated from fermentation A07)[9].

Sequencing

Paired reads had a minimum of 271 base pairs, a maximum of 658 base pairs, and an average of 434 base pairs. Raw sequence files for each isolate were submitted to the NCBI Short Read Archive (Appendix A).

20 Figure 2. Sites of origin for isolates in this study. Map of locations where soil samples were collected 2008-2010 [9]. Map created using www.batchgeo.com.

Assembly

Two different assembly pipelines were used to assemble the pair-end reads,

SOAPdenovo2 and A5-MiSeq. The quality of each was analyzed using QUAST (Table

3). An optimal de novo assembly would have a high N50, very long contiguous sequences (contigs), a low number of contigs, and a low number of uncalled bases

(N’s) per 1,000 base pairs. Figure 3 is a heat map exhibiting a comparison of these four statistical parameters based on the null hypothesis that A5-MiSeq is the better of the two assemblers for these data (Fig 3). Of the 64 draft genomes included in this study, we analyzed 61 with both pipelines. A5-MiSeq outperformed SOAPdenovo2 in all four statistical parameters in 37 of the 61 isolates analyzed (Fig 3). No

21 assemblies analyzed with the SOAPdenovo2 pipeline outperformed A5-MiSeq in all four statistical parameters. This confirms the hypothesis that A5-MiSeq is a better de novo assembler for this dataset (Table 3).

The number of uncalled bases per 1000 base pairs did not correlate with any of the other statistics (Fig 3). A5-MiSeq outperformed SOAPdenovo2 in #N’s per

1,000 base pair in all but three isolates: E08D002, G13D016, and E49M014. The other three parameters (N50, largest contig, and number of contigs) were correlated

(Fig 3). Of the 61 genomes analyzed, 52 showed better overall performance for these parameters with the A5MiSeq assembler. A5MiSeq had a higher N50 value in

55 of the assembled genomes. A5-MiSeq had the largest contig in 48 of assembled genomes, and the highest number of contigs in 44 of the assembled genomes. Nine isolate assemblies showed lower performance across these three parameters with

A5-MiSeq (H01C001, J04M017, J18C022, J19C022, K49C024, M24C029, R08M008,

S44D013, U22D013, and U22D013).

Annotation

All annotation data will be available to the public on the Integrated Microbial

Genome (IMG) Data Warehouse following publication or in October 2017, whichever comes first. A summary of IMG’s annotation data for these draft genomes is provided in Appendix B. The IMG annotation pipeline automatically determines the quality of assemblies submitted and marks them as high or low quality. All 64 assemblies submitted were marked as high-quality drafts.

22 Table 3. Comparison of select QUAST outputs of SOAPdenovo2 and A5-MiSeq. Outputs of Quality Assessment Tool (QUAST) for genome assemblies showing various genome statistics: the number of contiguous sequences produced (contigs), the shortest sequence length at 50% of the genome (N50), the number of nucleotides in the largest contig (largest Contig), and the average number of unpaired bases (Ns) per 1000 base pairs. SOAPdenovo2 A5-MiSeq largest largest Isolate contigs N50 contig Ns per kbp contigs N50 contig Ns per kbp A07M350 124 77757 266987 124.55 81 138150 731051 11.2 A07M352 181 42164 286423 28.17 89 130208 413436 2.57 E07C003 321 25208 89659 63.12 257 35500 101354 8.79 E08C011 333 20938 79415 77.62 274 25726 100424 8.12 E08C017 280 31923 105788 89.82 253 39369 144884 9.62 E08C020 147 43795 219196 107.02 168 46308 219736 5.56 E08D002 306 24671 77421 55.51 289 27346 110921 127.02 F02C013 350 17657 122241 40.6 210 49269 221472 3.39 F05M388 100 87112 340715 95.21 65 122930 341228 27.32 F09D005 293 34069 132085 22.56 72 105819 419122 2.99 F09M437 164 49843 138931 79.45 132 60283 152776 8.65 G08C001 88 123700 418998 110.71 49 263659 625940 2.47 G08C006 164 54376 152351 37.45 132 60500 152776 0.93 G08C011 320 22601 83138 39.87 103 118043 362184 1.92 G08C017 369 19754 72478 88.81 268 35429 122331 3.26 G09D026 167 65136 215434 80.29 48 200472 386638 3.5 G13D008 326 24281 82967 24.2 240 36317 135326 19.79 G13D016 389 17049 71567 21.76 213 49270 137429 105.84 G13D029 86 101201 263356 40.39 31 355884 850760 3.04 G13D038 329 24440 83065 23.34 262 35279 129198 2.78 G13D043 260 34436 134986 15.23 231 36111 135326 3.32 G19C023 298 23985 104498 27.13 273 32539 132768 2.37 G23C002 353 22335 83023 73.43 401 89667 372544 7.53 G23C019 144 61968 118279 27.29 133 65808 152776 2.58 G23D015 251 29229 134986 81.72 501 47646 341230 8.07 G24C011 204 57422 164615 27.44 80 107371 341078 2.12 H01C001 143 69583 240718 21.49 265 35611 92870 1.68 H01C007 623 11797 85447 24.24 63 186638 468775 8.3 H01D012 60 180455 468408 50.3 63 180915 319860 9.05 H01M105 72 151076 324293 128.21 65 189018 427677 8.34 H20C002 137 51654 147804 197.32 56 183938 744641 3.03 H20C009 86 131013 743497 30.6 61 183768 744414 6.08 H20D004 1060 5526 31676 45.46 158 50972 118621 6.88 J04M017 54 128248 452187 27.41 743 6646 35092 3.7 J11M005 134 76486 217200 82.67 49 175723 392086 7.61 J11M011 152 65312 364846 69.22 86 120925 341077 1.55 J11M287 147 74214 301212 40.01 113 75096 262032 9.44 J18C011 186 54626 188491 66.5 38 250911 917715 1.46 J18C022 97 93695 264915 37.49 128 65808 152776 4.72 J18C025 131 65240 198023 21.82 78 122982 290159 6.64 J19C022 141 75337 264089 32.44 228 44605 262770 2.44 J20M022 289 23395 93755 64.77 276 31055 113048 1.07 J20M030 410 16049 70314 19.01 58 173483 466342 3.21

23 Table 3

Cont.

SOAPdenovo2 A5-MiSeq largest N’s per largest N’s per Isolate contigs N50 contigs N50 contig kbp contig kbp K49C015 130 63605 343347 31.46 120 78041 288038 7.78 K49D024 168 61086 235738 29.85 242 35611 135394 3.84 K49M014 256 28075 135037 127.16 240 35977 135394 4.28 K49M015 255 31155 135054 110.04 204 81598 316355 216.4 M24C029 216 36686 136422 74.2 259 33522 101354 7.17 N09C011 289 30410 100941 37.44 36 232805 639487 26.11 N09C014 72 180327 294581 71.59 152 153492 625369 7.02 P01M009 94 88208 304203 23.91 72 109407 341078 1.24 R08M008 95 92541 340675 68.19 137 65808 152776 13.11 S44C017 164 57637 139294 151.97 175 58509 253653 5.72 S44C019 238 37568 125011 40.74 178 62690 168298 2.76 S44C021 255 29541 83420 89.24 614 54680 156317 8.39 S44D013 91 104382 278235 27.18 307 34818 111212 2.43 S48D018 165 55888 152339 24.52 61 188957 421276 1.2 T02C003 79 119965 392686 138.09 117 136074 404552 6.36 U22C014 61 126509 396891 66.44 67 122953 341078 8.42 U22D017 114 74892 290638 22.71 65 109707 341078 1.74 U22M431 104 87112 219709 19.09 65 109707 341078 1.74

24

Figure 3. Heat map showing A5-MiSeq vs SOAPdenovo2. A5-MiSeq was hypothesized to outperform SOAPdenovo2 in the QUAST statistical parameters shown in Table 2. ‘Performance’ was defined as N50 being higher, largest contiguous sequence being longer, # of contiguous sequences being lower, and number of uncalled (N’s) per 1000 base pairs being lower. Areas where this hypothesis was supported are shown in grey, areas where the opposite hypothesis, SOAPedenovo2 outperforming A5-MiSeq, is supported are shown in black.

25 II. 4 Discussion

De novo assembly quality control is limited by the fact that, by its very definition, it lacks a reference genome for comparison. With the explosion of whole genome sequencing there has been a concurrent trend towards more and more assembly software. While there have been large scale efforts to set up standards, like the Gold Assembly Gold-Standard Evaluation (GAGE) Project, there has been speculation that software has been optimized to work on these specific datasets

(GAGE, A5 MiSeq). In a similar vein, it is clear from projects like GAGE and

Assemblathon that not all assembly programs are best suited for every kind of dataset [58–60]. The Quality Assessment Tool for Genome Assemblies (QUAST) allows for comparisons of both aligned and de novo assemblies via different software and provides numerous statics and color-coded graphs as outputs. I used

QUAST to compare the assemblies of A5 and SOAPdenovo2 for 61 of our isolates and looked at four of the most commonly cited statistics for assembly quality. While this type of comparison cannot show which assembly is closer to the true arrangement of a given genome, it can show which assembly software is better relative to the other.

Of the 61 assemblies that were compared, all 61 had overall better assemblies using the A5-MiSeq pipeline (Fig 3). There are several potential reasons for this. The first is that A5-MiSeq is a newer assembler, first published in 2015, while SOAPdenovo2 was first published in 2012 [52, 61]. Assembly software design is an extremely active area of research and three years can allow for a great deal of innovation [53]. In this vein, SOAPdenovo2 was optimized for use with the older

26 Illumina HiSeq 2000 platform which has different sequencing chemistry and output data than the slightly newer Illumina MiSeq [52, 62]. A5-MiSeq and its progenitor program A5 were created for the specific purpose of simplifying the entire genome assembly process into a single automated pipeline that could be used on a laptop computer [52, 53]. To this end the creators of A5 limited their scope to datasets that could be assembled with the processing power of a standard laptop and thus optimized their program for microbial genome assembly. So while SOAPdenovo2 has a much broader range of applications, in our particular case, A5-MiSeq was optimized both for the sequencer we used and for the kind of genomic data we were inputting. It is not surprising that A5-MiSeq outperformed SOAPdenovo2 for our dataset.

Like the assembly process, not all annotations are created equal. For the initial annotation of these genomes, Rapid Annotations using Subset Technology

(RAST) was employed[55, 56, 63, 64]. This automated annotation service assigns gene function and metabolic reconstruction of archaeal and bacterial genomes, typically within 12–24 hours. Compared to genomes manually annotated in RAST’s parent project SEED, RAST’s functional annotations are 91–94% identical. This allows for rapid and accurate comparative projects and allowed me to work on downstream applications (see Chapter III) while waiting for more comprehensive annotations. I also used the Joint Genome Institute’s Integrated Microbial Genomes

Expert Review annotation (IMG-ER) system. IMG-ER generates a preliminary automated annotation, compares it to any available known annotations, which were provided by the user (in this case, those generated with RAST) and uses this to find

27 missing annotations and revise and review the automated annotation [57]. The IMG-

ER system is extremely comprehensive and user friendly and has a great deal of built-in computational capability.

The results of our IMG-ER annotation are shown in Appendix B. The percent of genes with predicted functions ranged from 69-79%. Function prediction relies on a variety of different databases to compare already predicted protein coding sequences to our sequences. The database of Clusters of Orthologous Groups of proteins (COGs) uses fully sequenced genomes and known protein function in an attempt to classify proteins based on orthology [65]. The COG collection contains

138,458 proteins, which form 4,873 COGs and comprise 75% of the 185,505 predicted proteins encoded in 66 genomes of prokarotic organisms [[66]. IMG compares sequences to these ‘knowns’ and creates functional predictions. Between

56-68% of the protein coding genes in our isolates had COGs (Appendix B). Pfam is a database of multiple sequence alignments for protein families that has been manually curated [67]. This is another database that IMG uses for sequence data comparison and function prediction. Our isolates have between 73-83% of their protein coding genes annotated via comparison to the Pfam database (Appendix B).

Other, smaller databases are also searched via IMG’s annotation but COG and Pfam are the two largest and most comprehensive.

The creation of 64 high quality draft genomes from extremophilic bacteria is a significant addition to the body of knowledge of bacterial genomics. Extremophiles are historically underrepresented in the whole of sequenced genomes to date [23].

These isolates were taken from high-yielding carboxylate biofuel fermentations,

28 which creates interest in understanding their genomes for fermentation and enzymatic processes. Having the whole genome sequences of 64 isolates from our library creates a wide variety of potential future applications.

29 CHAPTER III

MULTILOCUS SEQUENCE ANALYSIS OF A SUBGROUP OF HIGH-QUALITY DRAFT

GENOMES FOR ISOLATES IN THE GENERA GEOBACILLUS, ANOXYBACILLUS, AND

AERIBACILLUS

III. 1 Introduction

The genera Anoxybacillus, Geobacillus, and Aeribacillus have separate from genus Bacillus are relatively recently and have only been described in the last two decades [68–70]. In fact, the rapid separation of these genera from Bacillus involved substantial proliferation of named and characterized species within each, resulting in discrepancies in the characteristics of each genus and shuffling of names among genera [70]. The List of Prokaryotic names with Standing in Nomenclature (LPSN) states that since January 2000 names of prokaryotes change at almost 750 different names a year [71]”. Thus, the influx of additional draft genome sequence data for geographically diverse isolates from extreme environments within these genera provides for identification of new and potentially important clades.

Anoxybacillus

Pikuta et al. 2000 established the genus Anoxybacillus as distinct from the genus Bacillus (31). Today, Anoxybacillus includes 21 valid species[68, 71]. The type strain A. pushchinoensis was originally described as an anaerobe (hence the genus name Anoxybacillus), however, subsequent phenotypic analysis by the same research group established the capacity to grow aerobically, despite superior

30 growth in anaerobic conditions [72]. Anoxybacillus spp. are alkalphilic or alkatolerterant and use oxygen either aerotolerantly or facultatively anaerobicly.

Also, Anoxybacillus spp. are thermophiles with G+C contents ranging from 42-57%.

Most characterized Anoxybacillus are isolates from hot springs and have optimal growth temperatures of 50-65°C [73]. Anoxybacillus genomes tend to be smaller than Geobacillus or Bacillus genomes (<3Mb as compared to 3-4 and 4-6 Mbp, respectively), however this observation is based on a limited sample size [73]. This genus has potential for a wide variety of industrial applications, including enzyme production, lignocellulose and starch degradation, and bioremediation[74–78].

Geobacillus

Nanzina et al. 2001 identified the genus Geobacillus as distinct from the genus Bacillus [69]. Geobacillus has 20 published species , with the type strain G. sterothermophilus (originally called Bacillus sterothermophilus)[69]. G. sterothermophilus is aerobic or facultatively anaerobic, neutrophilic, and has a G+C content of 48-58%. The optimum temperature is variable for Geobacillus spp. depending on strain, but the majority of isolates require temperatures of 45-70°C for growth. Unlike Anoxybacillus and Aeribacillus, Geobacillus has been found in almost every environment sampled [79]. Geobacillus spp. have been found in every thermophilic niche including hot springs, hydrothermal vents (reaching up to 90°C), and thermogenic alkaline natural gas wells in the Barnett Shale [80–82]. Perhaps even more surprising, Geobacillus is often isolated from cool environments. Several

Geobacillus strains have been successfully isolated from permanently cool soils in

Ireland and Bolivia and have been found on all seven continents [79, 83]. The

31 ubiquity of Geobacillus in the soil is a source of active debate. A 2008 study hypothesized that in cool climates Geobacillus is deposited by rainwater into the soil. Once there, it will stay dormant until rare thermal incidents, like decomposing plant material, at which point, the bacteria or spores can rapidly grow and propagate [84]. This mechanism of transport is called aerobiology. No evidence exists for the aerobiology of Geobacillus because, to date, aerobiology studies are carried out in mesothermic conditions. There are many studies showing Bacillus in long-distance bioaerosols [85, 86]. Ziegler suggests that Bacillus and Geobacillus spores evolved to be exactly the appropriate size to fall into the ‘scavenging gap,’ allowing them to stay in the atmosphere longer than slightly larger particles without being ‘scavenged’ by water and brought back to Earth [79]. While there is still much to debate to the maxim ‘Everything is everywhere’, the genus Geobacillus certainly adds an interesting layer of complexity to the study of biogeography.

Aeribacillus

Genus Aeribacillus is the newest of the three genera detailed in this chapter and split from Geobacillus in 2010 [70]. Aeribacillus has only one valid species, A. pallidus. It is an aerobic, alkalitolerant, thermophile with a G+C content ranging from

39 to 41%. The strains used in reclassification study were found in a hot spring in

Mexico, while the original Bacillus pallidus strain was isolated from sewage[70, 87].

A strain of A. pallidus was isolated in crude oil contaminated soil, which produced a novel bioemulsifier with industrial applications[88]. Aeribacillus pallidus has also been shown to remove a variety of industrial dyes from waste-water which presents

32 an important bioremedial application [89]. This is by far the least well characterized of the three genera presented in this chapter.

Taxonomy in the Genomic Era

Taxonomy seeks to bring order to the seemingly endless multitude of biological diversity. Taxonomists attempt to characterize, classify, and name the organisms that they study in a stepwise manner [90]. While this kind of work may seem esoteric to some, the proper taxonomy of prokaryotes allows researchers worldwide to properly communicate with one another, which has broad clinical, biotechnical, and ecological significance. Early microbiologists relied on phenotypic data, such as morphology and biochemical assays, to classify organisms. Prokaryotic taxonomy underwent a revolution in the 1970s when Woese showed the highly conserved nature of 16S ribosomal RNA across microbial taxa [91]. The widespread adoption of 16S rRNA gene sequencing ushered in what is known as polyphasic taxonomy. The polyphasic taxonomy defines the bacterial species genetically as “a group of strains (including the type strain), having > 70% DDH (DNA-DNA hybridization) similarity, < 5°C ΔTm, < 5% mol G + C difference of total genomic

DNA, > 98% 16S rRNA identity [92]”. Beyond genetic identification polyphasic taxonomy uses morphological (cell shape, size), physiological (spores, motility, staining) and biochemical (peptidoglycan, benzoquinones, napthoquinones, fatty acids) characteristics to arrive at a consensus of features [90].

It is important to note that this polyphasic taxonomy relies on a consensus of attributes and this consensus does not necessarily conform to any biological realities [93].This means that the strict application of any particular set of guidelines

33 would be outside of its purview, though this has not stopped certain journals from using quantification as law when it comes to publication [93, 94]. Also, the polyphasic approach is unable to deal with un-culturable organisms, which cannot be run through the traditional gamut of biochemical tests. The DNA-DNA hybridization test is considered the ‘gold standard’ for the delineation of bacterial species [95]. One of the main problems with this test is that it is only performed by a few specialized labs and can take years to complete [94]. Requiring a test of DNA-

DNA hybridization for species discrimination has left prokaryotic taxonomy available to only a select few labs that have the money and expertise to run this type of experiment[96]. There is a certain irony to the fact that DNA-DNA hybridization is meant to mimic whole-genome sequence data as closely as possible, and now WGS is widely available and researchers are stuck performing DNA-DNA hybridizations for publication. With the low cost of whole genome sequencing, a paradigm shift in taxonomy is at hand. The polyphasic taxonomy must be reconsidered in the genomic era [92, 94, 96, 97].

For sequence-based taxonomy, several different tests have been developed and compared to more traditional polyphasic methods. Average nucleotide identity

(ANI) of whole genomes has been shown to correlate to DNA-DNA hybridization such that a value of 70% DNA-DNA hybridization is about 95% average nucleotide identity [98]. In silico approaches to DNA-DNA hybridization are also now available through genome-to-genome distance calculation (GGD) [99]. The third and perhaps most widely used ‘genomic-era’ taxonomic tool is multilocus sequence analysis

(MLSA). I used MLSA to estimate the phylogenies of 48 isolates, predicted to be

34 members of the genera Anoxybacillus, Geobacillus, and Aeribacillus based on a previous analysis of partial 16S rDNA sequences [9, 12].

MLSA is based on multilocus sequence typing (MLST), a method developed in the late 1990s for strain characterization using multiple housekeeping genes [100].

Four to ten loci from a type strain are sequenced and compared to unknowns and the number of differences is recorded. This type of scheme is mostly used in epidemiological studies, not to calculate phylogeny. In 2005 Gevers introduced the term MLSA, distinct from MLST in that it had a taxonomical application and used genes not under selection, which further differentiated it from MLST as it often focuses on virulence genes [97]. Since then a wide variety of studies have shown that MLSA schemes closely correlate to DNA-DNA hybridization studies [101–104].

In this chapter I applied the methodologies of these and other published MLSA schemes on isolates putatively assigned to genera Anoxybacillus, Geobacillus, and

Aeribacillus previously. For these analyses I used four protein-coding genes and the

16S rRNA gene. Specifically, the housekeeping genes selected were gyrB (DNA gyrase B subunit), groEL (a chaperonin), rpoD (sigma 70 factor of RNA polymerase), and trmE (tRNA modification GTPase, also called thdF or mnmE). By elaborating a better-defined phylogeny for these genera within our library of extremophiles we expect both to refine our ability to estimate the diversity within the library, and to identify clades novel to this library relative to the available sequenced genomes.

35 III.2 Methods

Isolate Selection

To better understand the relatedness among isolates with draft genome sequences, we selected a subset previously identified by phylogenetic analysis of partial 16S rDNA as within the genera Anoxybacillus, Aeribacillus, and Geobacillus

[9]. The isolates selected for this study, their original operational taxonomic unit

(OTU), site name and state, genomic G+C content, and best BLAST hit for genus based on the original partial 16S rDNA are in Table 4 [9]. This subset included 25 putative Aeribacillus spp. isolates from OTUs 1, 12, 24, 40, and 43. OTUs 1 and 25 included isolates that were included in both Aeribacillus and Bacillus and for comparative purposes all members of these OTUs were included. The subset included 2 putative Anoxybacillus spp. isolates from OTUs 1 and 27. The subset also included 15 putative Geobacillus spp. isolates from OTUs 2, 6, 9, 16, 19, 20, 36, and

44. OTU 2 included isolates that were in both Geobacillus and Bacillus so all members of this OTU were included. Isolates G24C011 and G13D016 were included because they showed closest similarity to genera Thermoactinomyces and

Planifilum, respectively. These genera are more closely related to genus Bacillus than to the actinomycetes [105]. No substrate mycelia have ever been recorded in the isolate library, which lead me to believe that isolates G24C011 and G13D016 had been mischaracterized [9, 11, 12].

To further elucidate our analysis, we used validated species to place our environmental isolates in context. These reference strains were chosen based on the strains used in the analysis of Aeribacillus by Miñana-Galbis et al. We chose

36 Clostridium difficile 630 as the outgroup for this analysis because Clostridia and

Bacillus are both Firmicutes but it was unlikely that any of these isolates cultured in the presence of some oxygen would group with Clostridia.

37 Table 4. Isolates used in multilocus sequence analysis (MLSA), adapted from Cope, 2013 [9]. Shows the original operational taxonomic unit, site name, state, genome G+C content, and genus based on BLAST of the partial 16S rDNA. All data other than G+C content from Cope, 2013 [9].

G+C Isolate Original OTU Site Name State (%) Genus E07C003 1 Great Salt Plains OK 39 Aeribacillus E08C011 25 Great Salt Plains OK 40 Bacillus E08C017 16 Great Salt Plains OK 39 Geobacillus E08C020 27 Great Salt Plains OK 42 Anoxybacillus E08D002 1 Great Salt Plains OK 39 Aeribacillus F02C013 19 Brazoria TX 52 Geobacillus F05M388 1 Brazoria TX 39 Aeribacillus F09D005 1 Brazoria TX 39 Aeribacillus G08C001 6 Bitter Lake (Roswell) NM 44 Geobacillus G08C006 44 Bitter Lake (Roswell) NM 39 Geobacillus G08C008 1 Bitter Lake (Roswell) NM 37 Anoxybacillus G09D026 1 Bitter Lake (Roswell) NM 39 Aeribacillus G13D008 1 Bitter Lake (Roswell) NM 52 Aeribacillus G13D016 39 Bitter Lake (Roswell) NM 39 Planifilum G13D038 1 Bitter Lake (Roswell) NM 39 Aeribacillus G13D043 1 Bitter Lake (Roswell) NM 39 Aeribacillus G19C023 43 Bitter Lake (Roswell) NM 39 Aeribacillus G23C002 1 Lazy Lagoon (Roswell) NM 39 Aeribacillus G23C019 1 Lazy Lagoon (Roswell) NM 45 Aeribacillus G23D015 1 Lazy Lagoon (Roswell) NM 39 Aeribacillus G24C011 3 Lazy Lagoon (Roswell) NM 39 Thermoactinomyces H01C001 2 San Francisco Bay CA 56 Geobacillus H01M105 1 San Francisco Bay CA 37 Aeribacillus H20C002 2 San Francisco Bay CA 37 Bacillus H20C009 2 San Francisco Bay CA 39 Geobacillus H20D004 36 San Francisco Bay CA 46 Geobacillus J04M017 12 Big Bend TX 52 Aeribacillus J18C011 9 Big Bend TX 46 Geobacillus J18C022 9 Big Bend TX 40 Geobacillus J18D015 1 Big Bend TX 39 Aeribacillus J19C022 9 Big Bend TX 52 Geobacillus J20M030 2 Big Bend TX 40 Geobacillus K49C015 1 Baker Hot Spring UT 42 Aeribacillus K49M014 1 Baker Hot Spring UT 39 Aeribacillus M24C029 9 Cape Romain SC 39 Geobacillus N09C011 2 Sapelo Island GA 39 Geobacillus N09C014 1 Sapelo Island GA 46 Aeribacillus P01M009 2 Puerto Rico PR 39 Bacillus S44C017 1 Yellowstone WY 52 Aeribacillus S44C019 20 Yelowstone WY 52 Geobacillus

38 Table 4 Cont.

G+C Isolate Original OTU Site Name State (%) Genus S44C021 2 Yellowstone WY 39 Geobacillus S44D013 40 Yellowstone WY 39 Aeribacillus S48D018 1 Yellowstone WY 39 Aeribacillus T02C003 12 Still Water NV 39 Aeribacillus U22C014 12 Owens Lake CA 39 Aeribacillus U22D017 25 Owens Lake CA 39 Aeribacillus U22M431 1 Owens Lake CA 39 Aeribacillus

Data Analysis

For all BLAST analyses I used Geneious 9.1.3 with default parameters using

NCBI’s RefSeq Database, accessed in April 2016 [106]. I selected genes to include in

this analysis based on published recommendations for bacterial multiple sequence

alignment[100, 107, 108]. My approach included accessing gene sequences via

RAST[55, 56, 63], creating alignments using ClustalW within the MEGA7

program[109], and inferring the correct open reading frame using the ExPASy

Translate Tool and trimming the alignments in-frame[110]. I generated statistics for

each locus using MEGA7 and Geneious 9.1.3[111]. Geneious calculates identical sites

only as those columns in an alignment that have at least 2 nucleotides or gaps that

are not free end gaps and does not consists entirely of gaps. Ambiguity characters

are not considered identical. Pairwise percent identity is computed by looking at all

pairs of bases in a column and scoring one when they are identical divided by the

total number of pairs[112]. To calculate the dN/dS I used the Nei and Gojobori

method in MEGA7[113].

To perform the Incongruence Length Difference (ILD) Test (also called the

Partition Homogeneity Test) I used the PAUP*4 Command-Line binary for Linux

39 [114]. ILD analysis was performed using the resources and systems administration support of the Texas A&M Institute for Genome Sciences and Society (TIGSS) High

Performance Computing Cluster. The maximum number of trees was limited to

1000 to optimize computation time and 1000 replications were run. Values greater than 0.001 are consider significant. The outgroup and reference strain A. pallidus str.

GS3372 were not included in this analysis.

The data for the four proteins was concatenated head-to-tail in frame based on the order that the genes appear in the B. subtilis genome. I assembled the data and estimated the phylogeny using the concatenation tool and PAUP* 4.0 plugin in

Geneious 9.1.3[115]. For all phylogenies we employed maximum likelihood (ML) with the Jukes-Cantor Model assuming equal rates of variation. C. difficile 630 was used as an outgroup to root trees. We used 1,000 replications for all bootstrap analyses.

III.3 Results

Sequence Attributes

This study involved querying draft genome sequences for isolates in an extensive isolate library established from carboxylate platform fermentations [9,

12]. The subgroup of isolates examined in this study originally occupied clades associated with Anoxybacillus, Aeribacillus, and Geobacillus based on analyses of partial 16S rDNA [9, 12]. In total we included 47 isolates from 17 original operational taxonomic units (OTUs) from 13 different sites (Table 4). We chose to also query extant genomes to establish points of reference across the phylogeny,

40 selected based on the phylogenetic tree published by D. Miñana-Galbis et al 2010

(Table 5)[70]. A summary of statistics for each gene locus are displayed in Table 6.

The fragments ranged from 700-1,577 nucleotides long after alignment and trimming, including gaps. The proportion of identical sites varied in the alleles from to 31.2 in the 16S rDNA to 55.8% in groEL. The proportion of pairwise identity ranged from 78.3 in trmE to 86.5% in groEL. The dN/dS were all less than 1, indicating there is selection against amino acid changes. The G+C content of the protein coding alleles ranged from 42.9% in gyrB to to 47% in rpoD while the 16S rDNA was an outlier at 55.9%. The G+C content of the protein coding alleles is congruent with the isolates used in this study, which have an average G+C content of

41.8%.

Single Gene Trees

Figures 4-8 are phylogenetic trees constructed using maximum likelihood estimation with nucleotide sequences for each gene locus. Initial visual inspection showed that while the 4 protein coding loci were similar, the 16S rDNA tree had a very different topology. An incongruence length difference test (ILD) can assess the overall congruence of any two sets of phylogenetic trees and thus determine their suitability for concatenation [114]. Pairwise ILD tests were performed between each genetic locus (Table 7).

41 Table 5. Reference strains used in multilocus sequence analysis. C. difficile strain 630 was used as an outgroup. Reference Strain NCBI Accession G+C% C. difficile strain 630 NC_009089.1 28.4 B.. halodurans C-125 NC_002570.2 43.7 A. pallidus str. GS3372 NZ_JYCD00000000.1 41.0 Anoxybacillus flavitherums WK1 NC_011567.1 41.8 Anoxybacillus gonensis strain G2 NZ_CP012152.1 41.6 Bacillus subtilis subsp. subtilis str. 168 NC_000964.3 43.7 Geobacillus thermodenitrificans NG80-2 NC_009328.1 48.9 Geobacillus kaustophilus HTA426 NC_006510.1 51.9

Table 6. Properties of genetic loci used in multilocus sequence analysis. Length is after trimming and includes gaps. dN/dS calculated in MEGA7 using Nei-Gojobori model. Statistics do not include outgroup C. difficile strain 630. % Pairwise % Identical Gene Length dn/ds identity Sites G+C% groEL 1577 0.047 86.5 55.8 43.7 gyrB 1547 0.105 79.4 37.7 42.7 rpoD 847 0.422 84.1 34.9 47.0 trmE 1378 0.956 78.3 34.0 43.9 16S 700 N/A 80.1 31.2 55.9

Table 7. P values from pairwise ILD test between 5 genetic loci. Values greater than 0.001 are considered significant and are shown in bold. Values estimated based on 1,000 replicates. Outgroup Clostridium difficile 630 and reference strain Aeribacillus pallidus str. GS3372 were not included in this analysis. gryB groEL rpoD trmE 16s gyrB -- 0.00100 0.00400 0.00300 0.00100 groEL 0.00100 -- 0.23900 0.00400 0.00100 rpoD 0.00400 0.23900 - 0.07000 0.00100 trmE 0.00300 0.00400 0.07000 - 0.00100 16s 0.00100 0.00100 0.00100 0.00100 -

42

Figure 4. Maximum-likelihood estimated phylogenetic tree obtained from 55 partial sequences of the 16S rDNA gene. Original operational taxonomic units are indicated by the number after the underscore. The length of the legend bar indicates 1% estimated sequence divergence. The tree was rooted using Clostridium difficile 630 as an outgroup. Significant bootstrap values (>70%) are based on 1000 replications.

43

Figure 5. Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial gyrB sequences. Original operational taxonomic units are indicated by the number after the underscore. The length of the legend bar indicates 1% estimated sequence divergence. The tree was rooted using Clostridium difficile 630 as an outgroup. Significant bootstrap values (>70%) are based on 1000 replications.

44

Figure 6. Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial grolEL sequences. Original operational taxonomic units are indicated by the number after the underscore. The length of the legend bar indicates 1% estimated sequence divergence. The tree was rooted using Clostridium difficile 630 as an outgroup. Significant bootstrap values (>70%) are based on 1000 replications.

45

Figure 7. Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial rpoD sequences. Original operational taxonomic units are indicated by the number after the underscore. The length of the legend bar indicates 1% estimated sequence divergence. The tree was rooted using Clostridium difficile 630 as an outgroup. Significant bootstrap values (>70%) are based on 1000 replications.

46

Figure 8. Maximum-likelihood phylogenetic tree estimated for 55 isolates based on partial trmE sequences. Original operational taxonomic units are indicated by the number after the underscore. The length of the legend bar indicates 1% estimated sequence divergence. The tree was rooted using Clostridium difficile 630 as an outgroup. Significant bootstrap values (>70%) are based on 1000 replications.

47 Aeribacillus pallidus displayed a large degree of difference in the rpoD tree suggesting the presence of a mutation in that particular allele and was thus left out of this test (Fig 7). While the original authors of the ILD test suggested a significance value of P<0.05, subsequent studies have found this to be too conservative with large or unevenly paired datasets like ours[116–118]. We thus used the more relaxed but still valid threshold of 0.001. All protein-coding loci showed at least two significant pairwise congruence values while the 16S rDNA showed none (Table 7).

Thus subsequent analysis did not include the 16S rDNA data.

Multi-locus Sequence Analysis

The concatenated alignment of the gryB, groEL, rpoD and trmE nucleotide sequences contains 5,347 nucleotides with a mean G+C content of 43.9% and a pairwise identity of 81.6%. The isolates cluster into eight distinct clades that all group with at least 99% bootstrap values (Fig 9).

Clade I contains four isolates all originally grouped into OTU2 from sites in

San Francisco Bay and Puerto Rico. This clade has 97.7% identical sites and 98.7% pairwise identity. These isolates were originally grouped in genera Bacillus and

Geobacillus[9]. The genomic G+C content for Clade I members ranged from 39-56% with an average of 42.75%. None of the nucleotide sequences for members of this clade exhibits closer than a 81.6% pairwise identity with Bacillus sp. MS. The isolates associated with genomes in clade I originally grouped in genera Geobacillus and Bacillus by partial 16S rDNA sequences and OTU2 was broken up MLSA into highly divergent clades. Clade I was present in all the individual gene trees with

100% bootstrap support (Fig 5-8).

48 Clade II contains three isolates from original OTUs 25, 9, and 1 from the sites at Great Salt Plains National Park, Big Bend National Park, and Baker Hot Springs, respectively. This clade has 99.2% identical sites and 99.5% pairwise identity. These isolates were originally grouped in genera Aeribacillus, Geobacillus, and Bacillus.

Genomic G+C content ranged from 40-46% with an average 42%. Consensus sequence Megablast analysis of each locus shows strong consensus support that this clade belongs with Bacillus smithii strain DSM 4216 as either a closely related species or subspecies (groEL 99.4% identity, gyrB 99.8%, rpoD 99.8%, trmE 99.3%).

All the individual gene trees supported clade II with 100% bootstrap support (Fig 5-

8).

It has a genomic G+C content of 40%. Megablast analysis indicates remarkable similarity to Bacillus licheniformis strain HRBL-15TDI7 with loci groEL and gryB both showing 100% pairwise identity and rpoD and trmE showing 99.9% and 99.8% pairwise identity respectively. All individual gene trees supported clade

III with bootstrap support ranging from 70% in rpoD to 100% in groEL (Fig 5-8).

Clade IV contains a single isolate, E08C020 from OTU27 isolated from Great

Salt Plains State Park. It was originally grouped with genus Anoxybacillus. It has a genomic G+C content of 42%. It is grouped in a clade with the reference strain

Anoxybacillus flavithermus in all individual gene trees with bootstrap support ranging from 98-100% (Fig 5-8). Megablast analysis groups it with Anoxybacillus flavithermus WK1 with groEL pairwise identity 97.2% gyrB 97.3%, 96.8% and trmE

95.5%.

49

Figure 9. Maximum-likelihood phylogenetic tree estimated by multilocus sequence analysis with concatenated gryB, groEL, rpoD, and trmE partial sequences. Original operational taxonomic units are indicated by the number after the underscore. The bar indicates 1% estimated sequences divergence. The tree was rooted using Clostridium difficile 630 as an outgroup (root branch not shown to scale). Significant bootstrap values (>70%) as expressed as percentages of 1000 replications are shown at branch points. Roman numerals indicate consensus clades.

50 Clade V contains 2 isolates from OTUs 6 and 9 taken from Bitter Lake near

Roswell, New Mexico and Big Bend National Park. They were both originally grouped into the genus Geobacillus. This clade has 99.6% identical sites and 99.6% pairwise identity. The genomic G+C content ranges from 44-52% with an average of

48%. Consensus megablast analysis shows strong consensus support that this clade belongs with Geobacillus thermoglucosidasius DSM 2542 with groEL and gyrB 99.7% pairwise identity, rpoD 99.8%, and trmE 99.5%. This clade is supported at 100% bootstrap support by the gryB and groEL individual trees and 98% bootstrap support by the rpoD and trmE trees (Fig 5-8).

Clade VI contains 5 isolates from 4 different OTUs (2, 19, 20, and 39) isolated from Big Bend National Park, Brazoria, Yellowstone National Park, and Bitter Lake near Roswell, New Mexico. These were original grouped in the genera Geobacillus except isolate G13D016, which was grouped in genus Plainilum. This clade has

99.2% identical sites and 99.7% pairwise identity. Their genomic G+C content ranges from 39-52% with an average of 44.4. Consensus sequence Megablast identifies it with Geobacillus sp. LC300 (groEL 99.2% pairwise identity, gyrB 99.9%, rpoD 99.9%, and trmE 100%). In rpoD J20M030 is monophyletically excluded from this clade but with very low bootstrap support (55%) and the other members are grouped with 74% bootstrap support (Fig 7). In groEL J20M030 is similarly excluded with 100% bootstrap support while the other members are grouped with below statistical significance at 69% support (Fig 6). The clade is supported at

100% by gyrB and trmE (Fig 5 and 8).

51 Clade VII contains 2 isolates from OTU12 isolated from Big Bend National

Park and Owens Lake. They were both originally grouped in genus Geobacillus.

Isolate J04M017 has a genomic G+C content of 52% while isolate U22C015 has a

G+C content of 39%. This clade has 100% identical sites and pairwise identity. It is supported with 100% bootstrap support by all individual gene trees except rpoD, which supports it with 99% (Fig 5-8). Megablast analysis shows no consensus as to what extant taxa are named in this clade. groEL shows 81% pairwise identity with

Bacillus thuringiensis strain CTC, gyrB shows 75.6% with Bacillus methanolicus

MGA3, rpoD shows 80.5% with Bacillus smithii strain DSM 4216 (which we have already shown to associate much more closely with clade clade I) and megablast with trmE from this clade results in no sequences over 48% similar to the query..

Clade VIII is by far the largest clade in the tree includes isolates from all three media. It is split into subgroups a and b rather than separate clades because there was not well-supported agreement among the individual gene trees for this additional separation. However, when the concatenated alignment of clade VIII was run through an additional 1000 bootstrap replications using clade VII as an outgroup, the additional split between clade VIIIa and clade VIIIb was well supported with 91% and 82% bootstrap support respectively (Fig 9). Clade VIIIa contains 7 isolates from OTUs 1, 2, 16, and 43, and isolated from Great Salt Plains,

Bitter Lake near Roswell, New Mexico, Lazy Lagoon, New Mexico, and Sapelo Island.

Their genomic G+C content ranges from 37-39% with an average of 41.4%. This clade has 99% identical sites and 99.6% pairwise identity. Clade VIIIb contains 22 isolates from OTUs 1, 3, 9, 12, 25, 36, 40, and 44 isolated from Brazoria NWR, Bitter

52 Lake, Lazy Lagoon, San Francisco Bay, Big Bend NP, Baker Hot Spring, Sapelo Island

Microbial Observatory, Yellowstone NP, Owens Lake, Cape Romain, and Still Water.

Their genomic G+C content ranges from 37-52% with an average of 41.4%. It has

99% identical sites and 99.8% pairwise identity. Consensus megablast analysis does not show strong support for existing taxa between the observed protein coding genes. groEL showed 85.2% pairwise identity to Bacillus smithii strain DSM 4216

(already shown to be closely aligned to clade I), gryB showed 76.4% pairwise identity to Bacillus sp. X1, rpoD shows 79% pairwise identity to

Anoxybacillus gonensis strain G2 (which was one of our reference strains) and trmE does not have full query cover on any of the matches.

Summary of Clades

Clade I has no close sequence similarities to extant genomes in GenBank.

Clade II is closely related to Bacillus smithii strain DSM 4216. Clade III is very closely related to Bacillus licheniformis strain HRBL-15TDI7. Clade IV is closely related to

Anoxybacillus flavithermus WK1. Clade V is closely related to Geobacillus thermoglucosidasius DSM 2542. Clade VI is closely related to Geobacillus sp. LC300.

Clade VII shows no close similarities. Clade VIII shows no close sequence similarities. See Appendix C for summaries of each clades G+C content and isolates.

III. 4 Discussion

At the beginning of this analysis, we hypothesized that there were three genera in this subset Anoxybacillus, Aeribacillus, and Geobacillus. Isolates from genera Bacillus, Thermoactinomycetes, and Planifilium were also used, although I

53 hypothesized that Thermoactinomycetes and Planifilium were mischaracterized.

These came from 17 operational taxonomic units in the original partial 16S rDNA analysis (Haynes (2014), Cope (2013)). Together, however, these isolates resolved into 9 clades, three of which had very low sequence similarity to known genomes.

This shows both that the original analysis yielding OTUs was splitting units below the genus level, and that culturing from fermentations inoculated with naturally occurring communities from extreme environments yielded microbes quite distinct from any sequenced to date.

This multilocus sequence analysis showed that the four protein-coding loci chosen for this study are under stabilizing selection, create congruent phylogenetic trees, and when concatenated show increased resolution (Tables 6-7 and Fig 5-9).

The 16S rDNA, however, created a phylogenetic tree that was not congruent with the protein-coding loci. Even using a relaxed threshold of congruence of P values greater that 0.001 and excluding outliers the 16S rDNA was not congruent with any of the protein-coding loci and thus could not be concatenated into the combined tree

(Table 7)[117]. The 16S rDNA also showed very different average genomic G+C content than the protein coding genes (Table 6). It must be noted that because we used draft sequences in this analysis we were unable to obtain the full 16S rDNA gene of many of our isolates. The traditional incongruence length difference (ILD) test uses a P value cutoff of .05, however this has been found to be extremely conservative. Studies have found that combining data with P values above .001 does not reduce phylogenetic accuracy [117]. Several published MLSA studies have used this or even more relaxed P values [119–121].

54 A potential reason that the 16S rDNA tree is so incongruent is that 16S rDNA is not a good marker for family Bacillacae. While this is clearly against the current polyphasic taxonomic paradigm, it is not a new phenomenon. In 1992 Fox et al published a paper where they showed that Bacillus globisporus and Bacillus psychrophilus were phenotypically and genotypically distinct but showed 99.8% 16S rDNA sequence similarity [122]. They suggested that recently diverged species might not be recognizable by 16S rDNA characterization. In type strains of genus

Anoxybacillus the 16S rRNA gene has sequence similarities of 94-99% [123]. This kind of similarity does not allow for adequate discrimination of closely related species [123]. Similar findings of similarity among the 16S rRNA gene have been found in genus Geobacillus. In a recent study 16 Geobacillus isolates were compared using housekeeping genes and the 16S rRNA gene [124]. The authors concluded that the similarity of the 16S rRNA gene was 93-100% in these isolate even though other tests showed all 16 isolates to be genetically distinct [124]. Despite the use of the

16S rRNA gene for modern bacterial taxonomy, it is clear that it not optimal for all situations, especially when working with family . So, while we were unable to use the entirety of the 16S rDNA sequence in this analysis, I believe that it is appropriate to hypothesize that the 16S rDNA would not have been able to add additional resolution to our multilocus sequence analysis.

A question inherent to any multilocus species analysis is if the concatenation of the genes strengthened the phylogeny beyond what was seen in the individual trees. Several studies have shown that incorporating non-congruent data into a tree topology has little to no effect on the resulting concatenated tree topology as long as

55 multiple genes are congruent [104, 125]. Since I allowed congruent genes into the final concatenated tree, we see an increase in bootstrap support for all of our clades between the individual gene trees and the concatenated tree (Fig 5-9). The concatenated tree has between 99-100% bootstrap for all clades (Fig 9). This shows that concatenated the protein-coding genes allowed for greater phylogenetic resolution.

Genomic G+C content is often used as a feature to help delineate between genera [126]. Bacillus has genomic contents ranging from 33-60% but as I have noted elsewhere, new information surfaces often to remove isolates from genus

Bacillus and create new more specialized genera [13]. Genus Geobacillus has genomic G+C contents ranging from 48-58% and Anoxybacillus has G+C contents ranging from 42-57% [70]. We can infer that a range of about ~12% is standard for these newer and genetically confirmed genera. The clades created by the MLSA show genomic G+C contents congruent with approximately 12% genomic G+C content range (Appendix C). Some of the clades such as VIIIa show even closer genomic G+C content grouping with a range of a genomic G+C content range of only

37-39% (Appendix C). These findings further strengthen our evidence for the MLSA clades being valid.

One of the most noteworthy findings of the MLSA is that four of our clades show no close sequence similarity to characterized strains, which could mean that our collection has potentially uncovered four previously unknown genera. While this may seem surprising it is important to remember that the majority (79-89%) of

16S rDNA sequences found in metagenomic studies aren’t associated with known

56 genera [127]. There are several parts of our experimental method that selected for diverse and potentially underrepresented isolates. Most large scale metagenomic studies of soils use PCR-amplified 16S rDNA sequences for characterization, which, apart from Bacillus-specific issues, can show incomplete or inaccurate pictures of microbial diversity [127]. Most metagenomic studies never go on to isolate and characterize individual strains. Another compounding factor is that almost all of the strains included in the MLSA are obligate thermophiles. The carboxylate platform screen provided an initial bottleneck that was only strengthened by the three selection media used for strain isolation and the high (55°C) isolation temperature

[10]. This coupled with my model for picking the isolates to be sequenced, which had a strong bias towards both genotypic and phenotypic diversity, demonstrated a clear proof of concept, the approaches worked to identify novel clades (Appendix C).

57 CHAPTER IV

SELECT COMPARATIVE ANALYSIS OF THE 64 HIGH-QUALITY DRAFT GENOME

SEQUENCES OF EXTREMOPHILES

IV.1 Introduction

Fermentation of lignocellulosic biomass into fuel and other useful chemicals

is an extremely active area of research. The isolate library was born in an effort to

optimize microbial inoculum and fermentation conditions by using microbial

communities from extreme environments [10]. Some consortia from these extreme

environments showed almost four-fold improvement in yield compared standard

inoculum in a carboxylate biofuel platform [9]. Phenotypic tests were performed on

isolates from a range of the best performing fermentations by Haynes (Table 2)[12].

In this work I have sequenced, assembled, and annotated 64 of these isolates. To

further elucidate the richness of the isolate library, I have performed initial

comparative genomic screens using annotation data from our sequencing efforts. I

have focused on traits that were of particular relevance to extremophiles isolated

from fermentation and/or to traits that are rare.

Carbohydrate utilization

Microbes that have a can utilize a broad range of substrates are of high value

to biofuel platforms. This is because biofuel reactors contain a spectrum of sugars

that must be metabolized for economic biomass conversion [128]. In second-

generation biofuel platforms, the hydrolysis of lignocellulose results in a broad

58 range of carbohydrates including glucose, galactose, mannose, xylose, and arabinose

[1, 129]. Bioconversion of pentose sugars (especially xylose) into ethanol is of particular interest, as it is a major bottleneck for ethanol production [1]. In this study I used the Integrated Microbial Genome (IMG) Network Subtree Structure for carbohydrate utilization. This structure includes networks for the utilization of disaccharides, hexarates, hexitols, hexonates, hexoses, hexuronates, ketoaldonates, pentose, and polysaccharides. In this chapter I use the IMG carbohydrate utilization

Network Structure to illustrate the diversity of carbohydrate utilization genes among our isolates.

Opu Family

Adapting to rapid changes in the environment is key to the survival of microorganisms in a given habitat. In soil environments, like those that we isolated our bacteria from, changes in the amount of water lead to frequent variation in osmotic conditions. This is made even more extreme by the fact that our isolates were sampled from areas with high levels of salinity (potassium, calcium, magnesium, sodium) like salt lakes, hot springs, etc [9]. In high osmolarity stress B. subtilis begins to accumulate osmolytes to maintain turgor and allow for optimal growth [130]. First, B. subtilis rapidly accumulates K+ ions. This quick response allows for the cell to return to normal function, but high levels of K+ can affect concentration gradients and thus hurt the cell. Ions are thus replaced with compatible solutes. These are organic osmolytes that do not harm the cell. In B. subtilis proline is the most predominant of these molecules, but it must be synthesized and this process takes time [131]. Glycine betaine is found in nature

59 and can also be synthesized from choline and is another effective compatible solute

[130]. Compatible solutes also function as protein stabilizers via the preferential exclusion model [132]. Glycine betaine has been implicated in both thermoprotection and cold-shock protection via this mechanism in Bacillus subtilis

[133, 134].

The KEGG (Kyoto Encyclopedia of Genes and Genomes) Module database is a collection of curated functional units that allow for the annotation of operons or tightly clustered genes [135]. The osmoprotectant transport system module

(MO0209) has three units, defined as KO:K05845 osmoprotectant transport system substrate-binding protein (opuC), KO:K05846 osmoprotectant transport system permease protein (opuBD), and KO:K05847 osmoprotectant transport system ATP- binding protein (opuA). Together these units allow the transport of environmentally scavenged betaine glycine into the cell. This operon has been experimentally validated in Bacillus subtilis [136]. I used the KEGG orthology database within IMG to look for these genes within our draft genomes.

Furfural Degradation

To efficiently ferment lignocellulosic biomass, pretreatment to breakdown lignin and cellulose is essential. This process creates inhibitory compounds such as furanic aldehydes, weak acids, and phenolic compounds [137]. Formation of these inhibitory compounds is not easy to prevent economically, so several approaches of overcoming inhibition have been proposed. One of these approaches is to select or engineer microbes capable of tolerating or degrading inhibitory compounds [137].

Furaldehydes (furfural and 5-hydroxymethyfurfural) have been shown to cause

60 microbes to enter extended lag phase via inhibiting protein synthesis and induce

DNA mutations [138]. In this chapter I use the MetaCyc database’s experimentally validated furfural degradation pathway to screen our draft genomes for the ability to degrade the inhibitory compound furfural.

Group II Intron Interrupting recA

In the multilocus sequence analysis (MLSA) in Chapter III, I attempted to use the gene recA as a locus for comparison. I used the Rapid Annotation using

Subsystems Technology (RAST) Server to download the nucleotide sequences of the regions of the draft genomes annotated to be recA [56, 63, 64]. I then aligned these nucleotide sequences using the ClustalW in MEGA7 [109]. This alignment showed that in four isolates (F02C013, G13D016, J20M022, and J20M030) the sequence was

~400 base pairs (bp) shorter than expected. Because of this discrepancy, recA was not used in the MLSA. After the completion of the MLSA, I returned to these aberrant sequences using IMG’s neighborhood sequence viewer. In isolate F02C013 the viewer showed another open reading frame (ORF) ~2600bp upstream of truncated recA sequence. In IMG, unlike RAST, this ORF was also annotated as recA. Between the two recA fragments was annotated a group II intron reverse transcriptase/maturase. I found this pattern—a recA sequence interrupted by a group II intron— in the other three aberrant isolates. Group II introns are usually reported intergeneically or in genes of unknown function, not in housekeeping genes. There is only one published instance of a group II intron interrupting recA and that is in G. kautophilus an obligate thermophile isolated at the bottom of the

61 Mariana trench [139]. In this chapter I will present the group II introns in the four isolates and compare them to known isolates.

IV. 2 Methods

Carbohydrate utilization

I used the Integrated Microbial Genomes (IMG) Network Subtree Structure for carbohydrate utilization to look for pathways that were complete in our draft genomes. Any pathway that had at least one genome that contained each enzyme required for that pathway was included in this study. A heatmap was created in

Microsoft Excel. IMG Pathway Object Identifiers (OIDs) and the full details about these carbohydrate utilization pathways is located in Table 8. The clades indicated in Table 9 were taken from the multilocus sequence alignment performed in

Chapter III.

62 Table 8. Identifiers associated with the IMG carbohydrate utilization network for pathways expressed in the draft genomes. IMG Pathway Object Identifiers (OIDs) and the IMG pathway details associated with the pathways. IMG Pathway OID IMG Pathway Details L-arabinose conversion to D-xylulose 5-phosphate - bacterial IPWAY:528 pathway IPWAY:531 D-xylose conversion to D-xylulose 5-phosphate D-fructose conversion to D-fructose 1,6-bisphosphate via D- IPWAY:539 fructose 1-phosphate IPWAY:541 Trehalose conversion to glucose and glucose 6-phosphate IPWAY:543 Maltose conversion to glucose and glucose 6-phosphate IPWAY:545 Maltose conversion to glucose and glucose 1-phosphate IPWAY:546 Sucrose conversion to fructose and glucose 6-phosphate IPWAY:547 Sucrose hydrolysis IPWAY:550 D-sorbitol conversion to fructose 6-phosphate IPWAY:551 D-mannitol conversion to fructose 6-phosphate Cellobiose 6-phosphate conversion to glucose and glucose 6- IPWAY:552 phosphate IPWAY:554 Cellobiose hydrolysis Chitobiose conversion to N-acetylglucosamine and N- IPWAY:555 acetylglucosamine 6-phosphate IPWAY:557 Chitobiose hydrolysis IPWAY:567 Cellulose degradation to cellobiose IPWAY:568 Cellulose degradation to glucose IPWAY:570 Galactarate conversion to 5-dehydro-4-deoxy-D-glucarate IPWAY:571 D-glucuronate conversion to D-fructuronate D-galacturonate conversion to pyruvate and glyceraldehyde 3- IPWAY:573 phosphate IPWAY:601 D-glucarate conversion to 5-dehydro-4-deoxy-D-glucarate IPWAY:604 L-rhamnose conversion to glycerone phosphate and lactaldehyde D-mannonate conversion to pyruvate and glyceraldehyde 3- IPWAY:653 phosphate D-altronate conversion to pyruvate and glyceraldehyde 3- IPWAY:654 phosphate

63 Colum numbers indicate IMG Pathway OIDs. Data are number of Table 9. Gene counts of IMG Pathways for carbohydrate u9liza9on. enzymes associated IMG Carbohydrate U>liza>on present in dra@ genomes. Pathways included were those with differen>al results across these isolates. Full pathway names and descrip>ons are in Table 8.

64 Furfural Degradation

MetaCyc is a database that contains experimentally validated metabolic pathways [140]. MetaCyc Pathway 6697 is furfural degradation and contains two enzymes: 2-furoyl-CoA dehydrogenase and 2-furoate-CoA ligase. I used the

Integrated Microbial Genomes (IMG) system’s function profile tool to compare our

64 draft genomes to these two enzymes. The function profile tool generates sortable table that shows the number of genes coding for each selected function across the selected genomes. I used the table generated by the function profile tool to determine which genomes contained at least one copy of each enzyme required for furfural degradation. To determine the relative abundance of the furfural degradation pathway in the Joint Genome Institute’s Database, I performed the same steps as above using the all finished, permanent draft, and draft genomes in any domain that were publically available, excluding the 64 genomes used in this study.

As of June 2016 this represented 44,925 genomes.

Opu Family

The Kyoto Encyclopedia of Genes and Genomes (KEGG) Module database allows for the collection of operon structures [135]. Within a module is a set of

KEGG Ortholog (KO) identifiers that show the combination and potential alternative enzymes required to complete the module. I used the IMG’s function search to find

KO:K05845 osmoprotectant transport system substrate-binding protein (opuC),

KO:K05846 osmoprotectant transport system permease protein (opuBD), and

KO:K05847 osmoprotectant transport system ATP-binding protein (opuA) within

65 our draft genomes. I then compared this to salinity and temperature data gathered by Cope 2013 [9]. Heatmaps were created using Microsoft Excel.

Group II Intron Interrupting recA

During research for Chapter III, I discovered that the recA sequence of isolates F02C013, G13D016, J20M022, and J20M030 was interrupted by a group II intron. Sequence data for the fragmented recA gene and the group II intron reverse transcriptase/maturase were obtained from the Integrated Microbial Genomes

(IMG) systems gene search tool. I then used the IMG Genome BLAST tool to blastp for homologs of the group II intron reverse transcriptase/maturase using all finished genomes in any domain that were publically available. I repeated this blastp for each isolate, saving any proteins that were >30% similar to our group II intron transcriptase/maturase to an IMG Analysis Cart. I then used the IMG Gene Cart

Neighborhoods tool to display all annotated proteins in the ‘neighborhood’ of the group II intron reverse transcriptase/maturase’s that were similar to those of our isolates. This allowed me to quickly determine what gene had been interrupted by the group II intron and locate other isolates that had recA interrupted. These isolates are G. kaustophilus HTA426, Geobacillus sp. WCH70, and G. thermoleovorans

CCB_US3_UF5. These isolates were added to the study. G. kaustophilus HTA426 is the same strain used in the paper about recA being interrupted by a group II intron

[139]. I downloaded the protein sequences for the group II intron transcriptase/maturase for our isolates, the isolates found to have sequence similarity, and the reference strains used in Chee and Takami and aligned them with

ClustalOmega in Geneious 9.1.4 [111, 139, 141]. I then generated a maximum

66 parsimony tree using Geneious 9.1.4 using the PAUP* plugin [115]. I edited this tree using FigTree 1.4.2 [142].

IV.3 Results

Carbohydrate utilization

There is a clear correlation between the clades created in the multilocus sequence analysis and both the presence and copy number of genes associated with carbohydrate utilization (Table 9). An example of this is that isolates in Clade I all have genes associated with Maltose conversion to glucose and glucose 1-phosphate

(IPWAY: 545), Sucrose conversion to fructose and glucose 6-phosphate

(IPWAY:546), and Sucrose hydrolysis (IPWAY:546), while isolates from Clade II universally lack these genes. This illustrates both the strength of the MLSA model and the differences in carbohydrate utilization pathways between putative genera.

Across all our isolates D-fructose conversion to D-fructose 1,6-bisphosphate via D- fructose 1-phosphate (IPWAY:539) is extremely well conserved and is only missing from two isolates (J04M017 and U22C014) both from Clade 7. L-arabinose conversion to D-xylulose 5-phosphate (IPWAY:528) and L-rhamnose conversion to glycerone phosphate and lactaldehyde (IPWAY:604) genes are both well conserved across all isolates but are missing in Clade 6. D-mannitol conversion to fructose 6- phosphate (IPWAY:551) genes are well conserved across all isolates but are missing from Clade I. Genes associated with D-xylose conversion to D-xylulose 5-phosphate

(IPWAY:531), which is of particular industrial relevance, was found in 31 isolates

67 (Table 9). Two isolates, J18C022 and J11M004, had carbohydrate utilization profiles that were both identical and noteworthy. These isolates had four carbohydrate utilization pathways that were only noted in these isolates: Chitobiose conversion to

N-acetylglucosamine and N-acetylglucosamine 6-phosphate (IPWAY:555),

Chitobiose hydrolysis (IPWAY:557), Cellulose degradation to cellobiose

(IPWAY:567), and Cellulose degradation to glucose (IPWAY:568). These two isolates also had high gene copy numbers in a variety of pathways (Table 9).

Furfural Degradation

Meta-Cyc Pathway 6697: furfural degradation was elucidated in Cupriavidus basilensis and shown in the same study in Pseudomonas putida [137]. This pathway contains two enzymes, 2-furoyl-CoA dehydrogenase, and 2-furoate-CoA ligase. Of the 64 sequenced isolates in this study, 29 contain at least one copy of each of these enzymes (Table 10). While not all of these isolates were included in the multilocus sequence alignment in Chapter III, there is a correlation between MLSA clade VIIIb and putative furfural degradation (Table 10). Of the 24 isolates in MLSA clade VIIb,

17 of them are putative furfural degraders. The enzymes involved in furfural degradation are not common within the Joint Genome Institute’s (JGI) database. Of the 44,925 publically available genomes (37,651 of which are bacterial), only 422 have 2-furoyl-CoA dehydrogenase and 322 have 2-furoate-CoA ligase. The number of genomes that have both of these enzymes is even less, just 276 isolates, all of which are bacterial. The 3:1 ratio of 2-furoyl-CoA dehydrogenase to 2-furoate--CoA ligase observed in all but one (S44C019) of our isolates is conserved among the majority of publically available genomes. 222 of the 276, or 80%, of the publically

68 Table 10. Draft genomes that contain enzymes associated with a furfural degradation pathway. Draft genomes that contain the two enzymes associated with MetaCyc:PWY-6997 furfural degradation. Table shows gene copy number of the enzymes associated with this pathway and the MLSA clade these isolates were assigned to if they were analyzed.

2-furoyl-CoA 2-furoate- MLSA Isolate dehydrogenase CoA ligase Clade J18C011 3 1 II K49C015 3 1 II G08C001 3 1 V J19C022 3 1 V S44C019 3 2 VI E08D002 3 1 VIII F05M388 3 1 VIII F09D005 3 1 VIII G08C006 3 1 VIII G09D026 3 1 VIII G23C019 3 1 VIII G24C011 3 1 VIII H01M105 3 1 VIII H20D004 3 1 VIII J18D015 3 1 VIII N09C014 3 1 VIII S44C017 3 1 VIII S44D013 3 1 VIII S48D018 3 1 VIII T02C003 3 1 VIII U22D017 3 1 VIII U22M431 3 1 VIII F09M437 3 1 n/a G13D029 3 1 n/a J11M011 3 1 n/a J11M287 3 1 n/a J18C025 3 1 n/a R08M008 3 1 n/a T02C029 3 1 n/a

69 available genomes with the full furfural degradation pathway show this 3:1 ratio.

The furfural degraders in the Genomes Online Database (GOLD) span 5 phyla:

Actinobacteria (24 isolates), Bacteriodetes (2 isolates), Firmicutes (26 isolates),

Proteobacteria (223 isolates), Tectomicrobia (1 isolate). This study has thus more than doubled the number of furfural degraders in phylum Firmicutes in the JGI database. This is particularly exciting because of the 37,652 bacterial genomes in

GOLD as of June 2016, 11,313 (30%) are in phylum Firmicutes.

Opu Family

The Opu family of genes is associated with the transportation of compatible solutes into cell during osmotic stress. The operon to transport glycine betaine consists of a substrate binding protein (opuC), a permease protein (opuBD), and an

ATP-binding protein (opuA). This operon is complete in 28 of the 64 genomes sequenced in this work (Table 10). 22 of the 28 isolates with the complete Opu operon (79%) show at least 2 and up to 5 copies of the opuBD, a permease.

Group II Intron Interrupting recA

The housekeeping gene recA codes for a protein that has numerous functions, all associated with DNA repair. In four of our isolates (F02C013, G13D016, J20M022, and J20M030), recA is interrupted by a group II intron in the approximately one- third of the way through its nucleotide sequence (Fig 10A). This group II intron ORF proceeds in the same direction as the ORF of recA. These group II introns have similar insertion sites between three and four hundred base pairs (bp) into the recA sequence, and are all ~2600bp long (Fig 10B). Within each group II intron is an open reading frame that codes for a reverse transcriptase/maturase that is ~600

70 amino acids long (Figure 10B). I compared the reverse transcriptase/maturase protein sequence from F02C013, G13D016, J20M022, and J20M030 and to that of other bacteria that had a group II intron interrupting recA as well as group II intron references sequences [139]. The alignment for these the group II intron reverse transcriptase/maturase is available in Figure 11. I then created a maximum parsimony tree of the amino acid sequences the group II intron reverse transcriptase/maturase (Figure 12). G. kaustophilius, G. thermoleovorans, Isolate

J20M030, and G. sp. WCH70 group together in one clade, isolates G13D016,

F02C013, and J20M022 group together in another clade, and B. cereus and L. lactis are both on their own branches (Fig 12).

71 Table 11. Draft genomes with osmoprotectant genes in Opu family. Table shows 48 draft genomes that contain at least one gene in Opu family. Original media for isolation are indicated (D -Drake’s thermophilic acetogen medium (DTAM), C- Cellulose Agar for Thermophiles (CAT), M – ModiGied Growth Medium (MGM)). Copy number of genes shown in columns. Isolates that have at least one copy of each gene in the Opu family are shown in bold. Various ions from the soil data from Cope 2013 are shown in heatmap form with units in mg kg-1 and range between blue (low values) and red (high values). 72 Table 12. Reference sequences with group II introns. Reference sequences containing group II introns used for sequence comparisons with isolates in this study. GOLD Project Accessions and GenBank Ascensions for these strains are indicated. GOLD Project GenBank Reference Sequence Ascension Ascension G. kaustophilus HTA426 Gp0000519 NC_006510.1 G. sp. WCH70 Gp0000952 NC_012790.1 G. thermoleovorans CCB_US3_UF5 Gp0013443 NC_016593.1 B. cereus strain 03BB87 Gp0117404 CP009941.1 L. lactis subsp. lactis Il1403 Gp0000731 NC_002662.1

Figure 10. Schematic of Group II intron interrupting recA and table showing actual sizes of recA fragments and group II intron. A) Schematic of group II intron interrupting recA protein and breaking recA into two fragments. Adapted from Chee and Takami 2005. B) Table with isolates from this paper and other isolates found to have a group II intron interrupting the recA gene. Shows lengths of first fragment (recA-A) in base pairs, the group II intron in base pairs, the group II intron reverse transcriptase/maturase open reading frame (ORF) in amino acids (aa), and the second fragment of recA (recA-B).

73

Figure 11. Alignment of the open reading frame for II introns in the RecA protein. Data are group II intron sequences present in 4 draft genome sequences and sequences for 5 reference species. Alignment view generated in Geneious 9.1.4. Numbers indicate aligned amino acid count. Colors are standard Clustal scheme.

74

Figure 12 Maximum parsimony tree of amino acid sequence ORF of group II introns. Maximum parsimony phylogeny created using PAUP plugin for Geneious 9.1.4. Lactococcus lactis used as an outgroup to root tree. Numbers indicate number of number of substitutions per site

75 IV.4 Discussion

Carbohydrate Utilization

A broad substrate range is an extremely desirable trait for fermentation

processes. This is because the breakdown of lignocellulose yields a variety of sugars,

all of which should be transformed into ethanol for an optimal fermentation [143].

In this chapter I demonstrated a variety of carbohydrate utilization pathways

present in our sequenced isolates and showed that these pathways were strongly

correlated with the clades proposed in Chapter III. Xylose and arabinose pentose

utilization pathways were present in at least half our isolates. Pentose utilization is

of particular relevance because the preferred microorganism in many fermentation

processes Saccharmyces cerevisiae is unable to ferment pentose sugars. Two isolates

stood out as particularly noteworthy, J18C022 and J11M004, because they

contained high gene copy numbers and pathways that were not present in any other

isolates. These two isolates are capable of breaking down cellulose without first

requiring a pre-processing step [143].

Furfural Degradation

The ability to tolerate or degrade furfural is of high value for use in second- generation biofuel platforms. It is clear that furfural degradation is not a pathway that is common in nature. While the JGI’s database is not as extensive as GenBank it cannot be called small and isolates from our study now represent ~10% of putative furfural degraders in the JGI database. Putative furfural degrading isolates are not correlated to an isolation media or a sampling site, but there is a strong correlation towards Clade VIIIb from the multilocus sequence analysis from Chapter III. This

76 clade was determined to be a potentially novel genus. I hypothesize that furfural degraders exist naturally among thermophilic soil bacterium in low numbers and that the selective pressure created by the carboxylate fermentation platform allowed these bacteria to proliferate.

Opu Family

The Opu operon allows for the rapid transport of the compatible solute betaine glycine into the cell for osmoprotection. 28 isolates of the 64 isolates sequenced contain a complete module for this operon. There is no clear correlation between the isolates that contain the complete operon and the chemistry of the soil that the bacteria from which the soil were isolated. There were, however, high levels of salts within the carboxylate fermentation platform. This could have established selection pressure and allowed bacteria that contained the Opu operon a greater chance of success and proliferation.

Hyperthermophiles are typically defined as those thriving above 60°C. Of the

64 isolates sequenced, 6 of our isolates (R08M008, S44C017, S44C019, S44C021,

S44D013, S44D018) came from these thermal environments. All but two (S44C017 and S44D018) of these isolates contained the full Opu operon. R08 samples were taken from Sulfur Springs, NM and S44 were taken from Yellowstone, WY. Betaine glycine has been shown to act as a thermoprotectant in E. coli and B. subtilis [133,

144]. While this is a small sample size, it is possible that the compatible solute betaine glycine functions as a thermoprotectant in our isolates.

77 Group II Intron Interrupting recA

Group II introns in prokaryotes are typically associated with mobile elements, plasmids, in the intergeneic region, or in proteins without functional homology. A group II intron interrupting a housekeeping gene is a rarity. There have only been two published instances of a group II intron interrupting an essential gene. The first reported case was in genus Azobacter where a group II intron has been inserted into the stop codon of the chaperonin groEL, though it is important to note that this insertion did not change the amino acid sequence [145]. The other reported incident is the one I used in this chapter, where a group II intron interrupted recA. The reason for the prevalence of this group II intron in recA is unclear. A study on the group II intron inserted into groEL in Azobacter vinelandii discovered that the intron had optimal catalytic activity at 65C [146]. No group II introns had ever been shown to retain activity above 60°C before this study. This suggests an unusual level of thermal tolerance in this group II intron. The authors suggest that the intron may have a role in post-transcriptional regulation that has yet to be explored [146]. Though this is speculative, the fact that our data comes from thermophiles may strengthen this line of reasoning.

In Chee and Takami 2005 a similar tree to Figure 12 allowed them to hypothesize that the group II intron from Geobacillus kaustophilius was a new family a group II introns [139]. With data that was unavailable to those authors the group

II introns from Geobacillus thermoleovorans and Geobacillus sp. WCH70 as well as our isolate J20M030 can be added to the protein family proposed by Chee and

Takami [139]. The group II introns from isolates G13D016, F02C013, and J20M022

78 clustered closely together phylogenetically and are potentially a novel protein family (Figure 12).

With 64 high quality draft genomes there is tremendous potential to study these data in depth for many years to come. By analyzing carbohydrate utilization pathways I was able to show the diversity of substrate utilization among our isolates as well as the strength of my phylogenetic analysis. The presence of the furfural degradation pathway showed both the presence of an industrially relevant trait as well as its rarity among other sequenced genomes. The Opu family’s operon in many of our extremophilic isolates correlates well to known phenotypic data about osmoregulation and thermotolerance. A group II intron interrupting the recA gene in four isolates is a rare occurrence and presents interesting questions as to a potential functional role of group II introns. In this chapter I merely scratched the surface of the potential of the IMG analysis system to compare genes and functions of our isolates across the breadth of known prokaryotes.

79 CHAPTER V

THESIS CONCLUSIONS

In Chapter II I used existing phenotypic and genotypic data to choose a diverse subset of the microbes collected from extreme environments and isolated from a carboxylate biofuel platform for whole genome sequencing. I extracted genomic DNA from these isolates and their genomes were sequenced using an

Illumina MiSeq at the Texas A&M Genomics and Bioinformatics Service. I used two different de novo assembly programs to assemble the paired-end reads into scaffolds and compared these assemblies using the Quality Assessment Tool for

Genome Assemblies (QUAST). The A5-MiSeq Assembler pipeline was superior to

SOAPdenovo2 in the majority of cases. I annotated the 64 assembled draft genomes and submitted them to publicly available databases for dissemination.

In Chapter III I took a subset of the 64 draft genomes that were known to be closely related based on their 16S rDNA and compared them using multilocus sequence analysis (MLSA). Using a subset of 47 of the isolates sequenced, I compared sequence data for the housekeeping genes gyrB, groEL, rpoD, and trmE as well as the 16S rRNA gene. An incongruence length difference (ILD) test showed that the 16S rRNA gene was incongruent with the protein-coding genes. A concatenated maximum likelihood tree of the four protein coding loci was created which revealed 8 distinct genera. Three of these genera had no sequence similarity to any known genera.

In Chapter IV I used the Integrated Microbial Genomes (IMG) annotation browser to analyze the draft genomes of our 64 isolates. Our isolates have genes for

80 a broad range of carbohydrate utilization pathways and these utilization profiles have strong correlations to the clades established in the mutlilocus sequence analysis from Chapter III. Many of our isolates have genes associated with the utilization pathways for pentoses and cellulobioses, both of which are very attractive to second-generation biofuel platforms. Two isolates, J18C022 and

J11M004 are of particular interest because they showed the widest range of carbohydrate utilization pathways and have genes associated with the utilization of cellulose and chitin. Furfural is an inhibitory compound that is created by enzymatic degradation of lignocellulosic biomass at high temperatures. 29 isolates sequenced contain the enzymes necessary to degrade furfural. This degradation pathway was determined to be relatively rare in genomes sequenced to date. The Opu operon, which contains genes associated with the transport of the compatible solute betaine glycine into the cell, was complete in 28 of our isolates. This operon has been associated with osmoprotection, thermoprotection, and cold-shock protection.

Finally, a group II intron was found to interrupt the recA gene in four isolates. Group

II introns rarely disrupt housekeeping genes so this was considered remarkable.

Phylogenetic analysis showed that in three of the four isolates the group II intron reverse transcriptase/maturase grouped into their own novel protein family.

This thesis describes the generation of a genomic resources for a library of extremophilic bacteria. Further, the studies where I used these genome data demonstrate the potential utility of both the library and the genome data. Based on what I have learned from this work there are some important future directions for this project. The creation of high-quality draft genomes is the first step towards

81 genome finishing. Future work could use new technology like optical mapping to creating finished genomes to add to the mere 4,762 (9.6%) that are of ‘finished’ quality[147, 148]. Characterizing at least some of the isolates found in Clades I, VII, and VII from Chapter III could lead to the creation of novel genera which would add to our body of taxonomic knowledge. By sequencing a characteristic subset of our total isolate library, knowledge gained from these isolates will allow us to make informed choices when screening non-sequenced isolates. Using knowledge gained from the Integrated Microbial Genomes (IMG) analysis system we can predict a great number of functions in silico before testing them in vivo. One study of particular interest would be to test in vivo substrate utilization of our isolates and compare that data to the data predicted in Chapter IV. Another study of interest would be to test isolates found to be putative furfural degraders (Chapter IV). The submission of this data to the Joint Genome Institutes (JGI) Genome’s Online

Database (GOLD) also allows us to publish these high quality drafts in Standards of

Genomic Sciences. This would help disseminate these genomes to researchers around the world. The creation of high quality draft genomes presents numerous opportunities for further study and opens many doors for future collaborations.

82 REFERENCES

1. Arora R, Behera S, Kumar S: Bioprospecting thermophilic/thermotolerant microbes for production of lignocellulosic ethanol: A future perspective. Renew

Sustain Energy Rev 2015, 51:699–717.

2. Liszka MJ, Clark ME, Schneider E, Clark DS: Nature versus nurture: developing enzymes that function under extreme conditions. Annu Rev Chem Biomol Eng

2012, 3:77–102.

3. Podar M, Reysenbach A-L: New opportunities revealed by biotechnological explorations of extremophiles. Curr Opin Biotechnol 2006, 17:250–5.

4. Brock TD, Freeze H: Thermus aquaticus gen. n. and sp. n., a nonsporulating extreme thermophile. J Bacteriol 1969, 98:289–97.

5. van den Burg B: Extremophiles as a source for novel enzymes. Curr Opin

Microbiol 2003, 6:213–218.

6. Urbieta MS, Donati ER, Chan K-G, Shahar S, Sin LL, Goh KM: Thermophiles in the genomic era: Biodiversity, science, and applications. Biotechnol Adv 2015, 33(6

Pt 1):633–47.

7. Fischer CR, Klein-Marcuschamer D, Stephanopoulos G: Selection and optimization of microbial hosts for biofuels production. Metab Eng 2008,

10:295–304.

8. Holtzapple MT, Davison RR, Ross MK, Albrett-Lee S, Nagwani M, Lee C-M, Lee C,

Adelson S, Kaar W, Gaskin D, Shirage H, Chang N-S, Chang VS, Loescher ME: Biomass

Conversion to Mixed Alcohol Fuels Using the MixAlco Process. Appl Biochem

Biotechnol 1999, 79:609–632.

83 9. Cope JL: Evaluation of Microbial Communities from Extreme Environments as Inocula in a Carboxylate Platform for Biofuel Production from Cellulosic

Biomass. 2013.

10. Cope JL, Hammett AJM, Kolomiets EA, Forrest AK, Golub KW, Hollister EB,

DeWitt TJ, Gentry TJ, Holtzapple MT, Wilkinson HH: Evaluating the performance of carboxylate platform fermentations across diverse inocula originating as sediments from extreme environments. Bioresour Technol 2014, 155:388–94.

11. Hammett AJM: Assessing the Potential of Natural Microbial Communities to

Improve a Second-Generation Biofuels Platform. 2011.

12. Haynes AR: Characterization of Extremophilic Bacteria for Potential in the

Biofuel and Bioprocess Industries. 2015.

13. Revised Road Map to the Phylum Firmicutes

[http://www.bergeys.org/outlines/bergeys_vol_3_outline.pdf]

14. Marchandin H, Teyssier C, Campos J, Jean-Pierre H, Roger F, Gay B, Carlier J-P,

Jumas-Bilak E: Negativicoccus succinicivorans gen. nov., sp. nov., isolated from human clinical samples, emended description of the family Veillonellaceae and description of Negativicutes classis nov., Selenomonadales ord. nov. and

Acidaminococcaceae fam. nov. in the ba. Int J Syst Evol Microbiol 2010, 60(Pt

6):1271–9.

15. Yutin N, Galperin MY: A genomic update on clostridial phylogeny: Gram- negative spore formers and other misplaced clostridia. Environ Microbiol 2013,

15:2631–41.

16. Sokolova T, Hanel J, Onyenwoke RU, Reysenbach A-L, Banta A, Geyer R, González

84 JM, Whitman WB, Wiegel J: Novel chemolithotrophic, thermophilic, anaerobic bacteria Thermolithobacter ferrireducens gen. nov., sp. nov. and

Thermolithobacter carboxydivorans sp. nov. Extremophiles 2007, 11:145–157.

17. Euzéby JP, Tindall BJ: Nomenclatural type of orders: corrections necessary according to Rules 15 and 21a of the Bacteriological Code (1990 Revision), and designation of appropriate nomenclatural types of classes and subclasses.

Request for an opinion. Int J Syst Evol Microbiol 2001, 51:725–727.

18. Schallmey M, Singh A, Ward OP: Developments in the use of Bacillus species for industrial production. Can J Microbiol 2004, 50:1–17.

19. Kunst F, Ogasawara N, Moszer I, Albertini AM, Alloni G, Azevedo V, Bertero MG,

Bessières P, Bolotin A, Borchert S, Borriss R, Boursier L, Brans A, Braun M, Brignell

SC, Bron S, Brouillet S, Bruschi C V, Caldwell B, Capuano V, Carter NM, Choi SK,

Cordani JJ, Connerton IF, Cummings NJ, Daniel RA, Denziot F, Devine KM, Düsterhöft

A, Ehrlich SD, et al.: The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 1997, 390:249–56.

20. Brumm PJ, Land ML, Mead DA: Complete genome sequence of Geobacillus thermoglucosidasius C56-YS93, a novel biomass degrader isolated from obsidian hot spring in Yellowstone National Park. Stand Genomic Sci 2015,

10:73.

21. Rastogi G, Muppidi GL, Gurram RN, Adhikari A, Bischoff KM, Hughes SR, Apel

WA, Bang SS, Dixon DJ, Sani RK: Isolation and characterization of cellulose- degrading bacteria from the deep subsurface of the Homestake gold mine,

Lead, South Dakota, USA. J Ind Microbiol Biotechnol 2009, 36:585–98.

85 22. Lin PP, Rabe KS, Takasumi JL, Kadisch M, Arnold FH, Liao JC: Isobutanol production at elevated temperatures in thermophilic Geobacillus thermoglucosidasius. Metab Eng 2014, 24:1–8.

23. Hughes Martiny JB, Field D: Ecological perspectives on the sequenced genome collection. Ecol Lett 2005, 8:1334–1345.

24. Stephanopoulos G: Challenges in engineering microbes for biofuels production. Science 2007, 315:801–4.

25. Lin L, Xu J: Dissecting and engineering metabolic and regulatory networks of thermophilic bacteria for biofuel production. Biotechnol Adv 2013, 31:827–

37.

26. Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula

J, Pagani I, Lobos EA, Kyrpides NC: The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 2015, 43(Database issue):D1099–106.

27. JGI GOLD | Statistics [https://gold.jgi.doe.gov/statistics]

28. DOE Mission Areas - DOE Joint Genome Institute [http://jgi.doe.gov/our- science/doe-mission-areas/]

29. Edwards DJ, Holt KE: Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data. Microb Inform Exp 2013, 3:2.

30. Baker M: De novo genome assembly: what every biologist should know. Nat

Methods 2012, 9:333–337.

31. MiSeq Gene & Small Genome Sequencer

[http://www.illumina.com/systems/miseq.html]

86 32. Loman NJ, Constantinidou C, Chan JZM, Halachev M, Sergeant M, Penn CW,

Robinson ER, Pallen MJ: High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol 2012,

10:599–606.

33. Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009, 6(11 Suppl):S6–S12.

34. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Ciufo S, Li W: Prokaryotic

Genome Annotation Pipeline. 2013.

35. Richardson EJ, Watson M: The automatic annotation of bacterial genomes.

Brief Bioinform 2013, 14:1–12.

36. Foster MW, Sharp RR: Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nat Rev Genet 2007, 8:633–9.

37. Noor MAF, Zimmerman KJ, Teeter KC: Data sharing: how much doesn’t get submitted to GenBank? PLoS Biol 2006, 4:e228.

38. Field D, Sansone S-A, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar

P, Kolker E, Maxon M, Millard S, Mugabushaka A-M, Perrin N, Remacle JE, Remington

K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Megascience. ’Omics data sharing. Science 2009, 326:234–6.

39. International Nucleotide Sequence Database Collaboration | INSDC

[http://www.insdc.org/about]

40. Karsch-Mizrachi I, Nakamura Y, Cochrane G, International Nucleotide Sequence

Database Collaboration: The International Nucleotide Sequence Database

Collaboration. Nucleic Acids Res 2012, 40(Database issue):D33–7.

87 41. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers

EW: GenBank. Nucleic Acids Res 2013, 41(Database issue):D36–42.

42. Genomic Standards Consortium (GSC) [http://gensc.org/]

43. Field D, Amaral-Zettler L, Cochrane G, Cole JR, Dawyndt P, Garrity GM, Gilbert J,

Glöckner FO, Hirschman L, Karsch-Mizrachi I, Klenk H-P, Knight R, Kottmann R,

Kyrpides N, Meyer F, San Gil I, Sansone S-A, Schriml LM, Sterk P, Tatusova T, Ussery

DW, White O, Wooley J: The Genomic Standards Consortium. PLoS Biol 2011,

9:e1001088.

44. Rachadech W, Navacharoen A, Ruangsit W, Pongtharangkul T, Vangnai AS: An organic solvent-, detergent-, and thermostable alkaline protease from the mesophilic, organic solvent-tolerant Bacillus licheniformis 3C5. Microbiology

2010, 79:620–629.

45. Li S, He B, Bai Z, Ouyang P: A novel organic solvent-stable alkaline protease from organic solvent–tolerant Bacillus licheniformis YP1A. J Mol Catal B Enzym

2009, 56:85–88.

46. BERTANI G: Studies on lysogenesis. I. The mode of phage liberation by lysogenic Escherichia coli. J Bacteriol 1951, 62:293–300.

47. Growth Curves USA, exclusive US distritutor of Bioscreen C Automated

Microbiology Growth Curves Analysis System

[http://www.growthcurvesusa.com/]

48. Bouillaut L, McBride SM, Sorg JA: Genetic manipulation of Clostridium difficile. Curr Protoc Microbiol 2011, Chapter 9:Unit 9A.2.

49. Texas A&M Genomics & Bioinformatics Service

88 [http://www.txgen.tamu.edu/]

50. Agencourt AMPure XP - Beckman Coulter, Inc.

[https://www.beckmancoulter.com/wsrportal/wsrportal/wsr/research-and- discovery/products-and-services/nucleic-acid-sample-preparation/agencourt- ampure-xp-pcr- purification/index.htm?i=A63880#2/10//0/25/1/0/asc/2/A63880///0/1//0/%2

Fwsrportal%2Fwsr%2Fresearch-and-discovery%2Fproducts-and- services%2Fnucleic-acid-sample-preparation%2Fagencourt-ampure-xp-pcr- purification%2Findex.htm/]

51. TruSeq DNA LT Sample Prep Kit Support

[http://support.illumina.com/sequencing/sequencing_kits/truseq_dna_lt_sample_p rep_kit.html.html]

52. Coil D, Jospin G, Darling AE: A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics 2015, 31:587–9.

53. Tritt A, Eisen JA, Facciotti MT, Darling AE: An integrated pipeline for de novo assembly of microbial genomes. PLoS One 2012, 7:e42304.

54. Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013, 29:1072–5.

55. Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, Edwards RA, Gerdes S,

Parrello B, Shukla M, Vonstein V, Wattam AR, Xia F, Stevens R: The SEED and the

Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

Nucleic Acids Res 2014, 42(Database issue):D206–14.

56. Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, Olson R, Overbeek R,

89 Parrello B, Pusch GD, Shukla M, Thomason JA, Stevens R, Vonstein V, Wattam AR, Xia

F: RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes.

Sci Rep 2015, 5:8365.

57. Markowitz VM, Mavromatis K, Ivanova NN, Chen I-MA, Chu K, Kyrpides NC: IMG

ER: a system for microbial genome annotation expert review and curation.

Bioinformatics 2009, 25:2271–8.

58. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz

MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 2012, 22:557–67.

59. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S,

Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Del Fabbro C,

Docking T, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs

RA, Gnerre S, Godzaridis É, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2013, 2:10.

60. Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino

DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung W-K, Ning Z, Haimel M, Simpson JT,

Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G,

Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al.:

Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011, 21:2224–41.

61. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G,

90 Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu S-M, Peng S,

Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Gigascience 2012, 1:18.

62. Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M: Comparison of next- generation sequencing systems. J Biomed Biotechnol 2012, 2012:251364.

63. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes

S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil

LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O,

Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008, 9:75.

64. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes

S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil

LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O,

Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008, 9:75.

65. Tatusov RL, Galperin MY, Natale DA, Koonin E V: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res

2000, 28:33–6.

66. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin E V, Krylov

DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov A V,

Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4:41.

91 67. Bateman A: The Pfam Protein Families Database. Nucleic Acids Res 2002,

30:276–280.

68. Pikuta E, Lysenko A, Chuvilskaya N, Mendrock U, Hippe H, Suzina N, Nikitin D,

Osipov G, Laurinavichius K: Anoxybacillus pushchinensis gen. nov., sp. nov., a novel anaerobic, alkaliphilic, moderately thermophilic bacterium from manure, and description of Anoxybacillus flavitherms comb. nov. Int J Syst Evol

Microbiol 2000, 50 Pt 6:2109–17.

69. Nazina TN, Tourova TP, Poltaraus AB, Novikova E V, Grigoryan AA, Ivanova AE,

Lysenko AM, Petrunyaka V V, Osipov GA, Belyaev SS, Ivanov M V: Taxonomic study of aerobic thermophilic bacilli: descriptions of Geobacillus subterraneus gen. nov., sp. nov. and Geobacillus uzenensis sp. nov. from petroleum reservoirs and transfer of Bacillus stearothermophilus, Bacillus thermocatenulatus,

Bacillus th. Int J Syst Evol Microbiol 2001, 51(Pt 2):433–46.

70. Miñana-Galbis D, Pinzón DL, Lorén JG, Manresa A, Oliart-Ros RM:

Reclassification of Geobacillus pallidus (Scholz et al. 1988) Banat et al. 2004 as

Aeribacillus pallidus gen. nov., comb. nov. Int J Syst Evol Microbiol 2010, 60(Pt

7):1600–4.

71. Parte AC: LPSN--list of prokaryotic names with standing in nomenclature.

Nucleic Acids Res 2014, 42(Database issue):D613–6.

72. Pikuta E, Cleland D, Tang J: Aerobic growth of Anoxybacillus pushchinoensis

K1(T): emended descriptions of A. pushchinoensis and the genus

Anoxybacillus. Int J Syst Evol Microbiol 2003, 53(Pt 5):1561–2.

73. Goh KM, Kahar UM, Chai YY, Chong CS, Chai KP, Ranjani V, Illias R, Chan K-G:

92 Recent discoveries and applications of Anoxybacillus. Appl Microbiol Biotechnol

2013, 97:1475–88.

74. Kambourova M, Mandeva R, Fiume I, Maurelli L, Rossi M, Morana A: Hydrolysis of xylan at high temperature by co-action of the xylanase from Anoxybacillus flavithermus BC and the beta-xylosidase/alpha-arabinosidase from Sulfolobus solfataricus Oalpha. J Appl Microbiol 2007, 102:1586–93.

75. Derekova A, Mandeva R, Kambourova M: Phylogenetic diversity of thermophilic carbohydrate degrading bacilli from Bulgarian hot springs.

World J Microbiol Biotechnol 2008, 24:1697–1702.

76. Duran C, Bulut VN, Gundogdu A, Soylak M, Belduz AO, Beris FS: Biosorption of

Heavy Metals by Anoxybacillus gonensis Immobilized on Diaion HP-2MG. Sep

Sci Technol 2009.

77. Kritee K, Blum JD, Barkay T: Mercury Stable Isotope Fractionation during

Reduction of Hg(II) by Different Microbial Pathways. Environ Sci Technol 2008,

42:9171–9177.

78. Deive FJ, Domínguez A, Barrio T, Moscoso F, Morán P, Longo MA, Sanromán MA:

Decolorization of dye Reactive Black 5 by newly isolated thermophilic microorganisms from geothermal sites in Galicia (Spain). J Hazard Mater 2010,

182:735–42.

79. Zeigler DR: The Geobacillus paradox: why is a thermophilic bacterial genus so prevalent on a mesophilic planet? Microbiology 2014, 160(Pt 1):1–11.

80. Derekova A, Sjøholm C, Mandeva R, Michailova L, Kambourova M: Biosynthesis of a thermostable gellan lyase by newly isolated and characterized strain of

93 Geobacillus stearothermophilus 98. Extremophiles 2006, 10:321–6.

81. Kimura H, Asada R, Masta A, Naganuma T: Distribution of Microorganisms in the Subsurface of the Manus Basin Hydrothermal Vent Field in Papua New

Guinea. Appl Environ Microbiol 2003, 69:644–648.

82. Struchtemeyer CG, Davis JP, Elshahed MS: Influence of the drilling mud formulation process on the bacterial communities in thermogenic natural gas wells of the Barnett Shale. Appl Environ Microbiol 2011, 77:4744–53.

83. Marchant R, Banat IM, Rahman TJ, Berzano M: The frequency and characteristics of highly thermophilic bacteria in cool soil environments.

Environ Microbiol 2002, 4:595–602.

84. Marchant R, Franzetti A, Pavlostathis SG, Tas DO, Erdbrugger I, Unyayar A,

Mazmanci MA, Banat IM: Thermophilic bacteria in cool temperate soils: are they metabolically active or continually added by global atmospheric transport?

Appl Microbiol Biotechnol 2008, 78:841–52.

85. Brodie EL, DeSantis TZ, Parker JPM, Zubietta IX, Piceno YM, Andersen GL: Urban aerosols harbor diverse and dynamic bacterial populations. Proc Natl Acad Sci

U S A 2007, 104:299–304.

86. Bowers RM, McCubbin IB, Hallar AG, Fierer N: Seasonal variability in airborne bacterial communities at a high-elevation site. Atmos Environ 2012, 50:41–49.

87. Scholz T, Demharter W, Hensel R, Kandler O: Bacillus pallidus sp. nov., a new thermophilic species from sewage. Syst Appl Microbiol 1987, 9:91–96.

88. Zheng C, Li Z, Su J, Zhang R, Liu C, Zhao M: Characterization and emulsifying property of a novel bioemulsifier by Aeribacillus pallidus YM-1. J Appl Microbiol

94 2012, 113:44–51.

89. Adiguzel A, Nadaroglu H, Bozoglu C, Güllüce M, Taslimi P: Removal of Some

Textile Dyes from Aqueous Solution by Using A Catalase-Peroxidase from

Aeribacillus pallidus (P26). J Pure Appl Microbiol 2014, 7.

90. Tindall BJ, Rosselló-Móra R, Busse H-J, Ludwig W, Kämpfer P: Notes on the characterization of prokaryote strains for taxonomic purposes. Int J Syst Evol

Microbiol 2010, 60(Pt 1):249–66.

91. WOESE CR, Fox GE, ZABLEN L, UCHIDA T, BONEN L, PECHMAN K, LEWIS BJ,

STAHL D: Conservation of primary structure in 16S ribosomal RNA. Nature

1975, 254:83–86.

92. Thompson CC, Chimetto L, Edwards RA, Swings J, Stackebrandt E, Thompson FL:

Microbial genomic taxonomy. BMC Genomics 2013, 14:913.

93. Gillis M, Vandamme P, Vos P De, Swings J, Kersters K, Gillis M, Vandamme P, Vos

P De, Swings J, Kersters K: Polyphasic Taxonomy. In Bergey’s Manual of Systematics of Archaea and Bacteria. Chichester, UK: John Wiley & Sons, Ltd; 2015:1–10.

94. Thompson CC, Amaral GR, Campeão M, Edwards RA, Polz MF, Dutilh BE, Ussery

DW, Sawabe T, Swings J, Thompson FL: Microbial taxonomy in the post-genomic era: rebuilding from scratch? Arch Microbiol 2015, 197:359–70.

95. Stackebrandt E, Frederiksen W, Garrity GM, Grimont PAD, Kämpfer P, Maiden

MCJ, Nesme X, Rosselló-Mora R, Swings J, Trüper HG, Vauterin L, Ward AC, Whitman

WB: Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 2002, 52(Pt 3):1043–7.

96. Vandamme P, Peeters C: Time to revisit polyphasic taxonomy. Antonie Van

95 Leeuwenhoek 2014, 106:57–65.

97. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ, Stackebrandt E,

Van de Peer Y, Vandamme P, Thompson FL, Swings J: Opinion: Re-evaluating prokaryotic species. Nat Rev Microbiol 2005, 3:733–9.

98. Klappenbach JA, Goris J, Vandamme P, Coenye T, Konstantinidis KT, Tiedje JM:

DNA–DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 2007, 57:81–91.

99. Meier-Kolthoff JP, Auch AF, Klenk H-P, Göker M, Wayne L, Brenner D, Colwell R,

Grimont P, Kandler O, Krichevsky M, Moore L, Moore W, Murray R, Stackebrandt E,

Starr M, Truper H, Stackebrandt E, Goebel B, Schleifer K, Klenk H, Göker M,

Vandamme P, Pot B, Gillis M, Goris J, Konstantinidis K, Klappenbach J, Coenye T,

Vandamme P, Tiedje J, et al.: Genome sequence-based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics 2013,

14:60.

100. Glaeser SP, Kämpfer P: Multilocus sequence analysis (MLSA) in prokaryotic taxonomy. Syst Appl Microbiol 2015, 38:237–45.

101. Brady C, Cleenwerck I, Venter S, Vancanneyt M, Swings J, Coutinho T:

Phylogeny and identification of Pantoea species associated with plants, humans and the natural environment based on multilocus sequence analysis

(MLSA). Syst Appl Microbiol 2008, 31:447–460.

102. Rong X, Huang Y: Taxonomic evaluation of the Streptomyces hygroscopicus clade using multilocus sequence analysis and DNA-DNA hybridization, validating the MLSA scheme for systematics of the whole genus. Syst Appl

96 Microbiol 2012, 35:7–18.

103. Guo Y, Zheng W, Rong X, Huang Y: A multilocus phylogeny of the

Streptomyces griseus 16S rRNA gene clade: use of multilocus sequence analysis for streptomycete systematics. Int J Syst Evol Microbiol 2008, 58(Pt

1):149–59.

104. Martens M, Dawyndt P, Coopman R, Gillis M, De Vos P, Willems A: Advantages of multilocus sequence analysis for taxonomic studies: a case study using 10 housekeeping genes in the genus Ensifer (including former Sinorhizobium).

Int J Syst Evol Microbiol 2008, 58(Pt 1):200–14.

105. Yoon JH, Park YH: Phylogenetic analysis of the genus Thermoactinomyces based on 16S rDNA sequences. Int J Syst Evol Microbiol 2000, 50 Pt 3:1081–6.

106. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35(Database issue):D61–5.

107. Zeigler DR: Gene sequences useful for predicting relatedness of whole genomes in bacteria. Int J Syst Evol Microbiol 2003, 53(Pt 6):1893–900.

108. De Clerck E, Vanhoutte T, Hebb T, Geerinck J, Devos J, De Vos P: Isolation, characterization, and identification of bacterial contaminants in semifinal gelatin extracts. Appl Environ Microbiol 2004, 70:3664–72.

109. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S: MEGA6: Molecular

Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 2013, 30:2725–9.

110. Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de Castro E, Duvaud S,

Flegel V, Fortier A, Gasteiger E, Grosdidier A, Hernandez C, Ioannidis V, Kuznetsov D,

97 Liechti R, Moretti S, Mostaguir K, Redaschi N, Rossier G, Xenarios I, Stockinger H:

ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 2012, 40(Web

Server issue):W597–603.

111. Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S,

Cooper A, Markowitz S, Duran C, Thierer T, Ashton B, Meintjes P, Drummond A:

Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 2012, 28:1647–9.

112. Geneious 9.1 Manual [http://assets.geneious.com/manual/9.1/index.html]

113. Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986,

3:418–26.

114. Farris JS, Kallersjo M, Kluge AG, Bult C: Testing significance of incongruence.

Cladistics 1995:315–319.

115. Swofford D: PAUP*: Phylogenetic Analysis Using Parsimony (*and Other

Methods) 4. 1999.

116. Leigh JW, Susko E, Baumgartner M, Roger AJ: Testing congruence in phylogenomic analysis. Syst Biol 2008, 57:104–15.

117. Cunningham CW: Can three incongruence tests predict when data should be combined? Mol Biol Evol 1997, 14:733–740.

118. Darlu P, Lecointre G: When Does the Incongruence Length Difference Test

Fail? Mol Biol Evol 2002, 19:432–437.

119. Ngamskulrungroj P, Gilgado F, Faganello J, Litvintseva AP, Leal AL, Tsui KM,

Mitchell TG, Vainstein MH, Meyer W: Genetic diversity of the Cryptococcus

98 species complex suggests that Cryptococcus gattii deserves to have varieties.

PLoS One 2009, 4:e5862.

120. Besansky NJ, Krzywinski J, Lehmann T, Simard F, Kern M, Mukabayire O,

Fontenille D, Touré Y, Sagnon N: Semipermeable species boundaries between

Anopheles gambiae and Anopheles arabiensis: evidence from multilocus DNA sequence variation. Proc Natl Acad Sci U S A 2003, 100:10818–23.

121. Martens M, Delaere M, Coopman R, De Vos P, Gillis M, Willems A: Multilocus sequence analysis of Ensifer and related taxa. Int J Syst Evol Microbiol 2007,

57(Pt 3):489–503.

122. Fox GE, Wisotzkey JD, Jurtshuk P: How close is close: 16S rRNA sequence identity may not be sufficient to guarantee species identity. Int J Syst Bacteriol

1992, 42:166–70.

123. Inan K, Bektas Y, Canakci S, Belduz AO: Use of rpoB sequences and rep-PCR for phylogenetic study of Anoxybacillus species. J Microbiol 2011, 49:782–90.

124. Tourova TP, Korshunova A V., Mikhailova EM, Sokolova DS, Poltaraus AB,

Nazina TN: Application of gyrB and parE sequence similarity analyses for differentiation of species within the genus Geobacillus. Microbiology 2010,

79:356–369.

125. Wertz JE, Goldstone C, Gordon DM, Riley MA: A molecular phylogeny of enteric bacteria and implications for a bacterial species concept. J Evol Biol

2003, 16:1236–1248.

126. Muto A, Osawa S: The guanine and cytosine content of genomic DNA and bacterial evolution (biased mutation pressure/codon usage/neutral theory).

99 1987, 84:166–169.

127. Janssen PH: Identifying the Dominant Soil Bacterial Taxa in Libraries of

16S rRNA and 16S rRNA Genes. Appl Environ Microbiol 2006, 72:1719–1728.

128. Vanfossen AL, Verhaart MRA, Kengen SMW, Kelly RM: Carbohydrate utilization patterns for the extremely thermophilic bacterium

Caldicellulosiruptor saccharolyticus reveal broad growth substrate preferences. Appl Environ Microbiol 2009, 75:7718–24.

129. Kim J-H, Shoemaker SP, Mills DA: Relaxed control of sugar utilization in

Lactobacillus brevis. Microbiology 2009, 155(Pt 4):1351–9.

130. Kappes RM, Kempf B, Bremer E: Three transport systems for the osmoprotectant glycine betaine operate in Bacillus subtilis: characterization of OpuD. J Bacteriol 1996, 178:5071–9.

131. Whatmore AM, Chudek JA, Reed RH: The effects of osmotic upshock on the intracellular solute pools of Bacillus subtilis. J Gen Microbiol 1990, 136:2527–35.

132. Arakawa T, Timasheff SN: The stabilization of proteins by osmolytes.

Biophys J 1985, 47:411–4.

133. Holtmann G, Bremer E: Thermoprotection of Bacillus subtilis by exogenously provided glycine betaine and structurally related compatible solutes: involvement of Opu transporters. J Bacteriol 2004, 186:1683–93.

134. Hoffmann T, Bremer E: Protection of Bacillus subtilis against cold stress via compatible-solute acquisition. J Bacteriol 2011, 193:1552–62.

135. Kanehisa M: Chemical and genomic evolution of enzyme-catalyzed reaction networks. FEBS Lett 2013, 587:2731–7.

100 136. Kempf B, Bremer E: OpuA, an Osmotically Regulated Binding Protein- dependent Transport System for the Osmoprotectant Glycine Betaine in

Bacillus subtilis. J Biol Chem 1995, 270:16701–16713.

137. Wierckx N, Koopman F, Ruijssenaars HJ, de Winde JH: Microbial degradation of furanic compounds: biochemistry, genetics, and impact. Appl Microbiol

Biotechnol 2011, 92:1095–1105.

138. Almeida JRM, Bertilsson M, Gorwa-Grauslund MF, Gorsich S, Lidén G:

Metabolic effects of furaldehydes and impacts on biotechnological processes.

Appl Microbiol Biotechnol 2009, 82:625–638.

139. Chee G-J, Takami H: Housekeeping recA gene interrupted by group II intron in the thermophilic Geobacillus kaustophilus. Gene 2005, 363:211–20.

140. Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, Holland TA,

Keseler IM, Kothari A, Kubo A, Krummenacker M, Latendresse M, Mueller LA, Ong Q,

Paley S, Subhraveti P, Weaver DS, Weerasinghe D, Zhang P, Karp PD: The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of

Pathway/Genome Databases. Nucleic Acids Res 2014, 42(Database issue):D459–

71.

141. Geneious 9.1. 2016:1–246.

142. FigTree [http://tree.bio.ed.ac.uk/software/figtree/]

143. Zaldivar J, Nielsen J, Olsson L: Fuel ethanol production from lignocellulose: a challenge for metabolic engineering and process integration. Appl Microbiol

Biotechnol 2001, 56:17–34.

144. Caldas T, Demont-Caulet N, Ghazi A, Richarme G: Thermoprotection by

101 glycine betaine and choline. Microbiology 1999, 145 ( Pt 9:2543–8.

145. Ferat J-L, Le Gouar M, Michel F: A group II intron has invaded the genus

Azotobacter and is inserted within the termination codon of the essential groEL gene. Mol Microbiol 2003, 49:1407–1423.

146. Adamidi C, Fedorova O, Pyle AM: A group II intron inserted into a bacterial heat-shock operon shows autocatalytic activity and unusual thermostability.

Biochemistry 2003, 42:3409–18.

147. Nagarajan N, Cook C, Di Bonaventura M, Ge H, Richards A, Bishop-Lilly KA,

DeSalle R, Read TD, Pop M: Finishing genomes with limited resources: lessons from an ensemble of microbial genomes. BMC Genomics 2010, 11:242.

148. Nagarajan N, Read TD, Pop M: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 2008,

24:1229–35.

102 APPENDIX A

Accession information for isolates used in this study. BioProject, BioSample, and Short Read Archive (SRA) are all part of the NCBI’s Archive and are accessible at https://www.ncbi.nlm.nih.gov/. The Genome Online Database (GOLD) and Integrated Microbial Genome (IMG) System are both part of the JGI’s Archive and are accessible at https://gold.jgi.doe.gov/ and https://img.jgi.doe.gov/, respectively. Short Read GOLD GOLD IMG Isolate BioProject BioSample Archive Study ID Project ID Genome ID A07M350 311332 SAMN04549648 SRP071281 Gs0118462 Gp0136437 2657245263 A07M352 311332 SAMN04549649 SRP071281 Gs0118462 Gp0136456 2657245260 E07C003 311332 SAMN04549623 SRP071281 Gs0118462 Gp0136457 2657245259 E08C011 311332 SAMN04549624 SRP071281 Gs0118462 Gp0137056 2660238831 E08C017 311332 SAMN04549625 SRP071281 Gs0118462 Gp0136458 2657245258 E08C020 311332 SAMN04549626 SRP071281 Gs0118462 Gp0136459 2657245257 E08D002 311332 SAMN04549627 SRP071281 Gs0118462 Gp0136460 2657245256 F02C013 311332 SAMN04549613 SRP071281 Gs0118462 Gp0137055 2660238830 F05M388 311332 SAMN04549614 SRP071281 Gs0118462 Gp0136489 2657245346 F09D005 311332 SAMN04549615 SRP071281 Gs0118462 Gp0136490 2657245347 F09M437 311332 SAMN04549616 SRP071281 Gs0118462 Gp0136491 2657245348 G08C001 311332 SAMN04549607 SRP071281 Gs0118462 Gp0136492 2657245349 G08C006 311332 SAMN04549608 SRP071281 Gs0118462 Gp0136494 2657245350 G08C008 311332 SAMN04549609 SRP071281 Gs0118462 Gp0137054 2660238829 G08C011 311332 SAMN04549610 SRP071281 Gs0118462 Gp0136495 2657245351 G08C017 311332 SAMN04549611 SRP071281 Gs0118462 Gp0136496 2657245352 G09D026 311332 SAMN04549612 SRP071281 Gs0118462 Gp0136497 2657245353 G13D008 311332 SAMN04549601 SRP071281 Gs0118462 Gp0136498 2657245354 G13D016 311332 SAMN04549602 SRP071281 Gs0118462 Gp0136500 2657245355 G13D029 311332 SAMN04549603 SRP071281 Gs0118462 Gp0136502 2657245356 G13D038 311332 SAMN04549604 SRP071281 Gs0118462 Gp0136503 2657245357 G13D043 311332 SAMN04549605 SRP071281 Gs0118462 Gp0136504 2657245358 G19C023 311332 SAMN04549606 SRP071281 Gs0118462 Gp0136574 2660238739 G23C002 311332 SAMN04549628 SRP071281 Gs0118462 Gp0136582 2660238747 G23C019 311332 SAMN04549629 SRP071281 Gs0118462 Gp0136583 2660238748 G23D015 311332 SAMN04549630 SRP071281 Gs0118462 Gp0136584 2660238749 G24C011 311332 SAMN04549631 SRP071281 Gs0118462 Gp0136585 2660238750 H01C001 311332 SAMN04549636 SRP071281 Gs0118462 Gp0137037 2660238821 H01C007 311332 SAMN04549637 SRP071281 Gs0118462 Gp0137042 2660238823 H01D012 311332 SAMN04549638 SRP071281 Gs0118462 Gp0137045 2660238824 H01M105 311332 SAMN04549639 SRP071281 Gs0118462 Gp0137047 2660238825 H20C002 311332 SAMN04549640 SRP071281 Gs0118462 Gp0137057 2660238832 H20C009 311332 SAMN04549641 SRP071281 Gs0118462 Gp0136581 2660238746 H20D004 311332 SAMN04549642 SRP071281 Gs0118462 Gp0137058 2660238833 J04M017 311332 SAMN04549600 SRP071281 Gs0118462 Gp0136573 2660238738 J11M005 311332 SAMN04549590 SRP071281 Gs0118462 Gp0136443 2657245262 J11M011 311332 SAMN04549591 SRP071281 Gs0118462 Gp0136569 2660238732 J11M287 311332 SAMN04549592 SRP071281 Gs0118462 Gp0136570 2660238733 J18C011 311332 SAMN04549593 SRP071281 Gs0118462 Gp0136510 2657245363 J18C022 311332 SAMN04549594 SRP071281 Gs0118462 Gp0136511 2657245364 J18C025 311332 SAMN04549595 SRP071281 Gs0118462 Gp0136512 2657245365

103 Apendix A Cont.

Isolate BioProject BioSample Short Read GOLD Study GOLD IMG Genome Archive ID Project ID ID J18D015 311332 SAMN04549596 SRP071281 Gs0118462 Gp0136513 2657245366 J19C022 311332 SAMN04549597 SRP071281 Gs0118462 Gp0137052 2660238828

J20M022 311332 SAMN04549598 SRP071281 Gs0118462 Gp0136571 2660238736 J20M030 311332 SAMN04549599 SRP071281 Gs0118462 Gp0136572 2660238737 K49C015 311332 SAMN04549586 SRP071281 Gs0118462 Gp0136546 2657245396 K49D024 311332 SAMN04549587 SRP071281 Gs0118462 Gp0136444 2657245261 K49M014 311332 SAMN04549588 SRP071281 Gs0118462 Gp0136548 2657245397 K49M015 311332 SAMN04549589 SRP071281 Gs0118462 Gp0136568 2660238731 M24C029 311332 SAMN04549618 SRP071281 Gs0118462 Gp0136578 2660238743 N09C011 311332 SAMN04549643 SRP071281 Gs0118462 Gp0137048 2660238826 N09C014 311332 SAMN04549644 SRP071281 Gs0118462 Gp0137051 2660238827 P01M009 311332 SAMN04549617 SRP071281 Gs0118462 Gp0136575 2660238740 R08M008 311332 SAMN04549632 SRP071281 Gs0118462 Gp0136579 2660238744 S44C017 311332 SAMN04549619 SRP071281 Gs0118462 Gp0136505 2657245359 S44C019 311332 SAMN04549620 SRP071281 Gs0118462 Gp0136506 2657245360 S44C021 311332 SAMN04549621 SRP071281 Gs0118462 Gp0136507 2657245361 S44D013 311332 SAMN04549622 SRP071281 Gs0118462 Gp0136508 2657245362 S48D018 311332 SAMN04549645 SRP071281 Gs0118462 Gp0136580 2660238745 T02C003 311332 SAMN04549646 SRP071281 Gs0118462 Gp0136577 2660238742 T02C029 311332 SAMN04549647 SRP071281 Gs0118462 Gp0136576 2660238741 U22C014 311332 SAMN04549633 SRP071281 Gs0118462 Gp0136586 2660238751 U22D017 311332 SAMN04549634 SRP071281 Gs0118462 Gp0136587 2660238756 U22M431 311332 SAMN04549635 SRP071281 Gs0118462 Gp0137040 2660238822

104 APPENDIX B

Statistics for 64 draft genomes. All genome statistics obtained from IMG-ER. Coding Genes w/ Func Genome Gene Base RNA RNA Function Pred COG COG Pfam Pfam Isolate Size Count G+C% Count Count % Prediction % Count % Count %

A07M350 3793737 3718 37 3043916 127 3.42 2802 75.36 2365 63.61 2970 79.88

A07M352 3796206 3718 37 3047569 125 3.36 2799 75.28 2373 63.82 2968 79.83 E07C003 3781182 4146 39 3083546 303 7.31 2927 70.6 2369 57.14 3084 74.38 E08C011 3382012 3710 40 2683309 282 7.6 2606 70.24 2106 56.77 2738 73.8 E08C017 3743360 4088 39 3050963 296 7.24 2927 71.6 2366 57.88 3077 75.27 E08C020 2715755 3039 42 2419436 131 4.31 2330 76.67 1879 61.83 2437 80.19 E08D002 3685776 4065 39 2983836 321 7.9 2866 70.5 2319 57.05 3025 74.42 F02C013 3376709 3661 52 2900786 145 3.96 2761 75.42 2234 61.02 2899 79.19

F05M388 3911764 3894 39 3186024 133 3.42 3032 77.86 2594 66.62 3176 81.56 F09D005 3909449 3895 39 3184684 130 3.34 3029 77.77 2589 66.47 3172 81.44

F09M437 3780439 3815 39 3089963 154 4.04 2945 77.2 2500 65.53 3089 80.97 G08C001 3836028 3954 44 3240910 141 3.57 3057 77.31 2567 64.92 3207 81.11 G08C006 3783734 3821 39 3093396 151 3.95 2956 77.36 2502 65.48 3099 81.1 G08C008 3738135 4111 39 3050147 296 7.2 2909 70.76 2353 57.24 3062 74.48 G08C011 3742161 4105 39 3056999 301 7.33 2909 70.86 2362 57.54 3060 74.54 G08C017 3735946 4065 39 3042579 280 6.89 2919 71.81 2363 58.13 3070 75.52 G09D026 3811435 3783 39 3127338 134 3.54 2965 78.38 2548 67.35 3120 82.47 G13D008 3678540 3991 39 2999607 292 7.32 2876 72.06 2345 58.76 3026 75.82 G13D016 3411064 3706 52 2931568 145 3.91 2803 75.63 2261 61.01 2942 79.38 G13D029 3810976 3770 39 3126880 129 3.42 2964 78.62 2551 67.67 3121 82.79 G13D038 3675921 4000 39 2994845 285 7.13 2874 71.85 2338 58.45 3022 75.55 G13D043 3678265 3988 39 2998366 290 7.27 2876 72.12 2347 58.85 3025 75.85 G19C023 3708305 4136 39 3011890 336 8.12 2868 69.34 2320 56.09 3017 72.94 G23C002 3736691 4105 39 3052684 294 7.16 2908 70.84 2356 57.39 3060 74.54 G23C019 3786670 3829 39 3095850 156 4.07 2953 77.12 2502 65.34 3096 80.86 G23D015 3603042 3893 39 2936625 278 7.14 2822 72.49 2307 59.26 2968 76.24 G24C011 3906981 3902 39 3186052 133 3.41 3034 77.75 2589 66.35 3176 81.39 H01C001 3913251 3820 37 3124908 123 3.22 2907 76.1 2454 64.24 3059 80.08 H01C007 4066452 4145 56 3550624 143 3.45 3092 74.6 2564 61.86 3243 78.24 H01D012 4067285 4146 56 3551596 143 3.45 3095 74.65 2562 61.79 3246 78.29

H01M105 3790409 3751 39 3098150 143 3.81 2945 78.51 2528 67.4 3104 82.75 H20C002 3728927 3625 37 2997043 116 3.2 2753 75.94 2363 65.19 2911 80.3 H20C009 3731698 3639 37 3001986 121 3.33 2759 75.82 2362 64.91 2918 80.19 H20D004 3770109 3819 39 3080925 148 3.88 2952 77.3 2491 65.23 3094 81.02 J04M017 3396372 3430 40 2832147 126 3.67 2697 78.63 2281 66.5 2849 83.06 J11M005 4213643 4462 46 3713553 150 3.36 3476 77.9 2931 65.69 3688 82.65 J11M011 3806784 3780 39 3122385 133 3.52 2961 78.33 2553 67.54 3114 82.38 J11M287 3910957 3892 39 3187403 133 3.42 3032 77.9 2592 66.6 3174 81.55

105 Apendix B Cont.

Coding Genes w/ Func Isolate Genome Gene G+C Base RNA RNA Funtion Pred COG COG Pfam Pfam Size Count % Count Count % Prediction % Count % Count % J18C011 3476121 3473 40 2761784 164 4.72 2629 75.7 2239 64.47 2776 79.93 J18C022 4216198 4465 46 3717306 149 3.34 3481 77.96 2936 65.76 3696 82.78 J18C025 3477710 3471 40 2762611 162 4.67 2629 75.74 2239 64.51 2778 80.03 J18D015 3786564 3821 39 3097514 159 4.16 2952 77.26 2505 65.56 3101 81.16 J19C022 3910131 3908 39 3188084 131 3.35 3035 77.66 2591 66.3 3180 81.37 J20M022 3261474 3567 52 2795383 146 4.09 2656 74.46 2135 59.85 2781 77.96 J20M030 3448863 3864 52 2954570 134 3.47 2844 73.6 2214 57.3 2987 77.3 K49C015 3525104 3523 40 2825274 165 4.68 2652 75.28 2280 64.72 2817 79.96 K49D024 2814502 3165 42 2550542 130 4.11 2440 77.09 1985 62.72 2569 81.17

K49M014 3680476 3988 39 2999866 285 7.15 2874 72.07 2346 58.83 3021 75.75

K49M015 3672566 3986 39 2993268 281 7.05 2882 72.3 2346 58.86 3029 75.99

M24C029 3982763 4233 44 3374350 128 3.02 3174 74.98 2580 60.95 3326 78.57 N09C011 3685292 4018 39 3005183 291 7.24 2875 71.55 2345 58.36 3036 75.56 N09C014 3820497 3778 39 3126189 129 3.41 2977 78.8 2564 67.87 3115 82.45

P01M009 3774243 3682 37 3041956 126 3.42 2788 75.72 2368 64.31 2943 79.93

R08M008 3910928 3890 39 3184359 130 3.34 3031 77.92 2591 66.61 3175 81.62 S44C017 3783562 3827 39 3092571 153 4 2956 77.24 2501 65.35 3102 81.06 S44C019 3431684 3707 52 2948335 146 3.94 2809 75.78 2301 62.07 2954 79.69 S44C021 3381731 3649 52 2913195 147 4.03 2776 76.08 2286 62.65 2916 79.91 S44D013 3732525 3715 39 3059727 130 3.5 2918 78.55 2518 67.78 3057 82.29 S48D018 3783578 3810 39 3092671 154 4.04 2949 77.4 2504 65.72 3096 81.26 T02C003 3793414 3763 39 3100364 147 3.91 2947 78.32 2527 67.15 3107 82.57 T02C029 3796369 3769 39 3102783 149 3.95 2951 78.3 2528 67.07 3110 82.52 U22C014 3396178 3423 40 2831104 127 3.71 2694 78.7 2281 66.64 2846 83.14 U22D017 3910723 3899 39 3187882 131 3.36 3034 77.81 2591 66.45 3178 81.51

U22M431 3910864 3891 39 3187585 132 3.39 3030 77.87 2591 66.59 3174 81.57

106 APPENDIX C

Isolates sorted by clades from MLSA. Table also shows genomic G+C content and the closest megablast isolate to that clade. Where no close sequence was available we consider that clade novel. Isolate G+C % Clades Closest megablast Isolate H20C002 37 I Novel P01M009 39 I Novel H20C009 39 I Novel H01C001 56 I Novel E08C011 40 II Bacillus smithii strain DSM 4216 K49C015 42 II Bacillus smithii strain DSM 4216 J18C011 46 II Bacillus smithii strain DSM 4216

J18C022 40 III Bacillus licheniformis strain HRBL-15TDI7

E08C020 42 IV Anoxybacillus flavithermus WK1

G08C001 44 V Geobacillus thermoglucosidasius DSM 2542

J19C022 52 V Geobacillus thermoglucosidasius DSM 2542

S44C021 39 VI Geobacillus sp. LC300 G13D016 39 VI Geobacillus sp. LC300 F02C013 52 VI Geobacillus sp. LC300 S44C019 52 VI Geobacillus sp. LC300 J20M030 40 VI Geobacillus sp. LC300 U22C014 39 VII Novel J04M017 52 VII Novel G08C008 37 VIIIa Novel N09C011 39 VIIIa Novel G19C023 39 VIIIa Novel E07C003 39 VIIIa Novel E08C017 39 VIIIa Novel E08D002 39 VIIIa Novel G23C002 39 VIIIa Novel F09D005 39 VIIIb Novel U22M431 39 VIIIb Novel U22D017 39 VIIIb Novel G23D015 39 VIIIb Novel J18D015 39 VIIIb Novel S48D018 39 VIIIb Novel

107 Apendix C Cont.

Isolate G+C % Clades Closest Megablast Isolate

F05M388 39 VIIIb Novel G08C006 39 VIIIb Novel T02C003 39 VIIIb Novel H01M105 37 VIIIb Novel G13D043 39 VIIIb Novel G13D038 39 VIIIb Novel G09D026 39 VIIIb Novel K49M014 39 VIIIb Novel G24C011 39 VIIIb Novel S44D013 39 VIIIb Novel M24C029 39 VIIIb Novel G23C019 45 VIIIb Novel H20D004 46 VIIIb Novel N09C014 46 VIIIb Novel G13D008 52 VIIIb Novel S44C017 52 VIIIb Novel

108