<<

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Comparative Genomics and Epigenomics of ureae

A thesis submitted in partial fulfillment of the requirement

for the degree of Master of Science in Biology

By

Andrew Oliver

August 2016

The thesis of Andrew Oliver is approved by:

______

Sean Murray, Ph.D. Date

______

Gilberto Flores, Ph.D. Date

______

Kerry Cooper, Ph.D., Chair Date

California State University, Northridge

ii Acknowledgments

First and foremost, a special thanks to my advisor, Dr. Kerry Cooper, for his advice and, above all, his patience. If I can be half the scientist you are someday, I would be thrilled.

I would like to also thank everyone in the Cooper lab, especially my colleagues

Courtney Sams and Tabitha Bayangnos. It was a privilege to work along side you.

More thanks to my committee members, Dr. Gilberto Flores and Dr. Sean

Murray. Dr. Flores, you were instrumental in guiding me to ask the right questions regarding bacterial . Dr. Murray, your contributions to my graduate studies would make this section run on for pages. I thank you for taking me under your wing from the beginning.

Acknowledgement and thanks to the Baresi lab, especially Dr. Larry Baresi and

Tania Kurbessoian for their partnership in this research. Also to Bernardine Pregerson for all the work that lays at the foundation of this study.

This research would not be what it is without the help of my childhood friend,

Matthew Kay. You wrote programs, taught me coding languages, and challenged me to go digging for answers to very difficult questions. I wish we could always collaborate.

To Dr. Melanie Oaks and the sequencing core at the University of California

Irvine: thank you for the help in generating all this data and getting this project off the ground.

Lastly, to my loving parents, Dan and Evelyn, and brother, Kevin. I am so lucky to call you family.

iii Dedication

This thesis is dedicated to my parents, Dan and Evelyn. I put years of work into this, but it pales in comparison to the work you put into raising me. I love you both.

iv Table of Contents

Signature page ii

Acknowledgments iii

Dedication iv

List of Figures/Tables vi

Abstract vii

Comparative Genomics and Epigenetics of 1

Introduction 1

Methods 7

Results 14

Discussion 18

Tables and Figures 25

References Cited 39

Appendix 1: Supplemental Tables and Figures 48

v

List of Figures/Tables

Figure 1: 16S rRNA tree of family Planocococcae 25

Table 1: General genome characteristics of Sporosarcina ureae 26

Figure 2: COG graph of Sporosarcina ureae 27

Figure 3A: 16S rRNA tree of Sporosarcina 28

Figure 3B: Core genome tree of genus Sporosarcina ureae 29

Figure 4: ACT plot of six strains of Sporosarcina ureae 30

Figure 5: Identity heatmap of Sporosarcina ureae 31

Figure 6: Core genome tree of Sporosarcina ureae 32

Figure 7: Circos plot of Sporosarcina ureae S204 33

Figure 8: Epigenome map of Sporosarcina ureae 34

Figure 9: Spore heatmap of Sporosarcina ureae 35

Figure 10: loci alignment 36

Figure 11A: MUSCLE alignment of mreB gene 37

Figure 11B: MUSLCE alignement of rodA gene 38

Figure S1: MATLAB model for core 48

Table S1: Methylation data on the six strains of Sporosarcina ureae 49

Table S2: COG groups of the six strains of Sporosarcina ureae 50

Figure S2: Python script for COG generation 51

vi Abstract

Comparative Genomics and Epigenetics of Sporosarcina ureae

By

Andrew Oliver

Master of Science in Biology

Sporosarcina ureae is an aerobic, motile, spore-forming Gram-positive cocci that was originally isolated in the early 20th century from soil enrichments with elevated levels of urea. The species is unique in that it is the only known spore-forming cocci, and is currently placed in a genus exclusively composed of . Current research has been focused on the biotech potential of the unique outer surface layer (S-layer), and the ability to efficiently convert urea into ammonia. Specifically, researchers are using organisms that hydrolyze urea in applications such as self-healing concrete, biofuel production, and more efficient means to make fertilizer. The goal of this study is to utilize Pacific Biosciences (PacBio) DNA sequencing technology to generate complete genome sequences and to investigate genetic and epigenetic variations between strains of

S. ureae that differ in their spatial and temporal isolation. We have sequenced the first six

vii complete genomes and methylomes of S. ureae. Genomes were assembled using PacBio

SMRT Analysis (v2.3.0) and Geneious (Biomatters; v9.0.4) software programs, and annotated using the Genome Automatic Annotation Pipeline . The average S. ureae genome is 3.3 Mb in size, and contains an average 3160 CDS, 66 tRNAs and 8 rRNAs, while only one of the strains contains a plasmid (64 kb). Epigenetic analysis, using SMRT Analysis and REBASE (New England Biolabs), of the strains demonstrated evidence of several novel adenine and cytosine methylases present in S. ureae.

Examination of the species requirement of 97% sequence identity across the 16S rRNA gene was met by all six strains. However, further analysis using in silico DNA-DNA hybridization (DDH), average nucleotide identity (ANI), and additional core- and pan- genome analysis demonstrated a highly divergent species or possibly some of the strains were a subspecies or new species. Further genetic analysis of the entire genus is needed to determine exactly how S. ureae, a spore-forming cocci, relates to the other spore- forming species in the genus Sporosarcina. Utilizing genomics, our analysis has begun to clarify the make up of the genus, and also found that there may be additional species of spore-forming cocci other than just S. ureae.

viii Comparative Genomics and Epigenomics of Sporosarcina ureae

Introduction

Sporosarcina ureae is an aerobic, motile, spore-forming Gram-positive that was originally isolated in the early 20th century from soil enrichments with elevated levels of urea. Although discovered more than a century ago, its phylogenetic placement compared to other closely related species remains contentious. Martinus Beijerinck originally isolated a motile coccus that clustered in packets and had the ability to form that he named Planosarcina ureae (Beijerinck 1901). In 1911, Lohnis suggested a name change to Sarcina ureae (Pijper, Crocker et al. 1955), while Orla-

Jensen in 1909 and Kluyver and van Niel in 1936 both proposed Sporosarcina ureae

(Orla-Jensen 1909) (Kluyver and van Niel 1936). In 1963, Kocur and Martinec concluded that a new genus Sporosarcina should be created in the family , and that genus contain the species S. ureae (Kocur and Martinec 1963). Further research into various molecular and biochemical properties supported the move into the family

Bacillacae (Kluckhohn 1986), however further phylogenetic analysis has moved S. ureae into its current family .

Exactly how Sporosarcina ureae fits into the genus Sporosarcina is still somewhat of an enigma. According to the most recent edition of Bergey’s Manual of

Systematic Bacteriology, the genus Sporosarcina is composed of nine species, eight of which are bacilli (Vos, Garrity et al. 2011). Indeed, S. ureae is currently the only member of the genus Sporosarcina that is a coccus and organizes itself in a sarcina grouping. All members of the genus form endospores, are catalase positive, motile and all but one species has the ability to hydrolyze urea, although Bergey’s Manual suggests that the

1 strongest evidence that links these organisms together is the similarity in the16S rRNA gene (Vos, Garrity et al. 2011).

The rod morphology is a well-studied phenotype. Despite morphology being a trait that necessitates many proteins working together, rodA and mreB are two of the most well studied proteins implicated in the rod-shape morphology (Henriques, Glaser et al.

1998, Daniel and Errington 2003). rodA in B. subtilis and the rodA-pbp2 operon in E. coli are partially responsible for the rod shape in those organisms and when under expressed or knocked out, the cells become spherical (Henriques, Glaser et al. 1998). Another gene, the actin homologue mreB, is necessary for the rod-shape in Caulobacter crescentus

(Figge, Divakaruni et al. 2004). Other proteins such as mreC and mreD are also thought to be involved in cellular morphology; however, their exact functions remain unknown

(Errington 2015). Additionally, it has been postulated that mbl, along with another protein, mreBH, has some functional similarity to mreB, though more research is needed to elucidate their exact mechanisms (Abhayawardhane and Stewart 1995, Kawai, Asai et al. 2009). Analysis of genes involved in cell shape in S. ureae or other potential cocci shaped Sporosarcina sp. may yield clues as to why they are phylogeneticaly grouped with rod-shaped .

S. ureae is currently the only known spore-forming cocci, and sporulates in harsh conditions and can remain viable for up to a year (Claus, Fritze et al. 2006). Currently, we have very limited data on the sporulation process of S. ureae, although it appears that the sporulation process in S. ureae closely resembles that of bacilli organisms in the genus Bacillus. However, there is one exception or major difference in the process, which is how the cell differentiates into a spore: Bacillus differentiates asymmetrically while S.

2 ureae differentiates symmetrically (Zhang, Higgins et al. 1997). Further work is needed to truly understand how cells of different morphology differentiate into endospores.

Another key characteristic of S. ureae is the ability to produce at least one urease, an enzyme that converts urea to ammonia, which studies have shown has higher activity compared to other purified bacterial (McCoy, Cetin et al. 1992). Despite similarities in urease production and sporulation, different strains that were designated by using Bergey’s Manual as S. ureae vary considerably in their overall physiology

(Pregerson 1973). Pregerson (1973) observed that this variation could be seen even among strains isolated from the same confined location, therefore it is possible there is a large amount of genetic variability among these strains or the strains may in fact represent novel species. The advent of whole genome sequencing (WGS) will allow us to examine the genetic variability of these different strains, and compare the genotype to the phenotype to truly understand S. ureae on a whole new level.

Although it is has been studied by several different laboratories, to date very little overall research has been done with S. ureae. Recently however, there has been increasing interest due to the potential biotechnological applications. The ability of S. ureae to convert urea to ammonia has important potential applications in the production of biofuels and fertilizers. Ammonia is currently being actively researched as a carbon- alternative fuel source. It’s high octane rating (110-130) and relative safety when compared to gasoline make it an ideal replacement for gasoline (Zamfirescu and Dincer

2009). Furthermore, the infrastructure needed to transport and store ammonia in quantities of 100 million tons or more already exists (Zamfirescu and Dincer 2009).

Traditional methods of generating ammonia for fertilizer rely heavily on the use of

3 natural gas; it has been estimated that to produce the ammonia needed for current fertilizer demands accounts for 2% of the entire world’s energy consumption (Zamfirescu and Dincer 2009). There exists a great need for more environmentally friendly alternative fuels and ammonia production, and S. ureae may hold the key to unlocking these methods.

Another application of S. ureae research is focused on the unique outer cell surface layer (S-layer). S-layers are composed of single proteins that form a predictable lattice structure and have potential applications in nanoelectronics, medicine, and biosensors (Ilk, Egelseer et al. 2008). Much of the research is looking into the self- assembly property of S-layers which, when bound to certain antibodies, may have the ability to advance vaccine development and diagnostic testing (Ilk, Egelseer et al. 2011).

Furthermore, the S-layer may have a role in binding heavy metals that are found in contaminated water supplies (Ilk, Egelseer et al. 2011).

Furthermore, additional S. ureae work is being done at the National Aeronautics and Space Administration (NASA) Ames Research Center examining organisms that convert urea to ammonium. A presentation by Lynn Rothschild indicated some of the first colonizers of Mars might use these organisms to convert human waste to ammonium and subsequently use the ammonium to lower the pH of the Mars soils in order to make calcium carbonate cement (Rothschild 2012). This cement could then be used to make bricks and other building materials. Therefore the bacterium, S. ureae, absolutely holds the potential to address some major issues facing humankind in the future.

In the current strain collection at California State University, Northridge (CSUN) there are over 50 different isolates from numerous soil samples around the world that

4 were characterized and identified by Bernadine Pregerson as potentially S. ureae.

Additionally, Pregerson’s analysis of soils containing these isolates concluded the bacterium is most commonly present in soils that reflected high activities of dogs and humans. The research also found that many of the isolates have the ability to survive in relatively high concentrations of urea, but there exists significant variation between the various isolates to actively grow in the presence of 10% urea (Pregerson 1973). In 1996 and prior to WGS, Risen studied a number of the genetic loci in many of the isolates.

Risen found that in addition to the phenotypic variation Pregerson observed, there exists extensive genetic variation, and concluded that these isolates were non-clonal (Risen

1996).

In addition to genomic and phenotypic variation already observed in these strains, we will examine the variation between epigenomes by analyzing genomic methylation patterns. Epigenomic analysis has become possible with the advent of single-molecule real-time (SMRT) sequencing, which detects modifications by recording kinetic changes in the DNA polymerase during sequencing (Chen, Jeannotte et al. 2014). The epigenome is known to have a role in the bacterial immune system, virulence, and gene expression

(Chen, Jeannotte et al. 2014, Mou, Muppirala et al. 2014). Comparative epigenomics has also shown methylation to be useful in distinguishing closely related species (Pirone-

Davies, Hoffmann et al. 2015).

To date there only exists a single annotated draft genome of S. ureae (strain DSM

2281). There has been no analysis or research utilizing this genome, thus very little is known about the overall genomics of Sporosarcina sp. particularly S. ureae. Comparative genomics of the species will allow us to clarify the phylogenetic uncertainty, and begin to

5 understand its morphological and physiological uniqueness relative to the genus

Sporosarcina. Furthermore, it will allow us to begin to understand the sporulation genetics behind the only known spore-forming cocci in the world. The goal of this study is to utilize comparative genomics and epigenomics to investigate the genetic differences between strains of S. ureae that differ in their spatial and temporal isolation.

For the study, I will utilize Pacific Biosciences (PacBio) sequencing in order to generate complete genome sequences for all of the selected isolates. PacBio introduced the Single Molecule Real Time (SMRT) sequencing technology that is part of the 3rd generation of DNA sequencing platforms in 2010 (McCarthy 2010). The benefit for using

PacBio sequencing, especially for de novo sequencing applications, is its ability to generate long reads. On average, PacBio generates DNA sequencing reads longer than 10 kb (N50 mean read length using their P6-C4 chemistry is 20 kb) which provides the researcher with vitally long reads that help span difficult repeat regions found in the genome where assemblies based on short-read data are typically problematic (Rhoads and

Au 2015). Overall, these longer reads provide for the generation of significantly fewer contigs (often times a single contig is generated) and more accurate assemblies, particularly when lacking a reference genome. PacBio sequencing is also able to detect base modifications and construct the methylomes of an organism for comparative epigenomics analysis. This is accomplished by using molecular real-time imaging to detect the kinetics of the polymerase during sequencing (Chen, Jeannotte et al. 2014), which can detect the change in the polymerase kinetics that are unique to particular type of base modification.

6

Materials and Methods

DNA Extraction

Using a protocol adapted from Cooper et. al, each strain of S. ureae was grown up in triplicate in 5 ml of tryptic soy yeast broth on a rotator at 30°C overnight (Cooper,

Mandrell et al. 2014). The replicates were then combined, pelleted (10 min @ 12,000 x g) and re-suspended in Tris-sucrose (10% sucrose, 50 mM Tris, pH 8.0) and diluted to an optical density of 1.6-1.8. Cells were lysed with lysozyme (100 mg/ml in 50mM Tris, pH

8.0) and 10% SDS. EDTA (100 mM EDTA, pH 8.0) was used to buffer the suspended

DNA. RNase One (10 mg/ml) was added and the strains were incubated at 37°C for 24 hours to ensure total RNA removal. Next, proteinase K (20 mg/ml) was added to remove any remaining proteins, and 3M sodium acetate (pH 5.5) and absolute ethanol added to precipitate the DNA. After re-suspending the DNA in EB buffer (Qiagen) it was incubated at 30°C overnight. Next, 400 µl of phenol:chloroform:isoamyl alcohol was added followed by centrifuging (12,000 x g; 5 min), and the aqueous layer transferred to a new tube. In order to remove any traces of phenol from the solution, 400 µl of chloroform was added and mixed by inverting three times, centrifuged (12,000 x g; 3 min), the top layer transferred to a new tube, and the DNA precipitated again with absolute ethanol. The resulting DNA was re-suspended in EB buffer. Quality, size and quantity of DNA were confirmed with a Nanodrop spectrophotometer (260/280 = 1.8-

2.0), gel electrophoresis (high single band, little smearing) and a Picogreen dsDNA assay

(Life Technologies’ Quant-iT Picogreen dsDNA kit) per the manufacturer’s instructions, respectively.

7

Genome Sequencing

The DNA obtained was sent to the UC Irvine Genomic High Throughput Facility for library preparation and PacBio sequencing. Library preparation involved 15 µg of genomic DNA being sheared into 20 kb fragments using a Covaris S2 Focused Acoustic

Shearer, and these fragments were used for generating the PacBio sequencing libraries.

One SMRT cell/strain allowed for >150x coverage per strain, ample coverage for the construction of de novo genomes. In total six S. ureae genomes (all except DSM 2281) were sequencing using PacBio technology, which resulted in an average of 67,798 reads, average read length of 13.57 kb, and an average of 166x coverage per genome (Table 1).

Assembly and Annotation

All genomes were assembled using SMRTanalysis software (v2.3.0.1, http://www.pacb.com/wp-content/uploads/2015/09/SMRT-Pipe-Reference-Guide.pdf), and any genomes that needed further assembly were done in silico by using Geneious software (Biomatters, v9.0.0) to map the corrected reads to the contigs and subsequently linking the contigs into a complete genome sequence (Kearse, Moir et al. 2012). Initially genomes were annotated using the Rapid Annotation using Subsystem Technology

(RAST) server (Aziz, Bartels et al. 2008). Despite the quality of annotations, the genomes were also run through NCBI’s own Prokaryote Genome Automatic Annotation

Pipeline (PGAAP) and these annotations were compared to the RAST annotations. Any questionable annotations were manually checked, however, the annotation calls were

8 found to be nearly identical. The genomes are deposited on NCBI under their respective accession numbers (Table 1).

Phage and Insertion Sequence Analysis

The six sequenced S. ureae genomes were run through the online phage detection tool PHASTER (PHAge Search Tool – Enhanced Release) under default settings, and manually checked in Geneious (Arndt, Grant et al. 2016). Only phage regions identified as “Intact” were used in this analysis. Insertion sequences were identified using the web- based tool ISfinder, and an e-value cutoff of 10-4 was applied to the BLAST results

(Siguier, Pérochon et al. 2006).

16S rRNA Analysis

For each of the six strains that were sequenced, we created a strain specific consensus sequence for the 16S rRNA gene from the different copies of the gene in each of the genome. Due to small known variation between copies of the gene, even within the same genome, a consensus sequence of all copies of the gene within the genome helps capture the most accurate single sequence to use in a comparative analysis (Acinas,

Marcelino et al. 2004). After PGAAP annotated the strains, each strain was BLASTed, using the BLAST plugin in Geneious, with a copy of its own 16S rRNA gene to verify that the genomes did not contain unidentified rRNA loci or genes. 77 total 16S rRNA gene sequences were used in the construction of the family 16S rRNA phylogenetic tree.

Included in the analysis were 16S rRNA gene sequences of 10 other species in the genus

Sporosarcina, in addition to the 6 sequenced strains, and 61 other species from 18 genera in the family Planococcaceae, which are publically available on the Ribosomal Database

Project (RDP) (Maidak, Cole et al. 2001). The average length of the 16S rRNA gene

9 sequences utilized for analysis was 1497 bp. The sequences were aligned using SILVA

Incremental Aligner (SINA, www.arb-silva.de) which is a powerful alignment tool that takes into account ribosomal secondary structure when aligning sequences (Pruesse,

Peplies et al. 2012). MEGA 7.0 was used to analyze the alignment file to determine the best phylogenetic tree model to use by way of maximum likelihood, Bayesian information criterion, and Akaike information criterion (Kumar, Stecher et al. 2016). The maximum likelihood phylogenetic trees were constructed using MEGA 7.0 using the

Kimura 2-parameter distance model (+Gamma, +Invariant) with 1,000 bootstraps.

Whole Genome Alignment

The online tool WebACT (http://www.webact.org/WebACT/home) was utilized to generate the whole genome comparison files for each of the six genomes against each other (Abbott, Aanensen et al. 2008). Using the default settings, the six genomes were aligned with the Artemis Comparison Tool (ACT, v13.0.0) software (Carver, Rutherford et al. 2005).

Core Genome analysis

The seven genomes used for analysis are publically available on NCBI by their respective accession numbers (Table 1). Currently there is no exact defined parameters to elucidate the core genome of related species, therefore we used the following core genome parameters (percent amino acid sequence identity (PI), percent query coverage

(PC), and E-value), and set those cutoffs to the conservative values of >90% PI, > 90%

PC and > 1e-10 E-value (Kuhn and Teixeira 2004). For the genus core genome, the

10 parameters were set to >75% PI, > 75% PC and > 1e-10 E-value. Models using the raw

BLAST data (described below) were generated with MATLAB (MathWorks, v.2014a) to predict the amount of genes at various parameter cutoffs. Our core genomes for the species and genus agreed with the predicted models (Figure S1).

To generate the raw sequence comparison data, I created a protein BLAST database of all protein sequences from the 7 genomes. Next, the protein sequences were individually compared to each genome using the pBLAST command from the BLAST+ software (Camacho, Coulouris et al. 2009) against the created protein BLAST database.

The output showed if the genes in the query genome were present in all the other database genomes, and how related they were to each other. This resulted in a raw data

.csv (comma separated values) file for each genome that would contribute to the core genome. A large dataset is generated in the previous step and a python program called

Geneparser (https://github.com/mmmckay/geneparser) was written to parse the .csv files and search for core genes in each genome. Geneparser uses the organism specific amino acid sequence files and generates concatenated gene sequences of all the shared genes for each genome. The concatenated sequences were imported into Geneious software, and exported as a FASTA file for MUSCLE alignment (v3.8.31) using the default settings

(Edgar 2004). Using the resulting alignment file, a phylogenetic tree was constructed with MEGA using the LG (+F,G,I) model and bootstrapped 1,000 times.

Epigenetic Analysis

Using SMRTanalysis software, the original PacBio sequencing reads were mapped to the complete genome sequence and any base modifications determined. I

11 compared the locations of methylation associated with different motifs between all 6 genomes. Additionally, I ran each motif through the REBASE database

(http://rebase.neb.com/rebase/rebase.html, New England Biolabs) to check if the motifs were associated with any known restriction enzymes and their associated organisms.

Only methylation sites that have a Phred-like Quality Value (QV) score of 30 or greater were presented in this study (Pirone-Davies, Hoffmann et al. 2015). The modifications were plotted using Circos (v.0.69) (Krzywinski, Schein et al. 2009).

BLASTp Ring

Strains P8, P17a, P32a, P33, P37, DSM 2281, and S. newyorkensis were protein

BLASTed against strain S204 and protein hits that were 50% sequence identity or greater were plotted on a circular Circos heatmap, where color indicates level of sequence identity.

COG Analysis

Reverse Position Specific-BLAST (RPS-BLAST) was used to find the Cluster of

Orthologous Groups (COG) data for the genomes. The query proteins were BLASTed against NCBI’s Conserved Database (CDD) using RPS-BLAST (Marchler-

Bauer, Lu et al. 2011). CDD contains well-annotated multiple sequence alignment models for ancient domains and full-length proteins, allowing for fast identification of conserved domains in the query proteins. After matching the query proteins to the well- annotated CDD proteins with RPS-BLAST, a short python script (Figure S2) was written to pair the correct COG information to each matched query protein. NCBI provides the

12 COG data for each of the genomes contained in the CDD, and this was cross-referenced with the BLAST results to obtain COG information about the query proteins.

Sporulation Analysis

The S. ureae and Bacillus megaterium genes associated with sporulation were searched using Geneious, and the identified spore genes of the six S. ureae genomes were then individually BLASTed against the known spore genes of Bacillus megaterium using

BLASTp. The percent identity of the hits was then used to create a Circos circular heatmap of gene comparisons between S. ureae and Bacillus megaterium. Only similarity hits of 50% amino acid identity or greater were shown. The resulting hits for each genome were binned into one of 11 spore gene categories that better describe the role of the gene in sporulation.

Urease Analysis

All six sequenced strains of S. ureae were inoculated on urea slants (Hardy

Diagnostics Urea Agar) and incubated for 24 hrs. To test their growth in 10% urea, strains were grown in TSY broth (27.5 g tryptic soy broth and 5.0 g yeast extract per liter) supplemented with 10% filter sterilized urea. The inoculated broths were incubated at 30oC and the optical density read every 24 hrs for 3 days to test for growth. To examine urease genetics, a MUSCLE alignment was done, using default parameters, on the urease loci of the six strains of S. ureae.

13 Morphogene Analysis

The Sporosarcina genomes (including S. koreensis, S. newyorkensis, and S. ureae

DSM 2281), along with Bacillus subtilis and Escherichia coli, were searched for the presence of genes involved in cell morphology determination, specifically mreB and rodA

(or the rodA-pbp2 operon). The amino acid sequences were aligned using MUSCLE and pairwise alignment percentages were generated to show sequence relatedness. 7 other genes (ftsZ, mbl, mreC, mreD, minC, minD, minE) were screened for their presence in the

Sporosarcina genomes due to their known role in cell morphology.

Results

Family Planococcaceae

16S rRNA gene analysis shows that the six sequenced strains of Sporosarcina ureae group in the genus Sporosarcina, within the family Planococcaceae (Figure 1).

The family is in the order and the phylum (Vos, Garrity et al. 2011).

Planococcaceae is Gram-variable and contains various morphologies (Shivaji, Srinivas et al. 2014). Most species of Planococcaceae are aerobic.

General Characteristics

The average S. ureae genome contains 3.33 Mb, and encodes for 3,160 proteins, 8 ribosomal loci, and 67 tRNAs (Table 1). Only strains P33 and P37 contain intact phage regions and the average S. ureae genome contains 21 insertion sequences (Table 1).

Major clusters of orthologous groups (COG) that are represented in the genome are amino acid transport and ; replication, recombination and repair; and

14 translation, ribosomal structure and biogenesis (Figure 2, Table S2). The GC content ranges from 41.2-44.7% and averages 42%. Of the 6 genomes sequences, only P37 contained a plasmid (Table 1).

Genus Sporosarcina

To refine the 16S rRNA phylogenetic tree (Figure 3A), a core genome for the genus was constructed using those Sporosarcina species that have been genome sequenced. In total 74 genes were labeled as core, present in all members of the genus at cutoff values of >75% PI, >75% PC, and <1e-10 E-value. The maximum likelihood tree based on the core genome slightly changed the topology compared to the 16S rRNA tree by switching P17a, P8 and P32a (Figure 3B).

Sporosarcina ureae

When aligning the 16S rRNA gene using MUSCLE, both S204 and DSM 2281 share 100% pairwise identity, which is reflected in their proximity to each other in the phylogenetic tree (Figure 3A). P8, P17a, and P32a cluster close to S204 with P32a being the most divergent strain sharing 99.9% pairwise identity when aligned in MUSCLE.

Notably, both P33 and P37 are highly divergent from the other four strains, sharing 100% pairwise identity with each other, but at best only 98.3% pairwise identity with the next closest relative, P32a (Figure 3A).

Although there is extensive synteny between genomes of S. ureae, S204 has a single large inversion (Figure 5). Strains P33 and P37 have more small genomic rearrangements relative to the other genomes (Figure 4). Although the gene order is mostly conserved, there exists a wide variation in sequence identity across all genes

(Figure 5). The average amino acid identity (AAI) between the two closest related strains

15 (S204 and DSM 2281) share 98% AAI, while the most distantly related strains (S204 and

P37) share only 79% AAI.

Core genome analysis of all members of the species using cutoff values of >90%

PI, >90% PC, and <0.0001 E-value revealed a core genome of 874 genes. The phylogenetic tree using these conserved genes shows relationships among the strains with more resolving power than the 16S rRNA gene, which fails to accurately resolve the relationship between P17a, P32a and P8 (Figure 6).

On average, each strain of Sporosarcina contains 485 unique genes, however the amount did vary widely between the strains. P32a contains 1131 unique genes, more than a third of its total number of genes (35%), while strain S204 contains only 62 unique genes, barely 5% of its total number of genes (Figure 7).

Methylome Analysis (Epigenomics)

At the epigenomic level there was significant differences among all six genomes, as none of them shared common methylases (Figure 8). Interestingly, S204 is the only strain to contain a Dam methylase, and it is the only active methylase in the genome. All the other strains contain multiple methylases, and P17a, P32a, and P33 all contain both m6A and m4C methylation. P37 lacks any apparent cytosine methylation; additionally,

P8 has no currently identified adenine or cytosine methylation, but did have several active for unknown types of base modifications.

Sporulation

All six strains have the ability to form endospores, verified using phase contrast microscopy. The Sporosarcina strains have on average 52 identified sporulation genes

16 (1.6% of total genome), compared to Bacillus megaterium, a well-studied spore-forming bacillus, which has 123 genes (2.4% of total genome). Overall, that means 57% less sporulation genes in the cocci compared to the bacillus. In terms of amino acid sequence identity, only an average of 27 B. megaterium known spore genes are found in the

Sporosarcina strains, at 50% amino acid identity or greater (Figure 9). An amino acid alignment showed 64.4% pairwise identity across all sporulation genes between the six

Sporosarcina strains.

Urease

When inoculated on a urea slant, all six strains showed a positive result for the production of a urease. Growth in tryptic soy yeast broth supplemented with 10% urea varied considerably between strains. P8, P17a and P37 actively grew in the presence of

10% urea, although at different rates, while P32a, P33 and S204 did not grow. Analysis of the urease loci failed to show gene changes that might explain these differences; however, P17a and P37 do have a large truncation of the urea transporter gene in the urease loci (Figure 10).

Morphogene Analysis

The rodA gene was found in the six sequenced genomes in addition to B. subtilis and E. coli. On average, the six newly sequenced genomes shared 35.7% amino acid identity with B. subtilis and 30.4% with E. coli. Even less related, B. subtilis and E. coli share 27.5% identity with each other across rodA. Notably, aside from E. coli, no other genome in this study contained the rodA-pbp2 operon. S. ureae DSM 2281, S. koreensis, and S. newyorkensis did not have a copy of rodA, but when BLASTed did hit ftsW, a

17 close homolog of rodA. Compared to the mreB gene found in E. coli, B. subtilis, S. ureae

DSM 2281, S. koreensis, and S. newyorkensis, the six strains shared amino acid identities of 52.4%, 48.9%, 96.4%, 76.2%, and 78.6% respectively. Several other genes involved in cellular morphology are also present in the Sporosarcina strains, such as the mreB paralog mbl, mreC and mreD, the septum-site determining proteins minC and minD, and the actin homolog ftsZ.

Discussion

As more genomes are becoming sequenced, the definition of what constitutes a prokaryotic species is being challenged. Up until now, a prokaryotic species has been defined as strains (including the type strain) characterized by certain phenotypic consistency, 70% DNA-DNA hybridization (DDH) and over 97% identity of the 16S rRNA gene (Gevers, Cohan et al. 2005). With the advent of affordable whole genome

DNA sequencing (WGS), the ability to study organisms at the individual nucleotide level allows for refining phylogenetic relationships based on the classic polyphasic approach.

There currently exists a push to include parameters derived from whole genome sequencing, such as average nucleotide identity (ANI) or average amino acid identity

(AAI), to delineate species (Richter and Rosselló-Móra 2009).

In this study we show that six strains of Sporosarcina are much less related to each other than the polyphasic metrics would have one believe. As has been known for a long time and further supported by Hug et al. (2016), researchers compared a tree of life built from 16S rRNA gene sequences to one built of 16 conserved ribosomal proteins and found that using more genes resolved relationships that were more ambiguous when

18 using just one gene (Hug, Baker et al. 2016). Furthermore, there seems to be some disagreement about what pairwise identity cutoff constitutes the same species.

Traditionally, organisms that share at least 97% 16S rRNA sequence identity, along with other phenotypic markers, were considered the same species (Stackebrandt and Goebel

1994). Current research suggests that moving the cutoff to 98.65%, when combined with other genomic metrics such as ANI, might accelerate the process of distinguishing novel species (Kim, Oh et al. 2014). At 98.65%, both P33 and P37 would not be considered the same species as the other four strains (Figure 2), which this is further supported by ANI and AAI. Therefore, I propose that P33 and P37 are in fact a new species of

Sporosarcina.

Inferring bacterial relationships based on whole genome DNA sequences is a difficult endeavor due to the vast amount of sequence shared during horizontal gene transfer (HGT). To counter this, studies indicate that using a smaller subset of “core” genes would minimize the effect of HGT skewing phylogenetic analysis (Uchiyama

2008). However, this presents a contentious problem: what parameters define a core genome? Kuhn et al. (2004) define a core gene as one that shares 90% amino acid identity and a Blast E-value of 1e-10 or less (Kuhn and Teixeira 2004). Konstantinidis and

Tiedje determined that the 70% DDH species cutoff corresponds with an average amino acid identity (AAI) of 95-96% (Wayne, Brenner et al. 1987, Konstantinidis and Tiedje

2005). In 2010, Ventura et al defined a core genome at the genus level to be 50% pairwise identity across 50% of the query sequence. Liu et al. (2015) used the 50-50 criteria as well when studying Comamonas testosteroni strains, though they employed an addition level of filtering by setting the BLAST E-value cutoff to 1e-5 (Liu, Zhu et al.

19 2015). Using a series of highly conserved genes, such as those found in a core genome, we are able to resolve the phylogenetic relationships of very closely related strains. A core genome of the genus yielded 74 genes (or about 2% of the genome), which is lower than what most analyses find (den Bakker, Cummings et al. 2010). This is likely the result of using a 75% identity cutoff value. Using the popular less stringent cutoff values of >50 PI and >50 PC, our supplemental models suggest that up to 15% of the genome could be labeled core. A phylogenetic tree built using the core genome changes the branching order of our strains by switching strains P17a, P32a and P8 (Figure 4). This is likely the most parsimonious phylogeny.

The strains were previously found to vary in their phenotypic characteristics

(Pregerson 1973). Although they all produce an active urease, they were found to grow differently in the presence of 10% urea (Pregerson 1973). Since no obvious genetic difference was found at the urease loci that might explain these differences, variations in transcription or post-translation modifications are likely the cause of these differences in phenotypes (Figure 9). Additional RNA-seq and ChIP-seq analysis may be needed to elucidate the cause of this variation.

Oddly, the genome of the Sporosarcina strains, a cocci bacterium, encodes for many morphogenes, such as rodA and mreB, which are thought to play a major role in imparting a rod-shaped phenotype (Jones, Carballido-López et al. 2001). I chose to bioinformatically analyze these two genes due to their important role in cellular morphology; however, clues as to the strains’ cocci phenotype may also be found in the 7 other screened genes with known roles in cell shape, warranting future analysis. In

Caulobacter crescentus, a rod shaped bacterium, mreB is implicated in coordinating the

20 machinery used in cellular elongation (Divakaruni, Loo et al. 2005). At this time, it is not known whether mreB in Sporosarcina is expressed; however, at the genetic level, there exists variation in a highly conserved residue (230, histidine) of the gene in S. ureae, which might explain its cocci morphology. We show that at this position, rod bacteria of the genus Sporosarcina, along with Bacillus megaterium, all share an arginine at that position. In a study done by Jones et al., researchers showed that arginine residue conserved in B. subtilis, Streptomyces coelicor, E. coli, Chlamydia trachomaist, and

Helicobacter pylori (Jones, Carballido-López et al. 2001). Further, when aligned on

NCBI’s CCD website (ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi), residue 230 is marked as a putative protofiliament interaction site. Because histidine contains an aromatic ring, which sets it apart from all other residues seen at that position, I suggest that this position, and its potential interaction with protofilaments, be investigated further.

rodA is another morphogene of interest, because of its role in the rod-shape phenotype in both B. subtilis and E. coli (in the rodA-pbp2 operon) (Henriques, Glaser et al. 1998). Although the six sequenced strains of Sporosarcina share less than 40% amino acid identity with the two bacilli, the presence of the gene at all is confounding. No obvious residue changes could be attributed to the lack of a rod phenotype in the

Sporosarcina strains. Future work should examine the gene’s expression levels in

Sporosarcina, and attempt to clone in a known functional copy of the gene so that the capacity for rod morphology can be examined.

Sporulation is another intriguing aspect of this organism. A popular scientific consensus is that gene expression in a sporulating cell is linked to the different cell volumes between the mother cell and the pre-spore (Stragier and Losick 1996). However,

21 when researching the sporulation process in S. ureae, Zhang et al. found that the symmetrical divide of S. ureae did not influence the expression of ftsZ, an important gene regulating the division of a cell during vegetative and sporulating conditions (Zhang,

Higgins et al. 1997). The current literature is sparse with examples of sporulating cocci.

Halobacillus halophila, once a member of the genus Sporosarcina, was originally described as a Gram-positive spore-forming coccus. However, it differs significantly in its 16S rRNA similarity form Sporosarcina ureae (Spring, Ludwig et al. 1996), and is currently considered a spore forming bacillus (Hänelt and Müller 2013). Additional research is needed to answer the questions: 1) how does a symmetrically dividing cell influence spore formation and 2) how does Sporosarcina ureae accomplish the sporulation process with nearly 60% less known sporulation genes than Bacillus megaterium? One area of interest for future research is the Stage 2 sporulation genes, which are known to regulate the asymmetrical division that leads to sporulation (Hilbert and Piggot 2004). There exists five known Stage 2 sporulation genes that Bacillus megaterium contains that are absent in S. ureae at a 50% identity cutoff. Analysis of these genes may shed light on how S. ureae accomplishes a symmetrical sporulation.

More than just sequence variation, all 6 strains have very different epigenomic profiles (Figure 11). In fact, none of the strains appear to share any methylation patterns when grown under identical conditions. Strain S204 appears to contain only the Dam methylase, whereas all other strains contain multiple methylases. P37 has adenine methylation but lacks the cytosine methylation that is present in P17a, P32a, and P33.

Puzzlingly, strain P8 has no methylation data that can be attributed to adenine or cytosine modifications, even though the genome is heavily modified. More research is needed to

22 examine how strains P33 and P37, which contain nearly identical core genomes, can have vast variation in their epigenomes. Although these are closely related strains based on the comparative genomics, the epigenomics demonstrates that these are different strains. I hypothesize that variations in the epigenomes allow closely related bacteria to adapt to different environments or slightly different ecological niches.

Even as these strains likely share a very similar ecological niche, soil with high levels of urea (Claus, Fritze et al. 2006), there exists significant genomic and epigenomic variation between members in this species. Within the context of the modern discussion regarding what demarcates bacterial species, the strains of Sporosarcina ureae likely represent multiple different ecotypes within the species. Ecotypes have been defined as populations that differ slightly in their genetic makeup, thus allowing them to thrive in a different niche but overall genetically and ecologically resemble the species

(Konstantinidis and Tiedje 2005). These researchers postulate that even strains that share

99% or greater ANI may be considered different ecotypes if genetic differences allow them to survive in different environments. This may explain why, in general, we see genetically diverse strains occupying the same niche around the world. It may also explain how P33 and P37, which share almost 100% ANI, were isolated 6,700 miles apart from one another.

Here we present the taxonomy of a genus and species that is based on modern metrics such as: average nucleotide identity, average amino acid identity, core genome analysis, and a stricter 16S rRNA gene identity cutoff (98.5%). When applied to the species S. ureae, we show that there exists enough genetic variation among the strains to suggest a restructuring of the genus by breaking up S. ureae into new species or

23 subspecies. Future researchers will likely answer the question of whether bacteria can be classified into discrete groups or rather they exist in a “continuum of genetic diversity,”

(Varghese, Mukherjee et al. 2015). Until then, our research suggests that the taxonomy of the species Sporosarcina ureae, among likely many other species, could benefit from the resolving power of data derived from WGS.

24 Tables and Figures

Figure 1: 16S rRNA gene tree of the 6 sequenced Sporosarcina strains, 71 type-species from the family Planococcaceae, and Bacillus megaterium as an outgroup, totalling 77 gene sequences. The K2+G+I model was used in MEGA 7.0, and 1000 psuedoreplicates were done. Node support values are displayed.

25

Generalthe 1: characteristicssequenced six Table of

S porosarcina

this strainsused in

study

.

26

(teal). (black),metabolismtransportacidand and amino way/membrane/envelope biogenesiscell metabolismcoenzyme and transport (gold), (purple),metabolismcarbohydrate transport transportand ion metabolism inorganic strainsof Figure Number of genes

2 : Bar graph showing major COG (Clusters of Orthologusgraphin Groups)BarmajorshowingCOG (Clustersof groupssix Sporosarcina . The COGsand are The recombination repair replication,(red), . (blue), lipidand (blue), (green),metabolismtransport

27

psuedoreplicates were done. Node support valuesaresupport Node displayed. psuedoreplicates done. were Sporosarcina 16Sof3A: thesequenced tree gene6 rRNA Figure and Bacillus megaterium Bacillus as an outgroup. ThemodelK2+G+Ioutgroup. wasasan used Sporosarcinaureae,

10 type strain speciesstrain thetype from genus10

in MEGA 7.0,MEGA in 1000 and

28

Figure 3B: Core genome phylogenetic tree of the genus Sporosarcina using a MUSCLE alignment of 74 conserved amino acid sequences. The maximum likelihood tree was built in MEGA7.0 using the LG+F+G+I model. Nine publically available whole and draft genomes, in addition the the six sequenced for this study were used. 1000 psuedoreplicates were done and the node support values are displayed.

29

Figure 4: ACT (Artemis Comparison Tool) alignment plot of six strains of Sporosarcina. Bands indicated shared genes. Red bands are genes shared in the same direction and blue bands are genes share in reverse directions. The genomes are, from top: S204, P17a, P8, P32a, P33, P37.

30

Figure 5: Circos generated heatmap, using percent amino acid identity, of Sporosarcina S204 vs (top to bottom): S. newyorkensis, P37, P33, P32a, P8, P17a, and DSM 2281

31

Figure 6: Core genome phylogenetic tree of the species Sporosarcina ureae using a MUSCLE alignment of 874 conserved amino acid sequences. The maximum likelihood tree was built in MEGA7.0 using the LG+F+G+I model. 1000 psuedoreplicates were done and the node support values are displayed.

32

Sporosarcina S204

Figure 7: Circos generated genome map of Sporosarcina ureae S204. The rings represent 1) COG groups, 2) reverse genes, 3) forward genes, 4) ribosomal loci (red), sporulation genes (blue), and urease loci (purple), 5) GC content.

33

Sporosarcina Methylation map m6A m4C unknown

Figure 8: Circos plot showing type and location of epigenetic modifications of six strains of Sporosarcina. Color of lines indicate type of modification: adenine (red), cytosine (blue), and unknown (green). From outside: P17a, P32a, S204, P8, P37, and P33.

34

Sporosarcina Spore Genes

Figure 9: Circos heatmap plot of all spore genes found in the six strains of Sporosarcina compared to all spore genes found in Bacillus megaterium. The rings are, from outside: 1) P37, 2) P33, 3) P32a, 4) P8, 5) P17a, 6) S204.

35

residues that would explain theexplain differencesureasewould residues in that kinetics. F igure 10 igure : MUSCLE al MUSCLE : ignmentureaseofthe in loci 6 strainsof6

Sporosarcina . No obvious changes were found in changesobviousfound Nowere .

36

Figure 11A: MUSCLE animo acid alignment of the rodA gene. Black highlights indicate 100% sequence identity at that residue.

37

Figure 11B: MUSCLE alignment of the mreB gene. Black highlights indicate 100% sequence identity at that residue.

38 References Cited

Abbott, J. C., D. M. Aanensen and S. D. Bentley (2008). "WebACT." Comparative

Genomics: 57-74.

Abhayawardhane, Y. and G. C. Stewart (1995). "Bacillus subtilis possesses a second

determinant with extensive sequence similarity to the escherichia coli mreB

morphogene." Journal of Bacteriology 177(3): 765-773.

Acinas, S. G., L. a. Marcelino, V. Klepac-Ceraj and M. F. Polz (2004). "Divergence and

Redundancy of 16S rRNA Sequences in Genomes with Multiple rrn Operons."

Journal of Bacteriology 186(9): 2629-2635.

Arndt, D., J. R. Grant, A. Marcu, T. Sajed, A. Pon, Y. Liang and D. S. Wishart (2016).

"PHASTER: a better, faster version of the PHAST phage search tool." Nucleic

acids research: gkw387.

Aziz, R. K., D. Bartels, A. A. Best, M. DeJongh, T. Disz, R. A. Edwards, K. Formsma, S.

Gerdes, E. M. Glass, M. Kubal, F. Meyer, G. J. Olsen, R. Olson, A. L. Osterman,

R. A. Overbeek, L. K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G. D.

Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke and O. Zagnitko

(2008). "The RAST Server: Rapid Annotations using Subsystems Technology."

BMC Genomics 9: 75-75.

Beijerinck, M. W. (1901). "Anhäufungsversuche mit ureumbakterien. Ureumspaltung

durch urease und durch katabolismus." Zentralbl. Bakteriol. Parasitenkd.

Infektionskr. Hyg. II Abt 7: 33-61.

39 Camacho, C., G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer and T. L.

Madden (2009). "BLAST+: architecture and applications." BMC bioinformatics

10(1): 1.

Carver, T. J., K. M. Rutherford, M. Berriman, M.-A. Rajandream, B. G. Barrell and J.

Parkhill (2005). "ACT: the Artemis comparison tool." Bioinformatics 21(16):

3422-3423.

Chen, P., R. Jeannotte and B. C. Weimer (2014). "Exploring bacterial epigenomics in the

next-generation sequencing era: A new approach for an emerging frontier."

Trends in Microbiology 22(5): 292-300.

Claus, D., D. Fritze and M. Kocur (2006). Genera Related to the Genus Bacillus—

Sporolactobacillus, Sporosarcina, , Filibacter and Caryophanon.

The , Springer: 631-653.

Cooper, K. K., R. E. Mandrell, J. W. Louie, J. Korlach, T. a. Clark, C. T. Parker, S.

Huynh, P. S. Chain, S. Ahmed and M. Q. Carter (2014). "Comparative genomics

of enterohemorrhagic Escherichia coli O145:H28 demonstrates a common

evolutionary lineage with Escherichia coli O157:H7." BMC genomics 15: 17-17.

Daniel, R. A. and J. Errington (2003). "Control of cell morphogenesis in bacteria: Two

distinct ways to make a rod-shaped cell." Cell 113(6): 767-776. den Bakker, H. C., C. a. Cummings, V. Ferreira, P. Vatta, R. H. Orsi, L. Degoricija, M.

Barker, O. Petrauskene, M. R. Furtado and M. Wiedmann (2010). "Comparative

genomics of the bacterial genus Listeria: Genome evolution is characterized by

limited gene acquisition and limited gene loss." BMC genomics 11(1): 688-688.

40 Divakaruni, A. V., R. R. O. Loo, Y. Xie, J. a. Loo and J. W. Gober (2005). "The cell-

shape protein MreC interacts with extracytoplasmic proteins including cell wall

assembly complexes in Caulobacter crescentus." Proceedings of the National

Academy of Sciences of the United States of America 102(51): 18602-18607.

Edgar, R. C. (2004). "MUSCLE: Multiple sequence alignment with high accuracy and

high throughput." Nucleic Acids Research 32(5): 1792-1797.

Errington, J. (2015). "Bacterial morphogenesis and the enigmatic MreB helix." Nature

Reviews Microbiology 13(4): 241-248.

Figge, R. M., A. V. Divakaruni and J. W. Gober (2004). "MreB, the cell-shape

determining bacterial actin homolog, coordinates cell wall morphogenesis in."

Genomics 51: 1-47.

Gevers, D., F. M. Cohan, J. G. Lawrence, B. G. Spratt, T. Coenye, E. J. Feil, E.

Stackebrandt, Y. V. D. Peer, P. Vandamme, F. L. Thompson and J. Swings

(2005). "Re-evaluating prokaryotic species." Nature Reviews Microbiology

3(September): 733-739.

Henriques, a. O., P. Glaser, P. J. Piggot and C. P. Moran (1998). "Control of cell shape

and elongation by the rodA gene in Bacillus subtilis." Molecular microbiology

28(2): 235-247.

Hilbert, D. W. and P. J. Piggot (2004). "Compartmentalization of Gene Expression

during Bacillus subtilis Spore Formation." Microbiology and Molecular Biology

Reviews 68(2): 234-262.

Hug, L. A., B. J. Baker, K. Anantharaman, C. T. Brown, A. J. Probst, C. J. Castelle, C. N.

Butterfield, A. W. Hernsdorf, Y. Amano, I. Kotaro, Y. Suzuki, N. Dudek, D. A.

41 Relman, K. M. Finstad, R. Amundson, B. C. Thomas and J. F. Banfield (2016).

"A new view of the tree and life's diversity." (April): Manuscript submitted for

publication-Manuscript submitted for publication.

Hänelt, I. and V. Müller (2013). "Molecular mechanisms of adaptation of the moderately

halophilic bacterium Halobacillis halophilus to its environment." Life 3(1): 234-

243.

Ilk, N., E. M. Egelseer, J. Ferner-Ortner, S. Küpcü, D. Pum, B. Schuster and U. B. Sleytr

(2008). "Surfaces functionalized with self-assembling S-layer fusion proteins for

nanobiotechnological applications." Colloids and Surfaces A: Physicochemical

and Engineering Aspects 321(1-3): 163-167.

Ilk, N., E. M. Egelseer and U. B. Sleytr (2011). "S-layer fusion proteins-construction

principles and applications." Current Opinion in Biotechnology 22(6): 824-831.

Jones, L. J. F., R. Carballido-López and J. Errington (2001). "Control of cell shape in

bacteria: Helical, actin-like filaments in Bacillus subtilis." Cell 104(6): 913-922.

Kawai, Y., K. Asai and J. Errington (2009). "Partial functional redundancy of MreB

isoforms, MreB, Mbl and MreBHp in cell morphogenesis of Bacillus subtilis."

Molecular Microbiology 73(4): 719-731.

Kearse, M., R. Moir, A. Wilson, S. Stones-Havas, M. Cheung, S. Sturrock, S. Buxton, A.

Cooper, S. Markowitz and C. Duran (2012). "Geneious Basic: an integrated and

extendable desktop software platform for the organization and analysis of

sequence data." Bioinformatics 28(12): 1647-1649.

Kim, M., H. S. Oh, S. C. Park and J. Chun (2014). "Towards a taxonomic coherence

between average nucleotide identity and 16S rRNA gene sequence similarity for

42 species demarcation of prokaryotes." INTERNATIONAL JOURNAL OF

SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY 64(Pt 2): 346-351.

Kluckhohn, L. (1986). "A Bacteriophaage of Sporosarcina ureae." California State

University Northridge Masters Thesis.

Kluyver, A. J. and C. B. van Niel (1936). "Prospects for a natural system of classification

of bacteria." Zentralblatt fur Bakteriologie, Parasitenkunde, Infektionskrankheiten

und Hygiene 94: 369-403.

Kocur, M. and T. Martinec (1963). "The taxonomic status of Sporosarcina ureae

(Beijerinck) Orla-Jensen." International Bulletin of Bacteriological Nomenclature

and Taxonomy 13(4): 201-209.

Konstantinidis, K. T. and J. M. Tiedje (2005). "Genomic insights that advance the species

definition for prokaryotes." Proceedings of the National Academy of Sciences of

the United States of America 102(7): 2567-2572.

Konstantinidis, K. T. and J. M. Tiedje (2005). "Towards a genome-based taxonomy for

prokaryotes." Journal of Bacteriology 187(18): 6258-6264.

Krzywinski, M., J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S. J. Jones and

M. A. Marra (2009). "Circos: an information aesthetic for comparative

genomics." Genome research 19(9): 1639-1645.

Kuhn, J. and M. Teixeira (2004). "Towards a "core" genome: pairwise similarity searches

on interspecific genomic data." 2010: 2-5.

Kumar, S., G. Stecher and K. Tamura (2016). "MEGA7: Molecular Evolutionary

Genetics Analysis version 7.0 for bigger datasets." Molecular Biology and

Evolution: msw054.

43 Liu, L., W. Zhu, Z. Cao, B. Xu, G. Wang and M. Luo (2015). "High correlation between

genotypes and phenotypes of environmental bacteria Comamonas testosteroni

strains." BMC genomics 16: 110-110.

Maidak, B. L., J. R. Cole, T. G. Lilburn, C. T. Parker Jr, P. R. Saxman, R. J. Farris, G. M.

Garrity, G. J. Olsen, T. M. Schmidt and J. M. Tiedje (2001). "The RDP-II

(ribosomal database project)." Nucleic acids research 29(1): 173-174.

Marchler-Bauer, A., S. Lu, J. B. Anderson, F. Chitsaz, M. K. Derbyshire, C. DeWeese-

Scott, J. H. Fong, L. Y. Geer, R. C. Geer and N. R. Gonzales (2011). "CDD: a

Conserved Domain Database for the functional annotation of proteins." Nucleic

acids research 39(suppl 1): D225-D229.

McCarthy, A. (2010). "Third generation DNA sequencing: Pacific biosciences' single

molecule real time technology." Chemistry and Biology 17(7): 675-676.

McCoy, D. D., A. Cetin and R. P. Hausinger (1992). "Characterization of urease from

Sporosarcina ureae." Archives of Microbiology 157(5): 411-416.

Mou, K. T., U. K. Muppirala, A. J. Severin, T. A. Clark, M. Boitano and P. J. Plummer

(2014). "A comparative analysis of methylome profiles of Campylobacter jejuni

sheep abortion isolate and gastroenteric strains using PacBio data." Frontiers in

Microbiology 5(DEC): 1-15.

Orla-Jensen, S. (1909). "Die Hauptlinien des natürlichen Bakteriensystems." Zentralbl.

Bakteriol. Parasitenkd. Infektionskr. Hyg. Abt 2(22): 305-346.

Pijper, A., C. G. Crocker and N. Savage (1955). "Sarcinae: motility, kind of flagella, and

specific agglutination." Journal of bacteriology 69(2): 151.

44 Pirone-Davies, C., M. Hoffmann, R. J. Roberts, T. Muruvanda, R. E. Timme, E. Strain,

Y. Luo, J. Payne, K. Luong, Y. Song, Y. C. Tsai, M. Boitano, T. A. Clark, J.

Korlach, P. S. Evans and M. W. Allard (2015). "Genome-wide methylation

patterns in Salmonella enterica subsp. enterica Serovars." PLoS ONE 10(4): 1-13.

Pregerson, B. (1973). "The Distribution and Physiology of Sporosarcina ureae."

California State University Northridge Masters Thesis.

Pruesse, E., J. Peplies and F. O. Glöckner (2012). "SINA: accurate high-throughput

multiple sequence alignment of ribosomal RNA genes." Bioinformatics 28(14):

1823-1829.

Rhoads, A. and K. F. Au (2015). "PacBio Sequencing and Its Applications." Genomics,

Proteomics and Bioinformatics 13(5): 278-289.

Richter, M. and R. Rosselló-Móra (2009). "Shifting the genomic gold standard for the

prokaryotic species definition." Proceedings of the National Academy of Sciences

of the United States of America 106(45): 19126-19131.

Risen, L. P. (1996). Multilocus Genetic Structure in Populations of Sporosarcina ureae

and the Assessment of Hexose Utilization, California State University,

Northridge.

Rothschild, L. (2012). Innovation in Space

Discoveries: Is There Life Out There? Horizon Lectures, Norway, Bergen.

Shivaji, S., T. N. R. Srinivas and G. S. N. Reddy (2014). The Family Planococcaceae.

The Prokaryotes, Springer: 303-351.

45 Siguier, P., J. Pérochon, L. Lestrade, J. Mahillon and M. Chandler (2006). "ISfinder: the

reference centre for bacterial insertion sequences." Nucleic acids research

34(suppl 1): D32-D36.

Spring, S., W. Ludwig, M. C. Marquez, a. Ventosa and K. H. Schleifer (1996).

" gen. nov., with Descriptions of Halobacillus litoralis sp. nov. and

Halobacillus trueperi sp. nov., and Transfer of Sporosarcina halophila to

Halobacillus halophilus comb. nov." International Journal of Systematic

Bacteriology 46(2): 492-496.

Stackebrandt, E. and B. M. Goebel (1994). "Taxonomic Note: A Place for DNA-DNA

Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition

in Bacteriology." International Journal of Systematic Bacteriology 44(4): 846-

849.

Stragier, P. and R. Losick (1996). "Molecular genetics of sporulation in Bacillus subtilis."

Annual review of genetics 30(1): 297-341.

Uchiyama, I. (2008). "Multiple genome alignment for identifying the core structure

among moderately related microbial genomes." BMC genomics 9: 515-515.

Varghese, N. J., S. Mukherjee, N. Ivanova, K. T. Konstantinidis, K. Mavrommatis, N. C.

Kyrpides and A. Pati (2015). "Microbial species delineation using whole genome

sequences." Nucleic acids research 43(14): gkv657--gkv657-.

Vos, P., G. Garrity, D. Jones, N. R. Krieg, W. Ludwig, F. A. Rainey, K.-H. Schleifer and

W. Whitman (2011). Bergey's Manual of Systematic Bacteriology: Volume 3:

The Firmicutes, Springer Science & Business Media.

46 Wayne, L. G., D. J. Brenner, R. R. Colwell, P. A. D. Grimont, O. Kandler, M. I.

Krichevsky, L. H. Moore, W. E. C. Moore, R. Murray and E. Stackebrandt

(1987). "Report of the ad hoc committee on reconciliation of approaches to

bacterial systematics." International Journal of Systematic and Evolutionary

Microbiology 37(4): 463-464.

Zamfirescu, C. and I. Dincer (2009). "Ammonia as a green fuel and hydrogen source for

vehicular applications." Fuel Processing Technology 90(5): 729-737.

Zhang, L., M. L. Higgins and P. J. Piggot (1997). "The division during bacterial

sporulation is symmetrically located in Sporosarcina ureae." Molecular

microbiology 25(6): 1091-1098.

47 Appendix A: Supplemental Tables and Figures

based on two variables: gene percent identity and coverage.andpercentgene variables: identity two basedon ofmodelscoreinspeciesS1:genusare(right).modelsMATLAB andcountthe gene (left) The Figure

48

Table S1: Raw methylation data for the six strains of Sporosarcina ureae that are presented in Figure 11.

49

Table S2: COG groups and gene counts for six strains of Sporosarcina.

50 import os import pandas as pd from collections import defaultdict import matplotlib.pyplot as plt import glob base = os.getcwd() #Create COGs with alligned fasta os.chdir(base+'/raw') files = glob.glob('*.csv') dict_list = [] for file in files: input_file = pd.read_csv(file, names=['concat', 'CDD', 'eval'])

concat_set = list(set(input_file['concat'].tolist()))

cog_tuples = []

for concat in concat_set: temp = input_file[input_file['concat'] == concat] eval = min(temp['eval'].tolist()) cdd_id = temp[temp['eval'] == eval]['CDD'].tolist()[0].split('|')[2] with open('cddid.tbl') as cdd_file, open('whog.txt') as whog, open('fun.txt') as fun: for line in cdd_file: if cdd_id in line: cog = line.split(' ')[1] for line in whog: if cog in line: letter = line.split(' ')[0] for line in fun: if letter in line: function = line.split('] ')[1] function = function.split('\n')[0] tup = (concat,cog, letter, function) cog_tuples.append(tup) break break break

cog_dict = defaultdict(int)

for tup in cog_tuples: cog_dict[tup[3]] += 1

dict_list.append(cog_dict) perc = [] for key in dict_list[0]: perc.append((1-abs(dict_list[0][key]-dict_list[1][key])/dict_list[1][key], key)) print(key, dict_list[1][key], dict_list[0][key], abs(dict_list[0][key]-dict_list[1][key]), 1-abs(dict_list[0][key]- dict_list[1][key])/dict_list[1][key]) perc = sorted(perc, key=lambda tup: tup[0]) x = [] y = [] for n in range(len(perc)): y.append(perc[n][0]) x.append(n) plt.plot(x,y) plt.show()

Figure S2: Short python script showing how the COG information was generated.

51