<<

COMPLETE GENOME SEQUENCE OF THE HYPERTHERMOPHILIC - SP. STRAIN RQ7

Rutika Puranik

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

May 2015

Committee:

Zhaohui Xu, Advisor

Scott Rogers

George Bullerjahn © 2015

Rutika Puranik

All Rights Reserved iii ABSTRACT

Zhaohui Xu, Advisor

The genus Thermotoga is one of the deep-rooted genus in the phylogenetic tree of life and has been studied for its thermostable enzymes and the property of production at higher temperatures. The current study focuses on the complete genome sequencing of T. sp. strain RQ7 to understand and identify the conserved as well as variable properties between the strains and its genus with the approach of comparative genomics. A pipeline was developed to assemble the complete genome based on the next generation sequencing (NGS) data. The pipeline successfully combined computational approaches with wet lab experiments to deliver a completed genome of

T. sp. strain RQ7 that has the genome size of 1,851,618 bp with a GC content of 47.1%. The genome is submitted to Genbank with accession CP07633. Comparative genomic analysis of this genome with three other strains of Thermotoga, helped identifying putative natural transformation and competence coding in addition to the absence of TneDI restriction- modification system in T. sp. strain RQ7. Genome analysis also assisted in recognizing the unique genes in T. sp. strain RQ7 and CRISPR/Cas system. This strain has 8

CRISPR loci and an array of Cas coding genes in the entire genome. The genome sequencing of this strain has provided a platform for studying the development of genetic tools, which would make these strains industrially applicable for biofuel generation. iv

I dedicate this work to my parents Mr. Rajiv Puranik and Mrs. Radhika Puranik. v ACKNOWLEDGMENTS

I sincerely thank my advisor Dr. Zhaohui Xu for constantly guiding me in my entire course of Master’s program. I am grateful for her support and patience. I wish to thank my committee members Dr. Scott Rogers and Dr. George Bullerjahn for accepting to be on my thesis committee and their valuable inputs. I appreciate the enormous support and help of my lab members Dr. Dongmei Han, Hui Xu and Dr. Uksha Saini and all my friends in the department. I am especially thankful to my husband Akshay Joshi for motivating me and believing in me.

vi

TABLE OF CONTENTS

Page

I. INTRODUCTION ...... 1

Genus Thermotoga ...... 1

Genome Sequencing ...... 3

Genome Annotation ...... 6

II. MATERIALS AND METHODS ...... 9

Growth conditions and DNA isolation ...... 9

Genome sequencing by BGI America’s ...... 10

Genome assembly and analysis ...... 11

Overview of completing the genome ...... 11

Primer Walking and PCR...... 13

Genome annotation ...... 16

III. RESULTS ...... 19

Completion of Thermotoga sp. strain RQ7 genome ...... 19

Genome details and features ...... 30

Features analyzed in the genome ...... 31

Unique genes in T. sp. strain RQ7 ...... 31

Natural Transformation ...... 32

The Type II Restriction-Modification system TneDI ...... 42

CRISPRs ...... 44

IV. DISCUSSION ...... 47

vii

V. CONCLUSIONS ...... 49

VI. REFERENCES ...... 50 viii

LIST OF FIGURES

Figure Page

1 Phylogenetic tree of life based on the small subunit rRNA sequences ...... 2

2 Small subunit rRNA phylogeny of different Thermotoga strains ...... 3

3 The principle of Sanger sequencing ...... 4

4 Schematic representation of steps involved in Illumina sequencing technology ...... 5

5 The approach of Paired-End reads ...... 6

6 Multistep annotation process ...... 7

7 Dataflow schematic for genome annotation ...... 8

8 Method used by BGI America’s for data production and quality control ...... 10

9 The pipeline of genome assembling and gap closure ...... 12

10 Schematic overview of GapFish algorithm...... 14

11 Schematic representation of the steps performed during primer walking ...... 16

12 Diagram summarizing the overall outline of prokaryotic genome annotation

Pipeline ...... 17

13 Multiple genome alignment of four different Thermotoga isolates using Mauve

alignment tool ...... 22

14 Amplification of randomly selected genes in the big gap of T. sp. strain RQ7 ...... 23

15 Wet lab approach to confirm the existence of the minigaps ...... 26

16 Differences in obtaining PCR products for easy genes and difficult genes found in

the big gap of T. sp. strain RQ7 ...... 27

17 The approach of Nested PCR for amplifying the difficult genes in the big gap of

T. sp. strain RQ7 ...... 28 ix

18 Nested PCR with internal primers for obtaining the final PCR product of difficult

...... 29

19 Metabolic reaction of sulfate reduction pathway ...... 31

20 CDD (Conserved Domain Database) analysis using amino acid sequences of the

PilZ protein in V. cholerae, P. aeruginosa and T. sp. strain RQ7 ...... 34

21 CDD analysis using amino acid sequences of PilB protein in V. cholerae,

P. aeruginosa and T. sp. strain RQ7 ...... 35

22 CDD analysis of PilQ amino acid sequence in N. gonorrhoeae showing the

presence of important conserved secretin domain ...... 36

23 Conserved domains of PilC protein in P. stutzeri and putative PilC in

T. sp. strain RQ7 analyzed by CDD ...... 36

24 CDD analysis using amino acid sequences outlines the conserved domains of

PilD in N. gonorrhoeae, V. vulnificus and putative PilD in T. sp. strain RQ7 ...... 37

25 Outline of the conserved domains of PilT protein in N. gonorrhoea and putative

PilT in T. sp. strain RQ7 ...... 38

26 CDD analysis using amino acid sequences of PilE in P. aeruginosa and

N. gonorrhoeae and putative PilE in T. sp. strain RQ7 ...... 38

27 CDD analysis using amino acid sequence of ComM in H. influenzae and

putative ComM in T. sp. strain RQ7 ...... 39

28 CDD analysis of putative ComE in T. sp. strain RQ7 and ComE in the reference

organism B. subtilis ...... 40

29 CDD analysis using amino acid sequence of ComEC protein in B. subtilis and

putative ComEC in T. sp. strain RQ7 ...... 40 x

30 CDD analysis using amino acid sequence of putative ComFC in T. sp.strain RQ7

and ComFC in B. subtilis ...... 41

31 CDD analysis using amino acid sequences of DprA protein in H. influenzae and

putative DprA protein in T. sp. strain RQ7 ...... 41

32 Deletion of TneDI R-M system in T. sp. strain RQ7 ...... 43

33 Diagramatic representation of CRISPR/Cas system in T. sp. strain RQ7 ...... 44 xi

LIST OF TABLES

Table Page

1 Comparison of different Next generation sequencing platforms ...... 4

2 Illumina libraries for the genome of T. sp. strain RQ7 ...... 10

3 List of all the primers used to sequence the big gap of ~36 kb in the genome of

T. sp. strain RQ7 ...... 15

4 Comparisons of assemblies using different methods ...... 19

5 Comparison of the big gap region among different Thermotoga genomes ...... 20

6 Statistics of assembling process ...... 24

7 Sanger sequences of T. sp. strain RQ7 available online from other studies ...... 25

8 ORFs differentially annotated in the complete genome ...... 30

9 List of unique genes in this genome ...... 31

10 Putative competence genes in the four Thermotoga strains...... 33

11 Total number of CRISPR loci and spacers among the four Thermotoga strains

after comparative genomics ...... 45

12 Overview of each CRISPR locus in T. sp. strain RQ7 together with its position on

the genome, number of total spacers, sequences of direct repeats and homology of

the repeats to other Thermotogales ...... 46

1

I. INTRODUCTION

Genus Thermotoga consists of hyperthermophilic, anaerobic, Gram-negative, rod-shaped bacteria that show the presence of a unique outer membrane known as “toga”[1]. The optimum growth temperature of the thermophilic Thermotoga is about 80oC[2]. The organisms of this genus are well known for their fermentative hydrogen production[3]. Thermotoga sp. strain RQ7 was isolated from geothermal heated seafloor, Ribeira Quente, the Azores and has the optimum temperature between 76-82oC [1]. Till date (03/14/2015) 12 complete genomes of this genus are available. T. sp. strain RQ7 harbors a 846 bp plasmid pRQ7 which replicates by rolling-circle mechanism[4-6]. Small subunit rRNA gene phylogeny done earlier for all Thermotoga strains clustered T. sp. strain RQ7 with T. neapolitana[7]. The complete genome of T. sp. RQ7 was sequenced in the current study, which on analysis supported the fact that this strain is most related to T. neapolitana.

1.1 Genus Thermotoga

Thermotoga have been known to inhabit many different environmental niches like geothermally heated seafloors [1] , solfataric springs (T. thermarum, T. neapolitana) [8], oil reservoirs (T. petrophila) [9], which are shared with many archaea. These bacteria are well known for their ability to utilize a variety of sugars, produce hydrogen and to act as sources of thermostable enzymes [10]. Due to the reported hydrogen yield approaching the Thauer limit (4 mol H2/mol ) for anaerobic fermentation this genus has been the topic of interest for biohydrogen production [3]. The first organism of this genus T. maritima MSB8 was isolated in

1986 from the geothermally heated sea floors of Italy, with the growth temperature range between 55oC - 90 oC[1]. This genus is represented by anaerobic, rod shaped, non- forming 2 and motile organisms with a monotrichous flagella [1, 11]. Small subunit rRNA analysis has placed this genus at the root of the phylogenetic tree of life (Figure 1) [11, 12] .

Figure 1: Phylogenetic tree of life based on the small subunit rRNA sequences [12]

Understanding the metabolic and evolutionary significance of this genus began with the completion of T. maritima MSB8 genome sequence in 1999 [11]. This project revealed the existence of lateral gene transfer events and presence of archaeal genes in Thermotoga. Most of the laterally transferred archaeal genes in Thermotoga belong to the family of ABC transporters

[13]. Many groups have studied the phylogenetic position of this genus as they share 7.7-11% of genes with archaea and 42.3- 48.2% of genes with firmicutes [3]. Mongodin et al. conducted comparative genomic hybridization study of nine sequenced Thermotoga strains to T. maritima in order to understand the extent of genomic diversity among the organisms of this genus. The study also conducted phylogenetic analysis of the Thermotoga strains based on small subunit rRNA (Figure 2). Phylogenetic position and gene transfer events of these hyperthermophilic organisms have been pursued as a topic for studying evolution and adaptation at extreme temperatures of this genus. 3

Figure 2: Small subunit rRNA phylogeny of different Thermotoga strains. Relatedness was studied with the data from comparative hyrbridization study (CGH) using T. maritima as reference genome. *, strains compared in CGH study [7]

1.2 Genome Sequencing

Sanger sequencing using chain – terminating inhibitors is one of the oldest method to determine the DNA fragments (Figure 3). Sanger in 1977 sequenced the genome of bacteriophage ɸX174 using the chemistry of 2’, 3’- dideoxy arabinonucleoside analogues[14].

Craig Venter’s project of complete genome sequencing of Haemophilus influenzae using new computational approaches marked the beginning of a new era in the field of biology as this was the first bacterial genome to be completed.

4

Figure 3: The principle of Sanger sequencing. Single stranded DNA is used as the template with four fluorescently labeled dNTPs (http://www.jenabioscience.com).

The rapid development in genome sequencing technology has led to quick generation of numerous sequences, thus helping the researchers to better understand the variety of life forms from bacteria to . Bacterial genomes are smaller in size as compared to eukaryotes, which makes it relatively easy to sequence and handle the data generated.

Table 1: Comparison of different Next generation sequencing platforms [15]

(CCD - charged couple device; dNTP – deoxynucleoside triphosphate; PCR – polymerase chain reaction; SNP – single nucleotide polymorphism)

Next generation sequencing (NGS) has revolutionalized the field of genomics. The platforms performing massive parallel sequencing create millions of fragments of DNA from a small amount of DNA sample [16]. Automation allows the projects to get completed in a matter of hours or days with a higher throughput and cheaper cost. There are different platforms that use the NGS technology for producing large amount of genomic data (Table 1). High- throughput 5

Illumina sequencing is one of the commonly used platforms for complete genome sequencing projects.The overview of the protocol is diagramatically represented in Figure 4.

Figure 4: Schematic representation of steps involved in Illumina sequencing technology [15]

Sequencing process begins with fragmentation of genomic DNA (gDNA) into small fragments to form libraries, followed by parallel sequencing to produce millions of reads. These reads are then assembled into a scaffold using either a reference genome or in the absence of reference genome (de novo sequencing) (http://www.illumina.com/technology/next-generation- sequencing.html). De novo sequencing data has large numbers of gaps in the final genome as a result of it’s short read lengths. Thus, many NGS platforms provide paired-end sequencing where the fragments of gDNA are read from both ends thus covering the region more number of times (Figure 5). 6

Figure 5: The approach of Paired-End reads. Paired-End reads enable to sequence the DNA fragment from both ends making it easier to align to the reference sequence specially in the difficult regions of DNA where repeats are present (http://www.illumina.com/technology/next-generation-sequencing.html).

As shown in Figure 4, the adapter ligated DNA library is used to anchor the flow cell lanes. Bridge amplication using unlabelled nucleotides creates millions of copies of the library which are then used as templates for the next step of sequencing with synthesis using fluorescently labelled dNTPs. The chemistry behind the fluorescent nucleotides is known as reversible terminators. Thus, use of NGS platforms has made it convenient to obtain reliable and accurate data within short period of time assisting researchers to study a wide array of new genomes.

1.3 Genome Annotation

Genome sequencing generates millions of base pairs of data which needs to be interpreted in a way so as to describe the physiological, ecological and functional properties of the organism under consideration. This job of understanding the large raw data is primarily performed by genome annotation. It is a multistep process as shown in Figure 6 that mainly includes nucleotide level annotation, protein level annotation and process level annotation. Gene mapping, gene finding are a part of nucleotide level annotation, where ORFs (Open Reading

Frame) are identified and annotated [17]. Computational approaches are used for all the levels of 7 annotation. Large number of softwares and tools are available currently to complete these different levels of annotation.

Figure 6: Multistep annotation process. The process begins with nucleotide- level annotation followed by protein- level annotation and ends with process-level annotation [17]

Automated genome annotation utilized a wide number of methods to predict the final number of total genes but frequent manual curation is also required. Genome scale comparisons and experimental evidence are used as the basis of manual intervention. The process of annotation has a data flow scheme that is applied after genome sequencing (Figure 7). Statistical gene prediction, general database search form the integral part of predicting gene functions and form the main backbone of the flow after complete genome sequencing. Specialized database searches may include domain predicting databases like CDD (Conserved Domain Database),

Pfam (Protein families database), SMART (Simple Modular Architecture Research Tool) or 8 genome – based databases like KEGG (Kyoto Encyclopedia of Genes and Genomes), COG

(Clusters of Orthologous Groups of ) [18].

Figure 7: Dataflow schematic for genome annotation [18]. FB:feedback. With the availability of these computational platforms and automated NGS technologies large number of bacterial genomes are being made available

9

II. MATERIALS AND METHODS

2.1 Growth conditions and DNA isolation

T. sp. strain RQ7 was cultivated in SVO medium developed by van Ooteghem et al.[19]

The genomic DNA of T. sp. strain RQ7 was prepared by phenol extraction. The protocol of handling of Thermotoga cultures and large scale DNA extraction performed by Dr. Dongmei

Han was detailed previously [20]. Genomic DNA extracted at large scale, referred to as old DNA, was stored at -200C for up to a year and was used to sequence the genome of T. sp. strain RQ7 by

Illumina sequencing as well as for amplifying some of the missing gaps during the later part of this study. According to the small-scale protocol, cells were collected from 10 ml of overnight

Thermotoga culture by centrifuging at 12000 x g for 1 min and washed with 800 µl of STE buffer. After centrifuging the pellet at 12000 x g for 1 min, 180 µl of Solution I (50 mM Tris pH

8.0 with HCl, 10 mM EDTA) was used to resuspend the cell pellet. 120 µl of 10% SDS was added and mixed well followed by incubation of the tube at 600C for 30 min. 25 µl of 5 M NaCl was then added, mixed well and then the tube was centrifuged at 15000 x g for 6-8 min. Later, the supernatant was transferred into a new tube followed by addition of equal volume of phenol/chloroform/isoamyl alcohol (25:24:1 V/V) to precipitate protein. The above step was repeated for 2-3 times until no protein layer was observed. Once a protein free solution was obtained; two-times volume of was added in order to precipitate the total DNA. This tube was incubated at -20oC for 15 min and later centrifuged for 8 -10 min to collect total DNA. Total

DNA was washed with 70% Ethanol, dried, dissolved in ddH2O followed by addition of RNase to the final concentration of 10 µg/ml. DNA was extracted after every 3-5 days (referred to as fresh DNA) to sequence the difficult regions of the genome as they required freshly prepared

DNA. 10

2.2 Genome sequencing by BGI America’s

The genome was sequenced by BGI America’s (Cambridge, MA) using high- throughput

Illumina sequencing technology to conduct pair – end sequencing for our DNA sample under a service agreement.

Table 2: Illumina libraries for the genome of T. sp. strain RQ7

Three different sized libraries were generated for T. sp. strain RQ7 – 500 bp, 2000 bp and

5000 bp respectively (Table 2). Paired – end sequencing produced reads of 90 bp with the 500 bp library while 49 bp of reads with 2000 bp and 5000 bp libraries respectively, thus generating 400

Mb of clean data. The method used to generate clean data from the sequenced raw data included several steps of cleaning contamination from adapters, removal of certain proportion of N’s and low quality reads (Figure 8).

Figure 8: Method used by BGI America’s for data production and quality control. This process ensures generation of clean data from raw data by setting cut-offs

11

2.3 Genome assembly and analysis

SOAPdenovo was used for initial assembly of the large amount of data generated [21, 22].

This was followed by initial gap filling and single base correction, which was performed by

SOAPaligner (http://soap.genomics.org.cn/soapaligner.html). The data obtained was in the form of 56 scaffolds, out of which only one was a first draft of the genome. Initial gene prediction was performed using Glimmer 3.0.

2.4 Overview of completing the genome

A pipeline designed for scaffold assembling and gap closure is described in detail in our recent publication [23]. The genome was completed by combining seven different steps that are outlined in Figure 9.

Step 1. De novo assembling using the SOAP package as described above

Step 2. Comparative genomics and wet lab validations

This step not only helped identifying the closest relative of T. sp. strain RQ7 but also validate the initial assembly of the genome. In wet lab validations, PCR (Polymerase Chain

Reaction) and Sanger sequencing were the primary techniques used.

Step 3. Integration of the assembly based on a reference genome

Using the commercially available package of CLC Genomics Workbench, an independent assembly effort was undertaken based on the reference genome identified in Step 2.

This assembly combined with the data from Step 1 produced a hybrid assembly, which was further revised in the next following steps. 12

Figure 9: The pipeline of genome assembling and gap closure. Clean data were subject to de novo assembling, initial gap filling, and single base corrections with SOAPdenovo and SOAPaligner; the resulting assembly was used for comparative genomics studies and also provided guidance for wet lab validations. Meanwhile, the clean data were assembled separately based on a reference genome using CLC Genomics Workbench. This second assembly was integrated into the first one to yield a hybrid assembly, which was then updated with public data, GapFish results, and Sanger sequencing data until the genome sequence was complete [23]

Step 4. Integration of public data

Sequences obtained for T. sp. strain RQ7 by searching public databases were useful not only in confirming the hybrid assembly but also to fill some small gaps.

Step 5. “Bait-and-Fish”

This step was implemented using the in house developed program Gapfish (Figure 10) written in Python. It assisted in closing remaining gaps present in the assembly. This program works on the concept of “Bait-and-Fish” (or seed and extend). The input sequence is the one 13 located immediately upstream of the gap and acts as a “bait” while the output is the sorted list of the sequences “fish” that are located in the same reads as “bait” but immediately downstream of them [23]. Consensus sequence is identified from the output list and new “bait” is identified, which is the second or third longest “fish”. This process is repeated until the gap is completely filled (Figure 10). When the new sequence matches the one on the other side of the gap and the size of the new sequences matches roughly with the original size of the gap, it is completely closed.

Step 6. Closing the remaining gaps with primer walking and Sanger sequencing

This step involved primer designing, optimization of PCR conditions and DNA preparation. Even though this was a labor intensive and costly Step, it assisted in filling the remaining gaps and validating the earlier gaps.

Step 7. Final review by GapFish

Other than its use in filling the gaps, GapFish proved to be extremely useful in crosschecking and correcting the final completed assembly especially the regions where sequences replaced the N’s. It also helped in circularizing the genome by identifying and removing the overlapping sequences at the ends.

2.5 Primer Walking and PCR

Primers were designed based on the genome of T. maritima and T. neapolitana and T. sp. strain RQ7 wherever applicable. Table 3 lists all the primers used to amplify the big gap of ~36 kb in T. sp. strain RQ7. Primers used in all PCR reactions were designed with the assistance of

Primer3 [24, 25]and Clone Manager (http://www.scied.com/pr_cmbas.htm).

14

Figure 10: Schematic overview of GapFish algorithm. (a) A segment upstream of a gap will be used as the “bait” to search against all Illumina reads that are 90 nt long. If the “bait” is found in a read, GapFish will excise the sequence fragment adjacent to the “bait” at the 3’ direction and return the result to the console. At the end of each search, all identified fragments will be sorted and save into a text file. (b) An example output of GapFish when searching with “bait” = 'GAGGCTCCTCAGGCGGTTGTGGAGGGCAATCCCAGAAACTCCG' (total 43 nt). Sequencing errors are apparent in the results, such as the 3rd position (G -> C) in the second line and the 8th position (T -> G) in the fifth line (underlined). This type of errors could have led to the collapse of the assembling effort of SOAPdenovo, leaving a gap behind. For solving this type of complications, GapFish-assisted human interventions have proven to be necessary. The sequence second to the last one (underlined) will be used as the “bait” for the next round of search [23]

15

Table 3: List of all the primers used to sequence the big gap of ~36 kb in the genome of T. sp. strain RQ7

Region amplified Primer Primer sequence Patching TRQ7_09460 (int) with TRQ7_09455 RP0968-INT-F CGTGGAGCTTGATCTTGT RP0968-INT-R CAACAGCGAAGGAATGTC TRQ7_09455 to TRQ7_09440 Tm0968-0972 F TAGGGGGATGCAGCATTAAC Tm0968-0972 R ATCCACTCCCAGTTCATTCG TRQ7_09435 to TRQ7_09425 Tm 0973-0975 F GCTCTTCGGATTTTGCAGTG Tm 0973-0975 R GGTCTCAGTGGGTTTATTCG TRQ7_09425 to TRQ7_09400 R7579F CTTCTCTGGCTCCTTTGACGAACTGAGTGT R7881R GCTATAAAGATCGCTGAAGCCGCAATGG TRQ7_09395 to TRQ7_09390 TM0982-0983 F GAATCCGCGAAGAGAAAGAG TM0982-0983 R CAGAACATGAAGCCAGGAGA Part of TRQ7_09390 and TRQ7_09385 RP8283F TGACAGTTTCCGGGATTC RP0984R1 CCTTGAAACGTAGAGCCTCCAGTAATGC TRQ7_09385 RP0984F2 ACGTTCTGCTCGACACGAATTTCACCAC RP0984R2 CTTCCAGTACTTGTACGGCAGGATGTAGAG RQ7_m984F GAAAGAGGCTGTGAGCGAAGGTTTGGAGAT RQ7_m984R CCAGCTCAAGGTTGATCATCTCTTCCATCG TRQ7_09380 to TRQ7_09375 Tm 0985-0986 F AGCTTTGACTACGTCTGGAG Tm0985-0986 R TCGGAGTATCTGACGAACCT TRQ7_09375 and TRQ7_09370 connection Out_m986F GAGTTCGTCACAAAGAGTGTCTCGGAAG Out_m987R GGTAGAAGTTCAACAGCCGTGACATCAAGG TRQ7_09370 Tn_m987F ACACCCCACAGAGAGAACCCTCAAATTC Tn_m987R GTCCTTTTCCAGCTCCCTGTCTATCTCTGA TRQ7_09370 and part of TRQ7_09365 Out_m987F CAGAAAAATGTGGGCTGGTGAGGTAGAAGAGG Tm0988R CATCTGTGGTTCCGACAAAC TRQ7_09365 Tn_m88 F GGGAGGACGTTCTGGATGAGAGTTTTGA Tn_m88-R CCACTCCTTCGAAGTCCTCTGGTAGAGATT TRQ7_09360 and TRQ7_09355 Tm0989-0990 F GAATACATTGCCGGTGAG Tm0989-0990 R TTCTGTCCGAGCGAAATG Tn_m90F CCGCTCTCTTTTATTCCTCGCTGACTTCTC Tn_m90R CCTCCTGCGGTATATTTTCTGTCCGAGT TRQ7_09350 Tm0991F AGATCTTCAGCCCGTTCGAT Tm0991R GCGGTGAAATAGGTCTTCCT TRQ7_09350 and TRQ7_09345 Out_m991R GATAGTGCGGTGTTCACCAGGGAAACCATT OUT92-R CAAGGTGGTGGGATTTGAAACGCAGGATAG TRQ7_09345 Out92F GGTTAGACGCTCGTCCTCTTGACAAATC NewRQ7_92R TAGCGACGCTTTGCGGTATCGCTTAAACAC Tn_m992-F1 CTTTCTCAAGGTGAGCGCGGTACAGAACATAG Tn_m992-R1 CAGGTGAAAGGCTGAATCTATGGGTGGGATAC TRQ7_09340 to TRQ7_09325 Tm0993-0997 F GCGGGTCGTTTGAATAAAGC Tm0993-0997 R TACCTGCTTCACGAATACGC TRQ7_09320 to TRQ7_09305 Tm0998-1002 F TTTGAGCATTACTGATATCGTTCC Tm0998-1002 R TTCAGGGAGAGGGTAGCTGA TRQ7_09305 patching to TRQ7_09285 (AraC) RP1002-AraC-F GTCCAGATGGCAATGTCT RP1002-AraC-R GTCTCGAACCAGCGAATA 16

Primers were ordered from Integrated DNA technologies (IDT). PCR was performed using Bio-Rad C1000 touch Thermal cycler.Desired PCR products were gel purified using

Geneclean Turbo kit from MP Biochemical (Santa Ana, CA).

The unknown sequence of ~36 kb big gap was amplified by combining the approaches of primer walking and nested PCR with the use of above listed primers. PCR programs were optimized for each primer pair. Figure 11 is the schematic representation of this strategy.

Figure 11: Schematic representation of the steps performed during primer walking. F1/ R1 and F2/R2 are designed based on the conserved sequence in reference genome ( ) followed by a round of PCR cycle that generates new sequence.The third pair F3/R3 is designed based on the Sanger sequence data and closes the gap completely. F, forward primer; R, reverse primer

2.6 Genome annotation

Three different pipelines annotated the genome - BGI America’s, RAST (Rapid

Annotation using Subsystem Technology) and NCBI’s prokaryotic genome annotation pipeline

(PGAP). The final genome annotation is a result of comparison of these three pipelines followed 17 by manual curation. The accepted annotation for T. sp. strain RQ7 is the one submitted to

GenBank and is publically available on NCBI’s website with accession CP007633. Figure 12 outlines the general protocol for annotating a prokaryotic genome by NCBI’s PGAP.

The current version of NCBI’s PGAP (http://www.ncbi.nlm.nih.gov/books/NBK174280/)

(released in 2013) has incorporated many changes as compared to its previous version. In the current version for annotating structural (5s, 16s, 23s), pipeline uses nucleotide BLASTn search with the reference set available for these structural RNAs in the NCBI RefSeq collection.

Figure 12: Diagram summarizing the overall outline of prokaryotic genome annotation pipeline.* , target set comprises of – specific core proteins, universally conserved ribosomal proteins and curated bacteriophage – protein clusters while search set consists of all automatic clusters, including curated and non-curated protein clusters, curated bacteriophage protein clusters and all bacterial UniProtKB/ Swiss-Prot proteins (http://www.ncbi.nlm.nih.gov/books/NBK174280/)

18

For annotation of tRNA’s, TRNAscan-SE is employed. ProSplign is used for protein coding genes annotation, it consists of target set, and search set of proteins as described in Figure

12. ProSplign is an application developed by NCBI for working with partial – frame and spliced protein alignments. This pipeline is also efficient in predicting the CRISPRs (Clustered regularly interspaced short palindromic repeats) that are commonly found in prokaryotic genomes. PGAP employs two different databases for CRISPR prediction – CRISPR recognition tool (CRT) and

PILER-CR. Overall, PGAP combines the similarity based gene detection approach with computational gene prediction algorithm to annotate the prokaryotic genome.

19

III. RESULTS

3.1 Completion of Thermotoga sp. strain RQ7 genome

The genome was completed by combining the computational approaches with the wet lab experiments. The data generated by Illumina sequencing was 400 Mb that was covered ~200 times and was assembled and aligned by SOAPdenovo and SOAPaligner (Step 1). This genome still had strings of “N’s” in the region where a gap was present and was in the form of scaffold.

This SOAP package was still efficient as it covered ~ 98% of the genome and provided scaffold of 1,822,593 bp including 14,240 N’s (Table 4).

Table 4: Comparisons of assemblies using different methods [23]

Scaffold size Total nt Methods # of ‘N’s Coverage* # of gaps Max gap (including ‘N’s) assembled

SOAP package 1,822,593 14,240 1,808,353 97.7% 28 ~ 36 kb

CLC package 1,884,513 201,850 1,682,663 90.9% 380 ~21 kb

This pipeline 1,851,618 0 1,851,618 100% 0 0

*Coverage was calculated by comparing the total number of nucleotides (nt) assembled (without ‘N’s) to the size of the complete genome of T. sp. strain RQ7

This scaffold had 27 gaps (ranging from 1 bp to 3.2 kb) referred to as minigaps as they could be filled by PCR. The next step was the comparing this assembly of T. sp. strain RQ7 with all the publically available Thermotoga genomes. As mentioned earlier (Figure 2), small subunit rRNA analysis has placed T. sp. strain RQ7 near T. neapolitana [7], which was also observed by our effort of genome comparison (Step 2). As these two strains are closely related, we used T. neapolitana genome as the reference genome for the independent assembly using CLC

Genomics Workbench and also during the later part of the study (Step3).

20

Table 5: Comparison of the big gap region among different Thermotoga genomes

T. sp. strain T. sp. strain T. maritima T. neapolitana Annotation RQ2 RQ7

TM0968 TRQ2_1822 CTN_1608 TRQ7_09455 hypothetical protein

TM0969 TRQ2_1821 CTN_1607 TRQ7_09450 hypothetical protein

TM0970 Absent CTN_1606 Disrupted hypothetical protein

TM0971 TRQ2_1821 Present* TRQ7_09441 hypothetical protein

TRQ7_09440 conserved hypothetical TM0972 TRQ2_1820 CTN_1605 (Disrupted) protein, GGDEF domain TM0973 TRQ2_1819 CTN_1604 TRQ7_09435 hypothetical protein

TM0974 TRQ2_1818 CTN_1603 TRQ7_09430 hypothetical protein TRQ7_09425 TM0975 Absent CTN_1602 hypothetical protein (Disrupted) TM0976 Absent Present* TRQ7_09421 hypothetical protein

TM0977 Absent CTN_1601 TRQ7_09420 hypothetical protein

TM0978 TRQ2_1817 CTN_1600 TRQ7_09415 hypothetical protein

TM0979 TRQ2_1816 CTN_1599 TRQ7_09410 hypothetical protein

TM0980 TRQ2_1815 CTN_1598 TRQ7_09405 hypothetical protein

TRQ7_09400 TM0981 TRQ2_1814 CTN_1597 hypothetical protein (Disrupted) TM0982 TRQ2_1813 CTN_1596 TRQ7_09395 hypothetical protein

TRQ7_09390 TM0983 TRQ2_1812 CTN_1595 hypothetical protein (Disrupted) TRQ7_09385 TM0984 TRQ2_1811 CTN_1594 hypothetical protein (Disrupted) TM0985 TRQ2_1810 CTN_1593 TRQ7_09380 hypothetical protein CTN_1592 & TM0986 TRQ2_1809 TRQ7_09375 hypothetical protein CTN_1591 TRQ7_09370 TM0987 TRQ2_1808 CTN_1590 hypothetical protein (Disrupted) TRQ7_09365 TM0988 TRQ2_1807 CTN_1589 hypothetical protein (Disrupted) TM0989 TRQ2_1806 CTN_1588 TRQ7_09360 hypothetical protein 21

TRQ7_09355 TM0990 TRQ2_1805 CTN_1587 hypothetical protein (Disrupted) TRQ7_09350 TM0991 TRQ2_1804 CTN_1586 hypothetical protein (Disrupted) TRQ7_09345 TM0992 Absent CTN_1585 hypothetical protein (Disrupted) TM0993 Absent CTN_1584 TRQ7_09340 hypothetical protein

TM0994 Absent CTN_1583 TRQ7_09332 hypothetical protein

TM0995 Absent CTN_1582 TRQ7_09331 hypothetical protein

TM0996 TRQ2_1803 CTN_1581 TRQ7_09330 hypothetical protein

TRQ7_09325 TM0997 TRQ2_1802 CTN_1580 hypothetical protein (Disrupted) transcriptional regulator, TRQ2_1801 CTN_1579 TRQ7_09320 TM0998 ArsR family

TM0999 Disrupted Present* TRQ7_09316 hypothetical protein

TM1000 Absent CTN_1578 TRQ7_09315 hypothetical protein

TM1001 Absent CTN_1577 TRQ7_09310 hypothetical protein

TRQ7_09305 TM1002 TRQ2_1800 CTN_1576 hypothetical protein (Disrupted) TM1003 Absent CTN_1575 Absent hypothetical protein

TM1004 TRQ2_1800 CTN_1573 Absent hypothetical protein

Comparative analysis of the ~ 36 kb big gap region among four Thermotoga genomes, showing the synteny and conservation. This region was missing from the initial assembly of the T. sp. strain RQ7 genome but was included after combining the data from the CLC assembly and the primer walking effort.The asterisks (*) indicate the presence of a homolog that is not annotated as a gene in a particular genome. Absent indicates the complete deletion of the ORF sequence in a particular genome.

22

Complete genome comparisons highlighted not only the great synteny these two genomes shared but also the small level of insertions, deletions and rearrangements. Mauve, a multiple alignment tool [26] was used to identify the genome scale differences and similarities among the four Thermotoga isolates (Figure 13).

Figure 13: Multiple genome alignment of four different Thermotoga isolates using Mauve alignment tool. The genomes of T. neapolitana and T. maritima are linearized at DnaA (chromosomal replication initiator protein) for proper alignment purposes.

In addition, the most striking difference made during this step was the observation of T. sp. strain RQ7 assembly missing a region of ~36 kb. Investigating this region in further details made it clear that it is a highly conserved region among Thermotoga genomes (Table 5). As the random deletion of such a large conserved region was not convincing, we conducted wet lab 23 experiments to validate this finding in T. sp. strain RQ7 genome. We selected random genes in that region and PCR amplified them using the genomic DNA of T. maritima and T. sp. strain

RQ7 (Figure 14).

Figure 14: Amplification of randomly selected genes in the big gap of T. sp. strain RQ7. PCR using the total DNA from two different strains as templates. The primers Tm0998-1002 F/R were used for obtaining products of orthologous genes of TM0998, TM0999, TM1000, TM1001 and TM1002. Five µl of PCR product was loaded on the gel. Marker used is Hind-III digested lambda DNA ladder. Product size is 2.1 kb. T. maritima (Tm) and T. sp. strain RQ7 (T. RQ7). Primers were designed based on the sequence of these ORFs in T. maritima.

We successfully got the PCR products in T. sp. strain RQ7 using the primers designed based on conserved regions of other Thermotoga genomes. Sanger sequencing the PCR product from T. sp. strain RQ7 confirmed the existence of those genes. This finding suggested that the entire region of ~36 kb was not absent from the genome but was rather not sequenced in the initial assembly. Thus, in addition to the 27 minigaps, there was one large gap of ~36 kb in the assembly of the genome.

24

Table 6: Statistics of assembling process [23]

Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 # Minigaps 27 27 15 13 1 0 0 Assembled nt 0 0 12,511 12,511 12,511 35,746 35,746 in the big gap Total nt 1,808,353 1,808,353 1,828,147 1,828,363 1,832,588 1,851,716 1,851,618 assembled* (*: “N”s are not counted. Assemblies in Steps 1-6 have overlapping end sequences (terminal redundancy).As a result, the assembly in Step 6 appeared to be slightly bigger than the final assembly)

In the quest of reducing the gap number, commercially available CLC Genomics

Workbench was employed for reassembling the genome with the use of original Illumina reads

(Step 3). As T. sp. strain RQ7 is most closely related to T. neapolitana, it was used as the reference genome. As shown in Table 4, CLC Genomics Workbench produced a scaffold of

1,884,513 bp including 201,850 N’s with 380 gaps. The use of this workbench helped in completely filling 12 out of 27 minigaps that were present in the initial SOAP assembly. Thus, a hybrid assembly of 1,823,180 bp (including 7544 “N’s”) was generated after introducing the sequences of 12 gaps. Furthermore, 7 fragments of 12,511 bp in total were found to patch in the big gap (Table 6) bringing the total assembled nucleotides to 1,828,147 bp.

The fourth step of searching publically available sequences for this strain, gave us 13 submissions of Sanger sequences submitted to GenBank by other researchers (Table 7). Out of these, 2 helped us patch the minigaps while 11 matched our assembled regions. Before moving to the wet lab approach in house program GapFish was used to close the minigaps (Step 5). This step used the reads produced by Illumina and successfully patched 12 minigaps. One of the minigap (minigap 8) could not be filled using GapFish, as it was present in highly repetitive region of the genome. This gap was filled by wet lab approach with the combination of PCR and

Sanger sequencing (Figure 15).

25

Table 7: Sanger sequences of T. sp. strain RQ7 available online from other studies

Sequence tag Size (bp) GenBank accession number

T. sp. RQ7 partial 16S rRNA gene, strain RQ7 1511 AJ401023.1

T. sp. RQ7, DNA joint III-III' sequence 966 DQ352550.1

T. sp. RQ7, DNA joint I-I' sequence 1242 DQ352547.1

T. sp. RQ7, DNA joint XIII-XIII' sequence 1964 DQ352559.1

T. sp. RQ7, DNA joint XII-XII' sequence 1584 DQ352556.1

T. sp. RQ7, DNA joint X-X' sequence 1382 DQ352553.1

T. sp. RQ7 partial gltB gene for glutamate synthase large subunit 925 AJ400999.1

T. sp. RQ7 partial 23S rRNA gene, Group I , i-Tna1931b gene, 922 AJ556790.1 and partial 23S rRNA gene strain RQ7

T. sp. RQ7 partial ino1 gene for myo-inositol 1P synthase 1006 AJ401012.1

T. sp. RQ7 16S rRNA gene, 23S rRNA gene, 5S rRNA gene, folC gene, deoB gene, leuS gene, ORF4, ORF5, ORF6, ahcY gene, topG gene, hppa gene, acp gene, ORF11, ORF12, priA gene, ORF13, ORF14, 36,963 AJ872270.1 ORF15, ORF16, ORF17, mrsA gene, ORF19, rrp2 gene, fsr gene, ORF2

T. sp. RQ7 partial int gene for integerase-recombinase 571 AM072709.1

T. sp. RQ7 partial estA gene for esterase 621 AM072720.1

T. sp. RQ7 partial tm0938 gene 621 AM072729.1

26

Figure 15: Wet lab approach to confirm the existence of the minigaps. MG4 – minigap 4. Primers used were minigap 4 F/R. The PCR product size of 850 bp (gap absent) and if 1.6 kb (gap present). MG8 - minigap 8. Primers used were minigap 8 F/R. The PCR product size of 835 bp (gap absent) and if 1.3 kb (gap present). Five µl of PCR samples were loaded. T. sp. strain RQ7 DNA extracted 3-5 days (Fresh DNA) used as template while T. sp. strain RQ7 DNA stored at -20oC for a year (Old DNA) used as template.

One of the minigaps (minigap 4) was confirmed to be absent from the genome by wet lab approaches. In addition to this, GapFish also assisted us in correcting 4 locations that were either misassembled by SOAP package or CLC Genomics Workbench. GapFish could not find any sequences in the Illumina reads to patch in the big gap, it was clear that this region was underrepresented in Illumina sequencing.

As the in silico options were exhausted we had to use wet lab approaches. Thus, in Step 6

Primer walking, PCR amplification and Sanger sequencing were adopted to sequence the last minigap, to fill the big gap and to validate a few minigaps. The big gap closure was achieved by freshly (after every 3-5 days) extracting the genomic DNA from T. sp. strain RQ7. 27

Figure 16: Differences in obtaining PCR products for easy genes and difficult genes found in the big gap of T. sp. strain RQ7. (a) PCR using the total DNA from four different isolates as templates using the primers Tm0985 - Tm0986 F/R for obtaining products of orthologous genes of TM0985 and TM0986 . Five µl of the PCR product was loaded on the gel. Marker used is 2-log ladder. Product size is 1.37 kb. T. maritima (Tm), T. neapolitana (Tn), T. sp. RQ2 (T. RQ2) and T. sp. strain RQ7 (T. RQ7). (b) PCR using the total DNA from four different isolates as templates using the primers Tn_m88 F/R for obtaining products of ortholog genes of TM0988. Five µl of the PCR product was loaded on the gel. Marker used is 2-log ladder. Product size is 3 kb. T. maritima (Tm), T. neapolitana (Tn), T. sp. RQ2 (T. RQ2) and T. sp. strain RQ7 (T. RQ7). Template DNA for T. sp. strain RQ7 used was old DNA stored at -20oC. Primers were designed in the conserved region of these ORFs in T. maritima and T. neapolitana.

The need of fresh DNA indicated the physical instability of this region of the genome, which might have led to the under representation of this region in Illumina reads.This step required optimization of PCR conditions, analyzing the Sanger sequencing data and primer designing. The closure of big gap was the most time consuming and most critical step before completing the genome. Some of the ORFs were easy to amplify and sequence (easy genes) while a few required multiple rounds of nested PCR to completely PCR them (difficult genes). A few of the ORFs could be sequenced from old DNA (~1 year old) while other regions required freshly extracted DNA. 28

Figure 16(a) and 16(b) exemplifies the difference of obtaining PCR products with old

DNA and freshly extracted DNA for some parts in the big gap. The ORFs that were difficult to

PCR in one round were sequenced using the approach of primer walking and nested PCR. These

ORFs required multiple rounds of PCR and Sanger sequencing to completely patch it to the neighboring region. While designing the primers, it was critical to keep sufficient overlap between the primers in order to assemble the Sanger sequencing data.

Figure 17: The approach of Nested PCR for amplifying the difficult genes in the big gap of T. sp. strain RQ7 (a) Strategy used for sequencing difficult genes in the T. sp. strain RQ7. The longer PCR product was used as template for the next round of nested PCR. The smaller product was Sanger sequenced and assembled with the overlap of the neighboring known sequence. The primers used were either designed on the sequence of T. maritima or T. neapolitana. (b)PCR using freshly extracted T. sp. strain RQ7 total DNA as template for obtaining products of ortholog genes of TM0988 using the primers Tm0988 F/R. Five µl of the PCR product was loaded on the gel. Marker used is lambda DNA-HindIII digest. Product size is 3.4 kb. T. sp. strain RQ7 (T. RQ7). NA – Not applicable here.

29

Figure 17(a) explains the strategy while 17(b) exemplifies one of the ORFs in the big gap, which was sequenced and assembled after multiple rounds of nested PCR. The desired PCR products were gel purified and sent for Sanger sequencing (Figure 18). These samples were required to be pure to avoid double sequences.

Figure 18: Nested PCR with internal primers for obtaining the final PCR product of difficult gene. (a) PCR using the product of Tm0988 F/R as template for obtaining products of ortholog geneTM0988 using the primers Tn_m88 F/R. 5µl of the PCR product was loaded on the gel. Marker used is 2-log ladder. Product size is 3 kb. T sp. strain RQ7 (T. RQ7) (b) Purified PCR product obtained in Figure (a). This was sent for Sanger sequencing. 5µl loaded on the gel.

The total size of the big gap was found to be 35,746 bp. Table 5 lists the ORFs in the big gap across the four genomes that we had compared. This region of big gap has been analyzed previously, where it has been predicted to undergo recombination events [27]. The final review

(Step7) helped us determine the total size of the genome to be 1,851,618 bp with the GC% of

47.1. Table 2 summarizes the statistical analysis of our assembly process. As the genome was completed by combining different approaches of both in silico and wet lab approaches our 30 pipeline was effective. The genome was also evaluated multiple times to deliver a good quality genome. Closing the gaps might alter the start and stop codon in that region or in the downstream sequence of that gap, thus affecting the initial annotation. As compared to the initial SOAP assembly, the complete genome of T. sp. strain RQ7 had additional 42 genes that were missing initially, changed the sizes of 12 genes and discarded 9 false open reading frames (ORFs) (Table

8).

Table 8: ORFs differentially annotated in the complete genome[23]

# of affected Putative functions ORFs ABC-type sugar transport and utilization machinery; Size variation 12 chemotaxis protein; RNA/ DNA processing ABC-type sugar transport and utilization machinery, Recovered 42 transcriptional regulators, metabolism system, ORFs DNA/RNA helicases, DNA methylases Dismissed 9  ORFs

The affected sets of genes were mainly functioning as ABC-type sugar transporters, chemotaxis protein, transcriptional regulation, RNA/DNA processing, sugar utilization machinery which are key metabolic functions. False or missing annotation of these genes would lead to wrong understanding of the biology of the organism. It could also affect the study pursued using this wrong information. Thus, the final review and re-annotation was crucial for this study.

3.2 Genome details and features

The GenBank accession number for this genome is CP007633 while the BioSample #

SAMN02744159 and BioProject # PRJNA246218.The total number of genes is 1901, which includes 1833 protein coding genes, 50 RNA genes (3 rRNA genes, 46 tRNA, and 1 ncRNA), and 18 pseudogenes. The genome also has 8 CRISPR loci. 31

3.3 Features analyzed in the genome

After complete sequencing, assembly and annotation of the T. sp. strain RQ7 genome, it was analyzed to report the features that were observed in the genome. Each of the features are described below.

3.3.1. Unique genes in T. sp. strain RQ7

After comparing the available genomes at NCBI database (until 03-03-2015) six genes were identified as being unique to T. sp. strain RQ7.

Table 9: List of unique genes in this genome

Locus tag Start Stop Product TRQ7_01575 291243 292688 Hypothetical protein TRQ7_01585 293031 294398 Hypothetical protein TRQ7_01610 299894 298668 Hypothetical protein TRQ7_01615 301168 299891 Hypothetical protein TRQ7_01700 314483 313635 Hypothetical protein TRQ7_01710 316166 315774 Hypothetical protein

Homologs of TRQ7_01625 coding for sulfate adenylyltransferase (EC 2.7.7.4) were identified in Petrotoga mobilis SJ95 (Pmob_0713) and Defluviitoga tunisiensis L3

(DTL3_1219) while that of TRQ7_01650 coding for adenylylsulfate kinase (EC 2. 7.1.25) in P. mobilis SJ95 (Pmob_0712). These two genes are involved in assimilatory sulfate reduction pathway (as shown in Figure 19 below).

Figure 19: Metabolic reaction of sulfate reduction pathway. Figure outlines the steps at which sulfate adenylyltransferase and adenylylsulfate kinase function.

In addition, TRQ7_01705 is found to be exclusively shared with Mesoaciditoga lauensis within the order of Thermotogales. 32

3.3.2. Natural Transformation

Thermotoga are known to undergo lateral gene transfer events (LGT). One of the ways this could happen is via natural transformation. T. sp. strain RQ7[28] and T. sp. strain RQ2 [29] were recently identified to be naturally transformable. Thus, whole genome sequencing of this organism provided an opportunity to perform comparative analysis of the competence protein coding genes across the genus Thermotoga. Using the experimentally characterized competence proteins from different bacteria as references, potential genes that might play a role in natural competence of Thermotoga were identified (Table 10). The putative genes that might be involved in competence development in Thermotoga strains are predicted based on the conserved domains that are also found in the experimentally characterized proteins in other bacteria.

Type IV pilus gene coding for PilZ was characterized in the year 1995 in Pseudomonas aeruginosa and was predicted to play role in fimbrae expression, twitching motility and DNA uptake by being a part of the fimbrial biogenesis system [30]. The experimental evidence of this protein in DNA binding in P. aeruginosa was provided in 2005 [31]. Type IV pilus are proved to play an important role in natural transformation in other bacteria like Neisseria gonorrhoeae [32] and Vibrio cholerae [33]. Sequence analysis and experiments in previous studies with multiple genomes showed the association of cyclic-di-GMP function with the PilZ domain proteins which might assist them to play multiple functions in a cell [34, 35]. One of the examples of this is the existence of YcgR domain with PilZ in V. cholerae [33] where cyclic-di-GMP binding has been experimentally proved (Figure 20).

33

Table 10: Putative competence genes in the four Thermotoga strains

Gene TRQ7 Putative function Tm Tn TRQ2 Name* DNA uptake and translocation Type IV pilus biogenesis and TRQ7_00110 PilZ (Pa, Vc) twitching motility TM0905 CTN_1670 TRQ2_0022 [30, 31, 33] Type II secretion system (T2SS), TRQ7_00455 PilB (Pa, Vc) Type IV fimbrial assembly NTPase TM0837 CTN_1739 TRQ2_0090 [36-38] Secretin/TonB forms gated CTN_1450 TRQ7_01410 channel for extrusion of TM1117 CTN_1933 TRQ2_1699 TRQ7_04530 PilQ (Nm, Tt) assembled pilin TM0088 CTN_0604 TRQ2_0859 TRQ7_08710 [39-41] Type II secretory pathway, TRQ7_04500 PilC (Ps, Ng) component PulF / Type IV TM_0094 CTN_0598 TRQ2_0853 fimbrial assembly protein [42, 43] Type IV prepilin peptidase, processes N-terminal leader TM1696 TRQ7_05855 PilD (Vv,Ng) CTN_0883 TRQ2_1138 peptides for prepilins [44-47] Putative channel protein, Transports DNA across the cell TRQ7_06260 ComEC (Bs) TM1775 CTN_0965 TRQ2_1049 membrane [48, 49] ComFC phosphoribosyltransferase TRQ7_07315 TM1584 CTN_1168 TRQ2_1247 (Bs,Hi) [50-52] TRQ7_07650 PilT (Ng) Motility protein [53] TM1362 CTN_1229 TRQ2_1467 Type IV pilin; major structural TRQ7_07980 PilE (Ng, Pa) TM1271 CTN_1301 TRQ2_1548 component of Type IV Pilus [53-55] High affinity DNA-binding TRQ7_09065 ComEA (Bs) periplasmic protein [56-59] TM1052 CTN_1515 TRQ2_1756

Post-translocation Promotes the recombination of the donor DNA into the TRQ7_02260 ComM (Hi) TM0513 CTN_0158 TRQ2_0424 chromosome [60]

DNA protecting protein TRQ7_03645 DprA (Hi) TM0250 CTN_0436 TRQ2_0698 [61, 62]

(*Gene names have been given after the experimentally characterized proteins in Ng, Neisseria gonorrohoeae; Nm, Neisseria meningitidis; Hi, Haemophilus influenzae; Bs, Bacillus subtilis; Pa, Pseudomonas aeruginosa; Ps, Pseudomonas stutzeri; Tt Thermus thermophilus; Vv, Vibrio vulnificus; Vc, Vibrio cholerae. Tm, T. maritima; Tn, T. neapolitana; TRQ2, T. sp. strain RQ2; TRQ7, T. sp. strain RQ7) 34

Figure 20: CDD (Conserved Domain Database) analysis using amino acid sequences of the PilZ protein in V. cholerae, P. aeruginosa and T. sp. strain RQ7

Pilus biogenesis has been the topic of interest for many years. In 1990, three accessory proteins PilB, PilC and PilD were studied for their role in pilus biogenesis [37] in P. aeruginosa.

Later in 2008, the functional role of PilB conserved residues was studied in the same organism.

The study revealed the exact roles of the Walker A, WalkerB, Asp Box and His Box motifs in the motor function of PilB. This protein belongs to the NTPase superfamily of proteins and plays an important role in powering the pilus extension [36]. In addition to this, PilB in V. cholerae has experimentally proven to play role in DNA uptake during natural competence [38] This protein is an important part of the DNA uptake machinery which is present in naturally competent cells and helps in successful intake of foreign DNA in V. cholerae. Figure 21 outlines the CDD analysis of PilB in P. aeruginosa, V. cholerae and T. sp.strain RQ7. PilQ is an important secretin protein that has proved to mediate exogenous DNA uptake through DNA binding in N. meningitidis [39]. 35

Figure 21: CDD analysis using amino acid sequences of PilB protein in V. cholerae, P. aeruginosa and T. sp. strain RQ7

PilQ is a secretin family protein in outer membrane and plays an important role in the secretion of type IV pili in N. meningitides [41] . This secretin is also isolated and identified in the thermophilic bacteria Thermus thermophilus HB27 where it was proved to bind dsDNA and span outer membrane and periplasmic space of the bacteria [40]. PilQ homologs are also predicted in T. sp. strain RQ7 genome based on the conserved domains (Figure 22).

36

Figure 22: CDD analysis of PilQ amino acid sequence in N. gonorrhoeae showing the presence of important conserved secretin domain. In T. sp. strain RQ7 three proteins were predicted to be putative PilQ proteins

Another important accessory protein that plays role in Type IV pilus formation and DNA transformation is PilC (Figure 23). PilC is a part of pilus biogenesis, DNA uptake and protein secretion in bacteria like Pseudomonas sp., Haemophilus sp. [42] .

Figure 23: Conserved domains of PilC protein in P. stutzeri and putative PilC in T. sp. strain RQ7 analyzed by CDD 37

PilD is the pre-pilin peptidase required during pilus biogenesis. In N. gonorrhoeae mutants lacking PilD were unable to form pilin from pre-pilin subunits and were non-piliated

[44].In H. influenzae pilD is a part of pil which codes for pre-pilin peptidase and on affects natural transformation [63] Figure 24.

Figure 24: CDD analysis using amino acid sequences outlines the conserved domains of PilD in N. gonorrhoeae, V. vulnificus and putative PilD in T. sp. strain RQ7

PilT proteins are a large family of proteins sharing consensus nucleotide binding motifs

(Figure 25) that assist in transporting macromoleculer complexes across the membranes. The energy for this transport was hypothesized to come from the intrinsic ATPase/ kinase activities of these proteins. PilT protein coding genes have been identified in many organisms like

Pseudomonas sp. and N. gonorrhoea [53]. PilT mutants in N. gonorrhoea affected competence for natural transformation at the level of DNA uptake. 38

Figure 25: Outline of the conserved domains of PilT protein in N. gonorrhoea and putative PilT in T. sp. strain RQ7

PilE protein has been identified as a part of Type IV pilus biogenesis (Figure 26) in N. gonorrhoeae [32]. This protein plays an important role in pilin subunit-subunit interactions and its absence affects the twitching motility and pilus biogenesis in P. aeruginosa [55]. The exact role of this protein in natural transformation is unknown but it is predicted to affect the uptake by playing a role together with other type IV pilus proteins [64].

39

Figure 26: CDD analysis using amino acid sequences of PilE in P. aeruginosa and N. gonorrhoeae and putative PilE in T. sp. strain RQ7

Competence protein ComM was identified in the mutant strain RJ248 of H. influenzae

Rd when the mutant showed 300 times lower transformation frequency than the wild type [60].

This work showed that ComM was induced during competence development in the wild type strain and probably plays its role later in transformation and does not affect the DNA uptake and transport. ComM is one of the AAA+ superfamily proteins. These AAA+ (ATPase associated with various cellular functions) superfamily proteins can bind and hydrolyze nucleotides. This superfamily has a general structure consisting of a conserved αβα, Walker A, Walker B motif and arginine finger [65] (Figure 27). It has been suggested that arginine residues might play diverse roles like ATP hydrolysis, binding, sensing and inter subunit interaction [65, 66] .

Figure 27: CDD analysis using amino acid sequence of ComM in H. influenzae and putative ComM in T. sp. strain RQ7 40

ComE operon has been characterized in B. subtilis which consists of three ORFs that play an important role in DNA uptake and DNA binding at cell surface during competence [49].

ComEA is predicted to bind the DNA non-specifically using its helix-turn-helix motifs (Figure

28) and thus assists in the process of DNA binding, uptake and transport during natural transformation in B. subtilis [59, 67].In N. gonorrohoeae similarity searches using ComEA from B subtilis identified ComE protein with same functions and domains [56].

Figure 28: CDD analysis of putative ComE in T. sp. strain RQ7 and ComE in the reference organism B. subtilis

ComEC (Figure 29) is a putative membrane spanning channel protein, which is hydrophobic in nature and is probably a part of uptake machinery of natural transformation in B. subtilis [48].This gene is the last one among the three genes in comE operon of B. subtilis.

41

Figure 29: CDD analysis using amino acid sequence of ComEC protein in B. subtilis and putative ComEC in T. sp. strain RQ7

The exact function of ComFC is still unknown but in B. subtilis this protein has been observed at the cell poles of Bacillus with other Com machinery proteins and is predicted to be involved in DNA uptake [68](Figure 30). This protein was first characterized in H. influenzae as

Com101 [50]and later identified in B. subtlitis as ComFC which showed 22% identity to Com101

[51].

Figure 30: CDD analysis using amino acid sequence of putative ComFC in T. sp. strain RQ7 and ComFC in B. subtilis

DprA also known as DNA protecting protein is predicted to play an important role during natural transformation in H. influenzae. It is thought to be a part of DNA translocation or recombination apparatus in this organism [61] (Figure 31).

42

Figure 31: CDD analysis using amino acid sequences of DprA protein in H. influenzae and putative DprA protein in T. sp. strain RQ7

3.3.3 The Type II Restriction-Modification system TneDI

Bacterial restriction-modification system is one of the many different defense mechanisms that have developed to protect the integrity of their genetic materials from invading DNA. A Type II R-M system, TneDI, has been characterized in T. neapolitana and overexpressed in Escherichia coli [2, 20, 69]. Orthologs of R-M system are also found in T. maritima, T. sp. RQ2 and other Thermotogales [2]. Analysis of T. sp. RQ7 genome revealed that this strain has a deletion of these genes, making the the host genome suseptible to the restriction of R.TneDI, as shown in Figures 32(a), 32(b) and 32(c). The absence of TneDI R-M system in T. sp. strain RQ7 might be one of the contributing factors for high natural transformation efficiency of this isolate. 43

Figure 32: Deletion of TneDI R-M system in T. sp. strain RQ7. (a) Diagrammatic representation of the R-M system locus in T. sp. strain RQ7 compared to T. neapolitana highlighting the deletion in T. sp. strain RQ7. RQ7_3940 F/R is the primer pair designed to PCR this locus in T. sp. strain RQ7. (b) Experimental confirmation for the deletion of R-M system coding genes in T. sp. strain RQ7 by PCR using the primers RQ7_3940F/R. T. neapolitana (Tn) genomic DNA was used as positive control to show the presence of complete R-M system genes. Expected product size in Tn is 1831 bases while in T. sp. strain RQ7 it’s 503bases. 5µl of PCR product was loaded on the agarose gel. (c) Gel showing the experimental evidence of active R-M system in the host T. neapolitana and inactive system in T. sp. RQ7 44

3.3.4 CRISPRs (clustered regularly interspaced short palindromic repeats)

CRISPRs provide bacteria a form of adaptive immunity against the invading phages and horizontally transferred mobile DNA elements like plasmids in a sequence specific manner [70-72].

This system utilizes non-coding crRNA together with a set of Cas proteins to target the invading nucleic acid (DNA/RNA). NCBI annotation pipeline identified 6 different CRISPR loci in T. sp.

RQ7 while CRISPRfinder database [73-75] helped us recognize 2 more loci in addition to the 6 mentioned above. Out of 8 CRISPR loci in its genome only 2 are associated with cas genes.

Table 12 compiles the details of all the CRISPR loci in the genome of T. sp. strain RQ7.

CRISPR # 1 in the Table was classified as a questionable locus by CRISPRfinder because it has only two spacers and the sequences of direct repeat vary after each spacer. Figure 33(a) and 33(b) summarize the arrangement of the loci in T. sp. strain RQ7 along with the associated cas coding genes.

Figure 33: Diagramatic representation of CRISPR/Cas system in T. sp. strain RQ7 (a) Outline of the 8 CRISPR loci in T. sp. strain RQ7 (1,851,618 bp); drawn in scale using Clone Manager. (b) The arrangement of the locus is shown across the linear chromosome along with the Cas genes; not in scale. The genome is linearized between DnaA (TRQ7_00005) and Ferredoxin (TRQ7_09610)

45

CRISPRs were recently reported to influence natural transformation in Streptococcus sp.

[76]. Comparative genomics approach was used to analyze the CRISPR/Cas regions in T. sp. strain RQ7 , T. maritima, T. neapolitana, and T. sp. RQ2. T. sp. strain RQ7 has 95 spacers and 2 loci of DNA targetting cas genes. Earlier studies had found RNA targetting Cmr genes Cmr1-

Cmr6 in addition to DNA targetting Cas genes in T. maritima and T. sp. RQ2 [77, 78] and our analysis revealed the presence of more spacers (Table 11) in these two strains. These differences may influence the differences in natural transformation efficiency among the four Thermotoga strains.

Table 11: Total number of CRISPR loci and spacers among the four Thermotoga strains after comparative genomics. CRISPRfinder was used to perform this analysis [73-75]

Genome Total # of CRISPR loci Total # of spacers T. maritima 8 106 T. sp. strain RQ2 8 129 T. neapolitana 8 60 T. sp. strain RQ7 8 95

46

Table 12: Overview of each CRISPR locus in T. sp. strain RQ7 together with its position on the genome, number of total spacers, sequences of direct repeats and homology of the repeats to other Thermotogales. Analysis of CRISPRs and the direct repeats was performed by the CRISPRfinder [73-75]

Homology of Co-ordinates CRISPR No. of cas Repeats repeats in other in T. sp. strain locus spacers genes Thermotogales RQ7 genome 553829 - 1* GTTTCAATCCTTCCTTAGAGGTATGGAAACA 2 No 553993 GTTTCAATACTTCCTTAGAGGTATGGAAACA None

GTTTCAATACTTCCTTTGAGGTATGAAAACA 594480 - 2 TTTCCTATACCTCTAAGAAAGGATTGAAAC 6 No 594906 T. maritima MSB8, T. sp. RQ2, GTTTCCATACCTCTAAGGAAGTATTGAAAC T. neapolitana DSM 4359 975170 - 3* GGTTTCAATACTTCCTTTGAGGTATGGAAAC 3 Yes 975400 CGTTTCAATACTTCCTTAGAGGTATGGAAAT None

AGTTTCAATACATCCTCAGAGGTATGATTTA T. petrophila, T. neapolitana DSM 983576 - 4 GTTTTTATCTTCCTAAGAGGAATATGAAC 4359, 51 Yes 986934 T. naphthophila RKU-10 GTTTTTATCTTCCTAAGAGGAATATAGTA T. neapolitana DSM 1011390 - 5 GTTTCAATACTTCCTTTGAGGTATGGAAAC 10 No 4359 1012080 GTTTCAATATTTCCTTATAGGTACAAACCC

T. maritima MSB8, T. sp. RQ2, 1090292 - 6 GTTTCAATACTTCCTTAGAGGTATGGAAAC 5 No T. neapolitana DSM 1090660 4359

T. maritima MSB8,T. sp. RQ2, 1233629 - 7 GTTTCCATACCTCTAAGGAAGTATTGAAAC 3 No T. neapolitana DSM 1233857 4359 T. neapolitana DSM 1422791 - 8 GTTTCAATACTTCCTTTGAGGTATGGAAAC 10 No 4359 1423488 *, only identified by CRISPRfinder

47

IV. DISCUSSION

The current work was initiated with the aim of completely sequencing the genome of the hyperthermophilic bacteria T. sp. strain RQ7. This project has effectively combined wet lab approaches with in silico computational methods to generate a completely assembled bacterial genome. This work started with de novo assembly of Illumina reads that lead to the initial draft of the genome, which was used for comparative analysis, followed by identification of a reference genome and identification of a big gap. Wet lab approach helped verify the absence of a hidden big gap. The genome was taken to a next level by the use of publically available sequences for T. sp. strain RQ7, reassembling the genome with CLC Genomics Workbench and in-house developed program GapFish. Once in silico sources were exhausted, wet lab approaches were again considered. By this, the amount of work requiring wet lab assistance had reduced to minimum. At the end, in silico analysis was employed to review the genome and wrap up the project. The pipeline developed from this work was efficient; because of the close relation of in silico and wet lab approaches. Information from computational analysis would guide the next step of work; either wet lab or in silico. Various software’s like SOAPaligner, SOAPdenovo,

CLC Genomics Workbench and in-house developed GapFish were used in this study.

Once completed, the genome was analyzed for the unique features that would help us understanding the physiology of the organism. The identification and successful closure of big gap has emphasized the importance of human intervention to analyze computationally generated data. Strategy of using freshly extracted genomic DNA and designing the primers in the conserved regions of the ORFs in other Thermotoga strains was the most precise attempt of handling an underrepresented region of the genome. The unique genes to this genome were identified successfully. The most important finding was the absence of the genes coding for 48 restriction-modification system (TneDI) in T. sp. strain RQ7 (Figure 30). With the absence of this system in T. sp. strain RQ7, the genome has opened an opportunity to study for the existence of a different system for restricting foreign DNA in T. sp. strain RQ7. Additionally, the genome wide comparison of putative competence genes (Table 10) in four different Thermotoga strains has provided the foundation for studying the existence of functional natural transformation system in this genus. The organisms of this genus are excellent sources of thermostable enzymes and biohydrogen, which makes them important for industrial applications. Thus, studying the natural transformation and competence machinery of this genus would give a new boost to the projects that aim towards making these organisms industrially suitable. Identifying CRISPR/Cas system and the individual CRISPR locus could be useful later to find the existence of phages in the niche of these hyperthermophilic bacteria in addition to the life forms that co-exist with T. sp. strain RQ7 at high temperatures.

49

V. CONCLUSIONS

The complete genome of T. sp. strain RQ7 consisting of 1,851,618 bp was sequenced in this study. The genome has shed light on important features pertaining to various aspects of lateral gene transfer in the genus, promoting genetic and genomics studies of these organisms. The study also successfully developed a pipeline for genome assembling by utilizing commonly available tools and resources. The pipeline demonstrates the importance of combining wet lab and in silico approaches in biological systems. These methods and principles of this pipeline are applicable to similar studies involving both prokaryotes and eukaryotes.

50

VI. REFERENCES

1. Huber R, Langworthy TA, Konig H, Thomm M, Woese CR, Sleytr UB, Stetter KO: Thermotoga-Maritima Sp-Nov Represents a New Genus of Unique Extremely Thermophilic Eubacteria Growing up to 90-Degrees-C. Archives of 1986, 144(4):324-333. 2. Xu Z, Han D, Cao J, Saini U: Cloning and characterization of the TneDI restriction: modification system of Thermotoga neapolitana. Extremophiles 2011, 15(6):665-672. 3. Frock AD, Notey JS, Kelly RM: The genus Thermotoga: recent developments. Environ Technol 2010, 31(10):1169-1181. 4. Harriott OT, Huber R, Stetter KO, Betts PW, Noll KM: A cryptic miniplasmid from the hyperthermophilic bacterium Thermotoga sp. strain RQ7. J Bacteriol 1994, 176(9):2759-2762. 5. Yu J: Genetic tools for the study of Thermotoga. Storrs, Connecticut UMI Dissertations Publishing; 1998. 6. Yu JS, Noll KM: Plasmid pRQ7 from the hyperthermophilic bacterium Thermotoga species strain RQ7 replicates by the rolling-circle mechanism. J Bacteriol 1997, 179(22):7161-7164. 7. Mongodin EF, Hance IR, Deboy RT, Gill SR, Daugherty S, Huber R, Fraser CM, Stetter K, Nelson KE: Gene transfer and genome plasticity in , a model hyperthermophilic species. J Bacteriol 2005, 187(14):4935-4944. 8. Windberger E, Huber R, Trincone A, Fricke H, Stetter K: Thermotoga thermarum sp. nov. and Thermotoga neapolitana occurring in African continental solfataric springs. Archives of Microbiology 1989, 151(6):506-512. 9. Takahata Y, Nishijima M, Hoaki T, Maruyama T: sp. nov. and Thermotoga naphthophila sp. nov., two hyperthermophilic bacteria from the Kubiki oil reservoir in Niigata, Japan. Int J Syst Evol Microbiol 2001, 51(Pt 5):1901- 1909. 10. Latif H, Lerman JA, Portnoy VA, Tarasova Y, Nagarajan H, Schrimpe-Rutledge AC, Smith RD, Adkins JN, Lee DH, Qiu Y et al: The genome organization of Thermotoga maritima reflects its lifestyle. PLoS Genet 2013, 9(4):e1003485. 11. Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA et al: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 1999, 399(6734):323-329. 12. Oren A: Microbial life at high salt concentrations: phylogenetic and metabolic diversity. Saline Systems 2008, 4:2. 13. Nanavati DM, Thirangoon K, Noll KM: Several archaeal homologs of putative oligopeptide-binding proteins encoded by Thermotoga maritima bind sugars. Appl Environ Microbiol 2006, 72(2):1336-1345. 14. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977, 74(12):5463-5467. 15. Ross JS, Cronin M: Whole genome sequencing by next-generation methods. Am J Clin Pathol 2011, 136(4):527-539. 51

16. Grada A, Weinbrecht K: Next-generation sequencing: methodology and application. J Invest Dermatol 2013, 133(8):e11. 17. Stein L: Genome annotation: from sequence to biology. Nat Rev Genet 2001, 2(7):493- 503. 18. Koonin EV, Galperin MY: Sequence Evolution Function: Computational Approaches in Comparative Genomics. In., 2010/12/07 edn. Boston: Kluwer Academic; 2003. 19. Van Ooteghem SA, Beer SK, Yue PC: Hydrogen production by the thermophilic bacterium Thermotoga neapolitana. Appl Biochem Biotechnol 2002, 98-100:177-189. 20. Han D, Norris SM, Xu Z: Construction and transformation of a Thermotoga-E. coli shuttle vector. BMC Biotechnol 2012, 12:2. 21. Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y et al: The sequence and de novo assembly of the giant panda genome. Nature 2010, 463(7279):311-317. 22. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20(2):265-272. 23. Puranik R, Quan G, Werner J, Zhou R, Xu Z: A pipeline for completing bacterial genomes using in silico and wet lab approaches. BMC Genomics 2015, 16 Suppl 3:S7. 24. Koressaar T, Remm M: Enhancements and modifications of primer design program Primer3. Bioinformatics 2007, 23(10):1289-1291. 25. Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG: Primer3--new capabilities and interfaces. Nucleic Acids Res 2012, 40(15):e115. 26. Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14(7):1394-1403. 27. Nesbo CL, Dlutek M, Doolittle WF: Recombination in Thermotoga: implications for species concepts and biogeography. Genetics 2006, 172(2):759-769. 28. Han D, Xu H, Puranik R, Xu Z: Natural transformation of Thermotoga sp. strain RQ7. BMC Biotechnol 2014, 14:39. 29. Xu H, Han D, Xu Z: Expression of heterologous cellulases in Thermotoga sp. Strain RQ2. BioMed research international 2015 In Press. 30. Alm RA, Bodero AJ, Free PD, Mattick JS: Identification of a novel gene, pilZ, essential for type 4 fimbrial biogenesis in Pseudomonas aeruginosa. J Bacteriol 1996, 178(1):46-53. 31. van Schaik EJ, Giltner CL, Audette GF, Keizer DW, Bautista DL, Slupsky CM, Sykes BD, Irvin RT: DNA binding: a novel function of Pseudomonas aeruginosa type IV pili. J Bacteriol 2005, 187(4):1455-1464. 32. Aas FE, Wolfgang M, Frye S, Dunham S, Lovold C, Koomey M: Competence for natural transformation in Neisseria gonorrhoeae: components of DNA binding and uptake linked to type IV pilus expression. Mol Microbiol 2002, 46(3):749-760. 33. Pratt JT, Tamayo R, Tischler AD, Camilli A: PilZ domain proteins bind cyclic diguanylate and regulate diverse processes in Vibrio cholerae. J Biol Chem 2007, 282(17):12860-12870. 34. Amikam D, Galperin MY: PilZ domain is part of the bacterial c-di-GMP binding protein. Bioinformatics 2006, 22(1):3-6. 52

35. Guzzo CR, Salinas RK, Andrade MO, Farah CS: PILZ protein structure and interactions with PILB and the FIMX EAL domain: implications for control of type IV pilus biogenesis. J Mol Biol 2009, 393(4):848-866. 36. Chiang P, Sampaleanu LM, Ayers M, Pahuta M, Howell PL, Burrows LL: Functional role of conserved residues in the characteristic secretion NTPase motifs of the Pseudomonas aeruginosa type IV pilus motor proteins PilB, PilT and PilU. Microbiology 2008, 154(Pt 1):114-126. 37. Nunn D, Bergman S, Lory S: Products of three accessory genes, pilB, pilC, and pilD, are required for biogenesis of Pseudomonas aeruginosa pili. J Bacteriol 1990, 172(6):2911-2919. 38. Seitz P, Blokesch M: DNA-uptake machinery of naturally competent Vibrio cholerae. Proc Natl Acad Sci U S A 2013, 110(44):17987-17992. 39. Assalkhou R, Balasingham S, Collins RF, Frye SA, Davidsen T, Benam AV, Bjoras M, Derrick JP, Tonjum T: The outer membrane secretin PilQ from Neisseria meningitidis binds DNA. Microbiology 2007, 153(Pt 5):1593-1603. 40. Burkhardt J, Vonck J, Averhoff B: Structure and function of PilQ, a secretin of the DNA transporter from the thermophilic bacterium Thermus thermophilus HB27. J Biol Chem 2011, 286(12):9977-9984. 41. Collins RF, Davidsen L, Derrick JP, Ford RC, Tonjum T: Analysis of the PilQ secretin from Neisseria meningitidis by transmission electron microscopy reveals a dodecameric quaternary structure. J Bacteriol 2001, 183(13):3825-3832. 42. Graupner S, Frey V, Hashemi R, Lorenz MG, Brandes G, Wackernagel W: Type IV pilus genes pilA and pilC of Pseudomonas stutzeri are required for natural genetic transformation, and pilA can be replaced by corresponding genes from nontransformable species. J Bacteriol 2000, 182(8):2184-2190. 43. Rudel T, Facius D, Barten R, Scheuerpflug I, Nonnenmacher E, Meyer TF: Role of pili and the phase-variable PilC protein in natural competence for transformation of Neisseria gonorrhoeae. Proc Natl Acad Sci U S A 1995, 92(17):7986-7990. 44. Freitag NE, Seifert HS, Koomey M: Characterization of the pilF-pilD pilus-assembly locus of Neisseria gonorrhoeae. Mol Microbiol 1995, 16(3):575-586. 45. Nunn DN, Lory S: Product of the Pseudomonas aeruginosa gene pilD is a prepilin leader peptidase. Proc Natl Acad Sci U S A 1991, 88(8):3281-3285. 46. Paranjpye RN, Lara JC, Pepe JC, Pepe CM, Strom MS: The type IV leader peptidase/N-methyltransferase of Vibrio vulnificus controls factors required for adherence to HEp-2 cells and virulence in iron-overloaded mice. Infect Immun 1998, 66(12):5659-5668. 47. Paranjpye RN, Strom MS: A Vibrio vulnificus type IV pilin contributes to biofilm formation, adherence to epithelial cells, and virulence. Infect Immun 2005, 73(3):1411-1422. 48. Draskovic I, Dubnau D: Biogenesis of a putative channel protein, ComEC, required for DNA uptake: membrane topology, oligomerization and formation of disulphide bonds. Mol Microbiol 2005, 55(3):881-896. 49. Hahn J, Inamine G, Kozlov Y, Dubnau D: Characterization of comE, a late competence operon of Bacillus subtilis required for the binding and uptake of transforming DNA. Mol Microbiol 1993, 10(1):99-111. 53

50. Larson TG, Goodgal SH: Sequence and transcriptional regulation of com101A, a locus required for genetic transformation in Haemophilus influenzae. J Bacteriol 1991, 173(15):4683-4691. 51. Londono-Vallejo JA, Dubnau D: comF, a Bacillus subtilis late competence locus, encodes a protein similar to ATP-dependent RNA/DNA helicases. Mol Microbiol 1993, 9(1):119-131. 52. Tomb JF, el-Hajj H, Smith HO: Nucleotide sequence of a cluster of genes involved in the transformation of Haemophilus influenzae Rd. Gene 1991, 104(1):1-10. 53. Wolfgang M, Lauer P, Park HS, Brossay L, Hebert J, Koomey M: PilT lead to simultaneous defects in competence for natural transformation and twitching motility in piliated Neisseria gonorrhoeae. Mol Microbiol 1998, 29(1):321-330. 54. Kline KA, Criss AK, Wallace A, Seifert HS: Transposon mutagenesis identifies sites upstream of the Neisseria gonorrhoeae pilE gene that modulate pilin antigenic variation. J Bacteriol 2007, 189(9):3462-3470. 55. Russell MA, Darzins A: The pilE gene product of Pseudomonas aeruginosa, required for pilus biogenesis, shares amino acid sequence identity with the N-termini of type 4 prepilin proteins. Mol Microbiol 1994, 13(6):973-985. 56. Chen I, Gotschlich EC: ComE, a competence protein from Neisseria gonorrhoeae with DNA-binding activity. J Bacteriol 2001, 183(10):3160-3168. 57. Inamine GS, Dubnau D: ComEA, a Bacillus subtilis integral membrane protein required for genetic transformation, is needed for both DNA binding and transport. J Bacteriol 1995, 177(11):3045-3051. 58. Provvedi R, Dubnau D: ComEA is a DNA receptor for transformation of competent Bacillus subtilis. Mol Microbiol 1999, 31(1):271-280. 59. Takeno M, Taguchi H, Akamatsu T: Role of ComEA in DNA uptake during transformation of competent Bacillus subtilis. J Biosci Bioeng 2012, 113(6):689-693. 60. Gwinn ML, Ramanathan R, Smith HO, Tomb JF: A new transformation-deficient mutant of Haemophilus influenzae Rd with normal DNA uptake. J Bacteriol 1998, 180(3):746-748. 61. Karudapuram S, Zhao X, Barcak GJ: DNA sequence and characterization of Haemophilus influenzae dprA+, a gene required for chromosomal but not plasmid DNA transformation. J Bacteriol 1995, 177(11):3235-3240. 62. Mortier-Barriere I, Velten M, Dupaigne P, Mirouze N, Pietrement O, McGovern S, Fichant G, Martin B, Noirot P, Le Cam E et al: A key presynaptic role in transformation for a widespread bacterial protein: DprA conveys incoming ssDNA to RecA. Cell 2007, 130(5):824-836. 63. Dougherty BA, Smith HO: Identification of Haemophilus influenzae Rd transformation genes using cassette mutagenesis. Microbiology 1999, 145 ( Pt 2):401- 409. 64. Hamilton HL, Dillard JP: Natural transformation of Neisseria gonorrhoeae: from DNA donation to homologous recombination. Mol Microbiol 2006, 59(2):376-385. 65. Snider J, Houry WA: AAA+ proteins: diversity in function, similarity in structure. Biochem Soc Trans 2008, 36(Pt 1):72-77. 66. Yamasaki T, Nakazaki Y, Yoshida M, Watanabe YH: Roles of conserved arginines in ATP-binding domains of AAA+ chaperone ClpB from Thermus thermophilus. FEBS J 2011, 278(13):2395-2403. 54

67. Dubnau D: Binding and transport of transforming DNA by Bacillus subtilis: the role of type-IV pilin-like proteins--a review. Gene 1997, 192(1):191-198. 68. Kaufenstein M, van der Laan M, Graumann PL: The three-layered DNA uptake machinery at the cell pole in competent Bacillus subtilis cells is a stable complex. J Bacteriol 2011, 193(7):1633-1642. 69. Xu H, Han D, Xu Z: Overexpression of a lethal methylase, M.TneDI, in E. coli BL21(DE3). Biotechnol Lett 2014, 36(9):1853-1859. 70. Bhaya D, Davison M, Barrangou R: CRISPR-Cas systems in bacteria and archaea: versatile small RNAs for adaptive defense and regulation. Annu Rev Genet 2011, 45:273-297. 71. Mojica FJ, Diez-Villasenor C, Garcia-Martinez J, Soria E: Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 2005, 60(2):174-182. 72. Richter C, Chang JT, Fineran PC: Function and regulation of clustered regularly interspaced short palindromic repeats (CRISPR) / CRISPR associated (Cas) systems. Viruses 2012, 4(10):2291-2311. 73. Grissa I, Vergnaud G, Pourcel C: CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 2007, 35(Web Server issue):W52-57. 74. Grissa I, Vergnaud G, Pourcel C: The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats. BMC Bioinformatics 2007, 8:172. 75. Grissa I, Vergnaud G, Pourcel C: CRISPRcompar: a website to compare clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 2008, 36(Web Server issue):W145-148. 76. Bikard D, Hatoum-Aslan A, Mucida D, Marraffini LA: CRISPR interference can prevent natural transformation and virulence acquisition during in vivo bacterial infection. Cell Host Microbe 2012, 12(2):177-186. 77. Haft DH, Selengut J, Mongodin EF, Nelson KE: A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput Biol 2005, 1(6):e60. 78. Zhang J, Rouillon C, Kerou M, Reeks J, Brugger K, Graham S, Reimann J, Cannone G, Liu H, Albers SV et al: Structure and mechanism of the CMR complex for CRISPR- mediated antiviral immunity. Mol Cell 2012, 45(3):303-313.