Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172

by

Soulbee Jin

A thesis submitted in conformity with the requirements for the degree of Master of Science Ecology and Evolutionary Biology University of Toronto

© Copyright by Soulbee Jin, 2010.

Evidence of Mobility of the 3-Chlorbenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172

Soulbee Jin

Master of Science

Ecology and Evolutionary Biology University of Toronto

2010 Abstract

The genome of B. phytofirmans OLGA172 has been sequenced by Next Generation sequencing methods, Illumina Solexa and Genome Analyzer and Roche 454 GS-FLX. Through various bioinformatic and molecular work, over 42 kbp of its genome surrounding its 3CBA degradative genes was assembled and annotated. The most important method used here was the synteny method, which implies homology between the genes, and descent from a common ancestor

(Guttman, 2008). The conserved gene order between B. phytofirmans PsJN, B. xenovorans

LB400, and OLGA172 was used as a confirmation of annotation through BLASTn, enabled closing of the gaps in NextGen sequencing data, and allowed prediction of genes further downstream. Though the whole genome may not have been assembled, a very significant region that carries a concentrated area of mobile genetic elements (MGE) has been found to surround the degradative genes, tfdC IDIEIFI in OLGA172. This thesis details the sequence evidence that, upon examination of closely related strains, OLGA172 and its related strain from pristine soils may be the ancestral chlorobenzoate degraders.

ii

Acknowledgments

I would like to thank Nicole Ricker (Ph.D. candidate), Jackie Goordial, Cindy Bongard, and especially Dr. Roberta Fulthorpe for all their effort and time. I really could not have asked for a better supervisor over the years of my research at U of T. She was always there to answer my countless emails, phone calls, and any small questions I had along the way. She was always available when I needed her, not only for the research advice, but also for personal issues, always providing me with thoughtful advice and guidance. She has become more of a friend to me and I would really like to thank her for that.

Jackie, Nicole, and Cindy helped me so much in clarifying many issues surrounding this thesis as well as a great source of moral support. They have helped me keep composed and calm throughout the stress and anxiety that comes with fast approaching deadlines. I would like to thank each one of them and as it would have been very difficult to complete this project without them. All of their input and comments has made a positive impact on me, and it is really great to have such good friends as my lab mates.

Lastly, I would like to thank all of my family and friends for all their support, as well as Second Cup for providing me with great coffee and a nice place to write.

iii

Table of Contents

Acknowledgments ...... iii

Table of Contents ...... iv

List of Tables ...... vii

List of Figures ...... viii

List of Abbreviations and Acronyms ...... x

List of Appendices ...... xi

Chapter 1 ...... 1

1 Introduction to B. phytofirmans OLGA172 and General Literature Review...... 1

1.1 Background information ...... 1

1.2 Mobile Genetic Elements ...... 4

1.3 Chloro-aromatic degraders from contaminated/industrial sites ...... 9

1.4 Spontaneous phenotypic instability in other strains ...... 11

1.5 Hypotheses and Structure of the Thesis ...... 12

Chapter 2 ...... 13

2 Next Generation Sequencing ...... 13

2.1 Methods ...... 15

2.1.1 Purification and DNA extraction of B. phytofirmans OLGA172...... 15

2.1.2 Confirming the identity and purity of OLGA172 DNA ...... 16

2.1.3 Next Generation Sequencing and preliminary assembly ...... 18

2.1.4 GC content and overall Genome Matches ...... 19

2.1.5 Multiple sequence alignment and phylogenetic trees ...... 21

2.2 Results ...... 21

2.2.1 Isolation and confirmation of OLGA172 ...... 21

iv

2.2.2 Total genome homology and GC content and analysis ...... 22

2.2.3 Summary of Data Generated by Solexa Genome Analyzer and 454 GS-FLX ..... 24

2.2.4 There are no evidence of plasmids in OLGA172 ...... 25

2.2.5 Gene annotation ...... 26

2.2.6 Sequence analysis of the integrases ...... 28

2.3 Discussion ...... 31

2.3.1 Advantages and Disadvantages of NextGen Sequencing ...... 31

2.3.2 Possible HGT of the tfd catabolic operon in OLGA172 ...... 34

Chapter 3 ...... 38

3 Extension of regions flanking the catabolic operon ...... 38

3.1 Methods ...... 40

3.1.1 Sequence linkage via PCR based on synteny analysis ...... 40

3.1.2 Linkage of sequences via Thermal Asymmetric InterLaced PCR (TAIL PCR) ... 43

3.1.3 Secondary structure analysis and GC content calculation ...... 45

3.2 Results ...... 46

3.2.1 OLGA172 shares conserved gene order with PsJN and LB400 ...... 46

3.2.2 TAIL PCR ...... 49

3.2.3 Sequence extension beyond RIT BphO1 ...... 50

3.2.4 Secondary structure analysis and GC content calculation ...... 51

3.3 Discussion ...... 53

3.3.1 Shared homology between OLGA, PsJN, and LB400 ...... 53

3.3.2 Confirmation of “junkyard” presence in OLGA172 ...... 54

3.3.3 DNA sequencing interrupted by secondary structure ...... 55

Chapter 4 ...... 57

4 Analysis of OLGA172 and related strains from pristine soils...... 57

v

4.1 Methods ...... 57

4.1.1 The ability of OLGA172 and its related strains from pristine soils to break down and grow in the presence of CBA ...... 57

4.1.2 Lysis of cells and amplification of CBA degradative genes ...... 58

4.1.3 Phylogenetic analysis ...... 61

4.2 Results ...... 61

4.2.1 Growth of OLGA172 and its related pristine isolates on CBA and their possession of degradative genes ...... 61

4.2.2 Phylogenetic analysis of the degradative genes, tfdC I and tfdDI ...... 64

4.3 Discussion ...... 67

4.3.1 OLGA172 is a representative CBA degrader from pristine soils ...... 67

Chapter 5 ...... 69

5 General Discussion ...... 69

5.1 Future Work ...... 70

References ...... 73

Appendices ...... 80

6 Appendix ...... 80

6.1 Total genome homology ...... 80

6.2 GC content calculation ...... 81

6.3 Total number of N’s in Solexa sequencing data ...... 83

6.4 Multiple sequence alignment of the RIT elements found in OLGA172 (RIT BphO1 ), CH34, H16, and H1...... 84

6.5 Chlorocatechol-1,2-dioxygenase and/or chloromuconate cycloisomerase carrying pristine soil isolates (Figure 4-4) ...... 92

6.6 Chlorocatechol-1,2-dioxygenase and/or chloromuconate cycloisomerase carrying strains available in GenBank (Figure 4-4) ...... 94

vi

List of Tables

Chapter Table No. Page No.

2 2-1 Primers used in PCR in Ch. 2 17

2-2 Thermal cycle conditions for PCR in Ch. 2 17

2-3 Total amount of bp (%) LB400 and PsJN genome that is 23 overlapped by that of OLGA172

2-4 GC content of OLGA172, LB400, JMP134, and PsJN 23 genomes and their degradative genes

2-5 Output of Solexa vs. 454 24

2-6 Genes involved in chromosome and plasmid replication 26 found in OLGA172

2-7 Annotation of genes neighboring the degradative genes in 27 OLGA172. Refer to Figure 2-3.

3 3-1 Primers used in PCR in Ch.3 41

3-2 Thermal cycle conditions for PCR in Ch.3 42

3-3 TAIL PCR thermal cycles 44

3-4 Annotation of genes that share their conserved gene order 47 with PsJN and LB400

3-5 Annotation of genes found in NODE_2930 51

4 4-1 Primers used in PCR in Ch. 4 59

4-2 Thermal cycle conditions for PCR in Ch. 4 60

4-3 The pristine isolates’ ability to grow on CBA media 62

4-4 The pristine isolates’ possession of the degradative genes 63

vii

List of Figures

Chapter Figure No. Page No.

1 1-1 A map showing the locations from which OLGA172 and its 2 related strains from pristine soils were isolated

1-2 Modified ortho -pathway of 3-chlorobenzoate. 3

1-3 A schematic diagram of OLGA172 sequence as found in 3 GenBank – tfdT, tfdC, tfdD , and partial site specific recombinase

2 2-1 a) BOX fingerprint of OLGA172; b) 16S ARDRA profile 21 of OLGA172

2-2 PCR amplified tfdC , int -λ gene, and 4525 bp of OLGA172 22 sequence as found in GenBank (Acc. No. AY168634)

2-3 27 kbp Sequencher alignment of the contigs surrounding 27 OLGA172’s degradative operon

2-4 CLUSTAL alignment of amino acid sequence of int -λ gene 29 found in OLGA172, CH34, H16 and H1.

2-5 A neighbor-joining phylogenetic tree of the RIT element 30 found in OLGA172, CH34, H16 and H1

3 3-1 A schematic diagram of the region surrounding the 38 degradative genes in OLGA172

3-2 A schematic diagram of the degradative genes and 40 associated RIT BphO1 in OLGA172

3-3 The region of the 3 rd int gene from which TAIL PCR 43 primers were designed

3-4 Details of TAIL PCR protocol and expected products 45

3-5 a) A schematic diagram of the region where gene 46 annotation was terminated in Ch.2; b) gap filling method through conserved gene order between OLGA172, PsJN and LB400

3-6 HGT mechanism by which the degradative genes may have 48 introduced itself into OLGA172.

viii

3-7 Gel electrophoreses of TAIL PCR reactions 49

3-8 TAIL PCR aligned with OLGA172’s 3 rd int gene. 50 A chromatogram provided of the product.

3-9 The connection between NODE_318, NODE_444, 50 contig00091, and NODE_2930

3-10 A diagram showing the region of 3 rd int gene where the 52 sequence of the TAIL PCR product was terminated – higher level of GC content is predicted.

3-11 Possible secondary structures of OLGA172’s genome 53 where the 6 bp palindrome exists

4 4-1 A diagram showing the sequential plating of the pristine 58 isolates on CBA and R2A media

4-2 Alignment of tfdD I sequences in order to design primers 60 targeting tfdD I gene

4-3 A picture of two CBA plates differing in colour due to the 62 (in)ability of the to degrade CBA

4-4 A phylogenetic tree of a) chlorocatechol-1,2-dioxygenase 65

A phylogenetic tree of b) chloromuconate cycloisomerase 66

ix

List of Abbreviations and Acronyms

Abbreviations and acronyms

2,4-D 2,4-dichlorophenoxyacetic acid 3-CBA 3-chlorobenzoate AAB Acetic acid bacterium APS Adenosine 5’ phosphate CCD Charge coupled device CCD operon Chlorocatechol degrading operon clc genes Chlorocatechol degrading genes emPCR Emulsion PCR GA Genome Analyzer HGT Horizontal gene transfer HTR Hyper mutable tandem repeats ICE Integrative and conjugative elements int gene Integrase gene IR Inverted repeat IS Insertion Sequence MOCP Modified ortho -cleavage pathway PCB Polychlorinated biphenyl PCR Chain Reaction PPi Pyrophosphate RIT Recombinases in Trios TAIL PCR Thermal Asymmetric InterLaced PCR tcb genes Chlorobenzene degrading genes tfd genes 2,4-dichlorophenoxyacetic acid degrading genes TR Tandem repeats

x

List of Appendices

Chapter Appendix Page No. No.

6 6.1 Perl script of total base homology calculation 77

6.2 Perl script of GC content calculation 78

6.3 CLUSTAL alignment of the RIT element found in 80 OLGA172, CH34, H16, and H1

6.4 Multiple sequence alignment of the RIT elements found in 84 OLGA172 (RIT BphO1 ), CH34, H16, and H1

6.5 Chlorocatechol-1,2-dioxygenase and/or chloromuconate 92 cycloisomerase carrying pristine soil isolates (Figure 4-4)

6.6 Chlorocatechol-1,2-dioxygenase and/or chloromuconate 94 cycloisomerase carrying strains available in GenBank (Figure 4-4)

xi

1

Chapter 1

1 Introduction to B. phytofirmans OLGA172 and General Literature Review. 1.1 Background information

Bioaccumulation of man-made halogenated aromatics has been an increasing concern worldwide. Many of these compounds are used as fire retardants, paints, varnishes, solvents, herbicides, and pesticides (Ghosal et al , 1985). Some of soil microorganisms have developed the ability for biodegradation of these xenobiotic pollutants through adaptation (Tan, 1999). Such adaptive responses include induction of that increase degradative capacity, and selection of those mutated enzymes with altered specificity that is directed towards of those compounds (van der Meer et al , 1992). Genetic diversity from adaptive processes has expanded the substrate specificity and their utilization pattern by the soil microflora, and enabled the maintenance of their metabolic activities under such environmental stress (Cavalca et al , 1999). Through natural selection under such conditions, those species that did not evolve the ability to break down these compounds remained vulnerable to them, while those that acquired the ability to use these compounds can grow.

Polychlorinated biphenyls PCB’s were widely used as transformer oils until it was discovered that they were accumulating in the food chain, due to their biological stability and their lipophilicity (Barrie et al , 1992). One example of this is the accumulation of PCB’s in polar bears’ tissues. The first discovery of PCBs in polar bears was in the 1970s. PCBs have been found to travel from the mother to its cubs, and have been known to cause reproductive disorders (Andersen et al , 2001). Some PCBs can be degraded via an aerobic degradation pathway that proceeds through chlorobenzoate intermediates. Details of these pathways have been intensively studied; many 3 chlorobenzoate (3CBA) degraders have been isolated.

2

Figure 1-1. The stars indicate the locations from which the pristine soil bacterial strains were collected. These locations are Saskatchewan, California, South Western Australia, Central Chile, S. Africa, and Western Russia (Fulthorpe, 1998).

Burkholderia phytofirmans OLGA172 is representative of a large group of 3CBA degraders isolated from pristine soils with no previous known exposure to any chemical pollution. These strains were collected from all around the world – Chile, Russia, S. Africa, California, Saskatchewan, and Australia. (Fulthorpe, 1998; Figure 1-1). OLGA172 was isolated from Russia (38 o0’E, 60 o0’N). The breakdown pathway of 3CBA is similar to that of 2,4- dichlorophenoxyacetic acid, a widely used herbicide used for the destruction of broad leaf weed plants (and also a major component of Agent Orange). Both compounds are metabolized through chlorocatechol. Chlorocatechol is a central intermediate in biodegradation of various chlorinated aromatic compounds (Liu et al , 2001), using the modified ortho -cleavage pathway (MOCP) under aerobic conditions. This pathway has been previously described in various species including Pseudomonas sp . Strain B13 and Cupriavidus necator JMP134 (Weisshaar et al , 1987; Ghosal et al , 1985; Cavalca et al , 1999). The steps in this pathway are shown in Figure 1-2. Chlorocatechol 1,2-dioxygenase catalyzes the formation of 2-chloromuconate from 3- chlorocatechol. Chloromuconate cycloisomerase then carries out the cycloisomerization of 2- chloromuconate into trans -dienelactone. Dienelactone transforms the dienelactone into maleylacetate, which then is reduced by maleylacetate reductase into 3-oxoadipate. 3-oxoadipate enters the TCA cycle to form succinate (Solyanikova and Golovleva, 2004). Chlorocatechol 1,2- dioxygenase, chloromuconate cycloisomerase, dienelactone hydrolase, and maleylacetate reductase are made by the genes of the chlorocatechol degradative (CCD) operon. These genes have been given various names, depending on the strain

3

Figure 1-2. Modified ortho -cleavage pathway of 3-chlorobenzoate degradation. The gene clusters and its enzymes responsible for each step are labeled. tfdCDEF gene cluster is used by B. phytofirmans OLGA172. they were studied in. In chlorobenzoate degrader Pseudomonas sp. st . B13, Burkholderia sp . st. NK8, and Cupriavidus necator NH9, they are referred to as clcA, B, C, and D (Ravatn et al , 1998), cbeA,B,C, and D (Perigio et al , 2001) , and cbnA, B, C , and D (Ogawa and Miyashita, 1999), respectively. In trichlorobenzene degraders, they are referred to as tcbC, D, E, and, F (Klemba et al , 2000), and in 2,4D degraders as, tfdC, D, E , and F (Laemmli et al , 2000). In this thesis I use the tfd notation for the degradative genes present in OLGA172. This is because these genes in OLGA172 are most closely related to the tfd genes found in the 2,4D degrading strain Cupriavidus necator JMP134. The aromatic ring cleavage of chlorocatechol is the rate limiting step, and in strains that lack this ability, the buildup of chlorocatechol is fatal to the cell (Cavalca et al , 1999).

Figure 1-3. A schematic diagram of B. phytofirmans OLGA172 DNA sequence as found in GenBank (Accession Number AY168634.1).

4

In B. phytofirmans OLGA172, a part of the CCD operon has already been sequenced (GenBank Accession Number AY168634.1). It consists of a part of λ family site specific recombinase that is missing its , LysR type transcriptional regulator tfdT, LysR type transcriptional regulator tfdT2 with a possible inactivation by frame shift mutation, chlorocatechol 1,2-dioxygenase tfdC , and chloromuconate cycloisomerase tfdD (Figure 1-3).

Based on observation in the laboratory, OLGA172 frequently shows phenotypic variation in its ability to completely degrade 3-CBA. On occasion, partial degradation is observed as a brown/black pigment in liquid cultures or in agar plates containing 3CBA due to the accumulation of oxidized and polymerized chlorocatechol, which has been found to be toxic to bacterial cells (Fava et al , 1993). The brown pigment is a convenient visual cue to indicate that OLGA172 is not degrading 3-CBA completely or inefficiently. OLGA172 also shows variation in the concentrations of 3CBA that it is able to tolerate and degrade. In some instances, OLGA172 was not able to grow in concentrations of 3-CBA above 3 mM or 5 mM, in others it can grow well in concentrations of 3CBA up to 9 mM. In some cases, OLGA172 was not found to grow on 3-CBA until 17 days had passed, although typically a growth period of 5-6 days was observed. When grown on non-selective media, OLGA172 is more reliable in its growth patterns. Causes for this phenotypic instability when grown on 3CBA can include the loss or movement of the catabolic genes within the genome (genetic instability), or non-optimal transcriptional control of the genes (Goordial, 2009).

This thesis details the way I was able to extend OLGA172’s small sequence available in GenBank and to contextualize it in its genome in such ways to determine the chromosomal/plasmid location of the degradative operon and determine its nearby genes. This may allow us to determine what may be causing such OLGA172’s phenotypic instability in degrading 3CBA. The possible origin of the CCD operon and its relatedness to homologous genes in other pristine soil 3CBA degraders is also considered.

1.2 Mobile Genetic Elements

Many of the genes responsible for the degradation of petroleum based pollutants have been shown to be located on plasmids of many bacterial strains and they have been historically considered non-mobile when found on chromosomes (van der Meer and Sentchilo, 2003). Upon further study of the various strains possessing the ability to degrade these compounds, many

5 degradative genes were shown to be flanked by numerous recombinases, repeat elements, and transposable elements. These are strongly suggestive of the past and future motility of these genes.

In most strains found in contaminated sites, the evolution of catabolic pathways for xenobiotic compounds does not involve de novo generation of new enzymes with specific capacities to break down the novel compounds. Evidence shows that the genes are drawn from pre-existing genetic materials through gene rearrangements of gene fragments from different microorganisms (Fulthorpe and Top, 2009; van der Meer et al , 2001). Since some chlorinated compounds are naturally produced (de Jong et al , 1994), the enzymes required to metabolize them may have been present in these strains long before the introduction of xenobiotic analogues into the environment. This pre-existing genetic material is available for recruitment into new metabolic pathways (Fulthorpe and Top, 2009; van der Meer et al , 2001). Substrate specificity of an depends on the subtle differences in its amino acid sequences. For instance, Liu et al (2005) studied the activity of cbnA from a 3CBA degrader C. necator NH9 and tcbC from the 1,2,4-trichlorobenzene degrader Pseudomonas sp . st. P51. cbnA has high activity against 3,5- dichlorocatehol and low activity against 3,4-dichlorocatechol whereas for tfbC , its specific activity is opposite. Amino acids Val-48 and Ala-52 are of critical importance in determining this substrate specificity and otherwise, the rest of the sequence of cbnA and tcbC is almost identical (12 out of 251 amino acids not identical). Thus, small mutations in an enzyme can alter its specificity towards degrading a new synthetic but structurally analogous substrate instead.

Gene dosage effects may also play a critical role in the of these otherwise non-degradable compounds. These compounds may be chemically analogous to a naturally occurring chlorinated substrate. Thus, the kinetic limitations of breaking down the synthetic compound may be overcome by enzyme overproduction through a gene dosage effect (Ghosal et al , 1985). Gene dosage effect is also used to deal with toxic intermediates produced through the pathway of breaking down CBA. It has been shown in several studies that in P. sp. st. B13, a single copy of its clc element carrying the necessary genes to break down 3CBA is in fact, insufficient to break it down, yet a higher copy number of these genes are required for the growth in the presence of 3CBA (Plumeier et al , 2002; Perez-Pantoja et al , 2003; Laemmli et al , 2004).

6

Although many microorganisms may have developed the ability to break down synthetic chlorinated compounds via vertical transmission of mutated genes, others may just as easily have acquired these genes through horizontal gene transfer (HGT). HGT can occur through three different mechanisms: transformation, transduction, and conjugation. Transformation requires the uptake of a naked DNA from the environment, without a donor cell. Transduction involves the DNA that is adjacent to a phage attachment site or a bacteriophage that has already replicated and packaged DNA fragments. Transduction also does not require a donor cell’s presence but phage-encoded can deliver and integrate the DNA into the recipient’s chromosome and protect it from the host’s endonucleases. Conjugation requires a physical contact between the donor and the recipient cells. The DNA is transferred via self-transmissible or self-integrating plasmids and also by conjugative transposons (Ochman et al , 2000). Compared to vertical gene transfer of genes, from a parent strain to its daughter strain, HGT allows quantum leaps in evolutionary time, bypassing the time required for successful growth and harboring of the genes between each generation (Hacker and Carniel, 2001). Thus, in environments where the pollution by man-made compounds can provide adequate carbon for growth, HGT of these genes allows for rapid genetic adaptation to changing environmental conditions (Burrus and Waldor, 2004).

There are various elements which aid in HGT of genes between bacteria. There are three classes of bacterial transposons, differentiated based upon their genetic organizations and transposition mechanisms. Class I elements include insertion sequences (IS) carrying only the genes necessary for their transposition, and composite transposons that carry genetic traits that are unrelated to transposition, flanked by two copies of IS. These ISs can be in a direct or an inverted orientation and they are very similar to one another, if not identical (Tsuda et al , 1999). Composite transposons are the key vehicles for the world wide spread of the many adaptive genes – such as those for antibiotic resistance, metal resistance, as well as catabolic genes. Composite transposons consists of two insertion sequences (IS), which are made of two short inverted repeats bracketing one or more transposases, flanking adaptive genes of interest (Wagner, 2006). The flanking IS acts cooperatively in order to mobilize the intervening genes. Transposons can ‘hop’ into phages and plasmids and can be transferred with them into other cells (Frost et al, 2005).

Class II transposons carry two short terminal inverted repeats and use a replicative mode where it usually results in 5 bp duplication of the target sequence. They each carry two

7 recombinase genes, a transposase gene, tnpA , and a resolvase gene, tnpR . tnpA catalyzes the formation of the co-integrate between the donor and the target DNA molecules using two directly repeated copies of the transposase. tnpR then catalyzes the site specific resolution between the two res site located within the transposon.

Class III transposons are conjugative transposons that can transfer genetic materials via conjugation. They do not possess any inverted repeats and do not generate duplication of the target sequence. They transpose by specific excision of their transposon which circularizes to form a non-replicative intermediate. This is nicked at its oriT site, and a single stranded DNA is transferred to a recipient cell. The single stranded intermediates in the donor and the recipient cells are duplicated to form a double stranded DNA and integrate into the chromosome or the plasmid (Tsuda et al , 1999).

Site specific recombinases also aid in a horizontal gene transfer event. There are three known types of recombinases: integrases, resolvases, and invertases. Integrases catalyze the integration and excision of circular DNA molecules, resolvases aid in resolution of co-integrases, and invertases aid in inversion of specific DNA fragments (Burrus et al , 2002). There are two major classes of site specific recombinases: tyrosine (integrase) and serine (resolvase, invertase) family. The tyrosine class of proteins forms covalent DNA-protein linkages through a C-terminal tyrosine with 3’-phosphate (Kornberg and Baker, 1992). The λ integrase family of site specific recombinases belongs to the tyrosine class of recombinases, and the recombinase found in B. phytofirmans OLGA172 falls into this category. The λ family integrases are found in many diverse organisms. The C-terminal region of the 180 amino acids is determined to have all the catalytic residues needed for cleavage and ligation. Absolutely conserved residues are R-(x) n-H-

(x) 2-R-(x) 25-37 -Y, including the tyrosine residue that is required for the attachment of this recombinase (Nunes-Duby et al , 1998).

As more mobile elements are discovered, the terminology has become muddled in the literature. In the past, conjugative transposons, integrative plasmids, and genomic islands were considered separate and distinct entities. Now, they are collectively called Integrative and Conjugative Elements, ICE for short (van der Meer and Sentchilo, 2003). ICEs are defined as elements that encode proteins that facilitate their own transfer and occasionally other genomic DNA from a donor to a recipient cell (Frost et al , 2005). ICEs consist of three functional

8 modules: maintenance, dissemination, and regulation modules – these three function as a scaffold and other genes to enable their movement. As an example of the importance of these elements, the genome of Vibrio cholerae STX and Providencia rettgeri R391 share 65kb of nearly identical backbone modules for regulation, conjugative transfer, integration, and excision, and yet each harbors DNA insertions within this region that give each strain specific properties such as antibiotic/mercury resistance. (Burrus and Waldor, 2004).

Generally, transposition activities are maintained at a low level, because of the accompanying mutagenic effects of genome rearrangement, such as spontaneous deletion of genes (Mahillon and Chandler, 1998). Therefore, genome stability depends on the balance between mutation and selection of required genes. Genome organization is directed by the gene expression via clustering together the functionally related genes (those involved in a same metabolic pathway) into operons, allowing optimization of gene regulation, which in turn benefits the bacteria and allows it to become fitter in its environment (Rocha, 2006). Under selective pressure where such pathway is beneficial, these genes are positively selected for and propagated through HGT to other strains of bacteria. This is also true for the genes responsible for the degradation of man-made chlorinated aromatic compounds.

B. phytofirmans OLGA172 and its sibling 3CBA degrading strains collected from around the world originated from pristine soils. Since 3CBA is not a naturally occurring compound, and the origins of these strains have had no previous exposure to 3CBA, this suggests that 3CBA is not their primary source of carbon. As stated earlier, it is very likely that 3CBA is structurally and/ or chemically analogous to a naturally occurring chlorinated compound that the strains were originally degrading in their pristine environments. Since OLGA172 and its sibling pristine isolates were found in wide locations, I hypothesize that they may be the original source of 3CBA degradative genes that have been horizontally gene transferred into those strains found in contaminated sites using the various methods of HGT and the different mobile elements as explained above. Then, these genes were most likely duplicated and/or mutated in the recipient strains to better degrade 3CBA in their habitats. The evidence of such mobile elements will be searched for throughout this study in order to test this hypothesis.

9

1.3 Chloro-aromatic degraders from contaminated/industrial sites

Numerous chloroaromatic degrading bacteria have been intensively studied. For the most part, all of these strains have been isolated from contaminated systems, i.e. agricultural zones or places receiving industrial waste. The knowledge gained from these studies form the basis of study of the catabolic genes in OLGA172 and its related strains. Analyzing the organization of mobile elements associated with the degradative genes in other organisms will help to determine if similar mechanisms exist in OLGA172.

One of the major model organisms studied in great depth for its ability to degrade chlorobenzoate and its possession of site-specific integrase is Pseudomonas sp . strain B13. It was isolated from a sewage system and it was the first pseudomonad described in the literature that used 3CBA as a sole carbon and energy source; however it took many years for details of its operon to be revealed (Ravatn et al , 1998). It uses the same modified-ortho cleavage pathway as OLGA172, and its genes involved in degradation, ( clcABDE ) lie on a 105kb integrative and conjugative element (ICE) called the clc (chlorocatechol) element (Sentchilo et al , 2003). OLGA172’s degradative genes tfdCDEF are not highly similar to B13’s clcABDE , though both sets of genes are able to degrade 3CBA. The clc element exists in two forms: circular, free plasmid form and integrated form in the chromosome. As many other organisms of this sort, it integrates into its target sequence – the 3’ end of a tRNA structural gene, glyV tRNA, specifically in B13, using its site-specific integrase, intB13 (Burrus et al , 2002). intB13 is also of the tyrosine class, retaining the active site residues of R-(x) n-H-(x) 2-R-(x) 25-37 -Y at the C-terminal region. Recombination takes place between an 18 bp of the attachment site, attP , of the clc element that is identical to 3’ 18 bp of the tRNA gly attachment site, attB , and the integration results in duplication of the 18 bp sequence (van der Meer et al , 2001). At this time, it is unknown whether or not OLGA172 retains a plasmid. However, finding the recognition sites, such as attP and attB in OLGA172 can suggest that OLGA172 too may bear the same ICE in which its degradative operon may lie.

Another strain that has a CCD operon is Cupriavidus necator JMP134. It was originally isolated from Australia and has the ability to degrade 2,4-dichlorophenoxyacetic acid (2,4-D), as well as 3CBA. The metabolism of 3CBA is initiated by genes located on its chromosome,

10 benzoate dioxygenase and 1,2-dihydro-1,2-dihydrobenzoate dehydrogenases (Perez-Pantola et al , 2000). The genes responsible for the rest of chlorocatechol degradation, tfdC IDIEIFI, lie on a 22kb fragment of pJP4 plasmid (Laemmli et al , 2000). 2,4-D and 3CBA are both metabolized through a chlorocatechol intermediate. 2,4-D has two initial steps before reaching the chlorocatechol intermediate: α-ketoglutarate dioxygenase encoded by tfdA carries out the conversion of 2,4-D to 2,4-dichlorophenol and 2,4-dichlorophenol hydroxylase encoded by tfdB carries out the conversion of 2,4-dichlorophenol into 3,5-dichlorocatechol. Both of the genes, tfdA and tfdB are also located on pJP4. tfdR and tfdS genes encode identical LysR type transcriptional regulators of the degradative pathway – tfdT gene encodes also another regulatory protein of the pathway but it is nonfunctional due to a C-terminal deletion caused by ISJP4 (Laemmli et al , 2000).

In addition to tfdC IDIEIFI (module I), there is a second set of the degradative genes, tfdC IIDIIEIIFII (module II) that also lies on pJP4. Perez-Pantora et al (2000) cloned each module into a medium copy number plasmid vector to determine the activity of the genes of the two modules. They discovered that though both sets of genes are functional and highly expressed, tfdE II and tfdF I were found to have very low expression. They also have found that module I resulted in more efficient degradation of 3CBA than module II alone, which may be explained by the low expression of tfdE II . Conversely, the enzymes of module II have higher efficiency towards 2,4-dichloromuconate which is an intermediate when breaking down 2,4-D, and low efficiency towards 2-chloromuconate, which is an intermediate formed in the catabolism of 3CBA. In the study by Laemmli et al (2000), when the amino acid sequences of the genes of the two modules were aligned, it was clear that the two modules were not duplicates of one another. The amino acid percent identity between each gene of the two modules varied significantly – tfdE I and tfdE II only shared 15% identity, whereas tfdC I and tfdC II shared 60% identity. Also, the gene cluster organization of module I and module II differed in the order of tfdC and tfdD . From this evidence, they determined that it was more likely that the two modules had different evolutionary origins.

According to the available sequence data on GenBank and earlier work in our lab, B. phytofirmans OLGA172 does not seem to have module II set of the degradative genes, nor does it possess tfdA and tfdB required for the breakdown of 2,4-D. Further sequencing data will

11 confirm this and reveal any other genes that are present in OLGA172 that may be related to degradation of 3CBA.

There are a few more stains worth mentioning with respect to their similarities with OLGA172. Burkholderia xenovorans LB400, is a very effective polychlorinated biphenyl (PCB) degrader and it oxidizes more than 20 different PCBs. It was isolated from a contaminated landfill in New York State. It shares ~97% identity in its 16S sequence with OLGA172. It has one of the two largest known bacterial genomes, 9.73 Mbp, possessing two chromosomes and a megaplasmid (Chain et al , 2006). Burkholderia phytofirmans PsJN is not a chloro-aromatic degrader, but it the closest relative to OLGA172 in terms of their 16S DNA sequences, displaying ~98% similarity. It is a Plant Growth Promoting Rhizobacterium (PGPR) involved in establishing rhizosphere and endophytic populations associated with a number of different plants such as potato and tomato. It is also known to stimulate plant growth (Sessitsch et al , 2005). Due to their similarities in 16S sequence the genome sequences of PsJN and LB400 were used for direct comparison with that of OLGA172.

1.4 Spontaneous phenotypic instability in other strains

Ralstonia eutropha H1 and Acetobacter pasteurianus are not close relatives of OLGA172 but they possess phenotypical instability traits similar to those of OLGA172. In the plasmid pAEI, 3122kb in size, carried by R. eutropha H1, a spontaneous deletion of a 93kb region (D- region) occurs at a very high rate. Through an endonuclease restriction mapping study, Chow et al (1995) have discovered that direct (R1) and inverted repeated (R2) sequences were found to flank the region. The R1 sequence contained two open reading frames, one of which had significant homology to the λ family integrase of site-specific integrases. Within the D-region, three copies of insertion element, ISAE l were also present. They determined that the spontaneous deletion of the D-region is due to the recombination activity between two R1 sequences, though site specific recombination activity is also suspected.

A. pasteurianus is vinegar producing acetic acid bacterium (AAB), a divergent group of the alpha-proteobacteria. It retains a high rate of physiological instability which includes loss of acetic acid resistance, deficiencies in ethanol oxidation, and bacterial cellulose synthesis. Whole genome sequencing analysis by Azuma et al (2009) showed that the A. pasteurianus genome contains more than 280 transposons and harbors 6 plasmids. The combination of the large

12 number of plasmids and transposons is one of the main reasons leading to the hyper-mutability of A. pasteurianus . Also, a number of tandem repeats (TR) were found in its genome, including hyper mutable tandem repeats (HTR) causing genome hyper variation. HTR can be expanded or contracted by its repeat units causing frame shifts and may be highly targeted sites for large deletion or transfer events in the genome, which in turn can contribute to the genome instability of AAB (Azuma et al , 2009).

1.5 Hypotheses and Structure of the Thesis

As stated above, I suspect that OLGA172 and its related pristine soil isolates are the original sources of the genes responsible for the degradation of 3CBA in the strains found in contaminated sites. To help clarify and answer this issue, I address the following hypotheses.

1. 3CBA degradative genes are associated with mobile genetic elements.

2. OLGA172 retains its 3CBA degradative genes on its chromosome, rather than a plasmid.

3. The genes responsible for 3CBA degradation are highly similar between OLGA172 and its relative pristine soil isolates, and they all carry module I set of the genes.

In chapter 2, I explain the details of the Next Generation sequencing methods, Illumina Solexa and Roche 454 sequencing, used for the whole genome sequencing of OLGA172. The benefits and downfalls of both methods are addressed. The initial assembly of the sequencing data surrounding the 3CBA degradative genes is explained and a novel “Recombinase in Trios” element is discovered

In chapter 3, the techniques used to further extend sequencing of the region carrying the degradative genes are described to show its chromosomal location as well as a ‘junkyard region’ containing mostly partial mobile elements and remnants of other genes.

In chapter 4, the phylogenetic analyses of the degradative genes of OLGA172 and its relative pristine isolates are presented. The ability to degrade 3CBA in also summarized for each of the strains.

Chapter 5 briefly concludes this thesis, indicating where my findings do or do not support my hypotheses. I end with suggestions for very interesting future work.

13

Chapter 2 2 Next Generation Sequencing

Whole genome sequencing of B. phytofirmans OLGA172 was carried out using Illumina Solexa and Genome Analyzer and Roche 454 GS-FLX methods. The Solexa platform depends on sequencing of millions of short reads, which guarantees that each nucleotide base in the genome is sequenced a several times (Hernandez et al , 2008). The 454 platform prepares its samples in vitro and miniaturizes the sequencing chemistries, which enables massively parallel sequencing reactions (Rothberg and Leamon, 2008).

During library preparation for Solexa sequencing, the DNA sample is sheared using a compressed air device called a nebulizer to about 800 bp pieces on average. The fragmented pieces of DNA are adenylated, and two unique different adaptors are ligated to each end. Each piece of DNA is amplified through Cluster Generation by bridge amplification. The flow cell surface is coated with single stranded oligonucleotides that are complementary to the sequences of the two adaptors used during library preparation. The fragmented DNA strands are denatured and these single stranded adaptor ligated DNA pieces are allowed to bind to the flow cell surface, to the corresponding oligonucleotides. Each strand is amplified using bridge amplification where the priming of the free distal end adaptor of the ligated strand occurs to the complimentary oligo on the planar and optically transparent flow cell surface. This results in localized amplification of each strand across millions of locations on the flow cell surface. Reverse strands from the amplification are cleaved and washed away. The ends of the amplified strands are blocked and sequencing primers are hybridized to the DNA templates.

Simultaneous sequencing by synthesis of each DNA template is carried out using the Genome Analyzer (GA). Each strand is sequenced base by base using four fluorescently labeled, reversibly terminated nucleotides. After each round of base addition, the clusters are excited by a laser. This allows each cluster to emit a colour that identifies the correct nucleotide base that was incorporated. Then the fluorescence and the blocking agent become detached from the nucleotide and allows for the next nucleotide to bind (Illumina Inc., 2009). This way, millions of short reads are generated through Solexa, which are either assembled via alignment to a reference sequence, or de novo assembled. However, the output of Solexa data is not optimized for de novo assembly as short 30-40 bp reads only allow very small overlaps between the generated reads.

14

Conversely, 454 sequencing generates longer reads, on average of 420 bp. Library preparation begins when the genomic DNA is randomly fragmented into pieces of 400-600 bp. Two unique adaptors, A and B are attached to the ends of the fragments, and they are denatured into single strands. Clonal amplification is carried out by emulsion (em) PCR where adaptor ligated DNA library fragments, micron-sized capture beads, and enzyme reagents in water are injected into a small plastic tube with synthetic emulsion oil. Vigorous shaking causes the water mixture to form droplets around the beads, creating micro-reactors for emPCR to take place with one strand of template DNA per bead. Through Polymerase Chain Reactions, each strand in the emulsion is amplified into millions of copies immobilized on the bead. The beads are rescued from the oil and cleaned. Those beads without any DNA or those that hold more than one unique DNA sequence are filtered out and eliminated during the sequence signal processing procedure.

DNA Sequencing is carried out through pyrosequencing. It also employs sequencing by synthesis method using PicoTiter plates. The beads from emPCR are loaded onto PicoTiter plates, the size of which allows only one bead to fit per well. Single stranded DNA is hybridized to a sequencing primer and each plate contains 1.6 million wells. After the beads are loaded with its amplified DNA immobilized on it, the plate is loaded into the GS-FLX. Four nucleotides are sequentially washed over the plate along with DNA polymerase, ATP sulfurylase, luciferase, , luciferin, and adenosine 5’ phosphosulfate. If the base is complementary to the base on the template, it is incorporated onto the template by DNA polymerase. This releases a pyrophosphate (PPi), the amount of which is proportional to the number of nucleotides incorporated (i.e. if there is 3 consecutive G’s in the DNA template sequence, 3 C’s will be incorporated, and 3 PPi’s released). ATP sulfurylase converts PPi to ATP in the presence of adenosine 5' phosphosulfate (APS). This ATP is used to convert luciferin to oxyluciferin in the presence of luciferase. This generates a visible light and the amount of light produced is again, proportional to the amount of ATP and therefore PPi. This light signal is detected by a charge coupled device (CCD) camera. A Flowgram is generated for each well in the PicoTiter plate, and the strength of each signal is proportional to the number of nucleotides incorporated. Apyrase degrades unincorporated nucleotides and ATP after each wash. When degradation is complete, the next nucleotide wash is carried out (Roche Diagnostics Corp., 1996-2008). 454 is better designed for de novo assembly of the genome due to its long reads when a reference sequence is not available.

15

The combination of short and long read lengths of Solexa and 454, respectively, should allow better assembly of the contigs through higher levels of overlap and increased gap closure. The two sets of sequencing data should allow scaffolding to one another, as the short and long reads should complement each other very well.

In this chapter, I describe the NextGen sequencing of the OLGA172 genome using both Illumina Solexa and Roche 454 technologies. The initial assembly of this newly acquired data, the annotation of the region surrounding the chlorocatechol operon, and the search for mobile elements in the genome are all described below.

2.1 Methods

2.1.1 Purification and DNA extraction of B. phytofirmans OLGA172

In order to purify colonies of B. phytofirmans OLGA172, it was sequentially plated on selective, 3mM chlorobenzoate (CBA) plate and non-selective R2A plates at 28 oC. 3mM CBA plate was made up of 100ml of 10x phosphate buffer that was 10mM and pH of 7 (in 1L: . . K2HPO 4 3H 2O – 17.1g (0.075M); NaH 2PO 4 H2O – 3.4g (0.024M)), 10ml of 100x

Mg/Ammonium Solution (in 1L: (NH 4) 2SO 4 – 33g (0.25M), MgSO 4.7H 2O – 24.6g (0.1M)) and . 1ml of 1000x trace element solution with pH of 7 (in 1L: Na 2EDTA 2H2O – 12g (0.036M), . -3 . -3 NaOH – 2g (0.05M), ZnSO 4 7H 2O – 0.4g (1.4x10 M), MnSO 4 4H 2O – 0.4g (1.8x10 M), . -4 . CuSO 4 5H 2O – 0.1g (1.0x10 M), FeSO 4 7H 2O – 3g (0.1M), Na 2SO 4 – 5.2g (0.04M), . -4 NaMoO 4 2H 2O – 0.1g (4.6x10 M)). Then 5mg of yeast extract; 889ml of distilled H 2O, 16g of agar and 20ml of 50mM 3-CBA solution was added for 1mM final concentration. 50mg of the pH indicator bromothymol blue was also added to the CBA media.

Non-selective R2A media was made using Difco TM R2A Agar (Difco Laboratories, Becton, Dickinson and Company, Ref. no. 218263) according to the manufacturer’s instructions. From a -80 oC glycerol stock, OLGA172 was initially plated on 3mM CBA plate to select for the degrading phenotype of OLGA172 from the contaminants that may have been present. Then a single colony was picked and plated on a non-selective R2A plate. Plating on R2A was carried out once or twice more in order to purify OLGA172 from the contaminants, and then re-plated on CBA media in order to confirm that it can still degrade CBA. In order to grow large quantities of OLGA172, purified OLGA172 was grown in liquid CBA media (same recipe as above

16 without the addition of agar) of 250ml. Then the cells were centrifuged (13000rpm for 1 minute) in order to collect the pellets then the DNA was extracted using the bacterial protocol of the DNeasy Blood &Tissue kit (Qiagen, cat. no. 69504) to extract large amounts of genomic DNA. DNeasy Blood & Tissue kit was used because large amounts of DNA (~1ug) could be extracted with each extraction.

The DNA concentration in ng/ul was measured using the NanoDrop TM 1000 Spectrophotometer (Thermo Fisher Scientific ©2008), The 260/280 ratio is the ratio of absorbance at 260nm and 280nm, which estimates the amount of protein in the DNA. 1.8 is the measure of 260/280 ratio for a pure DNA sample. For NextGen Sequencing methods, DNA submitted to the Solexa is required to have a 260/280 ratio between 1.8 - 2.0 and the extracted DNA of OLGA172 used for sequencing stayed within this range.

2.1.2 Confirming the identity and purity of OLGA172 DNA

Before sending OLGA172’s genomic DNA for NextGen Sequencing, the identity of OLGA172 had to be confirmed using 16S ARDRA (Amplified Ribosomal DNA Restriction Analysis) profiles, BOX fingerprinting, and PCR amplifications of genes from its known catabolic operon (chlorocatechol-1,2-dioxygenase, tfdC , λ family site specific recombinases). The full length 4525 bp region of OLGA172 that is available in GenBank (Accession No. AY168634.1) was also amplified using PCR.

The primers used in the PCR reactions are shown in Table 2-1. All primers were ordered from Invitrogen life Sciences. CCDb/e primers were designed by Leander et al (1998).

The HotStart Taq Kit (Qiagen, Cat. no. 203645) was used for every Polymerase Chain Reaction following manufacturer’s instructions. For 20ul amplification reaction, 10ul of Mastermix (provided by kit which consists of 2.5 units HotStarTaq DNA polymerase, 1x PCR buffer (10x concentrated, containing Tris-Cl, KCl, (NH 4)2SO 4, 15mM MgCl 2, pH 8.7), and 200uM of each dNTP), 8.6ul of nano pure ultra filtered water, 0.4ul forward and 0.4ul reverse primers of 50uM concentrations, and less than 0.4ug of template OLGA172 DNA were mixed well together. Then it was put in a thermocycler following the conditions stated in Table 2-2. The annealing temperature used in each case was approximately 5oC below the Tm of the primers.

17

The extension time varied depending on the expected product size. For those below 1 kbp, 1 minute was used and for those above 1 kbp, 1.5-3 minutes was used.

Table 2-1. Primers used in polymerase chain reactions in Chapter 2. Primer Targeted Seq 5’ ààà3’ Tm Anneal. Expected Name Gene (oC) Temp product (oC) size in OLGA172 (bp) 27F 16S ribosomal AGAGTTTGATCMTGGCTCAG 56 52 1365 1492R DNA GGTTACCTTGTTACGACTT BOX Conserved CTACGGCAAGGCGACGCTGACG 55 50 repetitive DNA seq. CBA λ site specific F:TCAGCAGTTGCAATCAGACC 64.1 57 4525 recombinase R:ACGGAACCGTCGAATATGAG 63.7 to the end of tfdD (flanking tfdC ) CCD tfdC F:GTITGGCAYTCIACICCIGAYGG 70 48 268 b/e R:CCICCYTCGAAGTAGTAYICIGT 62 Recom Site Specific F:GATGTGATTCCGGATCGTCT 60 55 250 F/R Recombinase R:CCGTGTTACGGTCGTTTCTT 60

The 16S ribosomal DNA sequences of OLGA172 were cut with the enzymes HaeIII , HhaI and AluI from Invitrogen to confirm its ARDRA profiles. 5ul of undiluted REact buffer*

(10X concentrate assay buffer), 1 unit of enzyme per ug of DNA, and dH 2O to make up a final reaction volume of 50ul were mixed and placed in a thermocycler at 37 oC overnight. REact 2 buffer was made up of 50mM Tris-HCl (pH 8.0), 10mM MgCl 2, and 50mM NaCl in the final assay mixture and it was used with HhaI and HaeIII . REact 1 buffer was made up of 50mM Tris-

HCl (pH 8.0) and 10mM MgCl 2 and it was used with AluI. Products were visualized after electrophoresis on a 1.5% agarose gel.

Table 2-2. Thermal cycle conditions for all polymerase chain reactions carried out. Reaction Cycle no. Thermal Condition ( oC) Time (min) Primary denaturation 1 95 5 Denaturation 95 1 Annealing 35 55 1 Extension 72 1 Final Extension 1 72 10

18

The BOX fingerprint of OLGA172 was carried out using the HotStart Taq kit and the primers were designed by Versalovic et al (1994). The procedure is as outlined in Table 2-2 with 50 oC as the annealing temperature.

A 0.8% (w/v) agarose gel was used to run genomic DNA and 1.5% (w/v) was used to run PCR products, restricted 16S rRNA bands, and DNA BOX fingerprints. Ultrapure Agarose was added to 0.5X TBE buffer. Ethidium bromide was added directly into this mix (0.13-0.15ul/ml) and was poured into a cast to solidify. The GeneRuler 1kb DNA Ladder: Fermentas Life Sciences DNA ladder was used with all Gel-Electrophoresis. 100ul of DNA Ladder, provided at a concentration of 0.5ug/ul, 100ul of 6X Loading Dye Solution (100mM Tris-HCl – pH 7.6, 0.03% bromophenol blue, 0.03% xylene cyanol FF, 60% glycerol and 60mM EDTA), and 400ul of deionized water were all mixed together and 3ul of this DNA ladder mixture was loaded for each gel.

2.1.3 Next Generation Sequencing and preliminary assembly

Illumina Solexa and Genome Analyzer II with Pair-end capabilities were used to sequence the entire genome of OLGA172. Approximately 5ug of pure genomic DNA was sent out for sequencing. The sequencing was carried out at Centre for the Analysis of Genomic Evolution and Function (CAGEF), at University of Toronto. This Solexa sequencing was repeated using a second flow cell, using the previously generated contigs from the first round of sequencing as a reference sequence. A de novo assembly of the raw reads into contigs (named NODE_###) was carried out at CAGEF. 454 Sequencing and Roche Genome Sequencer FLX were also used to sequence the genome of OLGA172. This was carried out at The Genome Quebec Innovation Centre at McGill University. Approximately 4ug of genomic DNA was sent out for 454 sequencing. De novo assembly was carried out with the raw reads to generate contigs (named contig###), instead of a reference assembly.

Sequencher 4.1.4 by Gene Codes Corporation (©1991-2002) was used in order to assemble together the data from both sets of Solexa sequencing data, each from a separate flow cell run, and 454 sequencing data. As mentioned, the DNA sequences from Solexa/Genome Analyzer are named ‘NODE###’ and those from 454/ GS-FLX are named ‘contig###’. Assembly in this program was focused on connecting the pieces that were either containing or neighboring the degradative genes of B. phytofirmans OLGA172. The existing sequence of B.

19 phytofirmans OLGA172 in GenBank (GenBank Acc. No. AY168634) was used as well in the assembly.

For primary assembly of the three sets of data in Sequencher, the Dirty Data algorithm was used which is intended for unedited sequence data. Ambiguous base calls (bases that were not A, T, C, or G) were considered poor matches to exact base calls (A, T, C, or G). ReAligner (Anson and Myers, 1997) was used to optimize gaps for small inserts and double called bases – instead of placing gaps arbitrarily within a small region. ReAligner aligned the gaps together to optimize alignment. Minimum match percentage was set to 80 and minimum overlap required between alignments of contig sequences was 20 bases. When searching for insertion sequences and its repeats, a 90% match percentage was used with 20- 25 bp overlap. Often times in Solexa Sequencing data, there are mis-called bases due to errors in the sequencing reaction denoted as N’s in the DNA sequence. This results in genes that may have N’s in the middle of their sequences, which makes the contigs/nodes containing these Ns impossible for Sequencher to align with other reads using high match percentage parameters. In such cases, 65-75% match percentage was used with 30-40 bp overlap. By overlapping DNA sequences from all three data sets, some N’s were eliminated or kept as place holders.

CLC Genomics Workbench 3.6.5 by CLCBio ©2009 was employed to assemble the data against the reference sequences from its close relatives, PsJN and LB400. The short 38 bp and the long 420 bp raw reads of Solexa and 454 Sequencing, respectively, were imported for use. For long reads, mismatch cost of 2, insertion cost of 3, deletion cost of 3, length fraction of 0.5 and similarity of 0.8, were applied. For short reads, mismatch cost of 2 and a cost limit of 8 were set. Fast Ungapped Alignment was chosen where in cases of conflicts, Vote of A, C, T, or G was taken, and for non specific matches (such as repeats), Random placement was chosen rather than choosing the Ignore option.

2.1.4 GC content and overall Genome Matches

A perl script was designed to calculate GC contents of large segments of DNA or the whole genome (see Appendix 6.1). Doing so, the GC content of the whole genome, each degradative gene (ex. tfdC, tfdD, tfdE, tfdF ), and of the entire chlorocatechol degradative operon were calculated for the following organisms: B. phytofirmans OLGA172, B. xenovorans LB400, B. phytofirmans PsJN, and C. necator JMP134.

20

Another perl script (see Appendix 6.2) was designed to determine the total amount of base overlap between OLGA172 and LB400, and PsJN. It was designed to take the contigs generated by NextGen Sequencing and scaffold it to the genome of LB400 and calculate the total number of base matches. See appendix for the details of the perl script programs.

Using the contents of B. phytofirmans PsJN genome as a query, StandAlone BLAST (Basic Local Alignment Search Tool) was used to align the Solexa sequencing data of the genome of B. phytofirmans OLGA172 to the query. Expected threshold value was 1e -20 , gap opening penalty of -5 and gap extension penalty of -3. A special focus was given to those homologous genes OLGA172 shared with B. phytofirmans PsJN and B. xenovorans LB400 because, though PsJN shares the highest sequence similarity of 98% with OLGA172 with respect to their 16S ribosomal DNA sequences. LB400 was chosen also because of its high 97% similarity to OLGA172’s 16S ribosomal sequence, but also, it has the ability to degrade CBA, among various other PCBs.

BLAST was used extensively in order to annotate the genes and to compare these genes of OLGA172 to those of other closely related organisms. It was also used in order to align 16S ribosomal DNA sequences of OLGA172 with B. phytofirmans PsJN (Acc. No. CP001052.1) and also with B. xenovorans LB400 (Acc. No. NC007951) in order to confirm their high similarities. Megablast was used with word size of 28, expected threshold of 20, match/mismatch score of 1 and -2, respectively, and a linear gap cost.ORF Finder was used to analyze various contigs from the NextGen Sequencing data in order to determine the presence of likely protein coding regions and putative genes. BLASTn was used in order to predict the likely genes that were found in the assembled contigs of the sequencing data. The contigs were aligned against the nucleotide collection database in search for related organisms carrying similar genes. Word size of 11 was used with expected threshold of 20, match/mismatch score of 2 and -3, gap existence and extension score of 5 and 2, respectively. The region neighboring the degradative genes, tfdC IDIEIFI, in OLGA172 were inferred from homologous BLASTn matches. Only those with 90% percent identity were taken into consideration when annotating these genes.

When there were no significant matches using BLASTn, BLASTx and BLASTp were used in order to infer function from the conceptually translated amino acid sequences. A lower percent identity was considered a threshold (30-50%) when annotating these genes, because

21 when focusing on amino acid sequences, relatively a smaller amount of conserved amino acid sequences are required to encode for a functional gene (Guttman, 2008). The SWISS-PROT database and BLOSUM62 matrix was used, with the expected threshold value of 10, with the word size of 3, filtered for low complexity regions.

2.1.5 Multiple sequence alignment and phylogenetic trees

CLUSTALW is a web-based program and was used to carry out global multiple sequence alignment of mobile genetic elements. CLUSTALX2 is a computer installed program with a graphical user interface, which allows the user to save the alignments in various formats. The alignments of the degradative genes were made here and saved in a Nexus format. A neighbor joining tree with bootstrapping for 1000 runs was made in SplitsTree 4.10 (Huson and Bryant, 2006). A neighbor joining tree was chosen because it does not assume that all evolutionary events occur at the same rate (Molecular Clock hypothesis) and its branch lengths are relative to evolutionary time.

2.2 Results

2.2.1 Isolation and confirmation of OLGA172

Prior to sending the DNA of OLGA172 out for genome sequencing its source and purity had to be confirmed. I used BOX PCR to fingerprint the genome and compare it to the original, and used ARDRA to confirm the 16s ribosomal sequence. Results are shown in Figure 2-1.

Figure 2-1. a BOX fingerprint (left) and 16S ARDRA profile (right) of B. phytofirmans OLGA172 genome. The restriction digests are carried out with AluI , HhaI , and HaeIII .

22

To confirm that DNA for submission had not lost the catabolic operon, several PCRs were carried out in order to amplify chlorocatechol-1,2-dioxygenase, tfdC , λ family site specific recombinase, and the 4526 bp stretch of DNA covering tfdD to λ family site specific recombinases (sequence available in GenBank). As seen in Figure 2-2, the correct products of expected sizes were amplified using all sets of primers: 250 bp product with RecombF/R primers, 268 bp product with CCDb/e primers, and 4525 bp product with CBAF/R primers.

Figure 2-2. PCR products using the indicated primers and OLGA172 template genome are shown. The correct sized bands for each reaction were produced.

2.2.2 Total genome homology and GC content and analysis

Sequencher was used to search for genes expected to be present in OLGA172. Fragments of its 16S ribosomal genes were found on NODE_101, contig01458, NODE_2566, NODE_2982, and NODE_411, confirming the identity of OLGA172 that was sequenced by NextGen Sequencing. When the 16S ribosomal DNA sequence of OLGA172 was imported into BLASTn, it was confirmed that PsJN shares 98% identity with that of OLGA172.

Using StandAlone Blast (Blastall), the contigs of OLGA172 that showed significant homology to PsJN and LB400 were identified. Using perl scripting, the output was converted into an excel file. This program calculated the total amount of base overlap between the genome of OLGA172 and these strains. The amount of overlap was measured in percentages (Table 2-3).

A large difference in the GC% content of the whole genome and the degradative genes would suggest a past horizontal gene transfer event of these genes. The GC contents of each

23

Table 2-3. Total amount of bp (%) of LB400 and PsJN genome that overlaps with that of OLGA172 (calculated via perl scripting; See Appendix 6.2). Bp(%) of LB400 chrom. 1 covered by OLGA172 56% Bp(%) of LB400 chrom. 2 covered by OLGA172 35.3% Bp(%) of LB400 chrom. 3 covered by OLGA172 7% Bp(%) of PsJN chrom. 1 covered by OLGA172 60.9% Bp(%) of PsJN chrom. 2 covered by OLGA172 36.4% Bp(%) of PsJN plasmid pBPHYT01 covered by 6.9% OLGA172 chromosome and plasmids of reference strains, as well as that of each gene responsible for CBA degradation was determined from GenBank data and our own sequence (Table 2-4). The difference in the GC content of the chromosome and the degradative genes is approximately 5% in OLGA172, 2% in LB400, and 0.1-10% for JMP134. Though PsJN is the closest relative of OLGA172 in terms of their 16S ribosomal DNA sequences, PsJN does not possess CBA degradative genes.

Table 2-4. GC% content of B. phytofirmans OLGA172, B. xenovorans LB400, C. necator JMP134, and B. phytofirmans PsJN genomes and their chlorocatechol degradative genes. Organisms B.phytofirmans B. xenovorans C. necator B. phytofirmans GC% OLGA172 LB400 JMP134 PsJN Chrom. GC% 61.23 1- 62.75 1- 64.7 1- 62.58 2- 62.84 2- 65.2 2-62.1 megaplasmid- pJP4- 64.66 pBPHYT01- 61.73 58.3 The degradative 56.4 60.8 56.4 operon Chlorocatechol- 57.5 61.4 I- 56.5 1,2-dioxygenase II- 64.4 (tfdC ) Chloromuconate 57.1 61.5 I- 57.8 Cycloisomerase II- 66.3 (tfdD ) Dienelactone 55.0 59.8 I- 54.8 Hydrolase ( tfdE ) II- 65.3 Maleylacetate 56 60.7 I - 56.4 reductase ( tfdF ) II - 69.8

24

2.2.3 Summary of Data Generated by Solexa Genome Analyzer and 454 GS-FLX

Table 2-5 shows the quantitative data achieved by both sequencers. After sequencing by Solexa using two flow cells, very high genome coverage of 45X was achieved. However, very short length of each read made the task of whole genome assembly very difficult, therefore 454 sequencing was required. 10X coverage by 454 seemed efficient enough to be supporting sequencing data for Solexa and work together as each other’s scaffold.

All data sets were imported into Sequencher 4.1.4 for further assembly. The long read lengths of 454 allowed for better assembly and connection between the short Solexa reads. Together, they allowed higher level of overlap between the contigs, allowing a small number of longer contigs (Table 2-5). However, the high amount (19%) of N’s that was present in the Solexa data interrupted a large number of gene sequences. Also, the N’s prevented the assembly of these contigs via sequence overlaps as many of them occurred at the terminal ends of the contigs.

Table 2-5. Comparison between Solexa and 454 Sequencing methods in sequencing the genome of B. phytofirmans OLGA172. Solexa 454

Total number of 9,597,854 202,326 quality reads Total number of bp 364,718,452 75,636,539 sequenced Avg length of each 38bp 414bp read Coverage based on 45X 10X 8 Mbp genome # of Contigs 5380 contigs from the 1 st run 1377 contigs generated by 6525 contigs from the 2 nd run reference (Solexa) and de novo assembly (454) Total # of contigs generated by 820 contigs in total combining Solexa and 454 data Total amount of N’s 19% in Solexa data

25

CLC Genomics Workbench 3.6.5 was also employed for the primary assembly step. Using a combination of the raw data of Solexa and 454, a reference assembly was carried out with the genome of B. xenovorans LB400 chromosome 1. However, over 21 million raw reads remained unassembled. Longer contigs were assembled in Sequencher by combining the contigs generated by Solexa and 454 together, and the assembly was easier to visualize in the Sequencher program. Therefore, Sequencher was used as the primary assembly program for the rest of the study. Nevertheless, CLC Workbench must to be explored further and used for the reference assembly of OLGA172’s genome to its close relatives.

2.2.4 There are no evidence of plasmids in OLGA172

The genes that are associated with chromosomal and plasmid origins of replication in strains such as PsJN, R. eutropha H16, and JMP134 were imported into Sequencher 4.1.4. They were each made a reference sequence, one by one, aligning all of the contigs that make up OLGA172’s genome to them, hoping to find a close sequence homology. A minimum of 80% match percentage was used for these alignments with at least a 30 bp overlap.

The genes associated with plasmid origins of replication, repA, repB, parA, parB (using versions of the genes found in PsJN and H16’s plasmid pHG1), could not be found in any of OLGA172’s contigs (Table 2-6).

Since catabolic genes in bacteria are often located on IncP plasmids (Fulthorpe and Top, 2009), the conserved genes on the IncP backbone were searched for in OLGA172’s genome. These genes include trfA1 and trfA2 (replication proteins) , korA (involved in replication control and repression of trfA genes), and traG (DNA transport protein; Gotz et al , 1996). IS1071 is also commonly associated with many catabolic genes and thus it was searched for in OLGA172’s genome. However, none of the genes associated with plasmid replication could be found in the genome of OLGA172.

Many evidences of chromosomal origins of replication were found as listed (Table 2-6). The lack of evidence for the genes associated with plasmid suggests that OLGA172 does not possess a plasmid. This means that OLGA172’s degradative genes that are responsible for the breakdown of 3CBA are most likely located on its chromosome.

26

Table 2-6. B. phytofirmans PsJN genes involved in chromosome and plasmid replication, that are found in OLGA172 Genes of PsJN (unless GenBank Related to Found in otherwise stated) Accession No. chromosome or OLGA172?? plasmid replication? Plasmid replicating gene repA AY305378.1 Plasmid No (in pHG1) Plasmid replicating gene AY305378.1 “ No repB (in pHG1) Plasmid partitioning gene AY305378.1 “ No parA (in pHG1) Plasmid partitioning gene AY305378.1 “ No parB (in pHG1) trfA1 and trfA2 (replication AY365053.1 “ No proteins: IncP backbone) korA (replication control: IncP AY365053.1 “ No backbone) traG (transport protein: IncP AY365053.1 “ No backbone) IS1071 (commonly found with AY365053.1 “ No many catabolic genes) DNA gyrase subunit B CP001052.1 Chromosome Yes aspartyl/glutamyl tRNA CP001052.1 “ Yes amidotransferase subunit A (in PsJN) aspartyl/glutamyl tRNA CP001052.1 “ Yes amidotransferase subunit B (in PsJN) aspartyl/glutamyl tRNA CP001052.1 “ Yes amidotransferase subunit C (in PsJN) ParA family ATPase (in PsJN) CP001052.1 “ Yes putative primosomal assembly CP001052.1 “ Yes protein PriA (in PsJN) putative primosomal replication CP001052.1 “ Yes protein PriB (in PsJN) Chromosome replication CP001052.1 “ Yes initiator-prot DnaA (in PsJN) DnaB (in PsJN) CP001052.1 “ Yes

2.2.5 Gene annotation

With the aid of OLGA172 CCD operon sequence already available (GenBank Accession

No. 168634), CBA degradative genes, tfdC IDIEIFI, on NODES_1207, NODE_4848, and contig00154 were linked. The original sequence was used to link these contigs with those

27

Figure 2-3. A schematic diagram of the 27kb region surrounding the 3CBA degradative genes in OLGA172. The black boxes flanking the three integrases represent inverted repeat sequences. carrying the three integrases, NODE_318, NODE_444 and contig00091. This gave 27 kbp of consensus sequence that carried the degradative genes as well as its surrounding genes (Figure 2- 3). It was then imported into BLASTn for annotation. There was no quality overlap between the terminal ends of this 27 kbp piece and any other contigs or nodes (Figure 2-3).

Table 2-7 shows the genes that are present within this consensus sequence, corresponding with Figure 2-3. The degradative operon is flanked on one side by genes of highly homology to genes from PsJN and LB400, with the exception of a partial integrase that lies adjacent to tfdF . At the other end, three recombinase/integrase genes are found. These terminate in a 28 bp region that is an inverted repeat found flanking the three recombinase/integrases. This small 28 bp stretch of DNA was also found on NODE_2930 but the overlap was not considered long enough to justify confident linkage.

Table 2-7. Annotation of the genes surrounding the degradative genes (Figure 2-3). Looking at Figure 2-3, the annotation is carried out from the left side to the right. Gene annotation (from left to BLASTn The organism GenBank % right in Figure 2-3) or with the highest Accession identity BLASTx match No. Phage integrase BLASTn R. metallidurans CP000352.1 83 CH34 Phage integrase " " " " λ family site specific recombinases " " " " Partial transposase ISPpu14 " R. eutropha H16 AY305378.1 78 Lys-R family transcriptional " B. sp . st. NK8 AB050198.1 83 regulator , tfdT2 Lys-R family transcriptional " B. sp . st. NK8 AB050198.1 75 regulator, tfdT 1 Lys-R family transcriptional BLASTn C. necator AY365053.1 79 regulator – possible inactivation by JMP134 pJP4 frameshift mutation

28

Chlorocatechol-1,2-dioxygenase " " " 85 Chloromuconate cycloisomerase " " " 90 Dienelactone hydrolase " " " 88 Maleylacetate reductase " " " 86 Partial integrase BLASTn C. taiwanesis st. CU633749.1 69% BLASTx LMG19424 Ribonuclease Rne/Rng family BLASTn PsJN, CP001052.1, 91 LB400 CP000270.1 Maf protein " PsJN, CP001052.1, 91 LB400 CP000270.1 Protein of unknown function " PsJN CP001052.1 89 DUF163 Iojap-like protein " " " " Nicotinate nucleotide " " " " adenylyltransferase Coproporphyrinogen oxidase " " " " Phosphoribosylamine/ glysine " LB400 CP000270.1 88 Protein of unknown functionDUF28 " " " " Uracil phosphoriboxyltransferase " PsJN CP001052.1 91 Putative NADPH-quinone reductase BLASTn LB400 CP000270.1 86 Methylglyoxal synthase " " " " Putative 3-oxoacyl-acyl carrier " " " " protein Conserved hypothetical protein " " " " K+ transporting ATPase, F subunit " " " " K+ transporting ATPase, A subunit " " " " K+ transporting ATPase, B subunit " " " " K+ transporting ATPase, C subunit " " " "

The degradative genes in OLGA172 share the highest similarities with those genes of C. necator JMP134. The integrases share their high similarities with those genes of R. metallidurans CH34. The rest of the genes that was found downstream of the degradative genes shared high similarities with those genes found in B. phytofirmans PsJN and B xenovorans LB400.

2.2.6 Sequence analysis of the integrases

The mobile genetic elements had the highest percentage match to those found in strains such as R. metallidurans CH34 and R. eutropha H16. To confirm that OLGA172’s λ family site specific recombinase belongs to the same family of integrases as these strains, C-terminal amino acid sequences of the known tyrosine class integrases from R. eutropha CH34, R. eutropha H16, and A. eutrophus H1 were aligned with that of OLGA172 in CLUSTALW (Figure 2-4).

29

R.E.H16 GWMKNRNIDLIDLDESVTARFMNRMIDASRDRVQRARPTLRQFLAYLRAEAIVCSPTLGG A.E.H1 GWMKNRNIDLIDLDESVTARFMNRMIDASRDRVQRARPTLRQFLAYLRAEAIVCSPTLGG R.M.CH34 GWMKHRNIDLIDLDESVTARFMKRMIDASRDRVQRARPTLRQFLAYLRAEAIVCSPTLGS OLGA172 RWMKSTNVGLVDLDESATACFTERLTDAPEARVQFELAVLRSFLAYLRDEAIVLSSTLGD *** *:.*:*****.** * :*: **.. *** ..**.****** **** *.***.

R.E.H16 QSEIARIYRRYLDHLRQDRGLAKNSLLVYGPFIRDFLDSHSANDGTILADAFCAVTIRDH A.E.H1 QSEIARIYRRYLDHLRQDRGLAKNSLLVYGPFIRDFLDSHSANDGTILADAFCAVTIRDH R.M.CH34 QSAIAHTYRRYLDYLRQDRGLAKNSLLVYGPFIRDFLDSHSAGDGSLLPDAFDAVTIRNH OLGA172 QSAITHIYERYLDYLRQDRGLAKNSVLVYGPFIRDFLNSQDVGDGDILPDAFDAMTIRNH ** *:: *.****:***********:***********:*:...** :*.*** *:***:*

R.E.H16 FLTYSEGRSAEYTRLMAVALRSFCHFLFLRGDTARDLYESVPSVRKWRQSTVPTFLTPEQ A.E.H1 FLTYSEGRSAEYTRLMAVALRSFCHFLFLRGDTARDLYESVPSVRKWRQSTVPTFLTPEQ R.M.CH34 LLARSKGRSAEYTRLMAVALRSFCHFLFLRGDTARDLAGSVPSVRKWRQSTVPTFLTPEQ OLGA172 ILTRSKGRSAEYTRLMTVALRSFCHFLFLHGETARDLYESVPSVRKWRQSTVPTFLTPEQ :*: *:**********:************:*:***** *********************

R.E.H16 QEALIASADRSTPTGRRDYAILLLLARLGL RAGEIVAMQLDDIHWRSGELVVHGKGQMVE A.E.H1 QEALIASADRSTPTGRRDYAILLLLARLGL RAGEIVAMQLDDIHWRSGELVVHGKGQMVE R.M.CH34 QEALIASADRSTPTGLRDYAILLLLARLGL RAGEIIEIELDDIHWRSGELVVHGKGQMVE OLGA172 EEVLIATADRSTPRGSRDYAVLLLLARLGL RAGEIVALELGDIHWRSGELVVHGKGQMVE :*.***:****** * ****:**************: ::*.*******************

R.E.H16 HVPLPSEVGAAIATYLRDGRGASASRHVFLRRLAPRVGLAGPAAIGKIVCQAFARAGFRP A.E.H1 HVPLPSEVGAAIATYLRDGRGASASRHVFLRRLAPRVGLAGPAAIGKIVCQAFARAGFRP R.M.CH34 HVPLSSEVGAAIATYLRDGRGASASRRVFLRRLAPRVGLAGPAAIGKIVCQAFARVGFRP OLGA172 HLPLPSEVGEAIAMYLRDDRGASASRRVFLRMWAPRVGLAGPAAIGHIVRLAFARAGFRP *:**.**** *** ****.*******:**** *************:** ****.****

R.E.H16 ACRGSA HLF RHGLATTMIRHGASIAEIAEVLRHRSPDSTAI YAKVAFEDLRGVARSWPTA A.E.H1 ACRGSA HLF RHGLATTMIRHGASIAEIAEVLRHRSPDSTAI YAKVAFEDLARGSRARGPR R.M.CH34 ACRGAA HLF RHGLATTMIRHGASMAEIAEVLRHRSPDSTAI YAKVAFEDLRGVARSWPTA OLGA172 ACRGAA HLF RHGLATTMIRHGASIAEIAEVLRHRSQDSTAI YAKVAFEDLRRVARPWPTT ****:******************:*********** ************** :*. .

R.E.H16 GGAI----- A.E.H1 REVQYDFDP R.M.CH34 GGAI----- OLGA172 GGAI----- .

Figure 2-4. CLUSTALW (1.81) multiple sequence alignment of λ family site specific recombinases at their C-terminal region. R.E.H16, A.E.H1, and R.M.CH34 represent R. eutropha H16, Alcaligenes eutrophus H1, and R. metallidurans CH34, respectively. The absolutely conserved catalytic site residues are shown (red). * represents discrepancies and ▪ represents where the residue is partially conserved.

Evidently, the conserved residues R-(x) n-H-(x) 2-R-(x) 25-37 -Y (Nunes-Duby et al , 1998) were found in all of the strains, and OLGA172 seem to have an intact catalytic region of the recombinase. Moreover, the partial integrase (Figure 2-3) that was located downstream of the degradative operon, next to tfdF , also carried the conserved residues of its catalytic region (results not shown). This means that the degradative genes in OLGA172 are flanked on either side by integrase genes encoding proteins of conserved catalytic residues. It is unknown if these integrases are still functional.

30

Figure 2-5. A neighbor-joining phylogenetic tree of the recombinases found in OLGA172, R. metallidurans CH34, A. eutrophus HI pAEI and R. eutropha H16, made in SplitsTree (Huson, D.H., and Bryant, D., 2006). 1000 runs of Bootstrapping were carried out; results shown on each branch.

Using CLUSTAL, the DNA sequences of the second (int2OLGA) and the third (int3OLGA) integrases found in tandem with the first site-specific recombinase (intlambdaOLGA), were aligned with the three int genes also found in R. metallidurans CH34, R. eutropha H16, and R eutropha H1. A neighbor-joining tree was generated from these alignments using SplitsTree program (Huson, D.H., and Bryant, D., 2006; Figure 2-5). Bootstrapping of the tree of 1000 runs was carried out in order to confirm the validity of each branch. This means that the program constructs a number of re-sampling of the data given to make this tree of the same alignment 1000 times. The number on each branch represents the percentage of runs that would result in that branch. This showed that the three integrases found in each organism have little sequence similarity to each other. Rather, all of the 1 st integrases in all

31 four of the organisms show a high sequence similarity, and the same is true for the 2nd and the 3 rd integrases, as confirmed by the bootstrapping result of a 100% for each of the three main branches. The CLUSTALW alignment for each of the three integrases in all of the organisms can be found in the Appendix 6.3.

Another set of 3 integrases was also found elsewhere in the genome of OLGA172 using StandAlone BLAST. StandAlone allows a very fast computation of local alignment compared to the regular NCBI server based BLAST, operated using Command Prompt. It lets the user select a single reference genome. OLGA172’s genome was aligned to the sequence of the PsJN’s plasmid pBPHYT01 using StandAlone BLAST. A number of contigs came up as homologous matches to the sequences of pBPHYT01 – contig1438, contig1564, contig1540, contig 1509, contig01424, NODE_4034, NODE_2710, NODE_2805, NODE_7, NODE_2763, NODE_349, NODE_1682, NODE_1271, NODE_13, NODE_358, NODE_2375, NODE_138. When each of these contigs was imported into BLASTn, they matched to integrases found in PsJN. The integrases in PsJN were also arranged in tandem with 3 bp overlaps between each integrase. The sequence of these integrases was imported into Sequencher and the above matching contigs were aligned together. These contigs of OLGA172 aligned almost perfectly with the three integrases of PsJN. This set of integrases are not the same as the three that were found neighboring the degradative genes in OLGA172, as when the two sets of three integrases were aligned together in CLUSTAL, the nucleotide match was very poor, and inverted repeat sequences are not found in the second set of integrases found (not shown).

2.3 Discussion

2.3.1 Advantages and Disadvantages of NextGen Sequencing

High throughput sequencing produces drastically increased amounts of data compared with ordinary capillary sequencing based on the Sanger Method (Kato, 2009; Pop, 2009). A microbial genome can be sequenced in a matter of days. Solexa and 454 used here only took 5-7 days for the sequencing of OLGA172’s genome, disregarding the time required for the preparation and confirmation of the genome and the “wait” times for machine access. Within this period of time, Solexa and 454 generated over 96 Mbp and 75 Mbp of data with 45X and 10X coverage, respectively (Table 2-5). At a moderate cost, NextGen Sequencing saved several years it would have taken to sequence the entire genome of this strain if the Sanger Method had been

32 used. The revolutionary improvements of these methods include the use of PicoTitre titanium plates used by 44, cluster amplification by bridge PCR by Solexa, and the use of fluorescent tags by both methods (Rothberg and Leamon, 2008). All of the above allowed significant cost and time reduction of whole genome sequencing.

Despite these advantages, there were a number of challenges in the employment of the NextGen methods. The first obvious problem was the short read length. However broad the coverage that may be offered by Solexa, it does not compensate for the short read lengths of the raw sequence data. These short reads may be adequate for research of a small region; they are simply too short for de novo assembly of the entire genome (MacLean et al , 2009). This was especially true for the reads generated by Solexa – 38 bp on average raw reads were what was obtained when OLGA172 was sequenced, 70 bp after end repair of the reads. This evidently presents a huge problem as thousands of contigs remain unassembled. Only short overlaps between the reads are made possible and de novo assemblies constructed using these short reads remain highly fragmented (Pop, 2009).

The complexity of whole genome assembly is dramatically increased when a high number of repeats are combined with the high number of short reads. During the assembly of the contigs in Sequencher, it was noticed that there were several contigs that contained 30-50 bp repeat sequences that were identical and therefore aligned to the same region. This generates artificial ‘hot’ and ‘cold’ spots at regions of deep or shallow coverage (MacLean et al , 2009). Assembly programs such as CLC Genomics Workbench placed these non-specific sequence matches such as repeats arbitrarily and this resulted in many false gaps in the genome. Sequencher 4.1.4., used in this study uses a greedy algorithm. This means that the decisions of assembly are made by the algorithm that optimizes the local objective function, which may not be the best solution in global terms of assembling the entire genome (Pop, 2009). It always processes the best overlap first, which can easily misassemble contigs with repeats, placing them at the first match sequence and ignoring the true sequence match which it may not have encountered yet. Another problem arises when two contigs only overlap only by a repeat sequence. As mentioned above, this occurred when assembling NODE_318, NODE_444 and contig00091 carrying the three integrases flanked by inverted repeats. Their sequences terminated at the repeat and only overlapped to NODE_2930 by the repeat sequence. This represents a false positive connection between the two contigs. This connection must be

33 terminated before this mistake propagates through further assembly. Assembly projects such as these require additional information via a reference sequence of a closely related strain.

The combination of 454 and Solexa data reduced the number of stray contigs and improved overall assembly. 454 sequencing reads closed many gaps in Solexa reads, and larger contigs were produced as a result. However, thousands of contigs still remained as individual contigs, disconnected from others. The problem here is with the numbers of N’s in Solexa sequencing data. The N’s in the sequence resulted from mis-called bases due to errors in the sequencing reaction. Many were eliminated by using all three data sets – if there were a string of N’s in a contig, there was another contig that covered the same region of the genome which allowed the elimination of these N’s and replaced them with correct nucleotide bases. However, this was not always the case, as there were an overwhelming amount of N’s in many of the contigs, totaling up to 19% of the whole genome. The number of N’s in a contig ranged from one to a string of few hundreds. Many occurred in the middle of gene sequences or near the ends of the contigs that rendered these contigs impossible to connect with others.

The pyrosequencing also had its own challenges. It has been documented many times that 454 technology lacks the ability to sequence a region of large homopolymeric nucleotides. When such region is being sequenced during pyrosequencing, the amount of fluorescence produced is proportional to the number of same nucleotides occurring in tandem. However, when incorporating more than 3-4 same nucleotides, the light response becomes non-linear making it difficult to determine the number of incorporated nucleotides (Ronaghi et al , 1999), as 454 would often read n nucleotides as n-1 nucleotides (Pop, 2009; MacLean et al , 2009). Therefore, sequencing a large homopolymeric region can be a problem in determining the exact number of the nucleotides, especially for regions such as Poly A Tails. In turn, this can then decrease the percent identity between two contigs overlapping this region of homopolymeric nucleotides, wherein the number of nucleotides may not have been called correctly in one of the contigs.

The main downfall of Solexa and 454 comes from the presence of secondary structures in sequences. Hairpin structures are extremely stable even at high temperatures inhibiting extension by polymerase, the stability of which depends on the number of G’s and C’s in the stem, and the number of nucleotides in the loop (Ronaghi et al , 1999). Secondary structures such as these as well as repeat sequences may prevent hybridization during emulsion PCR of pyrosequencing.

34

This results in poor signal to noise ratio due to inefficient on-bead amplification (Diehl et al , 2006). A possible evidence of this comes from NODE_318, NODE_444, and contig00091 which cover the region of tandem integrases flanked by inverted repeats (Figure 2-3). These three contigs cover the same region, but their sequences terminate at the exact same sequences at either ends and it has been very difficult to find its neighboring contig. Sequencing termination most likely occurred from secondary structures that were not resolved and therefore it was impossible for DNA to amplify beyond this sequence during library preparation. This is explored in the next chapter.

Considering all of these challenges above, 454 is more adapted to genomes with high amount of repeat sequences than Solexa (Hernandez et al , 2008). Nevertheless, NextGen Sequencing data alone is simply not sufficient in order to assemble the entire genome rich in repeat sequences and secondary structures, and further information of the organism is required in order to infer the organization of its genes.

2.3.2 Possible HGT of the tfd catabolic operon in OLGA172

The genome sequencing allowed us to uncover evidence that supports the possibility of a HGT event of the tfd catabolic genes. The first of these lines comes from an examination of the GC content of the genome of OLGA and its relatives.

The GC content of PsJN and LB400’s whole genome was compared with that of OLGA172. The GC content of PsJN is 60.99, only 0.24% different from that of OLGA172’s genome, whereas LB400’s GC content is 62.44, differing by 1.21%. Though the difference is not by very much, the GC content of OLGA172 is closer to that of PsJN’s which does not possess any catabolic genes related to chlorobenzoate degradation, suggesting HGT of these genes in OLGA172.

Also, LB400 has a degradative operon that is consistent with its genome GC content, but OLGA172 does not. The GC content of its degradative genes as a whole operon is very similar to that of its chromosomes and megaplasmid, differing by no more than 2%. This suggests that the degradative operon was not recently acquired from another organism by means of horizontal gene transfer, but it existed in its genome for a very long time. This organism was found in a

35 contaminated site and its ability to degrade such a range of aromatic compounds strongly suggests that it has long adapted to its complex niche.

In addition, the GC content of OLGA172’s degradative operon is almost identical to the GC content of JMP134’s module I degradative operon. The GC content of JMP134’s Module I degradative operon is significantly lower than that of its chromosomes, differing by 8-10%. However, the GC content of JMP134’s Module II degradative genes and the GC content of its chromosomes are quite similar. This first suggests that the two Modules had a different evolutionary origin (Laemmli et al , 2000) and that Module II may have arose before Module I. Many different bacterial species display a large range in terms of their GC content, but the genes in one organism show a fairly similar base composition, hence a very similar GC content (Ochman et al , 2000). This means that new genes received from another organism will have a distinguishable GC content from that of the recipients surrounding genome. As Module I degradative genes of JMP134 has a different GC content from the rest of its genome, it is highly likely that the Module I operon has recently been transferred into its genome by HGT mechanism. Since OLGA172’s degradative operon has a GC content that is almost the same with that of JMP134’s module I, this can also suggest a HGT of these genes in OLGA172.

Having said above, OLGA172’s GC content of the entire genome differs from that of its degradative operon by approximately 5%. This may suggest that its degradative genes have existed in OLGA172’s genome for a longer period of time than module I of JMP134, as the difference of 5% is not as great as the difference seen in module I of JMP134. However, this GC content difference is greater than the difference seen between the degradative operon and the entire genome of LB400. It cannot be concluded with this evidence alone that OLGA172’s degradative genes are of a foreign source. With no evidence of genes related to plasmid origin of replication and the presence of chromosomal genes next to its degradative genes, it is most likely that OLGA172’s genes responsible for the degradation of CBA are located on its chromosome.

The second evidence of HGT of the catabolic genes in OLGA172 comes from the association of the catabolic region with mobile elements. Upon examination of OLGA172’s genome, it was very clear that its degradative operon is associated with a number of mobile genetic elements (MGE). The obvious set of MGE was the 3 integrases in tandem found next to the degradative genes, tfdC IDIEIFI. These are also likely involved in the integration and excision

36 of circular DNA molecules (Burrus et al , 2002). There are two major classes of site specific recombinases: tyrosine (integrase) and serine (resolvase, invertase) family. The tyrosine class proteins form covalent DNA-protein linkages through a C-terminal tyrosine with 3’-phosphate (Kornberg and Baker, 1992). λ integrase family of site specific recombinases belong to the tyrosine class of recombinases, and the recombinase found in OLGA172 falls into this tyrosine category. The C-terminal region of the 180 amino acids carry all of the catalytic residues required for cleavage and ligation. The absolutely conserved residues are R-(x) n-H-(x) 2-R-(x) 25-

37 -Y, including the tyrosine residue that is required for the attachment of this recombinase (Nunes-Duby et al , 1998; Figure 2-4).

The discovery that integrases commonly occur as pairs or trios is a recent one. A study of R. metallidurans CH34 by Houdt et al (2009) coined the term Recombinases in Trios, RIT for short. They revealed that a number of RIT elements, RIT Cme1 and RIT Cme2, are present in CH34 and A. eutrophus H16 which carry 21 RIT-like elements, and also in various beta - Proteobacteria and alpha -Proteobacteria, though none have been found in gamma -Proteobacteria so far. Three full RIT elements found in H16 are orthologous to RIT Cme1 in CH34, but each in H16 seem to have a rearrangement (Houdt et al , 2009). I apply this terminology to the set of 3 integrases found in OLGA172: RIT BphO1 (associated with the CCD operon) and RIT BphO2 (homologous to the trio found in pBHPYT01) for the two sets of RIT found in B. phytofirmans OLGA172. As they have, I have also adopted the common naming system of tyrosine-based site specific recombinases of calling them int genes, as it is most frequently used. The λ family site specific recombinase of OLGA172 will be named int -λ. As is the case with RIT BphO1 and RIT BphO2 in OLGA172, RIT Cme1 and RIT Cme2 of CH34 also lacked high percentage identity between each other, but shared higher identities with other trios of various organisms (Appendix 6.3; Houbt et al , 2009). This was evident when each integrase found in the RIT elements of OLGA172, CH34, H16, and H1 were aligned together. The first int gene in all four organisms was highly homologous to each other, with low homology to the other two int genes. Likewise, the same is true for the other two int genes (Figure 2-5). It was also very interesting that when RIT BphO1 of OLGA172 was aligned to the RIT Cme1 , the alignment was almost flawless , and RIT Cme1 was also flanked by inverted repeat sequences (see Appendix 6.3). Also, the RIT element in CH34 has a nearby region carrying remnants of integrases and transposable elements. This area was given the name of “junkyard” by the authors. These regions that seem to store the

37 remnants of mobile genetic elements are very prevalent in many organisms such as H16 (Schwatz et al, 2003). They are suspected to aid in modulating lithoautotrophy in CH34 via movement and integration of these genes, acting alongside the RIT elements. Therefore, it is very likely that this “junkyard” region also exists in OLGA172 near its RIT element and that may act as a driving force of excision and integration of its CBA degradative genes.

38

Chapter 3 3 Extension of regions flanking the catabolic operon

In previous chapters I have explained how genome sequencing allowed the extension of a small amount of a catabolic operon to a much larger area that revealed the flanking areas. I was also able to infer possible functions of the mobile elements by their close homology to such elements in other species. However, due to regions of high secondary structure, our sequences terminate at the RIT elements, RIT BphO1 , when simple assembly methods are applied.

In this chapter I explain how I used regions of high homology shared between OLGA172 and its relatives to find putative flanking contigs and to test for their presence. I used the following three methods: primer design and PCR based on synteny methods, and primer walking method involving the use of arbitrary primers (TAIL PCR), and primer design method based on contigs having a minimal overlap.

Figure 3-1. A schematic diagram of the region surrounding the degradative genes, tfdC IDIEIFI, in B. phytofirmans OLGA172. The degradative genes and RIT BphO1 is enlarged to show its details.

Synteny methods are based on the fact that closely related organisms share a conserved gene order. We assume that a stretch of conserved genes will be further accompanied by a region of continued gene conservation. It was evident from the last chapter that the closest relatives of OLGA172 are B. xenovorans LB400 and B. phytofirmans PsJN. The genes surrounding the degradative genes in OLGA172 had the highest homologies to those genes found in PsJN and LB400. It was also noticed that OLGA172 shared a conserved gene order with PsJN and LB400 in the region preceding the degradative genes, tfdC IDIEIFI. However, this gene order terminated once the degradative genes and its flanking integrases appeared in OLGA172. These genes that

39 shared conserved gene order with OLGA172 lie on chromosome 1 of PsJN and LB400 – in fact, these genes are usually found on chromosomes of other organisms that carry these genes. On the other hand, the genes tfdC IDIEIFI responsible for degradation of CBA are usually found on plasmids of the strains that carried them, such as C. necator JMP134 (Figure 3-1). This implied that the degradative genes were exposed to a horizontal gene transfer (HGT) event, because in OLGA172, these genes that are usually found on a chromosome and those that are usually found on a plasmid were found next to each other. This transfer event may have been mediated by the integrases found associated with the degradative genes. Further sequencing past the integrases was required in order to determine the genes that may act in concert with the integrases to carry out this transfer event.

It was predicted that the genomic sequence of OLGA172 would resume high homology to PsJN and LB400 beyond the mobile elements, RITBphO1 . This allowed prediction of the subsequent genes in OLGA172’s genome, when the assembly of the contigs terminates due to the lack of quality sequence overlap between two contigs.

When the adjacent genes cannot be predicted because of a lack of a reference, primer walking method can be used in order to extend out sequencing from a known sequence in the genome. One of the primer walking methods used in this chapter is Thermal Asymmetric InterLaced (TAIL) PCR method. It consists of three specific nested primers with very high melting temperatures and one random primer with a significantly lower melting temperature. During a TAIL PCR reaction, 3 different products may be amplified: the product between the specific and the random primer (type I), the product between two specific primers (type II), and a product between two non-specific primers (type III; Figure 3-4). Out of the many places the random primer will bind to, one of them should be a region not far away from the RIT BphO1 , allowing the amplification of the desired region in between. Type I and type II products are eliminated out throughout the successive reactions using a set of nested primers in each reaction (Liu et al , 1995).

Using the TAIL PCR method and other sequence extension methods, the 27 kbp sequence of OLGA172’s genome previously annotated in Chapter 2 was further extended. This revealed the degradative operon’s neighboring genes and the presence of extensive mobile elements that may be responsible for the horizontal gene transfer of the degradative genes.

40

3.1 Methods

3.1.1 Sequence linkage via PCR based on synteny analysis

BLAST was used extensively in order to annotate and compare the genes of OLGA172 to those of other closely related organisms. BLASTn was used in order to annotate the genes that were found in the assembled contigs of the sequencing data. The contigs were aligned against the nucleotide collection database to search for related organisms carrying similar genes. Word size of 11 was used with expected threshold of 20, match/mismatch score of 2 and -3, gap existence and extension score of 5 and 2, respectively.

The contigs that had no significant hits using BLASTn were analyzed using BLASTp and BLASTx. This was especially the case with those regions carrying mobile genetics elements, which seemed to lack high similarity matches with other strains. SWISS-PROT database and BLOSUM62 matrix was used, with the expected threshold value of 10, with the word size of 3, filtered for low complexity regions. Those strains that possessed the genes with high similarities as that of OLGA172 were analyzed for their gene order compared to OLGA172, in order to determine the homology of these genes. The conserved gene order served as a predictor of the subsequent genes in OLGA172 when contig assembly came to a halt.

From the initial assembly of the contigs from NextGen Sequencing in Chapter 2, over 27 kbp of sequences were assembled together. However, this assembly of the contigs terminated due to the lack of quality overlap at either ends of this assembled piece. Assuming conserved gene order between closely related organisms, gene sequences of B. xenovorans LB400 and B. phytofirmans PsJN, were used to fill in the gaps between the contigs generated by genome sequencing and extending the sequencing completed in Chapter 2. This was carried out by assembly through Sequencher 4.1.4 by Gene Codes Corporation.

Figure 3-2. A schematic diagram of the degradative genes and the neighboring RIT BphO1 that were found by assembling the raw sequence data in Sequencher 4.1.4. The pink block represents the unknown sequence.

41

Sequencing also terminated at the other end of this 27 kbp piece where the RIT BphO1 was located (Figure 3-1). Solexa and 454 Sequencing generated contigs of the same sequence that terminated at the exact same nucleotide base. There was a lack of quality overlap with other contigs in this region and the sequence assembly here discontinued. Table 3-1. Primers used in polymerase chain reactions in Chapter 3. Primer Targeted Seq 5’ ààà3’ Tm Anneal. Expected Name Gene (oC) Temp product (oC) size in OLGA172 (bp) 318L 3rd phage ATCTCAGTGCCGATGCTCTT 60 55 unknown integrase neighboring tfdC IDIEIFI in OLGA172 4403R NODE_4403 AGGACGTCACAGGTGGTTTC 60 55 unknown unk containing 4403RC diguanylate GAACAGATTGTGCTGCTGGA 60 55 unknown digua cyclase 4897R NODE_4897 CGGAGGGCATCTATATCAGC 60 55 unknown hyp containing 4897RC glycosyl GAGGACTTCGAAGGCGTTTT 60.7 55 unknown glyc TAIL-1 3rd phage GTATCAACCAGGCGAACCAGAT 62.5 57 unknown TAIL-2 integrase AAAAATCAGTAACGCCCCACAC 62.3 57 unknown TAIL-3 neighboring TCGGACATGAATCATCTGAGAC 60 55 unknown tfdC IDIEIFI in OLGA172 Arb-deg CAWCGICNGAIASGAA 47- 42 unknown 48 2930R CACATGGAGCAGATACGTGTAAGG 63 57 569 bp with 318L

In order to extend sequencing from these integrases, the conserved gene order had to be followed. It was noticed that the genes that were present in OLGA172 downstream from the degradative genes shared a conserved gene order with that of B. xenovorans LB400, and B. phytofirmans PsJN. This conserved order was terminated at the CCD operon, but I hypothesized it would begin again after it in this unknown region seen in Figure 3-2. In LB400 and PsJN, this putatively conserved area contains diguanylate cyclase and glycosyl transferase. Homologous genes were found on NODES 4403 and 4897 of OLGA172. Thus, reverse primers were designed targeting these contigs. The reverse primers were designed out of each end of the two nodes, and

42 a forward primer 318L was designed to target 3 rd int gene of the RIT BphO1 . The reverse primers are 4403R_unk, 4403RC_digua, 4897R_hyp, 4897RC_gly.

The information regarding these primers is found in Table 3-1, and they were all ordered from Invitrogen. PCR reactions were carried out and products were sequenced when appropriate. The HotStart Taq Kit (Qiagen, Cat. no. 203645) was used for every PCR. For 20ul amplification reaction, 10ul of Mastermix (provided by kit which consists of 2.5 units HotStarTaq DNA polymerase, 1x PCR buffer (10x concentrated, containing Tris-Cl, KCl, (NH4)2SO4, 15mM MgCl2, pH 8.7), and 200uM of each dNTP), 8.6ul of nano pure ultra filtered water, 0.4ul forward and 0.4ul reverse primers of 50uM concentrations, and less than 0.4ug of template OLGA172 DNA were mixed well together. Then it was put in a thermocycler following the conditions stated in Table 2-2. The annealing temperature used was approximately 5oC below the Tm of the primers. The extension time varied depending on the expected product size. For those below 1 kbp, 1 minute was used and for those above 1 kbp, 1.5-3 minutes was used.

Table 3-2. Thermal cycle conditions for all polymerase chain reactions carried out. Reaction Cycle no. Thermal Condition ( oC) Time (min) Primary denaturation 1 95 5 Denaturation 95 1 Annealing 35 55 1 Extension 72 1 Final Extension 1 72 10

0.8% (w/v) agarose gel was used to run genomic DNA and 1.5% (w/v) was used to run PCR products, restricted 16S rRNA bands, and DNA BOX fingerprints. Ultrapure Agarose was added to 0.5X TBE buffer. Ethidium bromide was added directly into this mix (2ul in 15ml and 4ul in 30ml) and was poured into a cast to solidify. GeneRuler 1kb DNA Ladder: Fermentas Life Sciences DNA ladder was used with all Gel-Electrophoresis. 100ul of DNA Ladder, provided at a concentration of 0.5ug/ul, 100ul of 6X Loading Dye Solution (100mM Tris-HCl – pH 7.6, 0.03% bromophenol blue, 0.03% xylene cyanol FF, 60% glycerol and 60mM EDTA), and 400ul of deionized water were all mixed together and 3ul of this DNA ladder mixture was loaded for each gel.

43

3.1.2 Linkage of sequences via Thermal Asymmetric InterLaced PCR (TAIL PCR)

A different approach from the one described above involved the amplification of sequence out from the RIT BphO1 using primers targeting known sequence combined with an arbitrary primer. It uses three specific nested primers with very high melting temperatures and one random primer with a significantly lower melting temperature. The thermal conditions will be very precisely controlled so that the unwanted products will be diluted out, and the concentration of the desired product will be drastically elevated.

Figure 3-3. The region of the 3 rd int gene from which the 3 primers for TAIL PCR were designed (red circle). The sequence of this region is shown, and as well as the primer target sites in order (red). Tm of each of the primer is shown.

Figure 3-3 shows the region of the 3 rd int gene from which the three nested specific primers were designed. The pink region neighboring the integrase is the target sequence. The cycles used for TAIL PCR are shown in Table 3-3 as conducted in the study by Liu and Whittier (1995). The detailed mechanism of the reaction is shown in Figure 3-4. During the primary reaction, the first specific primer, TAIL-1, and a random primer are used. Here, the first 5 high stringency cycles allows the first basic binding of TAIL-1, and the 1 low stringency cycle allows the initial binding of the random primers to many regions of the genome. Then amplification is carried out interlacing the high and low stringency thermal cycles (TAIL process). During high

44

Table 3-3 . TAIL PCR thermal cycles (Liu and Whittier, 1995) Reaction File no. Cycle Thermal condition no. Primary 1 1 92 oC (2 min), 95 oC (1 min) 2 5 94 oC (30 s), 57 oC (1 min), 72 oC (2 min) 3 1 94 oC (30 s), 30 oC (3 min), ramping to 72 oC over 3 min, 72 oC (2 min) 4 10 94 oC (30 s), 42 oC (1 min), 72 oC (2 min) 5 12 94oC (30 s), 57 oC (1 min), 72 oC (2 min) 94 oC (30 s), 57 oC (1 min), 72 oC (2 min) 94 oC (30 s), 42 oC (1 min), 72 oC (2 min) 6 1 72 oC (5 min) Secondary 7 10 94 oC (30 s), 57 oC (1 min), 72 oC (2 min) 94 oC (30 s), 57 oC (1 min), 72 oC (2 min) 94 oC (30 s), 42 oC (1 min), 72 oC (2 min) 6 1 72 oC (5 min) Tertiary 8 20 94 oC (30 s), 42 oC (1 min), 72 oC (2 min) 6 1 72 oC (5 min) stringency cycles, only specific primers can bind, linearly amplifying the target product. During the low stringency cycle, both primers anneal to the template. The single stranded target DNA produced during the high stringency cycle is replicated, becomes double stranded, and increases in number by several folds. This provides a significant increase in the secondary reaction of the desired target template, where linear amplification of the target DNA is carried out using nested primers. In subsequent secondary and tertiary reactions, the unwanted products are eliminated using internally nested primers, where TAIL process is repeated to lower the background products (Liu and Whittier, 1995).

The PCR products resulting from the above reactions were purified with QIAquick PCR Purification Kit (Qiagen, Cat. No. 28104), according to the manufacturer’s instructions. DNA elution was carried out using filter sterilized dH 2O. When there was more than one product resulting from a PCR reaction, the reaction product was on an agarose gel and all of the bands (or the desired band of the expected size) were extracted. Qiaquick Gel Extraction Kit (Qiagen. Cat. No. 20021) was used to extract specific DNA bands out of agarose gels. DNA was TM eluted with filter sterilized dH 2O. DNA quantification was carried out using NanoDrop 1000 Spectrophotometer made by Thermo Fisher Scientific ©2008. The DNA concentration in ng/ul was measured via absorbance at 260nm before sending out for sequencing at The Centre for

45

Applied Genomics (TCAG) at The Hospital for Sick Children. When sequencing the regions of integrases and high repeat sequences, ‘Repetitive’ and ‘Secondary Structure’ option were chosen.

Figure 3-4. A detailed schematic representation of TAIL PCR protocol and the expected products during each reaction.

3.1.3 Secondary structure analysis and GC content calculation

As mentioned above, NextGen Sequencing methods revealed three different contigs that carry the RIT element, but all terminated at the same sequence. Also, when TAIL PCR was used

46 to extend sequencing of this region, the product that was amplified could not be sequenced fully in order to extend the sequencing. Web based programs, GeneBee-NET and Mobyle (Brodsky et al , 1995; Neron et al , 2009) were employed here in order to predict secondary structures of the region of RIT BphO1 . GC content of this region was also calculated by the Molecular Biology Core Facilities (MBCF) web program, copyrighted © 1992 - 2009, by Paul Morrison at the Dana-Farber Cancer Institute and/or Harvard University. This was carried out in order to determine if there is a higher GC content in this region, which may explain the stability of the possible secondary structures found here.

3.2 Results

3.2.1 OLGA172 shares conserved gene order with PsJN and LB400

Figure 3-5a shows the region that was annotated as described in Chapter 2. The terminal region of the last contig encoded K+ channel histidine kinase (red circle, Figure 3-5), which had the highest percentage match with that of PsJN and LB400. Thus, this gene’s sequence was found from the genomes of PsJN and LB400, imported into Sequencher and aligned to the end of a)

b)

Figure 3-5. a) The arrow shows the direction in which the genes were further annotated. The red circle shows where the gene K+ channel histidine kinase is located. b) Method of using the conserved gene order to fill in the gaps using the genes of PsJN and LB400.

47 the contig which carried the partial sequence of this gene. Then, the sequence of K+ channel histidine kinase of either PsJN or LB400 was made a reference sequence, to which all of the other contigs must align (Figure 3-5b).

In this way, the contig that carried the rest of the sequence for this gene in OLGA172 was aligned to this reference sequence. Only then, these contigs could be assembled together, with the gene sequences of PsJN and LB400 filling in the gaps that existed within these contigs. This gap filling method was applied in the direction of the arrow as seen in Figure 3-5a. Figure 3-5b shows the contig assembly carried out using the gap filling method.

Using the above gap-filling methods following conserved gene order, the genes were further annotated in the direction shown by the arrow in Figure 3-5a, from the segment of the genome that had already been annotated in Chapter 2. The putative annotation of these genes and their highest % identity match from BLASTn are shown in Table 3-4. Looking at Figure 3-

Table 3-4. Annotation of the genes that continued to share their conserved order with those of PsJN and LB400. Looking at Figure 3-5a, the annotation is carried out from the left side to the right. Gene annotation The organism GenBank % identity with the highest Accession match No. osmosensitive K+ channel signal PsJN, CP0001052.1 89 transduction histidine kinase LB400 CP000270.1, two component transcriptional regulator – " " 86 winged helix family small multidrug resistance protein " " 86 transmembrane protein " " 86 Putative ABC lysophospholipase exporter, PsJN CP0001052.1 84 fused innermembrane subunits, FtsX oxygen binding protein " " 84 Protein of unknown function – DUF924; " " 84 putative transmembrane protein glucose-1-dehydrogenase PsJN CP0001052.1 87 conserved hypothetical protein LB400 CP000270.1 91 transcriptional regulator of tetR family " " 91 diacylglycerol kinase PsJN CP0001052.1 87 glycosyl transferase group 1 PsJN, CP0001052.1 86 LB400 CP000270.1, Metallophosphoesterase " " 86 transmembrane protein " " 86

48

5a, the order of the genes listed from left to right. The genes annotated above in Table 3-4 follows the genes that had been annotated in Table 2-7.

After the above annotation, over 46 kbp of OLGA172’s genome has been assembled together, including the degradative operon (Figure 3-6). However, the sequence of this region came to a halt once again after the inverted repeat sequence of the RIT BphO1 . Using the synteny methods, I assumed that the gene order between OLGA172, PsJN, and LB400 would continue after the degradative genes (Figure 3-6).

Figure 3-6. The horizontal gene transfer mechanism by which the degradative genes may have introduced itself or transferred out of OLGA172’s genome. The contigs in which these genes occur are shown, as well as the regions from which the primers were designed.

The two conserved genes that were targeted using the reverse primers were diguanylate cyclase and glycosyl transferase. When PCR was carried using a forward primer, 318L, and the reverse primers, 4403R unk, 4403RC digua, 4897R hyp, 4897RC glyc (Figure 3-6), it did not yield any specific PCR products as they resulted in non-specific binding of the primers. No PCR products carried sequences from which either of the primers was designed from.

49

3.2.2 TAIL PCR

The next approach taken in order to extend sequencing past the RIT BphO1 sequence was Thermal Asymmetric Interlaced (TAIL) PCR. When this protocol was applied to amplifying the region downstream of the RIT element, the two non specific products and one specific product were produced in the primary and the secondary reaction, as expected (Figure 3-7). The specific product amplified by TAIL-3 (third specific primer) and the arbitrary degenerate primer is highly detectable in the tertiary reaction, as expected. However, it was only about 350 bp in size and when this band was extracted and sequenced, only 114 bp of sequence was obtained.

Figure 3-7. Gel electrophoreses of the TAIL PCR reactions. First lane in each gel picture is Fermentas GeneRuler 1kb Ladder. The second lane in each gel shows the reaction product(s).

This 114 bp sequence of the product was imported into BLASTn, in order to first find out which gene it may encode. Its homologous gene match was that of R. metallidurans phage integrase CH34 (83% identity, GenBank Accession No. CP000352.1). This small sequence aligned well to the 3 rd int gene, but was too short to provide linkages to any new contigs. The product was re-sequenced, but again, the sequence was again too short. Figure 3-8 shows the chromatogram of this product aligned with OLGA172’s genome in Sequencher. The sequence of the product indicated by the red box, and the sequence of the 3 rd specific primer, TAIL-3 is underlined.

As shown, the product is amplified from exactly where it was supposed to, and the chromatogram of the product shows very clear peaks for each base that has been sequenced with almost no background contamination. This confirms that the product is, in fact, the desired product amplified from the correct target DNA sequence. However, this product does not sequence beyond the sequence of the 3 rd int gene that is already known, in fact, it stops 120 bp before the end of the int gene. The product is not long enough to determine the sequence beyond

50 this region. The presence of secondary structures is predicted here that may be causing the termination of sequencing, as was predicted for the Next Generation Sequencing methods.

Figure 3-8. This shows the region in which the 3 rd specific primer (underlined) of TAIL PCR reaction was designed and the sequence of the TAIL PCR product. The product is indicated by the red box, and its chromatogram is shown below.

3.2.3 Sequence extension beyond RIT BphO1

The last method attempted in order to extend sequencing past the RIT BphO1 was another PCR reaction. It was noticed that NODE_2930 (1688 bp) overlaps NODE_318 (carrying RIT BphO1 ) with a mere 28 bp overlap. Moreover this 28 bp represents a repeat sequence, CGCG GCATAATCCGGGAATCGGCATAAC. Nonetheless, since this was the only contig that had

Figure 3-9. This shows the connection between NODE_318, NODE_444 and contig00091 (top 3) carrying RIT BphO1 and NODE_2930. The PCR products are named 318L2930RDR_ and 318L2930R-R (middle 2). The repeat sequence is underlined, and the 6 bp palindrome is highlighted. any overlap, a reverse primer was designed from NODE_2930, 2930R. This 2930R primer and 318L forward primer were used in a PCR reaction to determine if in fact, the two nodes connect using the annealing temperature of 55 oC and standard conditions (See Methods, Table 3-2). A high quality product of 569 bp in size resulted, which was gel extracted and sent to TCAG for sequencing. The resulting sequence (Figure 3-9) overlaps NODE_318 by over 300 bp and

51 overlaps NODE_2930 by over 100 bp. This definitively confirms the placement of NODE_2930 sequence adjacent to NODE_318. This PCR reaction was duplicated in order to confirm the validity of the product sequences. The products are named 318L2930RDR and 318L2930R-R which are the two middle sequences seen in Figure 3-9.

When NODE_2930 was annotated using BLASTn, only a few significant hits were found. So BLASTx was employed here in order to better annotate this region. In doing so, a number of mobile elements were found (Table 3-5). Moreover, a partial gene sequence of diaguanylate cyclase was also found here amidst all the mobile elements. Recall that this was the gene that was targeted when sequence extension was attempted with the primers 318L and 4403RC_digua, but this PCR reaction generated no specific product. This is most likely due to the fact that this gene was not intact, as evidently, only a partial sequence was found here.

Table 3-5. Annotation of the genes found in NODE_2930 using BLASTn and BLASTx. Gene annotation BLAST The organism GenBank % identity program with the Accession (protein if highest match No. BLASTx used, nucleotide if BLASTn used) Partial transposase IS66 BLASTn R. eutropha CP000092.1 80% JMP134 BLASTx B. glumae CP001503.1 84% BGR1 ISPsy5 transposase BLASTx R. CP0000352.1 59% metallidurans CH34 Transposase and inactivated BLASTn B. glumae CP001503.1 81% derivatives BGR1 1683 bp of 5’ end – acetoacetyl BLASTn B. CP0001052.1 79% CoA reductase phytofirmans 839 bp of 3’ end – diguanylate PsJN cyclase

3.2.4 Secondary structure analysis and GC content calculation

The termination of sequencing produced by both Solexa and 454 generated contigs, and of the TAIL PCR product all occurred in the same region of the RIT BphO1 . It was predicted that extensive secondary structures are present here in this region that may be causing this difficulty

52 in sequencing. These structures also may be stabilized by the presence of the higher GC content of this region.

When the region where TAIL PCR product’s sequencing terminated was looked at in detail, it was evident that a high GC level was present here, immediately after where the sequencing was terminated (Figure 3-10).

Figure 3-10. A schematic diagram not to scale showing the region of the 3 rd int gene where it is predicted that a higher level of GC content and secondary structures are present.

Using web-based programs, GeneBee-NET and Mobyle (Brodsky et al , 1995; Neron et al , 2009), the presence of secondary structures in the TAIL PCR product and in the sequence that connected NODE_318 and NODE_2930, were predicted.

The blue circle in Figure 3-11b) indicates the region where the TAIL PCR product’s sequencing terminated. This region is full of stem-loop structures that are likely stabilized by the high level of GC content here, and this is most likely the reason for the termination of sequencing. Figure 3-11a) and the red circle in b) shows the region surrounding the 6 bp palindrome (underlined) found when NODE_318 and NODE_2930 were connected.

Though a number of different stem-loop structures were predicted for this region, one conclusion could be drawn from them: there is a stem-loop structure that begins its formation in the centre of or very near this 6 bp palindrome that may be a reason for the termination of sequencing.

53

a) b)

Figure 3-11 a) Possible secondary structures generated by Mobyle surrounding the 6 bp palindromes (underlined). b) A secondary structure generated by GeneBee. 6 bp palindrome region (red circle) and the region where TAIL PCR product’s sequencing terminated (blue circle) are indicated. 3.3 Discussion

3.3.1 Shared homology between OLGA, PsJN, and LB400

Upon annotation via BLASTn, it was noticed that the order of the chromosomal genes were perfectly conserved between OLGA172 and both of its reference strains PsJN and LB400. Gene synteny is the best indicator of homology between those genes whose ancestral history may not be known (Guttman, 2009, Houdt et al , 2009). Thus, it is fair to say that OLGA172’s genes that share a perfect conserved gene order with those genes of PsJN and LB400 are orthologs, sharing the highest sequence similarities. They most likely share a common ancestor, and most likely, the same function. Moreover, these genes were located on the chromosome 1 of both PsJN and LB400, which increases the likelihood of the same genes in OLGA as well as its degradative genes that neighbor them are also located on its chromosome 1.

54

The synteny method used here allowed assembly of these contigs and prediction of subsequent genes in the region surrounding the degradative genes. This would not have been possible otherwise without the conserved gene order between OLGA172, PsJN, and LB400.

3.3.2 Confirmation of “junkyard” presence in OLGA172

The first PCR reaction in attempt to extend sequencing past the RIT BphO1 involved assumption of resumed conserved gene order between OLGA, PsJN, and LB400, after the RIT element. However, this PCR reaction generated no specific product. This may have been due to a few different reasons. First, there may have been a large number of mobile genetic elements have been inserted upstream of the RIT BphO1, which may have drastically enlarged the gap between the forward and the reverse primer. The length of the sequence lying between these primers may have been too large to be amplified with a conventional PCR method in a detectable concentration. Such is the case with R. eutropha H16. This strain possesses a self-transmissible megaplasmid pHG1 which carries the genes responsible for facultative lithoautotrophy, denitrification and mineralization of aromatic compounds. There exists a region, over 72 kbp in size, of extensive “junkyard” carrying 17 remnants of mobile elements and 22 partial/intact genes encoding for phage integrases (Schwartz et al , 2003). Within this region are three phage type integrases that are found in tandem (RIT elements), as in OLGA172’s genome. If such a large region of mobile genetic elements also exists near the RIT BphO1 found in OLGA172, then this explains why the PCR reactions carried out above did not succeed.

Another reason for the failure of above PCR could be due to interruption by an insertion sequence or another transposable element of the two genes that were targeted with the reverse primers, 4403R unk, 4403RC digua, 4897R hyp, and 4897RC glyc. If gene transposition, recombination, or a horizontal gene transfer did take place in this region, it is highly possible that an associated transposable element could have disrupted the gene sequence of diguanylate cyclase and/or glycosyl transferase rendering the PCR reaction a failure.

When NODE_318 and NODE_2930 were finally connected, it was evident that the region downstream of the RIT BphO1 (NODE_2930) was a hotspot of partial mobile genetic elements, just as in pHG1 of R. eutropha H16. Now it was also determined that OLGA172 too, had a “junkyard” carrying partial sequences of transposase IS66, partial integrases, and partial transposable elements (Table 3-5). More interestingly, BLASTn revealed another match to this

55

“junkyard” region to a part of Burkholderia glumae BGR1’s genome by 80% nucleotide percent identity, where it encodes for a transposase and its inactivated derivatives. This may be an evidence of non-mobility of this degradative operon in OLGA172. Though its associated integrases show that at one time point in the past, this operon most likely moved from its original location, the presence of genes that shares high similarity with the inactivated derivatives of transposase in BGR1 may suggest that these mobile elements in OLGA172 are no longer mobile or non-autonomous.

Furthermore, BLASTn also found a homologous match to a diguanylate cyclase gene in PsJN by 79%, though only a fragment was found in this contig. This was the gene that was targeted with the primer 4403R_digua, predicted to be located upstream of RIT BphO1 using synteny methods in a previous PCR reaction (Figure 3-6). However, this PCR reaction was unsuccessful as no specific product was amplified, and only resulted in non-specific binding. With the discovery of this small piece of this gene that is found in this region, it is mostly likely that the gene order was conserved between OLGA172, PsJN, and LB400 as predicted, but as stated earlier, it seems very probable that this gene, diguanylate cyclase was interrupted by the mobile genetic elements in this region. This most likely resulted in the failure of the previous PCR reaction, in addition to the presence of the “junkyard” in OLGA172 neighboring the degradative genes. The NODE_2930 carrying these partial mobile elements was 1688 bp long, but again at the terminal region of this contig, the assembly to other contigs was not possible due to a lack of quality overlap. Therefore, this “junkyard” region in OLGA172 may be much larger than what is present in NODE_2930 and extend further into the genome.

3.3.3 DNA sequencing interrupted by secondary structure

Hairpin structures are very stable even at high temperatures, which prevent extension by Taq polymerases. The stability of this stem-loop structure depends on the GC content of the stem, and the number of nucleotides in the loop (Ronaghi et al , 1999).

Figure 3-11 shows the possible secondary structures present within this region. The figures shown in a) and the red circled region in b) show where the sequencing was terminated by NextGen Sequencing methods. The blue circle in b) shows where the sequencing of the TAIL PCR product terminated, which is very close to where sequencing terminated due to NextGen Sequencing. Both regions show extensive secondary structures that most likely caused the

56 termination of sequencing at these regions. When faced with such secondary structures, they can inhibit hybridization of DNA polymerase to the template DNA, which can lead to inefficient amplification of the desired products during the emulsion PCR step of 454 Sequencing (Diehl et al , 2006). This supports the earlier prediction made in Chapter 2, that Next Generation Sequencing can be vulnerable to the presence of secondary structures in the genome. To further support this point, the GC content of the 300 bp region surrounding the 6 bp palindrome was also calculated. The GC content of this region is 63% which is higher than the 61% GC content of OLGA172’s whole genome. Figure 3-10 also exemplifies the region of very high GC content where the sequencing of TAIL PCR product was terminated. Though 2% is not a drastic difference, the elevated GC content of this region may be enough to stabilize the secondary structures present here, disrupting the sequencing reactions attempted of this region.

57

Chapter 4 4 Analysis of OLGA172 and related strains from pristine soils.

610 of 3CBA degrading bacterial strains were isolated from pristine soils around the world by Fulthorpe et al (1998; Figure 1). These strains came from soils that were able to degrade CBA without any lag period (Fulthorpe et al , 1996). However, as explained in the introduction, expression of the 3CBA degrading phenotype is erratic in OLGA172 and many other strains in this collection. We were curious to see if other strains also from pristine soils carry the same catabolic genes and mobile elements as those found in OLGA172. This information would tell us how widespread the putatively mobile catabolic element of OLGA172 is. The genome sequences surrounding OLGA172’s degradative genes made this possible.

Alternatively, the other "pristine isolates" could be using entirely different genes for the same function. Recall that that two different forms of chlorocatechol degradation have been found: module I and module II. Both of these modules are carried by pJP4 (catabolic plasmid from C. necator JMP134) and are both functional and highly expressed (Laemmli et al , 2000). However, the two modules share low sequences homologies and some think they have had different evolutionary origins. For these reasons, it is necessary to test for genes of both modules.

Approximately 30 of the strains above, including OLGA172, were purified from their glycerol cultures through sequential grown on selective (3CBA) and non-selective (R2A). In this chapter, we looked for both variants of tfdC , tfdD , and for evidence of recombinases in these other strains. Phylogenetic analysis of their degradative genes are carried out here as well, to determine their evolutionary relationship with other strains isolated from contaminated sites that are also capable of degrading xenobiotic compounds, such as 3CBA.

4.1 Methods

4.1.1 The ability of OLGA172 and its related strains from pristine soils to break down and grow in the presence of CBA

30 bacterial strains from pristine soils including OLGA172 were taken from their glycerol stocks at -80 oC. In order to purify these strains, they were sequentially plated on a

58 selective, 3mM chlorobenzoate (CBA) plate and a non-selective R2A plates and incubated at 28 oC (Figure 4-1). 3mM CBA plate was made up of 100ml of 10x phosphate buffer that was . . 10mM and pH of 7 (in 1L: K2HPO 4 3H 2O – 17.1g (0.075M); NaH 2PO 4 H2O – 3.4g (0.024M)),

10ml of 100x Mg/Ammonium Solution that has 100mM of Mg and 250mM of N (in 1L: (NH 4)

2SO 4 – 33g (0.25M), MgSO 4.7H 2O – 24.6g (0.1M)) and 1ml of 1000x trace element solution . . with pH of 7 (in 1L: Na 2EDTA 2H2O – 12g (0.036M), NaOH – 2g (0.05M), ZnSO 4 7H 2O – 0.4g -3 . -3 . -4 . (1.4x10 M), MnSO 4 4H 2O – 0.4g (1.8x10 M), CuSO 4 5H 2O – 0.1g (1.0x10 M), FeSO 4 7H 2O – . -4 3g (0.1M), Na 2SO 4 – 5.2g (0.04M), NaMoO 4 2H 2O – 0.1g (4.6x10 M)). Then 5mg of yeast extract; 889ml of distilled H 2O, 16g of agar and 20ml of 50mM 3-CBA solution was added for 1mM final concentration. 50mg of the pH indicator bromothymol blue was also added to the CBA media.

The non-selective R2A media made with Difco TM R2A Agar (Difco Laboratories, Becton, Dickinson and Company, Ref. no. 218263) according to the manufacturer’s instructions. From -80 oC glycerol stock, OLGA172 was initially plated on 3mM CBA plate to select for the degrading phenotype of OLGA172 from the contaminants that may have been present. Then a single colony was picked and plated on a non-selective R2A plate. Plating on R2A was carried out once or twice more in order to purify OLGA172 from the contaminants. Then these purified strains were again plated on a CBA media in order to confirm their ability to degrade CBA.

The growth and the colour change of the CBA media were recorded over a several weeks, in order to determine which strains were able to retain their ability of degrading CBA.

Figure 4-1. The order of selective (3mM CBA) and non-selective (R2A) media used to grow OLGA172 and its relative strains from pristine soils, in order to purify each strain’s colonies.

4.1.2 Lysis of cells and amplification of CBA degradative genes

For quick access to the DNA of bacterial cells for the purposes of PCR reactions, 2-3 pure bacterial colonies from pure cultures grown on R2A media were harvested and placed in

50ul of dH 2O in 1.5ml microcentrifuge tubes. The tubes were floated in boiling water for 5

59 minutes in order to lyse the cells and expose their DNA. . These lysates were used as templates for DNA amplification.

PCR was carried out in order to detect tfdT I, tfdC I and tfdD I genes, int -λ, as well as the module II of the tfdC II and tfdD II genes. The stretch of genes from tfdT I to tfdD I was also amplified. The HotStart Taq Kit (Qiagen, Cat. no. 203645) was used for every Polymerase Chain Reaction following manufacturer’s instructions. For 20ul amplification reaction, 10ul of Mastermix (provided by kit which consists of 2.5 units HotStarTaq DNA polymerase, 1x PCR buffer (10x concentrated, containing Tris-Cl, KCl, (NH4)2SO4, 15mM MgCl2, pH 8.7), and 200uM of each dNTP), 8.6ul of nano pure ultra filtered water, 0.4ul forward and 0.4ul reverse primers of 50uM concentrations, and less than 0.4ug of template OLGA172 DNA were mixed well together. Then it was put in a thermocycler following the conditions stated in Table 4-2. Table 4-1. Primers used in polymerase chain reaction in Chapter 4. Primer Targeted Seq 5’ ààà3’ Tm Anneal Expected Name Gene (oC) Temp product (oC) size in OLGA17 2 (bp) CCD b/e tfdC F:GTITGGCAYTCIACICCIGAYGG 70 48 268 R:CCICCYTCGAAGTAGTAYICIGT 62 Recom Site Specific F:GATGTGATTCCGGATCGTCT 60 55 250 F/R Recombinase R:CCGTGTTACGGTCGTTTCTT 60 tfdCII tfdC II F: GGCTGCCGGCATCACGAG 68 57 377 F/R R:GTCGACTACTACCGGGGC 58 tfdDII tfdD II F:TCAGGCGCTGTGACGATCGAT 68 58 1188 F/R R:AGCCATTGCCGACAGCCCCAA 73 tfdDIall_ tfdD I F:TCGTCAAGCGTCATGCCCGTG 71 65 909 F/R AC R:GGAGCGCRGARTGTGCGGAGA 72 tfdTIallF tfdT I F:GGATTCGGGCGITTCTATCCGG 71 65 321 tfdTI2all T R R:ATGTGCGACTGAYGCSGGYAC 70 C

All primers were ordered from Invitrogen life Sciences. CCDb/e primers were designed by Leander et al (1998). The annealing temperature used in each case was approximately 5oC below the Tm of the primers (Table 4-1). The extension time varied depending on the expected product size. For those below 1 kbp, 1 minute was used and for those above 1 kbp, 1.5-3 minutes was used.

60

Table 4-2. Thermal cycle conditions for all polymerase chain reactions carried out. Reaction Cycle no. Thermal Condition ( oC) Time (min) Primary denaturation 1 95 5 Denaturation 95 1 Annealing 35 55 1 Extension 72 1 Final Extension 1 72 10

By aligning the sequences of tfdT I (not shown) and tfdD I of C. necator JMP134, B. sp . NK8, and B. phytofirmans OLGA172, that all possess the Module I version of tfdT and tfdD genes, I designed a forward and a reverse primer for each of the genes. They were designed from the most conserved sequences of the genes (Figure 4-2). For tfdT I gene, nucleotide base discrepancies between the three strains were very frequent. The reverse primer was designed with higher amount of degeneracy in the primer sequence that would result in a product of 321 bp. The standard codes used for redundant sequence in primers were as follows: S: C + G; R: A + G; Y: C + T; I = deoxyinosine; N = A + C + T + G. Due to the degenerate bases used during the designing of these primers, the melting temperatures of both of the primers sets were kept very high (approximately 70 oC) to avoid non-specific binding. a)

b)

Figure 4-2. Alignment of the tfdD I sequences (output from Sequencher). In both a) and b), the sequences shown are (from top to bottom): OLGA172 (top 3 sequences – different contigs covering same region), B. sp . st. NK8, C. necator JMP134. The last sequences that are shown are of tfdDIall_F primer (a), and the tfdDIall_R primer (b). The dots along the bottom indicate discrepancies in aligned sequences.

61

PCR product purification was carried out using QIAquick PCR Purification Kit (Qiagen,

Cat. No. 28104). DNA elution was carried out using filter sterilized dH 2O. When there was more than one product amplified in a PCR reaction, the desired PCR product bands were gel extracted out of the agarose gels using Qiaquick Gel Extraction Kit (Qiagen. Cat. No. 20021). DNA was eluted with filter sterilized dH 2O. The manufacturer’s instructions were followed with both QIAquick PCR Purification Kit and Qiaquick Gel Extraction Kit. All PCR products were sequenced at The Centre for Applied Genomics (TCAG) - DNA Sequencing Facility at The Hospital for Sick Children (MaRS Centre - East Tower. 101 College Street, Room 14-601. Toronto, ON. M5G1L7).

4.1.3 Phylogenetic analysis

Using the keyword ‘chlorocatechol’ in GenBank, the genes for chlorocatechol degradation were located (refer to Appendix 6.6 for detail) and their sequences aligned using CLUSTALX2. Complete alignment was carried out for the chlorocatechol-1,2-dioxygenase gene, tfdC, and the chloromuconate cycloisomerase gene, tfdD . The tfdC sequences of the strains from pristine soils were obtained from the study by Leander et al , 1998. The PCR reactions using tfdDIall_F/R primers amplified the tfdD gene from these strains. The tfdC and tfdD sequences of these strains from pristine soils were added to the alignment with the rest of the strains found in GenBank.

The alignment made in CLUSTALX2 (Larkin et al , 2007) of all the chlorocatechol degraders available in GenBank plus those from the pristine soils were saved as NEXUS files and phylogenetic neighbor joining trees were made using Splitstree unrooted, as it is unknown which organism is the ancestor of the rest (Huson, D.H., and Bryant, D., 2006).

4.2 Results

4.2.1 Growth of OLGA172 and its related pristine isolates on CBA and their possession of degradative genes

The main results of the strains’ growth and amplification of their degradative genes are shown in Table 4-3 and Table 4-4.

62

Figure 4-3. A picture of two 3mM CBA plates. The media on the left is streak plated with the strain WV7-1, which is unable to break down CBA. The plate to the right has the strain R1-13-1, growing well and breaking down CBA.

The first obvious observation from this table is that often times, many strains that were able to degrade 3CBA on the first plate could not do so on the second after sequential plating on non-selective R2A media. Table 4-3. The sibling pristine isolates of B. phytofirmans OLGA172 and their ability to grow on 3CBA media. Colour Colour Growing on Growing on change of 1 st change of 2 nd 1st CBA 2nd CBA Strain Origin CBA plate CBA plate plate? plate? HH82 South Africa Yes Yes Y Y HH83 " Yes slight Y Y HHDI " Yes Yes Y Y JMP Australia Yes Yes Y Y MBG3 South Africa Yes Yes Y Y MR10-1 " Yes slight Y Y MB16-3 " Yes Yes Y Y PMPI " slight Yes Y Y R1131 Russia slight Yes Y Y OLGA172 " Yes Yes Y Y R3321 " Yes Yes Y Y WV71 Saskatchewan Yes Yes Y Y WG14-4 South Africa Yes slight Y Y WGH1 " Yes Yes Y Y WGM1 " Yes slight Y Y HH12-4 " Yes slight Y N HH44 " Yes No Y N WK112 Saskatchewan Yes No Y N WK33 " Yes No Y N WV151 " Yes No Y N R4121 Russia No No N N BTB1A Saskatchewan No No N N BTI2 " No No N N BTK2 " No No N N

63

R2181 Russia No No N N R2191 " No No N N R261 " No No N N RC13-1 Chile No No N N

From these results, it is clear that most of the strains that were able to grow on CBA media possessed both tfdC and tfdD of module I as OLGA172 does, instead of module II. Those st nd that were not able to grow on the 1 and/or 2 CBA media only displayed amplification of tfdD I, tfdD II , or partial amplification(*) to tfdC . int-λ was only amplified in two of the strains; R1-13-1 and R3-32-1, both of whom originate from Russia. Table 4-4. The related strains of B. phytofirmans OLGA172 – their possession of the degradative genes. CI DI CII DII int-λ TI TI-DI Main Product sizes TI-DI HH82 * Y Y HH83 Y Y Y 800 HHDI Y Y Y 800 JMP Y Y Y Y Y 1200 MBG3 Y Y Y Y Y 800 MR10-1 Y MB16-3 Y Y Y Y 800 PMPI * Y Y Y Y 2700 R1131 Y Y Y Y Y 2700 OLGA172 Y Y Y Y Y 2700 R3321 Y Y Y Y Y 2700 WV71 Y Y Y Y WG14-4 Y Y Y Y 700, 900, 1500 WGH1 Y Y Y Y 900, 1500, 2000 WGM1 Y Y Y Y Y Y 900, 1100 HH12-4 * Y HH44 Y Y Y WK112 * Y Y WK33 * Y Y WV151 Y R4121 * Y BTB1A * BTI2 * Y BTK2 * Y R2181 Y Y R2191 Y R261 RC13-1 Y Y

64

Using tfdDIall_F and tfdTI2all_R primers, the stretch of DNA carrying tfdT I, tfdC I, and tfdD I were amplified. The 2700 bp products of R3321, R1131, and OLGA172 were sequenced, and they were confirmed to have the same gene arrangement of tfdT ICIDI. The other products amplified with these primers have not yet been sequenced.

Partial amplifications indicated above means that the product of the expected size was amplified, but the concentration of the amplification was very low. This suggests that the primers were able to amplify the target sequence from the template DNA, but the sequence identity was not high enough to produce high levels of binding between the template DNA and the primer, which in turn produced a low level of products. The products that were only partially amplified could not be sequenced because gel extraction of these products with very low concentrations was very difficult, even with the increased PCR reaction volume.

4.2.2 Phylogenetic analysis of the degradative genes, tfdC I and tfdD I

The tfdC and tfdD sequences obtained in this part of the study were compared to those found in GenBank and those found in Leander et al (1998; Appendix 6.5 and 6.6). Figure 4.4 shows the phylogenetic analysis of all available chlorocatechol dioxygenase and chloromuconate cycloisomerase genes. OLGA172 and most of its related strains from pristine soils encode module I degradative genes and form a distinct cluster with genes from other members of the beta-proteobacteria carrying module I genes. Though these pristine strains were found widely spread around the globe, it is very significant that their degradative genes are very highly related to each other than to those genes that belong in strains found in contaminated sites, such as B. xenovorans LB400, RASC and EST4011.

The trees for tfdC and tfdD are congruent in spite of the lesser amount of data available for tfdD . All of the "pristine" strain genes form a cluster distinct from the module II genes and from the alpha -proteobacteria and gram positive strains. This suggests independent origins of these iso-functional genes in these phylogenetic groups. However, the beta -proteobacteria are clearly carrying around two different sets of these degradative genes.

65 a) Chlorocatechol-1,2-dioxygenase ( tfdC in OLGA)

66

b) Chloromuconate cycloisomerase ( tfdD in OLGA)

` Figure 4-4. A phylogenetic neighbour-joining tree made of the genes chlorocatechol-1,2- dioxygenase (a) and chloromuconate cycloisomerase (b), made in SplitsTree 4.10 (Huson, D.H., and Bryant, D., 2006).

67

4.3 Discussion

4.3.1 OLGA172 is a representative CBA degrader from pristine soils

During this sequential plating of the strains, it was not enough to conclude that the colonies were pure from the 1 st plating on the CBA media. This is due to the following reasons. Since the strains were originally taken from -80 oC glycerol stock, it was possible that some bacterial strains were growing in the presence of glycerol that was taken with them onto these selective media, therefore using glycerol instead of CBA as its carbon and energy source. It was also possible that some contaminants were growing within a CBA degrading colony. These impure colonies may be similar in their morphological appearances, making the growth appear pure by the naked eye. Only a subsequent streaking on R2A can reveal the presence of contaminating strains.

Moreover, OLGA172 and its related strains from pristine soils have been very unstable in terms of their ability to degrade CBA in the laboratory environment as stated earlier. Therefore the inability of a strain to degrade CBA may not mean that it is a contaminant strain, but a strain of interest that may have lost its ability to degrade CBA (5 strains lost their ability to degrade). These are the reasons why subsequent PCR reactions were carried out in order to confirm the presence of the genes responsible CBA.

From all of the above reactions, it has been very evident that the rest of the strains isolated from pristine soils act very much like OLGA172. When grown in the presence of 3- chlorobenzoate (CBA), OLGA172 often loses its ability to degrade CBA spontaneously in the lab, and this occurred with several of these strains when grown on CBA media. Some have successfully grown on both 1 st and 2 nd CBA media, such as HH82, and some have never been able to grow on CBA media, such as BTK2. However, some have lost their ability to degrade CBA when re-plated onto a 2 nd CBA plate after series of re-plating in order to purify its colonies. It is very interesting because it did not only happen to one arbitrary strain, but 5 of the strains above in a matter of 2 weeks. OLGA172 has shown this phenotypical instability very often over the last several years of studying it. So, if this instability can happen to 5 out of the 28 strains studied here in two weeks, it is very probable that it can also happen more often to other degraders if they were grown and analyzed for longer periods of time.

68

It is very evident for these strains originating from pristine soils that module I of the degradative genes are more prevalent than that of module II. As expected, both tfdC and tfdD are required for growth on CBA media. However, int-λ gene was only amplified in the strains

R3321, R1131, and OLGA172. This shows that tfdT IDICI genes travel as one unit, associated with the int -λ gene, only in this geographical region, as these three strains were all isolated from

Russia. The stretch of DNA from tfdT I to tfdD I were also only amplified in these three strains and the South African strain PMP1 (PMP1 amplified the correct sized product; however it has not been yet sequenced). tfdT was found in the strains HH82, MBG3 PMP1, R1131, R3321, WV71, WG144, WGM1, WGH1, HH44, WK112, WK33, R4121, BYK2, R2181 and RC131, but of these, most did not yield a product when amplified from tfdT to tfdD indicating that the gene order is not always conserved in the collection, nor is tfdT always present. This shows that tfdT IDICI genes travel as one unit, associated with the int -λ gene, only in this geographical region, as these three strains were all isolated from Russia.

All of the strains isolated from pristine soils including OLGA172 carrying module I set of degradative genes, grouped together in a same clade for both genes, independent of their geographic origin. There is some variation between the tfdC and tfdD genes from these pristine strains. It may be related to the phylogenetic position of the strains. While previous work has shown that these strains are primarily Burkholderia and Ralstonia strains (Fulthorpe et al , 1998), their 16S genes have not been fully sequenced. This is recommended for future work. If the C and D genes share congruency with the 16S genes, then it will support the idea that they are non mobile and ancient. If there are incongruent with 16S genes it will suggest some degree of mobility.

Thus, it is fair to say that OLGA172 is a representative organism of these bacterial strains that originated in pristine soils, in so far as the strains carry highly similar degradative genes. However, only two of the strains carries OLGA172's int -λ gene, and all three of the strains were isolated from the same geographic region, Russia. This element was not present in most of the other strains, so the widespread distribution of the strains and their general instability may not be related to any activity of the int-λ gene or the RIT element

69

Chapter 5 5 General Discussion

To date, there have been many biodegradative bacteria that have been isolated from xenobiotic polluted environments. Burkholderia phytofirmans OLGA172 and its related 3CBA degrading strains were the first ones that originated from pristine soils that were able to break down CBA at their first encounter with the compound.

A very important 47 kbp region in OLGA172 has been assembled and annotated during this study. This region shows that the 3CBA degradative genes of OLGA172 are flanked on its either side by mobile genetic elements, such as RIT BphO1 . The presence of a “junkyard” of partial integrases and transposable elements has also been found upstream of the RIT BphO1 which can suggest past movement of these genes. Definitively, it is unknown whether or not the transfer of these genes was out of OLGA172’s genome into that of others, or into the genome of OLGA172 from other organisms or perhaps both. There is clear evidence that bacterial strains exposed to novel substrates can evolve efficient degradative pathways through the acquisition and rearrangement of genes via horizontal gene transfer (Fulthorpe and Top 2009). However, CBA degrading genes exist within OLGA172 and its “pristine” strains without any known previous exposure to CBA or PCB contamination. OLGA172 and its relatives may be among the original sources for valuable chloro-aromatic genes. Below, I discuss the evidence for this.

1. These strains were isolated from the locations shown in Figure 1-1. These strains were not found in contaminated sites, but in pristine environments where adaptation to break down these xenobiotic compounds was not necessary. Nevertheless, when exposed to CBA, they required no lag time but immediately started to break down CBA (Fulthorpe et al , 1996). It is most likely that these strains retained the genes to degrade naturally occurring chlorinated compounds that are structurally and/or chemically analogous to CBA. If this is true, then the likelihood that these genes are of ancient origin is high.

2. Highly similar, but not identical genes have been a part of the well known and globally dispersed pJP4 plasmids that encode degradation of 3CBA and 2,4-D. pJP4 cannot be an original source of these genes as it is clearly the product of multiple gene transfer events (Fulthorpe and Top, 2009; Fulthorpe et al , 1995).

70

3. The CBA degradative genes are chromosomally located. The degradative operon is located in a long stretch of DNA that is highly homologous to a portion of chromosome 1 of both PsJN and LB400. We could not find any evidence of a plasmid in the genome of OLGA172. Chromosomally located genes are less dispensable than those located on a plasmid and are usually essential for the survival of the cell.

4. The GC content of the degradative operon in OLGA172 differs from that of the whole genome only by 5%. This difference is half of that seen in C. necator JMP134 between its genome and its module I degradative genes. The module I set of JMP134 lies on its pJP4 plasmid which is constructed through a numerous gene transfer events as stated earlier and therefore cannot be an ancestral source of the CBA degradative genes. This is clearly reflected by the large GC content difference seen in its module I degradative genes and its entire genome. In turn, this suggests that the degradative genes in OLGA172 probably resided in its chromosome for a long time, in order to degrade naturally occurring chlorinated compounds.

5.1 Future Work

Until the assembly of the NextGen Sequencing contigs is complete, many questions remain unanswered. Throughout this study, many evidences were found that supports the hypothesis as stated above. To definitively prove the hypothesis, the following are suggestions for future work.

1. The complete assembly of OLGA172’s genome and verification of the degradative operon’s location is critical. This will close all of the gaps that exist now in the contigs and confirm whether or not the rest of the genes in OLGA172 follow the conserved gene order of PsJN and LB400.

2. The lack of plasmid in OLGA172 has to be confirmed. The genes that are specific to each plasmid known to carry catabolic genes must be examined and searched for in OLGA172’s genome. As many different mechanisms of transfer exists for the movement of these genes, looking for only the conserved plasmid genes from just the IncP group may not be sufficient.

71

3. Other mobile elements in OLGA172 and its related pristine isolates must be identified. Determining the presence and the location of all the mobile genetic elements in these genomes will show how they may act in concert with one another in order to move the associated genes from one location to another. It will also allow comparison with other organisms that may carry similarly organized mobile elements, which will shed light on how the degradative genes in OLGA172 may be transposed.

4. The remaining degradative genes, tfdE , and tfdF , must be looked for in the pristine isolates. The sequence homology of these genes will show if all of the genes involved in CBA degradation in these strains arose at the same time as a unit, or came from different sources.

5. The sequence of 16S ribosomal genes from the pristine soil isolates has to be determined. A phylogenetic relationship can be drawn from their 16S sequence homologies, from which the congruency between the 16S genes and the degradative genes can be determined. If the phylogenetic relationship between the 16S sequences of the pristine isolates is congruent or in agreement with the phylogenetic relationship between the degradative genes of the pristine isolates, then we can conclude that these strains did not obtain these genes through a HGT event and that they have existed within these strains for a long period of time.

6. Using simple mating experiments, it has to be determined whether or not the degradative operon in OLGA172 is able to horizontally transfer via conjugation. This will determine if the associate mobile elements are still active.

7. Houdt et al (2009) have determined that the RIT elements are limited to the bacterial strains found in alpha- and beta-proteobacteria. However, this is a very broad distinction. Further study of the RIT elements, of their functions and of their specific host ranges must be carried out in order to determine their specific relationships with other genes.

8. Search for all the catabolic genes, remnants of the RIT element, “junkyard” region and other mobile elements in OLGA172’s related pristine strains that have stopped degrading 3CBA.

72

We cannot say definitively that the genes responsible for 3CBA degradation carried by OLGA172 and its pristine soil isolate relatives preceded the presence of the xenobiotic compounds or evolved in response to it, i.e. if the genes are ancestral to the many recently evolved pathways that include chlorocatechol degradation. However, there is very strong evidence of past movement of these genes in OLGA172. This includes the presence of the RIT element and partial remnants of mobile genetic elements that are not normally found on the chromosomal scaffold surrounding the catabolic region. This could mean that these genes were laterally transferred into OLGA172 from a donor strain. However it could also demonstrate that the genes have the capacity to be transferred out, or to a conjugative element that would allow widepread transfer. If the chlorocatechol genes were found independent of any mobile genetic elements, in roughly the same chromosomal location and in a similar organization in a number of pristine relatives, this would provide stronger evidence that these Burkholderia strains have evolved these genes over time in response to a widespread natural selection factor.

While the larger question remains unanswered, this work has uncovered a number of interesting things. The most important of these is probably the presence of the RIT elements in the OLGA172 genome that was conserved across several other species. This raises important question about the role of these in the mobility of the associated genes - their range and the associated mechanisms.

73

References

Achaz, G., Coissac, E., Netter, P., and Rocha, E. P. C. (2003). Associations between inverted repeats and the structural evolution of bacterial genomes. Genetics. 164 : 1279-1289.

Achaz, G., Rocha, E. P. C., Netter, P., and Coissac, E. (2002). Origin and fate of repeats in bacteria. Nucl. Acid. Res. 30( 13): 2987-2994.

Andersen M., Lie E., Derocher A.E., Belikov S.E., Bernhoft A., Boltunov A.N., Garner G.W., Skaare J.U.,and Wiig Ø. (2001). Geographic variation of PCB congeners in polar bears (Ursus maritimus) from Svalbard east to the Chukchi Sea. Polar Biol. 24 (4): 231-238

Anson, E., and Myers, E. (1997). ReAligner: A Program for refining DNA sequence multi- alignments. J. Comp. Biol. 4 (3): 369-383.

Azuma, Y., Hosoyama, A., Matsutani, M., Furuya, N., Horikawa, H., Harada, T., Hirakawa, H., Kuhara, S., Matsushita, K., Fujita, N., and Shirai, M. (2009). Whole-genome analyses reveal genetic instability of Acetobacter pasteurianus. Nucl. Acid. Res. 37 (17): 5768-83.

Barrie A., Gregor B., Hargrave C., Lake D., Muir E., Shearer F., Tracey G., and Bidleman, A. (1992). Arctic contaminants: sources, occurrence and pathways. Sci. Total Environ. 122 : 1-74.

Brodsky L. I., Ivanov V. V., Kalaydzidis Ya. L., Leontovich A. M., Nikolaev V. K. (1995). GeneBee-NET: Internet-based server for analyzing biopolymers structure. Biochem. 60 (8): 923-928

Buchrieser C., Brosch, R., Bach, S., Guiyoule, A., and Carniel, E. (1998). The high- pathogenicity island of Yersinia pseudotuberculosis can be inserted into any of the three chromosomal asn tRNA genes. Mol. Microbiol. 30 : 965-978.

Burrus, B., and Waldor, M. (2004). Shaping bacterial genomes with integrative and conjugative elements. Res. Microbiol. 155 : 376-385

Campbell, A. (1992). Minireview: Chromosomal insertion sites for phages and plasmids. J. Bacteriol. 174 (23): 7495-7499.

Cavalca, L., Hartmann, A., Rouard, N., and Soulas, G. (1999). Diversity of tfdC genes: distribution and polymorphism among 2,4-dichlorophenoxyacetic acid degrading soil bacteria. FEMS Microbiol. Ecol. 29 : 45-58.

74

Chain, P., Denef, V., Konstantinidis, K., Vergez, L., Agullo, L., Reyes, V., Hauser, L., Cordova, M., Gomez, L., Gonzalez, M., Land, M., Lao, V., Larimer, F., LiPuma, J., Mahent hiralingam, E., Malfatti, S., Marx, C., Parnell, J., Ramette, A., Richardson, P., Seeger, M., Smith, D., Spilker, T., Sul, W., Tsoi, T., Ulrich, K., Zhulin, I., and Tiedge, J. (2006). Burkholderia xenovorans LB400 harbors a multi-, 9.73-Mbp genome shaped for versatility. Proc. Natl. Acad. Sci. 103 (42): 15280-15287.

Cheng, C., Kussie, P., Pavletich, N., and Shuman, S. (1998). Conservation of structure and mechanism between eukaryotic I and site-specific recombinases. Cell. 92 : 841-850.

Chow, W. Y., Wang, C. K., Lee, W. L., Kung, S. S., and Wu, Y. M. (1995). Molecular characterization of a deletion-prone region of plasmid pAE1 of Alcaligenes eutrophus H1. J. Bacteriol. 177 (14): 4157-4161.

De Jong E, Field J.A., Spinnler, H-E., Wijnberg JBPA, de Bont JAM (1994). Significant biogenesis of chlorinated aromatics by fungi in natural environments. Appl. Environ. Microbiol. 60 : 264-270.

Diehl, F., Li, M., He, Y., Kinzler, K.W., Vogelstein, B., and Dressman, D. (2006). BEAMing: single-molecule PCR on microparticles in water-in-oil emulsions. Nature. 3 (7): 551-559.

Fava, F., Gioia, D., Marchetti, L. 1993. Characterization of a pigment produced by Pseudomonas fluorescens during 3-chlorobenzoate co-metabolism. Chemosphere. 27 (5): 825-835

Frost, L., Leplae, R., Summers, A., and Toussaint, A. (2005). Mobile genetic elements: The agents of open source evolution. Nature. 3 : 722-732.

Fulthorpe, R. R., McGowan, C., Maltseva, V., Holben, W. E., and Tiedje, J. M. (1995). 2,4- Dichlorophenoxyacetic acid-degrading bacteria contain mosaics of catabolic genes. Appl. Environ. Microbiol. 61 (9): 3274-3281.

Fulthorpe, R. R., Rhodes, A. N., and Tiedje, J. M. (1998). High levels of endemicity of 3- chlorobenzoate-degrading soil bacteria. Appl. Environ. Microbiol. 64 (5): 1620-1627.

Fulthorpe. R. R., and Top, E. M. (2009). Evolution of new catabolic functions through gene assembly by mobile genetic elements. ©Springer-Verlag Berlin Heidelberg 2009.

Ghosal, D., You, I. S., Chatterjee, D. K., and Chakrabarty, A. M. (1985) Microbial degradation of halogenated compounds. Science. 228 (4696): 135-228.

Godde, J. S., and Bickerton, A. (2006). The Repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. J. Mol. Evol. 62 : 718-729.

75

Goordial, J. (2009). Characterization of a novel 3-chlorobenzoate degrading bacterium isolated from a pristine environment; Burkholderia phytofirmans str. OLGA172. Thesis not yet complete.

Guo, H., and Xiong, J. (2006). A specific and versatile genome waking technique. Gene. 381 : 18-23.

Guttman, D. (2009). Computational genomics and bioinformatics – JBZ1492. Lecture Material. University of Toronto.

Hacker, L., and Carniel, E. (2001). Ecological fitness, genomics islands and bacterial pathogenicity. A Darwinian view of the evolution of microbes. EMBO rep 2 : 376-381.

Han, G. G., Shiga, Y., Tobe, T., Sasakawa, C., and Ohtsubo, E. (2001). Structural and functional characterization of IS679 and IS66-family elements. J. Bacteriol. 183 (14): 4296-4304.

Hernandez, D., Francois, P., Farinelli, L., Osteras, M., and Schrenzel, J. (2008). De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res. 18 : 802-809.

Hickman, A. B., Waninger, S., Scocca, J. J., and Dyda, F. (1997) Molecular organization in site- specific recombination: the catalytic domain of bacteriophage HP1 integrase at 2.7 A resolution. Cell. 89 : 227-237.

Houdt, R., Monchy, S., Leys, N., and Mergeay, M. (2009). New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrences in other bacteria. Antonie van Leeuwenhoek. 96 : 205-226.

Huson, D. H., and Bryant, D. (2006). Application of phylogenetic networks in evolutionary studies. Molec. Biol. Evol. 23 (2):254-267.

Innis, M. A., Myambo, K. B., Gelfand, D. H., Brow, M. D. (1988). DNA sequencing with Thermus aquaticus DNA polymerase and direct sequencing of polymerase chain reaction-amplified DNA. Proc. Natl. Acad. Sci. USA. 85 : 9436-9440.

Kato, K. (2009). Impact of the next generation DNA sequencers. Int. J. Clin. Exp. Med. 2 : 193- 202.

Klarmann, G. J., Schauber, C. A., and Preson, B. D. (1993). Template-directed Pausing of DNA synthesis by HIV-1 reverse transcriptase during polymerization of HIV-1 sequences in vitro . J. Biol. Chem. 268 (13): 9793-9802.

Kwon, H. J., Tirumalai, R., Landy, A., and Ellenberger, T. (1997) Flexibility in DNA recombination: structure of lambda integrase catalytic core. Science. 276 : 126-131.

Kornberg, A., and Baker, T. (1992). DNA replication. 2 nd Edition. Chapter 2: Repair, recombination, transformation, restriction and modification.

76

Larkin,M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G. (2007) Clustal W and Clustal X version 2.0. Bioinformatics. 23 : 2947- 2948.

Laemmli, C. M., Leveau, J. H. J., Zehnder, A. J. B., and van der Meer, J. R. (2000). Characterization of a second tfd gene cluster for chlorophenol and chlorocatechol metabolism on plasmid pJP4 in Ralstonia eutropha JMP134 (pJP4). J. Bacteriol. 182 (15): 4165-4172.

Laemmli, C., Werlen, C., and van der Meer, J. R. (2004). Mutation analysis of the different tfd genes for degradation of chloroaromatic compounds in Ralstonia eutropha JMP134. Arch. Microbiol. 181 : 112-121.

Lawrence, J. G., and Roth, J. R. (1996). Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics. 143 : 1843-1860.

Leander, M., Vallaeys, T., and Fulthorpe, R. (1998). Amplification of putative chlorocatechol dioxygenase gene fragments from alpha- and beta-proteobacteria. Can. J. Microbiol. 44 : 482-286.

Levano-Garcia, J., Verjovski-Almeida, S., and da Silva, C. R. (2005). Mapping transposon insertion sites by touchdown PCR and hybrid degenerate primers. BioTech. 38 (2): 225- 229.

Liu, S., Ogawa, N., Senda, T., Hasebe, A., and Miyashita, K. (2005). Amino acids in positions 48, 52, and 73 differentiate the substrate specificities of the highly homologous chlorocatechol-1,2-dioxygenases cbnA and tcbC. J. Bacteriol. 187 (15): 5427-5436.

Liu, Y., and Whittier, R. F. (1995). Thermal Asymmetric Interlaced PCR: Automatable amplification and sequencing of insert end fragments from P1 and YAC clones for chromosome walking. Genomics. 25 : 674-681.

MacLean, D., Jones, J. D. G., and Studholme, D. J. (2009). Application of 'next-generation' sequencing technologies to microbial genetics. Nat. Rev. Microbiol. 7 : 287-296.

Mahillon, J., and Chandler, M. (1998) Insertion sequences. Microbiol. & Mol. Biol. Rev. 62 (3): 725-774.

McGowan, C., Fulthorpe, R., Wright, A., and Tiedje, J. M. (1998). Evidence for interspecies gene transfer in the evolution of 2,4-dichlorophenoxyacetic acid degraders. Appl. Envir. 64 (10): 4089-4092.

McMurray, C. T. (1999). DNA secondary structure: a common and causative factor for expansion in human disease. Proc. Natl. Acad. Sci. USA. 96 : 1823-1825.

77

Néron B., Ménager, H., Maufrais, C., Joly, N., Maupetit, J., Letort, S., Carrere, S., Tuffery, P., and Letondal, C. (2009). Mobyle: a new full web bioinformatics framework. Bioinform. 25 (22): 3005–3011.

Nunes-Duby, S., Kwon, H. J., Tirumalai, R. S., Ellenberger, T., and Landy, A. (1998). Similarities and differences among 105 members of the Int family of site-specific recombinases. Nucl. Acids. Res. 26 (2): 391-406.

Ochman, H., Lawrence, J., Groisman, E. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature. 405 : 299-304.

Ogawa, N., Miyashita, K. (1995). Recombination of a 3-chlorobenzoate catabolic plasmid from Alcaligenes eutrophus NH9 mediated by direct repeat elements. Appl. and Environ. Microbiol. 61 (11): 3788-3795.

Perez-Pantoja, D., Guzman, L., Manzano, M., Pieper, D. H., and Gonzalez, B. (2000). Role of tfdC IDIEIFI and tfdD II CII EII FII gene modules in catabolism of 3-chlorobenzoate by Ralstonia eutropha JMP134 (pJP4). Appl. and Environ. Microbiol. 66 (4): 1602-1608.

Perigio B. Francisco, Jr, Naoto Ogawa, Katsuhisa Suzuki and Kiyotaka Miyashita. (2001). The chlorobenzoate dioxygenase genes of Burkholderia sp. strain NK8 involved in the catabolism of chlorobenzoates. Microbiol. 147 : 121-133

Plumeier, I. Perez-Pantoja, D., Heim, S. Gonzalez, B. and D.H. Pieper. 2002. Importance of different tfd genes for degradation of chloroaromatics by Ralstonia eutropha JMP134. J. Bact. 184 (15): 4054-4064.

Pop, M. (2009). Genome assembly reborn: recent computational challenges. Brief. Bioinform. 10 (4): 354-366.

Ravatn, R., Studer, S., Zehnder, A. J. B., can der Meer, J. R., (1998b). Int-B13, an unusual site- specific recombinase of the bacteriophage P4 integrase family, is responsible for chromosomal insertion of the 105-kilobase clc element of Pseudomonas sp. strain B13. J. Bacteriol . 180 (21): 5505-5514.

Ravatn, R., Zehnder, A.J.V., and van der Meer, J. R. (1998a). Low-frequency horizontal transfer of an element containing the chlorocatechol degradation genes from Pseudomonas sp . strain B13 to Pseudomonas putida F1 and to indigenous bacteria in laboratory-scale activated-sludge microcosms. Appl. Environ. Microbiol. 64 : 2126-2132.

Rocha, E. P. C. (2006). Inference and analysis of the relative stability of bacterial chromosomes. Mol. Biol. Evol. 23 (3): 513-522

Rocha, E. P. C., Danchin, A., and Viari, A. (1999). Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol. Biol. Evol. 16 (9): 1219-1230.

78

Ronaghi, M., Nygren, M., Lundeberg, J., and Nyren, P. (1999). Analyses of secondary structures in DNA by pyrosequencing. Anal. Biochem. 267: 65-71.

Rothberg, J. M., and Leamon, J. H. (2008). The development and impact of 454 sequencing. Nature Biotech. 26 (10): 1117-1124.

Schlomann, M. (1994). Evolution of chlorocatechol catabolic pathways. Biodeg. 5 : 301- 321.

Schwatz, E., Henne, A., Cramm, r., Eitinger, E., Friederich, B., and Gottschalk, G. (2003). Complete nucleotide sequence of pHG1: A Ralstonia eutropha H16 megaplsmid encoding key enzymes of H 2-based lithoautotrophy and anaerobiosis. J. Mol. Biol. 332 : 369-383.

Sentchilo, V., Zehnder, A. J. B., van der Meer, J. R. (2003). Characterization of two alternative promoters for integrase expression in the clc genomic island of Pseudomonas sp . strain B13. Molec. Microbiol . 49 (1): 93-104.

Sessitsch, A., Coenye, T., Sturz, A. V., Vandamme, P., Ait Barka, E., Salles, J. F., Van Elsas, J. D., Faure, D., Reiter, B., Glick, B. R., Wang-Pruski, G., and Nowak, J. (2005). Burkholderia phytofirmans sp. nov., a novel plant-associated bacterium with plant-beneficial properties. Intern. J. Sys. Evol. Microbiol. 55 : 1187-1192.

Solyanikova, I. P., and Golovleva, L.A. (2004). Bacterial degradation of chlorophenols: pathways, biochemical, and genetic aspects. J. Environ. Sci. Health. B. 39 (3): 333-51.

Tan, H, M. (1999). Bacterial Catabolic Transposons. Appl. Microbiol. Biotech. 51 :1-12

Trefault, N., Clement, P. Manzano, M., Pieper, D.H. and B. Gonzalez. 2002. The copy number of the catabolic plasmid pJP4 affects growth of Ralstonia eutropha JMP134 (pJP4) on 3- chlorobenzoate. FEMS Micro Lett. 202 :95-100.

Tsuda, M., Tan, H., N, A., Furukawa, K. (1999). Mobile catabolic genes in bacteria. J. Biosci. Bioeng. 87 (4): 401-410.

Van der Meer, J. R., and Sentchilo, V. (2003). Genomic islands and the evolution of catabolic pathways in bacteria. Curr. Opin. Biotech. 14 : 248-254.

Van der Meer, J. R., Ravatn, R., and Sentichilo, V. (2001). The clc element of Pseudomonas sp . strain B13 and other mobile degradative elements employing phage-like integrases. Arch Microbiol. 175 : 79-85

Versalovic, J., Schneider, M., De Bruijn, FJ., Lupski, JR. (1994). Genomic fingerprinting of bacteria using repetitive sequence-based polymerase chain reaction. Appl. Eviron. Microbiol. 62 (7): 2621-2628.

Wagner, A. (2006) Cooperation is fleeting in the world of transposable elements. PLoS Comp. Biol. 2(12): 1522-1529

79

Weisshaar, M. P., Franklin, C.H., and Reineke, W. (1987). Molecular cloning and expression of the 3-chlorobenzoate-degrading genes from Pseudomonas sp . strain B13. J. Bacteriol . 169 (1): 394-402.

Wells, R. D. (1996). Molecular basis of genetic instability of triplet repeats. J. Biol. Chem. 271(6): 2875-2878.

Wyndham, R.C., Cashore, A.E., Nakatsu, C.H. ad M.C. Peel. 1994. Catabolic transposons. Biodeg. 5 :323-342.

80

Appendices 6 Appendix 6.1 Total genome homology

!/usr/bin/perl

# this program adds up the base pairs that match in a blastall output file

my $line = " ";

my @info = " ";

my $count = 0;

my $GC = 0; print "what is the name of the data file?" or die "could not open, is fasta format and text?";

$source = <>;

open(CONT, "<$source") or die "Could not open data\n";

# name file for output print "what name do you want for the output file?";

$results = <>; open(OUTPUT, ">$results") or die "Could not open output file\n"; while (defined (@info = )){

if (@info[0] =~ m/#/) {

next;

}

else { print $info[3];

$count = $count + $info[3]; print $count;

} next;

81

6.2 GC content calculation

!/usr/bin/perl

# this program finds the GC content of genes in a fasta format file my $node = ""; my $source = " "; my $seq = ""; my $tot = 0; my $nodel = 0; my $line = ""; my $seqstart= 0;

my $newtot = 0; my $results = " "; my $count = 0; my $GC = 0;

#identify file with input sequence file print "what is the name of the fasta file?" or die "could not open, is fasta format and text?";

$source = <>;

open(CONT, "<$source") or die "Could not open sequences\n"; print "what name do you want for the output file?";

$results = <>; open(OUTPUT, ">$results") or die "Could not open output file\n";

$/ = ">"; while (defined ($line = )) {

$tot = length ($line); #determine total characters

$nodel = index($line, "\n" ); #determine where first line stops

82

$node = substr($line, 0, $nodel); #take the first line for node name

$seqstart = $nodel+1; # define start of sequence record

$seq = substr($line, $seqstart, $tot); #define sequence stretch

$seq =~ s/\n//g; #remove line characters

$seq =~ s/N//g; # remove enns

$newtot = length ($seq); #determine sequence length

while($seq =~ m/c|C|G|g/g) {

$count = $count + 1;}

$GC = ($count * 100)/$newtot;

{print OUTPUT ">$node\t$GC\n"};

{print ">node\t$GC\n"};

$count=0; next;

} close (CONT); close (OUTPUT);

83

6.3 Total number of N’s in Solexa sequencing data

#!/usr/bin/perl

#this program counts Ns in sequencing data

# declare variables

my $count;

my $DNA;

my $base;

#identify file with input sequence file print "what is the name of the fasta file?" or die "could not open, is fasta format and text?";

$source = <>;

open(CONT, "<$source") or die "Could not open sequences\n";

while (defined ($line = )) {

my $ENN = 0;

while ( $line =~m/N/g) {

$count++;

}

}

print "total number of Ns = $count\n";

84

6.4 Multiple sequence alignment of the RIT elements found in OLGA172 (RIT BphO1 ), CH34, H16, and H1.

pHG1 ------2pHG1 ------pAE1 TTATGCCGCGTCGTGCACGCGCATGACGGTGGGTATGGCCTTTCATGATCGCTGTGTCGG CH34 ------OLGA ------

pHG1 ------ATGAATGC 2pHG1 ------ATGAATGC pAE1 CATAAACATATGATGTCTCAGGCTGGCTAATGCGCTGGCCAGGAGAGATCAGATGAATGC CH34 ------ATGAATGC OLGA ------ATGAAGGC ***** ** pHG1 AACACGCACAGTCAGCGAATCCGGCGGGCTGCCGGCCCATCACATCGATGCATTTCTTGA 2pHG1 AACACGCACAGTCAGCGAATCCGGCGGGCTGCCGGCCCATCACATCGATGCATTTCTTGA pAE1 AACACGCACAGTCAGCGAATCCGGCGGGCTGCCGGCCCATCACATCGATGCATTTCTTGA CH34 GACACGCACAGTCAGCGAATCCGGCGGGCTACCGGCCCATCACATCGATGCATTTCTTGA OLGA GATACACACAGTCAGCGCATCGGGCGGGCTGCCCGCCCGTCACATTGATACATTCCTTGA * ** *********** *** ******** ** **** ****** *** **** ***** pHG1 TCGTCTACGGACGGCACACTATTCCGAGGTATCGCTTCGCAAGAAACGAAGAGTCCTGTG 2pHG1 TCGTCTACGGACGGCACACTATTCCGAGGTATCGCTTCGCAAGAAACGAAGAGTCCTGTG pAE1 TCGTCTACGGACGGCACACTATTCCGAGGTATCGCTTCGCAAGAAACGAAGAGTCCTGTG CH34 TCGTCTACGGACGGCTCACTACTCCGAGGTATCGCTTCGCAAGAAACGAAGAGTCCTGTG OLGA TCGTCTACGGACGGCACGTTATTCCGAAGTAACGCTTCTCAAGAAACGAAGGGTTTTGTC *************** * ** ***** *** ****** ************ ** *** pHG1 CGTGTTCTCCGGGTGGATGAAGAACAGGAACATTGACCTGATCGATCTCGATGAGTCTGT 2pHG1 CGTGTTCTCCGGGTGGATGAAGAACAGGAACATTGACCTGATCGATCTCGATGAGTCTGT pAE1 CGTGTTCTCCGGGTGGATGAAGAACAGGAACATTGACCTGATCGATCTCGATGAGTCTGT CH34 TGCGTTCTCTGGGTGGATGAAGCACAGGAACATCGACCTGATTGACCTCGATGAGTCTGT OLGA TGCCTTCTCTCGGTGGATGAAGAGCACGAACGTCGGACTGGTTGACCTTGATGAGTCCGC * ***** *********** ** **** * * *** * ** ** ******** * pHG1 CACGGCTCGTTTTATGAATCGCATGATCGACGCTTCAAGAGACCGCGTCCAGCGTGCGCG 2pHG1 CACGGCTCGTTTTATGAATCGCATGATCGACGCTTCAAGAGACCGCGTCCAGCGTGCGCG pAE1 CACGGCTCGTTTTATGAATCGCATGATCGACGCTTCAAGAGACCGCGTCCAGCGTGCGCG CH34 CACGGCTCGTTTTATGAAGCGCATGATCGACGCATCACGAGACCGCGTCCAGCGTGCTCG OLGA TACGGCTTGTTTCACGGAGCGCCTGACCGACGCTCCTGAAGCGCGTGTTCAGTTCGAGCT ****** **** * * * *** *** ****** * ** ** ** *** * * pHG1 TCCCACCTTACGGCAGTTTCTTGCCTATCTGCGCGCGGAAGCCATTGTGTGTTCGCCGAC 2pHG1 TCCCACCTTACGGCAGTTTCTTGCCTATCTGCGCGCGGAAGCCATTGTGTGTTCGCCGAC pAE1 TCCCACCTTACGGCAGTTTCTTGCCTATCTGCGCGCGGAAGCCATTGTGTGTTCGCCGAC CH34 TCCCACATTGCGGCAGTTTCTCGCCTATCTGCGCGCGGAAGCCATTGTGTGTTCGCCGAC OLGA TGCCGTGTTACGGTCGTTTCTTGCCTATCTGCGTGACGAAGCCATCGTGCTCTCGTCGAC * ** ** *** ****** *********** * ******** *** *** ****

85 pHG1 GTTGGGCGGCCAGTCCGAGATTGCGCGCATTTATCGCCGATACCTGGACCATCTGAGGCA 2pHG1 GTTGGGCGGCCAGTCCGAGATTGCGCGCATTTATCGCCGATACCTGGACCATCTGAGGCA pAE1 GTTGGGCGGCCAGTCCGAGATTGCGCGCATTTATCGCCGATACCTGGACCATCTGAGGCA CH34 ATTAGGCAGCCAGTCCGCGATCGCGCACACTTATCGCCGATACTTGGACTATCTGAGGCA OLGA ATTGGGCGACCAATCTGCGATCACCCATATCTACGAGCGGTACCTGGACTACCTGCGGCA ** *** *** ** * *** * * * ** ** *** ***** * *** **** pHG1 GGATCGTGGACTCGCGAAGAACTCGCTGCTCGTCTACGGCCCCTTCATTCGCGACTTTCT 2pHG1 GGATCGTGGACTCGCGAAGAACTCGCTGCTCGTCTACGGCCCCTTCATTCGCGACTTTCT pAE1 GGATCGTGGACTCGCGAAGAACTCGCTGCTCGTCTACGGCCCCTTCATTCGCGACTTTCT CH34 GGATCGTGGACTCGCGAAGAACTCTCTGCTCGTCTACGGCCCGTTCATTCGCGACTTTCT OLGA GGATCGTGGACTTGCGAAGAACTCGGTGCTCGTCTACGGGCCCTTCATTCGCGACTTCCT ************ *********** ************* ** ************** ** pHG1 CGACAGCCACTCGGCCAACGACGGAACGATATTGGCAGATGCATTTTGCGCCGTAACGAT 2pHG1 CGACAGCCACTCGGCCAACGACGGAACGATATTGGCAGATGCATTTTGCGCCGTAACGAT pAE1 CGACAGCCACTCGGCCAACGACGGAACGATATTGGCAGATGCATTTTGCGCCGTAACGAT CH34 CGACAGTCACTCGGCCGGCGACGGAAGTTTATTGCCAGATGCATTTGACGCGGTAACGAT OLGA CAACAGCCAGGATGTCGGCGACGGCGACATATTGCCCGATGCATTCGACGCCATGACGAT * **** ** * * ****** ***** * ******** *** * ***** pHG1 CCGAGATCATTTCCTTACCTACAGCGAAGGTCGATCGGCGGAGTACACGCGGCTGATGGC 2pHG1 CCGAGATCATTTCCTTACCTACAGCGAAGGTCGATCGGCGGAGTACACGCGGCTGATGGC pAE1 CCGAGATCATTTCCTTACCTACAGCGAAGGTCGATCGGCGGAGTACACGCGGCTGATGGC CH34 CCGGAATCATCTTCTTGCCCGCAGCAAAGGCCGATCGGCGGAATACACGCGGCTGATGGC OLGA CCGGAATCACATCCTTACCCGCAGCAAAGGCCGGTCGGCGGAGTACACGCGGCTGATGAC *** **** * *** ** **** **** ** ******** *************** * pHG1 AGTTGCGCTTCGCTCGTTCTGCCATTTCCTCTTTCTGCGCGGCGATACGGCCCGAGACCT 2pHG1 AGTTGCGCTTCGCTCGTTCTGCCATTTCCTCTTTCTGCGCGGCGATACGGCCCGAGACCT pAE1 AGTTGCGCTTCGCTCGTTCTGCCATTTCCTCTTTCTGCGCGGCGATACGGCCCGAGACCT CH34 AGTTGCGCTTCGCTCGTTCTGCCATTTCCTCTTTCTGCGTGGCGATACGGCCCGAGACCT OLGA GGTGGCCCTTCGCTCGTTTTGCCATTTCCTCTTTCTGCATGGCGAGACGGCACGAGACCT ** ** *********** ******************* ***** ***** ******** pHG1 GTATGAGTCAGTGCCGTCAGTTCGTAAGTGGCGACAGTCAACTGTACCGACGTTCCTCAC 2pHG1 GTATGAGTCAGTGCCGTCAGTTCGTAAGTGGCGACAGTCAACTGTACCGACGTTCCTCAC pAE1 GTATGAGTCAGTGCCGTCAGTTCGTAAGTGGCGACAGTCAACTGTACCGACGTTCCTCAC CH34 GGCTGGGTCAGTGCCCTCGGTTCGTAAGTGGCGACAGTCGACTGTGCCGACGTTCCTCAC OLGA GTATGAGTCGGTGCCTTCGGTTCGCAAGTGGCGGCAGTCAACTGTGCCAACGTTCCTCAC * ** *** ***** ** ***** ******** ***** ***** ** *********** pHG1 GCCTGAGCAGCAAGAAGCTCTCATTGCGTCCGCAGACCGGTCGACTCCGACTGGGCGCCG 2pHG1 GCCTGAGCAGCAAGAAGCTCTCATTGCGTCCGCAGACCGGTCGACTCCGACTGGGCGCCG pAE1 GCCTGAGCAGCAAGAAGCTCTCATTGCGTCCGCAGACCGGTCGACTCCGACTGGGCGCCG CH34 GCCTGAGCAGCAAGAAGCTCTCATTGCATCTGCAGACCGGTCGACTCCGACTGGGCTCCG OLGA GCCTGAGCAGGAAGAGGTCCTGATTGCAACTGCTGATCGGTCGACTCCACGCGGGAGCCG ********** **** * ** ***** * ** ** *********** *** *** pHG1 TGACTACGCAATCCTGCTGTTGTTGGCGCGGCTCGGTCTACGTGCCGGAGAAATAGTTGC 2pHG1 TGACTACGCAATCCTGCTGTTGTTGGCGCGGCTCGGTCTACGTGCCGGAGAAATAGTTGC pAE1 TGACTACGCAATCCTGCTGTTGTTGGCGCGGCTCGGTCTACGTGCCGGAGAAATAGTTGC CH34 TGACTATGCAATCCTGCTGTTGTTGGCGAGGCTCGGCTTACGTGCCGGAGAAATCATCGA OLGA TGATTATGCCGTCCTGCTGTTGCTGGCGCGGCTCGGTTTGCGTGCCGGAGAGATCGTCGC *** ** ** *********** ***** ******* * *********** ** * *

86 pHG1 CATGCAGCTCGACGACATTCACTGGCGTTCGGGGGAACTCGTCGTTCATGGCAAGGGGCA 2pHG1 CATGCAGCTCGACGACATTCACTGGCGTTCGGGGGAACTCGTCGTTCATGGCAAGGGGCA pAE1 CATGCAGCTCGACGACATTCACTGGCGTTCGGGGGAACTCGTCGTTCATGGCAAGGGGCA CH34 GATCGAGCTCGACGACATTCACTGGCGTTCGGGGGAACTCGTCGTTCATGGCAAGGGGCA OLGA GCTTGAACTAGGCGACATCCACTGGCGTTCCGGAGAACTCGTCGTTCATGGTAAGGGGCA * * ** * ****** *********** ** ***************** ******** pHG1 GATGGTGGAGCACGTCCCCCTGCCATCGGAGGTCGGAGCAGCAATCGCTACATATCTCCG 2pHG1 GATGGTGGAGCACGTCCCCCTGCCATCGGAGGTCGGAGCAGCAATCGCTACATATCTCCG pAE1 GATGGTGGAGCACGTCCCCCTGCCATCGGAGGTCGGAGCAGCAATCGCTACATATCTCCG CH34 AATGGTCGAGCACGTGCCGCTCTCATCGGAGGTTGGAGCAGCAATCGCAACATATCTCCG OLGA AATGGTCGAGCATCTCCCGCTGCCATCCGAGGTTGGAGAGGCGATTGCCATGTACCTTCG ***** ***** * ** ** **** ***** **** ** ** ** * ** ** ** pHG1 CGATGGTCGCGGAGCAAGTGCATCGCGGCACGTATTCCTTCGTAGATTGGCACCTCGGGT 2pHG1 CGATGGTCGCGGAGCAAGTGCATCGCGGCACGTATTCCTTCGTAGATTGGCACCTCGGGT pAE1 CGATGGTCGCGGAGCAAGTGCATCGCGGCACGTATTCCTTCGTAGATTGGCACCTCGGGT CH34 CGATGGTCGCGGAGCGAGTGCATCGCGTCGCGTCTTCCTTCGCAGATTGGCGCCTCGAGT OLGA CGACGATCGAGGTGCGAGCGCATCGCGACGGGTCTTCCTTCGCATGTGGGCACCGCGCGT *** * *** ** ** ** ******** * ** ******** * * *** ** ** ** pHG1 TGGTCTGGCGGGACCGGCGGCGATTGGCAAGATTGTTTGTCAGGCCTTCGCACGTGCAGG 2pHG1 TGGTCTGGCGGGACCGGCGGCGATTGGCAAGATTGTTTGTCAGGCCTTCGCACGTGCAGG pAE1 TGGTCTGGCGGGACCGGCGGCGATTGGCAAGATTGTTTGTCAGGCCTTCGCACGTGCAGG CH34 TGGTTTGGCGGGCCCGGCGGCGATTGGCAAGATTGTTTGTCAGGCCTTCGCACGTGTTGG OLGA CGGTCTGGCGGGACCGGCGGCGATTGGCCACATTGTTCGTCTGGCTTTCGCTCGTGCCGG *** ******* *************** * ****** *** *** ***** **** ** pHG1 TTTCCGCCCCGCGTGCCGTGGTTCCGCACATCTGTTCCGTCACGGTCTGGCGACGACGAT 2pHG1 TTTCCGCCCCGCGTGCCGTGGTTCCGCACATCTGTTCCGTCACGGTCTGGCGACGACGAT pAE1 TTTCCGCCCCGCGTGCCGTGGTTCCGCACATCTGTTCCGTCACGGTCTGGCGACGACGAT CH34 TTTCCGCCCCGCGTGCAGGGGCGCTGCACATCTGTTCCGTCACGGTCTGGCGACGACGAT OLGA ATTCCGTCCCGCGTGCCGTGGCGCCGCGCATCTATTCCGCCACGGTCTGGCGACGACGAT ***** ********* * ** * ** ***** ***** ******************** pHG1 GATTCGCCACGGGGCCTCGATCGCAGAAATAGCTGAGGTCTTGCGGCACCGCTCACCGGA 2pHG1 GATTCGCCACGGGGCCTCGATCGCAGAAATAGCTGAGGTCTTGCGGCACCGCTCACCGGA pAE1 GATTCGCCACGGGGCCTCGATCGCAGAAATAGCTGAGGTCTTGCGGCACCGCTCACCGGA CH34 GATTCGCCACGGGGCCTCGATGGCAGAAATCGCCGAGGTCTTGCGGCACCGCTCACCCGA OLGA GATTCGCCATGGTGCGTCGATCGCGGAAATCGCTGAGGTCTTACGGCACCGCTCACAGGA ********* ** ** ***** ** ***** ** ******** ************* ** pHG1 CAGTACCGCGATCTATGCAAAGGTCGCGTTCGAGGACCT-GCGCGGGG-TCGCGCGCTCG 2pHG1 CAGTACCGCGATCTATGCAAAGGTCGCGTTCGAGGACCT-GCGCGGGG-TCGCGCGCTCG pAE1 CAGTACCGCGATCTATGCAAAGGTCGCGTTCGAGGACCTCGCTCGGGGGTCGCGCGCTCG CH34 CAGTACCGCGATCTACGCAAAGGTCGCGTTTGAGGACCT-GCGCGGTG-TCGCGCGTTCG OLGA CAGTACCGCGATCTATGCAAAGGTTGCGTTTGAGGATCT-GCGCAGGG-TAGCACGCCCG *************** ******** ***** ***** ** ** * * * * ** ** ** pHG1 TGGCCCACGGCGGGAGGTGCAATATGACTTCGATCCGTGACTCCCTCGCTCGGTACGTGG 2pHG1 TGGCCCACGGCGGGAGGTGCAATATGACTTCGATCCGTGACTCCCTCGCTCGGTACGTGG pAE1 TGGCCCACGGCGGGAGGTGCAATATGACTTCGATCCGTGACTCCCTCGCTCGGTACGTGG CH34 TGGCCCACGGCAGGAGGTGCAATATGACTGCGATCCACGAGTCTCTCGCCCAGTACGTGG OLGA TGGCCCACGACGGGAGGTGCAATATGACTGCGATCCGTGACTCCCTCGCTCGGTACGTCG ********* * ***************** ****** ** ** ***** * ****** *

87 pHG1 CGGTCCGCCGCGCTCTCGGGGCATCATTCTATGAACCTGCATTGGCACTCGGTCATTTCG 2pHG1 CGGTCCGCCGCGCTCTCGGGGCATCATTCTATGAACCTGCATTGGCACTCGGTCATTTCG pAE1 CGGTCCGCCGCGCTCTCGGGGCATCATTCTATGAACCTGCATTGGCACTCGGTCATTTCG CH34 CAGTCCGCCGTGCTCTCGGGGCGTCATTCTATGAGCCTGCGTTGGCACTCGGTCATTTCG OLGA CAGTGCGCCGGGCGCTCGGGGCGAAATTCTATGAACCTGCATTGGCACTCGGTCATTTCG * ** ***** ** ******** ********* ***** ******************* pHG1 TTGATCTTCTGGAACATGAAGACGCAGAGTTCATTACTACCGATCTGGCTCTGCGCTGGG 2pHG1 TTGATCTTCTGGAACATGAAGACGCAGAGTTCATTACTACCGATCTGGCTCTGCGCTGGG pAE1 TTGATCTTCTGGAACATGAAGACGCAGAGTTCATTACTACCGATCTGGCTCTGCGCTGGG CH34 TTGATCTGCTGGAACGCGAAGGCGCCGAGTTCATTACAACCGATCTGGCTCTGCGCTGGG OLGA TTGATCTTCTGGAACACGAAGGTGCCGAGTTCATCACCACCGATCTGGCTCTTCGCTGGG ******* ******* **** ** ******** ** ************** ******* pHG1 CGATGACGCCCGCACTCGTCGAACGCGCCACCTGGGGGCGGCGCCTCTCTCAAGTGAGAG 2pHG1 CGATGACGCCCGCACTCGTCGAACGCGCCACCTGGGGGCGGCGCCTCTCTCAAGTGAGAG pAE1 CGATGACGCCCGCACTCGTCGAACGCGCCACCTGGGGGCGGCGCCTCTCTCAAGTGAGAG CH34 CGACGACGCCCGAACTCGTCGAACGCGCTACGTGGGGGCGGCGCCTCTCTCAAGTGAGAG OLGA CGACGACGCCCGTACTCGTCGAACGCGCTACCTGGGGGCGGCGTCTCTCCCAGGTGAGAG *** ******** *************** ** *********** ***** ** ******* pHG1 GATTCGCCAGATGGATGAACGTCATTGACGGTCGAAACCAGATTCCTCCAGCAGGACTCC 2pHG1 GATTCGCCAGATGGATGAACGTCATTGACGGTCGAAACCAGATTCCTCCAGCAGGACTCC pAE1 GATTCGCCAGATGGATGAACGTCATTGACGGTCGAAACCAGATTCCTCCAGCAGGACTCC CH34 GATTTGCCAGATGGATGAACGTGATTGACAGTCGAAACCAGATTCCTCCAGCAGGACTCC OLGA GATTCGCCAAGTGGATGAACGCCATCGACAGTCGGAATGAGATTCCTCCAGCAGGACTCC **** **** ********** ** *** **** ** ********************* pHG1 TGAGTGCCCGCAGACGGCGCAATGCCCCGCATATTTACACGGAGCAGGAAATTGATCTGC 2pHG1 TGAGTGCCCGCAGACGGCGCAATGCCCCGCATATTTACACGGAGCAGGAAATTGATCTGC pAE1 TGAGTGCCCGCAGACGGCGCAATGCCCCGCATATTTACACGGAGCAGGAAATTGATCTGC CH34 TGAGTGCCCGCAGACGGCGCAACGCCCCGCATATTTACACTGAGCAGGAAATTGATCGGC OLGA TGAGTGCCCGCCGACGGCGCAATCCTCCGCATATTTACACGGAACAGGAAATTACCCTGC *********** ********** * ************** ** ********* * ** pHG1 TTATGACCCGCGCCGCTCAACTGCGATCCCGAACCGGCATGCGAGCACTGACCTATTCGA 2pHG1 TTATGACCCGCGCCGCTCAACTGCGATCCCGAACCGGCATGCGAGCACTGACCTATTCGA pAE1 TTATGACCCGCGCCGCTCAACTGCGATCCCGAACCGGCATGCGAGCACTGACCTATTCGA CH34 TTATGACCCGCGCCTCTCAACTGCGATCCCGAACTGGCATCCGAGCACTGACCTATTCGA OLGA TCATGACTCATGCTGCGCGGCTACGCTCGCGCACAGGCTTGCGAGCACTGGCCTATACGA * ***** * ** * * ** ** ** ** ** *** * ********* ***** *** pHG1 CGCTCATCGGGCTTCTTGTAGCGACGGGCCTCAGGCCAGGCGAAGCGCTTCGGCTCGACC 2pHG1 CGCTCATCGGGCTTCTTGTAGCGACGGGCCTCAGGCCAGGCGAAGCGCTTCGGCTCGACC pAE1 CGCTCATCGGGCTTCTTGTAGCGACGG-CCTCAGGCCAGGCGAAGCGCTTCGGCTCGACC CH34 CGCTCATAGGGCTTCTTGTAGCGACGGGCCTCAGGCCAGGCGAGGCGCTCCGGCTCGACC OLGA CGCTCATCGGGCTTCTCGCAGCGACCGGCCTCAGACCAGGCGAAGCCCTTTCGCTCGACC ******* ******** * ****** * ****** ******** ** ** ******** pHG1 GGTCCGACGTTGACCTCGTCAGCGGGATACTCTCCATCCGGGAATCGAAGTTCGGCAAAT 2pHG1 GGTCCGACGTTGACCTCGTCAGCGGGATACTCTCCATCCGGGAATCGAAGTTCGGCAAAT pAE1 GGTCCGACGTTGACCTCGTCAGCGGGATACTCTCCATCCGGGAATCGAAGTTCGGCAAAT CH34 GGTCCGACGTTGACCTCGTCAGCGGGATACTCTCCATTCGGGAATCGAAGTTCGGCAAAT OLGA GGTGCGACGTTGATCTCGTGAACGGAATACTCTCCGTTCGGGAATCGAAGTTCGGCAAAT *** ********* ***** * *** ********* * **********************

88 pHG1 CGCGCTTTGTTCCTGTAGCAGAGTCGACCCGGGTGGCACTCGAACACTATGCCAAGAAAC 2pHG1 CGCGCTTTGTTCCTGTAGCAGAGTCGACCCGGGTGGCACTCGAACACTATGCCAAGAAAC pAE1 CGCGCTTTGTTCCTGTAGCAGAGTCGACCCGGGTGGCACTCGAACACTATGCCAAGAAAC CH34 CCCGCTTTGTTCCAGTAGCAGAGTCTTCCCGGGTGGCGCTCGAACACTATGCCCGGAAAC OLGA CGCGCTTTGTTCCTGTCGAAGAGTCGACCCGCGAGGCACTCGAACGCTACGCGCAGAGTC * *********** ** * ****** **** * *** ******* *** ** ** * pHG1 GCGATCAACTCTGTCCTTCACGATTGAGCGAGGCGTTCCTGGTTAGTGAGCGCGGCAAGC 2pHG1 GCGATCAACTCTGTCCTTCACGATTGAGCGAGGCGTTCCTGGTTAGTGAGCGCGGCAAGC pAE1 GCGATCAACTCTGTCCTTCACGATTGAGCGAGGCGTTCCTGGTTAGTGAGCGCGGCAAGC CH34 GCGATCAACTCTGTCCTGTACGATTGAGCGAAGCGTTCCTGGTTAGTGAGCACGGCAAGC OLGA GCGACCAACTCTGCCCTCTACGGGTGAGTGAAGCGTTCCTGGTGGGTGAGCGTGGTATTA **** ******** *** *** **** ** *********** ****** ** * pHG1 GATTGAAGGCCGGAACTGCACGAAGCATGTTCGTCAGAATGTCGCGCGCTGTCGGTCTGC 2pHG1 GATTGAAGGCCGGAACTGCACGAAGCATGTTCGTCAGAATGTCGCGCGCTGTCGGTCTGC pAE1 GATTGAAGGCCGGAACTGCACGAAGCATGTTCGTCAGAATGTCGCGCGCTGTCGGTCTGC CH34 GACTGAAGGCAGGCACTGCACGAAGCATGTTCGTCAGAATGTCGCGCGCTGTTGGTCTGC OLGA GACTGAACGCCAGCGCTGTGCGCAACATGTTCGTCAGAATGTCGCGTGCGGTCGGTCTAC ** **** ** * *** ** * ********************* ** ** ***** * pHG1 GATCGGCGACAGAGGATGGGCGCGATGGTTACGGCCCGCGCCTCCAGGACTTCCGGCATA 2pHG1 GATCGGCGACAGAGGATGGGCGCGATGGTTACGGCCCGCGCCTCCAGGACTTCCGGCATA pAE1 GATCGGCGACAGAGGATGGGCGCGATGGTTACGGCCCGCGCCTCCAGGACTTCCGGCATA CH34 GATCGGCGACAGAGGGCGGGCGCGATGGTTACGGCCCGCGCCTGCAGGACTTCCGGCATA OLGA GGCCAGCGACAAAGGATGGCCGGGCAGGTTACGGCCCACGGCTCCAGGACTTTCGACACA * * ****** *** ** ** * *********** ** ** ******** ** ** * pHG1 GCTTCGCGACGGGAAGGCTGGTCGAATGGTATCGCGCCGGTCTGGACGTAAGTCGAGAAC 2pHG1 GCTTCGCGACGGGAAGGCTGGTCGAATGGTATCGCGCCGGTCTGGACGTAAGTCGAGAAC pAE1 GCTTCGCGACGGGAAGGCTGGTCGAATGGTATCGCGCCGGTCTGGACGTAAGTCGAGAAC CH34 GCTTCGCGACTGGAAGACTGGTCGAATGGTATCGCGCCGGTCTGGACGTAAGTCGGGAAT OLGA GTTTCGCGACTGGAAGGCTGGTTGCATGGTATCGCGCCGGACTGGACGTGAGCCGGGAAT * ******** ***** ***** * *************** ******** ** ** *** pHG1 TGCCGAAACTTGCCGCCTACCTCGGGCATGTCAACATCGGTCTTACGTACTGGTACATCG 2pHG1 TGCCGAAACTTGCCGCCTACCTCGGGCATGTCAACATCGGTCTTACGTACTGGTACATCG pAE1 TGCCGAAACTTGCCGCCTACCTCGGGCATGTCAACATCGGTCTTACGTACTGGTACATCG CH34 TGCCGAAACTTGCCGCCTACCTCGGGCATGTCAACATCGGCCTTACGTACTGGTATATCG OLGA TGCCGAAACTTGCCGCCTACCTTGGACATGTCAACGTTGGTCTTACGTACTGGTACATCG ********************** ** ********* * ** ************** **** pHG1 AAGCGGTTCCTGAGTTGCTTGAACTCGCGGCAGCCTATCTCGACAAGGACTGTCCGGGAG 2pHG1 AAGCGGTTCCTGAGTTGCTTGAACTCGCGGCAGCCTATCTCGACAAGGACTGTCCGGGAG pAE1 AAGCGGTTCCTGAGTTGCTTGAACTCGCGGCAGCCTATCTCGACAAGGACTGTCCGGGAG CH34 AATCGGTTCCTGAGTTGCTGGAACTCGCGGCAGCCTATCTCGACAAGGACTGTCCGGGAG OLGA AAGCGGTTCCGGAGTTGCTTGAACTTGCGGCAGGCTATCTCAGCAGGAACTGTCCGGGAG ** ******* ******** ***** ******* ******* ** * ************

Interruption of RIT by conserved hypothetic protein in pHG1 pHG1 AACGGCCGTGAATGAGATGCACTCCGCTGCGACCGGATGTAGTTGCCGTCCTGAAACAGT 2pHG1 AACGGCCGTGAATGAGATGCACTCCGCTGCGACCGGATGTAGTTGCCGTCCTGAAACAGT pAE1 AACGGCCGTGAGCGCCGCCAGCCTCCCAGC--CCTGGTTCAGC--GCTTCTTCACCCAGC CH34 AACGGACGTGAGCGCCGCCAGTCTCCCAGC--CCTCGTTCAGC--GCTTCTTCACCCAGC OLGA AGTGGCCATGAACGCCGCCGGCCTTCCATC--CCTCGTCCAGC--GTTTCTTTACCCAGC * ** * *** * * * ** * ** ** * * ***

89 pHG1 GGCTGCTGTATCAACCAGGCGAACCAGGTGATCCGGTCTTCCCCAGTTCGCGCGGT--GG 2pHG1 GGCTGCTGTATCAACCAGGCGAACCAGGTGATCCGGTCTTCCCCAGTTCGCGCGGT--GG pAE1 GCCTGCTCGAGCAGCAAGGT---CTGAGTTCGCATACGGTGGCAAGTTACCGTGACACGT CH34 GCCTGCTCGAGCAGCAAGGT---CTGAGTTCGCATACGGTGGCAAGTTACCGTGACACGT OLGA GCTTGCTCGAGCAGCAAGGT---CTGAGTTCGCACACGGTGGCGAGTTACCGCGACACGT * **** * ** * *** * ** * * * **** ** * * pHG1 TCATCTCAGTGCCGATGCTCTTCAGCGACTCGTTT---CGCGCAACGCTGAGATCGCGCG 2pHG1 TCATCTCAGTGCCGATGCTCTTCAGCGACTCGTTT---CGCGCAACGCTGAGATCGCGCG pAE1 TCCGGCTGCTGCTGGCGTTCGCCACTAAGCATATCGGGCGCGCGCCGTCAAAGCTCCGAA CH34 TCCGGCTGTTGCTGGCGTTCGCCACGAAGCATATAGGGCGCGCGCCGTCAAAGCTGCGAA OLGA TCCGGTTGCTGCTGGCGTTCGCCATGAAGCATATCGGGCGCGTGCCGTCAAAACTGCGGA ** *** * * ** ** * * **** ** * ** pHG1 CCT----CTCGTGCCCCTCGCTGAAGAAGAAAT--CAATAACGCCTCATACGCTTCGGCA 2pHG1 CCT----CTCGTGCCCCTCGCTGAAGAAGAAAT--CAATAACGCCTCATACGCTTCGGCA pAE1 TCGAGGACTTCGACGTGTCGTTGATCGAGGAATTCCTGCAGCACCTCGAACACGGCAGAG CH34 TCGAGGACTTCGACGTGTCGTTGATCGAGGAATTCCTGCAGCACCTCGAACACGGCAGAG OLGA TCGAAGACTTCGACGCGTCGTTGATCGAGAAATTCCTTCAGCACCTTGAACAGGACAGGG * ** * *** *** ** *** * * * *** ** * * pHG1 CTATATGCCCTTCCCGACTATGTCTTGTAATCGAAAGA-TAGCTCTCTATTTCATG-GGC 2pHG1 CTATATGCCCTTCCTGACTATATCGGGTAATGAGGGAT-TTGCCTCCGTGCTCGTCCGAT pAE1 GCA-ATTCTGTGCGCACACGCAACACGCGGCTTGCCGC-GTGCATGCC--TTCTTCCGGT CH34 GCA-ATTCGGTGCGCACACGCAACACGCGGCTTGCCGCCGTGCATGCC--TTCTTCCGGT OLGA GAA-ATTCAGTGCGGACACGTAATACGCGCCTCGCCGCTGTGCATGCC--TTCTTCCGGT * ** * * * * ** * ** * * pHG1 AAACGGATGC-ACGGAGATAGAGATAGTC------TTGACTCCTTCCTTCAGCCCCA- 2pHG1 GCGCACTTGTTGTGGAGCCATCCCCCGTCACTAC---CCGACATAGTTTCGTGGCCCTA- pAE1 TCGTCGCAGTCAGCGAGCCTGCGTTGTTTCTGCAGTGTCAGCGCATTCTTGCAATCCCAT CH34 TCGTCGCGGTCAGCGAGCCTGCGTTGTTTCTGCTGTGTCAGCGCATTCTTGCAATCCCAT OLGA TCGTCGCGGTCAGCGAGCCCGCGCTGTTCCTGCAATGTCAGCGCATCCTTGCGATTCCAT * *** * * * * pHG1 GAG--CATCCGGGAAG-----GAGTCCATCATGAGAT--ACACCA----TCCATGA---- 2pHG1 CAGTCCATTCAGGAAGCCGAGAAGTTCGTCTGCCGGCCGGTAGCG----TCCTGGCTTGC pAE1 CCAAACGCTGCGAGCACGGCCCAGTTGAGTTTCTGACCGAGAGCGAGGCCGCCGCTCTGG CH34 CCAAACGCTGCGAGCACGGCCCGGTTGAGTTTCTGACCGAGAGCGAGGCCGCCGCTCTGG OLGA CCAAACGCTGTGAACACGGCCCCGTCGAGTTTCTCACCGAGAGCGAGGCCGCATCCCTGG * * ** * * * pHG1 ------2pHG1 CGTGAGCCAGAGATGTCTTCGCAAGAGCCTGCTCCTTC-ATCGCAAGTGTCGCTTCGAGA pAE1 TAGGGGCGCCTGACGTACGAAACTGGATCGGCAACCGCGACCGAACGCTGCTTCTTGTAG CH34 TAGGGGCGCCTGACGTACGAAACTGGATCGGCAACCGCGACCGAACGCTGCTTCTTGTAG OLGA TAGCTGCACCCGATGTACGAACATGGATCGGCAACCGCGACCGAACGCTGCTTCTTGTAG

pHG1 ------2pHG1 TAGATCTGCGTGGTTTCCACCGATTCATGCCCGAGCCACAGTGCAATCACGGAACGATCA pAE1 CGGTTCAAACGGGTCTGCGCAATAGCGAATTGACCGCACTCCGGCGTCAGGATGTGACGC CH34 CGGTTCAAACGGGTCTGCGCAATAGCGAATTGACCGCACTCCGGCGCCAGGATGTGGCGC OLGA CGGTTCAAACGGGTCTGCGCAACAGCGAACTGACCTCTCTCAGGCGTCAGGATGTGGTGC

90 pHG1 ------2pHG1 ACGCCCGCCTGCAGAAGGTCCATGGCCATCGTATGCCTCAAACGGTGAACGGTGACCTGT pAE1 TCGGCACAGGCGCCCACGTTCGTTGCCTTGGCAAAGGCAGAAAGATGAGATGCACTCCGC CH34 TCGGCACAGGCGCCCACGTTCGTTGCCTCGGCAAAGGCAGAAAGATGAGATGCACTCCGC OLGA TCGGCACAGGCGCCCACGTCCGTTGCCTCGGCAAAGGCAGAAAGATGCGATGCACTCCGC

pHG1 ------2pHG1 TTCTGCTTCAACGATGGACAAATTTTTGA------pAE1 TGCGACCGGATGTAGTTGCCGTCCTGAAACAGTGGCTGCTGTATCAACCAGGCGAACCAG CH34 TGCGACCGGATGTAGTTGCCGTCCTGAAACAGTGGCTGCTGTATCAACCAGGCGAGCCAG OLGA TGCGACCGGATGTCGTTGCCGTCCTGAAAGAATGGCTACGGTATCAACCAGGCGAACCAG

pHG1 ------2pHG1 ------pAE1 GTGATCCGGTCTTCCCCAGTTCGCGCGGTGGTCATCTCAGTGCCGATGCTCTTCAGCGAC CH34 GCGATCCGGTCTTCCCCAGTTCGCGCGGCGGTCATCTCAGTGCCGATGCTCTTCAGCAAC OLGA ATGATCCTGTCTTCCCCAGTTCGCGCGGCGGTCATCTCAGTGCCGATGCTCTTCAGCGAC

pHG1 ------2pHG1 ------pAE1 TCGTTTCGCGCAACGCTGAGATCGCGCGCCTCTCGTGCCCCTCGCTGAAGAAGAAATCAA CH34 TCGTGTCGCGCAACGCCGAAACCGCGCGCCTCTCGTGCCCCTCGCTGAAGAAGAAATCAA OLGA TCGTGTCACGCAACGTCGAAATCGCGCGTCTCTCGTGCCCCTCGCTGAAGAAAAAATCAG

pHG1 ------2pHG1 ------pAE1 TAACGCCTCATACGCTTCGGCA------CH34 TAACGCCTCATACGCTTCGACACACGGCCGCGATGAGCCTGATGCATCACGGCGTCGACC OLGA TAACGCCCCACACGCTTCGGCACGCGGCCGCGATGAGCCTGTTGCATCACGGCGTAGACC

pHG1 ------2pHG1 ------pAE1 ------CH34 TGACCGTGATCGCACTCTGGCTCGGGCACGAGTCCTCCGAGACTACTCAGATCTATTTGC OLGA TGACCGTGATCGCGCTCTGGCTCGGACATGAATCATCTGAGACGACCCAGATCTACCTGC

pHG1 ------2pHG1 ------pAE1 ------CH34 ACGCCGACATGCAGCTCAAGGAACGCGCGCTCGCGCACGCGACGGCGAGTGGTGTTGCAC OLGA ACGCCGACATGCGGCTCAAGGAACGCGCGCTTGCGCACGCCAATGCGAGCGGCATTGCGC

pHG1 ------2pHG1 ------pAE1 ------CH34 CGACACGCTACAAACCTCCAGATCCGTTGCTCGCCTTCCTGGAGGCCCTCTGA------OLGA CGACGCGTTACAAACCTCCCGACCCCTTGCTTGCCTTCCTGGAGGGCCTTTGATAATGCC

91 pHG1 ------2pHG1 ------pAE1 ------CH34 ------OLGA GACAATCCGAGCGACCCGGAACAGGGATCCGCCCAACTGATGCGCCCGTGGCTTCGGGAC

pHG1 ------2pHG1 ------pAE1 ------CH34 ------OLGA GCGGCATAATCCGGGAATCGGCATAAC

92

6.5 Chlorocatechol-1,2-dioxygenase and/or chloromuconate cycloisomerase carrying pristine soil isolates (Figure 4-4) Pristine Strains Origin Environment GenBank # Saskatchewan Leander et al (1998) and this BTB1A White spruce, poplar, birch study BTI2 " " " BTK2 " " " CHU2 California Eucalyptus " CLAb3 " Oak, pine meadow " CLC3 " " " GE51 Australia Acacia, meadow " GED1 " " " HH12-4 South Africa Eucalyptus over renosterveld " HH44 " " " HH82 " " " HH83 " " " HHDI " " " LCG2 Chile Cryptocaria, acacia, lithraea, chusquea " MB16-3 South Africa Renosterveld " MBG3 " " " MR10-1 " " " NP131 Saskatchewan Jack pine, white spruce " NPC2 " " " OLGA172 Russia Spruce, birch AY168634 South Africa Leander et al (1998) and this PMPI Fynbos study R1131 Russia Spruce, birch " R2181 " " " R2191 " " " R2381 " " " R261 " " " R3321 " Pine, spruce, birch " R4121 " " " RC13-1 Chile Acacia, cryptocaria, lithraea " WG14-4 South Africa Fynbos " WGH1 " " " WGM1 " " "

93

WK112 Saskatchewan Jack pine, birch, white spruce " WK33 " " " WV1 " White spruce, poplar, jack pine, birch " WV151 Saskatchewan White spruce, poplar, jack pine, birch " WV71 " " "

94

6.6 Chlorocatechol-1,2-dioxygenase and/or chloromuconate cycloisomerase carrying strains available in GenBank (Figure 4-4) Strain Origin Environment GenBank # A. xylosoxidans denitrificans EST4002 Estonia N/A AY540995 Achromobacter xylosoxidans Czech plasmid pA81 Republic Soil contaminated with PCBs AJ515144 Acidovorax sp . JS773 N/A N/A DQ146633 Acidovorax sp . JS776 " " DQ146634 Acidovorax sp . JS777 " " DQ146635 Alcaligenes sp . CS1 plasmid pCS1 Devon, UK Agricultural soils AF235015 Alcaligenes sp . CS3 plasmid pCS3 " " AF235014 Alcaligenes paradoxus JMP133 Australia " AF041364 Activated sludge in a nitrophenol-manufacturing Alcaligenes sp . NyZ215 China factory EF544605 B. xenovorans LB400 New York Landfill CP000270.1 Burkholderia cepacia Japan onion rot AF029344 Burkholderia cepacia strain WZ1 N/A Pesticide manufactory soil EU586138 Burkholderia sp . 3CB-1 N/A N/A DQ146637 paddy fields -2,4 -D Burkholderia sp . C308 Japan contaminated soil AB212983 Burkholderia sp . Ff54 " " AB212988 Burkholderia sp . H801 " " AB212987 Burkholderia sp . K301 " " AB212974 Burkholderia sp . M701 " " AB212986 Burkholderia sp . NK8 plasmid pNK8 " 3CBA enriched Soil AB050198 Burkholderia sp . RASC " " AF043451 Burkholderia sp. T201 " " AB212984 Burkholderia sp. T201 " " AB212984 Burkholderia sp . T301 " " AB212985 Burkholderia sp. Y212 " " AB212982 C. necator JMP134 Australia Agriculture Comamonas acidovorans strain Herbicide contaminated MC1 N/A building rubble AF077917 Defluvibacter lusatiensis " Industrial wastewater AJ536297 organanochlorine Delftia acidovorans " contamination AY078159 Delftia acidovorans strain CA28 Austria chloroaniline treated soil DQ146631

95

chlorobenzene contaminated Pandoraea pnomenusa N/A soil EF600715 Pseudomonas aeruginosa " N/A AF164958 Pseudomonas aeruginosa strain J5-2 " " EF111021 Pseudomonas chlororaphis " " AJ132716 pesticide/1,2,3,4 - tetrachlorobenzene Pseudomonas chlororaphis RW71 " contaminated soil AJ271325 Pseudomonas nitroreducens Japan oil brine EF108314 Pseudomonas putida Illinois, USA polluted creek AJ617740 2,4 -D producing factory Pseudomonas sp . GT241-1 China drainage AY493510 Ralstonia solanacearum strain IPO1609 Netherlands Potato CU914168 Ralstonia solanacearum strain MolK2 Philippines banana CU695238 Ralstonia sp . CS2 plasmid pCS2 Devon, UK Agricultural soils AF235016 Ralstonia sp . G N/A N/A AF077916 paddy fields -2,4 -D Ralstonia sp . I502 Japan contaminated soil AB212979 Ralstonia sp . JS704 " " DQ146622 Ralstonia sp . K101 " " AB212973 Ralstonia sp . K401 (modI) " " AB212975 Ralstonia sp . K401 (modII) " " AB212975 Ralstonia sp . T101 " " AB212978 Ralstonia sp . Y103 " " AB212981 Rhodococcus opacus N/A 2,4-D contaminated site AF003948 Rhodococcus opacus strain 1CP " " DQ146627 Rhodococcus sp . UFZ-B518 " " DQ146628 Rhodococcus sp . UFZ-B521 " " DQ146629 Rhodococcus sp . UFZ-B528 " " DQ146630 Herbicide contaminated Rhodoferax sp . P230 " building rubble AF176243 grassland soil, no prev. appl. Of soil bacterium C1CL Dijon, France 2,4-D AF047032 Sphingomonas herbicidovorans dichloroprop -degra ding soil (dccAI) N/A column AJ628862 Sphingomonas herbicidovorans (dccAII) " " AJ628863 paddy fields -2,4 -D Sphingomonas sp. I602 Japan contaminated soil AB212989 Sphingomonas sp. tfd44 Montana Herbicide Wastewater AY598949 Uncultured bacterium N/A N/A AB478351

96

Citeaux, Uncultured soil bacterium PLAE6 France no prev. appl. Of 2,4-D AF035159 Variovorax paradoxus Korea Greenhouse soil AF044314