Recombinase in Trio (RIT) Elements in Bacterial Genomes: Assessing the Distribution and Mobility of a Novel yet Widespread Set of Mobile Genes.

by

Nicole Dorothy Ricker

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Physical and Environmental Sciences University of Toronto Scarborough

© Copyright by Nicole Ricker 2016

Recombinase in Trio (RIT) Elements in Bacterial Genomes: Assessing the Distribution and Mobility of a Novel yet Widespread Set of Mobile Genes.

Nicole Dorothy Ricker Doctor of Philosophy Department of Physical and Environmental Sciences University of Toronto Scarborough

2016

Abstract

The research performed over the course of my doctorate training outlines the environmental distribution, mobility, expression and potential role of a newly described family of mobile elements as well as providing valuable information on the challenges and potential benefits of environmental metagenomics. Sequencing technologies have evolved considerably over the course of this work, and evaluating the limitations and opportunities provided by these evolving technologies has formed a significant portion of my thesis work. The remainder of the work has been dedicated to understanding the distribution and mechanisms of Recombinase in Trio (RIT) elements, a previously underappreciated mobile element found in a large diversity of strains, but predominantly in non- pathogenic . Recombinase in Trio (RIT) elements contain three tyrosine-based site-specific recombinases and display a characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van Houdt et al. 2006; Van Houdt et al. 2012; Ricker et al. 2013). RIT elements have been postulated to be mobile due to the occurrence of multiple identical copies within individual genomes, and are commonly found on plasmids and in genomic islands, including plant symbiosis and catabolic islands. The ability of RITS to excise and relocate themselves was tested using a variety of mating experiments. Although the determination of a potential target site sequence was initially elusive, the discovery that the RIT element also included a 20 bp palindrome adjacent to one of the terminal inverted repeats allowed for the alignment of the target genes and revealed the original target site sequence. Subsequently, RIT element mobility was observed during conjugation and the transformants analyzed provided some insight into the mechanism of recombination. Finally, environmental sampling was performed on

Southern Ontario streams in order to develop a methodology for evaluating the mobilome community of bacterial communities. ii

Acknowledgments

No great achievement is accomplished without having a thousand people to thank. It would be impossible to list all the people that have helped, supported and encouraged me over the years and I hope that you truly understand my gratitude for each and every one of you. I want to especially thank my outstanding supervisor, Roberta Fulthorpe, for all of your amazing mentorship over the past 6 years. You have provided me with encouragement and support when I felt unsure, clarity and direction when I was muddled, and a firm kick when I was stalled. Not to mention physical labour and beautiful lake scenery for balance, and the insight to recognize an amazing opportunity when it came knocking. I could not ask for a better supervisor for my PhD, or a better mentor for my career. I would also like to thank my committee members (Don Jackson and William Navarre) for their outstanding insight and recommendations throughout the project, as well as their patience and encouragement.

To my husband Toby – you have been my rock throughout my PhD and have done so much more than I ever could have asked of you. From sampler design, creation and installation to learning site specific recombination mechanisms and moving to Belgium (twice!), you’ve put in the blood, sweat and tears of this PhD and I am truly blessed to have such a wonderful partner in my life. Thanks also to my Mom for her endless support including hopping on a plane last minute to help with the first move to Belgium – and for helping to make sure I didn’t fall apart once we got there. Thanks to my Dad for assisting with all the reference site samplers and reminding me why I’m in this field by constantly making me defend science; and to my siblings for keeping me grounded while also reminding me that I could do this.

My time at UTSC has been filled with amazing people and opportunities that I had never anticipated. I want to thank everyone at the Fulthorpe lab (past and present) for all your support and encouragement, and for putting up with endless talks about RIT elements. I especially have to thank Tony, Roxana and Rosemary for all your dedication and friendship. Last but not least, I am so grateful for having had the opportunity to work with Rob Van Houdt and Bernard Hallet, as well as Ann Provoost, Kristel Mijnendonckx and all the other members of the SCK•CEN and to the W. Garfield Weston Foundation for providing funding for this international collaboration.

iii

Table of Contents

Acknowledgments ...... iii

Table of Contents ...... iv

List of Tables ...... viii

List of Figures ...... x

List of Appendices ...... xiii

Chapter 1 Introduction ...... 1

1.1 References ...... 3

Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation ...... 5

2 Horizontal Gene Transfer ...... 5

2.1 Intracellular MGEs ...... 7

2.2 Intercellular MGEs ...... 9

2.3 Impact on Genome Evolution ...... 12

2.4 References ...... 15

Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution ...... 21

3 Introduction ...... 21

3.1 Methods ...... 24

3.2 Results ...... 25

3.2.1 Assembly Quality for Cupriavidus metallidurans CH34 ...... 25

3.2.2 Contigs terminate at repeated elements and mobile elements ...... 29

3.2.3 Fragmentation is greatest at genomic island sites ...... 30

3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains ...... 32

3.2.5 Fragmentation Evident in Real Data ...... 36

3.3 Discussion ...... 38

iv

3.4 Acknowledgements ...... 42

3.5 References ...... 43

Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements ...... 47

4 Introduction ...... 47

4.1 Methods ...... 48

4.2 Results and Discussion ...... 48

4.2.1 Abundance and Occurrence in Database ...... 48

4.2.2 RIT Structure and Organization ...... 51

4.2.3 Inferred RIT Functionality ...... 53

4.2.4 Evidence for RIT Mobility Within Closely Related Strains ...... 55

4.2.5 Similarities between RIT elements and evidence for broad distribution...... 61

4.2.6 RIT Classification ...... 65

4.3 Conclusions ...... 67

4.4 Acknowledgements ...... 69

4.5 References ...... 69

Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as revealed through PacBio Single- Molecule Sequencing ...... 71

5 Introduction ...... 71

5.1 Materials and Methods ...... 74

5.1.1 Short read NGS sequencing ...... 74

5.1.2 PacBio Single Molecule Sequencing ...... 74

5.1.3 Assembly of Short Read Technologies and PacBio corrected reads ...... 75

5.1.4 Gene Annotation and Contig Validation ...... 75

5.1.5 Comparisons to Related Finished Genomes ...... 76

5.1.6 Large Plasmid Extraction ...... 76

5.2 Results ...... 76 v

5.2.1 Overall Genome Analysis ...... 76

5.2.2 Biological consistency of the Assembly ...... 78

5.2.3 Capacity of the PacBio Assembly for comparative studies ...... 82

5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon ...... 83

5.2.5 Limitations of the PacBio Assembly ...... 86

5.3 Discussion ...... 87

5.4 Acknowledgements ...... 90

5.5 References ...... 90

Chapter 6 Expression and Activity of RIT Elements ...... 96

6 Introduction ...... 96

6.1 Materials and Methods ...... 97

6.1.1 Growth of Bacterial Strains ...... 97

6.1.2 Construct creation ...... 98

6.1.3 Mating-out Assays ...... 100

6.1.4 Conjugation Experiments ...... 100

6.1.5 Expression Experiments ...... 101

6.2 Results ...... 101

6.2.1 No evidence of Intra-cellular mobility without a target site ...... 102

6.2.2 Target site identification ...... 105

6.2.3 Sequencing analysis of transconjugants ...... 108

6.2.4 Application of these Results to other RIT Elements ...... 112

6.3 Discussion ...... 113

6.4 References ...... 117

Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying degrees of urbanization...... 119

vi

7 Introduction ...... 119

7.1 Materials and Methods ...... 120

7.1.1 Sampling locations and collection of benthic invertebrates ...... 121

7.1.2 Sampler Design ...... 123

7.1.3 Bacterial Community Assessment ...... 124

7.1.4 Quantitative PCR ...... 125

7.2 Results ...... 126

7.2.1 Macroinvertebrate metrics of ecosystem health ...... 126

7.2.2 Community diversity measures ...... 129

7.2.3 Quantitative PCR ...... 133

7.2.4 Correlations between bacterial communities and water quality parameters ...... 134

7.2.5 Primer design specific to RIT elements ...... 137

7.3 Discussion ...... 138

7.3.1 Biomonitoring ...... 138

7.3.2 Bacterial community assessment ...... 139

7.4 Acknowledgements ...... 143

7.5 References ...... 144

Chapter 8 Conclusions and Future Directions ...... 147

8 References ...... 151

9 Appendix 1 Extra Tables ...... 152

Appendix 2 Sampler Construction and Site Information ...... 162

vii

List of Tables

Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four replicons in C. metallidurans CH34 using Velvet ad ABySS genome assembly software...... 26

Table 3.2.2: Details on the terminal regions for 7 large contigs...... 29

Table 3.2.3: Genomic islands found on chromosome 1 of CH34...... 31

Table 3.2.4: Velvet assembly metrics of the 5 genomes compared...... 33

Table 4.2.1: Summary of information of putative RIT elements found in this study...... 49

Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted repeats.60

Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons...... 78

Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with other closely related Burkholderia strains...... 80

Table 6.1.1: List of strains used in this study...... 97

Table 6.1.2: List of constructs created during this study...... 98

Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG...... 104

Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria containing RIT elements...... 112

Table 7.1.1: Sampling locations for river assessments...... 122

Table 7.1.2: Primers for quantitative PCR...... 125

Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during this study...... 127

Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time PCR. ... 133

Table 7.2.3: Water quality parameters for each site...... 135 viii

Table 7.2.4: Correlations of the bacterial communities to available water quality data...... 136

ix

List of Figures

Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C. metallidurans CH34...... 27

Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing genomic islands in C. metallidurans CH34...... 28

Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the genome) and three parameters thought to influence assembly quality...... 35

Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig length, N50 ad N50 as percent of longest replicon) and number of genomics islands as predicted by IslandViewer...... 36

Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data (Salzberg et al. 2012)...... 37

Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the abundance of the same taxonomic grouping in the NCBI genome database...... 51

Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families...... 52

Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top) and Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives...... 54

Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31...... 57

Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT elements obtained in this study (B)...... 64

Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of RIT elements...... 67

Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio sequencing...... 79

x

Figure 5.2.2: Large plasmid extraction...... 81

Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains...... 82

Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str. OLGA172 and comparison to homologous regions of related strains...... 86

Figure 6.1.1: Constructs used in the final conjugation experiment...... 99

Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223-K31A-C expression vectors...... 103

Figure 6.2.2: PCR amplification using primers designed to amplify out from the kanamycin gene...... 104

Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction of the target gene DUF1738...... 106

Figure 6.2.4: Final experimental design...... 107

Figure 6.2.5: Reversal of RIT element in positive transconjugants...... 108

Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and pACYC- TSV1...... 109

Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline resistance.110

Figure 6.2.8: Sequencing results of co-integrate structure of clone I4...... 111

Figure 6.3.1: Model for RIT element mobility based on experimental results...... 115

Figure 7.1.1: Map of sampling locations...... 123

Figure 7.1.2: Aquatic environment bacterial community samplers...... 124

Figure 7.2.1: Lake Simcoe region samplers after retrieval...... 128

Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation...... 129

xi

Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates...... 131

Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community compositions revealed by 16S pyrosequencing data...... 132

xii

List of Appendices

Appendix 1: Extra tables ……………………………………………………… 152

Table S1: Primers used in this study ………………………………….. 152

Table S2: Dissolved oxygen values by month ………………………… 154

Table S3: RIT elements determined to date …………………………… 154

Appendix 2: Sampler construction and site information …………………….. 162

xiii 1

Chapter 1 Introduction

My research relates to understanding the mechanisms of bacterial adaptation, and particularly how bacteria acquire and distribute genes through horizontal gene transfer (HGT). This is a topic that impacts every field of biology from ecology to medicine due to the ubiquity of bacteria and the range of diverse skills that they acquire through HGT, including pathogenesis, antibiotic resistance, root nodulation and xenobiotic compound degradation (Springael and Top, 2004; Frost et al. 2005; Siefert 2009). For this reason, understanding the mechanism and regulation of the genes involved in HGT provides universally applicable benefits. The goal of my graduate research has been to better characterize the genes involved in creating diversity within individual bacterial genomes, and to make progress towards investigating the effects that exposure to environmental pollutants has on the abundance and activity of mobile genetic elements (MGEs). This work was inspired by similar research into the distribution and expression of integrons (Wright et al. 2008; Koening et al. 2009) and plasmids (Smalla and Sobecky, 2002; Springael and Top, 2004). I provide a summary of our understanding of the range of MGEs found in bacteria and some details on their agents of mobility in the next chapter (Chapter 2) to assist the reader.

The main focus of my work has been devoted to understanding a previously uncharacterized set of mobility genes termed a Recombinase in Trio (RIT) element (Van Houdt et al. 2009; Ricker et al. 2013). At the start of my project, a former master’s student had recently discovered a recombinase in a chlorobenzoate degrader designated Burkholderia sp. str. OLGA172 (Jin, 2010) that later proved to be a RIT element. The Fulthorpe lab has studied this strain as the representative of a larger collection of chlorobenzoate degraders isolated from pristine sites during a biogeography survey (Fulthorpe et al. 1998). These pristine isolates are of particular interest since their chromosomally located chlorobenzoate degradation genes may be ancestral to widely disseminated plasmid-borne catabolic genes that are highly active in contaminated sites. In OLGA172, a RIT element was found lying just upstream from the catabolic genes and I had an interest in determining if it had a role in the movement of the catabolic operon, with a view to the larger interest of understanding the overall role of RIT elements and a possible link to the evolution of catabolic traits.

2

As next generation sequencing was becoming common at that time, OLGA172 was submitted for Illumina sequencing and subsequently for 454 sequencing in order to assemble the complete genome and provide context to the RIT element and adjacent catabolic genes. Unfortunately, all RIT element containing contigs were disconnected due to its presence in multiple copies within the genome. The bioinformatic community has long acknowledged this technical drawback of short read technology, but its importance to assembly quality and our understanding of bacterial evolution had been underestimated. I document these issues in Chapter 3. I recognized the potential of longer read technologies in understanding our strain and submitted it for sequencing on the PacBio RSII platform. These improvements allowed for the creation of a closed genome of OLGA172, which can subsequently be used to address specific questions regarding the role of RIT elements in the evolution of this strain. I detail the larger implications of the fundamental improvements achieved through the introduction of high throughput long read sequencing technologies in Chapter 5 using OLGA172 as an example. This chapter describes the closed genome of OLGA172 achieved using the PacBio sequencing technology, and compares the genomic context surrounding the catabolic genes (and RIT element) found in this strain with other fully sequenced relatives.

In chapter 4, I discuss the distribution and organization of Recombinase in Trio (RIT) elements, a previously underappreciated mobile element found in a large diversity of strains, but predominantly in non-pathogenic bacteria. A product of in depth in silico searching, I outline overall RIT element organization and distribution in currently sequenced genomes, and highlight individual strains harboring multiple identical copies of the same RIT element. Rob Van Houdt of the Belgian Nuclear Research Centre (SCK•CEN) in Belgium was the first author on the original paper recognizing and naming the RIT elements (Van Houdt et al. 2009). On reading a poster abstract I published on the distribution of RIT elements, he contacted me and we established a collaboration. I travelled to Belgium for 3 months in 2011 and again for 9 months the following year after securing a fellowship in order to investigate the activity of RIT elements in his lab. The experimental evidence I gathered supporting the intracellular mobility of these elements is presented in Chapter 6.

At the outset of PhD work my intention was to investigate the "mobilome" of bacterial communities exposed to low levels of environmental contamination. Other researchers have asserted that environmental pollutants are increasing the ‘evolvability’ of bacterial communities

3 by increasing their capacity for horizontal gene transfer (Baquero, 2009; Gillings and Stokes, 2012). In order to properly address this question of innate evolvability within a bacterial community, I wanted to investigate the impact of environmental pollutants on the mobile elements themselves separately from co-selection by the resistance genes being mobilized. The cost of investigating an environmental ‘mobilome’ necessitates the prudent identification of appropriate sites to be characterized. Accordingly I surveyed the suitability of several stream sites in Ontario for this kind of work and eventually designed and sampled several of them. I also examined in detail our current ability to quantify MGEs via various methods. Chapter 7 details this sampling strategy and the molecular characterizations I was able to perform on the bacterial communities, with several interesting results.

1.1 References Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10.

Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732.

Fulthorpe, R. R., Rhodes, A. N., & Tiedje, J. M. 1998. High levels of endemicity of 3- chlorobenzoate-degrading soil bacteria. Applied and Environmental Microbiology, 64(5), 1620-1627.

Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352.

Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis (2010) Dept. Ecology and Evolutionary Biology, University of Toronto.

Koenig, J.E., C. Sharp, M. Dlutek, B. Curtis, M. Joss, Y. Boucher and W.F. Doolittle. 2009. Integron Gene Cassettes and Degradation of Compounds Associated with Industrial Waste: The Case of the Sydney Tar Ponds. PLOS One 4: 1-9.

Ricker, N. H. Qian and Fulthorpe, R.R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239.

Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.

Smalla, K. and P.A. Sobecky. 2002. The prevalence and diversity of mobile genetic elements in

4 bacterial communities of different environmental habitats: insights gained from different methodological approaches. FEMS Microbiol. Ecol. 42:165-175.

Springael, D. and E.M. Top. 2004. Horizontal gene transfer and microbial adaptation to xenobiotics: new types of mobile genetic elements and lessons from ecological studies. Trends in Microbiol. 12(2):53-58.

Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226.

Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.

5

Chapter 2 The Role of Mobile Genetic Elements in Prokaryotic Adaptation

2 Horizontal Gene Transfer

Bacterial evolution is a dynamic process involving gene mutation, inversion, exchange, deletion and acquisition of exogenous DNA (Snyder and Champness, 2007). Each of these processes varies in rate of occurrence and the scope of possible outcomes (Brüssow, 2008) and the selection pressures shaping these outcomes are applied from multiple levels – gene, group, population, or community. Horizontal gene transfer (HGT), also known as lateral gene transfer (LGT), has been shown to play an important role in the evolution dynamics at all of these levels, and is arguably the most important evolution mechanism working at the population and community levels. HGT allows individual genomes to remain compact while providing access to a larger pool of potentially beneficial genes maintained within the community (Darmon and Leach, 2014). The study of the genes involved in horizontal gene transfer (aka the mobilome) has received substantial attention due to the increasing prevalence of antibiotic resistance. Antibiotics and antibiotic resistance genes are common contaminants from wastewater, agriculture and aquaculture (Perry and Wright, 2013; Gillings et al. 2015) and abundance of individual resistance genes in soil environments are increasing over time (Knapp et al. 2010). Understanding the dynamics of gene movement within environmental communities is therefore fundamental to establishing how existing resistances will be disseminated and in anticipating sources of new resistance genes.

A mobile genetic element (MGE) is defined as any discrete segment of DNA that can move within or between genomes (Siefert, 2009) and is inclusive of plasmids, phages, integrative conjugative elements (ICEs), transposons and the myriad of smaller elements capable of inter- or intra-cellular movement (for a complete review, see Bellanger et al. 2014). The distinction between these different categories is often blurred for various reasons including the modular nature of mobile element evolution, and the enormous time scale at which these elements have been evolving (Lawrence and Hendrickson, 2008; Siguier et al. 2014). Although all of these elements fit into the classification of transposable elements (Toussaint and Merlin,

6

2002; Curcio and Derbyshire, 2003; Roberts et al., 2008), the term transposable element inherently suggests that the mobility of the elements is through transposition. Since transposition and site-specific recombination are fundamentally different processes (biochemically), it is preferable to use the more inclusive term of Mobile Genetic Elements (MGEs) to refer to the full spectrum of genes involved in HGT. The term genomic island is also sometimes seen as equivalent to transposable element, however there are a variety of definitions for this particular term, many of which overlap with current definitions for other mobile elements. For the purposes of this thesis, the term genomic island will be used to refer to regions of a genome that are not shared with close relatives of the isolate, regardless of any evidence regarding current mobility.

There are three mechanisms for the acquisition of exogenous DNA into a bacterial cell. These are conjugation (formation of a junction between two cells for genetic exchange), transduction (movement of bacterial genes mediated by phage infection) and competence (direct uptake of DNA from the surrounding environment) (Olendzenski and Gogarten, 2009). On entry into a new cell, a MGE can be degraded by nucleases, maintained exogenously (in the case of most plasmids and some phages) or become integrated into the genome of the new organism through either homologous or illegitimate recombination (Lawrence and Retchless, 2009). There are also a number of mobile elements that integrate into the genome independently, in either a random or site-specific manner (Hallet et al. 2004; Siguier et al. 2014). The roles that MGEs fulfill in a bacterial genome are varied and poorly understood. They are most frequently studied for their role in the acquisition or dissemination of selectable traits such as antibiotic resistance, symbiosis, pathogenicity or catabolism. Many confer no such useful functions and are commonly considered a form of ‘selfish’ DNA. However these elements can be a key component of genome flexibility as they mediate deletions and inversions both through their own activity and by providing homologous regions within the genome. Many MGEs have also been found to affect expression of surrounding genes through the presence of outward facing promoters or the production of molecules involved in regulation (Darmon and Leach, 2014). The presence of previously mobile (ie. defective prophages) or partial remnants of MGEs can likewise impact bacterial adaptation by providing sites of homology or through in trans activity by intact elements. There are many categories into which MGEs can be sub-divided, however for the purposes of this chapter they will be described according to the degree of their mobility.

7

This chapter aims to clarify the individual terms used for the different classes of mobile elements that are capable of independent movement, including both within cell movement and between bacterial cells. This is in no way an exhaustive description of the elements involved in bacterial adaptation, and the reader is directed to recent excellent reviews for further information (Bellanger et al. 2014; Darman and Leach, 2014; Siguier et al. 2014).

2.1 Intracellular MGEs

By definition, intracellular mobile elements are those capable of transfer to different locations in a chromosome or between different replicons within a bacterial cell (between multiple chromosomes or from a chromosome to a plasmid). These elements can only be transferred horizontally (between cells) when they become associated with larger, self- transmissible elements described in section 2.2. The simplest form of MGE has traditionally been the insertion sequence (IS), although smaller non-autonomous mobile elements have recently been described. ISs generally range in size from 700-2500 bp, can facilitate their own movement and contain only the genes required for transposition (1-3 ORFs coding for transposase enzymes and regulatory genes) with flanking inverted repeats (Mahillon and Chandler, 1998; Siefert, 2009). Insertion sequences are grouped into individual families (www- is.biotoul.fr) based on several shared characteristics. The most important of these characteristics is similarity in the primary sequence of their encoded transposases (Siguier et al., 2014) but family members also share other features including the organization of open reading frames, target site preferences, and similarities in the length and sequence of both their short terminal inverted repeats and the direct repeats generated upon insertion (Siefert, 2009; Siguier et al. 2014). The majority of insertion sequences are mobilized by a DDE transposase (where DDE refers to the conserved Asp, Asp, Glu residues in the active site) and there are several large families of these enzymes that have been further divided into subgroups (Siguier et al. 2014). There are also several other transposase chemistries that have been identified including enzymes with a DEDD catalytic motif (related to Holliday junction resolvases) and the HUH (two histidine residues separated by a large hydrophobic residue) enzymes utilized by both IS200/IS605 and IS91 related elements (Siguier et al. 2015).

Transposons have traditionally been distinguished from ISs due to the presence of accessory genes (also called passenger genes or cargo) that serve purposes not related to

8 transposition (Siguier et al. 2014). However since related transposase enzymes have been found in both ISs and Transposases, this naming system conflicts with the homology based families that have been defined. In addition, there are transposons that are created through the coordinated movement of two flanking ISs (composite transposons) and these are separate from the unit transposons that have a mobility gene at one end of the element. Unit transposons are sometimes mobilized through the action of a site-specific recombinase and these are alternately referred to transposases or recombinases depending on whether they are referring to the MGE they are mobilizing (Siguier et al. 2015) or their phylogenetic relatedness to other proteins (Carraro and Burrus, 2015). As we progress towards the metagenomic age it is preferable to group MGEs according to the phylogeny of the mobility enzyme since this is more amenable to computational analysis and also speaks to the actual mechanism of mobility. Forming families based on the homology of the mobility enzymes should still be viewed primarily as ‘grouping by descent’ method rather than an attempt to define the limitations of individual families since the acquisition of accessory genes may not be a fixed feature of the family.

Many transposons encode separate integration and resolution systems, and the resolution mechanisms are commonly performed by site-specific recombinases. Site-specific recombinases (SSRs) can be divided into two unrelated families based on the use of either a tyrosine or a serine residue in the recombination event (Schumann, 2006). These enzymes are commonly used by bacteriophages to integrate their genomes into the host chromosome when the virus enters lysogeny, and many are also able to excise when circumstances dictate entry into the lytic phase (Hirano et al. 2011). However members of both classes of SSRs have been found in a variety of recombination reactions involving viral and bacterial DNA including integration, excision, inversion, control of plasmid copy number and movement of transposons (Nunes- Duby, 1998; Hallet et al. 2004; Mazel, 2006; Siguier et al. 2015).

In addition to the recombinases responsible for mobilizing phage genomes and transposons, integrons are a sub-family of tyrosine based site-specific recombinases (TBSSRs) that have been found to be functionally discreet from all other characterized tyrosine recombinases (Mazel, 2006). Integrons consist of the integron integrase (the TBSSR), a primary recombination site (attI) and an outward facing promoter, and are responsible for the acquisition of gene cassettes (individual genes that have an appropriate attC site for integration) in a non- disruptive and functional orientation (Hall and Collis, 1995; Mazel, 2006). The incorporation of

9 gene cassettes in this manner allows for the immediate use of the newly acquired genes, however integrons also serve as storage for additional genes since the existing gene cassettes are maintained in an array which can be composed of hundreds of individual genes (Cambray et al. 2010; Domingues et al. 2012). Gene cassettes generally decrease in activity with distance from the primary promoter, but can be shuffled under stressful circumstances since the integron integrase is activated by the SOS response (Guerin et al. 2009). This provides a pool of potential genes that do not pose a transcriptional burden to the cell but are available if necessary (Cambray et al. 2011; Darmon and Leach, 2014).

Integrons have not been found to be mobile themselves, but are commonly mobilized by other MGEs (Collis et al. 2002; Hall and Collis, 1995; Mazel, 2006; Boucher et al. 2007). They are sometimes referred to as mobile genetic elements (or components of the mobilome) due to their role in horizontal gene transfer by integrating gene cassettes (Ragan and Beiko, 2009; Olschlager and Hacker, 2008; Taylor et al., 2011). Integron classes are created based on homology of the tyrosine recombinase and attI site, and not by the function of the genes associated with them, in recognition of the transient nature of these associations. Integrons are of great interest due to the unique adaptive capacity they provide, and are understandably among the best studied of the mobile element classes due to their strong association with antibiotic resistance genes.

In addition to direct impacts related to their mobility, ISs and transposons are also involved in gene activation and regulation and can promote genome rearrangements either directly or by providing scattered regions of homology (Curcio and Derbyshire, 2003). Recombination may also occur at homologous regions within transposable element sequences resulting in greater diversity (Ling and Cordaux, 2010). In recent years it has become apparent that IS elements have a dramatic impact on genome evolution, ranging from inactivation and regulation of individual genes to the complete re-organization of genomes through IS expansion and subsequent genome streamlining (Siguier et al. 2014; Darman and Leach 2014).

2.2 Intercellular MGEs

Intercellular MGEs are similar to those described above except that in addition to the genes required for integation/replication they also carry all the genes necessary for facilitating their own movement between bacterial cells. Plasmids by classical definition are maintained

10 independently of the chromosome within a cell, and are therefore distinguished from transposons since the latter are integrated into the host chromosome (Siefert, 2009). Both plasmids and certain types of transposons can move between cells by conjugation.

Plasmids are often thought of as small extra-chromosomal DNA elements that carry non- essential traits and are therefore easily lost when not needed. However, as the number of characterized genomes has increased, it has become clear that many bacteria maintain plasmids of considerable size (up to 2 Mb) and complexity. In some highly stressful environments up to 78% of culturable bacteria have been observed to carry plasmids, most of them large (>50 kb) (Fulthorpe et al. 1993). Presumably plasmids carry unique skills that allow for a fitness benefit that outweighs reproductive pressure to maintain small plasmid sizes, either through supplying access to a unique niche for the individual strain or through an increased adaptive potential inherent to the presence of the plasmid itself. These plasmids have indeed been found to contain niche specific attributes such as symbiosis or catabolic pathways, and to be maintained as stable components of the genome (Kostantinidis and Tiedje, 2004). Moreover, it is now established that some replicons previously characterized as either megaplasmids (1-2 Mb) or second chromosomes are more accurately a combination of both elements. The term chromid has been used to describe second or third chromosomes that utilize a plasmid partitioning mechanism but contain genes essential to the survival of the cell (Harrison et al. 2010). In addition to megaplasmids and chromids, there have also been plasmids isolated that are capable of integrating into the chromosome of their host (Osborn and Boltner, 2002).

Bacteriophage represent some of the most abundant replicating genetic structures known, probably exceeding 1029 in the ocean alone (Schumann, 2006). Lytic phage immediately commence phage production upon entry into the bacterial cell, resulting in lysis of the bacterial cell and extinction of that particular cell lineage. However, temperate phage, under favorable conditions, will instead integrate into the chromosome of the host bacterium and may be maintained for multiple generations until the phage is induced to enter its lytic lifestyle. Induction is often the result of chemical or nutritional stress threatening the survival of the host bacterium, but can also respond to a number of environmental triggers (Schumann, 2006). Occasionally, bacterial DNA is accidentally packaged into the phage in addition to, or instead of, phage DNA. This process, referred to as transduction, allows the bacterial DNA to be transferred to a new host and has been observed with a number of virulence and pathogenicity

11 traits (Lima et al., 2008). Gene transfer agents (GTAs) are an extreme example of transduction in that these elements exclusively package random fragments of bacterial DNA into the phage capsid. Since there is no phage DNA, these capsids are not infective but are instead a genetically stable component of the bacterial genome (Lang and Beatty, 2007).

Integrated phages, termed prophages, are prevalent in many bacteria, averaging one per genome sequenced, and the advent of high throughput sequencing has provided a number of assembled bacteriophage genomes for analysis. Comparisons of these sequences has revealed that the current genomes are the products of extensive non-homologous recombination events, the result of both very frequent recombination between phage genomes and the enormity of the evolutionary time scale on which these events have been taking place (Hendrix and Casjens, 2008). The role of bacteriophage in HGT through transduction is well documented, however the role they play in directly introducing beneficial genes as a means of ensuring vertical inheritance is less studied. Metagenomic analysis of phage communities have revealed that bacteriophage contain an unprecedented diversity of genetic sequences that are readily exchanged between different phage genomes and that are equally available to the bacterial host of these genomes due to the ease with which genes are transferred (Hendrix and Casjens, 2008).

Most transposons are not self-transferable between cells and are therefore covered in Section 2.1 on Intracellular MGEs. Some transposons however, including Tn916, have acquired genes that provide the capability of intercellular movement, and were therefore named ‘conjugative transposons’ (therefore Tn916 is often referred to as CTn916). However, many of the conjugative transposons move entirely through the action of a site-specific recombinase instead of a canonical DDE transposase, leading to the re-classification of these elements as Integrative and Conjugative Elements (ICEs, also synonymous with the retired terminology of constin) (Rowland and Stark, 2005; Wozniak and Waldor, 2010). ICEs can be distinguished from many transposons by the non-random insertion mediated by the site-specific recombinase, and commonly form an excised circular intermediate that does not replicate autonomously prior to transfer to the recipient cell. However, many ICEs were originally defined as genomic islands and named according to the traits that they were conferring (symbiosis islands, pathogenicity islands, etc.) therefore the terms genomic island, conjugative transposon and ICE have been used interchangeably (Burrus et al. 2002; Juhas et al. 2009; Roberts et al. 2008; Wozniak and Waldor, 2010; Siguier et al. 2015). Inactivation of mobility genes or physical

12 separation from the conjugation machinery can negate the intercellular mobility of a transposon thereby changing the role of the MGE in the cell from homologues in other cells. Studies have also revealed that MGEs can be mobilized in trans by other mobile elements in the cell, and that MGE resolution systems can rescue plasmid resolution functions (Hallet et al. 2004). This illustrates the interconnectedness of the different MGE categories.

2.3 Impact on Genome Evolution Antibiotic resistance is arguably the largest human health concern of our century, as evidenced by a call from the World Health Organization that all governments should prepare a comprehensive national plan for surveillance and mitigation of antibiotic resistance (Leung et al. 2011). However in addition to monitoring and limiting the distribution of current resistance genes in pathogens, it is becoming increasingly apparent that environmental reservoirs serve as an important source of available resistance genes that can be acquired by human pathogens (Finley et al. 2013, Forsberg et al. 2012, Pruden et al. 2012; Perry and Wright, 2013). This has lead to a flood of studies investigating the presence of resistance genes in different environmental reservoirs, including pristine and/or ancient soils where antibiotic exposure could not have contributed to the observed resistance (Allen et al. 2009, D’Costa et al. 2011). Whether these resistance genes pose a tangible threat to human health depends on the ease with which they could be acquired by pathogens, and therefore it is no longer sufficient to investigate merely the presence of these genes in environmental samples. The context of resistance genes, including both the strains harboring them and their potential for mobility, has become the new focus of environmental studies on antibiotic resistance. Looking ahead, quantifying the likelihood of new combinations of mobile elements and resistance genes emerging from a given environment will require a greater understanding of the mobilome of different environments. This has previously been unfeasible, however the scope of environmental metagenomics is rapidly expanding with the advent of low cost, high throughput, sequencing technologies. It is now increasingly common to analyze complete assembled metagenomes, highlighting the importance of developing standardized methods that can be applied to environmental samples. However, our ability to locate and potentially identify mobile genetic elements will only be useful if we can also confirm the functions of putative mobile elements. Antibiotic resistance genes coming from clinical sources have been likened to an invasive species (Pruden et al. 2012; Gillings et al. 2015). They are introduced into the environment in

13 wastewater and agriculture in the same way as chemical pollutants, but they pose far different kinds of threats since they are present in replicative organisms and on self-transmissable elements. They may also form new combinations that aid in their dissemination or maintenance within a population. The aggressiveness of their dissemination is determined by the nature of the mobile genetic element with which they are associated, which is why it is so vitally important that we improve our understanding of the nature and diversity of these mobile elements in bacterial communities. Our current level of understanding of the transposable elements, and tyrosine based site-specific recombinases in particular, is akin to an uninitiated gardener – we can group elements based on shared characteristics, and can recognize some known weeds, but are left in awe of the diversity that we have yet to explore. The recognition that environmental bacterial communities serve as a reservoir of resistance genes has important implications for managing antibiotic resistance in pathogens. There are many mechanisms by which antibiotic resistance genes can be maintained in a complex microbial community in the absence of selection pressure. One is co-selection by other environmental pollutants, as has already been seen with heavy metals (Stepanauskis et al. 2006; McCarthur et al. 2011, Wright et al. 2006, Wright et al. 2008). Co-selection has undoubtedly impacted antibiotic resistance gene maintenance, given the high concentration of heavy metals relative to antibiotics in the environment (Stepanauskis et al. 2006). Heavy metal resistance genes are commonly found on the same transmissible plasmids and transposons carrying antibiotic resistance genes, and the class 1 integrons are commonly associated with resistance to disinfectants in addition to both heavy metals and antibiotics (Gillings et al. 2015). This highlights the role that seemingly unrelated environmental pollutants may play in the maintenance and dissemination of antibiotic resistance in bacterial communities. Secondly, it has been shown that some resistance genes can be silenced by other regulatory mechanisms that allow for the maintenance of the gene within the population or evolve from genes that serve alternative purposes in the absence of antibiotic pressure (proto-resistance genes). Movement of these genes into other genomic locations or other strains can create or restore the antibiotic resistance phenotype when selection is applied (Perry and Wright, 2013). Antibiotic resistance genes that are incorporated into integrons can likewise be inactivated and therefore present a reduced burden to the host strain. Since gene cassettes are promoter-less the genes contained within the integron cassette array are prone to severe polar effects, with the genes farthest from the integron integrase rarely transcribed. Since SOS induction can result in the re-organization

14 of the cassettes maintained within the array, integrons represent an ideal storage site where potentially useful genes can be maintained (Cambray et al. 2011). Thirdly, sub-inhibitory concentrations of antibiotics have been shown to increase the potential for evolution of new traits within populations, through increased mutation rates and horizontal transfer (Baquero 2009; Gillings and Stokes, 2012). This highlights the important distinction between minimum inhibitory concentrations (toxicity) of a chemical and minimal selective concentrations – a distinction not currently addressed in regulations designed to determine appropriate limitations on release of chemicals to the environment. Whether there are other environmental pollutants that specifically impact the ‘evolvability’ of bacterial communities remains an open question (Gillings and Stokes, 2012). It is important to realize that although the established mechanisms of HGT generally refer to the acquisition of novel genes or transposable elements from other organisms, the mobile elements themselves (plasmids, phages, ICEs) evolve over time through the transfer of rearrangement of modules between MGEs within a bacterial cell. IS density has been shown to be higher in bacterial plasmids than in their host chromosomes, which may be the result of preferential targeting by some transposable elements into plasmids using rolling-circle replication (Siguier et al. 2014). The modular nature of plasmids and phages has been well established (Hendrix et al. 2000; Toussaint and Merlin, 2002) as evidenced by the broad diversity of accessory genes that are commonly found on plasmids with homologous replication systems (Heuer and Smalla, 2012). IS elements and other intracellular MGEs can facilitate transfers of gene segments, and can also serve to recombine different MGEs, and therefore the categories established for the different elements should be considered fluid (Osborn and Boltner, 2002; Toussaint and Merlin, 2002). Insertion sequences interspersed throughout a genome can be beneficial to the bacteria for the purposes of incorporating exogenous DNA or disabling the ability of a phage to excise from a genome in order to preserve beneficial genes (which the phage had been using as a selective agent to ensure inheritance). There is therefore a complex balancing act between the risks involved in maintaining potentially mobile genes, and the benefit derived from the genome plasticity these genes enable.

The distribution of IS elements in a genome is non-random, resulting in regions of the genome where insertion of a new element is less likely to be detrimental (Plague, 2010). As a result, mobile elements often invade each other (Darmon and Leach, 2014). This can result in

15 new chimeric mobile elements, and fragments of inactivated MGEs that can serve as homologous regions for further rearrangements. These genomic regions have alternatively been referred to as genomic islands (Langille and Brinkman, 2009), regions of genome plasticity (RGPs) (Ogier et al. 2010), or ‘junkyards’ of MGEs (Schwartz et al. 2003), but they serve an important role by providing relatively safe regions for the acquisition of incoming mobile elements.

2.4 References Allen, H. K., Moe, L. A., Rodbumrer, J., Gaarder, A., & Handelsman, J. (2009). Functional metagenomics reveals diverse β-lactamases in a remote Alaskan soil. The ISME journal, 3(2), 243-251.

Baquero, F. 2009. Environmental stress and evolvability in microbial systems. Clin. Microbiol. Infect. 15(Suppl.1):5-10.

Bellanger, X., Payot, S., Leblond-Bourget, N., & Guédon, G. (2014). Conjugative and mobilizable genomic islands in bacteria: evolution and diversity. FEMS microbiology reviews, 38(4), 720-760.

Boucher, Y., Labbate, M., Koenig, J. E., & Stokes, H. W. (2007). Integrons: mobilizable platforms that promote genetic diversity in bacteria. Trends in microbiology, 15(7), 301-309.

Brussow, H. 2008. Phage-bacterium co-evolution and its implication for bacterial pathogenesis. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 49-77. Cambridge University Press, New York, NY, USA.

Burrus, V., G. Pavlovic, B. Decaris and G. Guedon. 2002. Conjugative transposons: the tip of the iceberg. Molecular Microbiology 46(3): 601-610.

Cambray, G., A-M. Guerout and D. Mazel. 2010. Integrons. Annual Review of Genetics. 44:141–166.

Cambray, G., N. Sanchez-Alberola, S. Campoy, E. Guerin, S. Da Re, B. Gonzalez-Zorn, M-C. Ploy, J. Barbe, D. Mazel and I. Erill. 2011. Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons. Mobile DNA 2(1):6

Carraro N. and Burrus V. 2015. Biology of Three ICE Families: SXT/R391, ICEBs1, and ICESt1/ICESt3, p 289-309. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0008-2014

16

Collis, C.M., Kim, M., Stokes, H.W. and R.M. Hall. 2002. Integron-encoded IntI integrases preferentially recognize the adjacent cognate attI site in recombination with a 59-be site. Molecular Microbiology 46(5): 1415-1427.

Curcio, M. J., & Derbyshire, K. M. 2003. The outs and ins of transposition: from mu to kangaroo. Nature Reviews Molecular Cell Biology, 4(11), 865-877.

Darmon, E. and D.R.F. Leach. 2014. Bacterial Genome Instability. Microbiology and Molecular Biology Reviews. 78(1):1-39.

Domingues S., G.J. da Silva, K. M. Nielsen. 2012. Integrons: vehicles and pathways for horizontal dissemination in bacteria. Mob. Genet. Elements 2:211-223.

D’Costa, V. M., King, C. E., Kalan, L., Morar, M., Sung, W. W., Schwarz, C., ... & Wright, G. D. (2011). Antibiotic resistance is ancient. Nature, 477(7365), 457-461.

Finley, R. L., Collignon, P., Larsson, D. J., McEwen, S. A., Li, X. Z., Gaze, W. H., ... & Topp, E. (2013). The scourge of antibiotic resistance: the important role of the environment. Clinical Infectious Diseases, cit355.

Forsberg, K. J., Reyes, A., Wang, B., Selleck, E. M., Sommer, M. O., & Dantas, G. (2012). The shared antibiotic resistome of soil bacteria and human pathogens. science, 337(6098), 1107- 1111.

Frost, L. S., Leplae, R., Summers, A. O. & Toussaint, A. 2005 Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3:722-732.

Fulthorpe, R. R., Liss, S. N., & Allen, D. G. (1993). Characterization of bacteria isolated from a bleached kraft pulp mill wastewater treatment system. Canadian journal of microbiology, 39(1), 13-24.

Gillings, M. R., & Stokes, H. W. 2012. Are humans increasing bacterial evolvability?. Trends in ecology & evolution, 27(6), 346-352.

Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226

Guerin, E. G. Cambray, N. Sanchez-Alberola, S. Campoy, I. Erill, S. Da Re, B. Gonzalez-Zorn, J Barbé, M.C. Ploy and D. Mazel. 2009. The SOS response controls integron recombination. Science 324:1034.

Hall, R.M. and C.M. Collis. 1995. Mobile gene cassettes and integrons: capture and spread of genes by site-specific recombination. Molecular Microbiology 15(4): 593-600.

17

Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA

Harrison, P. W., Lower, R. P., Kim, N. K., & Young, J. P. W. (2010). Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends in microbiology, 18(4), 141-148.

Hendrix, R. W., Lawrence, J. G., Hatfull, G., and Casjens, S. 2000. The origins and ongoing evolution of viruses. Trends Microbiol. 8, 504–508.

Hendrix, R.W. and S.R. Casjens. 2008. The Role of Bacteriophages in the Generation and Spread of Bacterial Pathogens. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 79-112. Cambridge University Press, New York, NY, USA.

Heuer, H., & Smalla, K. (2012). Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS microbiology reviews, 36(6), 1083-1104.

Hirano, N., Muroi, T., Takahashi, H., & Haruki, M. (2011). Site-specific recombinases as tools for heterologous gene integration. Applied microbiology and biotechnology, 92(2), 227-239.

Juhas, M., van der Meer, J. R., Gaillard, M., Harding, R. M., Hood, D. W., & Crook, D. W. (2009). Genomic islands: tools of bacterial horizontal gene transfer and evolution. FEMS microbiology reviews, 33(2), 376-393.

Knapp,C.W., J. Dolfing, P.A. Ehlert and D.W. Graham. 2010. Evidence of increasing antibiotic resistance gene abundances in archived soils since1940. Environ. Sci. Technol. 44:580–587. doi:10.1021/es901221x

Konstantinidis, K. T., & Tiedje, J. M. (2004). Trends between gene content and genome size in prokaryotic species with larger genomes. Proceedings of the National Academy of Sciences of the United States of America, 101(9), 3160-3165.

Lang, A.S. and J.T. Beatty. 2007. Importance of widespread gene transfer agent genes in alpha- . Trends in Microbiology 15:54-62.

Lawrence, J.G. and H. Hendrickson. 2008. Genomes in Motion: Gene Transfer as a Catalyst for Genome Change. In: Horizontal Gene Transfer in the Evolution of Pathogens. pp. 3-22. Cambridge University Press, New York, NY, USA.

Langille, M.G.I. and F.S.L. Brinkman, IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics (2009) Jan. 16 (EPub). PMID: 19151094

Lawrence, J.G. and A.C. Retchless. 2009. The Interplay of Homologous Recombination and Horizontal Gene Transfer in Bacterial Speciation. In: Horizontal Gene Transfer: Genomes in Flux. pp. 29-54. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.

18

Leung, E., Weil, D. E., Raviglione, M., & Nakatani, H. (2011). The WHO policy package to combat antimicrobial resistance. Bulletin of the World Health Organization, 89(5), 390-392.

Lima, W.C., A.C.M. Paquola, A.M. Varani, M-A. Van Sluys and C.F.M. Menck. 2008. Laterally transferred genomic islands in Xanthomonadales related to pathogenicity and primary metabolism. FEMS Microbiology Letters 281:87–97.

Ling A, Cordaux R (2010) Insertion Sequence Inversions Mediated by Ectopic Recombination between Terminal Inverted Repeats. PLoS ONE 5(12): e15654. doi:10.1371/journal.pone.0015654

Mahillon, J. and M. Chandler, Insertion sequences, Microbiol. Mol. Biol. Rev. 62 (1998) 725- 774.

Mazel, D. 2006. Integrons: agents of bacterial evolution. Nat. Rev. Microbiol. 4 :608-620.

McArthur, J. V., Tuckfield, R. C., Lindell, A. H., & Baker-Austin, C. (2011). When rivers become reservoirs of antibiotic resistance: industrial effluents and gene nurseries.

Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998. Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391- 406.

Ogier, J. C., Calteau, A., Forst, S., Goodrich-Blair, H., Roche, D., Rouy, Z., ... & Gaudriault, S. (2010). Units of plasticity in bacterial genomes: new insight from the comparative genomics of two bacteria interacting with invertebrates, Photorhabdus and Xenorhabdus. BMC genomics, 11(1), 568.

Olendzenski, L. and J.P. Gogarten. 2009. Gene Transfer: Who Benefits? In: Horizontal Gene Transfer: Genomes in Flux. pp. 3-12. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.

Olschlager, T. and J. Hacker. 2008. Genomic Islands in the Bacterial Chromosome – Paradigms of Evolution in Quantum Leaps. In: Horizontal Gene Transfer in the Evolution of Pathogenesis. pp. 113-134. Cambridge University Press, New York, NY, USA.

Osborn, A. M. and D. Boltner. 2002. When phage, plasmids and tranposons collide: genomic islands, and conjugative- and mobilizable-transposons as a mosaic continuum. Plasmid 48: 202- 212.

Perry, J. and G.D. Wright. 2013. The antibiotic resistance “mobilome”: searching for the link between environment and clinic. Frontiers in Microbiology. 4:1-7.

Plague, G.R. 2010. Intergenic transposable elements are not randomly distributed. Genome Biol. Evol. 2:584-590.

19

Pruden, A., & Arabi, M. 2012. Quantifying anthropogenic impacts on environmental reservoirs of antibiotic resistance. Antimicrobial Resistance in the Environment, 173-202.

Ragan M.A. and R.G. Beiko. 2009. Lateral Gene Transfer: Open Issues. Philosophical Transactions of the Royal Society B. 364: 2241–2251.

Roberts, A.P., M. Chandler, P. Courvalin, G. Guedon, P. Mullany, T. Pembroke, J.I. Rood, C.J. Smith, A. O. Summers, M. Tsuda and D. E. Berg. 2008. Revised nomenclature for transposable genetic elements. Plasmid. 60: 167-173.

Rowland, S.J. and W.M. Stark. 2005. Site-specific recombination by the serine recombinases. In: The Dynamic Bacterial Genome pp. 83-120. Cambridge University Press, NY, NY, USA.

Schwartz, E., A. Henne, R. Cramm, T. Eitinger, B. Friedrich and G. Gottschalk. 2003. Complete Nucleotide Sequence of pHG1: A Ralstonia eutropha H16 Megaplasmid Encoding Key Enzymes of H2-based LIthoautotrophy and Anaerobiosis. J. Mol. Biol. 332: 369–383

Schumann, W. 2006. Sequence specific recombination classes. In: Dynamics of the Bacterial genome pp. 97-98 John Wiley & Sons.

Siefert, J.L. 2009. Defining the Mobilome. In: Horizontal Gene Transfer: Genomes in Flux. pp. 13-27. Ed. M.B. Gogarten, J.P. Gogarten and L. Olendzenski. Humana Press. New York, NY, USA.

Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891.

Siguier P, Gourbeyre E, Varani A, Ton-Hoang B, Chandler M. 2015. Everyman’s Guide to Bacterial Insertion Sequences, p 555-590. In Craig N, Chandler M, Gellert M, Lambowitz A, Rice P, Sandmeyer S (ed), Mobile DNA III. ASM Press, Washington, DC. doi: 10.1128/microbiolspec.MDNA3-0030-2014

Snyder, and W. Champness. 2007. Molecular Genetics of Bacteria. Taylor, N.G.H., D.W. Verner-Jeffreys and C. Baker-Austin. 2011. Aquatic systems: maintaining, mixing and mobilizing antimicrobial resistance? Trends in Ecology and Evolution 26(6): 278-284.

Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47 (2002) 26-35.

Van Houdt, R.V, S. Monchy, N. Leys and M. Mergeay. 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96:205-226.

Wozniak, R. A., & Waldor, M. K. (2010). Integrative and conjugative elements: mosaic mobile genetic elements enabling dynamic lateral gene flow. Nature Reviews Microbiology, 8(8), 552- 563.

20

Wright, M. S., Peltier, G. L., Stepanauskas, R., & McArthur, J. V. (2006). Bacterial tolerances to metals and antibiotics in metal-contaminated and reference streams. FEMS microbiology ecology, 58(2), 293-302.

Wright, M.S., Baker-Austin, C., Lindell, A.H., Stepanauskas, R., Stokes, H.W. and J.V. McArthur. 2008. Influence of industrial contamination on mobile genetic elements: class 1 integron abundance and gene cassette structure in aquatic bacterial communities. ISME Journal 2: 417-428.

21

Chapter 3 The Limitations of Draft Assemblies for Understanding Prokaryotic Adaptation and Evolution

Acknowledgements and Contributions: This chapter is reproduced as published in Genomics (Ricker, N., Qian, H., & Fulthorpe, R. R. 2012. The limitations of draft assemblies for understanding prokaryotic adaptation and evolution. Genomics, 100(3), 167-175 doi:10.1016/j.ygeno.2012.06.009) with minor modifications.

3 Introduction

Next generation sequencing (NGS) platforms have revolutionized how we obtain genetic information, leading to rapid advances in the fields of genomics and metagenomics. These methods rely on newer sequencing chemistries (Sanger et al. 1977) and highly parallel operations that result in high yields at low costs per read but so far produce considerably shorter reads (in the range of 35-500 nucleotides) than Sanger sequencing (600 to 1500 nucleotides). Shorter reads increase the required complexity of the assembly algorithms (Miller et al. 2010), although the ability to sequence to very high coverage can overcome many of the original issues in genome assembly including read errors and coverage gaps (Wetzel et al. 2011). The utility of next generation sequencing has been demonstrated in examining new variants, or very close relatives, of previously sequenced strains (reviewed in MacLean et al. 2009)). This type of assembly, known as a reference or mapping assembly, is relatively straightforward provided that the two strains share high sequence identity across their genome. However in many bacterial species, the ‘core’ genes that are shared between closely related strains are supplemented by a significant fraction of ‘dispensable’ genes that vary between the strains of a given species (Medini et al. 2005). Assembling these sections of sequence data or entire genomes in the absence of a suitable reference strain (referred to as de novo assembly) is a far more difficult task (Pop 2009). Whole genome shotgun assemblies using traditional Sanger sequencing have been utilized for many years for this purpose but the cost and effort required to do this type of intense sequencing has been prohibitive for all but the largest laboratories (MacLean et al. 2009). The advent of NGS platforms promises to alleviate the financial and technical demands of obtaining high quality sequence data however the issue of repetitive elements in genomic

22 sequence remains a confounding issue in genome assembly that is difficult to resolve through coverage alone (Wetzel et al. 2011).

Many assembly programs for NGS data utilize de Bruijn graphing techniques (see (Miller et al. 2010) to perform de novo assemblies of the high number of reads produced, with the goal of finding the shortest path through the sequence data that includes as much of the sequence data as possible. For genomes with a high content of repetitive sequences, some assembly programs will produce an overly compressed alignment, and possible mis-assemblies, when multiple copies of a repeat are collapsed to one location (Chevreux et al. 1999; Philippy et al. 2008). Accurate graphs (those that do not collapse repetitive elements) will often form a ‘frayed rope’ pattern in repetitive sections whereby a path converges at the repeats and then diverges again (multiple paths leading into the repeat and multiple paths leading out of the repeat again) since there are multiple true alignments possible. Some assembly programs specifically search for the characteristic features that repetitive elements create within a graph such as convergent, divergent or cyclic paths (Miller et al. 2010) and therefore terminate at these repetitive elements to ensure that they are not overly compressed in the final assembly. This results in a more fractured assembly, but prevents the errors introduced by arbitrarily collapsing the repeats to one location.

Realistically, the assembly software is not expected to produce a perfectly aligned genome but rather to reduce the sequencing reads into a manageable number of contigs (‘contiguous sequence’ – the sequence produced by the assembly of multiple overlapping reads) for finishing. ‘Finishing’ is the process of closing all contig gaps, correcting introduced errors, and confirming low coverage regions of the assembly through PCR and cloning experiments at the bench. These experiments can still be expected to take months to years, even with excellent sequence data and the best software currently available (Nagarajan et al. 2010). For this reason, complete genome finishing is rarely carried out both due to the effort required, and because the aim of many sequencing projects is limited to looking for a small number of differences between the new strain and a previously sequenced close relative. The resulting genome projects are often submitted as unfinished draft assemblies, or as ‘assembled with likely errors’ (Phillippy et al. 2008).

23

Although not as repetitive as eukaryotic genomes, prokaryotic genomes contain a variety of repeated elements ranging in size from 1-6 bp microsatellites (Ellegren 2004; Mayer et al. 2010) to larger elements such as transposons, insertion sequences, rRNA operons, tRNA genes, and rhs family genes (Lupski and Weinstock 1992). The computational issues that repetitive genomes pose to NGS assembly has been discussed in other recent papers (Miller et al. 2010; Wetzel et al. 2011; MacLean et al. 2009; Zhang et al. 2011), but there has been remarkably little emphasis on the relative value of the portion of the genome that remains fragmented in these draft assemblies. To this end, we performed an in silico experiment using simulated long and short read data for the fully sequenced genome of Cupriavidus metallidurans CH34 (hereafter simply referred to as CH34). This organism was sequenced by the Joint Genome Institute (JGI) using whole genome shotgun cloning (WGS) with a combination of three randomly sheared libraries (3, 8 and 40 kb insert sizes) and an additional 3,752 individual Sanger reads for finishing (Van Houdt et al. 2009). It was chosen for this study because of the high quality finishing and annotation that has been performed (Van Houdt et al. 2009; Janssen et al. 2010) as well as the nature of the genome, which contains two large chromosomes and many types of mobile elements. It was our anticipation that the repetitive elements contained within this genome would be a hindrance to assembly, and that this simulation would serve to illustrate the portions of the genome that are inherently resistant to automated assembly. Four additional strains (Caulobacter sp. K31, Gramella forsetii KT0803, Rhodobacter sphaeroides 2.4.1 and Bordetella bronchiseptica RB50) were also included which varied in G+C content, number of replicons, repeat content of the genome and percentage of genes annotated as involved in mobility (plasmids, phages and transposons). A detailed analysis of each individual strain was not performed since the genomic islands have not been characterized, however genomic island predictions were available from the IslandViewer website (Langille and Brinkman, 2009) which utilizes multiple software programs to predict genomic islands from the completed sequence. The predicted genomic islands in these strains were considerably smaller than those determined in CH34, so it is expected that some of the predicted islands may actually be components of one larger island.

Only two assembly programs were utilized since the presence of repeated elements is a commonly acknowledged issue in assembly algorithms (Pevzner and Tang, 2001; Kingsford et al. 2010), and a comparison of computational effectiveness was outside the scope of this study.

24

Our intent was rather to illustrate the biological significance of the regions most likely to remain unassembled by the nature of their sequence. The Velvet assembler was chosen because the algorithms have been improved to prevent over-collapsing of repeats (Zerbino et al. 2009). The ABySS assembler (Simpson et al. 2009) was utilized to determine whether the results were specific to the Velvet algorithms. Our goal for this project was to use the well-annotated CH34 genome to better understand the biological relevance of the sections of the genome left unassembled and to examine which aspects of genome complexity would be most problematic to assemble into large contigs given ideal data. This serves to illustrate the inherent issues in draft assemblies of prokaryotic genomes, which we also illustrate is only further complicated by the use of real data.

3.1 Methods

All genomes were obtained from the NCBI website (www.ncbi.nlm.nih.gov) with the following Genbank Accension numbers: Cupriavidus metallidurans CH34 (CP000352- CP000355), Caulobacter sp. K31 (CP000927.1-CP000929.1)), Gramella forsetii KT0803 (CU207366.1), Bordetella bronchispetica RB50 (BX470250.1) and Rhodobacter sphaeroides 2.4.1 (CP000143.1-CP000147.1, DQ232586.1, DQ232587.1). These files were used to create error-free simulated long read (400 bp length at 10x coverage) and short read (75 bp length at 45x coverage) data for assembly in Velvet using a custom-made python program (available on request). These datasets were assembled using Velvet version 1.1.05 (Zerbino and Birney, 2008) using the max_kmer and big_assembly settings as these settings gave the best assembly statistics (N50 and max contig). The final graph of the Velvet assembly for C. metallidurans CH34 used 4,260,497 of the 4,265,686 (99.9%) simulated reads and resulted in a total of 139 contigs. The maximum contig length was 674,170 bp and the N50 value for the assembly (size of contig for which 50% of assembled reads are in a contig of that size or larger) was 159,531 bp. The median coverage was calculated as 11.8. The N50 and longest contig stats for the other genomes are listed in Table 1. Paired ends libraries with 100 bp reads were also created for two different insert distances (180 and 3000). The paired ends dataset for C. metallidurans CH34 was assembled in ABySS version 1.1.3 (Simpson et al. 2009) with a final N50 value of 36682 and maximum contig size of 166493 bp.

25

Assembled contigs were aligned to reference sequences using Geneious Pro version 5.5.2 (Drummond et al. 2010). Despite the error-free nature of the simulated data, alignments were performed at 98% identity since imperfect repeats (repeats with a small number of single base pair differences) could be seen as sequencing errors by the assembler and would be incorrectly collapsed thereby introducing errors into the final contigs. Coverage statistics included were those determined by the Geneious program and therefore represent coverage of reference by unique contigs only, with no allowance for contig repetition, instead of true coverage of the reference genome if all repeats could be accounted for. Examination of genes adjacent to the ends of contigs was performed using the NCBI Blast tool (Altschul et al. 1990), and the Genbank entries for each replicon (www.ncbi.nlm.nih.gov). Repeat content of the genomes was estimated by calculating the uniqueness of each genome at k-mer lengths of 31 and 1000 and then taking the average of these two calculations. Assembly files from the GAGE study (Salzberg et al. 2012) were downloaded and aligned by the same metrics, or by the addition of a maximum 500 bp gap parameter as necessary.

3.2 Results 3.2.1 Assembly Quality for Cupriavidus metallidurans CH34

CH34 has 4 large replicons (Table 3.2.1) and a multitude of well-annotated smaller mobile elements including genomic islands, transposons and insertion sequences (Van Houdt et al. 2009; Monchy et al. 2007). On the two chromosomes, there are four sets of 5S, 16S and 23S rRNA genes (2 sets on each) and 62 tRNA genes (8 of which are duplicates found on the second chromosome) (Janssen et al. 2010). There are 16 documented genomic islands (11 on chromosome 1, none on chromosome 2, 3 on pMOL30 and 2 on pMOL28), as well as 57 insertion sequences and 19 other transposable elements distributed across the four replicons (Janssen et al. 2010).

26

Table 3.2.1: Number of contigs aligning and coverage statistics for each of the four replicons in C. metallidurans CH34 using Velvet ad ABySS genome assembly software.

Total Total Total Number bases in Largest bases in bases in Velvet of contigs contigs Size (bp) contig contigs contigs Assembly aligned at longer (bp) longer longer 98% than 10 than 5 kb than 1 kb kb 3,928,089 75 674,226 3,786,365 3,835,365( 3,875,047 Chr 1 (17.2%) (96.4%) 97.6%) (98.6%) 2,580,084 63 541,760 2,466,986 2,504,599 2,532,450 Chr 2 (21.0%) (95.6%) (97.1%) (98.2%) 233,720 18 58,279 212,451 212,451 230,532 pMOL30 (24.9%) (90.9%) (90.9%) (98.6%) 171,459 9 101,867( 156,377 156,377 171,008 pMOL28 59.4%) (91.2%) (91.2%) (99.7%) Total Total Total Number bases in Largest bases in bases in ABySS of contigs contigs Size (bp) contig contigs contigs Assembly aligned at longer (bp) longer longer 98% than 10 than 5 kb than 1 kb kb 3,928,089 470 166,493 3,435,784 3,669,623 3,784,364 Chr 1 (4.2%) (87.5%) (93.4%) (96.3%) 2,580,084 212 107,711 2,242,357 2,416,359 2,511,515 Chr 2 (4.2%) (86.9%) (93.6%) (97.3%) 233,720 59 29,993 155,523 190,452 219,738 pMOL30 (12.8%) (66.5%) (81.5%) (94.0%) 171,459 36 50,670 135,295 140,927 162,258 pMOL28 (29.6%) (78.9%) (82.2%) (94.6%)

From our simulated dataset (see methods), an assembly of 139 contigs was created after assembly in Velvet. This assembly was aligned to the reference sequence of each of the four replicons (Table 3.2.1) in Geneious version 5.5.2 (Drummond et al. 2010). Several of the contigs were found to align to multiple replicons (Figure 3.2.1), including one that aligned to all four replicons (corresponding to Tn6049). The largest contig that was shared in more than one location was contig 152 (length 10,403 bp), which is found on both chromosome 1 and 2 and corresponds to Tn6048. Likewise, a single contig, 5471 bp, corresponded to the 4 rRNA operons that are evenly divided between the two chromosomes. All contigs mapped to the

27 reference genome at 98% identity. The genome was also assembled using ABySS version 1.3.3 (Simpson et al. 2009). This assembly was considerably more fragmented than the Velvet assembly (Table 3.2.1) and had two small contigs (915 bp and 740 bp) that did not align with any of the replicons at 98% identity. Due to the considerably larger number of fragments from this assembly, the causes of contig termination were not determined for the contigs produced from the ABySS software, however both software programs had greater difficulty assembling the genomic island rich pMOL30 compared to pMOL28 and showed similar contig distribution patterns (Figure 3.2.2).

Figure 3.2.1:Number of assembled contigs in Velvet aligning to replicons in C. metallidurans CH34. Venn diagram is based on 98% sequence identity. It is important to note that there are no shared contigs found solely between chromosome 1 and pMOL30, or solely between chromosome 2 and pMOL28.

28

Figure 3.2.2: Geneious alignment of assembled contigs to two key regions containing genomic islands in C. metallidurans CH34. Top two images are from the Velvet assembly, bottom images are the same regions from the ABySS assembly. The grey bar indicates region coverage, and the black lines are reference sequence (solid line) and location of the contigs with respect to the reference. The top alignment for each assembler includes the GI rich region ranging from approximately 1.2 Mb to 1.8 Mb on chromosome 1 and contains the two largest genomic islands. The bottom alignment is to the full length of pMOL28, with the heavy metal resistance island highlighted (location is as indicated in Monchy et al. 2007). There are more contigs listed in Table 3 than are visible on the figure since contigs mapping to repetitive elements can only be mapped onto the chromosome once.

29

3.2.2 Contigs terminate at repeated elements and mobile elements

The large contigs from the Velvet assembly were examined to determine the genomic determinants that had caused their termination (see Table 3.2.2). It was our anticipation that the known repeated elements would be the main cause of termination in our error-free dataset, and this was primarily found to be the case. 7 of the largest contigs (4 from chromosome 1 and one each from the other replicons) were investigated and of the 14 terminal regions, 12 were found to have terminated at a previously documented mobile element. The other two corresponded with genes that would not be expected to be mobile. One of these genes was found to have an internal repeat structure that interfered with assembly, and the other was found to have a second copy of the same gene present on both chromosome 1 and chromosome 2 (at 99% identity). When all contigs greater than 1 kb in length from chromosome 1 were included in the analysis (data not shown), 75% (35/46) of the termination points were from documented mobile elements. All other termination points were from duplicate genes found on multiple replicons with the exception of a shared gene cluster between CMGI-2 and CMGI-3 which are both located on chromosome 1 (discussed in section 3.2.3) and the rRNA operons for which there are two copies on each chromosome. Of the mobile elements in this genome, Tn6049 and ISRme3 were found in the highest abundance (12 copies and 10 copies, respectively), and Tn6049 was the only element found on all four replicons.

Table 3.2.2: Details on the terminal regions for 7 large contigs. For simplicity, only the four largest contigs from chromosome 1 and the largest single contig from each of the other replicons is included. The gene or mobile element responsible for the contig termination is listed along with the number of times that element occurs in the total genome.

Contig Name Size Replicon 5' terminus # in genome 3' terminus # in genome sodium 2 (100% to 2 (both on sulphate chr1, 99% to Contig 17 674226 chr 1 ISRme4 chr1) symporter chr2) 2 (both on 2 (both on Contig 113 358847 chr 1 ISRme7 chr1) IS1087B chr1) 12 (across all Contig 125 309700 chr 1 Tn6049 four replicons) IS1090 4 (all on chr1) 2 (both on 12 (across all Contig 143 302838 chr 1 IS1087B chr1) Tn6049 replicons)

30

3 (1 on chr1, 2 10 (across all Contig 220 541760 chr 2 Tn6048 on chr2) ISRme3 but pMOL28) repeated sequence 1 merE from 2 (1 on each within copB Contig 252 58279 pMOL30 Tn4380 plasmid) gene merE from 2 (1 on each 3 (across all but Contig 239 101867 pMOL28 Tn4380 plasmid) IS1086 pMOL30)

3.2.3 Fragmentation is greatest at genomic island sites

Interestingly, although there were long contigs distributed across all of the replicons in the Velvet assembly, the distribution of the smaller contigs was not found to be uniform. Instead there were regions on each of the replicons that were markedly fragmented with small to medium (61-5000 bp) contigs arranged in a pattern of small overlaps or with gaps between (Figure 3.2.2). Recognizing that genomic islands frequently contain smaller imbedded transposable elements and therefore many repeated elements, we overlayed the known genomic island co-ordinates with the assembled fragments for chromosome 1. As noted earlier, chromosome 1 contains 11 of the 16 genomic islands found in CH34. A sequential ordering of the longest contigs corresponding to chromosome 1 revealed that only one of the genomic islands (CMGI-9) was fully captured in a large contig. This is not surprising as this island has no documented repetitive elements (not even terminal repeats) that would have interfered with assembly. Since the genomic islands appeared to be linked with the prevalence of fragments, we aligned the contigs to each of the chromosome 1 islands individually in Geneious. In general, the larger genomic islands aligned to higher numbers of contigs (Table 3.2.3). The four largest islands each had a minimum of 2 contigs longer than 5 kb, representing accessory genes that were congruent without interference from mobile elements or repeated segments. However, the termination points of these contigs serve to highlight the difficulties of obtaining complete assemblies of even these relatively small regions (compared to the genome). As would be expected, many of the contigs terminated at a documented insertion sequence or transposon that was found at another location in the genome (sometimes within the same genomic island). Tn6049 (with a length of 3461 bp) is a very promiscuous transposable element found in 12 locations in the genome including on 5 of the 11 genomic islands and terminated assembly in each of the locations it was found. In addition, there were other genes that were present on more than one of the genomic islands and therefore interfered with proper assembly. CMGI-2 and CMGI-3 share several homologous gene clusters (see Table 3.2.3) and have similar conjugal

31 transfer genes. Two of these genes (trbB and trbF) share high sequence identity across their length (97 and 92%, respectively) and were found to cause the termination of contigs containing the conjugal transfer genes in both of these islands. CMGI-3 also has multiple copies of IS1071, and in some cases this element appears to have been responsible for the mobilization of fragments of adjacent genes, which are then also repeated within the island, further fragmenting the assembly.

CH34 is most noted for its ability to withstand heavy metals (Janssen et al. 2010) and many of the genes conferring these abilities are contained within three genomic islands distributed on the two plasmids pMOL30 and pMOL28. The two large islands account for almost the full length of pMOL30 and approximately a third of the length of pMOL28. Each island also contains “nested” islands with partial or complete mobile elements that separate different functional modules (Van Houdt et al. 2009). An examination of the Geneious alignments for both pMOL28 and the region of chromosome 1 containing two genomic islands conferring such notable phenotypes as hydrogenotrophy and metabolism of aromatic compounds revealed that these regions are highly fragmented in comparison to surrounding regions in both the Velvet and ABySS assemblies (Figure 3.3.2).

Table 3.2.3: Genomic islands found on chromosome 1 of CH34. Naming, sizes and content information are derived from previous characterization (Van Houdt et al. 2009). Contig information is solely from the Velvet assembly for simplicity.

Name of Size Content Information Contigs Size Range of Element aligned Aligned within Contigsb regiona,b CMGI-1 109,598 bp Tn6049; Closely related to 3 1-5 kb: 2 pathogenicity island in P. >10 kb: 1 aeruginosa CMGI-2 101,637 bp Tn4371 family integrase, 12 <1 kb: 6 hydrogenotrophy, 1-5 kb: 2 metabolism of aromatic 5-10 kb: 1 compounds >10 kb: 3 CMGI-3 97,042 bp Tn4371 family integrase, 16 <1 kb: 7 carbon fixation, 1-5 kb: 6 hydrogenotrophy 5-10 kb: 1 >10 kb: 2

32

CMGI-4 56,529 bp Tn4371 family integrase, 1 >10 kb: 1 Tn6048 CMGI-5 25,423 bp 63 bp direct repeats 3 1-5 kb: 3 CMGI-6 17,638 bp Tn6049 3 1-5 kb: 3 CMGI-7 15,362 bp Tn6049 1 1-5 kb: 1 CMGI-8 12,257 bp Tn6049, IS1087 3 1-5 kb: 3 CMGI-9 20,648 bp Integrase, no direct repeats 0 Contained within large contig CMGI-10 20,947 bp 3 Insertion Sequences 5 1-5 kb: 4 5-10 kb: 1 CMGI-11 10,824 bp Flanked by ISCme7 3 <1 kb: 2 5-10 kb: 1 a these numbers are an approximation because the alignments were performed at 98% and therefore some of the small contigs align to multiple places where imperfect repeats occur b numbers are only for contigs completely covered by genomic island; each island generally aligns to the ends of two larger contigs that are not included in these numbers

3.2.4 Investigating the relative contribution of multiple replicons or presence of documented mobility genes by comparison with other strains

In addition to our in depth analysis of CH34, we simulated datasets for an additional 4 genomes that varied in overall genome size, G+C content, number of replicons and predicted mobile element content. The metrics for all 5 genomes assembled using simulated unpaired long and short read datasets are summarized in Table 3.2.4. The Velvet assembly data included in Table 3.2.4 is based on alignment to the reference genome at 98% nucleotide identity with no gaps (see methods), and there were no significant errors in the contigs that would limit their ability to align with these restrictions. As was expected, the large genomes consistently produced a larger number of contigs after assembly, and the assembly quality in terms of both N50 value and maximum contig size relative to largest chromosome decreased with increasing genome size. In order to assess the causes of fragmentation for large genomes, we specifically included strains with variations in both the number of replicons and the number of genes annotated as related to horizontal gene transfer by the JCVI Comprehensive Microbial Resource JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi). Based on overall genome size, number of replicons and k-mer repetitiveness, it was expected that CH34 would have the poorest (most fragmented) assembly. However Caulobacter sp. K31 fared the worst by each of the common metrics listed in Table 3.2.4. Interestingly, the best N50 and maximum

33 contig sizes were obtained for Rhodobacter sphaeroides 2.4.1 despite the fact that this genome is composed of 7 different replicons (Figure 3.2.3). Furthermore, although CH34 was the second poorest assembly in terms of number of contigs and N50 value, Bordetella bronchiseptica RB50 had a smaller maximum contig size. This was unexpected despite its large overall genome size, based on the nature of the genome. This genome had specifically been chosen because only 0.37% of its gene content has been attributed to mobile functions (plasmids, phages and transposons) by the JCVI-CMR (http://cmr.jcvi.org/cgi-bin/CMR) and contained only one replicon. It also had the lowest percentage of repetitive k-mers by our calculations (see methods) and should theoretically assemble more easily.

To compare these results to the findings described for the well-annotated genome CH34 in the absence of characterized genomic islands, these strains were evaluated according to genomic islands predicted by programs contained within IslandViewer (Langille and Brinkman 2009). Although the precise number or size of the individual islands has not been verified (and is overestimated in CH34), the total number of predicted genomic islands significantly correlates to the maximum contig size, N50 value and N50 as a percentage of longest replicon. As had been seen in CH34, the most fragmented portion of the Bordetella brochiseptica genome corresponded to a 22 kb segment of repeated gene content shared between two predicted genomic islands (99% nucleotide identity), and likewise the Caulobacter sp. K31 assembly also had a large (10.5 kb) segment that was perfectly repeated between two predicted genomic islands.

Table 3.2.4: Velvet assembly metrics of the 5 genomes compared. Unique k-mer percentage was calculated as described in the methods. Mobile gene numbers were obtained from the JCVI-CMR (http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi). Coverage calculation is defined as total number of reference bases covered by unique contigs at 98% nucleotide identity without gaps or repeating of individual contigs. SIGI and DIMOB are the individual programs that IslandViewer (Langille and Brinkman, 2009) utilizes to predict genomic islands.

34

Cupriavidus Bordetella Gramella Rhodobacter Caulobacter metallidurans bronchisepta forsetii sphaeroides sp. K31 CH34 RB50 KT0803 2.4.1 Genome 5.89 6.91 5.34 3.8 4.6 size (Mb)

No. 3 4 1 1 7 replicons GC 66.3 62 68.1 36.6 68.8 content % Unique 98.55 98.2 99.5 99.09 99.18 k-mers

Contigs 151 139 104 90 99 N50 155,182 159,531 261,616 564,738 740,045 (bp) Longest 495,932 674,226 550,697 899,275 1,010,805 contig (bp) N50 vs. longest 2.83 2.91 4.78 10.31 13.51 replicon (%)

Mobile 162 164 19 49 103 Genes

Mobile Genes % 2.96 2.65 0.37 1.36 2.46 of genome Islands by SIGI- 13 3 9 2 6 HMM only Islands by DIMOB 3 12 1 7 1 only Predicted 9 5 5 1 2 by Both Total # 25 20 15 10 9 Of Islands Coverage 98.8 98.6 98.9 99 99.7 Percentage

35

Figure 3.2.3: Relationship between N50 (as percentage of the largest replicon in the genome) and three parameters thought to influence assembly quality. Top: genome size, r = -0.81 (ns); Middle: percent unique K-mers, r = 0.54 (ns); and Bottom: Number of Replicons, r = 0.42 (ns).

36

Figure 3.2.4: Relationship between three measures of assembly quality (maximum contig length, N50 ad N50 as percent of longest replicon) and number of genomics islands as predicted by IslandViewer. The pearson correlations between N50 or N50 as percent of longest replicon and number of predicted islands are statistically significant (p<0.05) but are also clearly curvilinear.

3.2.5 Fragmentation Evident in Real Data

The benefit of using simulated ideal data for this type of analysis is that patterns can be detected that may otherwise be masked due to the variations in sequencing coverage, introduction of sequencing specific errors and high number of contigs produced by real sequencing projects. In order to take our findings and compare them to real sequencing scenarios, we examined the assembly data from Rhodobacter sphaeroides 2.4.1. This strain was

37 utilized in the Genome Assembly Gold-Standard Evaluations (GAGE) study that compared the assembly efficiency of 7 different open access software programs (Salzberg et al. 2012). The assembled contigs from that study are freely available. We downloaded the contigs from the GAGE Velvet assembly of R. sphaeroides 2.4.1 and aligned them to the finished genome in the same way that we compared CH34 contigs generated from simulated sequence to its final genome. When the R. sphaeroides 2.4.1 contigs from the GAGE assembly were mapped to its finished chromosome 1 in Geneious, only 454 fragments (of a total of 1192 contigs and scaffolds) could be aligned at 98% identity - resulting in only 65% coverage of the chromosome. This indicated that the assembled contigs contained internal errors, so we allowed for up to 500 bp gaps in the Geneious alignment. This improved the assembly of chromosome 1 from 65% to 96.3%. Regardless of whether gaps were allowed or not, the distribution of the small contigs was greatly increased in regions predicted to be genomic islands (Figure 3.2.5). For the alignment without gaps, only one of the predicted genomic islands was assembled, whereas 4 out of 9 of the islands were assembled when gaps were allowed. The two islands predicted by both programs in IslandViewer had a large number of fragments for their relative size (13 fragments for 12.5kb and 12 fragments for 7.5 kb).

Figure 3.2.5: Geneious alignment of real contigs obtained from the GAGE assembly data (Salzberg et al. 2012). Top alignment is at 98% identity with no gaps allowed, bottom alignment is 98% identity with up to 500 bp gaps allowed. The region shown includes 3 putative genomic islands that are all clearly visible by the increased occurrence of small contigs in these regions. These islands occur at 216-228 kb, 550-557 kb and 632-648 kb and are roughly indicated with curved

38 brackets. Since these are only predicted islands, the precise borders may not be accurate and individual islands could be components of a larger combined island.

3.3 Discussion

The Genomes OnLine Database (v. 3.0; http://genomesonline.org accessed 19th March 2012) lists 3532 completed genomes of which 1045 are listed as permanent draft assemblies. The status of permanent draft implies that finishing experiments to verify or extend the existing contigs are not expected to be performed, and the draft status is likely to be related to repeated elements that cannot be resolved by computerized means. Contrary to the early view that many of these smaller repeated elements represent “junk DNA” (Mayer et al. 2010), microsatellites in the form of tandem repeats and transposable elements such as insertion sequences have both been found to regulate transcription of adjacent genes (Versalovic et al. 1991; Mahillon and Chandler, 1998). These repetitive elements also function as important components of genome plasticity by mediating DNA re-arrangements including chromosomal deletions, duplications and inversions (Lupski and Weinstock, 1992; Touchon and Rocha 2007). Larger transposable elements such as transposons and integrative conjugative elements (ICEs) can also be found in multiple copies within a genome, particularly if there are multiple large replicons as is commonly found in certain bacterial families such as the Burkholderiaceae (Janssen et al. 2010; Amadou et al. 2008; Tuanyok et al. 2008). Reaching the stage of a draft genome is sufficient if the goal is to discover interesting and novel genes or operons that do not contain repeated elements, with the consequence that many genome projects are being published at the draft assembly stage and then terminated (Nagarajan et al. 2010). These draft assemblies can have a number of errors including collapsed repeats, rearrangements and inversions (Phillippy et al. 2008; Salzberg et al. 2012; Narzisi and Mishra, 2011) as well as having an unknown fraction of the genome unaccounted for. In this study, we used simulated NGS data to confirm that currently available software programs are capable of accurately recognizing repeated segments in the DNA and that these repeats would be the primary cause of contig termination in the assembly. Having established the causes of termination, we wanted to better understand the nature of the fragmented regions of draft assemblies since the relative importance of these unassembled regions has to our knowledge never been addressed.

39

An examination of the genes adjacent to the termination points for the longest contigs (Table 3.2.2) clearly confirmed that the assemblies were terminated due to the presence of repeated elements. These repeated elements were inclusive of known mobile elements and genes containing internal repeat structures (as expected) but also of genes that were repeated in more than one genomic location (commonly on two separate replicons within this genome). This type of repetition (within or between replicons) is important in the evolution of novel traits since one copy of the gene can be free to evolve without risking functional impairment to the host cell due to the other preserved copy (the duplicate gene hypothesis (Ohno, 1970). Some transposable elements have been found to specifically target transmissible plasmids and the subsequent plasmid-chromosome exchanges facilitate assembly of genes into modules (Siguier et al. 2006), with the result that individual genomes will commonly have identical transposable elements and accessory genes distributed on both the main chromosome and some or all of the associated plasmids (as was seen here). Likewise, the findings from both B. bronchiseptica RB50 and Caulobacter sp. K31 illustrated that predicted genomic islands within the same chromosome can carry repetitive gene content which can interfere with assembly in the absence of repeats across different replicons. Neither of these two large repeated segments contained any insertion sequences or transposons, but were composed almost exclusively of hypothetical proteins. The hypothetical nature of these genes prevents an estimation of the causes of gene duplication in these strains, although one copy of the 22 kb portion of B. bronchiseptica RB50 is contained within an intact phage documented by the BacMap Genome Atlas website (Stothard et al. 2005). The second copy in this strain and both repeated segments in Caulobacter sp. K31 were not part of any documented phage (intact or otherwise) but their presence in two separate predicted islands within the chromosome could facilitate genomic island evolution.

It was expected that the number of genomic islands would have correlated with the percentage of genes annotated as involved in mobility, but this was not found to be the case. Rhodobacter sphaeroides 2.4.1 had 2.46% of the genes attributed to mobility functions, yet had a smaller number of islands than other strains with this percentage of mobile gene content and a more successful assembly in terms of N50 and maximum contig size. Given the high number of plasmids found in this strain it is reasonable that this high percentage of mobility functions relates directly to plasmid genes. These would not be expected to interfere with assembly since incompatibility prevents plasmids with highly similar transfer genes from co-existing within

40 cells. It was interesting to note that although both of the mobility related metrics (% mobile genes and predicted genomic islands) correctly predicted Caulobacter sp. K31 to be the most difficult to assemble, the number of genomic islands was a better indicator of assembly complexity for Bordetella bronchiseptica RB50 than mobile gene content. In addition, the most logical genome characteristic to interfere with assembly would be repetitiveness (measured as % unique k-mers) but this also was not an invariant predictor of the ease of assembly.

The validity of this work rests on the assumption that the simulated reads generated from the genomic data could be accurately assembled. There were no errors evident in any of the alignments performed from the Velvet unpaired simulated data when using a kmer length of 57, although there had been a number of single base pair errors introduced when using the standard settings and there were substantial SNPs introduced in the ABySS contigs (data not shown). This illustrates the high level of accuracy that Velvet can achieve with non-repetitive elements, as well as the high quality repeat recognition of this particular software program. In examining the distribution of the long reads from the Velvet assembly against chromosome 1 of CH34, the unassembled fragments tended to group together and these regions showed a clear association with the prevalence of repeated elements in the genomic islands. It is important to recognize that in an actual sequencing project the reconstruction of the genome would be further complicated by the presence of sequencing errors and variations in the level of coverage due to decreased amplification robustness, the latter of which may be more prominent in repetitive stretches due to the secondary structure formed by palindromic repeats (Jin, 2010). In comparing our simulated assemblies to the data available from the GAGE Velvet assembly of R. sphaeroides 2.4.1, it was clear that our correlation between the distribution of small contigs and the location of genomic islands was still valid when using real data.

Draft genome assemblies may lead us to unintentionally disregard the most important parts of prokaryotic genomes. Although eukaryotic genomes are more repeat rich than prokaryotic genomes, the reasons for this repetitiveness are vastly different between the kingdoms. In prokaryotic organisms, horizontal gene transfer is a prominent means of acquiring novel genes and rearrangements facilitated by mobile elements increase diversity. Insertion sequences can spread to high prevalence within a genome, and their activity may be specifically increased in response to changing environmental conditions. Since their behavior is strongly linked to adaptation, these elements are of great interest (Dobrindt et al. 2004). Larger mobile

41 elements are primarily assimilations of smaller elements (Toussaint and Merlin, 2002) or serve as recombination sites for incoming genetic information (Coleman et al. 2006; Pen et al. 2009), with the result that genomic islands and large transposable elements are inherently resistant to computerized assembly. These regions are full of complete or partial mobile genetic elements and are therefore problematic for genome assembly, but ironically they are the most likely to carry the genes responsible for any novel traits under investigation, particularly if they were acquired horizontally. Assembly software alone is capable of reconstructing genes, and complete operons, providing they are not interrupted by repetitive sequences or present in more than one copy within the genome (i.e. on separate replicons). In one study it was determined that the majority of genes can be reconstructed from even very short reads (25 bp) however genes containing repeats (primarily intergenic repeats or mobile elements such as transposons, IS elements and prophages) account for the vast majority of the unassembled genome (Kingsford et al. 2010). In our study, 40 of the 75 contigs corresponding to chromosome 1 of CH34 were fully contained within genomic islands (Table 3.2.3) and an additional 16 contigs were found to overlap with the edge of a genomic island. Many of the functional genes contained within these genomic islands were assembled indicating that examining the mid-range contigs (5-50 kb) of a draft assembly may be more informative in terms of recently acquired content. The genomic context of these newly acquired genes is lost when the data is left as a draft assembly, and the utility of the public databases is decreased by the introduction of incomplete or incorrect data. As an example, the largest genomic island in CH34 (CMGI-1) is almost identical to a pathogenicity island (PAGI-2C) found in Pseudomonas aeruginosa clone C, indicating recent transfer between industrially contaminated sites and nosocomial pathogens (Van Houdt et al. 2009). Based on our Velvet assembly simulation, a draft assembly of CH34 would have left this island in pieces and evidence of this important transfer event would remain hidden. In our own laboratory, we have discovered a Recombinase in Trio (RIT) element adjacent to the chlorobenzoate degrading genes of Burkholderia sp. R172 (Accession number AY168634.1) that is homologous to one of the RIT elements found in CMGI-2 of CH34 (Van Houdt et al. 2009). This association was determined through Sanger sequencing, and was not apparent from the reads from only next-generation sequencing data provided by both Solexa(Illumina) and Roche 454 sequencing (Jin, 2010). Other sequenced strains available in the GenBank database reveal that this is not an isolated event. For example, there are two other homologous RIT elements found in the draft assembly of the PAH degrading strain Burkholderia sp. Ch1-1.

42

Prior to additional work that has recently improved the quality of this assembly, the contigs containing each of the RIT elements in this strain terminated at the edges of these elements, revealing absolutely no genomic context.

The role of genomic islands in bacterial adaptation is becoming increasingly clear, yet many of the genes contained within these islands have not been characterized (Penn et al. 2009). Indeed, a defining feature of genomic islands is a high abundance of conserved hypothetical proteins (Van Houdt et al. 2009). Understanding the possible roles of the multitude of currently hypothetical genes will require intensive experiments, and the development of these experiments may be hampered by the incomplete information included in draft assemblies (Phillippy et al. 2008). With decreasing sequencing costs, initial draft genomes are going to increase in prevalence, inundating the public databases with incomplete or fragmented genome projects which decrease the overall utility of these databases for other analyses particularly those relating to horizontal gene transfer. This issue has been addressed in a number of publications, and there are validation tools available that can aid in distinguishing mis-assemblies (Phillippy et al. 2008). We submit that many of the genes responsible for prokaryotic adaptation will be present in these highly recombinational or potentially mobile regions that are inherently resistant to automated assembly, and that therefore the necessity of extensive finishing experiments to not only close the created contigs but also to correct the introduced errors should be an important focus of any sequencing project. Furthermore, the very elements disrupting the automated assembly have a wealth of information to provide regarding the evolution and transferability of these genes, and also may have a role in the regulation of these important genomic regions. As technological improvements become available to ease the assembly of bacterial genomes, recognizing the high relative importance of these regions will be key to creating the incentive needed to pursue novel ways of finishing genomes - and improve our knowledge of bacterial adaptation.

3.4 Acknowledgements

Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D Scholarship to

NR is gratefully acknowledged. The funding agency had no role in this study.

43

3.5 References Sanger, F. Nicklen S and A.R. Coulson. 1977. DNA sequencing with chain-terminating inhibitors, Proc. Natl. Acad. Sci. USA 74:5463-5467.

Miller, J.R., S. Koren and G. Sutton. 2010. Assembly algorithms for next-generation sequencing data, Genomics 95:315-327.

Wetzel, J., C. Kingsford and M. Pop. 2011. Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies, BMC Bioinformatics 12:95. http://www.biomedcentral.com/1471-2105/12/95

MacLean, D., J.D.G. Jones and D.J. Studholme. 2009. Application of ’next-generation’ sequencing technologies to microbial genetics, Nat. Rev. Microbiol. 7: 287-296.

Medini, D., C. Donati, H. Tettelin, V. Masignani and R. Rappuoli. 2005. The Microbial Pan- Genome, Curr. Opin. Genet. Dev. 15: 589-594.

Pop, M. 2009. Genome assembly reborn: recent computational challenges, Briefings Bioinf. 10(4):354-366.

Chevreux, B., T. Wetter and S. Suhai. 1999. Genome sequence assembly using trace signals and additional sequence information, Comput. Sci. Biol.: Proc. German Conference on Bioinformatics GCB'99 GCB:45–56.

Phillippy, A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly, Genome Biol. 9:R55 (doi:10.1186/gb-2008-9-3-r55)

Nagarajan, N.C., M.D. Cook, H. G. Bonaventura, A. Richards, K.A. Bishop-Lilly, R. DeSalle, T.D. Read and M. Pop. 2010. Finishing genomes with limited resources: lessons from an ensemble of microbial genomes, BMC Genomics 11:242.

Ellegren, H. 2004. Microsatellites: Simple Sequences with Complex Evolution, Nat. Rev. Genet. 5:435-445.

Mayer, C., F. Leese and R. Tollrian. 2010. Genome-wide analysis of tandem repeats in Daphnia pulex – a comparative approach. BMC Genomics 11:277 (http://www.biomedcentral.com/1471- 2164/11/277)

Lupski, J.R. and G.M. Weinstock. 1992. Short, Interspersed Repetitive DNA Sequences in Prokaryotic Genomes, J. Bact. 174(14) (1992) 4525-4529.

Zhang, W., J. Chen, Y. Yang, Y. Tang, J. Shang and B. Shen. 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies, PLoS ONE 6(3): e17915. doi:10.1371/journal.pone.0017915

Van Houdt, R., S. Monchy, N. Leys and M. Mergeay. 2009. New mobile elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria, Antonie

44 van Leeuwenhoek 96:205-226.

Janssen, P.J., R. Van Houdt, H. Moors, P. Monsieurs, N. Morin, A. Michaux, M.A. Benotmane, N. Leys, T. Vallaeys, A. Lapidus, S. Monchy, C. Medigue, S. Taghavi, S. McCorkle, J. Dunn, D. van der Lelie and M. Mergeay. 2010. The Complete Genome Sequence of Cupriavidus metallidurans Strain CH34, a Master Survivalist in Harsh and Anthropogenic Environments, PLoS ONE 5(5):e10433. Doi:10.1371/journal.pone.0010433.

Langille, M.G.I. and F.S.L. Brinkman. 2009. IslandViewer: an integrated interface for computational identification and visualization of genomic islands, Bioinformatics. Jan. 16 (EPub). PMID: 19151094

Pevzner, P.A. and H. Tang. 2001. Fragment assembly with double- barreled data, Bioinformatics 17 (2001) S225–S233.

Kingsford, C., M.C. Schatz and M. Pop. 2010. Assembly complexity of prokaryotic genomes using short reads, BMC Bioinformatics 11:21 (http://www.biomedcentral.com/1471- 2105/11/21)

Zerbino, D.R., G.K. McEwen, E.H. Margulies and E. Birney. 2009. Pebble and Rock Band: Heuristic resolution of repeats and scaffolding in the Velvet short-read de novo assembler, PLoS ONE 4(12):e8407. Doi:10.1371/journal.pone.0008407

Simpson, J.T., K. Wong, S.D. Jackman, J.E. Schein, S.J.M Jones and I. Birol. 2009. ABySS : A parallel assembler for short read sequence data structures, Genome Research 19:1117-1123.

Monchy, S., M.A. Benotmane, P. Janssen, T. Vallaeys, S. Taghavi, D. van der Lelie and M. Mergeay. 2007. Plasmids pMOL28 and pMOL30 of Cupriavidus metallidurans are specialized in the maximal viable response to heavy metals, J. Bact. 189(20):7417-7425.

Drummond, A.J., B. Ashton, S. Buxton, M. Cheung, A. Cooper, C. Duran, M. Field, J. Heled, M. Kearse, S. Markowitz, R. Moir, S, Stones-Havas, S. Sturrock, T. Thierer and A. Wilson. 2010. Geneious v5.5, Available from http://www.geneious.com

Salzberg, S.L., A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen, M. C. Schatz, A. L. Delcher, M. Roberts, G. Marxcais, M. Pop and J. A. Yorke. 2012. GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research 22: 557- 567.

Versalovic, J., T. Koeuth and J.R. Lupski. 1991. Distribution of Repetitive DNA Sequences in Eubacteria and Application to Fingerprinting of Bacterial Genomes. Nucleic Acids Res. 19(24):6823-6831.

Mahillon, J. and M. Chandler. 1998. Insertion sequences, Microbiol. Mol. Biol. Rev. 62:725- 774.

Touchon, M. and E.P.C. Rocha. 2007. Causes of Insertion Sequences Abundance in Prokaryotic

45

Genomes, Mol. Biol. Evol. 24(4):969-981.

Amadou, C., G. Pascal, S. Mangenot, M. Glew, C. Bontenps, D. Capela, S. Carrere, S. Cruveiller, C. Dossat, A. Lajus, M. Marchetti, V. Poinsot, Z. Rouy, B. Servin, M. Saad, C. Schenowitz, V. Barbe, J. Batut, C. Medigue and C. Masson-Boivin. 2008. Genome Sequence of the b-rhizobium Cupriavidus taiwanensis and comparative genomics of , Genome Res. 18:1472-1483.

Tuanyok, A., B.R. Leadem, R.K. Auerbach, S.M. Beckstrom-Sternberb, J.S. Bechstrom- Sternberg, M. Mayo, V. Wuthiekanun, T.S. Brettin, W.C. Nierman, S.J. Peacick, B.J. Currie, D.M. Wagner and P. Keim. 2008. Genomic Islands from Five Strains of Burkholderia pseudomallei. BMC Genomics 9:566. doi:10.1186/1471-2164-9-566

Narzisi, G. and B. Mishra. 2011. Comparing De Novo Genome Assembly: The Long and Short of It.,PLoS ONE 6(4):e19175. doi:10.1371/journal.pone.0019175

Ohno, S. 1071. Evolution by gene duplication, Springer-Verlag, New York.

Siguier, P., J. Filee and M. Chandler. 2006. Insertion sequences in prokaryotic genomes. Curr. Opin. Microbiol. 9:526-531.

Stothard, P., G. Van Domselaar, S. Shrivastava, A. Guo, B. O'Neill, J. Cruz, M. Ellison and D.S. Wishart. 2005. BacMap: an interactive picture atlas of annotated bacterial genomes, Nucleic Acids Research 33:D317-D320.

Jin, S. 2010, Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto.

Dobrindt, U., B. Hochhut, U. Hentschel and J. Hacker. 2004. Genomic islands in pathogenic and environmental microorganisms, Nat Rev Microbiol 2: 414–424.

Toussaint, A. and C. Merlin. 2002. Mobile Elements as a Combination of Functional Modules, Plasmid 47:26-35.

Coleman, M.L., M.B. Sullivan, A.C. Martiny, C. Steglich, K. Barry, E.F. DeLong and S.W. Chrisholm. 2006. Genomic Islands and the Ecology and Evolution of Prochlorococcus. Science 311:1768 (doi: 10.1126/science.1122050)

Penn, K., C. Jenkins, M. Mett, D.W. Udwary, E.A. Gontang, R.P., McGlinchey, B. Foster, A. Lapidus, S. Podell, E.E. Allen, B.S. Moore and P.R. Jensen. 2009. Genomic islands link secondary metabolism to functional adaptation in marine Actinobacteria. ISME J. 3:1193-1203.

Zerbino, D.R. and E. Birney. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res. 18:821-829.

Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment

46 search tool, J. Mol. Biol. 215 (1990) 403-410.

47

Chapter 4 Phylogeny and Organization of Recombinase in Trio (RIT) Elements

Acknowledgements and Contributions: This chapter is reproduced as published in Plasmid, with some modifications (Ricker, N., Qian, H., & Fulthorpe, R. R. 2013. Phylogeny and organization of recombinase in trio (RIT) elements. Plasmid, 70(2), 226-239 doi:10.1016/j.plasmid.2013.04.003).

4 Introduction

A mobile genetic element (MGE) is defined as any discrete segment of DNA that can move within or between genomes (Frost et al., 2005) and is inclusive of plasmids, phages, integrative conjugative elements (ICEs), and the myriad of smaller elements capable of inter- or intra-cellular movement (classified as transposable elements, see (Roberts et al., 2008). The mobility of some of these elements occurs through the action of site-specific recombinases (SSRs), which are divided into two classes defined by an absolutely conserved residue integral to the active site (tyrosine or serine). Tyrosine recombinases (TBSSRs, often just referred to as integrases) are extremely diverse, sharing only 3 absolutely conserved residues among all members described to date (Nunes-düby et al., 1998), however there are 24 sub-families described in the NCBI conserved domain database (Marchler-Bauer et al., 2005). A recent in depth analysis of TBSSRs, has instead divided the known representatives into 56 families of 4 or more elements which were found to correlate with type of mobile genetic element (plasmid, phage or prophage) in 87% of the families (Van Houdt et al., 2012). This analysis suggests that the functional roles of these elements may be different between the families and may be directly related to the nature of the mobile elements they are associated with.

Recombinase in Trio (RIT) elements were first defined in 2009 in Cupriavidus metallidurans CH34 (Van Houdt et al., 2009). The original description noted the common occurrence of conserved elements comprised of three TBSSR’s with overlapping open reading frames. The three tyrosine recombinases in these elements were all of similar size, generally with the largest enzyme first and the smallest enzyme in the middle. These elements were postulated as being independently mobile for three reasons: the diversity of organisms found to

48 be harboring homologous elements, specific gene interruptions implying targeted integration, and the presence of highly similar elements in more than one location in the same genome, as is often seen in transposons. After discovering a RIT element in a chlorobenzoate degrading strain in our own lab, we decided to further investigate the distribution of these elements in currently available genomes in order to characterize their associations and potential for mobility.

4.1 Methods

We used the NCBI databases and BLAST analysis tools (Altschul et al., 1997) to obtain progressively less homologous sequences to the two original RIT elements found in C. metallidurans CH34 (Van Houdt et al., 2009). All similarity matches that still conformed to a three adjacent recombinase format were utilized for additional searches. The three recombinases from each of the intact elements were analyzed through BlastP comparison to the Conserved Domain Database (CDD) (Marchler-Bauer et al., 2005) and the highest scoring matches were consistently to the pAE1, SG4 and SG5 sub-families of tyrosine recombinases, respectively (in order of transcription). Therefore all members of these sub-families from the conserved domain database were investigated for inclusion as RIT elements. A random sampling of enzymes from other subfamilies were also investigated in order to determine the ubiquity of the triad arrangement. Organization into clusters and determination of key features was determined through Blast homology. Automated multiple alignments were performed using the Muscle alignment program (Edgar, 2004) within Geneious (Drummond et al., 2011). Neighbour joining trees were also prepared in Geneious, using Jukes-Cantor models and bootstrap re-sampling with 100 replicates. For nucleotide comparisons a 70% support threshold was used (no outgroup for full RIT element trees; delta-Proteobacteria outgroup for 16S since there was only one representative from this sub-phyla). Amino acid comparisons were prepared using an 80% support threshold, using a RIT element from Acidothiobacillus as the outgroup to anchor the trees.

4.2 Results and Discussion 4.2.1 Abundance and Occurrence in Database

Through our homology searches of the NCBI database, we were able to find 148 sequences containing three adjacent tyrosine recombinases that we classified as putative RIT

49 elements. These elements were separated into groups based on homology to the third recombinase (see section 4.2.3), and the information for these groups is listed in Supplemental Table S3. These putative RIT elements were obtained from 63 different genera across 7 phyla of bacteria and this is not expected to be an exhaustive list given the diversity of the elements found. As summarized in Table 4.2.1, the Proteobacteria accounted for the majority of the strains (25, 17, 7 and 1 strains from the alpha through delta classes, respectively, representing 59.5% (50/84) of the total strains). This was a significant divergence from the expected representation both for Proteobacteria in general (which represent 42% of the genomes in the NCBI database) as well as for the alpha-, beta- and gamma-Proteobacteria individually. The gamma-Proteobacteria are the most abundant in the database, however both alpha- and beta- Proteobacteria had higher representation in the strains harbouring RIT elements (Figure 4.2.1). There is the possibility that this is an artifact of beginning the homology search with a beta- Proteobacteria representative, but this would not fully account for the abundance of alpha- Proteobacteria found. It is likely however that the majority of these RIT elements are connected by plasmid distribution and that the small number of isolated elements from particular phyla represent a rare transfer event. This is supported by the fact that searches initiated from the gamma-Proteobacteria and other low represented phyla consistently returned results from the alpha- or beta-Proteobacteria representatives. There could potentially be other RIT elements that are more broadly distributed among gamma-Proteobacteria or other phyla that we were not able to detect since they were not homologous enough to the RIT elements found to date.

Table 4.2.1: Summary of information of putative RIT elements found in this study.

pAE1 SG4 SG5 Phylogeny – No. RIT No. range range range Taxonomic Grouping Elements Strains (aa) (aa) (aa) Gene adjacent/interrupted alpha-Proteobacteria 45 Caulobacterales 3 1 403 313 330 DUF1738 Rhizobiales 20 10 305-425 304-373 281-362 variable Rhodospirillales 11 7 228-508 303-454 329-335 hypothetical, methylase Sphingomonadales 12 7 331-515 301-455 324-348 hypothetical, methylase beta-Proteobacteria 35 variable (IS66, RadC, Burkholderiales 29 15 348-425 308-457 329-349 transposase, integrase) Rhodocyclales 5 1 411-417 310-325 294-332 integrase unclassified 1 1 318 324 337 hypothetical gamma- Proteobacteria 12

50

Acidithiobacillales 2 1 414 311 331 integrase catalytic unit Alteromonodales 6 2 321-417 312-322 327-335 variable Enterobacteriales 2 2 315, 419 308, 330 337, 338 RadC, methylase Legionellales 1 1 418 332 335 hypothetical Pseudomonodales 1 1 411 323 354 trbI conjugative genes delta-Proteobacteria 1 Desulfobacteriales 1 1 409 338 337 reverse transcriptase

Acidobacteria 3 Solibacterales 3 1 412, 710 314, 452 332, 336 integrase, hypothetical

Actinobacteria 19 integrase, transposase, DNA Actinomycetales 5 5 304-511 308-332 329 directed reverse transcriptase Bifidobacteriales 14 6 400 321 351 transposase, integrase

Bacteroidia 7 5 5 407-426 313-341 336-343 hypothetical Flavobacteriales 2 1 425 330 337 RadC Cytophagales 1 1 422 327 337 DNA repair protein

Firmicutes 13 Clostridiales 12 11 404-537 283-334 337-342 variable IstB domain-containing protein Bacillales 3 2 407-413 327-329 338-340 ATP-binding protein

Verrucomicrobia 4 1 Opitutales 4 1 432 330 336 MerR regulator

Planctomycetes 1 Planctomycetales 1 1 419 321 330 hypothetical

51

Figure 4.2.1: Comparison of the taxonomic representation of our RIT collection with the abundance of the same taxonomic grouping in the NCBI genome database. The NCBI numbers included both completed genomes and incomplete sequencing projects. Significant differences (a=0.05) are indicated with a double asterisk.

4.2.2 RIT Structure and Organization

As mentioned in section 4.2.1, the NCBI Conserved Domain Database currently has 24 described subfamilies of tyrosine recombinases. All of the elements had one gene from each of the three subfamilies pAE1, SG4 and SG5 and they were always found in the same order and orientation (Figure 4.2.2; discussed in section 4.2.3). This pattern was also confirmed in the recent work examining the distribution of tyrosine based site specific recombinases on different types of mobile elements (Van Houdt et al., 2012). In that work the three families of tyrosine based site specific recombinase specifically involved in the formation of RIT elements were designated FamilyIntegrase (FamInt) 1, 5 and 2 (also in order of transcription) and were documented as having 64, 54 and 63 members, respectively. The number of included elements was more conservative than our study as inclusion was based on confidence in family membership for each individual recombinase. In our study we used the trio arrangement of these recombinases as the hallmark of these novel elements and so we included elements that had individual genes for which there was lower confidence in the family designation (see section

52

4.2.3). In addition to the 148 putative RIT elements, ie. genes in trios, we found only 15 sequences that corresponded with an individual recombinase from one of these sub-families but not found in a trio. In addition there were 20 putatively degraded RIT elements. The latter were distinguished by the presence of one or two documented recombinases and pseudogenes or small ORFs in the remainder of the corresponding region. There was no pattern evident in terms of which recombinase was missing in these degraded structures. These recombinases may be RIT remnants due to inactivation or ancient distribution of these elements but may also indicate that some or all of these subfamilies can function outside of the RIT arrangement.

As can be seen in Table 4.2.1, there is also a wide range of sizes observed for each of the recombinases. This is particularly evident in the pAE1 (Int1) recombinase, which varies from 305 to 710 amino acids in length. The SG4 (Int2) recombinase is less variable (283-457 amino acids), and the third (SG5, Int3) even less so (281-351 amino acids) but the individual variations may also be an artifact of automated annotations. The pattern of sizes originally described for these elements (largest first, smallest in the middle) is also variable. The largest recombinase is in the middle position for 10 elements and in the third position for 6 elements. Although these RIT elements cannot be assumed to be active, the presence of 6 similar elements with Int2 as the largest enzyme in 5 different members of the Sphingomonodales suggests that variation in the pattern of sizes is tolerated.

Figure 4.2.2: Names and arrangements of tyrosine recominase sub-families. Names are families according to the NCBI conserved domain database for the three integrases that comprise the RIT elements (see section 4.3.3). Arrows indicate the direction of transcription for Int1-3 (in order of transcription). The inverted repeats (IR) have only been confirmed in a small number of the putative RIT elements (see section 4.4.3).

53

4.2.3 Inferred RIT Functionality

Within our putative RIT elements we saw a broad diversity of recombinase sizes and amino acid sequences, but conservation was always highest in the C-terminal of each enzyme. This is commonly found in the tyrosine recombinases and in other characterized phage integrases in which the N-terminal region is involved in site recognition and the C-terminal contains all of the catalytic sites (Esposito and Scocca, 1997). The consistency of this finding across the intact RIT elements examined in this study implies that all three of these recombinases are being selectively maintained in this arrangement. The CDD utilizes these conserved regions to support inclusion of novel phage integrases into each of the currently outlined sub-families. If the novel enzyme contains sufficient conserved residues to surpass a pre-determined domain specific threshold then it is designated as specific to that particular sub- family. For the tyrosine recombinases, we have found no literature to date that investigates the functionality of the individual sub-families, which limits the utility of these classifications for evaluating whether each recombinase is functional. However of the 148 RIT elements included in our assessment, 93% have at least one recombinase that meets the specific criteria for inclusion in the designated subfamily. Int3 (SG5) is the least divergent of the three elements (Figure 4.2.3A,B), and 105 of the RIT elements (71%) have domain specific SG5 genes. 66% of the elements have domain specific pAE1 (Int1) genes, while only 25% have domain specific SG4 genes (Int2). In 36%, both pAE1 and SG5 are domain specific. Only 17 (11%) contain the amino acid residues required for designation of all three integrases as pAE1, SG4 and SG5 by the CDD. For the remainder, the top (highest E-value) non-specific sub-family hit was consistently to the expected group based on position within the RIT (ie. Int1, 2 or 3).

If we infer functionality by the presence of duplicate elements in a genome, we can postulate whether those recombinases lacking the threshold number of conserved residues for subfamily designation, may still be active. There are 19 species containing more than one identical RIT element within their genome, only 4 of which have all three subfamily specific integrases. There are 6 instances of genomes with identically duplicated RIT elements that are lacking subfamily specific SG4 genes (section 4.2.4.1) and also 6 closely related strains with duplicated elements lacking a subfamily specific SG5 gene (section 4.2.4.2). There is a single

54 instance of identical elements in a genome without a subfamily specific pAE1 recombinase (in Sinorhizobium meliloti 1021 pSymA) and a separate genome (Dinoroseobacter shibae DFL12) has identical elements where only the SG5 recombinase is subfamily specific. This may imply that all three enzymes are not strictly required for mobility, or that one or more of the residues currently used to delineate a subfamily specific enzyme are not necessary for this function.

Figure 4.2.3: Comparison of conservation between the Int1 (pAE1) recombinases (A - top) and Int3 (SG5) recombinases (B - bottom) from 40 divergent representatives.

55

Level of conservation is illustrated through shading (dark lines represent conserved amino acids). As can be seen, both enzymes increase in conservation towards the C-terminal, and conservation is higher in the third recombinase.

4.2.4 Evidence for RIT Mobility Within Closely Related Strains

For the purposes of this discussion, we are making the assumption that identical RIT sequences in the same strain implies mobility and high levels of similarity between RIT elements located in different strains or species is evidence of horizontal transfer likely via an intermediary replicon. In their paper originally defining RIT elements, Van Houdt et al. (2009) described two non-homologous RITs in Cupriavidus metallidurans CH34. The first of these RIT elements (RITCme1) bears high nucleotide identity (greater than 90%) to truncated RIT elements in Cupriavidus necator H16 pHG1 (two identical inverted RIT element fragments close together with integrase remnants in between) and to a degraded RIT element in Burkholderia sp. str. CCGE1002 (with only Int2 still listed as intact and the others listed as pseudogenes). In addition, RITCme1 shares 84% nucleotide identity to two identical RIT elements in the unassembled whole genome sequence data of the PAH degrading strain Burkholderia sp. Ch1-1 and to our newly identified element RITBphyt1 (Jin, 2010). RITBphyt1 was discovered in a chlorobenzoate degrading Russian soil isolate designated Burkholderia phytofirmans OLGA172 (formerly R172). In this strain, the RIT element is found adjacent to the chlorocatechol degradative operon (Jin, 2010). There are no chlorocatechol degrading genes found in C. metallidurans CH34, indicating that the genes adjacent to the RIT element are not shared between these two strains, however each of these strains do have partial IS66 elements overlapping the RIT elements which may represent the target site for insertion.

Our dataset did reveal two clusters of RITs sharing greater than 85% nucleotide identity over their full lengths– one cluster of RITs found in Acidiphilium/Caulobacter strains, and another cluster of RITs from Bifidibacterium longum. Each of these show evidence of recent mobility in that 1) 100% identical sequences are found in different locations within individual strains, 2) 80-100% identical sequences occur in separate species, and 3) the RIT elements share higher identity than the surrounding genes. Interestingly, although gene synteny appears to be conserved in many of the Bifidobacterium cluster, the adjacent genes in the Acidiphilium/Caulobacter cluster are highly variable indicating that the RIT elements have not

56 been mobilized as part of a larger element. Details on these two informative groups are given below.

4.2.4.1 Caulobacter/Acidiphilium cluster

This cluster of highly similar RITs come from the genomes of three strains and the plasmids they contain. Two of the strains are from the genus Acidiphilium, while the third is from the genus Caulobacter, which share approximately 86% 16S rRNA sequence identity.

Caulobacter sp. K31 is a chlorophenol degrader isolated from groundwater in Finland (Männisto et al., 2001). There are two identical RIT elements on the K31 chromosome, and another identical copy on one of the two plasmids in this strain (pCAUL02 – length 178 kb). Acidiphilium multivorum AIU301 is an aerobic, anoxygenic and phototrophic bacterium from pyritic acid mine drainage well known for its metal tolerance (www.bio.nite.go.jp/dogan/project/view/AM1). A. multivorum AIU301 carries one RIT element on the chromosome and 2 identical copies on one of its 8 plasmids (pACMV1 – length 272 kb). Acidiphilium cryptum JF-5 is a facultative iron-respiring strain isolated from coal mine lake sediment (www.ncbi.nlm.nih/bioproject/58447). A. cryptum JF-5 shows high gene synteny with A. multivorum AIU301 except for a 225 kb region from AIU301 that is a probable genomic island (www.bio.nite.go.jp/dogan/project/view/AM1). There is no RIT element on the A. cryptum JF-5 chromosome, however it also carries 8 plasmids. One of these (pACRY01 – 203 kb) carries a RIT element that is identical to those found in A. multivorum AIU301, except that one of the inverted repeats is only 97% similar). A second plasmid in this same strain (pACRY03 – 89 kb) carries a RIT element that bears 84% nucleotide identity with the RIT elements on pACRY01, and 82% sequence identity to 92% of the RIT elements in Caulobacter sp. K31 (no significant alignment to 238 bp of int1).

The RIT elements in this cluster are clearly moving as one intact unit since within each individual strain the nucleotide identity is 100% for the entire RIT element, including the three recombinase genes and the additional sequence between the enzymes and the inverted repeats. Similarity in the gene fragments surrounding the RIT elements are suggestive of specific target genes for integration – in this case the DUF1738 gene (also sometimes annotated as an anti- restriction protein; Figure 4.2.4). This is supported by the fact that the target gene is also consistent on both the pACRY01 and pACRY03 plasmids, and there is no copy of the

57 interrupted gene (DUF1738) on the A. cryptum JF-5 chromosome or any of the other 6 plasmids in that strain, consistent with the RIT element occurrence. Interestingly, as outlined in Figure 4.2.4 for the Caulobacter sp. K31 RIT elements, although the elements appear to have integrated into homologous genes, the relative orientation of the RIT element to the target gene is not always consistent and has impacted the gene annotation.

The RIT elements found in A. multivorum AIU301 share 83% nucleotide identity with those found in Caulobacter sp. K31 and the terminal inverted repeats are almost identical between the two genera – the Caulobacter strain shows perfect 34 bp repeats for each of the RIT elements, however the Acidiphilium RIT elements have a SNP in the 5’ repeat and the repeats are not the full 34 bp (therefore form imperfect repeats of 30 -33 bp in length). Despite the decreased identity, all of these inverted repeats have 8 bp regions that are absolutely conserved (discussed in section 4.2.4.3). The interrupted genes in A. multivorum AIU301 are all annotated as hypothetical proteins, however the protein upstream of the RIT element found on the chromosome has 73 and 70% homology respectively with the DUF1738 protein fragments found surrounding RIT1 and RIT2 on the K31 chromosome.

Figure 4.2.4: Arrangement of RIT elements on the chromosome of Caulobacter sp. K31. The two RIT elements and inverted repeats are identical, and are found within the same gene (DUF1738) however the orientation is reversed and the DUF1738 nucleotide identity is not as high as within the RIT element. When inserted in the correct orientation, the RIT element appears to restore the DUF1738 sequence, however this is not the case when the orientation is reversed.

58

4.2.4.2 Bifidobacterium longum cluster

The Bifidobacterium longum cluster consists of 15 RIT elements sharing 99-100% nucleotide identity distributed across 7 strains of these intestinal bacteria. These RIT elements have been previously characterized as Mobile Integrase Cassettes (MIC) (Lee et al., 2008) and a search of other intestinal bacteria led those researchers to suggest that these elements may be unique to the Bifidobacteria. Five of these strains contain multiple copies and almost all are flanked by similar transposases that range from 68 to 100% nucleotide identity. In the strains with more than one RIT element, one of the elements is commonly found in the reverse orientation with respect to the direction of transcription of the transposase gene, which is consistent with the duplicate RITS in both Caulobacter sp. K31 and A. multivorum AIU301. The combination of reversed relative orientation and the decreased level of nucleotide identity between the transposases implies that the transposases may be a target of the RIT elements and not responsible for their movement. However, unlike the Caulobacter/Acidiphilium cluster, within these genomes there are other homologous transposases (up to 99% nucleotide identity) that have not been interrupted by a RIT element.

The Bifidobacterium longum strains are all very similar (99% nucleotide identity for 16S), suggesting that the similarities observed in the RIT elements may simply be associated with vertical transmission rather than duplication and mobility. In some circumstances there is a high degree of surrounding gene synteny which supports this interpretation. There is however also evidence for significant genetic rearrangements specific to the RIT elements themselves. B. longum DJ010A has three RIT elements. One of these elements, RIT1, is surrounded by genes that are 99% conserved in the other B. longum strains and the genes are found in the same order. The genes surrounding RIT2 and RIT3 are conserved in other B. longum strains as well, however the gene synteny is not preserved as these genes are found scattered throughout the other genomes. The only strain that does show high gene synteny with the genes surrounding RIT2 is B. longum F8, however the RIT element itself is annotated as occurring in the opposite orientation in relation to the surrounding genes.

4.2.4.3 Target Sites

Although hampered by issues with incomplete annotations due to gene interruptions, an examination of the full collection of RIT elements clearly indicates that there are specific genes

59 that serve as target sites for integration. The genes immediately adjacent to or interrupted by the elements are commonly the same in cases of multiple identical elements within one strain and between different strains harboring elements with >65% SG5 protein identity. In addition to the genes described in sections 4.2.4.1 and 4.2.4.2, clusters of RIT elements were also found associated with IS66, RadC, methylase/helicase genes and integrase genes. There is even a RIT element in Aromatoleum aromaticum EbN1 which appears to have interrupted a second RIT element. Whether the variability in gene targets stems from sequence evolution or lack of the original target site in individual strains has not been investigated and a specific target sequence within these genes could not be determined.

4.2.4.4 Inverted Repeats

The tyrosine recombinases are a highly diverse family of proteins, with variable complexity in both their DNA binding sites and their requirement for accessory proteins (Azaro and Landy, 2002; Rajeev et al., 2009). The presence of multiple identical RIT elements in different parts of the genomes of some strains revealed terminal features that may be involved in recombinase binding or regulation. Alignment of the Bifidobacterium longum RIT sequences identified a 97 bp inverted repeat that is absolutely conserved and always 41 bp from Int1 and 3 bp from Int3. The inverted repeats identified in the Caulobacter and Acidiphillium strains were only 30-34 bp in length followed by a section of presumably non-coding sequence between the inverted repeats and the recombinase enzymes (illustrated in Figures 4.2.2 and 4.2.4). Alignment of the inverted repeats and non-coding sequence from the Caulobacter and Acidiphillium strains with the long inverted repeats from the Bifidobacterium revealed an interesting pattern of smaller repeats that may serve as recognition or regulatory sites for the recombinases (Table 4.2.2). Within the inverted repeats, there were two highly conserved direct repeats of T(A/T)ATGCCG with a 9 bp intervening sequence. Furthermore, an inverted repeat was also found at an interval of 48-49 bp towards the recombinase enzyme. This pattern was also confirmed at both ends of the RIT elements for our B. phytofirmans OLGA172 strain, C. metallidurans CH34, and Burkholderia sp. Ch1-1. Whether this indicates relatedness of these RIT elements to the Caulobacter/Acidiphillium cluster is not clear. In addition, 12 bp direct repeats separated by 5 bases were found in both Candidatus solibacter usitatus Ellin6076 and Gramella forsetii KT0803 (inverted copy at a distance of 43 bp in both cases). Similar partial

60

patterns were found in other strains as well, but more information is needed to determine their relevance.

There is evidence of RIT mobilization of adjacent genes in two bacteria. Opitutus terrae PB90-1 and Dinoroseobacter shiibae DFL-12 each have identical sequences that extended beyond the RIT element but did not have any other mobile elements associated with them. In the O. terrae PB90-1 strain, this identical sequence (including the RIT) was found in four copies in the genome and the region extending beyond the RIT element was 1.6 kb in length and contained a merR regulator, a heavy metal transport/detoxification gene and a hypothetical protein. In D. shiibae DFL12, there were two copies of the RIT and 2.7 kb of additional sequence including a gene annotated as a type III restriction protein subunit found on two different plasmids within this strain. In both of these circumstances, the copied regions are flanked by inverted repeats. In O. terrae PB90-1, this 37 bp inverted repeat had 9 bp imperfect direct repeats (A/TGT/CTATGTG) separated by 8 bp and an inverted copy at a distance of 49 bp, consistent with the pattern observed in the other strains. For the D. shiibae DFL12, the region was flanked by larger inverted repeats of 124 bp (bringing the upstream repeat to within 2 bp of the start codon for the RIT element). An imperfect direct repeat separated by 9 bp was found within this region (A/TTATGCC/GG) however no clear inverted version was identified.

Table 4.2.2: Potential recognition or regulatory sites contained within terminal inverted repeats. The sites occur at a precise distance between the repeats and the coding sequence for the recombinase genes. Bolded bases are direct repeats contained within the terminal inverted repeats and for which there is an inverted copy at a precise distance in the direction of the recombinase genes.

Strain 5’ Sequence Burkholderia phytofirmans TTATGCCGATTCCCGGATTATGCCG..49..CGGCATAA OLGA172 Cupriavidus metallidurans CH34 TTATGCCGACTCCCCGATTATGCCG..49..CGGCATAA Burkholderia sp. Ch1-1 TTATGCCGACTTCCCGATTATGCCG..49..CGGCATAA Caulobacter sp. K31 TAATGCCGCGATCCGGATTATGCCG..49..CGGCATAA Acidiphillium multivorum AIU301 TAATGCCGAGATCCGGATTATGCCG..49..CGGCATAA

61

Bifidobacterium longum NCC2705 TTAAGCCGGGTTTGTTGTTAAGCCG..48..CGGCTTAA Frankia sp. EANpec1 TTATGCCGAGGGCCGGGTTATGCCG..49..CGGCATAA Novosphingobium sp. PP1Y TAATGCCGTGACCCGGATTATGCCG..49..CGGCATAA Candidatus S. usitatus Ellin6076 ACTATGCCGCGTCCCGGACTATGCCGCGT..43..ACGCGGCATAGT

Gramella forsetii KT0803 ATTATGTAAAGTAAATTATTATGTAAAGT..43..ACTTTACATAAT

4.2.5 Similarities between RIT elements and evidence for broad distribution.

Of the 148 putative RIT elements, 64 elements were chosen for phylogenetic analysis based on nucleotide sequence of all three recombinases. These were chosen on the basis of having come from completed genomes, and were spread across 38 different genera. Only one representative was included if there were multiple identical elements within an individual strain (19 instances), and only one representative was used from the 17 nearly identical RIT elements in the Bifidobacterium cluster since the 16S sequences of these strains were also 99% identical. The 16S from the main chromosome was utilized as a proxy for strain phylogeny, even in circumstances where the RITs were solely present on plasmids. As discussed in section 4.2.1, the RIT elements that we found were largely from Proteobacteria (Figure 4.2.5A), however the presence of RIT elements from the Actinobacteria, Firmicutes, and Acidobacteria suggest that these elements are not restricted to any particular phylum of bacteria, and the diversity implies that they are an ancient feature of bacterial genomes.

A neighbour joining tree of all the putative RIT elements yields a tree with very long branches (Figure 4.2.5B), reflecting the deep divergence in the RIT elements obtained in this study. Most clusters harbour elements from more than one genera. Only three of these clusters are completely contained within one taxa (two clusters from the alpha-Proteobacteria and one from Actinomycetes). The other clusters are dominated by one taxa (commonly alpha- or beta- Proteobacteria, consistent with their abundance in the dataset) with unexpected additions from other classes or even other phyla.

The presence of multiple, diverse RIT elements in individual strains was a common finding. As is highlighted in Figure 4.2.5B, Burkholderia phymatum STM815, C. metallidurans CH34, C. necator H16 pHG1, loti MAFF303099, Novosphingobium sp. PP1Y and Candidatus Solibacter usitatus Ellin6076 each have more than one RIT element assigning to very different RIT clusters. In all except Novosphingobium sp. PP1Y and C. metallidurans

62

CH34 there are plasmids harboring RIT elements that could account for more recent transfer events. Many of the other RIT elements (including RITCme1 and RITCme2 from C. metallidurans CH34 and several RIT elements from the Rhizobiales) are contained within genomic islands however the lack of genomic island information for the majority of these strains prevented a full analysis of this relationship. An examination of the putative RIT elements from genomes that have been analyzed on the IslandViewer website (Langille, 2009) revealed that presence within a putative genomic island was common for RIT elements. Of the 20 strains that corresponded with our list and had been analyzed in IslandViewer, only 4 had their RIT elements separated from the putative islands (Acidithiobacillus ferrooxidans ATCC 53993, Aromatoleum aromaticum EbN1, Sinorhizobium fredii NGR234 and Opitutus terrae PB90-1). All of the other RIT elements analyzed were found within predicted islands or within 1 kb of a predicted island. The small islands predicted by IslandViewer can prove to be components of one larger island, as is found in Mesorhizobium loti MAFF303099. A 610 kb region of this chromosome has been documented as a symbiosis island (Uchiumi et al., 2004) however the prediction programs have instead illustrated it as 10-12 smaller islands. All three chromosomal RIT elements from this strain are found contained within this symbiosis island, although only one of them is documented as occurring directly within the predicted islands by IslandViewer. Interestingly, the regions of the symbiosis island where the RIT elements occur are also high in transposases from a variety of families. In this strain there is a fourth RIT element found on the pMLa chromosome, and similarly in both Burkholderia phymatum STM815 and Burkholderia phytofirmans PsJN there are identical RIT elements that are present on both a genomic island on the chromosome and one or more plasmids, suggesting a possible route of intragenomic variability within these islands.

The environmental distribution of RIT elements was found to be quite broad, with representatives from such diverse environments as the head of an off-shore oil producing well to an intracellular amoebal pathogen. There was an even contribution of strains present from each type of environment – 32% each from the combined soil/sediment/sludge environments and all combined water environments (freshwater, seawater, groundwater and wastewater). In addition, there were 18% specifically identified as plant-associated strains, and 16% were animal associated including a small number of pathogens. Disregarding the animal associated and seawater environments (for which there is no straightforward delineation as clean or

63 contaminated), almost half of the isolates (42%) have been isolated from contaminated environments. We note this could be due to a bias towards the study of these environments, particularly in light of the increased abundance of alpha- and beta-Proteobacteria in our RIT element collection compared to the NCBI database.

A

64

B

Figure 4.2.5: Phylogenetic analysis by 16S (A) and nucleotide sequence of the RIT elements obtained in this study (B). Only one representative is included for identical elements found within individual or highly similar strains (99% 16S identity). Scale bars are at the bottom of each figure and the re- sampling percentage is indicated at each node. Symbols in figure 5B are used to illustrate different RIT elements found in individual strains.

65

4.2.6 RIT Classification

In the work by Van Houdt et al. (2012) a classification system has been created consisting of 11 types of RIT elements, based on assignment of the three recombinases to the same NCBI protein clusters. Four of these types (3,4,5 and 7) were further subdivided since one of the recombinases (commonly Int2) was associated with recombinases from more than one protein cluster, suggesting the possibility of modular evolution. We wanted to evaluate whether our larger collection of RIT elements showed congruent evolution of all three recombinase genes in these four groups of RIT elements.

Phylogenetic trees of the amino acid sequence for each of the three recombinases from 40 individual RIT elements were included in this analysis. These sequences corresponded to the members of types 3A/B, 4A/B/C, 5A/B, and group 7A/B/C, as well as the other RIT elements from our collection that were found to cluster with these groups based on homology to the third recombinase. This resulted in the inclusion of the type 10 RIT element (Gramella forsettii KT0804) since it was found to cluster with the type 7 elements in our analysis. A single type 2 strain was also included (Acidithiobacillus) and used as an outgroup to root the three trees.

In our analysis (Figures 4.2.6A-C), it’s clear that although the clustering occurs at different levels of similarity for the three recombinases, congruency is evident for each of the members of types 3, 4, 5 and 7. As the protein cluster memberships are based solely on modified Blast scores (Klimke et al., 2009), they do not necessarily reflect phylogenetic relationships made evident by neighbour joining analysis. As an example, for each of the elements in type 3 and type 4, the Int2 recombinases are listed as being from the same protein cluster and the Int1 recombinases separate into different protein clusters. Yet in Figures 4.2.6A and 4.2.6B it is clear that the branching order relationships are preserved between the type 3 and type 4 proteins for both Int1 and Int2. The type 7 elements are a much more diverse group of recombinases, and from our analysis it would appear that the type 10 elements actually form one of several sub-clusters within this group, but the three recombinases show congruency nonetheless. More research is needed in order to determine an appropriate clustering cutoff for designation of RIT element types.

66

A

B

67

C

Figure 4.2.6: Individual congruency trees for each of the recombinases in a selection of RIT elements. The individual trees correspond to Int1 (A), Int2(B) and Int3(C). Number and letter designations are according to the types defined in Van Houdt et al. 2012.

4.3 Conclusions

In this analysis, we expand upon the findings of Van Houdt et al. (2012) by evaluating a more diverse collection of RIT elements. Through this collection we are able to confirm that the integrases are consistently from three subfamilies of tyrosine site-specific recombinases (pAE1, SG4 and SG5) in the same order and orientation. Although not all of these enzymes surpassed the specific domain threshold for inclusion in the individual subfamilies, the highest scoring hits were consistently to these groups and our protein neighbour joining trees provide evidence that the genes have evolved together. Recognizing their association is very informative for furthering our understanding of these three sub-families of TBSSRs. Functionality (mobility) could be inferred for a small number of elements that are copied within a genome or in closely related strains. It should be noted that the intention of this study is not to suggest that all of the

68 putative RIT elements we have outlined are active, but rather to examine the distribution of these elements in nature in the hopes of better understanding their role in bacterial genomes. Examining elements fitting within the description of a recombinase in trio, allows for a better understanding of the overall distribution of these elements. The maintenance of three integrases belonging to specific sub-families of the tyrosine recombinases in the same arrangement and in a potentially active form in such a wide diversity of organisms clearly suggests that there is some benefit to their presence in genomes. It has yet to be determined whether their structural organization is due to high levels of co-transfer or is evidence of functional interdependency. It is our hope that the terminal repeat sequences outlined in this study will prove useful in furthering an understanding of the mode of action of these recombinases. Any discussion of the putative role of these repeats would be preliminary given the limited knowledge on the mode of action of these specific sub-families of tyrosine recombinases, together or in combination. There are no current models that are consistent with a role for all three recombinases.

RIT elements bearing high similarities (greater than 80% nucleotide identity) across their full length can be found within closely related genera (such as Burkholderia and Cupriavidus) in the absence of any other gene synteny. This suggests the elements in these strains can be mobilized as an intact unit rather than as a component of a larger transposable element. We expect that any horizontal transfer events are mediated by plasmid movement between these closely related strains, a logical supposition supported by the prevalence of identical RIT copies on both the chromosome and one or more plasmids within the same strain. Plasmids may also be responsible for transport of these elements over greater phylogenetic distances, however the broad divergence of these elements suggests ancient origins.

It is clear from this assessment that RIT elements are capable of coordinated intracellular movement. It is not clear if the triad structure, though most common, is a strict requirement for the functioning of these enzymes since a small number (11%) of seemingly intact recombinases from the pAE1, SG4 and SG5 subfamilies were found by themselves or just in pairs. Many questions remain to be answered. It is clear however that several years of genome sequencing have brought to light a new element that adds to the astounding potential for bacterial adaptation through recombination.

69

4.4 Acknowledgements

Funding in the form of a NSERC Discovery Grant to RF and a NSERC CGS-D Scholarship to

NR is gratefully acknowledged. The funding agency had no role in this study.

4.5 References

Altschul, S.F., Madden, T.L., Schäffer, a a, Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-402.

Azaro, M.A. and Landy, A., 2002. λ Integrase and the λ Int family, in: Mobile DNA II. pp. p. 118-148. In N. L. Craig, R. Craigie, M. Gellert.

Drummond, A.J., Ashton, B., Buxton, S., Cheung, M., Cooper, A., Heled, J. et al., 2011. No Title [WWW Document]. Geneious v5.0.4. Available at: http://www.geneious.com.

Edgar, R.C., 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32, 1792-7.

Esposito, D., Scocca, J.J., 1997. The integrase family of tyrosine recombinases: evolution of a conserved active site domain. Nucleic acids research 25, 3605-14.

Frost, L.S., Leplae, R., Summers, A.O., Toussaint, A., 2005. Mobile genetic elements: the agents of open source evolution. Nature reviews. Microbiology 3, 722-32.

Jin, S., 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172. University of Toronto M. Sc. Thesis.

Klimke, W., Agarwala, R., Badretdin, A., Chetvernin, S., Ciufo, S., Fedorov, B., Kiryutin, B., Neill, K.O., Resch, W., Resenchuk, S., Schafer, S., Tolstoy, I., Tatusova, T., 2009. The National Center for Biotechnology Information ’ s Protein Clusters Database. Nucleic acids research 37, 216-223.

Langille, M.G.I. and F.S.L.B., 2009. “IslandViewer: an integrated interface for computational identification and visualization of genomic islands”. Bioinformatics Jan. 16 (E.

Lee, J.-H., Karamychev, V.N., Kozyavkin, S. a, Mills, D., Pavlov, a R., Pavlova, N.V., Polouchine, N.N., Richardson, P.M., Shakhova, V.V., Slesarev, a I., Weimer, B., O’Sullivan, D.J., 2008. Comparative genomic analysis of the gut bacterium

70

Bifidobacterium longum reveals loci susceptible to deletion during pure culture growth. BMC genomics 9, 247.

Marchler-Bauer, A., Anderson, J.B., Cherukuri, P.F., DeWeese-Scott, C., Geer, L.Y., Gwadz, M., He, S., Hurwitz, D.I., Jackson, J.D., Ke, Z., Lanczycki, C.J., Liebert, C. a, Liu, C., Lu, F., Marchler, G.H., Mullokandov, M., Shoemaker, B. a, Simonyan, V., Song, J.S., Thiessen, P. a, Yamashita, R. a, Yin, J.J., Zhang, D., Bryant, S.H., 2005. CDD: a Conserved Domain Database for protein classification. Nucleic acids research 33, D192-6.

Männisto, M.K., Salkinoja-Salonen, M.S., Puhakka, J. a, 2001. In situ polychlorophenol bioremediation potential of the indigenous bacterial community of boreal groundwater. Water research 35, 2496-504.

Nunes-düby, S.E., Kwon, H.J., Tirumalai, R.S., Ellenberger, T., Landy, A., 1998. Similarities and differences among 105 members of the Int family of site-specific recombinases 26, 391-406.

Rajeev, L., Malanowska, K., Gardner, J.F., 2009. Challenging a paradigm: the role of DNA homology in tyrosine recombinase reactions. Microbiology and molecular biology reviews : MMBR 73, 300-9.

Roberts, A.P., Chandler, M., Courvalin, P., Guédon, G., Mullany, P., Pembroke, T., Rood, J.I., Smith, C.J., Summers, A.O., Tsuda, M., Berg, D.E., 2008. Revised nomenclature for transposable genetic elements. Plasmid 60, 167-73.

Uchiumi, T., Ohwada, T., Itakura, M., Nukui, N., Dawadi, P., Kaneko, T., Tabata, S., Yokoyama, T., Tejima, K., Saeki, K., Omori, H., Hayashi, M., Sriprang, R., Murooka, Y., Tajima, S., Simomura, K., Nomura, M., Suzuki, A., Shimoda, Y., Sioya, K., Uchiumi, T., Ohwada, T., Itakura, M., Mitsui, H., Nukui, N., 2004. Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome Expression Islands Clustered on the Symbiosis Island of the Mesorhizobium loti Genome. Society.

Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26.

Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6

71

Chapter 5 The Chlorocatechol Degradative Operon in Burkholderia sp. strain OLGA172 Resides in Chromosomal Area of Genome Plasticity as revealed through PacBio Single-Molecule Sequencing

Acknowledgements and Contributions: This chapter has been submitted for consideration to the journal Genomics and represents the compilation of efforts of previous graduate students as well as my own work. Contributers to this chapter (besides myself and Roberta Fulthorpe) are as follows: Jackie Goordial (investigation of CC operon), Soulbee Jin (primer walking and assembly of CC operon region), Heng (Tony) Qian (short read data assemblies), Shu Yi (Roxana) Shen (comparison of chromosome 1 breakpoint, assistance with figures and bioinformatics support), Rosemary Saati (qPCR for copy number analysis, PCR confirmation and small plasmid extraction).

5 Introduction

Next generation sequencing has revolutionized our approach to the study of genomes. However due to the short read lengths of the initial next generation sequencers, the complete assembly of even the smallest of these genomes into a manageable number of contigs from sequence data alone is problematic, particularly in repetitive regions of the genome (Phillipy et al. 2008; Ricker et al. 2012; Ghodsi et al. 2013; Barbosa et al. 2014). Many researchers have called for increased efforts in completing sequencing projects (Parkhill, 2000; Phillipy et al. 2008; Klassen 2012), however the high costs and time requirements have simply made this goal unachievable for many labs. As a result, incomplete genome projects and permanent draft genomes are increasingly dominating databases. The number of genomes deposited in the Genomes Online Database (version 5.0; https://gold.jgi-psf.org accessed 31 October 2014) has grown from 3532 genomes in 2012 (Ricker et al., 2012) to 53,514 genomes, 73% (38,962) of which are bacterial genomes. Of these, only 12% (6648) of the bacterial genomes have been finished and 44% (23,551) of the remainder are listed as permanent draft (up from 29% permanent drafts in 2012). This increase in permanent draft genomes results not only from the practical and financial obstacles faced in finishing a genome, but also occurs due to the limited

72 scope of the original project. Many sequencing projects are initiated in order to determine whether a discrete set of known genes are present in an organism, or to investigate differences between closely related strains. For each of these types of projects, a permanent draft genome is sufficient since short read sequencing technologies can assemble the individual genes (Branscomb and Predki, 2002) and referenced based assembly to close relatives can reveal strain specific gene content. However, the prevalence of permanent draft genomes in the publicly available databases are problematic to researchers looking to understand overall genome organization and the impacts of horizontal gene transfer on prokaryotic evolution.

Although smaller than eukaryotic genomes, many prokaryotes are still challenging to assemble due to the presence of multiple replicons and highly repetitive elements shared within and between these replicons. Any repeat longer than the sequenced read results in an ambiguity that cannot be resolved by assembly algorithms due to the simultaneous existence of more than one true path through the data (Ghodsi et al. 2013). The ribosomal operons, typically ranging from 5-7 kbp, represent the largest repeat class found in the majority of bacteria (Koren et al. 2013) and therefore read lengths on a kilobase scale are required to produce an ungapped or ‘closed’ assembly. PacBio SMRT sequencing produces read lengths of up to 40 kb as well as a large volume of shorter reads to be used for error correction, simultaneously providing the benefits of hybrid short and long read libraries without the additional cost and effort of using two separate libraries or sequencing technologies. Most importantly, PacBio SMRT sequencing does not suffer from amplification biases or low coverage in regions containing secondary structure (due to palindromes or GC content), but rather produces random errors that can be detected and algorithmically managed (Chin et al. 2013; Koren et al. 2013).

Genome organization has been linked with important lifestyle and evolutionary traits (Slater et al. 2009; Harrison et al. 2010; Heuer and Smalla, 2012), and consequently understanding the nature and size of individual replicons can be very informative. Fragmented draft genomes contain anywhere from tens to hundreds of individual contigs, therefore relevant information on replicon characteristics is lost. Prokaryotic genomes range from 200 to over 9000 genes, and there is a strong link between environment and genome complexity (Konstantinidis and Tiedje, 2004; Cordero and Hogeweg 2009). Since horizontal gene transfer is a prominent mechanism of bacterial adaptation, the genomic context of individual genes is important for discerning the evolutionary origins and transferability of these traits. Of the 4394

73 completed genomes available in the NCBI genome database (accessed 27 Feb 2015), 33% (1474 genomes) contain more than one replicon. Plasmid localization infers not only potential for dissemination, but can also give insight into gene expression (Lopez-Leal et al 2014) and the rate of evolution since existence on a discretely replicating plasmid is effectively equivalent to a gene duplication event equal to the copy number of the plasmid (Norman et al 2009). Similarly, nucleotide substitution rates have also been observed to be higher on secondary chromosomes and smaller replicons than on the primary replicon, as have rates of recombination and rearrangements (Chain et al. 2006). Understanding the genomic context of genes found on chromosomes is also important since outward-facing promoters that are contained within adjacent mobile elements may directly regulate adaptive traits. As an example, in Bacteroides fragilis, antimicrobial resistance genes exhibit increased expression from insertion sequence (IS) elements found upstream (Sóki 2013). In a recent study using whole genome short read sequencing of resistant isolates, Sydenham et al. (2015) found that the majority of contigs bearing resistance genes terminated within 200 bp of the gene in question and therefore information on the upstream gene content was lost.

In this paper, we have assembled the complete genome of Burkholderia sp. str. OLGA172 (formerly Burkholderia sp. R172), a 3-chlorobenzoate (CBA) mineralizer isolated from an uncontaminated Boreal forest soil in northwestern Russia as part of a global survey of CBA and 2,4-D degrading soil bacteria (Fulthorpe et al. 1998). Chlorocatechol (CC) is a central intermediate in the CBA degradation pathway, as well as the degradation of several other chlorinated aromatic chemicals of environmental concern, such as chlorophenols, chlorobenzenes and chlorobiphenyls (Schlomann, 1994). Understanding the evolution and distribution of CC degradative pathways therefore has benefits for the remediation of a variety of anthropogenic compounds. The catabolic genes for CC degradation are usually found in an operon, and often, but not always, located on catabolic plasmids (Liu et al. 2005; van der Meer et al. 1992; Fulthorpe et al. 1995; Leveau & van der Meer, 1996) or on mobile elements such as integrative and conjugative elements (ICEs) (Sentchilo et al. 2009). In Burkholderia sp. str OLGA172, the complete modified ortho pathway for catechol degradation has been identified (Jin, 2010; Accession number: AY168634.1) however numerous Sanger and short read sequencing approaches have failed to reveal the genomic context of the region surrounding this operon. In this paper, we illustrate how the presence of a large repeated element, termed a

74

Recombinase in Trio (RIT) element, next to the operon interfered with this analysis and how the use of PacBio single molecule sequencing overcame these issues to produce a non-fragmented draft assembly.

5.1 Materials and Methods 5.1.1 Short read NGS sequencing Using an Illumina (Solexa) Genome Analyzer (GA) II sequencing was carried out at the Centre for the Analysis of Genomic Evolution and Function (CAGEF) at the University of Toronto, using two flow cells. A Roche 454 Genome Sequencer FLX (GS-FLX) was also used to sequence the genome at The Genome Quebec Innovation Centre at McGill University, using one quarter of a PicoTitre Plate, and de novo assembly of the raw reads was carried out to generate contigs. Illumina and 454 sequencing gave 364,718, 452 bp and 75,636,539 bp respectively. Hybrid assembly of the trimmed reads from both datasets was performed using Velvet version 1.2.08 (Zerbino and Birney, 2008) and resulted in 1508 contigs with a maximum contig length of 89,084 bp (mean value 5186 bp, N50 of 12,043). BLAST (Altschul et al. 1990) was employed to determine the identity of the other genes present on the contig surrounding the CC degradative genes and they were confirmed through synteny analysis of closely related species and PCR amplification.

5.1.2 PacBio Single Molecule Sequencing

Sequencing was performed by Genome Quebec Innovation Centre using 8 SMRT cells in a PacBio RSII sequencer. There were a total of 736,020 raw subreads with an average length of 4,949 bp and a maximum length of 23,822 bp. Contigs were assembled at the Innovation Centre through the Hierarchical Genome Assembly Process (HGAP) workflow (Chin et al., 2013) including pre-assembly error correction, Celera assembly, and polishing with Quiver. The corrected long reads produced after the pre-assembly error correction process were obtained from the Innovation Centre and utilized to examine coverage and disagreements in the final assembly using Geneious (version 6.0.3, http://www.geneious.com, Kearse et al., 2012) and in hybrid assemblies. Alignments were performed with a minimum overlap of 200 bp, minimum

75 overlap consensus of 98% and maximum of 30% errors throughout the read alignment and 10% gaps. The permissive error and gap rates were utilized due to the expected high rate of random errors in the individual PacBio reads.

5.1.3 Assembly of Short Read Technologies and PacBio corrected reads

In order to determine whether the PacBio assembly could be improved through the addition of short read sequencing data, hybrid assemblies of the PacBio corrected long reads and a combination of Illumina and 454 short reads (individually and combined) were performed using Mira (Chevreaux et al. 1999) on default settings. Contigs generated were compared to the sole PacBio assembly unitigs using the Mauve (Darling et al. 2004) whole genome alignment option in Geneious.

5.1.4 Gene Annotation and Contig Validation

Annotation was performed automatically using the RAST server (Aziz et al., 2008) utilizing Glimmer3 with no backfilling. The beginning sequence of each PacBio unitig was compared to the complete unitig using Blast (Altschul et al. 1990) and regions repeated at both ends of the unitigs were trimmed from the final replicons. Identification of individual mobile elements was also performed using Blast, accessed through the ISFinder website (www- is.biotoul.fr; Siguier et al. 2006). GC skew and coding density were determined and visualized using DNAPlotter (Carver et al. 2009). Highly similar repeats (>98% nucleotide ID) within individual replicons were determined using REPuter (https://bibiserv2.cebitec.uni- bielefeld.de/reputer; Kurtz et al. 2001). Repeated elements greater than 1000 bp between replicons were determined using Blast alignment with a mismatch score of 1/-3. Primers were designed to confirm placement of the RIT element adjacent to the tfd operon on chromosome 1 and a second identical RIT element found on chromosome 2. Primers were also designed to confirm placement of the 191 kb fragment that was removed from chromosome 2 in the hybrid assembly. Primer sequences are listed in Appendix 1.

76

5.1.5 Comparisons to Related Finished Genomes

MAUVE alignments (Darling et al. 2004) of individual replicons from Burkholderia phytofirmans PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1001, 1002 and 1003 were performed using Geneious version 6.0.3 (http://www.geneious.com, Kearse et al. 2012), using the Muscle (Edgar, 2004) alignment option and a minimum local co-linear block (LCB) value of 400. Putative genomic islands on chromosome 1 were determined using IslandViewer (Dhillon et al. 2013; www.pathogenomics.sfu.ca/islandviewer).

5.1.6 Large Plasmid Extraction

Large plasmid extraction was based on the method of Andrup et al. (2008). The samples were run on 0.5% Megabase agarose (BioRad) for 21.5 hours at 4-6oC before staining for 1.5 hours in ethidium bromide and destaining for 5 days at 4 oC. Plasmid sizes were estimated using the BAC Tracker Supercoiled DNA ladder (Epicentre) as well as previously sequenced plasmids in Cupriavidus metallidurans CH34 and B. phytofirmans PsJN.

5.2 Results 5.2.1 Overall Genome Analysis

The PacBio assembly for Burkholderia sp. str. OLGA172 produced 11 unitigs, 4 of which were discarded as small contigs of vector control sequence and one was found to contain a partial 16S sequence. There was also one unitig (9570 bp) that appears to be valid sequence (top matches are to Burkholderia strains) but that did not have any features consistent with being an independent replicon. This sequence was found to align with one of the larger unitigs, and to terminate in a repeated element (IS66) that is repeated in multiple locations within the genome. The remaining 5 unitigs were trimmed for overlapping terminal repeats (see methods) and retained for further analysis, providing an estimated total genome size of 8,574,889 bp. Each of these unitigs was aligned with the corrected long reads provided by the Genome Quebec Innovation Centre, and 19% (4697/24528) could not be aligned at 98% percent nucleotide

77 identity. A selection of the unmatched reads were analyzed by Blast comparison to the 5 unitigs and manual inspection indicated that there were small gaps (10-20 bp on average) that prevented alignment. However as can be seen in Table 5.2.1, depth of coverage for each unitig ranged from 6x to 19x from the corrected long reads alone, which provides a reasonable level of confidence in the assembly. Coverage for total reads obtained through PacBio sequencing ranged from 49x for the unitig designated plasmid 3 to 255x for the largest unitig. In order to determine the overall complexity of this genome, highly similar repeated elements greater than 1 kb were quantified for the three largest unitigs (see methods). There were a total of 105 repeated elements on the three largest unitigs (including the 16S operons) and 56 elements were found to have highly similar copies on both chromosome 1 and chromosome 2. The distribution of repeated elements classifies Burkholderia sp. str OLGA172 as at least a class II difficulty of assembly (Koren et al. 2013) due to having a large number of repeated elements in addition to 16S rDNA operons (more than 100 repeats greater than 500 bp) and potentially a class III difficulty due to the presence of two repeats greater than 7 kb. For this reason it is not surprising that our previous sequencing efforts using a combination of Illumina and Roche 454 sequencing had been unsuccessful in producing a reasonable assembly of this genome (Jin, 2010). As we had the additional short read data from both the Illumina and Roche 454 sequencing platforms available for this genome we also performed hybrid assemblies using all three sets of sequencing reads. All assemblies that incorporated the Illumina reads failed to produce an adequate assembly, presumably due to the low quality of the original Illumina data. The assembly of the PacBio and 454 data resulted in considerably more contigs than the PacBio assembly with only 11 contigs larger than 2 kb. The alignment of these 11 large contigs from the hybrid assemblies agreed very well with the unitigs from the original PacBio assembly (Table 5.2.1) with the exception of the removal of a 191,800 bp fragment from the PacBio unitig corresponding with chromosome 2 and the creation of an additional contig of 79,317 bp. Primers were designed targeting the regions flanking the 191 kb fragment within the PacBio chromosome 2 unitig, and products of the expected size were obtained confirming the original PacBio placement of this fragment (data not shown). The 79 kb contig was discarded as it was highly fragment with strings of uncalled bases (N’s).

78

Table 5.2.1: Statistics of PacBio unitigs assigned as putative replicons. Alignments were performed using only the corrected long reads from the PacBio SMRT sequencing aligned on unitigs after trimming of repetitive terminal regions.

Number of Pairwise Mean Standard

aligned reads Identity (%) coverage Deviation Chromosome 1 12,027 99.3 19.7 4.2 Chromosome 2 9129 99.3 19.7 3.9 Plasmid pOLGA1 458 99.3 11.6 2.8 Plasmid pOLGA2 195 99.3 10.1 4.2 Plasmid pOLGA3 22 99.5 5.9 1.7 Total aligned 21,831 reads Unaligned reads 4,697 (19%)

5.2.2 Biological consistency of the Assembly

In addition to using traditional assembly metrics such as size and coverage, we wanted to specifically investigate the quality of our assembly in terms of biologically relevant features. To perform this investigation we compared our PacBio assembly of Burkholderia sp. str. OLGA172 with the available completed genomes of other Burkholderia strains (Table 5.2.2) including Burkholderia sp. str. CCGE1001 (unpublished), CCGE1002 (Ormeño-Orrillo et al., 2012), CCGE1003 (unpublished), B. phytofirmans PsJN (Weilharter et al., 2011) and B. xenovorans LB400 (Chain et al., 2006), all of which bear 16S nucleotide identities of 97% or greater to our strain. There were 8853 genes annotated in our assembled genome, including 7 complete rRNA operons and 64 tRNA genes. The rRNA operons were distributed between the two chromosomes, 3 on the largest chromosome and 4 on the second chromosome, which is not common but is also documented in Burkholderia sp. CCGE1002. The number of tRNA genes was consistent with the other genomes, as was the estimated sizes of the chromosomes. The gene distribution for chromosome 1 is illustrated in Figure 5.2.1.

79

Figure 5.2.1: Chromosome 1 of Burkholderia sp. str. OLGA172 as determined by PacBio sequencing. Chromosome is illustrated after manual trimming of ends (see methods). Rings correspond to the following (moving from outside towards the middle): Coding sequences (CDS) in the forward direction; CDS in reverse direction; rRNA operons (black) and tRNA genes (grey); all CDS annotated as mobile elements (transposons, phage integrases, transposition helper proteins); GC plot; GC skew; The two inner most rings are black for above average value and grey for below average. Image created in DNAPlotter (Carver et al. 2009).

80

Table 5.2.2: Comparison of assembled genome or Burkholderia sp. str. OLGA172 with other closely related Burkholderia strains. Strain names listed are Burkholderia sp. str. unless a species name has been formally accepted (listed in italics with the strain name).

Accession Strain name Chromosome 1 Chromosome 2 Plasmids rRNA tRNA Numbers 271 kb, 137 kb, 23 OLGA172 4.65 Mb 3.50 Mb kb 21 64 B. NC_010676.1 phytofirmans NC_010679.1 PsJN 4.47 Mb 3.63 Mb 121 kb 18 63 NC_010681.1 NC_007951.1 B. xenovorans NC_007952.1 LB400 4.90 Mb 3.36 Mb 1.47 Mb* 18 65 NC_007953.1 NC_015136.1 CCGE1001 4.06 Mb 2.77 Mb none 18 61 NC_015137.1 1.28 Mb*, NC_014117.1 CCGE1002 3.52 Mb 2.59 Mb 489 kb 21 73 NC_014120.1 NC_014539, CCGE1003 4.08 Mb 2.97 Mb none 18 63 NC_014540.1

* indicates a third chromosome as opposed to a plasmid, as annotated in the Genbank entry.

For each of the unitigs, replication and partitioning genes were located and verified to be consistent with designated replicons (data not shown). As can be seen in Table 5.2.2, the number and size of additional replicons can be quite variable in Burkholderia strains, and we therefore experimentally verified our genome expectations by performing a large plasmid extraction on Burkholderia sp. str. OLGA172 (Figure 5.2.2). This indicated two large plasmids with sizes consistent with those obtained through the PacBio assembly as well as two smaller plasmids. The larger of these two smaller plasmids is consistent with the unitig designated plasmid 3 in the assembly (23 kb). Although none of the close relatives examined in this study

81 had a similarly sized plasmid, the replication and partitioning genes found on this replicon had highest homology with those from a 45 kb plasmid found in Burkholderia sp. KJ006 (76% protein homology with 77% coverage). With the exception of a mobile element found in multiple locations within the genome (IS66), the majority of the coding sequences identified on the plasmid corresponded to hypothetical or conserved hypothetical genes. A contig corresponding with the 8 kb plasmid was not found. There is one unitig that is approximately the right size (9570 bp) however there were no plasmid replication genes or other identifying features to suggest that this corresponds to the small plasmid visible on the gel. As discussed above, a comparison of this unitig with the complete assembly suggests that it is already accounted for in the consensus assembly. The nature of the 8 kb plasmid has not been determined, but it is possible that it corresponds to an excised mobile element that is contained within the assembly.

Figure 5.2.2: Large plasmid extraction. The first lane contains the BAC Tracker supercoiled DNA ladder (Epicentre), however as the plasmids were outside of the ideal range their size was instead estimated based on comparison to plasmids from previously sequenced strains. Samples are Cupriavidus metallidurans CH34 (lane 2; plasmid sizes 233 kb and 171 kb), Burkholderia sp. str. OLGA172 (lane 3; PacBio sizes listed as 271 kb and 137 kb) and B. phytofirmans PsJN (lane 4; 121 kb plasmid). The smaller plasmid in Burkholderia sp. str. OLGA172 is also visible below the non-chromosomal marker band of the ladder, and a smaller element is visible at the 8 kb marker.

82

5.2.3 Capacity of the PacBio Assembly for comparative studies

Despite having less than 99% 16S identity to our strain, it was expected that the chromosome 1 genes would show evolutionary conservation with other sequenced Burkholderia genomes. We therefore aligned our assembled chromosome with chromosome 1 from 6 Burkholderia strains using the MAUVE feature in Geneious (see methods). As can be seen in Figure 5.2.3, gene order and organization from our assembly was comparable with that seen for the other Burkholderia strains, with large local co-linear blocks (LCBs) shared between all 6 strains. There was greater similarity in the genomic arrangement of our strain with B. phytofirmans PsJN than with the other strains, however as expected there were large gaps both within and between LCBs corresponding to strain specific genomic islands. For the largest of these gaps the sequence was inspected to identify whether the break was specific to our assembly or was also observed in the other strains. In each of these cases, there was a consistent break point where regions of strain specificity were observed in all 6 genomes, including several islands with clear insertions following tRNA genes consistent with expectations for genomic island or prophage insertion sites.

Figure 5.2.3: MAUVE alignment of chromosome 1 from six Burkholderia strains. Local Colinear Blocks (LCB) are denoted by rectangles and level of identity within those blocks is illustrated by the height of the vertical lines contained within the rectangles. Genomes are (from top to bottom): OLGA172, B. xenovorans LB400, CCGE1001, CCGE1002, B.

83 phytofirmans PsJN and CCGE1003. LCBs drawn below the horizontal refer to inverted segments.

5.2.4 Highlighting a region of Strain Specificity – The Chlorocatechol (CC) Degradative Operon

Burkholderia sp. OLGA172 is a 3-chlorobenzoate (CBA) degrader representative of a large collection of unstable CBA degraders isolated from pristine environments. Previously in our lab, primers targeting chlorocatechol dioxygenase (CCD) genes (Leander et al. 1998) were used to confirm the presence of a modified ortho pathway of chlorocatechol (CC) degradation genes in Burkholderia sp. str. OLGA172 (Genbank: AY168634). Primer walking techniques were used to determine a 10 kb genomic region surrounding the CC degradative operon that included the full operon and a set of three tyrosine site specific recombinases (Jin, 2010). This set of recombinases is now recognized as a Recombinase in Trio (RIT) element (Van Houdt et al. 2009). Full genome sequencing using both the Illumina and 454 platforms was utilized to assemble the complete genome and provide context for the CC degradative operon (Jin, 2010), and a hybrid assembly of these two sequencing technologies produced a 14 kb contig that connected the CC degradative operon to a segment of chromosome 1 from a number of strains from the “plant-beneficial-environmental” (PBE) clade of Burkholderia (Suarez-Moreno et al. 2012).

Chlorocatechol degradation genes that encode an ortho cleavage pathway have been repeatedly identified in proteobacteria isolated from various contaminated systems. Separate isolations resulted in the use of different notations for the catabolic genes: the clc genes from chlorobenzoate degrading Pseudomonas knackmussii B13 (Chatterjee et al. 1981), the tfd genes from 2,4 D degrading Cupriavidus pinatubonensis JMP134 (Don and Pemberton, 1981; Don et al. 1995), tcb genes from trichlorobenzene degrader Pseudomonas sp. P51 (van der Meer et al. 1991), and the tft genes from trichlorophenol degrading Burkholderia phenoliruptrix AC1100 (Hubner et al. 1998). In spite of the different notations, the operons share sequence similarities consistent with divergent evolution from a common ancestor. In most instances these CC operons are located on IncP1 plasmids, suggestive of a rapid evolutionary response to the introduction of novel and abundant anthropogenic chloroaromatics to the environment. For instance, there is a very high degree of sequence similarity between tfd genes located on

84 different plasmids in strains isolated all over the world. Carriage on plasmids not only facilitates rapid distribution but also has the added benefit of providing an increased copy number of the degradative genes, which is important due to the toxicity of the intermediate degradation products (van der Meer, 2003). There are two known instances where the CC genes were not found on plasmids. The clc genes originally described in B13 are contained within an integrative conjugative element that has been shown to be self-transmissible (Sentchilo et al. 2009; Gaillard et al. 2006), and are also found in Burkholderia LB400. The tfd genes of Burkholderia sp. st. RASC (aka TFD3, isolated from Oregon sewage sludge USA) are reported to be chromosomal (Suwa et al. 1994, Tonso et al. 1995). However, recently Sakai et al. (2014) isolated 7 Burkholderia and one Cupriavidus 2,4-D degrading strains from paddy fields with CC genes highly homologous to those of RASC, and have shown them to be located on a group of megaplasmids 580-900 kb in size.

In spite of OLGA's isolation on chlorobenzoate as a selective substrate, and its close phylogenetic relationship to LB400, the CC operon is highly homologous to the tfd rather than the clc genes. (85% nucleotide identity to tfdC gene of JMP134, 79% to Burkholderia sp. NK8 (Liu et al. 2001)). There is high homology amongst the CC genes from OLGA and other strains isolated from pristine sites around the world, but there is no evidence of plasmid locations (Leander et al. 1998). For all these reasons, confirming the genomic context in this strain would support our hypothesis that these genes are ancestral, and aid in understanding the evolution of chlorocatechol degradation operon. The contig generated through the hybrid Illumina and 454 assembly illustrated that the CBA genes were likely to be located on chromosome 1 due to the presence of typical chromosome 1 genes adjacent to one end of the operon (Figure 5.2.4). However, at the other end of the operon the contig terminated within the RIT element, and therefore could not provide additional information to confirm the genes present beyond that point. Amplification using Thermal Asymmetric Interlaced PCR (TAIL-PCR) from the RIT element to other possible contigs that overlap with this region revealed a ‘junkyard’ of complete and partial mobile elements and hypothetical proteins in the region adjacent to the RIT element (Jin, 2010). With the PacBio assembly we were able to assemble this highly fragmented region of the original genome assembly and connect it to chromosome 1 genes conserved with the other Burkholderia strains. The assembly also revealed the presence of another copy of the RIT

85 element on the second chromosome that is identical in sequence to the end of the inverted repeats (3393 bp). It is therefore likely that this large repeated element hindered PCR confirmation of the CCD operon location, and our previous next generation sequence assemblies based on shorter reads.

Amplification and sequencing was performed from tfdC to the ribonuclease G in order to confirm the placement of the CC degradative operon on chromosome 1. Alignment of this region to other related strains indicated that the CC degradative operon is the starting point for a strain specific region of genome plasticity (RGP) that extends for 52 kb. There is no tRNA flanking the region and it does not begin with an integrase or transposase, although there are a number of mobile element proteins contained within. Comparison of this region to the other related Burkholderia strains does indicate however that this segment of the chromosome is highly strain specific. In each of these strains there is high homology and gene synteny for 90 kb leading up to the region where the CC degradative operon is found in Burkholderia sp. str. OLGA172, and gene synteny ends in all of these strains after the ribonuclease G protein (Figure 5.2.4). The portion of the chromosome leading up to this break in synteny is also conserved in other, more distantly related, species including Cupriavidus and Ralstonia strains. In Burkholderia sp. CCGE1001 there is a 62 kb genomic island documented in this site and the genomic island integrase is the first gene after the ribonuclease. Although none of the other strains have documented genomic islands in this location, this region is clearly involved in strain specificity due to the complete disruption of gene synteny. Some of these strains also contain highly homologous RIT elements (>80% nucleotide identity) to OLGA172, however none of these RIT elements occur in the same genomic location. The reasons for the lack of synteny following the Rnase G in each of these strains is not clear at this time.

86

Figure 5.2.4: Genomic arrangement of chromosome 1 genes from Burkholderia sp. str. OLGA172 and comparison to homologous regions of related strains. Note that the grey arrows do not indicate a type of gene but instead indicate that the genes found in this genomic location are not shared among the different strains included in the analysis. Note also that the complete genome for Burkholderia sp. NK8 has not been completed and therefore only the plasmid was available for comparison.

5.2.5 Limitations of the PacBio Assembly

The RIT element on chromosome 2 is flanked on both sides by at least 15 kb of complete and partial mobile genetic elements, including several that are also repeated in the ‘junkyard’ region adjacent to the RIT element on chromosome 1. Interestingly, the TAIL-PCR

87 sequencing results and the PacBio assembly disagreed on the nature of the ‘junkyard’ region flanking each of the RIT elements. Primers were designed that spanned the entire distance from tfdC to the opposite end of the RIT element on chromosome 1 based on both the PacBio genome assembly and TAIL-PCR sequencing results (which PacBio places adjacent to the chromosome 2 RIT element). Positive PCR products of the expected size were produced from both primers, however sequencing revealed that the PacBio product had multiple peaks indicative of a likely PCR chimera. The original TAIL-PCR primer set produced good quality sequencing results, suggesting that the PacBio assembly had mis-assembled the two ‘junkyard’ regions. There was no homologous RIT element found on any of the plasmid sequences.

5.3 Discussion

Historically, bacterial genomes have been defined as one large circular chromosome with additional information carried on transient plasmids that were not a defining feature of the species. However the discovery of secondary chromosomes, or chromids (Harrison et al. 2010), and other stable replicons that contribute important lifestyle characteristics, has necessitated a closer inspection of bacterial replicon dynamics. Besides the mobility associated with gene occurrence on a plasmid, these studies have also revealed differences in gene regulation and in the rates of recombination, rearrangement or mutation for both plasmid and chromid genes, as well as a separation of core and secondary functions between different replicons (Chain et al. 2006). Many genome projects and assembly platforms discuss assembly metrics with the goal of creating one large contig without considering the prevalence of multiple replicons in environmental isolates. Of the 4386 completed genomes available through the NCBI genome database (accessed 01 March 2015), 1306 (30%) contain multiple replicons, of which almost half (644) contain more than 2 replicons. The use of PacBio SMRT sequencing allows for the primary goal of gene identification while also providing important genome characterization and a putative location for those genes that can be experimentally validated.

As with many genome sequencing projects, the initial goal of this work was to investigate the genomic context of the chlorocatechol degradation operon in our strain. However the presence of a large repeated element directly adjacent to the operon, and a copy of it present on the second chromosome, resulted in a fragmented assembly that could not be reconciled through short read sequencing or via different PCR methods. PacBio SMRT sequencing allowed

88 us to locate our operon in a region of strain specificity that was particularly difficult to assemble due to the presence of multiple mobile genes and gene fragments. Although sequencing revealed that the nature of the mobile element ‘junkyard’ had been misassembled surrounding the RIT element, the overall organization of the PacBio assembly agrees well with our experimental results. The assembly also has the added benefit of providing a putative assembly of the difficult regions, from which primers can be designed and tested.

Only one of the closely related Burkholderia strains used in this study (B. xenovorans LB400) contained CC degradative genes, and these genes occur in a well-documented integrative conjugative element (ICEclc) located on chromosome 1 (Pradervand et al. 2014). The region surrounding the CC degradative operon in OLGA172 was designated as a potential island by the IslandViewer website (Dhillon et al. 2013; www.pathogenomics.sfu.ca/islandviewer) however a close inspection of the genes present does not support mobility of this region. Therefore we have used the term region of genome plasticity (RGP) to describe the genomic context. There are no conjugation or transfer genes to suggest that this is an integrative conjugative element (ICE) or prophage, and no flanking repeats were identified to suggest a transposon or genomic island. The CC degradative genes found in our strain were not localized to the same region of the genome as those found in LB400, showed no evidence of being contained in an ICE, and bore limited protein identity with the clc genes from LB400 (50-65% protein ID). There is only low similarity (tfdC has only 56% protein identity) to the genes carried on megaplasmids described by Sakai et al. (2014). However the region directly adjacent to the CC degradative operon on these megaplasmids corresponds with a conserved region referred to as the chromid region due to its homology with genes occurring on the second chromosome (or ‘chromid’; Harrison et al. 2010) of Burkholderia phytofirmans PsJN, Burkholderia xenovorans LB400 and Burkholderia sp. CCGE1002. The authors therefore suggested that the acquisition of the degradative genes may have been the result of insertion of the ancestor plasmid into a mobile element adjacent to the genes on the chromid (designated Tn6233) and subsequent acquisition of the genes and one copy of the mobile element on resolution (Sakai et al. 2014). Not surprisingly, this chromid region is also homologous to genes found on the second chromosome of Burkholderia sp. OLGA172. There are no tfd genes found on the second chromosome of our strain, however the presence of identical RIT elements on both chromosomes provides an opportunity for these genes to be transferred through

89 homologous recombination between replicons. In the case of our strain, it is clear that the CCD operon is located on the primary chromosome. Due to the pristine nature of the soil environment where this strain was isolated, it is possible that this operon is utilized for a different purpose in the natural environment of OLGA172. This is further supported by the variable degradation ability observed in this strain, which is likely the result of transcriptional and biochemical inefficiencies on this substrate (Goordial, 2010). This is consistent with published findings indicating that toxic intermediates in the chlorocatechol pathway can accumulate, and that this toxicity is evident when only one copy of the degradative operon is maintained in the cell (Perez-Patoja et al. 2003).

Although not within the scope of this project, the source and role of the third plasmid (pOLGA_3, 23 kb in length) would also make an interesting study. The plasmid replication gene for plasmid 3 was only strongly homologous (>75% nucleotide identity) to two other Burkholderia strains, however it showed lower homology (~ 30%) to plasmids from a diversity of sources. Included among these were two very small plasmids, a 12 kb plasmid found in Ralstonia solanacearum and a 3.2 kb plasmid isolated from Laribacter hongkongensis. In addition to these, the replication gene was also 41% homologous to a P2-like phage isolated from Burkholderia cepacia complex, and this phage was unique as it was the sole representative from that study for which the prophage is maintained as a plasmid within the cell (Lynch et al. 2010). The majority of the remaining matches corresponded to whole genome shotgun sequences and therefore the evolution of this particular replicon cannot be further investigated at this time. It would be interesting to further examine the relationship between Burkholderia phages and plasmid evolution.

While ease of assembly is often a key factor in sequencing decisions, it has been our experience that hesitations in adopting the use of PacBio SMRT sequencing have been attributed to cost per base comparisons to other available technologies. Certainly for projects sequencing a number of isolates or for routine testing of clinically relevant strains, the cost of PacBio sequencing is still prohibitively expensive. However we submit that the benefit of obtaining not only the functional gene content but also the number of individual replicons and the intact assembly of mobile genetic elements contained within the assembly provides a tangible benefit to current and future comparative studies that justifies the increased investment.

90

This represents a reasonably priced option whereby the immediate goals of any individual sequencing project can be achieved without contributing to the increased abundance of fragmented genomes in the public databases.

5.4 Acknowledgements

The authors gratefully acknowledge Eric Collins at the University of Alaska Fairbanks and Tony (Heng) Qian for assistance with genome assembly and annotation. NR is grateful to Ann Provoost and Kristel Mijnendonckx at the Belgian Nuclear Research Centre (SCK·CEN) for providing C. metallidurans CH34 DNA and for assistance and guidance for the large plasmid extractions. B. phytofirmans PsJN was kindly provided by Angela Sessitsch of the Austrian Institute of Technology. Funding in the form of a NSERC Discovery Grant to RF and a NSERC CGS-D Scholarship and Michael Smith Foreign Study Supplement to NR are gratefully acknowledged. The funding agency had no role in this study.

5.5 References Altschul, S.F., W. Gish, W. Miller, E.W. Myers and D.J. Lipman. 1990. Basic local alignment search tool, J. Mol. Biol. 215: 403-410.

Aziz R.K., D. Bartels, A.A. Best, M. DeJongh, T. Disz, R.A. Edwards, K. Formsma, S. Gerdes, E.M. Glass, M. Kubal, F. Meyer, G.J. Olsen, R. Olson, A.L. Osterman, R.A. Overbeek, L.K. McNeil, D. Paarmann, T. Paczian, B. Parrello, G.D. Pusch, C. Reich, R. Stevens, O. Vassieva, V. Vonstein, A. Wilke and O. Zagnitko. 2008. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75. doi:10.1186/1471-2164-9-75

Andrup, L., K.K. Barfod, G.B. Jensen, and Smidt, L. 2008. Detection of large plasmids from the Bacillus cereus group. Plasmid 59(2):139-143. doi: 10.1016/j.plasmid.2007.11.005.

Barbosa, E.G.V, F.F. Aburialle, R.T.J. Ramos, A.R. Carneiro, Y.L. Loir, J.B.A. Miyoshi, A. Silva and V. Azevedo 2014. Value of a newly sequenced bacterial genome. World J Biol Chem 5(2):161-168.

Branscomb, E. and P. Predki. 2002. On the high value of low standards. J. Bact. 184(23):6406- 6409.

Carver, T. N. Thomson, A. Bleasby, M. Berriman and J. Parkhill. 2009. DNAPlotter: circular and linear interactive genome visualization. Bioinformatics. 25(1):119-20.

Chatterjee D.K., S.T. Kellogg, S. Hamada, A.M. Chakrabarty. 1981. Plasmid specifying total degradation of 3-chlorobenzoate by a modified ortho pathway. J Bacteriol 146(2):639–646.

91

Chain, P.S., V.J. Denef, K.T. Konstantinidis, L.M. Vergez, L. Agullo, V.L. Reyes, L. Hauser, M. Cordova, L. Gomez, M. Gonzalez, M. Lan, V. Lao, F. Larimer, J.J. LiPuma, E. Mahenthiralingam, S.A. Malfatti, C.J> Marx, J. J. Parnell, A. Ramette, P. Richardson, M. Seeger, D. Smith, T. Spilker, W.J. Sul, T.V. Tsoi, L.E. Ulrich, I.B. Zhulin and J.M. Tiedje. 2006. Burkholderia xenovorans LB400 harbors a multi-replicon, 9.73-Mbp genome shaped for versatility. Proc Natl Acad Sci U.S.A. 103(42):15280-7.

Chevreux, B., T. Wetter and S. Suhai. 1999. Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56.

Chin, C-S. D.H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, A. Copeland, J. Huddleston, E. E. Eichler, S. W. Turner and J. Korlach. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10: 563- 569. doi:10.1038/nmeth.2474

Cordero, O. X. and P. Hogeweg. 2009. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proc Natl Acad Sci U.S.A. 106(51):21748-21753.

Darling, A. C., B. Mau, F.R. Blattner and N.T. Perna. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome research 14(7):1394-1403.

Dhillon, B.K., T.A. Chiu, M.R. Laird, M.G.I. Langille, and F.S.L. Brinkman. 2013. IslandViewer update: improved genomic island discovery and visualization. Nucleic Acids Res 41(Web server issue):W129-132. PMID: 23677610

Don, R. H. and J.M. Pemberton. 1981. Properties of six pesticide degradation plasmids isolated from Alcaligenes paradoxus and Alcaligenes eutrophus. J Bacteriol 145(2):681-686.

Don R.H., A.J. Weightman, H.J. Knackmuss and K.N. Timmis. 1995. Transposon mutagenesis and cloning analysis of the pathways for degradation of 2,4-dichlorophenoxyacetic acid and 3- chlorobenzoate in Alcaligenes eutrophus JMP134(pJP4). J Bacteriol 161(1):85–90.

Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797.

Fulthorpe, R.R., C. McGowan, O.V. Maltseva, W.E. Holben, and J.M. Tiedje. 1995. 2, 4- Dichlorophenoxyacetic acid-degrading bacteria contain mosaics of catabolic genes. Appl Environ Microbiol 61(9):3274-3281.

Fulthorpe, R.R., A.N. Rhodes and J.M. Tiedje. 1998. High levels of endemicity of 3- chlorobenzoate-degrading soil bacteria. Appl Environ Microbiol 64(5):1620-1627.

Gaillard M, T. Vallaeys, F.J. Vorhölter, M. Minoia, C. Werlen, V. Sentchilo, Al Pühler and J.R. van der Meer. 2006. The clc element of Pseudomonas sp. strain B13, a genomic island with various catabolic properties. J Bacteriol 188: 1999-2013.

92

Ghodsi, M., C.M. Hill, I. Astrovskaya, H. Lin, D.D. Sommer, S. Koren and M. Pop. 2013. De novo likelihood-based measures for comparing genome assemblies. BMC Research Notes 6:334.

Goordial, J. 2010. Characterization of a Novel Chlorobenzoate Degrading bacterium: Burkholderia phytofirmans OLGA172, Isolated from a Pristine Environment. M.Sc. Thesis Dept. Ecology and Evolutionary Biology, University of Toronto.

Harrison, P.W., R.P. Lower, N.K. Kim and J.P.W. Young. 2010. Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends Microbiol 18(4):141-148.

Heuer, H. and K. Smalla. 2012. Plasmids foster diversification and adaptation of bacterial populations in soil. FEMS Microbiol Rev 36(6):1083-1104.

Hubner A, C.E. Danganan, L. Xun, A.M. Chakrabarty and W Hendrickson. 1998. Genes for 2,4,5-trichlorophenoxyacetic acid metabolism in Burkholderia cepacia AC1100: characterization of the tftC and tftD genes and locations of the tft operons on multiple replicons. Appl Environ Microbiol 64:2086–2093.

Jin S. 2010. Evidence of Mobility of the 3-Chlorobenzoate Degradative Genes in a Pristine Soil Isolate, Burkholderia phytofirmans OLGA172, M.Sc. Thesis. Dept. Ecology and Evolutionary Biology, University of Toronto.

Kearse, M., R. Moir, A. Wilson, S. Stones-Havas, M. Cheung, S. Sturrock, S. Buxton, A. Cooper, S. Markowitz, C. Duran, T. Thierer, B. Ashton, P. Mentjies and A. Drummond. 2012. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28(12):1647-1649.

Klassen J.L., and C.R. Currie. 2012. Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genomics. 13:14.

Konstantinidis, K.T. and J.M. Tiedje. 2004. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci U.S.A. 101(9):3160-3165.

Koren, S., G. P. Harhay, T. P. Smith, J. L. Bono, D. M. Harhay, S. D. Mcvey, D. Radune, N. H. Bergman, and A. M. Phillippy. 2013. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol 14(9): R101.

Kurtz, S., J.V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye and R. Giegerich. 2001. REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res 29(22):4633-4642.

Leander, M., T. Vallaeys, and R. Fulthorpe. 1998. Amplification of putative chlorocatechol dioxygenase gene fragments from α-and β-Proteobacteria. Can J Microbiol 44(5): 482-486.

93

Leveau, J.H.J., C. Werlen, and J.R. van der Meer. 1996. Molecular mechanisms of genetic adaptation to xenobiotic compounds. International Biodeterioration & Biodegradation 37(3):252.

Liu, S., N. Ogawa and K. Miyashita. 2001. The chlorocatechol degradative genes, tfdT-CDEF, of Burkholderia sp. strain NK8 are involved in chlorobenzoate degradation and induced by chlorobenzoates and chlorocatechols. Gene, 268(1):207-214.

Liu, S., N. Ogawa, T. Senda, A. Hasebe and K. Miyashita. 2005. Amino acids in positions 48, 52 and 73 differentiate the substrate specificities of the highly homologous chlorocatechol 1,2- dioxygenases CbnA and TcbC. J. Bact 187(15):5427-5436.

López-Leal, G., M.L. Tabche, , S. Castillo-Ramírez, A. Mendoza-Vargas, M.A. Ramírez- Romero and G. Dávila. 2014. RNA-Seq analysis of the multipartite genome of Rhizobium etli CE3 shows different replicon contributions under heat and saline shock. BMC genomics, 15(1):770.

Lynch, K. H., P. Stothard, and J. J. Dennis. 2010. Genomic analysis and relatedness of P2-like phages of the Burkholderia cepacia complex. BMC genomics 11(1): 599.

Norman, A., L.H. Hansen and S.J. Sørensen. 2009. Conjugative plasmids: vessels of the communal gene pool. Philos Trans R Soc London [Biol] 364(1527):2275-2289.

Ormeño-Orrillo E, M.A. Rogel, L.M.O. Chueire, J.M. Tiedje, E. Martínez-Romero and M. Hungria. 2012. Genome Sequences of Burkholderia sp. Strains CCGE1002 and H160, Isolated from Legume Nodules in Mexico and Brazil. J Bacteriol 194(24):6927. doi:10.1128/JB.01756- 12.

Parkhill J. 2000. In defense of complete genomes. Nat Biotechnol.18:493–494.

Perez-Pantoja, D., T. Ledger, D.H. Pieper and B. Gonzalez. 2003. Efficient turnover of chlorocatechols is essential for growth of Ralstonia eutropha JMP134 (pJP4) in 3-chlorobenzoic acid. J Bacteriol 185(5):1534-1542.

Phillippy A.M., M.C. Schatz and M. Pop. 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9:R55.

Pradervand, N., S. Sulser, F. Delavat, R. Miyazaki, I. Lamas, and J.R. van der Meer. 2014. An Operon of Three Transcriptional Regulators Controls Horizontal Gene Transfer of the Integrative and Conjugative Element ICEclc in Pseudomonas knackmussii B13. PLoS Genetics. DOI: 10.1371/journal/pgen.1004441

Ricker, N., H. Qian and R.R. Fulthorpe. 2012. The limitations of draft assemblies for understanding prokaryotic adaptation and evolution. Genomics, 100(3):167-175.

Sakai, Y., N. Ogawa, Y. Shimomura and T. Fujii. 2014. A 2, 4-dichlorophenoxyacetic acid degradation plasmid pM7012 discloses distribution of an unclassified megaplasmid group across bacterial species. Microbiology 160(3):525-536.

94

Schlömann, M. 1994. Evolution of chlorocatechol catabolic pathways. Biodegradation 5(3- 4):301-321.

Sentchilo, V., K. Czechowska, N. Pradervand, M. Minoia, R. Miyazaki, an der Meer, V. and J. Roelof. 2009. Intracellular excision and reintegration dynamics of the ICEclc genomic island of Pseudomonas knackmussii sp. strain B13. Mol Microbiol, 72(5):1293-1306.

Siguier, P., Pérochon, J., Lestrade, L., Mahillon, J. and M. Chandler. 2006. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 34(suppl 1), D32-D36.

Slater, S.C., B.S. Goldman, B. Goodner, J.C. Setubal, S.K. Farrand, E.W. Nester, T.J. Burr, L. Banta, A.W. Dickerman, I. Paulsen, L. Otten, G. Suen, R. Wench, N.F. Almeida, F. Arnold, O.T. Burton, Z. Du, A. Ewing, E. Godsy, S. Heisel, K.L. Houmiel, J. Jhaveri, J. Lu, N.M. Miller, S. Norton, Q. Chen, W. Phoolcharoen, V. Ohlin, D. Ondrusek, N. Pride, S.L. Sticklin, J. Sun, C. Wheeler, L. Wilson, H. Zhu and D.W. Wood. 2009. Genome Sequences of Three Agrobacterium Biovars Help Elucidate the Evolution of Multichromosome Genomes in Bacteria. J. Bact. 191(8):2501-2511. doi:10.1128/JB.01779-08

Sóki, J. 2013. Extended role for insertion sequence elements in the antibiotic resistance of Bacteroides. World J Clin Infect Dis 3, 1-12.

Suárez-Moreno, Z. R., Caballero-Mellado, J., Coutinho, B. G., Mendonça-Previato, L., James, E. K. and V. Venturi. 2012. Common features of environmental and potentially beneficial plant- associated Burkholderia. Microb Ecol 63(2):249-266.

Suwa, Y., W. E. Holben, and L. J. Forney. 1994. Cloning of a novel 2,4-D catabolic gene isofunctional to tfdA from Pseudomonas sp. strain TFD3, abstr. Q-403, p. 459. In Abstracts of the 94th General Meeting of the American Society for Microbiology 1994. American Society for Microbiology, Washington, D.C

Sydenham, T. V., Sóki, J., Hasman, H., Wang, M. and U.S. Justesen. 2015. Identification of antimicrobial resistance genes in multidrug-resistant clinical Bacteroides fragilis isolates by whole genome shotgun sequencing. Anaerobe 31:59-64.

Tonso, N. L., V. G. Matheson, and W. E. Holben. 1995. Polyphasic characterization of a suite of bacterial isolates capable of degrading 2,4-D.Microb. Ecol. 30: 1–22 van der Meer JR, A.R. van Neerven, E.J. de Vries, W.M. de Vos and A.J. Zehnder. 1991. Cloning and characterization of plasmid-encoded genes for the degradation of 1,2-dichloro-, 1,4-dichloro-, and 1,2,4-trichlorobenzene of Pseudomonas sp. strain P51. J Bacteriol 173(1):6– 15. van der Meer, J.R., W.M. De Vos, S. Harayama and A.J.B Zehnder. 1992. Molecular Mechanisms of Genetic Adaptation to Xenobiotic Compounds. Microbiol Rev 56(4):677-694.

95 van der Meer, J. R. 2003. Evolution of metabolic pathways for degradation of environmental pollutants. Encyclopedia of Environmental Microbiology.

Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26.

Weilharter, A., B. Mitter, M.V. Shin, P.S. Chain, J. Nowak and A. Sessitsch. 2011. Complete genome sequence of the plant growth-promoting endophyte Burkholderia phytofirmans strain PsJN. Journal of Bacteriology. 193(13):3383-4. doi: 10.1128/JB.05055-11. Zerbino, D. R., & Birney, E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18(5), 821-829.

96

Chapter 6 Expression and Activity of RIT Elements

Acknowledgements and Contributions: This work was performed in collaboration with the researcher who originally described the RIT elements, Rob Van Houdt, at the Belgian Nuclear Research Center (SCK•CEN), due to his shared interest in elucidating RIT mobility and expression characteristics. The experiments were carried out over two research terms at the SCK•CEN for a total of almost 12 months over a two-year period. Experimental design, training and project supervision at the SCK•CEN was provided by Rob Van Houdt. Wietse Heylen and Ann Provoost provided technical assistance.

6 Introduction

As discussed in Chapter 4, RIT elements contain three TBSSRs and display a characteristic gene order and repeat architecture that is conserved across 7 bacterial phyla (Van Houdt et al. 2009; Van Houdt et al. 2012; Ricker et al. 2013). Since the recombinases of the RIT elements belong to sub-families of TBSSRs that have been commonly annotated as integrases, the two terms will be considered equivalent and used interchangeably. As I have shown, RIT elements can occur as multiple identical copies within individual genomes and are commonly found on plasmids and in genomic islands, including plant symbiosis and catabolic islands. These observations support the idea that they are mobile and that their role in genetic rearrangement/movement is likely to be a universal one. In this chapter, I describe the series of experiments performed to look for mobility of RIT elements. For these experiments, I obtained Caulobacter sp. K31 (generously donated by Craig Stephens of Santa Clara University, Santa Clara, California, USA) since the presence of multiple identical RIT copies in this genome is strongly indicative of their putative mobility in this strain. Many of the experiments were also performed in parallel with our strain of interest, Burkholderia sp. OLGA172. The ability of RIT elements to excise and relocate was tested using a variety of mating experiments ranging from non-specific intracellular mobility to site-specific targeting during conjugation. The general strategy used was to separate the three recombinase genes from their flanking inverted repeats to induce the recombinases (on one vector) to move a selectable marker (kanamycin resistance) which has been inserted between the inverted repeats (RIT::Km cassette) on a separate vector.

97

In the initial experiments, a conjugative plasmid (pOX38) was also added to the cells and conjugation occurred after induction. If the RIT::Km has been mobilized to the conjugative plasmid then it will escape the original cell and be detected in the recipient (as evident by gained kanamycin resistance in the recipient). In the second set of experiments, the RIT::Km cassette was carried by a suicide construct – a vector capable of its own conjugative transfer but that cannot be maintained in the recipient cell. The expression vector was contained within the recipient cell, along with a target site plasmid and induction occurred during conjugation so that the recombinases would be active when the suicide construct entered the cell.

6.1 Materials and Methods

6.1.1 Growth of Bacterial Strains

All E. coli cultures were grown in LB media supplemented with antibiotics when appropriate (kanamycin 50 µg/mL, ampicillin 100 µg/mL, tetracycline 20 µg/mL, streptomycin 50 µg/mL, chloramphenicol 30 µg/mL). M9 media with and without the addition of 1 mM leucine was utilized for differentiating auxotrophs, and LB with 0.3 mM diaminopimelic acid (DAP) was utilized for growing the MFDpir strain (provided by Jean Marc Ghigo from the Institut Pasteur, Paris, France). Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 (provided by Craig Stephens from Santa Clara University, Santa Clara, California, USA) were grown in Pseudomonas F media and Peptone Yeast Extract (PYE), respectively.

Table 6.1.1: List of strains used in this study.

Donor and Recipient Strains Genotype mcrA Δ(mrr-hsdRMS-mcrBC, modification-, restriction- φ80lacZDM15 ΔlacX74 recA1 E. coli DG1 araD139 Δ(ara-leu)7697 galU galK rpsL endA1 nupG dlacZ ∆M15 ∆(lacZYA-argF) U169 recA1 endA1 hsdR17(rK-mK+) supE44 thi-1 gyrA96 E. coli DH5α relA1 E. coli S17-1 λpir TpR SmR recA, thi, pro, hsdR-M+RP4: 2-Tc:Mu: Km Tn7 λpir E. coli MFDpir MG1655 RP4-2-Tc::[ΔMu1::aac(3)IV-ΔaphA-Δnic35-ΔMu2::zeo] ΔdapA::(erm-pir) ΔrecA

F- mcrB mrr hsdS20(rB- mB-) recA13 leuB6 ara-14 proA2 lacY1 galK2 xyl-5 mtl-1 E. coli HB101 rpsL20(SmR - ) glnV44 λ

98

Table 6.1.2: List of constructs created during this study.

Constructs Resistances Description Open reading frames for K31 RIT element TBSSRs inserted downstream pKK223-K31IntExp Amp of taq promoter Open reading frames for Olga RIT element TBSSRs inserted downstream pKK223-OlgaIntExp Amp of taq promoter Kanamycin cassette inserted between the inverted repeats of the RIT pACYC184-RIT::Km Tc, Km element from K31; courtesy of Wietse Heylen pTrc99-K31RITA-C Amp K31 RIT element recombinases inserted in pTrc99 backbone pACYC-TSV1 Tc Target site 1 from K31 DUF1738 gene inserted in pACYC184 backbone pACYC-TSV2 Tc Target site 2 from DUF1738 gene inserted in pACYC184 backbone pSF100-RIT::Km Km Suicide construct containing Km cassette flanked by RIT element repeats

6.1.2 Construct creation

Expression constructs were created by inserting only the open reading frames for each recombinase (individually or combined as a single transcript) using primers designed to be compatible with the cloning sites of both the pKK223.1 and pTrc99A vector backbones. Donor plasmids were created by first inserting a complete RIT element into the pACYC184 vector backbone and then amplifying a new backbone that contained the flanking sequences (including the inverted repeats) but without the open reading frames for the integrase genes. This new backbone was then ligated with a kanamycin gene cassette in order to create the pACYC184- RIT::Km donor plasmid. Target sequence oligos were designed with compatible ends to the cloning site in pACYC184 and directly ligated to create the target site 1 and target site 2 plasmids (pACYC-TSV1 and –TSV2, respectively). Plasmids were cloned into chemically competent DG1 cells, selected on appropriate antibiotics, and confirmed by restriction digest analysis and sequencing. The suicide vector was created by amplifying the kanamycin cassette flanked by the RIT element sequence from pACYC184-RIT::Km and then ligating the sequence into a pSF100 suicide vector backbone. This backbone contains the R6K origin of replication and therefore was maintained in S17-1 λpir host prior to conjugations.

99

Figure 6.1.1: Constructs used in the final conjugation experiment. Diagrams were created using pDRAW software. The target site in pACYC184-target1 is indicated in orange. The kanamycin cassette in pSF100-Km is flanked by the approximately 300 bp of RIT element sequence flanking the recombinases which includes the inverted repeats.

Since the recipient cells required the addition of two separate plasmids (expression vector and target site vector), E. coli DH5α cells containing the expression plasmid were made competent by washing as follows: overnight cultures of bacteria were diluted by 1/100 in fresh media with antibiotics and grown for 2-3 hours to an OD600 of approximately 0.4. All tubes, solutions and cultures were chilled on ice for 30 minutes, with occasionally swirling. Cells were pelleted at 5000 rpm for 5 minutes in a centrifuge at 4oC. Supernatant was removed and ice cold sterile Milli-Q water was used to re-suspend the cells. Pelleting and re-suspension of cells was repeated using sequentially smaller volumes of water until a final volume of 100 µL remained per tube. Electroporation of the target plasmid was performed in 1 mm cuvettes using 50 µL of competent cells (1.8 kV).

100

6.1.3 Mating-out Assays

Expression plasmid (pKK223 backbone) and pACYC184-RIT::Km donor plasmid were transformed into the same DH5α cell. A conjugative plasmid, pOX38, was also introduced through conjugation. Confirmation of the maintenance of all three plasmids was determined by plasmid specific PCR and visualization of the individual plasmids after DNA extraction (Promega Wizard Miniprep kit, according to the manufacturer’s instructions). After confirmation that all three plasmids were present, cells were conjugated with E. coli HB101. As HB101 is streptomycin resistant, transconjugants resistant to both streptomycin and kanamycin would be evidence of pOX38 mediated transfer of the RIT::Km cassette into the recipient strain.

6.1.4 Conjugation Experiments

Cultures were inoculated from single colonies into LB with appropriate antibiotics (3 mL) and grown at 37° for 6 hours. Cells were then washed with saline to remove antibiotics. Donor and recipient cells were re-suspended in appropriate volumes to create approximately the same cell density for each. Matings were performed by mixing cultures on filters on plain LB media (for uninduced) or LB + 0.2 mM IPTG (induced) and incubated overnight at 37oC. The following day the filters were removed and placed in microcentrifuge tubes containing 1 mL saline and vortexed. Undiluted and 1/10 diluted cultures were then plated on selective media to search for transconjugants. Both pre-mating and post-mating cultures were also serially diluted to determine total counts of donor and recipient cells. Putative transconjugants were inoculated with toothpicks into PCR grade water and streaked onto selective media for confirmation, as well as utilized for colony PCR. Based on colony PCR results, individual colonies from the selective media were grown for plasmid extractions and further confirmed through restriction digests and sequencing. Conjugation experiments were also varied to include stationary phase cultures or greater density log phase cultures, as well as including pre-induced cells (induction for 2 hours prior to mating) and using pACYC-TSV2 in the recipient cells. Finally, due to the high prevalence of false positives in the initial suicide construct mating experiments, conjugations were also performed using MFDpir with pSF100-RIT::Km as the donor cell with the same recipients as earlier experiments (separate matings with target 1 and target 2 recipients) and these matings were performed on LB + 0.3 mM diaminopimelic acid (DAP) to support growth of the MFDpir strain.

101

6.1.5 Expression Experiments

Expression experiments with and without induction were performed on E. coli cells containing pKK223-OlgaA-C and pKK223-K31A-C. RNA extractions were performed using Trizol and RNA was treated with deoxyribonuclease I (Invitrogen) according to the manufacturer’s instructions. PCR amplification was performed on DNAsed samples to ensure no DNA contamination in RNA samples. DNase treated RNA was used as template for first-strand cDNA synthesis. RNA, 50 ng ul-1 random hexamers, 10 mM dNTP mix and Diethylpyrocarbonate (DEPC) treated water were incubated at 65 ºC for 5 minutes, and 4 ºC for 1 minute. 40 units of RNase inhibitor (RNAseOUT, Invitrogen) and 200 units Superscript III reverse transcriptase (Invitrogen) were used for each sample, incubated at 25ºC for 5 minutes and then heated to 50ºC for 1.5 hours. The reaction was stopped by heating to 70ºC for 15 minutes. Quantitative PCR (qPCR) was performed using SYBR Green I technology on an ABI 7300 Sequence Detection System (Applied Biosystems). A master mix for each PCR run was prepared with SYBR Green PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The following amplification program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by 58°C for 35 seconds. A dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s) to produce melting curves of products that could be analyzed for primer dimers and PCR artifacts. Representative samples were run on a 1% agarose gel to confirm that products were the expected size. Dilutions of genomic DNA from Burkholderia sp. str. OLGA172 ranging from 101 to 10-4 ng ul- 1 total DNA were included to create a standard curve for each primer set and dilutions of cDNA ranging from 50 to 1 ng were used to analyze primer efficiency on transcripts. PCR efficiencies for all primers used were in the range of 90-110%. A standardized threshold setting of 0.8 units above the background level was utilized for every experiment for consistency. Each sample was normalized against 16S using the comparative deltaCt method (relative expression = 2∆∆Ct). Results were considered significant if there was a minimum of 2 fold difference in expression between treated and control samples (t-test, ∝ = 0.05).

6.2 Results

Most of this work was performed over two separate research terms at the Belgian Nuclear Research Centre (SCK•CEN) and the results will therefore be discussed chronologically. In the initial research term, the goal of the experiments was to determine whether the RIT element

102 could excise and insert into a conjugative plasmid without the addition of a specified target site. For this reason, a mating-out assay was designed. This was necessary since the project was limited in duration (3 months) and a target site was not readily apparent from the examinations performed at that stage. It was also anticipated that strong induction could potentially overcome the need for a specific target site. For ascertaining mobility potential of RIT elements, I created an expression plasmid containing only the open reading frames for the three recombinases that make up the RIT element, under the control of an IPTG-inducible promoter. Initial expression constructs were created using the recombinase open reading frames from each of Burkholderia sp. OLGA172 and Caulobacter sp. K31 in the pKK223.1 vector backbone (to create pKK223- K31A-C and pKK223-OlgaA-C). The complete RIT element in which the recombinase open reading frames were replaced with a single kanamycin resistance gene (RIT::Km cassette) was inserted into a pACYC184 vector backbone to create the donor vector. A conjugative plasmid (pOX38) was also present in the donor strain to act as a recipient of the mobilized kanamycin gene and to facilitate transfer to the recipient cells. Donor cells containing all three plasmids in an E. coli DH5-α (nalidixic acid resistant) background were mated with E. coli HB101 cells (streptomycin resistant). Providing the kanamycin resistance gene had been transferred to the pOX38 conjugative plasmid, transconjugants would be selected on LB-Km-Sm. All constructs were confirmed initially by restriction digest analysis and then sent for sequencing to confirm key regions of the sequence. For the pKK223 expression plasmids, the matings were performed prior to receiving sequence confirmation due to time restrictions.

6.2.1 No evidence of Intra-cellular mobility without a target site

Although there was a surprisingly high level of spontaneous double resistant mutants in the pKK223-K31A-C mating, there were no confirmed instances of a kanamycin gene being transferred to the recipient. There were no spontaneous mutants observed for the pKK223- OlgaA-C mating. The first round of sequencing suggested a possible point mutation in the first recombinase for each of the expression constructs (both Olga and K31), which may have rendered the proteins non-functional. These mutations were originally disregarded due to their proximity to the sequencing primers, but were confirmed with subsequent re-sequencing. Also, expression experiments revealed that recombinase expression was observed both with and without induction. For this reason, it was determined that the pKK223 constructs did not provide sufficient control over the recombinase genes.

103

There were significant differences in the expression of individual integrase genes in the expression constructs derived from K31 and Olga (Figure 6.2.1). Although recombinase expression was only slightly increased upon induction, it was clear that the first recombinase of each RIT element had the highest expression, presumably due to the presence of the pKK223 promoter upstream. The expression of the second recombinase was significantly decreased relative to the first gene (p<0.05 for K31 and p<0.005 for Olga). However, expression of the third recombinase decreased in the K31 construct but increased in the Olga construct (p<0.05 for LB and p<0.005 for IPTG) compared to expression of the first recombinase. PCR products were obtained from the cDNA using primers designed to amplify from the first to the second recombinase indicating that these were transcriptionally linked, however no product was produced from primers linking the second and third recombinase genes. As these constructs were not going to be utilized for further studies, the reasons for increased int3 expression in the Olga vector was not further investigated.

Figure 6.2.1: Expression of recombinase genes from pKK223-OlgaA-C and pKK223- K31A-C expression vectors. Values are relative abundance of integrase expression to 16S expression. Expression did not differ significantly between induced and un-induced cultures. Significant differences between individual recombinase genes (both within and between species) are discussed in the text.

104

Table 6.2.1: Decrease in optical density of cell cultures after induction with IPTG.

Constructs uninduced (OD600) 1 mM (OD600) pTrc99A (empty vector) 0.901 (0.052) 0.842 (0.007) pTrc99K31A-C 0.736 (0.057) 0.213 (0.001) pTrc99K31-RITA 0.695 (0.044) 0.342 (0.006) pTrc99K31-RITB 0.787 (0.027) 0.664 (0.016) pTrc99K31-RITC 0.666 (0.014) 0.663 (0.017)

Upon returning for a second research term, it was decided that new constructs would be created in the pTrc99A vector backbone such that the recombinase genes would be under the control of the more stringent lacIq regulator. Initial experiments again took place without target site sequences in the form of a conjugative mating out assay with the original donor plasmid. Although induction of the recombinase genes had a clear impact on cell density, (see Table 6.2.1), there were no transfers of kanamycin resistance to recipient cells, and no evidence for recombination or rearrangement between the plasmids found in the donor cells. PCR amplification from primers designed to amplify outwards from the kanamycin cassette suggested that the RIT element was being excised (Figure 6.2.2) however attempts to confirm the presence of a restored backbone lacking the kanamycin cassette were unsuccessful both by PCR and based on plasmid isolations. Therefore if the RIT element is excised in the absence of a specific target site, it occurs at levels below the detection limit for these methods.

Figure 6.2.2: PCR amplification using primers designed to amplify out from the kanamycin gene.

105

Lanes 1 and 7 are GeneRuler 1 kb plus ladder. The 700 bp product seen in lane 6 is consistent with that expected if the kanamycin cassette were being excised and the large bright band evident in lanes 2,4 and 5 is consistent with the complete plasmid backbone with kanamycin still present. Note that the smaller product is also evident in lane 2 which contains only the p184::Km donor plasmid without the expression plasmid. Lane 3 contains only the expression plasmid and lanes 4 and 5 have both a donor and an expression plasmid present.

6.2.2 Target site identification

The determination of a potential target site sequence was initially elusive. As identified in chapter 4, RIT elements found in multiple copies within a strain are commonly identical to the ends of the 30-38 bp inverted repeats presumed to designate the ends of the element. Despite the fact that all 3 RIT elements found in Caulobacter sp. K31 were 100% identical and had targeted the same gene (DUF1738) for insertion, in silico removal of the RIT sequences to the end of the inverted repeats did not result in the reconstruction of the original genes. Further investigation revealed that there was an additional sequence, a perfect 20 bp palindrome, that was adjacent to one of the terminal inverted repeats. Whether this palindrome occurred upstream or downstream of the RIT element (relative to recombinase transcription) was not consistent. It was determined that the location of the palindrome correlated with the direction of transcription of the target gene (in this case the DUF1738 gene) as opposed to the RIT element recombinases (Figure 6.2.3). A Blast search of the 20 bp palindrome revealed that the sequence did not exist in other DUF1738 genes lacking a RIT element, but instead revealed additional RIT elements that had not been previously identified. Therefore it was determined that this palindrome sequence must be a component of the RIT element. Further I hypothesized that an inversion of the RIT element relative to the palindrome must occur either during or after integration in the target site.

106

Figure 6.2.3: Orientation of RIT elements in Caulobacter sp. K31 relative to the direction of the target gene DUF1738. Genomic locations are given to the left of each diagram. ‘IR’ designates the inverted repeats that occur at each end of the recombinases.

Removal of the complete RIT element including the palindrome sequence allowed for the perfect reconstruction and alignment of the DUF1738 target genes of K31 and revealed the original target site sequence. Alignment of the same region of DUF1738 from other strains in which there was no evidence of RIT element insertion was used to determine a second potential target site. The latter differed from the K31 derived site by 4 bp (gtcg vs. gggc). With this information, two target site vectors were created in pACYC184 to act as recipients for the mobilized kanamycin gene, termed pACYC-TSV1 and pACYC-TSV2. Each contained a 45 bp target sequence in a pACYC184 vector backbone containing only tetracycline resistance for selection. Each target site plasmid was electroporated separately into a strain containing the pTrc99-K31A-C expression plasmid to create recipient strains with a putative target site and inducible recombinase genes. The suicide construct (pSF100-RIT::Km) was introduced by conjugation and the final experimental design is illustrated in Figure 6.2.4. Transconjugants capable of growth in both kanamycin and tetracycline were tested for kanamycin insertion in the target site via PCR.

107

Maintenance of the suicide construct within the recipient cell, and thus the detection of a high number of false positive (TcR and KmR) cells, proved to be a feature of this experimental design. Thinking this might be specific to an active Mu phage carried by the S17-1 donor strain facilitating recombination between the replicons (Ferrières et al. 2010), conjugation experiments were also performed in a Mu free donor strain (MFDpir), however high false positive rates were still observed in this strain as well. Nevertheless movement of the kanamycin cassette specifically into the target vector was confirmed by sequencing of positive clone products – TSV1A resulting from the original S17-1 mating with the target site 1 recipient and TSV2A resulting from the MFDpir mating with the target site 2 recipient.

Figure 6.2.4: Final experimental design. The recombinase enzymes are represented by blue circles labeled A, B and C although there is no evidence yet to suggest that all three are required or where they may bind. The sequences for the palindrome are written above and the putative binding sites are in bold font. Induction of the expression plasmid would result in production of the three recombinase proteins which would then be free to act on the inverted repeats flanking the kanamycin gene and mobilize it into the target site plasmid.

108

6.2.3 Sequencing analysis of transconjugants

Analysis of the sequence surrounding the kanamycin cassette after insertion in the target sites shed some light on the mechanism of insertion. For both the TSV1A and TSV2A recombinants, the kanamycin gene has been inserted in the opposite orientation relative to the palindrome when compared to the original suicide vector. Using two different target sites illustrated that the target site sequence itself was unchanged when the element inserted – with only the first target site this couldn’t be determined since the sequence flanking the kanamycin cassette matched TSV1 (see Figure 6.2.5). From these recombinants, it is clear that target site sequences are unchanged with RIT insertion, and that the 4 bp sequences on each end of the palindrome have been altered. Therefore it would appear that the strand exchange occurs at both ends of the palindrome, as opposed to in the centre of the palindrome as would be expected based on other known cross-over regions (Hallet et al. 2004).

Figure 6.2.5: Reversal of RIT element in positive transconjugants. Labels are included on the left. The first 4 bases correspond to the portion of the target site that differs between target site 1 and 2 (gtcg/gggc respectively), and the last 4 bases (cact) correspond to the continuation of the target site. ‘IR’ represents the inverted repeats that flank the kanamycin cassette. The palindrome and inverted repeat sequences are unchanged after recombination.

109

As illustrated in Figure 6.2.6, the TSV1A recombinant has a plasmid that is larger than the donor plasmid (pSF100-RIT::Km), which is 4.4 kb. The original target site plasmid was 2.3 kb and therefore the addition of the kanamycin cassette should have resulted in a plasmid of 3.2 kb. Sequencing was inconclusive, however restriction digestion suggests that TSV1A has two copies of the original pACYC184 backbone connected by the kanamycin cassette. This can be seen in Figure 6.2.6, as the HindIII digestion contains all the original bands from the target site plasmid and two additional bands consistent with the kanamycin cassette inserted into the target site plasmid. It is possible that both the original target plasmid and the recombinant were present in the same cell, however there is no original target plasmid visible on the mini-prep gel and the digest bands are equal intensity which suggests equal amounts of both plasmids. Therefore if both versions had been present in the cell they should have been visible prior to digestion.

Figure 6.2.6: Mating results for the recipient strain containing pTrc99-K31A-C and pACYC-TSV1. The miniprep results show the number of plasmids in each strain (A) and the HindIII digest illustrates that the recombinant (TSV1A, labeled +I in the figure) has maintained the original recipient plasmid bands and also acquired two new bands. The uninduced strain (-I) has bands corresponding to all of three of the plasmids. Labels are listed in legend and the first lane of each gel has the GeneRuler 1 kb plus ladder.

110

By contrast, the TSV2A recombinant plasmid runs much farther on the gel when undigested than even the original constructs (Figure 6.2.7). However digestion and sequencing confirmed it to have the expected size and the sequence corresponded to one copy of the pACYC184 backbone and the kanamycin cassette inserted specifically in the target site. Digestion of the TSV1A and TSV2A plasmids using enzymes that should produce a single linear band (BamHI) confirmed these differences. For the TSV2 plasmid there is a single band consistent with expectations for the recombinant plasmid, however the target site 1 recombinant plasmid gave bands of both the original plasmid and the recombinant plasmid (data not shown).

Figure 6.2.7: Target site 1 transconjugants retaining both kanamycin and tetracycline resistance. Lane 1 contains the GeneRuler 1 kb plus ladder. Lanes 2 is the TSV2A clone. The small plasmid is the pACYC-TSV2 plasmid with the kanamycin cassette inserted. The larger band was lost from the strain after sub-culturing. Lane 3 is the TSV1A plasmids (expression plasmid and recombined pACYC-TSV1 with Km) and lane 4 is the original recipient strain.

It’s important to note that RIT element mobility was only observed during conjugation, as this may be important to understanding the mechanism of mobility. Although the high false positive rate made it difficult to find positive recombinants, it also provided an opportunity to try inducing the plasmids when they were all present in one strain. This induction was performed on both the TSV1A positive recombinant clone and the previously uninduced clone that had retained all three original plasmids (designated ‘–I’ in Figure 6.2.6). Plasmids were

111 collected from a large volume of the induced cells and no rearrangements of any kind were observed in the subsequent plasmids either by gel or PCR analysis.

In the collection of KmR/TcR clones, there were a number of potential recombinants in addition to TSV1 and TSV2. These appear to have larger plasmids, perhaps as a result of unresolved co-integrate structures (similar to the plasmid visible in Figure 6.2.7 above the TSV2 recombinant plasmid). Primers designed to amplify across the target site revealed a putative co- integrate that had an altered target site. Sequencing analysis of this clone (I4) confirmed it to be a co-integrate of the donor and target plasmids. Sequencing out from the kanamycin gene revealed the presence of both the donor (pSF100) and target (pACYC184) backbones but no palindrome was found adjacent to either inverted repeat. Beyond each of the inverted repeats the sequence matched to the same half of the target site sequence. Clone I4 was a result of the target site 1 mating, and therefore the target site sequence was identical to the sequence flanking the kanamycin cassette (28 bp on one end and 17 bp on the other end), which explains the presence of two copies of the target site sequence. Sequencing from the vector backbones revealed the palindrome sequence to be separate from the kanamycin cassette and flanked on either side by the other half of the target site (illustrated in Figure 6.2.8).

Figure 6.2.8: Sequencing results of co-integrate structure of clone I4. The original target site and suicide vectors are shown at the top. The coloured boxes represent the sequences in common between the two (less than 30 bp for each). The bottom figure shows a simplified version of the co-integrate illustrating the location of the target site sequences and palindrome at the junction of the two plasmids.

112

6.2.4 Application of these Results to other RIT Elements

In recognizing the importance of the palindrome sequence to the mechanism of these novel elements, I performed an in silico search specific to the palindrome/inverted repeat arrangement. This search led me to identify additional RIT elements in the database (included in supplemental table 1). This suggests that RIT elements may be grouped according to conservation of their palindromes. Palindrome conservation groups span a wide range of species. The palindrome sequences were most variable in the centre region, and in some cases this central core was no longer a perfect palindrome sequence, suggesting that conservation of the key motifs is functionally important as opposed to maintenance of a perfect palindrome structure. As can be seen in Table 6.2.2, the conserved sequences in the palindromes also correspond with conserved sequences in the inverted repeats (the presumed binding sites identified in chapter 4). Whether these homologous sequences correspond to binding sites or facilitate the creation of a stem-loop structure has not been determined.

Table 6.2.2: Conserved sequences found in a variety of alpha- and beta-Proteobaceria containing RIT elements.

Strain Palindrome Sequence Inverted Repeat Sequence Caulobacter sp. K31 ttatgccgatatcggcataa cataatgccgcgatccggattatgccg cattatgccgtacgccggattatgccgcatggcc Sinorhizobium medicae ttatgccgatatcggcataa WSM419 pSMED02 cataatgccgtgattcggattatgccgcatgacc Acidophilium crytum ttatgccgatatcggcataa JF-5 pACRY03 Novosphingobium PP1Y ttatgccgatatcggcataa taatgccgtgacccggattatgccg Acidiphilium tgccccttatgccgacatcggcataaggggca taatgccgagatccggattatgccg multivorum AIU301 ttatgccgacgtcggcataag ttatgccgagggccgggttatgccg Frankia sp. EAN1pec Cuprividus metallidurans CH34 catgccgctagcggcatg ttatgccgactccccgattatgccg RITCme1 cctgtcatgccgctagcggcatgacagg ttatgccgacttcccgattatgccg Burkholderia sp. Ch1-1 Mesorhizobium loti st. ttatgccgacgtcggcataa ttatgccgatgtccggattatgccg NZP2037 Phaeobacter cataatgccgatgttcagattatgccgcg gallaeciensis ttatgccgacatcggcataagg DSM26640 Acidovorax sp. KKS102 cgctgcttatggagagctctccataagcagcg gcagcgttatgcacagcacgcagttatgcacagttgg

Leptothrix cholodnii ctgcttatggagagctttccataagcag gcagcgttatgcacagcacgcagttatgcacagt SP6

113

6.3 Discussion

Tyrosine based site-specific recombinases (TBSSRs) are a broad group of enzymes which perform conservative DNA recombination through the coordinated breakage, exchange and resealing of all 4 DNA strands (Hallet et al. 2004). There are approximately 400,000 TBSSRs in the NCBI database, many of which can be assigned to one of 24 sub-families based on conserved domains in the C-terminal catalytic domain (www.ncbi.nlm.nih.gov/cdd). Those associated with mobile genes have been further divided into putative role-specific sub-families (Van Houdt et al. 2012). Only a small number of these enzymes have been characterized biochemically and they exhibit extensive diversity in both their recombination mechanisms and the nature of the attachment sites. This is not unexpected given the varied roles that these enzymes perform in the cell. These functions can be separated into three different categories – chromosome or plasmid maintenance (by ensuring correct separation of multimers), intercellular distribution (phages, ICEs and genomic islands) and intracellular generation of diversity (phase switching and cassette integration) (Hallet et al. 2004; Subramanya et al. 1997; Tribble et al. 1997; Tirumalai et al. 1997; Guo et al. 1999; Cheetham and Katz 1995; Rowe-Magnus and Mazel, 2001).

Tyrosine-based site-specific recombinases are essential for the correct separation of circular replicons. The best studied representative is the XerCD/dif system which functions to resolve chromosomal dimers produced during replication (Hallet et al. 2004). These recombinases are distinguished from those utilized in homologous recombination since they require only short (~30 base pair) sequences to perform recombination. These sites are referred to as the “core” or “cross-over” site and usually possess dyad symmetry that facilitates the binding of the recombinases to recognition motifs (Hallet et al. 2004). The DNA strands are cut and exchanged at the borders of the central region separating the recognition motifs (Hallet et al. 2004).

Large conjugative plasmids and other mobile elements that are self-transmissible commonly encode their own site-specific recombinase adjacent to the recombination site for integration (Hallet et al. 2004). These sites can consist of only the core site (as in the Cre/loxP system), or be more complex. The relative positioning of recombination sites specifies whether the recombination reaction will result in integration, excision or inversion of the intervening

114

DNA. When the sites are located on a single replicon, directly repeated recombination sites will cause excision, while inverted repeated sites cause inversion (Hallet et al. 2004). Tyrosine recombinase systems can be specific to individual mobile elements or can be provided in trans from the host chromosome (Huber and Waldor, 2002).

Many transposons encode separate integration and resolution systems. Two examples using a site-specific resolution system are the Tn3 family and “Mu-like transposons” including Tn552 and the Tn5053/Tn402 family (Hallet et al. 2004). For both of these systems, the (usually DDE) transposase initiates the creation of a co-integrate which joins the donor and target sequence through two directly repeating copies of the transposon. These co-integrates are then resolved through the activity of the site-specific recombinase through intra-molecular recombination between the two copies at the transposon resolution site (res), resulting in one copy of the transposon in each location (Hallet et al. 2004). For the majority of Tn3 members, the resolution occurs through the action of a serine SSR commonly referred to as the resolvase. The res sites of the Tn3 members that have been characterized indicates that they contain three 12 bp inversely oriented binding sites, the first of which is the recombination core site and the other two correspond to accessory elements required for recombination to proceed (Hallet et al. 2004). The sequence identity and spacing between these three sites has some variability in different members, and it has been determined that some elements (including Tn552, ISXc5, Tn1546 and TnXO1) may each contain direct repeats instead of inversely oriented motifs at one of the two accessory binding sites (Hallet et al. 2004). There is a sub-family of the Tn3 elements (including Tn4430, Tn5401 and the Tn4651/Tn5041 families) that utilize a tyrosine based site-specific recombinase (TBSSR) for the resolution of the transposase driven co- integrates (Hallet et al. 2004).

These experiments demonstrated the movement of a Km cassette carried within a RIT structure to a recipient plasmid harboring either of two closely related target sites, but only during conjugative events. The target site was identified after careful examination of the gene sequence uncovered the presence of a palindrome that mobilized as part of the RIT element. This finding provides clues to the integration event mechanism.

The results obtained so far suggest that RIT elements can be transferred between replicons within a bacterial cell during the process of conjugation and that integration occurs at

115 the ends of the palindrome sequence. This suggests that either conjugation or the presence of single stranded DNA is central to RIT activity. Analysis of the sequence surrounding the kanamycin cassette after insertion in each target site shed some light on the mechanism of insertion. In both cases the palindrome appears downstream of the kanamycin gene in terms of transcription whereas it is upstream in the original construct in pSF100. This suggests the palindrome location is determined by the target site sequence, and that the palindrome may serve as the attachment and integration site and the RIT::Km is inverted either during or after integration. Previously characterized tyrosine recombinases (such as XerC/D and Cre) bind to sites exhibiting dyad symmetry and crossover occurs at the centre of this symmetry (Hallet et al. 2004). As can be seen in Table 6.2.2, there are complimentary sequences found in both the palindrome and the inverted repeats. It is therefore proposed that the RIT element recombinases may bind to one half of the palindrome sequence and to the complimentary sequence within the inverted repeats and that crossover occurs between the palindrome and the inverted repeat (Figure 6.3.1). If the crossover occurs between the palindrome and the inverted repeats then the core sites are also more consistent with those seen for XerC/D since the string of A/T is internal and the G/C is external to the crossover region.

Figure 6.3.1: Model for RIT element mobility based on experimental results. IR indicates the locations of the inverted repeats. Illustration of palindrome direction relative to kanamycin transcription in pSF100-RIT::Km suicide construct (top) and in the transconjugants

116 obtained (middle). Bottom picture is proposed binding sites and crossover regions in RIT integration involving a circular intermediate. Key residues predicted to be involved in binding are shown in bold and strand exchange occurs between the palindrome and the inverted repeats.

Site-specific recombination events were detected in these experiments, but many more may have been found if not for the high rate of false positives. These were due to the independent maintenance of the suicide construct within the recipient cell, or to recombination events that appear to be separate from RIT activation. As discussed in the results, the substitution of the MFD strain to replace S17-1 did not eliminate the false positive issue, suggesting that the source of the issue with pSF100 is not the active Mu phage described in the latter strain. There were significant regions of homology (~ 200 bp) between the suicide construct and both the donor and expression plasmids found in the recipient cell. This should not be an issue in a recA1 mutant background, but it has been shown that ATP-independent re- annealing of single stranded substrates can still occur although strand exchange is eliminated (Bryant and Lehman 1986). I therefore cannot preclude the possibility that the plasmids are becoming integrated with each other and our strong selection maintains these co-integrate structures. In addition, the ATP dependent functions of the RecA1 protein can be partially restored at a pH of 6.5 or lower (Kawashiwa et al 1984). As alterations in the pH during conjugation were not monitored, there is the possibility that homologous recombination accounts for a significant fraction of the false positives observed. A third possibility for the source of this issue could be cross-reactivity with the active XerC and XerD homologs found in the recipient strain. There have been phage elements described that do not carry their own site- specific recombinases but rather depend on the action of the host recombination machinery to facilitate their integration (Huber and Waldor, 2002). As these recombinases are essential to chromosome separation and cell reproduction, it is not feasible to perform the experiments in a XerC/D deficient background in order to determine whether these genes are contributing to the high false positive rate.

Although there is currently insufficient evidence to speculate on the role that RIT elements may play in the cell, the results obtained in this study indicate that they may be specifically active during the process of conjugation. The lack of kanamycin movement upon induction of the recombinases in cells already possessing all three plasmid constructs

117

(expression, target and suicide construct) was in sharp contrast to the diversity of arrangements obtained when induction occurred as the RIT::Km cassette was conjugating in. This is consistent with the data obtained in Chapter 4 that indicates that RIT elements are commonly associated with one or more plasmids in an individual strain. This activation could be similar to integrons, where a single stranded substrate is necessary for integration of gene cassettes to occur, or could be indicative of a role for these genes in the acquisition of genes directly from incoming plasmids regardless of the ability for that plasmid to be maintained long term within the recipient cell. In this manner, having RIT elements specifically active during conjugation events would be a powerful useful means of generating diversity from transient plasmid associations.

6.4 Acknowledgements

This project was funded through the W. Garfield Weston Foundation Doctoral Fellowship

Program. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D

Scholarship to NR is also gratefully acknowledged. The funding agencies had no role in this study.

6.5 References Bryant, F.R. and Lehman, I.R. 1986. ATP-independent renaturation of complementary DNA strands by the mutant recA2 protein from Escherichia coli. The journal of biological chemistry 261(28):12988-12993.

Cheetham, B. F., & Katz, M. E. (1995). A role for bacteriophages in the evolution and transfer of bacterial virulence determinants. Molecular microbiology, 18(2), 201-208.

Ferrières, L., G. Hémery, T. Nham, A.M. Guérout, D. Mazel, C. Beloin and J.M. Ghigo. 2010. Silent mischief: Bacteriophage Mu insertions contaminate E. coli random mutagenesis performed using suicidal transposon-delivery plasmids mobilized by broad-host range RP4 conjugative machinery. J. Bacteriol. 192(24):6418-27.

Grindley, N.D.F., Whiteson, K.L., and Rice. P.A. 2006. Mechanisms of Site-Specific Recombination. Annu. Rev. Biochem. 75:567-605.

Guo, F., Gopaul, D. N., & Van Duyne, G. D. (1997). Structure of Cre recombinase complexed with DNA in a site-specific recombination synapse. Nature, 389(6646), 40-46.

118

Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA

Huber, K. E., & Waldor, M. K. (2002). Filamentous phage integration requires the host recombinases XerC and XerD. Nature, 417(6889), 656-659.

Kawashima, H., Horii, T., Ogawa, T. and Ogawa, H. 1984. Functional domains of Escherichia coli recA protein deduced from the mutational sites in the gene. Mol Gen Genet (molecular and general genetics) 193(2):288-292.

Ricker, N., Qian, H., and Fulthorpe, R. 2013. Phylogeny and Organization of Recombinase in Trio (RIT) Elements. Plasmid. 70(2):226-239.

Rowe-Magnus, D. A., & Mazel, D. (2001). Integrons: natural tools for bacterial genome evolution. Current opinion in microbiology, 4(5), 565-569.

Siguier, P. Gourbeyre, E. and M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev. 38(5):865-891.

Subramanya, H. S., Arciszewska, L. K., Baker, R. A., Bird, L. E., Sherratt, D. J., & Wigley, D. B. (1997). Crystal structure of the site‐specific recombinase, XerD. The EMBO Journal, 16(17), 5178-5187.

Tirumalai, R. S., Healey, E., & Landy, A. (1997). The catalytic domain of λ site-specific recombinase. Proceedings of the National Academy of Sciences, 94(12), 6104-6109.

Tribble, G., Ahn, Y. T., Lee, J., Dandekar, T., & Jayaram, M. (2000). DNA recognition, strand selectivity, and cleavage mode during integrase family site-specific recombination. Journal of Biological Chemistry, 275(29), 22255-22267.

Van Houdt, R., Monchy, S., Leys, N., Mergeay, M., 2009. New mobile genetic elements in Cupriavidus metallidurans CH34, their possible roles and occurrence in other bacteria. Antonie van Leeuwenhoek 96, 205-26.

Van Houdt, R.., Leplae, R., Mergeay, M., 2012. Towards a more accurate annotation of tyrosine- based site-specific recombinases in bacterial genomes. Mobile DNA 3(6) doi:10.1186/1759-8753-3-6

119

Chapter 7 Developing a standardized method for analyzing gene content of bacterial communities in streams with varying degrees of urbanization

7 Introduction

A key challenge in characterizing the mobilome of environmental samples is the ability to draw comparisons between diverse environments. The ideal study involves collecting samples before and after an environmental disturbance, however this is limited to anticipated point source contamination events. Moreover, the information obtained can only be utilized in drawing comparisons specific to that location and time point. Unfortunately, environmental pollution is not limited to these discreet and known point source events. Anthropogenic pollutants from domestic, industrial and agricultural settings contribute a diverse array of chemical compounds to the environment (Gillings et al. 2015). Increased urbanization and decreased vegetation likewise contributes to increased levels of environmental pollutants (particularly polyaromatic hydrocarbons) surrounding human activities (Johnsen and Karlson, 2007).

Despite the inherent issues in drawing comparisons between sites with highly variable anthropogenic impacts, a baseline community mobilome needs to be established from which future studies can draw comparisons. There are a variety of metrics currently available for classifying anthropogenic impacts on freshwater streams. In our region, the Ontario Benthic Biomonitoring Network (OBBN) coordinates efforts to monitor impacts to both lakes and streams and has developed appropriate methods for comparing the benthic invertebrate populations between sites to establish anthropogenic impacts (Jones et al. 2007). The analysis of benthic macroinvertebrate populations is a well-established biomonitoring tool for comparing cumulative impacts of human activities in river systems (Rosenberg and Resh 1993; Wright et al. 2000). However, understanding how bacterial communities are impacted at these sites is not directly comparable to these macro-organism metrics. In order to determine which environmental pollutants cause changes in bacterial diversity or gene content, there must be a standardized bacterial community on which to perform testing. This standardized community

120 must account for varying bacterial populations within individual spatial niches as these can be expected to respond differently to selection pressures. Gene transfer mechanisms are particularly proficient in biofilm communities, therefore obtaining samples through filtering of stream water is less than ideal since it minimizes the genetic contribution of these important communities. However it is equally difficult to account for differences in sediment composition when comparing bacterial communities between streams, and this becomes increasingly valid when comparing between relatively pristine (reference) sites and more channelized urban streams. Moreover, individual sediment samples can be impacted by variations in groundwater inputs, which can be a source of various pollutants. Finally, in order to be used in a risk assessment framework, the bacterial community should represent a reasonable route of exposure for individuals either through direct contact or downstream water usage. For these reasons, we chose to utilize columns filled with a standardized substrate on which the bacterial community could colonize for a pre-determined length of time. This allows for bacteria that are present intermittently in the water column to colonize the soil columns in addition to the ubiquitous water inhabiting bacterial members.

I designed sand filled columns to capture and integrate the bacterial communities of streams for study. These columns were attached to flotation devices and floated in the water column at a shallow depth from the surface in 6 streams in southern Ontario, Canada. Two of the chosen streams are minimally impacted by human activities, and the other four streams are moderately impacted by a variety of pollutants (see Table 7.1.1). The sources of anthropogenic stress included in this study (urbanization, waste water outflows, agricultural practices and landfill leachate) were chosen in order to avoid strong selection by any particular pollutant and instead focus on circumstances where the communities are exposed to a variety of stressors.

7.1 Materials and Methods

The reference site samplers were placed in rivers contained within the Saugeen Valley Conservation Authority (SVCA) at sites chosen based on the 2010 water quality monitoring report from this agency (SVCA, 2010) and are both located in streams actively monitored by the Provincial Water Quality Monitoring Network (PWQMN). This region has a low road density and includes the provincial reference sites utilized for the Ontario Benthos Biomonitoring Network (OBBN) assessments (Jones, 2006). Sampler placement was chosen based on ease of

121 accessibility, however neither sampler was placed in the precise location of the PWQMN station for the reference sites due to concerns that the locations provided public access and may lead to tampering. The impacted sites have been chosen in the more urbanized Lake Simcoe watershed, based on recommendations from the Lake Simcoe Regional Conservation Authority (LSRCA) staff and a 2004 study of contaminants found in the rivers of the Lake Simcoe watershed (LSCRA, 2004). All impacted sites show significant accumulations of polyaromatic hydrocarbons (PAHs), which is to be expected in an urbanized watershed. The Uxbridge Brook site was chosen due to its use as a discharge stream for a wastewater treatment plant in the area (at a distance of approximately 2.5 km from discharge to sampling site), and is the only site that is located precisely at the PWQMN site. The Maskinonge river site did not have any contamination that exceeded the provincial limits according to the 2004 report but was chosen due to its location downstream of an intensive sod farm. The West Holland River was heavily impacted according to the 2004 contamination study by organochlorine pesticides including DDT (and its breakdown product DDE, among other contaminants). The Dyment’s Creek location was not highlighted in the 2004 study but was included due to the availability of concurrent chemical analyses performed by researchers at Environment Canada. At this location, chemical screening has been performed on the groundwater flowing beneath the stream and results from previous years have been published (Roy and Bickerton 2011). Contaminants found in this location are diverse and include volatile organic chemicals, metals and petroleum products.

7.1.1 Sampling locations and collection of benthic invertebrates

Chemical data and site characteristics are listed in Table 7.1.1. Provincial water quality monitoring data were available for all streams except for Dyment’s Creek, however stream and sediment data for this latter site were provided for 2011 by researchers at Environment Canada. For each of the sampling locations, benthic communities were sampled according to the OBBN protocols (Jones et al. 2007) and animals were preserved in 70% ethanol for transportation. Organic material was isolated by density separation in concentrated salt solution (if necessary) and animals were classified using microscope-assisted identification to at least the 27-group level (details in OBBN protocol, Jones et al. 2007). When possible, Trichoptera, Ephemeroptera and Coleoptera were identified to the family level in order to utilize a more accurate tolerance value for the coarse Hilsenhoff biotic index calculations. Benthic sampling was not performed

122 on the West Holland Canal and Maskinonge River sites due to the depth of the river at these locations, and samples were not collected from the North Saugeen River site in the fall of 2012 due to the presence of clam beds that should not be disturbed. Although samples were collected in both the fall and the summer across multiple years for individual sites, only the fall counts were used for analysis for consistency. Benthic counts were also obtained from each of the conservation authorities in order to supplement the available data. Determination of anthropogenic impact was determined using the coarse Hilsenhoff Biotic Index (cHBI; a modification of the HBI developed by Hilsenhoff, 1987) and Simpson’s diversity index (Simpson, 1949) as well as percent recovered of relatively intolerant species (combination of Ephemeroptera, Plecoptera and Trichoptera).

Table 7.1.1: Sampling locations for river assessments. Those in bold exceed the available recommended limits (Provincial Water Quality Objectives or Canadian Water Quality Guidelines for Protection of Aquatic Life (SVCA, 2010)); There is no listed guideline for PHCs; Abbreviations - PAH: polyaromatic hydrocarbon, OC: organochlorine, PHC: petroleum hydrocarbon, VOCs: volatile organic compounds.

Site Bank Width Depth (avg) Surrounding Region Chemicals of Concern Hamilton Creek 11.9 m 0.41 m Forest North Saugeen 15.2 m 0.46 m Forest River Uxbridge Brook 6.5 m 0.40 m Downstream of PAHs, Phenols, sewage outflow PHCs, Cr, Cu Maskinonge River 9 m 0.40 m Downstream of sod Pesticides, Cr, Cu farming West Holland Not determined Not determined Within Holland PAHs, Phenols, River Marsh (agriculture) OC pesticides, PHCs, Hg, Cd, Cr, Pb, Cu Dyment’s Creek 2-6 m 5-50 cm Historic landfill VOC’s turned residential

123

Figure 7.1.1: Map of sampling locations. The two reference sites (orange triangles) are North Saugeen (NS) and Hamilton Creek (HC), which are located within the SVCA. The four impacted sites (red triangles) are Dyment’s Creek (DC), West Holland Canal (WH), Maskinonge Creek (MC) and Uxbridge Brook (UX) and are all located within the LSRCA.

7.1.2 Sampler Design

All samplers were created from 1 ½” (inner diameter) polycarbonate tubes cut to one foot in length and fitted with 1 ½” copper to DWM pipe adapters machined internally to fit to the pipe. The ends of the adapters were fitted with screening material and nylon in order to prevent the entry of invertebrates or litter from the stream. All samplers were filled with autoclaved, fine grain sand. Samplers were floated mid-stream within 2 inches of the stream surface, and were sub-sampled monthly throughout the 4 month exposure time. At the end of four months the samplers were retrieved and replicate samples obtained from each portion of the sampler (inflow end, center, outflow end) to determine within sampler variation.

124

Figure 7.1.2: Aquatic environment bacterial community samplers. Constructed samplers (A) were attached to 2 L. bottles to be used as flotation devices. The devices were attached to cinder blocks to keep them in place in the stream (B).

7.1.3 Bacterial Community Assessment

DNA extraction of sampler soil was performed using a PowerSoil extraction kit (MoBio). Terminal restriction fragment length polymorphism (T-RFLP) using fluorescently labeled 16S primers were used to compare the sampler diversity, and pyrosequencing was performed on selected samplers to examine bacterial community diversity.

Initial T-RFLP comparisons between the inflow, center and outflow sub-samples were performed to determine whether the bacterial communities were consistent throughout the length of the sampler. Fluorescently labeled 16S primers (27F-FAM and 1492R-HEX) were used for amplification and digestion was performed using AluI. Digested samples were analyzed by the Guelph Molecular Supercenter Laboratory Services Division and statistical analysis was performed using R (R-project.org). Sub-samples were subsequently combined (3 separate replicates of sub-samples where possible) for between sampler comparisons, also by T- RFLP using the same methods. Principal coordinate analysis (separate analyses for Bray-Curtis and Jaccard distance measures) of the T-RFLP on combined samples were analyzed using the pco command in the Ecodist package of R. Principle coordinate scores were compared to water quality parameters, benthic invertebrate metrics and genetic (qPCR) data. Correlations were automatically generated using the corr function.

125

Pyrosequencing was performed on one replicate of combined samples, using a Roche 454 FLX titanium instrument (MR DNA Molecular Research LP). Primers utilized were provided by the facility (27Fmod and 530R) and targeted the V1-V3 16S region (Yarza et al. 2014). Data analysis was performed using programs in the QIIME pipeline (Caporaso et al. 2010), including Denoiser (Reeder and Knight, 2010) and UCLUST (Edgar, 2010). Sequences were rarefied to 1455 reads per sample (corresponding to the lowest read count obtained), OTU’s were grouped based on 97% similarity, and was assigned according to the Greengenes Database (DeSantis et al. 2006) files from May 2013. Beta diversity was evaluated using the vegan package in R with either Bray-Curtis or Binary Jaccard settings.

Table 7.1.2: Primers for quantitative PCR.

Amplicon Annealing Primer Name Sequence Size Temp (oC) Reference qPCR-intI1F ACCAACCGAACAGGCTTATG Nemergut et al. qPCR-intI1R GAGGATGCGAACCACTTCCAT ~ 286 bp 62 (2004) qPCR-16S-338F ACTCCTACGGGAGGCAGCAG qPCR-16S-518R ATTACCGCGGCTGCTGG ~ 200 bp 63 Fierer et al. (2005) sulI-F CACCGGAAACATCGCTGCA sulI-R AAGTTCCGCCGCAAGGCT 158 bp 55 Cheng et al. 2013 IncP1 korA-F TCATCGACAACGACTACAACG IncP1 korA-R TTCTTCTTGCCCTTCGCCAG 117 bp Smalla et al 2013 IS1071_qPCR-F GCACCAAGTCTGGGAATGAT This study IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG ~200 bp 60 This study IR_Olga TTATGCCGATTCCCGGATTATGCCG 3.5 kb This study IR_K31 TAATGCCGCGATCCGGATTATGCCG 3.5 kb 54 This study IR_ambig TWATGCCGIIIYCCSGATTATGCCG 3.5 kb This study IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG 3.5 kb 54 This study

7.1.4 Quantitative PCR

Quantitative PCR (qPCR) was performed using SYBR Green I technology on an ABI 7300 Sequence Detection System (Applied Biosystems). A master mix for each PCR run was prepared with SYBR Green PCR Master Mix (KAPA Biosystems) and 0.5 μM primers. The following amplification program was used: 95°C 2 min, 40 cycles at 95°C for 15 s followed by

126

60°C for 35 seconds. A dissociation step was added (95°C for 15s, 60°C for 30s, 95°C for 15s) to analyze the melting curves of products for primer dimers and PCR artifacts. Representative samples were run on a 1% agarose gel to confirm that products were the expected size. Primers used for MGE comparisons between samplers are listed in Table 7.1.2. Primer efficiencies were between 93-109%.RIT inverted repeat primer design and PCR

The inverted repeats from the strains listed in Table 6.2.2 were aligned and used to design ambiguous primers targeting the inverted repeats flanking their respective RIT elements. Since the primers were designed to target the inverted repeats, the same primer would be expected to anneal at each end and amplify the full RIT element. Two specific (non-ambiguous) primers were also created targeting the inverted repeats for Burkholderia sp. OLGA172 and Caulobacter sp. K31 individually. The two specific primers were tested to verify that they were strain specific and the ambiguous primers were shown to amplify the RIT elements in both strains. The ambiguous primers were tested on sampler DNA to search for RIT elements bearing comparable inverted repeats that could be amplified. PCR was carried out using HotStar Taq at 54oC plus 1 uL BSA per reaction, with an extension time of 3 minutes and 30 seconds.

7.2 Results 7.2.1 Macroinvertebrate metrics of ecosystem health

Prior to examining the microbial community from the samplers, the overall health of the stream was estimated based on biomonitoring of benthic communities from the stream sediment. Where possible, benthic animals were collected directly from the sites using the OBBN approved kick and sweep method. Abundance and identification data were also obtained from the relevant conservation authorities and these data were used to supplement the benthic monitoring data acquired during this study. Table 7.2.1 shows the results of the biotic indices for the four sites at which benthic animals could be obtained concurrent with sampling. Biotic indices fluctuate seasonally therefore the data used for calculating these biotic indices corresponded to the fall counts for all sites, which also coincided with the sampling season for the conservation authorities.

127

Table 7.2.1: Comparison of field sites based on biotic indices of benthics obtained during this study. The coarse Hilsenhoff Biotic Index (cHBI) ranks sites with a score below 5 as healthy and increasingly polluted above 5. Other indicators of a healthy benthic community include a high Simpson’s diversity score (approaching 1) and high abundance of species known to be intolerant to pollutants (%EPT).

cHBI cHBI Rating Simpson's Diversity %EPT Dyment’s Creek 6.93 Poor 0.48 5.69 North Saugeen River 5.33 Fair 0.63 14.18 Uxbridge Brook 5.23 Fair 0.79 16.75 Hamilton Creek 6.12 Fairly poor 0.60 13.25

The SVCA sites historically show lower degrees of impact than the LSRCA sites (as indicated by a lower cHBI value), however the cHBI values calculated in this study were higher than had been observed in previous years by the SVCA. The most recent benthic data provided from this conservation authority corresponded to 2007 and therefore recent trends could not be identified for these sites. However land usage in this region has not changed in that time period, and a 2010 water quality status report published by the conservation authority also confirmed that these particular sites had retained excellent water quality (SVCA, 2010). This was confirmed in the data available from the PWQMN, which also indicates that there has not been a drastic change in water quality during this period. For the LSRCA sites, only the Uxbridge Brook site and the Dyment’s Creek site could be sampled by the kick and sweep method due to the depth of the river at the other two locations. An attempt to obtain benthic invertebrates from the Maskinonge River site location by grab sample from the sediment was devoid of benthic organisms, and the depth of the West Holland Canal was too excessive for a grab sample to be attempted. However benthic counts were obtained from the conservation authority for the Maskinonge River for the 2011, 2012 and 2014 fall sampling events at a nearby sampling location used by the Provincial Water Quality Monitoring Network (PWQMN). The average Hilsenhoff Family Biotic Index (FBI – which is the family level version of the cHBI) and %EPT for this site were 6.35 and 2.43%, respectively (averaged across the three years, standard deviation of 0.32 (FBI) and 0.92 (%EPT)). Since the other sites had not been analyzed to the family level, the family level benthic data obtained from the LSRCA was collapsed to the same level of identification that the other sites had been analyzed at and a modified cHBI value of

128

5.90 was calculated. Therefore when the benthics were analyzed to only the 27-group level, both Uxbridge Brook and the Maskinonge River gave better cHBI ratings than Hamilton Creek (Table 7.2.1). Uxbridge Brook also had the highest percentage of sensitive species of benthic invertebrates (Ephemeroptera, Plecoptera and Trichoptera) of any of the sites analyzed, which generally indicates a healthy river ecosystem. There was no data available from the LSRCA pertaining to biomonitoring in the West Holland Canal due to the depth of this waterway.

The samplers recovered from each of the impacted sites were visually distinguishable from each other and from the reference sites, indicative of the varying nature of the ecosystems (Figure 7.3.1). The Maskinonge River sampler was the least changed visually from the reference sites, but was coated in duckweed (aquatic plant -Lemnoideae, a subfamily within the Aracaea), likely as a result of slow water movement coupled with high phosphorous levels. The Uxbridge Brook and West Holland Canal sites each had substantial green algae coating the samplers, and all three of these streams have high phosphorus levels according to the PWQMN data (see Table 7.3.2). The Dyment’s Creek sampler was thickly coated in an unknown chemical substance that had badly discoloured the sampler and could not be removed. The sand inside the samplers could also be distinguished visually between sites indicating that there had been substantial input of substrate while in situ, likely as a result of sediment deposition during rain events.

Figure 7.2.1: Lake Simcoe region samplers after retrieval. Samplers shown (bottom to top) are Maskinonge River, West Holland Canal and Dyment’s Creek.

129

7.2.2 Community diversity measures

Individual extractions from the front, center and back portions of each sampler from the 2012 season were sent for T-RFLP analysis. Any T-RFLP sample that had a total peak height of less than 3000 was removed from the analysis, which included the centre samples from the North Saugeen (NS), Dyment’s Creek (Barrie), West Holland Canal (CF) and Hamilton Creek (HC) samplers as well as the outflow sample from the North Saugeen (leaving only the inflow for that sampler). Samplers that had good band representation for all three sampler regions, Uxbridge (UX) and Maskinonge Creek (SOD), all grouped together as did the front and back samples from the remaining samplers (Figure 7.2.2).

Figure 7.2.2: Cluster analysis of T-RFLP data showing within sampler variation. Samples were obtained from the F,C,B (inflow (front), center and outflow (back)) portions of the samplers after retrieval for comparison of the bacterial community heterogeneity within the length of the sampler. Abbreviations are as follows: Dyment’s Creek (Barrie), West Holland (CF), North Saugeen (NS), Hamilton Creek (HC), Maskinonge Creek (SOD) and Uxbridge Brook (UX). Top diagram is the binary Jaccard comparison for the within sampler variation.

130

The bottom diagram illustrates the improved clustering of the samplers abundance (Bray-Curtis) however the West Holland Canal (CF) samples group with HC instead of with each other.

Although no conclusions could be drawn for the North Saugeen sampler due to poor amplification, each of the other samplers harbored their own unique community. When phylotype abundances were included in the statistical analyses, the bacterial communities showed high similarity between the different locations within the samplers, except that the West Holland inflow and outflow samples (CF-F and CF-B) did not cluster together. When abundance is disregarded however, the West Holland samples do cluster, albeit with only slightly better similarity than they cluster with the two reference sites.

In order to provide a greater quantity of DNA for further analysis, the front (inflow), centre and back (outflow) DNA from each sampler was combined and purified using GE Healthcare S200 purification columns (in triplicate), and the pooled DNA was used for subsequent analysis. Two replicates of pooled samples from each sampler, along with two DNA extractions from the sediment at the Barrie site for comparison were analyzed via T-RFLP. Although the sampler replicates generally gave good agreement by T-RFLP analysis, the West Holland River sampler (CF) and Hamilton Creek sampler again showed greater similarity between the two sites than between the two replicates. Further investigation revealed that the primary driver of the clustering was the abundance of one particular band (190 bp), which accounted for close to 50% of the total T-RFLP peaks in all of the samples except for Maskinonge (SOD), Dyment’s Creek (BH-1, BL-1) and Uxbridge Brook (UX). For this reason, the principle coordinate analysis was analyzed separately for the Bray-Curtis distance and Jaccard distance between sites. As can be seen in Figure 7.2.3, the West Holland Canal was the only sampler that segregated differently by the two distance calculations. However by both of these analyses the West Holland sampler bacterial community was more similar to the reference sites than to any of the other impacted sites.

131

Figure 7.2.3: Principal coordinate analysis of T-RFLP results from sampler replicates. Samples in both graphs are numbered as follows: Dyment’s Creek (1,2), West Holland Canal (3,4), Hamilton Creek (5,6), North Saugeen (7,8), Maskinonge Creek (9,10) and Uxbridge Brook (11,12). The Uxbridge Brook sites are overlapping in both plots and therefore the individual numbers are not visible.

In addition to the T-RFLP analysis, pyrosequencing was also performed on the combined sampler DNA, however for the Dyment’s Creek sample there was insufficient DNA for the analysis. The Uxbridge brook sample was repeated twice from the same DNA to serve as technical replicates. An examination of the pyrosequencing data revealed that the primary driving force for the observed similarities between the reference sites and the West Holland Canal site (CAR) was the prevalence of one particular species at a number of the sites. This was consistent with the T-RFLP analysis, where it had also been determined that a single dominant band was influencing the observed clustering. In the pyrosequencing data, 3 of the samples analyzed showed a clear dominance of Pseudomonas fluorescens. Two of these samples corresponded with reference locations – the North Saugeen sample had 41% P. fluorescens, and the same species accounted for 45% of the reads at the Hamilton Creek location. It also comprised 51% of the reads from the West Holland Canal site. Using the available 16S sequence for P. fluorescens PC17 from the NCBI Genbank database, it was determined that the

132 dominant band observed in the T-RFLP data was consistent with this species, and therefore dominance of this species could be established in the T-RFLP data as well. In contrast, the remaining three sites (Uxbridge Brook, Maskinonge River and Dyment’s Creek) had considerably less prevalence of P. fluorescens based on both the pyrosequencing reads and the observed T-RFLP band. For the Uxbridge Brook sample and the Maskinonge Creek sample, P. fluorescens was still the most abundant single OTU in the pyrosequencing dataset however this species accounted for only 18.5 and 19.4%, respectively. The Dyment’s Creek samples had insufficient DNA for pyrosequencing, however the T-RFLP data indicate that P. fluorescens was effectively absent from this site both within the sampler DNA and also in the analyzed sediment samples (less than 0.05% in any sample).

Figure 7.2.4: Principal coordinate analysis (PCoA) of the bacterial community compositions revealed by 16S pyrosequencing data. Beta diversity is determined using Bray-Curtis distances. UX12 and UX122 are technical replicates of the same DNA. Abbreviations are as follows: Uxbridge Brook (UX), North Saugeen (NS), West Holland Canal (CAR), and Hamilton Creek (HC).

133

7.2.3 Quantitative PCR

Quantitative PCR results for sampler DNA are shown in Table 7.2.2. DNA yields from the sampler sands were typically low, so pooled samples were used. Each biological replicate is a pool of inflow, center and outflow extractions. However, low relative concentration of mobile genes to total 16S gene copies, required that different amounts of DNA were used for 16S analysis compared to target gene analysis in order to stay within the linear range of the respective primers. Data was not converted to estimates of copy number since the target gene amplifications were still at or approaching the limit of linear amplification despite the larger volume of DNA used for the target genes relative to the 16S genes. Results given are therefore cycle threshold values (Ct) normalized to the 16S amplification for the same sample (deltaCt) but not converted to actual gene copies since duplications at each cycle cannot be assumed. Results are averages of at least two independent biological replicates.

Table 7.2.2: DeltaCt comparison of environmental samplers by quantitative real-time PCR. The deltaCt is determined by subtracting the target site cycle threshold (Ct) value from the 16S Ct value. A low deltaCt value is therefore indicative of a high concentration of target gene since there was a smaller difference between the target and the 16S gene cycle thresholds. The melting temperature (Tm) of the resulting PCR product is included as this differed according to site for some primer sets.

IS1071 sulI IncP deltaCt IS1071 Tm deltaCt sulI Tm deltaCt IncP Tm Uxbridge Brook 4.93 88 6.76 85.4 9.65 85.2 Dyment's Creek 5.88 87.7 9.67 85.4 8.95 85.7 West Holland Canal 9.52 87.7 8.78 88 9.52 88.65 Maskinonge Creek 9.43 87.7 10.03 87.1 9.92 89.35 Hamilton Creek 10.22 87.7 9.45 88 9.20 88.65 North Saugeen 12.84 87.7 9.45 87.1 8.78 89.1

In addition to low amplification, for all except the IS1071 primers the environmental samples gave multiple and/or broad peaks, indicating that more than one product was produced.

134

These peaks had a higher melting temperature than would be expected for primer dimers, and when representative samples were analyzed by gel electrophoresis, a single band was observed (data not shown). This suggests that the multiple peaks on the melting curve analysis are the result of similarly sized PCR products with varying G+C content. In support of this assertion, the melting temperature of the qPCR product was consistent between different replicates of the same sampler, consistent with specific target differences. The results included in Table 7.2.2 correspond to the primer sets that gave good reproducibility across multiple biological replicates, including the IS1071 primers designed in this study and primers from other published studies that targeted the IncP plasmid backbone (Smalla et al. 2013) and the sulI gene conferring resistance to sulfonamide antibiotics (Cheng et al. 2013).

In addition to the primer sets listed in Table 7.1.2 there were also a number of other mobile element primer sets tested from the existing literature, including primers designed to target the Tn3 and Tn21 classes of transposons (Gotz et al. 1996). These primers were not originally designed for qPCR analysis, however the amplified product was an appropriate size for this type of analysis. The delta Ct values obtained with these primer sets were quite low (ranging from 3 to 8) indicating high abundance of these transposons in all sites, however both peak quality and reproducibility between replicates were insufficient for the results to be included. Conversely, the IS1071 primers were both highly specific and highly reproducible between biological replicates (within 0.5 Ct after normalization), consistent with their design to target only one specific member of the Tn3 family of transposons.

7.2.4 Correlations between bacterial communities and water quality parameters

With the exception of Dyment’s Creek, water quality parameters were available from the Provincial Water Quality Monitoring Network (PWQMN) for each of the streams used in this project. For Uxbridge Brook the sampling site corresponded to the precise location of the PWQMN monitoring station however due to accessibility reasons the other sampling sites were located in other regions of the same streams. For Dyment’s Creek, water quality parameters

135 were available from Environment Canada for 2011 and these values were used to approximate the conditions in 2012. For each of the streams, the water quality parameters were averaged over the full year and the values are included in Table 7.2.3.

Table 7.2.3: Water quality parameters for each site. Data for all but the Dyment’s Creek site are the averages from the 2012 data available through the PWQMN database. Data for the Dyment’s Creek are the averages from the water stream chemistry data provided by Environment Canada for the 2011 field season.

Chloride Phosphorus DO EC Fe Nitrates Site (mg/L) (mg/L) (mg/L) (µS/cm) pH (µg/L) (mg/L) Maskinonge River 92.7 0.1755 8.12 831 7.66 550 0.77 Uxbridge Brook 45.16 0.1154 10.07 538 7.87 545 2.25 Hamilton Creek 4.8 0.0035 9.03 441 8.36 24 0.526 North Saugeen 5.53 0.003 10.22 427 8.21 No data 0.382 Dyment's Creek 245.08 0.024 7.60 1022 7.73 169 2.76 West Holland River 92.55 0.14 9.26 735.06 7.74 294 1.01

The Principal Coordinates derived from both the T-RFLP and the pyrosequencing data were compared to the available water quality parameters, and some significant correlations were observed (Table 7.3.4). The T-RFLP data contained more replicates (since there were duplicates of each sample) and included the Dyment’s Creek site therefore it will be the primary data discussed. Principal coordinate 1 (PC1) showed the strongest correlations with %EPT and dissolved oxygen, indicating that high biological oxygen demand had accounted for much of the variability between these sites. PC1 also correlated with chloride levels, which are a dominant feature of all the contaminated sites. PC2 didn’t show a correlation with any of the water quality parameters but was correlated to the abundance of both IS1071 and sulI in the population. IS1071 correlated oppositely to nitrate concentrations and P. fluorescens abundance. Since the gene abundance is given as deltaCt, a high value corresponds to low abundance of the gene and

136

therefore high IS1071 gene abundance was correlated with high nitrate concentrations. Conversely, IS1071 was not as abundant in the sites that had substantial P. fluorescens present. Table 7.2.4: Correlations of the bacterial communities to available water quality data. Only those correlations that were found to be significant are included in this table. PC numbers refer to the first and second principal coordinate from the T-RFLP analysis (Bray-Curtis) and separate analyses of pyrosequencing analysis by Bray-Curtis (BC) or Jaccard (J). DO is dissolved oxygen and DO4 is the average dissolved oxygen over the summer months (June to September). Gene abundances from qPCR analysis are listed by their gene name (sulI and IS1071).

Parameter Correlates Pearson's R degrees of p< freedom T-RFLP-PC1 PyroPC2 - BC 0.936 4 0.01 PyroPC2 - J 0.934 5 0.01 %EPT 0.900 4 0.02 DO 0.855 5 0.02 Chloride -0.798 5 0.05 T-RFLP-PC2 PyroPC1-BC 0.955 4 0.01 PyroPC1-J -0.820 5 0.05 IS1071 -0.879 5 0.01 sulI -0.883 5 0.01 PyroPC1- PyroPC1-J -0.989 4 N/A Bray_Curtis Simpsons I 0.997 2 0.01 IS1071 -0.868 4 0.05 Nitrates 0.895 4 0.02 P. fluorescens -0.922 4 0.01 PyroPC2- PyroPC2-J 0.929 4 N/A Bray_Curtis %EPT 0.977 3 0.01 DO 0.844 4 0.05 PyroPC1- IS1071 0.843 5 0.02 Jaccard P. fluorescens 0.932 5 0.01 PyroPC2-Jaccard %EPT 0.959 5 0.001 sulI -0.777 5 0.05 DO 0.838 8 0.01 CHBI DO -0.905 7 0.001 Chloride 0.874 7 0.01 Simpsons I sulI -0.929 3 0.05 Total P 0.882 7 0.01 EPT DO 0.793 7 0.02 DO4 0.845 7 0.01 IS1071 Nitrates -0.932 5 0.01 Ps. fluorescens 0.764 5 0.02 IncP Total P 0.936 6 0.001

137

DO Chloride -0.808 8 0.01

7.2.5 Primer design specific to RIT elements

In order to expand the current study to include RIT elements, it was necessary to develop primers that could be used to search for novel elements in environmental samples. Although the third integrase in the RIT element is more highly conserved than the first two integrases, alignments of the nucleotide sequences from our RIT collection showed no promising candidate regions from which to design primers. To address this lack of conservation, the RIT elements were divided into sub-groups sharing higher homology and primers were designed specific to conserved regions within the third integrase for these groups. However, although specific primers could be designed for genus level groups such as Sinorhizobium and Acidiphillium, there was insufficient conservation to design primers aimed at broader groups making qPCR for RIT elements in environmental samples unfeasible. However, the discovery of conserved sequences within the inverted repeats from RIT elements found in 10 different genera (see Table 6.2.2) highlighted another route by which RIT element distribution in environmental samples could be evaluated. Primers were designed that could bind to these conserved regions in the terminal inverted repeats and would therefore amplify the complete RIT element (being inverted repeats the forward and reverse primers are identical). Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 were used as the control strains and primers were designed that would match each of the strains specifically. These primers were shown to have no cross-amplification with the alternate control. An alignment of the inverted repeats from all 10 strains was then used to design two ambiguous primer pairs – the first had degenerate bases in all locations where there were disagreements (RIT_ambig) and a second set kept some of the original bases found in OLGA172 in case the fully ambiguous primer lacked specificity (RIT_less_ambig). Both Burkholderia sp. str. OLGA172 and Caulobacter sp. K31 produced bands of the expected size with the fully ambiguous primers and this primer set was therefore utilized to search for similar RIT elements in the environmental samplers. Light positive bands of the expected size were amplified from sampler DNA (specifically the Uxbridge Brook and Dyment’s Creek sites) however cloning and characterization of these products was beyond the scope of this project.

138

7.3 Discussion 7.3.1 Biomonitoring

The use of benthic invertebrates to establish the health of a stream ecosystem is well established (Rozenberg and Resh, 1993; Jones et al. 2007). Although the reference sites historically performed better by these metrics than the impacted sites, the biomonitoring scores for the reference sites were less favorable in this analysis. In particular, the coarse Hilsenhoff biotic index (cHBI) ranked the reference streams as ‘fair’, which was unexpected but is also consistent with the observations by the SVCA that biomonitoring scores are changing. The exact reasons for this trend have not been elucidated since land use and water quality parameters have not been declining for these streams, but it has been suggested that it could be the result of increasing temperatures due to changing climate conditions (SVCA, 2010). This highlights an important caveat with using the cHBI value for determining the overall health of an ecosystem since higher temperatures can mimic the stress effects of the organic pollutants for which this method was originally designed (Hilsenhoff, 1987). For the impacted sites, Dyment’s Creek performed the worst in terms of cHBI score with an overall ranking of ‘poor’. Pollution into this stream is one potential reason for the poor biomonitoring ranking, however lack of gravel substrate is likely to be a strong contributing factor as the sediment consisted primarily of sand and debris from the landfill including tires, metal rims and plastic garbage bags. Trichoptera were abundant at this site (leading to a high %EPT score) however they were found to be exclusively from the Hydropsychidae family, which are known to be tolerant to degraded conditions. The biomonitoring ranking for the Uxbridge site was much better than anticipated, and even exceeded the cHBI ranking for the reference sites. This site is 2.5 km downstream of a wastewater outflow and was expected to show some anthropogenic impact due to the presence of elevated levels of PAHs and other contaminants. However, the stream itself has extensive riparian vegetation, good shading and a varied sediment composition that would be expected to support a diversity of benthic organisms. It is also likely that the wastewater outflow adds in nutrients as seen by the elevated phosphorous and nitrate concentrations. All of these factors may have contributed to a high overall biomonitoring ranking despite the presence of organic contaminants. This site also had particularly prolific algae growth and a subsequently high abundance in Coleoptera (grazers) that may have impacted the overall ranking. These were exclusively of the Elmidae family, which are known to be more tolerant of organic pollutants

139 than other beetle families. The LSRCA benthic data is analyzed at the family level, and does show a slightly higher FBI value of 5.78 (‘fairly poor’) for the 2012 sampling. However, the Uxbridge Brook site also had the highest %EPT values obtained in this study and although the Trichoptera were all Hydropsychidae (and therefore more tolerant than other families), the Ephemeroptera individuals came from families that are known to be quite sensitive to organic pollution. The Simpson’s diversity metric also ranked the Uxbridge Brook site as healthier than the reference sites (Table 7.2.1). This suggests that from a biomonitoring point of view, the Uxbridge Brook site is maintaining a healthy benthic community.

7.3.2 Bacterial community assessment

The placement of the samplers directly in the water column was important for two reasons: first, to minimize between site differences in bacterial communities specific to the nature of the river substrate; and second, to specifically characterize the members of the bacterial community that were most accessible either by direct contact with the river, or through downstream applications such as irrigation or drinking water sources. In this manner, the samplers can be used as a proxy for the indigenous bacterial community and utilized for quantitative PCR comparison of mobile genes. The use of a sand substrate for colonization inside the samplers was considered preferable to simply filtering water samples since this could allow for the establishment of biofilm communities that may not be evident otherwise. These communities therefore represent an accumulated population from the 4 months sampling time as opposed to a single sampling event. Although not originally foreseen, this method had a particular advantage over single time point sampling due to the presence of sediment washed into the samplers after rain events. This was originally seen as a disadvantage since the goal was to maintain a standardized substrate across all streams. In reality, however, these transient communities that enter the water column after rain events are also accessible through direct contact or downstream applications, making their inclusion in the samples beneficial. The samplers still provide a standardized substrate for colonization, which minimizes the variation and allows for comparison of streams regardless of sediment composition. In this way, the samplers are designed specifically to capture the bacterial members that are transiently present in the water column as well as providing a suitable substrate for the establishment of biofilm communities that would normally be established within the sediment. Although the low DNA

140 yields obtained from the samplers made analysis difficult, there were distinctive communities identified for each of the sites.

The sampler diversity was examined by T-RFLP and bands were found to correlate with between streams differences as opposed to within sampler differences. This suggests that the water movement through the samplers was sufficient to create homogenous conditions in terms of oxygen and nutrient distributions. The added benefit of this consistency is that the large volume of soil present in the samplers can be frozen and used for additional analyses at a later date. The strong abundance of one particular species in the majority of the samplers was unexpected, and could be an indication that the length of time that the samplers are in the stream may need to be increased. It is possible that the initial colonization of the new substrate is accomplished by P. fluorescens and that over a longer length of time the bacterial community would diversify to more closely resemble the sediment community for that particular stream. This is an issue that would need to be addressed before this method could be utilized in further studies since the increased abundance of this one particular species interferes with subsequent analysis. Interestingly, the only sample from the contaminated sites that showed an overwhelming abundance of P. fluorescens was the West Holland Canal sampler, however this is also the site for which there water quality measures available were at a great distance (8.4 km). The actual PWQMN site was not accessible for a sampler due to both the dimensions of the river and level of public presence. The sampler was placed upstream of the PWQMN and the stretch of the canal between the sampling site and the PWQMN site is entirely agricultural therefore it is would be expected that the PWQMN data represents the worst case scenario for agricultural inputs to this river. Therefore the conditions at the actual sampling site may be less severe. Secondly, due to the volume of water moving through the canal the dissolved oxygen levels are likely to be quite variable in this stream. Finally, this location was chosen based on pollutants found in the sediment of the West Holland River however the levels of contamination found in the water itself were not found to exceed provincial guidelines (LSRCA, 2004). Given the depth of the canal, it is likely that the bacterial community in the water column (and therefore in the sampler) is rarely in contact with these contaminants.

The Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samples all had lower than 20% P. fluorescens abundance. In addition to the contaminants listed in Table 7.1.1, there are a couple of water quality parameters that could account for the lack of this species at the

141

LSRCA sites. First, the LSRCA sites all have significantly higher levels of phosphorus and chloride, higher conductivity readings and lower pH values than the SVCA sites. In addition, the dissolved oxygen (DO) values during the sampling period (June – September) were very low for both the Maskinonge Creek and Dyment’s Creek locations (supplemental table S2). The Uxbridge Brook location did not have DO levels below that of the reference sites, however when compared to the North Saugeen site it is clear that the duration of low dissolved oxygen levels was much longer at Uxbridge Brook (around 8 mg/L for all four months as opposed to only 2 months) and therefore the DO levels during the sampling time were lower than the annual averages would suggest. Since P. fluorescens is a strictly aerobic organism, the extended low dissolved oxygen levels could certainly account for the decreased abundance of this species at the contaminated sites.

Since the Maskinonge Creek, Uxbridge Brook and Dyment’s Creek samplers all segregated from the other sites in terms of phylogenetic diversity, it was expected that these three sites would also segregate from the others in terms of quantitative PCR results. In terms of IS1071 abundance, the Dyment’s Creek and Uxbridge Brook samples both had significantly higher IS1071 abundance compared to the other sites, and all sites were significantly higher than the North Saugeen River. The West Holland Canal and Maskinonge Creek samples were not significantly different from each other based on IS1071 abundance and the West Holland Canal was only slightly significantly increased over the Hamilton Creek sample (P=0.047). Given the known association of IS1071 with catabolic operons, it was expected that the highest abundances would occur in the sites with complex contaminants and they did. However I also expected high abundance of IS1071 at the two agricultural sites compared to Uxbridge Brook. This could be the result of the strong association of IS1071 with IncP plasmids (Dennis, 2005; Dunon et al. 2013) and would therefore be a result of wastewater and landfill leachate into the Uxbridge Brook and Dyment’s Creek sites, respectively. The increased abundance of the sulI antibiotic resistance gene solely at the Uxbridge Brook site is also consistent with the expectations from wastewater outflow. All sites had comparable numbers for IncP backbone abundance, but with different melting temperatures of the qPCR products, suggesting that there is a much greater diversity of these plasmids than only the ones carrying the sulI genes. However it should be noted that these primers were designed to be used in conjunction with a specific probe and therefore the results obtained may simply be an artifact of the method used.

142

The broad distribution of IS1071 found in this study suggests that continued studies on this particular element could prove interesting. Most notably, the increased abundance of IS1071 at the Dyment’s Creek location, coupled with the known diversity of contaminants entering this stream from the groundwater, suggests that this would be an interesting location for a more detailed study. There are a number of known contaminants in the Uxbridge Brook location, yet the biomonitoring data do not indicate a decrease in overall ecosystem health. However the bacterial community at the Uxbridge Brook site was clearly altered in comparison to the SVCA sites, which merits further investigation of the impacts that this alteration has on the overall community dynamics. The Uxbridge Brook site also had unexpectedly high levels of IS1071, as well as high levels of sulI gene abundance. This site would therefore also be an interesting location for a mobile element study in order to determine whether biomonitoring using macroinvertebrates is informative for understanding the bacterial community response to environmental contaminants. The qPCR results obtained in this study suggest that the bacterial community is enriched in genetic elements commonly associated with IncP plasmids, including IS1071 and sulI, which may indicate that this community represents an increased risk of resistance gene transmission. However primers targeting the class 1 integrase gene (intI1) did not detect this gene in any of the environmental samplers, which was unexpected since the sulI gene is a known component of the class 1 integrons commonly found on IncP plasmids (Schlüter et al. 2007). The reasons for the absence of intI1 in the samplers is unclear as the primers showed equal efficiency to the IS1071 primers on control DNA and a sub-sampling taken from the samplers earlier in the season had been positive for intI1. This is not likely to be due to total bacterial abundance as the 16S results were comparable between the sub-sample and the final sampler extractions, however it could be indicative of a change in the bacterial population that resulted in a decrease in intI1 relative to the total community. The highly specific and reproducible results obtained with the IS1071 primers suggests that these primers may be better suited to identifying impacted bacterial communities when DNA concentration is a limiting factor. As IS1071 is commonly found in multiple copies in the genome, this makes for a more robust target for qPCR analysis. IS1071 is also associated with several catabolic plasmids and transposons (Dunon et al. 2013; Van Houdt et al. 2000; Top and Springael, 2003) and therefore provides a broader target than solely antibiotic resistance plasmids.

143

There are two major challenges in examining mobile elements in community samples beyond the integron and plasmid replication genes that are currently used. The first challenge is diversity of the nucleotide sequence of the target genes. For plasmids, the essential nature of the replication genes provides sufficient conservation for primer design (Smalla et al. 2013; Gotz et al. 2006) and these primers that have been utilized on environmental samples, albeit often in association with a secondary probe for increased specificity. Class 1 integrons in particular are highly conserved specifically due to selection for the antibiotic resistance genes that they are associated with (Gillings et al. 2015). This conservation is not typical of mobile elements found in the environment, with the result that testing for other families of mobile elements through targeted qPCR or microarray approaches is not possible except in cases where the goal is to track the abundance and distribution of a previously determined individual element. This was illustrated in this study by the inconsistent results obtained through the Tn3 and Tn21 primers, as well as the inability to target groups of RIT elements. Although global distribution of individual mobile elements has not been commonly observed, in this study we designed qPCR primers specifically targeting IS1071 and found this particular element to have a broad distribution in environmental samples.

The goal of this study was to develop a reproducible method of analyzing the anthropogenic impacts on freshwater stream bacterial communities, with the goal of classifying reference and impacted locations for further characterization. These samplers can be used to draw comparisons between different locations in a manner that is not dependent on sediment quality (or availability) and is not impacted by spatial variations caused by groundwater inflow. This characterization can also point directly to individual elements that may warrant further investigation in the impacted sites. In this way, sites can be chosen for which metagenomic characterization would be informative and this information can be accumulated and stored for future analysis of general trends.

7.4 Acknowledgements

The following people are gratefully acknowledged for their insight and contributions: Jim Roy, Alex Fitzgerald and Lee Grapentine (Environment Canada), Dave Lembcke and Rob Wilson (LSRCA), Martha Nicol and Shaun Anthony (SVCA), Chris Jones (OBBN), landowners for access especially Brouwer Sod Farms in Keswick and Jason Verkaik at Carron Farms, Shu Yi

144

(Roxana) Shen, Rosemary Saati and the other summer students for assistance, Toby Ricker for design and construction of samplers, Ross Reid for assistance with sampler access and placement. Funding in the form of a NSERC Discovery Grant to RF and a NSERC PGS-D Scholarship to NR is also gratefully acknowledged. The funding agency had no role in this study.

7.5 References Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., ... & Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature methods, 7(5), 335-336.

Cheng, W., Chen, H., Su, C., & Yan, S. (2013). Abundance and persistence of antibiotic resistance genes in livestock farms: a comprehensive investigation in eastern China. Environment international, 61, 1-7.

Dennis, J. J. (2005). The evolution of IncP catabolic plasmids. Current opinion in biotechnology, 16(3), 291-298.

DeSantis, T. Z., Hugenholtz, P., Keller, K., Brodie, E. L., Larsen, N., Piceno, Y. M., ... & Andersen, G. L. (2006). NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic acids research, 34(suppl 2), W394-W399.

Dunon, V., Sniegowski, K., Bers, K., Lavigne, R., Smalla, K., & Springael, D. (2013). High prevalence of IncP-1 plasmids and IS1071 insertion sequences in on-farm biopurification systems and other pesticide-polluted environments. FEMS microbiology ecology, 86(3), 415-431.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5):1792-1797.

Fierer, N., Jackson, J. A., Vilgalys, R., & Jackson, R. B. (2005). Assessment of soil microbial community structure by use of taxon-specific quantitative PCR assays. Applied and environmental microbiology, 71(7), 4117-4120.

Gillings, M.R., Gaze, W.H., Pruden, A., Smalla, K. Tiedje, J.M. and Yong-Guan, Z. 2015. Using the class 1 integron-integrase gene as a proxy for anthropogenic pollution. ISME journal doi:10.1038/ismej.2014.226

Götz, A., Pukall, R., Smit, E., Tietze, E., Prager, R., Tschäpe, H., ... & Smalla, K. (1996). Detection and characterization of broad-host-range plasmids in environmental bacteria by PCR. Applied and Environmental Microbiology, 62(7), 2621-2628.

Hilsenhoff, W.L. 1987. An improved biotic index of organic stream pollution. Great Lakes Entomology 20: 31-39.

145

Jechalke, S., Dealtry, S., Smalla, K., & Heuer, H. (2013). Quantification of IncP-1 plasmid prevalence in environmental samples. Applied and environmental microbiology, 79(4), 1410-1413.

Johnsen, A. R., & Karlson, U. (2007). Diffuse PAH contamination of surface soils: environmental occurrence, bioavailability, and microbial degradation. Applied Microbiology and Biotechnology, 76(3), 533-543.

Jones F C, Somers KM, Craig B, Reynoldson TB (2007) Ontario Benthos Biomonitoring Network: Protocol Manual. Queen’s Printer for Ontario.

LSRCA, 2004. Lake Simcoe Watershed Toxic Pollutant Screening Program 2004 Report. Lake Simcoe Region Conservation Authority. Drafted July 2005.

Nemergut, D. R., Martin, A. P., & Schmidt, S. K. (2004). Integron diversity in heavy-metal- contaminated mine tailings and inferences about integron evolution. Applied and environmental microbiology, 70(2), 1160-1168.

Reeder, J., & Knight, R. (2010). Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nature methods, 7(9), 668-669.

Rosenberg DM, Resh VH (1993) Freshwater biomonitoring and benthic macroinvertebrates. Chapman and Hall, New York

Roy, J. W., & Bickerton, G. (2011). Toxic groundwater contaminants: an overlooked contributor to urban stream syndrome?. Environmental science & technology, 46(2), 729- 736.

Schlüter, A., Szczepanowski, R., Pühler, A., & Top, E. M. (2007). Genomics of IncP-1 antibiotic resistance plasmids isolated from wastewater treatment plants provides evidence for a widely accessible drug resistance gene pool. FEMS microbiology reviews, 31(4), 449-477.

Simpson, E.H. 1949. Measurement of diversity. Nature (London) 163:688.

SVCA, 2010. Saugeen Conservation Water Quality Status Report. Saugeen Valley Conservation Authority. Drafted March 2011.

Top, E. M., & Springael, D. (2003). The role of mobile genetic elements in bacterial adaptation to xenobiotic organic compounds. Current Opinion in Biotechnology, 14(3), 262- 269.

Van Houdt, R., Toussaint, A., Ryan, M. P., Pembroke, J. T., Mergeay, M., & Adley, C. C. (2000). The Tn4371 ICE family of bacterial mobile genetic elements.

146

Wright JF, Sutcliffe DW, Furse MT (2000) Assessing the biological quality of fresh waters: RIVPACS and other techniques. Freshwater Biological Association, Ambleside

Yarza, P, P. Yilmaz, E. Pruesse, F.O. Glöckner, W. Ludwig, K-H Schleifer, W.B. Whitman, J. Euzéby, R. Amann and R. Rosselló-Móra. 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology 12:635-645. doi:10.1038/nrmicro3330

147

Chapter 8 Conclusions and Future Directions

The majority of my PhD research has been dedicated to better understanding the nature and potential role of Recombinase in Trio (RIT) elements, a novel mobile element that involves three linked tyrosine-based site-specific recombinases from separate sub-families. From this work it is evident that in certain ways RIT elements could be considered comparable to insertion sequences, MGEs that encode genes for their own movement but carry no other functional information. From my intensive search of the extant sequence databases, it is clear that in the vast majority of cases the RIT elements contain only the open reading frames coding for the individual recombinase proteins that are presumably responsible for their mobility. These open reading frames are flanked by inverted repeats that are evidently involved in the excision and re- integration of these elements. However there are no IS families that are known to be mobilized solely through the activity of a TBSSR, and unlike insertion sequences which can occur in high numbers in an individual genome (Siguier et al. 2014), multiple identical RIT elements within an individual genome are relatively rare. As was described in Chapter 4, copy numbers of RIT elements identified to date range from 1 to 5 and the majority occur in only one copy in the genome. Therefore their primary role is not likely to be to provide homologous regions for genome rearrangements. IS elements also work in concert to mobilize larger segments of DNA (composite transposons), which has not been seen in RIT elements. There were only two instances identified in Chapter 4 where a RIT element appears to have mobilized adjacent genes. In each of these instances the additional genes are found between the RIT element and one of the inverted repeats. This suggests a mechanism more consistent with the transposons and ICEs that utilize a site-specific recombinase at one end of the element. In addition, since mobility of the RIT element was only observed during conjugation, it is possible that their role in the cell may be specific to MGE evolution, or movement of genes between replicons, as opposed to larger genome rearrangements. The prevalence of RIT elements in multi-replicon genomes is also consistent with a potential role in plasmid evolution.

The experiments performed here allow for some speculation on how RIT elements may function. The first issue to be addressed is the presence of three TBSSRs since this is not consistent with current knowledge of tyrosine recombinases. Since the recombination reaction

148 proceeds in a very symmetrical manner, a role for three separate enzymes is difficult. However site-specific recombination generally utilizes two copies of each recombinase and therefore a symmetrical arrangement is possible. This is supported by the presence of three putative binding sites at each end of the RIT element (two within the inverted repeat and a third within the palindrome sequence). This is also consistent with other mobile elements such as phage λ and Tn4430, both of which have complex binding requirements for the recombination reactions (Hallet et al. 2004). However as only two cut sites are created in the crossover reaction it is still unclear whether all three recombinases would be active simultaneously or whether they facilitate different reactions (integration vs. excision or inter- vs. intra-molecular recombination).

The second issue to be addressed from these experiments is the apparent need for conjugation in order for the RIT element to be mobilized. Since the recombinases were separated from the complete RIT element and were already being induced in the recipient cell, the experiment performed in Chapter 6 should not be interpreted as indicating that RIT elements are likely to be mobilized while conjugating into a new host. The conjugation experiment was chosen since it allowed for a single-stranded conformation, which has been shown to be necessary for some integron recombination reactions (Loot et al. 2010). Within the cell, single stranded DNA is produced prior to conjugation and also by replicons that undergo rolling circle replication. The large plasmids carrying RIT elements that I have identified are not likely to replicate in this manner, however this type of replication has been associated with some integrated conjugative elements (Wright et al. 2015). As RIT elements are generally found contained within genomic islands (some of which may actually be ICEs), this provides an opportunity whereby RIT elements could be activated specifically during either replication of these elements or preparation for conjugative transfer. RIT elements could effectively be silent when the ICE is integrated into the chromosome and become active when the ICE is stimulated to excise from the chromosome. This allows for a brief time when RIT elements could mobilize and have an impact on other targets within the cell.

There are many aspects of RIT element mobility that have yet to be determined. Most importantly, the contribution of each individual recombinase is still an open question. Expression plasmids containing each recombinase separately have already been prepared in the pTrc99 backbone, however these experiments have not been performed due to the false positive

149 issues with the experimental design. Purification of the recombinase proteins is also ongoing at the SCK•CEN by Rob Van Houdt in order to perform binding assays on the target site, inverted repeats and the palindrome sequence. The presence of a palindrome sequence beyond the inverted repeats is a unique feature compared to other MGEs. Whether this sequence provides accessory binding sites for regulation or forms an important secondary structure for activity has yet to be determined.

Chapter 4 was published in 2012 and there were 148 RIT elements obtained through in silico searching at that time. Subsequent Blast searches from known elements and also specifically from the palindrome sequences identified have brought this number to 183 (Supplemental Table 1), however this is still clearly not an exhaustive list as numbers continue to increase with each subsequent search. Unfortunately, the prevalence of draft genomes continues to limit our ability to examine the genomic context for many of these strains. It is important to note however that despite their broad distribution there are remarkably few instances of RIT elements shared between close relatives. This strongly suggests that RIT elements are distributed between strains on other mobile elements and rarely incorporated into more stable regions of the genome.

The inverted repeat primers developed in this project could be used to isolate additional RIT elements from environmental samples, and additional primers could be designed based on conserved repeats flanking other RIT elements in the collection. However, the most interesting products of these primers would be the very rare instances where the RIT element is mobilizing more genes than just the three recombinases. These could be preferentially retained through size selection in order to separate them from the amplicons containing only the recombinase genes. Of course understanding the abundance and diversity of RIT elements containing highly similar repeats would also be informative. Although a distinctive role for the RIT elements has not yet been established, the work described here has revealed a great deal about their distribution, their associations and their mobility. It is an important part of the larger goal of improving our overall understanding of the nature of the many MGEs that have yet to be characterized.

At the outset of this work, many approaches were considered to quantify the mobilome of stream environments at different levels of contamination including microarrays and a

150 comprehensive set of qPCR assays. My conclusion after investigating the degree of conservation of even related elements was that approaches based on sequence similarities or conserved primers for many MGE families were unfeasible. Fortunately the decreasing price of sequencing makes metagenomics of environmental samples a viable option for the near future. This approach has not been utilized in this study for two reasons. The first is computational cost. Although read lengths of Illumina next generation sequencing have increased from only 35 bp at the beginning of this project to as much as 250 bp currently, the full genes would need to be assembled in order for individual mobile elements to be assigned to families. This is completely feasible on genomic samples, and has been performed on a small number of metagenomic samples, but the cost is significant since it requires enough sequencing depth to assemble an entire community (3-4 Gbp of sequence, depending on quality), as well as sufficient computing power to perform the assembly (Luo et al. 2012).

The second issue with performing metagenomics currently is the annotation of the assembled mobility genes. Automated annotation pipelines such as MG-RAST annotate according to function of the closest homologues, however functional annotations are limited to general terms such as ‘integrase’, ‘mobile element protein’ or ‘transposase’. As described in Chapter 2, these terms are uninformative given the diversity of enzymes categorized by these terms. Moreover, the RIT element recombinases often are not annotated as mobile elements at all but rather as either hypothetical proteins or components of ‘recombination and repair’ functions. Therefore the annotated TBSSR component of the metagenome can alternatively be found in three different annotation categories – mobile element proteins, phage integrases, or recombination and repair. Since the majority of the TBSSR families have not been investigated, there is little to be gained currently from metagenomics for these enzymes. This is clearly illustrated by the unique adaptive role played by integrons. The type and extent of bacterial adaptation potential provided by integrons was completely unknown to science until their association with antibiotic resistance justified a more detailed investigation of their activity. However without information on the mechanisms and distribution of other families of tyrosine recombinases, we have a very limited ability to predict how unique these capabilities may be. In order to better understand the potential roles of the many components of the mobilome, we must begin to fill the knowledge gaps that clearly exist. More precise annotations, including genomic context, will be necessary to fully understand the diversity of MGEs in environmental samples

151 and to begin to examine their abundance and distribution in contaminated vs. reference ecosystems. This knowledge is important for determining the role that these MGEs play in both individual genomes and bacterial communities, and will be essential to developing risk assessment frameworks for monitoring both the current and developing risks of antibiotic resistance genes in the environment. As we move into the inevitable age of environmental metagenomics, annotation of available sequence data will continue to be the largest obstacle to understanding bacterial evolution. The improvements in both sequencing technologies and assembly algorithms have been important first steps; however increased funding for experimental characterization of putative functions is still a limiting factor.

8 References Hallet, B., Vanhooff, V. and F. Cornet. 2004. DNA Site-Specific Resolution Systems. In: Plasmid Biology pp. 145-180. Ed. B.E. Funnell and G.J. Phillips ASM Press, Washington, D.C. USA

Loot, C., Bikard, D., Rachlin, A. and Mazel, D., 2010. Cellular pathways controlling integron cassette site folding. The EMBO journal 29(15):2623-2634.

Luo, C. D. Tsementzi, N. Kyrpides, T. Read and K. T. Konstantinidis. 2012. Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. PLos ONE 7(2):e30087.

Siguier, P. Gourbeyre, E. ad M. Chandler. 2014. Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev 38: 865-891.

Wright LD, Johnson CM, Grossman AD (2015) Identification of a Single Strand Origin of Replication in the Integrative and Conjugative Element ICEBs1 of Bacillus subtilis. PLoS Genet 11(10): e1005556. doi:10.1371/journal.pgen.1005556

152

9 Appendix 1 Extra Tables

Table S1: Primers used in this study Primer Name Sequence Reference qPCR-intI1F ACCAACCGAACAGGCTTATG Nemergut et al. (2004) as quoted in qPCR-intI1R GAGGATGCGAACCACTTCCAT Wright et al. 2008 ISME 2:417-428. Fierer et al. qPCR-16S-338F ACTCCTACGGGAGGCAGCAG (2005) as quoted in Wright et al. qPCR-16S-518R ATTACCGCGGCTGCTGG 2008 ISME 2:417-428. sulI-F CACCGGAAACATCGCTGCA Luo et al. 2010. Environ. Sci. sulI-R AAGTTCCGCCGCAAGGCT Technol. 44:7220– 7225 Jechalke et IncP1 korA-F TCATCGACAACGACTACAACG al 2013 Appl. Environ. IncP1 korA-R TTCTTCTTGCCCTTCGCCAG Microbiol. 79(4):1410- 1413. IS1071_qPCR-F GCACCAAGTCTGGGAATGAT This study IS1071_qPCR-R ACGGGCATAGTGTTTCTTGG This study IR_Olga TTATGCCGATTCCCGGATTATGCCG This study IR_K31 TAATGCCGCGATCCGGATTATGCCG This study IR_ambig TWATGCCGIIIYCCSGATTATGCCG This study IR_less_ambig TTATGCCGIIIYCCSGATTATGCCG This study 184circle-F CCTCGCTAACGGATTCACCA This study 184circle-R TGGTGAATCCGTTAGCGAGG This study pTrc99-up_XbaI CTTATCTAGAGTGAAATTGTTATCCGCTCACAATTCCAC This study pTrc99- ATGCAAGCTTGGCTGTTTTGGCGGATGAGAGAAG This study dn_HindIII pTrc99-RitA-F cttatctagacaggaaacagatcATGATTACGTGCGGGCCATTC This study

153

pTrc99-RitA-R ctagaagcttcgttgctagccTCATAGCGTGCCTCCCGCA This study pTrc99-RitB-F cttatctagacaggaaacagatcATGAGCCTCACCGACCAGCTC This study pTrc99-RitB-R ctagaagcttcgttgctagccTCATTGCACAGCTTCCCGGC This study pTrc99-RitC-F cttatctagacaggaaacagatcATGAGCGCCGCCGCCTT This study pTrc99-RitC-R tagaagcttacgttgctagcaTTAGAGACCTTCCAAGAACGCGAG This study K31RitA-up CGATGATCGTCCGAGTCTGG This study K31RitC-dn CACCACGGCGTCGATCCAGC This study Olga-RITA-up CGTCCGTAGACGATCAAGG This study Olga-RITC-dn GGACATGAATCATCTGAGACG This study Target1-FOR AATTCCACCGCCCTGCACGAGCTGTCGCACTGGACGGGCTGCA This study Target1-REV GCCCGTCCAGTGCGACAGCTCGTGCAGGGCGGTGG This study Target2-FOR AATTCCACCGCCCTGCACGAGCTGGGCCACTGGACGGGCTGCA This study Target2-REV GCCCGTCCAGTGGCCCAGCTCGTGCAGGGCGGTGG This study Target-up CGACAGCTCGTGCAGGGC This study Target1-down TGCACGAGCTGTCGCACTGGACGGG This study Target2-down GTCCAGTGGCCCAGCTCGTGCAGGG This study Olga_RITBup GCACTGCGACGTACCGAGC This study Olga_RITBdn GCTATCTCAGCAGGAACTGTCC This study K31_RITBup CAGGAACAGCGGCGTGTC This study K31_RITBdn CTCCAACACGTACTGGTATCTGG This study pSF100-FOR-PstI ATAACTGCAGATACCCACGCCGAAACAAG This study pSF100-REV- CGTCGAATTCATCGCTAGTTTGTTTTGACTCC EcoRI This study K. Ampli-tet-5_out GACGATGAGCGCATTGTTAG Mijnendonckx K. Ampli-tet-3_out TCAGGGACAGCTTCAAGGAT Mijnendonckx

Table S2: Dissolved oxygen values by month. Values for the Dyment’s Creek location are averages of multiple values taken each month during the 2011 field season. Values for all other sites are the corresponding 2011 value for that month from the PWQMN database. All values are in mg/L. Site June July August September Maskinonge River 6.38 -- 5.23 -- Uxbridge Brook 8.32 8.09 8.08 8.95 Hamilton Creek -- -- 8.66 9.95 North Saugeen 10.4 8.07 8.53 10.17 Dyment's Creek 5.9 6.52 6.69 6.8

154

Table S3: RIT Elements documented to date.

(Phylum - if other than Proteobacteria);Class; Strain Genbank Accension Location Order Burkholderia phytofirmans OLGA172 RITBphyt01 beta; Burkholderiales Cupriavidus metallidurans CH34 RITCme1 chromosome 1393469- 1 CP000352 1396637 beta; Burkholderiales scaff_3 4103936- Burkholderia sp. Ch1-1 NZ_JH603161.1 4107104 beta; Burkholderiales scaff_3 368714- Burkholderia sp. Ch1-1 NZ_JH603161.1 371882 beta; Burkholderiales Novosphingobium sp. PP1Y 1558240- (RIT1) NC_015580.1 1561090 alpha; Sphingomonadales

Acidiphilium multivorum 449594- AIU301 NC_015186.1 452732 alpha; Rhodospirillales Acidiphilium multivorum 223812- AIU301 pACMV1(RIT1) NC_015178.1 226950 alpha; Rhodospirillales Acidiphilium multivorum 253938- AIU301 pACMV1 (RIT2) NC_015178.1 257076 alpha; Rhodospirillales Acidiphilium cryptum JF-5 175771- pACRY01 NC_009467.1 178909 alpha; Rhodospirillales Caulobacter sp. K31 chromosome (RIT1) - 2151285- NC_010338.1 NC_010338.1 2154423 alpha; Caulobacterales Caulobacter sp. K31 2422880- chromosome (RIT2) NC_010338.1 2426018 alpha; Caulobacterales Caulobacter sp. K31 pCAUL02 RIT1 - NC_010333.1 NC_010333.1 57564-60702 alpha; Caulobacterales Novosphingobium sp. PP1Y 1558240- (RIT1) NC_015580.1 1561090 alpha; Sphingomonadales Acidiphilium cryptum JF-5 pACRY03 NC_009469.1 38619-41757 alpha; Rhodospirillales Sinorhizobium medicae 369069- WSM419 pSMED02 NC_009621.1 372212 alpha; Rhizobiales

Cupriavidus metallidurans 1362583- CH34 RITCme2 CP000352 1365838 beta; Burkholderiales 3719322- Brenneria sp. EniD312 NZ_CM001230.1 3722580 gamma; Enterobacteriales Bordetella petrii strain DSM 1547845- 12804 (RIT1) NC_010170.1 1551100 beta; Burkholderiales Burkholderia sp. YI23 plasmid 1702794- byi-1p CP003090.1 1706046 beta; Burkholderiales

Burkholderia phymatum 1976580- STM815 chromosome 2 CP001044.1 1979793 beta; Burkholderiales 415435- Marinobacter sp. ELB17 NZ_AAXY01000001.1 418639 gamma; Altermonodales

155

536430- Marinobacter sp. ELB17 NZ_AAXY01000001.1 539634 gamma; Altermonodales 167649- Marinobacter sp. ELB17 NZ_AAXY01000007.1 170853 gamma; Altermonodales Marinobacter sp. ELB17 NZ_AAXY01000009.1 44734-47938 gamma; Altermonodales

Aromatoleium aromaticum 281475- EbN1 NC_006513.1 284649 beta; Rhodocyclales Aromatoleium aromaticum 105863- EbN1 (plasmid 2) NC_006824.1 109037 beta; Rhodocyclales Aromatoleium aromaticum 129992- EbN1 (plasmid 2) NC_006824.1 133166 beta; Rhodocyclales Cupriavidus necator H16 pHG1 (RIT1) AY305378 32312-35249 beta; Burkholderiales Sinorhizobium fredii NGR234 233129- pNGR234a (RIT2) NC_000914.2 236294 alpha; Rhizobiales 419038- Thauera sp. MZ1T CP001281.2 422197 beta; Rhodocyclales Candidatus Solibacter 4188436- (Acidobacteria);Solibacteres; usitatus Ellin6076 (RIT1) NC_008536.1 4191597 Solibacterales Candidatus Solibacter 9597168- (Acidobacteria);Solibacteres; usitatus Ellin6076 (RIT2) NC_008536.1 9600446 Solibacterales Mesorhizobium loti 4814973- MAFF303099 (RIT1) NC_002678.2 4818126 alpha; Rhizobiales Mesorhizobium loti 4880343- MAFF303099 (RIT2) NC_002678.2 4883496 alpha; Rhizobiales Cupriavidus necator H16 pHG1 (RIT2) AY305378 40646-43658 beta; Burkholderiales

Acidovorax sp. NO-1 NZ_AGTS01000021.1 WGS ctg22 beta; Burkholderiales Leptothrix cholodnii SP-6 837820- (RIT2) NC_010524.1 841107 beta; Burkholderiales Candidatus Accumulibacter phophatis clade IIA str. UW-1 125367- pAph01 NC_013193.1 128300 beta; unclassified Leptothrix cholodnii SP-6 828122- (RIT1) NC_010524.1 831358 beta; Burkholderiales scaff_1 776303- Burkholderia sp. Ch1-1 NZ_JH603159.1 779530 beta; Burkholderiales Mesorhizobium loti 283344- MAFF303099 pMLa NC_002679.1 286544 alpha; Rhizobiales 1980031- Thiomonas sp str. 3As FP475956.1 1983223 beta; Burkholderiales

Bifidobacterium longum 1146734- (Actinobacteria); NCC2705 (RIT1) NC_004307.2 1149950 Bifidobacteriales Bifidobacterium longum 1151346- (Actinobacteria); NCC2705 (RIT2) NC_004307.2 1154562 Bifidobacteriales Bifidobacterium longum 1510118- (Actinobacteria); NCC2705 (RIT3) NC_004307.2 1506902 Bifidobacteriales 977730- (Actinobacteria); Bifidobacterium longum F8 FP929034.1 981079 Bifidobacteriales Bifidobacterium longum (Actinobacteria); DJ010A (RIT1) NC_010816.1 35114-38464 Bifidobacteriales Bifidobacterium longum NC_010816.1 389423- (Actinobacteria);

156

DJ010A (RIT2) 392773 Bifidobacteriales Bifidobacterium longum 2152995- (Actinobacteria); DJ010A (RIT3) NC_010816.1 2156345 Bifidobacteriales Bifidobacterium longum 1541436- (Actinobacteria); DJ010A (RIT4) NC_010816.1 1544786 Bifidobacteriales Bifidobacterium longum 117374- (Actinobacteria); infantis 157F (RIT1) NC_015052.1 120729 Bifidobacteriales B. longum subsp. longum 998527- (Actinobacteria); JCM1217 (RIT1) NC_015067.1 1001743 Bifidobacteriales B. longum subsp. longum 1356344- (Actinobacteria); JCM1217 (RIT2) NC_015067.1 1352994 Bifidobacteriales B. longum subsp. longum 516654- (Actinobacteria); JDM301 (RIT1) NC_014169.1 519870 Bifidobacteriales B. longum subsp. longum 894276- (Actinobacteria); JDM301 (RIT2) NC_014169.1 897626 Bifidobacteriales B. longum subsp. longum 2274116- (Actinobacteria); JDM301 (RIT3) NC_014169.1 2270900 Bifidobacteriales

Burkholderia phymatum 1602828- STM815 pBphy01 NC_010625.1 1606047 beta; Burkholderiales Burkholderia phymatum 393453- STM815 pBphy02 (RIT3) NC_010627.1 396672 beta; Burkholderiales Mesorhizobium loti 5046735- MAFF303099 (RIT3) NC_002678.2 5049426 alpha; Rhizobiales WGS Pseudomonas aeruginosa 3897228- NCM1179 DF126593.1 3900501 gamma; Pseudomonodales Acidovorax sp. NO-1 NZ_AGTS01000037.1 WGS ctg39 beta; Burkholderiales

Burkholderia phymatum 228640- STM815 pBphy02 (RIT1) NC_010627.1 231843 beta; Burkholderiales scaff_3 250576- Burkholderia sp. Ch1-1 NZ_JH603161.1 253779 beta; Burkholderiales scaff_3 905113- Burkholderia sp. Ch1-1 NZ_JH603161.1 908316 beta; Burkholderiales Singulisphaera acidiphila 2680371- Planctomycetes; DSM 18658 YP_007202199.1 2,683,568 Planctomycetales

Burkholderia phymatum 367807- STM815 pBphy02 (RIT2) NC_010627.1 370963 beta; Burkholderiales Mesorhizobium amorphae WGS CCNWGS0123 NZ_AGSN01000188.1 ctg00205 alpha; Rhizobiales 723577- Polaromonas sp. JS666 NC_007948.1 726741 beta; Burkholderiales Acidithiobacillus ferroxidans 309036- ATCC 53993 NC_011206.1 312209 gamma; Acidithiobacillales Acidithiobacillus ferroxidans 619670- ATCC 53993 NC_011206.1 622843 gamma; Acidithiobacillales WGS ctg00540; Burkholderia sp. H160 NZ_ABYL01000018.1 70230-73427 beta; Burkholderiales 128404- Marinobacter ELB17 NZ_AAXY01000003.1 131292 gamma; Alteromonadales Klebsiella pneumoniae 342 NC_011283.1 1834119- gamma; Enterobacteriales

157

1836998 WGS 152668- Bacteroides fragilis 3.1.12 NZ_EQ973215.1 155859 (Bacteroidetes) Bacteroidales Paracoccus sp. TRP NZ_AEPN01000095.1 WGS ctg 98 alpha; Rhodobacteriales Bacteroides sp. 2_2_4 NZ_EQ973384.1 superctg1.30 (Bacteroidetes) Bacteroidales

1510958- Opitutus terrae PB90-1 NC_010571.1 1514235 (Verrucomicrobia); Opitutales 3830062- Opitutus terrae PB90-1 NC_010571.1 3833339 (Verrucomicrobia); Opitutales 5652242- Opitutus terrae PB90-1 NC_010571.1 5565519 (Verrucomicrobia); Opitutales 5678108- Opitutus terrae PB90-1 NC_010571.1 5681385 (Verrucomicrobia); Opitutales

Novosphingobium WGS pentaromativorans US6-1 NZ_AGFM01000100.1 ctg00100 alpha; Sphingomonadales Sphingomonas sp. SKA58 NZ_AAQG01000023.1 WGS 84-3662 alpha; Sphingomonadales Novosphingobium pentaromativorans US6-1 pLA1 (RIT2) NZ_AGFM01000122.1 77388-80973 alpha; Sphingomonadales Novosphingobium WGS nitrogenifiges DSM 19370 NZ_AEWJ01000060.1 ctg00067 alpha; Sphingomonadales Novosphingobium sp. PP1Y 2444845- (RIT2) NC_015580.1 2448439 alpha; Sphingomonadales 1129515- Roseobacter litoralis Och 149 NC_015730.1 1133109 alpha; Rhodobacteriales Verminephrobacter aporrectodeae subsp. tuberculatae At4 NZ_AFAL01000379.1 WGS ctg385 beta; Burkholderiales Candidatus Solibacter 3194175- (Acidobacteria);Solibacteres; usitatus Ellin6076 (RIT3) NC_008536.1 3198639 Solibacterales Burkholderia phytofirmans 259065- PsJN chromosome 1 (RIT2) NC_010681.1 262635 beta; Burkholderiales

Mesorhizobium amorphae WGS CCNWGS0123 NZ_AGSN01000034.1 ctg00035 alpha; Rhizobiales Sinorhizobium fredii NGR234 230024- pNGR234a (RIT1) NC_000914.2 233072 alpha; Rhizobiales

Sphingopyxis alaskensis 446826- RB2256 NC_008048.1 450282 alpha; Sphingomonadales Sphingomonas sp. KA1 193546- pCAR3 NC_008308.1 197002 alpha; Sphingomonadales Novosphingobium pentaromativorans US6-1 pLA1 (RIT1) NZ_AGFM01000122.1 71938-75388 alpha; Sphingomonadales Novosphingobium sp. PP1Y 328363- (RIT3) NC_015580.1 331816 alpha; Sphingomonadales Erythrobacter sp. SD-21 WGS NZ_ABCG01000002.1 66491-69569 alpha; Sphingomonadales Dinroseobacter shibae DFL 100303- 12 pDSHI01 NC_009955.1 103738 alpha; Rhodobacteriales Dinroseobacter shibae DFL 12 pDSHI03 NC_009957.1 67841-71276 alpha; Rhodobacteriales Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000339.1 WGS ctg361 alpha; Rhizobiales

158

Sinorhizobium meliloti 901120- 1021plasmid pSymA (RIT1) NC_003037.1 904243 alpha; Rhizobiales Sinorhizobium meliloti 1225061- 1021plasmid pSymA (RIT2) NC_003037.1 1228184 alpha; Rhizobiales WGS 5396- Sulfitobacter sp. NAS-14.1 NZ_AALZ01000022.1 7976 alpha; Rhodobacteriales Pelagibacterium halotolerans 3574154- B2 NC_016078.1 3577274 alpha; Rhizobiales Agrobacterium vitis S4 pTiS4 NC_011982.1 78540-81684 alpha; Rhizobiales Roseovarius sp. 217 NZ_AAMV01000021.1 32068-35074 alpha; Rhodobacteriales Novosphingobium sp. PP1Y 864650- pMpl NC_015583.1 867611 alpha; Sphingomonadales Methylocystis sp. ATCC ctg206: 49242 54584-57731 alpha; Rhizobiales

Rhodococcus opacus B4 (Actinobacteria); pROBO2 NC_012521.1 80391-83520 Actinomycetales Mycobacterium vanbaalenii 6334046- (Actinobacteria); PYR-1 NC_008726.1 6337148 Actinomycetales scaf_3 240108- Burkholderia sp. Ch1-1 NZ_JH603161.1 243225 beta; Burkholderiales 495942- (Actinobacteria); Frankia sp. EAN1pec NC_009921.1 498771 Actinomycetales

Marinobacter aquaeolei VT8 113911- pMAQU02 NC_008739.1 117085 gamma; Alteromonadales

Rhizobium leguminosarum bv. 376296- viciae plasmid pRL11 NC_008384.1 379437 alpha; Rhizobiales Rhizobium leguminosarum bv. viciae plasmid pRL10 NC_008381.1 53267-56093 alpha; Rhizobiales Mesorhizobium loti R7A 127743- symbiosis island AL672113.1 130902 alpha; Rhizobiales

Syntrophobotulus glycolicus DSM 8271 NC_015172.1 48211-51807 (Firmicutes); Clostridiales

Desulfotomaculum gibsoniae WGS 99329- DSM 7213 NZ_AGJQ01000018.1 102528 (Firmicutes); Clostridiales 1225943- Dehalobacter sp. DCA NC_018866.1 1229157 (Firmicutes); Clostridiales Desulfobacterium 3001915- autotrophicum HRM2 NC_012108.1 3005127 delta; Desulfobacteriales Lentibacillus sp. Grb1 NZ_AGAV01000005.1 contig005 (Firmicutes); Bacillus Clostridium saccharolyticum 3030356- WM1 NC_014376.1 3033574 (Firmicutes); Clostridiales 726773- Bacillus sp. 10403023 (RIT2) NZ_HE610986.1 729999 (Firmicutes); Bacillus Sulfobacillus acidophilus TPY 2174723- (RIT2) NC_015757.1 2177995 (Firmicutes); Clostridiales Sulfobacillus acidophilus TPY 489401- (RIT1) NC_015757.1 492631 (Firmicutes); Clostridiales Legionella drancourtii LLAP12 JH413829.1 scaffold37 gamma; Legionellales Heliobacterium 188072- modesticaldum Ice1 NC_010337.2 191303 (Firmicutes); Clostridiales

159

Desulfosporosinus sp. OT WGS TOU NZ_AGAF01000082.1 assembly 178 (Firmicutes); Clostridiales WGS scf3_ctg20 Acetivibrio celluloyticus CD2 NZ_AEDB02000020.1 16804-20008 (Firmicutes); Clostridiales mossii DSM 22836 NZ_ADLW01000025.1 WGS ctg1.25 (Bacteroidetes); Bacteroidales 1284588- (Bacteroidetes); Gramella forsetii KT0803 NC_008571.1 1287826 Flavobacteriales 1401240- (Bacteroidetes); Gramella forsetii KT0803 NC_008571.1 1404478 Flavobacteriales Cupriavidus necator H16 pHG1 (RIT3) NC_005241.1 51476- 54721 beta; Burkholderiales Echinicola vietnamensis DSM (Bacteroidetes); 1752 NC_019904.1 24444-27688 Cytophagia Bordetella petrii strain DSM 1107785- 12804 (RIT2) NC_010170.1 1111030 beta; Burkholderiales Bordetella petrii strain DSM 1365410- 12804 (RIT3) NC_010170.1 1368586 beta; Burkholderiales Johnsonella ignava ATCC 51276 NZ_ACZL01000056.1 WGS ctg1.56 (Firmicutes); Clostridiales Roseburia inulinivorans DSM WGS 16841 NZ_ACFY01000116.1 ctg476.1 (Firmicutes); Clostridiales Bacteroides finegoldii DSM WGS ctg8.4 17565 NZ_ABXI02000050.1 46223-49494 (Bacteroidetes); Bacteroidales 730450- Bacillus sp. 10403023 (RIT1) NZ_HE610986.1 733658 (Firmicutes); Bacillus WGS Prevotella buccae D17 NZ_GG739978.1 superctg1.53 (Bacteroidetes); Bacteroidales

634392- (Actinobacteria); Nocardioides sp. JS614 NC_008699.1 637611 Actinomycetales Intrasporangium calvum 948667- (Actinobacteria); DSM43043 NC_014830.1 952111 Actinomycetales Mesorhizobium alhagi CCNWXJ12-2 NZ_AHAM01000340.1 WGS ctg362 alpha; Rhizobiales

Aromatoleum aromaticum 284698- EbN1 NC_006513.1 287802 beta; Rhodocyclales Sphaerochaeta pleomorpha 1,216,600- str. Grapes NC_016633.1 1,219,767 Sphaerochaeta Corynebacterium halotolerans YIM 70093 = 126,593- Actinobacteria; DSM 44683 CP003697.1 129,791 Corynebacteriales Mycobacterium kansasii ATCC 3,763,863- Actinobacteria; 12478 CP006835.1 3,767,049 Corynebacteriales Clostridium difficile QCD- 63q42 Microlunatus phosphovorus 5,536,190- Actinobacteria; NM-1 AP012204.1 5,539,379 Propionibacteriales Prevotella oralis ATCC 33269 Thermoanaerobacterium NC_019970.1 425,454- Firmicutes; Clostridia

160 thermosaccharolyticum 428,667 M0795 Sphingomonas echinoides Alpha-Proteobacteria; ATCC 1482 PRJNA76627 Sphingomonadales Sphingobium yanoikuyae Alpha-Proteobacteria; XLDN2-5 PRJNA71691 Sphingomonadales Pseudomonas extremaustralis 14-3 Gamma-Proteobacteria; substr. 14-3b strain 14-3 PRJNA77729 Pseudomonadales Gordonia rhizosphera Actinobacteria; NBRC 16068 PRJDB4 Corynebacteriales Thioalkalivibrio Gamma-Proteobacteria; nitratireducens DSM 14787 PRJNA178382 Chromatiales Bacillus sp. 10403023 RIT2 PRJEA70827 Firmicutes; Baciili Singulisphaera acidiphila Planktomycetes; DSM 18658 PRJNA82973 Planktomycetales Alpha-Proteobacteria; felis ATCC 53690 PRJNA52159 Rhizobiales Beta-Proteobacteria; Burkholderia terrae BS001 PRJNA157903 Burkholderiales Alpha-Proteobacteria; Acidocella sp. MX-AZ02 PRJNA171232 Rhodospirillales Celeribacter baekdonensis Alpha-Proteobacteria; B30 PRJNA170411 Rhodobacterales Candidatus Microthrix parvicella RN1 - WGS contig Actinobacteria; 2605_44 CANL01000039.1 Candidatus Microthrix Roseivivax atlanticus strain 10,337- Alpha-Proteobacteria; 22II-s10s contig25 AQQW01000025.1 13,415 Rhodobacterales Phaeobacter gallaeciensis DSM26640 pGal_B134 NC_023148.1 Phaeobacter gallaeciensis Alpha-Proteobacteria; DSM26640 NC_023137.1 Rhodobacterales Alpha-Proteobacteria; Bradyrhizobium sp. STM 3809 PRJEA72433 Rhizobiales Draconibacterium orientale 3,816,933- Bacteroidetes; strain FH5T CP007451.1 3,819,813 Bacteroidales Photorhabdus temperata Gamma-Proteobacteria; subsp. temperata Meg1 PRJNA217865 Enterobacteriales Beta-Proteobacteria; Burkholderia glathei PRJEB6934 Burkholderiales acid mine drainage metagenome Mycobacterium austroafricanum strain DSM Actinobacteria; 44191 PRJEB5747 Corynebacteriales Ferrovum myxofaciens strain Beta-Proteobacteria; P3G Contig179 PRJNA255880 Ferrovales Acidovorax sp. KKS102 CP003872.1 1283645- Beta-Proteobacteria;

161

RIT1 1286932 Burkholderiales Acidovorax sp. KKS102 RIT2 CP003872.1 1297698-1300985 Acidovorax sp. KKS102 RIT3 CP003872.1 1302872-1306159 Acidovorax sp. KKS102 RIT4 CP003872.1 2254833-2258120 837820-841107 64,829- Actinobacteria; Arthrobacter sp. Soil736 LMSB01000017.1 67,968 Micrococcales Thalassobacter Alpha-Proteobacteria; stenotrophicus CYRX01000028.1 WGS Rhodobacterales Thioclava dalianensis strain Alpha-Proteobacteria; DLFJ1-1 JHEH01000032.1 WGS Rhodobacterales Paenirhodobacter enshiensis strain DW2-9 52,270- Alpha-Proteobacteria; contig24_scaffold11 JFZB01000023.1 55,423 Rhodobacterales

162

Appendix 2 Sampler Construction and Site Information

River Samplers

1 ½” x 1 ft length clear polycarbonate tube - cut with bandsaw, deburred and belt sanded

End caps 1 ½” copper to DWM adapters for fittings – machined inside to fit and adhered with 100% polyurethane construction adhesive; window screening and nylon adhered internally to fitting ends with hot glue

Filled with fine grain sand and attached with 90 lb threaded marine approved rope to two 4L pop bottles for floatation (2” distance between sampler and pop bottle lid). Tied with the same rope to cement block with minimum of 2 ft length of rope.

2011 Site assessment June 30, 2011: Hamilton Creek and Rocky Saugeen samplers installed

Hamilton Creek – West back line, just north of Chatsworth Rd. 24 (East of Williamsford) GPS: 44024’18”N 80O47’47”W Upstream of bridge (east of road) Main channel width – 11.9 m (39 ft) Water depth and hydraulic head 1/3- 35 cm and no head midchannel – 48 cm (2 cm head) 2/3 – 42 cm (1 cm head) Velocity – 10 meter travelled in 15 seconds (0.667 m/s) Temperature – 16OC Transparency – clear to bottom Sediment – silt and boulders ranging from 7 - 30 cm diameter; boulders look like pieces of cement and are covered in gravel type substance (and red on the bottom) Surrounding vegetation – grasses from edge, lots of variety; Other notes: very wide with many dead trees – swamp? Perhaps this is low water for the region; lots of fish with stripe down the side (up to 7 cm in length)

Rocky Saugeen –8th concession west of traverston rd (south of grey rd 12) GPS: 44027’35”N 81O2’42”W Downstream of bridge (north of road) Main channel width – 6.8 m (22 ft) narrows upstream to approx. 5.5m Water depth and hydraulic head measurements (taken upstream of bridge!) 1/3- 19 cm and no head midchannel – 24 cm (5 cm head) 2/3 – 30 cm (4 cm head) Velocity – 10 meter travelled in 16.7 seconds (0.599 m/s) Temperature – 16OC Transparency – clear to bottom

163

Sediment – rocks ranging from 10 - 30 cm diameter; Surrounding vegetation – overhanging willow (upstream) and cedar trees; lots of riparian vegetation Other notes: no fish observed; in a cedar forest with lots of trees; old sign hanging by river ‘Markdale fire truck water load area’; difficulty placing sampler – rocky bottom therefore hard to find solid place without tilting and water turbulent and pulling sampler down (about 20-30 cm below top of the water and only about 10 cm above rocks)

July 1, 2011

North Saugeen – 8th sideroad between conc. 4 & 6 (East of Moorsburg) GPS: 44020’16”N 80O56’0”W Runs parallel to road Main channel width – approx 15 m (50 ft) Water depth and hydraulic head 1/3- 40 cm (2 cm head) midchannel – 38 cm (6 cm head) 2/3 – 30 cm (2 cm head) Velocity – 11 meter travelled in 25 seconds (0.44 m/s) Temperature – 20OC Transparency – clear to bottom Sediment –rocks (up to approx 20 cm diameter) and pebble; some boulders look like pieces of cement and are covered in gravel type substance (as seen at Hamilton Creek); moss on rocks Surrounding vegetation – all cedar trees; many dead cedars as far as visible – widening of river? Other notes: placed sampler in a pocket 68 cm deep and behind a tree to minimize visibility to kayakers

July 8, 2011

East Holland River at Green Lane Width (estimated from bridge): 14.9 m at widest; 9 m under bridge Depth: 50 cm (1/3 channel); 58 cm mid-channel Velocity: 1 min 14 sec for 10 m Temperature: 24oC Turbidity: about 10 cm; very muddy therefore quite turbid Sediment: Mud and rock (sand and boulders) Vegetation: lots of grasses; healthy riparian region therefore turbidity not likely from erosion in this section of the river Other observations: crayfish observed and turtle; much deeper than previous visit; green tinge to the water Placed sampler under bridge but midstream to discourage interference – gone after one month. Could have been taken or could have been washed away in a rain event.

164

July 8, 2011

Uxbridge Brook at Davis Drive

Depth: 40 cm (1/3 near bank), 53 cm mid-stream, 26 cm (1/3 opposite bank) Width: 6.5 m, 6.1 m Velocity: 12.9 seconds for 10 m Temperature: 21oC Turbidity: clear to bottom Vegetation: lots of riparian vegetation; grasses and ferns and trees Sediment: pebbles at edges, medium boulders covered with green plants in center Other observations: lots of dragonflies and butterflies

Rocky Saugeen/Gypsy North East Uxbridge Location Hamilton Creek Creek Saugeen Holland Brook Date installed 30-Jun 30-Jun 01-Jul 08-Jul 08-Jul Date collected 08-Nov house#443843 east of ncession west of Moorseburg t Church Road; ; 8th sdrd Green Street st of Traverston b/t conc. 4 Lane/Rogers Location West Back Line & 6 Reservoir at Davis Drive under Placement upstream of downstream from closest to bridge, downstream of in stream bridge road crossing far bank midstream road crossing 44 20' 16" GPS 44 24' 18" N 44 27' 35" N N 80 47' 47" W 81 2' 42" W 80 56' 0" W ~ 50 ft but narrows up and Bank 6.8 m (narrows downstrea Width 11.9 m to 5.5 m) m 9 m 6.5 m Water depth (cm) 35 19 40 58 40 48 24 38 50 53 42 30 30 --- 26 avg water depth (cm) 41.7 24.3 36.0 54.0 39.7 Hydraulic Head (cm) 0.0 0 2 2.0 5 6 1.0 4 2 avg hydraulic head (cm) 1.0 3.0 3.3

165

Velocity 0.67 m/s 0.60 m/s 0.44 m/s 0.13 m/s 0.77 m/s Temperat ure (Celsius) 16 16 24 21 about 10 clear to bottom Transpare clear to cm; very except in deep ncy clear to bottom clear to bottom bottom muddy pools sand and pebble at edge, bigger rocks in middle with attached algae rocks with and green rocky (10-30 cm moss and mud and plants; deep Sediment silt and boulders diameter) pebble rock sand in pool lots of grasses; widening healthy section riparian therefore region dead and therefore a lot of grass; overhanging dying turbidity not healthy riparian Surroundi willow and cedars likely due to vegetation - ng varied; grasses cedars; lots of along erosion in grasses & ferns vegetation from edge riparian veg. length this section etc. boulders have gravel type substance cray fish coating; lots of scared a deer observed 5-7 cm fish away; no fish and turtle; (stripe on side); observed; heavily much deeper very wide - not a wooded area; than straight channel Markdale placed previous (swamp at low firetruck water sampler in visit; green a lot of tide?); many load area (old 68 cm deep tinge to dragonflies and Comments dead trees sign) pocket water butterflies; off of Traverston Rd

Water samples collected 18-Jul 18-Jul 18-Jul 04-Aug 04-Aug Water Depth (cm) 28 19 32 57 42 43.5 34 28 59 59 53 28 52 57 avg Water depth (cm) 41.5 27 37.3 58 Hydraulic head 0 1 3 0 3 0 0 5 0 3 0 0 5 2 avg 0.0 0.3 4.3 0

166

Hydraulic head (cm) Wetted bank width 14.3 m 4.6 m --- 8.5 m --- Hydrolab - Temp 26.16 23.38 25.62 20 17.62 SpC (mS/cm) 0.422 0.45 0.436 0.68 0.521 Dissolved Oxygen 6.1 mg/L 6.55 mg/L 9.11 mg/L 7.28 mg/L 7.89 mg/L pH 7.84 7.76 7.36 7.53 7.64 Total Dissolved Solids 0.3 g/L 0.3 g/L 0.3 g/L 0.4 g/L 0.3 g/L DO% 80 83 120.2 87.7 90.9 BOD when sampled 6.4 mg/L 6.5 mg/L 8.60 mg/L 7.68 mg/L 7.36 mg/L BOD after 5 days at 20C 6.64 mg/L 6.59 mg/L 6.84 mg/L ------iPhone GPS 44 15' 56" N 44 22' 7"N NOTE: rain event sampling 80 53' 37" lots of little fish; dark, 80 44' 23" W W crayfish small crayfish