AN ABSTRACT OF THE DISSERTATION OF

Edward W. Davis II for the degree of Doctor of Philosophy in Molecular and Cellular Biology presented on June 7, 2017.

Title: Phylogeny and Evolution of Gall-Associated Plant Pathogenic .

Abstract approved:

______Jeff H. Chang

Gall-associated phytopathogens have unique evolutionary histories that have shaped both their modes of infection and genomic structures. Pathogenicity of the gall- associated plant pathogens of the , , and genera is mediated by horizontally acquired virulence loci. The relative ease of gain and loss of the virulence loci has confounded accurate characterization of these bacteria, especially those characterizations made prior to the use of molecular markers for genotyping. Work presented in this thesis uses whole genome guided approaches to discern the intra- and inter- genetic diversity in the three genera described.

Rhodococcus fascians is the causal agent of leafy gall disease. We tested the hypothesis that R. fascians is comprised of a single species level group and show that two distinct clades with up to six species level groups are able to cause symptoms consistent with leafy gall disease. These data reveal previously unknown chromosomal diversity within this group of phytopathogenic bacteria. Four lines of evidence that

make use of whole genome sequences indicate that the species are acted on by distinct selective pressures and have unique evolutionary histories. Further, a set of horizontally acquired virulence loci is correlated with the pathogenic phenotype, regardless of chromosomal lineage. These data suggest that delimitation of phytopathogenic

Rhodococcus isolates requires careful consideration with respect to both the chromosomal genotype as well as the presence of virulence loci.

Rathayibacter spp. are vector mediated plant pathogens that infect galls in the seed heads of several grass species. Rathayibacter toxicus, a Select Agent in the U.S., has been shown to produce corynetoxins, nucleoside antibiotics that also cause neurotoxic effects in mammals, when infected by a bacteriophage. As R. toxicus encodes a clustered regularly interspersed short palindromic repeats (CRISPR) system that acts as an adaptive immune system against foreign DNA in bacteria, we studied the CRISPR array evolution for insights into the ecology of bacteria-phage interactions.

We predicted that over half of the spacers encoded in R. toxicus are functional in targeting the phage, while the bacteria still show susceptibility to the phage. These data suggest a complex homeostatic interaction whereby neither bacteria nor phage populations are eliminated. Further, our results challenge past conclusions on the efficacy of CRISPR-mediated immunity.

Lastly, we generated a set of tools to facilitate comparative genomics research and accurate species and strain identification. Data are integrated into a website

(http://gall-id.cgrb.oregonstate.edu/) to facilitate their use by all researchers, regardless of computational proclivity.

©Copyright by Edward W. Davis II June 7, 2017 All Rights Reserved

Phylogeny and Evolution of Gall-Associated Plant .

by

Edward W. Davis II

A DISSERTATION

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Presented June 7, 2017 Commencement June 2017

Doctor of Philosophy dissertation of Edward W. Davis II presented on June 7, 2017

APPROVED:

______Major Professor, representing Molecular and Cellular Biology

______Director of the Molecular and Cellular Biology Program

______Dean of the Graduate School

I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request.

______Edward W. Davis II, Author

ACKNOWLEDGEMENTS

I owe so much to so many people; I was certainly very fortunate to receive all of the support that I did throughout my time as a graduate student. I would like to thank my advisor, Jeff Chang, for his guidance, support, and the opportunity to make mistakes. I have gained more insight from Jeff over the past five years than I could have ever imagined. I would also like to thank my committee members Dee Denver,

Taifo Mahmud, Tom Sharpton, and Gary DeLander. Thank you to the BPP department, the MCB program and the CGRB. Thank you to Joyce Loper and Tom Wolpert for giving me excellent advice throughout my graduate student career. I need to extend a special thank you to Melodie Putnam and members of the OSU Plant Clinic; I would not have worked on Rhodococcus, Rathayibacter, or Agrobacterium without you.

Thank you to Marilyn Miller and Walt Ream; my time working with you on

Agrobacterium was an pleasure. Thanks to my lab mates from the Chang lab, past and present: Allie Creason, Elizabeth Savory, Alex Buchanan, Jason Cumbie, Meesha

Pena, Alex Weisberg, Chih-Feng Wu, Skylar Fuller, Michael Belcher, Dani Stevens, and others. It was certainly a team effort, and I’m glad to have had your help and friendship. Last, and most importantly, I would like to thank my family. My children,

Lexie and Landon, are what keep me going every day. And without my wife, Marie, I would not be here. She is my rock and this is all possible because of her support.

CONTRIBUTION OF AUTHORS

Chapter 1. EWD wrote the chapter.

Chapter 2. EWD and ALC contributed equally. EWD, ALC, MLP, OMV and JHC contributed to the conception and design of the work, analysis of data, and wrote and approved the manuscript. ALC, EWD, MLP, and OMV also contributed to the acquisition of isolates and/or genome sequences.

Chapter 3. EWD and AJW contributed equally. EWD, AJW, JFT, NJG and JHC contributed to the conception and design of the work, analysis of the data, and wrote and approved the manuscript. EWD wrote software to collect sequences, do analyses, and generate the sequence database. AJW and JFT designed the website interface.

EWD and AJW contributed to acquisition of isolates and/or genome sequences, and wrote software for WGS pipeline. AJW added virulence gene search, and generated databases for the website.

Chapter 4. EWD, MLP, LDL, MSB, AJS, BKS, TDM, DGL, WLS, EER, and JHC contributed to the conception and design of the work and wrote and approved the manuscript. EWD, LDL, MSB, and JHC contributed to analysis of data. EWD, MLP,

MSB, AJS, BKS, TDM, DGL, WLS, EER also contributed to the acquisition of isolates and/or genome sequences.

Chapter 5. EWD wrote the chapter

TABLE OF CONTENTS Page

Chapter 1 – Introduction: Reconciliation of Species Concepts, Horizontal Gene Transfer, and Genome Evolution ...... 1 SPECIES CONCEPTS AND APPLICATION TO BACTERIA ...... 2 BACTERIAL SPECIES DELIMITATION ...... 5 OPERATIONAL METHODS FOR CLASSIFYING BACTERIA ...... 10 CORE, FLEXIBLE, AND PAN-GENOME ...... 11 HORIZONTAL GENE TRANSFER AND SYMBIOSIS ...... 13 TAXONOMIC CLASSIFICATION AND NOMENCLATURE...... 16 BIBLIOGRAPHY ...... 17

Chapter 2 – Use of whole genome sequences to develop a molecular phylogenetic framework for Rhodococcus fascians and the Rhodococcus ...... 22 ABSTRACT ...... 23 INTRODUCTION ...... 23 MATERIAL AND METHODS ...... 27 Isolation of phytopathogenic Rhodococcus...... 27 Nucleic acid preparations ...... 27 Next-generation sequencing, assembly, and annotation ...... 27 Phylogenetic analyses ...... 28 Bioinformatic analyses ...... 30 RESULTS ...... 31 Whole genome-based phylogeny supports multiple lineages of plant pathogenic Rhodococcus...... 31 Alternative whole-genome based analyses support the multiple lineages of Rhodococcus...... 33 A molecular phylogeny based on whole genome sequences provides a framework for resolving the Rhodococcus genus ...... 35 ANI provides a framework for resolving the Rhodococcus genus ...... 37 DISCUSSION ...... 40 ACKNOWLEDGEMENTS...... 44 BIBLIOGRAPHY ...... 46

Chapter 3 – Gall-ID: tools for genotyping gall-causing phytopathogenic bacteria .... 61 ABSTRACT ...... 62 INTRODUCTION ...... 63 Diagnostics ...... 63 Agrobacterium spp...... 65 Pseudomonas savastanoi...... 66 Pantoea agglomerans ...... 67 Rhodococcus spp...... 68

TABLE OF CONTENTS (Continued) Page

MATERIAL AND METHODS ...... 69 Website framework and bioinformatics tools ...... 69 Datasets ...... 73 Bacterial strains, growth conditions, nucleic acid extraction, and genome sequencing ...... 74 RESULTS AND DISCUSSION ...... 75 Gall Isolate Typing ...... 75 Phytopath-Type ...... 76 Vir-Search ...... 76 Databases of Gall-ID ...... 77 Whole Genome Analysis ...... 79 Validation of tools available in Gall-ID ...... 82 Conclusions ...... 85 ACKNOWLEDGEMENTS...... 85 BIBLIOGRAPHY ...... 86

Chapter 4 – Phylogenomic analyses of the evolution of Rathayibacter toxicus and its ecological interactions with bacteriophage CS14 ...... 109 ABSTRACT ...... 110 SIGNIFICANCE STATEMENT ...... 111 INTRODUCTION ...... 112 RESULTS ...... 116 The Rathayibacter genus consists of at least ten species ...... 116 Rathayibacter toxicus forms three clades ...... 116 R. toxicus evolved via genome reduction ...... 117 The type VIIb secretion system-encoding locus was lost from R. toxicus ...... 121 R. toxicus acquired a type I-E CRISPR locus ...... 123 Amelioration of the type I-E CRISPR locus ...... 124 The acquired CRISPR system primed adaptation of spacers to CS14 ...... 125 DISCUSSION ...... 126 MATERIALS AND METHODS ...... 131 DNA sequencing ...... 131 Genome sequence assembly ...... 131 Reference genome sequences ...... 131 Construction of phylogenetic trees ...... 132 SNP identification and phylogenetic inference ...... 133 Calculating ANI ...... 133 Identifying orthologous clusters ...... 133 Identifying core functions lost by R. toxicus ...... 134 Identifying repeat regions...... 134

TABLE OF CONTENTS (Continued) Page

Identification of Mobile Genetic Elements ...... 134 Predicting HGT ...... 134 Codon usage ...... 135 Manual curation of key loci ...... 135 ACKNOWLEDGEMENTS...... 135 BIBLIOGRAPHY ...... 137

Chapter 5 – Conclusions and Future Directions ...... 168 CONCLUSIONS ...... 169

LIST OF FIGURES

Figure Page

Figure 2.1. Neighbor-joining tree based on 1727 homologous genes...... 53

Figure 2.2. Cloud plot of ANIb vs TETRA for all possible pairwise comparisons. ... 54

Figure 2.3. Heatmaps of codon usage preference and homology...... 55

Figure 2.4. Multi-locus sequence analysis maximum likelihood tree of the Actinobacterium phylum...... 56

Figure 2.5. Average nucleotide identity dendogram for 59 isolates of Rhodococcus.. 58

Supplemental Figure 2.1. Genome alignments of four members of phytopathogenic Rhodococcus...... 59

Figure 3.1. Overview of Gall-ID diagnostic tools...... 96

Figure 3.2. Flowchart for the WGS Pipeline...... 97

Figure 3.3. Validation of the Agro-type and Vir-Search tools...... 98

Figure 3.4. Maximum likelihood tree based on vertically inherited polymorphic sites core to 20 Rhodococcus isolates...... 100

Supplemental Figure 3.1. Validation of the Phytopath-type tool...... 105

Supplemental Figure 3.2. The SPAdes assembler produces high quality assemblies. 106

Supplemental Figure 3.3. Whole genome alignment of 13-626 assemblies to a close reference sequence indicates collinearity of genomes ...... 107

Figure 4.1. Rathayibacter spp. form a distinct phylogenetic unit comprised of 10 species-level groups...... 146

Figure 4.2. R. toxicus experienced massive changes to its genome...... 147

Figure 4.3. Rathayibacter encode homologs of the T7SS-associated EssC protein.. 148

Figure 4.4. R. toxicus strains encode a type I-E CRISPR-Cas system...... 149

Figure 4.5. R. toxicus has continually acquired spacers that target CS14...... 150

LIST OF FIGURES (Continued)

Figure Page

Supplemental Figure 4.1. Average nucleotide identity analysis supports 10 species- level groups within Rathayibacter...... 151

Supplemental Figure 4.2. Enrichment analyses for Rathayibacter compared to related genera...... 152

Supplemental Figure 4.3. Depiction of algorithm designed to predict functional outcomes of spacer-protospacer interactions...... 153

Supplemental Figure 4.4. Rathayibacter toxicus and CS14 have similar dinucleotide frequency patterns. …………………………...... 155

LIST OF TABLES

Table Page

Table 2.1. Isolates of Rhodococcus selected for whole genome sequencing...... 52

Table 3.1. Manually curated datasets developed for Gall-ID...... 101

Table 3.2. Statistics for the WGS Pipeline...... 103

Table 3.3. Strain identity of 14 isolates associated with crown gall...... 104

Supplemental Table 3.1. Comparison of Velvet to SPAdes assemblies of 14 isolates associated with crown gall...... 108

Supplemental Table 4.1. Rathayibacter isolates sequenced in this study...... 157

Supplemental Table 4.2. Core functions of Rathayibacter absent from R. toxicus. .. 159

Supplemental Table 4.3. Accessions of genome sequences included in this study. .. 165

Supplemental Table 4.4. Accessions/locus tags of EssC protein sequences used in phylogenetic tree...... 166 1

Chapter 1 – Introduction: Reconciliation of Species Concepts, Horizontal Gene Transfer, and Genome Evolution

Edward W. Davis II

1

2

SPECIES CONCEPTS AND APPLICATION TO BACTERIA

Species are one of the most basic units of biology and, along with genes, proteins, and organisms, are the basis by which we understand evolution. Debate over the concept of what defines species dates back to the introduction of biological and the binomial nomenclature by Carl Linnaeus. Linnaeus devised a system by which organisms could be classified into discrete groups (genera). One important criterion for classification was that each member of a group must share identifiable phenotypes that are distinct from members of other groups, e.g., differentia that define the species as a distinct group within a genus (Cain, 1958). Linnaeus’ species definition also included a requirement of true breeding, whereby the noted natural, ‘essential’ phenotypes did not change through time (Cain, 1958). Linnaeus was focused on defining discrete ‘natural’ characteristics of organisms and not using artificially derived categories, which is of special consideration even to present day (Brenner et al., 2005;

Cain, 1958).

Understandably, Linnaeus underestimated the natural variation within species, and therefore attempted to define species in a way that is inconsistent with the amount of possible intraspecies variation. In present day, we can reasonably demonstrate exceptions to Linnaeus’ criteria for delineations of species, especially in light of

Darwin’s theory of evolution. Selective pressure and genetic drift shape the evolution of all living organisms. While Linnaeus’ definition of species placed too strong an emphasis on phenotypic traits and lacked the flexibility to account for certain types of intraspecies diversity, the idea that we can use a set of criteria to define organisms into discrete units was essential for future research in biology and evolution. Linnaeus

2

3 provided the basis for defining species, and since his time, researchers have attempted to provide a coherent and distinct set of features to use for defining species.

Over 24 species concepts have been proposed since Linnaeus’ original definition (de Queiroz, 2007). Broadly, the concepts fall into categories based upon meeting criteria framed around biological (reproduction), phenetic, phylogenetic, and ecological features, among others (de Queiroz, 1998). These species concepts have a common, core criterion of a shared gene pool within a species that is not freely shared with members outside the species group, but each concept differs on what other features

(termed secondary characteristics) should also be used to delimit species members (de

Queiroz, 2007). Therefore, due to having the absolute requirement of secondary characteristics as intrinsic to species, the various species concepts are inherently incompatible and result in differences in classifications. These differences have led to disagreements about what evolutionary processes influence species (de Queiroz, 2007).

Furthermore, not all secondary characteristics, such as the production of viable offspring through sexual reproduction, can be applied to all domains of life, such as bacteria with an asexual mode of reproduction.

To unify the species concepts for both sexually and asexually reproducing organisms, a metapopulation lineage species concept was proposed (de Queiroz, 1998,

2007; Simpson, 1951). A metapopulation lineage is a segment of an ancestor- descendent series with a shared gene pool. An important distinction of species as a segment of a metapopulation lineage is that species can, and do, arise from other species, and also give rise to new species as time passes (de Queiroz, 2007). In a broad sense, this unification of species concepts shifts the focus of the species concept to the

3

4 processes of evolution and genetic variation at the population level and away from attempting to identify the most important characteristics that define individual species

(Hart, 2010). Since selection has no given order in which differential traits arise to define species, examining speciation as a segment of a metapopulation lineage allows for flexibility yet specificity for recognizing new species.

The metapopulation lineage concept unifies other species concepts, but the concept in itself, by definition, does not include a method to delineate species. The challenge shifts from the accurate and appropriate definition of species boundaries based on a singular concept to the methodological definition of a segment of a metapopulation lineage that represents a species (i.e. a shift from a conceptual to methodological disagreement) (de Queiroz, 2007). One benefit to this approach is that novel methods of species delimitation do not invalidate nor conflict with previously defined methods of species delimitation. These novel characteristics are especially relevant to many microbial species definitions since many do not follow the same rules of inheritance that govern macroscopic, biparental eukaryotic organisms. The methods that define a bacterial species do not have to be directly comparable to the methods used to define a eukaryotic species. While not wholly conclusive to the debate over bacterial species concepts, the metapopulation lineage species concept provides a framework to determine if bacteria form species on the basis of the particular processes and selection pressures that led to the speciation event, rather than a previously defined set of traits.

4

5

BACTERIAL SPECIES DELIMITATION

Although bacteria can conceptually be included in the metapopulation lineage species concept, there is debate on whether bacteria form discrete species (Doolittle and Papke, 2006). The core tenet of the metapopulation lineage species concept is a defined gene pool that has intraspecies sharing, but not interspecies sharing, and is transmitted through vertical inheritance from ancestor to descendent. Although, as previously stated, one could conceptually include bacteria in the metapopulation lineage species concept, several innate processes of bacteria support the exclusion of bacteria from the species concept.

Bacteria engage in horizontal gene transfer (HGT), with mechanisms including natural transformation, phage-mediated transduction, and plasmid-mediated conjugation. More recently, transfer of plasmids between cells has been demonstrated in a conjugation independent manner, including through intercellular nanotubes as well as outer membrane vesicles (Dubey and Ben-Yehuda, 2011; Fulsundar et al., 2014).

Bacteria also encode a variety of vehicles for transferring DNA between cells, including transposons, integrative conjugative elements (ICEs), and integrons. These mechanisms for horizontal acquisition of DNA are in opposition to the vertical inheritance necessary for cohesion within a species.

The process of HGT between distantly related organisms could limit bacterial cohesion and effectively abolish the idea of distinct bacterial species. Some have argued that, in theory, a particular organism could be part of two separate species, depending on the amount and timing of HGT events (Doolittle and Papke, 2006).

Despite the possibility of relatively free-flowing gene content between bacterial cells,

5

6 there is a fitness cost to HGT. In order to test the feasibility of the free-flowing gene hypothesis, one can use models to measure the predicted outcomes of HGT events. One such study showed that, when bacteria are in stable populations (i.e. low rate of gene loss), HGT has more detrimental effects than beneficial (Vogan and Higgs, 2011).

Therefore, selection would act to limit the HGT within the bacterial population and promote genetic cohesion. The authors suggest this result is the likely outcome in most extant bacterial populations. Conversely, early in evolution of the bacterial of life, predicted gene loss events and genome changes were much higher, and higher rates of HGT would have a net benefit to the bacterial population (Vogan and Higgs, 2011).

Therefore, the genomic and ecological contexts in which the HGT events occur affect the outcome of such events. If there is selective pressure against HGT in a particular environment, then, indeed, the model of free-flowing gene content will not prevail.

In addition, the process of recombination has been shown to promote cohesion within bacterial groups (Marttinen et al., 2015). Successful recombination events depend on some level of sequence homology. In a population of bacteria, members that are more closely related will have a larger proportion of homologous sequences than bacteria that are more distantly related, and therefore have more substrate for successful recombination events. However, sequence homology is not the only determinant of the recombination rate. Bacteria face ecological barriers to recombination and HGT as well, with bacteria in different environments encountering different types of foreign

DNA at different rates. If two bacterial species are never in the same ecological context, then the chances for HGT between the species are diminishingly low. Furthermore, codon preference and tRNA availability also potentially limit or encourage successful

6

7

HGT events (Tuller et al., 2011). The empirical rates and measured HGT events do not support the hypothesis that bacteria are panmictic, and, ultimately, selective pressure is the primary determinant of both successful recombination and HGT events. While

DNA can be shared between distantly related bacteria, the likelihood of selection favoring the transfer of chromosome sized fragments of DNA between bacteria is strikingly low; if selection favors these types of events, we can identify them through whole genome sequencing, such as in the case of species convergence between

Campylobacter jejuni and Campylobacter coli (Sheppard et al., 2008).

A recent publication by Bobay and Ochman (2017) supports the hypothesis that bacteria form distinct biological units, and even posits that the biological species concept can be applied to bacteria. The authors measured gene flow within and between species, to test two of the hypotheses proposed above: 1) Sequence similarity within species groups encourages intraspecies recombination events. 2) Barriers to recombination are sufficient to allow classification of bacteria into discrete species. In short, the authors compared the expected number of homoplasic sites to the observed number of homoplasic sites to identify polymorphisms likely to have arose by recombination events rather than convergent mutation. Results indicate that there are quantifiably different amounts of recombination events within a species threshold than between species thresholds, and the species threshold does not always correlate with genetic distance. Furthermore, the rates are not consistent with techniques typically used to define species, including 16S rRNA identity or average nucleotide identity. At most distant, Prochlorococcus marinus has identifiable homoplasies between species members with only 72% average nucleotide sequence identity. These data indicate that

7

8 sequence identity is likely not an accurate proxy for measuring the resulting genetic diversity defined by natural barriers to exchange of DNA between separate species.

Furthermore, the evidence of homoplasic sites in even distant members of the same species indicates that the recombination encourages genetic cohesion within species, rather than eroding the boundaries between species.

Several genome defense mechanisms, other than recombination, are employed by bacteria to exclude foreign DNA and promote intraspecies cohesion. The two best characterized mechanisms of genome defense are restriction modification (R-M) systems and clustered regularly interspaced short palindromic repeat (CRISPR) loci.

R-M systems are often found in pairs (Ishikawa et al., 2010). The restriction endonuclease is responsible for degrading targeted DNA, typically unmethylated DNA that was acquired from a foreign host. The paired methyltransferase, in contrast, protects the host DNA by methylating the specific recognition sites of the restriction enzyme, blocking its activity (Vasu and Nagaraja, 2013). R-M systems promote genetic cohesion by limiting horizontal gene transfer events to only organisms with the same

R-M systems, likely only members of the same species (Vasu and Nagaraja, 2013).

In contrast to R-M systems, which have shared features to an innate immune system, CRISPR loci are analogous to an adaptive immune system. CRISPR arrays are comprised of short, palindromic repeats interspersed with spacer sequences that are derived from foreign DNA (Rath et al., 2015). There are three described types of

CRISPR systems, and all CRISPR types share the three stages of defense. The first is analogous to an immunization event. Once foreign DNA enters the cell, the CRISPR- associated Cas1 and Cas2 proteins can incorporate a small segment of the foreign DNA

8

9 into the CRISPR locus (Rath et al., 2015). In this way, a bacterium can build a repertoire of spacers to a family of related phage over time. The second step of CRISPR immunity is expression. The entire CRISPR array is transcribed and processed into individual crRNA that act as the guides for targeting foreign DNA (Rath et al., 2015). The third step is interference of the target horizontally acquired elements. In type I systems, a protein complex, called Cascade, binds the crRNA and uses it to target the foreign

DNA, termed the protospacer. Cas3 is recruited upon recognition of the target DNA, and then unwinds and cleaves the DNA (Sinkunas et al., 2011). In the two other systems, a single protein, either Cas9 or Csm/Cmr, binds the crRNA and is responsible for target cleavage (Jinek et al., 2012; Rouillon et al., 2013; Staals et al., 2013). The

CRISPR systems allow the bacteria to adapt and selectively target foreign DNA for degradation such that not all horizontal gene transfer is blocked.

In total, the current data suggests that selective pressures maintain genetic cohesion within species despite the possibility of extensive horizontal transfer of DNA.

While DNA can be transferred between various distantly related bacteria, indiscriminate transfer and acquisition of DNA has associated fitness costs that are directly correlated with the expression level of the horizontally acquired DNA (Millan et al., 2015; Park and Zhang, 2012). If the benefits of acquisition do not outweigh the costs of expressing the novel DNA, then selection will favor the genotypes in which horizontal acquisition of DNA did not occur. A hypothetical situation as above

(Doolittle and Papke, 2006) whereby a bacterium encodes large segments of DNA from two distinct bacteria species would only be plausible in a scenario where selection pressures are very weak, with large population sizes and unlimited resources, or where

9

10 population sizes are very small leading to genetic drift and/or the founder effect after a population bottleneck event. Empirical data supports that, in most bacterial species, selective pressures maintain a cohesion between members of the same species, and therefore limit the amount of realized horizontal gene transfer over time.

OPERATIONAL METHODS FOR CLASSIFYING BACTERIA

In the genomics era, several different methods for classifying bacteria into groups have emerged, with increasing resolution as technologies improved. One of the defining moments of species classification was the development of using the 16S rRNA gene for bacterial taxonomy (Ludwig and Klenk, 2001; Woese and Fox, 1977). Since the 16S rRNA gene is universal, has both conserved and variable regions, and is slow to change, it is a suitable marker for comparison, especially between organisms with large genetic distances. With the low cost and availability of the polymerase chain reaction and Sanger sequencing techniques, amplification and sequencing of the 16S rRNA gene became the benchmark by which all bacteria and archaea were compared

(Rosselló-Móra and Amann, 2015). Presently, the 16S rRNA gene sequence is still used as the starting point for routine bacterial identification as well as comparative genomics approaches. Furthermore, the 16S rRNA gene is widely used in microbiome studies whereby questions focus on defining the composition of, and changes to, microbial communities rather than in identifying the individual species (de la Cuesta-

Zuluaga and Escobar, 2016). Although the 16S rRNA gene is multi-copy and may have sequence divergence of over 5%, in practice the majority of bacteria encode 16S rRNA

genes that are below the 1-1.3% threshold for the operational definition of species (Pei 10

11 et al., 2010). Therefore, for diagnostic and operational use where a small number of mismatches are unimportant to the conclusions derived from the putative identity of the 16S rRNA gene, it still remains a useful marker for bacterial taxonomic identification.

The so-called gold standard for accurate species delimitation is the use of DNA-

DNA hybridization (Wayne et al., 1987). Delimitation of novel species using DNA-

DNA hybridization requires comparisons between the isolate of interest and all potential species and related species of which the isolate might be a member. In the genomics era, use of average nucleotide identity (ANI) can be done completely in silico and has been proposed as an alternative method to DNA-DNA hybridization

(Konstantinidis and Tiedje, 2005; Richter and Rosselló-Móra, 2009). Cut-offs for species identification using ANI (~94-96%) were calibrated to the corresponding 70%

DNA-DNA hybridization value, in an attempt to standardize the new method yet allow for comparisons to data generated through DNA-DNA hybridization assays. Using this method allows for standard comparisons across different comparative genomics datasets as well as provides a baseline for species definition and subsequent comparative analyses using comparable species-level groups.

CORE, FLEXIBLE, AND PAN-GENOME

Common hypotheses that are tested using comparative genomics approaches often rely on accurate classification of the core, variable, and pan genomes of a species.

The core genome includes genes that are encoded in all members of a species. The core

genome typically includes functions that are essential for bacterial survival, including 11

12 biosynthesis, DNA replication, RNA transcription, protein translation, central metabolism, and other functions that are defining features of a particular species

(Tettelin et al., 2005). Comparative analyses of core genome sizes indicate a general negative correlation between the number of sequenced isolates of the species and the core genome size. This negative correlation typically follows an asymptotic function to the true, minimal core genome (Tettelin et al., 2005).

The flexible genome includes genes that are present in only a subset of representatives of a species. Typically the flexible genome includes functions that may enhance fitness in some, but not all, environmental contexts (Hacker and Carniel,

2001). The bacterial pan-genome is the sum total of the core and flexible genomes of a species (Tettelin et al., 2005). Bacterial pan-genomes are predicted to fall into two primary types, ‘open’ and ‘closed.’ An open pan-genome increases in size indefinitely as more isolates from the same species are considered. The implication of an open pan genome is that bacteria have access to a wide array of genes that may be present in any one genome only transiently. As selection acts on those genes that are beneficial in a specific ecological niche, the genes will become enriched for presence in a genome and become a part of the core genome, possibly leading to speciation events over time.

While the concept of an open pan genome seems counter to within species cohesion and seemingly an argument against any species concept, selective pressures maintain intraspecies cohesion. Selection during ecological change allows for cohesion yet flexibility within species. Conversely, a closed pan-genome ceases to increase in size after a certain number of isolates within a species have been identified. Closed pan

genomes are only found in recently evolved species that have a tightly defined niche 12

13 and/or a very specific function that defines the species, such as Bacillus anthracis

(Bentley, 2009). Other closed pan-genomes are found in host associated bacteria, such as Staphylococcus aureus (Tettelin et al., 2008). Having a closed pan genome is not necessary correlated with a small/large core genome, only with the observation that additional members of the same species do not add pan-genomic diversity to the species. Comparing the relative sizes of core and pan-genomes allows a rough estimation of the genetic diversity within and between species.

Interrogating the core and pan-genomes between species allows identification of species-specific gene clusters. When the gene cluster data is paired with predicted functional annotation, identification of the functional differences within and between species can be made (Wei et al., 2002). In examining statistically significant differences in presence/absence profiles of both orthologous clusters and functional categories, one can test hypotheses about functional differences between species. Hypothesis-guided investigation of the functional differences allows for development of testable models of both evolutionary history as well as present-day adaptations and ecological requirements.

HORIZONTAL GENE TRANSFER AND SYMBIOSIS

While data suggest that barriers between bacterial species are maintained despite the action of HGT, HGT has had an undeniable impact on the evolution of many host-microbe interactions. HGT often plays an important role in expanding and/or altering niche availability in the recipient bacteria. Horizontal acquisition of genomic

islands, plasmids, or other mobile genetic elements can enable traditionally 13

14 environmentally associated bacteria to interact with a novel host. The interaction with the host can be mutually beneficial, as in the case of the symbiosis island of

Mesorhizobium loti and its interaction with its Lotus hosts (Sullivan et al., 2002). Non- symbiotic Mesorhizobium isolates, which may have a plant-associated lifestyle, can evolve a symbiotic lifestyle after acquisition of the symbiosis island and can effectively fix nitrogen within its novel legume host. This symbiosis island is an example of an

ICE that integrates into the bacterial host genome after acquisition through the enzymatic activity of an integrase protein (Ramsay et al., 2006). Transfer is regulated through a quorum sensing mechanism, and depends on a widely encoded relaxase protein (Ramsay et al., 2006). In this way, the transfer of the ICE to certain hosts is a dead-end path (i.e. those that do not encode the relaxase), yet any recipient bacteria of the ICE are able to inhabit a new, isolated niche, likely leading to an overall increase in reproductive fitness.

In some bacterial lineages, horizontally acquired elements modulate the outcomes of the potential host-microbe interactions. Specifically, within the

Agrobacterium/ group, tumor-inducing (Ti)/ root-inducing (Ri) and sym plasmids encode genes that are necessary and sufficient for conversion from a free- living to pathogenic or mutualistic lifestyle, respectively. In the absence of the plasmids that mediate symbiotic interactions with plant hosts, bacteria in this family can live as plant-associated commensals, within the soil, and as antagonists to pathogenic agrobacteria (Martins et al., 2013; Mondy et al., 2013; Moore and Warren, 1979). Upon acquisition of a Ti/Ri plasmid, the bacterium gains the unique ability of interkingdom

DNA transfer whereby the bacterium can re-program the host to form a novel niche, 14

15 the crown gall or hairy root environment, which only the bacterium can utilize.

Therefore, the acquisition of the horizontally transferred DNA is the primary determinant of the lifestyle outcomes of the bacterium (Bourras et al., 2015).

In contrast to the pathogenic lifestyle conferred by the Ti/Ri plasmids, the sym plasmids confer the abilities of root nodulation and nitrogen fixation to Rhizobium and

Sinorhizobium spp. in association with specific plant hosts (Hooykaas et al., 1981). The plasmids encode proteins responsible for the production of nod factors that allow association and effective nodulation with plant hosts, as well as a collection of enzymes required for effective nitrogen fixation within the plant nodule environment. In the absence of the plasmid, the bacterium does not have the appropriate machinery to induce nodule formation by the plant hosts. Recent proteomic analysis of a root nodulating bacterium indicates that there are marked differences between the proteins expressed in the free-living state compared to the root nodule state (Tatsukami et al.,

2013). This indicates that although the bacterium requires the horizontally acquired genes for its host-associated lifestyle, certain baseline metabolic capabilities must be available to the organism or the host-associated lifestyle will not be effective and therefore will have selective pressures that eliminate the phenotype from the environment. Therefore, although the HGT provided the bacterium the capability to access a new niche, the bacterium encodes core chromosomal proteins that perform necessary functions for successful interaction with its plant host in the new niche.

15

16

TAXONOMIC CLASSIFICATION AND NOMENCLATURE

As described, within the Agrobacterium/Rhizobium group of bacteria, HGT is necessary for successful host-microbe interactions. In the past, these host-associated phenotypes were the primary factor in species identification and designation.

Understandably, when these phenotypes were first described, bacteria exhibiting the phenotypes were grouped into the same genus. Root nodulating organisms were, and to some extent still are, referred to as rhizobia (in the Rhizobium genus), while the plant pathogenic bacteria were all referred to as Agrobacterium and given a species name based on the phenotype of the host upon infection (Holmes and Roberts, 1981; Laranjo et al., 2014). Once molecular techniques were used in attempts to reconcile the differences between taxonomy and evolutionary history, several new genera and species were proposed (Chen et al., 1988; Jarvis et al., 1997; Jordan, 1982; Young et al., 2001). Still, the scientific community has not agreed upon proper nomenclature for certain species and genus level groups within this bacterial family (Farrand et al.,

2003). Regardless of the name given to the bacteria within this group, appropriate and definable species names are essential for clear communication. Biological data is valid only in the context of proper controls (i.e. compared to a null hypothesis). However, when making comparisons between groups of bacterial genomes, using species name alone can lead to inappropriate conclusions, and discontinuity over time about exactly which groups of bacteria are under comparison (Baltrus, 2016). While the taxonomic nomenclature of a particular bacterium is mostly an academic pursuit, appropriate naming has consequences outside of the scientific community, especially when federal

departments place restrictions on access to particular organisms due to health risks to 16

17 the general public. Stable nomenclature rules for bacteria are essential for continuity in the scientific literature as well as public policy.

BIBLIOGRAPHY

Baltrus, D. A. (2016). Divorcing Strain Classification from Species Names. Trends Microbiol. 24, 431–439. doi:10.1016/j.tim.2016.02.004.

Bentley, S. (2009). Sequencing the species pan-genome. Nat. Rev. Microbiol. 7, 258– 259. doi:10.1038/nrmicro2123.

Bobay, L.-M., and Ochman, H. (2017). Biological Species Are Universal across Life’s Domains. Genome Biol. Evol. 9, 491–501. doi:10.1093/gbe/evx026.

Bourras, S., Rouxel, T., and Meyer, M. (2015). Agrobacterium tumefaciens Gene Transfer: How a Plant Pathogen Hacks the Nuclei of Plant and Nonplant Organisms. Phytopathology 105, 1288–1301. doi:10.1094/PHYTO-12-14-0380- RVW.

Brenner, D. J., Staley, J. T., and Krieg, N. R. (2005). Classification of Procaryotic Organisms and the Concept of Bacterial Speciation. Bergey’s Man. Syst. Bacteriol. Vol. Two , Part A Introd. Essays, 27–38. doi:10.1007/0- 387-28021-9_4.

Cain, A. J. (1958). Logic and Memory in Linnaeus’s System of Taxonomy. Proc. Linn. Soc. London 169, 144–163. doi:10.1111/j.1095-8312.1958.tb00819.x.

Chen, W. X., Yan, G. H., and Li, J. L. (1988). Numerical Taxonomic Study of Fast- Growing Soybean Rhizobia and a Proposal that Rhizobium fredii Be Assigned to Sinorhizobium gen. nov. Int. J. Syst. Bacteriol. 38, 392–397. doi:10.1099/00207713-38-4-392. de la Cuesta-Zuluaga, J., and Escobar, J. S. (2016). Considerations For Optimizing Microbiome Analysis Using a Marker Gene. Front. Nutr. 3, 26. doi:10.3389/fnut.2016.00026. de Queiroz, K. (1998). “The General Lineage Concept of Species, Species Criteria, and the Process of Speciation,” in Endless Forms: species and speciation, eds. D. J. Howard and S. H. Berlocher (Oxford: Oxford University Press), 57–75. de Queiroz, K. (2007). Species Concepts and Species Delimitation. Syst. Biol. 56, 879– 886. doi:10.1080/10635150701701083.

17

Doolittle, W. F., and Papke, R. T. (2006). Genomics and the bacterial species problem. Genome Biol. 7, 116. doi:10.1186/gb-2006-7-9-116.

18

Dubey, G. P., and Ben-Yehuda, S. (2011). Intercellular Nanotubes Mediate Bacterial Communication. Cell 144, 590–600. doi:10.1016/j.cell.2011.01.015.

Farrand, S. K., Berkum, P. B. van, and Oger, P. (2003). Agrobacterium is a definable genus of the family . Int. J. Syst. Evol. Microbiol. 53, 1681–1687. doi:10.1099/ijs.0.02445-0.

Fulsundar, S., Harms, K., Flaten, G. E., Johnsen, P. J., Chopade, B. A., and Nielsen, K. M. (2014). Gene transfer potential of outer membrane vesicles of Acinetobacter baylyi and effects of stress on vesiculation. Appl. Environ. Microbiol. 80, 3469– 83. doi:10.1128/AEM.04248-13.

Hacker, J., and Carniel, E. (2001). Ecological fitness, genomic islands and bacterial pathogenicity: A Darwinian view of the evolution of microbes. EMBO Rep. 2, 376–381. doi:10.1093/embo-reports/kve097.

Holmes, B., and Roberts, P. (1981). The Classification, Identification and Nomenclature of Agrobacteria. J. Appl. Bacteriol. 50, 443–467. doi:10.1111/j.1365-2672.1981.tb04248.x.

Hooykaas, P. J. J., van Brussel, A. A. N., den Dulk-Ras, H., van Slogteren, G. M. S., and Schilperoort, R. A. (1981). Sym plasmid of Rhizobium trifolii expressed in different rhizobial species and Agrobacterium tumefaciens. Nature 291, 351–353. doi:10.1038/291351a0.

Ishikawa, K., Fukuda, E., and Kobayashi, I. (2010). Conflicts targeting epigenetic systems and their resolution by cell death: Novel concepts for methyl-specific and other restriction systems. DNA Res. 17, 325–342. doi:10.1093/dnares/dsq027.

Jarvis, B. D. W., Van Berkum, P., Chen, W. X., Nour, S. M., Fernandez, M. P., Cleyet- Marel, J. C., et al. (1997). Transfer of Rhizobium loti, Rhizobium huakuii, Rhizobium ciceri, Rhizobium mediterraneum, and Rhizobium tianshanense to Mesorhizobium gen. nov. Int. J. Syst. Bacteriol. 47, 895–898. doi:10.1099/00207713-47-3-895.

Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., and Charpentier, E. (2012). A Programmable Dual-RNA – Guided. 337, 816–822. doi:10.1126/science.1225829.

Jordan, D. C. (1982). Transfer of Rhizobium japonicum Buchananm 1980 to Bradyrhizobium japonicum gen. Nov., a genus of slow growing root nodule bateria. Int. J. Syst. Bacteriol., 378–380.

Konstantinidis, K. T., and Tiedje, J. M. (2005). Genomic insights that advance the species definition for . Proc. Natl. Acad. Sci. U. S. A. 102, 2567–2572.

doi:10.1073/pnas.0409727102. 18

Laranjo, M., Alexandre, A., and Oliveira, S. (2014). Legume growth-promoting

19

rhizobia: An overview on the Mesorhizobium genus. Microbiol. Res. 169, 2–17. doi:10.1016/j.micres.2013.09.012.

Ludwig, W., and Klenk, H.-P. (2001). “Overview: A Phylogenetic Backbone and Taxonomic Framework for Procaryotic Systematics,” in Bergey’s Manual® of Systematic Bacteriology (New York, NY: Springer New York), 49–65. doi:10.1007/978-0-387-21609-6_8.

Martins, G., Lauga, B., Miot-Sertier, C., Mercier, A., Lonvaud, A., Soulas, M.-L., et al. (2013). Characterization of epiphytic bacterial communities from grapes, leaves, bark and soil of grapevine plants grown, and their relations. PLoS One 8, e73013. doi:10.1371/journal.pone.0073013.

Marttinen, P., Croucher, N. J., Gutmann, M. U., Corander, J., and Hanage, W. P. (2015). Recombination produces coherent bacterial species clusters in both core and accessory genomes. Microb. Genomics 1. doi:10.1099/mgen.0.000038.

Millan, A. S., Toll-Riera, M., Qi, Q., and MacLean, R. C. (2015). Interactions between horizontally acquired genes create a fitness cost in Pseudomonas aeruginosa. Nat. Commun. 6, 6845. doi:10.1038/ncomms7845.

Mondy, S., Lalouche, O., Dessaux, Y., and Faure, D. (2013). Genome Sequence of the Quorum-Sensing-Signal-Producing Nonpathogen Agrobacterium tumefaciens Strain P4. Genome Announc. 1. doi:10.1128/genomeA.00798-13.

Moore, L. W., and Warren, G. (1979). Agrobacterium Radiobacter Strain 84 and Biological Control of Crown Gall. Annu. Rev. Phytopathol. 17, 163–179. doi:10.1146/annurev.py.17.090179.001115.

Park, C., and Zhang, J. (2012). High expression hampers horizontal gene transfer. Genome Biol. Evol. 4, 523–532. doi:10.1093/gbe/evs030.

Pei, A. Y., Oberdorf, W. E., Nossa, C. W., Agarwal, A., Chokshi, P., Gerz, E. A., et al. (2010). Diversity of 16S rRNA genes within individual prokaryotic genomes. Appl. Environ. Microbiol. 76, 3886–97. doi:10.1128/AEM.02953-09.

Ramsay, J. P., Sullivan, J. T., Stuart, G. S., Lamont, I. L., and Ronson, C. W. (2006). Excision and transfer of the Mesorhizobium loti R7A symbiosis island requires an integrase IntS, a novel recombination directionality factor RdfS, and a putative relaxase RlxS. Mol. Microbiol. 62, 723–734. doi:10.1111/j.1365- 2958.2006.05396.x.

Rath, D., Amlinger, L., Rath, A., and Lundgren, M. (2015). The CRISPR-Cas immune system: Biology, mechanisms and applications. Biochimie 117, 119–128. doi:10.1016/j.biochi.2015.03.025.

19

Richter, M., and Rosselló-Móra, R. (2009). Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl. Acad. Sci. 106, 19126–19131.

20

doi:10.1073/pnas.0906412106.

Rosselló-Móra, R., and Amann, R. (2015). Past and future species definitions for Bacteria and Archaea. Syst. Appl. Microbiol. 38, 209–216. doi:10.1016/j.syapm.2015.02.001.

Rouillon, C., Zhou, M., Zhang, J., Politis, A., Beilsten-Edmands, V., Cannone, G., et al. (2013). Structure of the CRISPR interference complex CSM reveals key similarities with cascade. Mol. Cell 52, 124–134. doi:10.1016/j.molcel.2013.08.020.

Sheppard, S. K., McCarthy, N. D., Falush, D., and Maiden, M. C. J. (2008). Convergence of Campylobacter Species: Implications for Bacterial Evolution. Science (80-. ). 320, 237–239. doi:10.1126/science.1155532.

Simpson, G. G. (1951). The Species Concept. Evolution (N. Y). 5, 285. doi:10.2307/2405675.

Sinkunas, T., Gasiunas, G., Fremaux, C., Barrangou, R., Horvath, P., and Siksnys, V. (2011). Cas3 is a single-stranded DNA nuclease and ATP-dependent helicase in the CRISPR/Cas immune system. EMBO J. 30, 1335–1342. doi:10.1038/emboj.2011.41.

Staals, R. H. J., Agari, Y., Maki-Yonekura, S., Zhu, Y., Taylor, D. W., vanDuijn, E., et al. (2013). Structure and Activity of the RNA-Targeting Type III-B CRISPR-Cas Complex of Thermus thermophilus. Mol. Cell 52, 135–145. doi:10.1016/j.molcel.2013.09.013.

Sullivan, J. T., Trzebiatowski, J. R., Cruickshank, R. W., Gouzy, J., Brown, S. D., Elliot, R. M., et al. (2002). Comparative sequence analysis of the symbiosis island of Mesorhizobium loti strain R7A. J. Bacteriol. 184, 3086–95. doi:10.1128/JB.184.11.3086-3095.2002.

Tatsukami, Y., Nambu, M., Morisaka, H., Kuroda, K., and Ueda, M. (2013). Disclosure of the differences of Mesorhizobium loti under the free-living and symbiotic conditions by comparative proteome analysis without bacteroid isolation. BMC Microbiol. 13, 180. doi:10.1186/1471-2180-13-180.

Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., et al. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome.” Proc. Natl. Acad. Sci. 102, 13950–13955. doi:10.1073/pnas.0506758102.

Tettelin, H., Riley, D., Cattuto, C., and Medini, D. (2008). Comparative genomics: the bacterial pan-genome. Curr. Opin. Microbiol. 11, 472–477.

doi:10.1016/j.mib.2008.09.006. 20

Tuller, T., Girshovich, Y., Sella, Y., Kreimer, A., Freilich, S., Kupiec, M., et al. (2011).

21

Association between translation efficiency and horizontal gene transfer within microbial communities. Nucleic Acids Res. 39, 4743–55. doi:10.1093/nar/gkr054.

Vasu, K., and Nagaraja, V. (2013). Diverse Functions of Restriction-Modification Systems in Addition to Cellular Defense. Microbiol. Mol. Biol. Rev. 77, 53–72. doi:10.1128/MMBR.00044-12.

Vogan, A. A., and Higgs, P. G. (2011). The advantages and disadvantages of horizontal gene transfer and the emergence of the first species. Biol. Direct 6, 1. doi:10.1186/1745-6150-6-1.

Wayne, L. G., Brenner, D. J., Colwell, R. R., Grimont, P. A. D., Kandler, 0, Krichevsky, M. I., et al. (1987). Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Int. J. Syst. Bacteriol., 463–464.

Wei, L., Liu, Y., Dubchak, I., Shon, J., and Park, J. (2002). Comparative genomics approaches to study organism similarities and differences. J. Biomed. Inform. 35, 142–150. doi:10.1016/S1532-0464(02)00506-3.

Woese, C. R., and Fox, G. E. (1977). Phylogenetic structure of the prokaryotic domain: The primary kingdoms (archaebacteria/eubacteria/urkaryote/16S ribosomal RNA/molecular phylogeny). 74, 5088–5090. Available at: http://www.pnas.org/content/74/11/5088.full.pdf [Accessed May 23, 2017].

Young, J. M., Kuykendall, L. D., Martínez-Romero, E., Kerr, A., and Sawada, H. (2001). A revision of Rhizobium Frank 1889, with an emended description of the genus, and the inclusion of all species of Agrobacterium Conn 1942 and Allorhizobium undicola de Lajudie et al. 1998 as new combinations: Rhizobium radiobacter, R. rhizogenes, R. rubi,. Int. J. Syst. Evol. Microbiol. 51, 89–103. doi:10.1099/00207713-51-1-89.

21

22

Chapter 2 – Use of whole genome sequences to develop a molecular phylogenetic framework for Rhodococcus fascians and the Rhodococcus genus

Edward W. Davis II†, Allison L. Creason†, Melodie L. Putnam, Olivier M. Vandeputte, and Jeff H. Chang

† These authors contributed equally

Frontiers in Plant Pathology

2014 9(7): e101996 22 PMCID: PMC4154481

23

ABSTRACT

The accurate diagnosis of diseases caused by pathogenic bacteria requires a stable species classification. Rhodococcus fascians is the only documented member of its ill-defined genus that is capable of causing disease on a wide range of agriculturally important plants. Comparisons of genome sequences generated from isolates of

Rhodococcus associated with diseased plants revealed a level of genetic diversity consistent with them representing multiple species. To test this, we generated a tree based on more than 1700 homologous sequences from plant-associated isolates of

Rhodococcus, and obtained support from additional approaches that measure and cluster based on genome similarities. Results were consistent in supporting the definition of new Rhodococcus species within clades containing phytopathogenic members. We also used the genome sequences, along with other rhodococcal genome sequences to construct a molecular phylogenetic tree as a framework for resolving the

Rhodococcus genus. Results indicated that Rhodococcus has the potential for having

20 species and also confirmed a need to revisit the taxonomic groupings within

Rhodococcus.

INTRODUCTION

Defining bacteria into stable and coherent genetically similar species has many practical implications. However, multiple factors including effective population size, horizontal gene transfer and bacterial recombination, and their barriers, affect cohesiveness of different groups of bacteria to varying degrees (Doolittle and

Zhaxybayeva, 2009). As a consequence, a unifying concept for bacterial species has 23

24 yet to be adopted, which has made it difficult to develop criteria and thresholds that can be generally applied for defining bacterial species.

Traditional polyphasic approaches define bacterial species as a monophyletic group with at least one discriminative phenotypic trait. Though pragmatic and widely adopted, the traditional approaches are weighted towards phenotypic traits and cannot keep pace with the rate in which new genotypes are being discovered and sequenced.

With major advances in contemporary methods in sequencing, operational criteria based on whole genome sequences have been developed and adopted to assist in resolving bacterial phylogeny (Konstantinidis et al., 2006). Multi-locus sequence analysis (MLSA) and trees based on whole genome sequences are powerful methods for inferring evolutionary relationships (Staley, 2009). Alternative criteria based on the degree of similarities in genome signatures have also been developed (Konstantinidis et al., 2006; Bohlin et al., 2008; Richter and Rosselló-Móra, 2009). Average nucleotide identity (ANI), for example, is a simple measure of genetic relatedness based on sequences conserved among compared genomes and has gained acceptance as a method for defining bacterial species (Konstantinidis et al., 2006; Chan et al., 2012; Kim et al.,

2014). ANI has also been developed as a method for codifying bacteria based on genome similarity (Marakeby et al., 2014).

Genome-enabled comparisons and the recognition of environmental niches in structuring gene flow have revealed a diversity of population structures for groups of plant-associated bacteria. Pseudomonas fluorescens, for example, occupies multiple niches and has a level of heterogeneity consistent with limited gene flow between sub-

clades that challenge their taxonomy (Loper et al., 2012). Likewise, a change to the 24

25 taxonomy of Agrobacterium tumefaciens has been proposed to reflect genome-enabled discovery of clade-specific traits (Lassalle et al., 2011). The ANI method has been used to assign newly discovered isolates to known plant-associated species and discover new species of plant pathogens (Dudnik et al., 2014; Durán et al., 2014; van der Wolf et al.,

2014).

The Gram-positive Rhodococcus genus is a member of the family and forms a distinct group with 30 valid species published (Jones and

Goodfellow, 2012). The genus has diverse members that inhabit a wide range of terrestrial as well as aquatic habitats and are renowned for their catabolic functions and ability to degrade a large number of organic compounds (Larkin et al., 2005).

Additionally, members of Rhodococcus have been recovered from extreme environments such as the deep-sea, oil-contaminated soils, and freeze-thaw tundra on glacial margins (Sheng et al., 2011; Shevtsov et al., 2013; Konishi et al., 2014). Because of their biotechnological applications and potential in bioremediation, there has been a dramatic increase in the number of sequenced Rhodococcus genomes. Their genomes are high in GC content and range in size from 4.3 megabases (Mb) to over 10 Mb. Most genomes are larger than 5 Mb and their large sizes have been attributed to both horizontal gene transfer and gene duplication (Letek et al., 2010). Partly due to the historical reliance on phenotypic traits and use of 16S rDNA sequence information, the

Rhodococcus phylogeny still remains poorly resolved (Gürtler et al., 2004).

To date, Rhodococcus fascians and are the only two members of the genus that are well documented as being pathogenic (Bargen and Haas,

2009; Stes et al., 2011). R. fascians can infect a broad range of plants. After breaching 25

26 the plant cuticle, the pathogen collapses the epidermal layer, and forms ingression sites beneath epiphytic colonies (Cornelis et al., 2001). R. fascians then grows inside the host tissue and provokes cell differentiation and de novo organogenesis, resulting in proliferations and abnormal growths called witches’ brooms or leafy galls (Putnam and

Miller, 2007).

To gain insights into the mechanisms and evolution of virulence, we determined the genome sequences for 20 isolates of Rhodococcus (Creason et al., 2014). Like R. equi, R. fascians has few horizontally-acquired virulence genes, which are predicted to be augmented by co-option of core genes, that contribute to the ability of the bacterium to infect and cause disease (Crespi et al., 1992; Letek et al., 2008; 2010; Creason et al.,

2014). Because of this mechanism of virulence evolution, phytopathogenicity is not expected to be a distinguishing trait suitable for classifying these Rhodococcus isolates.

In this study, we tested the hypothesis that leafy gall disease is caused by members of multiple species of Rhodococcus. Results from four independent methods were consistent and supported the hypothesis. Analysis of the 20 genome sequences showed the isolates formed two well supported clades, with one consisting of 16 isolates and having complex substructure indicative of multiple species. Analysis of the Rhodococcus genus associated four isolates collected from extreme environments or found in association with healthy plants to the two clades of Rhodococcus with plant- pathogenic members. Lastly, the need for revision of taxonomic grouping in

Rhodococcus is suggested, as determined based on ANI distances calculated for all possible pairwise comparisons between members with available genome sequences.

26

27

MATERIAL AND METHODS

Isolation of phytopathogenic Rhodococcus

Symptomatic tissue of Leucanthemum x superbum ‘Becky’, received by the

Oregon State University Plant Clinic, was washed, macerated in sterile saline, and incubated at room temperature for 30 minutes. Rhodococcus cells were selected for by culturing on semi-selective D2 media (Kado and Heskett, 1970). Isolate A22b was selected and verified as phytopathogenic based on its ability to cause leafy gall disease on pea seedlings and positive amplification for the fasA gene.

Nucleic acid preparations

A22b was grown in LB at 28ºC with shaking (Bertani, 1951). Genomic DNA from A22b was extracted from cells grown directly from stocks. The Wizard Genomic

DNA Purification Kit was used, according to the instructions of the manufacturer, to extract genomic DNA (Promega Corporation, Madison, WI, USA).

Next-generation sequencing, assembly, and annotation

Library construction and sequencing on an Illumina MiSeq were done in the

Center for Genome Research and Biocomputing at Oregon State University. The A22b genome was assembled using Velvet (v1.2.08), with a hash length of 125 (Zerbino and

Birney, 2008). The insert size was determined based on the estimated fragment size of the library preparation. Multiple assemblies were done, in which coverage cutoff, expected coverage, and hash length parameters were changed (Creason et al., 2014).

The highest quality assembly was identified based on the number of contigs and having 27

a sum total size between 5-6 Mb. Contigs were reordered using the genome sequence

28 of R. fascians A44a as a reference and the Mauve Contig Mover (Rissman et al., 2009).

The genome was annotated using Prokka (Seemann, 2014). As part of the Prokka pipeline, CDSs were annotated in part, based on BLASTP analysis and a database of genomes core to the Rhodococcus genus, including whole-genome assemblies from

Rhodococcus jostii RHA1, B4, Rhodococcus erythropolis PR4,

R. equi 103S and the R. fascians linear plasmid, pFiD188 (Na et al., 2005; McLeod et al., 2006; Sekine et al., 2006; Letek et al., 2010; Francis et al., 2012). The whole genome shotgun project for A22b has been deposited at DDJB/EMBL/GenBank under the accession JOKB00000000 (BioProject PRJNA252927, BioSample

SAMN02864791). The version described in this paper is version JOKB01000000. The

A22b short reads and annotated genome sequences are available for download

(SRS641819, http://dx.doi.org/10.7267/N9PN93H8). In order to be consistent, publicly available wgs sequences used in this study were similarly annotated. Their genome annotations are available upon request.

Phylogenetic analyses

We used Hal (-a muscle and –y 100 settings) to construct the multi-gene tree of the 20 isolates and the farcinica type strain as the outgroup (Ishikawa et al.,

2004; Robbertse et al., 2011).

Sequences for the maximum-likelihood MLSA tree were gathered from the

NCBI nt and wgs databases, using FtsY, InfB, RpoB, RsmA, SecY, TsaD, and YchF from Rhodococcus jostii RHA1 and Bifidobacterium longum subsp. infantis ATCC

15697 as queries in TBLASTN+ (v2.2.29) searches (with default settings; (Adékambi 28

et al., 2011)). The query sequences were selected to provide coverage of the

29

Actinobacteria phylum. Duplicate results from the two TBLASTN+ (v2.2.29) results, and strains in which all seven translated sequences were not detected, were filtered out.

A total of 1316 strains passed filter.

The filtered sequences were aligned using the L-INS-i algorithm in MAFFT

(v7.149b) with the –legacygappenalty flag. Gblocks (v0.91b) was used to trim the alignments prior to concatenation with half gapped positions allowed (-b5=h setting;

(Castresana, 2000)). Concatenated sequences with 100% identity, excluding those in the Rhodococcus genus, were collapsed into one entry, resulting in 961 sequences as input for tree generation. The most appropriate models of substitution for each gene were selected using the ProteinModelSelection.pl script provided with RAxML (v8;

(Katoh and Standley, 2013; Stamatakis, 2014)). Trees were generated using RAxML

(v8), based on the guidelines provided in the users manual (Stamatakis, 2014). Briefly, five starting parsimony trees were generated using the –y option; fixed initial arrangements were run on the five trees separately with the –i 10 setting. Automatic initial arrangements were also run on the five trees. The best log likelihood scores were used to choose the proper initial arrangement setting for further tree generation (-i 10 was the best for the dataset). A total of 500 rapid bootstraps (–x setting with –N 500) were performed on this dataset, and 10 distinct (–f d setting with –N 10) trees were generated. Bootstrap values were mapped on the best of the 10 distinct trees using the

–f b setting. See Supplemental Data 1 for the full tree, accession values, and duplicated sequences that were removed.

Alignments were visualized using Belvu (Sonnhammer and Hollich, 2005).

Images were generated using the iTOL (Letunic and Bork, 2011). 29

30

Bioinformatic analyses

The progressiveMauve (v2.3.1) alignment was produced using default settings and as input, the chromosomal sequences for isolates D188, A21d2, 05-339-1 and A44a

(Darling et al., 2010).

JSpecies with whole genome FASTA sequences as input, was used to calculate average nucleotide identities (BLAST; ANIb) and do pairwise comparisons of tetranucleotide frequencies (TETRA; (Richter and Rosselló-Móra, 2009)). Codon usage tables were constructed using EMBOSS cusp and sum difference statistics were calculated using EMBOSS codcmp (default settings; http://emboss.sourceforge.net/apps/cvs/emboss/apps/cusp.html).

The ANI values used to generate the distance dendrogram were calculated using published methods (Konstantinidis and Tiedje, 2005). The following were automated using ad hoc scripts. Genome sequences were split into 1020-nucleotide long segments.

The genome segments were used as queries in BLASTN+ (v2.2.27) searches against all other complete genomes in an all-by-all pairwise analysis. BLASTN+ (v2.2.27) was used, with the extra settings, “blastn –task blastn –dust no –xdrop_gap 150 –penalty -

1 –reward 1 –gapopen 5 –gapextend 2”, for the searches (Camacho et al., 2009).

Sequences with less than 70% coverage and 30% identity were filtered out, the number of results above the cutoffs were counted, and the average nucleotide identity of the resulting sequences were calculated. Results were comparable to those calculated using jSpecies and were better at handling the larger number of samples (Richter and

Rosselló-Móra, 2009). 30

31

The distance dendrogram was generated using the all-by-all pairwise ANI divergence values as input, which is defined as 100% - ANI (Chan et al., 2012). The hcluster Python package was installed along with all dependencies (http://scipy- cluster.googlecode.com/). IPython, in interactive mode, was used to generate the dendogram (Perez and Granger, 2007). The matplotlib library was also required

(ipython –matplotlib; (Hunter, 2007); http://matplotlib.org). The pdist() function from hcluster was used to calculate the Euclidean distance between the ANI divergence values, and the complete linkage on the distance matrix was calculated using the complete() function. The dendrogram was generated using the dendrogram() function of hcluster. Input data and the resulting script from the interactive IPython session can be found in Supplemental Data 2.2.

Graphs were generated in R (R Core Team, 2013). The 3-D scatter plot was generated using plot3d{rgl} and quads3d{rgl}. Heatmaps were generated using heatmap.2{gplots}.

RESULTS

Whole genome-based phylogeny supports multiple lineages of plant pathogenic Rhodococcus

In our first sequencing effort, we used hybrid approaches to generate high quality assemblies for isolates D188 and A44a (Creason et al., 2014). Unexpectedly, initial attempts to align the genome sequences were challenging, leading us to hypothesize that the two isolates represented different species of Rhodococcus. In order to test this hypothesis, we determined the genome sequences for 18 additional isolates

31 of Rhodococcus identified from diseased plants or initially typed as R. fascians (Table

32

2.1; (Miteva et al., 2004)). The alignment of four genome sequences shows conservation of collinear blocks, with A44a being the most disparate in respect to the level of conservation and number and size of gaps between blocks (Supplemental

Figure 2.1).

We constructed a multi-gene phylogenetic tree for the 20 isolates of

Rhodococcus based on the whole genome sequences to infer the evolutionary relationships. Homologous sequences were identified from all 20 and from the type strain of Nocardia, which we used as an outgroup (Ishikawa et al., 2004). Clusters with paralogs, but not those with sequences potentially acquired via recombination, were filtered out, leaving 1727 clusters. The corresponding sequences from each isolate were concatenated, aligned, and used to derive a neighbor-joining (NJ) tree. The isolates formed two distinct clades (Figure 2.1). Clade I has substructure, with the largest and deepest branching sub-clade i consisting of the type strain LMG3623, D188, the two glacial ice core isolates, three other culture collection isolates, and three isolated from diseased plants. Sub-clade ii includes three isolates obtained directly from diseased plants and one obtained from a culture collection. Sub-clade iii is represented by just two isolates, 05-339-1 and A76, collected from diseased plants. As indicated by the longer branch lengths, the two isolates are more diverged. Clade II consists of four isolates from the US that includes A44a.

We projected the presence/absence of a linear virulence plasmid and trait of phytopathogenicity on to the phylogenetic tree (Figure 2.1; (Creason et al., 2014)).

Phytopathogenicity is not exclusive to any clade or sub-clade. The five non-pathogenic

isolates clustered in Clade I with four and one found in sub-clades i and ii, respectively. 32

33

Sub-clade ii is the most variable in respect to virulence loci structure, as phytopathogenic A21d2 and A25f lack a linear virulence plasmid and A21d2 also lacks the entire fas operon (Creason et al., 2014). These data are consistent with our hypothesis that leafy gall disease is caused by multiple species of Rhodococcus and explain why initial attempts in aligning the whole genome sequences of D188 and A44a were difficult.

Alternative whole-genome based analyses support the multiple lineages of Rhodococcus

To support the existence of distinct groups of plant pathogenic Rhodococcus, we used alternative methods to cluster the bacteria based on similarities in their genome sequence features. We determined the average nucleotide identity values (ANIb; calculated with the BLAST algorithm) and tetranucleotide usage patterns (TETRA) for all pairwise comparisons (Teeling et al., 2004; Konstantinidis and Tiedje, 2005; Richter and Rosselló-Móra, 2009). A plot of ANI versus TETRA formed three distinct clouds

(Figure 2.2). The plots of the comparisons of isolates within sub-clades i and ii as well as Clade II coalesced into cloud 1. These comparisons were wholly within the calibrated and strictest thresholds of 96% ANI and 0.997 TETRA values that are recommended for circumscribing prokaryotic taxa (Richter and Rosselló-Móra, 2009).

Results were identical when we relaxed ANI thresholds to 94% (data not shown). The reciprocal comparisons between isolates 05-339-1 and A76 of sub-clade iii associated with Cloud 1 but fell below ANI thresholds, regardless of which strictness level was used. The failure to exceed threshold is consistent with the greater divergence between

33 these isolates, as observed in the NJ tree. Cloud 2 represented all possible comparisons

34 between isolates in different sub-clades of Clade I. This cloud spanned the TETRA threshold value but was well below the ANI threshold values (Richter and Rosselló-

Móra, 2009). Cloud 3 contains the most dissimilar comparisons between isolates of

Clades I and II. The values within this cloud are similar to those derived from a comparison between R. equi 103S and an environmental isolate of Rhodococcus

(McLeod et al., 2006; Letek et al., 2010). Thus, not only was the structure of the

Rhodococcus samples supported by analysis with ANI and TETRA, but the consistency in results reconfirmed the use of ANI for inferring genetic relatedness, as was previously demonstrated by others (Goris et al., 2007; Richter and Rosselló-Móra,

2009).

The genetic code is nearly universal but synonymous codons are not used with equal frequencies across species because of a complex balance of multiple forces

(Plotkin and Kudla, 2011). Because this codon bias can be used to distinguish between dissimilar groups of organisms, we compared all possible pairwise combinations between the 20 isolates and displayed the similarity values as a heat map (Sharp and

Li, 1987). Codon usage preferences clearly differentiated the isolates of Clade I from those of Clade II (Figure 2.3A). Codon usage preferences also revealed a pattern consistent with the substructure observed in the NJ tree and ANI versus TETRA analysis (Figures 2.1 and 2.2). The three sub-clades were easily discernable, though the relationships within sub-clade i differed slightly from those inferred from the NJ tree.

The core genome hypothesis suggests that coherent clusters of bacteria have a core set of functions, and each member augments the core with a variable accessory

genome that contributes functions for niche adaptation (Lan and Reeves, 1996). The 20 34

35 sequenced isolates share a core of 3,063 genes. However, sample size indubitably affects estimation of core genome identities and sizes. Given the small sample size of our collection and the imbalance in numbers of isolates between clades, we elected to cluster based on the percent of shared homologs rather than on core genomes (Figure

2.3B). The members of Clades I and II separated into distinct clusters, which could be taken as evidence for limited gene flow between clades. Relative to results from other approaches, the relationship of the isolates within Clade I were noticeably different, as sub-clade ii and sub-clade iii were not clearly demarcated. Regardless of these subtle differences, the results were entirely consistent in separating the 20 isolates into two groups of phytopathogenic Rhodococcus, with one also having evidence for substructure.

A molecular phylogeny based on whole genome sequences provides a framework for resolving the Rhodococcus genus

To develop a framework for resolving the Rhodococcus phylogeny, we constructed a multi-locus sequence analysis (MLSA) maximum likelihood (ML) tree based on 961 concatenated sequences representing 1316 members of the phylum (Figure 2.4). We used seven marker genes that were previously identified as conserved and informative for the subclass (Adékambi et al., 2011).

Rhodococcus and Nocardia formed sister groups and the members of the Rhodococcus genus formed a well-supported phylogenetically coherent cluster (bootstrap percentage of 98%). We were able to identify two relatively defined groups and one small, less defined, group within the Rhodococcus genus (Figure 2.4; marked as a, b, and c). The

35 two larger groups were consistent with the two clades previously described in a

36 phylogeny based on 16S rDNA sequences (Jones and Goodfellow, 2012). In contrast, in the MLSA ML tree, the smaller R. equi clade is within the larger Rhodococcus clade

(bootstrap percentage of 93%), unlike previous studies, which associated the R. equi clade with Nocardia.

The 20 isolates of interest in this study formed a distinct subgroup (bootstrap percentage of 100%) within the MLSA ML tree and also included five other isolates

(Figure 2.4). Phytopathogenic isolate A22b, which was identified from a diseased plant and sequenced independently from the 20 isolates, clustered in Clade II. Two of the isolates that clustered with the 20 of interest in this study, were identified independent of plants and in extreme environments. Rhodococcus sp. JG-3 was isolated from permafrost (GenBank BioProject PRJNA195882) and Rhodococcus sp. AW25M09 was isolated from the stomach of an Atlantic Hagfish (Hjerde et al., 2013). Two others,

Rhodococcus spp. 29MFTsu3.1 and 114MTsu3.1 were found associated with the rhizosphere or endosphere of Arabidopsis thaliana (GenBank BioProject

PRJNA201196).

The topology of the tree outside of the Rhodococcus genus was similar to previously reported trees and revealed inconsistencies in currently defined taxonomic groups, as previously observed (Adékambi et al., 2011; Gao and Gupta, 2012; Jones and Goodfellow, 2012; Verma et al., 2013). The Micrococcineae formed a polyphyletic group, with three to four distinct clades, depending on the tree and tree generation method (bootstrap percentages of 88-100%). We also had difficulties in accurately placing the Frankineae into a discrete group, as they were found throughout the

phylogeny. The Actinomyces neuii species formed a separate, but somewhat poorly 36

37 supported (bootstrap percentage of 63%) clade with Mobiluncus curtisii, as was the case in a phylogeny based on 16S rDNA sequences (Jones and Goodfellow, 2012). We observed a cryptic relationship with Actinopolysporineae included within the

Pseudonocardineae clade. However, some of the branches within the

Pseudonocardineae were poorly supported (bootstrap percentages <50%), indicating poor resolution with the clade as a whole. One key addition of the MLSA ML tree was that the branching of Actinomyces was well supported and is consistent with there being two large clades (bootstrap percentages of 90% and 100%).

ANI provides a framework for resolving the Rhodococcus genus

Because of the relatively few informative sites, bootstrap values at the tips of the MLSA ML tree were often low and insufficient for resolving the Rhodococcus genus; compare for example, the two clades of Rhodococcus with plant pathogenic members (Figures 2.1 and 2.4). Therefore, to develop a molecular framework for the

Rhodococcus genus, we used ANIb as a tool for inferring similarity. The values were calculated for 3422 pairwise comparisons of the 59 Rhodococcus isolates, compiled into a distance matrix, and used to generate a divergence dendogram (Figure 2.5; (Chan et al., 2012)). Seven distinct clusters formed with inter-group comparisons that exceeded ANI values of 70-75%, a range that is typically found between members of the same genus.

The two clades of phytopathogenic Rhodococcus spp. represent seven different species, when using ANI and a 94% threshold. As previously observed, Clade I has

complex substructure and represents four different species (Figures 2.1, 2.2, 2.3 and 37

2.5). Like conclusions described above, subclade i represents the originally named R.

38 fascians species, as it includes the type strain LMG3623, along with the most sequenced isolates. Isolates in subclade iii were not as closely grouped, which was also consistent with results from the NJ tree and analysis of codon usage. In fact, based on the single criterion of ANI, 05-339-1 and A76 should be considered as separate species.

Subclade iv is represented by a single isolate, Rhodococcus sp. AW25M09. In Clade

II, isolate A22b is just below the 94% ANI cutoff and may represent a separate subspecies from those that formed this second major clade in previously described analyses (Figures 2.1, 2.2, 2.3, and 2.5).

The other isolates of Rhodococcus, which are not known as plant-associated, formed five additional clades. Within these clades, and including single isolates

(singletons), there were 13 smaller groups defined by ANI values of 94% ANI or greater. These relationships can be used to infer species groups.

Clade III consists mostly of isolates named as Rhodococcus erythropolis. The group also included Rhodococcus qinshengii BKS 20-40,

ATCC 17895, and Rhodococcus sp. P27, which based on their association to the clade, could be considered members of the R. erythropolis species.

Clade IV has two subgroups and two singletons. The first subgroup is represented by Rhodococcus jostii RHA1 and also included Rhodococcus spp. DK17 and JVH1. The second subgroup varied in terms of named species and consisted of two isolates of Rhodococcus opacus, the type strain of Rhodococcus imtechensis and

Rhodococcus wratislaviensis IFP 2016. Because the latter strain did not cluster with the type strain of R. wratislaviensis, NBRC 100605, we suggest that IFP 2016 is not a

member of the R. wratislaviensis species and it belongs to a different species of 38

39

Rhodococcus. The other singleton isolate was R. opacus B4. A whole genome sequence of the R. opacus type strain is unavailable. Thus, we cannot suggest whether the subgroup or the singleton should be designated as the R. opacus species. Furthermore,

16S rDNA sequences of the three named R. opacus isolates are too similar for resolving this issue. Interestingly, R. opacus PD630 is most similar to R. wratislaviensis NBRC

100605 in respect to ANI values, but the two did not associate with one another in the divergence dendogram.

Clade V contains a tightly clustered group with members of the Rhodococcus equi species. Its placement within Rhodococcus was consistent with the MLSA ML tree.

Clade VI consists of two smaller subgroups. The first subgroup has R. rhodochrous ATCC 21198 and two undesignated isolates (EsD8 and BCP1). The low

ANI of 72% placed R. rhodochrous ATCC 21198 in a clade separate from isolate R. rhodochrous BKS6-46. To further investigate this discrepancy, we used the 16S rDNA sequence from R. rhodochrous ATCC 21198 as a query in a BLASTN+ search. The sequence identified corresponding sequences from Rhodococcus aetherivorans (100% identity, 100% subject coverage), including the type strain DSM 44752. When we used the 16S rDNA sequence from the type strain of R. rhodochrous as a query, it showed greater similarity to the corresponding sequence of R. rhodochrous BKS6-46 rather than ATCC 21198. In all, these data suggest the identity of ATCC 211983 should be revisited. The second subgroup of Clade VI consists of two Rhodococcus ruber isolates

(BKS 20-38 and Chol-4) and another undesignated isolate (P14). Thus, we suggest that

P14 is a member of the R. ruber species. 39

40

Clade VII has one cluster of defined species and two singletons. Two isolates of Rhodococcus pyridinivorans (AK37 and SB3094) and Rhodococcus sp. R04 clustered, which we suggest represents the R. pyridinivorans species. R. rhodochrous

BKS6-46 is a singleton in this clade but its precise placement within this clade of the dendrogram is somewhat misleading. The ANI values for the pairwise comparisons of

R. rhodochrous BKS6-46 to the two named R. pyridinivorans, but not Rhodococcus sp.

R04, exceeded the 94% threshold used to define a species relationship. The values derived from comparison with Rhodococcus sp. R04 likely caused R. rhodochrous

BKS6-46 to form its own branch. The last singleton is Rhodococcus sp. R1101, which had ANI values around 90% relative to the other isolates of Clade VII.

There were two outliers in the dendogram. Rhodococcus rhodnii and

Rhodococcus triatomae were identified as singletons and placed in a clade closest to

Clade V (R. equi). These isolates had low ANI values (between 71% and 75%) in comparison to all of the sequenced Rhodococcus isolates.

DISCUSSION

Leafy gall disease is a substantial economic problem for the horticultural industry. The pathogen has an extensive host range that includes most plants important to the industry (Putnam and Miller, 2007). Current management strategies rely on visual inspection and the only method of control is the destruction of infected plant material. While visual inspection of plants for disease is superficially trivial, the absence of fundamental information on its epidemiology and the lack of robust, on-site

diagnostics contribute to make disease management challenging. Whole genome 40

41 sequencing is a cost-effective and facile approach for studying bacterial species and has important practical implications for diagnosis of disease. In this study, we used whole genome sequences to analyze the genetic diversity of plant pathogenic

Rhodococcus as a first step towards the development of better management strategies for this pathogen.

Twenty isolates, many of which were identified from diseased plants, were previously selected for whole genome sequencing (Creason et al., 2014). Based on a tree derived from 1727 homologous genes, we demonstrated that the 20 isolates separated into distinct clades and sub-clades (Figure 2.1). Similarities in genome features, including ANI, TETRA, codon usage preference, and degree of genome homology, were all consistent in clustering the isolates into distinct and coherent groups (Figures 2.2 and 2.3). The similarity in results between the tree and ANI was encouraging and gave us confidence in using the latter for inferring evolutionary relationships of isolates within a larger dataset (Figure 2.5).

The phytopathogenic isolates formed a subgroup distinct from other members of the Rhodococcus genus, which could be taken as evidence for cohesion (Figures 2.4 and 2.5). However, within this subgroup, phytopathogenicity is not a discriminative trait (Figures 2.1, 2.4, and 2.5). Although, we speculate that members of this subgroup are potentiated towards phytopathogenicity. Virulence evolution in Rhodococcus has been modeled according to a mechanism of gene co-option whereby limited, but key horizontally acquired virulence genes, trigger the co-option of core genes for virulence

(Letek et al., 2010; Creason et al., 2014). As few as four functions, most often conferred

by a cluster of genes vectored by a conjugative virulence plasmid, are hypothesized to 41

42 be sufficient for phytopathogenicity for members of these genetically diverse clades of

Rhodococcus.

The non-pathogenic isolates recovered from a Greenland glacier ice core,

GIC26 and GIC36, as well as Rhodococcus sp. JG-3 from permafrost and AW25M09 from the stomach of the Atlantic Hagfish, clustered with phytopathogenic isolates in

Clade I (Miteva et al., 2004; Hjerde et al., 2013). Rhodococcus spp. 29MFTsu3.1 and

114MTsu3.1, both of which associate with plants, clustered in Clade II. Inspection of the genome sequences of isolates JG-3, AW25M09, 29MFTsu3.1 and 114MTsu3.1, failed to reveal any of the virulence genes known to be necessary for Rhodococcus to cause leafy gall disease. We did detect a linear plasmid-like sequence in the draft genome of Rhodococcus sp. AW25M09, but it lacks genes known to be necessary for virulence towards plants (Hjerde et al., 2013). It would be interesting to test whether these isolates, upon acquisition of genes that confer the four key virulence functions, gain the ability to infect and cause disease to plants.

Our genome sequencing effort of plant-associated isolates contributed to increasing the number of sequenced Rhodococcus isolates by 50%. Not since a decade ago, have genomic features been used to address the Rhodococcus genus (Gürtler et al.,

2004). We therefore used these genome sequences, an additional sequence we generated for phytopathogenic isolate A22b, along with most currently available rhodococcal genome sequences, to construct a molecular phylogeny to help resolve the genus and shed light on the phylum Actinobacteria as a whole. MLSA provided a framework for defining genus level relationships that can be further explored (Figure

2.4). In this study, the R. equi species was placed within the Rhodococcus genus, 42

43 consistent with previously reported results derived from whole-genome based approaches, and in contrast to phylogenetic analysis based on 16S rDNA sequences

(Letek et al., 2010; Jones and Goodfellow, 2012). Use of MLSA to infer phylogeny revealed inconsistencies in the placement of Frankineae, in contrast to a 16S rDNA- based phylogeny, which formed a Frankineae cluster near the root of the phylum (Jones and Goodfellow, 2012). However, MLSA was not sufficient for resolving some of the branches within the Pseudonocardineae (bootstrap percentages <50%). More genome sequences, or informative marker sequences for members of Actinopolysporineae might help discern the finer details within the suborder. Other minor discrepancies in single isolate naming were also noted (Supplemental Data 1).

With ANI, we were able to place the 59 Rhodococcus isolates with sequenced genomes into seven large groups. Results were consistent with the MLSA ML tree but

ANI provided greater resolution of their relationships (Figure 2.5). We were also able to infer at least 20 different species from these 59 isolates. Some conflicts between clustering and species naming were noted and suggest a need to revisit their taxonomical groupings. However, we recognize that the use of ANI to infer taxonomical groupings merely provides a framework and is not sufficient, by itself, for defining bacterial species. At present, in order to validly designate a species, it needs to be further characterized for at least one discriminative trait. Moreover, the designation of a species would also require inclusion of their corresponding type strain.

The methods that were used were congruent in supporting the existence of multiple species of Rhodococcus. However, each of the methods has limitations. The

tree developed based on whole genome sequences was computationally and time 43

44 intensive and precluded us from using methods with stronger statistical frameworks for phylogenetic reconstruction and also became too time-intensive for larger datasets.

MLSA is limited by the need for a sufficient set of generalizable and informative marker genes. Despite the reduced number of sequences, the MLSA approach was still computationally and time intensive with larger datasets. ANIb can be limiting in projecting relationships in an evolutionary context (Figure 2.2). Though the dendrogram constructed based on Euclidean distances was a convenient way for visualizing dissimilarities, there were nevertheless some discrepancies between placement of isolates and their measured ANI values (Figure 2.5). For example, R. opacus PD630 and R. wratislaviensis NBRC 100605 have an ANI value that warranted consideration as a species, but their dissimilarities to other isolates prevented the two from clustering in the dendogram (Figure 2.5). Lastly, similarities based on codon usage preferences and genome inventories were inadequate for resolving relationships between isolates with highly similar genome signatures (Figure 2.3). Nevertheless, when multiple methods were coupled, we were able to make strong inferences regarding taxonomical relationships.

In summary, we used whole genome sequences to resolve the phytopathogenic members of Rhodococcus into multiple sister species and developed a dataset that contributes to reconstructing the phylogeny of the Rhodococcus genus.

ACKNOWLEDGEMENTS

We would like to thank Alex Buchanan, Ethan Chang, and Tyler Horton, of the

Chang lab for their assistance. We would also like to thank Joyce Loper for critical 44

45 reading of this manuscript. This work is supported by a grant from the Agricultural

Research Foundation awarded to JHC. EWD is supported by a Provost’s Distinguished

Graduate Fellowship awarded by Oregon State University. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1314109 to EWD. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation. This work is also supported in part by the Society of American Florists and a cooperative agreement with

USDA-ARS awarded to MLP. OMV was supported by grants from FRS-FNRS and the

Fonds David et Alice Van Buuren (Belgium). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

45

46

BIBLIOGRAPHY

Adékambi, T., Butler, R. W., Hanrahan, F., Delcher, A. L., Drancourt, M., and Shinnick, T. M. (2011). Core Gene Set As the Basis of Multilocus Sequence Analysis of the Subclass Actinobacteridae. PLoS ONE 6, e14792. doi:10.1371/journal.pone.0014792.t003.

Bargen, von, K., and Haas, A. (2009). Molecular and infection biology of the horse pathogen Rhodococcus equi. FEMS Microbiol Rev 33, 870–891. doi:10.1111/j.1574-6976.2009.00181.x.

Bertani, G. (1951). Studies on lysogenesis. I. The mode of phage liberation by lysogenic Escherichia coli. J Bacteriol 62, 293–300.

Bohlin, J., Skjerve, E., and Ussery, D. W. (2008). Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes. BMC Genomics 9, 104. doi:10.1186/1471-2164-9-104.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T. L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421. doi:10.1186/1471-2105-10-421.

Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17, 540–552.

Chan, J. Z.-M., Halachev, M. R., Loman, N. J., Constantinidou, C., and Pallen, M. J. (2012). Defining bacterial species in the genomic era: insights from the genus Acinetobacter. BMC Microbiol 12, 302. doi:10.1186/1471-2180-12-302.

Cornelis, K., Ritsema, T., Nijsse, J., Holsters, M., Goethals, K., and Jaziri, M. (2001). The plant pathogen Rhodococcus fascians colonizes the exterior and interior of the aerial parts of plants. Molecular Plant-Microbe Interactions 14, 599–608. doi:10.1094/MPMI.2001.14.5.599.

Creason, A. L., Vandeputte, O. M., Savory, E. A., Davis, E. W., Putnam, M. L., Hu, E., Swader-Hines, D., Mol, A., Baucher, M., Prinsen, E., et al. (2014). Analysis of genome sequences from plant pathogenic Rhodococcus reveals genetic novelties in virulence Loci. PLoS ONE 9, e101996. doi:10.1371/journal.pone.0101996.

Crespi, M., Messens, E., Caplan, A. B., Van Montagu, M., and Desomer, J. (1992). Fasciation induction by the phytopathogen Rhodococcus fascians depends upon a linear plasmid encoding a cytokinin synthase gene. EMBO J 11, 795–804.

Darling, A. E., Mau, B., and Perna, N. T. (2010). progressiveMauve: multiple genome

alignment with gene gain, loss and rearrangement. PLoS ONE 5, e11147. 46

doi:10.1371/journal.pone.0011147.

47

Doolittle, W. F., and Zhaxybayeva, O. (2009). On the origin of prokaryotic species. Genome Res 19, 744–756. doi:10.1101/gr.086645.108.

Dudnik, A., Bigler, L., and Dudler, R. (2014). The Endophytic Strain Rhizobium sp. AP16 Produces the Proteasome Inhibitor Syringolin A. Appl Environ Microbiol. doi:10.1128/AEM.00395-14.

Durán, D., Rey, L., Mayo, J., Zuñiga-Davila, D., Imperial, J., Ruiz-Argüeso, T., Martinez Romero, E., and Ormeño-Orrillo, E. (2014). Bradyrhizobium paxllaeri sp. nov. and Bradyrhizobium icense sp. nov., nitrogen-fixing rhizobial symbionts of Lima bean (Phaseolus lunatus L.) in Peru. International Journal Of Systematic And Evolutionary Microbiology. doi:10.1099/ijs.0.060426-0.

Francis, I., De Keyser, A., De Backer, P., Simón-Mateo, C., Kalkus, J., Pertry, I., Ardiles-Diaz, W., De Rycke, R., Vandeputte, O. M., Jaziri, El, M., et al. (2012). pFiD188, the Linear Virulence Plasmid of Rhodococcus fascians D188. Molecular Plant-Microbe Interactions 25, 637–647. doi:10.1094/MPMI-08-11- 0215.

Gao, B., and Gupta, R. S. (2012). Phylogenetic framework and molecular signatures for the main clades of the phylum Actinobacteria. Microbiol Mol Biol Rev 76, 66– 112. doi:10.1128/MMBR.05011-11.

Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., and Tiedje, J. M. (2007). DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal Of Systematic And Evolutionary Microbiology 57, 81–91. doi:10.1099/ijs.0.64483-0.

Gürtler, V., Mayall, B. C., and Seviour, R. (2004). Can whole genome analysis refine the taxonomy of the genus Rhodococcus? FEMS Microbiol Rev 28, 377–403.

Hjerde, E., Pierechod, M. M., Williamson, A. K., Bjerga, G. E. K., Willassen, N. P., Smalås, A. O., and Altermark, B. (2013). Draft Genome Sequence of the Actinomycete Rhodococcus sp. Strain AW25M09, Isolated from the Hadsel Fjord, Northern Norway. Genome Announc 1, e0005513. doi:10.1128/genomeA.00055- 13.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 90–95.

Jones, AL, and Goodfellow, ML. 2012. "Genus IV. Rhodococcus (Zopf 1891) emend Goodfellow, Alderson and Chun 1998a" in Bergey’s Manual of Systematic Bacteriology 2nd Edition Volume Five The Actinobacteria, Part A eds Goodfellow M, Kampfer P, Busee, HJ, Trujillo, ME, Suzuki K, Ludwig W, and Whitman, WB (New York, NY: Springer), 437-477.

47

Ishikawa, J., Yamashita, A., Mikami, Y., Hoshino, Y., Kurita, H., Hotta, K., Shiba, T., and Hattori, M. (2004). The complete genomic sequence of

48

IFM 10152. Proc Natl Acad Sci USA 101, 14925–14930. doi:10.1073/pnas.0406410101.

Kado, C. I., and Heskett, M. G. (1970). Selective media for isolation of Agrobacterium, , Erwinia, Pseudomonas, and Xanthomonas. Phytopathology 60, 969–976.

Katoh, K., and Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol 30, 772–780. doi:10.1093/molbev/mst010.

Kim, M., Oh, H.-S., Park, S.-C., and Chun, J. (2014). Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. International Journal of Systematic And Evolutionary Microbiology 64, 346–351. doi:10.1099/ijs.0.059774-0.

Konishi, M., Nishi, S., Fukuoka, T., Kitamoto, D., Watsuji, T.-O., Nagano, Y., Yabuki, A., Nakagawa, S., Hatada, Y., and Horiuchi, J.-I. (2014). Deep-sea Rhodococcus sp. BS-15, Lacking the Phytopathogenic fas Genes, Produces a Novel Glucotriose Lipid Biosurfactant. Mar. Biotechnol. doi:10.1007/s10126-014-9568-x.

Konstantinidis, K. T., and Tiedje, J. M. (2005). Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA 102, 2567–2572. doi:10.1073/pnas.0409727102.

Konstantinidis, K. T., Ramette, A., and Tiedje, J. M. (2006). The bacterial species definition in the genomic era. Philos Trans R Soc Lond, B, Biol Sci 361, 1929– 1940. doi:10.1098/rstb.2006.1920.

Lan, R., and Reeves, P. R. (1996). Gene transfer is a major factor in bacterial evolution. Mol Biol Evol 13, 47–55.

Larkin, M. J., Kulakov, L. A., and Allen, C. C. R. (2005). Biodegradation and Rhodococcus--masters of catabolic versatility. Curr Opin Biotechnol 16, 282– 290. doi:10.1016/j.copbio.2005.04.007.

Lassalle, F., Campillo, T., Vial, L., Baude, J., Costechareyre, D., Chapulliot, D., Shams, M., Abrouk, D., Lavire, C., Oger-Desfeux, C., et al. (2011). Genomic species are ecological species as revealed by comparative genomics in Agrobacterium tumefaciens. Genome Biol Evol 3, 762–781. doi:10.1093/gbe/evr070.

Letek, M., González, P., Macarthur, I., Rodríguez, H., Freeman, T. C., Valero-Rello, A., Blanco, M., Buckley, T., Cherevach, I., Fahey, R., et al. (2010). The genome of a pathogenic Rhodococcus: cooptive virulence underpinned by key gene acquisitions. PLoS Genet 6, e1001145. doi:10.1371/journal.pgen.1001145.

48

Letek, M., Ocampo-Sosa, A. A., Sanders, M., Fogarty, U., Buckley, T., Leadon, D. P., González, P., Scortti, M., Meijer, W. G., Parkhill, J., et al. (2008). Evolution of

49

the Rhodococcus equi vap pathogenicity island seen through comparison of host- associated vapA and vapB virulence plasmids. J Bacteriol 190, 5797–5805. doi:10.1128/JB.00468-08.

Letunic, I., and Bork, P. (2011). Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res 39, W475–8. doi:10.1093/nar/gkr201.

Loper, J. E., Hassan, K. A., Mavrodi, D. V., Davis, E. W., Lim, C. K., Shaffer, B. T., Elbourne, L. D. H., Stockwell, V. O., Hartney, S. L., Breakwell, K., et al. (2012). Comparative Genomics of Plant-Associated Pseudomonas spp.: Insights into Diversity and Inheritance of Traits Involved in Multitrophic Interactions. PLoS Genet 8, e1002784. doi:10.1371/journal.pgen.1002784.

Marakeby, H., Badr, E., Torkey, H., Song, Y., Leman, S., Monteil, C. L., Heath, L. S., and Vinatzer, B. A. (2014). A system to automatically classify and name any individual genome-sequenced organism independently of current biological classification and nomenclature. PLoS ONE 9, e89142. doi:10.1371/journal.pone.0089142.

McLeod, M. P., Warren, R. L., Hsiao, W. W. L., Araki, N., Myhre, M., Fernandes, C., Miyazawa, D., Wong, W., Lillquist, A. L., Wang, D., et al. (2006). The complete genome of Rhodococcus sp. RHA1 provides insights into a catabolic powerhouse. Proc Natl Acad Sci USA 103, 15582–15587. doi:10.1073/pnas.0607048103.

Miteva, V. I., Sheridan, P. P., and Brenchley, J. E. (2004). Phylogenetic and physiological diversity of microorganisms isolated from a deep Greenland glacier ice core. Appl Environ Microbiol 70, 202–213.

Na, K.-S., Kuroda, A., Takiguchi, N., Ikeda, T., Ohtake, H., and Kato, J. (2005). Isolation and characterization of benzene-tolerant Rhodococcus opacus strains. Journal of Bioscience and Bioengineering 99, 378–382. doi:10.1263/jbb.99.378.

Perez, F., and Granger, B. E. (2007). IPython: a system for interactive scientific computing. Computing in Science & Engineering 9, 21–29.

Plotkin, J. B., and Kudla, G. (2011). Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet 12, 32–42. doi:10.1038/nrg2899.

Putnam, M. L., and Miller, M. L. (2007). Rhodococcus fascians in herbaceous perennials. Plant Dis 91, 1064–1076. doi:10.1094/PDIS-91-9-1064.

R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Available at: http://www.R-project.org.

Richter, M., and Rosselló-Móra, R. (2009). Shifting the genomic gold standard for the

49

prokaryotic species definition. Proceedings of the National Academy of Sciences 106, 19126–19131. doi:10.1073/pnas.0906412106.

50

Rissman, A. I., Mau, B., Biehl, B. S., Darling, A. E., Glasner, J. D., and Perna, N. T. (2009). Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 25, 2071–2073. doi:10.1093/bioinformatics/btp356.

Robbertse, B., Yoder, R. J., Boyd, A., Reeves, J., and Spatafora, J. W. (2011). Hal: an Automated Pipeline for Phylogenetic Analyses of Genomic Data. PLoS Curr 3, RRN1213. doi:10.1371/currents.RRN1213.

Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069. doi:10.1093/bioinformatics/btu153.

Sekine, M., Tanikawa, S., Omata, S., Saito, M., Fujisawa, T., Tsukatani, N., Tajima, T., Sekigawa, T., Kosugi, H., Matsuo, Y., et al. (2006). Sequence analysis of three plasmids harboured in Rhodococcus erythropolis strain PR4. Environ Microbiol 8, 334–346. doi:10.1111/j.1462-2920.2005.00899.x.

Sharp, P. M., and Li, W. H. (1987). The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15, 1281–1295.

Sheng, H. M., Gao, H. S., Xue, L. G., Ding, S., Song, C. L., Feng, H. Y., and An, L. Z. (2011). Analysis of the composition and characteristics of culturable endophytic bacteria within subnival plants of the Tianshan Mountains, northwestern China. Curr Microbiol 62, 923–932. doi:10.1007/s00284-010-9800-5.

Shevtsov, A., Tarlykov, P., Zholdybayeva, E., Momynkulov, D., Sarsenova, A., Moldagulova, N., and Momynaliev, K. (2013). Draft Genome Sequence of Rhodococcus erythropolis DN1, a Crude Oil Biodegrader. Genome Announc 1. doi:10.1128/genomeA.00846-13.

Sonnhammer, E. L. L., and Hollich, V. (2005). Scoredist: a simple and robust protein sequence distance estimator. BMC Bioinformatics 6, 108. doi:10.1186/1471- 2105-6-108.

Staley, J. T. (2009). The phylogenomic species concept for Bacteria and Archaea. Microbe 4, 361–365.

Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30, 1312–1313. doi:10.1093/bioinformatics/btu033.

Stes, E., Vandeputte, O. M., Jaziri, El, M., Holsters, M., and Vereecke, D. (2011). A successful bacterial coup d'état: how Rhodococcus fascians redirects plant development. Annu Rev Phytopathol 49, 69–86. doi:10.1146/annurev-phyto- 072910-095217.

50

Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., and Glöckner, F. O. (2004). TETRA: a web-service and a stand-alone program for the analysis and comparison

51

of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163. doi:10.1186/1471-2105-5-163. van der Wolf, J. M., Nijhuis, E. H., Kowalewska, M. J., Saddler, G. S., Parkinson, N., Elphinstone, J. G., Pritchard, L., Toth, I. K., Lojkowska, E., Potrykus, M., et al. (2014). Dickeya solani sp. nov., a pectinolytic plant-pathogenic bacterium isolated from potato (Solanum tuberosum). International Journal Of Systematic And Evolutionary Microbiology 64, 768–774. doi:10.1099/ijs.0.052944-0.

Verma, M., Lal, D., Kaur, J., Saxena, A., Kaur, J., Anand, S., and Lal, R. (2013). Phylogenetic analyses of phylum actinobacteria based on whole genome sequences. Res Microbiol. doi:10.1016/j.resmic.2013.04.002.

Zerbino, D. R., and Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821–829. doi:10.1101/gr.074492.107.

51

52

Table 2.1. Isolates of Rhodococcus selected for whole genome sequencing Geographic Isolate* Source Year§ location Group† Greenland glacial GIC26 Greenland >120,000 years Sub-clade i ice core Greenland glacial GIC36 Greenland >120,000 years Sub-clade i ice core Lavandula 05-561-1 angustifolia ‘Violet Washington, USA 2005 Sub-clade i Intrigue’ Chrysanthemum x LMG3605 United unknown Sub-clade i morifolium Chrysanthemum x D188 Europe 1984 Sub-clade i morifolium Moerbeke, LMG3602 Lilium longiflorum unknown Sub-clade i Belgium LMG3623 Lathyrus odoratus USA unknown Sub-clade i (Tilford's strain) Heliopsis A3b helianthoides Michigan, USA 2005 Sub-clade i 'Loraine Sunshine' LMG3616 Lathyrus odoratus United Kingdom unknown Sub-clade i Leucanthemum x A78 Pennsylvania, USA 2002 Sub-clade i superbum 'Becky' Oenothera speciosa A21d2 Michigan, USA 2002 Sub-clade ii 'Siskiyou'

Aster x 'Woods 04-516 Florida, USA 2004 Sub-clade ii Pink'

A25f Nemesia x 'Natalie' Washington, USA 2002 Sub-clade ii

LMG3625 Lathyrus odoratus United Kingdom 1958 Sub-clade ii

Hosta 'Blue 05-339-1 Michigan, USA 2005 Sub-clade iii Umbrellas'

Veronica spicata A76 Michigan, USA 2002 Sub-clade iii 'Royal Candles'

Veronica spicata A44a Oregon, USA 2002 Clade II ‘Minuet' Campanula x 02-815 Michigan, USA 2002 Clade II 'Sarastro' Viola x 'Purple 02-816c Michigan, USA 2002 Clade II Showers' Aster amellus A73a Pennsylvania, USA 2003 Clade II ‘Violet Queen' *Isolates designated with LMG were obtained from Belgium co-ordinated collection of micro- organisms (BCCM); GIC isolates are from a Greenland glacial ice core; remaining isolates were obtained from diseased plants submitted to the Oregon State University (OSU) Plant Clinic. Italicized isolates = first sequenced using a hybrid approach; bold = type strain. §Year deposited (BCCM), isolated (OSU plant clinic), or trapped in ice (GIC isolates). †Group designation is based on this study; sub-clades i-iii all belong to Clade I. 52

53

Figure 2.1. Neighbor-joining tree based on 1727 homologous genes. A rooted neighbor-joining tree was constructed using translated sequences from 1727 genes present in all 20 isolates and the Nocardia farcinica type strain (not shown). Clade and sub-clade designations are indicated at the corresponding node. Scale bar = number of amino acid substitutions per site; only branches with lengths greater than

zero are indicated. Isolates with a linear plasmid are denoted with a diagonal bar; empty 53

boxes = absence of linear plasmid. Phytopathogenic isolates are indicated with, “+”; non-pathogenic isolates are indicated with, “-”.

54

Figure 2.2. Cloud plot of ANIb vs TETRA for all possible pairwise comparisons.

Average nucleotide identity (ANIb; x-axis) and tetranucleotide (TETRA; y-axis) usage patterns were calculated and plotted as a factor of isolate grouping (z-axis). All pairwise comparisons, including reciprocal comparisons are presented, with colors assigned based on the clade membership of the isolate being compared to. Gray colored areas demark 96% ANI and 0.997 TETRA thresholds. Clouds are demarcated by dotted lines. Circles with black borders are below the TETRA threshold. Circles with yellow borders exceed the TETRA threshold but not the ANIb threshold. The black circle represents a comparison between R. jostii RHA1 and R. equi 103S.

54

55

Figure 2.3. Heatmaps of codon usage preference and homology.

(A) Codon usage preference similarities were calculated for all possible pairwise comparisons. Lower values indicate fewer differences. Isolates were ordered according to their phylogenetic relationships. (B) Reciprocal BLASTP analysis was used to determine percent homology for all possible pairwise comparisons. Larger values indicate greater similarities. Isolates were ordered according to their phylogenetic relationships.

55

56

Figure 2.4. Multi-locus sequence analysis maximum likelihood tree of the Actinobacterium phylum.

Translated sequences for ftsY, infB, rpoB, rsmA, secY, tsaD, and ychF from 1316 members were identified using TBLASTN, aligned, and used to generate a multi-locus maximum likelihood tree. A total of 961 sequences were used as input for tree generation. The 21 Rhodococcus isolates sequenced by our group are shown in bold and the two clades that include phytopathogenic Rhodococcus spp. are indicated. The R. erythropolis, R. rhodochrous, and R. equi clades, previously identified based on a 16S rDNA phylogeny of Rhodococcus are labeled with a, b, and c, respectively. Type strains are indicated with a superscript “T”. Branches outside of the Rhodococcus genus were collapsed at the genus, family, suborder, order, and subclass level, as appropriate, with the corners of the triangle indicating the shortest and longest total branch lengths for the members of the collapsed clade. A total of 500 rapid bootstraps were performed on this dataset, and branches are colored on a gradient to indicate bootstrap percentage (Green-Cyan-Red, with cutoffs of 100-75-50 and below, respectively). Scale bar = mean number of amino acid substitutions per site.

56

57

Figure 2.4. Multi-locus sequence analysis maximum likelihood tree of the Actinobacterium phylum.

57

58

Figure 2.5. Average nucleotide identity dendogram for 59 isolates of Rhodococcus.

Complete genome sequences for 59 isolates of Rhodococcus were used to generate an ANI matrix. The matrix was used to calculate an ANI divergence dendrogram. Groups are color coded according to groups represented in the MLSA ML tree. Branches are colored using cutoffs for pairwise comparisons of all taxa after the nodal point. The 21 Rhodococcus isolates sequenced by our group are shown in bold; type strains are designated with a superscript “T”. Clades and sub-clades of the phytopathogenic Rhodococcus isolates are labeled at the corresponding node. *Indicates conflict between placement within the dendogram and calculated ANI values (see corresponding text for details).

58

59

Supplemental Figure 2.1. Genome alignments of four members of phytopathogenic Rhodococcus.

Sequences of the chromosomes of isolates D188, A21d2, 05-339-1, and A44a were aligned using progressiveMauve. Each colored square represents a block of sequences that is collinear to a corresponding block of sequences in another genome sequence; linear collinear blocks (LCBs). The extent of homology within LCBs is represented by the height of the colored plot in each block.

59

60

Supplemental Data 2.1. Compressed folder with MLSA tree in Newick tree format file and its associated data. The phylogenetic tree of the entire Actinobacteria phylum is available as a Newick file. Also included are three comma separate files. 1) SD1_identical_seqs.csv: lists the names of the duplicate sequences that were removed prior to generating the final tree. The names of the species used in generating the tree are listed in the first column. The remaining columns contain the names of the species that were filtered out based on having an identical sequence. 2) SD1_names_identifiers.csv: associates genome sequence identifiers (either GI number for complete genomes, or 4 letter code for wgs sequences) to corresponding isolates in the tree. 3) SD1_taxonomy.csv: lists the taxonomy values for each of the taxa.

Supplemental Data 2.2. Compressed folder with Python script and input data for ANI dendrogram generation. A Python script file is included that can generate an ANI dendrogram from the two input files, one that includes the ANI divergence matrix (SD2_ani_data.txt), and one that includes the names of the genomes included in the analysis (SD2_ani_names.txt).

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fpls.2014.00406/abstract

60

61

Chapter 3 – Gall-ID: tools for genotyping gall-causing phytopathogenic bacteria

Edward W. Davis II†, Alexandra J. Weisberg†, Javier F. Tabima, Niklaus J. Grünwald, and Jeff H. Chang

† These authors contributed equally

PeerJ

2016; 4: e2222. 61

PMCID: PMC4958008

62

ABSTRACT

Understanding the population structure and genetic diversity of plant pathogens, as well as the effect of agricultural practices on pathogen evolution, are important for disease management. Developments in molecular methods have contributed to increase the resolution for accurate pathogen identification but those based on analysis of DNA sequences can be less straightforward to use. To address this, we developed Gall-ID, a web-based platform that uses DNA sequence information from 16S rDNA, multilocus sequence analysis and whole genome sequences to group disease-associated bacteria to their taxonomic units. Gall-ID was developed with a particular focus on gall-forming bacteria belonging to Agrobacterium, Pseudomonas savastanoi, Pantoea agglomerans, and Rhodococcus. Members of these groups of bacteria cause growth deformation of plants, and some are capable of infecting many species of field, orchard, and nursery crops. Gall-ID also enables the use of high- throughput sequencing reads to search for evidence for homologs of characterized virulence genes, and provides downloadable software pipelines for automating multilocus sequence analysis, analyzing genome sequences for average nucleotide identity, and constructing core genome phylogenies. Lastly, additional databases were included in Gall-ID to help determine the identity of other plant pathogenic bacteria that may be in microbial communities associated with galls or causative agents in other diseased tissues of plants. The URL for Gall-ID is http://gall-id.cgrb.oregonstate.edu/.

62

63

INTRODUCTION

Diagnostics

Determining the identity of the disease causing pathogen, establishing its source of introduction, and/or understanding the genetic diversity of pathogen populations are critical steps for containment and treatment of disease. Proven methods for identification have been developed based on discriminative phenotypic and genotypic characteristics, including presence of antigens, differences in metabolism, or fatty acid methyl esters, and assaying based on polymorphic nucleotide sequences (Alvarez

2004). For the latter, polymerase chain reaction (PCR) amplification-based approaches for amplifying informative regions of the genome can be used. These regions should have broadly conserved sequences that can be targeted for amplification but the intervening sequences need to provide sufficient resolution to infer taxonomic grouping.

The 16S rDNA sequence is commonly used for identification (Stackebrandt &

Goebel 1994). Because of highly conserved regions in the gene sequence, a single pair of degenerate oligonucleotide primers can be used to amplify the gene from a diversity of bacteria, and allow for a kingdom-wide comparison. In general, the sequences of the amplified fragments have enough informative polymorphic sites to delineate genera, but do not typically allow for more refined taxonomic inferences at the sub-genus level

(Janda & Abbott 2007). Multilocus sequence analysis (MLSA) leverages the phylogenetic signal from four to ten genes to provide increased resolution, and can distinguish between species and sometimes sub-species (Wertz et al. 2003; Zeigler

2003). MLSA however, is more restricted than the use of 16S rDNA sequences and 63

may not allow for comparisons between members of different genera. MLSA also

64 requires more time and effort to identify informative and taxon-specific genes as well as develop corresponding oligonucleotide primer sets.

Whole genome sequences can also be used. This is practical because of advances in next generation sequencing technologies. The key advantages of this approach is that availability of whole genome sequences obviates the dependency on a priori knowledge to provide clues on taxonomic association of a pathogen and the need to select and amplify marker genes. Also, whole genome sequences provide the greatest resolution in terms of phylogenetic signal, and sequences that violate assumptions of phylogenetic analyses (e.g., not vertically inherited) can be removed from studies to allow for robust analyses. Briefly, sequencing reads of genome(s) are compared to a high quality draft or finished reference genome sequence to identify variable positions between the genome sequences (Pearson et al. 2009). The positions core to the compared genome sequences are aligned and used to generate a phylogenetic tree.

Alternatively, whole genome sequences can be used to determine average nucleotide identities (ANI) between any sufficiently similar pair, e.g., within the same taxonomic family, of genome sequences to determine genetic relatedness (Goris et al. 2007; Kim et al. 2014). ANI can be used to make taxonomic inferences, as a 95% threshold for

ANI has been calibrated to those used to operationally define bacterial species based on 16S rDNA (> 94% similarity) and DNA-DNA hybridization (DDH; 70%) (Goris et al. 2007). Finally, the whole genome sequences can be analyzed to inform on more than just the identity of the causative agent and provide insights into mechanisms and evolution of virulence. However, a non-trivial trade-off is that processing, storing, and

64

65 analyzing whole genome sequence data sets require familiarity with methods in computational biology.

Agrobacterium spp.

Members from several taxa of Gram-negative bacteria are capable of causing abnormal growth of plants. Members of Agrobacterium are the most notorious causative agents of deformation of plant growth. These bacteria have been classified according to various schemes that differ in the phenotypic and genetic characteristics that were used. Its taxonomic classification has been a subject of multiple studies

(Young et al. 2001; Farrand et al. 2003; Young 2003). Here, we will use the classification scheme that is based on disease phenotype and more commonly encountered in the literature. Within Agrobacterium there are four recognized groups of gall-causing bacteria. A. tumefaciens (also known as Rhizobium radiobacter (Young et al. 2001), formerly A. radiobacter, genomovar G8 forms A. fabrum (Lassalle et al.

2011)) can cause crown gall disease that typically manifests as tumors on roots or stem tissue (Gloyer 1934; Kado 2014). A. tumefaciens can infect a wide variety of hosts and the galls can restrict plant growth and in some cases kill the plant (Gloyer 1934; Schroth

1988). Other gall causing clades include A. vitis (restricted to infection of grapevine),

A. rubi (Rubus galls), and A. larrymoorei, which is sufficiently different based on results of DNA-DNA hybridization studies to justify a species designation (Hildebrand

1940; Ophel & Kerr 1990; Bouzar & Jones 2001). The genus was also traditionally recognized to include hairy-root inducing bacteria belonging to A. rhizogenes, as well

as non-pathogenic biocontrol isolates belonging to A. radiobacter (Young et al. 2001; 65

66

Velázquez et al. 2010). Members of Agrobacterium are atypical in having multipartite genomes, which in some cases include a linear replicon (Allardet-Servent et al. 1993).

A Ti (tumor-inducing) plasmid imparts upon members of Agrobacterium the ability to genetically modify its host and cause dysregulation of host phytohormone levels and induce gall formation (Sachs 1975). The Ti plasmid contains a region of

DNA (T-DNA) that is transferred and integrated into the genome of the host cell (Van

Larebeke et al. 1974; Chilton et al. 1977; Thompson et al. 1988; Ward et al. 1988;

Broothaerts et al. 2005). Conjugation is mediated by a type IV secretion system, encoded by genes located outside the borders of the T-DNA on the Ti plasmid

(Thompson et al. 1988). Within the T-DNA are key genes that encode for auxin, cytokinin, and opine biosynthesis (Morris 1986; Binns & Costantino 1998). Expression of the former two genes in the plant leads to an increase in plant hormone levels to cause growth deformation whereas the latter set of genes encode enzymes for the synthesis of modified sugars that only organisms with the corresponding opine catabolism genes can use as an energy source (Bomhoff et al. 1976; Montoya et al.

1977). The latter genes are located outside the T-DNA borders on the Ti plasmid (Zhu et al. 2000).

Pseudomonas savastanoi

Pseudomonas savastanoi (formerly Pseudomonas syringae pv. savastanoi,

(Gardan et al. 1992)) is the causal agent of olive knot disease, typically forming as aerial tumors on stems and branches. Phytopathogenicity of P. savastanoi is dependent

on the hrp/hrc genes located in a pathogenicity island on the chromosome (Sisto et al. 66

2004). These genes encode for a type III secretion system, a molecular syringe that

67 injects type III effector proteins into host cells that collectively function to dampen host immunity (Chang et al. 2014). Phytopathogenicity of P. savastanoi is also associated with the production of phytohormones. Indole-3-acetic acid may have an indirect role as a bacterial signaling molecule (Aragon et al. 2014). A cytokinin biosynthesis gene has also been identified on plasmids in P. savastanoi and strains cured of the plasmid caused smaller galls but were not affected in growth within the galls (Iacobellis et al.

1994; Bardaji et al. 2011).

Pantoea agglomerans

Pantoea agglomerans (formerly Erwinia herbicola) is a member of the

Enterobacteriaceae family. P. agglomerans can induce the formation of galls on diverse species of plants (Cooksey 1986; Burr et al. 1991; Opgenorth et al. 1994; DeYoung et al. 1998; Vasanthakumar & McManus 2004). Phytopathogenicity is dependent on the pPATH plasmid (Manulis & Barash 2003; Weinthal et al. 2007). This plasmid contains a pathogenicity island consisting of an hrp/hrc cluster and operons encoding for the biosynthesis of cytokinins, indole-3-acetic acid, and type III effectors (Clark et al.

1993; Lichter et al. 1995; Nizan et al. 1997; Mor et al. 2001; Nizan-Koren et al. 2003;

Barash & Manulis 2005; Barash et al. 2005; Barash & Manulis-Sasson 2007). As is the case with P. savastanoi, mutants of the hrp/hrc genes abolish pathogenicity whereas mutations in the phytohormone biosynthesis genes led to galls of reduced size (Manulis et al. 1998; Mor et al. 2001; Nizan-Koren et al. 2003; Barash & Manulis-Sasson 2007).

67

68

Rhodococcus spp.

Gram-positive bacteria within the Rhodococcus genus can cause leafy gall disease to over 100 species of plants (Putnam & Miller 2007). The phytopathogenic members of this genus belong to at least two genetically distinct groups of bacteria, with R. fascians (formerly Corynebacterium fascians) being the original recognized species (Goodfellow 1984; Creason et al. 2014a). It is suggested that R. fascians upsets levels of phytohormones of the plant to induce gall formation. However, unlike Ti plasmid-carrying Agrobacterium, it is hypothesized that R. fascians directly synthesizes and secretes the cytokinin phytohormone (Stes et al. 2013; Creason et al.

2014b). Phytopathogenicity is most often associated with a linear plasmid, which carries a cluster of virulence loci, att, fasR, and fas (Creason et al. 2014b). The functions for the translated products of att are unknown but the sequences have homology to proteins involved in amino acid and antibiotic biosynthesis (Maes et al. 2001). The fasR gene is necessary for full virulence; the gene encodes a putative transcriptional regulator (Temmerman et al. 2000). Some of the genes within the fas operon are necessary for virulence, as many of the fas genes encode proteins with demonstrable functions in cytokinin metabolism (Crespi et al. 1992). In rare cases, the virulence loci, or variants therein, are located on the chromosome (Creason et al. 2014b).

We developed Gall-ID to aid in determining the genetic identity of gall-causing members of Agrobacterium, Pseudomonas, Pantoea, and Rhodococcus. Users can provide sequences from 16S rDNA or gene sets used in MLSA, and Gall-ID will automatically query curated databases and generate phylogenetic trees to group the

query isolate of interest and provide estimates of relatedness to previously 68

characterized species and/or genotypes. Users can also submit short reads from whole

69 genome sequencing projects to query curated databases to search for evidence for known virulence genes of these gall-causing bacteria. Finally, users can download tools that automate the analysis of whole genome sequencing data to infer genetic relatedness based on MLSA, average nucleotide identity (ANI), or single nucleotide polymorphisms (SNPs).

MATERIAL AND METHODS

Website framework and bioinformatics tools

The Gall-ID website and corresponding R shiny server backend are based on the Microbe-ID platform (Tabima et al. 2016) but includes major additions and modifications: Auto MLSA, Auto ANI, BLAST with MAFFT, and the WGS Pipeline.

The MLSA framework website was extended to support building Neighbor-Joining trees using incomplete distance matrices (NJ*) using the function njs() in the R package

PHYLOCH (Paradis et al. 2004). The MLSA framework was also modified to use the multiple sequence alignment program MAFFT using the R package PHYLOCH (Katoh

& Toh 2008; Heibl 2013; Katoh & Standley 2013). This allows user-submitted sequences to be added to pre-existing sequence alignments using the MAFFT "--add" function, to dramatically reduce the computational time required for analysis.

The server hosting the Gall-ID tools is running Centos Linux release 6.6,

MAFFT version 7.221, SRST2 version 0.1.5, Bowtie 2 version 2.2.3, and Samtools version 0.1.18. Gall-ID uses R version 3.1.2 with the following R packages: Poppr version 1.1.0.99 (Kamvar et al. 2014), Ape version 3.1-1 (Paradis et al. 2004),

PHYLOCH version 1.5-5, and Shiny version 0.8.0. 69

70

The Auto MLSA tool was developed previously (Creason et al. 2014a). Briefly,

Auto MLSA does the following: BLAST (either TBLASTN or BLASTN) to query

NCBI user-selectable databases and/or local databases and retrieve sequences, filter out incomplete sets of gene sequences, align gene sequence individually, concatenate aligned gene sequences, determine the best substitution model (for amino acid sequences), filter out identical sequences, append key information to sequences, and generate a partition file for RAxML (Stamatakis 2014). Auto MLSA also has the option of using Gblocks to trim alignments (Castresana 2000). Auto MLSA was modified to use the NCBI E-utilities, implemented in BioPerl, to associate accession numbers with taxon IDs, species names, and assembly IDs (Stajich 2002). For organisms without taxon identifiers, Auto MLSA will attempt to extract meaningful genus and species information from the NCBI nucleotide entry. Gene sequences are linked together using assembly IDs, which allows for genomes with multiple chromosomes to be compared, without having to rely on potentially ambiguous organism names. When assembly IDs are unavailable, whole genome sequences are linked using the four letter WGS codes, and, as a last resort, sequences will be associated using their nucleotide accession number. The disadvantage of using the latter approach is that organisms with multiple replicons, each with its own accession number, will be excluded from analysis. Auto

MLSA is available for download from the Gall-ID website. Detailed instructions for using the tools are provided.

The Auto ANI script automates the calculations of ANI for all pairwise combinations for any number of input genome sequences. Each of the supplied genome

sequences are chunked into 1020 nt fragments and used as queries in all possible 70

71 reciprocal pairwise BLAST searches. Parameters for genome chunk size, percent identity, and percent coverage have default values set according to published guidelines but can be changed by the user (Goris et al. 2007; Creason et al. 2014a). BLAST version

2.2.31+ was used with recommended settings and previously described in Creason et al. (2014): –task blastn –dust no –xdrop_gap 150 – penalty −1 –reward 1 –gapopen 5

–gapextend 2 (Goris et al. 2007). BLAST hits above the user-specified cut-offs (30% identity, 70% coverage, by default) are averaged to calculate the pairwise ANI values.

BLAST+ version 2.2.27 has been tested and works, but this version is currently unsupported. Versions 2.2.28-2.2.30 of BLAST+ have an undocumented bug that prevents efficient filtering using -max_hsps and -max_target_seqs and precludes their use in ANI calculation. Hence BLAST 2.2.31+ is the preferred and recommended version.

Sequences downloaded from NCBI are linked using assembly IDs. All accession types from NCBI are supported, assuming accession numbers are provided in the header line of the FASTA file. Locally generated genome sequences are also supported, in FASTA format, provided they follow the specified header format listed in the user guide. Alternatively, an auxiliary script is provided to rename headers within user-generated FASTA files to the supported format.

The WGS Pipeline was written in bash shell script and Perl. Paired Illumina sequencing reads located in the "reads" folder of the pipeline are processed in pairs.

The program SMALT (Ponstingl, 2013) is used to align reads to a reference genome and produce CIGAR format output files (Ponstingl 2013). The SSAHA_pileup

program converts the CIGAR format files into individual pileup files (Ning et al. 2001). 71

72

The pileup output is then combined with any additional supplied pre-computed pileup files and used to produce a core alignment of sites shared by 90% of the represented isolates. The optional "remove_recombination.sh" script runs the program Gubbins

(Croucher et al. 2014) to remove sites identified as potentially acquired by recombination. Finally, the program RAxML is used to produce a maximum-likelihood phylogenetic tree with non-parametric bootstrap support (Stamatakis 2014). By default

20 maximum likelihood tree searches are performed, and the "autoMRE" criterion is used to determine the number of non-parametric bootstrap replicates.

The WGS Pipeline test analysis was performed and benchmarked using 10 cores of a cluster server running Centos Linux release 6.6 and containing four AMD

Opteron™ 6376 2.3 Ghz processors (64 cores total) and 512 GB of RAM (Table 3.2).

The versions of the tools used in tests of this pipeline were Perl version 5.10.1, SMALT version 0.7.6, SSAHA_pileup version 0.6, Gubbins version 1.1.2, and RAxML version

8.1.17. The default parameters for WGS Pipeline were used (20 ML search trees,

"autoMRE" cutoff for bootstrap replicates) with the exception that the maximum- allowed percentage gaps in the Gubbins recombination analysis was increased to 50% in order to retain strain D188. The WGS Pipeline scripts were also modified to not ask for user input on the command line in order to run in a Sun Grid Engine (SGE) cluster environment.

Vir-Search uses the program SRST2, which employs Bowtie 2 and Samtools, with the “--gene_db” function to align the reads to custom databases of the virulence genes (Li et al. 2009; Inouye et al. 2012; Langmead & Salzberg 2012; Inouye et al.

2014). The identity of the virulence genes that the reads input by the user align to, the 72

73 read coverage and depth, and the name of the strain corresponding to the most similar allele are parsed from the SRST2 output and reported to the user as a static webpage.

Users are emailed a link to results once the analysis is complete. The submitted sequencing reads are deleted from the server immediately after completion, and results are available only to those with a direct link to the results webpage.

Datasets

The 16S and MLSA gene sequences were downloaded from the genome sequences of the following reference strains: Agrobacterium strain C58, Rhodococcus strain A44a, P. savastanoi pv. phaseolicola 1448A, and P. agglomerans strain LMAE-

2, C. michiganensis subsp. nebraskensis NCPPB 2581, D. dadantii strain 3937, P. atrosepticum strain 21A, R. solanacearum strain GMI1000, X. oryzae pv. oryzicola strain CFBP2286, and X. fastidiosa subsp. fastidiosa GB514 (NCBI assembly ID:

GCF_000092025.1, GCF_000760735.1, GCF_000012205.1, GCF_000814075.1,

GCF_000355695.1, GCF_000147055.1, GCF_000740965.1, GCF_000009125.1,

GCF_001042735.1, and GCF_000148405.1, respectively). The gene sequences were used as input for the Auto MLSA tool in BLAST searches carried out against complete genome sequences in the NCBI non-redundant (nr) and whole genome sequence (wgs) databases. The Auto MLSA parameters were: minimum query coverage of 50% (90% for the 16S plant pathogen dataset) and e-value cutoffs of 1e-5 for nr and 1e-50 for wgs. BLAST searches were limited to the genus for the bacteria of interest, with the exceptions of Agrobacterium, which was limited to Rhizobiaceae, and P. savastanoi,

which was limited to the P. syringae group. BLAST searches were completed in August 73

of 2015. The Auto MLSA tool uses MAFFT aligner to produce multiple sequence

74 alignments for each gene (Katoh & Standley 2013). The Gblocks trimmed alignment output of Auto MLSA was not used because Gall-ID aligns user-submitted gene sequences to each full gene alignment (Castresana 2000).

Bacterial strains, growth conditions, nucleic acid extraction, and genome sequencing

Cultures of Agrobacterium were grown overnight in Lysogeny Broth (LB) media at 28°C, with shaking at 250 rpm (Table 3.3). Cells were pelleted by centrifugation and total genomic DNA was extracted using a DNeasy Blood and Tissue kit (Qiagen, Venlo, Netherlands). DNA was quantified using a QuBit Fluorometer

(Thermo Fisher, Eugene, Oregon) and libraries were prepared using an Illumina

Nextera XT DNA Library Prep kit, according to the instructions of the manufacturer, with the exception that libraries were normalized based on measurements from an

Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Each library was assigned an individual barcode using an Illumina Nextera XT Index kit. Libraries were multiplexed and sequenced on an Illumina MiSeq to generate 300 bp paired-end reads.

Sequencing was done in the Center for Genome Research and Biocomputing Core

Facility (Oregon State University, Corvallis, OR). Sickle was used to trim reads based on quality (minimum quality score cutoff of 25, minimum read length 150 bp after trimming) (Joshi & Fass 2011). Read quality was assessed prior to and after trimming using FastQC (Andrews 2010). Paired reads for each library were de novo assembled using Velvet version 1.2.10 with the short paired read input option (“-shortPaired”), estimated expected coverage (“-exp_cov auto”), and default settings for other

74 parameters (Zerbino & Birney 2008). Genome sequences were assembled using a range

75 of input hash lengths (k-mer sizes), and the final assembly for each isolate was identified based on those with the best metrics for the following parameters: total assembly length (5.0~7.0 Mb), number of contigs, and N50. Paired reads for each library were error corrected and assembled using SPAdes versions 3.6.2 and 3.7.0, with the careful option (“--careful") and kmers 21, 33, 55, 77, and 99. Scaffolds shorter than

500bp and with coverage less than 5X were removed from the SPAdes assemblies prior to analysis.

RESULTS AND DISCUSSION

Gall-ID (http://gall-id.cgrb.oregonstate.edu/) is based on the Microbe-ID platform and uses molecular data to determine the identity of plant pathogenic bacteria

(Tabima et al. 2016). Gall-ID is organized into modules shown as tabs that allow users to choose from one of four options for analyzing data (Figure 3.1).

Gall Isolate Typing

The “Gall Isolate Typing” tab provides online tools to use molecular data, either

16S or sequences of marker genes used for MLSA, to group isolates of interest into corresponding taxonomic units that include gall-causing pathogens. Users must first select the appropriate taxonomic group, Agrobacterium, Pseudomonas, Pantoea, or

Rhodococcus for comparison. For some of these taxonomic groups, multiple gene sets used in MLSA are available, and the user must therefore select the appropriate set for analysis. FASTA formatted gene sequences are input, and after selecting the

appropriate options for alignment and tree parameters, a phylogenetic tree that includes 75

the isolate of interest is generated and displayed. The tree parameters include choice of

76 distance matrix, tree generating algorithm (Neighbor-Joining or UPGMA), and number of bootstrap replicates. A sub-clade of the tree containing only the isolate of interest and its nearest sister taxa is displayed to the right of the full tree. The tree can be saved as a Newick file or as a PDF. An example sequence from Agrobacterium can be loaded by clicking the "Demo" button located in the Agro-type tab.

Phytopath-Type

The "Phytopath-Type" tab provides online tools for the analysis of other non- gall-causing pathogens important in agriculture (Mansfield et al. 2012). This tool is similar in function to the "Gall Isolate Typing" tools, except it is not limited to a single taxon of pathogen. A database of 16S rDNA sequences from genera of important bacterial phytopathogens (Pseudomonas syringae group, Ralstonia, Agrobacterium,

Rhodococcus, Xanthomonas, Pantoea, Xylella, Dickeya, Pectobacterium, and

Clavibacter) is available for associating a bacterial pathogen to its genus. Additionally, for Clavibacter, Dickeya, Pectobacterium, Ralstonia, Xanthomonas, and Xylella, the user can use MLSA to genotype isolates of interest. As is the case with Gall Isolate

Typing, a phylogenetic tree will be generated and displayed, associating the isolate of interest to the most closely related genotype in the curated databases.

Vir-Search

The "Vir-Search" tab provides an online tool for using user-input read sequences of a genome to search for the presence of homologs of known virulence

genes. Users select a taxonomic group (Agrobacterium, P. savastanoi, P. agglomerans, 76

or Rhodococcus) to designate the set of virulence genes to search against. Users also

77 determine a minimum percent gene coverage and maximum allowed percent identity divergence, and upload single or paired read files in FASTQ format. The user-supplied read sequences are then aligned to the chosen virulence gene dataset on the Gall-ID server. Once the search is complete, a link to the final results is sent to a user-provided email address. Results display the percent coverage of the virulence genes and the percent similarity of the covered sequences. If the query identifies multiple alleles of virulence genes from different sequenced strains, the Vir-Search tool will report the strain name associated with the best-mapped allele. User-submitted data and results are confidential and submitted sequencing reads are deleted from the Gall-ID server upon completion of the analysis.

Databases of Gall-ID

A key component of the tools associated with the aforementioned tabs is the manually curated databases of gene sequences. The literature was reviewed to identify validated taxon-specific sets of genes for MLSA of taxa with gall-causing bacteria as well as other pathogens that affect agriculture (Table 3.1) (Sarkar & Guttman 2004;

Hwang et al. 2005; Castillo & Greenberg 2007; Alexandre et al. 2008; Young et al.

2008; Delétoile et al. 2009; Kim et al. 2009; Adékambi et al. 2011; Jacques et al. 2012;

Parker et al. 2012; Marrero et al. 2013; Pérez-Yépez et al. 2014; Tancos et al. 2015).

The sequences for the corresponding genes were subsequently extracted from the whole genome sequences of reference strains. Auto MLSA was employed to use the gene sequences as queries in BLAST searches. Auto MLSA is based on a previously

developed set of Perl scripts to automate retrieving, filtering, aligning, concatenating, 77

determining of best substitution models, appending of key identifiers to sequences, and

78 generating files for tree construction (Creason et al. 2014a). Gene sets in which there were less than 50% query sequence coverage for all of the genes were excluded to ensure that the databases contained only taxonomically informative sequences. Each gene set database was manually checked for duplicate strains, large gaps in gene sequences, poor sequence alignment, and mis-annotated taxonomic information. Each of the MLSA databases used in the Gall-ID tools is also available for download on the

“Database Downloads" tab of the Gall ID website. The 16S rDNA databases were populated in a similar manner, with one small exception. For the Phytopath-Type tool, the sequence of the 16S rRNA-encoding gene from C58 of Agrobacterium was used as a query to retrieve corresponding sequences from 345 isolates distributed across the different genera of plant pathogenic bacteria.

To populate the database for virulence genes, the literature was reviewed to identify genes with demonstrably necessary functions for the pathogenicity of

Agrobacterium spp., R. fascians, P. savastanoi, and P. agglomerans (Thompson et al.

1988; Ward et al. 1988; Lichter et al. 1995; Nizan et al. 1997; Manulis et al. 1998; Zhu et al. 2000; Maes et al. 2001; Vereecke et al. 2002; Nizan-Koren et al. 2003; Sisto et al. 2004; Nissan et al. 2006; Barash & Manulis-Sasson 2007; Matas et al. 2012). The gene sequences were downloaded from corresponding type strains in the NCBI nucleotide (nr) database or from nucleotide sequences in the NCBI nr database. The downloaded virulence gene sequences were then used as input for the Auto MLSA tool to retrieve sequenced alleles from other isolates of the same taxa. The downloaded alleles were manually inspected to ensure only pathogenic strains were represented.

The database was formatted for SRST2. 78

79

Whole Genome Analysis

The analyses of whole genome sequence datasets can be computationally intensive, which is prohibitive for online tools. Therefore, the "Whole Genome

Analysis" tab provides downloadable software pipeline tools for users to employ their institutional infrastructure or a cloud computing service to analyze whole genome sequencing reads (Illumina HiSeq or MiSeq). There are two options in this tab, the first,

"WGS Pipeline: Core Genome Analysis," provides a download link and instructions for using the WGS Pipeline tool to generate a phylogeny based on the core genome sequence or core set of single nucleotide polymorphisms (SNPs). The second option,

“Auto ANI: Average Nucleotide Identity Analysis," provides a download link for the

Auto ANI tool and detailed instructions for its use in calculating all possible pairwise

ANI within a set of genome sequences.

The WGS Pipeline is a set of scripts that automates the use of sequences from

Illumina-based paired reads derived from whole genome sequencing projects to determine core genome sequences or core SNPs and generate phylogenetic trees

(Figure 3.2). This pipeline uses SMALT and SSAHA2 pile-up pipeline to align sequencing reads to an indexed reference genome sequence and generate a pileup file, respectively (Ning et al. 2001; Ponstingl 2013). The WGS Pipeline then combines the pileup files along with other pre-computed pileup files to derive a core genome alignment defined based on regions that are shared between at least 90% of the compared genome sequences. Users have the option of using Gubbins to remove regions that are flagged as potentially derived from recombination (Croucher et al.

2014). Invoking Gubbins will also remove all non-polymorphic sites from the 79

alignments, thus yielding a SNP alignment that is based only on polymorphic sites that

80 are identified as vertically inherited and shared between at least 90% of the compared genome sequences. Finally, the user can use either the core genome sequence or core

SNP alignment and RAxML to generate a maximum likelihood (ML) phylogeny

(Stamatakis 2014).

Users must place their input files in the correspondingly named folders in order to run the WGS pipeline. The pipeline down weights reads with a Q score of < 30, requires a minimum depth of 12 and relies on a minimum threshold of 75% for consensus base calling. Users concerned with sequencing quality may, prior to running the WGS pipeline, run programs such as FastQC, Trimmomatic, Sickle, and/or BBDuk to filter reads based on quality threshold (Andrews 2010; Joshi & Fass 2011; Bolger et al. 2014; Bushnell 2016). Paired read sequences for each genome are read from the

"reads" folder, while a SMALT index named as "reference" and placed in the "index" folder will be used as a reference to align to. Identifiers are taken from the prefix of the read pair file names and used to name the output pileup files and taxa in the phylogeny.

The read pair file names must have the suffixes ".1.fastq" and ".2.fastq" for files with forward and reverse read sequences, respectively. The read sequences must be in

FASTQ format and because of requirements of the SMALT program, each paired read name must end in ".p1k" and ".q1k" for forward and reverse reads, respectively. If the input read sequences are not in the proper format, the user may run the included optional script "prepare_for_pileup.sh" to format read names. If the user has pre- computed SMALT pileup files prepared using the same SMALT index, the files may be placed in the "pileup" folder and will also be included in the analysis. The user may

be prompted to input the length of the inserts for each sequencing library. Users also 80

81 have the option of changing the number of ML searches or non-parametric bootstrap replicates when building a phylogeny (default values are 20 ML searches, autoMRE cutoff criterion for bootstrap replicates).

Pre-built SMALT indices for reference genome sequences from strain C58 of

Agrobaterium and strain A44a of Rhodococcus, as well as pre-computed pileup files for 17 publicly available Rhodococcus genome sequences, are available for download on the Gall-ID website. Detailed usage instructions and download links for the pipeline scripts are located in the "WGS Pipeline: Core Genome Analysis" tool in the "Whole

Genome Analysis" tab of Gall-ID.

Previously developed scripts for ANI analysis were rewritten and named Auto

ANI. The current version of these scripts alleviates dependencies on our institutional computational infrastructure and increases the scalability of analyses (Creason et al.

2014a). Results are saved in a manner that enables analyzing additional genomes without having to re-compute ANI values for previously calculated comparisons. All

BLAST searches are done in a modular manner and can be modified to run on a computer cluster with a queuing system such as the Sun Grid Engine. There are no inherent restrictions on the numbers of pairwise comparisons that can be performed.

The data are output as a tab delimited matrix of all pairwise comparisons and can also easily be sorted and resorted based on any reference within the output. Additionally, genome sequences with evidence for poor quality assemblies can be easily filtered out.

A distance dendrogram based on ANI divergence can also be generated; a python script is available for download (Chan et al. 2012; Creason et al. 2014a).

81

82

Validation of tools available in Gall-ID

We validated the efficacy of the online tools available from the Gall Isolate

Typing, and Vir-Search tabs. DNA from 14 isolates were prepared, barcoded, and sequenced on an Illumina MiSeq (Table 3.3). Of these isolates, the identities of 11 were previously verified as Agrobacterium. The remaining three were associated with plant tissues showing symptoms of crown gall disease but were not tested or had results inconsistent with being a pathogenic member of the Agrobacterium genus. The reads were trimmed for quality and first de novo assembled within each library using the

Velvet assembler (Zerbino & Birney 2008). The 16S gene sequences were identified and extracted from the assemblies and used as input for the Agro-type tool. The 16S gene sequences from each of the 11 isolates originally typed as Agrobacterium clustered accordingly; isolate 13-2099-1-2 is shown as an example (Figure 3.3A). The

16S sequence from isolates AC27/96, AC44/96, and 14-2641 were more distant from the 16S sequences of Agrobacterium (Table 3.3). The isolates AC27/96 and AC44/96 grouped more closely with various Rhizobium species, while subsequent analysis using the Phytopath-Type tool suggested isolate 14-2641 was more closely related to members of Erwinia, Dickeya, and Pectobacterium (Table 3.3, Supplementary Figure

3.1). A search against the NCBI nr database revealed similarities to members of

Serratia.

The trimmed read sequences were used as input for the Vir-Search tool as an additional step to confirm the identity of these isolates. Paired read sequences for each of the 14 isolates were individually uploaded to the Gall-ID server. The Agrobacterium

virulence gene database was selected, with the minimum gene length coverage set to 82

80% and maximum allowed sequence divergence set to 20%. The time for each Vir-

83

Search analysis ranged from 2-5 minutes. Results suggested that the genome sequences for nearly all of the Agrobacterium isolates had homologs of virulence genes demonstrably necessary for pathogenicity by Agrobacterium, while the genome sequences for the isolates AC27/96, AC44/96, and 14-2641 did not (Figure 3.3B, data for isolate 13-2099-1-2 shown). Contrary to the results from molecular diagnostics tests, the reads from isolate 13-626 failed to align to any virulence genes except for two

(nocM, nocP) involved in nopaline transport. This isolate had the fewest number of useable sequencing reads and the highest number of contigs compared to the others, and results could have been a consequence of a poor assembly of the Ti plasmid.

Indeed, the qualities of the 14 assemblies were highly variable, likely reflecting the multi-partite structure of the agrobacterial genomes, presence of a linear replicon, and/or variation in depth of sequencing. We therefore used SPAdes v. 3.6.2 to de novo assemble each of the genome sequences, with the exception of isolate 14-2641

(Bankevich et al. 2012). The total sizes of the assemblies were similar to those generated using Velvet and the qualities of the assemblies were high. But assemblies generated using SPAdes had proliferations in errors with palindromic sequence that appeared to be unique to isolates expected to have linear replicons. We informed the developers of the SPAdes software who immediately resolved the issue in SPAdes

3.7.0. Inspection of the summary statistics of the assemblies derived using this latest version of SPAdes suggested that relative to Velvet-based assemblies, there were improvements to all assemblies, witht the most dramatic to those with the lowest read coverage (Supplementary Table 3.1, Supplementary Figure 3.2). To further verify the

quality of assemblies generated using SPAdes 3.7.0, we used Mauve to align Velvet 83

84 and SPAdes assembled genome sequences of isolate 13-626 to the finished genome sequence of the reference sequence of A. radiobacter K84 (Darling et al. 2004; Slater et al. 2009). The SPAdes-based assembly was superior in being less fragmented and we were able to elevate the quality of the assembly from “unusable” to “high quality”

(Supplementary Figures 2 and 3). Therefore, there is greater confidence in concluding that isolate 13-626 lacks the vir genes and T-DNA sequence. It does however have an

~200 kb plasmid sequence which encodes nocM and nocP; this contig also encodes sequences common to replication origins of plasmids. We therefore suggest that because an isolate from the same pear gall sample originally tested positive for virD2, we mistakenly sequenced a non-pathogenic isolate.

To validate the WGS Pipeline tool of Gall-ID, Illumina paired end read sequences derived from previously generated genome sequences of 20 Rhodococcus isolates were used to construct a phylogeny based on SNPs (Creason et al. 2014a).

Using default parameters, the entire process, from piling up reads to generating the final phylogenetic tree, took 16 hours (Table 3.2). A total of 855,355 sites (out of a total of 5,947,114 sites in the A44a reference sequence) were shared in at least 18 of the 20 Rhodococcus genome sequences. Of the shared sites, 177,961 sites were polymorphic, of which 3,142 were removed because they were identified as potentially acquired by recombination. The final core SNP alignment was therefore represented by

174,819 polymorphic sites and used to construct a maximum likelihood tree (Figure

3.4). Most of the nodes were well supported, with all exceeding 68% bootstrap support and most having 100% support. The topology of the tree was consistent with that

derived from a multi-gene phylogeny (Creason et al. 2014a). As previously reported, 84

85 the 20 isolates formed two well-supported and distinct clades, and could explain the relatively low number of shared SNPs. The substructure that was previously observed in clade I was also evident in the ML tree based on core SNPs. The one noticeable difference between the trees was that the tips of the tree based on the core SNPs had substantially more resolution, and in particular, revealed a greater genetic distance between isolates A76 and 05-339-1, than previously appreciated based on the multi- gene phylogeny.

The amount of time to run Auto ANI was determined by comparing genome assemblies of the same 20 Rhodococcus isolates. The entire process was completed in five hours.

Conclusions

Gall-ID provides simplified and straightforward methods to rapidly and efficiently characterize gall-causing pathogenic bacterial isolates using Sanger sequencing or Illumina sequencing. Though Gall-ID was developed with a particular focus on these types of bacteria, it can be used for some of the more common and important agricultural bacterial pathogens. Additionally, the downloadable tools can be used for any taxa of bacteria, regardless of whether or not they are pathogens.

ACKNOWLEDGEMENTS

We thank Melodie Putnam for providing the 14 bacterial isolates and for critical reading of the manuscript. We thank Dr. Pankaj Jaiswal for organizing and inviting us

to participate in the STEM DNA Biology and Bioinformatics summer camps (Oregon 85

State University). Camp participants, Ana Bechtel, Mason Hall, Reagan Hunt, Pranav

86

Kolluri, Benjamin Phelps, Joshua Phelps, Aravind Sriram, Megan Thorpe, and eight others, prepared genomic DNA and libraries for whole genome sequencing and analyzed the data. We thank Charlie DuBois of Illumina for providing kits for library preparation as well as sequencing, and Mark Dasenko, Matthew Peterson, and Chris

Sullivan of the Center for Genome Research and Biocomputing for sequencing, data processing, and computing services. Finally, we thank the Department of Botany and

Plant Pathology for supporting the computational infrastructure. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the U. S. Department of

Agriculture or National Science Foundation.

BIBLIOGRAPHY

Adékambi T, Butler RW, Hanrahan F, Delcher AL, Drancourt M, and Shinnick TM. 2011. Core gene set as the basis of multilocus sequence analysis of the subclass Actinobacteridae. PLoS One 6:e14792. 10.1371/journal.pone.0014792

Alexandre A, Laranjo M, Young JP, and Oliveira S. 2008. dnaJ is a useful phylogenetic marker for . International Journal of Systematic and Evolutionary Microbiology 58:2839-2849. 10.1099/ijs.0.2008/001636-0

Allardet-Servent A, Michaux-Charachon S, Jumas-Bilak E, Karayan L, and Ramuz M. 1993. Presence of one linear and one circular chromosome in the Agrobacterium tumefaciens C58 genome. Journal of Bacteriology 175:7869-7874.

Alvarez AM. 2004. Integrated approaches for detection of plant pathogenic bacteria and diagnosis of bacterial diseases. Annual Review of Phytopathology 42:339-366. doi:10.1146/annurev.phyto.42.040803.140329

Andrews S. 2010. FastQC A quality control tool for high throughput sequence data. Available at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

Aragon IM, Perez-Martinez I, Moreno-Perez A, Cerezo M, and Ramos C. 2014. New insights into the role of indole-3-acetic acid in the virulence of Pseudomonas savastanoi pv. savastanoi. FEMS Microbiol Letters 356:184-192. 10.1111/1574- 86

6968.12413

87

Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov A, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev M, and Pevzner P. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455-477. 10.1089/cmb.2012.0021

Barash I, and Manulis S. 2005. Hrp-dependent biotrophic mechanism of virulence: How has it evolved in tumorigenic bacteria? Phytoparasitica 33:317-324. 10.1007/BF02981296

Barash I, and Manulis-Sasson S. 2007. Virulence mechanisms and host specificity of gall-forming Pantoea agglomerans. Trends in Microbiology 15:538-545. 10.1016/j.tim.2007.10.009

Barash I, Panijel M, Gurel F, Chalupowicz L, and Manulis S. 2005. Transformation of Pantoea agglomerans into a tumorigenic pathogen. In: Sorvari S, and Toldi O, editors. Proceedings of the 1st International Conference on Plant-Microbe Interactions: Endophytes and Biocontriol Agents. Lapland, Finland: Saariselka. p 10-19.

Bardaji L, Perez-Martinez I, Rodriguez-Moreno L, Rodriguez-Palenzuela P, Sundin GW, Ramos C, and Murillo J. 2011. Sequence and role in virulence of the three plasmid complement of the model tumor-inducing bacterium Pseudomonas savastanoi pv. savastanoi NCPPB 3335. PLoS One 6:e25705. 10.1371/journal.pone.0025705

Binns AN, and Costantino P. 1998. The Agrobacterium oncogenes. In: Spaink HP, Kondorosi A, and Hooykaas PJJ, eds. The Rhizobiaceae. Netherlands: Springer, 251-166.

Bolger A, Lohse M, and Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114-2120. 10.1093/bioinformatics/btu170

Bomhoff G, Klapwijk PM, Kester HC, Schilperoort RA, Hernalsteens JP, and Schell J. 1976. Octopine and nopaline synthesis and breakdown genetically controlled by a plasmid of Agrobacterium tumefaciens. Molecular & General Genetics 145:177- 181.

Bouzar H, and Jones JB. 2001. Agrobacterium larrymoorei sp. nov., a pathogen isolated from aerial tumours of Ficus benjamina. International Journal of Systematic and Evolutionary Microbiology 51:1023-1026. 10.1099/00207713-51- 3-1023

Broothaerts W, Mitchell HJ, Weir B, Kaines S, Smith LMA, Yang W, Mayer JE, Roa- Rodríguez C, and Jefferson RA. 2005. Gene transfer to plants by diverse species

of bacteria. Nature 433:629-633. 10.1038/nature03309 87

Burr T, Katz B, Abawi G, and Crosier D. 1991. Comparison of tumorigenic strains of

88

Erwinia herbicola isolated from table beet with E. h. gypsophilae. Plant Disease 75:855-858.

Bushnell B. 2016. BBMap. Available at http://sourceforge.net/projects/bbmap/.

Castillo JA, and Greenberg JT. 2007. Evolutionary dynamics of Ralstonia solanacearum. Applied and Environmental Microbiology 73:1225-1238. 10.1128/AEM.01253-06

Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17:540-552.

Chan JZ-M, Halachev MR, Loman NJ, Constantinidou C, and Pallen MJ. 2012. Defining bacterial species in the genomic era: insights from the genus Acinetobacter. BMC Microbiology 12:302. 10.1186/1471-2180-12-302

Chang J-M, Di Tommaso P, and Notredame C. 2014. TCS: A new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution 31:1625-1637. 10.1093/molbev/msu117

Chilton MD, Drummond MH, Merio DJ, Sciaky D, Montoya AL, Gordon MP, and Nester EW. 1977. Stable incorporation of plasmid DNA into higher plant cells: the molecular basis of crown gall tumorigenesis. Cell 11:263-271.

Clark E, Manulis S, Ophir Y, Barash I, and Y G. 1993. Cloning and characterization of iaaM and iaaH from Erwinia herbicola pathovar gypsophilae. Phytopathology 83:234-240. 10.1094/Phyto-83-234

Cooksey DA. 1986. Galls of Gypsophila paniculata caused by Erwinia herbicola. Plant Disease 70:464. 10.1094/PD-70-464

Creason AL, Davis EW, Putnam ML, Vandeputte OM, and Chang JH. 2014a. Use of whole genome sequences to develop a molecular phylogenetic framework for Rhodococcus fascians and the Rhodococcus genus. Frontiers in Plant Science 5:406. 10.3389/fpls.2014.00406

Creason AL, Vandeputte OM, Savory EA, Davis EW, Putnam ML, Hu E, Swader- Hines D, Mol A, Baucher M, Prinsen E, Zdanowska M, Givan SA, Jaziri ME, Loper JE, Mahmud T, and Chang JH. 2014b. Analysis of genome sequences from plant pathogenic Rhodococcus reveals genetic novelties in virulence loci. PLoS One 9:e101996. 10.1371/journal.pone.0101996

Crespi M, Messens E, Caplan AB, van Montagu M, and Desomer J. 1992. Fasciation induction by the phytopathogen Rhodococcus fascians depends upon a linear plasmid encoding a cytokinin synthase gene. The EMBO Journal 11:795-804.

88

Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA, Bentley SD, Parkhill J, and

89

Harris SR. 2014. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research:gku1196-. 10.1093/nar/gku1196

Darling A, Mau B, Blattner F, and Perna N. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394-1403. 10.1101/gr.2289704

Delétoile A, Decré D, Courant S, Passet V, Audo J, Grimont P, Arlet G, and Brisse S. 2009. Phylogeny and identification of Pantoea species and typing of Pantoea agglomerans strains by multilocus gene sequencing. Journal of Clinical Microbiology 47:300-310. 10.1128/JCM.01916-08

DeYoung RM, Copeman RJ, and Hunt RS. 1998. Two strains in the genus Erwinia cause galls on Douglas-fir in southwestern British Columbia. Canadian Journal of Plant Pathology 20:194-200. 10.1080/07060669809500427

Farrand SK, Van Berkum PB, and Oger P. 2003. Agrobacterium is a definable genus of the family Rhizobiaceae. International Journal of Systematic and Evolutionary Microbiology 53:1681-1687. 10.1099/ijs.0.02445-0

Gardan L, Bollet C, Abu Ghorrah M, Grimont F, and Grimont PAD. 1992. DNA relatedness among the pathovar strains of Pseudomonas syringae subsp. savastanoi Janse (1982) and proposal of Pseudomonas savastanoi sp. nov. International Journal of Systematic Bacteriology 42:606-612. 10.1099/00207713- 42-4-606

Gloyer WO. 1934. Crown gall and hairy root of apples in nursery and orchard. Geneva, NY: New York Agricultural Experiment Station Bulletin 638.

Goodfellow M. 1984. Reclassification of Corynebacterium fascians (Tilford) Dowson in the genus Rhodococcus, as Rhodococcus fascians comb. nov. Systematic and Applied Microbiology 5:225-229. 10.1016/S0723-2020(84)80023-5

Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, and Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Microbiology 57:81-91. 10.1099/ijs.0.64483-0

Heibl C. 2013. PHYLOCH: interfaces and graphic tools for phylogenetic data in R. Available at http://www.christophheibl.de/Rpackages.html.

Hildebrand EM. 1940. Cane gall of Brambles caused by Phytomonas rubi n.sp. Journal of Agricultural Research 61:685--696 pp.

Hwang MSH, Morgan RL, Sarkar SF, Wang PW, and Guttman DS. 2005. Phylogenetic

89

characterization of virulence and resistance phenotypes of Pseudomonas syringae. Applied and Environmental Microbiology 71:5182-5191.

90

10.1128/AEM.71.9.5182-5191.2005

Iacobellis NS, Sisto A, Surico G, Evidente A, and DiMaio E. 1994. Pathogenicity of Pseudomonas syringae subsp. savastanoi mutants defective in phytohormone production. Journal of Phytopathology 140:238-248. 10.1111/j.1439- 0434.1994.tb04813.x

Inouye M, Conway TC, Zobel J, and Holt KE. 2012. Short read sequence typing (SRST): multi-locus sequence types from short reads. BMC Genomics 13:338. 10.1186/1471-2164-13-338

Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ, Tomita T, Zobel J, and Holt KE. 2014. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Medicine 6:90. 10.1186/s13073-014-0090-6

Jacques M-A, Durand K, Orgeur G, Balidas S, Fricot C, Bonneau S, Quillévéré A, Audusseau C, Olivier V, Grimault V, and Mathis R. 2012. Phylogenetic analysis and polyphasic characterization of Clavibacter michiganensis strains isolated from tomato seeds reveal that nonpathogenic strains are distinct from C. michiganensis subsp. michiganensis. Applied and Environmental Microbiology 78:8388-8402. 10.1128/AEM.02158-12

Janda JM, and Abbott SL. 2007. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. Journal of Clinical Microbiology 45:2761-2764. 10.1128/JCM.01228-07

Joshi N, and Fass J. 2011. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files. Available at https://github.com/najoshi/sickle.

Kado CI. 2014. Historical account on gaining insights on the mechanism of crown gall tumorigenesis induced by Agrobacterium tumefaciens. Frontiers in Microbiology 5:340. 10.3389/fmicb.2014.00340

Kamvar ZN, Tabima JF, and Grünwald NJ. 2014. Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2:e281. 10.7717/peerj.281

Katoh K, and Standley DM. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30:772-780. 10.1093/molbev/mst010

Katoh K, and Toh H. 2008. Recent developments in the MAFFT multiple sequence alignment program. Briefings in Bioinformatics 9:286-298. 10.1093/bib/bbn013

Kim H-S, Ma B, Perna NT, and Charkowski AO. 2009. Phylogeny and virulence of naturally occurring type III secretion system-deficient Pectobacterium strains.

90

Applied and Environmental Microbiology 75:4539-4549. 10.1128/AEM.01336- 08

91

Kim M, Oh H-S, Park S-C, and Chun J. 2014. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. International Journal of Systematic and Evolutionary Microbiology 64:346-351. 10.1099/ijs.0.059774-0

Langmead B, and Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357-359. 10.1038/nmeth.1923

Lassalle F, Campillo T, Vial L, Baude J, Costechareyre D, Chapulliot D, Shams M, Abrouk D, Lavire C, Oger-Desfeux C, Hommais F, Guéguen L, Daubin V, Muller D, and Nesme X. 2011. Genomic species are ecological species as revealed by comparative genomics in Agrobacterium tumefaciens. Genome Biology and Evolution 3:762-781. 10.1093/gbe/evr070

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and Genome Project Data Processing S. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-2079.

Lichter A, Barash I, Valinsky L, and Manulis S. 1995. The genes involved in cytokinin biosynthesis in Erwinia herbicola pv. gypsophilae: characterization and role in gall formation. Journal of Bacteriology 177:4457-4465.

Maes T, Vereecke D, Ritsema T, Cornelis K, Thu HN, Van Montagu M, Holsters M, and Goethals K. 2001. The att locus of Rhodococcus fascians strain D188 is essential for full virulence on tobacco through the production of an autoregulatory compound. Molecular microbiology 42:13-28.

Mansfield J, Genin S, Magori S, Citovsky V, Sriariyanum M, Ronald P, Dow M, Verdier V, Beer SV, Machado MA, Toth I, Salmond G, and Foster GD. 2012. Top 10 plant pathogenic bacteria in molecular plant pathology. Molecular Plant Pathology 13:614-629. 10.1111/j.1364-3703.2012.00804.x

Manulis S, and Barash I. 2003. The molecular basis for transformation of an epiphyte into a gall-forming pathogen as exemplified by Erwinia herbicola pv. gypsophilae. Plant-Microbe Interactions 6:19-52.

Manulis S, Haviv-Chesner A, Brandl MT, Lindow SE, and Barash I. 1998. Differential involvement of indole-3-acetic acid biosynthetic pathways in pathogenicity and epiphytic fitness of Erwinia herbicola pv. gypsophilae. Molecular Plant-Microbe Interactions 11:634-642. 10.1094/MPMI.1998.11.7.634

Marrero G, Schneider KL, Jenkins DM, and Alvarez AM. 2013. Phylogeny and classification of Dickeya based on multilocus sequence analysis. International Journal of Systematic and Evolutionary Microbiology 63:3524-3539. 10.1099/ijs.0.046490-0

91

Matas IM, Lambertsen L, Rodríguez-Moreno L, and Ramos C. 2012. Identification of novel virulence genes and metabolic pathways required for full fitness of

92

Pseudomonas savastanoi pv. savastanoi in olive (Olea europaea) knots. New Phytologist 196:1182-1196. 10.1111/j.1469-8137.2012.04357.x

Montoya AL, Chilton MD, Gordon MP, Sciaky D, and Nester EW. 1977. Octopine and nopaline metabolism in Agrobacterium tumefaciens and crown gall tumor cells: role of plasmid genes. Journal of Bacteriology 129:101-107.

Mor H, Manulis S, Zuck M, Nizan R, Coplin DL, and Barash I. 2001. Genetic organization of the hrp gene cluster and dspAE/BF operon in Erwinia herbicola pv. gypsophilae. Molecular Plant-Microbe Interactions 14:431-436. 10.1094/MPMI.2001.14.3.431

Morris RO. 1986. Genes specifying auxin and cytokinin biosynthesis in phytopathogens. Annual Review of Plant Physiology 37:509-538. 10.1146/annurev.pp.37.060186.002453

Ning Z, Cox AJ, and Mullikin JC. 2001. SSAHA: a fast search method for large DNA databases. Genome Research 11:1725-1729. 10.1101/gr.194201

Nissan G, Manulis-Sasson S, Weinthal D, Mor H, Sessa G, and Barash I. 2006. The type III effectors HsvG and HsvB of gall-forming Pantoea agglomerans determine host specificity and function as transcriptional activators. Molecular microbiology 61:1118-1131. 10.1111/j.1365-2958.2006.05301.x

Nizan R, Barash I, Valinsky L, Lichter A, and Manulis S. 1997. The presence of hrp genes on the pathogenicity-associated plasmid of the tumorigenic bacterium Erwinia herbicola pv. gypsophilae. Molecular Plant-Microbe Interactions 10:677-682. 10.1094/MPMI.1997.10.5.677

Nizan-Koren R, Manulis S, Mor H, Iraki NM, and Barash I. 2003. The regulatory cascade that activates the Hrp regulon in Erwinia herbicola pv. gypsophilae. Molecular Plant-Microbe Interactions 16:249-260. 10.1094/MPMI.2003.16.3.249

Opgenorth DC, Hendson M, and Clark E. 1994. First report of bacterial gall of Wisteria sinensis caused by Erwinia herbicola pv. milletiae in California. Plant Disease 78:1217C. 10.1094/PD-78-1217C

Ophel K, and Kerr A. 1990. Agrobacterium vitis sp. nov. for Strains of Agrobacterium biovar 3 from Grapevines. International Journal of Systematic Bacteriology 40:236-241. 10.1099/00207713-40-3-236

Paradis E, Claude J, and Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20:289-290. 10.1093/bioinformatics/btg412

92

Parker JK, Havird JC, and De La Fuente L. 2012. Differentiation of Xylella fastidiosa strains via multilocus sequence analysis of environmentally mediated genes

93

(MLSA-E). Applied and Environmental Microbiology 78:1385-1396. 10.1128/AEM.06679-11

Pearson T, Okinaka RT, Foster JT, and Keim P. 2009. Phylogenetic understanding of clonal populations in an era of whole genome sequencing. Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases 9:1010-1019. 10.1016/j.meegid.2009.05.014

Pérez-Yépez J, Armas-Capote N, Velázquez E, Pérez-Galdona R, Rivas R, and León- Barrios M. 2014. Evaluation of seven housekeeping genes for multilocus sequence analysis of the genus Mesorhizobium: Resolving the taxonomic affiliation of the Cicer canariense rhizobia. Systematic and Applied Microbiology 37:553-559. 10.1016/j.syapm.2014.10.003

Ponstingl H. 2013. SMALT. Available at https://www.sanger.ac.uk/resources/software/smalt/.

Putnam ML, and Miller ML. 2007. Rhodococcus fascians in herbaceous perennials. Plant Disease 91:1064-1076. 10.1094/PDIS-91-9-1064

Sachs T. 1975. Plant tumors resulting from unregulated hormone synthesis. Journal of Theoretical Biology 55:445-453.

Sarkar SF, and Guttman DS. 2004. Evolution of the core genome of Pseudomonas syringae, a highly clonal, endemic plant pathogen. Applied and Environmental Microbiology 70:1999-2012.

Schroth MN. 1988. Reduction in yield and vigor of grapevine caused by crown gall disease. Plant Disease 72:241. 10.1094/PD-72-0241

Sisto A, Cipriani MG, and Morea M. 2004. Knot Formation caused by Pseudomonas syringae subsp. savastanoi on olive plants is hrp-dependent. Phytopathology 94:484-489. 10.1094/PHYTO.2004.94.5.484

Slater S, Goldman B, Goodner B, Setubal J, Farrand S, Nester E, Burr T, Banta L, Dickerman A, Paulsen I, Otten L, Suen G, Welch R, Almeida N, Arnold F, Burton O, Du Z, Ewing A, Godsy E, Heisel S, Houmiel K, Jhaveri J, Lu J, Miller N, Norton S, Chen Q, Phoolcharoen W, Ohlin V, Ondrusek D, Pride N, Stricklin S, Sun J, Wheeler C, Wilson L, Zhu H, and Wood D. 2009. Genome sequences of three agrobacterium biovars help elucidate the evolution of multichromosome genomes in bacteria. J Bacteriol 191:2501-2511. 10.1128/JB.01779-08

Stackebrandt E, and Goebel BM. 1994. Taxonomic Note: A place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. International Journal of Systematic Bacteriology 44:846-849.

10.1099/00207713-44-4-846 93

Stajich JE. 2002. The Bioperl Toolkit: Perl modules for the life sciences. Genome

94

Research 12:1611-1618. 10.1101/gr.361602

Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30:1312-1313. 10.1093/bioinformatics/btu033

Stes E, Francis I, Pertry I, Dolzblasz A, Depuydt S, and Vereecke D. 2013. The leafy gall syndrome induced by Rhodococcus fascians. FEMS Microbiol Lett 342:187- 195. 10.1111/1574-6968.12119

Tancos MA, Lange HW, and Smart CD. 2015. Characterizing the genetic diversity of the Clavibacter michiganensis subsp. michiganensis population in New York. Phytopathology 105:169-179. 10.1094/PHYTO-06-14-0178-R

Temmerman W, Vereecke D, Dreesen R, Van Montagu M, Holsters M, and Goethals K. 2000. Leafy gall formation is controlled by fasR, an AraC-type regulatory gene in Rhodococcus fascians. Journal of Bacteriology 182:5832-5840.

Thompson DV, Melchers LS, Idler KB, Schilperoort RA, and Hooykaas PJ. 1988. Analysis of the complete nucleotide sequence of the Agrobacterium tumefaciens virB operon. Nucleic Acids Research 16:4621-4636.

Van Larebeke N, Engler G, Holsters M, Van den Elsacker S, Zaenen I, Schilperoort RA, and Schell J. 1974. Large plasmid in Agrobacterium tumefaciens essential for crown gall-inducing ability. Nature 252:169-170.

Vasanthakumar A, and McManus PS. 2004. Indole-3-acetic acid-producing bacteria are associated with cranberry stem gall. Phytopathology 94:1164-1171. 10.1094/PHYTO.2004.94.11.1164

Velázquez E, Palomo JL, Rivas R, Guerra H, Peix A, Trujillo ME, García-Benavides P, Mateos PF, Wabiko H, and Martínez-Molina E. 2010. Analysis of core genes supports the reclassification of strains Agrobacterium radiobacter K84 and Agrobacterium tumefaciens AKE10 into the species Rhizobium rhizogenes. Systematic and Applied Microbiology 33:247-251. 10.1016/j.syapm.2010.04.004

Vereecke D, Cornelis K, Temmerman W, Jaziri M, Van Montagu M, Holsters M, and Goethals K. 2002. Chromosomal locus that affects pathogenicity of Rhodococcus fascians. Journal of Bacteriology 184:1112-1120.

Ward JE, Akiyoshi DE, Regier D, Datta A, Gordon MP, and Nester EW. 1988. Characterization of the virB operon from an Agrobacterium tumefaciens Ti plasmid. Journal of Biological Chemistry 263:5804-5814.

Weinthal DM, Barash I, Panijel M, Valinsky L, Gaba V, and Manulis-Sasson S. 2007. Distribution and replication of the pathogenicity plasmid pPATH in diverse

94

populations of the gall-forming bacterium Pantoea agglomerans. Applied and Environmental Microbiology 73:7552-7561. 10.1128/AEM.01511-07

95

Wertz JE, Goldstone C, Gordon DM, and Riley MA. 2003. A molecular phylogeny of enteric bacteria and implications for a bacterial species concept. Journal of Evolutionary Biology 16:1236-1248.

Young JM. 2003. Classification and nomenclature of Agrobacterium and Rhizobium - a reply to Farrand et al. (2003). International Journal of Systematic and Evolutionary Microbiology 53:1689-1695. 10.1099/ijs.0.02762-0

Young JM, Kuykendall LD, Martínez-Romero E, Kerr A, and Sawada H. 2001. A revision of Rhizobium Frank 1889, with an emended description of the genus, and the inclusion of all species of Agrobacterium Conn 1942 and Allorhizobium undicola de Lajudie et al. 1998 as new combinations: Rhizobium radiobacter, R. rhizogenes, R. rubi, R. undicola and R. vitis. International Journal of Systematic and Evolutionary Microbiology 51:89-103. doi:10.1099/00207713-51-1-89

Young JM, Park D-C, Shearman HM, and Fargier E. 2008. A multilocus sequence analysis of the genus Xanthomonas. Systematic and Applied Microbiology 31:366-377. 10.1016/j.syapm.2008.06.004

Zeigler DR. 2003. Gene sequences useful for predicting relatedness of whole genomes in bacteria. International Journal of Systematic and Evolutionary Microbiology 53:1893-1900.

Zerbino DR, and Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18:821-829. 10.1101/gr.074492.107

Zhu J, Oger PM, Schrammeijer B, Hooykaas PJ, Farrand SK, and Winans SC. 2000. The bases of crown gall tumorigenesis. Journal of Bacteriology 182:3885-3895.

95

96

Figure 3.1. Overview of Gall-ID diagnostic tools.

DNA sequence information can be used to reveal the identity of the causative agent (unknown isolate) of disease. Tools associated with "Gall Isolate Typing" and “Phytopath-type” use 16S rDNA or pathogen-specific MLSA gene sequences to infer the identity of the isolate by comparing the sequences to manually curated sequence databases. Tools associated with “Whole Genome Analysis” and “Vir-Search” use Illumina short sequencing reads to characterize pathogenic isolates. The former tab provides downloadable tools to infer genetic relatedness based on SNPs (WGS Pipeline) or average nucleotide identity (Auto ANI). The “Vir-Search” tab provides an on-line tool to quickly map short reads against a database of sequences of virulence genes.

96

97

Figure 3.2. Flowchart for the WGS Pipeline.

Scripts and the programs that each script runs are boxed and presented along the left. The logic flow of the WGS Pipeline tool is presented along the right. Rectangles with rounded corners = inputs and outputs; boxes outlined in red = processes. The inputs, outputs, and processes are matched to the corresponding script and program.

97

98

Figure 3.3. Validation of the Agro-type and Vir-Search tools.

A) An unrooted Neighbor Joining phylogenetic tree based on 16s rDNA sequences from Agrobacterium spp. The 16S rDNA sequence was identified and extracted from the genome assembly of Agrobacterium isolate 13-2099-1-2 and analyzed using the tool available in the Agro-type tab. The isolate is labeled in red, as “query_isolate”; inset shows the clade that circumscribes the isolate. B) Screenshot of output results from Vir-Search. Paired 2x300 bp MiSeq short reads from Agrobacterium isolate 13- 2099-1-2 were analyzed using the Vir-Search tool in Gall-ID. Reference virulence gene sequences that were aligned are indicated with a green plus ("+") icon and the lengths and depths of the read coverage are reported (must exceed user-specified cutoffs, which were designated as 90% minimum coverage and 20% maximum sequence divergence). Virulence genes that failed to exceed user-specific cutoffs for read alignment parameters are indicated with a red "X". Virulence genes are grouped into categories based on their function in virulence.

98

Figure 3.3. Validation of the Agro-type and Vir-Search tools. 99

100

Figure 3.4. Maximum likelihood tree based on vertically inherited polymorphic sites core to 20 Rhodococcus isolates.

WGS Pipeline was used to automate the processing of paired end short reads from 20 previously sequenced Rhodococcus isolates, and generate a maximum likelihood unrooted tree. Sequencing reads were aligned, using R. fascians strain A44a as a reference. SNPs potentially acquired via recombination were removed. The tree is midpoint-rooted. Scale bar = 0.05 average substitutions per site; non-parametric bootstrap support as percentages are indicated for each node. Major clades and sub- clades are labeled in a manner consistent with previous labels.

100

101

Table 3.1. Manually curated datasets developed for Gall-ID

Database Bacterial group # of isolates used References in Gall-ID "Agro-type" tool (Agrobacterium) MLSA (dnaK, glnA, Rhizobiaceae 199 Perez-Yepez gyrB, recA, rpoB, et al., thrA, truA) MLSA (atpD, gapA, Rhizobiaceae 188 Alexandre et gyrB, recA, rplB) al., 2008 dnaJ Rhizobiaceae 198 Alexandre et al., 2008 16S rDNA Rhizobiaceae 245 "Rhodo-type" tool (Rhodococcus) MLSA (ftsY, infB, Rhodococcus 85 Adekambi et rpoB, rsmA, secY, al., 2011 tsaD, ychF) 16S rDNA Rhodococcus 66 "Panto-type" tool (Pantoea agglomerans) MLSA (fusA, gyrB, Pantoea, Erwinia 356 Delétoile et leuS, pyrG, rplB, al., 2009 rpoB) 16S rDNA Pantoea, Erwinia 352 "Pseudo-type" tool (Pseudomonas savastanoi) MLSA (gapA, gltA, Pseudomonas syringae 158 Hwang et al., gyrB, rpoD) 2005 MLSA (acnB, fruK, Pseudomonas syringae 153 Sarkar et al., gapA, gltA, gyrB, pgi, 2004 rpoD) 16S rDNA Pseudomonas syringae 161

101

102

Table 3.1. (Continued) Manually curated datasets developed for Gall-ID

"Phytopath-type" tool 16S rDNA Rhodococcus, 345 Agrobacterium, Pseudomonas syringae, Ralstonia, Xanthomonas, Pantoea, Erwinia, Xylella, Dickeya, Pectobacterium, Clavibacter, Rathayibacter MLSA (atpD, dnaK, Clavibacter 7 Jacques et al., gyrB, ppK, recA, 2012 rpoB) MLSA (dnaA, gyrB, Clavibacter 7 Tancos et al., kdpA, ligA, sdhA) 2015 MLSA (dnaA, dnaJ, Dickeya 40 Marrero et al., dnaX, gyrB, recN) 2013 MLSA (acnA, gapA, Pectobacterium 54 Kim et al., icdA, mdh, pgi) 2009 MLSA (adk, egl, fliC, Ralstonia 28 Castillo et al., gapA, gdhA, gyrB, 2007 hrpB, ppsA) MLSA (dnaK, fyuA, Xanthomonas 348 Young et al., gyrB, rpoD) 2008 MLSA (acvB, copB, Xylella 17 Parker et al., cvaC, fimA, gaa, 2012 pglA, pilA, rpfF, xadA)

102

103

Table 3.2. Statistics for the WGS Pipeline

WGS Pipeline step Statistic Value Number of input paired read 19 sets

generate_pileup.sh Average runtime per pileup 00:42:01 (1 cpu) (hh:mm:ss)

Total runtime (hh:mm:ss) 13:18:14

Total pileup alignment length 5,947,114 bp

generate_core_alignment.sh 90%-shared core alignment 855,355 bp (1 cpu) length

Total runtime (hh:mm:ss) 00:15:32

Number of core polymorphic 177,961 bp sites core SNP alignment length (w/o putative recombinant 174,819 bp remove_recombination.sh SNPs) (10 cpus) Computational time (hh:mm:ss) 04:25:32 Actual runtime (hh:mm:ss) 00:29:28 Figure output runtime 00:13:03 (hh:mm:ss) Time to optimize RAxML 00:02:32 parameters (hh:mm:ss) Time to compute 20 ML 00:34:53 generate_phylogeny.sh searches (hh:mm:ss) (raxmlHPC-PTHREADS- Number of bootstrap replicates AVX, 50 (RAxML autoMRE) 10 cpus) Time to compute 50 bootstrap 01:02:09 searches (hh:mm:ss) Total runtime (hh:mm:ss) 01:39:34 All Total runtime (hh:mm:ss) 15:55:51

103

104

Table 3. Strain identity of 14 isolates associated with crown gall

# high # of Isolate Positive ID quality Clade (based virulence Host name based on read on 16S rDNA) genes pairs ID’ed 13-2099- Quaking Aspen virD2 PCR 1,244,074 Agrobacterium 63 1-2 2 (nocM, 13-626 Pear virD2 PCR 220,903 Agrobacterium nocP) AC27/96 Pieris Not pathogenic 826,690 Rhizobium 1 (tssD) No reaction to AC44/96 Pieris hybridization 1,404,002 Rhizobium 0 probes Peach/Almond Pathogenicity B131/95 539,283 Agrobacterium 46 Rootstock assay Peach/Almond Pathogenicity B133/95 1,199,902 Agrobacterium 46 Rootstock assay Response to 20 different Peach/Almond biochemical B140/95 448,314 A. tumefaciens 51 Rootstock and physiological tests Response to 20 different biochemical N2/73 Cranberry gall 1,345,404 A. tumefaciens 64 and physiological tests Response to 20 different biochemical W2/73 Euonymus 1,244,159 A. rubi 51 and physiological tests 15-1187- Yarrow virD2 PCR 508,223 A. tumefaciens 39 1-2a 15-1187- Yarrow virD2 PCR 299,970 A. tumefaciens 38 1-2b 14-2641 Rose No data 698,756 Serratia 0 Colony 15-172 Leucanthemum morphology on 384,308 A. tumefaciens 56 selective media Colony 15-174 Leucanthemum morphology on 753,570 A. tumefaciens 58 selective media

104

105

Supplemental Figure 3.1. Validation of the Phytopath-type tool.

The 16S rDNA sequence from isolate originally labeled as Agrobacterium isolate 14- 2641 was analyzed using the Phytopath-type tool. The isolate is labeled in red, as “query_isolate”; inset shows the clade that circumscribes the isolate.

105

106

Supplemental Figure 3.2. The SPAdes assembler produces high quality assemblies.

Genome sequences assembled using Velvet v. 1.2.10 (blue) and SPAdes v. 3.7.0 (orange) were compared based on scaffold N50 and number of scaffolds greater than 1kb in size.

106

107

Supplemental Figure 3.3. Whole genome alignment of 13-626 assemblies to a close reference sequence indicates collinearity of genomes.

The A. radiobacter K84 genome sequence (top) and assemblies of 13-626 generated using SPAdes v. 3.7.0 (middle; 47 scaffolds) and Velvet (bottom; 527 scaffolds) were aligned using Mauve. Shared locally collinear blocks (LCBs) between the reference and the two assemblies are color-coded and connected by color-coded lines. Scaffold and replicon ends are depicted as vertical red lines.

107

108

Supplemental Table 3.1. Comparison of Velvet to SPAdes assemblies of 14 isolates associated with crown gall.

Velvet Assembly SPAdes Assembly

# of Assembly # of Assembly Isolate N50 N50 scaffolds length scaffolds length name (bp) (bp) > 1kb (bp) > 1kb (bp) 13-2099- 52 275,131 5,552,211 36 305,993 5,655,819 1-2

13-626 579 21,870 6,842,659 47 520,764 6,958,258

AC27/96 154 96,285 7,277,651 62 304,507 7,294,256

AC44/96 47 329,597 6,092,333 42 411,299 6,103,740

B131/95 113 172,630 7,277,855 56 477,595 7,297,525

B133/95 78 301,459 7,163,747 57 478,752 7,169,549

B140/95 153 65,958 5,712,082 39 261,962 5,717,522

N2/73 52 299,053 5,744,801 28 583,418 5,768,133

W2/73 53 247,839 5,490,237 25 437,031 5,510,033

15-1187- 87 146,955 5,982,904 30 349,571 6,003,102 1-2a

15-1187- 141 89,835 5,979,368 33 349,600 6,003,936 1-2b

14-2641 168 101,816 6,253,349 NA NA NA

15-172 106 108,118 5,420,030 48 324,414 5,448,190

15-174 59 247,284 6,032,397 30 609,451 6,054,965

108

109

Chapter 4 – Phylogenomic analyses of the evolution of Rathayibacter toxicus and its ecological interactions with bacteriophage CS14

Edward W. Davis II, Melodie L. Putnam, Lucas Dantas Lopes, Michael S. Belcher, Aaron J. Sechler, Brenda K. Schroeder, Timothy D. Murray, Douglas G. Luster, William L. Schneider, Elizabeth E. Rogers, and Jeff H. Chang

109

110

ABSTRACT

Rathayibacter toxicus is a species of nematode-vectored bacteria that causes diseases to grass. This species also produces a lethal neurotoxin, corynetoxin, frequently when associated with bacteriophage CS14. Phylogenomic analyses of members of the genus Rathayibacter, collected over time and space, provided an evolutionary contrast for an archeological investigation into the steps that led to the co- existence of R. toxicus and CS14 populations. During its evolution, R. toxicus experienced a genome reduction, changes in nucleotide composition, and a change in codon use preference that is more similar to that of CS14. R. toxicus is the only species in the genus to have acquired a clustered regularly interspaced short palindromic repeat

(CRISPR) locus. CRISPR systems are adaptive immune systems that maintain a chronological memory of spacers that inform on past infections to protect against subsequent invasions. One of the two original horizontally acquired CRISPR spacers has sufficient similarity to have primed or mediated interference of CS14. Despite the acquisition of the CRISPR locus early in its evolution, R. toxicus had a massive proliferation of spacers, predicted to be efficacious in targeting an extant CS14 bacteriophage genome. This suggests the CRISPR provides very robust immunity and simulations predict CS14 populations should not be capable of evading immunity.

Nevertheless, the acquisitions of spacers continued up until contemporary time. This points to a persistent interaction between bacteria and bacteriophage and suggests that despite the immune system, in natural ecosystems the two antagonistic populations co- exist.

110

111

SIGNIFICANCE STATEMENT

Clustered regularly interspaced short palindromic repeat (CRISPR) systems are adaptive immune systems of prokaryotes. CRISPRs have been intensively studied and engineered for genome editing. What remains unresolved is the ecological consequences of CRISPR immunity. Phylogenomic analyses of field-collected

Rathayibacter toxicus revealed co-evolution between R. toxicus and a parasitic bacteriophage that is implicated in provoking the bacteria to produce a lethal neurotoxin. Different populations of the bacteria, separated in time and space, continually acquired spacers against the bacteriophage. Thus, in natural ecosystems, the CRISPR is ineffective at driving the phage to extinction and the antagonistic partners continue to adapt to each other.

111

112

INTRODUCTION

Bacteria inhabit practically all niches and provide essential services to the hosts they engage with and the ecosystems in which they persist. Bacteriophages are the most abundant entities in the biosphere. These obligatory parasitic vehicles of nucleic acids have key functions in driving bacterial evolution and influencing the fitness and ecology of bacterial populations (Clokie et al., 2011). The infection of bacteria can result in horizontal gene transfer (HGT) and the innovation of bacterial genomes. In contrast, the fitness costs associated with hosting bacteriophage can drive diversification of genome defense mechanisms, and counter adaptions by the bacteriophages, and engage antagonistic populations in an evolutionary arms race

(Stern and Sorek, 2011).

The clustered regularly interspaced short palindromic repeat (CRISPR) systems are an adaptive immunity that protects genomes. Defense occurs via a three-stage process (Amitai and Sorek, 2016). In the adaptation stage, CRISPR-associated (Cas) proteins identify and acquire target DNAs called protospacers, and integrate them as spacers. This represents a form of immunological memory in which spacers are stored in chronological order and interdigitated between direct repeats within an array

(Barrangou et al., 2007). In the expression stage, the array is transcribed and spacers are processed into small RNA units that interact with Cas proteins. In the interference stage, the processed units target via base-pairing, invading nucleic acids, leading to their Cas-mediated cleavage and degradation.

CRISPR loci are often acquired via HGT (Godde and Bickerton, 2006). The events that follow the initial acquisition are unclear and maintenance of CRISPR loci 112

113 will only occur if benefits outweigh costs, which can be quite severe. Laboratory experiments and inspection of environmental samples indicate that discrimination between self and non-self by CRISPRs is not infallible, as spacers corresponding to the host genome can be found and are predicted to cause autoimmunity (Burstein et al.,

2016; Fineran et al., 2014; Modell et al., 2017; Vercoe et al., 2013; Wei et al., 2015).

CRISPRs have also been implicated in limiting HGT of potentially beneficial traits

(Marraffini and Sontheimer, 2008; Westra et al., 2016). In addition, in populations of

Pseudomonas aeruginosa, isolates with CRISPR loci have smaller genomes relative to those without CRISPR loci (van Belkum et al., 2015). In the absence of a fitness benefit, the CRISPR locus may not be retained in the population.

There are two currently recognized classes and multiple subdivisions of

CRISPRs (Makarova et al., 2015). Class 1, type I systems are defined by the cas3 gene that encodes a helicase, often fused to an endonuclease domain (Sinkunas et al., 2011).

The functionality of this type of CRISPR depends upon the presence of a protospacer- adjacent motif (PAM; 2-5 nucleotides) proximal to the protospacer, essential for acquisition and interference. The type I CRISPR also has an additional mode of priming adaptation where a previously assimilated spacer guides the biased and enhanced acquisition of new spacers from molecules with similar sequences (Sternberg et al.,

2016). Priming is favored over interference when mismatches between spacer and target compromise the efficiency of interference (Fineran et al., 2014; Semenova et al.,

2016).

Bacterial lifestyle and genome size are correlated and hypothesized to reflect the stability of the environment in which bacteria persist (Koonin and Wolf, 2008). 113

114

Bacteria with larger genomes have greater flexibility in metabolism and signaling networks, and are better adapted for success in dynamic environments replete with competing bacteria (Konstantinidis and Tiedje, 2004). Bacteria with small genome sizes tend to be restricted to stable environments with fewer opportunities to interact with other bacteria, conditions that constrain population sizes. These bottlenecks increase susceptibility to the non-adaptive effects of genetic drift. Consequently, seemingly catastrophic events, such as loss of otherwise essential functions and massive attrition of the genome occur yet do not lead to extinction (McCutcheon and

Moran, 2012).

Genome size in bacteria mostly reflects the balance of gene gains and losses.

HGT can expand genomes and diversify populations. However, the inherent deletion bias of bacterial genomes oppose gains (Kuo and Ochman, 2009). Additionally, higher mutation rates, loss of DNA repair, proliferation of selfish elements, and genetic drift, can have strong effects on bacteria with small populations and result in massive reductions in genomes (Parkhill et al., 2003; van Ham et al., 2003). Hallmarks of genome reduction include a high A + T bias, degradation in functional capacity, and fixation of deleterious mutations (McCutcheon and Moran, 2012). But some taxa of free-living bacteria with large effective population sizes have members with small genomes devoid of hallmarks consistent with non-adaptive effects (Boscaro et al.,

2013; Giovannoni et al., 2005). It is hypothesized their streamlined genomes reflect a selection for small cells that confers a benefit in their environment (Giovannoni et al.,

2014).

114

115

Rathayibacter is a genus within the phylum Actinobacteria. These plant pathogenic bacteria must be vectored by Anguina species of and rely on the nematode to first colonize and cause disease to the seedhead (McKay and Ophel, 1993;

Fattah and Al-Assas, 2010; Murray et al., 2017). Rathayibacter toxicus is the only species of the genus known to synthesize corynetoxin, a tunicamycin and class of antibiotic that inhibits cell wall biosynthesis (Price and Tsvetanova, 2007).

Tunicamycins also interfere with early steps of protein glycosylation and are thus lethal neurotoxins to eukaryotes (Jago et al., 1983). The production of the toxin has been associated with bacteriophage CS14 but whether the phage is necessary is unclear

(Kowalski et al., 2007; Ophel et al., 1993).

We characterized the genome sequences from approximately 40 Rathayibacter isolates. Phylogenetic analyses supported their circumscription by the genus and supported the existence of at least 10 species. R. toxicus forms the earliest diverging and longest branch within the clade. The reconstruction of gene gains and losses, in addition to its 1~2 Mb smaller genomes, suggested an evolutionary process of genome reduction. Notably, R. toxicus is the only species within the genus to have acquired a type I-E CRISPR locus. The collection of bacteria, its record of spacers, coupled to the availability of the sequence of an extant CS14, form a unique dataset to retrospectively investigate the effects that acquiring the immunity locus have on genome evolution and its function on the ecology of bacteria-phage interactions.

115

116

RESULTS

The Rathayibacter genus consists of at least ten species

We determined the genome sequences for 40 isolates of Rathayibacter

(Supplemental Table 4.1). The assembled genomes ranged from 2.3 to 4.4 Mb, and the

G + C content ranged from 61% to 72%. The high G + C bias is consistent with levels previously described for members of Actinobacteria (Hershberg and Petrov, 2010). We generated a maximum-likelihood phylogenetic tree that also included publically available Rathayibacter strains as well as those representing genera anticipated to be closely related to Rathayibacter. The Rathayibacter genus formed a well-supported clade (bootstrap value of 100) that is distinct from, but closely related to, the

Clavibacter clade of plant pathogenic bacteria (Figure 4.1A). Within Rathayibacter, there are ten clades, which include eight previously named species; most of the inter- genus nodes were well-supported, having bootstrap values >75.

To delinate species, we calculated and analyzed all pairwise average nucleotide identity (ANI) values between genome sequences (Supplemental Figure 4.1; (Goris et al., 2007). Intra-clade diversity is low, with pairwise ANI values exceeding 99%. Most inter-clade comparisons had ANI values of ~75-80%. We operationally grouped the isolates into species units (using a 94% ANI cut-off; (Konstantinidis and Tiedje,

2005)), but recognized that many of the clades are represented by few isolates and require deeper sampling for increasing confidence in species identities. R. tritici-like and unnamed isolates of Rathayibacter (Leaf185/294) are partitioned into previously undescribed groups, and according to these analyses represent new species.

116

Rathayibacter toxicus forms three clades

117

Twenty-two of the sequenced isolates belong to the R. toxicus species

(Agarkova et al., 2006). The clades based on a whole-genome single nucleotide polymorphism (SNP) tree were consistent with previously published grouping schemes, but had miniscule within-clade genetic distances (Figure 4.1B; (Agarkova et al., 2006; Arif et al., 2016)). There is also spatial structuring, with clade A isolates all collected from Western Australia, and with clade B isolates all from South Australia.

Within clade A, there is a temporal effect, with the slight increase in genetic distance between a pair of isolates from the rest of group A, justifying recognition of sub-clade

A2 (1991 and 2001) separate from sub-clade A1 (1973-1983). The deepest branching clade includes FH100, which was collected from annual rabbitsfoot grass. The implications of this are unknown but may reflect potential host effects.

R. toxicus evolved via genome reduction

Members of R. toxicus have the smallest genomes (~2.3 Mb in size) that are between 1~2 Mb smaller than those of other Rathayibacter species (Figure 4.2A).

Evidence is consistent with an evolutionary process of genome reduction in R. toxicus.

The long branch length in the MLSA tree is consistent with greater divergence in coding sequences of R. toxicus, suggestive of an accelerated rate of change (Figure

4.1A). Genome-wide divergence is supported by the high inter-clade divergence in ANI between all pairwise comparisons with R. toxicus (Supplemental Figure 4.1).

Comparisons between the near-finished genome sequences show that despite the gross changes, the genomes are co-linear, with indels being the primary difference, though some inversions were detected. Importantly, the two compared R. toxicus genomes are 117

118 very similar in structure and order, with few indels that discriminate between the two isolates.

Within the genus, the genomes of R. toxicus have the lowest G + C content of

61%. R. toxicus, like all members of Rathayibacter and many Actinobacteria, lack homologs of mutS, necessary for mismatch repair. But, all Rathayibacter, including R. toxicus, have homologs of the non-canonical mismatch repair pathway, described in

Mycobacterium (Castañeda-García et al., 2017). Additionally, all examined genome sequences have other repair pathways such as homologous recombination, base excision repair, and nucleotide excision repair. But, R. toxicus lacks homologs of both ku and ligD while R. rathayi lacks ku and isolate FH6 of R. iranicus has a frameshift in ligD. The ku and ligD genes are necessary in , Bacillus, and few other taxa of bacteria for a non-homologous end joining pathway, a repair mechanism that is uncommon to bacteria (Gong et al., 2005). It is difficult to envision a scenario where loss of this repair pathway could lead to genome-wide changes in nucleotide composition. Hence, it is likely other forces have contributed to the increase in A + T content.

Expansions in selfish genetic elements and repeat sequences are implicated in genome reduction. Searches revealed few sequences with homology to transposase. R. tritici FH211 had the most with 75 and only one other genome sequence exceeded 50 putative transposase-encoding genes. Additionally, we identified a ~75 nucleotide long palindromic repeat sequence in the genomes of Rathayibacter. While R. festucae, which has the largest genome, has 99 repeats, others varied with 15-50 and showed no

118

119 obvious correlation to genome size; R. toxicus has 19. Thus, if mobile elements and repeats contributed to reduction, they have been culled from the genomes.

We reconstructed gene gains and losses in Rathayibacter (Figure 4.2B). Three species have large genome sizes and the reconstruction model is consistent in predicting more gains than losses, especially at the tips of the trees. R. festucae was predicted to have experienced large-scale changes with a strong bias towards gains. All other species of Rathayibacter were predicted to have sustained losses closer to the base of the tree with variable gains and losses near the tips of the tree. R. toxicus is remarkable in having the number of predicted genome changes being massive and on the same scale as that predicted between the two genera, Rathayibacter and

Clavibacter. R. toxicus is heavily biased towards losses and assuming an average of 1.0 kb per coding sequences, the predicted losses account for over 1 Mb of DNA sequence.

We next asked whether R. toxicus suffered losses in functions that are encoded by its core genome (Figure 4.2C). The core genome of the Rathayibacter genus is predicted to include 1,134 orthologous clusters. Compared to just the “small genome”

Rathayibacter, with a slightly larger predicted core genome size of 1,300 orthologous clusters, predictions revealed 276 orthologous clusters that are associated with functions absent from R. toxicus. The number of absent functions represents a substantial portion of the core, but manual inspection revealed few that would be considered potentially essential, with the exceptions being ribosomal protein L36, and prolyl-tRNA editing protein (Supplemental Table 4.2). But, while mutants with deletions in homologous genes of Escherichia coli are compromised, they are nonetheless viable (Ikegami et al., 2005; Ruan and Söll, 2005). Moreover, functions 119

120 that are enriched but not core in Rathayibacter, such as those involved in modifying the cell wall, protecting against environmental stresses, synthesizing vitamin B, resisting antibiotics, and natural competence, many of which were previously described, are conserved in R. toxicus (Supplemental Figure 4.2; manuscript in preparation). However, approximately 150 of the enriched orthologous clusters are absent from R. toxicus.

While loss was prominent, R. toxicus nonetheless has 180 orthologous clusters, of which 48 have annotated functions, that are core and unique to its species. The most distinguishable predicted functions are the CRISPR associated proteins. HGT was also predicted, and an average of 20 regions representing nearly 200kb were identified.

Isolate-specific HGTs were also predicted for just R. toxicus FH232 and FH79, the two with finished genome sequences. Predicted regions were further filtered to exclude those that were present in the genome sequences of both isolates and those with low confident predictions, e.g., similar in sequence and structure to regions not predicted to be acquired via HGT in other R. toxicus sequences. Isolate FH232 was predicted to have independently acquired at least five regions, totaling an estimated 55,000 kb while

FH79 had one region of approximately 12,500 kb. Few of the acquired genes have annotated functions but several of the predicted proteins have signatures of being secreted or associated with the membrane; putative models for the secretion signal and transmembrane domains often overlap, conflating the two. Each strain encodes a unique set of rearrangement hotspot (Rhs) family proteins, with an associated endoU nuclease predicted at the C-terminus. Together, these predicted functions reveal a potential for R. toxicus to have acquired the ability to secrete nucleases and engage in 120

121 inter-bacterial competition. Isolate FH232 is also predicted to have acquired barnase and barstar, a secreted ribonuclease and its cognate inhibitor.

The type VIIb secretion system-encoding locus was lost from R. toxicus

In comparison to Gram-negative bacteria, Gram-positive bacteria require structurally different apparatuses for secreting proteins into the extracellular milieu.

The type VII secretion system (T7SS) has demonstrable functions in virulence towards eukaryotic hosts, and more recently, for competition with other prokaryotes (Cao et al.,

2016; Flint et al., 2004; Gray et al., 2013; Gröschel et al., 2016). All members of

Rathayibacter encode at least one homolog of EssC/EccC. This is a SpoIIIE-FtsK-like

ATPase that interacts with cargo and, with three other core proteins, forms the membrane-associated complex of the T7SS (Figure 4.3A; Pallen, 2002; Zoltner et al.,

2016). However, the essC genes of R. toxicus are not linked to genes normally associated with the T7SS. Instead, essC is linked to one gene with a translated sequence that has very weak similarity to the cell wall anchored GspB, necessary for

Streptococcus gordonii to efficiently bind to platelets (Bensing and Sullam, 2002;

Xiong et al., 2008). The translated sequence of the other gene is similar to an extracellular solute-binding protein. We therefore suggest this locus of Rathayibacter may be involved in adhesion and/or recognition and not T7SS-mediated secretion.

Except for R. toxicus, species encode additional copies of EssC/EccC that clustered in the phylogenetic tree with homologs of Staphylococcus aureus and

Clostridium acetobutylicum (Figure 4.3A). In the genome sequences, essC/eccC are linked to genes that encode a EsxA/EsxB-related protein, which is a member of the 121

WXG100 superfamily secreted via the T7SS, and four more proteins that are core and

122 necessary for secretion by a T7SSb (Burts et al., 2005; Huppert et al., 2014; Pallen,

2002). The essC/eccC loci of Rathayibacter share a common organization and have high identity to the T7SSb of . Moreover, within Rathayibacter, the loci are syntenic, flanked by transposase sequences, and have a lower GC% than genome averages. Therefore, the putative T7SSb locus of Rathayibacter was horizontally acquired by a common ancestor and lost from R. toxicus.

In firmicutes, the T7SSb loci include a single toxin gene that encodes a LXG family domain in the N-terminus of the translated sequence, as well as a cognate antitoxin gene (Cao et al., 2016). The LXG domain is suggested to be recognized by the T7SS apparatus (Zhang et al., 2011). The T7SSb loci of Rathayibacter spp. are novel in having repetitive, “exported toxin-antitoxin modules” and additional modules in regions distal to the T7SSb loci (Figure 4.3B). Most of the translated sequences of the putative toxin-encoding genes have identifiable LXG domains and the genes are immediately followed by a gene that likely encodes a cognate antitoxin (immunity protein). In monoderm firmicutes, the T7SSb loci include toxin genes that encode LXG family domains in the N-termini of the translated sequences, as well as cognate antitoxin genes. Interestingly, we noted that some genes encoding putative toxin proteins lack the N-terminal LXG domain, but are still followed by cognate antitoxin gene inferred to be functional. The modular pairs of toxin and antitoxin is unique to

Rathayibacter and has not been previously described for T7SS and T7SSb-encoding loci. This structure and variation is reminiscent of type VI secretion loci, which encodes machinery for inter-bacterial competition by Gram-negative bacteria (Chang et al.,

2014). 122

123

R. toxicus acquired a type I-E CRISPR locus

Based on the inventory and organization of cas genes, the locus encodes a type

I-E CRISPR system (Figure 4.4A). There are two attributes that precluded us from confidently prediction a leader sequence. In R. toxicus, there are only 58 nucleotides between cas2 and the most recently acquired spacer. This is unusual because leader sequences are typically between 81 and 599 nucleotides long, though sequences as short as 60 nucleotides have been experimentally shown to be sufficient (Alkhnbashi et al., 2016; Yosef et al., 2012). Secondly, while leader sequences tend to be conserved within CRISPR types, there are no clear patterns across all CRISPR types (Alkhnbashi et al., 2016).

The CRISPR arrays have between 130 to 149 spacers. The direct repeat sequence between spacers in R. toxicus is similar to the cluster 2 repeats of E. coli

(Kunin et al., 2007). The bases that are predicted to form the stem are the most conserved. There are only four unique sets of CRISPR arrays and they have a strong phylogenetic signal (Figures 4.1B and Figure 4.4B). The CRISPR array in clade C is the most divergent. Although the array in clade C is divergent, it nevertheless independently acquired several spacers that are similar to those found in the other arrays. Those in (sub)clades A1, A2, and B, which share a more recent common ancestor, are largely subsets of each other, with few minor exceptions. Within clade B,

FH147 has a deletion of one spacer and between clades A and B, there are an additional six internal deletions. Clades A and B differ in their collection of recently acquired spacers. Additionally, four spacers in sub-clade A1 and eight spacers in sub-clade A2 123

differ. We noted that several spacers within the arrays are similar in sequence,

124 suggesting the same, but polymorphic, regions were acquired during separate infections. Some of these spacer pairs with similar sequences were acquired at separate periods that almost span the entire duration of time in which R. toxicus had the CRISPR locus.

Amelioration of the type I-E CRISPR locus

The two terminal spacers are common in the four CRISPR loci, and flanked by degenerate direct repeat sequences, a phenomenon that was previously noted (Horvath et al., 2008). The conservation of two spacers can be explained by their horizontal acquisition, along with the functional set of cas genes, by the most recent common ancestor of R. toxicus. We therefore posited the transition between these two original spacers and the spacers acquired subsequently, would provide clues to how acquisition of a CRISPR locus ameliorates. To this end, we wrote and applied a computer algorithm to assess targeting of spacers to the extant CS14 and Rathayibacter genome sequences (Figure 4.5A and Supplemental Figure 4.3). Using a stricter interpretation of the rules defined from analyzing the type I-E CRISPR of Escherichia coli, R. toxicus spacers were binned into the categories, interference, primed adaptation, stable, other, and none (Supplemental Figure 4.3; (Fineran et al., 2014)). Because the diversity of past and present target sequences are not known, the categories can only be used to qualitatively assess the potential for targeting.

The first two spacers had little evidence for having the potential to target any of the extant Rathayibacter genome sequence. There was also no strong evidence, based on mapping to any of the extant genome sequences, to suggest any subsequently 124

125 acquired spacer conferred autoimmunity. The enumeration of dinucleotide sequences and clustering based on relative frequency show R. toxicus has the highest relative frequencies of the PAM sequence (TT), relative to members of sister species, and is ostensibly more susceptible to targeting by its CRISPR (Supplemental Figure 4.4). This was expected, as R. toxicus has the highest A + T content of the genus. The observation that R. toxicus clustered closer to CS14 than to members of sister taxa prompted us to compare codon usages. As expected, CS14 was far more similar to R. toxicus

(distance of codon usage of 6.46 ± 0.03) in its codon usage than to any other species of

Rathayibacter (distance of codon usage of 14.21 ± 1.77). When codon usage of R. toxicus was compared to sister species, the distance values had a tight range centered approximately around 7.8.

The acquired CRISPR system primed adaptation of spacers to CS14

Five patterns emerged when the computer algorithm was used to bin spacers based on their comparisons to CS14 and the categories were mapped onto the CRISPR arrays (Figure 4.5B). The first and most terminal spacer was categorized as other, implying a potential to have once been capable of conferring interference or primed adaptation against an ancestral CS14. Second, and most striking is that ~53% of the

300 unique spacers were categorized as perfect, high quality, high quality priming, or priming and inferred to have likely targeted CS14. Third, these 159 predicted targeting spacers are distributed throughout the CRISPR arrays and throughout the bacteriophage genome sequence. Four, between sub-clades A and B, the most recently

125

acquired spacers varied with evidence for more interactions between members of clade

126

A and CS14. Even more interesting is that isolates in sub-clade A2 collected as little as 8 years after those of sub-group A1 and differing by <500 SNPs, differ by the 12 most recently acquired spacers, with six predicted to target CS14. A fifth pattern is revealed when attention is directed to the 67 spacers that were predicted to provide interference. In clade C, its 34 interference spacers are distributed fairly evenly throughout the array. In contrast, in clades A/B, the distribution of intereference spacers is more periodic, with a bias towards more recent acquisitions, especially in the arrays from clade A. The arrays within clade A share in common seven interference spacers, three of which match perfectly to CS14, that are absent from the array in clade B and predicted to target CS14.

A search failed to identify anti-CRISPR candidates in the CS14 sequence (data not shown; (Pawluk et al., 2016)).

DISCUSSION

We used a phylogenomics approach to gain insights into the evolution and ecology of Rathayibacter. Results show CS14 has had a profound impact on R. toxicus. Relative to sister species, R. toxicus is unique in having experienced a genome reduction during its evolution, though the attrition is not to the extreme of obligate endosymbionts (Figure 4.2). Reduction is indicative of a small population size and implies that Rathayibacter experienced a bottleneck. The host-associated phases of the

Rathayibacter lifestyle are likely stable environments but genome analyses suggest

Rathayibacter also competes in dynamic environments, which select for traits such as

126

tolerance to environmental stresses, natural transformation, immunity to antibiotics,

127 and production of antibiotics (Supplemental Figure 4.2). R. toxicus suffered a key loss of its T7SSb-encoding locus, predicted to be used by other Rathayibacter species in inter-bacterial competition, but maintains a functional corynetoxin locus which has analogous function in synthesizing an antibiotic that is also a lethal neurotoxin of eukaryotes (manuscript in preparation). R. toxicus is also predicted to have acquired additional features that may participate in inter-bacterial competition.

The genome reduction in R. toxicus is correlated with an increase in A + T content. Similarities in codon usage between bacteriophage and bacterial hosts have been previously reported and described as an adaptation of the parasite to the host

(Bahir et al., 2009; Lucks et al., 2008). The observations presented here raise an alternative possibility that genome-wide changes in nucleotide composition and codon usage patterns in R. toxicus was a fortuitous event for CS14 that expanded the host range of the bacteriophage. However, while all species of Rathayibacter encode multiple restriction-modification systems, R. toxicus is the only species that additionally encode a CRISPR and Abi-like proteins, which could be involved in abortive infection, another mechanism of defense against bacteriophages (Dy et al.,

2014).

The observation that spacers targeting CS14 were acquired during the last eight years in which R. toxicus were collected show the CRISPR locus, despite having an unusually short leader sequence, is functional. The immunity that CRISPRs provide has been repeatedly demonstrated in laboratory settings and empirical studies show that spacer diversity increases the likelihood for extinction of bacteriophage (Morley et al.,

127 2017; van Houte et al., 2016). This is counter to observations in the environmentally-

128 sampled R. toxicus, in which most of over 300 unique spacers were predicted to target

CS14 and are distributed in a manner indicative of continual and recent infections

(Figure 4.5). Results also revealed spatially and temporally varied interactions between the populations of R. toxicus and CS14 (Figures 4.1B and 4.4).

Models that predict high frequencies of interactions under nutrient limiting conditions and high bacteriophage diversity can compromise efficacy of immunity

(Weinberger et al., 2012; Westra et al., 2015). The distribution of spacers is consistent with a high frequency of interactions between R. toxicus and CS14. In fact, bacteriophage are vectored by nematodes (Dennehy et al., 2006). The likelihood that

CS14 persists in other species of Rathayibacter is blunted by observations that its codon usage is dramatically different. Secondly, given the limits of having only one extant bacteriophage sequence, the number of spacers predicted to be involved in primed adaptation is better viewed as an indicator of high bacteriophage diversity

(Weinberger et al., 2012; Westra et al., 2015). The efficacy of CRISPRs in controlling populations of bacteriophage is thus influenced by diversity in populations and evolutionary time, and also include infection dynamics, spatial structure, alternative hosts, and nonadaptive effects (Grenfell et al., 2004).

The observations regarding immunity to bacteriophage also explains conflicts regarding the effects of CRISPRs on HGT. There is a tension that must be balanced by genome defenses and immune systems of multicellular eukaryotes. Both function to protect against adapted pathogenic entities but must also grant access to entities that are of benefit and once granted, control their proliferation (Xin et al., 2016). A critical

128

difference between CRISPRs and eukaryotic immunity, is in the latter there is signaling

129 to enable tolerance of commensal or beneficial microorganisms (Perez-Lopez et al.,

2016). This is impossible for CRISPRs because naked nucleic acids do not express phenotypes for signaling or for discrimination. The inability to discriminate has prompted hypotheses regarding evolutionary trade-offs of CRISPRs (Bikard et al.,

2012; Marraffini and Sontheimer, 2008). The Rathayibacter dataset provided a contrast at a timescale of species divergence and genetic diversification to test the trade-off hypothesis. Results showed that CRISPRs are not inhibitory to HGT, as each of the clonal groups of R. toxicus independently acquired regions. Evolutionary and ecological parameters all likely contribute to obscure the short-term and individual effects of CRISPR. When viewed on an even larger evolutionary timescale, CRISPRs have no discernable effects on HGT (Gophna et al., 2015).

There is one other observation consistent with the perceived low effectiveness of CRISPR immunity in natural settings. Three strains previously characterized for their interactions with a corynetoxin-provoking bacteriophage are part of our dataset.

Only a subset of infected cells survived infection (Ophel et al., 1993). Assuming the bacteriophage previously used are similar to the sequenced CS14 and at least one of the spacers to the phage confers immunity, we would have predicted that most cells would have been resistant. The phage in the surviving cells behaved as if in pseudolysogeny, a state of arrested development, often in nutrient deprived carrier cells

(Cenens et al., 2013). It is unknown whether a pseudolysogenic state enables avoidance of CRISPR-mediated extinction but it is nonetheless an indicator of a potential nutrient- depauperate condition, which is modeled to compromise immunity (Westra et al.,

129 2015). Importantly, these reported observations point towards the possibility of a trade-

130 off. CS14 may exert a positive fitness effect by influencing the production of corynetoxin by R. toxicus, a dual weapon that could provide a competitive advantage against eukaryotes and competing bacteria.

There are some striking analogies to plant-pathogen interactions, which may be used to resolve the discord between previous observations and those described herein.

Knowledge of plant disease resistance is largely learned from studies of model and crop plants in experimental designs similar to those used to study CRISPRs. In agricultural settings, single dominant disease resistance genes are frequently deployed in monocultures (Boyd et al., 2013). But the selective pressures inevitably result in diversification and adaptation of pathogen populations such as the case of Puccinia graminis tritici race Ug99 (Singh et al., 2011). Indeed, models predict that intermediate to high immune pressures lead to the highest rates of pathogen adaptation (Grenfell et al., 2004). This cycle gives a superficial appearance of a co-evolutionary arms race, but one that is caused by anthropogenic activities. When studied in native contexts, plants and pathogens are in a trench warfare. Disease is always present but rarely reach epidemics, not because of fixation of strong immunity, but because of population-level resistance and tolerance (Segal et al., 1980). Host populations maintain quantitative resistance and their individuals vary in the effectiveness of immunity. Likewise, pathogen populations persist and are diverse in virulence. We therefore suggest that, within a bacterial population or even within extensive CRISPR arrays, the spacers provide population-level resistance that contribute to stabilizing diverse bacteriophage populations and promoting coexistence.

130

131

MATERIALS AND METHODS

DNA sequencing

Rathayibacter were grown in Lysogeny Broth (Bertani, 1951). DNA was extracted using the DNeasy Blood and Tissue kits (QIAGEN, Venlo, Netherlands), following instructions for Gram positive bacteria provided by the manufacturer. A

NanoDrop ND-1000 UV-Vis Spectrophotometer and Qubit 2.0 fluorometer

(ThermoFisher Scientific, Waltham, MA, USA) were used to measure the quality and quantity, respectively, of the DNA. Libraries were prepared, using the Illumina Nextera

XT library preparation protocol and sequenced on the Illumina HiSeq 300 by the Center for Genome Research and Biocomputing (CGRB at Oregon State University; Illumina

Inc., San Diego, CA, USA) or on a PacBio SMRT sequencing instrument (Pacific

Biosciences, Menlo Park, CA, USA).

Genome sequence assembly

Prior to assembly, BBDUK (http://jgi.doe.gov/data-and-tools/bbtools/), with minimum trimmed read length of 100 was used to trim Nextera XT adapter sequences from sequencing reads. SPAdes v. 3.7, with the --careful and -k 21,33,55,77,99 setting, was used to correct for errors and assemble trimmed reads (Bankevich et al., 2012).

Blobtools was used to assess and guide removal of contigs potentially derived from contaminating bacteria (Kumar et al., 2013). Mauve v. 2.4.0 and the finished sequences were used to order contigs and the draft genome sequences were annotated using

Prokka v 1.11 (Darling et al., 2010; Seemann, 2014).

131

Reference genome sequences

132

Genome sequences from Rathayibacter, Clavibacter, Curtobacterium,

Pseudoclavibacter, Leifsonia, Frigoribacterium, and Gulosibacter were extracted from the NCBI nucleotide databases (Supplemental Table 4.3). Genomes were required to be of ‘chromosome’ or ‘scaffold’ assembly quality.

Construction of phylogenetic trees

The Multi-Locus Sequence Analysis (MLSA) phylogenetic tree was generated using autoMLSA (Davis et al., 2016). Reference sequences from six housekeeping genes (ftsY, ychF, rpoB, secY, tsaD, infB) were extracted from the R. toxicus FH142 genome sequence (NCBI Assembly database accession GCF_000986985.1). These were used as queries in TBLASTN v2.5.0+, searches to identify orthologous sequences from other Rathayibacter genome sequences (Camacho et al., 2009).

The EssC sequence from R. rathayi FH95 was used as a query in BLASTP v2.5.0+ searches. The Swiss-Prot database was queried to identify sequences from characterized EssC proteins encoded in the Mycobacterium ,

Staphylococcus aureus, and Bacillus subtilis genomes (Supplemental Table 4.4).

Sequences above a 50% coverage cut-off were used in final tree generation.

Multiple sequence alignments were done using MAFFT v. 7.305b (Katoh and

Standley, 2013). Maximum likelihood phylogenetic tree inference was done using

RAxML v. 8.2.8 with 100 maximum likelihood tree searches (Stamatakis, 2014).

Bootstraps were assessed using the automatic majority rule extended (autoMRE) bootstopping criterion. Phylogenetic trees were visualized using iTOL v3 (Letunic and

Bork, 2016). 132

133

SNP identification and phylogenetic inference

The Bacterial and Archaeal Genome Analyser (BAGA) was used to identify

SNPs in members of the R. toxicus species, which were used to generate a phylogenetic tree (https://github.com/daveuu/baga).

Calculating ANI

Pairwise Average Nucleotide identities were calculated using autoANI according to the guidelines previously proposed (Davis et al., 2016; Goris et al., 2007).

Identifying orthologous clusters

We used GET_HOMOLOGUES v. 2.0 and the OrthoMCL algorithm (-M), with a minimum sequence identity cut-off of 35% (-S 35) to identify orthologous proteins clusters (Contreras-Moreira and Vinuesa, 2013). To determine the functional assignment of protein clusters, a representative ortholog was randomly sampled and analyzed, using interproscan v. 5.23 (Jones et al., 2014).

The orthologous cluster matrix from the GET_HOMOLOGUES output and the phylogenetic tree from autoMLSA were used as inputs for the gain loss mapping engine

(GLOOME) to determine the predicted gains and losses for all strains and phylogenetic clades (Cohen et al., 2010).

The significantly enriched clusters in the Rathayibacter genus were identified using a one-sided Welch’s t-test, with a Bonferonni corrected p-value of 0.01. The t.test and p.adjust methods implemented in R were used (R Core Team, 2017).

The Canberra distance function and Ward’s hierarchical clustering method 133

(ward.D2), implemented in the hclust package of R, was used for hierarchical cluster

134 analysis of the presence/absence matrix (R Core Team, 2017). Heatmaps were plotted using the gplots package implemented in R (R Core Team, 2017).

Identifying core functions lost by R. toxicus

Predicted functions of the core genome of Rathayibacter, excluding R. festucae and R. caricis were determined, once including and once excluding R. toxicus. The lists of functions were compared.

Identifying repeat regions

Regions were identified using the RepeatScout software (Price et al., 2005).

Repeats shared between genomes were clustered using CD-HIT-EST (Fu et al., 2012).

Identification of Mobile Genetic Elements

Orthologous clusters predicted to have either transposase or integrase function were identified using the interproscan output. All predicted transposases and integrases were used as queries in BLASTP v. 2.5.0+ searches. All hits above a 1e-5 cut-off were included.

Predicting HGT

GI_SVM was used to predict HGT (Lu and Leong, 2016). Identified regions that had greater than 95% total coverage to regions not predicted to have been acquired horizontally by other Rathayibacter isolates were filtered out.

134

135

Codon usage

Cusp and codcmp, in the EMBOSS collection of programs, were used to determine codon usage and distances of codon usage, respectively (Rice et al., 2000).

Manual curation of key loci

The interproscan output was inspected to identify proteins putatively associated with the T7SS (e.g. FtsK/SpoEIII domain (PF01580), Ess/Ecc family proteins). Results from the interproscan search were confirmed using BLASTP to query sequences from organisms with experimentally characterized T7SSs. Putative nuclease effectors were identified based on the presence of the LXG motif at the N-terminus and putative nuclease domains at the C-terminus. Esx family proteins were identified based on homology, a length threshold (~ 100aa), and the presence of the WXG motif.

CRISPR spacers were identified using the CRISPR-finder web software (Grissa et al., 2007). The CRISPR loci were oriented relative to the Cas-encoding genes.

Protospacers were predicted based on BLASTN v 2.5.0+ searches to the Rathayibacter phage genome (KX911187.1) with the settings –gapopen 10 –gapextend 2 –reward 1 – penalty -1 –word_size 5, modeled after the CRISPRTarget software (Biswas et al.,

2013). The CRISPRhit software uses rules previously developed and was written in python and depends on the Biopython package (Cock et al., 2009; Fineran et al., 2014).

CRISPRhit can be accessed at https://github.com/osuchanglab/crisprhit.

ACKNOWLEDGEMENTS

We thank Dr. Tal Pupko and Dr. Chris Mundt for their insightful comments, 135

assistance with data analysis and critical reading of this manuscript. We thank Heidi

136

Lederhos for her assistance as well as the members of the CGRB for their assistance with whole genome sequencing. EWD is supported by a Provost’s Distinguished

Graduate Fellowship awarded by Oregon State University. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1314109 to EWD. This was supported by a 2015 Farm Bill grant,

Section 10201 administered through the United States Department of Agriculture,

Animal and Plant Health Inspection Service (15-8130-0608-CA) to JHC and MLP.

This work was also supported by United States Department of Agriculture (USDA)

Agricultural Research Service appropriated project 8044-22000-040-00D and from two

2008 Farm Bill grants, Section 10201 administered through the United States

Department of Agriculture, Animal and Plant Health Inspection Service (13-8130-

0247-CA and 14-8130-0367-CA) to BKS, TDM, DGL, WLS, and EER. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

136

137

BIBLIOGRAPHY

Agarkova, I. V., Vidaver, A. K., Postnikova, E. N., Riley, I. T., and Schaad, N. W. (2006). Genetic Characterization and Diversity of Rathayibacter toxicus. Phytopathology 96, 1270–1277. doi:10.1094/PHYTO-96-1270.

Alkhnbashi, O. S., Shah, S. A., Garrett, R. A., Saunders, S. J., Costa, F., and Backofen, R. (2016). Characterizing leader sequences of CRISPR loci. Bioinformatics (Oxford, England) 32, i576–i585. doi:10.1093/bioinformatics/btw454.

Amitai, G., and Sorek, R. (2016). CRISPR-Cas adaptation: insights into the mechanism of action. Nat Rev Micro 14, 67–76. doi:10.1038/nrmicro.2015.14.

Arif, M., Busot, G. Y., Mann, R., Rodoni, B., Liu, S., and Stack, J. P. (2016). Emergence of a New Population of Rathayibacter toxicus: An Ecologically Complex, Geographically Isolated Bacterium. PLoS ONE 11, e0156182. doi:10.1371/journal.pone.0156182.

Bahir, I., Fromer, M., Prat, Y., and Linial, M. (2009). Viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences. Mol Syst Biol 5, 311. doi:10.1038/msb.2009.71.

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477. doi:10.1089/cmb.2012.0021.

Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., Moineau, S., et al. (2007). CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709–1712. doi:10.1126/science.1138140.

Bensing, B. A., and Sullam, P. M. (2002). An accessory sec locus of Streptococcus gordonii is required for export of the surface protein GspB and for normal levels of binding to human platelets. Mol Microbiol 44, 1081–1094.

Bertani, G. (1951). Studies on lysogenesis. I. The mode of phage liberation by lysogenic Escherichia coli. J Bacteriol 62, 293–300.

Bikard, D., Hatoum-Aslan, A., Mucida, D., and Marraffini, L. A. (2012). CRISPR interference can prevent natural transformation and virulence acquisition during in vivo bacterial infection. Cell Host & Microbe 12, 177–186. doi:10.1016/j.chom.2012.06.003.

Biswas, A., Gagnon, J. N., Brouns, S. J. J., Fineran, P. C., and Brown, C. M. (2013). CRISPRTarget: bioinformatic prediction and analysis of crRNA targets. RNA Biol 10, 817–827. doi:10.4161/rna.24046. 137

Boscaro, V., Felletti, M., Vannini, C., Ackerman, M. S., Chain, P. S. G., Malfatti, S.,

138

et al. (2013). Polynucleobacter necessarius, a model for genome reduction in both free-living and symbiotic bacteria. Proc Natl Acad Sci USA 110, 18590–18595. doi:10.1073/pnas.1316687110.

Boyd, L. A., Ridout, C., O'Sullivan, D. M., Leach, J. E., and Leung, H. (2013). Plant- pathogen interactions: disease resistance in modern agriculture. Trends Genet 29, 233–240. doi:10.1016/j.tig.2012.10.011.

Burstein, D., Sun, C. L., Brown, C. T., Sharon, I., Anantharaman, K., Probst, A. J., et al. (2016). Major bacterial lineages are essentially devoid of CRISPR-Cas viral defence systems. Nat Commun 7, 10613. doi:10.1038/ncomms10613.

Burts, M. L., Williams, W. A., DeBord, K., and Missiakas, D. M. (2005). EsxA and EsxB are secreted by an ESAT-6-like system that is required for the pathogenesis of Staphylococcus aureus infections. Proc Natl Acad Sci USA 102, 1169–1174. doi:10.1073/pnas.0405620102.

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., et al. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421. doi:10.1186/1471-2105-10-421.

Cao, Z., Casabona, M. G., Kneuper, H., Chalmers, J. D., and Palmer, T. (2016). The type VII secretion system of Staphylococcus aureus secretes a nuclease toxin that targets competitor bacteria. Nat Microbiol 2, 16183. doi:10.1038/nmicrobiol.2016.183.

Castañeda-García, A., Prieto, A. I., Rodríguez-Beltrán, J., Alonso, N., Cantillon, D., Costas, C., et al. (2017). A non-canonical mismatch repair pathway in prokaryotes. Nat Commun 8, 14246. doi:10.1038/ncomms14246.

Cenens, W., Makumi, A., Mebrhatu, M. T., Lavigne, R., and Aertsen, A. (2013). Phage- host interactions during pseudolysogeny: Lessons from the Pid/dgo interaction. Bacteriophage 3, e25029. doi:10.4161/bact.25029.

Chang, J. H., Desveaux, D., and Creason, A. L. (2014). The ABCs and 123s of Bacterial Secretion Systems in Plant Pathogenesis. Annu Rev Phytopathol 52, 317–345. doi:10.1146/annurev-phyto-011014-015624.

Clokie, M. R., Millard, A. D., Letarov, A. V., and Heaphy, S. (2011). Phages in nature. Bacteriophage 1, 31–45. doi:10.4161/bact.1.1.14942.

Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics (Oxford, England) 25, 1422–1423. doi:10.1093/bioinformatics/btp163.

138

Cohen, O., Ashkenazy, H., Belinky, F., Huchon, D., and Pupko, T. (2010). GLOOME: gain loss mapping engine. Bioinformatics (Oxford, England) 26, 2914–2915.

139

doi:10.1093/bioinformatics/btq549.

Contreras-Moreira, B., and Vinuesa, P. (2013). GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl Environ Microbiol 79, 7696–7701. doi:10.1128/AEM.02411-13.

Darling, A. E., Mau, B., and Perna, N. T. (2010). progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE 5, e11147. doi:10.1371/journal.pone.0011147.

Davis, E. W., II, Weisberg, A. J., Tabima, J. F., Grünwald, N. J., and Chang, J. H. (2016). Gall-ID: tools for genotyping gall-causing phytopathogenic bacteria. PeerJ 4, e2222. doi:10.7717/peerj.2222.

Dennehy, J. J., Friedenberg, N. A., Yang, Y. W., and Turner, P. E. (2006). Bacteriophage migration via nematode vectors: host-parasite-consumer interactions in laboratory microcosms. Appl Environ Microbiol 72, 1974–1979. doi:10.1128/AEM.72.3.1974-1979.2006.

Dy, R. L., Richter, C., Salmond, G. P. C., and Fineran, P. C. (2014). Remarkable Mechanisms in Microbes to Resist Phage Infections. Annu Rev Virol 1, 307–331. doi:10.1146/annurev-virology-031413-085500.

Fattah, F. A., and Al-Assas, K. (2010). Histopathological comparison of galls induced by Anguina tritici with galls subsequently colonised by in . Nematologia Mediterranea 38, 173–177.

Fineran, P. C., Gerritzen, M. J. H., Suárez-Diez, M., Künne, T., Boekhorst, J., van Hijum, S. A. F. T., et al. (2014). Degenerate target sites mediate rapid primed CRISPR adaptation. Proc Natl Acad Sci USA 111, E1629–38. doi:10.1073/pnas.1400071111.

Flint, J. L., Kowalski, J. C., Karnati, P. K., and Derbyshire, K. M. (2004). The RD1 virulence locus of Mycobacterium tuberculosis regulates DNA transfer in Mycobacterium smegmatis. Proc Natl Acad Sci USA 101, 12598–12603. doi:10.1073/pnas.0404892101.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. doi:10.1093/bioinformatics/bts565.

Giovannoni, S. J., Cameron Thrash, J., and Temperton, B. (2014). Implications of streamlining theory for microbial ecology. The ISME Journal 8, 1553–1565. doi:10.1038/ismej.2014.60.

Giovannoni, S. J., Tripp, H. J., Givan, S., Podar, M., Vergin, K. L., Baptista, D., et al. 139

(2005). Genome streamlining in a cosmopolitan oceanic bacterium. Science 309, 1242–1245. doi:10.1126/science.1114057.

140

Godde, J. S., and Bickerton, A. (2006). The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. J. Mol. Evol. 62, 718–729. doi:10.1007/s00239-005-0223-z.

Gong, C., Bongiorno, P., Martins, A., Stephanou, N. C., Zhu, H., Shuman, S., et al. (2005). Mechanism of nonhomologous end-joining in mycobacteria: a low- fidelity repair system driven by Ku, ligase D and ligase C. Nat Struct Mol Biol 12, 304–312. doi:10.1038/nsmb915.

Gophna, U., Kristensen, D. M., Wolf, Y. I., Popa, O., Drevet, C., and Koonin, E. V. (2015). No evidence of inhibition of horizontal gene transfer by CRISPR-Cas on evolutionary timescales. The ISME Journal 9, 2021–2027. doi:10.1038/ismej.2015.20.

Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., and Tiedje, J. M. (2007). DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57, 81–91. doi:10.1099/ijs.0.64483-0.

Gray, T. A., Palumbo, M. J., and Derbyshire, K. M. (2013). Draft Genome Sequence of MKD8, a Conjugal Recipient Mycobacterium smegmatis Strain. Genome Announc 1, e0014813. doi:10.1128/genomeA.00148-13.

Grenfell, B. T., Pybus, O. G., Gog, J. R., Wood, J. L. N., Daly, J. M., Mumford, J. A., et al. (2004). Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332. doi:10.1126/science.1090727.

Grissa, I., Vergnaud, G., and Pourcel, C. (2007). CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 35, W52–7. doi:10.1093/nar/gkm360.

Gröschel, M. I., Sayes, F., Simeone, R., Majlessi, L., and Brosch, R. (2016). ESX secretion systems: mycobacterial evolution to counter host immunity. Nat Rev Micro 14, 677–691. doi:10.1038/nrmicro.2016.131.

Hershberg, R., and Petrov, D. A. (2010). Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet 6, e1001115. doi:10.1371/journal.pgen.1001115.

Horvath, P., Romero, D. A., Coûté-Monvoisin, A.-C., Richards, M., Deveau, H., Moineau, S., et al. (2008). Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J Bacteriol 190, 1401–1412. doi:10.1128/JB.01415- 07.

Huppert, L. A., Ramsdell, T. L., Chase, M. R., Sarracino, D. A., Fortune, S. M., and Burton, B. M. (2014). The ESX system in Bacillus subtilis mediates protein 140

secretion. PLoS ONE 9, e96267. doi:10.1371/journal.pone.0096267.

141

Ikegami, A., Nishiyama, K.-I., Matsuyama, S.-I., and Tokuda, H. (2005). Disruption of rpmJ encoding ribosomal protein L36 decreases the expression of secY upstream of the spc operon and inhibits protein translocation in Escherichia coli. Bioscience, Biotechnology, and Biochemistry 69, 1595–1602. doi:10.1271/bbb.69.1595.

Jago, M. V., Payne, A. L., Peterson, J. E., and Bagust, T. J. (1983). Inhibition of glycosylation by corynetoxin, the causative agent of annual ryegrass toxicity: a comparison with tunicamycin. Chem Biol Interact 45, 223–234.

Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics (Oxford, England) 30, 1236–1240. doi:10.1093/bioinformatics/btu031.

Katoh, K., and Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol 30, 772–780. doi:10.1093/molbev/mst010.

Konstantinidis, K. T., and Tiedje, J. M. (2004). Trends between gene content and genome size in prokaryotic species with larger genomes. Proc Natl Acad Sci USA 101, 3160–3165. doi:10.1073/pnas.0308653100.

Konstantinidis, K. T., and Tiedje, J. M. (2005). Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA 102, 2567–2572. doi:10.1073/pnas.0409727102.

Koonin, E. V., and Wolf, Y. I. (2008). Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 36, 6688–6719. doi:10.1093/nar/gkn668.

Kowalski, M. C., Cahill, D., Doran, T. J., and Colegate, S. M. (2007). Development and application of polymerase chain reaction-based assays for Rathayibacter toxicus and a bacteriophage associated with annual ryegrass (Lolium rigidum) toxicity. Aust. J. Exp. Agric. 47, 177–183. doi:10.1071/EA05162.

Kumar, S., Jones, M., Koutsovoulos, G., Clarke, M., and Blaxter, M. (2013). Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots. Front Genet 4, 237. doi:10.3389/fgene.2013.00237.

Kunin, V., Sorek, R., and Hugenholtz, P. (2007). Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biol 8, R61. doi:10.1186/gb-2007-8-4-r61.

Kuo, C.-H., and Ochman, H. (2009). Deletional bias across the three domains of life. Genome Biol Evol 1, 145–152. doi:10.1093/gbe/evp016. 141

Letunic, I., and Bork, P. (2016). Interactive tree of life (iTOL) v3: an online tool for

142

the display and annotation of phylogenetic and other trees. Nucleic Acids Res 44, W242–5. doi:10.1093/nar/gkw290.

Lu, B., and Leong, H. W. (2016). GI-SVM: A sensitive method for predicting genomic islands based on unannotated sequence of a single genome. J Bioinform Comput Biol 14, 1640003. doi:10.1142/S0219720016400035.

Lucks, J. B., Nelson, D. R., Kudla, G. R., and Plotkin, J. B. (2008). Genome landscapes and bacteriophage codon usage. PLoS Comp Biol 4, e1000001. doi:10.1371/journal.pcbi.1000001.

Makarova, K. S., Wolf, Y. I., Alkhnbashi, O. S., Costa, F., Shah, S. A., Saunders, S. J., et al. (2015). An updated evolutionary classification of CRISPR-Cas systems. Nat Rev Micro 13, 722–736. doi:10.1038/nrmicro3569.

Marraffini, L. A., and Sontheimer, E. J. (2008). CRISPR interference limits horizontal gene transfer in staphylococci by targeting DNA. Science 322, 1843–1845. doi:10.1126/science.1165771.

McCutcheon, J. P., and Moran, N. A. (2012). Extreme genome reduction in symbiotic bacteria. Nat Rev Micro 10, 13–26. doi:10.1038/nrmicro2670.

McKay, A. C., and Ophel, K. M. (1993). Toxigenic Clavibacter/Anguina associations infecting grass seedheads. Annu Rev Phytopathol 31, 151–167. doi:10.1146/annurev.py.31.090193.001055.

Modell, J. W., Jiang, W., and Marraffini, L. A. (2017). CRISPR-Cas systems exploit viral DNA injection to establish and maintain adaptive immunity. Nature 544, 101–104. doi:10.1038/nature21719.

Morley, D., Broniewski, J. M., Westra, E. R., Buckling, A., and van Houte, S. (2017). Host diversity limits the evolution of parasite local adaptation. Molecular Ecology 26, 1756–1763. doi:10.1111/mec.13917.

Murray, T. D., Schroeder, B. K., Schneider, W., Luster, D., Sechler, A., Rogers, E. E., et al. (2017). Rathayibacter toxicus, other Rathayibacter species inducing Bacterial Head Blight of Grasses, and the Potential for Livestock Poisonings. Phytopathology. doi:10.1094/PHYTO-02-17-0047-RVW.

Ophel, K. M., Bird, A. F., and Kerr, A. (1993). Association of Bacteriophage Particles with Toxin Production by Clavibacter toxicus, the Causal Agent of Annual Ryegrass Toxicity. Phytopathology 83, 676–681.

Pallen, M. J. (2002). The ESAT-6/WXG100 superfamily -- and a new Gram-positive secretion system? Trends Microbiol 10, 209–212.

142 Parkhill, J., Sebaihia, M., Preston, A., Murphy, L. D., Thomson, N., Harris, D. E., et

al. (2003). Comparative analysis of the genome sequences of Bordetella pertussis,

143

Bordetella parapertussis and Bordetella bronchiseptica. Nat Genet 35, 32–40. doi:10.1038/ng1227.

Pawluk, A., Staals, R. H. J., Taylor, C., Watson, B. N. J., Saha, S., Fineran, P. C., et al. (2016). Inactivation of CRISPR-Cas systems by anti-CRISPR proteins in diverse bacterial species. Nat Microbiol 1, 16085. doi:10.1038/nmicrobiol.2016.85.

Perez-Lopez, A., Behnsen, J., Nuccio, S.-P., and Raffatellu, M. (2016). Mucosal immunity to pathogenic intestinal bacteria. Nat Rev Immunol 16, 135–148. doi:10.1038/nri.2015.17.

Price, A. L., Jones, N. C., and Pevzner, P. A. (2005). De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, i351–8. doi:10.1093/bioinformatics/bti1018.

Price, N. P. J., and Tsvetanova, B. (2007). Biosynthesis of the tunicamycins: a review. J. Antibiot. 60, 485–491. doi:10.1038/ja.2007.62.

R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Available at: http://www.R-project.org.

Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, 276–277.

Ruan, B., and Söll, D. (2005). The bacterial YbaK protein is a Cys-tRNAPro and Cys- tRNA Cys deacylase. J Biol Chem 280, 25887–25891. doi:10.1074/jbc.M502174200.

Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069. doi:10.1093/bioinformatics/btu153.

Segal, A., Manisterski, J., Fischbach, G., and Wahl, I. (1980). “How Plant Populations Defend Themselves in Natural Ecosystems,” in Plant Disease: An Advanced Treatise, ed.J. G. Horsfall (Academic Press), 76–98.

Semenova, E., Savitskaya, E., Musharova, O., Strotskaya, A., Vorontsova, D., Datsenko, K. A., et al. (2016). Highly efficient primed spacer acquisition from targets destroyed by the Escherichia coli type I-E CRISPR-Cas interfering complex. Proc Natl Acad Sci USA 113, 7626–7631. doi:10.1073/pnas.1602639113.

Singh, R. P., Hodson, D. P., Huerta-Espino, J., Jin, Y., Bhavani, S., Njau, P., et al. (2011). The emergence of Ug99 races of the stem rust is a threat to world wheat production. Annu Rev Phytopathol 49, 465–481. doi:10.1146/annurev- phyto-072910-095423.

143 Sinkunas, T., Gasiunas, G., Fremaux, C., Barrangou, R., Horvath, P., and Siksnys, V.

(2011). Cas3 is a single-stranded DNA nuclease and ATP-dependent helicase in

144

the CRISPR/Cas immune system. EMBO J 30, 1335–1342. doi:10.1038/emboj.2011.41.

Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30, 1312–1313. doi:10.1093/bioinformatics/btu033.

Stern, A., and Sorek, R. (2011). The phage-host arms race: shaping the evolution of microbes. Bioessays 33, 43–51. doi:10.1002/bies.201000071.

Sternberg, S. H., Richter, H., Charpentier, E., and Qimron, U. (2016). Adaptation in CRISPR-Cas Systems. Mol Cell 61, 797–808. doi:10.1016/j.molcel.2016.01.030. van Belkum, A., Soriaga, L. B., LaFave, M. C., Akella, S., Veyrieras, J.-B., Barbu, E. M., et al. (2015). Phylogenetic Distribution of CRISPR-Cas Systems in Antibiotic-Resistant Pseudomonas aeruginosa. MBio 6, e01796–15. doi:10.1128/mBio.01796-15. van Ham, R. C. H. J., Kamerbeek, J., Palacios, C., Rausell, C., Abascal, F., Bastolla, U., et al. (2003). Reductive genome evolution in Buchnera aphidicola. Proc Natl Acad Sci USA 100, 581–586. doi:10.1073/pnas.0235981100. van Houte, S., Ekroth, A. K. E., Broniewski, J. M., Chabas, H., Ashby, B., Bondy- Denomy, J., et al. (2016). The diversity-generating benefits of a prokaryotic adaptive immune system. Nature 532, 385–388. doi:10.1038/nature17436.

Vercoe, R. B., Chang, J. T., Dy, R. L., Taylor, C., Gristwood, T., Clulow, J. S., et al. (2013). Cytotoxic chromosomal targeting by CRISPR/Cas systems can reshape bacterial genomes and expel or remodel pathogenicity islands. PLoS Genet 9, e1003454. doi:10.1371/journal.pgen.1003454.

Wei, Y., Terns, R. M., and Terns, M. P. (2015). Cas9 function and host genome sampling in Type II-A CRISPR-Cas adaptation. Genes Dev 29, 356–361. doi:10.1101/gad.257550.114.

Weinberger, A. D., Wolf, Y. I., Lobkovsky, A. E., Gilmore, M. S., and Koonin, E. V. (2012). Viral diversity threshold for adaptive immunity in prokaryotes. MBio 3, e00456–12. doi:10.1128/mBio.00456-12.

Westra, E. R., Dowling, A. J., and Broniewski, J. M. (2016). Evolution and Ecology of CRISPR. Annu Rev Ecol Evol Syst 47, 307–331. doi:10.1146/annurev-ecolsys- 121415-032428.

Westra, E. R., van Houte, S., Oyesiku-Blakemore, S., Makin, B., Broniewski, J. M., Best, A., et al. (2015). Parasite Exposure Drives Selective Evolution of

Constitutive versus Inducible Defense. Curr Biol 25, 1043–1049. 144

doi:10.1016/j.cub.2015.01.065.

145

Xin, X.-F., Nomura, K., Aung, K., Velásquez, A. C., Yao, J., Boutrot, F., et al. (2016). Bacteria establish an aqueous living space in plants crucial for virulence. Nature 539, 524–529. doi:10.1038/nature20166.

Xiong, Y. Q., Bensing, B. A., Bayer, A. S., Chambers, H. F., and Sullam, P. M. (2008). Role of the serine-rich surface glycoprotein GspB of Streptococcus gordonii in the pathogenesis of infective endocarditis. Microb. Pathog. 45, 297–301. doi:10.1016/j.micpath.2008.06.004.

Yosef, I., Goren, M. G., and Qimron, U. (2012). Proteins and DNA elements essential for the CRISPR adaptation process in Escherichia coli. Nucleic Acids Res 40, 5569–5576. doi:10.1093/nar/gks216.

Zhang, D., Iyer, L. M., and Aravind, L. (2011). A novel immunity system for bacterial nucleic acid degrading toxins and its recruitment in various eukaryotic and DNA viral systems. Nucleic Acids Res 39, 4532–4552. doi:10.1093/nar/gkr036.

Zoltner, M., Ng, W. M. A. V., Money, J. J., Fyfe, P. K., Kneuper, H., Palmer, T., et al. (2016). EssC: domain structures inform on the elusive translocation channel in the Type VII secretion system. Biochem J 473, 1941–1952. doi:10.1042/BCJ20160257.

145

146

Figure 4.1. Rathayibacter spp. form a distinct phylogenetic unit comprised of 10 species-level groups.

A) Multi-locus sequence analysis (MLSA) maximum likelihood phylogenetic tree of Rathayibacter spp. and related genera. The MLSA tree is based on six housekeeping genes. Isolates with identical sequences were collapsed into a single branch. Bootstrap values are reported for the main branches in the trunk of the tree (inner numbers) as well as species level branches (outer numbers). The Rathayibacter genus split is well supported (100%, fuschia). All main splits within Rathayibacter are supported with >50% support values, excluding the split between the small and large Rathayibacter genomes (marked with *, 37% support). Groups are color coded based on species (within Rathayibacter) or genus (other genera). B) Maximum likelihood phylogenetic tree derived from whole genome SNP data from R. toxicus. Labels (A, B, C) correspond to the three sub-specific groups of R. toxicus.

146

147

Figure 4.2. R. toxicus experienced massive changes to its genome.

A) Co-linearity of genomes. Co-linear blocks are indicated with red lines and shaded according to percent identity (lightest red = 70% identity, darkest red = 99.5% identity). Regions with signatures indicative of HGT are marked in gray. Other features are also labeled; blue = transposase, orange = phage integrase, cyan = RHS family protein. The CRISPR locus is labeled in green. Sizes (Mb) are listed above and below the figure. Only near finished genes were included in the analyses. B) Reconstruction of gene gains (green) and losses (red) for Rathayibacter. The predicted numbers of gains and losses are shown for each node. Genus level comparisons were between Rathayibacter and Clavibacter. Isolates are color-coded according to operational classification of species, as in Figure 4.1. C) Stacked bar plot of shared gene content. Orthologous clusters were determined and classified based on categories listed in the key.

147

148

Figure 4.3. Rathayibacter encode homologs of the T7SS-associated EssC protein.

A) Maximum likelihood phylogenetic tree of translated essC/eccC sequences. Clades are color-coded according to predicted functions. Most nodes had bootstrap values > 90; only those less than 75 are indicated. B) Structures of representative loci. Arrows representing coding sequences, with directions indicating strand; colors: red = genes common to T7SS- and T7SSb-loci, green – demonstrably necessary for T7SS; blue = demonstrably necessary for T7SSb. Representative toxin-antitoxin modules are depicted; Ψ-toxin = incomplete toxin proteins.

148

149

Figure 4.4. R. toxicus strains encode a type I-E CRISPR-Cas system.

A) Representative cas locus structure from R. toxicus (top) compared to the locus of E. coli K-12 (bottom). Red blocks indicate homologous translated sequences, with shading indicating percent identity (darkest = 40% identity, lightest = 27% identity).

B) Alignment of the direct repeat from the R. toxicus CRISPR array (top) to the direct repeat from the E. coli CRISPR array (bottom). C) The structure of the four CRISPR arrays. Spacers are predicted to target (interference, priming, stable, and other) to CS14 (red) or an unknown molecule (white). Spacers that are identical between CRISPR arrays are connected via gray lines. Spacers that are identical within a CRISPR array are connected via red lines.

149

Figure 4.5. R. toxicus has continually acquired spacers that target CS14.

A) Alignment of non-redundant spacer (300 total unique spacers) to CS14. Categories were predicted using the computer algorithm (see Supplemental Figure 4.3). Arrows represent the coding sequences of CS14. B) Mapping of spacers acquired over time. The arrays were flipped, with the most ancient spacers located on the left; most recent common ancestor (MRCA) and the most contemporary spacer on the right; * = identical to CS14; filled triangles = deletion of spacer in one of the clonal groups, open triangle = deletion of spacer in one isolate. Divergence in spacer arrays are indicated with arrows that are labeled with clonal groups and sub-groups. Categories were predicated using the computer algorithm (see Supplemental Figure 4.3).

150

151

Supplemental Figure 4.1. Average nucleotide identity analysis supports 10 species-level groups within Rathayibacter.

Isolates are color coded by species (within Rathayibacter) or broader group (other strains). Species-level groups were defined based on an ANI threshold of 94% ANI cut-off.

151

152

Supplemental Figure 4.2. Enrichment analyses for Rathayibacter compared to related genera.

Clusters of orthologous genes were generated. The heatmap shows presence (colored boxes) or absence (white) of orthologous clusters. Dendograms along the top and side show relationships between isolates, based on similarities in presence/absence patterns, and Ward’s hierarchical clustering method based on the Canberra distances between presence/absence for each strain, respectively.

152

153

Supplemental Figure 4.3. Depiction of algorithm designed to predict functional outcomes of spacer-protospacer interactions.

Predicted outcomes of spacer-protospacer interactions are binned into direct interference, priming, and stable categories. Spacer-protospacer interactions that cannot be categorized are placed into ‘other’. Key positions and mismatch types that guided categorization are indicated in the box (upper right). Perfect and high quality matches were predicted to result in direct interference (DI). Interactions were binned according to five different filtering steps. The percent of the data explained by each filter is shown, based on the dataset from Fineran et al. (2014). Relative confidence of predictions is indicated, with the predictions going from most to least confident as interactions pass through the filters. Algorithm and benchmarking data can be found at https://github.com/osuchanglab/crisprhit

153

Supplemental Figure 4.3. Depiction of algorithm designed to predict functional outcomes of spacer-protospacer interactions. 154

155

Supplemental Figure 4.4. Rathayibacter toxicus and CS14 have similar dinucleotide frequency patterns.

Strains are arranged in rows, and each of the sixteen dinucleotides is arranged as a column. The data are scaled based on the standard deviation from the column average (Z score), and color coded by Z score. Rows and columns were individually clustered based on the Euclidean distance between the rows and columns, respectively.

155

156

156

Supplemental Figure 4.4. Rathayibacter toxicus and CS14 have similar dinucleotide frequency patterns.

Supplemental Table 4.1. Rathayibacter isolates sequenced in this study.

Genome Coding Strain Chang Strain Abbr. Species Sequenced by? Technology Size CDS Density GC% Reference (Mb) (CDS/kb) CA4 CA4 Rag agropyri USDA-ARS PacBio 3.07 2906 0.947 68.03 T DSM 15933 DSM 15933 Rca caricis USDA-ARS PacBio 4.36 3842 0.933 71.35 T DSM 15932 DSM 15932 Rfe festucae USDA-ARS PacBio 4.12 3943 0.904 72.3 DSM 26693 Z040 Rfe festucae Chang Illumina NA NA NA NA FH6 FH6 Rir iranicus USDA-ARS PacBio 2.36 3317 0.965 67.11 FH154 RFBF1 Rir iranicus Chang Illumina 2.31 3034 0.953 67.22 TRS5 RFBH1 Rir iranicus Chang Illumina 2.31 3037 0.953 67.22 T DSM 7484 Z041 Rir iranicus Chang Illumina 2.31 3237 0.955 67.16 T DSM 7485 FH95 Rra rathayi USDA-ARS PacBio 2.31 2989 0.963 69.38 CS6 RFBH4 Rra rathayi Chang Illumina 2.32 3075 0.964 69.26 15-709-1-1a Z007 Rra rathayi Chang Illumina 2.31 2908 0.952 69.61 16-R75-3a Z015 Rra rathayi Chang Illumina 2.31 3007 0.954 69.37 16-586-2b Z022 Rra rathayi Chang Illumina 2.31 2860 0.949 69.54 FH100 FH100 Rto toxicus Chang Illumina 2.31 2285 0.967 61.38 Agarkova et al. 2006 FH128 FH128 Rto toxicus Chang Illumina 2.31 2201 0.951 61.45 Agarkova et al. 2006 FH137 FH137 Rto toxicus Chang Illumina 2.31 2207 0.956 61.5 Agarkova et al. 2006 FH138 FH138 Rto toxicus Chang Illumina 2.32 2208 0.955 61.49 Agarkova et al. 2006 FH139 FH139 Rto toxicus Chang Illumina 3.18 2201 0.952 61.49 Agarkova et al. 2006 FH140 FH140 Rto toxicus Chang Illumina 2.40 2198 0.949 61.45 Agarkova et al. 2006 FH141 FH141 Rto toxicus Chang Illumina 3.44 2200 0.952 61.46 Agarkova et al. 2006 FH142 FH142 Rto toxicus Chang Illumina 2.31 2196 0.951 61.5 Agarkova et al. 2006 FH145 FH145 Rto toxicus Chang Illumina 2.34 2199 0.952 61.5 Agarkova et al. 2006

157

Supplemental Table 4.1. (Continued) Rathayibacter isolates sequenced in this study.

FH146 FH146 Rto toxicus Chang Illumina 2.31 2197 0.950 61.49 Agarkova et al. 2006 FH147 FH147 Rto toxicus Chang Illumina 2.31 2199 0.953 61.49 Agarkova et al. 2006 FH149 FH149 Rto toxicus Chang Illumina 2.31 2207 0.954 61.45 Agarkova et al. 2006 FH183 FH183 Rto toxicus Chang Illumina 2.31 2204 0.950 61.4 Agarkova et al. 2006 FH232 FH232 Rto toxicus USDA-ARS PacBio 2.31 2302 0.958 61.38 FH78 FH78 Rto toxicus Chang Illumina 3.10 2209 0.957 61.49 Agarkova et al. 2006 T DSM 7488 FH79 Rto toxicus USDA-ARS PacBio 2.31 2225 0.949 61.47 FH81 FH81 Rto toxicus Chang Illumina 3.43 2206 0.955 61.5 Agarkova et al. 2006 FH82 FH82 Rto toxicus Chang Illumina 3.46 2202 0.953 61.48 Agarkova et al. 2006 FH85 FH85 Rto toxicus Chang Illumina 3.26 2197 0.951 61.49 Agarkova et al. 2006 FH86 FH86 Rto toxicus Chang Illumina 3.18 2206 0.954 61.49 Agarkova et al. 2006 FH89 FH89 Rto toxicus Chang Illumina 3.19 2195 0.950 61.5 Agarkova et al. 2006 FH99 FH99 Rto toxicus Chang Illumina 3.19 2198 0.952 61.49 Agarkova et al. 2006 FH211 FH211 Rtr tritici USDA-ARS PacBio 3.17 2991 0.939 69.76 CT104 RFBD7 Rtr tritici Chang Illumina 3.06 3016 0.924 69.67 T DSM 7486 RFBH6 Rtr tritici Chang Illumina 3.15 2956 0.931 69.84 TRS19 RFBC6 Rtr2 tritici-like Chang Illumina 3.01 3155 0.920 72.71 NCPPB 2253 RFBD1 Rtr2 tritici-like Chang Illumina 3.39 3175 0.916 72.62

158

159

Supplemental Table 4.2. Core functions of Rathayibacter absent from R. toxicus.

Pfam/ Orthologous TIGRFAM Predicted Function Cluster Accession 30 PF07920 Protein of unknown function DUF1684 98 PF13622 Thioesterase-like superfamily 100 TIGR01722 Methylmalonate-semialdehyde dehydrogenase 139 PF00343 Glycosyl transferase, family 35 139 PF11897 Glycogen phosphorylase, domain of unknown function DUF3417 139 TIGR02094 Alpha-glucan phosphorylase 297 PF01810 Lysine-type exporter protein (LYSE/YGGA) 331 PF12320 Nuclease SbcCD subunit D, C-terminal domain 331 TIGR00619 Nuclease SbcCD subunit D 388 PF01544 Mg2+ transporter protein, CorA-like/Zinc transport protein ZntB 388 TIGR00383 Magnesium/cobalt transport protein CorA 441 PF03176 Membrane transport protein, MMPL domain 465 PF06224 Protein of unknown function DUF1006 478 PF03819 NTP pyrophosphohydrolase MazG, putative catalytic core 540 PF01028 DNA topoisomerase I, catalytic core, eukaryotic-type 719 PF01047 MarR-type HTH domain 731 PF03109 UbiB domain Conserved hypothetical protein CHP03858, luciferase-like 746 TIGR03858 monooxygenase, putative 754 PF02733 DhaK domain 754 TIGR02363 Dihydroxyacetone kinase DhaK, subunit 1 761 PF01447 Peptidase M4 domain 761 PF02868 Peptidase M4, C-terminal 770 PF16242 Pyridoxamine 5'-phosphate oxidase like 774 TIGR00040 Phosphodiesterase MJ0936/Vps29 798 PF01047 MarR-type HTH domain 946 PF03972 MmgE/PrpD 947 TIGR02317 2-methylisocitrate lyase 953 TIGR01800 2-methylcitrate synthase/citrate synthase type I 957 PF03190 Domain of unknown function DUF255 988 PF03180 Lipoprotein NlpA family 1045 PF03575 Peptidase S51 1053 PF04012 PspA/IM30 1054 PF04536 TPM domain 1081 PF01841 Transglutaminase-like 1117 PF12073 Protein of unknown function DUF3553 1218 PF04402 Protein of unknown function DUF541 159

160

Supplemental Table 4.2. (Continued) Core functions of Rathayibacter absent from R. toxicus.

1269 TIGR03558 Luciferase family oxidoreductase, group 1 1270 TIGR00045 Glycerate kinase 1361 PF13223 Domain of unknown function DUF4031 1397 PF01654 Cytochrome ubiquinol oxidase subunit 1 1398 PF02322 Cytochrome ubiquinol oxidase subunit 2 1398 TIGR00203 Cytochrome ubiquinol oxidase subunit 2 1401 PF11160 Protein of unknown function DUF2945 1402 PF00724 NADH:flavin oxidoreductase/NADH oxidase, N-terminal 1468 PF14013 Antitoxin Rv0909/MT0933 1508 PF00444 Ribosomal protein L36 1508 TIGR01022 Ribosomal protein L36 1593 PF04978 Protein of unknown function DUF664 1689 PF01741 Large-conductance mechanosensitive channel 1689 TIGR00220 Large-conductance mechanosensitive channel 1752 PF06197 Protein of unknown function DUF998 1753 PF13602 Zinc-binding dehydrogenase 1813 PF05768 Glutaredoxin-like 1814 PF02386 Cation transporter 1831 PF02913 FAD-linked oxidase, C-terminal 1940 PF02127 Peptidase M18 Nitrilotriacetate monooxygenase component A/pristinamycin IIA 1970 TIGR03860 synthase subunit A 1992 PF00392 Transcription regulator HTH, GntR 1999 PF01212 Aromatic amino acid beta-eliminating lyase/threonine aldolase 2003 PF13376 Bacteriocin-protection, YdeI or OmpD-Associated 2032 PF13563 2'-5' RNA ligase superfamily 2053 PF12867 DinB-like domain 2071 PF05656 Protein of unknown function DUF805 2246 PF16657 Maltogenic Amylase, C-terminal 2246 TIGR02456 Trehalose synthase/alpha-amylase, N-terminal 2276 PF02274 Amidinotransferase 2277 TIGR01885 Ornithine aminotransferase 2315 PF00596 Class II aldolase/adducin N-terminal 2318 PF14403 Circularly permuted ATP-grasp type 2 2319 PF04168 Domain of unknown function DUF403 2320 PF01841 Transglutaminase-like 2320 PF08379 Bacterial transglutaminase-like, N-terminal 2349 PF03992 Antibiotic biosynthesis monooxygenase domain

2417 PF00255 Glutathione peroxidase 160 2445 PF12146 Serine aminopeptidase, S33

161

Supplemental Table 4.2. (Continued) Core functions of Rathayibacter absent from R. toxicus.

2449 PF11209 Protein of unknown function DUF2993 2510 PF04029 ComB-like 2526 PF07978 NIPSNAP 2641 PF13515 Fusaric acid resistance protein-like 2728 PF04978 Protein of unknown function DUF664 2829 PF10646 GerMN domain 2829 PF10647 Lipoprotein LpqB, beta-propeller domain, C-terminal 2888 PF08028 Acyl-CoA dehydrogenase, C-terminal domain 2948 PF01841 Transglutaminase-like 3039 PF13476 AAA domain 3050 PF00251 Glycosyl hydrolase family 32, N-terminal 3050 PF08244 Glycosyl hydrolase family 32, C-terminal 3073 PF01841 Transglutaminase-like ABC transporter, CydDC cysteine exporter (CydDC-E) family, 3098 TIGR02857 permease/ATP-binding protein CydD ABC transporter, CydDC cysteine exporter (CydDC-E) family, 3099 TIGR02868 permease/ATP-binding protein CydC 3152 PF02190 ATP-dependent protease La (LON), substrate-binding domain 3176 TIGR02033 Hydantoinase/dihydropyrimidinase 3177 TIGR03842 F420-dependent oxidoreductase-predicted, CPS4043 3187 PF11533 Protein of unknown function DUF3225 3190 TIGR01879 Amidase, carbamoylase-type 3194 PF09349 Oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase, 3194 TIGR03180 type 2 3195 PF00576 Transthyretin/hydroxyisourate hydrolase, superfamily 3195 TIGR02962 Hydroxyisourate hydrolase 3197 PF01014 Uricase 3197 TIGR03383 Uricase 3217 PF03724 Domain of unknown function DUF306, Meta/HslJ 3254 TIGR00011 Prolyl-tRNA editing protein, YbaK/EbsC Mycothiol-dependent maleylpyruvate isomerase, metal-binding 3269 PF11716 domain 3269 TIGR03083 Conserved hypothetical protein CHP03083, actinobacterial-type 3269 TIGR03086 Conserved hypothetical protein CHP03086 3290 PF04122 Putative cell wall binding repeat 2 3312 PF03594 Benzoate transporter 3312 TIGR00843 Benzoate transporter

161

162

Supplemental Table 4.2. (Continued) Core functions of Rathayibacter absent from R. toxicus.

3375 PF02452 mRNA interferase PemK-like protein 3503 PF04248 Domain of unknown function DUF427 3580 TIGR00765 Virulence factor BrkB 3667 PF01068 DNA ligase, ATP-dependent, central 3667 PF04679 DNA ligase, ATP-dependent, C-terminal 3669 PF10013 Uncharacterised conserved protein UCP037205 3669 PF11625 Protein of unknown function DUF3253 3678 PF14016 Protein of unknown function DUF4232 3681 PF08327 Activator of Hsp90 ATPase homologue 1-like 3931 PF06877 Regulator of ribonuclease activity B domain 3952 PF09084 SsuA/THI5-like 3960 PF13409 Glutathione S-transferase, N-terminal 3960 PF13410 Glutathione S-transferase, C-terminal domain 3994 PF13380 CoA-binding 4024 PF02113 Peptidase S13, D-Ala-D-Ala carboxypeptidase C 4024 TIGR00666 Peptidase S13, D-Ala-D-Ala carboxypeptidase C 4060 PF13482 Ribonuclease H-like domain 4060 PF13604 AAA domain 4060 TIGR03491 RecB family nuclease, TM0106, putative 4136 PF11303 Protein of unknown function DUF3105 4137 PF03713 Domain of unknown function DUF305 4149 PF01047 MarR-type HTH domain 4187 PF07264 Etoposide-induced protein 2.4 (EI24) 4267 PF10990 Protein of unknown function DUF2809 4274 PF13788 Domain of unknown function DUF4180 4365 PF03663 Glycoside hydrolase, family 76 Nitrilotriacetate monooxygenase component A/ 4372 TIGR03860 pristinamycin IIA synthase subunit A 4376 PF01047 MarR-type HTH domain 4418 PF02610 L-arabinose isomerase 4418 PF11762 L-arabinose isomerase, C-terminal 4424 PF00933 Glycoside hydrolase, family 3, N-terminal 4424 PF01915 Glycoside hydrolase family 3 C-terminal domain 4424 PF14310 Fibronectin type III-like domain 4467 PF06283 ThuA-like domain 4581 PF13527 Acetyltransferase (GNAT) domain 4581 PF13530 Enhanced intracellular survival protein domain 4587 TIGR03057 xxxLxxG heptad repeat

4649 PF09084 SsuA/THI5-like 162 4649 TIGR01728 Aliphatic sulfonates-binding protein

4669 PF01522 NodB homology domain

163

Supplemental Table 4.2. (Continued) Core functions of Rathayibacter absent from R. toxicus.

Signal transduction histidine kinase, 4745 PF02702 osmosensitive K+ channel sensor, N-terminal 4745 PF13493 Sensor protein KdpD, transmembrane domain Conserved hypothetical protein CHP03704, 4773 TIGR03704 protein-(glutamine-N5) methyltransferase 4810 PF13275 S4 domain 4847 PF01047 MarR-type HTH domain 4867 PF01384 Phosphate transporter 4869 PF11829 Protein of unknown function DUF3349 4918 PF03773 Predicted permease DUF318 4919 PF09323 Protein of unknown function DUF1980 4919 TIGR03943 Protein of unknown function DUF1980 4958 PF02515 CoA-transferase family III 4999 PF00801 PKD domain 4999 PF03422 Carbohydrate binding module family 6 4999 PF07995 Glucose/Sorbosone dehydrogenase 4999 PF16640 Bacterial Ig-like domain (group 3) 5002 PF06030 Protein of unknown function DUF916, cell surface putative 5004 PF01658 Myo-inositol-1-phosphate synthase, GAPDH-like 5004 PF07994 Myo-inositol-1-phosphate synthase 5023 PF07859 Alpha/beta hydrolase fold-3 5027 PF04230 Polysaccharide pyruvyl transferase 5082 PF13828 Domain of unknown function DUF4190 5108 PF00874 PRD domain 5108 PF05043 Mga helix-turn-helix domain 5135 PF14013 Antitoxin Rv0909/MT0933 5158 PF04248 Domain of unknown function DUF427 5161 PF07920 Protein of unknown function DUF1684 5205 PF00989 PAS fold 5211 PF00724 NADH:flavin oxidoreductase/NADH oxidase, N-terminal 5231 PF13768 von Willebrand factor, type A 5240 PF07905 Purine catabolism PurC-like domain 5243 PF12773 Double zinc ribbon 5244 PF16918 Protein kinase G, tetratricopeptide repeat containing domain 5244 PF16919 Protein kinase G, rubredoxin domain 5250 PF05816 Toxic anion resistance 5252 PF09844 Protein of unknown function DUF2071 5268 PF00392 Transcription regulator HTH, GntR 5289 PF08922 Protein of unknown function DUF1905 5289 PF13376 Bacteriocin-protection, YdeI or OmpD-Associated 163

5295 PF13668 Ferritin-like domain

164

Supplemental Table 4.2. (Continued) Core functions of Rathayibacter absent from R. toxicus.

5299 PF01161 Phosphatidylethanolamine-binding protein 5299 TIGR00481 YbhB/YbcL 5302 TIGR02823 Acrylyl-CoA reductase AcuI 5304 PF01258 Zinc finger, DksA/TraR C4-type 5378 TIGR03666 PPOX class F420-dependent enzyme, family Rv2061, putative 5393 PF13602 Zinc-binding dehydrogenase

164

165

Supplemental Table 4.3. Accessions of genome sequences included in this study.

Strain Full Name Accession DSM 17612 Agrococcus lahaulensis DSM 17612 GCF_000425105.1 PF008 Clavibacter michiganensis subsp. capsici GCF_001280205.1 R1-1 Clavibacter michiganensis subsp. insidiosus GCF_000958465.1 Clavibacter michiganensis subsp. michiganensis NCPPB 382 NCPPB 382 GCF_000063485.1 Clavibacter michiganensis subsp. nebraskensis NCPPB 2581 NCPPB 2581 GCF_000355695.1 ATCC 33113 Clavibacter michiganensis subsp. sepedonicus GCF_000069225.1 UCD-AKU Curtobacterium flaccumfaciens UCD-AKU GCF_000349565.1 BH-2-1-1 Curtobacterium sp. BH-2-1-1 GCF_001806325.1 Leaf183 Curtobacterium sp. Leaf183 GCF_001424385.1 MR MD2014 Curtobacterium sp. MR MD2014 GCF_000772085.1 S6 Curtobacterium sp. S6 GCF_000710345.2 UNCCL17 Curtobacterium sp. UNCCL17 GCF_000686685.1 Leaf164 Frigoribacterium sp. Leaf164 GCF_001423665.1 Leaf186 Frigoribacterium sp. Leaf186 GCF_001423905.1 Leaf254 Frigoribacterium sp. Leaf254 GCF_001422145.1 Leaf415 Frigoribacterium sp. Leaf415 GCF_001424645.1 Leaf44 Frigoribacterium sp. Leaf44 GCF_001421865.1 Leaf8 Frigoribacterium sp. Leaf8 GCF_001421165.1 DSM 13485 Gulosibacter molinativorax DSM 13485 GCF_000425685.1 Leaf325 Leifsonia sp. Leaf325 GCF_001423545.1 Root112D2 Leifsonia sp. Root112D2 GCF_001424905.1 Root60 Leifsonia sp. Root60 GCF_001427485.1 356 LXYL Leifsonia xyli GCF_001063195.1 SE134 Leifsonia xyli GCF_001647635.1 DSM 46306 Leifsonia xyli subsp. cynodontis DSM 46306 GCF_000470775.1 CTCB07 Leifsonia xyli subsp. xyli str. CTCB07 GCF_000007665.1 DSM 23366 Pseudoclavibacter soli DSM 23366 GCF_000423505.1 Leaf185 Rathayibacter sp. Leaf185 GCF_001423885.1 Leaf294 Rathayibacter sp. Leaf294 GCF_001423005.1 Leaf296 Rathayibacter sp. Leaf296 GCF_001423045.1 Leaf299 Rathayibacter sp. Leaf299 GCF_001423055.1 VKM Ac- 2596 Rathayibacter tanaceti GCF_001634385.1 DSM 7488 Rathayibacter toxicus DSM 7488 GCF_000425325.1 70137 Rathayibacter toxicus GCF_000875675.1 FH142 Rathayibacter toxicus GCF_000986985.1 WAC3373 Rathayibacter toxicus GCF_001465855.1

165 NCPPB 1953 Rathayibacter tritici GCF_001641005.1 B187 Zimmermannella faecalis ATCC 13722 GCF_000381765.1

166

Supplemental Table 4.4. Accessions/locus tags of EssC protein sequences used in phylogenetic tree.

Category Accession/Locus_Tag Label sp|P9WNA5| T7SS ECCC5_MYCTU M. tuberculosis H37Rv sp|B2HST4| T7SS ECCC5_MYCMM M. marinum ATCC BAA-535 sp|P9WNA9| T7SS ECCC3_MYCTU M. tuberculosis H37Rv sp|P9WNA8| T7SS ECCC3_MYCTO M. tuberculosis CDC 1551 sp|D1A4G7| T7SS ECCC_THECD T. curvata ATCC 19995 T7SS WP_055824665.1 Leifsonia sp. Root60/Root1293 T7SS WP_056168985.1 Leifsonia sp. Leaf325 Novel T7SS WP_056124274.1 Curtobacterium sp. Leaf183 Novel T7SS WP_056044173.1 Frigoribacterium sp. Leaf8 Novel T7SS WP_055975712.1 Frigoribacterium sp. Leaf254/415 Novel T7SS RFBA6_03488 RFBA6 Novel T7SS RFBB5_00909 RFBB5 Novel T7SS RFBG4_03381 RFBG4 Novel T7SS RFBH5_03440 RFBH5 Novel T7SS RFBI5_03502 RFBI5 Novel T7SS Z016_03453 Z016 T7SSb sp|A4IKE7|ECCC_GEOTN G. thermodenitrificans NG80-2 T7SSb sp|C0SPA7|YUKB_BACSU B. subtilis 168 T7SSb sp|Q6GK24|ESSC_STAAR S. aureus MRSA252 T7SSb sp|P0C048|ESSC_STAAE S. aureus Newman T7SSb sp|Q932J9|ESSC_STAAM S. aureus Mu50 T7SSb sp|Q6GCI5|ESSC_STAAS S. aureus MSSA476 T7SSb sp|Q04351|Y3709_CLOAB C. acetobutylicum ATCC 824 T7SSb CA4_02599 R. agropyri CA4 T7SSb FH6_01868 R. iranicus FH6 T7SSb Z015_00415 R. rathayi Z015 T7SSb Z022_00429 R. rathayi Z022 T7SSb FH95_00440 R. rathayi FH95 T7SSb Z007_00518 R. rathayi Z007 Cryptic T7SSb WP_056059032.1 Frigoribacterium sp. Leaf164 1 Cryptic T7SSb WP_030016421.1 Curtobacterium sp. S6 Cryptic T7SSb WP_048680654.1 Leifsonia xyli 356_LXYL

Cryptic T7SSb FH137_01769 R. toxicus FH137 166 Cryptic T7SSb FH232_01825 R. toxicus FH232

Cryptic T7SSb FH211_01709 R. tritici FH211

167

Supplemental Table 4.4. (Continued) Accessions/locus tags of EssC protein sequences used in phylogenetic tree.

Cryptic T7SSb RFBD7_01686 R. tritici RFBD7 Cryptic T7SSb FH6_01584 R. iranicus FH6 Cryptic T7SSb CA4_00336 R. agropyri CA4 Unknown WP_051224508.1 P. soli DSM 23366 1 Unknown FH232_00251 R. toxicus FH232 Unknown FH79_00095 R. toxicus FH79 Unknown WP_066653334.1 Curtobacterium sp. MR_MD2014 Unknown WP_056259549.1 Frigoribacterium sp. Leaf186 2 Unknown WP_055811453.1 Leifsonia sp. Root112D2 Unknown Z040_00900 Z040 Unknown DSM_15932_02232 R. festucae DSM_15932 Unknown WP_051224845.1 P. soli DSM 23366 2 Unknown WP_017888161.1 C. flaccumfaciens UCD-AKU Unknown WP_056259520.1 Frigoribacterium sp. Leaf186 1 Unknown AOX66010.1 Curtobacterium sp. BH-2-1-1 Unknown WP_056059083.1 Frigoribacterium sp. Leaf164 2 Unknown WP_056043412.1 Frigoribacterium sp. Leaf8 Unknown WP_056228629.1 Frigoribacterium sp. Leaf44 Unknown WP_055974898.1 Frigoribacterium sp. Leaf254/415

167

168

Chapter 5 – Conclusions and Future Directions

Edward W. Davis II

168

169

CONCLUSIONS

The work presented in this thesis focused on the interactions between hosts and their pathogenic entities. The work included the development of tools for categorizing microorganisms, with the goal of identifying and typing pathogenic bacteria that afflict agriculturally important crop plants. The work also relied on genome-enabled approaches to test hypotheses on the evolution and ecology of host-pathogen interactions. The use of whole genome sequences has the additive advantage of generating hypotheses, many of which can be experimentally tested.

We tested the hypothesis that disease ascribed to the Gram-positive plant pathogen, Rhodococcus fascians, is indeed caused by multiple species of bacteria. Four individual lines of evidence supported the hypothesis. Both a multi-locus sequence analysis phylogenetic tree and a whole genome-based phylogenetic tree indicated that the strains identified as R. fascians have distinct phylogenetic histories. We identified two broadly defined clades through this work, with one of the clades having substructure. An analysis of the whole genome relatedness using average nucleotide identity showed a similar pattern of two broadly defined groups. ANI analysis also suggested the presence of six species level groups above the ANI cutoff of 94-96%.

We examined the possibility of species cohesion by measuring the number of shared orthologous proteins as well as distances of codon bias. These measures do not have the same power of resolution of ANI for minor differences between isolates, but did, however, indicate differences between members of the two broadly defined clades as

in the phylogenetic analyses. These data all indicate that the organism identified as R. 169

fascians using phenotypic characteristics has a diverse evolutionary history that spans

170 across multiple species. All organisms identified as pathogenic do have evidence for

HGT that is responsible for the pathogenic lifestyle. In the majority of pathogenic isolates, pathogenicity requires a large linear plasmid. Since the genetic determinants for disease within this group of bacteria are horizontally acquired, yet all sequenced isolates are members of the two sister clades that we identified, we hypothesized that some shared chromosomal features are necessary for the disease trait.

We hypothesize that the core genes encoded across both species potentiate the pathogenic lifestyle which is activated when the bacteria horizontally acquire DNA encoding the virulence loci. In order to examine this possibility, we determined the putative functions of each of the virulence loci that are shared by all pathogenic isolates.

We identified fasR, a transcriptional regulator, as one candidate that could mediate the interaction with housekeeping genes that have been co-opted for virulence. The regulon of fasR is currently unknown. We hypothesize that fasR mediates the interaction between the horizontally acquired virulence genes and the chromosomal factors through transcriptional regulation of one or more chromosomally encoded genes. A transcriptomic study of differential regulation, comparing the wild type strain to the fasR mutant, would allow us to test this hypothesis. If chromosomal genes were identified through the transcriptomic study, we would generate in-frame deletion mutants of each locus, and test the mutant bacteria in pathogenicity assays on plants.

We would control the experiment using the wild type strain as the positive control and an isogenic strain that is lacking the virulence loci as the negative control. If the mutant

behaved as the wild type pathogenic strain, then we would conclude that the 170

chromosomal gene does not play a role in virulence. However, if the mutant was

171 compromised or nonpathogenic, we would conclude the gene has a role in virulence.

We would rule out pleiotropic effects through complementation of the deletion mutant and restoration of the wild-type phenotype. We would then be able to generate and test new hypotheses about the progression of host-microbe interactions that lead to disease symptoms.

The current model for R. fascians disease implicates cytokinins directly interacting with the plant host and perturbing plant hormone equilibrium, but little is known about other genetic loci in the bacteria that are essential for virulence. Most

Rhodococcus spp. employ an environmental lifestyle unassociated with any plant host.

Can any member of the Rhodococcus genus become pathogenic when supplied with the virulence plasmid from R. fascians? Future work to test this hypothesis would be to introduce the virulence plasmid into an environmental isolate and challenge a susceptible plant with the transformed bacteria. The experiment would be controlled using the wild type R. fascians as the positive control and isogenic R. fascians line lacking the virulence loci as well as the wild type environmental isolate would be the negative controls. If the plants infected with the transformed environmental isolate had disease symptoms that showed a statistically significant increase in disease symptoms compared to plants infected with the negative controls and no significant difference compared to the positive control, then we would conclude that the plasmid can confer the pathogenicity trait to the environmental isolate. Conversely, if the disease symptoms were statistically insignificant compared to the negative controls, then we

would conclude that the plasmid in unable to confer pathogenicity to the environmental 171

Rhodococcus species. These results would provide insight on how specific the

172 interaction is between the horizontally acquired virulence genes, the bacterium, and the plant hosts.

We used the Rathayibacter toxicus-bacteriophage interaction as a model for studying CRISPR array evolution. Based on the current literature, effective targeting of a phage by a CRISPR locus leads to extinction of the phage. In order to test this model, we determined which CRISPR spacers from R. toxicus target the Rathayibacter phage and predicted what type of interaction would occur between each targeting spacer-protospacer pair. We found that not only did R. toxicus continue to acquire spacers to the phage across time, over half of all of the unique spacers were predicted to result in targeting and degradation of the phage DNA. These data are contrary to what one would expect if complete immunity was indeed conferred by acquisition of a single spacer to a target and challenge the current model of perfect CRISPR-mediated immunity.

Rathayibacter are a difficult genera of bacteria to study. The species are thus far, recalcitrant to genetic modifications and have few available tools. Moreover, R. toxicus is a Select Agent, which precludes its study except by those with sufficient biosafety clearance. Future studies on CRISPR evolution could be designed for model systems in CRISPR research. Ideally, a long term evolution experiment would be the most analogous system that one could use in a laboratory setting. In replicate, the model

E. coli with an intact CRISPR system could be challenged multiple times over the course of the experiment with phage. After each challenge with phage, surviving

bacteria could be stored as culture stocks for later sequencing of the CRISPR array. An 172

ideal experiment would include a genetically diverse population of phages. These

173 phages would provide diversity that is more similar to what is seen in a natural setting.

Over time, the E. coli would be challenged with the populations of phages. At the end of the experiment, sequencing of the CRISPR, one would expect to find a single immunization event if complete immunity is conferred by a single CRISPR spacer.

Conversely, as hypothesized based on the R. toxicus data, if multiple CRISPR spacers are acquired over time, then incomplete immunity would be suggested. One conclusion would be that the bacteria and phage co-evolve over time, leading to an equilibrium with presumably minor fitness cost to the host while still allowing infection of the phage. A result of equilibrium is more consistent and analogous with most infections of humans by viruses; most viruses do not completely kill the human host, and, additionally, re-immunization is often required in order to maintain any defense against the viruses across time. The R. toxicus strains have apparently a long evolutionary history with the phage, indicative of constant interaction and incomplete immunization against the phage.

Lastly, we generated multiple automated pipelines for whole genome sequence analysis and multi-locus sequence analysis to facilitate the use of these methods by other scientists. We also generated a website that can be used by applied scientists from a diagnostic perspective. These tools are valuable for the broader scientific community because, as the cost of genome sequencing diminishes, the availability of whole genome sequencing data dramatically increases to scientists with broadly different backgrounds and expertise. If we enable the scientists the opportunity to use these high

quality and robust techniques, as well as the proper methods for interpretation of the 173

data, then we have eliminated some of the barriers to high quality comparative

174 genomics research. Furthermore, these methods help facilitate the designation of appropriate phylogenetic and species-level groups for all comparative approaches that follow.

174

175

175