i

Phylogenetic Analysis of a Group of Enteric Based on 16S rDNA Gene Sequencing

A thesis submitted to the Miami University Honors Program in partial fulfillment of the requirements for University Honors

by

Katie Lynn Ziegler

May 2004 Oxford, Ohio

ii

ABSTRACT

PHYLOGENETIC ANALYSIS OF A GROUP OF ENTERIC BACTERIA BASED ON 16S rDNA GENE SEQUENCING

By Katie Lynn Ziegler

It has been suggested that phylogenies derived specifically from 16S rDNA gene sequencing provide reasonable evidence for describing major evolutionary lineages (Brown, 1997). Because of the critical role they play in protein synthesis and thus survival, rDNA sequences are highly conserved and tend to resist significant change over time. Thus, they are of prime interest when studying evolutionary relatedness. In this study, we sequenced a portion of the 16S rDNA gene from the amplified PCR products of enteric bacterial samples. The resulting DNA sequences were edited and aligned to develop a phylogeny based on our comparative data. We developed phylogenies using parsimony analysis and bootstrapping and compared our results to expected data. We hypothesized that species within the same genus would contain the most shared derived traits and therefore be most closely related. However, our phylogeny provided limited resolution of some of our strains. While a few strains resolved well (i.e. Serratia and some strains), weak bootstrap support characterized much of the phylogeny. Ribosomal DNA contains conserved, variable, and highly variable regions. The highly variable regions are most prone to mutation, and thus may not provide good phylogenetic signal. For this reason the hyper-variable regions not only fail to provide meaningful data, but can also destroy signal that is already present.

iii

iv

Phylogenetic Analysis of a Group of Enteric Bacteria Based on 16S rDNA Gene Sequencing

by Katie Lynn Ziegler

Approved by:

______, Advisor Dr. Gary Janssen, Ph.D.

______, Reader Dr. Luis Actis, Ph.D.

__Dr. Francisco B.-G. Moore__, Reader Dr. Francisco B.-G. Moore, Ph.D.

Accepted by:

______, Director, University Honors Program

v

vi

Acknowledgements

I would first like to thank Dr. Francisco Moore (Paco) and the University of Akron Biology Department for their immense help with this project. Paco, thank you for taking a chance on the random undergrad who emailed you in October about summer work! Thank you for designing this project for me to work on and for keeping the lab stocked with everything I needed throughout the process. Thank you for your guidance when I ran out of my own troubleshooting ideas. Thank you for serving as one of my readers. And thank you for responding so quickly to all of my questions as I wrote this thesis. I don’t know what I would have done without your input.

I would also like to thank Dr. R. Joel Duff from the University of Akron for allowing me to use his lab when I needed it and for answering some of my queries along the way.

Thanks to Traci Branch for running sequencing reactions while I was in Oxford. Thank you for staring aimlessly at the alignments with me until we figured out what the heck to do with them! I really appreciate that you dedicated so much of your time to helping me finish “on time”. Thanks also to Lauren Smith for her support and for sending me files to include in my thesis.

I would like to thank Dr. Gary Janssen from Miami University for serving as my advisor throughout this process. Thank you for putting up with my ever- shrinking time table and for continuing to support me throughout the process.

Thanks to Dr. Luis Actis for serving as my other thesis reader and for helping me with my proposal. Thanks also to Dr. Joe Carlin for granting me computer access and letting me download phylogeny software onto the departmental computer.

Finally, thanks to my family and friends for putting up with me when I was stressed out (aka the last month!) I don’t know what I would have done without all of you!

vii

Table of Contents

Introduction……………………………………………………………………...1

Materials and Methods……………………………………………………..…..6

Results…………………………………………………………………….……15

Discussion………………………………………………………………………23

References Cited………………………………………………………………27

Appendix………………………………………………………………………..29

viii

List of Figures

Figure 1. Schematic illustration of primer binding sites on the 16S rDNA gene……………………………………………………………………………………..10

Figure 2. Agarose gel electrophoresis illustrating a representative gel of PCR amplified 16S rDNA gene products………………………..………………….16

Figure 3. New England BioLabs TriDye 1 kb DNA ladder used to estimate the size of the 16S rDNA gene after PCR amplification…………………..……….17

Figure 4. Most parsimonious phylogeny detailing the relatedness of the bacterial species under study………………………………………………………...19

Figure 5. Bootstrap 50% majority-rule consensus tree of 2000 replicates……..21

ix

List of Tables

Table 1. Bacterial strains used for rDNA sequence determination (by LAB Reference Number)……………………………………………………………………..7

Table 2. Bacterial rDNA sequences obtained from the Ribosomal Database Project (by corresponding GenBank entry number)…………………………………7

Table 3. Primer sequences used for PCR and sequencing reactions……………9

x 1

Introduction

Developments in the field of molecular systematics have expanded the realm of knowledge available to us regarding bacterial classification and relatedness. Since DNA is the cellular component responsible for bacterial inheritance, and evolution is the change in the genetic composition of a population over time, variation in DNA sequences should provide hints toward solving evolutionary questions (Mindell & Honeycutt, 1990). It has been suggested that phylogenies derived specifically from 16S rDNA gene sequencing provide reasonable evidence for describing major evolutionary lineages (Brown et al., 1997). Ribosomal RNA genes encode the rRNAs, which bind ribosomal proteins and associate to form complete ribosome complexes. Ribosomal RNAs are universally present throughout all species, and ribosomes are essential for cellular function and survival. Because of the critical role they play, rDNA sequences are highly conserved and tend to resist significant change over time

(Wertz et al., 2003). Thus, they are of prime interest when studying evolutionary relatedness.

In addition to obtaining phylogenies from rDNA sequence data, research has indicated a number of other areas in which such data is useful. For example, the core genome hypothesis proposes the existence of core genes and auxiliary genes. Core genes include regulatory genes present in a full (or almost full) copy in each isolate of a species. They rarely undergo lateral gene transfer because, 2

theoretically, there should be no selective advantage conferred by doing so (Lan

& Reeves, 2001). Auxiliary genes, on the other hand, may include pathogenicity islands, resistance genes, novel metabolic functions, toxin genes, etc. These are significantly more likely to undergo lateral gene transfer in response to environmental or selective pressures (Lan & Reeves, 2001). In an attempt to define a bacterial species using a molecular approach, the core genome hypothesis acknowledges that because there is little or no selective advantage to acquiring core genes from other species, the core genes’ sequences tend to diverge between species. As a result, altered core genes pose a barrier to interspecies recombination (Lan & Reeves, 2001). If these core genes truly do not recombine, we may be able to more concretely define bacterial species based on conserved molecular sequence data from the core genes (i.e. 16S rDNA).

The benefits of using genotypes over phenotypes in classifying bacteria include deriving a more rapid and precise development of an objective and accurate species concept. Thus genotyping may be useful for effectively identifying pathogens—in particular those that grow slowly or those that cannot be cultivated in vitro (Harmsen & Karch, 2004). While sequence databanks facilitate this endeavor, none are perfect. As a result, sequence data might be most effective in distinguishing between species when they are too similar to be differentiated otherwise. For example, in the case of bioterrorist attacks, rapid identification of Bacillus anthracis could prove critical in differentiating it from 3

other similar Bacillus species before rendering appropriate treatment (Sacchi et al., 2002).

This study involves sequencing a portion of the rDNA gene from a subset of Gram-negative enteric bacteria and generating a phylogenetic tree based on the sequence data obtained. I will first provide a brief description of the organisms included in the study.

Serratia marcescens is a rod-shaped facultative anaerobe of the family

Enterobacteriaceae. Once considered nonpathogenic it is now known to cause some opportunistic infections, especially nosocomial infections in immunocompromised patients (Hume et al., 1999). Members of the genus

Serratia can be distinguished from others belonging to the by their production of three special enzymes: DNase, lipase, and gelatinase

(http://medic.med.uth.tmc.edu/path/00001521.htm).

Enterobacter aerogenes is associated with nosocomial infections

(http://www.diseasesdatabase.com) and is considered to be closely related to

Klebsiella pneumoniae, another Gram-negative facultative anaerobe. Klebsiella are also members of the Enterobacteriaceae. Enterobacter taylorae has also caused severe nosocomial infections in several patients (Rubinstein et al., 1993).

Citrobacter freundii is known to cause varying types of infections ranging from gastroenteritis to haemolytic ureamic syndrome (Tschape et al., 1995).

Salmonella typhimurium is also an enteric bacterium; some strains cause severe 4

foodborne infections, and many are resistant to anti-infectious agents including

certain antibiotics (http://www.ifst.org/hottop20.htm).

Similarly, the Erwinia species studied (E. carotovora subsp.

betavasculorum, E. carotovora subsp. carotovora, E. herbicola, and E.

amylovora) are members of the Enterobacteriaceae. Some Erwinia species are

also phytopathogenic, causing plant infections that lead to rotting (Basset et al.,

2000). E. carotovora subsp. betavasculorum and E. carotovora subsp.

carotovora are members of the “pectolytic soft rot group”, E. herbicola is a part of

the “yellow pigmented group”, and E. amylovora is a member of the “white

nonpectolytic wilt-causing group”

(http://grove.ufl.edu/~jbjones/FacultativelyAnaerobicGramlecture2004.doc).

These enteric bacteria are all Gram-negative bacilli, and most are capable

of causing some type of infectious disease, ranging from mild to severe. It is for

this reason that we are interested in studying them. As previously mentioned, genotyping may be useful for effectively identifying pathogens. Thus, the 16S rDNA sequences obtained may be useful for determining which enteric bacterium is causing symptoms in an infected individual.

Finally, Vibrio fischeri was used as an outgroup in this experiment. V. fischeri is a free-living marine bacterium that colonizes special light organs of some fish and allows them to bioluminesce. The genus Vibrio provides a good

outgroup because it is classified outside the enteric bacteria that are being

examined in this study, but it remains a closely related sister group. As an 5

outgroup, its trait values will function as default “primitive characters”. It is thus an ideal starting point from which to compare the enteric bacteria under study.

For this study, we developed a phylogeny based on partial 16S rDNA gene sequencing of the bacterial strains discussed above. We hypothesized that species within the same genus would contain the most shared sequence changes and therefore be most closely related. While comparisons of some samples from within our data set support this notion, overall our results generate weak support for this hypothesis. Some samples show no apparent relationship to other samples, even those from within the same genus. These apparently incorrect results lead us to explore possible ways in which we are being mislead by our data. 6

Materials and Methods

Bacterial Strains

A total of 16 strains of enteric bacteria were used in this study. These

strains included samples from the following species: Enterobacter aerogenes,

Enterobacter taylorae, , Erwinia carotovora subsp. carotovora, Erwinia carotovora subsp. betavasculorum, Erwinia herbicola,

Erwinia amylovora, , Salmonella typhimurium, and

Citrobacter freundii. Vibrio fischeri was included as an outgroup for rooting the phylogenetic tree. The strains were obtained from a variety of sources as listed

in Tables 1 and 2. Nine strains were sequenced and analyzed in our laboratory.

For the remaining seven, we obtained sequence data from the Ribosomal

Database Project for comparative analysis and outgroup rooting

(http://rdp.cme.msu.edu/html/). Table 1 shows the strains analyzed in our

laboratory according to their “LAB” number. DNA samples were prepared by 16S

rDNA gene amplification (PCR), and the PCR products were sequenced. Table

2 lists strains retrieved from the Ribosomal Database Project according to their

corresponding GenBank entry numbers.

7

Table 1. Bacterial strains used for rDNA sequence determination (by LAB Reference Number)

LAB # STRAIN # SA # SPECIES SOURCE 2 PBZ 2 ~ Enterobacter aerogenes M. Kory, U. of Akron 5 PBZ 5 ~ Serratia marcescens M. Kory, U. of Akron 20a ATCC13048 5589 Enterobacter aerogenes SGSCb 25 37055 5283 Enterobacter taylorae SGSC 26 ECB11129 5324 Erwinia carotovora subsp. betavasculorum SGSC 27 ECC71 5322 Erwinia carotovora subsp. carotovora SGSC 29 EH105 5326 Erwinia herbicola SGSC 44 ATCC8100 5581 Serratia marcescens SGSC 46c M15 3361 Salmonella Typhimurium SGSC a Int. J. Syst. Bacteriol. (1980) 30: 225-420. b Salmonella Genetic Stock Centre (http://www.ucalgary.ca/~kesander/index.html) c Imai et al. (1977) Inst. Ferm. Res. Comm. (Osaka) 8:63.

Table 2. Bacterial rDNA sequences obtained from the Ribosomal Database Project (by corresponding GenBank entry number)

SPECIES GENBANK NUMBER Erwinia amylovora AF141894 Serratia marcescens 1 AF124040 Serratia marcescens 2 AJ297946 Citrobacter freundii AF025365 Enterobacter aerogenes AF395913 Klebsiella pneumoniae Y17656 Vibrio fischeri AY292919

Strain Preparation

Bacterial strains were received as stabs in 1.5 ml microcentrifuge tubes.

Upon receipt of these strains from the sources listed in Table 1, we prepared stocks for our use according to the following protocol: streak bacterial sample uniformly onto an LB agar slant (10 ml agar in a 25 ml sterile tube), incubate overnight at 37°C; add 2.5 ml LB broth to the slant, vortex to suspend bacteria 8

until broth is turbid, add 300 µl 80% glycerol to a sterile storage tube, transfer 1.5

ml of turbid LB broth from slant into the tube containing glycerol (under sterile

hood), vortex to suspend in glycerol, incubate 20-30 min at room temperature,

store in liquid nitrogen. Each prepared sample contained 1.8 ml.

Polymerase Chain Reaction (PCR)

An approximately 1,550 base pair fragment of DNA was amplified from

whole cell suspensions of all 9 laboratory strains using a particular set of forward

and reverse primers. Each final PCR reaction was 25 µl in volume and contained

0.25 µl each of 1.25 µg/µl forward and reverse primers (0.3125 µg U 16S F and

U 16S R, respectively, see Table 3), 0.25 µl of whole cell bacterial culture prepared as described above (DNA source), 2.5 µl 10x buffer, 0.5 µl of Taq DNA polymerase (New England BioLabs) at 5,000 units/ml (2.5 units per reaction),

20.75 µl nuclease free water, and 0.5 µl of 100 mM dNTP’s (Eppendorf) containing 25 mM each of dATP, dGTP, dTTP, and dCTP. All oligonucleotide primers were purchased through Sigma Genosys. Reactions were first incubated for 3 min at 94°C. Then 32 cycles were run according to the following protocol:

45 s at 94°C, 45 s at an annealing temperature of 50°C (temperature varies for

some trials), and 1 min 30 s at 72°C. Reactions were then incubated at 72°C for

an additional 5 min final extension before being held at 4°C. PCR products were

identified using agarose gel (1% w/v) electrophoresis and purified using a 9

Qiaquick PCR purification kit (Qiagen). Negative controls were used, and these

contained all reagents excluding the bacterial sample.

Table 3. Primer Sequences used for PCR and Sequencing Reactions

Primer Gene Sequence (5' --> 3') Fwd/Rev U 16S F 16 S CCG AAT TCG TCG ACA ACA GAG TTT GAT CCT GGC TCA G forward U 16S R 16 S CCC GGG ATC CAA GCT TAC GGC TAC CTT GTT ACG ACT T reverse 16S_790_F 16 S ATT AGA TAC CCT GGT AG forward 16S_790_R 16 S CTA CCA GGG TAT CTA AT reverse

The annealing positions for each primer are illustrated in Figure 1. The U

16S F and U 16S R primers annealed at the ends of the gene and were used in

PCR to amplify the entire 16S rDNA gene and were used in sequencing

reactions. The 16S_790_F and 16S_790_R internal primers annealed near base

790 and were utilized in sequencing reactions to generate double stranded

sequence data for analysis. 10

Base 1 Base 790 Base 1,550

16S rDNA gene

Æ------(1)------(2)------Å ------(3)------Å Æ------(4)------

Æ: Forward primer (annealing location) Å: Reverse primer (annealing location) (1): Sequence generated from U 16S F primer (2): Sequence generated from U 16S R primer (3): Sequence generated from 16S_790_R primer (4): Sequence generated from 16S_790_F primer

Figure 1. Schematic illustration of primer binding sites on the 16S rDNA gene.

Purification of PCR Products

A Qiaquick PCR purification kit (Qiagen) was used to purify the amplified

PCR products. A binding buffer was added to the sample and then transferred to a Qiaquick column before spinning at 13,000 rpm for 1 min. Flow-through was discarded and 750 µl of elution buffer was added to the column and centrifuged

for 1 min. Eluate was discarded and the column was centrifuged an additional

minute. The column was transferred to a clean 1.5 ml microcentrifuge tube, and

25 µl nuclease free water was added to the center of the membrane. After

incubation at room temperature for 1 min, the tube was then centrifuged for an

additional 1 min. Eluate was saved and run on a 1% agarose electrophoresis gel 11

made with TBE buffer. The agarose containined ethidium bromide, so purified

DNA could be viewed under ultraviolet light.

16S rDNA Sequencing Reaction

The purified PCR products of approximately 1,550 bp were prepared for sequencing in a sequencing reaction using 16S rDNA forward and reverse sequencing primers (from the U 16S set or the 16S_790 set; Table 3) and fluorescently-tagged dideoxyribonucleotides. The U 16S F primer is a forward primer that anneals to the beginning of the gene and sequences in the forward direction. U 16S R anneals to the end of the gene (around base 1,550) and sequences in the reverse direction. The 16S_790_F primer anneals near base

790 of the gene and sequences in the forward direction; 16S_790_R anneals near base 790 and sequences in the reverse direction. Thus, the U 16S F and

16S_790_R primers will generate double stranded sequences near the beginning of the gene. The 16S_790_F and U 16S R primers will generate double stranded sequences near the end of the gene. Figure 1 summarizes this information.

Each sequencing reaction contained 0.5 µl of 1.25 µg/µl primer (0.625 µg either forward or reverse primer), 2.3 µl ABI Big Dye reaction mix (containing dNTPs, ddNTPs, buffer, and enzyme), and 5µl total of DNA plus nuclease free water. DNA concentration was estimated based on purification results, and we approximated how much to use in our reaction based on the band strength of the sample in the gel (approximately 55-65 ng of DNA per sequencing reaction). 12

This estimation was based on a comparison of the purified sample band to the brightness of a ladder containing known size standards. We used 2 µl of 1 kb

TriDye ladder and 2 µl of purified sample in our gel electrophoresis analysis of purified PCR products, and our bands were approximately the same brightness as the 3.0 kb reference band, which contained 125 ng DNA. Because we used 2

µl of sample in our electrophoresis gel and noted approximately 125 ng of DNA, we were able to use 1 µl of sample in our sequencing reaction containing approximately 55-65 ng of template DNA. Reagents were exposed to 30 cycles of the following: 20 s at 96°C, 10 s at 44 °C, 4 min at 60 °C; then held at 4 °C.

Cleanup of Sequencing Reaction Products

Sequencing reaction products were prepared for sequence analysis by one of the following two protocols: (1) add 32 µl 75% isopropanol to the reaction product, incubate 15 min at room temperature, centrifuge 20 min at 14,000 rpm, remove supernatant, add 180 µl 75% isopropanol to the pellet, centrifuge 5 min and remove supernatant again, place sample in Speed Vac for 10 min at medium heat to fully dry the sample, resuspend in 27 µl ABI Prism Template Suppression reagent (TSR), heat to 95°C for 2 min, chill on ice. (2) add 2 µl 125 mM EDTA and 20 µl of 100% ethanol to reaction product, incubate at room temperature for

15 min, centrifuge at 14,000 rpm for 30 min, remove supernatant, add 30 µl 70% ethanol, centrifuge 5 min at 14,000 rpm, remove supernatant again, place 13

sample in Speed Vac for 15 min on medium heat to fully dry the sample, add 25

µl TSR, heat 2 min at 95°C for 2 min, chill on ice.

Sequence Analysis

Prepared 16S rDNA sequencing reaction products were processed on an

ABI 310 Sequencer. Chromatogram results were checked and edited using

Sequencing Analysis Alias software. I carefully examined the nucleotide sequences and corrected any misread bases. Forward and reverse products were then aligned in Megalign software using the Clustal W alignment algorithm

(Dnastar, Inc.). This involved aligning sequence data generated from the U 16S

F and 16S_790_R primers to obtain a double stranded sequence near the beginning of the gene and sequence data from the 16S_790_F and U 16S R primers to obtain a double stranded sequence near the end of the gene (see

Figure 1). Only regions containing complete double stranded sequence data were considered in the analysis. Once double stranded sequences were generated for each of the nine laboratory strains, we aligned the sequences from the different strains, again using the Clustal W algorithm. Nearly 950 bases of data were available for comparison after aligning sequence data from all strains and removing single stranded regions. This number includes data from the regions at the beginning and end of the gene. These bases are located at positions 125-596 and 943-1421 in the 16S rDNA gene.

14

Phylogenetic Analysis

Phylogenetic analysis was performed using PAUP software (Swofford,

1997). In our parsimony analysis we performed a heuristic search, and gaps were treated as a fifth base pair. We also generated a 50% consensus bootstrap tree of 2,000 replicates from the results. 15

Results

Polymerase Chain Reaction (PCR)

PCR amplification was successful as indicated by the bands present

under UV light. All bands representing the 16S rDNA gene product were of

approximately equal length, corresponding to about 1,550 bases when compared

with a TriDye 1 kb DNA ladder (New England BioLabs). Figure 2 illustrates this result. 16

Lane: 1 2 3 4 5 6 7 8

Row 1

Row 2

Figure 2. Agarose gel electrophoresis illustrating a representative gel of PCR amplified 16S rDNA gene products. Bands are approximately 1,550 base pairs in size when compared to the size marker standards (lane 4). Template DNA from LAB strains (some of which are described in Table 1) are as follows:

Row 1: LAB 20* (lane 1), LAB 21 (lane 2), LAB 22 (lane 3), Ladder (lane 4), LAB 23 (lane 5), LAB 24 (lane 6), LAB 25* (lane 7) Row 2: LAB 26* (lane 1), LAB 27* (lane 2), LAB 28 (lane 3), Ladder (lane 4), LAB 29* (lane 5), LAB 30 (lane 6), LAB 31 (lane 7), negative control (lane 8) * Denotes strains from which rDNA sequence was obtained for this study

The size of the amplified gene products were estimated based on

comparison with known fragment lengths in the TriDye ladder. A schematic of

the ladder used is shown in Figure 3. The 16S rDNA fragment is slightly larger

than the 1.5 kb marker band of the ladder.

17

Figure 3. New England BioLabs TriDye 1 kb DNA ladder used to estimate the size of the 16S rDNA gene after PCR amplification.

Purification of PCR Products

After the purification process, gel electrophoresis was again performed to confirm that the amplified 16S rDNA gene product was still present and had been purified. Results closely resemble those in Figure 2 (data not presented). Again, all samples appeared under UV light to be approximately 1,550 bases in size.

We electrophoresed only 2 µl of purified sample (compared to 4 µl of PCR product), but obtained results of comparable brightness. This indicates that the purified 16S rDNA is more concentrated. The amount present after purification was estimated from agarose gels as described previously and was used to determine how much of the sample to use in the sequencing reactions.

18

16S rDNA Sequencing Reactions and Analysis

The sequencing reactions yielded DNA fragments polymerized to varying lengths with fluorescently labeled nucleotides incorporated at their 3’ ends. A purification procedure removed the enzyme, buffer, and additional dNTPs and ddNTPs from the sequencing reaction product so we could isolate only the 16S rDNA gene fragments for analysis. This procedure also resulted in single stranded DNA fragments, and the template suppression reagent suppressed the full-length template strands present. This allowed the ABI 310 Sequencer to read and to generate sequence based solely on the fluorescently labeled fragments.

Results of sequencing reactions were presented as chromatograms matched with their corresponding nucleotides. After checking the validity of the sequence output and correcting any incorrectly reported bases, we aligned the sequence data from the forward primers with the reverse complement of the overlapping sequence data from the reverse primer (Figure 1). Only double stranded regions were considered in the final analysis. The aligned sequence data can be viewed in the Appendix. 949 base pairs of double stranded data were available after aligning all strains and removing single stranded sequence data. The single stranded data omitted from the analysis corresponded to regions where confirmatory double strand sequence data was not available. The bases remaining for analysis (the 949 bases of complete overlap) are located at positions 125-596 and 943-1,421 in the 16S rDNA gene. Sequence data from 19

these two regions was analyzed as one large combined data set because PAUP parsimony settings analyze each character independently of each other character.

Phylogenetic Analysis

A parsimony analysis treating gaps as a “fifth base” and setting V. fischeri as a monophyletic outgoup produced the phylogeny shown in Figure 4.

Figure 4. Most parsimonious phylogeny detailing the relatedness of the bacterial species under study.

The S. marcescens samples are clustered together, as are the various

Erwinia species samples. As predicted, these strains are most closely related to others within the same genus and species. E. aerogenes (LAB2 and LAB20) are also identified as closely related. However, the E. aerogenes strain compared 20

from the Ribosomal Database Project is grouped more closely to the Erwinia species than to the other Enterobacter species. Similarly, the Enterobacter taylorae sample could not be resolved in terms of its relatedness to the other

Enterobacter strains or to any other strains involved. Instead, Citrobacter freundii is placed most closely to the laboratory strains of E. aerogenes. Finally, S. typhimurium does not seem to be very closely related to any of the other strains under study. This may be partly due to low resolution within the phylogeny, or it may be because Salmonella truly is more distantly related than the other enteric bacteria.

A 50% majority-rule consensus tree was generated via bootstrap analysis using 2000 replicates and treating gaps as a “fifth base”. This yielded the phylogeny shown in Figure 5. V. fischeri was again defined as an outgroup.

Bootstrap support values are indicated in the figure.

21

Figure 5. Bootstrap 50% majority-rule consensus tree of 2000 replicates.

Weak bootstrap support for some of the clades that do exist (i.e. the E. carotovora grouping) indicates that the data is incapable of resolving these taxa in their current form. Anything less than 75% bootstrap support constitutes weak support. The four Serratia strains are clearly related due to strong (98%) bootstrap support, but are not well resolved within their group. Similarly, the two

Enterobacter species (LAB20 and LAB2) resolved very well, with 99% bootstrap support. On the other hand, E. aerogenes (RD) could not be resolved using our data. The relationship between the two E. carotovora species (LAB26 and LAB

27) is also characterized by weak bootstrap support. Other strains show less than 50% bootstrap support, and therefore indicate no resolution for their relatedness. While E. herbicola (LAB29) and E. amylovora (RD) appeared to be 22

closely related in Figure 2, bootstrap analysis does not support this. In fact,

Figure 5 shows that E. herbicola does not even have enough support to be classified with the other enteric bacteria. This is unexpected, but may be explained by the recent reclassification of E. herbicola as Pantoea agglomerans

(http://www.plantpath.wisc.edu/fpath/do-practices-promote-resistance.htm).

Pantoea species can be differentiated from Erwinia because they produce a yellow pigment, do not degrade pectate and do not require growth factors, among other things. These distinctions may result in the unexpected classification of E. herbicola.

Wertz et al. (2003) have developed molecular phylogenies of enteric bacteria based on several different genes. They found the relationships between species to be inconsistent, with different relationships established for different genes. However, they were able to develop a composite tree that included sequence data from all of the genes in one large meta-analysis. Our data would likely reflect greater resolution if we also incorporated different genes into our analysis. Similarly, it would provide a more inclusive result in terms of illustrating which species truly are most closely related on an evolutionary scale.

23

Discussion

Sequence data is useful for classifying bacteria and for deriving a more

objective species concept. Thus genotyping may be useful to rapidly and

effectively identify the strains used in this study.

We hypothesized that species within the same genus would contain the

most shared derived traits and therefore be most closely related. While

comparisons of some samples from within our data set support this notion, others

do not. Enterobacter taylorae, for example shows no apparent relationship to the

Enterobacter aerogenes samples when analyzed by parsimony or bootstrapping.

These inconsistent results lead us to explore potential sources of error in our data. The quality of the sequence data is not in question, because sequences were checked both after sequencing and during alignment. Because only full consensus sequences were used for comparison, we knew that the integrity of the sequence data had been checked many times over. On the other hand, we cannot guarantee the quality of the sequence data taken from the

Ribosomal Database Project. This data included some unresolved bases, which may have skewed the analysis.

Out of 949 total characters compared between each strain, 789 were constant across all samples. Of the 160 variable regions, 84 were parsimony- uninformative and 76 were parsimony-informative. Thus, 8% of the bases 24

studied provided meaningful data for comparison, and these were the bases considered by PAUP for determining a strain’s position on the phylogenetic tree.

This should be a large enough data set to provide resolution of at least the major clades within the study group. Because this was not always the case, we surmise that the failure to detect significant grouping of clades indicates an unusual distribution of data. As a result, the PAUP software could likely not generate enough resolution between samples to determine a relationship between them that was necessarily better than all other possibilities (i.e. bootstrap support was weak for any given branching pattern). This means that, for example, the E. taylorae is not necessarily less related to other Enterobacter strains than it is to the other strains studied, but simply that the data was not informative enough for PAUP to determine where E. taylorae fit best into the phylogeny.

Ribosomal DNA is made up of conserved, variable, and highly variable regions (Harmsen & Karch, 2004). Highly variable regions which are comprised primarily of the loops and turns in the rDNA secondary structure are under weak selection since they do not have to bind to an opposing strand. They are therefore prone to mutation. These regions, however, do not provide good phylogenetic signal since the rate of substitution is so high that changes in two lineages do not represent shared changes (synapomorphies). Instead, they generally represent independently derived traits (homoplasy). In addition, when synapomorphies do occur in a hyper-variable region, one of the lineages 25

frequently mutates a second time thereby hiding the shared trait. For this reason the hyper-variable regions not only fail to provide meaningful data, but can also destroy signal that is already present.

In describing species, phylogeneticists expect to compare as complete a data set as possible, and this would include the full gene sequence. On the other hand, studies have found that the information at the 5’ end of the 16S rDNA gene is sufficient for identifying most bacterial species (Harmsen et al., 2003).

Harmsen et al. (2003) sequenced the 5’ end of the 16S rDNA gene from positions 54 to 510 in a group of Mycobacteria. Our study sequenced positions

125-596, which bears significant overlap with the critical region of Mycobacteria.

However, our data could not be resolved as clearly as the mycobacterial data apparently could. It is possible that if we exclude the hyper-variable sites in our sequences, we would get a more consistent signal from our data and thus better resolution in our phylogeny. Research typically excludes these regions since selection tends to cause convergence at these sites and renders the data somewhat uninformative for evolutionary studies. Some hyper-variable regions that we could remove from our data set in future analyses include the following bases: 173-191, 211-231, 452-479, 1002-1029, and 1130-1143.

It is understood that rDNAs from different species can be very similar, and that closely related species can lack substantial sequence identity (Harmsen &

Karch, 2004). Because of this, it can be difficult to establish an objective measure for how closely related rDNA sequences have to be in order to be 26

considered the same species. While genotyping is a significantly more objective technique for bacterial classification than phenotyping, the classification still depends on “How close is close enough?” and is still somewhat arbitrary

(Harmsen & Karch, 2004).

In future analysis, we will remove the 16S rDNA hyper-variable regions from our sequences to ensure that only conserved and variable regions containing meaningful information are included. We will also incorporate additional genes (possibly ompA) into our analysis for even better phylogenetic resolution within genera or species. Despite the fact that 16S rDNA gene sequences can be easily acquired for a large number of species through the

Ribosomal Database Project, it remained important for us to sequence the gene directly from our laboratory strains so that we can continue investigating the evolution of the strains. In addition, sequencing the 16S rDNA genes ourselves eliminates question regarding potential errors in the databases. After sequencing the complete 16S rDNA gene from the laboratory strains, we will have the necessary initial data to begin future evolutionary experiments, possibly looking at long-term evolution within the strains. 27

References Cited

Basset, A., Khush, R.S., Braun, A., Gardan, L., Boccard, F., Hoffmann, J.A. & Lemaitre, B. “The phytopathogenic bacteria Erwinia carotovora infects Drosophila and activates an immune response.” Proceedings of the National Academy of Sciences. 2000. 97: 3376-3381.

Brown, J.R., Douady, C.J., Italia, M.J., Marshall, W.E. & Stanhoupe, M.J. “Universal trees based on large combined protein sequence data sets”. Nature Genetics. 2001. 28: 281-285.

Harmsen, D. & Karch, H. “16S rDNA for Diagnosing Pathogens: a Living Tree”. ASM News. 2004. 70: 19-24.

Harmsen, D., Dostal, S., Roth, A., Niemann, S., Rothganger, J., Sammeth, M., Albert, J., Frosch, M. & Richter, E. “Comprehensive and public sequence database for identification of Mycobacterium species.” BioMed Central Infectious Diseases. 2003. 3: 26. [online] in

Hume, E.B.H., Conerly, L.L., Moreau, J.M., Cannon, B.M., Engel, L.S., Stroman, D.W., Hill, J.M. & O’Callaghan, R.J. “Serratia marcescens keratitis: Strain- specific corneal pathogenesis in rabbits.” Current Eye Research. 1999. 19: 525-532.

Lan, R. & Reeves, P.R. “When does a clone deserve a name? A perspective on bacterial species based on population genetics.” Trends in Microbiology. 2001. 9: 419-424.

Mindell, D.P. & Honeycutt, R.L. “Ribosomal RNA in Vertebrates: Evolution and Phylogenetic Applications”. Annual Review of Ecology and Systematics. 1990. 21:541-566.

Rubinstien, E.M., Klevjer-Anderson, P., Smith, C.A., Drouin, M.T., & Patterson. J.E. “Enterobacter taylorae, a new opportunistic pathogen: report of four cases.” Journal of Clinical Microbiology. 1993. 31: 249-254.

Sacchi, C.T., Whitney, A.M., Mayer, L.W., Morey, R., Steigerwalt, A., Boras, A., Weyant, R.S. & Popovi, T. “Sequencing of the 16S rRNA gene: A rapid tool for identification of Bacillus anthracis”. 2002. From CDC [online] in .

28

Sanderson, M.J. & Shaffer, H.B. “Troubleshooting Molecular Phylogenetic Analysis”. Annual Review of Ecology and Systematics. 2002. 33: 49-72.

Swofford, D. “PAUP*: Phylogenetic Analysis Using Parsimony (* and other methods)”. 1997. Sinauer Associates, Sunderland, MA.

Tschape, H., Prager, R., Streckel, W., Fruth, A., Tietze, E. & Bohme, G. “Verotoxinogenic Citrobacter freundii associated with severe gastroenteritis and cases of haemolytic uraemic syndrome in a nursery school: green butter as the infection source”. Epidemiology and Infection. 1995. 114: 441-450.

Wertz, J.E., Goldstone, C., Gordon., D.M. & Riley, M.A. “A molecular phylogeny of enteric bacteria and implications for a bacterial species concept”. Journal of Evolutionary Biology. 2003. 16: 1236-1248.