<<

Characteristics of genome evolution in obligate symbionts, including the

description of a recently identified obligate extracellular symbiont.

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Laura Jean Kenyon, B.S.

Graduate Program in Evolution, Ecology, and Organismal Biology

The Ohio State University

2015

Thesis Committee:

Norman Johnson

Andy Michel

Kelly Wrighton

Zakee L. Sabree, Advisor

Copyright by

Laura Jean Kenyon

2015

Abstract

Animal-bacterial symbioses have shaped the evolution of all eukaryotic organisms. All symbioses have in common a long-term association and therefore provide valuable insight into the evolution and diversification of both partners. Insect-bacterial mutualisms represent the most extreme natural partnerships known, showing evidence of coevolution and obligate interdependence between the partners. Early investigations of plant sap-feeding , in particular, revealed tissues of unknown function inhabited by and insects void of their symbionts revealed reduced host fitness compared to symbiotic insects. Due to the obligate nature of the relationships, endosymbiotic bacteria are uncultivable, but complete genome sequencing suggests bacterial mutualists metabolic capabilities and likely contribution to the mutualisms, typically nutrient provisioning. A well-supported pattern of bacterial genome evolution for obligate mutualists is extreme genome reduction, likely due to relaxed selection upon that are not required in the stable environment of the host, leading to an accumulation of deleterious mutations in these genes, and eventually to their complete loss, leaving only those genes that are required for the relatively-stable life-style and for maintenance of the symbiosis.

Furthering these studies, this work includes a comprehensive study of length evolution in obligate insect endosymbionts compared to their free-living relatives,

ii testing the long-held assumption that size reductions in individual genes due to small- scale deletions have impacted genome reduction. This was tested using orthologous protein sets from the (phylum: ) and Enterobacteriaceae

(subphylum: Gammaproteobacteria) families, each of which includes some of the smallest known genomes. Upon examination of protein lengths, we found that were not uniformly shrinking with genome reduction, but instead increased in length variability. Additionally, as complete loss also contributes to overall genome shrinkage, we found that the largest proteins in the proteomes of non-host-restricted bacteroidetial and gammaproteobacterial often were inferred to be involved in secondary metabolic processes, extracellular sensing, or of unknown function. These proteins were absent in the proteomes of obligate insect endosymbionts. Therefore, loss of large proteins not required for host-restricted lifestyles in obligate endosymbiont proteomes likely contributes to extreme genome reduction to a greater degree than protein shrinkage.

To further test these conserved patterns of genome evolution in insect mutualists and to gain insight into a newly described insect-bacterial symbiosis, we sequenced the complete genome of the obligate bacterial symbiont of Halyomorpha halys, an invasive pest of the US. Many phytophagous stink bugs, including H. halys, harbor gammaproteobacterial symbionts necessary for host development. "Candidatus Pantoea carbekii" is the primary occupant of gastric caeca lumina flanking the distal midgut of H. halys insects and is vertically transmitted. To infer contributions of “Ca. P. carbekii" to

H. halys, the complete genome was sequenced and annotated from a North American H.

iii halys population. Overall, the “Ca. P. carbekii" is nearly one-fourth (1.2

Mb) that of free-living congenerics, yet retains genes encoding many functions that are potentially host-supportive, including nutrient provisioning genes, similar to other mutualists of plant-feeding insects. These genomic resources aid in the continued exploration of -bacterial mutualisms.

iv

This document is dedicated to my family and significant other.

v

Acknowledgments

In addition to the helpful anonymous reviewers of the published work presented here, I would like to thank Franklin Sun and Rafah Asadi for developing Java programs and

Python scripts, respectively, used in analyses described in Chapter 1. I am extremely grateful the time and effort dedicatd by my vibrant, inspiring, and supportive committee including Andy Michel, Kelly Wrighton, Normon Johnson, and my advisor, Zakee

Sabree. Zakee is an outstanding mentor that is constantly excited about science, encourages me to excel in my field, is exceptionally creative, and dedicated to his mentorship. I am proud to have been Zakee’s student and will continue to emulate his intelligence and enthuisam for science.

vi

Vita

2010...... B.S. Zoology, University of Florida

2011 to present ...... Graduate Student, Department of Evolution,

Ecology, and Organismal Biology, The Ohio

State University

Publications

LJ Kenyon, T Meulia, and ZL Sabree. “Habitat visualization and genomic analysis of

"Candidatus Pantoea carbekii", the primary symbiont of the brown marmorated

stink bug.” Genome Biology and Evolution (in revision).

LJ Kenyon and ZL Sabree. "Obligate insect endosymbionts exhibit increased ortholog

length variation and loss of large accessory proteins concurrent with genome

shrinkage." Genome Biology and Evolution 6.4 (2014): 763-775.

RD Denton, LJ Kenyon, KR Greenwald, and HL Gibbs. "Evolutionary basis of

mitonuclear discordance between sister species of mole salamanders (Ambystoma

sp.)." Molecular Ecology 23.11 (2014): 2811-2824.

F Michonneau, K Netchy, J Starmer, S McPherson, CA Campbell, SG Katz, L Kenyon et

al. "Mitochondrial markers reveal many species complexes and non-monophyly

vii

in aspidochirotid holothurians." Echinoderms in a Changing World: Proceedings

of the 13th International Echinoderm Conference, January 5-9 2009, University

of Tasmania, Hobart Tasmania, Australia. CRC Press, 2012.

Fields of Study

Major Field: Evolution, Ecology, and Organismal Biology

viii

Table of Contents

Abstract……………………………………………………………………………………ii

Dedication………………………………………………………………………………....v

Acknowledgments………………………………………………………………………..vi

Vita…………………………………………………………………………………...….vii

Publications……………………………………………………………………………...vii

Fields of Study………………………………………………………………………..…viii

Table of Contents………………………………………………………………………....ix

List of Tables……………………………………………………………………………xiv

List of Figures………………………………………………………………………...….xv

Chapter 1: Obligate insect endosymbionts exhibit increased ortholog length variation and loss of large accessory proteins concurrent with genome shrinkage……………………...1

I. Introduction.…………………………………………………………………….1

II. Methods ………………………………………………………………………..4

A. Selection of taxa used in study ………………………………………...4

B. Detection of orthologous proteins and domain regions.……………….5

C. Calculating gap frequencies in orthologous protein alignments.………6

D. Evaluation of the maximum protein length in Gammaproteobacteria

and Bacteroidetes proteomes ……………………………………………..6

III. Results…………………………………………………………………………7

ix

A. Orthologous proteins and bacterial lineages used in this study……..…7

B. Obligate insect endosymbiont orthologs exhibit increased protein

length variation …………………………………………………...………7

C. Indel mutations at protein terminuses.…………………………..……18

D. Large proteins invovled in secondary cellular processes are absent

in OIE proteomes………………………………………………………...18

IV. Discussion……………………………………………………………………21

A. Increased length variability, and not uniform shrinkage, typifies

endosymbiont orthologs.…………………………………………………21

B. No sacred ground: functional domains and linker regions of OIE

proteins both exhibit elevated length variability.………………………...25

C. Use it or lose it: the loss of genes encoding large, non-essential proteins

contributes to genome shrinkage in endosymbiotic lineages.……………27

V. Concluding Remarks.…………………………………………………………30

Chapter 2: Habitat visualization and genomic analysis of "Candidatus Pantoea carbekii", the primary symbiont of the brown marmorated stink bug………………………………32

I. Introduction……………………………………………………………………32

II. Methods…………………………………………………………………….…35

A. Genome sequencing and annotation……………………………….…35

B. Comparative analysis…………………………………………………40

C. Molecular and phylogenetic reconstruction…………………………..41

D. SNP and genomic synteny analysis………………………………..…41

x

E. H. halys gut microbiome analysis………………………………….…42

F. Fluorescence in situ hybridization and electron microscopy…………43

G. Protein analysis of egg lavage………………………………………..46

III. Results……………………………………………………………….………47

A. P. carbekii dominates the H. halys crypt-bearing midgut and is

abundant on egg surfaces…………….………………………………….47

B. P. carbekii exhibits genome shrinkage and other consequences of host

restriction……………………………………………………………..….51

C. P. carbekii metabolism and putative role in H. halys physiology……55

D. P. carbekii plasmids encode genes important for nitrogen assimilation

and thiamine biosynthesis……………………………….………………58

E. Degradation of DNA replication and repair mechanisms and abundant

SNPs between P. carbekii strains………………………….……….……61

F. Degraded division genes may contribute to elongated cell

morphology………………………………………………………………64

G. Stress tolerance in P. carbekii………………………………………..65

IV. Concluding Remarks……………………………………………..………….72

Bibliography……………………………………………………………………………..74

Appendix A: Supplemental Materials for Chapter 1…………………………………...102

Appendix B: Supplemental Materials for Chapter 2……………………………………132

xi

List of Tables

Table 1.1: Variability of OIE (obligate insect endosymbiont) protein lengths compared to nonOIE (lifestyle other than OIE) protein lengths. “Range” indicates the range without outliers (as determined by box and whisker plots). “StDev” refers to the standard deviation. “IQR” indicates the interquartile range. “Number” is the number of orthologous sets falling into a certain category. “Proportion” is the Number out of the total number of orthologous-sets (82 for Flavobacteriaceae; 71 for

Enterobacteriaceae).….……..………………………………………………………...…16

Table 1.2: Two-sample unequal variance T-Tests and Mann-Whitney U Tests comparing variability of OIE (obligate insect endosymbiont) versus nonOIE (lifestyle other than

OIE) protein lengths. “Range” indicates the range excluding outliers (as determined by plotting). “StDev” is the standard deviation. “IQR” is the interquartile range…...…………………………………………………………………………...……16

Table 2.1: Genome characteristics compared between Pantoea carbekii and related free- living and symbiotic organisms with varying genome sizes. a: genes encoding the 5S,

16S and 23S ribosomal were counted. b: genome size is the sum of the chromosome and plasmid(s). c: in addition to seven 5S-23S-16S rRNA operons an additional 5S rRNA coding region has been annotated. d: two 23S rRNA-5S rRNA operons are present in addition to four 5S-23S-16S rRNA operons. e: two 5S-23S-16S rRNA operons are annotated and additional 16S-23S rRNA operon has been annotated. f:

xii three complete ribosomal RNA operons are annotated. g: a 23S-5S rRNA operon and a separate 16S rRNA gene has been annotated in the genome.……………………………………………………………….…………………53

Table 2.2: Brief descriptions, KEGG KO assignments, and emPAI values for proteins of note detected on the egg lavage surface (using LC/MS/MS followed by a Mascot analysis). Note: the emPAI values cannot be compared to one another when the databases used in the Mascot analysis are different. a: pBMSBPS1 database used; b: pBMSBPS2-4 database used; Eubacteria ‘nr’ database used for all others…………………...………....71

Table A.1: All species included in the ortholog analysis, including the abbreviations used in the text, genome size, life style, and accession numbers for Enterobacteriacae species…………………………………………………………………………………..102

Table A.2: All species included in the ortholog analysis, including the abbreviations used in the text, genome size, life style, and accession numbers for Flavobacteriacae species…………………………………………………………………………………..104

Table A.3: Description of the dataset used in the orthologous protein analysis, including the number of taxa used in each life style group and the COG distribution of the orthologs used……………………………………………………………………………………..105

Table A.4: List of orthologous proteins used in this study (represented by gene names and descriptions)…………………………………………………………...………………..106

Table A.5: Data table listing the proteomes used in the maximum protein length study and descriptions of proteomes, including life style, genome size, total number of proteins,

xiii average, maximum, and minimum protein lengths for all Gammaproteobacteria species included………………………………………………………………………..………..111

Table A.6: Data table listing the proteomes used in the maximum protein length study and descriptions of proteomes, including life style, genome size, total number of proteins, average, maximum, and minimum protein lengths for all Bacteroidetes species included……………………………………………………………………………...….114

Table A.7: Average standard deviations for all orthologous protein groups in each family and ANOVAs comparing standard deviations across orthologous protein groups…….130

Table A.8: Standard deviations (“StDev”) of lengths of proteins shared between

Flavobacteriaceae and Enterobacteriaceae………………………………………….….130

Table B.1: Single polymorphisms in Pantoea carbekii genes...... 132

Table B.2: Quantity of nonsynonymous SNPs in protein-coding genes for genes with more than one nonsynonymous SNP in Pantoea carbekii compared to P. carbekii

JPN...... 134

Table B.3: Species names, NCBI accession numbers, and genome sizes for organisms used in the phylogenetic analysis...... 137

Table B.4: Orthologs used in the phylogenetic analysis, represented by their gene name abbreviations...... 139

Table B.5: Presence and absence table comparing the gene content of Pantoea carbekii to

Escherichia coli strain K-12 (U00096.3), Pantoea ananatis (AP012032.1), the symbiont of Plautia stali (AP012551.1), "Candidatus Ishikawaella capsulata" (AP010872.1), and

Buchnera aphidicola strain APS (NC_002528) (S1F). Presence was determined by

xiv pairwise comparisons of proteomes using the blastp program (parameters: e-value: 1e-10) and manual inspection of genome annotations. a: the complete P. carbekii DNA polymerase I is encoded by two adjacent genes...... 140

Table B.6: KEGG KO frequencies in LC/MS/MS Mascot data analysis. Note: proteins could be grouped into more than one KEGG category...... 145

xv

List of Figures

Figure 1.1: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous protein lengths as a function of genome size. The variance for the OIE data points is 1.56 and 2.91 for Enterobacteriaceae and Flavobacteriaceae, respectively. For the nonOIE it is

0.35 and 0.38, respectively...………………………………………………………………...……………9

Figure 1.2: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous domain lengths as a function of genome size. The variance for the OIE data points is 0.63 and 2.29 for Enterobacteriaceae and Flavobacteriaceae, respectively. For the nonOIE it is

0.26 and 0.87, respectively...………………………………………………………………….…………11

Figure 1.3: Average orthologous domain lengths (A, C) and average orthologous linker lengths (B, D) as functions of average orthologous protein lengths for Enterobacteriaceae and Flavobacteriaceae, respectively...………………………………………………………………….…………12

Figure 1.4: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous linker lengths as a function of genome size...………………………………………………….……………………….…………13

Figure 1.5: Bubble plots of orthologous proteins representing proteins with OIE lengths more variable than nonOIE lengths. The x-axis depicts protein length in amino acids.

xvi

The bubbles are scaled by the number of proteomes with a protein at a particular length out of the total number of proteomes. Each table is labeled with the orthologous protein set it represents (A-H). The key shows what various proportions look like as bubbles...…………………………………………………………………….…...………17

Figure 1.6: Plots showing the maximum protein lengths in the proteomes of 42

Bacteroidetes species (A) and 79 Gammaproteobacteria species (B). In each plot, a reference line is drawn at the cut-off for the maximum protein length found in OIE species (y=1507 a.a. in Bacteroidetes, y=1420 a.a. in

Gammaproteobacteria)...…………………………………………………….…………20

Figure 1.7: Obligate insect endosymbionts have lost large proteins nonessential for host- restricted mutualisms. Proteins from free-living and non-obligate host-associated

Bacteroidetes (A) and Enterobacteriaceae (B) species that are longer than the longest obligate insect endosymbiont proteins are grouped into those involved in key cellular functions (i.e. core cellular processes) or secondary metabolic processes/pathogenesis/unknown functions (i.e. secondary processes and poorly defined).

Parenthesized values represent number of proteins that were assigned to each category.

Length on y-axis is number of (a.a.) residues...…………………………………………………………………….…………21

Figure 2.1: Pantoea carbekii is present within and beneath egg extrachorion matrix. A)

SEM of the H. halys egg surface with the extrachorion matrix peeling back to reveal an abundance of P. carbekii cells. Scale bar: 0.3 mm. B-C) P. carbekii cells intercalated in extrachorion matrix. B-Scale bar: 5 µm; C-Scale bar: 20 µm. D) TEM imaging of P.

xvii carbekii cells within the V4 midgut. Scale bar: 4 µm. E) FISH microscopy of the P. carbekii present in extrachorion matrix lavages with P. carbekii-specific FISH probes.

Scale bar: 10

µm…….....…………………………………………………………………….…………50

Figure 2.2: Pantoea carbekii clusters with other gammaproteobacterial symbionts.

Maximum Likelihood-based phylogenetic reconstruction of Pantoea carbekii and other gammaproteobacteria using concatenated and aligned orthologous proteins was performed in RAxML. Support values were generated from 100 bootstrap trees and are indicated at branch points. Parenthesized values next to species names are genome sizes in megabases; asterisks indicate estimated sizes for draft genomes. Taxa used, accession numbers, genome sizes, and orthologs used are reported in Appendix

B.1...………………………………………………………………………..….…………54

Figure 2.3: Metabolic reconstruction of Pantoea carbekii. Pathways predicted from the genome that are capable of generating essential (large blue boxes) and nonessential

(small blue boxes) amino acids, vitamins (yellow boxes) and metabolites (green boxes) are indicated. Absent genes and products not predicted to be generated by P. carbekii through canonical biosynthesis pathways are indicated in red type and boxes with broken outlines, respectively. Asterisks indicate the presence of genes that encode biosynthetic predicted to replace missing canonical enzymes. 3PG, 3-phosphoglycerate; P5P, pyridoxal 5-phosphate; BiCrb, bicarbonate; R5P, ribose-5-phosphate; E4P, erythrose-4- phosphate; Rb5P, ribulose-5-phosphate; F6P, -6-phosphate; FRPP, 5-

xviii phosphoribosyl 1-pyrophosphate; Gly3P, glyceraldehyde-3-phosphate; FMN, flavin mononucleotide...……………………………………………………………………….57

Figure A.1: Phylogenetic tree of Flavobacteriaceae created with RAxML using all of the orthologs shared between Flavobacteriaceae and Enterobacteriaceae . Shigella flexneri

2002017 is the outgroup for the Flavobacteriaceae tree. Pink colors indicate obligate insect endosymbionts and blue indicates bacteria that are not obligate insect endosymbionts………………………………………………………….....……………116

Figure A.2: Phylogenetic tree of Enterobacteriaceae created with RAxML using all of the orthologs shared between Flavobacteriaceae and Enterobacteriaceae. Flavobacteriaceae bacterium 3519 is the outgroup for Enterobacteriaceae. Pink colors indicate obligate insect endosymbionts and blue indicates bacteria that are not obligate insect endosymbionts..………………………………………………………….....…………117

Figure A.3: Bubble plots of all the orthologous proteins for Enterobacteriacae. The x- axis depicts protein length in amino acids. The bubbles are scaled by the number of proteomes with a protein at a particular length out of the total number of proteomes. Each table is labeled with the orthologous protein set it represents...... …….....…………118

Figure A.4: Bubble plots of all the orthologous proteins for Flavobacteriacae. The x-axis depicts protein length in amino acids. The bubbles are scaled by the number of proteomes with a protein at a particular length out of the total number of proteomes. Each table is labeled with the orthologous protein set it represents.…….....…...... …...……124

Figure A.5: Plot showing the gap distributions among alignments of orthologous proteins in Enterobacteriaceae (a) and Flavobacteriaceae (b). The 11th bin contains any residues

xix that could not be equally distributed among bins 1-10 and are at the end of the alignment.

The data-points shown represent averages over three different alignments per alignment algorithm, with the sequences shuffled differently in each of the three alignments...... 131

xx

Chapter 1: Obligate insect endosymbionts exhibit increased ortholog length variation and loss of large accessory proteins concurrent with genome shrinkage.

I. Introduction

Currently, the majority of smallest known bacterial genomes are borne by host-restricted mutualists that reside within specialized host tissues. These bacteria are vertically inherited and resistant to in vitro cultivation, which is likely the result of their long-term

(e.g. tens of millions of years), intracellular host associations (reviewed in Moran et al.

2008). The extreme reduction of these genomes has been the topic of many studies and there are several known attributing factors. Low effective population sizes and the occurrence of frequent bottlenecks in which bacteria are acquired by the offspring of their insect hosts every generation are prominent factors (Moran et al. 2008, Wernergreen

2002, Moran and Baumann 2000). These lead to an increase in the influence of genetic drift and a relaxation of selection for genes no longer vital in a stable habitat

(Wernergreen 2002, Moran and Baumann 2000). Thus, during the initial lifestyle change from free-living to endosymbiotic, genes that are not essential in maintaining the endosymbiotic relationship quickly acquire mutations, become pseudogenized, and are lost. The environmental and nutritional stability within host cells precludes the necessity for many of the enzymes that provide metabolic flexibility and resilience to diverse

1 conditions experienced by related, free-living taxa. In addition, the opportunities to acquire new genetic material from other bacteria are limited due to isolation within host tissues. Genome reduction is impacted by relaxed selection and loss of DNA repair and recombination mechanisms that contribute to the high mutation rate observed in these tiny genomes (Moran 1996). Additionally, there is a mutational bias towards deletions observed in bacteria (Mira et al. 2001). Yet another factor contributing to the initial genome reduction after a lifestyle change could be the loss of accessory (non-essential) proteins in the early evolution of genome reduction due to selection for this loss, rather than to drift (Lee and Marx 2012). These factors together likely lead to the extreme reduction in genome size witnessed in insect endosymbionts.

Another potential factor influencing the genome size of bacterial insect endosymbionts is protein length. Previous research has suggested that there may be selection for the smallest possible proteins that are still able to maintain function in bacterial cells, since smaller proteins would be less metabolically expensive (i.e. Wang et al. 2011). This pressure would theoretically be highest in smaller genomes, such as those in obligate endosymbionts. In fact, Charles et al. (1999) showed that many orthologous proteins were shorter in the obligate endosymbiont of aphids, Buchnera aphidicola, than in

Escherichia coli, a free-living relative. Similarly, Wang et al. (2011) showed that proteins are decreasing in length in bacterial lineages. However, others have noted that this may not be the case. For instance, Kuo et al. (2009) pointed out that there is no strong association between genome size and doubling time of bacteria and that the

2 obligate endosymbionts (with the smallest genomes) do not have a lifestyle that promotes selection for rapidly dividing cells. In fact, recent evidence suggests that cell division in endosymbionts, for example Rhizobium in legumes (Mergaert et al. 2006) and SOPE in cereal weevils (reviewed in Login and Heddi 2012), might be under host-control.

Additionally, many insect-endosymbionts have reduced ability to undergo cytokinesis and are often present as polyploids, with many chromosome copies contained within a single gigantic cell (reviewed in Baumann 2005; Komaki and Ishikawa 2000; Komaki and Ishikawa 1999). Thus, selection for rapidly dividing cells in endosymbionts appears to be absent or at-best, extremely weak.

Previous work has suggested differential selection within protein sequences depending on the necessity of specific residues for protein function. Specifically, protein regions that perform catalytic functions (i.e. “domains”) are under greater selection than portions of the proteins that do not (i.e. “linker regions”) (i.e. Wang et al. 2007, Wang and Caetano-

Anollés 2009). Wang et al. (2011) found that across Bacteria, as proteome diversity (i.e. number of protein-coding genes) decreased, the size of the proteins decreased, the domain lengths stayed the same, and the linker lengths decreased. If proteins were getting shorter, we would expect domains to vary little in length, due to selective pressures, and the linkers to be degraded. Using genome size as an indicator of proteome diversity, orthologous protein sets were identified in two distinct bacterial lineages that included taxa exhibiting extreme genome reduction. Orthologs with shared functional domain architectures were determined and the hypothesis that, in highly reduced genomes, linker

3 lengths shortened in length while domain lengths remained fixed in orthologs was examined.

II. Methods

A. Selection of taxa used in study

Orthologs from two bacterial families Flavobacteriaceae (phylum: Bacteroidetes) and

Enterobacteriaceae (subphylum: Gammaproteobacteria) that contained both obligate insect endosymbionts (OIE) and non-obligate insect endosymbionts (nonOIE) with complete genome sequences available were identified and used in this study. These two families were chosen as the focal groups because they contain majority of the known

OIEs with highly reduced genomes that are fully sequenced and annotated, and the genera included in our analyses were limited to those for which at least two complete genomes for species of each genera (e.g. sp. BPLAN and

Blattabacterium sp. Bge) were available. While OIEs are found in other families, i.e.

Wolbachia in and Tremblaya in Betaproteobacteria, potential biases associated with having only one genus of OIE per phylum or subphylum to compare with nonOIE relatives were avoided. Finally, similar numbers of species representing OIE and nonOIE lifestyles within each of the two families were chosen in order to make statistically significant comparisons; 20 OIE and 27 nonOIE Enterobacteriaceae genomes and 11 OIE and 11 nonOIE Flavobacteriaceae genomes were included (Appendix A.1).

Habitats of selected nonOIE species included non-obligately host-associated and free- living taxa and they were chosen based on their diversity of habitats (see Appendix A.1).

4

Proteomes were downloaded from the National Center for Biotechnology Information

‘nr’ database (http://www.ncbi.nlm.nih.gov) using custom, in-house informatics tools.

B. Detection of orthologous proteins and domain regions

OrthoMCL (Chen et al. 2006) and custom Perl scripts were used to identify orthologous proteins and the Pfam web-interface (Punta et al. 2012) was used (e-value = 1.0, default) to search the Pfam database for domain identification. Orthologous proteins that shared identical domain architecture (i.e. number and annotation of domains), protein annotation

(if available), and had standard deviations for total protein lengths of less than 30 were retained. Preliminary inspection of the protein lengths revealed that orthologous groups with standard deviations of more than 30 were almost consistently due to gene-fusions in one of the bacterial lineages. Since including gene-fusion events in this study could be misleading, these were removed. Proteins with multiple domains that were overlapping were not included unless one domain entirely encompassed a smaller domain, in which case the length of the larger domain was used. Upon visual inspection of individual

MAFFT alignments of orthologous protein sets (Katoh and Standley 2013), all outliers due to annotation errors were edited or not included. Pfam was used to identify functional domain region lengths and linker regions were protein regions not annotated as part of a functional domain. Linker length in each protein was determined by subtracting the sum of the functional domain lengths from the total protein length for each protein. Statistical analyses were performed within R (http://www.r-project.org). RAxML was used to generate maximum likelihood trees for each family using concatenated orthologous

5 protein sequences within the CIPRES portal using the cpREV protein matrix (Stamatakis et al. 2008; Miller et al. 2010; Adachi et al. 2000).

C. Calculating gap frequencies in orthologous protein alignments

In order to evaluate which region(s) (terminuses versus central portions) of the proteins tended to undergo more insertion or deletion mutations (i.e. mutations that would affect protein length), a custom Python (http://python.org) script that divided each protein in an alignment into 10 equal sections plus an 11th section when the alignment could not be divided evenly (which contained the remainder of residues unable to fit into the first 10 sections) and counted the gaps (“-“) present in each section was implemented.

Alignments were performed in ClustalW (Larkin et al. 2007), MAFFT (parameters: E-

INS-I and L-INS-I algorithms), and MUSCLE (Edgar 2004) for each orthologous protein set. Each alignment algorithm was run in triplicate with randomly shuffled sequences for each run, to avoid bias based on sequence input order, and averages of these three runs are reported.

D. Evaluation of the maximum protein length in Gammaproteobacteria and

Bacteroidetes proteomes

Proteomes from 79 Gammaproteobacteria and 42 Bacteroidetes species occupying different habitats (including species outside of the two focal families used in the previous ortholog domain and linker analyses) were used for evaluation of maximum protein lengths and total proteomes were examined (Appendix A.2). Representative proteomes

6 were downloaded from the NCBI database and ‘infoseq’ (EMBOSS, Rice et al. 2000) was used to determine protein lengths. A custom Python script extracted protein sequences with lengths longer than the desired cut-offs. BLASTP searches were performed with selected proteins of the Clusters of Orthologous Groups (COG) database

(Tatusov et al. 1997) to assign these proteins to functional groups. The KAAS - KEGG

Automatic Annotation Server (Moriya et al. 2007) was also used to assign the proteins to

KEGG (Kyoto Encyclopedia of Genes and Genomes) Orthology groups

(http://www.kegg.jp).

III. Results

A. Orthologous proteins and bacterial lineages used in this study

82 and 71 orthologs were drawn from the proteomes of 22 Flavobacteriaceae and 47

Enterobacteriaceae species, respectively, to determine if analogous or phylum-specific trends related to bacterial lifestyle could be detected (Appendix A.1). Genome sizes for these taxa spanned 0.24-6.1 Mb and were classified based on their habitat (e.g. obligate insect endosymbionts ‘OIE’ or non-obligate insect endosymbionts ‘nonOIE’).

Orthologous protein sets were used to test hypotheses about the impact of bacterial species lifestyle on genome reduction and protein length in two bacterial lineages.

B. Obligate insect endosymbiont orthologs exhibit increased protein length variation

Evidence from previous studies has suggested that protein length decreases with total

CDS length (i.e. Wang et al. 2011) and orthologous proteins of OIE genomes are smaller

7 than that of nonOIE relatives (i.e. Charles et al. 1999). The hypothesis that average protein length decreases as genome size decreases in two divergent bacterial lineages was tested. Since total CDS length was highly correlated with genome size in both families (p

< 2.2e-16, R2 > 0.97), only genome size was used. On average, OIE orthologs decreased in length and exhibited greater total length variation in both Enterobacteriaceae (p=2.6e-

6, R2=0.39) (Figure 1.1a) and Flavobacteriaceae (p=0.014, R2=0.27) (Figure 1.1b) lineages. The increased variability of average OIE protein lengths likely reflects elevated mutation rates commonly observed in endosymbiotic lineages (i.e. Moran et al. 2008).

Additionally, OIE lineages had longer branch lengths than nonOIE lineages in maximum- likelihood trees based on concatenated orthologs (Appendix A.3) supporting greater sequence divergence in obligate host-associated taxa.

8

Figure 1.1: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous protein lengths as a function of genome size. The variance for the OIE data points is 1.56 and 2.91 for Enterobacteriaceae and Flavobacteriaceae, respectively. For the nonOIE it is 0.35 and 0.38, respectively. 9

If functional domains are under greater selection than linker regions due to their role in activity, then, as genome sizes decrease, average protein lengths were expected to decrease while average functional domain lengths remained fixed. However, average orthologous domain lengths were positively correlated (p<0.05) with genome size in both families (Figure 1.2a-b) and the variance of average OIE domain lengths was greater than nonOIE average domain lengths. Both bacterial lineages exhibited significant positive correlations between the average total protein lengths and the average domain lengths

(Figure 1.3a,c). Additionally, several domains showed significantly (T-Test p<0.05) greater lengths in the OIE proteins than in the nonOIE proteins. Therefore, domain lengths did not remain fixed as genome length varied in either lineage (Figure 1.2a-b).

Instead, these data show that domain lengths in OIE proteins are more variable than in nonOIE orthologs.

10

Figure 1.2: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous domain lengths as a function of genome size. The variance for the OIE data points is 0.63 and 2.29 for Enterobacteriaceae and Flavobacteriaceae, respectively. For the nonOIE it is 0.26 and 0.87, respectively.

11

Figure 1.3: Average orthologous domain lengths (A, C) and average orthologous linker lengths (B, D) as functions of average orthologous protein lengths for Enterobacteriaceae and Flavobacteriaceae, respectively.

12

Figure 1.4: Enterobacteriaceae (A) and Flavobacteriaceae (B) average orthologous linker lengths as a function of genome size.

13

The expectation that protein linker regions are less sensitive to indel mutations, due to lower functional constraints, and exhibit greater length variation than functional domains in reduced genomes was explored. Positive correlations between average linker length and both genome size (Figure 1.4a) and average protein length (Figure 1.3b) were observed for Enterobacteriaceae orthologs, with OIE protein linker lengths differing by 4 amino acids and average total protein lengths differing by ~7 amino acids. These results suggest that, on average, both linker and domain length variations are contributing to total protein length variation. Average linker lengths for Flavobacteriaceae orthologs did not vary significantly with genome size (Figure 1.4b) or total protein length (Figure

1.3d).

Length distributions for orthologous protein sets were investigated individually to detect if OIE proteins were more variable in length when compared to the nonOIE proteins and to exclude the possibility of a few outliers biasing the trends observed when using averages. Although the genomes of OIE species were generally smaller (<1Mb) than nonOIE species (>1Mb), OIE orthologs were not uniformly smaller, but instead more variable in length than their orthologous counterparts in nonOIE species. In fact, over half of the OIE orthologous protein sets in both families were, on average, longer than orthologs in nonOIE species and the distributions of ortholog lengths for >75% of the

OIE orthologous protein sets had ranges that were larger than (excluding outliers) and significantly different (T-Test and Mann-Whitney p<0.001) from nonOIE orthologs

14

(Table 1.1-1.2). Additional metrics of differences (e.g. standard deviations and interquartile ranges) also showed significantly greater (p<0.001, Table 1.2) length variation for OIE orthologs vs. nonOIE orthologs for nearly all orthologous protein sets for both bacterial families.

Bubble plots for each orthologous set were generated to illustrate these differences

(Figure 1.5a-h, Appendix A.4). Finally, the variability of domain versus linker or total protein lengths were compared for each individual orthologous protein set, with the prediction that linker lengths would have greater variability than domain lengths due to greater selection upon functional domains, but no significant differences between the standard deviations of the overall protein, domain, and linker lengths for the OIE and nonOIE orthologs for either family was observed (Appendix A.5), which does not support the prediction.

15

Table 1.1: Variability of OIE (obligate insect endosymbiont) protein lengths compared to nonOIE (lifestyle other than OIE) protein lengths. “Range” indicates the range without outliers (as determined by box and whisker plots). “StDev” refers to the standard deviation. “IQR” indicates the interquartile range. “Number” is the number of orthologous sets falling into a certain category. “Proportion” is the Number out of the total number of orthologous-sets (82 for Flavobacteriaceae; 71 for Enterobacteriaceae).

Flavobacteriaceae Standard Range (%) IQR (%) Length Variability Deviation (%) No variability 5 (6.10%) 2 (2.44%) 5 (6.10%) Variable in both 4( 4.88%) 0 (0.00%) 4 (4.88%) nonOIE more variable 11 (13.41%) 14 (17.07%) 16 (19.51%) OIE more variable 62 (75.61%) 66 (80.49%) 57 (69.51%) Enterobacteriaceae Standard Range (%) IQR (%) Length Variability Deviation (%) No variability 11 (15.49%) 1 (1.41%) 11 (15.49%) Variable in both 1 (1.41%) 0 (0.00%) 1 (1.41%) nonOIE more variable 4 (5.63%) 10 (14.08%) 5 (7.04%) OIE more variable 55 (77.46%) 60 (84.51%) 54 (76.06%)

Table 1.2: Two-sample unequal variance T-Tests and Mann-Whitney U Tests comparing variability of OIE (obligate insect endosymbiont) versus nonOIE (lifestyle other than OIE) protein lengths. “Range” indicates the range excluding outliers (as determined by plotting). “StDev” is the standard deviation. “IQR” is the interquartile range.

Flavobacteriaceae Range StDev IQR nonOIE average 3.87 2.48 2.80 OIE average 12.15 6.68 5.82 T-Test P-value (nonOIE vs. OIE) 2.20E-06 1.43E-06 0.00033 Mann-Whitney P-value (nonOIE vs. OIE) 8.74E-09 2.00E-09 2.98E-06 Enterobacteriaceae Range StDev IQR nonOIE average 1.87 1.68 0.70 OIE average 7.04 3.07 3.06 T-Test P-value (nonOIE vs. OIE) 1.59E-06 0.00016 6.99E-07 Mann-Whitney P-value (nonOIE vs. OIE) 1.12E-10 4.50E-09 4.71E-11

16

Figure 1.5: Bubble plots of orthologous proteins representing proteins with OIE lengths more variable than nonOIE lengths. The x-axis depicts protein length in amino acids. The bubbles are scaled by the number of proteomes with a protein at a particular length out of the total number of proteomes. Each table is labeled with the orthologous protein set it represents (A-H). The key shows what various proportions look like as bubbles.

17

C. Indel mutation bias at protein terminuses

Previous work has suggested that protein mutations are more likely to occur at the N- or

C-terminuses rather than in the central core of the protein (Charles et al. 1999, Kurland et al. 2007); thus, the hypothesis that mutations would occur at the terminuses, irrespective of domain location was tested. In both the Enterobacteriaceae and Flavobacteriaceae orthologs studied, most of the gaps occur at the terminuses when using four different alignment algorithms and three different sequence orders for each algorithm (Appendix

A.6).

D. Large proteins involved in secondary cellular processes are absent in OIE proteomes

Examination of maximum protein lengths within Gammaproteobacteria and

Bacteroidetes OIE proteomes revealed that none were larger than 1507 amino acids long while the nonOIE members of these groups had maximum protein lengths ranging from

1408 a.a. (Serratia symbiotica str. ‘Cinara cedri’, RNA Polymerase b’ subunit, NCBI accession# AEW44827) up to 10,708 a.a. (Xanthomonas albilineans non-ribosomal peptide synthase; NCBI accession# YP_003375559; Figure 1.6). OIE genomes are enriched in genes encoding enzymes involved in central metabolic processes, namely

ATP and nutrient-generating pathways, while lacking those genes encoding for enzymes involved in secondary metabolism and cell division. To this end, the annotated functions for all enzymes longer than the longest OIE proteins (heretofore referred to as ‘large proteins’) for the two bacterial groups (e.g. >1420 a.a. for Gammaproteobacteria and

18

>1507 a.a. for Bacteroidetes) were investigated to determine what proportion of these proteins could be classified as participating in a) core cellular processes (e.g. and energy production; amino acid and vitamin biosynthesis; DNA replication, transcription and translation), or b) secondary/‘conditional’ processes (e.g. toxin and self- defense compound production; extracellular sensing) or unknown functions. Large proteins involved in core cellular processes were generally the same length and <1,600 residues long, while those assigned to secondary processes or of unknown function exceeded 1,600 residues in length and they were, on average, significantly longer (Mann-

Whitney test: gammaproteobacterial- x2=73.9, dF=1, p=<0.0001; bacteroidial: x2=22.6, dF=1, p=<0.0001) than proteins assigned to the core cellular processes (Figure 1.7a-b).

The relatively few large proteins assigned to the core processes category were involved in

DNA replication (e.g. bifunctional DNA polymerase III subunit alpha/DNA polymerase

III, epsilon subunit), RNA transcription (e.g. RNA polymerase beta’ subunit), recombination (e.g. V and helicase) and purine assembly (e.g. phosphoribosylformyl-glycineamide synthetase). Shorter orthologs of these proteins (as defined by ‘blastp’ alignment resulting in >65% amino acid sequence identity over >80% of the length of OIE proteins) were detected in OIE proteomes. 87-96% of the large proteins are involved in secondary processes, of which 18% and 48% of the gammaproteobacterial and bacteroidial large proteins, respectively, were conserved hypothetical proteins up to 7986 residues long. Intracellular invasion and survival, polyketide and nonribosomal peptide synthesis, cellulose degradation, radical oxygen

19 species scavenging and biofilm formation were among the many cellular operations large proteins categorized amongst the secondary processes were engaged in.

Figure 1.6: Plots showing the maximum protein lengths in the proteomes of 42 Bacteroidetes species (A) and 79 Gammaproteobacteria species (B). In each plot, a reference line is drawn at the cut-off for the maximum protein length found in OIE species (y=1507 a.a. in Bacteroidetes, y=1420 a.a. in Gammaproteobacteria).

20

Figure 1.7: Obligate insect endosymbionts have lost large proteins nonessential for host- restricted mutualisms. Proteins from free-living and non-obligate host-associated Bacteroidetes (A) and Enterobacteriaceae (B) species that are longer than the longest obligate insect endosymbiont proteins are grouped into those involved in key cellular functions (i.e. core cellular processes) or secondary metabolic processes/pathogenesis/unknown functions (i.e. secondary processes and poorly defined). Parenthesized values represent number of proteins that were assigned to each category. Length on y-axis is number of amino acid (a.a.) residues.

IV. Discussion

A. Increased length variability, and not uniform shrinkage, typifies endosymbiont orthologs

Based on previous observations that protein length reduction is positively correlated with proteome size (i.e. Wang et al. 2011, Kurland et al. 2007) and protein lengths for obligate 21 insect endosymbiotic lineages (OIE) are shorter than seen in their free-living relatives

(i.e. Charles et al. 1999), the hypothesis that protein length decreases with genome size was tested in two divergent lineages of bacteria. OIE orthologs were more variable in length, and not uniformly smaller, than those in their free-living relatives (nonOIE) for both bacterial families. Genome size and average ortholog lengths were significantly correlated for the Flavobacteriaceae and Enterobacteriaceae, but the correlation coefficients were low (R2= <0.40) and several average OIE ortholog lengths were larger than averages of nonOIE ortholog lengths. In both families, the variance of average OIE ortholog lengths was greater than nonOIE orthologs.

Patterns of individual orthologous protein length variance reflected overall average ortholog length variance in that OIE ortholog lengths varied more than nonOIE ortholog lengths. Examining ortholog length distributions for individual orthologous protein sets excluded the possibility that length distribution patterns distinct from those determined using protein lengths averaged across all proteins per habitat, per lineage were masked.

For example, some orthologous protein sets may have proteins that are getting smaller with genome reduction, but overall average lengths would not reflect this due to the possible presence of proteins in other orthologous protein sets getting larger. However, most of the orthologs did not uniformly decrease in size with genome reduction since the average lengths of the OIE proteins were larger than the averages of the nonOIE proteins for over half of the orthologous protein sets in both bacterial families.

22

Increased ortholog length variability, rather than ortholog length reduction in OIE taxa, did not reflect initial predictions but are not without precedent. For instance, although

Kurland et al. (2007) suggested that there is pressure for minimal size of proteins in

Achaea and Bacteria, they also found that there was only a weak positive correlation between average protein length and genome size. Further, they noted that the inclusion of pseudogenes could make it appear as though proteins are getting smaller, when, in actuality, many of them are no longer functional (Kurland et al. 2007). This is perhaps the case in previous studies that have detected a decrease in ortholog length with genome size, especially considering the increased pseudogenization of genes that occurs during the evolution of obligate mutualisms between insects and endosymbionts. The data examined here only included orthologous proteins with the same domain architecture, to the exclusion of annotated pseudogenes. Also, previous work has suggested that there may be selection pressures in bacteria for minimally sized proteins in large, complex proteomes with large population sizes; however, this pressure is likely weak in OIE lineages due to their simple, small proteomes and small effective population sizes

(Kurland et al. 2007). Thus, the extreme genome reduction observed in OIEs is not attributed to selection for smaller proteins, but due to entire gene loss due to the loss of

DNA repair mechanisms, lack of opportunities to acquire new genetic information, relaxed selection on proteins no longer necessary in their stable, intracellular environments, and the bottlenecks experienced every generation (i.e. Wernergreen and

Moran 1999, Wernergreen 2002). In fact, Charles et al. (1999) noted that although they found that Buchnera genes were significantly smaller than E. coli genes (85 protein genes

23 examined), this reduction in protein lengths could only account for <0.005% of the genome size reduction in Buchnera, and thus it is unlikely that ortholog length reduction decreases the energetic cost of protein synthesis enough to be strongly selected for.

Further experimental work is needed to address how the observed length variability in

OIE orthologs impacts their biosynthetic functions within their specific host contexts and if adaptive advantages exist. Some evidence based on well-studied proteins such as the lac repressor, T4 lysozyme, λ Cro, λ repressor, and Staphylococcus , suggests that many amino acid substitutions have neutral or nearly neutral impacts on the structural stability of proteins (Kurland 1992). But it is unclear what impact these indels have on the structural stability of OIE proteins. Lambert and Moran (1998) showed that the secondary structures of Domain I of the 16S rRNA in OIE genomes was less stable than the nonOIE counterparts and the OIE 16S rRNA stabilities varied more than the nonOIE stabilities (which showed very little variation). Thus, if proteins exhibit similar patterns as the 16S rRNA, it seems likely that the increased variability observed in OIE orthologous protein lengths may reflect decreased stability in these proteins. Supporting this hypothesis is the observed overexpression of the chaperone GroEL in Buchnera cells, perhaps reflecting the increased need for protein folding error-correction by the chaperone due to there being more misfolded proteins (i.e. Baumann et al. 1996, Fares et al. 2004, Moran 1996).

24

B. No sacred ground: functional domains and linker regions of OIE proteins both exhibit elevated length variation

The variability observed in OIE ortholog lengths is due to variation in both the linker and the domain regions of the orthologs, contrary to the expectation that domain lengths would remain fixed throughout genome reduction. Protein linker regions are assumed to lack a stable tertiary structure and are involved in binding and recognition of a diversity of molecules to assist in complex formation, whereas the domain regions are involved in reaction and thus functionally conserved evolutionary units (Wang et al. 2011,

Kurland et al. 2007). Conserved regions were predicted to remain unchanged during the extreme genome reduction as seen in OIE species. However, the average lengths of domains varied with genome size and there were significant positive correlations between the average domain and average ortholog lengths, suggesting that domain lengths were not fixed. As observed with overall ortholog lengths, domain lengths were also more variable in OIE orthologs than in nonOIE orthologs and were not uniformly smaller in

OIE orthologs. Ortholog linker lengths varied, as expected, in Enterobacteriaceae with positive correlations between average linker lengths and genome size, and average linkers lengths and average ortholog lengths. Although the two bacterial families chosen were expected to show converging patterns of protein length evolution, similar patterns were not observed in orthologs of Flavobacteriaceae taxa, suggesting that the variation observed in overall ortholog lengths was due to variation in the domain regions, not the linker regions. This could be due to several factors. For instance, the divergence times between the OIE and nonOIE organisms in each of the families are not the same and thus

25 we could be looking at different snap-shots of evolutionary time when comparing the two families (and therefore witnessing slightly different patterns). Alternatively, since more orthologs were used in the Flavobacteriaceae studies at the cost of studying fewer organisms, this could somehow be influencing the results. Lastly, perhaps the same evolutionary rules don’t apply to both families: within Enterobacteriaceae we see the patterns we might expect with protein length decreasing as genome size decreases, but this is not the case in Flavobacteriaceae due to unknown differences.

Since the orthologous domain lengths vary more than expected (in that no variation was expected), the hypothesis that more indel mutations occur at the terminuses of the proteins, rather than the central core, irrespective of domain location was tested. This has been suggested in previous studies (i.e. Charles et al. 1999, Kurland et al. 2007) and was observed in this study. Previous work has suggested that amino acid substitutions with the largest affect on the physical stability of the protein are usually not in the active sites, but rather the hydrophobic core of the protein (reviewed in Kurland 1992). In addition to affecting protein stability significantly, mutations in the central core might simply be less likely than mutations of the peripherally exposed residues (Chothia and Gough 2009).

Furthermore, Chothia and Gough (2009) suggested that mutations in areas outside of the core are four-times more likely than those inside of the core. Thus, the lack of conservation of domain lengths observed might be because the domains are not always located in the central core of the proteins, and thus not always the regions that affect protein stability significantly. Another potential cause for the observed domain length

26 variance in OIE proteins could be that some of the OIE genomes are still in the process of reduction and not quite at their minimal genome yet. In order words, this data may reflect varying levels of genome reduction within the OIEs, thus leading to varying domain lengths. Although less likely due to the strong bottlenecks experienced by OIEs at each generation, domains may be under positive selection where they exhibit dramatically longer or shorter lengths, but further evidence from molecular or enzymological investigations would lend support to this hypothesis.

Protein, domain, and linker lengths appear to vary amongst OIE proteomes; but what is unclear is if an amino acid length difference of only 1 or 2 residues would impact the function, and therefore evolution, of the protein. While some single mutations cause no measurable changes in fitness, others can change the function completely or make the protein nonfunctional. Only 30% of amino acid mutations in TEM1 β-lactamase lead to a decrease in fitness (reviewed in Soskine and Tawfik 2010), yet mutation analyses in other reported loci reveal a range of effects of amino acid substitutions and indels on protein function. Single amino acid changes in the Salmonella mannose-specific type I fimbrial adhesin protein FimH has been shown to alter its mannose binding ability and dramatically impact serovar pathogenicity (Kisiela et al. 2012). Similarly, FimH in

Escherichia coli uropathogenic strains have a structural point mutation in fimH which encodes for a protein with elevated monomannose binding affinity in the bladder

(Sokurenko et al. 2004). Another example of how the addition of a few amino acids to a protein can lead to a new function is evidenced in the GroEL generated by Enterococcus

27 aerogenes symbionts of predatory larval antlions (family Myrmelentidae, order

Neuroptera). While this GroEL functions as a protein-folding chaperone, it also is employed by the host as toxin that is secreted in it’s saliva that paralyzes prey. Toxicity of the GroEL produced by the antlion symbiont has been linked to 4 residues, which are not found in nontoxic GroEL produced by related enterobacteria (Yoshida et al. 2001).

These few examples provide some evidence that relatively few residue changes can significantly alter protein function.

C. Use it or lose it: the loss of genes encoding large, non-essential proteins contributes to genome shrinkage in endosymbiotic lineages

Increased gene length variability, and not uniform length reduction relative to genome size, was observed for OIE orthologs, which suggests their contribution to overall genome size is minimal. Gene deletions likely have a greater impact on overall genome size reduction than genic indels. Genes encoding functions not involved in core cellular processes like cell maintenance and central metabolism are often absent in the genomes of obligate intracellular taxa and the loss of these genes, especially if they are generally longer in length, would contribute to rapid overall genome size reduction. Therefore, nonOIE gammaproteobacterial and bacteroidial proteins longer than the longest OIE proteins were functionally categorized (e.g. core cellular processes or secondary processes/unknown function) to determine a) how these proteins were distributed between these categories and b) if these proteins had smaller orthologs amongst the OIE proteins. For both bacterial groups the majority of large proteins were >1,600 residues

28 long and categorized as poorly defined or involved in secondary processes (Figure 1.7a- b), and these proteins were, on average, significantly longer than those involved in core cellular processes. Overall, few of the large nonOIE proteins had orthologs in the OIE proteome, and these were limited to enzymes involved in DNA replication, RNA transcription, recombination and purine assembly. Specific functions of enzymes that fell within the secondary processes category were secondary metabolite production (e.g. nonribosomal peptide synthases, polyketide synthetases and RTX toxins) (Hur et al.

2012, Hamdache et al. 2013, Marahiel et al. 1997), virulence (e.g. Rhs family proteins and adhesins), extracellular sensing (e.g. kinases) and helicases. Large, extracellular, alpha-helical proteins, which includes proteins such as horizontally acquired alpha-2-macroglobulins that are typically found in pathogenic or saprophytically colonizing species and likely assist in colonization (Budd et al. 2004), comprised, along with uncharacterized membrane proteins, the majority of large bacteroidial proteins.

Additionally, Rhs family proteins were among the most abundant large proteins missing in OIEs of either phylum. These poorly characterized proteins have been implicated in inflammasome activation in Pseudomonas (Kung et al. 2012) and involved in signaling peptide transport across the inner membrane (Hill et al. 1994). Conserved hypothetical proteins of unknown function were also abundant among the largest nonOIE proteins and absent in OIE proteomes. The absence of large proteins in OIE proteomes can be explained by the loss of lifestyle-specific proteins, present as individual loci or as genomic islands (i.e. Frank et al. 2002), due to relaxed selection for these functions under the constant, intracellular host conditions. Alternatively, there might have been strong

29 selection for the loss of these large accessory proteins early in genome reduction (Lee and

Marx 2012), although the cost of having these large proteins for OIEs is not clear.

Obviously, deletion of long genes nonessential for intracellular habitation and mutualism with the host will contribute to rapid genome reduction more so than loss of shorter genes, but it is not clear, nor is it within the scope of this study to examine, why genes encoding enzymes involved in core cellular processes were generally shorter than those of unknown function or involved in secondary cellular processes.

V. Concluding Remarks

Protein lengths do not scale with genome reduction as expected, but genetic drift and relaxed selection explain the observed length variations in OIE orthologs. As additional genomes for obligate insect endosymbionts from other bacterial phyla become available, expansion of this study will be possible to ascertain the robustness of our results across diverse bacterial lineages. This work sets the stage for further characterization of the impact of the indels observed in OIE orthologs on their functions. In light of observed increased constitutive GroEL expression in Buchnera (Fares et al. 2004), indels in endosymbiont orthologs may have led to decreased protein stability that GroEL overexpression may compensate for. Alternatively, indels in endosymbiont orthologs may impact their substrate-binding capacities, leading to promiscuous binding and possible additional functional capabilities (i.e. Kelkar and Ochman 2013), representing adaptations of endosymbionts to different host lifestyles and/or their massive gene losses, but additional experimental analyses are required. Finally, genome-wide relaxed selection

30 for genes in OIE genomes, which has lead to an observed increased ortholog length variation, also contributes to rapid accumulation of inactivating mutations in loci tangential to the intracellular lifestyle. These mutations eventually lead to loss of these genes, potentially with little consequence to the mutualism, and contribute to overall genome reduction.

31

Chapter 2: Habitat visualization and genomic analysis of "Candidatus Pantoea carbekii", the primary symbiont of the brown marmorated stink bug.

I. Introduction

Insect-bacterial mutualisms are widespread in insects and have had significant impacts on the evolution and diversification of insects, one of the most diverse groups of

(i.e. Baumann 2005; Moran et al. 2008; McFall-Ngai et al. 2013; Douglas 2014). Many insects that feed exclusively on plant vascular fluids (e.g. xylem or phloem) or parenchymal cell contents can successfully exploit these abundant and nutritionally- imbalanced diets with the assistance of bacterial symbionts that can provision essential amino acids and vitamins to their host (Moran et al. 2008; Douglas 2013). Complete genome sequencing and annotation of Buchnera aphidicola revealed a highly reduced genome lacking many genes typically present in free-living relatives of B. aphidicola, namely Escherichia coli, yet the presence of nutrient-yielding biosynthetic pathways.

This pattern of gene loss/retention suggests that in spite of dramatic genome reduction due to gene loss, B. aphidicola maintains a genic repertoire that is comprised largely of those encoding processes generative of host-supportive nutrients (Shigenobu et al. 2000), and the genomes of other intracellular bacterial mutualists of evolutionarily distant insects, including , cicadas, and carpenter ants, exhibit similar patterns of genome streamlining and host-supportive genic repertoires (reviewed in Moran et al.

32

2008). With a few notable exceptions, available genomes of bacterial mutualists of insects have been largely derived from transovarially-transmitted, intracellularly- incarcerated species.

Amongst heteropteran insects, alternative means of intergenerational transmission and domiciling of symbionts have been observed (reviewed in Hosokawa et al. 2013).

Herbivorous females of the and Plataspidae have been observed to deposit gammaproteobacterial symbiont-laden gut secretions that are either smeared on eggs or encapsulated and positioned proximally to eggs (Fukatsu and Hosokawa 2002; Hosokawa et al. 2005; Tada et al. 2011; Kikuchi et al. 2012). Unlike intracellular mutualists that are present within immature tissues prior to emergence, these symbionts persist in an unknown state of activity outside of host tissues prior to nymphal acquisition by oral consumption of maternal secretions and are presumed to travel to and fill the extracellular lumina of gastric ceca located on the distal midgut region. For example, Megacopta punctatissima (Plataspidae) nymphs acquire the gammaproteobacterial symbiont,

"Candidatus Ishikawaella capsulata", by consuming maternally-generated capsules affixed to eggs, while Plautia stali (Pentatomidae) receive an inoculum of an unnamed gammaproteobacterial symbiont as nymphs from consuming maternal secretions smeared on eggs and both insects domicile their symbionts in the ceca of specialized crypts (Abe et al. 1995, Fukatsu and Hosokawa 2002; Hosokawa et al. 2006). Denial of either species from acquiring their symbionts resulted in delayed growth, retarded development and reduced fecundity (Abe et al. 1995; Fukatsu and Hosokawa 2002; Hosokawa et al. 2008).

33

While the occurrence of gammaproteobacterial symbionts inhabiting specialized midgut ceca of stink bugs has been well-documented, relatively few complete genomes for these symbionts are currently available (e.g. Nikoh et al. 2011; Brown et al. 2014; unpublished

Plautia stali symbiont genome GenBank: AP012551.1) to assist in inferring the specific contributions of these symbionts to their hosts (e.g. vitamins, essential amino acids, etc.) or possible genomic 'consequences' (e.g. reduction, skewed genic profile, A+T-bias) of their host associations.

In this regard, the complete sequencing of the primary gammaproteobacterial symbiont,

"Candidatus Pantoea carbekii" (hereafter referred to as P. carbekii; Bansal et al. 2014), of the invasive and highly polyphagous pentatomid, Halyomorpha halys, commonly known as the brown marmorated stink bug (Hoebeke and Carter 2003), is reported. While the P. carbekii genome is reduced relative to free-living gammaproteobacteria, it encodes enzymes that can generate essential nutrients potentially limited in the host's diet and enzymes that may assist symbiont survival on the egg surfaces prior to nymphal consumption and infection of the distal midgut. As in the aforementioned stink bugs, P. carbekii is domiciled within the lumina of deeply pigmented distal midgut gastric ceca and is obtained by nymphs when they consume maternal egg secretions following hatching (Taylor et al. 2014). Prevention of symbiont acquisition by nymphs via surface- sterilization of H. halys eggs also results in aberrant nymph behavior and developmental delays (Taylor et al. 2014). To detail the trans-generational symbiont transmission strategy in pentatomids, in situ electron and fluorescence microscopy was used to obtain

34 high-definition spatial and taxon-specific imagery of P. carbekii, yielding a more detailed description of the location and structural characteristics of the egg surface inhabited by P. carbekii.

Collectively, these data suggest that genomic characteristics typically observed in insect mutualists are evident in P. carbekii, the symbiont may play an essential role in H. halys development and support the host through nutrient-provisioning, and interfering with symbiont acquisition by nymphs may present a new option for H. halys management.

II. Methods

A. Genome sequencing and annotation

Complete genome sequencing of P. carbekii was performed using DNA extracted from two adult H. halys that were collected in Wooster, Ohio, USA in 2013 and briefly maintained in a lab colony. DNA was extracted from the V4 region of the midgut using the DNEasy Blood and Tissue (QIAGEN) kit. Illumina MiSeq sequencing platform with the v2 reagent kit was used to generate 7.7 million 250 bp paired-end reads with an expected 250 bp insert size. Reads were quality trimmed (parameter: base calls

Phred score were trimmed and reads <150 bp were excluded) and assembled within the

CLC Genomics Workbench (version 6, CLC Bio, parameters: mapping mode = map reads back to contigs, automatic bubble size = yes, minimum contig length = 200, automatic word size = yes, scaffolding = yes, auto-detect paired distances = yes, mismatch cost = 2, deletion cost = 3, length fraction = 0.5, similarity fraction = 0.8),

35 generating 111,569 contigs, with 829 contigs >2kb long using a word size = 23 and bubble size = 241. The contigs were initially screened to remove insect-related sequences by first isolating contigs with ORFs (detected with Prodigal; Hyatt et al. 2010) and then identifying contigs of putative Enterobacteriaceae origin. 337 were determined to be of

Enterobacteriaceae origin based on best BLAST hits of predicted ORFs to a custom database (BLASTP, e-value 0.001, NC_013971.1, NC_017531.1, NC_020064.1). These contigs were then compared to the NCBI-nt database (tBLASTx, e-value 0.0001) to identify their top-hits. The 322 contigs that had significant hits were then categorized based on the origin of the top-hit. 53 contigs had top-hits to an unpublished H. halys symbiont genome (see below), but only five of these contigs were >1,000 bp and four

(contig_3, contig_28, contig_43, and contig_139) were >20 kb (range: 65 -792 kb, average: 284 kb). 39 had top-hits to Serratia spp., but none of these contigs were over

1,000 bp. Similarly, three and nine contigs were of Alphaproteobacteria and

Betaproteobacterial putative origin, respectively, but all were under 1000 bp. Most of the contigs were of Eukaryotic origin (203) with one contig representing the H. halys mitochondrial chromosome (contig_619). 12 contigs were of Gammaproteobacteria origin (but not H. halys symbiont or Serratia spp.) and four of these were >1,000 bp.

These four Gammaproteobacterial contigs (contig_85, contig_578, contig_131, and contig_74) were annotated (ORFs detected using Prodigal, BLASTP and Pfam searches) as containing RepA (plasmid replication initiation protein A) and were considered putative plasmids.

36

Concurrent with our P. carbekii assembly efforts, an unpublished H. halys symbiont genome from H. halys specimens of unknown origin was released in Genbank in October

2013 (GenBank: NC_022547) by researchers in the Nihon University, School of

Pharmacy in Japan (hereafter called P. carbekii JPN). Pairwise alignment of the de novo generated four large contigs with the P. carbekii JPN genome was performed using

BLASTN (parameters: e-value 0.001) to obtain a putative genome scaffold. Evident in the contig arrangement with P. carbekii JPN and the sequence annotations at the ends of the four large contigs, ribosomal RNA (rRNA) operons were likely between each of these contigs and assembly of a single contiguous chromosome was failing due to the considerable sequence conservation between rRNA regions. Specifically, putative 5S and partial 23S sequences were annotated at both ends of contig_3, a partial 16S sequence was detected at one end of contig_43, a partial 16S sequence was detected at one end of contig_28 and a partial 23S and was detected at one end of contig_139 while a partial

16S sequence was detected at the opposite end of this contig. Outward-oriented PCR primers were designed to amplify sequence-spanning gaps between contigs using

Primer3Plus (Untergasser et al. 2012) with the ends of the four large contigs as templates.

All primers used in this study are reported in Appendix B.1. Long-range PCR reactions capable of generating amplicons >5 kb were performed with these primers using

Invitrogen Platinum High Fidelity Taq Polymerase and recommended thermocycler conditions and both combinatorial and guided (by the results of contig alignments to P. carbekii JPN) primer pairing in PCR reactions were performed. Successful PCR amplifications yielded amplicons that were approximately the expected size of rRNA

37 operons based on the lengths of related Pantoea spp. rRNA operons (region that spans complete 23S, 5S, and 16S sequences; Pantoea ananatis LMG20103, NC_013956.2 average: 5191 bp, range: 5143-5377 bp; Pantoea vagans C9-1, NC_014562.1 average:

5177 bp, range: 5028-5327 bp) and suggested that the alignment showed the appropriate orientation of the four large contigs. PCR products were purified using the Beckman

Coulter Agencourt AMPure XP purification system and Sanger sequencing of all amplicons was performed at The Ohio State University Plant-Microbe Genomics Facility

(PMGF). Sequences were manually edited and quality-checked in Geneious (version

7.0.4, Biomatters Limited). A Sanger-based, primer-walking strategy was used to sequence gap-spanning amplicons and yield a single, contiguous chromosome.

MiSeq-generated reads were quality-filtered using FASTX-Toolkit v 0.0.13 (Gordon and

Hannon 2010; ‘fastq_quality_trimmer’ was used to trim with quality

‘artifacts_remover’ was implemented to remove reads with only 3 of the 4 possible bases,

‘fastq_quality_filter’ was used to remove reads with less than 80% of bases with >Q28, and ‘fastq_to_fasta’ was used to remove reads with any ‘N’ ambiguous base calls). These high-quality reads were mapped to the P. carbekii USA chromosome using BWA

(default parameters; Li and Durbin 2009) with an average per base read coverage of

745X as determined by the ‘genomeCoverageBed’ program in BEDTools (Quinlan and

Hall 2009). The mapped reads were visually inspected using the Integrative Genomics

Viewer (Thorvaldsdóttir et al. 2012) to detect any inaccurate base calls in the P. carbekii

38

USA genome. Base calls supported by <99% of mapped reads were manually edited in

Artemis Genome Viewer (Rutherford et al. 2000), unless they were supported by the

Sanger sequencing. Only ten polymorphic sites were detected, where two different bases were supported nearly equally by the reads. In these ten cases, the base with majority

(>50%) of the reads supporting it was called.

Open reading frames (ORFs) were determined with Prodigal, tRNAs were called using tRNA-scan and Aragorn, and tmRNAs and rRNAs were detected using Rfam (Hyatt et al. 2010; Schattner et al. 2005; Laslett and Canback 2004; Burge et al. 2012). Annotation was completed using Pfam functional domain determination, COG assignments, KEGG

Orthology Groups, TIGRFAM, and BLAST comparisons of ORFs to the ‘ECO’, ‘nr’, and

‘nt’ databases (Punta et al. 2012; Tatusov et al. 1997; Moriya et al. 2007; Haft et al.

2003; Altschul et al. 1997; Zhou and Rudd 2013; BLASTP, e-value 0.001). Artemis

Genome Viewer and DNAPlotter were used to assist with annotation and detecting G/C skew (Rutherford et al. 2000; Carver et al. 2009). Metabolic reconstruction was done using KEGG-KAAS with information from EcoCyc and MetaCyc (Moriya et al. 2007;

Keseler et al. 2013; Caspi et al. 2008). P. carbekii proteins that were fragmented due to internal stop codon(s) and/or a frameshift mutations were annotated as pseudogenes.

Finally, the four large contigs (contig_578, contig_74, contig_85, and contig_131) that did not assemble de novo with the P. carbekii chromosome and contained repA genes were confirmed to be plasmids. Outward-oriented PCR primers were designed using

39 sequences at the ends of each contig and generated single PCR products that linked the ends of the respective contigs. The average read coverage for each of the contigs was as follows: contig_578 (‘pBMSBPS1’): 2,960X, contig_74 (‘pBMSBPS2’): 7,487X, contig_85 (‘pBMSBPS3’): 1,412X, and contig_131 (pBMSBPS4’): 610X. The lengths for the plasmids are: pBMSBPS1: 14,562 bp, pBMSBPS2: 6,287 bp, pBMSBPS3: 17,880 bp, and pBMSBPS4: 7,593 bp. The plasmids were annotated using the top-hits from a

BLASTP ‘nr’ search, Pfam domain search, and KEGG KO classification after ORF detection with Prodigal (default parameters). General features of the plasmids are listed in Table 2.1.

B. Comparative analyses

Biosynthesis-related gene content and general genomic characteristics of P. carbekii were compared to those of the following Gammaproteobacteria inhabiting various ecological niches (NCBI GenBank accession numbers are parenthesized): Escherichia coli str. K12

(NC_000913), Pantoea ananatis (AP012032.1), Plautia stali symbiont (AP012551.1),

"Candidatus Ishikawaella capsulata” symbiont of Megacopta punctatissima

(AP010872.1), and Buchnera aphidicola str. APS (NC_002528) (hereby referred to by their putative generic names). General genome statistics were compared across these genomes as in Moran et al. 2008. Maximal protein length was determined using the

‘infoseq’ program in EMBOSS (Rice et al. 2000). Pseudogenes in the genomes of the abovementioned taxa were reported based on GenBank-reported genome annotations.

40

C. Molecular phylogenetic reconstruction

Orthologous protein groups from the proteomes of related Gammaproteobacteria with complete and draft genomes were determined using OrthoMCL (Fischer et al. 2011).

Orthologs and the sourced proteomes used for phylogenetic reconstruction are reported in

Appendix B.1. 80 orthologs from 47 taxa were individually aligned in MAFFT using the

L-INS-i algorithm (Katoh 2002) and gap-containing columns were removed using a custom Perl script. Aligned proteins were concatenated using a custom Perl script and maximum likelihood trees were inferred using a web-based implementation of RAxML

(parameters: rapid bootstrapping under the LG protein model and DAYHOFF substitution matrix; 100 bootstrap trees; CIPRES Science Gateway; Stamatakis et al.

2008; Miller et al. 2010) and the best-supported Maximum Likelihood tree was reported.

FigTree v1.3.1 and MEGA5.1 were used to prepare trees for publication (Rambaut and

Drummond 2010; Tamura et al. 2011).

D. SNP and genomic synteny analysis

The P. carbekii chromosome generated from an H. halys population in Ohio, United

States-collected was compared to the P. carbekii JPN genome to identify single nucleotide polymorphisms (SNP) using BWA (Li and Durbin 2009) and the Genome

Analysis Tool Kit (‘GATK’; McKenna et al. 2010). Specifically, BWA-aligned genomes were subjected to an additional local alignment within GATK to improve the accuracy of

SNP calling by locally realigning areas where there might be insertions or deletions

(McKenna et al. 2010). GATK and Samtools (Li et al. 2009) were simultaneously used to

41 call SNPs and only SNPs detected by both were considered. To eliminate possible within- strain polymorphisms in the US P. carbekii genome, reads were mapped back to all SNPs in the US strain and only those sites for which >99% of the reads had identical base calls at the SNP site were considered in the analysis. Per base read depth for the SNPs averaged 857.5 reads (range: 51X – 1784X), as determined by BEDTools

‘genomeCoverageBed’ program with the “-d” flag implemented for per base read depth calls (Quinlan and Hall 2009). SNPs were determined to be transitions or transversions within Microsoft Excel. For SNPs detected in coding regions, the ‘diffseq’ program within EMBOSS was used to detect SNPs resulting in nonsynonymous or synonymous changes (Rice et al. 2000). SNPs resulting in nonsynonymous mutations in coding regions were categorized using the COG database (Tatusov et al. 1997). Structural rearrangements between the two genomes were detected using the ‘nucmer’ program within the MUMmer package (Delcher et al. 2002; default parameters except –maxmatch option) and visualized using the Genome Synteny Viewer (Revanna et al. 2011).

E. H. halys gut microbiome analysis

Three adults from an in-house H. halys colony were euthanized in 70% ethanol and immediately dissected in 1X phosphate buffered saline (PBS). DNA was extracted separately from the V1-V2, V3 and V4 midgut tissues of the digestive tract (for details, see Bansal et al. 2014) using the QIAgen DNEasy Blood and Tissue Kit and submitted to the Institute for Genomics and Systems Biology Next Generation Sequencing Core

(Argonne National Laboratory, Argonne, IL) for 16S rRNA amplicon library preparation

42 and Illumina MiSeq 2x251 bp paired-end sequencing. Illumina-generated 16S rRNA sequence reads (hereafter refered to as 'iTags' after Degnan and Ochman 2012) were assembled using Pandaseq (Masella et al. 2012), and trimmed to remove base calls with

1 errors in barcodes or primers and/or were <230 bp or >260 bps in length were removed) using the CLC

Genome Workbench. Quality filtered reads were preprocessed using the mothur software package (version 1.29; Schloss et al. 2009) and operational taxonomic units (OTUs) were clustered at >95% identity using usearch (version 7; Edgar 2010). Sequences representing

OTUs were taxonomically assigned by BLAST (program: blastn, parameters: default) searches of SILVA 16S small-subunit 'SSU' (version 115; Pruesse et al. 2007) and NCBI

GenBank 'nt' (downloaded on March 18, 2014) databases. Best hits within these databases that aligned with >99% of the OTU sequences informed OTU genus designations.

F. Fluorescence in situ hybridization (FISH) and electron microscopy

1. FISH. Egg lavages or homogenates were prepared from the following tissue types: 1) the extrachorion matrix surrounding freshly laid eggs, 2) crushed freshly laid eggs with the extrachorion matrix removed, 3) freshly laid eggs soaked in 10% bleach for 10 minutes and rinsed in dH2O, 4) eggs extracted from a gravid female, and 5) V4 midgut crypts. Lavages were prepared by placing a single egg in 200 μL 1X PBS in a 1.5 mL centrifuge tube, vortexing for 10-15 seconds and the extrachorion matrix was further manually disrupted by pipetting ~10 times, while being careful not to puncture the egg.

43

Following the removal of the lavage, eggs were further rinsed with 1X PBS to remove any residual extrachorion matrix and homogenized in 1X PBS using a combination of vigorous vortexing, pipetting, and manual crushing with a sterile pestle until little or no intact egg fragments were visible. Eggs from gravid females and bleached eggs underwent a similar process (vigorous vortexing, pipetting, and manual crushing with a sterile pestle in 1X PBS) upon collection or after treatment (i.e. bleach treatment). V4 midgut sections were dissected from colony-maintained adults euthanized in 70% ethanol in 1X PBS and immediately transferred to a 1.5 μL microcentrifuge tube containing 200

μL 1X PBS. V4 tissues were homogenized until little or no intact pieces remained by vigorous vortexing, pipetting, and crushing with a sterile pestle. Preparation for FISH followed Osborn and Smith, 2005. Specifically, ~10-15 μL of each sample (egg lavage, crushed egg or V4 midgut homogenates) was placed on microscope slides, air-dried in a vacuum hood, and fixed over an open flame for 1-3 seconds. 40 μL 4% paraformaldehyde was immediately added to each sample, covered with a cover-slip, and incubated for 24 hours at 4°C. The samples were then dehydrated in a series of ethanol washes for 3 minutes at each of the following concentrations of ethanol: 50%, 80%, and

95%. 1 μL (50ng/μL) of each probe was warmed to 48°C and mixed with 10 μL pre- warmed hybridization buffer (180 μL 5M NaCl; 20 μL Tris-HCl; 200 μL formamide; 599

μL ddH2O; 1 μL 10% sodium dodecyl sulfate, SDS) for each sample. 10 μL of the hybridization buffer/probe solution (final probe concentration: 5 ng/μL) was added to each sample, covered with a cover-slip, and incubated for 24 hours at 46°C. Cy3-labeled

Enterobacteriaceae-specific ('ENT1251’: 5'-Cy3-TGCTCTCGCGAGGTCGCTTCTCTT-

44

3, 50 ng/µL; (Ootsubo et al. 2012) and TYE-563-labeled P. carbekii-specific ('Crbck-

TEX': (5'-TYE-ATGCTGCCGTTCGACTT-3', 50 ng/µL; this study) FISH probes were used separately. Following incubation, samples were washed with washing buffer (1 mL

Tris-HCL; 2.15 mL 5M NaCl; 0.5 mL 0.5M ethylenediaminetetraacetic acid (EDTA);

46.35 mL ddH2O) for 3 minutes with gentle agitation, then dipped in ice-cold ddH20 and air-dried in a vacuum hood. A drop of mounting buffer (10% in 1X PBS) was placed on each slide and covered with a cover-slip to view with a Nikon Eclipse Ti inverted epifluorescence microscope.

2. TEM. Midgut gastric caeca (V4) tissues were dissected from colony-maintained adult insects in ice-cold fixative (3% glutaraldehyde, 1% paraformaldehyde in 0.1 M potassium phosphate buffer, pH 7.2) in preparation for transmission electron microscopy (TEM), as described in Bansal et al. 2014. Briefly, samples were fixed for 3 hours on a rocker at

23°C, washed with 0.1 M potassium phosphate buffer (0.1 M KH2PO4, 0.1 M K2HPO4, pH 7.2), and treated with a post-fixative solution (1% osmium tetroxide, 1% uranyl acetate in distilled water) at 23°C for 1 hour. Tissues were dehydrated through a graded ethanol series, resin infiltrated using a propylene oxide-resin series and then embedded in

EM Bed-812 resin (Electron Microscopy Sciences, Hartfield, PA). Ultrathin-sections were prepared using a Leica EM UC6 ultra-microtome. After staining with 3% aqueous uranyl acetate for 20 minutes, followed by Reynolds’ lead citrate for 10 minutes

(Reynolds 1963), sections were imaged using a Hitachi H-7500 transmission electron

45 microscope and images recorded with a Optronics QuantiFire S99835 (SIA) digital camera.

3. SEM. Freshly laid eggs (within one day of oviposition) were subjected to one of the following treatments prior to scanning electron micrograph imaging (SEM): 1) 10-15 seconds vortexing in 1X PBS, 2) soaking in 9% bleach for 2 minutes, followed by 3 washes with 1X PBS, 3) soaking in 10% SDS for 10 minutes, followed by 3 washes with

1X PBS, and 4) no treatment. Treated and control eggs were fixed in 3% glutaraldehyde,

2% paraformaldehyde, in 0.1M potassium phosphate buffer pH 7.2 (PB) over night, and subsequently rinsed three times for 10 minutes with PB and post-fixed in 1% OsO4 in PB for 1 hour. After two PB washes, samples were dehydrated through washes in 50%, 75%,

95%, and 100% ethanol for 15 minutes each. Following dehydration, samples were dried in a critical-point dryer, spatter coated with platinum and viewed on the Hitachi S-3500N scanning electron microscope.

G. Protein analysis of egg lavage

To determine what proteins, if any, could be detected on the egg surface, the Campus

Chemical Instrument Center (CCIC) and Mass Spectrometry and Proteomics Facility at the Ohio State University performed a shotgun peptide analysis using an egg lavage preparation. Briefly, lavages were prepared by placing an entire egg cluster (15-20 eggs) in 200 μL 1X PBS in a 1.5 mL centrifuge tube, vortexing for 10-15 seconds, and the extrachorion matrix was further manually disrupted by pipetting ~10 times. The eggs

46 were then removed from the lavage preparation. The lavage was determined by the CCIC to contain 4.71 ug/uL protein. This lavage was then digested with trypsin and analyzed by

LC/MS/MS on a LTQ Orbitrap mass spectrometer at the CCIC. The sequence information from the MS/MS data was analyzed by searching the ‘Eubacteria’ portion of the NCBI ‘nr’ database (version 20140427) and custom databases, including the genome reported here, using Mascot Daemon (Matrix Science, v. 2.2.1) on a 16 node IBM blade system at CCIC, accepting proteins with a Mascot score of 50 or higher, with a minimum of two unique peptides from one protein having a -b or -y ion sequence tag of five residues or better. These protein hits and identifications were manually screened to detect erroneous hits or misidentifications (i.e. hits to proteins from bacteria not found in the egg lavage). The relative abundance of the proteins detected was determined by comparing the Mascot-reported exponentially modified protein abundance indices

(‘emPAI’).

III. Results and Discussion

A. P. carbekii dominates the H. halys crypt-bearing midgut and is abundant on egg surfaces iTags were generated from adult Ohio H. halys gut tissues to assess the diversity of bacteria comprising the microbiome of the V4 gut region and a location-specific, relatively high concentration of P. carbekii was detected in H. halys V4 midgut tissues.

>98% of ~23,000 high-quality iTags for all of the V4-midgut region tissue samples were unambiguously assigned to P. carbekii. In contrast, <1% of the iTags generated from all

47 of the V1-V2 tissue samples and in most of the V3 tissue samples could be assigned to this symbiont. Additionally, P. carbekii has been detected in DNA preparations from H. halys egg clusters (Bansal et al. 2014, Taylor et al. 2014) and histological methods were used to localize egg-associated P. carbekii cells. H. halys eggs are typically laid in clusters, affixed to one another with a maternally-secreted extrachorion matrix (Figure

2.1E). Inspection of eggs in which the extrachorion matrix that has been disrupted by brief vortexing in sterile, distilled 1X PBS (Figure 2.1A) revealed dense patches of P. carbekii on the egg surface (Figure 2.1B). Rod-shaped bacterial cells were observed beneath the matrix (Figure 2.1C) that were morphologically similar to those imaged by

TEM of thin-sections of the V4 midgut gastric caeca (Figure 2.1D). Eggs treated with

10% bleach (Figure 2.1F) or 10% SDS (data not shown) lacked bacterial cells and the extrachorion matrix observed in untreated or vortexed eggs. To confirm that the bacterial cells present on egg surfaces were P. carbekii, FISH microscopy was performed on lavages prepared from the egg extrachorion matrix (Figure 2.1G) and of V4 midgut sections using a P. carbekii-specific FISH probe. Rod-shaped cells that exhibited a strong fluorescence signal and were morphologically similar (Figure 2.1G) to those in the SEM

(Figure 2.1C) and TEM (Figure 2.1D) were observed. Lavages prepared from eggs that were bleached or pre-washed multiple times in 1X PBS, or extracted from gravid females yielded no observable bacterial cells by FISH or light microscopic methods. Although pentatomomorphans exhibit a few types of symbiont acquisition and transmission mechanisms, varying from recruitment of environmental Burkholderia strains by each generation of alydids (Kikuchi et al. 2007) to vertical transmission of Ishikawaella via

48 consumption of symbiont-filled capsules by plataspids (Nikoh et al. 2011), maternal egg smearing of gut fluids and subsequent nymphal feeding on these eggs has been documented in other pentatomids examined, such as Nezara viridula (Prado et al. 2006),

Eurydema spp. (Kikuchi et al. 2012b) and Sibaria englemani (Bistolas et al. 2014), and appears to be the mode of symbiont transfer in H. halys (Taylor et al. 2014).

49

Figure 2.1: Pantoea carbekii is present within and beneath egg extrachorion matrix. A) SEM of the H. halys egg surface with the extrachorion matrix peeling back to reveal an abundance of P. carbekii cells. Scale bar: 0.3 mm. B-C) P. carbekii cells intercalated in extrachorion matrix. B-Scale bar: 5 µm; C-Scale bar: 20 µm. D) TEM imaging of P. carbekii cells within the V4 midgut. Scale bar: 4 µm. E) FISH microscopy of the P. carbekii present in extrachorion matrix lavages with P. carbekii-specific FISH probes. Scale bar: 10 µm. 50

B. P. carbekii exhibits genome shrinkage and other consequences of host-restriction

The P. carbekii genome is significantly reduced in size compared to congenerics (i.e.

Pantoea spp.; Figure 2.2) and other Gammaproteobacteria (Table 2.1), consisting of only

1,150,626 nucleotides, which is roughly one-fourth the size of these related species.

Although this represents significant genome reduction, many anciently-associated insect symbionts (i.e. tens to hundreds of millions of years of association), such as Buchnera, have further reduced genomes, with many being <1 Mb (Moran et al. 2008), suggesting that the H. halys-P. carbekii association may be more recent. The P. carbekii chromosome encodes 797 protein-coding genes, 2 transfer-messenger tmRNAs, and 40 tRNAs. Four plasmids were detected, all encoding RepA, and one encodes a similar gene content as the plasmid detected in Ishikawaella. Additionally, the genome was annotated to reflect the detection of two 5S-23S-16S rRNA operons and a 16S-23S operon lacking an identifiable 5S. 12 pseudogenes were predicted and annotated (Table 2.1) and roughly

40 additional proteins appear truncated, but not pseudogenized, relative to orthologs in other Pantoea species. With the 40 tRNA species, P. carbekii is able to decode all 20 amino acids (Table 2.1), including the translation initiating N-formyl-methionyl-tRNA

(BMSBPS_622), and a isoleucine-charged tRNA (BMSBPS_773) that, following posttranscriptional lysylation of a position 34 cytidine by TilS, changes the anticodon from CAU to AUA to prevent misreading of AUG as isoleucine (Soma et al. 2003).

Multiple copies of genes encoding tRNAs that recognize the mRNA codons “AUG” and

“GAA” (which correspond to methionine and glutamate, respectively) and “AUC”

51

(isoleucine) are present. As with other primary symbionts of insects, the P. carbekii genome exhibits reduced G+C% (30.57%) compared to free-living relatives such as P. ananatis (53.76%), which is a hallmark feature associated with genome reduction that accompanies elevated fixation of mutations under relaxed selection in endosymbionts with stable host-restricted lifestyles (Moran et al. 2008). Although the genome is reduced, P. carbekii still encodes metabolic pathways for the production of peptidoglycan, generation of ATP by aerobic respiration, and other primary metabolic processes.

Another trait observed that is associated with a host-restricted lifestyle is the reduction in the maximal protein length in comparison to free-living relatives, such as P. ananatis and

E. coli (Table 2.1). Some proteins involved in toxin production, secondary metabolic processes, virulence, extracellular sensing, or are of unknown function were shown to be, on average, larger than those involved in primary metabolic processes like DNA replication, transcription, translation and essential amino acid biosynthesis, and the former, being nonessential for long-standing host-restricted lifestyles, are largely missing from insect endosymbiont genomes and their absence likely contributes to the reduced genome sizes (Kenyon and Sabree 2014). P. carbekii exhibits a somewhat similar genic profile in that the longest protein encoded by its genome is the 1,413 amino acids long beta' subunit of RNA polymerase (RpoC), which is important for transcription, and it is one-third that of the longest protein encoded in the genome of a free-living relative, P.

52 ananatis AJ13355 (putative secondary metabolite biosynthesis protein YP_005934773:

4,385 a.a.) that is of unknown function.

Table 2.1: Genome characteristics compared between Pantoea carbekii and related free- living and symbiotic organisms with varying genome sizes. a: genes encoding the 5S, 16S and 23S ribosomal RNAs were counted. b: genome size is the sum of the chromosome and plasmid(s). c: in addition to seven 5S-23S-16S rRNA operons an additional 5S rRNA coding region has been annotated. d: two 23S rRNA-5S rRNA operons are present in addition to four 5S-23S-16S rRNA operons. e: two 5S-23S-16S rRNA operons are annotated and additional 16S-23S rRNA operon has been annotated. f: three complete ribosomal RNA operons are annotated. g: a 23S-5S rRNA operon and a separate 16S rRNA gene has been annotated in the genome.

Escherichia Pantoea Symbiont of Plautia Pantoea Buchnera Ishikawaella capsulata coli K-12 ananatis stali carbekii aphidicola APS Genome Size 4,641,652 4,877,280 b 4,092,852b 1,150,626b 754,729b 655,725b (bp) Plasmids - 1 2 4 1 2

Chromosomal 4,140 4,038 5,122 797 623 564 CDS Plasmid CDS - 278 26; 55 9; 5; 11; 6 8 3; 7

Ribosomal RNA 22c 22c 16d 8e 9f 3g coding genesa

tRNAs 89 78 59 40 37 32

Pseudogenes 184 N.A. N.A. 12 35 13 Maximum protein size 2,358 4,385 1,843 1,413 1,415 1,407 (a.a.) G+C content 51 54 (52) 57 (48; 49) 31 (30; 27; 26; 24) 30 (28) 26 (27; 31) (%) Free- Plant Intracellular Habitat living/Enter Insect Symbiont Insect Symbiont Insect Symbiont Pathogen Insect Symbiont ic Acyrthosiphon Plautia stali (‘Brown- Halyomorpha halys (‘Brown Megacopta punctatissima (Japanese Host Insect - - pisum (Pea winged Green Bug’) Marmorated Stink Bug’) Common Plataspid Stink Bug) Aphid) AP012551.1, AP010872.1, NC_002528.1, NC_000913 AP012032.1, NCBI Accession AP012552.1, AP010873.1 NC_002253.1, .3 AP012033.1 AP012553.1 NC_002252.1

53

Figure 2.2: Pantoea carbekii clusters with other gammaproteobacterial symbionts. Maximum Likelihood-based phylogenetic reconstruction of Pantoea carbekii and other gammaproteobacteria using concatenated and aligned orthologous proteins was performed in RAxML. Support values were generated from 100 bootstrap trees and are indicated at branch points. Parenthesized values next to species names are genome sizes in megabases; asterisks indicate estimated sizes for draft genomes. Taxa used, accession numbers, genome sizes, and orthologs used are reported in Appendix B.1.

54

C. P. carbekii metabolism and putative role in H. halys physiology

Phytophagous diets are limited for essential amino acids and some vitamins and, based on the genome content, P. carbekii appears capable of supplementing the diet of H. halys with these nutrients. While the genome shows evidence of reduction, P. carbekii encodes canonical pathways, like those typically observed in free-living gammaproteobacteria, namely E. coli and P. ananatis, for the biosynthesis of all essential and non-essential amino acids (Figure 2.3; Appendix B.1) except for , isoleucine, , and . The proline biosynthesis genes proA and proB, which encode glutamate-5- semialdehyde dehydrogenase and glutamate 5-kinase, respectively, are both absent from the P. carbekii genome and the ornithine-to-proline cyclodeaminase gene, ocd, is also missing. Since no alternative pathways or functionally equivalent genes appear present,

P. carbekii may not be able to synthesize proline de novo. Isoleucine, leucine, and valine biosynthesis pathways are complete except for the missing amino acid aminotransferase,

IlvE, which is required for the final step in branched chain amino acid biosynthesis.

Absence of this gene has been observed in both Ishikawaella and Buchnera and host participation in the production of these essential amino acids has been suggested (Nikoh et al. 2011; Shigenobu et al. 2000).

Although the genome of P. carbekii lacks argD, (acetylornithine aminotransferase) which catalyzes amination steps in lysine and biosynthesis pathways (Ledwidge and

Blanchard 1999), and argI (ornithine carbamoyltransferase), which catalyzes the sixth step of arginine biosynthesis (Glandsdorff et al. 1967), appears pseudogenized

55

(frameshift around an adenosine 9-mer), the functional role of ArgD in arginine and lysine biosynthesis could be replaced by AstC (Kim and Copley 2007), which is encoded on pBMSBPS1 and similarly on the Ishikawaella plasmid (Nikoh et al. 2011). Although the gene encoding the asparagine synthetase A (asnA) is missing from the asparagine biosynthesis pathway, asparagine synthetase B, asnB, which has homologous functions in

E. coli, is present. However, asnB appears pseudogenized by a frameshift around an adenosine 8-mer but previous work has shown that a subset of transcripts from frameshift-based pseudogenes can encode intact enzymes due to transcriptional slippage at homopolymeric sites (Tamas et al. 2008). ArgI production may be also rescued in a similar manner if transcriptional slippage around the adenosine 9-mer corrects the internal frameshift. If these pseudogenized-by-frameshift genes are nonfunctional, then it is possible that host-encoded enzymes may complement the functions performed by these enzymes. P. carbekii encodes complete or near-complete canonical pathways for the production of several vitamins and co-factors, including folate (vitamin B9), riboflavin

(vitamin B2; although ribC is absent), pyridoxal-5’-phosphate (vitamin B6), glutathione, iron-sulfur clusters, and lipoate (Figure 2.3). Unlike the Ishikawaella genome, P. carbekii is likely unable to synthesize biotin due to the complete absence of this pathway

(Appendix B.1). Like Ishikawaella, it is missing genes encoding enzymes involved in molybdopterin biosynthesis (meaA, meaB, and meaC), but the corresponding enzymes or metabolites may be supplied by other P. carbekii biosynthetic pathways, the host, or the host's diet.

56

Figure 2.3: Metabolic reconstruction of Pantoea carbekii. Pathways predicted from the genome that are capable of generating essential (large blue boxes) and nonessential (small blue boxes) amino acids, vitamins (yellow boxes) and metabolites (green boxes) are indicated. Absent genes and products not predicted to be generated by P. carbekii through canonical biosynthesis pathways are indicated in red type and boxes with broken outlines, respectively. Asterisks indicate the presence of genes that encode biosynthetic enzymes predicted to replace missing canonical enzymes. 3PG, 3-phosphoglycerate; P5P, pyridoxal 5-phosphate; BiCrb, bicarbonate; R5P, ribose-5-phosphate; E4P, erythrose-4-phosphate; Rb5P, ribulose-5-phosphate; F6P, fructose-6-phosphate; FRPP, 5-phosphoribosyl 1-pyrophosphate; Gly3P, glyceraldehyde-3- phosphate; FMN, flavin mononucleotide. 57

D. P. carbekii plasmids encode genes important for nitrogen assimilation and thiamine biosynthesis pBMSBPS1 encodes for nine proteins that include the entire -producing succinyl pathway (astCADEB), glutamate dehydrogenase (gdhA), replication protein A (repFIB), small heat shock protein (ibpB), and 3-octaprenyl-4-hydrobenzoate decarboxylases (ubiD). The AST pathway allows E. coli to use arginine as a nitrogen source under nitrogen-starvation conditions and the total carbon requirement for

Klebsiella aerogenes (reviewed in Cunin et al. 1986). GdhA, glutamate dehydrogenase, is part of the ATP-dependent glutamate synthase pathway, catalyzing amination of alpha- ketoglutarate to glutamate (Veronese et al. 1975) and is one of the two pathways ammonium nitrogens are assimilated in bacteria to produce glutamate. Only the glutamine-yielding glutamine synthetase part of the GS-GOGAT ammonia assimilation pathway is present and, therefore, GdhA is likely the primary mechanism for glutamate production and was detected in the Mascot protein analysis of the egg lavage. Lastly,

AstC might replace the catalytic role of ArgD, which is missing from the P. carbekii genome but is necessary for arginine and lysine biosynthesis (reviewed in Cunin et al.

1986).

pBMSBPS3 encodes several proteins that are involved in thiamine biosynthesis, although the role of the plasmid is not entirely clear. Of the 11 proteins encoded on the pBMSBPS3 chromosome, four are putatively involved in thiamine (vitamin B1)

58 biosynthesis. These include ThiS adenyltransferase ThiF, ThiG component of thiazole synthase, sulfur-carrier protein ThiS, and FAD-dependent oxidase ThiO.

Thiamine biosynthesis requires the formation of thiazole moiety and pyrimidine moiety separately. In E. coli, the thiazole moiety is derived from condensation of , , and 1-deoxy-D-xylulose-5-phosphate (‘DXP’; Du et al. 2011), which involves thiazole synthase ThiH, which is not identifiable in the P. carbekii genome (including all plasmids). However, ThiO, or glycine oxidase, is encoded on pBMSBPS3 and is typically utilized by Bacillus subtilis in replacement of ThiH, but it uses glycine instead of tyrosine in the formation of the thiazole moiety (Du et al. 2011). The pyrimidine moiety could be produced via proteins encoded by P. carbekii (ThiC, ThiD/E, and ThiL). However, two proteins are not encoded by either P. carbekii or the plasmid and are essential for thiamine biosynthesis from cysteine, glycine, or DXP. Sulfur transferase, ThiI, is required for the synthesis of thiamine de novo from cysteine (Du et al. 2011), making it unclear if this pathway is utilized in P. carbekii or if intermediates are scavenged from the host.

Thiazole tautomerase, or transcription regulator TenI is also lacking, yet is required for de novo biosynthesis of thiamine from cysteine, tyrosine, glycine, and DXP (Du et al.

2011). Although the biosynthesis of thiamine (thiamine phosphate and thiamine diphosphate) appears to be possible via the purine metabolism pathway, the proteins required are entirely coded on the P. carbekii chromosome (ThiCDEL), making the presence of the pBMSBPS3 plasmid unnecessary if all of the thiamine is derived from this pathway and if the plasmid has no other function. Additionally, there is no ABC transporter for thiamine encoded by the genome, suggesting that if thiamine is being

59 synthesized and distributed to the BMSB host, there must be an alternative way of transporting this vitamin. Since insects are not able to synthesize thiamine de novo

(Sweetman and Palmer 1928; Craig and Hoskins 1940) and considering the nutrient- deprived diet of BMSB, it seems likely that the pBMSBPS3 plasmid is involved in thiamine supplementation to BMSB.

It is not clear what the potential roles of pBMSBPS2 and pBMSBPS4 are, since many of the proteins they encode are hypothetical proteins with unknown functions. Of the six proteins pBMSBPS4 encodes, only one of them (RepA) is not classified as a hypothetical protein. pBMSBPS2 encodes five proteins, two of which could potentially be involved in stress-tolerance, including the DNA mismatch repair protein MutT and monophosphatase SuhB. A mutT deletion in E. coli significantly increases the spontaneous occurrence of A:T to C:G transversions a thousand fold over the wildtype and increases transcriptional errors (Shimokawa et al. 2000; Dukan 2000). Although the physiological role of SuhB is not well understood, a suhB mutation is sensitive to cold, suggesting a potential role in cold-tolerance (Chen and Roberts 2000; Matsuhisa et al.

1995). No proteins from pBMSBPS2 or pBMSBPS4 were detected in the egg lavage in the Mascot analysis, suggesting that they are either in low copy number or not expressed when P. carbekii is on the egg surface.

E. Degradation of DNA replication and repair mechanisms and abundant SNPs between

P. carbekii strains

60

A few genes encoding products involved in DNA repair, transcription, homologous recombination and metabolite conversions exhibit abnormal gene morphologies and/or harbor indel mutations, resulting in premature stop codons, frameshifts or truncations.

Notably, DNA polymerase I, which is involved in DNA repair (Glickman 1975; Sharon et al. 1975; Smith et al. 1975), is present in the Buchnera, Ishikawaella, and P. ananatis genomes as single loci encoding ~900 amino acid enzymes, but it is split into two protein coding regions (e.g. polA1 and polA2) in different reading frames separated by stop codons in P. carbekii. While the frameshifts and stop codons may interfere with production of a classical DNA polymerase I, the presence of intact functional domains within polA1 and polA2 suggests that, together, the products of these two genes may be capable of performing the functions of PolA. Additionally, the NAD-dependent DNA , LigA, which is active during DNA replication, recombination, and repair and joins

DNA fragments, closely resembles orthologs in Ishikawaella, Buchnera and other

Pantoea species except for lacking ~90 C-terminal residues that comprise a BRCT domain (Pfam: PF00533) that binds to DNA but is not essential for DNA-joining activity in E. coli (Wilkinson et al. 2005).

Several genes involved in DNA repair are missing from the P. carbekii genome. Absent are phr, which encodes a deoxyribodipyrimidine photolyase that acts in a light-dependent reaction to split pyrimidine dimers after UV radiation exposure (Keseler et al. 2013), and the genes xth ( III) and rep (Rep helicase) whose products are involved in

DNA repair as well. Xth is an that repairs DNA where damaged bases

61 have been removed or lost in E. coli (Gossard and Verly 1978). The loss of the regulation of xth results in the hypersensitivity of E. coli mutants to UV radiation (Sak et al. 1989).

Rep helicase is required for replication in E. coli, preventing double-stranded breaks and acting in a replication fork restart pathway (Michel et al. 1997; Sandler 2000). The absence of rep in E. coli results in severe growth problems, namely the accumulation of

DNA within a single cell (Trun 2003). The genes mutM and mutT encode formamidopyrimidine DNA glycosylase and 8-oxo-dGTP diphosphatase), respectively, work together in the base excision repair (BER) pathway, yet mutM is missing and mutT is present on pBMSBPS2. The absence of mutM in E. coli leads to an increase in GC→

AT transversions (Cabrera et al. 1988; Cox 1976). Lastly, recG (RecG DNA helicase) is absent from P. carbekii, but is also missing in Ishikawaella and Buchnera (Appendix

B.1). E. coli mutants lacking RecG show a requirement for Pol I DNA polymerase activity (Hong et al. 1995), yet polA appears significantly altered in P. carbekii.

Pairwise comparison of the reported P. carbekii genome and P. carbekii JPN highlights areas of the genome that may have accumulated recent mutations, assuming that the former is recently (i.e. in the last ~30 years) derived from a related native regional Asian

H. halys population. While the overall gene order between the genomes of the two P. carbekii strains was identical with only two inversions in intergenic spaces (76 and 66 bp) and no rearrangements, some nucleotide variations were detected in the SNP analysis and a 100 bp deletion was detected in the P. carbekii US genome in an intergenic region.

SNP detection is useful for identifying genes in closely-related organisms (i.e. strain-

62 level) that are experiencing rapid mutations and are either: 1) the target of strong positive selection or 2) no longer being selected for since they are not required for maintenance of the organism (i.e. Brown et al. 2014). An average of 1 SNP per kb was observed and of

1,144 total SNPs observed in P. carbekii, 511 of these were in genic regions (Appendix

B.1), and less than half (471 /1,144) of the SNPs increased the AT bias in the US genome. Most of the SNPs represented transitions (724) versus transversions (420). Over half of the SNPs in genic regions (293/511) coded for nonsynonymous protein mutations and these SNPs were distributed over 170 unique genes, with some genes having up to 7 nonsynonymous mutations, but these did not result in any observable loss-of-function due to new stop codons (Appendix B.1). Of the few SNPs that resulted in non-synonymous changes in protein coding regions, determining the impact of these changes on enzyme function is part of ongoing investigations.

Genes with more than one nonsynonymous SNP included ytfN (‘tamB’), uvrA, surA and polA. ytfN (‘tamB’) encodes the integral inner portion of the translocation and assembly module ‘TAM’, which was recently found to promote efficient secretion of autotransporters in proteobacterial species (Selkrig et al. 2012).

Along with TamA (also encoded within P. carbekii genome), the integral outer membrane protein in this complex, TAM allows for the efficient translocation of Antigen

43 and EhaA, two autotransporter proteins that are involved in biofilm formation and pathogenesis in E. coli (Selkrig et al. 2012; Danese et al. 2000; Wells et al. 2008).

Interestingly, Antigen 43 and EhaA are not encoded by the P. carbekii genome. This

63 suggests that TAM might translocate other proteins in P. carbekii, not yet described, that may share similar features with these biofilm-formation proteins. uvrA, which encodes a subunit of the UvrABC nucleotide excision repair generalized DNA repair process also appears to have acquired new mutations (Kenyon and Walker 1981). This is notable since it is involved in SOS response in E. coli and therefore might play a similar role in P. carbekii, which may experience environmental stress during its time on the H. halys egg surface. surA is yet another gene involved in DNA repair in which four substitution mutations are observed but it is unclear as to how these may affect its ability to fold outer membrane proteins or respond to high or low pH (Dartigalongue et al. 2001).

F. Degraded cell division genes may contribute to elongated cell morphology

SEM, TEM, and FISH imaging revealed somewhat elongated cells (Figure 2.1), also observed in the symbiont of M. histrionica (Prado et al. 2010), that may be the outcome of impaired cell division due to the loss of the Rep helicase (rep) and inactivating mutations in proteins involved in cytokinesis. Mutations in rep in E. coli resulted in severe growth defects and a near-doubling of the amount of chromosomal DNA present within each cell (Trun 2003). Many of the “filamentous temperature sensitive” (fts) genes that have been experimentally shown to participate in cell division via Z-ring formation are present in the P. carbekii genome, yet they exhibit truncations or abnormal gene morphologies. FtsK directs DNA translocation and chromosome segregation during cytokinesis in E. coli and the P. carbekii ortholog (932 a.a.) is ~80% of the length of

FtsK in P. ananatis (1112 a.a.) and the P. stali symbiont (1143 a.a.). While it lacks the

64

DNA-binding C-terminal gamma domain (Pfam: PF01580), it retains the putative transmembrane FtsK-SpoIIIE domain that is implicated in DNA translocation in E. coli

(Begg et al. 1995) and B. subtilis (Fleming et al. 2010). E. coli ∆ftsK-encoded proteins with substitutions in the C-terminal domain show impaired DNA binding but remain capable of functioning in DNA translocation and chromosome segregation (Sivanathan et al. 2006). Additionally, E. coli with FtsK lacking the C-terminal domain exhibit a higher proportion of filamentous cells compared to the wild-type (Sivanathan et al. 2006).

Sequence analysis alone is not conclusive on whether or not P. carbekii FtsK performs similarly to orthologs in P. ananatis or E. coli, or if its lack of a C-terminal domain contributes to the observed P. carbekii cell morphology. Although 387 nt coding for an intact N-terminal portion of the protein is detected, ftsN has been annotated as a pseudogene because an in silico translation of an adjacent 382 nt reveals a region coding for the peptidoglycan-binding SPOR domain found within the C-terminal portion of FtsN that is peppered with stop codons and frameshifts. FtsN is important in cell division as it interacts with another key late-stage cytokinesis protein, FtsA, to trigger septation, yet

FtsA has been shown to function independently of FtsN (Bernard et al. 2007). While ftsN is absent in the Buchnera genome, truncated and near-full length orthologs were annotated in the Ishikawaella (170 a.a.) and the Plautia stali symbiont (279 a.a.) genomes, respectively. Consequences of these mutations could be nonlethal impairment of cytokinesis within P. carbekii, leading to the observed elongated cell morphology, but further experimental work to confirm the expression and functionality of P. carbekii FtsK and FtsN is needed.

65

G. Stress tolerance in P. carbekii

During the period between oviposition and consumption by emerged nymphs (~1 wk;

Nielsen and Hamilton 2009), P. carbekii lives outside of host gastric tissues within a maternally-secreted matrix on the egg surface. While beneath and intercalated within the extrachorion matrix of H. halys eggs, P. carbekii is exposed to fluctuations in environmental conditions (e.g. temperature fluctuations, UV radiation, desiccation, etc.) and biotic interactions (i.e. competition with other microbes or predation) that may result in stress in P. carbekii cells. A number of putative stress-response genes were detected in the P. carbekii genome and are discussed briefly. Protein denaturation, misfolding and aggregation can be detrimental to cells following heat stress and sigma-32, which is encoded by rpoH, is at the hub of rapid responses to heat stress through the expression of enzymes involved in refolding and stabilizing denatured proteins (chaperones: GroESL,

ClpB, IbpAB and DnaKJ) or degrading irrecoverably misfolded proteins (proteases:

ClpAP, ClpXP, HslUV and FtsH) (reviewed in Rosen and Ron 2002; Gunesekere et al.

2006). P. carbekii encodes all of the aforementioned enzymes, with ibpB present on pBMSBPS1. IbpB binds aggregated or denatured proteins and has been shown to increase in expression in response to temperature up-shifts and during biofilm formation in E. coli (Lasokowska et al. 1996; Ren et al. 2004). Some of its functions are dependent on IbpA (Kuczyńska-Wiśnik et al. 2002), which is encoded on the P. carbekii chromosome and is present in both Ishikawaella and Buchnera. In Buchnera, IbpA significantly impacts heat-tolerance in pea aphids as shown by a single nucleotide

66 deletion in ibpA that resulted in sharp fitness declines in pea aphids harboring Buchnera

∆ibpA were maintained at elevated temperatures (25-30°C) (Dunbar et al. 2007). The presence of ibpA and the pBMSBPS1-encoded ibpB in P. carbekii may indicate that it also confers elevated temperature tolerance to H. halys. Additionally, host factor one

(hfq) is also encoded on the P. carbekii chromosome and it has both RNA chaperone activity and regulates the expression of, among other stress response genes, the stationary phase and environmental stress response regulator, RpoS (Muffler 1996; Moll 2003;

Battesti et al. 2011).

Many of the aforementioned genes are not present in Ishikawaella, which has a similar mode of inheritance to P. carbekii, but it is packaged in symbiont capsules, which are deposited next to the eggs (Fukatsu and Hosokawa 2002), rather than in a surface smearings. Ishikawaella might not have the same exposure to abiotic and biotic pressures as P. carbekii does within the extrachorion matrix. It has been suggested that the

Ishikawaella capsule conditions mimic that of the host midgut, protecting it from environmental fluctuations (Fukatsu and Hosokawa 2002; Hosokawa et al. 2005). As a result, the environmental conditions for Ishikawaella may not necessitate retention the same repertoire of stress response genes as P. carbekii.

Supporting these hypotheses, many of the aforementioned proteins were detected on the egg-surface through the LC/MS/MS Mascot analysis (i.e. GroEL, GroES, ClpB, IbpAB,

DnaKJ, ClpXP, HslU) and GroEL and GroES were the top-two most abundant proteins,

67 according to the emPAI metric (Appendix B.1; Table 2.2) in the bacterial ‘nr’ NCBI database, whereas IbpB (pBMSBPS1) had the highest emPAI in the plasmid protein custom database (Table 2.2). Only 169 proteins were detected in the egg lavage by the

Mascot analysis and based on a KEGG KO classification, two of the five most enriched categories included ko01110 ‘biosynthesis of secondary metabolites’ and ko1120

‘microbial metabolism in diverse environments’ (Appendix B.1), further supporting the hypothesis that P. carbekii responds to biotic and abiotic stress on the egg surface.

However, further work will need to be done to compare the proteins detected in the egg lavage to the V4 midgut crypts.

Other proteins detected on the egg surface that could be involved in stress tolerance include iron-sulfur cluster scaffold protein NfuA, subunit II of cytochrome bo terminal

CyoA, periplasmic trehalase TreA, histone-like nucleotide structure protein Hns, a putative cold shock protein, cold shock protein CspE, heat shock chaperone protein

GrpE, heat shock protein HtpG, outer membrane protein X and A, thioredoxin reductase

TrxB, cold shock DEAD-box protein A DeaD, and four proteins from pBMSBPS3 (Table

2.2). E. coli nfuA mutants show severe growth defect under oxidative and iron stress

(Angelini et al. 2008). CyoA and TrxB are thought to be members of the network of genes that can promote stress-induced mutatgenesis in E. coli (Al Mamun et al. 2012).

TrxB may also have some chaperone function, renaturing thermally induced aggregated proteins (Kern et al. 2003). TreA expression is increased with high osmolarity (Boos et al. 1987; Repolia and Gutierrez 1991) and mutants accumulate high amounts of

68 extracellular trehalase under osmotic stress (Styrvold and Strom 1991). HtpG expression is induced in heat shock and acid shock (Heitzer et al. 1990; Heyde and Portalier 1990) and can bind to sigma factor RpoH (heat shock sigma factor 32; Nadeau et al. 1993).

OmpA promotes biofilm formation by suppressing cellulose production (Barrios et al.

2006) and ompA mutants form sticky colonies due to an accumulation of cellulose

(Barrios et al. 2006; Ma and Wood 2009). Interestingly, bacterium-specific polymorphisms in OmpA were recently implicated in host tolerance of the facultative symbiont glossinidius by its tsetse fly host (Weiss et al. 2014), suggesting its potential importance in this symbiosis. OmpX expression changes with osmolarity and pressure in E. coli and is induced under basic and acididc conditions (Nakashima et al.

1995; Stancik et al. 2002). DeaD expression is induced by cold-shock (Jones et al. 1996) and a deletion of deaD results in a growth defect at low temperatures in E. coli (Jones et al. 1996; Charollais et al. 2004). The four proteins encoded by pBMSBPS3 detected are poorly characterized, with top BLASTP hits to NCBI ‘nr’ being hypothetical proteins, but are putatively involved in stress response or cell transport. For instance, one protein detected has multiple BLASTP ‘nr’ hits to a “stress-induced protein”, although no conserved domains were detected and no KEGG KO was assigned. Hypothetical protein

YciE was also detected and is poorly characterized, with a PFAM domain of unknown function (PF05974.7), yet it is expressed under osmotic stress induced by NaCl in E. coli

(Weber et al. 2006). The last two proteins detected from pBMSBPS3 are both putative periplasmic proteins, one being a periplasmic -binding sensor protein (BLASTP

‘nr’ hits; PF09849.4; ko9945) and the other an OB-fold protein (BLASTP ‘nr’ hits;

69

PF040768) that may bind to proteins or other small molecules (Ginalski et al. 2004).

Both proteins could be involved in transportation of molecules across the P. carbekii membrane, potentially important under environmental stress.

70

Table 2.2: Brief descriptions, KEGG KO assignments, and emPAI values for proteins of note detected on the egg lavage surface (using LC/MS/MS followed by a Mascot analysis). Note: the emPAI values cannot be compared to one another when the databases used in the Mascot analysis are different. a: pBMSBPS1 database used; b: pBMSBPS2-4 database used; Eubacteria ‘nr’ database used for all others.

Gene emPAI KEGG KO Description groES 8.42 K04078 chaperone groEL 7.71 K04077 chaperone elaB 7.32 K05594 cellular response to DNA damage trxA 4.13 K03671 protein folding catalyst thioredoxin 1 dnaK 2.15 K04043 molecular chaperone/heat shock protein lpp 1.94 - murein lipoprotein rpsU 1.75 - 30S subunit ribosomal protein S21 sod2 1.73 K04564 superoxide dismutase ihfA 1.72 K04764 integration host factor subunit alpha tuf 1.65 K02358 elongation factor Tu hns 1.59 K03746 hiistone-like nucleoid structuring protein - 1.48 - BLASTP hits to "stress-induced protein" cspE 1.11 K03704 cold shock protein grpE 0.9 K03687 heat shock protein ibpA 0.71 K04080 small heat shock protein A htpG 0.66 K04079 high temperature protein G hsp60 0.61 K04077 heat shock chaperone protein Hsp60 clpP 0.46 K01358 ClpP protein dnaJ 0.43 K03686 DnaJ protein ompX 0.36 - outer membrane protein X ompA 0.35 K16191 outer membrane protein A clpB 0.29 K03695 protein disaggregation chaperone yhgI/nfuA 0.15 K07400 iron-sulfur cluster scaffold protein cyoA 0.1 K02297 subunit II of the cytochrome bo terminal oxidase complex treA 0.1 K01194 periplasmic trehalase trxB 0.09 K00384 thioredoxin reductase deaD 0.05 K05592 cold-shock DEAD-box protein A ibpB 29.56a K04081 small heat shock protein B pBMSBPS3_8 3231.42b - hypotheical protein; stress-induced protein pBMSBPS_11 5.3b - putative periplasmic bacterial OB fold (BOF) protein pBMSBPS3_12 1.82b K09945 putative periplasmic ligand-binding-sensor protein pBMSBPS_9 3.09b - conserved hypothetical protein YciE 71

IV. Concluding Remarks

We report the first ultrastructural characterization of P. carbekii within the extrachorion matrix of brown marmorated stink bug eggs by SEM, and provide the complete genome sequence of this agricultural pest primary symbiont. Detection of the symbiont within this extrachorion matrix confirms that H. halys shares its primary symbiont transmission modality with other phytophagous stink bugs that exhibit egg-smearing behavior and harbor gammaproteobacterial symbionts within the gastric caeca of the distal midgut.

Elucidating the biochemical composition of the extrachorion matrix may reveal chemical attractants that stimulate nymphal feeding behavior immediately following emergence

(i.e. presence of specific attractants) as well as compounds involved in improving the survivability of P. carbekii outside of host tissues following oviposition and prior to nymph consumption.

Detailed genomic analysis of P. carbekii indicates that it has the potential to provision a wide range of dietary supplements, namely essential amino acids and vitamins, to its herbivorous host, and that the genome displays hallmarks of long-term host association, including a low G+C% and a reduced genic repertoire and genome size. If P. carbekii is provisioning nutrients to its host insect, then it would support the ability of H. halys to exploit a wide range of host plants and would explain breadth of >150 host plants H. halys is known to feed upon (Bergmann et al. 2014). The P. carbekii genome is reduced

72 in size relative to known non-host-restricted Pantoea sp., but many gammaproteobacterial intracellular symbionts have genomes <1 Mb in size and the retention of genes involved in peptidoglycan and cell wall biosynthesis, and stress response, which are absent in many gammaproteobacterial intracellular mutualist genomes, are among those contributing to the relatively modest size reduction of the P. carbekii genome. The multiphasic lifestyle (e.g. within the extrachorion matrix, insect gut during migration to the gastric caeca and intraluminal crypt-dwelling) of the P. carbekii may necessitate a broader genic repertoire than bacterial symbionts that are strictly associated with, and often within, host tissues.

While native to the Asian continent, H. halys was introduced into the United States in the

1990s and has since spread across much of the North American continent and it has been detected in Switzerland and Hungary (Hoebeke and Carter 2003; Wermelinger et al.

2008; Fogain and Graff 2011; Zhu et al. 2012; Leskey et al. 2012; Vetek et al. 2014, Xu et al. 2014). H. halys has few natural enemies in these regions and it is capable of feeding on over 150 different species of plants including major food crops, ornamental plants, and fruit-trees (Bergmann et al. 2014), resulting in losses >$10M/annually (Seetin 2011).

Current pest management strategies depend on heavy insecticide use that are of limited effectiveness due to rapid emergence of resistant strains, disruption of natural predator- prey relationships, raise concerns about food safety and environmental pollution; therefore, alternative strategies for managing H. halys are warranted (Leskey et al. 2012).

73

Bibliography

Abe Y, Mishiro K and Takanashi M. 1995. Symbiont of brown-winged green bug,

Plautia stali Scott. Appl. Entomol. Z. 39: 109-115.

Adachi J, Waddell PJ, Martin W, Hasegawa M. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol.

50:348-358.

Al Mamun AAM et al. 2012. Identity and function of a large gene network underlying mutagenic repair of DNA breaks. Science. 338:1344-1348.

Altschul SF et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

Angelini S et al. 2008. NfuA, a new factor required for maturing Fe/S proteins in

Escherichia coli under oxidative stress and iron starvation conditions. J. Biol. Chem.

283:14084-14091.

74

Bansal R, Michel A and Sabree ZL. 2014. The crypt-dwelling primary bacterial symbiont of the polyphagous pentatomid pest Halyomorpha halys (: Pentatomidae).

Environ. Entomol. 43:617-625.

Barrios AFG, Zuo R, Ren D and Wood TK. 2006. Hha, YbaJ, and OmpA regulate

Escherichia coli K12 biofilm formation and conjugation plasmids abolish motility.

Biotechnol. Bioeng. 93:188-200.

Battesti A, Majdalani N and Gottesman S. 2011. The RpoS-mediated general stress response in Escherichia coli. Annu. Rev. Microbiol. 65:189-213.

Baumann P, Baumann L, Clark M. 1996. Levels of Buchnera aphidicola chaperone

GroEL during growth of the aphid Schizaphis graminum. Curr. Microbiol. 32:279-285.

Baumann P. 2005. Biology of bacteriocyte-associated endosymbionts of plant sap- sucking insects. Annu. Rev. Microbiol. 59:155-189.

Begg KJ, Dewar SJ and Donachie WD. 1995. A new Escherichia coli cell division gene, ftsK. J Bacteriol. 177:6211–6222.

Bergmann E et al. 2014. Host plants of the brown marmorated stink bug in the U.S. http://www.stopbmsb.org/where-is-bmsb/host-plants. (29 April 2014).

75

Bernard CS, Sadasivam M, Shiomi D and Margolin W. 2007. An altered FtsA can compensate for the loss of essential cell division protein FtsN in Escherichia coli. Mol

Microbiol. 64:1289–1305

Bistolas K, Sakamoto R, Fernandes J and Goffredi S. 2014. Symbiont polyphyly, co- evolution, and necessity in pentatomid stinkbugs from Costa Rica. Front Microbiol. 5.

Boos W, Ehmann U, Bremer E, Middendorf A and Postma P. 1987. Trehalase of

Escherichia coli. Mapping and cloning of its structural gene and identification of the enzyme as a periplasmic protein induced under high osmolarity growth conditions. J.

Biol. Chem. 262:13212-13218.

Brown A, Huynh LY, Bolender CM, Nelson KG and McCutcheon JP. 2014. Population genomics of a symbiont in the early stages of a pest invasion. Mol. Ecol. 23:1516-1530.

Budd A, Blandin S, Levashina EA, Gibson TJ. 2004. Bacterial alpha-2-macroglobulins:

Colonization factors acquired by horizonal gene transfer from the metazoan genome?

Genome Biol. 5:R38.

Burge SW et al. 2012. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res.

41:D226-D232.

76

Cabrera M, Nghiem Y and Miller JH. 1988. mutM, a second mutator locus in Escherichia coli that generates G.C→T.A transversions. J. Bacteriol. 170:5405-5407.

Carver T, Thomson N, Bleasby A, Berriman M and Parkhill J. 2009. DNAPlotter: circular and linear interactive genome visualization. Bioinformatics. 25:119-120.

Caspi R et al. 2008. The MetaCyc Database of metabolic pathways and enzymes and the

BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 36:D623-D631.

Charles H, Mouchiroud D, Lobry J, Goncalves I, Rahbe Y. 1999. Gene size reduction in the bacterial aphid endosymbiont, Buchnera. Mol. Biol. Evol. 16:1820-1822.

Charollais J, Dreyfus M and Iost I. 2004. CsdA, a cold-shock RNA helicase from

Escherichia coli, is involved in the biogenesis of 50S ribosomal subunit. Nucleic Acids

Res. 32:2751-9.

Chen F, Mackey AJ, Stoeckert Jr CJ and Roos DS. 2006. OrthoMCL-DB: Querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 34.

77

Chen L and Roberts MF. 2000. Overexpression, purification, and analysis of complementation behavior of E. coli SuhB protein: comparison with bacterial and archaeal inositol monophosphatases. Biochem. 39:4145-4153.

Chothia C, Gough J. 2009. Genomic and structural aspects of protein evolution.

Biochem. J. 419:15-28.

Cox EC. 1976. Bacterial mutator genes and the control of spontaneous mutation. Annu.

Rev. Genet. 10:135-156.

Craig R and Hoskins WM. 1940. Insect biochemistry. Annu. Rev. Biochem. 9:617-640.

Cunin R, Glandsdorff N, Piérard A and Stalon V. 1986. Biosynthesis and metabolism of arginine in bacteria. Microbiol. Rev. 50:314-352.

Danese PN, Pratt LA, Dove SL and Kolter R. 2000. The outer membrane protein,

Antigen 43, mediates cell to cell interactions within Escherichia coli biofilms. Mol.

Microbiol. 37:424-432.

Dartigalongue C, Missiakas D and Raina S. 2001. Characterization of the Escherichia coli sigma E regulon. J. Biol. Chem. 276:20866-20875.

78

Degnan PH and Ochman H. 2012. Illumina-based analysis of microbial community diversity. ISME J. 6:183-194.

Delcher AL, Phillippy A, Carlton J and Salzberg SL. 2002. Fast algorithms for large- scale genome alignment and comparison. Nucleic Acids Res. 30: 2478-2483.

Douglas AE. 2013. Microbial brokers of insect-plant interactions revisited. J. Chem.

Ecol. 39:952-961.

Douglas AE. 2014. Lessons from studying insect symbioses. Cell Host Microbe. 10:359-

367.

Du Q, Wang H, and Xie J. 2011. Thiamin (vitamin B1) biosynthesis and regulation: a rich source of antimicrobial drug targets? Int. J. Biol. Sci. 7:41-52.

Dukan S et al. 2000. Protein oxidation in response to increased transcriptional or translational errors. Proc. Natl. Acad. Sci. 97:5746-5749.

Dunbar HE, Wilson AC, Ferguson NR and Moran NA. 2007. Aphid thermal tolerance is governed by a point mutation in bacterial symbionts. PLoS Biol. 5:1006-1015.

79

Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST.

Bioinformatics. 19:2460-2461.

Fares MA, Moya A and Barrio E. 2004. GroEL and the mainenance of bacterial endosymbiosis. Trends Genet. 20:413-416.

Fischer S et al. 2011. Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinformatics. 35:6.12.1–

6.12.19.

Fleming TC et al. 2010. Dynamic SpoIIIE assembly mediates septal membrane fission during Bacillus subtilis sporulation. Genes Dev. 24:1160–1172.

Fogain R and Graff S. 2011. First records of the invasive pest, Halyomorpha halys

(Hemiptera: Pentatomidae), in Ontario and Quebec. J. Entomol. Soc. Ontario. 142:45-48.

Frank CA, Amiri H, Andersson SGE. 2002. Genome deterioration: Loss of repeated sequences and accumulation of junk DNA. Genetica. 115:1-12.

80

Fukatsu T and Hosokawa T. 2002. Capsule-transmitted gut symbiotic bacterium of the

Japanese common plataspid stinkbug, Megacopta punctatissima. Appl. Environ. Microb.

68:389-396.

Ginalski K, Kinch L, Rychlewski L and Grishin NV. 2004. BOF: a novel family of bacterial OB-fold proteins. FEBS Lett. 567:297-301.

Giovannoni SJ et al. 2005. Genome streamlining in a cosmopolitan oceanic bacterium.

Science 309:1242-1245.

Glandsdorff N, Sand G and Verhoef C. 1967. The dual genetic control of ornithine transcarbamylase synthesis in Escherichia coli K12.Mutat. Res-Fund. Mol. M. 4:743-

751.

Glickman BW. 1975. The role of DNA Polymerase I in excision-repair. Basic Life Sci.

5A:213-228.

Goldman N and Yang Z. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736.

Gordon A and Hannon GJ. 2010. "Fastx-toolkit." FASTQ/A short-reads preprocessing tools (unpublished) http://hannonlab. cshl.edu/fastx_toolkit (visited 1 June 2014).

81

Gossard F and Verly WG. 1978. Properties of the main endonuclease specific for apurinic sites of Escherichia coli (endonuclease VI). Eur. J. Biochem. 82:321-332.

Gunesekere IC et al. 2006. Comparison of the RpoH-dependent regulon and general stress response in Neisseria gonorrhoeae. J. Bacteriol. 188:4769-76.

Haft DH, Selengut JD and White O. 2003. The TIGRFAMs database of protein families.

Nucleic Acids Res. 31:371-373.

Hamdache A, Azarken R, Lamarti A, Aleu J and Collado IG. 2013. Comparative genome analysis of Bacillus spp. and its relationship with bioactive nonribosomal peptide production. Phytochem Rev DOI 10.1007/s11101-013-9278-4.

Heitzer A, Mason CA, Snozzi M and Hamer G. 1990. Some effects of growth conditions on steady state and heat shock induced htpG gene expression in continuous cultures of

Escherichia coli. Arch. Microbiol. 155:7-12.

Heyde M and Portalier R. 1990. Acid shock proteins of Escherichia coli. FEMS

Microbiol. Lett. 69:19-26.

82

Hill CW, Sandt CH and Vlazny DA. 1994. Rhs elements of Escherichia coli: A family of genetic composites each encoding a large mosaic protein. Molecular Microbiology

12:865-871.

Hoebeke E and Carter ME. 2003. Halyomorpha halys (Stål) (: Pentatomidae): a polyphagous plant pest from Asia newly detected in North America. P. Entomol. Soc.

Wash. 105:225-237.

Hong X, Cadwell GW and Kogoma T. 2005. Escherichia coli RecG and RecA proteins in

R-loop formation. EMBO J. 14:2385-2392.

Hosokawa T et al. 2013. Diverse strategies for vertical symbiont transmission among subsocial stinkbugs. PLoS ONE 8: e65081.

Hosokawa T, Kikuchi Y, Meng XY and Fukatsu T. 2005. The making of symbiont capsule in the plataspid stinkbug Megacopta punctatissima. FEMS Microbiol. Ecol,

54:471-477.

Hosokawa T, Kikuchi Y, Shimada M and Fukatsu T. 2008. Symbiont acquisition alters behaviour of stinkbug nymphs. Biol. Lett. 4:45–48.

83

Hur GH, Vickery CR and Burkart MD. 2012. Explorations of catalytic domains in non- ribosomal peptide synthetase enzymology. Nat. Prod. Rep. 29:1074-1098.

Hyatt D et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 11:119.

Jones PG, Mitta M, Kim Y, Jiang W and Inouye M. 1996. Cold shock induces a major ribosomal-associated protein that unwinds double-stranded RNA in Escherichia coli.

Proc. Natl. Acad. Sci. 9:76-80.

Katoh K, Misawa, K, Kuma KI and Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res.

30:3059-3066.

Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7:

Improvements in performance and usability. Mol. Biol. Evol. 30:772-780.

Kelkar Y and Ochman H. 2013. Genome reduction promotes increase in protein functional complexity in bacteria. Genetics 193:303-307.

Kenyon CJ and Walker GC. 1981. Expression of the E. coli uvrA gene is inducible.

Nature Lett. 289:808-810.

84

Kenyon LJ and Sabree ZL. 2014. Obligate insect endosymbionts exhibit increased ortholog length variation and loss of large accessory proteins concurrent with genome shrinkage. Genome Biol. Evol. 6:763-775.

Kern R, Malki A, Holmgren A and Richarme G. 2003. Chaperone properties of

Escherichia coli thioredoxin and thioredoxin reductase. Biochem. J. 371:965-972.

Keseler IM et al. 2009. EcoCyc: a comprehensive view of Escherichia coli biology.

Nucleic Acids Res. 37:D464-D470.

Kikuchi Y et al. 2012. Primary gut symbiont and secondary, Sodalis-allied symbiont of the Scutellerid stinkbug ocellatus. Appl. Environ. Microb. 76:3486-3494.

Kikuchi Y, Hosokawa T and Fukatsu T. 2007. Insect-microbe mutualism without vertical transmission: a stinkbug acquires a beneficial gut symbiont from the environment every generation. Appl. Environ. Microb. 73:4308-4316.

Kikuchi Y, Hosokawa T, Nikoh N and Fukatsu T. 2012. Gut symbiotic bacteria in cabbage bugs Eurydema rugosa and Eurydema dominulus (Heteroptera: Pentatomidae).

Appl. Entomol. Zool. 47:1-8.

85

Kim J and Copley SD. 2007. Why metabolic enzymes are essential or nonessential for growth of Escherichia coli K12 on . Biochemistry. 46:12501-12511.

Kisiela DI et al. 2012. Evolution of Salmonella enterica virulence via point mutations in the fimbrial adhesin. PLoS Pathogens. 8:e1002733.

Komaki K and Ishikawa H. 1999. Intracellular bacterial symbionts of aphids possess many genomic copies per bacterium. J. Mol. Evol. 48:717-722.

Komaki K and Ishikawa H. 2000. Genomic copy number of intracellular bacterial symbionts of aphids varies in response to developmental stage and morph of their host.

Insect Biochem. Molec. Biol. 30:253-258.

Kuczyńska-Wiśnik D et al. 2002. The Escherichia coli small heat-shock proteins IbpA and IbpB prevent the aggregation of endogenous proteins denatured in vivo during extreme heat shock. Microbiology. 148:1757-1765.

Kung VL, Stehlik C, Bacon EM, Hughes AJ and Hauser AR. 2012. An rhs gene of

Pseudomonas aeruginosa encodes a virulence protein that activates the inflammasome.

Proc. Natl. Acad. Sci. 109:1275-1280.

86

Kuo C, Moran NA and Ochman H. 2009. The consequences of genetic drift for bacterial genome complexity. Genome Res. 19:1450-1454.

Kurland CG. 1992. Translational accuracy and the fitness of bacteria. Annu. Rev. Genet.

26:29-50.

Kurland CG, Canback B and Berg OG. 2007. The origins of modern proteomes.

Biochimie. 89:1454-1463.

Lambert JD and Moran NA. 1998. Deleterious mutations destabilize ribosomal RNA in endosymbiotic bacteria. Proc. Natl. Acad. Sci. 95:4458-4462.

Larkin MA et al. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947-

2948.

Laslett D and Canback B. 2004. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 32:11-16.

Lasokowska E, Wawrzynow A and Taylor A. 1996. IbpA and IbpB, the new heat-shock proteins, bind to endogenous Escherichia coli proteins aggregated intracellularly by heat shock. Biochimie. 78:117-122.

87

Ledwidge R and Blanchard JS. 1999. The dual biosynthetic capability of N- acetylornithine aminotransferase in arginine and lysine biosynthesis. Biochemistry.

38:3019-3024.

Lee M and Marx CJ. 2012. Repeated, selection-driven genome reduction of accessory genes in experimental populations. PLoS Genet. 8:e1002651.

Leskey T et al. 2012. Pest status of the brown marmorated stink bug, Halyomorpha halys in the USA. Outlooks on Pest Management. 23:218-226.

Li H and Durbin R. 2010. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 26:589-595.

Li H et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics.

25:2078-2079.

Login H and Heddi A. 2012. Insect immune system maintains long-term resident bacteria through a local response. J. Insect Physiol. 2:232-239.

López-Sánchez MJ et al. 2009. Evolutionary convergence and nitrogen metabolism in

Blattabacterium strain Bge, primary endosymbiont of the Blattella germanica.

PLoS Genet. 5:e1000721.

88

Ma Q and Wood TK. 2009. OmpA influences Escherichia coli biofilm formation by repressing cellulose production through the CpxRA two component system. Environ.

Microbiol. 11:2735-2746.

Maniloff J. 1996. The minimal cell genome: "on being the right size". Proc. Natl. Acad.

Sci. 93:10004-10006.

Marahiel MA, Stachelhaus T and Mootz HD. 1997. Modular peptide synthetases involved in nonribosomal peptide synthesis. Chem. Rev. 97:2651-2673.

Masella AP, Bartram AK, Truszkowski JM, Brown DG and Neufeld JD. 2012.

PANDAseq: paired-end assembler for Illumina sequences. BMC Bioinformatics. 13: 31.

Matsuhisa A, Suzuki N, Noda T and Shiba K. 1995. Inositol monophosphatase activity from the Escherichia coli suhB gene product. J. Bacteriol. 177:200-205.

McFall-Ngai M et al. 2013. Animals in a bacterial world, a new imperative for the life sciences. Proc. Natl. Acad. Sci. USA. 110:3229-3236.

McKenna A et al. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:297-1303.

89

Mergaert P et al. 2006. Eukaryotic control on bacterial cell cycle and differentiation in the Rhizobium-legume symbiosis. Proc. Natl. Acad. Sci. 103:5230-5235.

Michel B, Ehrlich SD and Uzest M. 1997. DNA double-strand breaks caused by replication arrest. EMBO J. 16:430-438.

Miller MA, Pfeiffer W and Schwartz T. 2010. Creating the CIPRES Science Gateway for inference of large phylogenetic trees. In Gateway Computing Environments Workshop

(GCE) (pp. 1-8). IEEE.

Mira A, Ochman H and Moran NA. 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet. 17:589-596.

Moll I, Afonyushkin T, Vytvytska O, Kaberdin VR and Blasi U. 2003. Coincident Hfq binding and RNase E cleavage sites on mRNA and small regulatory RNAs. RNA.

9:1308-1314.

Moran NA. 1996. Accelerated evolution and Muller's rachet in endosymbiotic bacteria.

Proc. Natl. Acad. Sci. 93:2873-2878.

90

Moran NA and Baumann P. 2000. Bacterial endosymbionts in animals. Curr. Opin.

Microbiol. 3:270-275.

Moran NA, McCutcheon JP and Nakabachi A. 2008. Genomics and evolution of heritable bacterial symbionts. Annu. Rev. Genet. 42:165-190.

Moriya Y, Itoh M, Okuda S, Yoshizawa A and Kanehisa M. 2007. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 35:W182-

W185.

Muffler A, Fischer D and Hengge-Aronis R. 1996. The RNA-binding protein HF-1, known as host factor for phage Q-beta RNA replication, is essential for rpoS translation in Escherichia coli. Gene Dev. 10:1143-1151.

Nadeau K, Das A and Walsh CT. 1993. Hsp90 chaperonins possess ATPase activity and bind heat shock transcription factors and peptidyl prolyl . J. Biol. Chem.

268:1479-1487.

Nakashima K, Horikoshi K and Mizuno T. 1995. Effect of hydrostatic pressure on the synthesis of outer membrane proteins in Escherichia coli. Biosci. Biotechnol. Biochem.

59:130-2.

91

Nielsen AL and Hamilton GC. 2009. Life history of the invasive species Halyomorpha halys (Hemiptera: Pentatomidae) in Northeastern United States. Ann. Entomol. Soc. Am.

102:608-616.

Nikoh N, Hosokawa T, Oshima K, Hattori M and Fukatsu T. 2011. Reductive evolution of a bacterial genome in insect gut environment. Genome Biol. Evol. 3:702-714.

Ootsubo MT et al. 2002. Oligonucleotide probe for detecting Enterobacteriaceae by in situ hybridization. J. Appl. Microbiol. 93:60-68.

Osborn AM and Smith, CJ. 2005. Molecular Microbial Ecology. Garland Science.

Prado SS, Hung KY, Daugherty MP and Almeida RPP. 2010. Indirect effects of temperature on stink bug fitness, via maintenance of gut-associated symbionts. Appl.

Environ. Microb. 76:1261-1266.

Prado SS, Rubinoff D and Almeida RPP. 2006. Vertical transmission of a pentatomid caeca-associated symbiont. Biol. 99:577-585.

Pruesse E et al. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res.

35:7188-7196.

92

Punta M et al. 2012. The Pfam protein families database. Nucleic Acids Res. 40:D290-

D301.

Quinlan AR and Hall IM. 2009. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26:841-842.

Rambaut A and Drummond A. 2010. FigTree v1.3.1. Institute of Evolutionary Biology,

University of Edinburgh, Edinburgh, United Kingdom.

Ren D, Bedzyk LA, Thomas SM, Ye RW and Wood TK. 2004. Gene expression in

Escherichia coli biofilms. Appl. Microbiol. Biot. 64:515-524.

Repoila F and Gutierrez C. 1991. Osmotic induction of the periplasmic trehalase in

Escherichia coli K12: characterization of the treA gene promoter. Mol. Micro. 5:747-

755.

Revanna KV, Chiu CC, Bierschank E and Dong Q. 2011. GSV: A web-based genome synteny viewer for customized data. BMC Bioinformatics. 12:316

Reynolds ES. 1963. The use of lead citrate at high pH as an experimental study. J. Cell

Biol. 17:208-212.

93

Rice P, Longden I and Bleasby A. 2000. EMBOSS: the European open software suite. Trends Genet. 16:276-277.

Rosen R and Ron EZ. 2002. Proteome analysis in the study of the bacterial heat-shock response. Mass Spectrom. Rev. 21:244-265.

Rutherford K et al. 2000. Artemis: sequence visualization and annotation.

Bioinformatics. 16:944-945.

Sak BD, Eisenstark A and Touati D. 1989. Exonuclease III and the catalase hyperperoxidase II in Escherichia coli are both regulated by the katF gene product. Proc.

Natl. Acad. Sci. USA. 86:3271-3275.

Sandler SJ. 2000. Multiple genetic pathways for restarting DNA replication forks in

Escherichia coli K-12. Genetics. 155:487-497.

Schattner P, Brooks AN and Lowe TM. 2005. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 33:W686-

W689.

94

Schloss PD et al. 2009. Introducing Mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Appl. Environ. Microbiol. 75:7537-7541.

Schoeffler AJ, May AP and Berger JM. 2010. A domain insertion in Escherichia coli

GyrB adopts a novel fold that plays a critical role in gyrase function. Nucleic Acids Res.

38:7830-7844.

Seetin M. 2011. News release: losses to mid-Atlantic apple growers at $37 million from brown marmorated stink bug. http://www.usapple.org/index.php?option=com_content&view=article&id=160:bmsb- loss-midatlantic&catid=8:media-category. (April 12, 2011). U.S. Apple Association.

Selkrig J et al. 2012. Discovery of an archetypal protein transport system in bacterial outer membranes. Nat. Struct. Mol. Biol. 19:506-510.

Sharon R, Miller C and Ben-Ishai R. 1975. Two modes of excision repair in toluene- treated Escherichia coli. J. Bacteriol. 123:1107-1114.

Shigenobu S, Watanabe H, Hattori M, Sakaki Y and Ishikawa H. 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature.

407:81-86.

95

Shimokawa H, Fujii Y, Furuichi M, Sekiguchi M and Nakabeppu Y. 2000. Functional significance of conserved residues in the phosphohydrolase module of Escherichia coli

MutT protein. Nucleic Acids Res. 28:3240-3249.

Sivanathan V et al. 2006. The FtsK γ domain directs oriented DNA translocation by interacting with KOPS. Nat. Struct. Mol. Biol. 13:965-972.

Smith DW, Tait RC and Harris AL. 1975. DNA repair in DNA-polymerase-deficient mutants of Escherichia coli. Basic Life Sci. 5B:473-481.

Sokurenko EV et al. 2004. Selection footprint in the FimH adhesin shows pathoadaptive niche differentiation in Escherichia coli. Mol. Biol. Evol. 21:1373-1383.

Soma A et al. 2003. An RNA-Modifying enzyme that governs both the codon and amino acid specificities of isoleucine tRNA. Mol. Cell. 12:689–698.

Stamatakis A, Hoover P and Rougemont J. 2008. A rapid bootstrap algorithm for the

RAxML web servers. Syst. Biol. 57:758-771.

Stancik LM et al. 2002. pH-dependent expression of periplasmic proteins and amino acid catabolism in Escherichia coli. J. Bacteriol. 184:4246-58.

96

Styrvold OB and Strøm AR. 1991. Synthesis, accumulation, and excretion of trehalose in osmotically stressed Escherichia coli K-12 strains: influence of amber suppressors and function of the periplasmic trehalase. J. Bacteriol. 173:1187-1192.

Suyama M, Torrents D and Bork P. 2006. PAL2NAL: Robust converstion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34.

Sweetman MD and Palmer LS. 1928. Insects as test animals in vitamin research: I.

Vitamin requirements of the flour beetle, Tribolium confusum duval. J. Biol. Chem.

77:33-52.

Tada A et al. 2011. Obligate association with gut bacterial symbiont in Japanese populations of the southern green stinkbug Nezara viridula (Heteroptera: Pentatomidae).

Appl. Entomol. Zool. 46:483-488.

Tamas I et al. 2008. Endosymbiont gene functions impaired and rescued by polymerase infidelity at poly(A) tracts. Proc. Natl. Acad. Sci. USA. 105:14934-14939.

Tamura K et al. 2011. MEGA5: Molecular Evolutionary Genetics Analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol.

Biol. Evol. 28: 2731-2739.

97

Tatusov RL, Koonin EV and Lipman DJ. 1997. A genomic perspective on protein families. Science. 24:631-637.

Taylor CM, Coffey PL, DeLay BD and Dively GP. 2014. The importance of gut symbionts in the development of the brown marmorated stink bug, Halyomorpha halys

(Stål). PloS ONE. 9:e90312.

Thorvaldsdóttir H, Robinson JT and Mesirov JP. 2012. Integrative Genomics Viewer

(IGV): high-performance genomics data visualization and exploration. Brief. Bioinform.

14:179-192.

Trun N. 2003. Mutations in the E. coli Rep helicase increase the amount of DNA per cell.

FEMS Microbiol. Lett. 226:187-193.

Untergasser A et al. 2012. Primer3--new capabilities and interfaces. Nucleic Acids Res.

40:e115.

Vetek G, Papp V, Haltrich A and Redei D. 2014. First record of the brown marmorated stink bug, Halyomorpha halys (Hemiptera: Heteroptera: Pentatomidae), in Hungary, with description of the genitalia of both sexes. Zootaxa. 3780:194-200.

98

Wang M and Caetano-Anolles G. 2009. The evolutionary mechanics of domain organization in the proteomes and the rise of modularity in the protein world. Structure

17:66-78.

Wang M et al. 2007. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Research. 17:1572-1585.

Weber A, Kögl SA and Jung K. 2006. Time-dependent proteome alterations under osmotic stress during aerobic and anaerobic growth in Escherichia coli. J. Bacteriol.

188:7165-7175.

Weiss BL, Wu Y, Schwank JJ, Tolwinski NS and Aksoy S. 2008. An insect symbiosis is influenced by bacterium-specific polymorphisms in outer-membrane protein A. Proc.

Natl. Acad. Sci. 105:15088-15093.

Wells TJ et al. 2008. EhaA is a novel autotransporter protein of enterohemorrhagic

Escherichia coli O157: H7 that contributes to adhesion and biofilm formation. Environ.

Microbiol. 10:589-604.

Wermelinger B, Wyniger D and Forster, B. 2008. First records of an invasive bug in

Europe: Halyomorpha halys Stål (Heteroptera: Pentatomidae), a new pest on woody ornamentals and fruit trees? Mit. Sch. Ges. 81:1.

99

Wernergreen JJ and Moran NA. 1999. Evidence for genetic drift in endosymbionts

(Buchnera): Analysis of protein-coding genes. Mol. Biol. Evol. 16:83-97.

Wernergreen JJ. 2002. Genome evolution in bacterial endosymbionts of insects. Nature

3:850-861.

Wilkinson A et al. 2005. Analysis of ligation and DNA binding by Escherichia coli DNA ligase (LigA). Biochim Biophys Acta. 1749:113-122.

Xi J, Ge Y, Kinsland C, McLafferty FW, and Begley TP. 2001. Biosynthesis of the thiazole moiety of thiamin in Escherichia coli: identification of an acyldisulfide-linked protein–protein conjugate that is functionally analogous to the ubiquitin/E1 complex.

Proc. Natl. Acad. Sci. 98:8513-8518.

Xu J, Fonseca DM, Hamilton GC, Hoelmer KA and Nielsen AL. 2014. Tracing the origin of US brown marmorated stink bugs, Halyomorpha halys. Biol. Invasions. 16:153-166.

Yang Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol.

24:1586-1591.

Yoshida N et al. 2001. Protein function: Chaperone turned insect toxin. Nature. 411:44.

100

Zhou J and Rudd KE. 2013. EcoGene 3.0. Nucleic Acids Res. 41: D613-D624.

Zhu G, Bu W, Gao Y and Liu Guonquing. 2012. Potential geographic distribution of brown marmorated stink bug invasion (Halyomorpha halys). PLoS ONE. doi:10.1371/journal.pone.0031246.

101

Appendix A: Supplemental Materials for Chapter 1

Genome Size Enterobacteriaceae Species Life Style NCBI Accession (Mb) BaJF98 Buchnera aphidicola JF98 0.64 OIE NC_017254.1 baJF99 Buchnera aphidicola JF99 0.64 OIE NC_017253.1 bapAPS Buchnera aphidicola APS 0.66 OIE NC_002528.1 baph5A Buchnera aphidicola 5A 0.64 OIE NC_011833.1 baphAk Buchnera aphidicola Ak 0.65 OIE NC_017256.1 baphBp Buchnera aphidicola Bp 0.62 OIE NC_004545.1 baphCc Buchnera aphidicola Cc 0.42 OIE NC_008513.1 baphSg Buchnera aphidicola Sg 0.64 OIE NC_004061.1 baphTu Buchnera aphidicola Tuc7 0.64 OIE NC_011834.1 baphUa Buchnera aphidicola Ua 0.63 OIE NC_017259.1 baumHc Baumannia cicadellinicola Hc 0.69 OIE NC_007984.1 blochc Candidatus Blochmannia chromaiodes 640 0.79 OIE NC_020075.1 blochf Candidatus Blochmannia floridanus 0.71 OIE NC_005061.1 blochp Candidatus Blochmannia pennsylvanicus BPEN 0.79 OIE NC_007292.1 blochv Candidatus Blochmannia vafer BVAF 0.72 OIE NC_014909.2 dicdad Dickeya dadantii Ech703 4.68 nonOIE NC_012880.1 diczea Dickeya zeae Ech1591 4.81 nonOIE NC_012912.1 entclo Enterobacter cloacae ATCC 13047 5.6 nonOIE NC_014121.1 erwamy Erwinia amylovora CFBP1430 3.83 nonOIE NC_013961.1 erwEjp Erwinia Ejp617 3.96 nonOIE NC_017445.1 erwpyr Erwinia pyrifoliae Ep1 96 4.07 nonOIE NC_012214.1 glapsy Glaciecola psychrophila 170 5.41 nonOIE NC_020514.1 morPCI Candidatus Moranella endobia PCIT 0.54 OIE NC_015735.1 morPCV Candidatus Moranella endobia PCVAL 0.54 OIE NC_021057.1 panvag Pantoea vagans C9 1 4.89 nonOIE NC_014562.1 pecatr Pectobacterium atrosepticum SCRI1043 5.06 nonOIE NC_004547.2 peccar Pectobacterium carotovorum PC1 4.86 nonOIE NC_012917.1 pecwas Pectobacterium wasabiae WPP163 5.06 nonOIE NC_013421.1 raquCI Rahnella aquatilis ATCC 33071 5.45 nonOIE NC_016818.1

Table A.1: All species includd in the ortholog analysis, including the abbreviations used in the text, genome size, life style, and accession numbers for Enterobacteriacae species. continued 102

Table A.1 continued

raquHX Rahnella aquatilis HX2 5.66 nonOIE NC_017047.1 riesia Candidatus Riesia pediculicola USDA 0.58 OIE NC_014109.1 salbon Salmonella bongori NCTC 12419 4.46 nonOIE NC_015761.1 Salmonella enterica serovar Dublin CT salent 4.92 nonOIE NC_011205.1 02021853 sermar Serratia marcescens FGI94 4.86 nonOIE NC_020064.1 serpro Serratia proteamaculans 568 5.5 nonOIE NC_009832.1 sersym Serratia symbiotica Cinara cedri 1.76 nonOIE NC_016632.1 shigbo Shigella boydii CDC 3083 94 4.87 nonOIE NC_010658.1 shigdy Shigella dysenteriae Sd197 4.56 nonOIE NC_007606.1 shigfl Shigella flexneri 2002017 4.89 nonOIE NC_017328.1 shigso Shigella sonnei 53G 5.22 nonOIE NC_016822.1 sodglm Sodalis glossinidius morsitans 4.29 nonOIE NC_007712.1 Wigglesworthia glossinidia endosymbiont of wigbre 0.7 OIE NC_004344.2 Glossina brevipalpis Wigglesworthia glossinidia endosymbiont of wigmor 0.72 OIE NC_016893.1 Glossina morsitans xenbov Xenorhabdus bovienii SS 2004 4.23 nonOIE NC_013892.1 xennem Xenorhabdus nematophila ATCC 19061 4.59 nonOIE NC_014228.1 yerent Yersinia enterocolitica palearctica Y11 4.63 nonOIE NC_017564.1 yerpes Yersinia pestis Angola 4.69 nonOIE NC_010159.1

103

Table A.2: All species includd in the ortholog analysis, including the abbreviations used in the text, genome size, life style, and accession numbers for Flavobacteriacae species.

Flavobacteriaceae Species Genome Size (Mb) Life Style NCBI Accession bge Blattabacterium Blattella germanica 0.64 OIE NC_013454.1 bgiga Blattabacterium giganteus 0.63 OIE NC_017924.1 Blattabacterium Blatta orientalis bor 0.634 OIE NC_020195.1 Tarazona Blattabacterium Panesthia bpaa 0.63 OIE NC_020510.1 angustipennis spadica Blattabacterium Periplaneta bplan 0.64 OIE NC_013418.2 americana Blattabacterium Cryptocercus cpunc 0.61 OIE NC_016621.1 punctulatus fbal Flavobacteriaceae bacterium 3519 10 2.80 nonOIE NC_013062.1 branchiophilum FL fbra 3.56 nonOIE NC_016001.1 15 ATCC fcol 3.16 nonOIE NC_016510.2 49512 Flavobacterium indicum GPTSA100 find 2.99 nonOIE NC_017025.1 9 fjoh Flavobacterium johnsoniae UW101 6.10 nonOIE NC_009441.1 Flavobacterium psychrophilum JIP02 fpysch 2.86 nonOIE NC_009613.1 86 Blattabacterium Mastotermes madar 0.59 OIE NC_016146.1 darwiniensis r15868 anatipestifer ATCC 11845 2.16 nonOIE NC_014738.1 ranagd RA GD 2.17 nonOIE NC_017569.1 ranaym Riemerella anatipestifer RA YM 2.13 noOIE AENH01 ranch1 Riemerella anatipestifer RA CH 1 2.31 nonOIE NC_018609.1 ranch2 Riemerella anatipestifer RA CH 2 2.17 nonOIE NC_020125.1 smcari Candidatus Sulcia muelleri CARI 0.28 OIE NC_014499.1 smdim Candidatus Sulcia muelleri DMIN 0.24 OIE NC_014004.1 SMDSEM Candidatus Sulcia muelleri SMDSEM 0.276984 OIE NC_013123.1 SMGWSS Candidatus Sulcia muelleri GWSS 0.24553 OIE NC_010118.1

104

Table A.3: Description of the dataset used in the orthologous protein analysis, including the number of taxa used in each life style group and the COG distribution of the orthologs used.

Flavobacteriaceae Enterobacteriaceae

No. obligate insect 11 20 endosymbionts (OIE) No. all other l 11 27 ifestyles (nonOIE) Total no. orthologs 82 71

min-max protein lengths 62-1331 a.a. 51-812 a.a.

COG Category of Orthologs

C 11 2

D - 1

E 7 -

FE 1 1

FJ - 1

G 1 5

H 2 -

IQ - 1

J 44 40

K 3 4

L 2 4

M 1 1

MU 1 -

O 4 5

OU - 1

R 4 2

S - 1

U 1 2

105

Enterobacteriaceae Description Orthologs gidA tRNA uridine 5-carboxymethylaminomethyl modification gyrB DNA gyrase subunit B dnaN DNA polymerase III beta trmE tRNA modification GTPase TrmE groES co-chaperonin GroES efp elongation factor P rplL 50S ribosomal protein L7/L12 rplJ 50S ribosomal protein L10 rplA 50S ribosomal protein L1 rplK 50S ribosomal protein L11 secE preprotein subunit SecE rpmG 50S ribosomal protein L33 fabB 3-oxoacyl-ACP synthase rplT 50S ribosomal protein L20 dnaK molecular chaperone DnaK prsA ribose-phosphate pyrophosphokinase ydhD hypothetical protein pth peptidyl-tRNA truA tRNA pseudouridine synthase A lpdA dihydrolipoamide dehydrogenase yadR iron-sulfur cluster insertion protein mraW S-adenosyl-methyltransferase MraW rpsB 30S ribosomal protein S2 frr ribosome recycling factor dnaQ DNA polymerase III subunit smpB SsrA-binding protein yfhC hypothetical protein rnc III mnmA tRNA-specific 2-thiouridylase MnmA suhB extragenic suppressor protein SuhB gapA glyceraldehyde 3-phosphate dehydrogenase A fldA flavodoxin FldA gpmA phosphoglyceromutase tpiA triosephosphate Table A.4: List of orthologous proteins used in this study (represented by gene names and descriptions). continued

106

Table A.4 continued tatD hypothetical protein asnC asparaginyl-tRNA synthetase rpsO 30S ribosomal protein S15 greA transcription elongation factor GreA yrbA hypothetical protein rpmA 50S ribosomal protein L27 obgE GTPase ObgE rpsI 30S ribosomal protein S9 rplM 50S ribosomal protein L13 rpsP 30S ribosomal protein S16 rplS 50S ribosomal protein L19 eno phosphopyruvate hydratase clpP ATP-dependent Clp protease proteolytic lon ATP-dependent protease LA cysS cysteinyl-tRNA synthetase fmt methionyl-tRNA formyltransferase rplQ 50S ribosomal protein L17 rpoA DNA-directed RNA polymerase subunit rpsD 30S ribosomal protein S4 rpsK 30S ribosomal protein S11 rpsM 30S ribosomal protein S13 secY preprotein translocase subunit SecY rpsE 30S ribosomal protein S5 rplR 50S ribosomal protein L18 rplF 50S ribosomal protein L6 rpsH 30S ribosomal protein S8 rpsN 30S ribosomal protein S14 rplE 50S ribosomal protein L5 rplX 50S ribosomal protein L24 rplN 50S ribosomal protein L14 rplP 50S ribosomal protein L16 rplV 50S ribosomal protein L22 rplD 50S ribosomal protein L4 rpsJ 30S ribosomal protein S10 rpsG 30S ribosomal protein S7 rpsR 30S ribosomal protein S18 rpsF 30S ribosomal protein S6 continued

107

Table A.4 continued.

Flavobacteriaceae Description Orthologs aroB 3-dehydroquinate_synthase aroC chorismate_synthase aroD 3-dehydroquinate_dehydratase aroK shikimate_kinase asd aspartate-semialdehyde_dehydrogenase aspC aspartate_transaminase atpA ATP_synthase_F1_subunit_alpha atpD ATP_synthase_F1_subunit_beta atpE ATP_synthase_F0_subunit_c atpF ATP_synthase_F0_subunit_b dapF diaminopimelate_epimerase dnaK chaperone_DnaK fusA translation_elongation_factor_G gapA glyceraldehyde_3-phosphate_dehydrogenase groEL chaperone_GroEL groES chaperone_GroES gyrB DNA_gyrase_subunit_B infA translation_initiation_factor_IF-1 ispB polyprenyl_synthetase ksgA dimethyladenosine_transferase lgt prolipoprotein_diacylglyceryl_transferase lipB lipoate-protein_ligase lpdA dihydrolipoyl_dehydrogenase lspA signal_peptidase_II miaB MiaB_family_tRNA_modification_enzyme mutS DNA_mismatch_repair_protein_MutS obg GTP-binding_protein pdhA pyruvate_dehydrogenase_E1 _subunit_alpha pdhB pyruvate_dehydrogenase_E1_subunit_beta pnp polyribonucleotide_nucleotidyltransferase prfA peptide_chain_release_factor_1 prsA ribose-phosphate_diphosphokinase rpIK 50S_ribosomal_protein_L11 continued

108

Table A.4 continued

rplA 50S_ribosomal_protein_L1 rplB 50S_ribosomal_protein_L2 rplC 50S_ribosomal_protein_L3 rplD 50S_ribosomal_protein_L4 rplE 50S_ribosomal_protein_L5 rplF 50S_ribosomal_protein_L6 rplL 50S_ribosomal_protein_L7+L12 rplN 50S_ribosomal_protein_L14 rplO 50S_ribosomal_protein_L15 rplP 50S_ribosomal_protein_L16 rplQ 50S_ribosomal_protein_L17 rplV 50S_ribosomal_protein_L22 rplY 50S_ribosomal_protein_L25 rpmA 50S_ribosomal_protein_L27 rpmE 50S_ribosomal_protein_L31 rpmI 50S_ribosomal_protein_L35 rpoA RNA_polymerase_subunit_alpha rpoB RNA_polymerase_subunit_beta rpoD RNA_polymerase_sigma-70_factor rpsB 30S_ribosomal_protein_S2 rpsC 30S_ribosomal_protein_S3 rpsD 30S_ribosomal_protein_S4 rpsE 30S_ribosomal_protein_S5 rpsF 30S_ribosomal_protein_S6 rpsG 30S_ribosomal_protein_S7 rpsH 30S_ribosomal_protein_S8 rpsI 30S_ribosomal_protein_S9 rpsJ 30S_ribosomal_protein_S10 rpsK 30S_ribosomal_protein_S11 rpsL 30S_ribosomal_protein_S12 rpsM 30S_ribosomal_protein_S13 rpsN 30S_ribosomal_protein_S14 rpsO 30S_ribosomal_protein_S15 rpsQ 30S_ribosomal_protein_S17 rsmI 16S rRNA methyltransferase serS serine-tRNA_ligase continued

109

Table A.4 continued

smpB SsrA-binding_protein sucA 2-oxoglutarate_dehydrogenase sucB dihydrolipoyllysine-residue_succinyltransferase sucC succinyl-CoA_ligase_subunit_beta sucD succinyl-CoA_ligase_subunit_alpha sufE cysteine desulfuration protein SufE tatC Sec-independent_protein_translocase_TatC trmE tRNA_modification_GTPase_TrmE trpS tryptophan-tRNA_ligase truA tRNA_pseudouridine_synthase_A tuf translation_elongation_factor_Tu tyrS tyrosine-tRNA_ligase rplT 50S_ribosomal_protein_L20

110

Max. Min. Life Genome Total # Avg. Protein Species Protein Protein Style Size (Mb) Proteins Size Size Size Buchnera aphidicola JF98 OIE 0.64 477 254.25 955 23 Acyrthosiphon pisum Candidatus Carsonella ruddii CE OIE 0.16 190 268.50 1284 37 isolate Thao2000 Candidatus Carsonella ruddii CS OIE 0.16 190 268.74 1284 37 isolate Thao2000 Candidatus Carsonella ruddii HC OIE 0.17 192 273.38 1287 37 isolate Thao2000 Candidatus Carsonella ruddii HT OIE 0.16 180 274.01 1287 37 isolate Thao2000 Candidatus Carsonella ruddii PC OIE 0.16 182 275.73 1288 37 isolate NHV Candidatus Carsonella ruddii PV OIE 0.16 182 274.30 1292 37 Candidatus Portiera aleyrodidarum OIE 0.35 280 285.10 1372 37 BT-Q-AWRs Candidatus Portiera aleyrodidarum BT OIE 0.35 273 287.24 1372 37 B-HRs Candidatus Portiera aleyrodidarum BT OIE 0.36 246 326.85 1372 37 QVLC Candidatus Portiera aleyrodidarum OIE 0.36 256 314.14 1372 29 BT-B Candidatus Portiera aleyrodidarum OIE 0.28 269 327.04 1376 37 TV Candidatus Riesia pediculicola USDA OIE 0.58 544 283.90 1392 19 Candidatus Moranella endobia PCIT OIE 0.54 406 339.67 1402 33 Candidatus Moranella endobia OIE 0.54 411 336.36 1402 33 PCVAL Buchnera aphidicola Bp Baizongia OIE 0.62 504 329.66 1404 38 pistaciae Wigglesworthia glossinidia OIE 0.70 611 329.99 1405 38 endosymbiont of Glossina brevipalpis Wigglesworthia glossinidia endosymbiont of Glossina morsitans OIE 0.72 618 326.67 1405 38 Yale colony Buchnera aphidicola Ua Uroleucon OIE 0.63 529 331.86 1406 38 ambrosiae Buchnera aphidicola JF99 OIE 0.64 590 308.21 1407 30 Acyrthosiphon pisum Table A.5: Data table listing the proteomes used in the maximum protein length study and descriptions of proteomes, including life style, genome size, total number of proteins, average, maximum, and minimum protein lengths for all Gammaproteobacteria species included.

continued

111

Table A.5 continued

Buchnera aphidicola 5A OIE 0.64 555 330.47 1407 38 Acyrthosiphon pisum Buchnera aphidicola Tuc7 OIE 0.64 553 329.81 1407 38 Acyrthosiphon pisum Buchnera aphidicola Ak OIE 0.65 559 327.67 1407 38 Acyrthosiphon kondoi Buchnera aphidicola APS OIE 0.66 564 328.27 1407 38 Acyrthosiphon pisum Buchnera aphidicola LL01 OIE 0.64 577 300.35 1407 33 Acyrthosiphon pisum Buchnera aphidicola TLW03 OIE 0.64 573 300.33 1407 30 Acyrthosiphon pisum Baumannia cicadellinicola Hc OIE 0.69 595 327.82 1408 31 Serratia symbiotica Cinara cedri nonOIE 1.76 672 338.49 1408 38 Buchnera aphidicola Cc Cinara cedri OIE 0.42 357 329.73 1410 38 Buchnera aphidicola Sg Schizaphis OIE 0.64 546 326.16 1413 38 graminum Candidatus Blochmannia vafer BVAF OIE 0.72 587 334.27 1416 38 Candidatus Blochmannia chromaiodes OIE 0.79 609 330.96 1416 38 640 Candidatus Blochmannia OIE 0.79 610 330.76 1416 38 Pennsylvanicus BPEN Candidatus Blochmannia floridanus OIE 0.71 583 334.59 1420 38 Buchnera aphidicola Cinara tujafilina OIE 0.44 360 320.03 1420 38 Sodalis glossinidius morsitans nonOIE 4.29 2432 290.05 1484 33 Shigella flexneri 2002017 nonOIE 4.89 4372 286.20 1517 24 Shigella dysenteriae Sd197 nonOIE 4.56 4270 262.20 1588 14 Shigella boydii CDC 3083 94 nonOIE 4.87 4246 276.67 1653 21 Shigella sonnei 53G nonOIE 5.22 5139 279.11 1653 19 Psychrobacter cryohalolentis K5 nonOIE 3.10 2467 342.94 2301 38 Pseudoxanthomonas suwonensis 11 1 nonOIE 3.42 3070 330.05 2310 31 Alteromonas macleodii Black Sea 11 nonOIE 4.48 3772 343.85 2877 33 Pseudoxanthomonas spadix BD a59 nonOIE 3.45 3149 318.56 2887 26 Glaciecola nitratireducens FR1064 nonOIE 4.13 3654 332.37 2955 35 Glaciecola psychrophila 170 nonOIE 5.41 5618 266.72 3094 37 Yersinia pestis Angola nonOIE 4.69 3832 300.04 3163 37 Yersinia enterocolitica palearctica nonOIE 4.63 4349 293.49 3245 37 Y11 Candidatus Hamiltonella defensa 5AT nonOIE 2.11 2094 269.14 3259 30 (Acyrthosiphon pisum) Serratia marcescens FGI94 nonOIE 4.86 4361 319.06 3285 25 Methylomicrobium alcaliphilum nonOIE 4.67 3981 318.35 3437 20 Xylella fastidiosa 9a5c nonOIE 2.73 2766 268.25 3455 30 Enterobacter cloacae ATCC 13047 nonOIE 5.60 5120 303.05 3546 37 Thalassolituus oleivorans MIL 1 nonOIE 3.92 3662 320.79 3596 31 continued 112

Table A.5 continued

Serratia proteamaculans 568 nonOIE 5.50 4891 323.04 3602 29 Xanthomonas campestris vesicatoria nonOIE 5.42 4487 335.15 3709 30 85 10 Colwellia psychrerythraea 34H nonOIE 5.37 4910 307.98 3758 15 Dickeya zeae Ech1591 nonOIE 4.81 4163 325.55 3796 30 Xanthomonas oryzae oryzicola nonOIE 4.83 4474 306.25 3874 26 BLS256 Serratia plymuthica AS9 nonOIE 5.44 4952 318.01 4169 30 Rahnella aquatilis CIP 78 65 ATCC nonOIE 5.45 4344 324.23 4595 29 33071 Rahnella aquatilis HX2 nonOIE 5.66 4443 319.21 4599 30 Xanthomonas citri Aw12879 nonOIE 5.40 4675 326.31 4743 30 Pectobacterium carotovorum PC1 nonOIE 4.86 4246 328.81 4874 30 Xanthomonas axonopodis citrumelo nonOIE 4.97 4181 338.34 5092 30 F1 Salmonella bongori NCTC 12419 nonOIE 4.46 3863 329.14 5556 25 Salmonella enterica serovar Dublin nonOIE 4.92 4513 296.08 5559 37 CT 02021853 Erwinia Ejp617 nonOIE 3.96 3600 305.51 5951 37 Pectobacterium wasabiae WPP163 nonOIE 5.06 4437 319.64 5981 30 Xenorhabdus nematophila ATCC nonOIE 4.59 4298 285.62 5994 20 19061 Pantoea vagans C9 1 nonOIE 4.89 3664 316.14 6003 23 Alteromonas macleodii Deep ecotype nonOIE 4.45 4084 313.75 6572 21 Psychrobacter arcticus 273 4 nonOIE 2.65 2120 335.09 6715 38 Dickeya dadantii Ech703 nonOIE 4.68 3970 333.80 6876 30 Erwinia amylovora CFBP1430 nonOIE 3.83 3677 293.97 7025 17 Erwinia pyrifoliae Ep1 96 nonOIE 4.07 3645 311.71 7028 38 Pectobacterium atrosepticum nonOIE 5.06 4472 323.38 7523 31 SCRI1043 Xenorhabdus bovienii SS 2004 nonOIE 4.23 4258 282.25 9647 20 Xanthomonas albilineans GPE PC73 nonOIE 3.85 3114 353.13 10708 24

113

Min. Life Genome Total # Avg. Protein Max. Protein Species Protein Style Size (Mb) Proteins Size Size Size Candidatus Sulcia muelleri OIE 0.28 246 339.76 1407 38 CARI Candidatus Uzinura OIE 0.26 227 333.63 1429 38 diaspidicola ASNER Blattabacterium Mastotermes OIE 0.59 537 341.74 1432 38 darwiniensis MADAR Blattabacterium Cryptocercus OIE 0.61 545 340.63 1432 29 punctulatus Cpu Blattabacterium Panesthia OIE 0.63 575 342.09 1433 38 angustipennis spadica BPAA Blattabacterium Blaberus OIE 0.63 572 346.90 1433 35 giganteus Blattabacterium Blattella OIE 0.64 586 344.01 1434 35 germanica Bge Blattabacterium Periplaneta OIE 0.64 577 344.54 1435 30 americana BPLAN Blattabacterium Blatta OIE 0.64 572 343.77 1435 38 orientalis Tarazona Candidatus Sulcia muelleri OIE 0.28 242 343.41 1450 38 SMDSEM Candidatus Sulcia muelleri OIE 0.24 226 338.17 1483 30 DMIN Candidatus Sulcia muelleri OIE 0.25 227 331.11 1507 38 GWSS Bacteroides vulgatus ATCC nonOIE 5.16 4066 373.52 1892 31 8482 Bacteroides salanitronis DSM nonOIE 4.24 3553 343.75 1943 30 18170 Bacteroides helcogenes P 36- nonOIE 4.00 3244 364.84 1954 30 108 Bacteroides fragilis NCTC nonOIE 5.21 4184 364.70 1957 30 9343 B. fragilis YCH46 nonOIE 5.28 4577 345.28 1957 38 Bacteroides xylanisolvens nonOIE 5.98 4407 363.93 2045 14 XB1A Bacteroides fragilis 638R nonOIE 5.37 4290 367.12 2048 30 Bacteroides thetaiotaomicron nonOIE 6.26 4778 390.20 2183 32 VPI-5482 Riemerella anatipestifer ATCC nonOIE 2.16 1941 317.55 2323 37 11845 DSM 15868 Riemerella anatipestifer RA nonOIE 2.31 2186 325.88 2340 30 CH 1 Flavobacterium indicum nonOIE 2.99 2671 334.87 2372 38 GPTSA100 9 Zunongwangia profunda SM nonOIE 5.13 4653 319.24 2385 33 A87 Flavobacteriaceae bacterium nonOIE 2.77 2534 328.58 2407 30 3519 10 Riemerella anatipestifer RA nonOIE 2.17 2044 317.41 3045 37 GD Riemerella anatipestifer RA nonOIE 2.17 1972 313.98 3045 37 CH 2 Table A.6: Data table listing the proteomes used in the maximum protein length study and descriptions of proteomes, including life style, genome size, total number of proteins, average, maximum, and minimum protein lengths for all Bacteroidetes species included. continued

114

Table A.6 continued

Riemerella anatipestifer RA nonOIE 2.13 1985 315.37 3045 25 YM Flavobacterium nonOIE 3.56 2867 342.27 3174 33 branchiophilum FL 15 Flavobacterium psychrophilum nonOIE 2.86 2412 333.58 3325 27 JIP02 86 Psychroflexus torquis ATCC nonOIE 4.32 3526 330.19 4408 31 700755 Robiginitalea biformata nonOIE 3.53 3209 335.17 4484 20 HTCC2501 Gramella forsetii KT0803 nonOIE 3.80 3584 318.69 4528 30 ruestringensis DSM nonOIE 3.84 3432 336.21 4538 30 13258 Flavobacterium columnare nonOIE 3.16 2642 339.34 4739 32 ATCC 49512 Cellulophaga lytica DSM 7489 nonOIE 3.77 3284 346.63 4782 31 Capnocytophaga canimorsus nonOIE 2.57 2404 314.53 5042 37 Cc5 Krokinobacter 4H 3 7 5 nonOIE 3.39 2978 341.50 5218 31 Capnocytophaga ochracea nonOIE 2.61 2171 345.15 5298 31 DSM 7271 Cellulophaga algicola DSM nonOIE 4.89 4163 341.67 6081 30 14237 Flavobacterium johnsoniae nonOIE 6.10 5017 352.67 6497 30 UW101 Polaribacter irgensii 23 P nonOIE 2.75 2635 347.60 7986 35

115

smdsem smdmin smgwss smcari bgerma bgigan bpaasp madarw cpunct bplane borien fcolum fbra15 fjohns fpsych findic fbal38 ranaym ranch1 r15868 ranagd ranch2 shigfl

0.2

Figure A.1: Phylogenetic tree of Flavobacteriaceae created with RAxML using all of the orthologs shared between Flavobacteriaceae and Enterobacteriaceae. Shigella flexneri 2002017 is the outgroup for the Flavobacteriaceae tree. Pink colors indicate obligate insect endosymbionts and blue indicates bacteria that are not obligate insect endosymbionts

116

glapsy raquCI raquHX sersym sermar serpro yerent yerpes pecwas pecatr peccar diczea dicdad panvag erwamy erwpyr erwEjp entclo salbon salent shigbo shigfl shigdy shigso blochf blochv blochp blochc baphBp baphCc baph5A bapAPS baJF99 BaJF98 baphTu baphAk baphUa baphSg riesia wigbre wigmor baumHc morPCI morPCV sodglm xennem xenbov fbal38

0.3

Figure A.2: Phylogenetic tree of Enterobacteriaceae created with RAxML using all of the orthologs shared between Flavobacteriaceae and Enterobacteriaceae. Flavobacteriaceae bacterium 3519 is the outgroup for Enterobacteriaceae. Pink colors indicate obligate insect endosymbionts and blue indicates bacteria that are not obligate insect endosymbionts.

117

Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE

133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121 Figure A.3: Bubble plots of all the orthologous proteins for Enterobacteriacae. The x- axis depictsrplE protein length in amino acids. The bubblesrplT are scaled by the number of proteomesOIE with a protein at a particular length OIEout of the total number of proteomes. Each table is labeled with the orthologous protein set it represents. nonOIE nonOIE continued

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

118 yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE

nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE

nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE

nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE

nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE

nonOIE nonOIE

74 75 76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129

truA dnaK OIE OIE

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR OIE OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE

429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB OIE OIE

nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

rpsG rpsJ OIE OIE

nonOIE nonOIE

155 156 157 158 102 103 104 105 Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE Figure A.3 continued 133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121

rplE rplT OIE OIE

nonOIE nonOIE

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE nonOIE nonOIE

74 75 76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 continued

truA dnaK OIE OIE119

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR OIE OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE

429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB OIE OIE

nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

rpsG rpsJ OIE OIE

nonOIE nonOIE

155 156 157 158 102 103 104 105 Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE

133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121

rplE rplT OIE OIE

nonOIE nonOIE

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE

nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE

nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE

nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE

nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE

nonOIE nonOIE

Figure74 A.375 continued76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129

truA dnaK OIE OIE

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR continued OIE 120OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE

429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB OIE OIE

nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

rpsG rpsJ OIE OIE

nonOIE nonOIE

155 156 157 158 102 103 104 105 Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE

133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121

rplE rplT OIE OIE

nonOIE nonOIE

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE

nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE

nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE

nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE

nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE

nonOIE nonOIE

74 75 76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129

truA dnaK OIE OIE

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE Figure A.3 continued

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR OIE OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE

429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB continued OIE OIE 121 nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

rpsG rpsJ OIE OIE

nonOIE nonOIE

155 156 157 158 102 103 104 105 Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE

133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121

rplE rplT OIE OIE

nonOIE nonOIE

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE

nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE

nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE

nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE

nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE

nonOIE nonOIE

74 75 76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129

truA dnaK OIE OIE

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR OIE OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE Figure A.3 continued 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB OIE OIE

nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

continued rpsG rpsJ OIE OIE 122

nonOIE nonOIE

155 156 157 158 102 103 104 105 Enterobacteriaceae yfhC rpsH OIE OIE

nonOIE nonOIE

129 130 131 132 133 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201

rpsN rpmG OIE OIE

nonOIE nonOIE

100 101 102 50 51 52 53 54 55 56 57 58

rpsI rplL OIE OIE

nonOIE nonOIE

119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsD rpsM OIE OIE

nonOIE nonOIE

205 206 207 208 117 118 119 120 121 122

groES rpsO OIE OIE

nonOIE nonOIE

95 96 97 98 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91

rpoA rpsK OIE OIE

nonOIE nonOIE

325 326 327 328 329 330 331 332 333 334 335 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148

rplM rplS OIE OIE

nonOIE nonOIE

133 134 135 136 137 138 139 140 141 142 143 144 145 113 114 115 116 117 118 119 120 121

rplE rplT OIE OIE

nonOIE nonOIE

177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 116 117 118 119 120 121 122 123 124 125 126 127

yadR rplJ OIE OIE

nonOIE nonOIE

110 111 112 113 114 115 116 117 118 119 120 160 161 162 163 164 165 166 167 168 169 170 171 172 173

efp frr OIE OIE

nonOIE nonOIE

186 187 188 189 190 191 192 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188

rpmA rplF OIE OIE

nonOIE nonOIE

66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

rplK secE OIE OIE

nonOIE nonOIE

140 141 142 143 144 145 146 147 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133

greA gyrB OIE OIE

nonOIE nonOIE

157 158 159 160 161 162 163 796 797 798 799 800 801 802 803 804 805 806 807 808 809

rpsR ydhD OIE OIE

nonOIE nonOIE

74 75 76 77 78 79 80 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129

truA dnaK OIE OIE

nonOIE nonOIE

250 253 256 259 262 265 268 271 274 277 280 283 286 289 292 295 617 619 621 623 625 627 629 631 633 635 637 639 641 643 645 647

fabB prsA OIE OIE

nonOIE nonOIE

402 403 404 405 406 407 408 409 410 411 412 413 414 415 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

dnaN mnmA OIE OIE

nonOIE nonOIE

365 366 367 368 369 370 371 372 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377

trmE dnaQ OIE OIE

nonOIE nonOIE

452 454 456 458 460 462 464 466 468 470 472 474 229 231 233 235 237 239 241 243 245 247 249 251 253 255 257 259 261

gpmA lpdA OIE OIE

nonOIE nonOIE

133 139 145 151 157 163 169 175 181 187 193 199 205 211 217 223 229 235 241 247 253 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483

asnC mraW OIE OIE

nonOIE nonOIE

460 461 462 463 464 465 466 467 468 469 470 471 268 271 274 277 280 283 286 289 292 295 298 301 304 307 310 313 316 319 322 325 328 331

gapA gidA OIE OIE

nonOIE nonOIE

329 330 331 332 333 334 335 336 337 617 619 621 623 625 627 629 631 633 635 637

obgE rplR OIE OIE

nonOIE nonOIE

239 247 255 263 271 279 287 295 303 311 319 327 335 343 351 359 367 375 383 391 114 115 116 117 118 119 120 121 122 123

pth fldA OIE OIE

nonOIE nonOIE

175 177 179 181 183 185 187 189 191 193 195 197 199 201 203 205 207 209 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180

rpsP rplQ OIE OIE

nonOIE nonOIE

77 78 79 80 81 82 83 84 85 86 87 88 89 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139

ycfH fmt OIE OIE

nonOIE nonOIE

256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 309 311 313 315 317 319 321 323 325 327 329 331 333 335 337 339 341 343 345

cysS lon OIE OIE

nonOIE nonOIE

456 458 460 462 464 466 468 470 472 474 476 727 732 737 742 747 752 757 762 767 772 777 782 787 792 797 802 807 812

tpiA smpB OIE OIE

nonOIE nonOIE

246 248 250 252 254 256 258 260 262 264 266 268 270 272 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164

eno rpsE OIE OIE

nonOIE nonOIE

429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 163 165 167 169 171 173 175 177 179 181 183 185

rnc rpsB OIE OIE

nonOIE nonOIE

224 225 226 227 228 229 230 231 232 233 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

secY rplA OIE OIE

nonOIE nonOIE

430 432 434 436 438 440 442 444 446 448 450 452 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243

yrbA clpP OIE OIE

nonOIE nonOIE

78 79 80 81 82 83 84 85 86 87 88 89 90 91 193 195 197 199 201 203 205 207 209 211 213 215

rpsF suhB OIE OIE

nonOIE nonOIE

111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 255 257 259 261 263 265 267 269 271 273 275 277 279 281 283 285 287 289 291 293 295

rpsH rplD OIE OIE

nonOIE nonOIE

129 130 131 132 133 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

rplN rplP OIE OIE

nonOIE nonOIE

121 122 123 124 125 126 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138

rplV rplX OIE OIE

nonOIE nonOIE

Figure108 109 A.3110 111 continued112 113 114 115 116 117 118 119 120 121 122 123 124 125 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

rpsG rpsJ OIE OIE

nonOIE nonOIE

155 156 157 158 102 103 104 105

123

Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE

540 541 542 543 544 545 546 547 548 90 91 92 93 94 Figure A.4: Bubble plots of all the orthologous proteins for Flavobacteriacae. The x-axis depictsgyrB protein length in amino acids. The bubblesinfA are scaled by the number of proteomes OIE OIE with a protein at a particular length out of the total number of proteomes. Each table is labelednonOIE with the orthologous protein set itnonOIE represents. continued 623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB 124ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE

nonOIE nonOIE

324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP OIE OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE

nonOIE nonOIE

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE

nonOIE nonOIE

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437 Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE Figure A.4 continued

540 541 542 543 544 545 546 547 548 90 91 92 93 94

gyrB infA OIE OIE

nonOIE nonOIE

623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE continued

nonOIE nonOIE 125 324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP OIE OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE

nonOIE nonOIE

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE

nonOIE nonOIE

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437 Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE

540 541 542 543 544 545 546 547 548 90 91 92 93 94

gyrB infA OIE OIE

nonOIE nonOIE

623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

Figure689 691 693A.4695 697 continued699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE

nonOIE nonOIE

324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP OIE OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE continued

nonOIE nonOIE126

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE

nonOIE nonOIE

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437 Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE

540 541 542 543 544 545 546 547 548 90 91 92 93 94

gyrB infA OIE OIE

nonOIE nonOIE

623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE

nonOIE nonOIE

324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP FigureOIE A.4 contiued OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE

nonOIE nonOIE

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE continued

nonOIE nonOIE 127 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437 Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE

540 541 542 543 544 545 546 547 548 90 91 92 93 94

gyrB infA OIE OIE

nonOIE nonOIE

623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE

nonOIE nonOIE

324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP OIE OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE

nonOIE nonOIE

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

Figure193 195 197A.4199 continued201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE

nonOIE nonOIE

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE continued

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 128

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437 Flavobacteriaceae aroC aroB OIE OIE

nonOIE nonOIE

332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

aroD aroK OIE OIE

nonOIE nonOIE

136 137 138 139 140 141 142 143 144 145 146 105 109 113 117 121 125 129 133 137 141 145 149 153 157 161 165 169 173 177

asd aspC OIE OIE

nonOIE nonOIE

328 329 330 331 332 333 334 335 336 337 338 339 340 341 392 393 394 395 396 397 398 399

atpA atpD OIE OIE

nonOIE nonOIE

523 524 525 526 527 528 529 530 492 495 498 501 504 507 510 513 516 519 522 525 528 531 534 537 540 543 546 549 552

atpE atpF OIE OIE

nonOIE nonOIE

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 157 161 165 169 173 177 181 185 189 193 197 201 205 209 213 217 221 225

dapF dnaK OIE OIE

nonOIE nonOIE

252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

fusA gapA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 331 332 333 334 335 336 337 338 339 340 341

groEL groES OIE OIE

nonOIE nonOIE

540 541 542 543 544 545 546 547 548 90 91 92 93 94

gyrB infA OIE OIE

nonOIE nonOIE

623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 70 71 72

ispB ksgA OIE OIE

nonOIE nonOIE

316 317 318 319 320 321 322 323 324 325 326 327 250 252 254 256 258 260 262 264 266 268 270 272 274

lgt lipB OIE OIE

nonOIE nonOIE

244 248 252 256 260 264 268 272 276 280 284 288 292 296 300 304 308 312 316 320 324 328 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242

lpdA lspA OIE OIE

nonOIE nonOIE

459 460 461 462 463 464 465 466 467 468 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215

miaB mutS OIE OIE

nonOIE nonOIE

428 430 432 434 436 438 440 442 444 446 448 799 804 809 814 819 824 829 834 839 844 849 854 859 864 869 874 879 884 889 894

obg rsmI OIE OIE

nonOIE nonOIE

318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 217 219 221 223 225 227 229 231 233 235 237 239

pnp pdhA OIE OIE

nonOIE nonOIE

689 691 693 695 697 699 701 703 705 707 709 711 713 715 717 719 721 723 725 727 729 731 323 324 325 326 327 328 329 330 331 332 333 334 335

pdhB prfA OIE OIE

nonOIE nonOIE

324 325 326 327 328 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376

prsA rpIK OIE OIE

nonOIE nonOIE

277 279 281 283 285 287 289 291 293 295 297 299 301 303 305 307 309 311 313 315 317 319 139 140 141 142 143 144 145 146 147 148

rplA rplB OIE OIE

nonOIE nonOIE

170 174 178 182 186 190 194 198 202 206 210 214 218 222 226 230 234 238 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290

rplC rplD OIE OIE

nonOIE nonOIE

204 205 206 207 208 209 210 211 212 213 214 203 204 205 206 207 208 209 210 211

rplE rplF OIE OIE

nonOIE nonOIE

181 182 183 184 185 186 187 188 189 190 179 180 181 182 183

rplL rplN OIE OIE

nonOIE nonOIE

120 122 124 126 128 130 132 134 136 138 140 142 144 119 120 121 122 123 124 125

rplO rplP OIE OIE

nonOIE nonOIE

135 137 139 141 143 145 147 149 151 153 155 157 159 161 132 133 134 135 136 137 138 139 140 141 142

rplQ rplT OIE OIE

nonOIE nonOIE

120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 111 112 113 114 115 116 117 118 119 120 121

rplV rplY OIE OIE

nonOIE nonOIE

118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216

rpmA rpmE OIE OIE

nonOIE nonOIE

83 84 85 86 87 88 89 90 91 81 82 83 84 85 86 87 88 89 90 91 92 93

rpmI rpoA OIE OIE

nonOIE nonOIE

61 62 63 64 65 66 234 240 246 252 258 264 270 276 282 288 294 300 306 312 318 324 330 336 342 348 354

rpoB rpoD OIE OIE

nonOIE nonOIE

1244 1250 1256 1262 1268 1274 1280 1286 1292 1298 1304 1310 1316 1322 1328 1334 284 288 292 296 300 304 308 312 316 320 324 328 332 336 340 344 348 352

rpsB rpsC OIE OIE

nonOIE nonOIE

215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 221 223 225 227 229 231 233 235 237 239 241 243 245 247 249 251 253 255

rpsD rpsE OIE OIE

nonOIE nonOIE

193 195 197 199 201 203 205 207 209 211 213 215 217 219 221 223 225 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245

rpsF rpsG OIE OIE

nonOIE nonOIE

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 153 154 155 156 157 158 159

rpsH rpsI OIE OIE

nonOIE nonOIE

126 127 128 129 130 131 132 133 134 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137

rpsJ rpsK OIE OIE

nonOIE nonOIE

94 95 96 97 98 99 100 101 102 103 104 119 120 121 122 123 124 125 126 127 128 129 130 131

rpsL rpsM OIE OIE

nonOIE nonOIE

123 124 125 126 127 128 129 130 131 132 133 134 135 136 122 123 124 125 126 127 128 129 130 131 132 133 134 135

rpsN rpsO OIE OIE

nonOIE nonOIE

88 89 90 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

rpsQ serS OIE OIE

nonOIE nonOIE

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 338 343 348 353 358 363 368 373 378 383 388 393 398 403 408 413 418 423 428

smpB sucA OIE OIE

nonOIE nonOIE

Figure141 142A.4143 continued144 145 146 147 148 149 150 151 152 153 154 155 156 894 896 898 900 902 904 906 908 910 912 914 916 918 920 922 924 926 928

sucB sucC OIE OIE

nonOIE nonOIE

367 370 373 376 379 382 385 388 391 394 397 400 403 406 409 412 415 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

sucD sufE OIE OIE

nonOIE nonOIE

289 290 291 292 293 294 295 296 297 126 128 130 132 134 136 138 140 142 144 146 148

tatC trmE OIE OIE

nonOIE nonOIE

247 249 251 253 255 257 259 261 263 265 267 269 271 273 275 277 453 455 457 459 461 463 465 467 469 471 473

trpS truA OIE OIE

nonOIE nonOIE

314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276

tuf tyrS OIE OIE

nonOIE nonOIE

366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398 357 361 365 369 373 377 381 385 389 393 397 401 405 409 413 417 421 425 429 433 437

129

Table A.7: Average standard deviations for all orthologous protein groups in each family and ANOVAs comparing standard deviations across orthologous protein groups. Flavobacteriaceae Enterobacteriaceae

Proteins 6.22 3.50 Domains 4.78 2.34 Linkers 5.95 3.42 ANOVA P- 0.226 0.134 value

Table A.8: Standard deviations (“StDev”) of lengths of proteins shared between Flavobacteriaceae and Enterobacteriaceae. Enterobacteriaceae Flavobacteriaceae OIE nonOIE OIE nonOIE Average StDev 2.83 0.77 5.25 1.49 T-Test p-value 4.38E-06** 0.0009**

130

Figure A.5: Plot showing the gap distributions among alignments of orthologous proteins in Enterobacteriaceae (a) and Flavobacteriaceae (b). The 11th bin contains any residues that could not be equally distributed among bins 1-10 and are at the end of the alignment. The data-points shown represent averages over three different alignments per alignment algorithm, with the sequences shuffled differently in each of the three alignments.

131

Appendix B: Supplemental Materials for Chapter 2

COGs for all genic Freq. COG Category Description regions E 26 Amino acid transport and metabolism J 25 Translation, ribosomal structure, and biogenesis M 24 Cell wall/membrane/envelope biogenesis H 21 coenzyme transport and metabolism R 21 General function prediction only C 18 energy production and conversion O 18 post-translational modification, protein turnover, and chaperones I 13 transport and metabolism G 12 transport and metabolism L 12 replcation, recombination, and repair S 12 function unknown U 6 intracellular trafficking, secretion, and vesicular transport K 5 transcription P 5 inorganic ion transport and metabolism V 4 defense mechanisms EH 4 Amino acid transport and metabolism; coenzyme transport and metabolism D 3 cell cycle control, cell division, chromosome partitioning EP 3 Amino acid transport and metabolism; inorganic ion transport and metabolism F 3 nucleotide transport and metabolism MG 3 Cell wall/membrane/envelope biogenesis; carbohydrate transport and metabolism KT 2 transcription; Q 2 secondary metabolites biosynthesis, transport, catabolism EM 1 Amino acid transport and metabolism; Cell wall/membrane/envelope biogenesis carbohydrate transport and metabolism; General function prediction only; inorganic ion GEPR 1 transport and metabolism; Amino acid transport and metabolism lipid transport and metabolism; General function prediction only; secondary metabolites IQR 1 biosynthesis, transport, catabolism Cell wall/membrane/envelope biogenesis; intracellular trafficking, secretion, and vesicular MU 1 transport post-translational modification, protein turnover, and chaperones; energy production and OC 1 conversion COGs for Freq. COG Category Description Nonsynonymous SNPs E 18 Amino acid transport and metabolism Table B.1: Single nucleotide polymorphisms in Pantoea carbekii genes. continued 132

Table B.1 continued

M 17 Cell wall/membrane/envelope biogenesis H 14 coenzyme transport and metabolism O 13 post-translational modification, protein turnover, and chaperones R 12 General function prediction only J 9 Translation, ribosomal structure, and biogenesis G 8 carbohydrate transport and metabolism I 8 lipid transport and metabolism L 7 replcation, recombination, and repair S 7 function unknown C 3 energy production and conversion MG 3 Cell wall/membrane/envelope biogenesis; carbohydrate transport and metabolism U 3 intracellular trafficking, secretion, and vesicular transport D 2 cell cycle control, cell division, chromosome partitioning EP 2 Amino acid transport and metabolism; inorganic ion transport and metabolism P 2 inorganic ion transport and metabolism V 2 defense mechanisms EH 1 Amino acid transport and metabolism; coenzyme transport and metabolism carbohydrate transport and metabolism; General function prediction only; inorganic ion GEPR 1 transport and metabolism; Amino acid transport and metabolism K 1 transcription Cell wall/membrane/envelope biogenesis; intracellular trafficking, secretion, and vesicular MU 1 transport

133

# Nonsynonymous Protein-coding Gene Description SNPs periplasmic protein YtfN 7 2-oxoglutarate dehydrogenase E1 component SucA 6 L-aspartate oxidase NadB 5 phosphoenolpyruvate carboxykinase [ATP] PckA 5 UvrABC system protein A UvrA 5 bifunctional folylpolyglutamate synthase/ dihydrofolate synthase 4 chaperone SurA precursor 4 DNA polymerase I PolA 4 glutaminyl-tRNA synthetase 4 GTP pyrophosphokinase RelA 4 ribulose-phosphate 3-epimerase Rpe 4 alanyl-tRNA synthetase AlaS 3 anaerobic C4-dicarboxylate transporter DcuA 3 bifunctional PutA protein PutA 3 D-3-phosphoglycerate dehydrogenase 3 dihydroxy-acid dehydratase IlvD 3 DNA polymerase III alpha subunit DnaE 3 DNA translocase FtsK 3 exodeoxyribonuclease V gamma chain RecC 3 FKBP-type peptidyl-prolyl cis-trans isomerase FkpA precursor FkpA 3 histidinol-phosphate aminotransferase HisC 3 leucyl-tRNA synthetase 3 MscS mechanosensitive ion channel YjeP 3 oligopeptidase A PrlC 3 organic solvent tolerance protein precursor Lmp 3 phosphoenolpyruvate-protein phosphotransferase PtsP 3 3-isopropylmalate dehydratase large subunit 2 5-methyltetrahydropteroyltriglutamate/homocysteine S-methyltransferase 2 aconitate hydratase 1 AcnA 2 ADP-L-glycero-D-manno-heptose-6-epimerase HldD 2 anthranilate synthase component II TrpD 2 arginyl-tRNA synthetase ArgS 2 biosynthetic arginine decarboxylase 2 Table B.2: Quantity of nonsynonymous SNPs in protein-coding genes for genes with more than one nonsynonymous SNP in Pantoea carbekii compared to P. carbekii JPN. continued 134

Table B.2 continued

biosynthetic arginine decarboxylase SpeA 2 BirA bifunctional protein BirA 2 CDP- synthetase 2 cell division protein 2 cell division protein FtsW FtsW 2 chromosomal replication initiator protein DnaA 2 dipeptide ABC transporter permease 2 dipeptide transport system permease DppC 2 DNA mismatch repair protein MutL 2 DNA polymerase III delta subunit HolA 2 DNA primase DnaG 2 DNA topoisomerase I TopA 2 DNA-directed RNA polymerase subunit beta'' 2 exodeoxyribonuclease V beta chain RecB 2 / synthesis protein PlsX 2 glucosamine--fructose-6-phosphate aminotransferase [isomerizing] GlmS 2 glucose-6-phosphate 1-dehydrogenase 2 glucose-inhibited division protein B 2 glycyl-tRNA synthetase beta chain GlyS 2 GTP-binding protein Era 2 GTP-binding protein LepA 2 hypothetical protein 2 Imidazole glycerol phosphate synthase subunit HisH HisH 2 inner membrane transport protein YajR 2 lipoprotein releasing system 2 lipoprotein YfgL precursor 2 lipoyl synthase LipA 2 N-carbamoyl-L-amino acid hydrolase AmaB 2 outer membrane protein A precursor OmpA 2 p-hydroxybenzoic acid efflux pump subunit AaeB 2 penicillin-binding protein 1B MrcB 2 peptide chain release factor 2 2 periplasmic trehalase precursor TreA 2 phosphoserine aminotransferase 2 protease EcfE 2 continued

135

Table B.2 continued

rod shape-determining protein MreC 2 S-adenosylmethionine synthetase MetK 2 sugar kinase YjeF 2 sulfate ABC transporter permease 2 sulfite reductase [NADPH] flavoprotein alpha-component CysJ 2 transaldolase B TalB 2 tryptophan synthase subunit beta TrpB 2 UDP-N-acetylenolpyruvoylglucosamine reductase MurB 2 UTP--glucose-1-phosphate uridylyltransferase GalU 2 valyl-tRNA synthetase ValS 2

136

NCBI Genome Size Species Name Accession (Mb) Buchnera_aphidicola_ JF99_Acyrthosiphon_pisum NC_017253.1 0.64 Buchnera_aphidicola_APS_Acyrthosiphon_pisum NC_002528.1 0.66 Buchnera_aphidicola_5A_Acyrthosiphon_pisum NC_011833.1 0.64 Buchnera_aphidicola_Ak_Acyrthosiphon_kondoi NC_017256.1 0.65 Buchnera_aphidicola_Bp_Baizongia_pistaciae NC_004545.1 0.62 Buchnera_aphidicola_Cc_Cinara_cedri NC_008513.1 0.42 Buchnera_aphidicola_Sg_Schizaphis_graminum NC_004061.1 0.64 Buchnera_aphidicola_Tuc7_Acyrthosiphon_pisum NC_011834.1 0.64 Buchnera_aphidicola_Ua_Uroleucon_ambrosiae NC_017259.1 0.63 Baumannia_cicadellinicola_Hc NC_007984.1 0.69 Candidatus_Blochmannia_chromaiodes_640 NC_020075.1 0.79 Candidatus_Blochmannia_floridanus NC_005061.1 0.71 Candidatus_Blochmannia_Pennsylvanicus NC_007292.1 0.79 Candidatus_Blochmannia_vafer NC_014909.2 0.72 Dickeya_dadantii NC_012880.1 4.68 Dickeya_zeae NC_012912.1 4.81 Enterobacter_cloacae NC_014121.1 5.6 Erwinia_amylovora NC_013961.1 3.83 Erwinia_Ejp617 NC_017445.1 3.96 Erwinia_pyrifoliae NC_012214.1 4.07 Glaciecola_psychrophila NC_020514.1 5.41 Candidatus_Moranella_endobia_PCIT NC_015735.1 0.54 Candidatus_Moranella_endobia_PCVAL NC_021057.1 0.54 Pantoea_vagans NC_014562.1 4.89 Pectobacterium_atrosepticum NC_004547.2 5.06 Pectobacterium_carotovorum NC_012917.1 4.86 Pectobacterium_wasabiae NC_013421.1 5.06 Rahnella_aquatilis_CIP NC_016818.1 5.45 Rahnella_aquatilis_HX2 NC_017047.1 5.66 Candidatus_Riesia_pediculicola_USDA NC_014109.1 0.58 Table B.3: Species names, NCBI accession numbers, and genome sizes for organisms used in the phylogenetic analysis. continued 137

Table B.3 continued

Salmonella_bongori NC_015761.1 4.46 Salmonella_enterica NC_011205.1 4.92 Serratia_marcescens NC_020064.1 4.86 Serratia_proteamaculans NC_009832.1 5.5 Serratia_symbiotica_Cinara_cedri NC_016632.1 1.76 Shigella_boydii NC_010658.1 4.87 Shigella_dysenteriae NC_007606.1 4.56 Shigella_flexneri NC_017328.1 4.89 Shigella_sonnei NC_016822.1 5.22 Sodalis_glossinidius_morsitans NC_007712.1 4.29 Wigglesworthia_glossinidia__Glossina_brevipalpis NC_004344.2 0.7 Wigglesworthia_glossinidia__Glossina_morsitans NC_016893.1 0.72 Xenorhabdus_bovienii NC_013892.1 4.23 Xenorhabdus_nematophila NC_014228.1 4.59 Yersinia_enterocolitica_palearctica NC_017564.1 4.63 Yersinia_pestis NC_010159.1 4.69 Candidatus_Ishikawaella_capsulata_Mpkobe AP010872.1 0.74 Symbiont_of_Plautia_stali NC_022546.1 4.04 Pantoea_agglomerans ANKX01 scaffolds Pantoea_stewartii AHIE01 scaffolds Pantoea_ananatis NC_017531.1 4.56 Erwinia_billingiae NC_014306 5.1 Erwinia_tasmaniensis CU468135.1 3.88 Buchnera_aphidicola_Cinara_tujafilina CP001817.1 4.45 Buchnera_aphidicola_LL01_Acyrthosiphon_pisum CP002300.1 0.64 Buchnera_aphidicola_TLW03_Acyrthosiphon_pisum CP002301.1 0.64 Serratia_plymuthica CP006250.1 5.33 Pseudomonas_aeruginosa NC_002516.2 6.26

138

Table B.4: Orthologs used in the phylogenetic analysis, represented by their gene name abbreviations.

asnC gapA rplD rpsE tpiA clpP gidA rplE rpsF trmE cysS gpmA rplF rpsG truA dnaK greA rplJ rpsH yadR dnaN groES rplK rpsI ydhD dnaQ gyrB rplL rpsJ yfhC efp lon rplM rpsK yrbA eno lpdA rplN rpsM thrS fabB mnmA rplP rpsN lepA fldA mraW rplQ rpsO rluD fmt obgE rplR rpsP alaS frr prsA rplS rpsR rpsA dnaX pth rplT secE holB rplW rnc rplV secY rplC rplA rplX smpB hemK infB rpmA tatD valS mesJ rpmG rpsS aceF rpsL rpoA rplB rplI trpS rpsB fabg1 dnaB ybeY rpsD adk

139

Symbiont of P. E. coli K-12 P. ananatis Ishikawaella Buchnera Function Gene Plauti stali carbekii 4.64 Mb 4.56 Mb 4.04 Mb 1.15 Mb 0.75 Mb 0.64 Mb

cysN P P P P P P

cysD P P P P P P

cysC P P P P P P

cysH P P P P P P

cysI P P P P P P

cysJ P P P P P P Sulfur assimiliation & cysteine cysQ P P P P P P

cysE P P P P P P

cysK P P P P P P

aroGH P P P P P P

aroB P P P P P P

aroD P P P P P P

aroE P P P P P P

aroK P P P P P P Chorismate aroA P P P P P P

aroC P P P P P P

pheA P P P P P P

aspC P P P P P - Phenylalanine trpD P P P P P P

trpG a P P P - P

trpE P P P P P P

trpC P P P P P P

Tryptophan trpA P P P P P P

trpB P P b P P P

metA P P P P P -

Methionine metB P P P P P - Table B.5: Presence and absence table comparing the gene content of Pantoea carbekii to Escherichia coli strain K-12 (U00096.3), Pantoea ananatis (AP012032.1), the symbiont of Plautia stali (AP012551.1), "Candidatus Ishikawaella capsulata" (AP010872.1), and Buchnera aphidicola strain APS (NC_002528) (S1F). Presence was determined by pairwise comparisons of proteomes using the blastp program (parameters: e-value: 1e-10) and manual inspection of genome annotations. a: the complete P. carbekii DNA polymerase I is encoded by two adjacent genes. continued

140

Table B.5 continued

metC P P P P P -

metE P P P P P P

asd P P P P P P

dapA P P P P P P

dapB P P P P P P

dapD P P P P P P

dapE P P P P P P

Lysine dapF P P P P P P

lysA P P P P P P

asd P P P P P P

thrA P P P P P P

thrB P P P P P P

thrC P P P P P P ilvA P P P P P -

ilvH P P P P P P

ilvI P P P P P P

ilvC P P P P P P

ilvD P P P P P P

ilvE P P P - - -

leuA P P P P P P

leuC P P P P P P Leucine, Isoleucine, Valine leuD P P P P P P

leuB P P P P P P

hisA P P P P P P

hisB P P P P P P

hisC P P P P P P

hisD P P P P P P

hisF P P P P P P

Histidine hisG P P P P P P

hisH P P P P P P

hisI P P P P P P continued

141

Table B.5 continued

phr P P P - P P

ung P P P P P P

nfo P P P P - P

xth P P P - P -

rep P P P - - P

mfd P P P P P P

nth P P P P P P

ligA P P P ψ P P

mutH P P P P P -

mutL P P P P P P

mutM P P P - P -

mutS P P P P P P Recombination and mutT P P P - - P Repair mutY P P P P - P

recA P P P P P -

recB P P P ψ P P

recC P P P P P P

recD P P P P P P

recG P P P - - -

recJ P P P P ψ -

rmuC P P P ψ - -

ruvA P P P P P -

ruvB P P P P P -

ruvC P P P ψ P -

uvrD P P P P P -

dnaA P P P P P P

dnaE P P P P P P

dnaN P P P P P P

dnaQ P P P P P P Replication dnaX P P P P P P

holA P P P P P P

holB P P P P P P

holC P P P P - - continued

142

Table B.5 continued

holD P P P P - -

holE P P P - - -

polA P P P Pa P P

priA P P P P P P

ssb P P P P P P

fabA P P P P - -

fabB P P P P P P

fabF P P P - - -

acpP P P P P P P

acpS P P P P P P Fatty acids fabD P P P P P -

fabZ P P P P P -

fabI P P P P P P

fabG P P P P P P

fabH P P P P - -

plsB P P P ψ - -

plsC P P P P - -

pssA P P P P - -

psd P P P P - P cdsA P P P P ψ -

pgsA P P P P ψ -

pgpA P P P P - -

cls P P P P P P

glmU P P P P - P

murA P P P P P P

murB P P P P P P

murC P P P P P P

murD P P P P P P Peptidoglyan murE P P P P P P

murF P P P P P P

mraY P P P P P P

murG P P P P P P

murI P P P P ψ P

Lipid A lpxA P P P P - - continued

143

Table B.5 continued

lpxC P P P P - -

lpxD P P P P - -

lpxB P P P P - -

lpxH P P P P - -

lpxK P P P P - -

waaA P P P P - -

lpxL P - P - - -

lpxM P P P P - -

Rod Shape mreB P P P P P -

mreC P P P ψ P -

mreD P P P P P -

144

KEGG KO Description Abundance ko03010 Ribosome 31 ko01110 Biosynthesis of secondary metabolites 23 ko01230 Biosynthesis of amino acids 16 ko01120 Microbial metabolism in diverse environments 14 ko01200 Carbon metabolism 11 ko00230 Purine metabolism 7 ko00010 Glycolysis / Gluconeogenesis 6 ko00240 Pyrimidine metabolism 5 ko03018 RNA degradation 5 ko00020 TCA cycle 4 ko00630 Glyoxylate and dicarboxylate metabolism 4 ko00680 Methane metabolism 4 ko02020 Two-component system 4 ko00190 Oxidative phosphorylation 3 ko00250 , aspartate and glutamate metabolism 3 ko00330 Arginine and proline metabolism 3 ko00620 Pyruvate metabolism 3 ko00720 Carbon fixation pathways in prokaryotes 3 ko03430 Mismatch repair 3 ko04146 Peroxisome 3 ko00030 Pentose phosphate pathway 2 ko00195 Photosynthesis 2 ko00270 Cysteine and methionine metabolism 2 ko00300 Lysine biosynthesis 2 ko00312 beta-Lactam resistance 2 ko00400 Phenylalanine, tyrosine and tryptophan biosynthesis 2 ko00450 Selenocompound metabolism 2 ko00480 Glutathione metabolism 2 ko00500 Starch and metabolism 2 ko00520 Amino sugar and nucleotide sugar metabolism 2 ko00540 Lipopolysaccharide biosynthesis 2

Table B.6: KEGG KO frequencies in LC/MS/MS Mascot data analysis. Note: proteins could be grouped into more than one KEGG category. continued

145

Table B.6 continued

ko00710 Carbon fixation in photosynthetic organisms 2 ko00970 Aminoacyl-tRNA biosynthesis 2 ko01210 2-Oxocarboxylic acid metabolism 2 ko02010 ABC transporters 2 ko03030 DNA replication 2 ko03440 Homologous recombination 2 ko04068 FoxO signaling pathway 2 ko04626 Plant-pathogen interaction 2 ko00051 Fructose and mannose metabolism 1 ko00052 Galactose metabolism 1 ko00061 Fatty acid biosynthesis 1 ko00253 Tetracycline biosynthesis 1 ko00260 Glycine, serine and threonine metabolism 1 ko00310 Lysine degradation 1 ko00340 Histidine metabolism 1 ko00380 Tryptophan metabolism 1 ko00640 Propanoate metabolism 1 ko00750 Vitamin B6 metabolism 1 ko00760 Nicotinate and nicotinamide metabolism 1 ko00790 Folate biosynthesis 1 ko00910 Nitrogen metabolism 1 ko01212 1 ko02060 Phosphotransferase system 1 ko03020 RNA polymerase 1 ko03060 Protein export 1 ko03070 Bacterial secretion system 1

146