<<

The Organization and Evolution of Biological Networks in Bacteria Elucidating Biological Pathways and Complexes Involved in Bacterial Survival and Environmental Adaptation

by

Cedoljub Bundalovic-Torma

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Department of Biochemistry University of Toronto

© Copyright by Cedoljub Bundalovic-Torma 2018 ii

The Organization and Evolution of Biological Networks in Bacteria: Systematic Exploration of Large-Scale Biological Networks in E. coli and Exopolysaccharide Biosynthetic Machineries.

Cedoljub Bundalovic-Torma

Doctor of Philosophy

Department of Biochemistry University of Toronto

2018 Abstract

Bacteria inhabit diverse environmental niches and employ various functional repertoires encoded in their genomes relevant for survival and adaptation. It has long been proposed that gene duplication plays an important role in bacterial adaptation, however systematic experimental study of the functional roles of duplicates in bacteria has been lacking. From decades of small- scale experimental work with the model bacterium, Escherichia coli, our view of the bacterial cell has expanded to encompass the concept of biological networks, a wiring diagram of the cell representing diverse functional associations of genes. To help define these complex functional relationships, large-scale genetic and protein interaction screens have been devised and applied to E. coli, providing datasets for deriving novel and biologically meaningful associations. In this work I present a study of gene duplication in the context of several large-scale biological networks. First (Chapter 2) I investigate two recently published E. coli genetic interaction (GI) networks, and find that duplicates are likely to contribute to increased robustness through epistatic buffering or integration in the context of biological pathways and protein complexes with broad biological roles1 and DNA damage and repair response pathways2. Next (Chapter 3) I further investigate the implications of gene duplication in the context of physical protein

iii interactions, based on a recent mapping of cell-envelope complexes generated for E. coli3, revealing several instances where the acquisition of novel physical interactions has likely led to neofunctionalization of duplicates involved in environmental adaptation. Finally (Chapter 4) in a first systematic survey of biofilm secretion machineries, I combine a novel phylogenetic clustering approach and genome proximity networks to identify the impact of duplication on the evolution of operons, as phyla-specific niche adaptations. In summary, this work expands our knowledge of how the rewiring of physical, epistatic, and genomic associations of duplicates has shaped biological pathways in bacteria of adaptive significance.

iv

Acknowledgments

This work would not have been possible without the guidance of my supervisor, Dr. John Parkinson, as well as the contributions of collaborators who have provided invaluable opportunities for me to accomplish the work presented in this thesis.

Above all, I would like to thank my parents, Zoran Torma and Branka Bundalovic, and my dear friend Mike Travis for their unwavering support during my studies.

v

Table of Contents

Abstract ...... ii

Acknowledgments...... iv

Table of Contents ...... v

List of Tables ...... ix

List of Figures ...... x

List of Appendices ...... xiv

List of Supplemental Files ...... xv

Chapter 1 Background ...... 1

1 Charting Cellular Complexity in Bacteria...... 1

1.1 Overview ...... 1

1.2 Escherichia coli as a Model Organism of Gram-negative Bacterial Biology ...... 2

1.3 Biological Networks – Wiring Diagrams of Biological Processes ...... 3

1.3.1 What are Biological Networks? ...... 3

1.3.2 Protein Interaction Networks ...... 5

1.3.3 Genetic Interaction Networks ...... 8

1.3.4 Computationally Derived Functional Interaction Networks Inferred Through Genomic Context Methods ...... 10

1.3.5 Integrative Approaches to Boosting Reliability of Physical Interactions with Predicted Functional Assignments ...... 12

1.4 Comparative Genomics and its Application Toward the Study of Biological Networks ..13

1.4.1 What is Comparative Genomics? ...... 13

1.4.2 Bacterial Genomic Evolution ...... 13

1.4.3 Defining the Evolutionary Relationships of Genes – Orthology and Paralogy .....14

1.4.4 Computational Approaches for Functional Annotation of Protein Sequences from Sequence Homology ...... 15

vi

1.4.4.1 Sequence Homology Based on Conservation of Structural Domains: Protein Families ………………………………………………………... 16

1.4.4.2 Sequence Homology Based on Inference of Evolutionary Relationships: Gene Families ………………………………………………………….. 17

1.4.5 Application of Biological Networks and Comparative Genomics Approaches to Study the Evolution of Bacterial Biological Processes ...... 17

1.4.5.1 Robustness and Evolvability …………………………………………… 19

1.4.5.2 Defining Modularity in Biological Networks ………………………….. 20

1.4.5.3 Functional and Evolutionary Modularity ………………………………. 21

1.5 Project Goals and Rationale ...... 24

Chapter 2 Investigation of the Evolution of Diverse Biological Pathways and Functional Divergence of Paralogs in E. coli Genetic Interaction Networks ...... 26

2 E. coli Genetic Interaction Networks ...... 27

2.1 Materials and Methods ...... 28

2.1.1 Source of Datasets...... 29

2.1.2 Enrichment of Genetic Interactions in Functional Modules ...... 31

2.1.3 Determination of Evolutionary Co-Conservation of Broad-GI Genes using Mutual Information of Phylogenetic Profiles ...... 32

2.2 Results ...... 34

2.2.1 E. coli Functional Modules Demonstrate an Enrichment in Genetic Interactions ...... 34

2.2.2 Phylogenetic Conservation of GI Networks Provide Evolutionary Insights into the Functional Integration and Divergence of Biological Processes ...... 44

2.3 Discussion and Conclusions ...... 60

Chapter 3 Investigation of the Organization of Physical Complexes and Functional Divergence of Paralogs in the E. coli Cell Envelope ...... 65

3 The E. coli Cell Envelope Protein-Protein Interaction Network...... 66

vii

3.1 Materials and methods ...... 67

3.1.1 Sources of Data: E. coli Cell Envelope Associated Proteome Physical Interaction Network ...... 67

3.1.2 Prediction and benchmarking of predicted E. coli CE-PPI protein complexes with Markov Clustering ...... 68

3.1.3 Integration of CE-PPI and previously published E. coli Genetic Interaction Networks ...... 68

3.1.4 Analysis of Functional Divergence of Paralogs by Differences in PPI Overlap and Functional Enrichment of Paralog Physical Interactions ...... 69

3.2 Results ...... 72

3.2.1 Analysis of the functional organization of the CE-PPI Network Identifies Novel Interactors with Complexes Involved in Cell Growth, Division, Nutrient Transport, and Environmental Sensing ...... 72

3.2.1.1 Gold-Standard PPI Curation and Benchmarking Enables the Generation of the First Large-Scale AP-MS E. coli CE-PPI Network and Systematic Exploration of Functional Organization of the Bacterial Cell Envelope …………………………... 73

3.2.1.2 Cell-Envelope Clusters Represent Known Complexes with Novel Interactors of Diverse Biological Roles and Dynamic Properties.……... 75

3.2.1.3 Integrated Network Analysis of Physical and Genetic Interactions Reveals Functional Cross-Talk Enriched Between Diverse CE Associated Biological Processes ………………………………….. 78

3.2.2 CE paralogs possess variability in shared physical interactors reflecting specialized roles in diverse biological processes ...... 81

3.3 Discussion and Conclusions ...... 90

Chapter 4 Systematic Prediction and Classification of Operon-Associated Bacterial Exopolysaccharide Secretion Machineries ...... 95

4 Systematic Study of the Evolution of Bacterial Synthase-Dependent Expolysaccharide System (EPS) Machineries ...... 97

4.1 Materials and Methods ...... 98

4.1.1 Data Sources ...... 98

4.1.2 Generation of EPS operon hidden-Markov models (HMMs) ...... 100

4.1.3 Prediction of putative EPS operon loci ...... 100

viii

4.1.4 A Genomic-Context Based Approach for EPS Operon Prediction ...... 102

4.1.5 Classification of EPS Loci and Definition of Operon Clades Using a Novel Protein Sequence Evolutionary Distance Clustering Approach ...... 103

4.1.6 Construction of EPS Operon Genomic-Proximity Networks ...... 105

4.2 Results ...... 106

4.2.1 A Comprehensive Survey of Bacterial EPS Operons Reveals Functional EPS Systems Across Bacteria of Diverse Lifestyles and Environmental Niches ...... 106

4.2.2 Evolution of EPS Operons is Driven by Gene Duplication, Loss and Rearrangements...... 110

4.2.3 Systematic Phylogenetic Distance-Based Clustering of EPS Operon Loci and Genomic-Proximity Networks Identifies Evolutionary Distinct Operon Clades 111

4.2.4 Functional Implications of Sequence Divergence and Operon Evolutionary Events in Bacterial EPS Systems Revealed by Phylogenetic Clustering and Genomic-Proximity Networks...... 116

4.3 Discussion and Conclusions ...... 138

Chapter 5 - Conclusions and Future Directions ...... 142

5 Summary ...... 142

5.1 E. coli Broad and DNA-Repair GI networks ...... 142

5.2 E. coli Cell-Envelope PPI Network ...... 143

5.3 Bacterial Exopolysaccharide Secretion Machineries ...... 146

References ...... 148

Appendices ...... 179

Appendix 1: Summary Table of CE-PPI Network Paralog PPI Enrichment ...... 179

Appendix 2: Sequence Variability of Phylogenetic Clusters Reveals Different Degrees of Structural Conservation of Cellulose Biosynthesis Machinery ...... 183

Appendix 3: Divergence of PNAG Phylogenetic Sequence Clusters Elucidates Structural Differences Related to Biofilm Secretion and Modification Across Diverse Bacterial Phyla ...... 186

Appendix 4: Phylogenetic Clustering Indicates Increased Divergence of Loci with Functionally Linked Roles in Regulation of Alginate Secretion ...... 188

ix

List of Tables

Table 3-1: Functional Enrichment of Cell-Envelope paralog Physical Interactions …………... 84

Table 4-1: EPS Systems Surveyed and Functional Description of Loci ………………………. 99

Table 4-2: Summary of EPS Operon Evolutionary Events …………………………………... 111

x

List of Figures

Figure 1-1: Overview of PPI Detection using Tandem Affinity Purification …………………… 8

Figure 1-2: Illustration of Genomic Context Measures ..………………………………………. 11

Figure 2-1: Overview of E. coli Broad-GI and DDR-GI Datasets Investigated ……………….. 31

Figure 2-2: Overview of Functional and Evolutionary Analyses of E. coli GI Networks ……... 34

Figure 2-3: Distribution of Broad-GIs Within and Between E. coli Functional Modules ……... 36

Figure 2-4: Functional Modules Enriched with Broad-GI Epistatic Interactions ……………… 37

Figure 2-5: Summary of E. coli Functional Modules Epistatically Enriched in the DDR-GI Network ………………………………………………………………………………………… 42

Figure 2-6: DDR-GI Enriched Functional Modules Reveal Integration of DNA Repair, Nucleotide Metabolism, and Cell Division Pathways Involved in DNA Damage Response……… …………………...……………………………………………………………………………… 43

Figure 2-7: Distribution of Broad-GIs Between Paralogous and Non-Paralogous Genes ……... 46

Figure 2-8: Comparison of Phylogenetic Conservation and GI-Profile Correlations of Broad-GI Genes, Based on Functional and Evolutionary Relationships …………………………………. 47

Figure 2-9: Broad-GI Network Functionally Related Genes Examined by Level of Phylogenetic Conservation and Functional Annotation ……………………………………………………… 49

Figure 2-10: Functionally Correlated and Phylogenetically Co-Conserved Complexes and Pathways identified from the Broad-GI Network ……………………………………………… 52

Figure 2-11: Phylogenetic-Conservation of Epistatically Interactions across Diverse Pathways Involved in DNA Damage Repair and Response in E. coli ……………………………………. 54

xi

Figure 2-12: Distribution of Paralog and Non-Paralog DDR-GIs Between UT and MMS Conditions ……………………………………………………………………………………… 56

Figure 2-13: Average Difference in Number of GI Interactions and Average GI-Profile Correlation Measures of Paralogous and Non-Paralogous Genes in DDR-GI Networks ……... 57

Figure 2-14: Paralogous Genes and GI-Profiles Correlations Compared Between UT and MMS DDR-GI Networks ……………………………………………………………………………... 58

Figure 2-15: Paralogous DEAD-Box RNA Helicase GI-Profiles Compared Across UT, MMS, and DF DDR-GI Networks …………………………………………………………………….. 59

Figure 3-1: E. coli Gold Standard PPI Curation and CE-PPI Network Benchmarking ………... 74

Figure 3-2: CE-PPI Network Iterative MCL Clustering – Benchmarking Against Known E. coli Protein Complexes ……………………………………………………………………………... 75

Figure 3-3: Exploration of CE-PPI Network Defined Clusters ………………………………... 76

Figure 3-4: Integration of CE-PPI with Previously Published E. coli Genetic Interaction (GI) Datasets ………………………………………………………………………………………… 79

Figure 3-5: GI Enrichment Among CE-Clusters with Diverse Biological Roles ……………… 81

Figure 3-6: CE-PPI Network Paralogs …………………………………………………………. 83

Figure 3-7: Summary of CE-Paralog PPI Overlap Compared to Paralog % Sequence Similarity and Normalized Protein Abundance …………………………………………………………… 85

Figure 3-8: Examining Paralog PPI and Functional Divergence for Selected CE-Paralog PPI Subnetworks ……………………………………………………………………………………. 87

Figure 4-1: Computational Prediction and Reconstruction of Bacterial EPS Operons, Phylogenetic Clustering and Identification of Operon Clades ……………………………….. 102

Figure 4-2: Summary of Predicted Bacterial EPS Operons by Genomic-Proximity Reconstruction …………………………………………………………………………………………………. 108

xii

Figure 4-3: Lifestyle and Niche Distribution of Predicted EPS Operons …………………….. 110

Figure 4-4: Evaluation of Phylogenetic Clustering Quality Scores: Phylogenetic Clusters Identified and Cluster Coverage of Cellulose Operon Sequences ……………………………. 114

Figure 4-5: Phylogenetically Clustered Cellulose Operon Genomic-Proximity Networks using Different Cluster Quality Metrics …………………………………………………………….. 116

Figure 4-6: Summary of EPS Operon Phylogenetic Sequence Clustering …………………… 117

Figure 4-7: Genomic-Proximity Network of Phylogenetically Clustered Cellulose Operons. Identifying Cellulose Operon Clades Distinguished by Diverse Genomic Evolutionary Events …. …………………………………………………………………………………………………. 119

Figure 4-8: Phylogenetic Clustering Identifies HGT Through Divergence and Rearrangement of Two Cellulose Gamma-Proteobacterial Operon Clades ……………………………………… 121

Figure 4-9: Cellulose Operon Clade with Distinct BcsC OMP Phylogenetic Cluster Reveals HGT and Operon Divergence among Gamma and Beta Proteobacterial Species ………………….. 123

Figure 4-10: Phylogenetically Clustered PNAG Operons and Selected Examples of Divergent Operon Clades ………………………………………………………………………………… 125

Figure 4-11: Phylogenetic Clustering Reveals Structural Evolution of PNAG PgaB Periplasmic Modifying Distinguishing Gram-Negative and Gram-Postive PNAG Operon Clades ….. …………………………………………………………………………………………………. 129

Figure 4-12: Genomic-Proximity Network of Pel Operons and Identification of Novel Gram- Positive Pel Operon Clades …………………………………………………………………… 132

Figure 4-13: Identification of Gram-positive Pel Clades; Iterative HMM Searching Reconstructs Operons with Divergent Loci Leading to Experimental Validation of Novel Gram-positive Biofilm Operon in B. cereus ATCC 10987 …………………………………………………... 134

Figure 4-14: Genomic-Proximity Network of Alginate EPS Operons Indicating Rearrangement and Divergence of Driving the Emergence of Distinct Operon Clades in Pseudomonas spp… 136

xiii

Figure 4-15: Genomic-Context Network of Acetylated-Cellulose EPS Operons ……………. 138

xiv

List of Appendices

Appendix 1: Summary Table of CE-PPI Network Paralog PPI Enrichment ………………… 176

Appendix 2: Sequence Variability of Phylogenetic Clusters Reveals Different Degrees of Structural Conservation of Cellulose Biosynthesis Machinery ………………………………. 180

Appendix 3: Divergence of PNAG Phylogenetic Sequence Clusters Elucidates Structural Differences Related to Biofilm Secretion and Modification Across Diverse Bacterial Phyla .. 183

Appendix 4: Phylogenetic Clustering Indicates Increased Divergence of Loci with Functionally Linked Roles in Regulation of Alginate Secretion …………………………………………… 185

xv

List of Supplemental Files

File Name Description File Extension File Size

File: SF1 Broad-GI Functional Module Top 5 Percentile MS Excel 102 KB Summary Table

File: SF2 DDR-GI Functional Module Top 5 Percentile MS Excel 48 KB Summary Table

File: SF3 Broad-GI Paralog and Non-Paralog GI Summary MS Excel 149 KB Tables

File: SF4 DDR-GI Paralog Summary Table MS Excel 17 KB

File: SF5 CE-PPI Predicted Clusters (MCL I=2.0) MS Excel 130 KB

File: SF6 CE-PPI MCL Cluster Top 5 Percentile GI MS Excel 55 KB Enrichment

File: SF7 CE-PPI Paralog Summary Table MS Excel 20 KB

File: SF8 EPS Loci HMM Seed Sequences MS Excel 18 KB

File: SF9 Bacterial Genomes Used for EPS Prediction Text 81 KB

File: SF10 Bacterial Genomes Lifestyle and Niche Metadata MS Excel 105 KB

File: SF11 EPS Reconstructed Operons MS Excel 166 KB

File: SF12 EPS Locus Phylogenetic Clusters Zip(Fasta) 988 KB

File: SF13 EPS Genomic-Proximity Networks Zip(Cytoscape) 1,363 KB

1

Chapter 1 Background

Sections of the following text are excerpted from: Bundalovic-Torma, C. & Parkinson, J. “Comparative Genomics and Evolutionary Modularity of Prokaryotes.” in Prokaryotic Systems Biology (2015), Adv Exp Med Biol.vol. 883. pp. 77-96. Springer Publishing.

Charting Cellular Complexity in Bacteria 1.1 Overview

Elucidating the means by which bacteria respond and influence their environments provides valuable insights into their roles in human health and disease. To do this requires detailed knowledge of the complex interactions that organize the thousands of genes and their protein products encoded in a bacterial genome into the biological processes mediating growth and survival under diverse environmental conditions. In this chapter I will introduce relevant concepts and methodologies pertaining to the generation of bacterial biological networks. These approaches are being increasingly employed in the study of the organization and evolution of the bacterial cell. For instance, recently generated large-scale biological networks in the model Gram-negative bacterium Escherichia coli have been applied toward elucidating the function of genes and proteins in the context of physical complexes and diverse biological pathways. There is further potential in applying these datasets to investigate how evolutionary processes have shaped biological roles of genes and proteins in the bacterial cell. I will then conclude by presenting the rationale and major aims of my research, to apply biological networks to elucidate the functional organization of the bacterial cell and thereby to systematically examine the adaptive significance of gene duplications within diverse biological contexts.

2

1.2 Escherichia coli as a Model Organism of Gram-negative Bacterial Biology

For the Gram-negative model bacterium Escherichia coli, detailed biochemical characterization studies accumulated over the past half-century have greatly expanded our fundamental knowledge of the function of genes which comprise biological pathways involved in bacterial growth and survival. A long lineage of studies (reviewed in 4) have focused on studying the large-scale organization of the E. coli proteome on the basis of physically-interacting complexes, biological pathways, and co-regulated genes. Yet despite decades of accumulated research, the first annotated genomic sequence of the model E. coli strain K-12 MG16555 revealed, surprisingly, numerous putative protein-encoding genes lacking experimental characterization6. A more recent bioinformatics survey examining several online E. coli knowledgebases has shown that around 41% of the E. coli K-12 proteome (1745 / 4220 proteins) lack an experimentally validated function7. Of these approximately 73% (1278 / 1745 proteins) have only a tentative functional assignment through computational prediction, resulting in 11% (480 / 4220 proteins) with unknown function.

Elucidating how individual proteins act together as part of integrated cellular processes provides us with a valuable key towards understanding how organisms adapt and survive to dynamic changing environments. In bacteria functional repertories of genes encoded by the genome are dynamically shaped not only through sequence divergence, but also through evolution of genomic organization, gene duplication or loss. These factors are very likely to be reflected in changes in the functional properties of genes proteins involved in biological pathways or complexes that drive adaptations. The identification of genes or proteins which function as part of complexes or biological pathways can be identified through physical and genetic interactions, respectively. Physical interactions result from the association of proteins via binding interfaces which are required for their coordinated function in a protein complex. Genetic interactions on the other hand result from statistically significant deviations in the expected fitness of an organism when genes are deleted in combination, which can indicate functional relationships among genes perform redundant biological roles, and/or belong as integral members of biological complexes or pathways.

3

The recent advent of large-scale interaction screening approaches is providing a novel means of charting the organization of the bacterial cell on an unprecedented scale. Thus a new vista of exploration has opened for elucidating the organization of bacterial cellular processes; however extracting biological meaning from these vast and complex datasets increasingly requires the development of novel integrative computational approaches. In recent years, the field of network biology has provided a valuable set of methodologies that have been successfully applied to investigate such datasets. With the recent availability of large-scale physical and genetic interaction networks in E. coli, there is great promise for gaining novel insights into the functional integration and evolutionary organization of the bacterial cell.

1.3 Biological Networks – Wiring Diagrams of Biological Processes

1.3.1 What are Biological Networks?

Proteins do not function in isolation but typically form parts of integrated biological systems such as metabolic pathways, signaling networks, and protein complexes. Much of the current knowledge of biological systems is derived from experimentally tractable and well characterized model organisms such as E. coli and yeast8,9. Allied to these investigations has been the establishment of reference resources providing curated information on biochemical pathways and complexes such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)10, MetaCyc11 and MiPS12.

Traditional low-throughput methods employed to predict gene function have typically relied on disrupting a gene in an organism of interest, either through directed knockout or random mutagenesis, and correlating its function to a change in phenotype13–16. One disadvantage of such approaches is the procedure of generating gene knockouts is a labor-intensive process which is challenging to implement in a large-scale fashion. To overcome this challenge, several high- throughput interaction screening methodologies have been devised to facilitate the assignment of gene function in an unbiased fashion and on a genomic scale. These methods can be essentially divided into two kinds, e.g. physical or functional, based on the mode of interaction assessed for any given pair of genes or proteins. Physical interaction screens have been devised to detect protein-protein interactions (PPI) and infer binding partners which comprise protein complexes, utilizing a variety of approaches including affinity-tag based pulldowns17, Two-Hybrid screens18

4 and BioID19. Functional interaction screens identify genes involved in similar biological processes and pathways, and have employed technologies such as gene expression microarrays which elucidate genes with correlated co-expression profiles across temporal or physiological states20, and genetic interaction (GI) screens that identify gene knockouts which display significant epistatic buffering affecting organism growth indicating functionally redundant or related roles across biological processes7,21. Such approaches typically provide functional relationships between genes or proteins as a list of binary interactions, with an associated metric assessing the statistical significance of the interaction.

Gene expression microarrays represent one of the earliest large-scale functional screening approaches22 based on the measurement of mRNA transcript abundances. The approach relies upon the hybridization of fluorescently labeled mRNA transcripts onto specially designed slides (microarrays), containing complementary oligonucleotide transcripts, “probes”, of known genes for an organism of interest23. Transcript abundance is then inferred through measuring intensity of fluorescence. The capacity for microarrays to assess the expression of thousands of genes for a single experiment is a powerful means of identifying those with likely roles in biological pathways involved in mediating the response of an organism to environmental change, genetic perturbation, or a phenotype of interest. This can be done either by examining the fold- expression change of individual genes across particular conditions or by combining the expression levels of genes across multiple microarray experiments to construct gene expression profiles which clustering algorithms can be applied to identify groups of genes with correlated expression patterns across multiple conditions24. However, establishing thresholds for distinguishing genes with true expression signals from off-target cross-hybridization noise remains a challenge, and thus genes with low expression levels may be excluded from analysis. However at present a microarray-independent method, RNA-Seq, which employs next- generation sequencing to directly measure RNA transcript levels is frequently employed as a substitute to traditional microarray approaches25. This approach can enable the identification of expressed genes for any organism of interest without the requirement of pre-defined set of set of gene probes. However, transcript detection is more sensitive to sample preparation and sequencing depth.

5

1.3.2 Protein Interaction Networks

Two-hybrid screens, initially developed for the model eukaryote yeast S. cerevisiae26 rely upon a protein fragment complementation assay to infer physical interactions, which involves the reconstitution of a gene promoter system between directly interacting hybrid bait and prey proteins resulting in the expression of a reporter gene27,28. This approach enables direct as well as transient (disassociation constant in the μM range29) protein interactions to be detected. In practice, the bait and prey fusion constructs are generated in separate yeast strains and conjugated, enabling high-throughput automated screening of interactions, although bacterial- based conjugation systems have also been developed30,31. The methodology is prone to false- positive interactions resulting from auto-activation of promoter expression by certain protein baits or through non-specific interaction of promiscuously interacting preys, which require particular consideration27; also the introduction of an additional protein domain required to generate hybrid fusion bait and prey proteins may also cause the disruption of potential interactions resulting in false-negative interactions, which may require the generation of both N- and C- terminal protein fusion strains. Furthermore, owing to essential biological differences, two-hybrid screens of bacterial proteins performed in yeast can also introduce false-positive interactions that may arise from the removal of temporal or spatial constraints, for instance, those with gene expression patterns under condition-specific regulatory control11, or for proteins which may be co-localized to different regions of the cell32. In a recent two-hybrid screen of E. coli proteins performed in S. cerevisiae, ~70% of the E. coli proteome was screened, encompassing ~3,606 proteins, of which ~ 35% were found to be involved in a binary interaction18. The authors of this study also noted the low overlap between the number of high-confidence interactions identified by two-hybrid (2,234) and those identified by previous AP-MS interaction screens, which are likely due to underlying experimental artifacts introduced by each approach33.

Traditionally, PPIs assessed through small-scale co-immunoprecipitation (co-IP) experiments have suffered a disadvantage compared to tagging based approaches by requiring the development of specific antibodies for a protein of interest and that purification is performed under non-native conditions requiring gene overexpression34. To overcome these limitations, affinity-tag based PPI pulldowns have been devised which utilize genetically modified collections of bacterial strains, consisting of individual genes, known as bait proteins, modified with affinity-tag sequences that enable their transcribed protein sequences to effectively bind to

6 an affinity column and purify along with their potential interactors, or prey proteins, identified through subsequent mass-spectrometry peptide identification17 (Figure 1-1AB). One important caveat for both antibody and affinity-tag mass-spectometry (AP-MS) based pull-down proteomics screens is that a bait protein will be identified with potentially multiple preys which may not all be valid, either due to indirect association, or experimental artifacts such as non- specific binding of proteins to the column, or over-expressed “background noise” proteins (i.e. the ribosome). To reduce the recovery of these so-called false-positive interactions, bait-prey protein interactions are subject to a PPI scoring metric35–40 (Figure 1-1C) which indicates the statistical significance of interaction based on the co-occurrence of prey and bait peptides identified across multiple purification experiments. Interactions can be inferred using two underlying models: a spoke-model where the prey is assumed to directly interact with all co- purified baits, or as a matrix-model (Figure 1-1D) where the baits are also assumed to interact with one another based on their frequency of co-occurrence in independent purification experiments. The model of interaction is relevant as they may yield different sets of interactions, possibly rejecting those that are genuine41. Tandem Affinity Purification (TAP) of tagged protein baits38 is a commonly employed proteomics approach for identifying interacting partners for an individual protein under native expression. Several large-scale AP-MS screens have been performed in recent years for the de-novo identification of soluble protein complexes in E. coli, and consequently have been applied for functional annotation of uncharacterized proteins by their association with members of previously identified complexes43,44. Such studies have varied in their coverage of potential interactions by limitations in the number of baits screened, and also in their ability to detect transient interactions; i.e. those dependent on the availability of a or chemical modification, such as phosphorylation, which plays an important role in numerous bacterial signaling processes45. However, attempts have been made by subsequent AP- MS studies to address these shortcomings through the application of mass-spectrometers of increasing sensitivity, refined purification approaches, integration of different probability scoring metrics, and additional filtering of biologically valid interactions using a set of gold-standard reference PPI derived from additional small-scale interaction experiments (see section 1.3.3).

7

Figure 1-1. Overview of PPI Detection using Tandem Affinity Purification, adapted from 46. A – Cellular lysates from an organism of interest expressing a C- or N- terminally TAP tagged bait protein of interest are subject to a two-step purification procedure resulting in the enrichment of stably bound physically interacting protein preys, while contaminants and non-specific interactors are removed. B – The composition of purified complexes of bait and prey proteins are identified using MS for multiple purification experiments. C – Scores are assigned to assess the

8 probability of interaction for a list of all pairs of bait and prey proteins identified across multiple MS experiments. D – A graphical representation of the protein complex is reconstructed using either spoke or matrix model of interactions derived from the probabilistic scoring procedure.

In addition to the widely employed AP-MS and two-hybrid approaches, several novel methods are being employed for assessment of large-scale PPI in bacteria and eukaryotes. In mammalian cell culture systems a modification of the AP based approach called luminescence-based mammalian interactome mapping, LUMIER, utilizes engineered bait proteins expressing a fused luciferase tag to efficiently identify interactions with a FLAG tagged prey protein of interest47. In contrast, in the BioID interaction screening approach the protein of interest is fused to the enzyme biotin which serves to biotinylate proximal proteins for identification through biotin affinity capture, and has an advantage in enabling transient and weak interactions as well as insoluble proteins to be detected48. More recently, co-elution profiling has been introduced as a label-fee approach which relies upon multiple HPLC fractionation of whole-proteome lyates to detect potential interactors based on overlapping co-elution patterns, and offers an advantage in requiring less time, labour and expense in strain generation, sample preparation and purification49,50.

1.3.3 Genetic Interaction Networks

The apportioning of genes into physical complexes and biological pathways represents one level of functional organization of a genome. Synthetic Genetic Array analysis (SGA) represents a a novel approach to assess how complexes and pathways are functionally integrated into higher- order biological processes based on epistatic fitness effects of pairwise gene deletions. Epistasis occurs when the fitness effects of an allele at one locus is masked by another at a distinct locus, and is commonly referred to as a genetic interaction (GI)1,51. One form of epistasis, called synthetic lethality, occurs when genes that are dispensable for survival when deleted in isolation result in a lethal phenotype when deleted in combination. Synthetic lethality is understood as a hallmark of genes with biologically redundant functions, and can be used to identify potential biological roles for uncharacterized genes52 as well as novel functional relationships between well-established complexes53. Based on this principle, the SGA methodology was initially developed using the budding yeast, S. cerevisiae, as a means to generate and screen for synthetic lethality phenotypes for pairs of gene deletion mutants54. A major advantage of SGA has been its automatibility enabling the large-scale screening of pairwise epistatic interactions between genes

9 in vivo, including those that are essential to viability. Since SGA analysis is based on quantitative phenotypic traits, it can also enable biologists to extract a potentially vast amount of functional information from a biological system that otherwise cannot be gleaned from methods such as large scale protein affinity pulldowns and genomic context methods.

An extension of the SGA method, called Epistatic Miniarray Profiling (E-MAP)55, has also been developed as a quantitative approach to detect functional interactions between genes with non- lethal epistatic fitness effects. In the E-MAP methodology the GI-epistasis score for a given gene pair is determined using the difference between the observed growth phenotype of a double-deletion mutant from the fitness expected from single-deletion mutant genotypes55. Consequently epistatic interactions can be of an aggravating or alleviating nature depending on whether the observed fitness is worse, or better, than expected, respectively. The degree to which a GI is alleviating or aggravating can be quantified using the multiplicative model of epistasis, but other models can also be employed56. In this manner a variety of functional relationships can be inferred among epistatically linked pairs of genes: those demonstrating aggravating epistatic relationships can be inferred possess a biologically redundant function, or alleviating epistasis can implicate genes which function as members of the same protein complex, metabolic pathway, or as mediators of transcriptional expression57,58. As a result this approach allows functional annotations to be transferred between genes, or protein complexes and biological pathways to be deduced through the correlation and clustering of GI profiles21, which can ultimately motivate more focused small-scale experimental validation experiments.

A myriad of S. cerevisiae E-MAP studies have demonstrated the usefulness of the approach in inferring novel functional connections among well-defined biological systems, such as chromosomal organization55,59, RNA processing60, the early secretory pathway61 and a global interaction network covering 75% of the S. cerevisiae genome62. E-MAPs have also been used in conjunction with protein-interaction datasets to reveal novel functional connections between known protein complexes involved in chromosome biology63,64. In response to the valuable biological insights that can be gained through SGA, analogous techniques have also been devised for E. coli, named E. coli Synthetic Genetic Array analysis (eSGA)21 and Genetic Interaction Analysis Technology for E. coli (GIANT-coli)65.

10

However, the present number of eSGA studies is significantly lacking and limited in the extent of genes screened compared to S. cerevisiae. Aside from a proof of concept, the only other eSGA study performed thus far has assessed genes involved in membrane biogenesis processes under nutrient-rich and nutrient-limiting culture conditions66. However, the genetic interaction network of E. coli remains limited and its further elucidation can provide crucial insight to the large-scale functional organization of bacterial processes.

1.3.4 Computationally Derived Functional Interaction Networks Inferred Through Genomic Context Methods

As a complement to experimental approaches, a set of metrics derived from fully sequenced bacterial genomes falling under the heading of Genomic-Context (GC) approaches, have also been developed to predict physical or functional interactions of genes and proteins in both model and understudied organisms. Unlike the experimental approaches previously described, GC methods infer functional associations for a given gene pair of interest based solely on features calculated from the genome sequence alone. The features typically examined include phylogenetic profiling67,68, conservation of gene-order69–71, chromosomal proximity72, and gene- fusion (Rosetta Stone)73 (Figure 1-2). It has been shown that in general the correlation of these features across different bacterial species indicates a co-evolutionary relationship between genes that are also likely to have related biological functions, or physically interact. Each method and selected studies will be briefly summarized below.

11

Figure 1-2. Illustration of Genomic Context Measures, adapted from 74. A – For a reference organism of interest, Phylogenetic profiling tracks the presence or absence of orthologous genes from a reference organism of interest and other distinct genomes for determining phylogenetic co-conservation. B – Gene ordering compares the conservation of the ordering of orthologous loci across species genomes. C – Chromosomal proximity compares the inter-genic distances (indicated by dashed lines) across species genomes. D – Rosetta stones are identified by the occurrence of fusion events between neighbouring loci in distinct species genomes.

Phylogenetic profiling determines whether a pair of genes is likely to be functionally related based on their correlated patterns of presence or absence in other organisms, given that genes which tend to be co-conserved tend to function as members of protein complexes or metabolic pathways75,76. Thus for each protein in a genome of interest a phylogenetic-profile is constructed, represented by a vector of the presence or absence of orthologs (i.e. genes that have arisen from the evolution of a common ancestral sequence through speciation - see section 1.4.3), within the proteomes of a set of compared genomes. The degree of functional interaction of a given pair of proteins is then determined by calculating the correlation between their phylogenetic profiles (Pearson Correlation, Jaccard Coefficient, or Mutual Information). It is important to note that species selection in the construction of phylogenetic profiles must be carefully considered72, gene-duplication (paralogy – genes which have arisen through duplication prior or following speciation – see section 1.4.3) may lead to false-positive predictions73, and the approach tends not to be as effective for highly-conserved proteins.

The conservation of gene-ordering and their direction of transcription can also be utilized to discern groups of genes in bacteria that are likely to comprise co-transcribed units, or operons, which typically encode members of protein complexes or biological pathways77. Importantly, gene-ordering enable a greater resolution in subsets of functionally related genes that may be missed by phylogenetic profiles. However, as it has been well documented, different bacterial species undergo different rates of chromosomal rearrangement78, thus in a genome of interest not all functionally related genes may possess a conserved gene-ordering, therefore examining genes based on their general chromosomal-proximity can be utilized to account for variability of operon organizations and potentially identify operons with novel compositions. In more extreme cases, a co-conserved gene pair may also have undergone a fusion event in a given species, which is found to commonly occur for proteins known to physically interact79. Such gene-fusion events occurring in distantly related species were originally termed Rosetta-Stones, as they

12 provide a key for deciphering the function of uncharacterized genes based on the known functional annotation of their fused partners80.

1.3.5 Integrative Approaches to Boosting Reliability of Physical Interactions with Predicted Functional Assignments

GC methods can be employed to boost PPI through integration with datasets generated from high-throughput screening approaches, which can also be utilized together with small-scale experimental studies in the functional prediction of uncharacterized proteins made available for numerous bacterial species from public online databases, such as STRING81. Similar integrative approaches have also been applied for large-scale interaction screening studies performed specifically for E. coli. For example, one such study in E. coli82 implemented an integrative machine learning approach to refine initial screen of 5,993 physical interactions (PPI) of 1,757 soluble proteins (with a 75% confidence of interaction), with 74,776 functional interactions computationally inferred utilizing genomic context (GC) datasets (with a 80% confidence of interaction) (see section 1.3.3). Little overlap was reported between PPI derived from small-and large-scale experiments and GC interactions (6.6%), which the authors interpreted as an indicator of complementarity of these datasets. Indeed, the authors were able to demonstrate the validity of integrative approaches for inferring protein functional assignment by identifying and experimentally confirming roles of uncharacterized genes involved in cell envelope biogenesis, DNA replication, translation, flagellar biogenesis and cell-division. Expanding upon this work, another group devised a Bayesian integration approach to generate a high-confidence functional network encompassing 1,941 E. coli proteins utilizing interactions derived from large-scale AP- MS pulldowns, GC functional interactions, and additional interaction datasets curated from small-scale interaction experiments, and literature curation74. Utilizing a network clustering approach resulted in the identification of 316 functional modules showing with a greater degree of functional enrichment than previous efforts82 and comprising a diverse catalogue of functionally distinct and biologically relevant complexes and pathways in bacteria.

13

1.4 Comparative Genomics and its Application Toward the Study of Biological Networks

1.4.1 What is Comparative Genomics?

Bacteria represent a phylogenetically diverse group of organisms which demonstrate a remarkable variety of lifestyles and strategies ranging from free-living in aquatic or terrestrial environments, to intimate associations (symbiosis) with other organisms with neutral, beneficial, or harmful, i.e. pathogenic, consequences for their hosts. Furthermore, host associations are reflective of bacterial taxa that have evolved specific adaptations to a particular host- environment, e.g. the skin or the gut83–85. With recent advances in next-generation genome sequencing technologies resulting in the generation of thousands of bacterial genomes covering diverse phyla86, opportunities are emerging to understand the underlying genetic mechanisms that facilitate these diverse lifestyle strategies. For example, recent sequencing initiatives such as the Genomic Encyclopedia of Archaea and Bacteria87 have uncovered a vast pool of novel uncharacterized prokaryotic genes from previously neglected prokaryotic phyla. Such genes offer enormous potential in driving the evolution of distinct lifestyle strategies. Consequently there has been much interest in the development and application of computational methods to annotate the ever increasing resource of sequence data representing a new avenue for research and discovery of novel bacterial functional repertoires (1.4.3).

1.4.2 Bacterial Genomic Evolution

A number of evolutionary forces play an important role in the emergence and elaboration of genes that mediate bacterial adaptation. For instance, spontaneous mutations arising during bacterial genomic replication contribute to sequence divergence of genes which function in biological processes responsible for nutrient acquisition and metabolism, cell growth and replication, which have important implications in adaptation of bacteria to diverse environments or lifestyles as well as the development of antibiotic resistance88–93. This can occur through the gain or loss of genetic material which results in important consequences reflecting differences in the composition of bacterial genomes93. Processes resulting in gene gain can occur through the genesis of genetic loci de-novo from previously non-coding sequence94, the duplication of pre- existing genetic loci where through sequence divergence one duplicate-copy may evolve a novel function (neofunctionalization) or the ancestral function may be divided between both of the

14 duplicate-copies (subfunctionalization)95, or the acquisition of genes via horizontal transfer from one bacterial species to another. Loss of genetic loci can also occur through sequence divergence by either the silencing of gene expression or mutation of a gene to encode a non-functional (pseudogenization), or the loss of loci in their entirety. Culture-based experimental studies of Salmonella enterica have indicated that gene duplication and loss occur spontaneously, with their rates likely to be influenced by a variety of factors, such as selection pressure and the presence of repeating DNA elements flanking the locus in question96–99. The application of comparative genomic approaches has greatly facilitated the large-scale investigation of the significance of these evolutionary forces as drivers of bacterial adaptation. However, the systematic exploration of their functional implications has been limited only to well- characterized genes derived largely from experimental studies performed in model organisms. However it has recently become possible to systematically investigate the functional consequences of these forces in the context of biological pathways defined utilizing large-scale biological networks generated for E. coli. In the following sections I will describe the basis of comparative genomics and functional prediction based on the resolution of the ancestral relationships of genes through sequence homology approaches and their application toward the study of biological networks.

1.4.3 Defining the Evolutionary Relationships of Genes – Orthology and Paralogy

In identifying gene duplications from bacterial genomic sequences two key terms are employed, orthology and paralogy. Orthologous genes are those which have arisen through a speciation event, while paralogous genes have arisen through duplication event either pre- (out-paralog) or post- (in-paralog) speciation100. This evolutionary framework has important implications in inferring the biological roles of duplicates based on sequence homology. Orthologous genes found among diverse species are more likely to have conserved their biological roles. On the other hand for paralogous genes one copy may retain its original function while the other can accumulate changes in its coding sequence and evolve a modified biological role or become a non-functional pseudogene. A recently performed survey of 13 phylogenetically diverse species genomes revealed a significantly greater tendency for orthologous genes to share experimentally supported functional annotations compared to paralogs, supporting the notion that orthologs are more likely to have conserved biological roles101. Furthermore, the average rates of functional

15 divergence (inferred from comparing the proportion of non-synonymous to synonymous amino acid substitutions) has also been shown to be significantly greater for paralogous than orthologous protein sequences across diverse bacterial taxa102, supporting that gene duplication contributes to the evolution novel biological functions. In the following sections I will describe various approaches that can be applied toward predicting the biological roles of orthologous and paralogous genes.

1.4.4 Computational Approaches for Functional Annotation of Protein Sequences from Sequence Homology

Among the more widely adopted methods of predicting gene function are those that rely on sequence similarity searches that attempt to identify putative homologs of previously characterized genes. Such approaches range from the naive use of an established tool such as BLAST103, to more sophisticated tools that facilitate the concurrent detection of orthologous proteins across species104. Indeed, numerous pipelines now exist that facilitate automated functional annotation of novel genome sequences; two of the more notable BLAST-based approaches being Rapid Annotation using Subsystem Technology RAST105 and the NCBI Prokaryotic Genome Automatic Pipeline106. Further tools allow the prediction of specialized protein properties, such as cellular localization107, and enzymatic function108–110.

1.4.4.1 Sequence Homology Based on Conservation of Structural Domains: Protein Families

Alternatives to BLAST based sequence searching approaches are profile-based sequence similarity searching methods, which are utilized by online databases such as PROSITE111 and the Protein Families Database (PFAM)112. Such approaches exploit the concept of heterogeneous selection pressure across protein sequences, and assign putative functions to unannotated proteins based on locally-conserved regions within their protein sequence that are likely to correspond to distinct functional (motif) or structural (domain) features. Identifying specific combinations of motifs and domains that are conserved among phylogenetically diverse bacterial genomes can be used to define evolutionarily and functionally related lineages of protein sequences, called protein families. Based on the manually curated sets of protein sequence alignments, the PFAM database provides specific sequence profiles for thousands of protein

16 families which can be employed in functional annotation of bacterial genomes through Hidden- Markov model based searches using the HMMER software suite113.

1.4.4.2 Sequence Homology Based on Inference of Evolutionary Relationships: Gene Families

Utilizing a combination of approaches mentioned above, it is possible to elucidate changes in the overall functional capacity of bacterial genomes, through comparative genomics studies of conserved orthologs, as well as the diversity of protein families by examining the divergence in the function of single protein-encoded genes, their composition of known motifs or domains114. Previous comparative genomics based surveys have been largely limited in their findings in two important respects, first by the scope of available fully sequenced bacterial genomes115,116, and second, by biological insights constrained by the comparisons of the overall conservation of broadly defined functional classes of well-characterized proteins which lack resolution of their functional relationships with other proteins involved in specific biological pathways117.

However, functional prediction based on sequence similarity can be compromised by the presence of gene-duplication events, making it difficult to discriminate the roles of duplicate, i.e. paralogous, genes111. Also, through horizontal gene transfer (HGT) a functionally redundant duplicate may be present in a genome, known as a xenolog118. All of these aspects of prokaryotic evolution are known challenges towards automated functional annotation of novel prokaryotic genomes104, and numerous specialized methods have been devised towards resolving sequences based on the inference of phylogenetic based orthology relationships119. A common approach utilized in these methods is to leveraging the available sequence diversity across hundreds of genomes and apply a clustering approach to define gene families consisting of sequences likely to share a common evolutionary descent (COG120; InParanoid121; ORTHOLUGE122; eggNOG123; ORTHOMCL124).

Elucidating potential functional divergence between orthologs and paralogous sequences is crucial toward investigating the evolution of bacterial biological processes across phylogenetically diverse bacterial species. For instance, it is possible that a potential orthologous sequence may not be detected because of overall low sequence identity between two phylogenetically distinct species125,126. And of further biological relevance, although it is possible to infer orthologous sequences with common evolutionary descent, it is likely that under

17 different environmental contexts, selection pressure may cause orthologs in different species to become divergent in function. In specific instances this is seen to be important for the fine-tuning metabolic efficiency, as seen in genomically reduced endosymbionts and parasites127, but is also likely to have broad implications bacteria for the elaboration of novel metabolic pathways and in the organization of pre-existing biological pathways114.

1.4.5 Application of Biological Networks and Comparative Genomics Approaches to Study the Evolution of Bacterial Biological Processes

One of the major aims of systems biology is to determine how the complex behaviours of a cell are carried out via underlying interactions of the multitude of genes and proteins encoded by a genome. In the previous sections we have discussed some of the main approaches in tackling this challenge through the methodologies and approaches in generating large-scale interaction datasets, which aim to place individual genes into their functional context. However, it is evident that these datasets can be further exploited to gain important insight on how genes and proteins evolve within the context of a biological system and enable adaptation of prokaryotes to distinct lifestyles. The integration, analysis, and visualization of such vast amounts of data require the extensive use of computational tools, and the concepts developed over the past decade of work in the field of network biology.

The elucidation of the organizational principles of complex biological systems and their evolution is a goal of network biology. The analysis of biological networks and their evolution through comparative genomics can lead to an important understanding of how protein complexes and biochemical pathways function in redundant biological processes, and how changes/rewiring of these pathways may support bacterial adaptability to diverse environments or lead to changes in bacterial lifestyle. The primary mode of exploring and analyzing such networks involves graph theoretical approaches.

In network graphs the genes or proteins are depicted as individual nodes linked by edges indicating shared biological relationships. The meaning of these relationships depends on the process described by the network. For example, nodes and edges in physical interaction networks represent proteins and their physical associations, respectively, such as those found in protein complexes. Metabolic networks can further be represented as a bi-partite graph, where nodes

18 might represent a metabolic enzyme or metabolite, and edges connecting them representing an enzymatic reaction. Features that can be abstracted from network graphs are the number of these interactors connected to a given node, i.e. node degree, and the number of edges that connect any two nodes in the network, i.e. a path. From these basic features topological properties of a network can be deduced and utilized to examine the organization principles biological processes represented in a biological networks. In both biological and non-biological networks studied to date, the distribution of node degree typically follows a power-law function where the majority of nodes are sparsely connected with the exception of a few nodes, called hubs, that are highly connected. This distribution is termed scale-free, suggesting that the distribution of physical associations reflect an important property of the functional organization of biological systems, and thus have not evolved through random processes43,44,128. For instance, in both eukaryotic (S. cerevisiae) and prokaryotic (E. coli) physical interaction networks hubs nodes have been noted to interact with proteins with related biological roles and are enriched in those essential for survival129.

1.4.5.1 Robustness and Evolvability

The concept of robustness in biological systems has received a significant degree of study in the field of theoretical population genetics. One pertinent question of this field involves resolving the mechanisms by which organisms maintain their fitness and survive in the face of deleterious mutations that arise spontaneously through errors in replication. This question stems from an important theoretical proposition, named Muller’s Ratchet, which states that under the effects of decreased population size and genetic drift, organisms should not be able to survive due to the irreversible accumulation of random deleterious mutations130. However, the impact of a mutation can vary substantially depending on several factors, one of which is the variability of environmental conditions, i.e. selective pressures, which may mitigate the potentially detrimental fitness effect of mutation127,131. Additional experimental research is providing support for the notion that mutations can be exploited to facilitate adaptation to novel environments and the development of robust biological responses. An example of this is provided by the well-noted phenomenon of genetic epistasis as described above (1.2.3); a number of large-scale gene deletion studies in E. coli as well as S. cerevisiae attest to the remarkable dispensability of the majority of genes in bacteria and eukaryotes, which is the result of functional redundancy selected for in biological networks to compensate for the potentially detrimental effects of

19 spontaneous mutations132,133. Ongoing directed selection experiments in E. coli have further sought to disentangle the acquisition of mutations and their epistatic effects in an experimentally reproducible fashion134. In one study, 12 replicate E. coli populations were grown for 20,000 generations under glucose limitation and 5 initial growth benefit-conferring mutations were identified in the genes regulating the utilization of alternate sugar sources135. Although nearly all mutations, singly and in combination, were found to confer relative fitness increases when reconstituted in the ancestral genotype, the authors noted a decreasing trend in the benefit of mutations acquired over time, which they conclude were a consequence of aggravating epistatic interactions in producing “genetic drag” as population fitness optima are reached. These results have further implications toward reconciling well the well noted topological features of biological networks and can be exploited to investigate the evolution of genes and proteins that shape biological processes relevant for bacterial adaptation and survival.

1.4.5.2 Defining Modularity in Biological Networks

Modules are groups of biological entities that can perform a distinct function in isolation136, and in biological networks correspond to subsets of nodes which possess a greater degree of edges with shared with one another, called clusters. Examples of biological modules range from prokaryotic membrane transporters involved in nutrient acquisition and antibiotic efflux, large molecular machines such as the bacterial flagellum, redundant iron-sulfur biogenesis clusters, and cell-envelope biogenesis pathways21,128,137.

A study of the evolutionary conservation of metabolism across prokaryotic, archaean, and eukaryotic genomes76 identified a core of widely conserved belonging to essential metabolic pathways, with a periphery of enzymes of limited conservation likely representing adaptations of particular phyla. Constructing phylogenetic profiles of enzymes mapped to defined KEGG pathways and calculating their phylogenetic co-conservation using a jaccard co- efficient (measuring the extent to which phylogenetic profiles for a pair of genes overlap), the authors were able to calculate the evolutionary modularity of various pathways, by summing the co-conservation of all enzyme pairs belonging to a pathway of interest. Given that many enzyme classes belong to highly conserved gene families, and that bias in species selection can bias phylogenetic profiling approaches, statistical significance of pathway modularity scores was assessed using distributions of shuffled enzyme phylogenetic profiles. From this approach, the

20 authors were able to identify highly co-conserved submodules of enzymes within pathways, and also an appreciable extent of shared enzyme memberships across related pathways, indicating a degree of flexibility in metabolism across life.

Identifying modules in biological networks is not a trivial task and numerous clustering approaches have been developed138. In regards to module identification in physical interaction networks, some commonly utilized clustering algorithms include Molecular Complex Detection (MCODE)139, Affinity Propagation (AP)140, and Markov Clustering (MCL)141, of which MCL has been shown to perform with the greatest accuracy and with robustness at recapitulating known biological complexes from noisy or sparse datasets138,142. Each method identifies modules based on different aspects of the underlying network topology and can be easily implemented in the popular network visualization tool, Cytoscape143,144.

Modularity is understood to be an organizational principle of biological systems that enables the development of complex behaviours, such as may be required during response to environmental change, mediated through the expression of smaller functional units which may represent components that physically interact to form a macromolecular machine or complements of enzymes representing metabolic pathways which enable substrates to be channeled for the production of a variety of products depending on need. For example, modularity is a feature observed among enzymes in bacterial metabolic networks which are found to comprise clusters that are involved in processing of distinct classes of substrates, e.g. carbohydrates, lipids, amino acids, nucleic acids and co-enzymes145. Support of a modular organization in biological networks is also exemplified by the successful application of modular principles in synthetic biology to construct engineered bacterial strains through the combination of “biological parts” that represent components of metabolic pathways or distinct complexes146. Furthermore, because prokaryotes are capable of growth, reproduction, and survival under diverse environments, it follows that the genes or proteins that comprise biological modules underlying these processes have been selected and refined by evolution to function efficiently together.

1.4.5.3 Functional and Evolutionary Modularity

It is well established that horizontal gene transfer is an important factor in the evolution of prokaryotes, as supported by an increasing corpus of work147, and is furthering our understanding of the contribution of modularity in the evolution of prokaryotes. For instance, when genes are

21 transferred between distinct bacterial species, it is commonly understood that their likelihood of being retained depends on their ability to function as a discrete functional unit, i.e. modularly, and therefore independently of their genetic context. Such notion of biological modularity has been demonstrated by the identification of “pathogenicity islands”, which are horizontally acquired genomic regions identified through comparative genomics studies of closely related strains of prokaryotes148. The extensive genomic sequencing of Yersinia strains have recently led to the discovery that the independent acquisition of plasmids and pathogenic determinants has resulted in the emergence of human pathogens Yersinia pseudotuberculosis and pestis from distinct environmental non-pathogenic lineages, contrary to the former notion of their divergence from a common ancestor149.

The bacterial genome itself is partitioned into co-regulated and transcribed sets of genes, called operons, which typically encode members of physically interacting protein complexes and metabolic pathways11,150. Comparative genomics studies of experimentally characterized operons in E. coli have further examined general evolutionary forces responsible for the organization operon structure across prokaryotes, revealing significant roles in operon gene ordering and intergenic distance as a means to ensure effective regulation the expression of lowly expressed metabolic pathways and influence the assembly of protein complexes151–155. Operons illustrate an important aspect of the principle functional modularity and its exploitation by bacteria to adapt to environmental challenges. Instances of horizontal transfer of partial to entire gene- neighbourhoods, likely corresponding to bacterial operons, have been identified across phylogenetically diverse prokaryotes, representing complexes and pathways as diverse as the ribosome, lipid biosynthesis and NADH oxidoreductase156. In addition, bacteria possess a remarkable array of operon encoded transport systems comprising diverse protein families derived from duplication events. Well studied examples include the class of ATP-Binding Cassette (ABC) transporters, of which 65 have been experimentally characterized in E. coli157 and possess a general architecture generally comprising: two cytoplasmic ATP hydrolyzing domains, two transmembrane domains making up the transporter pore, a periplasmic substrate binding protein specifically associated transporters with import functions, or a periplasmic adapter protein which facilitates substrate export via the outer membrane channel TolC. In addition, E. coli also possesses additional operon-encoded multiple drug export systems, the major facilitator (MSF), and resistance-nodulation-cell-division (RND) protein superfamilies,

22 respectively, which follow a similar theme of elaboration following duplication158. The vast array of transporters facilitates demonstrates an important role played by duplication in enabling the survival of bacteria in varying environmental conditions and cellular stress. Recently, a directed evolution study of Burkholderia RND transporters has provided additional experimental evidence supporting the importance of gene duplication and subfunctionalization in facilitated bacterial adaptation to environmental stress, particularly as a means to enable the fine-tuning and regulation of duplicate transporter operons159.

One of the first large-scale PPI interaction networks generated by Butland et al. in E. coli43 examined the network properties of both broadly conserved (across three domains of life) and E. coli specific protein baits (~648 proteins in total). Interacting pairs of proteins with a high-degree of conservation, based on the number of genomes containing detectable orthologs, show an increased likelihood of physically interacting, and form a core network involved in essential bacterial processes. From the standpoint of prokaryotic evolution, the bias between ortholog conservation and physical interaction suggests that certain non-essential protein complexes detected in the E. coli PPI network may have evolved different interaction partners in other bacterial phyla, developing novel functional modules and that the adaptability of bacteria is likely to be reflected in the evolutionary rewiring of PPI networks.

In the large-scale functional network of E. coli generated by Hu et al.82, clustering of binary interactions having overlapping support from these multiple datasets enabled 97 distinct functional neighbourhoods (modules) to be delimited, containing both uncharacterized proteins and those with consistent biological roles. The phylogenetic distribution of functional neighbourhoods did not appear to be phyla-specific, suggesting that different biological processes in prokaryotes may consist of a core set of proteins with different extents of elaboration as phyla innovations. Among the predictions that were experimentally validated were novel components involved in several important prokaryotic biological processes, such as DNA replication, cell-envelope biogenesis, and antibiotic resistance. In a follow up study, the large-scale integrative functional E. coli network generated by P. Alvarez et al.74 also identified genes predicted to have originated from horizontal transfer were less well connected, or peripheral, to the network and were hypothesized to form distinct functional modules. From the examination of the functional modules consisting of horizontally derived genes, many examples are indicative of roles in environmental sensing and possess interactions with native E. coli

23 proteins with alternate functional roles, i.e. iron-acquisition operons and iron-siderophore precursor biogenesis supporting the importance of functional modularity in bacterial adaptation.

The study of modules is not limited to PPI networks alone, but can be applied to metabolic networks, networks of genetic interactions and networks of gene regulation. In the recent study of Babu et al.1, GIs showed enrichment between functional modules corresponding to the subunits of distinct complexes with related biological roles, such as DNA polymerase and DNA repair exonucleases, and iron-sulfur and ferric enterobactin biogenesis. Other studies have examined the role of modularity in the evolution of novel bacterial adaptations, with interesting insights. Broadly conserved prokaryotic stress responses, such as chemotaxis, spore formation can be strongly resolved into distinct sub-modules based on the biological functions of their components (structural proteins, environmental sensing, cell-signaling, pathway cross-talk), which strongly correlate with the lifestyles of different prokaryotic species160; not surprisingly, components of these modules showing the greatest evolutionary divergence are involved either in direct environmental sensing or the last stage of internal cellular signaling cascades, indicating their roles as adaptations to nutrient availabilities of specific environmental niches. Modularity has also been shown to play an important role in the evolution of phyla specific traits. For example, differences in the localization of stalk formation in the closely related Alpha- Proteobacteria Caulobacter crecentus and Asticcacaulis species was shown to be driven by the protein SpmX, which evolved from a cell-development regulator in C. crescentus into a stalk localization determinant in Asticcacaulis, which demonstrates the remarkable modularity of the conserved prokaryotic peptidoglycan synthesis machinery which enables its adaptability to distinct biological contexts 161. Prokaryotes also display the potential to adapt broadly conserved protein complexes to exploit novel lifestyle niches. For example, the twin-arginine export system (Tat) is one of two essential protein secretory pathways found across prokaryotes, and is involved in the transport of folded proteins across the bacterial membrane162. The Tat complex in the majority of prokaryotes is comprised of the three proteins TatA, TatB, and TatC, likely originating from the ancient duplication and sequence diversification of the ABC transporter family163. In a recent study, Jiang and Fares demonstrated that functional divergence of various subunits of the Tat complex was significantly increased in prokaryotic phyla containing pathogens (e.g. Neisseria, Bartonella, Salmonella) and species adapted to extreme environments (Halobacteria)164. Further analysis of the predicted complement of Tat- dependent substrates in

24 these species identified ribosomal proteins that may influence host immune response, and inorganic ion transporters that may ensure ionic equilibrium in high-salt environments. These results suggest that that duplication has served to broaden the potential substrates transported by the Tat complex, demonstrating its functional modularity through participation in diverse pathways that underlie bacterial lifestyle adaptations across phylogenetically diverse species. These few examples serve as an illustration of the means modularity can facilitate prokaryotic evolution and adaptation through the combination of distinct pathways or processes, or evolution of components therein, generating novel environmental adaptations.

1.5 Project Goals and Rationale

Large-scale protein-protein and genetic interaction methodologies are becoming a standard approach for the systematic study of the bacterial cell. The complexity of such datasets contrasts as well as complements the traditional small-scale reductionist approaches of elucidating gene and protein function. Through the application of network biological approaches, protein and genetic interaction networks in E. coli have been shown to reveal unprecedented insights of functional organization of genes and proteins into functional modules comprising diverse biological processes, however the studies to date are relatively limited with many aspects of bacterial biology remaining to be explored. In addition, novel functional insights derived from the study of biological networks, combined with comparative genomics analyses holds great promise for addressing questions of evolutionary significance that have been traditionally within the realm of theoretical speculation, such as the contribution of the expansion of protein families through duplication, or the gain of novel genes contribute to novel bacterial adaptations and functional novelty. Therefore, assessing biological networks from both functional and evolutionary perspectives shows great potential for identifying genes which play crucial roles in bacterial adaptation to diverse environments. To achieve this overall goal, in the following chapters of my thesis I will present my research integrating network biology and comparative genomics approaches in three different contexts:

1) In Chapter 2, I present results of my analyses of two recently published E. coli GI networks, investigating the epistatic integration of genes involved in both broad biological functions and more specifically DNA damage and repair response pathways1,2. The functional organization of GIs is examined through the enrichment of epistatic interactions in bridging

25 genes belonging to a diverse set of functional modules in E. coli, as well as co-conserved genes and contribution of paralogs toward biological robustness through the elaboration of biological processes and complexes.

2) In Chapter 3 I will shift focus upon defining protein complexes in recently published large scale AP-MS interactome of the E. coli cell envelope associated proteome3. Applying a clustering algorithm I identify cell-envelope associated protein complexes to investigate the role of PPI in the functional partitioning of the cell-envelope. Integrating previously published GI datasets I also examine the functional integration of cell-envelope complexes into biological processes involved in bacterial adaptation and survival. I end with an application of this dataset to the systematic examination of functional divergence of paralogs, with a focus on utilizing rewiring of physical interactions to inform their contribution to novel adaptive environmental responses.

3) In Chapter 4, I describe a novel approach incorporating genomic-context methods and evolutionary distance based sequence clustering, which will be applied toward the study of operon-encoded bacterial synthase dependent exopolysaccharide (EPS) biosynthesis machineries. Through this approach I systematically examine the functional consequences of operon-reorganization, gene duplication, and loss to understand how by elaborating on a common mechanism of exopolysaccharide transport, bacteria have evolved a means to produce distinct EPS biofilms that serve as important adaptations across diverse phyla.

26

Chapter 2 Investigation of the Evolution of Diverse Biological Pathways and Functional Divergence of Paralogs in E. coli Genetic Interaction Networks

Portions of the following analyses and figures have been adapted with permission from:

Babu M, Arnold R, Bundalovic-Torma C, Gagarinova A, Wong KS, et al. 2014. Quantitative genome-wide genetic interaction screens reveal global epistatic relationships of protein complexes in Escherichia coli. PLoS Genetics. 10(2):e1004120.

Kumar A., Beloglazova N., Bundalovic-Torma C., Phanse S., Deineko V., et al. Conditional Epistatic Interaction Maps Reveal Global Functional Rewiring of Genome Integrity Pathways in Escherichia coli. Cell Reports. 14(3):648-661.

Attributions

I conceived and performed all of the analyses presented in this chapter. Selection of query and recipient knockout strains for E. coli synthetic genetic array (eSGA) screening, epistasis scoring, GI network generation and computation of GI-profile Pearson correlations were performed and provided by collaborators Dr. Mohan Babu/Dr. Andrew Emili (University of Regina/University of Toronto & Boston University). E. coli functional module annotations utilized for GI enrichment analyses were derived from a previously generated functional E. coli functional interaction network 74.

27

2 E. coli Genetic Interaction Networks 2.1 Introduction

The overarching goal of my thesis is an elucidation of the functional organization of the bacterial cell and how this organization has been shaped through evolutionary innovations by gene duplication. GI networks provide a valuable approach for identifying novel functional relationships of genes which act to coordinate diverse biological processes of the bacterial cell, which can further aid our understanding of the biological roles of paralogous genes. In this chapter I present an investigation of two recently published E. coli GI networks, the first representing a screen 163 genes against an array of over 4000 single gene knockouts to assess functional cross-talk of biological pathways and protein complexes with broad biological roles under normal laboratory growth conditions1; and the second a conditional screen of 549 genes with roles in DNA damage and repair response pathways for the elucidation of functional relationships specifically recruited in response to DNA damaging conditions2. I first examine the ability of GIs to capture functional relationships by integration with a previously predicted set of E. coli functional modules defining genes with functionally related roles in protein complexes and metabolic pathways, which revealed epistatic enrichment occurring between functional modules with biologically related roles. For example, functional modules enriched in Broad-GI network were found to bridge complexes and pathways involved in antibiotic resistance and cell- division, cell wall stability sensing through the phage-shock response pathway and anaerobic metabolism, while DDR-GIs were enriched between pathways involved in stalled replication fork progression and enzymes responsible for the removal of alkylated DNA. Having established the ability for GI networks in representing the functional relationships among genes, I then applied them toward investigating how evolutionary processes have shaped the organization of diverse biological pathways in E. coli. Applying correlation of GI-profiles, which infer functionally related genes that possess similar patterns of epistatic interactions, with a phylogenetic profiling approach, I investigate the relationship between the conservation of genes and their functional relatedness. In the Broad-GI network it was found that highly conserved genes tend to possess highly correlated GI profiles, reflective of their key roles in cellular growth, which have become elaborated by the acquisition and epistatic integration of novel genes, such as those involved of iron import and metabolism pathways. Additionally, low-

28 correlation of GI profiles was found to indicate functional divergence among novel components of the flagellum. In both the Broad-GI and DDR-GI networks, the functional diversification of paralogs appears to play an important role overall in increasing the robustness of E. coli to survive under DNA damaging conditions, as revealed by functional divergence among paralogs with specialized roles in the sensing of stress responses, as seen among duplicate lysine tRNA synthetases and an expanded family of RNA helicases. Together these results demonstrate the application of GI networks serve as a valuable approach toward elucidating the functional integration of diverse biological processes in E. coli and the contribution of paralogs toward increased biological robustness to diverse environmental conditions.

2.1 Materials and Methods

Epistatic synthetic genetic interaction screens (eSGA) and scoring of GIs were provided by collaborators post-doc Dr. Mohan Babu and Dr. Andrew Emili (lab of Dr. Emili University of Toronto) following a previously published methodology21. In brief, following bacterial conjugation, homologous recombination, double-antibiotic selection for successful double- knockouts transformants, duplicate plating and growth, normalized single and double knockout strain fitnesses were determined based on colony growth size, and a normalized epistatic score (S-Score) for each double knockout was calculated using a multiplicative fitness model subtracting the product of single knockout fitness from the fitness of the respective double knockout strain. Double knockouts were filtered if they occurred in a genomic window >= 10 kbp on either side of the donor strain. A distribution of double knockout S-Scores was generated, and a Z-score transformation was applied and a significance threshold (p-value <= 0.05, 2-tailed) was applied to select statistically significant GIs.

Differences in the selection of donor query and recipient knockouts, modifications of screening procedures and derivation of additional measures (Pearson-correlation coefficients of GI profiles, conditional-dependent differential GIs) for the Broad-GI and DDR-GI datasets analyzed in this chapter are described below.

29

2.1.1 Source of Datasets

2.1.1.1 Genetic Interaction Datasets

Dataset 1: Genetic Interaction Network of Genes Encompassing Broad Biological Processes (Broad-GI)

The Broad-GI dataset represents 163 query genes selected for eSGA screening represent broad biological processes falling into the following functional categories derived through literature curation and previously assigned functional annotation terms: cell envelope (28), FE-S cluster biosynthesis (22), general chaperones and proteases (12), metabolism (33), transcription (18), translation (7), antiviral defense (6), transport (5), and unclear or orphan (25) (Figure 2- 1A). Following the eSGA procedure outlined above significant GIs were determined, resulting in a total set of 42,705 (25,239 aggravating and 17,466 alleviating) GIs (Figure 2-1B).

Dataset 2: Conditional Genetic Interaction Networks of Genes Encompassing DNA Damage and Repair related Processes (DDR-GI)

The DDR-GI dataset represents 549 genes selected through literature curation for involvement in in DNA damage repair response pathways and were further classified into the following biological processes based on literature curation into the following associated functional terms: DNA damage/general stress or defense response (56), cell division (46), DNA repair (43), DNA replication (39), metabolism (36), ribosome biogenesis/translation/RNA related (34), DNA recombination (27), folding/degradation (23), transcription (21), transporter (21), other processes (52), and uncertain or unclear function (149) (Figure 2-1A). Genetic interaction screens of reciprocal donor-recipient knockout pairs were performed according to previously described methodology21 under two separate treatment conditions: rich-medium (LB – untreated: UT) and in the presence of the DNA alkylating agent methane methanesulfonate (MMS). A differential (DF) GI network was derived by taking the difference of normalized double mutant growth scores between MMS and UT networks. The resulting statistically significant GIs for each network (UT 23,648 interactions; MMS 28,885 interactions (DF 8,227 interactions), and

30 calculated GI-profile Pearson-correlation coefficients for RM and MMS GIs were utilized for further analysis (Figure 2-1B).

Figure 2-1. Overview of E. coli Broad-GI and DDR-GI Datasets Investigated. A – Query genes selected in Broad-GI and DDR-GI eSGA screens by functional category (Broad-GI) and Biological Process (DDR-GI). B – GI datasets employed in heatmap form, depicting the number of significant donor and recipient pairwise knockouts identified and epistatic interactions by kind (aggravating GI – red cells, alleviating GI – green cells) for Broad-GI (1) and DDR-GI (2) datasets, along with corresponding GI-profiles (3).

31

2.1.1.2 Functional Modules Dataset

Epistatic enrichment of Broad-GI and DDR-GI networks was assessed by utilizing a set of predicted functional modules in E. coli74. Functional modules were derived from a large-scale functional network encompassing 1784 proteins (43 % of the 4145 known protein coding genes of the E. coli proteome) which were further assigned using the Markov Clustering (MCL) algorithm141 into 316 distinct functional modules representing proteins of related function (Peregrin-Alvarez et al., 2009). Functional interactions were computationally inferred through Bayesian integration of large and small scale PPI experimental datasets, interactions derived from genomic context predictions, and literature curation. This network was found to have the highest overall recall and precision when comparing common COG membership of gene clusters when compared to previously generated networks44,72,81,82.

2.1.1.3 Evolutionary Datasets

Orthology predictions of protein-coding sequences corresponding to genes represented in the E. coli Broad-GI network were obtained from the eggNOG online database ver. 3.0165, and utilized to construct gene-specific phylogenetic profiles encompassing 233 fully sequenced γ- proteobacterial species (gamma-proteobacterial non-supervised orthologous groups - NOGs) consisting of: 29 E. coli serotypes, 64 enterobacterial and 140 gamma-proteobacterial species. For the DDR-GI network NOGs encompassing bacterial species (bacterial NOGs) were extracted from the eggNOG database encompassing 747 species from 11 major bacterial phyla. Paralogs were further identified based on co-membership in the same NOG.

2.1.2 Enrichment of Genetic Interactions in Functional Modules

To assess the significance of these interacting module pairs, I implemented a permutation test to identify those functional module pairs (intra- and inter-) which were statistically enriched in GIs, i.e. greater than random, which would suggest possible connections of biological relevance for further exploration. The permutation test was performed as follows: interaction partners for each gene participating in an inter- or intra-module GI were randomly exchanged. After randomizing

32 every interacting gene pair, the total number of inter- or intra- module GIs found between functional module pairs was recalculated. This procedure was repeated for 1000 iterations, after which the average and standard deviation of inter- and intra- module GIs was calculated and utilized for significance testing based on Z-scores. Z-scores were ranked from highest to lowest and significant inter- and intra- module interactors were defined as those with Z-scores falling within the top 5%.

2.1.3 Determination of Evolutionary Co-Conservation of Broad-GI Genes using Mutual Information of Phylogenetic Profiles

Phylogenetic profiles were then applied to calculate the evolutionary co-conservation between each pair of genes (Figure 2-1B) using a mutual information (MI) score166:

Where, A and B represent the phylogenetic profiles of a recipient gene pair, respectively; N indicates the “state” of genes A and B in a genome, i.e. presence (Ai, Bj = 1), or absence (Ai, Bj = 0). Therefore for a pair of genes, four possible states (N=4) emerge which represent information from which co-conservation between a pair of genes can be calculated: (1) when genes A and B are mutually present in a given genome (Ai & Bj = 1); (2) when genes A and B are mutually absent (Ai & Bj = 0); (3) when gene A is present but B is absent; and (3) vice-versa (Ai = 1 or 0;

Bj = 1 or 0; Ai != Bj).

It is important to note that in the calculation of gene co-conservation using MI, we only consider the binary state of either the presence or absence of a given E. coli recipient gene in a Gamma- proteobacterial genome, and do not take into account whether a gene may be present in multiple copies, i.e. paralogs. Thus pi(A) and pj(B) represents the frequency of gene presence or absence of gene A and B, across all genomes of a their respective phylogenetic profiles, while pi,j(A, B) represents the frequency of each of the four co-conservation patterns as described above. MI

33 scores calculated for each pair of genes in the Broad-GI network were further normalized to 1 by dividing by the maximal MI score of the set.

2.1.3.1 Evolutionary Conservation of Genomic Integrity Pathways

To determine the overall conservation of DDR GIs networks a phylogenetic profiling approach was utilized. In brief, a genetic interaction was considered to be conserved if each of the predicted orthologs derived bacterial non-supervised orthology groups (BactNOGs) extracted from the EggNOG database165. For each interacting gene pair detected in a given bacterial species genome the total number of conserved GIs was calculated for each species, and used to calculate the average conservation across each phyla of GIs. Gene pairs were further assigned to the same (intra-process) or different (inter-process) biological processes (as described previously in 2.1.2). Only the top 10th percentile of conserved inter-process GIs found in proteobacteria (360 species) were utilized for further analysis.

Figure 2-2. Overview of Functional and Evolutionary Analyses of E. coli GI Networks Performed. A – Integration of GI datasets with predicted E. coli functional modules74 utilized to examine epistatic enrichment of among functionally diverse subsets of genes. B – Phylogenetic profiling employed for investigating the organization of E. coli GI networks across evolutionarily co-conserved subsets of genes.

34

2.2 Results

2.2.1 E. coli Functional Modules Demonstrate an Enrichment in Genetic Interactions

To establish the functional relevance of GIs in examining the evolutionary trajectories of genes, I first performed an analysis to examine the ability of GIs to recapitulate known functional relationships by integrating GI networks with a previously generated E. coli functional network (2.1.3) (Figure 2-2A). Two sets of GI networks were explored, the first representing diverse biological processes in the E. coli genome (Broad-GI), and the second, targeting 549 genes with known roles in DNA damage repair and response pathways (DDR-GI) assessed normal rich medium (UT) and in the presence of the DNA damaging agent methyl methanesulfonate (MMS). Following the rationale that GIs are likely to occur among genes with related biological roles, after assigning a genetically interacting gene pair to their corresponding functional module memberships, a permutation testing approach was applied to identify functional modules significantly enriched in GIs, either occurring among genes both belonging to the same functional module (intramodule GIs), and those occurring between distinct functional module pairs (intermodule GIs). In the following sections I present examples of GI enrichment among functional modules which revealed the organization of both diverse biological processes, as well as the conditional rewiring of specific pathways involved in DNA damage response.

2.2.1.1 Enrichment of Genetic Interactions Among E. coli Functional Modules in the Broad-GI Network Identifies Epistatic Crosstalk Among Diverse Biological Processes

To uncover novel insights into the functional integration of diverse biological processes in the E. coli genome, the total set of 42,705 (25,239 aggravating and 17,466 alleviating) Broad-GIs (Figure 2-1C) were assigned to their corresponding memberships to a previously defined set of E. coli functional modules (Figure 2-2A). The majority of enriched GIs were found to occur between modules (11,018 inter-module vs. 57 intra-module GIs), possibly the result of biases in the Broad-GI screen for genes belonging to diverse biological processes. Interestingly, intra- module GIs showed a significantly greater propensity to be aggravating compared to inter- module GIs (Figure 2-3), suggesting that although modules defined are comprised of proteins

35 with related functions they are likely to function in an integrated manner to ensure bacterial growth and survival.

Figure 2-3. Distribution of Broad-GIs Within (Intra-) and Between (Inter-) E. coli Functional Modules.

Permutation testing was then performed to assess GI enrichment both within (intra-module) and between (inter-module) functional modules (2.1.3). To elucidate the biological implications of inter-module GI enrichment, I mapped GIs to their corresponding functional module memberships. Based on Z-score ranking, the top 5% module pairs with significantly enriched GIs (p-value <= 0.05) were selected for the following analyses (Supplementary File 1). Consistent with the greater proportion of inter-module GIs present in the Broad-GI dataset, genes belonging to distinct functional modules were predominant in the enriched set of module pairs (302 / 6212 of all module pairs tested) which are likely to correspond to functionally relevant links among diverse biological processes.

Enriched inter-module pairs with >= 3 GIs were selected to generate an integrated E. coli functional-GI network, which represent diverse complexes and pathways with functionally related roles in bacterial survival (Figure 2-4). Of the 133 module pairs associated with an enrichment in inter-module GIs (p-value <= 0.05), a high degree of interactors were particularly observed with modules comprising genes of the iron-sulfur cluster formation Suf operon (11 intermodule partners) and encoding Psp phage shock proteins (8 intermodule partners) (Figure 2- 4, examples 1 and 2, respectively).

36

37

Figure 2-4. Functional Modules Enriched with Broad-GI Epistatic Interactions. A - Top 5 % of E. coli functional modules enriched in epistatic interactions. Nodes in overview network represent a given functional module defined in74, piechart colors represent the functional annotation (COG supercategory) of genes from the Broad-GI network mapped to a given functional module. Edges represent pairs of modules with enrichment of intermodule GIs (P- value < 0.05) coloured according to the proportion of aggravating and alleviating GIs represented as a gradient, red-to-green, respectively. B - Selected examples described further in the text are indicated by numbered circles with corresponding subnetworks depicted in the panels below. Networks visualized using Cytoscape 3.5.1167.

The inter-module interactors with the Suf operon (sufABCDSE) were enriched in aggravating GIs with genes with related roles in iron-sulfur cluster homeostasis, which plays an essential role in prokaryotes and eukaryotes as a key co-factor the biogenesis and function of proteins involved in respiration, DNA repair, central metabolism and oxidative stress response168 (Figure 2-4 – Example 1). These interactors include members of the functionally redundant Isc Fe-S cluster operon (iscRSUA-hscBA)169, as well as periplasmic heme transport chaperone ccmC; also identified were members of an operon encoding the Btu Vitamin B-12 ABC transporter, btuDE170, of which btuE encodes a thioredoxin/glutathione peroxidase, implicating a related role in the repair of Fe-S clusters, which are sensitive to damage under oxidative stress conditions171,172. Aggravating GIs were also found to be enriched among functional modules defining genes with diverse roles in cellular homeostasis dependent on ATP availability, such as regulation of translation efficiency via tRNA editing under oxidative stress (pheT)165, a chromosomally neighbouring gene mediating glycolytic flux under slow growth (pfkB)166, progression of DNA replication forks and relaxation of DNA supercoiling (gyrB)167, and anaerobic energy production via fatty acid oxidation (ydiQ)168. Recently, links have been established between high cellular redox activity and the arrest of bacterial growth leading to persistence phenotypes following stationary phase growth169. Therefore, these enriched aggravating inter-functional GIs encompassing suf operon and functional modules involved in DNA replication, metabolism, and respiration may also correspond to biological pathways relevant for mediating persistence, which could have important implications toward devising therapeutic strategies for overcoming bacterial antibiotic resistance.

A distinct subset of epistatically enriched functional modules identified important links among genes with roles in membrane stability, transport, and survival under anaerobic growth. In E. coli the Phage Shock Protein (pspABCDE) operon was identified in response to infection of filamentous phage f1, and it is primarily believed to be involved in mediating cell responses to

38 cell envelope instability, maintaining respiratory chains and proton-motive force and mediating bacterial persistence170. Consistent with these findings, all members of the Psp operon and the regulator pspF were found to possess extensive aggravating interactions (Figure 2-4 – Example 2) indicating functional redundancy with a variety of functional modules with related roles in the maintenance of membrane stability and antibiotic resistance, such as, subunits of multidrug efflux pumps, mdtBC-D173, and the sap peptide transporter, sapADF174. Furthermore, GI enrichment (p-value < 0.05) was observed for functional modules involved in anaerobic respiration, including the Mgl galactose transporter, galactose catabolism enzymes galEKT, the anaerobic galactosidase complex ebgAC, subunits of fumarate reductase frdBCD, and the TAMO reductase I complex torAC175. A number of alleviating interactions indicative of genes with highly co-ordinated roles comprising biological pathways176 were noted in particular among this latter category of genes, consistent with recent findings of the role of Psp proteins in inducing anaerobic metabolic pathways in response to cellular stress177.

An additional number of epistatically enriched functional module pairs were also identified which support the extensive integration of diverse biological processes in E. coli. This includes inter-module interactions enriched particularly in alleviating GIs, which comprised the small heat shock chaperones ibpAB, the proton-driven multidrug efflux transporter subunits acrAB-tolC, and subunits of the NADH dehydrogenase, nuoEG, which suggest the interrelated functions of these proteins in bacterial stress response pathways178–180 (Figure 2-4 – example 3). In addition, both ibpAB and acrAB-tolC have been shown to play important roles in phenotypic heterogeneity of replicating E. coli cells by preferentially partitioning toward cell poles of the mother cell during cell division179,180, indicating that these epistatically enriched functional modules reflect a crucial functional link between the sequestering of age-related protein aggregates with the mitigation of damage to antibiotic exposure in ensuring successful reproduction of daughter progeny.

Enriched aggravating GIs were also observed for two functional modules comprising genes involved in DNA repair, recA and recBCD, respectively (Figure 2-4 – examples 4 and 5). Although recA and recBCD are known to play a concerted role in homologous recombination mediated repair of double-stranded DNA breaks, they were found to possess GIs enriched for genes belonging to distinct functional modules, reflecting their further specialized roles in DNA repair. For instance, recBCD, which encodes the helicase/exodeoxynuclease V complex180, was

39 found to interact with modules with related roles in DNA replication and repair, nucleotide metabolism, and cell-envelope biogenesis, which include: aggravating GIs with the DNA binding helicase, rep, which is known to mediate replication fork movement during DNA replication181, fuc operon members (fucAIOPU) associated with anaerobic L-fucose/D- utilization182, and an alleviating interaction with the relB anti-toxin protein involved in cell growth inhibition183. The recA DNA binding chaperone was found to interact with subunits of the replicative DNA polymerase III complex (holABCQX), consistent with its known role in binding DNA lesions or exposed single stranded DNA at stalled replication forks184, as well as the class Ib ribonucleotide reductase complex subunits nrdEF implicated in deoxynucleotide biosynthesis during early to mid-log growth and in nutrient limited conditions185. Together these findings provide further insight into the functional integration of distinct complexes and pathways involved in the coordination of DNA replication, repair and nutrient utilization which play crucial roles in regulating bacterial growth.

Enrichment in aggravating interactions were found linking genes with roles in the utilization of nitrogen-containing compounds, both through import mechanisms of branched-chain amino acids (livFGHKM)186, and energy related pathways involved in nitrate assimilation and reduction (napABD, napCFGH nitrate reductases)187,188 and menanquinone biosynthesis (menBCD)189 (Figure 2-4 – example 6). The clpP serine protease and its chaperone regulator clpA, which are involved in protein degradation and folding, respectively, were also found to be enriched in GIs representing distinct functional modules involved in cellular motility and adhesion (Figure 2-4 – examples 7 and 8). Interestingly, despite their coordinated roles, clpA and clpP were found to possess distinct inter-module enrichment patterns of entirely aggravating and entirely alleviating GIs (p-values < 0.05), respectively, which suggests their further involvement in distinct biological processes. Alleviating inter-module interactions with clpP chaperone involve a number of chaperones involved in pillin and fimbrae formation, and nitrate reductase maturation190, indicating their possible co-ordination in protein homeostasis. On the other hand, aggravating inter-module GIs involving the protease clpP and subunits of the flagellum are likely to be indicative of its functionally redundant role in flagellar post-translational regulation191. From this investigation utilizing GIs for elucidating the integration of distinct functional modules in E. coli, I next present work investigating how GIs reveal novel functional insights in the context of DNA damage repair and response (DDR) pathways.

40

2.2.1.2 Enrichment of GIs among E. coli Functional Modules Reveals Functional Integration of Diverse Biological Processes in Response to DNA Damage

The DDR-GI dataset represents a focused screen of 549 genes with known roles in DNA damage repair and response pathways, which were assessed both under different growth conditions, rich medium (UT) and in the presence of the DNA damaging agent methyl methanesulfonate (MMS). In addition a third set of differential GIs (DF) was derived from the subtraction of epistasis scores between gene pairs found in MMS and UT networks. Together these conditionally dependent GIs extending upon previous findings of the Broad-GI network by enabled a focused exploration of specific pathways relevant to E. coli adaptability and survival under DNA damage induced stress.

As before, in determining how DNA damage response pathways are functionally reorganized in response to DNA damaging conditions, the resulting statistically significant GIs for each network (UT 23,648 GIs, MMS 28,885 GIs, DF 8,227 GIs) were mapped to a set of predicted E. coli functional modules, as described above, and further were grouped into distinct functional module assignments, those occurring within modules (intra-module GIs) and those occurring between modules (inter-module GIs). To determine which functional modules were statistically enriched for GIs in UT, MMS, and DF networks, permutation testing was performed through the random reassignment of GI functional module memberships, and Z-score was calculated to determine module pairs with significant enrichment for GIs. Functional module pairs ranked in the top 5% were selected for further analysis (Supplementary File 2).

41

Figure 2-5. Summary of E. coli Functional Modules Epistatically Enriched in the DDR-GI Network. A and B – Intra-module (A and B) and inter-module (C and D) interactions, by number of GIs and distinct functional module pairs, respectively, identified as significantly enriched in epistatic interactions in UT, MMS, and DF DDR-GI networks. E – Overlap of significantly enriched functional module pairs compared between UT, MMS, and DF DDR-GI networks.

In UT and MMS GI networks, inter-module GIs were found to be enriched, with a greater degree of module pair enrichment observed in MMS conditions (Figure 2-5A-D) consistent with the increased number of GIs observed in MMS compared to UT. Low overlap was also observed among enriched module pairs across UT, MMS, and DF conditions, suggesting that the differences in GIs between DDR-GI networks are reflective of the recruitment of different pathways in response to DNA damage repair and response (Figure 2-5E). To investigate the changing patterns of GI enrichment in response to DNA damage in their relevant biological

42 context, a network of enriched module pairs identified from MMS GI network was generated. Several examples were identified which revealed a significant degree of epistatic cross-talk enriched between genes with related functional roles in DNA recombination and repair, nucleotide metabolism, stress-related mutagenic responses and cell-division inhibition (Figure 2- 6).

43

Figure 2-6. DDR-GI Enriched Functional Modules Reveal Integration of DNA Repair, Nucleotide Metabolism, and Cell Division Pathways Involved in DNA Damage Response. A – Overview network of E. coli functional modules, represented by blue nodes, with solid edges indicating MMS enriched inter-module GIs (solid edges), and enriched UT interactions in dashed edges; proportion of aggravating-to-alleviating inter-module GIs coloured red-to-green, respectively. B – Selected subnetwork examples of functional modules enriched in MMS GIs corresponding to complexes and processes with related roles in E. coli DNA damage repair and response. Node colour indicates genes that are members of the same functional module, red and green edges correspond to aggravating and alleviating GIs respectively, and edge type corresponds to the presence of a GI under MMS (solid line), UT (thin dotted line), or UT and MMS (dashed line). Networks visualized using Cytoscape 3.5.1167.

Enrichment of both aggravating and alleviating GIs was observed between proteins involved in DNA recombination-mediated repair processes, such as recA, its functional-modulators dinI and recX, the recBCD homologous recombination repair complex genes, hupB involved in stationary phase adaptive mutation192, and DNA endonucleases nfi and nei (Figure 2-6B – Example 1). These aggravating GIs involving recA are consistent with its known multiple roles in double- strand break repair and the bypass of mutagenic lesions in the rescue of stalled replication forks184 and further suggest functional redundancy of DNA repair pathways in maintaining genomic integrity. In addition, complementary changes in GIs between UT and MMS conditions were observed for recA regulators rdgC (loss of aggravating GI with recA) and dinI (gain of alleviating GI with recC), which are likely indicative of their respective inhibitory and activating roles in recA DNA binding193,194. Also observed were strong MMS-specific alleviating interactions between endonucleases nfi, nei, and the alkA and ada alkylation response genes, which are further supported by their integrated and complementary roles in the removal of damaged DNA bases195–197.

Enriched alleviating interactions were observed between functional module pairs involved in the maintenance of cellular nucleotide pools, consisting of the nudE Nudix family ADP-sugar diphosphatase, and the nrdAB, and nrdEF ribonucleoside-diphosphate complexes (Figure 2-6B – Example 2). The MMS-specific aggravating GIs observed under MMS between nudE and nrdEF and alleviating interactions with nrdAB are possibly indicative of specialized roles of ribonucleoside-diphosphate complexes under differing cellular stress conditions198,199. Further enrichment of alleviating GIs were also identified between functional modules with roles in regulation of cell-division, growth and DNA repair (Figure 2-6B – Example 3), including

44 interactions between the yafNO toxin-antitoxin pair and dinB error-prone DNA pol IV which play roles in translation inhibition through mRNA cleavage and error-prone DNA repair under DNA damaging conditions192,200, respectively, and the minCDE complex responsible for Z-ring contractile ring formation essential for septum formation201.

As for the Broad-GI network, here I establish the DDR-GI network as a useful resource for the exploring of functional relationships. In comparison with the Broad-GI network, the specific querying of known DNA damage and repair response genes and assessment of their epistatic enrichment across both normal growth and DNA damaging conditions enabled the focused examination of the functional integration of diverse biological complexes and pathways underlying DNA damage and response in E. coli, revealing their remarkable robustness and functional adaptability in response to environmental change. As seen in section 2.2.3.2, I show how exploring differences in GIs between normal and DNA damaging conditions can be applied to explore evolutionary relationships in the context of functional adaptability of DNA damage response pathways and increased robustness through acquisition of novel genes and functional divergence of gene duplications.

2.2.2 Phylogenetic Conservation of GI Networks Provide Evolutionary Insights into the Functional Integration and Divergence of Biological Processes

Having established that GI-networks provide a useful resource for exploring novel functional relationships among previously defined functional modules in E. coli, I next turned to exploring GIs from the perspective of the evolutionary relationships of genes and their respective roles in biological complexes and pathways. To this end I employed a phylogenetic-profile based approach (2.1.4) to study how evolutionary relationships of duplicated genes and co-conserved genes have contributed to the functional organization of E. coli Broad-GI and DDR-GI networks.

45

2.2.2.1 Broad-GIs are Significantly Enriched in Aggravating Interactions for Paralogous and Non-Paralogous Genes

Utilizing predicted orthologous relationships from the eggNOG database (2.1.2.3) I identified 225 groups of 512 paralogous genes (Supplemental File 3) in the Broad-GI network. Although paralogs were found to be significantly enriched in aggravating compared to alleviating GIs (~57% aggravating vs. 42% alleviating; p-value 1.29e-24), no significant difference was observed when compared to non-orthologous genes (~59% alleviating vs. 40% alleviating; p- value 3.8e-109) (Figure 2-7). This finding therefore reflects of an overall enrichment of aggravating GIs in the Broad-GI network. To better understand the functional implications of paralogous and non-paralogous genes, I next applied the Broad-GI network to investigate how evolution has influenced the functional integration of genes in known biological complexes and pathways.

Figure 2-7. Distribution of Broad-GIs Between Paralogous and Non-Paralogous Genes. Significant enrichment in aggravating GIs (red asterisk, p-value < 0.05) is observed among both paralogous and non-paralogous genes in the Broad-GI network.

2.2.2.2 Broad-GI Network Reveals Significant Functional Correlation of Highly-Conserved Essential Genes and Co-Conserved Complexes and Pathways with Related Roles in Survival

In the Broad-GI network, genes with lower degrees of overall conservation tended to comprise diverse biological processes and pathways of adaptive significance. Of this set of genes a significant increasing trend was observed between the degree of co-conservation of within-

46 complex (F-statistic p-value 5.4e-06) and pathway (F-statistic p-value 1.15e-14) members and correlation of their GI-profiles, likely to be indicative of genes with essential roles in complex/pathway function (Figure 2-8AB). The Broad-GI network also reveals that paralogous genes possess a higher average phylogenetic co-conservation and average GI-profile correlation compared to non-paralogous genes (Figures 2-8CD), although differences were not statistically significant; however, the average GI-profile correlations of paralogs (~0.05) were found to be significantly lower (p < 1e-8) than genes functioning within the same complex or pathway (~0.15 and ~0.1 respectively), suggesting an overall tendency for paralogs to perform divergent biological roles in the Broad-GI network.

Figure 2-8. Comparison of Phylogenetic Conservation and GI-Profile Correlations of Broad-GI Genes, Based on Functional and Evolutionary Relationships. A and B – Histogram of co-conservation (mutual information - MI) and GI-profile correlations (Pearson correlation coefficient – PCC) of Broad-GI genes belonging to reference EcoCyc complex and pathways. C – Average GI-Profile correlations and D – Average Co-Conservation of paralogous and non-paralogous genes in the Broad-GI network, with significant differences in means (t-test p-value < 0.05) indicated with a red asterisk.

Comparing the level of phylogenetic conservation for all Broad-GI network genes against their GI-profile Pearson correlations (a standard approach used for determining functional relationships between genes in GI screens202) interesting trends could be observed. A greater

47 average GI-profile correlation was noted among gene pairs involving one member with low conservation (orthologs identified in < 25% of Gamma-proteobacterial genomes investigated), which suggests functional integration between widely-conserved core-bacterial processes and those acquired exclusively among E. coli and other closely-related enterobacterial species. However, a different trend was found when comparing the phylogenetic conservation of genes with highly-correlated GI-profiles (Pearson correlation >= 0.3), where genes with intermediate conservation (25% to < 50% Gamma-proteobacterial genomes) were found to predominate (Figure 2-9A). The biological roles of these genes were found to be particularly enriched in metabolism and transport roles (Figure 2-9B), indicating their importance in maintaining growth under changing nutrient availabilities.

48

Figure 2-9. Broad-GI Network Functionally Related Genes Examined by Level of Phylogenetic Conservation and Functional Annotation. A – Conservation of gene pairs with differing levels of phylogenetic conservation (% of species with orthologs detected) and GI- profile correlation. B – Level of phylogenetic conservation of Broad-GI genes compared across their respective functional annotations (COG supercategory). C – An integrated network of co- conserved Broad-GI network genes; node colouring corresponds to gene phylogenetic conservation (% Gamma-proteobacterial species with an ortholog detected), edges connect co- conserved gene pairs (MI >= 0.2) which are functionally correlated (PCC >= 0.3). Network visualized using Cytoscape 3.5.1167.

49

Although it has been well established that genes that are known to function together as part of biological complexes or pathways also tend to be phylogenetically co-conserved72, the availability of large-scale GI networks provides a wealth of novel functional relationships and an unprecedented opportunity to investigate evolutionary co-conservation of functionally related genes on a vast scale. To achieve this goal, gene co-conservation was calculated for every pair of genes represented in the Broad-GI network using a mutual information (MI) score, which assigns a higher value for genes that are frequently both present or lost (thus co-conserved) across bacterial genomes (2.1.4.2). A co-conservation MI score of 0.2 was set as a threshold for defining significantly co-conserved gene pairs (based on avg. MI score for EcoCyc co-complex pairs with PCC >= 0.3 – see Figure 8A), a Broad-GI functional co-conservation network was generated (Figure 2-9C), which revealed a significant degree of functional integration among genes broad levels of phylogenetic conservation. Further examination of functionally correlated genes with differing degrees of phylogenetic conservation revealed biologically insightful conclusions. For instance, genes with highly correlated GI-profiles also tended to be highly conserved (>= 90%) across Gamma-Proteobacteria. These genes largely represented core biological processes whose integrated function is essential for bacterial survival, such as tRNA charging, cell envelope biosynthesis, gluconeogenesis pathways, as well as components of DNA polymerase, RNA polymerase, and Ribosome complexes (Figure 2-10A). However, even within these essential and highly-coordinated machineries interesting distinctions of GI-profile correlations patterns were noted, such as a predominance of anti-correlated GI-profiles (Pearson correlation < 0), which indicate genes that are likely to play distinct biological roles. involving paralogous lysine aminoacyl-tRNA synthetases, lysU and lysS, and pheST encoding the phenylalanine tRNA ligase complex (Figure 2-10B – Example 1), consistent with previously noted functional specialization of these genes in regulating mischarged tRNAs under heat shock or oxidative stress conditions, respectively203–205. In addition, both highly-correlated and anti- correlated GIs were also identified among biological complexes and pathways with subunits of varying conservation, which may suggest the evolutionary expansion of complex or pathway function through the acquisition of novel subunits. For example, although the majority of the components of the flagellum were found to possess significantly correlated GI-profiles and be co-conserved (Figure 2-10B – Example 2), anti-correlated GI-profiles were identified particularly involving subunits with lower phylogenetic conservation (<50%), possibly reflective of their specialized roles in flagellum function or assembly206.

50

Furthermore, co-conservation of significantly correlated GI-profiles was also found to occur among genes comprising distinct complexes and pathways indicative of related roles in growth under nutrient limitation antibiotic resistance (Figure 2-10B – Example 3). One example identified co-conservation and significant correlation of GI-profiles among the enterobactin transporter and biosynthesis genes (fepBCDG and entBE), tRNA thiolation complex (tusBCD), xanthine dehydrogenase complex (xdhABC), thiamine coenzyme metabolism (thiDEM), and glycerol-3-phosphate dehydrogenase complex (glpABC)207–211. It is also noteworthy that between these distinct biological processes, significantly correlated GI profiles are found to occur among genes with moderate-low phylogenetic conservation (< 30% orthologs detected in Gamma- proteobacterial genomes), which provides novel insight into the functional consequences and modular nature of genomic evolution via acquisition of novel genes which enable the elaboration of previously existing complexes and pathways.

51

52

Figure 2-10. Functionally Correlated and Phylogenetically Co-Conserved Complexes and Pathways identified from the Broad-GI Network. A – Essential genes (as defined by the PEC database212) with overall high phylogenetic conservation (>= 90% of Gamma-proteobacteria with ortholog detected) with highly correlated GI-profiles (PCC >= 0.5). B – Selected examples of genes associated with diverse biological processes in E. coli illustrating: 1) Anti-correlation of highly-conserved paralogous lysRS and pheST complex genes with other members of amino-acyl tRNA synthetases; 2) Anti-correlation of involving flagellum complex subunits with low phylogenetic conservation (~ 10% of Gamma-proteobacterial species with ortholog detected); 3) functional correlation among distinct E. coli complexes and pathways with roles in bacterial environmental adaptation. Networks visualized using Cytoscape 3.5.1167.

2.2.2.3 Epistatically Interacting Gene Pairs under DNA Damage Show Varying Degrees of Conservation among Bacteria

DDR-GI datasets can also be applied to gain a better understanding of how the acquisition and epistatic integration of novel genes have contributed to the evolution DDR pathways in E. coli. I therefore compared the extent of conservation pairs genetically interacting genes in E. coli DDR- GI networks generated under normal (UT) and DNA damaging growth conditions (MMS), as well as gene pairs which show significant change in epistasis between conditions (DF) across phylogenetical diverse bacteria. The conservation of DDR-GI gene pairs with significantly- scoring GIs was assessed using predicted bacterial non-supervised orthology groups (BactNOGs) extracted from the EggNOG database157. A phylogenetic profile was then constructed for each gene represented in DDR-GI networks, representing the presence or absence of predicted orthologs encompassing 747 species from 11 major bacterial phyla (2.1.4.1).

53

Figure 2-11. Phylogenetic-Conservation of Epistatically Interactions across Diverse Pathways Involved in DNA Damage Repair and Response in E. coli. A – Distribution of UT, MMS, and DF DDR-GIs within and between defined biological processes related to DNA

54 damage repair and response. B – Phylogenetic co-conservation (MI scores) of DDR-GI network genes. C – Phylogenetic Co-Conservation of UT, MMS, and DF GIs occurring among genes within and between biological processes. D – Between process UT, MMS, and DF GI co- conservation by major bacterial phyla. E – Network illustrating top 10% of conserved within and between process DF GIs among Proteobacterial species genomes; node size depicts the degree of conservation of E. coli genes annotated to a given process; edge thickness indicates the proportion of DF GIs conserved across Proteobacterial speceies (% of genomes where orthologs of an interacting gene pair are identified) colour indicates decrease (red) or increase (blue) of GI scores between MMS and UT conditions. Network visualized using Cytoscape 3.5.1167.

For UT, MMS and DF networks (e.g. GIs assessed under normal growth, DNA damaging conditions, and those specifically altered in DNA damage response, respectively) GIs occurring between genes annotated to different DDR processes greatly outnumbered GIs occurring within the same processes and also showed greater phylogenetic conservation overall (Figure 2-11A). When comparing between conditions, DF GIs which showed significant changes in epistasis in response to DNA damage also showed a markedly greater degree of conservation across bacterial phyla compared to UT and MMS-specific GIs (P-value < 0.05) (Figure 2-11B-C). This trend was also found to be consistent across diverse bacterial phyla, including extremophile phyla Deinococcus and Thermotoga, suggesting that DF GIs may define an integral set of conserved processes critical for bacterial DNA integrity (Figure 2-11D), and as for E. coli, other bacterial species are likely to possess their own particular complements of genes involved in DDR.

When investigating the degree to which novel genes have contributed to the organization of the E. coli DDR-GI networks, I next examined the conservation of intra- and inter- process DF genetically interacting gene pairs specifically among proteobacterial genomes (Figure 2-11E). Distinct patterns of GI conservation were observed among distinct DDR processes. For instance, processes with low overall conservation across proteobacterial species (< 50% of genomes examined) were found to comprise genes of uncertain function, metabolism, and, surprisingly, DNA replication, repair, and recombination. DF GIs among these processes also tended to show increase in epistasis scores under DNA damaging conditions, likely reflecting their integrated roles in DDR response. In contrast, processes showing higher conservation of genetically interacting gene pairs (> 75% of genomes examined) were annotated to core biological pathways, e.g., DNA replication, translation, cell-envelope biogenesis, and showed decreasing epistasis scores under MMS. The significance of these processes identified under MMS

55 conditions reflects the increasing importance of genes with roles in DNA damage repair responses which are otherwise more expendable under normal growth conditions. These findings suggest that epistatic integration of genes has enabled the elaboration of the E. coli DNA damage response, resulting in increased co-ordination processes involved in DNA replication, transcription, translation, and cell division.

2.2.2.4 Correlation of GI-Profiles Correspond to Functional Divergence of Paralogous Genes in DDR-GI Networks

Large-scale GI networks also provide a useful source of functional relationships that can be utilized toward interpreting the functional consequences of gene duplications. In contrast to the Broad-GI network, the 45 paralogs identified in the DDR-GI network (Supplementary File 4) showed an opposite tendency toward alleviating GIs under both UT and MMS conditions (~ 54% and ~ 59% of UT and MMS, respectively), which was also observed among the 497 non- paralogous groups (~ 60% and ~51% of UT and MMS GIs, respectively) (Figure 2-12). Although these differences were not found to be statistically significant, the tendency of DDR- GIs toward alleviating epistatic interactions may likely reflective of the coordinated roles played by both paralogs and non-paralogs in DNA damage and repair response pathways.

Figure 2-12. Paralogs and Non-Paralogs in the DDR-GI Network Show a Tendency for Alleviating GIs in UT and MMS Conditions. Changes in average proportions of GIs by type (green – alleviating, red – aggravating) were compared between UT and MMS conditions for paralogs (A) and non-paralogs (B) represented in the DDR-GI network.

I next examined differences between the correlations of GI-profiles, as an indicator of paralog functional relatedness between UT and MMS conditions. Interestingly, paralogs were found to

56 gain a significant greater number of interactions under MMS conditions compared to non- paralogs (35.13 vs. 17.644, respectively: t-test P-value < 1.23e-8), resulting in a significant increase in GI-profile correlations (P-value < 0.016), suggesting that paralogs contribute to the elaboration of biological processes involved in DNA damage and response reflected through functional reorganization of the DDR-GI network (Figure 2-13). Furthermore, a low number of paralogous groups (UT – 6/20; MMS - 7/20) were found to show significantly correlated GI- profiles under MMS, which further indicates the overall divergent roles of duplicate genes in response to DNA damaging conditions.

Figure 2-13. Average Difference in Number of GI Interactions (A) and Average GI-Profile Correlation Measures of (B) Paralogous and Non-Paralogous Genes in DDR-GI Networks.

57

Figure 2-14. Paralogous Genes and GI-Profiles Correlations Compared Between UT and MMS DDR-GI Networks. Paralogous DDR genes with corresponding GI-profile correlations derived from UT and MMS conditions; node colours correspond to paralogs identified as members of the same EggNOG165 othologous group; paralogs with significant positively correlated GI-profiles are indicated with blue edges, red edges indicate negative, or anti- correlated, paralogs. Network visualized using Cytoscape 3.5.1167.

Paralogs with significantly correlated GI-profiles (PCC > 0.3) under MMS conditions were also found to have similar biological roles (Figure 2-14), such as the DNA polymerase IV encoding dinB, and DNA polymerase V subunit encoding umuC, which are both known to be involved in mutagenic DNA lesion repair and have been shown to be expressed during the SOS response213 and the structurally related murine DD-carboxypeptidases dacA and dacC involved in peptidoglycan processing and maintenance of cell morphology, whose catalytic domains appear to be tuned for cell shape maintenance under normal and acidic conditions, respectively214,215. In contrast, the methyltransferase encoding paralogs, ogt and ada, which possess correlated GI- profiles under UT were found to be uncorrelated under MMS conditions reflective of divergent roles in DNA damage repair and response. Although ogt and ada utilize a similar mechanism to repair methylated guanine and thiamine bases via a methyltransferase reaction to a cysteine residue, Ada is known possess a dual role as a transcriptional activator which is carried out by a secondary methyltransferase domain that serves as an electro-static switch increasing DNA binding affinity resulting in the expression of adaptive response pathway genes216–218.

58

A final example is the DEAD-Box family of paralogous RNA helicases dbpA, deaD, srmB, rhlB, and rlhE, which show distinct GI-profile correlation patterns between UT and MMS conditions (Figure 2-14). DEAD-Box RNA helicases are widespread across the three kingdoms of life and have been implicated in a variety of processes219. To date, their major role in bacteria appears to be restricted to ribosome biogenesis and RNA decay processes. Generally, DEAD-Box helicases consist of a helicase-core motif, a DEAD domain, and a variety of other motifs with varied levels of conservation220. Direct comparison of GI profiles of the paralogous 5 RNA helicase members reveals notable alterations in GI profile patterns occurring between UT and MMS conditions (Figure 2-15). dbpA transitioned from largely aggravating to alleviating GIs under treatment with MMS, resulting in low GI-profile correlations with all other DEAD-Box helicase paralogs. In addition, a change from positive to negative correlation was observed between rhlB and rhlE GI- profiles under MMS. On the other hand, although rhlE and srmB retained their distinct patterns of alleviating and aggravating GIs in UT and MMS, reciprocal reduction in their respective GIs in response to MMS resulted in significant GI-profile correlation. This was similarly observed for deaD, which also found to be significantly correlated with srmB under MMS.

59

Figure 2-15. Paralogous DEAD-Box RNA Helicase GI-Profiles Compared Across UT, MMS, and DF DDR-GI Networks. Heatmap rows correspond to the GI-profile for an individual DEAD-Box RNA helicase paralog, columns correspond to individual genes screened in UT, MMS, and DF DDR networks. Colour indicates type of epistatic interaction in red (aggravating GI) and green (alleviating GI) for UT and MMS conditions and resulting change in epistasis epistasis score in orange (increase) and purple (decrease) for DF interactions.

I sought to investigate whether the differences noted in GI-profile correlation between DEAD- Box helicase paralogs could further inform their distinct biological roles ribosome biogenesis. The maturation of the 50S large-subunit of the ribosome is understood to occur through the sequential binding of different maturation factors that facilitate its proper folding. srmB is plays an important role in the early stages of 50S maturation in bacteria, whereas deaD and dbpA are largely implicated in later-stages219. The fundamental role of ribosome biogenesis and translation in the cell is suggested by the GI profiles of srmB, deaD and dbpA. For instance, the particular reduction of aggravating epistatic interactions for srmB may be indicative of the background activity of deaD and dbpA under MMS exposure. In addition, the significant number of aggravating GIs observed for srmB in contrast to other DEAD-Box RNA helicases could be a further indication of its additional role as a general RNA chaperone221 implicating it in a variety of translation-related processes that are likely to have a pronounced effect on cell growth. Although the precise role of rhlE is not as well known, recent evidence has implicated its function as a regulatory protein mediating different routes of 50S maturation through via srmB or deaD222. As the only DEAD-Box encoding gene to possess predominantly alleviating GIs in MMS, this suggests its deletion may provide a suppression effect, enabling either srmB or deaD to function in tandem to ensure 50S maturation, relieving fitness defects caused by gene deletions involved in other aspects of DNA damage and repair. In contrast, the striking difference in GI interaction profile of rhlB with other of the DEAD-Box genes under MMS may be accounted for by its quite distinct role as a crucial subunit of the RNA degradosome complex223. Together, the examples presented in this study further suggest that gene duplication plays an important contribution in the organization of the E. coli DDR-GI networks between UT and MMS conditions. That the majority of paralogs identified in this study appear to possess uncorrelated GI-profiles further indicates that duplication and evolution of paralogs to novel functional roles also contributes toward the increased robustness of DNA repair processes.

60

2.3 Discussion and Conclusions

In this Chapter I present an analysis of two E. coli genetic interaction networks generated from two recently published epistatic genetic interaction screens. The first network presented comprises genome-wide screen of 163 genes of broad biological functions (Broad-GI) comprising 42,705 GIs, and the second represents a focused study of 549 genes involved in DNA damage repair and response (DDR-GI) which were assessed under normal (UT) and DNA damaging (MMS) growth conditions, resulting in 23,648 GIs (UT), and 28,885 GIs (MMS) from which a set of 8,227 differential (DF), i.e. DNA damage dependent, GIs were derived. In general, aggravating interactions were found to predominate across diverse biological processes in both Broad-GI and DDR-GI networks, further reinforcing the importance of functional robustness as a contributor to bacterial adaptation and survival under diverse conditions and cellular stresses.

Through the examination of GI enrichment among previously defined E. coli functional modules, I demonstrate the ability to derive biologically meaningful insights from Broad-GI and DDR-GI networks. In the Broad-GI network the enrichment of GIs among functional modules revealed integral physiological relationships between iron-sulfur cluster biogenesis suf pathway and genes with roles in cellular redox sensing, as well as membrane stability sensing psp genes, antibiotic transporters and anaerobic respiratory pathways, indicating pathways that are likely critical for bacterial survival under stationary phase growth. Comparing the differences in functional modules enriched in DDR-GI networks enabled the further exploration of the dynamic reorganization of biological processes between normal and DNA damaging growth, illustrating the co-ordinated function of DNA recombination, replication and repair, as well as nucleotide metabolism and cell-division and stress response pathways.

Phylogenetic analyses enabled the exploration of the evolutionary factors that are likely to have contributed to the organization of E. coli Broad-GI and DDR-GI networks. In the Broad-GI network, genes with high co-conservation tend to comprise members of physically interacting complexes or biological pathways with highly correlated GI-profiles. Through the analysis of phylogenetic co-conservation of GI interactions and correlated GI-profiles derived from Broad- GI and DDR-GI networks, I identified numerous examples of functional integration among diverse biological processes. In the Broad-GI network, genes belonging to essential biological

61 processes were found to possess highly-correlated GI-profiles and were also highly conserved across Gamma-proteobacteria, while co-conserved genes with variable phylogenetic conservation were predominantly found to coincide with complexes and pathways likely to facilitate survival in diverse environments, such as the utilization of essential nutrients as iron and sulfur which play important roles in respiratory and translation processes. Interestingly, in the case of the flagellum, it was observed that components with low phylogenetic conservation possessed anti-correlated GI-profiles which may underlie distinct roles in flagellum assembly or regulation224,225, which may be further investigated for their effects on bacterial chemotaxis under different nutrient availabilities or growth conditions. In the DDR-GI network, gene pairs involved in dynamic response to DNA damage were also found to possess a significantly higher degree of co-conservation across a wide array of bacterial phyla, which likely represents core biological processes required for bacterial replication and growth. Together these results from the phylogenetic analysis of E. coli genetic interaction networks suggest that genes unique to E. coli have been acquired by their integration with core processes can vary remarkably and that their acquisition which suggests the possibility that in other Proteobacterial species additional gene complements may also have accrued to enable particular adaptations to different lifestyles and environments.

Robustness is an important property of complex biological systems which provides both resilience to deleterious mutation and facilitates evolution of adaptive responses to changing environmental conditions226. Such a property is attributed to the functional redundancy of genes, wherein the loss of function in one gene can be compensated by the presence of another with a similar biological role227. An important source of functional redundancy as well as biological innovation is gene duplication, whereby sequence divergence of duplicate, i.e. paralogous, genes can result in robust phenotypic responses, either through selection of a paralog for expression under a particular environmental stress (sub-functionalization), or through the evolution of novel functions (neo-functionalization)228.

Biological networks have provided increasing insights in understanding the contribution of duplication toward increased biological robustness in the context of Saccharomyces cerevisiae genetic, physical, and expression networks229–231. A recently performed laboratory evolution experiment in S. cerevisiae found an increased mutational robustness in ancient duplicates

62 compared to non-duplicated singletons after 2200 generations of growth under deleterious mutation accumulating conditions229. Additionally duplicates were found to show significant changes in expression patterns, further supporting the ability of duplicates to facilitate adaptation to changing environmental conditions and stresses. This finding was also further supported by a study of the expression of S. cerevisiae ancient duplicates which showed notable differences in their expression patterns compared to singleton genes assessed under five different environmental stresses230. Interestingly, utilizing data from a large-scale screen of GIs in S. cerevisiae62, the authors were also able to investigate different evolutionary trajectories of paralogs as a consequence of adaptation to stress. Paralogs with different transcriptional profiles were more likely possess a greater proportion of aggravating GIs when compared to singleton genes, indicating retention of ancestral functions and sub-functionalization through altered expression. However paralogs that were both upregulated under stress were less likely to share aggravating GIs, thus indicating neo-functionalization and evolutionary selection for divergent roles.

To date, studies of biological robustness and functional divergence of duplicates in bacteria have largely comprised general surveys of broadly defined functional classifications and comparative genomics analyses116,232,233, or literature-curated regulatory networks234,235. However, findings from these studies indicate the important role of gene duplication as a means of bacterial adaptation and survival according to the requirements of their particular lifestyles or environments. For instance, an early survey of 106 fully sequenced bacterial genomes revealed the frequent occurrence of paralogs annotated to COG functional categories involving the metabolism of amino acids, inorganic ions, carbohydrates, defense mechanisms and energy production116. A more recent comparative genomics survey examined the occurrence of paralogs distributed not only by functional category but also in relation to their phylogenetic distribution among 200 fully sequence bacterial genomes233. To identify paralogs that are likely to play a role as specific bacterial lifestyle adaptations, the authors applied a network theoretical approach to cluster bacterial species on the basis of statistical enrichment of paralogs with related GO terms. From this analysis several examples were identified where duplication has contributed to the species-specific expansion in distinct biological processes, for example, of energy production pathways in host-associated Gamma-proteobacteria Salmonella enterica, iron-sulfur cluster mediated processes in Escherichia coli, and DNA-mediated transposition in Yersina pestis.

63

Alternatively, the authors also discovered that gene duplications were indicative of common adaptive mechanisms in phylogenetically distant species which reside in similar environments. For example duplications were identified with roles in copper binding in aquatic environment dwelling species Dinoroseobacter shibae, Mycobacterium gilvum, and Nitrobacter hamburgensis, which are likely to reflect increased requirements for copper scavenging mechanisms in aquatic environments where it is likely to be scarce.

With the recent advent of GI screening approaches in E. coli, it is possible to systematically chart functional relationship on a genome-scale for a model bacterium. Thus the analysis presented in this study represents the first application of GI networks in E. coli to systematically investigate the contribution of gene duplication and functional divergence in the context of the organization of biological pathways and biological robustness. In the Broad-GI network, the average correlation of GI-profiles was found to be significantly lower among paralogous genes than for highly conserved genes belonging to essential biological processes. This finding suggests that the decrease on evolutionary constraints resulting from duplication has contributed to paralog neo- functionalization and selection for divergent biological roles in E. coli. Furthermore, it was seen that paralogs in the Broad-GI network possessed a slightly greater average correlation of GI- profiles than non-duplicates, which may indicate that paralogs have retained some degree of their ancestral function(s). In contrast, paralogs in the DDR-GI network were only seen to show an increase in their GI-profile correlations under DNA damaging conditions, suggesting the increased importance of paralogs in mediating survival in response to environmental challenges. Further analysis of Broad-GI and DDR-GI networks identified several examples of functional divergence of paralogs and their significance as a means of bacterial adaptation to diverse environmental conditions. In the Broad-GI network anti-correlated GI-profiles were noted among the well-conserved paralogous lysine tRNA synthetases, indicating neo-functionalization following duplications which is further supported by their specialized roles and altered expression under different environmental conditions. Furthermore, in the DDR-GI networks paralogs several cases were noted of correlated paralog GI-profiles specifically under MMS conditions, such as the DEAD-Box RNA helicases involved in ribosome biogenesis and other aspects of RNA processing, as well as and the DNA repair polymerases umuC and dinB. Together, these results demonstrate that GI networks provide valuable insights into biological

64 significance of paralogs, enabling a greater understanding of their roles in diverse biological processes and bacterial adaptation to diverse environmental conditions.

In this chapter I show that GI networks can be applied to derive novel insights enabling the inference of the underlying functional relationships among diverse biological processes, including protein complexes and pathways. I further demonstrate that the GI networks can be a valuable resource in addressing important questions of functional divergence of gene duplications and their role in elaboration of biological networks and increasing bacterial robustness. Although the number of GI screens performed in E. coli is limited, the utility of this approach in generating genome-scale functional networks enabling the investigation of pathways relevant for bacterial survival under diverse growth conditions show great promise in the future investigation of the evolution of gene duplications and their adaptive significance.

65

Chapter 3 Investigation of the Organization of Physical Complexes and Functional Divergence of Paralogs in the E. coli Cell Envelope

Portions of analyses and figures have been adapted with permission from:

Babu M, Bundalovic-Torma C, Calmettes C, Phanse S, Zhang Q, et al. 2018. Global landscape of cell envelope protein complexes in Escherichia coli. Nature Biotechnology. 31(1):103-112.

Attributions

Datasets including subcellular localization annotations of the E. coli K12 MG1655 proteome and cell envelope predicted protein-protein interactions generated by affinity purification mass- spectrometry (AP-MS) under multiple detergent conditions were provided by Mohan Babu (Assistant Professor, University of Regina).

Probabilistic scoring of AP-MS cell envelope PPIs, log-likelihood integration of multiple scoring metrics, and final benchmarking to define high-confidence CE-PPI used in this analysis were performed by Dr. Cindy Jiang (former post-doctoral fellow, lab of Dr. John Parkinson, Senior Scientist, Hospital for Sick Children). To help benchmarking of these data, I performed a systematic curation to define a Gold Standard set E. coli K12 MG1655 physical interactions.

66

3 The E. coli Cell Envelope Protein-Protein Interaction Network

In chapter 2 I demonstrate the value of GI networks in exploring the organization of biological pathways in E. coli in the context of functional modules, as well as investigating the role of gene duplication in shaping the diverse biological roles of paralogs. PPIs networks provide an additional complementary dataset for interpreting the impact of duplication on gene function. PPIs have a further advantage to GIs by identifying proteins with related biological roles through their physical associations into complexes with distinct biological functions, providing valuable insight into how proteins function to coordinate biological processes in the bacterial cell. Thus, similar to the approach taken in chapter 2, in this chapter I first demonstrate the functional relevance of PPIs in elucidating the organization of the E. coli cell envelope (CE-PPI) by identifying complexes comprised of proteins with related biological roles. Then I utilize CE-PPIs to systematically investigate the functional divergence of paralogs by comparing differences in their physical interactors and associated biological roles. Overall I show that the CE-PPI network, comprising 14,376 interactions encompassing 932 cell-envelope proteins of diverse topology, are biologically meaningful and recapitulate known complexes which mediate diverse physiological processes in E. coli. Furthermore, I highlight several examples where novel physical associations suggest dynamic integration of complexes involved in environmental sensing pathways. Furthermore, integration of CE-PPI with previously published GI networks reveals limited overlap between PPI and GI. I demonstrate that GIs reveal epistatic integration of CE complexes into biological pathways with roles in coordinating cell growth and survival. Establishing the biological validity of the CE-PPI network, I use it to investigate the consequences of gene duplication in the functional divergence of CE paralogs. I found that although paralogs possess a greater propensity for shared interactions than non-paralogs, indicative of retention of their ancestral functions, nearly all paralogs possess unique sets of interactors indicative of neo-functionalization. I end with a presentation of examples illustrating instances of functional enrichment among paralog interactors and discuss their consequences in

67 the evolution of duplicates and overall importance as a means by which bacteria adapt to diverse environmental conditions.

3.1 Materials and methods

3.1.1 Sources of Data: E. coli Cell Envelope Associated Proteome Physical Interaction Network

Selection of cell-envelope target proteins, affinity tagging, plasmid-based transformation in E. coli strains, growth, screening of multiple-detergent buffers for enrichment of membrane protein fractions, and AP-MS experiments were performed by the lab of collaborator Dr. Andrew Emili (University of Toronto) by postdoc Dr. Mohan Babu.

AP-MS experiments were performed for targeted TAP-tagged cell envelope bait proteins which were extracted employing a multiple-detergent solubilization procedure (octaethylene glycol monododecyl ether, DDM, and Triton X-100) optimized for membrane proteins while minimizing the disruption of physical interactions236. MS peptide identifications of potential interacting target and bait proteins from multiple-detergent replicated AP-MS experiments were then utilized to derive a finalized set of high-confidence CE-PPI (performed by postdoc Dr. Cindy Jiang, lab of Dr. John Parkinson, Hospital for Sick Children) through a Bayesian integration approach. In brief, two standard protein interaction probability-based scoring algorithms developed for AP-MS datasets were used to maximize the recovery of potential bait- prey interactions. First, the hypergeometric (HyperGeo) scoring scheme39, assumes a bait-prey and prey-prey matrix model based approach and normalized protein MS spectral abundance factors to measure bait and prey protein co-occurrence across all AP-MS experiment performed, and its probability of physical interaction determined utilizing a Hypergeometric distribution function. Second, the Comparative Proteomic Analysis Software Suite (CompPASS) score237, which assumes a spoke model and average spectral counts of bait-prey proteins to determine interaction specificity based on the frequency a prey is found co-purified with a particular bait.

After HGScore and CompPASS scoring was applied to MS purification experiment data, an integrated log-likelihood score (LLS) for identified bait and prey protein pairs was determined based on the summation of the combined likelihoods of each score to recover previously-known physical interactions derived from a Gold-Standard standard set of E. coli cell envelope

68 associated PPI. To derive this Gold-Standard set of known, experimentally confirmed interactions, I collated interactions involving at least a single cell-envelope protein from the following sources: small-scale experiments found in the iRefWeb database236 (excluding previously published large-scale AP-MS studies) and well-studied cell-envelope complexes and pathways, e.g ABC transporters, phospho-tyrosine signaling system components, and secretion systems, defined by EcoCyc150 and KEGG237 databases. Based on the calculation of the area under the Receiver Operating Characteristic (ROC) curve, which measures the proportion of Gold-Standard interactions (true positives) vs. a set of defined non-interacting (false-positive) proteins captured in the scored CE-PPI dataset, a final LLS cutoff was chosen based on the maximal recovery of Gold-Standard interactions. For detailed information on the scoring and benchmarking approach please see3, supplementary information. The resultant dataset contained ~14,376 high-confidence CE-PPI encompassing 2118 proteins, 870 of which are annotated or predicted cell envelope proteins.

3.1.2 Prediction and benchmarking of predicted E. coli CE-PPI protein complexes with Markov Clustering

To predict protein complexes a, Markov Clustering (MCL) algorithm134 was applied to the list of CE-PPI network interactions (2118 proteins, 14,376 interactions). Repeated rounds of clustering were performed by varying the MCL inflation parameter value (I) in the range of 1.2 – 5.2 by increments of 0.2. To benchmark complex predictions, I compared clusterings based on their coverage of reference EcoCyc complexes, calculated as the number of EcoCyc co-complex subunits identified as members of the same CE-cluster divided by the total number of subunits comprising the EcoCyc complex. The inflation parameter yielding the greatest coverage of reference complexes was then selected as defining the optimal clustering for defining the final set of CE-complexes for use in subsequent analyses.

3.1.3 Integration of CE-PPI and previously published E. coli Genetic Interaction Networks

As an extension of the analyses presented in Chapter 2, where the functional significance of GIs were examined based on enrichment among defined sets of E. coli functional modules, three previously published E. coli GI datasets were integrated with the CE-PPI network to investigate

69 functional relationships among CE-clusters. These three datasets comprise a set of 821 cell envelope associated genes screened under normal and minimal medium growth conditions66, as well as a whole genome screen of 163 genes with broad biological functions (Broad-GI) screened under normal growth1, and 456 genes with roles in DNA damage repair and response (DDR-GI) screened under normal and DNA damaging conditions2. For each GI dataset I mapped proteins represented in the CE-PPI network to their corresponding genetically interacting loci. To account for experimental variability across different E. coli GI datasets, only GIs determined under normal laboratory growth conditions were considered. For protein pairs with GIs contributed by multiple datasets, only GIs with consistent epistasis direction were considered, e.g.: a) alleviating (fitness better than expected when the corresponding gene pair is knocked out) or b) aggravating (fitness is worse than expected) were considered. This collected set of E. coli GIs comprising 4018 genes and 158,162 GIs were then mapped to their corresponding CE- cluster memberships in the CE-PPI network. Mapping genes to their corresponding proteins and cluster memberships in the CE-PPI network generated a final integrated CE-GI network subset of 2020 genes (~ 95% coverage of CE-PPI network proteins) and 23875 GIs; overlap between PPI and GIs was low (~6%) (Figure 3-2A). To examine the functional integration of GIs among CE-clusters, genetically interacting gene pairs were then assigned to their corresponding CE- cluster memberships and categorized according to their occurrence within (intra-cluster) or between (inter-cluster) CE-clusters, which resulted in 10,873 GIs representing 261 CE-clusters. Using a permutation testing approach I previously applied for identifying GI enrichment across E. coli functional modules74 (see Chapter 2), Z-scores were computed for each pairwise genetically interacting CE-cluster pair (intra- and inter-). CE-cluster pairs were then ranked according to Z-score and only top 5th percentile of CE-cluster pairs (543) showing significant enrichment of GIs were selected for downstream analysis.

3.1.4 Analysis of Functional Divergence of Paralogs by Differences in PPI Overlap and Functional Enrichment of Paralog Physical Interactions

3.1.4.1 Selection of Paralogous CE Proteins

To understand the role of gene duplication in the functional organization of the CE-PPI network, paralogs were obtained through the integration of two complementary orthology prediction databases, Orthologous Matrix (OMA)238 and EggNOG165. Sets of predicted orthology groups

70 were downloaded from the then current releases of OMA (release 17 ver., downloaded May 11, 2015) and EggNOG (release 4 ver., downloaded May 6, 2015) and used to assign ortholog memberships to CE-PPI network proteins. Paralogous CE proteins, were identified based on membership in the same EggNOG and OMA orthology groups. Given that EggNOG provides orthologous group predictions at different taxonomic resolutions, the degree of overlap was compared between OMA and three sets of EggNOG orthology groups, representing fully sequenced bacterial (bactNOG), proteobacterial (proNOG), and Gamma-proteobacterial (gproNOG) genomes. proNOG displayed the greatest degree of overlap with OMA ortholog group predictions, resulting in 91 paralogs for subsequent analysis.

3.1.4.2 Calculation of paralog CE protein PPI overlap

Physical interaction subnetworks for each identified CE paralog was extracted from the finalized CE-PPI network and used to construct a set of paralog PPI profiles, where each profile represents a list of all interacting proteins associated with a given paralog. The proportion of physical interactions shared between paralogous MP protein pairs was calculated using the Jaccard Index of their respective PPI profiles. Where, for proteins A and B, the Jaccard Index of their physical interaction profiles, JPPI-AB, is calculated by:

JPPI-AB = OverlapPPI-AB / (OverlapPPI-AB + UniquePPI-A + UniquePPI-B)

Individual paralog PPI subnetworks extracted from the MP network were visualized in Cytoscape167 (version 3.2.0).

3.1.4.3 Functional Enrichment of Paralog Physical Interactions

To examine how physical interactions have contributed to the functional divergence of paralogs in the CE-PPI network, functional enrichment of paralog physical interactions was assessed using a hypergeometric-testing approach. It is important to note that although a hypergeometric model was also utilized in the HGScore method for the probabilistic scoring of CE-PPI network (see 3.1.1), the model disregards the functional annotation of co-purified proteins and solely assigns their interaction probability based on MS peptide spectral count information, and is therefore not expected to contribute a methodological bias in this present analysis.

71

First, all interactors associated with a given group of CE paralogs were assigned to their corresponding experimentally supported COG functional categories (designated by letter codes)241 grouped into 7 major biological processes: transcription & translation (J, A, K, L, O), cell cycle & defense (D, V, T), membrane or membrane-associated (M, N, U), energy production (C), metabolism & transport (G, E, F, H, I, P, Q), unknown (R, S) and multiple. To identify paralogs which show enrichment in PPI associated with a particular biological process I compared the functional distributions paralog interactors utilizing a hypergeometric testing based on: 1) Number of interactions unique to a single member of a paralogous group (uniquecog); and

2) Number of interactions shared by two or more members of a paralogous group (sharedcog).

Hypergeometric tests were performed using a custom written Perl script to test for statistical enrichment of interactions involving uniquecog and sharedcog COG functions compared to the distribution of COG function interactions encompassing the entire CE-PPI network (totalcog). A Bonferroni correction (for an alpha value of 0.05) was then applied to adjust for Type I errors associated with multiple hypothesis statistical testing. Functional divergence between members of a paralogous group was inferred if at least one member was found to show a statistical enrichment in PPI associated with COG categories associated with a particular biological function as defined above.

72

3.2 Results

3.2.1 Analysis of the functional organization of the CE-PPI Network Identifies Novel Interactors with Complexes Involved in Cell Growth, Division, Nutrient Transport, and Environmental Sensing

3.2.1.1 Gold-Standard PPI Curation and Benchmarking Enables the Generation of the First Large-Scale AP-MS E. coli CE-PPI Network and Systematic Exploration of Functional Organization of the Bacterial Cell Envelope

The use of a gold-standard set of experimentally validated PPI is an essential requirement for the benchmarking and generation of accurate large-scale PPI interaction networks. For example, the Cyc2008 database provides gold-standard complexes curated from an extensive literature of small-scale experiments239 and employed for the scoring of S. cerevisiae physical interaction networks. However, an equivalent resource does not exist for E. coli membrane proteins. Therefore I curated a high quality set of E. coli membrane complexes from reviewing three online databases, EcoCyc150, iRefWeb236, and KEGG237 (see section 3.1.1), resulting in a gold- standard dataset of 3384 E. coli cell envelope associated PPI (Figure 3-1A). Subcellular localization of these PPI (Figure 3-1B) revealed a notable bias toward cytoplasmic interactions (1805 cytoplasmic PPI vs. 1575 CE PPI) owing to the known limitations of traditional aqueous- based experimental approaches in studying non-soluble proteins7,240. This gold-standard set was applied to determine meaningful PPI probabilistic scoring cutoffs to reduce the contribution of spurious interactions detected during AP-MS purifications (Figure 3-1C). Following integration of the two probabilistic scoring metrics applied to score co-purified bait-prey proteins identified through AP/MS pulldown experiments (HGSCore and CompPASS), a final score cutoff of 5.27 was chosen based on the recovery of interactors belonging to reference EcoCyc complexes, resulting in 14,376 high-confidence CE-PPI, the majority of which represent novel interactions preferentially involving proteins of the cell envelope (8214 cytoplasmic PPI vs. 12640 CE PPI - Figure 2-1D).

73

Figure 3-1. E. coli Gold Standard PPI Curation and CE-PPI Network Benchmarking. A - Summary of experimentally curated gold-standard E. coli PPI and B - corresponding distribution of PPI by subcellular localization (CY – cytoplasm, MA – membrane associated, IM – inner membrane, LPI – inner membrane lipoprotein, PE – periplasm, LPO – outermembrane lipoprotein, OM – outer membrane, EC – extra-cellular, NA – localization unknown). C - Application of gold-standard PPI for benchmarking of CE-PPI using two probability scoring metrics (HGScore and ComPASS) and their log-likelihood integration score (LLS); selection of LLS score cutoff for defining the finalized CE-PPI network. D) Corresponding distribution of CE-PPI network by subcellular localization.

74

3.2.1.2 Cell-Envelope Clusters Represent Known Complexes with Novel Interactors of Diverse Biological Roles and Dynamic Properties.

To examine the functional organization of the CE-PPI network, I employed MCL clustering with the aim of predicting clusters corresponding to protein complexes defined by the E. coli CE-PPI network. Systematic investigation of the clustering inflation parameter, which influences the granularity or total number of clusters generated, yielded I=2.0 as delivering an optimal clustering, corresponding to an enrichment in distinct clusters comprising proteins belonging to known E. coli protein complexes (Figure 3-2AB and Supplemental File 5).

Figure 3-2. CE-PPI Network Iterative MCL Clustering – Benchmarking Against Known E. coli Protein Complexes. A – Summary of CE-PPI clusters generated over a range of MCL inflation parameters with overall number of distinct EcoCyc150 reference complexes represented (blue dots), and the number of distinct MCL clusters possessing at least one EcoCyc complex subunit (red dots). Green line indicates chosen level of clustering (Inflation = 2.0) for defining CE-clusters indicating maximum portioning of EcoCyc complexes into discrete clusters. B – Corresponding proportion of coverage of EcoCyc co-complex members in discrete MCL clusters over the range of inflation parameters examined, selected clustering (Inflation = 2.0) indicated in green.

Investigation of the clustered CE-PPI network identified several instances of known complexes having novel interactors with related biological roles (Figure 3-3). Examples include: co- clustered members of the Tol-Pal membrane-spanning system with IM subunits of the anerobic succinate dehydrogenase (SdhAB) (Figure 3-3C – Example 1), consistent with the requirement of proton motive force in mediating interactions of the Tol-Pal complex assembly241; shared interactions among homologous subunits of formate dehydrogenase complexes O and N, which

75 may be the result of their common mode of biogenesis through interaction with the accessory protein FdhE242 (Figure 3-3C – Example 2); and between soluble proteins with related roles in sulfur utilization, including periplasmic sulfonate and sulfate transport (SsuA, CysP), cytosolic sulfonate utilization dehydrogenase complex members (SsuDE), iron acquisition from heme (YfeX), and the iron-sulfur cluster containing superoxide dismutase (SodB), possibly reflecting related roles in element scavenging and protection during anaerobic respiration related nutrient limitation and stress (Figure 3-3C – Example 3).

76

Figure 3-3. Exploration of CE-PPI Network Defined Clusters. Top panel illustrates a topological comparison of the CE-PPI network: A) unclustered, and B) with nodes arranged according to their MCL cluster membership. Nodes are coloured according to subcellular localization. C) CE-Clusters identifying novel physical associations between known EcoCyc complexes and non-complex members with related biological roles. Proteins with distinct CE- cluster memberships are outlined in blue. Edge width and color corresponds to the likelihood of physical association. Networks visualized using Cytoscape 3.5.1167.

An additional cluster identified all subunits of the NADH dehydrogenase complex along with a number of novel interactors (Figure 3-3C – Example 4). These include a variety of proton or ATP dependent transporters with related roles in anaerobic respiration, such as HyfH, a periplasmic iron-sulfur cluster containing protein associated with the Hyf operon encoding formate dehydrogenase 4 complex243. Investigating this intriguing association between energy production and transport further, I identified further physical associations between subunits of the NADH dehydrogenase and CE-clusters defining components of two ATP-dependent dipeptide transport systems DppBCDF, OppBCDF and its PE peptide binding protein OppA, which shows a preference for binding basic peptides246 as well as a cluster defining copper transport and chaperone proteins (CusAB, CopA), previously shown to be critical for anaerobic growth of E. coli under amino acid limitation by protecting and biogenesis of iron-sulfur containing proteins247, as well as the adenylosuccinate PurB which is essential for ATP- linked acid resistance in E. coli248. Taken together, these findings support an important role played by physical associations in the integration of distinct complexes in pH cellular homeostasis processes.

A final example further illustrates potentially novel insights that can be gained from the study of CE-PPI clusters comprising chemotaxis related proteins. Two clusters were identified encompassing core inner membrane subunits of the flagellar motor-switch complex, basal body as well as and the glucose/glycose/ribose sensing methyl-accepting chemotaxis protein (MCP) Trg (Figure 3-3C – Example 5). Interestingly, both flagellar motor complex subunits comprise a single cluster of interactions with the dynamically interacting chemotaxis sensory proteins Trg and the CheY response regulator protein, despite their known distinct subcellular localizations244, which suggest that these are likely to represent indirect interactions, further reflected by their lower PPI association scores. Trg was surprisingly found to physically associate with an extensive array of proteins, unreported in literature, involved in biological and metabolic

77 processes. Examining the functional roles of these interactors revealed related roles in cellular growth and regulation of chemotaxis. These consist of cell envelope biosynthesis (MltA, WecC), cell division and chemotaxis inhibition proteins (ZapA – early stage cell division indicator and GlgS, respectively), regulation of chemotaxis (AckA), biogenesis of periplasmic glucans (OpgB and EptA), gluconeogenesis (GlpX), repression of glucuronic acid utilization (KdgR), nucleotide biosynthesis and transport (Cmk, Tmk, CytR), and the NarZ subunit of the anaerobic nitrate reductase Z complex. Interestingly, previous studies performed in Pseudomonas aeruginosa have shown that mutations which impair the function of nirate reductase A, NarG, as well as nitrite reductase are associated with impaired motility and swarming phenotypes required for biofilm formation and dispersal245. To examine this potential link between nitrate reduction and flagellum regulation, additional inter-cluster interactions between Trg and NarG and periplasmic and nitrite reductase NrfA were also identified. Although these findings are largely speculative, they suggest a means by which chemotaxis signaling may be dynamically regulated through binding with additional proteins which serve to sense nutrient availabilities and thus coordinate cellular responses to diverse environmental conditions. Furthermore this analysis demonstrates that the CE-PPI network reflects biologically meaningful relationships, which will be applied in section 3.2.2 toward elucidating how evolutionary processes such as gene duplication have contributed to the functional diversification of paralogous proteins.

3.2.1.3 Integrated Network Analysis of Physical and Genetic Interactions Reveals Functional Cross-Talk Enriched Between Diverse CE Associated Biological Processes.

In the previous section, I demonstrated the utility of applying a clustering approach of physical interactions to reveal potentially novel biological insights among functionally diverse protein complexes. In the previous chapter I showed how genetic interactions can reveal genes which act in a functionally redundant or co-dependent manner in mediating cell growth and survival. The application of this approach can thereby increase our understanding of the functional integration of genes belonging to distinct complexes and pathways. Furthermore, I showed that GIs can also be applied toward understanding how processes such as gene gain and functional diversification through duplication have contributed toward bacterial adaptation through the elaboration of pre- existing biological functions. Therefore, I devised an integrated approach to understand how

78 these complexes might be further integrated into biological pathways as revealed through enrichment of epistatic relationships.

Mapping genes to their corresponding proteins and cluster memberships in the CE-PPI network generated an integrated CE-GI network, comprising 2020 genes (~ 95% coverage of CE-PPI network proteins) and 23875 GIs. GIs show little overlap with CE-PPI (~6%) (Figure 3-4A). Overall the number of intra-cluster GIs were found to be largely alleviating (635 / 873 = ~73%), while a greater number of inter-cluster GIs were found to be aggravating (34096 / 66404 = ~51%) (Figure 3-4B). Inter-cluster GIs were also found to be more common than intra-cluster pairs (Figure 3-4C), which indicates that GIs reflect novel information of the functional relationships of CE-complexes which may either buffer or play co-ordinated roles in pathways relevant for bacterial growth and survival. Using a permutation testing-based approach I previously employed in Chapter 2, I generated a functional network of CE-complex pairs enriched in GIs for further examination.

Figure 3-4. Integration of CE-PPI with Previously Published E. coli Genetic Interaction (GI) Datasets. A – Number of unique and overlapping CE-PPI and CE-GIs. B – Distribution of GIs within (intra) and between (inter) CE-Clusters by type: aggravating GI (red) and alleviating GI (green). C – Comparison of average inter CE-cluster PPI and GIs.

CE-cluster pairs ranked in the top-5th percentile of Z-scores were selected for further examination as being CE complexes with statistically significant GI enrichment (p-value <= 0.05; Supplemental File 6). Of the 543 enriched CE cluster pairs identified, distinct patterns of GI enrichment were identified among clusters with related biological roles (Figure 3-5A). Notably, a few CE clusters were identified possessing large degree of connectivity, which may indicate important roles as “hubs” coordinating integrated biological processes. A cluster hub with 12 interactors was identified representing FhuBCD ferric-hydroxamate trasporter subunits

79

(Figure 3-5B – Example 1) which showed particular enrichment in aggravating GIs (with a proportion of inter-cluster aggravating GIs >= 75%) with several clusters comprised of proteins involved in energy production, charged substrate and nucleic acid import. In contrast, a cluster representing subunits of Fep enterobactin iron-siderophore transporter, formed a hub of alleviating interactions with clusters involved in exopolysaccharide biosynthesis and secretion, cell envelope integrity (Tol proteins), as well as with the FhuBCD transporter (Figure 3-5B – Example 2). These interactions suggest a functional specialization between these related iron- siderophore import systems251 which may be crucial for coordinating distinct cellular responses depending on extracellular iron availability. A further example of a cluster serving as a hub of aggravating GIs was observed for the multi-drug efflux protein MacA and the flagellar protein FliL which were seen to interact with clusters with roles related to cell-envelope biogenesis and multi-drug efflux proteins (Figure 3-5B – Example 3). Interestingly, although the proportion of alleviating and aggravating inter-cluster alleviating GIs are roughly equal overall (49% and 51%, respectively), the majority of GIs enriched among CE-clusters appear to be alleviating (>= 75%). These alleviating GIs occur predominately among clusters comprised of proteins involved in cell division, protein modification and homeostasis and cell envelope protein biogenesis complexes, are likely reflective of integrated roles in the overall maintenance of cell envelope integrity. Together these results demonstrate that an integrative examination of the organization of physical and genetic interaction networks can serve as a useful basis for further elucidating biologically meaningful pathways in the bacterial cell.

80

81

Figure 3-5. GI Enrichment Among CE-clusters with Diverse Biological roles. A – Top 5th percentile of integrated CE-clustered network (having three or more GIs) with inter-cluster GI enrichment. Nodes correspond to CE-clusters, with piecharts indicating the functional distribution of cluster members according COG superfamily annotation. Edges indicate proportion of aggravating or alleviating inter-cluster GIs. Dashed circles indicate CE-cluster hubs enriched in aggravating and alleviating GIs, with corresponding subnetworks highlighted in panel B. Networks visualized using Cytoscape 3.5.1167.

3.2.2 CE paralogs possess variability in shared physical interactors reflecting specialized roles in diverse biological processes

In this section I focus on understanding the organization of the CE-PPI network through an evolutionary perspective, by investigating functional divergence of paralogous proteins. Based on the notion that functional divergence following gene duplication arises through differential loss or gain of interactors between paralogous proteins, I hypothesize that CE paralogs that have undergone functional divergence are more likely to possess fewer physical interactions in common and as a result belong to distinct CE-clusters. Furthermore, functional divergence among paralogs will also be reflected by the enrichment of interactors with distinct biological roles.

Extracting information from orthology databases (eggNOG165 and OMA238), paralogous protein groups (>= 2 members) associated with the E. coli CE-PPI network were identified, comprising 91 paralogs in total (Figure 3-6A). A jaccard score was employed to measure overlap of PPI between paralog pairs (3.1.4); paralogs were further annotated by biological roles (COG categories), paralog pairwise % sequence similarities were calculated through Needleman- Wunsch global alignments, and paralog protein abundances were downloaded from the PaxDb database246 in order to examine trends between paralog sequence divergence, protein abundance and their influence on paralog PPI overlap (Supplemental File 7).

82

Figure 3-6. CE-PPI Network Paralogs: A - Selection of CE-Paralogs from EggNOG and OMA Orthology Databases; B - CE-Paralog COG Functional Distribution; C - Comparison of Average PPI Overlap of CE-Paralogs vs. Non-Paralogous Protein Pairs; and D – Comparison of GIs involving CE-Paralogs vs. Non-Paralogous proteins in the CE-PPI network.

When examining the distribution of biological roles of identified CE paralogs compared to non- paralogs, cell-envelope related roles were found to be particularly over-represented in paralogs, comprising the following COG super-categories: “Cell Envelope-Associated”, “Cell Cycle & Defense”, and “Energy Production” (Figure 3-6B). Although the majority of paralogous CE proteins possessed unique PPI (89/91 paralogs), ~ 1/3 of paralogous CE protein pairs shared at least one interactor (19/57 pairs). Comparison in the overlap of paralog PPI profiles further showed a significantly greater PPI overlap overall, indicated by average jaccard index values, than non-paralog pairs (Figure 3-6C). Furthermore, CE paralogs were also found to contribute a greater number of GIs than non-paralogs, with a significant enrichment for alleviating GIs (p- value < 0.05). In S. cerevisiae, duplicate genes were also found to possess an elevated degree of GIs suggesting that they contribute to increased robustness of biological pathways253. Therefore, the increased number of GIs and divergence of PPI overlap observed may reflect

83 neofunctionalization of paralogs mediated through physical integration into novel biological contexts (Figure 3-6D). This finding was further supported by the identification of numerous CE paralogous protein pairs statistically enriched in interactions implicating their involvement in a broad range of biological processes (Table 3-1 and Appendix 1).

-

rotein

Group

P

Defense

Multiple

Transport

Associated

Translation

Cell Cycle & Cell Cycle

Cell Envelope

Metabolism & & Metabolism

eggNOG_OMA eggNOG_OMA

Transcription & Transcription

Energy Production Energy proNOG00040_4833 acrB__b0462 1 / 1 4 / 4 10 / 14 3 / 3 6 / 12 5 / 12 proNOG00040_4833 acrF__b3266 0 / 1 0 / 4 0 / 14 0 / 3 2 / 12 0 / 12 proNOG00040_4833 mdtF__b3514 0 / 1 0 / 4 4 / 14 0 / 3 4 / 12 7 / 12 proNOG00096_8637 tap__b1885 0 / 13 0 / 4 0 / 21 1 / 6 3 / 30 1 / 17 proNOG00096_8637 tar__b1886 1 / 13 1 / 4 1 / 21 0 / 6 3 / 30 1 / 17 proNOG00096_8637 trg__b1421 12 / 13 3 / 4 19 / 21 5 / 6 23 / 30 14 / 17 proNOG00096_8637 tsr__b4355 0 / 13 0 / 4 1 / 21 0 / 6 1 / 30 1 / 17 proNOG02873_5521 gfcE__b0983 2 / 7 2 / 3 2 / 9 1 / 3 1 / 5 1 / 4 proNOG02873_5521 wza__b2062 5 / 7 1 / 3 7 / 9 2 / 3 4 / 5 3 / 4 proNOG03846_5478 etk__b0981 0 / 0 1 / 1 2 / 4 0 / 0 0 / 0 0 / 0 proNOG03846_5478 wzc__b2060 0 / 0 0 / 1 2 / 4 0 / 0 0 / 0 0 / 0

Table 3-1. Functional Enrichment of Cell-Envelope Paralog Physical Interactions. Distribution of a selected subset of CE-PPI network paralog physical interactions are broken down according to 6 major COG functional annotations (see 3.1.4.3). For each paralog the number of physical interactions found in the CE-PPI network is represented as a fraction of the total number of interactions possessed by all other proteins belonging to the same paralogous group. Paralogs found by hyperogeometric testing to be enriched in interactions for a given COG annotation are highlighted in violet, lack of interactions are highlighted in grey.

I next examined the contributions of paralog sequence divergence and protein abundances as potential factors in shaping the differences in paralog PPI profile overlap. A significant positive correlation (p-value < 0.05) was also observed between paralog % sequence similarity and PPI overlap, likely the consequence of the conservation of interaction interfaces among proteins having high sequence similarity (Figure 3-7AB). Conversely, a negative correlation was found between the normalized difference of paralog protein abundances and PPI overlap (Figure 3-7C), and was also not significantly correlated to sequence % similarity. Previous studies have noted that divergence of expression patterns of duplicates in S. cerevisiae231,254 is an important factor contributing toward the specialization of paralogs for novel biological roles. In addition to these

84 findings, the analysis presented in this chapter suggests that the tuning of paralog expression differences may also contribute to differences in paralog function through the alteration of paralog PPIs, possibly by reducing competition between paralogs for interaction partners. Next I present selected examples which highlight the potential functional consequences of gene duplication as revealed through the investigation of paralog physical interactors in the CE-PPI network and biological processes in which they operate.

85

Figure 3-7. Summary of CE-Paralog PPI Overlap Compared to Paralog % Sequence Similarity and Normalized Protein Abundance. A – Scatterplot of PPI overlap (jaccard index) vs. pairwise % global sequence similarity for CE paralog pairs; coloured diamonds indicate selected paralogous groups described in text; blue crosses indicate paralogs enriched in interactors belonging to distinct COG categories. B – Scatterplot of PPI degree difference among CE paralog pairs vs. paralog pairwise global sequence similarity. Selected examples (yellow circles) highlight differing magnitudes of PPI degree difference corresponding to paralogs with differing degrees of sequence similarity, e.g.: MCP chemotaxis (~ 55% – 73% sequence similarity, > 70 PPIs) and RND multidrug efflux (~ 81 % - 87%, > 20 PPIs) paralogous protein families. C – Scatterplot of paralog pair PPI overlap plotted against normalized fold change of protein abundances downloaded from the PaxDb database246 and calculated as the absolute difference of paralog protein abundances divided by the maximum abundance: |AbundanceParalog1 – AbundanceParalog2| / max{AbundanceParalog1, AbundanceParalog2}.

86

87

Figure 3-8. Examining Paralog PPI and Functional Divergence for Selected CE-Paralog PPI Subnetworks. Physical interactions integrate CE paralogs in distinct biological processes. MCP chemotaxis proteins (~ 55% – 73% sequence similarity); RND multidrug transport subunits (~ 81% - 87% sequence similarity); paralogous dehydrogenases complex subunits (~ 97% sequence similarity); paralogous capsule exopolysaccharide secretion complex subunits (~ 69% and 81% sequence similarity). Networks visualized using Cytoscape 3.5.1167.

Among the 91 CE paralogs identified, those involved in environmental sensing showed the greatest degree of interactions. Among these are methyl-accepting chemotaxis proteins (MCPs) and resistance-nodulation-division (RND) efflux pumps (Figure 3-7B). MCPs are integral inner membrane proteins which play an essential role the bacterial chemotaxis response pathway. They are comprised of a periplasmic ligand binding domain and cytoplasmic signaling domain that dimerize and undergo methylation upon the binding of environmental attractants or repellants, thereby facilitating the binding and activation of downstream signaling effector molecules of the flagellar motor247. In the present analysis 4 MCP paralogs were identified in the CE-PPI network (global sequence similarity: ~55-73%). Interestingly, Trg which is involved in flagellum regulation in response to galactose and ribose chemo-attractants248, showed the greatest sequence divergence compared to other MCPs (~ 55%) and also showed the greatest degree of interactors enriched in cell-envelope-associated processes compared to paralogous MCPs: Tap (dipeptide taxis), Tar (aspartate taxis), and Tsr (serine taxis), consistent with the known roles of MCPs in regulating chemotaxis in E. coli in response to differing nutrient availabilities (Figure 3-8A). Interestingly, Trg is found to be a low abundant member of MCP chemotactic arrays249. Furthermore, previous experimental work has demonstrated that chemotaxis response in E. coli is primarily carried about by the signaling domains of Tsr and Tar250, suggesting that Trg sensing may play possess additional accessory functions that have not been previously investigated. From this study, the striking difference in number of PPI observed between Trg and other MCPs provide tentative support of functional specialization, possibly under the nutrient limited stationary phase growth conditions in which AP-MS purifications were performed.

When examining unique subsets of interactors for other MCPs, Tar was identified as interacting with an extracellular subunit of the flagellum FlgK. Clustering identified FlgK as a member of a putative complex involving another extra-cellular component of the flagellum, FliC, the IM NarX nitrate signal transduction histidine kinase, as well as cytosolic proteins involved in

88 metabolism of Nitrogen containing molecules, such as CynS cyanate lyase, and YaeW carnitine monooxygenase, recently identified as an enzyme involved in trimethylamine production in gut dietary metabolism251. Given previous results revealing physical associations between Trg and IM subunits of nitrate reductases (see 3.2.1.1), I examined the possible biological implications of these interactions with Tar. Topological and functional resemblances have been noted between MCPs and signaling transduction proteins. Experiments involving the generation of a chimeric fusion protein comprised of the nitrate binding PE domain of NarX and the CY signaling domain of Tar have shown a resulting chemorepellant response to nitrate in E. coli, confirming a conserved mechanism of action between MCPs and signal transduction histidine kinases252. Furthermore, studies have shown that biofilm formation is regulated in Pseudomonas aeruginosa through the periplasmic sequestering of FliC via a complex formed by the NirS PE nitrate reductase and the chaperone DnaK (not detected in the CE-PPI network)253. These various lines of experimental evidence along with the physical associations identified in the CE-PPI network suggest the intriguing possibility that functional diversification of MCP paralogs may serve to dynamically bridge chemotaxis response, nitrate sensing, and flagellum assembly, to regulate complex cellular responses.

IM subunits of RND efflux complexes play a critical role in microbial drug resistance through the binding and active export of antimicrobial molecules from the periplasm, which is achieved through physical interaction with a cognate periplasmic adaptor protein and the outer membrane channel TolC262. Several paralogs of IM subunits of RND efflux pumps were identified in the CE-PPI network, AcrB, AcrF, and MdtF (sequence similarity mid-high and ~81-87%), and showed differing overlap in their shared interactors revealing integration with diverse biological processes in the cell envelope. Both AcrB and MdtF shared a number of interactions with a subset of Na+/H+ symporters and diverse charged-substrate transporters, implying a possible energy coupling mechanism to enable efficient regulation of membrane charge and transport, or prevent the possible diffusion of antibiotics through membrane pores263,264 (Figure 3-8B). However, AcrB, MdtF and AcrF were identified as members of distinct CE clusters warranting further investigation of the possibility of their specialization into different biological roles. AcrB was found to possess a number of unique interactions with proteins involved in cell-envelope biogenesis, including PE peptidoglycan modifying and binding enzymes involved in cell division (AmiC, FtsN), LPS polysaccharide transporter subunits (MlaAE, LptG) and ECA modification

89

(WecH). Given that peptidoglycan biosynthesis pathways are well-known targets of antibiotics265,266 and LPS modification plays an important role in decreasing outer membrane antibiotic permeability267,268, these results suggest that physical interactions are an important means by which a variety of antibiotic resistance mechanisms may be physically coordinated to ensure bacterial survival. Further supporting the link between antibiotic efflux and cell envelope biogenesis pathways, the essential outer membrane biogenesis protein BamA was also found to interact with AcrB, which has been previously demonstrated to play an important role in mediating contact dependent growth inhibition269. MdtF interactors on the other hand were found to be enriched distinctly among a number of multi-drug efflux PE membrane fusion family (MSF) adaptor proteins. Interestingly, one of the interactors identified was AcrA, the cognate interactor of AcrB, which has been shown to complement the export function of MdtF by previous study270 and may have important implications in effecting transporter efficiency or substrate range271.

Investigating paralogs with roles in energy production, subunits of DMSO reductase and putative selenite reductase complexes, DmsB and YnfG, which display high sequence similarity (~97%) were found to possess a high overlap of interactions. Interactors comprised their respective co- complex members, DmsA and YnfF, with all four proteins assigned as members of the same CE cluster (Figure 3-8C). Consistent with the physical associations observed, previous research has demonstrated that YnfG can serve as a functional replacement to DmsB and support growth on DMSO272, however YnfF was not able to complement the function of DmsA. Given the strong physical association identified between YnfF and DmsB, a likely interpretation is that differing patterns of subfunctionalization of DmsA and YnfF have likely impacted binding sites affecting substrate binding specificity between these closely related complexes.

In a final example, I examined two paralogous exopolysaccharide secretion complexes involved in Group 1 and Group 4 colonic acid capsule biosynthesis and transport273 consisting of an IM periplasmic polysaccharide polymerase, GfcE-Etk, and a cognate OM secretin-like pore, Wza- Wzc ( ~70-80 % sequence similarity) (Figure 3-8D). Each cognate pair of paralogs were found to be members of distinct CE-clusters and possessed interactions with proteins having related functions in exopolysaccharide transport, for example: GfcE-Etk were found to interact with Etp, GcfD, CpsB (known capsule transport and colonic acid biosynthesis proteins), YghQ (putative

90 exopolysaccharide flippase) and NfrA (bacteriophage receptor); Wza-Wzc were found to interact with Gmd (colonic acid biosynthesis protein), LptD, WzzB (LPS transport and modification), WzyE (ECA polymerization), ElfD (fimbriae transport), and GspD (Type II secretion system secretin subunit). These distinct subsets of physical associations between paralogous capsule transport machineries not only supports the notion that each may play a specialized role in capsule transport274,275, but also illustrate how sequence divergence of OM pore subunits may have served as an evolving scaffold for the emergence of distinct exopolysaccharide secretion processes.

3.3 Discussion and Conclusions

Here I present an analysis of a recently generated physical interaction network of the E. coli cell- envelope associated proteome (CE-PPI), comprising 14,376 interactions among 604 cell envelope associated proteins of diverse subcellular localization. Physical interaction networks generated through large-scale physical interaction screening approaches provide a valuable tool toward understanding the complex and dynamic associations of proteins that enable bacteria to survive and adapt to diverse environments. However, the resulting complexity of such datasets poses a significant interpretive challenge necessitating the development of novel bioinformatics approaches to infer biologically meaningful conclusions.

To achieve this goal, I devised an integrated approach to investigate the functional organization of the E. coli CE-PPI network and performed the three following analyses: 1) By examining the composition of predicted protein complexes (CE-clusters), I investigated whether predicted CE- clusters comprise proteins with related biological roles; 2) Integrating previously generated E. coli genetic interaction (GI) networks with CE-clusters I examined whether CE-clusters may have independent biological roles, or whether they may function together to coordinate bacterial survival; and, 3) I sought to understanding how gene duplication has contributed to the evolution of biological processes in the CE-PPI network, in particular whether the divergence of PPI interactions is reflective of functional divergence among CE paralogs. In the resulting analyses I identified numerous examples of known protein complexes with novel interactors with related functional roles, epistatic integration of CE-clusters mediating distinct aspects of E. coli physiology, and paralogous proteins with varying degrees of shared interactors. In the CE-PPI network, the majority of paralogs identified were found to possess unique sets of interactors

91 highlighting instances which may indicate specialized biological roles. I provide examples where the divergence of paralog physical interactions among chemotaxis sensory proteins, antibiotic resistance exporter subunits, and biofilm transport machineries is likely to have important implications for environmental adaptation in E. coli.

The value of exploratory analyses of high-throughput interaction networks can be demonstrated by the potential to elucidate novel biological insights into the well-characterized complexes and pathways. Clustering of the CE-PPI network enabled the discovery of novel physical associations among well-studied bacterial membrane complexes, integrating proteins with roles in diverse biological processes, such as chemotaxis, cell-division, energy homeostasis, chemotaxis, and environmental sensing. For example, although the mechanism of chemotaxis regulation has been well-studied in E. coli276, the majority of the interactions identified in the CE-PPI network and chemotaxis proteins are novel, and were found to occur particularly among MCP chemotaxis sensing proteins. In the analysis presented, several interactions were noted for the MCP Trg (ribose and galactose sensing) involving proteins with roles in metabolism, cell- division, as well as anaerobic nitrogen sensing, suggesting a means of integrating diverse environmental cues to regulate cellular motility. In addition, clustering also identified a number of interactions between the NADH dehydrogenase complex involved in proton gradient generation and several peptide transporter complexes, OppBCDF and DppADF. Interestingly, Opp, which is involved in scavenging peptides from peptidoglycan recycling277, has also been shown to have a preference for the import of positively-charged peptides246, suggesting a crucial link between periplasmic pH homeostasis in regulating transport processes involved in cellular growth. Further experimental validation of these findings by investigating NADH production efficiency for opp or dpp deletion strains of E. coli grown under differing pH conditions.

In chapter 2 I presented an analysis of two previously generated E. coli GI networks and demonstrated their utility in understanding how functionally distinct gene modules are organized into biological pathways relevant for growth and survival. In the present chapter, I present novel findings through integrating GI datasets with CE-PPI network to identify CE-clusters that participate within distinct biological pathways. GIs were found to possess low overlap with CE- PPI, and showed distinct patterns of enrichment among CE-clusters identifying those with related biological roles. For instance, clusters mediating responses to environmental change

92

(transport, energy production, multidrug resistance processes, cell envelope integrity) were seen to be enriched in aggravating GIs, indicating robustness in sensing a diverse array of environmental conditions. In contrast, CE-clusters corresponding to integrated processes involved in the maintenance of cellular growth (cell-envelope biogenesis and cell division) were found enriched in alleviating GIs. For example, differing patterns of GI enrichment were found between subunits of two iron transporter complexes, Fhu and Fep, implicating them in distinct aspects of E. coli physiology. The FhuBCD iron hydroxamate transporter was found to possess predominantly aggravating GIs, notably with components of cell envelope biogenesis machineries, SecDF-YajC accessory complex of the Sec IM protein , and with BamB subunit of the Bam OM beta-barrel assembly complex, consistent with their roles affecting the biogenesis of proteins involved in a vast array of biological processes, with important implications in pathogenesis278–280. In contrast, the FepDG-EntAB enterobactin transporter was found to be enriched in alleviating GIs with CE-clusters involved in the biogenesis and export of colonic acid (Wza-WzC), cellulose (BcsA-BcsB), fatty acids (AccA-AccC), lipopolysaccharide (LptC), as well as cell envelope integrity complexes (TolBRQ, MreCD), which are likely to reflect the integration of pathways important for regulating bacterial growth and survival in nutrient limiting environments281–283. Follow up experiments could seek to validate these functional links by investigating, for instance, whether fep genes are necessary for biofilm production in E. coli under iron limiting growth.

The final analysis focused on the functional implications of gene duplication and divergence in the context of both the CE-PPI network and GIs. Duplication has been emphasized as an important factor in the evolution of biological networks by enabling the rewiring of physical associations among paralogs284,285. Two scenarios have been proposed that may influence the functional divergence of paralogs as reflected by changes in their physical interaction patterns: each duplicate may retain different portions of their original ancestral function through the differential loss of interactors (subfunctionalization), such as differential interactions of RND antibiotic transporters with their periplasmic adaptor proteins, or may develop novel functions through the gain of novel interactors (neofunctionalization), as suggested by paralogous colonic acid biosynthesis machineries, highlighted by my study. Gene duplication has been the subject of much examination in the context of S. cerevisiae PPI and GI networks231,287,288, yet similar studies in E. coli are largely lacking. In this first systematic study into the functional divergence

93 of paralogs in the CE-PPI network several interesting trends were observed. Although paralogs showed varying degrees of shared interactors, the majority were unique, a similar feature of duplicates in S. cerevisiae and C. elegans PPI networks289, further indicating an extensive degree of neo-functionalization among paralogs290. It also was found that CE-paralogs were statistically enriched in GIs overall compared to non-paralogs, and further these GIs also tended to be alleviating. This finding suggests that gene duplications contribute to the adaptive potential of bacteria, by performing functions which facilitates their integration or buffering of diverse biological pathways. To better understand how CE-paralogs have acquired novel biological roles I next examined functional enrichment among their interactors in the CE-PPI network. For example, of all the paralogs identified, those associated with environmental sensing, e.g. chemotaxis associated MCPs, and RND multiple drug efflux pump subunits, showed the greatest number of PPI overall, and were notably associated with different subsets of interactors, albeit with related biological functions. From further investigation, many novel interactors were found to have supporting experimental evidence of biological validity, suggesting a remarkable adaptability of paralogs to evolve novel functions by altering their physical associations. The diversification of interactors for MCP proteins, particularly involving nitrate reductases, is likely to be relevant for mediating adaptation to host-environments where nutrient and oxygen availabilities are scarce250, while the distinct interactions observed between RND efflux pumps involving periplasmic adaptor proteins involved in drug export, as well as charged substrate transporters and cell envelope biogenesis pathways, may serve to effectively maintain membrane impermeability to diverse antimicrobial molecules. From further investigation, many novel interactors were found to have supporting experimental evidence of biological validity, suggesting a remarkable adaptability of paralogs to evolve novel functions by altering their physical associations. Future investigation of how physical associations alter in response to different environmental conditions is likely to reveal greater understanding of the evolutionary potential of duplication in the organization of biological networks.

A few major important caveats regarding AP-MS based large-scale interaction screens that are likely to limit the conclusions of this study are: the inability to distinguish directly interacting proteins as compared to two-hybrid based approaches, the disruption of transient interactions or introduction of non-biologically relevant interactions due to multiple detergent solubilization, and the possibility of false-positive interactions assigned to closely related paralogous proteins.

94

Yet, despite these limiting factors, the integration of multiple PPI probability scoring metrics and the use of a well curated set of gold standard PPI to facilitate benchmarking resulted in the recovery of many previously reported PPI (~80% of reference PPI derived from EcoCyc). The quality of recovered complexes is further supported by additional benchmarking I performed on CE-clusters, indicating that the interactions generated are able to recapitulate a significant number of previously known cell envelope complexes. In the case of the incorrect identification of paralogous protein interactions, peptides were assigned from MS/MS spectra having a protein identification probability score of 90% or higher3. The majority of paralogous proteins examined in this study were found to be below 90% sequence similarity (56/57 paralogous pairs) indicating that peptide misidentification is unlikely. In the case of recently duplicated formate dehydrogenase subunits (DmsB and YnfG: > 97% sequence identity), shared interactors were found to be supported by previous experimental work254 which lends support to their biological validity. Given that the majority of the high-confidence physical associations in the CE-PPI network are novel, the interpretation of their biological significance presents the next significant challenge. To mitigate the effect of spurious interactions on these analyses, the benchmarking of clusters against known protein complexes, permutation testing for enrichment of GIs between CE-clusters, and functional enrichment of paralog interactors based on COG functional categories were employed. Although high throughput AP-MS screens represent a valuable approach for charting the functional organization of membrane proteins, there still remains a hole in our knowledge regarding the detailed spatial and temporal organization of proteins of the bacterial cell envelope291. However several lines of evidence are beginning to emerge which suggest that interactions between lipid, proteins, and DNA are key are likely to be key factors in the partitioning of the cell envelope into functionally related “hyperstructures”292.

The work presented in this chapter reveals a key value of high-throughput interaction screening approaches not only as a means of elucidating novel relationships between members of known protein complexes, but as a starting point toward elucidating how biological processes are spatially localized and coordinated within the cell envelope.

95

Chapter 4 Systematic Prediction and Classification of Operon- Associated Bacterial Exopolysaccharide Secretion Machineries

Attributions

This study was initially conceived through discussions between myself, Dr. Lynne Howell (Senior Scientist, Hospital for Sick Children, Toronto Ontario Canada) and Dr. John Parkinson.

Curated seed protein sequences for alginate, acetylated-cellulose, cellulose, pel, and PNAG operons, and bacterial genome niche and lifestyle metadata curation, were provided by Greg Whitfield and Lindsey Marmont (graduate students in the lab of Dr. Lynne Howell, Hospital for Sick Children).

ΔpelF Bacillus cereus ATCC 10987 strain used for validation of predicted Gram-positive pel operon was generated by Greg Whitfield. Scanning electron microscope images of wildtype and ΔpelF B.cereus ATCC 10987 provided courtesy of Dr. Lynne Howell (Hospital for Sick Children), Dr. Elyse Roach (University of Guelph) and Dr. Cezar Khursigara (University of Guelph).

96

A joint publication to Nature Biofilms is currently in preparation.

97

4 Systematic Study of the Evolution of Bacterial Synthase-Dependent Expolysaccharide System (EPS) Machineries

In the previous chapters I employed large-scale genetic and physical interaction networks to explore the impact of gene duplication on the organization of biological pathways and complexes in E. coli. While these studies provide a view of evolutionary processes from a global perspective of the bacterial cell, in this chapter I focus on a well-defined system of protein complexes associated with bacterial biofilm formation to gain more detailed mechanistic insights into the role of duplication on function. The study of synthase-dependent exopolysaccharide (EPS) systems has gained attention in recent years through the discovery of important roles they play in bacterial biofilm production resulting in antibiotic resistance and pathogen persistence255. As a prime candidate for the development of therapeutics, understanding their mechanism of function has become of prime importance. All synthase dependent EPS systems studied to date (cellulose, acetylated-cellulose, PNAG, pel, and alginate) are encoded by genomically-neighbouring loci in bacterial genomes, defining an operonic organization of co-transcribed functionalities. Interestingly, general functional homologies have been noted among individual genetic loci of different EPS systems, which comprise an inner membrane polysaccharide synthase and co- polymerase subunits, periplasmic polysaccharide modification enzymes, an outer membrane pore, along with variably conserved accessory proteins255. There is potential for sequence analysis approaches to provide further insights into how these functionalities have evolved specialized roles in the production of different kinds of EPS. Previous attempts have been made to survey EPS operons from available fully sequenced bacterial genomes256, but as yet a comprehensive systematic analyses of operon evolution is lacking. For instance, it is well not understood how evolutionary factors influencing operon evolution, such as locus sequence divergence, gene rearrangements, and locus duplications or losses, might influence species- specific differences in biofilm production. Here I introduce a systematic approach of classifying EPS operons based on the underlying evolutionary relationships among EPS operon encoded loci identified across phylogenetically diverse bacterial species. I employ an intuitive graphical visualization approach through the construction of genomic-proximity networks to investigate the complex evolutionary relationships among identified EPS operons and address the role gene duplication has played in their evolution across diverse bacterial phyla.

98

4.1 Materials and Methods

4.1.1 Data Sources

Selection of EPS operons and sources of operon locus reference protein sequences

NCBI Reference Sequence database identifiers (Refseq IDs) used to retrieve protein coding sequences corresponding to experimentally characterized EPS operon loci involved in the production of cellulose, acetylated-cellulose, alginate, pel, and PNAG production were provided by collaborators Greg Whitfield and Lindsey Marmont (Graduate Students of the lab of Dr. Lynne Howell, Hospital for Sick Children, Toronto Ontario, Canada) (Supplemental File 8 & Table 4-1). Cellulose Locus Function PFAM Domains Predicted BcsA Polysaccharide Synthase Glyco_tranf_2_3 + PilZ BcsB Co-Polymerase BcsB BcsZ Glyco_hydro_8 BcsC Outer Membrane Pore BCSC_C + TPR_19

Acetylated -Cellulose Locus Function PFAM Domains Predicted WssB Polysaccharide Synthase Glyco_tranf_2_3 + PilZ WssC Co-Polymerase BcsB WssD Hydrolase Glyco_hydro_8 WssE Outer Membrane Pore BCSC_C + TPR_19 WssF Modification None WssG Acetylation AlgF WssH Acetylation MBOAT WssI Acetylation ALGX WssJ Localization CBP_BcsQ

PNAG Locus Function PFAM Domains Predicted PgaA Outer Membrane Pore TPR_19 PgaB Hydrolase + Deacetylase GHL13 + Polysacc_deac_1 PgaC Polysaccharide Synthase Glyco_tranf_2_3 PgaD Co-Polymerase None

99

Pel Locus Function PFAM Domains Predicted PelA Hydrolase Glyco_hydro_114 PelB Outer Membrane Pore TPR_15 PelC OM Lipoprotein None PelD C-di-GMP Binding PelD_GGDEF PelE Co-Polymerase None PelF Polysaccharide Synthase Glyco_trans_1_4 PelG Co-Polymerase PelG

Alginate Locus Function PFAM Domains Predicted AlgD Precursor Biosynthesis UDPG_MGDP_dh Alg8 Polysaccharide Synthase Glyco_tranf_2_3 Alg44 Co-Polymerase PilZ + HlyD_3 AlgK AlgE Associated None AlgE Outer Membrane Pore Alginate_exp AlgG Epimerase NosD AlgX Acetylation ALGX + CBM AlgL Lyase Alginate_lyase AlgI Acetylation MBOAT AlgJ Acetylation ALGX AlgF Acetylation AlgF AlgA Precursor Biosynthesis MannoseP_isomer

Legend Inner Membrane Polysaccharide Biogenesis and Transport Periplasmic Modification Outer Membrane Transport Additional Function/Unknown

Table 4-1. EPS Systems Surveyed and Functional Description of Loci.

Selection of Fully Sequenced Bacterial Genomes Utilized for EPS Operon Prediction

Both protein coding sequence (.faa) and genbank genome annotation (.gbk) files were downloaded from the NCBI ftp database for 1861 completely sequenced reference and representative bacterial genomes (retrieved April 20th 2015: Supplemental File 9).

Bacterial Genome Metadata Annotation (Niche and Lifestyle Classification Extracted from Available Genome Sequencing Publications) For each bacterial strain predicted to possess an EPS operon, metadata corresponding to niche (host-associated or environmental) and lifestyle (pathogenic or non-pathogenic) annotations were extracted from available literature, either through associated genomic sequence publications

100 and/or additional supporting literature by querying species against the NCBI pubmed database (Supplemental File 10).

4.1.2 Generation of EPS operon protein family hidden-Markov models (HMMs)

To identify putative EPS operons I applied a HMM-model based sequence similarity profiling strategy. To ensure that HMM-models capture sufficient variation of EPS loci which will enable detection of EPS operons in phylogenetically divergent bacteria, I first defined initial seed sets of candidate EPS protein sequences. These were selected from a preliminary round of HMM-based searches of reference EPS sequences against a set of 1861 bacterial genome sequences. From the resulting matches, I selected the 20 highest scoring sequences ranked according to ascending e- values, with the additional proviso that they possessed between 90% and 97% sequence similarity to eliminate redundancy from closely related bacterial species. For each EPS gene family, these sequences represent an initial set of “seed sequences” that were subsequently used to generate EPS locus-specific HMM-models of the possible sequence of amino-acid states representative of the evolution of a given EPS protein family. For each set of seed sequences corresponding to a given EPS operon locus, i.e. a given protein family with an established functional role in EPS production, a multiple sequence alignment (MSA) was generated using MUSCLE257 (v.3.8.1551, with default settings) from which HMM-models were built by using the “hmmbuild” program from the HMMER package258 (v. 3.1b2, with default settings).

4.1.3 Prediction of putative EPS operon loci

Sets of EPS loci hits used for genomic-context reconstruction of EPS operons were predicted by performing sequence homology searches using the “hmmsearch” program from the HMMER package258 (v. 3.1b2, with default settings). Hits were retrieved as significant matches (e-value <= 1e-5) to generated EPS locus HMM-models against all protein coding sequences identified from 1861 reference completely sequenced bacterial genomes (Figure 4-1 – Step 1.2). For subsequent genomic-context reconstruction of bacterial EPS operons, RefSeq identifiers for all EPS hits associated with a given bacterial genome were utilized to retrieve locus start and stop locus positions retrieved from a corresponding genbank file, which provides a list of all predicted coding sequences and relevant annotation metadata associated with a given species genomic sequence.

101

Figure 4-1. Computational Prediction and Reconstruction of Bacterial EPS Operons, Phylogenetic Clustering and Identification of Operon Clades. A schematic overview of a general approach for the systematic prediction and classification operon sequences applied for the study of bacterial EPS machineries. From HMM searches for a curated set of core EPS operon protein families (Step 1), protein sequences corresponding to putative operon loci are identified on the basis of genomic-proximity (Step 2), and utilized for ML phylogenetic tree

102 reconstruction to define phylogenetic clusters based on the selection of an optimal evolutionary distance threshold (Step 3). Clustered EPS operon loci protein sequences are applied for the analysis of EPS operon evolution through the generation of genomic-proximity graphs and definition of operon clades (Step 4).

4.1.4 A Genomic-Context Based Approach for EPS Operon Prediction

Genomic-proximity of bacterial genes has been employed as a reliable predictor of operon membership in bacterial genomes259,260, which is often indicative of genes with biologically significant relationships, e.g. members of protein complexes or biological pathways8. Therefore employing this operon prediction approach provides a powerful means of inferring functional relationships for putative hits likely to be associated with EPS production, enabling a systematic study of operon evolution across diverse bacterial phyla.

The basic approach for the reconstruction of EPS operons first requires using retrieved locus start and stop positions to order all EPS loci hits according to their corresponding positions in a given bacterial genome. Next, the sequence of all genomically proximal EPS hits are concatenated and designated as a putative EPS operon. These operon reconstruction steps were implemented through custom Perl scripts which take as input a list of the genomic start and stop positions of predicted EPS loci hits identified in a given bacterial genome. Based on the calculated inter- genic distances of predicted EPS loci, defined as the number of bases separating successive loci, putative operons were defined as follows: o Predicted loci occur within a range of twice the size of a reference EPS operon, inter- genic distances of individual loci must be <= 5 Kbp. o Putative operons must possess at least two distinct loci, one of which encodes a putative polysaccharide synthase subunit.

Putative EPS operons derived from this first pass prediction step were subject to a second pass reconstruction if they were found with absent/undetected loci. New HMM models were iteratively constructed using first-pass predicted EPS operon protein sequences and an additional round of HMM searching was performed. From this additional set of putative EPS loci hits, the operon reconstruction procedure was performed according to the same steps listed above.

Based on this model, a database of reconstructed operons for each EPS system was generated (Figure 4-1 – Step 2) indicating for each bacterial genome the relative ordering within the

103 chromosome of gene loci identified by HMM-searches, along with their start and stop positions and predicted EPS locus identity. This information was utilized in subsequent steps for phylogenetic clustering and construction of EPS operon genomic-proximity networks.

4.1.5 Classification of EPS Loci and Definition of Operon Clades Using a Novel Protein Sequence Evolutionary Distance Clustering Approach

An overview of the process applied to systematically classify EPS operon loci is provided in Figure 1. First, each set of protein sequences corresponding to a given EPS locus was merged using CD-HIT261 using default settings (version 4.6.3 using global sequence identity threshold 0.9; wordlength 5). This is equivalent to the selection of a single sequence representative for a given species genome and reduces sequence redundancy resulting from overrepresentation of closely related strains (note that this process may still result in the inclusion of divergent paralogs from the same species genome). Initial multiple sequence alignments (MSAs) were then generated using MUSCLE and trimmed using trimal (version 1.2rev59 using -automated1 setting)262. The resulting alignment was then used for phylogenetic tree construction using the PhyML package263, with default parameters (version 3.0 using LG substitution model) and 1000 bootstrap replicates.

4.1.5.1 Deriving an Overall Cluster Quality Scoring Metric to Define Evolutionarily Distinct EPS Loci Sequence Clusters

To define an optimal set of clusters for a set of predicted EPS operon loci sequences, I began by extracting pairwise evolutionary distances (proportion of amino acid substitutions per site) for representative sequences associated with each EPS locus phylogenetic tree previously generated. For each EPS locus, I next defined different sets of sequence clusters by traversing through the phylogenetic tree, beginning from tree tips (no clustering), to increasingly deeper branches resulting in clusters comprising sequences of increasing evolutionary divergence (Figure 4-1 – Step 3). Sequences were assigned members of the same phylogenetic cluster if their evolutionary distances were less than or equal to the distance cutoff chosen. For each set of clusters, three metrics were calculated and used to define an optimal clustering that maximizes the proportion of sequences clustered and their within and between cluster evolutionary distances. The metrics employed were:

104

 Proportion of Sequences Clustered o The number of sequences found in a cluster / the total number of sequences represented in the phylogenetic tree.  Average Silhouette Width (wellness of clustering of individual sequences) o For each sequence (i), I defined ts average similarity to all other sequences assigned

to its cluster (ai), as well as its lowest average similarity to any other cluster (bi), i.e. its nearest neighbouring cluster. The silhouette width for sequence i is then given by:

(bi - ai) / max{ bi, ai}. Thus, the best clustering will result in a maximum silhouette

width, when sequences within clusters are closely related to one another, i.e when ai is closest to 0. o A limitation to Silhouette metric is its strong bias for highly related sequences, which

is likely to occur when sequences are unclustered (ai = 0). However, the monotonically decreasing nature of the score provides can provide a useful means of selecting against clusters with highly divergent sequences.  Dunn Index (overall separation of clusters) o The Dunn index is calculated by identifying a pair of sequence clusters with a minimum inter-cluster distance (min-inter), and dividing this value with the value for the cluster with the largest intra-cluster (max-intra) distance, giving: Dunn Index = min-inter / max-intra. The best overall clustering is identified as that which gives the largest value of the Dunn Index. o An important limitation of this metric is that it may be influenced by disparities in the distribution of distances, particularly if there are a few highly divergent sequences (a non-informative clustering that would obscure the evolutionary relationships among related operon lineages). Yet the metric may still be utilized as a useful indicator of determining transitions in the clustering landscape that correspond to the phylogenetic separation between distinct bacterial taxonomic divisions, if adequately represented to limit potential biases which may result from species with elevated sequence divergence due to genetic drift due to limited population sizes or adaptations to extreme environments.

In consideration of the features weighed above and the reciprocal nature of the Silhouette and Dunn indices three overall clustering quality scoring functions (Q) were compared. The ideal

105 scoring function was found to yield the greatest proportion of sequences assigned into evolutionarily distinct clusters:

Q = Proportion Sequences Clustered + Average Silhouette + Average Dunn

4.1.6 Construction of EPS Operon Genomic-Proximity Networks

To capture evolutionary and genomic organization relationships of predicted EPS operons, genomic proximity networks were generated (Figure 4-1 – Step 4), where the given nodes represent a given EPS locus phylogenetic sequence cluster, and edges connecting pairs of nodes represent the average genomic distances between loci represented by each node. An advantage of this approach is that it provides a visually intuitive means of identifying “operon clades” comprised of distinct subsets of genomically co-occurring sequence clusters, and representing a wide array operon evolutionary events contributing to the divergence of operon clades across potentially hundreds of bacterial genomes.

A genomic-proximity network is generated as follows: from phylogenetic clustering of EPS loci, the genomic co-occurrence of EPS locus phylogenetic clusters is determined by reference to the reconstructed operon database. In addition, for each pair of genomically co-occurring phylogenetic clusters, the inter-genic distances of their corresponding sequences are calculated and averaged over the number genomes in which they occur. The resulting genomic-proximity network is then visualized using Cytoscape (version 3.5)167. Additionally, the phylogenetic distribution (phyla-level) based on NCBI taxonomic assignments of species represented in each EPS locus sequence cluster, represented as pie charts, was calculated and imported into Cytoscape for visualization of network nodes.

106

4.2 Results

4.2.1 A Comprehensive Survey of Bacterial EPS Operons Reveals Functional EPS Systems Across Bacteria of Diverse Lifestyles and Environmental Niches

Biofilms produced by synthase-dependent EPS systems have been shown to be crucial for survival in various environmental niches and also serve as important pathogenic adaptations255. However, to date a comprehensive survey of known EPS producing bacteria has been lacking. Furthermore, such a dataset can enable the investigation of how gene duplication has contributed to the functional adaptation of a well-defined biological system and its potential implications as a mechanism for the regulation of biofilm production and its impact on bacterial survival. To address these challenges, I have devised a systematic bioinformatics based approach and novel phylogenetic distance based sequence clustering method for the comprehensive survey and classification of all known bacterial EPS systems to date. In the following sections I demonstrate the utility of these datasets in revealing how evolutionary processes have contributed to the emergence of distinct EPS operon clades across diverse bacterial phyla, as well as evaluating the impact of gene duplication and its impact on EPS biofilm production.

5 known EPS operons responsible for cellulose, acetylated-cellulose, PNAG, pel, and alginate (Table 4-1) production were identified through large-scale HMM-searches and genomic- proximity based reconstruction across a total of 1861 complete reference and representative bacterial genomes (as defined by the NCBI genomes database). A total of 407 cellulose, 321 PNAG, 146 pel, 64 alginate, and 4 acetylated-cellulose EPS operons were identified comprising at least 2 “core” operon loci (Figure 4-2A, Supplemental File 11). Core EPS operon loci were defined as encoding a polysaccharide synthase subunit as well as one additional locus involved in EPS modification or transport as identified by a previous review255. Of the total number of unique bacterial species found to possess an EPS operon, cellulose was found among 367 species (1.11 operons per species), PNAG was found in 288 species (~ 1.11 operons per species), pel was found in 140 species (~ 1.04 operons per species), alginate was found in 60 species (~ 1.06 operons per species), and acet-cellulose was found in 4 species (1 operon per species).

107

Figure 4-2. Summary of Predicted Bacterial EPS Operons by Genomic-Proximity Reconstruction. A - Number of bacterial genomes predicted to possess acetylated-cellulose, cellulose, PNAG, pel, and alginate operons are summarized by bacterial lifestyle (pathogen, non- pathogen, unknown) and corresponding niche (host-associated, environmental/other, unknown). Highlighted fields with asterisks indicate statistical enrichment of EPS operons found among pathogens or non-pathogens, and host-associated or environmental niches (one-tailed T-test, p

108

<= 0.05 with Bonferroni correction). B – Krona visualizations264 of relative taxonomic distributions of bacteria with a predicted EPS operon.

PNAG was found to be enriched in a greater proportion of pathogen genomes (161/289 ~ 56%; T-test p-value ~ 0.002), while conversely pel (84/140 = 60%; T-test p-value < 9.5e-10), alginate (39/60 = 65%; T-test p-value < 8.8e-4), cellulose (187/367 ~ 51%; T-test p-value 3.6e-5) were enriched in a greater proportion of non-pathogen genomes (Figure 4-2A). The few acetylated- cellulose operons predicted were equally represented in pathogens and non-pathogens. Furthermore, cellulose and PNAG operons were found to be statistically enriched in bacterial species annotated to host-associated niches (T-test p-values 2.22e-16 and ~0, respectively).

Interestingly, an appreciable number of species were found to possess multiple EPS operons (Figure 4-3), in some instances distinguishing species with differing environmental niches. Alginate and pel, a well-known example of co-occurring EPS operons were found to be statistically enriched (T-test p-value ~ 1e-3) among pathogenic human-host associated bacterium, which is largely due to the over-representation of P. aeruginosa strains in the genomes surveyed in this study. In contrast, other alginate operon combinations, e.g. with pel and/or PNAG were identified only in other non-pathogenic Pseudomonas spp. predominantly in host-associated plant and rhizosphere niches, exclusively representing P. fluorescens, P. brassicacaerum, and P. poae strains. Interestingly, the majority of acetylated-cellulose operons (3/4) were identified in Pseudomonas spp. possessing additional alginate, suggesting the possibility of the derivation of the acetylated-cellulose acetylation machinery from the alginate operons (see section 2.6.6 for additional analysis). Species possessing only alginate operons were found to reside in diverse niches, including diverse aqueous and soil environments.

109

Figure 4-3. Lifestyle and Niche Distribution of Predicted EPS Operons. Number of bacteria with multiple or single predicted EPS operons are represented according to their distribution (% bacterial genomes) according to lifestyle and niche annotation. Asterisks indicate statistically significant enrichment of single or multiple EPS operon combinations among pathogenic (red asterisks) or non-pathogenic bacteria (green asterisks) (one sided T-test p <= 0.05 with Bonferroni correction).

Although individual cellulose operons are found equally distributed in human and non-human hosts (30% vs. 34%), the combination of cellulose and pel operons were found to be enriched in non-pathogenic bacteria (70%; p-value 0.02) many found in plant-associated niches and aquatic environments, e.g. Burkholderia, Pelagibacterium, Rhodobacter, Pandorea, while cellulose and PNAG combinations were enriched in pathogens (66%; p-value 4.4e-06) largely represented by closely related strains from the genera Yersinia, Escherichia and Klebsiella.

110

4.2.2 Evolution of EPS Operons is Driven by Gene Duplication, Loss and Rearrangements

The processes underlying EPS operon evolution across diverse bacterial phyla is poorly understood. I hypothesize that adaptation to novel lifestyles or niches is driven at least in part, by changes in EPS operon composition. These changes consist of single locus or whole operon duplications, corresponding to dosage effects altering the level of export of modification of EPS; locus losses, which may indicate a reduced level of export, modification, or suggest supplementation of the lost function in one species with a novel gene; operon rearrangements which may also affect the regulation of EPS production through the order of expression of individual EPS system components; and, gene-fusions resulting in enhanced co-expression of interacting subunits enhancing EPS production.

Table 4-2. Summary of EPS Operon Evolutionary Events. For all predicted EPS operons the following evolutionary events were tallied: locus losses, representing the total number of core EPS operon loci not detected by HMM searches; locus duplications, representing the total number of operon associated sequences identified as a significant hit to the same HMM; locus fusions, representing loci which were found as significant hits in multiple HMM searches; operon rearrangements, if the resulting locus ordering of a predicted operon was altered relative to a pre-defined reference; and large-scale operon duplications, which were defined as species possessing more than one predicted operon occurring > 10kb apart.

For each set of predicted EPS operons, the resulting number of operon evolutionary events, e.g. duplications, losses, rearrangements, and fusions, were assessed relative to the locus composition and ordering of reference Gram-negative experimentally characterized operons (Figure 4-4). Locus losses (~75% of operons lacking one or more reference loci) were found to be the most frequent evolutionary event occurring across operons overall, with the exception of acetylated- cellulose, and occurred with the greatest frequency among pel operons with an average of 2.9 loci lost per operon (reference pel operon size - 7 loci). Locus losses were largely found to occur

111 for outer-membrane pore encoding loci among Gram-positive species, which is consistent with the lack of an outer-membrane bilayer in Gram-positive membrane architectures. Operon rearrangements were the next most frequent evolutionary events (~10%), most often occurring in ~25% of cellulose operons. Duplications of loci and whole operons were the least frequent of operon evolutionary events, with locus duplication outnumbering operon duplications. However a notable exception to this trend was observed for cellulose operons which show a greater frequency of whole operon duplication events compared to single locus duplications (30 vs 18). Interestingly all duplicated operons were found to be separated by a genomic distance > 10kB apart, and have likely not arisen through tandem duplication events. This suggests the possibility that whole-operon duplications may have arisen through horizontal transfer events147,156. This hypothesis is supported through further analysis which revealed that duplicate operons have distinct evolutionary histories (see section 4.2.4.2).

These results suggest that operon evolutionary events are likely to reflect important adaptations of EPS systems for biofilm production across diverse bacterial phyla, which is in contrast to well conserved operon organizations observed for bacterial ABC transporter systems265. Similarly, in previous analyses of E. coli GI and PPI networks, variable conservation and expansion of gene function via duplication was also found to occur among diverse complexes and pathways, which are likely to be of adaptive significance. In the following section, I will present additional analyses to gain further insights into how evolutionary processes such as gene duplication, loss and rearrangement have influenced EPS operon function. To do this I will present the development of a novel phylogenetic clustering approach for classification of EPS operon loci and a graphical visualization approach utilizing genomic-proximity networks to identify operon clades.

4.2.3 Systematic Phylogenetic Distance-Based Clustering of EPS Operon Loci and Genomic-Proximity Networks Identifies Evolutionary Distinct Operon Clades

Although functional homologies have been noted among the loci of several EPS systems255, for example, Glycosyl 2 protein super-families identified in cellulose, acetylated- cellulose, alginate, and PNAG polysaccharide synthase encoding loci, HMM-based searches detected very few sequences as significant hits to multiple distinct EPS systems, with the

112 exception of acetylation machineries of aceteylated-cellulose and alginate. These findings suggest that over time distinct EPS systems are likely to have evolved through sequence divergence of general catalytic functions and the operonic acquisition of unique polysaccharide modifying enzymes to function as distinct molecular machineries in biofilm production.

A study of how components of EPS machineries evolve through sequence divergence and operon arrangement can shed further insights into their mechanism of function and potential roles in facilitating lifestyle and niche adaptations among phylogenetically diverse bacteria, for instance by indicating divergence or conservation of domains of enzymatic importance, or by changing the order of locus expression affecting the assembly of EPS complexes153. Although other approaches have been devised for defining groups of evolutionarily related sequences, e.g. through clustering of proteins based on pairwise sequence identities (EggNOG165, OrthoMCL266) or evolutionary distances (OMA238), they do not provide adequate resolution of sequence diversity within clades which can provide important insights into the function of proteins in their biological contexts. Therefore, in order to relate the evolutionary impact of EPS locus sequence divergence, duplication, loss, and rearrangement events across phylogenetically diverse bacteria I devised a systematic method utilizing a phylogenetic distance based clustering approach to enable the classification of predicted EPS operon loci into evolutionarily related sequence clusters. In brief, for each EPS operon locus evolutionary relationships among all identified protein sequences were determined through the construction of phylogenetic trees, which compares the divergence (# of changes per residue/entire sequence length) among all pairs of sequences on a position-by-position basis from multiple sequence alignment.

113

Figure 4-4. Evaluation of Phylogenetic Clustering Quality Scores: Phylogenetic Clusters Identified and Cluster Coverage of Cellulose Operon Sequences. A -Phylogenetic clustering depicted on ML phylogenetic trees generated for predicted cellulose operon sequences, BcsABZC. B- Resulting evolutionary distance cutoffs selected to generate phylogenetic clusters based on the optimization of three different cluster quality scores: (Q1 = Silhouette + Dunn Index; Q2 = Q1 + Proportion of Sequences Clustered; Q3 = Q1 * Proportion of Sequences

114

Clustered; Q0 = No Clustering). Optimal clustering chosen for further analysis is indicated in pink (Q2).

To identify the optimal set of clusters for each EPS system locus, three cluster quality metrics were explored, investigating different combinations of Silhouette (Sil), Dunn Index (Dunn) and proportion of sequences clustered (Pclust) (Figure 44-). The distance cutoff which maximizes a given Q score function was chosen as defining the optimal clustering, and clustering was evaluated through inspection of EPS locus phylogenetic trees (Figure 4-4) as well as operon genomic-proximity graphs (Figure 4-5). In the case of the cellulose operon, the optimal clustering which maximizes the sum of the Sil and Dunn scores (Q1 = Sil + Dunn) resulted in a variable number of clusters across operon loci, which appeared to be indicative of genus-level taxonomic divergence patterns (Figures 4-4 and 4-5 – example Q1). However when these metrics of cluster evolutionary distance are additively combined with proportion of sequences clustered

(Q2 = Pclust + Sil + Dunn), an optimal clustering was identified which generally results in clusters of increased size encompassing species with distinct operon organizations and compositions (Figures 4-4 and 4-5 – example Q2). However these distinctions were eliminated when employing a multiplicative relationship between the proportion of sequences clustered and Sil and Dunn scores (Q3 = Pclust * (Sil + Dunn)), as the result of clustering highly divergent sequences (Figure 4-5 – example Q3). Therefore, for the purposes of deriving functional insight from processes which contribute to EPS operon evolution, Q2 was chosen for all subsequent clustering of EPS loci, summarized in Figure 5.

115

Figure 4-5. Phylogenetically Clustered Cellulose Operon Genomic-Proximity Networks using Different Cluster Quality Metrics. A comparison of phylogenetically clustered EPS operon genomic-context networks resulting from differing degrees of clustering, indicated by the cluster quality score (Q) utilized (see Figure 4). Optimal clustering is indicated by Q2 (pink border). Each node in the network corresponds to a distinct cellulose locus phylogenetic cluster and are arranged in rows and organized, top to bottom, according to locus ordering of the E. coli K12 MG1655 cellulose operon, BcsABZC. Node size indicates the relative number of sequences per cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic- proximity of phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey).

After applying the phylogenetic distance based clustering approach illustrated above, a consistent trend was observed between the average numbers of sequence clusters generated and the total number of operons predicted per EPS operon (Figure 4-6A, Supplemental File 12). Furthermore, for each EPS system the variability of the number of sequence clusters predicted per locus (Figure 4-6B) suggests differing degrees of locus evolution that are likely to be the result of different structural and functional constraints. In the next section I validate this approach by

116 deriving functional insights from EPS operon evolution through the construction of EPS operon genomic-proximity networks.

Figure 4-6. Summary of EPS Operon Phylogenetic Sequence Clustering. A – The average number of sequence clusters generated per EPS system decreases according to the number of operons predicted, indicated in parentheses. B – Average evolutionary distance of EPS loci sequences are variable and correlate with the number of phylogenetic clusters generated.

4.2.4 Functional Implications of Sequence Divergence and Operon Evolutionary Events in Bacterial EPS Systems Revealed by Phylogenetic Clustering and Genomic-Proximity Networks.

Genomic-context networks were constructed for predicted cellulose, PNAG, pel, alginate and acetylated-cellulose operons, where nodes represent phylogenetically clustered group of related EPS loci sequences and edges represent the average inter-genic distances between loci (Figure 4- 1 – Step 4, Supplemental File 13). From these networks, I was able to identify operon clades defined by subsets of connected of EPS sequence clusters, which revealed a diverse array of phyla-specific evolutionary events encompassing locus duplication, divergence, loss and rearrangement.

In the following sections I present selected examples of EPS operon evolution, focusing in particular on the contribution of gene duplication events ranging from whole operon duplications of distinct cellulose EPS operon clades arising from HGT among Gamma-Proteobacteria; tandem duplication of PNAG synthase PgaC loci resulting in the identification of a novel PNAG

117 operon clade in Gamma-Proteobacteria; emergence of distinct alginate operon clades derived through whole-operon duplication and rearrangement in Pseudomonas spp; and duplication of co-polymerase wssC locus in an acetylated-cellulose operon identified in Bordetella avium. In addition I will demonstrate the utility of this approach toward predicting a novel pel EPS operon clade in Gram-positives leading to the subsequent experimental validation of pel biofilm production in Bacillus cereus. Furthermore, I show that by employing this novel phylogenetic clustering approach in combination with available protein structural information provides a valuable means to interpret the potential functional consequences of EPS evolution across phylogenetically diverse bacterial species.

4.2.4.1 Cellulose Operon Networks Reveal Distinct Operon Clades Corresponding to Divergence and Rearrangement of BcsB, BcsZ, and BcsC Loci

Examination of the cellulose EPS operons revealed a greater number of sequence clusters for loci encoding the inner-membrane cellulose co-polymerase (BcsB), outer membrane polysaccharide transport pore (BcsC) and periplasmic glycosyl hydrolase (BcsZ), indicating greater degrees of divergence among the sequences included in this analysis; conversely, fewer sequence clusters were identified for polysaccharide synthase subunit BcsA, suggesting a higher overall degree of sequence conservation. The differences in sequence divergence observed between BcsA and BcsB loci were found to be indicative of their known roles in the inner-membrane cellulose synthase complex, revealing that sequence clustering is informative of functional relationships among EPS system loci (see Appendix 2). Evidence that operon clusters correspond to phyla specific cellulose operon organizations is illustrated by four selected examples (Figure 4-7 panel insets): 1) Rearrangement of BcsA in Beta-proteobacterial species and novel locus gain in Burkholderia spp.; 2) Sequence divergence in Zymomonas spp. identifying rearrangement and fusion of BcsA, BcsB and BcsC, BcsZ loci; 3) An operon cluster consisting of alpha- proteobacterial species with divergent BcsB loci lacking the BcsC outer membrane pore; and 4) Elaboration of operon organization and composition through horizontal transfer (HGT) in gamma proteobacterial species (Figure 4-8).

118

Figure 4-7. Genomic-Proximity Network of Phylogenetically Clustered Cellulose Operons. Identifying Cellulose Operon Clades Distinguished by Diverse Genomic Evolutionary Events. Phylogenetically clustered operon loci are arranged according to a canonical cellulose operon ordering shown in the side grey panel. Inset boxes depict selected examples of cellulose operon clades distinguished by evolutionary events: 1) Inversion of cellulose synthase BcsA locus among Beta Protobacteria; 2) Rearrangement and locus fusions of inner membrane cellulose synthase complex subunits (BcsAB) and periplasmic hydrolase and outer membrane pore (BcsZC) loci identified in Zymomonas spp.; 3) Loss of BcsC outer membrane pore and

119 divergence of BcsB inner membrane cellulose co-polymerase loci in Alpha Proteobacterial spp; 4) Whole operon duplications identified in Gamma Proteobacteria derived from the horizontal transfer of two cellulose operon clades; 5) A distinct operon clade of gamma and beta proteobacterial species defined by a shared BcsC phylogenetic cluster resulting from HGT. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic-proximity of phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey). Network visualized using Cytoscape 3.5.1167.

4.2.4.2 Cellulose Whole Operon Duplications Identified in Gamma- Proteobacteria Arising from HGT of Two Distinct Operon Clades

In the following sections I elaborate on how genomic-proximity networks can be utilized to interpret sequence divergence in the context of operon evolution, with two examples of HGT events identified among cellulose operon lineages. From an initial examination of the cellulose genomic-proximity network, phylogenetic clusters identified two distinct operon clades which co-occur in a number of genomes with inter-genic distances >10kB indicating likely HGT transfer events. Closer investigation of the operonic arrangements of species possessing single copies of a given operon clade revealed two distinct locus orderings: the first representing the canonical cellulose locus order (clade A), BcsABZC, found among Escherichia coli and Salmonella enterica strains, and the second a non-canonical ordering resulting from a rearrangement of the periplasmic glycosyl-hydrolase BcsZ (clade B), BcsABCZ, found largely among Dickeya, Erwinia and Pantoea spp. (Figure 4-8 – first panel). Interestingly, Enterobacter and Klebsiella spp. were also found to possess duplicated cellulose operons (clades A + B) which likely originated through horizontal transfer (Figure 4-8 – second panel). In addition, two additional BcsB phylogenetic clusters associated with B operon clade were identified: the first corresponding to a novel operon arrangement (clade B1) via the acquisition of additional loci in Proteus spp., (Figure 4-8 – third panel) with roles pertaining to cellulose precursor biogenesis and a cellulose lyase, while the second revealed a single Enterobacter sp. genome possessing three cellulose operons (clades A + B + B1, respectively) likely to have been acquired through multiple horizontal transfer events (Figure 4-8 – fourth panel).

120

Figure 4-8. Phylogenetic Clustering Identifies HGT Through Divergence and Rearrangement of Two Cellulose Gamma-Proteobacterial Operon Clades. An inset subgraph from the cellulose EPS operon genomic-proximity network (Figure 4-7 – ex. 4) is shown with illustrations highlighting specific operon organization for two distinct gamma proteobacterial operon clades, A (canonical BcsABZC) and B (BcsABC-Z). Arrows depict the order of transcription loci and are coloured according to intergenic distance (Kbp). Top panel depicts two examples of species which possess either a single A or B1 operon clade. Second panel depicts an example where both A + B operons are found in species indicating a HGT event. Third panel depicts an example of divergence of BcsB locus which identifies a rearrangement and novel locus acquisition of the B operon (B1). Fourth panel depicts an

121 example where further divergence of BcsB identifies a multiple HGT event of three distinct operon organizations, A + B + B1.

4.2.4.3 Phylogenetic Clustering Identifies HGT of Cellulose Operons Between Environmental Gamma- and Beta-Proteobacterial Species and Divergence of BcsC Outer Membrane Pore

To further explore the extent to which HGT has contributed to cellulose operon evolution, I sought to identify further cases of HGT events indicated by phylogenetic clusters encompassing species belonging to diverse bacterial phyla. One such example was found of an operon clade of related BcsC OMP loci (Figure 4-7 – panel 5) shared among Gamma and Beta-Proteobacterial species e.g. Xanthomonas and Burkholderia spp. respectively. From preliminary comparison of the operon organizations from these species (Figure 4-9), it was observed that their BcsC loci were of a greater size compared to related Proteobacterial operon clades, with an average difference of 313 amino acids from canonical cellulose operon BcsC sequences. Based on this distinctive expansion of the BcsC sequence, an additional instance of HGT could be identified in Frateuria aurantia DSM 6220, resulting in the acquisition of two distinct operon lineages (Figure 4-9C). Given that this particular operon clade was found to be represented among plant and opportunistic human pathogens suggests that the evolution of cellulose export function may serve a role in niche adaptation.

To investigate the potential functional consequences of BcsC divergence, sequence alignments of these non-canonical BcsC loci identified three distinct insertion events occurring in predicted TPR and OMP domains (Figure 9D). One of these insertions (Figure 4-9D – box A) is found to be conserved among Xanthomonas and Burholderia spp., indicating a common origin of this operon lineage through inter-phlya HGT followed by genus specific diversification. Although cellulose production has not been previously been demonstrated for Xanthomonas or Burkholderia, TPR domains are known to be important mediators of physical interactions in cellulose and other EPS systems267–269, the extensive enlargement serve an important means of increasing domain flexibility270 for the scaffolding of novel accessory proteins271–273.

122

Figure 4-9. Cellulose Operon Clade with Distinct BcsC OMP Phylogenetic Cluster Reveals HGT and Operon Divergence among Gamma and Beta Proteobacterial Species. An inset subgraph from the cellulose EPS operon genomic-proximity network (Figure 4-7 – ex. 5) is shown identifying a subset of operons with a shared BcsC phylogenetic group (BcsC_G3-3). Illustrations highlighting genus specific operon organizations and locus divergence among operon clade genomes: A) A gene fission event of the BcsC outer membrane pore locus identified in a Xanthomonas citri strain; B) Divergence and enlargement of BscB polysaccharide co-polymerase locus in Burholderia spp; and C) A secondary cellulose operon originating by HGT in Frateuria aurantia and accompanying operon evolutionary events including loss of the

123

BscZ hydrolase and a complementary tandem duplication. Panel D shows a multiple sequence alignment of BcsC_G3-3 phylogenetic group sequences and corresponding predicted PFAM domains against E. coli MG1655 K12 BcsC (BcsC_G1-1) with key insertion events identified by green boxes: 1) A common insertion following the BcsC periplasmic TPR domain; 2) An additional insertion unique to Burholderia spp.; and 3) An unique insertion in the BcsC outer membrane pore domain unique to Xanthomonas spp. Multiple sequence alignment was visualized using Geneious 10.2.2274.

From the results of the foregoing analyses, 33 species are estimated to possess cellulose operons derived from an HGT event, 24 of which were identified as non-tandem whole operon duplications of two distinct operon clades occurring mainly among human associated Enterobacterial spp. and 9 which have resulted from the transfer of an evolutionarily divergent operon clade across environmental Gamma and Beta-Proteobacterial spp. In contrast, only 5 tandem non-HGT operon duplications were identified, all of which are accompanied by differential loss of loci between duplicates, suggesting that duplicate operons may be expressed together enabling the reconstitution of a functioning cellulose synthase complex, or that lost loci may be functional compensated by the acquisition of novel loci.

4.2.4.4 Genomic-Proximity Networks PNAG Operons Reveal Differences in Locus Divergence Compared to Cellulose EPS Operons

The PNAG exopolysaccharide is a homopolymer comprised of Beta-1-6 linked N-acetyl-D- glucosamine and its production has been identified in a number of Gram-negative and Gram- positive species (Escherichia coli, Yersinia pestis, Actinobacillus pleuropneumoniae, Bordetella bronchiseptica, Staphylococcus epidermis, Staphylococcus aureus, and Bacillus subtilis) which is carried out by the proteins encoded by the pgaABCD and icaADBC operons, respectively255. PNAG has been shown to play an important role in the intercellular adhesion and persistence during pathogenic infection275–277, which is associated with the degree of N-acetylation enzymatic modification of the polysaccharide278.

The genomic-context network of Gram-negative PNAG EPS operons (Figure 4-11) provided further insight into the evolution of PNAG functionalities when compared to cellulose EPS operons. For example, the polysaccharide synthase encoding locus, PgaC, like BcsA, appears to be highly evolutionarily conserved, as indicated by fewer phylogenetic clusters encompassing diverse bacterial phyla, while conversely the outer membrane pore encoding locus, PgaA, like BcsC, appears more divergent among Gram-negative bacterial genera, i.e. represented by a

124 greater number of phylogenetic groups, and is absent in Gram-positives. The greatest divergence among Gram-negatives is observed for the PgaD locus, which comprises a unique protein family among EPS systems thought to be involved in assisting PgaC in the biosynthesis and transport of PNAG275. Interestingly, the dual periplasmic glycosyl hydrolase and de-acetylase PgaB, appears relatively well conserved among Gram-negatives, unlike its partial functional homolog hydrolase BcsZ, which shows considerable variation among Gram-negative phyla. Phylogenetic clustering of PNAG operon sequences further identify a distinct taxonomic division between homologous Gram-negative pgaBC and Gram-positive icaAB PNAG loci. In addition, a novel operon clade consisting of multiple PgaB and PgaC loci arising through tandem duplication and fusion events was found to co-occur in a number of Gram-negative genomes, assigned to distinct phylogenetic clusters indicating sequence divergence following locus duplication events.

125

Figure 4-10. Phylogenetically Clustered PNAG Operons and Selected Examples of Divergent Operon Clades. Phylogenetically clustered operon loci are arranged according to a canonical PNAG operon ordering shown in the side grey panel. Inset boxes depict selected examples of PNAG operon clades distinguished by evolutionary events: 1) Divergence of PgaD corresponding to related enterobacterial species including pathogen-specific losses of PgaA and PgaB loci critical for PNAG export; 2) Operon duplications occurring in aquatic niche dwelling bacteria, including a partial duplication of PNAG operon specific to the opportunistic pathogen Acinetobacter baumannii spp. and a whole operon duplication identified in Methylovora versatilis; 3) A unique PNAG operon organization among environmental bacteria lacking a PgaD locus; 4) Gram-positive PNAG operons with divergent PgaB loci, resulting from novel domain acquisitions; 5) A novel PNAG derived operon resulting from multiple tandem duplication of the PgaC polysaccharide synthase and lack of detectable PgaA outer membrane pore and PgaD. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic-proximity of phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey). Network visualized using Cytoscape 3.5.1167.

4.2.4.5 Comparisons of PNAG Operon Clades Across Bacterial Phyla Distinguish Gram-negative and –positive Operon Architectures and Reveal a Novel Instance of a Gram-Negative PNAG Operons

To examine how locus duplication, loss, and rearrangement events have contributed to the evolution of PNAG operons across bacterial phyla, selected examples of PNAG operon clusters were identified and compared (Figure 4-10 panels). For example, within a group of enterobacteria possessing related PgaD loci, there exist a number of closely related pathogen enterobacteria that have lost PgaA (E. coli ETEC H10407), as well as PgaB (Shigella. flexneri 5 str. 8401), suggesting the recent loss of the ability to produce PNAG (Figure 4-10 – Ex.1); in the case of S. flexneri this loss may be possibly due to an adaptation to an intracellular mode of infection279. Similarly, no PNAG operons were detected among Salmonella spp. genomes surveyed in this study, consistent with the loss of PNAG production previously associated with an intracellular pathogenic lifestyle280.

Divergence of PgaA also enabled the identification of an operon clade consisting of a partial duplication in the opportunistic Gamma-Proteobacterial pathogen A. baumannii and a full operon duplication in a Beta-Proteobacterial environmental Methylotenera versatilis strain, respectively, which suggests the evolution of biofilm production as a consequence of adaptation to differing lifestyles (Figure 4-10 – Ex. 2). Divergent PgaA sequence groups also identify rearrangements of

126

PgaC loci associated with environmental taxa, and in some instances losses of PgaB due to pseudogenization (Figure 4-10 – Ex.3).

Divergent PgaB sequence groups identified among Gram-positive bacteria were found to define a distinct PNAG operon organization, lacking PgaA and PgaD loci and having an inversion of their PgaB and PgaC loci. These operons were found to be annotated as intra-cellular adhesion (ICA) operons, which have been previously studied in Staphylococcus spp.281 (Figure 4-10 – panel 4, example A), and are defined by a lack of C-terminal glycosyl-hydrolase in their PgaB loci, distinguishing them from Gram-negative PgaB homologs282. Although the organization of Gram-positive PNAG operons appear to be similar (PgaC-PgaB), their PgaB sequences were assigned to distinct phylogenetic clusters, revealing many cases corresponding to locus-specific sequence expansion events (Figure 4-10 – panel 4 examples B and C).

A clade of Gram-negative PNAG operons were identified possessing varying numbers of divergent PgaC loci resulting from repeated tandem duplication events. Despite lacking a detectable PgaA locus, a possible role of these gene clusters in EPS production was investigated. One member of this operon clade, Thauera sp. MZ1T, was identified as a known abundant producer of EPS responsible for viscous bulking in activated sludge wastewater treatment processes283. Furthermore, a recent mutagenesis study284 demonstrated that biofilm-formation defective Thaurea mutants could be rescued by the complementation of the PgaB deacetylase locus predicted to reside in the operon predicted in this present study. This analysis suggests that the EPS produced by Thauera is a modified form of PNAG, and that the absence of an outer membrane PgaA locus may indicate the employment of a different mode of EPS outer membrane transport.

Given the diversity of species and PNAG operon compositions, phylogenetic clustering suggests that PNAG production among Gram-negative and Gram-positive species requires the core conserved functionality of the PgaC polysaccharide synthase and the PgaB deacetylase domain. Around this core, PNAG production appears to have become elaborated through the acquisition of novel protein domains and loci. Further structural analyses of divergent PgaA and PgaB loci provided added insights into the evolution of PNAG production and secretion among Gram- negative and Gram-positive species. Specifically, sequence clusters of PgaA outer membrane pore identified divergence of negatively charged residues in the PNAG binding pocket

127 suggesting altered capacities for PNAG secretion (see Appendix 3), while PgaB sequence groups are defined by several indel events likely to be involved in co-ordinating the actions of the de- acetylase and glycosyl hydrolase domain involved in PNAG modification (see following section). These results suggest an evolutionary scenario where the gain of novel loci or duplication and divergence of existing functionalities has enabled bacteria to exploit novel biofilm production capabilities in adaptation to diverse environments.

4.2.4.6 Divergence of PNAG PgaB Phylogenetic Sequence Clusters Elucidates Structural Differences Related to Biofilm Modification Across Diverse Bacterial Phyla

Phylogenetic clusters of representative PgaB sequences reveal remarkable evolutionary diversity of the essential de-acetylase domain which is essential for PNAG secretion (Section 4.2.4.4 Figure 10A). Significant events include the gain of a glycosyl hydrolase domain associated with Gram-negative PNAG operon lineages and its loss in Gram-positive ICA operon lineages (Figure 10 – panels 1 and 4a). However, a number of N- and C-terminal domain fusions were identified which further delineate the distinct PNAG operon clades (Figure 4-11A).

In addition to these novel domain fusion events, PgaB phylogenetic clusters also define distinct events affecting the evolution of the deacetylase domain across different operon clades. Using the E. coli K12 MG1655 sequence of the largest PgaB phylogenetic cluster as a reference, multiple sequence alignments identified several regions of indels, which when mapped to the crystal structure of PgaB (4F9D), appear to correspond to distinct structural elements surrounding the conserved deacetylase core. Indel regions were designated a number according to their appearance in the multiple sequence alignment of PgaB glycosyl hydrolase domains, and could be categorized into two types (Figure 4-11B). The first two indel regions 1 and 2 occur in the N-terminal region of the reference E. coli PgaB sequence, which correspond to beta-strands flanking the conserved residues His55, Asp114, and Asp115 comprising insertions of ~10aa in Staphylococcus aureus VC40, Bacillus infantis NRL B-14911, Lactobacillus plantarum 16, Leptospirillum ferriphilum ML-04 and of ~77aa in Geobacter metallireducens GS-15, Crinalium epipsammum PCC 9333, and Colwellia psychrerythraea 34H; while the latter three indel regions 3, 4 and 5, occur in a region oriented away from the deacetylase active site, corresponding to two beta-turn motifs and an alpha-helix cap, respectively (Figure 4-11C).

128

Regions 3 and 5 appear to be lost in the majority of other phylogenetic sequence groups, while in region 4 a 29 amino acid insertion was identified in Lachnoclostridium phytofermentans ISDg, which may compensate for the loss of 9aa in region 3. These latter evolutionary events appear to be uniquely found in PgaB sequence clusters possessing an additional C-terminal domain and suggest a possible role in maintaining their proper spatial orientation with the N-terminal de- acetylase domain.

129

Figure 4-11. Phylogenetic Clustering Reveals Structural Evolution of PNAG PgaB Periplasmic Modifying Enzyme Distinguishing Gram-Negative and Gram-Postive PNAG Operon Clades. A - Multiple sequence alignment of representative sequences comprising all PgaB phylogenetic clusters. Global sequence conservation compared against E. coli MG1655 K12 PgaB, phylogenetic cluster PgaB_G1, indicates presence of polysaccharide deacetylase domain (blue box) but an absence of glycosyl-hydrolase domain in non-PgaB_G1 sequences. Red arrows indicate phylogenetic group specific N-terminal domain fusions predicted by PFAM searches; C-terminal domain fusions identified (red box) but failed to be identified by PFAM searches. B - A close up view of sequence conservation of PgaB polysaccharide deacetylase domains with indel events highlighted: green boxes indicate insertions identified in non PgaB_G1 sequences; teal boxes indicate insertions in PgaB_G1 sequence residing in the C- terminal alpha-helix cap (yellow box). C – Crystal structure of E. coli PgaB (4F9D) indicating conservation of the deacetylase domain catalytic core. D – Deacetylase domain with indel regions indicated according to the colour scheme described for panel B. E – C-terminal alpha helical cap region of the PgaB deacetylase domain indicating insertions of the PgaB_G1 region that are spatially proximal to an N-terminal region of the hydrolase domain (purple); comparison of the same regions with PgaB_G1 sequence conservation indicated. Multiple sequence alignment was visualized generated using Geneious 10.2.2274, protein structure was visualized using Chimera 1.11.2285.

To further understand the biological import of identified PgaB indel regions, I examined regions 3, 4, and 5 in the context of Gram-negative PNAG modification. In E. coli K12 MG1655 PgaB, region 3 encompasses a beta-turn with an elongated loop, which is situated spatially proximal to an N-terminal region of the PgaB glycosyl transferase domain, corresponding to a disordered loop and alpha helix (pos. 367-392 of E. coli PgaB); both regions contain both polar and electrostatically charged residues which are highly conserved in PgaB_G1 sequences (Figure 4- 2D right panel). These results suggest a variety of scenarios by which the function of the PgaB the de-acetylase domain has evolved to accommodate a variety of domain fusions leading to phyla specific modifications of PNAG286.

4.2.4.7 Genomic-Proximity Networks of Pel Operons Reveal Locus Rearrangements Across Bacterial Phyla

Genomic-proximity networks of phylogenetically clustered Pel operon loci identify distinct operon organizations across phylogenetically divergent bacteria (Figure 4-12). This includes rearrangements which frequently accompany novel locus acquisitions (Figure 4-12 - Ex.2, Ex.3, Ex.4) and a novel Pel operon cluster identifying for the first time the potential of C-di-GMP regulated biofilm production in Gram-positive bacteria, a prediction further validated by collaborators Greg Whitfield (Graduate Student in the lab of Dr. Lynne Howell, Hospital For

130

Sick Children, Toronto, Ontario Canada) (Figure 4-12 – Ex. 5, described further in section 2.6.4.1).

As seen in the previous summary analysis of EPS operon evolutionary events, genomic rearrangements were often identified among pel operon clades, and involve PelA (precursor biogenesis) and PelB (outer membrane transport lipoprotein) loci. Many species possessing these alternate operon arrangements appear to be associated with diverse environments. The ordering of operon loci has been shown to play an important role in the assembly of macromolecular complexes153, suggesting that the reorganization of operons may possibly affect different means of coordinating the assembly of EPS secretion complex with loci responsible for biosynthesis of pel polysaccharide.

I also observed a high degree of overall conservation among components which are known to play key roles Pel biogenesis, such as the putative inner-membrane polysaccharide synthase (PelF), inner-membrane transporter subunit (PelG), periplasmic deacetylation enzyme (PelA) and C-di-GMP binding regulator (PelD). In contrast, a greater degree of divergence can be seen among inner (PelE) and outer-membrane (PelB, PelC) transport associated loci, which appear to follow a consistent pattern of clustering across bacterial phyla suggesting co-evolution of potentially physically interacting components.

131

Figure 4-12. Genomic-Proximity Network of Pel Operons and Identification of Novel Gram-Positive Pel Operon Clades. Phylogenetically clustered operon loci are arranged according to a canonical pel operon ordering shown in the side grey panel. Inset boxes depict selected examples of pel operon clades distinguished by evolutionary events: 1) Canonical pel operon organization; 2) Duplication of pel identified in Nitrosospira multiformis and consequent operon evolution through rearrangement, novel locus acquisition, and locus loss; 3) A related operon clade defined by a fission of the PelB TPR coding locus and operon rearrangement aquatic dwelling thermophilic species; 4) A potentially novel duplicated pel operon identified in Leptospirillum ferrooxidans comprised of divergent PelA and PelF loci; 5*) Gram-positive pel operon clades detected possessing a cyclic di-GMP regulatory locus PelD. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic-proximity of phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey). Network visualized using Cytoscape 3.5.1167.

132

4.2.4.8 Experimental Validation of Novel Gram-positive Pel Production in Bacillus cereus ATCC 10987

Genomic proximity network reconstruction of pel operons identified two distinct clades comprising several Gram-positive species (Figure 4-12 – Ex. 5). Of the synthase dependent EPS operons known to date, only PNAG production has been previously characterized in Gram- positives281. Operons reconstructed from initial HMM searches found several Gram-positives with putative pel operons comprised of the PelF polysaccharide synthase and the PelG putative transport protein (Figure 4-13A). To determine whether these were bona-fide pel operons with additional loci, more sensitive HMM searches were required. HMM models for pel loci were regenerated using sequences from predicted pel operons (with sequences having <= 97% sequence similarity and using the approach described in section 4.1.2). With these HMM-models Iterative HMM searches were performed (as described in section 4.1.3) which enabled the detection of additional loci including PelD a c-di-GMP binding regulator of pel secretion (Figure 4-13B). Given that c-di-GMP signaling in Gram-positive bacteria is relatively unknown287 this result may provide further evidence for its role in regulating biofilm formation. To experimentally verify this novel finding, biofilm impairment was examined in a model Gram- positive through the generation of a gene deletion knockout (performed by Greg Whitfield - graduate student in the lab of Dr. Lynne Howell, Senior Scientist, Hospital for Sick Children, Toronto Ontario) of the predicted pelF polysaccharide synthase encoding locus of Bacillus cereus ATCC 10987, a known pellicle forming Gram-positive288. Scanning electron microscope images comparing the resulting phenotypes of wildtype B. cereus ATCC 10987 and the pelF knockout strain (Figure 4-13C) revealed ablation of biofilm formation, confirming the presence of a novel synthase dependent EPS operon in a Gram-positive bacterium. In addition, a recently performed transposon mutagenesis screen in B. cereus ATCC 10987 confirmed a loss of biofilm formation corresponding to disruptions occurring in the PelADE-FG region identified in this analysis289, providing additional validation of this prediction. However the authors did not provide any further comment as to the nature of the biofilm produced.

133

Figure 4-13. Identification of Gram-positive Pel Clades; Iterative HMM Searching Reconstructs Operons with Divergent Loci Leading to Experimental Validation of Novel Gram-positive Biofilm Operon in B. cereus ATCC 10987. A - Subnetwork depicting Gram- positive pel operon clades with varying numbers of loci identified as significant (e-value < 1e-5) hits in first-pass (unfilled nodes) and iterative HMM searches (grey nodes). Selected examples shown: 1. PelA-PelFG identified by first-pass HMM hits; 2*. Iterative HMM searches identifying additional PelA loci in B. cereus ATCC 10987, a known pellicle producing Gram- positive; 3) Additional PelD loci identified by iterative HMM; 4 and 5) Gram-positive pel operons with only PelF and PelG loci identified. B – Operon organizations of selected examples of Gram-positive pel operons and additional loci identified (red boxes: hits above HMM e-value threshold of 1e-5). C – Scanning electron microscope images comparing biofilm formation in wildtype B. cereus ATCC 10987 and predicted pelF deletion knockout strain (ΔpelF). Images provided courtesy of Dr. Lynne Howell (Hospital for Sick Children), Dr. Elyse Roach (University of Guelph) and Dr. Cezar Khursigara (University of Guelph).

134

4.2.4.9 Genomic-Proximity Network of Alginate Operons Reveals Distinct Clades Among Pseudomnas spp.

Although the majority of alginate operons were predicted largely among Pseudomonas spp. genomes (Figure 4-2B), phylogenetic clustering and genomic-proximity network reconstruction revealed an array of events influencing alginate operon evolution. For example, two distinct alginate operon clades were identified among Pseudomonas spp., defined by whole operon duplication and rearrangement of alginate polysaccharide modification loci (Figure 4-14 – panel 1 and panel 2). Also identified were divergent, “atypical”, alginate operons comprising extensive rearrangements and also losses of functionally related subsets of alginate loci, e.g. outer- membrane transport loci (AlgKE), and polysaccharide modification machinery (AlgGXLIJF), which may either be too divergent to detect, or be functionally substituted by novel loci (Figure 4-14 – panel 3). Closer examination of the alginate genomic-proximity network also indicated a greater number of clusters for Alg44 and AlgX loci, which were reflective of increased divergence among distinct alginate operon clades. Given that both loci play related roles in the regulation and assembly of the alginate EPS secretion machinery, these results provide an avenue for future research toward elucidating how species may modify alginate production to adapt to diverse environmental niches (see Appendix 4).

135

Figure 4-14. Genomic-Proximity Network of Alginate EPS Operons Indicating Rearrangement and Divergence of Driving the Emergence of Distinct Operon Clades in Pseudomonas spp. Phylogenetically clustered operon loci are arranged according to a canonical alginate operon ordering shown in the side grey panel. Inset boxes depict selected examples of alginate operon clades distinguished by evolutionary events: 1) Canonical alginate operon organization with a partial operon duplication event identified in Pseudomonas resinovorans

136 resulting in the loss of alginate acetylation machinery (1b – indicated by A*); 2) A distinct alginate operon clade (2a) identified by rearrangement of acetylation machinery (indicated by B*) as well as HGT events with canonical alginate operon possessing species; 3) Atypical alginate operons involving loss of outer membrane transport loci or portions of acetylation machinery in deep sea dwelling bacteria. Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic-proximity of phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey). Network visualized using Cytoscape 3.5.1167.

4.2.4.10 Genomic-Proximity Network of Acetylated-Cellulose Operons Reveals Divergent Operon Clades in Pseudomonas spp. and Bordetella and Orthologous Relationships with Alginate Acetylation Machinery

From the genomes sequences surveyed, only four species were identified possessing acetylated- cellulose operons, comprising two distinct operon clusters with differing operon constitutions among 3 Pseudomonas spp. and Bordetella avium 197N, suggesting a recent emergence, possibly resulting from the acquisition of alginate acetylation machinery (Figure 4-15). Contrary to cellulose phylogenetic clusters, the polysaccharide synthase, WssB, was divided into distinct Gamma- and Beta- proteobacteria clusters. Also noted was a distinct phylogenetic cluster identifying a unique tandem duplication of WssC in Bordetella avium 197N, which was not observed among orthologous cellulose BcsB copolymerase loci (Figure 4-15 – Panel 2 red asterisk). This observation might suggest a divergent mechanism of action of cellulose inner- membrane transport. As noted in a previous section (4.2.1 – Figure 4-2B), 3 out of 4 of the predicted acetylated-cellulose operons were found to co-occur with alginate operons. In examining the question of the origin of acetylated-cellulose operons further, HMM-searches identified significant hits for acetylated-cellulose WssBCDE operon sequences to BcsABZC loci, as well as significant hits of the acetylation-machinery of both acetylated-cellulose and alginate operons (WssH – AlgI; WssI – AlgJ/AlgX, see also Table 4-1), indicating the likelihood of the evolution of acetylated cellulose production through the duplication and operonic acquisition of alginate acetylation machinery loci. These results further suggest a degree of evolutionary adaptability may exist between core EPS operon transport machinery loci that are likely due to similarities in subtrates and enzymatic domains utilized across EPS systems, and hints at a model by which they may evolve, first through their initial horizontal transfer into a

137 novel genomic context, and following subsequent processes of genomic rearrangement and sequence divergence they become co-adapted with distinct sets of accessory loci, thereby increasing the providing adaptive potential of bacteria in a particular environmental niche.

Figure 4-15. Genomic-Context Network of Acetylated-Cellulose EPS Operons. Phylogenetically clustered operon loci are arranged according to a canonical acetylated-cellulose operon ordering shown in the side grey panel. Inset panels identify 3 acetylated-cellulose operons identified in Pseudomonas spp. (panel 1) and a single Bordetella avium genome possessing a duplicated polysaccharide co-polymerase WssC locus (panel 2 - indicated by red asterisk). Node size indicates the relative number of sequences per phylogenetic cluster; node colouring represents the phyletic distribution of species for a given cluster; edges connect clusters which co-occur in the same genome(s); edge color indicates the genomic-proximity of

138 phylogenetic clusters: <= 0.1 Kb (red); > 0.1 & < 5 Kb (blue); and, >= 5 Kb (grey). Network visualized using Cytoscape 3.5.1167. 4.3 Discussion and Conclusions

In this chapter I describe a novel and general approach for the systematic clustering and classification of protein families associated with bacterial operons. Protein families are defined as a set of closely related groups of sequences which share a common evolutionary ancestor; conserved regions unique to members of a given protein family are often indicative of particular structural and functional adaptations that can be utilized to determine their biological roles290. For example, the PFAM database utilizes curated sets of protein family sequences, differentiated by specific combinations of protein domains or motifs, in the generation of profile hidden Markov models (HMMs)112. However, the identification of protein families is further complicated by genomic evolution through duplication, protein fusions, and horizontal transfer. Thus, methods have also been employed to classify protein families based on the resolution of evolutionary relationships of sequences, either by graphical clustering of pair-wise similarities of protein sequences (COG291, OrthoMCL266, EggNOG165), or by hierarchical evolutionary distances of gene sequences and construction of phylogenetic trees (TreeFAM292, TreeCL293). However, these methods do not provide further resolution of sequence diversity within protein- families that can provide additional insights into how members have adapted to function in different biological contexts.

The approach I have presented integrates both profile HMMs for the initial identification of protein families, prediction of operon associated protein families, and subsequent protein phylogenetic tree reconstruction and evolutionary distance clustering to identify evolutionarily related sequence clusters. In selecting optimal evolutionary distance cutoff for defining protein clusters, an overall cluster quality metric was devised. Combining two clustering quality metrics (Silhouette and Dunn index) and proportion of sequences clustered clusters, I was able to classify a diverse array of operon-associated protein families into taxonomically consistent and functionally informative sub-clusters. Other agnostic clustering methods to define sub-clusters of evolutionarily related protein families have ranged from phylogenetic tree reconstructions of evolutionary relationships within the SNF2 family of sequences involved in eukaroytic DNA damage repair and transcriptional regulation pathways294 to hierarchical clustering of pairwise global alignments of SRS domains involved in regulating invasion and virulence of the

139 apicomplexan parasite Toxoplasma gondii295. The approach presented in this chapter provides a novel extension of previous work, enabling the systematic and unbiased identification of sub- clusters comprising diverse protein families involved in an integrated biological process. Genomic-proximity networks were constructed to provide an intuitive means of utilizing phylogenetic clusters to examine diverse mechanisms of operon evolution across taxonomically diverse bacterial genomes. Although genomic-proximity networks have been previously utilized for functional prediction296, understanding mechanisms underlying bacterial genomic organization into functionally related gene clusters297, and transcriptional regulation of bacterial operons298, this work also represents a novel application of genomic-proximity networks for the systematic exploration of operon evolution and the definition of operon clades resulting from locus divergence, loss, duplication, and rearrangement events.

I have applied this approach toward the classification and analysis of the evolution of 5 bacterial EPS operon machineries, e.g. cellulose, acetylated-cellulose, PNAG, alginate and pel. To my knowledge there has been only one previous attempt to classify EPS oeprons, focusing specifically on the cellulose system256. In that study, cellulose operons were categorized into four major types, based on the presence or absence of experimentally validated accessory loci involved in cellulose production. Here, I chose to base my analysis on the four core operon loci, BcsABZC, deemed essential for cellulose production as a useful basis for large-scale operon prediction and classification. Cellulose operon clades identified in my study showed little consistency with the four major cellulose operon types presented in256, suggesting that the conservation of accessory loci is more variable across bacterial species compared to loci encoding core EPS functionalities. However one operon type was identified in this analysis, representing the loss of BcsC outer membrane transporter identified among a subset of Alpha Proteobacterial genomes, suggesting a novel mechanism of cellulose export (Figure 4-6 – Ex. 4). Instead, by focusing on only the core loci, my method was able to demonstrate that the loss of BcsC has resulted in an increased divergence of BcsB loci which highlights the key role of BcsB as an intermediary between cellulose biogenesis and periplasmic transport (Figure 4-8).

Combining phylogenetic clustering of core EPS loci with the visualization of operon organizations through genomic-proximity networks, I was able to systematically examine the evolution of EPS systems through locus divergence, duplication, loss, and rearrangement across

140 diverse bacterial phyla. In general, inner membrane components involved in EPS polymerization were found to be relatively conserved across all systems examined, while periplasmic and outer- membrane components showed a relatively increased degree of evolution, which are likely to have important functional implications. For instance, although Gram-negative and Gram-positive PNAG PgaB sequences share a glycosyl-hydrolase domain, phylogenetic clustering identified a significant evolution of PgaB function in Gram-negatives involving an N-terminal fusion with a polysaccharide de-acetylase domain. A close examination of the structural conservation of the hydrolase domain further identified a number of unique insertion events giving rise to structural features that likely play an important functional role in regulating the function of Gram-negative PgaB. In addition, in cellulose and pel operons rearrangement involving the functionally homologous periplasmic glycosyl-hydrolase (BcsZ) and glycosyl-hydrolase/de-acetylase (PelA) loci was found to be a defining feature of several operon clades. It is interesting to note that these rearrangements have resulted in a change in the ordering of BcsZ and PelA relative to their respective outer-membrane transport pores, consistent with their known roles in regulating EPS transport across the Gram-negative outer membrane268,299. These findings suggest that rearrangement and ordering serves as an important means of regulating EPS transport and may enable further EPS modification by accessory loci to occur300. Furthermore, identifying operon clades through a phylogenetic approach elucidated numerous instances of cellulose whole operon duplications arising from HGT of evolutionary distinct operon clades. What might be the significance of these observed HGT events? It is possible that such large-scale duplications may serve as a dosage response to given environmental stressors, as observed in the duplication of bacterial multiple-drug transporter operons301. However, given that in these instances horizontally transferred cellulose operon clades possess distinct operon organizations, loss of polysaccharide modification loci, or incorporation of novel loci, it is possible that each operon contributes toward fine tuning of biofilm production through differential modifications of cellulose structure. Interestingly, representative species of the two cellulose operon lineages identified in HGT events, e.g. the plant and human pathogens, D. dadantii and S. enterica, respectively, are known to produce structurally distinct forms of cellulose with different properties and involvements in pathogenesis302,303. Furthermore, BcsB divergence was also seen to accompany the rearrangement or horizontal transfer of these operons, which further suggests that it may play a key role in this fine-tuning of cellulose production by coordinating the export of growing cellulose chains through the periplasm.

141

Synthesizing observations made through this systematic study of EPS evolution can also shed further insights into the mechanisms of bacterial adaptation and evolutionary mechanisms which shape biological networks. The essential polysaccharide synthase encoding loci for cellulose (BcsA), acetylated-cellulose (WssB), PNAG (PgaC), and alginate (Alg8), and pel (PelF) EPS systems all share sequence homology with the broadly distributed the Glycosyl-Transferase protein superfamily304, indicating a common evolutionary origin. Therefore, it is likely that EPS systems have emerged through gene duplication and subsequent neofunctionalization of pre- existing domain folds into novel functional contexts. Subsequently, genomic rearrangements enable the recruitment of additional functionalities, e.g. additional inner-membrane components, periplasmic polysaccharide modification enzymes, and outer-membrane transport pores, into a co-transcribed operonic unit, comprising the nascent EPS operon systems. Through subsequent locus evolution, protein domain fusions, and genomic rearrangements the efficiency of EPS production and regulation could be fine-tuned, akin to the processes shaping operonic organization of protein complexes of metabolic pathways153,154, resulting in the establishment of a core set of operon loci. By providing a selective advantage to bacteria living in a given environmental niche, the chance of operon transmission through horizontal transfer to co- habiting bacteria is likely to increase, as a means establishing communities for efficient exchange of metabolites and effective intercellular communication305–308. Once the operon becomes established in a novel genomic context155,309,310 further elaboration may occur, i.e. through the acquisition of additional loci via genomic rearrangement, locus sequence divergence, duplication, or loss, reflected by the emergence of distinct operon clades.

142

Chapter 5 - Conclusions and Future Directions

5 Summary

A central postulate of systems biology is that the discrete biological processes which comprise complex biological systems, e.g. the bacterial cell, can be understood in terms of the organization of underlying interactions which mediate gene or protein function. In turn, this approach enables the researcher to discover potentially novel roles played by genes and proteins by determining their manner of association in differing experimental contexts.

In this work I have endeavored to demonstrate that network biological approaches can bear fruitful results in charting the organization of the bacterial cell, as demonstrated through analyses of biological networks derived from large-scale proteomics, genetic screening, and genomic- context methodologies. The overarching aim of my work has been to utilize these complex datasets to systematically examine the effect of duplication on the functional diversification of genes and proteins involved in a broad range of biological processes. To achieve this aim, I developed and applied a series of integrated network-biological and comparative genomics approaches toward the analysis of a diverse array of biological networks. I showed that these diverse datasets each provide unique and complementary information through which novel insights of the functional significance of gene duplication can be examined. Furthermore, biological networks can serve as a valuable starting point for further examination of how evolutionary processes shape gene and protein function in a variety of biological contexts and their consequent adaptive significance for bacterial survival in diverse environments.

5.1 E. coli Broad and DNA-Repair GI networks

In this set of analyses I focused on investigating the functional organization of GI networks in E. coli by examining differences of patterns of epistatic enrichment at the level of functional modules, as well as phylogenetically co-conserved and paralogous gene sets, encompassing roles in diverse biological processes as well as DNA damage response.

Utilizing a set of previously defined E. coli functional modules74 I was able to identify several biologically relevant examples of complexes and processes enriched in GI cross-talk. In the

143 context of Broad-GI network assessed under normal-growth conditions, several core biological processes were found to be enriched in aggravating interactions, including iron-cluster biogenesis operons, the phage shock operon and complexes involved in membrane stability as well as anaerobic respiration, and distinct functional partitioning of GIs between components of the DNA repair recombination system components. Alleviating interactions were found enriched between genes encoding RND antibiotic resistance pumps and protein aggregation pathways that have been separately described by previous experimental study. Overall, in E. coli, genes involved in metabolism and transport were found to be particularly correlated in their GI-profiles and phylogenetically co-conserved, which tentatively supports a model by which genomic evolution in bacteria is driven by epistatic buffering of acquired genes with previously existing biological processes. Similarly, among DNA repair processes, patterns of GI enrichment were found to be significantly altered under DNA damaging condition, which were found to be significantly conserved across bacterial phyla.

Paralogous DNA repair proteins also showed dramatic changes in their patterns of GI interactions dependent upon DNA damaging conditions, which suggests further conditional dependent GI screens will be a valuable means toward elucidating functional specialization of paralogous genes. Building upon the implications derived from analysis of the CE-PPI network, described below, where physical interactions were found to play an important role in dynamically integrating cell envelope associated proteins involved in environmental responses, GI networks further emphasize the importance of the notion that genes and proteins show functional adaptability in mediating survival under diverse environmental conditions. Building upon these studies, future directions whereby GI networks could be applied to investigate questions of the functional divergence of paralogs by screening under additional conditions of antibiotic exposure or growth under nutrient limiting conditions. Additionally, GI networks hold great promise for further examining how operon organization has influenced the evolution of biological pathways and contributed to biological robustness.

5.2 E. coli Cell-Envelope PPI Network

PPI networks provide a complementary approach to GI networks as a means of elucidating the organization of the bacterial cell mediated through physically interacting protein complexes. However, the traditional means of discovery and characterization of such complexes remains

144 limited owing to biases in observable phenotypes under physiologically relevant experimental conditions as well as amenability to traditional aqueous based approaches7. This limitation is further exacerbated by the ever increasing number of fully sequenced bacterial genomes providing a steadily increasing number of uncharacterized proteins311. In an important step towards overcoming this challenge, a large scale AP-MS physical interaction network of the E. coli cell-envelope associated proteome (CE-PPI) was generated to further elucidate the roles of proteins involved in mediating bacterial survival and environmental sensing3. Utilizing graphical clustering approach, I was able not only to identify many previously characterized protein complexes, but also to identify novel physical interactions among proteins with roles in diverse biological processes. Analysis of the CE-PPI network further suggests that well-known protein complexes may serve additional roles during bacterial growth in forming scaffolds of physical interactions resulting dynamic functional partitioning of the cell-envelope. Thus these findings may correspond to “physically linked groups of macromolecules (alias hyperstructures)”312,313 posited by the transertional hypothesis to be driven by the “coupled transcription, translation and insertion of nascent proteins into and through the membrane”312. A number of single molecule fluorescence and in-silico simulation experiments are providing support for the notion that the combination of genomic organization and the temporal and physico-chemical dynamics of gene expression may result in the heterogeneous partitioning of cytosolic and membrane proteins, e.g. the polar localization of cell-division machineries and dynamic recycling of RNA polymerases314–319. Expanding on these findings, future work could investigate whether correlations exist between gene co-expression and genomic-proximity among interactors identified in the CE-PPI network, as well as CE-PPI clusters.

On the other hand, integration of previously published genetic interaction (GI) networks revealed an additional level of functional integration across CE-PPI defined clusters with related functions. Although a greater number of GIs were found to occur between distinct clusters, a greater proportion of alleviating interactions were found to occur among proteins belonging to the same cluster, suggesting possible integration through physical interactions into biological pathways. Testing for statistical enrichment of GIs between CE-PPI clusters revealed an interesting trend in distribution of GIs, with partitioning of aggravating interactions between clusters involved in environmental response and nutrient acquisition, and alleviating interactions cell envelope biogenesis and cell-division pathways, respectively.

145

Previous research has reported a preponderance of duplicated cell envelope complexes associated with multidrug resistance and other functional categories associated with the cell envelope114. However owing to the lack of available data, it has not been possible to systematically functional implications of sequence divergence and functional diversification of paralogs. Thus, I performed an analysis on identified CE-PPI paralogous proteins of diverse biological function and degrees of sequence identity. In general, the overlap of interactions among paralogous proteins appears to follow a continuum influenced by biological function. For instance, a trend of increasing number of interactions was observed for proteins which play crucial roles in sensing environmental signals (methyl-accepting chemotaxis proteins) and antibiotics (resistance-nodulation-division proteins), yet the degree of interaction overlap varied extensively among these members, which may reflect specialization of their roles under specific environmental conditions. It is interesting to note the Trg galactose/glucose/ribose sensing MCP protein was found to be enriched for interactions with proteins involved in membrane biogenesis pathways, which may indicate functional coupling between cell growth and nutrient sensing, analogous to the previously noted sensing of electron transport and proton motive force for MCPs involved in aerotaxis320. It is also likely that divergence in protein interactions are likely to be influenced by the relative differences of gene expression under specific environmental conditions and phases of growth, for example, in the instance of MCP and RND proteins321,322. The changing composition of MCP arrays could be an important contributor to the regulation of flagellar biosynthesis resulting increased motility observed under post-exponential phase growth323. In the case of duplicate iron-sulfur containing DMSO/Selenate reductase complex subunits, a high degree of overlap was observed, indicating a high degree of retention of interactions consistent with experimentally validated functional complementarity254. This provides an example of paralogous sub-functionalization in the emergence of novel energy production pathways, where a conserved catalytic activity has been adapted for novel substrate specificities mediated through additional complex subunits. In contrast, a lack of overlapping interactions was observed for inner membrane synthase and outer membrane pore subunits of paralogous colonic acid transporters suggesting neofunctionalization through the gain of novel interactors with additional roles in exopolysaccharide export. Whether these novel interactors play a direct role in colonic acid biosynthesis and export remains to be elucidated, but may suggest a means by which exopolysaccharide biosynthesis programmes may be integrated to generate complex biofilm architectures324.

146

Physical interactions can reveal novel insights into the spatial co-ordination of macromolecular assemblies and functional partitioning of the cell-envelope associated proteome. As further revealed from the foregoing analysis of the CE-PPI network, the functional integration of distinct CE-clusters are also related on the level of epistatic growth effects, which provide a valuable perspective of the co-ordination of biological processes directly impacting bacterial fitness. Furthermore, this work demonstrates that physical interaction screens provide a valuable approach for investigating the functional implications of gene duplication. Future work can expand our understanding of adaptive significance of the CE-PPI through integration with gene co-expression datasets, elucidating the roles of PPI that are likely to be relevant in coordinating physiological responses to diverse environmental perturbations.

5.3 Bacterial Exopolysaccharide Secretion Machineries

Genetic and physical interaction networks provide a wealth of potential functional associations for genes and proteins that can be leveraged to understand the different biological roles of paralogs in diverse biological processes. However the majority of these interactions are novel and require further experimental elucidation. Therefore, in my final chapter I focused on examining duplication in the context of a specific and well-defined biological process. Bacterial exopolysaccharide (EPS) secretion machineries play a crucial role in the formation of bacterial biofilms which mediate survival to diverse environments, survival to antibiotics, and pathogenesis255,278,280. I presented a focused examination of gene duplication and its functional implications in the context of 5 defined EPS machineries known to date, which are responsible for cellulose, acetylated-cellulose, PNAG, pel, and alginate biofilm production. The resulting work I performed represents the first large-scale systematic survey of EPS operon production undertaken, which enabled me to identify instances where gene duplication and other evolutionary events have led to the emergence of distinct operon organizations across hundreds of bacteria of diverse phylogeny and lifestyle.

To tackle the challenge of examining operon evolutionary events in a systematic and unbiased manner, I developed a novel phylogenetic-tree clustering approach to assign EPS operon encoded genes to evolutionarily related phylogenetic clusters. By visualizing phylogenetic clusters with genomic-proximity networks, I identified operon clades of bacterial species that share evolutionarily related operon loci with unique operon organizations shaped by gene

147 duplication, gene loss and gene rearrangement. Several examples were presented highlighting instances in which gene duplication has contributed to the evolution of EPS operon systems, such as: whole duplications of cellulose operons among enterobacterial species, which were found to have originated through HGT events; tandem duplication events of the PgaC locus leading to the emergence of novel PNAG operon clade; and whole duplications of alginate operons among Pseudomonas spp.. Gene losses were the most frequently observed evolutionary event among EPS operons and often corresponded to the absence of outer membrane associated proteins in Gram-positive bacteria. In addition, genus-specific rearrangement was seen to be a key contributor to genus specific clades for cellulose, alginate, pel and PNAG operons, which is likely to have important implications in EPS production. For example, in cellulose operons the rearrangement of the periplasmic endoglucanase encoding gene bcsZ occurs most often and is consistent with mediating the down- and up-regulation of cellulose production necessary for the virulence phenotypes of the animal and plant pathogens S. enterica and E. amylovora, respectively325,326. Previous findings that gene ordering is crucial for regulation of metabolic pathways and assembly of protein complexes153,154 further suggests that rearrangement is also likely to play an important role in regulating EPS biofilm production by coordinating the distinct steps of polysaccharide biosynthesis, modification and export. Beyond gene duplication, losses, and rearrangements, my analyses also revealed instances of gene fusion where distinct EPS operon functionalities were found encoded by a single locus. Gene fusion events can often occur between two genes when their protein products physically interact327. Here I found that a fusion event has occurred in cellulose operons between bcsZ and the outer-membrane pore encoding locus bcsC, in the cellulolytic bacterium Z. mobilis NCIMB 11163, which suggests a means by which the function of bcsZ may be modulated through physical interaction325. My work demonstrates that operon clades identified are likely of adaptive significance and can greatly aid future research into the means by which bacteria can tailor biofilm production toward their specific lifestyles or habitats.

One important limitation of this study is that only genes known to be essential for EPS production were considered. It is known that bacteria may employ additional accessory loci in regulating EPS machineries. For instance, a recent electron-microscopy study of the cellulose synthase complex indicates that the core bcsABZC loci form a “macrocomplex” along with accessory loci (bcsGFQRE) that were not examined in this present study328. Therefore, the

148 dataset I have generated will provide a valuable starting point for a future survey of additional loci associated with identified EPS operon clades, which will lead to a greater understanding of the evolution of EPS machineries.

In summary, this work holds great promise for generating novel experimental hypotheses to guide future experimental studies to understand the role and mechanisms by which bacteria modulate EPS production to adapt to diverse environmental niches. For instance it has shown that duplication has enabled the adaptation of cellulose production during anaerobic growth in Enterobacter sp. FY-07329, and that the plant pathogens Dickeya dadantii spp. and human pathogens Salmonella enterica produce cellulose nanofibers with different widths and branching patterns which are likely due to differences in cellulose operon organizations302. Furthermore, this study provides a novel method and approach which can be applied toward the systematic study of bacterial operons in general, generating novel insights into how the expansion of protein families through duplication has contributed to the elaboration of protein complexes and metabolic pathways across diverse bacterial phyla.

References

1. Babu, M. et al. Quantitative Genome-Wide Genetic Interaction Screens Reveal Global Epistatic Relationships of Protein Complexes in Escherichia coli. PLoS Genet. 10, e1004120 (2014).

2. Kumar, A. et al. Conditional Epistatic Interaction Maps Reveal Global Functional Rewiring of Genome Integrity Pathways in Escherichia coli. Cell Rep. 14, 648–661 (2016).

3. Babu, M. et al. Global landscape of cell envelope protein complexes in Escherichia coli. Nat. Biotechnol. (2017). doi:10.1038/nbt.4024

4. Han, M.-J. & Lee, S. Y. The Escherichia coli Proteome: Past, Present, and Future Prospects. Microbiol. Mol. Biol. Rev. 70, 362–439 (2006).

5. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–62 (1997).

149

6. Riley, M. et al. Escherichia coli K-12: a cooperatively developed annotation snapshot-- 2005. Nucleic Acids Res. 34, 1–9 (2006).

7. Díaz-Mejía, J. J., Babu, M. & Emili, A. Computational and experimental approaches to chart the Escherichia coli cell-envelope-associated proteome and interactome. FEMS Microbiol. Rev. 33, 66–97 (2009).

8. Keseler, I. M. et al. EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res. 39, D583-90 (2011).

9. Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700-5 (2012).

10. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).

11. Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 44, D471-80 (2016).

12. Mewes, H. W. et al. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 30, 31–4 (2002).

13. Smith, V., Botstein, D. & Brown, P. O. Genetic footprinting: a genomic strategy for determining a gene’s function given its sequence. Proc. Natl. Acad. Sci. U. S. A. 92, 6479– 83 (1995).

14. Wagner, C. et al. Genetic analysis and functional characterization of the Streptococcus pneumoniae vic operon. Infect. Immun. 70, 6121–8 (2002).

15. Bernhardt, T. G. & de Boer, P. A. J. Screening for synthetic lethal mutants in Escherichia coli and identification of EnvC (YibP) as a periplasmic septal ring factor with murein hydrolase activity. Mol. Microbiol. 52, 1255–69 (2004).

16. Buchanan, G., Sargent, F., Berks, B. C. & Palmer, T. A genetic screen for suppressors of Escherichia coli Tat signal peptide mutations establishes a critical role for the second arginine within the twin-arginine motif. Arch. Microbiol. 177, 107–12 (2001).

150

17. Babu, M. et al. in Methods in molecular biology (Clifton, N.J.) 564, 373–400 (2009).

18. Rajagopala, S. V et al. The binary protein-protein interaction landscape of Escherichia coli. Nat. Biotechnol. 32, 285–290 (2014).

19. Roux, K. J., Kim, D. I. & Burke, B. BioID: a screen for protein-protein interactions. Curr. Protoc. protein Sci. 74, Unit 19.23. (2013).

20. Richmond, C. S., Glasner, J. D., Mau, R., Jin, H. & Blattner, F. R. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 27, 3821–35 (1999).

21. Butland, G. et al. eSGA: E. coli synthetic genetic array analysis. Nat. Methods 5, 789–95 (2008).

22. Lockhart, D. J. & Winzeler, E. A. Genomics, gene expression and DNA arrays. Nature 405, 827–36 (2000).

23. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–70 (1995).

24. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95, 14863–8 (1998).

25. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

26. Young, K. H. Yeast two-hybrid: so many interactions, (in) so little time... Biol. Reprod. 58, 302–11 (1998).

27. Criekinge, W. & Beyaert, R. Yeast two-hybrid: State of the art. Biol. Proced. Online 2, 1– 38 (1999).

28. Uetz, P. et al. A comprehensive analysis of protein–protein interactions inSaccharomyces cerevisiae. Nature 403, 623–627 (2000).

29. Perkins, J. R., Diboun, I., Dessailly, B. H., Lees, J. G. & Orengo, C. Transient Protein- Protein Interactions: Structural, Functional, and Network Properties. Structure 18, 1233–

151

1243 (2010).

30. Joung, J. K., Ramm, E. I. & Pabo, C. O. A bacterial two-hybrid selection system for studying protein-DNA and protein-protein interactions. Proc. Natl. Acad. Sci. 97, 7382– 7387 (2000).

31. Clarke, P., Cuív, P. O. & O’Connell, M. Novel mobilizable prokaryotic two-hybrid system vectors for high-throughput protein interaction mapping in Escherichia coli by bacterial conjugation. Nucleic Acids Res. 33, e18–e18 (2005).

32. Laddomada, F., Miyachiro, M. M. & Dessen, A. Structural Insights into Protein-Protein Interactions Involved in Bacterial Cell Wall Biogenesis. Antibiot. (Basel, Switzerland) 5, 14 (2016).

33. Wuchty, S. & Uetz, P. Protein-protein Interaction Networks of E. coli and S. cerevisiae are similar. Sci. Rep. 4, 7187 (2014).

34. Monti, M., Orrù, S., Pagnozzi, D. & Pucci, P. Interaction Proteomics. Biosci. Rep. 25, 45– 56 (2005).

35. Pardo, M. & Choudhary, J. S. Assignment of Protein Interactions from Affinity Purification/Mass Spectrometry Data. J. Proteome Res. 11, 1462–1474 (2012).

36. Armean, I. M., Lilley, K. S. & Trotter, M. W. B. Popular Computational Methods to Assess Multiprotein Complexes Derived From Label-Free Affinity Purification and Mass Spectrometry (AP-MS) Experiments. Mol. Cell. Proteomics 12, 1–13 (2013).

37. Choi, H. et al. SAINT: probabilistic scoring of affinity purification-mass spectrometry data. Nat. Methods 8, 70–3 (2011).

38. Collins, S. R. et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell. Proteomics 6, 439–50 (2007).

39. Sowa, M. E., Bennett, E. J., Gygi, S. P. & Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, 389–403 (2009).

152

40. Guruharsha, K. G. et al. A protein complex network of Drosophila melanogaster. Cell 147, 690–703 (2011).

41. Hakes, L., Robertson, D. L., Oliver, S. G. & Lovell, S. C. Protein interactions from complexes: a structural perspective. Comp. Funct. Genomics 2007, 49356 (2007).

42. Zeghouf, M. et al. Sequential Peptide Affinity (SPA) system for the identification of mammalian and bacterial protein complexes. J. Proteome Res. 3, 463–8

43. Butland, G. et al. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 433, 531–7 (2005).

44. Arifuzzaman, M. et al. Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res. 16, 686–91 (2006).

45. Bren, A. & Eisenbach, M. How signals are heard during bacterial chemotaxis: protein- protein interactions in sensory signal propagation. J. Bacteriol. 182, 6865–73 (2000).

46. Viala, J. P. M. & Bouveret, E. Protein-Protein Interaction: Tandem Affinity Purification in Bacteria. Methods Mol. Biol. 1615, 221–232 (2017).

47. Blasche, S. & Koegl, M. Analysis of protein-protein interactions using LUMIER assays. Methods Mol. Biol. 1064, 17–27 (2013).

48. Varnaitė, R. & MacNeill, S. A. Meet the neighbors: Mapping local protein interactomes by proximity-dependent labeling with BioID. Proteomics 16, 2503–2518 (2016).

49. Kristensen, A. R., Gsponer, J. & Foster, L. J. A high-throughput approach for measuring temporal changes in the interactome. Nat. Methods 9, 907–9 (2012).

50. Wan, C. et al. Panorama of ancient metazoan macromolecular complexes. Nature 525, 339–44 (2015).

51. Dixon, S. J., Costanzo, M., Baryshnikova, A., Andrews, B. & Boone, C. Systematic Mapping of Genetic Interaction Networks. Annu. Rev. Genet. 43, 601–625 (2009).

52. Tong, A. H. Y. et al. Global Mapping of the Yeast Genetic Interaction Network. Science

153

(80-. ). 303, 808–813 (2004).

53. Le Meur, N. & Gentleman, R. Modeling synthetic lethality. Genome Biol. 9, R135 (2008).

54. Tong, A. H. Y. & Boone, C. Synthetic genetic array analysis in Saccharomyces cerevisiae. Methods Mol. Biol. 313, 171–92 (2006).

55. Collins, S. R., Schuldiner, M., Krogan, N. J. & Weissman, J. S. A strategy for extracting and analyzing large-scale quantitative epistatic interaction data. Genome Biol. 7, R63 (2006).

56. Boone, C., Bussey, H. & Andrews, B. J. Exploring genetic interactions and networks with yeast. Nat. Rev. Genet. 8, 437–449 (2007).

57. Kelley, R. & Ideker, T. Systematic interpretation of genetic interactions using protein networks. Nat. Biotechnol. 23, 561–6 (2005).

58. Snitkin, E. S. & Segrè, D. Epistatic Interaction Maps Relative to Multiple Metabolic Phenotypes. PLoS Genet. 7, e1001294 (2011).

59. Collins, S. R. et al. Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature 446, 806–810 (2007).

60. Wilmes, G. M. et al. A Genetic Interaction Map of RNA-Processing Factors Reveals Links between Sem1/Dss1-Containing Complexes and mRNA Export and Splicing. Mol. Cell 32, 735–746 (2008).

61. Schuldiner, M. et al. Exploration of the Function and Organization of the Yeast Early Secretory Pathway through an Epistatic Miniarray Profile. Cell 123, 507–519 (2005).

62. Costanzo, M. et al. The Genetic Landscape of a Cell. Science (80-. ). 327, 425–431 (2010).

63. Ulitsky, I., Shlomi, T., Kupiec, M. & Shamir, R. From E-MAPs to module maps: dissecting quantitative genetic interactions using physical interactions. Mol. Syst. Biol. 4, 209 (2008).

154

64. Roguev, A. et al. Conservation and Rewiring of Functional Modules Revealed by an Epistasis Map in Fission Yeast. Science (80-. ). 322, 405–410 (2008).

65. Typas, A. et al. High-throughput, quantitative analyses of genetic interactions in E. coli. Nat. Methods 5, 781–7 (2008).

66. Babu, M. et al. Genetic interaction maps in Escherichia coli reveal functional crosstalk among cell envelope biogenesis pathways. PLoS Genet. 7, e1002377 (2011).

67. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285–8 (1999).

68. Enault, F., Suhre, K., Abergel, C., Poirot, O. & Claverie, J.-M. Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics 19 Suppl 1, i105-7 (2003).

69. Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–8 (1998).

70. Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. U. S. A. 96, 2896–901 (1999).

71. Korbel, J. O., Jensen, L. J., von Mering, C. & Bork, P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotechnol. 22, 911–7 (2004).

72. Yellaboina, S., Goyal, K. & Mande, S. C. Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: comparison with high-throughput experimental data. Genome Res. 17, 527–35 (2007).

73. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–3 (1999).

74. Peregrín-Alvarez, J. M., Xiong, X., Su, C. & Parkinson, J. The Modular Organization of Protein Interactions in Escherichia coli. PLoS Comput. Biol. 5, e1000523 (2009).

155

75. Caufield, J. H., Abreu, M., Wimble, C. & Uetz, P. Protein complexes in bacteria. PLoS Comput. Biol. 11, e1004107 (2015).

76. Peregrín-Alvarez, J. M., Sanford, C. & Parkinson, J. The conservation and evolutionary modularity of metabolism. Genome Biol. 10, R63 (2009).

77. Salgado, H. et al. RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 32, D303-6 (2004).

78. Rocha, E. P. Order and disorder in bacterial genomes. Curr. Opin. Microbiol. 7, 519–527 (2004).

79. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).

80. Date, S. V & Marcotte, E. M. Discovery of uncharacterized cellular systems by genome- wide analysis of functional linkages. Nat. Biotechnol. 21, 1055–62 (2003).

81. Szklarczyk, D. et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39, D561-8 (2011).

82. Hu, P. et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 7, e96 (2009).

83. Silva, M. T. Classical labeling of bacterial pathogens according to their lifestyle in the host: inconsistencies and alternatives. Front. Microbiol. 3, 71 (2012).

84. Tett, A. et al. Unexplored diversity and strain-level structure of the skin microbiome associated with psoriasis. NPJ biofilms microbiomes 3, 14 (2017).

85. Clemente, J. C., Manasson, J. & Scher, J. U. The role of the gut microbiome in systemic inflammatory disease. BMJ 360, j5145 (2018).

86. Pagani, I. et al. The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40, D571-9

156

(2012).

87. Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056–1060 (2009).

88. Stewart, P. S. Mechanisms of antibiotic resistance in bacterial biofilms. Int. J. Med. Microbiol. 292, 107–13 (2002).

89. Maiden, M. C. Horizontal genetic exchange, evolution, and spread of antibiotic resistance in bacteria. Clin. Infect. Dis. 27 Suppl 1, S12-20 (1998).

90. Sachs, J. L., Skophammer, R. G., Bansal, N. & Stajich, J. E. Evolutionary origins and diversification of proteobacterial mutualists. Proceedings. Biol. Sci. 281, 20132146 (2014).

91. Eiler, A. et al. Tuning fresh: radiation through rewiring of central metabolism in streamlined bacteria. ISME J. 10, 1902–14 (2016).

92. Adams, M. D., Chan, E. R., Molyneaux, N. D. & Bonomo, R. A. Genomewide analysis of divergence of antibiotic resistance determinants in closely related isolates of Acinetobacter baumannii. Antimicrob. Agents Chemother. 54, 3569–77 (2010).

93. Toft, C. & Andersson, S. G. E. Evolutionary microbial genomics: insights into bacterial host adaptation. Nat. Rev. Genet. 11, 465–75 (2010).

94. Andersson, D. I., Jerlström-Hultqvist, J. & Näsvall, J. Evolution of new functions de novo and from preexisting genes. Cold Spring Harb. Perspect. Biol. 7, a017996 (2015).

95. Taylor, J. S. & Raes, J. Duplication and divergence: the evolution of new genes and old ideas. Annu. Rev. Genet. 38, 615–43 (2004).

96. Adler, M., Anjum, M., Berg, O. G., Andersson, D. I. & Sandegren, L. High Fitness Costs and Instability of Gene Duplications Reduce Rates of Evolution of New Genes by Duplication-Divergence Mechanisms. Mol. Biol. Evol. 31, 1526–1535 (2014).

97. Reams, A. B., Kofoid, E., Savageau, M. & Roth, J. R. Duplication frequency in a

157

population of Salmonella enterica rapidly approaches steady state with or without recombination. Genetics 184, 1077–94 (2010).

98. Anderson, R. P. & Roth, J. R. Tandem Genetic Duplications in Phage and Bacteria. Annu. Rev. Microbiol. 31, 473–505 (1977).

99. Koskiniemi, S., Sun, S., Berg, O. G. & Andersson, D. I. Selection-Driven Gene Loss in Bacteria. PLoS Genet. 8, e1002787 (2012).

100. Innan, H. & Kondrashov, F. The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010).

101. Altenhoff, A. M., Studer, R. A., Robinson-Rechavi, M. & Dessimoz, C. Resolving the Ortholog Conjecture: Orthologs Tend to Be Weakly, but Significantly, More Similar in Function than Paralogs. PLoS Comput. Biol. 8, e1002514 (2012).

102. Esposito, M. & Moreno-Hagelsieb, G. Non-synonymous to synonymous substitutions suggest that orthologs tend to keep their functions, while paralogs are a source of functional novelty. bioRxiv 354704 (2018). doi:10.1101/354704

103. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

104. Kuzniar, A., van Ham, R. C. H. J., Pongor, S. & Leunissen, J. A. M. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–51 (2008).

105. Overbeek, R. et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42, D206–D214 (2014).

106. Angiuoli, S. V et al. Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. OMICS 12, 137–41 (2008).

107. Yu, N. Y. et al. PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 26, 1608–15 (2010).

158

108. Claudel-Renard, C., Chevalet, C., Faraut, T. & Kahn, D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res. 31, 6633–9 (2003).

109. Hung, S. S., Wasmuth, J., Sanford, C. & Parkinson, J. DETECT--a density estimation tool for enzyme classification and its application to Plasmodium falciparum. Bioinformatics 26, 1690–8 (2010).

110. Karp, P. D. et al. Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology. Brief. Bioinform. 11, 40–79 (2010).

111. Sigrist, C. J. A. et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–74 (2002).

112. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222-30 (2014).

113. Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–63 (1998).

114. Serres, M. H., Kerr, A. R. W., McCormack, T. J. & Riley, M. Evolution by leaps: gene duplication in bacteria. Biol. Direct 4, 46 (2009).

115. Kondrashov, F. A., Rogozin, I. B., Wolf, Y. I. & Koonin, E. V. Selection in the evolution of gene duplications. Genome Biol. 3, RESEARCH0008 (2002).

116. Gevers, D., Vandepoele, K., Simillon, C. & Van de Peer, Y. Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol. 12, 148–54 (2004).

117. Montaner, D., Minguez, P., Al-Shahrour, F. & Dopazo, J. Gene set internal coherence in the context of functional profiling. BMC Genomics 10, 197 (2009).

118. Gabaldón, T. & Koonin, E. V. Functional and evolutionary implications of gene orthology. Nat. Rev. Genet. 14, 360–366 (2013).

119. Wall, D. P., Fraser, H. B. & Hirsh, A. E. Detecting putative orthologs. Bioinformatics 19, 1710–1 (2003).

159

120. Tatusov, R. L., Koonin, E. V & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–7 (1997).

121. Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in- paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–52 (2001).

122. Fulton, D. L. et al. Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics 7, 270 (2006).

123. Powell, S. et al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 42, D231-9 (2014).

124. Chen, F., Mackey, A. J., Stoeckert, C. J. & Roos, D. S. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 34, D363-8 (2006).

125. Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).

126. Ponting, C. P. Biological function in the twilight zone of sequence conservation. BMC Biol. 15, 71 (2017).

127. Moran, N. A. Accelerated evolution and Muller’s rachet in endosymbiotic bacteria. Proc. Natl. Acad. Sci. U. S. A. 93, 2873–8 (1996).

128. Milo, R. et al. Network motifs: simple building blocks of complex networks. Science 298, 824–7 (2002).

129. Zotenko, E., Mestre, J., O’Leary, D. P. & Przytycka, T. M. Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality. PLoS Comput. Biol. 4, e1000140 (2008).

130. Muller, H. J. Some Genetic Aspects of Sex. Am. Nat. 66, 118–138 (1932).

131. LaBar, T. & Adami, C. Evolution of drift robustness in small populations. Nat. Commun. 8, 1012 (2017).

132. Baba, T., Huan, H.-C., Datsenko, K., Wanner, B. L. & Mori, H. The applications of

160

systematic in-frame, single-gene knockout mutant collection of Escherichia coli K-12. Methods Mol. Biol. 416, 183–94 (2008).

133. Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901–6 (1999).

134. Elena, S. F. & Lenski, R. E. Evolution experiments with microorganisms: the dynamics and genetic bases of adaptation. Nat. Rev. Genet. 4, 457–69 (2003).

135. Khan, A. I., Dinh, D. M., Schneider, D., Lenski, R. E. & Cooper, T. F. Negative epistasis between beneficial mutations in an evolving bacterial population. Science 332, 1193–6 (2011).

136. Sauro, H. M. Modularity defined. Mol. Syst. Biol. 4, 166 (2008).

137. Silhavy, T. J., Kahne, D. & Walker, S. The bacterial cell envelope. Cold Spring Harb. Perspect. Biol. 2, a000414 (2010).

138. Brohée, S. & van Helden, J. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 488 (2006).

139. Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2 (2003).

140. Frey, B. J. & Dueck, D. Clustering by Passing Messages Between Data Points. Science (80-. ). 315, 972–976 (2007).

141. van Dongen, S. & Abreu-Goodger, C. in Methods in molecular biology (Clifton, N.J.) 804, 281–295 (2012).

142. Vlasblom, J. & Wodak, S. J. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics 10, 99 (2009).

143. Killcoyne, S., Carter, G. W., Smith, J. & Boyle, J. Cytoscape: a community-based framework for network modeling. Methods Mol. Biol. 563, 219–39 (2009).

144. Saito, R. et al. A travel guide to Cytoscape plugins. Nat. Methods 9, 1069–76 (2012).

161

145. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabási, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–5 (2002).

146. Porcar, M., Latorre, A. & Moya, A. What Symbionts Teach us about Modularity. Front. Bioeng. Biotechnol. 1, 14 (2013).

147. Koonin, E. V., Makarova, K. S. & Aravind, L. Horizontal Gene Transfer in Prokaryotes: Quantification and Classification. Annu. Rev. Microbiol. 55, 709–742 (2001).

148. Hacker, J. & Kaper, J. B. Pathogenicity Islands and the Evolution of Microbes. Annu. Rev. Microbiol. 54, 641–679 (2000).

149. Reuter, S. et al. Parallel independent evolution of pathogenicity within the genus Yersinia. Proc. Natl. Acad. Sci. 111, 6768–6773 (2014).

150. Karp, P. D. et al. The EcoCyc Database. EcoSal Plus 6, (2014).

151. Bratlie, M. S., Johansen, J. & Drabløs, F. Relationship between operon preference and functional properties of persistent genes in bacterial genomes. BMC Genomics 11, 71 (2010).

152. Pannier, L., Merino, E., Marchal, K. & Collado-Vides, J. Effect of genomic distance on coexpression of coregulated genes in E. coli. PLoS One 12, e0174887 (2017).

153. Wells, J. N., Bergendahl, L. T. & Marsh, J. A. Operon Gene Order Is Optimized for Ordered Protein Complex Assembly. Cell Rep. 14, 679–685 (2016).

154. Kovács, K., Hurst, L. D. & Papp, B. Stochasticity in protein levels drives colinearity of gene order in metabolic operons of Escherichia coli. PLoS Biol. 7, e1000115 (2009).

155. Price, M. N., Huang, K. H., Arkin, A. P. & Alm, E. J. Operon formation is driven by co- regulation and not by horizontal gene transfer. Genome Res. 15, 809–19 (2005).

156. Omelchenko, M. V, Makarova, K. S., Wolf, Y. I., Rogozin, I. B. & Koonin, E. V. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biol. 4, R55 (2003).

162

157. Moussatova, A., Kandt, C., O’Mara, M. L. & Tieleman, D. P. ATP-binding cassette transporters in Escherichia coli. Biochim. Biophys. Acta 1778, 1757–71 (2008).

158. Munita, J. M. & Arias, C. A. Mechanisms of Antibiotic Resistance. Microbiol. Spectr. 4, 481–511 (2016).

159. Perrin, E. et al. Subfunctionalization influences the expansion of bacterial multidrug antibiotic resistance. BMC Genomics 18, 834 (2017).

160. Singh, A. H., Wolf, D. M., Wang, P. & Arkin, A. P. Modularity of stress response evolution. Proc. Natl. Acad. Sci. 105, 7500–7505 (2008).

161. Jiang, C., Brown, P. J. B., Ducret, A. & Brun, Y. V. Sequential evolution of bacterial morphology by co-option of a developmental regulator. Nature 506, 489–93 (2014).

162. Yuan, J., Zweers, J. C., van Dijl, J. M. & Dalbey, R. E. Protein transport across and into cell membranes in bacteria and archaea. Cell. Mol. Life Sci. 67, 179–199 (2010).

163. Saurin, W., Hofnung, M. & Dassa, E. Getting in or out: early segregation between importers and exporters in the evolution of ATP-binding cassette (ABC) transporters. J. Mol. Evol. 48, 22–41 (1999).

164. Jiang, X. & Fares, M. A. Functional diversification of the twin-arginine translocation pathway mediates the emergence of novel ecological adaptations. Mol. Biol. Evol. 28, 3183–93 (2011).

165. Huerta-Cepas, J. et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286-93 (2016).

166. Jothi, R., Przytycka, T. M. & Aravind, L. Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC Bioinformatics 8, 173 (2007).

167. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–504 (2003).

163

168. Roche, B. et al. Iron/sulfur proteins biogenesis in prokaryotes: formation, regulation and diversity. Biochim. Biophys. Acta 1827, 455–69 (2013).

169. Lu, J., Yang, J., Tan, G. & Ding, H. Complementary roles of SufA and IscA in the biogenesis of iron–sulfur clusters in Escherichia coli. Biochem. J. 409, 535–543 (2008).

170. Hvorup, R. N. et al. Asymmetry in the Structure of the ABC Transporter-Binding Protein Complex BtuCD-BtuF. Science (80-. ). 317, 1387–1390 (2007).

171. Arenas, F. A. et al. The Escherichia coli btuE gene, encodes a glutathione peroxidase that is induced under oxidative stress conditions. Biochem. Biophys. Res. Commun. 398, 690–4 (2010).

172. Py, B., Moreau, P. L. & Barras, F. Fe–S clusters, fragile sentinels of the cell. Curr. Opin. Microbiol. 14, 218–223 (2011).

173. Nagakubo, S., Nishino, K., Hirata, T. & Yamaguchi, A. The putative response regulator BaeR stimulates multidrug resistance of Escherichia coli via a novel multidrug exporter system, MdtABC. J. Bacteriol. 184, 4161–7 (2002).

174. Parra-Lopez, C., Lin, R., Aspedon, A. & Groisman, E. A. A Salmonella protein that is required for resistance to antimicrobial peptides and transport of potassium. EMBO J. 13, 3964–72 (1994).

175. Silvestro, A., Pommier, J. & Giordano, G. The inducible trimethylamine-N-oxide reductase of Escherichia coli K12: biochemical and immunological studies. Biochim. Biophys. Acta - Protein Struct. Mol. Enzymol. 954, 1–13 (1988).

176. Mani, R. et al. Defining genetic interaction. Proc. Natl. Acad. Sci. U. S. A. 105, 3461–6 (2008).

177. Jovanovic, G., Lloyd, L. J., Stumpf, M. P. H., Mayhew, A. J. & Buck, M. Induction and function of the phage shock protein extracytoplasmic stress response in Escherichia coli. J. Biol. Chem. 281, 21147–61 (2006).

178. Messaoudi, N. et al. Global stress response in a prokaryotic model of DJ-1-associated

164

Parkinsonism. J. Bacteriol. 195, 1167–78 (2013).

179. Voskuil, M. I., Bartek, I. L., Visconti, K. & Schoolnik, G. K. The response of mycobacterium tuberculosis to reactive oxygen and nitrogen species. Front. Microbiol. 2, 105 (2011).

180. Al Mamun, A. A. M. et al. Identity and function of a large gene network underlying mutagenic repair of DNA breaks. Science 338, 1344–8 (2012).

181. Guy, C. P. et al. Rep provides a second motor at the replisome to promote duplication of protein-bound DNA. Mol. Cell 36, 654–66 (2009).

182. LeBlanc, D. J. & Mortlock, R. P. Metabolism of D-arabinose: origin of a D-ribulokinase activity in Escherichia coli. J. Bacteriol. 106, 82–9 (1971).

183. Galvani, C., Terry, J. & Ishiguro, E. E. Purification of the RelB and RelE proteins of Escherichia coli: RelE binds to RelB and to ribosomes. J. Bacteriol. 183, 2700–3 (2001).

184. Courcelle, J. & Hanawalt, P. C. RecA-dependent recovery of arrested DNA replication forks. Annu. Rev. Genet. 37, 611–46 (2003).

185. Monje-Casas, F., Jurado, J., Prieto-Alamo, M. J., Holmgren, A. & Pueyo, C. Expression analysis of the nrdHIEF operon from Escherichia coli. Conditions that trigger the transcript level in vivo. J. Biol. Chem. 276, 18031–7 (2001).

186. Wood, J. M. Leucine transport in Escherichia coli. The resolution of multiple transport systems and their coupling to metabolic energy. J. Biol. Chem. 250, 4477–85 (1975).

187. González, P. J., Correia, C., Moura, I., Brondino, C. D. & Moura, J. J. G. Bacterial nitrate reductases: Molecular and biological aspects of nitrate reduction. J. Inorg. Biochem. 100, 1015–23 (2006).

188. Moreno-Vivián, C., Cabello, P., Martínez-Luque, M., Blasco, R. & Castillo, F. Prokaryotic nitrate reduction: molecular properties and functional distinction among bacterial nitrate reductases. J. Bacteriol. 181, 6573–84 (1999).

165

189. Jiang, M., Chen, M., Guo, Z.-F. & Guo, Z. A bicarbonate modulates 1,4- dihydroxy-2-naphthoyl-coenzyme a synthase in menaquinone biosynthesis of Escherichia coli. J. Biol. Chem. 285, 30159–69 (2010).

190. Blumer, C. et al. Regulation of type 1 fimbriae synthesis and biofilm formation by the transcriptional regulator LrhA of Escherichia coli. Microbiology 151, 3287–98 (2005).

191. Kitagawa, R., Takaya, A. & Yamamoto, T. Dual regulatory pathways of flagellar gene expression by ClpXP protease in enterohaemorrhagic Escherichia coli. Microbiology 157, 3094–103 (2011).

192. Williams, A. B. & Foster, P. L. The Escherichia coli histone-like protein HU has a role in stationary phase adaptive mutation. Genetics 177, 723–35 (2007).

193. Drees, J. C., Chitteni-Pattu, S., McCaslin, D. R., Inman, R. B. & Cox, M. M. Inhibition of RecA protein function by the RdgC protein from Escherichia coli. J. Biol. Chem. 281, 4708–17 (2006).

194. Lusetti, S. L., Drees, J. C., Stohl, E. A., Seifert, H. S. & Cox, M. M. The DinI and RecX proteins are competing modulators of RecA function. J. Biol. Chem. 279, 55073–9 (2004).

195. Guo, G., Ding, Y. & Weiss, B. nfi, the gene for endonuclease V in Escherichia coli K-12. J. Bacteriol. 179, 310–6 (1997).

196. Dizdaroglu, M. et al. Substrate specificity and excision kinetics of Escherichia coli endonuclease VIII (Nei) for modified bases in DNA damaged by free radicals. Biochemistry 40, 12150–6 (2001).

197. Wyatt, M. D. & Pittman, D. L. Methylating agents and DNA repair responses: Methylated bases and sources of strand breaks. Chem. Res. Toxicol. 19, 1580–94 (2006).

198. Guarino, E., Jiménez-Sánchez, A. & Guzmán, E. C. Defective ribonucleoside diphosphate reductase impairs replication fork progression in Escherichia coli. J. Bacteriol. 189, 3496– 501 (2007).

199. Martin, J. E. & Imlay, J. A. The alternative aerobic ribonucleotide reductase of

166

Escherichia coli, NrdEF, is a manganese-dependent enzyme that enables cell replication during periods of iron starvation. Mol. Microbiol. 80, 319–34 (2011).

200. Christensen-Dalsgaard, M., Jørgensen, M. G. & Gerdes, K. Three new RelE-homologous mRNA interferases of Escherichia coli differentially induced by environmental stresses. Mol. Microbiol. 75, 333–48 (2010).

201. Arumugam, S., Petrašek, Z. & Schwille, P. MinCDE exploits the dynamic nature of FtsZ filaments for its spatial regulation. Proc. Natl. Acad. Sci. U. S. A. 111, E1192-200 (2014).

202. Deshpande, R., Vandersluis, B. & Myers, C. L. Comparison of profile similarity measures for genetic interaction networks. PLoS One 8, e68664 (2013).

203. Clark, R. L. & Neidhardt, F. C. Roles of the two lysyl-tRNA synthetases of Escherichia coli: analysis of nucleotide sequences and mutant behavior. J. Bacteriol. 172, 3237–43 (1990).

204. Onesti, S., Miller, A. D. & Brick, P. The crystal structure of the lysyl-tRNA synthetase (LysU) from Escherichia coli. Structure 3, 163–76 (1995).

205. Bullwinkle, T. J. et al. Oxidation of cellular amino acid pools leads to cytotoxic mistranslation of the genetic code. Elife 3, (2014).

206. Chilcott, G. S. & Hughes, K. T. Coupling of flagellar gene expression to flagellar assembly in Salmonella enterica serovar typhimurium and Escherichia coli. Microbiol. Mol. Biol. Rev. 64, 694–708 (2000).

207. Raymond, K. N., Dertz, E. A. & Kim, S. S. Enterobactin: an archetype for microbial iron transport. Proc. Natl. Acad. Sci. U. S. A. 100, 3584–8 (2003).

208. Laxman, S. et al. Sulfur amino acids regulate translational capacity and metabolic homeostasis through modulation of tRNA thiolation. Cell 154, 416–29 (2013).

209. Du, Q., Wang, H. & Xie, J. Thiamin (vitamin B1) biosynthesis and regulation: a rich source of antimicrobial drug targets? Int. J. Biol. Sci. 7, 41–52 (2011).

167

210. Xi, H., Schneider, B. L. & Reitzer, L. Purine catabolism in Escherichia coli and function of xanthine dehydrogenase in purine salvage. J. Bacteriol. 182, 5332–41 (2000).

211. Spoering, A. L., Vulic, M. & Lewis, K. GlpD and PlsB participate in persister cell formation in Escherichia coli. J. Bacteriol. 188, 5136–44 (2006).

212. Yamazaki, Y., Niki, H. & Kato, J. in Methods in molecular biology (Clifton, N.J.) 416, 385–389 (2008).

213. Shee, C., Gibson, J. L., Darrow, M. C., Gonzalez, C. & Rosenberg, S. M. Impact of a stress-inducible switch to mutagenic repair of DNA breaks on mutation in Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 108, 13659–64 (2011).

214. Nelson, D. E., Ghosh, A. S., Paulson, A. L. & Young, K. D. Contribution of membrane- binding and enzymatic domains of penicillin binding protein 5 to maintenance of uniform cellular morphology of Escherichia coli. J. Bacteriol. 184, 3630–9 (2002).

215. Peters, K. et al. The Redundancy of Peptidoglycan Carboxypeptidases Ensures Robust Cell Shape Maintenance in Escherichia coli. MBio 7, e00819-16 (2016).

216. Potter, P. M., Kleibl, K., Cawkwell, L. & Margison, G. P. Expression of the ogt gene in wild-type and ada mutants of E. coli. Nucleic Acids Res. 17, 8047–60 (1989).

217. He, C. et al. A methylation-dependent electrostatic switch controls DNA repair and transcriptional activation by E. coli ada. Mol. Cell 20, 117–29 (2005).

218. Lindahl, T., Sedgwick, B., Sekiguchi, M. & Nakabeppu, Y. Regulation and expression of the adaptive response to alkylating agents. Annu. Rev. Biochem. 57, 133–57 (1988).

219. Iost, I. & Dreyfus, M. DEAD-box RNA helicases in Escherichia coli. Nucleic Acids Res. 34, 4189–97 (2006).

220. Redder, P., Hausmann, S., Khemici, V., Yasrebi, H. & Linder, P. Bacterial versatility requires DEAD-box RNA helicases. FEMS Microbiol. Rev. 39, 392–412 (2015).

221. Zhao, X. & Jain, C. DEAD-box proteins from Escherichia coli exhibit multiple ATP-

168

independent activities. J. Bacteriol. 193, 2236–41 (2011).

222. Jain, C. The E. coli RhlE RNA helicase regulates the function of related RNA helicases during ribosome assembly. RNA 14, 381–9 (2008).

223. Liou, G.-G., Chang, H.-Y., Lin, C.-S. & Lin-Chao, S. DEAD box RhlB RNA helicase physically associates with exoribonuclease PNPase to degrade double-stranded RNA independent of the degradosome-assembling region of RNase E. J. Biol. Chem. 277, 41157–62 (2002).

224. McCarter, L. L. Regulation of flagella. Curr. Opin. Microbiol. 9, 180–6 (2006).

225. Zhao, K., Liu, M. & Burgess, R. R. Adaptation in bacterial flagellar and motility systems: from regulon members to ’foraging’-like behavior in E. coli. Nucleic Acids Res. 35, 4441– 52 (2007).

226. Félix, M.-A. & Barkoulas, M. Pervasive robustness in biological systems. Nat. Rev. Genet. 16, 483–96 (2015).

227. Wagner, A. Genetic redundancy caused by gene duplications and its evolution in networks of transcriptional regulators. Biol. Cybern. 74, 557–67 (1996).

228. Kitano, H. Towards a theory of biological robustness. Mol. Syst. Biol. 3, 137 (2007).

229. Keane, O. M., Toft, C., Carretero-Paulet, L., Jones, G. W. & Fares, M. A. Preservation of genetic and regulatory robustness in ancient gene duplicates of Saccharomyces cerevisiae. Genome Res. 24, 1830–41 (2014).

230. Mattenberger, F., Sabater-Muñoz, B., Toft, C. & Fares, M. A. The Phenotypic Plasticity of Duplicated Genes in Saccharomyces cerevisiae and the Origin of Adaptations. G3 (Bethesda). 7, 63–75 (2017).

231. Guan, Y., Dunham, M. J. & Troyanskaya, O. G. Functional analysis of gene duplications in Saccharomyces cerevisiae. Genetics 175, 933–43 (2007).

232. Kuo, C.-H. & Ochman, H. The fate of new bacterial genes. FEMS Microbiol. Rev. 33, 38–

169

43 (2009).

233. Bratlie, M. S. et al. Gene duplications in prokaryotes can be associated with environmental adaptation. BMC Genomics 11, 588 (2010).

234. Martínez-Núñez, M. A., Pérez-Rueda, E., Gutiérrez-Ríos, R. M. & Merino, E. New insights into the regulatory networks of paralogous genes in bacteria. Microbiology 156, 14–22 (2010).

235. Teichmann, S. A. & Babu, M. M. Gene regulatory network growth by duplication. Nat. Genet. 36, 492–6 (2004).

236. Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford). 2010, baq023 (2010).

237. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457-62 (2016).

238. Altenhoff, A. M. et al. The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res. 43, D240-9 (2015).

239. Pu, S., Wong, J., Turner, B., Cho, E. & Wodak, S. J. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 37, 825–31 (2009).

240. Weiner, J. H. & Li, L. Proteome of the Escherichia coli envelope and technological challenges in membrane proteome analysis. Biochim. Biophys. Acta 1778, 1698–713 (2008).

241. Cascales, E., Gavioli, M., Sturgis, J. N. & Lloubès, R. Proton motive force drives the interaction of the inner membrane TolA and outer membrane pal proteins in Escherichia coli. Mol. Microbiol. 38, 904–15 (2000).

242. Lüke, I. et al. Biosynthesis of the respiratory formate dehydrogenases from Escherichia coli: characterization of the FdhE protein. Arch. Microbiol. 190, 685–96 (2008).

170

243. Skibinski, D. A. G. et al. Regulation of the hydrogenase-4 operon of Escherichia coli by the sigma(54)-dependent transcriptional activators FhlA and HyfR. J. Bacteriol. 184, 6642–53 (2002).

244. Wadhams, G. H. & Armitage, J. P. Making sense of it all: bacterial chemotaxis. Nat. Rev. Mol. Cell Biol. 5, 1024–37 (2004).

245. Van Alst, N. E., Picardo, K. F., Iglewski, B. H. & Haidaris, C. G. Nitrate sensing and metabolism modulate motility, biofilm formation, and virulence in Pseudomonas aeruginosa. Infect. Immun. 75, 3780–90 (2007).

246. Wang, M., Herrmann, C. J., Simonovic, M., Szklarczyk, D. & von Mering, C. Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell- lines. Proteomics 15, 3163–3168 (2015).

247. Hazelbauer, G. L., Falke, J. J. & Parkinson, J. S. Bacterial chemoreceptors: high- performance signaling in networked arrays. Trends Biochem. Sci. 33, 9–19 (2008).

248. Kondoh, H., Ball, C. B. & Adler, J. Identification of a methyl-accepting chemotaxis protein for the ribose and galactose chemoreceptors of Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 76, 260–4 (1979).

249. Studdert, C. A. & Parkinson, J. S. Crosslinking snapshots of bacterial chemoreceptor squads. Proc. Natl. Acad. Sci. U. S. A. 101, 2117–22 (2004).

250. Grebe, T. W. & Stock, J. Bacterial chemotaxis: the five sensors of a bacterium. Curr. Biol. 8, R154-7 (1998).

251. Koeth, R. A. et al. γ-Butyrobetaine is a proatherogenic intermediate in gut microbial metabolism of L-carnitine to TMAO. Cell Metab. 20, 799–812 (2014).

252. Ward, S. M., Bormans, A. F. & Manson, M. D. Mutationally altered signal output in the Nart (NarX-Tar) hybrid chemoreceptor. J. Bacteriol. 188, 3944–51 (2006).

253. Borrero-de Acuña, J. M. et al. A Periplasmic Complex of the Nitrite Reductase NirS, the Chaperone DnaK, and the Flagellum Protein FliC Is Essential for Flagellum Assembly

171

and Motility in Pseudomonas aeruginosa. J. Bacteriol. 197, 3066–75 (2015).

254. Lubitz, S. P. & Weiner, J. H. The Escherichia coli ynfEFGHI operon encodes polypeptides which are paralogues of dimethyl sulfoxide reductase (DmsABC). Arch. Biochem. Biophys. 418, 205–16 (2003).

255. Whitney, J. C. & Howell, P. L. Synthase-dependent exopolysaccharide secretion in Gram- negative bacteria. Trends Microbiol. 21, 63–72 (2013).

256. Römling, U. & Galperin, M. Y. Bacterial cellulose biosynthesis: diversity of operons, subunits, products, and functions. Trends Microbiol. 23, 545–57 (2015).

257. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–7 (2004).

258. Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).

259. Price, M. N., Huang, K. H., Alm, E. J. & Arkin, A. P. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res. 33, 880–92 (2005).

260. Salgado, H., Moreno-Hagelsieb, G., Smith, T. F. & Collado-Vides, J. Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl. Acad. Sci. U. S. A. 97, 6652–7 (2000).

261. Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics 28, 3150–2 (2012).

262. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–3 (2009).

263. Guindon, S., Delsuc, F., Dufayard, J.-F. & Gascuel, O. in Methods in molecular biology (Clifton, N.J.) 537, 113–137 (2009).

264. Ondov, B. D., Bergman, N. H. & Phillippy, A. M. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics 12, 385 (2011).

172

265. Davidson, A. L., Dassa, E., Orelle, C. & Chen, J. Structure, Function, and Evolution of Bacterial ATP-Binding Cassette Systems. Microbiol. Mol. Biol. Rev. 72, 317–364 (2008).

266. Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–89 (2003).

267. Zeytuni, N. & Zarivach, R. Structural and functional discussion of the tetra-trico-peptide repeat, a protein interaction module. Structure 20, 397–405 (2012).

268. Marmont, L. S. et al. PelA and PelB proteins form a modification and secretion complex essential for Pel polysaccharide-dependent biofilm formation in Pseudomonas aeruginosa. J. Biol. Chem. 292, 19411–19422 (2017).

269. Keiski, C.-L. et al. AlgK is a TPR-containing protein and the periplasmic component of a novel exopolysaccharide secretin. Structure 18, 265–73 (2010).

270. Nojima, S. et al. Crystal structure of the flexible tandem repeat domain of bacterial cellulose synthesis subunit C. Sci. Rep. 7, 13018 (2017).

271. Cescutti, P., Cuzzi, B., Herasimenka, Y. & Rizzo, R. Structure of a novel exopolysaccharide produced by Burkholderia vietnamiensis, a cystic fibrosis opportunistic pathogen. Carbohydr. Polym. 94, 253–60 (2013).

272. Cuzzi, B. et al. Versatility of the Burkholderia cepacia complex for the biosynthesis of exopolysaccharides: a comparative structural investigation. PLoS One 9, e94372 (2014).

273. Ferreira, A. S., Silva, I. N., Oliveira, V. H., Cunha, R. & Moreira, L. M. Insights into the role of extracellular polysaccharides in Burkholderia adaptation to different environments. Front. Cell. Infect. Microbiol. 1, 16 (2011).

274. Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–9 (2012).

275. Itoh, Y. et al. Roles of pgaABCD genes in synthesis, modification, and export of the Escherichia coli biofilm adhesin poly-beta-1,6-N-acetyl-D-glucosamine. J. Bacteriol. 190, 3670–80 (2008).

173

276. Parise, G., Mishra, M., Itoh, Y., Romeo, T. & Deora, R. Role of a putative polysaccharide locus in Bordetella biofilm development. J. Bacteriol. 189, 750–60 (2007).

277. Wang, X., Preston, J. F. & Romeo, T. The pgaABCD locus of Escherichia coli promotes the synthesis of a polysaccharide adhesin required for biofilm formation. J. Bacteriol. 186, 2724–34 (2004).

278. Whitfield, G. B., Marmont, L. S. & Howell, P. L. Enzymatic modifications of exopolysaccharides enhance bacterial persistence. Front. Microbiol. 6, 471 (2015).

279. Xu, D., Zhang, W., Zhang, B., Liao, C. & Shao, Y. Characterization of a biofilm-forming Shigella flexneri phenotype due to deficiency in Hep biosynthesis. PeerJ 4, e2178 (2016).

280. Echeverz, M. et al. Lack of the PGA exopolysaccharide in Salmonella as an adaptive trait for survival in the host. PLoS Genet. 13, e1006816 (2017).

281. Cue, D., Lei, M. G. & Lee, C. Y. Genetic regulation of the intercellular adhesion locus in staphylococci. Front. Cell. Infect. Microbiol. 2, 38 (2012).

282. Little, D. J. et al. Structural basis for the De-N-acetylation of Poly-β-1,6-N-acetyl-D- glucosamine in Gram-positive bacteria. J. Biol. Chem. 289, 35907–17 (2014).

283. Jiang, K. et al. Complete genome sequence of Thauera aminoaromatica strain MZ1T. Stand. Genomic Sci. 6, 325–35 (2012).

284. Prombutara, P. & Allen, M. S. Flocculation-Related Gene Identification by Whole- Genome Sequencing of Thauera aminoaromatica MZ1T Floc-Defective Mutants. Appl. Environ. Microbiol. 82, 1646–52 (2015).

285. Pettersen, E. F. et al. UCSF Chimera--a visualization system for exploratory research and analysis. J. Comput. Chem. 25, 1605–12 (2004).

286. Little, D. J. et al. Modification and periplasmic translocation of the biofilm exopolysaccharide poly-β-1,6-N-acetyl-D-glucosamine. Proc. Natl. Acad. Sci. U. S. A. 111, 11013–8 (2014).

174

287. Purcell, E. B. & Tamayo, R. Cyclic diguanylate signaling in Gram-positive bacteria. FEMS Microbiol. Rev. 40, 753–73 (2016).

288. Wijman, J. G. E., de Leeuw, P. P. L. A., Moezelaar, R., Zwietering, M. H. & Abee, T. Air-liquid interface biofilms of Bacillus cereus: formation, sporulation, and dispersion. Appl. Environ. Microbiol. 73, 1481–8 (2007).

289. Okshevsky, M. et al. A transposon mutant library of Bacillus cereus ATCC 10987 reveals novel genes required for biofilm formation and implicates motility as an important factor for pellicle-biofilm formation. Microbiologyopen e00552 (2017). doi:10.1002/mbo3.552

290. Buljan, M. & Bateman, A. The evolution of protein domain families. Biochem. Soc. Trans. 37, 751–5 (2009).

291. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–6 (2000).

292. Li, H. et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 34, D572-80 (2006).

293. Gori, K., Suchan, T., Alvarez, N., Goldman, N. & Dessimoz, C. Clustering Genes of Common Evolutionary History. Mol. Biol. Evol. 33, 1590–605 (2016).

294. Eisen, J. A., Sweder, K. S. & Hanawalt, P. C. Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. Nucleic Acids Res. 23, 2715–23 (1995).

295. Wasmuth, J. D. et al. Integrated bioinformatic and targeted deletion analyses of the SRS gene superfamily identify SRS29C as a negative regulator of Toxoplasma virulence. MBio 3, e00321-12-e00321-12 (2012).

296. Huynen, M., Snel, B., Lathe, W. & Bork, P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10, 1204–10 (2000).

297. Fang, G., Rocha, E. P. C. & Danchin, A. Persistence drives gene clustering in bacterial

175

genomes. BMC Genomics 9, 4 (2008).

298. Junier, I. & Rivoire, O. Conserved Units of Co-Expression in Bacterial Genomes: An Evolutionary Insight into Transcriptional Regulation. PLoS One 11, e0155740 (2016).

299. Castiblanco, L. F. & Sundin, G. W. Cellulose production, activated by cyclic di-GMP through BcsA and BcsZ, is a virulence factor and an essential determinant of the three- dimensional architectures of biofilms formed by Erwinia amylovora Ea1189. Mol. Plant Pathol. 19, 90–103 (2018).

300. Zaslaver, A., Mayo, A., Ronen, M. & Alon, U. Optimal gene partition into operons correlates with gene functional order. Phys. Biol. 3, 183–9 (2006).

301. Sandegren, L. & Andersson, D. I. Bacterial gene amplification: implications for the evolution of antibiotic resistance. Nat. Rev. Microbiol. 7, 578–88 (2009).

302. Jahn, C. E., Selimi, D. A., Barak, J. D. & Charkowski, A. O. The Dickeya dadantii biofilm matrix consists of cellulose nanofibres, and is an emergent property dependent upon the type III secretion system and the cellulose synthesis operon. Microbiology 157, 2733–44 (2011).

303. MacKenzie, K. D., Palmer, M. B., Köster, W. L. & White, A. P. Examining the Link between Biofilm Formation and the Ability of Pathogenic Salmonella Strains to Colonize Multiple Host Species. Front. Vet. Sci. 4, 138 (2017).

304. Campbell, J. A., Davies, G. J., Bulone, V. & Henrissat, B. A classification of nucleotide- diphospho-sugar based on amino acid sequence similarities. Biochem. J. 326 ( Pt 3), 929–39 (1997).

305. Sonnenburg, J. L., Angenent, L. T. & Gordon, J. I. Getting a grip on things: how do communities of bacterial symbionts become established in our intestine? Nat. Immunol. 5, 569–73 (2004).

306. Agapakis, C. M., Boyle, P. M. & Silver, P. A. Natural strategies for the spatial optimization of metabolism in synthetic biology. Nat. Chem. Biol. 8, 527–35 (2012).

176

307. Ponomarova, O. & Patil, K. R. Metabolic interactions in microbial communities: untangling the Gordian knot. Curr. Opin. Microbiol. 27, 37–44 (2015).

308. Prindle, A. et al. Ion channels enable electrical communication in bacterial communities. Nature 527, 59–63 (2015).

309. Lawrence, J. Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr. Opin. Genet. Dev. 9, 642–8 (1999).

310. Homma, K., Fukuchi, S., Nakamura, Y., Gojobori, T. & Nishikawa, K. Gene cluster analysis method identifies horizontally transferred genes with high reliability and indicates that they provide the main mechanism of operon gain in 8 species of gamma- Proteobacteria. Mol. Biol. Evol. 24, 805–13 (2007).

311. Land, M. et al. Insights from 20 years of bacterial genome sequencing. Funct. Integr. Genomics 15, 141–61 (2015).

312. Matsumoto, K., Hara, H., Fishov, I., Mileykovskaya, E. & Norris, V. The membrane: transertion as an organizing principle in membrane heterogeneity. Front. Microbiol. 6, 572 (2015).

313. Norris, V. et al. Functional taxonomy of bacterial hyperstructures. Microbiol. Mol. Biol. Rev. 71, 230–53 (2007).

314. Laloux, G. & Jacobs-Wagner, C. How do bacteria localize proteins to the cell pole? J. Cell Sci. 127, 11–9 (2014).

315. Yao, Y., Fan, L., Shi, Y., Odsbu, I. & Morigen. A Spatial Control for Correct Timing of Gene Expression during the Escherichia coli Cell Cycle. Genes (Basel). 8, 1 (2016).

316. Chai, Q. et al. Organization of ribosomes and nucleoids in Escherichia coli cells during growth and in quiescence. J. Biol. Chem. 289, 11342–52 (2014).

317. Feig, M., Yu, I., Wang, P.-H., Nawrocki, G. & Sugita, Y. Crowding in Cellular Environments at an Atomistic Level from Computer Simulations. J. Phys. Chem. B 121, 8009–8025 (2017).

177

318. Bakshi, S., Choi, H. & Weisshaar, J. C. The spatial biology of transcription and translation in rapidly growing Escherichia coli. Front. Microbiol. 6, 636 (2015).

319. Hacker, W. C., Li, S. & Elcock, A. H. Features of genomic organization in a nucleotide- resolution molecular model of the Escherichia coli chromosome. Nucleic Acids Res. 45, 7541–7554 (2017).

320. Edwards, J. C., Johnson, M. S. & Taylor, B. L. Differentiation between electron transport sensing and proton motive force sensing by the Aer and Tsr receptors for aerotaxis. Mol. Microbiol. 62, 823–37 (2006).

321. Liu, J. et al. Molecular architecture of chemoreceptor arrays revealed by cryoelectron tomography of Escherichia coli minicells. Proc. Natl. Acad. Sci. U. S. A. 109, E1481-8 (2012).

322. Parkinson, J. S., Hazelbauer, G. L. & Falke, J. J. Signaling and sensory adaptation in Escherichia coli chemoreceptors: 2015 update. Trends Microbiol. 23, 257–66 (2015).

323. Amsler, C. D., Cho, M. & Matsumura, P. Multiple factors underlying the maximum motility of Escherichia coli as cultures enter post-exponential growth. J. Bacteriol. 175, 6238–44 (1993).

324. Prigent-Combaret, C. et al. Developmental pathway for biofilm formation in curli- producing Escherichia coli strains: role of flagella, curli and colanic acid. Environ. Microbiol. 2, 450–64 (2000).

325. Halan, B., Buehler, K. & Schmid, A. Biofilms as living catalysts in continuous chemical syntheses. Trends Biotechnol. 30, 453–65 (2012).

326. Ahmad, I. et al. BcsZ inhibits biofilm phenotypes and promotes virulence by blocking cellulose production in Salmonella enterica serovar Typhimurium. Microb. Cell Fact. 15, 177 (2016).

327. Wilkens, S. Structure and mechanism of ABC transporters. F1000Prime Rep. 7, 14 (2015).

178

328. Krasteva, P. V. et al. Insights into the structure and assembly of a bacterial cellulose secretion system. Nat. Commun. 8, 2065 (2017).

329. Ji, K. et al. Bacterial cellulose synthesis mechanism of facultative anaerobe Enterobacter sp. FY-07. Sci. Rep. 6, 21863 (2016).

330. Morgan, J. L. W. et al. Observing cellulose biosynthesis and membrane translocation in crystallo. Nature 531, 329–34 (2016).

331. Du, J., Vepachedu, V., Cho, S. H., Kumar, M. & Nixon, B. T. Structure of the Cellulose Synthase Complex of Gluconacetobacter hansenii at 23.4 Å Resolution. PLoS One 11, e0155886 (2016).

332. Morgan, J. L. W., Strumillo, J. & Zimmer, J. Crystallographic snapshot of cellulose synthesis and membrane translocation. Nature 493, 181–6 (2013).

333. Little, D. J. et al. The structure- and metal-dependent activity of Escherichia coli PgaB provides insight into the partial de-N-acetylation of poly-β-1,6-N-acetyl-D-glucosamine. J. Biol. Chem. 287, 31126–37 (2012).

334. Oglesby, L. L., Jain, S. & Ohman, D. E. Membrane topology and roles of Pseudomonas aeruginosa Alg8 and Alg44 in alginate polymerization. Microbiology 154, 1605–15 (2008).

335. Fata Moradali, M., Donati, I., Sims, I. M., Ghods, S. & Rehm, B. H. A. Alginate Polymerization and Modification Are Linked in Pseudomonas aeruginosa. MBio 6, e00453-15 (2015).

336. Riley, L. M. et al. Structural and functional characterization of Pseudomonas aeruginosa AlgX: role of AlgX in alginate acetylation. J. Biol. Chem. 288, 22299–314 (2013).

179

Appendices Appendix 1: Summary Table of CE-PPI Network Paralog PPI Enrichment

Associated

-

ranscription & Translation ranscription

eggNOG_OMA Group eggNOG_OMA Paralog T & Defense Cycle Cell Envelope Cell Production Energy Transport & Metabolism Multiple proNOG00040_4833 acrB__b0462 1 / 1 4 / 4 10 / 14 3 / 3 6 / 12 5 / 12

acrF__b3266 0 / 1 0 / 4 0 / 14 0 / 3 2 / 12 0 / 12

mdtF__b3514 0 / 1 0 / 4 4 / 14 0 / 3 4 / 12 7 / 12 proNOG00096_8637 tap__b1885 0 / 13 0 / 4 0 / 21 1 / 6 3 / 30 1 / 17

tar__b1886 1 / 13 1 / 4 1 / 21 0 / 6 3 / 30 1 / 17

trg__b1421 12 / 13 3 / 4 19 / 21 5 / 6 23 / 30 14 / 17

tsr__b4355 0 / 13 0 / 4 1 / 21 0 / 6 1 / 30 1 / 17 proNOG00160_4491 malK__b4035 0 / 1 2 / 5 1 / 6 1 / 3 4 / 13 1 / 1

ugpC__b3450 1 / 1 3 / 5 5 / 6 2 / 3 9 / 13 0 / 1 proNOG00283_8619 atoB__b2224 0 / 1 0 / 0 3 / 3 0 / 0 0 / 0 0 / 0

paaJ__b1397 0 / 1 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0

yqeF__b2844 1 / 1 0 / 0 0 / 3 0 / 0 0 / 0 0 / 0 proNOG00284_5978 frmB__b0355 4 / 4 0 / 0 0 / 0 1 / 1 1 / 5 0 / 0

yeiG__b2154 0 / 4 0 / 0 0 / 0 0 / 1 4 / 5 0 / 0 proNOG00360_5573 tktA__b2935 4 / 6 3 / 4 1 / 1 1 / 3 8 / 13 1 / 4

tktB__b2465 2 / 6 1 / 4 0 / 1 2 / 3 5 / 13 3 / 4 proNOG00415_5876 degQ__b3234 0 / 4 3 / 3 2 / 2 0 / 1 5 / 5 1 / 2

180

degS__b3235 4 / 4 0 / 3 0 / 2 1 / 1 0 / 5 1 / 2 proNOG00547_5982 rfbA__b2039 8 / 9 0 / 0 1 / 2 0 / 0 6 / 6 3 / 3

rffH__b3789 1 / 9 0 / 0 1 / 2 0 / 0 0 / 6 0 / 3 proNOG00602_5678 atoC__b2220 0 / 0 0 / 0 0 / 0 0 / 0 1 / 2 0 / 0

norR__b2709 0 / 0 0 / 0 0 / 0 0 / 0 1 / 2 0 / 0 proNOG00717_5870 rhsC__b0700 1 / 1 0 / 0 0 / 0 0 / 0 0 / 2 0 / 0

rhsD__b0497 0 / 1 0 / 0 0 / 0 0 / 0 2 / 2 0 / 0 proNOG00751_4485 aroP__b0112 0 / 0 0 / 1 0 / 1 0 / 0 1 / 2 0 / 0

pheP__b0576 0 / 0 1 / 1 1 / 1 0 / 0 1 / 2 0 / 0 proNOG00963_5294 dtpA__b1634 0 / 1 0 / 0 0 / 0 0 / 0 1 / 1 0 / 0

dtpB__b3496 1 / 1 0 / 0 0 / 0 0 / 0 0 / 1 0 / 0 proNOG01012_4931 paoC__b0284 1 / 1 0 / 0 0 / 0 0 / 2 0 / 3 0 / 0

xdhA__b2866 0 / 1 0 / 0 0 / 0 1 / 2 2 / 3 0 / 0

xdhD__b2881 0 / 1 0 / 0 0 / 0 1 / 2 1 / 3 0 / 0 proNOG01159_11263 narH__b1225 0 / 0 1 / 1 1 / 1 0 / 3 5 / 9 0 / 0

narY__b1467 0 / 0 0 / 1 0 / 1 3 / 3 4 / 9 0 / 0 proNOG01495_4597 hsrA__b3754 0 / 2 0 / 3 0 / 0 0 / 0 1 / 3 0 / 1

mdtD__b2077 2 / 2 3 / 3 0 / 0 0 / 0 2 / 3 1 / 1 proNOG01499_3925 rbbA__b3486 3 / 4 0 / 2 2 / 2 2 / 2 5 / 6 1 / 2

ybhF__b0794 1 / 4 2 / 2 0 / 2 0 / 2 1 / 6 1 / 2 proNOG01548_6074 araD__b0061 0 / 0 0 / 0 0 / 1 0 / 0 1 / 1 0 / 0

sgbE__b3583 0 / 0 0 / 0 1 / 1 0 / 0 0 / 1 0 / 0 proNOG01571_5121 lysS__b2890 0 / 1 0 / 2 2 / 4 0 / 1 1 / 5 0 / 1

lysU__b4129 1 / 1 2 / 2 2 / 4 1 / 1 4 / 5 1 / 1 proNOG01770_8619 hycC__b2723 0 / 0 0 / 0 0 / 0 3 / 6 0 / 1 0 / 0

181

hyfB__b2482 0 / 0 0 / 0 0 / 0 3 / 6 1 / 1 0 / 0 proNOG01915_4684 talA__b2464 0 / 0 0 / 1 0 / 4 0 / 0 5 / 6 1 / 2

talB__b0008 0 / 0 1 / 1 4 / 4 0 / 0 1 / 6 1 / 2 proNOG02649_4441 ddpA__b1487 7 / 8 2 / 2 1 / 1 1 / 1 6 / 8 1 / 1

sapA__b1294 1 / 8 0 / 2 0 / 1 0 / 1 2 / 8 0 / 1 proNOG02678_5519 metL__b3940 1 / 2 0 / 1 0 / 1 0 / 1 6 / 15 1 / 1

thrA__b0002 1 / 2 1 / 1 1 / 1 1 / 1 9 / 15 0 / 1 proNOG02687_4395 fdnH__b1475 4 / 4 1 / 2 0 / 0 2 / 2 5 / 5 1 / 1

fdoH__b3893 0 / 4 1 / 2 0 / 0 0 / 2 0 / 5 0 / 1 proNOG02873_5521 gfcE__b0983 2 / 7 2 / 3 2 / 9 1 / 3 1 / 5 1 / 4

wza__b2062 5 / 7 1 / 3 7 / 9 2 / 3 4 / 5 3 / 4 proNOG02880_4580 yegH__b2063 0 / 0 0 / 0 0 / 0 0 / 2 1 / 1 0 / 0

yoaE__b1816 0 / 0 0 / 0 0 / 0 2 / 2 0 / 1 0 / 0 proNOG02888_5676 livJ__b3460 3 / 13 0 / 2 0 / 3 0 / 3 2 / 8 0 / 1

livK__b3458 10 / 13 2 / 2 3 / 3 3 / 3 6 / 8 1 / 1 proNOG03466_5570 dacA__b0632 4 / 4 1 / 4 4 / 6 1 / 2 6 / 13 0 / 0

dacC__b0839 0 / 4 3 / 4 2 / 6 1 / 2 7 / 13 0 / 0 proNOG03752_112 62 fsaA__b0825 0 / 2 0 / 2 0 / 0 0 / 0 1 / 3 0 / 1

fsaB__b3946 2 / 2 2 / 2 0 / 0 0 / 0 2 / 3 1 / 1 proNOG03846_5478 etk__b0981 0 / 0 1 / 1 2 / 4 0 / 0 0 / 0 0 / 0

wzc__b2060 0 / 0 0 / 1 2 / 4 0 / 0 0 / 0 0 / 0 proNOG04044_4928 fumA__b1612 2 / 2 0 / 0 0 / 0 1 / 1 4 / 6 1 / 2

fumB__b4122 0 / 2 0 / 0 0 / 0 0 / 1 2 / 6 1 / 2 proNOG04100_6271 argT__b2310 4 / 8 0 / 0 1 / 1 0 / 1 2 / 2 1 / 1

hisJ__b2309 4 / 8 0 / 0 0 / 1 1 / 1 0 / 2 0 / 1

182

proNOG04149_6169 ompF__b0929 0 / 2 0 / 3 1 / 3 0 / 0 0 / 2 0 / 3

ompN__b1377 1 / 2 3 / 3 2 / 3 0 / 0 1 / 2 3 / 3

phoE__b0241 1 / 2 0 / 3 0 / 3 0 / 0 1 / 2 0 / 3 proNOG04150_11263 narG__b1224 0 / 0 3 / 4 1 / 1 2 / 2 7 / 8 2 / 3

narZ__b1468 0 / 0 1 / 4 0 / 1 0 / 2 1 / 8 1 / 3 proNOG07629_5250 ldtB__b0819 1 / 2 0 / 0 0 / 1 1 / 2 1 / 4 1 / 1

ldtC__b1113 0 / 2 0 / 0 0 / 1 1 / 2 0 / 4 0 / 1

ldtE__b1678 1 / 2 0 / 0 1 / 1 0 / 2 3 / 4 0 / 1 proNOG07970_10398 chbF__b1734 1 / 1 0 / 0 1 / 1 0 / 0 2 / 2 1 / 2

melA__b4119 0 / 1 0 / 0 0 / 1 0 / 0 0 / 2 1 / 2 proNOG1036 9_17816 agp__b1002 3 / 4 0 / 0 0 / 0 0 / 4 0 / 2 0 / 1

appA__b0980 1 / 4 0 / 0 0 / 0 4 / 4 2 / 2 1 / 1 proNOG11279_5974 yaaU__b0045 3 / 4 1 / 1 0 / 0 0 / 0 3 / 4 2 / 2

ygcS__b2771 1 / 4 0 / 1 0 / 0 0 / 0 1 / 4 0 / 2 proNOG33075_8920 nlpD__b2742 3 / 3 0 / 2 5 / 6 2 / 2 4 / 5 1 / 1

ygeR__b2865 0 / 3 2 / 2 1 / 6 0 / 2 1 / 5 0 / 1 proNOG41960_5432 iscA__b2528 4 / 9 0 / 0 1 / 1 0 / 1 4 / 6 0 / 1

sufA__b1684 5 / 9 0 / 0 0 / 1 1 / 1 2 / 6 1 / 1 proNOG52556_4541 ridA__b4243 1 / 1 0 / 0 3 / 6 1 / 2 4 / 4 0 / 1

tdcF__b3113 0 / 1 0 / 0 3 / 6 1 / 2 0 / 4 1 / 1

183

Appendix 2: Sequence Variability of Phylogenetic Clusters Reveals Different Degrees of Structural Conservation of Cellulose Biosynthesis Machinery

Recently, a crystal structure has been solved for the BcsA-BcsB inner membrane complex responsible for cellulose biosynthesis and transport330; therefore, I chose to examine how sequence variability of BcsA and BcsB protein sequences affects structural conservation.

From the 4 selected examples of cellulose operon clusters identified above representing distinct operon organizations involving the rearrangement of bcsA in Burkholderia and Pantoea spp., locus fusions of bcsA and bcsB in Zymomonas spp., loss of bcsC among Alpha Proteobacterial spp. and whole operon duplications in Gamma Proteobacterial spp. (see 4.2.4.1 and corresponding Figure 7, examples 1-4, respectively), BcsA and BcsB protein sequences from a single representative species (based on longest protein sequence length) were selected. Multiple sequence alignments were generated and conservation of residues was mapped onto the BcsA- BcsB complex crystal structure (4HG6). Figure 1 reveals that the high degree of conservation of BcsA locus protein sequences across phylogenetically diverse bacteria corresponds to the glycosyl-transferase (GT) domain responsible for carrying out cellulose polymerization, and particularly encompasses regions which form a cleft where a UDP carrier moiety is bound and oriented through a conserved QxxRW motif to enable polymerization of glucose monomers of the growing cellulose chain330; on the other hand, the PilZ domain of BcsA, involve in regulation of GT function in response to cyclic-di-GMP levels (C-di-GMP) shows low conservation overall, except for a subset of residues corresponding to C-di-GMP binding sites. The periplasmic region of BcsB shows remarkably low sequence conservation overall, aside from a number of highly conserved residues in its carbohydrate binding and ferredoxin domains, one of which is a putative cellulose binding residue, which is oriented near the growing cellulose chain near the exit of the BcsA IM translocation channel. These results reveal that structural features of the cellulose synthase complex involved directly in cellulose polymerization and guidance of the growing polymer highly conserved among cellulose producing species, while on the other hand, the regulation of cellulose production appears to be more variable factor affecting cellulose production across phyla. Furthermore, the greater degree of variability observed for BcsB, which also plays an essential role in cellulose production by guiding the polymer through the periplasm

184 for export and additional processing331, is likely to be the result of its residence in the periplasmic space where it has undergone a significant degree of divergence as a consequence of adaptation to diverse environmental niches.

Figure 1. Phylogenetic Sequence Clustering Reflect Differences in Structural Conservation Between Cellulose Synthase Complex Subunits BcsA and BcsB. A - Sequence conservation was mapped onto the cellulose synthase complex, BcsA-BcsB (4HG6 – Rhodobacter

185 sphaeroides ATCC 17025) comprising sequences from eight species representing distinct cellulose operon clades (Figure 4-7 examples 1, 2, 3, and 4). B, C - Structural and multiple sequence alignments indicate a high degree of conservation corresponding to BcsA glycosyl hydrolase catalytic core domain and regions of the cellulose translocation channel (panel B), while low overall sequence conservation is found among the carbohydrate binding and ferredoxin domains (CBD1-2, and FD1-2) of BcsB sequences, excepting highly conserved cellulose residing in CBD-2 (panel C). The translocated cellulose polymer is indicated in green. BcsA domains identified using PFAM predictions for the R. sphaeroides reference sequence, BcsB domains were assigned according to332. Multiple sequence alignment was visualized generated using Geneious 10.2.2274, protein structure was visualized using Chimera 1.11.2285.

186

Appendix 3: Divergence of PNAG Phylogenetic Sequence Clusters Elucidates Structural Differences Related to Biofilm Secretion and Modification Across Diverse Bacterial Phyla

When examining the structural implications of sequence divergence across representative genomes of PgaA sequence clusters, low overall conservation was seen for the outer-membrane embedded beta-barrel and periplasmic TPR domains reflected by a greater number of phylogenetic groups observed in the PNAG genomic-context network (Figure 1). Mapping residue conservation upon the C-terminal PgaA beta-barrel domain crystal structure (4Y25) identified a patch of conserved residues localized in a periplasmic-proximal region of the pore, which forms a conserved binding pocket responsible for PNAG secretion. Surprisingly, closer examination of the region revealed low conservation, particularly of glutamate and aspartate residues (Glu741, Asp777, Glu800, Asp802) experimentally determined as critical for PNAG secretion in E. coli K12 MG1655. Because PNAG is a positively charged polysaccharide, this finding suggests that the loss of negatively charged residues in the binding pocket of PgaA may either indicate operon clades with attenuated PNAG production, or be compensation by differing degrees of PNAG de-acetylation which may result in an EPS with greater positive charge333.

187

Figure 1. Variable Conservation of the Electro-Negative PNAG Binding Pocket of the PNAG OMP Pore, PgaA, Revealed by Phylogenetic Clustering. A- Multiple sequence alignment of representative sequences comprising all PgaA phylogenetic clusters indicates a low degree of overall sequence conservation. Boxed region indicates a region of conserved residues delimiting the PNAG binding pocket of PgaA. Electronegative residues required for in-vivo PNAG export in E. coli (alignment positions indicated in blue) show low conservation. B - Sequence conservation mapped onto the crystal structure of the Escherichia coli PgaA outer- membrane pore domain (4Y25) with PNAG binding pocket and electro-negative residues indicated (yellow outline). PgaA beta-barrel and TPR domains were identified using PFAM predictions for the E. coli MG1655 K12 reference sequence. Multiple sequence alignment was visualized generated using Geneious 10.2.2274, protein structure was visualized using Chimera 1.11.2285.

188

Appendix 4: Phylogenetic Clustering Indicates Increased Divergence of Loci with Functionally Linked Roles in Regulation of Alginate Secretion

Based on the number of phylogenetic clusters identified, increased divergence relative to other components of the inner-membrane alginate synthase complex was indicated for two physically linked subunits, the C-Di-GMP binding Alg44 alginate co-polymerase and periplasmic alginate acetylase AlgX (See Section 4.2.4.8 - Figure 14). Examination of phylogenetic trees and multiple sequence alignments of Alg44 and AlgX phylogenetic clusters indicate surprisingly low overall sequence conservation of Alg44 and AlgX loci, with duplicate and atypical operon loci appearing show the greatest divergence (Figures 1AB and 2AB).

The structure of the Alg44 N-terminal PilZ binding domain (4RT0) revealed a high degree of conservation of residues particularly involved in C-di-GMP binding pocket and a highly variable loop region (Figure 1C). However, one of these residues (Arg-95) appears to have changed to a negatively charged Glu or non-polar Leu residue in clade 2 sequences and atypical operons, suggesting possible alteration in the regulation of alginate transport in clade 2 operons.

It has recently been demonstrated that in addition to regulating alginate polymerization by the polysaccharide synthase Alg8334, Alg44 also serves as a periplasmic scaffold which recruits AlgX to modify newly synthesized alginate polymers, likely through its C-terminal, periplasmic HlyD_3 domain335. Furthermore, the binding of AlgX to newly synthesized alginate provides a protective role against their degradation by the periplasmic alginate lyase AlgL336. Highly conserved residues were observed in the C-terminal HlyD_3 domain as well as the upstream trans-membrane region of Alg44 (Figure 1A) and may play critical roles in other aspects of its functional association with the Alg8 polysaccharide synthase, possibly by orienting the alginate polymer as it emerges into the periplasm for enzymatic modification by AlgX.

The structure of AlgX (4KNC) reveals low conservation of surface exposed regions with conserved regions correspond largely to the acetylase domain active site residues, and residues residing in the Carbohydrate Binding Module (CBM) domain (Figure 2C). Given that co- ordination of alginate production with periplasmic acetylation levels have been shown to result in dramatically different biofilm architectures and motility phenotypes in-vivo335, these results

189 suggest that divergence between Alg44 and AlgX are likely to be of physiological relevance and elucidating their implications on biofilm production across Pseudomonas spp. which inhabit diverse environmental niches are worthy of further experimental investigation.

190

Figure 1. High Degree of Sequence Divergence of Alg44 Phylogenetic Clusters. A - Multiple sequence alignments of representative sequences of all Alg44 phylogenetic clusters with sequence conservation compared against Pseudomonas aeruginosa PAO1. Sequences are designated according to their association with clade 1 (C1), clade 2 (C2), or atypical (AT) operons. Red arrows indicate PFAM predicted cyclic-di-GMP (c-di-GMP) binding PilZ and HylD_3 periplasmic domains. Inset panel of PilZ domain alignment and conservation of c-di- GMP binding positions are indicated by blue rectangles. B – Neighbour joining tree of aligned Alg44 protein sequences with boxes indicating C2 and AT associated loci, note distinct greater degree of divergence between C1 and C2 sequences (teal) compared to duplicate C1 Alg44 sequences (purple). C – Structural conservation of P. aeruginosa PAO1 Alg44 PilZ domain (4RT0) corresponding to c-di-GMP binding residues. Multiple sequence alignment was visualized and protein tree generated (Jukes Cantor, Neighbour Joining, 100 bootstrap replicates) using Geneious 10.2.2274, protein structure was visualized using Chimera 1.11.2285.

191

Figure 2. High Degree of Divergence among AlgX Phylogenetic Clusters. A - Multiple sequence alignments of representative sequences of all AlgX phylogenetic clusters with sequence conservation compared against Pseudomonas aeruginosa PAO1 AlgX. Sequences are designated according to their association with clade 1 (C1), clade 2 (C2), or atypical (AT) operons. Red arrows indicate PFAM predicted polysaccharide deacetylase domain (AlgX) and carbohydrate binding module (CBM_26) domains. Inset panel highlights the alignment region spanning conserved alginate deacetylase domain catalytic residues, indicated by blue rectangles. B – Neighbour joining tree of aligned AlgX protein sequences with boxes indicating C2 and AT associated loci, similar to Alg44 sequences C1 and C2 sequences (teal) show greater divergence compared to duplicate C1 AlgX sequences (purple). C – Structural conservation of P. aeruginosa PAO1 AlgX (4KNC) is greatest for residues comprising the catalytic core and alginate binding and lowest for surfaces with greater periplasmic exposure. Multiple sequence alignment was visualized and protein tree generated (Jukes Cantor, Neighbour Joining, 100 bootstrap replicates) using Geneious 10.2.2274, protein structure was visualized using Chimera 1.11.2285.