Understanding the Encapsulins:

Prediction and Characterization of Phage Capsid-like Nanocompartments in Prokaryotes

by

Devon Radford

A thesis submitted in conformity with the requirements for the degree of Doctorate of Philosophy in Molecular Genetics Department of Molecular Genetics Faculty of Medicine University of Toronto

© Copyright by Devon Radford 2015

Understanding the Encapsulins: Prediction and Characterization of Phage Capsid-like Nanocompartments in Prokaryotes

Devon Radford

Doctorate of Philosophy in Molecular Genetics

Department of Molecular Genetics

Faculty of Medicine University of Toronto

2015 Abstract

Encapsulins are a distinct family of prokaryotic compartments, comprised of an icosahedral shell encapsulating a variety of enzymes. The encapsulin shell protein is similar in sequence and structure to the capsid protein of dsDNA tailed bacteriophages. Although found across diverse phyla of and Archaea, few encapsulins were previously characterized.

My thesis work utilized a bioinformatic approach to identify and investigate the functions of all encapsulins found in bacteria and Archaea. I found 590 encapsulins showing definite sequence similarity to the previously described encapsulins, which I refer to as “classical encapsulins”.

I identified four new enzyme families strongly predicted to be classical encapsulin cargo enzymes: ferredoxins, rubrerythrin-like, Dps-bacterioferritin-like, and -like proteins.

While most species encode encapsulin cargo proteins adjacent to the encapsulin gene, I discovered 113 species encoding cargo proteins in genomic regions far removed from the encapsulin gene. I also found 118 genomes encoding two or three different enzymes strongly predicted to be targeted to the same encapsulin. Genes encoding three protein families, radical

ii

SAM oxidoreductases, CutA1-like proteins and metalloproteases, are highly enriched near encapsulin genes, but without targeting motifs for direct encapsulation.

In addition to classical encapsulins, I discovered 1060 novel putative encapsulins related to diverse families of phage capsids. The largest family of these novel encapsulins, with 986 proteins is larger than the classical encapsulin family, and conserves a sulphur metabolism operon encoding cysteine desulfurases, acetyltransferases, and rhodaneses. These activities are predicted to relieve sulphur restriction, and toxicity of cyanide and other oxidants.

Overall, novel and classical encapsulins and their encapsulated enzymes are predicted to contribute to polyvalent cation dependent catalytic activities, and relieve heavy metal and cyanide toxicity. Phylogenies predicted capsids and encapsulins share intermingled ancestry.

My work shows that encapsulins perform more diverse functions and are much more widely distributed than any other prokaryotic compartment.

iii

Acknowledgments

I would like to thank the Canadian Institute of Health Research, the Natural Science and

Engineering Research Council Alexander Graham Bell Doctoral Canada Graduate Scholarship program, and the Ontario Graduate Scholarship program for providing funding for this project.

High performance computing resources supplied by Sharcnet and SciNet academic computing clusters.

Many thanks are also due the technical and administrative staff of the University of

Toronto, Department of Molecular Genetics. Special thanks to Diane Bona, Bianca Garcia, and

Yurima Hidalgo specifically for their help and assistance throughout this project. I would also like to thanks all members of the Davidson and Maxwell labs, past and present for their input and constructive suggestions. Special thanks to Dr. Lia Cardarelli, Sarah Chan, Vuk Pavlovic,

Dr. Lisa Pell, and Dr. Zhou Yu for the contributions to the phage and prophage database. Also thanks are due to Kris Hon and Kelly Reimer for assistance in optimizing bench work experimental designs.

I would also like to thank the members of my supervisory committee who showed continued interest, critical insight and assistance throughout the project. In addition to my immediate supervisor, Dr. Alan Davidson, I would also like to acknowledge the insights and guidance provided by Dr. Karen Maxwell during the course of this project. Special mention is made to Linda Trouten-Radford and Don Radford for their tireless moral support, without which this project would have been impossible. I would also like to acknowledge the editorial assistance of Alan Davidson, my committee, and Linda Trouten-Radford in the revision and improvement of this thesis.

iv

Table of Contents

Acknowledgments ...... iv List of Tables ...... vii List of Figures ...... viii List of Appendices ...... xi List of Defintions ...... xiii Chapter 1: Introduction and Background ...... 1 1.I Review of characterized prokaryotic compartments ...... 2 1.II Review of known properties of published encapsulins ...... 9 1.III Properties, similarities and differences of encapsulin and capsid-like structures .. 16 1.IV Bacteriophages profoundly influence prokaryotic evolution and physiology ...... 24 1.V Objectives ...... 26 Chapter 2: Identification of comprehensive population of encapsulins related to published examples...... 29 2. Chapter Abstract: ...... 29 2. Introduction: ...... 30 2. Methods: ...... 31 2. Results: ...... 37 2. Discussion: ...... 81 Chapter 3. Characterization of classical encapsulin-like families...... 84 3. Chapter Abstract: ...... 84 3. Introduction: ...... 85 3. Methods: ...... 87 3. Results: ...... 92 3. Discussion: ...... 142 Chapter 4: Families of distinct novel prokaryotic capsid-like compartments ...... 144 4. Chapter Abstract: ...... 144 4. Introduction: ...... 145 4. Methods: ...... 147

v

4. Results: ...... 152 4. Discussion: ...... 175 Chapter 5: Conserved mechanisms of protein interaction produce the structural and functional similarities and differences existing between phage capsids and prokaryotic encapsulins ...... 177 5. Chapter Abstract: ...... 177 5. Introduction: ...... 178 5. Methods: ...... 183 5. Results: ...... 186 5. Discussion: ...... 203 Chapter 6: Caudovirales phage capsids share intermingled ancestry and descent with prokaryotic encapsulins ...... 204 6. Chapter Abstract: ...... 204 6. Introduction: ...... 205 6. Methods: ...... 207 6. Results: ...... 209 6. Discussion: ...... 218 Chapter 7: Conclusions ...... 220 7.2 Future Directions ...... 223 8. References ...... 224 9. Formulas ...... 235 Appendix 1: Other enzyme families enriched with encapsulins, conserving distinct association motifs indicative of non-targeted accessory proteins...... 238 Appendix 2: Other significantly enriched properties of encapsulin encoding organisms ...... 249 Appendix 3: Prediction of horizontal transfer paths from nearest neighbour Karlin distance and Bayesian Inference phylogeny ...... 258 Appendix 4: Other predicted novel encapsulin families suggest roles in resistance to toxic disruption of diverse carbon and phosphate metabolisms ...... 263

vi

List of Tables Table 2-1: Sequence identity between published encapsulins used as initial pblast queries Table 2-2: List of predicted classical encapsulin shell and cargo enzymes Table 2-3: Systematic quantification of protein with conserved encapsulin association specific terminal motifs by MEME/MAST Table 2-4: Average percent identity within encapsulin associated protein families Table 2-5: Co-encoded CEAS motifs are significantly more similar to each other, than otherwise independent CEAS motifs Table 2-6: Sequence similarity comparison of CEAS motifs relative to encapsulin relatedness Table 2-7: Amino acid frequency composition of classical encapsulin associated enzyme linker regions preceding the CEAS motifs Table 2-8: Example VADAR 61 3D profile quality index statistics for CEAS motifs Table 3-1: Pearson product-moment correlation matrix between continuous numerical features of bacteria in the combined PATRIC and MiST phenotypic annotation dataset Table 3-2: Goodman-Kruskal Tau-B association matrix of non-ordinal non-continuous features of bacterial genomes in the combined PATRIC and MiST dataset. Table 3-3: Features of significant difference between encapsulin encoding genomes and background of all genome in PATRIC dataset Table 3-4: Associations and dependencies between significantly enriched properties of genomes encoding encapsulin associated iron-dependent peroxidases Table 3-5: Statistical associations and dependencies between significantly enriched properties of genomes encoding encapsulin associated iron-binding proteins (, rubrerythrin, bacterioferritin, and hemerythrin) Table 3-6: Categorized distribution of Karlin distances between encapsulin genes and current genomes Table 3-7: Statistical comparison of dinucleotide frequencies between four predicted Burkholderia encapsulins and the four genomes in which they are currently encoded Table 3-8: Statistical comparison of dinucleotide frequencies between four Burkholderia genomes predicted to encoded four encapsulins Table 3-9: Standardized statistical comparison of dinucleotide frequency between four predicted Burkholderia encapsulins and the three most similar Burkholderia phage genomes Table 3-10: Distribution of Karlin difference classes across datasets Table 4-1: Distribution of phage-like segments in prokaryotes Table 4-2: Distribution of novel encapsulin families Table 4-3: List of predicted novel encapsulins and associated enzymes Table 5-1: Prior phenotypic amino acid substitutions identified in Bacteriophage lambda Table 5-2.i: Significance of average percent solvent accessibility in phage lambda gpE protein monomer Table 5-2.ii: Significance of average percent solvent accessibility in phage lambda gpE protein asymmetric unit Table 5-2.iii: Significance of average percent solvent accessibility in phage lambda gpE protein pentamer Table 5-2.iv: Significance of average percent solvent accessibility in capsid embedded phage lambda gpE protein asymmetric unit Table 5-3: Site specific divergence and conservation between capsid and encapsulin structures fit model of conservation and divergence in physical properties Table 5-4: Diverse capsids sharing structural features conserve positions predicted by the model to confer those features

vii

List of Figures Figure 1-1: Standard domain architecture of capsid-like fold Figure 1-2: Structural comparison of the T. maritima encapsulin and Bacteriophage HK97 capsid monomer structures Figure 1-3.i: Encapsulin and capsid shell proteins organize into very similar multimer structures Figure 1-3.ii: Encapsulin and capsid multimers organize into closed icosahedral shells using pentameric vertices and hexameric planer faces Figure 1-4: Genome context differences between encapsulins and prophage capsids Figure 2-1: ROC curves comparing classifier performance for transitive homology annotation and HMM matching methods, predicting Caudovirales major capsid proteins.. Figure 2-2: New classical encapsulins identified in 14 Bacterial and 3 Archaeal phyla. Figure 2-3: Distinct sequence identity clusters/families for six encapsulin associated enzymes classes Figure 2-4: Venn diagram of co-encoding of 6 classes of CEAS enzymes in predicted multi- target or hetero-compartment encapsulin systems Figure 2-5: Overlay of candidate C-terminal conserved encapsulin association specific motifs in diverse predicted cargo enzymes Figure 2-6: Conserved encapsulin association specific consensus motif Figure 2-7: Example separately encoded targeted peroxidase, encapsulin pair Figure 2-8: MAFFT sample alignment of related bacterioferritins with and without targeting motifs Figure 2-9: MAFFT sample alignment of iron-dependent peroxidases with and without targeting motifs Figure 2-10: Encapsulin associated ferredoxins from the Bacilli phyla encoded N-terminal motif similar to co-encoded C-terminal CEAS motifs in Figure 2-11: Example multi-enzyme encapsulin operon Figure 2-12: CEAS motif pairwise local percent similarity between co-encoded targeted enzymes vs. randomly chosen CEAS motifs Figure 2-13: Example paired alignments of Co-conserved positions in co-encoded CEAS motifs vs. non-co-encoded CEAS Figure 2-14: T. maritima published binding interface for CEAS motif Figure 2-15.i: Conserved predicted binding interface for consensus CEAS motif Figure 2-15.ii: Top view of polar contacts and hydrophobic patch residues in Catenulispora acidiphila DSM 44928 peroxidase encapsulin, predicted targeting motif and binding surface interaction Figure 2-15.iii: Two-dimensional simplified view of interacting positions in Catenulispora acidiphila DSM 44928 peroxidase encapsulin, predicted targeting motif and binding surface interaction. Figure 2-16: Diagrammed model of covariant positions between paired targeting motifs and classical encapsulins Figure 2-17: Alignment of conserved encapsulin and capsid inner surface positions within 4 Å of threaded CEAS motif peptide Figure 3-1: Decision tree of functional annotation of genes flanking predicted classical and novel encapsulin genes Figure 3-2: Encapsulin encoding organisms occupy distinct cargo specific phenotypic niches Figure 3-3: Encapsulin related resistance coupled to diverse operons Figure 3-4.i: Coexpression with the M. tuberculosis encapsulin shell protein relieves growth defect from expression of M. tuberculosis iron-dependent peroxidase

viii

Figure 3-4.ii: Electron micrograph of peroxidase packaging M. tuberculosis encapsulins expressed in E. coli BL21 Figure 3-5: Encapsulin shell and cargo enzymes show strong signal of horizontal transfer into current genomes Figure 3-6: Neighbour Joining Tree comparing select Burkholderia encapsulins and Burkholderia phage capsids Figure 3-7: GC content plot of Actinomyces oris K20 encapsulin and targeted enzyme region relative to total genome Figure 3-8: Co-encoded encapsulin shells and cargo enzymes as similar in nucleotide signature as functionally coupled horizontally transferred PVC operon genes Figure 3-9: Subsets of encapsulin shells compartmentalize cargo enzymes from non-parallel lineages, adopted from lineage swapping recombination Figure 3-13: Condensed Consensus Bayesian Inference phylogeny of representative rubrerythrins Figure 3-14: Condensed Consensus Bayesian Inference phylogeny of representative ferritin-like proteins Figure 3-15: Condensed Consensus Bayesian Inference phylogeny of representative iron- dependent peroxidases Figure 3-16: Condensed Consensus Bayesian Inference phylogeny of representative prokaryotic hemerythrins Figure 3-17: Condensed Consensus Bayesian Inference phylogeny of representative bacterioferritin-like proteins Figure 3-18: Condensed Consensus Bayesian Inference phylogeny of representative classical encapsulins Figure 4-1.i: Example operon diagram for rhodanese/desulfurase/MMPI encapsulin/acetyltransferase systems Figure 4-1.ii: Example operon diagram for FNR-sensor domain fusion MMPI encapsulin systems Figure 4-2: Sample motif maps of desulfurase proteins encoded adjacent to MMPI-like novel encapsulin genes Figure 4-3.i: N-terminal encapsulin association specific desulfurase motif Figure 4-3.ii: Internal encapsulin association specific desulfurase motif Figure 4-3.iii: Encapsulin association specific motifs and validated ligand binding and active sites conserved in novel encapsulin associated predicted desulfurases Figure 4-3.iv: Predicted structure of novel encapsulin associated cysteine desulfurase, maintains enzymatic active site, and presents encapsulin association specific motifs exposed toward predicted encapsulin shell inner surface Figure 4-4: Significantly enriched desulfurase association specific encapsulin N-terminal motif Figure 4-5.i: Experimentally validated canonical FNR-sensor ligand binding sites conserved in novel FNR sensor-like domain fusion encapsulins. Figure 4-5.ii: Predicted structure of novel encapsulin fused FNR sensor-like domain, maintains exposure of FNR sensor domain ligand binding site, and presents FNR-like domain on inner surface of predicted compartment Figure 5-1: Wide polyhead mutant in the Phage lambda gpE major capsid protein Figure 5-2: Narrow polyhead mutant in the Phage lambda gpE major capsid protein Figure 5-3: Small prohead mutant in the Phage lambda gpE major capsid protein Figure 5-4: DNA packaging defective mutant in the Phage lambda gpE major capsid protein Figure 5-5: Structural map of phenotypic amino acid substitutions sites, and interaction interfaces on the predicted structure of the Escherichia coli siphophage lambda gpE major ix capsid monomer Figure 5-6.i: Escherichia siphophage HK97 gp5 A147S, predicted narrow polyhead mutant produces aberrant complexes consistent in size with narrow polyheads, on glycerol gradient Figure 5-6.ii: Wildtype Escherichia siphophage HK97 prohead I and prohead II Figure 5-6.iii: Escherichia siphophage HK97 gp5 A147S, predicted narrow polyhead mutant, produces polyhead related particles under transmission electron microscopy Figure 6-1: Consensus tree from Bayesian Inference, Maximum Likelihood phylogenetic prediction analysis of capsids and encapsulins clustered in proximity to classical encapsulins Figure 6-2: Phage BcepMu capsid family derives from encapsulin ancestors and is more similar to novel encapsulins than phage capsids outside of this family Figure 6-3: Prophage rubrerythrin encapsulin operon intermediate in Anaerofustis stercorihominis DSM 17244

x

List of Appendices

Appendix 1: Other enzyme families enriched with encapsulins, conserving distinct association motifs indicative of non-targeted accessory proteins

Figure A1-1: Conserved, encapsulin association specific, CutA1-like dication tolerance protein consensus motif Figure A1-2: MAFFT alignment of representative CutA1-like copper tolerance enzymes with and without encapsulin association specific motif Figure A1-3: Condensed Bayesian Inference Phylogeny 46 of non-redundant representative CutA1-like metal binding proteins Figure A1-4: Conserved encapsulin association specific radical SAM oxidoreductase consensus motif Figure A1-5: Condensed Bayesian Inference Phylogeny 46 of non-redundant representative predicted radical SAM oxidoreductases Figure A1-6.i: Conserved encapsulin association specific fruiting body metalloprotease consensus distal internal motif Figure A1-6.ii: Conserved encapsulin association specific fruiting body metalloprotease consensus proximal internal motif Figure A1-7: Example genomic context of untargeted enriched encapsulin accessory proteins

Appendix 2: Other significantly enriched properties of encapsulin encoding organisms

Figure A2-1: Boxplot comparing percent GC content between encapsulin encoding genomes and the overall distribution of GC content across PATRIC dataset genomes Figure A2-2: Boxplot of distributions of proteome size between all PATRIC dataset genomes and those encoding encapsulin associated iron-dependent peroxidases Figure A2-3: Box-plot of distributions of genome length between all PATRIC annotated genomes and those encoding encapsulin associated iron-dependent peroxidases

Appendix 3: Prediction of horizontal transfer paths from nearest neighbour Karlin distance and Bayesian Inference phylogeny

Figure A3-1: Overall horizontal transfer path of classical encapsulins through Archaea and Eubacteria, based on nearest match nucleotide signature Karlin distances Figure A3-2: Representative example from Burkholderia derived encapsulin systems in which multiple sequential horizontal transfer events yield traceable co-transmission of encapsulins and enzymes with moderately diverged nucleotide signatures

xi

Appendix 4: Other predicted novel encapsulin families suggest roles in resistance to toxic disruption of diverse carbon and phosphate metabolisms

Figure A4-1: Conserved encapsulin association specific motif from novel alpha- ketoglutarate decarboxylase/2-oxoglutarate dehydrogenases Figure A4-2: Example novel encapsulated ketoglutarate decarboxylase/ dihydrodipicolinate synthase operon Figure A4-3.i: C-terminal conserved encapsulin association specific motif from inorganic pyrophosphatase Figure A4-3.ii: Conserved encapsulin association specific motif internally encoded in inorganic pyrophosphatase Figure A4-4.i: Candidate pyrophosphatase novel encapsulin targeting motifs are surface exposed Figure A4-4.ii: Candidate pyrophosphatase novel encapsulin targeting motifs as exposed in biological assembly hexamer Figure A4-5: Conserved encapsulin association specific motif from predicted heme- dependent cytochrome-like monooxygenase hydroxylases Figure A4-6: C-terminal Conserved encapsulin association specific motif from predicted undecaprenyl-pyrophosphate binding proteins

xii

List of Definitions

Association; Genomic association: Genes are genomically associated if they are encoded in close proximity to each other in a genomic region. Metabolic association: Proteins or cofactors are metabolically associated if they contribute to the same metabolic pathway or function. Statistical association: Variables are statistically associated if the value of the first can be predicted to some fraction from knowing the value of the second. Measured by Goodman-Kruskal tau association metric (Formula 3-1). Bacteriocin: Non-replicating compound produced by an organism causing death or growth inhibition of one or more bacterial strains. Bactericidal: Causing death of prokaryotic cells. Bacteriostatic: Preventing growth if prokaryotic cells without causing cell death. Bacteriophage: Viruses infecting prokaryotes. The most common order of these viruses being the Caudovirales, recognized for packaging parasitically replicating double stranded DNA genomes, within protein based tailed particles with icosahedral heads. A large subset of these viruses can persist as symbiotic genomic segments within host prokaryotic genomes, as a separate life cycle to lytic replication. Also known as phage. Bonferroni Correction: Method of correcting for testing multiple hypotheses in search of one or more significantly different comparisons. Defines the overall P-value cutoff necessary to minimize the familywise error rate, such that the overall likelihood of a false positive remains α despite multiple tests. The cutoff is calculated as acceptable error rate for a single test (α) divided by the number of tests. Candidate: Protein or gene under consideration for a specific classification or function, without sufficient accumulated evidence as yet to confidently assert classification or function Capsid: Virally encoded large hollow icosahedral homomeric complex packaging the viral genome and accessory proteins for transmission between hosts. Coencoded: Genes occurring together in the same genome. Usually refers to genes in close physical proximity to each other. This concept may be extended to more distantly encoded gene products within the same genome, in the presence of strong evidence of cofunctionality (eg. physical interaction between coexpressed proteins). Coexpressed: Genes expressed at the same time, or predicted to be under the same transcriptional regulatory control based on genomic context. Cofunctional: Proteins or riboproteins which act together to perform a function not

xiii

performed in the absence of one of the partners. Copackaged: Proteins or cofactors compartmentalized into the same compartment. Copurify: Molecules which under specific purification conditions occur together in the same fraction. Consistent with interacting molecules, but generally not sufficient proof of interaction. Cotargeted: Proteins sharing the same targeting signal for compartmentalization into the same compartment population, either within the same compartments or into discrete subpopulations of a single compartment population. Correlation: Statistically determinable relationship between two continuous variables, such that the values of one are related to the values of the other. Often determined by the Pearson Product-Moment correlation coefficient 51. Covariant: A property of some proteins, whereby specific amino acid difference at one or more positions co-occur with specific amino acid differences at one or more other positions. Covariant positions usually result from physical interactions between the positions. Encapsulated: Protein, substrate or cofactor packaged within an encapsulin shell. Encapsulin: Prokaryotically encoded large hollow icosahedral homomeric protein compartment, packaging prokaryotic enzymes and substrates for physiological function. Structurally very similar to capsids from the Caudovirales order of viruses. Classical Encapsulin: Encapsulins with non-negligible sequence identity to one or more of the three originally published encapsulins (T. maritima, P. furiosus, B. linens). Novel Encapsulin: Encapsulins without detectible sequence identity to one or more of the three originally published encapsulins (T. maritima, P. furiosus, B. linens). Enrichment: Greater than expected frequency of members of a gene family, components of a metabolic pathway, physiological properties, etc. in a subset of interest relative to another subset or the full population. Evaluated using the hypergeometric test with correction for multiple hypothesis testing. Eukaryote: Cell compartmentalized by membrane bound organelles with the genome sequestered within a nucleus. Heteromeric: Protein or riboprotein complex composed of copies of two or more difference proteins and/or RNAs. Hidden Markov Model: A position weighted substitution matrix derived from a multiple sequence alignment for use in the prediction of sequence similarity 68. Abbreviated HMM. Homolog: A gene or protein related to a second gene or protein by similarity in structure, sequence and function suggesting descent from a common ancestral nucleotide sequence. Analog: A gene or protein with similar function or structure, but not

xiv

necessarily closely related phylogenetically. Ortholog: A gene or protein from one species related to a gene or protein in a second species with the assumption of inheritance from the same gene in a common ancestor. Paralog: A gene or protein related to a second gene or protein within the same species with the assumption of gene duplication of one copy from the other. Xenolog: A gene or protein of distinct lineage but advantageous function which replaces a preexisting ancestral gene, which may or may not have been homologous. Homomeric: Protein complex composed of two or more copies of a single protein. in cis: Functional components are encoded together within a single genomic region. This layout is seen in the published encapsulin examples in which the shell and cargo enzyme genes are encoded as a fusion or adjacently in the same operon. in trans: Functional components are encoded in multiple separate genomic regions. This layout is seen in the beta-carboxysome compartments. Linocin: Encapsulin compartments encoded by Brevibacterium linens. Co- expressed and initially co-purified with bactericidal factor leading to classification as a bacteriocin. Packages an iron-dependent peroxidase. Shell and peroxidase transgenically expressed in E. coli does not act as a bacteriocin. Microcompartments: Near micrometer scale protein compartments comprised of shells packaging enzymes, substrates and cofactors involved in biological function. Similar to nanocompartments but larger in size. Generally greater than 100 nm in diameter. Nanocompartments: Nanometer scale protein compartments comprised of shells packaging enzymes, substrates and cofactors involved in biological function. Similar to microcompartments but smaller in size. Generally less than 100 nm in diameter. Phage: see Bacteriophage. Predicted: Classification confidently supported by available information but not tested experimentally. Prohead: Immature precursor structure form from phage capsid multimers prior to DNA packaging, expansion, cementing, and maturation. Capsid structure most contextually similar to the expected formation conditions of encapsulins. Prokaryote: Microbial organism from either the Bacterial or Archaea kingdom. Genome is not sequestered within a nucleus. Prophage: Viral genome incorporated into a host prokaryotic genome either directly or as a plasmid. Degradation of these prophages can produce genomic segments encoding proteins with similarity to phage components, but without the ability to self replicate parasitically. xv

Pseudo-paradoxically: A state of reality or aspect of reality which appears to be a contradiction, but is not in fact contradictory. Putative: Classification supported by tests to date, but not of the same confidence as predicted. Secretion system: System of proteins enabling compounds within a prokaryotic cell to be excreted outside the cell. Type III: Protein based system for excreting compound from within a Gram- negative cell into the cytosol of another cell through a membrane. Resembles a needle extending out of the Gram negative double membrane. Structurally related to the bacterial flagella. Type IV: Protein based system for transmitting small molecules, proteins and DNA between prokaryotes. The conjugation pilus is a special case of this family. Core machinery composed of a series of aligned membrane spanning channel pores. Other structural features denote subtypes and specializations. Type VI: Protein based system for excreting compound from within a prokaryotic cell into the cytosol of another cell through a membrane. These systems are similar in sequence and structure to Caudovirales phage tails. Stationary growth: State of a prokaryotic population in which the replication rate approximately equal the death rate for cells. Induced by a shortage of nutrients or an excessive quorum of bacteria. Ultra-megadalton: Protein or nucleotide complexes with mass greater than 1,000,000 Daltons in size. Transcripts, viruses and compartments fit into this mass category. Vegetative growth: State of a prokaryotic population in which replication is actively occurring, and the replication rate exceeds the death rate producing an increasing concentration of cells over time. Divided into lag and log phases and end with the transition to stationary phase growth.

xvi

Chapter 1: Introduction and Background

This thesis investigates families of prokaryotic compartments, collectively termed encapsulins, with significant similarity to capsids from Caudovirales viruses. The concept of compartmentalization is relatively novel in prokaryotes. While some compartments have been well characterized, this particular family has not been previously.

Historically bacteria were initially considered as simple, primitive, ill organized bags of protein.

As knowledge of these ubiquitous organisms grew, this naive view was steadily disproven.

Bacteria specifically and prokaryotes in general, were shown to be far more complex and well adapted than was initially thought. While plants are recognized as biochemical adaptive specialists, and animals are credited with adaptive mobility, prokaryotes excel in both skill sets.

A lingering shortfall in adaptive excellence between prokaryotes and eukaryotes, including both plants and animals, remained compartmentalization, for which eukaryotes were thought exclusively endowed.

Recent studies have shown a number of examples of protein-based subcellular compartmentalization in prokaryotes 1. These compartments confer many of the advantages of the smaller eukaryotic organelles. For example, carboxysomes compartmentalize carbon dioxide fixation in cyanobacteria, via reactions restricted to chloroplasts in eukaryotes 1. The enterobacterial ethanolamine and related 1,2-propanediol utilization compartments compartmentalize catabolism of these otherwise toxic carbon sources 1. Similar reactions in eukaryotes are restricted to endosomes and vacuoles. Lumazine synthase from Bacillus subtilis and Salmonella typhimurium form a trimer-based icosahedral shell, with targeted cargo riboflavin synthase-forming bifunctional enzyme complexes, catalyzing formation of 6,7-

1 dimethyl-8-ribityllumazine, a precursor to riboflavin 1,17. This use of a shell is distinct from the structurally solved Mycobacterial lumazine synthase, catalyzing the same reaction but forming a distinct decameric globular complex 18. Thus, compartmentalization is an important but not exclusive solution to a diverse range of metabolic needs.

These prokaryotic compartments bear no similarity in sequence or structure to eukaryotic protein compartments, such as the ribonucleoprotein vaults or iron-cage ferritins. The vaults are large protein transport containers, composed of cargo peptides, short structural RNAs and multiple copies of three conserved proteins: a major vault protein, a VPARP ADP(poly-ribose) polymerase, and a TEP1 telomerase associated protein 19. The 96 major vault proteins form the majority of the vault shell structure and over 70% of its mass, while the RNAs and enzymes facilitate the dynamics of assembly, cargo loading, and disassembly 19. These components form an oblong, hollow, closed cylinder with protruding caps, and a slender pipe 19. The eight

VPARP proteins localize to the pipe, while the two TEP1 proteins and at least six RNAs localize to the caps 19. The number and size of RNAs vary by species 19.

Eukaryotic ferritin conversely is a structurally conserved homomultimer, composed of 24 copies of the ferritin helix bundle protein, forming a dynamic octahedral cage. This enzymatic cage regulates iron availability through sequestering free iron into its lumen 2. These ferritin cages are similar in size to some prokaryotic compartments, but distinct in sequence, symmetry, and structure. The prokaryotic compartments are generally more symmetric and consistent in form and composition than vaults, while the eukaryotic ferritins utilize a more simplified symmetry.

1.I Review of characterized prokaryotic compartments

1.I.a Caboxysomes

2

The following section will describe the carboxysomes, ethanolamine utilization compartments, two forms of propanediol utilization compartments, and the lumazine synthases. Carboxysomes exhibit diameters between 800 Å and 1400 Å, visible by confocal microscopy, encoded in cyanobacteria and some chemoautotrophes 1. Although entirely protein based, due to the large compartment size, carboxysomes often fractionate with the cell membrane and membrane proteins, complicating purification and characterization of the true composition of these compartments 6. These microcompartments are composed of a thin outer layer of protein surrounding a large population of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCo) and carbonic anhydrase. Carbonic anhydrase catalyzes the equilibrium between bicarbonate and aqueous carbon dioxide 5. Bicarbonate is more soluble than carbon dioxide, but can not be directly by fixed in the same way. This catalysis is dependent on a coordinated zinc ion in the reaction centre 5. RuBisCo converts ribulose-1,5-bisphosphate and carbon dioxide to two molecules of 3-phosphoglycerate as the first step in the process of carbon fixation. Together these two enzymes provide carboxysome encoding organisms an efficient, steady supply of reduced carbon, otherwise inaccessible within carbon-limiting environments 5. Relative to plant

RuBisCo the prokaryotic homolog has been found to have low efficiency, requiring compartmentalization to restrict substrate diffusion.

The carboxysome encoding Prochlorococcus genus dominates the temperate oceanic carbon cycle, contributing over 50% of global carbon fixation 6. These bacteria inhabit photosynthetic niches down through over 200 meters of the water column, utilizing a broad range of wavelengths and intensities. In this genus the carboxysome forms heterogeneous 70-100 nm particles composed of six distinct proteins: the anhydrase CsoSCA, the RuBisCo subunits CbbL and CbbS, and the shell proteins CsoS2, CsoS1 and CsoS1D 6. Of the shell proteins, CsoS1D has the features of a gated pore, regulating substrate and product diffusion. In vitro these six

3 proteins are sufficient to catalyze both the anhydrase and carboxylase activities from purified compartments 6. Along with four regulatory genes, this appears to be the minimal carboxysome compartment operon, since other species encode as many as eighteen operon components 6.

These more complex carboxysomes are less uniform in shape, and less resilient to environmental stress, but more catalytically efficient 6.

At the phylogenetic level, two clusters of carboxysomes have been identified, alpha and beta 7.

The distinction between the two relates to differences in compartmentalized enzyme families, bacterial lineage, and operon structure 7. The alpha carboxysomes are better studied and more prevalent among both alpha-cyanobacteria and other chemoautotrophic lineages. These compartments follow the canonical patterns of regulators, enzymes and shell proteins encoded in single, tightly coupled operons. Instead, the beta carboxysomes are more restricted to beta- cyanobacteria, and unlike most compartments, derive components from multiple separately encoded genomic regions. The shell and RuBisCo protein subunits from both families are well conserved in sequence, while the regulators, anhydrases, and accessory proteins are more variable 7.

A number of genes encoding non-compartmentalized proteins are enriched with the alpha carboxysome family genes. These include bacterioferritins, uncharacterized pterin-4a- carbinolamine dehydratase-like enzymes, CbbQ ATP-binding proteins, CbbO which activates

CbbQ, and CobQ cobyrinic acid a,c-diamide synthases 7. CbbO and CbbQ regulate RuBisCo through an ATP-dependent mechanism. CobQ uses ammonia or glutamine to generate cobalamin cofactors. Bacterioferritin protects from Fenton toxicity by sequestering iron 7. The dehydratase-like enzymes are predicted to catalyze the conversion of 4a-hydroxy- tetrahydrobiopterin to dihydrobiopterin. The link between these two activities and carboxysomes is uncharacterized, though both contribute to aromatic amino acid hydroxylase

4 activity, suggesting carboxysomes may be linked to more than carbon fixation 7.

1.I.b Ethanolamine utilization (Eut) and propanediol utilization (Pdu) compartments

The ethanolamine utilization (Eut) compartments are encoded in related Enterobacteriaceae and some Firmicutes 1. These heteromeric compartments are encoded by a four protein operon, similar in structure to each other, but distinct in sequence and functions. EutS and EutM conserve a core mixed alpha/beta fold comprising the majority of the compartment shell. EutL acts as a gated pore restricting substrate transport. EutK is a two domain protein in which the carboxy-terminus is ligand binding and the amino-terminus is enzymatic. Together this compartment restricts diffusion of the toxic intermediate acetaldehydes produced during the conversion of ethanolamine to ethanol and acetyl phosphate or acetyl CoA 1,7. EutBC converts ethanolamine to acetaldehyde and ammonia. EutG converts acetaldehyde to ethanol. EutE converts acetaldehyde to acetyl CoA. EutD converts acetyl CoA to acetyl phosphate 7.

Regulation of this compartment occurs in two forms depending on lineage. Enterobacteriaceae utilize EutR, an ethanolamine and adenosylcobalamin responsive regulator. Firmicutes encode a two protein system of EutVW, a histidine kinase and response regulator 7.

The related propanediol utilization (Pdu) compartment in Salmonella enterica, converts 1,2- propanediol to propionyl-CoA, while restricting diffusion of the toxic intermediate propionaldehyde 8. This shell is composed of eight related subunits PduABB'JKNTU forming a closed polyhedron. The amino-terminal region of the PduD subunit encodes a nineteen amino acid targeting motif, directing the major enzymatic components, PduCDE, to the compartment shell 8. This nineteen amino acid segment was found sufficient to induce de novo targeting of green fluorescent protein or glutathione S-transferase to the Pdu compartment 8.

5

Five other proteins are also packaged as four enzymatic complexes in this compartment 7. PduS,

PduO, and PduGH catalyze regeneration of the B12 cofactor to PduCDE. PduP catalyzes the final step in the formation of propionyl-CoA. Also encoded in this pathway, though not compartmentalized, are PduL, PduQ, and PduW. PduQ consumes NADH to convert propionaldehyde to propanol. PduL and PduW convert propionyl-CoA to propionic acid 7.

Similar to alpha carboxysomes, Pdu operons are enriched for co-occurrence with cobalamin cofactor metabolism. CobU, CobS, and CobC synthesize the lower ligand of B12, required for

PduCDE activity. These proteins are not compartmentalized with the Pdu components, but instead co-regulated through PocR, to provide B12 to that compartment 7.

Similarly, a putative glycyl radical-based, B12-independent, propanediol utilization compartment has been proposed to exist based on gene enrichment analysis, although this compartment has yet to be observed in vivo 7. This theoretical compartment utilizes ten conserved genes: many of which are related to those used by Eut, Pdu and carboxysome compartments. These ten genes subdivide into four clusters. PduO-like enzymes co-cluster with

MIP channel. PduO is an ATP/cobalamin adenosyltransferase involved with B12 synthesis, while the MIP channel transports small neutral metabolites 7. Two drug-resistance proteins, two

TetR-like regulatory proteins, and a phosphotransacylase related to PduL, group into another cluster 7. The third cluster includes a histidine kinase and response regulator similar to Eut. The largest cluster centres around a predicted glycyl radical dependent dehydratase, a glycyl radical activator protein, an aldehyde dehydrogenase, a phosphotransacylase, an alcohol dehydrogenase, four predicted paralogous shell proteins, a EutN-like shell vertex protein, and a

EutJ homolog 7.

The predicted glycyl radical dependent dehydratase is predicted to convert 1,2-propanediol to propionaldehyde. From this point two reactions are predicted: aldehyde dehydrogenase

6 converting propionaldehyde to propionyl-CoA, and alcohol dehydrogenase converting propionaldehyde to propanol. Propionyl-CoA can then be converted to propionyl-phosphate by phosphotransacylase 7. N-terminal extensions are observed in the phosphotransacylases, PduO- like proteins, and predicted dehydratase which may encode some form of compartmentalization signals. No significantly conserved motifs were found within these extensions, suggesting the extensions may have an alternative purpose or utilize an undetectable interaction domain 7.

Eight homologs of one of these predicted shell proteins encode uncharacterized C-terminal extensions between 80 to 100 amino acids in length, in addition to the canonical shell domain 7.

These extensions are predicted to be dynamic and flexible without fixed structure, suggestive of high specificity protein-protein interaction domains. These predicted shell proteins diverge most strongly from both the Eut and Pdu analogs. In Desulfovibrio, these proteins occur as tandem gene duplications 7. The functional significance of tandem duplication in these systems remains unclear. This type of duplication is atypical of compartments in this and other families.

Escherichia coli overexpression of the Pectobacterium wasabiae homolog of this family produced a soluble brown protein with strong light absorption across broad peaks at 330 and

420 nm, suggesting the coordination of one or more iron-sulfur clusters 7. Visualized by native gel electrophoresis, this protein behaves like a hexamer similar to the Eut and Pdu analogs, but without formation of ultra-megadalton complexes or pseudo-aggregates expected for a microcompartment 7. Thus if these hexamers contribute to compartments in vivo, some component absent from the E. coli expression condition is required, such as one or more other paralogs or vertex proteins from the operon.

In Roseburia inulinovorans this operon also includes an anaerobic aldolase, required for the conversion of six-carbon sugars to 1,2-propanediol 7. Under anaerobic conditions, when grown on fucose or rhamnose, components of this operon comprise 12/20 of the most highly

7 upregulated genes 7. This confirms that these genes are co-expressed, but does not demonstrate formation of a compartment.

1.I.c Lumazine synthase

Lumazine synthases catalyze the formation of 6,7-dimethyl-8-ribityllumazine 1. This is achieved through compartmentalization in Bacillus subtilis, Aquifex aeolicus and Salmonella typhimurium

1, while in Mycobacterium this reaction is uncompartmentalized and the enzyme is globular 18.

The B. subtilis compartment forms small 60mer complexes from twenty trimers, and packages trimeric riboflavin synthases 1. This synthase converts the 6,7-dimethyl-8-ribityllumazine produced by the enzymatic domain of the shell to riboflavin and 5-amino-6-ribitylamino-

2,4(1H,3H)-pyrimidinedione.

The A. aeolicus lumazine synthase produces two different sizes of compartments from the same monomers 9. While the larger form incorporates 180 monomers, the smaller uses only 60 monomers in the shell 9. Expression of the A. aeolicus lumazine synthase gene in E. coli produces analogous compartments, despite the differences between the two bacteria. These compartments spontaneously package positively charged enzymes 9. Thus a deca-arginine tag is sufficient to induce packaging of HIV protease as a de novo cargo enzyme in these compartments. This initial pairing provided inefficient packaging, partially relieving the protease dependent toxicity. Directed evolution from this low specificity compartment/cargo system successfully produced a super-negative lumazine synthase with improved packaging efficiency, and completely relieved proteases dependent toxicity in this system 9. Thus even low specificity compartmentalization can be a selectively advantageous adaptation.

8

These compartments vary in size, structure, phylogenetic distribution, genomic structure, and specific functions. More generally carboxysome, Eut, Pdu and lumazine synthase compartments all facilitate specialized carbon metabolisms, with potentially problematic substrates or products. In the case of carboxysomes this involves concentrating low-solubility carbon dioxide in proximity to RuBisCo 5. In the cases of Eut, Pdu and lumazine synthase this involves restricting the diffusion of potentially toxic metabolic pathway intermediates 1. Both N-terminal and C-terminal targeting motifs are predicted for these compartments, but not both in the same systems. Thus prokaryotic compartments in general, defy blanket generalization, with diversity in form and function between families, and restricted specific features within families.

1.II Review of known properties of published encapsulins

Overall most prokaryotic compartment families are restricted to a small number of lineages; compartmentalize a single set of enzymes for a distinct function; and are structurally distinct from other protein families including other compartments 1,7. While the families mentioned above are well characterized in mechanisms, another distinct family of compartments has been identified, both under-characterized and appearing to be the exception to these generalizations.

These compartments are the encapsulins 1,2, forming the focus of this project. The five published examples of this compartment family consist of packaged enzymes and homomeric shell complexes, composed of multiples of 60 monomers, organized as pentamers and hexamers, to form hollow icosahedral protein spheroids 1. Previously, only encapsulins from Thermotoga maritima 2, Brevibacterium linens 21, Rhodococcus jostii 20 and Pyrococcus furiosus 4 had been investigated in any depth. During the preparation of this thesis a fifth encapsulin was published from xanthus 22. These five studied encapsulins occur in three highly divergent phyla spanning both Eubacteria and Archaea, suggesting a much broader dissemination of

9 encapsulins than has yet been reported. With known encapsulins identified in such divergent organisms, an intrinsic goal of this project was to investigate the prevalence and distribution of encapsulins across all sequenced prokaryotes.

Distinctly from other compartments, encapsulin shells bear strong similarity in structure to phage capsids from the order Caudovirales. Electron microscopy (EM) studies have confirmed the formation of capsid-like compartments for the T. maritima, B. linens and P. furiosus encapsulins in vivo 2,4. X-ray crystallography structures of the T. maritima encapsulin and P. furiosus virus-like particle have been solved 2,4. Recombinant expression of the B. linens and R. jostii encapsulins in E. coli has confirmed the transferability of these compartments, and provided preliminary characterization of in vivo mechanistic properties 2,20.

The B. linens encapsulin was the first encapsulin identified, leading to this family of compartments being initially termed linocins, with this specific complex termed linocin M18

2,21. Initial purification studies attributed bactericidal activity to this compartment, but this was later shown to result from a co-purified but indirectly related factor 2,44. On its own this linocin has 10 to 100 fold bacteriostatic activity, but not bactericidal activity 21. The precise mechanism of this activity has yet to be determined 21. From B. linens, the compartment purifies as a greater than two megadalton assembly, found in both the intracellular and extracellular fractions, suggesting either excretion, or prolonged stability post cell lysis. When recombinantly expressed in E. coli, the compartments were only detected intracellularly 2, suggesting a distinction in how

Brevibacterium and Escherichia target linocin, a difference in extracellular stability, or an absence of E. coli cell lysis. E. coli cell lysis is well characterized and inducible by multiple mechanisms, yet evidently did not produce the extracellular population of linocins 2. No known mechanism of excreting an intact ultra-megadalton complex exists in bacteria 2, suggesting encapsulins, along with other very large, extracellularly, prokaryotic factors 66 are involved in a

10 realm of physiology as yet poorly characterized, or persist after cell lysis. Coexpression of the B. linens encapsulin shell and cargo peroxidase in E. coli, produced compartments showing extra

EM density, consistent with peroxidase hexamers, bound at regular intervals within the shell 2.

In E. coli these encapsulins were 22nm in diameter, containing 60 monomers 2.

The T. maritima encapsulin is 265 amino acids in length, forming a 240 Å icosahedral compartment composed of 60 shell monomers, resolved to 3.1 Å by X-ray crystallography 1.

The shell is 20-25 Å thick depending on cross section 1. The five-fold, three-fold and two-fold symmetry sites form small pores through the shell 1. The three-fold site pore is positive, while the two-fold site is negatively charged and the five-fold pore is uncharged 1. These may facilitate semi-selective diffusion of small cofactors and substrates into and out of the compartment, but are too small to transport known larger substrates/products, such as coordinated iron-phosphate granules 1. Unassigned extra densities at the five-fold symmetry sites are proposed to be metal ions coordinated by backbone oxygens 1. Whether these putative ions are ferric, ferrous, or some other element is unclear. Packaging of the ferritin-like protein into this encapsulin shell is supported by the presence of a co-crystallized eight amino acid peptide matching a sequence in the C-terminus of the enzyme; the observation of copurification with iron giving a dark yellow colouration to the purified encapsulin crystals; and detection of multiple ferritin-like peptides during mass spectrometry of resuspended crystals 1. This compartment was purified from bulk T. maritima culture under standard thermophilic growth conditions, suggesting in vivo basal encapsulin expression is considerable 1.

Prior to this project, the protein that I classify as the P. furiosus encapsulin was annotated as a viral particle 4, but actually exhibits the functional characteristics and genome context of an encapsulin. The P. furiosus encapsulin is 345 amino acids long including the ferritin-like domain, which comprises the first 96 residues of the N-terminus. This encapsulin forms 30-

11

36nm diameter, 3.3nm thick icosahedral shells composed of 180 monomers, organized as 12 pentamer vertices and 20 hexamers, with the ferritin-like domains facing inward 4. The shell domains were resolved to 3.6 Å. However, the structure of these ferritin-like sub-complexes has yet to be directly determined, due to the flexibility of this domain relative to the more rigidly symmetrical shell preventing crystallographic resolution of the first 109 amino acids of the protein 4. Non-encapsulated ferritin-like homologs form pentameric complexes, which when superimposed into the inside of the shell, fit well as extensions from the pentameric subunits of the shell 1. How these domains might interact as hexamers is less clear. The overall compartments were 7 MDa in mass 4, much larger than the other encapsulins. To improve resolution, this ferritin-like domain was deleted from some crystallography constructs producing a PfVΔFlP compartment 4. Both PfVΔFlP and the structurally archetypical capsid from

Bacteriophage HK97 exhibit an overall negative charge on the inner surface of the compartment

4. In HK97 this charge density facilitates induced expansion via repulsion from the negatively charged genome being forcibly packaged 4. Conversely, PfV under all conditions tested was nucleotide free, and showed no evidence of an expanded state 4. Instead the negative charge density may improve packaging of iron cations, which strongly copurified with PfV 4.

The R. jostii encapsulin was purified from its in vivo context as a 1.8 megadalton complex, disassembled and reconstituted in vitro, both with and without its cargo enzyme 20. Initial purification occurred at neutral pH in phosphate buffer. While disassembly was achieved at pH

3.0 in acetate buffer producing dimers, return to pH 7.0 restored nanocompartment formation 20.

This R. jostii encapsulin gene encodes a 269 amino acid protein, which in vitro oligimerizes to form icosahedral compartments 22-30 nm in diameter with 60 monomers predicted per shell 20.

Larger species were detected during purification, but were not followed up on and were not observed in vitro 20. This suggests the 60mer form is a lowest energy state, but may not fully

12 represent the biologically relevant forms. Mixing disassembled encapsulin shells with the R. jostii iron-dependent peroxidase, followed by a return to neutral pH, produced packaged peroxidase compartments. This co-assembly was somewhat inefficient, as evidenced by large subsets of peroxidase proteins remaining unpackaged with sub-optimal filling of predicted binding sites, despite expression at non-saturating concentrations 20. The co-assembled population yielded a mean ratio of 8.6 encapsulin shell monomers per peroxidase monomer 20, or 10.32 shell pentamers per hexameric peroxidase complex. This supports in vivo encapsulin assembly benefiting from more complex factors than existed in vitro, and potentially producing a larger major compartment similar to the B. linens encapsulin. Based on microscopy and structural predictions, the shell is predicted to encode pores less than 5 Å in diameter 20, suggesting the larger substrates required for the known encapsulin activities must enter through some other uncharacterized mechanism.

During the preparation of this thesis, a paper was published identifying an encapsulin in

Myxococcus xanthus 22. This encapsulin forms a 32 nm compartment shell of 180 monomers of a protein termed EncA, packaging multiple copies of three small proteins termed EncB, EncC, and EncD 22. EncB and EncC encode predicted ExxH iron-binding sites. Prevalence of these M. xanthus encapsulin compartments were found to increase under amino acid starvation conditions, with few compartments noted during vegetative growth, despite similar total concentrations of EncA peptide 22. Deletion of EncA causes increased sensitivity to hydrogen peroxide 22. Under energy-dispersive X-ray spectroscopy and inductively coupled plasma mass spectrometry these encapsulin compartments exhibit an internal iron and phosphorous rich core particle, with 4:1 ratio of iron to phosphorous 22. Consistent with the observations from other encapsulins, expression of the M. xanthus EncA in E. coli produces a heterogeneously sized population with 80% of compartments with 32 nm diameter and 20% with 18 nm diameter 22. A

13

17 Å resolution structural model was derived for the native compartment, based on cryo-EM, icosahedral averaging, and homology modeling based on the T. maritima and P. furiosus encapsulins 22. More precise models of the structures of the M. xanthus protein monomers and central particle were derived based on cryo-electron tomography 22. The iron-phosphate particle was found to be an aggregate of 11-19 granules, 5 to 6 Å in diameter 22. Based on these observations it is proposed that this encapsulin functions as an iron sink during starvation to minimize peroxide dependent iron toxicity 22.

Among the four encapsulin examples published at the time of preparation of this thesis, two distinct non-co-occurring cargo enzymes have been identified: a family of ferritin-like proteins and a family of iron-dependent peroxidases, each independently targeted or fused to the encapsulin shell. Unlike the non-encapsulin compartments discussed above, these two enzymes are the only known in vivo targets to the encapsulin shell, and no systematic investigation of associated genes had been performed in these systems. Microcompartments compartmentalize one or more enzymes within single pathways. Thus even with the limited scope of prior study, encapsulin compartments show cargo diversity atypical of other compartment families.

These encapsulin associated, ferritin-like proteins are short iron-coordinating proteins, with moderate sequence similarity throughout a large family. The Thermotoga maritima encapsulated ferritin-like protein is 114 amino acids long, of which 103 are predicted to compose the iron- binding four helix bundle structure, and the C-terminus encodes a short extension segment 1. In solved structures from this family, the helices are organized as two longer antiparallel helices

(~30 aa) and two shorter helices (~10-15 aa) 40. The extension segment is hypothesized to confer targeting through interaction with sites on the inner surface of the encapsulin shell 1. A peptide matching eight residues of this extension co-crystallized with the T. maritima encapsulin shell protein monomer, supporting this hypothesis 1. This peptide forms two hydrophobic contacts

14 with the encapsulin inner surface, burying a four residue narrow hydrophobic patch, consistent with anchoring the enzyme to the shell, hence the use of the term 'anchor' peptide 1. The

Pyrococcus furiosus ferritin-like domain is divergent from the T. maritima example, and occurs as an amino-terminal fusion to the PfV virus-like particle protein 4. Both ferritin-like domains copurify with iron 2,4, and are predicted based on other ferritin-like proteins to catalyze oxidation of iron from Fe+2 to Fe+3.

Genomically adjacent to the encapsulin shell genes from Brevibacterium linens and

Rhodococcus jostii are genes encoding experimentally observed encapsulated iron-dependent peroxidases 1,20. This peroxidase family is characterized by the ability to catalyze the decolouration of a family of aromatic dyes, such as 2,2′-azinobis-3-ethylbenzo-6- thiazolinesulfonic acid, through the breakdown of hydrogen peroxide 2. Encapsulation enhances this decolourant peroxidase activity in Rhodococcus jostii by 70-75% per peroxidase monomer

20. These dyes are not native to bacterial environments, and bear no significant similarity to known bacterial substrates. As such, a generalized biological role for these peroxidases has remained unclear. Distinct from the T. maritima encapsulin, the B. linens encapsulin has been linked to bacteriostatic activity in vivo against a broad range of bacteria, including Listeria and

Bacillus, through an unknown mechanism 21. The assays to refute linocin dependent bactericidal activity have been repeated, but those to confirm or refute the separate bacteriostatic activity have not been repeated more recently 2,21. Initial evidence of bactericidal activity from this compartment has been reassigned to an independent co-expressed factor that copurifies under certain bulk fractionation conditions 21. The R. jostii encapsulated peroxidase has been demonstrated to degrade the plant fibre lignin in vitro 20, with an eight-fold increase in effective activity, when encapsulated. The biological significance of this activity or the biological function of these encapsulins has yet to be determined. In both published cases, the peroxidase

15 gene encodes a variable length C-terminal extension, thought to confer targeting through a mechanism similar to the encapsulin associated ferritin-like proteins 2.

It has been proposed, but previously unconfirmed, that encapsulins sequester otherwise toxic conditionally important enzymes, allowing survival under stress 2. Mycobacterium tuberculosis excretes proteins with high sequence similarity to the B. linens encapsulin, but these are previously uncharacterized as compartments 3. A number of bacteriocins with features and properties similar to the B. linens encapsulins are reported in the pre-genomics literature 21,23, but were never revisited once it became possible to sequence the encoding strains and more definitively characterize these factors. Thus bacteriostatic and potentially bactericidal encapsulins may be more widespread than the more recent literature recognizes 23.

1.III Properties, similarities and differences of encapsulin and other capsid-like structures

The proteins of these encapsulin shells are highly similar at all levels of structure to the major capsid proteins of Caudovirales bacteriophage, as well as to lesser extent herpes viruses 4.

Quinternary structure, also termed high level quaternary structure, is moderately variable among capsids, such that encapsulins and the most common icosahedral super-structures of capsids, are far more similar than the diversity of minority super-structures observed (eg. extended T4-like capsids). The quaternary structure is effectively identical, with the same organization of pentamers and hexamers contributing to the same closed icosahedral structure. The tertiary structure of Herpes capsids, Caudovirales capsids and encapsulins all conserve the same mixed alpha-beta fold composed of three subdomains (Figure 1-1); axial (A), peripheral (P), and the extension (E) loop 1,4.

16

Figure 1-1: Standard domain architecture of capsid-like fold

Encapsulin shell proteins from T. maritima used as a representative example. Red: domain A. Green: domain P. Blue: E-loop.

17

The P domain is the largest, comprising the majority of the fold and comprising all the residues of the multimer-multimer interfaces. This domain is characterized by a hydrophobic core formed between a long alpha helix underlain by a series of beta sheets and peripheral alpha helices. In most capsids this domain includes the N-terminus partially exposed on the lumenal surface of the compartment 2. The A domain is smaller, composed of a three to six strand beta sheet and two to four alpha helices depending on the capsid 2. This domain provides the basis for the central pore and monomer-monomer interfaces within hexamers and pentamers, and includes the C-terminus again facing into the compartment lumen 2. The A domain and P domain connect through two to four semi-flexible strands. Finally the E-loop is situated below this joint and to one side of the P domain, forming a two stranded beta-sheet or unstructured extended loop 4. This loop facilitates interaction between adjacent monomers across the two fold symmetry sites between multimers, cementing the super structure into a resilient complex 2.

Variation in length and relative position of this E-loop is a major source of structural variation between capsids.

Unlike encapsulins, Caudovirales capsids package double stranded DNA (dsDNA), in addition to proteins. Most of the internal proteins are in fact degraded and displaced in the process of genome packaging into these capsids. The mechanisms of these processes are well studied in a number of viruses, though some questions still remain to be resolved 24. In most bacteriophage, the head scaffold protein, in parallel with the major capsid protein, forms a highly ordered, generally homogenous population of preliminary capsid complexes. The scaffold proteins form an internal lattice centred around the dodecameric portal vertex, and the major capsid proteins forming the characteristic closed icosahedral shell 24,37. Once this is achieved, a protease also targeted to the interior of the capsid, becomes active, digesting the scaffold and then itself into small peptides. These fragments then diffuse out of small pores in the capsid 37. In

18

Enterobacteria phage P22, the scaffold and the major capsid proteins interact through the N- terminus of the capsid proteins and a helix-loop-helix motif in the C-terminus of the scaffold proteins 37. This C-terminal motif is the same relative position as the anchor peptide in the T. maritima encapsulin adjacent ferritin-like protein 2. This proteolytic degradation of internalized proteins or domains leaves a stable, empty, near symmetric capsid. The major asymmetry is the dodecameric portal complex, which incorporates in place of one of the vertices, forming a channel permissible to the passage of DNA. To this portal attaches a terminase complex, with large and small subunits 24. This terminase acts as an ATP dependent packaging motor ratcheting the DNA through the terminase and portal into the empty capsid, and as a nuclease clipping genome concatemers into single packaged genomes, either in response to a head full signal, or recognition of a specific cleavage site 24. This produces a near crystalline DNA density, with high self repulsive potential energy, contributing to ejection of at least the beginning of the genome, during infection 24. Thus capsids must resist high internal pressure, while encapsulins lack this constraint.

19

Comparisons of the protein structures of the Thermotoga maritima encapsulin to the virus-like particle PfV of Pyrococcus furiosus and the capsid protein, gp5, of Bacteriophage HK97, differ by root mean squared deviations (RMSD) of only 2.39Å over 178 aa (Figure 1-2) and 2.65Å over 199 aa respectively 2. The major source of this structural variation is a 60o rotation of the T. maritima E-loop (Figure 1-2), and a 15o rotation of the PfV E-loop relative to the E-loop in the

HK97 protein 2. The HK97 E-loop is involved in a non-canonical form of capsid cross linking, thus may not be representative of E-loops in general 4. These monomers interact similarly in the formation of hexamers and pentamers, which further interact to form the hollow, closed icosahedral shells of both the encapsulins and capsids (Figure 1-3.i, Figure 1-3.ii).

My comparisons reveal stronger sequence similarity between known encapsulins and some phage capsids, than was evident between divergent phage capsids within the same taxonomic family, suggesting a shared history between the two protein families. The connection to herpes virus capsids is as yet primarily structural, with insufficient sequence similarity to draw confident relationships at the genomic level. While phylogenies linking Eukaryotes, Bacteria, and Archaea can be constructed based on rRNA and certain housekeeping genes 4, the mechanisms underlying viral and prokaryotic pan-genomics call into question how meaningful these phylogenies truly are to other specific genes 15,25.

The genome contexts of the five published encapsulins are isolated from other evidence of prophage, supporting functional distinction from prophage capsids, despite the similarity in structure and sequence. Prophages conserve gene complement and gene order in well-defined patterns, including the capsid proteins (Figure 1-4). Genes encoded surrounding the encapsulins are dissimilar from all known phage proteins, and do not maintain prophage-like gene order

(Figure 1-4).

20

Figure 1-2: Structural comparison of the T. maritima encapsulin and Bacteriophage HK97 capsid monomer structures

21

Figure 1-3.i: Encapsulin and capsid shell proteins organize into very similar multimer structures

Structural model of Eschericha coli siphophage lambda capsid pentamer (left). Structural model of Thermotoga maritima encapsulin pentamer(right).

22

Figure 1-3.ii: Encapsulin and capsid multimers organize into closed icosahedral shells using pentameric vertices and hexameric planer faces

Model of the formation of the P. furiosus encapsulin from the combination of pentameric vertices and hexameric planer faces.

Figure 1-4: Genome context differences between encapsulins and prophage capsids

Genome contexts of encapsulins and prophage capsids differ distinctly. Prophage capsids are associated with prophage structural and morphogenetic genes. Encapsulins are associated with non-structural, statistically enriched prokaryotic cargo enzymes.

23

1.IV Bacteriophages profoundly influence prokaryotic evolution and physiology

Prokaryotes and their viruses (phages) intrinsically share their genomes, due to the life cycle of phages. Among phages, the Caudovirales order is the most common, and most influential 15.

This order is characterized by a double stranded DNA genome, packaged into a protein based particle with an icosahedral head and tubular tail. Nearly every bacterial genome examined shows extensive evidence of phage infection by this order, ranging from nucleotide signature transfection artifacts in horizontal gene transfer, to complete phage genomes (prophages) integrated into the host genomes 15. Research has found bacterial protein complexes utilizing phage-like proteins autonomous of prophages. These include components of the type VI, type

IV, and type III secretion systems, various bacteriocins, diverse pathogenicity islands, and the encapsulins 1,15. These components of the type IV and III secretion systems contribute to the needle and channel structures, sharing sequence similarity with structurally analogous components of the phage-tail. Despite being poorly characterized as a protein family, the few encapsulins studied, suggest a strong potential for applications in bioengineering and enhancing the understanding of bacterial physiology and genome history. The strong history of research in viral capsids, which share nearly identical protein structure to encapsulins, can offer insight into encapsulins.

Caudovirales phage genomes show extremely high sequence diversity, despite high conservation of gene order, 'body' plan, and protein structure at all levels. This is especially true of the major capsid proteins of these viruses, which preserve highly similar structures in proteins without directly detectable alignable primary sequence. Pseudo-paradoxically, mutational studies in Escherichia phage lambda and HK97 have identified a number of single amino acid substitutions, which radically alter the structure of the capsid complexes 10,11,12,13,63. A number of these altered forms phenocopy functional differences observed between these phage capsids

24 and the solved encapsulin structures, including compartment size, DNA intolerance, and non- expandability. Thus, encapsulins can be investigated as a special case of the capsid-like structure, which itself is a special case of a homomeric nanocomplex 14.

Caudovirales phage capsids closely obey well defined geometric constraints determining the sizes and shapes of compartments formed 27. These are the mathematical laws of Platonic icosahedral symmetry, whereby any closed symmetric icosahedral body must contain twelve vertices, thirty edges and twenty triangular faces. Each vertex is the intersection of five edges, in capsids forming a pentamer. Each face is minimally formed from three subunits of adjacent pentamers, but can be enlarged while maintaining symmetry, with the insertion of planar hexamers, between vertices 27. Thus all capsids in this order contain a multiple of 60 monomers, referred to as the triangulation number (T) 27. The smallest possible capsid T=1, contains 60 monomers forming twelve pentamers and zero hexamers 27. This is the geometry seen in the T. maritima, and recombinantly expressed B. linens and R. jostii encapsulins 1,20. The T=3 capsid fold, as seen in the P. furiosus encapsulin, maintains the twelve pentamer vertices, with the addition of a hexamer between each 4. Up to T=6 only one symmetric organization can be made within these constraints for each triangulation number 27. T=7 can in theory form a left-handed icosahedron or a right-handed icosahedron depending on how the hexamers pack 27. The majority of natural capsids form the left-handed version for that size. This geometrically predictable quinternary structure in capsids has been valuable in predicting and modeling capsid properties, even as improved microscopy resolution has revealed capsids are not perfectly symmetric. Instead capsids are under tension and strain from tiny asymmetries between monomers and subunits. These same constraints are conserved throughout the capsids examined to date, supporting the existence of an uncharacterized conserved mechanisms directing form and function across all capsid-like compartments.

25

In addition to direct phage derived proteins, phages also contribute considerably to horizontal gene transfer, acting as shuttles for genomic fragments and encoding many of the activities key to the formation and dissemination of mobile genetic elements 67. A significant proportion of prokaryotic genes are predicted to have undergone horizontal transfer at some point, either between related strains, or across differing kingdoms 67. Bacteria and Archaea within extreme environments share pan-genomes despite their differences in core replicative genes 67.

Thermotoga maritima is an archetypical example of this exchange, having among the highest prevalence of Archaea derived genes observed in sequenced bacteria 67. Similarly many parasitic bacteria have adopted host genes and genomic fragments as a mechanism of evading immune and antibacterial defenses 67. Three classes of horizontal transfer exist: acquisition of novel genes, transfer of paralogs to existing genes, or replacement of existing genes with xenologs of compatible function from another lineage 67. Bacteriophage can facilitate all three classes, though are most influential in the acquisition of novel genes, as they encode intrinsic homology independent insertion mechanisms, and proteins to deactivate host rejection mechanisms 53. An average of 3% of bacterial genes show evidence of relatively recent cross kingdom transfer, while 8% of Archaea show evidence of similar transfer 67. For bacteria 2% are paralog duplication or exchange products, and 1% are novel genes, while in Archaea the trend is similar with 5% paralogs and 3% novel genes 67. Thus horizontal transfer and likely transfection has contributed significantly to the development of modern prokaryotic genomes.

1.V Objectives

The encapsulins studied to date support a number of areas of further inquiry. 1) How many encapsulins exist across prokaryotes? As the limited number studied previously separately encapsulate two distinct enzymes, 2) how many other enzyme families are independently

26 targeted? Beta-carboxysomes compartmentalize components encoded in separate genomic regions. 3) Do any encapsulins utilize a similar distributed encoding schema? Most other prokaryotic compartments are encoded with multiple enzymatic genes from a functional pathway, and package more than one of these functionally coupled enzymes. 4) Do encapsulins package multiple proteins in a similar manner? 5) Is the interaction mechanism proposed for the

T. maritima ferritin-like protein representative of all encapsulated enzymes? 6) What historical processes have led to the modern dissemination of encapsulins? 7) What ecological, metabolic, and physiological roles are encapsulins enabling for prokaryotes? 8) Are the published example encapsulins representative of all capsid-like prokaryotic compartments or do novel encapsulins exist, which are previously unidentified? Given the structural similarities between capsids and encapsulins, 9) what are the mechanistic, physiological, and phylogenetic relationships between capsids and encapsulins? Computational analyses can elucidate or strongly predict the answers to this broadly reaching questions.

Elucidating the structural mechanisms and physiological roles of encapsulins throughout the prokaryotic kingdoms will illuminate prokaryotic physiology and enable the improved design of engineered protein compartments for biotechnology, medicine, and bio-remediation.

Understanding the origins and functions of the extended family of encapsulins related to the published examples will reveal the importance of this family across prokaryotes, and the factors guiding their development. The identification of novel families of encapsulins may expand the population of capsid-like compartments to match the sequence diversity seen in capsids, and will better reveal the relationship between this structure as it occurs in both super-kingdoms.

Investigating both the classical and novel encapsulins address the breadth of functions for which encapsulins have been adapted. Examination of the conserved and covariant features of capsid and encapsulin sequence, structure, and phenotype, elucidates the key mechanisms governing

27 the formation and function of this compartment class, across the diversity of sequences that exists among capsids and encapsulins. This also illuminates the factors that could enable and govern the interconversion of capsids and encapsulins between packing genomes and proteins.

Establishing these mechanisms, it becomes informative to investigate the phylogenetic relationships between capsids and encapsulins to determine which form and function fits best as ancestral, or what interconversions are implied by this phylogenetic data.

28

Chapter 2: Identification of comprehensive population of encapsulins related to published examples

2. Chapter Abstract:

Three of the five published encapsulins are very different from each other in sequence suggesting they represent only a small subset of the true population of encapsulins present and active in prokaryotes, even in that one family of similar sequences. By identifying and comparing the members of the larger and more complete family it becomes possible to predict the roles these encapsulins fulfill in biology. Using a systematic BLAST approach combining multiple searches from divergent but related queries, 590 new encapsulins were identified in this family. Seven new enzyme families were enriched surrounding these new encapsulins. Along with the ferritin-like proteins and peroxidases, four of these families significantly conserved a distinctive motif exclusively found in encapsulin associated enzymes, and shown sufficient for targeting into encapsulin shells in one in vitro system. Similar to beta-carboxysomes, our data shows many encapsulin shell and cargo enzyme genes are encoded in separate operons, while retaining the same targeting motifs. This in trans targeting was previously unrecognized in encapsulins. Several organisms encode two or more of these predicted targeted enzymes suggesting encapsulins may package multiple distinct enzymes, with potentially synergistic activities, similar to other microcompartments. For most other compartments, the packaged components are constant while the non-compartmentalized proteins may vary. With multi- enzyme encapsulins however, the packaged contents vary depending on niche, while the external proteins may be conserved. This suggests the encapsulin compartments themselves are more adaptable and generalizable than other prokaryotic compartment families.

29

2. Introduction:

Three of the five published encapsulins differ greatly in sequence from each other, and occur in three radically different taxa, indicating representation of a very small subset of the true prokaryotic encapsulin population. Identification and comparison of a more complete sampling may enable prediction of the biological roles of encapsulation. The published encapsulins independently compartmentalize two families of enzymes, iron-dependent peroxidases and ferritin-like proteins 1. Given that two distinct families are targeted to divergent but related encapsulins, questions of what other enzyme families may be encapsulated remain open.

Most prokaryotic compartments studied to date, encode all targeted and accessory enzymes within single operons, while a significant subset of beta-carboxysomes utilizes components encoded in multiple separate genomic locations within bacteria 7. Published encapsulins follow the former trend, but a more complete survey of encapsulins may reveal examples of the latter genomic architecture. Most other compartments utilize N-terminal targeting domains 7, while the published encapsulins show C-terminal extensions related to encapsulation 2. Four examples are insufficient to gauge if this C-terminal targeting is representative or if N-terminal targeting is utilized for some encapsulins. Proteins targeted to the same compartment are expected to share unique features responsible for that targeting. The motifs within the C-terminal extension segment, observed to co-crystallize with the T. maritima encapsulin shell 2, may represent such targeting motifs if conserved and covariant with features of the encapsulin shell. Examination of such co-variation and conservation of positions in pairs of both enzymes and encapsulin shell proteins could reveal key interaction residues and predict interaction mechanisms. Other prokaryotic compartments function as part of larger metabolic pathways and operons 7, yet the

30 functional operons of encapsulins have not been previously investigated. This section aims to address the true prevalence of this classical family of encapsulins, and elucidate several of the mechanisms and features conferring enzyme/compartment interaction, at the genomic, bio- physical, and metabolic level.

2. Methods:

2.I.i Identifying the more complete population of encapsulin shells and associated enzymes

To identify a more complete set of encapsulins, exhaustive protein BLAST (pblast) 90 and selective position specific iterative BLAST (psiblast) searches 91 were performed on the January

12, 2012 version of the NCBI prokaryotic non-redundant database, using the P. furious virus- like particle [NP_578920], T. maritima [NP_228594] and B. linens encapsulins [ZP_05915299], and the enzyme proteins encoded adjacent to these genes (NP_578921, NP_228595 and

ZP_05915298 respectively), as starting queries. These encapsulin genes are fairly divergent in sequence (Table 2-1), thus were expected to yield broad coverage of the intervening population of related encapsulins. Pblast results were compared to define a reproducible consistent set of proteins with sequence similarity to the known encapsulin shells and associated enzyme proteins.

31

Table 2-1: Sequence identity between published encapsulins used as initial pblast queries T. maritima R jostii P. furiosus

T. maritima - 36.6 %id over 253aa 23.2 %id over 265aa

R. jostii 36.6 %id over 253aa - 23.2 %id over 220aa

B. linens 33.8 %id over 265aa 52.5 %id over 264aa 17.1 %id over 249aa

The resulting sets were refined to more confidently predicted encapsulins and associated proteins based on multiple criteria. In both sets false matches had high E-values, low percent similarities and identities, short BLAST alignment lengths, inconsistent relative positions of aligned regions, dissimilar protein lengths between query and predicted homolog, or were already shown experimentally not to be encapsulins. For encapsulins, it was also necessary to determine these were not prophage capsids. To distinguish encapsulins from prophage capsids, surrounding phage-like structural proteins were identified by psiblast, as per Waller et al. (2011)

25. This involved systematic exhaustive psiblast searching starting from every sequenced bacteriophage morphogenetic protein, against the above set of prokaryotic genomes. Those encapsulin-like candidates with surrounding genes similar to these phage structural genes were excluded from further focus as candidate encapsulins.

32

By looking at three diverse known encapsulin queries representing the same function, higher E- values could be accurately discriminated for true homologs. Both divergence and false matches produce high E-values. True encapsulins may be hit by more than one of these queries at differing levels of similarity, while false hits should be at best equally dissimilar, and more likely will be only weakly hit by a single query. Follow up pblast with divergent confident predicted encapsulins, were repeated until no further confident homologs could be identified.

BLAST based methods are transparent to critical evaluation with minimal expert knowledge or hidden information. HMM methods represent an attempt to consolidate the information content of multiple sequences into a single searchable consensus, with use similar to a single sequence in a BLAST search 68. This consolidation produces a grey box model, in which the user can not directly critically evaluate the model’s validity, applicability to a classification task, or alternative explanations for the observed sequence similarity distribution. The sequence alignments and algorithms used to generate and evaluate published HMMs are publically available, but evaluating these components requires more expertise in both algorithm theory and biology than is needed to interpret pairwise sequence similarity. Quantified sequence similarity between two proteins has a discrete definitive meaning. Quantified sequence similarity between a protein and an HMM depends on the structure, composition, alignment quality, and raw sequence similarity. Thus HMMs can be used to perpetuate and expand previously defined families with intrinsic unreported constraints. These constraints may or may not be biologically relevant to the biological questions trying to be addressed.

2.I.b Identifying enriched enzymes associated with encapsulins as candidate metabolic or physical interaction partners

33

Proteins encoded surrounding predicted encapsulins were clustered by sequence identity using the CD-HIT method 31, to identify protein families encoded near multiple predicted encapsulins.

BLAST searches were performed from representative sequences in these families to define the set of similar enzymes throughout prokaryotes, to evaluate enrichment in proximity to predicted encapsulins. Enrichment was assessed using Hypergeometric tests. Blasts from the families represented by the Escherichia coli 8.0416 iron-dependent peroxidase [ZP_16247603],

Gordonia rubripertincta NBRC 101908 iron-dependent peroxidase [ZP_11244791]; the

Nitrosomonas europaea ferritin [PDB accession: 3K6C], Thermotoga maritima ferritin-like protein [NP_228595], Brevibacillus brevis NBRC 100599 ferritin-like protein [YP_002773661]; the Brucella melitensis biovar Abortus 2308 bacterioferritin [PDB accession: 3FVB],

Hydrogenobaculum sp. Y04AAS1 bacterioferritin [YP_002121978], Stigmatella aurantiaca

DW4/3-1 bacterioferritin [ZP_01463086]; the Desulfovibrio vulgaris hemerythrin [PDB accession: 2AWY], Pseudonocardia dioxanivorans CB1190 hemerythrin [AEA25308]; the

Thermotoga maritima radical SAM oxidoreductase [NP_227933], Sulfolobus islandicus M.16.4 radical SAM oxidoreductase [YP_002915824]; the Thermocrinis albus DSM 14484 CutA1 divalent ion tolerance protein [YP_003473646]; the DK 1622 metalloprotease [YP_631747], Stigmatella aurantiaca DW4/3-1 metalloprotease

[ZP_01464279]; and the Burkholderia pseudomallei rubrerythrin [PDB accession: 4DI0], and the N-terminal domains of the Ferroglobus placidus DSM 10642 [YP_003436670] and

Sulfolobus acidocaldarius DSM 639 encapsulin fusions [YP_254943], proved most relevant to the primary families discussed below. All enrichment analyses were evaluated using hypergeometric enrichment tests for sequence similarity families, and correcting for multiple hypothesis testing using the Bonferroni correction for critical value (P-value < 6.04E-06).

34

2.II Identifying conserved encapsulin association specific (CEAS) motifs and associated conserved encapsulin surface sites

Proteins interact through motifs, short regions of conserved amino acids. The discriminative

MEME algorithm 26 was used to identify conserved peptide motifs exclusively present when encapsulins and these enzymes from section 2.I co-occur, and absent when the enzymes were on their own. A non-redundant representative subset of encapsulin co-encoded proteins from the enzyme families of interest, noted in the previous section, were used as the positive set, and homologs to these proteins occurring in genomes without detectible encapsulins were used as the negative set. The reciprocal comparison was also performed to identify and exclude encapsulin independent motifs. The predicted encapsulin independent proteins were also examined for motifs enriched in the absence of encapsulins, since such motifs might provide activities compensating for the lack of compartmentalization. No such motifs were identified.

Control analyses were performed to account for alternative sources of apparent exclusive motif conservation. The length and distribution of motifs was considered since very long motifs or motifs conserved in all homologs independent of genomic associations could be explained as enzymatic, rather than related to encapsulin function. Test datasets were generated with randomly shuffled sequences with same amino acid distribution, to determine the cutoff of E- value at which random artifacts of amino acid distribution begin to appear as motifs. Protein motifs with E-values less than this cutoff are not artifacts and must be explained by other processes, including biological function.

The MAST motif search algorithm 26 was used to screen the NCBI non-redundant database, including these enzyme families, identifying proteins exhibiting the predicted motifs. Predicted motifs not exclusive to encapsulin associated enzymes, or not conserved in multiple encapsulin associated enzymes based on MEME and MAST analyses were excluded from further

35 consideration. Dataset sequences were scrambled, while maintaining amino acid frequency, before repeating MEME prediction, thus defining the dataset specific expected false positive E- value threshold. To confirm MEME analysis does not intrinsically identify conserved specific motifs for any grouping, the same analyses were independently applied to two families of enzymes sometimes encoded in proximity to encapsulins, and with similar protein length and cluster sizes. MEME identified distinct motifs for both families, but no motifs specific to encapsulin association exclusively, as seen for the predicted targeted families.

2.II.i Identification of binding site composition on encapsulin shells

With the prediction of interaction motifs in the enzymes, a next goal was to determine the sites of associated interaction on the encapsulin shell proteins. Making use of the T. maritima encapsulin protein structure [3DKT], homology modeling threading was performed with four representative predicted encapsulins and predicted enzyme encapsulin association motifs.

Threading is a process of replacing the amino acid sequence of a known protein structure with the sequence of a similar homolog as a method of structure prediction. MAFFT multiple sequence alignment 28 was also performed with that protein sequence and the full set of predicted encapsulins. These two methods were used to identify conserved positions and sites of co-variation between the predicted enzyme motifs and encapsulins. Positions in the threaded encapsulin structures with atoms within 4.0 Å of the threaded motif structure were considered as potential interaction partners. These were compared with conserved and variable positions in the multiple sequence alignments.

Markov Chain Clustering analysis 29 was performed on genes up to 10 ORFs up and down- stream of each predicted encapsulin to identify potential operons and infer functional units.

36

These flanking gene clusters were then tested for enrichment relative to the full proteomes of these genomes and assigned statistical significances based on hypergeometric tests, and a

Bonferroni corrected cutoff of P < 6.04E-6. Conserved operons and metabolic pathways were recognized based on literature searches and the BioGraph database of biomedical relations 30.

2. Results:

2.I Identification of a more complete population of encapsulin shells and associated enzymes

To identify encapsulin homologs across all sequenced eubacterial and archaeal species, I conducted exhaustive BLAST searches using three of the previously characterized encapsulins as queries (see 2.I Methods). No further confident encapsulin homologs could be identified from subsequent searches with divergent homologs. This suggests our new list includes the full range of classical encapsulins within the database at the time of the analysis. Systematic exhaustive psiblast searches starting from every sequenced bacteriophage morphogenetic protein, against the above set of prokaryotic genomes, was also required to discriminate prophage capsids from predicted encapsulins. Those proteins with sequence similarity to encapsulins, but with surrounding genes similar to these phage structural genes were excluded from further focus as candidate encapsulins. This two tiered method of using primary sequence similarity to systematically classify genome context, as well as identify functional homologs, outperformed traditional sequence similarity classification methods (Figure 2-1). Due to the limited number of previously validated encapsulins to use as a benchmark, this method was validated on annotation of phage proteins among phage and prophage genomes, with the most relevant case being the validation of the ability to predict capsids with a high true positive rate while maintaining a low false positive rate.

37

Figure 2-1: ROC curves comparing classifier performance for transitive homology annotation and HMM matching methods,scripted predicting ROC curve Caudovirales of capsid prediction major incapsid phage proteins.

1

0.9

0.8

0.7

0.6 Random HMM match 0.5 Transitive homology Transitive homology + context single iteration pblast

0.4 True Positive Rate TruePositive

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

Receiver operating characteristic (ROC) curve comparing annotation performance of method combining genome context and systematic pairwise sequence similarity (transitive homolog + regional context [SITHA]), relative to systematic pairwise sequence similarity alone (transitive homology [THA]), and traditional sequence similarity classification by pairwise sequence similarity (pblast) or HMM model matching. Context informed systematic pairwise sequence similarity out performs the other methods, predicting capsid proteins in Caudovirales genomes. False positive rates were calculated on the basis of the annotation of phage or prophage genes as capsids, which are definitively known to not be capsids. True positive rates were calculated on the basis of annotation of phage or prophage genes as capsids, which either are known capsids, or when considered in the context of surrounding annotations and the absence of an annotated capsid in that genome would be reasonable capsid candidates, not contradicted by any other available data.

38

Using this method, I identified 723 proteins with sequence similarity to the published encapsulin shell proteins, of which 590 were encoded in isolation from any phage-related proteins. These predicted encapsulin shells showed: significant sequence similarity to the published examples, no other proteins with sequence similarity to phage morphogenetic proteins encoded nearby, and sufficient genome context to determine if these were prophage or encapsulin/capsid-like proteins in isolation from prophage. This third point excluded contigs too short or unique to assess context. This defined a confident set of predicted encapsulin homologs found in 573 genomes, approximately 22% of all genomes examined (Table 2-2). Due to similarity to the published encapsulins, I term this group of proteins classical encapsulins. In proximity to these encapsulin shell genes were a series of predicted cargo enzymes. Cargo proteins are enzymes coencoded with predicted encapsulins and either from families known to be targeted to encapsulins (ferritin-like proteins or iron-dependent peroxidases), or enriched adjacent to encapsulins and exhibiting features supportive of targeting. 471 (79.8%) of these were peroxidases and ferritin-like proteins, defining a confident initial expanded set. Additional cargo families were predicted by data that will be discussed subsequently. 72.6% of predicted shell and cargo proteins were hit by two or more different queries of the same functions (encapsulin shell or enriched coencoded enzyme families), indicating strong overlapping coverage of the respective sampled sequence spaces, and confidence in these functional annotations.

These new classical encapsulins were encoded in organisms spanning an extreme breadth of taxa. Encapsulin were identified in 14/33 eubacterial phyla and 3/5 archaeal phyla (Figure 2-2).

This far exceeds the distribution of any other prokaryotic compartment family. These phyla were also the largest, most clinically and environmentally relevant, and most extensively studied phyla in both kingdoms, thus most extensively sequenced, suggesting encapsulins are contributing to important physiological advantages in a broad range of organisms.

39

Table 2-2: List of predicted classical encapsulin shell and cargo enzymes

Available on the appended CD or online at https://www.dropbox.com/s/pl9pczqojx0tb78/ radfordThesis_classicEncapsulinList.xls

Figure 2-2: New classical encapsulins identified in 14 Bacterial and 3 Archaeal phyla.

Classical encapsulin homologs were identified in 14/33 eubacterial phyla, and 3/5 archaeal phyla. They were distributed across most classes of the largest phyla; Firmicutes, , and Actinobacteria. All marked phyla (black) include species with (green) and without (red) encoded classical encapsulins.

40

Clustering and enrichment analyses identified several protein families closely genomically associated with these encapsulins, eight of which proved of particular interest (Table 2-3). Two of these families were the previously identified iron-dependent peroxidases and ferritin-like proteins. Of the 590 predicted encapsulins identified, 351 had iron-dependent peroxidases encoded within five open reading frames (ORFs). I found 120 encapsulins in close genomic association with ferritin-like proteins, of which 20 were ferritin-encapsulin fusion proteins, similar in architecture to PfV. Thus, 471 of the 590 predicted encapsulins (79.8%) fit the pairings established by the previous published encapsulins, though with far greater sequence diversity. This represents a confident expanded set, very likely similar in form and function to the previously published encapsulins. This confident set includes extension segments unique to the encapsulin associated homologs and qualitatively similar to the T. maritima anchor peptide, which will be analyzed in the following section to characterize potential targeting motifs. The remaining subset of predicted encapsulins which are comparably similar in sequence to the published encapsulins and this expanded set, were encoded without peroxidase or ferritin-like sequences in proximity. Instead members of three enriched sequence families were associated with nearly all of the predicted encapsulins in this second subset.

41

Table 2-3: Systematic quantification of protein with conserved encapsulin association specific terminal motifs by MEME/MAST class encapsulin associate specific motif encoding no unique (predicted targets) motifs untargeted

total fusion adjacent separate multiple predicted targets

encapsulin 590 45 504 41 (no adj) 118 2

encapsulated

peroxidase 373 0 351 22 30 426

ferritin-like 128 20 102 6 32 189

rubrerythrin 79 25 36 18 23 64

bacterioferritin 18 0 7 11 6 546

hemerythrin 34 0 13 21 27 213

ferredoxin 14 0 14 0 14 304 associated (non-targeting motif encoding)

CutA1-like 23 0 4 19 19 103

radical SAM 44 0 12 32 44 84

metalloprotease 10 0 10 0 10 494

42

In attempting to assign meaningful descriptions to these enriched sequence families, comparisons of annotation within sequence families revealed inconsistency in the naming conventions of bacterial iron-binding proteins. This inconsistency is a product of the historical reuse of the same terms to describe different levels of generality and relatedness 59. For instance,

"ferritin" has referred to: a specific family of dodecahedral cage forming proteins found in eukaryotes; the overall group of prokaryotic or eukaryotic cages with this quaternary structure; the superclass of iron-sequestering helix bundle proteins; and the generic term for small uncharacterized prokaryotic iron-binding proteins, which may or may not be distantly related to each other or the eukaryotic ferritins which originally received the name 59. Subsets of multiple sequence families are assigned the same name, and related members of the same sequence families assigned different names, to the extent that the same protein in different sequencing projects of the same species may be annotated radically differently. These inconsistencies have also translated into the composition of certain annotated ferritin HMMs, which poorly distinguish the breadth and demarcations of true families.

Thus sequence similarity rather than annotation nomenclature was used to define families, and suspect hybrid HMMs were not used as evidence for annotation. Pfam defines protein families as sets of protein regions that share a significant degree of sequence similarity 92. Thus families are defined by sequence similarity either to individual proteins or in the case of HMMs to consensuses of multiple sequence alignments. As families are by definition sets of similar sequences, using HMMs or BLAST based sequence similarity methods would both define families. Theoretical similarities and differences in the families defined depend on technical differences, not biology. Neither method utilizes biological or functional information due to the intermittent availability of such information. The methods explain the rational of why pblast methods were used rather than HMMs (Methods 2.I.i).

43

The ferritin-like family was defined as the set of proteins similar in sequence to the

Nitrosomonas europaea ferritin [PDB accession: 3K6C], Thermotoga maritima ferritin-like protein [NP_228595], and/or Brevibacillus brevis NBRC 100599 ferritin-like protein

[YP_002773661]. Pfam HMM matches to this family are particularly inconsistent due to the diversity of proteins represented. Members very similar to those used to define Pfam HMMs, such as NP_228595 with PF02915, were hit. More distant homologs with clear sequence similarity but low identity, such as EIJ81053, were not matched by these HMMs. The Dps-like bacterioferritin-like family was defined as the set of proteins similar to the Brucella melitensis biovar Abortus 2308 bacterioferritin [PDB accession: 3FVB], Hydrogenobaculum sp. Y04AAS1 bacterioferritin [YP_002121978], and/or Stigmatella aurantiaca DW4/3-1 bacterioferritin

[ZP_01463086]. Subsets of this family include partial matches to the Pfam HMMs: PF00210,

PF02915, PF13668, and PF05067. No published HMM accounts for the full length of these proteins. The hemerythrin-like family was defined as the set of proteins similar to the

Desulfovibrio vulgaris hemerythrin [PDB accession: 2AWY], and/or Pseudonocardia dioxanivorans CB1190 hemerythrin [AEA25308]. Subsets of this family are significantly matched by the Pfam HMM, PF01814, to varying degrees. Lastly the rubrerythrin-like family was defined as the set of proteins similar to the Burkholderia pseudomallei rubrerythrin [PDB accession: 4DI0], and N-terminal domains of the Ferroglobus placidus DSM 10642 encapsulin fusion [YP_003436670], or Sulfolobus acidocaldarius DSM 639 encapsulin fusion

[YP_254943]. Members of this family are present as contaminants in the Pfam HMMs,

PF02915 and PF05067. The peroxidases were more consistent in naming convention and were defined based on sequence identity to the Brevibacterium linens BL2 DyP-peroxidase

[ZP_05915298], DyP-type peroxidase family HMM (PF04261), Escherichia coli 8.0416 iron- dependent peroxidase [ZP_16247603], and/or Gordonia rubripertincta NBRC 101908 iron- dependent peroxidase [ZP_11244791]. These proteins were chosen as representatives based on 44 availability of solved structure, and distribution within the sequence space covered by each family. Other representatives from each family were examined and produced the same family definitions. HMMs were derived from multiple sequence alignments of non-redundant representative sequences from each family and found to match the members of the families exclusively.

In the case of the rubrerythrin-like family, it was the conserved N-terminal iron-binding domain which was representative. Classical rubrerythrin encodes a C-terminal rubredoxin domain unconserved in our dataset, or in many other rubrerythrin-like prokaryotic proteins 59. These family names were used as best approximation, place holders for reference purposes, as published definitions of ferritin, rubrerythrin, hemerythrin, bacterioferritin, and ferredoxin vary from kingdom to kingdom, and were either too narrow or too broad to fit our enzyme families.

Among the set of predicted encapsulins, I identified 25 confident predicted encapsulins containing fused rubrerythrin-like domains, while in 36 cases a rubrerythrin family member was encoded adjacent to the encapsulin gene. Dps-bacterioferritin-like proteins and bacterial hemerythrin-like proteins, were similarly found in significant enrichment near encapsulin genes

(Table 2-3), suggesting targeted enzymes for another 20 predicted encapsulins. Five genomes encode homologs from two different families near the same encapsulin genes, for a total of 509 enzymes adjacently encoded with 504 encapsulins. Forty-three predicted encapsulins were found in the absence of any homologs of these families, but in forty-one of these cases highly similar homologs were found elsewhere in the same genome. In the two remaining cases, the predicted encapsulins were encoded near the end of short contigs in incomplete genome assemblies, suggesting the adjacent contigs, expected to encode a targeted enzyme, were not sequenced or assembled.

45

These three new encapsulin associated families were distinct in sequence from each other, and from both the iron-dependent peroxidases and the family of ferritin-like proteins, with no significant identity between members of different families based on BLAST (Figure 2-3). Clean

HMMs can be generated, which also exclusively hit single families without cross hits by hmmer

68. Internally, each of these six encapsulin-associated enzyme families exhibit an approximately normal distribution with overall mean sequence identity of 48.1% and a standard deviation of

16.5%, indicative of moderately diverse, isolated families (Table 2-4). Between enzyme families nominal sequence identity is detected, below the 20% threshold of confidence, confirming these are distinct families. Similarly, the predicted new encapsulins show a mean percent identity of

39.9%, with a standard deviation of 17.2%. As such, encapsulins and functionally diverse associated enzymes cluster into diverse sequence families, suggesting rather than overall conservation, covariant positional sequence diversity may contribute to interaction.

46

Figure 2-3: Distinct sequence identity clusters/families for six encapsulin associated enzymes classes.

Clustered based on global pairwise sequence identity using the CD-HIT method 31. All cluster members minimally greater than 21% identical in overall sequence to at least one other member. Internal nodes symbolize proteins chosen as representative of connected leaf nodes based on sequence identity.

47

Table 2-4: Average percent identity within encapsulin associated protein families class mean median mode standard minimum maximum deviation

encapsulin 39.92 32.74 58.49 17.22 20.05 99.64

encapsulated

peroxidase 58.05 55.46 55.46 9.531 40.56 99.73

ferritin-like 43.52 43.90 50.00 13.70 20.14 99.49

rubrerythrin 48.03 43.10 66.7 16.3 20.56 98.36

bacterioferritin 46.06 47.20 49.38 16.66 20.11 99.43

hemerythrin 39.59 27.27 26.10 17.52 20.12 94.68 ferredoxin 44.03 36.13 34.35 21.15 17.24 99.29

associated

CutA1-like 44.30 39.80 37.90 15.30 26.20 100.0

radical SAM 33.66 31.50 33.33 8.871 21.38 99.15

metalloprotease 31.62 28.45 25.00 11.81 20.06 99.85

48

2.I.i Many encapsulins are encoded with multiple predicted targeted cargo enzymes

Among the genomes encoding predicted encapsulins, 118 encoded two or more enzymes from the predicted cargo families discussed above (Figure 2-4), usually with one encoded proximally and one or more encoded separately in the genome. All five families had members encoded nearby and separately from, predicted encapsulins. Only ferritin-like and rubrerythrin-like domains were encoded as fusions with encapsulins. This was the first evidence of separately encoded (in trans) encapsulin components, as previously published encapsulated peroxidase and ferritin-like proteins were encoded adjacent to shell genes. While uncommon in prokaryotes, and previously only linked with beta-carboxysome compartmentalization, natural separately encoded (in trans) bacterial systems have been observed, and are extensively utilized in molecular biology (eg. plasmid complementation of genomic gene knockout).

The Myxococcus xanthus encapsulin system 22, published during the preparation of this thesis, confirms several of these computational and statistical predictions. The uncharacterized encapsulated protein EncD is a member of my Dps-type bacterioferritin-like protein family.

Both of these proteins are encoded separately from the encapsulin shell gene and adjacently encoded, co-compartmentalized ferritin-like protein, termed EncB.

49

Figure 2-4: Venn diagram of co-encoding of 6 classes of CEAS enzymes in predicted multi- target or hetero-compartment encapsulin systems

Numbers in the overlapping circles mark pairs of proteins (eg. 24 peroxidases and 24 hemerythrins occur together). Numbers in non-overlapping sections represent single proteins not co-encoded with any other targets (eg. 7 hemerythrins occur alone with encapsulins). 39 genomes encode two targeted genes from the same enzyme family.

This new 590 operon set is a confident gold standard classical encapsulin group, extending the confident population over 100 fold, from the five previously studied examples. This gold standard set meets the criteria of confident similarity of the predicted shell genes to one of the published encapsulins and significant conservation of at least one predicted cargo enzyme encoding an example of the conserved encapsulin association specific (CEAS) motif. These proteins are significantly similar to the published encapsulins, while covering the breadth of prokaryotic diversity. By defining a robust population such as this, it becomes possible to

50 answer many of the unresolved general and systematic questions about encapsulins. The

Linocin_M18 hidden Markov model, PF04454, matches 93.1% of the new encapsulins across the NCBI nr non-redundant database at the time of the analysis. The proteins not hit by this

HMM, comprise the more divergent encapsulins within confident encapsulin operons, including the presence of confident targeted enzymes with CEAS motifs. Encapsulins were exclusively hit up to E-values < 1.0E-52, and no encapsulins were hit with E-value > 4.5E-16. In this intermediate zone between E-values >= 1.0E-52 and E-values <= 4.5E-16 both prophage capsids and canonical encapsulins were matched. At E-values > 4.5E-16 phage and prophage capsids were further matched with highly significant E-values. Thus the pfam Linocin_M18

HMM has a high true positive rate, but also a high false positive rate, in regard to discriminating prophage capsids from encapsulins. This is due in part to the use of prophage capsids in the

HMM seed alignment. An HMM derived from alignment of my confident gold standard classical encapsulin set, improves this discrimination (no capsids hit up to E-value < 0.01).

2.I.b Similarity between encapsulins and phage capsids extends to the primary sequence level

In addition to this expanded population of encapsulin operons, BLAST hit upon diverse phage and prophage operons with similarity to the query encapsulin proteins. Capsid hits from the encapsulin queries ranged between 21% and 43% identity, with E-values as low as 0.01, better than divergent capsids in the same taxonomic family, Myoviridae, and infecting the same host species, such as the Listeria myophage B054 and Listeria myophage P100 which bear nominal sequence identity with a minimum E-value of 2.5. Outside of the context of bacterial or viral genome comparison this level of similarity would be unremarkable. However, it is widely recognized that sequence diversity of phage and phage-related structural proteins is extremely high, despite strong structural conservation 60. While individual proteins chosen at random often

51 bear little or no detectible sequence identity, most members of phage protein functional classes can be connected into coherent families through moderate sequence similarity to intermediate proteins. These similarities are notably stronger and longer than the random, short, very weak similarities between arbitrary proteins. Thus the structural similarity between capsids and encapsulins extended even to the primary sequence level, with encapsulins showing greater similarity to specific subfamilies of capsids than those capsids do to divergent capsids in the same taxonomic family.

For example, the Listeria phage B054 capsid, YP_001468712, was 46% similar and 20% identical over 295 amino acids to the predicted encapsulin from Aquifex aeolicus VF5. This capsid protein was also similar to 19 other encapsulins including the T. maritima encapsulin at

41% similar and 22% identical over 145 amino acids. The A. aeolicus VF5 predicted encapsulin is encoded separately from a predicted targeted Dps-bacterioferritin-like protein, while the T. maritima encapsulin genomic region encodes an adjacent ferritin-like protein. The B054 phage is in the Myoviridae family, a double stranded DNA bacteriophage with long non-contractile tails. Several phages in this family utilize capsids with nominal sequence similarity to the B054 capsid, yet perform the same function, are in the same taxonomic family, Myoviridae, and can be connected to this capsid through a chain of intermediate similar phage capsids. Over 32% of all predicted encapsulins showed sequence similarity to one or more phage capsids. Some of these similarities were low, but in light of conservation of structure, not negligible. The remaining 68% bore significant similarity to these 'capsid'-similar encapsulins, supporting a gradient of similarity, very much like what was observed among phage capsids. This degree of diversity was comparable to a diverged capsid subfamily, rather than a hyper-ancient distant relative to phage capsids. Within the context of sequence similarity searches using the Pfam

Linocin_M18 HMM between E-values >= 1.0E-52 and E-values <= 4.5E-16 both prophage

52 capsids and confident canonical encapsulins were matched. At E-values > 4.5E-16 phage and prophage capsids were matched with highly significant E-values, supporting this similarity.

In the context of pairwise sequence similarity, most phage capsids were likewise moderately similar to close relatives, and had low similarity to more diverged relatives. The low magnitude of this similarity would be atypical of eukaryotic structural families, but is well established in phage 15. Thus the vertical inheritance from a proto-prokaryote, proposed in the literature 2 was not supported by the sequence similarity distributions. These prophage and phage capsids with

BLAST matches to published encapsulins were surrounded by other phage structural genes, and lack homologs to the enzyme families discussed above. Thus capsids and encapsulins show differences in genome context and function, but not segregation by structural or sequence similarity features.

2.II.i Six distinct enzyme families conserve the same consensus CEAS motif family

The T. maritima encapsulin co-crystallizes with a short peptide matching the adjacently encoded ferritin-like protein 2. This peptide is called the anchor peptide, as it was hypothesized to mediate binding between the ferritin-like protein and the encapsulin shell 2. In prokaryotes, enriched co-encoded genes imply co-expression, co-regulation and usually co-functionality 32,33.

A strong marker of interaction and co-functionality is the presence and conservation of motifs conserved uniquely to the set of co-encoded proteins and absent from homologs not encoded together, in this case conserved encapsulin association specific (CEAS) motifs. Conservation alone can be explained by independent motif function. However, co-functionality is required to explain conservation exclusively observed in the presence of a specific protein family, and absent in the absence of those proteins. If the T. maritima anchor peptide is such a CEAS motif, conserved across all predicted targeted enzymes, it would support its role in targeting.

53

Initially the sets of diverse, encapsulin adjacent peroxidases and ferritin-like proteins were separately aligned with homologs from genomes not encoding encapsulins. These alignments indicated the presence of C-terminal peptide extensions beyond the enzymatic region in the encapsulin associated homologs. Alignments of the new encapsulin associated protein families also followed this trend for four other families. Manual alignments of these extensions highlight plausible conserved motifs qualitatively similar to the T. maritima anchor peptide (Figure 2-5).

Due to the small size of these motifs, standard large scale alignment methods did not produce confident alignment in these terminal regions, making manual alignment more reliable for aligning the candidate motifs, without claiming statistical significance. Proteins with these extensions are predicted encapsulin targeted proteins, based on the published criterion 2. This criterion is the presence of C-terminal peptide extensions seen specifically in the enzymes encoded adjacent to the predicted encapsulin shell gene, while absent from homologous enzymes encoded in regions without encapsulin shell genes.

To avoid observer bias and quantify the significance of these potentially conserved motifs, I performed discriminative MEME analysis on an unaligned balanced subsample (171 sequences) of these six enzyme families encoded adjacent to predicted encapsulins, identifying a significant

CEAS motif present in all homologs encoded adjacent to predicted encapsulins (Figure 2-6).

This subsample was balanced by combining the 26-30 most diverse examples of each adjacently encoded enzyme class. Control analyses (see Methods) identified no motifs comparable in significance, scope, exclusivity, or diversity of associated enzyme families as this CEAS motif model. Alignment of segments encoding this motif above, agrees with the statistical validation in this paragraph (Figure 2-5). These motifs are unlikely to be enzymatic due to their short nature and the fact these motifs were not found in homologs in encapsulin free genomes. These motifs were not random artifacts of amino acid distribution, since equally significant motifs

54 were not found in randomly shuffled sequences with same amino acid distribution. The statistical significance of conservation of these motifs was several hundred orders of magnitude stronger than any motif found in the randomized dataset with the same amino acid distribution

(E > 440), confirming this motif is not random.

Figure 2-5: Overlay of candidate C-terminal conserved encapsulin association specific motifs in diverse predicted cargo enzymes

Overlaid C-terminal 18-32 aa of encapsulin adjacent candidate cargo enzymes. Families marked in differing colours: Dps-bacterioferritin-like proteins (green), iron-dependent peroxidases (blue), hemerythrin-like proteins (red), ferritin-like proteins (light yellow), rubrerythrin-like proteins (gold). 55

Figure 2-6: Conserved encapsulin association specific consensus motif.

Regular expression consensus: xAPxDGSL[TG][IV]GSLKG. Logo representation of unbiased, statistically optimized, significant conserved enzyme encapsulin association specific motif (MEME conservation E-value = 2.0e-405). Predicted by MEME from non-redundant, functionally balanced, 171 protein representative sample of predicted encapsulin associated enzymes. Motif found exclusively encoded in termini of proteins genomically associated with encapsulins. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to positional information content in the motif. Grey vertical brackets (I) demark small sample correction error bars.

This consensus CEAS motif model (Figure 2-6) identifies and discriminates 99.6% of predicted encapsulin targeted proteins with a false discovery rate (FDR) less than 0.140%. This included separately encoded predicted encapsulin targeted proteins. Focused analyses of unhit sequences

56 revealed conserved motifs similar to the core of the consensus CEAS model in the encapsulin associated subset, but not the homologs from genomes without predicted encapsulins. The high

E-value of these false negatives is due to subfamily positional variation in segment upstream of the motif. For example, the Anaeromyxobacter dehalogenans 2CP-1 predicted encapsulin adjacent ferritin-like protein encodes a motif matching the consensus CEAS model weakly (E- value = 2.4), but strongly matching a similar model optimized for encapsulin adjacent ferritins

(MAST search E-value = 5.3E-30). These two models describe the same motif, but differ slightly in length and composition of the amino acids upstream of this motif. Thus my gold standard predicted encapsulin set is supported by a conserved encapsulin association exclusive motif in every predicted targeted enzyme, and absent in all other homologs to these enzymes.

Taking both the consensus CEAS motif model and moderate family specific conservation in the amino acids flanking this region into account, I was able to perfectly discriminate the set of encapsulin co-encoded enzymes from the homologs not encoded with encapsulins (TPR =

100%, FPR = 0.00%). This included the remaining subset of separately encoded predicted cargo enzymes, indicating that similar to eukaryotic organelles, encapsulins can compartmentalize components encoded separately (Figure 2-7), linked by shared regulation and targeting. The conservation of the CEAS motif in these enzymes strongly suggests encapsulin packaging functions with both in cis and in trans encoded targets, which had not previously been recognized for encapsulins, nor observed in the majority of other prokaryotic compartment operons. This data suggests a conserved interaction mechanism for all encapsulins in this family, with target family specific secondary optimization. It is possible that these family- optimized flanking regions confer orientational specificity or modulate features like compartment size or geometry.

57

Figure 2-7: Example separately encoded targeted peroxidase, encapsulin pair

Select Escherichia coli strains encode predicted classical encapsulin shells and targeted peroxidases in separate genomic regions. The motif exclusively found in targeted encapsulin associated enzymes is conserved in these separately encoded peroxidases. Similar separated encapsulin components are observed throughout Bacterial and Archaeal taxa, as alternatives to encapsulin components being encoded together as seen in the previously published four examples. These include all six predicted encapsulated enzyme families.

58

The case of the Methanobacterium sp. SWAN-1 predicted classical encapsulin and adjacent hemerythrin was the most extreme instance of this co-divergence of an encapsulin shell and associated targeting motif. This predicted encapsulin, YP_004520445, showed only 26-31 % identity to the published encapsulins. The adjacently encoded hemerythrin exhibited an enzymatic region similar to others in that family, but with a C-terminus comparably divergent to the encapsulin, encoding a weak match to the consensus CEAS motif model. Closer examination of this pair revealed covariant, predicted compensatory changes in both the encapsulin and CEAS motif sequence, suggesting encapsulins and their associated targeting motif have co-evolved.

All previously identified CEAS motifs were located at the C-terminus of encapsulated proteins, following variable length diverse linkers, extending beyond the enzymatic fold regions (Figure

2-8, Figure 2-9). Conversely, in several ferredoxin-like proteins from diverse species in the

Bacillaceae family, strong matches to the CEAS consensus (0.0087 <= MAST E-value <= 0.15) were found encoded in amino-terminal extension segments (Figure 2-10). These N-terminal motifs closely match the C-terminal motifs of cargoes in the same species. This is the first evidence of amino-terminally encoded interaction motifs for encapsulated cargo enzymes, and the first instance of the same family of targeting motifs being encoded on either terminus. Many phage proteins and other bacterial compartments utilize distinct N-terminal targeting motifs 3,8.

In addition, experiments by Snijder (2014) confirmed the T. maritima instance of the CEAS motif is sufficient to induce encapsulation of novel cargo into the T. maritima encapsulin shell, when TFP was fused to the C-terminus 39. Overall, the conservation of this motif in four new families of enzymes, in addition to the previously published ferritin-like protein and peroxidase families confirms the potential for in vivo encapsulation of six distinct cargo enzyme families.

59

Figure 2-8: MAFFT sample alignment of related bacterioferritins with and without targeting motifs

60

Figure 2-9: MAFFT sample alignment of iron-dependent peroxidases with and without targeting motifs

61

Figure 2-10: Encapsulin associated ferredoxins from the Bacilli phyla encoded N-terminal motif similar to co-encoded C-terminal CEAS motifs in ferritins

Short ferredoxin domain proteins were found exclusively associated with encapsulins in this phylum and conserve N-terminal matches to the CEAS motif. These motifs are the first instance of the same family of targeting motif encoded at either termini of a genomically associated candidate cargo enzyme.

62

2.II.b Co-encoded cargo and encapsulin shell variation in sequence correlates with conservation of CEAS motif variable sites

Thirty-six CEAS motifs from enzymes encoded in the same genome with a predicted encapsulin shell and another CEAS motif encoding enzyme from another enzyme family, matched the predicted motif of the other enzyme family with very low E-value (Figure 2-11, Figure 2-12,

Figure 2-13). Sixty-four additional genomes co-encoded CEAS motif containing proteins from different families, offering the potential for targeting of products from two genes into one shell.

If these CEAS motifs co-target, co-encoded CEAS motifs should be more similar to each other on average than to non-co-targeting CEAS motifs, such as those from other genomes. If all

CEAS motifs are equally similar, then no selection would be evident in promoting co-targeting, suggesting co-targeting may be a horizontal transfer artifact rather than biologically relevant. If these co-encoded motifs are significantly more similar, it would indicate a selective advantage exists in co-targeted enzyme pairs, forming the basis for further hypotheses.

Direct Smith-Waterman similarity comparison showed a significantly higher average sequence identity and similarity between co-encoded CEAS motifs, relative to randomly paired CEAS motifs (Figure 2-12, Table 2-5). For example three encapsulin co-encoded bacterioferritins, not hit by the MEME predicted bacterioferritin optimized CEAS motif, match the rubrerythrin optimized CEAS motif well (E-value = 1.5E-5). Bacterioferritins have no significant sequence similarity to rubrerythrins in the enzymatic fold. Yet they were co-encoded in the same genomes with rubrerythrins, which encode similar CEAS motifs. The species in question were

Chondromyces apiculatus DSM 436, Corallococcus coralloides DSM 2259 and Stigmatella aurantiaca DW4/3-1 respectively. Similar cases exist with co-encoded hemerythrins and peroxidases. In total 118 encapsulin shells were co-encoded with two or more CEAS motif contain enzymes within these families, suggesting multiple different enzymes may be targeted

63 to the same encapsulin. Other prokaryotic compartments are known to package multiple enzymes, but previously this had not been indicated for encapsulins.

Figure 2-11: Example multi-enzyme encapsulin operon

A number of Frankia species encode distinct bacterioferritin-like proteins and ferritin-like proteins, both with conserved predicted targeting motifs, flanking predicted encapsulin shell genes, in a predicted coexpressed, cotargeted compartment system. More broadly 118 species encode two or more distinct targeted or fused enzymes from different families. These systems either co-target or differentially express these targeted components, under co-expression with the shell gene.

64

Figure 2-12: CEAS motif pairwise local percent similarity between co-encoded targeted enzymes vs. randomly chosen CEAS motifs

Co-encoded enzymes exhibit significantly greater similarity of CEAS motifs, than motifs compared from randomly chosen pairs (Wilcoxon ranked-sum test for difference in median P- value << 0.0001; two-tailed Student two-sample T-test for differences of means of unequal variance P-value << 0.0001).

65

Figure 2-13: Example paired alignments of Co-conserved positions in co-encoded CEAS motifs vs. non-co-encoded CEAS motifs.

Co-encoded motif pairs boxed together as pairs or triplets (red). CEAS motif from non-co- encoded enzymes chosen at random and compared (blue).

Table 2-5: Co-encoded CEAS motifs are significantly more similar to each other, than otherwise independent CEAS motifs

Co-encoded Independently Two-tailed two sample CEAS motifs encoded CEAS Student T-test for means of and linkers motifs and linkers unequal variance P-value

Mean percent sequence 73 67 1.91E-17 identity

Mean percent sequence 92 84 5.59E-26 similarity

Segments matching the consensus CEAS motif model plus 10 aa upstream for context were aligned by Needleman-Wunsch algorithm. N=160 paired co-encoded motif segments, and 160 randomly selected non-co-encoded motif segments.

66

Further supportive of CEAS motifs facilitating interaction with encapsulins, CEAS motifs from enzymes associated to similar encapsulins were significantly more similar than those associated with dissimilar encapsulins or encapsulins chosen at random (two-tailed Student two-sample T- test for means of unequal variance, P-value = 0.00, Table 2-6). This evidence of distinct co- encoded enzymes conserving significantly similar motifs, correlated with encapsulin variation, suggests coordinated heterogeneous or multi-enzyme encapsulins. Based on the biochemical properties of unencapsulated homologs, if compartmentalized together, these enzymes may function synergistically, catalyzing complementary reactions.

For example, copackaging hemerythrin with iron-dependent peroxidase brings two forms of dioxide radical degradation into a single compartment. Iron-dependent peroxidase degrades hydrogen peroxide using aromatic compounds as redox acceptors, while hemerythrin degrades

2,34 aromatic R-O2H radicals, as well as hydrogen peroxide and nitric oxide . Hemerythrin can facilitate oxidation of iron from Fe+2 to Fe+3, the state required for peroxidase function, potentially recovering viability of reduced peroxidase 35,36. Bacillus subtilis and Pseudomonas putida MET94 iron-dependent peroxidases catalyze this oxidation independently and lack encapsulation, but show susceptibility to temperature and chemical denaturation, 36 potentially offset by packaging. Similar synergy was supported for the predicted pairings between the other families.

67

Table 2-6: Sequence similarity comparison of CEAS motifs relative to encapsulin relatedness Within Encapsulin Between Encapsulin Two-tailed two sample Clusters, Same Clusters, Same Enzyme Student T-test for means of Enzyme Function Function unequal variance P-value (N = 28707 pairs) (N = 3905 pairs)

Mean Local 91% +/- 11 74% +/- 15 0.00 Identity

Mean Local 95% +/- 6.8 85% +/- 9.9 0.00 Similarity

Mean 80% +/- 13 62% +/- 15 0.00 Global Identity

Mean 83% +/- 11 71% +/- 12 0.00 Global Similarity

Encapsulin sequences were clustered with a cutoff of 60% global identity yielding 84 subfamilies. CEAS motifs from associated proteins for each cluster were compared for sequence similarity manually and using Smith-Waterman and Needleman-Wunsch automated alignment methods. A sample of between cluster motifs was compared to measure average 'between cluster' similarities. An equal number of randomly chosen pairs of motifs from each enzyme type were also compared to define a background similarity. Significance determination was consistent between two-tailed two-sample Student and Welch T-tests for difference of means with unequal variance, and Wilcoxon rank-sum test with continuity correction for difference in median (applicable to non-normal data). Similar results were found with clustering encapsulins at 40%, 50%, 70%, or 80% identity. Use of less than 60% identical encapsulin clusters began to limit the analysis by unifying enzyme types into single clusters. Use of greater than 60% identical encapsulin clusters began to limit the analysis by restricting clusters to tight lineages with nominal sequence variance.

68

2.II.c Encapsulin cargo binding surface may be more extensive than previously observed

With this much larger set of encapsulins and cargo enzymes, it becomes possible to better characterize what conserved and co-variant features define the interaction between specific encapsulin shells and cargos. Conservation, co-variation, and non-conservation reveal which features are important to encapsulin function, which are primarily dependent on complementary interaction between shell and cargo proteins, and which are purely eccentricities of one genome or another. Studies on single encapsulins lack this capability and resolution.

The linker and motif family described in the preceding sections includes the previously published T. maritima ferritin-like protein linker and anchor motif (GGDLGIRK), 1 but are not fully represented by that example alone. The regions of unconserved amino acids between the enzymatic domains and the CEAS motif were highly variable in length among proteins and families, ranging from three to ninety-three amino acids with a mean of 22.8 amino acids. A small subset of bacterioferritins, in place of these linker regions encoded a second distinct bacterioferritin-like fold of 140-157 amino acids between the canonical bacterioferritin fold and the CEAS motif. Overall, the linkers were heavily enriched for alanine (16.9%) and proline

(15.2%), moderately enriched for negatively charged residues (15.7%), and slightly depleted in glycine (4.1%), arginine (4.3%), and lysine (3.6%). The overall frequency of hydrophobic residues does not significantly differ from expected frequencies. This amino acid distribution suggests flexible linkers, with a negative charge density conducive to non-specific interaction with an overall positively charged, encapsulin inner shell surface (Table 2-7). The T. maritima linker is enriched for glycine providing greater flexibility and reduced bulkiness, in place of polar/charge interactions predicted in the majority of CEAS motifs, and as observed in other compartments 9. Flexibility is supported by the irresolvability of the ferritin-like protein in the T.

69 maritima encapsulin crystal, despite being inferred to be present based on mass spectrometry, crystal colour, and resolution of the GGDLGIRK peptide.

Table 2-7: Amino acid frequency composition of classical encapsulin associated enzyme linker regions preceding the CEAS motifs Amino Acid Frequency Ala (A) 16.9% Arg (R) 4.30% Asn (N) 2.10% Asp (D) 6.80% Cys (C) 0.01% Gln (Q) 5.00% Glu (E) 8.90% Gly (G) 4.10% His (H) 1.10% Ile (I) 4.00% Leu (L) 7.00% Lys (K) 3.60% Met (M) 0.60% Phe (F) 0.60% Pro (P) 15.2% Ser (S) 6.80% Thr (T) 6.20% Trp (W) 0.20% Tyr (Y) 2.00% Val (V) 4.70%

70

Multiple sequence alignment and structural threading between the T. maritima GGDLGIRK peptide structure 2 and representative CEAS motifs, revealed position specific conservation and covariation not evident from the T. maritima peptide alone (Figure 2-14). Homology modeled threading involves predicting the structure of a sequence of interest based on the known structures of one or more similar sequences. The more similar the sequence of interest is to the known structures the more confident the structure prediction can be. This confidence was assessed using standard structural quality measurements calculated by the VADER structural comparison and evaluation suite 61.

Threading of these motifs was confident (Table 2-8), and highlighted a number of additional conserved surface features potentially contributing to interaction (Figure 2-15.i-iii, Figure 2-16,

Figure 2-17), beyond the previously published four residue hydrophobic patch, Phe30, Leu34,

Leu233 and Ile249 2. Paired predicted encapsulin targeting motifs fit proposed binding interfaces better than the negative control. The C. acidiphila DSM 44928 motif was the best fit for all shell binding interfaces tested, due to its closest match to the consensus motif. The A. metalliredigens QYMF motif and shell binding interface fit each other well, but poorly fits the non-interacting C. acidiphila DSM 44928 shell interface, consistent with co-optimization.

71

Table 2-8: Example VADAR 61 3D profile quality index statistics for CEAS motifs Catenulispora positive acidiphila control: DSM 44928 Alkaliphilus T. maritima encapsulated metalliredigens negative encapsulated peroxidase QYMF ferritin control: ferritin motif; motif; motif; AAAAAAAA GGDLGIRK SLGLGSLK GLNIGNLK

T. maritima encapsulin 44 56 56 56

Catenulispora acidiphila DSM 44928 encapsulin 36 54 56 38

Alkaliphilus metalliredigens QYMF encapsulin 47 56 62 57

These measures quantify quality of fit, and should not be extrapolated to affinity, specificity or biological packaging efficiency.

Figure 2-14: T. maritima published binding interface for CEAS motif

The last eight out of ten positions in the T. maritima CEAS motif, GGDLGLRK, were previously co-crystallized with that organism's encapsulin 2. This peptide forms a salt bridge between the Asp and Arg, and makes hydrophobic contacts with four residues highlighted in purple. Due to the flexibility conferred by the leading glycine further up stream positions could not be resolved.

72

Figure 2-15.i: Conserved predicted binding interface for consensus CEAS motif

Encapsulin shell residues within 4 Å of threaded CEAS motif peptide (grey) highlighted with carbon atoms in green, nitrogen atoms in blue, and oxygen atoms in red. Surface view on threaded encapsulin shell. Mesh view on threaded CEAS peptide. The highlighted residues were conserved or covariant with the targeting motif and were close enough for physical interaction. This surface expands on the four amino acid binding site predicted from the T. maritima encapsulin system alone. New encapsulin shell and CEAS motif peptide from Catenulispora acidiphila DSM 44928 shown as examples. VADER 61 validates this homology modeled threaded structure as similar in quality to the published T. maritima structure of the same regions, with the same 3D profile quality score.

73

Figure 2-15.ii: Top view of polar contacts and hydrophobic patch residues in Catenulispora acidiphila DSM 44928 peroxidase encapsulin, predicted targeting motif and binding surface interaction

Purple residues form one hydrophobic pocket and two patches, complementary to the hydrophobic residues in the targeting motif. The approximate silhouette of these patches was outlined in purple. Yellow dashed lines mark polar contacts. No salt bridges were predicted in this example.

74

Figure 2-15.iii: Two-dimensional simplified view of interacting positions in Catenulispora acidiphila DSM 44928 peroxidase encapsulin, predicted targeting motif and binding surface interaction.

The motif peptide was marked in bold, with the proposed interacting encapsulin surface residues arrayed around this in plain text. The first two positions in the motif, boxed in grey, were not resolvable in the T. maritima targeting peptide. Thus interaction with those residues was inferred based on continuation of the motif peptide within the constraints imposed by the resolved encapsulin surface and monomer/monomer interface in this region. The four residues highlighted in blue were previously implicated in interaction with the T. maritima targeting peptide. The additional residues were conserved and positioned to interact with the connected motif positions. Green dashed lines denote polar contacts. Purple dashed lines denote hydrophobic interactions. Purple boxes group the residues forming the large hydrophobic patch and pocket.

75

Figure 2-16: Diagrammed model of covariant positions between paired targeting motifs and classical encapsulins

Several positions in all classical encapsulin association specific motifs co-vary with positions on the inner surface of paired encapsulins (N = 646). Blocks in the same colour mark interacting positions which co-vary between paired targeting motifs and encapsulin shell proteins. Positions connected by solid lines were co-conserved for interaction, based on proximity and orientation when threaded to the T. maritima encapsulin and peptide structure.

76

Figure 2-17: Alignment of conserved encapsulin and capsid inner surface positions within 4 Å of threaded CEAS motif peptide

20 sites within 4 Å of threaded CEAS motif peptide structure show strong conservation, and covariance with variant positions of the CEAS motif. Based on conservation, physical proximity, and complementary amino acid properties I propose these sites contribute to physical interaction between the encapsulin shell and cargo proteins. The bottom seven sequences in the alignment (boxed in red, background blue), are capsids with similarity to encapsulins, conserving different amino acids in these 20 positions. Blue vertical lines mark collapsed alignment columns.

77

Within the threaded structural models of the new CEAS motifs, the CEAS motifs contact several residues along a conserved cleft on the inner surface of the encapsulin shell. This cleft involves 20 positions on the encapsulin shell and 11 positions on the cargo enzyme in the CEAS motif (Figure 2-16, Figure 2-17). The CEAS motif residues occur consecutively in a continuous peptide, while the interacting encapsulin shell positions are distributed discontinuously in three segments of the shell protein. The T. maritima anchor peptide CEAS motif contacts a subset of these 20 positions, including but not limited to the four positions published previously 2.

Reference to residues in the CEAS motifs or encapsulins shells are represented in the form

CEAS_Aaa# or Enc_Aaa# respectively, with Aaa standing for three letter amino acid codes and

# marking the position numbered from 1 to 11 starting at the aspartate in the consensus CEAS motif or from the start codon in the T. maritima encapsulin.

Relative to the consensus CEAS motif, DGSL[TGN][IV]GSLKG, the hydroxyl group of the serine in position three (CEAS_Ser3), corresponding to the first glycine of GGDLGIRK, was positioned to interact with the nitrogen of the amide bond of the methionine in position four of the encapsulin (Enc_Met4). These first two glycines of the T. maritima peptide lack the interaction potential of consensus amino acids in these positions. CEAS_Leu4 fit into a conserved pocket between the Enc_Leu7 and Enc_Arg9 residues, using CEAS_Gly5 as a flex point to improve this fit. If position five in the CEAS motif was instead threonine, asparagine, or similar residues, these side chains fit near and potentially interact with the conserved

Enc_Gln231 and Enc_Asp232 positions (Figure 2-15.i, Figure 2-17). In sequence alignments

(Figure 2-17), encapsulin position 231 varied primarily (80%) between glutamine, threonine, and glutamate and these residues correlated with variations in the third position of the CEAS motif. These interactions compensated for a less tight fit of the CEAS_Leu4 position, caused by bulkier hydrophobic substitutions tolerated at this site. CEAS_Ile6, CEAS_Ser8 and

78

CEAS_Lys10 faced an extended recessed interface including the hydrophobic patch of

Enc_Phe30, Enc_Leu34, Enc_Leu233, and Enc_Ile249. Positions six and eight of the consensus

CEAS motif models fit in this conserved four residue patch, while position ten extended beyond it. The polar atoms in position eight were accommodated by the additional conserved positions

Enc_Ala26, Enc_Arg27, and Enc_Ser234 forming a polar patch adjacent to the hydrophobic patch. In 3.0% of encapsulin alignments Leu34 was replaced with serine, or Ile249 was replaced with histidine accommodating these polar/polar interactions. In the rarer cases of hydrophobic residues in motif position eight, position 27 of the encapsulin instead encoded small hydrophobic residues. The seventh position was very highly conserved as glycine acting as a kink in the peptide structure, positioning the surrounding interaction sites. The tenth position,

CEAS_Lys10, was situated to interact with the oxygen of the amide bond between Enc_Lys31 and Enc_Thr32 in the T. maritima context. Aligned residues at these encapsulin positions were moderately conserved (31: 56% K or R, 28% S or T. 32: 68% polar). If packed as a different rotamer, lysine or arginine in this motif position fit placing the positive charge near the oxygen of the amide bond between the conserved Enc_Lys38 and Enc_Phe39. CEAS_Leu9 fit into a side pocket formed by Enc_Val42, Enc_Gly44, Enc_Pro45 and Enc_Leu229 likely forming hydrophobic interactions. Of these positions 42, 45, and 229 were conserved, and 44 was more variable.

Three strongly conserved motif positions extended beyond the solved peptide, specifically

CEAS_Asp1, CEAS_Gly2, and CEAS_Gly11. The two glycines likely confer flexibility relative to the preceding and proceeding peptide chain. The precise location of CEAS_Asp1 was unresolved in the T. maritima structure, possibly due to atypical flexibility from the two leading glycine residues in the T. maritima CEAS motif. However, the presence of a polar cleft in line with where the aspartic acid would occur if the peptide continued N-terminally, suggested a

79 possible interaction site involving the peptide backbone of Enc_Glu5, Enc_Leu7 and the amide bond between Enc_Lys8 and Enc_Arg9. Variations in the CEAS motif correlated with complementary variations in these encapsulin surface positions (Figure 2-16, Figure 2-17). As an example, replacement of a polar CEAS position with an aromatic was accommodated by replacement of the polar encapsulin surface pocket with a pocket composed of small hydrophobic residues. Conversions of hydrophobic to polar co-variation were also observed between the encapsulin and CEAS motif positions. Sequence and threaded structures supported several cases of the salt bridge observed in T. maritima's anchor peptide being substituted for added hydrophobic interactions, allowing for its absence in the consensus.

Similar features were evident at the analogous positions on the inner surface of capsids close enough in sequence to enable confident alignment to encapsulins. Conservation of these positions was variable depending on the capsid compared, with different capsids conserving different subsets of these positions and conservation trends of different properties (paired hydrophobics, charge pairs, polar/polar interactions) in some capsid families (Figure 2-17). The small sample size of capsids with high enough similarity to classical encapsulins prevents these capsid conservation trends from being statistically validated, but the observed trends fit a hypothesis of encapsulin/cargo interaction through these sites diverging to produce the capsid/accessory protein interactions at work in modern phage. For instance the Burkholderia phage BcepF1 conserves the encapsulin-like hydrophobic patch, but not the hydrophobic pocket, and a subset of the covariant positions. These capsid surfaces may contribute to binding of the capsid scaffold or head peptidase 37.

The highly variable positions upstream of the core CEAS motif remained variable at the cluster level, suggesting these positions do not contribute specifically to interaction/function with the shell monomers. Extrapolating based on the peptide structure and viable bond angles for

80 continuing the peptide, these positions occurred extending into the compartment lumen or following the curvature of the multimer/multimer interface. These submotif positions may serve as subfamily specific features optimally positioning the CEAS motif relative to the enzymatic fold, accommodating eccentricities on the specific extended encapsulin surfaces, or contributing non-specific interactions with the compartment inner surface or adjacent cargo. Alternatively upstream residues, along with linker features, may alter the optimal geometry of the multimer/multimer interfaces, influencing the size of the compartment, similar to the effect of capsid scaffold proteins on some phage heads 37. Thus most CEAS motifs and linkers likely interact over a larger area than the T. maritima example.

2. Discussion

In summary, these segments of my project identified classical encapsulins in much broader roles than was previously recognized, occurring in fourteen bacterial and three archaeal phyla, and including 590 gene clusters. This includes the identification of four distinct new targeted cargo enzyme families, and three enriched accessory protein families. These targeted cargo enzymes all conserve a single family of encapsulin association specific motifs, uniquely found in either termini of encapsulin associated enzymes. These also include the first instances of in vivo encoded genomically separated compartment components in encapsulins. These targeted enzymes also presented the first instances of more than one distinct type of enzyme predicted for targeted encapsulation by the same shell protein gene product.

This previously unrecognized complexity may explain certain observational inconsistencies from the B. linens encapsulin. Purified B. linens peroxidase encapsulins were initially shown to confer bacteriostatic activity against diverse Gram positive bacteria 41. In B. linens this functions

81 in synergy with the bactericidal activity of a co-expressed but distinct membrane depolarizing factor 4,42. Sutter et al. 2008 2, reconfirmed the absence of bactericidal activity in the B. linens encapsulin concluding annotation as a bacteriocin was erroneous, but did not address bacteriostatic activity, which was the more recent basis for annotation of the B. linens compartment as a bacteriocin 41. Sutter recombinantly expressed and purified the B. linens encapsulin shell and CEAS peroxidase in E. coli, and found no lysis activity against a confluent bacterial lawn of susceptible Listeria. This same strain was also not lysed when exposed to increasing purity and concentration of the native B. linens M18 encapsulin 2. The use of a confluent lawn precludes testing non-lytic killing, log phase dependent killing, or bacteriostatic activity. These other activities can be addressed through growth curve assays, spotting assays with log phase bacteria grown in the presence of linocin, or comparative growth assays 4. Given the larger operon of other enzymes significantly enriched around this encapsulin, plausibly the two genes expressed by Sutter et al. 2008 2, were not the complete bacteriostasis operon, thus would not have the confirmed the effect observed previously 21. Purification methods may also have allowed for diffusion of the bacteriostatic factor out of the compartment. Given the known in vitro substrates of these peroxidases 20, bacteriostasis may be induced through oxidative modification of susceptible cell wall components, similar to lignin 20 or the decoloured aromatic dyes 36 for which the peroxidase family is named. The trend of ex situ encapsulin expression forming compartments with subtly different properties 2,20,22, further supports that encapsulins are not the simple independent two component systems initially identified. My findings indicate diverse factors may also contribute to regulated formation and function of these compartments.

These findings implicate encapsulins as more widespread and diverse in components than previously recognized. These compartments can be confidently considered to functionally associate with six different enzyme families, including the two previously known. Evidence was

82 found supporting three other encapsulin associated families, with distinct encapsulin association specific motifs, but not the targeting CEAS motif (appendix 1). The six families conserving the canonical CEAS motif are confidently considered to be encapsulated on the inner surface of these encapsulins, based on conservation, prior experiments and unpublished communications.

These are the iron-dependent peroxidases, ferredoxins, ferritin-like proteins, Dps- bacterioferritin-like proteins, hemerythrin-like proteins, and rubrerythrin-like proteins. The other three associated families, CutA1-like proteins, metalloproteases, and radical SAM oxidoreductases, are not internalized by the same mechanism, and in the case of the metalloprotease may function in a regulator role rather than metabolically. These enzymes show organization into more complex multi-component systems with the potential for metabolic synergy.

83

Chapter 3. Characterization of classical encapsulin-like families

3. Chapter Abstract:

While the number and diversity of encapsulin encoding prokaryotes indicates these compartments are biologically important, the exact functions of these classical encapsulins remained unclear. The previously published data was from a too diverse, limited set of encapsulins to draw any significant conclusions regarding general functions, niches, or ecological importance. Thus, enrichment analyses of genes, phenotypes, ecological niches and taxonomic distributions associated with the larger encapsulin family identified above, revealed new insights into the conserved roles and functions performed by encapsulins in general. Three niches of significantly enriched encapsulin function were identified corresponding to distinct enriched operon components and targeted enzymes. Peroxidase and hemerythrin associated encapsulin genes were enriched in mesophilic, obligate aerobic, non-aquatic pathogens and in operons encoding molybdenum dependent metabolisms, otherwise sensitive to reactive oxygen.

Ferritin-like protein, radical SAM oxidoreductase, and rubrerythrin associated encapsulin genes were enriched in thermophilic, obligate anaerobic, specialized aquatic, non-pathogens and in operons encoding reactive nitrogen, iron, copper, and arsenic resistance. Bacterioferritin and

CutA1-like protein associated encapsulin genes were enriched in mesophilic, aerobic, motile, non-spore forming bacteria and in operons encoding molybdenum dependent metabolism and reactive nitrogen resistance.

Quantification and characterization of horizontal transfer signals also supported these significant patterns. Encapsulin shell and targeted enzyme genes followed the same predicted

84

Karlin distributions of horizontal transfer, and showed signals of horizontal entry into most of the genomes currently encoding these compartments. In select cases these signals can be traced back to phage as the closest origins, while in others direct cross kingdom transmission between

Archaea and bacteria is predicted. In general, the encapsulin shell genes are predicted to acclimate closer to the nucleotide signature of the current genome faster than the co-encoded genes producing the targeted enzymes. Despite this, the nucleotide signatures of the various encapsulin operon components relate to each other more closely than the positive control of

Photorhabdus-like virulence cassette genes, suggesting the encapsulin components are at least as tightly coupled in transfer as Photorhabdus-like virulence cassette genes, and likely more so.

More precise and robustly accurate phylogenetic methods support these relationships and highlight the predicted origin of the encapsulin targeting motif as lying with the ferritin-like proteins. The other targeted enzymes are predicted to derive their copy of this motif from recombination.

3. Introduction:

Published classical encapsulins occur in diverse niches, and divergent lineages without a readily apparent conserved function. The enzymes known to be encapsulated in these systems are also distinct, based on inferred aromatic and ionic ligands, potentially involved in multiple functional pathways. A systematic analysis of the genes encoded around a large population of classical encapsulins will aid in predicting the functional operons, and pathways to which encapsulins contribute. Analysis of the phenotypic, metabolic, taxonomic and ecological features enriched for encapsulin encoding organisms will elucidate the niches in which encapsulation of these pathways confers selective advantages. The PATRIC 58 and MiST 56 datasets are ideal for this

85 analysis, as well curated and validated sources of phenotypic, ecological, taxonomic and environmental features of diverse bacteria. These datasets are also large enough to allow robust statistical evaluation of trends to determine significance.

Horizontal transfer is a reasonable mechanism to consider relative to the dissemination of encapsulins to the diverse lineages in which they currently occur. Characterization and quantification of these transfer signals, and the phylogenetic signal overlying them, will clarify how the modern distribution most likely arose. Karlin distance metrics 64, Bayesian inference 46, maximum likelihood 45, and Bayesian co-estimation 48,49 are all validated effective tools for assessing these signals, and deriving meaningful interpretations.

In this chapter I used these tools to predict: the impact of horizontal transfer on encapsulin dissemination; whether the genes encoding encapsulin shells and cargo enzymes were transferred together or have more independent histories; and the biological functions of encapsulins overall. These analyses are primarily statistical in nature, aiming to encompass the breadth of encapsulins rather than focusing on specific cases, as has been attempted with individual encapsulins and other proteins in the past.

86

3. Methods:

3.I Detection of significantly enriched phenotypic and taxonomic classifications among encapsulin encoding organisms

Phenotypic enrichment analyses based on the PATRIC 58 and MiST 56 datasets were used to identify conserved physiological niches of encapsulin function. A set of 36 phenotypic, taxonomic, ecological, and environmental features were analyzed for significant enrichment for subsets of 41773 specific values. These significantly enriched features marked taxonomic, ecological and physiological conditions where encoding encapsulins was significantly selectively advantageous. Interdependencies between physiology, or ecology mean these advantages may not directly relate to enriched features independently, but instead be driven by one or more features strongly associated with the others. To address the role of interdependency in enrichment, these features were then analyzed for inter-correlation and statistical association, using Pearson product-moment correlation coefficients 51 and Goodman-

Kruskal tau association coefficients 43,50,54, to determine which significantly enriched fields were significantly linked or dependent with others. Pearson product-moment correlation is only applicable to comparing continuous numerical variables (eg. genome length), while Goodman-

Kruskal tau association is applicable to comparing non-continuous non-ordinal variables (eg. motility). These analyses utilized the implementation of the Goodman-Kruskal tau association

43 test presented in Formula 3-1, along with explanation of the theory behind the test .

The genomes encoding each of the predicted major encapsulin/enzyme system types were similarly compared to the full PATRIC genome set as background, identifying significant differences in association between features relating to the presence of encapsulins (Formula 3-

87

2). Background correlations or associations greater than or equal to 0.50 were given further consideration as sources of dependency between features. Those features which were independent of statistical association with other features were assessed for independent roles in encapsulin function. While weaker correlations and associations were statistically significant, depending on sample size, their influence would not explain the majority of shifts in feature distributions. These smaller significant dependencies were considered only when relevant to a significantly enriched or depleted feature. Changes in feature association between encapsulin datasets and the background were evaluated based on statistical significance. Gene encoded in proximity to these predicted encapsulins were examined for enriched function in association to the various clusters, groupings, and enriched niches identified above. Predicted function was systematically assigned following the decision tree presented in Figure 3-1.

The predicted M. tuberculosis encapsulin shell and adjacent peroxidase gene were cloned into co-expressible IPTG inducible plasmids under the control of T7 promoter. Protein expression in

E. coli BL21 was confirmed by SDS-PAGE. Starting cultures were inoculated to an initial OD595 of 0.1 in LB broth across three 96 well plates for five replicates. Growth was measured by

TECAN incubated shaking plate reader and compared between strains induced to express either the shell alone, peroxidase alone, or shell and peroxidase together using IPTG. Reciprocal empty plasmids were co-transformed in the shell alone and peroxidase alone strains, to avoid plasmid specific effects. Growth curves were measured using a shaking TECAN plate reader incubator over the course of 19 hours. Filtered cell lysates were visualized by transmission electron microscopy for the presence or absence of encapsulin-like partials after induction of plasmids.

88

Figure 3-1: Decision tree of functional annotation of genes flanking predicted classical and novel encapsulin genes

89

3.2 Quantification of horizontal transfer signals

Measuring horizontal transfer signal depends on comparing the properties of one protein, gene, gene segment, genomic region, or genome to another. The properties chosen affect the quality of prediction of transfer that can be made. Codon bias and by extension nucleotide frequency is one of the best properties for this purpose, as it isn’t under direct selection and is the product of such complex processes as to be effectively stochastic. The use of χ2 test analysis of mononucleotide frequencies provides only coarse distinction in codon bias, due to a number of factors skewing mononucleotide frequency. This gives the χ2 test on mononucleotide frequencies a high false positive and false negative rate for accurately classifying horizontally transferred regions. Trinucleotide frequency is strongly determined by protein sequence and thus selection for function, making it also non-ideal as a measure of neutrally transferred signal. χ2 testing on dinucleotide frequencies provides a less noisy test for genome signal and codon bias, since as Karlin (1998) 64 and others have shown, dinucleotide frequency largely corrects for amino acid composition by preferentially sampling {wobble,1} pairs across adjacent codons.

However, this fails to account for the relationship between mononucleotide and dinucleotide composition, reducing sensitivity to signature differences. The Karlin δ* difference metric accounts for this (Formula 3-3), taking into account both double stranded mononucleotide and dinucleotide frequencies defining a more sensitive measure of nucleotide signature difference 64.

The enrichment of each dinucleotide couple was calculated using expected frequencies based on mononucleotide frequencies. The sixteen dinucleotide couples have been shown to be predictive of lineage specific nucleotide signatures 64. The average difference in these signatures of dinucleotide enrichment is a validated measure of lineage difference 64. The magnitude of these differences has been compared between a variety of genes from diverse and related organisms

90 across all kingdoms, and with a breadth of well defined relatedness relationships 64. Their observations define a validated scale of interpretation for evidence of horizontal transfer.

To understand the underlying structure of the nucleotide signature differences observed, further

Karlin analyses were performed comparing the genes and genomes of interest with each other, and control sets focusing on DNA polymerase III subunits and Photorhabdus-like virulence cassette genes (PVC) 66. These analyses included deriving predicted horizontal transfer paths for the transmission of encapsulins and enzymes, suggesting causes for the similarities and divergences observed 67. Polymerase genes served as a negative control for nominally horizontally transmitted genes, as they depend on a tightly coupled multi-component complex required for DNA replication intrinsically optimized for the current genomic signature, supporting the assumption they rarely produce selectively advantageous horizontal transfers

55,67. PVC genes comprise a tightly coupled, highly conserved, horizontally transferred operon similar in gene order and product structure to bacteriophage tails 55,66, serving as positive controls.

3.3 Phylogenetic analyses of encapsulins and associated enzymes

Bayesian co-estimation (StatAlign),48,49 Bayesian inference (MrBayes 3.2.1),46 and maximum likelihood (Phyml 3.0) 45 phylogenetic methods were applied to predict ancestry of each family containing homologs, with and without significantly conserved encapsulin association specific

(CEAS) motifs. To the extent possible, outgroups for each dataset were chosen from related divergent proteins with distinct function, but retaining sufficient identity for alignment. Initial input protein multiple sequence alignments were produced using the MAFFT algorithm 28.

Alignments were then optimized using the Bayesian co-estimation method. The optimized alignments were used as input for the three above phylogeny prediction methods, generating 91

'best fit' predicted phylogenies, which were compared for similarities and disagreements.

Interpretation of phylogeny topology and relative positions of homologs encoding the encapsulin associated exclusive motifs, relative to homologs without these motifs, were used to predict how these motifs arose and were distributed.

Bayesian inference was performed with two parallel runs of five chains of inferred phylogeny trees, with heating on four of the five chains introducing variations. The amino acid substitution rates were modeled allowing for a mixture of models with fixed rate matrices assuming a gamma distribution with each state modeled separately. Trees in each chain were sampled every hundredth iteration for swapping between chains. These analyses were allowed to run to convergence, defined as an average standard deviation of split frequencies less than 0.050.

Phyml maximum likelihood prediction was performed using the Le-Gascuel model of amino acid substitution, and a discrete gamma distribution 62. The initial tree topology was derived based on the BioNJ neighbour joining algorithm 45. Subsequent topology space searching used the NNIs methodology 45. All other parameters used default values. Two independent runs were performed for each sequence family, and the resulting trees compared for inconsistencies.

3. Results:

3.I Classical Encapsulins have been optimized for distinct roles in multiple niches through the combination of different cargo enzyme families

The analyses discussed above predicted encapsulins in a wide range of organisms living in various niches with diverse metabolism and biochemistry. At the same time, facilitated by

92 horizontal transfer, encapsulins occurred in subsets of lineages, rather than being uniformly distributed as expected for vertical inheritance. For instance, encapsulins are prevalent in

Bacteroidetes but absent from related Chlorobi and Ignavibacteriae, while also present among

Acidobacteria but absent from related Fibrobacterese and Marinimicrobia (Figure 2-2). Even within some genera some species encoded encapsulins while others did not. Among

Agrobacterium, A. radiobacter, A. rhizogenes and A. tumefaciens encode encapsulins while A. fabrum, A. sp. H13-3 and A. vitis do not. This suggests that encapsulins either function in a single widely but sporadically conserved metabolic role, a sporadically conserved niche, or are fulfilling multiple distinct functions in different niches.

If encapsulation were not advantageous it would be selected against, due to the effect of restricting diffusion and added resource cost of producing such large complexes. Thus some advantage must exist to offset the increased metabolic cost. Either the advantage is universal in which case encapsulin usage is only restricted by availability, or the advantage is specific to a subset of niches. Thus, a single function sometimes utilizing encapsulation and in related lineages not would not in fact be the same, but one a role restricted to non-advantageous niches and one permissive to these advantageous niches.

Non-advantageous parasitic elements, such as transposons are considered by some, lack the sequence conservation seen in this family. True parasites are actively degraded by hosts producing high rates of divergence and short range diffusion between lineages. This produces the strain to strain diversity in prophage. These encapsulins show the reverse of these expectations with conservation across a very large range of organisms. Sutter et al. (2008) 2 noted ferritins in thermophilic anaerobes and hypothesized that they may play a role in protection from oxygen exposure, but they did not quantify the observation or test the hypothesis. One way to elucidate what physiological roles encapsulins serve was to determine

93 what niches encapsulins were significantly enriched in, and what operons they were found co- encoded with. Thus, I performed a systematic analysis to address these issues.

The distributions of phenotypic features common to bacteria were analyzed to elucidate which phenotypic niches showed statistically independent, significant enrichment for encoding encapsulins. Based on the guilt-by-association principle 32, phenotypes enriched in encapsulin encoding strains are predicted to be functionally linked to the biological role of encapsulins. As definitions and metrics can vary between sources, two standardized datasets of bacterial phenotypic annotations were compared and analyzed for trends and significant differences between the bacteria in general and encapsulin encoding organisms specifically. These datasets were the pathosystems resource integration center database (PATRIC) 58 and the microbial signal transduction database (MiST) 56. PATRIC is a curated collection of characterized pathogenic and non-pathogenic bacterial strains annotated and cross-referenced for a range of physical, ecological, pathological, genomic, and phenotypic features 58. MiST comprises a similar dataset with a focus toward bacteria studied for signal transduction 56, and annotates using an overlapping feature set and classification system in common with PATRIC. Statistical inter-dependence between variable ecological, genomic, and phenotypic features were quantified using Pearson product-moment correlation coefficients 51 and Goodman-Kruskal tau association coefficients 43,50,54. The resulting correlation and association matrices are available in

Tables 3-1 and 3-2 respectively.

94

Table 3-1: Pearson product-moment correlation matrix between continuous numerical features of bacteria in the combined PATRIC and MiST phenotypic annotation dataset. Available on the appended CD or online at https://www.dropbox.com/s/czjcls4xairwqrz/ radfordThesis_correlationAndAssociationMatrices.xls

Table 3-2: Goodman-Kruskal Tau-B association matrix of non-ordinal non-continuous features of bacterial genomes in the combined PATRIC and MiST dataset. Available on the appended CD or online at https://www.dropbox.com/s/czjcls4xairwqrz/ radfordThesis_correlationAndAssociationMatrices.xls

These analyses indentified three largely mutually exclusive niches of encapsulin function specific to which major CEAS enzyme classes were co-encoded with the encapsulins (Figure 3-

2). The statistically defined specific significant features of these niches are presented in Table 3-

3. These three niches and the diverse operons associated with encapsulins in each are discussed in the subsections below. Three prominent examples of these niches are diagramed in Figure 3-

3. Additional less prominent niches are also suggested to exist, but lack sufficient size for statistical confidence.

Cell shape, temperature tolerance, oxygen tolerance, habitat, and specific forms of pathogenicity differed significantly in distribution and enrichment between encapsulin encoding and background genomes. Different combinations of cargo enzymes showed different enrichment profiles for phenotype, physiological niche, and stress tolerances. This suggests related encapsulins are fulfilling distinct functions in different niches, through packaging different cargo combinations. Similarly related cargo proteins are suggested to fulfill additional functions when coencoded with different enzymes.

95

Table 3-3: Features of significant difference between encapsulin encoding genomes and background of all genome in PATRIC dataset Bonferroni f(x) f(x) Hypergeometric Feature Value correction encapsulin background test P-value cutoff

Cell shape Cocci 0.02404 0.2661 0 1.877E-05

Cell shape Bacilli 0.8558 0.6138 2.17E-15 1.877E-05

Temperature Mesophilic 0.8374 0.9435 7.32E-09 1.877E-05 tolerance

Temperature Thermophilic 0.1084 0.035859 1.15E-06 1.877E-05 tolerance

Temperature Hyperthermophilic 0.04926 0.007781 8.95E-07 1.877E-05 tolerance

Oxygen Facultative 0.105263 0.393359 0 1.877E-05 tolerance Oxygen Aerobic 0.642105 0.334375 0 1.877E-05 tolerance

Habitat Terrestrial 0.165714 0.074608 1.26E-05 1.877E-05 Isolation Sputum 0.207407 0.015312 0 1.877E-05 Source

Isolation Bronchial alveolar 0.037037 0.00264 1.72E-06 1.877E-05 Source lavage

Isolation Groundwater 0.037037 0.00264 1.72E-06 1.877E-05 Source Necrotizing Disease pneumonia, chronic 0.052083 0.003005 5.78E-07 1.877E-05 infections Hemoptoic Disease 0.052083 0.003606 3.28E-06 1.877E-05 pneumonia

Disease Tuberculosis 0.34375 0.019832 0 1.877E-05

Enriched features underlined and bolded in one of the two frequency columns depending if the feature was positively enriched in the encapsulin associated set or the background set.

96

Figure 3-2: Encapsulin encoding organisms occupy distinct cargo specific phenotypic niches

Blue ovals represent encapsulin encoding prokaryotes enriched with specific cargo enzymes statistically and hypothetically functionally associated with specific ecological and physiological niches. Niches represented diagrammatically.

Figure 3-3: Encapsulin related resistance coupled to diverse operons

97

3.I.a: Niches of peroxidase encapsulin function

A significant subset of peroxidase packaging encapsulins from diverse species showed significant enrichment for several families of flanking genes involved in molybdenum metabolism 30. The most prominent of these enzymes were families of anaerobic molybdopterin-

Fe4S4-dependent oxidoreductases and formate dehydrogenases (Hypergeometric test P = 2.36E-

11). Several other families of molybdenum biosynthesis proteins were also enriched in these regions, though to a lesser degree. Non-encapsulin associated homologs of these enzyme families occur in obligate anaerobes, hence the annotation as anaerobic enzymes. Yet all of these candidate molybdopterin oxidoreductase/peroxidase/encapsulin operons were found in obligate aerobic organisms. Coordinated molybdenum in enzymatic centres is particularly susceptible to oxygen 30. Thus these encapsulins may protect the metal cluster from oxygen damage or Fenton toxicity during aerobic growth.

Enrichment analysis of the PATRIC dataset of annotated bacterial physiological features, revealed peroxidase/encapsulin encoding organisms were significantly enriched in non-aquatic, obligate aerobic, rod shaped, mesophilic, high guanine and cytosine (GC) content, large proteome/genome bacteria. Also significantly enriched were bacteria isolated from sputum and bronchial alveolar lavage, and diverse bacteria causing three invasive lower respiratory diseases: chronic necrotizing pneumonia, Tuberculosis, hemoptoic pneumonia. Identification of multiple enriched features in this way raised questions of interdependency and linkage, in which one feature of an organism's physiology or taxonomy was associated with another feature. These associations and correlations were statistically determined using the Goodman-Kruskal tau association metric (GKtau) 50,54 and Pearson product moment correlation coefficient 51, with the relevant significant interdependencies between features outlined in Table 3-4.

98

Table 3-4: Associations and dependencies between significantly enriched properties of genomes encoding encapsulin associated iron-dependent peroxidases enriched significantly total association Δ tau P-value field: value associated/disassociated fields (GKtau) GC content: 60- Gram Stain [+] 0.0473 0.0257 8e-07 70% Chromosome number [>background] 0.0393 0.0336 2.37e-16 Plasmid number [

99

Only the significantly associated features are included in Table 3-4. Only pairs of continuous variables, such as genome length, proteome size, and optimal growth temperature, can be tested for correlation. No significant correlations were detected. Non-continuous variables, such as cell shape, aerobic tolerance, and pathology, are assessed for statistical association, analogous to correlation but less restricted in application (Formula 3-2). The scale and meaning is also similar to correlation with an association of 1.0 signifying the value of the first variable is entirely predicted by the second variable, and an association of 0.0 signifying the value of the first variable is entirely unpredictable from knowledge of the second variable. One difference with correlation being association has direction, such that one variable can be predictive of another without that variable being predictive of the first. Unlike simple correlation, an association of

0.5 is not generally assumed to be random, but signifies 50% of the variation in the first variable can be predicted based on the values of the second variable. Continuous variables can also be assessed for association, under the assumption that values are distinct (eg. Optimal growth at 37 oC is different from optimal growth at 40 oC). This allows comparison of continuous and non- continuous variables, but may give a skewed sense of dependence for continuous variables with smoothly distributed values due to the assumption above. As with correlation, association does not prove causation, only how well one variable predicts the value of another.

As I am comparing encapsulin encoding organisms to the background of bacteria in general, it is also possible to quantify how the relationships between features differ between encapsulin encoding bacteria and the background of bacteria overall. This statistic is termed Δ tau, as it is the difference in association between two variables, between the two sets (Formula 3-3). Both association, tau, and change in association, Δ tau, follow defined probability distributions 54 allowing for estimation of the probability of a given association, similar to determining the

100 significance of a correlation. Like in the case of correlation, a significant association is not necessarily a prominent association, but an insignificant association is unlikely to be important.

Obligate aerobic growth showed strong significantly increased association with mesophilic temperature tolerance (tau = 0.57, Δtau = 0.14, P = 1.0E-11), rod cell shape (tau = 0.55, Δtau =

0.17, P = 8.3E-7), and non-aquatic habitat (tau = 0.57, Δtau = 0.17, P = 1.5E-10). Enrichment in rod shaped bacteria was driven by significantly increased association with mesophilic temperature tolerance (tau = 0.70, Δtau = 0.42, P = 1.8E-78), obligate aerobic growth (tau =

0.52, Δtau = 0.23, P = 1.6E-25), and non-aquatic habitats (tau = 0.49, Δtau = 0.25, P = 3.2E-31).

Thus all together these features were predicted to interact in encapsulin encoding organisms to a significantly greater extent than in prokaryotes in general. The absolute association between rod cell shape and obligate aerobic growth was almost equal in both directions, with encapsulins increasing the association of shape to aerobic growth more than the opposite direction. The association of both those features to non-aquatic habitats was effectively one way in both cases.

This implies that the advantage of this function of encapsulins was optimized by both aerobic growth and rod shape, but this advantage was moderately linked to non-aquatic growth. Aerobic rod cells experience more uniform exposure to oxidative toxicity than do other cell shapes of equal volume, or in anaerobic niches. Non-aquatic growth also presents distinctive toxic stress conditions.

The question of how mesophilic temperature tolerance relates to this niche remained. The majority of enrichment for mesophiles was due to enrichment for optimal growth at 37 oC. In bacteria overall, optimal growth temperature was not strongly correlated or associated with any other features. In the set of peroxidase/encapsulin encoding organisms optimal growth temperature showed significant increased association to larger proteome size, higher GC content and a number of taxonomic classifications. These were the orders Actinomycetales,

101

Burkholderiales, and Corynebacterineae, and the families Mycobacteriaceae, Burkholderiaceae, and Nocardiaceae. Of these Burkholderiales are very distant evolutionarily from these other orders which occur in the Actinomycetes phyla. Further, it was the higher GC content, larger proteomed members of these taxa which exhibited encapsulins. Thus the enrichment for mesophiles was not significantly dependent on oxygen tolerance, cell shape, or habitat, but did relate to overall genomic composition. This suggests the selective advantage causing enrichment of encapsulins in rod shaped aerobes was linked to mesophilic growth as well, but in a sub-niche influencing genome composition. Mesophilic niches are more chemically diverse than other niches, with greater potential for exposure to environmental toxins, which could be sequestered or excluded using encapsulation. Additional enriched features are discussed in appendix 2.

Overall these enrichments and associations (Table 3-4), suggest the peroxidase/encapsulin system serves to tolerate metal and radical oxygen related toxic interference with metabolisms particularly susceptible to damage or dysfunction due to the niche inhabited. These niches appear to vary somewhat, ranging from an active host immune response generating peroxide and radical oxygen attempting to kill the encapsulin encoding pathogen, to the passive toxicity of terrestrial habitats rich or depleted in essential but potentially toxic metals including iron, copper, zinc, and others with affinity for the encapsulin associated operon components. A compartmentalization system such as encapsulins can in principle reduce the rate of damage in niches with limiting metal availability, or protect enzymes from toxic interaction with overabundant ions. Molybdenum dependent metabolisms were particularly enriched with the implication that encapsulated peroxidases along with co-targeted enzymes, act to confer increased resistance to reactive oxygen and toxin derivative ions.

Encapsulins in this family can also prevent direct enzymatic toxicity. When the M. tuberculosis

CEAS motif encoding iron-dependent peroxidase gene was expressed in E. coli BL21 from a

102 plasmid, a significant growth defect was observed, even at low expression levels without IPTG induction (Figure 3-4.i). This defect was aggravated by induction of expression. However, this defect was not seen when the encapsulin shell was expressed alone, or the encapsulin and peroxidase were co-expressed together from the same T7 promoter plasmid systems (Figure 3-

4.i). Expression was confirmed by SDS-PAGE for all constructs. Under TEM these samples co- expressing peroxidase and encapsulin shell show encapsulin like compartments of uniform size

(Figure 3-4.ii). Similar to the R. josti and B. linens encapsulins 2,20, the expression of the shell alone produced fewer compartments on TEM, despite comparable expression levels by SDS-

PAGE. These observations support a protective role for encapsulation, restricting the reaction environment of these iron-dependent peroxidases preventing these deleterious growth defects.

Comparing the effects of expressing mutants of this peroxidase without the CEAS motif, or without the active site would address this hypothesis. This predicted function may act through excluding access to off-target reactants, or restricting diffusion of toxic products and intermediates, as seen in the Eut and Pdu compartments 1. A possible mechanism for bacteriostatic encapsulins is suggested, in which encapsulins accumulate inhibitory reaction byproducts, released into the media by cell death and pH dependent loss of containment becoming inhibitory to non-encapsulin encoding prokaryotes. This form of pH dependent disintegration is documents for immature capsids, the structural stage most similar to encapsulins 80. Encapsulin encoding organisms might be protected by transporting these byproducts into their own encapsulins.

103

Figure 3-4.i: Coexpression with the M. tuberculosis encapsulin shell protein relieves growth defect from expression of M. tuberculosis iron-dependent peroxidase

Average growth of E. coli BL21 over time under 500 ug/ml Amp selection and 100 ug/ml IPTG induction in shaking LB broth. Incubated and meassured in 37oC TECAN 96-well plate reader. N=5.

Figure 3-4.ii: Electron micrograph of peroxidase packaging M. tuberculosis encapsulins expressed in E. coli BL21

104

Encapsulin associated bacterioferritin and iron-dependent peroxidase co-encoding genomes were significantly enriched for aerobic growth, and occurred exclusively in the taxonomic family Acetobacteraceae. Acetobacteraceae dependent features of GC content, proteome size, and plasmid number also showed enrichment. Due to the limited size of this co-encoding subsample and evidence of localized vertical inheritance no further significant enriched fields could be elucidated. Similar trends and limitations were found for the set of encapsulin associated hemerythrin and iron-dependent peroxidase co-encoding genomes. Specifically GC content between 65-70%, and genomes from the Nocardiaceae and Mycobacteriaceae taxonomic families were significantly enriched.

3.I.b: Niches of iron-binding encapsulin function

The ferritin-like, rubrerythrin-like, Dps-bacterioferritin-like and hemerythrin-like proteins share a subset of predicted enzymatic functions, although each family also exhibits distinct functional features. These four families are all iron-binding and have iron-sequestration potential. These iron-binding proteins showed overlapping similar trends of which some were significant, while others were too small for significance.

Hemerythrin on its own shows significant enrichment for GC content between 65-70%, and genomes in the Pseudonocardiaceae, Nocardiaceae and Mycobacteriaceae classes. These genomes showed a trend toward terrestrial habitat and obligate aerobic growth. This was similar to the niches enriched for peroxidases, which was to be expected given the prevalence of co- encoding between these classes. These niches were also distinct from the other cargo enzyme families.

105

Bacterioferritin shows significant enrichment for GC content between 50-55%, motility, aerobic growth, and genomes in the Aquificaceae, Rhodospirillales, Acetobacteraceae, and

Hydrogenothermaceae lineages. Further these genomes show a trend toward above background enrichment for thermophiles. This places the bacterioferritin niche intermediate between the two extremes. This agreed with the co-encoding observations, which marked bacterioferritins as separately co-encoded with peroxidases, ferritins, and rubrerythrins. Thus either encapsulin associated bacterioferritins contributed to a distinct niche function, or expand the aerobic mesophile and anaerobic thermophile niche functions of encapsulated peroxidases, ferritins, and rubrerythrins to a broader niche. The enrichment for motility may be informative to this, as active motility is thought to aggravate oxidative stress, the common metabolic target of all six consensus CEAS motif enzymes.

Rubrerythrin shows significant enrichment for isolation from groundwater, optimal growth at 60 oC, and obligate anaerobic bacteria. These genomes were enriched for bacteria from the orders

Clostridiaceae, Veillonellaceae, Coriobacteriaceae, and Brachyspiraceae.

Rubrerythrin encapsulin systems were encoded in operons involving diverse carbohydrate, and modified carbohydrate metabolisms. These include Clostridial short chain dehydrogenase/ oxidoreductases, 3-oxoacyl-(acyl-carrier-protein) synthase III, and amidophosphoribosyl- transferase, which were also associated to Mycobacterial peroxidase encapsulins. Thus rubrerythrin encapsulins may specialize in carbohydrate redox biochemistry, and overlap with encapsulated peroxidases for some of these functions. These reactions involved oxidation susceptible metal coordinating enzymes, suggesting a link to rubrerythrin iron metabolism.

The largest of the individual iron-binding families, ferritin-like proteins, showed significant enrichment for fresh water isolation, hyperthermophilic temperature tolerance, obligate anaerobic growth, non-pathogens, and genomes in the Synergistaceae, Thermotogaceae,

106

Myxococcaceae, Halanaerobiaceae and Frankiaceae lineages. Ferritin-like proteins were also significantly depleted for mesophilic temperature tolerance, facultative aerobic growth and genomes in the Enterobacteriaceae lineage. This niche is very different from that enriched for peroxidases or hemerythrin-like proteins above, supporting different niches for subsets of the different cargo families.

Among these ferritin-like protein/encapsulin pairs was a group of bacteria from the order

Myxococcales. In a type strain from this group, Myxococcus xanthus, knockout of the gene I predicted encodes an encapsulin shell results in a mutant defective in fruiting body formation 38.

These organisms encode three dissimilar CEAS motif enzymes; the downstream adjacently encoded ferritin-like protein, a separately encoded 41% similar ferritin-like protein with a nearly identical CEAS motif, and a separately encoded bacterioferritin-like protein with a nearly identical CEAS motif and nominal sequence similarity in the enzymatic region. A recent study has confirmed all three of these proteins are packaged in vivo 22. This confirmed my prediction that this is a multi-target encapsulin system, explaining the non-essentiality of the adjacent ferritin-like proteins to fruiting body formation, while the encapsulin shell gene itself is essential. Loss of one of these ferritin-like proteins does not abolish fruiting, because the other ferritin-like protein and bacterioferritin-like protein continue to be encapsulated and functional.

Given the low diversity of these three CEAS motif proteins one of the non-adjacent proteins likely contributes to the fruiting body function, while the adjacent ferritin-like protein is retained to fulfill a different role. The two ferritin-like proteins might function redundantly despite the low degree of similarity, and thus be required only under extreme excess iron or peroxidase.

Three genes encoded upstream adjacent to the encapsulin gene were also implicated in fruiting body formation 38. These were annotated as an NtrC-type response regulator ATPase

(MXAN3555), a predicted Zn-dependent metalloprotease (MXAN3554), and a conserved

107 hypothetical protein of unknown function (MXAN3553). The metalloprotease and encapsulin were required for fruiting body formation, while deletion of the ATPase or hypothetical protein delayed fruiting body formation but did not abolish it 38. Deletion of other genes had no effect on compartment formation and stability. Significantly enriched unrelated metalloprotease and regulator ATPase families were observed near many other encapsulins, supporting coupling of these regulatory mechanisms with diverse encapsulin functions.

Another group of ferritin-like protein/encapsulin pairs encoded in diverse Actinobacteria,

Firmicutes, Fusobacteria, Proteobacteria, and Bacteroidetes were associated with nitrogen metabolism, specifically involving urea, ammonium and nitrite. The enzymes implicated were urate oxidases, four families of urea ABC transporter components, a Bacillus sp. 1NLA3E ammonium transporter, a Tistrella mobilis KA081020-065 nitrate ABC transporter substrate- binding component, several predicted (2Fe-2S)-binding NAD(P)H-nitrite reductase and diverse nitroreductases. These cover a broad range of reactive nitrogen pathway components. This strain of Bacillus was isolated from nitrate and heavy metal contaminated subsurface, providing a possible pressure for encapsulin dependent resistance.

Taken together the co-encoded encapsulated iron-binding enzymes (ferritin-like, rubrerythrin- like, Dps-bacterioferritin-like) and radical SAM oxidoreductases were significantly enriched in specialized or aquatic, anaerobic, thermophilic, moderate GC bacteria, with significant depletion for pathogens compared to background. This was distinctly different from the features observed for peroxidase associated encapsulins suggesting a distinctive functional niche. Pooling the iron- binding protein families increased noise somewhat, but yielded sufficient sample size to evaluate shared trends. Relevant interdependencies are outlined in Table 3-5.

108

Table 3-5: Statistical associations and dependencies between significantly enriched properties of genomes encoding encapsulin associated iron-binding proteins (ferritin, rubrerythrin, bacterioferritin, and hemerythrin). enriched significantly total association Δ tau P-value field: value associated/disassociated fields (GKtau) GC content: Plasmid number [background] 0.0250 0.0193 1.28e-05 Taxonomic phylum [Thermotogae and Synergistetes] * 0.0836 0.061247 2.47e-11 Taxonomic class [Clostridia and Thermotogae] * 0.176 0.12984 3.36e-29 Taxonomic order [Rhodospirillales and Thermotogales] * 0.277 0.193564 1.32e-29 Taxonomic family [Acetobacteraceae] * 0.319 0.201288 1.23e-08 Habitat Plasmid number [

109

Enrichment for specialized or aquatic habitats showed no significant increase in association with other features. This supports these encapsulins functioning in response to external environmental factors, aggravated by but not fully dependent on temperature, oxygen, motility or gross cell wall structure. Also the relationship between these encapsulins and non- pathogenesis was not due to habitat.

Thermophilic and hyperthermophilic temperature tolerance showed large significant decreases in association with gram staining, motility, oxygen requirement/tolerance, habitat, and pathogenicity. Optimal growth at 60oC was significantly enriched, with non-significant secondary maxima existed at higher temperatures. Decreases in association indicate encapsulin encoded organisms are enriched for growth at high temperature with a greater physiological diversity than otherwise exists among these extremophiles. Thus encapsulation of these enzymes is predicted to enable growth at higher temperatures without as stringent a need for other high temperature adaptations. This suggests these encapsulin systems improve stability or fidelity of temperature sensitive metabolisms, or protect from higher enthalpy off-target reactions.

Obligate anaerobic growth was significantly enriched without increased association, above 15%, with any other features. A 35% decrease in association with pathogenicity was found, suggestive these encapsulins do not affect virulence. This supports anaerobic metabolism as significantly involved in a functional role of these encapsulins, tying in with high temperature tolerance. As the previously proposed roles for encapsulation of ferritin-like proteins focused on oxygen toxicity, enrichment for obligate anaerobic growth suggests alternative functions.

Encapsulins in these systems cannot be selectively justified solely as protection against oxygen as these organisms cannot grow with oxygen despite encoding encapsulins.

110

Non-disease classification was enriched in ferritin-like protein encapsulating genomes without strong association to other factors. This suggests, unlike encapsulated peroxidases, some of the functional advantages of iron-sequestering encapsulin systems are not selected for in pathogens.

Pathogenic niches offer a number of advantages and lack a number of risks associated with independent lifestyles. These encapsulins would thus be predicted to either provide a function pathogens can parasitize from their hosts, or protection from an environmental threat not prevalent in hosts. With the independent enrichment for thermophiles and obligate anaerobes, the latter seems more reasonable. Few studied hosts maintain thermophilic conditions even when feverish, and offer radically different anaerobic niches than exist elsewhere.

The above feature associations and disassociations (Table 3-5), suggest the functions of these encapsulins are more essential at high temperatures, involve anaerobic metabolism, in aquatic or specialized environments, which is a distinct niche from genomes encapsulating peroxidases.

This suggests functions in restricting temperature sensitive off target reactions, and preserving stability, consistent with the predicted functions of flanking gene families. Predicted functions included protecting reactive iron-sulphur clusters, restricting arsenate-like metal toxicities, and sequestering potentially toxic environmental reactive oxygen or nitrogen products. Unlike the peroxidase encapsulins the functions involved in these compartments are suggested not to promote pathogenesis, or are not advantageous for pathogens.

Specific taxonomic phyla, classes, orders and families were significantly enriched; the phyla

Aquificae, Synergistetes and Thermotogae, the classes Clostridia, Synergistia, Delta-

Proteobacteria, Aquificae, and Thermotogae, the orders Rhodospirillales, Synergistales,

Myxococcales, Aquificales, Cystobacterineae, and Thermotogales and the families

Acetobacteraceae, Synergistaceae, Myxococcaceae, Thermotogaceae, and Halanaerobiaceae

111 were enriched. Given the niche it is unsurprising these taxa comprise many specialized, extremophilic, aquatic organisms colonizing otherwise toxic environments.

Conversely the two lineages were significantly depleted; the Bacilli and Gamma-Proteobacteria classes, the Lactobacillales and Enterobacteriales orders, and the families Enterobacteriaceae and Streptococcaceae. These two families were significantly depleted for both iron-binding and peroxidase encapsulins suggesting either nominal exposure to encapsulin encoding organisms for horizontal transfer, or the absence of selective pressure to utilize encapsulins. Both families are commensal to the mammalian gut, which was significantly depleted for organisms encoding classical encapsulins, limiting opportunity for horizontal exchange. The healthy mammalian gut is also anaerobic, low in oxidative stress, nutrient rich and low in toxic metals. As such most of the proposed functions for encapsulins would be of nominal advantage in this niche.

3.I.c Overarching features of encapsulin encoding organisms

A few features were enriched independent of CEAS enzyme type. Overall molybdopterin/ molybdate/molybdenum metabolism proteins were enriched near peroxidase, hemerythrin, and ferritin-like protein associated encapsulins. A family of amino acid/urea/molybdate ABC transporters shared across Archaea, and Proteobacteria and Actinobacteria was the most widely distributed of these functions. This transporter subfamily was shared between all three of these encapsulin associated enzyme families. A family of TOBE-like molybdenum-pterin binding proteins was similarly co-encoded with ferritin-like protein and peroxidase associated encapsulins in several divergent Proteobacteria species. Seven other molybdenum related activities were also encoded in proximity 30.

112

Also prominent across multiple CEAS enzyme associated families were transporters and enzymes involved in arsenic toxicity. These included examples of the complete canonical arsenic resistance pathway; phosphate/arsenate ABC transporters, arsenate reductases [ArsC], glutaredoxin, arsenical export pumps [ArsA, ArsB], and arsenic response transcriptional regulators [ArsR] (hypergeometric enrichment test P = 1.85E-8) 52.These regions also encode various enzymes confirmed or predicted to be inhibited by arsenic, either through replacement of phosphate and/or interaction with thiol. Examples include diverse pyruvate dehydrogenases,

6-phosphogluconate dehydrogenase, FAD-dependent oxidoreductases, pyridoxamine 5- phosphate oxidases, and radical SAM oxidoreductases among others. In total 18 families of arsenic resistance/susceptibility genes were observed near encapsulins from five cargo enzyme classes and four phyla; Actinobacteria, Firmicutes, Proteobacteria, and Synergistetes. Many of these were widely conserved, some were singular instances. No single enzyme family was conserved in all cases, suggesting encapsulin facilitated arsenic resistance is not specialized to a particular susceptible pathway. Arsenic is linked to increased reactive oxygen production 52, which would be mitigated by the known activities of the CEAS enzyme families.

While only peroxidases showed a significant enrichment for rod shaped, and both ferritin-like protein and peroxidase subsets showing significant depletion for cocci bacteria, all the subsets tested showed trends in those directions. As such, the overall set was significantly enriched in rod shaped cell morphology, and significantly depleted for spherical bacteria. As discussed above, cell shape was statistically associated with a number of other features, but this enrichment supports a selective advantage to encapsulins in rods, that was not as prevalent in spheres. Despite the subset specific significant enrichments for obligate aerobic and obligate anaerobic oxygen tolerances, the overall set shows a significant depletion for facultative anaerobic bacteria. This suggests encapsulins function primarily at the two extremes of oxygen

113 toxicity and have not been generally optimized for the oxygen variable or microaerobic niches favoured by facultative anaerobic metabolism. The set co-encoded bacterioferritin or hemerythrin plus another CEAS enzyme, show the least depletion for these facultative anaerobes, and may include applications of encapsulin functionality into facultative niches.

Salinity tolerance, isolation source, host species, host gender, age, or health, sampling site or anatomical subsite, taxonomic classifiers, Gram stain, sporulation potential and isolation methodologies showed no trends of enrichment or depletion overall or for any tested cluster, grouping, or class. This supports the broad role of encapsulins in one or more conserved functions across a breadth of prokaryotic niches.

3.2 Encapsulin encoding genomic regions exhibit distinctly different nucleotide compositions relative to surrounding genomes

Given the diversity of encapsulins encoding organisms and sequence similarity to phage, I investigated the role of horizontal transfer in encapsulin origins. Inheritance leaves consistent signatures depending on if it occurs vertically or through horizontal transfer 64. Codon usage and

GC content establish a pattern of dinucleotide frequencies specific to each lineage at a given time. Over time these signatures drift, such that closely related strains will have similar but not identical signatures. Horizontal transfer places genes into genomes with different signatures.

This feature can be used to determine how encapsulins were distributed and infer where they came from, directly addressing the hypotheses of ancient pre-prokaryotic origin, or more recent origin and dissemination. Comparison of mononucleotide and dinucleotide frequencies using the

Karlin metric showed notable differences between most encapsulin genes and genomes currently encoding them (Table 3-6), based on simple χ2 tests and the more robust Karlin δ*- difference metric for double stranded DNA 64. Based on these distance measures, 84% of

114 encapsulins were predicted to have been transmitted to modern genomes through horizontal transfer between different prokaryotic families (Table 3-6, Figure 3-5), and several transfers between Archaea and Eubacteria. Among these 84% were four of the five previously published encapsulins. This undercuts the hypothesis that encapsulins were all vertically inherited from a hyper-ancient common ancestor. Instead a hypothesis that horizontal transfer mechanisms, such as phage transfection, contributed to the modern encapsulin distribution better fits this new data.

Table 3-6: Categorized distribution of Karlin distances between encapsulin genes and current genomes lineage same same different different different different transfer species genus/family family order/class phyla kingdom distance Karlin class very similar moderately weakly distantly distant very similar similar similar distant 0.0316<=δ* 0.0511<=δ* 0.0851<=δ* 0.120<=δ* 0.146<=δ* 0.180<=δ* <=0.0434 <=0.0847 <=0.120 <=0.145 <=0.180 <=0.228 gene count 4 (1.0%) 67 (16%) 140 (33%) 106 (25%) 88 (21%) 24 (5.6%) (% of total) Karlin distances metrics were calculated comparing each predicted encapsulin gene to the overall nucleotide signature of the chromosome or genome in which it currently occurs. The density and frequency of each Karlin distance class was calculated to generate the table above.

115

Figure 3-5: Encapsulin shell and cargo enzymes show strong signal of horizontal transfer into current genomes

116

Unexpectedly the population of DNA polymerase II subunits showed evidence of horizontal transfer based on Karlin distances. This indicates that these genes have been horizontally transmitted and conserved despite being coupled with distributed DNA replication machinery, and used as a validated negative control in other less quantitative horizontal transfer studies 55,67.

This suggests horizontal transfer over the scope of prokaryotic history has been more tolerated and influential, than may be widely appreciated.

Three of the four encapsulins 'very similar' to their current genomes occur in strains of

Burkholderia multivorans, and the fourth in Burkholderia sp. TJI49. These encapsulins are expected to either have arisen in an ancestral Burkholderia de novo, or been transferred from a lineage with very similar genomic features, such as an ancestor of one of the families of modern

Burkholderia phages. Comparison between these four encapsulins and genomes, show the B. multivorans strains to be 'very similar' relative to each other and 'distantly similar' to

Burkholderia sp. TJI49 (Table 3-7, Table 3-8). Based on the pattern of statistical relationships between the three B. multivorans genes and genomes, B. multivorans CGD2 was most close to the strain originating these encapsulins, with the other two strains transferring/inheriting it from that ancestor. These patterns suggest the Burkholderia sp. TJI49 encapsulin, while very similar to the multivorans group, entered separately from a similar source rather than through transfer from the B. multivorans lineage directly. Phylogenetic methods also predicted this relationship between these encapsulins (Figure 3-6).

117

Table 3-7: Statistical comparison of dinucleotide frequencies between four predicted Burkholderia encapsulins and the four genomes in which they are currently encoded gene vs. genome: 2 Burkholderia Karlin δ* (χ test P- Burkholderia Burkholderia multivorans Burkholderia value) multivorans CGD1 multivorans CGD2 CGD2M sp. TJI49 Burkholderia multivorans CGD1 0.0434 (1.00) 0.0338 (1.00) 0.0338 (0.994) 0.141 (0.960) Burkholderia multivorans CGD2 0.0372 (1.00) 0.0352 (1.00) 0.0352 (0.999) 0.146 (0.914) Burkholderia multivorans CGD2M 0.0452 (1.00) 0.0395 (1.00) 0.0395 (0.999) 0.143 (0.914) Burkholderia sp. TJI49 0.119 (0.894) 0.129 (0.863) 0.129 (0.926) 0.0316 (1.00)

Table 3-8: Statistical comparison of dinucleotide frequencies between four Burkholderia genomes predicted to encoded four encapsulins genome vs. genome: 2 Burkholderia Burkholderia Burkholderia Karlin δ* (χ test P- multivorans multivorans multivorans Burkholderia value) CGD1 CGD2 CGD2M sp. TJI49 Burkholderia 0.0213 0.0150 0.127 multivorans CGD1 - (1.00) (1.00) (0.860) Burkholderia 0.0213 0.0313 0.130 multivorans CGD2 (1.00) - (1.00) (0.871) Burkholderia 0.0150 0.0313 0.123 multivorans CGD2M (1.00) (1.00) - (0.814) 0.127 0.130 0.123 Burkholderia sp. TJI49 (0.860) (0.871) (0.814) -

118

Figure 3-6: Neighbour Joining Tree comparing select Burkholderia encapsulins and Burkholderia phage capsids

Lineages of the same colour are predicted to be more closely related than groups of different colour. Red line used to highlight split between groups. Central wheat yellow group contains four predicted Burkholderia encapsulins, and four Burkholderia phage capsids which are predicted as closest relatives. Other cloured groups are other more distantly related phage capsids.

119

Relative to sequenced Burkholderia phages, neighbour joining on a standard BLOSSUM62 matrix assigned these encapsulins into a parsimonious lineage, more closely related to three phage capsids than those capsids were to the other Burkholderia phage. Amino acid similarity persists far longer than nucleotide similarity, such that Karlin analysis being more sensitive to nucleotide drift classified these B. multivorans encapsulins as 'weakly similar' to those phage with non-significant differences in dinucleotide frequencies (Table 3-9). The TJ149 encapsulin was least similar to the phages in dinucleotide frequencies, and 'very distant' in dinucleotide signature (Table 3-9). These similarities support the hypothesis that this family of encapsulins and the family of BcepMu-like phage capsids both arose from an ancient Burkholderia prophage capsid or encapsulin. While phage infection is a parsimonious mechanism for transmission and dissemination of this family, other methods of horizontal transfer cannot be fully discounted.

The 'very distant' difference class suggests transfers between organisms in very different lineages, such as different phyla or kingdoms. Given that encapsulins were found in both archaeal and eubacterial lineages, but not eukaryotic lineages, the more extreme difference scores likely represent transfers between these two kingdoms. One such distantly similar example was the encapsulin gene from Actinomyces oris K20 which differs notably from its surrounding genome as seen in Figure 3-7. This encapsulin is 59% identical in amino acid sequence to the B. linens encapsulins over 262 aa, which supports considering it representative of encapsulins in general. This example also highlights the difference in conservation at the nucleotide vs. amino acid level. While guanine and cytosine (GC) content is an imperfect predictor of horizontal transfer, in the context of my aforementioned data, the region of low GC in this plot, encoding the predicted targeted enzyme and encapsulin genes, may correspond to an insertion region from horizontal transfer.

120

Table 3-9: Standardized statistical comparison of dinucleotide frequency between four predicted Burkholderia encapsulins and the three most similar Burkholderia phage genomes 2 gene vs. genome: Karlin δ* (χ Burkholderia phage Burkholderia phage Burkholderia phage test P-value) BcepMu phiE255 KS10 Burkholderia multivorans CGD1 0.0968 (0.0538) 0.105 (0.0658) 0.122 (0.0317) Burkholderia multivorans CGD2 0.0966 (0.0712) 0.103 (0.0864) 0.121 (0.0405) Burkholderia multivorans CGD2M 0.0966 (0.0712) 0.103 (0.0864) 0.121 (0.0405) Burkholderia sp. TJI49 0.179 (0.0673) 0.191 (0.0525) 0.194 (0.0146)

121

Figure 3-7: GC content plot of Actinomyces oris K20 encapsulin and targeted enzyme region relative to total genome

Top panel is a plot of average % GC over windows of 250 nucleotides. Middle panel highlights predicted and confirmed gene products within this region. Bottom panel highlights genes encoded within this region.

122

Most of the predicted targeted enzymes follow a similar Karlin distribution to the encapsulins

(Kolmogorov-Smirnov test P > 0.05, no significant difference in distribution) when nucleotide signature was compared to their current genomes (Figure 3-5, Table 3-10). The exception was a second mode relating to intermixing two populations. Splitting the distributions into targeted enzymes encoded adjacent to encapsulins or those encoded separately partially resolves this bimodality. Separately encoded enzymes tend to be more similar in signature to the current genome on average, while adjacently encoded enzymes tend to be less similar to the current genomes on average (Table 3-6). This fit with the expectation that separately encoded enzymes have had enough time in the current lineage to undergo internal transposition, thus would be more acclimated to the genome signature. Conversely, the targeting signal alone may have undergone transposition or recombination fusing to a separately transferred enzyme from the six targeted families.

Table 3-10: Distribution of Karlin difference classes across datasets Karlin same lineage long term <-> transferred from distant lineage difference class very moderately weakly similar distantly distant very distant similar similar similar encapsulin 0.9% (4) 16% (67) 32.6% (140) 24.7% (106) 21% (88) 5.6% (24) /genome enzyme 0.5% (1) 14% (28) 26% (54) 28% (57) 13% (26) 11% (28) /genome encapsulin 0.5% (1) 1% (2) 8.0% (15) 12% (23) 39% (73) 39% (74) /enzyme

123

The remaining bimodality in these enzyme distributions (Figure 3-5) corresponds to a subset of bacterially encoded enzymes significantly enriched in Archaea-like ferritin-like protein and rubrerythrin genes (hypergeometric tests P = 6.61E-4 and P = 1.96E-6 respectively), and depleted for peroxidase (hypergeometric test P = 2.01E-15). These enzymes, while maintaining the same highly conserved predicted CEAS targeting motifs, tended to be very different in nucleotide signature from both the current genomes and the genomically associated encapsulins.

Similar encapsulins are associated with targeted enzymes with distinctly different more bacteria- like nucleotide signatures, suggesting these are xenologs, proteins of distinct lineage but advantageous function which replaced an ancestral gene 67. The protein sequences of these xenologs deviate from the phylogeny of the encapsulins, as will be discussed shortly. This same predicted xenolog group accounts for a large segment of the bimodality of the distribution of enzyme versus encapsulin gene Karlin delta distances (Figure 3-8), with the same significant enrichment for ferritin, and rubrerythrin, and significant depletion for peroxidase

(hypergeometric tests P = 1.37E-4, P = 4.28E-4, and P = 2.02E-11 respectively). Thus two families of Archaea derived iron-binding enzymes were predicted to have been adopted by some extremophilic bacteria, as an alternative to or potentially in place of bacterial homologs. This form of xenolog replacement is well established in extremophilic communities 67.

124

Figure 3-8: Co-encoded encapsulin shells and cargo enzymes as similar in nucleotide signature as functionally coupled horizontally transferred PVC operon genes

125

The prevalence and frequency of horizontal transfer was high, such that co-transferred enzymes and encapsulins did not necessarily share the same nucleotide signature (Figure 3-8). The comparison of predicted encapsulated enzymes relative to their encapsulins was consistent with these gene segments passing through multiple horizontal transfer events, with selective amino acid conservation promoting different rates of adaptation to the intermediate hosts. One member of a co-transferred pair may lag in converting from the signature of a prior horizontal transfer donor, or conforming to the current recipient genome signature (Figure 3-8, Table 3-7). The distribution for separately encoded predicted targeted enzymes was similar, but with slightly greater average distance (Figure 3-8, Table 3-8). This was consistent with our hypothesis that these genes became separated from the encapsulin operon by transposition or recombination within the species. The average difference in this distribution was insignificantly different from the Karlin distribution between DNA polymerase III subunits in the same genomes (Student two-sample T-test with unequal variances P-value = 0.22), as well as the Karlin distribution between PVC subunits in the same genomes (Student two-sample T-test with unequal variances

P-value = 0.76). Thus encapsulin shells and predicted targeted enzymes, both adjacent and separately encoded, were as similar in nucleotide signatures as expected for functionally coupled, co-transferred genes. The average distances of the polymerase III and PVC distributions significantly differ from each other (Student two-sample T-test with unequal variances P-value = 9.09E-18), suggesting the encapsulin/enzyme systems were intermediate between the two states (tightly coupled essential genes vs. functionally coupled opportunistic virulence genes), while closer to the transferred PVC cases.

The same trends were seen for the genes of the phage tail-like Photorhabdus virulence cassettes

(PVC). These operons conserve more than a dozen proteins, with more strongly conserved gene order and amino acid sequences, and as virulence cassettes are archetypical of horizontally

126 transferred genes. Yet these operons yield comparable gene to gene divergence in Karlin distance, as seen comparing encapsulins and associated enzymes. In fact the overall gene to gene divergence trend between PVC genes is slightly higher than between the majority of encapsulin CEAS enzyme pairs. The exception being the families of Archaea-like ferritin-like and rubrerythrin proteins which appear to have replaced a family of eubacterial ferritin-like proteins in association with two small subgroups of encapsulins. The PVC gene to genome

Karlin distances were on average in the extreme range (δ* > 0.22), in agreement with these being recently horizontally transferred genes.

Similar trends were found comparing the family of radical SAM oxidoreductases genomically linked with encapsulins (Table 3-9). With the variability in size of these radical SAM enzymes some of these differences relative to the encapsulins may be inflated, but on average the mononucleotide and dinucleotide frequencies of radical SAM domains were more like the genomes than the encapsulin genes (mononucleotide mean χ2 Pgeno = 0.525, Penc = 0.489, dinucleotide mean χ2 Pgeno = 0.583, Penc = 0.156). These radical SAM oxidoreductases were likely introduced into most of these genomes through horizontal transfer over time, but not necessarily preferentially from the same sources as the currently co-encoded encapsulins.

Instead this family may have been selectively pressured to associate with encapsulins leading to the observed conserved motif and the significant enrichment near encapsulins.

One notable exception is the encapsulin/radical SAM oxidoreductase pair from Pyrobaculum islandicum DSM 4184, in which the radical SAM enzyme and encapsulin were more similar in signature to each other than either is to the genome. This radical SAM enzyme, YP_931416, has a 'very distant' signature from the rest of the genome and significantly different mononucleotide and dinucleotide frequencies (mononucleotide χ2 P = 6.51E-11, dinucleotide χ2 P = 7.57E-08).

This organism's encapsulin, YP_930919, is much more similar in nucleotide frequencies to this

127 enzyme (mononucleotide χ2 P = 0.905, dinucleotide χ2 P = 0.559). When compared to the rest of this genome the encapsulin is 'weakly similar' with significantly different mononucleotide and dinucleotide frequencies (mononucleotide χ2 P = 3.04E-11, dinucleotide χ2 P = 1.17E-06). As such the SAM enzyme and encapsulin were likely co-transferred into this and similar cases in other Pyrobaculum species.

The high nucleotide frequency similarities between genes, and the closer Karlin distances between these encapsulins and genomes, indicates the encapsulin conforms more closely to the

P. islandicum genome signature than the radical SAM enzyme, which suggests a difference in transfer timing or signature drift rate. The pressure to maintain encapsulin shell nucleotide signature is expected to be weaker than for the enzymatic radical SAM oxidoreductase, given the known high level of sequence variability in the capsid-like folds, and the more tightly constrained sequence and low mutation tolerance exhibited by globular enzymes. This is one explanation of the apparent difference in signature acclimation rate between these two compartment components.

Many confident lineages in the phylogenetic trees discussed previously show phylogenetic topological differences between paired components (Figure 3-9). This implies in addition to transfer events moving compartments and enzymes as one unit, a subset of CEAS motif enzymes were recruited to encapsulins from independent transfer events, and converted by fusion with a compatible CEAS motif.

128

Figure 3-9: Subsets of encapsulin shells compartmentalize cargo enzymes from non-parallel lineages, adopted from lineage swapping recombination

Topological graphs of consensus Bayesian Inference phylogenies of representative encapsulins and targeted enzymes, connected based on genomic association. The encapsulin lineage occupies the center and provides the hypothetical expected topology of descent of targeted cargo enzymes. Solid lines between graphs mark cargo proteins which differ from the expected lineage topology. These enzymes target encapsulins which are not closest relatives to the enzymes closest relative. Light purple lines mark enzymes targeted to an encapsulin, with closest relatives encapsulating a different family of enzymes.

129

The rubrerythrin-like family provides the most striking example of this cargo swapping, in that all seven lineages represented in this phylogeny are linked to encapsulins with closest relatives associated with a different cargo enzyme class (Figure 3-9). This group is also the set affected by the major horizontal transfer exchange event between Archaea to bacteria discussed above. The other cargo families also demonstrate predicted swapping events both within families as marked by the many solid blue lines and the one lineage which replaced the expected peroxidase partner, with pairing to ferritin- like proteins (Figure 3-9). This suggests strong selective advantages to compartmentalizing these enzymes, despite the expected temporary fitness cost of adapting genes from disparate lineages. The precise biochemical mechanisms providing these advantages can be inferred based on my earlier finding in this chapter, but remain to be experimentally validated.

3.3 Encapsulin shells and associated enzymes predicted to derive from the same and parallel phylogenetic paths with intermittent exchange

In light of the high sequence diversity and strong evidence of horizontal transfer among encapsulin shell and cargo proteins, addressing questions of ancestral relationships between homologs, and the origins of the targeting CEAS motifs, requires robust techniques. Bayesian phylogenetic methods were applied to elucidate how the histories of the major CEAS encoding enzyme families relate to each other, to homologs without CEAS motifs, and to the encapsulins. Bayesian inference analysis predicts the four types of association (fusion, adjacent encoded CEAS, separately encoded CEAS, and no CEAS motif) occur intermixed non-parsimoniously among the five enzyme classes examined, and among the encapsulins (Figures 3-13, 3-14, 3-15, 3-16, 3-17, 3-18). For four of the five enzyme families the root ancestral state was predicted to lack the CEAS motif (posterior probabilities [pP] = 1.0). The exception

130

was the ferritin-like lineage which predicts the CEAS encoding state to be the family’s ancestral state

(pP = 1.0) (Figure 3-14). Thus if the families diverge at similar rates, the CEAS motif is predicted to have first arisen in ferritin-like proteins, followed by transfer and fusion into the other enzyme families.

The diversity of organisms in both kingdoms encoding predicted encapsulated ferritin-like proteins, fits with this family being sufficiently prevalent to act as such an origin for this motif.

The ferritins and rubrerythrins were more parsimoniously split than the peroxidases, with most subtrees above the root containing only one association class (adjacently or separately encoded targeted, or non- targeted homologs), though confidently assigned exceptions exist. Among iron-dependent peroxidases no encapsulin fusions were observed, but the other three association classes were widely intermixed, with motif comparison suggesting regular exchange and deletion of CEAS motifs between sub- lineages. The hemerythrins were the smallest and most parsimonious of examined enzyme classes yet still intermixed targeted and untargeted homologs in some lineages (Figure 3-16). The predicted best-fit phylogenies of bacterioferritin and radical SAM oxidoreductases also follow these trends.

131

Legend 3-13, 14, 15, 16, 17, 18: Figure legend for following 6 Bayesian Inference or Bayesian Co- estimation Phylogenies

132

Figure 3-13: Condensed Consensus Bayesian Inference phylogeny of representative rubrerythrins

Consensus Bayesian inference and co-estimation phylogeny of bacterial rubrerythrins in relation to genomic association to encapsulins. The divergent rubrerythrin, YP_004798835, from Caldicellulosiruptor lactoaceticus 6A was chosen as an outgroup. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

133

Figure 3-14: Condensed Consensus Bayesian Inference phylogeny of representative ferritin-like proteins

Consensus Bayesian Inference and co-estimation phylogeny of bacterial ferritin-like proteins in relation to genomic association to encapsulins. A set of three ferritin-like inorganic pyrophosphatases were used as the outgroup. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

134

Figure 3-15: Condensed Consensus Bayesian Inference phylogeny of representative iron-dependent peroxidases

Consensus Bayesian Inference and co-estimation phylogeny of bacterial Dyp-peroxidases in relation to genomic association to encapsulins. Three related divergent iron-dependent peroxidases from Acinetobacter serve as the outgroup. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

135

Figure 3-16: Condensed Consensus Bayesian Inference phylogeny of representative prokaryotic hemerythrins

Consensus Bayesian Inference and co-estimation phylogeny of bacterial hemerythrins in relation to genomic association to encapsulins. The outgroup was chosen as the divergent hemerythrin-like protein ABP87146 from Pseudomonas mendocina ymp. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

136

Figure 3-17: Condensed Consensus Bayesian Inference phylogeny of representative bacterioferritin-like proteins.

Consensus Bayesian Inference and co-estimation phylogeny of Dps-like bacterioferritin in relation to genomic association to encapsulins. The outgroup was chosen as the divergent bacterioferritin-like protein from Clostridium bolteae ATCC BAA-613. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

137

Figure 3-18: Condensed Consensus Bayesian Inference phylogeny of representative classical encapsulins

Consensus Bayesian Inference phylogeny of representative predicted classical encapsulins. The outgroup was chosen as the divergent encapsulin-like protein from Coprothermobacter proteolyticus DSM 5265. Nodes representing multiple proteins were labeled with taxa names and numbers of species represented.

138

The predicted ferritin-like protein consensus tree will be discussed in more detail as an example.

It centres on a single join for 40 lineages, with subtrees of variable size and composition meeting at this node (Figure 3-14). The majority of these lineages do not encode the CEAS motif, so are secondary to questions of its origins and distribution. Four sub-lineages of ferritin- like proteins co-cluster fused or adjacently encoded CEAS enzymes with unassociated enzymes supporting a mixed history between the CEAS motif and unassociated classes (Figure 3-14).

The separately encoded CEAS ferritin-like proteins cluster as a parsimonious group, meeting at the central join. Closest to the root group were four CEAS ferritins, followed by the central join node. Along with the topologies of the mixed groups, these observations suggest CEAS ferritins were the ancestral form, but CEAS motifs were not conserved in most lineages, nor regained after lost in any lineages. Instead divergent motifs, distinct from the ancestral CEAS motif, were present in the sole lineage where a definitively unassociated derived lineage may be reverting to the CEAS motif. The predicted CEAS motifs in these examples conform to the consensus, but were divergent from other ferritins or known encapsulin CEAS enzymes.

The peroxidase phylogeny segregated clearly and confidently, intermixing encapsulin associated and non-encapsulin associated forms throughout the tree (Figure 3-15). Separately encoded and encapsulin adjacent CEAS peroxidases occurred together in confident clusters (pP = 1.0), with unassociated homologs confidently embedded among CEAS examples (pP = 0.99). The same trends were true of the rubrerythrin phylogeny. The topology of both the rubrerythrin and the peroxidase trees predicted the unassociated form without the CEAS motif was ultimately ancestral, with similar CEAS motif variants transferred in and deleted multiple times.

Streptomyces albus J1074 encoded both a CEAS and non-CEAS iron-dependent peroxidase, each related to distinct CEAS encoding horizontal transfer lineages. Similar cases occurred

139 throughout the sub-tree containing CEAS peroxidases, demonstrating regular conversion between the CEAS encoding and unassociated form perfectly correlated with the presence or absence of an encapsulin. When compared to the phylogeny of encapsulins (Figure 3-18), the sub-families of encapsulins associated with these two peroxidase lineages did not match the peroxidase topology, but instead combine encapsulins and peroxidases from separate sections of their respective trees. The presence of both a CEAS and non-CEAS encoding peroxidase in the same genome suggests a positive function of compartmentalized peroxidases, rather than only a protective role. Conversely the two peroxidases from S. albus J1074 may be conditionally expressed to use the compartment when needed, and the free enzyme when preferable.

The bacterioferritin-like protein phylogeny segregates well, but with moderately low posterior probabilities throughout the lower levels of the optimal topology (Figure 3-17). This phylogeny exhibited trends of intermixing of CEAS and unassociated homologs, as well as adjacently and separately encoded CEAS homologs. Overall intermixing was low with a single major lineage of adjacent encoded CEAS bacterioferritins, and representatives of several smaller families of

CEAS proteins embedded with moderate confidence in the other unassociated lineages. The unassociated form of bacterioferritins was confidently predicted as ancestral (pP = 1.0). The adjacently encoded CEAS bacterioferritins cluster parsimoniously along with the largest subset of separately encoded CEAS bacterioferritins. Other CEAS bacterioferritins were encoded separately from their predicted target shell, with diverged CEAS motifs, suggesting this motif entered the main lineage once, followed by separate recombination into the other lineages.

Thus the five families examined all convey similar patterns. The CEAS motif has entered and moved within these lineages through both vertical inheritance and through separate horizontal transfer coupled with the presence of encapsulin shell genes. This left the question, if patterns existed in the phylogeny of encapsulins packaging adjacently or separately encoded proteins.

140

Mapping the enzyme CEAS classes onto the co-encoded encapsulin phylogeny revealed intermixing of separate and adjacent encapsulins throughout the tree (Figure 3-18). All but one of the ferritin-like and rubrerythrin encapsulin fusions, group together as monophyletic lineages in their respective trees. The exception was the diheme binding domain encapsulin fusion from

Candidatus Kuenenia stuttgartiensis, which represents a divergent family of ferritin-like domains dissimilar from the other fusions, though still within the ferritin-like sequence family.

The three encapsulins without detectable CEAS enzymes were distributed without any notable trend. The incompleteness of some of these genome sequences and the confidence with which these proteins cluster with CEAS encoding homologs (0.97 <= pP <= 1.0) suggests CEAS enzymes do, or until relatively recently did exist for these encapsulins, but were not detected either due to incomplete sequencing, nucleotide deletion/mutation, or divergence in excess of the similarity search cutoffs. The Coprothermobacter proteolyticus DSM 5265 encapsulin/ rubrerythrin system appears to be an example of just such a deletion. No genes in the vicinity of the encapsulin exhibit motifs matching the CEAS motif pattern. Elsewhere in the genome, a short 102 aa rubrerythrin-like protein was encoded, 73% identical from position 12 to 101 to the

CEAS rubrerythrin from Methanolinea tarda NOBI-1, but the sequence stops short of where in homologs the CEAS motif starts.

Thus the distribution of CEAS motif dependent targeting is best explained as the product of short range vertical inheritance, and two partially independent levels of horizontal transfer transmitting 1) shells and enzymes, and 2) CEAS motifs. Many CEAS motifs show evidence of transfer together with both shell and enzyme, while others were transferred separately from one or the other of the components.

141

3. Discussion

In summary, these targeted enzyme families and associated accessory genes showed enrichments for distinct physiological niches. Peroxidase and hemerythrin associated encapsulins were enriched for reactive oxygen resistance in molybdenum dependent biochemistry in non-aquatic, aerobic, mesophilic, rod-shaped and large genomed bacteria, and invasive lung pathogens. Ferritin-like protein, rubrerythrin-like protein and radical SAM oxidoreductase associated encapsulins were enriched for reactive nitrogen and polyvalent cation resistance biochemistry, mainly heavy metal ions, in specialized aquatic, anaerobic, thermophilic, medium genomed prokaryotes, with a significant enrichment for non-pathogens.

Bacterioferritin associated encapsulins were enriched for iron/copper tolerance, molybdenum protection, and reactive nitrogen resistance in facultative aerobes and motile aerobic bacteria.

Thus suggesting prokaryotes have adopted encapsulin based compartmentalization for a diverse set of moderately related yet distinct niche functions.

These encapsulin shells and cargo enzymes show strong signals of horizontal transfer. Nearly all cargo enzymes and shell genes show nucleotide signature similarities consistent with co- transfer, with signals indicative of multiple sequential horizontal transfer events. Phylogeny and nucleotide signature distributions both predict a subset of encapsulins combine shell genes and cargo enzymes from distinct lineages. The most notably of these being a family of Archaea-like rubrerythrin-like genes replacing bacterial ferritin-like cargo proteins. This sort of component swapping is common among phage genomes and to a lesser extent among extremophilic prokaryotic systems.

Phylogeny also predicts several CEAS motifs have been transferred and fused to new lineages of cargo enzymes, through recombination from transferred encapsulins. Thus shell genes, cargo enzymes, and the CEAS motifs which facilitate interaction of the former all occur with

142 distributions and genomic architectures indicative of extensive horizontal transfer.

Preliminary experiments support encapsulation of peroxidases protects bacteria from the toxic effects of these enzymes, which would otherwise retard growth rates. Further experiments are required to definitively determine what property of these peroxidases causes this growth defect and how encapsulins prevent this toxicity. Co-expressing along side the wildtype encapsulin shell of a truncated form of this enzyme without the C-terminal CEAS targeting motif, would confirm or refute the hypothesis of the encapsulation being guided by this motif. Expression of a mutant of this peroxidase with the predicted enzymatic active site replaced by inactive residues would confirm or refute the role of peroxidase activity in compartment function and toxicity.

Thus overall encapsulins serve diverse roles in multiple distinct niches of ecology, morphology, and phenotype, predicted to utilize distinct subset of enzymes in distinct contexts.

143

Chapter 4: Families of distinct novel prokaryotic capsid-like compartments

4. Chapter Abstract:

Surrounding the diversity of classical encapsulins lays the larger diversity of capsid-like proteins. Investigating this larger sequence space of capsid-like folds revealed 1060 predicted novel encapsulins in 24 divergent capsid-like compartment families, distinctive in context from prophage. These predicted novel encapsulins were highly diverse, comparable to the breadth of diversity observed in Caudovirales capsids, though not the density of capsids. One family was very large with 986 proteins, and distributed across 15 bacterial phyla. Within bacteria taxa this distribution exceeds the diversity of even the classical encapsulins. The remaining predicted novel encapsulin families were small, with between 11 and 30 members, and occurred in restricted lineages similar to most other prokaryotic compartments, with the exception of classical encapsulins. An additional exception was a family of encapsulin-like predicted inorganic pyrophosphatase compartments which occurred in multiple phyla.

Clustering and enrichment analyses similar to those used for the expanded classical encapsulin set, were applied to the novel predicted encapsulins to predict compartment functions. The large novel encapsulin family of 986 proteins, subdivided into two distinct but closely related forms of compartment, based on this clustering and predicted function. The larger subfamily of 644 proteins conserves significantly enriched genomic association with four genes involved in cyanide resistances, iron-sulfur cluster formation, and thiol production.

Together these enzymes are predicted to maintain a stress and toxin resistant source/sink for sulfur, similar to one of the functions of ferritin in iron metabolism. The second subfamily of

144

342 of these compartments occurs as non-identical tandem compartment paralog pairs, with amino-terminally fused CRP/FNR-sensor-like small oxidant binding domains. This subfamily is predicted to sequester CO, O2, NO, cyanide, and toxic divalent heavy metals. The two subfamilies share significantly similar sequence in the predicted shell fold, despite the differences in fusion domains and genomic contexts. The smaller families are predicted to fulfill diverse functions in resistance to toxic carbon, phosphate, and heavy metal interactions.

4. Introduction:

The diversity of encapsulins related to the five published examples establishes a large family of diverse capsid-like compartments, as discussed in the preceding two chapters. Yet in the scope of sequence diversity seen in capsids, this family is comparably conservative. Thus it is reasonable to investigate if more divergent compartment families exist related to more divergent capsids, in a manner similar to the relationship observed between Listeria phage B054 and the

Thermotoga maritima encapsulin. As I determined above, this encapsulin and phage capsid share 41% sequence similarity, relatively high in the context of phage capsids which often conserve no detectable similarity, despite strong structural conservation.

The full spectrum of phage-like operons in prokaryotes has also never been fully addressed.

Prophage encoded effectors and phage-similar prokaryotic structural components, such as encapsulins, certain bacteriocins, and secretion systems are known to play important roles in select systems 15. However, the broad effects of these systems on a kingdom level have not been elucidated. These operons were originally identified by non-systematic methods, thus may not represent the full population of phage-like or phage derived prokaryotic operons. Systematic

145 approaches, based on sequence similarity and genomic context analysis, will address these questions and provide a more definitive description of the true population of prophage and phage-like operons.

Characterization of such novel capsid-like compartments may suggest more diverse encapsulin functions, and highlight the limitations on constraints under which encapsulins are beneficial.

The sizes and types of novel encapsulin families identified in this way may shed light on the relationship between encapsulins and phage capsids. While many proteins have been annotated based on partial hidden Markov model matches and sequence similarity matches, a systematic approach looking at encapsulin or capsid-like families may reveal larger novel families not previously properly annotated as compartments.

Compartmentalization confers advantages for diverse prokaryotic metabolisms 1,5,6,9. Even the classical encapsulins compartmentalize multiple distinct enzymes. Thus novel encapsulins may or may not function in the same pathways as classical encapsulins. The classical encapsulins are involved with multiple variations on iron metabolism (Chapter 3), while other compartments serve distinct roles in carbon metabolism 1,5,6,9. Novel encapsulins may function in entirely different niches, or fulfill variations on these roles.

The case of beta-carboxysomes 7 and the classical encapsulins confirms genes encoded separately from predicted novel encapsulin may contribute to novel encapsulin dependent metabolism. However such separately encoded proteins cannot be easily computationally identified on enrichment alone. Instead distinctive motifs, such as the canonical conserved classical encapsulin association specific (CEAS) motifs, would be a more effective predictor of such separately encoded possible targeted encapsulins. Analysis of enzymes encoded adjacent to predicted novel encapsulins relative to similar proteins encoded in genomes that lack predicted

146 encapsulins, may reveal novel motifs specific to association in these novel encapsulin families.

These association specific motifs will be statistically supported, potential protein interaction sites between these novel compartment operon components. Sequence conservation may lend further predictions as to the sites and residues involved in such predicted interactions. This will reveal if similar mechanisms are at play across capsid-like compartments in general or if the classical CEAS targeting motif is unique to that family of encapsulins.

4. Methods:

4.1 Prediction of novel capsid-like compartment families distinct from classical encapsulins

Sequence similarity and genome context were used to identify a more complete set of isolated phage capsid-like proteins, preliminary to identifying novel sequence families of predicted encapsulins. First, it was essential to define a set of high confidence representative phage capsids to use as queries for capsid-like proteins in bacteria. Second, it was important to distinguish encapsulins and prophage capsids, as both encode the same structure and relatively similar sequence, yet represent distinct functions.

The identification of a confident capsid query set was non-trivial in light of intrinsic issues in the field of phage proteome annotation. Due to the long history of phage typing and annotation methods, the existing experimentally validated annotations were sporadic and biased toward small tractable individuals in various inconsistently supported families. The majority of phage capsid genes, if annotated at all, depended on potentially spurious heuristic assays appropriate to the time of identification, but no longer sufficient in the era of genomics. For example, during the early era of phage identification, purified phages were routinely run on SDS-PAGE gels and

147 the densest band visible was annotated as the capsid protein. In truth the capsid protein is often the most prevalent protein in phage, but not always. In many phages, such as Lactococcus phage

P008, the major tail protein is comparable in prevalence and approximate length to the true capsid protein causing false annotation. To resolve these issues systematic BLAST-based sequence similarity analysis was performed on the set of all sequenced Caudovirales genomes available as of May 5th, 2011 on the NCBI database. The results of these searches were combined to define sequence similarity relationships across all the proteins in that viral order.

Based on these relationships, functional annotation could be mapped outward from the experimentally well validated phages to the larger population of phage genomes. By traversing a virtual graph linking similar sequences based on BLAST result parameters (E-value, hit length, and percent identity and similarity), it was possible to confidently extend annotation of experimentally validated proteins to previously non-confidently annotated proteins connected by similarity to intervening proteins. I term this method transitive homology annotation (THA), as it leverages the graph theory concept of the transitive closure 25.

In regards to tailed bacteriophage, this method’s accuracy can be improved by incorporating gene synteny information. Diverse tailed phages retain the same order of structural genes, encoded in roughly the order of expression and assembly. From this background, false hits will stand out as misplaced, while comparably weak true hits occurring in the expected context can be recognized. I term this method synteny informed transitive homolog annotation (SITHA).

Similar methods addressed the issue of resolving prophage capsids from encapsulins, using the contextual presence or absence of other genes with similarity to phage morphogenetic genes in proximity to capsid-similar genes. All sequenced Bacterial and Archaeal genomes were downloaded from the NCBI database on May 5th, 2011, and searched for sequence similarity to

148

Caudovirales proteins as well as published encapsulin shells and enzymes using single iteration pblast and in parallel ten iteration psiblast. In both cases an upper limit of E-value less than 10.0 was applied, with lower effective cutoffs instituted based on genome context, and multiplicity of hits from phage proteins with the same function. Due to the extremely high sequence diversity of phage functional families, preliminary investigation of phage sequence similarity found this cutoff was optimal for reducing false negative rates, without suffering unrecoverable excessive false positive rates. The pblast match dataset provided moderate coverage, higher quality identification of protein similar to phage, while the psiblast dataset provided higher coverage, for a lower quality survey of prophage-likeness of genomic regions.

From these datasets, ordered lists of sequence similarity relationships were compiled and sorted to assign predicted functions to phage-like prokaryotic proteins based on sequence similarity, gene length, high scoring pair (HSP) parameters, number of HSPs, and gene order. These comprised the basis of an annotated database of phage-like proteins throughout sequenced prokaryotes. Phage-like regions were subsequently classified into subclasses based on conserved features. Predicted prophage comprised regions encoding six or more predicted open reading frames (ORFs) with similarity to phage morphogenetic genes within 20 ORFs of each other, including matches to components of both the phage head and tail operons. Regions encoding three or more ORFs with sequence similarity to phage tail structural genes within 15 ORFs of each other, while not in proximity to ORFs similar to phage head proteins, were classified as predicted tail-like bacteriocins or secretion systems. Additional sub-features allowed more precise classification of some of these into specific classes of bacteriocins or secretion systems, including pyocins-like genes, Photorhabdus-like virulence cassette genes, and type VI secretion systems. Regions containing one or more ORFs similar to phage head genes including at least one protein with moderate similarity to either a phage major capsid protein or a published

149 encapsulin, not in proximity to ORFs similar to non-head phage structural components were classified as predicted encapsulin regions. Also identified were a variety of other phage similar regions, encoding multiple regions with similarity to phage proteins, but not organized in a recognizable prophage, tail-like, or encapsulin gene content or synteny.

4.2. Characterization of distribution, relatedness and protein associations among predicted encapsulins

The predicted encapsulin genes were clustered based on sequence identity to identify families of potentially related novel compartments. Three different identity cutoffs were applied sequentially using the CD-HIT method 31 to define clusters of decreasing similarity to identify interrelations between potentially historically related families that may have diverged. The chosen cutoffs were 90%, 60% and 30% sequence identity marking closely related, diverse, and divergent encapsulin sequence families.

The genes encoded around the predicted encapsulins were clustered by sequence similarity using Markov Chain clustering (MCL). These predicted protein family clusters were subsequently analyzed for enrichment in GO term ontology, sequence families, and evidence based functionality (Figure 3-1). Such evidence included published experiments, significant similarity to a protein of validated function, and matches to multiple-sequence based models of confident functions (HMMs, COG models). Protein sequence families with significantly greater than expected frequencies near predicted encapsulins relative to the frequencies of those families in the whole genomes were deemed enriched and examined further. Genes encoded up to 5, 10, or 20 OFRs from the predicted shell gene were considered for enrichment. Enrichment was evaluated using hypergeometric enrichment tests and correcting for multiple hypotheses

150 testing using the Bonferroni correction for critical value, giving a cutoff of P-value less than

6.04E-06.

Enriched candidate encapsulin associated protein families were examined by MEME and MAST motif conservation detection algorithms 26, identifying any motifs conserved in and exclusive to enzymes genomically associated with encapsulins. Discriminative motif discovery was performed using non-redundant representative subsets of encapsulin co-encoded proteins from the families of interest as the positive set, and homologs to these proteins occurring in genomes without detectable encapsulin homologs as the negative set. The resulting predicted

'discriminative' motifs were then tested by MAST against the NCBI non-redundant database and subset sequence files containing only the relevant gene families including all homologs from predicted encapsulin encoding and encapsulin free genomes. Those motifs not exclusive to the encapsulin co-encoded set of homologs were classed as not encapsulin specific, and excluded from further consideration. Those protein families conserving at least one encapsulin association specific motif were retained for further analysis.

A novel method was developed to predict candidate binding interfaces in predicted encapsulin families. For sufficiently large families with well supported associated enzymes, MEME analyses were applied to the predicted encapsulins in addition to the predicted targets. This provided a set of unbiased significantly conserved motifs for both the enzymes and the predicted encapsulins. Interacting motifs are conserved or diverge in a dependent relationship. Thus without making any assumptions of mechanism or biology, conserved motifs contributing to selectively advantageous interaction between proteins, will be enriched for co-occurring with each other, while those unrelated to co-functionality will be equally prevalent independent of genomic association. Thus pairs of encapsulin and enzyme motifs which co-occur significantly more frequently than expected based on background motif prevalence, are potential candidates

151 for physical interaction. Co-enrichment may also be the product of non-physical interaction due to co-functionality. This methodology was validated using the classical encapsulins and targeted enzymes, which conserve a well defined enzymatic targeting motif and encapsulin binding surface as discussed above (Chapter 2.II.c).

Structures were predicted for predicted novel encapsulin associated enzymes and fusion domains using the Phyre structure prediction suite 78. Phyre compares user defined query sequences to the database of solved protein structures to predict the most likely fold of the query protein. Where sequence similarity is high enough to a segment of solved structure, automated homology modeled threading is attempted. For regions without strong matches to solved structure predicted structure is derived based on secondary structure prediction and the constraints imposed by connected confidently modeled segments. This produces a best estimate predicted structure, with position specific confidence measurements indicating how likely the query protein is to actually form each segment of the predicted fold 78.

4. Results:

4.1 Diverse forms and functions of phage-like regions predicted

It is important to distinguish capsids from encapsulins, and to sample as completely as possible the true population of encapsulins. By leveraging all the available sequenced members of a functional family, as well as synteny information, SITHA outperforms conventional single origin query BLAST or the existing capsid or encapsulin hidden Markov Models (HMMs) 68 available from TIGR or Pfam when judged by true positive/false positive ratio (Figure 2-1). The limitation was one of speed, as annotation depends on context across a region rather than single

152 sequence comparison. Dr. Davidson, Joe Wu, Tomasz Blazejewski and I are developing more time efficient comparable methods, utilizing novel HMMs derived in our lab. Preliminary results from phage structural protein families suggest comparable annotation quality with better computational efficiency using precompiled HMMs. Thus the SITHA method establishes confident functional annotations for a maximum population of sequenced Caudovirales phage genomes, giving a strong starting point for comparison with proteins, genes and regions of interest in other genomes.

From a starting set of 44 experimentally validated capsids, our SITHA method confidently annotate 489 phage major capsid proteins, along with 4978 other phage structural and genome packaging genes assigned specific functions, in 529 tailed phage species. An additional 32,058 proteins could be confidently classified into more general functions in phage physiology. As phage taxonomy is based on protein encoded morphological features, our method was also able to taxonomically classify 16 previously unclassified sequenced phages.

This expanded set of confident phage annotations allowed the confident specific annotation of

10866 phage-like proteins in 1807 prokaryotic species, and generalized morphological functional characterization of 811,765 phage-like proteins across 2110 prokaryotic strains, and

1319 non-phage viruses. These were classified into predicted prophages, tail-like bacteriocins/secretion systems, encapsulin-like operons, other phage-related operons, and ambiguous phage-like proteins (Table 4-1). A total of 2052 prophage regions, including 1214 prophage capsids, were identified. The remaining 842 prophages either showed complete deletion of the capsid, or were too degraded or divergent to accurately annotate the capsid among multiple candidates. 674 other prophage remnant-like regions were also identified but were too degraded to unequivocally classify as prophage. Also identified were 109 phage tail- like bacteriocin operons, and 50 novel prophage independent phage resistance genes.

153

Table 4-1: Distribution of phage-like segments in prokaryotes

number of total phage-like number of species regions proteins in encoding these regions classified regions

predicted prophages 2052 28421 1807

predicted tail-like 114 8320 106 secretion systems/bacteriocins

predicted encapsulins 1479 1650 1345

other predicted phage 2326 13959 1807 related operons

154

590 predicted encapsulins showed sequence similarity to at least one of the four encapsulins published by other groups. These are discussed and characterized in the preceding sections of this thesis. The remaining 1060 showed confident signals of sequence similarity and identity to capsids, occurred in contexts distinct from prophage regions without the indicative phage morphogenetic genes, were large enough to encode all the domains of the complete capsid-like fold, and were encoded with one or more enriched enzyme families distinctive to prokaryotic metabolism and not phage replication. These predicted novel encapsulins occur in nineteen distinct phyla of both Eubacteria and Archaea. Out of these, 986 shared sequence similarity with each other and conserved similar enzymes in proximity. The remaining 74 were diverse in sequence and genomically associated enzymes, subdividing into 23 clusters based on greater than 30% global sequence identity. These clusters comprise twelve small significantly conserved functional families, along with eleven divergent candidate encapsulins forming single member clusters (singlets) by sequence. 9 of these singlets were encoded near enzymes sharing function with enzyme families enriched near members of the large clusters, suggesting potential analogous compartment functions despite the divergence in primary sequence. Unequivocally distinguishing these dissimilar singletes from cryptic prophage remnants is not possible, but there is also no evidence to support these regions being prophage related. The conservation of both a candidate encapsulin shell and at least one proximally encoded significantly enriched enzyme family argues against the larger families being non-functional remnants.

Similar to classical encapsulins these families were enriched in genomic association with conditionally toxic biochemistry involving metal cations, reactive nitrogen, S-adenylmethionine, peroxides, or oxygen. The bias toward iron-dependent metabolism was less pronounced in these families than with the classical encapsulins, though sulphur remains important. In addition to iron, thiol, NADH and phosphate metabolisms were also very prevalent.

155

4.II.a Prokaryotes encode multiple capsid-like novel encapsulin families beyond the limits of the classical encapsulin family

The preceding section identified 1060 predicted novel encapsulins distinct in sequence from classical encapsulins, and of previously unknown function. Unlike capsids from prophage or prophage remnants, these genomic loci lacked surrounding genes encoding other phage morphogenetic proteins. Clustering of these sequences into families allows more detailed analyses, including identification of enriched associated enzymes and the prediction of metabolically active operons. With the diversity of sequences found among capsid-like proteins, a robust clustering method, such as Markov Chain clustering (MCL) 29 was preferable. MCL is resistant to input bias, long branch effects, and the formation of clustering artifacts which can limit other methods. Clustering is determined based on proximity/connectedness sampled multiple times from within a sequence identity graph. Proteins more similar to each other have more strong connections, causing them to be sampled more frequently together, giving them a higher confidence of clustering together.

Markov Chain clustering separated the novel encapsulins into 24 clusters with less than 20% sequence identity between clusters. Average sequence identity within clusters was variable depending on the cluster considered. Most of these encapsulin families were comprised of less than 30 proteins each, with 11 clusters containing only a single candidate encapsulin. Genomic context and enzyme family association further grouped these 24 clusters into eight families plus the eleven singlets (Table 4-2). One family, composed of a single sequence cluster, was much larger than the others containing 986 predicted novel encapsulins. This cluster was the primary focus of further analysis of novel predicted encapsulins.

156

Table 4-2: Distribution of novel encapsulin families Encapsulin Family Predicted Cargo (associations) Protein Species Count Count

family 1 (linocin -like) peroxidase, ferritin, 590 577 hemerythrin, rubrerythrin, bacterioferritin, ferredoxin (CutA1-like, radical SAM oxidoreductase, metalloprotease)

family 2.i (major cysteine desulfurase (Ser-O- 644 576 membrane protein I - acetyltransferase, rhodanese) like)

family 2.ii (FNR binding fumarate/nitrate/small oxidants 342 171 domain fusions) (rhodanese)

family 3 (lambda capsid- predicted heme-dependent 22 22 like, 1 cluster) cytochrome P450-like monooxygenase hydroxylases, undecaprenyl-pyrophosphate binding proteins

family 4 (Enterobacteria alpha-ketoglutarate 11 10 phage phiP27 capsid- decarboxylase/2-oxoglutarate like, 4 clusters) dehydrogenases

family 5 (6 clusters) inorganic pyrophosphatases 11 11

family 6 (Clostridium metal regulated compartment of 6 2 specific metal response unknown function compartments, 1 cluster)

family 7 (F10 capsid- protease/encapsulin fusion 8 6 like, 3 clusters)

family 8 (A118 capsid- divalent cation-dependent 5 5 like, 2 clusters) hydrolases of diverse function

singlets indeterminate 11 11

157

4.II.b Predicted encapsulated cysteine desulfurase-like enzyme compartments distributed across diverse lineages

592 cysteine desulfurase-like proteins were found encoded within five open reading frames of

576 predicted encapsulins distinct in sequence from the previously published encapsulin family

2,20, which I refer to as classical encapsulins. In most cases the desulfurase was encoded adjacent to the predicted novel encapsulin. 16 of these 592 predicted desulfurases were encoded as homologous non-identical gene pairs or triplets defining 576 predicted desulfurase/encapsulin operons, conserving a set of functionally related enzymes and regulators (Figure 4-1.i). These tandem encoded desulfurase genes may confer slightly different affinities or may be fully redundant. 46 genomes encoded two predicted encapsulins from this family and split the enriched associated enzymes between the two operons. Twenty-two strains encode predicted encapsulins and desulfurases separately. These separate operons share homologous regulators from the same sequence family, suggesting the potential for co-regulation and co-expression.

Together these define 644 predicted encapsulin/desulfurase systems.

342 other predicted encapsulins, with strong sequence similarity, were encoded within related genomes, but without desulfurase association. These genomes were from six of the same phyla and many of the same genera as the predicted desulfurase associated encapsulins. Unlike the majority of encapsulin operons these predicted encapsulins occur in non-identical homologous pairs, defining 171 isolated predicted encapsulin operons with two shell proteins in this family, along with conserved regulatory genes (Figure 4-1.ii). These two-encapsulin operons also encode an additional amino-terminal domain, with significant similarity to the ligand binding domain of CRP/FNR family regulators. These two subfamilies segregate by length, sequence similarity, and genomic association, suggestive of a common origin with divergent adaptation to related functions. The complete list of predicted encapsulins and cargo enzymes in this family is available in Table 4-3.

158

Figure 4-1.i: Example operon diagram for rhodanese/desulfurase/MMPI encapsulin/acetyltransferase systems

Figure 4-1.ii: Example operon diagram for FNR-sensor domain fusion MMPI encapsulin systems

Table 4-3: List of predicted novel encapsulins and associated enzymes Available on the appended CD or online at: https://www.dropbox.com/s/hse6n29zk4adbso/ radfordThesis_novelEncapsulinList.xls

159

This novel encapsulin shell family had no detectible sequence similarity to classical encapsulins.

Moderate similarity was evident to several phage capsids (NP_996706, YP_239278,

NP_818559, YP_001936108, YP_655396, YP_544727 and YP_002003377), and to capsid derived hidden Markov models. The most capsid similar example in this family was YP_555398 which was 24% identical and 37% similar to the Xanthomonas phage Xp15 capsid,YP_239278, over 222 residues and 24% identical and 41% similar to the N-terminal third of the Lactococcus phage phiLC3 capsid, NP_996706. While many members of this family are annotated as 'major membrane proteins', ExPASy TMPred 69 predicted nominal transmembrane domain propensity.

The Mycobacterium leprae homolog forms very high molecular weight soluble complexes, which pellet under certain ultracentrifugation conditions 70, similar to the classical encapsulins.

The encapsulin/desulfurase pairing was conserved in diverse representatives of 15 divergent bacterial phyla, including both the more common adjacently encoded form and the less common two operon form. Of these Proteobacteria, Actinobacteria, Bacteroidetes, and Cyanobacteria were the most prevalent. The cysteine desulfurase sequence family was distributed across all three kingdoms of life with validated diverse functions in sulphur metabolism 71. Along with the desulfurase enrichment (Hypergeometric P-value = 0.00), two other sulfur metabolism components show strong co-enrichment; Serine O-acetyltransferase-like enzymes

(Hypergeometric P = 0.00), and rhodanese-like sulfurtransferase enzymes (Hypergeometric P =

0.00). 418 of these encapsulin operons include one or more acetyltransferase homologs, and 193 include a rhodanese homolog. A few of these encapsulin encoding regions include two out of three of these components, with the missing component encoded separately often with a secondary homologous predicted encapsulin gene (eg. Saccharopolyspora erythraea).

160

Several diverse families of transcriptional and protein stability regulators were frequently genomically associated with these operons, though independent families were insignificantly enriched. This fits with coordinated regulation with other metabolisms, rather than isolated induction by a unique activator. These regulators included XRE, TetR, ArsR, a family of iron sensors and molybdenum-pterin dependent transcription regulators. The diversity of regulators suggests function of this encapsulin family is central to resistance to a variety of toxic stresses.

The breadth and diversity of lineages and the presence of genomically associated transposases in some species supports a mobile history involved in the dissemination of this family.

Cysteine desulfurases are essential for generating free sulfur for iron sulfur cluster and thiol metabolism 71, both major targets of arsenic toxicity. Arsenic ionically strips sulfur from iron sulfur clusters, and binds up thiol 52. This family also contributes to synthesis of various molybdenum dependent cofactors. Free thiol in turn is required for rhodanese dependent detoxification of cyanide. Serine O-acetyltransferases produce O-acetylserine, a precursor to cysteine, and a conditionally strong sulfur acceptor. Packaging of these reactions into encapsulins may protect from arsenic, sequester and detoxify cyanide, and improve reaction efficiency through channeling.

Another genomic architecture is also observed for 342 homologs of these predicted encapsulins, in which two predicted shell genes are encoded adjacently, without strong conservation of desulfurase or acetyltransferase homologs. These paired predicted encapsulins in two-encapsulin operons, were significantly longer (Student two-sample T-test P = 1.57E-164). This added length is composed of an additional conserved N-terminal domain, similar to CAP/FNR oxygen binding domains (cd00038). This type of oxygen binding domain is named for and best characterized from a large extended family of transcriptional regulators. These regulators canonically encode two domains: a sensor domain and a DNA binding domain. The sensor

161 domain is the one with similarity to the N-terminal domain found fused in these predicted encapsulins. In the regulators this domain binds diverse small oxidant molecules with high affinity. Structure prediction (Phyre2) 78 and threading confidently resolved complete folds for both the predicted encapsulin domain and the FNR-like domain. Modeling also predicts a flexible hinge and short alpha helix connecting the two domains, exposing the FNR-like fold on the inner surface of the compartment.

This additional N-terminal fusion domain includes the canonical iron-dependent oxidant binding site and linker, but not the DNA binding regulator domain. Only thirty-eight of these FNR-like encapsulins were encoded near desulfurases, and encode a single copy per operon in those cases.

Thus these two-encapsulin operons may be mechanistically and/or functionally distinct from the one encapsulin system homologs. This iron-dependent oxidant binding domain may serve in an alternative biological niche. The name-sake family of fumarate and nitrate reductase transcriptional regulators acts in diverse bacteria and some eukaryotes to regulate oxidant sensitive metabolisms. Under ideal environmental conditions the sensor domain binds oxygen and activates expression of nitrate reductase. However these regulators are a major cause of induced toxic disruption of metabolism, owing to promiscuous affinity for diverse small oxidants, including the native oxygen, as well as carbon monoxide, nitric oxide, and ferriccyanide 72. Deletion of these regulators confers increased resistance to these compounds, at the expense of nitrate metabolism 72. This ligand promiscuity which is so toxic in a regulator is advantageous for fusion encapsulins, generating a high capacity, low specificity sequestration sink of diverse toxic oxidants. Rhodanese activity, enhanced by desulfurase, can further detoxify bound ferriccyanide to usable iron and less toxic thiocyanide 73. This may be the selective pressure promoting enrichment for rhodanese in proximity to this subfamily of predicted novel encapsulins.

162

4.II.c Predicted novel encapsulated cysteine desulfurase-like enzymes conserve two specific motifs in N-terminal extension domain

In a number of these novel encapsulin associated enzyme families, MEME motif prediction analysis identified significantly conserved motifs, which were exclusive to the homologs co- encoded with encapsulins, and absent from the related homologs not encoded with encapsulins.

These conserved encapsulin association specific motifs were exclusive to each system of shell/enzyme family pairs. None of these motifs match the targeting motif observed for classical encapsulins discussed earlier in this thesis (Chapter 2.II). Instead each was unique to a distinct encapsulin/enzyme family. In all cases the conserved motifs suggest the potential for protein/protein interaction primarily through hydrophobic patch interactions.

In addition to the 644 cysteine desulfurase-like proteins co-encoded with encapsulins, 543 homologous were identified in genomes without predicted novel encapsulins, enabling attempted detection of encapsulin association specific motifs by comparing the two sets through

MEME. The predicted encapsulin associated and independent desulfurase-like homologs conserve highly similar motifs throughout the C-terminal domains (mean 400 aa). The encapsulin associated set encodes a number of conserved motifs within an N-terminal extension unique to this set (mean 197 aa) (Figure 4-2). In the non-encapsulin associated set this N- terminal extension was absent. Homologous structures have been solved for the C-terminal fold

(PDB ID: 1JF9, 1CL1, 1BJN, 1EGS), but not the N-terminal extension, which was dissimilar to any identified solve structure based on Phyre 78. If the addition of the N-terminal domain does not radically alter the C-terminal fold, encapsulation would orient the catalytic site and dimerization interface toward the centre of the compartment with substrate accessibility unaffected. Conversely to the C-terminal fold, this N-terminal extension would face the shell inner surface.

163

Figure 4-2: Sample motif maps of desulfurase proteins encoded adjacent to MMPI-like novel encapsulin genes.

N-terminal segment (green background) uniquely encoded in homologs genomically associated to predicted novel encapsulins. C-terminal segment (red background) conserved in all members of predicted desulfurase family. Individual predicted conserved motifs highlighted in different coloured boxes.

164

Paired motif enrichment analysis predicted that the two, significantly conserved, desulfurase, N- terminal motifs were significantly co-enriched with significantly conserved encapsulin motifs

(Figure 4-3.i, Figure 4-3.ii). The largest of these occurred at the N-terminus of the desulfurases

(Figure 4-3.i), and was termed the LARLASEF motif from its consensus. The smaller of the motifs (Figure 4-3.ii) was encoded near the boundary of the core desulfurase segment separated by a short linker-like segment, and was termed the SPYYFLDE motif from its consensus.

MAST search with these motifs exclusively hits encapsulin associated homologs up to E-value

< 0.32. These motifs were found in eighteen separately encoded desulfurases occurring in the same genomes as encapsulins, and in 96% of all encapsulin adjacent desulfurases. Neither motif shows matches in homologs from genomes without encapsulins. When aligned with experimentally studied desulfurases, the C-terminus of the encapsulin associated family members conserve all the residues of the validated ligand binding and active sites, as well as the two motifs highlighted above (Figure 4-3.iii). Phyre 78 predicted structures for these enzymes maintain the enzymatic active site, and present the encapsulin association specific motifs exposed toward the predicted encapsulin shell inner surface (Figure 4-3.iv).

165

Figure 4-3.i: N-terminal encapsulin association specific desulfurase motif

Simplified consensus "LPDAAPPAGLPDPATLARLA[SN]EF[FL]AALPGQ", E-value=9.1e- 410, Log Likelihood Ratio=1759, Information Content= 95.5 bits, Relative Entropy=79.3 bits.

Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

Figure 4-3.ii: Internal encapsulin association specific desulfurase motif

Simplified consensus "[NP][AE]SP[YF]YFL[DG][EGD]", E-value=5.6e-31, Log Likelihood Ratio=392, Information Content= 27.8 bits, Relative Entropy= 28.3 bits. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

166

Figure 4-3.iii: Encapsulin association specific motifs and validated ligand binding and active sites conserved in novel encapsulin associated predicted desulfurases.

BLOSSUM62 coloured MAFFT multiple sequence alignment of representative encapsulin associated predicted desulfurases (green) and non-encapsulin associated cysteine desulfurases (orange). The two conserved motifs exclusive to desulfurases genomically associated to this encapsulin family were marked in labeled green boxes. The residues of the desulfurase ligand binding and catalytic site were marked in blue boxes.

167

Class gene name species NCBI accession encapsulin adjacent SufS subfamily cysteine desulfurase desulfurase Nakamurella multipartita WP_015750276.1 encapsulin adjacent Hyphomicrobium desulfurase cysteine desulfurase denitrificans WP_013214792.1 encapsulin adjacent SufS subfamily cysteine desulfurase desulfurase Methylocella silvestris WP_012589530.1 encapsulin adjacent Acinetobacter desulfurase cysteine desulfurase calcoaceticus WP_017390450.1 encapsulin adjacent cysteine desulfurase- Synechococcus sp. PCC desulfurase like protein 6312 AFY61733.1 encapsulin adjacent desulfurase hypothetical protein Methylocystis parvus WP_016917829.1 encapsulin adjacent Mycobacterium avium desulfurase nifS-like protein subsp. paratuberculosis AAF82074.1 encapsulin adjacent cysteine desulfurase, Methylobacterium desulfurase SufS subfamily extorquens CM4 ACK82221.1 encapsulin adjacent cysteine desulfurases, desulfurase SufS subfamily Chlorobium ferrooxidans WP_006365192.1 non-encapsulin cysteine desulfurase, Methanosarcina associated desulfurase SufS subfamily acetivorans C2A AAM06635.1 non-encapsulin Rhodopirellula baltica SH associated desulfurase nifS protein 1 NP_869536.1 non-encapsulin Haemophilus influenzae associated desulfurase nifS protein, putative Rd KW20 AAC22941.1 non-encapsulin Mycoplasma penetrans associated desulfurase aminotransferase NifS HF-2 NP_758133.1 non-encapsulin associated desulfurase nifS protein Borrelia garinii PBi AAU06940.1

168

Figure 4-3.iv: Predicted structure of novel encapsulin associated cysteine desulfurase, maintains enzymatic active site, and presents encapsulin association specific motifs exposed toward predicted encapsulin shell inner surface.

blue grey: confidently predicted structure of desulfurase monomeric subunit; grey and yellow: confidently predicted structure of homo-dimeric desulfurase complex; green: pyridoxal 5'- phosphate binding pocket and catalytic site; orange: low confidence predicted structure of N- terminal extension domain; red: predicted location of two encapsulin association specific motifs Homology modeled threaded sequence of representative encapsulin associated cysteine desulfurase from Acinetobacter calcoaceticus [WP_017390450] displayed.

169

The Phyre structure prediction suite 78 identified two phage capsids [3J4U 79 94.6% confidence,

1IF0 80 72.9% confidence] as confident predictors of structure for this family. Based on threading from these structures, a significantly conserved co-enriched encapsulin motif (Figure

4-4) occurs in an internally N-terminal sub-domain on the inner encapsulin surface. I refer to these as TALGD motifs based on the consensus. The threaded TALGD motifs form an alpha helix on a flexible loop with low confidence structure prediction. As such the precise position relative to the inner surface cannot be determined. Within the limits imposed by the confident threaded structures, the TALGD motifs was localized similarly to the target binding sites for classical encapsulins 2 as discussed above (Chapter 2.II.i). The classical encapsulin binding site combines conserved residues distributed throughout the protein, while the TALGD motif occurred as a continuous segment. In both cases site specific conservation produced statistically significant signals. Among the desulfurase motifs, the SPYYFLDE site was the most strongly and significantly co-enriched, suggesting functional and perhaps physical interaction.

Structure prediction in the FNR-encapsulin fusions was similarly confident (Figure 4-5.i, Figure

4-5.ii). Confidence was quantified based on sequence similarity and fit quality as determined by the Phyre suite 78. The C-terminal domains with similarity to capsids can be confidently fit to the same Bordetella phage capsid structure predicted above [PDB id: 3J4U 79]. The FNR domain threads confidently to the P. aeruginosa Dnr transcriptional regulator sensor domain

[PDB id: 2Z69 81]. The amino acids between the two domains act as a hinged linker with dynamic structure and thus less confident predictability. Based on the threaded model the FNR domain would project into the interior of the compartment, accessible to ligands.

170

Figure 4-4: Significantly enriched desulfurase association specific encapsulin N-terminal motif

Simplified consensus " xxAQ[TL]ALGD[VN]A ", E-value = 1.8e-187, Log Likelihood Ratio = 1199, Information Content = 30.4 bits, Relative Entropy = 26.2 bits. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

171

Figure 4-5.i: Experimentally validated canonical FNR-sensor ligand binding sites conserved in novel FNR sensor-like domain fusion encapsulins.

Clustal coloured MAFFT multiple sequence alignment of representative unfused predicted encapsulins (orange), FNR sensor-like domain fusion encapsulins (green), and FNR regulator sensor domains (cyan). The conserved ligand binding and hinge sites were highlighted in labeled blue and cyan boxes respectively.

ID gene name species NCBI accession

encapsulin: B. ceno Major membrane protein I Burkholderia cenocepacia PC184 YP_002095113.1

encapsulin: A. baum major membrane protein I (MMP-I) Acinetobacter baumannii ATCC 17978 YP_001084436.1

encapsulin: N. euro major membrane protein I Nitrosomonas europaea ATCC 19718 NP_840658.1

encaspulin: M. flag major membrane protein I Methylobacillus flagellatus KT YP_544727.1

FNR-domain_encapsulin: C. acid Crp/Fnr family transcriptional regulator Catenulispora acidiphila WP_015793205.1

172

FNR-domain_encapsulin: A. benz Crp/Fnr family transcriptional regulator Amycolatopsis benzoatilytica WP_020659342.1

FNR-domain_encapsulin: A. vanc CRP/FNR family transcriptional regulator Amycolatopsis vancoresmycina WP_004561479.1

FNR-domain_encapsulin: A. halo hypothetical protein Actinopolyspora halophila WP_017975495.1

FNR-domain_encapsulin: C. kore hypothetical protein Catelliglobosispora koreensis WP_020522066.1

FNR-domain_encapsulin: A. enza hypothetical protein Actinokineospora enzanensis WP_018685788.1

FNR-domain_encapsulin: C. fusc Crp/Fnr family protein Cystobacter fuscus WP_002627926.1

FNR-domain_encapsulin: A. miru Crp/Fnr family transcriptional regulator Actinosynnema mirum WP_015803726.1

FNR-domain_encapsulin: A. nigr Crp/Fnr family transcriptional regulator Amycolatopsis nigrescens WP_020666522.1

FNR-domain_encapsulin: A. mort hypothetical protein Actinopolyspora mortivallis WP_019853175.1

FNR-domain_encapsulin: A. flav hypothetical protein Actinomadura flavalba WP_018658245.1

FNR-domain_encapsulin: A. acet Crp/Fnr family protein Acetobacter aceti WP_010667751.1

FNR-sensor_domain: M. tube Crp/FNR family transcriptional regulator Mycobacterium tuberculosis CDC1551 NP_336168.1

FNR-sensor_domain: R. sola FNR regulatory transcription regulator protein Ralstonia solanacearum GMI1000 NP_519404.1

FNR-sensor_domain: N. meni Crp/FNR family transcriptional regulator Neisseria meningitidis MC58 NP_273429.1

FNR-sensor_domain: A. fabr transcriptional activator, Crp/FNR family Agrobacterium fabrum str. C58 NP_354595.1

FNR-sensor_domain: M. loti transcriptional regulator Mesorhizobium loti MAFF303099 NP_104749.1

FNR-sensor_domain: L. lact FNR-like protein Lactococcus lactis CAB53581.1

FNR-sensor_domain: D. radi transcriptional regulator Deinococcus radiodurans R1 NP_296083.1

FNR-sensor_domain: B. meli Crp/Fnr family transcriptional regulator Brucella melitensis bv. 1 str. 16M NP_540669.1

173

Figure 4-5.ii: Predicted structure of novel encapsulin fused FNR sensor-like domain, maintains exposure of FNR sensor domain ligand binding site, and presents FNR-like domain on inner surface of predicted compartment.

blue grey: model of FNR domain predicted from P. aeroginosa dnr sensor domain [2z69]; purple: model predicted from Bordetella phage capsid [3j4u]; green: CO/O2/NO ligand binding site; orange: low confidence predicted structure of flexible hinge/linker between FNR and encapsulin domains; red: symbolic boundary of internal and external encapsulin surfaces. Homology modeled threaded sequence of representative predicted encapsulin Acetobacter aceti [WP_010667751] displayed.

174

4. Discussion

With the exception of the sulfur metabolism desulfurase/acetyltransferase/rhodanese encapsulin family, these novel encapsulin/enzyme families tend to be smaller and more lineage-restricted than the classical encapsulins. Enzymes associated with different encapsulin families show no evident pattern of interfamily conservation of motifs (appendix 4). This was despite the consistent presence of significant conservation of unique encapsulin association specific motifs in at least one enzyme family genomically associated with each encapsulin family. This suggests multiple, largely independent, solutions have arisen to facilitate enzyme/encapsulin interactions in each of these extremely diverged families of encapsulins. Given the considerable distance in sequence space between encapsulin families, the implementation of different interaction mechanisms for the differing inner surfaces of different encapsulin families was to be expected.

The diversity of capsid interaction motifs in phage scaffold and head protease enzymes further demonstrated the plasticity possible in fulfilling these interactions.

Despite this diversity in physical mechanism and sequence families, the enriched functions implicated in these analyses share biochemical similarities. The conserved cysteine desulfurase, and inorganic pyrophosphatase encapsulin systems both contribute to resistance to arsenic and cyanide through thiol and phosphate metabolism 74,75. Predicted undecaprenyl-phosphate phosphatase/dephosphotases are known to be protective against rare earth metal toxicity 76, and as encapsulated systems may also function in remediation of other toxins, including arsenic.

Encapsulation of UDP-glucose/GDP-mannose dehydrogenase, alpha-ketoglutarate decarboxylases/2-oxoglutarate dehydrogenases, predicted heme-dependent cytochrome-like monooxygenase/hydroxylases, and NAD(P)H generating nitrite reductase, serves a similar purpose protecting these otherwise arsenic sensitive enzymes. Many of these enzymes are susceptible to cyanide toxicity, due to their dependence on coordinated iron in the reaction

175 centres, which can be tightly bound by cyanide, blocking essential reactions. Any enzyme dependent on exposed coordinated iron shares this sensitivity to some degree. Encapsulin associated rhodanese domain enzymes and cyanide hydratases are predicted to actively degrade these ferriccyanide molecules relieving this toxicity. Coupling this detoxification with compartmentalization would be expected to improve efficiency, concentrate toxic precursors, and control diffusion. Cyanide being a very simple organic toxin is produced as a metabolic byproduct and defensive secondary metabolite by diverse organisms. As such cyanide resistance is advantageous in multiple niches, but not necessary in most broadly colonized environments.

This agrees with the more restricted distribution of these predicted encapsulins.

In summary, novel capsid-like compartments are involved in diverse niches of sulfur, iron and phosphate metabolisms, potentially improving resistance to heavy metal and organic toxins, and improving efficiency of sulfur turn over. This places these novel encapsulins in distinct, yet similar roles as the family of classical encapsulins discussed in Chapter 2 and 3 of this thesis.

The link to toxin resistance fits well with the non-vertical dispersal and retention of encapsulins.

Most prokaryotes do not colonize niches rich in broad scope ionic and organic toxins. Without this pressure, encapsulation of enzymes, while conditionally advantageous is not essential and therefore can be readily lost without loss of viability. These types of toxicity produce an interface zone of inhibitory but non-lethal toxicity, optimal for horizontal transfer into unrelated lineages. These select organisms which receive and retain these conditionally advantageous compartments tend to be the more resistant and robust strains, such as tuberculosis, multiple drug resistant Pseudomonas, and several hyper-extremophiles.

176

Chapter 5: Conserved mechanisms of protein interaction produce the structural and functional similarities and differences between capsids and encapsulins

5. Chapter Abstract: Wildtype and mutant capsids share many structural and phenotypic properties with encapsulins, in some cases favouring the wildtype form and in others properties of a mutant form. These include icosahedral symmetry, preferred geometry dependent compartment size, structural stability, DNA packaging, and ionic context dependent structural dynamics or rigidity.

Whatever phylogenetic relationship links this conserved encapsulin/capsid-like structure in these three kingdoms, must have converted the distinctive properties of one form to become the other, while retaining the shared features. Examining the effects of amino acid substitutions in the capsid of E. coli phage lambda reveals consistent mechanisms linking sequence perturbation to variation in these phenotypes. These same mechanisms were revealed to predict conservation and specific variation in sequence between the lambda capsid and other capsid-like folds including but not limited to the encapsulins. The resulting model successfully predicted the structural and phenotypic effect of a novel point mutation in the capsid protein of E. coli phage

HK97. These findings establish the basis for directed engineering of capsids and encapsulins for specific biotechnologically relevant purposes. The observed consistency of these mechanisms suggests that capsids and encapsulins should be fairly easily converted from the properties of one to the other with minimal mutagenesis.

177

5. Introduction:

Among the Caudovirales phage there is very high sequence diversity among major capsid proteins, yet these proteins conserve very similar geometry and structure at all levels. This is similarly true of the encapsulins. The levels of protein-protein interactions required for normal capsid formation can be considered on three levels. First, the core interactions, which form and stabilize the tertiary structure of the functional fold, and are usually hydrophobic. Second the monomer to monomer interactions which enable the formation of more complex units. Finally, the multimer to multimer interactions, which organize these more complex units into the final super structure of the functional capsid or encapsulin compartment. Some of the details of this process are described in chapter 1 of this thesis. Overall, monomers organize into either pentamers or hexamers, which then interact to form icosahedrons. The geometry of closed icosahedrons is mathematically defined by minimum energy physics. The size of these icosahedrons is determined by the number of hexamers incorporated between pentamers which act as vertices.

178

From early on in the study of virology, Escherichia coli Siphophage lambda has been a focus of diverse study. Specific amino acid substitutions at 63 positions in the lambda major capsid protein (Table 5-1), induced aberrant structural phenotypes which could be grouped into six categories 10,11,12,13,63,86. First of these were substitutions causing failure to form head structures.

Second and third were substitutions which cause the formation of wide polyheads (Figure 5-1), or the formation of narrow tubular polyheads (Figure 5-2) respectively 63. Polyheads are diverse, large, often asymmetric, structures formed from large sheets of interconnected capsid subunit multimers. Forth, substitutions causing the formation of atypically small phage heads (Figure 5-

3) 13. These small heads more closely resemble published encapsulins in size and geometry than wildtype capsid heads. Fifth are substitutions, which do not undergo expansion during phage maturation, unlike wildtype capsids. In wildtype phage this expansion involves a dramatic increase in internal volume and external diameter. Observation of studied encapsulins in vivo suggests these compartments behave more like this class of mutants forming and maintaining a constant diameter and structure. It is possible that the absence of observed expansion in encapsulins is due to a deficit of study under diverse conditions, but may also indicate the true absence of biologically relevant expansion. The sixth class comprises substitutions forming effectively normal, expanded 'mature' capsid heads, except for a failure to package and retain

DNA (Figure 5-4) 11. Published encapsulins appear to share this aversion to DNA packaging, as encapsulins are not copurified with DNA even at low levels 2,4,20,22, unlike many capsids which non-specifically package nucleotides and small DNA fragments even in the absence of genome packaging machinery 24. This DNA intolerance may not be true of all encapsulins, and in the case of encapsulins may also be explained by exclusion of DNA by substrate.

179

Table 5-1: Prior phenotypic amino acid substitutions identified in Bacteriophage lambda Class Mutations no heads formed 44 sites smaller proheads Ser59Phe, Val78Ile, Thr284Ile unexpandable proheads Ala293Thr, Ala324Thr DNA packaging defective Arg222Cys, Ser226Phe, Tyr245Ser, Tyr245Gln, Gly246Asp expanded heads wide polyheads Pro45Ser, Gly46Glu, Glu229Lys, Ser240Phe, Gly243Glu, Gly243Arg, Pro305Ser narrow polyheads Tyr40Ser, Pro75Ser, Pro75Leu, Arg91His, Ser301Phe

Figure 5-1: Wide polyhead mutant in the Phage lambda gpE major capsid protein

(Katsura 1978, fig 2d) 63 Electron micrograph of polyhead with a widening end. Negative staining with 2% uranyl acetate. Magnification: 157,000X.

Figure 5-2: Narrow polyhead mutant in the Phage lambda gpE major capsid protein

(Katsura 1978, fig 2e) 63 Electron micrograph of partially filled narrow polyhead. Negative staining with 2% uranyl acetate. Magnification: 157,000X.

180

Figure 5-3: Small prohead mutant in the Phage lambda gpE major capsid protein

(Katsura 1983, fig 4) 13 Electron micrograph of small and normal proheads. Negative staining with 2% uranyl acetate. Magnification: 105,000X. Small mature proheads were partially purified from a lysate of W3350(λcIts857 Sam7 EdefK562) and mixed with normal mature proheads purified from a lysate of W335O(λcIts857 Sam7 Aam32). Small proheads (s). Normal proheads (n). Free tails and fragments of E. coli membrane were seen in the micrograph. Among the proheads indicated by arrows, those on the left were completely flattened, whereas those on the right were not.

181

Figure 5-4: DNA packaging defective mutant in the Phage lambda gpE major capsid protein

(Katsura 1989, fig 1) 11 Electron micrographs of DNA packaging defective mutants; a) expanded empty heads (EdefK560), b) empty proheads without packaging motor (EdefK560 + Aam32), c) unstable expanded empty heads without head stabilizing protein gpD (EdefK511 + Dam15), d) expanded empty head phage particles (EdefK560/pCOS-1 trimer). Negative staining with 2% uranyl acetate. Magnification ~ 170,000X.

182

When these mutations were originally identified, phage capsid structures were unresolved beyond the limits of the electron microscopy of the time. As such these observations could not be grounded in the biophysical mechanics at play, nor used to generate hypotheses as to the conserved properties governing the formation and functionality of this class of protein compartment. The following sections place this prior data into the modern structural and functional context, and makes testable predictions about the generalized constraints driving formation and functional properties of capsids and encapsulins.

5. Methods:

5. Combining structural and phenotypic data for a unified model of mechanisms governing capsid form and function

The structure of the major lambda capsid protein (gpE) has not yet been determined at atomic resolution. To place the prior mutational studies into the context of the physical changes induced on the structure of the lambda capsid, the sequence of the lambda major capsid protein, gpE, was structurally threaded to the 43% identical solved structure of the Escherichia coli CFT073 prophage capsid (PDB id: 3BQW) 84 based on homology modeling. Threading was applied at the level of the protein monomer, planar hexamer, vertex pentamer, and asymmetric heptamer applying the same symmetry constraints as the CFT073 capsid crystal. Within the full lambda capsid, monomers exist in seven similar but distinct structural contexts. Partial lambda capsid structures were generated for each of these contexts using the VIPERdb database 85, based on the known constraints of icosahedral symmetry for a triangulation number seven capsid.

From these threaded structures position specific solvent accessibility was calculated for the

183 monomer, hexamer, pentamer and full capsid contexts, using the GetArea algorithm 16. The structures and solvent accessibility maps were compared to identify specific residues involved in the various interfaces and interactions. The sites of phenotypic amino acid substitutions were also mapped to the structures to define a composite map (Figure 5-5).

The mature lambda phage capsid is composed of multimers of three different proteins with distinct functions (gpE, gpD, gpB), and depends on temporary interactions with three other proteins (nu1, gpA, gpC) during maturation. Thus many of the phenotypic classes observed could result from either change to the major capsid protein itself, or from changing the interaction with these other minor capsid proteins. To this end the interaction sites for the lambda phage portal (gpB), head-DNA stabilization protein (gpD), and head scaffold/protease protein (gpC) were compared with the sites of phenotypic mutations, using published structures or generating threaded models for proteins without solved structures.

Based on the trends observed from mutagenesis of the lambda capsid, I extrapolated to predict expected amino acid similarities and differences in other capsid and encapsulin structures with similar and different structural and phenotypic properties. Novel phenotypic amino acid substitutions in the lambda capsid along with these other capsids were proposed based on these trends from the lambda capsid. One of these proposed substitutions in the Escherichia coli siphophage HK97 was tested by expression, fractionation and TEM visualization, in parallel with the wildtype capsid.

184

Figure 5-5: Structural map of phenotypic amino acid substitutions sites, and interaction interfaces on the predicted structure of the Escherichia coli siphophage lambda gpE major capsid monomer

Peptide backbone cyan ribbon diagram. Important sites denoted as colour coded stick amino acid diagrams; red: wide polyhead, orange: narrow polyhead, forest green: unexpandable heads, pea green: smaller than wildtype heads, cyan: DNA packaging defective, light green: hexamerization interface, yellow: multimer/multimer interface.

185

5. Results:

5.1 Classes of phenotypic mutations to the Escherichia coli siphophage lambda capsid protein are distributed into discrete structural perturbations

Homology modeled threading yielded good models of the Escherichia coli siphophage lambda gpE capsid overall, with a minor differences owing to differences in protein length and sequence. Three elongated surface loops in the CFT073 capsid monomer, were predicted to be one amino acid shorter each in the lambda capsid. The reduction occurs between residues E37-

K38, P45-G46, and E259-N260 in the lambda sequence. These loop segments have little secondary or tertiary structure, and are not predicted to significantly affect quaternary structure, with the exception of the P45-G46 region. The P45-G46 containing loop contributes to important monomer/monomer interactions, but is predicted to function and fold very similar to the loop in CFT073.

The six classes of phenotypic mutations occurred in discrete structural positions and showed distinctive alterations to structural regions. While most substitution sites were discrete (Figure

5-5), substitutions within the same phenotypic class altered the gpE structures in similar ways.

These trends linking structural perturbation to aberrant phenotype are summarized by the following model:

Model: 1. Mutations abolishing head formation catastrophically disrupt core and interface positions, or bias folding kinetics toward alternative non-capsid fold/unfolded state. 2. Narrow polyhead mutants alter the monomer/monomer, pentamer/hexamer and/or hexamer/hexamer interfaces promoting tightly curved hexamer/hexamer sheets by shifting the equilibrium of multimer incorporation toward hexamers and away from pentamers.

186

3. Wide polyhead mutants promote blunt hexamer/hexamer sheets by altering the monomer/monomer, hexamer/hexamer and/or pentamer/hexamer interfaces to reduce the frequency of pentamer incorporation, and favour more oblique interface angles between hexamers. 4. Small prohead mutants enforce tighter closed icosahedral geometry with bulkier residues at the three-fold symmetry sites, reducing the number of hexamers which can be incorporated. 5. Expansion defective prohead mutants stabilize the prohead conformation at three-fold symmetry sites resisting induced expansion, by adding additional polar/polar interactions between multimers. 6. DNA packaging defective mutants disrupt DNA coordination on inner capsid surface, by a combination of exclusion, charge repulsion, and introduction of points of off-pathway charge affinity unacceptable to packaging.

When a threaded portal structure based on the Bacillus phage Spp1 portal (PDB id: 2JES) was modeled with the lambda capsid, the region involved was distinct from any of the regions associated with prohead or head defects in Bacteriophage lambda. As such the packaging motor, also known as the terminases, nu1 and gpA, did not appear to directly interact with any of the sites causing packaging defect.

The small prohead mutant sites (S59F, V78I) projected into binding sites of gpD, the head cementing proteins (PDB id: 1C5E) at the three fold symmetric sites, but are phenotypic prior to stages of maturation where gpD is required. The gpD protein acts to stabilize expanded capsids, while the small prohead phenotype arises at the stage of initial capsid assembly. Thus interaction with gpD may or may not stabilize aberrant sized heads, but cannot be the cause of the phenotype, nor can it be required. Instead, these substitutions shifted the optimal geometry of hexamer/pentamer interaction promoting sharper binding angles enclosing a smaller optimal capsid. Similarly, the site of the T284I small prohead mutant was buried in the interface between asymmetric units altering this geometry.

187

The narrow polyhead tube mutant sites (T40S, P75S, P75L, R91H, and S301F) were partially embedded at the inner face of the interface of asymmetric units. These locations contribute to the binding geometry between asymmetric units, and to a limited extent may interact with the internal scaffold protein. The substitutions change these interactions making hexamer/hexamer interactions more favourable and pentamer incorporation less favourable. The narrow polyhead substitutions also promote a sharper curvature to these hexamer/hexamer sheets resulting in a narrow tube-like particle. The wide polyhead mutant sites (P45S, G46E, E229K, S240F, G243E,

G243R, and P305S) were partially embedded at the inner face of the interfaces between the monomers of the pentamers and hexamers. This altered the same mechanisms as the narrow polyhead mutants, but in the opposite direction promoting a shallow curvature, giving rise to a wider polyhead phenotype. In both classes coordination with the scaffold was disrupted.

Amino acid substitutions producing expansion defective proheads (A293T, A324T) occurred at sites embedded in the vicinity of three fold quasi-symmetric sites. These substitutions replaced small non-interacting amino acids with polar amino acids with the proximity to form hydrogen bonds between adjacent proteins. This was predicted to stabilize this region, resisting the conformational changes involved in expansion. In the wildtype state genome packaging supplies enough energy to induce expansion, but not enough to fracture the capsid, stall packaging, or break the interaction between the packaging motor and the immature prohead. In the expansion defective proheads however the reverse was true, with insufficient energy to disrupt the three fold quasi-symmetric sites to induce expansion, causing the packaging motor to build up enough repulsive potential energy to detach from the capsid prematurely. This is supported by the observation that these proheads can package shorter artificial genomes without defect, so long as the genome volume is less than that which requires an expanded head 86.

Substitution sites causing normal heads defective for DNA packaging (R222C, S226F, Y245S,

188 and Y245D) were exposed on the inner surface of the hexamer and pentamer. These substitutions remove positive charges, and introduce negative charges, partial negative charges, or bulky aromatic residues, thus altering the interior of the capsid to become intolerant to the bulky negatively charged genome. The increased negative charge repels the DNA and the substitution of phenylalanine at position 226 interferes with DNA neutralization by proximally positioned positive residues. Tyrosine at position 245 is not positioned to interfere in this way.

5.2 Phenotypic mutations enriched in distinct interfaces affecting different levels of structure

In multimeric complexes, such as capsids, amino acid substitutions can modify multiple properties and produce similar phenotypes. Thus determining where substitutions are occurring relative to interfaces and the hydrophobic core, can be informative as to the mechanisms underlying the observed effects. In the monomer no significant difference was observed in mean percent solvent accessibility on a per class basis comparing narrow tube (p1), wide tube (p2), small prohead (h1a), DNA packaging defective head (h2), or uncharacterized residues verses the full protein (Table 5-2.i, P > 0.1). These classes thus most likely affect interactions above the level of the monomer. The expansion resistant prohead of normal size class (h1b) showed a significantly lower mean percent accessibility compared to the full protein (P < 2.2*10-16), the set of unclassified residues (P < 2.2*10-16), and the h2 class (P < 0.05). These residues may thus alter the monomer fold. Positionally, the residues in this class occur on partially buried loops on the surface of the three-fold symmetry site (Figure 5-5). Changes at these positions are unlikely to dramatically alter the monomer fold.

189

Table 5-2.i: Significance of average percent solvent accessibility in phage lambda gpE protein monomer P-value of two All Unclassified P1 P2 H1a H1b H2 sample t-test residues residues

All residues -

Unclassified 0.9603 -

P1 0.8069 0.8113 -

P2 0.899 0.8926 0.7776 -

H1a 0.9593 0.9625 0.9174 0.9166 -

H1b <2.2e-16 <2.2e-16 0.1297 0.06396 0.3044 -

H2 0.7996 0.8051 0.9733 0.7753 0.9334 0.08501 -

Statistically significant cells (red) mark classes of phenotypic substitution sites which might alter the capsid structure at this level.

190

In the asymmetric heptamer no significant difference was measured in mean percent solvent accessibility comparing small prohead (h1a), DNA packaging defective head (h2), or uncharacterized residues verses all residues (Table 5-2.ii, P > 0.05). Thus the small prohead and

DNA packaging defective phenotypes most affect interactions above the level of the hexamer or asymmetric heptamer. The expansion defective prohead of normal size class (h1b) again showed a significantly lower mean percent solvent accessibility compared to the full protein (P <

2.2*10-16), and all the individual classes (P < 0.05). Based on position structural changes induced from those sites would affect the three-fold symmetry multimer/multimer interfaces.

The wide tube class (p2) also showed a significantly lower percent accessibility compared to the full protein (P < 10-3), the unclassified residues (P < 10-3), the DNA packaging defective head class (P < 0.05) and the small prohead class (P < 0.05). The p2 group was composed of seven copies of six residues suggesting this significance may be robust. The narrow tube class (p1) had significantly lower percent solvent accessibility compared to all sites (P < 0.05), or the unclassified residues (P < 0.01). These two classes thus may act through structurally altering the hexamer. Based on the positions of these substitution sites such alterations would modify the two-fold symmetry multimer/multimer interfaces (Figure 5-5).

191

Table 5-2.ii: Significance of average percent solvent accessibility in phage lambda gpE protein asymmetric unit P-value of two All Unclassified P1 P2 H1a H1b H2 sample t-test residues residues

All residues -

Unclassified 0.5169 -

P1 0.01049 0.007348 -

P2 9.14e-4 0.0006122 0.4462 -

H1a 0.5454 0.5892 0.1031 0.04327 -

H1b <2.2e-16 <2.2e-16 5.025e-6 1.445e-4 3.116e-4 -

H2 0.7703 0.6574 0.0791 0.01516 0.4995 4.557e-9 -

Statistically significant cells (red) mark classes of phenotypic substitution sites which might alter the capsid structure at this level.

192

Similarly, in the pentamer no significant difference was measured in mean percent solvent accessibility comparing small prohead (h1a), DNA packaging defective head (h2), or uncharacterized residues verses the full protein (Table 5-2.iii, P > 0.05). Once again the defective prohead of normal size class (h1b) showed a significantly lower mean percent solvent accessibility compared to the full protein (P < 2.2*10-16), and all the individual classes (P <

0.05). The narrow tube class (p1) had significantly lower percent solvent accessibility compared to the full protein (P < 10-3), or the unclassified residues (P < 10-3). Thus the narrow polyhead tube class may affect both the hexamer and pentamer fold, based on site position altering the multimer/multimer interfaces.

In the embedded asymmetric heptamer no significant difference was measured in mean percent solvent accessibility comparing small prohead (h1a), DNA packaging defective head (h2), or uncharacterized residues vs. the full protein (Table 5-2.iv, P > 0.05). Thus these classes of mutants either act via alterations of structure below the level of statistical significance, or above the level of the icosahedron. Again the expansion defective class (h1b) showed a significantly lower mean percent solvent accessibility compared to the full protein (P < 2.2*10-16), and all the individual classes (P < 0.05). This is the level at which the predicted alterations to the three fold symmetry site would have the greatest effect, and mostly the level at which this phenotype is achieved. The narrow tube class (p1) had significantly lower percent accessibility compared to the full protein (P < 10-6), the unclassified residues (P < 10-8), the defective heads (P < 10-3), and the small heads (P < 0.05). The wide tube class (p2) had significantly lower percent accessibility compared to the full protein (P < 0.005), the unclassified residues (P < 10-3), the defective heads

(P < 0.01), and the small heads (P < 0.05). Thus polyheads are resulting from alterations to interface surfaces at the levels of the hexamer, pentamer and whole capsid. This fits with the dramatic deformation of structure observed in these two classes.

193

Table 5-2.iii: Significance of average percent solvent accessibility in phage lambda gpE protein pentamer P-value of two All Unclassified P1 P2 H1a H1b H2 sample t-test residues residues

All residues -

Unclassified 0.6274 -

P1 0.0009434 0.0006488 -

P2 0.0835 0.06796 0.5361 -

H1a 0.6059 0.6388 0.1235 0.2322 -

H1b <2.2e-16 <2.2e-16 2.472e-6 2.353e-5 0.002643 -

H2 0.4535 0.3883 0.09585 0.4016 0.4454 1.164e-6 -

Statistically significant cells (red) mark classes of phenotypic substitution sites which might alter the capsid structure at this level.

Table 5-2.iv: Significance of average percent solvent accessibility in capsid embedded phage lambda gpE protein asymmetric unit P-value of two All Unclassified P1 P2 H1a H1b H2 sample t-test residues residues

All residues -

Unclassified 0.5194 -

P1 1.053e-7 5.1e-8 -

P2 0.001037 0.0006913 0.8564 -

H1a 0.3616 0.3942 0.01850 0.02892 -

H1b 2.2e-16 2.2e-16 6.276e-8 0.0001855 0.0003116 -

H2 0.9953 0.8766 0.0005005 0.008499 0.3911 1.350e-9 -

Statistically significant cells (red) mark classes of phenotypic substitution sites which might alter the capsid structure at this level. P-value of change in solvent accessibility between sequential capsid assembly steps; unfolded > monomer > multimer > closed capsid. Narrow tubular polyheads (P1). Wide polyheads (P2). Small heads (H1a). Expansion resistant proheads of normal size (H1b). DNA packaging defective heads (H2).

194

Overall solvent accessibility changed insignificantly in all the residue classes, between the monomer and asymmetric unit or the monomer and the pentamer in regard to percent solvent accessibility (P > 0.05). The specific residues which were buried in these transitions comprise the hexamer-ization and pentamer-ization interfaces respectively (Figure 5-5). Most residues in one mutimerization interface also contributed to the other. The residues of the asymmetric unit and the pentamer were significantly less accessible to solvent than those of the monomer as expected given the known stability of the complex (P < 0.05).

Overall the residues of the embedded asymmetric unit were significantly less accessible than those of the free asymmetric unit (P < 0.001), the pentamer (P < 0.01) and the monomer (P <

0.005). This significant difference derives primarily from unclassified residues which were significantly less accessible in the embedded asymmetric unit than in the free forms (P < 0.05 for monomer, and asymmetric unit, P = 0.07406 for pentamer) while all the phenotypic classes were not significantly different (P > 0.05). The specific residues which were buried in the transition from the multimer models to the embedded multimer model comprise the multimer/multimer interfaces (Figure 5-5).

All but the small prohead defect class show obvious shifts in solvent accessibility in a subset of members during the formation of the asymmetric unit and pentamer from the monomer form in pairwise plots. These different levels of burial at the different stages of capsid formation support the model of mechanisms I proposed in the preceding section. Thus the different classes of phenotypic mutations show discrete enrichment in subsets of the interfaces governing the formation of the phage lambda capsid complex.

195

5.3 Predicted mechanisms governing the structural features of the lambda capsid predict the similarities and differences observed between capsids and encapsulins

The physical features affected by the published phenotypic mutants are conserved across capsids in general, including icosahedral symmetry, variable size, selective packaging, and inducible expansion. Thus findings from study of gpE may be informative to understanding the factors governing other capsid-like compartments including encapsulins. With the high degree of structural conservation among capsids, and the fundamental nature of these protein interaction mechanisms within that structure, I extrapolated these mechanistic trends to predict site specific similarities and differences in other capsids and encapsulins, based on published phenotypic differences. This confirmed if this model could be generalized beyond the gpE capsid.

These other capsids exhibited the predicted conservation and variation in amino acid properties at phenotypic positions identified by my model (Table 5-3). Where capsids or encapsulins were structurally or phenotypically similar to the wildtype or mutant lambda capsids, amino acid properties at those sites were conserved to either the wildtype of mutant states respectively. For instance, small capsids and encapsulins had large hydrophobic residues (T. maritima encapsulin

Leu81, Caulobacter sp. gene transfer agent Tyr195) at the homologous position to the Val78Ile mutation inducing small capsids in lambda. In the case of the gene transfer agent the capsid is as small as possible under the constraints of monomer size and icosahedral geometry. Encapsulins, for which expansion has yet to be observed, encode polar residues (T. maritima Thr252) at the homologous position to the Ala324Thr expansion defective lambda mutation. The reverse is true of the PfV encapsulin with alanine in the structurally analogous 330 position. This encapsulin shows size variation that may signify previously unrecognized expansion 4, potentially in response to ion load. Thus a subset of encapsulin may undergo functionally non-essential expansion. Enterobacteria phage T5 requires an accessory domain (delta) to prevent polyhead formation, and encodes Thr231 at the position of the lambda polyhead mutation Pro75Ser. 196

Table 5-3: Site specific divergence and conservation between capsid and encapsulin structures fit model of conservation and divergence in physical properties phenotype gpE gpE PfV PfV TmEncap TmEncap mutation interface residue interface residue interface No heads Tyr4Ser Adj to penta Tyr112 Adj to hexa Phe6 Adj to penta mono/mono mono/mono mono/mono interface interface interface Wide tube Gly46Glu Penta/hexa Gly154 Penta/hexa Gly56 Penta/hexa polyhead interface interface interface Narrow Pro75Ser Core adj to Pro164 Core near Pro77 Core near tube hexa/hexa hexa/hexa hexa/hexa polyhead interface interface interface Small Val78Ile Core near 3 Leu168 Core near 3 Leu81 Core near 3 prohead fold sym site fold sym site fold sym site

Small Tyr106Ser Adj to Tyr184 Adj to Leu104 Adj to plaques mono/mono mono/mono mono/mono interface interface interface DNA Arg222Cys Loop on Glu276 Loop on Glu199 Helix on packaging inner surface inner surface inner surface defect Small Thr284Ile Core near 3 Val307 Core near 3 Ile228 Core near 3 prohead fold sym site fold sym site fold sym site

Expansion Ala324Thr 3 fold sym Ala330 3 fold sym Thr252 3 fold sym defective site site site

197

Capsids with shared phenotypic properties share conserved amino acid properties at predicted phenotypic sites. These included the shared features of the capsids of phages T5 [gp149], HK97

[gp5], and lambda [gpE] (Table 5-4). Key interface and core positions were conserved between lambda and HK97 (eg. gp5_N146 and gp149_E172 align to gpE_E37, where E37F causes lethal disruption of multimer/multimer interface; gp5_L203 aligns to gpE_Y106 in the hydrophobic cores, where Y106S forms smaller than wildtype plaques and Y106Q fails to produce heads).

Positions in lambda affecting polyhead formation are conserved with the lambda wildtype properties in the HK97 capsid (eg. gp5 A147 aligns to gpE_Y40, where Y40S produces narrow polyheads and Y40Q fails to form heads; gp5_N306 aligns to gpE_S240, where S240F produces wide polyheads). Positions in lambda affecting capsid size are conserved between lambda and

HK97 capsids which are of similar size (eg. gp5 E165 aligns to gpE_S59, where S59F produces smaller capsids). Similar trends were observed for the other altered structural properties.

Differences at key positions highlighted from the model, explain differences in compartment structural properties between capsids and encapsulins.

Table 5-4: Diverse capsids sharing structural features conserve positions predicted by the model to confer those features. Available on appended CD or online at: https://www.dropbox.com/s/0ozt1kfpfkv0zok/ radfordThesis_multiCapsidStructureMapAlignment.xls

198

5.3.i Validation of a predicted novel phenotypic mutation in the Phage HK97 capsid.

The model predicts 49 additional phenotypic non-lethal amino acid substitutions in lambda and four other tractable capsids/encapsulins. One of these was tested and found to produce unique aberrant capsid particles consistent with the predictions of the model. The model predicted that an Ala147Ser substitution in the Escherichia coli siphophage HK97 capsid protein, gp5, would act as a narrow polyhead mutant producing large asymmetric hexamer enriched particles. When

I expressed the HK97gp5 A147S protein it was moderately soluble, under IPTG induction, in E. coli BL21. This reduced solubility fits with the large asymmetric nature of polyheads, which can behave as structured aggregates and may or may not remain soluble. Alternatively, misfolded protein may also form insoluble aggregates.

Under glycerol gradient fractionation the HK97 primary capsid portal, gp3, and capsid shell protein, gp5, fractionate as a much larger complex in the mutant than observed for wildtype both in my hands and in the literature (Figure 5-6.i). When the fraction normally containing these proteins was observed under transmission electron microscopy (TEM) few capsid-like particles were present and these showed evidence of disruption and instability. An example wildtype

HK97 head preparation is shown for reference in Figure 5-6.ii. When my glycerol gradient fraction containing the larger complex was visualized by TEM, it revealed diverse polyhead-like structures (Figure 5-6.iii). These included very large hemisphere/tube complexes, enlarged pseudo-spherical particles similar to disrupted enlarged capsids, and a unique polyhead intermediate which I term "lemon heads", due to the resemblance to the fruit. These appeared as large oblong hollow particles with an internal tube traversing the long axis. Thus my model is supported as explaining existing known phenotypic differences between capsids and predicting the effects of new amino acid substitutions. Further, mutagenesis can strengthen this validation of the applicability of this model to further capsid-like folds, including encapsulins.

199

Figure 5-6.i: Escherichia siphophage HK97 gp5 A147S, predicted narrow polyhead mutant produces aberrant complexes consistent in size with narrow polyheads, on glycerol gradient.

Ultracentrifugation fractions over a 10-30% glycerol gradient of Escherichia siphophage HK97 gp5 A147S expressed in Escherichia coli BL21.

200

Figure 5-6.ii: Wildtype Escherichia siphophage HK97 prohead I and prohead II

Transmission electron microscopy of negative stained wildtype prohead morphologies of the Escherichia siphophage HK97 wildtype gp5 capsid expressed from E. coli BL21. Purified by 10-30% glycerol step gradient. Microscopist, Kris Hon of Davidson lab.

201

Figure 5-6.iii: Escherichia siphophage HK97 gp5 A147S, predicted narrow polyhead mutant, produces polyhead related particles under transmission electron microscopy.

Transmission electron microscopy of negative stained mutant head morphologies of the Escherichia siphophage HK97 gp5 A147S capsid expressed from E. coli BL21. Purified by 10- 30% glycerol step gradient.

202

5. Discussion:

Comparison of capsids and encapsulins reveals many similarities at all levels, while also demonstrating diversity in sequence, form and aspects of functionality. Mutational analysis of the lambda capsid and phylogenetic analyses of other capsids and encapsulins reveal that these similarities are highly consistent with an intertwined shared history. The distinctive geometrically predicable icosahedral structure of both capsids and encapsulins was revealed to be the product of conservation and co-variation of a relatively small set of interface residues, maintained in both kingdoms. The size and even symmetry of this shared structure was revealed to be malleable in response to minor changes to these residues. Differences in these positions were found to be predictive of differences in known and newly mutagenized capsids. The residues determining DNA packaging potential, which observationally differs so markedly between capsids and encapsulins, were also found to be a very small set, the mutation of any of which was sufficient to block DNA packaging. The expansion process is also a prominent property of capsids, not observed in encapsulins. Again any one of multiple single amino acid substitutions was sufficient to block this expansion to produce a more encapsulin-like capsid.

This position specific sensitivity is particularly striking in the face of extreme diversity of capsid and encapsulin sequence overall, presenting strong evidence for function driven selective constraints. Capsids thus are highly covariant in sequence rather than the traditional view of proteins being overall highly mutationally tolerant 15,80. The use of this model can guide comparison of capsid-like proteins including encapsulins, to focus on the phenotypically and structurally important positions while filtering out sequence diversity noise from less influential positions.

203

Chapter 6: Caudovirales phage capsids share intermingled ancestry and descent with prokaryotic encapsulins

6. Chapter Abstract:

Encapsulins and capsids are highly similar on multiple levels of sequence and structure.

More precise determination of the phylogenetic relationships among these classes clarified how these classes are most likely to relate to each other. Markov Chain clustering, Bayesian inference, Maximum likelihood, and Bayesian co-estimation phylogeny prediction methods consistently predict a shared intermingled phylogeny for encapsulins and capsids. The clustering identifies multiple groupings of capsids and encapsulins as most closely related to each other.

Focusing on the main linage of classical encapsulins, alignable novel encapsulins and capsids identifies seven predicted conversion events between encapsulins and capsids. These conversions represent three families of predicted inter-related capsids and encapsulins from three distinct phyla. The BcepMu-like phages are predicted to derive their capsids from ancestral phosphate metabolism encapsulins, related to the classical Sulfolobus encapsulin fusion family. A group of novel encapsulins is predicted to be derived from a small family of

Mycobacterium phage capsids. As such, the capsids and encapsulins are predicted to historically sample from a shared pool of proteins used and interconverted by both kingdoms.

204

6. Introduction:

Capsids and encapsulins share structural and sequence similarity. These similarities suggest phylogenetic relationships exist, but do not immediately indicate what these relationships are.

Encapsulins may be a monophyletic offshoot of capsids, or capsids may stand as a descendant adaptation of encapsulins, or capsids and encapsulins may be co-mingled converting between phage and prokaryotic function repeatedly. Due to the diversity of sequences involved, multiple robust methods of clustering or phylogeny prediction are required to assess these relationships.

Examination of the genome context of some of intermediate homologs of these proteins, gives further insights into the forms and processes that follow these conversions.

Clustering methods, such as Markov chain clustering (MCL), group proteins based on similar features, in this case sequence similarity 29. MCL is an adaptation of connectivity flow models in graph theory, whereby proteins are treated as nodes in a graph, with similarity to other proteins represented by edges of differing thickness. MCL identifies clusters by choosing random starting points and moving from one node to another stochastically based on edge weight, defining a connected path of nodes 29. Closely related proteins belonging to the same cluster will occur very frequently together in these stochastic paths, while more weakly similar proteins will not. Clusters can thus be distinguished based on which nodes occur together with a certain frequency. This method is a robust way to group similar proteins and predict which proteins are most closely related to each other, even among divergent proteins or large datasets.

However, MCL lacks the ability to make fine grain predictions of the magnitudes and directions of relationships between clustered proteins 29. MCL clustered proteins can be considered related with confidence, but how they relate can be less definitive.

205

This is where phylogenetic methods apply, providing pairwise, as well as cluster level, prediction of relationships between proteins of interest. However, reliable, robust phylogenetic prediction by established methods is computationally expensive in time, memory, and complexity. This precludes the detailed evaluation of very large, very diverse datasets. Instead, focused subsets are evaluated rather than attempting to place all proteins from a cross kingdom spanning super-family, such as the encapsulins.

Bayesian Inference (BI) phylogeny 46 approaches the question of relationships from the perspective of posterior probability. Specifically how probable is the observed set of alignment based sequence similarity relationships, if the true phylogenetic relationships between these proteins follow a particular topology. Initially selections of crudely structured minimally informed trees are considered. The tree giving the strongest probability of producing the observations, is retained as a starting point for the next round of possible trees, generated by randomly adding or swapping zero or more branch points 46. This process is repeated until fitting and probability criteria are satisfied, identifying the evolutionary tree which best fits the available data, including predicted distances between sampled proteins and statistical measurements of how confident each predicted relationship between proteins actually is 46.

Maximum Likelihood (ML) phylogeny 45 takes the alternative approach of generating a crude tree from the available sequence similarity information, then refining it to optimize the evolutionary likelihood based on a chosen amino acid substitution model. By iteratively adding, removing, or flipping branch points the predicted best tree approaches a maximized likelihood within the constraints of the evolutionary models, and the alignment data 45. Bootstrapping statistically quantifies the robustness of each branch point in the predicted tree. This involves repeating the analyses with the exclusion of small subsets of the data, producing a population of related but less informed trees. The frequency specific groupings occur in these bootstrapped

206 trees, defines the confidence of that grouping in the maximum likelihood tree 45.

Bayesian co-estimation 48,49 attempts to escape alignment biases, by predicting both the tree topology and multiple sequence alignment concurrently. This method utilizes the same principle of estimating posterior probability, but applied to both linear multiple sequence alignment and two-dimensional directed acyclic graph generation in the form of the phylogenetic tree 49. This generates a tree and alignment co-dependently optimized for best fitting the sequences involved.

The size of dataset applicable to co-estimation is however dramatically smaller than the ML or

BI methods, and requires much more similar proteins as minimal information can be incorporated into the de novo alignment prediction.

6. Methods:

6. Multiple robust methods were compared to confidently predict the consensus best fit phylogenetic relationship between encapsulins and capsids

Markov Chain Clustering 29 was applied to the combined set of capsids and predicted encapsulins to determine sequence similarity families at the broadest level of sequence diversity.

To avoid alignment artifacts, edge weight was assigned based on global pairwise sequence similarity determined by automated Myers and Miller global alignment, using the EMBOSS stretcher implementation 87. Clustering parameters were varied over multiple runs to define a hierarchical clustering of sequence similarity for all the protein of interest.

From this clustering tree I extracted the major branch encoding the predicted classical encapsulins and co-clustered most closely similar capsids and predicted novel encapsulins.

These sequences were initially aligned using the MAFFT alignment algorithm 28. Bayesian

207

Inference (BI) 46 and Maximum Likelihood (ML) 45 phylogenetic prediction methods were applied to this mixed alignment of predicted classical encapsulins, novel encapsulins, and capsids to more robustly define best fit phylogenies within this group. Robust phylogeny of the full set was not feasible due to size and limitations in available computational facilities.

Phylogeny prediction was repeated using four different outgroups to account for any artifacts of outgroup choice. The chosen outgroups were the divergent predicted novel encapsulins from

Ehrlichia canis str. Jake and Wolbachia endosymbiont of Culex quinquefasciatus Pel, and the capsid-like dihydrodipicolinate reductases of Stigmatella aurantiaca DW4/3-1 and Myxococcus xanthus DK 1622. The BI analyses were run for 3000000 generations each, with 5 parallel chains, and two parallel runs, using the WAG amino acid model 88. The ML analyses were run with 100 bootstrapping cycles, using the WAG amino acid model, and automatically choosing the best of NNI and SPR moves. The WAG model was chosen as it was the predicted best available model for the data according to ProtTest 89. Bayesian Co-estimation (BC) 48 analysis, using the StatAlign implementation 49, was applied to interesting subtrees from the consensus trees predicted in the BI and ML analyses, to account for the potential of alignment artifact bias.

208

6. Results:

6.1 Phage capsids and encapsulins share an intermingled predicted ancestry, with multiple distinct predicted functional conversions

Due to the diversity of capsid and encapsulin sequences under consideration multiple robust methods are desirable to accurately assess the ancestral relationships between these two kingdoms of capsid-like compartments. Phylogenetic analyses using the methods above consistently revealed a mixed hierarchy of descent for encapsulins and bacteriophage capsids.

Unlike more simple methods such as Neighbour Joining, Maximum Likelihood, Bayesian

Inference and Bayesian Coestimation give confidence measurements for each partition, so that trees can be evaluated clearly and separately at each level and predicted relationship rather than only as a single interdependent whole. Many capsid families were predicted to have arisen from encapsulin ancestors, and a number of encapsulins were predicted to have arisen from ancestral phage capsids (Figure 6-1). Multiple replicates of each method, using different outgroups yield very similar optimal phylogenies. Long branch effects shift the relative positions of a few nodes without altering the major implications of these trees.

209

Figure 6-1: Consensus tree from Bayesian Inference, Maximum Likelihood phylogenetic prediction analysis of capsids and encapsulins clustered in proximity to classical encapsulins.

Condensed consensus predicted phylogeny of the major encapsulin lineage including the classical encapsulins, alignable phage capsids and novel encapsulins. Two outgroups shown; Ehrlichia canis str. Jake divergent predicted novel encapsulin shell protein and Wolbachia endosymbiont of Culex quinquefasciatus Pel divergent predicted encapsulin shell protein. Groups labelled above external nodes. All nodes labelled with confidence metrics (Posterior probability:Bootstrapped Maximum Likelihood frequency).

210

Using the divergent predicted encapsulin from Ehrlichia canis str. Jake or from the Wolbachia endosymbiont of Culex quinquefasciatus Pel as the outgroup, both yield the same optimal

Bayesian Inference (BI) tree topology. Using the capsid-like dihydrodipicolinate reductases of

Stigmatella aurantiaca DW4/3-1 or Myxococcus xanthus DK 1622 as outgroups, produces a nearly identical optimal tree, with only two regions of topological difference. The first was the relative position of two capsids up or down one level within a large consistent confident capsid family without predicted encapsulins. In all analyses this partition was of moderately low confidence (posterior probability [pP]: 0.71 <= pP <= 0.81) and uninformative to the relationship between capsids and encapsulins. The second difference was the relative order of common ancestry between three small encapsulin families, surrounded by consistent confident encapsulin lineages. This region was also not informative to the relationship between capsids and encapsulins, and contains some low confidence partitions. Between the two topologies the

Wolbachia based tree shows more confidence in this region (0.69 <= pPWol <= 0.99 vs. 0.51 <= pPMxan <=0.80), thus was likely more representative of the true ancestry.

The Maximum Likelihood (ML) optimal trees for these outgroups were very similar. When comparing Phyml and Bayesian trees, the differences comprise minor rearrangements in relative node positions within the parsimonious lower confidence sections of the tree, with no change to the implications of groups combining encapsulins and capsids. As expected the likelihood and boot strapping statistics follows a similar pattern and distribution throughout the tree as the posterior probability does in the Bayesian trees, with the maximum likelihood percentage scores being numerically overall lower than the posterior probabilities for the same partitions.

Based on these phylogenies, during the divergence of the 183 proteins examined, capsid and encapsulin ancestors converted at least seven times, generating non-parsimonious groups of

211 varying sizes. These conversions can be collapsed into three major families: the BcepMu-like phage capsids, which were predicted to have derived from encapsulin ancestors (Figure 6-2); a small family of assorted encapsulins predicted to have descended from a prophage ancestor in common with a family of Mycobacteria phage capsids; and the family of Thermotoga encapsulins predicted to share a prophage common ancestor with capsids from a family of Mu- like phage spanning a broad range of hosts. The topologies of these intermixed families were evaluated and confirmed by Bayesian Co-estimation as optimal independent of alignment bias.

The BcepMu-like family of phage capsids was consistently and confidently predicted to be most similar to and most closely related to novel phosphate metabolism encapsulins, and predicted to have ancestrally arisen from classical rubrerythrin-fusion encapsulins (Figure 6-2). This was the same family of phage implicated in the transfer of Burkholderia classical peroxidase encapsulins

(Chapter 3.II). The subtree depicted was predicted as optimal consistently by BI, ML, and BC analyses. The subtree was rooted based on the larger consensus BI/ML phylogeny.

Nearest the BcepMu representative, the Haemophilus parasuis candidate novel encapsulin,

ZP_02478675, was predicted as an inorganic pyrophosphatase compartment. No other proteins similar to phage were encoded in proximity. Based on pblast, this gene was similar to three diverse phage capsids (12.4 <= % identity <= 45.0 over 174-266 aa, 5e-63 <= E-value <=

0.001). Although annotated in NCBI as an inorganic pyrophosphatase, no pyrophosphatase features were evident in this sequence, nor experimental validation listed. This gene was encoded five ORFs away from a confident inorganic pyrophosphatase enzyme, ZP_02478670, with HMM matches and sequence similarity to functional pyrophosphatases, including the predicted active site. This pyrophosphatase also encodes two conserved, encapsulin association specific, surface exposed, motifs discussed in appendix 4.

212

Figure 6-2: Phage BcepMu capsid family derives from encapsulin ancestors and is more similar to novel encapsulins than phage capsids outside of this family.

BcepMu-like phage capsids derive from an encapsulin ancestor and are more closely related to H. parasuis encapsulin than Phage B3 capsid, or others beyond root of subtree. Consensus subtree from Bayesian Inference, Maximum Likelihood, and Bayesian Co-estimation analyses of capsids and encapsulins predicted related to BcepMu-like phage.

213

The predicted encapsulin/capsid intermediate, YP_531441, from Rhodopseudomonas palustris

BisB18 shares properties of both a candidate encapsulin and a prophage capsid. This protein was encoded in close proximity with a number of prokaryotic enzymes of uncharacterized function.

The adjacently encoded protein, YP_531442, significantly matches the DUF1018/SHOCT model, PF06252 (E = 3.4E-8). This domain is of uncharacterized function, but is similar to the gas vesicle forming/regulating proteins of the GvpG family 93, PF05120, and regulatory protein binding domain, PF06464 92. In connection with the capsid-like candidate encapsulin protein,

YP_531441, this suggests SHOCT proteins may also contribute to compartmentalized function.

Upstream is encoded an uncharacterized protein, YP_531437, with similarity to PF07805, representing the bactericidal HipA toxin. Expression of HipA, without HipB to sequester it, is lethal to bacteria including E. coli 92. In complex with HipB, HipA is an effective regulator of the cell wall during replication 92. Sequestration of YP_531437 by this capsid-like protein may act similarly. These families of proteins are specific to prokaryotes and not encoded in phage.

The genome context of these proteins also includes a degraded tapemeasure-like protein

(YP_001111218 34% identity over 150 out of 722aa by pblast), a degraded head protease

(NP_050636 21% identity over 183 aa by psiblast), and candidate Mu Mor-like regulator protein

(NP_050621 6.7% identity over 118 aa by psiblast; E-value=0.036, 25% identity over 35 aa by pblast) in addition to the candidate encapsulin/capsid intermediate. The sequence divergences of these proteins are consistent with degrading prophage remnants. Blast using pblast identifies three hits to phage capsids (10.8-42.8% identity over 221-305aa, 1e-62 <= E-value <= 0.0002) for this gene. This level of conservation is atypical of a degrading prophage remnant.

Significantly enriched enzymatic and regulatory families, including metallo-beta-lactamase domain-containing protein and regulatory helix-turn-helix proteins, were encoded 20-22 ORFs upstream from this candidate encapsulin. This region thus encodes two prokaryotic proteins

214 related to compartmentalization/sequestration, three prophage remnant proteins, and an atypically conserved capsid-like protein which may function as an encapsulin.

The predicted encapsulin, ZP_02599408, from Bacillus cereus H3081.97 also lacks a definitive metabolic function. Proximity to an enriched PhoH, phosphate starvation response operon suggests a role in phosphate metabolism, but the family sample size was too small to declare significance. No other phage-like proteins were found in proximity, distinguishing this from a capsid/encapsulin intermediate. This sequence was highly diverged from other encapsulins and capsids in the group, allowing for divergence in targeting motif and enzyme family. Pblast identifies high scoring pair alignments with the Aeromonas phage phiO18P capsid protein,

YP_001285647 (31% identity over 273aa, E-value = 0.055), and the Lactobacillus phage phig1e capsid protein, NP_695160 (30% identity over 74aa, E-value = 0.14). If this is not some form of selectively advantageous novel encapsulin, but instead a cryptic prophage remnant, this level of conservation of the capsid sequence and fold would be unexpected, given the complete absence of other similarity to phage genes surrounding ZP_02599408.

The other nine predicted encapsulins in this group were all classical fusions between N-terminal rubrerythrin-like domains and PfV-like encapsulins. No other phage-like proteins were encoded in proximity to these genes. All these predicted encapsulins were confidently similar to the published Pyrococcus virus-like particle encapsulin (eg. NP_376906; 25% identity over 208aa,

E-value = 1e-8). Falling nearest the root of this subtree suggests the common ancestor of the full group was likely an encapsulin similar to the classical encapsulins. Rooting is derived from the outgroups discussed above for Figure 6-1.

Thus the representative capsids in the tree (BcepMu, phiE255, etc.) were predicted to descend from a rooted ancestral lineage comprised of encapsulins, followed by the classical fusion

215 encapsulins, and B. cereus novel encapsulin. Within the capsid lineages are the H. parasuis candidate encapsulin predicted as derived from a capsid ancestor, and R. palustris intermediate which also may have derived from a capsid or more parsimoniously an encapsulin ancestor. As such both encapsulin to capsid and capsid to encapsulin conversions are represented, but in this context with the stronger case being for an encapsulin to capsid conversion generating the

BcepMu-like capsid family. The predicted H. parasuis and R. palustris are lower confidence due to the posterior probabilities and restricted distribution of these two candidate encapsulins.

This functional interchange between capsids and encapsulins predicts the existence of intermediate regions with properties of both prophage and encapsulins, although selection for genome efficiency eliminates non-functional remnants and drift abolishes junk signal such as the prophage remnant phage-like genes near the candidate Rhodopseudomonas palustris BisB18 novel encapsulin/capsid intermediate discussed above. Identification of a more confident prophage/encapsulin intermediate in Anaerofustis stercorihominis DSM 17244 strongly confirms this prediction of conversion intermediates between capsid and encapsulin regions

(Figure 6-3).

216

Figure 6-3: Prophage rubrerythrin encapsulin operon intermediate in Anaerofustis stercorihominis DSM 17244

Encapsulin (EN): 25-27% identity over 169-258aa to all three published encapsulins, 12% identity to phage T1 capsid. Targeted rubrerythrin (EE): 35% identity to rubrerythrin, C- terminally 32% identity to Phage phiAsp2 capsid protease. Prophage portal protein (PO): 27% identity to Staphylococcus phage 71 portal. Prophage major tail protein (MT): 16-37% identity to major tail proteins from five phages. Prophage tail assembly components (red). Prophage cell entry or lysis enzymes (orange). Various phage replication enzymes (dark grey).

A. stercorihominis DSM 17244 encodes a confident linocin-type classical encapsulin

(ZP_02862955, 25-27% identical to three published encapsulins over 169-258 aa) with an adjacent ferritin-like enzyme (ZP_02862955, 35% identical to PfV ferritin-like domain

NP_578921). The C-terminal region encoding the canonical CEAS motif also encodes a short stretch of sequence identity to the C-terminus of the Actinoplanes phage phiAsp2 head protease.

These two genes were themselves encoded in place of the head region of a high confidence partial prophage including a portal gene, major tail tube gene, and a number of other phage assembly and replication components. As such, this region lacks some of the required components for a functional phage, but is a prime candidate for recombination to generate a novel phage, similar in sequence and gene order to Listeria phage A500 or Actinoplanes phage phiAsp2. Alternatively this region may accumulate further deletions and mutations eliminating the prophage signal until only the encapsulin shell and enzymes remain, as seen in encapsulin homologs.

217

6. Discussion

Comparison of capsids and encapsulins reveals many similarities at all levels, while also demonstrating diversity in sequence, form and aspects of functionality. The observations of minimal mutation inducing conversion from the features of a capsid compartment to an encapsulin-like compartment, is consistent with the phylogenetic relationships predicted between representatives of the two kingdoms. Evolutionary theory traditionally assumes kingdoms follow distinct and separate lineages, owing to the differences in primary features which defined the kingdoms as distinct. However, the physiology of viruses and their hosts lacks the same basis for assuming distinction, as the genomes of both virus and host occupy the same physical compartments during stages of diverse viral life cycles. Phylogenetic predictions consistently support multiple distinct conversion events, where some capsids are predicted to descend from encapsulin ancestors, and the reverse of novel encapsulins predicted to descend from capsid ancestors are both represented. The acquisition of a minimal number of mutations would enable these predicted conversions, with the generation of novel phage capsids requiring the involvement of a prophage genome to adopt the encapsulin gene in place of its capsid gene.

In the case of the classical encapsulins this family appears to have arisen from a branch of novel encapsulins, then diversified to package the six families discussed above in various functional niches. Out of this family is predicted to have arisen small families of novel encapsulins and capsids, such as the BcepMu-like phage capsids. Other capsid families are predicted to have arisen from a common ancestor to the novel encapsulin branch which produced the classical encapsulin family.

218

Further toward the root, the group of MMPI-like desulfurase associated novel encapsulins and

FNR-sensor-like domain fusion encapsulins branched off, with the largest confident capsid lineages arising between the two large encapsulin families. This largest lineage of capsids within the phylogeny includes small predicted encapsulin families arising out of it. It is thus clear that neither the viral nor prokaryotic adaptation of the encapsulin-like fold is definitively ancestral to the other. The phylogenies are consistent and clear even without depending on rooting, that capsids and encapsulins can not be separated into separate complete lineages without converting between kingdoms and functions. The applications of the capsid/encapsulin fold in viruses and prokaryotes are intrinsically not parsimoniously divisible.

219

Chapter 7: Conclusions

Capsid-like compartments are wide spread across prokaryotes. Taking into account the evidence that phage capsids are directly related to encapsulins, this compartment family spans two super- kingdoms, with three taxonomic kingdoms: Viruses, Archaea and Bacteria. Even in eubacteria, predicted encapsulins are more broadly distributed than any other family of prokaryotic compartments identified. Rather than being restricted to specific taxa in limited niches, classical and novel encapsulins were identified in multiple niches, with prediction of multiple distinct targeted enzymes with distinct functions. Thus encapsulins fulfill a more generalized compartmentalization role than other protein shells.

Both encapsulin shells and associated enzymes show strong evidence of horizontal transfer.

These signals relate to their current genomes and each other in manners strongly suggestive of multiple horizontal transfer events in their histories. Various encapsulins show evidence of transfection, transposition, and recombination. As such, horizontal transfer is evidenced to have played an extensive role in the history and dissemination of encapsulins. Within this network of transfers, Caudovirales capsids fit as highly successful, polyphyletic specialized applications of the more generalized encapsulin-fold and not a distinct lineage onto themselves, or parsimonious ancestral family to the encapsulin-fold. With the continuous high level of sequence diversity among capsids and encapsulins as a structural super-family, encapsulins/capsids exhibit the diversity required to fulfill compartmentalization of multiple functions. Further this family is only the second form of prokaryotic compartment, after the beta-carboxysomes, to show evidence of natural in vivo separately encoded cargo. Encapsulins are also the first compartment class to utilize the same targeting motif interchangeably at both termini of cargo proteins.

220

With this level of horizontal transfer, along with the distribution of encapsulin encoding organisms, selection is expected to have been influential in determining conservation of transferred genes. Analysis of the phenotypic features and environmental conditions under which encapsulins encoding organisms are observed indicate retention of encapsulins is coupled to niche specific selective advantage. The classical encapsulins are enriched for occurrence in multiple specific physiological niches, with specific enzyme families, sharing similar metabolic properties and pathways. Flanking gene analysis, along with comparative phylogeny indicate encapsulins have been converted/co-opted to multiple diverse roles in segregation of toxic reactants, and protection of sensitive enzymes in different lineages, environmental niches, and metabolic requirements. This diversity includes six distinct targeted enzyme families among classical encapsulins, a large novel encapsulin family with two forms of thiol metabolism/cyanide resistance, and many smaller predicted novel encapsulin families.

Interaction with iron, molybdenum, copper, phosphate, arsenic and other conditionally toxic reactants are enriched with encapsulins of various families.

Phylogenetic analyses, along with select examples, indicate the three kingdoms have interconverted capsids and encapsulins repeatedly, with conversion between functioning as capsids for Caudovirales viruses, and functioning as prokaryotic encapsulins in both Archaea and Bacteria. Thus capsids and encapsulins have arisen, developed, and diverged concurrently, with neither modern function confidently distinguishably pen-ultimately ancestral based on phylogenetic signal.

There is no strong argument for de novo, ex nihlo origin of the encapsulin-like compartment in either prokaryotes or viruses as the proteins required are too complex to spontaneously arise fully functional without an ancestor. Prokaryotes are favoured as the original source, based on an intrinsic higher theoretical capacity for generating de novo function, and tolerating/retaining

221 deleterious or neutral non-functional precursors. Viruses mutate faster, generally generating diversity at a faster rate, but also due to this high mutation rate and factors of life cycle are less tolerant of maintaining deleterious or neutral non-functional proteins. Thus the data would tentatively best support the Caudovirales capsid originally arising from some form of proto-

Archaeal encapsulin conferring resistance to some form of toxic, metallic, multivalent cation.

This is purely a logical deduction based on the accepted properties of viruses and prokaryotes, which can not be empirically supported or refuted.

The same structural mechanisms predicted to govern capsid form and function in the

Enterobacteria phage lambda, explain the conservation and variation in encapsulin form and function, and in other capsids. Mutagenesis experiments both in the literature 10,11,12,13,63 and in my hands demonstrated minimal mutations are sufficient to convert between many of the properties of a capsid to those of an encapsulin. Our mutational map model predicts the physical properties of capsids and encapsulins can be intelligently re-engineered for optimization of compartment size, cargo affinity and type, flexibility, and stability.

222

7.2 Future Directions

Building on this work will involve experimentally confirming the many computational predictions I made, and directly characterizing the novel encapsulin families I identified. Of particular interest are the members of the MMPI-like desulfurase associated novel encapsulin family and related FNR-sensor-like fusion domain novel encapsulins. This family represents the largest and most broadly disseminated family of encapsulins, and unlike classical encapsulins appears specialized for a single evidently broadly important function.

Further study will elucidate how the packaging of the multiple co-targeted CEAS motif encoding cargo enzymes contributes to function and selective advantage in members of the classical encapsulin family. These studies will also determine if the various classical encapsulin associated proteins identified are packaged through interaction with CEAS motif encoding cargo enzymes, or function externally. The precise mode of these functions will also be determined in this process.

Testing of further predicted phenotypic mutations in capsids and encapsulins, will further test my model explaining the mechanisms guiding structure and function in capsid-like proteins, and enable intelligently engineered capsids and encapsulin to be generated tailored for specific attributes and functional applications. Rather than depending on random mutagenesis and directed evolution techniques it will be possible with minimal mutagenesis to modulate the size, stability, cargo affinities, and compartment shape for new biotechnological and industrial applications.

223

8. References 1. Kang S, Douglas T. 2010. Some enzymes need a space of their own. Science Biochem.

327: 42-43.

2. Sutter M, Boehringer D, Gutmann S, Gunther S, Prangishvili D, Loessner M, Setter

K, Weber-Ban E, Ban N. 2008. Structural basis of enzyme encapsulation into a bacterial

nanocompartment. Nature Struct. Mol. Biol. 9: 939-947.

3. Rosenkrands I, Rasmussen P, Carnio M, Jacobsen S, Theisen M, Andersen P. 1998.

Identification and characterization of a 29-kilodalton protein from Mycobacterium

tuberculosis culture filtrate recognized by mouse memory effector cells. Infect. and

Immu. 66 (6): 2728-2735.

4. Akita F, Chong K, Tanaka H, Yamashita E, Miyazaki N, Nakaishi Y, Suzuki M,

Namba K, Ono Y, Tsukihara T, Nakagawa A. 2007. The crystal structure of a virus-

like particle from the hyperthermophilic archaeon Pyrococcus furiosus provides insight

into the evolution of viruses. J. Mol. Biol. 368 (5): 1469-1483.

5. Badger MG, Price GD. 1994. The role of carbonic anhydrase in photosynthesis. Ann.

Rev. Plant Phys. Plant Mole. Biol. 45: 369-392.

6. Roberts EW, Cal F, Kerfeld CA, Cannon GC, Heinhorst S. 2012. Isolation and

characterization of the Prochlorococcus carboxysome reveal the presence of the novel

shell protein CsoS1D. J. Bacteriology 194 (4): 787-795.

7. Jorda J, Lopez D, Wheatley NM, Yeates TO. 2013. Using comparative genomics to

uncover new kinds of protein-based metabolic organelles in bacteria. Protein Science 22:

179-195.

224

8. Fan C, Bobik T. 2011.The N-terminal region of the medium subunit (PduD) packages

adenosylcobalamin-dependent diol dehydratase (PduCDE) into the Pdu

microcompartment. J. Bacteriology 193 (20): 5623-5628.

9. Wörsdörfer B, Woycechowsky KJ, Hilvert D. 2011. Directed evolution of a protein

container. Science 331: 589-592.

10. Katsura I, Kobayashi H. 1990. Structure and Inherent Properties of the Bacteriophage

Lambda Head Shell: VII. Molecular Design of the Form-determining Major Capsid

Protein. J. Mol. Biol. 213: 503-511.

11. Katsura I. 1989. Structure and Inherent Properties of the Bacteriophage Lambda Head

Shell: VI. DNA-packaging-defective Mutants in the Major Capsid Protein. J. Mol. Biol.

205: 397-405.

12. Katsura I. 1986. Structure and Inherent Properties of the Bacteriophage Lambda Head

Shell: V. Amber mutants in Gene E. J. Mol. Biol. 190: 577-586.

13. Katsura I. 1983. Structure and Inherent Properties of the Bacteriophage Lambda Head

Shell: IV. Small-head Mutants. J. Mol. Biol. 171: 297-317.

14. Tarabout C, Roux S, Gobeaux F, Fay N, Pouget E, Meriadec C, Ligeti M, Thomas D,

Ijsselstijn M, Besselievre F, Buisson D-A, Verbavatz J-M, Petittjean M, Valery C,

Perrin L, Rousseau B, Artzner F, Paternostre M, Cintrat J-C. 2011. Control of

peptide nanotube diameter by chemical modifications of an aromatic residue involved in a

single close contact. PNAS 108 (191): 7679-7684.

15. Asadulghani M, Ogura Y, Ooka T, Itoh T, Sawaguchi A, Iguchi A, Nakayama K,

Hayashi T. 2009. The defective prophage pool of Escherichia coli O157: prophage-

prophage interactions potentiate horizontal transfer of virulence determinants. PLoS

Pathog 5 (5): e1000408.

225

16. Fraczkiewicz R, Braun W. 1998. Exact and efficient analytical calculation of the

accessible surface areas and their gradients for macromolecules. J. Comp. Chem. 19: 319-

333.

17. Kumar P, Singh M, Karthikeyan S. 2011. Crystal structure analysis of icosahedral

lumazine synthase from Salmonella typhimurium, an antibacterial drug target. Acta

Crystallogr D Biol Crystallogr. 67 (2): 131-9.

18. Morgunova E, Meining W, Illarionov B, Haase I, Jin G, Bacher A, Cushman M,

Fischer M, Ladenstein R. 2005. Crystal structure of lumazine synthase from

Mycobacterium tuberculosis as a target for rational drug design: binding mode of a new

class of purinetrione inhibitors. Biochemistry 44 (8): 2746-2758.

19. van Zona A, Mossinka MH, Houtsmullerb AB, Schoestera M, Schefferc GL,

Scheperc RJ, Sonnevelda P, Wiemer EAC. 2006. Vault mobility depends in part on

microtubules and vaults can be recruited to the nuclear envelope. Experimental Cell

Research 312: 245-255.

20. Rahmanpour R, Bugg TD. 2013. Assembly in vitro of Rhodococcus jostii RHA1

encapsulin and peroxidase Dyp to form a nanocompartment. FEBS J. 280 (9): 2097-2104.

21. Eppert I, Valdés-Stauber N, Götz H, Busse M, Scherer S. 1998. Growth reduction of

Listeria spp. caused by undefined industrial red smear cheese cultures and bacteriocin-

producing Brevibacterium linens as evaluated in situ on soft cheese. Appl. Enviro.

Microb. 63 (12): 4812-4817.

22. McHugh CA, Fontana J, Nemecek D, Cheng N, Aksyuk AA, Heymann JB, Winkler

DC, Lam AS, Wall JS, Steven AC, Hoiczyk E. 2014. A virus capsid-like

nanocompartment that stores iron and protects bacteria from oxidative stress. EMBO

Journal 33 (17): 1896-1911.

226

23. Joerger MC, Klaenhammer, TR. 1986. Characterization and purification of Helveticin J

and evidence for a chromosomally determined bacteriocin produced by Lactobacillus

helveticus 481. J. Bacteriology 167 (2): 439-446.

24. Knobler, CM, Gelbart, WM. 2009. Physical chemistry of DNA viruses. Ann. Rev. Phys

Chem. 60: 367–83.

25. Waller A, Hug L, Mo K, Radford D, Maxwell K, Edwards E. 2012. Transcriptional

analysis of a Dehalococcoides-containing microbial consortium reveals prophage

activation. App. Env. Micro. 78 (4): 1178-1186.

26. Bailey TL, Elkan C. 1994. Fitting a mixture model by expectation maximization to

discover motifs in biopolymers. Proceedings of the Second International Conference on

Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park,

California.

27. Janner A. 2006. Towards a classification of icosahedral viruses in terms of index

polyhedra. Acta Crystallographica A62: 319-330.

28. Katoh K, Standley D. 2013. MAFFT multiple sequence alignment software version 7:

improvements in performance and usability. Molecular Biology and Evolution 30: 772-

780.

29. van Dongen S. 2000. Graph Clustering by Flow Simulation. PhD thesis, University of

Utrecht, May 2000.

30. Liekens A, De Knijf J, Daelemans J, Goethals B, De Rijk P, Del-Favero J. 2011.

BioGraph: Unsupervised Biomedical Knowledge Discovery via Automated Hypothesis

Generation, Genome Biology 12: R57.

31. Huang Y, Niu B, Gao Y, Fu L, Li W. 2010. CD-HIT Suite: a web server for clustering

and comparing biological sequences. Bioinformatics 26: 680-682.

227

32. Aravind L. 2000. Guilt by association: contextual information in genome analysis. Geno.

Res 10: 1074-1077.

33. Korbel J, Jensen L, von Mering C, Bork P. 2004. Analysis of genomic context:

prediction of functional associations from conserved bidirectionally transcribed gene

pairs. Nature Biotech. 22 (7): 911-17.

34. Overton TW, Justino MC, Li Y, Baptista JM, Melo AM, Cole JA, Saraiva LM. 2008.

Widespread distribution in pathogenic bacteria of di-iron proteins that repair oxidative

and nitrosative damage to iron-sulfur centers. J. Bacteriology 190 (6): 2004-2013.

35. Kurtz DM. 2003. "Dioxygen-binding Proteins" in Comprehensive Coordination

Chemistry II 8: 229-260. doi:10.1016/B0-08-043748-6/08171-8

36. Santos A, Mendes S, Brissos V, Martins L. 2013. New dye-decolorizing peroxidases

from Bacillus subtilis and Pseudomonas putida MET94: towards biotechnological

applications. Appl. Microbiol. Biotechnol. 91: epub doi:10.1007/s00253-013-5041-4.

37. Chen D-H, Baker M, Hryc C, DiMaio F, Jakana J, Wu W, Dougherty M, Haase-

Pettingell C, Schmid M, Jiang W, Baker D, King J, Chiu W. 2011. Structural basis for

scaffolding-mediated assembly and maturation of a dsDNA virus. PNAS 108 (4): 1355-

1360.

38. Kim D, Chung J, Hyun H, Lee C, Lee K, Cho K. 2009. Operon required for fruiting

body development in Myxococcus xanthus. J. Microbiol Biotechnol. 19 (11): 1288-1294.

39. Snijder J. 2014. Cargo encapsulation induces a stability switch in the bacterial

nanocompartment encapsulin. Presented at 2014 Form and Function of Protein Nanoshells

Workshop. Lorentz Center, Leiden, the Netherlands. February 3-7, 2014.

228

40. Chang C, Evdokimova E, Savchenko A, Edwards AM, Joachimiak A. 2011. Crystal

Structure of Protein Ne0167 from Nitrosomonas europaea. Unpublished submission to

PDB.

41. Nam KH, Xu Y, Piao S, Priyadarshi A, Lee EH, Kim H-Y, Jeon YH, Ha N-C,

Hwang KY. 2010. Crystal structure of bacterioferritin from Rhodobacter sphaeroides.

Biochem. Biophys. Res. Commun. 391: 990-994.

42. Isaza CE, Silaghi-Dumitrescu R, Iyer RB, Kurtz DM, Chan MK. 2006. Structural

basis for O2 sensing by the hemerythrin-like domain of a bacterial chemotaxis protein:

substrate tunnel and fluxional N terminus. Biochemistry 45: 9023-9031.

43. Berry K, Mielke P. 1985. Goodman and Kruskal's Tau-B statistic: A nonasymptotic test

of significance. Sociological Methods and Research 13 (4): 543-550.

44. Boucabeille C, Letellier L, Simonet J-M, Henckes G. 1998. Mode of Action of

Linenscin OC2 against Listeria innocua. Appl. Enviro. Microb. 64 (9): 3416-3421.

45. Guindon S, Gascuel O. 2003. PhyML: A simple, fast and accurate algorithm to estimate

large phylogenies by maximum likelihood. Systematic Biology 52 (5): 696-704.

46. Huelsenbeck JP, Ronquist F. 2001. MRBAYES: Bayesian inference of phylogenetic

trees. Bioinformatics 17 (8): 754-755.

47. Li P, Chen B, Song Z, Song Y, Yang Y, Ma P, Wang H, Ying J, Ren P, Yang L, Gao

G, Jin S, Bao Q, Yang H. 2012. Bioinformatic analysis of the Acinetobacter baumannii

phage AB1 genome. Gene 507: 125-134.

48. Lunter G, Miklós I, Drummond A, Jensen JL, Hein J. 2005. Bayesian coestimation of

phylogeny and sequence alignment. BMC Bioinformatics 6: 83-92.

49. Novák Á, Miklós I, Lyngsø R, Hein J. 2008. StatAlign: an extendable software package

for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics 24

(20): 2403-2404. 229

50. Pearson R. 2011. Exploring Data in Engineering, the Sciences, and Medicine. Oxford

University Press, January 02, 2011.

51. Rodgers JL, Nicewander WA. 1988. Thirteen ways to look at the correlation coefficient.

The American Statistician, 42 (1): 59-66.

52. Rosen B. 2002. Biochemistry of arsenic detoxification. FEBS Letters 529: 86-92.

53. Navarre WW, Porwollik S, Wang Y, McClelland M, Rosen H, Libby SJ, Fang FC.

2006. Selective silencing of foreign DNA with low GC content by the H-NS protein in

Salmonella. Science 313 (5784): 236-238.

54. Somers R. 1962. A new asymmetric measure of association for ordinal variables.

American Sociological Review, 27 (6): 799-811.

55. Thomas C, Nielsen K. 2005. Mechanisms of, and barriers to, horizontal gene transfer

between bacteria. Nature Reviews Micro. Biol. 3: 711-721.

56. Ulrich LE, Zhulin IB. 2010. The MiST2 database: a comprehensive genomis resource on

microbial signal transduction. Nucleic Acids Res. 38: D401-407.

57. Valdés-Stauber N, Scherer S. 1994. Isolation and characterization of Linocin M18, a

bacteriocin produced by Brevibacterium linens. Appl. Enviro. Microb. 60 (10): 3809-

3814.

58. Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ,

Gough R, Hix D, Kenyon R, Machi D, Mao C, Nordberg EK, Olson R, Overbeek R,

Pusch GD, Shukla M, Schulman J, Stevens RL, Sullivan DE, Vonstein V, Warren A,

Will R, Wilson MJC, Seung Yoo H, Zhang C, Zhang Y, Sobral BW. 2014. PATRIC,

the bacterial bioinformatics database and analysis resource. Nucl Acids Res 42 (D1):

D581-D591.

230

59. Andrews SC. 2010. The ferritin-like superfamily: Evolution of the biological iron

storeman from a rubrerythrin-like ancestor. Biochimica et Biophysica Acta 1800: 691–

705.

60. Cardarelli L, Lam R, Tuite A, Baker LA, Sadowski PD, Radford DR, Rubinstein JL,

Chirgadze N, Maxwell KL, Davidson AR. 2009. The crystal structure of Bacteriophage

HK97 gp6: Defining a large family of head-tail connector proteins. J. Mol. Biol. 395 (4):

754-768.

61. Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS. 2003.

VADAR: a web server for quantitative evaluation of protein structure quality. Nucleic

Acids Res. 31 (13): 3316-3319.

62. Le S, Gascuel O. 2008. An improved general amino acid replacement matrix. Mol. Biol.

Evo. 25 (7): 1307-1320.

63. Katsura I. 1978. Structure and Inherent Properties of the Bacteriophage Lambda Head

Shell: I. Polyheads produced by two defective mutants in the major capsid protein. J. Mol.

Biol. 121: 71-93.

64. Karlin S. 1998. Global dinucleotide signatures and analysis of genomic heterogeneity.

Current Opinion in Microbiology 1: 598-610.

65. Bird LJ, Coleman ML, Newman DK. 2013. Iron and copper act synergistically to delay

anaerobic growth of bacteria. App. Enviro. Microb. 79 (12): 3619-3627.

66. Sarris PF, Ladoukakis ED, Panopoulos NJ, Scoulica EV. 2014. A phage tail-derived

element with wide distribution among both prokaryotic domains: a comparative genomic

and phylogenetic study. Genome Biol. Evol. 6 (7): 1739-1747.

67. Koonin EV, Makarova KS, Aravind L. 2001. Horizontal gene transfer in prokaryotes:

Quantification and classification. Annu. Rev. Microbiol. 55: 709-742.

231

68. Eddy S. 2001. HMMER User's Guide. Biological sequence analysis using profile hidden

Markov models. (http://hmmer.wustl.edu/).

69. Hofmann K, Stoffel W. 1993. TMbase - A database of membrane spanning proteins

segments. Biol. Chem. Hoppe-Seyler 374: 166.

70. Winter N, Triccas J, Rivoire B, Pessolani MCV, Eiglmeier K, Lim E-M, Hunter SW,

Brennan PJ, Britton WJ. 1995. Characterization of the gene encoding the

immunodominant 35kDa protein of Mycobacterium leprae. Mol. Micro. 16 (5): 865-876.

71. Trotter V, Vinella D, Loiseau L, Ollagnier de Choudens S, Fontecave M, Barras F.

2009. The CsdA cysteine desulphurase promotes Fe/S biogenesis by recruiting Suf

components and participates to a new sulphur transfer pathway by recruiting CsdL (ex-

YgdL), a ubiquitin-modifying-like protein. Molecular Microbiology 74 (6): 1527-1542.

72. Unden G, Schirawski J. 1997. The oxygen responsive transcriptional regulator FNR of

Escherichia coli: the search for signals and reactions. Molecular Microbiology 25: 205-

210.

73. Cipollone R, Ascenzi P, Tomao P, Imperi F, Visca P. 2008. Enzymatic detoxification

of cyanide: Clues from Pseudomonas aeruginosa Rhodanese. J. Mol. Microbiol.

Biotechnol. 15 (2-3): 199–211.

74. Sekowska A, Kung H-F, Danchin A. 2000. Sulfur metabolism in Escherichia coli and

related bacteria: Facts and fiction. J. Mol. Microbiol. Biotechnol. 2 (2): 145-177.

75. Lill, R. 2009. Function and biogenesis of iron-sulphur proteins. Nature 460: 831-838.

76. Inaoka T, Ochi K. 2012. Undecaprenyl pyrophosphate involved in susceptibility of

Bacillus subtilis to rare earth elements. J. Bact. 194 (20): 5632-5637.

77. Tator LD, Marolda CL, Polischuk AN, van Leeuwen D, Valvano MA. 2007. An

Escherichia coli undecaprenyl-pyrophosphate phosphatase implicated in undecaprenyl

phosphate recycling. Microbiology 153 (8): 2518-2529. 232

78. Kelley LA, Sternberg MJE. 2009. Protein structure prediction on the web: a case study

using the Phyre server. Nature Protocols 4: 363-371.

79. Zhang X, Guo H, Jin L, Czornyj E, Hodes A, Hui WH, Nieh AW, Miller JF,

Zhou ZH. 2013. A new topology of the HK97-like fold revealed in Bordetella

bacteriophage by cryoEM at 3.5 A resolution. Elife 2: e01299-e01299.

80. Conway JF, Wikoff WR, Cheng N, Duda RL, Hendrix RW, Johnson JE, Steven

AC. 2001. Virus maturation involving large subunit rotations and local refolding. Science

292: 744-748.

81. Giardina G, Rinaldo S, Johnson KA, Di Matteo A, Brunori M, Cutruzzola F. 2008.

NO sensing in Pseudomonas aeruginosa: structure of the transcriptional regulator DNR.

J. Mol. Biol. 378: 1002-1015.

82. Campbell RE, Sala RF, van de Rijn I, Tanner ME. 1997. Properties and kinetic

analysis of UDP-glucose dehydrogenase from group A Streptococci: Irreversible

inhibition by UDP-chloroacetol. J. Biol. Chem. 272 (6): 3416-3422.

83. Roychoudhury S, May TB, Gill JF, Singh SK, Feingoldll DS, Chakrabarty AM.

1989. Purification and characterization of guanosine diphospho-D-mannose

dehydrogenase: a key enzyme in the biosynthesis of alginate by Pseudomonas

aeruginosa. J. Biol. Chem. 264 (16): 9380-9385.

84. Zhang R, Hatzos C, Abdullah J, Joachimiak A. 2008. The crystal structure of the

putative capsid protein of prophage (E. coli CFT073). Unpublished submission to PDB.

85. Carrillo-Tripp M, Shepherd CM, Borelli IA, Venkataraman S, Lander G, Natarajan

P, Johnson JE, Brooks CL, Reddy I, Reddy VS. 2009. VIPERdb2: an enhanced and

web API enabled relational database for structural virology. Nucleic Acid Research 37:

D436-D442.

233

86. Kawaguchi K, Noda H, Katsura I. 1983. Structure and Inherent Properties of the

Bacteriophage Lambda Head Shell: III. Spectroscopic Studies on the Expansion of the

Prohead. J. Mol. Biol. 164: 573-587.

87. Myers E, Miller W. 1988. Optimal Alignments in Linear Space. CABIOS 4 (1): 11-17.

88. Whelan S., Goldman N. 2001. A General Empirical Model of Protein Evolution Derived

from Multiple Protein Families Using Maximum-Lilelihood Approach. Mol. Biol. Evo.

18 (5): 691-699.

89. Abascal F, Zardova R, Posada D. 2005. ProtTest: selection of the best-fit models of

protein evolution. Bioinformatics 21 (9): 2104-2105.

90. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment

search tool. J. Mol. Biol. 215: 403-410.

91. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.

1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res. 25: 3389-3402.

92. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL,

Gunesekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman

A. 2010. The Pfam protein families database. Nuc. Acids Res. Database Issue 38: D211-

222.

93. Walsby AE. 1994. Gas vesicles. Microbiol. Rev. 58 (1): 94-144.

234

9. Formulas

Formula 3-1: Goodman-Kruskal tau association metric

The Goodman-Kruskal tau association test 50 determines the degree to which the values of one categorical variable x are predictive of the values of a second categorical variable y, or how much of the variability in y is explained by variation in x. Unlike Pearson product moment correlation, this measure can be asymmetric in which x and y are not equally predictive of each other. For example genus is a very good predictor of Gram stain, but Gram stain is a very poor predictor of genus. Thus comparing the strength of association between variables gives directionality to the relationship. The stronger the association of y to x, the stronger effect a change to x will have on y. Comparing associations within subpopulations can identify relational differences between variables due to population differences.

The first step is to calculate the contingency table between x and y, defining the frequencies of each observed value of x given y and vice versa in the dataset. This is converted to a joint probability matrix estimating the population joint probabilities. Based on this estimate the marginal probabilities of the values of x and the values of y are calculated. The sample variation of y is derived from these probabilities. The expected conditional variation of y given the variation of x, is calculated as one minus the sum of the probability of xiyj divided by the probability of xi for all values of x and y. The tau statistic is the fraction of the sample variance of y minus the expected variance of y, given the values of x divided by the sample variance of y.

Similar to Pearson product moment correlation, association theoretically ranges from -1.0 to 1.0. In practice, association minimizes at zero as knowing the value of x never gives less information than not knowing x. An association of 0.0 implies no predictive relationship between the variables and an association of 1.0 implies x perfectly predicts y. The case of variables with discrete unique values, such as accessions, demonstrates that strong association does not necessitate causation, only a strong predictive relationship. While the Pearson correlation approximates a T-distribution, tau approximates a Chi-squared distribution for large datasets with multiple values for variables 43,50. At smaller sizes and restricted variable values, a non- asymptotic test of significance is appropriate and more accurate 43.

235

Algorithm implementation for R (Pearson 2011): GKtau <- function(x,y){ # # First, compute the IxJ contingency table between x and y # Nij = table(x,y,useNA="ifany") # # Next, convert this table into a joint probability estimate # PIij = Nij/sum(Nij) # # Compute the marginal probability estimates # PIiPlus = apply(PIij,MARGIN=1,sum) PIPlusj = apply(PIij,MARGIN=2,sum) # # Compute the marginal variation of y # Vy = 1 - sum(PIPlusj^2) # # Compute the expected conditional variation of y given x # InnerSum = apply(PIij^2,MARGIN=1,sum) VyBarx = 1 - sum(InnerSum/PIiPlus) # # Compute and return Goodman and Kruskal's tau measure # tau = (Vy - VyBarx)/Vy tau }

Formula 3-2: Goodman-Kruskal delta tau association metric

Δτ = τa(xy) - τb(xy)

The Goodman-Kruskal delta tau association test calculates the difference in statistical pairwise feature association between two sets, a and b. For our purposes these sets are encapsulin encoding organisms and the background of all bacteria in the datasets. A significant negative delta tau marks a decoupling of two variables (i.e. y is less dependent on x in encapsulin encoding organisms than among bacteria overall). These suggest encapsulins promote certain values of y apart from the effect on the values of x. A significant positive delta tau marks an increased association between two variables, specifically the value of y is more dependent on the value of x in encapsulin encoding organisms than among bacteria overall. These suggest

236 encapsulins promote functions dependent on the status of both variables together (eg. spore formation under high salt). Delta tau elucidates linkages and decouplings related to the encapsulin function even for features with strong or weak absolute association in either dataset.

Formula 3-3: Karlin δ*-difference metric for comparison of double stranded DNA segments

Karlin δ*-difference metric calculates the average absolute relative difference in nucleotide signature between two double stranded nucleotide segments 64.

δ* = sum(abs((p(XY)enc/p(X)enc*p(Y)enc)-(p(XY)geno/p(X)geno*p(Y)geno)))/16, where X and Y are nucleotides {A,T,G,C}, and p(X) is the prevalence of that nucleotide in both strands of a sequence.

Expected δ* range in prokaryotes: within species δ*≈0.025, between families δ*≈0.06, between kingdoms δ*>=0.22 64.

Eg. Sulfurihydrogenibium yellowstonense SS-5 CutA1-like enzyme

Δp*(AT)=p(AT)enz/p(A)enzp(T)enz – p(AT)geno/p(A)genop(T)geno

0.11/(0.35*0.35) – 0.075/(0.29*0.29) = 0.037

δ* = sum(Δp*(AA)+0.037+Δp*(AC)+…+Δp*(GG))/16 = 0.22

CutA1 gene signature ‘very distant’ from SS-5 genome signature (same distance as between Escherichia coli and Homo sapiens)

237

Appendix 1: Other enzyme families enriched with encapsulins, conserving distinct association motifs indicative of non-targeted accessory proteins

Many other prokaryotic compartments have been found to functionally associate with and sometimes package additional enzymes without the canonical packaging signals specific to those compartments 1,5,7,8. This packaging most often occurs through specific interaction with the co-packaged canonically targeted cargo enzyme 1,5,7,8. The gene families enriched around my expanded encapsulin set were examined for conserved motifs and published phenotypic effects to identify candidates for this type of encapsulin associated accessory proteins. Motif searching across the spectrum of sequenced genomes was used to reveal if the CEAS motif is present in any other proteins besides the six encapsulated families identified above, independent of encapsulins, or confirm this motif is unique to encapsulin associated enzymes. MAST searching from the consensus CEAS motif against the NCBI non-redundant nr database identified hits to the six families discussed above, a few unique proteins and no additional protein families above a cutoff of E-value <= 5.4. This suggests these six families represent the extent of cargo families targeted via the CEAS motif mechanism.

Among the few unique proteins encoding a terminal CEAS-like motif were two CutA1-like copper/dication tolerance proteins from Hydrogenobacter thermophilus. This family was also enriched near a subgroup of encapsulins. These predicted motifs have the sequence

DESLGISQESS matching the consensus CEAS motif model with an E-value of 2.20E-09.

Elsewhere in these genomes were encapsulin homologs and CEAS motif encoding bacterioferritins. The presence of serine rather than glycine in the seventh position of the motif is atypical of the consensus, but is also seen in a few ferritin-like enzymes encoded adjacent to encapsulin shells. The series of large polar residues from positions seven through eleven without

238 an intervening small hydrophobic is also atypical, suggesting this motif may interact atypically or weakly with the coencoded encapsulin targeting sequence binding site.

Related CutA1-like homologs lacked the CEAS-like motif, but when co-encoded with a predicted encapsulin and bacterioferritin encode an alternative encapsulin association specific motif in approximately the same position and present in multiple predicted phylogenetic lineages (Figure A1-1, Figure A1-2, and Figure A1-3). This motif bears weak similarity to the consensus CEAS motif, with conserved near central glycine flanked by a bulky residue then small hydrophobic residue upstream, and downstream short bulky polar and aromatic residues.

This level of similarity may be anecdotal or an artifact of extensive divergence from an ancient shared ancestral motif.

Four CutA1-like homologs were genomically more closely physically linked to the encapsulin gene than the canonical CEAS motif encoding bacterioferritin. The encapsulin association specific motif in CutA1-like proteins if involved in protein-protein interaction is unlikely to bind the same sites as the canonical CEAS motifs used by the bacterioferritins, due to differences in the two motif families. Binding sites would instead be distinct, potentially external to the encapsulin, or involve a bacterioferritin surface. Alternatively the CutA1-like protein encapsulin association specific motif may function to modify some metabolically associated function not physically linked to encapsulation, such as copper channeling. Between the CutA1-like proteins and bacterioferritin enzymes the encapsulin association specific motifs show stronger conservation in the bacterioferritins, suggesting iron metabolism is still the dominant factor in these compartments, with synergistic copper metabolism 65 being advantageous but not as essential. Thus the CutA1-like proteins, differing from the iron-binding proteins, may represent a non-essential or conditionally essential function for encapsulins in copper resistance.

239

Figure A1-1: Conserved, encapsulin association specific, CutA1-like dication tolerance protein consensus motif

Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

240

Figure A1-2: MAFFT alignment of representative CutA1-like copper tolerance enzymes with and without encapsulin association specific motif

CutA1-like proteins co-encoded with encapsulins are marked in green. These highlighted proteins conserve the motif, while similar sequences in related bacteria, lacking the encapsulin targeted bacterioferritin pairing, also lack the motif. BLOSSUM62 colouring was used to mark positions of conservation across the full sequence. The conserved segment is boxed in black and highlighted using CLUSTAL colour formatting.

241

Figure A1-3: Condensed Bayesian Inference Phylogeny 46 of non-redundant representative CutA1-like metal binding proteins

Larger leaves represent subfamilies of the same association distance. Smaller leaves represent single genomes. Subfamilies and internal nodes were labeled with posterior probabilities of each partition. Saccharopolyspora erythraea NRRL 2338 divalent cation tolerance protein used as outgroup. The genomically encapsulin associated and coencoded CutA1-like proteins conserve this distinctive motif independent of predicted phylogenetic history.

242

A second family of non-CEAS encoding accessory enzymes, iron-dependent radical S-adenosyl methionine (SAM) oxidoreductases, was enriched near ferritin-like protein and rubrerythrin associated encapsulins (Hypergeometric test P-value = 4.61E-15). A divergent subfamily of

SAM oxidoreductases was encoded near a subset of encapsulin associated bacterioferritin-like enzymes, but the subfamily was too small to declare significance. Some physiological roles of radical SAM oxidoreductases are similar to hemerythrin, but utilizing radical SAM rather than radical dioxide as the reactive intermediate. Analyses by MEME and MAST identified a distinct, significantly conserved, discriminative motif (E-value = 3.9e-279), specific to those homologs co-encoded with encapsulins (Figure A1-4). This suggests functional interaction, with binding of some encapsulin specific metabolic component, but not necessarily the encapsulin shell itself. Function, not just physical interaction, can drive selection for motif conservation.

Figure A1-4: Conserved encapsulin association specific radical SAM oxidoreductase consensus motif

Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

243

This radical SAM encapsulin associated motif does not closely match the consensus of the six families conserving the canonical CEAS motif. The new motif was encoded in a distinct position, on a mid-sequence variable loop predicted to be surface exposed. Similar to the consensus CEAS motif, the radical SAM encapsulin association specific motif was distinctive from the control cases, not predicted in reciprocal discriminating MEME analysis, and scores several orders of magnitude below the noise cutoff for random artifact motifs (E-value = 5.4).

Phylogenetic analysis 46 suggests proteins conserving this motif were not monophyletic or all closest relatives, such that random vertical linear inheritance does not fully explain this conservation (Figure A1-5). Radical SAM oxidoreductases with this motif exclusively co- occurred in subfamilies with CEAS motif encoding ferritin-like, rubrerythrin-like or bacterioferritin-like enzymes, suggesting co-functionality. Thus this motif was predicted to confer co-functionality with encapsulins, while interacting distinctly with a metabolite, or through a distinct site from the consensus CEAS motif, either on the shell or co-encoded consensus CEAS enzyme, or through some other mechanism.

244

Figure A1-5: Condensed Bayesian Inference Phylogeny 46 of non-redundant representative predicted radical SAM oxidoreductases

Labeled leaves represent subfamilies of the same function, or lineages of importance. Unlabelled leaves represent single genomes. Subfamilies and internal nodes were labeled with posterior probabilities of each partition. Holophaga foetida and Thermobaculum terrenum ATCC BAA-798 homologs used as outgroups.

245

Comparing the optimized Bayesian coestimation phylogeny for the encapsulin associated family of radical SAM oxidoreductases, to the optimized Bayesian inference phylogeny reveals moderately similar overall topology with multiple differences in partitions (Figure A1-5). The partitions in the inference tree were generally a higher confidence than those in the co- estimation tree based on posterior probability, thus were critically treated with more confidence.

The overall trends in both analyses agree the unassociated form of radical SAM domain proteins was predicted to be ancestral, with the encapsulin associated motif arising once from an existing exposed surface and being maintained only in proximity to an encapsulin. Unlike the other encapsulin associated enzymes discussed, this conserved motif family shows limited evidence of transfer between lineages. They also show a gradient of homologs without encapsulins retaining motifs similar to the conserved motif but matching more weakly than encapsulin associated proteins (E-value >= 3.0E-9), suggesting degradation of signal in previously encapsulin associated enzymes or adaptation of a pre-existing surface for encapsulin related function.

246

Similarly, a small subfamily of metalloproteases genomically associated with encapsulins in the order Myxococcales, encoded two conserved exclusive internal motifs only observed in the encapsulin associated members (Figure A1-6.i, Figure A1-6.ii). Both motifs were distinct from the consensus CEAS motif, and the radical SAM specific encapsulin associated motif. In both motifs the E-value falls well below the expected rate of random motif detection (E-value = 5.1).

No matches with an E-value < 0.001 to either motif occur in homologs not associated with encapsulins. Gene knockout studies in one of these strains, Myxococcus xanthus DZ2, revealed both the predicted encapsulin and this metalloprotease were essential to fruiting body formation

38. Other conserved proteins within this region had non-essential phenotypes (Figure A1-7). This indicates a metabolic interaction between the two proteins, strongly suggesting a biochemical interaction such as might occur through one or both of these conserved motifs. Alternatively, one or both of these motifs are involved in the fruiting body specific regulatory biochemistry.

247

Figure A1-6.i: Conserved encapsulin association specific fruiting body metalloprotease consensus distal internal motif (closer to C-terminus). E-value = 2.7e-125

Figure A1-6.ii: Conserved encapsulin association specific fruiting body metalloprotease consensus proximal internal motif (closer to N-terminus). E-value = 9.1e-40

Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

Figure A1-7: Example genomic context of untargeted enriched encapsulin accessory proteins

248

Appendix 2: Other significantly enriched properties of encapsulin encoding organisms

A2.1 Properties of encapsulated peroxidase encoding organisms

In the case of genomes with encapsulated iron-dependent peroxidases, isolation source was statistically independent of other features, indicating a direct encapsulin derived advantage in non-aquatic habitats. Enrichments in bacteria causing chronic necrotizing pneumonia, tuberculosis, or hemoptoic pneumonia, were significantly associated with high GC content and isolation from sputum and bronchial alveolar lavage. This suggests an advantage of encoding these encapsulin during invasive lung infection. Invasive lung pathogens with encapsulin were significantly depleted in association with taxonomic phylum or order, implying the encapsulin related advantage is unlikely due to some other taxonomically inherited property. Thus specific strains of lung colonizing Mycobacterium and Burkholderia encoding encapsulated peroxidase have developed invasive pathogenesis through a taxonomically independent mechanism, facilitated by the presence of these encapsulins. Further research may clarify if facilitation is direct or indirect. Peroxidase compartmentalization may be particularly effective for resisting the oxidizing effects of the human immune response, or switching from the high oxygen environment of free lung growth to the near anaerobic conditions of a lung granuloma.

Conversely encapsulin function may enable some shared physiological capability not directly related to pathogenesis. Research into the application of encapsulins in these specific pathologies may elucidate a novel physiological target for treatment against these diseases.

The average GC content of the encapsulin encoding set was significantly higher than the background (64.7% versus 46.5%) as was the average proteome size (4984 CDS versus 2619

249

CDS) and genome length (5570616 nt versus 3652025 nt) (Figure A2-1, Figure A2-2). These three features were also significantly correlated (GC content versus proteome size: 0.36, GC content versus genome length: 0.57, proteome size versus genome length: 0.67, P < 2.6E-128).

The enrichment for large genomes/proteomes implied encapsulins function in organisms with more complex, or adaptable metabolisms with the potential for optimization or specialization.

Figure A2-1: Boxplot comparing percent GC content between encapsulin encoding genomes and the overall distribution of GC content across PATRIC dataset genomes.

Encapsulin encoding GC content (mean 60.1%, median 64.2%). Background GC content (mean 46.49%, median 44.15%). Welch Two sample t-test P < 2.2e-16.

250

Figure A2-2: Boxplot of distributions of proteome size between all PATRIC dataset genomes and those encoding encapsulin associated iron-dependent peroxidases.

Encapsulin encoding genomes average proteome size (mean: 4984 RefSeq CDS, median: 4849 RefSeq CDS). All PATRIC genomes [background] average proteome size (mean: 2619 RefSeq CDS, median: 2402 RefSeq CDS). Welch 2-sample t-test P < 2.2e-16.

251

Another factor in the link between larger genomes and encapsulins may be a tendency to accumulating horizontally transferred operons which could progressively enlarge genome and proteome size. The role of above average GC content in these organisms was less clear, but may relate to this transfer tolerance or an underlying convergent advantage promoting high GC content itself. While the topic of much theorizing, GC content as a phenotypic feature, is not well validated for mechanisms of selective advantage.

Peroxidase/encapsulin encoding organisms were significantly depleted for the following characteristics: isolation from gastrointestinal tract (significant increased association with GC content), sphere cell shape (significant increased association with temperature tolerance, optimal growth temperature, and GC content), facultative and obligate anaerobic growth (significant increased association with GC content, no significant dependence on other fields), GC content between 30-50% (significant increased association with proteome size), and aquatic habitat

(significant increased association with high GC content, no significant dependence on other fields). Unless otherwise stated, none of these enriched or depleted features showed significant dependence on taxonomic classifiers, supporting physiological functional relationships between encapsulin function and success in these niches.

Specific taxonomic fields showed statistically significant enrichment. The classes Beta- proteobacteria and Actinobacteridae were enriched. The enriched orders were Rhodospirillales,

Burkholderiales, Actinomycetales, Corynebacterineae, and Streptomycineae. The enriched families were Acetobacteraceae, Mycobacteriaceae, Burkholderiaceae, Gordoniaceae, and

Nocardiaceae. Statistically dependent significant depletion was also observed in specific taxa.

The phyla Firmicutes, and Proteobacteria were depleted. The depleted classes were Bacilli,

252

Clostridia, Gamma-proteobacteria, and Epsilon-proteobacteria. The depleted orders were

Lactobacillales, Bacillales, Campylobacterales, and Enterobacteriales. The depleted families were Enterobacteriaceae and Streptococcaceae. Thus peroxidase/encapsulins are in significant use sporadically throughout diverse lineages.

A2.2 Properties of encapsulated iron-binding protein encoding organisms

Enrichment for specialized or aquatic habitats among encapsulated iron-binding protein encoding organisms showed no significant increase in association with other features, except a weak statistically significant increase in association with plasmid-less bacteria. Association with gram stain, motility, temperature tolerance, oxygen requirement and disease were all significantly reduced. Groundwater isolation showed a significant 15% increased association with taxonomic class Negativicutes, and 20% increased association with taxonomic order

Selenomonadales. Converse associations between taxonomic class/order and isolation source were not significant suggesting some feature of these lineages leverages the selective advantage of these encapsulins, rather than an intrinsic feature of groundwater habitat.

Thermophilic and hyperthermophilic temperature tolerance showed a significant 30% decrease in association with gram staining, 27% decrease in association with motility, 34% decrease in association with oxygen requirement, 23% decrease in association with habitat, and 30% decrease in association with disease. Optimal growth at 60oC was enriched, although non- significant secondary maxima existed at higher temperatures.

GC content in this dataset was not significantly different from background, but showed a categorical significant enrichment for 53.1% GC organisms, with weak association to a variety

253 of taxonomic and genome structure features. This suggests iron-binding encapsulin systems are selectively advantageous in a different genome context than peroxidase encapsulins.

Alternatively, this bias toward moderate GC content may be a marker of tolerance for horizontal transfer from moderate GC Archaea, and not have a physiological role.

A2.3 Trends and enriched features of accessory encapsulin associated enzyme families

Only five genomes in the PATRIC dataset co-encode associated radical SAM oxidoreductases and encapsulins, which was insufficient to detect significant enrichment in any phenotypic features. The genomes in question were all moderate GC (45.3-46.9 %), single chromosome, somewhat short length genomes (1824357 nt - 1980592 nt). These bacteria all grow as anaerobic or microaerobic, gram negative, motile rods, without spore formation, occurring in specialized aquatic habitats, including the bacterioferritin-like protein/encapsulin/CutA1-like protein/radical SAM oxidoreductase system of Thermocrinis albus DSM 14484 and related species. None of these genomes were pathogenic. Analysis of surrounding genome context identified a number of phosphate metabolism enzymes, suggesting roles in phosphate protection and arsenic resistance. In addition to iron, the M. xanthus ferritin-like/bacterioferritin/encapsulin compartment was found to also package large quantities of phosphate 22, lending support to a niche for encapsulin associated enzymes in phosphate metabolism.

Genomes co-encoding predicted bacterioferritin/CutA1-like protein encapsulin systems were significantly depleted for mesophilic temperature restriction, and enriched for isolation from hot springs. The depletion for mesophilic growth was more significant than the enrichment for hot springs, and was responsible for 22% of this enrichment. Otherwise the set of PATRIC strains

254 co-encoding these systems of encapsulins and enzymes was too small and diverse to confidently declare more subtle trends significant. As such the niche of predicted bacterioferritin/CutA1 encapsulin systems was less clear than the peroxidase associated, or iron-sequestering encapsulins but was more similar to the iron-sequestering encapsulin niches. When CutA1 co- occurred with an encapsulin targeted with ferritin-like proteins or rubrerythrin the phenotypic trends followed those of the iron-binding proteins alone.

These enrichments and interdependencies support the ferritin-like, CutA1-like, and rubrerythrin- like families of encapsulin systems compartmentalizing specifically anaerobic divalent cation toxicity pathways. Copper and iron, the two most common and most prevalent targets of the enzymes in question, have been recently found to act synergistically to aggravate toxicity under anaerobic conditions 65. This toxicity is independent of oxygen. Instead toxicity is connected to off-target interference of iron and copper with other metal dependent enzymes, and to a lesser extent direct cation reduction. This is similar to the toxicity of arsenic by phosphate replacement. As the affinities for iron and copper differ considerably in encapsulated enzymes, sequestration might prevent or at least minimize the synergistic toxicity by retaining a pool of either metal separately from the other, or both metals separate from the cytoplasm. The systems which co-encode CutA1-like and ferritin-like enzymes tend to occur in toxically metal rich hot spring environments, such that bulk sequestering of both metals may be advantageous.

A2.4 Overall properties of encapsulin encoding organisms, not restricted to single cargo families

Each of the individual datasets examined showed decreases in association between taxonomic classifications and enriched phenotypic features, relative to the background of all genomes in

255 the PATRIC database. For larger sample sizes these decreases in association were often significant. This supports that encapsulins are significantly contributing to niche level physiological phenotypes, beyond what would be explained by taxonomy level inherited phenotypic conservation. The cases where phenotypic and taxonomic features were more closely statistically associated, confirm the expected role of inheritance in retaining advantageous encapsulin function and distributing it to descendants once acquired.

Encapsulin encoding genomes showed significantly higher average GC content than the background (Figure 3-4). The encapsulin genes themselves showed no evident trend in GC content relative to their current hosts, with several examples higher than and lower than the current hosts. As such the GC content difference likely relates to intrinsic properties of higher

GC genomes and horizontal transfer rather than a direct effect of encapsulins on GC content.

One explanation is that high GC genomes tolerate more horizontal transfer, potentially through silencing of novel lower GC genes via HNS-like mechanisms 53. Consistent with this, encapsulin encoding genomes were overall on average significantly larger (P < 2.2e-16) than the average bacteria overall (Figure 3-5, Figure 3-7), accumulating more complex genomes through transfer. Encapsulin encoding organisms had a significantly higher optimal growth temperature than the average bacteria overall (P = 0.007675, encapsulin encoding mean = 41.3 oC, background = 36.5 oC). This supports encapsulation having a stabilizing effect, restricting diffusion or aiding folding under kinetically more labile conditions.

256

Figure A2-3: Box-plot of distributions of genome length between all PATRIC annotated genomes and those encoding encapsulin associated iron-dependent peroxidases

Encapsulin encoding genome length (mean: 5570616 nt, median: 5078984 nt, min: 255395 nt). Background genome length (mean: 3652025 nt, median: 3434013 nt, min: 5123 nt [Acetobacter aceti]). Welch 2-sample t-test P < 2.2e-16.

257

Appendix 3: Prediction of horizontal transfer paths from nearest neighbour Karlin distance and Bayesian Inference phylogeny

To address the question of genomes of origin, and better understand the differences in nucleotide signature between paired cargo and shell genes, Karlin analysis was used to build a map of closest genome signature. When overlaid with amino acid sequence phylogeny 45,46 the two methods define a predicted history of transfer. Amino acid phylogenies can determine long distance relationships well, but have difficulty confidently resolving the recent histories of very closely related proteins if there are insufficient differences at the amino acid level. Also selection influences amino acid sequence more strongly than nucleotide sequence preventing the accumulation of distinctive signal. Conversely Karlin distances are best at predicting short range relationships, and the transmission of non-selected low latency signal 64, ideal for tracing ancestry. Further Karlin distance confers directionality to phylogenetic relationships that would otherwise require an outgroup. Together the paths predicted can be more detailed than is possible based on amino acid sequence alone, and more far reaching than possible based on nucleotide signatures alone.

Recapitulating the predicted paths of horizontal transfers for encapsulin and enzyme proteins revealed encapsulin/enzyme pairs conform to genomes on the same predicted paths of transfer, often with the same predicted origin, or predicted origins within relatively few transfer steps along the same path (Figure A3-1). A group of encapsulin and associated peroxidase genes horizontally transferred through various Burkholderia species are presented as a focused example (Figure A3-2). Similar patterns existed throughout most encapsulin lineages, linking paired encapsulins and associated enzymes as transmitted via the same paths of horizontal transfer, the simplest interpretation being that they were transferred together.

258

Figure A3-1: Overall horizontal transfer path of classical encapsulins through Archaea and Eubacteria, based on nearest match nucleotide signature Karlin distances

Nucleotide signatures connect all classical encapsulins back to horizontal transfer events originating in three common ancestors; An Escherichia coli 8.0569 -like ancestor, a Corynebacterineae ancestor, and a Proto-Archaeal ancestor. Eg. Escherichia coli 8.0569 -like ancestor source of encapsulins for encapsulin encoding lineages of E. coli (peroxidase associated), Desulfovibrio magneticus RS-1 (ferritin associated), Spirochaeta smaragdinae DSM 11293 (rubrerythrin associated), and Bacillus methanolicus MGA3 (ferritin and ferredoxin co- associated).

259

Figure A3-2: Representative example from Burkholderia derived encapsulin systems in which multiple sequential horizontal transfer events yield traceable co-transmission of encapsulins and enzymes with moderately diverged nucleotide signatures.

Paired encapsulin shell and peroxidase genes connected to current protein targets, and genome with most similar nucleotide signature, defining predicted transfer paths for Burkholderia encapsulin components. Paired enzymes and shells predicted to have been transferred out of the same source genomes.

26 0

With multiple predicted transfer events, the nucleotide signals traced back to three genomes of earliest detectable origin for the encapsulin shell (Figure A2-1). Prior to these origins, transfer signals were too extensively degraded to interpret, and phylogenetic methods no longer predicted confident descent. Intermediate ancestors are also informative of the most likely paths encapsulins traveled to reach the genomes they currently function within. A diverse group of encapsulins in Proteobacteria, Firmicutes, and Spirochaetes were predicted to derive from transfer out of an ancestor closest to the modern Escherichia coli 8.0569 nucleotide signature.

Actinobacteria, including Mycobacterium, were predicted to derive encapsulins as transfers from an ancestor closest to the modern Mycobacterium tuberculosis SUMu012 nucleotide signature, which was also the predicted ancestral origin of the peroxidases in all lineages except

Proteobacteria. The Burkholderia encapsulins were predicted to have arisen through an ancestor closest to the modern Burkholderia multivorans CGD2 nucleotide signature, which was also the predicted source of Proteobacterial encapsulin targeted peroxidases. An extensive network of encapsulins in Crenarchaeota, Synergistetes, Firmicutes, Thermotogae, and Proteobacteria trace back to origins in an ancestor closest to the modern Thermogladius cellulolyticus 1633 nucleotide signature. These three encapsulin lineages share a predicted ancestor in the early

Corynebacterineae, based on the consensus of Maximum likelihood and Bayesian Inference amino acid sequence phylogenies.

A third earliest detectable origin was predicted in ancient Proto-Archaea standing as the source for the Pyrococcus furiosus PfV, and related extremophilic bacterial encapsulins. Encapsulin targeted rubrerythrin were predicted to have entered eubacteria from transfer out of Archaea also in this lineage, replacing an existing encapsulin associated ferritin-like protein in an ancestor closest to Stigmatella aurantiaca DW4/3-1, which also produced a subfamily of divergent non-fusion targeted ferritin-like proteins. Closely related encapsulin encoding species

261 combined other distinct horizontally transferred encapsulin and enzyme lineages (Figure 3-36).

The hypothesis of the PfV compartment representing an ancestral state of encapsulin 2 was not well supported by these analyses. The encapsulin from Pyrococcus furiosus DSM 3638 most closely matched the nucleotide signature of that genome, with a Karlin distance of 0.0523. This could be due to origin in that genome or prolonged presence in that lineage. However, that nucleotide signature was not seen in any other encapsulins, arguing against this lineage being an ancestral origin of any of the other encapsulins. Phylogeny based on amino acid sequence diversity places the P. furiosus lineage as an off-shot of the larger lineage of Archaeal and extremophilic bacterial encapsulins (Figure A3-2). Among the encapsulins this lineage was distant from the other eubacterial encapsulin lineages, not consistent with close proximity to the predicted shared ancestral state.

262

Appendix 4: Other predicted novel encapsulin families suggest roles in resistance to toxic disruption of diverse carbon and phosphate metabolisms A4.1 Smaller novel encapsulin families spanning multiple phyla

Other novel encapsulin/enzyme pairings were predicted with diverse function and mechanisms.

A small family of alpha-ketoglutarate decarboxylases/2-oxoglutarate dehydrogenases were significantly enriched in close proximity to eleven predicted novel encapsulins from diverse bacteria in Firmicutes, Protebacteria, and Actinobacteria (Hypergeometric test P-value = 1.8E-

10). Similar to classical peroxidase encapsulins, these genomic regions were enriched for molybdenum biosynthesis metabolism. Unlike classical peroxidase encapsulins this family was enriched with dihydrodipicolinate synthases, and histidine triad regulatory dephosphoralases.

Based on BioGraph linkages, the combination of these enzymes suggests a role in lysine biosynthesis 30.

Discriminating MEME analysis of the novel encapsulin associated set identified one significantly conserved N-terminal motif in the novel encapsulin associated decarboxylases

(Figure A4-1, Figure A4-2). This motif was very dissimilar to the classical conserved encapsulin association specific motif, coinciding with this smaller group of divergent novel encapsulins interacting through a distinct interaction interface. The motif was 24 amino acids long with a series of conserved specific hydrophobic and polar positions interspersed with less conserved positions. MAST searching yields good discrimination of encapsulin associated homologs relative to homologs from encapsulin free genomes, up to a cutoff of E-value < 0.002. In both the novel and classical encapsulin associated cases these genes were enriched with delta- aminolevulinic acid dehydratase/porphobilinogen synthases, dihydrodipicolinate synthases, and

NAD+ synthases, supporting roles in redox carbon metabolism.

263

Figure A4-1: Conserved encapsulin association specific motif from novel alpha-ketoglutarate decarboxylase/2-oxoglutarate dehydrogenases

Simplified consensus '[TS][CP]L[FA][TG][DG]N[VA][DAL][LVY][IV]E[DE][ILV] Y[KER][KCQR]Y[QEL][KV][DL][NSP]NS', E-value = 3.9e-087, False positive E-value cutoff > 14. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

Figure A4-2: Example novel encapsulated ketoglutarate decarboxylase/ dihydrodipicolinate synthase operon

264

A number of inorganic pyrophosphatases were significantly enriched in proximity with a family of predicted novel encapsulins. Six moderately divergent inorganic pyrophosphatases were encoded within five ORFs of predicted encapsulins in a number of diverse bacteria

(Hypergeometeric test p-value = 3.26E-5). Five others encode similar encapsulins and pyrophosphatases separately. The average pairwise percent identity was 70.7% between these enzymes. These genome regions were free of any other phage like proteins. Gordonia alkanivorans NBRC 16433 and Gordonia rubripertincta NBRC 101908 encode ferritin-similar predicted inorganic pyrophosphatase encapsulin fusions. Eighty other genomes encode similar inorganic pyrophosphatases without encapsulins.

MEME discriminative motif discovery analysis identifies two conserved motifs potentially involved in targeting/binding encapsulins (Figure A4-3.i, Figure A4-3.ii). Homology threading from the solved structure of an E. coli pyrophosphatase [PDB accession: 2EIP], predicts these motif formed a surface exposed beta strand and alpha helix, not associated with the multimerization interfaces or enzymatic active site. Both motifs contribute to the exposed surface of the enzyme complex, and expose charged, bulky, and hydrophobic residues ideal for interaction with the inner surface of a capsid-like shell (Figure A4-4.i, Figure A4-5.ii). The C- terminal motif only occurred in encapsulin co-encoding genomes up to an E-value < 0.0097. No predicted encapsulin associated pyrophosphatases score above this threshold.

265

Figure A4-3.i: C-terminal conserved encapsulin association specific motif from inorganic pyrophosphatase

E-value = 1.7e-504, log likelihood ratio = 1819. The information content and relative entropy of this motif was 96.9 bits and 90.5 bits respectively. Simplified consensus of 'EAGKWVKVEGW [GE][GD]V[END][EA]ARQEIL[ED]SFERAK'. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

Figure A4-3.ii: Conserved encapsulin association specific motif internally encoded in inorganic pyrophosphatase

E-value = 3.5e-163, log likelihood ratio = 840. The information content and relative entropy of this motif was 65.7 bits and 57.7 bits respectively. Simplified consensus; '[IV]GVLxMEDE [AG] [GE]xD[AED]K[IL][LI]AVP[HV][DS][KD]'. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

266

Figure A4-4.i: Candidate pyrophosphatase novel encapsulin targeting motifs are surface exposed. red: internal conserved motif, yellow: C-terminal conserved motif.

Figure A4-4.ii: Candidate pyrophosphatase novel encapsulin targeting motifs as exposed in biological assembly hexamer

267

Threading predicts the internally encoded second motif exposes hydrophobic and hydrophilic patches on the outer surface of the enzyme structure between the two surfaces formed by the C- terminal domain. Positions three to five fall beneath a helix from the C-terminal conserved encapsulin association specific motif, precluding protein-protein interaction, explaining their non-conservation in the motif. The C-terminal motif was more discriminative than the internally encoded motif. Both motifs were conserved among encapsulin co-encoded pyrophosphatases, and not conserved in encapsulin free genomes. MAST analysis discriminated encapsulin co- encoded from encapsulin free homologs at a cutoff of E-value < 4.10E-17. No predicted encapsulin associated pyrophosphatases scored above this threshold. The two motifs together were uniquely conserved in encapsulin associated homologs, without similar motifs present together in encapsulin free homologs. MEME analysis on the amino acid shuffled dataset of this family predicted random motifs with a minimum E-value of 530.

A4.2 Specialized lineage specific predicted compartments

A subset of enriched encapsulin associated protein families were previously annotated and validated with functions, while other families comprised uncharacterized predicted and hypothetical proteins for which I predicted function based on motif and domain conservation, based on the method described above (Figure 3-1). One such family of hypothetical proteins was predicted to be heme-dependent cytochrome-like monooxygenase/hydroxylases. This family was highly enriched within six open reading frames of 22 Enterobacterial isolated capsid-like regions including subsets of major capsid proteins, head stabilizing accessory proteins, and protease proteins (Hypergeometric test P-value = 0.0). The inferred function of this associated family was based on similarity to HMMs of cytochrome P450, heme-dependent monoxygenases, and RuBisCo motifs (PF00016: P-value = 7.4E-05). These functions share

268 roles in reactive carbon/oxygen metabolisms. RuBisCo is compartmentalized in carboxysomes and chloroplasts for CO2 fixation. Monoxygenases replace two hydrogen atoms with an oxygen atom on organic substrates, releasing water and consuming O2. Cytochrome P450 is involved in diverse redox metabolisms, including toxic naphthalene metabolism 30. This encapsulin family was associated with another confirmed and five potential naphthalene metabolism enzymes 30.

The monooxygenase/hydroxylase family was dissimilar to all sequenced bacteriophage proteins, supporting this association is not prophage dependent. 89% of enzyme homologs in this family were encoded in proximity to capsid-like proteins and show high levels of sequence conservation overall. Two prophage head region remnants encode capsids with similarity to these candidate encapsulins, along with predicted heme-dependent cytochrome-like monooxygenase/ hydroxylases. This supports a common history with these phages, selective conservation of the head operon relative to the tail, and conversion of an ancestral prophage capsid to an encapsulin while other remnants of the prophage remained.

MEME analysis of the predicted encapsulin associated heme-dependent cytochrome-like monooxygenase/hydroxylases detects one significantly conserved encapsulin association specific discriminative C-terminal motif (Figure A4-5). MAST search matches all encapsulin associated members of this family, and none of the homologs encoded in genomes without predicted encapsulins. MEME analysis on the amino acid shuffled dataset of this family predicts random motifs with a minimum E-value of 56. This indicates a significant link between this motif and encapsulins, but may or may not represent a direct binding interaction.

269

Figure A4-5: Conserved encapsulin association specific motif from predicted heme-dependent cytochrome-like monooxygenase hydroxylases

This motif was expected to occur by chance with E-value = 1.1e-081, and log likelihood ratio of 405. The information content and relative entropy of this motif was 31.4 bits and 27.8 bits respectively. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

270

Also enriched near this encapsulin family, was another family of uncharacterized proteins.

HHPred detects motif level similarity to HMMs for peptidases, pyrophosphatases, heme-binding domains, undecaprenyl-phosphate galactosephosphotransferases, and glycoside hydrolases.

Taken together these suggest an enzyme which interacts with phosphorylated carbohydrate substrates, most likely including undecaprenyl-pyrophosphate or a precursor. Undecaprenyl- pyrophosphate is utilized in rare earth metal resistance through an uncharacterized mechanism

76. Undecaprenyl-phosphate galactosephosphotransferase activity is required in the colanic acid biosynthesis pathway, and cell surface glycosalation (EMBL-EBI). Excessive accumulation of undecaprenyl-pyrophosphate derivatives has been linked to growth defect and abnormal cell morphology in E. coli 77. Thus this protein family may contribute to mitigating toxicity from rare earth metals, undecaprenyl-pyrophosphate, or one or more cell wall disruptive antibiotics, through sequestering the enzyme or substrate from toxic metabolism. Fifteen predicted undecaprenyl-phosphate binding enzymes were encoded an average of 7.3 ORFs from predicted encapsulins in various Escherichia and Shigella strains (Hypergeometric test p-value = 0.0).

Five additional member of this family were encoded separately from, but in the same genome as predicted encapsulins in the same family.

In eleven species these encapsulin/enzyme systems co-occur in the same operon with predicted heme-dependent cytochrome-like monooxygenase hydroxylases. 163 homologs of the predicted undecaprenyl-phosphate binding enzymes occur in encapsulin-less genomes. Comparing these two sets using the MEME algorithm identified one significantly conserved encapsulin association specific motif (E-value = 9.8e-177), unique to the encapsulin adjacent set and absent from the separately encoded or encapsulin free set (Figure A4-6). Shuffled MEME analysis on the same dataset yields no motifs with E-value less than 190, confirming the significance of this motif conservation. MAST analysis discriminates enzymes encoded near predicted encapsulins,

271 from homologs not encoded with encapsulins, with a threshold of E-value < 3.2E-14. Two intermediate prophage remnant/encapsulin regions encode homologs at this threshold, as do a number of genomes without detected encapsulins suggesting maintenance of this encapsulin may be weaker than for other families.

Figure A4-6: C-terminal Conserved encapsulin association specific motif from predicted undecaprenyl-pyrophosphate binding proteins

E-value = 9.8e-177, and log likelihood ratio of 736. The information content and relative entropy of this motif was 80.4 bits and 75.9 bits respectively. Simplified consensus of 'SNARARANIQ [KE]LKTM[VI]NGFRG'. Vertical axis denotes positional information content in bits. Horizontal axis denotes relative position within the motif. Letter colour denotes amino acids of similar properties. Letter height denotes contribution to information content of that position in the motif. Grey vertical brackets (I) demark small sample correction error bars.

272

Alkalilimnicola ehrlichei MLHE-1 encodes a unique predicted encapsulin adjacent UDP- glucose/GDP-mannose dehydrogenase, YP_741640, with an N-terminal motif absent in any other homologs, 'MRKANNRMMDDTPSFP'. This motif was not significantly similar to sequences in any other bacterial proteins (E-value >= 2.8). The adjacently encoded encapsulin in this system was similar to nine phage capsids by pblast (22.6 <= % identity <= 38.9 over 130-

379 aa, 2e-57 <= E-value <= 0.32). No other UDP-glucose/GDP-mannose dehydrogenases were encoded adjacent to predicted encapsulins, limiting statistic analysis. Upstream adjacent to this predicted encapsulin, was encoded a candidate degraded large terminase remnant (26.2 <= % identity <= 29.7 over 58-106 aa, 5.4 <= E-value <= 5.7). No other proteins with similarity to phage were detected in proximity. Based on hit coverage and identity the encapsulin sequence was approximately 6.58 times better conserved than the terminase, supporting positive selective maintenance. UDP-glucose and GDP-mannose are important intermediates in several pathways.

Though neither is toxic, studied homologs of this enzyme are inhibited by various antibiotics, off-target metabolites and environmental containments 30,82,83. Encapsulation may serve to protect this enzyme from these inhibitors. The encapsulation of these single enzyme systems was unconfirmed, and will require experimental confirmation. The larger families of novel capsid-like compartment and co-encoded adjacent enzymes are more confidently biologically important as they show much strong conservation and diversity.

273