Identification of core genes involved in Streptococcus pneumoniae host-pathogen interactions under diverse infections, and their potential as therapeutic targets

Item Type dissertation

Authors D'Mello, Adonis

Publication Date 2021

Abstract Streptococcus pneumoniae (the pneumococcus) is a - specific opportunistic pathogen that asymptomatically colonizes the nasopharynx. It is the leading cause of otitis media, communityacquired pneumonia, bacteremia and meningitis, as it can sprea...

Keywords in vivo; multi-species; reverse-vaccinology; transcriptomics; Genomics; Host-Pathogen Interactions--genetics; Streptococcus pneumoniae--genetics

Download date 05/10/2021 11:13:30

Link to Item http://hdl.handle.net/10713/15770 Curriculum Vitae

Adonis D’Mello Ph.D., expected May 20th, 2021 [email protected] University of Maryland Institute for Genome Sciences 670 W Baltimore St. HSFIII, Baltimore, MD 21201

Education 2015 B.E. (Hons.) Biotechnology, BITS, Pilani University – Dubai International Academic City. Mentor: Dr. Neeru Sood: GPA: 3.1 2021 Ph.D. (anticipated) in Molecular Medicine, University of Maryland Baltimore, Baltimore, MD, USA. GPA: 3.67 Mentor: Dr. Hervé Tettelin

Professional Experience

2013 Data analysis of Newborn screening tests Mentor: Dr Sanjida Ahmed Eastern Biotech & Life Sciences (Dubiotech), Dubai

2016-2021 Graduate Student Mentor: Dr. Hervé Tettelin Institute for Genome Sciences, University of Maryland Baltimore, Baltimore, MD, USA.

Institutional Service

2016 GPILS Molecular Medicine “Big Brother” Incoming Student Support University of Maryland Baltimore, Baltimore, MD, USA.

2017 GPILS Molecular Medicine Student Recruitment University of Maryland Baltimore, Baltimore, MD, USA.

Student Mentorship 2019 Co-mentoring of GPILS rotation student in Molecular Medicine, Bernadine Monari Publications – Peer-Reviewed Journal Articles 1: D'Mello A, Riegler AN, Martínez E, Beno SM, Ricketts TD, Foxman EF, Orihuela CJ, Tettelin H. An in vivo atlas of host-pathogen transcriptomes during Streptococcus pneumoniae colonization and disease. Proc Natl Acad Sci U S A. 2020 Dec 29;117(52):33507-33518. doi: 10.1073/pnas.2010428117. Epub 2020 Dec 14. PMID: 33318198; PMCID: PMC7777036. 2: D'Mello A, Ahearn CP, Murphy TF, Tettelin H. ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic vaccine candidates. BMC Genomics. 2019 Dec 16;20(1):981. doi: 10.1186/s12864-019-6195-y. PMID: 31842745; PMCID: PMC6916091. 3: Schumacher M, Nicholson P, Stoffel MH, Chandran S, D'Mello A, Ma L, Vashee S, Jores J, Labroussaa F. Evidence for the Cytoplasmic Localization of the L-α- Glycerophosphate Oxidase in Members of the "Mycoplasma mycoides Cluster". Front Microbiol. 2019 Jun 19;10:1344. doi: 10.3389/fmicb.2019.01344. PMID: 31275271; PMCID: PMC6593217. 4: Pettigrew MM, Ahearn CP, Gent JF, Kong Y, Gallo MC, Munro JB, D'Mello A, Sethi S, Tettelin H, Murphy TF. Haemophilus influenzae genome evolution during persistence in the human airways in chronic obstructive pulmonary disease. Proc Natl Acad Sci U S A. 2018 Apr 3;115(14):E3256-E3265. doi: 10.1073/pnas.1719654115. Epub 2018 Mar 19. PMID: 29555745; PMCID: PMC5889651. 5: Lizcano A, Akula Suresh Babu R, Shenoy AT, Saville AM, Kumar N, D'Mello A, Hinojosa CA, Gilley RP, Segovia J, Mitchell TJ, Tettelin H, Orihuela CJ. Transcriptional organization of pneumococcal psrP-secY2A2 and impact of GtfA and GtfB deletion on PsrP-associated virulence properties. Microbes Infect. 2017 Jun;19(6):323-333. doi: 10.1016/j.micinf.2017.04.001. Epub 2017 Apr 10. PMID: 28408270; PMCID: PMC5581956.

Proffered Communications

Proffered Oral Presentations

D'Mello A, Ahearn CP, Murphy TF, Tettelin H. ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic protein vaccine candidates. NIAID Genomic Centers for Infectious Diseases (GCID) Annual Meeting 2020,

Virtual, 08/22/2020, Maryland, U.S.A

D'Mello A, Riegler AN, Martínez E, Beno SM, Ricketts TD, Foxman EF, Orihuela

CJ, Tettelin H. An in vivo atlas of host-pathogen transcriptomes during

Streptococcus pneumoniae colonization and disease. Virtual Streptococcal Seminar Series

2020, Virtual, 09/09/2020, Alabama, U.S.A

Proffered Poster Presentations

D'Mello A, Riegler AN, Martínez E, Beno SM, Ricketts TD, Foxman EF, Orihuela

CJ, Tettelin H. An in vivo atlas of host-pathogen transcriptomes during

Streptococcus pneumoniae colonization and disease. International Symposium on Pneumococci and Pneumococcal Diseases (ISSPD), Virtual, 06/21/2020, Toronto, Canada

Proffered Poster Presentations (author, but not presenter)

Dual RNAseq of PMNs infected with multi-drug resistant gonococci

Edwards, V, McComb, E, D’mello, A, Gray, M, Ravel, J., Criss, A., Tettelin, H.

International Pathogenic Neisseria Conferences (IPNC), 09/27/2020, California, U.S.A

Abstract

Title of Dissertation: Identification of core genes involved in Streptococcus pneumoniae host- pathogen interactions under diverse infections, and their potential as therapeutic targets

Adonis D’Mello, Doctor of Philosophy, 2021

Dissertation Directed by: Hervé Tettelin, PhD, Professor, Department of Microbiology &

Immunology, Institute for Genome Sciences

Streptococcus pneumoniae (the pneumococcus) is a human-specific opportunistic pathogen that asymptomatically colonizes the nasopharynx. It is the leading cause of otitis media, community- acquired pneumonia, bacteremia and meningitis, as it can spread from the nasopharynx. It has been estimated that ~500,000 children under the age of 5 die annually due to pneumococcal infection and pneumococcal bacteremia and meningitidis has over 60% mortality rates in the elderly. It has also been established that influenza A co-infections enhance the progression from asymptomatic colonization to invasive disease, with clinically worsened outcomes. Utilization of antibiotics and pneumococcal conjugate vaccines has resulted in the rise of antibiotic resistance and emergence of non-vaccine serotypes, requiring identification of novel therapeutic targets.

We hypothesized that there exist pneumococcal conserved genes, that are involved in these diverse forms of colonization, infection, and influenza co-infection. We sought to identify such genes using a combination of in silico, in vivo and ex vivo approaches. Through pangenomics and reverse vaccinology (in silico), and multi-species transcriptomics on a mouse model of pneumococcal colonization and invasive disease (in vivo), and a primary human lung epithelial model of pneumococcal and influenza co-infection (ex vivo), we observed mechanisms underlying diverse host-pathogen interactions and identified novel potential avenues for therapy from both the host and bacterial perspectives.

Identification of core genes involved in Streptococcus pneumoniae host-pathogen interactions under diverse infections, and their potential as therapeutic targets

by Adonis D’Mello

Dissertation submitted to the faculty of the Graduate School of the University of Maryland, Baltimore in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021

@Copyright 2021 by Adonis D’Mello

All rights reserved

Acknowledgements

First and foremost, I would like to thank my mentor Dr. Hervé Tettelin for all that he has done over the course of my graduate school career. Over the past 5 years working under his tutelage has taught me so much. Whether it was growing as a scientist, expanding my knowledge, inspiring me to do my best or providing life and career advice, he was always there. I wouldn’t be capable of the things I have learnt over the years without his guidance and support, and for that I am extremely grateful. I would like to thank our collaborators Drs. Timothy Murphy and Christian Ahearn, for their assistance developing the reverse vaccinology project. I also want to thank Dr. Carlos

Orihuela for his help with the in vivo mouse project and for welcoming me into his home during my period at the University of Alabama, Birmingham (UAB), learning to construct knock-out mutants. Also of course, the Orihuela lab, with a special thanks to Ashleigh Riegler, Eriel Martinez and Dr Ellen Foxman, for their invaluable assistance on the in vivo mouse project, as well as Drs.

Kevin Harrod and Jennifer Tipper for providing us with the ex vivo human lung epithelial cell samples.

I would like to thank Drs. Scott Devine, Robert Ernst, Julie Dunning Hotopp, Kevin McIver and

Carlos Orihuela for serving on my thesis committee and helping me with strengthening my thesis and projects with their ideas. I also must thank Dr Scott Devine, Dr Toni Antalis and Chelsea

Rosenberger for their help in the Molecular Medicine Program.

I would like to thank my family, my parents Abel and Asha and my sister Adora for their continuous support throughout graduate school, without which I wouldn’t have made it to this stage. Finally, I would like to thank the friends I have made during my graduate school career.

You all have made the past 5 years fly by with your company and have given me some wonderful unforgettable memories.

iii

Table of Contents

List of tables ...... vii

List of figures ...... viii

List of abbreviations ...... x

I. Chapter 1: Introduction ...... 1

A. Biology of Streptococcus pneumoniae ...... 2

B. Pneumococcal disease ...... 12

C. Therapeutics ...... 18

D. Challenges ...... 24

E. Avenues for discovery of new therapeutics ...... 27

F. Concluding remarks ...... 30

II. Chapter 2: Reverse vaccinology and its application to Streptococcus pneumoniae ...... 32

A. ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic

protein vaccine candidates. 1 ...... 32

B. Abstract ...... 32

C. Background ...... 34

D. Results ...... 45

E. Discussion ...... 60

F. Conclusion ...... 65

G. Methods ...... 67

H. Supplementary information ...... 79

I. Acknowledgements ...... 80

iv

J. Identification of Streptococcus pneumoniae vaccine candidates with ReVac...... 80

K. Assessing pneumococcal genomic diversity ...... 80

L. Defining a representative pneumococcal sub-population ...... 81

M. Application of ReVac for prioritization of conserved pneumococcal potential vaccine

candidates (PVCs) ...... 86

III. Chapter 3: An in vivo atlas of host–pathogen transcriptomes during Streptococcus pneumoniae colonization and disease 2 ...... 88

A. Abstract ...... 88

B. Significance statement...... 89

C. Introduction ...... 89

D. Results ...... 91

E. Discussion ...... 117

F. Materials and methods ...... 125

G. SI Appendix 1...... 129

H. Supplementary data ...... 132

I. Acknowledgements ...... 132

IV. Chapter 4: Influenza A virus modulation of Streptococcus pneumoniae infection using an ex vivo human primary lung epithelial cell model ...... 133

A. Abstract ...... 133

B. Background ...... 134

C. Results ...... 136

D. Discussion ...... 151

E. Methods ...... 156

v

F. Code availability ...... 159

G. Supplementary data ...... 159

V. Chapter 5. Conclusions and future directions ...... 160

VI. References ...... 171

vi

List of tables

Table 2.1 List of all the programs run in ReVac and their predicted features, with the scoring scheme for each program’s output...... 36

Table 2.2 Examples of control used to develop the scoring scheme, and a summary of the outputs from each of ReVac’s components ...... 54

Table 2.3 Top candidates selected from ReVac’s output for Nontypeable Haemophilus influenzae and M. catarrhalis (M.W/pI represent molecular weights and isoelectric points) ..... 57

Table 2.4 List of Streptococcus pneumoniae vaccine candidates...... 86

Table 5.1 Ten conserved pneumococcal genes of interest across all studies with high ReVac scores...... 162

vii

List of figures

Figure 2. 1. Schematic of the ReVac workflow, its components, and underlying features ...... 46

Figure 2. 2 A density plot showing the scores for all sequences run through ReVac, and the cutoff for our M. catarrhalis and Nontypeable Haemophilus influenzae datasets ...... 49

Figure 2. 3 ReVac’s CPU time estimates ...... 60

Figure 2. 4 Whole genomes trees of M. catarrhalis and protein alignment trees of ReVac’s M. catarrhalis candidates ...... 63

Figure 2. 5 Whole genome tree of the 270 Nontypeable Haemophilus influenzae genomes used in ReVac...... 64

Figure 2. 6 Whole genome trees of Streptococcus pneumoniae genomes...... 82

Figure 2. 7 Protein alignment trees of Streptococcus pneumoniae vaccine candidates ...... 85

Figure 3. 1 Rarefaction curves and estimation of pneumococcal burden ...... 92

Figure 3. 2 Pneumococcal principal component analysis (PCA) and dendrograms ...... 94

Figure 3. 3 Pneumococcal strains D39 and 6A10 are more genetically similar when compared to

TIGR4 ...... 96

Figure 3. 4 Principal component analysis (PCA) of in vivo and in vitro representative datasets

...... 97

Figure 3. 5 Venn diagrams of pneumococcal differentially expressed (DE) genes detected in vivo and in vitro in mimicking conditions ...... 98

Figure 3. 6 Host principal component analysis (PCA) and dendrogram ...... 100

Figure 3. 7 Differentially expressed (DE) genes detected in the pneumococcus and host ...... 103

Figure 3. 8 Differentially expressed (DE) genes of the pneumococcus and the host ...... 104

Figure 3. 9 Differentially expressed (DE) genes of the host in the nasopharynx ...... 104

viii Figure 3. 10 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct

(equivalent to LFC) in pneumococcal infected lung, heart and nasopharynx ...... 105

Figure 3. 11 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct

(equivalent to LFC) in the mouse lung ...... 106

Figure 3. 12 Pneumococcal and host genes of particular interest ...... 109

Figure 3. 13 In vivo competition experiments between pneumococcal strain TIGR4 and isogenic mutants in highly expressed genes ...... 114

Figure 3. 14 Assessment of the interferon response and its contribution to mouse protection.. 115

Figure 3. 15 The pneumococcus can rely on the ula operon for growth with ascorbate as a sole carbon source ...... 116

Figure 3. 16 Z-scored heatmap of host TLR 7, 9, 13, and chaperone protein Unc93b1 ...... 121

Figure 4. 1 Rarefaction curves and estimation of pneumococcal and viral burdens...... 137

Figure 4. 2 EF3030 CFUs recovered from an apical wash of the Air Liquid Interface transwell lung epithelial system...... 139

Figure 4. 3 Differentially Expressed (DE) EF3030 genes...... 140

Figure 4. 4 Weighted Gene Co-expression Network Analysis (WGCNA) clusters of EF3030 gene expression...... 142

Figure 4. 5 EF3030 genes of interest...... 144

Figure 4. 6 Human gene expression profiles and DE genes...... 148

Figure 4. 7 PCA analysis of expression profiles of 77 EF3030 metabolic genes in our study and previously studied our in vitro and in vivo mouse models...... 152

Figure 4. 8 Immunofluorescence of Air Liquid Interface cultured nHBEC cells...... 156

ix List of abbreviations

ABC ATP Binding Cassette

ALI Air Liquid Interface

BCAA Branched chain amino acids

BMC Blood mimicking conditions

BSML Bioinformatic Sequence Markup Language

BSR Blast Score Ratio

CDM Chemically Defined Medium

CDP Cytidine diphosphate

CDSs Coding sequences

CFU Colony Forming Unit

COG Clusters of orthologous genes

COPD Chronic obstructive pulmonary disease

CPU Central Processing Unit

CPS Capsular polysaccharide

CSP Competence Stimulating Peptide

DE Differentially Expressed

ECM Extra Cellular Matrix

FDR False Discovery Rate

GFF General Feature Format

GI Genomic islands

GO Gene Ontology

GPS Global Pneumococcal Sequencing

x HGT Horizontal Gene Transfer

HMM Hidden Markov Model

HK Histidine kinase

IAV Influenza A virus

IEDB Immune Epitope Database

LFC Log2 Fold Change

LMC Lung mimicking conditions

LTA Lipoteichoic acid

M.W Molecular weight

MHC Major Histocompatibility Complex

MLST Multilocus sequence type

MLVA Multilocus variable number of tandem repeat analysis

NCBI National Center for Biotechnology Information nHBEC Normalized Human Bronchial Epithelial Cells

NGS Next-generation sequencing

NMC Nasopharynx mimicking conditions

NTHi Nontypeable Haemophilus influenzae

PBS Phosphate Buffered Saline

PCA Principal Component Analysis

PCV7/13 Pneumococcal Conjugate Vaccine7/13 pI Isoelectric point

PTS Phosphotransferase System ptt NCBI Protein Table

xi

PVC Potential vaccine candidates

RAM Random Access Memory

RSV Respiratory syncytial virus

RV Reverse vaccinology

SCL Subcellular localization

SPase Signal peptidase

Spn Streptococcus pneumoniae

SSRs Simple sequence repeats

TAP Transporter associated with antigen processing

TCS Two Component Systems

THY Todd-Hewitt with Yeast extract

VST Variance Stabilized Transformation

WGCNA Weighted Gene Co-Expression Analysis

Abbreviations do not include gene, software, or database names.

xii I. Chapter 1: Introduction

Streptococcus pneumoniae (Spn), also known as the pneumococcus, is a Gram-positive, non- motile bacterium, that is lancet in shape and usually exists as pairs known as diplococci. The pneumococcus is typically a human commensal organism that asymptomatically colonizes the nasopharynx (1). However, in those with aged, underdeveloped, or compromised immune systems, such as the elderly, young children, or those with underlying medical conditions, it can cause opportunistic infections (1). Spn is the main cause of community-acquired pneumonia and meningitis in these individuals. It can also cause other infections in as it progresses from the nasopharynx such as bronchitis, otitis media, or sepsis to name a few (1). According to the

Centers for Disease Control and Prevention (CDC), pneumococcal pneumonia causes approximately 150,000 hospitalizations each year in the United States alone. Globally pneumonia and bronchiolitis are responsible for over 2 million , of which the pneumococcus was the leading cause of morbidity and mortality in approximately half of the cases (2). This is excluding other forms of pneumococcal infections mentioned above. Therapeutic use of vaccines, such as pneumococcal conjugate vaccines (PCV7 and PCV13, described more in later sections), has significantly reduced pneumococcal disease in the elderly and infants under 12 months of age.

However, neonatal infants (<2 months) are still vulnerable as the vaccine is ineffective at this stage. A study conducted in 8 hospitals between 1993-2001 found high incidence of invasive pneumococcal disease in this age group, including pneumonia, meningitis, septicemia, bone and joint infections, and otitis media (3). The pneumococcus is highly specialized at colonizing and infecting its host and has been extensively studied but remains a significant pathogen despite current forms of therapy. In this section, we will explore the pneumococcus, its commensal and

1 pathogenic nature, current therapeutics, how we have been studying the bacterium and how we can improve our understanding of the species.

A. Biology of Streptococcus pneumoniae

1. Growth patterns of Streptococcus pneumoniae

As with several other species of bacteria, Streptococcus pneumoniae can alter its growth pattern to suit its environment. Spn, typically when introduced into a rich environment, can grow in a ‘free-floating’ or planktonic mode. However, it is also capable of growing independently as a biofilm, as well as with other microbial species in complex multi-species biofilms. Such forms of growth have been observed both in vitro and in vivo (4, 5). Multiple studies have shown that the pneumococcus usually persists as a biofilm during colonization and disease, apart from bacteremia, during which it transitions to planktonic growth (5, 6). However, both these modes of growth are not absolute, both forms can exist in differing proportions, depending on the environment during colonization and disease.

These two major contrasting forms of growth can offer a survival advantage to the pneumococcus within its respective environment. For example, during bacteremia or host sepsis, it exists as short free-floating chains of pneumococci. This form allows Spn to evade its hosts immune system by reducing its overall size which lowers the probability of antibody based agglutination and proceeding complement based killing (7). However, in other host anatomical sites, such as in the heart, pneumococcal biofilm growth allows for more resistance to antibiotic penetrance and further host immune evasion by killing host macrophages (5). Besides these advantages, when present as biofilms, the pneumococcus demonstrates complex quorum sensing patterns with other bacteria occupying the same niche, including other closely related streptococcal species (8) . This allows for genetic exchange to occur through multiple mechanisms, which will be explored later. There

2 also exists strong interspecies competition for , sometimes even through the expression of protein toxins, known as bacteriocins, against the other species (4).

The above-mentioned studies investigating both forms of pneumococcal growth have shown the importance of growth patterns and their link with host colonization and infection. It is important to note that not all strains of Spn can form biofilms and not all biofilms are pathogenic to the infected host, demonstrating a complex role of growth patterns regarding pathogenicity and host adaptation (9).

2. Virulence factors and surface exposed features

The pneumococcus is a complex microorganism capable of undergoing genetic plasticity, causing multiple disease outcomes. It will, therefore, almost certainly be a host to a diverse panel of virulence factors and surface exposed features involved in host-pathogen interactions. Common classifications of virulence factors include its polysaccharide capsule, surface proteins and enzymes, and a major singular toxin called pneumolysin. As the polysaccharide capsule arguably stands as a field of study on its own, it will be discussed in the next section. Here we will describe key surface proteins, enzymes, and pneumolysin.

Probably the most well studied pneumococcal virulence factor, pneumolysin (PLY). PLY is a 53- kDa protein toxin made by nearly all disease-causing clinical isolates (10). It is part of a family of proteins called hemolysins, which are capable of lysing host red blood cells. Once released by the pneumococcus, PLY binds to the cholesterol present within host cell membranes. Pneumolysin monomers then oligomerize to form circular pores in the membrane, causing cell dysfunction and at high concentrations, cell (11). PLY and other hemolysins are immunogenic and can trigger the host complement pathway (11). ΔPLY strains of Spn have reduced virulence and inflammation in mouse lungs (11). They also showed lower bacterial burdens in the mouse respiratory tract (11).

3

Clearly, pneumolysin can play an important role in most pneumococcal infections including bacteremia and even meningitis, albeit to a lower degree (11).

Another class of pneumococcal surface proteins specific to Spn and closely related species (12) that have been extensively studied are choline binding proteins (CBPs). These are anchored to the cell wall through the interaction of repeat regions in the CBPs with pneumococcal surface choline. The number of CBPs varies (13-16) among isolates of the pneumococcus (13).

Several CBPs have been identified and implicated in virulence, such as four cell wall hydrolytic enzymes (LytA, LytB, LytC, and CbpE), and pneumococcal surface proteins A and C (PspA,

PspC) (11).

Other pneumococcal surface virulence factors are in the form of cell wall anchored proteins or

LPXTG (Leu-Pro-any-Thr-Gly) motif-containing proteins. Different Spn strains have different numbers and types of these proteins, up to 17 have been discovered (11). Most LPXTG anchored proteins have the motif near the C-terminus, where it acts as a recognition sequence for sortase enzymes, which then covalently bond the threonine in the motif to the cell wall peptidoglycan layer of the pneumococcus. Several of these LPXTG proteins fall under the class of proteases and have been extensively studied (11). A few will be described here, namely, hyaluronidase, neuraminidase, serine protease PrtA, and zinc-metalloproteases.

Hyaluronidase is a highly conserved enzyme as it is expressed and secreted by 99% of clinical isolates (11). It has been shown to have a role in aiding bacterial growth and dissemination by breaking down the hyaluronic acid present in host connective tissue and extracellular matrix, which induces host inflammatory responses by inducing cytokine secretion from binding to CD44 on host cells (11). This hyaluronic acid can also be used by the pneumococcus as a carbon source for growth (14) .

4 Neuraminidase has been shown to cleave N-acetylneuraminic acid from glycolipids, lipoproteins, and oligosaccharides present on host cell surfaces (11). It has also been shown to cleave off sialic acid residues from host glycoproteins exposing galactose. Both sugars can be utilized by Spn and are known to trigger biofilm formation in the host nasopharynx (15).

PrtA is a part of the subtilase family of serine proteases. Although its exact role is unknown, a

ΔPrtA strain of D39 showed attenuation in mice (11).

Zinc metalloproteases are another class of proteases that bind and are regulated by zinc ions, through a conserved HxxHHxH amino acid motif. Four pneumococcal zinc metalloproteases have been identified, namely an IgA1 protease (or ZmpA), ZmpB, ZmpC, and ZmpD, of which the first three have been experimentally characterized. IgA1 protease cleaves host Immunoglobulin A, preventing phagocytosis by capsule specific host IgA1 (16, 17). ZmpB has been shown to contribute significantly to pneumococcal virulence during pneumonia and bacteremia through knock out mutant studies, showing increased mouse survival times and reduced bacterial counts

(18). Whereas ZmpC is known for ECM remodeling. These zinc metalloproteases contain an

LPXTG motif as well, however they are present in the N terminus of the protein and their true localization was unknown until recently (16). Some studies have shown that they are secreted by the pneumococcus (19, 20).

Of course, within the species, several other surface factors exist that are relevant to pneumococcal environmental adaptation and survival in infection, such as the adherence proteins and other unclassified lipoproteins. These and the virulence factors described above shed some light on the complexity present across the pneumococcal species in the context of virulence. Not only does the presence of these factors vary across the multiple pneumococcal strains, but also the amino acid

5 sequence diversity within each of these across the species is also a major variable in the severity of virulence and resulting host responses.

3. Capsular polysaccharide (CPS)

Most pneumococcal strains, known as encapsulated strains, have an external layer of polysaccharide coating the cell wall. This layer can be up to ~400nm in thickness and represents over half of the pneumococcal cell volume (21). Studies have shown that this layer is usually covalently linked to the peptidoglycan in the cell wall, at different linkage sites depending on the strain involved (22). Two known exceptions to this are pneumococcal serotype 3 and 37, which are non-covalently bound using a phosphatidylglycerol group (23, 24).

The CPS is a known key virulence determinant of the pneumococcus (21). Most recovered clinical isolates have a CPS and ~100 different CPS types have been recognized to date. Non-encapsulated versions of these isolates are often entirely avirulent supporting the importance of the CPS in virulence (21). As a result, CPS are currently used in pneumococcal vaccines, described in later sections.

Being the outermost surface, the CPS acts as an important pneumococcal host defense and host interaction system. The majority of CPSs are highly negatively charged at host physiological pH

(7.35-7.45). It is hypothesized that this promotes the repelling of negatively charged sialic acid rich host glycoproteins in the nasopharynx preventing adherence, but pneumococcal neuraminidase cleaves these residues, allowing evasion of the nasal mucus and assisting in adherence to epithelial cells (25).

The CPS also fulfills other important functions such as reducing binding to host immunoglobulins, host complement proteins, and C-reactive protein (21). It has been shown that the CPS interacts

6 with C3b/iC3b proteins of the host complement system and the Fc domains of host immunoglobulins, preventing opsonization by phagocytic cells to a significant degree (21).

It has been established that Spn mutants producing different amounts of CPS showed a strong relationship between virulence and capsular thickness for a given pneumococcal strain (26). Apart from thickness, differences in serotype structure and composition affected a given pneumococcal strain’s ability to reduce opsonophagocytosis in vitro and induce an immune response in vivo, hence diversifying the invasive potential of different strains (21).

The pneumococcus regulates its capsule quite effectively to suit its host environment, an epigenetic-based process known as phase variation. Studies have shown that lowered capsule production allows for better host cellular adherence and colonization, but poor virulence, and conversely strains that have more capsule tend to be more virulent but are poor host colonizers

(27).

Genes involved in CPS synthesis were demonstrated to be present in the same locus of the pneumococcal genome, flanked by the genes dexB and aliA. This holds true for all studied serotypes except type 37 (21). Five genes in the cps locus, cpsABCD and cpsE/wchA are mainly highly conserved except in serotypes that do not have glucose in their CPS. The latter serotpyes are missing the cpsE/wchA gene which is the glucose-1-phosphotransferase that catalyzes glucose addition (21). It has also been shown that cpsCDE genes form two distinct subclasses, which suggests evolution from two ancestral CPS types (21).

7

4. Two-component signal transduction systems

The pneumococcus, like members of the Streptococcus genus and other bacteria (28), exhibits a simple form of regulation and environmental sensing through the use of what’s known as “Two- component systems (TCSs)” also called two-component signal transduction systems. The basic model of the TCS consists of a membrane-associated sensor histidine kinase (HK), typically a homodimer, and a cytoplasmic cognate response regulator (RR) (29). Upon receipt of a specific external stimulus the kinase domain of the HK sensor protein is activated to auto-phosphorylate conserved histidine residues on the HK dimer (29). This phosphate group is then transferred by the HK to a conserved aspartate residue in its cognate RR (29). The RR then undergoes a conformational change allowing it to regulate gene expression or protein function, usually by acting as a DNA-binding transcriptional regulator (29). At least 13 known TCSs exist for the pneumococcus, most of which have been shown to contribute to virulence in vivo in mice (29).

However, other studies have shown that some of these are highly serotype/strain dependent.

Indeed, it appears that, at least in the context of virulence, TCS mechanisms are diverse across Spn strains and is a significant variable affecting their function and distribution (29, 30).

5. Competence and transformation

The pneumococcus with its diverse panel of serotypes and genotypes is a renowned champion at natural genetic transformation. This natural ability to easily undergo horizontal gene transfer

(HGT) is widely conserved in the Streptococcus genus, and as a result has a significant impact on pneumococcal adaptability and evolution (31). It was this property of the pneumococcus, that led to the discovery of deoxyribonucleic acid (DNA) as the heritable genetic material (32). The main forms of pneumococcal HGT involves competence and uptake of extracellular DNA. Another advantage for transformation for the pneumococcus over Gram-negative species is that it only has

8 to uptake DNA across a single cytoplasmic membrane (33). However, other forms exist such as conjugation, which is the exchange of bacterial plasmids through specialized pilus protein machinery, or transduction, which involves bacteriophage infection to bring in and integrate foreign genes (34).

Consider for example the CPS described in the section above. These polysaccharide capsules are extremely diverse, immunogenic, and are key virulence determinants. As a result, the CPS is under significant selective pressure. With HGT, the pneumococcus is known to exchange parts of, as well as the entirety of cps locus giving rise to more CPS variants and ultimately more serotypes

(35). Of course, such transformation is not limited to the cps locus, but across the genome. The interactions with other streptococcal species also provide a larger pool of DNA for potential genetic exchange (31).

It has been established that such inter streptococcal and intra pneumococcal species genetic exchange involves a temporary state of natural competence which last up to 40 minutes (31). This competence state is induced via a competence stimulating peptide (CSP), a hydrophobic protein pheromone secreted by competent cells within the same niche. This triggers the expression of various genes and entry into the competent state to allow uptake of extracellular free DNA as well as induction of bacteriocin expression against other cells in the environment (31).

The mechanism of CSP secretion and the genes involved have also been determined (31). Most pneumococcal strains produce one of two alleles, CSP-1 or CSP-2. Translated pre-CSP is processed and secreted through the ATP binding cassette transporter ComAB (31). The secreted

CSP is detected by the histidine kinase ComD on other pneumococci, which phosphorylates ComE

(31). Phosphorylated ComE then induces the transcription of comAB, comCDE, and sigX, as well

9 as other genes initiating a CSP release feedback through comAB. SigX regulates expression of

DNA uptake machineries and bacteriocins (31).

Competent pneumococci also have what is known as a toxin-antitoxin system to bring about pneumococcal “fratricide”. They can produce CibAB, a two-peptide bacteriocin which acts on the membrane of non-competent cells (31). Competent cells are protected from the bacteriocin through the expression of CibC (31). These factors kill external cells exposing DNA which can be integrated into the pneumococcal genome by homologous recombination, as well as trigger biofilm formation and the release of pneumolysin from the non-competent cells, causing the lysis of host cells rendering them vulnerable to infection by the competent cells (31, 36).

Such CSP meditated transformation typically works for small segments of DNA. However larger transformation patterns have also been observed. Consider the case of PMEN-1, a multidrug resistant clone of the pneumococcus. A study traced back the evolution of this clone, isolated from different countries over a period of 65 years (37). They found large 30-50 kb segments of DNA in the capsule locus came from multiple other pneumococcal strains. However, no markers of conjugation or transduction were found around the capsule locus in PMEN-1 (37). Another study established that intercellular contact of competent and non-competent pneumococci in biofilms can allow large recombination events of such 30kb magnitudes, often with multiple occurring simultaneously (38). As we have seen competence also triggers biofilm formation and fratricide, keeping lysed pneumococci in proximity for DNA uptake before DNA degradation. Another example is pneumococcal strain Hungary 19A, which had donor DNA from S. mitis and S. oralis

(31). These results show that S. pneumoniae is not only capable of frequent genetic transformation but effectively utilizes these systems for adaptation and evolution.

10

6.

The pneumococcus with its two different growth patterns, genetic transformation capabilities, and intraspecies variability encodes diverse metabolic capabilities. It has evolved to adapt to various host anatomical sites’ environmental conditions, to colonize or cause disease, and as such possesses the ability to uptake and process a variety of nutrients for energy needs, survival, and replication.

Indeed, over 30% of pneumococcal transport mechanisms are involved in the uptake of diverse carbon sources, many of which haven’t been investigated in detail (39). The majority of pneumococcal transporters are classified as phosphotransferase systems (PTS) and

ATP binding cassette (ABC) transporters. At least 21 different PTS and 8 different ABC transporters have been predicted in the pneumococcus, capable of importing at least 32 different (39). It has also been shown that the genes for at least 20 of these carbon transporters are present across all strains within the species (39), showing that they offer some selective advantage. Uptake through a PTS is size restricted and results in phosphorylation of the carbohydrate during transport (39). ABC transporters use internal ATP as the energy source for the uptake of the carbohydrate without phosphorylation, and hence require more energy to operate, but are not as size restrictive (39).

Global nutritional metabolism and regulation is maintained by CodY and the catabolite control protein A (CcpA). CodY has been shown to regulate carbon metabolism, amino acid metabolism, and iron uptake (40). Whereas CcpA regulates several metabolic genes directly and indirectly, controlling ~19% of all pneumococcal genes (41). Both studies have demonstrated that regulation by these two genes has an impact on overall virulence of the pneumococcus.

11

Amino acid uptake provides nitrogen sources for the pneumococcus and can be an important factor in host pathogen interactions. It has been shown that glutamate and glutamine have significant roles in pneumococcal metabolism (42, 43). The pneumococcus is auxotrophic for arginine, cysteine, histidine, glycine, glutamine, and branched chain amino acids (BCAA) like isoleucine, leucine, and valine, with BCAA uptake performed by the LivJHMGF ABC transporter, which also contributes to virulence (42, 43). As such, acquisition of specific nutrients is linked to not only virulence, but also its potential for growth, adaptation, and survival during colonization, transmission, and disease. Hence pneumococcal metabolism is an important field of study.

B. Pneumococcal disease

1. Anatomical site adaptations

As described in the earlier section, the pneumococcus is a highly diverse species, with diverse gene presence/absence profiles due to high genetic variability from high natural transformation and recombinogenic potential. It is widely known that the pneumococcus can asymptomatically colonize the nasopharynx and is potentially followed by invasive pneumococcal disease.

The nasopharynx is a vital colonization niche for many bacteria such as Streptococcus pneumoniae, Hemophilus influenzae, and Moraxella catarrhalis. In children, this nasopharyngeal microbiome is populated a few months after birth. Every person is likely to be colonized by these bacteria at some point in life, usually with asymptomatic colonization, but sometimes this can progress to disease (44).

As mentioned previously, most pneumococci have a specific polysaccharide capsule, which can be highly variable between different strains, and is an important virulence factor as it protects the pneumococcus from phagocytosis (44). This allows for a first line defense against the host,

12 improving colonization chances. Colonization primarily requires pneumococcal adherence to the host epithelium in the upper respiratory tract (44). In the pneumococcus it is facilitated by binding to epithelial cell surface carbohydrates, such as N-acetyl-glucosamine, access to which is dependent on the amount of pneumococcal CPS produced. This occurs through specialized cell wall-associated proteins, such as pneumococcal surface adhesin A (PsaA) (44).

Another surface protein, choline binding protein A (CbpA) has been shown to bind human cell surface sialic acid and lacto-N-neotetraose. It also has direct interactions with the human polymeric

Ig receptors (pIgR) on epithelial cells (44). Once attached to host epithelial cells, the pneumococcus can translocate across this barrier using receptor-mediated endocytosis by attaching to pIgR through the YRNYPT motif of CbpA on the apical surface, followed by receptor internalization and recycling to the basolateral surface (44).

Not only does the pneumococcus have specialized genes for adherence, but nutrient scavenging is also adapted to this environment. It has been shown that pneumococcal Neuraminidase A cleaves off host cell sialic residues to access the galactose beneath as a carbon source for more growth, while also allowing more access for adherence (15). Several other genes including bgaA, choP, eno, hyl, and pavA, are also specialized and involved in pneumococcal adherence and cleavage of host biomolecules for growth (45). These pneumococcal genes and virulence/surface factors are more than adequately suited for human nasopharyngeal colonization, the first step in promoting transmission or transitioning to invasive pneumococcal diseases such as pneumonia, sepsis, or meningitis.

To proceed with infection in the lower respiratory tract environment and development of pneumonia, the pneumococcus improves adherence chances in this region by shedding its capsule, which also improves resistance to antimicrobial peptides as they would target shedded capsule

13

(46). Capsular shedding has been shown to occur through the bacterial autolytic enzyme, LytA.

LytA primarily causes bacterial autolysis upon exposure to antibiotics but can also cause capsular shedding without lysis (46). After capsular shedding, pneumococci can infect lower respiratory epithelial cells causing pneumonia, and exposes the extracellular matrix (ECM) in bronchioles and alveoli. Several bacterial surface proteins promote attachment to the ECM such as PepO, PavA,

PavB, PfbA, PclA, and PsrP, ultimately promoting infection of the alveolar lobes (47).

Here, inflammation arising from pneumococcal infection is severe, caused by the pneumococcal cell wall, pneumolysin. Cell wall peptidoglycan binds to pattern recognition molecules which bind to heterodimers TLR-1/TLR-2 or TLR-1/TLR-6 and signals the secretion of NF-κB induced proinflammatory cytokines (47). Similarly, pneumolysin binds to the Fc region of antibodies and induce the complement pathway, also triggering inflammation (47). At this stage of infection, alveoli are overrun with pneumococci, edema, and red blood cells encased in fibrin protein matrix.

The lymphatics are also dilated and filled with cells and fibrin (47). The host has now entered a bacteremic stage. Pneumococci can enter the bloodstream most likely through a combination of the dilated lymphatics, cell damage to epithelial and endothelial cells, and/or by active invasion of endothelial cells (47). Similar to epithelial cell invasion through host pIgR, studies have shown that pneumococci can invade endothelial cells using a similar mechanism through host platelet- activating factor receptor (PAFr), G protein-coupled receptor (48). Once in the blood, the CPS is again a relevant and necessary virulence factor to prevent complement binding and opsonization.

As mentioned earlier, the type and amount of CPS is crucial to the survivability of the pneumococcus at this stage. Pneumococcal surface proteins such as PspA, PepO, and others also affect host defenses and have been shown to reduce complement activation on the pneumococcal surface (47).

14

From the blood, the pneumococcus can gain access to the rest of the body and infect vital organs, the most detrimental being the brain. The pneumococcus is a major cause of bacterial meningitis causing two-thirds of the adult cases in Europe and USA (49). Access to the brain is dependent on a pathogen’s ability to cross the endothelial cell blood brain barrier. This is again likely to occur through some form of receptor-mediated endocytosis. The pneumococcus has a variety of surface proteins that can adhere to additional host endothelial cell receptors such as laminin receptor (LR) and Platelet Endothelial Cell Adhesion Molecule-1 (PECAM-1) (49). It has also been shown that a feedback mechanism can exist with cooperation of these receptors where inflammation and PAFr activation by pneumolysin and choline binding proteins, can lead to upregulation of pIgR and

PECAM-1, allowing more adherence of the pneumococcus and hence more invasion (49).

Through the blood the pneumococci also gain access to the heart, via the same endothelial receptors involved in crossing the blood brain barrier. Studies have shown that the pneumococcus can invade cells of the myocardium and replicate intracellularly within host vesicles causing cardiac microlesions, rupturing cells, and eventually transitioning into biofilm growth (5). These pneumococcal heart biofilms also release pneumolysin and kill host macrophages (5). Such pneumococcal heart damage has also been shown to be highly strain dependent (50).

Apart from these major common host infection sites, studies have identified Spn at other host sites such as the appendix, middle ear, liver, kidneys, skin, and soft tissues (47, 51-55). It is quite evident that the pneumococcus is specialized in colonizing or infecting many host anatomical sites, harboring genes important for infection shared across all sites, as well as strategies unique to each environment.

15

2. Influenza virus co-infections

The most common cause of respiratory illness in humans is the influenza A virus (IAV). As is the case with the pneumococcus, IAV is most pathogenic to those with weaker immune systems such as young children, elderly, and immunocompromised (56). Combined infections with IAV and the pneumococcus represent a significant number of infection-related deaths worldwide. It has been shown that these co-infections result in exacerbated symptoms and worse clinical outcomes relative to individual infections by either pathogen (56).

In the 20th century, there were three pandemics of influenza A, in 1918, 1957, and 1968. In all three pandemics, it was observed that IAV infection helped the progression of commensal, colonizing pneumococcus, into an invasive pathogen. Most deaths were due to secondary bacterial infection, most of which were caused by the pneumococcus (56).

The current proposed model of IAV-induced susceptibility to pneumococcal infection begins with decreased mucociliary velocity of lung epithelial cells upon IAV invasion and infection. This reduces pneumococcal clearance from the mucus. Viral infection increases expression of IL-12 and IFN-γ which downregulates host scavenger receptor MARCO (57), which then decreases bacterial phagocytosis of non-opsonized particles. Viral invasion promotes apoptotic signals (58) such as increased surface expression of CD200 on immune cells that bind to CD200R on dendritic cells and macrophages, which signals lowered responses to bacterial stimuli in these cells. Lastly, type I interferon and corticosteroid expression in response to IAV inhibit the recruitment of PMNs and macrophages (56).

It has also been shown that IAV infection and the following host responses can induce release of bacteria from biofilms during their colonization state. Such released pneumococci have distinct

16 virulence gene expression profiles with an increased ability to spread and cause infection of host anatomical sites (59).

3. Other co-infections

The pneumococcus has been implicated in exacerbations of disease with other viral and bacterial pathogens as well. During the 2009 H1N1 (Swine Flu) pandemic, 13% of postmortem lung samples from patients in the United States showed a concurrent infection with Spn. Whereas in

Japan and Argentina, approximately 50% of the severe cases were positive for Spn (59).

Similarly, frequent bacterial co-infections were seen with COVID-19 (or severe acute respiratory syndrome due to coronavirus 2 (SARS-CoV-2) infection), in up to 50% of non-survivors (60); as well as in exacerbated acute respiratory infections in children infected with respiratory syncytial virus (RSV) (61). Some of such bacterial co-infections include the pneumococcus (62).

Apart from viral co-infections, some studies show the pneumococcus interacting with other known lung colonizers, such as nontypeable Haemophilus influenzae. Such interactions are also seen in the context of otitis media and nasopharyngeal colonization which has shown to offer better protection from antibiotics to the pneumococcus in poly-microbial biofilms (63, 64). Some of these co-infections go even further to include viral influences (65).

Clearly in combination with other pathogens the pneumococcus is a potentially more lethal threat and more studies from this perspective need to be conducted.

17

C. Therapeutics

1. Antibiotics and resistance

Antibiotics have been used as a first line of defense against most bacterial infections. As with any form of selective pressure, bacterial antibiotic resistance is now a growing concern. In the case of the pneumococcus, penicillin was the treatment of choice since the 1940s. As a result, penicillin resistant isolates were first identified in the mid-1970s and are now widespread globally (66).

Usage of many such β-lactam drugs have increased the risk of selection for, and the rate of carriage of penicillin resistant pneumococci (67).

Other antibiotics such as macrolides, fluoroquinolones, vancomycin, trimethoprim, and others, have also selected for antibiotic resistant pneumococci worldwide (66, 68). It has been established that macrolide resistance in the pneumococcus is mainly due to ribosomal methylation of the erm(B) gene and/or a two-component macrolide efflux pump encoded by genes mef(E)/mel carried on the transformable genetic element Mega (69). These resistance genes are known to be acquired from larger transposable elements (69).

Fluoroquinolones operate by targeting DNA gyrase and topoisomerase, two essential enzymes necessary for bacterial cellular replication, and directly inhibit DNA synthesis. Studies have found that resistance in some isolates arose from spontaneous mutations in the pneumococcal DNA gyrase gyrA and topoisomerase IV genes parC and parE, which affected the fluoroquinolone binding site (70). Better fluoroquinolones have been developed to counter some of such mutations; however, it is likely that more mutations in these genes could be selected for and eventually worsen pneumococcal fluoroquinolone resistance overall (70).

The genetic transformation capabilities of the pneumococcus have been implicated in the development of multiple antibiotic resistance phenotypes, and the development of mosaic

18 resistance genes (71-73). On average, approximately 40% of pneumococci exhibit multidrug- resistance phenotypes with different rates across countries. In the 2000s, among penicillin-resistant isolates, co-resistance to macrolides was seen in 72–75% of isolates, co-resistance to tetracycline in 65–72%, and co-resistance to co-trimoxazole in 69–74% (73, 74).

Fortunately, advanced pneumococcal vaccines introduced in 2010 (such as PCV13 described in the following section) have significantly reduced the spread of most antibiotic resistant strains in places with higher vaccine rates, according to the CDCs Active Bacterial Core Surveillance reports

(2009-2018) (75). Moreover, resistance can also potentially be treated with higher antibiotic doses

(76), as they are usually safe to the host.

Primary therapeutic use of antibiotics for infection was designed from a clinical perspective, in order to reduce bacterial loads and alleviate symptoms as the hosts immune response fends off the infection. Bacterial eradication was not the primary focus despite it being a main determinant of therapeutic outcome, as well as preventing spread of resistant strains (68).

The idea that development of new drugs to not just treat infections, but eradicate infecting pneumococci is appealing, but has drawbacks. For example, drug development is a slow process and resistance can spread significantly faster. Another pitfall would be the selective pressure of new drugs on the bacteria and commensal organisms, which would not be apparent until several years after their use (68). Based on this repetitive pattern of treatment and resistance, alternative strategies in the form of vaccines or host-targeted therapeutics are more appealing to develop.

2. Vaccines and escape

As described above, antibiotics leave a lot to be desired as a therapeutic. Vaccines help prevent the rise of antibiotic resistance by lowering the risk of infection from both antibiotic susceptible

19 and resistant strains. As a result, they also lower the use of antibiotics for treatment, further limiting the potential for development of antibiotic resistant strains (77).

The earliest forms of pneumococcal vaccines were whole killed pneumococci, but these caused adverse reactions (78). It was later discovered that antibodies generated towards CPS were highly protective, however increasing amounts of different CPS types were discovered (78). The vaccines that followed were designed using a mix of different purified CPS. A 23 valent vaccine was licensed in 1983 for older adults at high risk for invasive pneumococcal disease. However, young children were unable to develop protective responses to these vaccines because the CPS did not induce T-cell-dependent antibodies in this age group. This led to the development of a CPS-protein conjugate vaccines (78).

The protein conjugate vaccine requires a covalent attachment of the CPS to the protein. It was found that these conjugate vaccines induced strong memory antibody responses in infants and children and could be boosted with subsequent immunization. (78)

Pneumococcal conjugate vaccine (PCV7), a 7-valent conjugate vaccine (targeting serotypes 4, 6B,

9V, 14, 18C, 19F and 23F) was licensed in 2000 for use in children under 5 (78). However, after its introduction, a significant increase of invasive disease was seen with non-vaccine serotypes, which led to the development of a 10-valent and then a 13-valent conjugate vaccine (serotypes 1,

3, 4, 5, 6A, 6B, 7F, 9V, 14, 18C, 19A, 19F, and 23F) in 2010, including the most invasive of these non-vaccine serotypes (78). Still, more isolates from non-vaccine serotypes are being found causing invasive disease, a phenomenon called serotype replacement or vaccine escape (78).

Another mechanism for vaccine evasion is serotype switching where the pneumococci can alter its serotype through genetic recombination of the genes involved in capsular production (79). It is

20 increasingly more complex to create PCVs with additional serotypes, especially as 100 serotypes are known so far, hence other forms of vaccines need to be considered.

One potential alternative would be in the form of engineered whole cell vaccines. These would consist of killed non-encapsulated pneumococci expressing highly immunogenic proteins on their surface (78). The difference of these whole cell vaccines with the earliest attempts of pneumococcal vaccines, is that they could be modified to express ineffective versions of highly antigenic genes and/or remove reactogenic elements, and still be safe and effective (80). Use of whole cell vaccines could also generate immune responses to species-wide, conserved pneumococcal surface proteins, as well as the non-conserved elements (23). To reduce the chances of serotype replacement, studies were focused on identifying pneumococcal antigens present in all serotypes, i.e., core proteins, and designing a vaccine with those as the primary immunogen.

Several potential virulence factors were investigated such as pneumococcal surface proteins A and

C, pneumolysin, pneumococcal surface antigen A, choline binding proteins, iron transporters, lipoproteins, histidine triad proteins, etc., as well as some combinations of these. Unfortunately, these antigens harbor significant amounts of variation across the alleles from clinical isolates (78,

81).

These classical approaches to vaccine development appear to be inadequate for a genetically diverse pathogen such as the pneumococcus. As such, with advancement of genomics and even proteomics (81), vaccine candidate identification can now be performed by in silico screening of the entire genome or pangenome of a pathogen (82) , a powerful approach to vaccine design known as reverse vaccinology, which will be expanded on in the following sections.

21

3. Roles of innate and adaptive immunity

Another potential form of therapeutics could be designed from the host’s perspective, based on our knowledge of its response to the pneumococcus. The host immune system is broadly classified into two main branches, innate and adaptive immunity, both of which play an instrumental role in pneumococcal host defenses as well as vaccine efficacy.

The pneumococcus first encounters the host innate immune system through host cell pattern recognition receptors (PRRs) such as Toll-like receptors (TLRs), NOD-like receptors (NLRs), cytosolic DNA sensors, complement system, and others, which are activated by microbial molecules with conserved motifs known as Pathogen-associated molecular patterns (PAMPs) or pathogen induced tissue damage (83). Pneumococcal cell wall lipoteichoic acid (LTA) and lipoproteins are recognized by TLR1/2/6, TLR9 recognizes pneumococcal DNA, and TLR4 is activated by PLY (83-86). These induce cytokine production further stimulating immune cell recruitment and proliferation. Some NLRs such as NOD1 and NOD2 detect bacterial peptidoglycan and activate NF-κB gene expression (83). Various cytosolic DNA sensors also detect host and lysed bacterial DNA, which can also stimulate NF-κB-dependent gene expression and type 1 interferons (IFN) which in turn stimulate expression of so-called IFN-stimulated genes

(ISGs) (87). The type I interferon response is a known player in anti-viral defense, but newer studies have shown its role in bacterial infections (87, 88). Pneumococcal infection activates type

I IFN in host macrophages, epithelial cells, and during nasopharyngeal colonization (88). The complement system consists of serum and membrane proteins which, when activated, form a cascade of reactions contributing to the elimination of invading microorganisms, such as opsonophagocytosis, although the pneumococcus has virulence factors to avoid this as described above (83).

22

Host adaptive immunity can be further divided into humoral and cell-mediated immunity. Humoral immunity consists of B cells which produce antibodies specific to antigens that activate them. At host mucosal surfaces, pneumococcal specific IgA antibody is produced during colonization (17).

Secretory IgA is also involved in opsonization and phagocytosis of pneumococcus, but again the pneumococcus possesses an IgA1 protease to avoid this (17). Upon such antigen stimulation, B cells differentiate and produce other immunoglobulins needed to fight the pathogen (89). Cell- mediated immunity involves T cells which are stimulated by antigen-presenting cells (APCs) that harbor major histocompatibility complex (MHC) proteins that present internalized pneumococcal antigens. Activated T cells ultimately differentiate and produce various cytokines that recruit and activate macrophages, monocytes, and neutrophils to clear pneumococcal infection (89).

Two vulnerable demographics, the very young and the elderly, of pneumococcal pathogenicity are at developmental stages where host immunity and defenses are relatively weaker. Infants have poor T cell responses as they were not exposed to non-maternal antigens before birth, and their B cells have inherently lower expression of receptors as well as incomplete class switching for immunoglobulin production, and lower somatic hypermutations (89). As for the elderly, they possess weaker adaptive immune cells, antibody production and specificity, immunoglobulin class switching, and cell maturation brought about by aging, which are more permissive to pneumococcal colonization and disease (89).

23

D. Challenges

1. Pneumococcal diversity and the pangenome

As was explored in the earlier sections, most pneumococcal strains are adept at taking up foreign

DNA and integrating it into their chromosomes through competence and homologous recombination. This results in acquisition of novel genes or alleles and an increasing overall genomic diversity within the species. This repertoire of all genetic sequences of a particular species has been defined as the pangenome (90). The size of the pangenome is dependent on the ability of the species to integrate new genetic material and pass it over generations, such is the case of the pneumococcus. As such, pangenomes can exhibit a finite amount of diversity (closed pangenome), or like Spn, an exceptionally large, yet to be fully characterized degree of diversity (open pangenome) (91, 92)

Comparative genomics performed on closely related streptococci and the pneumococcus have shown that several genetic acquisitions in the pneumococcus originate from S. mitis and most acquisitions only come from species living in similar niches (93). Therefore, it is possible that a virulence gene identified in Spn can exist in species like S. mitis, S. oralis, and S. infantis, which are not always pathogenic, hence such genes do not independently contribute to virulence (93).

Another study showed that several genes are shared between the pneumococcus and commensal closely related streptococci, but specific gene or allele sets exist in each species which can contribute to either the beneficial commensal relationship or the pathogenic potential of that streptococcal species (82).

Besides pneumococcal competence, other contributors to genetic diversity exist like streptococcal genus-wide conjugative transposons (94), which have been shown to bring in antibiotic resistance markers (95, 96). Not only that, apart from DNA uptake in the autolysis and uptake cycle,

24 bacteriophages can also play a role in genetic diversity through transduction. Depending on the bacteriophage, a lytic phage can induce cell lysis and allow for released DNA uptake by other environmental pneumococci, or temperate phages which can integrate into their host pneumococcal DNA and potentially disrupt or alter genes and their regulation. Both kinds of phages have been identified in Spn (97).

In light of this extensive genetic diversity, the potential for it to increase, and the observed rise in vaccine escape, the Global Pneumococcal Sequencing (GPS) project was undertaken (98). The project includes whole genome sequencing of over 20,000 pneumococcal strains from around the globe, including closely related subsets pneumococci exhibiting multidrug resistance incidence.

Other subsets correlated specifically with invasiveness (98). Such studies can have a large impact on future vaccine design. Comparative genomics across these large datasets could eliminate protein vaccine candidates that have multiple alleles, or are dispensable, avoiding any unnecessary selective pressure and resulting vaccine escape.

2. Serotypes and genotypes

With the vast amount of genetic diversity that exists within the pneumococcal species, it is no surprise that this pathogen can exhibit a diverse panel of surface antigens and hence varied levels of pathogenicity. A serotype would classify different isolates based on their surface antigens and their ability to induce specific antibodies, which in the case of Spn is defined by its capsular polysaccharides. Alternatively, an organism’s genotype is its specific set of heritable genes, which also varies within a species, allowing for a different type of classification.

Serotyping was originally performed using the Quellung reaction, where different antibodies were allowed to bind to the capsule of the pneumococcus, causing capsular swelling that was visible under a microscope (99). Pneumococci were then classified by the specific anti-sera that caused

25 the swelling. Since then, PCR based serotyping has been developed, targeting the cps locus (100-

102).

In 2006, the genes responsible for the synthesis of the capsule were investigated for the 90 serotypes known at the time (35). Today, 100 different known serotypes exist for the pneumococcus, the newer ones discovered being 6C, 11E, 20B, 6D, 6F, 6G, 6H, 35D, 7D, and

10D. Phylogenetic analyses on the newest one, 10D, showed that it acquired a cps gene from an oral streptococcus species (103).

We have seen how Spn can develop newer serotypes through genetic recombination from earlier sections; however there is poor correlation between genetic background and the serotype of a pneumococcal strain (79). While the serotype can be a significant determinant of pathogenicity and virulence, the rest of the genotype is also a major contributor to levels of pathogenicity and discovery of targeted therapeutics. There have been several methods for pneumococcal genotyping, such as multilocus sequence typing (MLST) that looks at the DNA sequence differences among a few housekeeping genes, pulsed field gel electrophoresis that looks at DNA gel band patterns after restriction enzyme digestion, multilocus variable number of tandem repeat analysis (MLVA) that profiles based on the number of tandem repeat patterns in the DNA, and others. Newer forms include next-generation sequencing (NGS) and mass spectrometry (104).

Recent NGS is inexpensive and several thousand pneumococcal genomes have been sequenced in addition to those generated as part of the GPS. NGS now allows us to perform both genotyping at a nucleotide level and in-silico serotyping by observing the sequenced cps locus. With the pneumococcal DNA uptake and integration methods described earlier and selective pressure from antibiotics and vaccines, it is inevitable that serotype and genotype diversity will keep growing over time, therefore more species pangenome-wide defined therapeutic targets are required.

26

E. Avenues for discovery of new therapeutics

1. Models of experimental study

The pneumococcus causes multiple forms of disease and colonization. In order to study the pneumococcus in its natural ecological niche for a particular disease, in vitro and animal models have been characterized and extensively used.

For example, the pneumococcus has been extensively studied in vitro, both planktonically and as biofilms, using several different bacterial media and supplements, albeit these are not accurate recapitulations of the true in vivo environment (105, 106). In order to better address this issue, some of these and other studies used immortalized human cell lines as in vitro models of pneumococcal infection for study (107, 108). Others have also attempted to chemically recreate host niches in terms of their nutrients and physiological properties (109). While all these studies yield important results and identify potentially useful therapeutic targets, additional in vivo validation is required.

Use of various in vivo models have also been established. Mouse, rat, and rabbit models of disease exist for pneumococcal pneumonia, septicemia, meningitis, and colonization (110). Such in vivo model animals are usually genetically identical, allowing for generation of high quality in vivo experimental data. The choice of the model usually leans towards use of mice, as despite the rats and rabbits allowing for larger individual sample quantity, larger animals are more expensive and result in smaller sample sizes of experimental biological replicates. However, with larger sample sizes these are just as effective.

Other less common models exist, such as the use of chinchilla, gerbils and guinea pigs for pneumococcal otitis media, but the field is also transitioning to mouse models for this anatomical

27 site (110). Ferret models also exist for studying pneumococcal transmission and non-human primate models for the closest human representative model (111, 112).

Of all in vivo models, mice are the most common, offering a nice balance between cost, in vivo environment representation and knowledge of the model biology. The mouse model is also more studied, better characterized and has extensive methods for genetic manipulation, allowing investigators to test the importance of specific host genes. Other potential ways of study that are gaining traction and can be explored for the pneumococcus, are in the form of ex vivo models of human primary cells, where human cells are obtained from organs or tissues and cultured on transwells, or human organoids, which are cultured three dimensional structures of organs generated from pluripotent stem cells (113, 114).

a) Use of genetics and next generation sequencing technologies

One advantage of studying the pneumococcus from an experimental perspective lies in harnessing its natural genetic transformation capabilities. With naturally occurring competence and homologous recombination, it has been investigated using a variety of genetic techniques. This inherent transformative property has also been used exploited to investigate genes of interest with knock out mutants in several studies (115, 116).

For example, signature tagged mutagenesis was used to generate 1,250 pneumococcal clones tested for survival in a septicemia and pneumonia mouse model, to profile and classify various pneumococcal virulence genes (117). Other studies have also explored the genetic profile of the pneumococcus with mutations and RNA sequencing (118-120).

The availability of extensively used and verified in vitro and in vivo model combined with the advancement of next generation sequencing technologies are also powerful tools of therapeutic

28 target investigation. As mentioned in earlier sections, several thousand genomes of the pneumococcus have been sequenced, allowing for extensive phylogenetic analysis and comparative genomics of the species. Several essential pneumococcal genes and host pathogen interactions have also been established through transposon sequencing (Tn-seq), gene silencing

(CRISPRi), and transcriptomics (121-124). Some of these studies and others will be explored more in depth in the coming sections of this study.

2. Reverse vaccinology

Reverse vaccinology (RV) is a genomics-based approach to identifying protein vaccine candidates

(PVCs) for a bacterial species. It was first implemented to identify PVCs for serogroup B meningococcus, prioritizing bacterial surface proteins as these are likely to be antigenic and capable of inducing humoral antibody responses (125). RV provides the advantage of identifying

PVCs without culturing a pathogen and allows for species wide evaluation of PVCs.

RV protocols follow two main approaches, decision tree or filtering platforms and machine- learning or classifying platforms (125). In decision tree programs, a pathogen’s protein sequences are passed through filters evaluating for specific properties of good PVCs such as probability of adhesin potential, number of transmembrane domains, subcellular localization, etc. Proteins that do not pass these filters are eliminated until only a subset of the pathogen’s proteins remain as

PVCs. Machine learning programs aggregate measured or predicted protein features from a pathogen’s predicted proteome into a matrix and classifies each protein as a PVC or non-PVC with a probability model based on a known training dataset of PVCs and non-PVCs (125). They also do not filter out proteins.

Only a handful of pneumococcal RV studies have been performed, usually only on few pneumococcal genotypes and using a handful of protein feature prediction tools. One study looked

29 at 26 pneumococcal genomes consisting of 13 different serotypes, applying a decision tree RV architecture, and identified 13 PVCs (126). Another looked at 22 pneumococcal surface factors for their PVC potential again using a decision tree RV architecture, but with different protein features (127). A more recent study looked at 46 pneumococcal proteins based on literature using a combination of both RV protocols to determine PVC potential with good results (128). However, many of these RV studies fail to account for the widespread pneumococcal diversity within the species and RV on the pneumococcus can be greatly improved.

F. Concluding remarks

As we have seen so far, the pneumococcus is an ever-changing species, with high diversity, specialized genes and biological systems for a panel of human infections and survival. Despite extensive studies, such an evolving pathogen requires constant novel therapeutic targets to monitor and control its pathogenicity and spread. In the coming chapters of this study, we will approach the pneumococcus from three broad perspectives. We will look at it from an in-silico perspective using pangenomics and improved reverse vaccinology, to identify conserved species-wide gene targets that may not have been altered with selective pressure. We will also look at some of these targets and identify others from two host-pathogen interaction perspectives using, an in vivo mouse model of colonization and disseminated infection, and as well as an ex vivo human lung epithelial cell model.

We will accomplish these goals through three specific aims. 1) Construct a reverse vaccinology pipeline to identify novel potential vaccine candidate proteins of Streptococcus pneumoniae.

2) Survey candidates identified through reverse vaccinology with the host-pathogen transcription profiles of invasive pneumococci at various anatomical sites in mice using dual RNA-seq.

30

3) Validate mice and pneumococcal predicted pathways from in vivo experiments using an ex vivo human lung model infected with S. pneumoniae through dual RNA-seq. Together these studies will shed light on previously unexplored therapeutic targets for the host and the pneumococcus and advance the pneumococcal field.

31

II. Chapter 2: Reverse vaccinology and its application to Streptococcus

pneumoniae

Preface

This chapter has been divided into two parts to accomplish Specific Aim 1 Construct a reverse vaccinology pipeline to identify novel potential vaccine candidate proteins of Streptococcus pneumoniae. The first published part consists of the design and construction of a prokaryotic reverse vaccinology pipeline originally developed for other lung pathogens, namely, Hemophilus influenzae and Moraxella catarrhalis. The second part applies the functional developed pipeline to prioritize potential vaccine candidates for Streptococcus pneumoniae.

A. ReVac: a reverse vaccinology computational pipeline for

prioritization of prokaryotic protein vaccine candidates. 1

B. Abstract

Background

Reverse vaccinology accelerates the discovery of potential vaccine candidates (PVCs) prior to experimental validation. Current programs typically use one bacterial proteome to identify PVCs through a filtering architecture using feature prediction programs or a machine learning approach.

1 D'Mello, A., Ahearn, C. P., Murphy, T. F., & Tettelin, H. (2019). ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic protein vaccine candidates. BMC genomics, 20(1), 981. https://doi.org/10.1186/s12864-019-6195-y

32

Filtering approaches may eliminate potential antigens based on limitations in the accuracy of prediction tools used. Machine learning approaches are heavily dependent on the selection of training datasets with experimentally validated antigens (positive control) and non-protective antigens (negative control). The use of one or few bacterial proteomes does not assess PVC conservation among strains, an important feature of vaccine antigens.

Results

We present ReVac, which implements both a panoply of feature prediction programs without filtering out proteins, and scoring of candidates based on predictions made on curated positive and negative control PVCs datasets. ReVac surveys several genomes assessing protein conservation, as well as DNA and protein repeats, which may result in variable expression of PVCs. ReVac’s orthologous clustering of conserved genes, identifies core and dispensable genome components.

This is useful for determining the degree of conservation of PVCs among the population of isolates for a given pathogen. Potential vaccine candidates are then prioritized based on conservation and overall feature-based scoring. We present the application of ReVac to 69 Moraxella catarrhalis and 270 non-typeable Haemophilus influenzae genomes, prioritizing 64 and 29 proteins as PVCs, respectively.

Conclusion

ReVac’s use of a scoring scheme ranks PVCs for subsequent experimental testing. It employs a redundancy-based approach in its predictions of features using several prediction tools. The protein’s features are collated, and each protein is ranked based on the scoring scheme. Multi- genome analyses performed in ReVac allow for a comprehensive overview of PVCs from a

33 pangenome perspective, as an essential pre-requisite for any bacterial subunit vaccine design.

ReVac prioritized PVCs of two human respiratory pathogens, identifying both novel and previously validated PVCs.

Keywords: Reverse vaccinology, Vaccines, Antigen scoring, Orthology, Core genome, Bacterial,

Pangenome

C. Background

Reverse vaccinology pipelines use genome datasets to identify potential vaccine candidates

(PVCs) based on in silico prediction of hallmark features of an ideal vaccine candidate antigen.

These features include presence of epitopes exposed on the bacterial surface for host immune recognition, antigenicity, sequence conservation across isolates, and expression during infection

(129, 130). Since the development and application of reverse vaccinology to the case of Serogroup

B meningococcus (131), its potential for growth has increased significantly with the advent of next-generation sequencing techniques, development of bioinformatic tools for multi-genome analyses, protein functional predictions, and high throughput protein expression platforms (132).

These advances in technology offer an opportunity to generate new reverse vaccinology programs that accurately predict candidate bacterial proteins for use in subunit-based vaccines.

Several tools have been developed for antigen prediction and vaccine candidate identification, including NERVE, Jenner-Predict, Vaxign, VaxiJen, VacSol, and Bowman-Heinson (125). These tools typically follow either filtering or machine learning algorithms. The filtering workflows utilize a single program for each feature prediction and filter out proteins at each stage. A limitation of the filtering architecture is the potential of elimination of vaccine candidates from further

34 analyses, in the event of a false negative prediction by any given bioinformatic tool. The machine learning workflows use datasets of known PVCs and negative controls to classify antigens and non-antigens through a probability score. To date, tools applying either of the two approaches consider protein sequences exclusively. An extensive review of all these workflows can be found in Dalsass et al. (125).

Here we describe ReVac, a computational pipeline for prediction and prioritization of protein- based bacterial vaccine candidates for experimental verification. ReVac surveys several genomes, using multiple independent tools for predictions of the same feature, to assess a large panel of protein features and sequence conservation. ReVac also scans both the protein and DNA sequences of genes for repeat sequences that could mediate phase variation (gene on/off switching) or protein structure variations, attributes that are typically not desirable in a candidate for vaccine development (133). ReVac compiles all data across various features, at the protein and nucleotide level, from several bacterial genomes, into one tab-delimited output file. It also scores each protein based on each individual feature in parallel, without eliminating any candidate from analyses. A general problem in reverse vaccinology is that most workflows predict hundreds of proteins as vaccine candidates, rendering experimental verification assays cumbersome (125). Although some provide a ranking of candidates based on sequence similarity with curated epitopes (134), this approach does not promote the discovery of new types of candidates from different bacteria.

ReVac uses its own scoring scheme for the output of each feature prediction tool that is part of its workflow. The scoring scheme was developed, based on manually observing trends of feature predictions, of control datasets of known antigens and non-antigens. These control datasets were obtained from various antigen/epitope databases of predicted and experimentally curated proteins, namely Protegen, AntigenDB, Vaxign’s control datasets, ePSORTB. We supplemented these

35 publicly available datasets with known antigens from our Moraxella catarrhalis and non-typeable

Haemophilus influenzae (NTHi) datasets (135-140). These control datasets consist of DNA and protein sequences from various Gram-positive and Gram-negative species, which were run through ReVac (Additional file 2.1), and the corresponding scoring scheme is shown in Table 2.1 below.

The final output of ReVac consists of a list of predicted vaccine candidates sorted based on their

ReVac scores, an aggregate scoring scheme that combines individual feature weights assigned to each of the candidates’ features. This allows the user to consider candidates by perusing those with the highest ReVac scores. Importantly, ReVac accounts for strain-to-strain variation when prioritizing top candidates by generating clusters of orthologous genes across all genomes of the species of interest. ReVac displays average scores of gene conservation for each ortholog cluster to provide an estimate of variation. These two innovations in reverse vaccinology application allow for selection of a manageable number of conserved PVCs for experimental verification and vaccine development.

Table 2.1 List of all the programs run in ReVac and their predicted features, with the scoring scheme for each program’s output.

Additional scoring descriptions based on outputs from multiple programs are listed at the bottom.

Module Gene property Evidence Output Scoring weight

(Reference) (points)

PSORTb* Surface Sub-cellular +1 if surface

exposure^ localization exposed

36

Table 2.1 continued Surface -1 if

localization cytoplasmic

prediction

LipoP Surface Lipoprotein motif Presence or 1 or 0

exposure^ absence of a

motif

TMHMM Surface Transmembrane Number of If surface

exposure^ spans helices exposed <2 :

+0.5

02:00.0

3: -0.2

≥4 : -2

If cytoplasmic

-2

SignalP Surface Signal peptide Signal peptide +1 for presence

exposure^

SPAAN Surface Adhesin protein Adhesin +0.5 if above

exposure^ protein score cutoff score

(default 0.75)

Surface HMMs a Surface HMM for motif HMM title and 0.5

exposure or function score

Function

37

Table 2.1 continued Antigenic Antigenicity Antigenic Peptides, 0.5

epitopes scores, protein +0-1

coverage proportional to

coverage

Bcell Pred Antigenicity B cell epitopes, 6 Number of + 0-1

prediction peptides, proportional to

methods protein coverage

combined coverage + 0-1

proportional to

total number of

peptides of a

given length per

protein

MHC class I Antigenicity MHC-I epitopes Number of + 0-1

peptides, proportional to

protein coverage if 80-

coverage 90%

+ 0-1

proportional to

total number of

peptides of a

given length per

protein

38

Table 2.1 continued + 1 if coverage

is >=90%

NetCTLpan Antigenicity MHC-I epitopes Number of + 0-1

peptides, proportional to

protein coverage if 80-

coverage 90%

+ 0-1

proportional to

total number of

peptides of a

given length per

protein

+ 1 if coverage

is >=90%

Immunogenicity Antigenicity MHC-I epitopes Number of + 0-1

(MHC-I) immunogenicity peptides, proportional to

protein coverage

coverage + 0-1

proportional to

total number of

peptides of a

given length per

protein

39

Table 2.1 continued + 1 if coverage

is >=10%

MHC class II Antigenicity MHC-II epitopes Number of + 0-1

peptides, proportional to

protein coverage if 80-

coverage 90%

+ 0-1

proportional to

total number of

peptides of a

given length per

protein

+ 1 if coverage

is >=90%

BLAT (IEDBb Antigenicity Similarity to Protein + 0-1 database*) curated epitopes coverage proportional to

from IEDB coverage

+ 1 if coverage

is >70%

Autoimmunity Autoimmunity Similarity to Protein +1 if no

human proteins coverage autoimmunity

40

Table 2.1 continued -2 *(0 to1)

proportional to

coverage

-2 if coverage is

>20%

Autoimmunity Autoimmunity Similarity to user- Protein +1 if no

Commensals defined coverage autoimmunity

commensal -2 *(0 to1)

organisms’ proportional to

proteins coverage

-2 if coverage is

>20%

SSRd Finder Variability of Phase variation Number of +1 if no SSR

expression simple -0.5 for each

sequence SSR

repeats -0.25 for each

SSR in the

promoter

-0.5 for each

SSR with

frameshift

potential

41

Table 2.1 continued -0.01 times the

length of the

SSR.

SSRd Finder Variability of Potential Number of -0.2 for each

Protein expression conformational protein tandem protein repeat,

shifts repeats max penalty of

1.

IslandPath Potential for Genomic Islands Presence in a -1 for each

horizontal gene GI protein in a GI

transfer +0.5 for absence

Jaccard Clusters† Conservation Orthologous Presence in an +1 for each

clusters orthologous protein in a

cluster COG in >90%

of genomes

-0.25 for each

protein in a

COG in <90%

of genomes

PanOCT† Conservation Orthologous Presence in an +1 for each

clusters orthologous protein in a

cluster COG in >90%

of genomes

42

Table 2.1 continued -0.25 for each

protein in a

COG in <90%

of genomes

OrthoMCL † Conservation Orthologous Presence in an +1 for each

clusters orthologous protein in a

cluster COG in >90%

of genomes

-0.25 for each

protein in a

COG in <90%

of genomes

LS-BSR † Conservation Orthologous Presence in an +1 for each

clusters orthologous protein in a

cluster COG in >90%

of genomes

-0.25 for each

protein in a

COG in <90%

of genomes

Attributorc Function Annotation & GO Annotation & +1 for each GO

Terms GO Terms term in our

43

Table 2.1 continued surface exposed

GO db

-1 for each GO

term in our non-

surface exposed

GO db

+1 if presence of

surface exposure

keywords if

predicted

periplasmic aHMM: Hidden Markov Model. This component includes a collection of HMMs selected from the Pfam database for motifs associated with surface proteins. bIEDB: Immune Epitope Database and Analysis Resource cIn house Perl/Python script dSSRs: simple sequence repeats

*If PSORTB, LipoP, SignalP and IEDB Database matches are all positive, weight is incremented by 2.

^If all surface exposure tools fail a conclusive prediction, weight is decremented by 2

44

Table 2.1 continued †Each protein is given an additional 0.1 for >90% presence in each of the clustering algorithms, Jaccard Clusters, PanOCT, OrthoMCL and LS-BSR, and penalized 0.5 for <90% presence or absence of a cluster for each tool.

D. Results

1. ReVac workflow

The ReVac pipeline uses the Ergatis workflow management system to analyze all data on distributed computer clusters (141). Figure 2.1 shows the overall workflow and components of

ReVac. Parallel computing allows ReVac to run efficiently while performing predictions on entire collections of input genomes. Analysis is launched using a list of GenBank-formatted genomes as input. ReVac’s foundation components convert the GenBank files to formats suiting each predictive tool’s input, as necessary. Amino acid and nucleotide gene sequence FASTA files, as well as annotation General Feature Format (GFF), files are created. Their content is then binned into smaller subsets of data that are submitted as parallel batches on the compute cluster.

45

Figure 2. 1. Schematic of the ReVac workflow, its components, and underlying features

Blue arrows indicate the components where control datasets were used to develop the scoring algorithm. Red arrows indicate a user’s input query dataset, which runs through all components and the scoring algorithm, to output a list of prioritized candidates for the supplied species. Scoring based on core genes or orthology components is indicated by the black arrow.

46

ReVac utilizes several bioinformatic tools for its protein or nucleotide feature predictions (Figure

2.1, Table 2.1, and Methods) that are grouped into the following categories: subcellular localization, antigenicity & immunogenicity, conservation & function, exclusion features, genomic islands, and foundation components. Subcellular localization contains tools predicting overall protein localization from the analyses of lipoprotein signal, transmembrane helices, signal peptide presence, adhesin potential, and HMM (Hidden Markov Model) domains associated with surface exposure. Antigenicity & immunogenicity covers Major Histocompatibility Complex

(MHC) class I and II binding capabilities, B-cell epitope presence, overall MHC immunogenicity and a BLAT (BLAST-Like Alignment Tool) (142) alignment with known experimentally verified epitopes, acquired from the Immune Epitope Database & Analysis Resource (IEDB) (143).

Conservation & function applies 4 different methods for generating clusters of orthologs and implements a tool that updates annotations and assigns Gene Ontology (GO) terms (144).

Exclusion features determine protein similarity to Homo sapiens proteins (risk of autoimmunity) and a user-defined list of commensal organisms (to address the risk of depleting the microbiome), as well as the prediction of amino acid and/or nucleotide repeats that mediate phase variation.

Genomic Islands (GI) prediction informs whether or not a gene is carried within a putative mobile element and therefore transmissible between isolates or species. Lastly, foundation components refer to all tools involved in file format conversion, input data generation and text processing. The implementation of multiple prediction tools and scoring schemes for most of the features considered compensates for each individual tools’ potential for false negative/positive predictions.

Given these attributes, ReVac offers an innovative and comprehensive workflow design for reverse vaccinology.

47

Outputs from ReVac’s components are systematically converted into tab-delimited format and grouped by protein IDs or locus tags derived from the GenBank files. This is achieved using in- house Perl scripts, to generate ReVac’s initial gene feature summary table. This table is then parsed using ReVac’s scoring algorithm (Table 2.1) and a final score-sorted summary table is reported.

These two tables include results for all genes provided as input without eliminating any potential candidates. To look for highly conserved core vaccine candidates, the scored summary table is further parsed for overall protein conservation, comparing all orthology methods used, across all genomes. ReVac then refines the list of PVCs for those with ReVac scores comprised of a distribution of ideal PVCs feature (i.e. where the ReVac scores were penalized by a total of less than 10% of its overall score, due to the presence of undesirable PVC’s scoring features). All clusters are then grouped and given an ortholog ID. Their annotation, average, minimum and maximum ReVac scores are reported at an ortholog cluster level. Based on scores observed for positive and negative controls we used, clusters harboring average scores higher than a ReVac score of 10 with minimum variation (based on the reported average, minimum and maximum) in the scores across the cluster, are ranked as top PVCs. A higher score cutoff can be chosen by the user to further reduce the number of prioritized candidates. Here, 10 was chosen as the cutoff for our NTHi and M. catarrhalis datasets, as it was observed that the frequency of non-antigens was higher below this value (Figure 2.2, left peak of Controls), while the frequency of antigens formed a second distinct peak for scores 10 and higher (Figure 2.2, right peak of Controls) (See also

Additional file 2.1). Implementation of higher cutoffs to focus the list of candidates in a separate small table does not eliminate any candidates from the complete scored table. Other candidates can be selected by scanning the full table that shows PVCs in ranked order and evaluating the relative importance of features that may have diminished their overall score.

48

Figure 2. 2 A density plot showing the scores for all sequences run through ReVac, and the cutoff for our M. catarrhalis and Nontypeable Haemophilus influenzae datasets

49

2. Control datasets used for development of the scoring scheme

The control datasets used in ReVac comprise a total of 564 proteins acquired from Vaxign,

Protegen and AntigenDB (135, 136, 139), as well as our manually curated list of NTHi and M. catarrhalis antigens (137, 138). Where possible, protein identifiers (IDs) from these three public databases were systematically converted to Uniprot unique IDs for consistency and ease of access to protein characteristics (Additional file 2.1: Sheet 3). Because ReVac is the first pipeline to consider nucleotide features associated with candidate antigens, we also obtained closely related nucleotide sequences for all public candidates by retrieval of best TBLASTN (145) hits against the National Center for Biotechnology Information (NCBI) nt database of non-redundant nucleotide sequences (all hits were to the respective species). Among other features, nucleotide sequences provided information on simple sequence repeats (SSRs) that may mediate phase variation.

Since these databases contained some of the same sequences or different alleles of the same antigens, we used OrthoMCL (146) to identify their orthologs (Additional file 2.1). Of the 564 proteins, 376 were assigned to 102 clusters by OrthoMCL. As we were interested in the scores across all alleles of an antigen, we included all 564 in our analysis. The 564 proteins were split into 136 Gram-positive and 428 Gram-negative datasets using the species and associated Gram stain information provided from their respective databases. We also used the species hits from the

TBLASTN results for this purpose. These two datasets were then run on two pipelines, each with relevant Gram-positive or Gram-negative parameters required for some of the tools incorporated in ReVac. Of the 564, 41 were unique non-antigens from Vaxign (136) and were included to assess their scores relative to our weighing scheme. All proteins from control datasets were run through the workflow (except orthology given the wide range of species represented) for development of

50 the scoring scheme (Table 2.1). Inspection of positive and negative control proteins enabled optimization and implementation of score boosting for desired features carried by real antigens, as well as maximum thresholds of penalization in the case of autoimmunity and SSRs, as described in the Methods. Summary tables from ReVac runs on all datasets are available in Additional files

2.1, 2.2 and 2.3.

A subset of the controls used is presented in Table 2.2 to illustrate the process of optimizing feature scoring. The scores for each component were developed by observing trends in the predicted features of all the tools and their correlation to whether the control protein was antigenic or non- antigenic. For example, the first 2 antigens from Table 2.2 the pertactin autotransporter from

Bordetella pertussis and the peptidoglycan-associated outer membrane lipoprotein (P6) from

NTHi, have overall subcellular localization predictions suggesting surface exposure, consistent with previous experimental findings (138, 147, 148). The tools that accurately predicted these features were assigned positive weights (shown in Table 2.1) to identify other proteins displaying these features. In events when multiple tools show strong predictions of surface localization, the

ReVac score is boosted as it was observed in multiple antigens from the dataset, and these features indicate a strong potential vaccine candidate. As for the tools that provided no features for these two antigens, they were not weighted negatively as they weren’t necessary for surface exposure in the case of these two antigens but may be relevant to other proteins. We see this in the case of the

Streptococcus agalactiae antigen, C protein alpha-antigen (149), where the presence of transmembrane helices and adhesin features were predicted in the protein. These tools were also assigned positive weights for identification of these features in other proteins, based on their observed frequency within the control dataset (Table 2.1). Since some of the tools have no

51 conclusive feature predictions for certain sequences, such antigens have lower overall ReVac scores.

52

General

Information

No. ReVac Score Score Breakdown Organism Gram Stain Type

1 14.853 15.253-0.400 Bordetella pertussis - Antigen

2 13.709 13.709-0.000 Non-typable Hemophilus - Antigen

influenzae

3 9.049 9.049-0.000 Moraxella catarhallis - Antigen

4 8.192 8.192-0.000 Streptococcus agalactiae + Antigen

A909

5 6.791 6.791-0.000 Streptococcus + Antigen

pneumoniae

6 6.32 6.520-0.200 Neisseria meningitidis - Antigen

LNP21362

7 5.768 7.768-2.000 Streptococcus + Non Antigen

pneumoniae

8 2.475 5.542-3.066 Clostridium perfringens + Non Antigen

str. 13

Surface Exposure Predictions

No. PSORTB HMM mapping to

Localization LipoProtein Transmembrane Signal Peptide SPAAN adhesin ratio surface exposed database Annotation/GO Terms

Helices

1 OuterMembrane SignalPeptidase I None MNMSLSRIVKAAPLRRTTL None Positive outer membrane autotransporter

AMALGALGAAPAAHA barrel|GO:0009405,GO:0015474

,GO:0045203,GO:0046819

2 OuterMembrane SignalPeptidase II None MNKFVKSLLVAGSVAALA None Positive peptidoglycan-associated

ACSSSNNDA lipoprotein|GO:0009279

3 None SignalPeptidase II None MQFSKSIPLFFLFSIPFLA None Positive Bacterial extracellular solute-

binding protein

4 Cellwall SignalPeptidase I 1 None 0.782535 Positive hypothetical protein

5 Extracellular Intracellular None None None None Thiol-activated cytolysin family

protein|GO:0015485,GO:000940

5

6 Periplasmic SignalPeptidase II None MFKRSVIAMACIFALSACG None None Transferrin binding family

protein|GO:0016020

7 None Intracellular None None None None Capsular polysaccharide

synthesis family protein

8 None Intracellular None None None None shikimate dehydrogenase

ec::1.1.1.25|GO:0004764,GO:00

09423

Antigenicity Predictions*

MHC binding+ Immunogenicity within Alignment to

No. Antigenicity B cell epitopes MHC I binding MHC II binding Antigen Processing MHC complex curated epitopes

1 45.05% 15.16% 94.07% 100.00% 61.10% 13.08% 26.92%

2 30.72% 5.23% 96.73% 94.12% 79.08% 17.65% 99.35%

3 44.02% 16.03% 94.57% 100.00% 69.57% 23.91% None

4 50.40% 38.97% 83.10% 90.46% 43.34% 1.79% None

5 43.74% 13.80% 95.33% 98.94% 73.04% 12.10% 22.08%

6 30.33% 15.16% 81.56% 86.68% 46.93% 1.84% None

7 48.94% 4.61% 96.81% 100% 81.91% 30.85% None

53

Table 2.2 continued

8 34.32% 7.75% 96.31% 98.89% 77.49% 16.61% None

Adverse

Features

Autoimmunity Repeat regions Repeat

No. with humans genes & copy regions

number proteins &

copy number

1 None None |APAGGAVPGG 2||PQP

3|

2 None None None

3 None None None

4 None None None

5 None None None

6 None None |ARFRRS 2|

7 None None None

8 3.32% None None

*Percentages are relative to the length of the amino acid sequence.

Table 2.2 Examples of control proteins used to develop the scoring scheme, and a summary of the outputs from each of ReVac’s components

Certain predicted features among outputs for these tools were not assigned weights as it was observed that their predictions may not accurately predict PVCs and hence, we were unable to assign a justified positive or negative weight. As such, PSORTB (140) suggests that the heparin binding protein (NHBA) from the Gram-negative bacterium Neisseria meningitidis, currently used in a multicomponent vaccine against meningococcal serogroup B, is localized exclusively in the periplasm. However, this is not consistent with experimental evidence that indicates the protein is exposed on the bacterial surface (150). Thus, in the case of PSORTB predicted periplasmic proteins, no negative weight was assigned as some periplasmic predictions may be inaccurate or inconclusive such as in the case of NHBA. To account for this, we used multiple different tools for more accurate prediction of subcellular localization. Another example would be the case of pneumolysin from Streptococcus pneumoniae, an extracellular virulence factor (5). PSORTB

54 provided a strong extracellular prediction, however LipoP (151) suggested a cytoplasmic protein.

Again, for the same reason, intracellular predictions of LipoP were not penalized. Wherever similar and other trends were noticed among other tools the weights were assigned and distributed using similar justifications (Described further in Methods). The remaining non-antigens had feature predictions and annotations consistent with intracellular localization across all tools. These were assigned negative weights for each tool suggesting an intracellular localization, which should be avoided as potential PVCs. A complete list of weights assigned, and the scoring scheme is presented in Table 2.1 and described in the Methods.

Tools comprising the antigenicity prediction features were all assigned positive weights relative to the proportion of antigenic regions within a protein and boosted if the presence of curated epitopes within the sequence was observed. Most of these tools operate by splitting an input protein sequence into individual peptides and analyzing them individually as potential epitopes; all proteins tend to have at least some antigenic regions. As a result, weights relative to percent of antigenic regions were assigned. Lastly, adverse features are those that should be avoided when choosing any PVC, such as repeat regions or similarity to host or commensal organism proteins.

ReVac identified repeats within the B. pertussis pertactin transporter and the N. meningitidis heparin binding proteins. Such repeats suggest that these antigens may undergo slipped strand mispairing resulting in phase variation of the proteins, a negative feature of vaccine antigens (133).

Antigens with sequence repeats in either promoter or protein coding regions are therefore negatively penalized. Additionally, negative scores are given to antigens with features of similarity to host and commensal proteins, to avoid the negative effects of cross reactivity of an immunizing vaccine antigen. When both features were absent, ReVac attributes positive weights to the score to increase the ranks of the PVCs away from ones having these features.

55

As not all the tools implemented in ReVac could be run for our control dataset, such as those related to protein conservation across their many respective species and genomes, a lower score cutoff of 8 was chosen for these datasets. Using this threshold, 74 of the 136 Gram-positive antigens had a score of at least 8 with no non-antigens in the subset. 182 of 428 Gram-negative antigens had a score of at least 8 with 2 non-antigens in the subset (Table 2.2 and Additional file

2.4). It should be noted that given the breadth of species and the large number of validated antigens and non-antigens included in our control datasets, the scoring scheme we developed should be readily applicable to many bacterial pathogens. The scoring scheme can be applied iteratively to any number of new genomes being added to databases. We anticipate that the number of new genomes of interest will grow much faster than the experimental validation of new candidates that should be added to the control dataset. It is conceivable that many of the new candidates will harbor features similar to those already curated in our dataset and therefore will not change the scoring mechanism. However, when sufficient amounts of truly novel candidates become available in the future, an update to the scoring scheme could be released after some additional manual intervention. The simplest, systematic way of identifying the need for a new release will be to determine when a critical number of important validated candidates fail to be ranked near the top of the ReVac output.

3. Application of ReVac to non-typeable Haemophilus influenzae

(NTHi)

ReVac was run on 270 NTHi genomes, derived from sputum isolates obtained from a 20-year prospective study of adults with chronic obstructive pulmonary disease (COPD) (138, 152). This dataset comprised 477,769 predicted protein encoding genes. Of these, 4477 proteins had scores

56 that were not penalized more than 10% of their total score and grouped into ortholog clusters based on a consensus from 4 orthology prediction methods (See Methods). Each ortholog’s average score was calculated, as well as its range within its ortholog cluster. Clusters were then filtered based on the presence of at least 80% of proteins present in the lowly penalized 4477 dataset. This yielded

29 ortholog clusters which were high scoring, i.e. greater than 10, that included both core and dispensable genes of NTHi. Surveying this list, provides a highly prioritized selection of orthologs for consideration, and downstream experimental verification of vaccine candidacy potential.

Candidates were prioritized based on a candidate cluster being highly conserved within the species

(usually > 90%), and at least 80% of the proteins in the cluster being high scoring (to allow for some allelic variation), with a narrow score distribution. Based on these results, some of our top candidates include a Hemopexin transporter protein, involved in extracellular transport of hemopexin, and the outer membrane lipoprotein P4 (Table 2.3). The NTHi dataset of the analysis is provided in Additional file 2.2.

Table 2.3 Top candidates selected from ReVac’s output for Nontypeable Haemophilus influenzae and M. catarrhalis (M.W/pI represent molecular weights and isoelectric points)

All the below candidates were surface exposed, predicted antigenic, conserved core proteins with low autoimmunity and no repeat regions. Further information about these candidates is available in Additional files 2.2 and 2.3 respectively. *Present in M. canis.

Example locus Amino acid length M.W./pI Annotation Gene tag

NTHi

84P48H1_01193 562 62.24/9.43 Hemopexin transporter hxuB

57

Table 2.3 continued 84P8H1_00650 274 30. Lipoprotein E precursor Hel (P4)

48/8.97

M. catarrhalis

ADC73_RS07905 405 44.01/9.47 Hypothetical protein (porin None

family)

E9Y_00353 537 60.40/8.95 Protein of unknown function None

(DUF560)

M137P16B1_1805 344 38.37/8.95 Gram-negative porin protein None

AO373_1452* 331 36.16/9.9 Ferric iron ABC transporter None

iron-binding protein

4. Application of ReVac to Moraxella catarrhalis

The Moraxella catarrhalis dataset consisted of 69 genomes, 49 were obtained from NCBI and 20 were newly sequenced by our group. The latter were obtained from sputum isolates, from patients with COPD (138, 152, 153). This dataset comprised 130,179 predicted protein encoding genes. Of these, 3995 met the maximum 10% penalization filter, and again filtered based on 80% presence in a cluster. Analyses resulted in 64 high scoring ortholog clusters (greater than 10) of core and dispensable genes of M. catarrhalis. Based on these results, top candidates identified were an iron transporter protein, a Gram-negative porin protein and 2 conserved-hypothetical proteins previously unstudied (Table 2.3). The M. catarrhalis dataset of the analysis is provided in

Additional file 2.3.

58

5. ReVac benchmarking and runtime

ReVac was run on 4 major datasets, namely, 2 control datasets of Gram-positive and Gram- negative antigens, to optimize the scoring algorithm for specific components, and our test datasets of M. catarrhalis and NTHi, to prioritize potential vaccine candidates. Two smaller datasets of mycoplasma proteins were also run as test data. For benchmarking purposes, proteins were batched into groups of 1000 and 5000 sequences and were run through ReVac’s most time-consuming commands to generate an estimate of CPU (Central Processing Unit) hours (Figure 2.3A). These were run on an isolated, dedicated server with 2 CPUs (CPU model: Intel(R) Xeon(R) CPU E5–

2690 v3 @ 2.60GHz, 256GB RAM), each with 12 cores with hyperthreading (approximately 48 virtual cores). An overall estimate was also made of ReVac’s runtime in hours for our test datasets, with actual runtime for the entire duration of a run, on multiple servers on a compute grid without

CPU usage restrictions and competition with other, unrelated compute jobs potentially submitted to the same servers by other users (Figure 2.3B).

59

Figure 2. 3 ReVac’s CPU time estimates

A) An estimation of ReVac’s CPU time focused on its rate-limiting steps using batches of 1000 and 5000 proteins. Multiple runs (one for each time point on the figure) were submitted in succession on a single host, using increasing amounts of dedicated cores, each running the same batch of the respective 1000 (solid line) or 5000 proteins (dashed line). The total numbers of proteins analyzed using 1 and 48 cores are provided as labels for comparison to B. B) Real-life CPU time estimates derived from the entire ReVac workflow running on 150–300 compute clusters through Ergatis, each utilizing a single host in most cases.

E. Discussion

ReVac was developed to identify bacterial PVCs by evaluating multiple features based on ideal

PVCs characteristics and homology to known PVCs. A major aim of ReVac’s development was to add multilayered-redundant analyses for most of the protein feature predictions used in parallel, in order to provide additional independent confidence in the respective feature predictions. The various tools employed encompass categories of essential features of PVCs; subcellular localization, antigenicity and immunogenicity, conservation and function, and genomic islands.

ReVac builds on previously published reverse vaccinology pipelines and includes several major improvements: 1) the analyses of multiple genomes for a given species enabling assessment of

60 conservation of PVCs (core genome) or, for instance, unique dispensable genome PVCs of hyper- virulent strains. 2) the evaluation of SSRs in both coding and upstream regions to assess phase variable expression of PVCs based on the presence of this nucleotide sequence feature (enabled by the use of multiple genomes). 3) the parallel analysis and summary total of the features for each protein of the input genomes, with positive scores for desirable features versus negative scores for undesirable features, reducing the false negative elimination of candidates. And 4) flexibility to prioritize or de-prioritize any feature based on organism-specific considerations of a vaccine development project, for example. The parallel scoring system allows resolution of a PVC ranked highly or poorly due to multiple features rather than just a few.

Reverse vaccinology pipelines, such as ReVac, are beneficial for the prioritization of PVCs for vaccine development against bacteria prior to experimentation. Providing short lists of ideal candidates to assay in vitro and in animal models is desirable and is especially powerful where animal models may not be well established. The human restricted pathogens Moraxella catarrhalis and non-typeable Haemophilus influenzae are examples of such bacteria with underdeveloped animal models (154). The added innovative ability of ReVac to assess conservation and phase variability from multiple genomes is critical for these genetically diverse human pathogens (137,

138).

To provide an estimate of genetic variation among the M. catarrhalis genomes, we used MASH

(155) to generate a pairwise genome distance matrix and associated phylogenetic trees (Figures

2.4 and 2.5). We observed the formation of 4 distinct clades. Two of these clades were known to be separated based on sero-susceptibility of those strains (blue and green). Upon surveying the metadata of the genomes, we discovered that one of the other two clades clustered based on their date of isolation (orange) and the most distant was another species, M. canis, which was

61 misannotated as M. catarrhalis in NCBI (National Center for Biotechnology Information). We attempted to identify similar relationships between strains in our NTHi genomes, however, there appears to be no correlation between the strains apart from clustering based on multilocus sequence types (MLST) as previously reported (138). There was no clear genetic clustering based on clinical source of the strain, geography, duration of persistence, exacerbation vs. colonization, or year of isolation, of these strains. Upon investigation of some of our M. catarrhalis candidates we observed that some candidates, at the protein level, could recapitulate the same separation among clades as the whole genome tree (Figure 2.4C and D). It is possible that these proteins are core drivers of inter-strain variation as not all protein clusters could replicate the whole genome tree.

62

Figure 2. 4 Whole genomes trees of M. catarrhalis and protein alignment trees of ReVac’s M. catarrhalis candidates

A) Whole genome tree of the 69 M. catarrhalis genomes used in ReVac. The four clades seen are labeled as, blue-indicating a sero-resistant clade, green-indicating a sero-sensitive clade, orange- indicating older isolates of M. catarrhalis dating to 1932, and red-indicating misannotated M. canis genomes from NCBI. B) Whole genome tree of 128 currently available M. catarrhalis genomes on NCBI, maintains the same topology as 4A. C) A protein alignment tree of one of ReVac’s top candidates, which separates sero-sensitive and sero-resistant clades, but is absent in the other two clades (also present in the respective clades of B. D) A protein alignment tree of the candidate iron transporter that replicates the whole genome tree topology.

63

Figure 2. 5 Whole genome tree of the 270 Nontypeable Haemophilus influenzae genomes used in ReVac

64

Based on overall vaccine candidates from both NTHi and M. catarrhalis datasets, certain types of high scoring proteins are more prioritized over others. Hemolysins and TonB receptor proteins are identified as quality candidates in both species, however they are not included in our top candidates as they were not core proteins. Other types include various transporter proteins, outer membrane proteins and porins. This result is in part due to the outer membrane-surface exposed nature of these classes of proteins. However, ReVac’s comprehensive assessment of immunogenic epitope identification and cross-referencing of proteins against commensal organism’s genomes supports that there are dissimilar sequences of these proteins in the pathogen and its related commensal.

Whether these proteins might be strong candidates in other species, will likely depend on their conservation across those species. The presence of redundant proteins to these will diminish their impact as vaccine candidates outside the two species considered here. Presence of similar types of proteins identified as vaccine candidates is interesting as both these species occupy similar niches in the host. In total, ReVac identified a set of surface exposed proteins in two exclusively human respiratory tract pathogens. The features of these proteins as pathogen-specific PVCs, provides strong rationale to experimentally validate the efficacy of the novel vaccine antigens against the pathogens.

F. Conclusion

The identification of core vaccine candidates provides a path for vaccine development against prokaryotic pathogens. Use of essential-core genome components in vaccines reduces the impact of any selective pressure, which may be imposed on a pathogen through vaccine use. This

65 phenomenon has been observed in the case of Group A Streptococcus, where several potentially effective vaccine candidates are not a part of the core genome (156). Selection of these non-core candidates could result in non-vaccine strains dominating a now vacant niche. Another strength of

ReVac’s prioritization of candidates on a cluster level, is that it allows for identification of clade- specific vaccine candidates. This approach could be powerful in situations where a species may be a commensal organism, but certain variants are pathogenic, such as in the case of E. coli (157). A reverse vaccinology analysis at a whole pan-genome level will be able to identify and distinguish these candidates from core candidates.

All vaccine candidates are antigens, but not all antigens are effective vaccine candidates for various reasons. ReVac therefore prioritizes antigenic PVCs using predictions from multiple antigenicity and immunogenicity tools which are becoming more prominent as effective tools for identification of potentially novel epitope regions (158). For example, certain antigens may induce more effective adaptive immune responses than others. It should be noted that ReVac is not an antigen predictor, but a workflow that ranks proteins by their vaccine candidacy potential. Follow-up in vitro and in vivo characterization will therefore be required to assess the validity of these PVCs as antigens. ReVac’s primary mandate is to help identify, as well as reduce the number of candidates that will have to be tested by providing the user with a ranked list of PVCs.

Here we have presented ReVac, a reverse vaccinology pipeline for the identification of bacterial protein PVCs from the input of one or more annotated bacterial genomes. ReVac’s implementation of a parallel scoring scheme of all proteins in the organisms’ proteome and summary scores minimizes elimination of false negative antigens due to potential errors in the prediction tools implemented. False positive identification of antigens is reduced by assigning lower scores to proteins associated with non-antigens, and with similarity to human host proteins as well as

66 commensal organisms’ proteins. These multiple genome assessments performed in ReVac evaluate the conservation of PVCs and those subject to phase variable expression in a given population of bacterial strains to further reduce false positives. The compilation of these features integrated into ReVac’s pipeline was tested on sets of positive and negative control antigens to optimize the scoring algorithm. We used ReVac to identify PVCs for the human-restricted pathogens M. catarrhalis and NTHi. We identified both known and novel PVCs for each bacterium, supporting the efficacy of ReVac to identify experimentally proven PVCs.

Furthermore, the identification of previously uncharacterized proteins as PVCs shows the benefit of ReVac to identify novel protein targets to investigate in future studies. ReVac can be used to identify PVCs of current bacterial pathogens where vaccines are not currently developed and can also be used to quickly identify PVCs of emerging pathogens as sequences become available.

G. Methods

ReVac and all its tools were built and integrated together using the open-source Ergatis workflow management system (available at http://ergatis.sourceforge.net) (141). Ergatis allows the user to monitor bioinformatic pipelines, such as ReVac, through its user interface, and allows the parallelization of several analyses on distributed computer clusters (141). Individual tools can be installed as Ergatis components and their analyses then parallelized, for any bioinformatic pipeline.

For these reasons, Ergatis was chosen as a streamlined way of performing all ReVac’s computationally intensive multi-genome analyses while allowing for easy access to output data and monitoring.

67

Here, we provide a description of all the tools incorporated in the ReVac pipeline, including those involved in file format conversions. The complete scoring weight scheme for each component is described in Methods. The rationale for assigning different weights to specific components was derived from the results of ReVac runs on our control datasets of known antigens and non-antigens acquired from various antigen databases from multiple bacterial species (Additional file 2.1 and

Figure 2.1). Components are grouped into broad categories (Table 2.1), as follows:

1. Subcellular localization

PSORTb 3.0 (140) – A bacterial protein subcellular localization (SCL) predictor geared for all prokaryotes, including archaea and bacteria with typical and atypical membrane/cell wall topologies (140). PSORTb 3.0 assigns a protein one of seven possible SCLs, namely, Extracellular,

Cell Wall, Outer Membrane, Periplasmic, Cytoplasmic Membrane, Cytoplasmic & Unknown. We ran PSORTb 3.0 on control datasets, 69 M. catarrhalis and 270 NTHi genomes using default parameters with an overall cut off set at a value of 7.5 for the overall prediction score. Proteins that were predicted to localize at multiple sites were also included if their cumulative score was above 7.5. We used the long output-type option provided by PSORTb and predictions are provided in the summary table. Each protein was given a +1 towards its overall score if it wholly or partially localized to the outer membrane, cell wall or was predicted to be extracellular. If the protein was predicted to localize only to the cytoplasm or cytoplasmic membrane it was given a − 1. A prediction of Periplasmic was not scored as proteins from control datasets showed that some outer membrane proteins were often miscalled as periplasmic.

68

LipoP (151) – A hidden Markov model (HMM)-based tool to predict lipoprotein signal peptides in Gram-negative Eubacteria, able to distinguish between lipoproteins Signal peptidase II (SPase)- cleaved proteins, SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins

(151). Although it was developed for Gram-negatives, it has shown effective performance on the prediction of Gram-positive bacterial lipoproteins. We ran LipoP using default parameters with a cutoff of − 3. Each protein was given a +1 towards its overall score if it was predicted to be a signal peptidase or had transmembrane helices.

TMHMM 2.0 (159) – A membrane protein topology prediction method, based on an HMM. It predicts the number of transmembrane helices present in a protein sequence. TMHMM was run using default parameters and predicted number of helices are displayed in the summary table. Each protein was given + 0.5 if it had a single helix. It is not scored for 2 helices and penalized for more than 2 helices (− 0.2 for 3 and − 2 for 4 or more), as such proteins are harder to purify and less likely to be accessible on the cell surface. If the protein was predicted partially cytoplasmic or in the cytoplasmic membrane, it was penalized an additional − 2.

SignalP 4.1 (160) – A program developed for the prediction of signal peptides from amino acid sequences that are targeted to the secretory pathway (160). We ran SignalP using the ‘best’ option for prediction allowing a truncated length of 70 for the peptides. We also disabled its graphical output as we were only interested in presence/absence of signal peptides. Its raw output provides coordinates of a protein’s signal peptide, which we then used to map back the actual signal peptide sequence for the summary table. Each protein was given a +1 if it had a signal peptide.

69

SPAAN (161) – An artificial neural network developed to predict the probability of a protein being an adhesin. Adhesins are surface exposed proteins which mediate the adhesion of microbial pathogens to host cells (161). We ran SPAAN using a cutoff of 7. Each protein was given a +1 if it was a predicted adhesin.

HMMPFAM 3.0 (162) – A HMMER 3.0, HMM based tool, which queries a given amino acid sequence against multiple relevant subsets of TIGRFAM and PFAM databases (163, 164) namely, surface exposed, signal proteins, secreted proteins, and bacteriocins. This database of motifs was developed from the TIGRfam v15.0 and Pfam v31.0 databases, manually filtered based on keywords in their descriptions, which were relevant to surface exposure. We used default noise threshold cutoffs for all our HMM runs. Each protein was given a +0.5 for any positive hit against motifs in this database.

2. Antigenicity & immunogenicity

Antigenic (http://www.bioinformatics.nl/cgi-bin/emboss/antigenic) – An EMBOSS package utilizing the method of Kolaskar and Tongaonkar to predict antigenic determinants in proteins

(165). This tool outputs the antigenic peptide regions within a protein. We used a minimum peptide length of 9. Overlapping peptides are merged into larger regions to calculate overall percent coverage of the whole protein. The total number of antigenic peptides and their percent coverage of the protein are presented in the summary table. Each protein is given + 0.5 for having antigenic regions and an additional score of the ratio of its antigenic regions over its amino acid length.

70

MHC Class I (143) – An MHC I binding prediction tool acquired from the Immune Epitope

Database & Analysis Resource (IEDB), containing 8 different peptide binding prediction methods.

Of the 8, we have applied the consensus method for our analysis, using default parameters, applied across all 78 human MHC I alleles available for this method. The raw outputs assign a rank for each peptide of length 9, from a protein, with a sliding window of a single amino acid. The 99th percentile peptide sequences across all 78 alleles are selected, sorted based on coordinates within a protein, and then tiled together to produce a consensus MHC I binding region. The total number of binding peptides, the number of alleles they bind, their total consensus region, and their percent coverage, are presented in the summary table. Each protein is given a score of the ratio of its binding regions over its amino acid length, if its binding regions cover 80–95% of the protein. If that ratio is greater than 95% it is given a +1. The weight is also boosted by the ratio of binding peptides of length 9, over the total number of peptides possible for a given protein with a sliding window of 1. The rationale for this being that when a protein is processed for MHC loading, the more peptides that can bind to, from all that could be generated, the better the overall binding of the protein.

MHC Class II (143) – Also acquired from IEDB, this tool comprises of 5 methods for MHC peptide binding prediction. Here again, we have chosen the consensus method, using default parameters, across all 63 human MHC II alleles available. Like the MHC Class I component, the raw outputs assign a rank for each peptide of length 15, from a protein, with a sliding window of a single amino acid. Here, the 95th percentile is selected as MHC class II binding is less efficient due to the MHC

II molecules being open at the ends of peptide the binding region. The post processing of the raw

71 outputs is handled the same as in MHC Class I. Each protein is scored like the scheme followed in

MHC Class I.

NetCTLpan (143) – Another MHC I binding predictor acquired from IEDB included in our analysis. This tool makes its predictions by also considering the precursory steps involved in MHC peptide binding, such as proteasome cleavage and TAP (Transporter associated with antigen processing). Analysis was conducted across 12 MHC allele super-types available in this package, using a peptide length of 9 with default parameters for thresholds, TAP & cleavage weightage.

The post processing of the raw outputs is handled the same as in MHC Class I. Each protein is scored like the scheme followed in MHC Class I.

B Cell Pred (143) – A linear B cell epitope predictor also procured from IEDB. It scores amino acid residues using 6 different scale-based methods. We have applied all 6 methods in our analysis and only those peptide sequences which are predicted to be epitopes across all methods are considered. Analysis was conducted using default cutoffs and parameters across all 6 methods using a peptide length of 7. As in the case of MHC I, peptide epitopes from all 6 methods are tiled together to form consensus predicted B-cell epitope regions. Each protein is given a score of the ratio of its binding regions over its amino acid length. Here again it is boosted by the ratio of binding peptides of length 7, over the total number of peptides possible for a given protein with a sliding window of 1.

Immunogenicity (143) – A Class I Immunogenicity tool from IEDB, that uses amino acid properties as well as their position within the peptide to predict the immunogenicity of a peptide-

72

MHC (pMHC) complex. The peptides predicted to be effective MHC I binders from MHC Class

I & NetCTLpan (the 99th percentile) are passed as inputs here and run using default parameters.

Each protein is scored similar to the scheme followed in MHC Class I, but scores for 20% coverage or more, as not all MHC bound peptides induce an immune response.

BLAT (142, 143) – Using the BLAST-like Local Alignment Tool, amino acid sequences are compared to a database of experimentally curated epitope sequences acquired from IEDB. Using the BLAST output type and default parameters, protein regions mapping to curated epitopes are again tiled to get consensus epitope regions and their percent coverage. Each protein is given a score of the ratio of its homologous regions over its amino acid length. If that ratio is greater than

70% it is given another + 1. Here if the protein is positive for surface exposure in any 3 among

PSORTb, LipoP, SignalP and IEDB, it is given a + 2 as this pattern was observed frequently in positive control datasets.

3. Conservation & function

Jaccard Clusters of Orthologous Genes (COG) Analysis (166) – A two-phase protein clustering algorithm, used to generate protein paralog and ortholog clusters. It parses an all-v-all BLAST

(167) output of whole genomes, after which a Jaccard similarity coefficient is calculated for every pair of proteins. Then it performs a bidirectional best hit analysis on the paralog clusters generated by the first phase of the algorithm, rather than on individual proteins to call its orthologs (166).

Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods).

73

PanOCT (168) – A tool for pan-genomic analysis of closely related prokaryotic species or strains using conserved gene neighborhood information to separate recently diverged paralogs into orthologous clusters (168). We elected to use 90% conservation as our cut-off and ran the analysis at default parameters. The PanOCT component was adapted to use a list of GFF files, a multi-

FASTA file of all amino acid sequences and an all-v-all BLAST file run using the -m8 output option, to generate its config file and then perform clustering. Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods).

OrthoMCL (146) – This program implements a scalable method for constructing orthologous groups using a Markov Cluster algorithm to group (putative) orthologs and paralogs (146).

Analysis makes use of a MySQL database to which data is stored during the clustering process, using an all-v-all m8 BLAST file and a multi-FASTA of all proteins, run at default parameters.

Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods).

LS-BSR (169) – Large Scale Blast Score Ratio (LS-BSR) compares the genetic content of hundreds to thousands of bacterial genomes and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed (169). LS-BSR was run at default parameters, using a list file of strain specific DNA sequences of predicted genes to produce a BSR matrix of proteins that cluster with each other. Default cutoff is 0.7 but we elected to use 0.6 as some proteins which clustered together in other tools as core, were barely missing the default

74 cutoff. Each protein belonging to a COG that is present in 90% of the genomes provided, receives

+ 1 (in at least one of the conservation methods). Here, if a given protein is shown to cluster in

90% of the genomes across multiple methods (described above), it is given an additional + 0.1 for each of those clustering methods.

Attributor – An in-house developed python script which refreshes annotations for a FASTA file of amino acid sequences, and assigns GO terms wherever applicable. It accepts inputs from

TMHMM, LipoP, RAPSearch2 and HMMPFAM to call a specific annotation for a given protein.

Each protein that has a GO term belonging to our database of surface exposed GO terms, is scored

+ 1 for each GO term, and − 1 for each GO term falling in our non-surface exposed GO term database, if none were identified in our surface exposed GO terms. Both GO term databases were constructed from the prokaryotic subset of GO terms, and then manually filtered for relevant surface exposed GO annotations, as well as Attributor GO term predictions for experimentally curated surface and non-surface exposed proteins acquired from the ePSORTB database. Here we also looked to rescue PSORTb predictions calling periplasmic, when proteins are surface exposed based on attributor, and scoring an additional + 2.

4. Exclusion features

Autoimmunity (149) – Autoimmunity is a Perl script taken form NERVE (149) and adapted to run on Ergatis. It uses BLAST to compare each amino acid sequence as a query against the human proteome, allowing for 3 substitutions and 1 mismatch, over a minimum peptide length of 9. Raw outputs are again tiled together to give consensus regions of autoimmunity against Human

75 proteins. Each protein that doesn’t map to the human proteome is given a +1. Any protein that has a positive hit is penalized by 2 times the ratio of its homologous regions over its amino acid length.

If that ratio is greater than 20% it is penalized another − 2.

Autoimmunity Commensals (149) – This component is an adaptation of the one above that runs against a non-redundant database of a commensal organism of choice. In our case, since we were looking at NTHi & M. catarrhalis, the most closely related species selected were H. haemolyticus and M. bovis, respectively. The database of non-redundant protein amino acid sequences was made from 13 strains of H. haemolyticus acquired from NCBI and then clustered using OrthoMCL, for

NTHi. However, only one strain of M. bovis was available on NCBI. Each protein is scored the same as autoimmunity.

SSR_Finder (170) – SSR_finder is a script developed by Siena et al. (170) which looks for Simple

Sequence Repeats, up to 10 base pairs in length, in DNA coding sequences and 500 bp upstream of the gene. SSRs have been shown to contribute to phase variation of proteins allowing generation of different protein isoforms and mediating on/off translational switching through frameshifts

(170). It scans through a list of multi contig DNA fasta files for such SSRs. Each gene is given a

+1 for absence of an SSR. Presence of a repeat is penalized − 0.5 for each. An additional − 0.25 is received if the repeat is in promoter region, − 0.5 if the repeat has the potential to cause a frame shift, as well as − 0.01 times the total length of the repeat.

76

SSR_Finder_Protein (170) – The above script was adapted to run on protein sequences looking for repeats, up to 20 amino acids in length, which would allow conformational changes in the protein.

Each protein is penalized − 0.2 for each protein repeat, up to a maximum penalty of − 1.

5. Genomic islands

IslandPath (171) – IslandPath is a tool developed for the detection of genes of potential horizontal transfer origin known as Genomic Islands, in prokaryotes (171). Islandpath accepts NCBI Protein

Table (ptt) files of the proteome as well as protein & coding sequence FASTA files with their coordinates to identify possible genomic islands. Each protein present within a genomic island is penalized − 0.5.

6. Foundation components

Input data that is passed through ReVac, requires a multitude of formats as per the component being invoked, as well as supplementary data from other predictive components. ReVac uses the following foundation components, which are provided as a part of the Ergatis workflow (141).

Genbank2bsml – Accepts standard GenBank file formats (.gbk) for conversion to Bioinformatic

Sequence Markup Language (BSML), which is an Ergatis standard file format for data in addition to the more common FASTA files.

Bsml2fasta – Accepts the BSML inputs from Genbank2bsml for conversion to FASTA files. This component allows for the generation of single sequence FASTA files or a larger multi-sequence

77 file, providing the option of generating numerical sequence IDs for each sequence as well filtering files to contain either solely nucleotide or amino acid sequences.

Split_multifasta – Splits the multi-sequence file(s) generated by Bsml2fasta into smaller files containing a user-defined number of sequences, for distribution onto the grid for the later components.

Bsml2ptt – Allows conversion of the BSML files into a tab delimited ptt file. Currently only utilized in the IslandPath component.

Extract_CDS_Features – An in-house developed script to parse Genbank files and extract all features relevant to any coding sequences present. These include contig ID, coordinates, amino acid length, estimated molecular weight, isoelectric point, gene name, and annotation.

Formatdb/Xdformat – These components format and index whole proteome FASTA files for all- v-all BLAST searches.

NCBI-BLASTP/WU-BLASTP – Different components require different types of BLAST outputs; hence both are available for use in ReVac.

RAPSearch2 (172) – A new upgraded protein similarity search tool for next generation sequencing data, which scans the latest UniRef and UniProt databases for use in Attributor.

78

The ReVac workflow package available on github (https://github.com/admelloGithub/ReVac- package) is designed to work through the Ergatis management system, after all required tools and dependencies are installed.

7. Phylogenetic tree construction

We used MASH (155) using default settings to acquire pairwise distance matrices for our whole genome trees. These we then converted into Newick tree files using the Neighbor-Joining method in MEGA7 (173) and unrooted tree figures were constructed in R studio using the APE package

(174). For our protein ortholog trees, the amino acid FASTA sequences were acquired from the

ReVac’s orthology components outputs and aligned using ClustalW in MEGA7 with default parameters, and Newick tree files generated using the Neighbor-Joining method, for use in R studio with the Ape package.

H. Supplementary information

The following additional tables are available for download at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6916091/

Additional file 2.1. Control datasets used for development of the scoring scheme.(573K, xlsx)

Additional file 2.2. Nontypeable Haemophilus influenzae dataset.(87M, xlsx)

Additional file 2.3. Moraxella catarrhalis dataset.(25M, xlsx)

Additional file 2.4. Examples of control proteins used for development of the scoring scheme.(20K, xlsx)

79

I. Acknowledgements

We thank the members of the Institute for Genome Sciences’ Informatics Resource Center for IT and bioinformatics support.

J. Identification of Streptococcus pneumoniae vaccine candidates with

ReVac.

K. Assessing pneumococcal genomic diversity

One of the priorities of reverse vaccinology through the ReVac pipeline is the identification of core potential vaccine candidates (PVCs). As we know, the pneumococcus is a highly diverse species with several different genotypes and ~100 different serotypes. With the current vaccines, a significant problem is pneumococcal vaccine escape as described in chapter 1. This makes the selection of the candidate pneumococcal genomes that would be used for reverse vaccinology to identify core antigens, a crucial step of the process. Considering the large amount of pneumococcal diversity, any bias to some strains or some serotypes over others in our initial ReVac input dataset could lead to the identification of false positive core pneumococcal PVCs.

In order to circumvent this possible bias in our input pneumococcal genomes, we first decided to acquire an estimate of pneumococcal genomic diversity. Several thousand pneumococcal genomes have been sequenced over the years and are present within the National Center for Biotechnology

Information (NCBI) genome database. We first acquired all assembled and annotated pneumococcal genomes from NCBI (as of 2018), which yielded 7,591 genomes. In order to visualize the overall diversity across these 7,591 genomes, we generated a pairwise distance matrix from the genomic DNA sequences using a nucleotide k-mer based mutation distance estimation tool called MASH (155). This pairwise distance matrix was used to make a phylogenetic tree using

80 the Neighbor-Joining method, shown in Figure 2.6A with the MEGAX tool (175). The phylogenetic tree (Figure 2.6A) clearly showed the formation of several clades of Spn genomes as expected.

L. Defining a representative pneumococcal sub-population

The 7,591 genomes we acquired clearly contained all pneumococcal genomic diversity, at least of all sequenced isolates. However, performing reverse vaccinology on thousands of genomes is a computationally intensive task. What’s more, several of these genomes are potentially genetically similar or identical and would be redundant in the entire dataset. Due to these two considerations, we elected to subsample the entire genomic dataset to a more computationally manageable number. Given our pre-requisite of identifying core pneumococcal PVCs, we estimated using a sample size calculator(176) that for a genomic population of 7,591 genomes, with 99% confidence, and a confidence interval of <5%, 258 genomes were sufficient to call core antigens present in at least 90% of the sample.

However, this sample size of 258 genomes was estimated based off just random sampling of the

7,591 genomic population. As we have seen on Figure 2.6, pneumococcal diversity could pose a serious problem with just random sampling, despite the high statistical likelihood of a PVC still being core in 258 randomly sampled pneumococcal genomes. Due to this, we decided to not randomly select 258 genomes, but instead attempt to identify the most optimum 258 clades of the phylogenetic tree in Figure 2.6A, and then select one representative genome from each of the 258 clades, as our input ReVac pneumococcal dataset. These genomes would be prioritized based on their genomic quality, such as completeness, fewest contigs, and average pneumococcal GC content.

81

To determine the 258 clades of the phylogenetic tree (Figure 2.6A), we used k means clustering on the pairwise distance matrix, with a k of 258. K‐means is a hard classification clustering technique (177), which will assign each of the 7,591 genomes to one of 258 clusters, based on its genetic similarity to the other genomes within that cluster, using the pairwise distance matrix. A representation of the 258 clades on the original phylogenetic tree is shown in Figure 2.6B left, and a topology view is shown in Figure 2.6B right. We can clearly see the identification of 258 clusters representing pneumococcal genomic diversity. We then picked the highest quality genome (as described above) and constructed a new phylogenetic tree for these 258 genomes (Figure 2.6C).

Based on the similarity of this tree with the tree of 7,591 tree, we concluded that we had a good representation of pneumococcal genomic diversity in our ReVac input genomic dataset.

Figure 2. 6 Whole genome trees of Streptococcus pneumoniae genomes.

A) Whole genome tree of the 7,591 Streptococcus pneumoniae genomes used in ReVac. B) Whole genome tree of 7,591 S. pneumoniae genomes colored by k means clusters (left) and in topology view (right). C) Whole genome tree of the 258 genomes selected from each of the clades (left) and in topology view (right).

82

83

84

Figure 2. 7 Protein alignment trees of Streptococcus pneumoniae vaccine candidates

A) Protein alignment trees of ortholog clusters of known vaccine candidates PspA (left) and ZmpB (right) with ~61% and ~66% average pairwise percent identities respectively. B) Protein alignment trees of ortholog clusters of ReVac prioritized candidates O-Gly, CbpD, ABC transporter, Iron ABC transporter, & LysM (left to right), with >95% average pairwise percent identities.

85

M. Application of ReVac for prioritization of conserved pneumococcal

potential vaccine candidates (PVCs)

Based on these 258 pneumococcal genomes, we performed reverse vaccinology using ReVac as described in the previous section (Section 2.1), with the use of Streptococcus mitis as the next closest species for any commensal autoimmunity (See Section 2.1 Methods).

The results of the ReVac run are shown in Additional File 2.5: https://github.com/admelloGithub/Data

A list of the top 5 highest scoring pneumococcal PVCs, as well as 2 known pneumococcal PVCs from literature (178, 179) and their characteristics are shown in Table 2.4. One of the highest scoring pneumococcal antigens within our ReVac control dataset (See Additional File 2.1 for controls, SP_0965/CN52_RS0109140 for ReVac score in Additional File 2.5), was also verified in this dataset. It was a core pneumococcal protein with consistent scores across the orthologs, demonstrating the functionality of the scoring scheme as expected (Not prioritized here as higher scoring candidates were identified). Amino acid alignment-based trees of these 7 pneumococcal genes and their orthologs within the 258 genomes is shown in Figure 2.7.

Table 2.4 List of Streptococcus pneumoniae vaccine candidates.

Spn TIGR4 ReVac loci Gene/Product prioritized Comments

Variable and autoimmune regions. Issues with

SP_0117 PspA No orthology. Protein SSR repeats.

86

Table 2.4 continued SP_0664 Zmpb No Variable. Issues with orthology.

Highly conserved, cell wall integrated,

SP_0368 O-Glycosidase Yes nasopharynx special.

SP_2201 CbpD Yes Highly conserved, Adhesin features.

ABC

SP_2197 transporter Yes Highly conserved, Adhesin features, Signal peptide

SP_1032 Fe transporter Yes Highly conserved, Signal peptide.

Highly conserved, Adhesin features, Signal

SP_0107 LysM Yes peptide.

We can clearly see that for the two known pneumococcal PVCs (Figure 2.7A), PspA and ZmpB, there exists a large amount of amino acid sequence diversity within the orthologs across the pneumococcal isolates. Such PVCs would make the designing of protein vaccines a challenge as few conserved regions would exist. However, the top 5 ReVac prioritized PVCs are not only predicted to be surface exposed and highly antigenic, but also extremely highly conserved, with high amino acid sequence similarities within the 258 orthologs reflected through the low genetic distance in the trees (Figure 2.7B). Such PVCs would allow for the design of highly conserved, species-wide pneumococcal protein vaccines, and demonstrate the power of reverse vaccinology in vaccine design.

87

III. Chapter 3: An in vivo atlas of host–pathogen transcriptomes during

Streptococcus pneumoniae colonization and disease 2

Preface

This chapter fulfills Specific Aim 2 Survey candidates identified through reverse vaccinology with the host-pathogen transcription profiles of invasive pneumococci at various anatomical sites in mice using dual RNA-seq. Here, we will identify relevant host-pathogen gene expression and pathways in the pneumococcus and the mouse, using dual RNA-Seq.

A. Abstract

Streptococcus pneumoniae (Spn) colonizes the nasopharynx and can cause pneumonia. From the lungs it spreads to the bloodstream and causes organ damage. We characterized the in vivo Spn and mouse transcriptomes within the nasopharynx, lungs, blood, heart, and kidneys using three

Spn strains. We identified Spn genes highly expressed at all anatomical sites and in organ-specific manner; highly expressed genes were shown to have vital roles with knockout mutants. The in vivo bacterial transcriptome during colonization/disease was distinct from previously reported in vitro transcriptomes.

2 D'Mello, A., Riegler, A. N., Martínez, E., Beno, S. M., Ricketts, T. D., Foxman, E. F., Orihuela,

C. J., & Tettelin, H. (2020). An in vivo atlas of host-pathogen transcriptomes during Streptococcus pneumoniae colonization and disease. Proceedings of the National Academy of Sciences of the

United States of America, 117(52), 33507–33518. https://doi.org/10.1073/pnas.2010428117

88

Distinct Spn and host gene expression profiles were observed during colonization and disease states, revealing specific genes/operons whereby Spn adapts to and influences host sites in vivo.

We identified and experimentally verified host-defense pathways induced by Spn during invasive disease, including pro-inflammatory responses and the interferon response. These results shed light on the pathogenesis of Spn and identify therapeutic targets.

B. Significance statement

D’Mello et al. detail host-pathogen interaction gene expression profiles of Streptococcus pneumoniae (Spn) and its infected host at disease relevant anatomical sites using mice as experimental models. The authors identify the shared and organ-specific transcriptomes of Spn, show that bacterial and host gene expression profiles are highly distinct during asymptomatic colonization versus disease-causing infection, and demonstrate that Spn and host genes with high levels of expression contribute to pathogenesis or host defense, respectively, making them optimal targets for intervention.

C. Introduction

The Gram-positive microbe Streptococcus pneumoniae (Spn) asymptomatically colonizes the nasopharynx (15). It is also an opportunistic pathogen which commonly infects young children, immunodeficient patients and the elderly (180). It was estimated that in 2015, 192,000–366,000 deaths in children under 5 years of age were the result of pneumococcal infections (181). With the advancement of antibiotics and introduction of the first pneumococcal conjugate vaccine in 2000, and subsequent vaccines, deaths attributable to Spn have declined (181, 182). However, increasing

89 antibiotic resistance and serotype replacement have been observed (183). A study conducted from

2006-2016 showed increased resistance to penicillin, amoxicillin, ceftriaxone and meropenem.

Implementation of the PCV13 vaccine in 2010 (184) resulted in increased incidence of non- vaccine Spn serotypes (185). Thus, despite major advances in public health, pneumococcal infections continue to be a significant cause of morbidity and mortality, and further investigations are warranted.

The pneumococcus infects a variety of anatomical sites and causes a range of illnesses, some resulting in acute mortality. Spn is a common cause of sinusitis, otitis media, bronchitis, pneumonia, bacteremia, sepsis, and meningitis (180). Our group has previously shown that during bacteremic episodes, pneumococci in the bloodstream can translocate into the heart and kill cardiomyocytes resulting in acute heart dysfunction (5). Invasive pneumococcal disease can also lead to renal complications in pediatric patients and hospitalized adults (186).

Current efforts to reduce the impact of pneumococcus on human health include improving the efficacy of vaccines and identifying host-directed therapies to decrease morbidity and mortality.

Over the past 20 years, several studies have examined pneumococcal gene expression and virulence determinant requirements in vivo, with consideration for anatomical-site specific differences. One approach was the use of signature tagged mutagenesis to identify mutants that are unable to replicate in vivo in an anatomical site-specific manner (187). Accompanying studies using microarrays to examine in vivo isolated RNA from the bacteria indicated that Spn gene expression varies according to the host site (188). Other RNA-seq studies have also focused on the host response within the lungs upon pneumococcal infection. These have identified key genes involved in neutrophil recruitment and response to diverse strains and mutants (119, 189). Other forms of sequencing-based studies such as transposon sequencing (Tn-seq) have also been

90 performed on the pneumococcus in multiple contexts, such as in vivo survival or transmission, in multiple hosts (112, 123, 190).These studies, some of which are discussed below in the context of our results, offer an important consideration when identifying vaccine candidates, for which the ideal targets would be universally expressed, essential genes. However, it is important to note that prior studies on the in vivo transcriptome of Spn have employed only a single strain of Spn and used comparisons to in vitro conditions for purposes of normalization, whereas the contrast between in vitro and in vivo transcriptomes reported here suggests a more physiological method of normalization. Lastly, we build and expand upon prior studies to provide information on the coordination and regulation of host pathways responsible for restricting pneumococcal infection at different anatomical sites, a process that remains a largely open question.

Here we explored the transcriptome of the pneumococcus and its host using dual species RNA-seq at diverse host anatomical sites in vivo. Blood, kidneys, lungs, and nasopharynx were harvested from mice colonized or infected with 3 different Spn strains, with uninfected host tissue used as control. These data provide a comprehensive view of dynamic processes impacting bacterial and host biology as the bacterium invades distinct anatomical sites.

D. Results

1. Pneumococcal burden varies in a strain- and site-specific manner

within the mouse

To avoid strain bias, we used 3 Spn strains each belonging to a distinct capsular serotype and multilocus sequence type: TIGR4 (serotype 4, ST205), D39 (2, ST595), and 6A-10 (6A, ST460).

To capture the host and bacterial transcriptomes during infection, mice were challenged with each of these strains and tissue/organs collected from the nasopharynx, lungs, blood, and kidneys (the

91 latter lacking 6A-10 samples). Heart data (TIGR4 only) were included from our previously published study (5).

Figure 3. 1 Rarefaction curves and estimation of pneumococcal burden

A. Saturation curves of pneumococcus infected samples. Samples whose curves plateau harbor enough reads mapped across all 1682 core S. pneumoniae (Spn) genes to achieve saturation. B. Rarefaction curves of infected and uninfected host samples. C. Log10 values of reads mapped (Supplemental Table 3.1) to the pneumococcus infected samples (top), host samples (middle), and relative estimate of pneumococcal burden within each sample. Horizontal black lines (top & middle) indicate the minimum number of reads required for samples to plateau.

Figure 3. 2 Pneumococcal principal component analysis (PCA) and dendrogramsFigure 3. 3 Rarefaction curves and estimation of pneumococcal burden A typical obstacle when performing multispecies RNA-seq is uneven sequence amounts obtained A. Saturation curves of pneumococcus infected samples. Samples whose curves plateau harbor enough forreads each mapped species across examined. all 1682 Deeper core sequencingS. pneumoniae is often(Spn) neededgenes to for achieve the minority saturation. species, B. Rarefaction in our case curves of infected and uninfected host samples. C. Log10 values of reads mapped (Supplemental Table 3.1) to the pneumococcus infected samples (top), host samples (middle), and relative estimate of Spnpneumococcal. To attain confidence burden within in eachthe cov sample.erage Horizontal obtained blackfor Spn lines genes, (top & we middle) generated indicate saturation the minimum curves number of reads required for samples to plateau. to determine the minimum requirement for reads mapping to the respective Spn and mouse

92 genomes (Figures 3.1A & 3.1B). Samples whose curves plateaued or approached plateauing were deemed adequate. The minimum required number of mapped Spn reads per sample was estimated at ~150,000 reads. Saturation was achieved for all samples except D39-infected kidneys and D39- colonized nasopharynxes that were excluded from all bacterial analyses.

We estimated the ratios (Figure 3.1C bottom) of the number of Spn reads (Figure 3.1C top) compared to total sequenced mouse reads (Supplemental Table 3.1) at each site. Although these ratios should not be directly compared across multiple anatomical sites, and they depend not only on the bacterial burden but also on the transcriptional activity of Spn (and mouse cells) at each site, ratios suggest that Spn infection in the lungs is highest with D39 followed by 6A-10 and TIGR4.

On the other hand, TIGR4 appears to dominate over the other strains in the nasopharynx, followed by 6A-10. Whereas D39 was more successful in thriving in the blood compared to TIGR4 and 6A-

10 during bacteremia. Bacterial read proportions for these strains are consistent with their known virulence phenotypes in mice (50). No conclusions could be drawn within the hearts as our previous study only included the TIGR4 strain. Similarly, for kidneys we lacked 6A-10 samples, but TIGR4 appeared to perform better than D39 at this site. A complete overview of sequenced and mapped reads is provided in Supplemental Table 3.1. To determine if actual bacterial burdens varied across anatomical sites, we performed colony forming unit (CFU) counts for TIGR4 infected organs, however no significant differences were found (Supplemental Table 3.1).

2. The in vivo pneumococcal and host transcriptomes demonstrate

differences in gene expression profiles during colonization and infection

To assess the spectrum of core Spn gene expression responses at each anatomical site across our samples, and to verify that replicate samples were similar to each other, we performed principal

93 component analyses (PCA) on PanOCT core gene orthologs (168) across the three strains. The bacterial PCA (Figure 3.2A)

Figure 3. 2 Pneumococcal principal component analysis (PCA) and dendrograms

A. PCA of S. pneumoniae (Spn) gene expression profiles (Supplemental Table 3.2) across in vivo colonized or infected samples. B. Bootstrapped dendrogram of Spn gene expression profiles showing replicate clustering of in vivo colonized or infected samples. C. PCA of panel A samples with the addition of in vitro grown planktonic and biofilm TIGR4 samples from our previous study (Shenoy et al. 2017). D. Bootstrapped dendrogram with the addition of in vitro samples, again showing replicate clustering.

Figure 3. 5 Pneumococcal strains D39 and 6A10 are more genetically similar when compared to TIGR4Figure 3. 6 Pneumococcal principal component analysis (PCA) and wasdendrograms based on DEseq2’s variance stabilized counts of 1,682 core Spn genes (Supplemental Table A. PCA of S. pneumoniae (Spn) gene expression profiles (Supplemental Table 3.2) across in vivo 3.colonized2, which or also infected includes samples. strain B. -Bootstrappedspecific Spn dendrogram and mouse of genes).Spn gene Spn expression typically profiles colonizes showing the replicate clustering of in vivo colonized or infected samples. C. PCA of panel A samples with the nasopharynxaddition of in without vitro grown harming planktonic the host and biofilm(15). We TIGR4 observed samples a distinct from our gene previous expression study (Shenoy profile et in al. 2017). D. Bootstrapped dendrogram with the addition of in vitro samples, again showing replicate clustering. 94 nasopharyngeal samples, which are separated by PC1 away from bacterial profiles seen in the remaining infected organs (Figure 3.2A). This pattern revealed that Spn engages in a different expression program during colonization vs. disease. This grouping of samples from infections also indicated that there exist core Spn genes whose expression reflects an infection state at all diseased host anatomical sites.

We observed that PC2 separated the TIGR4-infected samples away from D39 and 6A-10 (Figure

3.2A). This indicated that the strains differ in their gene expression profiles, despite our focus on core genes. There was limited organ-based separation among the infected organs on the PCA because PC2 was more influenced by the TIGR4 variation. A bootstrapped dendrogram of all samples (Figure 3.2B) showed this effect. Samples clustered by organ type among the D39 and

6A-10 strains, but TIGR4 samples clustered independently. This finding most likely reflects the fact that D39 and 6A-10 are slightly more genetically similar to each other (Figure 3.3).

95 Figure 3. 3 Pneumococcal strains D39 and 6A10 are more genetically similar when compared to TIGR4

A. phylogenetic tree based on complete genome sequences of all 3 pneumococcal strains shows a relatively smaller genetic distance between D39 and 6A10 strains. This potentially explains their more similar in vivo transcriptomic profiles relative to TIGR4.

To estimate the similarity between in vivo Spn expression profiles and representative in vitro Figure 3. 8 Pneumococcal strains D39 and 6A10 are more genetically similar when datasets,compared we toperformed TIGR4 a PCA on the 1,682 core genes including Spn planktonic and biofilm in vitroA. phylogeneticgrowth datasets tree (5) based (Figure on complete 3.2C). Spn genome grows sequences planktonically of all 3within pneumococcal the blood strains and TIGR4 shows has a relatively smaller genetic distance between D39 and 6A10 strains. This potentially explains their more beensimilar demonstratedin vivo transcriptomic to form biofilms profiles within relative t heto TIGR4.heart, yet these samples did not cluster closely with in vitro planktonic and biofilm samples, respectively, on the PCA. There was a clear separation of in vivo and in vitro states along PC1. These data suggest that in vitro Spn expression profiles are not representative of in vivo conditions. TIGR4-infected heart samples (5), which include biofilm pneumococci within cardiac microlesions, clustered more closely to the in vitro TIGR4 biofilm samples on a bootstrapped dendrogram (Figure 3.2D). However, in vitro samples are all clustered in a clade separated from in vivo samples. On this dendrogram, like on the PCA plot,

96 nasopharyngeal samples still clustered independently from the other states, indicating that nasopharyngeal colonization elicits a unique Spn gene expression profile significantly different from their infection phenotypes, irrespective of host anatomical site. Many previous studies have assayed the pneumococcal transcriptome in vitro. The D39 transcriptional signature was previously investigated under a panel of in vitro conditions designed to mimic anatomical sites

(109). A comparison of the gene expression in the D39 in vitro study to in vivo conditions based on core genes in our study displayed a stark distinction between these data sets (Figure 3.4). The differentially expressed (DE) genes acquired in vitro also showed no significant overlap to those derived from in vivo conditions (Figure 3.5).

97 Figure 3. 4 Principal component analysis (PCA) of in vivo and in vitro representative datasets

A core Spn gene expression profile PCA of our dataset combined with in vitro data from Shenoy, Brissac et al. (2017) and a D39 study of in vitro conditions mimicking in vivo conditions (Aprianto, Slager et al. 2018). The PCA shows a clear separation between in vitro and in vivo datasets.

Figure 3. 9 Host principal component analysis98 (PCA) and dendrogramFigure 3. 4 Principal component analysis (PCA) of in vivo and in vitro representative datasets

A core Spn gene expression profile PCA of our dataset combined with in vitro data from Shenoy, Brissac et al. (2017) and a D39 study of in vitro conditions mimicking in vivo conditions (Aprianto, Slager et al. 2018). The PCA shows a clear separation between in vitro and in vivo datasets. Figure 3. 5 Venn diagrams of pneumococcal differentially expressed (DE) genes detected in vivo and in vitro in mimicking conditions

Only 14% of Spn DE genes are shared between in vivo blood vs nasopharynx (our study) and in vitro blood mimicking conditions (BMC) vs nasopharynx mimicking conditions (NMC) (Aprianto, Slager et al. 2018). Similarly, only 4.8% of Spn DE genes are shared relative to lung mimicking conditions (LMC).

Figure 3. 5 Venn diagrams of pneumococcal differentially expressed (DE) genes detected in vivo and in vitro in mimicking conditions

Only 14% of Spn DE genes are shared between in vivo blood vs nasopharynx (our study) and in vitro blood mimicking conditions (BMC) vs nasopharynx mimicking conditions (NMC) (Aprianto, Slager et al. 2018). Similarly, only 4.8% of Spn DE genes are shared relative to lung mimicking conditions (LMC).

99 Figure 3. 6 Host principal component analysis (PCA) and dendrogram

A. PCA of host gene expression profiles (Supplemental Table 3.2) across samples colonized or infected with Spn, and uninfected. B. A three dimensional PCA shows complete separation between anatomical site clusters, with further separation of heart and kidney samples (PC3), as well as infected and uninfected invaded organs. C. Bootstrapped dendrogram of host gene expression profiles showing replicate clustering of host samples.

We then generated a PCA for expression profiles of all host genes (Figure 3.6A). Host profiles Figure 3. 11 Host principal component analysis (PCA) and dendrogram revealedA. PCA of sample host gene clustering expression based profiles on (Supplemental anatomical site. Table This 3.2) was across not samples surprising colonized given or thatinfected gene with Spn, and uninfected. B. A three dimensional PCA shows complete separation between anatomical programssite clusters, reflecting with furtherspecific separation organ functions of heart would and kidney likely supersede samples (PC3), genes as involved well as in infected the response and uninfected invaded organs. C. Bootstrapped dendrogram of host gene expression profiles showing toreplicate infection. clustering PC1 separated of host samples. the blood away from the solid tissue types, and PC2 clearly separated out the nasopharynx as a colonization state away from infected anatomical sites. The lungs, hearts and kidneys appeared to cluster more closely together. However, the addition of PC3 showed clear separation based on anatomical site (Figure 3.6B). Apart from the nasopharynx, each site further

100 formed sub-clusters of uninfected and infected strain-specific samples. The bootstrapped dendrogram of host samples corroborated the differences between organ types and strains, as well as the differences between infected and uninfected states for nearly all anatomical sites (Figure

3.6C). Based on shorter distances between uncolonized and colonized states of the nasopharynx, there appeared to be fewer differences between host gene expression, relative to other anatomical sites, further demonstrating the host’s tolerance to pneumococcal colonization relative to disease states.

3. The pneumococcus and its host differentially regulate unique gene

subsets in a site-specific manner

To determine the major Spn responses to each host site, and each host site’s response to Spn, we queried differentially expressed (DE) genes with their respective control. Statistically significant

DE Spn genes, accounting for strain differences, were used to identify key differences between disease states and the colonization state used as baseline. Figure 3.7A highlights 69 Spn DE genes shared among all disease states when compared to nasopharyngeal colonization. Of these, all but ribosomal protein L19 (SP_1293) were either upregulated or downregulated in the same direction

(Figure 3.7B). Other comparisons revealed 376 DE genes in the blood versus the nasopharynx,

214 DE genes in the heart, and 231 and 390 DE genes in the lungs and kidneys, respectively (Figure

3.8). Most genes not shared between the three Spn strains did not exhibit differential gene expression in disease-associated organs versus the nasopharynx. Some interesting exceptions include the increased expression of adhesin choline binding protein I in strain TIGR4 (SP_0069) and bacteriocin-containing operons in strain 6A-10 (HKM25_525-HKM25_529). These were upregulated in the nasopharynx versus diseases sites, supporting the requirement for attachment

101 and intraspecies competition, respectively. The comprehensive list of Spn DE genes for each strain-infected organ versus the nasopharynx is available in Supplemental Table 3.3.

To evaluate the host response to Spn infection, we performed similar host DE gene analyses using respective uninfected tissues as baseline. The use of multiple strains allowed for the identification of a shared host infection response to Spn at each anatomical site (Figure 3.7C). The 190 mouse

102 Figure 3. 7 Differentially expressed (DE) genes detected in the pneumococcus and host

A. Upset plot of Spn DE genes (Supplemental Table 3.4). The individual or connected dots represent the various intersections of genes that were either unique to, or shared among comparisons, similar to a Venn diagram. Vertical bars represent the number of DE genes unique to specific intersections. Horizontal bars represent the number of DE genes for each specific comparison. B. Z-scored heatmap of expression levels of 69 Spn DE genes shared across all comparisons (green vertical bar from panel A). C. Upset plot of host DE genes (Supplemental Table 3.5). Horizontal gold bars represent the total number of host DE genes for that comparison. Colored vertical bars represent intersections of interest, i.e. specific to the infection of heart, blood, lungs, and an overall infection response in organs susceptible to invasive disease (orange, red, blue and green vertical bars respectively). D. Z-scored heatmap of expression levels of 190 host DE genes shared across all comparisons except the nasopharynx, i.e. shared across disease anatomical sites (green vertical bar from panel C).

Figure 3. 7 Differentially expressed (DE) genes detected in the pneumococcus and host

A. Upset plot of Spn DE genes (Supplemental Table 3.4). The individual or connected dots represent the various intersections of genes that were either unique to, or shared among comparisons, similar to a Venn diagram. Vertical bars represent the number of DE genes unique to specific intersections. Horizontal bars represent the number of DE genes for each specific comparison. B. Z-scored heatmap of expression levels of 69 Spn DE genes shared across all comparisons (green vertical bar from panel A). C. Upset plot of host DE genes (Supplemental Table 3.5). Horizontal gold bars represent the total number of host DE genes for that comparison. Colored103 vertical bars represent intersections of interest, i.e. specific to the infection of heart, blood, lungs, and an overall infection response in organs susceptible to invasive disease (orange, red, blue and green vertical bars respectively). D. Z-scored heatmap of expression levels of 190 host DE genes shared across all comparisons except the nasopharynx, i.e. shared across disease anatomical sites (green vertical bar from panel C). Figure 3. 8 Differentially expressed (DE) genes of the pneumococcus and the host

Z-scored heatmaps of expression levels of DE genes from host and pneumococcal Upset plots (Figure 3.8). A. Spn DE genes from Figure 3.8A (horizontal colored bars) in the blood, heart, kidneys, and lungs (left to right, respectively). B. Host DE genes unique to each organ site from Figure 3.8C (vertical colored bars) in the blood, heart, kidneys, and lungs (left to right, respectively).

Figure 3. 13 Differentially expressed (DE) genes of the pneumococcus and the host

Z-scored heatmaps of expression levels of DE genes from host and pneumococcal Upset plots (Figure 3.8). A. Spn DE genes from Figure 3.8A (horizontal colored bars) in the blood, heart, kidneys, and lungs (left to right, respectively). B. Host DE genes unique to each organ site from Figure 3.8C (vertical colored bars) in the blood, heart, kidneys, and lungs (left to right, respectively).

Figure 3. 9 Differentially expressed (DE) genes of the host in the nasopharynx

Z-scored heatmaps of expression levels of DE genes from host nasopharyngeal samples. Relatively fewer genes were shared across all host colonized samples versus diseased samples, therefore, unlike Figure 3.9 where organ-specific subsets of genes that were DE across 3 pneumococcal strains (TIGR4, D39, 6A-10), the heatmaps presented here encompass all host DE genes for each pneumococcal strain. A. Heatmap of D39 colonized nasopharynx. B. Heatmap of TIGR4 colonized nasopharynx. C. Heatmap of 6A-10 colonized nasopharynx.

Figure 3. 14 Differentially expressed (DE) genes104 of the pneumococcus and the hostFigure 3. 9 Differentially expressed (DE) genes of the host in the nasopharynx

Z-scored heatmaps of expression levels of DE genes from host nasopharyngeal samples. Relatively fewer genes were shared across all host colonized samples versus diseased samples, therefore, unlike Figure 3.9 where organ-specific subsets of genes that were DE across 3 pneumococcal strains (TIGR4, Figure 3. 10 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct (equivalent to LFC) in pneumococcal infected lung, heart and nasopharynx

Up- and downregulated genes were selected from the 69 genes differentially expressed at all sites relative to the nasopharynx (Figure 3.8B). qRT-PCR was performed on (i) the same RNA samples that were sequenced as validation of the RNA-seq measurements, (ii) additional TIGR4 and D39 infected hearts that were not subjected to RNA-seq (Note that these RNAs were depleted after completion of this experiment). Strong correlations with significant p-values are observed for the 10 genes tested. PspA primers were not specific in 6A-10 hence not shown in these plots. Green shadings indicate one standard error from the linear regression. A. RNA-seq LFC values calculated using core pneumococcal genes shared across 3 strains (TIGR4, D39, 6A-10). B. Strain-specific RNA-seq values. genes that were DE upon invasive infection across 4 sites, followed the same upregulated or Figure 3. 10 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct (equivalent to LFC) in pneumococcal infected lung, heart and nasopharynx downregulated trend across all comparisons (Figure 3.7D). Other mouse gene sets of interest Up- and downregulated genes were selected from the 69 genes differentially expressed at all sites includerelative 517 to the DE nasopharynx genes in infected (Figure blood, 3.8B). 622qRT DE-PCR genes was performedin infected on hearts, (i) the 414 same DE RNA genessamples in infected that were sequenced as validation of the RNA-seq measurements, (ii) additional TIGR4 and D39 infected lungs,hearts and that 385 were DE not genes subjected in infectedto RNA- seqkidneys (Note (Figurethat these 3. RNAs8). Nasopharyngeal were depleted after colonized completion samples of this experiment). Strong correlations with significant p-values are observed for the 10 genes tested. harboredPspA primers only werefew DEnot specificgenes (Figure in 6A-10 3. 9hence). Lists not of shown all Spn in these and plots.mouse Green DE genesshadings are indicate provided one in standard error from the linear regression. A. RNA-seq LFC values calculated using core pneumococcal Supplementalgenes shared acrossTables 3 3.strains4 & 3. (TIGR4,5, respectively. D39, 6A -To10). confirm B. Strain RNA-specific-seq RNADE gene-seq values. results, we performed a cross-platform validation of selected DE genes, 10 pneumococcal and 11 mouse genes, using qRT-PCR. A subset of RNA samples that were subjected to RNA-seq, as well as additional

105 infected heart samples were tested. Results revealed strong correlations with significant p-values between the two platforms (Figures 3.10 and 3.11).

Figure 3. 11 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct (equivalent to LFC) in the mouse lung

Up- and downregulated genes were selected from the 190 genes differentially expressed at all infected sites relative to uninfected sites, excluding the nasopharynx (Figure 3.8D). qRT-PCR was performed on the same RNA samples that were sequenced as validation of the RNA-seq measurements. Strong correlations with significant p-values are observed for the 11 genes tested for each individual infecting pneumococcal strain. Green shadings indicate one standard error from the linear regression.

Figure 3. 11 Log2 Fold Change (LFC) correlation plots of RNA-seq LFC vs qRT-PCR ∆∆Ct (equivalent to LFC) in the mouse lung

Up- and downregulated genes were selected from the 190 genes differentially expressed at all infected sites relative to uninfected sites, excluding the nasopharynx (Figure 3.8D). qRT-PCR was performed on the same RNA samples that were sequenced as validation of the RNA-seq measurements. Strong correlations with significant p-values are observed for the 11 genes tested for each individual infecting pneumococcal strain. Green shadings indicate one standard error from the linear regression.

106 107 4. Key pneumococcal genes expressed during colonization are

distinct from those expressed during infection

To identify all potentially relevant DE pathways for our datasets, we used complete DE gene lists

(FDR ≤ 0.05) for pathway enrichment analyses. KEGG pathway analyses (191) of Spn DE genes identified several pathways unique to each comparison (Supplemental Table 3.6), two of which were shared across all disease anatomical sites relative to the nasopharynx: the ascorbate and aldarate metabolism pathway and the pentose and glucuronate interconversions pathway.

Unsurprisingly, most of the genes in these 2 pathways were present in the 69 DE gene subset shared across disease states. Three genes are shared between these two pathways: SP_2033,

SP_2034, and SP_2035. These genes are part of the ula operon that metabolizes ascorbate and aldarate (SP_2031 – SP_2038). An overview of the uptake and conversion of L-ascorbate (L- ascorbic acid), also known as C, by Spn is shown in Figure 3.12A. L-ascorbate is transported into the bacteria using a phosphotransferase system (PTS) (SP_2036, SP_2037, and

SP_2038), where it is ultimately converted into cytidine diphosphate (CDP)-ribitol by other genes within the operon together with genes SP_1983, SP_1270, and SP_1271. Of all the genes within this pathway, only SP_1983 was not DE above cutoff. CDP-ribitol is essential for bacteria wall synthesis (192) and therefore, we hypothesize that interruption of the uptake of L-ascorbate could affect nasopharyngeal colonization.

We also interrogated genes highly expressed by Spn at all anatomical sites, as well as those relevant to disease or colonization. Of the 100 most highly expressed Spn genes at each site (Supplemental

Table 3.6), 52 were shared across all sites (Figure 3.12B & C). While many of these shared genes were housekeeping genes (cell division, translation, etc.), two were previously studied potential vaccine candidates, lactoferricin-resistance associated PspA and zinc metalloprotease ZmpB (178,

108 179), and another was an oligopeptide transporter operon (amiACDEF) shown to have relevance to nasopharyngeal colonization (193). The two vaccine candidates and genes amiA, amiC and amiD were also shown to be essential for TIGR4 lung survival in vivo in the signature tagged mutagenesis study (187). The identification of these housekeeping genes, virulence factors, and operon among the highest expressed genes suggests that the pneumococcus invests significant resources to produce these proteins at all anatomical sites.

Figure 3. 12 Pneumococcal and host genes of particular interest

A. Schematic view of two potentially interlinked pneumococcal KEGG pathways (Supplemental Table 3.6): ascorbate & aldarate metabolism, and pentose & glucuronate interconversions. *SP_1983 was the only gene not differentially expressed (DE) relative to the nasopharynx. B. Upset plot (see Figure 3.8A) of the 100 most highly expressed Spn genes (Supplemental Table 3.6) at each anatomical site (on average). C. Absolute gene expression level heatmap of the 52 Spn genes (Supplemental Table 3.6) that are highly expressed genes across all sites (black vertical bar from panel B). D. Ingenuity Pathway Analysis Activation Z-score heatmaps (average Zscore <-3 or >3) of canonical pathways in host DE genes (Supplemental Table 3.7) present in all diseased anatomical sites. Nasopharyngeal pathways are not shown as too few host DE genes were observed (see Figure 3.7C).

Figure 3. 16 Pneumococcal and host genes of particular interest

A. Schematic view of two potentially interlinked pneumococcal KEGG pathways (Supplemental Table 3.6): ascorbate & aldarate metabolism, and pentose & glucuronate interconversions. *SP_1983 was the only gene not differentially expressed (DE) relative to the nasopharynx. B. Upset plot (see Figure 3.8A) of the 100 most highly expressed Spn genes (Supplemental Table 3.6) at each anatomical site (on average). C. Absolute gene expression level heatmap of the 52 Spn genes (Supplemental Table 3.6) that are highly expressed genes across all sites (black vertical bar from panel B). D. Ingenuity Pathway Analysis Activation Z-score heatmaps (average Zscore <-3 or >3) of canonical pathways in host DE genes (Supplemental Table 3.7) present in all diseased anatomical sites. Nasopharyngeal pathways are not shown as too few host DE genes were observed (see Figure 3.7C).

109 110 5. Host gene signatures reveal induction of the interferon pathway

during invasive disease

For host datasets, we used Ingenuity Pathway Analysis (IPA) to study all differentially expressed

(DE) gene lists (FDR ≤0.05, Log2FoldChange <-2 or >2). Key differences between host responses during colonization compared to invasive disease were observed. Only ~200 DE genes were observed overall during nasopharyngeal colonization compared to ~6000 DE genes on average in invaded tissues (Figure 3.7C). A minimal response to strain D39 was observed within the nasopharynx, but a robust response from a modest number of genes was mounted with strains

TIGR4 and 6A-10. Both strains induced a similar pattern of induction of pro-inflammatory cytokines considered part of the acute phase response, including C-reactive protein, tumor necrosis factor, and interleukin-1. Their expression signature also included genes indicative of neutrophilic infiltration, and possibly other leukocytes, such as expression of chemoattractant receptors found on myeloid cells including FPR1, FPR2, LTB4R, and CXCR2, formyl peptide receptors, and LTB4 receptor. This finding is consistent with previous reports showing that nasopharyngeal colonization can lead to recruitment of neutrophils (194, 195). These results indicate that nasopharyngeal colonization can induce subclinical inflammation in the nasopharynx, and that bacterial strains differ in the extent to which they alter host biology during colonization.

As in nasopharyngeal colonization, infection with Spn led to the expected induction of host NFB- dependent pro-inflammatory cytokines typical of the acute phase response; however, both the magnitude of induction and number of transcripts associated with this response was much greater in infected tissues. In addition, during invasive infection, all tissues – lungs, heart, and kidneys – showed enrichment for genes and pathways associated with leukocyte-specific expression likely

111 indicative of an inflammatory infiltrate. To obtain an overview of the differences in host response during infection compared to nasopharyngeal colonization common to all infected tissues, we focused on the subset of 190 genes induced during infection in blood and all tissues but not during nasopharyngeal colonization (green bar, Figure 3.7C). This list showed significant enrichment of a number of interferon-stimulated genes, genes usually induced following innate immune sensing of viral infection or other abnormal nucleic acids (196), together with typical transcription factors and cytokines associated with the interferon response identified as upstream regulators

(Supplemental Table 3.5). Examination of pathways enriched in the 190 gene set or in the organ- specific DE genes (Supplemental Table 3.7) showed enrichment of pathways associated with the interferon response: interferon signaling, role of pattern recognition receptors in recognition of bacteria and viruses, and activation of interferon regulatory factors by cytosolic pattern recognition receptors (Figure 3.12D). Some organ-specific patterns in expression changes during Spn infection were also observed: Spn infection led to enrichment of coagulation-associated genes in the blood, and pathways associated with cytochrome p450-mediated biosynthesis and degradation were more altered in the lung than in other tissues (Supplemental Table 3.7).

6. Deletion of highly expressed pneumococcal genes lowers burden at

corresponding anatomical site

To confirm the importance of Spn genes identified as highly expressed (Figure 3.12B,

Supplemental Table 3.6) or differentially expressed (Figure 3.7A, Supplemental Table 3.4) in an anatomical-site specific manner (Supplemental Table 3.6), we tested the virulence of their respective isogenic mutants versus wildtype control using a competitive index assay. Three out of four mutants in genes highly expressed only during asymptomatic colonization (ula operon,

SP_0368, and SP_1675) were attenuated versus control on day 5 of colonization, each exhibiting

112 an approximate 10-fold reduction in recoverable bacteria from nasal lavage samples. In contrast, mutation of SP_2150 (arcC) had no significant effect, although a general downward trend was observed (Figure 3.13A). The ula operon is responsible for the import and degradation of ascorbic acid (197), SP_0368 encodes an O-glycosidase that may contribute to the scavenging of carbohydrates from host glycoconjugates (198), and SP_1675 encodes a kinase (ROK) family member (199). The latter are commonly involved in the regulation of sugar utilization networks

(200). arcC is a member of the arginine deaminase operon which among other functions allows the pneumococcus to use arginine as a carbon source (201). Mutants deficient in genes identified as highly expressed during pneumonia and invasive disease (SP_0664, SP_1647, SP_1648 and

SP_1891) were in turn found to be starkly impaired in their ability to cause bacteremia following pneumonia (Figure 3.13B). These genes encode ZmpB, endopeptidase PepO, manganese transporter PsaB, and AmiA. Our results with these mutants agree with studies that have previously linked these proteins to Spn pathogenicity (18, 193, 202, 203). Importantly, experiments performed with this panel of mutants in vitro found that only PsaB and AmiA were essential for growth in media, thus the requirement for the remainder of proteins was specific to the in vivo condition

Figure 3.13C).

113 Figure 3. 13 In vivo competition experiments between pneumococcal strain TIGR4 and isogenic mutants in highly expressed genes

Mutants were tested using a completive index assay against wildtype (WT) TIGR4. Paired WT and mutant samples are denoted by dots with a connecting dashed line. A. Bacterial titers in nasal lavage fluid of colonized mice 5 days post-inoculation with WT TIGR4 and Spn deficient in genes highly expressed in the nasopharynx. B. Bacterial titers in the blood of mice 1 day after intratracheal with WT TIGR4 and Spn deficient in genes highly expressed during invasive disease. C. Mutants tested in panels A and B were tested for overall fitness in Todd-Hewitt both with 0.5% Yeast Extract (THY). Competitive index values were calculated by dividing the numbers of CFU for the isogenic mutants by those for WT. D. A double mutant, deficient in both ula and SP_1675 was tested versus TIGR4 WT in the nasopharynx model of colonization. E. Cross infection experiments in which genes highly expressed in nasopharynx (Δula and ΔSP_1675) were tested in the model of lower respiratory tract infection with bacteremia. F. Genes highly expressed in disease anatomical sites (∆pepO and ∆zmpB) were tested with the nasopharynx model of colonization. For panels A-F, mutants were compared to WT using a paired Student’s t-test (* = P<0.5; **= P<0.01; *** = P<0.001; **** = P<0.0001).

Figure 3. 17 The pneumococcus can rely on the ula operon for growth with ascorbate as a sole carbon sourceFigure 3. 13 In vivo competition experiments between pneumococcal strain TIGR4 and isogenic mutants in highly expressed genes

Mutants were tested using a completive index assay against wildtype (WT) TIGR4. Paired WT and mutant samples are denoted by dots with a connecting114 dashed line. A. Bacterial titers in nasal lavage fluid of colonized mice 5 days post-inoculation with WT TIGR4 and Spn deficient in genes highly expressed in the nasopharynx. B. Bacterial titers in the blood of mice 1 day after intratracheal with WT TIGR4 and Spn deficient in genes highly expressed during invasive disease. C. Mutants tested in panels A and B were tested for overall fitness in Todd-Hewitt both with 0.5% Yeast Extract (THY). Figure 3. 14 Assessment of the interferon response and its contribution to mouse protection

A. IFNα and IFNβ levels in the serum, spleen and hearts of mice experiencing invasive pneumococcal disease. Each data point represents an individual mouse with statistical analysis performed using a two- tailed Student’s t-test (* = P<0.5; **** = P<0.0001). B. Mice were administered recombinant IFNβ (30,000U) in saline or saline alone either intranasally (I.N.; 50μL volume) or intra-peritoneally (I.P.; 100μL volume) 24 hours prior to challenge intratracheally (I.T.) with 5 x 106 CFU or I.P. with 1 x 105 CFU of TIGR4 in 100μL of saline, respectively. Body condition score and Kaplan-Meier survival of mice treated with IFNβ versus control following challenge with TIGR4. Body condition score significance was tested using a repeated measures mixed effects model with Sidak’s multiple comparison post-test of IFNβ versus saline only within each infection group (** = P<0.01); and survival significance was assessed by Log-rank Manel-Cox test (pvalues indicated on the graph; n=10 mice per treatment group). C. Mice were administered 50 μg Poly(I:C) in 100 μL saline I.P. 24 hours prior to challenge I.P. with 1 x 105 CFU of TIGR4. Survival significance was assessed by Log-rank Manel-Cox test (p-value indicated on the graph; n=9 mice per treatment group).

ConsideringFigure 3. that 14 Assessmenthe genes identifiedt of the as interferon highly expressed response during and nasopharyngeal its contribution colonization to mouse are protection presumablyA. IFNα and involved IFNβ levels in the in thescavenging serum, spleen of nutrients, and hearts we of hypothesized mice experiencing that invasivetheir combined pneumococcal deletion disease. Each data point represents an individual mouse with statistical analysis performed using a two- wouldtailed haveStudent’s a cumulative t-test (* = effect P<0.5; on **** Spn =growth P<0.0001). in the B. nasopharynx.Mice were administered We observed recombinant that Spn can IFNβ rely (30,000U) in saline or saline alone either intranasally (I.N.; 50μL volume) or intra-peritoneally (I.P.; on100μL the ula volume) operon 24 for hours growth prior towith challenge ascorbate intratracheally as a sole carbon (I.T.) with source 5 x 10(Figure6 CFU or3.15 I.P.), withthat 1a xdouble 105 CFU of TIGR4 in 100μL of saline, respectively. Body condition score and Kaplan-Meier survival of mice treated with IFNβ versus control following115 challenge with TIGR4. Body condition score significance was tested using a repeated measures mixed effects model with Sidak’s multiple comparison post-test of IFNβ versus saline only within each infection group (** = P<0.01); and survival significance was assessed by Log-rank Manel-Cox test (pvalues indicated on the graph; n=10 mice per treatment group). C. Mice were administered 50 μg Poly(I:C) in 100 μL saline I.P. 24 hours prior to challenge I.P. with 1 x 105 CFU of TIGR4. Survival significance was assessed by Log-rank Manel-Cox mutant lacking the ula operon and SP_1675 was starkly attenuated during colonization in mice

(Figure 3.13D), and that the double mutant had normal growth in rich medium in vitro (Figure

3.13C). Importantly, cross-infection, i.e. testing of the mutants necessary for colonization in the invasive disease model, and vice versa, revealed that the ula mutant was attenuated for invasive disease whereas the SP_1675 mutant was not (Figure 3.13E). Similarly, the pepO mutant was attenuated for colonization, but the zmpB mutant was not (Figure 3.13F). Thus, for most genes tested, high-level of transcription in the nasopharynx or disease site was indicative of their importance under same condition. These findings help prioritize new candidate highly expressed

Spn genes as targets for intervention.

Figure 3. 15 The pneumococcus can rely on the ula operon for growth with ascorbate as a sole carbon source

TIGR4 and its isogenic mutant Δula were inoculated in Chemically Defined Medium (CDM) supplemented with glucose or ascorbic acid as the sole carbon source and allowed to grow for 6 hours. The optical density of bacterial cultures is shown. While both TIGR4 and Δula can grow on glucose at the same rate, Δula was unable to grow on ascorbic acid as a sole carbon source.

Figure 3. 19 Z-scored heatmap of host TLR 7, 9, 13, and chaperone protein Unc93b1Figure 3. 20 The pneumococcus can rely on the ula operon for growth with ascorbate as a sole carbon source 7. Interferon contributes to host defense in pneumococcal infection

InTIGR4 general, and type its isogenic I interferon mutant is Δnotula consideredwere inoculated to be in a Chemicallyprotective mechaniDefined Mediumsm during (CDM) pneumococcal supplemented with glucose or ascorbic acid as the sole carbon source and allowed to grow for 6 hours. The optical density infection;of bacterial albeit cultures a role is shown.in alveolar While-capillary both TIGR4 barrierand defenseΔula can has grow been on glucosedescribed at the (204, same 205) rate,. GivenΔula was unable to grow on ascorbic acid as a sole carbon source. 116 the observed enrichment in interferon-stimulated gene expression in infected vs. uninfected organs

(Supplemental Table 3.5), we sought to confirm the production of interferon within tissues during invasive pneumococcal disease and determine its biological consequence. Notably both IFNα and

IFN were detected in the spleen of infected mice at levels higher than uninfected controls (Figure

3.14A), with IFN also elevated in serum and the heart of infected mice. One-time intranasal pretreatment of mice with IFN (isolated from mouse cells, PBL Assay Science) one day prior to intratracheal challenge with Spn TIGR4 reduced morbidity and mortality versus the mock-treated controls (Fig 3.14B). Next, we probed the role of interferon in protection against intraperitoneal infection. IFN pretreatment alone did not protect mice that were pretreated and challenged intraperitoneally with TIGR4 (Figure 3.14B). However, pretreatment with Poly(I:C), a potent inducer of both interferon and NFkB dependent pro-inflammatory cytokines, did enhance survival of intraperitoneally-infected mice (Fig 3.14C) (205). From these results we infer that the interferon response alone primarily confers protection of the airway. However, in combination with other activated host defense pathways, it may also confer protection during systemic infection.

E. Discussion

We report for the first time the shared in vivo pneumococcal and host transcriptome during colonization and invasive disease in relevant anatomical sites in mice. We also identify relevant highly and differentially expressed bacterial and host genes, while exploring the difference between asymptomatic pneumococcal colonization and disease. We also validate the importance of differentially regulated genes for in vivo survival of the bacteria and host, respectively.

We observed major differences between in vitro and in vivo gene expression results, supporting the use of the in vivo colonization state as the preferred baseline condition for calculating Spn DE

117 genes rather than an in vitro control. Indeed, Spn transcriptional gene signatures investigated under in vitro conditions (5, 109) did not recapitulate our in vivo conditions (Figures 3.4 & 3.5). It is reasonable to assume that the differences between in vitro and in vivo conditions are in large part due to differences in nutrient availability, the complexity of processes required for sequestering nutrients in vivo, oxygen tension, shear forces, and other physical or physiological aspects of in vivo conditions. It is also likely that the in vitro conditions used in prior studies, like those we used for biofilm formation (5), were optimized for Spn growth, and as such were not accurately replicating the physiological environment of the host. The discordance between in vitro and in vivo studies further demonstrates the potential drawbacks of using results solely from in vitro studies to infer in vivo mechanisms.

Different Spn strains not only have varied transcriptomic signatures within the same organs in vivo

(Figure 3.2A) but can also result in disparities of Spn activity at distinct sites (Figure 3.1C). This is recapitulated in the host responses to each strain as well (Figure 3.6). The pneumococcus and the host differentially expressed different numbers and types of genes at each anatomical site, evidenced by our identification of DE gene sets not only commonly expressed across sites but also other sets unique to each, with only few host genes DE in the nasopharynx (Figure 3.7A & 3.7C).

In both these cases, the Spn strain effect can be visualized (Figure 3.8). Genes that are DE follow similar trends (i.e. up- or downregulated), but to varied strain-dependent magnitudes.

We observed that in Spn genes DE only within the blood, multiple metal ion scavenging genes were upregulated. Ribosomal proteins were downregulated in the heart possibly suggesting a slowdown of translation as the pneumococcus transitions into a biofilm state as previously described (5, 206). It is tempting to hypothesize that Spn may have evolved mechanisms enabling its transition to a low inflammation-triggering biofilm mode in certain organs. Overall, the

118 pneumococcus is also able to alter its feeding habits, a forte for Spn, respective to each anatomical site. We have shown the upregulation of L-ascorbate uptake and degradation in the nasopharynx, however genes involved in galactose and lactose utilization were also observed. It is also interesting to note that mannose transporter genes were upregulated at all other sites, potentially implicating this feeding mechanism in fitness in disease states.

The identification of targetable highly expressed genes can be particularly useful, especially for the development of vaccines. We detected two previously described pneumococcal antigenic virulence factors, pspA and zmpB, in our highly expressed gene set. Studies have shown that zmpB is conserved across the pneumococcal proteome and is protective in mice (178). However, zmpB has also been shown to be present in closely related commensal streptococcal species colonizing the human host, thereby possibly limiting its vaccine potential (207). PspA is also present in most pneumococcal strains albeit exhibiting different alleles (208), and this protein has also been demonstrated to be protective against pneumococcal disease (209). Given that these are core proteins, surface exposed, protective, and now found to be broadly highly expressed by the pneumococcus in vivo, this makes them potentially powerful candidates for the development of a serotype-independent pneumococcal vaccine.

Zafar et al. identified 375 Spn genes necessary for shedding and transmission using a Tn-seq approach (210). Of these, 72 genes were upregulated in the nasopharynx, suggesting that these genes could be key factors involved in Spn transmission. Like us they identified that SP_0368 (O- glycosidase) was upregulated within the nasopharynx, and we demonstrated its importance for in vivo colonization using knockout mutants. The dlt locus (SP_2173-SP_2176) was also identified by Zafar et al. as one of the major determinants of transmission. Similar Tn-seq studies performed in the context of pneumococcal growth in human saliva have identified and validated that the

119 amiACDEF operon that we found highly expressed at all anatomical sites, was essential for survival (123). This was also essential in vivo in mice within the lungs and nasopharynx (190) correlating well with its high expression at all anatomical sites in our study. Other key genes we identified also correlate with survival genes identified by these Tn-seq studies. For example, 3 out of the 8 genes of the vitamin C operon, and SP_2150 that were upregulated in the nasopharynx relative to all other anatomical sites, were specifically essential in the nasopharynx but not in the lung. Similarly, SP_0062 which is part of a phosphotransferase system (PTS), was upregulated in the lung in our dataset and shown to be essential in the lung but not the nasopharynx (190).

Mutants lacking the dltB gene (SP_2175) were shown by Cooper et al. to be selected for during in vivo experimental evolution during colonization, probably because the authors removed the transmission bottleneck by performing multiple experimental passages (211). Cooper et al. hypothesized that the dlt locus was maintained across the pneumococcal population because it is required for transmission to new hosts despite increased adhesion and colonization when the locus was deleted. Along these lines our results revealed that the dlt locus was not significantly DE across organs relative to the nasopharynx and all genes within the locus were robustly expressed at all sites. These studies suggest a complex regulatory program of the dlt locus during host- pathogen interactions.

Warrier et al. used data from a variety of RNA sequencing techniques focused on Spn strain TIGR4 to characterize the complex operon architecture of the pneumococcus (124). They identified 474 potential multigene operons with experimental support. They also identified a regulatory element pyrR (SP_1278), essential for virulence and colonization. We found pyrR to be upregulated within the blood, kidneys, and heart relative to the nasopharynx, validating its role in virulence and infection. Our 2 operons (ula SP_2033-SP_2038 & SP_1266-SP_1271) upregulated in the

120 nasopharynx (Figure 3.12A) were also detected in the Warrier et al. dataset. A microarray study involving strains of serotype 4 (WCH43) and serotype 6A (WCH16) also identified the ula operon as upregulated in the nasopharynx relative to the blood, from a total of 62 DE genes. They also showed that a ula knockout mutant had no significant effect on pneumococcal virulence (212).

While we also showed that an ula knockout mutant is only mildly attenuated, our double knockout mutant adding the SP_1675 gene was significantly attenuated in colonization. The other operon consists of lic proteins and phosphorylcholine uptake genes (45) (SP_1266-SP_1271) known to be involved in teichoic acid synthesis (192). The link found between the two operons suggesting activated teichoic acid and bacterial cell wall synthesis, and the resources invested by the pneumococcus for nasopharyngeal colonization, demonstrate the importance of these genes in the initial stage of interactions with the mammalian host.

Figure 3. 16 Z-scored heatmap of host TLR 7, 9, 13, and chaperone protein Unc93b1 These host genes are upregulated under conditions of particular interest within the blood and lungs, as compared to the study by Famà et al. 2020.

121 It has previously been shown that type I interferon decreases expression of Platelet-Activating

Factor receptor (PAFr), a pneumococcus binding receptor, and this inhibits pneumococcal transmigration across epithelial and endothelial cell barriers in the airway, preventing invasive disease (204). Our in vivo results revealed a strikingly consistent interferon-stimulated gene expression signature during acute invasive infection. Enhanced survival of intratracheally IFN- pretreated mice during the following the onset of disease provides further evidence for this protective effect. Several previous studies have noted induction of interferon stimulated genes during pneumococcal infection and have proposed two main models as a cause. Firstly, the

STING/cGAS cytosolic DNA sensing system has been shown to recognize Spn nucleic acids

(213). Signaling from well-known bacterial recognition receptors such as Toll-like receptors

(TLRs) also lead to activation of both NFB and interferon regulatory factors (IRFs) that mediate the interferon response. Famà et al. demonstrated the effect of nucleic acid sensing TLRs upon pneumococcal infection in mice (214). TLR7, TLR9, TLR13 and chaperone protein UNC93B1, responsible for TLR trafficking out of the endoplasmic reticulum, were shown to control pneumococcal infection as knockout mice lacking these genes were moderately to highly susceptible. Examining the levels of these genes within our dataset shows their upregulation at different sites, mainly the blood and lungs (Figure 3.16), albeit not all passed the DE FDR cutoff of ≤0.05 (Supplemental Table 3.5).

Secondly however, leakage of mitochondrial DNA from damaged mitochondria in stressed or dying cells is also a potential ligand for STING/cGAS (215-217). Since Spn invasion causes cell stress and death, DNA from damaged mitochondria or dying cells would be another potential trigger of the interferon response during invasion. While Skovbjerg et al. observed the upregulation of interferon related genes in vitro in human monocytes exposed to intact and autolyzed Spn (218),

122 they also demonstrated that it was the intact bacteria that were responsible for interferon stimulation, suggesting that exogenous pneumococcal DNA/RNA from autolyzed bacteria may not be the only stimulant to the interferon pathway. A subset of the host interferon-related genes from a different study (identified by Pattern B in Figure 1, Skovbjerg et al.) were also upregulated across all infected tissues in our study. Notably, we observed that pretreatment with IFN alone was not protective in mice challenged intraperitoneally (Figure 3.14C). This may be due to our use of a single treatment regimen, the fact that the bacteria in the peritoneum do not require invasion to gain access to the bloodstream (and intraperitoneal challenge is therefore considered to be a highly lethal infection model), or that the main target of IFN is instead the lungs. Yet, we observed that pretreatment with Poly(I:C) conferred protection in this infection model, suggesting that a mixed interferon and NFκB response has additional effects, most likely occurring at the myeloid cell level, that confers protection during systemic disease. In general, our findings corroborate the prevalence of the interferon responses to Spn infection in mice and its importance in protecting against onset of lethal pneumococcal pneumonia.

Previous studies have shown interactions between the pneumococcus and host cell receptors, including PAFr, Polymeric-Immunoglobin Receptor (pIgR), and Laminin Receptor (Rspa), that facilitate pneumococcal invasion (219-221). We observed that PAFr and pIgR were upregulated in the lungs, blood, and heart as they encounter the pneumococcus. This could suggest that the pneumococcus initiates a feedback invasion mechanism, whereby host receptors hijacked for invasion are being upregulated, allowing for more invasion. However, for Rspa, changes were only observed for strain TIGR4, with downregulation in TIGR4 infected blood, and upregulation in

TIGR4 infected kidneys. Another receptor, found in B-cells, CD22 was shown to be important for host resistance to the pneumococcus (222). It was consistently downregulated in all our infected

123 blood samples but not at other sites, leaving the host susceptible to further infection, and possibly explaining the high titers of Spn we observed in the blood.

In summary, using multiple pneumococcal strains in our in vivo transcriptome dataset, as well as multiple anatomical sites within the host, allowed us to identify shared and strain/organ-specific responses to infection in the pneumococcus and in mice. The focus on core pneumococcal genes shared across the three genetically different strains tested makes it likely that other pneumococcal strains will exhibit similar responses to those identified here when they encounter the host.

Exploration of our key shared and organ specific host genes and pathways, and their human orthologs, will provide a robust rationale for designing follow-up studies in other mammalian models and in humans. We demonstrated that the mechanisms of sensing, adaptation, and response of Spn to its host and the host to different Spn isolates are complex, and strikingly different between colonization and disease states.

While we identified pneumococcal and mouse genes playing key roles in host-pathogen interactions, not all of these may be directly relevant in humans, and other important genes have likely been missed. A number of Spn proteins or components have been shown to be strictly restricted to interactions with human cell moieties (223-228) and will not have been interrogated in our mouse models. Similarly, not all mouse responses are directly applicable to human responses and several human-specific responses were likely not observed in this study. Validation of the bacterial and host genes identified here in the human setting, or in models more closely recapitulating human systems will be required. Despite these obvious caveats, we believe that the atlas of transcriptional responses during host-pathogen interactions presented here will constitute an essential resource for the pneumococcal and microbial pathogenesis research communities, and

124 serve as a foundation for identification and validation of key host and pneumococcal therapeutic targets in future studies.

F. Materials and methods

All transcriptomics data generated as a part of this study have been deposited at the Gene

Expression Omnibus repository under accession number: GSE150788. Additional information about Materials and Data availability including genome accession numbers, bacterial strains and constructs, mice, reagents, and software packages, are available in Supplemental Table 3.8. Details on bacterial growth, mouse housing and RNA-seq procedures are provided in Appendix 1.

1. Collection of samples for RNA isolation

Mice were anesthetized using vaporized isoflurane at 2.5-3% during experimental challenge. For asymptomatic nasopharyngeal colonization mice were inoculated with 105 CFU of Spn in 10 µl saline with samples collected on day 7 (194). For pneumonia, mice were challenged with 105 CFU of Spn in 100 µl saline by forced aspiration with lungs collected on day 2 (229). For invasive disease, mice were challenged intraperitoneally with 104 CFU of Spn in 100 µl saline with blood, heart, and kidneys collected between 30-36h post-challenge. Data generated from TIGR4 infected blood and heart samples were from a previously published study by our group, with mice infected in the same fashion (5). Prior to tissue collection, mice were euthanized by inhalation of vaporized isoflurane (5% over oxygen) and confirmed by cervical dislocation, bilateral pneumothorax, or cardiac puncture according to approved protocols. For nasopharynx samples, euthanized mice were decapitated, exterior skin removed, brain and cerebral column excised, and the skull defleshed. The skull and oropharynx were subsequently crushed in RNA Protect Bacteria Reagent

125 (Qiagen) and stored at -80°C. For blood, samples were obtained by exsanguination via retro-orbital bleed, mixed with RNAprotect Bacteria Reagent, and stored at -80°C. For lungs, heart, and kidneys, organs were excised, minced down to 1 mm3 fragments with a sterile razorblade, then, and repeated as necessary, rinsed with ice-cold PBS and gently pressed to the point that blood no longer effused from the tissue. Samples were then placed into RNA Protect Bacteria Reagent, and stored at -80°C.

2. RNA isolation, library construction, sequencing, and

transcriptomics data analyses

These procedures were performed as previously described (5), with minor modifications as detailed in Appendix 1. The numbers of reads generated for each sample are provided in

Supplemental Table 3.1. Common and unique differentially expressed (DE) genes for both species were determined using gene list intersections and individual heatmaps of DE genes were generated based on Z-scores of Variance Stabilized Transformation counts (230) (Figures 3.7 & 3.12)

(Supplemental Table 3.2). Pneumococcal and mouse DE genes are provided in Supplemental

Tables 3.3, 3.4 & 3.5. See Supplemental Table 3.8 for package source/documentation & Data and

Code Availability for additional parameters used in DE gene estimation or heatmap generation.

3. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway

analysis

Bacterial DE gene lists were used to determine DE KEGG (191) pathways (with an adjusted p- value<0.05) (Supplemental Table 3.6) using R package KEGGprofile in Rstudio (See

Supplemental Table 3.8 for package source/documentation & Data and Code Availability for additional parameters).

126 4. Ingenuity Pathway Analysis (IPA)

Mouse DE gene lists were filtered using a Log2 Fold Change cutoff (LFC >2 and <-2) and uploaded into Ingenuity Pathway Analysis (231) (QIAGEN Inc., https://www.qiagenbioinformatics.com/products/ingenuitypathway-analysis). Core analyses were conducted on each DE gene list per Spn strain using default parameters. Canonical pathways were determined using the mouse database with default parameters, on mouse organ systems. Analyses were then compared across organ types to estimate core pathways, and across Spn strains for organ specific responses (Figure 3.12). Canonical pathway data was exported (Supplemental Table 3.7) for use in Rstudio. Heatmaps of canonical pathways were generated for those meeting an average

Z-score cutoff of >3 and <-3 (Nasopharynx samples did not meet this cutoff), in Rstudio (See Data

& Code Availability).

5. qRT-PCR cross-platform validation of RNA-seq

For qRT-PCR validation, we selected 10 pneumococcal genes and 11 mouse genes that were up- or downregulated from the 69 Spn shared DE genes at all sites vs. nasopharynx, and 190 host DE genes identified from all diseased sites. Control genes for Spn and host were selected based on high expression at all anatomical sites and low variation in expression (<0.5 standard deviation).

Primers for Spn were designed using NCBI-Primer BLAST, and those for the mouse were obtained from PrimerBank (232-234). These primer pairs are listed in Supplemental Table 3.8. We used the

Qiagen QuantiTect SYBR® Green RT-PCR Kit and generated correlation curves of qRT-PCR

∆∆Ct values (equivalent to Log2 Fold Change (LFC)) to RNA-seq LFC. These revealed strong correlations with significant p-values between the two platforms (Figures 3.10 and 3.11).

127 6. In vivo competition experiments and assessment of interferon-

mediated protection

Experiments were performed using mouse models of asymptomatic nasopharyngeal colonization, lower respiratory tract infection with bacteremia, and intraperitoneal challenge with disseminated disease. For competition assays, logarithmic growth phase Spn TIGR4 strain and its isogenic mutant were mixed at a 1:1 ratio. In all instances, isoflurane-anesthetized C57BL/6 mice were inoculated either intranasally (10µl in 105 CFU), intratracheally (100µl in 105 CFU), or intraperitoneally (100µl in 104 CFU) with physical assessment using a body condition index (194), and nasal lavage and blood collected for enumeration of pneumococci at designated time points.

Bacterial titers in the nasopharynx were determined day 5 post-challenge. For competition experiments involving pneumonia with bacteremia, bacterial titers in the blood were determined 2 days post-challenge. For both nasal lavage samples and blood, the number of bacteria present in collected fluids was determined by serial dilution of the sample, plating on blood agar plates, and extrapolation from colony counts the next day. For in vitro experiments testing fitness, the competitive index of each mutant strain was calculated using the following equation: (mutant output CFU/WT output CFU)/(mutant input CFU/WT input CFU). Centrifuged spleen homogenates from mice challenged intraperitoneally with 104 CFU of TIGR4, alternatively uninfected control mice, were used to determine the IFNα and IFNβ responses to invasive pneumococcal disease. For experiments with IFNβ mice were administered 30,000 U of recombinant mouse IFNβ in 50 µl or 100 µl PBS by intranasal instillation or intraperitoneal injection, respectively, 1 day prior to challenge with TIGR4. For Poly(I:C) experiments, mice were administered 50 µg Poly(I:C) in 100 µl PBS by intraperitoneal injection 1 day prior to

128 intraperitoneal challenge with TIGR4. In all experiments, control mice were administered PBS alone without challenge recombinant protein, drug, or bacteria.

7. Data and code availability

R code/scripts used to generate figures and data presented is available on GitHub at https://github.com/admelloGithub/InVivoSpn

G. SI Appendix 1

1. Bacterial strains

Streptococcus pneumoniae (Spn) strains used in this study are listed in the Supplemental Dataset

8. Pneumococci were grown in Todd-Hewitt broth supplemented with 0.5% yeast extract (THY) or on tryptic soy blood agar plates at 37°C in 5% CO2. Isogenic mutants were created by allelic exchange in strain TIGR4 (235). Mutagenic PCR constructs were generated by amplifying the upstream and downstream DNA fragments flanking the gene(s) of interest followed by the 5’ and

3’ integration of these fragments with the Sweet Janus Cassette using a HiFi assembly master mix

(NEB). Transformation of TIGR4 with the mutagenic construct (100ng/ml) was induced using competence-stimulating peptide variant 2 (CSP-2) as described (236). Blood agar plates supplemented with kanamycin (300 mg/L) were used for selection. To create the double mutant,

∆ula/SP_1675, the Sweet Janus Cassette from ∆ula mutant was replaced by a construct containing a clean deletion of the ula operon. In this instance streptomycin (200 mg/L) was used for selection.

2. Mice

Animal experiments were carried out using male and female mice 6 to 12 weeks of age. Wildtype

C57BL/6 were supplied from The Jackson Labs (Ellsworth, Maine) and housed at the University of Alabama at Birmingham Animal Facilities for at least 1 week prior to their use. All animal

129 experiments were performed using protocols approved by the University of Alabama at

Birmingham IACUC (protocols #21152 and #21231).

3. RNA isolation

On the day of RNA extractions, samples were thawed and spun down to discard the supernatant.

Pellets were then incubated in 100 μL of lysis buffer (10 μL of mutanolysin, 20 μL of proteinase

K, 30 μL of lysozyme, 40 μL of TE buffer) for 10 minutes. Pellets were mechanically disrupted in

600 μL RLT buffer (RNeasy Mini Kit, Qiagen) containing 1% -mercaptoethanol, using a motorized pestle for 30 seconds. Samples were then spun through a Qiashredder followed by addition of 590 μL of 80% ethanol, and RNA was captured on the RNeasy Mini Kit columns with

DNase treatment on column (Qiagen protocol), followed by additional DNase treatment in solution. Extracted RNA was quantitated using a Bioanalyzer. Ribosomal RNA was depleted using the Ribo-Zero rRNA Removal Kits for Gram-positive bacteria and/or for human/mouse/rat

(Illumina).

4. RNA library construction and sequencing

300 bp-insert RNA-seq Illumina libraries were constructed using 0.01 – 2.0 μg of enriched mRNA that was fragmented then used for synthesis of strand-specific cDNA using the NEBnext Ultra

Directional RNA Library Prep Kit (NEB-E7420L). The cDNA was purified between enzymatic reactions and the size selection of the library performed with AMPure SpriSelect Beads (Beckman

Coulter Genomics). The titer and size of the libraries was assessed on the LabChip GX (Perkin

Elmer) and with the Library Quantification Kit (Kapa Biosciences). RNA-seq was conducted on

150 nt pair-end runs of the Illumina HiSeq 4000 platform using two or three biological replicates

(organs from different mice) for each condition. The numbers of reads generated for each sample are provided in Supplemental Table 3.1.

130 5. Transcriptomics data analyses and differential gene expression

calculations

FASTQ files were mapped to their respective genomes using HISAT2 (237) for Mus musculus and

Bowtie (238) for pneumococcal genomes of D39, TIGR4, and 6A-10 (see reference genome information in the Supplemental Table 3.8). Reads mapped to each respective genome were tabulated in Supplemental Table 3.1. Gene expression counts for all samples were then estimated using HTseq (239). All sample counts for the mouse data were tabulated in Supplemental Table

3.2. For bacterial counts, core genes without paralogs were determined for D39, TIGR4, and 6A-

10 using PanOCT (168). Paralogs were removed because they could lead to potential biases in differential gene expression from a core perspective, due to reads mapping to multiple paralogs in a respective genome. Counts were then tabularized for analysis based on core genes, using TIGR4 locus tags as identifiers in Supplemental Table 3.2. Counts tables were then imported into Rstudio for analyses and estimation of differentially expressed (DE) genes. Rarefaction curves in Figure 1 were generated from HTseq counts data (See Data & Code Availability). Principal Component

Analyses (PCAs) and dendrograms in Figures 3.2 & 3.6 were generated using various R packages

(see Supplemental Table 3.8) based on normalized Variance Stabilized Transformation (VST) counts acquired using the DESeq2 R package (230). For mouse DE gene estimation, infected/colonized tissues were compared to their respective uninfected tissue control samples as the baseline using DEseq2 and filtered using an FDR cutoff of 0.05. For bacterial DE gene estimation, each infected tissue was compared to the nasopharyngeal colonization state as the baseline, while accounting for strain differences within core genes using DEseq2 and filtered using an FDR cutoff of 0.05. Common and unique DE genes for both species were determined using

Upset plots and individual heatmaps of DE genes were generated based on Z-scores of VST counts

131

(Supplemental Table 3.2). Pneumococcal and mouse DE genes are provided in Supplemental

Table 3.3, 3.4 & 3.5. See Supplemental Dataset 3.8 for package source/documentation & Data and

Code Availability for additional parameters used in DE gene estimation or heatmap generation.

H. Supplementary data

Supplementary data, tables and figures are available online at PNAS at https://www.pnas.org/content/117/52/33507 https://www.pnas.org/content/suppl/2020/12/09/2010428117.DCSupplemental

I. Acknowledgements

This work was supported with funds from: i) Merck, Sharpe & Dohme, Corp. under the Merck

Investigator Studies Program (MISP) award IISP ID#: 57329 – Pneumovax entitled: “Dual RNA- seq for characterization of the Streptococcus pneumoniae and host transcriptomes during bacteremia/invasive disease”, and ii) the National Institutes of Health, National Institute of Allergy and Infectious Diseases under the R01AI114800 award entitled: “Cardiac microlesion formation during invasive pneumococcal disease.” We acknowledge essential support provided by members of the Institute for Genome Sciences’ Genomics Resource Center, Genome Informatics Core, and

High-Performance Computing Core. Finally, we thank Mogens Kilian and Matthew Chung for critical reading of this manuscript.

132 IV. Chapter 4: Influenza A virus modulation of Streptococcus pneumoniae

infection using an ex vivo human primary lung epithelial cell model

Preface

This chapter fulfills Specific Aim 3 Validate mice and pneumococcal predicted pathways from in vivo experiments using an ex vivo human lung model infected with S. pneumoniae through dual

RNA-seq. Here, we will identify host-pathogen gene expression and pathways in the pneumococcus and human lung epithelial cells during Streptococcus pneumoniae infection and co-infection with influenza A virus, with multi-species RNA-seq.

A. Abstract

Streptococcus pneumoniae is a known colonizer of the nasopharynx and can cause pneumonia and disseminated disease at various host anatomical sites. Transition from pneumococcal host colonization to invasive disease is not well understood. Studies have shown that such a transition can occur during influenza virus co-infections. We investigated the pneumococcal and host transcriptomes with and without influenza A infection at this transition, using primary Human

Bronchial Epithelial Cells (nHBEC) in a trans-well monolayer model at an Air-Liquid Interface

(ALI), with multi-species deep RNA-seq. Distinct pneumococcal gene expression profiles were observed in the presence and absence of influenza. Influenza co-infection allowed for significantly higher pneumococcal growth in the presence of host nHBEC and triggered bacterial expression of differential metabolic profiles. Host nHBEC transcriptomes were not significantly perturbed by a

6h pneumococcal infection. Infection with influenza alone or pneumococcal with influenza induced host differential expression of thousands of host genes. Analysis of these host and

133 pneumococcal transcriptomes helps shed light on the transitionary process of pneumococcal invasion upon influenza co-infection.

B. Background

Streptococcus pneumoniae, a known colonizer of the nasopharynx, and most common cause of community acquired pneumonia, has also been shown to be one of the most frequent forms of bacterial co-infection following influenza virus infection (240). It has been established that influenza infections predispose the host to secondary bacterial infections, including during past influenza pandemics and epidemics (241). The elderly and infantile are the most vulnerable to fatal outcomes of influenza infections in the form of respiratory and circulatory deaths (242, 243). These demographics overlap with those susceptible to pneumococcal colonization and invasive disease.

These secondary bacterial co-infections have also been associated with increased severity and mortality rates due to bacterial pneumonia (56, 244). However, there is significant overlap in the symptoms of influenza and bacterial co-infections, making such superinfections a challenging field of study (240).

Multiple studies have investigated the role of influenza and the pneumococcus in superinfections.

Influenza infections showed increased pneumococcal burdens in murine lungs and nasopharynxes

(245). Investigations into pneumococcal host-to-host transmission also revealed increased bacterial transmission due to influenza pre-infections in mice (245) but reduced influenza transmission (246). One study has investigated the pneumococcal genes involved in transmission using transposon sequencing (Tn-Seq) in a ferret model, identifying key pneumococcal gene sets important for transmission (112) . Yet another investigated pneumococcal virulence factors in influenza superinfections using CRISPRi-seq (based on a doxycycline inducible CRISPR-Cas9

134 system that can turn genes on or off), showing involvement of pneumococcal capsule and the adenylsuccinate synthetase gene purA (247). Other forms of sequencing-based studies including

RNA-seq have highlighted key genes such as the pneumococcal two-component system SirRH, that increases pneumococcal survival in influenza infected A549 cells (248). Pre-incubation of influenza with the pneumococcus prior to host infection has also resulted in increased virulence and bacteremia in mice (249). As for the host’s influence on this process, several major genes and chemokines have been identified such as IL-17, IL-23, IL-12, CD200, type I interferon genes, etc.

(56, 250). These studies, some of which are discussed below in the context of our results, offer conclusive pneumococcal and host markers of superinfections. We validated and expanded upon these markers in a human primary lung epithelial cell superinfection model, consisting of human tracheobronchial tissues cultured at ALI, stimulating their differentiation to multiple airway epithelial phenotypes (ciliated, mucous, and non-ciliated, non-mucous club cells) in proportions similar to mature mammalian airways.

Genes involved in the process of the bacterial transition into an invasive state in the lung, using a human lung primary differentiated cell epithelial layer, have not been previously studied. This experimental design provides epithelial cell-specific transcriptomic responses from a host perspective. We hypothesize that Influenza infection alters the expression profile of the lung epithelial barrier, rendering it more permissive to pneumococcal invasion and growth.

Streptococcus pneumoniae adapts to this new environment by changing its gene expression program to acquire nutrients, invade the epithelium and cause disease. Here we explored the transcriptome of the pneumococcus strain EF3030, a pneumonia causing strain, and the human lung epithelium in the presence and absence of influenza A virus pH1N1, using multi-species

135 RNA-seq. Our results shed light on the genes involved in the superinfection process from a bacterial and host perspective.

C. Results

1. Streptococcus pneumoniae engages in distinct gene expression

profiles on primary human lung epithelial cells with and without influenza

To acquire the human lung epithelial and pneumococcal transcriptome in the context of influenza presence or absence, we infected our primary nHBEC cells with influenza A virus pH1N1 for 72 hours followed by Spn EF3030 for 6h (See Methods), and performed multi-species RNA-seq. For our samples, these are the pneumococcal and influenza signals relative to human lung epithelial cell gene expression.

One caveat of multi-species RNA-seq that commonly occurs is significantly lower sequencing read amounts for the less dominant species (251). To ensure adequate sequencing depth for robust interrogation of the EF3030 transcriptome, we generated saturation curves for all our pneumococcus-infected samples (Figure 4.1A). We observed high coverage of our in vitro EF3030 and Host+EF3030+pH1N1 samples, as those saturation curves plateau for over 95% of EF3030 protein coding genes. Our Host+EF3030 samples were also approaching the same level of saturation for those same genes, allowing us to perform robust analyses of bacterial signals across all conditions tested.

136 Figure 4. 1 Rarefaction curves and estimation of pneumococcal and viral burdens.

A. Saturation curves of EF3030 samples. Samples whose curves plateau or approach plateauing harbor enough reads across >95% of EF3030 genes. B. PCA of EF3030 samples clustering by condition. C. Bootstrapped dendrogram of EF3030 showing replicate clustering of EF3030 samples with relative distances. D. Read mapping percentages averaged across 3 replicates for EF3030 (top) and pH1N1 (bottom).

To acquire an estimate of the overall quality of our EF3030 samples and reproducibility across our biological replicates, we performed a principal component analysis (PCA) and a hierarchical clustering of genome-wide bacterial expression profiles (Figures 4.1B and 4.1C). The PCA (Figure

4.1B) showed the formation of 3 clusters corresponding to our different EF3030 sample types.

PC1 captured ~47% of the overall variation in EF3030 gene expression, with the most difference observed between in vitro EF3030 and EF3030 encountering the human lung epithelium (no

137 influenza: Host+EF3030). However, expression profiles of Host+EF3030+pH1N1 formed a cluster located between in vitro EF3030 and Host+EF3030 along PC1. This suggests that a number of EF3030 genes are in a transitional expression state when compared to in vitro EF3030 and

Host+EF3030. PC2 also captured a significant amount of variation at ~25%, separating the presence of pH1N1 infection of the human lung epithelium for EF3030 genes. This also suggests that certain EF3030 genes might respond to the virus directly or indirectly during the lung epithelial cell infection process. The hierarchical clustering dendrogram (Figure 4.1C) of EF3030 samples showed excellent clustering for each sample type with high bootstrap values and short branch lengths between biological replicates. One of the key observations we made from the reads mapping to the EF3030 genome, was that the proportion of reads that mapped to EF3030 were significantly higher for Host+EF3030+pH1N1 relative to Host+EF3030 (Figure 4.1D top), with

~30x more bacterial reads mapped from Host+EF3030+pH1N1 samples. A similar observation was made for pH1N1 reads in Host+EF3030+pH1N1 compared to Host+pH1N1, albeit it was not as stark of an increase with ~2x more reads mapping. These results are likely not directly representative of true measures of CFUs of EF3030 in these samples (Figure 4.1D top), but we observed a 6x increase in bacterial CFUs from an apical wash of infected transwells in our pilot experiments following identical infection protocols (Figure 4.2). This correlates with known increases in pneumococcal burden upon influenza infection (245).

138 Figure 4. 2 EF3030 CFUs recovered from an apical wash of the Air Liquid Interface transwell lung epithelial system.

Red dotted line is mean of EF3030 recovered from 1000 CFU inoculated into BLM media (with no epithelial cells). Black dotted lines enclose +/- standard error of mean (SEM 1.32E6 +/- 2.07E5). 4.63E5 +/- 2.75E5 Mean and SEM for No IAV CFU (NIC) per apical wash. 3.14E6 +/- 4.96E5 Mean and SEM for + IAV CFU per apical wash.

139 Figure 4. 3 Differentially Expressed (DE) EF3030 genes.

A. Upset plot of EF3030 DE genes. The individual or connected dots represent the various intersections of genes that were either unique to, or shared among comparisons, similar to a Venn diagram. Vertical bars represent the number of DE genes unique to specific intersections. Horizontal bars represent the number of DE genes for each specific comparison. B. Z-scored heatmap of expression levels of genes from the red intersection column of panel A. C. Z-scored heatmap of expression levels of genes from the orange intersection column of panel A. D. Z-scored heatmap of expression levels of genes from the blue intersection column of panel A.

140 2. Statistical analysis of significantly differentially expressed

Streptococcus pneumoniae genes reveal different responses to primary

human lung epithelial cells with and without influenza

To identify the major EF3030 gene responses to epithelial cells, and epithelial cells with pH1N1, we performed a differential expression analysis using DEseq2 (230). We queried differentially expressed (DE) genes for the comparisons of Host+EF3030 relative to in vitro EF3030,

Host+EF3030+pH1N1 to in vitro EF3030, and Host+EF3030+pH1N1 to Host+EF3030, considering only DE genes that passed a DEseq2 FDR cutoff of ≤0.05 and an absolute Log2 Fold

Change (LFC) of ≥1. Figure 4.3A highlights 3 sets of DE genes that were shared among certain comparisons. The red column shows 163 DE genes shared between the Host+EF3030 vs EF3030, and Host+EF3030+pH1N1 vs EF3030 comparisons. This suggests that these genes are modulated by EF3030 upon encounter of the human lung epithelium. The orange column consists of 80 genes

DE in all 3 comparisons, suggesting regulation of these EF3030 genes constitutes a core host- pathogen interaction process. The blue column identifies 52 DE genes shared between the

Host+EF3030+pH1N1 vs EF3030 and Host+EF3030+pH1N1 vs Host+EF3030 comparisons, suggesting these EF3030 genes are regulated in the presence of pH1N1. Heatmaps of normalized gene expression values (Supplemental Table 4.1) of these 3 sets of DE genes are presented in

Figures 4.3B, 4.3C and 4.3D, respectively. Figure 4.3B shows a clear differential regulation of the

163 genes, approximately half of which are upregulated on the human lung epithelium regardless of viral infection. Figure 4.3C shows a more complex pattern of expression of the 80 genes shared across comparisons. The majority of these show a pattern of strong upregulation of EF3030 genes upon interaction with the human lung epithelium but a relatively lower upregulation in the presence of influenza co-infection. This trend is reminiscent of the sample clustering observed in

141 the PCA plot (Figure 4.3B) along PC1. Figure 4.3D reveals a clear pattern of gene expression associated with host-pathogen interactions involving the presence of prior viral infection. Entire

DE gene lists from Figure 4.3 and the subsets detailed here (columns of Figure 4.3A) are provided in Supplemental Table 4.2.

3. Weighted Gene Co-expression Network Analysis (WGCNA) of

bacterial transcriptomes captures additional genes associated with host-

pathogen interactions

Figure 4. 4 Weighted Gene Co-expression Network Analysis (WGCNA) clusters of EF3030 gene expression.

A. Z-scored heatmap of expression levels of 835 EF3030 genes clustering in four modules – black, blue, dark green & pale violet, from top to bottom). B-E. PCA plots based on the genes in each module – black, blue, dark green & pale violet, respectively.

142 We implemented WGCNA to supplement DE genes following the 3 major patterns of EF3030 gene expression observed in Figure 4.3. Gene modules were predicted by WGCNA analysis applied to normalized expression values of ~1900 EF3030 genes (Supplemental Table 4.1).

WGCNA identified 4 modules of EF3030 gene expression, consisting of 835 genes (Figure 4.4A), that fit patterns of interest observed with DE gene sets. The black module consisted of 330 genes regulated specifically when EF3030 encountered the human lung epithelium. The blue module consisted of 193 genes that were more relevant to EF3030 specifically in the absence of a pH1N1 infection, as they were more similar to the in vitro expression profile. The dark green module encompassed 204 genes regulated differently in pH1N1 presence and absence, following the major transitionary pattern of expression seen in Figure 4.3C. The final pale violet module comprised

108 genes most relevant to host-pathogen interactions in the presence of pH1N1. In order to visualize the relevance of these modules and assign a numerical value to their importance, we generated PCA plots of the gene subsets from each of these modules in Figures 4.4B-4.4E. The

330 EF3030 genes of the black module (~17% of all EF3030 genes) capture ~78% of the variation along PC1 separating in vitro EF3030 and Host+EF3030 (Figure 4.4B), while PC2 only captured subtle differences (~8%) in biological replicates. The 193 EF3030 genes of the blue module (~10% of all EF3030 genes) capture ~84% of the variation along PC1 separating Host+EF3030 interactions specifically in the absence of pH1N1 infection (Figure 4.4C), with PC2 only capturing subtle differences (~8%) in in vitro EF3030 and Host+EF3030+pH1N1 samples. The 204 EF3030 genes of the green module (~11% of all EF3030 genes) capture ~89% of the variation along PC1, revealing genes important to both Host+EF3030 and Host+EF3030+pH1N1 states to different degrees, with PC2 accounting for ~4% of replicate variation. Lastly, the 108 genes of the pale violet module (~5.5% of all EF3030 genes) capture ~88% of the variation along PC1

143 demonstrating genes most important in the context of prior pH1N1 infection, with PC2 again accounting for ~4% of replicate variation. These modules encompassing 835 EF3030 genes (~43% of all EF3030 genes) clearly condense the most important genes involved in these specific contexts of host-pathogen interactions and are more associated to specific conditions (pH1N1, Host, or

Host+pH1N1) than other genes in this dataset. However, not all the genes identified in modules are necessarily of relevance to the condition considered, and some spurious correlations may exist.

The list of all these genes and their normalized expression values are listed in Supplemental Table

4.3.

4. Streptococcus pneumoniae exhibits different metabolic profiles

during human lung epithelium infection compared to co-infection with

influenza

Figure 4. 5 EF3030 genes of interest.

A. Upset plot of the top 100 most highly expressed genes in each condition. 14 genes are shared in our ex vivo model (black column) B. Z-scored heatmap of expression levels of 125 EF3030 genes of interest identified from DE genes, WGCNA modules and top 100 highly expressed genes.

144 To identify the genes that the pneumococcus invests in heavily upon lung epithelial cell infection, we also compiled the top 100 most highly expressed genes in each sample type (Figure 4.5A).

Fourteen genes were shared when EF3030 encounters the lung epithelium, 13 genes were specific to just EF3030 infection and 11 were specific to co-infection with pH1N1. Lists of differentially expressed (DE) genes, WGCNA genes, and top 100 genes of EF3030 are provided in Supplemental

Table 4.4. Inspection of these EF3030 gene sets and KEGG pathways enriched for DE genes alone to prioritize the most important pathways (Supplemental Table 4.4) revealed that several bacterial genes involved in host-pathogen interactions are also important in influenza co-infection, albeit to different degrees. Interestingly, several of these genes consist of PTS systems and ABC transporters, as well as enzymes mediating uptake and metabolism of diverse nutrients. These operons were further scrutinized and prioritized as potential targets for therapeutic intervention.

For example, PTS operon EF3030_RS03030-EF3030_RS03040 (galactitol uptake), was specific to Host+EF3030 infection but not Host+EF3030+pH1N1. Three other PTS operons

145 (EF3030_RS00335-EF3030_RS00360 predicted to mediate mannose uptake, EF3030_RS01555-

EF3030_RS01575 glucuronic acid uptake, EF3030_RS09945-EF3030_RS09980 ascorbate uptake), were significantly upregulated during both Host+EF3030 and Host+EF3030+pH1N1 but

~2-4 Log2 folds lower in the latter.

With respect to ABC transporters, 3 operons (EF3030_RS10280-EF3030_RS10300 phosphate uptake, EF3030_RS00460-EF3030_RS00470 unknown function, EF3030_RS08970-

EF3030_RS08985 iron uptake) were important to both, although phosphate uptake was ~5 Log2 folds higher during just infection. One large locus consisting of 17 genes also followed this trend consisting of higher upregulation in Host+EF3030 infection but lesser upregulation (2 Log2 folds lower) during pH1N1 infection (EF3030_RS07890-EF3030_RS07970). It consists of 2 ABC transporter operons (carbohydrate uptake), neuraminidase B, a few enzymes and membrane proteins. Two other ABC operons were specific to just EF3030 infection (EF3030_RS01205-

EF3030_RS01210 iron uptake, EF3030_RS06705- EF3030_RS06725 energy coupling factor transporter). Whereas one ABC transporter involved in amino acid uptake was up in infection and co-infection (EF3030_RS05245-EF3030_RS05250) but much more in pH1N1 co-infection (1.5

Log2 folds lower).

Carbon processing enzymes, such as for galactose catabolysis (EF3030_RS08885-

EF3030_RS08895), were upregulated in both types of infection, but lesser during pH1N1 co- infection (~1.5 Log2 folds lower). Two other operons (EF3030_RS05730-EF3030_RS05745,

EF3030_RS05470-EF3030_RS0545) involved in glycogen and galactose/tagatose metabolism were mainly up only in EF3030 infection. Another operon EF3030_RS10795-EF3030_RS10805 for glycerol metabolism was up in both, but was 2 Log2 folds higher in pH1N1 co-infection.

Another large 8 gene locus (EF3030_RS02050-EF3030_RS02090) containing several acetyl-CoA

146 enzymes was significantly downregulated upon pH1N1 co-infection but was expressed at similar levels in vitro and in Host+EF3030. A heatmap of the operons listed above is shown in Figure

4.5B. Log2 fold changes for magnitudes of DE are listed in Supplemental Table 4.4.

Apart from metabolic genes, several genes such as membrane proteins, uncharacterized proteins, competence genes, and bacteriocins were significantly upregulated in Host+EF3030, but overall lower during pH1N1 co-infection. Finally, the CiaR-CiaH two-component regulatory system

(EF3030_RS03775-EF3030_RS03780) was also upregulated specifically in

Host+EF3030+pH1N1 infection.

147 5. Influenza virus infection overwhelmingly alters the human lung

epithelium transcriptome

Figure 4. 6 Human gene expression profiles and DE genes.

A. PCA plot of human lung epithelial samples. B. Bootstrapped dendrogram of host gene expression profiles showing replicate clustering of host samples. C. Upset plot of lung epithelial DE genes. 2335, 761 and 558 DE genes were unique to the comparisons of Host+pH1N1 vs Host, Host+EF3030+pH1N1 vs Host and Host+EF3030+pH1N1 vs Host+EF3030 (red, orange and blue columns, respectively).

We also performed host transcriptome analyses following some of the same approaches as described for the bacterial transcriptome. Host conditions included: control uninfected human lung epithelial cells (Host), Host+EF3030, Host+pH1N1, and Host+EF3030+pH1N1. PCA analysis

(Figure 4.6A) revealed a stark separation along PC1 (~85% of the variation) between Host+pH1N1 and Host+EF3030+pH1N1 on one side, and Host and Host+EF3030 on the other side. The next component PC2 captured only ~5% of variation between biological replicates. Host and

148 Host+EF3030 samples clustered together with almost no difference, suggesting no major influence on the host as a result of just EF3030 infection (6h) of human lung epithelial cells. Similarly, within pH1N1 containing samples, no separation was seen between Host+pH1N1 and

Host+EF3030+pH1N1 suggesting that the presence of EF3030 co-infection had no significant effect during pH1N1 infection, potentially due to an overwhelming response to just pH1N1.

Hierarchical clustering of expression profiles (Figure 4.6B) again underlined the clear separation of pH1N1 containing samples, but also intermixing of biological replicates within Host and

Host+EF3030, and within Host+pH1N1 and Host+EF3030+pH1N1 sample types. However, the fact that bacterial gene transcriptomes showed significant differences between EF3030,

Host+EF3030 and Host+EF3030+pH1N1, imply that intermixing of biological replicates from a host perspective was not due to bad sample replication but merely the lack of a detectable response to EF3030 during human epithelial lung cell infection and during pH1N1 co-infection in this experimental design.

Using DESeq2, we were able to identify 12637 genes specific to pH1N1 infection of the human lung epithelium, 2335, 761 and 558 DE genes unique to the comparisons of Host+pH1N1 vs Host,

Host+EF3030+pH1N1 vs Host, and Host+EF3030+pH1N1 vs Host+EF3030, respectively

(colored columns of Figure 4.6C). The list of all Host DE genes and subsets are presented in

Supplemental Table 4.6. Only a few genes (<3) were DE passed an FDR cutoff of ≤0.05 and an absolute Log2 Fold Change (LFC) of ≥1 for the comparisons of Host+EF3030 vs Host, and

Host+EF3030+pH1N1 vs Host+pH1N1. However, for the comparisons of Host+pH1N1 vs Host,

Host+EF3030+pH1N1 vs Host, and Host+EF3030+pH1N1 vs Host+EF3030, ~ 13,000-17,000 genes were DE pass cutoffs, nearly a quarter of all annotated human genes. Most noticeably, 1,000 more genes were DE in Host+EF3030+pH1N1 vs Host relative to Host+pH1N1 vs Host, and

149 another 1,500 more DE in Host+EF3030+pH1N1 vs Host+EF3030 when compared to

Host+EF3030+pH1N1 vs Host. This suggests that subsets of genes might be altered but not DE due to the presence of just EF3030 during infection, and some might be due to the presence of just pH1N1 in the context of co-infection. However, it is important to note that because no genes were

DE for Host+EF3030 vs Host, and Host+EF3030+pH1N1 vs Host+pH1N1, these extra ~1000-

2000 DE genes, though likely modulated during co-infection, may also be DE due to altered responses to EF3030 but not significant enough to meet DE cutoffs in the Host+EF3030 vs Host, and Host+EF3030+pH1N1 vs Host+pH1N1 comparisons.

We also performed Ingenuity Pathway Analysis (IPA) of DE genes for the most relevant comparisons of Host+pH1N1to Host and Host+EF3030+pH1N1 to Host. Despite the identification of several additional DE genes during co-infection, IPA analysis didn’t resolve significant differences in pathways identified between pH1N1 infection and EF3030+pH1N1 co-infection

(Supplemental Table 4.6). This is likely due to two reasons. Firstly, the majority of DE genes are responding to pH1N1 and no significant bacterial signal is captured during co-infection after 6 hours. Secondly, IPA was optimized to work with much smaller datasets, with the upper limit of

~3,000 DE genes. As we had ~12,000-15,000 genes in our dataset, approximately a quarter of the

Human genome, it could identify several false positives.

Due to these reasons, we attempted to identify the broader biological processes involved during both pH1N1 infection and EF3030+pH1N1 co-infection using GO term analysis. We first separated the DE genes into two datasets, those that were upregulated in both comparisons, and those that were downregulated in both comparisons. These two gene subsets were then subjected to ShinyGO (252) analysis, with an FDR cutoff of ≤0.05, which selected protein coding genes and their associated predicted biological processes (Supplemental Table 4.6). For the upregulated gene

150 set, the majority of the responses were in the form of cytokines and immune signaling and activation, responses to chemical/biotic stimuli, cell death responses and secretion. These represented ~50% of the genes of the upregulated dataset. As for the downregulated dataset, the identified processes related to cilial, microtubule and cytoskeletal changes, representing ~33% of the downregulated genes.

D. Discussion

Analysis of the reads mapping to the EF3030 transcriptome in our dataset showed a large increase in bacterial signal (~30) upon influenza infection (Figure 4.1D). Although this isn’t conclusive of increased pneumococcal burdens upon pH1N1 infection, others have shown that bacterial CFUs increase significantly and remain elevated upon secondary pneumococcal infection (253). Not only that, it appears that pneumococcal titers can also be serotype specific. Sharma-Chawla et al. showed that 19F serotypes (the serotype of our Spn strain EF3030) have increased CFUs during secondary pneumococcal infection with influenza in mice lungs and nasopharynxes, but not in the blood (254). As for the relatively mild increase in pH1N1 reads mapped in our study, one study showed that viral titers initially increase then decline with time, typically after 7 days (253). This could be happening in our dataset as well, however 72+6 hours of post bacterial infection may not be sufficient to capture the eventual viral decline. Interestingly, viral titers during co-infection in vitro had no changes (255), so whether viral replication consistently increases may be very dependent on conditions and model systems, and therefore, remains inconclusive.

The large increase in pneumococcal burdens during co-infection would require significant resources for pneumococcal growth and multiplication. We see that the EF3030 transcriptome reflects this necessity through the regulation of a diverse panel of genes, but mainly through its

151 differential regulation of nutrient scavenging ABC transporters, PTS systems and metabolic enzymes. Our findings show that upon encountering the human lung epithelium, multiple scavenging and nutrient processing genes are upregulated in order to grow. However, during pH1N1 co-infection, several of these are no longer as important, likely due to the effect of pH1N1 on the human lung epithelium. It is possible that pH1N1 disrupts and/or lyses host cells, resulting in the availability of more readily accessible nutrients in the form of glycerol, glycogen and amino acids, as genes involving the utilization of these were observed to be upregulated significantly more during influenza infection.

Figure 4. 7 PCA analysis of expression profiles of 77 EF3030 metabolic genes in our study and previously studied our in vitro and in vivo mouse models.

Orthologs of these genes are carried by each of strains D39, TIGR4, 6A-10 and EF3030 (core genes among these strains).

152 Studies have explored pneumococcal metabolic changes and shown stark differences in metabolic profiles between biofilm and planktonic forms of growth (256, 257). Using mouse models of pneumococcal colonization and disease, we previously showed that pneumococcal colonization within the nasopharynx exerts no significant effect on the host (251). As we see metabolic changes in our dataset, we decided to compare these genes and operons identified with some of our previous in vitro (biofilm and planktonic) and in vivo (mice infections) work (5, 251). We selected the genes forming operons that were described in the results section above (80 genes) and performed a PCA on the core genes conserved across the pneumococcal strains EF3030, D39, TIGR4 and 6A-10 (77 of 80 genes) shown in Figure 4.6. Despite the caveats of merging datasets from our previous independent studies under diverse in vitro and in vivo conditions, as well as involving different pneumococcal strains, we can see that these 77 metabolic genes capture ~50% of the variation in

PC1 which appears to separate invasive disease states on the left and more of a colonization state on the right, which grow as biofilms (5). It is possible that human lung epithelial cells cultured in the absence of other lung sub-epithelial and immune cells are more similar to a nasopharyngeal state from an EF3030 infection perspective. PC2 however, captured strong signals of strain-based differences within these 77 genes. What’s more, our influenza-infected samples clustered extremely well with samples of invasive disease for these 77 genes. This suggests that these metabolic genes are differentially regulated specifically during invasive disease. It is likely that, if these studies consisted of a single strain, more of the invasive vs colonization signal would be captured along PC1. These data seem to show that presence of influenza co-infection triggers a change away from a biofilm state. One study showed that introduction of influenza triggered a dispersion of pneumococcal biofilms in vitro (105). These researchers went on to identify ~70 genes that were transcriptionally regulated specifically to induce dispersion from a pneumococcal

153 biofilm state. Eighteen of these genes were found to follow the same trend in our

Host+EF3030+pH1N1 vs Host+EF3030 DE dataset. Yet another study demonstrated this in vivo in mice, where influenza co-infection and dispersion from a biofilm induced hypervirulence and invasive disease in pneumococcal strains EF3030 and D39 (59). Together with our data this suggests that the pneumococcus exists as a colonizing biofilm on the lung epithelium, but during influenza co-infection, a dispersion of this biofilm occurs, turning it into its invasive state, where it follows a different metabolic profile.

Some pneumococcal virulence factors in influenza superinfections were explored for essentiality in an in vivo CRISPRi-seq model, and showed that the pneumococcal capsule, together with purA, bacA and pacL genes were required for transmission, but interestingly not pneumolysin (247). We see in our dataset that in EF3030, most of the genes in the capsular locus (EF3030_RS01695-

EF3030_RS01745) are not DE passed cutoffs when it sees the host lung epithelium, but their expression is elevated and identified in our black WGCNA module (Figure 4.5A) . Genes purA and pacL, encoding an adenylsuccinate synthetase and an ATPase-calcium transporter respectively, were among the top 100 most highly expressed genes in all conditions, making them potentially essential genes. We also saw that pneumolysin (EF3030_RS09370) was significantly downregulated when encountering the host lung epithelium, but bacA (EF3030_RS02235) had no changes in expression across all conditions. All the genes found relevant during influenza superinfection in CRISPRi-seq, except for bacA, a bacitracin susceptibility gene (likely due to it being identified in vivo in mice which could potentially contain other lung microbes), follow suit with our dataset in terms of expression trends. This combined with known dispersion from biofilm and increased bacterial loads observed in our study and others, support the possibility that EF3030 switches toward a transmission phenotype during influenza co-infection.

154 As we established an invasive pneumococcal state upon influenza co-infection versus colonization during infection, we investigated if influenza infection in the host resulted in differential expression of genes that aid invasive disease. For example, two known pneumococcal host surface receptors PECAM-1 and RSPA (221, 258), that are used by bacteria to achieve host cell invasion, were upregulated ~3-5 log2 folds upon influenza infection in our study, potentially aiding the invasive pneumococcus.

Several pathways, identified by IPA and GO analyses (on upregulated protein coding genes) was shown to mediate upregulation of diverse immune regulatory systems (Supplemental Table 4.6) to counteract infection, are consistent with other studies showing the involvement of dendritic cell induced cytokine signaling and IL-17 signaling (250, 259). These data suggest that pH1N1 infection and EF3030+pH1N1 co-infection of human lung epithelial cells result in the upregulation of cytokines and signaling to recruit host immune cells. On the other end of the spectrum, we observed that approximately one-third of all protein-coding downregulated genes (>2500 genes) are related to cilial beating and microtubule-based cell movement in our GO analysis, suggesting that this is a significant biological process in response to influenza infection, with and without

EF3030. This phenotype was also observed in our pilot experiments which followed identical protocols of infection, where a marked decrease in cilia was observed with immunofluorescence

(Figure 4.8). Such cilial regulation is consistent with prior influenza and pneumococcal infection studies (260), also involving upregulation of IL-17 and IFN genes seen in our data. Furthermore, cilial beating has been shown to be regulated by cAMP in primary human airway epithelial cells

(261), consistent with the activation of cAMP signaling upon influenza infection and superinfection found in our IPA pathway data (Supplemental Table 6).

155 Figure 4. 8 Immunofluorescence of Air Liquid Interface cultured nHBEC cells.

A. Non-IAV infected cells. B. Cells infected with IAV for 72 hours. Red stain represents cilial cells and green stain represents mucin producing goblet cells, showing a large loss of cilia and mucin producing cells post infection.

E. Methods

1. Isolation and culture of differentiated human bronchial epithelial

cells and sample generation

Primary human bronchial epithelial cells were obtained from human lung tissue through the tissue donation program by the International Institute for the Advancement of Medicine using a previously published method (262). Cells were harvested from the human tracheobronchial tissues and plated onto collagen-coated 100-mm plates (Biocoat; Becton Dickinson), in BEGM medium to ~80% confluency, followed by passaging onto 6.5-mm Transwell inserts (0.4-μm pore size) with FNC coating (0407, AthenaES) and differentiated at an Air-Liquid Interface (ALI) using

Lonza tissue culture reagents, as previously described (113). Differentiation was further enhanced by adding 10 nM recombinant human neuregulin-1α/epidermal growth factor-like domain of

156 heregulin-α (296-HR-050, R&D Systems) to the basolateral ALI media (113). Cells were incubated for approximately two weeks until appearance of ciliated cells and mucus production was seen (113). Following culture of differentiated human bronchial epithelial cells on transwells, cells were infected with pH1N1 or mock infected (no pH1N1 controls) for 72 h (~50% of cells infected with IAV). pH1N1 infected cells were then exposed to 1000 CFUs of Streptococcus pneumoniae strain EF3030 for 6 h. For pneumococcal EF3030 control samples, ~50,000 CFUs of

EF3030 were cultured in ALI media for 6 hours. 1ml of RNA protect was then added to all transwells and then stored in 5ml of RNAprotect Bacteria Reagent in 50 ml falcon tubes at -80C.

2. RNA isolation, library construction, sequencing, and

transcriptomics data analyses

On the day of RNA isolation, transwell samples were thawed and the membranes cut out using a scalpel. RNAprotect and isolated membranes were centrifuged at 10,000 rpm, to pellet the membrane and any dislodged cells, and the supernatant discarded. Pellets were then incubated in

100 μL of lysis buffer (10 μL of mutanolysin, 20 μL of proteinase K, 30 μL of lysozyme, 40 μL of

TE buffer) for 10 minutes. Followed by mechanical disruption in 600 μL RLT buffer (RNeasy

Mini Kit, Qiagen) containing 1% ß-mercaptoethanol, using a motorized pestle for 30 seconds

(251). RNA was then captured on the RNeasy Mini Kit columns with DNase treatment on column

(Qiagen protocol).

Extracted RNA was quantitated using a Bioanalyzer. Ribosomal RNA was depleted using the

RiboZero rRNA Removal Kits for Gram-positive bacteria and/or for human/mouse/rat (Illumina).

300 bp-insert RNA-seq Illumina libraries were constructed using ~1.0 μg of enriched mRNA that was fragmented then used for synthesis of strand-specific cDNA using the NEBnext Ultra

Directional RNA Library Prep Kit (NEB-E7420L). The cDNA was purified between enzymatic

157 reactions and the size selection of the library performed with AMPure SpriSelect Beads (Beckman

Coulter Genomics). The titer and size of the libraries was assessed on the LabChip GX (Perkin

Elmer) and with the Library Quantification Kit (Kapa Biosciences). RNA-seq was conducted on

150 nt pair-end runs of the Illumina HiSeq 4000 platform using two or three biological replicates for each condition (251).

FASTQ files were mapped to their respective genomes using HISAT2 (237) for Homo sapiens and

Bowtie (238) for pneumococcal EF3030 and viral pH1N1 genomes. Gene expression counts for all samples were then estimated using HTseq (239). Counts tables were then imported into Rstudio for analyses and estimation of differentially expressed (DE) genes. Rarefaction curves were generated from HTseq counts data. Principal Component Analyses (PCAs) and dendrograms were generated in R (see Code Availability) based on normalized Variance Stabilized Transformation

(VST) counts acquired using the DESeq2 R package (230). For human DE gene estimation, infected samples were compared to their respective uninfected control samples as the baseline using DEseq2 and filtered using an FDR cutoff of ≤0.05 and an absolute Log2 Fold Change cutoff of ≥1. For bacterial DE gene estimation, each EF3030 infected sample was compared to EF3030 grown in ALI media as the baseline, using DEseq2 and filtered using an FDR cutoff of ≥0.05 and an absolute Log2 Fold Change cutoff of ≥1. Common and unique DE genes for both species were determined using Upset plots (R package UpsetR) and individual heatmaps of DE genes were generated based on Z-scores of VST counts (R package DESeq2).

3. Bacterial gene Kyoto Encyclopedia of Genes and Genomes

(KEGG) pathway analysis

Bacterial DE gene lists were used to determine DE KEGG (191) pathways (with an adjusted P <

0.05) using R package KEGGprofile in Rstudio (see Code Availability).

158

4. Human gene Ingenuity pathway analysis

Human DE gene lists were uploaded into IPA (231) (Qiagen, https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and- visualization/qiagen-ipa/). Core analyses were conducted on each DE gene list comparison using default parameters. Canonical pathways were determined using the human database with default parameters, on epithelial cells.

5. Human gene Gene Ontology (GO) analysis

Human DE genes that were shared in the Host+pH1N1 vs Host, and Host+EF3030+pH1N1 vs

Host comparisons were filtered for protein coding genes and uploaded into ShinyGO (252) with default parameters.

F. Code availability

R code/scripts used to generate figures and data presented is available on GitHub at https://github.com/admelloGithub/ExVivoSpnH1N1 .

G. Supplementary data

All supplemental tables are available online at: https://github.com/admelloGithub/Data

159

V. Chapter 5. Conclusions and future directions

Preface

This chapter will summarize the findings from all previous chapters from a pneumococcal vaccine or therapeutic target perspective. We will examine expression levels of identified potential vaccine candidates (Chapter 2) in S. pneumoniae transcriptomes in different mouse anatomical sites

(Chapter 3), as well as in the ex vivo human lung epithelial cell model with influenza (Chapter 4).

We will then explore previous studies conducted on these targets to validate their potential, and put these targets into broad perspective with respect to current vaccination technologies and associated challenges.

As we have seen in chapter 2, the application and advancement of pangenomics and reverse vaccinology processes to microbial species can be a powerful approach to the identification of species-wide, highly conserved gene targets with potential for vaccine development. For the pneumococcus, we were able to identify highly conserved potential vaccine candidates (PVCs), as well as observe their degree of conservation across the Spn pangenome, together with that of known vaccine candidates PspA and ZmpB. This shows that not only can we identify novel core

PVCs by performing reverse vaccinology from a species-wide perspective, but we can also investigate all other genes of interest for a particular microbial species from a pangenomic perspective.

However, despite the identification of antigenic and conserved PVCs for the pneumococcus, they would only be of therapeutic use if they were expressed by the pneumococcus when present in the host, especially during highly invasive disease states. As such, we observed the transcriptomic

160 profiles of three different strains of the pneumococcus, D39, TIGR4 and 6A-10, within different anatomical sites in vivo in mouse models (Chapter 3). This study allowed us to identify a core pneumococcal signature separating nasopharyngeal colonization from invasive disease consisting of 69 core pneumococcal genes.

As viral co-infections are also an important aspect of pneumococcal pathology, we also observed the transcriptomic profile of yet another pneumococcal strain EF3030, as it grows on the human lung epithelium with the presence and absence of influenza pH1N1. This analysis revealed diverse gene expression and metabolic profiles between pneumococcal infection and superinfection with pH1N1, that we condensed to 125 EF3030 genes of interest in Chapter 4 (Figure 4.4).

From these two host-pathogen interaction studies in the mouse and on the human lung epithelium, we discussed several key genes and operons involved in the uptake and processing of several metabolites, such as ascorbate, mannose etc., whose expression changes between colonization, invasive and superinfection states. But our use of both species-wide reverse vaccinology, as well as host-pathogen interaction transcriptomics offered a unique advantage to the genes of interest identified by each study. Knowledge of the expression of our 5 identified PVCs helps prioritization for experimental testing, and knowledge of species-wide conservation of our host-pathogen interaction genes of interest helps select the more important targets for therapeutic intervention.

Especially as several of those are surface exposed transporter systems, which would be better vaccine candidates.

From all studies, we were able to combine our top reverse vaccinology candidates, 69 invasive vs colonization genes, and 125 infection vs superinfection genes, to yield a total of 149 core pneumococcal genes (as several were in common across the studies, Supplemental Table 5.1 https://github.com/admelloGithub/Data), and observe their transcriptomic profile, conservation,

161 and PVC potential (ReVac score). Firstly, nearly all 149 genes were highly conserved and present in at least ~90% of the pneumococcal genomes that were used, which would translate to 85%-95% of the entire pneumococcal species extrapolating from our sample size estimates (Section 2.2).

Among these, a total of 10 pneumococcal genes (Table 5.1) had therapeutic relevance with high

ReVac scores, as well as specificity to different conditions.

Table 5.1 Ten conserved pneumococcal genes of interest across all studies with high ReVac scores. (*ReVac prioritized candidates, Table 2.4)

TIGR4_ID EF3030_ID ReVac_Scores Annotation Comment

ABC transporter

substrate-binding TIGR4 nasopharynx

SP_0092 EF3030_RS00470 11.69 protein specific

LysM domain- Core candidate (invasive

SP_0107* EF3030_RS00545 13.271 containing protein and influenza higher)

Cell wall surface

anchor family

SP_0368* EF3030_RS01780 13.941 protein Nasopharynx Candidate

zinc metalloprotease

SP_0664 EF3030_RS03125 11.017 ZmpB Core candidate

iron-compound

ABC transporter iron

compound-binding

SP_1032* EF3030_RS04990 12.712 protein Core candidate unbaised

162

Table 5.1 continued sugar ABC

transporter

substrate-binding Nasopharynx/

SP_1683 EF3030_RS07930 12.459 protein colonization Candidate

SP_1687 EF3030_RS07950 11.354 neuraminidase B TIGR4 specific

ABC transporter

substrate-binding

SP_1690 EF3030_RS07965 14.338 protein TIGR4 specific

LysM peptidoglycan Blood Kidney Lung

binding domain- Influenza (may be the

SP_2063 EF3030_RS10100 12.725 containing protein dispersed state)

phosphate ABC

transporter

substrate-binding Nasopharynx/colonizatio

SP_2084 EF3030_RS10280 11.478 protein n Candidate

For example, one of our highest scoring pneumococcal ReVac candidate SP_1032, an iron ABC transporter had an unbiased above average expression (average expression is the average of expression values of all genes in the genome) across all in vivo, in vitro and ex vivo conditions.

Another crucial known vaccine candidate ZmpB had above average expression across all conditions, but we have shown that it displays significant allelic variation among isolates of Spn

(Chapter 2, Figure 2.6). Targeting these genes would offer significant therapeutic potential due to their conservation, pathogenic regulation, and surface exposed accessibility. Interestingly, another

163 one of our highest scoring core pneumococcal ReVac candidates, SP_0107, a LysM domain protein, was expressed at significantly higher levels in vivo invasive states and during influenza superinfection.

Apart from our core ReVac candidates, some genes that were modulated in vivo and ex vivo also had vaccine candidate potential in a condition or strain specific manner. For example, SP_0368,

SP_1683, SP_2084, an O-glycosidase, an uncharacterized ABC transporter, and a phosphate transporter, respectively, had above average expression specifically during nasopharyngeal colonization and pneumococcal lung epithelial cell infection, suggesting that these could be colonization-specific PVCs. Whereas SP_2063, a LysM peptidoglycan-binding domain protein, had above average expression in vivo in the blood, kidney, lung and ex vivo upon influenza superinfection, but this was not so during colonization and heart infections. This could be related to the planktonic or dispersed states, as we know that the pneumococcus grows as a biofilm during colonization and heart infections (5). Lastly, SP_0092, SP_1690, and SP_1687, two uncharacterized ABC transporters and known virulence factor neuraminidase B, had above average expression only in the TIGR4 strain across all conditions indicating strain specificity.

While the above information provides confidence for the therapeutic potential of these genes, quite a few have been previously studied in the context of antibiotic and immune resistance, as well as vaccine potential. For example, genes SP_0092 and SP_1690 have been linked with pneumococcal resistance to the antibiotic linezolid (263-265). Through a combination of transcriptomics, proteomics and pneumococcal mutants, these studies have established that these two ABC transporters are necessary for the acquisition of nutrients and the generation of the energy requirements for the pneumococcus to resist linezolid-based antimicrobials. These findings also verify that these two genes are indeed surface exposed as potential targets for intervention.

164

Similarly, SP_0107, a surface exposed, essential, peptidoglycan-binding protein, has been shown to confer resistance to host-mediated cell lysis (190, 266). SP_1683, also a verified surface lipoprotein, was tested in the context of immunogenicity using a pneumococcal protein array of

95 proteins, and human IgG antisera from children with pneumococcal pneumonia, but was found to not be seroprevalent, i.e., was potentially not inducing antibodies (267).

As for known virulence factors neuraminidase B (SP_1687) and zinc metalloprotease B

(SP_0664), both have been shown to have antigenic domains and ZmpB was potentially protective

(178, 268), but both exhibit allelic variation (268-270), as we also established in our study for

ZmpB (Figure 2.7A, right).

SP_2063 has been previously shown to have antigenic, seroprotective domains, as it was used in the construction of a chimeric fusion protein (together with PspA and PsaA), with high immunogenicity in mice and prevents lethal infection (271). SP_0368 was also proven to be highly conserved across the pneumococcus species using protein arrays (as we have established as well)

(272), as well as identified in pneumococcal surface proteins through mass spectrometry (273).

Besides that, it was also capable of inducing antibodies in humans in a potential vaccine candidate study (273).

While these pneumococcal targets are antigenic and conserved, SP_1032 and SP_2084 are two of these that are the most exciting. Firstly, both are highly conserved within the pneumococcus

(Chapter 2.2), and each is important to different in vivo pneumococcal survival phenotypes (see

Table 5.1). A comparative genomic analysis of several streptococcal species, consisting of

Streptococcus pneumoniae, S. pseudopneumoniae, S. mitis, S. oralis, and S. infantis, identified ~

200 pneumococcal genes that were unique to the species (82). Both SP_1032 and SP_2084 are only present within the pneumococcal species. SP_2084 is a proven surface exposed protein

165 implicated in penicillin resistance (274, 275), but not as studied from a vaccine perspective.

However, the core candidate iron transporter SP_1032, has been shown to be protective for respiratory infection in mice (276), and also induces antibodies during septicemia (277). Apart from this, when included as a part of a multi-antigen vaccine, it was also protective against lethal pneumococcal infection (278). These studies as well as our findings support its vaccine potential, make it the most promising pneumococcal species wide, vaccine candidate.

It is well known that vaccine escape is a potential outcome of any vaccine implementation (279).

In the case of the pneumococcus, serotype-based vaccines selected for the expansion of non- vaccine serotypes (279). However, in the case of our protein-based antigens, escape could be in the form of mutations within the antigen, its epitopes, a lack of expression or phase variation, or even loss of the antigen’s gene sequence (280). However, the fact that our candidates are highly conserved with extremely low nucleotide variation within the species suggests that mutations are not easily tolerated within these genes and potentially lethal. As our candidates are also highly expressed, some at all anatomical sites, it would indicate a lower potential for lack of expression or phase variation. Conservation and high expression characteristics also suggest essentiality of these candidates within the pneumococcal species (281). Overall, for these reasons, the potential for vaccine escape is significantly lowered, however will not be necessarily eliminated.

Although our findings suggest primarily utilizing SP_1032 as a subunit protein-vaccine candidate, there are tendencies for such single component vaccine designs to be less immunogenic relative to whole cell or multi-antigen-based vaccines (282). As a result, it would be beneficial to include some of the other candidates identified here to produce a multi-antigen vaccine, comprising of targets specific to the 2 major phenotypes in vivo, colonization and invasive disease. For example, since SP_2084 is important during nasopharyngeal colonization it could be a part of the multi-

166 antigen cocktail. Similarly, SP_2063 was more relevant as a candidate during invasive disease, albeit not restricted to the pneumococcal species like SP_1032 and SP_2084, unique pneumococcal epitopes from this candidate could also be included as a part of the vaccine design, as previously explored (283). Furthermore, non-core antigens may need to be included to expand the coverage of the vaccine and achieve broad protection, for instance as observed in the case of

Group B Streptococcus (284).

After candidate identification and selection, the next hurdle comes in the form of generating large quantities of the purified protein candidates. Several protein expression systems exist in the form of genetically engineered systems, such as bacterial, yeast, insect cells, mammalian and even plant cell lines (282). For a pneumococcal protein vaccine, a bacterial expression system would be the most fitting, such as an E. coli (285), as the other systems are designed to factor in post translational modifications that are required for instance, to produce antigens mimicking properties of eukaryotic pathogens (282).

Assuming production of sufficient amounts of purified protein candidates is achieved, we would then have to focus on the vaccine delivery aspect. For subunit protein vaccines, this can be accomplished using nanoparticle carriers (286). These could be organic, such as liposomes, or inorganic nanoparticles, such as metal-based particles (286), harboring the antigens through attachment to the nanoparticle surface (conjugation or adsorption), or encapsulation within the nanoparticle (287). As our antigens are surface-exposed, membrane bound candidates, a liposomal nanoparticle with candidates bound within their outer phospholipid structure could a potential route for vaccine delivery.

While the use of these pneumococcal PVCs has been presented as a protein-based vaccine above, nucleic based vaccines are also a strong potential alternative. DNA and RNA-based vaccines have

167 certain advantages over protein vaccines. They are relatively cheaper to produce, easier to design, known to be safe, and stable (286). Due to these reasons, such vaccines can be produced in large quantities significantly faster than conventional protein vaccines. For example, consider the case of SARS-CoV-2, the RNA virus responsible for the COVID-19 pandemic. Nucleic acid vaccines were quickly designed, tested, mass produced, and distributed. Some were in the form of DNA vaccines, where the DNA sequence of the SARS-CoV-2 spike protein antigen was incorporated into adenoviral vectors, which delivered this DNA into the infected host nuclei producing the antigen within the cell for presentation in MHC complexes, thereby generating strong lasting immune responses (288). Others were designed as liposomal carriers of the spike protein mRNA which would import this mRNA into the host cell cytoplasm, producing the protein for antigen presentation (289). RNA vaccines also have potential advantages over DNA vaccines. They can be designed to be self-replicating, requiring lower doses (286, 290). They tend to be more immunogenic, because they also trigger several host pathways are responsive to foreign RNA molecules (286, 291). They can also be produced in cell free systems, reducing any potential chances of contamination (286, 290). While bacterial formulations of such vaccines are not currently in use, they could be easily developed using similar systems, as the antigen incorporated need not be of viral origin, especially considering that mRNA vaccines are also being tested as cancer therapies (286).

With such designs as potential vaccines, pre-clinical studies would still need to be performed measuring immunogenicity, any potential toxicity, dose ranging, as well as quantifying the vaccines’ protective nature (292) in a primate model, with a pneumococcal infection challenge.

This would then be reconfirmed in clinical trials in human subjects.

168

Our findings clearly demonstrate the advantages of implementing a multi-faceted, multi-platform approach to the of study the pneumococcus, from bioinformatics, to pangenomics and multi-strain, multi-system experimentation, altogether allowing for the identification of better potential therapeutic targets against pneumococcal disease than those currently being investigated. Future experimental validation of these candidates, as well as establishing their function (for those uncharacterized), for their importance and necessity during specific pathological conditions will greatly advance the pneumococcal field.

From the host’s perspective, we found that the pneumococcus triggers host interferon signaling during its invasive state. We also saw that treatments with interferon ß and type I interferon stimulants result in protection of mice from invasive disease. As for influenza superinfection, the viral presence also triggers interferon signaling (293), with interferon genes upregulated by several log2 folds. Such interferon treatments are likely to also be protective to some degree, but this remains to be validated. On the other hand, during superinfection, we saw a strong downregulation of genes controlling cilial beating, which is regulated by the cAMP signaling pathway. The cAMP signaling pathway was activated during superinfection, yet it was inhibited in our in vivo mouse lung model. This pathway could be regulated specifically due to the presence of influenza but could also be skewed due to the potential for more host cell types to be involved than the epithelial cells we tested ex vivo. In addition, the in vivo model consisted of entire lungs in a different host species. The effect of cilial beating on influenza and superinfection remains to be determined and validated, for example through drug-based activation or inhibition.

The overarching theme of this study has been to study the pneumococcus and its interactions with the host in order to identify potential new therapeutic targets from a species-wide perspective. This was in an attempt to maneuver around constant cyclical patterns of current pneumococcal

169 treatments in the form of antibiotics and vaccines resulting in resistance and vaccine escape, as only select strains are targeted. Although we identified novel conserved core candidates and novel avenues for intervention, they may not be sufficient to prevent such cyclical patterns. However, we do hope that they will help reduce the frequency at which therapeutics become obsolete. Such forms of microbial investigations should be applied to other species as well, utilizing in silico, in vitro and in/ex vivo methods of analyses as multiple lines of evidence for targets will provide strong confidence for target selection and prioritization.

170

VI. References

1. D. van de Beek, J. de Gans, A. R. Tunkel, E. F. Wijdicks, Community-acquired bacterial meningitis in adults. N Engl J Med 354, 44-53 (2006).

2. C. Feldman, R. Anderson, Recent advances in the epidemiology and prevention of Streptococcus pneumoniae infections. F1000Res 9 (2020).

3. J. A. Hoffman et al., Streptococcus pneumoniae infections in the neonate. Pediatrics 112, 1095-1102 (2003).

4. J. Galante, A. C. Ho, S. Tingey, B. M. Charalambous, Quorum sensing and biofilms in the pathogen, Streptococcus pneumoniae. Curr Pharm Des 21, 25-30 (2015).

5. A. T. Shenoy et al., Streptococcus pneumoniae in the heart subvert the host response through biofilm-mediated resident macrophage killing. PLoS Pathog 13, e1006582 (2017).

6. M. R. Oggioni et al., Switch from planktonic to sessile life: a major event in pneumococcal pathogenesis. Mol Microbiol 61, 1196-1210 (2006).

7. A. B. Dalia, J. N. Weiser, Minimization of bacterial size allows for complement evasion and is overcome by the agglutinating effect of antibody. Cell Host Microbe 10, 486-496 (2011).

8. E. Shanker, M. J. Federle, Quorum Sensing Regulation of Competence and Bacteriocins in Streptococcus pneumoniae and mutans. Genes (Basel) 8 (2017).

9. K. Blanchette-Cain et al., Streptococcus pneumoniae biofilm formation is strain dependent, multifactorial, and associated with reduced invasiveness and immunoreactivity during colonization. mBio 4, e00745-00713 (2013).

10. K. A. Benton, J. C. Paton, D. E. Briles, Differences in virulence for mice among Streptococcus pneumoniae strains of capsular types 2, 3, 4, 5, and 6 are not attributable to differences in pneumolysin production. Infect Immun 65, 1237-1244 (1997).

11. A. M. Mitchell, T. J. Mitchell, Streptococcus pneumoniae: virulence factors and variation. Clin Microbiol Infect 16, 411-418 (2010).

12. B. Maestro, J. M. Sanz, Choline Binding Proteins from Streptococcus pneumoniae: A Dual Role as Enzybiotics and Targets for the Design of New Antimicrobials. Antibiotics (Basel) 5 (2016).

13. I. P.-D. Sergio Galán-Bartual, Pedro García, Juan A. Hermoso, "Streptococcus Pneumoniae Molecular Mechanisms of Host-Pathogen Interactions", S. H. Jeremy Brown, Carlos Orihuela, Ed. (Academic Press, 2015), https://doi.org/10.1016/B978-0-12- 410530-0.00011-9., pp. 207-230.

171

14. C. Marion et al., Streptococcus pneumoniae can utilize multiple sources of hyaluronic acid for growth. Infect Immun 80, 1390-1398 (2012).

15. K. A. Blanchette et al., Neuraminidase A-Exposed Galactose Promotes Streptococcus pneumoniae Biofilm Formation during Colonization. Infect Immun 84, 2922-2932 (2016).

16. M. Bek-Thomsen, K. Poulsen, M. Kilian, Occurrence and evolution of the paralogous zinc metalloproteases IgA1 protease, ZmpB, ZmpC, and ZmpD in Streptococcus pneumoniae and related commensal species. mBio 3 (2012).

17. E. N. Janoff et al., Pneumococcal IgA1 protease subverts specific protection by human IgA1. Mucosal Immunol 7, 249-256 (2014).

18. C. E. Blue et al., ZmpB, a novel virulence factor of Streptococcus pneumoniae that induces tumor necrosis factor alpha production in the respiratory tract. Infect Immun 71, 4925-4935 (2003).

19. R. Camilli et al., Zinc metalloproteinase genes in clinical isolates of Streptococcus pneumoniae: association of the full array with a clonal cluster comprising serotypes 8 and 11A. Microbiology (Reading) 152, 313-321 (2006).

20. M. Yamaguchi et al., Zinc metalloproteinase ZmpC suppresses experimental pneumococcal meningitis by inhibiting bacterial invasion of central nervous systems. Virulence 8, 1516-1524 (2017).

21. J. C. Paton, C. Trappetti, Streptococcus pneumoniae Capsular Polysaccharide. Microbiol Spectr 7 (2019).

22. U. B. Sorensen, J. Henrichsen, H. C. Chen, S. C. Szu, Covalent linkage between the capsular polysaccharide and the cell wall peptidoglycan of Streptococcus pneumoniae revealed by immunochemical methods. Microb Pathog 8, 325-334 (1990).

23. J. N. Luck, H. Tettelin, C. J. Orihuela, Sugar-Coated Killer: Serotype 3 Pneumococcal Disease. Frontiers in Cellular and Infection Microbiology 10 (2020).

24. J. N. Luck, H. Tettelin, C. J. Orihuela, Sugar-Coated Killer: Serotype 3 Pneumococcal Disease. Front Cell Infect Microbiol 10, 613287 (2020).

25. A. L. Nelson et al., Capsule enhances pneumococcal colonization by limiting mucus- mediated clearance. Infect Immun 75, 83-90 (2007).

26. L. C. Mac, M. R. Kraus, Relation of virulence of pneumococcal strains for mice to the quantity of capsular polysaccharide formed in vitro. J Exp Med 92, 1-9 (1950).

27. J. Li, J. R. Zhang, Phase Variation of Streptococcus pneumoniae. Microbiol Spectr 7 (2019).

172

28. N. Patenge, T. Fiedler, B. Kreikemeyer, Common regulators of virulence in streptococci. Curr Top Microbiol Immunol 368, 111-153 (2013).

29. G. K. Paterson, C. E. Blue, T. J. Mitchell, Role of two-component systems in the virulence of Streptococcus pneumoniae. J Med Microbiol 55, 355-363 (2006).

30. A. Gomez-Mejia, G. Gamez, S. Hammerschmidt, Streptococcus pneumoniae two- component regulatory systems: The interplay of the pneumococcus with its environment. Int J Med Microbiol 308, 722-737 (2018).

31. G. Salvadori, R. Junges, D. A. Morrison, F. C. Petersen, Competence in Streptococcus pneumoniae and Close Commensal Relatives: Mechanisms and Implications. Front Cell Infect Microbiol 9, 94 (2019).

32. O. T. Avery, C. M. Macleod, M. McCarty, Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med 79, 137-158 (1944).

33. J. C. Mell, R. J. Redfield, Natural competence and the evolution of DNA uptake specificity. J Bacteriol 196, 1471-1483 (2014).

34. J. P. Claverys, B. Martin, P. Polard, The genetic transformation machinery: composition, localization, and mechanism. FEMS Microbiol Rev 33, 643-656 (2009).

35. S. D. Bentley et al., Genetic analysis of the capsular biosynthetic locus from all 90 pneumococcal serotypes. PLoS Genet 2, e31 (2006).

36. M. S. Gilmore, W. Haas, The selective advantage of microbial fratricide. Proc Natl Acad Sci U S A 102, 8401-8402 (2005).

37. K. L. Wyres et al., The multidrug-resistant PMEN1 pneumococcus is a paradigm for genetic success. Genome Biol 13, R103 (2012).

38. L. A. Cowley et al., Evolution via recombination: Cell-to-cell contact facilitates larger recombination events in Streptococcus pneumoniae. PLoS Genet 14, e1007410 (2018).

39. C. M. Buckwalter, S. J. King, Pneumococcal carbohydrate transport: food for thought. Trends Microbiol 20, 517-522 (2012).

40. W. T. Hendriksen et al., CodY of Streptococcus pneumoniae: link between nutritional gene regulation and colonization. J Bacteriol 190, 590-601 (2008).

41. S. M. Carvalho, T. G. Kloosterman, O. P. Kuipers, A. R. Neves, CcpA ensures optimal metabolic fitness of Streptococcus pneumoniae. PLoS One 6, e26707 (2011).

173

42. S. Basavanna et al., Screening of Streptococcus pneumoniae ABC transporter mutants demonstrates that LivJHMGF, a branched-chain amino acid ABC transporter, is necessary for disease pathogenesis. Infect Immun 77, 3412-3423 (2009).

43. A. Leonard, M. Lalk, Infection and metabolism - Streptococcus pneumoniae metabolism facing the host environment. Cytokine 112, 75-86 (2018).

44. D. Bogaert, R. De Groot, P. W. Hermans, Streptococcus pneumoniae colonisation: the key to pneumococcal disease. Lancet Infect Dis 4, 144-154 (2004).

45. A. Kadioglu, J. N. Weiser, J. C. Paton, P. W. Andrew, The role of Streptococcus pneumoniae virulence factors in host respiratory colonization and disease. Nat Rev Microbiol 6, 288-301 (2008).

46. C. C. Kietzman, G. Gao, B. Mann, L. Myers, E. I. Tuomanen, Dynamic capsule restructuring by the main pneumococcal autolysin LytA in response to the epithelium. Nat Commun 7, 10859 (2016).

47. A. J. Loughran, C. J. Orihuela, E. I. Tuomanen, Streptococcus pneumoniae: Invasion and Inflammation. Microbiol Spectr 7 (2019).

48. J. N. Radin et al., beta-Arrestin 1 participates in platelet-activating factor receptor- mediated endocytosis of Streptococcus pneumoniae. Infect Immun 73, 7827-7835 (2005).

49. F. Iovino, J. Seinen, B. Henriques-Normark, J. M. van Dijl, How Does Streptococcus pneumoniae Invade the Brain? Trends Microbiol 24, 307-315 (2016).

50. A. T. Shenoy et al., Severity and properties of cardiac damage caused by Streptococcus pneumoniae are strain dependent. PLoS One 13, e0204032 (2018).

51. M. Piastra et al., Kidney injury owing to Streptococcus pneumoniae infection in critically ill infants and children: report of four cases. Paediatr Int Child Health 36, 282-287 (2016).

52. J. M. Garcia-Lechuz et al., Streptococcus pneumoniae skin and soft tissue infections: characterization of causative strains and clinical illness. Eur J Clin Microbiol Infect Dis 26, 247-253 (2007).

53. N. Itoh, I. Kawamura, M. Tsukahara, K. Mori, H. Kurai, Evaluation of Streptococcus pneumoniae in bile samples: A case series review. J Infect Chemother 22, 383-386 (2016).

54. I. Korona-Glowniak et al., Resistant Streptococcus pneumoniae strains in children with acute otitis media- high risk of persistent colonization after treatment. BMC Infect Dis 18, 478 (2018).

174

55. D. Miron, I. Dashkovsky, M. Zuker, S. Szvalb, C. Cozacov, Primary Streptococcus pneumoniae appendicitis in a child: case report and review. Pediatr Infect Dis J 22, 282- 284 (2003).

56. K. R. Short, M. N. Habets, P. W. Hermans, D. A. Diavatopoulos, Interactions between Streptococcus pneumoniae and influenza virus: a mutually beneficial relationship? Future Microbiol 7, 609-624 (2012).

57. S. Ghosh, D. Gregory, A. Smith, L. Kobzik, MARCO regulates early inflammatory responses against influenza: a useful macrophage function with adverse outcome. Am J Respir Cell Mol Biol 45, 1036-1044 (2011).

58. J. M. Shim, J. Kim, T. Tenson, J. Y. Min, D. E. Kainov, Influenza Virus Infection, Interferon Response, Viral Counter-Response, and Apoptosis. Viruses 9 (2017).

59. L. R. Marks, B. A. Davidson, P. R. Knight, A. P. Hakansson, Interkingdom signaling induces Streptococcus pneumoniae biofilm dispersion and transition from asymptomatic colonization to disease. mBio 4 (2013).

60. C. C. Lai, C. Y. Wang, P. R. Hsueh, Co-infections among patients with COVID-19: The need for combination therapy with non-anti-SARS-CoV-2 agents? J Microbiol Immunol Infect 53, 505-512 (2020).

61. J. C. Brealey et al., Streptococcus pneumoniae colonization of the nasopharynx is associated with increased severity during respiratory syncytial virus infection in young children. Respirology 23, 220-227 (2018).

62. J. M. Toombs, K. Van den Abbeele, J. Democratis, A. K. J. Mandal, C. G. Missouris, Pneumococcal coinfection in COVID-19 patients. J Med Virol 10.1002/jmv.26278 (2020).

63. K. A. Murrah et al., Nonencapsulated Streptococcus pneumoniae causes otitis media during single-species infection and during polymicrobial infection with nontypeable Haemophilus influenzae. Pathog Dis 73 (2015).

64. Q. Vermee et al., Biofilm production by Haemophilus influenzae and Streptococcus pneumoniae isolated from the nasopharynx of children with acute otitis media. BMC Infect Dis 19, 44 (2019).

65. V. Avadhanula, Y. Wang, A. Portner, E. Adderson, Nontypeable Haemophilus influenzae and Streptococcus pneumoniae bind respiratory syncytial virus glycoprotein. J Med Microbiol 56, 1133-1137 (2007).

66. T. van der Poll, S. M. Opal, Pathogenesis, treatment, and prevention of pneumococcal pneumonia. Lancet 374, 1543-1556 (2009).

175

67. F. Baquero, G. Baquero-Artigao, R. Canton, C. Garcia-Rey, Antibiotic consumption and resistance selection in Streptococcus pneumoniae. J Antimicrob Chemother 50 Suppl S2, 27-37 (2002).

68. L. Aguilar, M. J. Gimenez, C. Garcia-Rey, J. E. Martin, New strategies to overcome antimicrobial resistance in Streptococcus pneumoniae with beta-lactam antibiotics. J Antimicrob Chemother 50 Suppl S2, 93-100 (2002).

69. M. R. Schroeder, D. S. Stephens, Macrolide Resistance in Streptococcus pneumoniae. Front Cell Infect Microbiol 6, 98 (2016).

70. R. Cherazard et al., Antimicrobial Resistant Streptococcus pneumoniae: Prevalence, Mechanisms, and Clinical Implications. Am J Ther 24, e361-e369 (2017).

71. L. S. Havarstein, P. Gaustad, I. F. Nes, D. A. Morrison, Identification of the streptococcal competence-pheromone receptor. Mol Microbiol 21, 863-869 (1996).

72. G. Pozzi et al., Competence for genetic transformation in encapsulated strains of Streptococcus pneumoniae: two allelic variants of the peptide pheromone. J Bacteriol 178, 6087-6090 (1996).

73. R. R. Reinert, The antimicrobial resistance profile of Streptococcus pneumoniae. Clin Microbiol Infect 15 Suppl 3, 7-11 (2009).

74. R. R. Reinert, Resistance phenotypes and multi-drug resistance in Streptococcus pneumoniae (PROTEKT years 1-3 [1999-20021). J Chemother 16 Suppl 6, 35-48 (2004).

75. Anonymous (2018) Active Bacterial Core Surveillance reports

76. L. Opatowski et al., Antibiotic dose impact on resistance selection in the community: a mathematical model of beta-lactams and Streptococcus pneumoniae dynamics. Antimicrob Agents Chemother 54, 2330-2337 (2010).

77. K. P. Klugman, S. Black, Impact of existing vaccines in reducing antibiotic resistance: Primary and secondary effects. Proc Natl Acad Sci U S A 115, 12896-12901 (2018).

78. D. E. Briles, J. C. Paton, R. Mukerji, E. Swiatlo, M. J. Crain, Pneumococcal Vaccines. Microbiol Spectr 7 (2019).

79. N. J. Croucher et al., Selective and genetic constraints on pneumococcal serotype switching. PLoS Genet 11, e1005095 (2015).

80. Y. J. Lu et al., GMP-grade pneumococcal whole-cell vaccine injected subcutaneously protects mice from nasopharyngeal colonization and fatal aspiration-sepsis. Vaccine 28, 7468-7475 (2010).

176

81. M. Bittaye, P. Cash, Streptococcus pneumoniae proteomics: determinants of pathogenesis and vaccine development. Expert Rev Proteomics 12, 607-621 (2015).

82. M. Kilian, H. Tettelin, Identification of Virulence-Associated Properties by Comparative Genome Analysis of Streptococcus pneumoniae, S. pseudopneumoniae, S. mitis, Three S. oralis Subspecies, and S. infantis. mBio 10 (2019).

83. U. Koppe, N. Suttorp, B. Opitz, Recognition of Streptococcus pneumoniae by the innate immune system. Cell Microbiol 14, 460-466 (2012).

84. F. F. Chiu et al., Domain 4 of pneumolysin from Streptococcus pneumoniae is a multifunctional domain contributing TLR4 activating and hemolytic activity. Biochem Biophys Res Commun 517, 596-602 (2019).

85. E. Gowin et al., Analysis of TLR2, TLR4, and TLR9 single nucleotide polymorphisms in children with bacterial meningitis and their healthy family members. Int J Infect Dis 60, 23-28 (2017).

86. M. Wang, L. Wang, L. Fang, S. Li, R. Liu, NLRC5 negatively regulates LTA-induced inflammation via TLR2/NF-kappaB and participates in TLR2-mediated allergic airway inflammation. J Cell Physiol 234, 19990-20001 (2019).

87. E. A. Joyce, S. J. Popper, S. Falkow, Streptococcus pneumoniae nasopharyngeal colonization induces type I interferons and interferon-induced gene expression. BMC Genomics 10, 404 (2009).

88. B. B. Maier et al., Type I interferon promotes alveolar epithelial type II cell survival during pulmonary Streptococcus pneumoniae infection and sterile lung injury in mice. Eur J Immunol 46, 2175-2186 (2016).

89. L. R. K. Brooks, G. I. Mias, Streptococcus pneumoniae's Virulence and Host Immunity: Aging, Diagnostics, and Prevention. Front Immunol 9, 1366 (2018).

90. H. Tettelin et al., Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 102, 13950-13955 (2005).

91. H. Tettelin, D. Medini, The pangenome : diversity, dynamics and evolution of genomes (2020).

92. H. Tettelin, D. Riley, C. Cattuto, D. Medini, Comparative genomics: the bacterial pan- genome. Curr Opin Microbiol 11, 472-477 (2008).

93. C. Donati et al., Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol 11, R107 (2010).

177

94. N. J. Croucher et al., Role of conjugative elements in the evolution of the multidrug- resistant pandemic clone Streptococcus pneumoniaeSpain23F ST81. J Bacteriol 191, 1480-1489 (2009).

95. S. T. Chancey et al., Composite mobile genetic elements disseminating macrolide resistance in Streptococcus pneumoniae. Front Microbiol 6, 26 (2015).

96. V. Ramos, C. Duarte, A. Diaz, J. Moreno, [Mobile genetic elements associated with erythromycin-resistant isolates of Streptococcus pneumoniae in Colombia]. Biomedica 34 Suppl 1, 209-216 (2014).

97. P. Garcia, A. C. Martin, R. Lopez, Bacteriophages of Streptococcus pneumoniae: a molecular approach. Microb Drug Resist 3, 165-176 (1997).

98. R. A. Gladstone et al., International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact. EBioMedicine 43, 338- 346 (2019).

99. A. B. SABIN, Immediate pneumococcus typing directly from sputum by the neufeld reaction. Journal of the American Medical Association 100, 1584-1586 (1933).

100. D. S. Billal, M. Hotomi, S. Tasnim, K. Fujihara, N. Yamanaka, Evaluation of serotypes of Streptococcus pneumoniae isolated from otitis media patients by multiplex polymerase chain reaction. ORL J Otorhinolaryngol Relat Spec 68, 135-138 (2006).

101. M. H. Leung et al., Sequetyping: serotyping Streptococcus pneumoniae by a single PCR sequencing strategy. J Clin Microbiol 50, 2419-2427 (2012).

102. L. Gonzales-Siles et al., Identification and capsular serotype sequetyping of Streptococcus pneumoniae strains. J Med Microbiol 68, 1173-1188 (2019).

103. F. Ganaie et al., A New Pneumococcal Capsule Type, 10D, is the 100th Serotype and Has a Large cps Fragment from an Oral Streptococcus. mBio 11 (2020).

104. R. E. Rayner, J. Savill, L. M. Hafner, F. Huygens, Genotyping Streptococcus pneumoniae. Future Microbiol 10, 653-664 (2015).

105. Y. Chao, L. R. Marks, M. M. Pettigrew, A. P. Hakansson, Streptococcus pneumoniae biofilm formation and dispersion during colonization and disease. Front Cell Infect Microbiol 4, 194 (2014).

106. F. Cools et al., Streptococcus pneumoniae galU gene mutation has a direct effect on biofilm growth, adherence and phagocytosis in vitro and pathogenicity in vivo. Pathog Dis 76 (2018).

107. T. Brissac, C. J. Orihuela, In Vitro Adhesion, Invasion, and Transcytosis of Streptococcus pneumoniae with Host Cells. Methods Mol Biol 1968, 137-146 (2019).

178

108. D. K. Hu, Y. Liu, X. Y. Li, Y. Qu, In vitro expression of Streptococcus pneumoniae ply gene in human monocytes and pneumocytes. Eur J Med Res 20, 52 (2015).

109. R. Aprianto, J. Slager, S. Holsappel, J. W. Veening, High-resolution analysis of the pneumococcal transcriptome under a wide range of infection-relevant conditions. Nucleic Acids Res 46, 9990-10006 (2018).

110. D. Chiavolini, G. Pozzi, S. Ricci, Animal models of Streptococcus pneumoniae disease. Clin Microbiol Rev 21, 666-685 (2008).

111. L. F. Reyes et al., A Non-Human Primate Model of Severe Pneumococcal Pneumonia. PLoS One 11, e0166092 (2016).

112. H. M. Rowe et al., Bacterial Factors Required for Transmission of Streptococcus pneumoniae in Mammalian Hosts. Cell Host Microbe 25, 884-891 e886 (2019).

113. J. D. Brand et al., Influenza-mediated reduction of lung epithelial ion channel activity leads to dysregulated pulmonary fluid homeostasis. JCI Insight 3 (2018).

114. F. Schutgens, H. Clevers, Human Organoids: Tools for Understanding Biology and Treating Diseases. Annu Rev Pathol 15, 211-234 (2020).

115. F. Z. Hu et al., Deletion of genes involved in the ketogluconate metabolism, Entner- Doudoroff pathway, and glucose dehydrogenase increase local and invasive virulence phenotypes in Streptococcus pneumoniae. PLoS One 14, e0209688 (2019).

116. I. Schweizer et al., New Aspects of the Interplay between Penicillin Binding Proteins, murM, and the Two-Component System CiaRH of Penicillin-Resistant Streptococcus pneumoniae Serotype 19A Isolates from Hungary. Antimicrob Agents Chemother 61 (2017).

117. A. Polissi et al., Large-scale identification of virulence genes from Streptococcus pneumoniae. Infect Immun 66, 5620-5629 (1998).

118. S. Z. Kimaro Mlacha et al., Phenotypic, genomic, and transcriptional characterization of Streptococcus pneumoniae interacting with human pharyngeal cells. BMC Genomics 14, 383 (2013).

119. N. D. Ritchie, T. J. Evans, Dual RNA-seq in Streptococcus pneumoniae Infection Reveals Compartmentalized Neutrophil Responses in Lung and Pleural Space. mSystems 4 (2019).

120. J. Slager, R. Aprianto, J. W. Veening, Refining the Pneumococcal Competence Regulon by RNA Sequencing. J Bacteriol 201 (2019).

121. C. J. Heath et al., Co-Transcriptomes of Initial Interactions In Vitro between Streptococcus Pneumoniae and Human Pleural Mesothelial Cells. PLoS One 10, e0142773 (2015).

179

122. X. Liu et al., Exploration of Bacterial Bottlenecks and Streptococcus pneumoniae Pathogenesis by CRISPRi-Seq. Cell Host Microbe 10.1016/j.chom.2020.10.001 (2020).

123. L. M. Verhagen et al., Genome-wide identification of genes essential for the survival of Streptococcus pneumoniae in human saliva. PLoS One 9, e89541 (2014).

124. I. Warrier et al., The Transcriptional landscape of Streptococcus pneumoniae TIGR4 reveals a complex operon architecture and abundant riboregulation critical for growth and virulence. PLoS Pathog 14, e1007461 (2018).

125. M. Dalsass, A. Brozzi, D. Medini, R. Rappuoli, Comparison of Open-Source Reverse Vaccinology Programs for Bacterial Vaccine Antigen Discovery. Front Immunol 10, 113 (2019).

126. A. P. Argondizzo et al., Identification of proteins in Streptococcus pneumoniae by reverse vaccinology and genetic diversity of these proteins in clinical isolates. Appl Biochem Biotechnol 175, 2124-2165 (2015).

127. S. Talukdar, S. Zutshi, K. S. Prashanth, K. K. Saikia, P. Kumar, Identification of potential vaccine candidates against Streptococcus pneumoniae by reverse vaccinology approach. Appl Biochem Biotechnol 172, 3026-3041 (2014).

128. L. D. Mamede et al., Reverse and structural vaccinology approach to design a highly immunogenic multi-epitope subunit vaccine against Streptococcus pneumoniae infection. Infect Genet Evol 85, 104473 (2020).

129. I. Delany, R. Rappuoli, K. L. Seib, Vaccines, reverse vaccinology, and bacterial pathogenesis. Cold Spring Harb Perspect Med 3, a012476 (2013).

130. A. Sette, R. Rappuoli, Reverse vaccinology: developing vaccines in the era of genomics. Immunity 33, 530-541 (2010).

131. R. Rappuoli, Conjugates and reverse vaccinology to eliminate bacterial meningitis. Vaccine 19, 2319-2322 (2001).

132. I. A. Doytchinova, D. R. Flower, VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics 8, 4 (2007).

133. A. Tan, J. M. Atack, M. P. Jennings, K. L. Seib, The Capricious Nature of Bacterial Pathogens: Phasevarions and Vaccine Development. Front Immunol 7, 586 (2016).

134. V. Jaiswal, S. K. Chanumolu, A. Gupta, R. S. Chauhan, C. Rout, Jenner-predict server: prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions. BMC Bioinformatics 14, 211 (2013).

135. H. R. Ansari, D. R. Flower, G. P. Raghava, AntigenDB: an immunoinformatics database of pathogen antigens. Nucleic Acids Res 38, D847-853 (2010).

180

136. Y. He, Z. Xiang, H. L. Mobley, Vaxign: the first web-based vaccine design program for reverse vaccinology and applications for vaccine development. J Biomed Biotechnol 2010, 297505 (2010).

137. A. C. Perez, T. F. Murphy, A Moraxella catarrhalis vaccine to protect against otitis media and exacerbations of COPD: An update on current progress and challenges. Hum Vaccin Immunother 13, 2322-2331 (2017).

138. M. M. Pettigrew et al., Haemophilus influenzae genome evolution during persistence in the human airways in chronic obstructive pulmonary disease. Proc Natl Acad Sci U S A 115, E3256-E3265 (2018).

139. B. Yang, S. Sayers, Z. Xiang, Y. He, Protegen: a web-based protective antigen database and analysis system. Nucleic Acids Res 39, D1073-1078 (2011).

140. N. Y. Yu et al., PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics 26, 1608-1615 (2010).

141. J. Orvis et al., Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 26, 1488-1492 (2010).

142. W. J. Kent, BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664 (2002).

143. W. Fleri et al., The Immune Epitope Database and Analysis Resource in Epitope Discovery and Synthetic Vaccine Design. Front Immunol 8, 278 (2017).

144. M. Ashburner et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29 (2000).

145. E. M. Gertz, Y. K. Yu, R. Agarwala, A. A. Schaffer, S. F. Altschul, Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4, 41 (2006).

146. L. Li, C. J. Stoeckert, Jr., D. S. Roos, OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178-2189 (2003).

147. I. G. Charles et al., Molecular cloning and characterization of protective outer membrane protein P.69 from Bordetella pertussis. Proc Natl Acad Sci U S A 86, 3554-3558 (1989).

148. M. De Chiara et al., Genome sequencing of disease and carriage isolates of nontypeable Haemophilus influenzae identifies discrete population structure. Proc Natl Acad Sci U S A 111, 5439-5444 (2014).

149. S. Vivona, F. Bernante, F. Filippini, NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol 6, 35 (2006).

181

150. D. Serruto, M. J. Bottomley, S. Ram, M. M. Giuliani, R. Rappuoli, The new multicomponent vaccine against meningococcal serogroup B, 4CMenB: immunological, functional and structural characterization of the antigens. Vaccine 30 Suppl 2, B87-97 (2012).

151. A. S. Juncker et al., Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci 12, 1652-1662 (2003).

152. S. Sethi, N. Evans, B. J. Grant, T. F. Murphy, New strains of bacteria and exacerbations of chronic obstructive pulmonary disease. N Engl J Med 347, 465-471 (2002).

153. T. F. Murphy, A. L. Brauer, B. J. Grant, S. Sethi, Moraxella catarrhalis in chronic obstructive pulmonary disease: burden of disease and immune response. Am J Respir Crit Care Med 172, 195-199 (2005).

154. C. P. Ahearn, M. C. Gallo, T. F. Murphy, Insights on persistent airway infection by non- typeable Haemophilus influenzae in chronic obstructive pulmonary disease. Pathog Dis 75 (2017).

155. B. D. Ondov et al., Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016).

156. M. R. Davies et al., Atlas of group A streptococcal vaccine candidates compiled using large-scale comparative genomics. Nat Genet 51, 1035-1043 (2019).

157. A. Leimbach, J. Hacker, U. Dobrindt, E. coli as an all-rounder: the thin line between commensalism and pathogenicity. Curr Top Microbiol Immunol 358, 3-32 (2013).

158. M. D. B. van de Garde, E. van Westen, M. C. M. Poelen, N. Y. Rots, C. van Els, Prediction and Validation of Immunogenic Domains of Pneumococcal Proteins Recognized by Human CD4(+) T Cells. Infect Immun 87 (2019).

159. A. Krogh, B. Larsson, G. von Heijne, E. L. Sonnhammer, Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567-580 (2001).

160. T. N. Petersen, S. Brunak, G. von Heijne, H. Nielsen, SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8, 785-786 (2011).

161. G. Sachdeva, K. Kumar, P. Jain, S. Ramachandran, SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics 21, 483-491 (2005).

162. A. Prakash, M. Jeffryes, A. Bateman, R. D. Finn, The HMMER Web Server for Protein Sequence Similarity Search. Curr Protoc Bioinformatics 60, 3 15 11-13 15 23 (2017).

163. S. El-Gebali et al., The Pfam protein families database in 2019. Nucleic Acids Res 47, D427-D432 (2019).

182

164. D. H. Haft et al., TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res 41, D387-395 (2013).

165. P. Rice, I. Longden, A. Bleasby, EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, 276-277 (2000).

166. D. R. Riley, S. V. Angiuoli, J. Crabtree, J. C. Dunning Hotopp, H. Tettelin, Using Sybil for interactive comparative genomics of microbes on the web. Bioinformatics 28, 160- 166 (2012).

167. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J Mol Biol 215, 403-410 (1990).

168. D. E. Fouts, L. Brinkac, E. Beck, J. Inman, G. Sutton, PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 40, e172 (2012).

169. J. W. Sahl, J. G. Caporaso, D. A. Rasko, P. Keim, The large-scale blast score ratio (LS- BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2, e332 (2014).

170. E. Siena et al., In-silico prediction and deep-DNA sequencing validation indicate phase variation in 115 Neisseria meningitidis genes. BMC Genomics 17, 843 (2016).

171. M. G. Langille, F. S. Brinkman, IslandViewer: an integrated interface for computational identification and visualization of genomic islands. Bioinformatics 25, 664-665 (2009).

172. Y. Zhao, H. Tang, Y. Ye, RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 28, 125-126 (2012).

173. S. Kumar, G. Stecher, K. Tamura, MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. Mol Biol Evol 33, 1870-1874 (2016).

174. E. Paradis, J. Claude, K. Strimmer, APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20, 289-290 (2004).

175. S. Kumar, G. Stecher, M. Li, C. Knyaz, K. Tamura, MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol Biol Evol 35, 1547-1549 (2018).

176. P. Kadam, S. Bhalerao, Sample size calculation. Int J Ayurveda Res 1, 55-57 (2010).

177. E. Demidenko, The next-generation K-means algorithm. Stat Anal Data Min 11, 153-166 (2018).

178. Y. Gong et al., Immunization with a ZmpB-based protein vaccine could protect against pneumococcal diseases in mice. Infect Immun 79, 867-878 (2011).

183

179. R. Mukerji et al., The diversity of the proline-rich domain of pneumococcal surface protein A (PspA): Potential relevance to a broad-spectrum vaccine. Vaccine 36, 6834- 6843 (2018).

180. D. H. Engholm, M. Kilian, D. S. Goodsell, E. S. Andersen, R. S. Kjaergaard, A visual review of the human pathogen Streptococcus pneumoniae. FEMS Microbiol Rev 41, 854- 879 (2017).

181. B. Wahl et al., Burden of Streptococcus pneumoniae and Haemophilus influenzae type b disease in children in the era of conjugate vaccines: global, regional, and national estimates for 2000-15. Lancet Glob Health 6, e744-e757 (2018).

182. P. P. Gounder et al., Impact of the Pneumococcal Conjugate Vaccine and Antibiotic Use on Nasopharyngeal Colonization by Antibiotic Nonsusceptible Streptococcus pneumoniae, Alaska, 2000[FIGURE DASH]2010. Pediatr Infect Dis J 34, 1223-1229 (2015).

183. C. Feldman, R. Anderson, Bacteraemic pneumococcal pneumonia: current therapeutic options. Drugs 71, 131-153 (2011).

184. R. Kaur, M. Pham, K. O. A. Yu, M. E. Pichichero, Rising Pneumococcal Antibiotic Resistance in the Post 13-valent Pneumococcal Conjugate Vaccine Era in Pediatric Isolates from a Primary Care Setting. Clin Infect Dis 10.1093/cid/ciaa157, ciaa157 (2020).

185. J. Sempere, S. de Miguel, F. Gonzalez-Camacho, J. Yuste, M. Domenech, Clinical Relevance and Molecular Pathogenesis of the Emerging Serotypes 22F and 33F of Streptococcus pneumoniae in Spain. Front Microbiol 11, 309 (2020).

186. S. T. Huang et al., Pneumococcal pneumonia infection is associated with end-stage renal disease in adult hospitalized patients. Kidney Int 86, 1023-1030 (2014).

187. D. L. Hava, A. Camilli, Large-scale identification of serotype 4 Streptococcus pneumoniae virulence factors. Mol Microbiol 45, 1389-1406 (2002).

188. C. J. Orihuela et al., Microarray analysis of pneumococcal gene expression during invasive disease. Infect Immun 72, 5582-5596 (2004).

189. V. Minhas et al., In vivo dual RNA-seq reveals that neutrophil recruitment underlies differential tissue tropism of Streptococcus pneumoniae. Commun Biol 3, 293 (2020).

190. T. van Opijnen, A. Camilli, A fine scale phenotype-genotype virulence map of a bacterial pathogen. Genome Res 22, 2541-2551 (2012).

191. M. Kanehisa, Y. Sato, M. Kawashima, M. Furumichi, M. Tanabe, KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 44, D457-462 (2016).

184

192. S. Baur, J. Marles-Wright, S. Buckenmaier, R. J. Lewis, W. Vollmer, Synthesis of CDP- activated ribitol for teichoic acid precursors in Streptococcus pneumoniae. J Bacteriol 191, 1200-1210 (2009).

193. K. E. Bruce, B. E. Rued, H. T. Tsui, M. E. Winkler, The Opp (AmiACDEF) Oligopeptide Transporter Mediates Resistance of Serotype 2 Streptococcus pneumoniae D39 to Killing by Chemokine CXCL10 and Other Antimicrobial Peptides. J Bacteriol 200, e00745- 00717 (2018).

194. A. N. Riegler, T. Brissac, N. Gonzalez-Juarbe, C. J. Orihuela, Necroptotic Cell Death Promotes Adaptive Immunity Against Colonizing Pneumococci. Front Immunol 10, 615 (2019).

195. J. K. Lemon, M. R. Miller, J. N. Weiser, Sensing of interleukin-1 cytokines during Streptococcus pneumoniae colonization contributes to macrophage recruitment and bacterial clearance. Infect Immun 83, 3204-3212 (2015).

196. W. M. Schneider, M. D. Chevillotte, C. M. Rice, Interferon-stimulated genes: a complex web of host defenses. Annu Rev Immunol 32, 513-545 (2014).

197. M. Afzal, S. Shafeeq, O. P. Kuipers, Ascorbic acid-dependent gene expression in Streptococcus pneumoniae and the activator function of the transcriptional regulator UlaR2. Front Microbiol 6, 72 (2015).

198. C. Marion, A. M. Burnaugh, S. A. Woodiga, S. J. King, Sialic acid transport contributes to pneumococcal colonization. Infect Immun 79, 1262-1269 (2011).

199. F. Titgemeyer, J. Reizer, A. Reizer, M. H. Saier, Jr., Evolutionary relationships between sugar kinases and transcriptional repressors in bacteria. Microbiology 140 ( Pt 9), 2349- 2354 (1994).

200. C. Marion et al., Identification of a pneumococcal glycosidase that modifies O-linked glycans. Infect Immun 77, 1389-1396 (2009).

201. C. Schulz et al., Regulation of the arginine deiminase system by ArgR2 interferes with arginine metabolism and fitness of Streptococcus pneumoniae. mBio 5, e01858-01814 (2014).

202. B. A. Eijkelkamp, C. A. McDevitt, T. Kitten, Manganese uptake and streptococcal virulence. Biometals 28, 491-508 (2015).

203. V. Agarwal et al., Streptococcus pneumoniae endopeptidase O (PepO) is a multifunctional plasminogen- and fibronectin-binding protein, facilitating evasion of innate immunity and invasion of host cells. J Biol Chem 288, 6849-6863 (2013).

204. K. S. LeMessurier, H. Hacker, L. Chi, E. Tuomanen, V. Redecke, Type I interferon protects against pneumococcal invasive disease by inhibiting bacterial transmigration across the lung. PLoS Pathog 9, e1003727 (2013).

185

205. M. F. Mian et al., Length of dsRNA (poly I:C) drives distinct innate immune responses, depending on the cell type. J Leukoc Biol 94, 1025-1036 (2013).

206. C. N. Morra, C. J. Orihuela, Anatomical site-specific immunomodulation by bacterial biofilms. Curr Opin Infect Dis 33, 238-243 (2020).

207. M. Kilian, H. Tettelin, Identification of Virulence-Associated Properties by Comparative Genome Analysis of Streptococcus pneumoniae, S. pseudopneumoniae, S. mitis, Three S. oralis Subspecies, and S. infantis. mBio 10, e01985-01919 (2019).

208. P. A. Knupp-Pereira, N. T. C. Marques, L. M. Teixeira, H. C. C. Povoa, F. P. G. Neves, Prevalence of PspA families and pilus islets among Streptococcus pneumoniae colonizing children before and after universal use of pneumococcal conjugate vaccines in Brazil. Braz J Microbiol 51, 419-425 (2020).

209. E. Akbari et al., Protective responses of an engineered PspA recombinant antigen against Streptococcus pneumoniae. Biotechnol Rep (Amst) 24, e00385 (2019).

210. M. A. Zafar et al., Identification of Pneumococcal Factors Affecting Pneumococcal Shedding Shows that the dlt Locus Promotes Inflammation and Transmission. mBio 10, e01032-01019 (2019).

211. V. S. Cooper et al., Experimental Evolution In Vivo To Identify Selective Pressures during Pneumococcal Colonization. mSystems 5, e00352-00320 (2020).

212. L. K. Mahdi, M. B. Van der Hoek, E. Ebrahimie, J. C. Paton, A. D. Ogunniyi, Characterization of Pneumococcal Genes Involved in Bloodstream Invasion in a Mouse Model. PLoS One 10, e0141816 (2015).

213. J. S. Ruiz-Moreno et al., The cGAS/STING Pathway Detects Streptococcus pneumoniae but Appears Dispensable for Antipneumococcal Defense in Mice and Humans. Infect Immun 86, e00849-00817 (2018).

214. A. Fama et al., Nucleic Acid-Sensing Toll-Like Receptors Play a Dominant Role in Innate Immune Recognition of Pneumococci. mBio 11, e00415-00420 (2020).

215. X. Hu et al., Type I IFN expression is stimulated by cytosolic MtDNA released from pneumolysin-damaged mitochondria via the STING signaling pathway in macrophages. FEBS J 286, 4754-4768 (2019).

216. Y. Gao et al., Mitochondrial DNA Leakage Caused by Streptococcus pneumoniae Hydrogen Peroxide Promotes Type I IFN Expression in Lung Cells. Front Microbiol 10, 630 (2019).

217. A. P. West et al., Mitochondrial DNA stress primes the antiviral innate immune response. Nature 520, 553-557 (2015).

186

218. S. Skovbjerg et al., Intact Pneumococci Trigger Transcription of Interferon-Related Genes in Human Monocytes, while Fragmented, Autolyzed Bacteria Subvert This Response. Infect Immun 85, e00960-00916 (2017).

219. S. D. Shukla et al., An antagonist of the platelet-activating factor receptor inhibits adherence of both nontypeable Haemophilus influenzae and Streptococcus pneumoniae to cultured human bronchial epithelial cells exposed to cigarette smoke. Int J Chron Obstruct Pulmon Dis 11, 1647-1655 (2016).

220. F. Iovino, G. Molema, J. J. Bijlsma, Streptococcus pneumoniae Interacts with pIgR expressed by the brain microvascular endothelium but does not co-localize with PAF receptor. PLoS One 9, e97914 (2014).

221. C. J. Orihuela et al., Laminin receptor initiates bacterial contact with the blood brain barrier in experimental meningitis models. J Clin Invest 119, 1638-1646 (2009).

222. V. E. Fernandes et al., The B-cell inhibitory receptor CD22 is a major factor in host resistance to Streptococcus pneumoniae infection. PLOS Pathogens 16, e1008464 (2020).

223. L. Lu et al., Species-specific interaction of Streptococcus pneumoniae with human complement factor H. J Immunol 181, 7138-7146 (2008).

224. J. H. Kim et al., Monoacyl lipoteichoic acid from pneumococci stimulates human cells but not mouse cells. Infect Immun 73, 834-840 (2005).

225. S. Hammerschmidt, M. P. Tillig, S. Wolff, J. P. Vaerman, G. S. Chhatwal, Species- specific binding of human secretory component to SpsA protein of Streptococcus pneumoniae via a hexapeptide motif. Mol Microbiol 36, 726-736 (2000).

226. C. Elm et al., Ectodomains 3 and 4 of human polymeric Immunoglobulin receptor (hpIgR) mediate invasion of Streptococcus pneumoniae into the epithelium. J Biol Chem 279, 6296-6304 (2004).

227. M. R. Batten, B. W. Senior, M. Kilian, J. M. Woof, Amino acid sequence requirements in the hinge of human immunoglobulin A1 (IgA1) for cleavage by streptococcal IgA1 proteases. Infect Immun 71, 1462-1469 (2003).

228. V. Agarwal et al., Enolase of Streptococcus pneumoniae binds human complement inhibitor C4b-binding protein and contributes to complement evasion. J Immunol 189, 3575-3584 (2012).

229. N. Gonzalez-Juarbe et al., Bacterial Pore-Forming Toxins Promote the Activation of Caspases in Parallel to Necroptosis to Enhance Alarmin Release and Inflammation During Pneumonia. Sci Rep 8, 5846 (2018).

230. M. I. Love, W. Huber, S. Anders, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014).

187

231. A. Kramer, J. Green, J. Pollard, Jr., S. Tugendreich, Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics 30, 523-530 (2014).

232. A. Spandidos et al., A comprehensive collection of experimentally validated primers for Polymerase Chain Reaction quantitation of murine transcript abundance. BMC Genomics 9, 633 (2008).

233. A. Spandidos, X. Wang, H. Wang, B. Seed, PrimerBank: a resource of human and mouse PCR primer pairs for gene expression detection and quantification. Nucleic Acids Res 38, D792-799 (2010).

234. X. Wang, B. Seed, A PCR primer bank for quantitative gene expression analysis. Nucleic Acids Res 31, e154 (2003).

235. Y. Li, C. M. Thompson, M. Lipsitch, A modified Janus cassette (Sweet Janus) to improve allelic replacement efficiency by high-stringency negative selection in Streptococcus pneumoniae. PLoS One 9, e100510 (2014).

236. A. L. Bricker, A. Camilli, Transformation of a type 4 encapsulated strain of Streptococcus pneumoniae. FEMS Microbiol Lett 172, 131-135 (1999).

237. D. Kim, B. Langmead, S. L. Salzberg, HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357-360 (2015).

238. B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25 (2009).

239. S. Anders, P. T. Pyl, W. Huber, HTSeq--a Python framework to work with high- throughput sequencing data. Bioinformatics 31, 166-169 (2015).

240. E. Y. Klein et al., The frequency of influenza and bacterial coinfection: a systematic review and meta-analysis. Influenza Other Respir Viruses 10, 394-403 (2016).

241. C. R. MacIntyre et al., The role of pneumonia and secondary bacterial infection in fatal and serious outcomes of pandemic influenza a(H1N1)pdm09. BMC Infect Dis 18, 637 (2018).

242. S. Clohisey, J. K. Baillie, Host susceptibility to severe influenza A virus infection. Crit Care 23, 303 (2019).

243. W. W. Thompson et al., Mortality associated with influenza and respiratory syncytial virus in the United States. JAMA 289, 179-186 (2003).

244. D. M. Morens, J. K. Taubenberger, A. S. Fauci, Predominant role of bacterial pneumonia as a cause of death in pandemic influenza: implications for pandemic influenza preparedness. J Infect Dis 198, 962-970 (2008).

188

245. D. A. Diavatopoulos et al., Influenza A virus facilitates Streptococcus pneumoniae transmission and disease. FASEB J 24, 1789-1798 (2010).

246. H. M. Rowe et al., Respiratory Bacteria Stabilize and Promote Airborne Transmission of Influenza A Virus. mSystems 5 (2020).

247. X. Liu et al., Exploration of Bacterial Bottlenecks and Streptococcus pneumoniae Pathogenesis by CRISPRi-Seq. Cell Host Microbe 29, 107-120 e106 (2021).

248. N. M. Reinoso-Vizcaino et al., The pneumococcal two-component system SirRH is linked to enhanced intracellular survival of Streptococcus pneumoniae in influenza- infected pulmonary cells. PLoS Pathog 16, e1008761 (2020).

249. M. M. Pettigrew et al., Dynamic changes in the Streptococcus pneumoniae transcriptome during transition from biofilm formation to invasive disease upon influenza A virus infection. Infect Immun 82, 4607-4619 (2014).

250. G. Ambigapathy et al., Double-Edged Role of Interleukin 17A in Streptococcus pneumoniae Pathogenesis During Influenza Virus Coinfection. J Infect Dis 220, 902-912 (2019).

251. A. D'Mello et al., An in vivo atlas of host-pathogen transcriptomes during Streptococcus pneumoniae colonization and disease. Proc Natl Acad Sci U S A 117, 33507-33518 (2020).

252. S. X. Ge, D. Jung, R. Yao, ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628-2629 (2020).

253. A. M. Smith et al., Kinetics of coinfection with influenza A virus and Streptococcus pneumoniae. PLoS Pathog 9, e1003238 (2013).

254. N. Sharma-Chawla et al., Influenza A Virus Infection Predisposes Hosts to Secondary Infection with Different Streptococcus pneumoniae Serotypes with Similar Outcome but Serotype-Specific Manifestation. Infect Immun 84, 3445-3457 (2016).

255. K. Ouyang et al., Pretreatment of epithelial cells with live Streptococcus pneumoniae has no detectable effect on influenza A virus replication in vitro. PLoS One 9, e90066 (2014).

256. C. J. Sanchez et al., Biofilm and planktonic pneumococci demonstrate disparate immunoreactivity to human convalescent sera. BMC Microbiol 11, 245 (2011).

257. R. N. Allan et al., Pronounced metabolic changes in adaptation to biofilm growth by Streptococcus pneumoniae. PLoS One 9, e107015 (2014).

258. F. Iovino, G. Molema, J. J. Bijlsma, Platelet endothelial cell adhesion molecule-1, a putative receptor for the adhesion of Streptococcus pneumoniae to the vascular endothelium of the blood-brain barrier. Infect Immun 82, 3555-3566 (2014).

189

259. T. Kuri et al., Influenza A virus-mediated priming enhances cytokine secretion by human dendritic cells infected with Streptococcus pneumoniae. Cell Microbiol 15, 1385-1400 (2013).

260. K. S. LeMessurier, M. Tiwary, N. P. Morin, A. E. Samarasinghe, Respiratory Barrier as a Safeguard and Regulator of Defense Against Influenza A Virus and Streptococcus pneumoniae. Front Immunol 11, 3 (2020).

261. A. Schmid et al., Real-time analysis of cAMP-mediated regulation of ciliary motility in single primary human airway epithelial cells. J Cell Sci 119, 4176-4186 (2006).

262. M. L. Fulcher, S. Gabriel, K. A. Burns, J. R. Yankaskas, S. H. Randell, Well- differentiated human airway epithelial cell cultures. Methods Mol Med 107, 183-206 (2005).

263. J. Feng et al., Proteomic and transcriptomic analysis of linezolid resistance in Streptococcus pneumoniae. J Proteome Res 10, 4439-4452 (2011).

264. J. Feng et al., Genome sequencing of linezolid-resistant Streptococcus pneumoniae mutants reveals novel mechanisms of resistance. Genome Res 19, 1214-1223 (2009).

265. C. R. Lee, J. H. Lee, K. S. Park, B. C. Jeong, S. H. Lee, Quantitative proteomic view associated with resistance to clinically important antibiotics in Gram-positive bacteria: a systematic review. Front Microbiol 6, 828 (2015).

266. S. Manna et al., The transcriptomic response of Streptococcus pneumoniae following exposure to cigarette smoke extract. Sci Rep 8, 15716 (2018).

267. A. Olaya-Abril, I. Jimenez-Munguia, L. Gomez-Gascon, I. Obando, M. J. Rodriguez- Ortega, A Pneumococcal Protein Array as a Platform to Discover Serodiagnostic Antigens Against Infection. Mol Cell Proteomics 14, 2591-2608 (2015).

268. N. J. Croucher et al., Diverse evolutionary patterns of pneumococcal antigens identified by pangenome-wide immunological screening. Proc Natl Acad Sci U S A 114, E357- E366 (2017).

269. G. Gamez et al., The variome of pneumococcal virulence factors and regulators. BMC Genomics 19, 10 (2018).

270. Y. C. Hsieh et al., Establishment of a young mouse model and identification of an allelic variation of zmpB in complicated pneumonia caused by Streptococcus pneumoniae. Crit Care Med 36, 1248-1255 (2008).

271. T. Gupalova et al., Development of experimental pneumococcal vaccine for mucosal immunization. PLoS One 14, e0218679 (2019).

190

272. G. A. Pandya et al., Monitoring the long-term molecular epidemiology of the pneumococcus and detection of potential 'vaccine escape' strains. PLoS One 6, e15950 (2011).

273. A. Olaya-Abril, I. Jimenez-Munguia, L. Gomez-Gascon, I. Obando, M. J. Rodriguez- Ortega, Identification of potential new protein vaccine candidates through pan-surfomic analysis of pneumococcal clinical isolates from adults. PLoS One 8, e70365 (2013).

274. H. Soualhine et al., A proteomic analysis of penicillin resistance in Streptococcus pneumoniae reveals a novel role for PstS, a subunit of the phosphate ABC transporter. Mol Microbiol 58, 1430-1440 (2005).

275. M. A. Zafar et al., Identification of Pneumococcal Factors Affecting Pneumococcal Shedding Shows that the dlt Locus Promotes Inflammation and Transmission. mBio 10 (2019).

276. M. Jomaa et al., Immunization with the iron uptake ABC transporter proteins PiaA and PiuA prevents respiratory infection with Streptococcus pneumoniae. Vaccine 24, 5133- 5139 (2006).

277. R. H. Whalan et al., PiuA and PiaA, iron uptake lipoproteins of Streptococcus pneumoniae, elicit serotype independent antibody responses following human pneumococcal septicaemia. FEMS Immunol Med Microbiol 43, 73-80 (2005).

278. W. Y. Chan et al., A Novel, Multiple-Antigen Pneumococcal Vaccine Protects against Lethal Streptococcus pneumoniae Challenge. Infect Immun 87 (2019).

279. B. Kwambana-Adams et al., Rapid replacement by non-vaccine pneumococcal serotypes may mitigate the impact of the pneumococcal conjugate vaccine on nasopharyngeal bacterial ecology. Sci Rep 7, 8127 (2017).

280. K. Hanada, Y. Yamaoka, Genetic battle between Helicobacter pylori and humans. The mechanism underlying homologous recombination in bacteria, which can infect human cells. Microbes Infect 16, 833-839 (2014).

281. X. Li, W. Li, M. Zeng, R. Zheng, M. Li, Network-based methods for predicting essential genes or proteins: a survey. Brief Bioinform 21, 566-583 (2020).

282. M. J. Francis, Recent Advances in Vaccine Technologies. Vet Clin North Am Small Anim Pract 48, 231-241 (2018).

283. N. R. Scott, B. Mann, E. I. Tuomanen, C. J. Orihuela, Multi-Valent Protein Hybrid Pneumococcal Vaccines: A Strategy for the Next Generation of Vaccines. Vaccines (Basel) 9 (2021).

284. D. Maione et al., Identification of a universal Group B streptococcus vaccine by multiple genome screen. Science 309, 148-150 (2005).

191

285. D. J. Marciani et al., Genetically-engineered subunit vaccine against feline leukaemia virus: protective immune response in cats. Vaccine 9, 89-96 (1991).

286. M. Brisse, S. M. Vrba, N. Kirk, Y. Liang, H. Ly, Emerging Concepts and Technologies in Vaccine Development. Front Immunol 11, 583077 (2020).

287. L. Zhao et al., Nanoparticle vaccines. Vaccine 32, 327-337 (2014).

288. T. R. F. Smith et al., Immunogenicity of a DNA vaccine candidate for COVID-19. Nat Commun 11, 2601 (2020).

289. K. S. Corbett et al., Evaluation of the mRNA-1273 Vaccine against SARS-CoV-2 in Nonhuman Primates. N Engl J Med 383, 1544-1555 (2020).

290. C. Zhang, G. Maruggi, H. Shan, J. Li, Advances in mRNA Vaccines for Infectious Diseases. Front Immunol 10, 594 (2019).

291. J. G. van den Boorn, G. Hartmann, Turning tumors into vaccines: co-opting the innate immune system. Immunity 39, 27-37 (2013).

292. H. H. Mao, S. Chao, Advances in Vaccines. Adv Biochem Eng Biotechnol 171, 155-188 (2020).

293. T. Zangari, M. B. Ortigoza, K. L. Lokken-Toyli, J. N. Weiser, Type I Interferon Signaling Is a Common Factor Driving Streptococcus pneumoniae and Influenza A Virus Shedding and Transmission. mBio 12 (2021).

192