influenzae evolution during PNAS PLUS persistence in the human airways in chronic obstructive pulmonary disease

Melinda M. Pettigrewa, Christian P. Ahearnb,c, Janneane F. Gentd, Yong Konge,f,g, Mary C. Gallob,c, James B. Munroh,i, Adonis D’Melloh,i, Sanjay Sethic,j,k, Hervé Tettelinh,i,1, and Timothy F. Murphyb,c,l,1,2

aDepartment of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT 06510; bDepartment of Microbiology and Immunology, University at Buffalo, The State University of New York, Buffalo, NY 14203; cClinical and Translational Research Center, University at Buffalo, The State University of New York, Buffalo, NY 14203; dDepartment of Environmental Health Sciences, Yale School of Public Health, New Haven, CT 06510; eDepartment of Biostatistics, Yale School of Public Health, New Haven, CT 06510; fDepartment of Molecular Biophysics and Biochemistry, Yale School of Medicine, New Haven, CT 06510; gW.M. Keck Foundation Biotechnology Resource Laboratory, Yale School of Medicine, New Haven, CT 06510; hInstitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201; iDepartment of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201; jDivision of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, University at Buffalo, The State University of New York, Buffalo, NY 14203; kDepartment of Medicine, Veterans Affairs Western New York Healthcare System, Buffalo, NY 14215; and lDivision of Infectious Diseases, Department of Medicine, University at Buffalo, The State University of New York, Buffalo, NY 14203

Edited by Rino Rappuoli, GSK Vaccines, Siena, Italy, and approved February 27, 2018 (received for review November 10, 2017)

Nontypeable (NTHi) exclusively colonize and adaptation cannot be accurately studied in vitro or in animal infect humans and are critical to the pathogenesis of chronic obstruc- models as a result of the unique physiological and immunological tive pulmonary disease (COPD). In vitro and animal models do not environments encountered by NTHi during colonization and in- accurately capture the complex environments encountered by NTHi fection in the human host. We conducted a 15-y prospective study during human infection. We conducted whole-genome sequencing of of adults with COPD who were followed monthly to collect de- 269 longitudinally collected cleared and persistent NTHi from a 15-y tailed clinical data and sputum samples that were cultured for prospective study of adults with COPD. Genome sequences were used bacterial pathogens, including NTHi. We hypothesized that NTHi

to elucidate the phylogeny of NTHi isolates, identify genomic changes alters its genome to adapt to survival in the human respiratory MICROBIOLOGY that occur with persistence in the human airways, and evaluate the tract and that these adaptations facilitate persistence. To test this effect of selective pressure on 12 candidate vaccine antigens. Strains hypothesis, we conducted whole-genome sequencing (WGS) on persisted in individuals with COPD for as long as 1,422 d. Slipped- this large collection of carefully characterized strains of NTHi. strand mispairing, mediated by changes in simple sequence repeats The goals of the present study were to use our unique set of in multiple genes during persistence, regulates expression of critical 269 prospectively collected cleared and persistent NTHi strains virulence functions, including adherence, nutrient uptake, and modi- with corresponding epidemiologic and clinical data to (i) eluci- fication of surface molecules, and is a major mechanism for survival in date the phylogeny of NTHi strains from individuals with COPD the hostile environment of the human airways. A subset of strains relative to other strains with publicly available ; (ii) underwent a large 400-kb inversion during persistence. NTHi does not undergo significant gene gain or loss during persistence, in contrast to Significance other persistent respiratory tract pathogens. Amino acid sequence changes occurred in 8 of 12 candidate vaccine antigens during persis- Nontypeable Haemophilus influenzae (NTHi) exclusively colonize tence, an observation with important implications for vaccine devel- and infect humans and play an important role in the course and opment. These results indicate that NTHi alters its genome during pathogenesis of chronic obstructive pulmonary disease (COPD). persistence by regulation of critical virulence functions primarily by We conducted whole-genome sequencing of 269 NTHi isolates slipped-strand mispairing, advancing our understanding of how a from a 15-y prospective study of COPD to assess in vivo adaption bacterial pathogen that plays a critical role in COPD adapts to survival of NTHi. NTHi uses slipped-strand mispairing in simple sequence in the human respiratory tract. repeats to regulate critical virulence functions as the primary mechanism to adapt to survival in the human airways. Analyses Haemophilus influenzae | chronic obstructive pulmonary disease | whole- of changes in 12 candidate vaccine antigens during persistence genome sequencing | genome evolution | candidate vaccine antigens provided data with important implications for guiding vaccine development. These results advance understanding of how an ontypeable Haemophilus influenzae (NTHi) are pathobionts exclusively human pathogen alters its genome to adapt to sur- Nthat exclusively colonize and infect humans and are adapted vival in the hostile environment of the human respiratory tract. to survival in the human respiratory tract, their primary ecolog- ical niche. NTHi are critical to the course and pathogenesis of Author contributions: M.M.P., S.S., H.T., and T.F.M. designed research; M.M.P., C.P.A., chronic obstructive pulmonary disease (COPD). Approximately Y.K., M.C.G., J.B.M., A.D., S.S., H.T., and T.F.M. performed research; C.P.A., Y.K., H.T., and T.F.M. contributed new reagents/analytic tools; M.M.P., C.P.A., J.F.G., Y.K., M.C.G., 65 million people globally have COPD, which is the fourth J.B.M., A.D., S.S., H.T., and T.F.M. analyzed data; and M.M.P., C.P.A., M.C.G., H.T., and leading cause of death worldwide and predicted to be third by T.F.M. wrote the paper. the year 2030 (1, 2). NTHi persists in the lower airways of in- The authors declare no conflict of interest. dividuals with COPD for extended periods of time and causes This article is a PNAS Direct Submission. inflammation, impaired pulmonary function, and tissue damage Published under the PNAS license. that leads to progressive loss of lung function (3–6). COPD is Data deposition: The sequences reported in this paper have been deposited in the Gen- also characterized by acute exacerbations, which are intermittent Bank database. For a list of accession numbers, see SI Appendix, Table S2. worsening of symptoms that cause enormous morbidity (4). 1H.T. and T.F.M. contributed equally to this work. Approximately half of exacerbations of COPD are caused by 2To whom correspondence should be addressed. Email: [email protected]. , and NTHi is the most common bacterial cause (3). This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. NTHi has developed mechanisms to survive and persist in the 1073/pnas.1719654115/-/DCSupplemental. hostile environment of the human airways (7). Mechanisms of

www.pnas.org/cgi/doi/10.1073/pnas.1719654115 PNAS Latest Articles | 1of10 Downloaded by guest on October 7, 2021 analyze the genomes of persistent strains of NTHi to identify unique variants,” a concatenated FASTA file of aligned core SNPs, was then changes that occur with persistence, including phase variation in used for phylogenetic analysis by using RAxML v8.2.0 (18). Nondefault pa- simple sequence repeats (SSRs), SNPs, genome rearrangements, rameters include using the -d flag for complete random starting tree instead and gene loss and gene gain; and (iii) evaluate the effect of of the default randomized stepwise addition parsimony starting tree, the -f a immune selective pressure and adaptation to the host airway flag for rapid bootstrap analysis and search for best scoring maximum- − = environment. Overall, genetic variation during persistence in the likelihood tree, and the asc-corr lewis flag to correct the likelihood for as- human respiratory tract occurs via multiple mechanisms, with certainment bias and account for the lack of invariant sites. An initial analysis changes in SSRs being especially frequent. NTHi are naturally using the autoMRE option (i.e., extended majority-rule consensus tree crite- rion) was employed to determine the minimum number of bootstrap repli- transformable; however, surprisingly limited gene gain or loss cates needed, which was 50. We then performed three independent analyses occurred during long-term persistence in the human host. Sev- with different starting seeds, each with 500 rapid bootstrap inferences fol- eral candidate vaccine antigens underwent changes that may lowed by a thorough maximum-likelihood search. The SNP matrix consisted of mediate immune escape during persistence, whereas others 403 strains, 157,303 positions per strain, and 110,159 distinct alignment pat- remained stable, an observation that has important implications terns. Trees were visualized with v3.4.4 (19) and FigTree v1.4.3. in guiding vaccine development for NTHi. Genome Alignment and SNP Identification. Reference-free whole-genome Materials and Methods multiple alignment-based comparative analyses of diverse subsets (as many The institutional review boards of the University at Buffalo and the Veterans as ∼150 genomes) of our H. influenzae genomes and/or annotated publicly Affairs Western New York Healthcare System approved this study; study available H. influenzae genomes were performed by using the CloVR- participants provided written informed consent before enrollment. Comparative pipeline (20). Among other outputs, this pipeline generates a list of SNPs present within aligned sections of the genomes compared and a list COPD Study Clinic. The present study includes NTHi strains from subjects of clusters of syntenic orthologous genes across the genomes (21). It also enrolled during the 15-y period from April 1994 through March 2009 taken generates a Sybil interactive Web interface (22) for interrogation of the from a 20-y prospective study of adults with COPD that was conducted at the comparative data. Sybil instances for the comparison of our nine Buffalo Veterans Affairs Medical Center from 1994 to 2014 (3, 8). Subjects finished genomes among themselves and together with 19 publicly available were seen at monthly clinic visits and at unscheduled visits at the onset of H. influenzae finished genomes (from NCBI as of October 25, 2017). An in- suspected acute exacerbations. Detailed clinical data and expectorated dependent set of clusters of orthologs, Jaccard-filtered ortholog clusters (JOCs) sputum samples were collected at each visit. based on reciprocal best-BLAST matches (23) was generated by performing all- vs.-all searches of all of the genes predicted in our 269 NTHi genomes. Bacterial Strains. NTHi were identified by using standard microbiology tech- SNPs identified in persistent strains with closed genomes are reported in niques, and a P6-specific monoclonal antibody was used to distinguish NTHi Dataset S1. JOCs were used to expand analysis of these robust SNPs to the from H. haemolyticus (9). Strains that were isolated at a single monthly clinic entire set of 269 genomes. When SNP-containing genes were included in visit and were not isolated again at subsequent monthly clinic visits were JOCs that harbored one whole gene from each of the 269 genomes, the classified as cleared. Strains that were isolated from a study participant at multiple alignment of the gene nucleotide sequences from each JOC was more than one monthly clinic visit and that were the same strain by multilocus inspected for variations at the SNP position. Counts for each variant are sequence typing (MLST) were classified as persistent. We estimated the dura- reported in the column labeled “Orthologs (JOCs) in 269 genomes” in tion of persistence of individual NTHi strains by calculating the number of days Dataset S1. In several cases, SNP-containing genes were undergoing fre- between the first visit date when the strain was detected and the last visit date quent frame shifting (Identification of SSRs) or were present in only a small when the strain was detected, which provides a minimum estimate of the subset of the 269 strains. These cases resulted in JOCs with too few gene duration of persistence. MLSTs were determined as described previously (10). members for expanding the SNP analysis. This study included 67 cleared NTHi strains and the first and last detected NTHi isolates of 101 persistent strains as described in SI Appendix,TableS1 Identification of SSRs. A Perl program was used to identify tandemly repeated motifs ranging from 1 to 9 nt in length (24). The minimum number of repeat Genome Sequencing, Assembly, and Annotation. Genomic DNA was extracted units required to be present in an uninterrupted tandem arrangement was from low-passage NTHi strains (three passages from the original isolation), as follows: nine repetitions for homopolymeric tracts, five for 2-nt repeats, and samples were subjected to 150-bp paired-end sequencing on an Illumina four for 3-nt repeats, and three for 4–9-nt repeats (24). The Perl program HiSEq 2500 system in the Next-Generation and Expression Analysis Core at the was set to include a 500-nt region upstream of genes and not allow any University at Buffalo or 250-bp paired-end sequencing on an Illumina MiSeq mismatches. SSRs that changed between the first and last isolate in at least system at the Yale Center for Genome Analysis. Btrim was used to filter the one of the four persistent strains with closed genomes were manually cu- raw sequencing data to retain high-quality sequence reads (11). De novo rated to determine their position within ORFs or in the promoter of ORFs. assembly was accomplished by using velvet version 1.2.10 with 99 as the Orthologous genes across the first and last isolates of the four strains were kmer size and default parameters (12). Assembled sequences were anno- defined by using the CloVR-based clusters of syntenic orthologous genes. tated by using Prokka (13). SSRs in the remaining 97 persistent strains with draft genomes were also The first and last isolates of four persistent strains and one cleared strain identified using the Perl script described here previously. JOCs were defined (nine NTHi strains in total) were also sequenced at the Yale Center for Ge- across the first and last isolates of the persistent strains (97 persistent strains nome Analysis by using the PacBio RS II platform. These strains were se- with draft genomes and four persistent strains with closed genomes). quenced by using P6-C4 chemistry and a targeted library size of 10 kb with We used an in-house script to identify SSRs that changed between the first one strain per single molecule, real-time cell. The Hierarchical Genome As- and last isolate of a persistent strain. Briefly, we used an iterative process to sembly Process was used to assemble the PacBio genomes (14). The con- search for motifs in the first isolate that matched the motifs in the final isolate sensus sequence was further corrected by using in-house scripts to iteratively of a persistent strain; the program was used to search for the original motif map Illumina reads to the assembly until no differences were found be- and the reverse complement, and also accounted for potential shifts in the tween the assembly and the Illumina reads. The nine finished gap-free ge- starting base pair of a given motif. nome sequences assembled from PacBio data were trimmed of overlaps at the ends of the circular chromosome and then aligned to the H. influenzae Assessment of Competence of Strains for Transformation. Transformation effi- Rd KW20 reference nucleotide sequence (GenBank accession no. NC_000907.1) ciency of persistent strains of NTHi was determined by using a modification of using NUCmer (15) and rotated to make the first nucleotide the same as Rd the protocol by the M-IV method of Poje and Redfield (25), using a linearized KW20. The nine trimmed and rotated sequences were subjected to the CloVR plasmid p572b that contains the gene that encodes peroxiredoxin–gluta- Microbe automated annotation pipeline (16). redoxin (26). Results are expressed as a percentage of cells transformed.

Phylogenetic Analyses. Whole genome-level core SNP-based phylogenetic Analysis of Candidate Vaccine Antigens. Sequences of 12 NTHi candidate analysis of our 269 genomes together with 134 publicly available H. in- vaccine antigens from the first and final isolates of the 101 persistent strains fluenzae genomes [from the National Center for Biotechnology Information were analyzed for changes during persistence and positive selection. Ref- (NCBI) as of October 25, 2017] was performed. First, SNPs were identified erence sequences of the 12 NTHi candidate vaccine antigens were aligned within the core regions shared by all genomes by using the In Silico Geno- to the H. influenzae 86–028NP genome (GenBank accession no. CP000057.2) typer (ISG) pipeline (17) with default parameters. The ISG output of “clean by using the Burrows–Wheeler Alignment tool (bwa) (27). For a particular

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1719654115 Pettigrew et al. Downloaded by guest on October 7, 2021 vaccine antigen, the reads that aligned to its genome position were topologies were largely congruent with RAxML-computed average PNAS PLUS extracted from BAM files using SAMtools (28). An in-house multiple align- relative Robinson–Foulds distances ranging from 1% to 3.25%. Our ment program was used to align the reads and obtain the consensus se- 269 genomes are annotated with the ST, designation as cleared or quence of the vaccine antigen gene for each strain. The ORFs were persistent, designation as an exacerbation (infection) or colonization translated by using EMBOSS Transeq (European Molecular Biology Labora- (no symptoms) strain, and year of isolation. The genomes of initial tory/European Institute) (29). Multiple sequence alignments acquisition and final isolates of persistent strains group together, of the nucleotide and the amino acid sequences were generated with consistent with results from MLST. Moreover, STs generally group ClustalW in MacVector using default parameters. The nucleotide and amino together; for example, ST155 strains group together in the same acid sequences of persistent strains were analyzed for synonymous and nonsynonymous changes during persistence. The Phylogenetic Analysis by clade along with the single-locus variant ST414. The COPD strains Maximum Likelihood (PAML) codeml program was used to identify amino are distributed throughout the tree, indicating that there is no clear acids of antigens experiencing positive selection by using previously de- genetic clustering based on clinical source of the strain, geography, scribed methods (SI Appendix, SI Materials and Methods) (30, 31). duration of persistence, exacerbation vs. colonization, or year of isolation. P2 and P5 Antigen Membrane Topology Prediction. The membrane topology of the P2 protein of strain 86–028NP (GenBank accession no. AAX87199.1) was Changes During Persistence. We focused our detailed analyses of predicted by Prediction of TransMembrane Beta-Barrel Proteins Viterbi changes in NTHi genomes during persistence in the human re- method in combination with the predicted P2 membrane topology de- spiratory tract initially on four persistent strains by analyzing the scribed by Bagos et al. (32) and Sikkema and Murphy (33). The P5 protein closed genomes of the first detected isolate and final detected membrane topology prediction of 86–028NP (GenBank accession no. isolate. Strains 5P28H1/5P54H1, 6P24H2/6P32H1, 48P106H1/ AAX88164.1) was generated by ClustalW alignment and comparison of the 48P153H1, and 67P38H1/67P56H1 persisted in four different 86–028NP sequence to strain UC19 strain sequence, of which the topology was study participants for 819, 252, 993, and 570 d, respectively. We determined by Webb and Cripps (34). Transmembrane domains of P2 and extended these analyses to the first detected isolate and final P5 proteins were further supported by the BOCTOPUS 2 beta barrel prediction detected isolate of 97 additional persistent strains with draft program (35). genomes to determine whether key observations regarding SSR repeats and SNPs were consistently observed. The 101 persistent Analysis of Amino Acid Sequence Haplotypes of Candidate Vaccine Antigens. strains persisted for a median of 161 d (range, 2–1,422 d). Unique haplotype sequences of each candidate antigen were extracted from the amino acid sequences of persistent strains with the uniqHaplo.pl Perl SSR Analysis. Phase variation mediated by slipped-strand mis- script. Unique amino acid haplotype sequences for each antigen were

pairing is an important regulatory mechanism that allows NTHi MICROBIOLOGY aligned, and pairwise sequence percent identity scores were calculated by to adapt to different or changing environments (38, 39). Changes using MacVector. Mean haplotype pairwise diversity (MHPD) values are the in the number of SSRs may affect gene expression when the calculated average of all pairwise percent identity values between the unique amino acid sequence haplotypes of each candidate antigen. SSRs are located in upstream regulatory regions. Alternatively, SSRs within ORFs may affect protein translation. The genomes Expression and Purification of Recombinant Proteins and Mutant Construction. of the four persistent strains contained unique SSRs, as desig- Stop codon mutations in candidate antigen ORF sequences were identified by nated by the SSR coordinate, ranging in number from 72 translation of the nucleotide sequences. Immunoblot assays with antiserum (48P153H1) to 92 (5P28H1) among the eight initial and final to recombinant protein E were used to determine whether the protein was isolates of the four strains (Dataset S2). expressed in strains with the stop codon. Briefly, cloning, expression, and Further analysis identified 22 genes in which the number of SSRs purification of protein E were accomplished by using pCATCH, allowing changed during persistence in at least one of the four strains, sug- expression of recombinant lipoprotein (36). A protein E-KO mutant was gesting that NTHi adapts to the environment in the human re- generated by overlap extension PCR and allelic exchange in NTHi strain 86– spiratory tract by regulating the expression of these genes. Each of 028NP as described previously (37). Whole-cell lysates of WT, KO mutants, the SSRs that changed in number was manually evaluated with and NTHi COPD strains with and without the stop codon were performed by regard to its position in relation to the ORF, and, in each case, we using polyclonal antiserum to recombinant purified protein E. also determined whether the SSR changed in the other isolate of the persistent strain (Dataset S3). All of the genes that change in Results SSR copy number during persistence encode potential virulence Phylogeny of NTHi Genomes from COPD Airways. The genome se- functions, including adhesins, modifications of lipooligosaccharide, quences of 67 cleared NTHi strains and the first and final detected andironuptake(Table1).Forexample,onesuchgeneisaDNA NTHi isolates of 101 persistent strains isolated prospectively over methylase that mediates a rapid and reversible change in the ex- 15 y from adults with COPD were determined. Summary assembly pression of many genes throughout the genome (40). The number data on the 260 draft and 9 closed (gap-free or finished) genomes of repeats changed in all four persistent strains in the HMW1A/ are provided in SI Appendix,TableS2. The median number of HMW2A-related adhesins and in each of the hemoglobin- contigs of draft genomes is 70.5 (range, 26–968). The median haptoglobin–binding proteins. We observed an interrupted repeat in genome size is 1,841,659 (range 1,688,785–2,031,876) and the the HMW1A/HMW2A adhesin encoding locus (BV121_1850) median number of predicted genes is 1,813 (range, 1,599–2,395). in strain 5P28H1 and in the hemoglobin-haptoglobin–binding These data substantially increase the number of publicly available protein loci (BV121_803) in strain 5P28H1. These interrupted H. influenzae genomes, including increasing the number of closed repeats may arise from homologous recombination between genomes from 19 to 28. paralogous loci, which generates imperfect repeats (39). We used MLST as an initial step to examine the diversity of The number of unique SSR repeats in 101 persistent strains our strains. Among the 168 initial acquisition strains, 77 different ranged from 30 to 88 per isolate, with a mean of 64.6 (SD, 8.8; SI sequence types (STs) and one (strain 99P15H1) that lacked the Appendix, Table S3). We also determined whether each SSR was fucK1 gene and could not be assigned an ST were represented present in the final isolate of the persistent strain and whether (SI Appendix, Table S1). The most common STs were 155, 156, the matching SSR repeat changed in number. The SSR repeat and 159, which represented 8, 6, and 11 strains, respectively. Of coordinate, motif, gene name, and number of repeats are pro- the 101 persistent strains, two changed ST during persistence vided for SSRs that changed during persistence (Dataset S3). (39P25H/39P29H1 and 106P3H1/106P7H1); in both cases, the The mean number of repeats that changed in persistent strains change resulted from single locus variants caused by base pair was 6.9 (SD, 3.0). The proportion of SSR repeats that changed changes in the fucK1 gene. per persistent strain ranged from 1.9% to 27.4% (SI Appendix, We generated three maximum-likelihood phylogenetic trees Table S3). We examined the correlation between the number of based on core genome SNPs of the 269 COPD genomes and repeats that changed and the duration of persistence. There was 134 publicly available H. influenzae genomes (Fig. 1). Resulting tree a significant positive correlation between the proportion of SSR

Pettigrew et al. PNAS Latest Articles | 3of10 Downloaded by guest on October 7, 2021 133P17H1_57_88perV50_1_1_2006 13P36H1_57_7perV24_0_0_1997 148P17H1_57_91perV4_0_0_2007 13P24H2_57_7perV36_1_1_1996 133P50H1_57_88perV17_0_0_2007

148P4H1_57_91perV17_1_0_2007

68P37H1_57_clr_1_0_2000

63P35H1_57_clr_1_1_1999

135P29H4_57_clr_1_0_2007

72P12H1_41_53perV14_1_0_1998 39P20H1_13_clr_1_0_1996 72P14H1_41_53perV12_0_1_1998

40P92H1_41_clr_1_1_2003

47P41H1_34_36perV38_0_0_1998

47P38H1_34_36perV41_1_0_1998 2842STDY5882080 177_HINF 2842STDY5882081 552_HINF 536_HINF 116P14H1_34_78perV1_0_0_2003 HI1413 116P1H1_34_78perV14_1_0_2002 PittAA

GE3

CCUG_26214 RMHi93

HI1374 Hi322

56P41H1_2_44perV34_0_1_1999 6P9H1_85_5perV8_0_0_1995 56P34H1_2_44perV41_1_1_1998 126P6H1_2_clr_1_0_2004 49P5H1_85_98perV21_1_0_1995 126P12H2_98_clr_1_0_2005 128P56H1_98_87perV1_0_0_2007 128P1H1_98_87perV56_1_0_2004 96P29H1_98_clr_1_0_2002 91P109H1_98_62perV114_1_1_2007 6P8H2_85_5perV9_1_1_1995 91P114H1_98_62perV109_0_0_2007 49P21H1_85_98perV5_0_0_1997 R2846_ST1622 135P16H1_263_89perV30_1_0_2006 GE49 1209 133P41H1_263_clr_1_0_2007 135P30H1_263_89perV16_0_0_2007 2842STDY5882051 2842STDY5882052 84P36H1_263_59perV37_1_0_2002 31P7H7_156_20perV14_1_1_1995 84P37H2_263_59perV36_0_1_2002 74P1H1_321_104perV15_1_0_1997 31P14H6_156_20perV7_0_0_1996 74P15H1_321_104perV1_0_0_1998 1059_HINF 2019_ST321 84P29H1_156_clr_1_1_2001 MiHi64 6P18H1 1061_HINF 6P5H_349_4perV22_1_1_1994 34P8H9_156_22perV4_0_0_1995 47P68H1_349_clr_1_1_2000 R535 6P22H1_349_4perV5_0_0_1996 34P4H6_156_22perV8_1_0_1995 66P36H1_349_clr_1_1_1999

100 91P3H3_411_61perV4_1_1_1999 100

100

100 100

74 66 91P4H1_411_61perV3_0_0_1999 100 100 C10 100 100 100 3655 37 100 35 30 67P10H5_411_49perV50_1_0_1997 Hi345 67P50H1_411_49perV10_0_0_2001

67P56H1_156_50perV38_0_1_2001 100 100 106P7H1_10001_74perV3_0_1_2001 67P38H1_156_50perV56_1_1_200098P5H3_156_clr_1_0_2000 106P3H1_10001_74perV7_1_0_2001 100 99P15H1_156_clr_1_1_2001 121P2H1_156_clr_1_1_2003 100 55 100 84P5H1_603_55perV7_1_1_1999 100 63 100 84P7H1_603_55perV5_0_0_1999

79 86 93

100 100 100 74P16H1_170_54perV23_1_1_1998 100 100 74P23H1_170_54perV16_0_1_1999 63P46H1_170_clr_1_1_2000 56P54H1_170_clr_1_1_2000 100

100 CGSHiCZ412602_ST712 56P16H1_265_100perV1_0_1_1996 60294N1 10P118H1_244_clr_1_1_2004 100 51P1H2_260_41perV4_1_0_1995 56P1H1_265_100perV16_1_0_1995 100 HI2004 100 51P4H_260_41perV1_0_1_1995 NT127 100 100 100 100 84P15H4_260_clr_1_1_2000 59P7H1_253_101perV16_1_0_1996 100 584 100 100 59P16H1_253_101perV7_0_0_1997 C486_ST244 100 1104 411 100 100 HI1988 100 100 100 100 84P13H1

100 3 84P12H1_630_56perV13_1_0_2000 100 53 100 3P14H1_253_2perV24_1_0_1995 HI2116 100 3 53 100 100 84P15H1_630_clr_1_1_2000 100 3P24H1_253_2perV14_0_0_1996 100 100 PittII 100 100 100 HI1388

100 100 R2866_ST99 100 100 48P160H1_196_38perV162_1_0_2007 100 65373_B_Hi_3 GE42 100 85 100 94P40H1_1334_66perV41_1_0_2003 48P162H1_196_38perV160_0_0_2007118P3H1_196_clr_1_0_2003 100 52 94P41H1_1334_66perV40_0_0_2003 103P85H1_196_71perV89_1_1_2006 HI2192 100 78 79 74P49H2_1331_clr_1_0_2001 100 100 100 73100 100 98 19P81H1_1331_13perV68_0_0_2001 103P89H1_196_71perV85_0_1_2007 84 100 100 100 92 100 19P68H7_1331_13perV81_1_0_2000 56P127H1_1329_clr_1_0_2005 93P6H1_196_63perV1_0_0_2000 60 100 GE71 100 100 124P2H1_1335_80perV1_0_0_2003 100 93P1H1_196_63perV6_1_0_1999 100 63P50H1_196_102perV90_1_1_2000 100 124P1H1_1335_80perV2_1_0_2003 100 94P5H_1373_clr_1_0_2000 63P90H1_196_102perV50_0_0_2003 100 100 1P22H3_1374_clr_1_0_1996

100 121P25H1_1374_clr_1_1_2004 100 100

100 100 51P5H1_1339_42perV10_1_0_1995 100 100 100 100 51P10H1_1339_42perV5_0_0_1996 101P53H1_36_clr_1_0_2005 100 94 100 14P6H_364_clr_1_0_1994 100 100 93 100 94 100 10P129H1_1327_clr_1_1_2005

100 100 84P21H1_1333_57perV23_1_0_2001 12P95H1_414_clr_1_1_2000 45 99 100 100 100 84P23H1_1333_57perV21_0_1_2001 1200 100 84P23H6_414_58perV28_1_1_200199P1H1_414_clr_1_1_2000CCUG_60490 100 14P26H2_1328_clr_1_0_1997 76

84P28H1_414_58perV23_0_0_2001 0 39P18H1_10002_24perV1_0_0_1996

0 100 1 100 100 100 39P1H1_10002_24perV18_1_0_1995 99 100 100 56P68H1_155_clr_1_1_2001 99 100 59P1H1_365_clr_1_0_1997 67P10H4_155_48perV23_1_0_1997 100 39P29H1_10003_25perV25_0_1_1997

67P23H1_155_48perV10_0_0_1998 100 39P25H_10003_25perV29_1_0_1997 100 100 100 100 100 103P20H1_155_70perV3_0_1_2002 100 149P2H1_1375_clr_1_0_2007 103P3H1_155_70perV20_1_0_2000 99 100 100 2842STDY5882018 2842STDY5882019 100 98 2842STDY5882040 100 100 57P13H1_155_46perV20_1_1_1996 100 2842STDY5882041 100 CCUG_54503 100 hi467 57P20H1_155_46perV13_0_0_1997 100 100 100 100 1P18H1_1336_1perV16_0_1_1995 7P49H2_155_clr_1_1_1998 100 1P16H5_1336_1perV18_1_0_1995 100 100 47P12H2_155_35perV17_1_0_1996 100 100 126P11H1_66_84perV26_1_0_2005 47P17H_155_35perV12_0_0_1996 74 100 84P18H1_66_26perV20_1_1_2000 84P20H1_66_26perV18_0_1_2001 10045 100 48P106H1 14 126P26H1_66_84perV11_0_1_2006 _155_39perV153_1_1_20047P49H1 1347 100 HI1417 64 48P153H1_155_39perV106_0_0_2006 100 51 2842STDY5882076 64 6P24H2_155_6perV32_1_0_1996 62 100 2842STDY5882075 98 45 100 KR494_ST124 6P32H1_155_6perV24_0_0_1997 100 100 100 WAPHL1 54P24H7_107_43perV33_1_1_1997 100 100 HI2428 54P33H2_107_43perV24_0_1_1998 HI1408 100 100 124P40H1_425_83perV37_0_0_2006 PittEE_ST107 100 100 100 124P37H1_425_83perV40_1_0_2006 91P86H1_107_clr_1_1_2006 100 100 HI1426 124P14H1_107_clr_1_1_2004 100 45P9H2_3_33perV11_1_1_1995 93P28H1_107_64perV10_0_0_2001 100 49 100 100 19 45P11H1_3_33perV9_0_1_1996 93P10H1_107_64perV28_1_1_2000 91 99P41H1_3_68perV50_1_0_2003 101P47H1_503_107perV36_0_0_2004 100 70 99P50H1_3_68perV41_0_0_2003 76 100 127P5H1_3_86perV6_1_0_2004 101P36H1_503_107perV47_1_1_2003 100 63 127P6H1_3_86perV5_0_0_2004 24P44H4_680_16perV45_1_1_1998 8 100 29100 39P13H_3_clr_1_0_1996 24P45H1_680_16perV44_0_1_1998 69P7H1_3_51perV2_0_0_1997 100 100 43P2H1_159_clr_1_0_1995 100 69P2H1_3_51perV7_1_0_1997 100 79P2H1_3_clr_1_1_1998 40P41H1_159_30perV44_1_0_1998 100 74 100 25 44P197H1_582_32perV204_1_0_2009 40P44H1_159_30perV41_0_1_1999 100 100 44P204H1_582_32perV197_0_0_2009 59P6H1_159_47perV3_0_0_1996 99 100 93 69 97 96P6H3_329_106perV15_1_0_2000 100 59P3H1_159_47perV6_1_0_1996_159_clr_1_1_1998 87 96P15H1_329_106perV6_0_0_2001 11P6H 100 99 97 100 100 HI1722 99 100 NCTC8143_ST3 100 102P20H1_159_108perV32_1_1_2002 100 19P115H1_159_clr_1_1_2004 71 18P17H2_3_11perV16_0_1_1996 29 100 94 100 18P16H1_3_11perV17_1_0_1996 90 100 105P26H2_159_72perV9_0_0_2002 100 Hi375_ST3 87P37H1_159_clr_1_1_2002 100 40 96 100 101P3H1_12_69perV19_1_0_2000 74 100 99 101P19H1_12_69perV19_0_0_2000 105P9H1_159_72perV26_1_0_2001 100 33P46H1_159_clr_1_1_2001 100 81P25H1_12_clr_1_1_2000 33P19H1_12_21perV18_0_0_1997 100

99 33P18H2_12_21perV19_1_0_1997 100 102P32H1_159_108perV20_0_1_200399P27H1_159_clr_1_1_2002 100 100 492_HINF 100 100 56P76H1_12_clr_1_1_2001 100 100 46P159H1_12_clr_1_1_2006 PittGG_ST43C188 100 44P104H1_159_31perV85_0_0_2004 100 48P28H1_183_37perV45_1_1_1997 98 44P85H1_159_31perV104_1_0_2003 100 48P45H1_183_37perV28_0_0_1998 100 RdAW 100 100 100 32P8H3_183_clr_1_1_1995 NML_Hia_1_ST23 100 100 GE117 100 100 100 124P4H1_14_81perV3_0_1_2004 Rd_KW20_ST1621 48 100 124P3H1_14_81perV4_1_1_2003 100 100 HI2114 F3031_ST65 100 100 100 73P2H1_14_103perV39_1_0_1997 FDAARGOS_199_ST47F3047_ST70 100100 73P39H1_14_103perV2_0_0_2000 ATCC_10211 100 99 10810_ST6 84P56H1_14_105perV48_0_0_2004 100 84P48H1_14_105perV56_1_1_2004 100

100 100 40_HINF 100 100 723_ST14 2842STDY5882057 100 100 2842STDY5882065 98 2842STDY5882058 100 100 2842STDY5882064 2842STDY5882048 100 56P178H1_14_clr_1_1_2008 100 2842STDY5882047 100 100 HI1974 2842STDY5882032 31P13H1_33_clr_1_0_1995 99 100 100 100 100 86_028NP_ST33 2842STDY5882033 100 100 48 100 107P3H1_33_75perV4_1_1_2000 2842STDY5882049 100 100 100 100 100 107P4H1_33_75perV3_0_1_2000 100 59 100 100 2842STDY5882050 72 100 100 100 46P30H6_432_34perV27_0_0_1997 65 100 100 2842STDY5882086 25 100 100 100 35P8H_432_clr_1_0_1995

92 100 46P27H1_432_34perV30_1_0_1997 2842STDY5882085 100 99 100 48 55P21H1_136_99perV3_0_0_1997 2842STDY5882056 83 100 100 100 55P3H1_136_99perV21_1_0_1995 100 98 2842STDY5882055 99 32 100 2842STDY5882087 95 65234_N_Hi_1 100 100 100 100 85 100 2842STDY5882088 100 100 100 100 65234_B_Hi_2 100 100 100 33P125H1_136_109perV110_0_0_2007

100 100 33P110H3_136_109perV125_1_0_2007 GE6 1124_HINF 100 100 12P37H2_1372_clr_1_0_1997 100 100 1123_HINF 100 100

100 100 100 100 14P33H2_203_9perV35_1_1_1998 100 100 14P35H1_203_9perV33_0_0_1998 Hi359 100 47P36H1_142_clr_1_0_1998 138P39H1_203_90perV2_0_0_2007 100 86 100 Hi378 100 100 138P2H1_203_90perV39_1_0_2005 100 100 100 93 100 83

100 96

100 GE68 19P173H1_203_14perV172_0_1_2007 100 _103_93perV28_0_0_1999 100 19P172H1_203_14perV173_1_0_2007 2842STDY5882060 19P94H1_243_12perV49_0_1_2002 2842STDY5882059 100 HI1394

100 Hi394 100 19P49H1_243_12perV94_1_0_1998 100 HI1980 5P54H1 Hi403 24P29H_398_15perV17_0_1_1997 100 24P17H1_398_15perV29_1_1_1996 5P28H1_103_93perV54_1_0_1997 615_HINF 105P30H1_46_73perV29_0_1_2003 105P29H1_46_73perV30_1_0_2003 27P5H2_46_18perV10_1_0_1995 126P21H1_134_85perV16_0_0_2005 27P10H2_46_18perV5_0_0_1995 2842STDY5882062 126P16H1_134_85perV21_1_0_2005 2842STDY5882061 14P41H1_134_10perV46_1_0_1999 HI2007 GE146 12P56H2_103_clr_1_1_1998 14P8H4_436_8perV12_1_0_1995 14P46H1_134_10perV41_0_0_1999 14P12H1_436_8perV8_0_0_1995 84P14H2_436_clr_1_0_2000 75P47H1_103_clr_1_1_2001 614_HPAR 91P18H1_205_clr_1_1_2000 2842STDY5882037 Hi381 2842STDY5882036 1057_HINF Hi361 34P10H2_147_clr_1_0_1995 70P13H1_147_52perV11_0_1_1998 MiHi270 70P11H1_147_52perV13_1_1_1998 46P8H1_103_clr_1_0_1995 14P13H9_147_94perV23_1_0_1995 14P23H1_147_94perV13_0_0_1996 50P6H1_1337_50perV5_0_0_1995 50P5H1_1337_50perV6_1_0_1995 HI1373 96P3H1_245_67perV6_1_0_2000 96P6H1_245_67perV3_0_0_2000 29P30H1_245_clr_1_0_1997 37P6H2_103_23perV4_0_1_1995 19P126H1_245_clr_1_1_2005 37P4H1_103_23perV6_1_1_1995

2842STDY5882035 2842STDY5882034 111P80H1_11_clr_1_1_200627P12H6_11_clr_1_1_1995

477_ST1 27P39H1_145_19perV38_0_0_1998 2842STDY5882078 27P38H1_145_19perV39_1_0_199848P115H1_11_clr_1_0_2004 2842STDY5882079

2842STDY5882045 122P13H1_11_79perV4_0_0_2004 2842STDY5882046 122P4H1_11_79perV13_1_1_2003

5P1H1_139_3perV27_1_0_1994

2842STDY5882022 5P27H1_139_3perV1_0_0_1997 2842STDY5882021

84P8H1_648_clr_1_1_1999 102P16H1_389_clr_1_1_2001

94P20H1_139_65perV17_0_0_2001 94P17H1_139_65perV20_1_1_2001

26P2H_1332_17perV1_0_0_1994 26P1H_1332_17perV2_1_0_1994

124P8H1_404_82perV6_0_1_2004 87P128H2_389_60perV138_1_0_2007 124P6H1_404_82perV8_1_1_2004 87P138H1_389_60perV128_0_1_2007

107P50H1_215_76perV75_1_0_2004 111P90H1_1341_77perV96_1_1_2007 107P75H1_215_76perV50_0_0_2006 111P96H1_1341_77perV90_0_1_2007

0.05

Fig. 1. Phylogenetic tree of 269 nontypeable H. influenzae COPD genomes from this study together with 134 publicly available H. influenzae genomes. The core-genome SNP-based maximum-likelihood phylogenetic tree is presented in its polar tree layout, it is midpoint-rooted, and nodes are sorted in increasing order. Bootstrap values from 500 iterations are shown on each node, and most show very strong support with values ≥90%. A branch-length scale bar equivalent to 0.05 substitutions per site is provided at the bottom of the figure. The 269 COPD genomes are labeled with strings that include data separated by underscores in the following order: isolate name, MLST ST, paired isolate information for persistent strains (a number representing the clinic visit of isolation, followed by “perV,” followed by the visit number of the paired isolate) or “clr” for cleared strains, first isolate (marked as “1”) or last isolate (“0”), exacerbation (“1”) or colonization (“0”), and year of isolation. Isolates are color-coded in a gradient of red shades based on grouped dates of isolation, from 1995 to 1996 in light red, then darker for 1997–1999, 2000–2001, 2003–2005, and the darkest for 2006–2009. Closed (gap-free) COPD genomes are colored in green. The 19 closed publicly available genomes (per NCBI as of October 25, 2017) are labeled with their strain name followed by MLST type and colored in blue. Strain names for the remaining 115 draft publicly available genomes are in black.

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1719654115 Pettigrew et al. Downloaded by guest on October 7, 2021 Table 1. Genes with SSRs that change frame during persistence in the human respiratory tract PNAS PLUS Implications for respiratory Region of gene Strains that change Gene function* tract persistence with SSRs no. of SSRs (of 4)

HMW1A/HMW2A adhesin Adherence to respiratory mucosal surface Promoter 4 Sialyltransferase Molecular mimicry to evade host immune response ORF 4 DNA methylase, type I restriction- Phase variation of multiple virulence genes ORF 1 modification methylase Efflux transporter Resist toxic substances in airways Promoter 1 Glycosyltransferase family Modifications of LOS on the bacterial surface ORF, promoter 4 Hemoglobin-haptoglobin–binding Iron acquisition ORF 4 protein Acetyltransferase Post translational modification and ORF 1 regulation of multiple proteins Outer membrane autotransporter Secretion of multiple virulence factors ORF 1 Phosphotransferase Transport of carbohydrate and ORF 3 cellular regulatory functions TonB dependent receptor family Uptake and transport of nutrients Promoter, ORF 2 protein and substrates

LOS, lipooligosaccharide. *Some gene functions underwent changes in multiple genes.

repeats that changed during persistence and the duration of fourfold increase in azithromycin minimum inhibitory concentration colonization in days (Pearson correlation coefficient r2 = 0.35, during persistence (10). However, this SNP was observed only twice P < 0.0001). Based on the frequency of SSRs that likely result in within the 269-genome dataset. changes in expression of multiple virulence factors, and the

correlation between the number of changes and duration of Genome Rearrangements During Persistence in Human Airways. An MICROBIOLOGY persistence, we conclude that NTHi uses slipped-strand mis- inversion of ∼400 kbp in length occurred during persistence in pairing as a major mechanism to adapt to microenvironments in the human respiratory tract in two of the four persistent strains the human respiratory tract. with closed genomes. The inversions depicted in the Sybil (22) synteny gradient image (Fig. 2) were confirmed via PCR am- SNPs Occurring During Persistence of NTHi in Human Airways. Analysis of plification of upstream and downstream nonrepetitive flanking closed genomes of the four persistent strains showed substantial regions of the inversion sites for both persistent strains. variability in the number of SNPs that occurred in genomic loci The ends of the 400-kbp inversion are located within the loci that during persistence in human airways among the four strains (Table 2 encode HMW1 and HMW2, which are key adhesins that mediate and Dataset S1). For example, strain 67P38H1/67P56H1 underwent adherence of NTHi to human respiratory epithelial cells. The hmw1 81 SNPs during persistence for 570 d; of these, 11 were intergenic and hmw2 structural genes are followed by two additional genes and 70 were in 13 genes, including 59 nonsynonymous and 11 syn- downstream, hmw1B and hmw1C or hmw2B and hmw2C (41). The onymous SNPs. Remarkably, strain 6P24H2/6P32H1 showed a sin- hmwA genes encode the surface-exposed nonpilus adhesins, and gle SNP during persistence for 252 d (Table 2 and Dataset S1). hmwB and hmwC genes encode accessory proteins required for Nonsynonymous SNPs were observed in a wide variety of genes in processing and secretion of the structural protein (42). HMW1B/1C these four pairs, with the highest number of nonsynonymous SNPs in and HMW2B/2C are interchangeable, which is consistent with the hemoglobin-haptoglobin–binding protein A, followed by outer observed high degree of homology between the two accessory membrane protein P2. SNPs in these genes occurred in clusters of proteins (41). Close examination of the inversion sites demonstrates nucleotides undergoing variations in regions that we called “hyper- that the HMW operons are not disturbed, such that each HMW variable” (Dataset S1). The hemoglobin-haptoglobin–binding pro- locus still contains one A, one B, and one C gene. tein A genes also underwent frequent SSR-induced frame-shifting. The SNP identified in ribosomal protein L22, which resulted in a Gene Gain and Loss and Competence in Persistent Strains. Based on switch from glycine to aspartic acid at amino acid 91 (G91D) in our analyses of ortholog clusters, there were no major changes in strain 5P28H1/5P54H1, was previously identified as associated with a the genome in terms of gene gain or loss. The size differences

Table 2. SNPs that occur in NTHi genomes during persistence in the human respiratory tract Persistent Genes Genes with Intergenic strain Persistence, d Total SNPs with SNPs nonsynonymous SNPs SNPs

5P28H1 819 23 5 5 5 5P54H1

6P24H2 252 1 1 1 0 6P32H1

48P106H1 993 3 2 2 1 48P153H1

67P38H1 570 81 13 13 11 67P56H1

Pettigrew et al. PNAS Latest Articles | 5of10 Downloaded by guest on October 7, 2021 no difference for 6P24H2 and 6P32H1 compared with the other persistent strains. 6P24H2 and 6P32H1 each had 1,504 uptake sequences, 5P28H1 and 5P54H1 each had 1,508, 48P106H1 and 48P153H1 each had 1,478, and 67P38H1 and 67P56H1 each had 1,479 uptake sequences. Thus, the number of DNA USSs in each isolate of persistent strains remains constant. However, USS se- quences are stable for hundreds of millions of years, and loss of USS likely occurs over large time scales (44). In H. influenzae, competence for transformation is controlled by the Sxy-dependent cAMP receptor protein regulon, which is composed of 26 genes that are controlled by the CRP and Sxy regulators (45, 46). We identified orthologs of each of the 17 required competence genes from H. influenzae strain Rd (46) in each isolate of our four persistent strains. There were no obvious mutations that explain the lack of transformability of 6P24H2 and 6P32H1 in vitro. Although isolates 6P24H2 and 6P32H1 were not competent under laboratory conditions, they may be competent under natural conditions.

Evaluation of NTHi Candidate Vaccine Antigens. Vaccine develop- ment for NTHI is undergoing exciting new developments with the licensing of a pneumococcal conjugate vaccine that contains the conserved NTHi protein, protein D (47). Recognizing the limited efficacy of a vaccine with a single NTHi antigen, this proof-of-principle observation has stimulated active preclinical Fig. 2. (A) Sybil synteny gradient image of nine closed genomes compared and clinical development of several additional NTHi vaccine with reference strain 86_028NP. Clusters of syntenic orthologous genes were antigens (48, 49). used to draw orthologs of genes predicted in the reference genome (86- We conducted two separate analyses of candidate vaccine an- 028NP). Each ortholog is drawn with the color from the gradient corre- tigens in 101 persistent strains. Complete ORF sequence data sponding to its position in its own genome. As a result, breaks in the color were not available for some antigens in a few isolates; in these gradient represent inversions, translocations, or insertions, and white spaces instances, they were not included in the analyses (Table 3). We indicate deletions or regions harboring genes that are not included with ref- compared multiple sequence alignments of each of the 12 vaccine erence genes in clusters of orthologs. The three genomes containing the in- antigens in the first and final isolates of each strain to identify version are indicated with asterisks. (B) Schematic of upstream and positively selected codons. We also assessed the extent to which downstream inversion sites in the first isolate of a persistent strain 5P28H1, 12 candidate vaccine antigens of NTHi underwent changes in the containing the inversion, and the final isolate of the same strain, 5P54H1, with airways of adults with COPD. Positive selection describes selection no inversion. PCR primers were designed to span the flanking regions of the in the population of persistent strains. In contrast, our second inversion break sites, utilizing a common forward primer for the upstream analysis examined how genes that encode antigens in individual break site and a common reverse primer for the downstream break site. (C) Agarose gel of PCR products demonstrating the specificity of PCR primers strains change during persistence. surrounding the inversion sites. Common forward primer for the upstream break site “1” amplifies a product with unique reverse primer “3” using Positive Selection in Vaccine Candidate Antigens. We investigated 5P28H1 DNA in lane 1, but not with 5P54H1 DNA in lane 5. Common forward 12 candidate vaccine antigens for positively selected codons by primer “1” amplifies a product with unique reverse primer “4” using 5P54H1 using the PAML-codeml program based on the sequence align- DNAinlane6,butnotwith5P28H1DNAinlane2.Forwardprimer“5” with ments of the first and final isolates in the population of the common reverse primer “2” for the downstream break site amplifies a product 101 persistent strains. Because a majority of codons in a protein- with 5P28H1 DNA in lane 3, but not with 5P54H1 DNA in lane 7. Forward coding sequence are under physiochemical constraint based on primer “6” with common reverse primer “2” amplifies a product with 5P54H1 protein structure and function (50), we applied selection models DNA in lane 8, but not with 5P28 DNA in lane 4. in the codeml program that identified individual codons in an- tigen sequences under positive selection. Several amino acids of four antigens, P2, P5, Hap, and D15, between the acquisition and final isolates of persistent strains experienced positive selection in the persistent strains (Table 3 with closed genomes were minimal; −39 bp, −25 bp, −127 bp, and SI Appendix, Table S4). The surface exposed loops of the and +76 bp for strains 5P28H1/5P54H1, 6P24H2/6P32H1, outer membrane proteins P2 and P5 are variable among strains 48P106H1/48P153H1, and 67P38H1/67P56H1, respectively (SI and during persistent infection (33, 51, 52). Based on the P2 and Appendix, Table S2). We evaluated the transformation frequency P5 predicted membrane topology, positively selected amino acids of the acquisition and final isolates of each of the four persistent were located in the surface-exposed regions of both proteins. Fig. strains with closed genomes, given the limited gene gain and loss 3 highlights the amino acids experiencing positive selection in observed during persistence. The average transformation fre- – − − reference to the 86 028NP antigen sequences. quencies were 8.55 × 10 7 for 5P28H1, 3.27 × 10 6 for 5P54H1, −5 −5 −6 1.10 × 10 for 48P106H1, 9.61 × 10 for 48P153H1, 6.29 × 10 Changes in Candidate Vaccine Antigens During Persistence in COPD −7 for 67P38H1, and 5.8 × 10 for 67P56H1. 6P24H2 and Airways. In addition to sequence diversity identified in the pop- 6P32H1 were not transformable by using laboratory conditions ulation of strains, we also assessed the extent to which 12 can- that readily resulted in transformation of the other six isolates. didate vaccine antigens of NTHi underwent changes during The genome of this strain underwent only one coding-region persistence in the airways of adults with COPD. All 101 persistent SNP in the coding regions during 272 d of persistence in the strains contained the ORFs of the 12 candidate antigens. Of the human respiratory tract. 12 candidate vaccine antigens analyzed, 36 amino acid sequence Efficient transformation of H. influenzae requires the presence changes were observed between the first and final isolates of of 9-mer (aagtgcggt) DNA uptake signal sequences (USSs) (43). 32 persistent strains. Two strains contained multiple changes in We reasoned that a noncompetent lineage would experience a different antigens. Changes occurred during persistence in divergence or gradual loss of USSs over time. Inspection of the eightantigens:P2,P5,PD,Hap,D15,HtrA,P4,andPE(Table number of uptake sequences in the four persistent strains revealed 3andSI Appendix,TableS5). The remaining four antigens

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1719654115 Pettigrew et al. Downloaded by guest on October 7, 2021 Table 3. Candidate vaccine antigens that undergo selection and change during persistence of NTHi in the human PNAS PLUS respiratory tract Codons experiencing No. of strains that change Candidate vaccine antigen Biological function selection (sequences analyzed)† during persistence‡

P6 Peptidoglycan-associated lipoprotein None (201) 0 PilA Type IV pilus None (200) 0* OMP 26 Translocase None (199) 0* Protein F ABC transporter None (200) 0* P4 Acid phosphatase None (200) 1 Protein E Adhesin None (200)§ 1 HtrA Heat shock protein None (199) 2** Protein D Phosphodiesterase None (195) 4**¶ D15 (BamA-like protein) Nucleotidyltransferase Multiple (196) 2** Hap Adhesin Multiple (186) 3***# P5 fimbrin Fimbrin Multiple (200)§ 5* P2 Porin Multiple (187) 21

† Positive selection analysis was performed on the first and final isolate (202) antigen sequences of the 101 persistent strains. The analysis was performed on the initially extracted ORFs from the genome sequences using bwa. *n = 100, **n = 99, ***n = 95. ‡Analysis of 101 strains except where noted. §The nucleotides encoding the PE and P5 stop codons were removed so to include these strain sequences for detection of positive selection. ¶Two of the four changes were nonsynonymous SNPs. #One of the three changes was a nonsynonymous SNP.

showed no changes during persistence (Protein F, P6, OMP26, The P5 gene of strain 18P16H1/18P17H2 had an opal muta- and PilA). tion (GAA to TGA) upon acquisition. Another strain, 48P28H1/ MICROBIOLOGY Outer-membrane protein P2 underwent amino acid sequence 48P45H1, acquired a 4-bp insertion during persistence that changes during persistence in 21 of 101 strains. Insertions, dele- resulted in a stop codon. Immunoblot with antiserum to P5 tions, and SNPs all occurred in predicted surface-exposed loops, confirmed that P5 was not expressed in the strains that contained with loops 4, 5, and 6 having the most frequent changes (Fig. 3). the stop codon but was expressed in other strains. Despite these insertions, deletions, and SNPs, the gene remained in frame in all strains, consistent with the observation that, as the Amino Acid Sequence Haplotypes of Candidate Vaccine Antigens. We major porin protein of NTHi, P2 is important for viability. assessed amino acid sequence conservation of the 12 candidate Outer-membrane protein P5 experienced nonsynonymous vaccine antigens by determining the pairwise percent identity insertions, deletions, and SNPs during persistence in five of scores of the unique haplotypes for each antigen and de- 101 persistent strains. Each of the variations observed during termining the prevalence of each haplotype among the persistent persistence resulted in amino acid changes in regions that are strains. Eight of the 12 antigens had greater than a 95% MHPD predicted to be surface exposed loops, in particular loops 2 and 3 of percent identity (Fig. 4A). In addition to the high MHPD for (Table 3 and Fig. 3). No changes in P5 loops 1 and 4 were ob- these antigens, they had the fewest number of unique haplotypes, served during persistence, yet sequence diversity was present with the exception of the D15 antigen, which had 48 haplotypes among the strains. Nonsynonymous SNPs occurred in PD, Hap, (Fig. 4B). The P6 antigen had the fewest unique haplotypes with D15, HtrA, P4, and PE during persistence in one or two strains eight; 92% of persistent strains were represented by one each among the 101 persistent strains. P6 haplotype (Fig. 4B). PilA, P5, P2, and Hap showed the most sequence heterogeneity Stop Codons in Candidate Vaccine Antigens. Analysis of genes that among strains with an MHPD of percent identity between 90% and encode candidate vaccine antigens of NTHi from 168 COPD 71% and a large number of haplotypes (SI Appendix,TableS6). The strains (67 cleared and 101 initial acquisitions of persistent P2 antigen had the greatest number of haplotypes (n = 103) and a strains) revealed the presence of stop codons in three genes. The range of pairwise percent identity comparisons from 99.7% to as protein E gene of 15 of 101 persistent strains (14.9%) and 12 of low as 42.1%. The Hap antigen had the lowest MHPD of percent 67 cleared strains (17.9%) contained a G169A amber mutation. identity at 71.4%. This analysis has important implications in The mutation was present upon acquisition in all but one strain assessing the conservation of candidate vaccine antigens to prioritize that had the mutation; the persistent strain 59P3H1/59P6H1 them further as NTHi vaccine formulations are developed. acquired the mutation during persistence in the human re- spiratory tract. Immunoblot assays with antiserum to recombi- Discussion nant protein E confirmed that the protein was not detected in H. influenzae was the first bacterial species to have its genome strains with the stop codon but was detected in other strains (SI sequenced, in 1995 (54). WGS has been used to describe the Appendix, Fig. S1). Thus, the protein E gene of 27 of 168 strains population structure of H. influenzae (55), provide insight into (16.1%) contained an amber mutation that resulted in the ab- genomic diversity (56), identify targets for strain differentiation sence of expression of protein E. This amber mutation was and diagnostics (57), and characterize virulence properties (58– observed by Singh et al. (53) in 1.6% of NTHi strains. 61). Our study is unique in that we examined a large collection of The gene that encodes the adhesin Hap in 2 of 168 strains had prospectively collected H. influenzae, including strains that per- deletions in the ORF of the strains, both resulting in downstream sisted for months to years in the human respiratory tract, to stop codons. In persistent strain 31P7H7/31P14H6, the Hap gene answer questions related to in vivo adaptation of this exclusively contained a one-base deletion resulting in a frame-shift muta- human pathogen in the airways of its human host. Our analyses tion and several downstream stop codons. In persistent strain indicate that NTHi uses slipped-strand mispairing in SSRs as a 102P20H1/102P32H1, the final isolate contained a 22-bp de- major mechanism for survival and pathogenesis in the human letion in the Hap gene resulting in downstream stop codons. airways. Moreover, the propensity for SNPs during persistence

Pettigrew et al. PNAS Latest Articles | 7of10 Downloaded by guest on October 7, 2021 we identified strains containing stop codons in the P5 and Hap genes. Thus, our data also have important and immediate im- plications in guiding vaccine development. Results of phylogenetic analyses of our 269 strains are consis- tent with prior studies using MLST or whole-genome sequences that did not find a strong correlation between the clinical source and geographic origin of the NTHi strains (55, 62). De Chiara et al. (55) defined the population structure of NTHi by using a global collection of 97 strains. They identified six monophyletic clades based on phylogenetic analysis of the core genome. How- ever, the population structure was not related to geography or the clinical source of the isolate. We observed evidence of a relatively large-scale genome rearrangement occurring during persistence; a 400-kbp inversion occurred between the loci that encode HMW1 and HMW2 in two of the four persistent strains with closed genomes. Inversions may change bacterial phenotypes by altering gene expression patterns or disrupting genes and are therefore thought to help bacteria persist and adapt to changing environments (63, 64). Three of the 19 publicly available closed genomes also have the same inversion: 477 (GenBank accession no. NZ_CP007470.1), 723 (GenBank accession no. NZ_CP007472.1), and C486 (GenBank accession no. NZ_CP007471.1). Our analyses indicate that the genes in the HMW operon are swapped, but operon structures are not disrupted by the inversion. The two swapped HMW operons may change the expression of the adhesins.

Fig. 3. Diagrammatic representation of the predicted P2 (A) and P5 (B) antigen membrane topologies in the 86–028NP reference sequences show- ing amino acid positions experiencing positive selection and changes in extra- cellular loops. Positively selected amino acids are highlighted by red circles in the 86–028NP reference sequence. Boxes above the surface-exposed loops represent the loop number, and the coloring scheme indicated by the key represents the number of changes observed in each loop during persistence. Numbers next to an amino acid symbol represent the position in the primary antigen sequence, starting from the amino terminus, of strain 86–028NP. Amino acids in the gray- Fig. 4. A comparison of the 12 NTHi candidate vaccine antigen-translated shaded columns represent those that are transmembrane-spanning. E, extracel- amino acid sequences observed in the set of persistent strains from the lower lular; OM outer membrane; P, periplasmic regions of the reference sequence. airways of adults with COPD. (A) Unique haplotype pairwise comparison of amino acid sequence percentage identity. An “X” in the plot represents the MHPD of percent identity of each antigen. Outlying empty circles indicate shows a high degree of variability among strains. Our analyses of pairwise percent identity values that are greater than 1.5 times the inter- 12 candidate vaccine antigens indicates that some candidate quartile range. (B) The proportion of isolates whose sequences represent vaccine antigens are stable (e.g., protein F, P4, P6, and OMP 26). each of the unique haplotypes of the total number of sequences analyzed In contrast, others, such as P2, P5, and Hap, undergo changes for the candidate vaccine antigens. Numbers above each of the bars indicate under immune or airway environment-selective pressure during the total number of unique haplotypes for each antigen. Colored bar por- tions represent the individual-unique haplotypes. The blue bar portions in- persistence and exhibit a relatively high level of diversity. We dicate the haplotype that contains the greatest number of persistent isolate also observed a higher portion of strains, 16.1% in total com- sequences, followed by the orange portion representing the second great- pared with 1.6% identified previously, containing a stop codon in est, and the gray portion for the haplotype with the third most abundant the protein E gene (53). In addition to the protein E stop codon, number of isolates sequences within it.

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1719654115 Pettigrew et al. Downloaded by guest on October 7, 2021 Future experimental work should examine the role of this in- child over a 7-mo period and determined that ∼156 kb of the PNAS PLUS version in altering the phenotype and/or transcriptional genomic content (7.8% of the genome) was exchanged during profile of NTHi strains. multiple horizontal gene-transfer events. A key observation from our study indicates that SSRs are a Vaccine development for NTHI is an active, evolving area of main driver of change during persistence in the respiratory tract. research. Immune escape has received little attention in studies of We identified and manually confirmed changes in 22 genes in NTHi vaccine development because strains of NTHi that have four persistent strains, and essentially every gene with changing persistently colonized the human respiratory tract have not been SSRs has virulence implications. In a human challenge model, studied. Amino acids of 4 of 12 candidate vaccine antigens in phase variation of 13 selected NTHi genes was examined in vivo various stages of development experienced positive selection in during experimental human colonization of 15 subjects over 6 d our 101 persistent strains. Furthermore, several candidate vaccine (65). Genes encoding a choline kinase (licA) and IgA protease antigens experienced nonsynonymous changes during persistence (igaB) were generally phase-off in the inoculated strain and in the lower airways, all in predicted surface-exposed regions of switched to phase-on in a significant number of samples. Power the P2 and P5 molecules analyzed. Intriguingly, amino acids ex- et al. (39) evaluated SSR-mediated phase variation in H. influenzae periencing positive selection also underwent changes in surface and determined that there were 56, 60, 53, and 54 SSRs in strains exposed loops of the P2 and P5 antigens during persistence. This 86–028NP, Rd KW20, R2846, and R2866, respectively. We observation supports that antigens with surface exposed epitopes identified a greater number of SSRs, between 72 and 92, in our are under immune selective pressure and that the resulting anti- closed genomes by using similar bioinformatics thresholds for genic variation that occurs during persistence confers immune SSR identification. Our research expands upon the observations evasion and contributes to NTHi persistence. Although recogni- from these two studies because we were able to examine the tion by the host immune system is driving diversity in antigens, potential for genome-wide phase variation during long periods of adaptation to the host lower airway environment is also likely in vivo persistence in the human airways by using prospectively contributing to such changes. collected strains. The observation of sequence variations in the final isolate We also observed that, as the length of time a patient carried a compared with the first isolate of a persistent strain raises the strain increased, the number of SSRs that changed also in- question of whether these mutations were deleterious and en- creased. This important observation suggests that changes in abled clearance of the strain. To this end, we sequenced the SSRs, and thus regulation of expression genes by slipped-strand P2 gene in intermediate isolates of the patient 19 strain in which mispairing, are driven by selective pressure in the human airways. the P2 antigen incurred an 18-bp deletion during persistence. This observation is consistent with recent work that shows that This deletion was a sequential process, but more importantly, the passage of NTHi strains through human respiratory epithelial strain persisted for several months after the final deletion event MICROBIOLOGY cells selects for expression of IgA proteases that are regulated by occurred. This provides proof of principle that not all mutations slipped-strain mispairing (66). observed in final isolates of persistent strains are deleterious and In several instances, including the LOS sialyltransferase and result in clearance of the strain. In future studies, we will in- the hemoglobin-haptoglobin–binding proteins, the SSR was vestigate the intermediate isolates of persistent strains to assess present near the beginning of the ORF such that a change in the extent to which mutations are advantageous, deleterious, or copy number would engender an early frame shift, resulting in neutral with regard to persistence. the translation of a very short peptide. It should be noted that In summary, our results advance the understanding of how an most of these N-terminal peptides (short ORFs) were not pre- exclusively human pathogen alters its genome to adapt to survival dicted by the annotation system in our genomes; therefore, we in the human respiratory tract. Identifying changes that occur in performed manual curation of these peptides by using existing the genomes of a unique set of strains of a human pathogen in its non–frame-shifted protein sequences as a guide (Dataset S2). It natural biological niche advances our understanding of the path- should also be noted that our detailed determinations were based on ogenesis of bacterial infection in COPD substantially. The obser- whole-genome sequences of four strains (eight isolates) and may vations regarding the impact of selective pressure on candidate therefore be underestimates of the number of potentially phase- vaccine antigens in the human respiratory tract has immediate variable genes given that we likely detected the most prevalent implications in prioritizing antigens for vaccine development. variants. Phase variants may exist in mixed populations, and ours were not based on uncultured samples, nor were they confirmed by ACKNOWLEDGMENTS. We thank Shaun Adkins, Jonathan Crabtree, James additional methods (e.g., fragment length analysis of mixed pop- Matsumura, Suvarna Nadendla, the Genomics Resource Center, the Bioinfor- ulations) (40, 65). Data from the 97 persistent strains with draft matics Resource Center, and the IT Department at Institute for Genome Sciences, University of Maryland School of Medicine, for their help with data storage, genomes provide valuable information regarding the potential extent management, analysis, and visualization; Charmaine Kirkham, Aimee Brauer and of phase variation in NTHi. However, these data should be inter- Antoinette Johnson (Clinical and Translational Research Center, University at preted with caution; repeat regions are challenging to sequence and Buffalo) for expert technical assistance in the laboratory; Dr. Lauren Bakaletz assemble, and some of the SSR repeats were near the end of contigs. (Nationwide Children’s Hospital) and Dr. Joseph St. Geme III (Children’s Hospital We did not observe instances of large-scale gene acquisitions of Philadelphia) for their donations of P5 and Hap antiserum as well as KO or deletions during persistence of NTHi in human airways in the mutants of genes encoding each antigen. This work was supported by National Institute of Allergy and Infectious Diseases of the National Institutes of Health strains with closed genomes. This is in contrast to Streptococcus Grant R01AI019641 (to T.F.M. and M.M.P.), the Department of Veterans Affairs pneumoniae, another exclusively human pathogen. Hiller et al. (S.S. and T.F.M.), and National Center for Advancing Translational Sciences Award (67) sequenced sequential S. pneumoniae isolates colonizing a UL1TR001412 to the University at Buffalo.

1. Mathers CD, Loncar D (2006) Projections of global mortality and burden of disease 7. Wong SM, Akerley BJ (2012) Genome-scale approaches to identify genes essential for from 2002 to 2030. PLoS Med 3:e442. Haemophilus influenzae pathogenesis. Front Cell Infect Microbiol 2:23. 2. Kim V, Criner GJ (2013) Chronic bronchitis and chronic obstructive pulmonary disease. 8. Sethi S, Evans N, Grant BJ, Murphy TF (2002) New strains of bacteria and exacerba- Am J Respir Crit Care Med 187:228–237. tions of chronic obstructive pulmonary disease. N Engl J Med 347:465–471. 3. Murphy TF, Brauer AL, Schiffmacher AT, Sethi S (2004) Persistent colonization by 9. Murphy TF, et al. (2007) Haemophilus haemolyticus: A human respiratory tract com- Haemophilus influenzae in chronic obstructive pulmonary disease. Am J Respir Crit mensal to be distinguished from Haemophilus influenzae. J Infect Dis 195:81–89. Care Med 170:266–272. 10. Pettigrew MM, et al. (2016) Effect of fluoroquinolones and macrolides on eradication 4. Sethi S, Murphy TF (2008) Infection in the pathogenesis and course of chronic ob- and resistance of Haemophilus influenzae in chronic obstructive pulmonary disease. structive pulmonary disease. N Engl J Med 359:2355–2365. Antimicrob Agents Chemother 60:4151–4158. 5. White AJ, et al. (2003) Resolution of bronchial inflammation is related to bacterial eradi- 11. Kong Y (2011) Btrim: A fast, lightweight adapter and quality trimming program for cation following treatment of exacerbations of chronic bronchitis. Thorax 58:680–685. next-generation sequencing technologies. Genomics 98:152–153. 6. Desai H, et al. (2014) Bacterial colonization increases daily symptoms in patients with 12. Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using chronic obstructive pulmonary disease. Ann Am Thorac Soc 11:303–309. de Bruijn graphs. Genome Res 18:821–829.

Pettigrew et al. PNAS Latest Articles | 9of10 Downloaded by guest on October 7, 2021 13. Seemann T (2014) Prokka: Rapid prokaryotic genome annotation. Bioinformatics 30: 42. Cholon DM, et al. (2008) Serial isolates of persistent Haemophilus influenzae in pa- 2068–2069. tients with chronic obstructive pulmonary disease express diminishing quantities of 14. Chin CS, et al. (2013) Nonhybrid, finished microbial genome assemblies from long- the HMW1 and HMW2 adhesins. Infect Immun 76:4463–4468. read SMRT sequencing data. Nat Methods 10:563–569. 43. Smith HO, Tomb JF, Dougherty BA, Fleischmann RD, Venter JC (1995) Frequency and 15. Delcher AL, Phillippy A, Carlton J, Salzberg SL (2002) Fast algorithms for large-scale distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd ge- genome alignment and comparison. Nucleic Acids Res 30:2478–2483. nome. Science 269:538–540. 16. Angiuoli SV, et al. (2011) CloVR: A virtual machine for automated and portable se- 44. Redfield RJ, et al. (2006) Evolution of competence and DNA uptake specificity in the quence analysis from the desktop using cloud computing. BMC Bioinformatics 12:356. . BMC Evol Biol 6:82. 17. Sahl JW, et al. (2015) The in silico genotyper (ISG): An open-source pipeline to rapidly 45. Redfield RJ, et al. (2005) A novel CRP-dependent regulon controls expression of identify and annotate nucleotide variants for applications. competence genes in Haemophilus influenzae. J Mol Biol 347:735–747. bioRxiv, 10.1101/015578. 46. Sinha S, Mell JC, Redfield RJ (2012) Seventeen Sxy-dependent cyclic AMP receptor 18. Stamatakis A (2014) RAxML version 8: A tool for phylogenetic analysis and post- protein site-regulated genes are needed for natural transformation in Haemophilus analysis of large phylogenies. Bioinformatics 30:1312–1313. influenzae. J Bacteriol 194:5245–5254. 19. Huson DH, et al. (2007) Dendroscope: An interactive viewer for large phylogenetic 47. Prymula R, et al. (2006) Pneumococcal capsular polysaccharides conjugated to protein trees. BMC Bioinformatics 8:460. D for prevention of acute otitis media caused by both Streptococcus pneumoniae and 20. Agrawal S, et al. (2017) CloVR-comparative: Automated, cloud-enabled comparative non-typable Haemophilus influenzae: A randomised double-blind efficacy study. microbial genome sequence analysis pipeline. BMC Genomics 18:332. Lancet 367:740–748. 21. Angiuoli SV, Dunning Hotopp JC, Salzberg SL, Tettelin H (2011) Improving pan- 48. Murphy TF (2015) Vaccine for nontypeable Haemophilus influenzae: The future is genome annotation using whole genome multiple alignment. BMC Bioinformatics now. Clin Vaccine Immunol 22:459–466. 12:272. 49. Khan MN, et al. (2016) Developing a vaccine to prevent otitis media caused by non- 22. Riley DR, Angiuoli SV, Crabtree J, Dunning Hotopp JC, Tettelin H (2012) Using Sybil for typeable Haemophilus influenzae. Expert Rev Vaccines 15:863–878. interactive comparative genomics of microbes on the web. Bioinformatics 28: 50. Yang Z, Swanson WJ (2002) Codon-substitution models to detect adaptive evolution 160–166. that account for heterogeneous selective pressures among site classes. Mol Biol Evol 23. Crabtree J, Angiuoli SV, Wortman JR, White OR (2007) Sybil: Methods and 19:49–57. for multiple genome comparison and visualization. Methods Mol Biol 408:93–108. 51. Hiltke TJ, Schiffmacher AT, Dagonese AJ, Sethi S, Murphy TF (2003) Horizontal 24. Siena E, et al. (2016) In-silico prediction and deep-DNA sequencing validation indicate transfer of the gene encoding outer membrane protein P2 of nontypeable phase variation in 115 Neisseria meningitidis genes. BMC Genomics 17:843. Haemophilus influenzae, in a patient with chronic obstructive pulmonary disease. 25. Poje G, Redfield RJ (2003) Transformation of Haemophilus influenzae. Haemophilus JInfectDis188:114–117. influenzae Protocols, eds Herbert MA, Hood DW, Moxon ER (Humana Press, Totowa, 52. Duim B, et al. (1997) Molecular variation in the major outer membrane protein NJ), Vol 71, pp 57–70. P5 gene of nonencapsulated Haemophilus influenzae during chronic infections. Infect 26. Murphy TF, Kirkham C, Sethi S, Lesse AJ (2005) Expression of a peroxiredoxin- Immun 65:1351–1356. glutaredoxin by Haemophilus influenzae in biofilms and during human respiratory 53. Singh B, Brant M, Kilian M, Hallström B, Riesbeck K (2010) Protein E of Haemophilus tract infection. FEMS Immunol Med Microbiol 44:81–89. influenzae is a ubiquitous highly conserved adhesin. J Infect Dis 201:414–419. 27. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler 54. Fleischmann RD, et al. (1995) Whole-genome random sequencing and assembly of transform. Bioinformatics 25:1754–1760. Haemophilus influenzae Rd. Science 269:496–512. 28. Li H, et al.; 1000 Data Processing Subgroup (2009) The sequence 55. De Chiara M, et al. (2014) Genome sequencing of disease and carriage isolates of alignment/map format and SAMtools. Bioinformatics 25:2078–2079. nontypeable Haemophilus influenzae identifies discrete population structure. Proc 29. Li W, et al. (2015) The EMBL-EBI bioinformatics web and programmatic tools Natl Acad Sci USA 111:5439–5444. framework. Nucleic Acids Res 43:W580–W584. 56. Power PM, Bentley SD, Parkhill J, Moxon ER, Hood DW (2012) Investigations into 30. Suyama M, Torrents D, Bork P (2006) PAL2NAL: Robust conversion of protein se- genome diversity of Haemophilus influenzae using whole genome sequencing of quence alignments into the corresponding codon alignments. Nucleic Acids Res 34: clinical isolates and laboratory transformants. BMC Microbiol 12:273. W609–W612. 57. Coughlan H, et al. (2015) Comparative genome analysis identifies novel nucleic acid 31. Jeffares DC, Tomiczek B, Sojo V, dos Reis M (2015) A beginners guide to estimating diagnostic targets for use in the specific detection of Haemophilus influenzae. Diagn the non-synonymous to synonymous rate ratio of all protein-coding genes in a ge- Microbiol Infect Dis 83:112–116. nome. Methods Mol Biol 1201:65–90. 58. Zhang L, et al. (2012) Nontypeable Haemophilus influenzae genetic islands associated 32. Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ (2004) PRED-TMBB: A web with chronic pulmonary infection. PLoS One 7:e44730. server for predicting the topology of beta-barrel outer membrane proteins. Nucleic 59. Strouts FR, et al. (2012) Lineage-specific virulence determinants of Haemophilus in- Acids Res 32:W400–W404. fluenzae biogroup aegyptius. Emerg Infect Dis 18:449–457. 33. Sikkema DJ, Murphy TF (1992) Molecular analysis of the P2 porin protein of non- 60. Su YC, Resman F, Hörhold F, Riesbeck K (2014) Comparative genomic analysis reveals typeable Haemophilus influenzae. Infect Immun 60:5204–5211. distinct genotypic features of the emerging pathogen Haemophilus influenzae type f. 34. Webb DC, Cripps AW (1998) Secondary structure and molecular analysis of interstrain BMC Genomics 15:38. variability in the P5 outer-membrane protein of non-typable Haemophilus influenzae 61. Spencer-Smith R, Varkey EM, Fielder MD, Snyder LA (2012) Sequence features con- isolated from diverse anatomical sites. J Med Microbiol 47:1059–1067. tributing to chromosomal rearrangements in Neisseria gonorrhoeae. PLoS One 7: 35. Hayat S, Peters C, Shu N, Tsirigos KD, Elofsson A (2016) Inclusion of dyad-repeat e46023. pattern improves topology prediction of transmembrane β-barrel proteins. 62. LaCross NC, Marrs CF, Gilsdorf JR (2013) Population structure in nontypeable Bioinformatics 32:1571–1573. Haemophilus influenzae. Infect Genet Evol 14:125–136. 36. Yang M, Johnson A, Murphy TF (2011) Characterization and evaluation of the 63. Kresse AU, Dinesh SD, Larbig K, Römling U (2003) Impact of large chromosomal in- Moraxella catarrhalis oligopeptide permease A as a mucosal vaccine antigen. Infect versions on the adaptation and evolution of Pseudomonas aeruginosa chronically Immun 79:846–857. colonizing cystic fibrosis lungs. Mol Microbiol 47:145–158. 37. Murphy TF, Brauer AL (2011) Expression of urease by Haemophilus influenzae during 64. Cui L, Neoh HM, Iwamoto A, Hiramatsu K (2012) Coordinated phenotype switching human respiratory tract infection and role in survival in an acid environment. BMC with large-scale chromosome flip-flop inversion observed in bacteria. Proc Natl Acad Microbiol 11:183. Sci USA 109:E1647–E1656. 38. Moxon R, Bayliss C, Hood D (2006) Bacterial contingency loci: The role of simple se- 65. Poole J, et al. (2013) Analysis of nontypeable Haemophilus influenzae phase-variable quence DNA repeats in bacterial adaptation. Annu Rev Genet 40:307–333. genes during experimental human nasopharyngeal colonization. J Infect Dis 208: 39. Power PM, et al. (2009) Simple sequence repeats in Haemophilus influenzae. Infect 720–727. Genet Evol 9:216–228. 66. Murphy TF, et al. (2017) Immunoglobulin A protease variants facilitate intracellular 40. Atack JM, et al. (2015) A biphasic epigenetic switch controls immunoevasion, viru- survival in epithelial cells by nontypeable Haemophilus influenzae that persist in the lence and niche adaptation in non-typeable Haemophilus influenzae. Nat Commun 6: human respiratory tract in chronic obstructive pulmonary disease. J Infect Dis 216: 7828. 1295–1302. 41. St Geme JW, 3rd, Grass S (1998) Secretion of the Haemophilus influenzae HMW1 and 67. Hiller NL, et al. (2010) Generation of genic diversity among Streptococcus pneumoniae HMW2 adhesins involves a periplasmic intermediate and requires the HMWB and strains via horizontal gene transfer during a chronic polyclonal pediatric infection. HMWC proteins. Mol Microbiol 27:617–630. PLoS Pathog 6:e1001108.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1719654115 Pettigrew et al. Downloaded by guest on October 7, 2021