Strain-resolved Dynamics of the Lung Microbiome in Patients with Cystic Fibrosis

Marija Dmitrijeva Universitat Zurich Christian Kahlert Kantonsspital Sankt Gallen Rounak Feigelman Universitat Zurich Rebekka Kleiner Kantonsspital Sankt Gallen Oliver Nolte Zentrum fur Labormedizin Sankt Gallen Werner Albrich Kantonsspital Sankt Gallen Florent Baty Kantonsspital Sankt Gallen Christian Von Mering (  [email protected] ) Universitat Zurich https://orcid.org/0000-0001-7734-9102

Research

Keywords: cystic fbrosis, longitudinal data, lung sputum, metagenomics, strain-typing

Posted Date: March 12th, 2020

DOI: https://doi.org/10.21203/rs.3.rs-16882/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License

Version of Record: A version of this preprint was published at mBio on April 27th, 2021. See the published version at https://doi.org/10.1128/mBio.02863-20.

Page 1/30 Abstract

Background In cystic fbrosis, dynamic and complex communities of microbial pathogens and commensals colonize the lung. Cultured isolates from lung sputum reveal high inter- and intraindividual variability in pathogen strains, sequence variants, and phenotypes; disease progression likely depends on the precise combination of infecting strains. Routine clinical protocols, however, provide limited overview of the colonizer populations. A more comprehensive and precise identifcation and characterization of infecting strains could therefore assist in making corresponding decisions on treatment. Results Here, we propose a minimally invasive and culture-independent way to monitor microbial lung content. We describe longitudinal tracking of four patients with cystic fbrosis over several months, including repeated sampling and bulk DNA sequencing of sputum. We fnd that the taxonomic identity of individual lineages can be easily established. Crucially, even superfcially clonal pathogens can be subdivided into multiple sub-strains at the sequence level. By tracking individual allelic differences over time, an assembly-free clustering approach allows us to reconstruct multiple strain-specifc genomes with clear structural differences. Conclusions We emphasize the advantages of metagenomics sequencing of longitudinal data for taxonomic identifcation of microbial colonizers in cystic fbrosis lungs. Our study is the frst to track sub-strain pathogen dynamics using a culture-independent approach, with a perspective of high- resolution routine characterization of disease progression that may allow to more accurately monitor treatment outcomes and tailor treatment.

Background

Cystic fbrosis (CF) is a monogenic, autosomal recessive, and life-shortening disease that predominantly affects the Caucasian population [1]. The disease involves multiple organ systems but has its most severe consequences in the airways, where it leads to decreased mucociliary clearance followed by mucus plugging. Subsequently, the mucosal airways of CF patients are chronically infamed and colonized by allochthonous microorganisms. The resulting respiratory symptoms include difculty breathing, persistent cough, expectoration of sputum, and recurrent pulmonary infections. Respiratory failure accounts for more than half of CF patient deaths [2,3]. Nevertheless, improvements in CF management such as antibiotic therapy and administration of mucolytic drugs have increased the median life expectancy for patients, turning CF into a predominantly adult disorder [4]. More recent therapies aim to directly ameliorate residual function of the protein encoded by the CFTR gene locus and have been shown to slow the rate of lung function decline in a subset of CF patients [5]. Chronic respiratory infections, however, seem to persist even though respiratory symptoms improve [6]. Improved characterization of persistent respiratory pathogens is therefore needed to develop tailored therapies that control their composition and abundance.

Several common pathogens colonizing the lungs of CF patients are known. Pseudomonas aeruginosa is predominant in the adult CF population [2,3]. However, aggressive antimicrobial therapies aimed at preventing initial colonization by P. aeruginosa [7,8] have recently led to a decline in its prevalence [2]. Another key pathogen in CF is Staphylococcus aureus, which accounts for the majority of infections in

Page 2/30 young patients and has become increasingly more prevalent amongst all CF patients [2,3]. Other pathogens recognized in CF include members of the Burkholderia cepacia complex, mycobacteria such as Mycobacterium avium and Mycobacterium abscessus, Stenotrophomonas maltophilia, and members of the Achromobacter genus [2,3,9]. Although these pathogens are present in a small fraction of CF patients, they are often multi-drug resistant and thus challenging with regard to the treatment options in the clinic. Finally, anaerobic such as members of the Prevotella genus have also been identifed in CF patient sputum using specialized culture techniques [10], but these are typically not assessed during routine clinical diagnosis and their role as pathogens in CF patients has yet to be defned.

Culture-independent approaches are increasingly complementing and expanding on the fndings of traditional microbiology approaches. For instance, studies using sequencing to characterize the lung microbiome have noted the presence of anaerobic bacteria not recognized as typical CF pathogens, such as Prevotella, Veillonella, and , in a sizable portion of the patients [11–13]. In addition, culture-independent approaches uncovered a high level of variability across the lung microbiomes of CF patients [13–18]. In late-stage patients, however, the microbiome generally tends to be lower in diversity and becomes dominated by one or a few of the commonly recognized CF pathogens [12–14,17]. Several efforts have compared patient-matched samples from different clinical states, but haven’t found signifcant reproducible changes between samples taken at baseline and at exacerbation, which suggests that the CF lung microbiome is resilient over time [11,12,14–16]. Most culture-independent studies of the lung microbiome in CF, however, have been performed using 16S rRNA sequencing, and thus provide only limited insights into the functions or strain identities of lung microbial communities.

Whole-genome sequencing (WGS) and metagenome sequencing improve on the limited taxonomic and functional resolution of 16S rRNA sequencing. Metagenomics allows to survey bacterial, viral, and fungal populations at once, giving a more complete picture of microbial relative abundances in the CF lung microbiome [17,19]. Consequently, a larger portion of microbiome inhabitants can be classifed at the species level [17], and prominent CF pathogens have been classifed at the strain level [17,18,20]. Moreover, multiple subpopulations of specifc pathogens have been detected in CF through metagenome sequencing [17,18]. However, so far only single reference points per patient were typically sequenced, limiting haplotype deconvolution and preventing insights into the temporal dynamics of these strains.

Here, we describe longitudinal sputum sampling in CF patients over the course of one and a half years, conducting metagenomics sequencing of spontaneously expectorated sputum at multiple time points. The aim of the study was to investigate the advantages of collecting longitudinal data of CF patients for monitoring and characterizing strain successions in situ. We successfully classifed most of the lung microbiome members at the species and genus level and confrmed the presence of pathogens identifed during routine clinical diagnosis. Importantly, we show how longitudinal metagenomics data can be used to deconvolute distinct strain variants of the same species within a given patient. We introduce an assembly-free approach that can delineate near-complete, strain-specifc genomes even when their sequence divergence is fairly low. Our study introduces culture-independent methods that can be used in the future for monitoring and management of pathogen strains in CF.

Page 3/30 Results Patient-specifc lung microbiomes

We followed four CF patients over the course of 19 months (Fig. 1, Additional File 1: Figs. S1, S2, S3). A summary of patient information and clinical parameters is available in Additional File 2: Table S1. An overview of prescribed medication is available in Additional File 3: Table S2. During the course of our study, the patients produced sputum spontaneously, either during routine clinical check-ups or during exacerbations. In total, 25 samples were collected. We extracted total DNA from the collected sputum, enriched for non-methylated DNA, and sequenced it using the Illumina HiSeq 4000 platform. Sequencing depth varied between 21 million reads and 179 million reads (Fig. 1f, Additional File 1: Figs. S1f, S2f, S3f). Human DNA constituted between 70% and 93% of all reads. We observed no signifcant associations between the total DNA concentration in the sample, the sequencing depth, or the fraction of non-human reads. Non-human DNA predominantly originated from bacteria; DNA viruses and fungi accounted for at most 0.75% of the profled microbiome (Additional File 5: Table S4). More than 95% percent of the bacteria at each time point could be identifed at least at the genus level using the mOTUs software [21] (Fig. 1d, Additional File 1: Figs. S1d, S2d, S3d , Additional File 4: Table S3), indicating that the lung microbiomes of these patients largely consisted of previously characterized bacterial clades.

Lung microbiome compositions showed marked differences between the four patients. For instance, patient CFR06 showed high abundance of oral anaerobes such as Prevotella, Parvimonas, and Fusobacterium, and was the only patient devoid of any Pseudomonas (Additional File 1: Fig. S1d). This patient was several years younger than the other subjects and displayed a milder form of CF, with better lung function and higher lung microbiome diversity (Additional File 1: Fig. S1a, S1e). Around seven months into the study, the patient had an exacerbation that was treated with a combination of piperacillin/tazobactam and intravenous tobramycin (Additional File 1: Fig. S1b). Following this event, the lung microbiome composition of CFR06 experienced a major shift - Prevotella buccae, S. aureus, and Achromobacter xylosoxidans/insuavis all decreased in relative abundance. Concomitantly, Prevotella oris, Fusobacterium nucleatum, and morbillorum increased in relative abundance (Additional File 1: Fig. S1d).

By contrast, the lungs of patients CFR07 and CFR09 were colonized predominantly by P. aeruginosa (Additional File 1: Figs. S2d, S3d). Patient CFR07 displayed a stable clinical phenotype, with no exacerbations recorded during the course of the study, and retained lung function at 30% (Additional File 1: Fig. S2a, S2b). P. aeruginosa accounted for more than 90% of all bacteria in this patient’s lung microbiome, resulting in the lowest microbiome diversity of all examined patients (Additional File 1: Fig. S2e). The lung microbiome of patient CFR09 contained, in addition to P. aeruginosa, high fractions of the genera Prevotella and Veillonella, and various low abundant genera that comprised up to 35% of the lung microbiome (Additional File 1: Fig. S3d).

Page 4/30 Finally, patient CFR11 presented the most severe course of disease (Fig. 1) and died shortly after study completion. The patient experienced multiple exacerbations, accompanied by high levels of infammation, with lung function gradually declining from 36% to 28% (Fig. 1a, 1b). The lung microbiome of CFR11 was dominated by P. aeruginosa and the oral anaerobes P. oris and F. nucleatum (Fig. 1d). In addition, Parvimonas micra, Streptococcus intermedius and A. xylosoxidans/insuavis were present in lower relative abundances, with the fraction of Achromobacter increasing at later time points (Fig. 1d). Thus, we observed a unique lung microbiome composition for each of our four patients. In three of the patients, the most abundant bacteria remained the same throughout the course of the study, and only CFR06 displayed a major shift in lung microbiome composition.

Differences between clinical microbiology and sequence- based identifcation

Next, we validated the lung microbiome compositions obtained via sequencing with pathogen identifcation performed by the external clinical microbiology laboratory (see “Methods”, Fig. 1c, 1d, Additional fle 1: Figs. S1-S4). Pathogen identifcation was restricted to aerobic and facultative anaerobic bacteria considered to be relevant pathogens in CF. Bacteria not considered clinically relevant were summarized as respiratory fora, and anaerobic bacteria were not cultured. Culture assessment confrmed the presence of P. aeruginosa at least once in patients CFR07, CFR09, and CFR11, in whom it was identifed as the main pathogen in accordance with the sequencing data. A. xylosoxidans was confrmed for CFR06 and CFR11, although with some temporal discrepancies between the two methods. Finally, we confrmed the presence of S. aureus in CFR06 throughout the entire course of study and the presence of Haemophilus infuenzae in CFR06 towards the end of the study. Overall, we report good agreement between the sequencing data and the microbiological culture assessment results; discrepancies were observed mostly for transient community members and/or those of low abundance.

Sequence-based taxonomic profling identifed more bacteria that were either not grown or not identifed in the clinical microbiology laboratory, providing a more complete picture of the lung microbiome in our CF patients. Notably, anaerobic bacteria that were not identifed clinically accounted for the majority of identifcations in patients CFR06 (Additional fle 1: Fig. S5a) and CFR11 (Additional fle 1: Fig. S5d), and up to 47% of the lung microbiome composition in patient CFR09 (Additional fle 1: Fig. S5c). From all identifed bacteria, the most relevant from a clinical perspective were A. xylosoxidans (identifed in two patients: CFR06 and CFR11) and P. aeruginosa (identifed in three patients: CFR07, CFR09, and CFR11). We therefore set out to assess to what extent cultivation-independent sequencing would allow to classify these pathogens in more detail.

Strain-level classifcation of Achromobacter

Page 5/30 Seven of the sputum samples exhibited a notable fraction of sequence reads that could be mapped to A. xylosoxidans (one sample from CFR06 and six samples from CFR11), and the assembled Achromobacter contigs showed more than 80% expected genome completeness according to CheckM [22] (Additional fle 6: Table S5). From a selection of 22 fully-sequenced A. xylosoxidans reference strains, A. xylosoxidans FDAARGOS_147, a strain isolated from a patient at Children’s National Hospital in Washington DC, was the only strain whose gene content overlapped with that of the patients’ strains by more than 50% (Additional fle 1: Fig. S6). We therefore decided to use a phylogenetic tree-based marker gene sequence identity approach to place our samples within the Achromobacter genus.

Of a total of 144 Achromobacter genomes from the NCBI genome database (November 2018) [23], Achromobacter genus trees were constructed using either the sequences from the ten single-copy genes used by mOTUs, standard Achromobacter MLST sequences [24], or pairwise genome average nucleotide identities (see “Methods”). All three genus trees were largely consistent with each other, but revealed some discrepancies with the NCBI consensus (Fig. 2, NCBI columns). As has been noted previously, taxonomic classifcation within the Achromobacter genus has inconsistencies [25]. We noted that strains assigned to various Achromobacter species were often not monophyletic; instead, our analyses confdently clustered together strains assigned to different species by the NCBI taxonomy. For example, one well-supported cluster contained genomes from A. ruhlandii, denitrifcans, and xylosoxidans (Fig. 2a). The Genome Taxonomy Database (GTDB) is a recent effort to redefne prokaryotic taxonomy and improve on such inconsistencies in species assignment [26]. To determine whether this approach could be of help here, we downloaded the species assignments from GTDB (as of April 2019). Indeed, the GTDB assignments of the Achromobacter species formed more consistent clusters within our tree (Fig. 2, GTDB columns).

On all three Achromobacter genus trees, we observed that our patient-derived Achromobacter genomes and A. xylosoxidans FDAARGOS_147 clustered together with two strains of A. insuavis. The strain in patient CFR11 seemed nearly identical to A. insuavis AXX-A, a strain observed at the Laboratory of Bacteriology at the Faculty of Medicine in Dijon, France (Fig. 2a, zoom-in). The strain in patient CFR06 clustered together with A. xylosoxidans FDAARGOS_147, but was not identical to it (Fig. 2a, zoom-in). Taken together, these results indicated that the clinical laboratory misidentifed this pathogen, stating A. xylosoxidans instead of A. insuavis.

Strain-level classifcation of P. aeruginosa

Next, we asked how well strain identifcation performs in the case of P. aeruginosa, a species for which much more comprehensive reference information is available. Patients CFR07, CFR09, and CFR11 all carried Pseudomonas in sufcient amounts to allow 99% estimated genome coverage (Additional fle 6: Table S5). To identify the strains, we compared our samples to a representative set of 359 P. aeruginosa strains using the PanPhlAn software [27] and the mOTUs single-copy gene sequences (Additional fle 1: Fig. S7). Both revealed that patients CFR09 and CFR11 harbour strains very similar to P. aeruginosa

Page 6/30 PAER4_119 (frst sampled in Poland), but the samples from CFR07 were distinct and clustered with other reference strains. Indeed, the tree revealed that CFR07 is infected by a new strain not present in the representative set (Additional fle 1: Fig. S7b).

Our work with Pseudomonas indicated that more than one strain variant was likely present in some of the patient samples, most notably in CFR11. Both PanPhlAn and CheckM presented corresponding warnings, but could not delineate these variants.

Delineation of P. aeruginosa strain variants without cultivation or genome assembly

To distinguish and track P. aeruginosa strain variants in patient CFR11, we took advantage of the repeated samplings in our time-series data. Under the assumption that the relative abundances of competing P. aeruginosa populations in a patient would vary over time, any population-specifc sequence variants should similarly vary over time. This would allow reconstructing constituent genomes through clustering sequence variants by their shared temporal behaviour. To test this approach, we aligned all apparent Pseudomonas reads from patient CFR11 to the closest reference genome (P. aeruginosa PAER4_119), which served as a scaffold (Fig. 3.1, 3.2). We then called single nucleotide variants (SNVs) using metaSNV [28] and determined their allele frequency at each time point (Fig. 3.2, 3.3). Finally, we determined clusters of SNVs displaying similar changes in allele frequencies over time (Fig. 3.4).

A total of 3451 SNVs were called, of which 3079 had appreciable allele frequencies at every time point in CFR11 (see “Methods”). Repeated t-SNE clustering runs at slightly varying settings consistently yielded seven distinct clusters of SNVs, in addition to a pool of lower-frequency SNVs that could not be reliably clustered (Fig. 4a). Of the seven clusters, each showed a clearly distinctive pattern of allele frequencies over time (Fig. 4b). Cluster 3 appeared to be a linear sum of clusters 2 and 6, and seemed inversely correlated with cluster 1. Clusters 6 and 7 followed a somewhat shared pattern over time, with the exception of the frst time point, where they differed markedly. Unexpectedly, clusters 4 and 5 exhibited very little change over time. To investigate the latter more carefully, we plotted the spatial position of the cluster-specifc SNVs in the reference genome (Additional fle 1: Figure S8). Whereas SNVs belonging to clusters 1, 2, 3, 6, and 7 were homogeneously positioned across the genome as expected, SNVs from clusters 4 and 5 exhibited a strong focus on specifc regions of the genome. Together with the fact that these clusters do not vary over time, we interpret such SNVs to refect intragenomic polymorphisms in relation to the reference genome (e.g. at tandem-repeat regions) rather than diagnostic variants in the patient. Clusters 4 and 5 were hence discarded. Considering the remaining clusters, the most parsimonious interpretation of the data appears to be the presence of three distinct P. aeruginosa strain variants in the patient (refected in clusters 1, 2 and 6, respectively). In this scenario, cluster 3 would consist of SNVs that are ancestrally shared between two of the variants, and whose frequencies hence refects the sum of their relative abundances.

Page 7/30 Because t-SNE is a non-deterministic algorithm, we sought to validate our observations with another method and thus performed principal component analysis (PCA) on the same selection of SNVs. The frst three components explained 97% of the total variation, with each subsequent component contributing less than 1%. We then performed hierarchical clustering with eight clusters on the frst three components (Additional fle 1: Fig. 9a, only frst and second component shown). When projecting these clusters onto our t-SNE plot (Additional fle 1: Fig. 9b), the results were largely consistent (the only deviation being a merging of clusters 6 and 7).

We also validated our results using DESMAN, a tool developed for grouping SNVs into haplotypes [29]. DESMAN indeed confrmed our grouping, yielding three haplotypes that strongly coincided with clusters 1, 2, and 6 (Fig. 4b, Additional fle 1: Fig. 9c, 9d). We note that it would have been impossible to distinguish these strain variants using the 16S rRNA gene alone (Additional fle 1: Fig. 10). Likewise, at the observed pairwise divergence of less than 0.1% between the variant genomes, traditional genome assembly approaches would also likely not be able to distinguish these [30].

Table 1. Number of SNVs consistently clustered by three different approaches (t-SNE, PCA, and DESMAN). # SNVs Strain variant #1 specific 213 Strain variant #2 specific 105 Strain variant #3 specific 186 Shared between #2 and #3 67

Subsequently, we focused on SNVs that were assigned to the same cluster or haplotype by all three methods. At least 100 SNVs separated each strain variant from the other (Table 1). The rate of mutations in P. aeruginosa has been estimated to be at most 5.5 SNVs per year [31–33], unless a hypermutator phenotype develops [32,34]. To test for potential hypermutator mutations, we recruited reads to the six genes DNA repair genes known to be affected in hypermutator strains: mutS, mutL, uvrD, mutM, mutY, and mutT [33–36]. We detected only one mutation that might potentially disrupt gene function – a frameshift deletion in the mutS gene. However, this mutation was observed only in reads corresponding to strain variant #1 (not variants #2 or #3), and the mutation was positioned towards the 3’-end of the gene, close to the stop-codon of the predicted open reading frame.

Detection of P. aeruginosa variant-specifc structural genome differences

Page 8/30 To determine whether longitudinal metagenomics sequencing provides sufcient evidence to detect large- scale genomic variation between strain variants, we recruited reads against the reference P. aeruginosa PAER4_119 genome and calculated the average coverage for windows of 1000 base pairs along the genome. After suitable normalization (see “Methods”), any large-scale genomic differences between strain variants would become evident as time-point dependent deviations in read coverage.

Indeed, seven regions of the genome displayed obvious deviations in read coverage over multiple time points (Fig. 5, circos diagram). Of these, the regions located around 4.4 Mbp, 4.9 Mbp, and 5.5 Mbp coincided with predicted phage sequences; structural variations at phage insertions are to be expected. A fourth region around 2.7 Mbp showed inconsistent and overall low coverage in our sample data and was not considered further. This left three regions of interest. Manual inspection allowed to pinpoint region borders more precisely at 1,639,504 bp – 1,649,747 bp and 3,024,303 bp – 3,088,958 bp for two of the regions. The third region was revealed to be composite, its borders at 3,228,483 bp – 3,235,055 bp and 3,241,118 bp - 3,243,808 bp. We further refer to these selected regions as region #1, #2, and #3.

Each of the three regions showed signifcant correlations to the relative abundance profles of one of our inferred strain variants (Fig. 5, line-plots). Region #1 highly correlated with the relative abundance profle of variant #1 (95% CI: 0.9731 < r < 0.9986, p-value = 6.349e-09), indicating that it was present there, but absent in variants #2 and #3. The genes contained in region #1 encoded multiple metabolism-related genes, but its overall functional signifcance was difcult to assess. Region #2 correlated with the relative abundance profle of variant #3 (95% CI: 0.8447 < r < 0.9913, p-value = 8.316e-06). This region contained pyoverdine synthetases and genes from the type VI secretion system. We observed that two SNVs in the region mapped to variant #1, suggesting that it is present in all three strain variants, but amplifed at least two-fold in variant #3. Several dozen paired-end reads spanned from the end of the region back to its start, suggesting the additional copies in variant #3 are either arranged as tandem duplications or form an excised plasmid. Finally, the average coverage of region #3 correlated very well with the combined relative abundance profle of variants #2 and #3 (95% CI: 0.9713 < r < 0.9985, p-value = 8.240e-09), indicating that it was absent in variant #1. This region contained multiple genes of the psl operon, which plays a role in bioflm generation in P. aeruginosa. Taken together, it becomes clear that the information contained in the relative read coverage over time, when correlated against the relative proportions of SNVs, allows precise and confdent mapping of large-scale genomic structure variants to their respective strain variants.

Discussion

In this study, we sought insights into the temporal dynamics of the lung microbiome in CF by using non- invasive DNA sequencing of lung sputum. By repeated sampling of the lung microbiome over several months, we were able to distinguish persistent pathogens from other, more transient community members. The taxonomic identifcation of pathogens from sequencing data was generally in line with the clinical microbiology laboratory reports, albeit in the case of Achromobacter and Pseudomonas, the sequence-based identifcations proved to be more precise. Moreover, with P. aeruginosa in patient CFR11,

Page 9/30 at least three distinct strain variants were observed. Importantly, distinguishing these variants would not have been possible without the longitudinal repeated samplings, showing that tracking patients over time provides valuable added information. To our knowledge, only two other shotgun metagenomics studies with multiple reference points per CF patient have been published so far [20,37]. Neither, however, explores longitudinal genomic variation on a sub-strain level within a patient.

The youngest of our patients (CFR06) showed the highest diversity in the lung, with designated pathogens accounting for no more than 21% of the microbiome. A recent study performed in younger CF patients argues that an abundance of non-typical microbes in the lung microbiome may be contamination from the oral and upper-airway microbiome and is not observed in patients with increased bacterial burden [38]. However, we observed a similarly high content of non-typical microbes in CFR11, a patient who was at a much later stage in the disease.

Our study detected one apparent case of clinical species mis-identifcation: in both patients CFR06 and CFR11, the observed pathogen strain likely belongs to A. insuavis, not A. xylosoxidans. This refects general issues in the taxonomy of this genus. The Achromobacter genus was expanded with new species over the past decade [39–41], thus misidentifcation of Achromobacter species by conventional clinical methods is not uncommon [25,42–44]. Certain species of Achromobacter cannot be distinguished based on 16S rRNA sequence alone [25,39,40] and lack of representative spectra in MALDI-TOF databases commonly used by clinical microbiology laboratories [43,45]. To counter this issue, genotyping methods such as multi-locus strain-typing [25] and nrdA sequencing [42] have been developed. Genotyping of several CF patient cohorts has revealed A. insuavis was the second most prevalent species after A. xylosoxidans [43,44,46–48] or at least accounted for a considerable fraction of Achromobacter infections [42]. A. insuavis is also one of the few Achromobacter species capable of chronic infection [44,47,48], and our observations in patient CFR11 are in line with previous fndings. Our results show that clear and reliable pathogen identifcation is possible without the need for cultivation, based on community-wide sequencing data alone.

For one of the best-covered pathogens, P. aeruginosa, the longitudinal data allowed us to delineate at least three distinct strain variants in patient CFR11. These displayed distinct temporal dynamics and were genotypically more similar to each other than to the closest reference genome. Multiple studies have described genotypically and/or phenotypically distinct P. aeruginosa subpopulations in the lung by culturing isolates from CF patients [36,49–55]. Deep sequencing approaches have indicated that P. aeruginosa is polymorphic in some patients [17,18], as are some other CF pathogens [17,18,56]. Extensive longitudinal sampling, however, was performed in few prior studies [20,51]. Ours is thus the frst study to use longitudinal metagenomics data for describing the dynamics of multiple strain variants of a species in the same CF patient.

Emergence of phenotypically and genotypically distinct subpopulations of P. aeruginosa in CF has previously been shown to be driven by spatial heterogeneity [52,57]. Lung regions differ in oxygen and carbon dioxide concentrations [58], patterns of ventilation and deposition [59], and disease burden [60].

Page 10/30 General microbial community composition differs depending on lung region as well [61,62]. Nevertheless, other studies have found no clustering of P. aeruginosa isolates based on region of isolation [63] and showed identical phenotypes and genotypes in upper and lower airways [64,65]. Although sputum sequencing does not provide us with information on the spatial distribution of our strains within the lung, the fact that we observed relatively strong temporal changes in the relative strain abundances would argue against a purely location-driven strain differentiation.

Longitudinal microbiome tracking in CF also still has a number of limitations. First, both taxonomic identifcations as well as sub-strain detection requires sufcient coverage in terms of number of timepoints and in terms of sequence read coverage depth. For example, the smaller number of time points from patients CFR07 and CFR09 prevented us from performing strain deconvolution on P. aeruginosa in a manner similar to CFR11. Likewise, Achromobacter in patient CFR06 was not sufciently covered in most time points to allow precise strain identifcation.

The limited availability of genetic data for characterization was partly due to an excess of human DNA; up to 93% of generated reads mapped to the human genome, which is not unexpected in studies of lung sputum [17,18]. In order to enrich for non-human material, we performed depletion of methylated DNA. The depletion worked to a certain extent based on data from paired samples, and we obtained more than 25% of non-human reads in select samples, which is more than a two-fold improvement to the numbers from previous studies [17,18]. However, it did not work equally well for each sample. A recent assessment of human DNA depletion methods in human saliva samples showed the limited effectiveness of currently available kits and introduced a new depletion method that decreased the fraction of human reads to 8.53% [66]. This method has yet to be applied to sputum. Another recent study proposed a microfuidics- based method to enrich microbial DNA in samples from human airways [67]. The implementation of methods enriching for non-host material in oral and sputum samples looks promising, as this would lead to a decrease in sequencing costs and provide more sequencing material to study the less abundant bacteria.

We were unable to fully exclude the possibility that the P. aeruginosa in CFR11 displays a hypermutator phenotype [68], although this seems unlikely given that the DNA repair genes seemed unaltered. Still, we cannot infer whether the presence of multiple strain variants indicate multiple infection events or whether they might track back to a single, fast mutating ancestral strain. In general, due to lack of absolute abundance data, we also cannot be certain whether the observed change in the relative abundance of a specifc bacterium could be in response to other bacteria growing and dying. Absolute quantifcation has already provided novel insights into the gut microbiome [69], and, more recently, the application of qPCR for the absolute quantifcation of bacteria CF lung microbiome has challenged the existence of a CF lung microbiome in early childhood [38]. Combined with WGS sequencing, these quantitative approaches therefore present a promising venue to increase interpretability in future studies.

Conclusions

Page 11/30 We have demonstrated that metagenomic sequencing of time-series data in CF patients can complement routine clinical diagnostics. Non-invasive, whole-genome sequencing of sputum can provide better taxonomic resolution for pathogens than the current methods routinely used in the clinic. Unlike 16S rRNA sequencing, classifcation can be made on a sub-strain level. In addition, by using data from multiple time points, multiple strain variants of the same species can be tracked within a given patient, including the assignment of variant-specifc SNVs and variant-specifc large-scale genomic changes.

Given recent advances in targeted depletion of human material in samples [66,67], we believe that sequencing costs might soon sink enough to allow routine use of workfows like ours in the clinic; a recent case report estimated a similar procedure would take less than 48 hours [37]. When coupled to a growing database of previously observed strains (ideally including the results of past antibiotics resistance testing), this should enable continuous improvements in monitoring and managing pulmonary infections in CF.

Methods Sputum sample collection

All study participants provided informed consent. The study was approved by the Cantonal Ethics Committee, St. Gallen (EKSG 13/112). For the study, patients with CF spontaneously produced sputum. The samples were collected at the Cantonal Hospital St. Gallen. Sputum samples were weighed and aliquoted into sterile tubes.

Clinical microbiology pathogen identifcation

All samples were subjected to standard clinical microbiology procedures used for CF sputum in a ISO 15089 certifed laboratory. Sputum samples were pre-processed with a liquefying agent (Copan SL- solution, RUWAG, Bettlach, Switzerland) before streaking on agar plates. Columbia agar, chocolate, MacConkey- and CNA-agar (Becton Dickinson, Allschwil, Switzerland) were streaked to support growth of the bacterial spectrum present in the upper airways. For the specifc detection of CF-associated pathogens, selective chromogenic plates (bioMérieux, Genève, Switzerland) were incubated: PAID agar for P. aeruginosa, SAID agar for S. aureus, and BCSA for Burkholderia species (Achromobacter species usually grow well on this agar as well). All plates were visually inspected after 16-24 hrs of incubation at 36°C with or without 5% CO2 as per standard protocol, followed by a second inspection after another day of incubation. Colonies suggestive of CF-associated pathogens or showing indicative growth on selective media were subjected to MALDI-TOF analysis on a Bruker MALDI Biotyper (Bruker Daltonics, Bremen, Germany) using the standard direct smear protocol. As per manufacturer recommendations, species identifcation was considered reliable at a score above 2.000. In case no CF-associated pathogen was seen after both inspections, the culture was usually reported as respiratory tract fora.

Page 12/30 DNA extraction, treatment, and sequencing

After dilution in Sputolysin (Calbiochem Corp., San Diego, CA, USA), total DNA was extracted using the High Pure PCR Template Preparation Kit (Roche, Basel, Switzerland) as per the manufacturer’s instructions. DNA concentration was measured using the ACTgene UV99 spectro-photometer at a wavelength of 260 nm, and samples were stored at −20°C. As the starting material was not limiting, and sufcient amounts of DNA were available, no extra amplifcation step was deemed necessary and no extraction blanks for PCR/sequencing contamination control were processed.

After DNA isolation, samples were subjected to methylated DNA depletion using the NEBNext Microbiome Enrichment kit (New England Biolabs Inc., Ipswich, MA. USA) to enrich for microbial DNA. As a control, we included day 0 samples from all patients without performing depletion. Depletion of methylated DNA did not have a consistent effect on the total number of reads obtained (Additional File 1: Fig. S11a). Relative microbial DNA content increased in three out of four patients by up to 2.3-fold, but did not exceed 27% (Additional File 1: Fig. S11b).

Next-generation sequencing libraries were prepared using the TruSeq DNA Nano Library Prep kit (Illumina Inc., California, USA) as per the manufacturer’s instructions. The libraries were sequenced using the Illumina HiSeq 4000 platform (Illumina Inc., California, USA) in paired-end mode (2 X 125 bp). Reads were quality-checked with FastQC [70].

Removal of the host genome reads, contig assembly and annotation

Reads were aligned to human genome build 38 [71] using BowTie2 (version 2.3.1) [72]. Reads that did not align concordantly to the human genome were used for downstream analysis and assembly. We assembled reads into contigs using metaSPAdes (version 3.10.1) [73]. The contigs were then searched against the NCBI nucleotide database (as of June 24th 2017) using blastn (version 2.6.0) [74]. During the search, an e-value cut-off of e-15 was used and the fve closest matching sequences were retained. For taxonomic annotation, we only considered matching sequences that had a bit score within 10% range of the maximum scoring match. Contigs were assigned to the most recent common ancestor of the considered matches. Assembly completeness and contamination was assessed using CheckM [22]. Phages and viruses were largely excluded from this analysis due to their poor representation in databases and lack of a standardized taxonomy.

Taxonomic profling and diversity estimation

Raw reads were trimmed and fltered based on quality using sickle (version 1.33) [75]. Trimmed and fltered reads were profled using mOTUs (version 2.0.1) [21] and MetaPhlAn (version 2.7.1) [76]. For MetaPhlAn input, all trimmed and fltered reads were pooled in the same fle. The two methods had

Page 13/30 several disagreements in species delineation, but the generated taxonomic profles (compared on a sample by sample basis) highly correlated on genus level (Additional fle 7: Table S6).

To determine the aerobe and anaerobe content, detected species were mapped to oxygen tolerance data from BacDive (as of August 2019) [77]. Unclassifed species from a known genus were labelled as aerobe or anaerobe only when all species of this genus were labelled as aerobes or anaerobes. Otherwise, the label “unknown” was assigned.

Diversity was calculated based on relative abundances obtained from mOTUs, using Shannon’s diversity index.

Strain identifcation with PanPhlAn

For A. xylosoxidans, a total of 22 genomes and their annotations were downloaded from the Integrated Microbial Genomes & Microbiomes Database (as of May 2018) [78]. These genomes were used to create a pangenome using PanPhlAn (version 1.2.3.6) [27]. We used the pooled trimmed and fltered reads as input to the PanPhlAn software to generate gene presence and absence profles.

For P. aeruginosa, a total of 2226 genomes were downloaded from the Pseudomonas Genome Database (as of July 2018) [79]. Because of the high number of genomes, we could not use the complete set of genomes for PanPhlAn and had to generate a set of representative genomes. Pairwise genomic distances were calculated using Mash (version 2.0) [80]. We discarded outlier genomes with an average distance of more than 0.1 and with less than 90% estimated completeness according to BUSCO (version 3.0.2) and their Gammaproteobacteria database [81,82]. Remaining genomes were clustered at a distance threshold of 0.005. From each cluster, we selected the genome with the smallest average distance to all other cluster members, yielding 359 representative genomes. The representative genomes were annotated using Prokka (version 1.12) [83] and used to generate the pangenome using PanPhlAn (version 1.2.3.6) [27]. We used the pooled trimmed and fltered reads as input to the PanPhlAn software to generate gene presence and absence profles.

Phylogenetic tree generation

A total of 145 genomes from the Achromobacter genus were downloaded from the NCBI Genome database (as of November 2018) [23]. We searched these genomes using blastn (version 2.6.0) [74] against the mOTUs database (version 2.0.1) and against the PubMLST database of the Achromobacter genus (as of July 2017). For our samples, we searched the contigs assigned to the Achromobacter genus against the same databases. We used the coordinates output by the search to extract the corresponding gene sequences from the genomes. If the extracted gene sequence was shorter than the sequences of this gene in the databases, we padded the gene sequence with “X”. If a gene was absent from the genome, we introduced a string of “X” that was the length of this gene.

Page 14/30 We then produced two types of composite sequences. Based on the mOTUs database search, sequences were created by concatenating the single-copy genes COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0541, COG0533, and COG0552. Based on the PubMLST database search, sequences were created by concatenating the housekeeping genes eno, gltB, lepA, nrdA, nuoL, nusA, and rpoB. Composite sequences that contained more than half of “X” were omitted from further analysis. The remaining sequences were aligned using MUSCLE (version 3.8.1551) [84]. Two maximum likelihood trees were constructed using RAxML (version 8.2.10) [85] under the GTRCAT model. Bordetella pertussis (NC_002929.2) was used as an outgroup to root the trees. One hundred bootstraps were performed on the trees to estimate branch confdence.

For the third tree, the downloaded Achromobacter genomes and/or sample contigs were annotated using Prokka (version 1.12) [83]. The obtained gene sequences were used as input for the ANIcalculator (version 1.0) [86]. Genes annotated as rRNA, tRNA, or tmRNA were ignored during the calculation. Based on the calculated pairwise average nucleotide identities, a Euclidean distance matrix was created and genomes were clustered using UPGMA.

The P. aeruginosa tree in (Additional fle 1: Fig. S7b) was generated using the same procedure as for the Achromobacter mOTUs tree, albeit the tree was not rooted. All trees were visualized using iTOL [87].

Identifcation of strain variants

Filtered and trimmed sample reads were recruited to the P. aeruginosa PAER4_119 (CP013113.1) and A. xylosoxidans FDAARGOS_147 (CP014060.1, data not used further) genomes using the ngless framework provided by the developers of metaSNV (version 0.8.1) [88–91]. The framework fltered out reads that did not map uniquely, mapped at an identity less than 97%, or had less than a 45 base pair match with the reference. SNVs were called using metaSNV (version 1.0.3) [28].

Prior to running DESMAN (version 2.1.1) [29], we fltered out SNVs that clustered together (more than 8 per 1000 kb) because these could have biased the relative abundance calculations. DESMAN was run on a selection of 1287 SNVs to determine the relative abundance for three haplotypes. Ten runs using different random seeds were performed.

Analysis of P. aeruginosa genome coverage

Read recruitment was performed as described in “Identifcation of strain variants”. Average read coverage was calculated for windows of 1000 base pairs. To account for higher coverage near origins of replication [92], we ft a quadratic polynomial function to every sample’s coverage profle. Each window’s coverage was then normalized to the value output by the function at the chromosome position in the middle of the window.

Page 15/30 Locations of phage sequences on the reference chromosome were determined using the online tool PHASTER (as of March 2019) [93].

Abbreviations

CF: cystic fbrosis; CFTR: cystic fbrosis transmembrane conductance regulator; CRP: C-reactive protein; DNA: deoxyribonucleic acid; FEV: forced expiratory volume; MALDI: matrix-assisted laser desorption/ionisation; MLST: multi-locus sequence typing; PCA: principal component analysis; PCR: polymerase chain reaction; RNA: ribonucleic acid; SNV: single-nucleotide variation; TOF: time of fight; t- SNE: t-distributed stochastic neighbour embedding; UPGMA: unweighted pair group method with arithmetic mean; WGS: whole-genome sequencing.

Declarations

Ethics approval and consent to participate

The study was approved by the Cantonal Ethics Committee, St. Gallen (EKSG 13/112). All study participants provided informed consent.

Consent for publication

Written informed consent was obtained from all study participants.

Availability of data and material

All sample reads post-fltering of human material have been deposited on the European Nucleotide Archive under the project identifer PRJEB32062.

Competing interests

The authors declare that they have no competing interests.

Funding

MD, RF, and CVM were supported by the Swiss National Science Foundation (grant #31003A_160095). This work was partly funded by an internal grant from the Cantonal Hospital St. Gallen.

Authors’ contributions

RF, CRK, and CVM conceived and planned the study. CRK, RLK, WCA, and FB selected suitable patients, obtained informed consent, and processed the samples. ON performed the clinical microbiology analysis. MD and RF performed the bioinformatics analysis. MD and CVM wrote the manuscript, with the input from all co-authors. All authors read and approved the fnal manuscript.

Page 16/30 Acknowledgements

We thank João F. Matias Rodriguez and Janko Tackmann for helpful discussions on procedures and the manuscript. The Functional Genomics Centre Zurich is gratefully acknowledged for DNA library preparation, sequencing, and initial data processing.

References

1. Elborn JS. Cystic fbrosis. The Lancet. 2016;388:2519–31. 2. Cystic Fibrosis Foundation. Cystic Fibrosis Foundation Patient Registry 2017 Annual Data Report. 2018 p. 96. https://www.cff.org/Research/Researcher-Resources/Patient-Registry/2017-Patient- Registry-Annual-Data-Report.pdf. Accessed 23 January 2020. 3. Zolin A, Orenti A, Naehrlich L, van Rens J, Fox A, Krasnyk M, et al. ECFSPR Annual Report 2017. 2019 p. 149. https://www.ecfs.eu/sites/default/fles/general-content-images/working-groups/ecfs-patient- registry/ECFSPR_Report2017_v1.3.pdf. Accessed 23 January 2020. 4. Burgel P-R, Bellis G, Olesen HV, Viviani L, Zolin A, Blasi F, et al. Future trends in cystic fbrosis demography in 34 European countries. European Respiratory Journal. 2015;46:133–41. 5. Burgener EB, Moss RB. Cystic fbrosis transmembrane conductance regulator modulators: precision medicine in cystic fbrosis. Current Opinion in Pediatrics. 2018;30:372. 6. Hisert KB, Heltshe SL, Pope C, Jorth P, Wu X, Edwards RM, et al. Restoring Cystic Fibrosis Transmembrane Conductance Regulator Function Reduces Airway Bacteria and Infammation in People with Cystic Fibrosis and Chronic Lung Infections. Am J Respir Crit Care Med. 2017;195:1617– 28. 7. Hansen CR, Pressler T, Høiby N. Early aggressive eradication therapy for intermittent Pseudomonas aeruginosa airway colonization in cystic fbrosis patients: 15 years experience. Journal of Cystic Fibrosis. 2008;7:523–30. 8. Mogayzel PJ, Naureckas ET, Robinson KA, Brady C, Guill M, Lahiri T, et al. Cystic Fibrosis Foundation Pulmonary Guideline. Pharmacologic Approaches to Prevention and Eradication of Initial Pseudomonas aeruginosa Infection. Annals ATS. 2014;11:1640–50. 9. Gilligan PH. Infections in Patients with Cystic Fibrosis: Diagnostic Microbiology Update. Clinics in Laboratory Medicine. 2014;34:197–217. 10. Tunney MM, Field TR, Moriarty TF, Patrick S, Doering G, Muhlebach MS, et al. Detection of Anaerobic Bacteria in High Numbers in Sputum from Patients with Cystic Fibrosis. Am J Respir Crit Care Med. 2008;177:995–1001. 11. Fodor AA, Klem ER, Gilpin DF, Elborn JS, Boucher RC, Tunney MM, et al. The Adult Cystic Fibrosis Airway Microbiota Is Stable over Time and Infection Type, and Highly Resilient to Antibiotic Treatment of Exacerbations. PLOS ONE. 2012;7:e45001. 12. Carmody LA, Zhao J, Schloss PD, Petrosino JF, Murray S, Young VB, et al. Changes in Cystic Fibrosis Airway Microbiota at Pulmonary Exacerbation. Annals ATS. 2013;10:179–87.

Page 17/30 13. Coburn B, Wang PW, Diaz Caballero J, Clark ST, Brahma V, Donaldson S, et al. Lung microbiota across age and disease stage in cystic fbrosis. Scientifc Reports. 2015;5:10241. 14. Zhao J, Schloss PD, Kalikin LM, Carmody LA, Foster BK, Petrosino JF, et al. Decade-long bacterial community dynamics in cystic fbrosis airways. PNAS. 2012;109:5809–14. 15. Price KE, Hampton TH, Gifford AH, Dolben EL, Hogan DA, Morrison HG, et al. Unique microbial communities persist in individual cystic fbrosis patients throughout a clinical exacerbation. Microbiome. 2013;1:27. 16. Carmody LA, Zhao J, Kalikin LM, LeBar W, Simon RH, Venkataraman A, et al. The daily dynamics of cystic fbrosis airway microbiota during clinical stability and at exacerbation. Microbiome. 2015;3:12. 17. Losada PM, Chouvarine P, Dorda M, Hedtfeld S, Mielke S, Schulz A, et al. The cystic fbrosis lower airways microbial metagenome. ERJ Open Research. 2016;2:00096–2015. 18. Feigelman R, Kahlert CR, Baty F, Rassouli F, Kleiner RL, Kohler P, et al. Sputum DNA sequencing in cystic fbrosis: non-invasive access to the lung microbiome and to pathogen details. Microbiome. 2017;5:20. 19. Hauser PM, Bernard T, Greub G, Jaton K, Pagni M, Hafen GM. Microbiota Present in Cystic Fibrosis Lungs as Revealed by Whole Genome Sequencing. PLOS ONE. 2014;9:e90934. 20. Bacci G, Taccetti G, Dolce D, Armanini F, Segata N, Cesare FD, et al. Taxonomic and functional dynamics of lung microbiome in cystic fbrosis patients chronically infected with Pseudomonas aeruginosa. bioRxiv. 2019;609057. 21. Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh H-J, Cuenca M, et al. Microbial abundance, activity and population genomic profling with mOTUs2. Nature Communications. 2019;10:1014. 22. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–55. 23. NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018;46:D8–13. 24. Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. 25. Spilker T, Vandamme P, LiPuma JJ. A Multilocus Sequence Typing Scheme Implies Population Structure and Reveals Several Putative Novel Achromobacter Species. Journal of Clinical Microbiology. 2012;50:3010–5. 26. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996–1004. 27. Scholz M, Ward DV, Pasolli E, Tolio T, Zolfo M, Asnicar F, et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nature Methods. 2016;13:435–8.

Page 18/30 28. Costea PI, Munch R, Coelho LP, Paoli L, Sunagawa S, Bork P. metaSNV: A tool for metagenomic strain level analysis. PLOS ONE. 2017;12:e0182392. 29. Quince C, Delmont TO, Raguideau S, Alneberg J, Darling AE, Collins G, et al. DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biology. 2017;18:181. 30. Ayling M, Clark MD, Leggett RM. New approaches for metagenome assembly with short reads. Brief Bioinform. 2019. https://doi.org/10.1093/bib/bbz020. 31. Smith EE, Buckley DG, Wu Z, Saenphimmachak C, Hoffman LR, D’Argenio DA, et al. Genetic adaptation by Pseudomonas aeruginosa to the airways of cystic fbrosis patients. Proceedings of the National Academy of Sciences. 2006;103:8487–92. 32. Cramer N, Klockgether J, Wrasman K, Schmidt M, Davenport CF, Tümmler B. Microevolution of the major common Pseudomonas aeruginosa clones C and PA14 in cystic fbrosis lungs. Environmental Microbiology. 2011;13:1690–704. 33. Marvig RL, Johansen HK, Molin S, Jelsbak L. Genome Analysis of a Transmissible Lineage of Pseudomonas aeruginosa Reveals Pathoadaptive Mutations and Distinct Evolutionary Paths of Hypermutators. PLOS Genetics. 2013;9:e1003741. 34. Feliziani S, Marvig RL, Luján AM, Moyano AJ, Rienzo JAD, Johansen HK, et al. Coexistence and Within-Host Evolution of Diversifed Lineages of Hypermutable Pseudomonas aeruginosa in Long- term Cystic Fibrosis Infections. PLOS Genetics. 2014;10:e1004651. 35. Ciofu O, Mandsberg LF, Bjarnsholt T, Wassermann T, Høiby N. Genetic adaptation of Pseudomonas aeruginosa during chronic lung infection of patients with cystic fbrosis: strong and weak mutators with heterogeneous genetic backgrounds emerge in mucA and/or lasR mutants. Microbiology. 2010;156:1108–19. 36. Williams D, Evans B, Haldenby S, Walshaw MJ, Brockhurst MA, Winstanley C, et al. Divergent, Coexisting Pseudomonas aeruginosa Lineages in Chronic Cystic Fibrosis Lung Infections. Am J Respir Crit Care Med. 2015;191:775–85. 37. Güemes AGC, Lim YW, Quinn RA, Conrad DJ, Benler S, Maughan H, et al. Cystic Fibrosis Rapid Response: Translating Multi-omics Data into Clinically Relevant Information. mBio. 2019. https://doi.org/10.1128/mBio.00431-19. 38. Jorth P, Ehsan Z, Rezayat A, Caldwell E, Pope C, Brewington JJ, et al. Direct Lung Sampling Indicates That Established Pathogens Dominate Early Infections in Children with Cystic Fibrosis. Cell Reports. 2019;27:1190-1204.e3. 39. Vandamme P, Moore ERB, Cnockaert M, De Brandt E, Svensson-Stadler L, Houf K, et al. Achromobacter animicus sp. nov., Achromobacter mucicolens sp. nov., Achromobacter pulmonis sp. nov. and Achromobacter spiritinus sp. nov., from human clinical samples. Systematic and Applied Microbiology. 2013;36:1–10. 40. Vandamme P, Moore ERB, Cnockaert M, Peeters C, Svensson-Stadler L, Houf K, et al. Classifcation of Achromobacter genogroups 2, 5, 7 and 14 as Achromobacter insuavis sp. nov., Achromobacter

Page 19/30 aegrifaciens sp. nov., Achromobacter anxifer sp. nov. and Achromobacter dolens sp. nov., respectively. Systematic and Applied Microbiology. 2013;36:474–82. 41. Vandamme PA, Peeters C, Inganäs E, Cnockaert M, Houf K, Spilker T, et al. Taxonomic dissection of Achromobacter denitrifcans Coenye et al. 2003 and proposal of Achromobacter agilis sp. nov., nom. rev., Achromobacter pestifer sp. nov., nom. rev., Achromobacter kerstersii sp. nov. and Achromobacter deleyi sp. nov. International Journal of Systematic and Evolutionary Microbiology. 2016;66:3708–17. 42. Spilker T, Vandamme P, LiPuma JJ. Identifcation and distribution of Achromobacter species in cystic fbrosis. Journal of Cystic Fibrosis. 2013;12:298–301. 43. Coward A, Kenna DTD, Perry C, Martin K, Doumith M, Turton JF. Use of nrdA gene sequence clustering to estimate the prevalence of different Achromobacter species among Cystic Fibrosis patients in the UK. Journal of Cystic Fibrosis. 2016;15:479–85. 44. Edwards BD, Greysson-Wong J, Somayaji R, Waddell B, Whelan FJ, Storey DG, et al. Prevalence and Outcomes of Achromobacter Species Infections in Adults with Cystic Fibrosis: a North American Cohort Study. Journal of Clinical Microbiology. 2017;55:2074–85. 45. Dupont C, Michon A-L, Jumas-Bilak E, Nørskov-Lauritsen N, Chiron R, Marchandin H. Intrapatient diversity of Achromobacter spp. involved in chronic colonization of Cystic Fibrosis airways. Infection, Genetics and Evolution. 2015;32:214–23. 46. Barrado L, Brañas P, Orellana MÁ, Martínez MT, García G, Otero JR, et al. Molecular Characterization of Achromobacter Isolates from Cystic Fibrosis and Non-Cystic Fibrosis Patients in Madrid, Spain. Journal of Clinical Microbiology. 2013;51:1927–30. 47. Amoureux L, Bador J, Bounoua Zouak F, Chapuis A, de Curraize C, Neuwirth C. Distribution of the species of Achromobacter in a French Cystic Fibrosis Centre and multilocus sequence typing analysis reveal the predominance of A. xylosoxidans and clonal relationships between some clinical and environmental isolates. Journal of Cystic Fibrosis. 2016;15:486–94. 48. Gade SS, Nørskov-Lauritsen N, Ridderberg W. Prevalence and species distribution of Achromobacter sp. cultured from cystic fbrosis patients attending the Aarhus centre in Denmark. Journal of Medical Microbiology,. 2017;66:686–9. 49. Foweraker JE, Laughton CR, Brown DFJ, Bilton D. Phenotypic variability of Pseudomonas aeruginosa in sputa from patients with acute infective exacerbation of cystic fbrosis and its impact on the validity of antimicrobial susceptibility testing. J Antimicrob Chemother. 2005;55:921–7. 50. Ashish A, Paterson S, Mowat E, Fothergill JL, Walshaw MJ, Winstanley C. Extensive diversifcation is a common feature of Pseudomonas aeruginosa populations during respiratory infections in cystic fbrosis. Journal of Cystic Fibrosis. 2013;12:790–3. 51. Caballero JD, Clark ST, Coburn B, Zhang Y, Wang PW, Donaldson SL, et al. Selective Sweeps and Parallel Pathoadaptation Drive Pseudomonas aeruginosa Evolution in the Cystic Fibrosis Lung. mBio. 2015;6:e00981-15. 52. Jorth P, Staudinger BJ, Wu X, Hisert KB, Hayden H, Garudathri J, et al. Regional Isolation Drives Bacterial Diversifcation within Cystic Fibrosis Lungs. Cell Host & Microbe. 2015;18:307–19.

Page 20/30 53. Marvig RL, Dolce D, Sommer LM, Petersen B, Ciofu O, Campana S, et al. Within-host microevolution of Pseudomonas aeruginosa in Italian cystic fbrosis patients. BMC Microbiology. 2015;15:218. 54. Winstanley C, O’Brien S, Brockhurst MA. Pseudomonas aeruginosa Evolutionary Adaptation and Diversifcation in Cystic Fibrosis Chronic Lung Infections. Trends in Microbiology. 2016;24:327–37. 55. Sherrard LJ, Tai AS, Wee BA, Ramsay KA, Kidd TJ, Zakour NLB, et al. Within-host whole genome analysis of an antibiotic resistant Pseudomonas aeruginosa strain sub-type in cystic fbrosis. PLOS ONE. 2017;12:e0172179. 56. Lieberman TD, Flett KB, Yelin I, Martin TR, McAdam AJ, Priebe GP, et al. Genetic variation of a bacterial pathogen within individuals with cystic fbrosis provides a record of selective pressures. Nature Genetics. 2014;46:82–7. 57. Markussen T, Marvig RL, Gómez-Lozano M, Aanæs K, Burleigh AE, Høiby N, et al. Environmental Heterogeneity Drives Within-Host Diversifcation and Evolution of Pseudomonas aeruginosa. mBio. 2014;5:e01592-14. 58. Martin CJ, Marshall H, Cline F. Lobar Alveolar Gas Concentrations: Effect of Reduced Lung Volumes. Journal of Applied Physiology. 1953;6:209–12. 59. Brown JS, Zeman KL, Bennett WD. Regional Deposition of Coarse Particles and Ventilation Distribution in Healthy Subjects and Patients with Cystic Fibrosis. Journal of Aerosol Medicine. 2001;14:443–54. 60. Gurney JW, Habbe TG, Hicklin J. Distribution of Disease in Cystic Fibrosis: Correlation with Pulmonary Function. Chest. 1997;112:357–62. 61. Willner D, Haynes MR, Furlan M, Schmieder R, Lim YW, Rainey PB, et al. Spatial distribution of microbial communities in the cystic fbrosis lung. The ISME Journal. 2012;6:471–4. 62. Goddard AF, Staudinger BJ, Dowd SE, Joshi-Datar A, Wolcott RD, Aitken ML, et al. Direct sampling of cystic fbrosis lungs indicates that DNA-based analyses of upper-airway specimens can misrepresent lung microbiota. PNAS. 2012;109:13769–74. 63. Sommer LM, Marvig RL, Luján A, Koza A, Pressler T, Molin S, et al. Is genotyping of single isolates sufcient for population structure analysis of Pseudomonas aeruginosa in cystic fbrosis airways? BMC Genomics. 2016;17:589. 64. Mainz JG, Naehrlich L, Schien M, Käding M, Schiller I, Mayr S, et al. Concordant genotype of upper and lower airways P aeruginosa and S aureus isolates in cystic fbrosis. Thorax. 2009;64:535–40. 65. Ciofu O, Johansen HK, Aanaes K, Wassermann T, Alhede M, von Buchwald C, et al. P. aeruginosa in the paranasal sinuses and transplanted lungs have similar adaptive mutations as isolates from chronically infected CF lungs. Journal of Cystic Fibrosis. 2013;12:729–36. 66. Marotz CA, Sanders JG, Zuniga C, Zaramela LS, Knight R, Zengler K. Improving saliva shotgun metagenomics by chemical host DNA depletion. Microbiome. 2018;6:42. 67. Shi X, Shao C, Luo C, Chu Y, Wang J, Meng Q, et al. Microfuidics-Based Enrichment and Whole- Genome Amplifcation Enable Strain-Level Resolution for Airway Metagenomics. mSystems. 2019;4:e00198-19. Page 21/30 68. Oliver A, Cantón R, Campo P, Baquero F, Blázquez J. High Frequency of Hypermutable Pseudomonas aeruginosa in Cystic Fibrosis Lung Infection. Science. 2000;288:1251–3. 69. Vandeputte D, Kathagen G, D’hoe K, Vieira-Silva S, Valles-Colomer M, Sabino J, et al. Quantitative microbiome profling links gut community variation to microbial load. Nature. 2017;551:507–11. 70. Andrews S, Krueger F, Segonds-Pichon A, Biggins L, Krueger C, Wingett S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 30 January 2020. 71. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–64. 72. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9. 73. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–34. 74. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. 75. Joshi N, Fass J. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ fles (Version 1.33). 2011. https://github.com/najoshi/sickle. Accessed 20 September 2019. 76. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profling using unique clade-specifc marker genes. Nat Methods. 2012;9:811–4. 77. Reimer LC, Vetcininova A, Carbasse JS, Söhngen C, Gleim D, Ebeling C, et al. BacDive in 2019: bacterial phenotypic data for High-throughput biodiversity analysis. Nucleic Acids Res. 2019;47:D631–6. 78. Chen I-MA, Markowitz VM, Chu K, Palaniappan K, Szeto E, Pillay M, et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 2017;45:D507–16. 79. Winsor GL, Grifths EJ, Lo R, Dhillon BK, Shay JA, Brinkman FSL. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database. Nucleic Acids Res. 2016;44:D646–53. 80. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. 81. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210– 2. 82. Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018;35:543–8. 83. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9.

Page 22/30 84. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–7. 85. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–3. 86. Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 2015;43:6761–71. 87. Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016;44:W242–5. 88. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:13033997 [q-bio]. 2013. http://arxiv.org/abs/1303.3997. 89. Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende DR, et al. MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit. PLOS ONE. 2012;7:e47656. 90. Kultima JR, Coelho LP, Forslund K, Huerta-Cepas J, Li SS, Driessen M, et al. MOCAT2: a metagenomic assembly, annotation and profling framework. Bioinformatics. 2016;32:2520–3. 91. Coelho LP, Alves R, Monteiro P, Huerta-Cepas J, Freitas AT, Bork P. NG-meta-profler: fast processing of metagenomes using NGLess, a domain-specifc language. Microbiome. 2019;7:84. 92. Korem T, Zeevi D, Suez J, Weinberger A, Avnit-Sagi T, Pompan-Lotan M, et al. Growth dynamics of gut microbiota in health and disease inferred from single metagenomic samples. Science. 2015;349:1101–6. 93. Arndt D, Grant JR, Marcu A, Sajed T, Pon A, Liang Y, et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 2016;44:W16–21. 94. Gu Z, Gu L, Eils R, Schlesner M, Brors B. circlize implements and enhances circular visualization in R. Bioinformatics. 2014;30:2811–2.

Additional Materials

Additional File 1: Supplementary Figures. Figure S1. Study report of patient CFR06 displaying the dynamics of multiple parameters over time. Figure S2. Study report of patient CFR07 displaying the dynamics of multiple parameters over time. Figure S3. Study report of patient CFR09 displaying the dynamics of multiple parameters over time. Figure S4. Relative abundance profles generated by MetaPhlAn for each patient. Figure S5. Relative abundance profles of aerobes and anaerobes for each patient. Figure S6. Identifcation of A. xylosoxidans strains using PanPhlAn. Figure S7. Identifcation of P. aeruginosa strains using PanPhlAn and gene marker sequences. Figure S8. Positional distribution of SNVs from the Fig. 4 clusters in the P. aeruginosa genome. Figure S9. Identifcation of P. aeruginosa strain variants in patient CFR11. Figure S10. SNV detection in the 16S rRNA sequence of P. aeruginosa. Figure S11. Comparison of Day 0 samples that have been subjected to methylated DNA depletion or not. (PDF)

Additional File 2: Table S1. Overview of information gathered during the study for each patient. (XLSX)

Page 23/30 Additional File 3: Table S2. Overview of medication prescribed to each studied patient. (XLSX)

Additional File 4: Table S3. Complete relative abundance profles (mOTUs). (XLSX)

Additional File 5: Table S4. Complete relative abundance profles (MetaPhlAn). (XLSX)

Additional File 6: Table S5. Overview of contig assembly quality reports (CheckM). (XLSX)

Additional File 7: Table S6. Genus level taxonomic profle correlations between mOTUs and MetaPhlan for each sample. (XLSX)

Figures

Page 24/30 Figure 1

Study report of patient CFR11 displaying the dynamics of multiple parameters over time. (A) percentage forced expiratory volume (FEV) (black) and concentration of C-reactive protein (CRP) (blue). The actual measurements are shown as dots. (B) Medication assigned to the patient during the course of the study and recorded exacerbation events (in red). (C) Bacteria identifed in the clinical microbiology laboratory. (D) Relative abundance profles generated by mOTUs. Species with more than 5% relative abundance at

Page 25/30 at least one time-point are shown. Genera that contain at least one species with more than 5% relative abundance at at least one time-point are shown. Colours represent selected species and genera. Less abundant species are grouped into “Others” (grey). Taxa that could not be identifed by mOTUs on the genus level are grouped into “Unknown Genus” (white). The actual measurements are shown as bars. (E) Shannon’s entropy calculated based on the relative abundance profles generated by mOTUs. The actual measurements are shown as dots. (F) Number of reads per sample. Human reads are indicated in black. Reads that did not concordantly map to the human genome are depicted in red. (G) Concentration of total DNA isolated from patient sputum. The actual measurements are shown as dots.

Figure 2

Strain-typing of patient-specifc Achromobacter in the context of a molecular-based phylogeny of the genus. (A) Maximum-likelihood tree based on sequences of ten single-copy genes used by mOTUs. Colours on the right of the tree depict species assignments according to NCBI and GTDB taxonomies and apply to (B) and (C) as well. Blue circles indicate branch confdence values (>= 90) based on 100 bootstraps of the tree; (B) Maximum-likelihood tree based on sequences of seven housekeeping genes

Page 26/30 from the Achromobacter MLST database; (C) UPGMA clustering of pairwise average nucleotide identities of the comprising Achromobacter genomes.

Figure 3

Strain variant identifcation from metagenomics data through assessing temporal changes in SNV allele frequencies. (1) Selection of a reference genome based on generated gene presence and absence profles. (2) Recruitment of reads to the reference genome (NZ_CP013113). A pile-up of a selected region

Page 27/30 containing an SNV (1,959,777 - 1,959,791 base pairs) is shown for every time point. Reference sequence displayed on the bottom. Grey indicates reads that are identical to the reference sequence. Orange indicates reads in which a substitution to guanine has occurred. (3) The change in allele frequency over time for the selected SNV. (4) A group of SNVs that show a similar pattern of changes in allele frequency over time. The selected SNV is depicted in orange.

Figure 4

Clustering of SNVs based on their change in allele frequency over time. (A) A t-SNE plot depicting the clustering pattern. Most SNVs occur at low allele frequencies (grey). Remaining SNVs form seven distinctly visible genotypes that are labelled and coloured accordingly. (B) Changes in the allele frequencies (p) of SNVs belonging to each distinct genotype over time. The coloured line indicates mean allele frequency of the genotype. Dark grey ribbon indicates 95% confdence intervals of the mean.

Page 28/30 Figure 5

Selected regions of the P. aeruginosa genome show variant-specifc coverage profles. The circos diagram provides an overview of the genome coverage profles. The outer circle depicts regions of the reference genome that have a coverage of at least 5% from the average coverage at at least one time point. The second outermost circle depicts potential phage sequences as predicted by PHASTER [93]. The remaining circles depict normalized coverage profles for each of the ten time-points. Orange regions indicate lower than average coverage and blue regions indicate higher than average coverage. Zoom ins of three regions displaying variant-specifc coverage are shown. Each variant is depicted in a distinct colour and the

Page 29/30 average coverage of the selected region is depicted in grey. The circos diagram was generated using the R package circlize [94].

Supplementary Files

This is a list of supplementary fles associated with this preprint. Click to download.

additionalfle1supplementaryfgures.pdf additionalfle7tables6.xlsx additionalfle3tables2.xlsx additionalfle6tables5.xlsx additionalfle4tables3.xlsx additionalfle5tables4.xlsx additionalfle2tables1.xlsx

Page 30/30