<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 An enhanced characterization of the human skin microbiome: a new biodiversity of 2 microbial interactions 3 4 Akintunde Emiola1, Wei Zhou1, Julia Oh1* 5 6 1The Jackson Laboratory for Genomic Medicine, Farmington, Connecticut, USA 7 *Corresponding author. [email protected] 8 9 10 ABSTRACT 11 12 The healthy human skin microbiome is shaped by skin site physiology, individual-specific factors, 13 and is largely stable over time despite significant environmental perturbation. Studies identifying 14 these characteristics used shotgun metagenomic sequencing for high resolution reconstruction 15 of the , fungi, and viruses in the community. However, these conclusions were drawn from 16 a relatively small proportion of the total sequence reads analyzable by mapping to known 17 reference genomes. ‘Reference-free’ approaches, based on de novo assembly of reads into 18 genome fragments, are also limited in their ability to capture low abundance , small 19 genomes, and to discriminate between more similar genomes. To account for the large fraction 20 of non-human unmapped reads on the skin—referred to as microbial ‘dark matter’—we used a 21 hybrid de novo and reference-based approach to annotate a metagenomic dataset of 698 healthy 22 human skin samples. This approach reduced the overall proportion of uncharacterized reads from 23 42% to 17%. With our refined characterization, we revisited assumptions about the skin 24 microbiome, and demonstrated higher biodiversity and lower stability, particularly in dry and moist 25 skin sites. To investigate hypotheses underlying stability, we examined growth dynamics and 26 interspecies interactions in these communities. Surprisingly, even though most skin sites were 27 relatively stable, many dominant skin microbes, including and staphylococci, 28 were actively growing in the skin, with poor or no relationship between growth rate and relative 29 abundance, suggesting that host selection or interspecies competition may be important factors 30 maintaining community homeostasis. To investigate other mechanisms facilitating adaptation to 31 a specific skin site, we identified Staphylococcus epidermidis genes that are likely involved in 32 stress response and provide mechanisms essential for growth in oily sites. Finally, horizontal gene 33 transfer—another mechanism of competition by which strains may swap antagonistic or virulent 34 coding regions—was relatively limited in healthy skin, but suggested exchange of different 35 metabolic and environmental tolerance pathways. Altogether, our findings underscore the value 36 of a combined reference-based and de novo approach to provide significant new insights into 37 microbial composition, physiology, and interspecies interactions to maintain community 38 homeostasis in the healthy human skin microbiome.

39 40 BACKGROUND 41 Deep metagenomic shotgun sequencing is a powerful tool to interrogate composition and function 42 of complex microbial communities. Microbial communities offer the potential for discovery of a 43 tremendous suite of previously unknown biological functions, for example, new bioactive

1

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

44 compounds, antimicrobials, virulence factors, or metabolic pathways. Such discovery has relied 45 on the ability to survey and deconvolute species from mixed microbial consortia. Advances in 46 next-generation sequencing and computational analyses have, in recent years, greatly furthered 47 efforts to reconstruct microbial communities at the species1,2, strain2,3, and even single nucleotide 48 polymorphism level3,4, examining function, transmission, and stability of the resident microbes. 49 50 However, interpretations of many metagenomic datasets are limited by the inability to characterize 51 a large fraction of the total microbial reads present in the original sample5,6. This uncharacterized 52 sequence space, or microbial ‘dark matter’7, typically results from the inability to map a sequence 53 read to a known microbial reference genome and can exceed 96% of sequence reads within a 54 sample5. Such ‘reference-based’ approaches, whether mapping reads to complete genomes8 or 55 marker genes9, have high sensitivity and discriminatory ability between even very similar 56 genomes8. However, microbes with no representative reference, or those with significant 57 pangenomic variation, which can account for considerable within-species diversity in gene 58 content10, are not captured. Conversely, reference-free approaches based on de novo assembly 59 to aggregate reads into longer stretches of contiguous DNA sequence, can aid in the identification 60 and characterization of new genomes. However, de novo assembly-based approaches are less 61 effective in capturing small genomes (e.g., viruses), low-abundance microbes, and in 62 discriminating between very similar genomes. 63 64 By combining both approaches into a holistic framework, we aimed to reduce the proportion of 65 uncharacterized sequence space in a metagenomics dataset, and thus provide new insights into 66 the biological function and interspecies interactions of these microbial communities. We used a 67 hybrid de novo and reference-based approach aimed at characterizing microbial dark matter in 68 the skin metagenome. Our previous analyses of this dataset (698 samples), which were 69 exclusively reference-based, showed that the skin microbiome is defined primarily by the 70 physiological characteristics of the skin site (e.g., whether it was a sebaceous, moist, dry, or foot 71 site), then by host-intrinsic factors that confer individuality in strain representation and the 72 presence of low-abundance and transient organisms5,6. More intriguing was the observation that 73 the skin microbiome is remarkably stable even over years, despite the exposure of skin to different 74 hygiene practices and the external environment6. However, our conclusions were based on an 75 incomplete portrait with, on average, half of each sample remaining uncharacterized by our 76 reference-based analyses5. By incorporating additional information from microbial dark matter, 77 we stood to gain significant new insights into the landscape of skin biodiversity and microbial 78 stability. 79 80 Leveraging our integrated approach, we uncovered previously unaccounted-for biodiversity and 81 reduced microbial stability in the skin microbiome. We used this refined characterization to more 82 deeply probe interspecies interactions, identifying intra- diversity and mechanisms 83 underlying stability and inter-species interactions in the skin, including new assessments on 84 growth rate and horizontal gene transfer. Our results demonstrate the highest resolution analysis 85 of the skin microbiome to date, and provide new hypotheses for how skin microbes interact and 86 compete to maintain homeostatic community conditions. 87

2

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

88 RESULTS 89 A hybrid de novo and reference-based microbial community analysis 90 To address the significant uncharacterized sequence space (mean ± sd 42% ± 24%) in our initial 91 analysis of a 698-sample longitudinal skin metagenomic dataset (Supplementary Fig. 1), we used 92 reference-independent approaches to reconstruct composition. With the improvement of de novo 93 assembly algorithms to input large datasets11, we concatenated our samples and assembled 94 iteratively, resulting in 75% ± 19% reads incorporated into the assemblies (Supplementary Fig. 95 2). The 1,037,465 resultant contigs >1kb were then grouped into genome ‘bins’ based on co- 96 abundance clustering and nucleotide composition. However, because this approach is limited in 97 its ability to recover small genomes, low-abundance species, and to ascertain precise taxonomic 98 classifications, we speculated that integrating reference-based analyses (Fig. 1A) would further 99 reduce dark matter beyond the 33% ± 21% reduction observed by mapping reads to our de novo 100 reference set only (Fig. 1B). Microbial reads unmappable to our de novo reference catalogue 101 were annotated by mapping to a reference database of fungal, bacterial, and viral genomes. 102 Reference-based and de novo classifications were integrated with a normalization step that took 103 into consideration the total proportion of reads derived from each approach (Supplementary Fig. 104 2). While using de novo references significantly aided reconstruction of microbiota, our hybrid 105 approach most considerably reduced the proportion of dark matter (16% ± 17%; Fig. 1B and C). 106 107 A new biodiversity of the human skin metagenome 108 Our new compositional analysis was largely concordant with our previous findings that the skin 109 microbiome is predominated by Staphylococcus, Cutibacterium (formerly Propionibacterium), 110 , and Malassezia species5,6 (Fig. 1C, Fig. 1D). However, we uncovered 111 considerably more diversity of Staphylococcus, Corynebacterium, Proteobacteria, and fungal 112 genomes at most skin sites (Fig. 1C, Supplementary Fig. 3A). For example, we identified a 113 previously uncharacterized, but abundant, Lactobacillales colonizing the external auditory canal 114 (Ea) as Alloiococcus otitis. 9% ± 13% of reads mapped to bins were unclassifiable based on 115 BLASTn alignment to the NCBI nt database. These likely represented contigs that belong either 116 to uncharacterized genomes or undiscovered pan-genomic variation of a lower abundance strain 117 that could not be binned with its species unit. These otherwise unclassifiable bins were most 118 frequently bacterial, underscoring that the majority of undiscovered biodiversity in the skin is not 119 fungal or viral (Supplementary Fig. 3B). 120 121 Our revised classification showed lower representation of C. acnes than previous analyses, and 122 markedly lower Propionibacterium phage representation in sebaceous regions. De novo-only 123 approach did not capture viral contigs; these were most accurately recovered with the combined 124 use of reference genomes, for example in the alar crease (Al) (Fig. 1C). 125 126 With our integrated classification, we re-examined our conclusions of diversity and stability at the 127 community level. All skin sites except the ear and foot had higher diversity than previously 128 reported, which is expected given the identification and inclusion of more genomes/genome bins 129 than in our original analyses (Fig. 1E). Because resolved dark matter only represented a few 130 genomes in the external auditory canal (Ea) (mostly Lactobacillales) and foot sites (mostly 131 Staphylococcus and Corynebacterium), diversity was unchanged in these regions. Since previous

3

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

132 analyses correlated increased diversity with decreased stability6, we re-evaluated our conclusions 133 on stability over the ~month (T2-T3) and ~year intervals (T1-T2) collected in this study. Most sites 134 were less stable than originally defined (Fig. 1F)—for example, the hand and forearm (dry) sites, 135 which are highly exposed to the external environment (Fig. 1F, Supplementary Fig. 4A). This may 136 be due to behavioral patterns like hand-washing or increased acquisition of transient, 137 environmental microbes. Longitudinal tracking of individual species’ dynamics over time showed 138 that community stability is driven by specific microbes (Supplementary Fig. 4B). For example, 139 newly identified Corynebacterium species (e.g., jeikeium) were associated with stability whereas 140 staphylococci showed more fluctuation over time. 141 142 Mechanisms underlying interspecies interactions: growth dynamics of skin microbiota 143 A significant limitation of previous skin microbiome studies is the absence of information on 144 microbial activity, which is necessary to understand potential underlying homeostatic mechanisms 145 as microbes are unlikely to be completely inactive in the skin. We reasoned that the contributions 146 of viable populations to overall microbial abundance and functional community potential could be 147 inferred by assessing bacterial growth rate from the skin metagenome. This would also allow us 148 to estimate the ratio of rapidly growing to dead/stationary cells of common skin microbes, which 149 would provide additional insights into antagonistic interspecies interactions. 150 151 To achieve this, we used the peak-to-trough ratio (PTR) method12, implemented in GRiD13, which 152 maps metagenomic reads to a bacterial reference genome to calculate coverage drops across 153 the genome. Because most bacteria harbor a single circular chromosome replicated bi- 154 directionally from the origin of replication (ori) to the terminus (ter) region14, rapidly growing cells 155 will have a higher coverage at ori vs ter. 156 157 First, we generalized our analysis to define the steady-state growth dynamics of dominant skin 158 microbes. Defining a microbe with a GRiD score > 1 as being in exponential phase, we identified 159 the proportion of bacteria that were actively growing across different skin sites. Most examined 160 microbes in dry sites (i.e., palm and forearm) were most active (Fig. 2A), interesting because 161 these regions are often perturbed (e.g., by hand washing) and biomass is low. Such factors could 162 affect microbial viability and growth rate to replenish the endogenous community. In contrast, 163 sebaceous sites, which typically harbors higher biomass, least supported rapid growth, which 164 could reflect the relatively specialized physiologic growth conditions, including anoxia, which 165 would limit or slow growth of many microbes15. Foot sites favored growth of S. epidermidis and 166 other non-lipophilic microbes at the expense of sebum-metabolizing C. acnes, which was least 167 active at that site (Fig. 2A). Strikingly, growth rates of these microbes were stable over multiple 168 timepoints (Fig. 2B). 169 170 We also asked if increased growth rate resulted in increased relative abundance. Relative 171 abundance was positively correlated with growth rate only for certain species (Fig. 2C). For 172 example, C. ureicelerivorans’ relative abundance was strongly correlated with growth rate in all 173 skin sites (Fig. 2C). This suggests that for select species, microbial abundance may be controlled 174 by how rapidly cells are dividing. In contrast, C. acnes relative abundance and growth rate were 175 strongly anti-correlated in sebaceous sites (Fig. 2C), suggesting that C. acnes’ growth rate may

4

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

176 be modulated based on the presence/absence of competition or is regulated by its quorum 177 sensing mechanisms. 178 179 Finally, we examined the microbial growth dynamics in patients with primary immunodeficiency 180 syndrome, characterized with eczematous lesions5, as rapid growth might identify potential 181 . Common skin commensals such as S. epidermidis and C. ureicelerivorans had 182 significantly decreased growth rate in these patients (Fig. 2D). The decreased growth rate of S. 183 epidermidis is surprising because these patients are unusually prone to staphylococcal infections. 184 This observation may be due to the lack of correlation between growth rate and abundance for S. 185 epidermidis in most skin sites (Fig. 2C). 186 187 Structural variants of Staphylococcus epidermidis 188 Community homeostasis is maintained by both host factors, interspecies interactions, and 189 acquisition of genes, the latter of which can play important role in modulating adaptability to a 190 particular environmental niche. These genes can encode for proteins that can influence signaling, 191 virulence, and antimicrobial properties16,17. Consequently, we examined microbial structural 192 variants that may potentially harbor genes that play a role in strain adaptability to a specific skin 193 site. We focused our analysis on S. epidermidis because unlike other Staphylococci, it thrives well 194 in multiple skin sites (Fig. 2A). 195 196 We retrieved pangenome sequences from panDB, a database which assembles non-redundant 197 genomic regions from multiple sequenced strains18, split sequences into 1 Kb fragments, and 198 determined the enrichment of fragments across samples. Because Staphylococcus other than S. 199 epidermidis thrives mostly in foot sites, and less in sebaceous regions (Fig. 2A), we investigated 200 structural variants that might be associated with S. epidermidis adaptability in sebum-rich sites. 201 We identified 6 S. epidermidis fragments that were always enriched in sebaceous sites when 202 compared with other sites (Fig. 3A) which suggests that these may harbor genes essential for 203 survival in sebaceous regions. 204 205 In addition, we hypothesized that if these fragments are indeed associated with adaptability, 206 homologues in S. capitis, a closely related genome, will also be associated with adaptability in 207 the latter. Similar to S. epidermidis, we identified 11 S. capitis fragments that were always 208 differentially enriched in sebaceous sites (Fig. 3A). Interesting, 4 of the 7 genes harbored in S. 209 epidermidis fragments had homologues in S. capitis (Fig. 3B), suggesting an underlying 210 importance in sebaceous sites. 211 212 Next, we examined if these candidate genes could be located in mobile genetic elements which 213 may suggest inter/intra- species transmission. Using BLASTn alignment to the NCBI nt database, 214 all candidate genes could be identified in previously annotated plasmids or phages (Fig. 3C). We 215 conducted additional benchmark analysis to minimize false positives by determining the 216 correlation between candidate genes and S. epidermidis abundance or growth rate. 217 Unsurprisingly, all fragments containing genes with homologues in S. capitis were positively 218 correlated with both abundance and growth rate in sebaceous sites. Most of these genes encode 219 for hypothetical proteins; however, a candidate (S_epi_13619) encodes for a stress response

5

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

220 protein suggesting that mechanisms to adapt to otherwise unfavorable conditions may play a 221 important role in survival. 222 223 Indirect interactions: horizontally-transferred genes in skin microbes 224 To further investigate the genetic basis of interspecies interactions using our refined metagenomic 225 analysis, we investigated horizontal gene transfer (HGT). HGT is a mechanism by which a 226 microbe can acquire genetic material that may confer an increased survival or competitive 227 advantage within a community. 228 229 We developed a HGT prediction pipeline (Supplementary Fig. 5) and based on the simulated 230 dataset, the pipeline was able to identify 31% - 51% of the simulated HGT genes. We found that 231 the prediction sensitivity was confined by the sensitivity of the metagenomic assembler (i.e., 232 whether a gene was fully assembled) and the sensitivity of the gene predictor (i.e., whether an 233 open reading frame was correctly annotated as a gene) (Supplementary Fig. 6), but not the 234 synonymous distance-based algorithm presented in this study. The predicted HGT genes 235 exhibited a large variation of synonymous distances, representing both recent HGT events and 236 more ancient HGT events during the diversification of the microbial species (Supplementary Fig. 237 6). Importantly, the HGT genes with the lowest synonymous distances (synonymous distance < 238 0.1), which correspond to the most recent HGT events, almost exclusively matched the simulated 239 HGT genes (94% - 100%, Supplementary Fig. 6), demonstrating the high specificity of the 240 prediction pipeline. 241 242 To predict HGT among microbes within the skin microbiome, we developed a pipeline using our 243 existing metagenomic data. In each pair of assembled genomes, we identified HGT candidates 244 by looking for pairs of genes that are significantly more similar than immobile genes (Fig. 4A, 245 Supplementary Fig. 5). Our HGT identification pipeline is a parametric version of a previously 246 described robust method that searched for identical or near-identical gene pairs in distantly 247 related microbial genomes19. Consistent with previous reports, functional annotation of HGT 248 candidates showed a distribution over a wide functional spectrum19 (Fig. 4B). The most 249 overrepresented functionality of the predicted HGT candidates were the transporters, highlighting 250 the potential of the microbes to acquire the ability to uptake environmental nutrients and extrude 251 harmful molecules through HGT events (Fig. 4B). Although most types of transporters were 252 uniformly distributed in the mobile gene pool, microbiome at the sebaceous sites demonstrate 253 enrichment of transporters that are involved in transporting metallic cation (Fig. 4B and 4C), 254 including iron, manganese, zinc, cobalt, nickel, and biotin. These results suggest the existence of 255 a pool of (conditionally) mobile genes exerting a multitude of biological functions, with enrichment 256 of specific functions observed at specific skin types. 257 258 Finally, we constructed a network to reflect the top HGT events among microbial species in each 259 type of skin sites (Figure 4D). Across skin sites, HGT were identified as a function of abundant 260 species. For example, HGT candidates identified at a sebaceous site predominantly came from 261 , including bacteria in the genera Propionibacterium, Corynebacterium and 262 Staphylococcus (Fig. 4D). Due to its dominance in many skin sites, C. acnes was central in 263 networks corresponding to all types of skin types except for toenails, in which Actinobacteria (C.

6

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

264 acnes, Corynebacterium singulare, Micrococcus luteus, and Kocuria rhizophila) and multiple 265 Staphylococcus species formed disconnected networks, strongly suggesting that HGT events at 266 a body site were driven by the microbiome composition at the site. Overall, this pipeline 267 characterizes statistically likely HGT candidates directly from shotgun metagenomic data, and is 268 useful to simultaneously estimate the functional and taxonomic distribution of the mobile genes 269 between genera. 270 271 DISCUSSION 272 The human skin, our largest organ and first line of defense, is home to a diversity of 273 microorganisms. These microbes play an essential role in influencing metabolic processes, 274 immune system modulation, and antagonism of potentially transient pathogens15. Alterations in 275 community composition have been associated with a number of skin diseases like atopic 276 dermatitis, psoriasis, and eczema20,21. Numerous host-intrinsic (e.g., genetics, immune 277 competence, skin barrier conditions), environmental or lifestyle factors (e.g., hygiene, exposure 278 to different microbes), as well as microbiome-intrinsic factors affect disease predilection. A deep 279 understanding of these factors, as well as the ecological complexity of the skin’s microbiota, is 280 needed to understand factors that influence its homeostasis and predisposition to disease. 281 282 Large-scale studies have aimed to characterize the skin microbiota using deep shotgun 283 metagenomic sequencing, yet conclusions drawn from those previous analyses were limited in 284 that a majority of sequence reads could not be mapped to any known genome5,6. To solve this 285 limitation, we used an integrated approach that incorporated de novo assembly and binning with 286 reference-based analyses to identify and quantify new microbial skin occupants. This dual 287 approach provided a key improvement in the resolution of the community. While we observed a 288 significant reduction in uncharacterized sequence space using de novo approaches, viral 289 genomes and low abundance genomes were poorly captured but could be resolved with reference 290 genomes. 291 292 High-level analyses of the defining characteristics of the skin microbiome were largely concordant 293 with previous findings. However, our hybrid approach provided important new insights into skin 294 community structure, including: increased diversity, reduced stability across many sites, 295 dominance of previously uncharacterized microbes in certain skin sites like the external auditory 296 canal, and reduced representation of phage than previously believed. With few exceptions, we 297 found that previous reports overestimated stability of the skin microbiome, likely because they 298 lacked deeper classification of additional genomes, particularly for staphylococci and 299 cornyeforms. Subsequently, we focused our analysis—continuing to interleave reference-based 300 and de novo approaches—to examine potential factors that could underlie community stability 301 and homeostasis, including the growth rate of different community members and interspecies 302 interactions. 303 304 Leveraging metagenomic data to predict growth rate of dominant skin species, we measured 305 marked variance in which species were actively growing in the skin, and how skin site could affect 306 activity. For example, most microbes in dry sites were actively dividing compared to relatively few 307 in sebaceous sites. Moreover, S. epidermidis appeared to grow exponentially in all sites, at similar

7

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

308 growth rates, suggesting that the skin environment generally provides adequate nutrients to 309 support its rapid growth. Alternatively, S. epidermidis strains may have acquired specific genes 310 modulating adaptability to each site. For example, we identified numerous genes present in 311 mobile genetic elements that were associated with adaptability to sebaceous sites. Yet, strikingly, 312 its ultimate relative abundance was not correlated with growth rate in most skin sites, suggesting 313 an equally rapid killing or cell death. In this case, the lack of correlation between growth rate and 314 relative abundance indicated that competitiveness within the community is largely independent of 315 growth rate. Other skin sites/species showed a different relationship between growth rate and 316 relative abundance. A positive correlation suggests that strains out-compete the rest of the 317 community during exponential growth, as in the case of C. ureicelerivorans. Conversely, a 318 negative correlation may reflect an active quorum sensing mechanism involved in the regulation 319 of growth rate. For example, we observed a negative correlation for C. acnes that is restricted to 320 sebaceous skin, suggesting that the microenvironment may play a role in regulating growth rate. 321 322 In addition, we observed that several types of transporters, especially the metal transporters, were 323 highly abundant in the HGT gene pools at the sebaceous sites, potentially reflecting the 324 importance to transport a vast variety of lipid soluble metals diffusing through the permeable cells 325 of the sebaceous glands and follicular walls – a major absorption pathway of metals, including 326 metallic toxicants. Additionally, the enrichment of metal transporters at the sebaceous sites 327 paralleled the over-representation of Actinobacteria species at those skin sites, raising the 328 possibility that the dissemination of metal transporting ability among Actinobacteria microbes at 329 the sebaceous sites may be important to metal balance and consequently the health of the host. 330 331 In conclusion, we present a new landscape of the skin microbiome, providing the highest 332 resolution reconstruction of microbial community composition and biodiversity in the skin to date. 333 Importantly, we have used new approaches and analyses in a multifaceted investigation of 334 functional elements and interspecies interactions underlying stability and community 335 homeostasis. Our findings have generated testable hypotheses to interrogate interspecies and/or 336 interstrain inhibition. Finally, our analyses are broadly applicable to investigate these mechanisms 337 in skin disease, which ultimately can provide clues as to sources of antimicrobials directed 338 towards strain-specific pathogens. 339 340 341 METHODS 342 Sample datasets 343 We retrieved 698 metagenomic shotgun skin samples from our previous work5,6, which have been 344 quality filtered for the presence of human DNA. The majority of the samples (n = 594) in these 345 dataset were derived from longitudinal sampling of 12 individuals at 3 different time points with 346 sampling intervals of 10-30 months (“long”) and 5-10 weeks (“short”). 24 samples from this 347 collection were also derived from 2 individuals with hyper IgE syndrome. The remaining samples 348 represent a single timepoint from three additional healthy individuals. 349 350 Taxonomic classification of skin microbes 351 To classify skin microbes, we used a hybrid de novo-based and referenced-based

8

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

352 characterization technique (Supplementary Fig. 2). All samples were concatenated and sequence 353 reads were assembled into contigs using MEGAHIT 1.0.5 (--k-min 37 --k-max 67 --k-step 10 -m 354 0.99 --kmin-1pass --continue)11, which was used for its ability to handle large datasets with 355 relatively low memory requirement and short run time. We discarded contigs < 1Kb, mapped each 356 individual sample’s reads back to the contig catalog using bowtie2 2.2.8 (--sensitive)22, and 357 extracted unmapped reads. However, read coverage from our derived contigs catalogue was 358 relatively low (Supplementary Fig. 2). To re-assemble unmapped reads, we concatenated a 359 subset of unmapped reads from randomly selected samples (due to the high memory requirement 360 of SPAdes23) and re-performed previous steps using SPAdes 3.7.1 (--meta) for de novo 361 assembly, which was better able to resolve scaffolds from contigs. Newly extracted 362 contigs/scaffolds were merged with the previous catalog to produce a non-redundant contigs 363 catalog. We repeated the mapping of reads to the catalog to retrieve unmapped reads, randomly 364 concatenated a small subset of unmapped reads, and performed assembly using SPAdes. We 365 repeated this iterative step until no additional improvement in reads coverage from our 366 contigs/scaffolds catalog was observed (Supplementary Fig. 2), obtaining a total of 1,037,465 367 contigs/scaffolds > 1Kb. 368 369 Contigs/scaffolds were then grouped into genome bins using MetaBAT (--sensitive, -m 1500)24, 370 which resulted in 556 bins. Bins were taxonomically classified using MEGAN 4.70.425 371 (Supplementary Table 1). We excluded 22 bins which were of non-microbial origin and further 372 evaluated the quality of each bin using CheckM 1.0.626. For stringent annotation, we required that 373 ≥ 65% of contigs/scaffolds present in a bin were assigned to the lowest level ; the sole 374 exception being the kingdom-level taxonomy where our requirement was 40%. Bins were labeled 375 “uncharacterized” if they were unable to be assigned to at least a kingdom, although by further 376 relaxing our requirements to exclude contigs/scaffolds with no hits enabled characterization of 377 those bins to at least the kingdom level (Supplementary Fig. 3B). 378 379 Each sample was then mapped back to the genome bins using bowtie2. Unmapped reads were 380 subsequently characterized by a reference-based approach using reprDB18 and assigned to a 381 genome using Pathoscope 2.08. Reads unassignable by either method were categorized as “dark 382 matter”. 383 384 Bacterial growth rate estimation from skin samples 385 We estimated bacterial growth rate using GRiD (v1.3)13 using a coverage cutoff of 0.2. We created 386 custom GRiD database using species of C. acnes, S. epidermidis, S. aureus, C. jeikeium, and C. 387 ureicelerivorans. 388 389 Structural variant analysis 390 We retrieved pangenome sequences for S. epidermidis and S. capitis from panDB18, split 391 genomes into 1 Kb fragments, and predicted genes using prokka27. We mapped reads using 392 bowtie2 to the fragment pool of each species, estimated reads count across samples using 393 featureCounts28, and DESeq229 to infer genes differentially enriched between skin sites. We 394 filtered candidates using a adjusted p-value of 0.05 and log2 fold change of 1. We aligned genes 395 from S. epidermidis and S. capitis to generate a phylogenetic tree using MAFFT30.

9

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

396 397 Horizontal gene transfer prediction 398 By definition, HGT genes that transferred between two lineages have shorter evolutionary 399 distance than immobile genes (those that diverged at the time of divergence of the lineages). 400 Therefore, HGT genes tend to be more similar19. For a pair of genomes, one reliable method of 401 identifying HGT genes is to search for pairs of genes that are more similar to each other than the 402 immobile genes in the genomes that reflect the evolutionary distance of the genomes. Genomes 403 assembled from metagenomic shotgun data are inevitably incomplete and inaccurate, especially 404 at strain level. Therefore, we assembled genomes at species level and proceeded only with 405 genomes that have at least 25% completeness. To do this, metagenomic shotgun reads were first 406 assembled using MEGAHIT. The resulting contigs were then assigned a taxonomic label using 407 Kraken v0.10.631; all contigs assigned to the same species were combined to represent the 408 species draft genome. Genes were predicted from each species draft genome using prodigal32, 409 and completeness of the assemblies assessed using BUSCO v233. 410 411 We identified potential HGT genes in each pair of species genomes. To do this, we first assembled 412 a set of immobile genes from the genome pair to compute a null distribution of sequence similarity. 413 Immobile genes were identified using the bacteria-specific universal single-copy orthologs 414 (USCOs) annotated by BUSCO. Because USCOs are universally present across bacterial 415 lineages and exist only in single copy, their horizontal transfer is unlikely. Second, all gene 416 sequences from a pair of species genomes were clustered using uclust34 at 0.5 similarity cut-off 417 to reduce complexity. If any pair of genes within a cluster and from different species genomes has 418 a significantly higher similarity than the immobile genes, the gene pair is identified as horizontally 419 transferred. To mitigate the effect of natural selection, we computed synonymous distance—the 420 number of synonymous changes per synonymous site—as the test statistics for similarity. Each 421 pair of protein-coding genes was first reverse-aligned using the seqinr package35, after which 422 synonymous distance was computed using PAML36, which implements an ad hoc method that 423 corrects for codon frequency bias. Finally, to further lessen the influence of purifying selection, 424 we removed HGT candidates that represented essential genes (i.e., genes with positive hits to 425 the DEG 10 database37 using ublast34 with e-value < 10-9) and ribosome genes (ie, genes 426 corresponding to KEGG BRITE ko03009 and ko03011). 427 428 We applied the prediction pipeline to three simulated microbial communities for validation. The 429 simulated communities was generated using HgtSIM38. Briefly, each community included three 430 common skin bacteria species: C. acnes, S. epidermidis, and Streptococcus mitis. Each species 431 was represented by three sequenced strain genomes, including one RefSeq representative strain 432 and two other strains from RefSeq. All 9 strains were mixed at equal abundances in each 433 simulated community. From the representative strain, 5 genes were randomly selected and 434 horizontally transferred to all other strains in the community (a total of 15 HGT genes in each 435 community). The three microbial communities differ in the amount of mutations accumulated in 436 the HGT genes: the HGT genes were allowed to accumulate 0, 5% and 10% mutated bases in 437 the recipient strains for the three communities, respectively. 3 million paired-end metagenomic 438 shotgun reads were sampled from each community as part of the HgtSIM pipeline, and 439 subsequently processed using the HGT prediction pipeline described above.

10

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

440 441 Statistics 442 We conducted all statistical analyses using R software. Spearman correlation was utilized for all 443 correlation coefficient analyses while statistical differences between population groups were 444 determined using the Wilcoxon rank-sum test. Where multiple measurements (e.g., timepoint, 445 skin site within an individual) were used for correlation analyses, partial Spearman correlation 446 was used, adjusting for multiple measurements. To assess microbial stability over time, we 447 utilized the Yue-Clayton theta index, which calculates the distance between communities based 448 on relative abundance of species in the population39. Community diversity was determined using 449 the Shannon diversity index, which measures both species richness and evenness. 450 451 DECLARATIONS 452 Ethics approval and consent to participate. Not applicable. 453 Consent for publication. All authors have approved submission of this manuscript. 454 Availability of data and material. The data used in this analysis is available in SRA under 455 Bioproject 46333. 456 Competing interests. All authors declare that they have no competing interests.

457 Funding. This work was funded by the National Institute of Health (DP2 GM126893-01 and K22 458 AI119231-01). JO is additionally supported by the National Institutes of Health (1U54NS105539, 459 1 U19 AI142733, 1 R21 AR075174, 1 R43 AR073562), the Department of Defense 460 (W81XWH1810229), the National Science Foundation (1853071), the American Cancer Society, 461 and Leo Foundation.

462 Authors' contributions. AE and JO conceived the project. WZ contributed scripts and analysis. 463 AE and JO analyzed data. AE and JO wrote the manuscript.

464 Acknowledgements. We would like to thank the Oh lab for critical reading of the manuscript.

465 466 467 REFERENCES 468 1. Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in 469 complex metagenomic samples without using reference genomes. Nat Biotechnol. 32, 470 822-828 (2014). 471 2. Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by 472 eigengenome partitioning. Nat Biotechnol. 33, 1053-1060 (2015). 473 3. Luo, C. et al. ConStrains identifies microbial strains in metagenomic datasets. Nat 474 Biotechnol. 33, 1045-1052 (2015). 475 4. Tsai, Y. C. et al. Resolving the Complexity of Human Skin Metagenomes Using Single- 476 Molecule Sequencing. MBio. 7, e01948-01915 (2016). 477 5. Oh, J. et al. Biogeography and individuality shape function in the human skin 478 metagenome. Nature. 514, 59-64 (2014). 479 6. Oh, J. et al. Temporal Stability of the Human Skin Microbiome. Cell. 165, 854-866 (2016). 480 7. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter.

11

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

481 Nature. 499, 431-437 (2013). 482 8. Hong, C. et al. PathoScope 2.0: a complete computational framework for strain 483 identification in environmental or clinical sequencing samples. Microbiome. 2, 33 (2014). 484 9. Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific 485 marker genes. Nat Methods. 9, 811-814 (2012). 486 10. Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus 487 agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A. 102, 488 13950-13955 (2005). 489 11. Li, D., Liu, C. M., Luo, R., Sadakane, K., Lam, T. W. MEGAHIT: an ultra-fast single-node 490 solution for large and complex metagenomics assembly via succinct de Bruijn graph. 491 Bioinformatics. 31, 1674-1676 (2015). 492 12. Korem, T. et al. Growth dynamics of gut microbiota in health and disease inferred from 493 single metagenomic samples. Science. 349, 1101-1106 (2015). 494 13. Emiola, A., Oh, J. High throughput in situ metagenomic measurement of bacterial 495 replication at ultra-low sequencing coverage. Nature Communications. 9, 4956 (2018). 496 14. Wang, J. D., Levin, P. A. Metabolism, cell growth and the bacterial cell cycle. Nat Rev 497 Microbiol. 7, 822-827 (2009). 498 15. Grice, E. A., Segre, J. A. The skin microbiome. Nat Rev Microbiol. 9, 244-253 (2011). 499 16. Sharon, G. et al. Specialized metabolites from the microbiome in health and disease. Cell 500 Metab. 20, 719-730 (2014). 501 17. Donia, M.S., Fischbach, M.A. HUMAN MICROBIOTA. Small molecules from the human 502 microbiota. Science. 349, 1254766 (2015). 503 18. Zhou, W., Gay, N., Oh, J. ReprDB and panDB: minimalist databases with maximal 504 microbial representation. Microbiome. 6(1),15 (2018). 505 19. Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to 506 individual scales. Nature. 535, 435-439 (2016). 507 20. Zeeuwen, P. L., Kleerebezem, M., Timmerman, H. M., Schalkwijk, J. Microbiome and skin 508 diseases. Curr Opin Allergy Clin Immunol. 13, 514-520 (2013). 509 21. Oh, J. et al. The altered landscape of the human skin microbiome in patients with primary 510 immunodeficiencies. Genome Res. 23, 2103-2114 (2013). 511 22. Langmead, B., Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods. 512 9, 357-359 (2012). 513 23. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to 514 single-cell sequencing. J Comput Biol. 19, 455-477 (2012). 515 24. Kang, D. D., Froula, J., Egan, R., Wang, Z. MetaBAT, an efficient tool for accurately 516 reconstructing single genomes from complex microbial communities. PeerJ. 3, e1165 517 (2015). 518 25. Huson, D. H., Auch, A. F., Qi, J., Schuster, S. C. MEGAN analysis of metagenomic data. 519 Genome Res. 17, 377-386 (2007). 520 26. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., Tyson, G. W. CheckM: 521 assessing the quality of microbial genomes recovered from isolates, single cells, and 522 metagenomes. Genome Res. 25, 1043-1055 (2015). 523 27. Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30, 2068-2069 524 (2014).

12

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

525 28. Liao, Y., Smyth, G.K., Shi, W. featureCounts: an efficient general purpose program for 526 assigning sequence reads to genomic features. Bioinformatics. 30, 923-930 (2013). 527 29. Love, M.I., Huber, W., Anders, S. Moderated estimation of fold change and dispersion for 528 RNA-seq data with DESeq2. Genome biology. 15, p.550 (2014). 529 30. Katoh, K., Standley, D. M. MAFFT multiple sequence alignment software version 7: 530 improvements in performance and usability. Mol Biol Evol. 30, 772-780 (2013). 531 31. Wood, D.E., Salzberg, S.L. Kraken: ultrafast metagenomic sequence classification using 532 exact alignments. Genome biology 15(3) p.R46 (2014). 533 32. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site 534 identification. BMC Bioinformatics. 11, 119 (2010). 535 33. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., Zdobnov, E. M. 536 BUSCO: assessing genome assembly and annotation completeness with single-copy 537 orthologs. Bioinformatics. 31, 3210-3212 (2015). 538 34. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. 539 Bioinformatics. 26, 2460-2461 (2010). 540 35. Charif, D., Lobry, J. R. in Structural Approaches to Sequence Evolution: Molecules, 541 Networks, Populations (Bastolla, U., Porto, M., Roman, H. E., Vendruscolo, M. 542 eds.).(Springer, 2007). 543 36. Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24, 1586- 544 1591 (2007). 545 37. Luo, H., Lin, Y., Gao, F., Zhang, C.T., Zhang, R. DEG 10, an update of the database of 546 essential genes that includes both protein-coding genes and noncoding genomic 547 elements. Nucleic acids research 42(D1), pp.D574-D580 (2013). 548 38. Song, W., Steensen, K., Thomas, T. HgtSIM: a simulator for horizontal gene transfer 549 (HGT) in microbial communities. PeerJ 5, p.e4015 (2017). 550 39. Yue, J.C., Clayton, M.K. A similarity measure based on species proportions. Commun. 551 Stat. A-Theor. 34, 2123–2131 (2005). 552 553 FIGURE LEGENDS 554 Figure 1. Hybrid de novo and reference-based approach resolves dark matter in skin 555 metagenome. 556 (A) Simplified flowchart of the hybrid de novo and reference-based characterization. (B) Boxplots 557 show the fraction of uncharacterized sequences when using reference database from Oh et al.6, 558 de novo genome bins only, or hybrid approach. Black lines indicate median; boxes first and third 559 quartiles. (C) Microbial relative abundance across skin sites from a representative individual using 560 the different classification approaches. (D) Heatmap shows microbial relative abundance across 561 all samples from hybrid de novo and reference-based characterization, segregated by skin site 562 characteristic. Darker colors indicate higher relative abundance. (E) Community diversity using 563 Shannon diversity index. * = p-value < 0.05, ** = p-value < 0.01, *** = p-value < 0.001, NS = not 564 significant by Wilcoxon-rank sum test. (F) Estimation of community stability using Yue-Clayton 565 theta index, where q~1 represents a completely stable community. “Long” refers to sampling time 566 interval between T1 and T2 while “Short” represents short sampling time interval between T2 to 567 T3. * = p-value < 0.05, ** = p-value < 0.01, *** = p-value < 0.001, NS = not significant by Wilcoxon- 568 rank sum test. 569

13

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

570 Figure 2. Growth dynamics of skin bacteria. 571 Growth rate was calculated using GRiD13. (A) Proportion of common skin microbes in exponential 572 vs. stationary growth phase/absent across different skin sites. GRiD = 1 for a microbe was 573 considered stationary phase. C. acnes= Cutibacterium acnes, C=Corynebacterium, 574 S=Staphylococcus. (B) Boxplot showing the GRiD score of microbes over time and grouped by 575 skin site characteristics. (C) The heatmap shows partial Spearman correlation coefficient values, 576 correcting for multiple measurements, between growth rate (GRiD) and relative abundance. Black 577 colors indicate no significant correlation. The scatter plots below shows correlation between C. 578 acnes growth rate (GRiD) and relative abundance from representative skin sites. (D) Boxplot 579 showing microbial growth rate (GRiD) in healthy and primary immunodeficiency cohorts. 580 Significant differences between groups were computed with Wilcoxon rank-sum test. 581 582 Figure 3. Structural variants in Staphylococcus epidermidis. 583 (A) Venn diagram showing number of fragments differentially enriched in sebaceous sites when 584 compared with dry, moist, or foot sites. (B) Phylogenetic tree constructed for genes identified in 585 candidate fragments of S. epidermidis and S. capitis. Genes from both species clustering together 586 are highlighted. (C) Candidate S. epidermidis fragment, corresponding gene, and functional 587 annotation. (D) Spearman correlation between candidate S. epidermidis gene fragments and 588 relative abundance or growth rate (GRiD) in sebaceous and foot sites. 589 590 Figure 4. Horizontal transfer of genes in skin community. 591 (A) Overview of the HGT candidate identification process. Metagenomic reads were assembled 592 and contigs belonging to the same species were pooled into a species draft genome. For each 593 pair of species draft genomes, orthologous gene pairs were predicted. If a gene pair had a 594 significantly smaller synonymous distance than the immobile gene pairs (that is, the universal 595 single-copy orthologs), the pair of genes were identified as HGT candidates. (B) Distribution of 596 functions (i.e., KEGG BRITES annotations) of all identified horizontal gene transfer (HGT) 597 candidates and candidates that were annotated as transporters. (C) Detailed functions (i.e., 598 KEGG orthologs) of HGT candidates annotated as metallic cation, iron-siderophore, and vitamin 599 B12 transporters identified in the sebaceous sites. (D) Networks representing the top 10 species 600 pairs for which HGT events were most frequent (i.e., HGT events were identified in the largest 601 amount of samples) in each type of skin site. Nodes represent species and edges represent HGT 602 events. In each type of skin site, the size of a node is proportional to the degree of that node.

603 604 605 SUPPLEMENTARY INFORMATION 606 Supplementary Figure 1. Skin sites, skin physiologic characteristics, and number of 607 samples used. Overview of 698 samples analyzed in this study, encompassing 15 healthy adults 608 and two hyper-IgE patients, three timepoints, and 17 skin sites, representing 4 609 microenvironments; dry, moist, sebaceous, and foot sites. The numbers adjacent each site 610 correspond to the total number of samples derived from those sites. 611 612 Supplementary Figure 2. Flowchart of hybrid de novo and reference-based approach to

14

bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

613 resolve microbial dark matter. The inset boxplot represents the percentage of reads mapping 614 back to contigs/scaffolds (> 1 Kb) catalog after each round of iterative assembly. Black lines 615 indicate median; boxes first and third quartiles. 616 617 Supplementary Figure 3. Hybrid de novo and reference-based resolution of dark matter. 618 (A) Scatterplots show concordance in relative abundance between the original classification in 619 Oh et al.6 and the hybrid approach. Points deviating significantly from the diagonal are those 620 whose relative abundance changed significantly based on the resolution of dark matter. Major 621 genera, phyla, and kingdoms are shown. (B) Annotation of uncharacterized bins. Boxplots shows 622 proportion of uncharacterized bins reassigned to a taxa. We relaxed our initial annotation 623 requirement by excluding contigs/scaffolds with no hits from MEGAN output and re-ran our 624 annotation pipeline. 625 626 Supplementary Figure 4. Higher compositional reconstruction of the skin microbiome. 627 (A) Community stability across skin site as calculated by the Yue-Clayton theta index, where q~1 628 represents a completely stable community. “Long” refers to sampling time interval between T1 629 and T2 while “Short” represents short sampling time interval between T2 to T3. (B) Heatmap 630 shows partial Spearman correlation values correcting for multiple measurements between 631 microbial relative abundance at timepoints T2 vs T3 (i.e. short time interval) (top) and T1 vs T2 632 (i.e. long time interval) (bottom). Black colors indicate no correlation. C. acnes=Cutibacterium 633 acnes, C.=Corynebacterium, S.=Staphylococcus. 634 635 Supplementary Figure 5. HGT candidate identification pipeline. 636 637 Supplementary Figure 6. Validation of the HGT prediction pipeline. 638 (A) Simulated HGT genes that were identified using the pipeline between all species pairs in all 639 simulated communities (0%, 5%, and 10% mutations). (B) Number of HGT genes identified as a 640 function of the synonymous distance of the HGT genes in the species pairs. Identified HGT genes 641 that matched the simulated HGT genes were shown in red. 642 643 Supplementary Table 1: Microbial genome bins identified using de novo approach. Bins 644 highlighted in yellow represent non-microbial genomes. 645 646 Supplementary Table 2: Bin coverage across samples. 647 648 Supplementary Table 3: Community relative abundance resolved using hybrid de novo and 649 reference-based approach

15

A De novo Growth rate B approach Reference-based prediction approach Structural variant analysis bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Horizontal gene Percentage of reads Percentage transfer prediction

Abundance, diversity, 0 20 40 60 80 100 stability, co-occurrence Reference v1 Binning only Binning and reference

Bacteria Dry Proteobacteria Moist Hp Vf Ac Ic Id Pc Al Ba Ch Ea Gb Mb Oc Ra Ph Tn Tw Alphaproteobacteria Sebaceous C1.00 Reference v1 Brevundimonas Foot Betaproteobacteria 0.75 Gammaproteobacteria Pseudomonas 0.50 Acinetobacter Moraxella 0.25 Firmicutes Lactobacillales 0.00 Streptococcus Clostridiales 1.00 Bacillus Staphylococcus 0.75 Staphylococcus aureus Binning Staphylococcus epidermidis 0.50 Bacteroides Prevotella 0.25 Actinobacteria Relative abundance Relative 0.00 Gordonia 1.00 Binning and re Propionibacterium Cutibacterium acnes 0.75 Corynebacterineae Corynebacterium 0.50 Corynebacterium jeikeium Corynebacterium ureicelerivorans 0.25 Deinococcus

f. Eukaryota 0.00 Fungi Malasseziaceae T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 T1 T2 T3 Viruses Uncharacterized bins Dark matter E D Dry Moist Sebaceous Foot 3 *** *** *** * *** *** *** *** *** NS *** *** *** *** NSNS NS Proteobacteria Alphaproteobacteria Brevundimonas Betaproteobacteria Dry 2 Moist Gammaproteobacteria Sebaceous Pseudomonas Foot Acinetobacter Moraxella Firmicutes 1 Reference v1 Lactobacillales Binning and reference

Streptococcus index Shannon diversity Clostridiales Bacillus 0 Staphylococcus Hp Vf Ac Ic Id Pc Al Ba Ch Ea Gb Mb Oc Ra Ph Tn Tw Staphylococcus aureus Staphylococcus epidermidis Bacteroides F Dry Moist Prevotella NS * *** 1.00 *** *** *** Actinobacteria Actinomycetales 0.75 Gordonia Propionibacterium 0.50 Cutibacterium acnes 0.25 Corynebacterineae Corynebacterium 0.00 Reference v1 Corynebacterium jeikeium Sebaceous Foot *** *** *** NS NS * Binning and reference Corynebacterium ureicelerivorans 1.00 Theta index Deinococcus 0.75 Eukaryota Fungi 0.50 Malasseziaceae Phages 0.25 Eukaryotic viruses 0.00 Long Short Inter Long Short Inter personal personal A B

Dry Dry Moist Sebaceous Foot Moist bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; 3.0this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. C. jeikeium 2.5

2.0 T1 C. ureicelerivorans 1.5 T2 T3 1.0 GRiD C. acnes Sebaceous Foot 3.0

2.5 S. aureus 2.0

1.5 S. epidermidis 1.0

C. acnes C. jeikeium S. aureus C. jeikeium C. acnesS. aureus Mostly growing Absent/mostly stationary S. epidermidis S. epidermidis C. ureicelerivorans C. ureicelerivorans

C D

C. jeikeium C. acnes S. aureusS. epidermidis NS p = 0.002 NS NS P = 0.013 C. ureicelerivorans Hp Dry Vf Moist Ac Sebaceous 2.5 Ic Foot Id Pc Al Ba Ch Ea 2.0 Gb 0 0.5 healthy Mb Oc hyper_IgE Ra GRiD Ph −0.5

Tn Spearman correlation Tw 1.5 Vf Id Al Tn 0.8 0.006 0.6 0.4 0.75 0.004 0.4 0.50 0.2 0.002 0.2 0.25 0.0 0.000 1.0 0.0 0.00 Relative Abundance Relative 1.02 1.05 1.08 1.11 1.00 1.05 1.10 1.15 1.00 1.05 1.10 1.15 1.005 1.015 1.025 1.035 C. jeikeium C. acnes S. aureus S. epidermidis GRiD C. ureicelerivorans A B enriched in sebaceous sites

v. dry bioRxiv preprint v.doi: foot https://doi.org/10.1101/2020.01.21.914820v. dry v. foot ; this version posted January 23, 2020. The copyright holder for this preprint 7 (which was not certified by peer 65review) is the author/funder. All rights reserved. No reuse allowed without permission. 1637 1925 293 157 Height 6 11

7 28 17 13 S_epi_14090 S_epi_20627 0 1 2 3 S_epi_19594

12 3 S_capitis_03029 S_capitis_03333 S_capitis_03300 S_epi_13620 S_capitis_02938 S_capitis_03512 S_capitis_03299 S_capitis_03321 S_capitis_03320 S_capitis_03325 S_capitis_03332 S_epi_19593 v. moist v. moist S_epi_19640 S_capitis_00325 S_capitis_03322

S. epidermidis S. capitis S_epi_13619 S_capitis_02939

C D Foot site Sebaceous site

Fragment Gene Functional annotation S_epi.fa_11535001−11536000 S_epi.fa_11535001-11536000 S_epi_13619 CsbD stress response protein S_epi.fa_17226001−17227000 S_epi.fa_11535001-11536000 S_epi_13620 hypothetical protein S_epi.fa_12070001-12071000 S_epi_14090 hypothetical protein S_epi.fa_17193001−17194000 S_epi.fa_17193001-17194000 S_epi_19593 hypothetical protein S_epi.fa_12070001−12071000 S_epi.fa_17193001-17194000 S_epi_19594 transcriptional regulator S_epi.fa_17080001−17081000 S_epi.fa_17226001-17227000 S_epi_19640 hypothetical protein S_epi.fa_18242001-18243000 S_epi_20627 hypothetical protein S_epi.fa_18242001−18243000

Plasmid-borne Phages

Spearman correlation cor. with rel abund cor. with rel abund cor. with growth rate cor. with growth rate A C

Metagenomic reads

Species 1 draft genome Species 2 draft genome

Immobile gene pairs

Test gene pairs Distance 0.47 0.48 0.52 P r obability

Distance=0.08 Predicted HGT candidates

0.0 Not identi fied in the HGT

Predicted HGT gene Distance gene pool

B D

Dry C. simulans oteins oteins p r

s oteins oteins

t M. aurum p r i s otein

p r C. ureicelerivorans p r d ligands oteins em ymes is ociated p r ys t ery oteins sociated

C. granulosum esis p r oteins oteins oteins a s n binding e s oteins

s ecombination tors ide biosynthe s es oteins t system nd as s achi n d folding cataly s P. fluorescens d r molecules a n a em p r fera s tem

elated en z M. luteus s a n ial biogenesis defense s e s xins otility p r RNA biogenesis ases hesis p r o C. acnes nsfera s ta s ome P45 0 ome one s r eleton p r aminoglyc a yltra n eplication p r epair a n etide biosynt h porters cription fa c cription m lation facto r fer RNA biogene s aryotic etion sys t -compone n

omosome and L. clevelandensis osome otein ki n enyltr a o k otea s r r r r oly k eptidases ran s ran s ran s ran s ran s w o Phosph a P T Cytoc h Glyco s T Cytos k Photosyn t P P Amino acid- r Bacterial m Bacterial t Sec r P T P Chape r DNA r CD molecules Ubiquitin sy s E x Cell adhesion P Glyco s Lipid biosynthesis p r Lipopolysaccha r T Mitochond r DNA r Ch r T Messenge r T A. oris G. bronchialis 1.00 Dry 0.75 Moist 0.50 C. aurimucosum C. jeikeium 0.25 0.00 C. ureicelerivorans 1.00 L. clevelandensis Moist 0.75 0.50 P. alcaligenes nces M. luteus 0.25 C. acnes 0.00 1.00 G. vaginalis Sebaceous 0.75 S. maltophilia C. granulosum elative abund a

R 0.50 Sebaceous bioRxiv 0.25preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 0.00 C. kroppenstedtii 1.00 C. frankenforstense Toenail 0.75 0.50 C. granulosum C. singulare 0.25 C. acnes 0.00 S. capitis

S. epidermidis L. clevelandensis 1.00 0.75 Dry Toenail 0.50 C. singulare 0.25 0.00 1.00 P. acnes M. luteus Moist K rhizophila 0.75 S. pettenkoferi 0.50 nces 0.25 0.00 S. simulans S. camporealensis 1.00 A. prevotii S. epidermidis 0.75 Sebaceous S. warneri 0.50

elative abund a A. mediterraneensis R 0.25 0.00 1.00 0.75 Toenail 0.50 0.25 0.00 els ers ers t ters ters ers s s riers han n ers sporters r porter ters porters s family anspo r anspo r anspor t porters sporters anspor t anspo r sporter sporte r porters anspor t sporters tra n ran s e system on c a nsporte r ran s otran s es ion c anspo r anspor t s in t r se system el tra n o r nsporte r own t r Drug t r cotra n xtrusion otein t r Metal t r acid t r P Sugar id tra n r ion tra n P nitrite t nsfera s ate t r e elect r nic k other t iven tr a xin e Unk n ansfer a d lipid tran s o ganic ory facto r ganic al solute t r osphate O r d VB12 tr a amino a c Nitrate / osphotr a eut r cces s nd o r membra n type and g and t A e, a n Sodium bile salt c eptide and hosphot r P ganophosp h te and C-2 ran s o r T II in P h and n al potential-d r A B r in P e opho r Multidr u ype II Na+-p h aride, polyol, a n Mineral a T H P te and Phosph a Enzy m ochemi c Sacc h quaporin s on-side r e I and A Elect r Phosph a tion, i r Enzy m Metallic c a bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

40 (Ch) Cheek Alar crease (Al) 41 37 (Gb) Glabella

39 (Ea) External auditory canal

Retroauricular crease (Ra) 44

38 (Mb) Manubrium Occiput (Oc) 41

Back (Ba) 36 43 (Ac) Antecubital fossa

41 (Vf) Volar forearm (Id) Interdigital 35 web

(Pc) Popliteal fossa 43 40 (Hp) Hypothenar palm

43 (Ic) Inguinal crease Toenail (Tn) 35

40 (Tw) Toe web space Plantar heel (Ph) 41

Front Back Additional samples Sebaceous Moist Dry Foot 13 Nares 2 Axilla 6 Control mock community Sample pool (n = 698) bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820Concatenate; this allversion samples posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. de novo assembly (MEGAHIT)

Contigs

extract contigs > 1 Kb bowtie mapping

Contigs (R1) Percentage of reads mapping Percentage 0 20 40 60 80 100 R1 R2 R3 R4 R5 MEGAHIT SPAdes

Unmapped reads

Concatenate a subset unmapped reads from selected samples

Assembly subset, extract contigs/s- caffolds > 1 Kb, concatenate with previous contig pool (i.e. R1) repeat step until n = 5 ( de novo assembly with SPAdes)

Contigs/scaffolds (Rn)

binning (MetaBAT)

Genome bins

extract unmapped reads

Unbinned reads

Pathoscope

Pathoscope-assigned reads

dark matter reads = total number of reads - (number of binned reads + number of Pathoscope-assigned reads) A Dry Moist Sebaceous Foot 0.20 0.5 0.4 0.6 0.15 0.04 0.3 Eukaryota 0.10 0.4 0.2 0.02 0.2 0.05 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.9148200.1 ; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 0.00 0.0 0.0 0.00 0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.00 0.02 0.04 0.5 0.6 0.3 0.4 0.100 0.3 0.4 0.075 0.2 Firmicutes 0.2 0.050 0.2 0.1 0.1 0.025 0.0 0.0 0.0 0.000 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.00 0.025 0.050 0.075 0.100 0.5 0.5 0.4 0.6 0.4 0.4 0.3 0.3 0.4 0.3 0.2 Proteobacteria 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.3 0.25 0.20 0.20 0.4 0.2 0.15 0.15 0.3 Actinobacteria 0.2 0.10 0.10 0.1 0.05 0.1 0.05 0.00 0.0 0.0 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20 0.8 1.00 0.6 0.04 0.6 0.75 0.03 0.4 C. acnes 0.4 0.50 0.02 0.2 0.2 0.25 0.01 0.0 0.0 0.00 0.00 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.00 0.01 0.02 0.03 0.04 1.00 0.3 0.6 0.75 0.4 0.4 0.2 0.50 Corynebacterium 0.2 0.1 0.2 0.25 0.0 0.0 0.0 0.00 0.0 0.1 0.2 0.3 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.00 0.25 0.50 0.75 1.00

Relative abundance (Binning and reference) 0.6 0.15 0.6 0.75 0.4 0.10 0.4 0.50 Staphylococcus 0.05 0.2 0.2 0.25 0.00 0.0 0.0 0.00 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 0.25 0.020 0.6 0.20 0.015 0.2 0.15 0.4 S. epidermidis 0.010 0.10 0.1 0.2 0.005 0.05 0.000 0.00 0.0 0.0 0.000 0.005 0.010 0.015 0.020 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.1 0.2 0.0 0.2 0.4 0.6 0.6 0.6 0.75 0.3 0.4 0.4 0.50 0.2 Viruses 0.2 0.2 0.25 0.1 0.0 0.0 0.00 0.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 0.0 0.1 0.2 0.3 Relative abundance (Reference v1)

B 1.00

0.75 Proteobacteria

Bacteriodetes

Firmicutes 0.50 Actinobacteria

Relative abundance Relative Other Bacteria

Eukaryota 0.25

0.00

Dry Foot Moist Sebaceous A B 0 0.75 0 0.65 0 0.89 0 0 0.7 0 0 Hp 0.8 0.67 0 0 0 0 0.83 0 0 0 0 Vf Hp Vf Ac Ic Id Pc 0 0.77 0 0 0 0 0 0 0 0 0 Ac 0 0.85 0.7 0 0 0.84 0 0 0 0 0 Ic 1.00 0 0.95 0 0 0 0 0 0 0 0 0 Id 0.76 0.9 0.82 0.63 0 0 0 0 0 0.62 0 Pc 0.75 bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted0.71 0.61 January 0.85 23, 0.81 2020. 0.78 The 0.64 copyright 0 holder 0 0for this 0 preprint 0.71 Al (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Ba 0.50 0 0.78 0.73 0.72 0.85 0.8 0 0.72 0 0 0 0.8 0.69 0.82 0.9 0 0.74 0 0.73 0 0 0.82 Ch 0.25 0.8 0.76 0 0.69 0.71 0.75 0.65 0.95 0 0.68 0.69 Ea 0.82 0.75 0.78 0.85 0.9 0.92 0.89 0.71 0.67 0.64 0.75 Gb 1 0.00 0.69 0.72 0.72 0.79 0.81 0.88 0.86 0.87 0 0.8 0.74 Mb Al Ba Ch Ea Gb Mb 0.84 0.84 0 0.91 0 0.84 0.75 0.7 0.65 0.83 0.85 Oc 0.7 0.82 0.92 0.92 0.85 0.98 0 0.62 0 0.89 0.69 Ra 0.8 1.00 00000000000 Ph 0 0.71 0.79 0 0 0 0 0 0 0.74 0 Tn 0.75 Tw 0 0 0 0 0 0.73 0.69 0.72 0.68 0.63 0 0.6 0.50 0.63 0.75 0 0.68 0 0 0 0 0 0 0 Hp 0.9 0.73 0.73 0.7 0.73 0 0 0 0 0 0 Vf 0.25 0.4 Theta Index 0 0.66 0 0 0.67 0.7 0 0 0 0 0 Ac 0.00 0.65 0.85 0.79 0.85 0 0 0 0 0 0.64 0 Ic Oc Ra Ph Tn Tw 0.81 0.83 0 0 0.86 0 0 0 0 0 0 Id Pc 0.2 1.00 0 0.68 0.83 0 0 0.75 0 0 0 0 0 Inter Al

Long 0.89 0 0.78 0 0 0 0 0.68 0 0 0.85 Short 0.68 0.96 0.73 0.77 0 0.75 0 0 0.79 0 0.82 Ba

0.75 0 Spearman correlation coefficient personal 0 0 0.61 0.83 0 0.67 0 0.69 0 0 0 Ch 0.50 0.9 0 0 0.68 0.62 0.92 0 0.92 0 0 0 Ea Dry 0.79 0.67 0.88 0.72 0.65 0 0 0 0 0 0.73 Gb 0.25 Moist 0.65 0.77 0 0.86 0.73 0.76 0 0 0 0 0 Mb Sebaceous 0.77 0.82 0.79 0.69 0.66 0.61 0 0 0 0 0.61 Oc 0.00 0 0 0.66 0.69 0.74 0.79 0 0.66 0 0 0.71 Ra Foot 0 0.65 0 0 0 0 0 0 0 0 0 Ph 00000000000 Tn Inter Inter Inter Inter Inter Long Long Long Long Long Short Short Short Short Short 0 0.64 0.85 0 0 0 0 0.9 0 0 0.88 Tw personal personal personal personal personal Dry Moist C. acnes Reference v1 Binning and reference S. aureus Sebaceous C. jeikeium Foot Streptococcus S. epidermidis Pseudomonas Staphylococcus Malasseziaceae Corynebacterium Propionibacterium C. ureicelerivorans bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. bioRxiv preprint doi: https://doi.org/10.1101/2020.01.21.914820; this version posted January 23, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A B Genes horizontally transferred from All predicted HGTs Predicted HGTs that match simulated HGTs C.. acnes S. epidermidis S. mitis 0% mutation 5% mutation 10% mutation 0% m u 120 C. acnes - S. epidermidis

tation 75 C. acnes - S. mitis 100

S. epidermidis - S. mitis 5%

T 75 80 muta t C. acnes - S. epidermidis 50 s of H G ion 1 0 C. acnes - S. mitis t 50

S. epidermidis - S. mitis Coun 40 % m u 25 25 C. acnes - S. epidermidis tation

C. acnes - S. mitis 0 0 0 S. epidermidis - S. mitis 0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 0.5 1.0 1.5 2.0 2.5 successfully identified HGT gene Synonymous distance HGT gene not recognized by prodigal HGT gene without fully assembled sequence