Inter- interactions in microbial communities

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Hsu, Tiffany Yeong-Ting. 2018. Inter-species interactions in microbial communities. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:42015251

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Inter-species interactions in microbial communities

A dissertation presented

by

Tiffany Yeong-Ting Hsu

to

The Division of Medical Sciences

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Biological and Biomedical Sciences

Harvard University

Cambridge, Massachusetts

October 2017

© 2017 Tiffany Yeong-Ting Hsu

All rights reserved.

Dissertation Advisor: Professor Curtis Huttenhower Tiffany Yeong-Ting Hsu

Inter-species interactions in microbial communities

Abstract

Microorganisms are omnipresent and exist as communities within and around the human body. These communities, regardless of location, may cause disease: dysbioses within the gut microbiota are associated with obesity and inflammatory bowel disease, while differences in immune development and environmental exposures are linked to atopy and diabetes. It is thus crucial to characterize microbial communities and their interactions to better understand how they are formed, maintained, and manipulated. To better understand the ecology of communities on and around the human body, my work has explored lateral gene transfer (LGT) within human-associated microbial communities and the transfer of microbes between the human body and environmental surfaces.

I developed the first method for detection of de novo LGT events from metagenomes termed WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events. I applied

WAAFLE to the Human Microbiome Project: LGT frequencies were highest in the gut and oral sites, and lowest in the vaginal and skin microbiomes. High frequency pairs corresponded with increased taxon abundances and close phylogenetic distances. Taxa found in multiple LGT pairs had strong partner preferences, and several had biases in transfer directionality. Enriched functions in LGT contigs included transposases, phage, and TonB membrane receptors. Taxa in high frequency LGT pairs may preferentially use LGT as a tool to maintain or change their community status.

iii

I examined cross-talk between human-associated and built-environment microbial communities in heavily trafficked environments, specifically the Boston subway. These areas may facilitate microbial transmission and are ripe for public health interventions such as sanitation or architecture. We used 16S rRNA gene and metagenomics shotgun sequencing to profile microbes on multiple surface types in trains along the red, green, and orange lines, as well as ticketing machines at four train stations. Community structure was dictated by surface type, rather than train line. Common taxa included human skin and oral commensals such as

Propionibacterium, Corynebacterium, Staphylococcus, and Streptococcus. Enriched functions were often from Propionibacterium acnes pathways, and few antibiotic resistance genes were observed.

Overall, microbial communities on the Boston subway are likely derived from the rider population and influenced by rider interactions and environmental biochemistry.

iv

Table of Contents

Abstract ...... iii Table of Contents ...... v Acknowledgements ...... vii List of Figures ...... x List of Tables ...... xii List of Abbreviations ...... xiii Chapter 1: Introduction ...... 1 Copyright Disclosure ...... 2 Overview ...... 2 The significance of lateral gene transfer ...... 3 Mechanisms and discovery of lateral gene transfer ...... 3 Problems with the prokaryotic “species concept” ...... 5 Methods for identifying species and LGT ...... 7 LGT in microbial communities ...... 9 Transferred functions and their associated costs ...... 11 Evolutionary legacy of LGT ...... 12 Surveying microbial communities in the built-environment ...... 13 Microbial composition of the built-environment ...... 14 Applications for the built-environment ...... 16 Technical considerations for sampling the built-environment ...... 17 The role of DNA sequencing for microbial profiling ...... 19 Amplicon Sequencing ...... 19 WMS Sequencing ...... 21 Contig Assembly ...... 23 Summary ...... 24 Chapter 2: Lateral Gene Transfer in the Human Microbiome ...... 26 Attributions ...... 27 Introduction ...... 27 Results ...... 30 Identifying recent LGT events from metagenomic shotgun sequencing ...... 30 WAAFLE performance on synthetic data ...... 32

v

Rates of novel LGT events across the human microbiome ...... 35 LGT frequency and pair formation are shaped by abundance and phylogeny ...... 41 Genera have preferred transfer partners that are shared across similar sites ...... 44 Mobile elements and TonB receptors are enriched in LGT contigs ...... 49 Discussion ...... 54 Methods ...... 58 Chapter 3: Urban transit system microbial communities differ by surface type and interaction with humans and environment ...... 68 Copyright Disclosure ...... 69 Attributions ...... 69 Abstract ...... 69 Importance ...... 70 Introduction ...... 71 Results ...... 73 Sampling microbial communities on the Boston transit system ...... 73 Microbial communities are specific to surface types and immediate environment .. 74 Subway microbial communities are largely derived from human skin and oral commensal microbes ...... 77 Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial microbial community ...... 81 All surface types are dominated by skin microbes, with smaller proportions of oral, gut, and environmental taxa across seats and touchscreens ...... 83 Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces ...... 86 Minimal pathogenic and antibiotic resistance presence on the Boston transit system ...... 88 Discussion ...... 90 Materials and Methods ...... 95 Acknowledgements ...... 101 Chapter 4: Conclusions ...... 102 Appendix I : Supplemental Materials for Chapter 2 ...... 108 Appendix II : Supplemental Materials for Chapter 3...... 117 References ...... 129

vi

Acknowledgements

I came to Harvard determined to learn “computational biology”. Considering that my laboratory experience far exceeded my programming experience (5 years versus 10 weeks), I must first thank Dr. Curtis Huttenhower for taking a chance on me. In my first email to him, I wrote:

“…I am interested in learning how to analyze large datasets and make some sense out of them. I feel that it is no longer sufficient to look at just a few key genes - especially when there are now ways to profile entire genomics, transcriptomes, and proteomes - though all the associations found will still have to be validated molecularly. Still, I think it's exciting that there is a chance to look at the entire network and see how it works.

I was wondering if you took rotation students - or knew of anyone who might train a student to do dry work - since I have a wet lab background. I was also wondering what your opinion was on how much of an "omics" understanding a scientist might need.”

What I have learned during my time in the lab has completely exceeded those expectations. The

Huttenhower Lab is a rare place that does not distinguish their bioinformatians from their experimentalists. Every member is free to learn both, and they often do, through the process of helping each other out. Curtis was also willing to help me take on projects I was initially unqualified for, such as WAAFLE, which was born out of my qualifying exams.

Second, I must thank both past and present members of the Huttenhower Lab. Curtis has assembled a wonderful team of people. To each of you, I would like to say: “You have qualities that I strive to emulate, and skills and knowledge that I still hope to learn some day.” I specifically want to thank two people, Dr. Eric Franzosa and Dr. Regina Joice.

Eric was my mentor throughout my PhD; without him I would not have graduated. The beginning of my PhD was difficult, because the way computational biologists thought and the terms they used were alien to me. It was not always clear what analyses were being suggested

vii or why, and how to carry them out. Eric always took the time to explain these analyses, by breaking down the underlying assumptions and hypotheses. When I had trouble turning those analyses into code, he would show me his code and introduce me to new syntax. Eric was often the first to review my grants and paper drafts: I learned a lot about writing from his revisions.

Towards the end of my PhD, when I had trouble mentoring and tutoring students, it was again

Eric that I turned to for advice. I hope I will become an equally skilled and kind scientist as I move through my career.

Regina was my mentor throughout the MBTA project. Since she had a wet lab background, she could anticipate my confusion and would help me if she knew the answer, or help me rephrase the question so someone else could. When I got lost in the computational aspects of my work, she would always steer me back to the biological question we were asking.

She also freely shared advice when I asked for it: I still remember sidling up and saying,

“Regina, I have a science/graduate school/life question, would you have time to talk later?”

Third, I want to thank my scientific colleagues outside the lab, including Dr. Morgan

Langille and Dr. Robert Beiko, the WAAFLE co-authors; Dr. Georgina Hold, for involving me in her comparative genomics project; Dr. Wendy Garrett, who gave me access to her laboratory when we didn’t have the right equipment; and Dr. Eric Rubin, Dr. Michael Springer, and Dr.

Colleen Cavanaugh, my dissertation advisory committee; and Dr. Ting-Ting Wu, my undergraduate research mentor. To Morgan, Rob, and my advisory committee, I have always enjoyed and appreciated your feedback on my projects. I have heard horror stories about collaborators and committees: all five of you were truly a pleasure to work with, and even took

viii time to meet with me one-on-one, whether it was for advice, beer, or while driving me to see

Bonnie Bassler. To Ting, despite all your cautionary advice, I still went to graduate school!

Without you, I would have never have experienced scientific research, and I hope we stay friends and colleagues for the years to come.

Fourth, I must thank the administrative staff, including Nicole Levesque, the

Biostatistics Department program coordinator, who magically scheduled me into Curtis’s schedule over the past five years; as well as Kate Hodgins, Anne O’Shea, Danny Gonzalez, and

Maria Bollinger, the present and former BBS program administrators, who have always swiftly responded to questions about Harvard and graduate school.

Lastly, I want to thank my family and friends. Both my mother, Lichuan Hsu, and brother, Eric Hsu, have always been there to support me. They have heard more than their fair share of gripes and complaints along the way. My father, Che-Chang Hsu, is no longer here, but

I believe he would be proud of my work. As an electrical engineer, he was extremely excited when I told him I was going to learn Python and described Curtis’s work. I am glad he was able to see me start my bioinformatics journey. My partner, Wesley Hong, always gives me new perspectives to consider, and is there to remind me that graduate school is not everything, but a small step towards our aspirations. To my friends, I will remember the late night problem sets, races and shopping trips, and surprise birthday parties: it is you who have made my time here in Boston/Cambridge all the merrier.

ix

List of Figures

Figure 2-1. WAAFLE pipeline overview...... 31

Figure 2-2. WAAFLE parameter evaluation...... 35

Figure 2-3. LGT rates are highest for oral and stool sites...... 39

Figure 2-4. Both abundance and phylogeny affects LGT rates...... 43

Figure 2-5. Taxa degree and differential edges...... 46

Figure 2-6 . Enriched functions show taxon and structural similarities across sites...... 51

Figure 3-1. Collection of samples from MBTA trains and stations...... 74

Figure 3-2. Taxonomic composition of subway microbial communities...... 76

Figure 3-3. Putative MBTA microbial community sources...... 78

Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes...... 82

Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate analyses.

...... 84

Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P.

acnes removal...... 87

Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on

subway surfaces...... 89

Figure I-1. Filtering potential misassemblies...... 109

Figure I-2. Determining which contig types contain misassemblies...... 110

Figure I-3. Gene call evaluation...... 112

Figure I-4. LGT evaluation with or without missing BLAST hits...... 112

Figure I-5. Selection of k1 and k2...... 113

x

Figure I-6. Comparison of LGT measures...... 114

Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and

technical samples...... 114

Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites...... 115

Figure II-1. Biomass and alpha diversity for train and station samples...... 118

Figure II-2. Ordination of surface data subsets...... 118

Figure II-3. Comparison of antibiotic resistance markers from the ARDB database...... 119

Figure II-4. Letter from the MBTA...... 120

xi

List of Tables

Table I-1. WAAFLE Parameters...... 116

Table II-1. Sample collection and metadata...... 121

Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate...... 121

Table II-3. LEfSe and MaAsLin analysis for 16S sequencing...... 121

Table II-4. MaAsLin analysis for shotgun data...... 121

Table II-5. Antibiotic resistance gene and virulence factor markers...... 121

xii

List of Abbreviations antibiotic resistance (ABR). antibiotic resistance genes (ARG). base pair (bp). biological species concept (BSC). coding sequence (CDS). coding sequences (CDS). ecological species concept (ESC). false positive rate (FPR). gene transfer agent (GTA). Human Microbiome Project (HMP)(The Human Microbiome Project Consortium). Human Microbiome Project Phase 1-II (HMP 1-II). interpolated variable order motifs (IVOM). kilobase (kb). last universal common ancestor (LUCA). positive predictive value (PPV). single nucleotide polymorphisms (SNP). true positive rate (TPR). WAAFLE (Workflow to Annotate Assemblies and Find LGT Events). whole metagenome shotgun (WMS).

xiii

Chapter 1:

Introduction

Copyright Disclosure

Portions of this Introduction appear in or are adapted from the following publications:

Franzosa, E.A., T. Hsu, A. Sirota-Madi, A. Shafquat, G. Abu-Ali, X.C. Morgan, C. Huttenhower, Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nature Reviews Microbiology, 2015. 13(6):p. 360-72.

Overview

There are approximately 3.8 × 1013 bacterial cells in the average 70 kg man, which is roughly equal to the number of human cells in the body [1]. These bacterial cells are found as microbial communities [2], and may interface with the immediate environment outside the host

[3]. Within a microbial community, individual taxa may have different phenotypes as compared to the overall community: some have proposed that an individual microbe may be viewed as a component cell of a multicellular organism, in which components communicate to coordinate growth, movement, and biochemical activities in order to efficiently proliferate, access new resources, and defend against antagonists [4]. As follows, it is necessary to study microbial interactions at the individual and community scale. Furthermore, microbial communities may influence or be influenced by the surrounding environment. Humans emit a detectable microbial cloud into the surrounding air [5], and skin microorganisms are influenced by temperature, moisture, and ultraviolet radiation [6]. Thus, it is important to characterize microbial interactions within a community, as well as microbial interactions with the surrounding environment in order to understand community formation, maintenance, and function.

Microbial profiling began with Anton van Leewenhoek, who observed microorganisms using a self-built microscope and classified them based on morphology [7]. Louis Pasteur and

2

Robert Koch later popularized the use of what is now considered traditional culture methods to isolate microbes and observe their phenotypes [8]. However, the “The Great Plate Count

Anomaly” showed that the majority of were not being cultured: Razumov observed that viable plate counts were much lower than microscopic counts [9-11]. The advent of 16S and metagenomics shotgun sequencing partially solved this problem by allowing scientists to identify and classify not-yet-culturable microbes. Coupled with other ‘omics’ data (including transcriptomics, proteomics, and metabolomics) and appropriate study design, researchers can begin to better understand microbial interactions at both the individual and community scale, and across different environments.

In this Introduction, I will first explore lateral gene transfer (LGT), one type of interaction within microbial communities. Specifically, I will discuss its mechanisms, history, and roles in the human microbiota. Next, I will delve into the interactions between human- associated microbial communities and the built-environment, where humans spend the majority of their time. Finally, I will outline the potential and limitations of DNA sequencing approaches for profiling microbial communities.

The significance of lateral gene transfer

Mechanisms and discovery of lateral gene transfer

One of the most important types of interactions within microbial communities has proven to be LGT. LGT occurs when genetic information (or DNA) is passed from a single cell to a neighboring cell (lateral transmission), rather than from parent to offspring (vertical transmission). LGT is primarily known to occur through three mechanisms, transformation,

3 transduction, and conjugation, and via two recently discovered mechanisms, gene transfer agents (GTA) and cell fusion [12]. Transformation ensues when a bacterium uptakes naked

DNA from the environment and incorporates it into its own genome. Transduction occurs when a bacteriophage accidentally packages part of the host genome with its own genome, which is then injected and integrated into the next infected bacterium. Conjugation requires physical contact between two bacteria, and involves DNA transfer from one bacterium to the other via a multiprotein apparatus. The different mechanisms of LGT limit both the potential participants and amount of DNA transferred. For example, transduction restricts LGT partners to those with the same phage host range, and phage can only package a small quantity of DNA. Lastly, GTA are DNA elements evolved from prophages; they package small pieces of bacterial DNA in capsids and transfer them to nearby hosts [13]. Cell fusion is similar to sexual reproduction in eukaryotes in that microbial cells physically join and may bi-directionally transfer DNA [14].

LGT was initially considered a curiosity, but is now recognized as a potentially strong evolutionary force in prokaryotes. Assuming one LGT event for every 1010 vertical replications, no gene in any modern genome can be linked to the last universal common ancestor (LUCA) through vertical descent [15]. LGT was first observed in 1928 as transformation: “R” (“rough”, avirulent) Pneumococcus strains alone could not cause disease in mice, but would kill mice if mixed with heat-killed “S” (“smooth”, virulent) Pneumococcus strains [16]. In 1943, Avery,

MacLeod, and McCarty determined the agent of this particular phenomenon (conversion of R strains to S strains) to be DNA [17]. In the 1960s, Japanese researchers found that multi-drug resistant Escherichia coli could transfer resistance to drug-sensitive Shigella through conjugation

[18-20], elevating LGT to a cause for concern. Finally, in 1999, researchers found that 20-25% of

4

Aquifex aeolicus and Thermotoga maritima genes were more similar to Archaea than Bacteria [21,

22], indicating that LGT can cross domains in the tree of life.

Problems with the prokaryotic “species concept”

Identifying LGT between different species is of particular interest, since these events may increase the fitness of individual microbes, which in turn may alter microbial communities.

In both macro- and microbiology, species are defined as clusters of similar organisms, though what drives the separation of these clusters is unclear for microorganisms. Historically, macro- organisms were delineated based on morphology, while microorganisms were classified based on metabolic characteristics [23]. The introduction of the “biological species concept” (BSC) by

Ernst Mayr in 1942 attempted to unify existing systematics and the theory of evolution, and stated that “species are groups of actually or potentially interbreeding natural populations, which are reproductively isolated from other such groups” [24, 25]. This definition formalized

“species” as a unit of ecology and evolution, and identified “reproductive isolation” as the driver for species formation.

The BSC did not work well for microorganisms or plants, due to LGT and ability to form hybrids, respectively. Still, several scientists attempted to apply the BSC to bacteria. Ravin searched for similarity between “genospecies”, defined as groups of bacteria that could exchange genes, and “phenospecies”, defined as groups of bacteria that shared metabolic phenotypes. Unfortunately, the two groups did not correlate well, indicating that genetic exchange ability does not necessarily correspond to phenotype [26]. Dykhuizen and Green proposed defining bacterial species as strains that could undergo recombination with each other

5 but not with other strains [27], which proved to be impractical given the frequency of LGT and large size of a species’ pan-genome [28].

In 2002, Frederick Cohan argued that ecology was the driver of species clusters in bacteria (as opposed to reproductive isolation). He proposed defining bacterial species as

“ecotypes”, which are “…set(s) of strains using the same or similar ecological resources, such that an adaptive mutant from within the ecotype out-competes to extinction all other strains of the same ecotype; an adaptive mutant does not, however drive to extinction strains from other ecotypes” [29]. This definition has also been referred to as the “ecological species concept”

(ESC). Cohan’s first model was termed the “stable ecotype model,” which assumed that 1) microorganisms exist as large populations (1010 cells) and that 2) population genetic diversity is largely controlled by periodic selection, in which a single species consistently sweeps the population [30], rather than genetic drift. He pointed out that the latter was supported by long term culture experiments, which often gave rise to strains with different phenotypes [31-33].

With this, ecotypes could be detected as sequence clusters due to genome-wide sweeps in microbial populations.

Recent work has observed that gene-specific sweeps, rather than genome sweeps, occur in microbial populations [34-36]. However, previous work has shown that the recombination rate is usually lower than the mutation rate, and thus a gene should not undergo a different rate of selection as compared to its genome [37]. To reconcile these observations, Cohan proposed the ‘Adapt Globally, Act Locally” model, in which multiple ecotypes adopt the same gene through lateral transfer, but maintain separate evolutionary trajectories [37, 38]. In 2012, Shapiro

6 et al expanded upon this theory by characterizing two populations of Vibrio cyclitrophicus, in which they found that i) SNPs associated with a specific population were constrained to specific genome regions, and ii) recent recombination was more common within a population than between them [39]. From this, they proposed that microbes undergo gene transfer, leading to gene sweeps. Since transferred genes are habitat-specific, gene sweeps prompt populations to specialize, which in turn decreases gene flow between different populations and leads to the formation of distinct genomic clusters. Their observations imply that gene-specific sweeps can lead to the formation of new species. More recent work has focused on characterizing conditions under which gene specific sweeps may occur [40], as well as how gene transfer and genetic drift work together towards speciation [41].

Methods for identifying species and LGT

The BSC and ESC disagree on the force (i.e., reproductive isolation versus ecological specialization) that drives speciation, but both agree that DNA sequence clusters will correspond with species. Compositional biases between species have been observed as early as

1959, in which the buoyant density of nine different bacterial DNAs were highly correlated to the molar fraction of guanine and cytosine [42]. Microbial species were originally distinguished via DNA-DNA hybridization, in which a single-stranded reference DNA and a single-stranded query DNA are mixed, and the degree of binding between the two molecules is measured [43].

If molecules from the query organism showed ≥70% re-association with the reference DNA molecules, the query and reference organisms were classified as the same species [44]. With

DNA sequencing, scientists began sequencing cultured isolates. In 1995, the first bacterial genome Haemophilus influenza was sequenced [45]. The reference genome database grew

7 exponentially: by 2000, 27 microbial genome sequences had been published [46], and by 2005,

220 microbial genomes were sequenced with another 650 in progress [47]. One study utilized this growing set of reference genomes and showed that 50 kilobase (kb) segments of a prokaryotic genome are more similar to each other than to other genomes, and reflect species- specific properties for DNA modification, replication, and repair [48]. Biases in nucleotide composition between species have since been used for genome and metagenome assembly, as well as for LGT detection.

The earliest LGT studies observed the transfer of phenotypes (i.e., ”R” Pneumococcus strains becoming virulent, or Shigella acquiring antibiotic resistance), but the majority of new studies utilize computational methods to detect LGT in sequenced genomes. Computational methods usually fall into fall into two bins, tree-based and non-tree based methods. Tree-based methods involve comparing gene trees to a species tree, in which the species tree is often constructed from a slow evolving, essential gene such as the 16S rRNA gene or a combination of housekeeping genes [49, 50]. Each phylogenetic tree reflects the evolutionary history of the gene(s) used to construct it. Thus, if the evolutionary history of a gene deviates significantly from that of the species tree, it may be explained by LGT, duplication, gene loss, incomplete lineage sorting, or homologous recombination [51]. Tree-based methods further enable inference of directionality and time of transfer. Directionality may be based off the “out-of-

Africa” principle, which assumes that the taxonomic group with the largest representation of the transferred gene is the donor [52, 53].

8

Tree-based methods are considered the gold standard, but are more computationally intensive than non-tree based methods. Methods that do not require trees can be subdivided into compositional and gene-based methods. Compositional methods search for changes in GC content, oligonucleotide frequencies, or even structural features, such as interaction energies between base pairs or chromatin structure, any of which may have arisen through LGT. In contrast, gene-based methods look for discrepancies between gene distances and phylogenetic distances. Approaches for this include, i) searching for similar genes between distantly related species, ii) calculating evolutionary rates for homologous genes and identifying those (potential xenologs) with different evolutionary rates, iii) identifying strain-specific genes shared with other species but not within species [51]. Both compositional and gene-based methods are limited to detection of relatively recent LGT events, since transferred sequences may ameliorate, or become more similar to the host sequence over time [54].

LGT in microbial communities

Estimates of LGT frequency were first calculated per taxon, and then per gene family.

Compositional methods predicted that 11% [55] to 17% [54] of the Escherichia coli chromosome was acquired through LGT. Later studies compared LGT percentages between taxa: one study found that LGT ranged from 0% of protein-coding genes in Mycoplasma genitalium to 16.6% of protein-coding genes in Synechocystis PCC6803. This study further identified E. coli, Helicobacter pylori, and Archaeoglobus fulgidus to have large proportions of transferred genes associated with plasmid-, phage-, or transposon-sequences [56]. Symbionts and parasites such as Wigglesworthia brevipalpis, Chlamydia, Mycoplasma, Rickettsia and Borrelia burgdorferi, were found to have lower proportions of laterally transferred coding sequences (CDS) [53, 57]. Estimates for LGT

9 percentages across gene families has also been highly variable. Explicit phylogenetic methods have since estimated that anywhere from 2% [58] to 60% [59] of genes are affected by LGT [60].

Forces that drive LGT within communities may include phylogeny, geography, and ecology. Phylogeny is expected to play a strong role: closely related partners in a group will preferentially exchange genes, since they will have shared genomic structure, machinery, and phage host range [61]. One study inferred Bayesian phylogenetic trees for 5282 sets of proteins, and found that Escherichia coli and Shigella have higher rates of gene transfer within phylogenetic groups as compared to between phylogenetic groups [62]. Another study found that integrons in Vibrio cholerae is associated with geography [63]. Lastly, taxa with similar ecological needs may be found in close proximity, which fosters conjugation, cell fusion, or

GTAs; in addition, increased LGT via plasmids has been observed in biofilms [64]. One study inferred LGT events between pairs of genomes if they shared 500 bp blocks with 99% similarity: they found that genome pairs from the same environment had the most LGT events, followed by genome pairs with small phylogenetic distances [65].

The human microbiota is likely to have high frequencies of LGT. More LGT was found between human-associated microbial genomes, as compared to between human- and non- human-associated microbial genomes, with most transfers occurring in the oral and gut sites

[65, 66]. Still, these studies focused on available reference genomes, which represent microbial snapshots in time. Future work utilizing metagenomics contigs and shotgun metagenomic reads across time may better capture de novo LGT events. For example, one study identified mobile gene pools in Fijian and North American microbiomes from single-cell genome sequencing,

10 mapped shotgun metagenomics reads to the genes, and found that mobile gene abundances were associated with diet and Fijian villages [67]. With this, they determined that LGT frequencies are not only determined by microbial characteristics (i.e. phylogeny, geography, and ecology), but may also be driven by host lifestyle and geography.

Transferred functions and their associated costs

There are two leading hypotheses for the types of genes transferred through LGT. The first hypothesis assumes that genes can be divided into two classes, i) “informational” genes, which are utilized for replication, transcription, and translation, and ii) “operational” genes, which are used in metabolism [68]. This hypothesis predicts that the latter gene type is more likely to be transferred, since the former gene type is responsible for cell division, the most fundamental process for life. The second hypothesis is termed “the complexity hypothesis”, and states that genes integrated into large, complex systems (i.e., part of large signaling pathways) are less likely to be transferred than genes part of smaller pathways [69]. These two hypotheses are not mutually exclusive: indeed, some have found that “informational” genes are more likely to be part of complex systems [70]. As follows, predicted transfer functions have included

“plasmid, phage, and transposon functions”, “cell surface structures”, “surface polysaccharides”, “DNA transformation”, “pathogenesis”, and “toxin production and resistance” [57]. Another studying utilizing phylogenetic trees found that “energy metabolism” and “mobile and extrachromosomal element functions” were enriched in discordant phylogenetic trees, whereas “DNA metabolism,” “protein synthesis,” “protein fate,” and

“regulatory functions” biosynthesis were depleted [71].

11

Transferred genes may not be retained even if they are beneficial, since they may incur high costs. Costs include disruption of neighboring genomic features via insertion, utilizing limited resources through transcription and translation, and disrupting interactions within the cellular network [72]. Furthermore, if transferred genes contain different codon usage, they may lead to improper expression and/or protein mis-folding [73], which may incur cytotoxicity.

Different microbial taxa may have a variety of mechanisms to handle such costs: for example, some taxa harbor HN-S proteins, which bind regions of high AT content and silence expression

[74]. Also, most successfully transferred genes eventually ameliorate [54]. The former operates immediately, while the latter takes time, indicating that different mechanisms may operate on different timescales to facilitate and select for gene integration.

Evolutionary legacy of LGT

The significance of LGT on evolution is still being debated today. Scientists found that phylogenetic trees constructed from other “universal” genes such as heat shock protein HSP70 and glutamate dehydrogenase do not agree with the rRNA-based universal phylogenetic tree

[54]. Furthermore, informational genes such as aminoacyl-tRNA synthetases (aaRSs), which attach amino acids to the corresponding tRNAs, have evidence of transfer [75]. These discrepancies have led to two hypotheses. The first is the “early massive horizontal transfer hypothesis”, in which LGT occurred early in prokaryotic evolution and created modern cells, after which vertical gene transfer became the dominant evolutionary force (as compared to

LGT). The second is the “continual horizontal transfer process”, in which LGT has been a continuous force from early evolution that continues today [68, 70].

12

Woese has argued in favor of the “early massive horizontal transfer hypothesis” hypothesis [76]. He argues that the rRNA gene represents cellular information processing systems such as replication, transcription, and translation, which are fundamental to cells and differ between bacteria, archaea, and eukaryotes. This implies that multiple progenitor cells, each with their own information processing systems, must have existed before the division of the three domains. These progenitor cells were not well-developed, which allowed for extensive

LGT that may have eventually given rise to the efficient, modular cells seen today. In contrast,

Lake has argued for the “continual horizontal transfer process” hypothesis [70]. To test both hypotheses, he classified genes as “informational” or “operational” [68], and then constructed phylogenetic trees for each gene type. He assumed that informational genes were not subject to transfer or were transferred infrequently (which is debatable [69]). If the phylogenetic trees for the two gene types were similar, it would indicate that most LGT had occurred before formation of the three domains, thus supporting the “massive horizontal gene transfer hypothesis”. Instead, he found that phylogenetic trees for the two gene types were significantly different, which indicates that LGT is still an ongoing force today, thereby supporting the

“continual horizontal transfer process”. Others have argued that i) the observed variation in nucleotide composition across whole genomes and ii) Occam’s Razor support the “continual horizontal transfer process” [60].

Surveying microbial communities in the built-environment

Another set of microbial interactions is between microbial communities and their environment. In 1934, the Dutch microbiologist Lourens G. M. Baas Becking articulated that

“everything (microorganisms) is everywhere: but the environment selects [77, 78]”. This

13 statement put forth a hypothesis that has shaped current microbial ecology: microbial distributions were believed to be primarily shaped by dispersal and environment, as opposed to earth history and geography [79]. This hypothesis is demonstrated in the human microbiome, in which microbial communities and their associated functions are often site-specific [80, 81]. In contrast, the built-environment seems to be primarily shaped by dispersal, especially from human-associated communities [82]. To better understand how microbial communities outside the human body affect human health, researchers must first understand these dispersal patterns and then determine how these microorganisms interact with their new environment.

Microbial composition of the built-environment

The built-environment is the ecological habitat of humans, consisting of the physical parts of where we live and work (such as homes, offices, streets) [83][75][77][77]. Humans spend most of their time in the built-environment: one study showed that Americans (across states) spend ~87% of their time indoors and ~6% of their time in an enclosed vehicle

(consistently over the past few decades) [84]. As of 2015, buildings were estimated to cover 1.3% to 6% of global ice-free land [85, 86], and are expanding rapidly [87, 88]. Although building temperatures and humidity vary across the world, each unit is enclosed and consistently maintains these variables throughout the day and across seasons [88]. They may also contain a variety of materials and chemicals not found in the natural environment [89]. As follows, it is important to identify i) which microbes are in the built-environment, and ii) how they are adapt to these environments. Furthermore, distinguishing how microbes, microbial compounds, and man-made chemicals affect human health can result in actionable changes in hygiene and building construction.

14

Currently, most studies have focused on building surfaces such as homes, restrooms, hospitals, and classrooms. These studies have shown that the majority of microbes in the built- environment are derived from human skin, with some influence from human interaction and the surrounding environment [90]. This is unsurprising, given that humans shed between 2 x

108 and 10 x 108 skin cells/day [91]. Colonization and de-colonization happen rapidly: the Home

Microbiome Study monitored seven families in their homes for six weeks, in which three families had samples taken pre- and post- move into their new homes. For these three families, the differences in microbial community structure between their previous and new homes were insignificant, indicating quick colonization of the new home. Researchers also quantified how much each individual contributed to the microbial signal of the house, and found that an absence of three days led to smaller contribution [92], indicating quick de-colonization. The effect of human interaction can be observed via microbial community patterns on different surfaces and room types. For example, a study of public restrooms showed that the microbial community of bathroom floors were likely derived from soil taxa, while communities on toilet seats, handles, and the inside of the stall were derived from gut bacteria and urine [93]. Lastly, the surrounding environment may introduce new members to built-environment communities: one study found that phylogenetic diversity was correlated with ventilation air, airflow rates, and humidity and temperature [94].

These findings indicate that the human microbiome is rarely colonized or altered by built-environment microbial communities. Instead, a person may be primarily exposed to his/her own microbiome, which could then self-perpetuate or perpetuate to other occupants within the building, either to their benefit or detriment [95]. One example is the effect of pets on

15 their owners: some studies found that infants in homes with dog or cat exposure have decreased risk of atopy [96], though other studies identified pets as sources of endotoxins [97,

98]. More work is needed to determine what constitutes a healthy indoor microbiome [3], especially since adverse health effects have been tied to microbial and non-microbial sources.

Microbial threats include single pathogens such as Legionella, which may be transferred through water systems and inhalation (if aerosolized), as well as microbial components such as endotoxin, which has been paradoxically linked to promotion of and protection against asthma

[99]. Non-microbial threats include damp indoor environments, which have been associated with respiratory diseases, and may further be linked to growth of mold and fungal species

[100].

Applications for the built-environment

Since built-environment microbial communities are largely derived from human skin, they may also resemble their occupants, giving rise to forensic applications. The Home

Microbiome Study could predict which family belonged to which home using microbial community profiles [92]. Many built-environment studies have also found that occupants of the same space have significantly more similar microbiomes. For example, families not only share microbes with one another, but also with their dogs [101]. Co-habituating couples could be matched based on their skin microbiome samples ~86% of the time [102]. Lastly, one study collected shoe and phone samples from individuals at three different conferences: random forest models could predict which conference each sample was taken from, and distinguish between two individuals’ shoe samples at a single conference [103]. These studies indicate that

16 individuals may be linked to highly-trafficked buildings, as well as to colleagues within the same space [104].

Another potential application is improved protocols for hygiene, especially with the development of the hygiene hypothesis. The hygiene hypothesis was conceived as early as 1989:

David Strachan found that high prevalence of hay fever (at ages 23 and 11) and eczema (in the first year of life) was linked to smaller family sizes. He hypothesized that fewer infections early in life (due to lack of disease transmission in smaller families) lead to greater numbers of infection later in life [105]. His hypothesis was replaced by the “Old Friends” hypothesis in

2004: Ross et al stated that increased disease types (such as allergy) in developed parts of the world was due to lack of exposure to “old friends”, which are defined as microbes that co- evolved with humans. These “old friends” facilitate regulatory T cell development, thereby preventing inappropriate immune responses [106]. The “Old Friends” hypothesis has led to the general consensus that increased microbial diversity is favorable, though others argue it is simply a community property [107]. With this, hygiene should be redefined as an effort to select for beneficial bacteria, rather than an attempt at complete sterilization [108, 109]. Suggested interventions have been to build with materials that select for specific microbes, as well as increasing building ventilation and outdoor green space to boost microbial diversity [3, 110].

Technical considerations for sampling the built-environment

The majority of built-environment samples have been sequenced using 16S rRNA sequencing due to low biomass, which makes them particularly susceptible to batch effects and contamination from sequencing kits and reagents. The former was demonstrated in a study that

17 monitored office buildings in Flagstaff, AZ; San Diego, CA; and Toronto, ON for one year.

Samples were grouped and sequenced by season, with eight technical replicates included in each sequencing run. Unfortunately, sequencing run was conflated with seasonality: even technical replicates varied widely across run. Researchers attempted to eliminate the batch effect by removing highly variable low, abundance taxa, which worked poorly [111]. Other studies have been affected by contaminants found in sequencing reagents, extraction kits, and

PCR reagents [112-114]. For example, one study found that age as the driver of observed trends in the nasopharyngeal microbiome among children in a refugee camp, but another study showed that the driver was kit contaminants [115]. The use of technical replicates, extraction and negatives controls, and microbial spike-ins have all been proposed to address the problem of batch effects and contaminants. These technical challenges may be further complicated by the rise of citizen microbiology projects, in which aseptic technique, sample collection logistics, and privacy concerns must be considered [116].

Unfortunately, built-environment studies that rely on DNA sequencing for community profiling cannot distinguish between DNA in live cells and extracellular DNA, which can survive on surfaces for weeks to years [117]. Currently, it is unclear whether most microbes on these surfaces are active, dormant, or dead. Some have described the built-environment as a microbial wasteland, where most microbes are likely dormant or dead [82, 111]. One study found that 40% of prokaryotic and fungal DNA in soil was extracellular or from cells that were not intact [118]. Several methods have been developed that may assist in assessing viability, which primarily function by examining membrane integrity, measuring transcription or

18 translational activity, or measuring cellular respiration (through ATP). Still, the majority of these methods are for bacteria, and may not work on viruses or spores [119].

The role of DNA sequencing for microbial profiling

High-throughput DNA sequencing has proven invaluable for investigating diverse environmental and host-associated microbial communities. Sequence-based taxonomic profiling of a microbiome can be carried out using either amplicon (typically the 16S rRNA gene) or whole metagenome shotgun (WMS) sequencing (reviewed in [120-122]). The resulting DNA sequence data are then used to assess the community in at least two ways: taxonomic profiling, which answers, “who is present in the community?” and functional profiling, which answers,

“what could they be doing?” Still, there are several limitations to DNA-based approaches. First, the most common taxonomic profilers provide at best species-level taxonomic resolution, whereas many important phenomena occur at the strain level. Second, DNA sequencing cannot directly measure the functional activity of a community under a given set of conditions. While the former has been addressed through sequencing and bioinformatics techniques, the latter may require multi’omic data sets, which include community RNA (transcriptomics), protein

(proteomics), and metabolite abundances (metabolomics), preferably in an integrated framework.

Amplicon Sequencing

One common method for profiling a microbial community involves sequencing specific microbial amplicons (predominantly the bacterial 16S rRNA gene). Although amplicon-based sequencing considers only one or a few microbial genes, it may be used for taxonomic,

19 phylogenetic, and even functional profiling. It may also be used to profile low biomass samples, as compared to WMS sequencing. To identify which taxa are present, amplicon sequences are either directly binned to reference taxa [123, 124] by classification or phylogenetic placement, or more commonly they are first clustered into operational taxonomic units (OTUs) sharing a fixed level of sequence identity (often 97%) [125, 126], and then binned as a whole (often by classification of a reference sequence). Functional profiles can be approximated for marker- based samples by associating 16S rRNA or marker genes with annotated reference genomes, aggregating coding sequences (from the reference genomes) into gene families, and then inferring gene family abundances through taxonomic abundances [127].

Unfortunately, the singular use of the 16S rRNA marker gene has several problems.

First, some species have multiple copies of the 16S rRNA gene, which in turn have different sequences [125]. Second, the 16S rRNA gene has difficulty resolving species due to its slow evolution: strains with less than 97% 16S rRNA sequence identity are likely to be different species, but strains with more than 97% 16S rRNA are not necessarily the same species [49, 128].

The use of the 97% cutoff is also somewhat arbitrary and based off concordance with DNA-

DNA hybridizations [129]. In order to improve taxonomic resolution from 16S rRNA sequencing, two techniques have been developed. One recent technique, termed “oligotyping”, uses a sequence entropy-based approach to identify maximally informative sites within the 16S rRNA gene to improve OTU resolution [130]. Oligotyping is advantageous for distinguishing closely related taxa (such as those that differ by a single 16S rRNA nucleotide) and has been applied to study subspecies-level population structure in the vaginal microbiome [131] and to link sewage samples to specific fecal pollution sources [132]. In addition, a new, low-error

20 approach to 16S rRNA gene sequencing, termed LEA-Seq has been proposed and used to profile stable carriage of host-specific strains in the human gut microbiome [133].

WMS Sequencing

WMS sequencing involves sequencing “random” DNA fragments from microbial communities. Taxonomic profiling of metagenomes instead uses some or all shotgun reads to determine membership in a community. This can be done in a number of ways, including metagenomic assembly followed by phylogenetic binning or placement of contigs [134]. More commonly, short reads are profiled directly by comparison to a reference catalogue of microbial genes or genomes. Alternatively, reads can be mapped to a (pre-computed) catalog of clade- specific marker sequences (with [135] or without [136] pre-clustering). Finally, reads may be assigned to species based on agreement with models of genome composition [137] or by exact k- mer matching [138], thus enabling placement of reads or assembled contigs when corresponding reference genomes are not available (which is common for poorly characterized communities).

WMS sequencing is the preferred method for strain-level profiling due to its ability to identify variation throughout microbial genomes. Strains may differ in sequence through loss or gain of genomic regions or through single nucleotide polymorphisms (SNPs), both of which can be identified by mapping shotgun reads to reference genomes. For example, mapping WMS reads from tongue samples to genomes of Streptococcus mitis highlighted the presence and absence of genomic islands in isolates of that species from individuals enrolled in the Human

Microbiome Project (HMP) [2]. Genomic islands were shown to contain multiple, functionally

21 coherent genes (e.g. subunits of the V-type H+ ATPase) that were gained and lost together, suggesting a mechanism for individual- and body site-specific functional specialization.

Detection of SNP differences requires greater sequencing depth. Existing WMS data from human stool samples have been used to identify reference genomes with high sequencing coverage which were then scanned for SNPs [135]. This analysis revealed that subject-specific

SNP variation tended to remain stable for up to a year and was comparatively more conserved than overall species abundance.

Functional profiling of metagenomic samples typically begins by associating new sequence data with known gene families. This can be accomplished by directly mapping DNA or RNA reads to databases of gene sequences that have been clustered at the family level; such databases include KEGG Orthology [139], COG [140], NOG [141], Pfam [142], and UniRef [143].

Naturally, the number of reads that can be mapped in this manner depends on the completeness of the underlying reference database. Alternatively, reads can be assembled into contigs to determine putative protein-coding sequences, and then the CDSs are assigned to gene families following the same or similar methods used for annotating isolate microbial genomes.

Both strategies yield profiles of the presence and absence of a gene family as well as the relative abundance of each family within a sample. Functional profiles at the gene family-level may contain many thousands of features. Downstream analyses can be made more tractable by further performing per-organism or whole-community pathway reconstruction based on these genes. Although not specifically designed for microbial community analysis, species-specific pathway databases such as KEGG [139], MetaCyc [144], and SEED [145] can be useful for this purpose. Integrated bioinformatics pipelines such as IMG/M [146], MG-RAST [145],

22

MetaPathways [147], and HUMAnN [148] have been developed to streamline the conversion of raw meta’omic sequencing data into more easily-interpreted profiles of microbial community function.

Contig Assembly

Deeper WMS sequencing can facilitate the de novo assembly of contigs and even microbial genomes. Assemblies are generated by connecting overlapping sequencing reads to form longer sequences, which may be represented as an assembly graph in which nodes embody sequence information (such as k-mers) and edges connect adjacent or overlapping sequences. Metagenomics samples come with special challenges that may lead to errors in the assembly graph. These samples contain multiple taxa with differential abundances, leading to uneven coverage and the presence of conserved sequences across taxa, and making it difficult to determine where edges should be drawn. Several tools have been built to address these problems: MetaVelvet-SL generates a single assembly graph, and then uses k-mer coverage to identify sub-graphs that are assumed to be single species; in contrast, both MetaSPAdes and

IBDA-UD use multiple k-mer sizes to iteratively improve the assembly graph [149]. General challenges in assembly arise from technical variables, such as sequencing errors, chimeric reads, and read lengths that are shorter than genomic repeats [150], as well as the size of the dataset, which increases computational intensity. Lastly, there is no gold standard to determine if a given assembly is correct. As a result, earlier metagenomic studies that utilized assembly limited analyses to cataloguing genes and functions [151, 152], though one study went further and identified plasmids and scaffold synteny in samples collected from the Sargasso Sea [153].

23

Assembly is crucial to studying microbial communities, and may be used to identify novel sequence elements, generating reference genomes from uncultivated or poorly represented microorganisms in reference databases, and characterizing the synteny of microbial genes. Improvements in metagenomics contig assembly has led to the recovery of whole microbial genomes from communities [92, 154-156], which was previously only possible in low- complexity communities [157]. One study was able to assemble 31 bacterial genomes after binning assemblies by differential read coverage [158]. Increasing the number of reference genomes across the tree of life may help with discovery of novel gene functions and pathways

[159]. Furthermore, assemblies can reveal novel genomic rearrangements and LGT events not in previous reference genomes. For example, one study found that the genomic architecture of mobile genes in human gut samples was specific to individuals, even though individual mobile genes were found universally across U.S. and Fijian cohorts [67].

Summary

The advent of DNA sequencing has made it quicker and easier to profile microbial communities, while further development of tools for analyzing and interpreting sequencing data may potentially reveal how community trends and interactions between individual microbes. In Chapter 2, we describe the tool WAAFLE, a workflow that annotates assemblies and finds LGT events from assembled metagenomics contigs. We then apply WAAFLE to the

Human Microbiome Project, and find that properties such as phylogenetic relatedness and abundance affect LGT frequency, and that transferred functions are enriched for mobile elements and outer membrane receptors. In Chapter 3, we survey the microbial communities on

24 the Boston subway using 16S rRNA and WMS sequencing. We observe that that the microbial community mostly comprises of skin microbes, and that overall pathogenic potential is low.

25

Chapter 2:

Lateral Gene Transfer in the Human Microbiome

Attributions

The contributors to this work include Tiffany Hsu, Eric A. Franzosa, Chengwei Luo,

Dennis Wong, Morgan Langille, Robert G. Beiko, and Curtis Huttenhower, in no particular order. T.H. and E.A.F developed the software implementation, evaluated the method, and applied the tool to the Human Microbiome Project. All authors helped design the method and interpret the data. T.H. wrote the text with feedback from E.A.F., M.L, R.G.B, and C.H.

Introduction

Lateral gene transfer (LGT) is the movement of genetic material between organisms without sexual or asexual reproduction [160]. Its role in microbial communities is not well understood, due to the difficulty in identifying LGT events. First, evolutionarily significant events are difficult to ascertain. These events include ancient LGTs, which have likely ameliorated to the host genome, as well as LGT of homologs, which are conserved across species and difficult to distinguish from orthologs (homologs that arose through speciation) and paralogs (homologs that were duplicated and have a separate evolution trajectory) [53]. Second, transient events, in which LGT occurs but the organism does not accept or maintain the transferred sequence, are difficult to measure [161]. Still, LGT has proven to be an important evolutionary force [15], especially with the rise of antibiotic resistance in human-associated microbial communities. LGT events may change the fitness of individual microbes, which may in turn affect microbial community composition and function. These events may eventually give rise to new species, impacting both evolutionary history and phylogeny [70, 76].

27

Several studies have characterized the quantity of and forces shaping LGT in human- associated microbial communities. For human-associated microbial genomes, most transfers occur in the oral and gut sites [65, 66]. LGT may be shaped by host factors, such as lifestyle and geography, as well as microbial traits, such as phylogeny and ecology. One study found that cultural practices affected LGT rates: mobile gene pool abundances in Fijian and North

American microbiomes were associated with diet and Fijian villages [67]. Another study found increased LGT between human-associated isolates as compared to between human-associated and non-human-associated isolates. Isolates with shorter phylogenetic distances and from similar sources (between human and/or non-human) had increased transfer, though the latter had the stronger effect [65]. As follows, some have proposed that LGT is a mechanism used between niche-sharing microbes to adapt to changing conditions [162], while others have suggested it as a mechanism to enforce cooperation or competition [163, 164]. This is further supported by the observation that transferred genes are enriched for functions in cell surface,

DNA-binding, and pathogenicity, which may be necessary for survival in different environments [57].

Microbial community sequencing has generated 16S rRNA and metagenomic shotgun datasets, yet most software tools available for LGT detection are designed for whole and/or draft genomes [165, 166]. Methodologies for detecting LGT fall roughly into three categories, composition-based, alignment-based, and phylogeny-based approaches. Compositional-based methods assume that laterally transferred genes will have distinct nucleotide compositions as compared to the host genome: software such as Alien_Hunter [167] uses interpolated variable order motifs (IVOMs) to find genomic regions with significant shifts in composition.

28

Alignment-based methods look for discrepancies between gene distances and phylogenetic distances: for example, Darkhorse [168] aligns protein sequences (from a single genome) to a reference database and infers LGT using bitscore and phylogeny. In contrast, IslandPick uses genome alignments and comparative genomics to identify LGT in closely related genomes [169].

Phylogeny-based implementations such as rSPR [170], PhylTr [171], and MaxTiC [172] search for incongruence between gene trees with species trees. Only the software Daisy [173] utilizes shotgun reads, but still requires prior knowledge of donor and recipient genomes.

Here, we present WAAFLE, a Workflow for Annotating Assemblies and Finding LGT

Events, which uses alignment-based methods to detect LGT events in contigs assembled from metagenomic shotgun sequencing sets. A tool that can utilize shotgun sequencing data has several advantages. First, we can potentially find new LGT events that are not yet reflected in reference genomes. Second, since each metagenomic sample represents a snapshot in time, users will have the ability to compare LGT rates between individuals, conditions, and across time. Third, although WAAFLE is limited to fairly recent events, the use of reference databases allows us to identify gene functions and perform taxonomic assignment with higher accuracy, especially in human-associated datasets. In this study, we apply WAAFLE to the Human

Microbiome Project Phase 1-II (HMP 1-II) [174] assembled contigs. We quantify LGT frequencies for taxon pairs at the level across six major body sites, which specifically represents the number of unique, novel, and fixed LGT events per sample (which represents a single body site in an individual). We then i) determine how abundance and phylogeny influence LGT frequencies, ii) characterize taxon pair formation and partner preference, and iii) identify functions enriched in LGT contigs.

29

Results

Identifying recent LGT events from metagenomic shotgun sequencing

In order to detect LGT events from metagenomic shotgun sequencing, we developed

WAAFLE, a Workflow to Annotate Assemblies and Find LGT Events (Fig. 2-1A). WAAFLE has one required input, i) assembled metagenomic contigs in FASTA format, and two optional inputs, ii) gene calls for each contig and iii) a nucleotide reference database of genes with taxonomic and functional annotations (down to the species level, and for UniRef50 and

UniRef90 terms, respectively). A default reference database of pangenomes, MetaRef [175], is provided. WAAFLE conducts a four step process to output a single file in which each contig is classified as containing LGT or not, with each gene annotated with a taxon and function. First, contigs are searched against the nucleotide reference database via BLASTN. Second, contigs are annotated with genes, either by connecting overlapping BLAST alignments or using supplied gene calls. Third, contig genes are assigned UniRef50/90 annotations and taxon scores; the latter represents how well a given taxon characterizes a gene. To do this, we bin BLAST hits by gene, and then group BLAST hits within bins by taxonomic annotation. From the BLAST hit bins, we designate the most common UniRef50/90 term to each gene. From the BLAST hit groups, we calculate a single score per taxon per gene using the percent identity and subject coverage.

Fourth, contigs are classified as having LGT or not. Using the taxon scores, we determine whether genes across a contig are best explained by two taxa or one (Fig. 2-1A).

30

Figure 2-1. WAAFLE pipeline overview. A) Within microbial populations, genes can be transferred vertically or laterally, which may confer adaptive traits to individual microbes and affect the community composition and function. To understand the impact of LGT, we built the tool B) WAAFLE, which identifies LGT events within metagenomic contigs using a four step process. First, WAAFLE searches contigs against a reference species pangenome database, which is generated by downloading NCBI isolate genomes, binning isolate genes by species, and then clustering binned species genes at 97% nucleotide identity. Second, WAAFLE calls genes (if not supplied) by connecting overlapping BLAST hits. Third, WAAFLE assigns each gene a function and taxon scores. To do this, alignments are first binned by genes: the most common UniRef50/90 annotation across hits per gene is assigned as the gene function. Binned alignments are then further grouped by taxa, and taxon scores are calculated using percent identity and subject coverage. Fourth, we classify the contig as having LGT or not. If a single taxon has taxon scores above k1 (blue threshold) across all contig genes, the contig is predicted to not have LGT. Otherwise, if two taxa have taxon scores above k2 (red threshold) across all contig genes, the contig is predicted to have LGT. C) To evaluate WAAFLE and its parameters, we generated synthetic contigs by selecting random donor and recipient genomes at varying at

31

Figure 2-1 (Continued) different taxonomic levels. We chose a three gene region from the recipient genome, and replaced the center gene with a gene from the donor genome. We then truncated the newly formed contig at both ends.

How accurately WAAFLE detects LGT depends both on the contig assembly quality and

WAAFLE parameter settings. False positive LGT calls may arise from contig misassemblies, which we hypothesized would have steep drops in read coverage. To identify misassemblies, we mapped shotgun reads to metagenomic contigs and examined gene junctions, the region between two contig genes. Contigs that had i) low coverage for read junctions relative to flanking genes and ii) lacked paired or single read support for the junction were removed from analysis, regardless of LGT status (Fig. I-1, Fig. I-2). WAAFLE may also call different amounts of LGT depending on its five parameters, which include subject coverage (s), overlap

percentage (o), gene length (g), one-taxon threshold (k1), and two-taxon threshold (k2) (Table I-1).

The first three parameters are utilized to minimize false positive gene calls (step 2), which may lead to increased LGT calls. The last two parameters are employed in LGT classification (step 4).

Specifically, WAAFLE identifies a contig as not having LGT if one taxon has taxon scores

greater than k1 across all genes. If no single taxon scores above k1, WAAFLE searches for two

taxa that collectively have taxon scores greater than k2 across all genes. If the contig contains

such a pair, it is classified as LGT; otherwise, it is classified as “ambiguous”. Lowering k1 and

raising k2 thresholds make it more difficult to call LGT.

WAAFLE performance on synthetic data

To set the default WAAFLE parameters, we generated a synthetic dataset from the NCBI isolate genomes. This dataset consisted of 1000 contigs spanning 8 taxonomic levels, with 25

32 donor-recipient pairs at each level. Each contig was created by i) selecting a donor-recipient pair with some taxonomic level difference, ii) choosing a recipient genome fragment containing three genes, iii) replacing the center gene (of the fragment) with a random gene from the donor taxon, and iv) truncating the contig ends (Fig. 2-1C). It should be noted that the NCBI isolate genomes used to generate the synthetic dataset are the same genomes used to create WAAFLE’s species pangenome database. As follows, the species pangenome database contains all the species and genes present in the synthetic contigs. In reality, the reference database will be missing species and genes potentially present in biological data. To simulate missing information in the pangenome database, we removed 20% of the BLAST alignments (to the synthetic contigs) generated from the first step in WAAFLE.

We first evaluated WAAFLE’s ability to call genes by varying three parameters, subject coverage, overlap, and gene length, during step 2 of the WAAFLE pipeline. We compared each set of WAAFLE gene calls to the NCBI gene annotations. True positives were defined as the number of NCBI genes with corresponding WAAFLE genes, while false positives were defined as the number of single NCBI genes with multiple corresponding WAAFLE genes (to one NBCI gene) and the number of WAAFLE genes with no corresponding NCBI genes. The true positive rate (TPR) ranged from 0.691 to 0.841 while the positive predictive value (PPV) ranged from

0.955 to 0.994 (Fig. I-3). Overall, we found that lower overlap, increased subject coverage, and increased gene length increased the PPV, with subject coverage and gene length having the greatest effect. Since increasing the number of genes increases the potential of calling LGT, we conservatively set gene calling parameters at 0.75 for subject coverage, 0.1 for overlap, and base pairs (bp) for minimum gene length.

33

To evaluate LGT classification, we supplied WAAFLE with the WAAFLE-called genes generated from the default parameters (mentioned above) and filtered for contigs containing at least two genes. We then varied the one-taxon and two-taxon thresholds for step 4 of the

WAAFLE pipeline. Synthetic contigs with inter-species or above LGT events were considered true positives when WAAFLE called LGT, and false negatives otherwise. Synthetic contigs with inter-strain LGT events were considered true negatives if WAAFLE classified the contigs as having no LGT, and false positives otherwise. The TPR ranged from 0.513 to 1 and false positive rate (FPR) ranged from 0 to 0.111, where most false positives arose as a consequence of BLAST hit removal (Fig. I-4). Higher one-taxon thresholds increased both TPR and FPR, while higher two-taxon thresholds decreased both TPR and FPR (Fig. 2-2A, Fig. I-5). As the one-taxon threshold increases, it becomes difficult to classify contigs as not having LGT, which leads to more LGT calls and increases the number of true and false positives. In contrast, increases in the two-taxon threshold make it difficult to classify contigs as having LGT, resulting in fewer true and false positives. As such, we decided to set the one-taxon threshold at 0.5 and the two-taxon threshold at 0.8. To evaluate organism calls, we examined the subset of correctly called LGT contigs (true positives), and identified the taxonomic levels (kingdom through species) at which

WAAFLE correctly matches the reference taxa. WAAFLE often correctly annotated taxa down to the family level, but did not always identify the correct genus or species (Fig. 2-2B).

34

Figure 2-2. WAAFLE parameter evaluation. Using the WAAFLE-called genes (using the default gene calling parameters), we examined how the one-taxon (k1) and two-taxon (k2) thresholds would affect A) LGT classification and B) taxonomic assignment. For the left half of the figure, we set k2 at 0.8 while varying k1 from 0.1 through 0.9. For the right half of the figure, we set k1 at 0.5 while varying k2 from 0.1 through 0.9. A) Colors indicate k2, and the x-axis indicates the taxonomic level difference between the donor and recipient genomes. For example, we observe lower TPR and FPR for inter-species LGT. B) Colors indicate k1, and the taxonomic level at which WAAFLE correctly identified an organism. For example, lower percentages are observed for correctly calling a taxon at the species level.

Rates of novel LGT events across the human microbiome

We used WAAFLE to interrogate the expanded Human Microbiome Project (HMP1-II)

[174]: a dataset that includes 2,341 shotgun metagenomes sampled from 265 individuals at

35 diverse body sites at up to three time points (http://hmpdacc.org). For quality control, we removed samples with poor assembly (with less than 1,000 gene calls across contigs) and inconsistent taxonomic profiles (appeared as outliers in ordination analyses), and then filtered out contigs that resembled mis-assemblies (described earlier, see Methods). We first set out to develop a measure for LGT frequency, which would allow us to quantify LGT for taxon pairs across body sites. Within a metagenomic assembly, each LGT event detected by WAAFLE is likely to be i) unique; ii) novel, since the use of a reference database should exclude previously detected LGT events; and iii) fixed in the population, since erratic events are likely not assembled. Thus, an increase in LGT frequency (per sample) as measured by WAAFLE represents an increase in unique LGT events, which can further be stratified by taxon pairs.

We generated two measures to quantify LGT frequency, which included i) gene percentages (the number of genes in LGT contigs normalized by the total number of sample genes) and ii) events per gene (the number of LGT contigs normalized by the total number of sample genes). The two measures may not correspond due to differences in assembly: samples with multiple short contigs may have low gene percentages and high events per gene, while samples with LGT in a few long contigs may have high gene percentages and low events per gene. Still, we found that both measures were highly correlated across body sites (Fig. I-6), thus increases in either measure generally indicate higher LGT frequencies. We then used gene percentages to determine if WAAFLE is reproducible across technical replicates. As expected,

LGT pairs were most similar between technical replicates, followed by intra-individual and then inter-individual samples based on Jaccard and Bray-Curtis distances (Fig. I-7). Distances for LGT pairs were much higher than that of single taxon gene percentages, indicating that

36 similar taxonomic gene profiles still lead to highly variable LGT profiles. For the remainder of the analyses, we used only assemblies unique to an individual, body site, and time point, leaving 1,128 assemblies with 237 from stool, 208 from tongue dorsum, 191 from supragingival plaque, 182 from buccal mucosa, 94 from anterior nares, and 89 from posterior fornix.

LGT is an adaptive mechanism that may facilitate microbial survival and maintenance at the individual or community level. Cataloguing high frequency LGT pairs identifies the partners and genes each taxon has access to, and furthers understanding of their interactions.

We thus characterized high frequency LGT pairs across six body sites, and found that they generally fell into three categories, those with high phylogenetic relatedness [61], large joint abundances, and similar functions or niches (Fig. 2-3A). Pairs with closely related taxa included

Bacteroides with Parabacteroides (0.746% genes, average phylogenetic distance PD=1.02),

Odoribacter (0.0947%, PD=1.77), or Alistipes (0.260%, PD=1.95), all of which were found in the stool and considered inter-family transfers, despite relatively short phylogenetic distances. In contrast, some taxa with high abundances transferred regardless of phylogenetic distance, including Lactobacillus and Gardnerella (0.137%, PD=8.34) in the posterior fornix, and

Corynebacterium and Propionibacterium (0.0522%, PD=3.14) in the anterior nares. Lastly, some taxa pairs have overlapping functions or niches. Eubacterium and Roseburia (0.0637%, PD range

0.81 to 6.44) in stool are both butyrate producers that decrease in abundance with lower intake of carbohydrates [176, 177]. Oral taxa have close physical proximity through biofilms; one example includes a corncob structure found in supragingival plaque consisting of

Corynebacterium and Streptococcus, with an outer ring of Haemophilus and Aggregatibacter [178],

37 which may explain high frequency transfers for each pair, but not across the two pairs in both buccal mucosa and supragingival plaque.

38

Figure 2-3. LGT rates are highest for oral and stool sites. A) For each body site, we display LGT between the ten genera with the highest gene percentages via heatmaps. Each row and column represent a single genus and off-diagonal cells represent LGT gene percentages. Colors indicate the number of genes for the row taxa in LGT contigs involving both row and column taxa divided by the total genes per sample, averaged across body site, resulting in an asymmetrical matrix. The histogram above each heatmap shows the average number of genes per sample across body sites. B) Each point represents one sample in the body site. LGT frequencies on the y-axis are measured as the number of LGT contigs divided by the total number of sample genes, plotted on a log2 scale per 1000 genes.

39

40

Different environments may also facilitate or hinder LGT. This is evident in the patterns we see for the six different body sites: taxa seem to transfer indiscriminately in the stool and oral sites, but appear more selective in the anterior nares and posterior fornix (Fig. 2-3A). We therefore investigated whether differences in overall LGT frequency are attributable to body site. To do this, we calculated the overall extent of LGT in each body site using events per gene.

LGT frequencies were highest in the stool (m = median 2.898 events per 1000 genes), followed by multiple oral sites, including the supragingival plaque (m=2.134), tongue dorsum (m=2.129), and keratinized gingiva (m=1.799). Frequencies were lowest in the vaginal and skin sites (Fig. 2-

3B). To further understand how technical and biological effects might affect LGT rates, we performed a linear regression using events per gene as the dependent variable, and technical and biological effects as the explanatory variables. Technical effects included the number of contigs per sample and contig size (genes/contigs), while biological effects included genus richness, genera evenness, and body site. Significant predictors of LGT frequency included body site (p<2e-16), average contig size (p=2e-16, positive coefficient), and species evenness

(p=2e-16, positive coefficient). These observations indicate that sites with high LGT rates are i) mucosal and ii) have higher alpha diversity, in which evenness plays a larger role than richness.

LGT frequency and pair formation are shaped by abundance and phylogeny

We next set out to characterize the overall effect of phylogeny and taxon abundance on

LGT frequencies. To this end, we calculated phylogenetic distances and joint abundances for each LGT taxon pair, and estimated how well each of these variables predicted LGT gene percentages using a nonparametric generalized additive model smoother. Phylogenetic distance was calculated by measuring branch length between two taxa in the PhyloPhlAn phylogenetic

41 tree [50], which represents the average number of nucleotide substitutions between two taxa.

Joint abundance was calculated by multiplying one taxon’s abundance by the other: taxon abundance was quantified as the total number of genes for a single taxon (across all contigs regardless of LGT status) divided by the total number of genes per sample, averaged across a body site. We observed an increase in LGT gene percentages at low phylogenetic distances (Fig.

2-4A), and an increase in LGT gene percentages as joint abundances increase (Fig. 2-4B). The former suggests that species level LGT events fix in the population more often than higher level

LGT events. Phylogeny is known to affect LGT: closely related partners have shared DNA composition and transcriptional/translational machinery, allowing them to successfully integrate and express transferred genes [61]. The latter suggests that taxonomic abundance leads to increased transfer opportunities and thus higher rates irrespective of phylogenetic distance.

42

Figure 2-4. Both abundance and phylogeny affects LGT rates. For both plots, each point represents a taxa pair, and smoothing functions are fit by a generalized additive model using cubic splines. Only taxa pairs annotated to at least the genus level are included, and taxa pairs found in a single sample (across body sites) are colored in gray. All other pairs are colored by inter-taxon LGT level (i.e, inter-species LGT pairs are red). In A), the x-axis displays the phylogenetic distance between the two taxa, while the y-axis shows the LGT gene percentages, or the average number of LGT genes in a taxa pair divided by total number of genes in a sample. In B), the x-axis shows the joint abundance, and the y-axis is the same as A). Joint abundances are calculated by multiplying one taxon’s gene percentage against another taxon’s gene percentage. Colors are the same as A). 43

We further examined how phylogeny and taxon abundances influence LGT pair formation, regardless of LGT rate. For phylogenetic distances, we observed that the HMP LGT pairs form bi-modal distributions across body sites. This distribution may indicate selective pair formation at specific distances, or reflect taxonomic bias in NCBI reference genomes. To distinguish between these two hypotheses, we compared the phylogenetic distance distribution from HMP LGT pairs to the distribution from randomly generated LGT pairs. We observed that the phylogenetic distance distributions are significantly different via the Kolmogorov-Smirnov test: randomly generated pairs have on average larger phylogenetic distances than that of HMP

LGT pairs (Fig. I-8A), indicating that LGT preferentially occurs between closely related species.

We repeated this analysis for LGT joint abundances, and found that randomly generated taxon pairs had higher joint abundances than that of HMP LGT pairs (Fig. I-8B). This suggests that

LGT pair formation occurs more often than expected between rare taxa, which may be supported by the physical structure and community organization of microbial communities.

Genera have preferred transfer partners that are shared across similar sites

Individual taxa may vary in partner choice: some may be promiscuous, while others are more selective. We can identify these preferences by representing LGT pairs as a network, in which nodes are genera and edges are unique LGT events. We generated networks for each of the six body sites, and then calculated degree for every node (genus) along with the percentage of genes found in LGT events involving that node (Fig. 2-5A). As expected, genera with higher frequencies of LGT also have large numbers of partners. Interestingly, the majority of LGT events for these genera was accounted for by a small number of partners: for example, 90% of genes for LGT events involving Streptococcus are transferred with 11 (out of 57), 22 (out of 91),

44 and 19 (out of 88) genera in the buccal mucosa, supragingival plaque, and tongue dorsum, respectively. Still, we attempted to identify taxa that i) had a larger number of partners and relatively large number of preferred partners, and ii) had a larger number of partners and relatively small number of preferred partners. The former represent more promiscuous taxa, while the latter may be more selective. The former category included Streptococcus (Fig. 2-5B),

Actinomyces, Veillonella, and Haemophilus in the oral sites, as well as Clostridium and

Faecalibacterium in stool. The latter category included Aggregatibacter in the oral sites (Fig. 2-5B),

Bacteroides in the stool, and Corynebacterium and Propionibacterium in the anterior nares and supragingival plaque. LGT may not be as advantageous for these latter taxa, which might lead to limited transfer abilities.

45

Figure 2-5. Taxa degree and differential edges. A) Across the six body sites, we compared the total number of LGT partners for a given genera against the number of partners needed to explain 90% of genes in LGT transfers. Points are colored by the number of genes in LGT events involving the given genera normalized by total sample genes, which are averaged across samples and log2 normalized. Taxa in the upper right corner are more promiscuous: these genera have many partners and need more partners to explain transfer; while taxa in the lower right corner are more selective: they have the ability to transfer with multiple taxa but mostly transfer with a few. Several genera are designated by letters as shown in B). B) We show an example of a promiscuous taxon, Streptococcus, along with a selective taxon, Aggregatibacter. The x-axis displays body site, while the y-axis is the gene percentage for LGT pairs, proportionally scaled to the square root of the total sum gene percentage. C) Arc diagrams display directional transfers in the buccal mucosa, supragingival plaque, and tongue dorsum. Solid black circles represent genera, and size indicate average taxon gene percentages for the corresponding site. Arcs indicate directional transfer between two circles in a counterclockwise fashion: arcs above two circles indicate donation of genes from the right node to the left node, and vice-versa for arcs under the two nodes. Arc width indicates the average number of LGT contigs with that direction normalized by total number of genes per sample. Arcs colored in blue are found in all three oral sites, while arcs in red are found in two oral sites.

46

47

We next investigated which LGT events were shared across multiple body sites. The networks for each site consisted of anterior nares (nodes=61, edges=166), posterior fornix (n=85, e=342), and buccal mucosa (n=130, e=890), which had fewer nodes and edges than stool (n=174, e=2698), supragingival plaque (n=242, e=2812), and tongue dorsum (n=188, e=2898). Across all six sites, only 3 edges were shared, including Bacteroides and Parabacteroides, Bacteroides and

Capnocytophaga, as well as Peptoniphilus and Streptococcus, while 2212 edges were unique to one site. This is not surprising: the six sites have distinct taxonomic compositions, along with different environments and selective pressures, which leads to different LGT pairs. As follows, we focused on the intersection of the three oral networks, which shared 308 pairs, of which 232 pairs were not found in non-oral sites. Some oral pairs were found at differential frequencies across sites: for example, Streptococcus (degree=49) had higher percentage of transfers with

Gemella, Capnocytophaga, and Prevotella in the buccal mucosa, supragingival plaque, and tongue dorsum, respectively (Fig. 2-5B). Despite consistent partners in the oral sites, some of these genera had completely different partners in non-oral sites. For example, Streptococcus paired mostly with Lactobacillus in the posterior fornix and Dolosigranulum in the anterior nares.

Continuing our focus on the oral sites, we looked to see if oral taxon pairs might have preferences in transfer directionality, or if one taxon consistently donates or receives genes from its partner. We assigned directionality to LGT contigs with outer genes annotated as one taxon

(designated as the recipient), and inner genes annotated as a different taxon (designated as the donor). We quantified events per gene for directional LGT pairs, and filtered for pairs found in at least 10% of samples across each site. For each pair, we took the maximum directional LGT frequency across oral sites, and then selected for pairs in the 75th percentile or above. We then

48 plotted all edges associated with the 21 genera in the selected pairs (Fig. 2-5C). Across all three oral sites, Streptococcus, Veillonella and Pasteurella preferentially donated to Haemophilus, Rothia and Aggregatibacter preferentially donated to Neisseria, and Simonsiella preferentially donated to

Eikenella. Other transfers have no donor or recipient preference: these include Gemella or

Granulicatella with Streptococcus, as well as Neisseria and Haemophilus. We hypothesized that recipients may be the more abundant taxon (as compared to donors) within the community.

Although recipients often made up a larger portion of the contig in which they are found, they were not consistently the more abundant taxon. Furthermore, some directional transfers were site-specific, indicating that environment may also facilitate donor/recipient dynamics.

Mobile elements and TonB receptors are enriched in LGT contigs

Laterally transferred gene functions have been shown to be i) for adaptation, rather than for information storage [68], and ii) the outer component of an interaction network (such as a signaling or metabolic pathway), as opposed to a central component [69, 70]. We aimed to determine if such trends persist for novel LGT events in the HMP1-II metagenomes. To do this, we searched for gene functions enriched in LGT contigs. We quantified the number of UniRef90 terms from LGT and non-LGT contigs, aggregated them into Pfam clans [179], and performed

Fisher’s Exact Test to identify Pfam clans associated with LGT contigs (as compared to all contigs). Enriched and depleted Pfam clans could be divided into 5 groups, i) DNA-binding proteins such as transposases, ribonucleases, exo/endonucleases; ii) mobile elements including phage, plasmids and toxin/antitoxin systems; iii) specific enzymes such as GMP synthase and the FMN-binding split barrel superfamily, the latter mostly consisted of pyridoxine/pyridoxamine 5'-phosphate oxidase; iv) transport systems including ABC

49 transporters and TonB dependent receptors, and v) antibiotic resistance genes (ARGs) (Fig. 2-

6A). As expected, groups i) and ii) were enriched across most body sites, with the exception of plasmid toxin-antitoxin systems, which were enriched in the oral sites, as well as the NUMOD4 motif, which is part of an endonuclease found in Bacteroides [180], and was enriched only in stool. Groups iv) and v) contained mixed results: inner membrane transport proteins, such as

ABC transporter permeases, were depleted, while outer membrane beta-barrel proteins and

TonB-dependent receptors were enriched.

50

Figure 2-6 . Enriched functions show taxon and structural similarities across sites. A) We searched for Pfam clans enriched and depleted in LGT contigs by aggregating UniRef90 terms for Fisher’s Exact Test. Each cell within the heatmap is colored by the log2 normalized odds ratio, in which a positive value indicates enrichment of the Pfam clan in LGT contigs, whereas a negative value indicates depletion of the Pfam clan in LGT contigs. B) We counted the number of genes for UniRef90 annotations in enriched Pfam clans, specifically plasmid-related genes, transcriptional regulators, TonB receptors, and ISNme transposases (from left to right). The x- axis is labeled using two color bars: the first bar indicates the UniRef90 annotation, while the second bar indicates the body site; colors for the latter correspond to A). The y-axis displays the number of genes found in LGT genes stratified by genus, and is proportionally scaled to the square root of the total number of LGT genes. C) We show a single contig containing a Neisseria and Haemophilus LGT event in the buccal mucosa. From top to bottom, we first show a graph in which the x-axis is the length of the contig and the y-axis is the taxon score. Arrows represent aggregated BLAST hits, those in red are for genus Neisseria while those in blue are for Haemophilus. Below, we display the called genes and their assigned UniRef90 functions. Lastly, we examined other oral sites and searched for contigs with Neisseria-Haemophilus LGT transfers with the UniRef90 term E3D293. These contigs are colored by UniRef90 function and labeled with sample number, many share synteny with the example contig.

51

52

We next examined groups with potential adaptive functions, including group iii) with

GMP synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase, and group v) ARGs. GMP synthase and pyridoxine/pyridoxamine 5'-phosphate oxidase are likely LGT markers rather than transferred functions: GMP synthase is hypothesized to be part of integration sites [181], and has been found at the 3’ end of integrative and conjugative elements in Staphylococcus aureus, Listeria monocytogenes, Clostridium perfringens, and Enterococcus faecalis, which are four

Gram-positive bacteria with low GC content [182]. Pyridoxine/pyridoxamine 5'-phosphate oxidase may also be a LGT marker: manual examination of LGT contigs with this function revealed that it is frequently found in conserved regions nearby transferred genes. Surprisingly, antibiotic resistance (ABR) was depleted outside of the VOC superfamily, which may contain glyoxlyase and bleomycin resistance genes. Depletion may be due to the lack of selection for

ABR in healthy human subjects, as well as WAAFLE’s inability to detect ABR LGT events already present in the reference database. Still, many enriched genes within the helix-turn-helix binding proteins (CL0123) are from TetR, AraC, and MerR family transcriptional regulators, of which the former and latter may control for tetracycline and mercury resistance, respectively.

AraC was associated with LGT for iron acquisition regions in cheese (Fig. 2-6B) [183].

Lastly, we searched for functions that were specific to taxa. To do this, we extracted

UniRef90 terms from significant Pfam clans, and determined which taxa they were derived from. Examples include the ISNme transposase (CL0219), which was found almost exclusively in Neisseria across oral sites; the plasmid recombination enzyme (CL0169), which was mostly in

Streptococcus and Prevotella in oral sites, but spread across Bacteroides, Parabacteroides, Alistipes, and Clostridium in stool; and the TonB receptor dependent receptor plug and TonB-linked outer

53 membrane protein SusC/RagA family (PF07715, CL0193, CL0287), which was found mostly in

Capnocytophaga and Prevotella across oral sites, Prevotella in anterior nares and posterior fornix, and Bacteroides, Parabacteroides, and Alistipes in stool (Fig. 2-6B). We looked specifically at LGT contigs containing ISNme transposons, which occurred almost exclusively between Neisseria and Streptococcus. These contigs contained a conserved structure across oral sites in multiple samples (Fig. 2-6C).

Discussion

LGT is a strong evolutionary force: assuming one LGT event for every 1010 vertical replications, no gene in any modern genome can be linked to the last universal common ancestor (LUCA) through vertical descent [15]. Most studies and computational tools have focused on whole genomes, which makes characterization of LGT within microbial communities particularly challenging. First, the use of reference genomes removes the microbial community context (i.e. the genome is obtained from culture rather than the community).

Second, the assembly of complete genomes from microbial communities is experimentally and computationally challenging, requiring either low diversity communities [157] or single cell genomics [67]. We addressed both limitations by developing WAAFLE, which detects novel

LGT events directly from partially assembled metagenomes. With this, we can begin to ask i) whether novel LGT events consistently occur in microbial communities, ii) which biological factors affect LGT frequency, and iii) which taxa and functions are exchanged. In our validation with synthetically generated LGT events, WAAFLE performed solidly with high true positive rates for LGT detection and taxonomic assignment. We then applied WAAFLE to the Human

Microbiome Project 1 Phase II and quantified LGT frequencies across multiple body sites.

54

Increased LGT frequencies were associated with overall community trends such as greater community evenness and body sites (stool and oral), as well as individual taxon pairs with higher community abundances and small phylogenetic distances. We also observed that mobile genetic elements and outer membrane proteins were enriched in LGT contigs. Overall, this demonstrates that WAAFLE can generate biological insights using existing metagenomic data.

It is important to consider the biological interpretation for LGT frequency, which depends on i) the data from which LGT is detected and ii) the quantification method. In

WAAFLE specifically, the use of metagenomes means that each detected LGT event is unique to a sample and fixed in the population, while the use of a reference database means that each detected event should not have been previously characterized in reference genomes. Strikingly, our study detected multiple LGT events in six major body sites, demonstrating that LGT is an ongoing process in which events continuously fix in microbial populations. We next quantified

LGT frequencies as the i) number of LGT contigs per gene and ii) number of genes in LGT contigs per gene. We hypothesized that higher LGT frequencies as detected by WAAFLE were likely caused by an increased number of unique taxon pair combinations and/or increased fixation rates. Indeed, across all six body sites, we found that higher community evenness, along with larger taxonomic abundances and smaller phylogenetic distances between taxon pairs, led to increased LGT frequency. With this, we propose that LGT occurs universally between taxa, in which greater community evenness increases the number of unique taxon pair combinations, and higher joint taxonomic abundance increases the probability of exchange.

Fixation of events is then limited by factors such as phylogenetic distance.

55

WAAFLE has several limitations that should be taken into account. First, WAAFLE is ultimately affected by the quality of the metagenome assembly, which is in turn influenced by biological factors such as community evenness and richness. As follows, LGT frequencies were difficult to compare across sites: the posterior fornix had fewer contigs, and had close to the highest or lowest frequencies across body sites depending on the measure used. Samples with longer contig lengths (gene to contig ratio) also tended to have increased LGT frequency, especially those in gut and oral sites, though vaginal sites were not affected due to low community diversity. Second, WAAFLE’s parameters are tuned to be conservative with LGT calls (minimizing false positives). As such, WAAFLE underestimates LGT events, especially for inter-genus and inter-species LGT events, where most LGT is most likely to occur. WAAFLE is also unable to detect strain-level LGT events, as the default reference database is annotated to the species level. Third, WAAFLE lacks to ability to infer donor and recipient for most events.

This study briefly identified donor and recipient taxa across oral sites based on taxon gene order within contigs, but did not find consistent relationships between genera. A more focused characterization of donor and recipient taxa using phylogenetic trees may reveal whether specific taxa are prone to donating or receiving genes, and distinguish between transferred and non-transferred gene functions (as opposed to genes on contigs with or without LGT).

Despite these limitations, WAAFLE allowed us to identify patterns for specific taxa and

LGT-enriched functions. We found that most taxa across sites were relatively selective about their partners, even if they had the ability to transfer with multiple other taxa. For example, promiscuous taxa such as Streptococcus transfer with many genera, while taxa such as

Aggregatibacter primarily transfer with Haemophilus. We also found that metagenomically-

56 enriched LGT functions included mobile genetic elements such as transposons, phage, and plasmids as well as outer membrane proteins, suggesting that 1) LGT events involving mobile elements are ongoing and relatively frequent as compared to transfer of other genes, and 2) mobile elements are pangenome-specific and do not ameliorate. These two points are illustrated in an example showing a Neisseria and Haemophilus transfer, in which the majority of the contig consists of Haemophilus genes with a single gene matching a Neisseria-specific ISNme transposase. This event is consistently detected across samples from the buccal mucosa, supragingival plaque, and tongue dorsum, showing that certain LGT events may be prevalent across individuals and taxon-pair specific.

We anticipate that future work will include generation of new measures for LGT frequency, improved detection of donor and recipient taxa, and further investigation of specific functions and taxa. WAAFLE as is detects novel (not in reference genomes), recent (events without amelioration), and fixed LGT events. Our ability to find these events enables us to i) determine the timescale at which these events occur, through the use of time-series data; ii) quantify the proportion of the microbial or human population contains specific events, in which

LGT sweeps might correspond with strain sweeps [184]; and iii) identify environmental factors that might influence LGT frequencies and transferred functions, through the use of case-control studies. Improved classification of donor and recipient taxa may facilitate discovery of transferred metabolic functions, which may be taxon-pair specific and were not detectable across body sites. Unlike findings based on reference genomes [185], which can infer donor and recipient, we did not see enrichment for antibiotic-resistance genes for LGT transfers. This may be due to the use of a healthy cohort, rather than one taking antibiotics. More work is needed to

57 quantify the frequency and characteristics of novel LGT events in microbial communities, as well as the variation in transferred functions in different cohorts and in response to selective pressures. WAAFLE represents a step forward in characterizing LGT directly from microbial communities, which will ultimately enable us to understand the roles of LGT for adaptation or speciation in microbial communities.

Methods

Datasets

Metagenomic datasets used in this study were produced through the Human

Microbiome Project Phase 1-II [174]. The HMP data are publicly available through the HMP’s public data repository (http://www.hmpdacc.org/). Contigs were assembled via IBDA-UD [186].

The pangenome reference database was generated by downloading NCBI isolate genomes, binning isolate genes by species, and then clustering binned species genes at 97% nucleotide identity [187].

Detecting LGT events from metagenomic shotgun sequencing datasets

WAAFLE takes one required input, i) contigs assembled from metagenomic data in

FASTA format, and two optional inputs, ii) gene calls for each contig in genome format file 3

(GFF3), and iii) a nucleotide reference database of genes with taxonomic and functional annotations. WAAFLE uses four steps to classify each contig as having LGT or not:

1. Contigs are searched against the ChocoPhlAn pangenome reference database

(https://bitbucket.org/biobakery/humann2/wiki/Home) using BLASTN default

parameters.

58

2. If gene calls were not supplied, contigs are annotated with genes using overlapping

BLASTN alignments.

3. Within a contig, each gene is assigned multiple taxon scores. BLASTN hits are grouped

by taxonomic annotation and gene overlap. We then calculate a score from each group

using BLASTN hit percent identity and subject coverage.

4. Each contig is classified as “No LGT”, “LGT”, or “ambiguous” by examining whether all

genes across a contig are best explained by one taxon, two, or multiple, respectively (Fig.

2-1).

These steps can be tuned using 5 parameters: subject coverage (s), overlap percentage

(o), gene length (l), one taxon score (k1), and two taxon score (k2) (Table I-1). We describe each step in detail below.

Step 2: Calling genes.

If gene calls are not supplied, we combine overlapping BLAST hits to call genes. BLAST hits are first filtered by subject coverage cutoff s (s, default 0.75), which is defined as the percentage of the reference gene (subject sequence) that aligned to the contig (query sequence).

For hits that aligned to contig ends, it is not possible for the full gene to align to the contig. We thus calculated subject coverage by dividing the alignment length by the subject gene length that can potentially align to the contig. Specifically, we subtracted the length of the subject gene that ran off the contig from the total subject gene length.

The filtered BLAST hits are then sorted by length and sequentially assigned to groups based on overlap percentage. Hits and groups may be considered nucleotide fragments: overlap

59 percentage is calculated between a two nucleotide fragments by dividing the length of the overlap between the two fragments by the length of the shorter fragment. Specifically, each

BLAST hit is added to a group if the hit has at least overlap percentage o (o, default 0.1) with any existing groups, otherwise a new group is created. After all BLAST hits have been considered, each group is considered a gene, and the start and end sites are calculated as the minimum start and maximum end of all BLAST hits encompassed (in the group). The resulting genes are further filtered by length (l, default 200 bp).

Step 3: Assign taxon scores to genes.

To assign taxon scores, WAAFLE combines the BLASTN results from step 1 and gene annotations called from step 2 or supplied by the user. First, WAAFLE bins all BLAST hits (s, default 0) to genes if they have overlap greater than o (o, default 0.1); it is possible to assign a single hit to multiple genes. The top UniRef term across all BLAST hits assigned to a gene is then annotated as the gene function. Second, for each gene in a contig, WAAFLE further groups the binned BLASTN hits based on taxonomic annotation, which can be performed at different taxonomic levels (such as kingdom, phylum, class, etc). Each BLAST hit within the group is scored by multiplying its percent identity by its subject coverage. For each nucleotide position within the gene, we allot the maximum score across grouped BLAST hits, or if there were no

BLAST hits at that position, allot a score of 0. This results in a vector of scores per taxon per gene, which we average for a single taxon score. Once each taxon has been scored at each gene, each contig can then be represented by a table, S, with N rows (representing taxa) and M columns (representing genes).

60

Step 4: LGT classification and taxonomic annotation of contigs.

Only contigs with more than 1 gene and more than 1 taxon are considered for LGT. To search for LGT, we loop through seven taxonomic levels, starting at the species level and ending at the kingdom level. The loop is terminated if the contig is i) classified as containing

LGT or not containing LGT, and ii) assigned a single taxon pair or taxon, respectively. At each taxonomic level, we perform step 3, in which each contig is represented as table S, where each

entry Sij contains taxon i‘s score for jth gene.

Using this table, we define O(i) = minj(Sij) as taxon i’s worst single-gene score, and C(i, i’)

= minj(max(Sij, Si’j)) as the worst single-gene score for the combination of taxa i and i’. If

maxi(O(i)) is larger than the one taxon score threshold (k1, default 0.5), then one taxon explains

the entire contig. If maxi(O(i)) < k1 and maxi,i'(C(i, i')) is larger than the two taxon score threshold

(k2, default 0.8), then i and i' jointly explain the contig, indicating LGT between taxa i and i'. If

neither k1 nor k2 are met, the contig is annotated as “ambiguous”. If the contig is annotated as

“ambiguous”, the loop continues a higher taxonomic level. If if the contig is determined to contain no LGT or LGT, WAAFLE performs taxonomic assignment.

Taxonomic assignment is performed as follows: if the contig is determined to contain no

LGT, the contig is annotated with taxon i with O(i) = maxi(O(i)). If multiple taxa have scores

equal to maxi(O(i)), we annotate the contig with the term “multiple” (rather than any taxon). If the contig is determined to contain LGT, the contig is annotated with taxa i and i’ resulting in

C(i, i’) = maxi,i'(C(i, i')). If multiple taxon pairs have scores equal to maxi,i'(C(i, i')), WAAFLE determines whether the multiple pairs share one taxon, indicating that one taxon is known

61 while the other is uncertain. If so, WAAFLE determines the name of the uncertain taxa by identifying the last common ancestor shared between all uncertain taxa, and assigns the contig the taxon pair consisting of the universally shared taxon and the last common ancestor of the uncertain taxon. If all pairs are different, we annotate the contig with term “multiple”. If contigs are assigned the term “multiple”, the determined LGT status is rejected and the loop continues to a higher taxonomic level. Otherwise, we complete the search and annotate the contig with its

LGT status and corresponding taxa.

Other Options: Dealing with Unknown Taxa

First, in the case where there are no BLAST alignments to a gene (due to a user supplying their own gene calls), WAAFLE by default assigns the gene a taxon score of 1 for the taxon “Unknown”. This will result in WAAFLE either i) identifying the contig as a inter- kingdom LGT between one taxon and the “Unknown”, or ii) identifying the contig as

“ambiguous” if no two taxa can explain the full contig. Second, users may choose to “spike” in an “Unknown” taxon into table S during Step 4, in which the “Unknown” is equal to 1 -

maxi(O(i)) across all genes. Simulation with this flag has shown that WAAFLE will then call

LGT between one taxon and an “Unknown” for contigs containing multiple genes with low taxon scores, so caution is advised if using this function.

Tuning parameters through grid search

WAAFLE has 5 parameters, subject coverage (s), overlap percentage (o), gene length (g),

one taxon score threshold (k1), and two taxon score threshold (k2). We constructed a set of 1000 synthetic contigs to set these parameters. Contigs were generated through a three step process.

62

First, we randomly selected donor and recipient genomes that differed across 8 taxonomic levels (kingdom, phylum, class, order, family, genus, species, and strain/no difference). Second, we chose a three gene region within the recipient genome, and swapped out the center gene with a random donor gene. At each taxonomic level, contigs contained 25 unique donor- recipient pairs with 5 contigs each (for a total of 190 unique donors and 183 unique recipient strains). Third, we truncated the contigs on both ends. After truncation, some contigs were left with only one gene, which were removed and resulted in a different distribution across taxonomic levels.

Gene Calling

We first assessed WAAFLE’s ability to call genes while varying three three parameters, which included i) subject coverage from 0, 0.25, 0.5, 0.75, and 0.9, ii) overlap from 0.1 to 0.5 in

0.1 increments, iii) gene length from 0, 25, 50, 75, 100, and 200 bp. We then compared each NCBI reference gene to WAAFLE-called genes, and vice-versa, to identify true positives, false positives, and false negatives:

1. True positive: A WAAFLE-called gene overlap the NCBI annotated gene by at least 80%.

2. False positive: A WAAFLE-called gene does not match any NCBI annotated gene, or two

or more WAAFLE-called genes match one NCBI annotated gene.

3. False negative: The reference gene does not match any WAAFLE-called gene.

Note that true negatives cannot be assessed meaningfully: these would be regions where

NCBI had no annotation, and WAAFLE did not call a gene. With this, we compared TPR against PPV for each set of conditions (Fig. I-3).

63

LGT Classification

In order to set parameters k1 and k2, we performed a second grid search to characterize

WAAFLE’s ability to call LGT. We only included contigs with at least 2 genes. We varied four parameters, including i) subject coverage from 0, 0.25, 0.5, 0.75, 0.9, ii) overlap from 0.1 to 0.5 in

0.1 increments, iii), k1 from 0.1 to 0.9 in 0.1 increments, and iv) k2 from 0.1 to 0.9 in 0.1 increments. We then assessed positives and negatives as such:

1. True positive: WAAFLE calls “LGT” for a synthetic contig with an inter-species LGT or

above.

2. True negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an

inter-strain LGT.

3. False positive: WAAFLE calls “LGT” for a synthetic contig with an inter-strain LGT.

4. False negative: WAAFLE calls “No LGT” or “ambiguous” for a synthetic contig with an

inter-species LGT or above.

It should be noted that WAAFLE does not have to call LGT at the correct taxonomic level; thus, this assessment looks specifically at whether WAAFLE can detect LGT, not whether it called the correct taxa.

Taxonomic Annotation

For correctly classified contigs, we assessed whether WAAFLE annotated contigs with the correct taxa at each taxonomic level. To compare one taxon call against another, we looked to see whether they had identical names at each phylogenetic level (i.e, same name at kingdom,

64 phylum, class, etc.). At best, two taxa may match across all seven levels, in the worst case scenario, two taxa may not match at all. For a contig without LGT, we compared the WAAFLE taxon to the reference taxon. For a contig with LGT, we compared each WAAFLE taxon to each reference taxon, and selected the combination of pairs with the highest number of matches. We then calculated what percentage of the reference taxa had a correct match at each taxonomic level.

Quality control for the Human Microbiome Project (HMP) assemblies

Samples were filtered out if they 1) were outliers in ordination analyses using

MetaPhlAn [186] community profiles or 2) had fewer than 1,000 genes across contigs

(definitively annotated as LGT or not). Contigs were then filtered from these samples if they resembled misassemblies, defined here as the erroneous combination of genomic material from two species into a single contig, which will match WAAFLE’s internal model for a biological

LGT event and result in false positive LGT calls. To identify and quantify misassembly in contigs from the HMP1-II dataset, we examined recruitment of reads to gene junctions. Contigs that met the two conditions below were removed:

1. The average coverage (reads per nucleotide) of the gene-gene junction is less than half of

the average coverage of the flanking genes.

2. There are no single reads or read pairs that support the junction. Single reads may

support the junction if they overlap both the junction and flanking genes (single), paired

reads may support the junction if i) each read is in a flanking gene (perfect-double) or ii)

65

one read is in one flanking gene, and the other overlaps the other flanking gene and the

junction (partial-double) (Fig. I-1).

Both conditions are necessary to remove a contig because contig coverage is highly variable, and read support decreases as junction lengths increase.

Linear regression for LGT frequency

We performed linear regression with LGT events per gene as the outcome, and number of contigs, gene to contig ratio, alpha diversity, richness, and body site as the regressors. Alpha diversity was calculated using the Gini-Simpson Index [188], which is equal to 1 minus the sum of the square of each genera’s gene percentages. Richness was counted as the total number of genera per sample.

Determining phylogenetic distance between pairs

We calculated phylogenetic distances between pairs using the PhyloPhlAn tree [50]. If both taxa were annotated to the species level (tree tips or terminal nodes), distances were calculated between terminal nodes. If a taxon was not annotated to the species level, the internal node for the last common ancestor (LCA) was determined after searching the tree for all species that matched the last known level by regular expression. Distances were then calculated between nodes, and adjusted by adding the average distance from the LCA to the terminal nodes.

Functional Analyses

Identifying enriched and depleted Pfam clans

66

Fisher’s Exact Test was performed both per sample and per body site. For each sample, we counted the total number of UniRef90 genes in contigs with at least 2 genes and WAAFLE classification of “LGT” or “No LGT”. For the body site, we summed the total number of

UniRef90 genes in contigs with at least 2 genes and WAAFLE classification of “LGT” or “No

LGT”. We then aggregated UniRef90 terms to Pfam clans, and identified Pfam clans that were positively or negatively associated with LGT contigs. A Pfam clan was considered significant if:

1. The site-wide q-value is < 0.01.

2. The difference between the percentage of sample odds ratios (OR) that agreed with the

side-wide odds ratio and the percentage sample odds ratios that disagreed with the site-

wide odds ratio is greater than 0.2

 ORsupport + ORagainst + ORnan = total_samples

 (ORsupport - ORagainst) / total_samples > 0.2

For the latter condition, 0.2 was chosen because it requires at least 20% of the samples to

have an odds ratio, and the worst case scenario involves the ORsupport / total samples = 0.6, and

ORagainst / total samples < 0.4.

Searching for genes within Pfam clans

WAAFLE annotates each gene with a UniRef90 term and taxon, which enables us to examine in more detail which genes and taxa are within enriched Pfam clans. To do this, we quantified the UniRef90 terms from specific Pfam clans and stratified them by taxonomic annotation and LGT status (within an LGT contig or not). UniRef90 terms with similar annotations were collapsed for plotting purposes.

67

Chapter 3:

Urban transit system microbial communities differ by surface type and interaction with

humans and environment

Copyright Disclosure

This Chapter is a reproduction of a published manuscript, in which the * indicates equal contribution:

Hsu T.*, Joice R.J.*, J. Vallarino, G. Abu-Ali, E.M. Hartmann, A. Shafquat, C. DuLong, C. Baranowski, D. Gevers, J.L. Green, X.C. Morgan, J.D. Spengler, C. Huttenhower. Urban Transit System Microbial Communities Differ by Surface Type and Interaction with Humans and the Environment., MSystems, 2016. 1(3): e00018-16.

Attributions

R.J., J.S., and C.H. designed the study. C.B. optimized the sampling protocol. R.J., T.H., and J.V. collected transit samples, and R.J. and T.H. extracted DNA for 16S and shotgun sequencing at the Broad Institute. R.J and A.S. performed 16S computational analyses; T.H., A.S, and G.A. performed shotgun computational analyses, E.M.H. and J.L.G. helped interpret taxonomic composition and functional profiling results. R.J., T.H., A.S., C.D. and X.C.M. made figures: X.C.M., D.G., J.D.S., and C.H. provided support throughout the sequencing and analysis process. R.J., T.H., X.C.M. wrote the manuscript.

Abstract

Public transit systems are ideal for studying the urban microbiome and inter-individual community transfer. In this study, we used 16S amplicon and shotgun metagenomic sequencing to profile microbial communities on multiple transit surfaces across train lines and stations in the Boston metropolitan transit system. The greatest determinant of microbial community structure was the transit surface type. In contrast, little variation was observed between geographically distinct train lines and stations serving different demographics. All surfaces were dominated by human skin and oral commensals such as Propionibacterium,

69

Corynebacterium, Staphylococcus, and Streptococcus. Non-human associated taxa detected included generalists from , which was especially abundant on outdoor touchscreens. Shotgun metagenomics further identified viral and eukaryotic microbes including

Propionibacterium phage and Malassezia globosa. Functional profiling showed that P. acnes pathways such as propionate production and porphyrin synthesis were enriched on train holds, while electron transport chain components for aerobic respiration was enriched on touchscreens and seats. Lastly, the transit environment was not found to be a reservoir of antimicrobial resistance and virulence genes. Our results suggest that microbial communities on transit surfaces are maintained from a metapopulation of human skin commensals and environmental generalists, with enrichments corresponding to local interactions with the human body and environmental exposures.

Importance

Mass transit, specifically urban subways, are distinct microbial environments with high occupant densities, diversities, and turnovers, and they are thus especially relevant to public health. Despite this, only three culture-independent subway studies have been performed, all since 2013 and with widely varying designs and differing conclusions. In this study, we profiled the Boston subway system, which provides 238 million trips per year by the Massachusetts Bay

Transit Authority (MBTA). This yielded the first high-precision microbial survey of a variety of surfaces, ridership environments, and microbiological functions (including tests for potential pathogenicity) in a mass transit environment. Characterizing microbial profiles for multiple transit systems will be increasing important for biosurveillance of antibiotic resistance genes or pathogens, which can be early indicators for outbreak or sanitation. Understanding how human

70 contact, materials, and the environment affect microbial profiles may eventually allow us to rationally design public spaces to sustain our microbial health.

Introduction

Mass transit systems host large volumes of passengers and facilitate a constant stream of human/human and human/built environment microbial transmission. The largest urban mass transit system in the United States facilitates an average of 11 million trips per weekday (New

York). The next four largest systems transport just over 1 million trips per weekday

(Washington DC, Chicago, Boston, San Francisco) [189][180][182][181], yet little is known about the mass transit system microbial reservoir. Understanding the associated microbial transmission dynamics between humans and the built environment, and microbial occupation and persistence on different surfaces, can inform decisions regarding public health and safety.

Microbial DNA sequencing-based studies have revealed that microbial communities of the built environment are greatly influenced by their human occupants. Communities within homes showed higher similarity to those of their inhabitants [92], and specific surfaces frequently contacted by human skin, such as keyboards or mobile phones, had microbial communities that reflect those of skin [190, 191]. In restrooms and classrooms, variation in microbial community composition across surface types was associated with variation in human contact with those surfaces: desks contained human skin and oral microbes, while chairs contained intestinal and urogenital-derived microbes [93, 192]. However, a limitation of most built environment microbiome research is that human contact, surface type, and material composition are frequently confounded. For example, in the classroom study described above,

71 different forms of human contact were associated with distinct microbial community profiles; however, the desks and chairs were also constructed from different materials.

Previously observed subway microbial communities comprise both human and environmentally derived microbes. Air samples from within the New York and Hong Kong subway systems included microbes originating from soil and environmental water in addition to human skin [193, 194]. The recent metagenomic study of New York subway stations [195] has been widely criticized [196] and leaves many detailed analysis questions regarding the transit microbiome unanswered, but it has provided an initial reference dataset for further analysis of subway microbiome diversity. In addition, while this study collected surface type information, it did not standardize their characterization or, as a result, investigate surface-specific enrichments for microbial taxa. Understanding the separate influences of human contact, surface type, and surface material would help identify mechanisms through which microbial communities form and persist on surfaces within built environments.

In the present study, we provide the first comprehensive metagenomic profile of microbial communities across multiple surface types and materials in a high-volume public transportation system. Samples were collected from seats, seat backs, walls, vertical and horizontal poles, and hanging grips inside train cars from three subway lines, as well as touchscreens and walls of ticketing machines inside five subway stations. Using a combination of 16S amplicon and shotgun metagenomic sequencing, we characterized the microbial community composition, functional capacity, and pathogenic potential of the Boston mass transit system. In agreement with previous studies, we observed a combination of human-, soil-,

72 and air-derived microbial communities across the system. Taxonomic differences were most strongly associated with surface type, as compared to geographic, train-line, and material differences in a multivariate analysis. The distribution of metabolic functions was dominated by

P. acnes, which made up a majority of the community. Minimal antibiotic resistance genes and virulence factors were detected across transit system surfaces. In addition to identifying the most important factors determining microbial colonization, our results may serve as a baseline description of microbes on public transportation surfaces, which will be relevant toward future design of transit environments encouraging microbial health.

Results

Sampling microbial communities on the Boston transit system

We collected samples from train cars and stations (n=73) from the Boston transit system.

This system is maintained by the Massachusetts Bay Transportation Authority (MBTA), which operates bus, subway, commuter rail, and ferry routes in the greater Boston area. Our study focused on the subway system, which consists of four lines (red, orange, blue, green, and silver) that extend from downtown Boston into the surrounding suburbs (Fig. 3-1A). Train car samples were collected from the red, orange, and green lines, and comprised 6 surface types, including grips, horizontal and vertical poles, seats, seat backs, and walls (Fig. 3-1B). Station samples were collected from the touchscreens and the sides of fare ticketing machines (Fig. 3-1C). Biomass yields were highest for hanging grips (141.83±92.68 ng/µL), followed by seats (128.1429±49.955 ng/µL) and touchscreens (120.47±73.68 ng/µL), though these differences were not statistically significant (Fig. II-S1A).

73

Figure 3-1. Collection of samples from MBTA trains and stations. (A) Microbial community samples were collected from the Massachusetts Bay Transit system in the Boston, Massachusetts metropolitan area. Train samples were collected from 6 train car surfaces across 3 locations along 3 train routes; station samples were collected from 5 stations. (B, C) Diagram of the surfaces sampled within train cars (B) and stations (C). Sampled surfaces specifically included seats and seat backs, horizontal and vertical poles, hanging grips, and walls within train cars, as well as the screens and walls of touchscreen machines within stations.

For each sample, we collected metadata describing built environment type, surface type, material composition, as well as collection date (Table II-1). For train car samples, we also recorded the train line, within-train location, and location along the subway route at time of sample collection (nearest subway stop). For station samples, we recorded the station, ticketing machine location, and which side of the touchscreen was swabbed. 16S rDNA amplicon sequence data was generated from most samples (n=72), and a subset (n=24) was subjected to shotgun metagenomic sequencing.

Microbial communities are specific to surface types and immediate environment

The surface type from which microbes were collected proved to be the major determinant of community diversity and structure. Alpha diversity of touchscreen samples was significantly higher than that of all other surface types (p<0.0001, ANOVA comparison of 7

74 surfaces with Bonferroni correction, Fig. II-1B), and did not correlate with biomass (Spearman’s rho=0.0057, Fig. II-1A). The largest axes of beta diversity separated train holds (horizontal and vertical poles, hanging grips), chairs (seat and seat backs), touchscreens, and walls (Fig. 3-2A).

Train line remained only a minor driver of community structure (Fig. 3-2B), and did not dictate overall community composition for either holds (Fig. II-S2A) or seats, once the material of the latter was taken into account (Fig. II-S2B, II-S2C). In particular, the green line seats were upholstered with vinyl, while seats on the orange and red lines were upholstered with polyester.

75

Figure 3-2. Taxonomic composition of subway microbial communities. All ordinations are principal coordinate analyses using Bray-Curtis distance among filtered OTUs (see Methods), colored by metadata. (A) Subway data by surface, (B) train car data by train line, and (C) touchscreen data by location of machine. (D) Relative abundances of bacterial families across samples from train cars (see Table II-2 for complete data). (E) Relative abundance of bacterial families within stations (complete data as above). Asterisks indicate that the sample was collected on a separate day during the same month as the remaining samples. For station samples, “W” indicates a sample from a ticketing machine wall; all other samples are from the ticketing machine touchscreens.

The location of ticketing machines (e.g. outdoor, indoor, underground) was a primary source of variation between microbial communities on touchscreens (Fig. 3-1C). Univariate analyses using Linear Discriminant Analysis Effect Size (LEfSe) [197] revealed that indoor

76 touchscreens were characterized by genus Acinetobacter, while underground touchscreens had increased levels of genus Corynebacterium, and family Tissierellaceae, specifically genus

Finegoldia and genus Anaerococcus. Those with outdoor exposures were enriched for class

Alphaproteobacteria, including family Acetobacteraceae and genus Methylobacterium,

Sphingomonas, and Blastococcus (Table II-3). These results imply that surface type is a major driver of community composition on transit surfaces, and that indoor versus outdoor exposure detectably influences the resident microbial composition of touchscreen surfaces.

Subway microbial communities are largely derived from human skin and oral commensal microbes

Subway microbial clades were generally those found in typical human skin communities

[2, 81] (Fig. 3-3Ai) and were dominated by the phyla Firmicutes, , and

Actinobacteria, each of which comprised over 20% of the microbial community, based on 16S data. The Bacteroidetes were much less abundant with an average community abundance of 6%

(Table II-S2). The families with the highest mean relative abundances were Staphylococcoceae and Corynebacteriaceae (Fig. 3-2D-E), also typical of skin commensals. Propionibacterium was not observed due to known primer bias [198] but was confirmed later with shotgun metagenomics. The next most abundant taxa were Micrococcaceae, which included genus

Micrococcus (found in hair and skin) and genus Rothia (found in the oral cavity [2, 199]), as well as Streptococcaceae (found in the oral cavity) and Pseudomonadaceae. We also observed low proportions of gut and oral commensals such as Lachnospiraceae, Veillonella, and Prevotella.

77

Figure 3-3. Putative MBTA microbial community sources. (A) i. Ordination of subway surface data jointly with human skin (anterior nares), oral (mixed sites from within oral cavity) and gut (stool) microbiome data from the Human Microbiome Project (HMP). Principal coordinate analysis was performed with weighted UniFrac distance and calculated using OTU relative abundances. ii-iv. Correlations between subway samples and human body sites [200]: ii. skin, iii. oral, and iv. gut, as well as environmental sites: v. air [201] and vii. soil [202]. The x- and y- axes represent mean relative abundance across each data set with standard error bars. For each plot, subway samples (MBTA) are on the x-axis and potential source community on the y-axis. (B, C) Microbial SourceTracker [203] was used to identify possible human and environmental sources of subway station (B) train and (C) station communities. Relative estimated contribution of each source is plotted per subway sample.

78

79

Highly abundant non human-associated taxa encompassed the order Burkholderiales

(3.25%); as well as class Alphaproteobacteria (9.15%), which contains genera Sphingomonas

(1.48%) and Methylobacterium (1.14%) and families (1.48%) and

Methylocystaceae (0.447%). These Alphaproteobacteria are widespread environmental bacteria with flexible metabolic regimes; Sphingomonads in particular, including the genera

Sphingomonas and Sphingobium, are found in soils and sediments and are most well studied for their ability to degrade polyaromatic hydrocarbons [204]. Methylobacterium, primarily M. extorquens, is a genus of plant- and soil-associated facultative methylotrophs; these bacteria are highly prevalent on the surfaces of plants, and their diverse metabolic capabilities make them likely to survive in other environments [205]. Enhydrobacter aerosaccus, which is currently classified as belonging to Moraxellaceae but may more aptly be classified as an

Alphaproteobacterium [206], was also prevalent in the subway samples.

To determine the microbial clades driving these patterns, we correlated the abundance of subway microbial genera with their abundance in three human body sites [200] as well as air and soil [201, 202] (Fig. 3Aii-vi). As expected, the human skin genera Staphylococcus and

Corynebacterium (Fig. 3Aii), human oral cavity taxon Streptococcus, and human gut-resident genera Bacteroides and Prevotella are abundant on both the subway and their respective body sites (Fig. 3Aii-iv). In addition to human-associated taxa, several genera previously observed in indoor air [201] were also abundant on subway surfaces: Sphingomonas, Methylobacterium,

Acinetobacter, Streptococcus, Staphylococcus and Corynebacterium (Fig. 3Av). In contrast, typical soil genera were rare on subway surfaces (Fig. 3Avi). Microbial SourceTracker [203] confirmed these origins based on overall community composition as compared to a variety of reference

80 environments [207] (Fig. 3B-C). Only a subset of touchscreen samples included a substantial proportion of environmental microbes (e.g. air and soil), most notably from the Riverside above-ground outdoor ticketing station (Fig. 3C).

Propionibacterium phages and the yeast Malassezia globosa dominate the non-bacterial microbial community

Shotgun metagenomic sequencing, which allowed us to profile viral and eukaryotic microbes that cannot be identified by 16S sequencing as well as bacterial taxa that are poorly amplified by the 16S V4 region primers [198], was performed for 24 mass transit samples including 15 train car samples and 9 station samples. In agreement with previous studies of skin ribotypes [81, 208], the most abundant species across all samples was the facultative anaerobe

Propionibacterium acnes (mean 47%, max 81%); its average abundance was 29.8% for chairs,

71.6% for grips and poles, and 43.4% for touchscreen surfaces (Fig. 3-4). Other metagenomically assessed bacterial abundances agreed with 16S data, including high levels of family

Micrococcaceae (mean 5.3%), Staphylococcaceae (mean 5.28%), Corynebacteriaceae (mean

4.95%), and Streptococcaceae (mean 3.73%), along with non human-associated taxa included soil taxa Geodermatophilaceae (mean 1.22%) and Acinetobacter (mean 0.70%) (Table II-2).

81

Figure 3-4. Trans-domain taxonomic profiles from subway shotgun metagenomes. Relative abundances of the twenty microbial species with highest mean across 24 metagenomes from train cars and stations. Among colored metadata annotations, train line (green, orange, or red) is indicated for car surface samples and location (indoor or outdoor) for touchscreens. P. acnes is not amplified by the 16S primers used in this study but readily detectable by shotgun sequencing, as are non-bacteria such as Propionibacterium phage.

Eleven non-bacterial species were present at an abundance of ≥0.1% in at least two samples. The most abundant and prevalent viruses included Propionibacterium bacteriophages and oncovirus Merkel cell polyomavirus (a common respiratory infection [198]). The relative abundance of Propionibacterium bacteriophages P100D and P101A show similar abundance patterns to P. acnes, with lower average abundance on chairs (3.2%), and higher abundances on holds (5.4%) and touchscreens (7.9%), suggesting that phage/host relationships are detectable directly from metagenomics. Remaining viruses were found sporadically (in only 2 samples) or

82 had mean relative abundances less than 0.0006% (Table II-2). Many of these viruses were phage that corresponded to abundant bacterial species, including Pseudomonas phage, Lactobacillus phage, Lactococcus phage, Staphylococcus phage 3A, Staphylococcus phage 80 alpha, and

Staphylococcus phage phi2958PVL.

The yeast Malassezia globosa [209] also occurred with abundance patterns similar to those of P. acnes, with lower abundance on chairs (0.03%) and higher abundances on holds (0.25%) and touchscreens (0.1%). Both M. globosa and P. acnes show niche-specific adaptation to metabolism of lipid-rich sebum [209, 210] and are commonly found on sebaceous skin sites, which comprise of the chest, back, and face [208]. This may indicate that sebaceous skin taxa more easily transfer or adhere to built environment surfaces.

All surface types are dominated by skin microbes, with smaller proportions of oral, gut, and environmental taxa across seats and touchscreens

To identify differentially abundant taxa across metadata categories, we performed a multivariate analysis using MaAsLin [211], which controls for multiple covariates using a generalized linear model (Table II-4). For 16S data, we accounted for built environment type, surface type, material composition, and sample location. For human-associated taxa, seats were particularly enriched in skin taxon Corynebacterium and vaginal taxon Gardnerella, though all contacted surface types had higher relative abundances of Corynebacterium as compared to train walls (Fig. 3-5A). The skin taxon Staphylococcus was also enriched across all surface types except for touchscreens and train walls, and Corynebacterium was negatively associated with vinyl seats relative to polyester seats. Grips were enriched for oral taxa such as Rothia and Veillonella. For

83 non human-associated taxa, all grips and vertical poles were depleted in class

Alphaproteobacteria, as contrasted to their enrichment on outdoor surfaces at the Riverside station (western suburb). These clades included Methylobacteriaceae (grips and vertical poles) and Methylocystaceae (all holds), as well as family Sphingomonadaceae (grips and vertical poles) and genus Amaricoccus (all holds). Because many of these organisms are likely associated with soil particles, it is reasonable that they should be less abundant on surfaces where soil is unlikely to settle.

Figure 3-5. Enrichment of microbial taxa with respect to metadata using multivariate analyses. Each ring represents significant associations of one metadatum with microbial clades using MaAsLin [211] (FDR q<0.25). (A) 16S data. For location, surface category, surface type, and surface material (inner rings to outer rings), the direction of association between taxa and metadata is indicated in red (positive) or green (negative) was relative to Alewife, touchscreens, seat backs, and polyester, respectively. (B) Shotgun metagenomic data; only a simplified surface type was represented by sufficiently many samples for analysis. Horizontal poles, vertical poles, and grips were grouped into “holds”, and that seats and seat backs were grouped into “chairs”. The direction of association is again indicated by color. Only taxa with at least one association are shown in each cladogram.

84

For shotgun data, we again used MaAsLin [211] to identify associations between microbial taxa and a single covariate, surface type (Fig. 3-5B, Table II-4). Due to the small number of samples, surface type metadata were grouped into chairs (seat and seat backs), holds

(hanging grips, horizontal and vertical poles), and touchscreens. For human-associated taxa, chairs and touchscreens were enriched in multiple species of Corynebacterium (including C. aurimucosum, genitalium, jeikeium, massiliense, pseudogenitalium, tuberculostearicum, urealyticum) and Staphylococcus (S. caprae capitis, epidermis, haemolyticus, hominis, pettenkoferi); vaginal taxa

Gardnerella vaginalis and Lactobacillus (L. crispatus and L. iners); and gut taxa Ruminococcus bromii,

Faecalibacterium prausnitzii, and Eubacterium rectale. Touchscreens were particularly enriched in oral species such as Streptococcus (S. cristatus, gordonii, infantis, mitis/oralis/pneumoniae, parasanguinis, sanguinis, thermophiles, tigirinus), Prevotella (P. copri, melaninogenica), and Rothia aeria (also enriched in holds). For non-human associated taxa, we saw similar patterns as in the

16S data. Touchscreens were enriched in Methylobacteriaceae, Burkholderiales,

Sphingomonadales, and Rhodobacteraceae (also enriched in chairs). Many of these non-human associated taxa that we identified on surfaces are hardy generalists that survive under harsh conditions [212].

Most Corynebacterium species enriched in both chairs and touchscreens have higher (but not statistically significant) abundances in chairs, with the exception of C. kroppenstedtii and C. matruchotii. The lack of oral species on holds may be due to the newfound detection of P. acnes, which is enriched in holds and may affect the relative abundances of rarer taxa. Generally, skin taxa dominate all surfaces, with P. acnes enriched on holds and Corynebacterium and

Staphylococcus on chairs and touchscreens. Oral taxa are present on both holds and

85 touchscreens. Non-human associated taxa remain enriched on touchscreens, which present more exposed surface areas not enclosed within trains.

Metagenomes reflect dominance of Propionibacterium acnes across subway surfaces

Functional genomic profiling using HUMAnN2 quantified 3,975,869 UniRef50 [143] protein families, which were collapsed into 12,074 KEGG Orthology (KO) [213] families. For hypothesis testing, we focused on 604 KOs with mean abundances greater than the overall median abundance and variance across samples in the 90th percentile. MaAsLin identified 590

KOs significantly associated with surface type (q < 0.05): 360 enriched in holds, 204 depleted in holds, 12 enriched in chairs, 4 depleted in chairs, 5 enriched in touchscreens, and 4 depleted in touchscreens, relative to all other surface types (Table II-4).

Many of the KOs enriched in holds were genes found in the P. acnes genome [214]. These included systems for anaerobic respiration, lipases and esterases for degrading lipids within sebaceous sites, hyaluronate lyase for digesting the extracellular matrix of skin, fermentation of pyruvate to propionate (Fig. 3-6A). Production of propionate is catalyzed by methylmalonyl-

CoA carboxyltransferase, which is enriched in the holds. Porphyrin synthesis is a major function of several Propionibacterium [215], contributing to a range of physiological activities

(e.g. potential keratinocyte damage from free radical release [214, 216]) and industrial uses (e.g. synthesis of vitamin B12 [217]). Here, the pathway was represented by several genes from the hem and cbi/cob gene clusters [217, 218]. To verify that the KOs detected above were indeed specific to P. acnes, we removed its contributions to the overall abundance of each UniRef50 family, renormalized, and again identified KOs enriched on different surface types (see

86

Methods). KOs specific to P. acnes metabolism were no longer enriched on holds, with a few exceptions including iron transport (Fig. 3-6A, Table II-4).

Figure 3-6. Enrichment of KEGG Orthology (KOs) across MBTA surfaces before and after P. acnes removal. For all heatmaps, rows represent significantly enriched KOs detected through linear regression with MaAsLin, columns represent samples, and cells are colored by sum- normalized reads per kilobase (RPKs) on a log scale. Further metadata is shown as colored bars below the heatmaps. The first colored bar explains the collapsed surface types (second bar), in which chairs include seats (light blue) and seat backs (dark blue), grips include horizontal poles (red), vertical poles (orange), and grips (yellow), and touchscreens are from Riverside (green), Alewife (red), Forest Hills (orange), and South Station (light blue). KOs annotated with yellow circles are found before and after P. acnes removal. (A) Selected KOs enriched in holds only are specific to and colored by P. acnes metabolic function. (B) Selected KOs specific to oxidative phosphorylation and photosynthesis are shown before (above) and after (below) P. acnes removal. Direction of association between KO abundances and surface types, relative to holds, are shown as green ‘+’ (positive) or red ‘-’ (negative) to the left of the heatmap. Columns are colored by metadata as in Fig. 3-2.Many KOs associated with oxidative phosphorylation and photosynthesis were enriched in chairs and touchscreens relative to holds before removal of P. acnes. These included NADH dehydrogenase I subunits (EC:1.6.5.3), ferredoxin-NADP+ reductase (involved in photosystem I, EC:1.18.1.2), ATPase subunits (EC:3.6.3.14), and cytochrome c oxidases (EC:1.9.3.1). After depletion of P. acnes-derived processes, ferredoxin-

87

NADP+ reductase and F-type H+-transporting ATPase subunits were enriched only on chairs, while cytochrome c oxidase subunits and NADH dehydrogenase subunit types and Fe-S proteins were enriched only on touchscreens (Fig. 3-6B). Increased numbers of electron transport chain components may indicate more aerobic respiration, or the presence of eukaryotic DNA (as detected by chloroplasts or mitochondria). Notably, high levels are found across all KOs for the horizontal pole from the Red Line and the outdoor touchscreen from

Riverside station, although it is unlikely that these trends were completely eukaryotic. Riverside station touchscreen 16S profiles included only 4.04% chloroplast classified sequences, and overall holds included for shotgun sequencing had the highest average proportions of chloroplast, followed by chairs and touchscreens. Thus, presence of more electron transport chain components may also reflect a metabolic strategy enriched among persisters in the built environment, especially relevant to the touchscreens’ Alphaproteobacteria.

Minimal pathogenic and antibiotic resistance presence on the Boston transit system

To detect antibiotic resistance factors in MBTA metagenomes, we used ShortBRED [219] to create high-precision sequence markers from the Comprehensive Antibiotic Resistance

Database (CARD) [220]. This resulted in 2,657 antibiotic resistance gene (ARG) markers for 792

ARGs in CARD, but only 46 ARG markers were detected with RPKMs greater than 0 in at least two samples. This is notable because the average read depth of our samples was 9.8×106 reads

(0.989 Gnt), but the average RPKM per sample for these markers was only 1.172, ranging from 0 to 46.67. Similarly, a low abundance of ARGs (<0.3% of total reads mapped to the Antibiotic

Resistance Database) was found in the Home Microbiome Project [92]. Our hits included several

88 resistance mechanisms, including efflux pumps, antibiotic target modification or replacement, antibiotic inactivation, and changes in nucleic acid machinery (rpoB or par genes) (Fig. 3-7A).

Figure 3-7. Quantification of antibiotic resistance marker and virulence factor abundances on subway surfaces. (A) Antimicrobial resistance markers (rows) quantified in metagenomes by ShortBRED [219] and annotated by antibiotic target through the Antibiotic Resistance Ontology in the CARD database. (B) Virulence factors (rows) likewise quantified and manually annotated by virulence function through keywords on the VFDB web site. For both heatmaps, columns (samples) are arranged as in Fig. 3-6.

To contextualize ARG enrichment (or rather depletion) in this environment, we further compared the Boston subway to ARGs in the air microbiome from several other built environments [221] as well as from 552 stool samples from individuals in the United States,

China, Malawi, and Venezuela [2, 222, 223]. For consistency with previous surveys, we used

ShortBRED to generate 4,132 antibiotic ARG markers for 849 ARGs in the Antibiotic Resistance

Database (ARDB). Both the air microbiome and Boston subway samples had noticeably lower

89 levels of RPKMs that that of typical human stool (Fig. II-3). The gut microbiome has repeatedly been observed [224] to be enriched for tetracycline resistance, beta-lactamases, and MFS/RNS efflux pumps, whereas none of these were substantially present in the MBTA and only low levels of tetracycline and beta-lactamase resistance in indoor air [221].

To similarly assess virulence factors in the MBTA, we created sequence markers from the Virulence Factor Database (VFDB) [225], resulting in 7,869 markers for 2,089 factors. 54 markers were detected with RPKMs greater than 0 in at least two samples. The average RPKM per sample was 0.240, ranging from 0 to 23.74. All of the putative virulence factors, with the exception of the alpha and beta-hemolysin proteins found in S. aureus, are opportunistic factors typical of normal microbial life. For example, many proteins were classified as part of pathogenicity islands; however, most of these proteins are transposases, integrases, and repetitive regions (Fig. 3-7B). Other hits were annotated with functions in adherence, antiphagocytosis, and secretion systems, but consisted of cell surface proteins such as lipopolysaccharides, capsule polysaccharide proteins, and flagellar proteins. This indicates that the real pathogenic potential detected in the Boston subway is very low. Overall, the Boston subway has minimal antibiotic resistance and virulence factor presence.

Discussion

Here, we report on the microbial profile of the Boston metropolitan transit system.

Previous studies have characterized the Hong Kong and New York subway aerosol communities [193, 194], as well as surfaces in the New York subway [195], but we believe this to be the first to determine how space utilization by passengers, surface type, and material

90 composition individually affect microbial ecology. We further describe the microbial community metabolic potential across surface types and metagenomically assess the absence of pathogenic potential. The former primarily reflected P. acnes pathways on holds and aerobic respiration on seats and touchscreens; resistance and virulence factors among the latter were depleted relative to environments such as the human microbiome.

Surface type was the major driver of variation in composition, lending support to three potential hypotheses: differences may be driven by 1) human body interactions [192], 2) material composition of these surfaces, which may enhance microbial adherence and growth, or

3) a combination of the two factors. Our data support the third hypothesis. First, we observed a significant enrichment of oral microbes on horizontal poles and grips, which may be higher up and closer to riders’ faces or reflect transfer through skin-mediated contact (Fig. 3-1C). Second, both 16S and shotgun data showed enrichment of vaginal commensals in seat surfaces, which may be transmitted through clothing. Third, we found that seats were enriched in vaginal and oral taxa relative to seat backs, and outdoor touchscreens were enriched in Alphaproteobacteria relative to indoor touchscreens. If surface material were the only driver of microbial composition, seats vs. seat backs and indoor vs. outdoor touchscreens should have similar taxonomic profiles. Surface material certainly plays at least a partial role, however, as we observed decreased Corynebacterium in vinyl seats as compared to polyester seats. Overall, our observations indicate that both human body interactions and surface material shape community composition, with the former as the stronger driver.

91

Previous studies of the transit microbiome, particularly those of New York [195] and

Hong Kong [194], have also observed environmental exposure to be an additional driver of its microbial community composition. Afshinnekoo et al, for example, found that samples’ human

DNA reflected census demographics for the surrounding region, although we saw no differentiation at the microbial level among Boston train lines serving suburbs with different ethno-demographics. We primarily observed the impact of environmental exposure on outdoor touchscreens, in agreement with Leung et al’s higher alpha diversities for outdoor stations in

Hong Kong. The surfaces we investigated are near-uniformly exposed to high volume and diversity of rider interaction. This frequent human contact could homogenize many potential influences on microbial populations, such as demographics or weather. Since the body sites used for contact, indoor/outdoor location, and material composition remain consistent, these exposures would thus shape the taxonomic differences we observed across the Boston subway.

There are few non-opportunistic pathogens in the built environment outside of hospitals

[226]. None were reported for restrooms [93], classrooms [192], or Hong Kong subway aerosols

[194], possibly due to lack of phylogenetic resolution with 16S sequencing. During partial genome assembly from home [92] and restroom [227] surface metagenomes, shotgun sequencing facilitated identification of opportunists with pathogenic potential, but even with this increased resolution, outright virulence factors were rare. Robertson et al detected no human pathogens using Sanger and pyrosequencing in New York subway aerosols [193].

Furthermore, although Afshinnekoo et al report 12% of taxa represented known pathogens in the National Select Agent Registry and PATRIC database, this database uses an extremely broad definition of “pathogen,” and these results were later refuted [196]. Our study assessed

92 whether typical subway microbial communities were unusual in their carriage or transfer of antibiotic resistance genes and virulence factors. We detected low numbers of these genes, and they were present at drastically lower amounts than observed in the human gut.

One goal of studying the microbiology of the built environment is to establish a baseline against which deviations can be used to detect potential public health threats. As with the human microbiome, however, inter-subject variability appears to be quite high in built environments (e.g. buildings) and in transit systems, and both greater cross-sectional breadth and longitudinal depth are still necessary. All subway microbiome papers to date have detected a high level of skin-associated genera. In addition to this work, Leung et al (Hong Kong subway aerosols) included Micrococcus (4.9%), Enhydrobacter (3.1%), Propionibacterium (2.9%),

Staphylococcus, and Corynebacterium (1.5%), while Robertson et al detected high levels of families

Staphylococcaceae, Moraxellaceae, Micrococcaceae, Enterobacteriaceae, and

Corynebacteriaceae. Afshinnekoo et al in the New York subway is the only major exception, with the most abundant organisms instead including Pseudomonas stuzeri, Acinetobacter, and

Stenotrophomonas. If microbes shed from skin (or still resident on shed skin cells) do dominate mass transit environments, it must be determined whether these microbes are deposited, dormant, or actively growing, or whether they can be stably transferred from one individual to another.

Like other built environments, however, human-associated microbes are by no means the only apparently functional community residents even when abundant. Notably, our wall samples, which are not consistently touched but in the presence of high human density, have

93 lower biomass and different microbial compositions from other train surfaces. Establishing a

"typical" microbial baseline for mass transit environments will require thoughtful sample design that controls for local space properties, short- and long-term temporal variation (e.g. time of day and season), and cross-sectional differences within and between cities. It may also prove useful to monitor for a combination of normal versus undesirable organisms and metabolic or functional profiles, as the latter has been observed to be more stable than in the human microbiome [2]. In some cases specific pathogens may be easier to detect; in others (e.g. when individual pathogens may be extremely low density), structural, functional, or metabolic shifts may be better indicators of changing transit profiles and, consequently, health hazards. In all such cases, future studies should incorporate expertise from architecture, engineering, public health, microbiology, and ecology, thus allowing both confident and interdisciplinary analyses as well as institutional changes in response to scientific findings.

In conjunction with other published investigations, this work helps to characterize the

“urban microbiome” and, in doing so, adds to our understanding of how these microbial communities are formed, maintained, and transferred. Such studies fall in a critical space between environmental and human-associated microbial ecology, and as such must address the challenges of both. These include study designs with rich metadata, including architectural features, human contact, environmental exposure, surface type, and surface material; accounting for a wide range of potential biochemical environments, contaminants, and biomass levels; and the involvement of institutional review boards, city officials, and engineers as appropriate. Future work will help to determine which urban microbes are viable and resident

94

(as opposed to transient), as well as identifying the mechanisms utilized to persist in the built environment. It will also be important to identify microbes that can be transferred between people via specific fomites, since this especially has the potential to inform public health and policy (by monitoring organisms, gene content, or both). A greater understanding of these processes may thus eventually lead to construction of built environments that enhance and maintain human health.

Materials and Methods

Study permissions

The Massachusetts Bay Transportation Authority (MBTA) approved all aspects of transit system sampling and gave permission to the Harvard T.H. Chan School of Public Health to conduct this study (Fig. II-4). Additional support was provided by the MBTA Police, who accompanied the study team during sample collection. A written description of the protocols and study goals were distributed to interested MBTA passengers during sampling.

Sample collection

Samples were collected in 2013 on May 16, May 23, and October 22 from the public transit system serving metropolitan Boston during normal workday hours. Train car sampling began at the outmost termini of train routes (Alewife Station on the Red Line, Riverside Station on the Green Line, and Forest Hills Station on the Orange Line). Trains were sampled as they proceeded inbound towards the city center. Station samples were collected by swabbing the touchscreens and sides of ticket machines at five stations (Fig. 3-1).

95

For all samples, we recorded the sampling date, outdoor air temperature and relative air humidity, location, surface type (seat, seat back, horizontal pole, vertical pole, hanging grip, wall, or touchscreen), and material composition (polyester and vinyl (seats and seat backs), stainless steel (poles), PVC (grips), combination of wood, engineered wood, extruded thermoplastic, fiber reinforced plastic, aluminum honeycomb panel, melamine-finished aluminum panels reinforced with Kevlar (walls), or coated glass (touchscreens)). For train car samples, we recorded the within-train location of sample collection (end or middle of car), as well as the train line and location along the route when sample was collected. For station samples, we recorded the location of each ticketing machine (indoor, outdoor, underground) and the side of the touchscreen swabbed (right, left, both).

All metadata are described in Table II-1 and where possible, metadata terms from the

Minimum Information Standards for the Built Environment (MIxS-BE) were used [228].

Weather information was compiled from weather archives from the National Oceanic &

Atmospheric Administration [229] and Weather Underground (KBOS [230]).

Swab collection and processing

DNA-free cotton swabs (Puritan, Maine, USA) were used for collection in this study.

Each swab was dipped into a swabbing solution prepared from 0.15 M NaCl and 0.1% Tween

20, as used in previous studies [81, 192, 201, 231]. All surfaces were swabbed for approximately

15 seconds, and each surface was sampled 2 or 3 times with separate swabs over non- overlapping regions. Swabs were stored together in 15 mL Falcon tubes on ice for no more than

96 one hour before being taken to a central location and stored on dry ice. All samples were transported directly from dry ice to a -80°C freezer for storage.

DNA extraction, 16S amplicon sequencing, and operational taxonomic unit (OTU) calling

Samples were processed using the MoBio PowerLyzer PowerSoil DNA extraction kit

(MO BIO Laboratories, Inc.). For each sample, 2 or 3 swabs from the same sample were pooled for optimal biomass recovery. Amplification and sequencing by Illumina MiSeq were performed as described previously by Caporaso et al [232]. OTU tables were constructed with

Quantitatve Insights into Microbial Ecology (QIIME) software [233] version 1.8 using a closed reference (pick_closed_reference_otus.py) with Greengenes reference version 13.5 at the 97% identity level. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at least one of 72 samples). Because the primers used in the study were designed to amplify bacterial 16S genes, we filtered out OTUs that corresponded to chloroplasts, mitochondria, and archaea. This reduced the dataset to 2,134 unique OTUs representing 501 unique genera. OTU frequencies in samples were then sum-normalized to proportional data (Table II-2). Further details can be found in the Supplemental Information.

Analysis methods

Alpha diversity was calculated using the Inverse Simpson diversity index in the R package ‘vegan’ [234]. Ordinations were calculated by principal coordinate analysis (PCoA) using Bray-Curtis dissimilarity, unless otherwise noted, using the relative abundance table generated above. For univariate and multivariate tests, we further filtered OTUs (minimum abundance threshold 0.001 in at least seven of 72 samples). Univariate test for taxa differentially

97 abundant with respect to touchscreen location was performed using LEfSe [197]. For this analysis, each metadata category was tested using alpha values of 0.05 for both the Kruskal-

Wallis and Wilcoxon tests with one-against-all comparison and an LDA effect size cutoff of 2.0.

Significant taxa-metadata univariate associations are listed in Table II-3. Multivariate association tests for taxa that were differentially abundant with respect to metadata were performed using MaAsLin [211]. For this analysis, we used four metadata categories: these included locale (train or station), surface type (e.g. seat, seat back, etc), surface material (e.g. polyvinyl chloride, carpet, etc), and location (e.g. Forest Hills Station, Orange Line train, etc).

Microbial source prediction was performed using Microbial Sourcetracker [203] and using data from human and environmental sites in Hewitt et al [207]. GraPhlAn [235] was used for visualization of associations and phylogenetic relationships.

Shotgun library sequencing and quality control

DNA was extracted using the MoBio PowerLyzer PowerSoil DNA extraction kit (MO

BIO Laboratories, Inc.) as described for 16S sequencing libraries. Only samples with at least 80 ng/µL were selected and sent to the Broad Institute for shotgun library construction. Libraries were constructed using the Illumina Nextera XT method and sequenced on the Illumina HiSeq

2000 platform with 100 bp paired-end (PE) reads. The sequencing depth was 16.7×106 PE reads per sample. The KneadDATA v0.3 pipeline (http://huttenhower.sph.harvard.edu/kneaddata) was used to remove low quality reads and human host sequences. Further details can be found in the Supplemental Information.

98

Taxonomic and functional profiling of metagenomes

Pan-microbial (bacterial, archaeal, viral, and eukaryotic) taxonomy was determined using MetaPhlAn2 [136] (http://huttenhower.sph.harvard.edu/metaphlan2). 1,340 microbial clades comprising 499 species were identified (Table II-2), and filtered for relative abundance ≥

0.1% in at least two samples for downstream multivariate analysis with MaAsLin [211]. For all

MaAsLin analysis involving shotgun taxonomic and functional profiles, we used one metadata category: collapsed surface types, which included chairs (seat and seat backs), holds (grips, horizontal and vertical poles), and touchscreens.

Functional genomic profiles were generated with HUMAnN2 version 0.3.0 [148]

(http://huttenhower.sph.harvard.edu/humann2), which leverages the UniRef [143] orthologous gene family catalog, along with the MetaCyc [144], UniPathway [236], and KEGG [139] databases. HUMAnN2 gives three outputs: the 1) UniRef proteins and their abundances in reads per kilobase (RPK), 2) MetaCyc pathways and their abundances in RPK 3) MetaCyc pathways and their coverage ranging from 0 to 1. HUMAnN2 further calculates the RPK and coverage for each microbial taxa observed in MetaPhlAn2 for each UniRef protein and MetaCyc pathway.

To look at the functional profile, we collapsed 3,975,869 UniRef50 protein families into

12,074 KEGG Orthology (KO) numbers. UniRef50 proteins that did not belong to any KOs were not analyzed further. We sum-normalized KO RPKs and focused on KOs with mean abundance greater than the overall median abundance and variances in the 90th percentile. We identified

KOs that were significantly enriched in chairs, holds, and touchscreens using MaAsLin [211]

99 with a false discovery rate (FDR) < 0.05. KO differences between surface types were heavily influenced by the presence of Propionibacterium acnes. To remove this influence, we removed P. acnes’ RPK contribution to each UniRef50 protein and then re-summed the overall UniRef50

RPK from the remaining taxa. UniRef proteins were again collapsed into KOs and subjected to the analysis described above. We then compared KOs that were significantly enriched in seats, holds, and touchscreens before and after P. acnes removal. Tables with KO RPKs are at http://huttenhower.sph.harvard.edu/MBTA2015.

Identification and quantification of antibiotic resistance and virulence factor gene markers.

Antibiotic resistance gene markers were generated with ShortBRED (Short Better Read

Extract Dataset) [219] from the Comprehensive Antibiotic Resistance Database (CARD) [220] using UniRef90 [237] as a reference. ShortBRED virulence factor markers were generated from the Virulence Factor DataBase (VFDB) [225] using UniRef50 [237] as a reference (due to the availability of a previous version of these markers). ShortBRED maps the shotgun reads against the markers, and returns normalized marker abundances as reads per kilobase per million reads

(RPKM). We aggregated and annotated antibiotic resistance gene markers using the antibiotic resistance ontology (ARO) numbers in CARD.

To facilitate cross-dataset comparison, we also generated 121 bp markers with

ShortBRED from the Antibiotic Resistance Database (ARDB) [238] using UniRef50 [237] as a reference and aggregated these markers at the ARDB family level. We compared the distribution of antibiotic resistance gene markers in our dataset to four previously profiled shotgun datasets describing the gut microbiomes of 552 individuals from the United States [2,

100

223], China [222], Malawi [223], and Venezuela [223], as well as one shotgun dataset profiling air microbiomes in a home, hospital (indoors and outdoors), pier, and offices (indoors and outdoors) [221]. Virulence factors were annotated using VFDB ontologies available on http://www.mgc.ac.cn/VFs/main.htm. ShortBRED results can be found in Table II-5.

Accession numbers

Raw sequence files were deposited into Sequence Read Archive (SRA) under the

National Center for Biotechnology Information (NCBI) with accession number PRJNA301589.

Acknowledgements

We thank the MBTA Transit Police Department, specifically Chief Paul MacMillan and

Detective Matthew Haney, for their support of this project. We are also grateful to MBTA police officers Tommy O’Connor and Lieutenant David F. Albanese for their assistance during sample collection. We also thank Sydney Lavoie and Gerrod Voit for additional laboratory and computational assistance, and Boyu Ren and Koji Yasuda for helpful feedback and discussion.

Jessica L. Green would like to disclose her affiliation as CTO of Phylagen, Inc. which does not conflict with the study. The authors declare no conflict of interest.

101

Chapter 4:

Conclusions

To understand a given microbial community, there are two major questions to be answered: “Who is there?”, followed by “What are they doing?” DNA sequencing has proven to be a powerful tool for answering these questions. It has the capability of surveying thousands of organisms and millions of genes relatively quickly, but is limited in its ability to track microbial activity. In addition, the size of the resulting datasets restricts most analyses to identification of associations between microbial abundances and metadata, or a search for biomarkers or keystone species. Understanding the complexity underlying these trends must begin with i) characterizing the stability of the observed trend and ii) determining its activity and its effect within and outside the microbial community. The former may be established via comprehensive time-series sampling, while the latter may be achieved through the combination of DNA sequencing with other ‘omics’, such as transcriptomics, proteomics, and metabolomics, or through wet laboratory experiments.

In Chapter 2, we introduced WAAFLE, which is the first method for detection if de novo

LGT events from metagenomes. A tool that can utilize WMS sequencing data is important, since the majority of tools for LGT detection are optimized for full genomes. As follows, identifying novel LGT events will require constant sequencing of whole genomes, which is achievable for clinical isolates but difficult for single organisms within a complex community. The direct use of metagenomes allows for LGT profiling of older datasets in the context of a community (as opposed to cultured isolates), which may affect LGT activity. We next demonstrated proof of concept by applying WAAFLE to the Human Microbiome Project Phase 1-II. Indeed, there are limits to what we can detect: first, potential misassemblies based on read coverage disproportionately affect contigs classified as LGT, and second, short contig lengths limit

103 detection of plasmids, unless there are novel rearrangements within them. Still, we were able to identify high frequency LGT pairs across six major body sites, which increased in frequency with shorter phylogenetic distances and higher taxonomic abundances. Most pairs were also specific to environment (body site), though the buccal mucosa, supragingival plaque, and tongue dorsum shared pairs with differential abundances. As expected, enriched functions in

LGT contigs included mobile elements such as transposons and phage, along with GMP synthases and TonB outer membrane receptors.

Immediate next steps include characterizing LGT stability over time, as well as determining how LGT frequency varies with disease and environment. Both approaches require datasets with specific study designs: the former requires time-series data while the latter requires case/control cohorts, or samples collected from the built-environment or environmental sources. Applying WAAFLE to these datasets will help quantify LGT rates, which may occur at the scale of minutes, days, or months; as well as determine how LGT rates change with disease, or how they might be associated with cohort metadata (such as dietary intake or drug administration). Analyses of whole genomes has shown that LGT rates are likely higher in human-associated versus non-human associated environments: further work may identify the taxa and functions responsible for increased LGT. Still, computational detection of events at the

DNA level does not indicate active use of transferred genes. To quantify LGT activity, WAAFLE results should be combined with other ‘omics data in order to find actively transcribed or translated LGT products. Results may also be combined with wet laboratory procedures such as qPCR or transformation, to validate the presence or activity of transferred genomic segments, respectively. Furthermore, attempts to induce LGT within culture may help identify conditions

104 such as abiotic/biological stress, specific spatial structure, or proximity of select taxon partners that might favor LGT.

In Chapter 3, we described microbial communities on the Boston subway, which were mostly derived from human skin and oral sites. Samples were collected from trains on the red, orange, and green line, as well as ticketing machines from Alewife, Park Street, South Station,

Forest Hills, and Riverside. The original intent of the study design was to see if microbial communities might vary based on the demographic served. Instead, microbial communities on trains mostly varied by surface type, likely due to rider interactions such as sitting on seats or touching the ticketing machines, while microbial communities on touchscreens varied mostly by indoor or outdoor location. Functional profiles were dominated by systems for anaerobic respiration and porphyrin synthesis, which reflected the high abundance of Propionibacterium acnes. Overall, the number of antibiotic resistance genes were lower than that found in the human gut.

Future directions include identifying the stability of these high-traffic spaces as well as determining the proportion of live, dormant, and dead microbes. The former will require sampling the subway at regular intervals over a longer period of time. This sampling strategy will enable us to determine if there is a consistent built-environment microbiome: if so, fluctuations may be useful indicators of disease outbreak, or simply indicators of changing seasons, or both. For the latter, microbial viability may be measured using a variety of methods, including sample treatment with propidium monoazide or cell sorting to distinguish between

DNA from intact versus dead cells, isolation of RNA rather than DNA for transcriptomic

105 activity, identification of protein synthesis through fluorescent click-chemistry (such as

BONCAT), or measurement of cellular activity through ATP assays. Multiple methods will need to be tested, as contamination and low biomass are common problems for built- environment samples. Furthermore, if the majority of built-environment samples are dead, then profiling should shift from looking at microbial taxa to looking at metabolites, or microbial components such as pathogen-associated microbial patterns (PAMPS), which may stimulate the human immune system.

Long term goals include understanding how LGT affects microbial evolution and how the built-environment influences human health, especially immune development. It is unclear what role LGT plays in speciation and whether that role differs today versus the evolutionary past. Still, it is clear that LGT has a clinical impact, especially in the rise in antibiotic resistance.

If we can identify the conditions under which LGT occurs, as well as the specific gene segments and taxa participating in transfer, it may be possible to use LGT to alter microbial community structure or processes, or predict short-term microbial evolution (especially for pathogens).

Some work has also suggested that LGT helps maintain bacterial species: thus, a better understanding may help refine the “species concept” for bacteria, leading to better taxonomic assignment and calculation of phylogenetic distance.

In contrast, to characterize the effects of the built environment on human immune development, studies should move beyond single built-environment types, and begin i) comparing same purpose built-environment structures with that lead to differential health outcomes, or ii) comparing different built-environment structures to identify their similarities

106 and differences. An example of the former involves surveying nursing homes with varying survival rates, while an example of the latter would include examining a rural home versus urban home. Study of the former could identify aspects of building design that might facilitate better health outcomes through microbial community modulation, such as increased ventilation or changes in hygiene. Study of the latter can help establish a baseline built-environment, likely to be skin and oral microbes, and determine which microbes or PAMPs could potentially be introduced. This is especially important if constant exposure to skin and oral-derived microbes lead to adverse health outcomes. Multiple diseases have been associated with the microbiome, of which a subset are linked to Western lifestyle and diet. This has led to an extensive search for therapeutics to modulate the human microbiome. A better understanding of LGT and the built- environment microbiome may help spur therapeutics, and highlight adaptive mechanisms used by the microbiota and host to adjust to the “new normal.”

107

Appendix I:

Supplemental Materials for Chapter 2

108

Supplemental Figures

Figure I-1. Filtering potential misassemblies. To search for miassemblies, shotgun reads were mapped to contigs using Bowtie2. We then examined both read coverage (Step 1) and read support (Step 2) for gene junctions, or the regions between two genes on a contig. Genes containing any single junction that fail both steps 1 and 2 are removed from analysis.

109

Figure I-2. Determining which contig types contain misassemblies. In A), we show the percentage of contigs filtered out via read mapping, stratified by whether WAAFLE classified them as LGT or not. We find that more LGT contigs are filtered out, as expected. In B), we examine the gene junction type and determine what percentage have read support. Here “AA” junctions are defined as gene junctions between two genes annotated to the same taxa, while “AB” junctions are defined as gene junctions between two genes annotated to different taxa. As expected, junctions between genes annotated to different taxa have less read support.

110

111

Figure I-3. Gene call evaluation. To assess how well WAAFLE calls genes, we varied the subject coverage threshold (for including a BLAST hit), gene length threshold (above which to include the gene), and minimum overlap (above which to merge a BLAST hit into a gene group).

Figure I-4. LGT evaluation with or without missing BLAST hits. We show the TPR against the FPR for the LGT evaluation with 20% of BLAST hits removed (on left) versus the evaluation with all BLAST hits (on right).

112

Figure I-5. Selection of k1 and k2. As in Fig. 2-2, we show the LGT evaluation for WAAFLE with 20% of BLAST hits removed. On the left, we hold k2 at 0.8 while we vary k1 from 0.1 to 0.9 (blue line represents default k1 chosen). On the right, we hold k1 at 0.5 while we vary k2 from 0.1 to 0.9 (red line indicates default k2 chosen). In A), colors indicate the inter-taxon level for LGT, for example, “species” in red shows the TPR and FPR for inter-species LGT across different k1 and k2 values. In B), colors indicate the taxonomic level at which WAAFLE is evaluated for taxonomic assignment. For example, “species” in red indicates the percentage of correct species calls in LGT contigs.

113

Figure I-6. Comparison of LGT measures. We attempted to quantify LGT frequencies per sample using 2 methods: 1) the number of LGT contigs divided by the total number of genes, 2) the number of genes in LGT contigs divided by the total number of genes. Initially, we were concerned that the former might overestimate LGT in samples with many short contigs, while the latter might overestimate LGT in samples with many long contigs. In this plot, each point represents a LGT taxon pair in a body site. The x-axis is the first measure, while the y-axis is the second measure. We found that the two measures were highly correlated within body sites, indicating that higher values in either measure usually point to higher frequencies of LGT. However, when comparing body sites, we observe a different y-axis scale for the posterior fornix: the longer contigs in the posterior fornix may lead to larger LGT frequencies if gene percentages (measure 2) are utilized rather than events per gene.

Figure I-7. Jaccard and Bray-Curtis distances between inter-individual, intra-individual, and technical samples. We looked to see if LGT detection via WAAFLE is reproducible across technical replicates and stable in individuals. We focused on contigs with taxonomy resolved to the genus level and inter-genus LGT events. For each body site, we subsampled half the

114

Figure I-7 (Continued) samples while including all technical replicates. In each sample, gene percentages were quantified for inter-genus taxon pairs (number of genes in LGT pair divided by number of sample genes) and single taxa (number of genes for taxa in sample divided by number of sample genes). We then calculated Jaccard and Bray-Curtis distances between samples from different individuals, the same individual but different time points, and technical replicates.

Figure I-8. Phylogenetic distances computed from random taxa pairs within body sites. For each body site, we (i) randomly chose WAAFLE-called pairs (waafle) or (ii) generated taxon pairs by randomly choosing two taxa weighted by gene percentages (simulated). For the former, up to 1,000 pairs or the total number of taxon pairs were chosen, whichever number is smaller. For the latter, 1,000 unique pairs were generated. We then plotted the A) phylogenetic distance distribution and B) LGT joint abundance distribution. Joint abundances were calculated by multiplying the gene percentage of one taxon (number of genes for a single taxon in a sample divided by total sample genes, averaged across sites) against the other.

115

Supplemental Tables

Table I-1. WAAFLE Parameters. This table describes the 5 parameters used to tune the WAAFLE pipeline.

Parameters Definition WAAFLE Effect Steps Involved (and default values)

Subject Percentage of a Step 2: s = Increasing subject coverage filters out coverage (s) reference gene (subject 0.75 low quality BLAST hits when calling sequence) that aligned Step 3: s = 0 genes (Step 2) and scoring taxa (Step to the contig (query 3). Including higher quality BLAST sequence) hits in Step 2 led to more accurate gene calling. Including more BLAST hits in Step 3 led to higher taxon scores.

Overlap Length of overlap Steps 2 & 3: o Lowering overlap percentage allows percentage region between two = 0.1 more hits to be merged into groups, (o) nucleotide fragments leading to fewer gene calls (Step 2). overlap by, divided by Inclusion of more BLAST hits per the length of the gene for taxon scoring (Step 3) can shorter fragment lead to higher scores.

Gene length Length of gene called Step 2: g = A higher gene length cutoff prevents (g) or supplied to 200 bp LGTs from being called due to WAAFLE spurious gene calls, and leads to lower numbers of genes per contig.

One taxon A single taxon’s Step 4: k1 = A lower threshold for the one taxon score (k1) minimum score across 0.5 score makes it easier for WAAFLE to all genes in a contig annotate a contig as “No LGT”.

Two taxon The minimum score Step 4: k2 = A lower threshold for the two taxon score (k2) for two taxa after 0.8 score makes it easier for a contig to maximizing scores be called “LGT”. between them across all genes in a contig

116

Appendix II:

Supplemental Materials for Chapter 3

117

Supplemental Figures

Figure II-1. Biomass and alpha diversity for train and station samples. (A) Biomass from samples collected across the subway system. Each data point represents a pooled sampling strategy in which two or three swabs from the same site were pooled and jointly extracted. DNA yield is plotted in ng/mL. (B) Alpha diversity by surface type, as measured by the inverse Simpson diversity index. In both (A) and (B), colors represent the line of the train from which sample was derived (red, orange or green line of the train or station, or black indicating from within a downtown station).

Figure II-2. Ordination of surface data subsets. (A) Train hold surfaces by train line, (B) train chair surfaces by train line, (C) train chairs by material, and (D) touchscreen surfaces by date. All ordinations are principal coordinates analyses using Bray-Curtis distance, colored by metadata category, calculated using filtered OTU relative abundance table subsets of the relevant samples.

118

RPKMs of antibiotic resistance gene gene resistance antibiotic of RPKMs

icrobiomes in New York City (office) and San Diego (hospital, home, pier), the Boston MBTA, and gut gut and MBTA, Boston the pier), home, (hospital, Diego San and (office) City York in New icrobiomes

. Comparison of antibiotic resistance markers from the ARDB database. ARDB the from markers resistance antibiotic of Comparison .

3

-

II

markers from air m air from markers andVenezuela. Malawi, China, States, United the from individuals 552 from microbiomes Figure

119

Figure II-4. Letter from the MBTA. We received MBTA approval, by way of the MBTA Transit Police, to carry out the study prior to grant submission and confirmed detailed sampling plans with the MBTA prior to any public work. Their assistance and input was invaluable both for study design and for safe execution of sample collection, and this letter includes the initial approval information from Chief MacMillan approving the work.

120

Supplemental Tables

Supplemental Tables are too large to display and are available online at http://huttenhower.sph.harvard.edu/MBTA2015. Captions are included below for reference.

Table II-1. Sample collection and metadata. Includes metadata for all collected samples that were sequenced via 16S amplicon or shotgun sequencing. Abbreviations are defined at the bottom.

Table II-2. 16S and shotgun OTU tables along with taxa present across sequencing plate. The first tab contains the 16S OTU counts after quality control, stitching, length filtering, removal of chloroplast, mitochondria, and archaea, and filtering for at least 0.1% in 1 sample. The second tab contains the unfiltered MetaPhlAn OTU table with percentages (100 = 100%). Note that additional filtering was performed before LEfSe and MaAsLin runs for both 16S (at least 0.1% in 7 samples) and shotgun (at least 0.1% in 2 samples). The third tab contains our analyses to identify contaminant taxa. As a negative control, we examined all samples present on a sequencing plate containing a subset of MBTA samples, which included touchscreens (n=21), trains (n=6), 30 saliva cultures, 13 skin samples, and 2 macaque tissue samples. Listed taxa listed were present in 80% of samples with at least 0.00001 abundance, and are shown with their average abundance across all samples. This provides a quality control test for potential contaminant taxa, none of which were nontrivially abundant or significant during our MBTA analyses.

Table II-3. LEfSe and MaAsLin analysis for 16S sequencing. The first tab contains LEfSe results when searching for differentially abundant taxa between touchscreen locations (outdoors (out), underground (under), and indoors near an exit facing an outside environment (inout). Significant results report both logarithmic LDA scores and p-values. The second tab contains results for MaAsLin run with four covariates, including surface category, surface type, surface material, and surface location. Only organisms with q>0.25 are reported.

Table II-4. MaAsLin analysis for shotgun data. MaAsLin analysis was performed to identify differentially abundant taxa (first and second tab) and KOs (third and fourth tab) with respect to surface type. For both, surface type was split into chairs (seat backs and seats), holds (horizontal/vertical poles, grips), and touchscreens. For identifying differentially abundant taxa, we performed MaAsLin with full taxonomies at all levels (first tab) as well as with species only (second tab). All results are reported: we considered organisms with q<0.25 to be significant. For identifying differentially abundant KOs, we performed MaAsLin on KO abundances calculated using all shotgun reads (third tab) and after P. acnes-associated reads were removed (fourth tab). Only significant results are reported; these are KOs with q<0.05.

Table II-5. Antibiotic resistance gene and virulence factor markers. RPKM values for CARD (first tab), VFDB (second tab), and ARDB (third tab). The RPKM values for CARD and VFDB are only for MBTA data; the ARDB data contains values from multiple shotgun datasets.

121

Supplemental Information

The BioProject number, protocols, raw data tables, and supplemental tables can be downloaded at http://huttenhower.sph.harvard.edu/MBTA2015.

Methods and Materials

DNA extraction, 16S amplification and sequencing. Samples were processed using the

MoBio PowerLyzer PowerSoil DNA extraction kit (MO BIO Laboratories, Inc.) using bead- beating homogenization. For each sample, 2 or 3 swabs from the same sample were pooled for optimal biomass recovery. Each swab was individually homogenized in a bead-beating tube at

6.0 M/s for 40 seconds on the MP Biomedical FastPrep 24, but subsequent cleanup was pooled over one column. Extracted DNA extracts were quantified using a Qubit fluorimeter and sent to the Broad Institute for sequencing. Amplification and sequencing by Illumina MiSeq were performed as described previously [232]. In brief, genomic DNA was subjected to 16S amplification using primers designed incorporating the Illumina adapters and a sample barcode sequence, allowing directional sequencing covering variable region V4 (Primers:

515F[GTGCCAGCMGCCGCGGTAA] and 806R [GGACTACHVGGGTWTCTAAT]). PCR was performed in triplicate with 1 μl of template (1:50), 10 μl of HotMasterMix with the HotMaster

Taq DNA Polymerase (5 Prime), and 1 μl of primer mix (for final concentration of 10 μM). The cycling conditions consisted of an initial denaturation of 94°C for 3 min, followed by 24 cycles of denaturation at 94°C for 45 sec, annealing at 50 °C for 60 sec, extension at 72°C for 5 min, and a final extension at 72°C for 10 min. Amplicons were quantified on the Caliper LabChipGX

(PerkinElmer, Waltham, MA), pooled in equimolar concentrations, size selected (375-425 bp) on

122 the Pippin Prep (Sage Sciences, Beverly, MA) to reduce non-specific amplification products from host DNA, and a final library size and quantification was performed on an Agilent

Bioanalyzer 2100 DNA 1000 chip (Agilent Technologies, Santa Clara, CA). Sequencing was performed on the Illumina MiSeq platform (version 2) according to the manufacturer’s specifications with addition of 5% PhiX, and yielded paired-end reads of 150 bp in length in each direction. Total read depth was at least 5,000 reads (up to over 100,000 reads) per sample.

OTU calling. Quantitative Insights into Microbial Ecology (QIIME) software [233] version 1.8 was used for data processing. Paired-end reads (with approximately 97 bp overlap) were stitched and size selected (225 – 275 bp) to reduce nonspecific amplification products.

Operational taxonomic units (OTUs) were called with a closed reference

(pick_closed_reference_otus.py) using the Greengenes reference version 13.5 at the 97% identity level based on the PICRUSt [127] protocol. Using these parameters, we observed 17,954 unique

OTUs. We filtered low-abundance OTUs (minimum abundance threshold 0.001 in at least one of

72 samples); this reduced the dataset to 2,134 unique OTUs representing 501 unique genera.

Since the primers used in the study were designed to amplify bacterial 16S genes, we filtered out OTUs that corresponded to chloroplasts, mitochondria, and archaea. OTU frequencies in samples were then sum-normalized to proportional data. The filtered OTU tables can be found in Table II-2.

KneadData. KneadData incorporates Trimmomatic [239] and bowtie2 [240] for filtering and human sequence removal, respectively. Reads were scanned with a four-base wide sliding window and trimmed when the average base Phred score drops below 20. Trimmed reads

123 shorter than 70 nt were discarded. UCSC Human genome assembly version hg38 was used as reference for removal of human sequences. The average sequencing depth after quality control was 9.8×106 reads per sample.

Negative control analyses. Unfortunately, our study did not include negative controls beyond those internal to the sequencing platform. Instead, we took several measures during analysis to test for contamination in the 16S datasets. First, we looked at relative abundances across multiple sets of samples on the same sequencing plate, since taxa present across all samples may indicate contamination (especially since the batch included many non-transit samples). This was possible mainly for the touchscreen samples (n=21) and a few train samples

(n=6), which were pooled with 30 saliva cultures, 13 skin samples, and 2 macaque adipose samples. At the species level, we found 42 taxa (of 1647 total) that were present in 80% of samples, with average abundance ranging from 0.018% (Pseudomonas unknown) to 11.1%

(Actinomyces unknown). Many of these are skin-associated, including Pseudomonas,

Staphylococcus, Corynebacterium (in increasing abundance) or associated with the oral cavity, including Fusobacterium, Veillonella, Peptostreptococcus, Streptococcus, Prevotella, and

Porphyromonas (in increasing order) (Table II-2). It is unclear whether the latter arises from the large number of saliva samples in this dataset, or as a true contaminant. None of the taxa with lower average abundance are key to our findings.

Chloroplast and mitochondrial sequences were actually considered to be a type of contaminant in our study, inasmuch as they essentially represent plant- and human-material derived reads. They were found across all touchscreen and surface samples, but at very low

124 levels in adipose fat (primate, not human-derived) and saliva. Others have claimed that chloroplast DNA may be an artifact of cotton swabs rather than environmental exposures; our skin samples were processed with Copan swabs and yielded 1-2 orders of magnitude fewer chloroplast sequences (<1% maximum). Our standard primer pairs are known to amplify chloroplast and mitochondrial sequences: this is a well-known problem for those that study plant-associated microbial communities [241, 242]. Chloroplast DNA percentages varied from

1.32%-6.98%, and 0.054-1.03% in the touchscreens. They varied even more in the train data (not pooled with the touchscreens): chloroplast DNA ranged from 0.9% to 62.39%, with especially high levels on the Red line, while mitochondrial DNA varied from 0-8.27% on the trains (data available via website). This led to our analysis strategy of treating both sequence types like typical contaminants, discounting their sequence abundance, renormalizing, and analyzing primarily the resulting quality-controlled datasets.

Physical negative controls should be part of future study design, as recommended by

Adams et al [114] and Salter et al [115]. Their use, we note, must still be context dependent, as no one blanket analysis is likely to apply to different sample and contaminant types. Some studies have utilized the approach outlined in Flores et al, where OTUs constituting greater than 1% of the total negative control sequences were removed from all samples prior to rarefaction and analyses [243]. Another approach developed by Meadow et al involves searching for taxa with high abundance in negative controls relative to samples: this is done by plotting the relative abundances of taxa in negative controls against the relative abundances of taxa in samples and applying a cutoff [190]. Adams et al performed a meta-analysis of built environment studies, and reported phylum Tenericutes as significantly enriched in kit

125 microbiomes, and Cyanobacteria (or chloroplast) as highly abundant in dust but not in kits.

They also mention that skin taxa are often found as contaminants, but removing them could remove true signal. Typical kit contaminant taxa were also not significant in our study.

Comparison to the NY subway study

To expand our comparison with the previous NYC subway study, we downloaded their

MetaPhlAn2 tables (provided at the time of the NYC publication by Nicola Segata in collaboration with our group) from their supplementary data. We applied a simple quality control filter by retaining taxa with at least 0.1% abundance in at least 1% of samples (14 samples), and then focused specifically on the samples most similar to ours, i.e. from subway stations or trains.

In the NYC study, the most abundant taxa in the resulting 1,416 samples included

Pseudomonas stuzeri (27.01%), Pseudomonas unclassified (8.66%), Enterobacter cloacae (7.66%),

Stenotrophomonas maltophilia (7.10%), and Acinetobacter pitti/calcoaceticus/nosocomialis (3.39%).

Neither Yersinia nor Bacillus anthracis were present in any samples. These results are strikingly different from our top species from similarly analyzed metagenomic data, which included

Propionibacterium acnes (47.44%), Propionibacterium phage (total ~6%), Micrococcus luteus (2.40%), and Staphylococcus epidermis (1.98%). This may be due to a combination of factors, most likely the different types of surfaces sampled, but also including the swab protocol development and biomass validation prior to sequencing carried out for our study (see Methods). Most of our samples represent heavily utilized, nonporous, non-sanitized surfaces within train cars or, less often, stations; in contrast, NYC study surfaces include benches (n=326), rails/poles (condensed

126 from other categories, n=468), garbage cans (n=142), kiosks (n=161), turnstiles (n=151), and doors

(n=77), with all other surfaces sampled <24 times.

In support of this hypothesis, the NYC microbiomes at least in part do resemble those of other built environment surfaces and dust. Adams et al, for example [244], collected dust in vacuums or passively (through settlement). The former, which was considered homogenized, had significantly higher levels of Pseudomonales, Enterobacteriales, and Streptophyta as compared to the latter. Overall, Gammaproteobacteria dominated most samples (ave. 76.8%), still primarily from Pseudomonadales and Enterobacteriales, and overshadowed the Bacilli

(6.68%), Betaproteobacteria (5.02%), and Alphaproteobacteria (4.80%), and Actinobacteria

(2.48%). The NYC subway had high levels of Enterobacteriales (17.90%) and Pseudomonadales

(49.61%), but none for Streptophyta (0%, suggesting a possible sampling or extraction bias).

However, it is difficult to compare NYC swabbed samples (or our own) to vacuumed or settled dust, given the extreme heterogeneity seen in the latter for distinct space types or time integration periods. Adams et al, for example, was in turn quite distinct from dust in the

International Space Station [245], for example, a mixed use academic classroom building [201], or house dust [92], none of which significantly resembling our skin-dominated MBTA surfaces.

Taking these unusual features of the NYC subway data as given, however, we sought to determine whether surface material was at least a major determinant of their microbial community composition, as it proved to be for ours. We grouped their sample metadata into four categories: type of object (bench, rail/pole, garbage can, kiosk, turnstiles, etc.), surface material (wood, metal, plastic, etc.), object category (station, train, etc.), and borough (Queens,

127

Brooklyn, Manhattan, etc.) Applying the MaAsLin multivariate linear model to these variables jointly, we found 71 differentially abundant clades at FDR<0.25.

Surprisingly, none of these associations were with surface material type; most instead segregated with object type, which may at least be concordant with the much greater diversity of objects sampled in the NYC study. Rails and poles had lower levels of Pseudomonas and

Acinetobacter lwoffi as compared to benches, for example, while garbage cans had higher levels of Enterococcus italicus and Leunostoc. Clostridia and Klebsiella (not marine taxa) were found in the abandoned South Ferry and Penn Station timecourse samples, as well as in trains as compared to all other stations. Lastly, and also surprising, some taxa were associated with borough: this includes higher levels of Acinetobacter and Moraxellaceae in Manhattan as compared to the Bronx. Without more detail on the study’s exact sampling protocol - which parts of these diverse objects were swabbed, for example, and for how long over what surface area - it is difficult to interpret statistically significant but low effect size differences. It may be useful for future studies to sample fewer, more controlled environments with greater specificity, and of course to assess the results with more careful and targeted metagenomic analyses.

128

References 1. Sender, R., S. Fuchs, and R. Milo, Revised Estimates for the Number of Human and Bacteria Cells in the Body. PLoS Biol, 2016. 14(8): p. e1002533.

2. Consortium, T.H.M.P., Structure, function and diversity of the healthy human microbiome. Nature, 2012. 486(7402): p. 207-14.

3. Engineering, N.A.o., E. National Academies of Sciences, and Medicine, Microbiomes of the Built Environment: A Research Agenda for Indoor Microbiology, Human Health, and Buildings. 2017, Washington, DC: The National Academies Press. 253.

4. Shapiro, J.A., Thinking about bacterial populations as multicellular organisms. Annu Rev Microbiol, 1998. 52: p. 81-104.

5. Meadow, J.F., et al., Humans differ in their personal microbial cloud. PeerJ, 2015. 3: p. e1258.

6. Rosenthal, M., et al., Skin microbiota: microbial community structure and its potential association with health and disease. Infect Genet Evol, 2011. 11(5): p. 839-48.

7. Leewenhoeck, A.v., Observations, Communicated to the Publisher by Mr. Antony van Leewenhoeck, in a Dutch Letter of the 9th of Octob. 1676. Here English'd: concerning Little Animals by Him Observed in Rain-Well-Sea. and Snow Water; as Also in Water Wherein Pepper Had Lain Infused. Philosophical Transactions Royal Society, 1677. 12: p. 821-831.

8. Adler, A. and E. Ducker, When Pasteurian Science Went to Sea: The Birth of Marine Microbiology. J Hist Biol, 2017.

9. Razumov, A., The direct method of calculation of bacteria in water: comparison with the Koch method. Mikrobiologija, 1932. 1: p. 131-146.

10. Staley, J.T. and A. Konopka, Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol, 1985. 39: p. 321-46.

11. Stewart, E.J., Growing unculturable bacteria. J Bacteriol, 2012. 194(16): p. 4151-60.

129

12. Soucy, S.M., J. Huang, and J.P. Gogarten, Horizontal gene transfer: building the web of life. Nat Rev Genet, 2015. 16(8): p. 472-82.

13. Lang, A.S., O. Zhaxybayeva, and J.T. Beatty, Gene transfer agents: phage-like elements of genetic exchange. Nat Rev Microbiol, 2012. 10(7): p. 472-82.

14. Naor, A., et al., Low species barriers in halophilic archaea and the formation of recombinant hybrids. Curr Biol, 2012. 22(15): p. 1444-8.

15. Zhaxybayeva, O. and W.F. Doolittle, Lateral gene transfer. Curr Biol, 2011. 21(7): p. R242-6.

16. Griffith, F., The Significance of Pneumococcal Types. J Hyg (Lond), 1928. 27(2): p. 113-59.

17. Avery, O.T., C.M. Macleod, and M. McCarty, Studies on the Chemical Nature of the Substance Inducing Transformation of Pneumococcal Types : Induction of Transformation by a Desoxyribonucleic Acid Fraction Isolated from Pneumococcus Type Iii. J Exp Med, 1944. 79(2): p. 137-58.

18. Ochiai, K., et al., Studies on inheritance of drug resistance between Shigella strains and Escherichia coli strains. Nihon Iji Shimpo, 1959. 1861: p. 34-46.

19. Akiba, T.K.T.I.Y., S. Kimura, and T. Fukushima, Studies on the mechanism of development of multiple drug-resistant Shigella strains. Nihon Iji Shimpo, 1960. 1866: p. 45-50.

20. Anderson, E.S., The ecology of transferable drug resistance in the enterobacteria. Annu Rev Microbiol, 1968. 22: p. 131-80.

21. Aravind, L., et al., Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles. Trends Genet, 1998. 14(11): p. 442-4.

22. Nelson, K.E., et al., Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature, 1999. 399(6734): p. 323-9.

23. Sokal, R.R. and T.J. Crovello, The Biological Species Concept: A Critical Evaluation. The American Naturalist, 1970. 104(936): p. 127-153.

130

24. Mayr, E., Systematics and the origin of species, from the viewpoint of a zoologist. 1942: Harvard University Press.

25. de Queiroz, K., Ernst Mayr and the modern concept of species. Proc Natl Acad Sci U S A, 2005. 102 Suppl 1: p. 6600-7.

26. Ravin, A.W., Experimental Approaches to the Study of Bacterial Phylogeny. The American Naturalist, 1963. 97(896): p. 307-318.

27. Dykhuizen, D.E. and L. Green, Recombination in Escherichia coli and the definition of biological species. J Bacteriol, 1991. 173(22): p. 7257-68.

28. Tettelin, H., et al., Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A, 2005. 102(39): p. 13950- 5.

29. Cohan, F.M., What are bacterial species? Annu Rev Microbiol, 2002. 56: p. 457-87.

30. Atwood, K.C., L.K. Schneider, and F.J. Ryan, Periodic selection in Escherichia coli. Proc Natl Acad Sci U S A, 1951. 37(3): p. 146-55.

31. Treves, D.S., S. Manning, and J. Adams, Repeated evolution of an acetate-crossfeeding polymorphism in long-term populations of Escherichia coli. Mol Biol Evol, 1998. 15(7): p. 789- 97.

32. Imhof, M. and C. Schlotterer, Fitness effects of advantageous mutations in evolving Escherichia coli populations. Proc Natl Acad Sci U S A, 2001. 98(3): p. 1113-7.

33. Rozen, D.E. and R.E. Lenski, Long-Term Experimental Evolution in Escherichia coli. VIII. Dynamics of a Balanced Polymorphism. Am Nat, 2000. 155(1): p. 24-35.

34. Guttman, D.S. and D.E. Dykhuizen, Detecting selective sweeps in naturally occurring Escherichia coli. Genetics, 1994. 138(4): p. 993-1003.

35. Coleman, M.L. and S.W. Chisholm, Ecosystem-specific selection pressures revealed through comparative population genomics. Proc Natl Acad Sci U S A, 2010. 107(43): p. 18634-9.

131

36. Papke, R.T., et al., Searching for species in haloarchaea. Proc Natl Acad Sci U S A, 2007. 104(35): p. 14092-7.

37. Cohan, F.M. and E.B. Perry, A systematics for discovering the fundamental units of bacterial diversity. Curr Biol, 2007. 17(10): p. R373-86.

38. Majewski, J. and F.M. Cohan, Adapt globally, act locally: the effect of selective sweeps on bacterial sequence diversity. Genetics, 1999. 152(4): p. 1459-74.

39. Shapiro, B.J., et al., Population genomics of early events in the ecological differentiation of bacteria. Science, 2012. 336(6077): p. 48-51.

40. Takeuchi, N., et al., Gene-specific selective sweeps in bacteria and archaea caused by negative frequency-dependent selection. BMC Biol, 2015. 13: p. 20.

41. Dixit, P.D., T.Y. Pang, and S. Maslov, Recombination-Driven Genome Evolution and Stability of Bacterial Species. Genetics, 2017. 207(1): p. 281-295.

42. Rolfe, R. and M. Meselson, The Relative Homogeneity of Microbial DNA. Proc Natl Acad Sci U S A, 1959. 45(7): p. 1039-43.

43. De Ley, J., H. Cattoir, and A. Reynaerts, The quantitative measurement of DNA hybridization from renaturation rates. Eur J Biochem, 1970. 12(1): p. 133-42.

44. Wayne, L.G., International Committee on Systematic Bacteriology: announcement of the report of the ad hoc Committee on Reconciliation of Approaches to Bacterial Systematics. Zentralbl Bakteriol Mikrobiol Hyg A, 1988. 268(4): p. 433-4.

45. Fleischmann, R.D., et al., Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 1995. 269(5223): p. 496-512.

46. Fraser, C.M., J.A. Eisen, and S.L. Salzberg, Microbial genome sequencing. Nature, 2000. 406(6797): p. 799-803.

47. Ravel, J. and C.M. Fraser, Genome sequencing of microbial species, in Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. 2004, John Wiley & Sons, Ltd.

132

48. Karlin, S., Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol, 1998. 1(5): p. 598-610.

49. Hanage, W.P., C. Fraser, and B.G. Spratt, Sequences, sequence clusters and bacterial species. Philos Trans R Soc Lond B Biol Sci, 2006. 361(1475): p. 1917-27.

50. Segata, N., et al., PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat Commun, 2013. 4: p. 2304.

51. Ravenhall, M., et al., Inferring horizontal gene transfer. PLoS Comput Biol, 2015. 11(5): p. e1004095.

52. Cavalli-Sforza, L.L., The DNA revolution in population genetics. Trends Genet, 1998. 14(2): p. 60-5.

53. Koonin, E.V., K.S. Makarova, and L. Aravind, Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol, 2001. 55: p. 709-42.

54. Lawrence, J.G. and H. Ochman, Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol, 1997. 44(4): p. 383-97.

55. Medigue, C., et al., Evidence for horizontal gene transfer in Escherichia coli speciation. J Mol Biol, 1991. 222(4): p. 851-6.

56. Ochman, H., J.G. Lawrence, and E.A. Groisman, Lateral gene transfer and the nature of bacterial innovation. Nature, 2000. 405(6784): p. 299-304.

57. Nakamura, Y., et al., Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet, 2004. 36(7): p. 760-6.

58. Ge, F., L.S. Wang, and J. Kim, The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol, 2005. 3(10): p. e316.

59. Lerat, E., et al., Evolutionary origins of genomic repertoires in bacteria. PLoS Biol, 2005. 3(5): p. e130.

133

60. Dagan, T. and W. Martin, Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci U S A, 2007. 104(3): p. 870-5.

61. Andam, C.P. and J.P. Gogarten, Biased gene transfer in microbial evolution. Nat Rev Microbiol, 2011. 9(7): p. 543-55.

62. Skippington, E. and M.A. Ragan, Phylogeny rather than ecology or lifestyle biases the construction of Escherichia coli-Shigella genetic exchange communities. Open Biol, 2012. 2(9): p. 120112.

63. Boucher, Y., et al., Local mobile gene pools rapidly cross species boundaries to create endemicity within global Vibrio cholerae populations. MBio, 2011. 2(2).

64. Madsen, J.S., et al., The interconnection between biofilm formation and horizontal gene transfer. FEMS Immunol Med Microbiol, 2012. 65(2): p. 183-95.

65. Smillie, C.S., et al., Ecology drives a global network of gene exchange connecting the human microbiome. Nature, 2011. 480(7376): p. 241-4.

66. Liu, L., et al., The human microbiome: a hot spot of microbial horizontal gene transfer. Genomics, 2012. 100(5): p. 265-70.

67. Brito, I.L., et al., Mobile genes in the human microbiome are structured from global to individual scales. Nature, 2016. 535(7612): p. 435-439.

68. Rivera, M.C., et al., Genomic evidence for two functionally distinct gene classes. Proc Natl Acad Sci U S A, 1998. 95(11): p. 6239-44.

69. Cohen, O., U. Gophna, and T. Pupko, The complexity hypothesis revisited: connectivity rather than function constitutes a barrier to horizontal gene transfer. Mol Biol Evol, 2011. 28(4): p. 1481-9.

70. Jain, R., M.C. Rivera, and J.A. Lake, Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci U S A, 1999. 96(7): p. 3801-6.

134

71. Beiko, R.G., T.J. Harlow, and M.A. Ragan, Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A, 2005. 102(40): p. 14332-7.

72. Baltrus, D.A., Exploring the costs of horizontal gene transfer. Trends Ecol Evol, 2013. 28(8): p. 489-95.

73. Drummond, D.A. and C.O. Wilke, The evolutionary consequences of erroneous protein synthesis. Nat Rev Genet, 2009. 10(10): p. 715-24.

74. Banos, R.C., et al., Differential regulation of horizontally acquired and core genome genes by the bacterial modulator H-NS. PLoS Genet, 2009. 5(6): p. e1000513.

75. Wolf, Y.I., et al., Evolution of aminoacyl-tRNA synthetases--analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res, 1999. 9(8): p. 689-710.

76. Woese, C.R., Interpreting the universal phylogenetic tree. Proc Natl Acad Sci U S A, 2000. 97(15): p. 8392-6.

77. Baas Becking, L.G.M., Geobiologie of inleiding tot de milieukunde. 1934, The Hague, the Netherlands: W.P. Van Stockum & Zoon.

78. de Wit, R. and T. Bouvier, 'Everything is everywhere, but, the environment selects'; what did Baas Becking and Beijerinck really say? Environ Microbiol, 2006. 8(4): p. 755-8.

79. O'Malley, M.A., The nineteenth century roots of 'everything is everywhere'. Nat Rev Microbiol, 2007. 5(8): p. 647-51.

80. Yasuda, K., et al., Biogeography of the intestinal mucosal and lumenal microbiome in the rhesus macaque. Cell Host Microbe, 2015. 17(3): p. 385-91.

81. Grice, E.A., et al., Topographical and temporal diversity of the human skin microbiome. Science, 2009. 324(5931): p. 1190-2.

82. Gibbons, S.M., The Built Environment Is a Microbial Wasteland. mSystems, 2016. 1(2).

135

83. Impact of the Built Environment on Health. 2011 09/15/2017]; Available from: https://www.cdc.gov/nceh/publications/factsheets/impactofthebuiltenvironmentonhealt h.pdf.

84. Klepeis, N.E., et al., The National Human Activity Pattern Survey (NHAPS): a resource for assessing exposure to environmental pollutants. J Expo Anal Environ Epidemiol, 2001. 11(3): p. 231-52.

85. Kitzes, J.P., Audrey, S. Goldfinger, and M. Wackernagel, Current Methods for Calculating National Ecological Footprint Accounts. Science for Environment & Sustainable Society, 2007. 4(1): p. 1-9.

86. Hooke, R.L., J.F. Martín-Duque, and J. Pedraza, Land transformation by humans: A review GSA Today, 2012. 22(12): p. 4-10.

87. Division, U.N.D.o.E.a.S.A.P., World urbanization prospects: the 2011 revision. Vol. ST/ESA/SER.A/322. 2012: United Nations Publications.

88. Environment, N.E.W.G.o.t.E.B.o.t.B., et al., Evolution of the indoor biome. Trends Ecol Evol, 2015. 30(4): p. 223-32.

89. Dai, D., et al., Factors Shaping the Human Exposome in the Built Environment: Opportunities for Engineering Control. Environ Sci Technol, 2017. 51(14): p. 7759-7774.

90. Kelley, S.T. and J.A. Gilbert, Studying the microbiology of the indoor environment. Genome Biol, 2013. 14(2): p. 202.

91. Milstone, L.M., Epidermal desquamation. J Dermatol Sci, 2004. 36(3): p. 131-40.

92. Lax, S., et al., Longitudinal analysis of microbial interaction between humans and the indoor environment. Science, 2014. 345(6200): p. 1048-52.

93. Flores, G.E., et al., Microbial biogeography of public restroom surfaces. PLoS One, 2011. 6(11): p. e28132.

136

94. Kembel, S.W., et al., Architectural design influences the diversity and structure of the built environment microbiome. ISME J, 2012. 6(8): p. 1469-79.

95. Lax, S., C.R. Nagler, and J.A. Gilbert, Our interface with the built environment: immunity and the indoor microbiota. Trends Immunol, 2015. 36(3): p. 121-3.

96. Ownby, D.R., C.C. Johnson, and E.L. Peterson, Exposure to dogs and cats in the first year of life and risk of allergic sensitization at 6 to 7 years of age. JAMA, 2002. 288(8): p. 963-72.

97. Park, J.H., et al., Predictors of airborne endotoxin in the home. Environ Health Perspect, 2001. 109(8): p. 859-64.

98. Thorne, P.S., et al., Endotoxin Exposure: Predictors and Prevalence of Associated Asthma Outcomes in the United States. Am J Respir Crit Care Med, 2015. 192(11): p. 1287-97.

99. Liu, A.H., Endotoxin exposure in allergy and asthma: reconciling a paradox. J Allergy Clin Immunol, 2002. 109(3): p. 379-92.

100. Sharpe, R.A., et al., Indoor fungal diversity and asthma: a meta-analysis and systematic review of risk factors. J Allergy Clin Immunol, 2015. 135(1): p. 110-22.

101. Song, S.J., et al., Cohabiting family members share microbiota with one another and with their dogs. Elife, 2013. 2: p. e00458.

102. Ross, A.A., A.C. Doxey, and J.D. Neufeld, The Skin Microbiome of Cohabiting Couples. mSystems, 2017. 2(4).

103. Lax, S., et al., Forensic analysis of the microbiome of phones and shoes. Microbiome, 2015. 3: p. 21.

104. Lax, S.G., J., 13. Forensic microbiology in built environments, in Forensic Microbiology, D.O.T. Carter, J.K. and M.E.M. Benbow, J.L., Editors. 2017, John Wiley & Sons, Ltd: Chichester, UK.

105. Strachan, D.P., Hay fever, hygiene, and household size. BMJ, 1989. 299(6710): p. 1259-60.

137

106. Rook, G.A., et al., Mycobacteria and other environmental organisms as immunomodulators for immunoregulatory disorders. Springer Semin Immunopathol, 2004. 25(3-4): p. 237-55.

107. Shade, A., Diversity is the question, not the answer. ISME J, 2017. 11(1): p. 1-6.

108. Vandegrift, R., et al., Cleanliness in context: reconciling hygiene with a modern microbial perspective. Microbiome, 2017. 5(1): p. 76.

109. Bloomfield, S.F., et al., Time to abandon the hygiene hypothesis: new perspectives on allergic disease, the human microbiome, infectious disease prevention and the role of targeted hygiene. Perspect Public Health, 2016. 136(4): p. 213-24.

110. Rook, G.A., Regulation of the immune system by biodiversity from the natural environment: an ecosystem service essential to health. Proc Natl Acad Sci U S A, 2013. 110(46): p. 18360-7.

111. Chase, J., et al., Geography and Location Are the Primary Drivers of Office Microbiome Composition. mSystems, 2016. 1(2).

112. Mohammadi, T., et al., Removal of contaminating DNA from commercial nucleic acid extraction kit reagents. J Microbiol Methods, 2005. 61(2): p. 285-8.

113. Tanner, M.A., et al., Specific ribosomal DNA sequences from diverse environmental settings correlate with experimental contaminants. Appl Environ Microbiol, 1998. 64(8): p. 3110-3.

114. Adams, R.I., et al., Microbiota of the indoor environment: a meta-analysis. Microbiome, 2015. 3: p. 49.

115. Salter, S.J., et al., Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol, 2014. 12: p. 87.

116. Coil, D., “Citizen Microbiology: A Case Study in Space.”, in The Rightful Place of Science: Citizen Science, D.K. Cavalier, E.B., Editor. 2016, Consortium for Science, Policy & Outcomes: Tempe, AZ.

117. Nielsen, K.M., et al., Release and persistence of extracellular DNA in the environment. Environ Biosafety Res, 2007. 6(1-2): p. 37-53.

138

118. Carini, P., et al., Relic DNA is abundant in soil and obscures estimates of soil microbial diversity. Nat Microbiol, 2016. 2: p. 16242.

119. Emerson, J.B., et al., Schrodinger's microbes: Tools for distinguishing the living from the dead in microbial ecosystems. Microbiome, 2017. 5(1): p. 86.

120. Riesenfeld, C.S., P.D. Schloss, and J. Handelsman, Metagenomics: genomic analysis of microbial communities. Annu Rev Genet, 2004. 38: p. 525-52.

121. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res, 2009. 19(7): p. 1141-52.

122. Segata, N., et al., Computational meta'omics for microbial community studies. Mol Syst Biol, 2013. 9: p. 666.

123. McDonald, D., et al., An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J, 2012. 6(3): p. 610-8.

124. Yilmaz, P., et al., The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks. Nucleic Acids Res, 2014. 42(Database issue): p. D643-8.

125. Huse, S.M., et al., Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet, 2008. 4(11): p. e1000255.

126. Knights, D., et al., Human-associated microbial signatures: examining their predictive value. Cell Host Microbe, 2011. 10(4): p. 292-6.

127. Langille, M.G., et al., Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol, 2013. 31(9): p. 814-21.

128. Vandamme, P., et al., Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev, 1996. 60(2): p. 407-38.

129. Stackebrandt, E.G., B.M., Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology. International Journal of Systematic and Evolutionary Microbiology, 1994. 44(4): p. 846-849.

139

130. Eren, A.M., et al., Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol Evol, 2013. 4(12).

131. Eren, A.M., et al., Exploring the diversity of Gardnerella vaginalis in the genitourinary tract microbiota of monogamous couples through subtle nucleotide variation. PLoS One, 2011. 6(10): p. e26732.

132. McLellan, S.L., et al., Sewage reflects the distribution of human faecal Lachnospiraceae. Environ Microbiol, 2013. 15(8): p. 2213-27.

133. Faith, J.J., et al., The long-term stability of the human gut microbiota. Science, 2013. 341(6141): p. 1237439.

134. McHardy, A.C., et al., Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods, 2007. 4(1): p. 63-72.

135. Schloissnig, S., et al., Genomic variation landscape of the human gut microbiome. Nature, 2013. 493(7430): p. 45-50.

136. Segata, N., et al., Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods, 2012. 9(8): p. 811-4.

137. Brady, A. and S. Salzberg, PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods, 2011. 8(5): p. 367.

138. Wood, D.E. and S.L. Salzberg, Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol, 2014. 15(3): p. R46.

139. Kanehisa, M., et al., Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic acids research, 2014. 42(Database issue): p. D199-205.

140. Tatusov, R.L., E.V. Koonin, and D.J. Lipman, A genomic perspective on protein families. Science, 1997. 278(5338): p. 631-7.

141. Powell, S., et al., eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res, 2012. 40(Database issue): p. D284-9.

140

142. Punta, M., et al., The Pfam protein families database. Nucleic Acids Res, 2012. 40(Database issue): p. D290-301.

143. Suzek, B.E., et al., UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 2007. 23(10): p. 1282-8.

144. Caspi, R., et al., The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic acids research, 2014. 42(Database issue): p. D459-71.

145. Overbeek, R., et al., The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res, 2005. 33(17): p. 5691-702.

146. Markowitz, V.M., et al., IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res, 2012. 40(Database issue): p. D123-9.

147. Konwar, K.M., et al., MetaPathways: a modular pipeline for constructing pathway/genome databases from environmental sequence information. BMC Bioinformatics, 2013. 14: p. 202.

148. Abubucker, S., et al., Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol, 2012. 8(6): p. e1002358.

149. Vollmers, J., S. Wiegand, and A.K. Kaster, Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters! PLoS One, 2017. 12(1): p. e0169662.

150. Nagarajan, N. and M. Pop, Sequence assembly demystified. Nat Rev Genet, 2013. 14(3): p. 157-67.

151. Gill, S.R., et al., Metagenomic analysis of the human distal gut microbiome. Science, 2006. 312(5778): p. 1355-9.

152. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 2010. 464(7285): p. 59-65.

141

153. Venter, J.C., et al., Environmental genome shotgun sequencing of the Sargasso Sea. Science, 2004. 304(5667): p. 66-74.

154. Wrighton, K.C., et al., Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science, 2012. 337(6102): p. 1661-5.

155. Castelle, C.J., et al., Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nat Commun, 2013. 4: p. 2120.

156. Di Rienzi, S.C., et al., The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria. Elife, 2013. 2: p. e01102.

157. Tyson, G.W., et al., Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 2004. 428(6978): p. 37-43.

158. Albertsen, M., et al., Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol, 2013. 31(6): p. 533-8.

159. Mukherjee, S., et al., 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat Biotechnol, 2017. 35(7): p. 676-683.

160. Eisen, J.A., Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr Opin Genet Dev, 2000. 10(6): p. 606-11.

161. Hao, W. and G.B. Golding, The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res, 2006. 16(5): p. 636-43.

162. Polz, M.F., E.J. Alm, and W.P. Hanage, Horizontal gene transfer and the evolution of bacterial and archaeal population structure. Trends Genet, 2013. 29(3): p. 170-5.

163. Mitri, S. and K.R. Foster, The genotypic view of social interactions in microbial communities. Annu Rev Genet, 2013. 47: p. 247-73.

164. Smith, J., The social evolution of bacterial pathogenesis. Proc Biol Sci, 2001. 268(1462): p. 61-9.

142

165. de Carvalho, M.O. and E.L. Loreto, Methods for detection of horizontal transfer of transposable elements in complete genomes. Genet Mol Biol, 2012. 35(4 (suppl)): p. 1078-84.

166. Ragan, M.A., On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett, 2001. 201(2): p. 187-91.

167. Vernikos, G.S. and J. Parkhill, Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinformatics, 2006. 22(18): p. 2196-203.

168. Podell, S. and T. Gaasterland, DarkHorse: a method for genome-wide prediction of horizontal gene transfer. Genome Biol, 2007. 8(2): p. R16.

169. Langille, M.G., W.W. Hsiao, and F.S. Brinkman, Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics, 2008. 9: p. 329.

170. Whidden, C., N. Zeh, and R.G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft Distance. Syst Biol, 2014. 63(4): p. 566-81.

171. Tofigh, A., M. Hallett, and J. Lagergren, Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans Comput Biol Bioinform, 2011. 8(2): p. 517-35.

172. Chauve, C., et al., MaxTiC: Fast Ranking Of A Phylogenetic Tree By Maximum Time Consistency With Lateral Gene Transfers. bioRxiv, 2017.

173. Trappe, K., T. Marschall, and B.Y. Renard, Detecting horizontal gene transfer by mapping sequencing reads across species boundaries. Bioinformatics, 2016. 32(17): p. i595-i604.

174. Lloyd-Price, J.M., A*, et al., Strains, functions and dynamics in the expanded Human Microbiome Project. Nature, in press.

175. Huang, K., et al., MetaRef: a pan-genomic database for comparative and community microbial genomics. Nucleic Acids Res, 2014. 42(Database issue): p. D617-24.

176. Louis, P., G.L. Hold, and H.J. Flint, The gut microbiota, bacterial metabolites and colorectal cancer. Nat Rev Microbiol, 2014. 12(10): p. 661-72.

143

177. Flint, H.J., et al., Interactions and competition within the microbial community of the human colon: links between diet and health. Environ Microbiol, 2007. 9(5): p. 1101-11.

178. Mark Welch, J.L., et al., Biogeography of a human oral microbiome at the micron scale. Proc Natl Acad Sci U S A, 2016. 113(6): p. E791-800.

179. Finn, R.D., et al., Pfam: clans, web tools and services. Nucleic Acids Res, 2006. 34(Database issue): p. D247-51.

180. Sitbon, E. and S. Pietrokovski, New types of conserved sequence domains in DNA-binding regions of homing endonucleases. Trends Biochem Sci, 2003. 28(9): p. 473-7.

181. Burrus, V., et al., The ICESt1 element of Streptococcus thermophilus belongs to a large family of integrative and conjugative elements that exchange modules and change their specificity of integration. Plasmid, 2002. 48(2): p. 77-97.

182. Burrus, V., et al., Conjugative transposons: the tip of the iceberg. Mol Microbiol, 2002. 46(3): p. 601-10.

183. Bonham, K.S., B.E. Wolfe, and R.J. Dutton, Extensive horizontal gene transfer in cheese- associated bacteria. Elife, 2017. 6.

184. Truong, D.T., et al., Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res, 2017. 27(4): p. 626-638.

185. Stokes, H.W. and M.R. Gillings, Gene flow, mobile genetic elements and the recruitment of antibiotic resistance genes into Gram-negative pathogens. FEMS Microbiol Rev, 2011. 35(5): p. 790-819.

186. Peng, Y., et al., IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 2012. 28(11): p. 1420-8.

187. Konstantinidis, K.T. and J.M. Tiedje, Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A, 2005. 102(7): p. 2567-72.

188. Jost, L., Entropy and diversity. Oikos, 2006. 113(2): p. 363-375.

144

189. National Transit Database. Monthly Module Raw Data Release. 2015; Available from: http://www.ntdprogram.gov/ntdprogram/data.htm.

190. Meadow, J.F., A.E. Altrichter, and J.L. Green, Mobile phones carry the personal microbiome of their owners. PeerJ, 2014. 2: p. e447.

191. Fierer, N., et al., Forensic identification using skin bacterial communities. Proc Natl Acad Sci U S A, 2010. 107(14): p. 6477-81.

192. Meadow, J.F., et al., Bacterial communities on classroom surfaces vary with human contact. Microbiome, 2014. 2(1): p. 7.

193. Robertson, C.E., et al., Culture-independent analysis of aerosol microbiology in a metropolitan subway system. Appl Environ Microbiol, 2013. 79(11): p. 3485-93.

194. Leung, M.H., et al., Indoor-air microbiome in an urban subway network: diversity and dynamics. Appl Environ Microbiol, 2014. 80(21): p. 6760-70.

195. Afshinnekoo, E., et al., Geospatial Resolution of Human and Bacterial Diversity with City-Scale Metagenomics. Cell Systems, 2015. 1(1): p. 72-87.

196. Ackelsberg, J., et al., Lack of Evidence for Plague or Anthrax on the New York City Subway. Cell Systems. 1(1): p. 4-5.

197. Segata, N., et al., Metagenomic biomarker discovery and explanation. Genome Biol, 2011. 12(6): p. R60.

198. Nelson, M.C., et al., Analysis, optimization and verification of Illumina-generated 16S rRNA gene amplicon surveys. PLoS One, 2014. 9(4): p. e94249.

199. Segata, N., et al., Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol, 2012. 13(6): p. R42.

200. Costello, E.K., et al., Bacterial community variation in human body habitats across space and time. Science, 2009. 326(5960): p. 1694-7.

145

201. Kembel, S.W., et al., Architectural design drives the biogeography of indoor bacterial communities. PLoS One, 2014. 9(1): p. e87093.

202. Lauber, C.L., et al., Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial community structure at the continental scale. Appl Environ Microbiol, 2009. 75(15): p. 5111- 20.

203. Knights, D., et al., Bayesian community-wide culture-independent microbial source tracking. Nature methods, 2011. 8(9): p. 761-3.

204. Stolz, A., Molecular characteristics of xenobiotic-degrading sphingomonads. Appl Microbiol Biotechnol, 2009. 81(5): p. 793-811.

205. Peyraud, R., et al., Genome-scale reconstruction and system level investigation of the metabolic network of Methylobacterium extorquens AM1. BMC Syst Biol, 2011. 5: p. 189.

206. Kawamura, Y., et al., Genus Enhydrobacter Staley et al. 1987 should be recognized as a member of the family Rhodospirillaceae within the class Alphaproteobacteria. Microbiol Immunol, 2012. 56(1): p. 21-6.

207. Hewitt, K.M., et al., Bacterial diversity in two Neonatal Intensive Care Units (NICUs). PLoS One, 2013. 8(1): p. e54703.

208. Grice, E.A., et al., A diversity profile of the human skin microbiota. Genome Res, 2008. 18(7): p. 1043-50.

209. Dawson, T.L., Jr., Malassezia globosa and restricta: breakthrough understanding of the etiology and treatment of dandruff and seborrheic dermatitis through whole-genome analysis. J Investig Dermatol Symp Proc, 2007. 12(2): p. 15-9.

210. Zouboulis, C.C., Propionibacterium acnes and sebaceous lipogenesis: a love-hate relationship? J Invest Dermatol, 2009. 129(9): p. 2093-6.

211. Morgan, X.C., et al., Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology, 2012. 13(9): p. R79.

146

212. Barberan, A., et al., Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J, 2012. 6(2): p. 343-51.

213. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000. 28(1): p. 27-30.

214. Bruggemann, H., et al., The complete genome sequence of Propionibacterium acnes, a commensal of human skin. Science, 2004. 305(5684): p. 671-3.

215. Lee, W.L., A.R. Shalita, and M.B. Poh-Fitzpatrick, Comparative studies of porphyrin production in Propionibacterium acnes and Propionibacterium granulosum. J Bacteriol, 1978. 133(2): p. 811-5.

216. Holland, K.T., et al., Propionibacterium acnes and acne. Dermatology, 1998. 196(1): p. 67-8.

217. Roessner, C.A., et al., Isolation and characterization of 14 additional genes specifying the anaerobic biosynthesis of cobalamin (vitamin B12) in Propionibacterium freudenreichii (P. shermanii). Microbiology, 2002. 148(Pt 6): p. 1845-53.

218. Hashimoto, Y., M. Yamashita, and Y. Murooka, The Propionibacterium freudenreichii hemYHBXRL gene cluster, which encodes enzymes and a regulator involved in the biosynthetic pathway from glutamate to protoheme. Appl Microbiol Biotechnol, 1997. 47(4): p. 385-92.

219. Kaminski, J., et al., High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comp Biol, in press.

220. McArthur, A.G., et al., The comprehensive antibiotic resistance database. Antimicrob Agents Chemother, 2013. 57(7): p. 3348-57.

221. Yooseph, S., et al., A metagenomic framework for the study of airborne microbial communities. PLoS One, 2013. 8(12): p. e81862.

222. Qin, J., et al., A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 2012. 490(7418): p. 55-60.

147

223. Yatsunenko, T., et al., Human gut microbiome viewed across age and geography. Nature, 2012. 486(7402): p. 222-7.

224. Hu, Y., et al., Metagenome-wide analysis of antibiotic resistance genes in a large cohort of human gut microbiota. Nat Commun, 2013. 4: p. 2151.

225. Chen, L., et al., VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res, 2005. 33(Database issue): p. D325-8.

226. Li, Y., et al., Role of ventilation in airborne transmission of infectious agents in the built environment - a multidisciplinary systematic review. Indoor Air, 2007. 17(1): p. 2-18.

227. Gibbons, S.M., et al., Ecological succession and viability of human-associated microbiota on restroom surfaces. Appl Environ Microbiol, 2015. 81(2): p. 765-73.

228. Glass, E.M., et al., MIxS-BE: a MIxS extension defining a minimum information standard for sequence data from the built environment. ISME J, 2014. 8(1): p. 1-3.

229. National Centers for Environmental Information & National Oceanic and Atmospheric Administration. Record of Climatological Observations. 8/29/2015; Station: Boston Logan International Airport, MA, US. ]. Available from: http://www.ncdc.noaa.gov/cdo-web/.

230. Weather Underground. Weather History for KBOS 8/29/2015]; Available from: http://www.wunderground.com/history/.

231. Paulino, L.C., et al., Molecular analysis of fungal microbiota in samples from healthy human skin and psoriatic lesions. J Clin Microbiol, 2006. 44(8): p. 2933-41.

232. Caporaso, J.G., et al., Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4516-22.

233. Caporaso, J.G., et al., QIIME allows analysis of high-throughput community sequencing data. Nature methods, 2010. 7(5): p. 335-6.

234. Oksanen J, B.F., Kindt R, Legendre P, Minchin P, O'Hara R, Simpson G, Solymos P, Stevens H, Wagner H, vegan: Community Ecology Package. 2015.

148

235. Asnicar, F., et al., Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ, 2015. 3: p. e1029.

236. Morgat, A., et al., UniPathway: a resource for the exploration and annotation of metabolic pathways. Nucleic acids research, 2012. 40(Database issue): p. D761-9.

237. Suzek, B.E., et al., UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 2015. 31(6): p. 926-32.

238. Liu, B. and M. Pop, ARDB--Antibiotic Resistance Genes Database. Nucleic Acids Res, 2009. 37(Database issue): p. D443-7.

239. Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.

240. Langmead, B. and S.L. Salzberg, Fast gapped-read alignment with Bowtie 2. Nat Methods, 2012. 9(4): p. 357-9.

241. Rastogi, G., et al., A PCR-based toolbox for the culture-independent quantification of total bacterial abundances in plant environments. J Microbiol Methods, 2010. 83(2): p. 127-32.

242. Lane, D., 16S/23S rRNA sequencing, in Nucleic acid techniques in bacterial systematics, G.M. Stackebrandt E, Editor. 1991, John Wiley and Sons: Chichester, United Kingdom. p. 115- 175.

243. Flores, G.E., J.B. Henley, and N. Fierer, A direct PCR approach to accelerate analyses of human- associated microbial communities. PLoS One, 2012. 7(9): p. e44563.

244. Adams, R.I., et al., Passive dust collectors for assessing airborne microbial material. Microbiome, 2015. 3: p. 46.

245. Checinska, A., et al., Microbiomes of the dust particles collected from the International Space Station and Spacecraft Assembly Facilities. Microbiome, 2015. 3: p. 50.

149