THE DEVELOPMENT OF A MICROBIOME REFERENCE THAT SPANS THOUSANDS OF INDIVIDUALS by DANIEL THOMAS MCDONALD B.S., University of Colorado, 2008

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of Computer Science 2015

This thesis entitled: The Development of a Microbiome Reference that Spans Thousands of Individuals written by Daniel Thomas McDonald has been approved for the Department of Computer Science

Professor Robin Dowell

Professor Ken Krauter

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

IRB protocol # ____12-0582______

iii

McDonald, Daniel Thomas (Ph.D., Computer Science)

The Development of a Microbiome Reference that Spans Thousands of Individuals

Thesis directed by Professor Rob Knight

The research objective of this thesis is to measure the extent of microbial diversity associated with the human large intestine to an accuracy within the limits of the V4 region of the 16S rRNA gene at 97% similarity. This gene has become a powerful tool in assessing microbiome composition, and in recent years, a significant amount of research has shown an intimate relationship between the microbiome and human health. Unlike the human genome, in which the bulk of its content is shared across the human population, there is no common component of the human microbiome.

What has been observed is a range of configurations, with factors such as age and

BMI being strongly associated with these differences. To date, however, no project has aimed to scope the range of microbiome configurations, and thus our concept of what it means to be healthy (from a microbial perspective) is nonexistent.

International efforts such as the American Gut Project will not only help us to understand more about our microbial constituents, but also pave the way toward understanding how these communities can be manipulated for the benefit of human health.

The structure of this thesis is to first provide background about the microbiome through a commentary on the history of 16S, and a review on microbiome research.

iv

Following this, the next series of chapters is concerned with building the case for large-scale microbiome studies leading up to the American Gut Project. The second half of the thesis emphasizes the computational difficulties of the research, and specific contributions made to the processing and analysis of sequence data that enable insight into the microbiome. These contributions include a file format that is a recognized standard by the Genomic Standards Consortium, a novel method for transferring taxonomies for benefiting taxonomic curation, and a practical biological example of the use of reproducible and executable IPython Notebooks. Last, the thesis discusses a software tool that has been useful in the analysis of next- generation sequence data, and a few microbiome analyses.

v

ACKNOWLEDGEMENTS

Progress in the sciences is contigent on human interaction. Collaboration is essential as it brings in new ideas and bridges gaps in experience. But the human side of science goes beyond collaboration; it is helping others discover the unknown, encouraging people to grow, empathizing with challenges and sacrifices, and of course patience.

Ten years ago, I was in the process of failing out of an undergraduate degree in computer science and was perpetually injured due to snowboarding more than sleeping. I was disillusioned, literally a wreck, and teetering on dropping out. By chance from an emergency room visit, I met Jeremy Widmann, who was a researcher with Rob Knight. Jeremy and I became good friends, and he subsequently introduced me to Rob. We identified a research project to work on based on a mutual interest in networks, and it’s been an amazing ride ever since.

Rob’s commitment and dedication is unparalleled, and he is nothing short of profoundly inspirational. Without a doubt, I would not be putting the final touches on a PhD thesis had it not been for that chance encounter, and Rob’s patience, persistence, encouragement and support.

I thank Robin Dowell whose advice I’ve sought and valued through graduate school, and for offering to co-advise me for the tail end of my graduate career. I also thank

vi

Nikolaus Correll, Ken Anderson and Ken Krauter for their service on the thesis committee, and Henry Tufo for his service on the thesis proposal.

Thank you to Phil Hugenholtz, whose painstaking curation efforts and endless ideas for improving helped to put an initial direction for my graduate studies.

Much of the work in this thesis was done in collaboration with the incredible members of the Knight lab (former and current) and in collaboration with fantastic individuals from around the world in other groups. Their names are included where appropriate in the text. I’d specifically like to thank (in no particular order) Jerry

Kennedy whom I shared an office with for years and who would entertain insane ideas or issues as they arose, Ulla Westermann who tirelessly kept reimbursements and purchasing organized, Jeff DeReus for doing wonders to the compute infrastructure, Greg Caporaso for the well organized and productive code sprints,

Justine Debelius for her relentless drive into the American Gut data, Jose Clemente for always being open to bounce ideas off of, Antonio Gonzalez for being perpetually positive and forward thinking, Greg Humphrey and Donna Berg-Lyons who make the wetlab magic happen, Cathy Lozupone for her excitement and detailed insight into the microbiome, Jessica Metcalf for always being willing to give detailed feedback on papers, Se Jin Song for answering my naïve molecular questions, Gail

Ackermann for painstakingly managing the American Gut IRB protocols, Julia

Goodrich for the late night phylogenetics work, Jeremy Widmann for getting me

vii involved in this madness, Elaine Wolfe for being the public face of the American Gut help account, Adam Robbins-Pianka for the frequent and great late-night converstaions and instrumental work on the American Gut website, Josh

Shorenstein for the work on the American Gut localization and Vioscreen, Yoshiki

Vazquez Baeza for frequent laughs and making Chrome do terrible things, Luke

Ursell for helping with the organization of this thesis, Embriette Hyde for maintaining the American Gut blog, Luke Thompson for resolving the LaTeX necessary for participant results, and Jose Navas and Amnon Amir for their investigation into filtering for bloom sequences in the American Gut.

I’d especially like to thank Scott Handley, Karin Rengefors, Naiara Rodríguez-

Ezpeleta and Konrad Paszkiewicz who organize and run the Workshop on Genomics in the Czech Republic, a workshop that I’ve taught at the last four years, and which has been one of the most remarkable experiences of my academic career.

Over the last two years, I’ve had the opportunity to begin an investigation into the microbiome of ICU patients made possible thanks to the efforts of Paul Wischmeyer.

For all of the students in the inaugural year of the Interdisciplinary Quantitative

Biology program: I think a random walk is a reasonable null model for graduate school.

viii

The Interdisciplinary Quantitative Biology program is now is beginning its forth year, and I would like to thank Andrea Stith, Jana Watson-Capps, Emilia Costales,

Kim Kelley, Kim Little and Janice Jones for keeping the program alive and running. And thank you to Tom Cech for making the program happen, and for listening to all the crazy ideas that the students have about it.

Thank you to Rajshree Shrestha and Jackie DeBoard for helping to navigate the surprisingly difficult to figure out graduate school requirements.

Thank you to my friends, and in particular, my old roommates Aimee d’Emery,

Elias Santistevan, and Adam Robbins-Pianka who put up with my odd hours, and weird travel schedule.

Finally, thank you to my parents Pam and David McDonald who put me on the path of playing with computers, and my sister Kate and her wife Max for unwavering advice and support. And, last, I’d like to abundantly thank my wife Alina for her incredible support and patience over the last four years.

ix

CONTENTS

CHAPTER

I. Ribosomal RNA, the lens into life ...... 1

Commentary ...... 1

Concluding remarks ...... 9

II. From molecules to dynamic communities ...... 11

Introduction ...... 11

From catalogs to robust, reproducible community patterns ...... 14

How do we know which microbes are present? ...... 17

Is there a core human microbiome? ...... 21

Microbial community states associated with disease ...... 23

Changes in the microbiome over time ...... 26

Conclusions and outlook ...... 31

Concluding remarks ...... 33

III. Context, and the human microbiome ...... 35

Introduction ...... 35

Importance of reference sets in human microbiome research ...... 37

Limitations of existing reference sets ...... 39

Technical challenges to employing reference sets ...... 42

Examples of reference set usage ...... 46

Contributions of the American Gut Project ...... 49

Conclusion ...... 50

Concluding remarks ...... 51

x

IV. Towards large-cohort comparative studies to define factors influencing the gut microbial community structure of ASD patients ...... 52

Introduction ...... 52

The importance of experimental design: cross-sectional versus longitudinal analysis ...... 55

Cross-sectional study designs ...... 55

Longitudinal study designs ...... 58

Comparison of study designs with respect to neuropsychiatric disorders . 60

The importance of controlling for technical variables in traits with small effect size ...... 61

The influence of diet on the microbiome ...... 64

Correlation versus causation ...... 67

Conclusion ...... 68

Concluding remarks ...... 70

V. American Gut: an Open Platform for Citizen-Science Microbiome Research 71

The American gut is diverse, and in some cases extreme ...... 73

Correlations with participant health, lifestyle, and diet ...... 75

Power curves and effect sizes ...... 80

Using the American Gut as a discovery platform ...... 81

Concluding remarks ...... 83

VI. The Biological Observation Matrix (BIOM) Format or: how I learned to stop worrying and love the ome-ome ...... 84

Background ...... 85

Data description ...... 90

Analyses ...... 91

Discussion ...... 93

xi

Methods ...... 96

Growth of the ome-ome ...... 96

BIOM file format ...... 97

Concluding remarks ...... 98

VII. An improved Greengenes taxonomy for and archaea with explicit ranks ...... 99

Introduction ...... 99

Materials and methods ...... 102

16S data compilation and de novo tree inference ...... 102

Transferring taxonomies to trees (tax2tree) ...... 103

Taxonomy comparisons ...... 107

Results and discussion ...... 108

Construction of the rank-explicit Greengenes taxonomy ...... 108

The value of accommodating polyphyletic groups in a 16S rRNA-based taxonomy ...... 113

Reconcilliation of NCBI and Greengenes-defined candidate phyla .. 114

Final comments and prospectus ...... 117

Concluding remarks ...... 118

VIII. Collaborative cloud-enabled tools allow rapid, reproducible biological insights ...... 119

Article ...... 119

Concluding remarks ...... 127

IX. Enabling scientific insight and analysis contributions ...... 129

Concluding remarks ...... 133

X. Conclusions ...... 134

XI. Bibliography ...... 141

xii

XII. Appendix ...... 162

A. American Gut Consortium ...... 162

B. Supplemental Tables and Figures for Chapter V ...... 170

C. Supplemental Text, Tables and Figures for Chapter VI ...... 189

D. Supplemental Tables and Figures for chapter VII ...... 194

E. Supplemental Figure for chapter VIII ...... 218

xiii

TABLES

TABLE 1: A COMPARISON OF OTU-PICKING STRATEGIES...... 46

TABLE 2: GREENGENES CLASSIFICATIONS...... 116

xiv

FIGURES

FIGURE 1: RELATIONSHIP BETWEEN SEQUENCING READ-LENGTH AND CLASSIFICATION. 20

FIGURE 2: PREDATOR-PREY DYNAMICS FOR TWO SPECIES X AND Y...... 30

FIGURE 3: DIVERSITY OF THE AMERICAN GUT...... 75

FIGURE 4: HEALTH AND LIFESTYLE IN THE AMERICAN GUT FECAL SAMPLES...... 79

FIGURE 5: STATISTICAL POWER ESTIMATES IN THE AMERICAN GUT...... 81

FIGURE 6: ICU FECAL SAMPLES COMPARED AGAINST FECAL SAMPLES FROM SELF- REPORTED HEALTHY AMERICAN GUT PARTICIPANTS...... 82

FIGURE 7: GROWTH OF THE “OME-OME”...... 86

FIGURE 8: BIOM SIZE COMPARISON...... 92

FIGURE 9: TAX2TREE WORKFLOW...... 110

FIGURE 10: NCBI VS. GREENGENES...... 111

FIGURE 11: INCLUDED DATA AND RESULTS FOR THE IPYTHON PAPER...... 125

1

Chapter 1 Ribosomal RNA, the lens into life

From: McDonald D, Xu Z, Hyde ER, Knight R. (2015) Ribosomal RNA, the lens into life. RNA 21(4):692-4. PMID: 25780194

Ribosomal RNA has had a profound impact on our perspective of life on the planet.

From forging our recognition of three domains of life, to providing reserachers with a tool for idenfitying microorganisms that dominate life, the study of these genes, and more recently their utility for community readout, has ushered in a new era of life sciences research. This commentary, published in RNA, provided a brief history of how our perspectives on life have changed since the journals inception.

Commentary

Our world is, to a first approximation, a microbial world—even our own lives are evolutionarily and molecularly linked to that which we cannot see. This “invisible” world contains nearly unfathomable molecular and genetic diversity. Throughout the 20th century, a metaphor of “war against disease” prevailed: Microbes were hunted under the guise of health and agriculture, the goal being the eradication of

“pathogens” at a global scale. Microbes were considered at best a nuisance and at worst a threat: a target for technological solutions. Humanity invested heavily, with the broad support of the scientific community, in the advancement of weapons for

2 the war against microbes. We sought out novel antibiotics and antimicrobials as our enemy mounted resistance.

However, despite the profile of microbial pathogens as sources of human suffering needing to be eradicated, the scientific mindset still had room to respect the unknown. The evidence of what we might miss were there from the beginning. Even

Pasteur noted that many microbes were beneficial, especially in industry, and doubted that humans could live without their microbial guests. Beijerinck wrote in the early 1900s that “...in its primitive form life is like fire, like a flame borne by the living substance;—like a flame which appears in endless diversity and yet has specificity within it.” Now, by viewing the microbial world through the lens of RNA, we are beginning to see that this flame of diversity can be a powerful ally.

In the 20 years since it was founded, RNA has been a key partner in a dramatic expansion of our ability to see into the microbial world, including the subtle effects of the beneficial microbes that overwhelmingly outnumber pathogens. In particular, using ribosomal RNA not just as an object of study in itself but as a tool to investigate the structure of microbial communities has led to a fuller appreciation that, in nature, no organism lives in isolation. Advances in molecular techniques, improvements to sequencing technologies, expansion of databases, and development of scalable software have all played key roles in characterizing microbial communities in environments ranging from the deep ocean to the air to our own

3 bodies. In many ways, these new insights have come from seeing RNA as a tool, at higher and higher levels of organization.

The first of these conceptual leaps came from the work of Carl Woese and George

Fox in the late 1970s, who realized that universally conserved components of the translation apparatus, first the 5S and then the longer small subunit ribosomal

RNA gene, could be not just objects of study in the context of their role in translation, but could also be utilized as key markers for evolution that could be used to interpret large-scale patterns of the evolution of life. The reorganization of life's diversity from a five-kingdom model (plants, animals, fungi, protists, and bacteria) to a three-domain model (bacteria, archaea, and eukaryotes), with the revelation that most diversity was found not in organisms visible to the naked eye but rather in the microbial world, was in many ways as fundamental to our understanding of our place in the universe as the Copernican shift from a geocentric to a heliocentric model. This idea, that rRNA could be used to read out an organism's place in the tree of life, rapidly led to a proliferation of rRNA sequences in the database, as many investigators sought to place their favorite organisms on this universal tree of life.

The second conceptual leap was to realize that in addition to placing known organisms on the tree of life, rRNA used as a universal marker could also place unknown organisms, and thus act as a tool for identifying novel members of

4 microbial communities. Norm Pace and members of his laboratory developed environmental PCR, which allowed direct investigation of the organisms in an environmental sample without resorting to culture-based techniques that can only grow a small fraction of the microbes in a given sample. Norm, together with David

Lane, Mitch Sogin, Phil Hugenholtz, and many others rapidly expanded the tree of life to encompass many previously unsuspected lineages, the vast majority of which still have not been cultured successfully today. As SSU rRNA sequences began to amass from a multitude of environments, identifying the kinds of communities in each of those environments, the need for comprehensive references arose. These resources, including the Ribosomal Database Project, Greengenes, and SILVA, centralized our growing knowledge of the various lineages of microbes that were out there, and especially in the case of Greengenes, highlighted the vast “dark” portions of the tree in which no organisms have been cultured. Of the phyla recognized within the kingdom Bacteria, approximately 30 are represented by organisms that have been successfully cultured, whereas 50–100 candidate phyla have been reported by different authors. As resources grew, taxonomies based on phylogeny highlighted the gross errors present in the published literature, often stemming from earlier classifications based on morphology or biochemistry, which evolve in a far less clock-like fashion than DNA sequences.

The third conceptual leap was to realize that as well as comparing organisms to one another in the context of a single community, the universal tree of life could be used

5 to compare whole communities to each other, thus providing an overall picture of which factors drive microbial community composition. The initial idea for this came from a discussion after one of Norm's lab meetings, where the problem of comparing five communities from one of Jeff Walker's studies on endolithic communities

(microbes that live in rocks) arose. Andy Martin had devised a test called the P test, which could use the tree to test whether two communities were identical. The problem was that all the communities were statistically significantly different, and where do you go from there? One of us (Rob Knight) realized that the goal wasn't so much to find out whether the communities were different, because all microbial communities will differ in some respect, but rather to tell how far apart evolutionarily they were from one another. With this distance metric in hand, it would then be possible to perform ecological techniques such as ordination based on distances from the phylogenetic tree, and to find out which environmental factors were most important in determining which microbes live in which communities.

Cathy Lozupone completed a very exciting PhD thesis in Rob's lab developing this distance metric, called UniFrac (for “Unique Fraction”, i.e., the fraction of the evolutionary tree unique to one lineage or another) and set up an easy-to-use web site in conjunction with another graduate student, Micah Hamady. The Knight lab then proved UniFrac's metric properties together with Manuel Lladser. Shortly after developing UniFrac, Cathy set out to ask a simple and straightforward question: Is there a primary environmental factor that differentiates microbial communities? Through an analysis that spanned over 200 papers (although she had

6 to read over 400 to find the ones that had used general-purpose primers and that had deposited their data), Cathy saw that communities associated with saline environments were systematically different than those associated with non-saline, with brackish environments such as estuaries falling between them (this factor, a long way down the list of factors examined, outweighed temperature, pH, etc.).

Following this remarkable observation, Ruth Ley asked how communities associated with the mammalian gut fit in, and much to our surprise, the most important factor was whether the community was environmental or host- associated: This was twice as important as the saline/non-saline split in environmental samples. This observation suggests that the selective pressures put forth by the gastrointestinal tract are more important than those present in any other environment.

UniFrac came along at just the right time because following on the heels of the

Human Genome Project, “next generation sequencing” instruments enabled a completely new scale of DNA sequencing. Coupled with environmental PCR, we and other labs began to generate tremendous amounts of SSU rRNA amplicon sequence data. For example, in late 2007, the core facility down the hall from the Knight lab was still charging $8 per sequence; shifting to the 454 instrument, in the first deployment of our highly multiplexed barcoded sequencing protocol with Kirk

Harris, Jeff Walker, Nick Gold, and Micah Hamady, we collected 500,000 sequences for $12,000. If we had done this the old way, it would have cost $4 million, and they

7 wouldn't be finished with the sequencing yet! Together with other pioneers in amplicon barcoding such as Rick Bushman and Mitch Sogin, and computational tools such as QIIME, developed in my lab by Greg Caporaso, Jesse Stombaugh,

Justin Kuczynski, and a host of others, we enabled these protocols to be deployed in a huge range of environmental samples. These sequences came from every conceivable environment on the planet, and previously unobserved life was popping up everywhere.

Enter the poop. Through a back of the envelope calculation in the 70s, Dwayne

Savage estimated the total number of microbial cells in and on the human body to be up to 100 trillion; 10-fold more than the number of human cells in a body. The vast majority of these microbial cells reside in the large intestine, and the structure of these communities can be observed (in approximation) by proxy through the end result of last night's dinner. Changes in microbial communities have now been linked to a wide range of conditions, including inflammatory bowel disease, colon cancer, cardiovascular disease, and rheumatoid arthritis in humans, and, in mouse models, multiple sclerosis, depression, and even autism. However, simply observing a community does not provide a mechanistic understanding of the interplay between the microbiome and the host. To answer this question, we need hosts, such as mice, that completely lack microbes and which can be experimentally inoculated with known microbes under controlled conditions. Jeff Gordon's group, through an interest in determining the causes of obesity in the mid-2000s, brought the use of

8 gnotobiotic mice to scale, developing a framework for assessing causality in microbiome studies. Remarkably, phenotypes can be transferred not just from mouse to mouse but even from human to mouse by transferring the microbiome. For example, germ-free mice, when inoculated with the communities from humans who have Kwashiorkor, a pervasive wasting disease in developing countries, will develop similar symptoms. Similarly, germ-free mice inoculated with bacteria from an obese human will gain significantly more adipose tissue than those inoculated with the microbiome of a healthy human. These phenotypic changes are coupled to changes in the microbial community, which can be read out from the RNA. These results cemented the idea that the microbiome should be thought of as a vital organ, and, much as we take care to exercise for cardiovascular health, it is imperative that we take care of our microbial friends through eliminating unnecessary antibiotic use, regularly consuming dietary fiber (which feeds the butyrate producers in the large intestine), and other measures.

The fourth, and most recent, conceptual leap is occurring now: Rather than using the microbial communities as objects of study in and of themselves, we can use the communities as a tool to read out environmental or medical conditions. For example, we can tell today if a person is lean or obese with 90% accuracy based on the microbial community in a person's stool: On the one hand, there are easier ways to tell if someone is fat, but on the other we can only do this with 58% accuracy using every human gene ever linked to obesity by genome-wide association studies.

9

Similar potential has been shown for reading out diabetes and cirrhosis from the human microbiome, and for reading out soil pH and nitrogen content, oil pollution in marine water and sediment, stress in populations of primates, post-mortem interval of a corpse, and a wide range of other conditions. An exciting new frontier is in the dynamics of the microbiome, where it may actually be how your microbiome changes, rather than a static snapshot, that can best be linked to your physiological state.

Taken together, what is remarkable is that in such a short timespan, ribosomal

RNA has moved from being an object of study in its biochemical role, to a tool for placing organisms on a phylogenetic tree, to a tool for understanding who lives in a given microbial community, to a tool for relating communities to one another, and most recently to a tool for reading out the properties of an organism or environment for the communities it harbors. As sequencing technologies and software to interpret all those sequences continue to advance, it will be fascinating to see how these applications, and conceptual advances we have not yet even begun to anticipate, will cement the central role of RNA as a marker for the microbial world, an instrument to conserve precious microbial biodiversity, and an enabling technology to improve human and environmental health.

Concluding remarks

The use of ribosomal RNA for understanding differences in communities in relation to environmental changes has been a catalyst to linking disparate fields of study,

10 such as ecology with molecular biology, and both with computer science, which has provided a foundation for manipulating large volumes of data effectively and for gaining biological insight.

11

Chapter 2 From molecules to dynamic communities

From: McDonald D, Vázquez-Baeza Y, Walters WA, Caporaso JG, and Knight R.

(2013) “From molecules to dynamic biological communities.” Biology and Philosophy

28(2):241-259. PMID: 23483075

This published chapter discussed how ribosomal RNA can be used for community readout, some of the insights that have been made, particularly in human associated environments, as well as some of the present computational challenges with the analysis and interpretation of these datasets.

Introduction

Microbial ecology used to be a small and specialized field that struggled to identify more than a tiny proportion of the Earth’s microbial biodiversity. Part of the problem was due to the prevalence of pure-culture methods, in which microorganisms had to be removed from their natural environments (which included communities of other organisms) and cultured in laboratories. Recent advances in molecular techniques, sequencing technologies and computational methods have enabled researchers to explore the microbial world at unprecedented levels, with a focus on the natural habitats of microorganisms. The combination of these advances has so far produced remarkable insight into the role of microorganisms in human health and their powerful effects on the natural world,

12 while at the same time developing novel evidence about the evolution and diversification of life on Earth. In this article, we discuss how these advances have allowed researchers to create new lines of inquiry, we summarize important biological and philosophical results from recent publications, and we discuss how our improved understanding of microbial ecology may affect our lives in the coming years.

The last decade has seen a transformation and democratization of DNA sequencing

[1]. High-throughput sequencing, of the type necessary to characterize the complex microbial communities that inhabit our bodies, used to be the exclusive province of a few large sequencing centers: only research groups with access to substantial resources could engage in sequencing projects. Now, a benchtop machine that fits in an individual investigator’s laboratory can produce billions of 100-nucleotide sequences per month. For comparison, a bacterial genome from the gut is typically about three million nucleotides and the human genome is about three billion nucleotides. However, the number of bacterial genomes that inhabit a human implies that they contribute far more genes than does our human genome [2].

Playing music from a digital file once required a high-end workstation but can now be performed on a handheld device because transistors can now be packed more densely onto a microchip. In the same way, characterizing the types (e.g., the strains, species or phyla) of microbes present in a given sample (the microbiota) or the genes present in these microbes (the microbiome) are problems that can be

13 addressed with a fixed amount of sequencing that is rapidly becoming cheaper and more accessible.

These transformations in sequencing technology have correspondingly changed what it means to undertake a sequencing project. When sequences were very expensive (in the late 1970s and early 1980s), it was a substantial accomplishment to sequence even one gene from one species. Correspondingly, the focus was on identifying genes that acted as the best phylogenetic markers. These were short fragments of sequences from which inferences about the patterns of evolution were likely to match the inferred patterns of evolution of the corresponding species.

These markers therefore provided efficient readouts of evolutionary history while minimizing sequencing costs. For example, ribosomal RNA genes, which play essential structural and catalytic roles in the ribosome and are thought to be almost exclusively vertically transmitted [3, 4], have been especially useful for reconstructing phylogenetic trees, including phylogenetic trees of organisms that have not been isolated in pure culture [5]. Initial studies focused on the 5S rRNA gene [6], although expansion to longer rRNA genes, notably the small subunit rRNA, has allowed substantially greater phylogenetic resolution [7, 8]. Here we describe several conceptual changes deeper sequencing has led to already, and will refine in the future.

14

From catalogs to robust, reproducible community patterns

The initial focus on cataloging the rRNA genes in individual species allowed phylogenies of known taxonomic groups to be reconstructed. This work provided the framework for our initial understanding that life on Earth falls into at least three distinct lineages: the Archaea, the Bacteria, and the Eukarya (initially described as the archaebacteria, the eubacteria, and the urkaryotes, respectively) [6]. These findings, which focused on sequencing DNA from known species, were soon complemented by a radical idea: that these phylogenetic marker genes could be isolated from unknown species via bulk DNA extraction directly from the environment. This technique, pioneered by the Pace lab [9], allowed researchers to start building catalogs of the known and unknown organisms, the DNA of which was present in any given environment. As the cost of sequencing DNA declined, the focus on sequencing single marker genes such as the 16S rRNA gene expanded to include shotgun metagenomic surveys, in which total DNA extracted from a sample is fragmented and sequenced. Both approaches are widely employed today. Marker- gene surveys are used to investigate the microbiota of a sample, and metagenomic surveys are used to investigate both the microbiota and the microbiome of a sample.

These two views of microbial communities can yield different findings, because functional genes are frequently transferred horizontally (i.e., between different lineages). In contrast, rRNA genes are almost always transferred vertically.

However, several recent studies have shown similar patterns emerging from studies involving both types of data [10-12].

15

The 26 years of sequencing since Pace’s first community sequencing efforts have revealed a picture of 85+ phyla within the bacteria alone, and in some cases as many as 15 new candidate phyla have been detected in a single study [12, 13]. The bacterial and archaeal census has been estimated to reach as many as 106–

109 species [14], when calculated using sequence similarity criteria. Robust patterns of microbial community composition have now been observed, in a wide range of host-associated and free-living contexts. For example, human body sites are highly distinct from one another and highly diverse among individuals [15, 16]. Although any two humans are >99 % identical in their genome composition [17], there are no species-level OTUs (operational taxonomic units) shared across the gut microbial communities of all humans [18]. This lack of shared OTUs suggests that many of the phenotypic differences that we see between humans may arise from differences in our microbiota, rather than differences in our genomes. We suspect that this observation will drive many advances in medicine in the coming years. For example, lean and obese individuals differ systematically in their gut microbial communities

[10, 13, 19] but much less so in their genomic composition. Obesity can be identified

90 % of the time using the bacteria in the feces alone [19], but with only 58 % accuracy from variations in the genomes of different individuals [20]. Similarly surprising insights have arisen in environmental microbiology. For example, pH has been found to be the main driver of microbial communities in soil [21-24], and salinity plays a crucial role in structuring both free-living bacterial and archaeal communities across many environments [25-28]. These patterns can be striking: for

16 example, seasonal patterns in marine water microbial diversity are highly reproducible in the Western English Channel [29], with the same organisms dominating microbial communities in different seasons annually. However, most of the organisms present in any given season are found even at just a single time-point if more sequences (millions rather than thousands) are collected from the sample

[30]. These results suggest that seasonal differences do not arise from the presence or absence of community members, but rather from variations in the abundance of organisms that are always present. This finding reinforces the point that much of what we think we know about the microbial world may be limited by the amount of sequencing that it is cost-effective to perform. The work to catalog Earth’s microbial diversity has thus produced a compendium of rich and detailed observations, and efforts such as the Earth Microbiome Project [31, 32] will round out our encyclopedia of our microbial world. But cataloging alone is insufficient: a list of the species present in a rainforest, for example, speaks little to the interactions, functions or potential of the organisms so listed.

The problem with phylogenetic marker gene surveys, such as the 16S rRNA gene sequencing projects described above, is that they tell us the ‘who’, without the ‘how’, thus failing to answer the most pressing questions. For instance, how can an organism live at pH 0 [33], and what can such capacities teach us about the potential for pollution mitigation or for life on other planets? Endeavors such as the

Genomic Encyclopedia of Bacteria and Archaea (GEBA) [34] perform whole-genome

17 sequencing on organisms that are as phylogenetically divergent as possible from previously sequenced organisms. Even a small amount of this phylogenetically targeted genome sequencing provides novel gene discovery that greatly outpaces gene discovery from organisms chosen arbitrarily or at random. Targeted sequencing can inform us about the reproducibility of the evolutionary process among organisms from different lineages that adapt to similar environments. For example, comparative genomics based on whole-genome data, and linked to rich evolutionary history and detailed environmental information (derived from marker gene databases and marker gene surveys, respectively), can offer insights into which types of biochemical or regulatory functions are necessary to survive in a given environment. These results enable an understanding of the systems biology of microbial communities, which can ultimately be applied to engineer microbial communities to treat disease, generate electricity, or clean up hazardous waste sites. However, marker gene surveys still improve our understanding of microbial ecology and enable novel findings and technological applications. We will focus on this technique for the remainder of the paper to show how this is the case.

How do we know which microbes are present?

A key problem with studies of the microbiome lies in determining which organisms are present. All stages of the process, including DNA extraction, amplification of specific target genes, clustering of sequences, and identification of taxonomic group are prone to both error and bias [35]. As the number of sequences involved in a given study has grown, reliance on advanced computational methods has increased

18

[36]. However, the algorithm that is chosen can have large impacts both on beliefs about what organisms are present in a given environment [37] and how many kinds of organisms are present [38, 39]. Even defining kinds of organisms is complicated at the microbial level. In lieu of a robust definition of a microbial species [40], the percentage of sequence identity of a marker gene is often used to define operational taxonomic units or OTUs. For example, most 16S rRNA gene-based studies treat a cluster of sequence fragment ‘reads’ (the output of a DNA sequencing instrument, and thus the typical observation in studies of microbial communities) that are

>97 % identical to one another as members of the same OTU. 97 % identity is treated as a proxy for species-level groupings of organisms, although this definition is known to be problematic for several reasons. One is that the rate of evolution of the 16S rRNA gene differs among taxonomic lineages, so the same number of differences in the sequence may represent different times since divergence from a most recent common ancestor. The choice of algorithm for assigning sequences to

OTUs can also have a large impact on which sequences are clustered into the same

OTU and on how many OTUs are observed in a study. For example, it is not clear whether a 97 % sequence identity threshold means that each sequence added to an

OTU must be 97 % similar to all other sequences in the OTU cluster, or whether each sequence should be 97 % similar to the sequence that defines the center of the cluster (i.e. the cluster centroid) [41, 42]. Because neither laboratory nor computational protocols are standardized, reported differences among studies often stem from differences in methodologies rather than from differences in the

19 underlying biology. And because techniques for performing meta-analyses of microbiome data are still only emerging, it is often difficult to standardize a reanalysis, and comparisons of results across studies and especially among laboratories must be performed with caution.

Modern marker-gene-based studies often investigate the composition of microbial communities at the OTU level, due to difficulties in relating counts of short DNA sequence fragments to named species. Although short reads of sequences (100–400 bases is currently typical, depending on sequencing platform) from the genomes of well-studied organisms can often be assigned at least to the family level, and sometimes at the genus or species level, many sequences cannot confidently be assigned to known named taxonomic groups. The limitation here is primarily the amount of information present in short reads of marker genes for differentiating closely related taxa. Figure 1 shows that when working with the most informative region of the 16S rRNA gene for broad analyses of bacterial and archaeal communities, the fraction of reads that can be assigned to taxonomic groups increases as expected with the length of the sequence. In real-world experiments (as opposed to the simulation presented in Fig. 1) this effect is exacerbated by PCR and sequencing biases and errors.

20

Figure 1: Relationship between sequencing read-length and classification. Using RDP Classifier, a popular taxonomic assignment method based on oligonucleotide frequencies [43]. Simulated sequences were generated from 16S genes to represent the complete sequence between the 515F/806R primers (full amplicon) or shorter 150 and 100 base pair reads from the 515f forward primer.

Our inability to assign detailed taxonomy to short reads is often not important for many of the questions that are interesting to address at the community level.

Phylogenetic diversity calculations allow us to determine the relative similarity of microbial communities, using similarity of the fragment of the marker gene as a proxy for the relatedness of the organisms represented by those marker genes.

Although in principle horizontal gene transfer, the movement of genes among different genomes, could obscure the phylogenetic pattern, in practice the difference

21 in gene content between two organisms closely tracks the differences in marker genes such as the 16S rRNA gene [44, 45]. However, there are cases in which genomes with identical 16S rRNA genes have markedly different properties

(e.g., Bacillus cereus, a harmless soil bacterium, and Bacillus anthracis, the causative agent of anthrax, are almost indistinguishable except for a plasmid that confers pathogenicity [46]). Additionally, our conclusions are limited by our depth of sequencing (i.e., the number of marker gene sequence reads collected from a sample). A study that collects 1,000 sequences per sample will miss species that are only present at an abundance of one cell in a million. These limitations to knowledge are widely appreciated by specialists, but are often omitted in popular accounts and in descriptions for non-specialists.

Is there a core human microbiome?

Our initial expectations of the microbial diversity living within and on human beings were limited and biased because relatively few microbes can be grown in culture [47, 48] and because many phylogenetically and functionally distinct kinds of microbes are difficult to distinguish by morphological or biochemical characteristics. For instance, Escherichia coli was believed to be a common and abundant gut microorganism inhabiting most members of the human population.

However, culture-independent surveys based on 16S rRNA gene sequencing and/or shotgun metagenomic sequencing (in which all the DNA from a given community is extracted and analyzed) typically find it at less than 1 % abundance in the gut of healthy adults [10, 15, 49, 50]. The scientific and medical community sought to

22 determine the “core” microbiome of humans at the level of microbial species shared by everyone [2]. Surprisingly, such a core does not seem to exist at the level of species; instead what appears to be shared are microbial functions [10, 50]. One suggestion is that there might be a few types of common but only partially overlapping (or perhaps non-overlapping) microbial communities. One study found just three “enterotypes” or types of gut bacterial communities in human populations

[51], although this simplistic picture appears not to be true when additional subjects and populations are considered [16, 18, 52-55]. However, the idea that human gut microbial communities might be classified into just a few types is conceptually appealing and has received much media attention [56-58], so debate on this topic is likely to continue. The microbial diversity revealed due to improvements in culture-independent techniques, in part due to the vast decrease in sequencing costs noted above, has been remarkable. There are no shared OTUs across the gut communities of all humans, even at a depth of coverage of one million sequences per sample [16]. This unexpected finding has given rise to the idea of microbes as personal identification markers [59]. In addition, because monozygotic twins differ in their microbiota [10, 18], it could be argued that our microbiota are more personally unique than our own genomes.

In some sense, whether or not there is a core microbiome is a purely definitional issue. Finding a core depends on the level at which sequences are aggregated

(grouping together more similar or more distantly related groups of organism, for

23 example), the abundance threshold that may be set deliberately or may be intrinsically limited by technology or study design (for example, if only 1,000 sequences per sample are collected, organisms that are as rare as one in a million microbes will be missed), and the fraction of individuals that a taxon must appear into be considered “core” (for example, the MetaHIT consortium used a 50 % threshold [50]). Some kind of core can always be defined. A more productive research direction is to ask whether there are systematic differences among the microbial communities of every human that can be correlated with the physiological state of each individual.

Microbial community states associated with disease

Much attention has focused on testing whether differences in microbial diversity correlate with physiological states, especially disease states. For example, Ruth Ley,

Peter Turnbaugh and colleagues in the laboratory of Jeffrey Gordon embarked on an exploration of changes in the microbiota associated with obesity in different mouse models. This seminal work revealed robust differences in the gut communities of these mice compared with lean mice, both in the case of genetically induced obesity in the ob/ob leptin model [60] and in diet-induced obesity [61].

Remarkably, increased adiposity was transmissible to genetically normal mice on a standard, calorie-controlled diet by transferring these microbial communities from the obese mice to the normal mice [61, 62]. The major taxonomic difference between the mice microbiota was the relative abundance of the phyla Bacteroidetes and

Firmicutes. This finding has been shown to hold for human hosts as well [13],

24 although the same pattern has not been replicated in all human studies [63, 64]. As mentioned above, we can now predict—based on the microbial community composition alone—whether an individual is obese or lean at 90% accuracy [19] while predictions based on host genomic markers perform little better than chance

[20]. Interestingly, these predictions work best when the microbes are classified into broad groups. Clustering the sequences into groups at the 80 % sequence identity level (corresponding approximately to bacterial phyla) actually works better than clustering the sequences into groups at the 97 % sequence identity level

(corresponding approximately to bacterial species) for classifying people as lean or obese. These more detailed analyses at the species-proxy level do, however, provide better resolution when classifying multiple samples from the same site [19]. A possible explanation for the improved predictability using phylum-level classification could be that differences in biochemical pathways are differentially represented across phyla but conserved across OTUs within phyla. These biochemical pathways are the primary features that differentiate obese from lean individuals. Models trained on data that are too specific (i.e., clustered at 97 % identity rather than a lower percent identity) are prone to overfitting, and have reduced predictive capacity. But it is important to bear in mind that the phylogenetic levels at which bacteria are associated with particular states may vary considerably, depending on the ecology of the particular phenotype or disease.

25

Recent large-scale endeavors, such as the Human Microbiome Project [65], the

American Gut ([66] and the Personal Genome Project [67] are opening up new opportunities for analysis because they are building a base of healthy microbiomic data against which disease states (collected by some of these projects) can be contrasted. This is important because of the breadth of diseases associated with the microbiome. Disease states that have been found to be associated with features of the microbiome include inflammatory bowel disease [68, 69], wasting diseases [70], obesity [71], halitosis [72], dental caries [73], and perhaps even autism [74]. The gut microbiome appears to be causal for certain disease states, and is not just a biomarker. Causality can be inferred when, for example, fecal transplantation (and thus microbiota inoculation) in human subjects is used successfully to treat inflammatory bowel disease (IBD—primarily ulcerative colitis) [75] and insulin sensitivity associated with metabolic syndrome [76]. These results indicate that gut microbes play an active role in these disease states and are not merely effects of the host’s condition. It is possible that in the not-to-distant future a microbiome sample will become a normal component of a health checkup. Microbiome analyses may be used to diagnose disease and could provide possible avenues for the prevention of disease through predictive tests. As we mentioned above, molecular samples from microbial communities may track or predict disease states better than does the human genome.

26

Changes in the microbiome over time

Microbial ecology shares similarities with traditional ecology, yet there are some important differences. In the ecology of macroorganisms, it is often possible to observe interactions directly, such as predation or competition for resources. Such observations are much more difficult in the microbial world, and ecological interactions must often be inferred from statistical variations in sequence data instead. Species definitions, although notoriously problematic even for macroorganisms, are even more difficult in microbes, and operational definitions based on similarities in DNA sequences must be used instead (as already discussed). Additionally, the cost of DNA sequencing posed a barrier until recently to collecting the detailed time-series and spatial datasets that are necessary for ecological modeling in microorganisms. However, some aspects of microbial ecology are substantially easier than in large-organism ecology. For example, the reliance on DNA sequence data means that with advances in technology, even a deep sampling of the population (millions of individuals) can be performed rapidly, and observation biases are likely to be less profound than when attempting to glimpse rare and elusive insects or mammals. The ability to collect large-scale information about microbial populations is likely to allow classical ecological models to be applied to the microbial world far more effectively than has been possible in macroecology, because more types of microbes can be simultaneously observed with large population sizes and with replicated sampling.

27

Ecological principles offer more than just ways to stratify the human population

(e.g., by disease state). At infancy, our microbial populations go through remarkable changes in structure prior to reaching a resemblance to most adult communities.

Inoculation is not necessarily from our mothers, and is substantially influenced by delivery mode. Microbial communities of children delivered vaginally initially tend to resemble their mother’s vaginal communities, while the microbial communities of children delivered by C-section initially tend to resemble human skin communities.

Skin inoculations may be obtained from the mother, the medical staff involved in the delivery, or hospital surroundings (many of which harbor communities resembling human skin) [77, 78]. Stabilization of the microbiota of human children occurs around the third year of life [18], but routine disruptions, adjustments and fluctuations appear to be normal in healthy individuals [15, 79]. While in general, the intra-individual microbiome variation is less than inter-individual, the amount of variability over long time periods [79] gives rise to the idea of microbial “weather” in which microbial communities react to dietary and health conditions (even as they causally affect them). This phenomenon may be especially important in determining the health of the elderly [55].

A revelatory aspect to studies of the microbiome is that classical ecological models and datasets previously only obtainable for a few economically important systems, such as fisheries, are now testable on the microbial scale because of the ability to assess simultaneously the relative abundance of thousands of species in thousands

28 of samples [80]. However, this move towards accounts of microbial communities in terms of alternative stable states and dynamical systems [81-83] is not entirely without peril. In the absence of theories of underlying causes, defining the number and boundaries of these states can be technique-dependent and implicitly theory- laden in ways difficult to identify—especially by investigators who are not specialists in the relevant mathematical techniques. With the availability of larger datasets and the ability to track communities over time, key ecological concepts such as resilience, alternative stable states, predator–prey cycling, and bottom-up versus top-down regulation of ecosystems will be increasingly important. However, it is equally important not to forget the lessons learned from past applications of these techniques, especially in traditional ecological modeling. For example, it has been known for almost four decades that Lotka-Volterra predator–prey dynamics with time lags produce patterns that would appear as completely uncorrelated between two species that in fact do interact deterministically (Fig. 2) [84]. However, this fact is routinely ignored in network analyses that seek to find connections among organisms by building a network in which nodes correspond to organisms, and edges correspond to pairs of organisms that are correlated. Correlation is usually assessed by determining whether the abundances of two taxa are correlated across a set of samples, typically using the Pearson correlation coefficient that assumes that all interactions are linear. In other words, taxa are linked if their correlation coefficient exceeds an arbitrary researcher-defined threshold. These networks are often used to find groups of organisms that “co-occur”, presumably

29 because of shared environmental preferences or because of mutualistic ecological interactions. Hence these network methods, which often rely on linear correlations among organisms to detect relationships [50, 85, 86], would incorrectly assert organisms to lack ecological connections even when these connections are fully deterministic. This happens simply because the inference procedure requires an understanding of the time-evolution of the system in order to find these causal links.

30

ation X (a) Popul

Y

tion Y

a

tion

opul a

P

opul P

ation X

Popul Time

ation X (b) Popul

Y

tion Y

a

tion

opul a

P

opul P

ation X

Popul Time

Figure 2: Predator-prey dynamics for two species X and Y. Predator-prey dynamics visualized as a scatterplot (relating sampled species abundances) that is interpretable when successive time-points are connected (a). If, however, the information about time were not included (b), these dynamics would appear uncorrelated because when X is high, Y can be either high or low, and vice versa. Thus, even in a completely deterministic system, it is impossible to tell whether two species interact with each another simply by examining multiple samples in which both are present. However, this technique is widely used in practice despite its limitations. Figure adapted from [84].

31

The analysis of time-series in microbial ecology has also been limited because the performance of standard signal processing methods is degraded with uneven sampling periods and small numbers of data points [87-89]. Such degradations have historically been common in microbial ecology datasets due to the cost of obtaining the data. However, we have already obtained valuable information about the temporal dynamics of a few microbial communities, such as the assembly of an infant’s gut microbiome and its transition towards a healthy human adult gut microbiome [90]. In the few cases in which even sampling has been performed or can be assumed, techniques exist to detect abrupt disruptions [91, 92]. In these contexts, such disruptions could mean one of the interventions that has been shown to have large effects in mice or humans such as diet change [93] or antibiotic administration [94, 95]. Therefore, as in disease surveillance, choosing a specific analytical approach (for example co-occurrence analysis, clustering analysis, and control systems analysis) depends to a large extent on whether the goal is to monitor a trend, detect an outbreak or provide general awareness of the possibility of change [96].

Conclusions and outlook

Overall, the ability to collect far larger amounts of sequence data has led to much broader and deeper characterizations of the human microbiome and microbial communities in other habitats, especially when linked to rich contextual information about the provenance and status of each sample [32]. In particular, the increased use of time-series studies (enabled by the decline in the cost of

32 sequencing) allows us to apply for the first time a wide range of ecological models to the microbial world. Perturbation experiments are especially important for understanding how microbial communities change and for understanding groups of species that change together and interact in complex ways. However, this expanded body of ecological data introduces substantial epistemic issues, especially in regard to how data are interpreted via models and concepts. For example, the definition of

OTUs at both the organism and the gene level (e.g. in the construction of “gene catalogs” [50]) is in many respects a return to phenetic methods, which have been criticized due to their lack of theoretical justification and their instability when more data are added [97]. The methodological principle of clustering sequences at some threshold before analysis is also not well grounded theoretically. One example would be if a single nucleotide change in the 16S rRNA gene of a single species distinguished exactly lean from obese humans, or co-varied perfectly with disease severity in IBD. Such findings would be of enormous importance yet would be missed completely by current techniques. Similarly, we know that because of factors such as horizontal gene transfer, gene- and taxon-level analysis will not map precisely on to each another, yet the data to perform such analysis and the theoretical framework for reconciling differences is at this point largely lacking.

Some of the solutions to these problems are being sought in large-scale projects such as the Earth Microbiome Project [31, 32]. These research consortia are working towards understand relationships among microbial processes across different

33 systems and timescales. They will be especially important for identifying which theoretical constructs across different scales and levels of analysis are especially useful both for understanding and predicting microbial community responses. And as this article has made clear, the availability of large datasets and the development of new methods with which to analyze them have already produced dramatic changes in how the microbial world is understood, and its relationship to the rest of the biological world. As the many human microbiome studies discussed above show, microbial ecology—especially molecular microbial ecology, even at its relatively crude stage of development—is transforming how human biology itself is understood. This transformation, which we expect to occur not just in human biology but in traditional ecology and biology more broadly, will raise philosophical issues that require the attention of scientists and philosophers. We have indicated just some of these issues, dealing with the units of analysis and the causal powers associated with them, and how imperfect methods and models become more refined and effective in the process of inquiry. Philosophy of biology itself can learn a great deal from these recent and future developments in microbial ecology, as other papers in this special issue demonstrate.

Concluding remarks

Analyzing a community in isolation is like attempting to understand human society by only looking at a single city. While it may be possible to learn something, you may not have confidence in your observation and the risk of overfitting is high. For instance, if your single observation is restricted to a coastal city, you may end up

34 assuming that all cities on the planet embrace fishing. Just like basing conclusions off of singular or small numbers of samples when studying complex systems such as human society is unreasonable, so is the case with the study of microbial communities. This article helps to make the case for the need for ecological principles, large sample sizes, and broad study designs as key for scoping diversity and unraveling its relationship to environment.

35

Chapter 3 Context, and the human microbiome

From: McDonald D, Birmingham A, and Knight R. “Context, and the human microbiome.” (submitted)

This chapter outlines the value of context when analyzing microbiome datasets, computational challenges with large references that can provide context, discusses the issues with existing reference sets, and covers how the American Gut Project will begin to chip away at these problems.

Introduction

In the last few years, the study of the bacteria, archaea, microbial eukaryotes, and viruses that inhabit the human body (particularly the large intestine) has revealed a remarkable biological and functional diversity [15, 16, 50, 98-100]. These organisms, collectively known as the microbiome, potentially outnumber human cells 10:1 [101] and vastly expand on the functional capabilities provided by our genomes. Disruption in these microbial communities, also known as dysbiosis, has been causatively associated in Kwashiorkor [102] (a wasting disease endemic to

Africa) and obesity [103]. Numerous correlative associations in humans and mouse models have also been observed in a broad spectrum of complex diseases including autism spectrum disorder [104], inflammatory bowel disease [105], type-2 diabetes

36

[106], colorectal cancer [107], depression [108] (see [109] for a detailed review on the brain-gut-microbe axis), and more.

The implication of the microbiome in human health is immense, with prospects for novel medical products including therapeutics and clinical assays. This has led to large investments in both academia [16] and industry [110]. Although such research could have a profound impact on human society both in first- and third-world countries, we are just scratching the surface of understanding the complexity of this vital organ. As such, identifying means that improve the pace of research is arguably a matter of human health on a global scale.

A crucial and missing component of microbiome research is a robust and comprehensive reference set. Such a dataset would characterize what we know about diversity of the human microbiome and its relationship to the health and lifestyle choices of individuals, providing much-needed context against which to compare findings of focused studies such as those on particular disease populations.

This reference would allow researchers to place their study in the framework of what is already known in order to better interpret observed patterns (compelling examples of this can be found in [111] and [112]). It would also enable stringent hypothesis testing and evaluation of effect sizes. A robust reference dataset must be built on top of a cross-sectional study design in order to understand the variation in

37 the population, while also including rich longitudinal components to enable an understanding of how species structure changes over time.

In this review, we highlight the importance of reference sets in human microbiome research; limitations of existing resources; technical challenges to employing reference sets; examples of prototypical reference usages; and contributions of the

American Gut Project to addressing some of these issues. Discussion will focus on the 16S rRNA gene, which is a popular locus for use in microbiome studies over a wide range of environment types [113-117] and is the core locus assayed in the

American Gut Project. Construction of references based on other loci is important for studying microbial eukaryotes, viruses, and interactions between these organisms, but high-throughput study of these other components of the community is not yet cost-effective.

Importance of reference sets in human microbiome research

The community structure of the human microbiome is the result of a multifactorial process that involves succession over time [90], is influenced by host genetics [118], and is affected by lifestyle choices [52, 119]. Communities are made up of thousands of microbial species, with the predominant microbial biomass residing in the human large intestine. Fascinatingly, within the human gastrointestinal tract, it appears that multiple organisms are capable of fulfilling common ecological niches, leading to remarkably different microbial communities that possess similar functional potential [16]. Furthermore, while variations in the human genome are minute

38 across the population, variations in the human microbiome on geographical and temporal scales are immense [18, 120]. Despite investments of hundreds of millions of dollars, we still don’t understand the distribution of community structures in healthy individuals[121], but we do know that when studies of the microbiome are performed without a concern for integration with existing studies, effects of significant biological importance can be easily missed [122].

A well-characterized reference dataset can be used to test hypotheses, and conversely, to derive testable hypotheses from the reference itself. For instance, inflammatory bowel disease has been observed to be associated with a microbial dysbiosis index (MD-index) [123]; a robust reference set would allow assessment of the hypothesis that diet or lifestyle factors are strongly correlated to this index within the general public as well. In an opposite example, a significant correlation between diversity and season was observed in the American Gut reference set [124].

Because it appears that individuals have a higher diversity during the holiday season in the US, one might hypothesize that it is the holidays and not the season that drives the correlation– possibly due to changes in exercise and diet patterns.

This putative effect can then be tested once the project acquires sufficient samples from western countries in the southern hemisphere.

A comprehensive reference dataset will also help researchers make rational decisions about sample size, which can greatly impact the power and analysis of a

39 study [125]. Such a dataset is also crucially necessary to support characterization of the effect sizes of variables (e.g., antibiotic use). Within the microbiome field, effect size for many variables of interest is not yet well understood, and many that are important in diseases with complex etiologies such as autism [104] are likely to be subtle. Well-characterized references offer the possibility for a researcher to expand their own dataset by pulling samples to augment their own [120] particularly when meta-analysis is taken into consideration during the design phase for a study.

Limitations of existing reference sets

The $173 million NIH-initiated Human Microbiome Project (HMP) set out to characterize the human microbiome at a population scale, and to define standard reference datasets to be used for human microbiome research [126]. The resulting

16S rRNA datasets are composed of samples from 242 individuals, all of whom were medical students in the United States and were certified healthy by medical professionals. Thousands of samples were collected from these individuals at one to three time points, covering 15 to 18 sampling sites depending on the sex of the individual. These samples were evaluated using two different regions of the 16S gene (leading to two distinct datasets—V1-3 and V3-5) [122], and were processed at four different sequencing centers. Phenotypic information about the individuals was collected, but while the sequence data associated with the samples are publically available, access to any de-identified information about the individuals requires rigorous approval mechanisms.

40

Although the HMP generated an incredible volume of data, numerous design, technical, and access decisions affecting the HMP dataset have made reuse challenging. For instance, the decision to sample a few people extensively rather than a large number of people minimally (i.e., a cross-sectional study design) led to observation of only a small fraction of the diversity present with the population [18] and resulted in small sample sizes for different stratifications in the dataset [127].

The choice to sequence multiple loci within the 16S rRNA gene resulted in data that are impractical due to technical bias as amplification performance differs between primers [37, 122]. Furthermore, because the study design was not sufficient to elucidate the effect of employing multiple sequencing centers (which has been observed in other contexts; see the Microbiome Quality Control Project, http://www.mbqc.org/), this issue must still be actively evaluated. Host information, such as age and sex, are nearly prohibitive to access, which makes explaining any systematic patterns in the data impossible. The end result is that use of the HMP

16S rRNA as a robust reference set has proven difficult.

In contrast to the HMP, the Global Gut project [18] set out to characterize microbial diversity at spatial and temporal scales. To do this, the researchers collected samples from three distinct populations (US citizens, Malawians, and Venezuelan

Amerindians), the latter two of which are culturally distinct from western populations. Within each population, samples were collected cross-sectionally over

41 an age gradient. Notably, the two non-western populations appear to be completely distinct from the western individuals, suggesting the limited population size and emphasis of the HMP grossly underestimates the variation in community structure across the human race. However, the populations do intersect on samples collected from infants, suggesting that it is potentially lifestyle, diet or environmental choices that shape our microbiomes as we age (including interaction with our genetic predisposition [118]). Although the sequence data are readily available for reuse, the distribution of many of the study variables is not approved, limiting the long- term usefulness of the samples. (It should be noted that the Global Gut did not intend to be a reference for microbiome research, but the populations represented in the dataset are extremely difficult to collect samples from and have shown to be useful in adding perspective for independent projects [120, 128])

Lack of access to the full set of metadata variables associated with these earlier studies is crippling, as interpretation of the observational data can only happen within the context of the collected variables. From a practical standpoint, if a systematic pattern is observed in the data, but there aren’t any variables that explain the pattern, then the researcher cannot support a hypothesis about the pattern without collecting new information (which may be impractical or impossible). Similarly, confidence in the face of confounding variables is reduced if only a limited number of variables are tracked. Given that researchers typically do

42 not know the answer in advance of a study, it is imperative that study designs strive to collect as much information as feasible.

Technical challenges to employing reference sets

Even a well-designed and carefully collected reference must be employed with caution in order to minimize spurious variation and contain necessary computational effort. The first of these needs arises since reference-based analyses assume that any systematic compositional differences inherent in the data outweigh any technical variation, which is particularly problematic when combining data generated from different protocols or platforms [122]. In fact, biological conclusions can be driven by technical variation if the researchers are not careful (as in [129], where samples were found to cluster by the extraction kit used), which underscores the need for accepted community standards for sample handling, sequencing, and data analysis. Bioinformatic strategies to mitigate any remaining variation, such as trimming sequences to a common length between studies, have shown to help normalize platform bias [120]. Sometimes stronger measures are necessary: for example, the American Gut Project received samples from self-reported healthy individuals that contained levels of beyond anything previously observed, and it was determined that these artificial blooms likely stemmed from the shipping conditions for some samples. The blooms can be bioinformatically subtracted from the dataset [manuscript in prep] by removing common lab contaminants and organisms observed to bloom during storage in the

Microbiome Quality Control Project (MBQC), but as a result, any meta-analysis

43 that leverages the American Gut data must perform this same subtraction in order to control for bias that the filter introduces. On-going efforts such as MBQC are explicitly exploring the effect of different types of storage effects so that they can be controlled for as necessary in the future.

Once technical variation has been minimized, the comparative analysis can begin.

Many researchers, particularly those at remote sites, do not have access to large- scale compute instruments and must rely on commodity hardware for data analysis; this creates the temptation to employ analysis techniques that require as little computation as possible. However, some such techniques are particularly vulnerable to artifacts caused by combining dissimilar datasets.

The primary data type used in analysis of a microbiome study is the OTU table

[130, 131]. In order to be comparable, a reference and a study must have their sequence data assigned to a shared set of OTUs (i.e., partitioned into a common set of bins). OTUs themselves are clusters of similar sequences, with the similarity threshold generally set at 97% by sequence identity, and are typically determined in one of three ways as summarized in Table 1 (for a comprehensive review, see [132]; each of these methods is named in terms of its OTU reference, but nota bene that this represents a distinct concept from that of the reference datasets discussed throughout). The first is a closed-reference approach in which all the sequence data for the input study and the microbiome reference set are compared against a

44 curated 16S rRNA database such as Greengenes [133] to identify which known

OTUs are represented. This is computationally tractable even for very large studies since the evaluation of every sequence is independent of every other, lending itself to embarrassingly parallel compute strategies, and since the reference dataset’s

OTU assignments can be computed just once (and in advance). The second strategy, known as de novo picking, defines novel OTUs based on the sequences in a study.

This is computationally expensive, as all the data must be maintained in memory in order to determine the clusters. The third approach, open-reference picking, is a hybrid method in which sequences are first compared to a database of known OTUs as described above, after which those that fail to match to a known OTU are then put through a de novo step.

Studies employing a reference set typically rely on the closed-reference approach to minimize compute since only the input study need be evaluated and can be done so in an embarrassingly parallel fashion. The closed-reference strategy is unlikely to result in OTUs composed of non-16S sequence as the reference is expected to only contain 16S sequence. Comprehensive references, like Greengenes, also typically contain only near-full length reads allowing researchers to combine data represented by multiple variable regions. In addition, any annotation information about the reference, such as the phylogenetic relationship between the data contained or annotations such as taxonomy, can be used “for free” with the input study data. Unfortunately, this strategy can only classify sequences that are

45 reasonably similar to those in the reference database, leading to the potential for a large study effect, for instance, observing significant patterns in the data that are not driven by the underlying biology. A large study effect may be a problem if combining studies with differential representation in the reference (e.g., samples from different environments).

In contrast to closed-reference, a de novo approach allows a researcher to assign

OTUs to as much of the data as possible, but the distinct error profiles of the studies being combined can lead to spurious, study-specific OTUs. The error profiles of a study can stem from the 16S protocol, variation in the mastermix, error profiles of the sequencing instrument used, etc. As a result, a meta-analysis that uses a de novo strategy requires that OTU picking be redone after combining the sequence data from the studies. Contamination in the data (e.g., non-16S sequence such as phiX) will cluster into OTUs unless the contamination is explicitly filtered out prior to OTU picking.

The hybrid open-reference method steers a middle course: Since data that are not represented in the reference are recovered, bias driven by differential representation in the reference is reduced. In addition, since the amount of data being fed into the de novo step is minimized, the impact of study-specific error profiles is diminished. Furthermore, open-reference OTU picking can be augmented with techniques such as using a random subsample when constructing the

46 intermediate de novo reference in order to accelerate its performance (details on the procedure can be found in [132]). However, like de novo, OTU picking must be redone when combining studies together, it has the potential to use non-16S data, and it is not practical if the studies targeted used sequenced variable regions.

Strategy Pros Cons Data Combination Bias Closed- • Is extremely parallelizable • Is limited to finding • May show large bias reference • Computes reference diversity present in OTU if combining studies assignments only once database with differential • Is highly unlikely to retain representation in the non-16S sequences reference • Supports reads fragments from multiple loci • Gets the phylogeny and taxonomy for free De novo • Utilizes all of the sequences • Must hold all sequence • May generate • Requires no OTU database data in memory spurious OTUs if • Can group organisms distinct • Is very complex to combining studies from anything seen before parallelize with differential error • May produce phylogenies • Produces spurious OTUs profiles sensitive to subtle differences without pre-filtering in OTUs • Must redo OTU picking with all data being combined Open- • Leverages an OTU database • Must redo OTU picking • Shows less bias due to reference but also utilizes sequences with all data being differential diversity that do not match to that combined representation than database • Produces spurious OTUs closed-reference • Is modestly parallelizable without pre-filtering • Shows less bias due to • Is infeasible if data are differential error from multiple loci profiles than de novo

Table 1: A comparison of OTU-picking strategies.

Examples of reference set usage

One of the first studies to combine multiple microbiome datasets (that these researchers are aware of) was the work by Lozupone and Knight [25], which aggregated sequence data from hundreds of studies in order determine environmental factor(s) that explained the observed differences in microbial

47 community structure. They discovered that data from samples collected in the natural environment across a multitude of gradients (e.g., pH, temperature, atmospheric pressure, etc) separated primarily based on whether the samples originated from saline or non-saline environments – despite the substantial technical differences between studies. Fascinatingly, when these same data were combined with samples collected from vertebrate guts, the primary variation in the data was explained by whether the samples were environmental or host-associated

[98], implying that an extremely high degree of specialization has occurred in the microbial communities of vertebrate guts (which is particularly interesting given the difference in evolutionary time that environmental microbial communities have had to specialize relative to the time that vertebrates have existed). While this meta-analysis did not employ a reference set of the type discussed here, it has itself become a de facto reference set that has subsequently been employed for comparison with numerous other studies [134-136].

More recently, a re-evaluation of a longitudinal study aimed at exploring succession in microbial communities within an infant [90] was performed using the HMP as context [122]. While the original work showed a distinct increase in the diversity through the first few years of life, putting its results in context immediately clarified the trajectory of succession by showing that the microbiome moved from resembling a vaginal community (which makes sense given the mode of birth [78]) to resembling a fecal community. Visualizing longitudinal microbiome studies as

48 animations [79], particularly in the context of a reference, has been so useful that the ability was recently added into EMPeror [137], a common visualization tool for ordination plots generated from microbiome data.

Meta-analyses are becoming more widespread as computational power increases, sometimes employing past studies that were not intended as reference sets in that new role. Moeller et al [120] reused the Global Gut [18] data to paint a compelling picture of the coevolution of hominids and their gut communities, highlighting a departure that humans have appeared to take with respect to our closest ancestors.

The data suggest that the rate of change in the human microbiome is significantly higher since divergence with chimpanzee, particularly in US adults, including a significant decrease in alpha diversity. The motivation to reuse the Global Gut data was access to samples collected from hunter-gather groups as well as western adults, enabling the researchers to test the hypothesis that hunter-gather groups are more similar from a microbial perspective to our closest ancestors potentially due to the dramatic dietary differences that exist between these groups and western populations. However, the sample size for any given age group and population combination (e.g., infant Malawians) within Global Gut was relatively small, so it would be interesting to revisit this and see what the pattern of coevolution is against a reference that contains a larger number of samples for different age groups.

49

Contributions of the American Gut Project

The American Gut Project set out to build a comprehensive open-source and open- access microbiome 16S rRNA reference dataset for the scientific community to use.

It relies on a crowd-funding model that allows for broad reach across the US population, and is set up so that virtually anyone can participate (with the exception of convicted felons and children younger than 6 weeks old). Individuals can elect to receive a collection kit in exchange for a donation to the project. Though the sample population is not free from bias (being shifted toward older Caucasians interested in their own health), the variability encompassed by the project vastly exceeds that of the Human Microbiome Project [127]. In addition, the project has recently expanded internationally to the UK and Australia to reduce participant overhead for shipping samples. All participants in the project are consented under protocol #141853 approved by the University of California San Diego’s Human

Research Protection Program (HRPP); the protocol specifies that all non-identifying data collected will be deposited into the public domain. Each participant is presented with a HRPP-approved questionnaire that covers diet, lifestyle, and health history, including a NIH-validated food frequency questionnaire [138]. The infrastructure to support electronic consent, questionnaires, localization for international portals, and management of over 22,000 barcoded samples has opened the doors for external researchers and the general public alike to perform their own experiments using the framework of the American Gut Project.

50

The American Gut Project is subset of the Earth Microbiome Project (EMP) [113], which has been instrumental in advocating for adherence to the standards of the

Genomics Standards Consortium, including Minimum Information about a MARKer gene sequence (MIMARKS) [139] –a suite of standards defining variables to be collected within a marker gene survey for virtually any environment imaginable.

The EMP and American Gut also follow published sequencing protocols [140] that aim to normalize technical bias for microbiome studies, and employ the BIological

Observation Matrix (BIOM) [131] specification as a standard and computationally efficient means to represent the resulting large, sparse -omics datasets and their sample and observation metadata. All data are de-identified and deposited into the public domain as quickly as possible via the European Bioinformatics Institute

(EBI), which is part of the International Nucleotide Sequence Database Consortium

(INSDC). American Gut has taken a further step by providing executable IPython

[141] Notebooks that allow others to reproduce and modify the analyses being performed on the data. All code for the project is hosted on Github in the “biocore” organization and is available under the BSD license, and all code and binaries used by the project are open-source.

Conclusion

Research is never performed in isolation. It is built upon the foundations laid by prior knowledge, and evaluated in the context of present knowledge. However, if data are not collected with a view towards integration, or if rich reference points do not exist, research is effectively performed in a vacuum. These are some of the

51 challenges that a common reference can help to address, and the American Gut is widely collaborative, carefully structured project that aims to provide such a reference. The establishment of a comprehensive reference encourages widespread use of standard protocols, since normalization of technical variation is essential when comparing results to the reference and assessing the significance of a study against the background population. Application of context-aware study designs that adhere to community-accepted standards used by references like the American Gut should minimize the time until microbiome research findings become medically actionable.

Concluding remarks

While there is immense value in a rich reference that spans the diversity of microbial life associated with humans, that reference is, by itself, not sufficient to elucidate the relationship between complex disease and the microbiome. References like the American Gut can play a useful role in understanding community structures present in disease states, but ultimately, there is a critical need for focused studies that leverage large cross-sectional or longitudinal study designs (in tandem with a detailed reference) in order to to assess the presence of systematic patterns in community structure, and whether those patterns matter.

52

Chapter 4 Towards large-cohort comparative studies to define factors influencing the gut microbial community structure of ASD patients

From: McDonald D, Hornig M, Lozupone C, Debelius J, Gilbert JA, Knight R.

(2015) "Towards large-cohort comparative studies to define the factors influencing the gut microbial community structure of ASD patients." Microbial Ecology in

Health and Disease 26:26555 PMID: 25758371

In this published review, a summary of the evidence of the relationship between the microbiome and Autism Spectrum Disorder (ASD) is provided. In addition, the benefits of large-cohort study designs are discussed, which are useful for defining the presence of factors that drive systematic differences in the community structures of affected individuals. Leading up to this review, we created an ASD focused cohort of the American Gut, which has a few hundred families enrolled, and is targeted at ASD-affected individuals and their neurotypical siblings.

Introduction

Differences in the gut microbiota that inhabit the intestinal tracts and feces of children with autism spectrum disorders (ASD), as compared to neurotypical children, have been reported by several research groups over the past decade [104,

142-145] [for comprehensive review, see [146]]. The relationship of these differences in the microbiota to dietary practices, the diversity and severity of clinical features, and pathogenesis remains unclear. There is now evidence in animal models [104,

53

147] as well as from more limited studies in humans, that signaling along the gut- microbiome–brain axis is a critical regulator of both central nervous system and immunefunction [109, 148]. In addition, some studies suggest that interventions targeting the microbiome (probiotics, fecal transplants) may have utility in the management of other neuropsychiatric disorders [148-154]. Further research to delineate the extent of involvement of gut microbes in autism, and to monitor or even suggest therapies, is therefore promising.

The role of bacteria co-associating with our gastrointestinal tract in physiological development and disease has recently attracted considerable attention, primarily as a result of technological advances associated with sample processing, sequencing of genetic information, and data analytical tools. In the last decade, the revolution in sequencing technology has fundamentally altered our perception of microbial diversity and ecology, by enabling us to process and analyze thousands to tens of thousands of samples in a single study [16, 113, 140]. These advances have allowed us to identify significant trends relating the physiology, environment, and health history of a host and the presence or relative abundance of the bacteria that inhabit the host (e.g. [60, 155, 156]. Many factors affect the colonization and succession of the microbial communities that live within us, and that change over time [18, 90]. It is therefore difficult to capture the combination of events within an individual's life that have resulted in that individual's unique microbial signature. Although some bacterial taxa correlate strongly with specific conditions [60, 155-157], other

54 relationships are less obvious, and may require far larger cohorts of participants to detect [16].

Bacteria have profound influences on key aspects of our immune regulatory network [158], with far-reaching implications for our physiological and even neurological development [159]. Direct association between bacteria and host cells is important for immunological development [160], regulation, and response [161].

However, bacterial biomass in the lumen, including bacteria that do not actively associate directly with host gastrointestinal tissues, might be more important for the production of key metabolites that can have important physiological effects once they cross into our bloodstream [e.g. 4-ethyl phenyl sulfate (4-EPS) production

[104]]. Bacteria contribute to circulating blood levels of amino acids such as tryptophan (including synthesis from dietary serine or indoles), thereby affecting levels of key regulatory neurotransmitters, such as serotonin, and also regulate levels of neuroactive metabolites along the tryptophan degradation (kynurenine) pathway both in the intestine and in the blood [109, 162-164]. Although the common method for assessing a gut microbial community is through the feces, in some circumstances such as inflammatory bowel disease (IBD), mucosal biopsies may help identify bacterial associations that may not be evident in fecal samples, especially in cases where mucosally associated bacteria are not dominant in the fecal sample [123].

55

The importance of experimental design: cross-sectional versus longitudinal analysis

Given the heterogeneity of ASD and the many potential confounding factors that may influence microbial diversity, looking for associations in very large and well- characterized cohorts may be the key to finding associations between the gut microbiota and disease. Large-scale efforts such as the Earth Microbiome Project

[113] and American Gut (http://americangut.org) have demonstrated the willingness of large communities of researchers, and even of the general public, to contribute thousands of samples to provide a fuller picture of the microbial diversity of our planet and our bodies. In particular, aggregating longitudinal datasets from different microbial habitats is starting to provide an understanding of dynamics on different timescales [165], and extending these to studies of people with different clinical conditions provides an especially exciting opportunity at present.

Cross-sectional study designs

Cross-sectional study designs are useful for identifying systematic patterns across a population, testing the hypothesis that some component of microbial variation within a population is correlated with a study parameter (e.g. ASD diagnosis).

Applying a cross-sectional study design to very large cohorts, for instance thousands of subjects, may provide the statistical power to elucidate subtle phenomena when faced with many confounding factors, as is common in microbiome studies where lifestyle, diet, age, genetics, and disease play important roles in shaping community structure.

56

The benefit of a large cross-sectional study design was demonstrated during an early analysis of the American Gut dataset (http://americangut.org). At first, patterns driven by diet and other lifestyle choices were observed, but statistical significance suffered from limited sample sizes within the specific groups of subjects showing interesting trends. As we collected thousands of additional samples, many of these groups reached sample sizes that increased the confidence and significance of the observed patterns. One such pattern was a population-scale seasonal effect, in which samples collected from individuals during the holiday season in the United

States tended to have higher diversity within each sample [124]. Empirical power estimations suggest around 100 samples per group are required to reliably observe these subtle differences across seasons, even after matching individuals for a variety of other factors [166]. These more subtle patterns only appeared through the collection of a large number of samples from a broad cross-section of the population, making it possible to detect the signal against high levels of background noise coming from other factors.

Another recent microbiome study that focused on Crohn's Disease patients [123] and relied on a large cross-sectional design also benefitted greatly from a large sample size. Critically, the researchers noted that the number of samples was more important than sequencing depth (the number of sequences collected from each sample) for detecting statistically significant patterns that were apparent in the full

57 dataset. The study design allowed conclusive identification of key taxa that differentiate Crohn's patients from healthy controls that had not previously been reported as associated with Crohn's. Interestingly, once the specific taxa were identified, it was then possible to assess whether the metabolic potential of these organisms made sense in the context of the disease. In this case, some of the microbes that are less abundant in Crohn's patients are involved in the production of butyrate, which is a short-chain fatty acid (SCFA). Butyrate is consumed by intestinal epithelial cells [167], which are instrumental in initiating an immune response [168, 169]. In addition, the researchers were able to identify an amplification effect from antibiotic usage, in which individuals who had recently taken antibiotics had a significantly pronounced increase in detrimental taxa observed in Crohn's patients, with a corresponding decrease in beneficial taxa. One taxon in particular, Fusobacterium, was recently found to be highly correlated with colorectal cancer [107], which has a higher incidence in Crohn's and IBD patients.

These observations suggest antibiotic usage by this population should attract closer scrutiny due to the increased risk to the patient [although it should be noted that the specific effects of antibiotic usage in healthy individuals is still poorly understood, and appears to be highly variable in different subjects [95, 170]]. A parallel study in ASD, especially one relating differences in the microbiota to common interventions such as drugs targeted at resolving gastrointestinal symptoms, antipsychotics, antidepressants, dietary changes, and other treatments, and with excellent clinical data, could be especially valuable in understanding

58 which changes in the microbiome are likely to be associated with ASD symptoms and which are most likely to be side-effects of treatment.

Longitudinal study designs

Although cross-sectional studies are useful, they cannot provide insight into variation within an individual over time, limiting their power to observe phenomena such as succession and to factor out between-subject variation in diseases with complex etiologies. Such questions can only be addressed with longitudinal study designs, examining multiple timepoints from the same individual. Ecological succession of the gut microbial community is of particular interest in autism because microbial communities play a central role in training the immune system during childhood development [90]. Early antibiotic usage, for example, is associated with an increase in allergies and obesity [171, 172], and may be associated with disrupting the maturation of the microbiome. Within the human microbiome, an infant's initial microbial communities depend on delivery mode [78], where the infant fecal community tends to resemble the mother's vaginal community after vaginal birth, but instead resemble skin after C-section. Koenig et al. [90], through a 3-year time series tracking a newborn, monitored this succession, revealing a large amount of change over time progressing from a vaginal-like community to a community resembling the adult fecal state [122]. One particularly interesting observation was a substantial regression in community state as a result of the child receiving antibiotics. This regression was rapidly ameliorated, suggesting that resilience in the community is picked up relatively early in life.

59

However, the impact that these types of disruptions can have on the fledgling immune and endocrine systems is not yet known, nor is the magnitude of this impact with respect to other environmental, dietary, and lifestyle factors.

Some important general considerations in longitudinal study designs include how frequently to sample, whether to focus timepoints around defined interventions, and what auxiliary data (e.g. diet or immunological data) need to be collected at each timepoint versus assessed once for each subject. In general, not enough studies have been done in order to provide detailed guidance on these points, and animal model studies can be misleading. For example, on the basis of studies on mice, which respond within 1 day to dietary shifts [93, 173], we performed a parallel dietary intervention study in humans with very little effect after 10 days in an inpatient setting [52]. However, longitudinal studies of the effects of microbiota transfer from humans to mice have been very useful for elucidating effects of microbiomes associated with obesity [103] and malnutrition [102], and the same is likely to be true for autism [174] given the availability of mouse models [104]. Given the established role of gut microbiota in allergen sensitization in mouse models [161], and given high variability among individual animals as well as among individual humans, understanding effects of changes in the microbiome in response to defined perturbations is likely to benefit considerably from animal model work even when details of the timescale or nature of the response differ among species.

60

Longitudinal studies of the human microbiome to date have typically employed small sample sizes, limited timepoints, or both. For example, the NIH-funded

Human Microbiome Project [16] reported data from only two timepoints in each of

250 subjects. Only a couple of daily studies of apparently healthy adults have employed sampling durations as long as a year [79, 175], and a recent study of dozens of healthy students employed only weekly sampling [176]. Nonetheless, it is clear that dynamics are shaping up to be an important aspect of the human gut microbiome, and studies both of the baseline dynamics of the microbiome in ASD subjects, and of dynamics in response to treatment with different interventions aimed at alleviating different ASD symptoms, hold considerable potential both for stratified treatment and for developing a better biological understanding of the underlying mechanisms.

Comparison of study designs with respect to neuropsychiatric disorders

Within the context of neuropsychiatric disorders, cross-sectional designs have been instrumental in recognizing the correlation between the presence of blood markers of inflammation or intestinal barrier compromise and depression [108], bipolar disorder [177], and autism [146, 178]. The pathogenesis of these diseases differs.

However, the implication of inflammation in such a broad range of disorders suggests that inflammation, and both its cause and effect, ought to be a focal point for investigation. In particular, inflammation can lead to a permeable gut, thereby allowing metabolites produced by gut inhabitants (and even the inhabitants themselves, or fragments of them) to leak into the bloodstream [179, 180], and some

61 metabolites can even pass the blood/brain barrier [181]. On the other hand, the predominant source of serotonin in the body is within the large intestine, and it is the role of enterochromaffin cells to synthesize serotonin from tryptophan [109].

Dysregulation of the gut microbiome can trigger secondary effects in these cells that alter the rate of serotonin production [182], with significant changes in neuropsychiatrically relevant domains, including mood [183] and satiety [184] and possibly, the stereotypic features of autism [185, 186]. Interestingly, some of the classes of drugs prescribed for treatment of neuropsychiatric disorders act on the gastrointestinal tract and may also affect the immune system [182]. One metabolite of interest is 4-EPS, originally observed to be significantly increased in serum in a mouse model of autism [104] [fascinatingly, this model requires stimulating the mother's immune system prior to birth, resulting in offspring with autism-like symptoms [187]]. Anorexia Nervosa is an eating disorder characterized by the inability or unwillingness to gain weight [188, 189]. ASD is a comorbidity for anorexia, and may be reflective of sociocommunicative problems within individuals with anorexia [190-192]. The microbiome plays a role in the pathology of anorexia; the bacterial ClpB heat shock protein can induce anti α-MSH antibody, leading to a reduction in appetite, weight loss, and anxiety [193, 194].

The importance of controlling for technical variables in traits with small effect size

The problem of large versus small effect sizes is in some ways analogous to assessing the risk of a campfire sparking a wildfire. If you asked: are campfires

62 correlated with wildfires, the answer is likely to be yes by analysis of whether wildfires are more likely in proximity to campgrounds. A large amount of variation in the type of camp, its geographic location, and definition of wildfire can likely be tolerated. In this case, the presence of a fire is a large effect. If instead you asked: are certain personality types more likely to spark a wildfire from a campfire, then the answer is subtle necessitating finer control over data collection. For instance, how personality type is assessed is critical in assuring that everyone underwent the same test and that there was no researcher bias in test administration. In addition, controlling for substance use is necessary in order to understand whether it is personality, or say, the presence of alcohol that leads to accidental wildfires. In this case, the potential small effect of personality (which is a large effect in other contexts) requires more careful control, relative to the large effect of simply having a campfire, in order to properly identify if in fact there is an effect mediated by personality.

The complexity of neurological disorders, and the difficulty to date in pinpointing specific causes, suggests that the causes themselves are varied, subtle, and possibly multifactorial. As such, emphasis on controlling for technical variables is essential to minimize noise, and maximize signal. For instance, the Human Microbiome

Project sequenced two separate regions of the 16S rRNA gene [16] from the same samples leading to a confounding effect if analyzing both loci together. The end effect was that it was not feasible to compare data from one loci to another as the

63 noise stemming from the loci masked any usable signals in the data. On a practical level, using the exact same protocols for all samples of a common type is critical in order to limit the impact of technical bias. Frustratingly, there is even variation that is introduced into the data by the site that is processing the samples, though there are ongoing efforts to understand the drivers of this variation so that it can be normalized, something that is necessary for clinical applications of microbiome assays.

Digging deeper, in addition to tightly controlled technical variables and large sample sizes, using a tiered systems approach can substantiate interpretation and validation. This is particularly useful within studies of autism as there is evidence for genetic predispositions that may ‘activate’ through an environmental trigger, where the microbiome is considered part of the environment. The systems approach can greatly improve the understanding of the roles particular organisms are undertaking. From 16S rRNA data, it is possible to predict a likely functional metagenomic profile [195], but it is not feasible to predict the specific metabolites being produced, which will to a certain extent be modulated by the availability of fermentable substrates, and other sources of energy for microbes. These metabolites are the vector of communication between microbes, and between microbes and the host. Similarly, knowing the genetic makeup of the host is informative, but it cannot be used to determine a specific immune response. A tiered approach that includes immunological markers, metabolite profiles, and microbial community data

64 sampled at near the same time point enables researchers to tightly validate observations between levels, and truly begin to understand the dynamics of disease.

The influence of diet on the microbiome

Perhaps unsurprisingly, what one eats can influence the composition of the microbiome. Long-term diet has one of the largest known effects on the human gut microbiome: in particular, the balance of carbohydrates to animal protein affects the balance of Prevotella sp. to Bacteroides sp., driving the largest component of overall patterns in the human microbiome within healthy adult Western populations [52].

Cross-culturally, societies with high-grain, low-animal diets also tend to have far more Prevotella at the expense of other major gut taxa [18].

Most short-term dietary changes have been far more modest. However, on the extreme end, shifting to a heavy animal product diet characterized by meats and cheese can, on very short time scales, increase abundance of bile-tolerant organisms

[114]. The increase in these organisms is negatively correlated with acetate and butyrate stemming likely from the reduced fiber load available for microbial fermentation. Butyrate has previously been observed to modulate colonic regulatory

T-cell differentiation in murine models [196], and is of particular interest within the study of neurological disorders due to the observed relationships with gut inflammation. The role of dietary gluten and casein in the etiology of ASD remains of intense interest to the community, but strong evidence to date has been scarce

[reviewed in [197]].

65

Propionic acid, a SCFA produced by the microbiome from fermentable dietary carbohydrates [198], has been associated with ASD in rat models [181]. ASD-like symptoms, including a neuroinflammatory response, can be induced from intraventricular infusion of propionic acid [181] resulting in significant changes in behavior and social interaction [199]. However, pathology might only occur in individuals with genetic and/or acquired aberrations in metabolism, since in healthy individuals SCFAs are primarily metabolized in the liver [198], again indicating that associations between gut microbiota and ASD may also involve other underlying genetic factors. In healthy individuals, propionic acid potentially can increase feelings of satiety, lower carcinogenesis and cholesterol [200] and possibly have an anti-inflammatory effect [201]. See [200] for an in depth review of propionate, including a discussion on fermentable substrates.

The possible role of vitamin D in ASD is also intriguing. Dark/yellow skin requires increased ultraviolet (UV) B exposure to induce sufficient vitamin D in low sunlight regions/seasons [202]. This effect is particularly accentuated among Somalian and other immigrants from sub-Saharan Africa with fundamentalist practices leading to full body cover [203]. There are also dietary practices [avoidance of vitamin D- enriched dairy products, and other common staples such as maize [204]] that may exacerbate deficiencies in vitamin D relating to reduced UVB exposure, contributing to Th1 skew/altered intestinal inflammatory state [205] as well as frankly increased

66 risk of some infectious diseases such as tuberculosis [203]. In the context of vitamin

D deficiency (and perhaps also with low levels of lithocholic acid), vitamin D receptor (VDR) expression should be increased [206, 207]. VDR has been reported to negatively regulate intestinal NFkB (and therefore downstream innate immune signaling) induced by bacteria, and bacteria also regulate VDR expression [208].

This may be an important mechanism for explaining the role of vitamin D deficiency in autoimmune diseases, perhaps through a Th17 mechanism involving bacteria that play roles similar to the regulatory roles that segmented filamentous bacteria (SFB) play in mice [209-211].

Although we have a limited understanding of how diet influences the microbiome in the short-term [through extreme changes [114]] and more long-term phenomena

[212], we do not yet understand how to manipulate diet to guide a microbial community from one state to another (e.g. from disrupted to healthy). One aim of the American Gut Project is to characterize diet and its impact on the microbiome, with the hope of elucidating systematic differences – if they exist – between dietary restrictions (e.g. vegetarians and paleo eaters). Unfortunately, the reliability of dietary data collected from the general public is often low. Even the recall of individuals for meals consumed over the course of a week can be compromised [213].

The first attempt at collecting detailed diet information by the American Gut

Project yielded limited results (though a correlation in diversity with the number of different types of plants consumed was observed). Variables such as the

67 approximate percentage of fat consumed over the course of a week had incredible variance and in many cases were outside of reality. In its second attempt at diet, the American Gut Project decided to take a two-pronged approach, one using a generalized diet questionnaire that lacked free text entry and contained questions about the frequency of consumption (e.g. in an average week, how often do you consume at least 2–3 servings of fruit in a day?), and the second to use a validated food frequency questionnaire through a professional service called Vioscreen.

Correlation versus causation

As noted above, several intriguing correlations have been observed between ASD and the gut microbiota. However, establishing causality is challenging. In particular, gut barrier dysfunction is a common comorbidity of ASD, but is known itself to affect the microbiome both in humans and mouse models. Therefore, appropriate controls need to be selected carefully so that the effects of ASD itself are not confounded with the effects of gut barrier dysfunction. Longitudinal studies can help resolve these types of issues, for example, by testing whether changes in the microbiota associated with ASD precede or follow gut barrier dysfunction issues within a subject, although very limited data are available at present.

In mouse models causality is easier to establish because symptoms that model ASD can be induced experimentally and the ability to administer microbially based therapies is substantially greater. The best example of this to date is the MIA study described above [104], wherein autism-like symptoms can be traced to specific

68 metabolites produced by the microbiome, and even reversed using probiotics. For other conditions, including malnutrition [102] and obesity [103], causal pathways may be uncovered by transplanting microbes from humans with a different physiological state into mice and demonstrating that aspects of the phenotype can be recaptured, either using fecal samples directly or using large collections of strains of bacteria isolated from individual fecal specimens (the latter providing evidence that only the bacteria are involved, as metabolites, viruses, antibodies, etc. are not transferred in these experimental designs). These types of studies therefore hold considerable promise for unraveling causality in ASD [174].

Conclusion

The lifestyle and dietary choices of individuals affected by ASD span a broad range, complicating analyses. As was the case with Crohn's Disease, the ability to observed subtle, and informative patterns depends on large sample sizes that are only feasible in cross-sectional study designs. Longitudinal designs, on the other hand, offer perspective into the change of a community over time, allowing tests of hypotheses about factors leading to a change in state within an individual (e.g. is a measured parameter such as disease severity modulated by changes in the microbiome) and whether a change in the microbial community happens prior to observable changes in individual state (e.g. reported severity) or vice versa, allowing inferences about causality. As we learned with the American Gut Project, the general public is extremely interested in microbiome research at present, and providing appropriate mechanisms to engage the public is an effective means to get

69 to sufficient sample sizes to have the power to detect subtle differences in the data.

Longitudinal studies, due to the high level of dedication over an extended period, cannot reach comparable sample sizes due to their expense. Given that a large sample size is more difficult in these designs, strict exclusion criteria must be defined to minimize confounding factors, maximize the signal, and maintain a high probability that the individuals will continue to the end of the study.

Studies of associations between ASD and the microbiome have generated a number of intriguing hypotheses about how microbes could be involved in the etiology of

ASD. However, there are many confounding variables, such as diet and gastrointestinal comorbidities, as well as technical variation among studies and background microbiome differences among cohorts, that complicate analysis.

Several approaches are likely to be exceptionally valuable in resolving such complexities: 1) access to large, cross-sectional cohort studies that can help generate hypotheses about combinations of factors that may have a strong influence on ASD outcomes, particularly if interacting in a nonlinear way; 2) longitudinal studies that allow high inter-individual variability in the microbiome to be factored out yet provide data regarding associations with progression within each individual to be revealed, potentially helping to get closer to causal pathways; and 3) animal model research including microbiome transfers and administration of candidate probiotics that will facilitate rapid progress toward understanding whether the microbiome plays a causal or contributory role in some subsets of ASD.

70

Concluding remarks

The creation of the ASD-cohort within the American Gut Project will make headway on the first point in the conclusion, and a partially on the second. Specificially, by reaching out to a large number of families, we anticipate that sample sizes will be sufficient to identify significant signals in the microbiome data, which in turn can be used to generate hypotheses for more focused future studies, or form a basis for interventions to be tried in animal models. Since the cohort includes both ASD- affected individuals and any neurotypical siblings, we will begin to chip away at interindividual differences, as each sibling pair will have a partial control. However, the cohort does not include a longitudinal component limiting the capacity to assess progression of the disease.

71

Chapter 5 American Gut: an Open Platform for Citizen-Science Microbiome Research

From: McDonald D*, Debelius J*, Metcalf J, American Gut Consortium#, Knight R.

“American Gut: an Open Platform for Citizen-Science Microbiome Research.” (in preparation)

* these authors contributed equally

# full author list in appendix A

The American Gut Project is on going, but we’ve reached an important milestone: more over 5,000 samples processed. Due to the open nature of the project, we’re still compiling results from collaborators who wanted to contribute an analysis to the manuscript.

Within the American Gut, I’ve undertaken a variety of roles, including being the lead organizer for the bulk of its existence. One of the main computational contributions I’ve made is the development of the primary processing pipeline that uses reproducible IPython Notebooks and which dispatches compute to a HPC environment. Each sequencing run performed (upwards of every 1-2 months) triggers a reprocessing of the full dataset, which as of May 2015 requires approximately 10,000 CPU hours. In addition, I led the development of the present participant website that manages our IRB-approved consent form and questionnaires. The site itself is built on top of Tornado, Postgres and Redis, and is

72 designed to support multiple locales for internationalization purposes. It is currently serving both the American Gut and the British Gut (using British English of course).

Introduction

The human microbiome plays an important role in health and disease, a fact that becomes clearer each year as more medical studies using metagenomic techniques are published. However, we still do not know the breadth of diversity that constitutes a healthy human microbiome in western cultures, and how variables such as lifestyle and diet affect this diversity. This knowledge is an essential comparative baseline for medical research and the improvement of personalized medicine, which is highly dependent on a person’s microbes [214]. It is also equally important to educate the public about this aspect of their health, which has only consistently reached the main-stream public in the last decade. Therefore, we launched the largest crowd-funded citizen science project to date, The American Gut

Project, to simultaneously accomplish the goals of characterizing the diversity of human microbiome in people living western lifestyles (starting with North America) and empowering participants with data about their own body’s microbes with a broad set of support venues, including an active forum and online class about the human microbiome. This new reference database of human microbiome samples allowed us to characterize the diversity of the American Gut, describe correlations with participant health, lifestyle, and diet, and evaluate/establish the American Gut as a platform for discovery (e.g., substudies, using it for context).

73

As of May 27th, 2015, the American Gut Project includes microbial sequence data from 5,020 samples from 4,199 human participants, whose microbes are represented by over 130 million 16S rRNA fragments from the V4 region. The primary specimen type is fecal, totalling 4,279 samples from 3,889 participants.

Skin and oral body sites are also represented, totalling 343 and 368 samples respectively (additional detail in supplementary table S5.1). The age range represented spans from 0 to 93 years of age, Body Mass Index from 8.5 to 78.1, and a multitude of other health and lifestyle variables (supplemental table S5.2). The only exclusion criteria for participants are that the individual must be greater than

6 weeks old, and cannot be a convicted felon. The project is crowdfunded, and contribution to the project can be made via a FundRazr portal

(https://fundrazr.com/campaigns/4Tqx5). A demographic breakdown of the participants can be found in supplementary table S5.3. These participants span urban and rural boundaries, race, and ethnicity in greater numbers than found in the HMP. So far, the majority of the samples collected were from caucasians living in the United States, with Asians and Pacific Islanders in the United States being the next most well represented group.

The American gut is diverse, and in some cases extreme

Within the American Gut, the goal of the project is to characterize the extent of microbial diversity associated with humans, regardless of geographic location, health, and lifestyle choices. The diversity present in a single sample can be

74 described in terms of the abundances of the individual organisms present in the sample. The environment in which the sample is collected has a great impact on the structure of these samples, for instance, skin samples typically contain higher abundances of aerotolerant organisms like , while fecal samples contain predominantly anaerobic organisms and are dominated by Firmicutes and

Bacteroidetes. However, what constitutes a healthy microbiome is not well understood. In addition, the bounds of diversity associated with humans are not known. The best reference to date is the Human Microbiome Project, and was heavily weighted toward oral and skin samples from only 242 individuals. In contrast, the American Gut is weighted toward fecal samples having collected from

3,889 individuals encompassing a much wider span of diversity within that sample type (figure 3A and 3C). The total number of operational taxonomic units (OTUs) observed in the fecal samples is much higher (figure 3B). Interestingly, the spread of phylum level abundances for fecal samples is even more extreme than observed in the HMP, with some samples being nearly 100% Firmicutes, and others being almost entirely depleted (figure 3D).

75

Figure 3: Diversity of the American Gut. (a) PCoA of unweighted UniFrac distance for the AGP and HMP for fecal (AGP red, HMP Green), Oral (AG blue, HMP purple) and skin (AG orange, HMP yellow) samples. (b) Distribution of samples and sequencing depth for the AGP (white) and HMP (grey) (c) Beta diversity added with each additional microbial community. The American Gut samples (teal) encompass more diversity than the HMP samples (red). (d) Phylum level distributions for AGP fecal samples. OTUs which did not map to the eight most prevalent phyla were collapsed into the “Other” category.

Correlations with participant health, lifestyle, and diet

We found a correlation between microbial community structure and demographic, geography, lifestyle and health status (Figure 4; Supplemental figures S5.1-3).

76

Fecal microbial communities do not form distinct clusters in principal coordinates space, reflecting a complex relationship between the host status and their microbial community (Figure 4a). PD whole tree diversity is one of the strongest gradients. No metadata variable alone explains variation in alpha and beta diversity

(Supplemental tables S5.4 and S5.5). However, the current American Gut sample is large enough to reflect novel effects.

One of the most striking drivers in community structure was participant age

(Figure 4b; Supplemental figure S5.1). The gut microbiome is viewed as highly plastic during the first two to three years of life, but once it settles into an adult configuration, it remains relatively stable and resilient [18, 52, 175]. Within the

American Gut population, we see changes in community structure with age (Figure

4b). The microbiome seems to grow with its host long after solid food has been introduced. Furthermore, we find a divergent trajectory for men and women. There was a small difference in the change in alpha diversity pre and postmenopausal women (25-45 and 55-75, respectively), but men of the same age saw a larger difference in their rate of change (Supplemental figure S5.2). During this period, men experience a drop in testosterone, which women do not experience. In mice, fecal transplants were able to modulate sex hormone levels, supporting the hypothesis changes in the microbiome may alter testosterone [215].

The American Gut Project is one of the few cross-sectional studies which includes samples for individuals with one or more medical condition, including IBD,

77 diabetes, and obesity. IBD had a profound, significant effect on the microbiome

(Supplemental figure S5.2). An effect was also observed when comparing lean and obese participants, and Christensellaceae was associated with lean subjects, as seen in Goodrich et al [216]. The cross-sectional nature of the American Gut allowed us to look for a persistent effect with antibiotics: PD whole tree diversity was reduced and community structure altered, as measured by unweighted UniFrac distance, in participants who reported antibiotic use in the six months to a year prior to sampling compared to those who had not used antibiotics for more than a year.

Among healthy adults 20 to 69 with BMIs 18.5 to 30 who had not used antibiotics in the past year and reported no diagnosis of IBD or diabetes, lifestyle strongly influenced the microbiome. The effect tended to result from extreme states, especially deficiencies. Individuals who reported eating less than five types of plants in a week had lower alpha diversity than their plant-eating counterparts, even those who reported consuming six to ten types (FDR-corrected p < 0.05)

(Supplementary figure S5.3, supplementary table S5.4). PICRUSt [195] metagenomic prediction showed a change in the metagenomic profile for the non- plant eaters, with a tendency toward oxidative pathways (Figure 4d).

The frequency of alcohol consumption was correlated with increased alpha diversity and changes in community structure, measured by unweighted UniFrac distance

[217] (Figure 2c, Supplemental figure S5.1). A previous comparison of alcoholics and moderate drinkers found differences in alpha diversity [218], yet these differences

78 reflect seemingly contradictory evidence about alcohol consumption: moderate drinking is seen to reduce inflammatory stress and cardiovascular risks, while binge drinking or addictive behavior may exacerbate the same conditions [219].

Alcohol also reflects a general trend seen in lifestyle variables include plant consumption, last antibiotic use, and sleep duration, in which the unweighted

UniFrac distance between the samples decreases as alpha diversity increases. This is not simply a function of sample number: larger groups do not display the convergent behavior. The trend suggests a dysbiosis gradient in which higher diversity leads to a convergent community, while health or lifestyle choices lead to a loss of species. This is congruent with a metagenomic model of IBD and obesity, where the loss of peripheral pathways was strongly associated with disease [220].

The American Gut is unique among current microbiome datasets in the geographical distribution of samples: participants hail from 49 US states, the

District of Columbia and Puerto Rico. However, no spatial autocorrelation or correlation between the microbiome and state of residence or region of the country could be established (Figure 4e, Supplemental figure S5.4). Previous studies have shown individuals who cohabitate tend to have more similar microbiome, especially when animals are present in the household [221]. This may suggest the gut microbiome is altered through close contact, rather than regional patterns.

79

Figure 4: Health and Lifestyle in the American Gut fecal samples. (a) Fecal samples do not cluster by any metadata variable in unweighted UniFrac PCoA space, although PD whole tree alpha diversity represents a clear gradient. (b) Aging has a significant association with unweighted UniFrac within fecal samples. Each bar is the average distance (± stdev) between the reference group, given by the label for the cluster of bars, and the group described by the bar color. Significance was tested with a one-tailed permutation t-test (p < 0.1 +, p < 0.05 *, p < 0.01: **, p < 0.001 : ***). (c) Alcohol consumption and phylogenetic diversity are significantly positively correlated (FDR-corrected p < 0.05). (d) In predicted functional profiles from PICRUSt, people who consume less than five types of plants have fecal microbiomes with unique pathways compared to those who consume more plants. (e) No significant spatial autocorrelation between fecal samples was seen at multiple distance scales.

80

Power curves and effect sizes

Statistical power measures the probability of finding a significant difference between two sets of observation, given a difference in the underlying populations. A

Monte Carlo simulation was used to estimate statistical power for PD whole tree alpha diversity and weighted and unweighted UniFrac distance on the most extreme groups in nine metadata categories in the American Gut population, reflecting health, lifestyle, diet, and demographics (Figure 5). Bodysite on the same individual was used a reference, since it is known to have a strong effect on microbiome composition [16]. Fecal samples were matched for IBD and diabetes diagnosis, last reported antibiotic use, the number of types of plants consumed, age in decades, collection season and sleep duration.

We found IBD had the strongest effect on fecal microbial communities using unweighted metrics; less than 30 samples analyzed per group were need to see a difference in alpha and beta diversity at 80% power. In contrast, a difference in weighted UniFrac required more than 350 samples for the same power, suggesting that the presence or absence of specific taxa is better at discriminating healthy and disease communities. This was a trend overall: unweighted UniFrac distance was a more powerful metric in human health than weighted UniFrac distance.

81

Figure 5: Statistical power estimates in the American Gut. Power compares the probability for finding a significant (p < 0.05) difference between the two most extreme groups for a category compared to the number of samples analyzed in each category. Bodysite (oral vs fecal, gray), IBD diagnosis (teal), antibiotic use in the past month compared to not in the last year (brown), non-drinkers compared to daily drinkers (gold), less than six hours of sleep compared to more than eight (brown), lean vs obese BMI (pink), people in their 20s vs 60s (purple), those who consumed less than 5 types of plants compared to thirty or more (maroon), rare vs daily exercise (orange) and samples collected in winter vs summer (green), were compared. Power was estimated for (a) PD whole tree diversity, (b) Unweighted UniFrac distance and (c) Weighted UniFrac distance over the extremes of ten categories representing health and lifestyle in the AGP.

Using the American Gut as a discovery platform

As a subproject of the Earth Microbiome Project, all samples have been processed using the standard 16S EMP protocol. The specific intent is to facilitate meta- analyses with minimal technical variability. Any study that conforms to the EMP processing standard can be combined with relative ease for additional insight, or a wider selection of controls. By combining fecal samples from self-reported healthy individuals in the American Gut with fecal samples collected from an Intensive

Care Unit microbiome pilot project (put on as part of the International Nutritional

Survey), we were able to observe a marked difference in beta diversity (figure 6a).

These data helped to highlight specific taxa that are differential in ICU patients

82

(regardless of the reason for admission), and include previously recognized beneficial organisms like Faecalibacterium prausnitzii (figure 6b).

Figure 6: ICU fecal samples compared against fecal samples from self- reported healthy American Gut participants. a) PCoA of unweighted UniFrac distances comparing ICU and American Gut healthy fecal samples. b) Significantly different organisms. A positive value indicates the taxon is significantly depleted (p < 1e-10, Kruskal-Wallis, FDR corrected, OTUs binned by genus name) in the ICU samples while a negative value indicates the organism is enriched.

The infrastructure built to support the American Gut enables other researchers to perform their own microbiome analyses as components of their work. In effect, the

American Gut is a low-cost platform for 16S sequencing. To date, these projects include: a Sloan Funded project examining the spatial and temporal aspects of microbial community establishment in the office environment by monitoring nine offices in Flagstaff, San Diego, and Toronto; an Autism Spectrum Disorder (ASD) cohort of the American Gut (supported by a personal contribution) which is aimed at assessing the relationship between gastrointestinal distress and ASD; and an

83 intensive care unit microbiome pilot run through the International Nutritional

Survey which is looking at patients from five different ICUs, three body sites and two time points.

Concluding remarks

To date, the American Gut Project has raised over $1,000,000 from over 8,000 members of the general public and key industry donations. One of the primary goals of the project has been the development of reusable data, metadata and analyses.

All data generated by the project are deposited (de-identified) into the public domain via the European Bioinformatics Insitute (EBI). In addition, every bit of source code used has an open source license, and all distributed code from the project has been placed under the BSDv3 license. The goal being: open access, open source, open data.

84

Chapter 6 The Biological Observation Matrix (BIOM) Format or: how I learned to stop worrying and love the ome-ome

From: McDonald D*, Clemente J*, Kuczynski J, Rideout JR, Stombaugh J, Wendel

D, Wilke A, Huse S, Hufnagle J, Meyer F, Knight R and Caporaso JG. (2012). “The

Biological Observation Matrix (BIOM) Format or: How I Learned To Stop Worrying and Love the Ome-ome.” GigaScience 1:7. PMID: 23587224. *These authors contributed equally

The Genomic Standards Consortium has recognized the Biological Observation

Matrix (BIOM) as a standard way to represent a common datatype within the – omics fields: the observation by sample matrix. BIOM is a general file format that allows for efficient on disk representation of these matrices through the use of sparse matrix representations, and it includes mechanisms for representing arbitrary metadata associated with the observations and samples. BIOM has become a valuable file format in the last few years, as it has allowed reseachers to reuse, instead of redevelop, analytic tools on different datatypes. For instance, the

Quantitative Insights into Microbial Ecology package, which was originally developed for the analysis of 16S rRNA gene datasets, was built around the BIOM format and as such has allowed researchers to take advantage of its rich statistical and visualization frameworks to analyze metagenomic, viromic and metabolomic data.

85

For my part as a co-first author, I conceived of the original concept of BIOM, laid out the prototype which defined the initial file format and the in memory representation, and led the development of the software up to and following publication of the manuscript.

Background

Advances in DNA sequencing have led to exponential increases in the quantity of data available for “comparative omics” analyses, including metagenomics (e.g., [50,

99]), comparative genomics (e.g., [222]), metatranscriptomics (e.g., [223, 224]), and marker-gene-based community surveys (e.g., [79, 225]). With the introduction of a new generation of "benchtop sequencers" [226], accessible to small research, clinical, and educational laboratories, sequence-based comparative omic studies will continue to increase in scale. The rate-limiting step in many areas of comparative omics is no longer obtaining data, but analyzing that data (the “bioinformatics bottleneck”) [227, 228]. One mechanism that will help reduce this “bioinformatics bottleneck” is standardization of common file formats to facilitate sharing and archiving of data [229].

As with the increasing prevalence of high-throughput technologies in the biological sciences, the categories of comparative omics data, which we collectively term the

“ome-ome”, are rapidly increasing in number (Figure 7). Researchers are relying on more types of omics data to investigate biological systems, and the coming years will bring increased integration of different types of comparative omics data [99,

86

230]. A common data format will facilitate the sharing and publication of comparative omics data and associated metadata and improve the interoperability of comparative omics software. Further, it will enable rapid advances in omics fields by allowing researchers to focus on data analysis instead of on formatting data for transfer between different software packages or reimplementing existing analysis workflows to support their specific data types.

250

200

150

100

50

Unique '-ome' termsMEDLINE '-ome' in Unique 0 1940 1950 1960 1970 1980 1990 2000 2010 Year

Figure 7: Growth of the “ome-ome”. Growth of the “ome-ome”, or the types of “omic” data, over time based on mentions in Medline abstracts. Chao1 analysis indicates that there may be over 3,000 “omes”: however, given the well-known limitations of such nonparametric extrapolation techniques, we can only wonder how many “omes” remain to be discovered as technological advances usher in a new era of “ome-omics”.

87

Despite the different types of data involved in the various comparative omics techniques (e.g., metabolomics, proteomics, or microarray-based transcriptome analyses), they all share an underlying, core data type: the “sample by observation contingency table”, or the matrix of abundances of observations on a per-sample basis. In marker gene surveys, this table contains counts of OTUs (Operational

Taxonomic Units) or taxa on a per-sample basis; in metagenome analyses, counts of orthologous groups of genes, taxa, or enzymatic activities on a per-metagenome basis; in comparative genomics, counts of genes or orthologous groups on a per- genome basis; and in metabolomics, counts of metabolites on a per-sample basis.

Many tools have been developed to analyze these contingency tables, but they are generally focused on a specific type of study (e.g., QIIME for marker gene analysis

[130], MG-RAST for metagenome analysis [231], VAMPS for taxonomic analysis

[232]). However, many techniques are applicable across data types, for example rarefaction analyses (i.e., collector curves). These are frequently applied in microbiome studies to compare how the rate of incorporation of additional sequence observations affects the rate at which new OTUs are observed. This allows us to determine whether an environment is approaching the point of being fully sampled

(e.g., [130]). Rarefaction curves could similarly be applied in comparative genomics to study the rate of discovery of new gene families, as done in [34]; a researcher could compile a contingency table of genomes (samples) by genes (observations) and use a rarefaction curve to determine how quickly new gene families were accumulating as new genome sequences are added. A standard format for biological

88 sample by observation contingency tables will support the use of bioinformatics pipelines for different data types than those they were initially designed for (e.g.,

QIIME could be applied to generate rarefaction curves for proteomic data, or MG-

RAST could output metatranscriptome tables). Adoption of this standard will additionally facilitate the adoption of future analysis pipelines, as users can then directly apply those pipelines to their existing data.

In many existing software packages (e.g., [130, 231]), contingency tables are represented as tab-separated text, but minor syntactic differences prevent easy exchange of data between tools. For example, differing representation of samples and observations as either rows or columns, and the mechanism for incorporating sample or observation metadata (if possible at all), cause the formats used by different software packages to be incompatible. Additionally, in many of these applications a majority of the values (frequently greater than 90 %) in the contingency table are zero, which is taken to mean that the corresponding

“observation” was not observed in the corresponding sample. The fraction of the table that has non-zero values is defined as the "density", and thus a matrix with a low number of non-zero values is said to have a low density. As data sets continue to increase in size, “dense” representations of these tables, where all values are represented (in contrast to “sparse” representations, where only non-zero values are represented), result in an increasingly inefficient use of disk space. For example, marker gene survey OTU tables with many samples can have as few as 1 % non-

89 zero values. As the collection of samples becomes more diverse, these tables become even sparser and their size (both on disk and in memory) becomes a considerable barrier to performing meta-analyses.

Sample and observation metadata are essential for the interpretation of omics data, and for facilitating future meta-analyses. Two projects have recently arisen to address the need for metadata standards: MIxS [139], which defines what metadata should be stored for diverse sequence types, and ISA-TAB [229], which defines a file format for storing that metadata. A standard file format for representing sample by observation contingency tables could compliment these existing standards by providing a means for associating MIxS-compliant metadata provided in ISA-TAB format with samples and observations.

The Biological Observation Matrix (BIOM, pronounced “biome”) file format has been developed with input from the QIIME, MG-RAST, and VAMPS development groups.

The BIOM file format is based on JSON [233], an open standard for data exchange.

The primary objectives of the BIOM file format are presented in table S6.1. In addition to consolidating data and metadata in a single, standard file format, the

BIOM file format supports sparse and dense matrix representations to efficiently store these data on disk. The OTU table with 6,164 samples and 7,082 OTUs mentioned above contains approximately 1 % non-zero values. Because zero-values are not included in the sparse BIOM-formatted file, representing the same

90 information in this format requires 14 times less space than with a tab-separated text file. As a sparse matrix increases in size or decreases in density (e.g., in an

Illumina sequencing run versus a 454 sequencing run), this difference in file size will further increase.

To support the use of the BIOM file format, the format specifications and an open- source software package, biom-format, are available at http://biom-format.org.

Included with the format specification is a format validator, and included in the software package is a script to easily convert BIOM files to tab-separated text representations (which can be useful when working with spreadsheet programs) and Python objects to support working with this data. Supplemental data S6.2 presents a comparison of QIIME software for processing a contingency matrix as a

2D array (derived from QIIME 1.4.0) versus using the biom-format objects (derived from QIIME 1.4.0-dev). The biom-format software package will additionally serve as a repository where other developers can submit implementations of these objects in other languages.

Data description

To compare the relative size of storing sample by observation contingency tables in sparse BIOM-formatted files versus tab-separated files, we extracted 60 QIIME

OTU tables from the QIIME database. Each observation (OTU) in these tables contains a single metadata entry corresponding to the taxonomy assigned to the

OTU, and the tab-separated files were formatted in “Classic QIIME OTU table”

91 format (i.e., the format generated by QIIME 1.4.0 and earlier). An example file in the BIOM format can be found in [234], and an example of the classic QIIME OTU table can be found in [235].

Analyses

The OTU tables selected for this study ranged in size from 6 samples by 478 OTUs

(BIOM size: 0.10 MB; classic QIIME OTU table size: 0.06 MB) up to 6,164 samples by 7,082 OTUs (BIOM size: 12.24 MB; classic QIIME OTU table size: 175.76 MB). In the latter case, at approximately 1 % density there are 100-fold fewer counts in the sparse OTU table, but the file size is only 10-fold (rather than 100-fold) smaller for

BIOM-formatted versus tab-separated text. This discrepancy arises because the matrix positions must be stored with the counts in the sparse representation (as row number, column number, value; see [234]) but are implied in tab-separated text. The file compression ratio (tab-separated text file size divided by BIOM file size) that is achieved when representing contingency tables in sparse versus dense formats is therefore a function of the density of the contingency table. In the data presented in Figure 8 the density ranges from 1.3 % non-zero values to 49.8 % non- zero values, with a median of 11.1 %. The file compression ratio increases with decreasing contingency table density for this data set (compression ratio = 0.2 × density-0.8; R2 = 0.9; Figure S6.1).

92

1000

100

10

1

0.1 Classic QIIME OTU Table file size(MB) file Table OTU ClassicQIIME

OTU Table Equal file size 0.01 0.01 0.1 1 10 100 1000 Sparse BIOM file size (MB)

Figure 8: BIOM size comparison. Size of sparse BIOM formatted file versus size of QIIME “classic” OTU Table formatted file, for 60 independent microbiome studies currently stored in the QIIME database at http://www.microbio.me/qiime.

At small file sizes, tab-separated text files represent OTU tables more efficiently than BIOM-formatted files, but starting at approximately 0.2 MB the sparse BIOM representation becomes more efficient (Figure 8). This extra overhead incurred with the sparse representation is negligible (on the order of kilobytes) in cases where the

93 dense representation is more efficient. As contingency table density increases, as may be the case with certain types of comparative omics data, users can format their files in dense BIOM format to avoid inefficiencies with sparse representations.

We find that dense representations become more efficient than sparse representations at a density of around 15 % (Figure S6.1).

In general, a simple tab-separated format will be slightly more efficient for storage than the dense BIOM file format, but will not provide a standard way to store sample and observation metadata or provide interoperability across comparative omics software packages; thus, the BIOM file format will still be advantageous.

Similarly, compressing tab-separated text files representing sample by observation contingency tables (e.g., with gzip) can result in a similar degree of compression as converting a dense matrix representation to a sparse representation, but would not provide the additional benefits of the BIOM file format.

Discussion

The biom-format software package has been designed with three main objectives: to be a central repository for objects that support BIOM-formatted data in different programming languages, to have minimal external dependencies, and to provide an efficient means for representing biological contingency tables in memory along with convenient functionality for operating on those tables. At present we provide Python

2 (2.6 or greater) objects in both dense and sparse representations to allow for efficient storage across a range of densities of the underlying contingency table

94 data. Our goal is to make the biom-format project an open development effort so that other groups can provide objects implemented in different programming languages (ideally with APIs as similar as possible to the Python API).

Managing a community development effort is a challenge. To address this, we will maintain a code repository on GitHub [236] which is currently used for managing many successful collaborative software projects such as IPython, homebrew, and rails. The core BIOM development group will review new additions (in the form of pull requests) and, when they are fully documented and tested, will merge them into the biom-format repository.

A challenge in achieving community adoption of a new standard is convincing users and developers to overcome the learning curve associated with it. To address this, we have fully documented the BIOM file format standard, as well as the motivations for it, on the BIOM format website [237]. The biom-format software project contains a conversion script that allows users to easily move between BIOM- formatted files and tab-separated text files. This allows users to interact with their data in ways they traditionally have (e.g., in a spreadsheet program). To reduce the barrier-to-entry for using the biom-format software, the Python objects in the biom- format package are designed to be easily installable on any system running Python

2.6 or 2.7. To achieve this, biom-format relies only on the Python Standard Library

95 and NumPy (a common dependency for scientific Python applications which is installed by default on Mac OS X and many versions of Linux).

The introduction and refinement of high-throughput sequencing technology is causing a large increase in both the number of samples and the number of observations involved in comparative omic studies (e.g., [31, 79]), and sparse contingency tables are therefore becoming central data types in these studies. For example, it is not uncommon to find hundreds of thousands of OTUs in modern microbial ecology studies (unpublished observation based on preliminary analysis of the initial Earth Microbiome Project [31] dataset). Whether these observations represent new biological findings or sequencing error is a contested topic [38, 238,

239], but certain poorly characterized environments are hypothesized to contain large reservoirs of yet unknown OTUs [240]. We expect both the number of samples and the number of observations involved in comparative omic studies to continue to grow over the coming years, and an efficient representation of this data that can be easily interrogated across different bioinformatics pipelines will be essential to reducing the bioinformatics bottleneck. Similarly, integrating metadata in BIOM formatted files, ideally based on standards such as MIxS and ISA-TAB, will facilitate meta-analysis across different data types.

The number of categories of comparative omic data (e.g., genomic, metabolomic, pharmacogenomic, metagenomic) is increasing rapidly, and the need to develop

96 software tools specific to each of these data types contributes to the bioinformatics bottleneck. The BIOM file format provides a standard representation of the “sample by observation contingency table”, a central data type in broad areas of comparative omics, providing the means to generally apply tools initially designed for analysis of specific “omes” to diverse “omic” data types. The BIOM file format is currently recognized as an Earth Microbiome Project Standard and a Candidate Standard by the Genomics Standards Consortium, and is being adopted by groups developing comparative omics analysis software. We can embrace the proliferation of omics techniques by using standards such as the BIOM file format to reduce the gap in availability of bioinformatics tools for new domains of omics research. Taken together, these advances are an additional step toward the next phase of comparative omics analysis, in which fundamental scientific findings will increasingly be translated into clinical or environmental applications.

Methods

Growth of the ome-ome

In order to evaluate the growth of the “ome-ome” over time we searched a local installation of MEDLINE abstracts (through 2010) and tabulated the number of distinct terms ending in “ome” or “omes” on an annual basis. A list of false positive terms was compiled from the Mac OS X 10.7.4 built-in dictionary, and an initial pass over MEDLINE to identify irrelevant terms ending in ome that are not part of the standard English lexicon (e.g., “trifluorome”, “cytochrome”, “ribosome”). While

97 some false positives are still present, the number of unique “ome” terms being referenced in the biomedical literature is growing rapidly.

BIOM file format

The BIOM file format version 1.0.0 is based on JSON, an open standard for data exchange for which native parsers in several programming languages are available.

JSON was chosen as the basis for the BIOM format as it is a widely accepted and lightweight transmission format used on the Internet since 1999. It is directly translatable into XML if necessary, but embodies less complexity and overhead (in terms of the amount of supporting information that must be included in a valid file).

Several representative BIOM-formatted files and classic QIIME OTU table files used in the analysis presented in Figure 8 are provided in a zip file [241]. A full definition of the BIOM format is available at [234].

The BIOM project consists of two independent components. The first component is the BIOM file format specification, which is versioned and available at [237]. A

BIOM validator script is additionally packaged with the format specification, and allows users to determine if their files are in valid BIOM format. The second component of the BIOM format project is the biom-format software package, which contains general-purpose tools for interacting with BIOM formatted files (e.g., the convert_biom.py script, which allows for conversion between sparse and dense

BIOM-formatted files, and for conversion between BIOM-formatted files and tab-

98 separated text files), an implementation of support objects for BIOM data in

Python, and unit tests for all software. We hope that the development of similar support objects in other programming languages will become a community effort, which we will manage using the GitHub environment.

Concluding remarks

BIOM was a very different type of project to be involved in. How do you define a format, and then actually get people to use it? First off, it was critical to fill a need within a community. To this end, BIOM standardized the way we represented microbial composition data and allowed researchers to store these often sparse data efficiently. Second, there needed to be wide adoption. With BIOM, we sought early adoption from the metagenomics community through MG-RAST, as well as the broader areas of the microbiome community through VAMPS. And finally, there needed to be a benefit for adoption. What this format allowed us to do early, and rapidly, was to open the statistical and visualization tools developed for microbial community analysis to other fields through QIIME. In addition, it reduced the burden of tiering datasets as all the components speak the same language. BIOM is the primary distributable from the American Gut Project, and has formed a central component of the reusability of the American Gut data.

99

Chapter 7 An improved Greengenes taxonomy for bacteria and archaea with explicit ranks

From: McDonald D, Price M, Goodrich J, Nawrocki E, DeSantis T, Anderson G,

Eddy S, Arkin A, Knight R, Hugenholtz P. (2012). “An improved greengenes taxonomy for bacteria and archaea with explicit ranks.” ISME Journal

Mar;6(3):610-8. PMID: 22134646.

This published chapter covers a novel algorithm for transferring taxonomy from one tree to another. This method, which has become an integral component of the

Greengenes rRNA reference database, reduced the manual curation effort of the lead curator for Greengenes by hundreds to thousands of hours per release, and additionally, made the bulk of the taxonomic placements objective. Greengenes has become a standard reference dataset for use in microbial ecology studies, including the Human Microbiome Project and the Earth Microbiome Project (of which the

American Gut is a part of), and has been cited 400 times (as of April 3, 2015). To a certain extent, Greengenes is similar to the goals of the American Gut in that it attempts to be a standard reference database for community use, in order to learn more about the specific communities under study.

Introduction

A robust universal reference taxonomy is a necessary aid to interpretation of high- throughput sequence data from microbial communities [242]. Taxonomy based on

100 the 16S rRNA gene (16S) is the most comprehensive and widely used in microbiology today [243, 244], but has yet to reach its full potential because numerous microbes belong to taxa that have not yet been characterized and because numerous sequences that could be reliably classified remain unannotated. For example, two thirds of 16S sequences in GenBank are only classified to domain

(kingdom), that is, Archaea or Bacteria, by the National Center for Biotechnology

Information (NCBI) taxonomy: this taxonomy is likely the most widely consulted

16S-based taxonomy, despite its disclaimer that it is not an authoritative source, in part because classifications are maintained up to date through user submissions.

Most of the un(der)classified sequences are from culture-independent environmental surveys; these sequences can swamp BLAST searches, leaving users baffled about the phylogenetic affiliation of their submitted sequences. This shortcoming has been addressed by several dedicated 16S databases, including the

Ribosomal Database Project [245], Greengenes [246], SILVA [243] and EzTaxon

[247], that classify a higher proportion of environmental sequences. However, improvements are still needed because many sequences remain unclassified and numerous classification conflicts exist between the different 16S databases [246].

Moreover, the emergence of large-scale next-generation sequencing projects such as the Human Microbiome Project [2, 248] and TerraGenome [249], and the availability of affordable sequencing to a wide range of users who have traditionally lacked access, mean that the need to integrate new sequences into a consistent universal taxonomic framework has never been greater.

101

The Greengenes taxonomy is currently based on a de novo phylogenetic tree of

408 135 quality-filtered sequences calculated using FastTree [250]. De novo tree construction is among the most objective means for inferring sequence relationships, but requires either generation of new taxonomic classifications or transfer of existing taxonomic classifications between iterations of trees as the 16S database expands. Previously, we developed a tool that automatically assigns names to monophyletic groups in large phylogenetic trees [251], which is useful for naming novel (unclassified) clusters of environmental sequences. Here we describe a method to transfer group names from any existing taxonomy to any tree topology that has overlapping terminal node (tip) names. We used this ‘taxonomy to tree' approach to annotate the 408 135 sequence tree with the NCBI taxonomy as downloaded in June 2010 [252], supplemented with the Greengenes taxonomy from the previous iteration [251] and CyanoDB [253]. Explicit rank information, prefixed to group names, was incorporated into the Greengenes taxonomy to help users orient themselves and to improve the consistency of the classification. We assessed the consistency of the resulting classification with the NCBI taxonomy including currently defined candidate phyla (divisions), and present recommendations for consolidation of 34 redundantly named groups and exclusion of one on the basis that its sole representative is chimeric.

102

Materials and methods

16S data compilation and de novo tree inference

We obtained 16S sequences from the Greengenes database, which extracts these sequences from public databases using quality filters as described previously [246].

We only used sequences that had <1% non-ACGT characters. The sequences were checked for chimeras using UCHIME [254] and ChimeraSlayer [255]. We only removed sequences from named isolates if they were classified as chimeric by both tools; we removed other sequences if they were classified as chimeric by either tool or if they were unique to one study, meaning that no similar sequence (within 3% in a preliminary tree) was reported by another study. Quality filtered 16S sequences were aligned based on both primary sequence and secondary structure to archaeal and bacterial covariance models (ssu-align-0.1) using Infernal [256] with the sub option to avoid alignment errors near the ends. The models were built from structure-annotated training alignments derived from the Comparative RNA

Website [257] as described in detail previously [256]. The resulting alignments were adjusted to fit the fixed 7682 character Greengenes alignment through identification of corresponding positions between the model training alignments and the Greengenes alignment. Hypervariable regions were filtered using a modified version of the Lane mask [258]. A tree of the remaining 408 135 filtered sequences,

(tree_16S_all_gg_2011_1) was built using FastTree v2.1.1, a fast and accurate approximately maximum-likelihood method using the CAT approximation and branch lengths were rescaled using a gamma model [250]. Statistical support for taxon groupings in this tree was conservatively approximated using taxon

103 jackknifing, in which a fraction (0.1%) of the sequences (rather than alignment positions) is excluded at random and the tree reconstructed. We use these support values to help guide selection of monophyletic interior nodes for group naming during manual curation.

For evaluation of NCBI-defined candidate phyla, we added 765 mostly partial length sequences, that failed the Greengenes filtering procedure but were required for the evaluation, to the alignment using PyNAST ([259]; based on the 29

November, 2010 Greengenes OTU templates) and generated a second FastTree

(tree_16S_candiv_gg_2011_1) using the parameters described above.

Transferring taxonomies to trees (tax2tree)

Having constructed de novo trees, the next key step was to link the internal tree nodes to known, named taxa. We used the NCBI taxonomy [252] as the primary taxonomic source to annotate the trees, supplemented by the previous iteration of the Greengenes taxonomy [251] and cyanoDB [253]. This taxonomic annotation used a new algorithm called tax2tree. Briefly, tax2tree consists of the following steps:

Input consists of a flat file containing the donor taxonomy and an unannotated (no group names) newick format recipient phylogenetic tree with common sequence (tip) identifiers. The tax2tree donor taxonomy is in a very simple format (Supplementary

Figure S7.1) comprising a unique ID followed by a taxonomy string with rank

104 prefixes and was derived from the NCBI taxonomy

(ftp://ftp.ncbi.nih.gov/pub/taxonomy/) using a custom Python script. The tax2tree algorithm first filters out non-informative taxonomic assignments from the donor taxonomy strings including the words ‘environmental', ‘unclassified', as described previously [251]. After filtering, the remaining assignments at each taxonomic level from domain to species are added to each tip with empty placeholders at levels that are missing taxon names. The result of this phase is a tree in which some of the tips have taxonomic information at some or all ranks, imported directly from a donor taxonomy. In addition, each node in the phylogenetic tree is augmented with a tip start and stop index corresponding to a list that contains tip taxonomy information.

This structure allows for rapid lookups of the taxon names present at all of the tips that descend from any internal node.

1 Precision and recall values necessary for the F-measure calculation (see step

5) are calculated and stored on the internal nodes. This caching markedly

improves performance on large trees.

2 Each taxonomic rank at each internal node is determined if it is safe to hold a

name. A taxonomic rank is considered safe if (a) there exists a name at that

taxonomic rank on the tips that descend that is represented on ⩾50% of those

tips and (b) the parent taxonomic rank (for example, phylum to class) is also

safe. These names are decorated onto the tree at this point resulting in a

phylogenetic tree containing many duplicate names on internal nodes.

3 The F-measure (F=2 × ((precision × recall)/(precision+recall))) [260] is then

105

calculated for each internal node for each name at each taxonomic rank in

order to determine the optimal internal node for a name. The F-measure is

defined as the harmonic mean of precision and recall, and balances false

positives and false negatives (precision is the fraction of informative tips with

a given name under a given node out of the total count of informative tips

under the node; recall is the fraction of informative tips descending from a

given node out of all the informative tips of the entire tree containing the

same name). Node references and F-measure scores are then cached in a 2-

dimensional Python dictionary external to the tree keyed by both the rank

level and by the taxon name. After all nodes are scored, the 2-dimensional

hash table is iterated over and for each unique name, the internal node with

the highest F-measure score for each name is retained. Each name will only

be saved on an internal node once; all other internal nodes with that name

will be stripped of the name. If a tie is encountered, the internal node with

the fewest tips is kept. The result of this phase is that the phylogenetic tree

contains many names on the internal nodes with each name occurring at

most a single time. Gaps in the taxa names decorated onto the tree are likely

as the result of polyphyletic groups.

4 Backfilling is used to fill taxonomic gaps in the unique taxon names left on

the phylogenetic tree from the F-measure process. For this procedure, the

input taxonomy is transformed into a tree and a Python dictionary is

constructed that is keyed by the taxon name and valued by its corresponding

106

node. A gap is defined as missing taxonomy rank name information in the

phylogenetic tree between a named interior node and its nearest named

ancestor (for example, having phylum and order names but without a name

for the intervening class rank). For each internal node and nearest named

ancestor pair in the phylogenetic tree in which a gap occurs, the taxon name

of the node farthest from the root is identified in the input taxonomy. The

input taxonomy is traversed until the nearest ancestor's taxa name is found.

The names of the nodes traversed in the input taxonomy tree are then

appended to the node farthest from the root of the phylogenetic tree.

Following the backfilling procedure, it is possible for the phylogenetic tree to

have duplicate taxa names.

5 A back-propagation procedure identifies redundant taxon names in the

phylogenetic tree that can be collapsed into a single clade. Here, we test

whether any internal node has nearest named descendants at a given rank

(for example, phylum) that all share the same name. If so, the name can be

removed from the descendants and propagated to the internal phylogenetic

node being interrogated.

Secondary taxonomies were then applied manually to the annotated recipient tree, tree_16S_candiv_gg_2011_1, in ARB [261] using the group tool. For the Greengenes taxonomy, this was achieved by displaying the Greengenes taxonomy field at the tips of the tree and assigning group names missed in the automated taxonomy

107 transfer (mostly higher level ranks associated with candidate phyla). For the cyanoDB taxonomy, manual assignments were based on type species

(http://www.cyanodb.cz/valid_genera; 405 instances in tree_16S_candiv_gg_2011_1;

Supplementary table S7.1). The manually supplemented taxonomic assignments were then exported from the curated tree as a flat file (Supplementary Figure S7.1) using functionality in the tax2tree pipeline and reapplied to tree_16S_all_gg_2011_1 ensuring manual updates were efficiently propagated to both trees. The tax2tree software is implemented in the Python programming language using the PyCogent toolkit [262], and is available under the open-source

General Public license at http://sourceforge.net/projects/tax2tree/.

Taxonomy comparisons

The NCBI and Greengenes taxonomies were compared for each of the 408 135 sequences in tree_16S_all_gg_2011_1 making use of the explicit rank designations.

The lowest classified rank for each sequence was determined and compared (Figure

10a) to estimate overall improvements in classification. Note that only contiguous classifications were used in this estimate, that is, all ranks leading to the lowest named rank also had to have names. Taxonomic similarities and differences for each sequence at each rank were also assessed by dividing sequences into five categories, (i) the two taxonomies had equal values (same name) at a given taxonomic rank, (ii) the two taxonomies had unequal values at a given rank, (iii) and (iv) one of the taxonomies lacked a value at a given rank and (v) both taxonomies lacked a value at a given rank. This provided an indication of the type

108 of changes that had occurred between the NCBI and Greengenes taxonomies

(Figure 10b).

Results and discussion

The rapid accumulation of sequence data from across the tree of life is a boon for molecular taxonomy, but also presents a major barrier to sequence-based taxonomy curators as it is essentially impossible to manually curate trees comprising hundreds of thousands of sequences from scratch. We overcame this difficulty by developing an automated procedure based on F-measures [260] for transferring any

(donor) taxonomy (in a standard flat text format, Supplementary Figure S7.1) to any unannotated (recipient) tree (in Newick format) given common sequence identifiers. The F-measure is most often used to measure the classification performance (precision and recall) of information retrieval processes such as database searches [260]. This approach also has the potential to provide an assessment of the quality of fit between a taxonomy and a tree, which could be used to screen multiple taxonomies and/or trees before manual curation efforts.

Construction of the rank-explicit Greengenes taxonomy

Using quality-filtered sequences from Greengenes [246] aligned with the secondary structure-aware infernal package [256], we constructed a phylogenetic tree using

FastTree2 [250] containing 408 135 sequences (tree_16S_all_gg_2011_1). We inferred confidence estimates using taxon jackknife resampling (in which sequences, rather than positions in the alignment, are resampled) as it provides a

109 conservative guide to group monophyly, which we found greatly assists manual group name curation between tree iterations (see below). We then applied NCBI classifications to this topology using the tax2tree algorithm (Figure 9 and methods) also taking advantage of the explicit rank designations provided by NCBI to include rank prefixes to group names (for example, p(hylum)__, c(lass)__, o(rder)__). Most sequences (69% 280 488 of 408 135) had uninformative NCBI classifications, that is, no rank information below domain (kingdom; Figure 10a); of these, most were environmental clones designated as ‘unclassified Bacteria'. The remaining 127 647 sequences with informative classifications were then applied to the tree. This

‘unamended' approach alone resulted in an improved classification, to at least phylum-level, of nearly all of the taxonomically uninformative sequences (280 452 of

280 488) because most belong to known phyla but were simply deposited without classifications.

110

A) ~ B) f__Lachnospiraceae A f__Lachnospiraceae; g__Clostridium; s__ g__Clostridium B Unclassifed g__Dorea C f__Lachnospiraceae; g__Clostridium; s__ s__Clostridium bolteae ~ D f__Lachnospiraceae; g__Clostridium; s__ s__Clostridium hylemonae E f__Lachnospiraceae; g__Clostridium; s__ + F f__Lachnospiraceae; g__Clostridium; s__ G f__Lachnospiraceae; g__Dorea; s__ H f__Lachnospiraceae; g__Dorea; s__ G H I f__Lachnospiraceae; g__Clostridium; s__Clostridium bolteae J f__Lachnospiraceae; g__Clostridium; s__Clostridium bolteae K f__Lachnospiraceae; g__Clostridium; s__Clostridium hylemonae L f__Lachnospiraceae; g__Clostridium; s__Clostridium hylemonae A B C D E F I J K L

C) ~ D) ~ E) ~

G H G H G H

A B C D E F I J K L A B C D E F I J K L A B C D E F I J K L

F) ~ G) ~ H)

A f__Lachnospiraceae; g__Clostridium; s__ B f__Lachnospiraceae; g__Clostridium; s__ C f__Lachnospiraceae; g__Clostridium; s__ D f__Lachnospiraceae; g__Clostridium; s__ E f__Lachnospiraceae; g__Clostridium; s__ F f__Lachnospiraceae; g__Clostridium; s__ G f__Lachnospiraceae; g__Dorea; s__ H f__Lachnospiraceae; g__Dorea; s__ G H G H I f__Lachnospiraceae; g__Clostridium; s__Clostridium bolteae J f__Lachnospiraceae; g__Clostridium; s__Clostridium bolteae K f__Lachnospiraceae; g__Clostridium; s__Clostridium hylemonae A B C D E F I J K L A B C D E F I J K L L f__Lachnospiraceae; g__Clostridium; s__Clostridium hylemonae

Figure 9: tax2tree workflow. Overview of the tax2tree workflow. (i) The inputs to tax2tree; a taxonomy file that matches known taxonomy strings to identifiers that are associated with tips of (that is, sequences within) a phylogenetic tree. To simplify the diagram, only the family, genus and species are used, although the full algorithm uses all phylogenetic ranks. (ii) The input taxonomy represented as a tree and a taxon name legend for the figure. (iii, iv) Nodes chosen by the F-measure procedure at each rank; (iii) species, (iv) genus and (v) family. In this example, the genus Clostridium is polyphyletic, and the F-measure procedure picked the ‘best’ internal node for the name (uniting tips A–F). However, as unique names at a given rank can only be placed once on the tree, this leaves tips I–L without a genus name placed on an interior node. (vi) The backfilling procedure detects that tips I–L have an incomplete taxonomic path (species to family) and (vi) prepends the missing genus name (obtained from the input taxonomy) to the lower rank because this step of the procedure examines only ancestors but not siblings. (vii) The common name promotion step identifies internal nodes in which all of the nearest named descendants share a common name. In this example, the node that is the lowest common ancestor for tips I–L has immediate descendants that all share the same genus name, Clostridium. This name can be safely promoted to the lowest common ancestor (interior node) uniting

111

tips I–L. (viii) The resulting taxonomy. Note that the sequence identified as B was unclassified in the donor taxonomy but is now classified as f__Lachnospiraceae; g__Clostridium; s__.

Figure 10: NCBI vs. Greengenes. A comparison of the NCBI taxonomy to the updated Greengenes taxonomy for sequences in tree_16S_all_gg_2011_1. (a) Lowest taxonomic rank assigned to each sequence; (b) taxonomic differences between NCBI and Greengenes at each rank, showing the percentage of sequences classified to each of five possible categories (see inset legend; GG, Greengenes) highlighting cases where NCBI and Greengenes differ.

We then overlaid additional taxonomic information onto the NCBI-annotated tree by manual group name curation in ARB. This information consisted primarily of candidate phyla and other rank designations for environmental clusters imported from previous iterations of Greengenes (either assigned manually or by GRUNT

[251]), and taxonomic information for the obtained from cyanoDB

[253]. This resulted in more informative classifications for 75% of sequences by at least one rank up to six ranks. These increases in classification depth are graphically shown in Figure 10a by rank.

112

Changes in sequence classifications between NCBI and Greengenes at each rank are summarized in Figure 10b. Most changes were from uninformative

(domain/kingdom name only) in NCBI to informative in Greengenes again reflecting the classification of the large fraction of unclassified environmental sequence in

NCBI. The percentage of changes to informative NCBI classifications were relatively low (<7% for all ranks), indicating the degree of congruence between

NCBI and Greengenes classifications achieved in part by accommodating polyphyletic groups (see below). Of these type of changes, many occurred at the higher ranks particularly in the candidate phyla where Greengenes is manually curated most intensively (see below). A similar comparison with SILVA or RDP was not possible because of a lack of explicit ranks in these taxonomies. However, by comparing the group name immediately following the domain (kingdom) name

(either Bacteria or Archaea) in the SILVA taxonomy, we estimated that only 28% of the 408 135 sequences lacked phylum-level classifications in SILVA, as opposed to

68% in NCBI.

Further, the updated Greengenes taxonomy performed well in a test of reference taxonomies using a naïve Bayesian classifier [263]. In this paper, we found that retraining the RDP classifier [43] using taxa from the new Greengenes taxonomy resulted in increased classification resolution relative to SILVA or RDP for a range of environments (human body habitats, snake and mouse gut and soils).

113

The value of accommodating polyphyletic groups in a 16S rRNA-based taxonomy

In principle, every taxon should correspond to a single monophyletic group in the

16S rRNA-based taxonomy, but practical considerations make relaxing this constraint very useful. In our approach, the back-filling step in the tax2tree (see methods) allows multiple groups to be given the same rank name. This feature is important for taxonomic groups that are well-established in the literature but polyphyletic in the reference 16S rRNA tree. A prominent example is the class

Deltaproteobacteria, which rarely forms a monophyletic group in large 16S rRNA topologies and comprises five groups in the current tree_16S_all_gg_2011_1. This result may indicate that the Deltaproteobacteria do not form an evolutionary coherent grouping and will need to subdivided and reclassified. Alternatively, the

Deltaproteobacteria may be a monophyletic group not resolved in 16S rRNA trees due to tree inference artifacts, chimeric sequences and/or to limits in the phylogenetic resolution of trees constructed from the 16S rRNA molecule alone.

This issue can best be addressed using ‘whole genome' tree approaches that have greater phylogenetic resolution than single-gene topologies. Trees based on a concatenation of 31 conserved near ubiquitous single-copy gene families indicate that the small subset of Deltaproteobacteria with genome sequences are monophyletic [34, 264]. A second example, also based on concatenated conserved marker genes, indicate that the first genome sequence representative of the family

Halanaerobiales (Halothermothrix orenii) is a member of the Firmicutes phylum

114

[265]; which is its (contested) classification based on 16S rRNA trees [266]. Indeed, the Halanaerobiales is separate from the Firmicutes in the current Greengenes topology and is only classified as such because of the back-filling procedure.

Similarly, a number of phylum-level associations have been suggested based on concatenated gene topologies including a relationship between the class

Deltaproteobacteria and phylum Acidobacteria [264]. We predict that at least a subset of the currently defined candidate phyla will coalesce with other phyla once they are adequately represented by genome sequences and whole-genome trees can be constructed. Therefore, current estimates of the number of 16S-based candidate phyla (>50) should only be used as an approximation and may well drop as candidatus genome sequences accrue. However, regardless of absolute number of phyla, there is a strong need for consistent delineation of taxonomic groups between public databases, particularly candidate phyla. Thus, by retaining polyphyletic groups as sets of monophyletic taxa with the same name, we can accommodate the uncertainty in our present knowledge about both the tree and taxonomy and easily propagate whole-genome-based classification improvements in subsequent iterations of de novo 16S rRNA-based trees.

Reconcilliation of NCBI and Greengenes-defined candidate phyla

At the time of writing, the NCBI taxonomy lists 71 candidate phyla (divisions) of which 30 are represented only by partial (<1200 nt) sequences. Therefore, in order to address the classification of these groups we amended tree_16S_all_gg_2011_1

115 with 765 mostly partial length sequences from GenBank and generated a new de novo tree using FastTree; tree_16S_candiv_gg_2011_1. We found some NCBI candidate phyla to be polyphyletic in tree_16S_candiv_gg_2011_1 because of a small number of submitter misclassifications or chimeric artifacts. In these instances, we reconciled NCBI and Greengenes designations using the majority classification for a given NCBI group. On the basis of tree_16S_candiv_gg_2011_1, we resolved the 71 candidate phyla into 45 monophyletic groups and one chimeric artifact (Table 2).

Many proposed phyla appear to belong to well-established lineages including the

Proteobacteria, Firmicutes, Bacteroidetes, Chloroflexi and Spirochaetes. In cases where two or more NCBI candidate phyla were combined and did not cluster with more established groups, we gave priority to either the oldest and/or the largest group. For example, we reclassified candidate phylum kpj58rc [267] as OP3 [268] because of the priority of OP3 in the literature and larger number of representative

OP3 sequences in the public databases. We also compared our classifications to

SILVA and RDP and in many instances saw consistencies. For example,

Greengenes and SILVA both classify candidate phylum CAB-I as Cyanobacteria and Greengenes and RDP both classify KSA1 in the Bacteroidetes. Conversely, in some instances we saw disagreements, such as candidate phylum GN02 [13] being classified as BD1-5, and WS5 [269] as WCHB1-60 by SILVA (Table 2). This points to the need to consolidate classifications and also to give priority to published group names where possible.

116

Table 2: Greengenes classifications. Greengenes classifications of NCBI-defined candidate phyla (divisions) based on tree_16S_candiv_gg_2011_1. SILVA_106 and RDP classifications are included for reference

117

Final comments and prospectus

The new NCBI-reconciled Greengenes taxonomy rescues over 200 000 environmental sequences from unclassified oblivion. Moreover, the tax2tree pipeline will assist in reconciling information among the various 16S rRNA resources

(Greengenes, SILVA, RDP and EZ-Taxon) with phylogenetic trees built using different methods, and will, we hope, make it easier for users to reconcile taxonomic classifications of large data sets obtained using different taxonomic schemes. This is especially important because which taxonomy is used can have a larger effect on the results than which assignment method is used [37]. The new Greengenes taxonomy, along with all intermediate data products including the tree, can be downloaded from the Greengenes web site at http://greengenes.lbl.gov/.

This work, by automating the process of improving the tree and allowing import of taxonomic knowledge from elsewhere, provides the first step toward an automated pipeline that will immensely improve our ability to link organisms to environment and to understand the evolutionary change associated with phenotypic changes such as adaptation to a new host, switching to a new habitat or adapting to use a new substrate. By providing the foundation for organizing microbial knowledge, these expanded taxonomies will greatly expand our ability to understand the microbes that pervade all aspects of life on the Earth.

118

Concluding remarks

Thanks to the development of tax2tree, we were able to recover at least phylum level annotations for virtually all of the 16S data deposited in Genbank at the time of publication. That was remarkable, as out of the 408,000 sequences included, only about 275,000 were classified to their domain (e.g., Bacteria or Archaea). The development of the tax2tree algorithm opened the door for its use as a taxonomic classifier of unknown sequence (in niche cases) as well. It’s use as a classifier is described in Harris et al ISME 2013 included in chapter 9.

119

Chapter 8 Collaborative cloud-enabled tools allow rapid, reproducible biological insights

From: Ragan-Kelley B*, Walters W*, McDonald D*, Riley J, Granger BE, Knight R,

Perez F and Caporaso JG. (2012) “Collaborative cloud-enabled tools allow rapid, reproducible biological insights.” ISME Journal 7:461–464 Oct. 25. PMID:

23096404. *These authors contributed equally

This published chapter introduced the Knight Lab to the use of IPython Notebooks for publishing reproducible and executable analyses. The efforts laid here formed the basis for how analyses are distributed in the American Gut Project, offering a means for anyone to be able to reproduce the results and figures on their own systems with the actual code used for the analysis.

As a co-first author, I participated in the discussion of the original concept, implemented the methods that derived the input data, generated phylogenies, and compared the resulting trees. Two distance metrics were used for the tree comparisons, the first was Robinson-Foulds distance, and the second was derived from the correlation in tip-to-tip branch length distances between the trees.

Article

Microbial ecologists today face critical computational barriers. The rapid increase in the quantity of data acquired by modern sequencing instruments makes analysis by

120 hand infeasible, and even software developed just a few years ago cannot scale to modern data sets. As a result, making advanced, scalable algorithms and large- scale computational resources available to end-users is necessary to advancing our understanding of microbial ecology.

One challenge many face when developing software for the first time is the gap between writing a script that can run on a single processor and writing a script that will scale to a larger cluster. A second is that knowledge required for a project is often distributed among many individuals, including software developers, subject matter experts and experts in the use of specific computer systems. Although computation can be a language that bridges many disciplines, additional ‘glue' is often needed to make the requirements mutually comprehensible to diverse members of a project team.

One approach to this ‘glue' is represented by IPython [270], which provides tools for interactive and parallel computing that support online collaboration. The IPython notebook allows users to combine code, text (including mathematical expressions), figures, and so on, into a single document. These documents are accessed through a web browser and can be simultaneously edited by multiple collaborators. The resulting environment is analogous to Google Docs, but aimed at scientific computation. Beyond document writing, these notebooks can execute arbitrary code in the Python programming language, providing a framework where

121 documentation, software and results are combined in one place, and code can be edited, annotated and re-run dynamically to immediately show how the results change. IPython also provides tools to run computations in parallel, with a high- level interface that eases the transition from a classic serial script to a parallel environment.

The power of the IPython approach is especially apparent when it is coupled to cloud computing, which is rapidly increasing in popularity in bioinformatics [271].

Services such as Amazon's Elastic Compute Cloud (EC2) provide on-demand access to large-scale computational resources, allowing anyone to trade (small amounts of) money for (large amounts of) compute time. Although Amazon provides a web-based interface for management, in this project, we used the StarCluster tool

(http://mit.edu/star/cluster) to automate and simplify the process of building, configuring and managing clusters of virtual machines on Amazon's EC2. Using

StarCluster, we can configure a virtual cluster that includes domain-specific libraries (bioinformatics tools in our case), as well as cluster management tools and shared file system configuration, for ‘one-click parallel computing'. Once the

StarCluster configuration has been defined, we can start a virtual cluster in the cloud with a single command. StarCluster will ensure that the cluster nodes start up together, and correctly configured, drastically reducing the time and complexity of setting up a parallel cloud cluster.

122

These principles were exemplified at the recent NIH ‘Cloud Computing for the

Microbiome' workshop, in Boulder, CO, USA, which brought together participants with broad expertise including developers of IPython, Quantitative Insights Into

Microbial Ecology (QIIME; [130] and PrimerProspector [272], contributors to the

Greengenes [246] resource, and the author of StarCluster participating remotely from MIT. The IPython and StarCluster authors have backgrounds in physics, whereas the QIIME, PrimeProspector and Greengenes developers come from microbial ecology and bioinformatics; neither group had used the other's tools before this meeting. Initially, a demonstration of IPython had been planned for the workshop based on distributed matrix calculations, however, given the audience, demonstrating how IPython could help tackle a real biological problem using the cloud seemed far more desirable.

After considering several potential problems, we settled on one question of compelling interest and generalizability: as current sequencing technologies generally limit us to sequencing only certain regions of the 16S ribosomal RNA, what region is optimal for recapitulating the 16S phylogeny that would be obtained from sequencing the full gene? Although previous studies have examined the role of the region and read length for taxonomic assignment [43] and community clustering

[273], the region that best recaptures the phylogenetic tree reconstructed from full- length sequences has not been recently examined using the full Greengenes alignment. Intuition suggests that a longer sequence would automatically yield a

123 better tree because more characters would be available, but this intuition had been proven wrong in other areas of community analysis. What would happen when short fragments were isolated from the alignment and used in the popular phylogeny package FastTree [250]?

Several key technologies were leveraged to answer this question: (a) the IPython notebook provided a rapid, collaborative environment for execution of code; (b)

StarCluster provided an easy way to set up pre-configured clusters with dozens of central processing units on EC2; (c) the Python Comparative Genomics Toolkit

(PyCogent) toolkit [262] provided a large number of well-tested biological utility functions in Python; (d) Greengenes provided the source alignments and trees; (e)

PrimerProspector provided easy ways to locate primer positions in the Greengenes alignment; and (f) QIIME provided visualization routines that could be deployed in a web browser. Ultimately, we succeeded in producing a working demonstration in a total development time of roughly 7 h. The IPython Notebook used for this analysis is ‘NIHCloudDemo (Complete)' (see ‘Data availability' section).

Having achieved our educational goal of producing a practical demonstration of cloud computing, we now ask whether our example computation produced results of scientific value. We sliced the alignment of the full-length 16S ribosomal RNA to simulate sequencing of amplicons at different read lengths using a collection of popular primers (Figure 1a). A phylogenetic tree was subsequently constructed from

124 each resulting alignment, and distances were computed between all trees and the tree calculated from the full-length alignment as Pearson correlations in tip-to-tip distances across trees (used in Figure 11; distances computed as 1−r) and

Robinson–Foulds distances. Principal coordinates analysis was applied to the

Pearson distance matrix to visualize the results, and the Mantel test was applied to confirm that similar results were achieved using the two distance measures.

Coloring the results by the length of the read (Figure 11b) we see some association with the length of read for the v2 region, but essentially no association for other regions. Coloring the results instead by the location of the start point within the

16S ribosomal RNA sequence (Figure 11c), we see that the location within the sequence matters immensely. Thus, we can conclude that choosing the region of the

16S ribosomal RNA wisely is more important for reconstructing a useful phylogeny, such as that required for phylogenetically informed community distance metrics such as UniFrac [217], whereas obtaining longer reads should be treated as a secondary concern. We further illustrate this in Supplementary Figure S8.1, where we performed principal coordinates analysis only on distances between trees generated from the V3 to V4 regions and the full-length sequences (see the ‘V3 and

V4 Regions Only' notebook). This analysis shows that when working with regions of the 16S that best recapitulate phylogeny, longer reads yield trees that are more similar to the full-length trees than shorter reads (Supplementary Figure S8.1a).

Finally, in the ‘Pearson v Robinson–Foulds Distances' notebook, we compare

Pearson distances to Robinson–Foulds distances and show that these distance are

125 significantly correlated (Mantel test: r=0.77, P<0.001) as are the principal coordinates analysis plots generated from each distance matrix (Procrustes test: M2:0.67, P<0.001). The analyses presented here are easily generalizable: the user can substitute in any input alignment (for example, fungal internal transcribed spacer). The ‘Variable Region Position Boundaries' notebook describes this process

(see ‘Data availability' section).

Figure 11: Included data and results for the IPython paper. (a) Regions of the 16S ribosomal RNA included in this analysis. Start and end positions refer to positions in the Greengenes alignment and v regions indicate the variable regions included in each simulated amplicon. Sliced amplicons that would overlap entirely with other sliced amplicons are not included. Only full-length reads were used in analysis of the V9 region as the full-length amplicon is shorter than 150 bases. (b) Principal coordinates analysis of Pearson correlation coefficients between tip-to-tip distances in pairs of phylogenetic trees constructed from differentially sliced alignments. Points are colored by amplicon length. Points representing trees generated from full-length sequences are circled in white to indicate their position when obscured by other points. (c) Principal coordinates analysis of Pearson correlation coefficients between tip-to-tip distances in pairs of phylogenetic trees constructed from differentially sliced alignments. Points are colored by the first variable region encountered in the differentially sliced alignments. Points representing trees generated from full-length sequences are circled in white to indicate their position when obscured by other points.

126

We emphasize how this effort generated two outcomes that facilitate validation and replication of our results: both the IPython notebooks we developed and the Amazon

Machine Image (AMI) that contains all the necessary biological libraries and

IPython/StarCluster support are publicly available (see ‘Data availability' section).

This allows anyone with an Amazon account to repeat our analysis or modify it to address related questions. The cost of the analysis depends on the size of the data set. Using an input alignment with 636 sequences (that is, Greengenes clustered into 82% operational taxonomic units), the cost is $7.40 and the runtime is 5 min on four m2.4 × large instances (the majority of the cost results from having to pay for a full hour of instance time). The complete analysis used a variety of input alignments

(Greengenes operational taxonomic units ranging from 76% to 99%, roughly in steps of 3%, with between 121 and 84 413 sequences, respectively) and cost approximately

$180 with a runtime of 25 h on four eight-core/68 GB-RAM instances (that is, the

Amazon Web Services m2.4 × large instance type). Had this analysis not been run it parallel, the results would have required over a month to compute. The ‘Timing' notebook contains additional details (see ‘Data availability' section).

In conclusion, we have shown how a team of researchers with radically different backgrounds can leverage cloud resources and open source tools to achieve a new and scientifically interesting result relevant to an important question in microbial ecology, in record time and all the while producing easily reproducible outcomes.

127

Central to this effort was the use of cloud resources not only to command and deploy a large amount of compute power, but also as an integral part of the development process itself: the team edited the IPython notebooks for this study directly on the cloud servers. This enabled multiple authors to rapidly evolve the initial draft, with each person focusing on a different aspect of the overall computation. As the shared environment provided by the IPython notebook includes code and execution results, the team could rapidly reach a mutual understanding using the shared language of computation and, through rapid, iterative development and visualization processes, achieve the desired result in the same environment meant to perform the final production runs. Cloud-enabled tools thus allow broadly applicable solutions to interesting scientific problems to be rapidly formulated, communicated and reproduced. Microbial ecologists are poised to take advantage of these advances in scientific computing by using tools like the QIIME/StarCluster/IPython pipeline described here, other existing tools such as Galaxy/CloudMan or CloudBioLinux

[274], or many new tools that will come online in the coming months and years.

Concluding remarks

The combination of deploying an HPC environment at the whim of a credit card, with IPython Notebooks, helps to bring about the democratization of computation within the sciences. Compute clusters within EC2 do not replace the need for specialized instruments, but they do make large-scale computation for many of the problems being tackled within the life sciences available to anyone regardless of

128 institutional support and without the need to have a system administrator (often known as a graduate student) on staff in a lab.

129

Chapter 9 Enabling scientific insight and analysis contributions

In this chapter, I cover a selection of other published works that I contributed to, which include a method to enable insight, benchmarking of sequence clustering strategies, as well as a few contirbutions to analyses. For the most part, these were established projects in which my colleguges, advisor or collaborators graciously allowed me have a role. I am incredibly thankful for these opportunities to participante, and for the efforts of those in the projects who took the painstaking effort of sample collection that formed the basis for these studies.

As a result of my involvement in Greengenes (discussed in chapter 7), I was in an excellent position to contribute to the Phylogenetic Investigation of Communities by

Reconstruction of Unobserved States (PICRUSt) project [195] in which the initial goal was the prediction of functional metagenomic profiles from 16S data. Within

PICRUSt, I compiled the necessary augmented version of Greengenes with the available genomes and annotations at the time making it possible to perform the prediction from 16S data alone. In addition, I helped develop the software package, which necessitated the expansion of BIOM (see chapter 6) to better represent collapsed functional pathways. PICRUSt is an open source, collaborative project that includes contributions from geographically distributed developers. On the development side, in order to maintain a robust and flexible codebase, we leveraged unit testing and encouraged test-driven development. PICRUSt is behind on

130 broader, and more rigorous efforts that I’ve pushed for such as regression testing, code coverage, coding sprints, continuous integration, and aggressive documentation and stylistic standards but is slated to get revamped in the future.

These development practices are highlighted in particular in scikit-bio, http://scikit- bio.org. The adoption of these coding practices have helped to ensure a high quality code base that we’re confident in, as well as providing a framework that actually helps guide new developers who get involved.

The second work was a benchmarking of different strategies for clustering sequence data together [132]. One of the goals was to determine a scalable solution, without sacrificing quality, for large datasets such as the Earth Microbiome Project. In this project, I performed the closed reference OTU picking steps, and analyized the resulting data. Due to the size of the dataset (EMP is the largest 16S dataset that

I’m aware of, spanning 15,000 samples at the time of the manuscript), it broke

BIOM. To compensate, I coordinated the development necessary to address the scalability issues of BIOM at the time. In brief, this necessitated revisiting the backend in-memory representation switching to compressed sparse row and column structures, and transitioning the on-disk representation from JSON to HDF5.

The first analysis to mention was the Human Microbiome Project (HMP) [16]. In the

HMP, samples were collected from 242 healthy individuals at up to 18 different body sites in the United States. The overarching goal of the HMP was the

131 establishment of standard practices for microbiome research and a characterization of the structure, function and diversity of the healthy microbiome. The challenges encountered in this project, and its limited scope of the population, helped to form the impetus for the American Gut Project (see chapter 5). For the HMP, I organized the denoising process of the 16S data, which is a computationally demanding task particularly for the (what at the time was) large volume of data generated from over

6,000 samples. Denoising is a process by which flowgrams (floresence intensities) are aligned and clustered for the purpose of reducing sequencing errors, and other sequencing artifacts (such as homopolymer runs) that can result from pyrosequencing on the 454 platform [275]. It is an important quality control step

(for some questions) when processing data from 454 instruments. How the data are combined can impact the quality of the denoising as well as runtime, and I worked on how to efficiently group samples together as to make it feasible to denoise the dataset.

The second analysis is on the characterization of new-onset Crohn’s disease patients

[123]. In this study, the largest pediatric Crohn’s Disease cohort to date was compiled, and microbiome samples were collected from multiple locations along the gastrointestinal tract. One of the incredible observations made in the study was that the use of antibiotics appears to have an amplifying effect on the microbial dysbiosis seen in Crohn’s disease patients, where the use of antibiotics increases pro-inflammatory organisms while decreasing anti-inflammatory ones. In this

132 project, I performed a PICRUSt analysis in order to assess whether there were significant differences in the functional profiles between cohorts and against healthy individuals. The motivation was that previous metagenomic studies have observed specific functional differences in the communites of healthy and Crohn’s patients, and thus we wanted to see if we could recover these patterns using only

16S data. The results of this analysis are in supplemental table D of the manuscript and largely correspond to previously observed differences in functional potential from metagenomic data in Crohn’s patients.

And the final analysis manuscript is a characterization of what is believed to be the most diverse place on the planet, the Guerrero Negro hypersaline microbial mat

[12]. This was an additional paper in a series of works from the Pace lab focused on the Guerrero Negro mat, and included samples collected at multiple different depths using both Sanger and 454 sequencing. At the time, I was fortunate to be sitting next to Dr. Greg Caporaso, a post doc at the time in the lab, who was working on this study. We discussed the taxonomic assignment of the reads from the GN, and since the study had a suite of near full-length sequences available, I suggested that we apply the method developed for Greengenes, tax2tree (from chapter 7), in order to classify these environmental sequences. This approach worked surprisingly well (though is limited in practice as it depends on having near-full length sequences), and the subsequent taxonomy assignements are shown in figure 1 of the manuscript. The added benefit of using tax2tree, beyond

133 taxonomic classification, was the ability to determine the presence of candidate taxons in the dataset as a result of the de novo phylogeny that is required for tax2tree, leading to a series of observations that would not have been possible using traditional taxonomic assignment methods. These data are shown in Table 1 of the manuscript.

Concluding remarks

As a graduate student, I’ve had the opportunity to be involved in a few incredible projects, and overlap with truly amazing individuals and pioneers in the study if microbial ecology. These experiences have altered my perspective on what it means to be human, my perspective of disease, and the shear amount of unknown in the natural world. The tools that I’ve helped developed, and led development on have played a useful role in expanding scientific knowledge.

134

Chapter 11 Conclusions

The exploration of microbial communities, and their relationship to human health, has opened new doors into our understanding of disease changing our perspective on what it means to be human. In relatively recent times, thanks to improvements in molecular techinques and high throughput sequencing technology, our ability to gain insight into the composition of these communities has exploded. As researchers began tying composition to disease, it became apparent that there exists a strong relationship between the microbiome and health. However, it is still unknown what a healthy microbiome is, and the extent to which lifestyle and diet factors influence it.

This thesis comes at a time when our ability to generate data about microbial communities is outpacing our ability to interpret and analyze these data.

Identifying ways to expand interpretation, and increase the rate of scientific progress is extremely important given the potential for broad benefits to human health. One present limitation in the field is the difficulty of comparing studies together, which limits interpretation to potentially subjective comparisons against existing literature (assuming the comparable methods were used) and observations of the data contained within the study.

135

Therefore, I worked toward the establishment of a well-characterized microbiome resource focused on reuse using standards compliant sequencing protocols, metadata collection and file formats. By using a combination of crowdfunding, localized web resources, and multiplexed 16S sequencing, I was able to amass microbiome samples from a large cross section of the population. These samples represent, to the best of my knowledge, the largest scientific microbiome study performed to date (in terms of the number of participants). In order to minimize barriers to reuse, all deidentified data have been deposited into the public domain prior to publication, and the processing pipeline I developed that generated participant results is BSD-licensed within an executable IPython Notebook on the

American Gut Github repository. In parallel, the project has generated a significant amount of public interest in the microbiome including participants using the

American Gut for their own research studies. One of the greatest challenges I’ve found has been the interaction with the public, and communicating in such a way as to provide accurate information that is digestable (and also combating the prevalent probiotic snake oils that exist on the market).

The framework I developed for the American Gut is in essence a generalized 16S sequencing pipeline and has enabled other researchers who don’t have access to a sequencing facility or experience generating 16S sequence data to perform studies of interest at a low cost. This pipeline is setup so that it is possible for researchers to take full advantage of the American Gut (e.g., consent and questionnaire), or to

136 perform the sequencing work with separate metadata collection methods. To this end, while we’ve sequenced samples from around 4,200 participants for the

American Gut, as part of the framework developed, we’ve actually shipped well over

28,000 samples. These other projects include studies of the built environment, where samples were collected off of surfaces of occupied spaced, to an ICU microbiome pilot study where samples were collected from patients at five different

ICUs from around the world. In addition, I’ve helped to create an autism spectrum disorder (ASD) cohort within the American Gut Project, which is presently collecting samples from individuals with ASD and any neurotypical siblings.

One of the core technologies that back the American Gut came out of my involvement in the development of the Quantitative Insights into Microbial Ecology

(QIIME) software package [130]. The central datatype within QIIME is a contingency matrix in which physical samples collected are represented as columns, and the observations (e.g., the “species” in a sample, typically referred to as an operational taxonomic unit (OTU)) are represented on the rows. The cells of the matrix represent the sequence counts of a given OTU in a sample, and the bulk of the cells in the matrix tend to be zero (i.e., a sparse matrix). When we originally published QIIME, these data were represented in-memory as an unstructured

Python tuple which lead to a large amount of duplication of logic across the code base as there lacked an API for interaction, and it impacted our ability to alter the representation of the data in memory because the representation wasn’t abstracted.

137

I saw an opportunity to carve out this data structure from QIIME, improve the efficiency of its in-memory representation, and provide a rich API for interaction. In doing so, I realized the opportunity to standardize how these data are stored on-disk paying particular attention to being able to represent arbitrary information about the samples or the observations (e.g., the taxonomy of an OTU). The result of this was the BIOM project, and has since become a Genomic Standards Consortium recognized file format, and has allowed QIIME to expand beyond its original intended focus of amplicon sequencing.

The taxonomy used by the American Gut, and the reference used to interpret the

OTUs observed in the sequence data, are provided by Greengenes. I originally became involved in Greengenes through exploring how to reduce the time between releases of the resource, which led to the development of tax2tree. This algorithm greatly reduced the amount of manual curation effort by the curator and reduced the number of subjective judgements necessary within the curation effort. An unexpected reuse of the algorithm then came about as a means of taxonomic classification which provided useful insight into an analysis of what is believed to the the most diverse place on the planet.

The involvement in Greengenes paved the way to getting involved in a few other projects, one in particular that was instrumental for the American Gut was a collaboration with the IPython development team. At the time, the IPython group

138 had recently development a new way of tying documentation and code together within a web-browser, as well as a streamlined parallel framework for dispatching arbitrary tasks to different IPython engines (whether local or remote). My initial foray into the IPython Notebook was transformative, particularly the parallel components, and opened my eyes to the possibility of compute becoming a commodity similar to electricity: just open up a web browser, and compute away.

The ability to control an arbitrary-sized compute cluster within Amazon’s EC2 from a web browser, executing whatever I wanted, truly felt like the future. This is the future. These are incredible times. The Notebook ended up being an ideal platform for sharing the processing and analyses I was doing in the American Gut, and which are now available, BSD-licensed, from the American Gut Github repository. The hope there is two-fold, first to facilitate reproducibility and second, to offer a stepping stone for anyone who wants to build from the analyses already being under taken.

The works contained in this thesis have fundamentally shifted in my perspective on the natural world, and expanded my own personal understanding of life in general.

Through the development and use of computational techniques, I’ve learned about the vast “dark” parts of the microbial world and how we are intimiately connected to these organisms, which are billions of years divergent from our species. Plainly put, the bulk of life is single celled, and it’s been shocking to realize how little we know about even the “simplest” of organisms, let alone the interaction of these organisms

139 with their environment or with other organisms. But what’s amazing is the pace at which new technologies are being developed. The state of the microbiome field has shifted from understanding composition and its relationship to host and/or environment, to beginning to develop mechanistic understandings. In particular, the application of a technique called metabolomics has picked up and has allowed researchers to begin to “evesdrop” on the molecular communication between cells.

This pivot has changed the field from being able to ask not just whose there, but what are they doing in their environment.

Coming into the life sciences from a computational upbringing is challenge (as is always the case when stepping into a different field). One of the greatest challenges has been to learn how to effectively communicate across discipline boundaries.

When I began exploring options for graduate school, the BioFrontiers Institute was setting up a new educational program called the Interdisiplinary Quantitative

Biology program. It seemed like a perfect fit: the goal of the program was to facilitate interdisciplinary research and to teach students how to work with others outside of traditional departmental silos. As luck would have it, I was accepted into the inaugural class, and rapidly thrown into things like Michaelis-Menton kinetics,

Van Der Waals forces, stem cells, Markovian processes and atomic force microsopy.

It was a step into the deep end, and helped to open my eyes to some of the challenges faced in a multitude of disciplines. At the minimum, it established an appreciation for some of the complexity that researchers in these areas face.

140

Because of my involvement in tool development, opportunities arose to teach how to use the software I’d helped develop at workshops around the world to students from all across the spectrum of the sciences. I’ve found that the efforts of the

Interdisciplinary Quantative Biology program improved my effectiveness at communication, and which led to some of the most profound experiences of my graduate career, notably, teaching at the Workshop on Genomics in Český Krumlov.

The workshop is run in the middle of winter for two immersive weeks (from 9am –

10pm each day), and includes roughly 75 students from a range of disciplines, at all ranges in their careers, from all over the world (with a high level of diversity in keyboards…). What has become abundantly clear from these experiences is the reminder that science is a human endeavor, which can at times be easily forgotten, but it is through communication that progress is made.

In conclusion, this dissertation has made significant strides in the characterization of 16S in the V4 region, including addressing significant computational and development needs, and provides what has a reasonable chance of being an important dataset that may lead to benefits in human health in the future. In addition, the development of this resource required engagement with broad sections of the general public. The data gathered have enabled novel population-level insights giving rise to a greater understanding of the interplay between the microbiome and human health in the face of the varied lifestyle and dietary choices of the population.

141

Bibliography

1. Shendure, J. and H. Ji, Next-generation DNA sequencing. Nat Biotechnol, 2008. 26(10): p. 1135-45.

2. Turnbaugh, P.J., et al., The human microbiome project. Nature, 2007. 449(7164): p. 804-10.

3. Lawrence, J.G., Gene transfer, speciation, and the evolution of bacterial genomes. Curr Opin Microbiol, 1999. 2(5): p. 519-23.

4. Amann, R.I., W. Ludwig, and K.H. Schleifer, Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev, 1995. 59(1): p. 143-69.

5. Pace, N.R., A molecular view of microbial diversity and the biosphere. Science, 1997. 276(5313): p. 734-40.

6. Woese, C.R. and G.E. Fox, Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A, 1977. 74(11): p. 5088-90.

7. Lane, D.J., et al., Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A, 1985. 82(20): p. 6955-9.

8. Winker, S. and C.R. Woese, A definition of the domains Archaea, Bacteria and Eucarya in terms of small subunit ribosomal RNA characteristics. Syst Appl Microbiol, 1991. 14(4): p. 305-10.

9. Pace, N., et al., The analysis of natural microbial populations by rRNA sequences. Advances in microbial ecology, 1986. 9: p. 1-55.

10. Turnbaugh, P.J., et al., A core gut microbiome in obese and lean twins. Nature, 2009. 457(7228): p. 480-4.

11. Fierer, N., et al., Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc Natl Acad Sci U S A, 2012. 109(52): p. 21390-5.

12. Harris, J.K., et al., Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME J, 2013. 7(1): p. 50-60.

13. Ley, R.E., et al., Unexpected diversity and complexity of the Guerrero Negro hypersaline microbial mat. Appl Environ Microbiol, 2006. 72(5): p. 3685-95.

14. Schloss, P.D. and J. Handelsman, Status of the microbial census. Microbiol Mol Biol Rev, 2004. 68(4): p. 686-91.

142

15. Costello, E.K., et al., Bacterial community variation in human body habitats across space and time. Science, 2009. 326(5960): p. 1694-7.

16. Human Microbiome Project, C., Structure, function and diversity of the healthy human microbiome. Nature, 2012. 486(7402): p. 207-14.

17. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p. 1304-51.

18. Yatsunenko, T., et al., Human gut microbiome viewed across age and geography. Nature, 2012. 486(7402): p. 222-7.

19. Knights, D., et al., Human-associated microbial signatures: examining their predictive value. Cell Host Microbe, 2011. 10(4): p. 292-6.

20. Sandholt, C.H., et al., Combined analyses of 20 common obesity susceptibility variants. Diabetes, 2010. 59(7): p. 1667-73.

21. Lauber, C.L., et al., Pyrosequencing-based assessment of soil pH as a predictor of soil bacterial community structure at the continental scale. Appl Environ Microbiol, 2009. 75(15): p. 5111-20.

22. Rousk, J., et al., Soil bacterial and fungal communities across a pH gradient in an arable soil. ISME J, 2010. 4(10): p. 1340-51.

23. Chu, H., et al., Soil bacterial diversity in the Arctic is not fundamentally different from that found in other biomes. Environ Microbiol, 2010. 12(11): p. 2998-3006.

24. Fierer, N., et al., Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients. ISME J, 2012. 6(5): p. 1007-17.

25. Lozupone, C.A. and R. Knight, Global patterns in bacterial diversity. Proc Natl Acad Sci U S A, 2007. 104(27): p. 11436-40.

26. Caporaso, J.G., et al., Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4516-22.

27. Tamames, J., et al., Environmental distribution of prokaryotic taxa. BMC Microbiol, 2010. 10: p. 85.

28. Auguet, J.C., A. Barberan, and E.O. Casamayor, Global ecological patterns in uncultured Archaea. ISME J, 2010. 4(2): p. 182-90.

143

29. Gilbert, J.A., et al., Defining seasonal marine microbial community dynamics. ISME J, 2012. 6(2): p. 298-308.

30. Caporaso, J.G., et al., The Western English Channel contains a persistent microbial seed bank. ISME J, 2012. 6(6): p. 1089-93.

31. Gilbert, J.A., et al., The Earth Microbiome Project: Meeting report of the "1 EMP meeting on sample selection and acquisition" at Argonne National Laboratory October 6 2010. Stand Genomic Sci, 2010. 3(3): p. 249-53.

32. Knight, R., et al., Unlocking the potential of metagenomics through replicated experimental design. Nat Biotechnol, 2012. 30(6): p. 513-20.

33. Edwards, K.J., et al., An archaeal iron-oxidizing extreme acidophile important in acid mine drainage. Science, 2000. 287(5459): p. 1796-9.

34. Wu, D., et al., A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature, 2009. 462(7276): p. 1056-60.

35. Hamady, M. and R. Knight, Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res, 2009. 19(7): p. 1141-52.

36. Gonzalez, A. and R. Knight, Advancing analytical algorithms and pipelines for billions of microbial sequences. Curr Opin Biotechnol, 2012. 23(1): p. 64- 71.

37. Liu, Z., et al., Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res, 2008. 36(18): p. e120.

38. Kunin, V., et al., Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol, 2010. 12(1): p. 118-23.

39. Quince, C., et al., Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods, 2009. 6(9): p. 639-41.

40. Cohan, F.M., What are bacterial species? Annu Rev Microbiol, 2002. 56: p. 457-87.

41. Schloss, P.D. and J. Handelsman, Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol, 2005. 71(3): p. 1501-6.

144

42. Schloss, P.D. and S.L. Westcott, Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol, 2011. 77(10): p. 3219-26.

43. Wang, Q., et al., Naive Bayesian classifier for rapid assignment of rRNA sequences into the new . Applied and environmental microbiology, 2007. 73(16): p. 5261-7.

44. Zaneveld, J.R., et al., Ribosomal RNA diversity predicts genome diversity in gut bacteria and their relatives. Nucleic Acids Res, 2010. 38(12): p. 3869-79.

45. Konstantinidis, K.T. and J.M. Tiedje, Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A, 2005. 102(7): p. 2567-72.

46. Ivanova, N., et al., Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature, 2003. 423(6935): p. 87-91.

47. Rappe, M.S. and S.J. Giovannoni, The uncultured microbial majority. Annu Rev Microbiol, 2003. 57: p. 369-94.

48. Staley, J.T. and A. Konopka, Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol, 1985. 39: p. 321-46.

49. Eckburg, P.B., et al., Diversity of the human intestinal microbial flora. Science, 2005. 308(5728): p. 1635-8.

50. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 2010. 464(7285): p. 59-65.

51. Arumugam, M., et al., Enterotypes of the human gut microbiome. Nature, 2011. 473(7346): p. 174-80.

52. Wu, G.D., et al., Linking long-term dietary patterns with gut microbial enterotypes. Science, 2011. 334(6052): p. 105-8.

53. MacDonald, N.J., D.H. Parks, and R.G. Beiko, Rapid identification of high- confidence taxonomic assignments for metagenomic data. Nucleic Acids Res, 2012. 40(14): p. e111.

54. Jeffery, I.B., et al., Categorization of the gut microbiota: enterotypes or gradients? Nat Rev Microbiol, 2012. 10(9): p. 591-2.

55. Claesson, M.J., et al., Gut microbiota composition correlates with diet and health in the elderly. Nature, 2012. 488(7410): p. 178-84.

145

56. Keim, B. Gut-bacteria mapping finds three global varieties. 2011 [cited 2013 16 Jan]; Available from: http://www.wired.com/wiredscience/2011/04/gut- bacteria-types/.

57. Yong, E. Gut microbial ‘enterotypes’ become less clear-cut. 2012 [cited 2013 16 Jan]; Available from: http://www.nature.com/news/gut-microbial-enterotypes- become-less-clear-cut-1.10276.

58. Zimmer, C. Bacterial ecosystems divide people into 3 groups, scientists say. 2011 [cited 2013 16 Jan]; Available from: http://www.nytimes.com/2011/04/21/science/21gut.html.

59. Fierer, N., et al., Forensic identification using skin bacterial communities. Proc Natl Acad Sci U S A, 2010. 107(14): p. 6477-81.

60. Ley, R.E., et al., Obesity alters gut microbial ecology. Proc Natl Acad Sci U S A, 2005. 102(31): p. 11070-5.

61. Turnbaugh, P.J., et al., Diet-induced obesity is linked to marked but reversible alterations in the mouse distal gut microbiome. Cell Host Microbe, 2008. 3(4): p. 213-23.

62. Turnbaugh, P.J., et al., An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 2006. 444(7122): p. 1027-31.

63. Duncan, S.H., et al., Human colonic microbiota associated with diet, obesity and weight loss. Int J Obes (Lond), 2008. 32(11): p. 1720-4.

64. Schwiertz, A., et al., Microbiota and SCFA in lean and overweight healthy subjects. Obesity (Silver Spring), 2010. 18(1): p. 190-5.

65. NIH. Human microbiome project. 2012 [cited 2013 16 Jan]; Available from: http://commonfund.nih.gov/hmp.

66. Human-Food-Project. American Gut - what's in your gut? 2012 [cited 2013 16 Jan]; Available from: http://humanfoodproject.com/american-gut.

67. Personal-Genome-Project. Personal genome project. 2012 [cited 2013 16 Jan]; Available from: http://www.personalgenomes.org.

68. Frank, D.N., et al., Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Natl Acad Sci U S A, 2007. 104(34): p. 13780-5.

69. Michail, S., et al., Alterations in the gut microbiome of children with severe ulcerative colitis. Inflamm Bowel Dis, 2012. 18(10): p. 1799-808.

146

70. Gordon, J.I., et al., The human gut microbiota and undernutrition. Sci Transl Med, 2012. 4(137): p. 137ps12.

71. Kallus, S.J. and L.J. Brandt, The intestinal microbiota and obesity. J Clin Gastroenterol, 2012. 46(1): p. 16-24.

72. Kazor, C.E., et al., Diversity of bacterial populations on the tongue dorsa of patients with halitosis and healthy patients. J Clin Microbiol, 2003. 41(2): p. 558-63.

73. Yang, F., et al., Saliva microbiomes distinguish caries-active from healthy human populations. ISME J, 2012. 6(1): p. 1-10.

74. Finegold, S.M., et al., Pyrosequencing study of fecal microflora of autistic and control children. Anaerobe, 2010. 16(4): p. 444-53.

75. Landy, J., et al., Review article: faecal transplantation therapy for gastrointestinal disease. Aliment Pharmacol Ther, 2011. 34(4): p. 409-15.

76. Vrieze, A., et al., Transfer of intestinal microbiota from lean donors increases insulin sensitivity in individuals with metabolic syndrome. Gastroenterology, 2012. 143(4): p. 913-6 e7.

77. Biasucci, G., et al., Mode of delivery affects the bacterial community in the newborn gut. Early Hum Dev, 2010. 86 Suppl 1: p. 13-5.

78. Dominguez-Bello, M.G., et al., Delivery mode shapes the acquisition and structure of the initial microbiota across multiple body habitats in newborns. Proc Natl Acad Sci U S A, 2010. 107(26): p. 11971-5.

79. Caporaso, J.G., et al., Moving pictures of the human microbiome. Genome Biol, 2011. 12(5): p. R50.

80. Gonzalez, A., et al., Our microbial selves: what ecology can teach us. EMBO Rep, 2011. 12(8): p. 775-84.

81. Costello, E.K., et al., The application of ecological theory toward an understanding of the human microbiome. Science, 2012. 336(6086): p. 1255- 62.

82. Lozupone, C.A., et al., Diversity, stability and resilience of the human gut microbiota. Nature, 2012. 489(7415): p. 220-30.

83. Gajer, P., et al., Temporal dynamics of the human vaginal microbiota. Sci Transl Med, 2012. 4(132): p. 132ra52.

147

84. Holling, C.S., Resilience and stability of ecological systems L. Annu Rev Ecol Syst, 1973. 4: p. 1-23.

85. Steele, J.A., et al., Marine bacterial, archaeal and protistan association networks reveal ecological linkages. ISME J, 2011. 5(9): p. 1414-25.

86. Barberan, A., et al., Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J, 2012. 6(2): p. 343-51.

87. Moller-Levet, C.S., K.H. Cho, and O. Wolkenhauer, Microarray data clustering based on temporal variation: FCV with TSD preclustering. Appl Bioinformatics, 2003. 2(1): p. 35-45.

88. Mason, J., A new computation for the discrete Fourier transform. IEE Proc G, 1978. 2(1): p. 16-20.

89. Mallat, S., Multifrequency channel decompositions of images and wavelet models. IEEE Trans Acoust Speech Signal Process, 1989. 37(12): p. 2091- 2110.

90. Koenig, J.E., et al., Succession of microbial consortia in the developing infant gut microbiome. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4578-85.

91. JR Beltran, J.G.-L., J Navarro. Edge detection and classification using Mallat's wavelet. in Image Processing. 1994. IEEE Conference Publications.

92. S Mallat, S.Z., Characterization of signals from multiscale edges. IEEE Trans Pattern Anal Mach Intell, 1992. 14(7): p. 710-732.

93. Turnbaugh, P.J., et al., The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med, 2009. 1(6): p. 6ra14.

94. Dethlefsen, L., et al., The pervasive effects of an antibiotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoS Biol, 2008. 6(11): p. e280.

95. Dethlefsen, L. and D.A. Relman, Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4554-61.

96. Robertson, C. and T.A. Nelson, Review of software for space-time disease surveillance. Int J Health Geogr, 2010. 9: p. 16.

97. Quieroz, K.d., Phenetic clustering in biology: a critque. Q Rev Biol, 1997. 72(1): p. 3-30.

148

98. Ley, R.E., et al., Worlds within worlds: evolution of the vertebrate gut microbiota. Nat Rev Microbiol, 2008. 6(10): p. 776-88.

99. Reyes, A., et al., Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature, 2010. 466(7304): p. 334-8.

100. Andersen, L.O., H. Vedel Nielsen, and C.R. Stensvold, Waiting for the human intestinal Eukaryotome. ISME J, 2013. 7(7): p. 1253-5.

101. Savage, D.C., Microbial ecology of the gastrointestinal tract. Annu Rev Microbiol, 1977. 31: p. 107-33.

102. Smith, M.I., et al., Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science, 2013. 339(6119): p. 548-54.

103. Ridaura, V.K., et al., Gut microbiota from twins discordant for obesity modulate metabolism in mice. Science, 2013. 341(6150): p. 1241214.

104. Hsiao, E.Y., et al., Microbiota modulate behavioral and physiological abnormalities associated with neurodevelopmental disorders. Cell, 2013. 155(7): p. 1451-63.

105. Khor, B., A. Gardet, and R.J. Xavier, Genetics and pathogenesis of inflammatory bowel disease. Nature, 2011. 474(7351): p. 307-17.

106. Everard, A. and P.D. Cani, Diabetes, obesity and gut microbiota. Best Pract Res Clin Gastroenterol, 2013. 27(1): p. 73-83.

107. Kostic, A.D., et al., Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe, 2013. 14(2): p. 207-15.

108. Maes, M., et al., Increased IgA and IgM responses against gut commensals in chronic depression: further evidence for increased bacterial translocation or leaky gut. J Affect Disord, 2012. 141(1): p. 55-62.

109. O'Mahony, S.M., et al., Serotonin, tryptophan metabolism and the brain-gut- microbiome axis. Behav Brain Res, 2014.

110. Reardon, S., Microbiome therapy gains market traction. Nature, 2014. 509(7500): p. 269-70.

111. Antonio Gonzalez, Y.V.B., Rob Knight. The assembly of an infant gut microbiome framed against healthy human adults. 2012 [cited 2014 Nov]; Available from: https://www.youtube.com/watch?v=Pb272zsixSQ.

149

112. Antonio Gonzalez, Y.V.B., Rob Knight. Gut Ecosystem Restoration via Fecal Transplantation. 2014 [cited 2014 Nov]; Available from: https://www.youtube.com/watch?v=-FFDqhM4pks.

113. Gilbert, J.A., J.K. Jansson, and R. Knight, The Earth Microbiome project: successes and aspirations. BMC Biol, 2014. 12: p. 69.

114. David, L.A., et al., Diet rapidly and reproducibly alters the human gut microbiome. Nature, 2014. 505(7484): p. 559-63.

115. Lamendella, R., et al., Assessment of the Deepwater Horizon oil spill impact on Gulf coast microbial communities. Front Microbiol, 2014. 5: p. 130.

116. Willing, B., et al., Twin studies reveal specific imbalances in the mucosa- associated microbiota of patients with ileal Crohn's disease. Inflamm Bowel Dis, 2009. 15(5): p. 653-60.

117. Gilbert, J.A., et al., The seasonal structure of microbial communities in the Western English Channel. Environ Microbiol, 2009. 11(12): p. 3132-9.

118. Goodrich, Julia K., et al., Human Genetics Shape the Gut Microbiome. Cell, 2014. 159(4): p. 789-799.

119. Kang, S.S., et al., Diet and exercise orthogonally alter the gut microbiome and reveal independent associations with anxiety and cognition. Mol Neurodegener, 2014. 9: p. 36.

120. Moeller, A.H., et al., Rapid changes in the gut microbiome during human evolution. Proc Natl Acad Sci U S A, 2014.

121. Knights, D., et al., Rethinking "enterotypes". Cell Host Microbe, 2014. 16(4): p. 433-7.

122. Lozupone, C.A., et al., Meta-analyses of studies of the human microbiota. Genome Res, 2013. 23(10): p. 1704-14.

123. Gevers, D., et al., The treatment-naive microbiome in new-onset Crohn's disease. Cell Host Microbe, 2014. 15(3): p. 382-92.

124. American-Gut-Project. Alpha Diversity Notebook. 2015 [cited 2015 Feb]; Available from: http://nbviewer.ipython.org/github/biocore/American- Gut/blob/master/ipynb/Alpha diversity notebook.ipynb.

125. Kuczynski, J., et al., Direct sequencing of the human microbiome readily reveals community differences. Genome Biol, 2010. 11(5): p. 210.

150

126. Human Microbiome Project, C., A framework for human microbiome research. Nature, 2012. 486(7402): p. 215-21.

127. American-Gut-Project. mod1. 2015 [cited 2014 Nov]; Available from: http://microbio.me/americangut/img/mod1_main.pdf.

128. American-Gut-Project. Website. 2015 [cited 2015 June 10]; Available from: http://americangut.org.

129. Salter, S.J., et al., Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol, 2014. 12(1): p. 87.

130. Caporaso, J.G., et al., QIIME allows analysis of high-throughput community sequencing data. Nat Methods, 2010. 7(5): p. 335-6.

131. McDonald, D., et al., The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Gigascience, 2012. 1(1): p. 7.

132. Rideout, J.R., et al., Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ, 2014. 2: p. e545.

133. McDonald, D., et al., An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J, 2012. 6(3): p. 610-8.

134. Sullam, K.E., et al., Environmental and ecological factors that shape the gut bacterial communities of fish: a meta-analysis. Mol Ecol, 2012. 21(13): p. 3363-78.

135. Lozupone, C., et al., Identifying genomic and metabolic features that can underlie early successional and opportunistic lifestyles of human gut symbionts. Genome Res, 2012. 22(10): p. 1974-84.

136. Ley, R.E., et al., Evolution of mammals and their gut microbes. Science, 2008. 320(5883): p. 1647-51.

137. Vazquez-Baeza, Y., et al., EMPeror: a tool for visualizing high-throughput microbial community data. Gigascience, 2013. 2(1): p. 16.

138. Kristal, A.R., et al., Evaluation of web-based, self-administered, graphical food frequency questionnaire. J Acad Nutr Diet, 2014. 114(4): p. 613-21.

139. Yilmaz, P., et al., Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol, 2011. 29(5): p. 415-20.

151

140. Caporaso, J.G., et al., Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J, 2012. 6(8): p. 1621-4.

141. Pérez, F.a.G., Brian E., {IP}ython: a System for Interactive Scientific Computing. Computing in Science and Engineering, 2007. 9: p. 21-29.

142. Adams, J.B., et al., Gastrointestinal flora and gastrointestinal status in children with autism--comparisons to typical children and correlation with autism severity. BMC Gastroenterol, 2011. 11: p. 22.

143. Wang, L., et al., Increased abundance of Sutterella spp. and Ruminococcus torques in feces of children with autism spectrum disorder. Mol Autism, 2013. 4(1): p. 42.

144. Williams, B.L., et al., Impaired carbohydrate digestion and transport and mucosal dysbiosis in the intestines of children with autism and gastrointestinal disturbances. PLoS One, 2011. 6(9): p. e24585.

145. Williams, B.L., et al., Application of novel PCR-based methods for detection, quantitation, and phylogenetic characterization of Sutterella species in intestinal biopsy samples from children with autism and gastrointestinal disturbances. MBio, 2012. 3(1).

146. Hsiao, E.Y., Gastrointestinal issues in autism spectrum disorder. Harv Rev Psychiatry, 2014. 22(2): p. 104-11.

147. Foley, K.A., et al., Sexually dimorphic effects of prenatal exposure to lipopolysaccharide, and prenatal and postnatal exposure to propionic acid, on acoustic startle response and prepulse inhibition in adolescent rats: relevance to autism spectrum disorders. Behav Brain Res, 2015. 278: p. 244-56.

148. Hornig, M., The role of microbes and autoimmunity in the pathogenesis of neuropsychiatric illness. Curr Opin Rheumatol, 2013. 25(4): p. 488-795.

149. Aroniadis, O.C. and L.J. Brandt, Fecal microbiota transplantation: past, present and future. Curr Opin Gastroenterol, 2013. 29(1): p. 79-84.

150. Dinan, T.G. and J.F. Cryan, Melancholic microbes: a link between gut microbiota and depression? Neurogastroenterol Motil, 2013. 25(9): p. 713-9.

151. Foster, J.A. and K.A. McVey Neufeld, Gut-brain axis: how the microbiome influences anxiety and depression. Trends Neurosci, 2013. 36(5): p. 305-12.

152. Logan, A.C. and M. Katzman, Major depressive disorder: probiotics may be an adjuvant therapy. Med Hypotheses, 2005. 64(3): p. 533-8.

152

153. Sha, S., et al., Systematic review: faecal microbiota transplantation therapy for digestive and nondigestive disorders in adults and children. Aliment Pharmacol Ther, 2014. 39(10): p. 1003-32.

154. Vitetta, L., M. Bambling, and H. Alford, The gastrointestinal tract microbiome, probiotics, and mood. Inflammopharmacology, 2014. 22(6): p. 333-9.

155. Chen, Y., et al., Characterization of fecal microbial communities in patients with liver cirrhosis. Hepatology, 2011. 54(2): p. 562-72.

156. Cani, P.D., et al., Changes in gut microbiota control metabolic endotoxemia- induced inflammation in high-fat diet-induced obesity and diabetes in mice. Diabetes, 2008. 57(6): p. 1470-81.

157. Morgan, X.C., et al., Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol, 2012. 13(9): p. R79.

158. El Aidy, S., T.G. Dinan, and J.F. Cryan, Immune modulation of the brain-gut- microbe axis. Front Microbiol, 2014. 5: p. 146.

159. Julio-Pieper, M., et al., Review article: intestinal barrier dysfunction and central nervous system disorders--a controversial association. Aliment Pharmacol Ther, 2014. 40(10): p. 1187-201.

160. Chu, H. and S.K. Mazmanian, Innate immune recognition of the microbiota promotes host-microbial symbiosis. Nat Immunol, 2013. 14(7): p. 668-75.

161. Stefka, A.T., et al., Commensal bacteria protect against food allergen sensitization. Proc Natl Acad Sci U S A, 2014. 111(36): p. 13145-50.

162. Bercik, P. and S.M. Collins, The effects of inflammation, infection and antibiotics on the microbiota-gut-brain axis. Adv Exp Med Biol, 2014. 817: p. 279-89.

163. Clarke, G., et al., The microbiome-gut-brain axis during early life regulates the hippocampal serotonergic system in a sex-dependent manner. Mol Psychiatry, 2013. 18(6): p. 666-73.

164. de Theije, C.G., et al., Altered gut microbiota and activity in a murine model of autism spectrum disorders. Brain Behav Immun, 2014. 37: p. 197-206.

165. Shade, A., et al., A meta-analysis of changes in bacterial and archaeal communities with time. ISME J, 2013. 7(8): p. 1493-506.

166. American-Gut-Project, Power Analysis. 2015.

153

167. Roediger, W.E., Utilization of nutrients by isolated epithelial cells of the rat colon. Gastroenterology, 1982. 83(2): p. 424-9.

168. Kim, M.H., et al., Short-chain fatty acids activate GPR41 and GPR43 on intestinal epithelial cells to promote inflammatory responses in mice. Gastroenterology, 2013. 145(2): p. 396-406 e1-10.

169. Weng, M., W.A. Walker, and I.R. Sanderson, Butyrate regulates the expression of pathogen-triggered IL-8 in intestinal epithelia. Pediatr Res, 2007. 62(5): p. 542-6.

170. Maurice, C.F., H.J. Haiser, and P.J. Turnbaugh, Xenobiotics shape the physiology and gene expression of the active human gut microbiome. Cell, 2013. 152(1-2): p. 39-50.

171. Chen, Y. and M.J. Blaser, Inverse associations of Helicobacter pylori with asthma and allergy. Arch Intern Med, 2007. 167(8): p. 821-7.

172. Cox, L.M., et al., Altering the intestinal microbiota during a critical developmental window has lasting metabolic consequences. Cell, 2014. 158(4): p. 705-21.

173. Crawford, P.A., et al., Regulation of myocardial ketone body metabolism by the gut microbiota during nutrient deprivation. Proc Natl Acad Sci U S A, 2009. 106(27): p. 11276-81.

174. Gilbert, J.A., et al., Toward effective probiotics for autism and other neurodevelopmental disorders. Cell, 2013. 155(7): p. 1446-8.

175. David, L.A., et al., Host lifestyle affects human microbiota on daily timescales. Genome Biol, 2014. 15(7): p. R89.

176. Flores, G.E., et al., Temporal variability is a personalized feature of the human microbiome. Genome Biol, 2014. 15(12): p. 531.

177. Severance, E.G., et al., Seroreactive marker for inflammatory bowel disease and associations with antibodies to dietary proteins in bipolar disorder. Bipolar Disord, 2014. 16(3): p. 230-40.

178. Kang, V., G.C. Wagner, and X. Ming, Gastrointestinal dysfunction in children with autism spectrum disorders. Autism Res, 2014. 7(4): p. 501-6.

179. Berg, R.D., Bacterial translocation from the gastrointestinal tract. Trends Microbiol, 1995. 3(4): p. 149-54.

180. Turner, J.R., et al., The role of molecular remodeling in differential regulation of tight junction permeability. Semin Cell Dev Biol, 2014. 36: p. 204-12.

154

181. Thomas, R.H., et al., The enteric bacterial metabolite propionic acid alters brain and plasma phospholipid molecular species: further development of a rodent model of autism spectrum disorders. J Neuroinflammation, 2012. 9: p. 153.

182. Desbonnet, L., et al., The probiotic Bifidobacteria infantis: An assessment of potential antidepressant properties in the rat. J Psychiatr Res, 2008. 43(2): p. 164-74.

183. Flory, J.D., et al., Serotonergic function in the central nervous system is associated with daily ratings of positive mood. Psychiatry Res, 2004. 129(1): p. 11-9.

184. Voigt, J.P. and H. Fink, Serotonin controlling feeding and satiety. Behav Brain Res, 2015. 277: p. 14-31.

185. McDougle, C.J., et al., Effects of tryptophan depletion in drug-free adults with autistic disorder. Arch Gen Psychiatry, 1996. 53(11): p. 993-1000.

186. Yang, C.J., H.P. Tan, and Y.J. Du, The developmental disruptions of serotonin signaling may involved in autism during early brain development. Neuroscience, 2014. 267: p. 1-10.

187. Malkova, N.V., et al., Maternal immune activation yields offspring displaying mouse versions of the three core symptoms of autism. Brain Behav Immun, 2012. 26(4): p. 607-16.

188. Attia, E. and B.T. Walsh, Anorexia nervosa. Am J Psychiatry, 2007. 164(12): p. 1805-10; quiz 1922.

189. Dellava, J.E., et al., Impact of broadening definitions of anorexia nervosa on sample characteristics. J Psychiatr Res, 2011. 45(5): p. 691-8.

190. Anckarsater, H., et al., The sociocommunicative deficit subgroup in anorexia nervosa: autism spectrum disorders and neurocognition in a community- based, longitudinal study. Psychol Med, 2012. 42(9): p. 1957-67.

191. Baron-Cohen, S., et al., Do girls with anorexia nervosa have elevated autistic traits? Mol Autism, 2013. 4(1): p. 24.

192. Gillberg, I.C., M. Rastam, and C. Gillberg, Anorexia nervosa 6 years after onset: Part I. Personality disorders. Compr Psychiatry, 1995. 36(1): p. 61-9.

193. Sinno, M.H., et al., Regulation of feeding and anxiety by alpha-MSH reactive autoantibodies. Psychoneuroendocrinology, 2009. 34(1): p. 140-9.

155

194. Tennoune, N., et al., Bacterial ClpB heat-shock protein, an antigen-mimetic of the anorexigenic peptide alpha-MSH, at the origin of eating disorders. Transl Psychiatry, 2014. 4: p. e458.

195. Langille, M.G., et al., Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol, 2013. 31(9): p. 814- 21.

196. Furusawa, Y., et al., Commensal microbe-derived butyrate induces the differentiation of colonic regulatory T cells. Nature, 2013. 504(7480): p. 446- 50.

197. Mari-Bauset, S., et al., Evidence of the gluten-free and casein-free diet in autism spectrum disorders: a systematic review. J Child Neurol, 2014. 29(12): p. 1718-27.

198. Reichardt, N., et al., Phylogenetic distribution of three pathways for propionate production within the human gut microbiota. ISME J, 2014. 8(6): p. 1323-35.

199. MacFabe, D.F., et al., Effects of the enteric bacterial metabolic product propionic acid on object-directed behavior, social behavior, cognition, and neuroinflammation in adolescent rats: Relevance to autism spectrum disorder. Behav Brain Res, 2011. 217(1): p. 47-54.

200. Hosseini, E., et al., Propionate as a health-promoting microbial metabolite in the human gut. Nutr Rev, 2011. 69(5): p. 245-58.

201. Vinolo, M.A., et al., Suppressive effect of short-chain fatty acids on production of proinflammatory mediators by neutrophils. J Nutr Biochem, 2011. 22(9): p. 849-55.

202. Currenti, S.A., Understanding and determining the etiology of autism. Cell Mol Neurobiol, 2010. 30(2): p. 161-71.

203. Ustianowski, A., et al., Prevalence and associations of vitamin D deficiency in foreign-born persons with tuberculosis in London. J Infect, 2005. 50(5): p. 432- 7.

204. Maxwell, S.M., S.M. Salah, and J.E. Bunn, Dietary habits of the Somali population in Liverpool, with respect to foods containing calcium and vitamin D: a cause for concern? J Hum Nutr Diet, 2006. 19(2): p. 125-7.

205. Antico, A., et al., Hypovitaminosis D as predisposing factor for atrophic type A gastritis: a case-control study and review of the literature on the interaction of

156

Vitamin D with the immune system. Clin Rev Allergy Immunol, 2012. 42(3): p. 355-64.

206. Smolders, J., et al., Expression of vitamin D receptor and metabolizing enzymes in multiple sclerosis-affected brain tissue. J Neuropathol Exp Neurol, 2013. 72(2): p. 91-105.

207. Joseph, R.W., et al., Vitamin D receptor upregulation in alloreactive human T cells. Hum Immunol, 2012. 73(7): p. 693-8.

208. Wu, S., et al., Vitamin D receptor negatively regulates bacterial-stimulated NF-kappaB activity in intestine. Am J Pathol, 2010. 177(2): p. 686-97.

209. Olliver, M., et al., Immunomodulatory effects of vitamin D on innate and adaptive immune responses to Streptococcus pneumoniae. J Infect Dis, 2013. 208(9): p. 1474-81.

210. Sczesnak, A., et al., The genome of th17 cell-inducing segmented filamentous bacteria reveals extensive auxotrophy and adaptations to the intestinal environment. Cell Host Microbe, 2011. 10(3): p. 260-72.

211. Ruemmele, F.M. and H. Garnier-Lengline, Transforming growth factor and intestinal inflammation: the role of nutrition. Nestle Nutr Inst Workshop Ser, 2013. 77: p. 91-8.

212. Yang, J. and D.J. Rose, Long-term dietary pattern of fecal donor correlates with butyrate production and markers of protein fermentation during in vitro fecal fermentation. Nutr Res, 2014. 34(9): p. 749-59.

213. Freedman, L.S., et al., Pooled results from 5 validation studies of dietary self- report instruments using recovery biomarkers for energy and protein intake. Am J Epidemiol, 2014. 180(2): p. 172-88.

214. Clayton, T.A., et al., Pharmacometabonomic identification of a significant host-microbiome metabolic interaction affecting human drug metabolism. Proc Natl Acad Sci U S A, 2009. 106(34): p. 14728-33.

215. Markle, J.G., et al., Sex differences in the gut microbiome drive hormone- dependent regulation of autoimmunity. Science, 2013. 339(6123): p. 1084-8.

216. Goodrich, J.K., et al., Human genetics shape the gut microbiome. Cell, 2014. 159(4): p. 789-99.

217. Lozupone, C. and R. Knight, UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol, 2005. 71(12): p. 8228-35.

157

218. Mutlu, E.A., et al., Colonic microbiome is altered in alcoholism. Am J Physiol Gastrointest Liver Physiol, 2012. 302(9): p. G966-78.

219. Poli, A., et al., Moderate alcohol use and health: a consensus document. Nutr Metab Cardiovasc Dis, 2013. 23(6): p. 487-504.

220. Greenblum, S., P.J. Turnbaugh, and E. Borenstein, Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc Natl Acad Sci U S A, 2012. 109(2): p. 594-9.

221. Song, S.J., et al., Cohabiting family members share microbiota with one another and with their dogs. Elife, 2013. 2: p. e00458.

222. Markowitz, V.M., et al., IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res, 2012. 40(Database issue): p. D115-22.

223. Zakrzewski, M., et al., Profiling of the metabolically active community from a production-scale biogas plant by means of high-throughput metatranscriptome sequencing. J Biotechnol, 2012. 158(4): p. 248-58.

224. Helbling, D.E., et al., The activity level of a microbial community function can be predicted from its metatranscriptome. ISME J, 2012. 6(4): p. 902-4.

225. Muegge, B.D., et al., Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science, 2011. 332(6032): p. 970-4.

226. Glenn, T.C., Field guide to next-generation DNA sequencers. Mol Ecol Resour, 2011. 11(5): p. 759-69.

227. Desai, N., et al., From genomics to metagenomics. Curr Opin Biotechnol, 2012. 23(1): p. 72-6.

228. Angiuoli, S.V., et al., Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One, 2011. 6(10): p. e26624.

229. Sansone, S.A., et al., Toward interoperable bioscience data. Nat Genet, 2012. 44(2): p. 121-6.

230. Turnbaugh, P.J. and J.I. Gordon, An invitation to the marriage of metagenomics and metabolomics. Cell, 2008. 134(5): p. 708-13.

158

231. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008. 9: p. 386.

232. VAMPS. Available from: http://vamps.mbl.edu.

233. JSON. Available from: http://www.json.org.

234. McDonald, D. biom-format 1.0 example. 2012 [cited 2015 June 30]; Available from: http://biom-format.org/documentation/format_versions/biom-1.0.html.

235. McDonald, D. classic QIIME OTU table example. [cited 2015 June 30]; Available from: https://github.com/biocore/biom- format/blob/2.1.4/biom/__init__.py - L25.

236. Github. Available from: http://www.github.com.

237. McDonald, D. biom-format.org. 2015 [cited 2015 June 30]; Available from: http://biom-format.org/.

238. Sogin, M.L., et al., Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A, 2006. 103(32): p. 12115-20.

239. Reeder, J. and R. Knight, The 'rare biosphere': a reality check. Nat Methods, 2009. 6(9): p. 636-7.

240. Quince, C., T.P. Curtis, and W.T. Sloan, The rational exploration of microbial diversity. ISME J, 2008. 2(10): p. 997-1006.

241. McDonald, D. biom-format representative files. 2012 [cited 2015 June 30]; Available from: http://www.gigasciencejournal.com/content/supplementary/2047-217x-1-7- s3.zip.

242. Tringe, S.G. and P. Hugenholtz, A renaissance for the pioneering 16S rRNA gene. Curr Opin Microbiol, 2008. 11(5): p. 442-6.

243. Pruesse, E., et al., SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res, 2007. 35(21): p. 7188-96.

244. Peplies, J., et al., A standard operating procedure for phylogenetic inference (SOPPI) using (rRNA) marker genes. Syst Appl Microbiol, 2008. 31(4): p. 251- 7.

159

245. Cole, J.R., et al., The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res, 2009. 37(Database issue): p. D141-5.

246. DeSantis, T.Z., et al., Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol, 2006. 72(7): p. 5069-72.

247. Chun, J., et al., EzTaxon: a web-based tool for the identification of prokaryotes based on 16S ribosomal RNA gene sequences. Int J Syst Evol Microbiol, 2007. 57(Pt 10): p. 2259-61.

248. Group, N.H.W., et al., The NIH Human Microbiome Project. Genome Res, 2009. 19(12): p. 2317-23.

249. TM Vogel, P.S., JK Jansson, PR Hirsch, JM Tiedje, JD van Elsas, et al, TerraGenome: a consortium for the sequencing of a soil metagenome. Nat Rev Micro, 2009. 7: p. 252.

250. Price, M.N., P.S. Dehal, and A.P. Arkin, FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One, 2010. 5(3): p. e9490.

251. Dalevi, D., et al., Automated group assignment in large phylogenetic trees using GRUNT: GRouping, Ungrouping, Naming Tool. BMC Bioinformatics, 2007. 8: p. 402.

252. Sayers, E.W., et al., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 2011. 39(Database issue): p. D38-51.

253. CyanoDB. Available from: http://www.cyanodb.cz.

254. Edgar, R. UCHIME. Available from: http://www.drive5.com/uchime.

255. Haas, B.J., et al., Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res, 2011. 21(3): p. 494-504.

256. Nawrocki, E.P., D.L. Kolbe, and S.R. Eddy, Infernal 1.0: inference of RNA alignments. Bioinformatics, 2009. 25(10): p. 1335-7.

257. Cannone, J.J., et al., The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 2002. 3: p. 2.

160

258. Lane, D., 16S/23S rRNA sequencing, in Nucleic Acid Techniques in Bacterial Systematics. 1991, John Wiley and Sons: West Sussex.

259. Caporaso, J.G., et al., PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics, 2010. 26(2): p. 266-7.

260. Rijsbergen, C.v., Information Retrieval. 2nd ed. 1979, Boston: Butterworth.

261. Ludwig, W., et al., ARB: a software environment for sequence data. Nucleic Acids Res, 2004. 32(4): p. 1363-71.

262. Knight, R., et al., PyCogent: a toolkit for making sense from sequence. Genome Biol, 2007. 8(8): p. R171.

263. Werner, J.J., et al., Impact of training sets on classification of high- throughput bacterial 16s rRNA gene surveys. ISME J, 2012. 6(1): p. 94-103.

264. Ciccarelli, F.D., et al., Toward automatic reconstruction of a highly resolved tree of life. Science, 2006. 311(5765): p. 1283-7.

265. Mavromatis, K., et al., Genome analysis of the anaerobic thermohalophilic bacterium Halothermothrix orenii. PLoS One, 2009. 4(1): p. e4192.

266. W Ludwig, H.-P.K., Overview: a phylogenetic backbone and taxonomic framework for procaryotic systematics, in Bergey's Manual of Systematic Bacteriology, R.C. DR Boone, GM Garrity, Editor. 2001, Springer: New York.

267. Kelly, K.M. and A.Y. Chistoserdov, Phylogenetic analysis of the succession of bacterial communities in the Great South Bay (Long Island). FEMS Microbiol Ecol, 2001. 35(1): p. 85-95.

268. Hugenholtz, P., et al., Novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol, 1998. 180(2): p. 366-76.

269. Dojka, M.A., et al., Microbial diversity in a hydrocarbon- and chlorinated- solvent-contaminated aquifer undergoing intrinsic bioremediation. Appl Environ Microbiol, 1998. 64(10): p. 3869-77.

270. F Perez, B.G., IPython: a system for interactive scientific computing. Computing in Science and Engineering, 2007. 9: p. 21-29.

271. Stein, L.D., The case for cloud computing in genome informatics. Genome Biol, 2010. 11(5): p. 207.

272. Walters, W.A., et al., PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers. Bioinformatics, 2011. 27(8): p. 1159-61.

161

273. Liu, Z., et al., Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res, 2007. 35(18): p. e120.

274. Afgan, E., et al., Using cloud computing infrastructure with CloudBioLinux, CloudMan, and Galaxy. Curr Protoc Bioinformatics, 2012. Chapter 11: p. Unit11 9.

275. Quince, C., et al., Removing noise from pyrosequenced amplicons. BMC Bioinformatics, 2011. 12: p. 38.

162

Appendix

A. American Gut Consortium

Name Affiliations Department of Computer Science University of BioFrontiers Institute, Colorado at University of Colorado Daniel McDonald Boulder at Boulder Department of Antonio Gonzalez Pediatrics, UCSD Department of Computer Science, University of California, San Yoshiki Vázquez-Baeza Diego

Department of Chemistry and Biochemistry, University of Colorado at Department of Justine W. Debelius Boulder Pediatrics, UCSD Department of Gail Ackermann Pediatrics, UCSD Department of Computer Science, University of BioFrontiers Institute, Colorado at University of Colorado Adam Robbins-Pianka Boulder at Boulder Antimicrobial Discovery Center, Northeastern University, Department of Biology, Boston, Massachusetts Philip P. Strandwitz 02115, USA

163

BioFrontiers Institue, University of Colorado at Jessica L. Metcalf Boulder Department of Chemistry & Biochemistry, Department of University of Colorado Joshua Shorenstein Pediatrics, UCSD at Boulder Department of Computer Science, University of Colorado at Department of Jose A. Navas-Molina Boulder Pediatrics, UCSD Department of Zhenjiang Zech Xu Pediatrics, UCSD University of Dan Knights Minnestora

Department of Genetics and Genomic Science, Immunology Institute, Icahn School of Icahn School of Medicine at Mount Medicine at Mount Jose C. Clemente Sinai Sinai Department of Luke R. Thompson Pediatrics, UCSD Department of Gregory Humphrey Pediatrics, UCSD Department of James Gaffney Pediatrics, UCSD Department of Grant Gogul Pediatrics, UCSD Anschutz Medical Catherine Lozupone Center Department of Elaine Wolfe Pediatrics, UCSD Kings College Tim Spector London

164

Kings College Victoria Vazquez London Kings College Matthew Jackson London Department of Genetics, Harvard Chi Zhang Medical School Stanford University, Department of Microbiology and Will Van Treuren Immunology BioFrontiers Institue, University of Colorado at Jamie Morton Boulder Department of Biological Sciences, North Carolina State Robert R. Dunn University Thérapeutique Cliniques et Expérimentales des Infections, Faculté de Emmanuel Montassier Médecine, University of Nantes Department of Computer Science and Engineering, University of Hannah D. Holscher Minnesota Department of Animal Sciences, University of Illinois, Urbana, Kelly S. Swanson Illinois 61801

165

Department of Animal Sciences, University of Illinois, Urbana, Jan S. Suchodolski Illinois 61801 Gastrointestinal Laboratory, Texas A&M University, College Station, Cecil M. Lewis Jr. TX 77843 Department of Anthropology, University of Oklahoma, Allison E. Mann Norman OK 73019 Department of Anthropology, University of Oklahoma, Christina Warinner Norman OK 73019 Department of Anthropology, University of Oklahoma, J. Gregory Caporaso Norman OK 73019 Department of Biological Sciences, Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ John H. Chase 86011

166

Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ Christopher A. Lowry 86011 Department of Integrative Physiology and Center for Neuroscience, University of Colorado Boulder, Boulder, CO Embriette R. Hyde 80309

Pediatrics Department, the University of California, San Diego, San Diego, Rachel S. Park CA

University of Maryland School of Medicine, George M. Church Baltimore, MD Harvard Medical Jack Gilbert School

167

Department of Ecology and Evolution, and Department of Surgery, University of Chicago; Institute for Genomics and Systems Biology, Biosciences Department, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, U.S.A.; 3Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, Jacques Ravel USA. Institute for Genome Sciences, University of Maryland School of Medicine, Pawel Gajer Baltimore MD Institute for Genome Sciences, University of Maryland School of Medicine, Lynn M. Schriml Baltimore MD Institute for Genome Sciences, University of Maryland School of Medicine, Sébastien Matamoros Baltimore MD

168

Université catholique de Louvain, Louvain Drug Research Institute, WELBIO (Walloon Excellence in Life sciences and BIOtechnology), Metabolism and Nutrition research group, B-1200 Patrice D. Cani Brussels, Belgium Université catholique de Louvain, Louvain Drug Research Institute, WELBIO (Walloon Excellence in Life sciences and BIOtechnology), Metabolism and Nutrition research group, B-1200 Janet K. Jansson Brussels, Belgium Pacific Northwest National Laboratory, Scott T. Kelley Richland, WA Department of Biology, San Diego State University, San Kyle Bittinger Diego, CA

University of Pennsylvania Perelman School of Medicine, Philadelphia, PA Rick Bushman 19104, USA.

169

University of Pennsylvania Perelman School of Medicine, Philadelphia, PA Jeff Leach 19104, USA. Human Food Project, 53600 Hwy 118, Terlingua, Texas Joshua Ladau 79852

The Gladstone Institutes, University of California, San Francisco, CA, Katherine S. Pollard USA

170

B. Supplemental Tables and Figures for Chapter V

Table S5.1. Summary of samples by sample type in the American Gut (AG) as of May 27th, 2015 compared with the Human Microbiome Project (HMP). Sample AG AG HMP HMP type samples participants samples participants

Fecal 4279 3889 365 230

Skin 343 249 1367 238

Oral 368 337 3316 234

Vaginal 10 9 482 109

Nasal 6 6 339 221

Hair 5 5 0 0

Blank 500 N/A N/A N/A

Table S5.2. American Gut data dictionary. Data Free Column name type Description Response? ACNE_MEDICA TION bool Use of acne medication No ACNE_MEDICA TION_OTC bool Use of over the counter acne medication No AGE int Age in years No AGE_UNIT str Unit of measurement for age No ALCOHOL_FRE QUENCY str Frequency of alcohol consumption No ALTITUDE A legacy standard field No The percentage of protein from animal ANIMAL_PER int sources No

171

ANONYMIZED_ The barcode stripped of zeros and run NAME suffix; a standard legacy field No ANTIBIOTIC_CO NDITION str Reason for taking antibiotics Yes ANTIBIOTIC_M EDS str Antibiotics used Yes ANTIBIOTIC_SE LECT str How recently antibiotics have been used No APPENDIX_REM OVED bool Whether their appendix has been removed No ASSIGNED_FRO M_GEO A legacy standard field No ASTHMA bool Does the participant have Asthma? No BMI float Body mass index No BODY_HABITAT str Body habitat sampled No BODY_PRODUC T str Product of the body habitat No BODY_SITE str Body site sampled No Barcode sequence used; this field may not BarcodeSequence str be supplied for all samples. No CARBOHYDRAT Percentage of calories from carbohydrates E_PER over the week of sampling No CAT bool Has a pet cat No CENTER_NAME str A standard field No CENTER_PROJE CT_NAME str A standard field No CEPHALOSPORI Was treated with antibiotics related to N bool cephalosporin derived CHICKENPOX bool Had chicken pox No CITY str City the participant lives in Yes COLLECTION_D ATE str Date of collection No COMMON_NAM E str Type of sample No Whether dining tends to be a communal COMMUNAL_DI activity (i.e. Participant eats in dining hall NING bool or cafeteria) No CONDITIONS_M bool Whether the participant has conditions No

172

EDICATION that require medication CONTRACEPTIV Whether the participat is taking E bool contraceptives No COSMETICS_FR EQUENCY str Frequency of cosmetics use No COUNTRY str Participant's country of residence No COUNTRY_OF_B IRTH str Country of birth Yes Whether the participant was born via c- CSECTION bool section No CURRENT_RESI DENCE_DURATI How long the participant has lived in their ON str present location No DECEASED_PAR ENT relates to informed consent No DEODORANT_U SE str Does the participant use deoderant No DEPTH A legacy standard field No DESCRIPTION str A sample description No DESCRIPTION_ WELL str A description of the sample well No DIABETES str Type of diabetes No DIABETES_DIA GNOSE_DATE str Date of diabetes diagnosis No DIABETES_MED ICATION bool Are they on diabetes medication No DIABETES_MED ICATIONS str The diabetes medications in use Yes DIET_RESTRICT IONS str Diet restriction type Yes DIET_TYPE str High level diet categorization No DOG bool Has a pet dog No DOMINANT_HA ND str The participant's domiant hand No DRINKING_WAT ER_SOURCE str Primary source of water No Description A Qiime-standard field No ELEVATION float Elevation of the sampling location No

173

EMP Status bool A standard field No ENA-BASE- COUNT Lab Processing Information No ENA- CHECKLIST Lab Processing Information No ENA-SPOT- COUNT Lab Processing Information No ENV_BIOME str Associated biome type No ENV_FEATURE str The enviroment sampled No ENV_MATTER str Type of matter sampled No EXERCISE_FRE QUENCY str Freqency of exercise No EXERCISE_LOC ATION str Primary exercise location No EXPERIMENT_C ENTER str Center that processed the sample No EXPERIMENT_D ESIGN_DESCRI PTION str A standard database field No EXPERIMENT_T ITLE str A standard database field No EXTRACTIONKI T_LOT str The extraction kit lot number No EXTRACTION_R OBOT str The specific robot being used No Percentage of calories from fat over the FAT_PER course of the observed week No FIBER_GRAMS Grams of fiber in the sampled week No FLOSSING_FRE QUENCY str Frequency of flossing No FLU_VACCINE_ DATE str How recently a flu vaccine was taken No Does the participant have other food allergies? The question was represented in FOODALLERGIE the survey as a checkbox, so a blank may S_OTHER str indicate a non answer or negative answer No FOODALLERGIE S_OTHER_TEXT str What the other allergies are? Yes

174

Whether the participant is allergic to peanuts. The question was represented in FOODALLERGIE the survey as a checkbox, so a blank may S_PEANUTS bool indicate a non answer or negative answer No Whether the participant is allergic to shellfish. The question was represented in FOODALLERGIE the survey as a checkbox, so a blank may S_SHELLFISH bool indicate a non answer or negative answer No Whether the participant is allergic to tree nuts. The question was represented in the FOODALLERGIE survey as a checkbox, so a blank may S_TREENUTS bool indicate a non answer or negative answer No Does the participant live in a fraternity FRAT bool house? No GENERAL_MED A description of the medications S participants are taking; generally not used Yes GLUTEN bool The participant does eat a gluten free diet No HAS_EXTRACTE D_DATA bool Lab Processing Information No HAS_PHYSICAL _SPECIMIN bool Lab Processing Information No HEIGHT_IN int Participant height in inches No HEIGHT_OR_LE NGTH int Participant height in cm No HOST_COMMON _NAME str Host type if applicable No HOST_SUBJECT _ID str Unique participant ID No HOST_TAXID int NCBI taxon ID No IBD str Indication of IBD No ILLUMINA_TEC HNOLOGY str Technology used KEY_SEQ A legacy standard field

LACTOSE bool Is the participant lactose intolerant No LAST_TRAVEL str When the participant traveled last No LATITUDE float A latitude in the participant's zipcode No LIBRARY_CONS TRUCTION_PRO str Library protocol used No

175

TOCOL Partipant lives with other people who LIVINGWITH bool have submitted samples No LONGITUDE float A longitude in the participant's zipcode

LinkerPrimerSeq uence str The linker primer used No MACHINE str The sequencing instrument used No Total percentage of calories cosumed from MACRONUTRIE carbohydrates, fat and protein over the NT_PCT_TOTAL int course of the week No MAINFACTOR_ Factors which contribute to participant OTHER_1 migraines Yes MAINFACTOR_ Factors which contribute to participant OTHER_2 migraines Yes MAINFACTOR_ Factors which contribute to participant OTHER_3 migraines Yes MASTERMIX_LO T str The mastermix lot used No Whether the participant experiences MIGRAINE bool migraines No MIGRAINEMED Whether the participant is taking S bool migraine medications No MIGRAINE_AGG The major factor which contributes to RAVATION migranes No MIGRAINE_AUR If the participant experiences migranes, do A bool they experiece an Aura No MIGRAINE_FAC TOR_1 str Factors that contribute to migraines No MIGRAINE_FAC TOR_2 str Factors that contribute to migraines No MIGRAINE_FAC TOR_3 str Factors that contribute to migraines No MIGRAINE_FRE QUENCY str Frequency of migraines No MIGRAINE_ME DICATIONS str Migraine medications in use Yes MIGRAINE_NAU Whether nausea is associated with the SEA bool migraine No MIGRAINE_PAI bool Whether pain is associated with the No

176

N migraine MIGRAINE_PHO Whether there is a sensitivity to sound NOPHOBIA bool during a migraine No MIGRAINE_PHO Whether there is a sensitivity to light TOPHOBIA bool during a migraine No MIGRAINE_REL ATIVES str Whether relatives experience migraines No Whether the participant takes a MULTIVITAMIN bool multivitamin NAILS bool Whether the participant bites their nails No NITROMIDAZOL Treatment with Nitromidazole (antibiotic) E in the past year derived NONFOODALLE If the participant is allergic to bee strings. RGIES_BEESTI This was a check box, and a blank may NGS bool indicate a negative answer or no answer No If the participant is allergic to dander. NONFOODALLE This was a check box, and a blank may RGIES_DANDER bool indicate a negative answer or no answer No If the participant is allergic to some type of drug. This was a check box, and a blank NONFOODALLE may indicate a negative answer or no RGIES_DRUG answer No Participant has no non-food allergies. This NONFOODALLE was a check box, and a blank may indicate RGIES_NO a negative answer or no answer No NONFOODALLE If the participant is allergic to poison ivy. RGIES_POISONI This was a check box, and a blank may VY bool indicate a negative answer or no answer No If the participant is allergic to sun. This NONFOODALLE was a check box, and a blank may indicate RGIES_SUN bool a negative answer or no answer No The sample barcode without prefixes or suffixes. Equilivant to ANONYMIZED

ORIG_NAME NAME The sample barcode without prefixes or ORIG_SAMPLE_ suffixes. Equilivant to ANONYMIZED

NAME NAME PCR_PRIMERS Lab Processing Information

Participant has been treated with PENICILLIN bool pencillin in the past year derived

177

PERCENTAGE_F The percentage of carbs from processed ROM_CARBS str sources consumed on average No PETS str List of participants pets common name Yes How frequently does the participant PET_CONTACT str interact with pets Yes PET_LOCATION S str Where pets live Yes Does the participant have PKU bool Phenylketonuria? No The percentage of protein from plants PLANT_PER int consumed during a week No Type of sequencing used; lab processing PLATFORM str information No Individual responsible for plating; lab PLATING str processing information No POOL_FREQUE How often the participant uses a pool or NCY str hot tub Yes PREGNANT bool Is the participant pregnant No PREGNANT_DU E_DATE date when pregnant people are due No the carbohydrate the participant eats most PRIMARY_CARB str freqently Yes PRIMARY_VEGE the vegetable the participant eats most TABLE str frequency Yes when the primer plate was created; lab PRIMER_DATE int processing information No PRIMER_PLATE int primer plate used for sample processing No PROCESSING_R The robot used for PCR prep; lab OBOT str processing information No PROJECT_NAM A database standard field; The name of E str the sequencing run No Percentage of calories from protein consumed over the course of the studied PROTEIN_PER int week No if the sample is publically avaliable in the PUBLIC bool database No Participant has been treated with quinolin QUINOLINE bool in the past year derived RACE str No

178

RACE_OTHER str Free text if "Race" is other Yes REGION A legacy database field No relatives who are also submitting RELATIONS str american gut samples Yes REQUIRED_SA MPLE_INFO_ST ATUS str A databse field No ROOMMATES str number of nonrelated roommates No RUN_CENTER str where the sample was sequenced No RUN_DATE date when the sample was sequenced No RUN_PREFIX str a description of the sequencing run No Another column giving the same barcode without preceeding zeros or a run suffix. This is synonymous with ANONIMIZED_NAME, ORI_NAME, or SAMPLE str ORI_SAMPLE_NAME No SAMPLE_CENT Lab processing field; the location where ER str the sample was processed No SAMPLE_DESCR Lab processing field; a description of the IPTION str place where the sample was run No SAMPLE_ID str A numeric sample identifier No A column giving the same barcode without preceeding zeros or a run suffix. This is synonymous with ANONIMIZED_NAME, SAMPLE_NAME int ORI_NAME, or ORI_SAMPLE_NAME No SAMPLE_PLATE str The plate on which the sample was run No SAMPLE_TIME time When the sample was collected No Sample mass used in analysis; Lab SAMP_SIZE str standard field No SEASONAL_ALL Does the participant have seasonal ERGIES str allergies? No SEQUENCING_ METH str Lab processing field No SEQ_RUN str The run on which samples were sequenced No SEX str Participant gender No SHARED_HOUSI Does the participant live in commnaul NG str housing (dorm, frat, senior living) No SITE_SAMPLED str Where was the sample collected? No

179

SKIN_CONDITI ON str If the participant has a skin condition No SLEEP_DURATI How long the participant sleeps in the ON str average night No SMOKING_FRE QUENCY str How often the participant smokes No SOFTENER str Does the participant use fabric softener No SPECIAL_REST Does the participant have dietary RICTIONS bool restrictions No The state or provenance where the sample STATE str was collected No STUDY_CENTE R str Where the samples were sequenced No If the participant used sulfa drugs in the SULFA_DRUG bool past year derived SUPPLEMENTS bool does the participant take suplements No SUPPLEMENTS what supplements does the participant _FIELDS str take Yes the participant uses a tanning bed or does TANNING_BEDS str not tan No TANNING_SPRA the participant uses a tanning spray or YS str does not tan No The gene sequenced; A database standard TARGET_GENE str field No TARGET_SUBFR The part of the gene sequence; a database AGMENT str standard field No TAXON_ID int A description of the host Taxon No TEETHBRUSHI NG_FREQUENC frequency with which participant brushes Y str their teeth No TITLE str study name No TM1000_8_TOOL str Lab processing Field No TM300_8_TOOL str Lab processing Field No TM50_8_TOOL str Lab processing Field No TONSILS_REMO VED Bool if participant had their tonsils removed No TOT_MASS int participant weight in kilograms No TYPES_OF_PLA str Number of plant species eaten in week of No

180

NTS observation Lab processing field; The water used to in WATER_LOT str sample processing No WEIGHT_CHAN if the participants weight has changed GE str more than 10 lbs in the past year No WEIGHT_LBS int weight in pounds No The plate well where the sample was WELL_ID str sequenced No ZIP str participant zipcode No

Table S5.3. Demographics of US participants in the American Gut. Counts Percentage1 Published

Sex Female 1980 55.9 50.8

Male 1561 44.1 49.2

Other 1 0.03 --2

Race African American 26 0.68 12.6

Asian or Pacific 135 3.53 5.0 Islander

Caucasian 3520 92.1 63.7

Hispanic3 67 1.75 16.3

Other 77 2.01 6.2

Body Mass Index4 Underweight 129 3.958269 1.8

Normal 1998 61.307149 31.2

Overweight 750 23.013194 34.0

Obese 382 11.721387 33.0

Smoker I do not smoke 3644 95.819090 82.6

I smoke 159 4.180910 17.4

IBD I do not have an IBD 3512 95.073091 99.6

I have an IBD 182 4.926909 0.4

Diabetes I do not have diabetes 3648 97.254066 90.7

181

I have diabetes 103 2.745934 9.35

1Percentage is calculated as the reported group/total reported 2Not reported. 3The US census lists Hispanic as an ethnicity, not a race. 4BMI categories were calculated for adults 20 and older.

Table S5.4. Regression of PD whole tree diversity against lifestyle variables for adults in the Northern Hemisphere with a BMI of less than 40. The best fit for the model was determined using the R2 value, AICc, and condition number. Variable Coefficient (95% CI) P

Intercept 17.22 ( 13.38, 21.07) 3.96E-18

Antibiotic Use (In the past year) -1.93 ( -2.83, -1.03) 2.72E-05

Antibiotic Use (In the past 6 months) -1.81 ( -2.72, -0.90) 1.02E-04

Antibiotic Use (In the past month) -2.92 ( -4.37, -1.48) 7.87E-05

IBD Diagnosis -4.93 ( -6.56, -3.29) 4.49E-09

Antibiotic Use in IBD patients (In the 3.01 ( -0.66, 6.67) past year) 0.11

Antibiotic Use in IBD patients (In the 1.11 ( -2.66, 4.88) past 6 months) 0.56

Antibiotic Use in IBD patients (In the 6.55 ( 2.31, 10.79) past month) 2.50E-03

Types of Plants Consumed in a Week 1.33 ( 0.09, 2.56) (6 to 10) 0.04

Types of Plants Consumed in a Week 1.91 ( 0.69, 3.12) (11 to 20) 2.07E-03

Types of Plants Consumed in a Week 1.90 ( 0.65, 3.16) (21 to 30) 3.01E-03

Types of Plants Consumed in a Week 2.53 ( 1.23, 3.83) (More than 30) 1.43E-04

Alcohol Use (Less than once a week) 1.01 ( 0.16, 1.86) 0.02

Alcohol Use (Once or twice a week) 1.66 ( 0.76, 2.56) 2.92E-04

182

Alcohol Use (Three to five times a 1.37 ( 0.46, 2.27) week) 3.22E-03

Alcohol Use (Daily) 1.00 ( 0.003, 2.01) 0.05

Exercise Frequency (Once or twice a 1.16 ( 0.11, 2.21) week) 0.03

Exercise Frequency (Regularly) 1.78 ( 0.79, 2.76) 4.29E-04

Exercise Frequency (Daily) 1.71 ( 0.65, 2.77) 1.56E-03

Sleep Duration (6-7 hours) 1.93 ( 0.86, 3.00) 4.34E-04

Sleep Duration (7-8 hours) 1.82 ( 0.77, 2.88) 6.77E-04

Sleep Duration (8 or more hours) 1.76 ( 0.56, 2.96) 3.98E-03

Exercise Location (Outdoors) 0.36 ( -0.41, 1.12) 0.36

Exercise Location (Depends on the -0.95 ( -1.89, -0.002) season) 0.05

Exercise Location (Both) -0.18 ( -0.96, 0.60) 0.64 ln(Age in Years) 2.23 ( 1.30, 3.17) 2.91E-06 cos(Collection Month) -12.04 (-22.24, -1.84) 0.02 cos(Collection Month)*ln(Latitude) 3.42 ( 0.64, 6.20) 0.02

Table S5.5. Bloom filter sequences. >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;;Pseudom onadaceae;Pseudomonas-3007971-Bloom TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGT GGTTTGTTAAGTTGAATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCA AAACTGGCAAGCTAGAGTATGGTAGAGGGTAGTGGAATTTCCTG >;Root;Bacteria;Firmicutes;Bacilli;Bacillales-4459942-Bloom TACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGC GGTCCTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGG AAACTGGAGGACTTGAGTGCAGAAGAGAAGAGTGGAATTCCACG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterob acteriaceae;Escherichia/Shigella-6359652-Bloom TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGC GGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTG ATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGG

183

>;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterob acteriaceae;Citrobacter-2376152-Bloom TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGC GGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCG AAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterob acteriaceae-5235310-Bloom TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGC GGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCG AAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;Pseudomonadales;Pseudom onadaceae;Pseudomonas-9016203-Bloom TACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGT GGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCA AAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterob acteriaceae;Escherichia/Shigella-8491357-Bloom TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGC GGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTG ATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Enterobacteriales";Enterob acteriaceae-7842949-Bloom TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGC GGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCG AAACTGGCAGGCTGGAGTCTTGTAGAGGGGGGTAGAATTCCAGG >;Root;Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus- 1963084-Bloom TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGC GGTTTTTTAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAAGCGCATCGG AAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATG >;Root;Bacteria;"Proteobacteria";Gammaproteobacteria;Pseudomonadales;Pseudom onadaceae;Pseudomonas-9894753-Contam TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGT GGTTCGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCA AAACTGACGAGCTAGAGTATGGTAGAGGGTGGTGGAATTTCCTG

184

Figure S5.1. Lifestyle variables have an effect on the microbiome community structure. Each bar is the average distance (± stdev) between the reference group, given by the label for the cluster of bars, and the group described by the bar color. Significance was tested with a one-tailed permutation t-test (p < 0.05 *, p < 0.01: **, p < 0.001 : ***) A single sample from each individual was used for (a) Body Mass Index and (b) most recent antibiotic dose. In healthy adults, defined as those 20-69 with a BMI between 18.5 and 30, and no reported history of IBD, diabetes, and no antibiotic use in the past year, (c) Alcohol Frequency, (d) Nightly sleep duration, (e) collection season and (f) the number of types of plants consumed in a week changed the community structure.

185

Figure S5.2. Sex-specific differences in alpha diversity in younger adults compared to older adults. PD whole tree diversity was regressed against age for (a) all adult participants, (b) women 25 - 45, (c) women 55 - 75, (d) men 25 - 45 and (e) men 55 - 75. There was a significant positive slope (p < 0.001) for all participants 25 to 75, and a positive correlation between age and alpha diversity for younger men (p < 0.01). The slope for younger men was significantly steeper (p < 0.05) than the slope for younger women. No difference was seen in older adults.

186

Figure S5.3. Co-abundance networks include Christensellaceae as a hub in the healthy subset of adults. Positive correlations are in blue, negative correlations are in grey and the size of the edge is based on the strength of the correlation. The size of the node is based on the number of correlations associated with the corresponding taxon. The color of the node is based on the phylum level: Actinobacteria (blue), Bacteroidetes (green), Firmicutes (purple), Proteobacteria (yellow), and Tenericutes (grey). The presence of Christensenellaceae as a hub confirms the same finding in a different cohort. We further confirmed the previous finding that Christensenellaceae is associated with protection from obesity. Mean rank of

187

Christensenellaceae relative abundance was significantly lower in lean subjects than in subjects with BMI above 25 (Mann-Whitney U test, p-value = 0.003), and Christensenellaceae was more often present in lean versus overweight subjects (Chi-square test, p-value = 0.020).

Figure S5.4. An example of a participant result. The bar chart shows a phylum level taxonomic summary comparing the American Gut Participants to people of similarity diet, BMI, Gender, Age, and the microbiome of food writer Micahel Pollan. The tables describe the most abundant genera and families in the individual’s sample, as well as taxa enriched in the sample compared to the rest of the American Gut participants. Rare taxa were defined as those found in less than 10% of the American Gut population. The PCoAs across the bottom show the participant’s sample in reference to body site, fecal samples within the global gut, and the American Gut population alone. Results are regenerated as new data are added.

188

Figure S5.5. Bloom Filtering Alters the relative abundance of Grammaproteobacteria. There was, on average, a 6.5 fold reduction in Gammaproteobacteria in the mailed American Gut samples, compared to the immediately frozen PGP samples. The correlation between taxon abundance in filtered and unfiltered samples were 0.92, 0.91, and 0.90 respectively for American Gut sequencing runs 1, 2, and 7 and 0.92 for the PGP samples (all permutative p < 0.001). However, filtering reduced the Gammaproteobacteria abundance in the American Gut samples by 6.5 fold.

189

C. Supplemental Text, Tables and Figures for Chapter VI

Text S6.1 Goals of the BIOM format. The initial goals for the BIOM format and biom-format software project are as follows:

• The format should be fully generalizable to arbitrary biological sample and observation types, not specific to one or a few data types. • The contingency table should be representable in either sparse or dense matrix format for file size, load time, and runtime memory considerations. These contingency tables tend to be sparse (i.e., containing mostly counts of zero) for many comparative omics fields. • Values in the contingency table should be representable either as integers or floating point (i.e., real) numbers to support absolute and relative abundances. • Sample and observation metadata should be optionally representable. Samples generally have metadata describing environmental parameters (such as ‘host- associated’ or ‘free-living’) while observation metadata may describe the taxonomic or functional classification of each observation. • Information on the data type (e.g., OTU Table, Ortholog Table, Metabolite Table) should be included, and based on terms from a controlled vocabulary. This controlled vocabulary should be easily updatable to support new data types. • Information on the source of the file should be present in the file including the software package and version that generated the file (e.g., "QIIME version 1.5.0"), and date and time of the file creation. • The BIOM format should be versioned, and this version information should be included in all BIOM files.

Text S6.2 Comparison of QIIME v1.4.0 with and without biom-format Comparison of QIIME OTU Table collapsing code with native QIIME OTU table data structures (Panels A-D) and biom-format Table objects (Panel E). Given an OTU table and associated sample metadata, this code collapses sets of samples with the same value for a given metadata entry into a single sample. Here we illustrate the vastly reduced complexity of this operation using biom-format Table objects (in QIIME 1.4.0-dev svn revision 2770 and later; Panel E) versus native QIIME objects in QIIME (QIIME 1.4.0 and earlier, Panels A-D). The full version of each example can be found in the QIIME repository using the information in each panel caption.

Panel A: QIIME 1.4.0: Qiime/scripts/summarize_otu_by_cat.py (prior to switch to biom-format Table objects).

190

Panel B: QIIME 1.4.0: Qiime/qiime/summarize_otu_by_cat.py (prior to switch to biom-format Table objects).

Panel C: QIIME 1.4.0: Qiime/qiime/summarize_otu_by_cat.py (continued; prior to switch to biom-format Table objects).

191

Panel D: QIIME 1.4.0: Qiime/qiime/summarize_otu_by_cat.py (continued; prior to switch to biom-format Table objects).

192

Panel E: QIIME 1.4.0-dev, revision 2770: Qiime/scripts/summarize_otu_by_cat.py Replacement for all code in Panels A-D after switch to biom-format Table objects from native QIIME OTU table data structures.

193

100

10

1

-0.849 tab-delimited file size) file tab-delimited y = 0.217x R² = 0.88762 Compression ratio (BIOM file size / size/ file (BIOM Compressionratio 0.1 1 0.1 0.01 Matrix density

Figure S6.1 Matrix density to compression ratio.

194

D. Supplemental Tables and Figures for chapter VII

257086 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 239559 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 413620 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 462557 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 450147 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 315208 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 320577 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 461052 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 462874 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__Pseudomonas cuatrocienegasensis 77091 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 94214 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 69717 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 227277 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 140655 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 364673 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 47897 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 105848 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 303853 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 274226 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 111707 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 557138 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 108393 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 52398 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 164556 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 170325 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 263567 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 103335 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 104171 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 205180 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas macrocytogenes 214868 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas macrocytogenes 173614 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas macrocytogenes 66199 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__ 162524 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas insignis 162892 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas insignis

195

173619 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Azomonas; s__Azomonas insignis 314855 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 339205 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 336279 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 325623 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 304190 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 349981 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 104004 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 343792 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 149280 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 252064 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__; s__ 500112 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 500245 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 500159 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 500196 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 499873 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 500267 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 499842 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 499863 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__ 499157 k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pseudomonadales; f__Pseudomonadaceae; g__Pseudomonas; s__

Figure S7.1. Sample of a donor taxonomy flat text file; sequence identifier (Greengenes id in this case) followed by rank explicit taxonomic string

Table S7.1. NCBI and cyanoDB taxonomy assignments for 405 cyanobacterial type species present in tree_16S_candiv_gg_2011_1 (http://www.cyanodb.cz/valid_genera) gg CyanoDB classification (Type id genus species NCBI classification only) Bacteria; Cyanobacteria; Gloeobacterophycideae, 356 Gloeobacter violaceus Gloeobacteria; Gloeobacterales; Gloeobacterales, 4 PCC 7421 Gloeobacter Gloeobacteraceae Bacteria; Cyanobacteria; Gloeobacterophycideae, 978 Gloeobacter violaceus Gloeobacteria; Gloeobacterales; Gloeobacterales, 40 PCC 7421 Gloeobacter Gloeobacteraceae Bacteria; Cyanobacteria; Gloeobacterophycideae, 108 Gloeobacter violaceus Gloeobacteria; Gloeobacterales; Gloeobacterales, 670 PCC 7421 Gloeobacter Gloeobacteraceae Bacteria; Cyanobacteria; Gloeobacterophycideae, 356 Gloeobacter violaceus Gloeobacteria; Gloeobacterales; Gloeobacterales, 5 PCC 8105 Gloeobacter Gloeobacteraceae

196

Nostocophycideae, 136 Chlorogloeopsis Bacteria; Cyanobacteria; Nostocales, 186 fritschii PCC 6718 Stigonematales; Chlorogloeopsis Chlorogloeopsidaceae Nostocophycideae, 830 Chlorogloeopsis Bacteria; Cyanobacteria; Nostocales, 35 fritschii PCC 6912 Stigonematales; Chlorogloeopsis Chlorogloeopsidaceae Nostocophycideae, 189 Mastigocladus Bacteria; Cyanobacteria; Nostocales, 042 laminosus Greenland_8 Stigonematales; Mastigocladus Hapalosiphonaceae Nostocophycideae, 106 Westiellopsis prolifica Bacteria; Cyanobacteria; Nostocales, 269 SAG 16.93 Stigonematales; Westiellopsis Hapalosiphonaceae Nostocophycideae, 106 Westiellopsis prolifica Bacteria; Cyanobacteria; Nostocales, 426 SAG 23.96 Stigonematales; Westiellopsis Hapalosiphonaceae 107 Anabaena Bacteria; Cyanobacteria; Nostocophycideae, 976 oscillarioides BECID22 Nostocales; Nostocaceae; Anabaena Nostocales, Nostocaceae 107 Anabaena Bacteria; Cyanobacteria; Nostocophycideae, 977 oscillarioides BECID32 Nostocales; Nostocaceae; Anabaena Nostocales, Nostocaceae Anabaena 107 oscillarioides str. 'BO Bacteria; Cyanobacteria; Nostocophycideae, 978 HINDAK 1984/43' Nostocales; Nostocaceae; Anabaena Nostocales, Nostocaceae Bacteria; Cyanobacteria; 287 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 777 AB2002/16 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 295 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 002 AB2002/17 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 296 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 149 AB2002/34 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 287 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 412 AB2002/35 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 290 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 475 AB2002/36 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 289 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 237 AB2002/37 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 298 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 390 AB2006/20 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 290 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 545 NIVA-CYA 494 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 300 Anabaenopsis elenkinii Nostocales; Nostocaceae; Nostocophycideae, 296 NIVA-CYA 501 Anabaenopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 247 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 999 aquae Aphanizomenon Nostocales, Nostocaceae 303 Aphanizomenon flos- Bacteria; Cyanobacteria; Nostocophycideae,

197

561 aquae 1042 Nostocales; Nostocaceae; Nostocales, Nostocaceae Aphanizomenon Bacteria; Cyanobacteria; 332 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 526 aquae 176 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 992 aquae 1tu26s2 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 989 aquae 1tu29s19 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 993 aquae 1tu37s13 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 350 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 292 aquae 617 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 327 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 733 aquae A1 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 344 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 565 aquae A4 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 315 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 342 aquae A5 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 315 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 826 aquae A7 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 348 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 194 aquae A8 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 346 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 100 aquae AFA-1 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 321 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 734 aquae AFA-2 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 341 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 347 aquae AFA-3 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 336 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 646 aquae AFA-4 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 304 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 133 aquae AFA-5 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 355 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 112 aquae AFA-6 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 331 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 037 aquae AFA-7 Aphanizomenon Nostocales, Nostocaceae

198

Bacteria; Cyanobacteria; 231 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 20 aquae NIES-81 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 283 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 60 aquae NIES-81 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 311 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 44 aquae PCC 7905 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 501 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 00 aquae PCC 7905 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 730 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 17 aquae PMC9401 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 374 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 15 aquae PMC9706 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 743 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 63 aquae PMC9707 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 903 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 79 aquae str. 'Aph Inba' Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 899 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 75 aquae str. 'Aph K2' Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 896 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 17 aquae str. 'Aph Ku' Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 883 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 20 aquae str. 'Aph Zayi' Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 716 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 10 aquae TR183 Aphanizomenon Nostocales, Nostocaceae Aphanizomenon flos- Bacteria; Cyanobacteria; 206 aquae var. klebahnii Nostocales; Nostocaceae; Nostocophycideae, 81 218 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 401 Aphanizomenon flos- Nostocales; Nostocaceae; Nostocophycideae, 48 aquae var. klebahnii 83 Aphanizomenon Nostocales, Nostocaceae Bacteria; Cyanobacteria; 295 Nostocales; Nostocaceae; Nostocophycideae, 06 Cyanospira rippkae Cyanospira Nostocales, Nostocaceae Bacteria; Cyanobacteria; 310 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 1 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 310 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 2 raciborskii Cylindrospermopsis Nostocales, Nostocaceae 557 Cylindrospermopsis Bacteria; Cyanobacteria; Nostocophycideae, 28 raciborskii Nostocales; Nostocaceae; Nostocales, Nostocaceae

199

Cylindrospermopsis Bacteria; Cyanobacteria; 560 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 40 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 18 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 21 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 22 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 23 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 24 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 25 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 26 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 27 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 28 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 29 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 30 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 31 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 32 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 33 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 34 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 35 raciborskii Cylindrospermopsis Nostocales, Nostocaceae 608 Cylindrospermopsis Bacteria; Cyanobacteria; Nostocophycideae,

200

38 raciborskii Nostocales; Nostocaceae; Nostocales, Nostocaceae Cylindrospermopsis Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 40 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 608 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 43 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 710 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 65 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 726 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 36 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 182 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 333 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 505 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 678 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 836 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 101 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 997 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 102 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 144 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 102 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 287 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 102 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 447 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 102 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 603 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 105 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 958 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 105 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 209 raciborskii Cylindrospermopsis Nostocales, Nostocaceae

201

Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 264 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 371 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 418 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 537 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 692 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 845 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 106 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 989 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 147 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 291 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 440 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 107 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 582 raciborskii Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 309 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 9 raciborskii AWT205 Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 542 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 259 raciborskii BM Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 252 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 250 raciborskii FAS-C1 Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 568 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 411 raciborskii KLL07 Cylindrospermopsis Nostocales, Nostocaceae Bacteria; Cyanobacteria; 103 Cylindrospermopsis Nostocales; Nostocaceae; Nostocophycideae, 268 raciborskii PMC98.14 Cylindrospermopsis Nostocales, Nostocaceae Cylindrospermopsis Bacteria; Cyanobacteria; 238 raciborskii Nostocales; Nostocaceae; Nostocophycideae, 157 QHSS/NR/CYL/03 Cylindrospermopsis Nostocales, Nostocaceae 321 Cylindrospermopsis Bacteria; Cyanobacteria; Nostocophycideae, 218 raciborskii T3 Nostocales; Nostocaceae; Nostocales, Nostocaceae

202

Cylindrospermopsis 199 Bacteria; Cyanobacteria; Nostocophycideae, 48 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 249 Bacteria; Cyanobacteria; Nostocophycideae, 71 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 281 Bacteria; Cyanobacteria; Nostocophycideae, 70 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 321 Bacteria; Cyanobacteria; Nostocophycideae, 23 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 341 Bacteria; Cyanobacteria; Nostocophycideae, 33 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 345 Bacteria; Cyanobacteria; Nostocophycideae, 06 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 354 Bacteria; Cyanobacteria; Nostocophycideae, 91 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 378 Bacteria; Cyanobacteria; Nostocophycideae, 00 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 409 Bacteria; Cyanobacteria; Nostocophycideae, 79 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 445 Bacteria; Cyanobacteria; Nostocophycideae, 12 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 464 Bacteria; Cyanobacteria; Nostocophycideae, 66 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 469 Bacteria; Cyanobacteria; Nostocophycideae, 57 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 490 Bacteria; Cyanobacteria; Nostocophycideae, 64 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 517 Bacteria; Cyanobacteria; Nostocophycideae, 99 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 520 Bacteria; Cyanobacteria; Nostocophycideae, 46 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 703 Bacteria; Cyanobacteria; Nostocophycideae, 23 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 715 Bacteria; Cyanobacteria; Nostocophycideae, 21 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 718 Bacteria; Cyanobacteria; Nostocophycideae, 03 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 721 Bacteria; Cyanobacteria; Nostocophycideae, 77 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 732 Bacteria; Cyanobacteria; Nostocophycideae, 30 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 740 Bacteria; Cyanobacteria; Nostocophycideae, 14 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 109 Bacteria; Cyanobacteria; Nostocophycideae, 540 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 109 Bacteria; Cyanobacteria; Nostocophycideae, 893 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 109 Bacteria; Cyanobacteria; Nostocophycideae, 942 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 109 Bacteria; Cyanobacteria; Nostocophycideae, 945 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 110 Bacteria; Cyanobacteria; Nostocophycideae, 334 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae

203

110 Bacteria; Cyanobacteria; Nostocophycideae, 377 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 110 Bacteria; Cyanobacteria; Nostocophycideae, 628 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 110 Bacteria; Cyanobacteria; Nostocophycideae, 904 Nodularia spumigena Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 186 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 051 CCY9414 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 193 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 916 CCY9414 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 544 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 785 GSL023 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 148 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 253 PCC 73104 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 356 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 05 PCC 73104/1 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 203 Nodularia spumigena Bacteria; Cyanobacteria; Nostocophycideae, 40 PCC 9350 Nostocales; Nostocaceae; Nodularia Nostocales, Nostocaceae 633 Bacteria; Cyanobacteria; Nostocophycideae, 40 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 778 Bacteria; Cyanobacteria; Nostocophycideae, 66 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 831 Bacteria; Cyanobacteria; Nostocophycideae, 83 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 102 Bacteria; Cyanobacteria; Nostocophycideae, 224 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 109 Bacteria; Cyanobacteria; Nostocophycideae, 602 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 110 Bacteria; Cyanobacteria; Nostocophycideae, 216 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 110 Bacteria; Cyanobacteria; Nostocophycideae, 233 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 311 Bacteria; Cyanobacteria; Nostocophycideae, 353 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 312 Bacteria; Cyanobacteria; Nostocophycideae, 035 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 312 Bacteria; Cyanobacteria; Nostocophycideae, 135 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 327 Bacteria; Cyanobacteria; Nostocophycideae, 757 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 331 Bacteria; Cyanobacteria; Nostocophycideae, 170 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 333 Bacteria; Cyanobacteria; Nostocophycideae, 022 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 346 Bacteria; Cyanobacteria; Nostocophycideae, 283 Nostoc commune Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 148 Nostoc commune Bacteria; Cyanobacteria; Nostocophycideae, 113 '0'Brien 02011101' Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 251 Nostoc commune Bacteria; Cyanobacteria; Nostocophycideae, 316 AHNG0605 Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 812 Nostoc commune Bacteria; Cyanobacteria; Nostocophycideae, 94 UTEX 584 Nostocales; Nostocaceae; Nostoc Nostocales, Nostocaceae 113 Raphidiopsis curvata Bacteria; Cyanobacteria; Nostocophycideae,

204

438 HB1 Nostocales; Nostocaceae; Nostocales, Nostocaceae Raphidiopsis Bacteria; Cyanobacteria; 141 Nostocales; Nostocaceae; Nostocophycideae, 300 Trichormus variabilis Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 141 Nostocales; Nostocaceae; Nostocophycideae, 547 Trichormus variabilis Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 142 Nostocales; Nostocaceae; Nostocophycideae, 847 Trichormus variabilis Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 143 Nostocales; Nostocaceae; Nostocophycideae, 583 Trichormus variabilis Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 143 Nostocales; Nostocaceae; Nostocophycideae, 648 Trichormus variabilis Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 108 Trichormus variabilis Nostocales; Nostocaceae; Nostocophycideae, 009 str. 'GREIFSWALD' Trichormus Nostocales, Nostocaceae Bacteria; Cyanobacteria; 108 Trichormus variabilis Nostocales; Nostocaceae; Nostocophycideae, 007 str. 'HINDAK 2001/4' Trichormus Nostocales, Nostocaceae 527 Wollea saccata ACCS Bacteria; Cyanobacteria; Nostocophycideae, 300 045 Nostocales; Nostocaceae; Wollea Nostocales, Nostocaceae 105 Nostochopsis lobatus Bacteria; Cyanobacteria; Nostocophycideae, 355 92.1 Stigonematales; Nostochopsis Nostocales, Nostochopsaceae Bacteria; Cyanobacteria; 158 Nostocales; Scytonemataceae; Nostocophycideae, 785 Brasilonema bromeliae Brasilonema Nostocales, Scytonemataceae Bacteria; Cyanobacteria; 136 Nostocales; Scytonemataceae; Nostocophycideae, 174 Scytonema hofmanni Scytonema Nostocales, Scytonemataceae Bacteria; Cyanobacteria; 218 Scytonema hofmanni Nostocales; Scytonemataceae; Nostocophycideae, 884 PCC 7110 Scytonema Nostocales, Scytonemataceae Oscillatoriophycideae, 129 Crocosphaera watsonii Bacteria; Cyanobacteria; , 891 WH 8501 Chroococcales; Crocosphaera Cyanobacteriaceae Oscillatoriophycideae, 137 Crocosphaera watsonii Bacteria; Cyanobacteria; Chroococcales, 516 WH 8501 Chroococcales; Crocosphaera Cyanobacteriaceae Oscillatoriophycideae, 323 Cyanobacterium Bacteria; Cyanobacteria; Chroococcales, 1 stanieri PCC 7202 Chroococcales; Cyanobacterium Cyanobacteriaceae Oscillatoriophycideae, 226 Cyanobacterium Bacteria; Cyanobacteria; Chroococcales, 595 stanieri PCC 7202 Chroococcales; Cyanobacterium Cyanobacteriaceae Oscillatoriophycideae, 288 Stanieria cyanosphaera Bacteria; Cyanobacteria; Chroococcales, 76 str. PCC 7437 Pleurocapsales; Stanieria Dermocarpellaceae 482 Stanieria cyanosphaera Bacteria; Cyanobacteria; Oscillatoriophycideae, 80 str. PCC 7437 Pleurocapsales; Stanieria Chroococcales,

205

Dermocarpellaceae Oscillatoriophycideae, 149 Snowella rosea Bacteria; Cyanobacteria; Chroococcales, 375 1LM40S01 Chroococcales; Snowella Gomphosphaeriaceae Oscillatoriophycideae, 147 Woronichinia Bacteria; Cyanobacteria; Chroococcales, 523 naegeliana 0LE35S01 Chroococcales; Woronichinia Gomphosphaeriaceae Oscillatoriophycideae, 311 Bacteria; Cyanobacteria; Chroococcales, 4 aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 313 Bacteria; Cyanobacteria; Chroococcales, 2 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 316 Bacteria; Cyanobacteria; Chroococcales, 7 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 317 Bacteria; Cyanobacteria; Chroococcales, 1 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 163 Bacteria; Cyanobacteria; Chroococcales, 45 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 176 Bacteria; Cyanobacteria; Chroococcales, 17 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 177 Bacteria; Cyanobacteria; Chroococcales, 55 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 185 Bacteria; Cyanobacteria; Chroococcales, 94 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 186 Bacteria; Cyanobacteria; Chroococcales, 22 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 216 Bacteria; Cyanobacteria; Chroococcales, 64 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 246 Bacteria; Cyanobacteria; Chroococcales, 27 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 267 Bacteria; Cyanobacteria; Chroococcales, 66 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 276 Bacteria; Cyanobacteria; Chroococcales, 06 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 276 Bacteria; Cyanobacteria; Chroococcales, 62 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 286 Bacteria; Cyanobacteria; Chroococcales, 51 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae 292 Microcystis aeruginosa Bacteria; Cyanobacteria; Oscillatoriophycideae,

206

30 Chroococcales; Microcystis Chroococcales, Microcystaceae Oscillatoriophycideae, 345 Bacteria; Cyanobacteria; Chroococcales, 24 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 354 Bacteria; Cyanobacteria; Chroococcales, 67 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 378 Bacteria; Cyanobacteria; Chroococcales, 28 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 391 Bacteria; Cyanobacteria; Chroococcales, 02 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 405 Bacteria; Cyanobacteria; Chroococcales, 36 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 405 Bacteria; Cyanobacteria; Chroococcales, 48 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 406 Bacteria; Cyanobacteria; Chroococcales, 47 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 410 Bacteria; Cyanobacteria; Chroococcales, 65 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 494 Bacteria; Cyanobacteria; Chroococcales, 23 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 509 Bacteria; Cyanobacteria; Chroococcales, 35 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 527 Bacteria; Cyanobacteria; Chroococcales, 26 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 168 Bacteria; Cyanobacteria; Chroococcales, 090 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 185 Bacteria; Cyanobacteria; Chroococcales, 182 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 272 Bacteria; Cyanobacteria; Chroococcales, 805 Microcystis aeruginosa Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 142 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 308 0BB35S02 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 143 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 701 0BF29S01 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 143 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 305 0BF29S03 Chroococcales; Microcystis Microcystaceae

207

Oscillatoriophycideae, 141 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 906 1BB38S07 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 576 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 831 MCYS-LB01 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 557 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 478 MCYS-LB02 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 545 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 034 NIES-101 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 549 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 713 NIES-298 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 471 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 225 NIES-843 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 471 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 226 NIES-843 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 313 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 6 NIES-89 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 162 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 620 NIES-90 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 311 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 2 NIES-98 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 323 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 6 NIES-98 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 311 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 839 NPCD-1 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 311 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 3 PCC 7005 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 319 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 1 PCC 7806 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 477 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 59 PCC 7806 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 324 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 1 PCC 7820 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 395 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 73 PCC 7820 Chroococcales; Microcystis Microcystaceae 311 Microcystis aeruginosa Bacteria; Cyanobacteria; Oscillatoriophycideae, 9 PCC 7941 Chroococcales; Microcystis Chroococcales,

208

Microcystaceae Oscillatoriophycideae, 714 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 54 PCC 7941 Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 525 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 03 UTEX 'B 2667' Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 168 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 511 UTEX 'LB 2388' Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 487 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 96 UTEX 'LB 2664' Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 292 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 57 UWOCC MRC Chroococcales; Microcystis Microcystaceae Oscillatoriophycideae, 518 Microcystis aeruginosa Bacteria; Cyanobacteria; Chroococcales, 05 UWOCC MRD Chroococcales; Microcystis Microcystaceae 143 Spirulina major Bacteria; Cyanobacteria; Oscillatoriophycideae, 320 0BB22S09 Oscillatoriales; Spirulina Chroococcales, Spirulinaceae 141 Spirulina major Bacteria; Cyanobacteria; Oscillatoriophycideae, 039 0BB36S18 Oscillatoriales; Spirulina Chroococcales, Spirulinaceae Oscillatoriophycideae, 248 Starria zimbabweensis Bacteria; Cyanobacteria; Oscillatoriales, 773 SAG 74.90 Oscillatoriales; Starria Gomontiellaceae Oscillatoriophycideae, 107 Lyngbya cf. Bacteria; Cyanobacteria; Oscillatoriales, 506 confervoides VP0401 Oscillatoriales; Lyngbya Oscillatoriaceae Oscillatoriophycideae, 706 Oscillatoria princeps Bacteria; Cyanobacteria; Oscillatoriales, 63 NIVA-CYA 150 Oscillatoriales; Oscillatoria Oscillatoriaceae Oscillatoriophycideae, 340 Bacteria; Cyanobacteria; Oscillatoriales, 24 Microcoleus vaginatus Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 278 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 966 CJI-U2-KK1 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 279 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 336 CJI-U2-KK2 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 278 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 861 CNP3-KK2 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 277 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 899 CSI-U-KK1 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 278 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 248 CSU-U-KK1 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 277 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 831 NV1-KK1 Oscillatoriales; Microcoleus Phormidiaceae

209

Oscillatoriophycideae, 277 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 630 PBP-D-KK1 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 277 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 798 SAG 2211 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 279 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 350 SEV1-KK3 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 278 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 824 SNM1-KK1 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 279 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 162 SRS1-KK2 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 279 Microcoleus vaginatus Bacteria; Cyanobacteria; Oscillatoriales, 170 UBI-KK2 Oscillatoriales; Microcoleus Phormidiaceae Oscillatoriophycideae, 562 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 32 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 584 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 34 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 584 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 35 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 584 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 36 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 584 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 39 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 584 Planktothricoides Bacteria; Cyanobacteria; Oscillatoriales, 45 raciborskii Oscillatoriales; Planktothricoides Phormidiaceae Oscillatoriophycideae, 306 Bacteria; Cyanobacteria; Oscillatoriales, 6 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 559 Bacteria; Cyanobacteria; Oscillatoriales, 38 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 40 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 41 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 42 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae 584 Bacteria; Cyanobacteria; Oscillatoriophycideae, 43 Planktothrix agardhii Oscillatoriales; Planktothrix Oscillatoriales,

210

Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 92 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 94 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 95 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Bacteria; Cyanobacteria; Oscillatoriales, 96 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 809 Bacteria; Cyanobacteria; Oscillatoriales, 34 Planktothrix agardhii Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 355 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 849 213 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 97 CCAP 1459/11A Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 560 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 79 CCAP 1459/21 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 44 CCAP 1460/5 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 554 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 670 HAB001 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 561 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 941 HAB113 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 533 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 615 HAB1448 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 585 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 820 HAB202 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 540 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 742 HAB203 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 576 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 906 HAB204 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 564 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 965 HAB205 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 563 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 822 HAB206 Oscillatoriales; Planktothrix Phormidiaceae 583 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriophycideae,

211

754 HAB207 Oscillatoriales; Planktothrix Oscillatoriales, Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 615 HAB208 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 568 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 285 HAB209 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 541 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 078 HAB210 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 583 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 208 HAB212 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 571 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 027 HAB217 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 511 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 581 HAB218 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 554 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 001 HAB236 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 538 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 366 HAB237 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 562 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 752 HAB240 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 553 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 162 HAB241 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 585 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 254 HAB243 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 550 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 088 HAB326 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 513 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 856 HAB602 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 590 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 948 HAB604 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 540 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 245 HAB605 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 580 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 867 HAB612 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 585 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 528 HAB613 Oscillatoriales; Planktothrix Phormidiaceae

212

Oscillatoriophycideae, 539 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 699 HAB619 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 553 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 638 HAB631 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 512 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 520 HAB633 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 569 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 665 HAB635 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 561 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 867 HAB638 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 580 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 239 HAB645 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 594 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 150 HAB655 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 87 NIVA-CYA 10 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 89 NIVA-CYA 105 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 83 NIVA-CYA 11 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 86 NIVA-CYA 116 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 85 NIVA-CYA 117/3 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 335 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 34 NIVA-CYA 126 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 84 NIVA-CYA 126 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 537 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 796 NIVA-CYA 126 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 82 NIVA-CYA 127 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 737 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 93 NIVA-CYA 127 Oscillatoriales; Planktothrix Phormidiaceae 257 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriophycideae, 78 NIVA-CYA 128 Oscillatoriales; Planktothrix Oscillatoriales,

213

Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 79 NIVA-CYA 133 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 78 NIVA-CYA 137 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 73 NIVA-CYA 15 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 74 NIVA-CYA 168 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 70 NIVA-CYA 21 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 68 NIVA-CYA 229 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 66 NIVA-CYA 29 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 67 NIVA-CYA 297 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 65 NIVA-CYA 30 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 64 NIVA-CYA 313 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 61 NIVA-CYA 34 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 60 NIVA-CYA 56/3 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 57 NIVA-CYA 59 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 55 NIVA-CYA 61/1 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 56 NIVA-CYA 64/6 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 58 NIVA-CYA 65 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 552 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 81 NIVA-CYA 68 Oscillatoriales; Planktothrix Phormidiaceae 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriophycideae,

214

49 NIVA-CYA 88/3 Oscillatoriales; Planktothrix Oscillatoriales, Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 48 NIVA-CYA 9 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 584 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 46 NIVA-CYA 91 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 574 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 891 PCC 10606 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 547 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 470 PCC 9239 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 538 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 455 PCC 9625 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 572 Planktothrix agardhii Bacteria; Cyanobacteria; Oscillatoriales, 526 PCC 9637 Oscillatoriales; Planktothrix Phormidiaceae Oscillatoriophycideae, 585 Pseudoscillatoria Bacteria; Cyanobacteria; Oscillatoriales, 449 coralii BgP10_4S Oscillatoriales; Pseudoscillatoria Phormidiaceae Oscillatoriophycideae, 305 Trichodesmium Bacteria; Cyanobacteria; Oscillatoriales, 2 erythraeum Oscillatoriales; Trichodesmium Phormidiaceae Oscillatoriophycideae, 137 Trichodesmium Bacteria; Cyanobacteria; Oscillatoriales, 852 erythraeum IMS101 Oscillatoriales; Trichodesmium Phormidiaceae Oscillatoriophycideae, 137 Trichodesmium Bacteria; Cyanobacteria; Oscillatoriales, 926 erythraeum IMS101 Oscillatoriales; Trichodesmium Phormidiaceae 412 Chroococcidiopsis Bacteria; Cyanobacteria; Oscillatoriophycideae,Chrooc 07 thermalis PCC 7203 Pleurocapsales; Chroococcidiopsis occales, Xenococcaceae Synechococcophycideae, 112 Bacteria; Cyanobacteria; , 359 Acaryochloris marina Acaryochloris Acaryochloridaceae Synechococcophycideae, 575 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcales, 49 MBIC11017 Acaryochloris Acaryochloridaceae Synechococcophycideae, 107 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcales, 807 MBIC11017 Acaryochloris Acaryochloridaceae Synechococcophycideae, 238 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcales, 740 MBIC11017 Acaryochloris Acaryochloridaceae Synechococcophycideae, 238 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcales, 741 MBIC11017 Acaryochloris Acaryochloridaceae Synechococcophycideae, 253 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcales, 957 MBIC11017 Acaryochloris Acaryochloridaceae 253 Acaryochloris marina Bacteria; Cyanobacteria; Synechococcophycideae,

215

958 MBIC11017 Acaryochloris Synechococcales, Acaryochloridaceae Synechococcophycideae, 333 Cyanobium gracile Bacteria; Cyanobacteria; Synechococcales, 5 PCC 6307 Chroococcales; Cyanobium Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 497 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 82 marinus str. EQPAC1 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 335 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 1 marinus str. MIT 9201 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 335 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 2 marinus str. MIT 9202 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 335 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 3 marinus str. MIT 9211 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 337 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 6 marinus str. MIT 9302 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 334 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 7 marinus str. MIT 9303 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 337 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 7 marinus str. MIT 9312 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 141 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 567 marinus str. MIT 9312 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 142 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 421 marinus str. MIT 9312 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 334 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 8 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 927 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 26 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 930 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 43 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 930 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 66 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 128 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 676 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 563 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 954 marinus str. MIT 9313 Prochlorococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 563 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 955 marinus str. MIT 9313 Prochlorococcus Synechococcaceae

216

Bacteria; Cyanobacteria; Synechococcophycideae, 250 Prochlorococcus Prochlorales; Prochlorococcaceae; Synechococcales, 086 marinus str. NATL2A Prochlorococcus Synechococcaceae Prochlorococcus Bacteria; Cyanobacteria; Synechococcophycideae, 388 marinus str. TAK9803- Prochlorales; Prochlorococcaceae; Synechococcales, 90 2 Prochlorococcus Synechococcaceae Synechococcophycideae, 159 Bacteria; Cyanobacteria; Synechococcales, 887 elongatus CCMP1630 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 329 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 9 elongatus PCC 6301 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 110 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 298 elongatus PCC 6301 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 111 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 342 elongatus PCC 6301 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 128 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 998 elongatus PCC 6301 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 129 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 035 elongatus PCC 6301 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 330 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 3 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 280 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 50 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 142 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 789 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 142 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 872 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 142 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 995 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Synechococcophycideae, 142 Synechococcus Bacteria; Cyanobacteria; Synechococcales, 996 elongatus PCC 7942 Chroococcales; Synechococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 559 Thermosynechococcus Chroococcales; Synechococcales, 47 elongatus BP-1 Thermosynechococcus Synechococcaceae Bacteria; Cyanobacteria; Synechococcophycideae, 107 Thermosynechococcus Chroococcales; Synechococcales, 856 elongatus BP-1 Thermosynechococcus Synechococcaceae Synechococcophycideae,Pseu 239 Arthronema africanum Bacteria; Cyanobacteria; danabaenales, 741 SAG 12.89 Oscillatoriales; Arthronema Pseudanabaenaceae 496 Halomicronema Bacteria; Cyanobacteria; Synechococcophycideae,Pseu 04 excentricum str. Oscillatoriales; Halomicronema danabaenales,

217

TFEP1 Pseudanabaenaceae Synechococcophycideae,Pseu 328 Bacteria; Cyanobacteria; danabaenales, 1 Leptolyngbya boryana Oscillatoriales; Leptolyngbya Pseudanabaenaceae Synechococcophycideae,Pseu 146 Bacteria; Cyanobacteria; danabaenales, 996 Leptolyngbya boryana Oscillatoriales; Leptolyngbya Pseudanabaenaceae Synechococcophycideae,Pseu 990 Limnothrix redekei Bacteria; Cyanobacteria; danabaenales, 33 007a Oscillatoriales; Limnothrix Pseudanabaenaceae Synechococcophycideae,Pseu 991 Limnothrix redekei Bacteria; Cyanobacteria; danabaenales, 96 165a Oscillatoriales; Limnothrix Pseudanabaenaceae Synechococcophycideae,Pseu 993 Limnothrix redekei Bacteria; Cyanobacteria; danabaenales, 44 165c Oscillatoriales; Limnothrix Pseudanabaenaceae Synechococcophycideae,Pseu 956 Limnothrix redekei Bacteria; Cyanobacteria; danabaenales, 07 CCAP 1443/1 Oscillatoriales; Limnothrix Pseudanabaenaceae Synechococcophycideae,Pseu 584 Limnothrix redekei Bacteria; Cyanobacteria; danabaenales, 69 NIVA-CYA 227/1 Oscillatoriales; Limnothrix Pseudanabaenaceae Bacteria; Cyanobacteria; Synechococcophycideae,Pseu 330 Prochlorothrix Prochlorales; Prochlorothrichaceae; danabaenales, 1 hollandica Prochlorothrix Pseudanabaenaceae Bacteria; Cyanobacteria; Synechococcophycideae,Pseu 330 Prochlorothrix Prochlorales; Prochlorothrichaceae; danabaenales, 6 hollandica Prochlorothrix Pseudanabaenaceae Bacteria; Cyanobacteria; Synechococcophycideae,Pseu 219 Prochlorothrix Prochlorales; Prochlorotrichaceae; danabaenales, 119 hollandica SAG 10.89 Prochlorothrix Pseudanabaenaceae

218

E. Supplemental Figure for chapter VIII

Figure S8.1: (A) Principal Coordinates Analysis of Pearson correlation coefficients between tip-to-tip distances in pairs of phylogenetic trees constructed from differentially sliced alignments, including trees generated from full-length sequences and reads beginning in the V3 and V4 regions only. Points are colored by amplicon length. Points representing trees generated from full-length sequences are circled in white to indicate their position when obscured by other points. (B) Principal Coordinates Analysis of Pearson correlation coefficients between tip-to-tip distances in pairs of phylogenetic trees constructed from differentially sliced alignments, including trees generated from full-length sequences and reads

219

beginning in the V3 and V4 regions only. Points are colored by the first variable region encountered in the differentially sliced alignments. Points representing trees generated from full-length sequences are circled in white to indicate their position when obscured by other points.