A Population Study of the Upper Airway Microbiota in Busselton, Western Australia

ELENA MIROSLAVOVNA TUREK

National Heart and Lung Institute,

Faculty of Medicine,

Imperial College London

A Thesis Submitted for the Degree of Doctor of Philosophy ABSTRACT

Abstract That the healthy lungs are sterile has been a subject of much debate. Recent studies employing culture independent techniques have focused on investigating the diseased lungs with little still being done to characterise the which sustain the healthy lung ecosystem.

The Busselton Health Study (BHS) is a long running epidemiological study that focuses on defining and characterizing common diseases. Recently addition of oropharyngeal (throat) swabs to the BHS sample collection has occurred. The main aims of this thesis therefore were to assess the upper airway microbiome in healthy individuals and evaluate the changes to the bacterial composition under disease and environmental stress (smoking). A random population sample of 529 participants was studied consisting of 60 current-smokers, 216 ex-smokers and 253 never-smokers, out of which there were 77 asthmatics, 46 subjects with gastro-esophageal reflux disease (GERD), 19 patients with diabetes and 387 healthy individuals. Bacterial DNA was extracted from all swabs and 16S rRNA gene qPCR and next generation sequencing performed.

The results indicated that airways of healthy individuals contain a diverse collection of bacteria. Through application of weighted correlation network analysis (WCNA), these bacteria were seen to work together in tight networks. As the BHS cohort is a general population sample no severe disease phenotypes were noted, hence limited changes in bacterial community structure of participants with disease compared to healthy participants were seen.

Smoke exposure however had a profound effect on the microbiome, with current- smokers exhibiting a significant drop in bacterial burden, diversity and a change in community structure. There was depletion of Haemophillus spp. and Neisseria spp. in current-smokers, and a strong positive relationship with streptococci. Notably within the first few years after cessation of smoking the bacterial community of the airways appears to regenerate with an increase in community diversity.

i ABSTRACT

A streptococci specific quantification assay and sequencing method (based on the map gene) was established in order to dissect further the relationship at the species level of the genus with current-smoking. This identified an increased dominance of Streptococcus salivarius (an opportunistic pathogen) in current-smokers.

In conclusion, a healthy airway microbiome contains a rich community of microorganisms. This is significantly impacted by smoking, leading to loss of diversity and a community dominated by streptococci. Diversity however can be restored if smoking cessation occurs.

ii DECLARATION

Declaration of Originality

I declare that I completed this thesis, performed all the experiments and data analysis described herein myself, except where appropriately referenced.

Elena Turek

The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work.

iii ACKOWLEDGEMENTS

Acknowledgements

I would like to begin by thanking my supervisors, Professors Miriam Moffatt and William Cookson. I came to you as a complete novice in everything science related, and with a lot of patience, guidance and support you have moulded me into the person I am today. Someone who has truly suffered my journey through science is Dr Mike Cox and I’m forever grateful.

I would like to thank the Busselton team, Professors Bill Musk, Alan James, Matthew Knuiman and Matthew Hunter, who have provided us with this incredible study cohort, and have been very helpful in getting information needed.

I would like to thank all the other members of Molgen group past and present that have been there in all my experiences. A special credit goes to Dr Phillip James, who taught me how to google and a lot of R, and who still keeps me on my toes when he’s bored in the lab. I would like to thank Dr Saffron Willis Owen with helping me on the all the network analysis (Chapter 4, WCNA), your knowledge of statistics and coding is inspiring. I would like to thank Dr Huw Christopher Ellis for keeping me sane, and Dr Claire McBrien for undoing Huw’s hard work and making me sign up for a marathon. Finally, I wish to thank all the people that have made the work environment so enjoyable, for their discussions, time and a lot of cake, in particular Leah, Sharon, Giovanna, Verdiana, Colin and Kenny who never fails to deliver. Over the years you have all become my science family.

Next, I would like to thank my jitsu family, who have kept me broken and are always there when I needed to blow off steam. Constantinos, you are one of my oldest friends, somehow you managed to pick up the pieces of me and guide them through the brown and black belt gradings. Your faith and belief, powered me to smash through all the obstacles. Imperial Jitsu, who have all been my light at the end of a long day in the lab. Shaan you are probably world’s second-best president (after VVP). Also, Leo, Tilly, Clement, Arnie, Ian B, Alex M, Barry and Neil for being awesome.

iv ACKOWLEDGEMENTS

I would also like to thank my family. My wonderful husband Dr Vladimir Turek, who is my love, my rock, my soulmate and my best friend. My son Misha, who is my light and the most incredible boy in the world, and with this prepare to have a new bed time story book. My new set of parents (my in-laws) and extended siblings, Yulya, Yura, and especially Tatiana, who have all welcomed me with open arms. Thank you to my wonderful auntie Nadia and sister Catherine. Lastly, a special mention to my best friend, Dr Natalia Yankova, who is like a sister to me.

Finally, I would like to dedicate this doctorate thesis to the greatest woman on earth, who single-handedly brought me up; you are the most inspiring, loving, caring mother and now grandmother. You have always taught me to be my very best, pushed me to work harder than humanly possible and that with hard work anything can be achieved.

I owe everything I am to you and I thank you, моя мамочка.

v TABLE of CONTENTS

Table of Contents

ABSTRACT ...... I

DECLARATION OF ORIGINALITY ...... III

ACKNOWLEDGEMENTS ...... IV

TABLE OF CONTENTS ...... VI

FIGURES AND TABLES ...... X

ABBREVIATIONS ...... XIII

CHAPTER 1 INTRODUCTION ...... 1

1.1 Background ...... 1 1.1.1 The Microbial World ...... 1 1.1.2 Sequencing Technology ...... 3 1.1.3 Bacterial Identification ...... 3

1.2 Human Microbiome ...... 5 1.2.1 Overview of the Human Microbiome ...... 5 1.2.2 Respiratory System ...... 7 1.2.2.1 Healthy Respiratory Microbiome ...... 11 1.2.2.2 Asthma ...... 12 1.2.2.3 Cigarette Smoke ...... 14 1.2.2.4 Diabetes ...... 15 1.2.2.5 GERD ...... 17 1.2.3 Challenges of Respiratory Microbiome Research ...... 19

1.3 Busselton Health Study ...... 20

1.4 Thesis Aims ...... 21

1.5 Hypotheses ...... 21

CHAPTER 2 MATERIALS AND METHODS ...... 22

2.1 Materials and Reagents ...... 22

2.2 Primers for PCR and Sequencing Reactions...... 24 2.2.1 Quantitative PCR ...... 24 2.2.2 Quadruplicate PCR ...... 24 2.2.3 Illumina MiSeq Sequencing ...... 25

2.3 16S rRNA Gene Bacterial Sequencing ...... 26 2.3.1 Sample Processing ...... 27 2.3.1.1 Sample Collection ...... 27 2.3.1.2 Bacterial DNA Extraction from Throat Swabs ...... 28 2.3.1.3 DNA Quantification using the Nanodrop ...... 29 2.3.1.4 Bacterial DNA Quantification by Quantitative PCR ...... 29

vi TABLE of CONTENTS

2.3.2 Library Preparation ...... 30 2.3.2.1 Quadruplicate PCR ...... 31 2.3.2.2 Gel Electrophoresis ...... 32 2.3.2.3 Amplicon Purification ...... 34 2.3.2.4 Amplicon Quantification using Picogreen ...... 34 2.3.2.5 Amplicon Gel Extraction and Purification ...... 35 2.3.2.6 Library Quantification ...... 36 (A) BioAnalyser...... 36 (B) Quantitative PCR ...... 36 (C) Calculating Library Concentration ...... 37 2.3.3 Illumina MiSeq Sequencing ...... 37 2.3.3.1 MiSeq Sequencing ...... 39

2.4 Sequence Processing using QIIME ...... 40

2.5 Data Analysis ...... 44 2.5.1 General Data Processing ...... 44 2.5.2 Statistical Analyses: Diversity ...... 44 2.5.2.1 Alpha Diversity ...... 44 (A) Alpha Diversity Indices ...... 44 (B) Data Normality ...... 46 (C) Analysis of Variance ...... 46 2.5.2.2 Dissimilarity Measures ...... 46 2.5.2.3 Net Relatedness Index/ Nearest Taxon Index ...... 47 2.5.3 Statistical Analyses: Community Structure ...... 49 2.5.3.1 Relative Abundance ...... 49 2.5.3.2 Indicator Species Analysis ...... 49 2.5.3.3 Differential Expression Analysis for Sequence Count Data ...... 50

CHAPTER 3 16S RRNA GENE SEQUENCING ...... 51

3.1 Introduction...... 51

3.2 Methods ...... 52 3.2.1 Subjects and Oropharyngeal Swabs...... 52 3.2.2 Quantitative PCR ...... 53 3.2.3 MiSeq Sequencing ...... 54 3.2.3.1 Sequencing Data: Extraction Kits ...... 55 3.2.3.2 Sequencing Data: Mock Community ...... 57 3.2.3.3 Sequencing Data: Batch Effect ...... 57 3.2.3.4 Sequencing Data: Identification of Contaminant OTUs ...... 62 3.2.3.5 Sequencing Data: Filtering and Normalisation ...... 63

3.3 Results ...... 66 3.3.1 Demographics ...... 66 3.3.2 Characteristics of Population ...... 68 3.3.2.1 Common OTUs ...... 68 3.3.2.2 Whole Population: Alpha Diversity ...... 70 3.3.2.3 Whole Population: Beta Diversity ...... 70 3.3.3 Disease Phenotype ...... 76 3.3.3.1 Diversity ...... 76 (A) Disease Phenotype: Alpha Diversity ...... 76 (B) Disease Phenotype: Beta Diversity ...... 77 (C) Disease Phenotype: Community Phylogenetic Structure ...... 77 3.3.3.2 Community Structure ...... 78 (A) Disease Phenotype: Indicator Species Analysis ...... 79 (B) Disease Phenotype: DESeq2 ...... 79 3.3.4 Smoking Phenotype ...... 82

vii TABLE of CONTENTS

3.3.4.1 Diversity ...... 83 (A) Smoking Phenotype: Alpha Diversity ...... 83 (B) Smoking Phenotype: Beta Diversity ...... 84 (C) Smoking Phenotype: Community Phylogenetic Structure ...... 84 3.3.4.2 Community Structure ...... 86 (A) Smoking Phenotype: Relative Abundance ...... 86 (B) Smoking Phenotype: Indicator Species Analysis ...... 88 (C) Smoking Phenotype: DESeq2 ...... 88 3.3.5 Ex-Smoking Individuals ...... 92 3.3.6 Smoking Pack Years ...... 97

3.4 Discussion ...... 99

CHAPTER 4 WEIGHTED CORRELATION NETWORK ANALYSIS ...... 107

4.1 Introduction...... 107

4.2 Pre-Processing and Details of Statistical Tests Implemented...... 111 4.2.1 Data Transformation ...... 111 4.2.2 OTU filtration ...... 113 4.2.4 Detection of Outlying Samples ...... 113 4.2.5 Phenotype Selection ...... 114 4.2.6 Summary of Statistical Tests Implemented in Analysis ...... 116

4.3 Aim 1: Bacterial Networks in the Airways ...... 118

4.4 Aim 2: Network Construction and Module Detection ...... 120

4.5 Aim 3: Module-Trait Relationships ...... 123

4.6 Aim 4: Identification of Key ‘Hub’ Bacteria ...... 126

4.7 Aim 5: Functional Inference through PICRUSt ...... 137 4.7.1 WCNA Analysis using Greengenes dataset ...... 139 4.7.2 Comparison with Original WCNA Results ...... 141 4.7.3 PICRUSt Analysis of Whole Dataset ...... 147 4.7.4 PICRUSt Analysis of WCNA Modules ...... 151

4.8 Discussion ...... 159

CHAPTER 5 STREPTOCOCCUS SPP. ANALYSIS ...... 167

5.1 Introduction...... 167 5.1.1 Molecular Typing ...... 168 5.1.2 Species Classification ...... 171

5.2 Methods ...... 173 5.2.1 Experimental Design ...... 173 5.2.1.1 Sample Selection ...... 173 5.2.1.2 Primer Design ...... 173 (A) Quantitative PCR Primers ...... 173 (B) Quadruplicate PCR Primers ...... 175 (C) Illumina MiSeq Sequencing Primers ...... 176 5.2.2 Streptococcus Specific Sequencing ...... 176 5.2.2.1 Quantitative PCR ...... 176 5.2.2.2 Mock Community: Candidate Bacterial Species ...... 177 5.2.2.3 Library Preparation ...... 179

viii TABLE of CONTENTS

5.2.2.4 MiSeq Sequencing ...... 180 5.2.3 Sequence Processing in QIIME ...... 182 5.2.3.1 Clustering Level ...... 182 5.2.3.2 Streptococcal Specific Classifier ...... 183 5.2.4 Data Analysis...... 184

5.3 Results ...... 185 5.3.1 Quantitative PCR and Sequencing Data ...... 185 5.3.1.1 Streptococcal Bacterial Load ...... 185 5.3.1.2 Sequencing Quality Statistics ...... 186 5.3.1.3 Filtering and Normalization ...... 187 (A) Mock Communities ...... 188 (B) Oropharyngeal DNA Samples ...... 191 5.3.1.4 Summary of Analysis Workflow ...... 192 5.3.2 Whole Population Analysis ...... 193 5.3.2.1 Whole Population: Alpha Diversity ...... 193 5.3.2.2 Whole Population: Beta Diversity ...... 193 5.3.2.3 Common Streptococcal OTUs ...... 196 5.3.3 Smoking Phenotype ...... 197 5.3.3.1 Diversity ...... 197 (A) Alpha Diversity ...... 197 (B) Beta Diversity ...... 197 5.3.3.2 Community Structure ...... 197 (A) Relative Abundance Bar Plots ...... 198 (B) Indicator Species Analysis ...... 199 (C) Differential Expression Analysis for Sequence Count Data (DESeq2) ...... 200 5.3.3.3 Weighted Correlation Network Analysis (WCNA) ...... 202 5.3.3.4 Phylogenetic Visualisation ...... 206

5.4 Discussion ...... 209

CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS ...... 213

BIBLIOGRAPHY ...... 220

APPENDICES ...... 249

ix FIGURES and TABLES

Figures and Tables

Chapter 1

Figure 1.1 Schematic diagram of the respiratory tract. 8

Chapter 2

Figure 2.1 Library preparation summary workflow. 26 Figure 2.2 Illumina MiSeq sequencing by synthesis workflow diagram. 38 Figure 2.3 QIIME bioinformatics summary workflow 40 Figure 2.4 Visual representation of Species Richness and evenness. 45 Figure 2.5 Schematic representation of the phylogenetic tree. 48

Table 2.1 16S rRNA gene V4 primer sequences 24 Table 2.2 Details of sequencing primers. 25 Table 2.3 Primers for V4 16S rRNA amplicon sequencing. 25 Table 2.4 Swab site details. 27 Table 2.5 Breakdown of the mock community members. 33 Table 2.6 R Studio packages used in data analysis. 43

Chapter 3

Figure 3.1 NMDS of samples according to their extraction method. 56 Figure 3.2 Comparison of samples according to the sequencing run. 58 Figure 3.3 Bacterial burden when comparing by smoking status. 60 Figure 3.4 Batch effect, bacterial burden and smoking status. 61 Figure 3.5 OTUs identified as potential contaminants. 63 Figure 3.6 Establishing the rarefaction level. 64 Figure 3.7 Summary of data generation and processing. 67 Figure 3.8 Top 50 (most abundant) bacterial OTUs found in the population. 69 Figure 3.9 Histogram of the mean per sample Bray-Curtis dissimilarity. 71 Figure 3.10 Summary flowchart for PERMANOVA analysis. 73 Figure 3.11 Comparison of Shannon’s Diversity by disease phenotype. 77 Figure 3.12 Disease status is related to aspects of community relatedness. 78 Figure 3.13 DESeq analysis: asthmatic vs. healthy non-smoking Individuals. 81 Figure 3.14 Venn diagram of OTUs counts for the three smoking phenotypes. 82 Figure 3.15 Comparison of Shannon’s Diversity by smoking phenotype. 83 Figure 3.16 Community phylogenetic structure: smoking status. 85 Figure 3.17 Fifty most abundant OTUs present in the data: smoking phenotypes. 87 Figure 3.18 DESeq analysis: never-smoking versus smoking individuals. 90 Figure 3.19 DESeq analysis: ex-smoking versus smoking individuals. 91 Figure 3.20 Shannon’s Diversity: smoking phenotype (ex-smoking sub-grouped). 94 Figure 3.21 Fifty most abundant OTUs (ex-smoking sub-grouped). 96 Figure 3.22 Assessing effects of pack years on Shannon’s Diversity. 97 Figure 3.23 Assessing effects of Pack Years on abundance of Streptococci Genus. 98

Table 3.1 Summary of selected study cohort. 53 Table 3.2 Bacterial burden ranges for each sequencing library batch. 54 Table 3.3 Sequencing results: quality metrics of the runs. 55

x FIGURES and TABLES

Table 3.4 Comparison of the mock community members. 57 Table 3.5 Dissimilarity between seven sequencing runs. 59 Table 3.6 Sample numbers of the final study cohort. 65 Table 3.7 Demographics for the final population of 529 individuals. 66 Table 3.8 List of the tested clinical variables. 72 Table 3.9 Significant clinical variables that cause variability at P < 0.01. 74 Table 3.10 PERMANOVA analysis: traits contributing to population variance. 75 Table 3.11 Alpha diversity across smoking phenotypes (Tukey’s HSD test P-values). 84 Table 3.12 Number of ex-smoking individuals. 92 Table 3.13 P-values from Tukey’s HSD test of the alpha diversity indices. 93 Table 3.14 Ex-smokers: cessation ≤ 10 years further sub divided. 95

Chapter 4

Figure 4.1 Summary flowchart of WCNA bacterial co-abundance network analysis.109 Figure 4.2 Histograms displaying the different transformation methods. 112 Figure 4.3 Clustering dendrogram of samples based on their Euclidean distance. 115 Figure 4.4 Bacterial OTU network properties for different soft thresholds. 119 Figure 4.5 Hierarchical clustering tree of OTUs, with dissimilarity based on TOM. 121 Figure 4.6 Comparison of Midnightblue module members. 126 Figure 4.7 Pink Module: comparing smoking vs. non-smoking individuals. 130 Figure 4.8 Yellow (A) and Red (B) modules: smoking vs. non-smoking individuals. 135 Figure 4.9 PICRUSt analysis summary workflow. 138 Figure 4.10 Module-trait associations (Greengenes database). 140 Figure 4.11 Blue (A) and Yellow (B) modules: comparison between smoking vs. non- smoking individuals. 146 Figure 4.12 Top 50 (Most Abundant) functions (KEGG 3 level) found in BHS. 149 Figure 4.13 Abundance of metabolic pathways when comparing the eight individual bacterial modules identified through network analysis. 153 Figure 4.14 Comparison of the abundance of the twelve functions making up metabolic processes in the Blue and Yellow modules. 155 Figure 4.15 Heatmap with 46 KEGG3 functions identified as significantly different between Blue and Yellow modules. 157

Table 4.1 Description of the identified modules. 122 Table 4.2 Module-trait associations significant at a 5% threshold. 123 Table 4.3 Summary of the Midnight blue module members (Silva database). 127 Table 4.4 Summary of Grey60 module members (Silva database). 129 Table 4.5 Summary of Pink module members (Silva database). 131 Table 4.6 Summary of Red module members (Silva database). 133 Table 4.7 Summary of Yellow module members (Silva database). 134 Table 4.8 Correlation between bacterial abundance and pack years. 136 Table 4.9 Summary of Blue module members (Greengenes database). 143 Table 4.10 Summary of Yellow module members (Greengenes database). 145 Table 4.11 Summary of the WCNA modules (Greengenes database). 151 Table 4.12 Significant KEGG 2 level pathways: Blue vs. Yellow modules. 156 Table 4.13 Significant KEGG 3 level functions in relation to current-smoking individuals. 158

xi FIGURES and TABLES

Chapter 5

Figure 5.1 Neighbour joining non-rooted phylogenetic tree of map gene. 174 Figure 5.2 Streptococcus spp. bacterial burden: smoking phenotypes. 186 Figure 5.3 Mock community structures across the 6 sequencing runs. 189 Figure 5.4 Number of Ssquencing reads for the 50 samples with the lowest abundance of Streptococcal reads. 192 Figure 5.5 The 25 most abundant (A) and most prevalent OTUs (B). 196 Figure 5.6 Top 25 OTUs present in the data separated by current-smoking and non- smoking phenotype. 198 Figure 5.7 DESeq analysis: current non-smoking vs. current-smoking individuals. 201 Figure 5.8 Module-trait associations based on the streptococcal data. 203 Figure 5.9 Evolutionary relationships of manually identified OTUs. 207

Table 5.1 Summary of molecular typing techniques. 169 Table 5.2 Current classification of streptococci specifically found in humans. 171 Table 5.3 Streptococcus-specific, map gene primer sequences 173 Table 5.4 Details of the barcoded sequencing primers. 175 Table 5.5 Primers for map gene amplicon sequencing. 176 Table 5.6 Quality metrics of the six sequencing runs. 187 Table 5.7 Blast identification of 10 most abundant OTUs of the mock community. 190 Table 5.8 Significant clinical variables at P < 0.01. 195 Table 5.9 PERMANOVA analysis: Streptococcal data 195 Table 5.10 Species identified to be significantly associated with current-smoking and non-smoking individuals. 199

xii ABBREVIATIONS

Abbreviations

1/D Inverse Simpson’s Diversity 16S rRNA gene 16S Ribosomal RNA Gene ANOVA Analysis of Variance BAL Broncho-alveolar Lavage BH Benjamini And Hochberg Method BHS Busselton Health Study BMI Body Mass Index bp Base Pairs bwa Burrows-Wheeler Alignment Tool CN Copy Number COPD Chronic Obstructive Pulmonary Disease copies/ml Copies Per Millilitre DESeq Differential Expression Analysis for Sequence Count Data DM Diabetes Mellitus DNA Deoxyribonucleic Acid dNTP Deoxynucleotide (dNTP) Solution Mix dsDNA Double Stranded DNA DSMZ Deutsche Sammlung Von Mikroorganismen Und Zellkulturen EBT Elution Buffer with Tris fM Femtomolar g Grams GERD Gastro-Esophageal Reflux Disease GG Greengenes Database GLM Generalized Linear Regression Model GWAS Genome-Wide Association Studies H’ Shannon’s Diversity Index HMP Human Microbiome Project ICL Imperial College London ISA Indicator Species Analysis Kb Kilobyte Kbp Kilo Base Pairs KEGG Kyoto Encyclopaedia of Genes and Genomes ln [S] Pielous Evenness M Molarity m2 Meters Squared map Methionine Aminopeptidase Mdn Median ME Module Eigengene ml Millilitres MLEE Multilocus Enzyme Electrophoresis MLSA Multilocus Sequence Analysis

xiii ABBREVIATIONS

MLST Multi-Locus Sequence Typing MM Module Membership Mn Mean MPD Mean Pair-Wise Phylogenetic Distance NA Not Available NGS Next Generation Sequencing NHLI National Heart and Lung Institute nM Nanomolar NMDS Non-Metric Multidimensional Scaling NRI Nearest Relative Index NTI Nearest Taxon Index oC Degrees Celsius OTU Operational Taxonomic Unit PCR Polymerase Chain Reaction PERMANOVA Permutational Multivariate Analysis of Variance PICRUSt Phylogenetic Investigation of Communities by Reconstruction of Unobserved States pM Picomolar ppFEV baseline Forced Expiratory Volume ppFVC baseline Forced Vital Capacity PPS Protein Precipitate Solution qPCR Quantitative Polymerase Chain Reaction RDP Ribosomal Database Project rho [] Spearman’s Correlation Coefficient rpm Revolutions per Minute S Species Richness SD Standard Deviation spp. Species Subsp. Sub Species SV Silva Database TE Tris Borate EDTA Buffer TOM Topological Overlap Matrix Tukey’s HSD test Tukey's Honest Significant Differences Test UV Ultra Violet v/v Volume/Volume % WCNA Weighted Correlation Network Analysis WHO World Health Organization α Alpha β Beta µg Micrograms μl Microliter µM Micromole ΦX174 Phi X 174

xiv CHAPTER 1: INTRODUCTION

Chapter 1 Introduction

1.1 Background

This thesis will examine the microbial population of the airways in a representative sample of the middle aged general population. It was completed to understand how the microbial community relates to health and to common diseases such as asthma, diabetes, and gastro-esophageal reflux. Additionally, it will test how the microbiome is modified by the single most important environmental influence on health in this community, that of cigarette smoking. The introduction will provide a brief overview of the development of microbiome research, and then discuss the respiratory system and common diseases in the general population that may be influenced by the respiratory microbiome.

1.1.1 The Microbial World Life on Earth began from single-celled microorganisms around 3-4 billion years ago 1. Through natural selection these microorganisms grew, multiplied and evolved; some developing into the macro-scale eukaryotes that make up the majority of the visible world around us. Others stayed on the micro-scale to become one of the most diverse and abundant life forms on the planet. The prokaryotic microorganisms are made up of bacteria and archaea, with approximately 4–6 × 1030 cells residing in every part of the biosphere 2 and accounting for more than 50% of the biomass of the earth 3.

Although there have been numerous references throughout history to the existence of microscopic life forms 4; it was Antonie Van Leeuwenhoek in 1675 who, using a self-designed single lensed microscope, was finally able to discover, describe and study microorganisms 5,6. Despite Leeuwenhoek’s discovery, it was only in the 19th century that Robert Koch made a direct connection between bacteria and disease. Koch established that Bacillus anthracis was a major cause of anthrax in infected cattle 7, that Vibrio cholerae was isolated from cholera

1 CHAPTER 1: INTRODUCTION sufferers 8, and that slow growing Mycobacterium tuberculosis was the causative agent of tuberculosis 9. This experimental work underpinned the development of Koch’s Postulates, a set of criteria used for many years to determine a causal link between microorganisms and disease 10.

An important step towards complete understanding of bacteria came with the discovery of nucleic acid in 1869 by the Swiss physician Freidrich Miescher. Nonetheless, it took another century before Watson and Crick could resolve the problem of how the DNA molecule was responsible for coding genetic information. Their experiments (based on the work of Rosalind Franklin 11) answered the question of the nature and function of DNA 12. Carl Woese furthered understanding of the evolutionary background of organisms by constructing a tree of life based on the progressive evolution of ribosomal DNA. Woese showed that the 16S ribosomal subunit of bacteria (rRNA) could divide the prokaryotes into Eubacteriobonta (Eubacteria) and Archaebacteriobonta (Archaebacteria) 13, and proposed the use of a 16S rRNA database to generate phylogenetic trees 14,15. From then on knowledge of the microbial world grew exponentially, with the realisation that only 0.1-1 per cent of bacteria can actually be grown in the laboratory 16, a notion referred to as the Great Plate Count Anomaly.

The 16S rRNA gene (S for Svedberg, a unit of sedimentation rate that provides a measure of particle size under ultracentrifugation) is found in all bacteria, making it a good basis for a universal bacterial identification tool 17. The gene is 1.5 Kilo base pairs (Kbp) in size, making it large enough to provide statistically informative sequencing results. The gene consists of nine constant (C) regions, which are ideal for designing “universal” primers allowing amplification from most bacteria 18, and nine hypervariable regions (V1-V9), which are species-specific sequences that can be utilised to distinguish and identify different bacterial species 19,20. Overall, the size and structure of the 16S rRNA gene made it better suited for sequence analysis compared with the 23S rRNA gene, which was used in early iterations of ribosomal phylogeny by Sanger sequencing 21.

2 CHAPTER 1: INTRODUCTION

1.1.2 Sequencing Technology DNA sequencing methods have also advanced a long way in the last forty years. Ray Wu and his colleagues at Cornell University founded the idea of using DNA polymerase with specific nucleotide labelling to sequence cohesive ends of lambda phage DNA 22. Fredrich Sanger enhanced this method in 1977 by developing a chain termination method of sequencing that provided rapid and reliable results 23, leading to the sequencing of the first full genome, bacteriophage ΦX174 24.

In order to sequence longer DNA fragments (over 1,000 bp and up to whole chromosomes), shotgun sequencing was industrialised. This broke the target DNA into random fragments for individual sequencing, before reassembling the sequences based on overlapping regions 25. As a result scientists were able to use shotgun sequencing to look at whole genomes of the Epstein-Barr virus 26, Escherichia coli 27, Haemophilus influenzae 28 and many others.

Further advances in techniques led to the development of the second generation of sequencing methods. Pyrosequencing was a long-standing method of choice, as it provided high sequencing accuracy at a reduced cost 29. However, in recent years high-throughput sequencing has outcompeted previous methods. The optimised sequencing by synthesis method has provided low cost, high accuracy and ability to run 500,000 operations in parallel 30,31 (See Chapter 2 Section 2.3.3 on details of the sequencing by synthesis method). Illumina are one of the leading players in the current market 32.

1.1.3 Bacterial Identification It has now become customary to sequence the 16S rRNA gene when investigating the identity and classification of bacterial species 17, as whole genome sequencing is able to differentiate two species that have over 98.7% similarity 33. First generation of sequencing methods focused on analysing complete genomes, however with the advances of high throughput sequencing the work has shifted towards looking at shorter subunits at a deeper depth 34,35. The nine-

3 CHAPTER 1: INTRODUCTION hypervariable regions of the 16S rRNA gene act as the sequencing subunits (V1- V2 and V3-V5 regions) and are used for taxonomic classification of bacterial sequences. Sequencing across multiple hypervariable regions can increase specificity and sensitivity of the results, and only reduces the similarity cut off between species to 97%.

As a result, the last ten years have seen the number of published species almost double; as of May 2017 the Approved Lists of Bacterial Names had 15,626 published species (http://www.bacterio.net/-number.html), compared to 8,168 species reported in 2007 17. There are a number of widely-used bacterial databases; ARB-Silva (http://www.arb-silva.de/) 36, Greengenes (Lawrence Berkeley National Laboratory: http://greengenes.lbl.gov/cgi-bin/nph- index.cgi) 37, Genbank (https://www.ncbi.nlm.nih.gov/genbank/) 38 and the Ribosomal Database Project (RDP, http://rdp.cme.msu.edu/) 39, all of which are free, regularly updated and available to the scientific community as reference databases. Using all the available resources, scientists are now able to examine microbial communities from all ecosystems 40; host-associated (human 41, animal 42, plant 43), environmental (aquatic, terrestrial) 40,44,45 and engineered systems (food production 46, wastewater 47, lab synthesis 48).

4 CHAPTER 1: INTRODUCTION

1.2 Human Microbiome

1.2.1 Overview of the Human Microbiome

The human body is perfect example of a specialized environment that harbours a highly diverse ecosystem of microorganisms. A whole array of bacteria, viruses, archaea protists and fungi reside on and within all human tissues and biofluids. They are collectively referred to as the human microbiome 49. It was previously estimated that the microorganisms actually outnumbered the human cells almost tenfold 50,51, however this approximation has since been re-evaluated 52,53. Nonetheless the microbiome still plays a major part in the body affecting physiology, immune system development and defence, digestion and many other functions that may not yet be defined.

There is still debate about when the human body is first colonised with bacteria. It has been a long standing assumption that human babies develop in a sterile environment within the womb and a normal healthy foetus gets its first dose of bacterial exposure upon delivery 54. Recently however this idea has been challenged as studies have identified commensal bacterial species in the amniotic fluid 55 and placental basal plate 56,57, and have shown that dysbiosis of these microbial populations can be associated with unhealthy pregnancy complications such as preterm births 58,59. The limited knowledge of the structure, function, development of the placenta and its microbiome led to the initiation of the “Human Placenta Project” 60.

One agreed upon notion is that during birth the neonate is exposed to an extensive assortment of maternal microbes and the mode of delivery plays a large part in the initial development of the gut microbiota 61,62. Studies have shown infants who go through a vaginal birth are more likely to be colonized by maternal bacteria from the vagina and gut, so have an increased relative abundance of Prevotella, Sneathia and Lactobacillus genera 54,63. On the other hand, those born by Caesarean-section display a closer association with the maternal skin and oral microbiota, usually dominated by Propionibacterium, Corynebacterium and

5 CHAPTER 1: INTRODUCTION

Streptococcus 57,64. The gut microbial diversity continues to evolve over the next 3-5 years post birth, after which it begins to reflect a more adult-like composition.

There are a number of factors that shape infant gut microbiota: nutrition, environment, genetics and health status all play a vital role in child growth and development later on in life. Breast milk contains more than 700 species of bacteria 65,66, mainly dominated by Staphylococci, Streptococci, Lactobacilli and Bifidobacterium (commonly seen as “beneficial” gut organisms), so it is not surprising that the gut bacteria of breast-fed children shows dominance of the same organisms 67,68. In comparison, gut microbiota of formula-fed babies are dominated by Enterococci and Clostridia 69,70. The gut microbial diversity gets further enriched following weaning and introduction of solid food when it begins to reflect a more adult bacterial composition 64,71.

Early exposure to a variety of geographical locations and immediate family environment (including siblings) all play a significant role in shaping the infant’s bacterial composition. Children that are brought up in more rural environments have shown to have fewer cases of asthma and hay fever compared to those living in the city 72-74. This together with the reports of children with older siblings displaying a higher proportion of Bifidobacteria in the gut 75 supports the idea of the Hygiene Hypothesis. The Hygiene Hypothesis states that unhygienic contact with older siblings (or other exposure) might provide protection from the development of allergic illnesses, implying that a more enriched microbiota could be protective against respiratory diseases 76.

Overall due to the worldwide progressive increase in immune mediated (allergies and asthma) and metabolic (obesity and diabetes) diseases in children, a theory of ‘microbial deprivation syndrome of affluence’ has been proposed 77,78. This theory aligns with the hygiene hypothesis, where it suggests that a sheltered childhood development means that individuals have limited microbial pressures, resulting in insufficient T cell induction to counter the allergy inducing Th2 responses, therefore leading to an allergy epidemic 79.

6 CHAPTER 1: INTRODUCTION

In order to fully understand the adult microbiota, in 2008 the National Institute of Health initiated comprehensive microbiome studies focusing on different locations around the human body. The Human Microbiome Project (HMP) is based on 242 healthy individuals and examined 15-18 different body sites (gut, oral cavity, skin, airways and vagina) to characterize the microbiota in different environments (https://hmpdacc.org) 41,80.

With this foundation many studies now examine changes and links between microbial communities and disease states. This has brought new perspectives into many fields and diseases such as psoriasis 81, diabetes 82, asthma 83, idiopathic pulmonary fibrosis 84, and metabolic disease 85. It is important to note that although the HMP examined a very extensive set of samples, little emphasis was placed on mapping the airway microbiome.

1.2.2 Respiratory System The lung and airways are responsible for respiration 86,87. The respiratory tract is separated into the upper and lower airways, containing a number of different means to filter and prevent harmful particles from entering the body and bloodstream (Figure 1.1). Air is breathed in through the mouth (which contains saliva that itself contains peptides and proteins that act as an antimicrobial defence mechanism 88) or the nose (which is lined with hair and mucus membranes that filter large particles at first point of entry). Air then moves into the pharynx, where the oesophagus and larynx intersect at the back of the throat. The epiglottis (a flap of cartilage), which opens and closes, and in so doing regulates the passage of substances, ensures that food passes correctly into the oesophagus (Figure 1.1)

Air continues its path into the trachea, a cartilaginous tube that joins the upper and lower airways. The trachea is lined with a layer of pseudo-stratified ciliated columnar epithelium (respiratory epithelial cells) that contain mucus secreting goblet cells that assist in trapping foreign particles. Mucus is moved up towards

7 CHAPTER 1: INTRODUCTION the larynx to be either swallowed or expelled as phlegm, a process known as mucociliary clearance 89,90. The trachea then branches into the bronchi.

Figure 1.1 Schematic diagram of the respiratory tract. Includes all the main sections of the tract, with an expanded section of the alveoli and surrounding capillary network required for the process of gas exchange. Source: Diagram released into the public domain by its author, LadyofHats. https://commons.wikimedia.org/wiki/File:Respiratory_system_complete_en.svg

The bronchi resemble an inverted tree; as they enter the lungs at each hilum they branch into secondary (lobar) bronchi, which further subdivide into narrower tertiary (segmental) bronchi. Ciliated respiratory epithelial cells extend mucociliary clearance from this level.

Finally, the segmental bronchi subdivide into even smaller tubes, collectively known as sub-segmental bronchi, or bronchioles, before terminating at the alveoli,

8 CHAPTER 1: INTRODUCTION where the processes of gas exchange takes place. At this point oxygen from the saturated air diffuses through the epithelial layer into the capillary network, where it is picked up by haemoglobin in the red blood cells. The latter are transported via the pulmonary vein, through the bloodstream to oxygenate every cell around the body. Carbon dioxide on the other hand diffuses from the cells into the bloodstream, then into the alveoli via the pulmonary artery, and back up through the respiratory tract to be exhaled out of the body.

There are several million alveoli present within each individual, with a complete surface around of around 70 m2. Contraction of the diaphragm, main muscle of respiration, controls breathing and separates the respiratory tract from the abdominal cavity.

An average individual inhales approximately 10 cubic meters of air each day, which consists of not only the required oxygen, but also dust, smoke, chemicals, microorganisms, other pathogens, particles and pollutants 91. These airborne microorganisms may contribute to respiratory disorders in humans including allergies, asthma and pathogenic infections. It is calculated that the concentration of atmospheric bacteria reaches an estimate of 104-106 bacteria/m3 (in land) 92, is highly diverse 93 and season dependent 94. Frequently identified genera included Ralstonia, Cupriavidus, Bacillus 95, as well as a number of bacterial pathogens; Streptococcus pneumoniae, Streptococcus pyogenes, Mycoplasma pneumoniae, Haemophilus influenzae, Klebsiella pneumoniae, Pseudomonas aeruginosa and Mycobacterium tuberculosis 96,97. Interestingly, the respiratory system is capable of harbouring these pathogens, amongst other bacteria, in a quiescent state, as they reside in a symbiotic relationship with the host 98.

As the anatomy of the airways changes, so does the microbial composition, with distinct niches along the way. Nostrils (anterior nares) are exposed to the surrounding environment and are equipped as highlighted earlier with a number of adaptations in order to trap foreign particles, including lining of keratinized stratified squamous epithelium, nasal hair and presence of both serous and sebaceous glands99. This close contact with skin and environment has led to the

9 CHAPTER 1: INTRODUCTION colonization with some skin-associated bacteria such as Staphylococcus aureus.100, Propionibacterium spp., and Corynebacterium spp. 101,102

In the nasopharynx a lining of stratified squamous epithelium continues with an occasional scatter of respiratory epithelial cells. The bacterial composition increases in diversity 101, and in addition to the organisms seen in the nostrils, Moraxella spp., Haemophilus spp. and Streptococcus spp. 103,104 are seen.

The oral microbiome is colonized by over 600 different taxa 105, with distinct subsets of bacteria around the cavity. The different habitats include: teeth, gingival sulcus, tongue, cheeks, hard and soft palates and tonsils 106. Studies have reported dominance of Streptococcus spp. and Prevotella spp. 107 in the oral cavity, but many microorganisms are present that can cause oral infections such as tooth decay, periodontitis, root canal infections and tonsillitis 108-111.

The oropharynx extends from the uvula (posterior edge of soft palate) to the hyoid bone, until it merges into the laryngopharynx. Like the nasopharynx, it is lined with squamous stratified epithelium. An increased bacterial diversity is seen, compared to the nose 112 as well as a greater bacterial load (around 106 bacterial copies/ml) 113. Bacteria characteristic of this region include streptococcal species, Neisseria spp., Rothia spp. and anaerobes, such as Veillonella spp., Prevotella spp. and Leptotrichia spp. 103,114-116.

Continuing down the respiratory tree, bacterial concentration decreases in the lower respiratory tract (reporting around 102-103 bacterial copies/ml in broncho- alveolar lavage [BAL] samples 117,118). Although the degree of contamination from environmental sources is debated 119, studies that have been done show similarities in the bacterial composition in the airways generated from different countries 103,120-122. This consistency (and the effective filtering of bacteria by the upper airways) suggest the presence of a resident respiratory microbiome.

10 CHAPTER 1: INTRODUCTION

1.2.2.1 Healthy Respiratory Microbiome Original examination of healthy human lungs using culture-dependent analysis established the idea of lung sterility 123. Fortunately, with advances of molecular techniques, this idea has now been challenged with many studies showing how bacteria play a role in sustaining a healthy lung, as well as having profound effects on disease.

The idea of a “core microbiome” has been proposed, as a number of studies have consistently seen the presence of nine genera; Acinetobacter, Fusobacterium, Megasphaera, Prevotella, Pseudomonas, Sphingomonas, Staphylococcus, Streptococcus and Veillonella 103,124,125. Overall, the bacterial composition of lungs may be determined by three main factors 126. First is the way in which bacteria enter the airways, including inhalation of air, subclinical microaspiration of upper respiratory tract contents 127,128 and the transient spread along the airway mucosa.

A second factor covers how bacteria are eliminated from the body, including mucociliary clearance and host immune defences. The mucosal surfaces are lined with different classes of epithelium-derived antimicrobial peptides including: defensin, cathelicidin, and histatin; all produced to play a critical role as an additional line of defence for any invading pathogens. Defensins are cysteine-rich cationic proteins that disrupt bacterial cell membrane structure. The α-defensins originate from neutrophils with the β-defensins produced by the respiratory epithelium 129,130. Cathelicidins are a family of polypeptides found in both macrophages and granulocytes, which again work to damage cell membrane of organisms 131,132. Histatins work together with salivary peptides and proteins to defend against oral microorganisms 133.

The final factor that affects bacterial composition of the lungs is the relative reproduction rate of community members, which is dependent on growth conditions. These factors are common to all ecological niches and include nutrient availability, temperature, pH, oxygen tension, as well as the abundance and activation state of host inflammatory cells. The host may provide nutrients for

11 CHAPTER 1: INTRODUCTION particular bacteria and the bacteria may compete amongst themselves for nutrients or may actively inhibit each other’s growth. Disease states and loss of diversity may alter these factors, allowing for selective bacterial colonization, which in turn creates a further misbalance influencing both the immigration and elimination of bacteria from the injured respiratory tract.

1.2.2.2 Asthma Asthma is a common, heterogeneous, chronic inflammatory disorder of the airways 134. The condition causes narrowing of the airways leading to intermittent wheeze, coughing, airflow obstruction and bronchial hyper-responsiveness. Although it is a chronic condition, in its less severe forms it can be reversed. In severe asthma, the airways may become irreversibly remodelled (scarred) with a fixed airflow obstruction that resembles chronic obstructive pulmonary disease (COPD). In 2015 there were an estimated 358 million world-wide 135 and this number is predicted to reach 400 million by 2025 136,137.

The severity of asthma relates to the frequency of symptoms and exacerbations, and objective tests of airflow out of the lungs such as the forced expiratory volume per second (FEV1) and the peak expiratory flow rate 138. Atopic asthma describes the presence of symptoms that come on as a response to an allergen such as house- dust mite. Asthma is commonly non-atopic in adult patients. Four severity categories of asthma are recognised, each with different treatment plans. The mildest form is known as intermittent (treated with inhaled rapid-acting β2- agonist for symptom relief if needed), followed by mild persistent (where patients are preferably on low dose inhaled corticosteroids), moderate persistent (treated with a low-to-medium dose of inhaled corticosteroids in addition to long acting

β2-agonists), and severe persistent (where the patients are on a high dose of inhaled corticosteroids, long acting β2-agonists in addition to a range of other treatments).

Beta2 (β2) adrenergic receptor agonists cause smooth muscle relaxation and corticosteroids act as anti-inflammatory and immune-suppressive mediators in

12 CHAPTER 1: INTRODUCTION the body. Unfortunately, because of the complexity of this disease it is difficult to precisely categorise individuals and find an optimal drug combination 139.

The origins of the disease are largely unknown and as yet there is no cure 140. It has been recognised that both genetic and environmental pressures influence the development of asthma141. However, genes do not change over short periods of history and so environmental factors are likely to be driving the rise in asthma prevalence. Research studies that examine the genetic factors contributing to disease development 142,143 assume that genes play the most profound effect in early childhood onset. The GABRIEL consortium (a multidisciplinary study to identify the genetic and environmental causes of asthma in the European Community) was one of the biggest genome-wide association studies (GWAS) for asthma, where 10,000 cases of asthmatics and 16,000 controls were screened 144. Ten loci identified were likely to be responsible for informing the adaptive immune system about epithelial damage and activating airway inflammation. Loci included IL18R1, IL33, IL33, SMAD3 with the most significant being ORMDL3/GSDMB.

There are numerous environmental factors implicated in asthma 145, including air pollution 146, tobacco smoke 147,148, occupational exposures (which are estimated to account for 5-25% of all adult asthma cases) 149,150 and acute respiratory infections with rhinovirus 151. Conversely, rural environments and a rich microbial environment in childhood seem strongly protective against the development of asthma 72-74.

Identification of known respiratory pathogens such S. pneumoniae, H. influenzae and Moraxella catarrhalis in nasopharyngeal aspirates from neonates is associated with an increased risk of recurrent wheezing and asthma. The same organisms have been identified using molecular methods in the airways of children and adults with asthma103 and in wheezing neonates from rural Ecuador 152. S. pneumoniae, H. influenzae and M. catarrhalis may interact with human rhinoviruses to produce wheezing in children 153,154. There is still a divide in the literature about the proportions of bacterial diversity in the adult asthmatic

13 CHAPTER 1: INTRODUCTION airways, with some studies reporting a decrease in bacterial diversity 155,156 and others showing an observed increase in diversity of patients 157,158. Suggestions for prevention of asthma are limited, and mainly focus on the elimination of allergens from the immediate surroundings. In 2015 the UK suffered 1,500 deaths due to asthma, despite the NHS spending £1 billion on treatment and care of patients 159. It seems that manipulation of the infantile microbiome could be the best course of action in order to prevent asthma occurrence 160,161. However, before such interventions are possible it is important to understand the natural composition of the healthy airway flora at all ages.

1.2.2.3 Cigarette Smoke Smoking can be dated back to shamanistic rituals in 5000BC, yet the first report of damage caused by smoking was not published until 1950 162. Despite increasing awareness over the years of the risks of smoking, there are over 1 billion smokers worldwide today, with tobacco being the most common substance in popular culture. Smoking can be highly addictive, and with every puff, individuals are exposed to an enormous amount of genotoxic chemicals that can cause tremendous damage to the body 163.

In the UK, smoking and second hand smoke exposure is the leading cause for acute respiratory tract infections, and is the primary causal factor for 84% of lung cancer deaths and 83% of deaths from COPD 164. However, it does not just affect the respiratory system, smoking increases the risk of developing coronary heart disease by 24%. In addition, there have been strong links of smoking to a number of other diseases such as type 2 diabetes, colorectal and liver cancer, erectile dysfunction, birth defects, ectopic pregnancy, and immune dysfunction. This in turn puts a £5.7 billion burden on the NHS to provide care and treatment for these individuals.

In the 20th century, one hundred million deaths were caused by tobacco, and if the current trends continue this figure could increase to one billion in the 21st century, as reported by the World Health Organization (WHO) 165,166. The excess mortality

14 CHAPTER 1: INTRODUCTION associated with smoking is attributable to the acquisition of vascular, neoplastic and respiratory disease 167. Continuing smoking throughout lifetime has been shown to reduce the average life expectancy by 10 years compared to lifelong non- smokers 168.

Tobacco is harvested from the Nicotiana plant, which contains a stimulant alkaloid, nicotine. However, modern day cigarettes are made up of so much more than just the tobacco plant, and it is estimated that the tobacco smoke contains around 5,000 chemicals 169,170. Several of the tobacco carcinogens (acrolein and formaldehyde) are responsible for formation of DNA crosslinks leading to cell death, others such as acrylonitrile cause oxidative stress as a result of the high concentration of free radicals. Ethylene oxide, 1,3-butadiene and acetaldehyde bind to the DNA molecule, damaging it and forming DNA adducts, leading to mutagenesis and without adequate DNA repair to carcinogenesis.

In addition to DNA damage, alterations in the respiratory microbiome have also been noted 171. However, examination of the effects of smoking on the bacterial composition of the airways has not as yet yielded conclusive results. Some studies have shown little difference 120,124 whilst another has suggested enrichment of Veillonella in smoking individuals 113. However, these investigations have all been quite small, and so cannot provide statistically informed results. Given the impact of smoking on human health, this inconclusiveness highlights the need to look at larger population groups to understand the effects of smoking on the respiratory microbiota.

1.2.2.4 Diabetes Diabetes mellitus (DM) is one of the most common lifelong inflammatory conditions worldwide. It describes a collection of metabolic disorders and occurs as a result of an increment in the individual’s blood sugar level caused by either limited production of insulin by the pancreas or inhibition of cell response to insulin present in the circulation. Common symptoms of having a high blood sugar level include increased frequency of polyuria, polydipsia and polyphagia.

15 CHAPTER 1: INTRODUCTION

However, if the disease is left untreated, progression affects many other systems including damage to eyes, kidneys and nerves, as well as potentially leading to heart disease, strokes and even limb amputation 172.

The hormone insulin plays a key role in keeping glucose levels balanced in the body, as it enhances uptake of glucose from the blood into the cells either for energy or to be stored for later use. Glucose is absorbed from food. The body is also able to obtain glucose through the breakdown of glycogen (glucose storage form) and gluconeogenesis, which is the conversion of non-carbohydrate substrates into glucose. Insulin is released into the blood from beta cells (β-cells), found in the islets of Langerhans of the pancreas, as a response to blood glucose levels. When there is an insufficient amount of insulin produced by the pancreas (Type 1 DM), or cells present insulin resistance (Type 2 DM), then the body is unable to properly reabsorb the glucose, resulting in persistently high levels in the blood.

Type 1 DM makes up 5-10% of all diabetes cases, the causes of which are largely unknown, but both genetic and environmental factors are suspected to be at play. On the other hand, Type 2 DM accounts for 90% of the diabetic cases and is a result of physical inactivity, sedentary lifestyle and excess weight. Eighty % of those with T2DM are overweight or obese at the time of diagnosis. In 2014 there were 422 million people with diabetes worldwide, and with the rise in obesity levels it is predicted that this figure will double in the next 20 years 173,174. The International Diabetes Federation reported that 2015 saw an estimate of 5.0 million deaths caused by diabetes, making it the seventh leading cause of death, and costing $673billion to global health 175, therefore much research into the causes, cures and methods of preventions are imperative.

There is increasing evidence that gut microbiota plays a critical role in facilitating a number of gut-associated conditions including obesity and diabetes. On the whole, the gut is dominated by and Bacteroidetes phyla, and dysbiosis of this bacterial ratio has been directly linked with obesity. Studies have seen a significant decrease in the levels of Bacteroidetes in obese mice and humans 176-

16 CHAPTER 1: INTRODUCTION

178, and demonstrated the reversal of the bacterial proportions following weight loss 179, in this way suggesting that changes in particular bacterial communities could be responsible for fluctuations in the host’s weight.

Weight gain, fat accumulation and subsequent changes in microbial communities of the gut in turn lead to changes to insulin sensitivity, altered glucose metabolism and development of T2DM 85,180,181. It has been reported that individuals suffering from T2DM display a decreased microbial diversity, especially from the members of the Firmicutes phyla 182. On the other hand, diabetic individuals displayed enrichment of Proteobacteria and members of the Lactobacillus species 85.

Currently there is no known cure for diabetes, with symptoms being managed by maintaining a healthy lifestyle with a balanced diet, exercise, weight loss, and medication. One of the more extreme therapies currently being explored is faecal microbiome transplant (FMT, bacteriotherapy), which replenishes a patient’s intestinal microbiota from healthy donors. This technique has been highly successful in the treatment of Clostridium difficile infections 183, and has shown promising results in improving insulin sensitivity 184. All current studies focus on the relationship between gut microbiota and diabetes, and there is very little insight on the effects of diabetes on other microbial habitats around the body, such as the airways.

1.2.2.5 GERD Another common chronic condition is Gastro-Esophageal Reflux Disease (GERD), which affects almost 20% of the Western population 185. The disease is characterised by constant irritation and damage to the mucosal lining of the oesophagus by stomach acid. Prolonged irritation can cause esophagitis, progressing to Barrett’s oesophagus, which is endoscopic presence of columnar- lined esophagus that can lead to malignancies and esophageal adenocarcinomas 186,187. Unfortunately there is no cure for the condition, changes to patient’s lifestyle are recommended for mild cases, with drug and surgical interventions for more severe cases 188.

17 CHAPTER 1: INTRODUCTION

GERD has an association with chronic respiratory symptoms 189. Sixty per cent of asthmatics may have signs of gastro-esophageal reflux, and the prevalence of reflux may be directly related to the severity of asthma symptoms 190. GERD is also associated with laryngitis, pharyngitis, sinusitis, idiopathic pulmonary fibrosis and dental erosions 191.

GERD is caused by incomplete closure of the lower esophageal sphincter, which allows duodenal bile, enzymes and stomach acid to reflux upwards causing burning and inflammation to surrounding tissues. Individuals are classified relative to presence (erosive esophagitis, EE) or absence (non-erosive reflux disease, NERD) of esophageal mucosal erosions. Major factors that instigate GERD include hernias, hypercalcemia and obesity, with 13% of esophageal acid exposures attributed to changes in body mass index 192.

Both culture dependent and independent studies have investigated the microbial community of the healthy esophageal mucosa, demonstrating that the content is reflective of that found in the oral microbiome 193. Streptococcus spp. are seen as the dominant and most abundant members of this location. In individuals affected by GERD, the esophageal microbiome displays enrichment in Gram-negative taxa Veillonella, Prevotella, Haemophilus, Campylobacter, Fusobacterium and Neisseria, 194.

Considering the apparent relationship between asthma and GERD, it is interesting to note that there is little known about potential changes to the respiratory microbiome in the presence of GERD symptoms.

It is becoming more evident that the microbiota interacts positively and negatively with the immune system of their human hosts. Some studies have suggested an intimate immunological relationship between the gut and respiratory systems. In its simplest form this might be due to direct translocation of gut bacteria into the lungs 195. The potential for a more sophisticated gut-lung axis interaction is a subject of much interest, with the possibility of immune priming by gut antigens as well as the production of immune-modulating metabolites by particular

18 CHAPTER 1: INTRODUCTION organisms 196-198. Studies of the gut-lung axis are at an early stage, and an examination of how non-respiratory diseases such as diabetes and GERD affect the microbial composition of the airway could give a broader view on the subject.

1.2.3 Challenges of Respiratory Microbiome Research In order to investigate the lower airways, BAL is an optimal technique that directly samples the location of interest. This method involves a bronchoscope to be passed through either the mouth or nose into the lungs; followed by wedging of the bronchoscope into a segmental bronchus, and infusion and re-collection of saline from the segment. Although this method is commonly used in clinical practice to diagnose patients suffering from infections and severe disease, it is invasive for the individual and expensive.

BAL samples are not without problems, as they further dilute the relatively low concentration of bacteria that is found in the lower airways (as highlighted in Section 1.2.2). Recent studies have brought to light the great significance of contamination when assessing microbial communities from low biomass samples 119. Potential sources of contaminant bacteria come from laboratory equipment and reagents used in the sample handling, as well as skin, oral, and respiratory microbiota from the investigators collecting the samples 199. Additionally, cross-contamination of the bronchoscope from the upper airway may occur during the procedure.

The Busselton Health Study examines a substantial population of individuals and performing bronchoscopies would be impractical and astronomically expensive. In health, high levels of similarities of the bacterial composition are present between the lower and upper airways (Section 1.2.2). Therefore, oropharyngeal swab samples were acquired in this study as a way to investigate the upper airway microbiome and used to hypothesise the composition of the lower airways.

19 CHAPTER 1: INTRODUCTION

1.3 Busselton Health Study The Busselton Health Study (BHS) is a cross-sectional, whole population study. It was initiated in 1966 by Dr Kevin Cullen (the local GP) together with his collaborators. It is administered by the Busselton Population Medical Research Foundation 200 and the School of Popular Health, University of Western Australia 201 and is currently one of the longest running epidemiological study in the world 200.

Over the years, more than 16,000 residents of the Shire of Busselton, South Western Australia have taken part in the continuing collection of health questionnaires and have donated blood and other samples for laboratory tests. These surveys test the local population on a regular basis for a wide range of diseases, looking at existing conditions, disease progression and prediction of early signs of disease.

More than 300 publications have arisen from the BHS cohort 200. There has been a particular focus on specific systems and diseases: respiratory 202-204, obesity 205, cardiovascular 206-208, and cancer 209.

Busselton is a coastal community with a relatively stable population of predominantly European descent. All those who attend the testing centre complete a self-administered questionnaire that collects information on the demographics, lifestyle and environment, medical and history of any illness. Additionally, patients undergo a physical examination. The complete collection of the specimens taken can be found in the paper by James et al. 210 that describes the study. More recently oropharyngeal (throat) swabs have been obtained.

20 CHAPTER 1: INTRODUCTION

1.4 Thesis Aims The BHS provides a unique opportunity to investigate the bacterial community of the airways in a general population representative of middle-aged Australians.

Therefore, the aims of this thesis were to: 1. Establish the constituents and features of the bacterial communities of the airways in a healthy population. 2. Compare the airway bacterial community of healthy individuals to those with the common diseases of asthma, diabetes and gastro-esophageal reflux disease (GERD). 3. Compare the airway bacterial community of never-smoking individuals with current-smoking and ex-smoking individuals.

1.5 Hypotheses The aims of the thesis were to enable three hypotheses to be addressed: 1. The bacterial community of the airways within the general, healthy, non-smoking population will have a distinctive community structure. 2. There will be significant differences seen between the healthy non- smoking individuals when compared to those who present with a particular disease phenotype, especially asthma. 3. Smoking will have a profound effect on the bacterial community of the airways, significantly altering it when compared to the airways of the non-smoking individuals.

21 CHAPTER 2: MATERIALS AND METHODS

Chapter 2 Materials and Methods 2.1 Materials and Reagents

Item Manufacturer 2.3.1.1 Sample Sterilin Rayon Swabs Copan Group, Brescia, Italy Collection MP Biomedicals, Santa Ana, Fastdna Spin Kit For Soil California, USA Precellys 24 Bertin Corp., USA 2.3.1.2 Bacterial Spineze Spin Basket Fitzco Inc, Minnesota, USA DNA Extraction Ethanol Analytical Reagent Grade Fisher Scientific, UK from Microcentrifuge Tubes (2 Ml) Alpha Labs, UK Oropharyngeal Screw Cap Microtubes (0.5 Ml) Alpha Labs, UK (Throat) Swabs Centrifuge Tubes (15 Ml And 50 Ml) Corning Inc, NY, USA Vortex Rotamixer, UK Microcentrifuge (HERAEUS PICO 21) Fisher Scientific, UK Autoclaved Scissors 2.3.1.3 DNA Nanodrop ND-1000 LabTech Ltd, UK Quantification Nuclease-Free Water Qiagen, Germany using the Nanodrop Microamp Optical Adhesive Seal ThermoFisher Scientific, USA Microamp Fast 96-Well Reaction ThermoFisher Scientific, USA Plate (0.1ml) Microamp Adhesive Seal Applicator ThermoFisher Scientific, USA Microamp Splash Free Support Base ThermoFisher Scientific, USA 2.3.1.4 Bacterial MO BIO PCR Clean Water Qiagen, Germany DNA quantification SYBR Fast Qpcr Kit Master Mix (2X) KAPA BioSystems, Roche, by Quantitative PCR Universal Basel, Switzerland Viia7 Real-Time PCR System ThermoFisher Scientific, USA Sterile Disposable Reagent Reservoirs Corning Inc, NY, USA Rotanta 460 Centrifuge Hettich, DJB Labcare Ltd, UK Topo® Ta Cloning® Kit ThermoFisher Scientific, USA Hotstartaq DNA Polymerase Qiagen, Germany Qiaprep Spin Miniprep Kit Qiagen, Germany 2.3.2.6 Library Kapa Library Quantification KAPA BioSystems, Roche, Quantification: (B) (Illumina) Basel, Switzerland Quantitative PCR New England Biolabs, Q5 Master Mix (1.2 Ml) Massachusetts, USA 2.3.2.1 PCR Plates (Fully Skirted) ThermoFisher Scientific, USA Quadruplicate PCR Ultra-Violet Products Ltd, Reaction UV PCR Cabinet Upland, CA Adhesive Pcr Film ThermoFisher Scientific, USA

22 CHAPTER 2: MATERIALS AND METHODS

MJ Research, Bio-Rad Peltier Thermal Cycler (Ptc-225) Laboratories, USA Agarose Molecular Grade Bioline, London, UK Tris Borate Edta Buffer (10x Sigma-Aldrich, Missouri, USA Concentrate) Gel Red Nucleic Acid Stain Biotium, Cambridge, UK Horizontal Gel Tank Alpha Labs, UK 2.3.2.2 Gel New England Biolabs, Electrophoresis Gel Loading Dye Massachusetts, USA Hyperladder 1 Kb Bioline, London, UK Ultra-Violet Products Ltd, Biodoc-It®2 Imaging Systems Upland, CA Microwave Ampure XP Beckman Coulter, USA Magnetic Stand-96 ThermoFisher Scientific, USA 2.3.2.3 Amplicon Dynal, Bead Separation Magnet ThermoFisher Scientific, USA Purification 1x Te Buffer ThermoFisher Scientific, USA Conical Bfm Plate, 96 Well ThermoFisher Scientific, USA Quant-It Picogreen Dsdna Assay Kit ThermoFisher Scientific, USA 2.3.2.4 Amplicon Infinite 200 PRO Tecan, Switzerland Quantification Black 96 Well Assay Plate Costar, USA using Picogreen Eppendorf Tube (1.5 Ml) Alpha Labs, UK 2.3.2.5 Amplicon Gel Cutting Tips Alpha Labs, UK Gel Extraction and Qiaquick Gel Extraction Kit Qiagen, Germany Purification Isopropanol (≥ 99.5%) Sigma-Aldrich, Missouri, USA Agilent 2100 Bioanalyzer Agilent Technologies, USA 2.3.2.6 Library High Sensitivity DNA Reagents Agilent Technologies, USA Quantification: High Sensitivity Dna Chip Agilent Technologies, USA (A) BioAnalyser Chip Priming Station Agilent Technologies, USA Ms3 Vortexer IKA, Staufen, Germany Miseq® Reagent Kit V3 Illumina, California, USA Phix Control V3 Illumina, California, USA 2.3.3 Illumina Elution Buffer With Tris (EBT) Illumina, California, USA MiSeq Sequencing Miseq™ Sequencing Machine Illumina, California, USA Sodium Hydroxide Solution (10M) Sigma-Aldrich, Missouri, USA

23 CHAPTER 2: MATERIALS AND METHODS

2.2 Primers for PCR and Sequencing Reactions All primers were sourced from Eurofins Genomics, Germany.

2.2.1 Quantitative PCR Primers selected targeted the V4 region of the 16S rRNA gene (Table 2.1) for both the quantitative PCR and subsequent quadruplicate PCR (see Section 2.3.2.1). The primers used for quadruplicate PCR had some additional modifications (see Section 2.2.2) to enable subsequent pooling and multiplex sequencing of amplicons generated 211.

Table 2.1 16S rRNA gene V4 primer sequences

Primer Name Sequence S-D-Bact-0564-a-S-15 5' AYT GGG YDT AAA GNG 3’ S-D-Bact-0785-b-A-18 5' TAC NVG GGT ATC TAA TCC 3’

2.2.2 Quadruplicate PCR Barcoded primers targeting the V4 region of the 16S rRNA gene 211-214 were used to generate amplicons for subsequent sequencing (see Section 2.3.2.1). Use of these primers allows sequencing of 96 samples all containing unique barcodes simultaneously (Table 2.2) 215.

The 5’ end of the 16S rRNA gene forward primer contained a specific adaptor, selection of 8 different index reads, pad and link. The 5’ end of the 16S rRNA gene reverse primer was modified to contain a specific adaptor, selection of 12 different index reads, pad and link.

24 CHAPTER 2: MATERIALS AND METHODS

Table 2.2 Details of sequencing primers. Adaptors, Index Reads, Pad and Link modification for V4 primers for quadruplicate PCR. Forward Adaptor AATGATACGGCGACCACCGAGATCTACAC CTCTCTAT AAGGAGTA Forward Index TATCCTCT CTAAGCCT Barcodes GTAAGGAG CGTCTAAT ACTGCATA TCTCTCCG Forward Pad TCCGGTCGCG Forward Link GCC Forward Primer 5’ AYTGGGYDTAAAGNG 3’

Reverse Adaptor CAAGCAGAAGACGGCATACGAGAT TAAGGCGA CTCTCTAC CGTACTAG CGAGGCTG Reverse Index AGGCAGAA AAGAGGCA Barcodes TCCTGAGC GTAGAGGA GGACTCCT GCTCATGA TAGGCATG ATCTCAGG Reverse Pad AGGCGGCCG Reverse Link CGT Reverse Primer 5’ TACNVGGGTATCTAATCC 3’

2.2.3 Illumina MiSeq Sequencing Sequencing of amplicon libraries generated (see Sections 2.3.2 and 2.3.3) involved the addition of three sequencing primers (Table 2.3).

Table 2.3 Primers for V4 16S rRNA amplicon sequencing.

Multiplexing Read 1 5’ TCCGGTCGCGGCCAYTGGGYDTAAAGNG 3’ Sequencing Primer Multiplexing Read 2 5’ GGATTAGATACCCBNGTAACGCGGCCGCCT 3’ Sequencing Primer Multiplexing Index Read 5’ AGGCGGCCGCGTTACNVGGGTATCTAATCC 3’ Sequencing Primer

25 CHAPTER 2: MATERIALS AND METHODS

2.3 16S rRNA Gene Bacterial Sequencing The following section details sample collection, bacterial DNA extraction and the series of steps carried out prior to and including the sequencing of the 16S rRNA amplicons generated (the library) on the Illumina MiSeq Sequencing Platform (Figure 2.1).

In summary, samples were all collected in Australia, frozen and transported to Imperial College London (ICL) for processing and analysis. DNA was extracted, and bacterial burden was quantified using a 16S rRNA gene specific quantitative PCR method. 16S rRNA amplicons were amplified from samples using specific barcoded primers, so that sequencing reads generated could be traced back to original sample post library sequencing. Amplicon libraries were purified to remove presence of primer dimer before sequencing on the Illumina MiSeq sequencing instrument.

STEP 1: Sample Processing 16S rRNA gene qPCR

STEP 2: Library Universal PCR Preparation Amplification of DNA extraction: FastDNA Oropharyngeal 16S rRNA gene Spin Kit for soil swab

Figure 2.1 Library preparation summary workflow. Sample collection, bacterial DNA extraction, preparation and quantification required for sequencing.

26 CHAPTER 2: MATERIALS AND METHODS

2.3.1 Sample Processing 2.3.1.1 Sample Collection In addition to clinical samples and completion of a detailed questionnaire 210, participants who attended a 4 hour interview at test centres in Busselton (Australia) had oropharyngeal (throat) swabs taken. Oropharyngeal swabs were obtained by using sterile rayon swabs that were rubbed gently with an even pressure around the oropharynx five times. Swabs were immediately placed back into their collection tubes post sampling and frozen and stored at -80oC prior to transportation on dry ice to ICL, UK.

Completed questionnaires contained details about participants’ demographics, lifestyles, medical history including list of medications, biochemical results (blood, respiratory tests), and history of major diseases. A copy of the blank questionnaire is included (Appendix 2).

Upon receipt at ICL, swabs were stored at -80oC until extraction, with the latter being carried out within six months of sample collection. The goal of sampling was to get an oropharyngeal (throat) swab from each individual. This however was not always achieved and if during sampling the swab came into contact with any other site (e.g. tongue); a note was made against the individual sample. Consequently, taking this information into consideration, out of 1,167 samples taken, 725 were deemed exclusively as throat swabs, as they did not come in contact with any other area of the oropharynx (Table 2.4).

Table 2.4 Swab site details. Swab Site Numbers Nasal 3 Palate 174 Throat 725 Throat_C* 90 Tonsils 12 Other 163 Total 1,167 *Denotes ambiguous samples, which the operators deemed not an exclusive throat swab and may have touched other areas of the oropharynx.

27 CHAPTER 2: MATERIALS AND METHODS

2.3.1.2 Bacterial DNA Extraction from Throat Swabs DNA was extracted from 725 throat swabs following the FastDNA Spin Kit for Soil (MPBio) protocol; with some aspects of the protocol modified from the manufacturer’s recommendations in order to negate loss of DNA of Gram-positive bacteria. Gram-positive bacteria have thicker cell walls due to the presence of peptidoglycan and therefore it is important to ensure efficient breakdown of the bacterial cell wall during the DNA extraction and this can be achieved through a combination of enzymatic lysis and bead beating.

Using individual autoclaved scissors, each swab head was carefully cut off into Lysing Matrix tube mixed with lysis solution (both supplied as part of the MPBio FastDNA Spin Kit for Soil) and homogenized using a Precellys®24 bead beating instrument at a speed of 6,800 rpm, with two 30 second cycles. Supernatant was collected. An additional spin down of the swab head using a SpinEze® spin basket was carried out to collect all possible supernatant. Next 250 μl of PPS (Protein Precipitate Solution) was added to the supernatant to lower solubility of proteins and the proteins pelleted by centrifugation for 5 minutes at a speed of 14,000 X g.

The resulting supernatant was transferred into a clean 15 ml tube and mixed with 1 ml of binding matrix and the solution was left for 30 minutes to allow binding of all free-floating DNA. Next 850 µl of the solution was put through a silica spin filter and centrifuged for 1 minute at speed of 14,000 X g to capture all bound DNA in the form of a pellet. To remove any unbound reagents, the pellet (set in the spin filter) was washed using 500 μl of an ethanol-based solution (prepared using 12 ml of SEWS-M Wash Solution [supplied in the kit] and 100 ml of 100% ethanol). The pellet was then re-suspended in 100 µl of DNAse Pyrogen-Free water and incubated for 5 minutes at 55 °C (optional step in provided kit protocol in order to increase DNA yield). To collect eluted DNA into a clean catch tube the spin filter was centrifuged for 1 minute at speed of 14,000 X g. Extracted DNA was stored at -80°C until required.

28 CHAPTER 2: MATERIALS AND METHODS

2.3.1.3 DNA Quantification using the Nanodrop The Nanodrop (ND-1000) spectrophotometer was used to estimate the concentration of extracted DNA (bacterial and human combined). The Nanodrop shines a light through the sample, and as nucleic acids absorb light, it is possible to quantify the concentration of DNA present. Two µl of each extracted DNA sample was loaded on to the Nanodrop pedestal and measured allowing DNA yield for the sample to be determined.

2.3.1.4 Bacterial DNA Quantification by Quantitative PCR 16S rRNA gene specific quantitative PCR (qPCR) was used to confirm presence and determine abundance of bacterial DNA in each extracted sample. Quantitation is achieved through use of a non-specific fluorescent dye, SYBR® Green I (asymmetrical cyanine dye 216) that binds to any double stranded DNA present in a sample. Any resulting DNA-dye-complex emits green light (λmax = 520 nm) that is detected by the ViiA™ 7 Real-Time PCR System with the amount of amplified product then being determined against a set of standards of known concentrations.

Standards used were a dilution series (1x104 to 1x108 copies / μl) of a plasmid containing sequence verified full-length 16S rRNA gene of Vibrio natriegens Strain DSMZ-759 (DNA of the bacterium obtained from Deutsche Sammlung von Mikroorganismen und Zellkulturen, Braunschweig, Germany). V. natriegens Strain DSMZ-759 is a Gram-negative marine bacterium and consequently not expected to be present in human clinical samples.

To prepare the standards, the full length 16S rRNA amplicon was generated using a HotStarTaq protocol with a 20 µl reaction containing: 0.1 µl of Taq, 2 µl of both 27F primer (5’ AGAGTTTGATCMTGGCTCAG 3’) 217 and 1492R primer (5’ GGTTACCTTGTTACGACTT 3’) 217, 2 µl of 10 X Buffer, 4 µl of 5 X Q Buffer, 0.4 µl of dNTPs, 5.5 µl of water and 4 µl of the V. natriegens, Strain DSMZ-759 DNA. Thermal cycling conditions were 94 °C for 5 minutes, followed by 30 cycles of: 1 min at 94 °C, 1 min at 48 °C, 1 min at 72 °C, finishing off with final extension of 72 °C for 10 minutes. Post amplification the total reaction (20 µl) was run on a

29 CHAPTER 2: MATERIALS AND METHODS

1.5 % agarose gel (see Section 2.3.2.2) and the expected product 1,400 bp in size was cut from the gel using a sterile gel cutting tip. The gel slice underwent gel extraction (see Section 2.3.2.5) to remove the agarose gel, and any potential contaminants or primer dimer. Next, purified DNA was cloned using the TOPO TA Cloning Kit with transformation into Escherichia coli. Positive colonies were subsequently grown up, 15% glycerol stocks prepared, and plasmids purified using the QIAprep Spin Miniprep Kit (Qiagen, Germany). Purified plasmid DNA was sent off for Sanger Sequencing (Wellcome Trust Sanger Centre) to confirm successful amplification and cloning of the 16S rRNA gene of V. natriegens, DSMZ- 759.

Following the sequencing confirmation, plasmid DNA was quantified using PicoGreen (Section 2.3.2.4) and diluted down to a working concentration of 2 x 107 copies / μl. The qPCR standards were then generated by preparing a serial dilution from 1x108 to 1x104 copies / μl.

Each 15 µl 16S rRNA gene qPCR reaction contained 5 µl of either DNA template, standard or non-template control (PCR grade water), 7.5 µl of SYBR Fast qPCR Kit Master Mix: 0.3 µl of each primer (forward and reverse Table 2.1, 0.2 µM final concentration) as well as 1.9 µl of PCR water. All samples including non-template controls and standards were done in triplicate. Thermal cycling conditions were optimised to 90 °C for 3 minutes, followed by 40 cycles of 20 seconds at 95 °C, 30 seconds at 50 °C and 30 seconds at 72 °C. In a couple of instances, where sample repeats were not tightly aligned, qPCR was repeated for the sample.

2.3.2 Library Preparation Each sequencing library consisted of 96 samples in total - 94 throat swab DNA samples, one PCR positive control (further details Section 2.3.2.1) as well as a PCR negative control (PCR grade water). Samples were ordered by their bacterial abundance as determined by qPCR copy number. This was due to prior work within the Genomic Medicine Section (NHLI) that had identified through previous sequencing projects (Roche Junior 454 sequencing) a high level of sample drop-

30 CHAPTER 2: MATERIALS AND METHODS out when runs consisted of samples heterogeneous in their bacterial abundance. Ordering by bacterial abundance resulted in batches amounting to seven sequencing runs; with batch (run) one containing samples with highest bacterial abundance and run seven containing samples with the lowest abundance.

2.3.2.1 Quadruplicate PCR To generate sufficient product for subsequent sequencing, quadruplicate PCRs (set up as four identical 96 well PCR plates) were done for each sample. PCRs were set up with 16S rRNA gene V4 region barcoded primers (Section 2.2.2). The primers were specifically modified to contain Illumina adaptors, unique index reads, pad and link sequences, with final primers being 65 bp long (Table 2.2)

The forward primer contained a 29 base Illumina adaptor and whilst the reverse primer was modified with reverse complement of Illumina adaptor, comprising of 24 bases. These adaptors bind to the flow cell during sequencing. Attached to the Illumina adaptors are unique index reads; 8 different index reads for the forward primer and 12 different index reads for reverse, which generate enough barcodes to sequence 96 samples simultaneously. Inclusion of the pad-sequence is to prevent formation of hairpins within the primer and to enable an annealing temperature of 50 °C, whilst a non-complimentary (to 16S rRNA gene) linker sequence was added to separate the priming site from the unique barcodes.

An example of a complete barcoded primer is: 5’AATGATACGGCGACCACCGAGATCTACACCTCTCTATTCCGGTCGCGGCCAYTGGGYDTAAAGNG3’ Where green is adaptor, red is index, blue is pad, orange is link and black is the specific V4 forward primer sequence.

The positive control for each sequencing run was a mock community consisting of bacterial clones of a number of bacteria commonly found in the respiratory tract.

31 CHAPTER 2: MATERIALS AND METHODS

Table 2.5 gives a detailed list of all strains included in the mock community. As well as acting as a positive control, the mock community enables reproducibility across sequencing runs to be assessed.

Each 25 µl sequencing PCR reaction contained 14 µl of Q5® High-Fidelity 2X Master Mix, 5 µl (1.5 µM working stock) of both forward and reverse primers and 1 µl of either DNA template (extracted throat swab DNA), positive control (mock community) or negative control (PCR grade water). Thermal cycling conditions were 2 minutes at 95 °C (initial denaturation), followed by 35 cycles of 20 seconds at 95 °C, 20 seconds at 50 °C and 30 seconds at 72 °C and performed on a Peltier Thermal Cycler (PTC-225). Final extension was done for 2 minutes at 72 °C.

2.3.2.2 Gel Electrophoresis Agarose gel electrophoresis was used to confirm successful PCR amplification and that no contamination of the PCR had occurred (indicated by no bands seen for negative control). To prepare the gel, 2 g of Bioline Agarose (Molecular Grade) was diluted in 100 ml of Tris Borate EDTA Buffer (1 X concentrate, pH 8.3) and heated until completely dissolved, resulting in a 2% Agarose gel solution. When the solution had cooled sufficiently, 3 µl of Gel Red (Nucleic Acid Stain) was added (enabling visualisation of DNA under ultra violet luminescence), the gel was poured and allowed to set.

Taking all negative and positive controls from each of the four PCR plates, and individual samples (randomly selected from the quadruplicate plates) a 5 μl aliquot of each sample was combined with 5 μl of a gel loading dye (glycerol-based buffer) and loaded on to the agarose gel and electrophoresis performed for 45 minutes (at 100 volts). Hyperladder (1Kb) was added to each agarose gel as a size reference; the expected band size of the amplicon being 350 bp.

Results were visualised on the BioDoc-It®2 Imaging System. In the presence of no contamination in negative controls and confirmation of successful amplification (positive control), the four replicate PCR plates were pooled prior to amplicon purification.

32 CHAPTER 2: MATERIALS AND METHODS

Table 2.5 Breakdown of the mock community members. Individual type strains community with the DSMZ Strain numbers (Deutsche Sammlung von Mikroorganismen und Zellkulturen, Braunschweig, Germany). Full Name Strain Number Actinomyces odontolyticus DSMZ-19120 Bifidobacterium denticum DSMZ-20436 Burkholderia cepacia DSMZ-7288 Chlamydophila pneumoniae DSMZ-19748 Corynebacterium pseudodiptheticum DSMZ-44287 Enterococcus faecalis DSMZ-20478 Escherichia coli DSMZ-30083 Fusobacterium nucelatum subsp. nucleatum DSMZ-15643 Granuclicatella adiacens DSMZ-9848 Haemophillus parainfluenzae DSMZ-8978 Haemophillus influenzae DSMZ-4690 Leptotrichia buccalis DSMZ-1135 Moraxella catarrhalis DSMZ-9143 Mycobacterium bovis DSMZ-43990 Mycobacterium psychrotollerans DSMZ-44697 Mycoplasma pneumoniae DSMZ-22911 Neisseria meningitidis DSMZ-10036 Neisseria flavescens DSMZ-17633 Nocardia farciniea DSMZ-43665 Pasteurella multocida DSMZ-16031 Pseudomonas fluorescens DSMZ-50090 Rothia mucilaginosa DSMZ-20746 Salmonella enterica DSMZ-17058 Staphylococcus aureus DSMZ-20231 Streptococcus agalactiae DSMZ-2134 Streptococcus constillatus DSMZ-20575 Streptococcus infantis DSMZ-12492 Streptococcus mitis DSMZ-12643 Streptococcus parasanguinis DSMZ-6778 Streptococcus pneumoniae DSMZ-20566 Streptococcus pseudopneumoniae DSMZ-18670 Streptococcus pyogenes DSMZ-20565 Streptococcus sanguinis DSMZ-20567 Treponema denticola DSMZ-14222 Veillonella dispar DSMZ-20735 Vibrio natriegens DSMZ-759

33 CHAPTER 2: MATERIALS AND METHODS

2.3.2.3 Amplicon Purification The AMPureXP Kit was used to purify PCR amplicons post pooling. The kit works through the use of paramagnetic beads that reversibly bind DNA of specific length (depending on DNA to bead ratio used) and in doing so removes any primer dimer, excess oligonucleotides, salts and enzymes that may be present or remaining after the PCR process.

First, the AMPure beads were brought up to room temperature and after ensuring they were fully re-suspended, they were combined and thoroughly mixed with the PCR amplicons in a ratio of 1:0.7. This ratio ensured removal of any fragments below 200 bp in size, as these would remain in the supernatant and not attached to the beads.

The mixture was left at room temperature for 10 minutes to allow DNA binding to the beads. Separation of supernatant from DNA-bead complex was achieved by use of a magnetic block allowing the supernatant to be discarded. The pellet of DNA-bead complex was washed twice with 100 µl of 80% (v/v) ethanol after which washed pellets were air dried for 5 minutes to allow all ethanol to evaporate.

Finally, DNA was eluted by addition of 30 µl 1 X TE Buffer, supernatant containing eluted DNA was separated from the AMPure beads by using the magnetic block.

2.3.2.4 Amplicon Quantification using Picogreen Purified amplicons (Section 2.3.2.3) were quantified using Quant-iT PicoGreen® dsDNA assay kit following manufacturer’s instructions. PicoGreen is a double- strand DNA (dsDNA) reagent that is an ultra-sensitive fluorescent nucleic acid stain for quantitating dsDNA in solution.

The Quant-iT PicoGreen® dsDNA assay involves creating a standard curve (0, 1.56, 3.13, 6.25, 12.5, 25, 50 and 100 ng concentrations) by diluting a phage λ DNA Standard (provided in the kit) with 1 X TE Buffer. Next 100 µl duplicates of each standard dilution were transferred into a black fluorimeter assay plate. The

34 CHAPTER 2: MATERIALS AND METHODS remaining wells were filled with 99 µl of 1 X TE Buffer and 1 µl of amplicon DNA. Subsequently, 100 µl of a diluted PicoGreen reagent (1:200 dilution, diluted with 1 X TE Buffer) was added to all 96 wells. Fluorescence of purified PCR amplicons was measured and using the standard curve DNA concentration for each sample determined.

After establishing DNA concentration of all 96 samples and the desired concentration of the equi-pooled library, a unique volume of each sample was used. The equi-molar pool was then purified again with AMPure Beads as per the procedure detailed in Section 2.3.2.3 but with one exception in that the DNA-bead complex was washed twice with 200 µl of 80% (v/v) ethanol.

2.3.2.5 Amplicon Gel Extraction and Purification Gel extraction was used to purify the pooled amplicon library further to ensure complete removal of primer dimer. The pooled amplicon library was run on a 1.8% agarose gel at 80 volts for 60 minutes (Section 2.3.2.2). The resultant band (350 bp) was cut out from the gel using gel cutting tips, transferred to a 1.5 ml sterile microcentrifuge tube (weighed in advance), in order to determine the weight of gel slice and subsequently calculate the volume of Buffer QG required.

QIAquick Gel Extraction Kit was used to extract DNA from gel slice. The QIAquick system involves a bind-wash-elute procedure. Supplied buffer Buffer QG was added at a volume of three times the weight of gel slice (if 100 µg then 300 µl is added), after which the tube was incubated at 50 °C for 10 minutes, until the gel had completely dissolved. Next one gel volume of isopropanol was added to the dissolved mixture to precipitate DNA.

Afterwards, the mixture was put through a QIAquick spin column and centrifuged for 1 minute at a speed of 14,000 X g. Nucleic acids were adsorbed to the silica membrane in high-salt conditions provided by the buffer. This was repeated until all flow-through was run through the column twice to ensure full capture. Impurities were washed away using 750 µl of an ethanol-based buffer (Buffer PE), centrifuged for 1 minute at speed of 14,000 X g. The column was centrifuged for

35 CHAPTER 2: MATERIALS AND METHODS another minute at the same speed to make sure all ethanol was removed. DNA was eluted by addition of 30 µl of Elution Buffer (Buffer EB) to the spin column and incubated for 4 minutes at room temperature, before being centrifuged for 1 minute at speed of 14,000 X g, after which the column was discarded, and the purified library (the retained supernatant) was stored until sequencing.

2.3.2.6 Library Quantification Purified library (Section 2.3.2.5) underwent quality assessment to confirm removal of any trace of primer dimer and to determine the size of the purified DNA library. Two assays were performed. One used the Agilent High Sensitivity DNA Kit and the Agilent 2100 BioAnalyzer, allowing size of the library to be determined. The second, quantified the library using a qPCR and the ViiA7 Real- Time PCR System.

(A) BioAnalyser BioAnalyzer is a nanofluidics device that analyses the DNA library by performing size fragmentation (using electrophoresis). Consequently, it can detect a range of fragment sizes including small fragments (e.g. primer dimer). It also enables size of DNA fragments to be determined by using a set of internal standards. The protocol was followed according to the manufacturer’s instructions. One µl of amplicon library was loaded together with 5 µl of marker (size standards) onto the chip and then run on the BioAnalyzer.

(B) Quantitative PCR Quantification of the amplicon library was performed using the KAPA Library Quantification Kit. Primer Premix (10 X, 1 ml) was mixed with the KAPA SYBR® FAST qPCR Master Mix (2 X, 5 ml) before setting up the reaction. DNA standard (PhiX) used for this experiment was provided in the kit as a 10-fold dilution series (20 pM to 0.2 fM).

One µl of amplicon library was diluted in 999 µl of PCR water, before creating a 2- fold dilution series (1:1000, 1:2000, 1:4000, 1:8000). The 20 µl reaction contained 12 µl of Mastermix, 4 µl of PCR water, 4 µl of template (DNA standard or library dilution or non-template control [PCR grade water]). All samples, including non-

36 CHAPTER 2: MATERIALS AND METHODS template controls and standards, were done in triplicate. The PCR conditions were 5 minutes at 95 °C, followed by 35 cycles for 30 seconds at 95 °C and 45 seconds at 60 °C.

(C) Calculating Library Concentration Once the purified library had been run on the ViiA™ 7 Real-Time PCR System and the BioAnalyzer, results were used to work out concentration (in nanomolar) of amplicon library.

Average qPCR quantity = average of the triplicate qPCRs for library dilution 452 = length of PhiX used in the standards (from qPCR) Average DNA length = average DNA fragment length value acquired from BioAnalyzer Dilution factor = library dilution concentration (1000, 2000, 4000, 8000)

The top library concentration value that came up below the top standard was then taken as the Library Concentration.

2.3.3 Illumina MiSeq Sequencing Next generation sequencing was performed in house on the MiSeq Sequencing machine (Version 2.5.0.5) using the MiSeq® Reagent Kit V3 and following the manufacturer’s instructions 213,215.

Illumina MiSeq 16S rRNA gene amplicon sequencing is a culture independent profiling technique (workflow summary in Figure 2.2). During the process, adaptors, that were incorporated into the modified primers and hence attached to the DNA amplicons, bind to the flow cell, which is covered by a lawn of oligos that are complimentary in sequence to the primer adaptors. Immobilized templates are clonally amplified through bridge amplification, creating millions of molecular clusters each containing around 1,000 copies of the same template. The sequencing of these templates is then carried out through Illumina’s sequencing

37 CHAPTER 2: MATERIALS AND METHODS by synthesis technology. After clonal amplification, reverse strands are cleaved and washed away, leaving just forward strands. Fluorescently labelled nucleotides compete based on sequence of the template. Upon attachment of a nucleotide the cluster is excited; emission intensity and wavelength are detected by the high- resolution camera and base calls determined. After completion of the first read, product is washed away, the template folds over to the nearest oligo and with the use of index read two, the template is extended to form a double stranded bridge. This bridge is linearized, three-prime ends are blocked, and forward strands are cleaved and washed off, allowing the process of sequencing to be repeated on the reverse strand. The sequencing process occurs over 600 cycles and takes around 55 hours to complete.

Figure 2.2 Illumina MiSeq sequencing by synthesis workflow diagram. Source: Figure was recreated based on Illumina help guides.

Each sequencing library loaded onto the MiSeq had a 20% (previously determined with Genomic Medicine Section, NHLI) spike-in of Phi X, which is a small well- characterized genome, easy to align and remove from sequences afterwards. The spike-in is done to introduce diversity into the sequencing process, allowing discrimination of clusters by providing a balanced fluorescent signal at each cycle.

38 CHAPTER 2: MATERIALS AND METHODS

It also serves as a technical control to analyse quality metrics of the run in real time by monitoring cluster generation, sequencing, alignment, amount of product yield and any error rates.

2.3.3.1 MiSeq Sequencing A maintenance wash, following manufacturer’s instructions, was carried out on the MiSeq instrument before each use. This ensured any reagents and library carry over from previous sequencing runs was removed. Next the mapping file was uploaded onto the instruments interface. The mapping file (comma separated value file) contains the entire sample ID list and location on the plate, corresponding index names, sequences, sample concentration and study name.

Denatured DNA library and Phi X control, both at 8 pM concentration, were combined, to result in a 20% Phi X spike: 800 µl of DNA library and 200 µl of Phi X control. DNA library and Phi X mix (600 µl) were loaded onto the MiSeq cartridge. Four µl of each of the three sequencing primers (Section 2.2.3 for further primer details) were added to their pre-assigned dedicated wells on the MiSeq reagent cartridge using extra-long 200 µl gel-loading tips, and the sequencing commenced. The reaction was run for 600 cycles and took approximately 55 hours to complete.

During the sequencing run, quality metrics that were recorded included; cluster density that looks at the density of the clonal clusters (K/mm2), cluster passing filter showing percentage of clusters passing internal (RTA) filter (the latter screens raw data and removes any reads that do not meet the overall quality of Illumina chastity filter). Other quality metrics were Quality Score (Q30), which is a probability of an incorrect base call with a high Q-Score indicating that a base call is more reliable and less likely to be incorrect. Numbers of reads produced at the end of the sequencing run (millions) were also recorded.

39 CHAPTER 2: MATERIALS AND METHODS

2.4 Sequence Processing using QIIME Illumina sequences were processed using the available QIIME scripts (Quantitative Insights Into Microbial Ecology, Version 1.9.0) 218. Following completion of sequencing run forward and reverse reads, as well as the index files were transferred from the MiSeq machine onto the local computer for processing. To extract the data all the files were required to be unzipped (for the complete command workflow refer to the following: [https://goo.gl/Uus1yE]). Figure 2.3 gives an overview of the QIIME workflow.

Figure 2.3 QIIME bioinformatics summary workflow

Initially forward and reverse barcodes, 8 base pairs from the beginning of paired reads, had to be formatted using extract_barcodes.py command, making the reads compatible for downstream analysis (Figure 2.3).

Certain quality control measures were carried out on the data using Trim Gallore (Version 0.4.1) 219. This included trimming of adaptors; both forward and reverse

40 CHAPTER 2: MATERIALS AND METHODS adaptor sequences (15 bp for forward and 18 bp for reverse were removed from the standard library). Additionally, a parallel paired-end validation on both primers was carried out, removing reads with a length shorter than 150 base pairs.

Forward and reverse reads were then joined together on overlapping ends using fastq-join 220,221 method. The latter utilised parsed barcodes created in the first step above, with a minimum overlap of 200 base pairs. When the overlapping bases matched, the highest sequence quality base was used.

In order to map every read back to their original sample, the process of de- multiplexing was performed. Here all files containing sequence reads, barcodes and metadata mapping files across seven sequencing runs were merged together into a single shared medium using split_libraries_fastq.py command. Further quality checks were performed where the retained sequences had to have a quality score greater than or equal to 30 (Phred ≥ Q30, making the probability of an error 1 in 1000). Additionally, if there were 10 consecutive bases below Q30 the read was truncated. For the read to be retained 70% of the combined reads also had to be consecutively above Q30.

As highlighted earlier each sequencing library was spiked with a Phi X genome (Section 2.3.3.1 above) to introduce diversity and act as a technical control in the sequencing run. Burrows-Wheeler Alignment Tool (bwa 222) was used to align all sequences to a reference Phi X genome. A total of 636 Phi X sequences were identified and subsequently removed, leaving 46,337,810 high quality reads.

Operational taxonomic units (OTUs), which represent the group of organisms studied, were identified using the pick_open_reference_otus.py command. This process initially clustered sequences against the Silva database (23 August 2013 release) 223, with all the unknown OTUs clustered separately as new de novo centroids. OTU picking was carried out using the uclust method 224, which clustered OTUs based on 97% sequence similarity. Next, the most abundant sequence was used as a representative for each OTU and were selected through

41 CHAPTER 2: MATERIALS AND METHODS pick_rep_set.py command. To create a phylogenetic tree and remove chimeric sequences, representative sequences were aligned to template Silva alignment using PyNAST 225.

To remove any chimeric sequences from the dataset, ChimeraSlayer 226 was used on the aligned representative sequences. It performs a series of steps in order to flag any potential chimeric sequences.

Highly variable regions from representative sequences were removed using filter_alignment.py command, in this way filtering the alignment, which was then used to construct the phylogenetic tree using FastTree 227. was assigned to each sequence using the uclust method again. Finally, an OTU table was constructed, tabulating number of times each OTU was seen in every sample.

The steps detailed above resulted in the following three output files:  .tre file; Phylogenetic tree description file  .fasta file; Reference Sequence: this file contained representative nucleotide sequences together with their taxonomic classification  .biom file; Biological Observation Matrix file, containing the OTU table information

Next, the three files were transferred into R Studio (Version 0.99.903, operating R version 3.2.0, 2015-04-16), where statistical analysis was performed on the data. Table 2.6 gives a summary of the different library packages used for processing.

42 CHAPTER 2: MATERIALS AND METHODS

Table 2.6 R Studio packages used in data analysis.

Package Version Description Ref. ape V4.1 Used to read tree files. 228 ade4 V1.7-10 Analysis of ecological data 229 BiocGenerics V0.20.0 S4 generic functions for Bioconductor 230 biomformat V1.6.0 Processing BIOM files 231 Biostrings V2.42.1 Allows manipulation of biological strings 232 cluster V2.0.6 Cluster analysis 233 cowplot V0.7.0 ‘ggplot2’ extension: plot annotations 234 DESeq2 V1.14.1 Differential gene expression analysis 235 dplyr V0.5.0 Data manipulation 236 dynamicTreeCut V1.63-1 Method of cluster detection (WGCNA) 237 flashClust V1.01-2 Implementation of optimal hierarchical clustering (WGCNA) 238 ggplot2 V2.2.1 Sophisticated data visualizations 239 ggmap V2.6.1 ‘ggplot2’ extension: Special Visualisation 240 ggtree V1.10.5 Visualization and annotation of phylogenetic trees 241 gridBase V0.4-7 Integration of base and grid graphics 242 gridExtra V2.2.1 Miscellaneous functions for “Grid” graphics 243 gtools V3.5.0 R programming tools: permutations 244 gtable V0.2.0 Extension to ‘ggplot2’; arrange ‘Grobs’ in tables 245 indicspecies V1.7.6 Indicator Species analysis 246 IRanges V2.8.1 Manipulating intervals on sequences 247 jpeg V0.1-8 Read and write jpeg files 248 knitr V1.15.1 Dynamic report generation in R 249,250 lattice V0.20-34 High level data visualization system for multivariate data 251 MASS V7.3-45 Supports statistical functions 38 matrixStats V0.53.1 Manipulations table values 252 nlme V3.1-131 Linear and nonlinear mixed effects model 253,254 pgirmess V1.6.5 Data analysis in ecology 255 phyloseq V1.19.1 Handling and analysing high-throughput microbiome data 256 phytools V0.2.2 Phylogenetic tools: constructing phylogenetic trees 257 picante V1.6-2 NRI/NTI analysis; tool for integrating phylogeny and ecology 258,259 plyr V1.8.4 Data manipulation tool 260 pscl V1.4.9 Calculating the zero-inflation model 261,262 RColorBrewer V1.1-2 Creating custom colour palettes 263 reshape2 V1.4.2 Flexible data manipulation tool 264 scales V0.4.1 Scale function for visualization 265 stringr V1.2.0 Simple, consistent wrappers for common string operations 266 vegan V2.4-2 Community ecology package 267 VennDiagram V1.6.17 Generate high resolution Venn and Euler Plots 268 WCNA V1.51 Weighted Correlation Network Analysis 269

43 CHAPTER 2: MATERIALS AND METHODS

2.5 Data Analysis 2.5.1 General Data Processing After transferring of files into R, further filtering and formatting steps were performed on the data. Initially, all bacteria that were “Unclassified” and “Unassigned” at kingdom level were sub-selected and removed from the data. Members of the Phylum Cyanobacteria, which contains photosynthetic bacteria commonly found in soil and water, recognised as contaminants when conducting human studies were removed.

The OTU table was reformatted by renaming bacteria that were classified as unidentified or with uncertain placement from taxonomic ranks (class, order, family and genus). All following labels were replaced with “NA” from all taxonomic ranks:  Uncultured_organism  Family_III_Incertae_Sedis  Uncultured  Family_Incertae_Sedis  Family_XI_Incertae_Sedis  Incertae_Sedis  Family_XIII_Incertae_Sedis  Order_II_Incertae_Sedis

Data underwent a further series of filtering and rarefaction steps before it was used in statistical analysis and details of these are provided in Chapter 3, Section 3.2.3.

2.5.2 Statistical Analyses: Diversity 2.5.2.1 Alpha Diversity (A) Alpha Diversity Indices R. H. Whittaker first introduced diversity in 1960 270, as a concept that incorporated mean species diversity within a habitat (alpha diversity). To assess alpha (α) diversity of a population there were a number of complementary diversity indices used, including Species Richness (S), Pielous Evenness (ln [S]) 271, Shannon’s diversity index (H’) 272 and Simpson’s Diversity Index (converted to Inverse Simpson’s, 1/D) 273,274. Individually these indices carry some limitations; therefore all four were used in this thesis for analysing the data generated 275-277. Figure 2.4 portrays species richness and evenness in a bacterial environment.

44 CHAPTER 2: MATERIALS AND METHODS

Figure 2.4 Visual representation of Species Richness and evenness. The community composition where each shape represents unique organism, individuality of organisms is depicted by colour and shape. Increased Species Richness in the community is reflected by an increase in number of different types of organism. An even community is when no one organism displays dominance. Source: Figure was recreated based on Cox et al., 2013 278

Observed Species Richness looks at count data from all different species present, however it neglects species abundance and is sensitive to rare species. To compensate, equality of community was quantified through Pielou’s evenness, which tests homogeneity of species. Shannon’s diversity index accounts for both species abundance and evenness, with an assumption that all species are represented in the sample and are all randomly sampled. Shannon’s diversity is however sensitive to changes in frequency of abundance and rare species. Simpson’s diversity also incorporates species abundance and evenness and measures the probability that two species randomly selected from a sample will be different. This index is sensitive to abundant species.

45 CHAPTER 2: MATERIALS AND METHODS

(B) Data Normality For comparisons between samples of alpha diversity measures it was necessary to first carry out a Shapiro Wilks Test 279 to determine the distribution of data, whether it was normal or not normal. Significant results provided enough evidence to reject the null hypothesis, meaning that data was not normally distributed, at the 95% confidence level, therefore non-parametric tests were used. In order to account for the large sample numbers, this test was done in cooperation with the quantile-quantile plots280 and a histogram of the data, to add visual confirmation about data distribution.

(C) Analysis of Variance Having established that the data was not normally distributed ([B] above), Kruskal- Wallis 281, one way analysis of variance, was implemented, to test for significant difference in means within a population. To confirm which population was driving differences observed, an ANOVA linear model was drawn up using Tukey’s Honest Significant Differences (HSD) 282 test calculating individual group means. P-values were adjusted for multiple comparisons and controlled for the family-wise error rate at the 95% confidence level.

2.5.2.2 Dissimilarity Measures Dissimilarities between different habitats was assessed through beta (β) diversity 270. There are two established types of β diversity: turnover and variation. Turnover looks at measuring species replacement between sites whilst variation (also referred to as nestedness 283) looks at loss of species between sites.

Dissimilarity between populations was calculated using an abundance based Bray-Curtis dissimilarity matrix 284, which accounts both for number and abundance of shared species. When the result approaches zero, samples compared show the same composition. However when the result approaches one; two samples tested showed no similarity 285.

Adonis 286 is a non-parametric Permutational Multivariate Analysis of Variance (PERMANOVA) carried out using distance matrices, in this case Bray-Curtis. This function (from Vegan package 267) was carried out to test the relationship between Bray-Curtis

46 CHAPTER 2: MATERIALS AND METHODS distances for species data and selected clinical variables, to see if variance in the data could be explained by these variables.

Non-metric multidimensional scaling (NMDS) 287 plot, a robust rank-based ordination method that represents the dissimilarity in a low-dimensional space, was used to visualize results from the beta diversity analyses. Raw data was converted into ranks; with organisms that were more closely related being clustered closer together.

2.5.2.3 Net Relatedness Index/ Nearest Taxon Index Phylogenetic diversity was measured using mean pair-wise phylogenetic distance (MPD) 288 between species separated by particular clinical variables of interest. This looked at whether species in a particular group were more closely clustered than expected by chance and the calculation also included relative abundance of species.

The Net Relatedness Index is the average taxonomic distance between species in a community and focuses on the roots of a phylogenetic tree. The Nearest Taxon Index looks at the average taxonomic distance between species, focuses on the tips of a phylogenetic tree (Figure 2.5).

The package ‘Picante’ 258,259 was used to integrate phylogenies and ecology. Each analysis calculated a cophenetic distance matrix, where similar bacteria were clustered together, which was then reflected by the height in hierarchical clustering. Additionally, the mean of nearest taxon distances for each species was weighted by species abundance. In both instances positive values indicate species co-occur with more closely clustered species than expected, and negative values indicate that closely related species do not co-occur, and phylogenetic evenness exists.

47 CHAPTER 2: MATERIALS AND METHODS

Figure 2.5 Schematic representation of the phylogenetic tree. Blue dots refer to Net Relatedness Index (NRI) at the roots of the tree and green dots represent the Nearest Taxon Index (NTI) found at the tips of the phylogenetic tree. Source: Figure was recreated based on prior publications 258,288.

Results were first tested for normality with Shapiro-Wilks, followed by comparing group means of z-values (standardized effect size of mean pairwise distance versus null communities) using Kruskal Wallis and Tukey’s HSD test between clinical variables of interest.

48 CHAPTER 2: MATERIALS AND METHODS

2.5.3 Statistical Analyses: Community Structure

2.5.3.1 Relative Abundance Using rarefied sequencing data OTU abundance was transformed to relative abundance within each sample using the function transform_sample_count 256 in phyloseq. After this data transformation, OTUs were filtered to include only those with a variance greater than zero.

Data was separated out into applicable categories (disease or smoking phenotypes see Chapter 3) and average OTU count for each individual was calculated. Relative abundance bar charts were constructed for the top 50 OTUs (most abundant, agglomerated to genus level) that accounted for 97% of the data. Average OTU counts (used to draw relative abundance charts) were converted to proportions to see how they compared between different phenotypes.

2.5.3.2 Indicator Species Analysis Indicator Species Analysis 289,290, is a multivariate analysis that looks at significant association between individual OTUs and particular niches (clinical variables). This test reflects the environmental conditions, identifying species preferences, so is an ecological indicator that predicts species distribution within a community and is defined in terms of presence-absence. It combines species relative abundance with relative frequency of occurrence in various groups of sites.

This analysis was carried out using one of the functions in the indicspecies package 246 called multipatt (multi-level pattern analysis). Multipatt creates combinations of input clusters and compares these combinations with species in the OTU table. Indicator species of a category were only considered when P-value was less than specified alpha values (P ≤ 0.05, if α = 0.05). Additionally, indicator value indices were selected depending on the product of two quantities, A = 0.6 and B = 0.6. Quantity A is a positive predictive power of a species to be an indicator of a particular niche. Quantity B reflects how frequently that species is found across the whole dataset.

49 CHAPTER 2: MATERIALS AND METHODS

2.5.3.3 Differential Expression Analysis for Sequence Count Data Differential Expression Analysis for Sequence Count Data (DESeq2 function in R) 291,292, is a uni-variate analysis that can be carried out to compare OTUs found in a clinical group of interest and against OTUs present in a control (e.g. healthy/ non-smoking) group. The aim of applying this analysis was to identify OTUs that were differentially abundant when comparing traits of interest. This analysis is based on negative-binomial distribution 293,294, modelled on the assumption of Poisson distribution.

One of the main limitations however is that this type of analysis is sensitive to OTU abundance, where the most common OTUs are more likely to be picked up as differentially abundant, with rare OTUs overlooked.

DESeq2 analysis was carried out on un-rarefied data, so taking the entire data into consideration. Inbuilt false positive control rate was kept at a constant P = 0.05 level, allowing a more balanced selection of differentially expressed OTUs. Before carrying out the analysis, data was pre-filtered, removing any taxa with less than 3,000 reads. Post analysis, there was a stringent threshold implemented which removed low count OTUs.

From the DESeq analysis output, several parameters were extracted including; fold change, adjusted P-value (negative log10 transformed), OTU ID with its associated genus. Abundance and prevalence information for each OTU was also obtained.

Results were plotted in a ‘volcano plot’. The latter arranges bacteria along dimensions of biological and statistical significance. Horizontal dimension indicates fold change between two groups (on a log2 scale), and vertical axis represents P-value for a t-test of differences between samples (negative log10 scale; smaller P-values appear higher up). Horizontal axis indicates biological impact of change; vertical indicates statistical evidence, or reliability of change.

50 CHAPTER 3: 16S rRNA GENE SEQUENCING

Chapter 3 16S rRNA Gene Sequencing 3.1 Introduction As previously highlighted (Chapter 1 Section 1.3) the BHS is a cross-sectional, whole population study that was initiated in 1966 by Dr Kevin Cullen.

The BHS has evolved over the years with self-administered questionnaires collecting information on demographics, lifestyle and environment of participants, medical and physical history of any illness and participants undergoing physical examination. Increasingly, additional biological samples (blood for DNA, blood for serum) have been collected at regular intervals in order to fulfil the evolving study design and its aims as detailed by James et al. 210

Most recently oropharyngeal swabs, which are minimally invasive compared to bronchoscopies, and can be used for bacterial identification certainly in those lacking disease 103 (for detailed discussion see Chapter 1, Section 1.2.3), have been collected for the first time. These samples provide a unique opportunity to investigate the bacterial community of airways in a general random population sample.

The aims of this chapter were to: 1. Use 16S rRNA gene amplicon sequencing to profile the bacterial community of airways from a general random population sample. 2. Compare the airway bacterial community of healthy individuals to those with a disease; asthma or diabetes or GERD. 3. Compare the airway bacterial community of never-smoking individuals with smoking and ex-smoking individuals.

51 CHAPTER 3: 16S rRNA GENE SEQUENCING

3.2 Methods 3.2.1 Subjects and Oropharyngeal Swabs One thousand one hundred and sixty-seven individuals of the BHS population were sampled in the present study. All individuals completed a detailed questionnaire (See Appendix 2).

All sampling aimed to swab the back of the throat only. Since tongue depressors had not been used during the procedure, there were instances when there was potential contamination of the swab due to inadvertent contact of the swab with tonsils, palate or nasal areas (Chapter 2, Section 2.3.1.1) and this was recorded by the samplers. As it has previously been shown that distinct niches of airways display different microbial profiles 99,103,112,115,120, the 442 swabs with a record of not being strictly “throat” were excluded.

Small number of samples from non-Caucasian individuals (N = 10) were excluded to avoid potential bias in results due to ethnicity 295,296. All individuals with a diagnosis of cancer (N = 85) were also excluded as medication and chemotherapy have a defined effect upon the immune system of the body, and increase the risk of bacterial infections 297,298. Twenty-six individuals with missing metadata were excluded; resulting in 604 individuals being the focus of the present study.

DNA was extracted from swabs within 6 months of collection using the extraction protocol detailed in Chapter 2, Section 2.3.1.3 with DNA quantified accordingly.

Twenty-six of the swabs had been extracted beforehand, at a time when the QiAmp DNA Mini Kit (Qiagen) 22 was the extraction method of choice. These extracted DNAs were retained in the immediate downstream experiments (16S rRNA gene qPCR and sequencing) to investigate whether different extraction methods could be comparable (Section 3.3.2.1).

16S rRNA gene quantitative PCR (Chapter 2, Section 2.3.1.4) and sequencing (Chapter 2, Section 2.3.3) was performed on the DNA swabs from all 604 individuals. At the time of DNA extraction, potential risk of sample contamination from laboratory reagents

52 CHAPTER 3: 16S rRNA GENE SEQUENCING including extraction kits 119 was not known or fully understood. Consequently, no kit controls were included during sample processing. Approaches used to determine potential contaminant taxa are therefore highlighted in subsequent sections in this chapter (Chapter 3, Sections 3.2.3.4– 3.2.3.5).

Within the 604 individuals, 282 were never-smokers, 249 were ex-smokers whilst 73 were current-smokers. Eighty-six individuals had asthma, 21 had diabetes, 56 had gastro- esophageal reflux disease and the remaining 441 were healthy. The smoking and disease phenotypes were not exclusive of each other, but the disease phenotypes were (Table 3.1).

Table 3.1 Summary of selected study cohort. Participants are split by disease status (Asthma, Diabetes, GERD and healthy disease-free controls) and then sub-divided by smoking status: never-smoked, ex-smoking and current-smoking. Asthma Diabetes GERD Healthy Total Never-Smoked 37 7 28 210 282 Ex-Smokers 40 8 24 177 249 Current-smoker 9 6 4 54 73 Total 86 21 56 441 604

3.2.2 Quantitative PCR To determine bacterial burden, all 604 DNA samples (in Sample ID order) underwent 16S rRNA gene qPCR (Chapter 2, Section 2.3.1.4). Previous 454 DNA sequencing studies 299 and data generated within the Section (Section of Genomic Medicine, National Heart and Lung Institute) had revealed mixing low and high DNA quantity samples impacted not only on the quality of sequencing runs but sample sequencing success. Consequently, a strategy of ordering samples by bacterial burden was applied when preparing sequencing libraries.

This resulted in samples being organised into 6 batches for sequencing library preparation with each batch containing 94 samples plus 2 controls (one positive [mock community] and one negative, Chapter 2, Section 2.3.2 and Table 3.2 below). The seventh batch consisted of 40 samples with the lowest abundance, plus positive and negative

53 CHAPTER 3: 16S rRNA GENE SEQUENCING controls, as well as 7 samples of high bacterial burden that had failed during the library preparation stage from sequencing batches 2, 3 and 4 and were therefore repeated.

Table 3.2 Bacterial burden ranges for each sequencing library batch. Copy number (CN) range of samples included in each batch and the number of samples per run. Sequencing Library Number of Lower CN Upper CN Batch samples/run ǂ Library 1 223,690,416 1,746,357,120 94 Library 2 106,522,712 220,728,240 90 Library 3 51,376,580 106,489,688 92 Library 4 24,486,678 50,898,865 93 Library 5 7,012,747 24,428,968 94 Library 6 803,845 6,875,503 94 Library 7 4,657 196,283,712* 47 ǂDoes not include positive and negative controls and can be <94 samples due to sample drop out at library preparation stage. *Batch 7 contained 40 samples with a low bacterial burden and 7 high bacterial burden samples that dropped out during library preparation for runs 2, 3 and 4, and were therefore repeated.

3.2.3 MiSeq Sequencing For all seven batches of samples, 16S rRNA gene amplification with specific barcoded primers, targeting the V4 variable region, was performed in quadruplicate. Amplicons generated were pooled, purified and quantified after which libraries were prepared and sequenced on a MiSeq bench top sequencer (Chapter 2, Section 2.3.2). Quality metrics (Chapter 2, Section 2.3.3.1) for all seven sequencing runs are given in Table 3.3.

Post sequencing, raw data (45,377,276 sequences) were transferred into QIIME (Chapter 2, Section 2.4). Sequencing data was processed; forward and reverse reads were joined, quality controls were carried out, data was filtered to remove low quality reads and reads generated from the Phi X spike. Operational taxonomic units (OTUs) were picked from the Silva bacterial database 223,300 and chimeras were removed, after which it was possible to construct the OTU table and phylogenetic tree.

54 CHAPTER 3: 16S rRNA GENE SEQUENCING

Table 3.3 Sequencing results: quality metrics of the runs. Cluster Density; Cluster passing filter shows percentage of reads that met the overall quality of Illumina chastity filter. Quality score (Q30) is the prediction of probability of an incorrect base call. Number of Reads (millions) produced at the end of each sequencing run. Sequencing Cluster Density Cluster Passing ≥ Q30 Reads Run (k/mm2) Filter (%) (%) (Millions) Run 1 531 89.7 82.2 12.75 Run 2 800 75.8 66.6 20.61 Run 3 891 80.7 77.5 22.51 Run 4 671 86.1 79.0 16.34 Run 5 646 82.0 73.2 17.34 Run 6 435 79.4 71.1 11.83 Run 7 547 74.9 80.1 12.50

Next the phylogenetic tree, OTU table, representative sequence files and sample metadata (that was extracted from the questionnaire) as well as the other appropriate laboratory results (16S rRNA gene qPCR) were transferred into R (Version 3.2.0, 2015-04-16 301) for further data filtering and analysis (Chapter 2, Sections 2.4 and 2.5). A marked down script of analyses conducted can additionally be explored at: https://goo.gl/Aao9LS.

3.2.3.1 Sequencing Data: Extraction Kits As highlighted earlier 26 samples of the 604 samples were extracted using a different protocol and extraction kit, QIAmp DNA Mini Kit (Qiagen) (Section 3.2.1 above).

Although the sample numbers were small, as anticipated, generation of a NMDS plot (using Bray-Curtis dissimilarity) (Chapter 2, Section 2.5.2.2, Figure 3.1) demonstrated separation of samples dependent on extraction method.

55 CHAPTER 3: 16S rRNA GENE SEQUENCING

Figure 3.1 NMDS of samples according to their extraction method. Samples were either extracted using FastDNA Spin Kit (N = 578) or QiAmp DNA Mini kit (N = 26). The 26 samples extracted using the QiAmp extraction kit can be seen clustering to the left of the graph.

Statistical significance (P < 0.001) in the number of reads sequenced from samples extracted by different methods was confirmed using Kruskal Wallis and through PERMANOVA analysis (P < 0.001).

These findings highlight that variation in DNA extraction methods employed will significantly impact on the ability to compare and contrast as well as combine different data sets generated from independent laboratories unless standardisation of methods is implemented. Standardisation not only of extraction methods but also sequencing methods and primers implemented will also be key.

For further processing and statistical analysis, data generated from the QiAmp extracted 26 samples were removed leaving 578 samples and 14 controls remaining.

56 CHAPTER 3: 16S rRNA GENE SEQUENCING

3.2.3.2 Sequencing Data: Mock Community Each sequencing run included a positive and negative controls. The mock community (Chapter 2, Section 2.3.2.1) consisted of a set of common respiratory bacteria and was used to determine consistency of sequencing runs and assess if multiple runs were comparable to each other.

To confirm reproducibility of sequencing runs, a Bray Curtis dissimilarity matrix for the mock communities was calculated (Table 3.4). The highest value possible is 1 indicating no similarity, whilst the lowest value of 0 indicates complete similarity. Results revealed that the mock communities were fairly similar across runs.

Table 3.4 Comparison of the mock community members. Lower triangular matrix showing Bray-Curtis dissimilarity distances between the mock community samples for the seven sequencing runs. Mock1 Mock2 Mock3 Mock4 Mock5 Mock6 Mock7 Mock1 Mock2 0.168 Mock3 0.155 0.244 Mock4 0.237 0.301 0.162 Mock5 0.251 0.350 0.166 0.110 Mock6 0.309 0.412 0.229 0.200 0.215 Mock7 0.122 0.216 0.184 0.187 0.196 0.307

3.2.3.3 Sequencing Data: Batch Effect As highlighted earlier (Section 3.2.3.3), samples had been organised into sequencing library batches based on their bacterial burden (qPCR copy number), with Run 1 containing samples with highest copy number and Run 6 with lowest. It however transpired that although this resulted in sequencing runs of high quality (Table 3.3) a profound batch effect appeared to have been created. Figure 3.2 is a Bray-Curtis-based NMDS ordination illustrating the relationship between the seven-different sequencing runs A) combined plot B) individual plots for each sequencing run. As can be seen from these plots, going down the list of sequencing runs the samples progressively became more separated, with samples from Run 6 being distinct from those of other runs.

57 CHAPTER 3: 16S rRNA GENE SEQUENCING

Figure 3.2 Comparison of samples according to the sequencing run. (A) NMDS plot demonstrating the relationship between samples, with each sample coloured according to sequencing run. (B) Individual plots for each sequencing run.

This visible difference was statistically significant when tested with PERMANOVA, using the Adonis function (calculated using Bray Curtis dissimilarity matrix, Chapter 2, Section 2.5.2.2), with 13% of variability in the data attributed to sequencing run, Df (6) = 14.747, P = 0.001, R2 = 0.134. When comparing all runs to each other there was a significant difference (P = 0.001) (Table 3.5) with the dissimilarity increasing with each subsequent run (R2 values).

58 CHAPTER 3: 16S rRNA GENE SEQUENCING

Table 3.5 Dissimilarity between seven sequencing runs. All comparisons were significant (P = 0.001), as indicated by the R2 from Adonis analysis (calculated on the basis of Bray-Curtis dissimilarity).

Run1 Run2 Run3 Run4 Run5 Run6 Run7* Run1 Run2 0.028 Run3 0.048 0.029 Run4 0.081 0.042 0.018 Run5 0.102 0.058 0.037 0.014 Run6 0.180 0.168 0.132 0.135 0.101 Run7* 0.060 0.052 0.020 0.030 0.030 0.047 *Note Run 7 consists of sequencing of 40 samples with lowest bacterial burden and repeat sequencing of 7 samples of high bacterial burden.

Exploring the data distribution further by run and considering disease status and smoking status it became apparent that whilst the number of ex-smokers and never- smokers was comparable across runs there was a bias in relation to current-smokers.

Samples were ordered and batched on the basis of bacterial burden; and the relation of bacterial burden to disease status and smoking status was investigated further using Kruskal Wallis and Tukey’s Honest Significant Differences test (Chapter 2, Section 2.5.2.1C). No difference was seen between different disease groups, however there was a significant difference when comparing smoking versus non-smoking individuals (never-smokers and ex-smokers combined), χ2(2) = 17.73, P = 0.0014, with a lower bacterial burden seen in smokers. Individual group means were significantly different: smoking versus never-smoking (P = 0.008) and smoking versus ex-smokers (P = 0.05) with the mean lower for smokers. There was no significant difference between never-smoking versus ex-smoking groups (Figure 3.3).

59 CHAPTER 3: 16S rRNA GENE SEQUENCING

Bacterial Burden by Smoking Status

10 *** ***

9

)

g

o

L

l

u

/

s

e

l

u

c 8

e

l

o

m

(

r

e

b

m

u

N 7

y

p

o

C

6

*TukeyHSD

Never_Smoking Ex_Smoking Smoking

Figure 3.3 Bacterial burden when comparing by smoking status. Box Plot comparing the three phenotypes. There is a significant difference between never-smoking and smoking individuals (P = 0.008), ex-smoking and smoking individuals (P = 0.05). For the smoking group, lowest value for bacterial burden was 5.38 (Copy number: 239,756 molecules/μl).

Re-examining the sequencing run data and taking into consideration not only bacterial burden but also smoking status (Figure 3.3), it became apparent that although the number of smoking samples increased down the sequencing run order (Figure 3.4A), there was no run bias in relation to bacterial burden split by smoking status (Figure 3.4 comparison of Runs 1-6 only. Run 7 due to inclusion of repeat samples of high bacterial burden was excluded).

60 CHAPTER 3: 16S rRNA GENE SEQUENCING

Therefore, association between smoking phenotype and sequencing run was predicted to be a proxy for bacterial burden, and whilst smoking affects bacterial load, bacterial load does not have a direct function over the smoking phenotype. Consequently, it was concluded that the removal of batch effects caused by the sequencing runs would also remove genuine inter-individual variation, which could be of biological importance.

Figure 3.4 Batch effect, bacterial burden and smoking status. (A) Number of samples per sequencing run, coloured by smoking status. (B) Bacterial burden of the samples across the sequencing runs separated and coloured by smoking status.

61 CHAPTER 3: 16S rRNA GENE SEQUENCING

3.2.3.4 Sequencing Data: Identification of Contaminant OTUs DNA extraction kit controls were not included at the time of swab extraction, due to lack of knowledge of potential kit contamination (PCR negative controls were however included), so it was important to take steps to remove potential kit contaminant OTUs. To do so a strategy of identifying individual OTUs that increased in their number of reads as the total biomass decreased was adopted, as these OTUs were more likely to be contaminants. Correlation between OTU abundance and bacterial burden (logged qPCR copy number) data was tested using Spearman’s correlation coefficient.

Spearman’s Correlation is a non-parametric rank correlation coefficient, the output from the correlation test gave out a P-value and correlation coefficient (rho [], estimated measure of association). P-value was adjusted using Bonferroni correction (P-values multiplied by the number of comparisons 302) and samples were filtered so only taxa with P < 0.05 were considered.

Seventeen thousand 863 OTUs that were present across 578 samples were tested. There were 3,344 statistically significant OTUs (Bonferroni corrected P-value < 0.05), 173 OTUs had negative correlation coefficients (individual OTUs that increase their number of reads as biomass of the total yield decreases) indicating high probability of being contaminants. Examples of OTUs identified are shown in Figure 3.5. The 173 OTUs were removed from the sequencing data set prior to further analyses. The full list of removed OTUs is given in Appendix 3.1

This is however just one of the possible ways of identifying contaminants and therefore the possibility remains that not all potential contaminant OTUs had been identified and removed. An additional way of tackling potential contaminants is by filtering low abundant OTUs and rarefying the data (Section 3.2.3.5). Nevertheless, in all subsequent analyses care was taken to consider the origins of all significantly identified OTUs, by cross checking with the listing of potential contaminants previously highlighted by Salter et al. 119.

62 CHAPTER 3: 16S rRNA GENE SEQUENCING

Figure 3.5 OTUs identified as potential contaminants. Association between OTU abundance and qPCR (logged) data was tested using Spearman’s correlation coefficient. Examples of OTUs with negative correlation and P- value < 0.05 are shown. In both cases 95% confidence intervals are illustrated by grey shadow. (A) Comamonadaceae_10374: rho = -0.597, P = 5.33E-57. (B) Methylobacterium_12694: rho = -0.445, P = 1.74E-29.

3.2.3.5 Sequencing Data: Filtering and Normalisation After removal of potential contaminant OTUs using the correlation approach, data was further filtered to remove any OTUs that were only present in one sample within the dataset. Additionally, OTUs that had less than 20 reads were removed, due to previous reports that inclusion of large numbers of low abundant taxa lead to overestimation of microbial diversity 303,304, as well as driving results towards a particular pattern in the data, particularly when examining alpha diversity 303,305,306.

In addition, during the sequencing process although the aim of normalisation of sequencing libraries (equimolar pooling of samples: Chapter 2, Section 2.3.2.4) is to achieve an equal number of sequence reads per sample, there is invariably variation in the number of reads generated per sample during sequencing. Therefore, to ensure there were no biases in results due to this, a process of random-resampling (rarefaction) was carried out. This enables data to be normalized such that each sample contains the same

63 CHAPTER 3: 16S rRNA GENE SEQUENCING number of sequences. Consequently, samples that did not contain enough reads to attain the chosen rarefaction cut off were removed.

To establish the level of rarefaction, samples were ordered by the number of sequences and two different levels of rarefaction were tested, in order to choose a level that would retain the majority of samples. The first rarefaction level was at 6,543 reads (Sample 10BHS00082 had 6,544 reads), which removed 49 samples and 225 OTUs. The second level was at 11,099 reads (Sample 10BHS00689 had 11,100 reads), which removed 66 samples and 50 OTUs. Rarefaction level was therefore established at 6,543 reads, as this retained the most samples, 529 samples with 3,993 taxa summing up to 3,461,247 reads (Figure 3.6 shows the 100 samples with the lowest number of sequences).

Busselton Sequencing Data: Rarefaction Level

20000

15000

s

d

a 11,099

e

R

f

o

10000

r

e

b

m

u

N

6,543

5000

0

0 25 50 75 100 Samples Figure 3.6 Establishing the rarefaction level. The bottom 100 samples: range of number of sequencing reads are shown. Two different rarefaction levels were tested.

The cleaned data set (Table 3.6) was then taken through statistical analyses looking first at the population of 529 individuals as a whole and then separating them according to

64 CHAPTER 3: 16S rRNA GENE SEQUENCING disease status and smoking status (phenotypes). Measures of alpha and beta diversity were explored, as were associations between specific OTUs and traits of interest (Chapter 2, Sections 2.5.2, 2.5.3).

Table 3.6 Sample numbers of the final study cohort. Data after processing and filtered for statistical analysis. Asthma Diabetes GERD Healthy Total Never-Smoked 33 6 24 190 253 Ex-Smokers 38 8 19 151 216 Current-Smoker 6 5 3 46 60 Total 77 19 46 387 529

65 CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3 Results 3.3.1 Demographics Post data generation, quality control analysis, removal of contaminant OTUs and rarefaction (Sections 3.2.3.4 to 3.2.3.5 above, summarized in Figure 3.7), samples for 529 individuals remained (labelled Population) from the original 604 individuals whose DNA (extracted from oropharyngeal [throat] swabs) underwent 16S rRNA sequencing.

The demographics for these individuals (split by sex) are detailed in Table 3.7.

Table 3.7 Demographics for the final population of 529 individuals.

Male Female Total 259 270 Age (years) 55.71 (SD:5.57) 55.78 (SD:5.59) BMI 28.48 (SD:4.04) 27.28 (SD:5.07) Current-Smoker 35 25 Number of Pack Years 11.83 (SD:18.70) 6.53 (SD:12.79) Ex-Smoker 108 108 Years since Quit 17.52 (SD:12.76) 20.21 (SD:11.31) Never-Smoker 116 137 Asthma 36 41 ppFEV baseline 96.41 (SD:13.98) 94.79 (SD:12.67) ppFVC baseline 98.92 (SD:11.95) 98.55 (SD:12.04) Diabetes 10 8 GERD 18 28 Healthy 194 193 BMI = Body Mass Index ppFEV baseline = percent predicted forced expiratory volume baseline ppFVC baseline = percent predicted forced vital capacity baseline GERD = gastro-esophageal reflux disease

66 CHAPTER 3: 16S rRNA GENE SEQUENCING

Figure 3.7 Summary of data generation and processing. Quality control analysis, removal of contaminant OTUs and rarefaction.

67 CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.2 Characteristics of Population 3.3.2.1 Common OTUs The rarefied sequence dataset contained 4,005 OTUs from 14 unique phyla, the most common of which were Firmicutes, Bacteroidetes, Proteobacteria, Fusobacterium and Actinobacterium (together making up 98.4% of all OTUs). The fifty most abundant OTUs (≥ 5,599 sequencing reads) are shown in Figure 3.8 with the remaining 3,995 OTUs grouped together under the label of “3,955 Other OTUs”. Fifty most abundant OTUs accounted for 97% of the data. In addition, prevalence (proportion of individuals that contained a particular OTU) was also calculated. There were 8 OTUs present in all 529 samples. No known contaminants 119 were noted in the top 50 OTUs.

Firmicutes was the most common phylum; with 24 OTUs in the top 50, and 57.9% of all OTUs found in the complete dataset. In addition, 5 OTUs from this phylum were present in all samples: Streptococcus_4768 (which made up 13.4% of total reads in the dataset), Firmicutes_6640 (12.4%), Veillonella_19412 (11.4%), Streptococcus_20297 (7.6%) and Granulicatella_3979 (1.5%).

During processing of raw sequencing data (Chapter 2, Section 2.4), taxonomy was not assigned to one OTU, Firmicutes_6640. By drawing out a phylogenetic tree (using an Interactive Tree Of Life 307 interface [iTOL, V3.2.4, http://itol.embl.de], web software used for display and annotation of phylogenetic trees) this OTU branched off within a close lineage to several other Streptococcal species, giving a strong indication that this could potentially be an unidentified Streptococcal species (data not shown).

The second most common phylum was Bacteroidetes (14.1%), with 11 OTUs within the top 50 OTUs. One OTU in this phylum Prevotella_1525, which accounted for 10% of the total reads in the complete data set, was also present in all 529 samples. The last three common phyla were Proteobacteria (making up 12.3% of whole dataset) with 6 OTUs in the top 50 and Actinobacterium (9.1%) and Fusobacterium (4.9%) each having 4 and 5 OTUs respectively in the top 50, as well as having one OTU (Actinomyces_13710 and Fusobacterium_7409 respectively) present in all 529 samples.

68

Busselton Cohort: Top 50 OTUs

50 70 90 Percent_Prevalence 60 80 100

Streptococcus_4768 Firmicutes_6640 Veillonella_19412 Prevotella_1525 3955 Other OTUs Streptococcus_20297 Neisseria_10019 Actinomyces_13710 Fusobacterium_7409 Haemophilus_11091 Atopobium_5828 Veillonella_19390 Gemella_3258 Leptotrichia_8776 Prevotella_2598 Granulicatella_3979 Selenomonas_17559 Porphyromonas_11896 Prevotella_1868 Prevotella_1969 Leptotrichia_8929 Johnsonella_15517 Megasphaera_16215 Leptotrichia_8995 Lachnoanaerobaculum_14874 Prevotella_2754 Leptotrichia_8876 Neisseria_9962 Actinomyces_13062 Veillonella_18212 Prevotella_621 Haemophilus_11056 Pasteurellaceae_11461 Oribacterium_15965 Capnocytophaga_2454 Haemophilus_10797 Stomatobaculum_15766 Capnocytophaga_509 Veillonella_19388 Solobacterium_11985 Rothia_13982 Veillonella_19389 Veillonellaceae_17891 Stomatobaculum_15638 Prevotella_2890 Prevotella_879 Lachnoanaerobaculum_14972 Parvimonas_8503 Streptococcus_7798 Selenomonas_17440 Streptococcus_28 1e+04 1e+05 Abundance Figure 3.8 Top 50 (most abundant) bacterial OTUs found in the population.

X-axis shows abundance of each OTU (number of reads per OTU, log10 transformed). OTU IDs are labelled along y-axis. Each has been colour coded according to phyla level (Red: Firmicutes, Green: Bacteroidetes, Cyan: Proteobacteria, Pink: Fusobacterium, Blue: Actinobacterium and Black: ‘3,955 Other OTUs’) Colour and size of each point represents prevalence of each OTU across samples (number of samples individual OTU appears in, transformed to a percentage).

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.2.2 Whole Population: Alpha Diversity

Alpha Diversity (α) looks at the diversity within a sample (Chapter 2, Section 2.5.2.1A). It was calculated using a number of complementary diversity indices including Observed Richness, Pielou’s evenness, Shannon’s Diversity and Simpson’s Reciprocal Index (Inverse Simpson’s).

Different diversity indices were assessed for normality using Shapiro-Wilk’s test (at 95% confidence level) and confirmed with quantile-quantile plots (Chapter 2, Section 2.5.2.1B). All diversity indices were significant: Observed Richness (Mn = 213.4, SD = 63, P = 2.43E-05), Pielou’s evenness (Mn = 0.5, SD = 0.06, P = 3.32E-14), Shannon- Weiner Diversity (Mn = 2.9, SD = 0.4, P = 9.29E-09) and Inverse Simpson’s Reciprocal Index (Mn = 10.3, SD = 4.1, P = 3.18E-05).

The data therefore was regarded as non-normal, and further assessment of alpha diversity of samples (Sections 3.3.3.1A and 3.3.4.1A below) was tested for using Kruskal Wallis (one-way ANOVA) analysis and post hoc Tukey’s Honest Significant Differences test (Chapter 2, Section 2.5.2.1C).

3.3.2.3 Whole Population: Beta Diversity

Beta diversity (β) looks at any measure of variation in species composition between different samples and was calculated based on the Bray Curtis Dissimilarity matrix (Chapter 2, Section 2.5.2.2).

The average distance was calculated between all (N = 529) samples (Mn = 0.51, SD = 0.06), and illustrated as a histogram with a density distribution given in Figure 3.9.

70

CHAPTER 3: 16S rRNA GENE SEQUENCING

Busselton Cohort: Beta Distribution (Bray Curtis Dissimilarity Matrix)

8

6

y

t

i

s

n

e 4

D

2

0

0.4 0.5 0.6 0.7 0.8 AverageDistance Figure 3.9 Histogram of the mean per sample Bray-Curtis dissimilarity. Bray Curtis dissimilarity was calculated on all the samples. A density curve with the mean is overlaid over the histogram.

Next PERMANOVA analysis (using the Adonis function in R) was carried out to identify how much variability within the data could be explained by the clinical traits (see Figure 3.10 for summary flowchart). This was achieved in two steps. Firstly, all clinical variables were tested independently in the Adonis model to see if there were any causing a significant difference. Subsequently significant variables (P < 0.05) were tested together in a multi-variate Adonis model to establish what the collective effect of clinical variables was on variance in the data (significant variables were added together to calculate total degree of variability caused).

Overall, 58 different variables from the sample metadata were selected and tested. Any samples lacking complete metadata were excluded as the Adonis model does not accept missing (Not Available [NA]) values in the formula (Figure 3.10 Part A). Consequently, thirty-nine samples were removed leaving 490 samples remaining to be analysed. The list of variables tested can be found in Table 3.8.

71

CHAPTER 3: 16S rRNA GENE SEQUENCING

Table 3.8 List of the tested clinical variables. These clinical variables were taken directly from the questionnaire.

Disease Medication Asthma Smoking Status High Blood Pressure Hay Fever Age Kidney Disease Bronchitis Season Sinusitis Pneumonia Sex Colonic Polyps Tight Chest Nanodrop Crohn’s Disease Wheezing qPCR Copy Number (CN) Gallstones ppFEV baseline (Log) qPCR Copy Number Diabetes ppFVC baseline 16S gene: MiSeqRun Thyroid Disease Haemoglobin Ever-Smoked Eczema White Cell Count Never-Smoker Pleurisy Monocyte Absolute Ex-Smoker Obstructive Sleep Apnoea High Density Lipoprotein Current-Smoker Caelian Disease Red Cell Count Pack Years Ulcerative Colitis Corrected WBC Count House Cat Chronic Ear Infection Eosinophil Absolute House Dog Osteoporosis Platelet Count BMI1 (Value) Parkinson’s Neutrophil Absolute BMI2 (Category) Gastric Duodenal Ulcer Basophils Absolute Biological Siblings GERD Irritable Bowel Syndrome Diverticular Disease

Individual traits were then tested in an Adonis model independently (Figure 3.10 Part B, formula used: Adonis (beta diversity matrix ~ clinical variable, in sample data)). Fourteen traits came out as significant at P < 0.01 and these are detailed in Table 3.9

72

Figure 3.10 Summary flowchart for PERMANOVA analysis. Steps taken to determine the degree of variability caused by clinical variables.

CHAPTER 3: 16S rRNA GENE SEQUENCING

Table 3.9 Significant clinical variables that cause variability at P < 0.01. Adonis test was run on each variable independently, P-value shows significance when tested against beta diversity. R2 gives the amount of variation that can be explained by that particular variable.

Clinical Variable R2 P-value 16S gene: MiSeqRun 0.127 0.001 (Log) qPCR Copy Number 0.067 0.001 qPCR Copy Number 0.060 0.001 Status 0.030 0.001 Current-Smoker 0.028 0.001 Nanodrop 0.028 0.001 Pack Years 0.017 0.001 White Cell Count (WBC) 0.007 0.002 Corrected WBC Count 0.007 0.002 Never-Smoker 0.007 0.006 Ever-Smoker 0.007 0.002 Monocyte Absolute 0.007 0.003 Haemoglobin 0.006 0.003 ǂppFVC baseline 0.006 0.009 ǂppFVC baseline = percent predicted forced vital capacity baseline.

Significant clinical variables were then selected (Figure 3.10 Part C) but avoiding variables that reflected similar findings e.g. Log qPCR Copy, qPCR CN, Nanodrop and X16SMiSeqRun are all proxies for quantification/bacterial load of samples. Variables selected are highlighted in bold font in Table 3.9.

In order to identify maximum variance in the population, Adonis was run (Figure 3.10 Part D) with all selected variables in one model using the following formula:

Adonis (beta diversity matrix ~ NeverSmoker+ Corrected_WBC_Count +LogqPCRCopy + CurrentSmoker, in sample data)

Depending on the order in which variables were placed into the formula the significance of each variable changed, and this had to be accounted for. Therefore,

74

CHAPTER 3: 16S rRNA GENE SEQUENCING analysis was repeated with variables being in every possible position in the model. This resulted in 24 different permutations being carried out.

Once tested in combination, the selected clinical variables all were significant, and were used in the final model in the same order as seen above Figure 3.10 Part E).

Due to the large size of this cohort it was difficult to account for all the potential causes of variation, and only 9.7% ([0.007+0.006+0.065+0.019] *100) could be explained by the significant clinical variables, with the majority of variance (90.3%) still remaining undescribed. Results of the final model implemented are given in Table 3.10.

Table 3.10 PERMANOVA analysis: traits contributing to population variance. Variables Df Sums of Squares F Model R2 P-value Never-Smoker 1 0.45 3.64 0.007 0.007 Corrected WBC Count 1 0.41 3.28 0.006 0.003 Log qPCR Copy Number 1 4.37 35.14 0.065 0.001 Current-Smoker 1 1.26 10.12 0.019 0.001 Residuals 485 60.35 0.901 Total 489 66.84 1.000 *Refer to Chapter 2, Section 2.5.2.2 for further details.

75

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.3 Disease Phenotype The next step was to establish whether any differences in bacterial composition existed, and what these differences if any were, when comparing healthy individuals with those suffering from asthma, diabetes or gastro-esophageal reflux disease (GERD)

Smoking was seen to significantly affect the variance of a population, which has been seen in previous studies 112,113,120,308, therefore to make sure effects of smoking did not intervene with disease phenotypes, all current-smoking individuals (N = 60) were excluded. Consequently, individuals who were non-smoking but were suffering from the selected diseases were compared with healthy non-smoking individuals.

The 469 individuals (having 3,459 OTUs in total) consisted of 71 asthmatic individuals, 14 diabetic individuals, 43 individuals with GERD and 341 healthy non-smoking individuals. It should be noted that one of the asthmatic individuals is also marked to be suffering from COPD, this patient was included in all analysis.

3.3.3.1 Diversity (A) Disease Phenotype: Alpha Diversity Alpha diversity was analysed using the four different alpha diversity indices: Observed Richness, Pielou’s evenness, Shannon-Weiner Diversity and Simpson’s Reciprocal Index (Inverse Simpson’s). There was a significant difference within population means in all alpha diversity indices (tested by Kruskal Wallis [ANOVA] as dataset non-normally distributed).: - Observed Richness: χ2(3) = 7.97, P = 0.047 - Pielou’s evenness: χ2(3) = 11.62, P = 0.009 - Shannon’s Diversity: χ2(3) = 13.18, P = 0.004 (Figure 3.11) - Simpson’s Reciprocal Index: χ2(3) = 11.45, P = 0.009

To establish which individual group means were significantly different, Tukey’s HSD test was carried out. Results revealed a significant difference between healthy versus asthmatic individuals: Pielou’s evenness (P = 0.005), Shannon’s diversity (P = 0.005, Figure 3.11) with asthmatics having a lower level of diversity. No other significant differences were seen for the other diseases compared to healthy individuals or between diseases.

76

CHAPTER 3: 16S rRNA GENE SEQUENCING

*** 4.0

3.5

y

t

i

s

r

e

v i 3.0

D

s

'

n

o

n

n

a

h

S 2.5

2.0

*TukeyHSD 1.5

Asthma Diabetes GERD Healthy

Figure 3.11 Comparison of Shannon’s Diversity by disease phenotype. Shannon’s Diversity showing a significant difference between those with asthma versus healthy individuals, P = 0.005. Boxplot indicates that asthma individuals have a significantly lower bacterial diversity compared to healthy individuals.

(B) Disease Phenotype: Beta Diversity Differences in beta diversity were explored using Bray Curtis dissimilarity (Chapter 2, Section 2.5.2.2) No significant differences were observed between disease phenotypes through PERMANOVA (Adonis function in R).

(C) Disease Phenotype: Community Phylogenetic Structure To investigate whether OTUs in a particular group were clustered closer than expected by chance, phylogenetic diversity was measured through two metrics (Chapter 2,

77

CHAPTER 3: 16S rRNA GENE SEQUENCING

Section 2.5.2.3). The Net Relatedness Index (NRI) looks at species found at higher levels (roots) of the phylogenetic tree whilst the Nearest Taxon Index (NTI) focuses on lower levels (tips) of the phylogenetic tree. Analysis was carried out for each individual disease separately and tested for significance (Figure 3.12).

No significant associations were seen for Net Relatedness Index. A significant difference was however noted in the Nearest Taxon Index data (χ2(3) = 23.8, P = 2.75E-05), with GERD individuals having a significantly lower NTI compared to healthy individuals (P = 0.02), meaning the species are more dispersed than expected by chance.

A B ***

2

1 2

x

x

e

e

d

d

n

I

n

I

s

s

n

e

o

x

n

a d 0

T

e

t

t

a

s

l

e

e

r

R

a

t

e

e

N

N

−1

1

−2

*TukeyHSD

Asthma Diabetes GERD Healthy Asthma Diabetes GERD Healthy Disease Disease Figure 3.12 Disease status is related to aspects of community relatedness. (A) Box plots illustrating the Net Relatedness Index (NRI) (B) Nearest Taxon Index (NTI). A significant decrease in phylogenetic relatedness by NTI was seen in GERD individuals compared to healthy controls.

3.3.3.2 Community Structure As alpha diversity had revealed a significant difference between asthmatic and healthy individuals (Section 3.3.3.1), a detailed examination into the composition and structure of the bacterial communities was carried out. Two approaches were used: indicator species analysis and DESeq2 (Differential Expression Analysis for Sequence Count Data, Chapter 2, Section 2.5.3 for further details).

78

CHAPTER 3: 16S rRNA GENE SEQUENCING

Indicator Species Analysis was conducted using rarefied data to establish whether there were any OTUs that were characteristic of a particular disease group.

Differential Expression Analysis for Sequence Count Data (DESeq2 function in R) analysis was carried out to compare each disease group to the healthy group. This was done to see if there were any OTUs that were differentially abundant between selected groups. As this method standardizes abundance expectations by an estimate of sequencing depth, un-rarefied data was used, which is accounted for through the inbuilt false positive control rate within the function.

(A) Disease Phenotype: Indicator Species Analysis For this analysis a significance level of α = 0.05 was used, whilst the minimum positive predictive value for an OTU to be indicative of a phenotype was set at 0.6 (‘A’ threshold) and the minimum sensitivity of the OTU being seen across the whole dataset was also set at 0.6 (‘B’ threshold). This way only OTUs that were predictive of a particular disease and were not common across the dataset were identified. A total of 469 individuals with 3,459 OTUs were analysed.

A single significant OTU was found. Veillonella_19966 was identified as an indicator of diabetes (A = 0.75, B = 0.64, P = 0.005) and not frequently found across the rest of population. This OTU was rare, being prevalent in 26% of population but with low abundance (595 reads, equating to 0.02% of total reads).

(B) Disease Phenotype: DESeq2 For this analysis, each disease was compared to healthy non-smoking individuals. As the analysis was conducted on un-rarefied data, the analysis involved 488 individuals with 4,216 observed OTUs. Full listings of the significant OTUs identified can be found in Appendix 3.2 with key OTUs being highlighted below.

Asthmatics versus Healthy Individuals Comparing asthmatic to healthy individuals, there were 26 OTUs (from 14 unique genera) significant at an adjusted P < 0.05 (Figure 3.13, Appendix 3.2.1). Twenty-five of

79

CHAPTER 3: 16S rRNA GENE SEQUENCING the OTUs were at lower levels in asthmatic individuals and were members of the genera: Bergeyella (1), Capnocytophaga (3), Haemophilus (1), Leptotrichia (1), Megasphaera (1), Mycoplasma (1), Porphyromonas (1), Prevotella (7), Selenomonas (5), Streptococcus (2), Tannerella (1) and Veillonella (1).

Rothia_13982 (Genus: Rothia) was the only OTU highlighted as occurring at a higher level in asthmatics compared to healthy individuals.

Diabetics versus Healthy Individuals Comparing diabetics to healthy individuals, only 11 OTUs showed significant differences at the adjusted P < 0.05 (Appendix 3.2.2). Nine OTUs existed at a lower level in diabetics compared to healthy individuals and these OTUs were members of the genera: Capnocytophaga (3), Leptotrichia (1), Neisseria (1), Prevotella (1) and Veillonella (1). Two OTUs were of Unknown genus RF9_12187 (Phylum: Tenericutes, Class: Mollicutes, Genus: unknown) and Candidate_division_SR1_12158 (Phylum: Candidate_division_SR1).

There were two members from the Streptococcus genus found to be at significantly higher levels in diabetics when compared to healthy individuals. These OTUs had also been identified (although not at a significant level so data not shown) as indicators of the diabetes group through indicator species analysis.

GERD versus Healthy Individuals Comparing individuals with GERD to the healthy group, there were 23 OTUs that were significantly lower in individuals with GERD (adjusted P < 0.05, (Appendix 3.2.3). Eight OTUs belong to the genera: Bergeyella (1), Burkholderia (1, known contaminant 119, despite stringent filtering in Section 3.2.3.4 it was presumed that not all possible contaminants were removed), Capnocytophaga (3), Haemophilus (1), Leptotrichia (1), Neisseria (1), Oribacterium (1), Prevotella (10), Ruminococcus (1), Streptococcus (2) with one OTU (RF9_12187).

There were two Streptococcal OTUs found to be higher in GERD individuals than healthy individuals P = 0.01.

80

Figure 3.13 DESeq analysis: asthmatic vs. healthy non-smoking Individuals. Coloured OTUs were identified to have significant changes (adjusted P < 0.005). The sizes of the points indicate the abundance of those taxa in the whole data set. OTUs are coloured by their genus level (see key).

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.4 Smoking Phenotype The last aim addressed with the data set was to establish if any differences in bacterial composition exist when comparing smoking and non-smoking individuals.

As highlighted previously (Section 3.2.1 above) all individuals had completed a detailed questionnaire (Appendix 2) allowing data on the total number of smoking years, number of years since quitting smoking and pack years to be determined for each of the 529 individuals.

A total of 253 individuals had never-smoked, 216 were ex-smokers with 60 individuals being current-smokers. Collectively there was 4,005 OTUs with certain OTUs being present across more than one smoking phenotype (Figure 3.14).

Figure 3.14 Venn diagram of OTUs counts for the three smoking phenotypes. The smoking group had 1,187 OTUs across 60 samples; the ex-smoking group had 2,551 OTUs in 216 samples whilst the never-smoking individuals had 2,802 OTUs in 253 samples.

82

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.4.1 Diversity

(A) Smoking Phenotype: Alpha Diversity Similar to early analyses, alpha diversity was analysed using the four different alpha diversity indices: Observed Richness, Pielou’s evenness, Shannon-Weiner Diversity and Simpson’s Reciprocal Index (Inverse Simpson’s) with Kruskal Wallis (ANOVA) test applied in light of the dataset being not normally distributed. A significant difference within the population means for all the alpha diversity indices was found: - Observed Richness: χ2(2) = 14.4 P = 7.33E-04 - Pielou’s evenness: χ2(2) = 25.9, P = 2.43E-06 - Shannon’s Diversity: χ2(2) = 26.0, P = 2.27E-06 (Figure 3.15) - Simpson’s Reciprocal Index: χ2(2) = 22.8, P = 1.10E-05.

***

4 ***

y

t

i

s

r

e 3

v

i

D

s

'

n

o

n

n

a

h

S

2

*Tukey HSD 1

ExSmoking NeverSmoking Smoking

Figure 3.15 Comparison of Shannon’s Diversity by smoking phenotype. Shannon’s Diversity showing significant differences when comparing the smoking and non-smoking individuals, P = 0.005. The boxplot indicates that the smoking individuals have a significantly lower bacterial diversity compared to the non-smoking individuals.

83

CHAPTER 3: 16S rRNA GENE SEQUENCING

To establish which individual group mean was significantly different, a multiple comparison of phenotypes was carried out using Tukey’s HSD test. This revealed that the never-smoking and the ex-smoking individuals displayed a similar alpha diversity, with no significant differences found in any of the four diversity indices (Table 3.11: Column labelled Ex-Smoking/Never-smoking).

Smoking individuals however displayed a significant difference when compared to non- smoking individuals (never-smokers and ex-smokers) for all diversity indices. The significant lower Observed Richness (P = 1.25E-03) indicates that smokers have fewer species present in comparison. Whilst the significantly lower Pielou’s evenness reflects how unequal community structure is; dominated by a fewer species (P = 7.85E-07). Overall diversity of smokers was much lower when compared to non-smoking individuals (Shannon’s diversity P = 1.80E-05, Figure 3.16, Simpson’s Reciprocal Index P = 9.49E-07 and Figure 3.15).

Table 3.11 Alpha diversity across smoking phenotypes (Tukey’s HSD test P-values).

Ex-Smoking/ Smoking/ Smoking/ Never-smoking Never-Smoking Ex-Smoking Observed Richness 6.69E-01 1.25E-03 9.48E-03 Pielou's Evenness 8.74E-01 7.85E-07 5.62E-06 Shannon's Diversity 9.85E-01 1.80E-05 4.00E-05 Simpson’s Reciprocal Index 9.52E-01 9.49E-07 3.69E-06

(B) Smoking Phenotype: Beta Diversity Considering beta diversity, the smoking phenotype accounted for 2.9% of variation within the data (R2 = 0.029, P = 0.001) when compared to non-smoking individuals.

(C) Smoking Phenotype: Community Phylogenetic Structure To investigate whether OTUs in a particular group were either clustered or over- dispersed more than expected by chance, phylogenetic diversity was measured once more using Net Relatedness Index (NRI) and Nearest-taxon index (NTI) (Chapter 2, Section 2.5.2.3, Figure 3.16). Variances in population means were assessed using Kruskal Wallis test and Tukey’s Honest Significant Difference test.

84

CHAPTER 3: 16S rRNA GENE SEQUENCING

There was a significant association found in NRI data (χ2(2) = 14.88, P = 5.88E-04). When comparing individual groups there was a significant difference seen between smoking versus never-smoking (P = 0.000007) individuals, as well as smoking versus ex-smoking individuals (P = 0.0001, Figure 3.16A). Higher NRI score (z-values) means phylogenetic composition of smokers at the roots of a phylogenetic tree cluster together more closely than expected by chance.

A B *** *** ***

2

2

x

e

x

d

e

n

d

I

n

I

s

s

n

e

o

n

x

d

a

e

T

t

t

a

l

s

e 0 e

r

R

a

t

e

e

N

N

1

−2

*TukeyHSD *TukeyHSD

NeverSmoking ExSmoking Smoking NeverSmoking ExSmoking Smoking

Figure 3.16 Community phylogenetic structure: smoking status. (A) Box plots illustrating Net Relatedness Index (NRI); the significant increase in the z value of the smoking individuals indicate clustering of the clades at the higher levels of the phylogenetic tree than expected. (B) Nearest Taxon Index (NTI) displays higher z values for never-smoking individuals when compared to ex and current-smoking individuals. A significant increase seen in never-smoking individuals indicates clustering of clades at the lower levels of the phylogenetic tree than would be expected.

A significant difference in the NTI data (χ2 (2) = 11.65, P = 0.003) was also found. When individual groups were compared a significant difference (although to a smaller degree than for the NRI results) was seen between never-smoking and ex-smoking individuals (P = 0.04). Phylogenetic composition of never-smokers displayed a higher NTI (z- values) as seen in Figure 3.16B, which reflects clustering at the tips of the phylogenetic tree compared to the ex and current-smoking individuals.

85

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.4.2 Community Structure Next, in light of the alpha diversity findings for smoking phenotypes (Section 3.3.4.1A), detailed examination into bacterial composition was carried out. Here, three approaches were used: relative abundance, indicator species analysis and DESeq2 (Differential Expression Analysis for Sequence Count Data).

(A) Smoking Phenotype: Relative Abundance A relative abundance bar chart was constructed by separating the data by smoking phenotypes: never-smoking, ex-smoking and current-smoking. In light of the NRI results showing clustering of clades at the roots of the phylogenic tree of smoking individuals (Section 3.3.4.1C), the fifty most abundant OTUs (representing 97% of data) were agglomerated to phylum and genus levels (Figure 3.17), in order to explore bacterial composition at roots and tips of phylogenetic tree.

The most common OTUs fell into 5 phyla (Figure 3.17A) with smokers exhibiting a shift in proportions when compared to non-smoking individuals, in this way confirming a different phylogenetic composition at the roots of the phylogenetic tree. This reflects the higher NRI scores (z-values) that were observed and indicates bacterial clustering closer together than expected by chance.

There were 25 different genera (Figure 3.17B) representing the most common OTUs. Smokers displayed a decrease in most genera and seemed to be dominated mainly by Streptococcus and Veillonella, which both showed a substantial increase in comparison to non-smoking individuals. This mirrors results seen in the alpha diversity (Section 3.3.4.1A) where smokers displayed a lower observed richness and evenness.

86

NeverSmoking ExSmoking Smoking

CHAPTER 3: 16S rRNA GENE SEQUENCING

A. NeverSmoking ExSmoking Smoking

B. Phylum Actinobacteria Bacteroidetes Firmicutes Fusobacteria Proteobacteria 100%

Phylum Actinobacteria Bacteroidetes Firmicutes Fusobacteria Proteobacteria 75%

e

c

n

a

d

u

b

A 50%

e

v

i

t

a

l

e

R

25%

0%

NeverSmoking ExSmoking Smoking

Actinomyces Gemella Leptotrichia Porphyromonas Stomatobaculum Atopobium Granulicatella Megasphaera Prevotella Streptococcus Genus Bifidobacterium Haemophilus Neisseria Rothia Unidentified_Firmicutes Capnocytophaga Johnsonella Oribacterium Selenomonas Veillonella Fusobacterium Lachnoanaerobaculum Parvimonas Solobacterium

Figure 3.17 Fifty most abundant OTUs present in the data: smoking phenotypes. Top 50 OTUs included account for 97% of overall data. (A) OTUs agglomerated at phylum level: each section colour coded different phyla. (B) OTUs agglomerated and colour coded according to genus level.

87

CHAPTER 3: 16S rRNA GENE SEQUENCING

(B) Smoking Phenotype: Indicator Species Analysis Using rarefied data, Indicator Species Analysis was carried out to establish whether any OTUs were associated with particular smoking phenotypes. Analysis included a stringent set of parameters, with P = 0.05 and α = 0.05 level as well as A = 0.6, B = 0.6. Keeping both ‘A’ and ‘B’ parameters high was done to make sure that identified OTUs were significantly indicative of particular smoking phenotypes and were scarcely found across the rest of the data.

Pasteurellaceae_11461 was an indicator of never-smoking individuals (A = 0.6, B = 0.8, Stats = 0.71, P = 0.005), although not a very abundant OTU (9,528 reads, equating to 0.3% of total reads), it was present across 82% of the samples. The second indicator OTU, Bifidobacterium_13861, was associated with the smoking individuals (A = 0.8, B = 0.6, Stats = 0.71, P = 0.005). This OTU had low abundance (2,874 reads, making up only 0.08% of total reads) and was present in 24% of samples. This result however does reflect the increased relative abundance of Bifidobacterium genus seen previously in the smoking individuals (Section 3.3.4.2).

(C) Smoking Phenotype: DESeq2 Using DESeq2 three comparisons investigating community structures were carried out: smoking versus never-smoking (control), ex-smoking versus never-smoking (control) and smoking versus ex-smoking (control). Due to the structure of this analysis, un- rarefied data was used. This brought the total number of individuals up to 549 with 4,218 observed OTUs. Minimum reads per sample was 274 and maximum reads per sample was 323,895. Normalisation and correction for false discovery rate was accounted within the DESeq function. Full details of all significant OTUs identified are given in Appendix 3.2.

Smokers versus never-smokers Comparing smokers versus never-smokers, there were 151 significant OTUs at P < 0.001 (Appendix 3.2.4); abundance of 52 OTUs was increased and for 99 OTUs decreased in smokers (Figure 3.18). OTUs displaying a higher abundance in smokers belonged to 10 different genera: Actinomyces (1), Bifidobacterium (Bifidobacterium_13861 as identified by the indicator species analysis above

88

CHAPTER 3: 16S rRNA GENE SEQUENCING

[Section 3.3.4.2B]), Cryptobacterium (1), Howardella (1), Lactobacillus (2), Prevotella (2), Rothia (3), Streptococcus (40) and Veillonella (1). The OTUs at a lower abundance in the smokers belonged to 25 different genera. Genera that contained more than 5 different OTUs were: Haemophilus (14), Leptotrichia (8), Neisseria (23), Prevotella (13) and Veillonella (5).

The genera containing fewer than 5 OTUs were: Alysiella (1), Bergeyella (2), Blautia (1), Campylobacter (1), Capnocytophaga (3), Cardiobacterium (1), Catonella (1), Fusobacterium (1), Gemella (2), Johnsonella (1), Lachnoanaerobaculum (3), Lautropia (1), Oribacterium (2), Peptococcus (2), Peptostreptococcus (3), Porphyromonas (4), Stomatobaculum (2), Streptococcus (2) and Tannerella (1). Two OTUs (Candidate_division_SR1 and RF9_12187) could not be assigned to a genus and were therefore classed as Unknown.

Smokers versus ex-smokers Comparing smoking individuals to ex-smokers at P < 0.001 (Appendix 3.2.5), there were 41 OTUs of higher abundance in smokers compared to ex-smokers, 37 of which were the same OTUs that were in higher abundance in the smokers when compared to never-smokers (Figure 3.19). The four additional OTUs were Bifidobacterium_13401, Bifidobacterium_13785, Dialister_19867 and Streptococcus_6617. Seventy-three OTUs were less abundant in smokers compared to ex-smokers, the genera with the largest OTU membership were: Haemophilus (11 OTUs), Leptotrichia (5), Neisseria (20) and Prevotella (8).

Genera containing less than 8 OTUs included: Alysiella (1), Bergeyella (1), Campylobacter (1), Capnocytophaga (3), Catonella (1), Fusobacterium (1), Gemella (1), Johnsonella (1), Lachnoanaerobaculum (1), Mogibacterium (1), Oribacterium (2), Peptococcus (1), Peptostreptococcus (3), Porphyromonas (3), Stomatobaculum (2), Streptococcus (2), Tannerella (1) and Veillonella (2).

89

Never Smoking vs Smoking Individuals

Streptococcus_20297

Neisseria_9962 Significant Genera a Actinomyces a Lautropia Streptococcus_20302 a Alysiella a Leptotrichia a Bergeyella a Neisseria 20 Capnocytophaga_509 a Bifidobacterium a Oribacterium a Blautia a Peptococcus Neisseria_9888 Streptococcus_20337 a Campylobacter a Peptostreptococcus Prevotella_621 Prevotella_822 a Capnocytophaga a Porphyromonas

e a Cardiobacterium a Prevotella

u

l

a Capnocytophaga_2454

V a Catonella a Rothia

Neisseria_10062

P a Cryptobacterium a Stomatobaculum d Haemophilus_11369

e

t Streptococcus_7171

c a Fusobacterium a Streptococcus Streptococcus_7048 Bifidobacteriaceae_13747

e

r r a Gemella a Tannerella o Streptococcus_5304

C Neisseria_9813 a Haemophilus a Unidentified_Firmicutes

R Neisseria_9645

D Streptococcus_20305 a Howardella a Unknown

F Bergeyella_430 Streptococcus_8021 0 Tannerella_312

1 Haemophilus_11091 a Johnsonella a Veillonella g 10 o Leptotrichia_8995 L Streptococcus_7143 Streptococcus_4578 a Lachnoanaerobaculum a NA

− Capnocytophaga_2384 Firmicutes_6640 Neisseria_9958 a Lactobacillus Neisseriaceae_10099 Streptococcus_20342 Streptococcus_7114 Streptococcus_226 Gemella_3258 Streptococcus_6916 Prevotella_1378 Veillonella_19390 Streptococcus_7865 RF9_12187 Leptotrichia_8999 Abundance Fusobacterium_7409 Streptococcus_20298 Alysiella_9616 Streptococcus_4601 Prevotella_660 1e+06 Neisseriaceae_9657 Bifidobacterium_13861 Streptococcus_28 Lactobacillus_4348 2e+06 Prevotella_1348 Gemella_3265 Streptococcus_6449 Rothia_14025 Bergeyella_444 Campylobacter_8553 Streptococcus_7760 3e+06 Prevotella_1969 Streptococcus_7148 Streptococcus_224 Leptotrichia_9129 Prevotella_745 4e+06 Streptococcus_7277 Prevotella_657 Gemella_3415 Prevotella_1354 Streptococcus_4641 p < 0.001 5e+06 Prevotella_641 Prevotella_2598 Coriobacteriaceae_6049 Lactobacillus_4421 Streptococcus_244 Prevotellaceae_1387 Porphyromonas_261 p < 0.01 Prevotella_2046 Veillonellaceae_18387 Treponema_11625 Bergeyella_438 Streptococcus_20 p < 0.05 Porphyromonas_295 Porphyromonas_270 Streptococcus_123Streptococcus_5606 Porphyromonas_271 0 Never Smoking Smoking

5 0 5

. . .

2 0 2

− Log 2 Fold Change Figure 3.18 DESeq analysis: never-smoking versus smoking individuals. Coloured OTUs were identified to have significant changes (adjusted P < 0.005). The sizes of the points indicate the abundance of those taxa in the whole data set. OTUs are coloured by their genus level (see key).

Ex Smoking vs Smoking Individuals 25 Streptococcus_20297 Significant Genera a Actinomyces a Leptotrichia a Alysiella a Mogibacterium Streptococcus_20302 Neisseria_9962 a Bergeyella a Neisseria 20 Neisseria_10019 a Bifidobacterium a Oribacterium a Campylobacter a Peptococcus a Capnocytophaga a Peptostreptococcus Neisseria_10209 a Cardiobacterium a Porphyromonas Streptococcus_20337

e a Catonella a Prevotella

u Capnocytophaga_509 l Prevotella_621 a 15

V a Dialister a Rothia

Prevotella_822

P Streptococcus_7171 a Fusobacterium a Stomatobaculum

d

e Haemophilus_11369 Rothia_13888

t

c Neisseria_10062 a Gemella a Streptococcus

e

r r Capnocytophaga_2454 a Haemophilus a Tannerella o Streptococcus_5304

C Bifidobacteriaceae_13747a Howardella a Unidentified_Firmicutes

R Streptococcus_7048 D Neisseria_9888 Streptococcus_20338 a Johnsonella a Unknown

F 10 Neisseria_9645 Firmicutes_6640 0 Bergeyella_430 1 Tannerella_312 a Lachnoanaerobaculum a Veillonella

g

o Streptococcus_5404

L Neisseria_10020 Streptococcus_20342 a Lactobacillus a NA

− Haemophilus_10843 a Lautropia Haemophilus_11011 Streptococcus_20298 Streptococcus_5402 Neisseria_9833 Streptococcus_5317 Fusobacterium_7409 Streptococcus_7143 Abundance Streptococcus_4578 5 Prevotella_1378 Gemella_3258 Streptococcus_7148 Streptococcus_4601 1e+06 Prevotella_660 Neisseria_9708 Campylobacter_8553 Streptococcus_7139 Bifidobacterium_13401 2e+06 Prevotella_657 Bacillales_3452 Streptococcus_6617 Lactobacillus_4348 Prevotella_641 Streptococcus_6916 Gemella_3265 p < 0.001 3e+06 Neisseriaceae_9639 Capnocytophaga_480 Prevotella_1969 Prevotella_1791 Streptococcus_226 Streptococcus_28 p < 0.01 4e+06 Prevotella_1566 Streptococcus_224 RF9_12250 Streptococcus_244 Prevotella_1660 Porphyromonas_598p < 0.05 Streptococcus_123 Streptococcus_60 5e+06 Prevotella_942 Streptococcus_20 0 Ex Smoking Smoking

4 2 0 2

− − Log 2 Fold Change Figure 3.19 DESeq analysis: ex-smoking versus smoking individuals. Coloured OTUs were identified to have significant changes (adjusted P < 0.005). The sizes of the points indicate the abundance of those taxa in the whole data set. OTUs are coloured by their genus level (see key)

CHAPTER 3: 16S rRNA GENE SEQUENCING

Ex-smokers versus never-smokers As the prior analyses had shown never-smoking and ex-smoking individuals were very similar to each other (Section 3.3.4.1), a comparison between the two groups was carried out. Out of 19 OTUs (at an adjusted P < 0.05, Appendix 3.2.6) 13 OTUs were at a higher level of abundance in ex-smokers compared to never-smokers and these were assigned to genera: Capnocytophaga (1 OTU), Catonella (1), Corynebacterium (1), Cryptobacterium (1), Desulfobulbus (1: potential contaminant), Dialister (1) Filifactor (1), Haemophilus (1), Porphyromonas (1), and Streptococcus (4).

Seven OTUs occurred at a lower abundance in ex-smokers compared to never-smoking individuals, one OTU of Unknown genus (RF9_12187) with the others belonging to: Bergeyella (1), Granulicatella (1), Haemophilus (1), Leptotrichia (2) and Neisseria (1).

3.3.5 Ex-Smoking Individuals There was a highly significant difference noted between smoking versus non-smoking individuals in bacterial burden (Section 3.2.2, 3.2.3.3), alpha and beta diversities (Section 3.3.4.1) as well as OTU level differences (Section 3.3.4.2). Such marked differences were not seen when comparing ex-smoking versus never-smoking individuals.

From the patient metadata, it was possible to calculate the number of years since quitting smoking for the ex-smokers, affording an opportunity to investigate changes in the microbiome after cessation of smoking.

Using the information provided in the questionnaire (Section 3.2.1 and Appendix 2), numbers of years since quitting smoking were used to sub-group the ex-smoking individuals (Table 3.12). There was one ex-smoker for whom, there was no information for when they had quit smoking and therefore this individual was excluded from further analysis.

Table 3.12 Number of ex-smoking individuals. Cohort sub-grouped based on years since smoking cessation. Current 0<10 10<20 20<30 30+ Never Smoker Years Years Years Years Smoker 60 33 51 76 55 253

92

CHAPTER 3: 16S rRNA GENE SEQUENCING

First, alpha diversity was compared between the four different ex-smoking sub-groups as well as the smoking (N = 60) and never-smoking (N = 253) individuals. Significant differences (Kruskal Wallis test) were observed in all alpha diversity matrices. - Observed Richness: χ2(5) = 15.5, P = 8.5E-03 - Pielou’s Evenness: χ2(5) = 26.6, P = 6.7E-05 - Shannon’s diversity: χ2(5) = 26.4, P = 7.5E-05 - Inverse Simpsons diversity: χ2(5) = 23.23, P = 3.4E-04.

Using Tukey’s Honest Significant Difference test, current-smokers were found to be significantly different from the four ex-smoking subgroups and never-smoking individuals (see Table 3.13, Figure 3.20). No other significant differences were identified.

Table 3.13 P-values from Tukey’s HSD test of the alpha diversity indices. Comparing the smoking individuals to the number of years after ceasing of smoking and never-smoking individuals. There were no significant differences seen between the different ex-smoking categories or in comparison with the never-smoking individuals (data not shown). Observed Pielou’s Shannon's Inverse Richness Evenness Diversity Simpsons Current-Smoker vs. 7.29E-01 4.64E-03 1.43E-02 5.67E-02 0<10 Years Current-Smoker vs. 2.72E-01 3.19E-03 4.93E-03 2.59E-02 10<20 Years Current-Smoker vs. 1.95E-01 9.70E-04 1.35E-03 2.30E-03 20<30 Years Current-Smoker vs. 2.46E-02 1.06E-03 5.76E-04 2.14E-03 30+ Years Current-Smoker vs. 5.86E-03 4.86E-06 4.08E-06 9.14E-05 Never-Smoker

93

CHAPTER 3: 16S rRNA GENE SEQUENCING

4.0 ***

3.5

y

t

i

s r 3.0

e

v

i

D

s

'

n

o

n

n

a 2.5

h

S

2.0

*Tukey HSD 1.5

Current Smoking 0−10 10−20 20−30 30+ NeverSmoking

Figure 3.20 Shannon’s Diversity: smoking phenotype (ex-smoking sub-grouped). Significant differences were observed when comparing current-smokers to those who quit (Table 3.13 for P-values).

Beta diversity of the population was analysed using the Adonis function (calculated based on Bray Curtis dissimilarity matrix). The results were significant (R2 = 0.04, P = 0.001) when comparing the different ‘cessation bins’ with current and never-smoking individuals. However, the significant difference observed was due to the bacterial composition of smokers being highly different to the non-smokers as the NMDS plot showed little separation between the different ‘cessation bins’ and never-smoking (data not shown).

Indicator species analysis showed similar results to the prior findings (Section 3.3.4.2B) with Bifidobacterium_13861 being significant for smoking individuals. There were no OTUs identified as significant in ex-smoking individuals.

Relative abundance of the top 50 most abundant OTUs was plotted to see how different taxa proportions changed over time. Figure 3.21 shows the complete dataset, looking at current-smokers, followed by dissected ex-smokers into 4 ‘cessation bins’ (Table 3.12)

94

CHAPTER 3: 16S rRNA GENE SEQUENCING and the never-smoking individuals. As seen for the alpha diversity, the most marked difference was when comparing current-smokers and most recent ex-smokers (0 < 10 years group).

Subsequently, recent ex-smokers (0 < 10 years) were dissected further (Table 3.14) and compared to current-smokers, with the aim of identifying the time scale by which the bacterial composition most resemble the pre-smoking state.

Table 3.14 Ex-smokers: cessation ≤ 10 years further sub divided. The group of 33 ex-smokers who ceased smoking ten years or less where further subdivided into 9 groups. Current 0<1 1<2 2<3 3<4 4<5 5<6 6<7 7<8 8-10 Smoker Years Years Years Years Years Years Years Years Years 60 3 5 6 1 4 3 2 6 3

None of the alpha diversity indices or beta diversity showed any significant difference when comparing the current-smokers to recent ex-smokers (0 < 10 years group). The very low numbers within each of the recent ex-smoking categories are likely to be driving this lack of definitive difference.

95

CHAPTER 3: 16S rRNA GENE SEQUENCING

100%

75%

e

c

n

a

d

u

b

A 50%

e

v

i

t

a

l

e

R 25%

0%

Current Smoking 0−10 10−20 20−30 30+ NeverSmoking

Actinomyces Granulicatella Megasphaera Porphyromonas Stomatobaculum Atopobium Haemophilus Neisseria Prevotella Streptococcus Genus Capnocytophaga Johnsonella Oribacterium Rothia Unidentified_Firmicutes Fusobacterium Lachnoanaerobaculum Parvimonas Selenomonas Veillonella Gemella Leptotrichia Peptococcus Solobacterium

Figure 3.21 Fifty most abundant OTUs (ex-smoking sub-grouped). The top most abundant 50 OTUs included account for 97% of the overall data. Bar charts for each group showing relative abundances of OTUs agglomerated at the genus level.

96

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.3.6 Smoking Pack Years As smoking was found to have a significant effect on the bacterial composition of airways, a closer look at how the number of cigarettes and total time of smoking impacted on the microbial community in current-smoking individuals (N = 60) was conducted. Number of smoking pack years had previously been calculated (included in demographic details Table 3.7) using the parameters: number of cigarettes per day, and number of smoking years. Number of cigarettes was divided by 20 (average number of cigarettes in a pack), and then multiplied by total number of smoking years. For the current-smoking group, the mean number of pack years was 31.6 (SD = 25.6).

A significant negative correlation was seen between Shannon’s diversity and number of pack years (Spearman’s correlation, rho = -0.34, P = 0.008) (Figure 3.22A). Additionally, significant association between pack years and alpha diversity was seen through a generalised linear regression model (adjusted P = 0.0003), where both age and sex were included as covariates (not significant) (Figure 3.22B).

Figure 3.22 Assessing effects of pack years on Shannon’s Diversity.

(A) Scatter plot looking at Spearman’s correlation between number of Pack Years (log10 transformed) against Shannon’s diversity.

(B) Generalized linear regression model comparing Shannon’s Diversity against number of Pack Years (log10 transformed), taking account of both age and sex as covariates. In both cases 95% confidence intervals are illustrated by grey shadow.

97

CHAPTER 3: 16S rRNA GENE SEQUENCING

Alpha diversity saw smoking individuals having a lower species diversity with dominance of fewer bacteria. After looking at the bacterial composition of current-smokers, there was a significant increase in Streptococcus genus, which was also confirmed when looking at the DESeq2 results. Thus, further analysis looked at sub-selecting the data to focus only on the Streptococcus genus and investigate the relationship of this genus in relation to smoking pack years.

Significant positive correlation was seen between number of pack years and bacterial abundance (Spearman’s correlation, rho = 0.34, P = 0.008) (Figure 3.23A). Additionally, significant association between pack years and bacterial abundance was seen through a generalised linear regression model (adjusted P = 0.0002, age and sex included as covariates [not significant]) (Figure 3.23B).

Figure 3.23 Assessing effects of Pack Years on abundance of Streptococci Genus.

(A) Scatter plot looking at Spearman’s correlation between number of Pack Years (log10 transformed) against abundance of Streptococcal OTUs (log10 transformed). (B) Generalized linear regression model comparing abundance of Streptococcal OTUs

(log10 transformed) against Pack Years (log10 transformed), taking account both age and sex as covariates. In both cases 95% confidence intervals are illustrated by grey shadow.

98

CHAPTER 3: 16S rRNA GENE SEQUENCING

3.4 Discussion The aims of this chapter were to: 1. Use 16S rRNA gene amplicon sequencing to profile the bacterial community of airways from a population sample. 2. Compare the airway bacterial community of healthy individuals to those with disease; asthma or diabetes or gastro-esophageal reflux disease (GERD). 3. Compare airway bacterial community of never-smoking individuals with current- smoking and ex-smoking individuals.

The primary finding of this chapter was that that healthy upper respiratory tract contains a rich microbial community that drastically depletes with smoking.

The present study has used samples collected from the BHS, and is one of the largest studies to date looking at the bacterial composition of the upper airways in a general population sample. Until recent years it was considered that the healthy upper airway was sterile, however findings from this present study indicate the presence of a rich microbiome in the healthy upper airways. This microbiome is fairly stable across the population sample, however environmental factors such as asthma and smoking have shown to have a significant impact on bacterial composition.

A key advantage of microbial culture independent analysis, such as 16S rRNA gene sequencing, is the opportunity it affords to be able to examine hundreds of individuals, in this study 529 participants, much more robustly. To date there have been a number of large scale population projects that looked to establish a healthy microbiome in the gut including Metagenomes of Human Intestinal Tract (MetaHIT) who have 124 (predominantly ‘healthy’) adults 309 and a large Chinese cohort looking at type 2 diabetes in 145 participants 310. Little effort however has been given to characterise the microbiome of the lungs, due to problems presented by low bacterial biomass in samples collected from healthy individuals. The few studies that have included healthy individuals in their analyses have looked at limited numbers 103,120,121, and the Human Microbiome Project 41, although containing 331 adult throat samples, has focused most of its research on looking at the airway microbiome in diseased states 113,311.

99

CHAPTER 3: 16S rRNA GENE SEQUENCING

One of the limitations of this present study is that it is cross sectional and consequently gives just a snapshot view of the bacterial composition of the lungs. Studies have shown that the respiratory tract is a dynamic ecosystem of microbiota, which is in constant interaction with the host, and is greatly affected by inflammatory responses occurring during disease 312-315 and treatment 316. Regardless, the present study involving samples from the BHS provides a baseline that can be used in further exploration of the respiratory microbiome in a healthy population and comparing it to the diseased state.

Whole population

Currently there is an estimation of over 1,000 different bacterial phyla present in the world 317. The BHS data generated in this chapter has demonstrated the presence of 14 bacterial phyla in the airways, the most common being Firmicutes, Bacteroidetes, Proteobacteria, Fusobacterium and Actinobacterium (together making up 98.3% of all the OTUs, Section 3.3.2.1). This suggests that there is a very strong selection of bacteria that are present in the airways, and this dominance has been observed in another (although much smaller [N = 20]) study 318.

The most common OTUs (Figure 3.8) were from the genus of Streptococcus (Streptococcus_4768 and Streptococcus_20297, making up 13.4% and 7.6% of the total of whole reads respectively), Prevotella (Prevotella_1525, 10% of the total) and Veillonella (Veillonella_19412, making 11.4% from the total number of reads). Members of these bacterial genera have been previously shown to be present in the oral cavity (by relatively smaller studies) 99,103,112,121,319. Bacterial similarities between the samples from BHS and the oral microbiome could be due to the nature of the sampling method; oropharyngeal (throat) swabs. This method of sampling was chosen because it’s moderately easy to attain from a healthy volunteer, and has shown similarities with microbiota from upper lobes of lungs (reflecting the lower airways) in healthy individuals, as confirmed by Hilty et al. 103 (Refer to Chapter 1, Section 1.2.3). However, because the swabs pass the oral cavity, there are some potential dangers of contaminating the sample with bacteria from oral cavity, hence why care was take when selecting samples (Chapter 2, Section 2.3.1.1).

Nevertheless, these OTUs were not only most abundant across the dataset but were

100

CHAPTER 3: 16S rRNA GENE SEQUENCING present in all the samples. This means that it is not just the dominant number of individuals who are influencing the sequencing results, but instead that every individual may potentially have a characteristic set of core bacteria present.

This was further confirmed by the fact the beta-diversity distribution of the data was low, meaning that on the whole individuals with the population are very similar to each other (Figure 3.9). Current-smoking status accounted for 3% of the variation in the population when tested individually in a PERMANOVA model (Table 3.9). Despite this being a very small value, it was the strongest association between a clinical variable and diversity found within the cohort. In a multivariate model, there were three variables that accounted for 10% of the variability in the population (Log qPCRCopy, Pack Years and Status, Table 3.10), however much of the variance still remains unexplained.

Disease Phenotype

Although participants of the BHS were not recruited based on incidence or severity of disease by way of being a random population sample, there were a number of individuals who reported to be sufferers of asthma, diabetes, and GERD. Those with asthma were classified as mild (based on ppFEV and ppFVC values Table 3.7). There was a single individual reported to be suffering from COPD (Section 3.3.3).

There was no significant difference found between healthy individuals and those suffering from diabetes or GERD when comparing diversities of populations. This was expected for diabetes, as it is a condition not classically involving the lungs/respiratory system and studies to date have only seen an impact on the intestinal microbiota 182. However this was a little surprising for GERD, which has been shown to have a relationship with asthma severity 190, as well as association with laryngitis, pharyngitis, sinusitis, idiopathic pulmonary fibrosis and dental erosions 191.

When looking more specifically at the bacterial composition, some bacteria were significantly decreased in abundance in the upper airway microbiome in the diabetes group when performing univariate analysis against healthy non-smoking controls. The phyla (DESeq2 analysis, P < 0.05) included Proteobacteria, Bacteroidetes, Fusobacteria and Firmicutes. A reduction of Firmicutes has previously been reported in the gut microbiota of individuals with diabetes 182, so it is feasible that in diabetes a similar effect

101

CHAPTER 3: 16S rRNA GENE SEQUENCING on the bacterial composition in terms of abundance in more than one area of the body may exist.

There is very limited published data about the effects of GERD on the airway microbiome. One study (N = 97) looked at laryngeal microbial communities and concluded that the disease had no significant effect 320. In the present study although as mentioned earlier there was no difference in alpha and beta diversities (GERD versus healthy individuals) a difference in the phylogenetic diversity at the tips of phylogenetic tree in individuals suffering from GERD was noted through significant differences in NTI values (Section 3.3.3.1C, Figure 3.12).

The asthmatics exhibited some significant changes when compared to healthy non- smoking individuals, in both their diversity and community structure. Individuals with asthma displayed a significantly lower alpha diversity (Section 3.3.3.1A both richness and evenness) when compared to healthy controls.

Literature to date has given a very inconsistent picture about the diversity differences between the microbiomes of asthmatics versus healthy individuals. There have been a number of studies that have found asthmatic patients to have a higher bacterial diversity 83,157,158,318,321. However, there are those that align with the results of this chapter, showing that asthmatic individuals display a lower bacterial diversity 155,322,323. Finally, there are those that have found no significant difference in the bacterial diversity between asthma and healthy controls 324,325.

There are a number of possible reasons why different studies report different findings, including method of specimen collection (asthma-specific studies usually analyse either bronchial brushings, bronchoalveolar lavage samples or induced sputum), the severity of the disease of the participants (e.g. recruiting mild asthmatics, treatment resistant, severe asthmatics), low number of participants and laboratory methods of handling samples (processing using a variety of sequencing platforms).

All these inconsistencies between studies just indicate the extent of how little is still known about bacterial community in airways, and in asthma individuals specifically and the increasing need for standardisation of protocols from inclusion criteria, sample type and methods applied for DNA extraction and generation of sequencing data.

102

CHAPTER 3: 16S rRNA GENE SEQUENCING

A finding that is similar across the majority of current studies is the decrease in number of Bacteroidetes and Fusobacteria found in asthmatic individuals when compared to controls 103,155-157,324. This is also seen in the analysis of the BHS sequencing data presented in this chapter, where there was a significant decrease in abundance (through DESeq2 analysis, at P < 0.01, Section 3.3.3.2C) in several members from the phyla Bacteroidetes (mainly Prevotella and Capnocytophaga), Fusobacteria (Leptotrichia) and Firmicutes (Veillonella, which has been noted in previous studies 155,156).

In contrast the published literature presents strong evidence for enrichment of Proteobacteria in asthmatic individuals when compared to healthy controls (key species being Haemophilus, Moraxella and Neisseria spp. 103,157,318,323). This was not seen in the BHS sequencing data, however there was a significant increase in abundance from a member of the phylum Actinobacteria (Rothia_13982). Enrichment in Rothia spp. has been previously described as being associated with patients with eosinophil-low asthma 321.

Smoking Phenotype

Overall, smoking individuals were found to have a significantly different microbial composition of the upper respiratory tract when compared to the non-smoking individuals (Section 3.3.4).

Smoking individuals contained a significantly lower bacterial burden when compared to the never- and ex-smoking individuals as tested using 16S rRNA gene qPCR (Figure 3.3).

Smoking individuals displayed a lower bacterial richness (Table 3.11) and these bacterial species were not evenly spread (Pielou’s Evenness) and were dominated by a fewer individual species (Shannon’s Diversity) when compared to the non-smoking individuals. The non-smoking individuals displayed a higher Shannon’s diversity value (Figure 3.15), meaning that individuals in the group have a higher bacterial diversity, which is more evenly spread when compared to smoking individuals. This contradicts original findings of Charlson et al. 112, who suggested that smokers upper respiratory tract communities were significantly more diverse than non-smokers. However, there have been a number of studies that concur with the current findings of this chapter that smoking individuals

103

CHAPTER 3: 16S rRNA GENE SEQUENCING have a lower bacterial diversity compared to ex-smoking and never-smoking individuals 103,124,308,326.

Inter-group diversity showed that never-smoking and ex-smoking individuals were a lot more similar to each other and significantly more diverse compared to smokers. Beta- diversity was calculated using Bray-Curtis dissimilarity matrix, accounting for OTU abundance, and revealed that the smoking individuals not only were different but also contained fewer OTUs. This reflects the results seen in the Venn diagram (Figure 3.14), that showed smokers have a lower total number of OTUs and fewer unique OTUs.

This diversity was explored further at the phylogenetic level and to identify differences between bacterial composition depending on smoking phenotype (Figure 3.16). A positive result for net relatedness index was seen across the whole population, meaning that there is distinct clustering of bacteria. A significantly higher NRI result was seen in smokers suggesting that the bacteria are even more closely related at the roots of phylogenetic level than expected by chance. This is seen in the relative abundance pie charts (Figure 3.17A), which show that smoking individuals contain a greater relative abundance of Firmicutes phylum, meaning bacteria are phylogenetically closer related.

Smoking individuals were found to have a decrease in the relative abundance of Bacteroidetes, Fusobacteria and Proteobacteria compared to healthy individuals (Figure 3.17B). This was confirmed by the DESeq analysis (P < 0.001), which identified a significant decrease in abundance of some of the major genera including Neisseria, Haemophilus, Leptotrichia, Prevotella and Veillonella amongst many others (Figure 3.18- Figure 3.19), all of which have been noted in previous studies that have investigated the impact of smoking on the upper respiratory tract 112,120,327. All the above genera observed in this current study were also noted as significantly decreased in abundance when comparing smoking individuals with never smokers and ex-smokers.

The relative abundance of the phyla Actinobacteria and Firmicutes were increased in smoking compared to non-smoking individuals. The most enriched genus was Streptococcus, having 23 different OTUs with higher relative abundance in smokers (Figure 3.18). Previous studies have shown enrichment of this genus; however, the results of this study highlight a large number of different Streptococcal OTUs driving the

104

CHAPTER 3: 16S rRNA GENE SEQUENCING difference between the participants. Interestingly, the microbial abundance of Streptococcus significantly increased with the increase of pack years (Figure 3.23 meaning that over time the increased intensity of smoking is a promoting factor for the increase in abundance of this genus.

When comparing never-smoking and ex-smoking individuals little difference was seen. The two groups had similar alpha diversity, beta diversity and there were very few OTUs that were noted to be significantly different between them. This could mean that with time bacterial composition of the lungs normalizes after the individual has quit smoking (Figure 3.20). Based on this hypothesis, the length of time the individuals had quit smoking was explored (Table 3.12), to see how long it took for upper airway microbiome of ex-smoking to resemble that of never-smoking individuals.

Using the provided smoking history, ex-smoking individuals were separated into groups, depending on number of years since quitting and these groups were then in turn compared to current-smokers. Both alpha diversity and relative abundance of the community structure displayed a significant difference between smokers and first ten years since quitting (Table 3.14). Unfortunately, when exploring the recent ex-smokers further numbers within each year category were too low meaning the analysis was underpowered to detect any significant differences.

When examining exclusively the current-smokers, more specifically number of pack years, against Shannon’s Diversity (representative matrix of alpha diversity), there was a significant negative correlation noted (Figure 3.22). This significant relationship indicated that with an increase in number of pack years the alpha diversity decreased, even when adjusting for age and sex (both not significant).

In summary, in this substantive (number of individuals) study looking at the upper respiratory tract from individuals from a general random population sample, smoking had a profound effect on the overall microbial community; causing a decrease in bacterial richness and overgrowth of the Streptococcus genus. The biggest strength of this study is the large sample size that has enabled real insight into the bacterial community of the upper airway respiratory tract in predominantly health individuals. The study was limited by sampling methods, although studies have shown that oropharyngeal (throat)

105

CHAPTER 3: 16S rRNA GENE SEQUENCING swabs are reflective of the lower respiratory microbiome 103, other sampling techniques might have provided a more in depth picture of the lungs themselves. Additionally, because this was a cross-sectional study, it is difficult to build a more dynamic picture of the upper airway microbiome and understand how the bacterial community changes over time. Nevertheless, the large amount of data generated from a healthy population could be of utility for use in comparisons in other studies.

106

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Chapter 4 Weighted Correlation Network Analysis 4.1 Introduction Having established the communities present within the upper respiratory tract (oropharyngeal swabs) of the BHS participants (Chapter 3), the next question to address was how these bacteria interact with one another and in turn how these bacterial communities interact with their host: the human. Interactions within microbial communities in a human host and their implications on health and disease have been investigated in some detail (see below). In such networks, individual bacteria can be considered as nodes, and pairwise relationships between bacteria can be thought of as edges. Positive relationships may reflect mutualism between bacteria that work together to increase chances of survival. A negative relationship may reflect antagonistic phenomena; where bacteria have to compete for survival. Having a broader knowledge of these relationships in a healthy environment and changes that occur during disease or stress can provide understanding of the relationship between bacterial ecosystems and the host.

A number of studies have explored microbial co-occurrence in a variety of environments 328-331 with the main focus being on intestinal microbial ecosystems 332,333 and effect of gut microbiota on obesity and the metabolic syndrome 334-337. Still, little has been done to characterise inter-species interactions within the airway microbiome, including changes that accompany disease and cigarette smoke exposure.

It is well established that tobacco smoke exposure has severe implications on the immune system, yet there is still very limited knowledge on the full impact that smoke exposure has on the airways microbiome. Studies have shown a decrease in relative abundance of certain bacteria in the oral microbiome 326,327, colonisation of the nasopharynx with potential pathogens 338 and decrease in the microbial diversity of the lower airway microbiome 320. These studies have however been very small, with limited insight into changes that occur with smoke exposure, and furthermore the effects have not been placed within a systems-level context.

107 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Weighted Gene Co-Expression Network Analysis (WGCNA or WCNA) is a systems-level statistical method commonly applied to large, highly-dimensional gene expression datasets to define correlation patterns 269,339. Many studies have used this approach to explore gene networks at the level of messenger RNA (mRNA) in various tissues and disease states from cancer 340-342 to brain function 343-345, and the approach has also successfully been applied to DNA methylation data 346. In recent years WGCNA has been used to explore the intestinal mucosal microbiota 347, identifying replicable modules of co-abundant bacteria.

Building on this development, the publicly available WGCNA workflow (https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/) was adapted in this chapter in order to examine further the 16S rRNA gene sequencing results of the BHS participants in order to discover the potential interactions within the airway microbial communities.

Briefly, the approach implemented calculates the correlation between the operational taxonomic units (OTUs), grouping them into modules and pruning connections that do not share neighbours as these are less likely to represent true connections. In turn, these modules can be correlated with clinical traits, allowing identification of bacterial co- abundance modules that associate with disease. Module membership; a measure closely related to intra-modular connectivity, and bacterial significance; correlation between an OTU and trait, can then be used to specify key OTUs within a given module. The pipeline was further extended by the application of phylogenetic methods to allow functional characterization of modules.

Figure 4.1 summarizes the WCNA workflow and shows the breakdown of the bacterial co-abundance analysis together with a typical plot that can be used to illustrate each stage of the analysis. A more detailed explanation of each step taken is given in the subsequent text.

108 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Sample dendr ogram and quantitative trait heatmap, UNRAREFIED data

5

7

7

2

5

0

0

S

7 2

5

4 5 H

7

3 2

B

6

0 0

0

9 0

0 0

1

9 6 6 0

S S

7 6 9

S

0 H H 8 3

H

0 0 1

B B

B

0 0

S 0 0

4 5

0

0

1 1

S S

2 4 H 6 2

1

9

4 8 5 8 H H 5

2 B 4

2

1

0 0 7 2 2

0 0 B B

8 6

4

0

0 0 0 0 4

1 0 0 0

0 2

1

1

2

0 0 0

S S 0 1 1

0 4

9 3 5

0

0

0

0

S S 1

H H 1 0 S 0

0

2 6 8

8

S

S 0 8

3 2 0

H H

B B 5 H 2 4 0 7

S

2 3

1 0 3 H

H S 4 S

0 0 B B 8 4 0 7

1

B 3 0

H

0 0 8 5 9

7 9 7 9 5

1 1 0 0 B 0 0 3

3 B H 2 S 0 H

0 2 0

8

B

0

1 7 6 0 4 7 4

1 1 S S 0 0 0 0 1

1 0 8 8 4

1 B 0 0 2 1

H B 8

0

9 0

2 0 7 0 5 4 5

2 1 0

0 1 3 0 6 9 S S 3

H H 0 2 4 8

0 2 7 0 5

S 1

2 8 B

0 0 0 3 0 S 1 1 0

6

1 5 3 5 8 0

S 1 5 8 6 5 2 8 S

B B 7 9 H H 1 1

S

7 7 7 0

0 0 0 2 H 6 2 0 0 0

0 0 2

S H 0 3 5 0 3 0

6 3 8 9 9 0 4 0 0 7 0

0 0 4 3 0

H H

5 1 3 5 5 8 3 B B 1

H 0 0 1 4

B 0 S S 2 9 5

S S S 1 0 0 0 2 S

6 9 2 4 7 5 S 0 0 0 9 0 S

H B 0 1

1 1

9 5 8 9 3 0 0 9 8 0 0 B S

5 B 0 6 8 7 3 3 8 5 0 9

0 4 8 7

B 9 S 3 0 0 0

0 3 8 1 2 1 0 7 H H 5 3 3 8

H H H 0 S S S 0 0 1 4 H

B 9

3 4 1 0 4 1 0 0 3 7 H 1 1 1 0 H

0 9 9 5 4 4 3 0 0 1 8 8 3 2 7 2

S 1 7 H 0 0

0 6 1

0 1 0 0 0 0 2 7 1 4 0 5 0 5 8 0 0

1 0 S S 7 2

1 B H 0 6 8 1 S B B B

B B 1 1 1 8 5 6 9 4 1 0 3 H S S 1

1 3 3 4 6 7 4 0 0 0 8 B H H 0 8 0 1 0 B 2 8

S S 0 7 5 0 0

1 0 6

H 0 9 1 0 0 1 0 3 4 2 8 5 3 3 0 0 B

t 6 8 3 S 5 9

Define Similarity Matrix 0 S 5 1 5 1 S S 6 6 0 0 0

0 0 0 0 B 4 8 0 3 3 4 9 0 9 0 1 H H

0 0 0 0 0 1 3 2 7 4 0 H B 4 2 0 0 0 4 7

5 9 S S 4 1 8 B B S H H 0 2 3

0 2 9 8 7 4 0

2 6 9 H H 7 1 2 2 0 8 0 S S

S 8 2 S 3 3 0 0 0 0 4 6

1 B 1 6 7 S 1 4 1 4 1 1 1 S S

1 1 0 0 0 0 3 1 6 1 6 5 5 0 H

H 0 0 0 0 0 0 0 H H S 3 0 9 1 5 B B 0 0 0 1 5 5

S S 5 5 9 2 B 0 0 S S 1 0 5 7

0 7 5 9 3 2 1 1 7 4 0 6 0 B B

h S 3 4 9 9 8 7 0 8 0 0 0 0 1 1 H S 1

H H 8 6 B B 0 0 0 0 0 0 2 7

0 4 1 4 8 6 0 1 H H

5 5 H 1 0 0 4 0 1 1 0 0 0 1 5 2 8

S H 0 H 2 1 0 0 9 4 2 1 0 0 H H 6 4

B S S 4 0 S S 6 4 5 B 5 1 0 S 8 5 1 1 0 0 B 0 0

S S 6 3 9 3 7 7 3 3 B H 7 8 8 1 4 5 0 0 0 0

H H 5 4 1 0 0 S 0 1 8 H H

1 0 5 5 0 0 0 0 9 1 0 0 0 3 0 0

1 H 0 B B 9 5 1 0 0 S B S S S S H

1 9 1 0 1 3 0 0 0 9 9 3 7 9 B B

S S S 3 S 0 5 0 5 7 0 1 0 0 6 8 4

g B B 0 5 8 B 0 0 7 5 1 S 6 8 0 4 1 1 0 B B 3 3

0 0 1 7 9 5 6 0 0 0 0 0 4 8 0 7 5 1 2 1 4 S S S S 0 0

H H H H H 0 3 6 4 0 4 2 H 5 5 1 1

H H B B 0 0 5 2 1 1 B 3 9 7 9 0 0

0 0 0 0 0 S S 8 S S 0 5 S 0 B B

i 5 2 0 0 4 2 2 8 5 6 1 0 S 0 H 0 0

B 0 0 0 S 9 1 6 5 7 5 9 4 0 S S H 0 9 2 H H B

0 0 0 5 0 0 1 S S 7 3 8 0 4 0 0 0 6 8 H H 1 0 0 0 0

1 0 0 5 5 H 0 0 0 0 3 4 9 1 0 S 4 0 8 5 0 0 3 6

H H 1 0 H 0 1 0 0 4 4 H S S

B 0 7 6 0 3 3 S 0 5 B 1 0 1 H H H H

B B B B 0 0 B B 1 1 S S 0 0 1 8 7 1 7 1 0 0

0 1 S 0 0 9 2 7 4 3 0 0 0 4 0 S S 1 1

0 S S 0 0 0 3 0 H H H H 4 7 B 0 3 3 H 0

1 S 1 2 1 5 3 8 4 7 0 9 6 0 0 1 H S B 6 4 0 2 0 0

1 1 S H 1 8 7 0 3 7 4 5 H H 0 0 1 9 B B B B 1 1

S S 0 0 0 0 S 0 S H H 0 0 3 1 S S 6 3 3

e B B 0 B S 0 7 B S S S S 2 1 H 4 5 0 0 8 0 0

0 0 0 0 1 1 0 0 3 6 0 0 0 B 4 0 0 4 5 9 1 1 H H

0 0 0 0 S S 2 5 4 0 0 H 7 0 B B B B

1 H H 0 0 1 0 S S 5 9 1 1 0 5 6 1 1 3 1 1

H S 0 3 1 4 8 0 0 3 9 0 0 0 0 9 S 7 7 1 0 6 B H H

0 0 H H 0 4 5 B B B B 0 7 4 3 B 0 0 1 0 1 0 0 0 0 S S

H 2 8 0 0 5 5 S H B B 9 5 9

0 0 1 S S 0 S S 0 S H 5 B S 0 4 1 S S 0 8 2

1 1 H H 1 1 1 1 4 5 H 0 0 H B B 0 0 0 B 0 0 0 H 5 8 1 0 5 0 S S

H H H H 6 2 7 1 1 0 0 9 S H 1 8 0 0 0 0

S S H S 8 7 7 0 0 0 4 5 S 0 0 1 0 9 0 0 7 3 B B

S H H 0 0 0 0 0 B 0 0 3 1 0 1 8 4 5 2 0 1 7 4

B B 0 0 S 0 H H 0 0 0 0 7 9 6 4 5 S S 0 0 0 7 2 6 1 5 0 0 1 1 1 1

B S S H 6 0 S 8 1 0 0 4 H 6 S 4 6 S B B

1 1 B B 1 1 0 0 1 0 1 S S 0 1 5 1 8 5 0 0 H H

H B 8 6 0 0 0 0 0 1 H 0 0 B 4 3 3 7 0 7 5 9 3 6

H H H H B H B 0 9 4 B 0 0 H S S S 2 B H H S 6 2 2 3 1 1 1 1

B B 0 6 B B B B 0 8 6 0 1 B 0 0 3 7 3 2 4 0 H H 0 0

B 4 S S S 0 0 0 6 2 3 0 0 2 7 0 5 8

0 0 H S 0 0 S S S 0 0 1 1 S S 1 1 3 8 0 9 0 H 1 H S S 0 1 0 7 1

H H 0 H B B 7 7 S 0 0 S S S 1 5 7 6 2 0 3 1 0 0 0

0 0 0 1 H 7 4 8 B B 0 0 H H 1 1 7 3 9 5 1 7

H H 0 0 0 B H 1 1 0 B 7 1 0 3 7 0 H 0 0 4 8 2 7 H

0 0 S S 4 2 0 5 8 0 1 H H B S 0 0 3 8 5 8 1 2 1 8 3 B B

0 0 6 4 0 0 4 8 6 S S B S 0 0 0 1 0 B B 2 2 7 0 1 1 1 1

B B 0 B B 0 0 B H H 1 H 0 0 H 5 0 7 1 0 0 5 9 1 0 6

1 1 1 6 0 0 0 0 3 5 S S S 2 2 2 B B

1 0 0 0 0 S S H H 0 0 B 0 5 0 4 6 5 5 0 8 9 0 S 1 1

B B 1 1 B S S H H H 5 5 4 0 0 H H B 2 0 6 1 H 0 H 2

1 0 0 0 B H H H H 1 7 0 3 4 0 0 1 3 0 0 6 2 5 0

1 1 4 4 S 1 0 B 1 0 0 S S B B 1 H H 8 8 4 5 8 B 0 0 2 5 0 0

1 1 0 9 B B 1 5 1 1 0 0 0 B 0 1 0 0 2 9 3 8 4 7 S B 0 0 6 9 5

0 0 1 0 0 1 1 0 H H S B B H S S 0 0 4 0 8 0 8 3 4 8

0 3 2 H H S S H 0 0 8 9 1 4 1 0 0 0 0 0 0 0 0 0

1 1 0 0 B B B 0 0 0 B 1 0 9 7 S S 1 1 2 8 1 0

0 0 0 0 0 1 B B 1 1 S 0 H H 2 0 0 8 0 S 7 1 S 0

1 0 6 3 B H H S S S 0 1 0 0 0 B 0 1 9 6 B 2 H 0

S S B 5 4 H H B B 0 0 0 B B B B 0 0 0 3 0 S 0 7 0 S 0 0 7 0 1 0 H 6 5 1 1

2 1 0 0 0 1 0 0 0 0 1 1 B B B 3 0 0 6 2 S 3 0 1 0 0 1

1 1 1 1 1 H 0 0 S H H S S 1 1 1 8 5 1 9 6 S S 0 6

1 5 4 8 S H 3 2 4 0 0 6 2 0 0 0 4 1 1

B B S H 0 0 0 B B 0 0 1 H H 0 S 0 4 7 0 0 0 0 0 0 8 0

1 1 1 0 0 B B S S 1 0 7 9 0 0 3 S 9 7 8 5 0

1 S S 3 9 0 0 0 0 H H B B 5 0 0 0 0 H H 5 7 0 0 2 9

0 0 1 0 0 1 0 0 0 0 H 1 1 0 1 0 S 0 0 1 S 0 0 0 0 B 8 8

7 7 1 1 0 0 1 B B B B H H 0 0 0 6 4 9 0 H 0 0 6 S H H S 6 1 B 6 3 8 0

H H S S S 1 1 0 4 H S S 5 4 0 1 4

0 7 7 S S 1 0 B B B 4 2 0 H H 5 4 1 0 0 5 5 7 3

0 0 1 1 1 0 0 H 1 1 0 B 0 0 6 H H S S S 2 5

B 0 0 S S H H 0 2 4 S H 1 S S 9 0 3 4 S S

H 5 7 1 S S 1 1 B B 0 0 H 1 1 6 4 0 0 7

1 0 0 1 1 S 1 1 1 B B 1 H H 0 0 7 4 S S 0 7 2 H S B B 1 0 0 0 S

0 0 H H 0 0 0 0 1 B 1 1 1 0 S S 0 0 2 S S 4 0 0 0 7 8 5 6 5

B B 0 1 S 1 B H S H B 1 0 7 B 0 1 9

5 5 S S 0 0 0 0 0 3 3 0 6 S 0 4 2 H 1 0 8 4 0 H 8 9 3 S

B B 1 1 H H 1 1 S 8 3 B 0 0 B B H H 4 S S 1 1 0 0

H H H 0 1 1 7 3 2 1 1 0 0 3 2 0 0 6 6

1 0 B 0 0 B B 8 4 0 0 0 S S B B H H H 3 6

0 0 H H 1 1 B S H 2 4 B 0 0 4 8 H H S 3 0 1 H H

1 0 0 B 1 1 1 1 B B 0 4 0 S S 3 2 0 0 0 1 0 0 0 9 5 3 0 0

B B S S H H H 0 0 0 0 0 0 H H 3 S B H 0 0 0 S S 0 0 H 0 0 9

1 1 1 1 1 4 5 4 4 3 0 H H 2 1 S S B 0 0 8 5 H H 5 5

0 0 H 8 6 S B 2 0 0 H B 0 0 3 S B 0 0 0 0 0

H H S S 1 B B 4 H 0 0 6 1 1 B B 1 0 0 0 7 2 H

B B B 0 0 0 1 1 0 0 H 7 0 7 S S S 0 0 H H 0 0 1 0

1 1 0 S 7 1 3 0 9 9 4 4 5 2

0 0 0 1 1 1 2 9 1 1 H H 9 7 B B B 0 0 7 2 2 0 0

S S 0 0 1 1 B B H 1 B 0 0 B B 1 H 0 1 B B 0 8 1 5 2 2

0 0 B B 0 0 S S S 8 4 4 4 1 2 9 4 0 7 4 0 1 B S S 5 1 4

1 1 H H B 9 0 H H B B 0 3 5 B B 1 H 0 B 1 S S 1 S S H H 0 S S 0

0 0 B 6 0 6 0 2 3 H H 0 0 0 0 B B 0 5 8 S S 5

B B 0 1 1 1 5 6 6 B H B H 1 1 0 0 0 5 0 B S S

0 0 H H B 1 0 0 0 1 1 5 4 0 0 S S 5 2 S S

1 S S 1 1 0 0 3 0 0 H H B B 9 8 0 0 0 3 0 0 3

1 1 0 0 2 2 0 H 0 1 0 0 B B 0 0 0 7 6 9 7 S S

0 0 0 0 5 4 B H 1 0 0 3 3 0 0 1 8 0 S S 0 0 6

S S H H 0 1 1 8 8 6 0 0 1 0 0 0 0 1 0 1 B 0 S

B B H H 0 H 0 B B 0 0 B 1 1 B B 0 0 1 1 0 H H S 0 4 7

0 0 1 1 1 1 0 2 0 8 B B 0 H H H H 3 0 5 0

1 1 0 3 0 4 9 S S 0 0 0 0 0 0 H H S S 0 0 H H

B B 2 0 0 B 2 0 0 B 0 0 0 0 1 1 1 S S 1 0 0 1 5 8 H H

1 1 2 4 0 1 6 1 B B H H 1 1 0 0 H H 1 1 0 3 7 3 2 3 1

H H 1 1 1 0 0 S 2 0 5 2 0 0 B S 0 0 6 5 1 0 5

1 0 0 6 2 0 0 0 3 B 1 1 0 0 8 3 1 1 1 0 H H

0 0 S S 1 0 8 0 0 0 0 0 0 S 1 1 H 0 0 H 2 0 S 2

H H B B 1 1 0 4 0 1 7 5 9 0 0 1 0 H B B 0 4 5 1

1 B B B 0 0 3 2 7 5 7 0 0 B B 0 0 H

0 0 7 5 2 5 4 2 0 0 S S B B 0 1 1 0 0 B B 0 0 5 3

7 4 6 1 4 H H 0 0 S S 0 0 1 1 H H B B 1 0

0 0 1 S 1 7 2 5 0 0 0 B B H H S S S 7 3 0 0 0 B B

B B 4 9 0 0 3 2 0 0 0 B B 1 0 5 S

1 1 S S H 3 1 1 S S S S H 1 1 S S 5 8 0

0 0 1 1 0 3 4 0 2 0 3 8 1 1 1 0 0 0 1 0 B B

H H 0 0 0 S 5 1 1 9 6 4 7 S S 0 0 H B B S S B H

B B 1 1 S S 7 6 5 3 0 0 1 0 S S 0 0 1 0 0 0 0 0 5 4

0 0 8 7 5 5 1 1 S S 0 0 B 0 0

5 7 6 9 1 B 2 B S S 1 B B 0 0 0 0 S 3 4 0 0 0 0 0

0 0 1 3 1 3 1 0 H H 0 0 H H 0 0

S S H 1 S 5 S 2 4 B B H 0 0 H H 5 2 0

1 1 8 8 6 0 2 0 6 4 3 5 0 0 0 H 0 0 0

1 1 1 H H 0 0 S B 8 2 6 9 4 7 H H H H B H H 0 0 1 1 0

0 0 B B 4 8 0 1 1 0 0 1 1 1 1 1 1 1 1

H 0 0 6 5 0 0 4 0 S H H B H H 1 1 0 S B

H H 1 6 1 0 6 6 0 0 H H 0 0 S S 0 0 1 1

0 1 0 0 0 1 0 1 1 H H 1 1 S S S S

1 1 4 3 0 0 8 6 7 6 B B S S H H B B S S 0 0 H 0 0 7 5 S

0 0 B 2 3 8 0 1 B B 1 1

H H 0 0 0 0 1 5 4 0 S 6 2 5 0 1 1 1 B S S

1 1 0 0 1 0 0 0 6 H 9 3 H 0 B B B B S S B 0 0

B B 2 7 S S 4 0 2 B B 0 1 0

B H S S 0 0 0 3 4 1 3 8 1 2 1 B B 1 1 B B 0 0

0 0 0 0 H

B B 3 8 4 0 1 5 0 0 0 S 0 0 B B 0 0 B B H H 1 1 0 0 H

0 0 0 4 9 0 1 S 0 1 5 B H H 0 0 H H

1 0 1 1 1 0 0 0 0 0 6 1 7 H H B B 1 H H 0 S S H

1 1 B B 0 0 0 9 0 0 5 0 8 5 0 0 0 0 0

0 0 S S 7 1 B B 0 H 8 0 0 1 H 1 H S S

0 S S 1 1 0 1 4 S 3 0 0 0 0 H H

B S H H S S 8 S S

0 0 1 4 0 S S S 5 0 0 S 0 1 4 1 1 0 0 1 1 0 0 0 0

1 H H 9 8 0 0 S 6 5 8 1 0 0 7 S B B B 0 1 1 B

S S 0 0 0 0 0 0 0 H 0 0 1 B B

0 0 0 4 3 S S H 8 5 S S 9 2 7 4 6 9 1 1 1 1 B B 1 B B B

1 1 S S 0 0 0 0 S 0 8 5 3 1 1 B B H H

1 0 0 0 0 5 4 3 1 1 B

H H 0 0 2 B 1 1 B B B

1 1 H H 0 0 0 0 5 3 7 0 7 4 6 1 1 1 1 H H

0 0 2 5 H B B H H 1 1 H 3 3 0 0 3 4 5 0 1 S S 0 H H

B B H H H 0 H 0 4 1 1 0 0 0 0

0 S S H 0 0 4 7 0 2 S S H 8 7 2 0 3 1 8 7 0 0 0 0 0

1 1 0 0 1 1 S S B 9 4 0 B 0 0

H H 0 0 H S 5 7 H H 0 B B 0

H H 1 H H 4 0 9 0 0 0

B B S S S 0 0 2 6 4 2 0 4 0

B B 0 0 S 0 0 0 0 S 8 1 3 3 0 1 1 B B

0 0 B S S 8 8 3 1 1 1 1

B 0 0 B B 0 0 1 6 0 S 6 7 0 4 0 0 0 0 H H B B 1 1 1

S B B B S 0 B 6 2 6 0 0 1 1 1 1

H H B 4 0 0 0 6 1 H H B 0 0 0 1

B B S S H H 1 1 8 1 1 1

S S B B 5 5 0 3 B B 0 1 0

0 0 B B 0 0 0 0 1 1 B S S H 0 0 1 1 0 0 0

1 1 S S H H H 0 7 2 1

H 0 0 0 H 1 7 7 0 4 0 5 1 0 0 0 0 0 0 0 0

0 0 S S 0 1 0 H H 2 0 0 1

0 0 0 8 4 8 4 0 H S 1 1 B B

0 0 B B H 0 9 2 6

H 0 B B 0 1 1 0 0 B B 0 0 0 0 S

1 1 0 0 H H 0 0 0 0 0 1 1

1 1 H H B 1 0 1 S 0

S S 1 1 1 B B 4 5 2 0 9 4 S 0 0 1 1

H H 1 1 B 1 H H 1 5 7 S 0 0 0 0 0 1 8 S S

B 1 B 0 0 S 3 2 6 7 0 3 1 S S 0 0

1 1 0 0 B B 9 5

H H 1 B 1 0 0 7 0 0 0 0 1 1 B H 0

1 1 B 1 0 0 S S S

B B 0 S S S S 5 0 0 H

0 0 0 0 6 0 3 9

0 B B 8 8 0 0 0 S S H S 0 0

5 0 H H B B 0 5 6 8 7 0 H 0 0 S S 1 1

B B 1 1 S S 0 0 0 3 8 H H

H 3 6 H 1 1 0 H H 0

0 B B 0 1 1 S 4

0 0 1 3 S S S B

1 1 0 0 0 5 1 5 3 H S S B

1 0 0 H H H H 4 2 H H 5 5

1 0 0 1 S 1 1 0 3 0 S 0 0 S S B

0 0 1 0 1 B 0 0 H H H

B B 2 4 5 5 1

1 0 0 1 9 B H H B B S B B

1 1 H H B 4 5 6 0

H 0 1 0 1 1 0

1 1 0 1 8 3 H H H B 0 0

1 1 S S 0 0 3 4 H H B B 0

1 1 B B B B 9 0 0 0 S S

0 0 H S 3 6 S 0 2 H H H B

1 1 5 0 5 0 B B 0 0 0 0

0 0 B B 1

B B 5 6 0 0 8 9 3 2 1 H

0 0 3 4 0 0 0

B S B 0 0 0 0 B B 0 0

0 0 0 0 8 1 1

1 1 H H 8 4 0 0 B B S S

B 0 8 3 4 1 0 0 1 1 0 1 1

1 H 0 H B H H B B 0 0

0 0 0 0 0 0

3 3 3 3

0 S S 0 S S 0 0 6 0 0 1 1 1 B

1 1 1 1 4 8 S S

H S S 0 0 0 0 0

B B 0 0 0 3 2 7 4 1 1 1

S 5 4 0 0 0 1 1

1 1 0 0 0 0 B H H B B

0 0 B S 5 6

1 1 3 0

S S 0 2 1 1

H H 0 0 1 1

0 0 H H B 1 0 1 6 4 0 H H

H H S 7 3 7 1 5 8 1 1 1

0 0 0 S 0 0

0 0 0 1 0 H 4 2 B B

S S 4 1

H 0 7 0

1 1 0 B B 0 0 1 0 4

B B H H S S 7 5 4 4

1 3 8 0 4 1 1 B B

1 0 0 B B S S H 1 0 4 0 0

S S B H 0

5 0 4 4

H H

1 B 0 0 0 S 0 1 7 4

0 0 0 0 7 0

B B S S 8 6 8 5 0 0

0 0 H H 5 5 2 1 1

S S B 0 3 8 2 5

H H 0 B 0 1 4

2

H H 0 B B 0 7 2 7

1 1 1 1 0 0 0 0 S 0

0 0 H 3 6 0 0 1 1

1 1 1 1 8 3

0 H H 7 6 9 2

B B 1 0 S 3 8

H H 0 S

1 B B 0 0 7 2

B B 0 6 0

1 1 S S S 1 0 S 0 0

H 0 1 7

1 0 0 B 4 4 2 1

0 0 1 B B 5 1 0

0 3 0

0 0 1 1 2 7 1 3 0 1

0 0 B B H 5 H

S 0 0 S

0 1 5 S S

H 0 H 1 0 1 0

1 1 0 0 H H S S B 3 8 1 0 0

4 3 5

1 1 7 5 5 7 0 0 S

1 1 0 0 1

B

S S 2 0 B

1 0 0 8 0 3 0 0

H 0 B S 8 8 4 H H H

1 1 B B B 0 6 6 6 3

H H 7 5 6 6 0 S S

Prune Outliers 1 1 0

0 2 S S H

0 0

6 4 7 7 0

1 0 0 5 S S

0 0 0 B H H S S 2 6 5 0 9 2 7 B

H 0 1 7 0 0 0 1 B B

B B 1 2 S H H

1

2 6 4 4 0 H H B

1 0 0 S

1 1 1 0 1 0 6 9 0 1 0 0

B B 0 0 2 0 0 0 0 H H 0 0

H H B

0 0 3

B B

4 0 0 0 H 0

0 S B B

0 1 0 0 2 1 0 0 0 1

0 0 S 1 H 1 1

S S

1 1 B B 0 S S 0 S S B B

0 0

0 0 0 0 1

0

0 0 0 0 B 0 0

1 1 S S 0 H

S

0 0 1 H 0 B 0 0

H H 1 1

H H 0 0 H H

S 0 0 S S 0 1 1

S S

1 1 H H S B 0 1 1

B S H

B B B B

S S B B 1

S S

H H H

0 1

0 H H

B B H B

0 0

0 0 H 0 0

H H

H H B B 1

1 B

0 0 B B 0

B 1 1

1 1 1 1

B

0 0

0 B B

1 1 B B 0 0 1

0

0

1 0 0 1 1

0 0 1 1

1

1

1 1

1 1

sex Diabetes GERD HighBloodPressure dist(datExpr) NeverSmok er hclust (*, "a verage") ExSmok er CurrentSmok er Wheezing Asthma PleurScaleisy independence (UNRAREFIED) Mean connectivity Pnemonia Bronchitis HayFever 18 1

0

2 Sin usitis 5 6 12 ^ 20

7 16 0 Obstr uctiv eSleepApmoea 8 9 10 9 14

4

R . 4 age 3

0

d

e PackYears

n BMI 2

g

8

i

.

s ppfev_baseline

0

0

,

t

0 i ppfvc_baseline y

t

3

i

F

red_cell_count v

l

i

7 t

e platelet_count.

c

d

0

e

o neutrophil_absolute

n

M

monocyte_absolute n

0

o

y 6 eosinophil_absolute 0

.

g

Define Adjacency Matrix C

2

0 basophils_absolute

o

l

n

o

a

p

e

5

o

.

M

T

0

0

e

0

e

1 r 2

4

F

.

0

e

l

a

c 3

S 3 4 . 5 6 7 8 9 10 12 14 16 18 20

1 0

Parameters 0

5 10 15 20 5 10 15 20

Soft Threshold (po wer) Soft Threshold (po wer)

Gene dendrogram and module color s

Clustering of module eig engenes (MM20)

0

.

1

0

.

1

y

e

r

g

w w

8 E

o o

.

l l

l l

M

9 0

e e

.

y y

t

n

0

h

e

g

i

e

l

r

k

g E

n

i

E

M ǂ p

M

E

6

.

M

0

t

h

t

n

h

a

g

TOM Network g n y

i

i

c

8

o

e t

.

n

e

h

n

m

H

l

a

g

t e

0 n i

a

4

l

e

H

.

E

s 0 e

r

E

0

6 e

E g

M

r

y

k M

n

E

g

M

c e

t

a

r

a

y M

h

l a

g

c t

g

b

i

E

n

l

E

E

e

E

M

e

M

g

M

u

M

a l

b

2

t

m e

.

w

l

e d

h

0

E o

p

u e

l

l

g

r l r

7 i

M

b

u e

. n E

y

n e p E

d

i M

s

0 E

E

w i

M

m

o o

M

M

r

u

E

b

q

r M Prune Modules E

u

t

M

E

M

Dynamic T ree Cut

as.dist(dissTOM) Merged dynamic hclust (*, "a verage") Whole Population: Module −trait relationships

0.042 −0.021 0.0052 0.028 0.019 −0.075 0.032 0.048 0.00098 0.016 0.053 −0.03 −0.096 −0.075 −0.0067 −0.046 −0.075 −0.018 −0.067 MEgre y (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.11 0.13 0.00017 −0.12 −0.083 0.31 −0.03 −0.077 −0.069 −0.017 −0.022 0.19 0.099 0.058 −0.016 0.09 0.099 0.1 0.064 1 MEpink (1) (0.6) (1) (0.6) (1) (2e−12) (1) (1) (1) (1) (1) (5e−04) (1) (1) (1) (1) (1) (1) (1) −0.074 0.11 −0.031 −0.063 −0.031 0.15 −0.05 −0.063 −0.031 −0.031 −0.071 0.093 0.077 0.017 −0.043 0.12 0.069 0.031 0.071 MEb lack (1) (1) (1) (1) (1) (0.02) (1) (1) (1) (1) (1) (0.9) (1) (1) (1) (0.6) (1) (1) (1) −0.05 0.034 −0.019 −0.04 −0.048 0.14 −0.079 −0.078 −0.093 −0.038 0.0086 0.093 0.016 0.061 −0.0024 0.16 0.061 0.0081 0.1 MEcy an (1) (1) (1) (1) (1) (0.04) (1) (1) (1) (1) (1) (0.9) (1) (1) (1) (0.06) (1) (1) (1) −0.031 0.09 −0.02 −0.013 −0.094 0.17 −0.014 −0.029 −0.042 −0.028 0.00018 −6.3e −05 0.047 −0.023 −0.065 −0.017 −0.02 −0.026 0.043 MEsalmon (1) (1) (1) (1) (1) (0.004) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) 0.00079 0.031 −0.037 0.044 −0.017 −0.044 −0.098 −0.068 0.029 −0.0096 0.021 −0.066 0.052 −0.032 −0.099 −0.021 −0.092 −0.027 0.036 0.5 MEbro wn (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.18 0.0047 −0.076 −0.06 0.01 0.079 −0.091 −0.049 −0.043 −0.044 −0.06 0.038 0.024 0.14 −0.06 0.044 0.027 −0.037 0.048 MEgre y60 (0.003) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (0.2) (1) (1) (1) (1) (1) −0.014 0.022 −0.095 0.079 −0.068 −0.019 −0.11 −0.026 −0.0034 −0.032 −0.077 −0.076 −0.035 −0.0073 −0.047 0.049 −0.04 −0.00067 0.078 MEtan (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.0046 0.0093 −0.033 0.066 0.019 −0.13 −0.02 0.022 0.065 −0.0035 0.011 −0.087 −0.016 0.032 −0.054 −0.0084 −0.11 −0.035 0.0068 MElightgreen (1) (1) (1) (1) (1) (0.04) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (0.8) (1) (1) 0.1 0.00033 −0.026 0.072 0.038 −0.17 −0.074 −0.072 0.077 0.00066 0.062 −0.098 0.043 −0.079 −0.027 −0.017 −0.11 −0.015 0.0098 MEmagenta 0 Correlation of Modules (1) (1) (1) (1) (1) (0.003) (1) (1) (1) (1) (1) (0.9) (1) (1) (1) (1) (0.8) (1) (1) 0.094 −0.036 −0.059 0.047 0.015 −0.099 −0.089 −0.012 0.049 0.0043 0.044 −0.096 −0.04 −0.038 −0.11 −0.043 −0.18 −0.062 0.0011 MEmidnightb lue (1) (1) (1) (1) (1) (0.5) (1) (1) (1) (1) (1) (0.9) (1) (1) (1) (1) (0.006) (1) (1) 0.015 0.03 −0.079 0.08 0.013 −0.15 −0.12 −0.028 0.0031 −0.04 −0.095 −0.12 −0.012 0.00069 −0.1 0.025 −0.023 −0.043 −0.0081 MEpur ple (1) (1) (1) (1) (1) (0.02) (1) (1) (1) (1) (1) (0.4) (1) (1) (1) (1) (1) (1) (1) 0.072 −0.014 −0.029 0.13 0.023 −0.24 −0.1 −0.041 0.066 −0.0011 0.0083 −0.18 0.021 −0.061 −0.073 −0.048 −0.14 −0.053 −0.019 MEyello w (1) (1) (1) (0.4) (1) (7e−07) (1) (1) (1) (1) (1) (0.003) (1) (1) (1) (1) (0.1) (1) (1) 0.00071 0.02 0.0059 0.059 −0.049 −0.017 −0.059 −0.023 0.095 −0.015 −0.041 −0.079 −0.009 −0.054 −0.078 −0.0018 0.0012 −0.019 0.014 MElightcy an to Traits (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.5 0.014 0.083 −0.0033 −0.027 −0.06 0.14 −0.00017 −0.027 0.022 −0.033 −0.061 0.043 0.071 −0.055 −0.05 0.068 −0.014 0.002 0.056 MEgreen (1) (1) (1) (1) (1) (0.04) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) 0.098 0.016 −0.022 0.097 −0.02 −0.12 −0.026 −0.02 0.06 −0.02 −0.068 −0.11 −0.00074 −0.094 −0.066 0.019 −0.09 −0.037 0.0012 MEb lue (1) (1) (1) (1) (1) (0.1) (1) (1) (1) (1) (1) (0.5) (1) (1) (1) (1) (1) (1) (1) 0.092 −0.059 −0.02 0.15 0.052 −0.32 −0.038 0.035 0.05 0.021 −0.025 −0.21 −0.0056 −0.092 −0.054 −0.087 −0.15 −0.049 −0.019 MEred (1) (1) (1) (0.1) (1) (2e−12) (1) (1) (1) (1) (1) (2e−04) (1) (1) (1) (1) (0.08) (1) (1) 0.13 0.049 −0.041 0.083 −0.0098 −0.12 −0.096 0.02 0.029 0.054 −0.034 −0.12 −0.06 −0.075 −0.038 −0.029 −0.083 −0.11 0.076 MEgreen yello w (0.4) (1) (1) (1) (1) (0.2) (1) (1) (1) (1) (1) (0.4) (1) (1) (1) (1) (1) (1) (1) −1 0.048 0.015 −0.04 0.012 −0.041 0.045 0.016 −0.0019 0.01 −0.0088 −0.074 −0.017 −0.12 −0.055 −0.0096 0.032 −0.062 −0.055 −0.028 MElighty ello w (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)

x er er er se age ears BMI GERD Asthma Diabetes erSmok Wheezing Bronchitis ackY v ExSmok Pnemonia P Ne platelet_count CurrentSmok red_cell_count

neutrophil_absolute monocyte_absolute eosinophil_absolute basophils_absolute

Module membership vs. bacterial significance Module membership vs. bacterial significance ModulePinkModule: membership Bifidobacter vs. bacterialium significance_13861 PinkModule: Bifidobacteriaceae_13837

r r r

e cor=0.83, p=2.1e−69 e cor=0.74, p=5.6e−13 e cor=0.92, p=3.7e−66

k k k

o o o

m m m

2

0

S S S

1

t t t

3

.

.

n n n

0

1000 0

e e e

) )

r r r

e e

r r r

0

l l

u u u

a a

2

. 100

c c

C C C

0

s s

0

r r r

0 0

2

o o o

.

f f 1 f 1

8

0

g g

0

e e e

.

o o

l l

c c c

( (

0

n n n

0

e e

a a a

Define Key Bacterial ‘Hubs’ c c

1

c c c

.

0

n n

i i i

f f f

0

1

a a

i i i

. 10

d d

n n n

0

n n

g g g

i i i u 10 u

4

s s s

b b

0

l l l

A A

.

a a a

i i i

0

0

0

r r r

0

0

e e e

.

.

t t t

0

0

c c c a 0.2 0.4 0.6 0.8 a 0.1 0.2 0.3 0.4 0.5 0.6 a 0.2 0.4 0.6 0.8 B B B 1 Module Membership in y ellow module Module Membership in magenta module ModuleNon−Smok Membershiper inCurrent red module−Smoker Non−Smoker Current−Smoker

Module membership vs. bacterial significance Module membership vs. bacterial significance

r r

e cor=0.49, p=0.0081 e cor=0.76, p=5.1e−13

k k

o o

8

m m

0

5

S . S

t t

1

0

.

n n

0

e e

r r

r r

u u

6

0

C C

.

r r

0

0

o o

1

f f

.

0

e e

c c

4

n n

0

.

a a

0 Figure 4.1 Summary flowchartc of WCNAc bacterial co-abundance network analysis.

i i

5

f f

i i

0

.

n n

0

g g

i i

2

s s

0

.

l l

0

a a

i i

r r

e e

t t

Illustration of the differentc steps performed.c For each key step carried out an example of a 0.25 0.30 0.35 0.40 0.45 0.50 a 0.2 0.4 0.6 0.8

B B Module Membership in lightgreen module Module Membership in pur ple module the output (a typical plot) is shown. Analysis is described in more detail in the following chapter sections. ǂTOM = topological overlap matrix.

109 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Adapting the published WGCNA methodology (WGCNA package version 1.51, R version 3.3.2 [2016-10-31]) for 16S rRNA gene data came with a number of difficulties (for full workflow refer to: https://goo.gl/Csv6Hg). Bacterial communities exhibit multifarious interactions between individual microbes. Additionally, inferring correlation structure in microbial communities is problematic due to the presence of rare microbes and zero sequencing read counts.

A number of steps were taken to address these issues. These included data transformation, OTU filtering and the use of non-parametric correlation methods (see subsequent sections below).

The aims of this chapter were to: 1. Define bacterial networks in the upper airways through application of WCNA on the complete BHS 16S rRNA sequencing dataset. 2. Identify modules of co-abundant bacteria. 3. Quantify the module relationships with clinical traits and lifestyle habits (smoking). 4. Use module membership to define key ‘hub’ bacteria. 5. Extrapolate metagenomic functions on the complete dataset as well as on any modules showing association with clinical traits or lifestyle habits (smoking).

110 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.2 Pre-Processing and Details of Statistical Tests Implemented 4.2.1 Data Transformation Initially, the top most abundant 100 OTUs were sub-selected from the Busselton un- rarefied data set and used to assess three different transformation measures. As can be noted from Figure 4.2A, the original data was not normally distributed, and this was confirmed by the Shapiro Wilks test (Mean [Mn] = 4.2E+05, Standard Deviation [SD] = 1.02E+06, P < 2.2E-16), (refer to Chapter 2, Section 2.5.2.1B).

The first transformation, commonly implemented in ecological studies, was log10 (x+1) 348. Due to the presence of a high number of zeros in the OTU table, a constant value (in this case ‘1’) was added to every observation, and then base log 10 transformation was taken from each value (Figure 4.2B). The test for normality indicated that data through the transformation was more normally distributed compared to the untransformed data (Mn = 878.9, SD = 446.9, P = 1.2E-06).

The second transformation was the Variable Stabilizing Transformation (VST, Figure 4.2C), which is recommended by the authors of WGCNA for the analysis of RNA-Seq data. This function uses fitted dispersion mean relations, which transform count data by normalizing them using size factors, resulting in an approximately homoscedastic matrix of values. This function was modified to include a calculation for geometric means for size factor to deal with the presence of the large number of zeros (OTUs not present in all samples hence 0 for number of sequencing reads). When checking normality for the transformed data set, a similar improvement was seen to that observed for the log10 (x+1) transformation (Mn = 3,543.7, SD = 1,226.3, P = 7.8E-05).

The last transformation was regularized-logarithm transformation (rlog, Figure 4.2D), from the DESeq2 package (version 1.14.1) 291,292,349 available in R. The rlog transformation is purported to give a variance stabilizing effect comparable to VST, but with improved robustness to variation in size factors. The rlog transformation showed the best outcome for the data (Mn = 3,258.6, SD = 1,455.0, P = 0.003), although computationally the method requires a lot of power to run.

111

A No Transformation B Log10 Transformation C Variance Stabilizing Transformation D rLog Transformation

5e−04

3e−06

6e−04 4e−04

4e−04

2e−06 3e−04 3e−04 4e−04

y

y y y

t

t t t

i

i i i

s

s s s

n

n n n

e

e e e

D

D D D

2e−04 2e−04

1e−06 2e−04

1e−04 1e−04

0e+00 0e+00 0e+00 0e+00

0e+00 2e+06 4e+06 0 1000 2000 3000 4000 5000 2000 4000 6000 0 2000 4000 6000 TotalCounts TotalCounts TotalCounts TotalCounts Figure 4.2 Histograms displaying the different transformation methods. Distribution of the 100 (most abundant) OTUs with no transformation and after application of each transformation method, overlaid with kernel density curves.

(A) Distribution of the data before any transformation. (B) Log10 (x+1) transformation. (C) Regularized-logarithm transformation. (D) Variance Stabilizing Transformation.

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

In conclusion all three transformations yielded a broadly similar effect on the data, bringing it closer to a normal distribution when compared to the original, untransformed data (Figure 4.2). In light of this and taking into consideration computational efficiency given the large scale of the data set, it was decided to carry out the log10 (x+1) transformation as a pre-processing tool for WCNA.

4.2.2 OTU filtration

Consistent with the recommendations of Weiss et al., (2016) 350 extremely rare OTUs were removed from the dataset as these are recognized to have a pronounced impact on performance of correlation analysis techniques. For detailed annotation of the filtration performed on the data please refer to Chapter 3, Section 3.2.3.4.

4.2.3 Rarefaction Rarefaction may have an impact on calculated correlations and downstream clustering as highlighted by McMurdie et al., (2014) 351. Given the dependence of WCNA on correlation of OTUs the following steps were performed in parallel on un-rarefied (4,218 OTUs, 578 samples) and rarefied (4,005 OTUs, 529 samples) data in order to assess which was the most suitable for WCNA (Chapter 3, Section 3.2.3.5 for the details of the establishment of rarefaction level).

4.2.4 Detection of Outlying Samples A distance matrix was computed using both the un-rarefied and rarefied BHS data sets in order to look at the compositional distances between samples. These distance matrices were used as the foundation for constructing hierarchical clustering dendrograms (based on Euclidian distance and average linkage clustering), before proceeding with the assembly of a bacterial co-expression network. In this case the agglomerative (“bottom up”) approach to hierarchical cluster analysis was taken. Average Agglomerative Clustering 352 (referring to the Un-weighted Pair Group Method using arithmetic Averages, UPGMA) looks at the average dissimilarities between all objects, in this way identifying individual samples with highest dissimilarity from the cohort, which were seen as outliers.

113 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Outliers were manually removed from each individual dendrogram applying a cut height of 30 for the unrarefied data (removing one sample) and a cut height of 12 for the rarefied data (also resulting in removal of one individual) (Figure 4.3).

4.2.5 Phenotype Selection A subset of clinical variables was selected out of the available metadata for further analysis with an emphasis on respiratory functions. This selection was carried out with the aim of minimising multiple testing.

The continuous (quantitative) variables included - Age in years (Mn = 55.8, SD = 5.6) - Cumulative smoke exposure as measured in pack years (Mn = 9.4, SD = 16.6) - Body Mass Index (BMI, Mn = 27.9, SD = 4.6) - Red cell count (Mn = 4.6, SD = 0.4) - Platelet count (Mn = 240.3, SD = 51.7) - Neutrophil absolute count (Mn =3.4, SD = 1.2) - Monocyte absolute count (Mn = 0.5, SD = 0.14) - Eosinophil absolute count (Mn = 0.2, SD = 0.13) - Basophil absolute count (Mn = 0.06, SD = 0.04)

The discrete (qualitative) variables included sex, smoking status (never-smoking [274 individuals], ex-smoking [239 individuals] and current-smoking [65 individuals]), and presence or absence of the following: diabetes, GERD, high blood pressure, wheezing, asthma, pleurisy, pneumonia, bronchitis, and sinusitis.

114

Sample dendrogram A (UNRAREFIED data)

5

3

4

6

5

1

0

S

H

B

0

0

1

3

5

7

7

5

0

0

2

S

H

B

0

1

7 2

4 5

5

3 2

7

0 0

6

0 0

0

0

S S

S

H H

9

H

B B

9

0 0

B

7

6 6

1 1

0

0

6 9

1

0

8 3

0 1

S

0 0

H

S S

B

H H

0

1

B B

0 0

0

4 5

1 1

2 4

4 8

6 2

9

0 0

5 8

2

0 0

7 2 5

4

8

S S 0 0 2

0 1

2

0

0 0 4

0 0

H H

0

6

t

0 4

S S

0 1

B B

S

2

0 1

0 0

0 0 H H

4

H 5

S

0

1 1 S

B B 2

1

0

B

H S

0 0 0

H 0

0

0

9 3

1 1 B 8

H

B S

S

1

1 0

0 0

h

0 B

3 2 H

H

1 0

2 6

1 0

1 0

B 8

B

2 4 S

1

0 0 8

0 3

0

8 4

7

5 H

3

S S 0

1

1

2

0 0

7

1

8

B

0

3

H H 4 0 0

3

3

0

0

0

2

9

g

B B S S 1

1

0

1

0 S 5 9

0

7 5

7 9 0 0 0

0

H H

2 0

S

0 4 7

0

6 4

1 7 1 1 H

8

S

i S

8 4

B B

0 H 5 4

S 7 5

2 0

8

B 0

3 3

8 0 0 H

H

2 1 1 1

0 0

0 0 B

S

5

H 0 2

1 0

1 1

9

B

4 8 0 0

B 0 0

0 0 9 0

1 6 1

H

B 0 0

5

0 6 0

2 8

0

3 1 S S

2 S

S S S

0 0

2 7

0

B 0 S

S 1

0 0 5 3

1 2 8

2

5

0 H H

7 9

1 H H S

H H

0 0

0 0 3 5

7 7 8

0 H H

6 4

5 8

4 3

e

S B B

B 1 B

B B H

0 0 S

3 5 3

0

0 0 S S

6 9 B 7 B

6 2

0 1

0 0 0 2

0

0 0 0

H 0 0

2

0 0 B

3 5 0

S 6 4 9 H 4 0

0 1 8 0 7 0

H H 0 0

1 1 9 5

1

1 1 1

0

B 0 0 5 S S 0

9

0 5 1 1 7 1 5 1

3 8 2 1 S 8 3 9 0

B

H S S 8 7

B B

2

1 0

0 9 1

3

5 0 9 5 0 2 S S 0 1

0 0 8 0 9 3 8

H H 0

H 0 0 0 0

B

4

9 H H

3

1 1

3 4 3 0 8 0 S

0 0 0 3 2 0 0

S 8 9 S 1

6 8 7 3 S H H

B B

1 1 0 0

0

B 7

6

0

3 B B

1 1 0 4 0 1

0 1 0 0 0

4 3

9 9 5 4 7 S H

S S 0 0 5 3

H 1 4

1 H H B B

0 0 S S

0

0

1 0 0

0 0 5 9 3

1 0 1 0 0

S 7 4 S

4 6 S 7 0

3 3 1 1 5 8 S S

7 2 B

0 0 H 8 8 7 2

H H 1 0

B 0 B B

6 1 1 H H

0 6 1

H 1 8 6 S

0 1

0 0 S S 8 8

0 0 5 S S 3 3

H 2 7 1 4 8 3

H 1 1 8 0 0 2 8

H 0 H H

0 0 B

B B 0 S

S 0

0 4 8 1 0

4 8 3 9

B B

0 0 0 0 7 0

0 0 4 1 H 5 0 0

1 0 4 2

H H 0 4 2 0 0 1 4 7

6 H H

1 B B 1 0

0 0 B 1 B B

0 7 5

1 6 6

1 0 0 1 9

H 0 0

H S 3 S 0

0 9 0 8 0 0

S S S S 3 7 2 2

S S 9 1 B 2 0 0 0 0 1 0

3

B B 0 0

0 1 B B 0 0 5 9

5 1 1

0 1 4 2 3

0 0 0 6 S

5 1 1 1

1

5 5 0 B

B 3 0 0 0 0

H 2 6 0 0 3 4 H S 0 0

0 S S 0 0

0 0 1 H H 3 1 S S

H H H H 1 1 1 4 6

1 4 1 8 0 0

0 1 5 7

5 9

6 7 S 8 7 0 4

S S S 2 9

0 1 8 0

0 3 9 0 0 0 H 0

3 4 0 S 9

1 1 0

B 0 S S

1 1 B H S S 2 7

4 5 5 2 1 1 H H

B B B B 8 2 0 0 H H 0 0

B B 0 7

1 4 1 1 0 4

3 2 1 4

0 0 0 1

1 4 9

1 0 6 H 4

H H H S 0 B 5 S

0 0 S S S

H S 0 0 0

0 6 5 1 5 5

0 0 0 0 8 6 B 0 0

0 0 6 3 9 H H B B H H

0 1 1 S S

7 7 3 3 5 2 B B

S 7 0 6 0

1 5 9 5

0 0 9 8 1 7 S S 0

0 2

B 0 1 1

B B 1 B S 1 0 0

0 0 H 6 4

1 1 0 1 1 5 5 H H H H 0 0

1 1 0 1 B 5 0 0

0 0 6 0 H S S

0 0 3 7 B B 0 0

8 0 0 7 8 4 5 B B

5 5 0 9 3 0

4 1 0 S 0 H H

3 1 H 1 6

0 0 1 8

0 0 S S 0

0 2 0 H H 3 3

0 0 0 9 0

0 0 H 0 8 1 8 S S

S B B B 1 0 B 1 1

4 0 0 0 4 B 5 0 0 1 1

S S 9 0 0 4 1 0 B 2 1 0 0

1 9 5 6 0 9

3 6 4 8 8 H H

S 0 3 0

1 S B B 5 5

1 1 1

B 7 H 0 0

0 0 S 1 0

H H B B 1 0 3 4

S S 0 9 1 5 0 0 5 S 0

7 9 8 0 4 1 1

S S 4 0 0 0 B 0 1 1 H H

5 2 H 1 0 S S S 0 0 0

1 0 S S 0

0 2 4 2 0

H H S 9 9 0 0 B B

0 1

0 5 5 8 0 0

H 5 0 7 5 4

0 0 B 0 H 9 6

S S 5 7 1 3 9 5 1 8 1 1 1 1

5 5 B B H 3 1 6 3

0 0 1

0 1 0 0 0 H H 8 H 0 7 5 0 0

0 0 0 S S B B

H H 0 4 4

B 5 6 H 1 1 0 0

S H H H H 0 0

B B 1 6 0 S 9

H 9 1 0 1 2

1 1 0 0 S S

B 5 7 3 3 0 4 1 0 1

5 2 1 0 7 B

1 1 0 0 1 S 1

0 0 0 B 4 0 0 8 5

H H S

0 B B 2 8 0 0 S S 0 0

0 S S S 4 3 B 1 1

S S H H

B B 0 0 3 B 3

B B 0 0 0 0 0 3 S S

H B B 1 9 4 0

0 2 5 0 0 4 7 0 H

B 0 7 0 0 6 7 0 8

0 0 1 1 S 0

0 3 8 3 8 4 5 H H

6

S 0 4 9 0 2 7 0 0 H S 1 1

S S 1 B B 0 0 0

0 0 1 1 S 1 H 0 H H 1

H H 0 0 1

H H 0 0 H 0 0 B B 4 7

1 0 6 S 1 0 0 0 0 0 5 S S 0

0 3 6 0 0 1 2 H H

B 0 B S

3 8 0 4 0 0

S S 1 S S 9

8 7 5 4 7 B B

0 0 1 1 0 0 H 1

9 S S B

1 1 H 0 0

H H 1 1 0 4 1 H 0

1 1 0 0 B 0 0 1 1

B B H 0 0 0 0 4 5 B B

1 B B B 0 0 2 0 6

0 S S 0 9 0 S

H 4 8 S 0 0 0 0 0 6 4

H H B B

1 1 5 H 1 0 0

1 1 4 5 7

H H H H

2 S S 0 B

5 9 6 3 3

S 0 1 1 3 1 0 0

B B B H H 0 4 B 0 0 S S

0 0 0 0 0 0 0 5 S S 0

1 B S S 3 9 0

0 7 S S 0 0 S 0 1 8

4 7 1 0 0

0 0 S 1 9 0 1 1

0 0 B H H H 3 B B S S H

B

6 1 0

7 4 5 9 5 9

B B B 1 5 5 0 0 B

0 0 5 6 1 1 3 1

0 0 1 1 1 0 H H 0 1 0

1 1 0 3 1 B B

4 5 H 0 0 7 H

S S H H S 9 5 5 2 5 H 1 1

3 S 0 0 0

H 0 0 0 H H H 6 0 0

S S H 7 4

0 B B 1 B 0

B H 9 0 1 H H 4 1 4 6

0 0 2 8 0 8 1 0

1 1 0 9 7 3

1 1 0 0 0

1 0 0 0 0 S S

B B 4 2

8 6 8 8 3 0 0 5 S

0 B 0 7

1 S S 0 1 1 B 8 B

H H B B 7 9 H 0 S

B S S 0 B 0 0 B B 0 1

0 B H 0 0 1 7 0 0

1 1 6 0 1 2 7 0 0 2 1 8 1

H H B 3 7 2 4

7 B B

8 7 1 1

S 0 0 0 0 S S 3 0

0 0 2 5 0 0 0 0

4 0 H H 0

0 0 0 1 7

0 0 7 S H 6 4

0 1 1 0 0 3 8 1

0 H H 1 0 0 7 0 0

B B S 9 4 0 0 B 0 4 5 6 2 3 2 6 1 5 H

7 7 H H 0 B 1 0 0 1

6 S S 0 0

B B

0 8 1 0 1 1 1 1

0 0 0 0 0

1 1 S S 0 3 1 8 0

1 H H 1 5 S S 1

H 1 1 B B 3 7 5 8 7

1 1 0 0 S 4 3

1 1 1 0 9 3 B

0 0 5 8 0 0 9 0 6 5 2 1 6 2 0 1

0 0 B B 1 0 0 0 H S 0 0

6 S S 1 1 S 2 S

0 0 B

0 6 4 8 B B H 7 4 3 0 0

H H 0 7

8 0 0 9 5

S S 0 6 7 S

0 0 S 1 0 0 6 2

H H 0 0 7 1 S 2 7

B B B 0 8 2

1 1 1 0 1 3 5 5 H H 0 0 5 5 0 2 2

0 0 1 3

0 0 H S 8 3 1

0 1 1 B S S 0

6 4 0 0 0 0 5 5 S S 2 1

H 0 H H 2

B 4 H H 7 0 0 6 S

B B 5 3

S 0 4 0 0 0 0

S 1 1 3 4 9 1

0 0 0 2 5

H H 0 0 S S 0 0 H H 0 0 1 0 0

B B 8 8 5 1 S H 1 1

4 2 1 1 5 1 B B 8 9 8 3

S S 0 0 1

1 5 0 0 1 1 0 0 B 8 5

2 7 0

0 1 H 2 0 0

0 0 H H 4 B 1 S B B H H

B B S 6 0 0 0

0 0 4 1

1 1 1 7 H S 5 1 0 S

6 0 0 0 H 0 0

H 0 0 0 8 4 0 0 0

B B 4 4 S S 0 8 0 0 B S S 2 5

B 1 B 2 8

0 1 0 0 H H 0 0 4 6 1

S 4 7 H 0

S S 1 0 0 0 1

H H 1 2 0 0 0

0 0 B 1 1 0 2

B B 0 0 0 0 B B

8 7 H 1 0 0 S S S

3 2

1 1 0 3 0 H B

0 0 5 4 S S 0 7 1 1 0 7 0 S S

B S 0 H 0 S S 0 0 H

0 0 B 1 0 9 7 0

0 9 6 2

H H B B 8 H H

S S 0 2 9 0 B 1 S 6 9

H 1 1 0 3 1 1

B B S S S S 0 0 0 0 0

H H 0 0 5 0 8 3 4 4 8

S S

4 8 1 0 0 8 9 S S

1 1 0 1 B 1 0 0

S 0 B 1 0 1 0 0 0 0 H H H

0 S 4 7

2 1 0 0 B B

S S 0 0 H H S H H H H H S 1 0

0 3 0 1 0

B B 1 3 0 8 B B

0 0 0 7 1 1 1

H H 1 1 S 0 5 9 6 1 9 1 0 6

B S S H

B B 1 0 6 3 H 0 0 H H 1 3 2

0 0 H 0 0 0

1 0 0 6 2

1 H H 1 0

7 7 0 S S H H 0 B B B 6 5

1 1 0 0 8 0

H 1 1 6 3

0 0 B B 6 4 4 H 0 0 B B 3 0

1 1 S B B B 8 5

H H H 0 1 5 7 5 4 0 H

0 S 6 1 4

1 5 0 0 0 0 3 9 5 4

B B H 1 B

1 B B 5 5

B S S B 0 0 H H 4

0 0 S S S 1 S S 1 0 0 0

B B 8 8 2 5

0 B B 0 0

1 1 0 0 0 1 1 S S 1 1 H H 0 0 8

B 7 9 0 0 0 3 4 6

0 0 B 6 4 1 0 9 7

1 0 0 1

B B 1 1 5 7 B S 1 0 B

7 7 0 0 H S 0

S S 0 0 H 1 1 2 9

0 B 0 7

0 0 1 1 1

0 0 B B 6 5 0 0

H H S S 0 0 0 0

H H 1 1 0 0 0 H H H 1 1 7

0 9 4 1 S 1 1 0 0 4

2 6 0 0 0 9 0

H H 0 0 B B 7 0 0

0 0 1 0 0 S 0 0 0

5 5 1 1 1

0 0 0 7

1 0 4 2 B 1 1 1 8

H H 3

1 1 B 6 0 0 0 0 0 0

S S H H 1 1

1 1 2 B B 0 0 0 S S 0

B B S S 3 2 1 H H B B 3 2

S B 0 0 0 0 0

1 1 0 0 1 1

1 1 H S S S S

B B 3 6

3 3 0 S S

1 H 7 4 0 3

7 2 0 4 1 1 S S 0 0

B B

B B 4 0 0 0 0 0 S S

0 0 0 0 3 0 0 1 0 8 5

H H 0 1 1 0 9

B B S S

0 0 S S H H

H H H 0 0 1 0

S S 4 5 1 B 9

2 4 4 2 H H 8 4 0 5

0 6 4 8 1 3 H H

0 B 8 3 H H

0 S S

0 0 1 1 1 H H

1 1 0 0 0 0 0 0 5 3

1 1 1 S S 9

7 1 1 0 0 6 6 H H

B B

1 1 S 0 0

S S 0 H H B B 5

0 0 B B B 8 4 2 1 H H 8 5 1

1 3 0 0 4 4 0

H H 3 0 1 0 2 0 B B 3 0 S

1 B B

1 1 B B

0 1 0 9 7 2

8

4 S S 6 1 3 2 S S 7 2 B B H H

0 0 1 1

H

1 H 0 0 5 B B

0 0 0 0 0 7 0 1 2 7 4 S S 0

0 1 9 9 0 0 H B B

4 4 4 4 1 3 5 0 5 2

H H B B 0 0 0 0

B B 8 6

0 0 0 H 0 0 0

1

6 S 2 3 2 9 5 8 0 0 0

1 1 S

B B

7 H 1 1 0 0 0

1 1 5 6 H H H 0 0

1 0 0 0 0 5 4 0 0 0 0 B B

S S 8 4 1 1 0 0 5 2 S

0 0 1 1 B 1 1

9 0 S H H

B B 0 0 0 7 1 1 3 0 0

2

0 0 0 0 0 9 7 1 0 1 1 0

B

0 0

H 1 5 0 1 1

1 0 6 H 0 0 2 2

0 0 0 0 0 0

5 4 0 0 B B 0 0 B B 0 0 1 1

S S S S 1 1 0 1 4

8 8 9 4 0

0 0 1 1 H H 3 1

B B 3 H

0 S 0 0 0 0 H 9 8 S S

0 0 0

S 5 1 1

7 6

0 0 2 6 0 9 7

1 1 0 5

1 0 B 0 0 B 0 0 0 0 S S 0 0 S

S S 1 8 6

S S 0 0 3 3 1

1 1 2 S S

0 0 0 0 0 1

H H 0 0 H H

B B 0 0 1 B

7

S 2 2 S S B S S 3 7

0 H 8 H H 5

H 3 0 3 1

0 0 0 1 1 0 1 1

S S 6 5 5

0 0 8 3 S S

0 1 S S H

4 9 H H H H 1 1

6 0

0 0 H H H H 0 0 S 0

B B S S B B 2

2 4 0 0 4 1 0

0 7 5

0 0

H 1 B H H H H B B 0 4

1 0 0 0

S S 0 0 B

2

H H 3 0 4 2 S 0

S 0 H H 0

1 1 B B H H 0 0 B 1 5 8

2 0 0 0 B B 0 0 0

B B B B 1 2 0 0 0

7 4 6 0 H 0

0 0

S 0 0 1 H H 0 0 0 1

2 0 0 3 2 0

B 0 0 B B 0

5 B B

3 2 0 0 0 5

6 2 3

H H S 0 0 0 5 3

B B 2 5 3 1 1 0 0 1 1 S 0

0 0 H 8 H B B B B S S 5 8

0 0 4 0 0 0 0

S S

0 4 1 6 1 1 1 0 0

B 7 3 S S

1 0 0 B B 1

0 5 5 9 0 S S 0 0 S

5 1 H 0 0 7 7 2 S S

4 9 1

7 1 1 1 5 4

0 0 7 5 5 3 2 1 1

B B 1 1 3 0 0 0 0

0 0 H 1 1 0 0

B B 5 H S

S H H

1 1 8 7 0 S 3 4

5 0 0 S S

1 9 3 8 0 1 1 1 1 H H

1 7 2

0 0 1 3 0 H H

B 1 1 1

1 1 7 6 S S 3 H H H

0 0 0 1 5 H H 1 1

1 1 0 0 0 5

0 0

S S 0

8 8 0 0 B 1 0 0

1 3 4 1 1 B H

3 5 B B

0 2 7 0 H 2 0 H

0 0 0 1 B B

0 8 6 4 S H H 0 0

4 8 1 B B

1 1 0 0 0 S

B B 5 2 B

H H 1 0 B B

1 S S

0 0 0 0 5 0 0 0

0 6 0 0

0 1 0 0

H H 0 0 0 0 6 4 B

0 0 7 6 0 0

1 0 9 9 4 B B

S S 1 0 0 0 0

B B

0 H 0 0 7 5 0 S S

S S 1 2 0 0

H

0 0 4 S 1

2 3 B B 2 4 1

S S 0 0 1 1 H H

S 0 0 6 2 S S

6 1 0 1 1 0

B B 0 0 0

0 0 S S S 5 4 S 0 1 1

6 0 0

1 1 1 0 0 1

H H 0 7 S 1 1

S B

4 H H

H H 0 0 B 6

S S 6 5 0 1 8

H 0 0

3 4 S S 5 1

0 0 H H 1 S 1 B B

6 6 1 1 S 1 H H

H 1 1 1

S S S 0 0

4 6 H 0 H

H H 0

8 6 0

B B 1 1 1 H

4 3 2 7 0 0 H 0 0

B B

B B 1 1 0 0 6

1 1 0 0 0

H H 3 8 0 0 B

1

B B H H H B B

0 0 B 0 H

9 3 1 S S

H B 1 5 0 B

0 0 H H B B 0 2

3 8 7 1 1 0

0 0

0 0 S S 0 0 B

B 0 0 1 1

1 4 0 S S 2

0

B B 0 0 S S

0 0 0 0 0

0 9 0 B B

B 5 0 B

1 1 0 0 0 0 0 8

1 4 0 0 B S 0 0

B B H H 1 1

1 1 0

0 5 0

4 9 0 0 1 3

H H S S 0

0 0 1 1 S

0 1 H H 1 1

4 3 0 0 1 7

0 0 0 0 0 8 5

S 1 1 1 S H H 1 5

0 0 0 0 4 3

0 0 0

S S 1

1 7 B B

9 8 0 0 H 3

S

1 1

S

B B 0 0 H H 1 1 6 9

S H 1 0 0 8 5 7 4

1 B B

0 0 8 0 1 6 0

1 1 1 B B

S S H H

6 5 8 1 3 0 0

2 5

H 0

S S B H

0 0 0 0 H 0 8 7

2 0 9 2 3 1

B B 0 0

B S S H 4 7 4 0 0

0 0

S S H 8 5

1 1 3 3 0 1 1

0 0 B

B 0

H H 0 5 4

B

1 1 B 4 0 0

4 0 2 0 0 0

S S 0 0 H H 1 1

0 1 B 0 4 0

1 1 5

B 3 7 0 9 4 S

8 1 3 3 0

B 0 0 H H

0 0

H H 1 5 7

0 4 5 S

B B 0 3 0 0

1 0 0 4 0 0

1 1 0 0 0 1

1 B B 2 7

0 6 4 2 2 8 7 2 6

H H 0 0 1 1 S

0

1 1 H

B B 4 0 S

S S 8 8

B B 0 0 1 1 1

0 0 0

H

1 0 0 S S

0 0 5 6 S S S 9

1 S 5 0 3 6 7 0 0

0 0 0 0 6

1

B B

0 0 0 0 H

7 2

0 0 1 1 B 0

H

H H 0 0

1 1 1 4 S 8

0 0 1 0 0 5 1 B

2 S S 0 0 H H

H H H

0 0

1 1 0 0 S S S S H

0 1

1 1 0 6 1 B 0

0

B S

B B 0 0 S S 6

1 7 0 0 0 0

0 0 0

H

1 1 S S B B

B H H B B

0 0

S S B 8 4 1

H H H H 4 0

8

0 0 0 0 0

4 5 0 0 1

S 7 0 H

H H

S S S

B 0 0 0

0 0

0 0 3 H H

7 B B 1

3 S S

B B B B

H H 1 9 2 0

1 1 0 0

2

S

S S S B

0 1 1 1

H B B 1 1

1 H H H 0 0

0 0 0

0

0 0 B B

0 0 9 5

H H

B B 0 0 6

9 4 S

1 0

5 7 H 0 0

0 0 H H 1 8

B 0 H 1 1

0 B

1 1 B B 0 0

1 1 2 6 3 8

0 0 0

B B

S S 3 9

1

8 8 7 1 1 H

0 0

0 B

S S B B

0 S 0 0 B 1 1

S 5 6

0 0

1 1 0 5

0 0

5 1

4 2 3

H H 5 5

1 0

B

5 1 8 7 1 1 0 0 0

1 1

0 0

3 H H

H

H 1 0 1 1

S

0 1 5

6 0 0

1 4 0

B B 0 3 1 1 1

0 0

4

S S

0 0 B B

5 3 B

0 0 B 0

5

H 0 0

9 1

0 0 3 4

0 0

0

0 0

1 1 0

0 S S 0 H H

2

S S 2 4

5

1 1 B 0 0 S S

0 S S

1 1

5 6 3 2 1

1

0

3 6

8 3 S B B

0 H H 0

0 0

H H

H H

S

8 9 0 0

H H 9

0

3 4 8 4 0 0

1 0

B B H 0 5

S S

B B

0 0 B B

0 0 8

H B B

3 3 5 6 0 0 S 1 1

0 0 S

8 3

0 0 B

H H 0 0 0 0

4

0 0 S S

0 0 3 3 0 0

B

1 1 4

H 0 3

1 1 0

H

0 0 1 1

0

1 1 B B

0 0 5 6

0

S S H H

7

0 1

6

1 B

8

B 0

0 0 S S

1 1 0

S S 4

0 0

B B 3 4

H H 0 2

5 4 2

0

1 1

S

0 0

H H

0

0 0 4 0

H H 1 S S 6

B B 7 3 0

1

S S H 0

B B

1 1

1 0 4

B B 0 0 7 5 0

7 1

H H

5 8

0 0

H H 0 4 2 B 0 S 1

1 1 0 0

0 0 4 4 0

0 4

B B

7

1 1

1 0 0 0

0

0 0 S

1 1 B B 7 H 0

S 8 5

0 0

3 8 5

1

5 5 4

S

0 0 S 0 0

S S

H B 0 0

1 1 4 4

8 6

2 H

1 1 2

1 1 0 7 4 0

0

H H 1 4 0 0

3 6

H H B 0

B

0 0 8

2 7 2 5

S 1 3 8

S

1 0 3 8

0 B 0 S S

B

B B

0

0

S S 7 2

9 2

7 6 7

1 0

1 0 0 0

0 H H

0 0 S

1 H H

0

3 0 1

2 1

4 4 6

H H

0 0

1

1

1 1 S S

B

B

H

7 B B

S 0 0 1 0

1 0

5

B B S S

0

0

H H

0 0

5 0 0

0 0 1 B

1 S S

H

0 0

5

1

1

H H

1 1

0 0 3

B B 1 0

2 7 0 S S

1 3

S S

1 1 B H H

3

0 3 5 B B

0 0 2 1

7 5

5 7

S

0

H H

H H 8

B B

8 1

4 6 3 0 0

1 1 0

S

7 5 6 0

1

0

H 0 0 B B

B B 8 4

6 0 1 1 1

0

0 1 0 0

0 3 H

0 8

1 1 0 0

0 0 2 6

B

7 0 0

0 6

0 0 0 0

6 S

7 7

5

B 1 1

1 1 S 1 0

2 0

0

9 2

S S

S S 7

4 4 S S

0 6 4 H

0

0 0

2 1

0

5 0 0 1

H

2

0 0 H H

0 2 6

1

H H B H H

3

S S 6 9 0 0

S

B

1

0 0

0 4 0

B B

0

0

B B B B

2 1 0 0

0 H H

0

H

0 0

S S

0 0

S 0 1

0 0 0 0

0 0

1 S S

B B

S

0 0 B

1 1

1 1 1 1

S 0 0 H H

0 H

0 0

H H

0

S S

H

S S B B

B

H

1 1

1

B B

B H H

0 0

0

B H H

0 0

0

1 1

1 B B

0

B B

1 1

1

0 0

1

0 0

1 1

1 1

sex DiabetesB Sample dendrogram GERD (RAREFIEDdist(log) data) hclust (*, "average") NeverSmoker

2

1

7

7

5

0

0

S

ExSmoker H

B

0

1 CurrentSmoker

2

5

2

0

0

4

6

S

5

H

1

B

0

0

S

9 9

1

1 7

H

3 0

B

1 0

0

9

0 0

1

8

S S

2

H H

0

B B 0

0

0 0

Wheezing S

1 1

H

B

0

5

1

1 7

6

0

0

6

S

5

H

4

B 1

0 0

6

1

S

5

H

7

8

B

3 0

3

0

2 0

8

1

8

S

0

1 7

Asthma 0

0

8 2

H

0

5 3

9 S

B

S

1 0

6

0

H

0 0

H

3

1

4

B

0 S S

B

1

0

0

0

H H

5

1

1

S

0

B B

7

7

0

H 0 0

5 6 6

4

S 1 1

5 9 6

B

3

8 3 0

0

H

6

0

2

0 1 0

1

1

B 0

8 7 8

0 0

5 S 8 9

t 0

0 3 0 3

S

S S

1 4 3

1

0 H 0 7 2

H

7

0 5 7 4

0 H H 0 0 0

B

5 B

4 0 1

S 0 0 0

B B 0

S

1 0

5 0 0

0

0 0 1 S S

H 9 9

H

0 1

1

S S

8 0

1 1 S

3 9

B H H

0

B 0

h

1 5

2 7 H H

0 H

0 B B

8

S S

3 1

5

0 0

Pnemonia 1 B B

B

1 0 0

0

1 0

4

H

H 0 0

0 0

0

1 1

4

0 0 0

3

B

6 5 B S S 1 1

1

0

9

0 1

S S

0

1 2 0

H H

0

8

0 5

7 1

1

6 4 1

H H

g

0 6 7 B B S

0

0 1

S

0 0

0

B B

0 6 5 0 0

0

8 5

H

0 0

H 9

5

0 0

8 4 1 1

0 0 S 0

i 3 1

B 9

S S 3

1 1 B

0 0

0 0

0

S 6 7

H

0 5

0

0

H H 0 0

7 0 2

S S 5 8

1 0

H 0

B 4

1

B B S S

2 0 6

1 0

0

H H 0 6

B S

2

0 0

1 8 0 8

0 0

H H

1 1 S

3 7 0

B B 7

H

1 1

0 0 0 5

1

S S

2

B B 0

3 7 1

0 0 3

H

0 0 B 3

e 9

S

1 7

0 0 0

4 4 H

1 1 0 H

2 8

0 B

5 0

S S

3 8

1 1 H

0 0

0 S

B B

2 7 6 9

0 0

8 1 1

5

0 4

H H 0 0

B

S 0 0

5 4 9 2

H 1

0

2 S

8

0 0

B B S S 0

2 1 1

1 1 3 8

6

5

6 H B

S 4

0

S H

0 0 1

8

0 0 0 0

H H 0

1

6 0

1

B

6 H S

1 1 B

3

H 0 0

5

S S 3

0 B B 1

2 0

0

7

0

B 1 4

H

B 1 1 S S

0 0 0

7

Bronchitis H H 1

S

7

1

0 0 7

H

0 0 B 0

1 1

5

S

H H

B B

0 2 4

1 5

1 H

3

S

1 0

0 S S

7

5

0 0

0 0 3 B B

H 3 9 8

0

0

B 0

7 9 1

0 H

7

2

1 1 H H 3 6

8 4 0 0

5 2 2 0

S 6

B 0

0 4

2 6

5 9

0

7 8 S B 2

4 0

0 0 1 1

3 B 0 0 3

3 B

0 5

1 7 6 3

5 7 H S

0

9 8 2

0

2 5 8 9 0 1 0 0

5 3

0 0

7 0 0 0 1

7 H 0 0

2 1 2

1 1 8 3 0

0 0

1

B 3 7 7

0 H

4 6 7 0 2 5 0

1 0 0

0 1 0 4 5 0

3 S S 0 1

2 4 6 S S 6

B 1 0 8 5 0

0 0

5 2

0 5 0 8 9

0 1

1 0 3 5 5 4 8 0

B 0 0

S

0 8 5 S 5 4

6 5 3

8 0 0

0 4 1 0 S 0 1

5 1 H H S

1 0 S

S S H H 0 1 0 2 9 9 4

1 5 8

0 0 1 1 1 1 0 9 5

0

9 1 8 4 3 5 S

0 7 8 S 9 0

0 H S 5 6

1 0 0 H 9 0 0 0

4 7

0 1

H 0 8 3 0 0

B B 5 H 9 2 7

0 0 0 0 B B S 2 4 7 H

H H 1 S

3

S S 2 6 0 0 3 8 0 0

0 8 0 9

0 0 0 5 2

0 0 1 H

2 7 B 1 H

H B 0 5 S S

S 0 0 7 0

0 0 0 S 3 7 9

0 0 3 B S 0 0

B

S S 1 B

S S 3 4 0 0 H 0 0 2

B B 9 8 7 3 3 9 1 H

0 0 4 0 1 0

H H 4 8

0 0 S 0 S 4

S S 0 0 5 6 5 2

0 0 B B 0 0

1 1 0 0 6 3

B 1 1 1 H 0 0 0

7 5 0 H H

H 2 0 8

1 0 0 1 0

0 0 7 9 0 0 3 3 H 4

1 3 S S S 0 0

H H H H B 1 B 8

0 0 1 S S 7

B B 1 5 3 0 5 1

5 0 0 0 0

H S 0 0 1

H 0 1 9 7 S S 3

5 S 4 2 H H 3 2 0 1 4

0 0 B 4 7 4 5 7

0 0 0 B B S S 1

1 1 2 4 0 0 0 1 2

5 9 2 4 B 0 2

0 S 0 B 0 5

0 S S

0 0 B B B B H H 5 8 3 2 8

S S 6 0 H 1 1 1

H H 0 0

1 2 6 1 S S

B 0 5 5 1 4 5

B H 1 0 S 6 1 4 9 5

B B 0 S 0 0 0

0 1 H 0 0 0 H H

0 7 1 2 S S 0 1 0 9 4 8

1 0 1 0 H H

S S 0

1 1 0 0 0 0 3 3 3 1 0 0

4 H S 0 6

H H

B B B 3 6 7 8

0 5 6 2 4 6

H H 0 B B 1 1 6 0 0 1 S S 0 2 5 0 6

0 0 3 3 1 1

0 0 0 H H

6 3 9 5 0 0 1 B S 7 S H 8 2

S H 1 B B 0

B S

5 2 1 1 1 1 H H 2 0 0 0 0 5 0 B B

5 9 2 S 4 7 5

H H

0 0 B S 0 0 0 0

1 0 0 0 6 H 9 0 0 B B

1 0 0 0 0 3 4 0 0 0 0

8 1 1 8 4 0 0

B B

0 1 0 7 0 4 H H 0 0

S S S 0 0 B B

S S 0 B

2 H B 3 0 0 8 6 6 5 1 0 0

6 4 7 5 0 H 1 H S

H S S 1 0 S

B B 1 1 0 1 5 0 0 5 3

0 B B 0 0 8 H 5 0 0 0 0

1 1 8 5 0 0

7 9 4 6 B H 0 0

0 0 S S S

0 0 0 0 1 4 2 S 1 0 0

1 1 0 0

1 0 B B

8 0 1 3 2 6 6 7 1 1

3 7 3 4 0 0

H H H H B 1 4 H 5 3 4

0 0 1 B 0 B S S 3 0

0 0 B 7 1 H 1 1

1 0 0 S S

S 5 S S 9 1 0 0 H H S H S

1 1 2 1 B 4 S S

0 0 2 9 1 4 B 0 S S

3 9 1 1

S S H H 1 0 0 H S S

7 H 1 0 0 0 4 0 1 4

0 0 1 0

0 5 2 4 0 1

1 1 1 0 S 0 0 S 0 2 4

B B 1 1 0 0 8 0 B

B B 3 0 0 0

0 4 0 0 1 H H B

3 5 0 9

3 4 0 4 0 H H B B B

2 4 H H

H S S H H 1 1 H H

0 S 1 0 1 0 4 0 1 H H

0 0 B B B

H H 0 0 1 B 3 7 8 0 0 H H

4 0 1 S 1 0 1

0 0 1 2 5 0 9 0

0 0 4 4 S 0 9

0 0 8 8 H S 0

7 H 4 1

1 0 S S S S 5 4 1 1 3 B B 0 0 0

5 1

0 0 B B 0 0 B 7

B 0 0 B B 8 8 0 B 0 0

H H 0 1 0 0 0 0 S B B B B

age S S 7 9 H

S S 1 1 2 5 0 3 1 0 0 S S

B B 1 1 1 8 7 3 9 B B

0 4 8 H 1

0 S 2

S 0 0 0 0 0 0 0 0 1 1 1

0 0 B H B H

2

H H H H 0 0 0 0 8

0 7 8 1 1 0 0 1 3 1 0

4 S 1 0 0 4 5 0 S S 0 0 0 0

5 3 5 5 7 S S

0 0 0 9 4 7

0 0 B B 0 5 0 B 3 1 H 0 0

H H 3 4 S S

H H 7 7 0 1 H H

0 0 0 B 0 1 1

S 0 0 0 0

9 H

H S S B 0 1 1 1 0

1 7 8 1 1 5 5 B 1

B 7 B B B 0 4 6 2 1 1 1 1

1 9 1 1 4

4 0 0 0 0 4 1 S S S 2 0

1 1 0 0 7 H 0 0 0 H H 1 1

0 0 0 H H

9 9 9 S 0

B B 0 B

B B S 1 1 H H B B

4 7 5 8 9 S S S S

S S 0 0

5 5 1 0 H B 9 0 4 3 0

B 0 1 0 0 0 0 0 S 2 4

4 3 0 0 H H 7 3

4 1 9 4 4

1 1 0 0 1 0 0

S S S 0 0

2 3 0 B 5 0

0 0 S 9 9 H H 1 H 1 5 B B 0 B B

0 0 0 0

9 7 2 8 1

H 8 5 1 B B

1 1 0 0 H 0 3 4 2 9 1

0 1 0 1 H H H H 0 0 S 5 4

0 1 B 8 1 1 1

1 7 6 6 H H 1 0 0

S S 0 B B 0 5 2 8 7

6 5 H

0 0 3 0 5 S S

1 1 8 8 S S 7 6 0 0 1 0 0

1 1 H H 9 H 5 S 9 5 1 1

4 3 3 5 0 B B B

H 8 8 0 0

0 0 1 8 6 3 2 5 1

B 8 0 0 1

0 0 1 S S 0 5 0

0 0 1 7 B 0 B B S S 0 0

0 0 B B 0 4 0 2 1 8

B B 2 4 8 8 H

0 0 0 S 1 4 S

H H 6 5 3 7 B 1 1 1 1

6 6 1 9 9 6

1 1 4 5 0 0 H H 0

B B 0 0 H H 1 1

S B H 0 0 0 0 0 1

0 B 2 0 0

S S 1 0

0 0 3 5 0 S 8 3 0 0 0

S S 1 1 0 0 S 4 6 2 3 3 6

H H 0 0 3 1 4 2 S S

1 H H

0 0 1 1 0 0 B

S S 4 3 8 3 6 3 6 4

0 0 0 0 6 H H 1 1 1

B B 0 0 0 0

0 4 0 0 0 0 0 3 3

4 8 1 0 6 1 2 B B

0 B B B

0 0 1 9 3 1 1

H 1 1 0 0 8 2 S S 0 0

S S H H 1 1 S 0 0 4 4 9 0

0 0 6

H H H 0 0 0 0 1 0

B B 1 5 8 1 H 0 1 3 9

0 0 S 9 B B H H

0 0 1 1

H H S S 8 0 1 B 0 0

8 3 B 1 9 0 2 S S 7 0 0

0 0 S S

0 0 5 5 0 S S

3 S 0 0 0 0 0 0

0 0 0 0 9 2

B 6 4

1 H H

H H B B 0 0 0 S 0 0 0 H 0 0 4 0

1 B S S 8 8 0 0 S S

B B 1 1 B

S S 7 8 7 0 B B 0 0

3 1 H 0 0 0 7 0 1 1

1 1

B B 0 0 0 1

H H 5 H H 0 0

S S 0 0 H H 8 5

0 5 1 S S H H

H S S S S

0 0 1 1 0 0 1 0

0 0 9 2 1 1 B B

0 0 S 0 S S

B B 5 6 S 3 B 6 1 0 0

1 0 0 0 1 0 H 4 0 H H

H H S S

0 0 B 0 0 7

H H 0 S

0 0

B B 1 1 B B 0

S S S S

1 1 0 B B B B 0 0

H H 0 1 2 3 0 0

1 1 S S B 1 H H H H

0 0 0 9 9 0 H H 2 1 1

0 0 8 6 0 0 0

0 0 0

1 1 H 0 B H 5 8 6 4 9 H H 4 7

0 B B B B

0 0 H H

0 0 B B 0 S S 0 0 0

H

8 0 0 0 0

0 0 1 1

0 H H S S

1 1 3 7 B B S 1 0 H H

3 9 9 5 6 B B B B

0 H H 4 B B

S S 1 S S 0 0 6 4 3 2 8 4

S 0 0 S 0 0

B 0 B 5 B B

1 1 0 0 1 1

3 7 S S

9 2 S 0 1 1 1 1 B B

S B H H 4 5 0 0

1

1 0 0 0 0 2 8

5 2 0 0 0 B B 0 0

5 5 3 0 0 6 B B

1 0 1 4 3 4 4 8 H H

H S 1 1 1 1

0 B B 0 9 0 0

1 1 H H

H H 3 9 S H 5 3 H 8

9 1 0 0 0

0 4 8

H H

0 0 1 1 B B 6 0

6 5 H 1 0 H 1 1 S S 0 0 1 1

1 1 1 1 1 1 S 0 0

1 0 0 0 0 0 0

1

1 6 7 0 0 1 2 1 1 B B

B S

9 0 H 8 2 2

B B 3 2 0 1 1

B B 5 B 1 B 3 1 2 6

H 5

0 0 7 7

0 0 1 1

S S 0 0 3 4 0 1 1 1 B B

B 5 S B 0 0 0 0 0 0

8 H H

0 7 1 1 1 0 0

0 H

7 3 0 0 1

0 0 1 0 6

0 0 1 0 B H 0 1 0 3 7

S 1

1 1 8 0 0

0 0 B

5 3 0 0 0

0 3 2 4 2 7 0

4 S S 3 4 1

H H 7 3 0 S S S S S S S 1 1

1 H 2 0 B B

0 1 0 0 0

1 1 5 B

1 1 5 1 0 0 0 1 0 0 3 4

3

B 0 0 0

0

8 6 7 5 0 0 H 1 1

1 7 5 0 6 6 1

1 S S 5 5 8

0 0

1 8 8 0 0

H H 0 0 S 0 0

B B 5 B H 0 H H H H

0 1 S 2 9 H H 0 0

S 0

S S 0 5 9 2 S S

1 3 9

2 3 1 0 0

2 4 0 5 3 S S

0 3 4 3 3

0 0 S B

7 5 4 1 1

H H 6

0 0 1 0 1

B B 0 S S 0 7 S 0 0

H B 0

1 0 0 0 0 B B B B B B

3 3

7 1 1 0 H H

0 0 H H 1 1 H H

1 S 0 0 1

1 0 S 1 0

S 5 0 0 H H

S S 7 1 6

1 1 0 1 H

0 0 B B 3 6

0

S 5 0 4 4 4 0 0 0 0 0 0

B H S 4 6 S S

0 H H 4 0 0

PackYears 0 0 0 0

4 0 9 B 0 0 0 B

8 5 B B S 0 1

0 0 0 B B

H 0 9 0

B B

H 1 1 0 0 H 1 0

H H S 2 1 B

9 4 2 1 1 1 1 1 1

0 1 0 6 1 0

0 8 0 0 S 0 0

H B 2

B B 0 0 7 0 0 H H H

2 0 0 0 2

S S S S S 0 S S 0 0

0 3 8

H S S 0 0

1 1 B 0 0

3 0

0 B 0 0

B 1 0 0 5 0 0

5 B B 9 9

H 0 3

S 0 0 0 2 0 5 5 1 1

B 0 0 S S 1 1 0 H

9 2 1 1 B B

0 0 B

S 1 1

H H H H 0 H 0 1

B 1 S H H 1

1 0 0 H H 0

0 S S 2

1 0 0 5 4

S S 1 5 S S

1 1 3 5 0 0

0 0 0 B 0

2 9 0 0

S 0 0 7 2 0

H H H B

1 3 7 4

0 0

B B 0 1 B B 0 B H

1 9 B B 0

0 1 1 H B B 0 0 S

4 1 0

8 6 0 0 H H

1 0

H H 0 0 H H 1 1

S S S 4 8 1

0

8 0 4

B H B B 1 0 S S

0 0 0 0 0

7 3 9 0 8 0 0 0

B 0 0

6 1 S 8 0 0 0

0 0 B

S 1

B B 0 0 H

S S 4 4

0 1 B B B B

2 0 1

0 H H 0 0

1 1 1 1 H

0 6 B 0 1 0 1 1

7 9 S S 8 7 5 0 H H 2 2

2 0 1 1 S

2 7 0 0 0 S S

H 0 0 0 0 2 S

0 0

H 1 0 0 B 0 0

0 0 0

1 1 1 H H

6 4 0 5 7 0 2

B B 4 8 3 3 B 8 8 3 8

4 1

0 0 1 B B

H

6 S S H 1 1 H 6

6 6 0 0

B 4 1 1 S S H H 0 1 1

2 5 0 H

B

S S

1 5 1 1 3 B B 0 3 0

0 0 6 2 0 0 0 2 3 6 3

1

0 0 0 0 4 5

1

B 3 8

B 8 0 0 B 1

9 8 S S

0 H H B B

0 7 S H H B

0 1 2 4 0 0 0 0 8

1 1 0 1 0 0 1 1 0 0 0 4

S 0

6 1 1 1 H H 9 1

7 7

S S 0

0 4 0 8 6

0 1 1 0 0

3 5

1 0 0

7 8 0

0 0 0 0 B B 1 1 3 2

0 0 0 0 0 0 H H 3

S H B B

S S 8 6 4 1

5 8 4 0

H S B B 0

1 0 0 1 5

8 3 1 6 5

0 0 H H 1

8 0 1 1

S 2 0 1

0 0 0 0 5 6 0 6

4 2 1

S S 0 0

S S 6 3 S S S S 0 0 B B

3 6 0 2

6 2 H B 0 0

0 0 H H 5 0

3 3 0 5 S

0 0 B H

8 7

B B 0 0

1 1 1 8 0 0

8 1 6 6 0 9 0

S S H 1 1

0 1 0 0 0 0

5 4 0 0

2 7 H H 0 1 1

H H 0 0 6

4 5 0 H H 4 H H

S S B B 0 B

0 5 7 1

0 0 0 0 B

S S 4 4 0 H

4 8 2 7 2 6 3 0

S

0 0 5 1 1

1 1 0 0 B

5 1 H H 1 S S

6 0 6

1 1 B B 6 0 0 0 1 1

0 0 1 0

3 4 0 0 B B S B B B B

1 1 0

1 0 H H

2 3 5 S S 5 8 0 1

S B

H H 5 S

0 0 0 6 6

1 0 S S

S 4 1 S H 1

0 0 0 0 9 0 6 8

B B 1 1 1 H H

0 3 0 0 0 0 0 0 S S 0 0

1 8 1

0 0

2 1 0 1 0 0 0

H B B 0

5

1 1 2

0 0 B B H H 3

3 6 H H 0

S S 1 1 4 2 7

0 0 H H B

0 1 S S H 1 1 1 1 S 1 1

H 9 9 B B

S S

0 0 0 0 0 H H

S S 5 4 1

0 0 1 S S

B

0 0 6 8 7

0 0 1 3 0

S S B B 1 0 0 0

1 1 0 B B

0 0 1

H H 8 8 0 0

B B

H H B B 0

7 4 H

1 1 S S 0 0 B B

H H 0 S S

1 1 H H 2 0 0

0 0 8 9

0 0 0 6 0 0 H H

1 0 0 S

9 1 0

0 0 1 1

S S H H 0 0

B B 0 0

3 8

B B S 4 0 0

1 B

0 0 0

B B S 5 4

H H 1 1 H H 8

B B 1 1

S S 4 3 S S S 4 B B

0 0

1 1

0 0 1 1 H

B B 0 0

6 4 1 1

H H 0 0 0

0

0 0 H 0 0

0 S S

0 9 0 0 0

B B 3 3 B B H 0 0 0

1 4 6 3

7

1 1 H H

0 0 0 5 S S 0 0 H H H

BMI 3 0 B

1 1 1

3 4

B B 1 1 0 0

0

2 1 1 1 S 5

B 0 0 1 1 0 0 0 1 1

9 7 H H 0 8

B 3

1 1 2 4

B B 6 0 0

B B B

H H 6 4 S S

0 0

7 7 S S 3

0 1 1 0 0 1 1 S

6 2 4 4

0 H 2 S

B B

3 6

0 0 0 1

0 0 0

4 4 S

1 1

B B

0 0 H H 1

1 9

0 0 1 0

1 0

S S H H 0 0

0 0 4 B H

1 1 0

1 1 1 H

0 1

0 0

0 0 0

H 8

0 0 B B 0 0

0

1 1 8 2

0 0 6 0

B B

H H B 5

0 0 7 S

B

1 1

4

S S 0 0

B S

S S 7 5

5 1 S S S

0 0

S S 0 4

0

B B 4 0 2 4 0

S S

H 0

1 1

0

8 5 3 1

0

H H 1 1 H

1 4

H H 5 H H H

0 0 9 2 2 9 1

H H

0

H H

B 1

3 8 0 0

0

7 4 0

B B 1 B

1 1 5 8 2 7 2

B B B B

B

B B

0

B B 3 4 S 0 0

S 0 9 1

0 0 7 6 0 0

6 0 0 0 0

0 0 0 0

0

0 0

1

0 0 1 0

S S

8 6 H

1 1 2 4 1

8 0 0 S 0 0

1 1 1 1

H S 1

1 1

1 1 0 0

5 0

0 0

0 B H H

S S

S S

B 7 3 H

H

8 7 S S

0 0

0 0 7 2

0 0

2 2

7

4 B B

0 7 4

1 9

H H B H H

B

6 7

5 4 0 0

2 4

H H 1

5 6 S S

1

1 S 0 0

1 4 3

6 4

0

0

5 1 B B B B

3 0

2 4

2 9 S S

8 2

8 7 7

3 B B 1 1

H H 1 0

8 2

H 1

1

0 0 0 0 0 0

4 1

2 5

6 5 0 0

0 0

0 7 2

1 0 0 H H

0 0

0 0

B B

7 0

0 0 B 1 1 1 1

0 0

2 3

5 5 0 0

0 0

0 7 6

1 1

0

0 0

0 0 B B

6 9 S S

0

0 0

S S 5 3

0 0

S S 0 0

S S

S 8 5

1 1 S 0 0

4 2

S S 1

H H 0 1

0 0

S S

0 0

H H

3 7

H H

0 4 H H 1 1

0 0

H

H

0 0

2 4 H H

B B

S S

H H 3 1

B B S S

1 1

0 0 B B

B B

B

B 4 7

1 0 0 0 S S

B B

0 0

H H 0 0

8 8 B B

0 0

H H

red_cell_count 0 0 S S

0

0

5 3 7 9 1 1

0 0

0 0

1 1 H H

0 0 0 0

1 1

B B

1 1

4 5 1 B B

1

1 1 0 7 H H

1 1

1 5

0 0 1 1 S S

B B

0 0

8 7 0 0

0 0 0 0

B B 7 7

0 0

S S 1 1

7 2 H H 1 1

0 0

S S

0 0 5 5

1 1

0 0

H H B B

S S

1 1 1 1

H H

0 0

0 0

B B

0 0

H H

B B

S S 1 1

0 0

S S

0 0 B B

1 1

H H

1 1 0 0

H H

B B

1 1

B B

0 0

0 0

6 6

1 1

1 1

1 4

4 5

1 0

0 0

S S

H H

B B

0 0

4 platelet_count 1 1 neutrophil_absolutesex Figuremonocyte_absolute 4.3Diabetes Clustering dendrogram of samples based on their Euclidean distance. eosinophil_absoluteGERD dist(log) hclust (*, "average") A. Dendrogrambasophils_absoluteNeverSmoker drawn using un-rarefied data, with the cut height at 30, which removed 1 sample that was considered an outlier.

ExSmoker

B. DendrogramCurrentSmoker drawn using rarefied data, with the cut height at 12, which removed 1 sample that was considered an outlier. Cut heights are shownWheezing as a solid red line. Asthma

Pnemonia

Bronchitis

age PackYears BMI

red_cell_count

platelet_count

neutrophil_absolute

monocyte_absolute

eosinophil_absolute

basophils_absolute CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.2.6 Summary of Statistical Tests Implemented in Analysis Following the publicly available WGCNA tutorials, the relationship between the modules and clinical traits was tested using all the clinical variables simultaneously irrespective of their format (both discrete [qualitative] and continuous [quantitative]). To confirm results obtained the clinical variables were separated, with continuous variables tested for association to bacterial co- abundance modules using Spearman’s correlation (Chapter 3, Section 3.2.3.4), whilst a Kruskal Wallis test (Chapter 2, Section 2.5.2.1C) was used to analyse discrete variables. As both methods showed agreement, only results of the combined analysis are presented in this chapter.

The statistical methods applied in this chapter are summarised below.

Spearman’s Correlation Spearman’s correlation was used as this outperforms several other metrics (for example Pearson’s correlation, which is the routine in WGCNA analysis) when applied to high throughput sequencing data 353, is relatively robust to the issue of compositionality 350 and can detect monotonic as well as linear relationships (Chapter 3, Section 3.2.3.4). It calculates Pearson’s correlation between the rank values of two variables. When the observations have a similar rank, Spearman’s correlation (rho) approaches one, however when the two variables have a dissimilar rank the correlation approaches minus one.

This correlation method was used to describe association between clinical variables tested and bacterial modules. Additionally, the correlation method was used to look at the association between bacterial abundance and continuous variables (like monocyte absolute count values).

Kruskal Wallis Data was confirmed to not be normally distributed using Shapiro Wilks test (Chapter 2, Sections 2.5.2.1B), therefore non-parametric tests were selected. When comparing distribution of bacterial and functional abundance between two

116 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS or more groups of independent clinical variables the Kruskal Wallis test was used (Chapter 2, Section 2.5.2.1C).

Benjamini and Hochberg (BH) The Benjamini and Hochberg method (BH) 354 was applied to control the false discovery rate in instances of multiple comparisons.

Generalized Linear Regression Model (GLM) In order to establish the relationship between independent variables (functional abundance) and dependent variables (clinical traits), as well as postulating any trait projections, a functional relationship was analysed using the Generalized Linear Regression Model (GLM) 355.

This regression model was used due to its flexible nature that allowed the non- normal distribution of the data to be accommodated as well as the dichotomous nature of some of the variables tested. When looking at dichotomous variables (Current-Smoker; Yes/No), a binomial distribution was used to predict probability of the expected proportion of positive (“Yes”) outcomes.

Wilcoxon Signed Rank Test Abundance between two modules was compared using Wilcoxon signed rank test 356, which is a non-parametric alternative of the t-test to evaluate population mean ranks in samples [Section 4.7.4].

117 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.3 Aim 1: Bacterial Networks in the Airways

Analysis of Network Topology In order to define the connection weight between OTUs in a network, a ‘soft’ thresholding framework was used, preserving the continuous nature of bacterial co-abundance information (where 1 = connected and 0 = unconnected).

The correlation matrix used to create the hierarchical clustering (Figure 4.4) was converted to approximate scale free topology (in an adjacency matrix) using a soft thresholding parameter β. A scale free topology network is characterised by the presence of few highly connected nodes and is common in nature. An appropriate value of β was determined according to the methods of Zhang and Horvath, (2005) 357 using the approximate scale free topology criterion.

To analyse if the scale free topology criterion was satisfied, a scale-free topology model fitting index R2 across a range of β values (from 1 to 20) was plotted. As R2 of the linear regression model approaches 1, a scale free topology can be assumed. The relationship between R2 and adjacency function parameter (β) is characterized by the saturation curve shown in Figure 4.4. An appropriate soft thresholding β parameter is defined as the lowest power where the scale-free topology model fitting index R2 exceeds the cut off of 0.85 358.

In the un-rarefied data, the threshold for approximate scale free topology (R2 ≥ 0.85) was reached with β = 3, (R2 = 0.88) (Figure 4.4A). This value was therefore used to construct the weighted network.

The scale-free topology fit index failed to reach the threshold of 0.85 in the rarefied data (Figure 4.4B) at any of the tested values of β, suggesting that rarefaction of the data has had a substantial impact on the correlation matrix. It is important to note that when 16S rRNA gene sequencing data undergoes rarefaction it does result in loss of data variable in scale but sometimes considerable. This will in turn impact the scale free topology due to loss in

118 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS sensitivity of the results. Consequently, the un-rarefied data set only was utilised for further analysis and network construction.

A Scale independence (UNRAREFIED) Mean connectivity

2

^

R

18 1

0 d 5 6 12 20 7 16 0 e 8 9 9 10 14

. 4 4 n 3

0

g

i

s

, 2

t

i

y

0

t

i

F

0

v

l

i

3

t

e

7

c

d

.

e

o

0

n

M

n

0

o

0

y

2

g

C

o

l

n

o

a

5

.

p

e

0

0

o

M

0

T

2

1

e

e

r

F 3

3 4 e . 5 6 7 8 l 9 10 16 18

1 0 12 14 20

0

a

c S 5 10 15 20 5 10 15 20

Soft Threshold (power) Soft Threshold (power)

B Scale independence (RAREFIED) Mean connectivity

2

^

R

8

. 4 1

d

0

0

e

0

n

1

g

i 3

s

,

t

i

y

0

6

t

.

i

F

8

v

0

l

i

t

e

c

d

1 e

o

0

n

6

M

n

4

o

y

.

g

C

0

o

5 0

l 6 8 9 n 7 4

o 10 14 16 18 20 a

p 12 e

o

M

T

2

0

.

e

2

0

e

r 2

F 3 e 4 5 6 7 l 2 8 9 10 12 14 16 18 20

0

a

c S 5 10 15 20 5 10 15 20

Soft Threshold (power) Soft Threshold (power)

Figure 4.4 Bacterial OTU network properties for different soft thresholds. The cut-off for approximate scale free topology (R2 ≥ 0.85) is shown as a solid red line. A. Plots visualize scale free topology-fitting index of un-rarefied data; the saturation curve indicated that a soft threshold power (β = 3) was appropriate to use, R2 = 0.88, slope −γˆ = -1.65 and mean connectivity = 24.3. B. Scale free topology-fitting index of rarefied data did not reach the threshold.

119 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.4 Aim 2: Network Construction and Module Detection As described in Section 4.3 (above) the adjacency matrix was constructed using un-rarefied data and a β soft thresholding parameter of 3. This was transformed into a topological overlap matrix (TOM) to filter out spurious and isolated connections, thereby minimizing noise and providing an indication of interconnectedness.

The TOM provided the input for the hierarchical clustering. Each branch of the hierarchical clustering tree grouped together tightly interconnected OTUs, which in turn define the co-abundance modules. Branch cutting was implemented through the Dynamic Tree Cut method 359, with a minimum module size of 20 OTUs being used, to detect relatively large bacterial co-abundant cabals and not bacterial interactions comprising of just a few members. This resulted in the identification of 20 modules in total. Two OTUs were unassigned and are therefore referred to as the grey module from this point onwards.

For each module, a characteristic profile was calculated and summarized by a module eigengene (defined as the first principal component of a given module). These were then clustered together in order to quantify the similarity between observed modules. The modules corresponding to the correlation of 0.75 were merged (tree height corresponding to 0.25), in this way combining those that are closely related, resulting in 19 modules. The turquoise module was below the cut threshold and merged with the brown module (Figure 4.5). The identified modules were then used to investigate correlations with the various clinical variables.

120

Clustering of module eigengenes (MM20)

0

.

1

y

e

8

r

.

w w

g

0

o o

l l

E

l

l

e e

M

y y

t

k

n

h

n

e

6

i

g

.

i

e

p

l

0 r

E

E

g

E

M

n

M

t

a

n

h

M

y

o

g

n

4

i

n

c

.

t

a

n

e

e

m

t

l

0

0

h

e

e

H

a E

6

g

r

k

n

e

i

s

y

l

r

c

g

a

M

a

e

g

E

a

t

y

E

E

r

t

l

c

n

b

g

h

e

M

M

M

2

e

E

. g

u

E E

i

l

g

l

e

0

w

e d

l

M

b

a

M

M

t

E

o

u e

p

l

l

r

r

l

h

m

n e

M b

u

e

g

E

s

i

E

w i

E

y

p

n

M

o

o

M

E

E

r

M

d

u

i

b

M

M

q

r

m

E

u

E

t

M

E

M

M Figure 4.5 Hierarchical clustering tree of OTUs, with dissimilarity based on TOM. A threshold for merging modules is shown as a solid red line (corresponding to a tree height of 0.25). The grey module contains the unassigned OTUs. MM = Module Membership ME = Module Eigengene

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Table 4.1 summarises the size of each module in accordance to number of members present, total OTU abundance (also converted into percentage abundance) from the total number of reads in the whole study (43,775,771 reads). A detailed summary of all modules is given in Appendix 4.1.1 [Top10Silva] and appropriate tables below.

Table 4.1 Description of the identified modules. Details are given of the number of bacterial members in each module, with their total bacterial abundance, and the percentage abundance within the whole data. Number of Total Percentage Module Module Members Abundance Abundance Summary Black 112 366,013 0.84 4.1.1.1 Blue 652 9,370,060 21.40 4.1.1.2 Brown 2,189 18,519,535 42.31 4.1.1.3 Cyan 45 91,993 0.21 4.1.1.4 Green 234 8,803,550 20.11 4.1.1.5 Green Yellow 59 8,884 0.02 4.1.1.6 Grey60 30 5,684 0.01 Table 4.4 Light Cyan 42 32,777 0.07 4.1.1.8 Light Green 28 2,095 0.00 4.1.1.9 Light Yellow 24 11,714 0.03 4.1.1.10 Magenta 68 10,502 0.02 4.1.1.11 Midnight Blue 42 7,150 0.02 Table 4.3 Pink 109 72,880 0.17 Table 4.5 Purple 63 90,709 0.21 4.1.1.14 Red 160 2,270,977 5.19 Table 4.6 Salmon 47 173,507 0.40 4.1.1.16 Tan 52 98,695 0.23 4.1.1.17 Yellow 268 3,838,964 8.77 Table 4.7

122 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.5 Aim 3: Module-Trait Relationships Correlation between module eigengenes and clinical variables was calculated using Spearman’s correlation (Section 4.2.6 above and Chapter 3, Section 3.2.3.4), adjusting for multiple testing using the Benjamini and Hochberg method (BH) 354. Correlation coefficients and adjusted P-values that showed significance at a 5 % threshold are summarised in the table below (Table 4.2).

Table 4.2 Module-trait associations significant at a 5% threshold. Correlation value and adjusted P-value were calculated using Spearman’s correlation and corrected using the Benjamini and Hochberg false discovery rate (FDR) procedure Current Pack Monocyte Module Sex Smoker Years Absolute Count rho* -0.18 Grey60 P-value 0.003 rho 0.31 0.19 Pink P-value 2E-12 5E-04 rho 0.15 Black P-value 0.02 rho 0.14 Cyan P-value 0.04 rho 0.17 Salmon P-value 0.004 rho -0.13 Light Green P-value 0.04 rho -0.17 Magenta P-value 0.003 rho -0.18 Midnight blue P-value 0.006 rho -0.15 Purple P-value 0.02 rho -0.24 -0.18 Yellow P-value 7E-07 0.003 rho 0.14 Green P-value 0.04 rho -0.32 -0.21 Red P-value 2E-12 2E-04 *rho = Spearman’s Correlation Coefficient

123 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Fifteen significant module-trait associations were identified after adjustment for multiple testing, Sex was seen to display a significant negative association with the Grey60 module (rho = -0.18, P = 0.003).

White blood cells (leukocytes) are comprised of a number of different cells types, including neutrophils, eosinophils, basophils, lymphocytes and monocytes. Monocytes are one of the largest types of white blood cells and can differentiate into macrophages and dendritic cells. Of the various cell count traits only the absolute monocyte count achieved significance, displaying a negative correlation with the Midnight blue module (rho = -0.18, P = 0.006).

Overall the majority of significant associations were seen for smoking related traits. Being a current-smoker was significantly associated with ten different modules. Most significant ones were the Pink module (rho = 0.31, P = 2E-12), the Yellow module (rho = -0.24, P=7E-07) and the Red module (rho = -0.32, P = 2E- 12). These same modules were significantly associated with cumulative tobacco smoke exposure (Pack Years), which showed a similar pattern of associations to the binary variable ‘Current-Smoker’ though of lower magnitude; Pink module (rho = 0.19, P = 5E-04), Yellow module (rho = -0.18, P = 0.003) and Red module (rho = -0.21, P = 2E-04).

It should be noted that clinical traits relating to respiratory conditions, disease (like diabetes and GERD) showed a number of nominally significant associations with different modules, however these correlations were not strong enough to withstand adjustment for multiple testing.

Individual OTUs were correlated with the module eigengenes, to calculate module membership (MM, also known as eigengene-based connectivity). This enabled identification of intra-modular hubs. Full summary of the modules, their members and membership statistics can be found in Appendix 4.1.1 [Top10Silva].

Ralstonia_10329 was the OTU with the highest module membership in the Light- Cyan module. This is a common contaminant of extraction kits 119, and the Light

124 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Cyan module showed no significant module-trait association. In addition, the rest of the module was made up of bacteria that could be classified as potential contaminants including members from Burkholderia, Ralstonia, Massilia and Undibacterium genera. Although a number of steps were taken to remove potential contaminants (see Chapter 3, Sections 3.2.3.2, 3.2.3.5), using un-rarefied data created a greater probability of identifying some low abundant potentially contaminant OTUs.

Overall modules were comprised of a diverse selection of bacteria, however there were a proportion of modules dominated by just one bacterial family. In Chapter 3, Section 3.3.2.1, un-identified Firmicutes_6640 was proposed to be of streptococci origin. Here it was found in the Green module. The majority of OTUs in this module belonged to the family of bacteria, meaning it could potentially be of similar origin.

Some of the smaller modules like the Salmon module had 97% of its reads belonging to Veillonellaceae family, and 99% reads in the Tan module were bacteria from Prevotellaceae family. Grey (unassigned) module contained only two OTUs Corynebacterium_14237 and Corynebacterium_14236. These patterns of module composition may reflect how similar species can work closely together, have strong connectivity or share comparable ecological niches.

Out of all the significant modules highlighted in these analyses, five modules outlined in Table 4.2 showing the most significant associations with smoking, sex and monocyte count were further assessed to identify OTUs with the highest module membership (hubs).

125 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.6 Aim 4: Identification of Key ‘Hub’ Bacteria Midnight Blue module members (Silva database) This module was significantly associated with absolute monocyte value, it was made up of 42 different members, with a total abundance of 7,150 reads. The module members were assessed in closer detail, Table 4.3 gives a summary of 12 OTUs that were seen to be most abundant, with highest module membership and highest prevalence. Spearman’s correlation indicated a negative correlation between abundance of the significant OTUs (Table 4.3) and absolute monocyte values. Overall 74% of OTUs belong to Prevotellaceae family making up 82% of abundance in that module.

To explore this module further, abundance of all module members (42 OTUs) was correlated against absolute monocyte value for all samples. Data indicated a negative correlation (through Spearman’s correlation, Figure 4.6A) and a significant association between the two variables (through GLM [refer to Section 4.2.6], Figure 4.6B). Therefore, this negative relationship suggests that there is an increase in white blood cell count when bacteria from Prevotellaceae family decrease in abundance.

Figure 4.6 Comparison of Midnightblue module members. Predominantly from the Prevotellaceae family against the absolute monocyte values. Grey shaded area indicates the 95% confidence interval. A. Significant negative correlation as identified through Spearman’s correlation. B. Significant association when tested using generalized linear regression model.

126

Table 4.3 Summary of the Midnight blue module members (Silva database). OTUs that displayed highest module membership (MM), highest abundance, and highest prevalence in the population. Spearman’s correlation was done to test OTU abundance against absolute monocyte value (Clinical Trait P-value).

Prevalence Clinical Trait P- OTU_ID Genus Abundance MM MM P-value (%) value Prevotella_1354 Prevotella 1,294 26.64 0.75 8.39E-107 1.33E-03 Prevotella_809 Prevotella 1,049 21.80 0.72 4.02E-93 1.05E-02 Prevotella_2653 Prevotella 384 16.09 0.64 2.48E-68 1.45E-02 Prevotella_848 Prevotella 486 15.22 0.62 9.30E-63 3.86E-04 Prevotella_858 Prevotella 313 14.01 0.60 4.21E-57 1.50E-01 Prevotella_640 Prevotella 381 13.84 0.59 3.72E-55 6.77E-03 Porphyromonas_11970 Porphyromonas 390 12.63 0.58 6.59E-53 7.56E-03 Prevotella_829 Prevotella 152 12.80 0.56 7.17E-50 6.41E-03 Prevotella_1990 Prevotella 255 11.94 0.56 2.05E-48 6.08E-02 Prevotella_1852 Prevotella 145 12.11 0.53 2.47E-43 8.54E-02 Prevotella_808 Prevotella 167 9.69 0.52 1.37E-40 2.64E-02 Capnocytophaga_2419 Capnocytophaga 330 1.73 0.19 4.34E-06 6.22E-01

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Grey60 module members (Silva database) Grey60 module was significantly associated with sex (gender). The module is made up of 30 different members, with a total abundance of 5,684 reads. Table 4.4 gives a summary of the OTUs that were the most abundant in the module, those with the highest module membership and the highest prevalence.

Overall, 90% of OTUs belong to Veillonellaceae family making up 81% of the abundance in the module, with a further 3% of the OTUs belonging to the Prevotellaceae family with a total abundance of 16%. Kruskal Wallis test was used to assess the associations of the individual OTUs (abundance) with sex (gender).

Unfortunately, bacteria identified in this module were not sequenced deeply enough, and therefore only recognised by their family name, Veillonellaceae. Consequently, it would not be possible to make any assumptions about the identity of these bacteria.

Previous studies have noted that the intestinal (gut) microbiome is significantly influenced by gender, with a higher presence of Veillonella in men than woman 360,361. When looking at abundance of all OTUs from the Grey60 module, the same result can be seen, where Veillonella abundance is significantly lower in females when compared to males.

128

Table 4.4 Summary of Grey60 module members (Silva database). OTUs that displayed highest module membership (MM), highest abundance, and highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against sex (gender) (Clinical Trait P-value)

Clinical Trait P- OTU_ID Genus Abundance Prevalence (%) MM MM P-value value Veillonellaceae_17903 NA 870 21.45 0.78 2.58E-120 3.70E-05 Veillonellaceae_17924 NA 456 14.01 0.65 1.10E-69 1.15E-03 Veillonellaceae_17934 NA 266 12.80 0.63 7.12E-65 2.58E-03 Veillonellaceae_17899 NA 346 12.46 0.62 2.54E-62 4.65E-04 Veillonellaceae_17940 NA 226 13.15 0.62 1.92E-61 6.22E-04 Veillonellaceae_17056 NA 236 12.63 0.61 3.96E-61 8.61E-04 Veillonellaceae_17901 NA 221 12.63 0.61 3.30E-59 6.32E-03 Veillonellaceae_17912 NA 261 11.94 0.61 4.41E-59 2.75E-06 Prevotella_1681 Prevotella 936 16.09 0.60 7.30E-59 8.79E-05 Veillonellaceae_17938 NA 197 11.42 0.59 2.10E-55 3.41E-04 NA = Not Assigned

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Pink module members (Silva database) The Pink module was significantly associated with smoking, and made up of 109 different members, with a total abundance of 72,880 reads. Table 4.5 gives a summary of OTUs that were the most abundant in the module, those with the highest module membership and the highest prevalence. Overall, 19% of the OTUs in the module belong to Bifidobacteriaceae family with a total abundance of 59%. A further 34% of the OTUs belong to the Lactobacillaceae family with a total abundance of 25%, and finally 15% of OTUs were from the Streptococcaceae family, with a total of 9% abundance. Kruskal Wallis test was used to assess the differences for the individual OTUs in relation to current-smoke exposure.

Correlation of the Pink module with current-smoking phenotype indicated a strong positive relationship (Table 4.2), which was confirmed by a significant increase in abundance of the top two Pink module hub OTUs in current-smokers (Figure 4.7).

PinkModule: Bifidobacterium_13861 PinkModule: Bifidobacteriaceae_13837

1000

) )

e e

l l a a 100

c c

s s

0 0

1 1

g g

o o

l l

( (

e e

c c

n n a a 10

d d

n n u 10 u

b b

A A

1 Non−Smoker Current−Smoker Non−Smoker Current−Smoker Figure 4.7 Pink Module: comparing smoking vs. non-smoking individuals. Bacterial abundance of OTUs with highest module membership when compared between non-smoking individuals (including both never and ex-smokers) and current-smoking individuals.

A Kruskal Wallis test was used to compare the distributions between the two categories and indicated a significant difference. Both OTUs show a strong positive relationship with the current-smoking phenotype.

130

Table 4.5 Summary of Pink module members (Silva database). OTUs that displayed highest module membership (MM), highest abundance, and highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against current-smoke exposure (Clinical Trait P-value).

OTU_ID Genus Abundance Prevalence (%) MM MM P-value Clinical Trait P-value Bifidobacterium_13861 Bifidobacterium 36,777 45.50 0.64 8.94E-69 2.76E-15 Bifidobacteriaceae_13837 NA 2,779 34.26 0.54 1.67E-45 1.61E-06 Bifidobacterium_13401 Bifidobacterium 528 15.05 0.54 1.75E-44 4.50E-11 Lactobacillus_4421 Lactobacillus 2,853 17.65 0.51 2.48E-40 2.42E-04 Bifidobacterium_13785 Bifidobacterium 514 13.32 0.50 3.44E-38 7.02E-08 Bifidobacteriaceae_13749 NA 1,430 17.47 0.49 2.09E-36 6.53E-06 Lactobacillus_4496 Lactobacillus 3,359 16.61 0.49 1.41E-35 1.64E-09 Lactobacillus_4348 Lactobacillus 5,295 10.90 0.44 2.33E-28 2.57E-19 Streptococcus_20307 Streptococcus 326 8.65 0.41 1.74E-24 1.85E-10 Actinomyces_13397 Actinomyces 178 7.44 0.39 1.15E-22 1.26E-14 Streptococcus_224 Streptococcus 875 10.38 0.39 2.58E-22 4.11E-09 Streptococcus_226 Streptococcus 4,243 16.44 0.36 3.01E-19 5.00E-09 Lactobacillus_4457 Lactobacillus 1,619 6.23 0.33 1.62E-16 8.37E-07 Prevotella_1947 Prevotella 2,474 6.92 0.24 3.47E-09 3.35E-03

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Red module and Yellow module members (Silva database) Both Yellow and Red modules (independently) displayed a strong negative correlation with current-smoking.

1. Red module members The Red module was made up of 160 different members, with a total abundance of 2,270,977 reads. Table 4.6 gives a summary of OTUs that were the most abundant, those with the highest module membership and the highest prevalence.

Overall, 16% of OTUs belong to the Pasteurellaceae family making up 7% of abundance in the module. A further 66% of the OTUs belong to the Neisseriaceae family with a total abundance of 93%. Kruskal Wallis test was used to assess the differences for OTUS in relation to current-smoke exposure.

2. Yellow module members The Yellow module was made up of 268 different members, with a total abundance of 3,838,964 reads. Figure 4.7 gives a summary of OTUs that were most abundant in the module, as well as those with the highest module membership and the highest prevalence.

Overall, this module contained a large selection of different OTUs, with 9% of OTUs belong to Fusobacteriaceae family making up 50% of abundance in the module. A further 19% of the OTUs belong to the Lachnospiraceae family with a total abundance of 9%. OTUs that belong to the Leptotrichiaceae family encompassed 20% of OTUs, with 9% of total abundance, Porphytomonadaceae family made up 6% of OTUs with 14% of total abundance, and finally 16% of OTUs belong to Prevotellaceae family with 9% of total abundance. Kruskal Wallis test was used to assess the differences in OTU abundance in current-smoke exposure

132

Table 4.6 Summary of Red module members (Silva database). OTUs that displayed highest module membership (MM), highest abundance, and highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against current-smoke exposure (Clinical Trait P-value). Prevalence MM P- OTU_ID Genus Abundance MM Clinical Trait P-value (%) value Neisseria_10019 Neisseria 1,739,107 99.13 0.95 6.52E-286 2.57E-15 Neisseria_10020 Neisseria 16,598 69.72 0.93 1.42E-256 5.49E-12 Neisseria_10178 Neisseria 15,245 76.47 0.92 1.28E-234 5.85E-13 Neisseriaceae_10099 NA 19,438 78.20 0.91 1.72E-226 1.47E-09 Neisseria_14366 Neisseria 14,407 66.09 0.91 5.72E-221 4.94E-10 Haemophilus_11369 Haemophilus 21,654 68.86 0.91 1.13E-220 2.20E-09 Neisseria_10062 Neisseria 6,911 65.40 0.91 7.51E-217 2.81E-12 Neisseria_10095 Neisseria 2,624 52.25 0.86 7.08E-168 5.23E-09 Neisseria_9813 Neisseria 2,695 53.81 0.85 1.00E-160 2.24E-09 Neisseria_9962 Neisseria 159,167 93.25 0.83 1.98E-147 1.01E-12 Neisseria_9888 Neisseria 76,545 78.03 0.80 1.20E-130 1.60E-10 Neisseria_10209 Neisseria 13,657 81.83 0.74 1.64E-99 1.80E-12 Neisseriaceae_9657 NA 5,033 68.34 0.67 4.80E-76 3.59E-03 Pasteurellaceae_11461 NA 123,723 95.33 0.64 1.89E-68 2.44E-04

Table 4.7 Summary of Yellow module members (Silva database). OTUs that displayed the highest module membership (MM), highest abundance, and highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against current-smoke exposure (Clinical Trait P-value). Prevalence MM P- Clinical Trait P- OTU_ID Genus Abundance MM (%) value value Fusobacterium_7409 Fusobacteriaceae 1,924,185 99.83 0.87 2.95E-181 9.09E-06 Leptotrichia_8995 Leptotrichia 325,378 95.33 0.84 4.55E-153 3.73E-07 Lachnoanaerobaculum_14972 Lachnoanaerobaculum 75,695 95.16 0.82 2.16E-140 1.36E-05 Leptotrichia_8999 Leptotrichia 12,422 59.34 0.82 5.22E-139 8.97E-05 Catonella_9245 Catonella 21,114 90.48 0.78 5.16E-120 1.09E-07 Prevotella_660 Prevotella 4,123 51.04 0.78 4.01E-117 2.08E-06 Prevotella_1736 Prevotella 28,364 78.89 0.77 7.34E-116 1.93E-08 Peptostreptococcus_20146 Peptostreptococcus 74,891 90.14 0.77 1.83E-115 4.40E-08 Bergeyella_430 Bergeyella 32,958 88.93 0.76 1.27E-111 1.36E-06 Oribacterium_15965 Oribacterium 125,233 97.92 0.76 1.38E-111 1.50E-06 Porphyromonas_11896 Porphyromonas 492,971 97.58 0.74 5.52E-103 1.90E-09 Stomatobaculum_15638 Lachnospiraceae 82,220 88.58 0.74 1.21E-102 5.84E-11 Prevotella_621 Prevotella 144,949 92.73 0.74 6.15E-102 5.60E-10 Capnocytophaga_509 Capnocytophaga 113,129 90.66 0.72 6.83E-95 1.72E-05 Capnocytophaga_2454 Flavobacteriaceae 114,044 87.54 0.66 2.66E-72 3.02E-04

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Both Yellow and Red modules displayed a strong negative correlation with current-smoking. This was confirmed when bacterial abundances for the four hub OTUs were plotted and the bacteria were seen to be decreased in current-smokers (Figure 4.8 panels A and B respectively).

These hub OTUs had been seen in previous analyses (Chapter 3, Section 3.3.4.2) where all four were decreased in the upper airways of smokers through the DESeq2 analysis. Indicator species analysis identified Neisseria_10019, Neisseria_10020 and Leptotrichia_8995 to be associated with the never-smoking individuals, and Fusobacterium_7409 to be an indicator species for the ex-smoking individuals.

A YellowModule: Fusobacterium_7409 YellowModule: Leptotrichia_8995

10000

)

)

e

e

l

l

a

a

c

c

s

s

1000

0

0

1

1

g

g

o

o

l

l

(

(

e

e

c

c

n

n

a 100

a

d

d

n

n

u

u

b b 10

A

A

Non−Smoker Current−Smoker Non−Smoker Current−Smoker B RedModule: Neisseria_10019 RedModule: Neisseria_10020 1e+05 1000

) )

e e

l l

a a

c c

s s

0 1e+03 0

1 1

g g

o o

l l

( (

e e

c c

n n

a a

d d 10

n n

u u b 1e+01 b

A A

Non−Smoker Current−Smoker Non−Smoker Current−Smoker Figure 4.8 Yellow (A) and Red (B) modules: smoking vs. non-smoking individuals. Bacterial abundance of OTUs with highest module membership when compared between the non- and current-smoking individuals.

Kruskal Wallis test was used to compare the distributions between the two categories and indicated a significant difference. All the hub bacteria were negatively associated with the current-smoking phenotype.

135 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Pack Years Variable (Red, Yellow and Pink modules) The red, yellow and pink modules (Table 4.2) that were significantly associated with current-smoking were also seen to be significant (to a smaller extent) when looking at number of pack years study subjects were exposed to. Pack years quantifies the smoking history for the current and ex-smoking individuals. This assesses exposure over the smoking period and risks of developing related diseases. Consequently, all never-smoking individuals were excluded from the analysis.

Spearman’s correlation was used to look at the relationship between abundance of module hub bacteria and pack years (Table 4.8). The results indicated that there was a positive correlation between bacteria of the Pink module and pack years. The opposite correlation was seen for bacteria of the Red and Yellow modules which displayed a significant negative correlation with an increase in number of pack years.

Table 4.8 Correlation between bacterial abundance and pack years. Spearman’s Correlation was used to look at the relationship between abundance of the top two hub bacteria of each module (Pink, Red and Yellow modules) and pack years. Pack Years Spearman’s Correlation Module OTU ID rho P-value Bifidobacterium_13861 0.169 3.8E-03 Pink Bifidobacteriaceae_13837 0.153 8.8E-03 Neisseria_10019 -0.260 7.3E-06 Red Neisseria_10020 -0.190 1.1E-03 Fusobacterium_7409 -0.138 1.9E-02 Yellow Leptotrichia_8995 -0.133 2.3E-02

136 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.7 Aim 5: Functional Inference through PICRUSt

Functional capabilities of bacteria identified from the airway microbiome of BHS participants were examined using PICRUSt (phylogenetic investigation of communities by reconstruction of unobserved states) 362.

PICRUSt is a two-part workflow. Initially gene content of each OTU is predicted using precompiled information from the Greengenes database 37. The OTU table is then converted into a metagenome table to allow metagenome inference. This is achieved by aligning the OTU table computed from the 16S rRNA gene study together with the copy number of marker genes and gene contents of each OTU (counts of gene families on a per sample basis).

PICRUSt has very specific requirements when processing OTU tables including use of a closed reference OTU-picking method. Reads are clustered against a reference sequence collection and only those seen in the database are selected for further downstream analysis, with anything not in the database being discarded. Additionally, OTUs have to be picked against the Greengenes database. The current version of the Greengenes database contains only 3,093 species and was last updated in May 2013 363.

Sequences for the Busselton dataset described in preceding sections (and Chapter 3) were processed using an open-reference OTU-picking method, where reads are initially clustered against the more comprehensive and regularly updated Silva database, and all those that were not found in the database were clustered separately as “de-novo” OTUs. The current version of the Silva database contains around 12,117 quality checked bacterial species, making it much larger by comparison, and was last updated in September 2016. The Silva database (Chapter 2, Section 2.4) was used to process the Busselton sequences, as this extensively aligned rRNA sequencing database was more comprehensive than the Greengenes database.

There is currently no direct way of converting between the two processing methods and although OTU names may have similarities they are not translatable

137 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS between databases. Consequently, to allow application of the PICRUSt software, the OTU table for this data set had be re-constructed using the appropriate OTU- picking method; reverse strand matching (checking the sequences in both directions) at the 97% identity level against the Greengenes database (Version 13.5, May 2013). As network results were also not translatable, WCNA analysis had to be repeated.

A summary of PICRUSt (version 1.1.2) analysis that was carried out on the data is shown in the following flowchart (Figure 4.9).

Figure 4.9 PICRUSt analysis summary workflow.

138 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.7.1 WCNA Analysis using Greengenes dataset

As before, prior to WCNA analysis a filtering pipeline was applied to the data; any unclassified OTUs were removed (Chapter 3, Section 3.2.3). Data was further filtered to remove any OTUs only present in one sample and OTUs with fewer than 20-reads. This resulted in 1,191 OTUs across 578 samples. Data was not rarefied.

WCNA analysis was repeated using the same methods as described previously, briefly: dataset was log10 (x+1) transformed, then used to draw a hierarchical clustering dendrogram. Cut height threshold was set at 20 removing 25 individuals and leaving 553 individuals for further network construction (See Appendix 4.2 for dendrogram). Data processing was attempted with a higher threshold, in order to retain more samples, however downstream network analysis saw a significant proportion of OTUs assigned to the Grey module, and overall associations were not convincing. Therefore, removal of outliers was stringent.

An adjacency matrix was constructed using a β soft thresholding parameter of 2. Minimum module membership was preserved at 20, resulting in 8 modules being constructed. There was no Grey (unassigned) module so all OTUs were assigned into a module, and no modules were merged at 0.75 similarity level.

Correlations between clinical variables and identified modules were calculated as previously using Spearman’s correlation and P-values, the latter adjusted for multiple comparisons using the Benjamini and Hochberg method of false discovery rate control. Summary of the significant associations can be found in Figure 4.10.

139

Whole Population: Module−trait relationships (Greengenes database)

0.018 0.029 −0.041 0.043 −0.035 −0.014 −0.012 −0.011 0.027 −0.05 −0.085 −0.026 −0.0012 −0.053 −0.069 0.056 0.0048 −0.027 −0.008 MEgreen 1 (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)

0.041 0.011 −0.035 0.1 0.021 −0.2 −0.1 −0.06 0.038 −0.013 −0.014 −0.14 4e−04 −0.027 −0.082 −0.029 −0.11 −0.036 −0.012 MEturquoise (1) (1) (1) (1) (1) (2e−04) (1) (1) (1) (1) (1) (0.06) (1) (1) (1) (1) (1) (1) (1)

0.084 −0.05 −0.024 0.12 0.032 −0.23 −0.024 0.019 0.054 −0.021 −0.036 −0.16 −0.0077 −0.095 −0.064 −0.069 −0.13 −0.06 −0.0063 0.5 MEyellow (1) (1) (1) (0.9) (1) (4e−06) (1) (1) (1) (1) (1) (0.02) (1) (1) (1) (1) (0.3) (1) (1)

−0.089 0.12 −0.019 −0.079 −0.021 0.16 −0.021 −0.065 −0.043 −0.045 −0.086 0.11 0.079 0.039 −0.032 0.12 0.09 0.047 0.086 MEblack (1) (0.6) (1) (1) (1) (0.009) (1) (1) (1) (1) (1) (0.4) (1) (1) (1) (0.6) (1) (1) (1) 0 −0.031 0.1 0.027 −0.047 −0.071 0.18 −0.02 −0.055 0.0043 −0.063 −0.024 0.067 0.081 −0.013 −0.076 0.047 0.012 0.044 0.062 MEblue (1) (1) (1) (1) (1) (8e−04) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)

−0.049 0.049 −0.022 0.023 −0.021 −0.0051 −0.098 −0.12 0.013 −0.028 0.02 −0.034 0.062 −0.0029 −0.12 −0.018 −0.084 −0.019 0.012 MEbrown (1) (1) (1) (1) (1) (1) (1) (0.9) (1) (1) (1) (1) (1) (1) (0.8) (1) (1) (1) (1) −0.5

0.036 0.081 −0.0066 −0.044 −0.00054 0.07 0.021 0.014 −0.021 −0.028 −0.013 0.062 −0.05 −0.051 0.034 0.058 0.0054 −0.00097 −0.0053 MEpink (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1)

0.069 0.062 −0.016 −0.013 −0.0096 0.035 −0.012 0.051 0.02 0.028 −0.093 0.0026 −0.045 −0.037 0.0041 0.0075 0.011 −0.091 0.12 MEred −1 (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (1) (0.5)

x s r r r g s e s I t t e e e e e D e e e n a ia ti g r n n t t t t s te R k k k i m n i a a M u u lu lu lu lu e E o o o z h o h e B o o o o o o b e st m c Y c c s s s s ia G m m m e n k l_ t_ b b b b S S tS h A e ro c l a a a a D r x n a e le _ e E n W P B P c e il_ il_ s_ v re _ t h te h il e r d la p y p h N u re p o c o p C tr o n o u n si s e o o a n m e b Figure 4.10 Module-trait associations (Greengenes database). Correlation value (top value) and adjusted P-value (bottom value) were calculated using Spearman’s correlation and corrected with the Benjamini and Hochberg false discovery rate (FDR) procedure. ME = Module Eigengene.

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.7.2 Comparison with Original WCNA Results

Fewer bacteria were identified through the Greengenes (GG) database than the Silva (SV) database (unrarefied data contained 1,191 OTUs versus 4,218 OTUs respectively). This reduction in data set size and composition inevitably had effects on network contents and structure, as well as relationships with clinical variables. This meant that GG results could not be considered identical to initial SV outcomes. Nevertheless, some overlap in bacterial communities at a higher taxonomic rank were seen (Refer to Appendix 4.1.2 [Top10Greengenes] for detailed summary of the bacterial composition of each module).

For example, the Black module seems to be similar in both GG and SV datasets. In both cases this module mainly consisted of members from the families Prevotellaceae, Porphyromonadaceae, and Spirochaetaceae. Additionally, both of the Black modules showed a positive association with current-smoking.

The Brown module, although showing no association with any of the clinical variables, in both datasets contained a similar selection of OTUs from Prevotellaceae, Actinomycetaceae and Veillonellaceae families, indicating that the network construction method is able to consistently identify features regardless of the database used to predict gene content.

Similar to the SV (Silva) analysis, the GG (Greengenes) analysis also identified a “contamination” module (GG_Red). This module was predominantly made up of known contaminants including Burkholderiales, Methylobacterium, Rhizobiales, Oxalobacteraceae, Xanthomonadaceae 119, which is reflective of the SV Light Cyan module seen previously.

Following adjustment for false discovery rate only relationships with smoking and smoking related variable (pack years) attained significance in the GG dataset. Current-smoking displayed strong association with a number of identified modules, strongest of which was with the Yellow (rho = -0.23, P = 3.64E-06) and Blue modules (rho = 0.18, P = 7.84E-04).

141 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Smoking related modules (Greengenes database) (i) Blue module members The Blue module was significantly associated with smoking, and made up of 232 different members, with a total abundance of 13,540,552 reads. Table 4.9 gives a summary of OTUs that were most abundant in the module, those with highest module membership and highest prevalence.

Overall, 37% of OTUs belong to the Streptococcaceae family making up 63% of abundance in that module, further 14% of the OTUs belong to the Veillonellaceae family with a total abundance of 36%. Kruskal Wallis test was used to assess the differences OTU abundance against current-smoke exposure.

The Blue module (GG) displayed a strong positive association with current- smoking and was dominated by Streptococcaceae and Veillonellaceae families. Mapping this module back to the SV analysis was more complex, as this positive association and the bacterial content was seen across a number of different SV modules (SV Pink [Table 4.5], SV Green and SV Salmon).

When looking closer at the content of the Blue module (GG), all identified members of Bifidobacteriaceae family were found in it, which is reflective of the significant OTUs found using the Silva database.

142

Table 4.9 Summary of Blue module members (Greengenes database). OTUs that displayed the highest module membership (MM), the highest abundance, and the highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against currentsmoke exposure (Clinical Trait P-value).

Prevalence Clinical Trait OTU ID Genus Abundance MM MM P-value (%) P-value Streptococcus_2836 Streptococcus 16,062 93.25 0.94 5.92E-257 9.84E-04 Streptococcus_688 Streptococcus 3,221,499 100.00 0.91 1.35E-208 2.86E-05 Streptococcus_2375 Streptococcus 38,629 97.58 0.89 7.49E-186 5.31E-04 Streptococcus_2820 Streptococcus 5,017,609 100.00 0.89 1.41E-185 1.97E-04 Streptococcus_3007 Streptococcus 6,837 87.72 0.88 9.90E-177 8.37E-03 Veillonella_2485 Veillonella 726,083 100.00 0.87 2.62E-171 8.05E-03 Veillonella_1334 Veillonella 143,87 98.62 0.87 1.29E-170 8.64E-02 Veillonella_1597 Veillonella 3,821,069 100.00 0.86 9.54E-164 3.87E-03 Streptococcus_1426 Streptococcus 2,533 81.14 0.84 1.41E-145 8.88E-05 Veillonella_1281 Veillonella 72,431 98.27 0.82 3.32E-137 8.80E-03 Streptococcus_1019 Streptococcus 8,654 93.77 0.78 1.34E-116 1.30E-02 Streptococcaceae_177 NA 54,539 94.29 0.76 1.29E-104 8.89E-01 Rothia_714 Rothia 73,129 76.64 0.49 6.15E-35 1.48E-05 Bifidobacterium_719 Bifidobacterium 37,881 53.11 0.33 1.74E-15 1.38E-15

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

(ii) Yellow module members

The Yellow module was significantly associated with smoking, and made up of 155 different members, with a total abundance of 3,853,659 reads. Table 4.10 gives a summary of OTUs that were most abundant in the module, those with the highest module membership and highest prevalence.

Overall, 33% of OTUs belong to the Neisseriaceae family making up 54% of abundance in the module with a further 48% of the OTUs belonging to the Pasteurellaceae family with a total abundance of 43%. The Kruskal Wallis test was used to assess the differences in current-smoke exposure.

There was a negative association between members of the GG Yellow module and current-smoking. Bacteria from this module were predominantly identified in the Neisseriaceae and Pasterurellaceae families. A similar negative correlation was seen in the SV Red module (Table 4.6), which was dominated by members from the Neisseria genus.

The direction of associations (negative of positive) of the top 2 OTUs were confirmed by comparing abundance of each of the representative OTUs graphically between non- (including both never and ex smoking individuals) and current-smoking individuals (Figure 4.11).

On the whole the Blue module hub OTUs were increased in the smoking individuals (Figure 4.11A), whereas the Yellow module hub OTUs were relatively decreased in the current-smoking individuals (Figure 4.11B).

144

Table 4.10 Summary of Yellow module members (Greengenes database). OTUs that displayed the highest module membership (MM), highest abundance, and highest prevalence in the population. Kruskal Wallis was done to test OTU abundance against current-smoke exposure (Clinical Trait P-value).

Prevalence Clinical Trait P- OTU_ID Genus Abundance MM MM P-value (%) value Haemophilus_613 Haemophilus 1,152,677 100.00 0.91 5.46E-207 9.20E-07 Neisseriaceae_1893 NA 1,801,646 99.13 0.84 1.01E-146 3.24E-15 Actinobacillus_653 Actinobacillus 4,602 74.91 0.83 2.02E-140 3.85E-06 Kingella_1835 Kingella 14,511 78.89 0.83 7.18E-140 1.01E-12 Neisseria_2140 Neisseria 8,155 64.01 0.81 3.61E-129 1.66E-09 Neisseria_161 Neisseria 4,817 64.36 0.79 1.10E-121 1.07E-10 Neisseria_45 Neisseria 7,919 66.44 0.79 6.78E-118 3.09E-11 Neisseria_2391 Neisseria 1,985 52.42 0.76 8.13E-106 7.59E-09 Neisseria_852 Neisseria 6,335 65.92 0.76 1.27E-104 1.76E-10 Pasteurellaceae_788 NA 119,010 98.10 0.75 3.85E-101 9.83E-05 Neisseria_1285 Neisseria 138,094 89.62 0.66 3.31E-69 1.20E-10 Actinobacillus_1237 Actinobacillus 82,140 89.45 0.63 1.65E-61 2.41E-04 Neisseria_2315 Neisseria 71,368 67.13 0.61 2.33E-58 5.55E-08 Haemophilus_1340 Haemophilus 92,409 88.24 0.61 2.81E-57 3.75E-03 Rothia_920 Rothia 27,780 74.74 0.52 1.27E-39 5.45E-05 Gallibacterium_317 Gallibacterium 161,963 73.36 0.37 5.65E-19 2.02E-01 Porphyromonas_360 Porphyromonas 84,386 16.96 -0.082 5.41E-02 1.35E-03

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

A BlueModule: Streptococcus_2836 BlueModule: Streptococcus_688

) 100 )

e e

l l

a a 10000

c c

s s

0 0

1 1

g g

o o

l l

( (

1000

e 10 e

c c

n n

a a

d d

n n

u u 100

b b

A A

1 Non−Smoker Current−Smoker Non−Smoker Current−Smoker B YellowModule: Haemophilus_613 YellowModule: Neisseriaceae_1893 1e+05

)

)

e

e

l

l

a

a

c

c

s

s

0 0 1000

1 1 1e+03

g

g

o

o

l

l

(

(

e

e

c

c

n

n

a

a

d

d

n

n

10 u u 1e+01

b

b

A

A

Non−Smoker Current−Smoker Non−Smoker Current−Smoker

Figure 4.11 Blue (A) and Yellow (B) modules: comparison between smoking vs. non-smoking individuals. Bacterial abundance of OTUs with highest module membership when compared between the non- and current-smoking individuals. Kruskal Wallis test was used to compare the distributions between the two categories and indicates a significant difference.

146 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.7.3 PICRUSt Analysis of Whole Dataset

After picking a new OTU table with Greengenes and repeating the WGCNA analysis to confirm presence of strong bacterial interactions and associations with smoking phenotypes, the new dataset was ready to be taken through the PICRUST processing pipeline.

Initially, the OTU table was normalized according to the default marker gene copy number. At this point each OTU value per sample was divided by the PICRUSt’s pre-calculated 16S rRNA gene copy number 362,364, meaning that the OTU table reflected relative abundances of organisms and could be used to construct metagenome functional predictions.

Functional predictions of KEGG Ortholog (KOs) were calculated using the normalised OTU table, which was multiplied by a vector of gene counts for each OTU. Kyoto Encyclopaedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/) is a bioinformatics project that was initiated to collate post-genomic analyses, and has become a resource tool in understanding a variety of different biological systems 365,366. In summary, KEGG links sets of genes in a genome to biological pathways or complexes. Since its launch in 1995 by the Human Genome Program of the Ministry of Education, Science, Sports and Culture in Japan 367 the project has greatly expanded to include many data-oriented entry points.

The KO predictions were used to categorize each OTU by their metabolic functions into a KEGG Pathway. The KEGG pathway is a compilation of manually drawn pathway maps demonstrating the current understanding of molecular interaction, reaction and relation networks for seven different systems. These make up the first level of KEGG Pathway (KEGG level 1) and include: Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processing, Organismal Systems, Human Diseases and Drug Development. Each one is made up of a number of different processes (KEGG level 2), which are a collective to particular pathway groups (KEGG level 3).

147 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

An example of a KEGG pathway may include the process of photosynthesis, which is a way by which green plants and some specialised bacteria are able to use light in order to make organic compounds and energy through utilizing carbon dioxide and water: Metabolism (KEGG1) -> Energy Metabolism (KEGG2) -> Photosynthesis (KEGG3)

The resultant pathway predictions were imported back into R for analysis. Data went through a series of basic processing steps including any functions present in one sample only and those with fewer than 20 counts. Additionally, rarefaction level was established to be 329,878 reads, which removed two samples and one function, this rarefaction level was used to look at the whole population analysis only. This resulted in 268 functions at KEGG level 3 across 576 samples.

Data was not normally distributed, confirmed using Shapiro Wilks test (Mn = 7.1E+05, SD = 1.31E+06, P = 2.2E-16) (Chapter 2, Section 2.5.2.1B), therefore non-parametric statistical analyses were carried out.

Beta diversity was calculated using Bray Curtis Dissimilarity Matrix, where mean average distance between samples was 0.06 (SD = 0.02), indicating that samples were highly similar in terms of their functional capabilities. In fact, 211 functions were seen across all samples. The fifty most abundant functions can be found in Figure 4.12.

148

Busselton Cohort: Top 50 KEGG3 Functions KEGG_Pathways1 Environmental Information Processing Genetic Information Processing Metabolism Unclassified

Transporters ABC transporters General function prediction only DNA repair and recombination proteins Ribosome Purine metabolism Pyrimidine metabolism Peptidases Ribosome Biogenesis Chromosome Amino acid related enzymes Amino sugar and nucleotide sugar metabolism Function unknown Transcription factors Secretion system Aminoacyl−tRNA biosynthesis DNA replication proteins Glycolysis / Gluconeogenesis Pyruvate metabolism Phosphotransferase system (PTS) Chaperones and folding catalysts Other ion−coupled transporters Two−component system Homologous recombination Translation proteins Peptidoglycan biosynthesis Oxidative phosphorylation Cysteine and methionine metabolism Methane metabolism Others Carbon fixation pathways in prokaryotes Starch and sucrose metabolism Alanine, aspartate and glutamate metabolism Fructose and mannose metabolism Arginine and proline metabolism Mismatch repair Glycine, serine and threonine metabolism Replication, recombination and repair proteins Phenylalanine, tyrosine and tryptophan biosynthesis Galactose metabolism Pentose phosphate pathway Porphyrin and chlorophyll metabolism Valine, leucine and isoleucine biosynthesis DNA replication Protein folding and associated processing Protein export Lysine biosynthesis One carbon pool by folate Energy metabolism Bacterial secretion system 5e+06 1e+07 Abundance Figure 4.12 Top 50 (Most Abundant) functions (KEGG 3 level) found in BHS. All the functions were seen in 100% of samples. Each point is coloured according to the KEGG 1 level each function belongs to.

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

In light of the earlier findings (Section 4.5, 4.7.1, Chapter 3, Section 3.3.4) in relation to current-smoking status, abundances of KEGG level 3 functions were assessed using Kruskal Wallis test by current-smoking status (comparing current- smokers against current non-smokers), resulting in 106 functions showing significance (P < 0.001, adjusted using Benjamini-Hochberg). To confirm these results and make trend predictions, a generalized linear model (GLM, refer to Section 4.2.6) 355 was constructed. Results indicated 84 functions significantly associated with current-smoking (P < 0.001, adjusted using Benjamini-Hochberg), all of which were found to be in agreeance with results from the Kruskal Wallis test (Appendix 4.3).

Summarising these at KEGG level 1, there were three Cellular Processes functions, four functions from Environmental Information Processing, seven functions from Genetic Information Processing pathway and eleven functions from Human Diseases. These included a negative relationship between current-smoking and neurodegenerative diseases, such as Alzheimer’s disease, Huntington’s disease and Parkinson’s disease. On the other hand, current-smoking displayed a positive relationship with metabolic (Diabetes mellitus, Type I and II) and infectious diseases (Staphylococcus aureus infection and Tuberculosis).

There were six functions that were grouped by ‘Organismal systems’ and 13 functions that were Unclassified, which included poorly characterised functions and those that were characterised under Metabolism or Cellular Processes and Signalling at the KEGG level 2. The biggest group seen showing association with current-smoking was Metabolism, with 40 different functions displaying significant association with current-smoking (GLM results).

On the other hand, never-smoking (not including ex-smokers) was positively associated with one function; Other Ion-Coupled Transporters, from Unclassified - Cellular Processes and Signalling pathways (Kruskal Wallis: P-value = 0.0003, GLM: Z = 3.61, P = 0.0003).

150 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.7.4 PICRUSt Analysis of WCNA Modules

Initially, eight bacterial network modules identified through GG WCNA analysis were explored for functional capabilities through PICRUSt (Refer to Figure 4.9). For each module, an individual OTU table was exported containing all bacteria found within that module. Each one of the OTU tables was then taken through the PICRUSt analysis workflow independently as previously done for the whole data (Section 4.7.3 above), resulting in eight independent functional pathway prediction tables, which were all imported back in R for processing.

Table 4.11 provides a module summary including the number of OTUs contained in each module, which OTU was the representative for that module, number of functions (at KEGG level 3) these modules translated to and what the overall abundance of these functions was. It should also be noted that modules containing the fewest OTUs (Black, Pink and Red modules) also resulted in the lowest abundance of functions.

Table 4.11 Summary of the WCNA modules (Greengenes database). This includes total number of OTUs making up each module and their representative OTUs with highest module membership (MM). OTU tables for each module were exported independently and taken through PICRUSt analysis to identify pathway predictions per module. Resultant pathways were explored, looking at the number of functions specifically found in each module (KEGG level 3) and total abundance these modules carried. Total No. of No. of Module OTU with Highest MM Abundance OTUs Functions of Functions Green 141 Streptococcus_1586 251 3.31E+09 Turquoise 266 Fusobacterium_2562 251 1.78E+09 Yellow 155 Haemophilus_613 255 8.42E+09 Black 56 Mogibacteriaceae_1896 227 1.11E+08 Blue 232 Streptococcus_2836 260 2.72E+10 Brown 232 Prevotella_153 258 6.76E+09 Pink 21 Staphylococcus_1225 250 1.51E+07 Red 88 Rhizobiales_2076 261 5.95E+08

151 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Differences in abundance of functions between all modules were explored at the three KEGG levels, which were all calculated using Kruskal Wallis test (P-value adjusted with Benjamini-Hochberg method).

Out of the seven KEGG level 1 system pathways, five showed a significant difference in abundance between modules (based on un-rarefied data), as shown in Figure 4.13 (P-adjusted value < 0.001).

152

KEGG 1 Pathway: Abundance by Module * * * * *

) 1e+08

e

l

a

c

s

0

1

g

o

l 1e+05

(

e

c

n

a

d

n

u

b 1e+02

A

*Kruskal Wallis (P<0.001)

s g g s m s d e in in e s m ie s s s s li if s s s a o te s e e e e b s s c c c is a y a ro o o D t S cl P r r e l n r P P n M a a n n a m U l o o m is llu ti ti u n e a a H a C rm rm rg fo fo O In In l c ta ti n e e n m e n G ro vi n E Figure 4.13 Abundance of metabolic pathways when comparing the eight individual bacterial modules identified through network analysis. There were five pathways displaying significant differences at P < 0.001 when calculated using Kruskal Wallis test and adjusted using

Benjamini-Hochberg (indicated by *). Abundance was drawn on a log10 scale.

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Following up significant pathways at KEGG level 3 there were 142 functions that showed significant differences in abundance between modules, 126 of these functions belonged to the Blue module, the Yellow module had 8 functions, whilst the Brown and Red modules had 4 functions each. This data suggest that modules can be at least partially distinguished functionally.

The Blue and Yellow modules were seen to have the highest positive and highest negative correlations respectively with current-smoking phenotype (Figure 4.10). These modules also displayed the highest total abundance of functions (see Table 4.11) and contained the largest number of functions (at KEGG level 3) with significant differences in abundance, when compared across all modules. Therefore, these two modules were explored in more detail.

Overall, the two modules displayed similar functional signatures – sharing 254 functions, with the Yellow module having 255 unique functions and the Blue module 260; differing mainly in proportions. Metabolic Structure: KEGG1 (7) -> KEGG2 (37) -> KEGG3 (255/260)

Abundance between the two modules was compared using the Wilcoxon signed rank test 356 (Section 4.2.6). The P-values were adjusted using the Benjamini and Hochberg method (BH). Results indicated a significant difference in abundance in the Metabolism pathway (KEGG level 1) between Blue (Median [Mdn] = 4.3E+07) and Yellow (Mdn = 1.5E+07) modules; Z = -5.24, P < 0.001.

Further analysis was focused on the 12 functions making up the Metabolism pathway at KEGG level 2 (Figure 4.14). The Wilcoxon signed rank test was run again comparing abundances of the different functions between the Blue and Yellow modules. The four with significant differences (P < 0.01) are detailed in Table 4.12.

154

KEGG 2 Pathway: Blue and Yellow Modules * * * * 1e+09

)

e

l 1e+07

a

c

s

0

1

g

o

l

(

e c 1e+05

n

a

d

n

u

b

A

1e+03

m s m m s m m s s s m s te s s ie s s in id e s sm li li li li il li li c id li li o o o o o o m A t o o b b b b m b b ta e b b a a a a a a a i o k a a t t t t F t t V n ly t t e e e e e e e i o e e d m P M M M M m M M n A M M id ry te y y d id a r d e d c a a rg z n p s e n id n A d r e n a i r h a t a n d n E s L to t s o n o o y i c O d le o in c h E s fa f i c ti e o e o o o u a m b th n N d A S r n C m e a r a y f s rp r e C s o li e g th o o T e i m b f d O B is a o f n l t o i o a o e B c b M m s is y a lis c s l t o ti e G e b o th M a i n t b y e o s M n io e B X Figure 4.14 Comparison of the abundance of the twelve functions making up metabolic processes in the Blue and Yellow modules. Significant differences were calculated using Wilcoxon signed rank test and adjusted for multiple hypothesis testing using Benjamini- Hochberg, significant differences at P < 0.01, were labelled (*)

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Table 4.12 Significant KEGG 2 level pathways: Blue vs. Yellow modules. Comparison of the abundance from the 12 significant functions that make up ‘metabolic processes’. Analysis was carried out using Wilcoxon signed rank test (P < 0.01) and adjusted using Benjamini-Hochberg. Median (BM) Median (YM) Z Statistic P-value Amino Acid Metabolism 2.13E+08 6.55E+07 -3.08 2.08E-03 Carbohydrate Metabolism 1.70E+08 4.70E+07 -4.44 8.82E-06 Energy Metabolism 1.59E+08 5.71E+07 -3.27 1.09E-03 Metabolism of Other Amino Acids 4.55E+07 1.18E+07 -2.77 5.64E-03 BM = Blue Module. YM = Yellow Module

The four significant KEGG 2 level pathways were made up of 46 different KEGG level 3 functions. The next task was to see how these functions compared in abundance between the modules across all 578 samples. Overall, functions from the Blue module were more abundant than those found in the Yellow module, therefore when exploring significant KEGG level 3 functions data was rarefied so that the two modules were compared on the same level. Rarefaction level established earlier was too stringent [Section 4.7.3] when used to analyse just two modules and resulted in a large number of samples being removed (21 for Blue and 141 for Yellow). Therefore, a much lower level of 2,000 reads was used. This resulted in all samples being preserved in both modules.

A heatmap of the Blue and Yellow modules was drawn up, comparing the identified significant KEGG level 3 pathways across all 578 individuals in the study (Figure 4.15). The functions were colour coded according to the identified four significant KEGG level 2 pathways, with samples that were missing a particular function (NA) coloured white. There were a number of functions identified in Figure 4.15 that could indicate the presence of contaminant bacteria. Unfortunately, due to the nature of this analytical pipeline it was difficult to identify and excluded these prior to running PICRUSt. Consequently, further analysis focused on the functions seen to be significantly associated with current- smokers (Section 4.7.3), of which there were 16 functions recurring (Table 4.13).

156

Figure 4.15 Heatmap with 46 KEGG3 functions identified as significantly different between Blue and Yellow modules. Functions are colour coded according to KEGG 2 pathway they belong to (green is Metabolism of Other Amino Acids, red is Energy Metabolism, blue is Carbohydrate Metabolism and black is Amino Acid Metabolism). Both modules were rarefied to 2,000 counts; brighter tiles indicated greater abundance, samples missing particular functions (NA) were coloured white.

Table 4.13 Significant KEGG 3 level functions in relation to current-smoking individuals. KEGG3 functions that show an association with current-smoking and significant difference in abundance when comparing the Blue and Yellow modules. Associations were considered significant when tested with Kruskal Wallis (P < 0.001) or GLM (P < 0.001) (all P-values were adjusted using the Benjamini-Hochberg approach).

Kruskal Wallis GLM KEGG3 Function KEGG2Pathway P-value Z-value P-value Starch and sucrose metabolism Carbohydrate Metabolism 7.54E-12 5.98 2.26E-09 Glycolysis / Gluconeogenesis Carbohydrate Metabolism 2.15E-09 5.79 6.84E-09 Galactose metabolism Carbohydrate Metabolism 4.82E-10 5.72 1.05E-08 Ascorbate and aldarate metabolism Carbohydrate Metabolism 8.56E-10 5.41 6.24E-08 Amino sugar and nucleotide sugar metabolism Carbohydrate Metabolism 1.50E-10 5.22 1.79E-07 D-Alanine metabolism Metabolism of Other Amino Acids 2.56E-07 4.66 3.12E-06 Fructose and mannose metabolism Carbohydrate Metabolism 8.06E-08 4.34 1.41E-05 Pentose and glucuronate interconversions Carbohydrate Metabolism 7.00E-06 4.13 3.69E-05 Pentose phosphate pathway Carbohydrate Metabolism 9.25E-06 4.05 5.15E-05 Taurine and hypotaurine metabolism Metabolism of Other Amino Acids 9.00E-04 3.32 9.01E-04 Phosphonate and phosphinate metabolism Metabolism of Other Amino Acids 5.74E-04 3.32 9.05E-04 Carbon fixation pathways in prokaryotes Energy Metabolism 3.64E-04 -3.3 9.80E-04 Lysine degradation Amino Acid Metabolism 4.56E-08 -3.53 4.09E-04 beta-Alanine metabolism Metabolism of Other Amino Acids 5.26E-06 -3.69 2.21E-04 Citrate cycle (TCA cycle) Carbohydrate Metabolism 1.12E-06 -4.41 1.06E-05 Oxidative phosphorylation Energy Metabolism 1.45E-09 -5.84 5.18E-09

CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

4.8 Discussion The aims of this chapter were to:

1. Define bacterial networks in the airway through application of WCNA on the complete dataset. 2. Identify modules of co-abundant bacteria. 3. Quantify their relationship with clinical traits or lifestyle habits (smoking). 4. Use module membership to define key ‘hub’ bacteria. 5. Extrapolate metagenomic functions on the complete dataset and any key modules associated with clinical traits or lifestyle habits (smoking).

The principal finding of this chapter was the identification of robust bacterial co- occurrence modules that exhibited a significant relationship with exposure to cigarette smoke. There have been a few studies looking at co-occurrence of bacteria. Those that have been conducted have focused on categorizing individuals into enterotypes according to gut microbiota 368-371. Analysis of the Busselton Health Study cohort is unique not only because it looks at characterizing the respiratory microbiome, but also because of the large number of participants, giving the study enough power to carry out this kind of network analysis.

WCNA analysis was originally developed for the examination of gene expression data. Consequently, a number of adaptations were required to the analytical workflow to allow analysis of bacterial data. Adaptations included data transformation (Figure 4.2), using unrarefied data and Spearman’s correlation for analysis. These adaptations allowed successful identification of interactions between individual OTUs and relating modules of OTUs to clinical variables.

Bacterial networks Overall, it was noted that individuals within the population were very similar to one another (Figure 4.3). Also, although there were individuals suffering from particular diseases, these did not create any clustering bias across the data and were therefore not significantly distinct from the rest of the population.

159 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

When looking at the whole dataset, there were a number of interactions seen between modules and clinical variables. These however were mostly weak connections and disappeared after correction for false discovery rate. Reflecting back to Chapter 3, Section 3.3.2.3, similar clinical variables were shown to cause significant variation in the population (beta diversity) when tested with the Adonis function followed with a multivariate model. Traits that indicated a significant effect on variance in the population (Chapter 3) were pack years, bacterial burden, season, sex and smoking status, some of which were seen to have a significant association with bacterial networks. Having these parallel outcomes from different analytical techniques provides validity to the results.

It was interesting to note a strong negative correlation between members from the Prevotellaceae family (Midnight Blue module) and absolute monocyte value (Table 4.2). There is limited knowledge on how airway bacterial communities are affected by changes in white cell counts. Previously, Jangi et al. (2016) saw changes to gut microbiota following treatment of multiple sclerosis 372. The study noted low abundance of Prevotella in untreated patients, with a significant increase following treatment.

Smoking habit displayed the strongest effect on bacterial interactions (Table 4.2) showing association with a number of modules. Further analysis was carried out only on those with the strongest associations and those that displayed significance in cumulative smoking as well as presence of a current-smoking lifestyle.

Separately, data was also re-analyzed by separating participants in accordance to their smoking status, in this way taking the strong smoking influence out of the equation. There were interactions seen with different respiratory conditions tested and other clinical assessments. Unfortunately, splitting the data led to significant drops in numbers for each phenotype, meaning that although bacterial modules and clinical variable interactions were identified, there was not enough power for these to withstand correction for false discovery. They were therefore not considered for further assessment. Nevertheless, the analysis did show that it is possible to further dissect respiratory conditions, and it would therefore be highly interesting to look at which bacterial cabals are associated with respiratory conditions in a larger dataset.

160 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Current-smoking showed significant positive correlation with the Pink module (Silva database, SV). The hub bacteria found in this module were statistically increased in current-smokers and decreased in never-smoking individuals (Figure 4.7). The Pink module was an average-sized module with 109 OTUs, however it only made up 0.2% of the total bacterial abundance (Table 4.5). Regardless of module size, bacterial hubs that were selected to be representative of this module have shown association with smoking individuals through previous analyses carried out in Chapter 3, Section 3.3.4.2 (B and C). Both Indicator Species Analysis (ISA) and Differential Expression Analysis for Sequence Count Data (DESeq2) picked up Bifidobacterium_13861 to be significantly associated with current-smoking status, and Bifidobacteriaceae_13837 was picked up also by ISA. Other major participants of the module belonged to the genus Lactobacillus and Streptococcus. Looking further, with cumulative smoking the relative abundance of these bacteria was found to significantly increase.

The Yellow and Red modules (SV) showed the strongest negative correlations between their module members and the current-smoking variable. The SV Red module, made up 5% of total bacterial abundance, and was almost exclusively comprised of members from the Haemophillus and Neisseria genera (Table 4.6). The Yellow module (SV) was the fourth largest of the networks, making up almost 9% of the total bacterial abundance. It comprised of a phylogenetically mixed group of organisms, including a number of members from Fusobacterium, Leptotrichia, Prevotella, Stomatobaculum genera, amongst others (Table 4.7). Reflecting back on results from the DESeq2 analysis (Chapter 3, Section 3.3.4.2C), in which 99 OTUs showed a significant decrease in abundance following smoke exposure, 71 of these species were identified across the Yellow and Red modules. This indicates that this is a robust feature of the bacterial community, which can be detected regardless of methods used. Contrary to the Pink module (SV), members of the Red and Yellow modules (SV) were seen to decrease in relative abundance with the increase in number of pack years. A number of these bacteria have also been noted in previous studies that have examined smoking effects on the upper respiratory tract 112,120,327.

161 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

Overall, it seems that current exposure to smoke has a highly significant effect on the airway microbiome, and regardless of damage done community regenerates when given the opportunity. Literature has little to add in terms of detailed characteristics of the microbial community of smoke exposed airways, but the same bacterial candidates have come up across a number of different assessments applied in this thesis.

Metabolic and functional inference Functional analysis of the data was completed using PICRUSt, however prior to carrying out the analysis, an initial hurdle in relation to data formatting had to be addressed. Initially, the OTU table was picked using an open reference approach against the Silva (SV) database (Chapter 3), which eliminated simpler conversion methods. Tax4fun 373, which is a specific R package designed to predict functional capabilities of 16S rRNA gene data picked against the Silva database was attempted. However, the original data used an open reference approach for picking OTUs, meaning that this analysis failed (data not shown). Therefore, data had to be reanalyzed using a closed reference approach picked against the Greengenes (GG) database.

Despite this, WCNA analysis of the data displayed similar results, in this way validating the prior findings of the significant impact that smoking habit has on the airway microbiome (WCNA analysis of OTUs picked against the Silva database). Although Greengenes is a smaller reference database in comparison to Silva, repeated analysis reflected results seen thus far. Members of the Blue module (GG), which displayed significant increase in smoking individuals, contained members predominantly from the genera: Bifidobacterium, Lactobacillus, Streptococcus and Veillonella (Table 4.9). The Yellow module (GG) on the other hand predominantly contained members from Haemophilus and Neisseria genera, and was significantly decreased in the current- smoking individuals (Table 4.10).

After dealing with the initial difficulty, data was processed through PICRUSt and functional capabilities of bacteria were assessed. Functionally, BHS participants of the study showed significant similarity between each other due to the low mean dissimilarity distance between samples when surveying beta diversity using the Bray Curtis dissimilarity matrix (Chapter 3, Section 3.3.2.3). Additionally, it should be noted that 79%

162 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS of functions were present across all samples, providing evidence that the BHS general population contained predominantly healthy participants, with no particular clinical variable significantly pulling the populace apart.

Focus was placed on smoking, which had a significant association (P < 0.001) with a large number of predicted functions (mainly those seen in the Metabolism pathway), which was not observed for the never-smoking individuals. This again suggests that smoking habit has an important effect not only on bacterial composition of the airways, but also functional capabilities depicting them.

Current-smoking status showed a strong negative association with a number of neurodegenerative disease such as Alzheimer’s disease, Huntington’s disease and Parkinson’s disease (Section 4.7.3, Appendix 4.3). Although this relationship has always been debatable, many different epidemiological studies have shown a similar negative association between smoking and the aforementioned conditions 374. In the same way, active smoking has previously shown a positive association with numerous infectious diseases 375. Metagenomic results indicated similar outcomes, with smoking individuals displaying a strong positive association with Staphylococcus aureus infection and Tuberculosis.

The data was then explored through the confines of individual modules identified during the WCNA analysis (Greengenes database, GG). Overall, the number of OTUs going into PICRUSt analysis did not have a major effect on the number of functions present for the modules. The difference between modules came from the function abundance, and how these compared across the groups. This could suggest a more collective effect of bacteria, where it is not about the richness of bacterial species going in but more the size of the bacterial networks. When comparing eight different modules, there were a number of different functions that displayed significant differences in abundance.

Overall, Blue and Yellow modules (GG) not only had highest total functional abundances but also the highest number of significant functions. This together with a significant correlation with smoking status (as revealed through WCNA analysis) made them perfect candidates for further investigation.

163 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

The two modules shared a large number of pathways between them, differing mainly in abundance. The most significant difference was seen in the Metabolism pathway (P < 0.001), specifically those responsible for amino acid metabolism, carbohydrate metabolism, energy metabolism and metabolism of other amino acids. When assessing abundance of these specific pathways across all samples in a heat-map format there were a couple of functions that indicated visually dramatic differences. Overall, it reflects metagenomic differences that are present between members of Streptococcus, Haemophilus and Neisseria genera.

Evidence has suggested that Haemophilus spp. and Neisseria spp. share a recent common ancestor. Furthermore there have been suggestions of possible horizontal gene exchange between these distantly relating Proteobacteria 376. This provides an understanding into why these bacteria may work together in a tight co-occurrence network.

Overall, the Yellow module (GG) displayed lower abundance for most of the predicted functions, reflecting both structural differences and metabolic capabilities of Haemophilus spp. and Neisseria spp., when compared to streptococci bacteria. Members of the Yellow module (OTUs) displayed little to no ability for D-Arginine and D-ornithine metabolism, which comes as a result of Haemophilus spp. lacking five enzymes involved in initial steps of arginine biosynthesis 377. Lower abundance was also seen in pentose and glucuronate inter-conversions (metabolic reactions involved in conversion of monosaccharide pentose and glucuronate, salts or esters of glucuronic acid), as well as ascorbate (Vitamin C) and aldarate (sugar acids) metabolism. The latter pathway was significantly correlated with current-smoking and had increased abundance in the streptococci dominant Blue module. Reasons for this could be due for example of the known existence of the ula operon in S. pneumoniae, which is able to excite ascorbic acid in order to use it as a carbon source for energy production 378.

The Blue module (GG) also displayed a number of functions of lower abundance, in particular inositol phosphate metabolism, which is part of a crucial pathway responsible for cellular functions (including cell growth, apoptosis, cell migration and differentiation).

164 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

All functions found in Figure 4.14 showed highly significant differences (P < 0.001) in abundance between the Yellow and Blue modules (GG). It was therefore interesting to see a number of these functions also being significantly associated with the smoking phenotype (Table 4.13). It should be noted that functions that had a positive association with current-smoking were seen to have greater abundance in the Blue module (GG), and the opposite was true with negatively associated functions exhibiting higher abundance in the Yellow module (GG).

Oxidative phosphorylation displayed a strong negative association with current-smoking and its abundance was significantly increased in the Yellow module. Studies have seen H. influenzae evolve a number of mechanisms to protect itself from oxygen-generated stress both from host defenses and co-pathogens 379. In fact, it has been shown that S. pneumoniae are able to inhibit H. influenzae growth (in culture) through production of supernatants, in particular hydrogen peroxide 380. Consequently, it could be hypothesised that whilst streptococci can cause oxidative stress to surrounding bacteria it is not adapted to withstand the environmental stress caused by smoking.

Another pathway exhibiting a strong negative association with current-smoking and an increased abundance in the Yellow module (GG) was tricarboxylic acid (TCA) cycle (Krebs cycle), a key metabolic pathway in aerobic organisms responsible for energy production. Although there are little known variations to the TCA cycle, some studies have seen evidence of alternatives, like for example a replacement of an enzymatic step. Typically, conversion of Succinyl-CoA to Succinate is achieved using succinyl-CoA synthetase (SCS), however a closer look at the Neisseriaceae genome has shown repeated transition from SCS to the alternative acetate:succinate CoA-transferase (ASCT), which is thought to be an adaptation to a number of missing SCS subunits 381.

Overall, it seems that bacteria associated with smoking are better adapted to metabolizing carbohydrates, which is an important biological molecule in many metabolic pathways. Members of the Streptococcus genus are highly resilient to environmental stress, adapting to be strongly acidogenic and acid-tolerant, as well as being able to withstand fluctuations in the nutrient pool, giving then an advantage when competing with other bacteria 382.

165 CHAPTER 4: WEIGHTED CORRELATION NETWORK ANALYSIS

In order to overcome nutrient starvation, survive harmful effects of glycolytic intermediates and maintain suitable NAD/NADH+ balances, bacteria like S. mutans have developed an advanced regulatory network. In order to accomplish an optimal flow of carbohydrates, a combination of transcriptional regulation with allosteric modulation of enzyme activities has evolved 383. Furthermore, a recent study by He et al. 2017 has seen enhanced sugar metabolism and increased levels of formate when S. mutans was found to co-occur with Candida albicans 384, which is a known opportunistic pathogenic fungus commonly found in the upper respiratory tract. The study found up-regulation of a number of different genes required an assortment of pathways responsible for carbohydrate metabolism. Considering that smoking is associated with the Streptococcus genus and infectious disease, it is likely that there are other pathogenic organisms that co-occur and, in this way, enhance this bacterium.

Overall, results from these analyses illustrate the presence of bacterial networks that have a close relationship to the host. These networks are seen to be significantly influenced by current-smoke exposure, where bacterial families like Streptococcaceae, Bifidobacteriaceae and Lactobacillaceae show better adaptation to external pressures and are significantly increased in current-smoking individuals. On the other hand, there is a significant depletion of bacteria from Pasteurellaceae (Haemophilus genus) and Neisseriaceae families following smoke exposure. These results are consistent when repeated using data reconstructed through two different processing methods and two different bacterial databases, further validating these outcomes.

166 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Chapter 5 Streptococcus spp. Analysis 5.1 Introduction The term ‘streptococcus’ was first used in 1874 by an Austrian surgeon, Theodor Billroth, whilst working with skin and wound infections 385,386. Over the years, ways of identifying and validating the lineage of these Gram positive, chain-forming cocci has substantially evolved.

The bacterium has both human and animal origins. It is classified in the phylum of Firmicutes, class of , order of Lactobacillales and is a member of the Streptocacceae family. Formerly, the Streptocacceae family was subdivided into three genera: Streptococcus, Enterococcus and Lactobacilli, all displaying similar physiological properties which created difficulties in differentiating between the three genera.

Initial taxonomy of ‘streptococcus’ was based on culture-dependent techniques. Bacteria were classified according to the three described haemolytic patterns that were presented by strains when cultured on blood agar plates 387. The first was ‘Alpha’ (α) haemolysis, encompassing partial breakdown of the red blood cells, giving a green discolouration on the blood agar plate.

‘Beta’ (β) haemolysis causes complete rupture of the red blood cells in the media, creating wide areas clear of blood cells surrounding the bacterial colonies. This group was further subdivided according to serotype classification. Rebecca Lancefield and her team showed that cell surface antigens could be used to distinguish between the different strains which they then assigned letters A through to X to 388,389. Group A strains were isolated from human diseases; included S. pyogenes that causes scarlet fever and pharyngitis 390, Group B strains were isolated from bovine and dairy sources but also included S. agalactiae that was shown to cause pneumonia, meningitis and pregnancy complications 391. Strains further down the list were isolated from a variety of different animals.

The final haemolytic pattern described was ‘Gamma’ (γ) haemolysis, showing no change in the medium and characteristic of Enterococcus and S. faecalis organisms.

167 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Over the next 50 years, the Lancefield grouping became widely accepted and, alongside biochemical and physiological characteristics, was used for separating the Streptococcus genus into four principal divisions: ‘pyogenes’, ‘viridans’, ‘lactic’ and ‘enterococci’ 392.

Unfortunately, some organisms could possess common group antigens and be completely unrelated, meaning new approaches had to be developed to aid further classification. This led studies (around the mid-1980’s) to look at cell wall composition and metabolic processes, identification of DNA base composition, DNA–DNA hybridization and DNA transformation. Results brought reclassification of the Enterococci division into a new family of Enterococcaceae, genus of Enterococcus 393 and third division of lactic streptococci was reclassified to the family Lactobacillaceae and genus of Lactococcus 394.

The next major revision to Streptococcus genus came with the introduction of DNA-based approaches. These approaches focused on the identification and detailed understanding of bacteria at the strain level. Many pathogenic species such as N. meningitidis, S. pneumoniae and S. aureus can all act as causative agents of invasive disease, however they are more commonly seen to asymptomatically colonize the host. Consequently, it became important to study the structure and function of serotypes, relationships between different bacterial isolates and how these related to diseases in local and global epidemics.

5.1.1 Molecular Typing

Over the next three decades many molecular typing techniques were developed, with three of the most promising approaches providing the groundwork for bacterial classification. These included Multi Locus Enzyme Electrophoresis (MLEE), Multi Locus Sequence Typing (MLST) and Multi Locus Sequence Analysis (MLSA), details of which can be found in Table 5.1.

168

Table 5.1 Summary of molecular typing techniques. Number of key approaches used over the years for bacterial classification.

Multi Locus Enzyme Electrophoresis Multi Locus Sequence Typing (MLST) Multi Locus Sequence Analysis (MLSA) (MLEE) Electrophoretic mobility is related to Nucleotide sequencing data of internal Multiple housekeeping gene fragments are different alleles at the gene locus for fragments from several housekeeping genes. sequenced and concatenated. Actual DNA enzymes. Allele sequences at each locus and sequence type sequences are used for analysis. Description A unique profile of electromorphs are assigned numbers and used for analysis. Used to improve species descriptions, (mobility variants) is created for each Usually applied to strains belonging to well- especially when species boundaries are strain of organisms 395-398 defined species not well known

Identified clusters of closely related Increased discrimination power during strains (clones or clonal complexes) Useful resource in science, public health, cluster analysis 397. Advantages that are particularly liable to cause veterinary communities and food industry 399 Efficiently resolves phylogenetic disease relationships at genera and species levels

Discriminatory levels depend on many factors Limited capability to cross-compare No clear guidelines on number of genes (number of genes used, length of sequenced studies across different laboratories. required for sequencing, gene size or gene fragments, degree of sample diversity). Disadvantages Indistinguishable electrophoretic primer design Some bacterial organisms require additional mobility variants could be encoded by Studies so thus far look at each taxon molecular typing methods to attain a higher different nucleotide sequences individually resolution 397,400

Neisseria meningitidis 401 Publically available MLST scheme: Burkholderia spp. 404 Noted http://pubmlst.org contains information on Vibrio Genus 405 Organisms nearly 100 different bacterial species, eukaryote Mycobacterium Genus 406 and plasmid databases 400,402,403

CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

MLSA of Streptococcus Genus Studies have attempted to identify taxonomic lineage of streptococcus bacteria using MLST 407-409, but there are still many taxonomic confusions within this group.

Bishop et al. attempted to construct phylogenetic trees from concatenated sequences of seven housekeeping gene fragments. The gene fragments selected, map-pfl-ppaC-pyk-rpoB-sodA-tuf (~3,063 bp concatenated sequence), were obtained from 420 strains of all species within the viridans group of streptococci 410. The study was able to identify independent sequencing clusters, for example separating S. mitis and S. pseudopneumoniae, two very closely related species, as well as mapping out new ‘unknown’ species.

The outcome resulted in development of online electronic taxonomy software (http://www.eMLSA.net, similar to the MLST scheme) the aim of which was to allow users to identify species simply by entering gene fragments. Unfortunately, this website was not kept up to date and is no longer active hence the software is not available.

Adaptation to this work came from Paul Cardenas, whose doctoral thesis investigated the development of a next generation sequencing technique capable of identifying streptococci at species level from mixed microbial DNA samples 411. Eight different genes were selected: guaA, map pfl, ppaC, pyk, rpoB, sodA and tuf, with nine sets of gene specific primers being designed.

Through experimental validation and optimisation, only three genes (map, tuf and pfl found to have the most sensitive primers, picking up small DNA concentrations during amplifications) were taken through to sequencing on the Roche 454 Junior machine. After a test-sequencing run, the map gene displayed superior results, compared to tuf and pfl, for strain identification and correct establishment of bacterial abundances.

Overall this scheme does not discriminate between the different bacterial species as well as MLSA, however the methionine aminopeptidase (map) gene was able to successfully sequence microbial DNA from 76 oropharyngeal samples with

170 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS assignment of different Streptococcus OTUs at the strain level 411. This is the only study to date that has used this approach (next generation sequencing of amplicons of the map gene alone) for differentiating Streptococcus OTUs at the strain level.

5.1.2 Species Classification Currently around 70 different species are recognised within the genus of streptococcus, from a variety of animal and human sources (Table 5.2) 412.

Table 5.2 Current classification of streptococci specifically found in humans.

I, Pyogenic cocci S. pyogenes S. hongkongensis group S. agalactiae S. iniae S. dysgalactiae subsp. dysgalactiae S. pseudoporcinus S. dysgalactiae subsp. equisimilas S. urinalis S. equi subsp. equi II, Mitis- S. australis S. oralis Sanguinis group S. cristatus S. parasanguinis S. dentisani S. peroris S. gordonii S. pneumoniae S. infantis S. pseudopneumoniae S. lactarius S. rubneri S. massiliensis S. sanguinis S. mitis S. sinensis S. oligofermentans S. tigurinus III, Mutans group S. mutans S. sobrinus IV, Salivarius S. salivarius group S. vestibularis V, Anginosus S. constellatus subsp. S. anginosus subsp. anginosus group pharyngis S. constellatus subsp. S. anginosus subsp. whileyi viborgensis S. constellatus subsp. constellatus S. intermedius VI, Bovis group S. gallolyticus subsp. gallolyticus S. gallolyticus subsp. pasteurianus S. infantarius subsp. infantarius VII, S. acidominimus Miscellaneous S. gallinaceus streptococci S. suis Source: An adapted table from Janda et al. 2014 412

171 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

There are a number of species found to be part of the healthy oral and upper respiratory tract microflora: S. constellatus, S. intermedius, S. mitis, S. pyogenes and S. viridans. Although the majority of these species are quiescent, they are occasionally seen to be associated with disease, for example S. constellatus can be involved with pulmonary exacerbations in cystic fibrosis 312, S. intermedius has been isolated in periodontal cases 413, S. mitis can cause infective endocarditis 414, S. pyogenes can cause sepsis amongst other symptoms 415.

Since 16S rRNA gene sequencing does not enable members of the Streptococcus genus to be identified at the species level and in light of the findings of Chapter 3 in relation to over growth Streptococcus genus in smokers.

The aims of this chapter were to; 1. Establish a robust methodology for carrying our streptococci specific quantitative PCR and next generation sequencing of human samples using the Illumina platform. 2. Establish a reference guide for identification of streptococci species. 3. Investigate further the Busselton Health Study findings in relation to smoking and overgrowth of the Streptococcus genus (Chapter 3) by differentiating at species level the Streptococcus OTUs.

172 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.2 Methods

5.2.1 Experimental Design

5.2.1.1 Sample Selection Samples selected for Streptococcus specific analysis were based on results from the 16S rRNA gene sequencing. To reduce sequencing costs, samples were inspected and only those that presented high abundance of streptococci bacteria were sub-selected. There were 483 samples taken through for further analysis using Streptococcus specific sequencing, out of which 234 were never-smokers, 196 were ex-smokers and 53 were current-smokers.

5.2.1.2 Primer Design Based on research published by Bishop et al. 410 and the doctoral thesis work of P. Cardenas 411 (refer to Section 5.1.1), the map gene was selected for use in the NGS approach for species identification of the Streptococcus genus (Figure 5.1). Two sets of primers were designed, one for quantification using qPCR and another for Illumina MiSeq sequencing.

(A) Quantitative PCR Primers Amplification primers for the part of methionine aminopeptidase (map) gene, that would allow accurate quantitation of just members of the Streptococcus genus in the extracted DNA sample, were taken from Bishop et al. 410. The primer length was 23/24 base pairs (forward and reverse respectively) resulting in the generation of an amplicon fragment of 348 base pairs in length (Table 5.3).

Table 5.3 Streptococcus-specific, map gene primer sequences

Name Sequence Source map-up 5' GCWGACTCWTGTTGGGCWTATGC ‘3 Eurofins, Germany map-down 5' TTARTAAGTTCYTTCTTCDCCTTG ‘3 Eurofins, Germany

173 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

map

S. pneumoniae Uncertain S. pseudopneumoniae S. mitis Unknown A IgA1 protease negative S. oralis S. oligofermentans “S. mitis biovar 2” S. oralis S. infantis S. peroris S. australis S. parasanguinis S. cristatus S. sinensis S. sanguinis S. gordonii S. intermedius S. constellatus S. anginosus S. vestibularis S. thermophilus S. salivarius Unknown B S. agalactiae 0.05 S. pyogenes

Figure 5.1 Neighbour joining non-rooted phylogenetic tree of map gene. The map gene provides enough information to distinguish between the closely related species: S. pneumoniae, S. pseudopneumoniae and S. mitis. Source: Image taken from the study of Bishop et al. (2009) 410. Copyright: © Bishop et al; licensee BioMed Central Ltd. 2009 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

174 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(B) Quadruplicate PCR Primers Barcoded primers targeting the map gene were used to generate gene amplicons (see Chapter 2, Section 2.3.2.1). The map-up and map-down primers were designed and ordered through IDT Technologies (Integrated DNA Technologies, IW, USA). These adapted primers included 8 unique forward barcode reads and 12 reverse barcodes, in addition to the pad and link sequences required for sequencing.

Consequently, for the PCRs for the sequencing library the complete amplicon fragment size was 348 base pairs (Table 5.4).

Table 5.4 Details of the barcoded sequencing primers. Adaptors, Index Reads, Pad and Link modification for the map gene for quadruplicate PCR.

Forward Adaptor AATGATACGGCGACCACCGAGATCTACAC Forward Index Reads CTCTCTAT AAGGAGTA TATCCTCT CTAAGCCT GTAAGGAG CGTCTAAT ACTGCATA TCTCTCCG Forward Pad ACTGTGCTCG Forward Link TTA Forward Primer 5' GCWGACTCWTGTTGGGCWTATGC '3

Reverse Adaptor CAAGCAGAAGACGGCATACGAGAT Reverse Index Reads TCGCCTTA GTAGAGAG CTAGTACG CAGCCTCG TTCTGCCT TGCCTCTT GCTCAGGA TCCTCTAC AGGAGTCC TCATGAGC CATGCCTA CCTGAGAT Reverse Pad CGCGCCGCGC Reverse Link GTG Reverse Primer 5' TTARTAAGTTCYTTCTTCDCCTTG '3

175 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(C) Illumina MiSeq Sequencing Primers

The following primers were added directly to the sequencing cartridge and used to prime the sequence reads (Table 5.5). After sequencing of the forward read, the index read primer was used to sequence an index tag, allowing the synthesis of the reverse strand through bridge amplification (Chapter 2, Section 2.3.3).

Table 5.5 Primers for map gene amplicon sequencing.

Multiplexing Read 1 5' ACTGTGCTCGTTAGCWGACTCWTGTTGGGCWTATGC '3 Sequencing Primer Multiplexing Read 2 5' CAAGGHGAAGAARGAACTTAYTAACACGCGCGGCGCG '3 Sequencing Primer Multiplexing Index Read 5' CGCGCCGCGCGTGTTARTAAGTTCYTTCTTCDCCTTG '3 Sequencing Primer

5.2.2 Streptococcus Specific Sequencing

5.2.2.1 Quantitative PCR To establish and optimise the map quantitative PCR DNA isolated from S. mitis Strain DSMZ-12643 (DNA of the bacterium obtained from Deutsche Sammlung von Mikroorganismen und Zellkulturen [DSMZ], Braunschweig, Germany, refer to Section 5.2.2.2 for further details on bacterial preparation) was used as the test bacterial strain to confirm primer amplification with the correct product being generated both size wise and sequence wise. Subsequently, a serial dilution of S. mitis was used to generate the standard curve (1x104 to 1x108 copies/ µl).

Initially, PCR conditions were analysed using a gradient PCR (12°C running across the plate). Results showed 55 °C as the optimal annealing temperature. Therefore, thermal cycling conditions were optimised to 90 °C for 3 minutes, followed by 40 cycles of 20 seconds at 95 °C, 30 secs at 55 °C and 30 secs at 72 °C.

In addition to anneal temperature, primer concentrations optimal for amplification needed to be established and again this was achieved by running a PCR with a gradient of different primer concentrations (0.1 to 0.8 nM). Results indicated that 0.2 nM was optimal due to a low presence of primer dimer, equal

176 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS serial dilution, adequate melt curves and distinction between lowest sample dilution (103 copies per μl) and non-template control (molecular grade PCR water). The final volume of the qPCR reaction was 15 µl which included: 7.5 µl of SYBR Fast qPCR Kit Master Mix (KAPA BioSystems, Roche, Basel, Switzerland), 0.3 µl of each primer at 10 µM (0.2 nM final concentration), 1.5 µl of molecular grade PCR water and 5 µl of the bacterial DNA, standard or water. Reactions were done in triplicate on the ViiA™ 7 Real-Time PCR System.

5.2.2.2 Mock Community: Candidate Bacterial Species Nine different strains of Streptococcus were ordered from DSMZ and bacterial identity was confirmed through Sanger Sequencing (Wellcome Trust Sanger Centre). Each organism had to go through a similar processing seen in Chapter 2, Section 2.3.1.4, for the preparation of Vibrio natriegens Strain DSMZ-759.

Following the validation of bacterial identity, stocks were quantified using PicoGreen (Chapter 2, Section 2.3.2.4) and diluted to a working concentration of 2 x 107 copies/µl. Bacterial concentration was also confirmed using qPCR (Section 5.2.2.1 above), where the standard curve was generated by a serial dilution of the S. mitis (1x104 to 1x108 copies/ µl). Next the working concentration stocks were equimolar pooled at a 1:1 ratio. The nine strains forming the mock community were all Gram positive, non-motile and non-spore forming. The strains were:

1. Streptococcus agalactiae (DSMZ-2134) 416 Generally present as part of the normal bacterial flora of gastrointestinal (GI) tract and genitourinary tract in around 15-40% of women. There are cases when it can become an infectious pathogen causing septicemia, pneumonia and meningitis in neonates 391 and occasionally immune-compromised adults.

2. Streptococcus constellatus subsp. Constellatus (DSMZ-20575) 416 S. constellatus is usually part of the normal flora of the oral cavity, urogenital region, and intestinal tract. It can however also cause purulent infections in other

177 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS parts of the body. Clinically associated with abscess formation in the respiratory tract and involved with pulmonary exacerbations in cystic fibrosis patients 312.

3. Streptococcus infantis (DSMZ-12492) 417 S. infantis is a catalase-negative coccus and a facultative anaerobe that displays α- haemolytic properties on Columbia blood (sheep) agar. It is part of the Mitis- Sanguinis group and has been noted to be present in human oral cavity.

4. Streptococcus parasanguinis (DSMZ-6778) 418 This is a catalase-negative bacterium, α-haemolytic and is part of the healthy microflora of the oral cavity 419.

5. Streptococcus pneumoniae (DSMZ-20566) 416 Optimal living temperature for S. pneumoniae is between 30-35 °C, so can frequently be found as part of the upper respiratory tract (specifically throat and nasal passages) microbiota. Despite being a commonly occurring bacteria, it was originally isolated by Louis Pasteur in 1881. During an overgrowth in population, activation of pathogenic properties is promoted, this in turn creates infectious environments that are highly favourable for natural transformation and antibiotic resistance. S. pneumoniae causes bacteremia, otitis media, meningitis and is the primary agent in pneumoniae infections. Infections are treated with penicillin, which interferes with synthesis of peptidoglycan in the bacterial cell wall.

6. Streptococcus pseudopneumoniae (DSMZ-18670) 420 S. pseudopneumoniae has only recently been described as a separate species from S. pneumoniae. It is believed to cause pneumoniae in humans, however structurally it is very similar to S. pneumoniae and S. mitis, and little is as yet known about its clinical implications.

7. Streptococcus pyogenes (DSMZ-20565) 416 The bacterium can cause strep throat and impetigo, as well as more severe diseases such as scarlet fever, glomerulonephritis and necrotizing fasciitis. It is

178 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS highly versatile, making it challenging to develop a vaccine against. It is known that S. salivarius can outcompete S. pyogenes by inhibiting growth 421.

8. Streptococcus sanguinis (DSMZ-20567) 416 S. sanguinis is commonly found in dental plaque and colonizing dental cavities. It can travel via the bloodstream to the heart, resulting in bacterial endocarditis – infection of the heart endocardium (lining). S. sanguinis infections are generally treated with penicillin and aminoglycoside antibiotics to stop bacterial growth. Using homologous recombination (sexual process involving DNA transfer between bacterial cells) the bacterium can genetically transform 422.

9. Streptococcus mitis (DSMZ-12643) 423 S. mitis is not usually pathogenic but commonly causes bacterial endocarditis (inflammation of the inner layer of the heart). S. mitis is part of the normal flora of the upper respiratory tract (mouth, throat and nasopharynx). Closely related to S. pneumoniae, with homologous recombination occurring between the two species contributing to penicillin resistance 424.

5.2.2.3 Library Preparation The selected 483 extracted DNAs from the BHS swab samples (Section 5.2.1.1) were randomized before being grouped into sequencing runs; Kruskal Wallis test (Chapter 2, Section 2.5.2.1C) was carried out to make sure there were no significant association between runs and any clinical variables. This was done to avoid any potential bias, which had previously been identified through the 16S rRNA gene sequencing (Chapter 3, Section 3.2.3.3).

Six sequencing libraries were prepared, which included 94 BHS samples and two controls (one positive mock community and one PCR-negative). Sequencing library number 6 was made up of 28 BHS samples, individual (working concentration of 2 x 107 copies/µl) species of S. mitis, S. pneumoniae and S. pseudopneumoniae and a 50:50 equimolar mix of S. mitis/ S. pseudopneumoniae.

179 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Also included were two mock communities (aliquot of the mock run on the prior 5 runs as well as an aliquot of a newly prepared mock community to confirm stability of the original mock) and one PCR-negative control.

Library preparation followed a similar processing pipeline as for the 16S rRNA gene library preparation (Chapter 2, Section 2.3.2) but with different thermocycling conditions for the quadruplicate PCR reaction. Each 25 µl PCR reaction contained 12.5 µl of Q5® High-Fidelity 2 X Master Mix, 6.5 µl of water, 5 µl of both forward and reverse primers (1.5 µM working stocks) and 1 µl of either the sample DNA, positive control (mock community) or negative control (PCR water). Thermal cycling conditions were 2 minutes at 95 °C (initial denaturation), followed by 35 cycles of 20 seconds at 95 °C (denaturation), 20 seconds at 55 °C (annealing) and 30 seconds at 72 °C (extension). A further final extension of 5 minutes at 72 °C was included. Each plate was carried out in quadruplicate, at the same time.

The rest of the library preparation followed the identical protocol as for the 16S rRNA library preparation (Chapter 2, Section 2.3.2) such that quadruplicate plates were pooled (granted there was no contamination present) and amplicons were purified using by AMPure XP purification. Each library was then quantified using Picogreen, equimolar pooled and taken through another round of AMPure XP purification. Each library was gel extracted, before being quantified by quantitative PCR and BioAnalyser to establish concentration and size.

5.2.2.4 MiSeq Sequencing Sequencing also followed the same protocol as for the 16S rRNA gene sequencing (Chapter 2, Section 2.3.3). Each library was diluted to 8pM with a 20% Phi X spike was added before being loaded onto the machine. Primers loaded for the sequencing are given in Section 5.2.1.2C above.

180 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Quality metrics produced by the MiSeq instrument were recorded to confirm run success and are given in Table 5.6 (for further details about each quality metric please refer to Chapter 2, Section 2.3.3.1)

181 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.2.3 Sequence Processing in QIIME As for the 16S rRNA gene sequencing (Chapter 2, Section 2.4) (for complete command workflow refer to following: https://goo.gl/vYgohs), processing of the Illumina sequencing reads was done using QIIME.

Initially forward and reverse barcodes were combined making them compatible for downstream analysis. Forward and reverse primers were both identified in reverse complement direction and selected and trimmed off. Reads that were less than 150 bp were also removed. This was carried out for all individual sequencing runs.

Illumina’s forward and reverse reads were then joined together using the join_paired_ends.py command. It utilised parsed barcodes created in the first step, with a minimum overlap of 70 base pairs, when the overlapping bases matched, the highest sequence quality base was used. Afterwards, sequence reads, barcodes and metadata mapping files across six sequencing runs were merged together into a single shared medium using the split_libraries_fastq.py command.

Analogous to 16S rRNA gene sequencing, each map gene sequencing library was spiked with Phi X genome in order to introduce diversity and act as a technical control in the sequencing run. There was a total of 2,169 Phi X sequences identified and removed, leaving 35,650,976 high quality reads.

5.2.3.1 Clustering Level In setting the threshold for similarity with regard to OTU identification, prior work carried out by Cardenas 411 had applied an 80 % similarity. However, to ensure the most appropriate clustering level was implemented, sequences for the single strains S. mitis, S. pneumoniae and S. pseudoneumoniae (from the Illumina sequencing runs) as well as the Sanger sequencing results for each individual mock community member were aligned and run through the Geneious Next Generation Sequencing Tool (Version 11.0.2, http://www.geneious.com, Kearse et al., 2012) 425.

182 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

A dissimilarity matrix was created (Appendix 5.1) to look at the sequences from different Streptococcus species. Firstly, this analysis confirmed 100% identity between results from Sanger sequencing versus Illumina MiSeq sequencing. Secondly, the analysis indicated a 94.9% similarity between a single strain sequences of S. mitis and S. pseudopneumoniae. Consequently, a clustering (similarity) level of 95% was used to process the data.

After establishing the clustering level, a custom parameters file was composed in order to specify for pick_open_reference_otus.py command the level at which to cluster OTUs. Sequences were clustered against the database that was originally used by Bishop et al. 410. All unknown OTUs were clustered separately as new de novo centroids. The OTU picking was carried out using the uclust method, clustering OTUs at 95% sequence similarity.

5.2.3.2 Streptococcal Specific Classifier Generally, when assigning taxonomies to sequences there are two main approaches used. The first employs one of the comprehensive databases available, such as the ARB-Silva 36, Greengenes 37, Ribosomal Database Project (RDP) 39 or Genbank 38. However, the resolution for these databases can usually only achieve genus level classification. The second approach is taxonomy-independent and classifies the reads into Operational Taxonomic Units (OTUs). One advantage of the RDP Classifier 426, which is a naïve Bayesian classifier, is that it makes assignments based on the composition of sub-sequences and can be retrained for species level classification 427, thus is commonly used for taxonomic assignments. Initially, a reference guide was constructed based on the work generated by Bishop et al. 410, however upon examining the mock communities (data not shown) there were some misnaming of the original sequences. Therefore, a Streptococcal specific classifier could not be constructed for the examination of this data.

Consequently, following ‘blind’ analysis, key OTUs that were seen across a number of different analyses were selected, and manually identified. To achieve this, individual reference sequences were pulled out and searched by using the Basic

183 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Local Alignment Search Tool (BLAST, https://blast.ncbi.nlm.nih.gov) 428. This database consists of information found in the GenBank, EMBL, DDBJ, PDB and RefSeq sequences, and optimized for megablast, so looking for highly similar sequences.

Due to the nature of the reference database, in order to be sure about bacterial identification a few parameters were considered, including percentage identity that reflects how well the reference sequence matched that found in the database (being as close to 100%). ‘E’ values that define how many times the relevant hit can be expected to be seen by chance when searching the database, in this way reflecting background noise. Values close to zero were considered as these provide a higher significance to the match. Finally, reference sequences of the streptococci species from this study were 412 bp in length (the map amplicon size). One of the biggest limitations of this database is that it is not complete, meaning that there are many unidentified and poorly annotated sequences presented. Consequently, the alignment was explored to see how many bases were matched to that found in the database, making sure that there were minimal differences when compared to the original sequence. Once all these parameters were met, a decision on bacterial identity was proposed. When the alignment gave multiple potential species identities, all those with highest identity scores and length of matched sequence were reported.

5.2.4 Data Analysis Data analysis was identical (unless otherwise specified) to that carried out for the 16S rRNA gene sequencing data (Chapter 2, Section 2.5). Details of the R analysis workflow are available at: https://goo.gl/AkiAr5.

Potential contaminants were identified through a similar method to that applied in Chapter 3, Section 3.2.3.4. The bacterial burden was correlated against the number of Streptococcal reads, those OTUs with a negative correlation were deemed as potential contaminants and removed from the dataset, see Section 5.3.1.3B.

184 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3 Results There were six streptococci-specific sequencing runs carried out to process the 483 BHS oropharyngeal DNA samples. Similar to the 16S rRNA gene sequencing, the majority of the sequencing runs contained 94 samples and two controls (one positive mock community, and one negative control).

The final library (sequencing run six) contained 28 samples (15 of which were repeats from previous runs that initially were unsuccessful), four test controls; individual bacterial (working concentration of 2 x 107 copies/µl) species of S. mitis, S. pneumoniae and S. pseudopneumoniae and a 50:50 equimolar mix of S. mitis/ S. pseudopneumoniae. Also included were two mock communities (aliquot of the mock run on the prior 5 runs as well as an aliquot of a newly prepared mock community to confirm stability of the original mock) and one PCR-negative control.

5.3.1 Quantitative PCR and Sequencing Data

5.3.1.1 Streptococcal Bacterial Load To determine Streptococcal bacterial burden, all 483 oropharyngeal DNA samples underwent streptococcus specific qPCR (Section 5.2.2.1).

A significant difference was seen when comparing smoking versus non-smoking individuals (never-smokers and ex-smokers combined), χ2(2) = 12.39, P = 0.002 (Kruskal Wallis [Chapter 2, Section 2.5.2.1C]), with a lower Streptococcal bacterial burden seen in the smokers.

Individual group means were significantly different: smoking versus never- smoking (P = 0.018) and smoking versus ex-smokers (P = 0.026) (Tukey’s Honest Significant Differences test [Chapter 2, Section 2.5.2.1C]) with a lower mean in smokers. There was no significant difference between never-smoking versus ex- smoking groups (Figure 5.2).

185 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Streptococcus spp. Bacterial Burden

*** *** 1e+08

)

g

o

L

L

u

/

s

e

l

u

c

e

l

o

m

(

r

e

b

m

u

N

y

p 1e+06

o

C

*Tukey HSD

NeverSmoking ExSmoking Smoking

Figure 5.2 Streptococcus spp. bacterial burden: smoking phenotypes. There was a significant difference between never-smoking and smoking individuals (P = 0.018), ex-smoking and smoking individuals (P = 0.026).

5.3.1.2 Sequencing Quality Statistics Following sequencing of each individual library, run quality metrics were evaluated (as for the 16S rRNA gene sequencing runs [Chapter 3, Section 3.2.3]) and confirmed that all runs were successful (Table 5.6).

After sequencing, raw data (62,361,916 sequences) was transferred into QIIME (Section 5.2.3). Sequencing data was processed; forward and reverse reads were joined, quality controls were carried out, data was filtered to remove low quality reads and the Phi X spike in and an OTU table was constructed (see Section 5.3.1.3).

186 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Table 5.6 Quality metrics of the six sequencing runs. Cluster Density; Cluster passing filter shows percentage of reads that met the overall quality of Illumina chastity filter. Quality score (Q30) is the prediction of probability that an incorrect base was called. Number of Reads (millions) produced at the end of each sequencing run that passed the internal filter.

Sequencing Cluster Density Cluster Passing ≥ Q30 Reads Run (k/mm2) Filter (%) (%) (Millions)

Run 1 688 90.34 70.0 16.54 Run 2 784 91.26 84.2 18.40 Run 3 1,006 88.76 97.1 23.42 Run 4 1,004 82.57 71.7 21.92 Run 5 1,086 81.10 66.7 22.31 Run 6 858 89.22 62.9 17.76

5.3.1.3 Filtering and Normalization A total of 56,653 reads across 500 samples (483 oropharyngeal DNA samples, 13 technical controls and four Streptococcus species) were imported into R for statistical analysis. As OTUs were not picked, a custom taxonomic table was created focusing on one organism that differed only at species level. Therefore, the higher taxonomic levels were the same for all OTUs being Phylum: Firmicutes, Class: Bacilli, Order: Lactobacillales, Family: Streptococcaceae and Genus: Streptococcus.

During QIIME processing OTU IDs were constructed according to the following pattern “New.Reference” plus OTU number. To simplify the names, the first part of each name (“New.Reference”) was removed, leaving behind just unique OTU numbers.

187 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(A) Mock Communities Next the data was separated by type: controls (372 OTUs across 17 control samples) and oropharyngeal DNA samples (14,987 OTUs across 483 samples) to allow assessment of the mock community (positive control) of each run. For the sixth sequencing run two mock communities had been included (Section 5.2.2.3).

The sequencing reads from the controls (negative and positive) were rarefied to 500 reads, which removed all negative controls, leaving the seven mock communities and four bacterial control species (S. mitis, S. pneumoniae and S. pseudopneumoniae and the 50:50 equimolar mix of S. mitis/ S. pseudopneumoniae).

Initially, the mock communities across all runs were visualised to confirm species separation. Nine different streptococcal species had been included when constructing the mock community (Section 5.2.2.2), so the top 10 OTUs (most abundant) were sub-selected and plotted (Figure 5.3).

Similar abundances of each OTU within each mock community was observed reflecting consistent sequencing quality across all runs. The three individual Streptococcus species – S. mitis, S. pneumoniae and S. pseudopneumoniae and a mix of S. mitis/ S. pseudopneumoniae were also identifiable, both when sequenced independently and within the mock communities.

188 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Figure 5.3 Mock community structures across the 6 sequencing runs. Additionally, the three individual Streptococcus species (S. mitis, S. pneumoniae and S. pseudopneumoniae and a mix of S. mitis/ S. pseudopneumoniae), which are of close origin were identified as separate organisms at 95% clustering level.

Reference sequences for these top ten OTUs were then examined using the BLAST database (Table 5.7). As can be seen from the table for two OTUs; OTU11487 and OTU19058 more than one species was identified and therefore all identified species names are included in the table.

It should be noted that OTU19058 was recognised through the database as S. mitis, S. pneumoniae and S. pseudopneumoniae, which are all closely related species, however this was one of the known species (S. mitis) whose identity was previously confirmed through Sanger sequencing (Section 5.2.2.2).

189 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Nonetheless, all potential candidates were included in the table because they had information on a larger portion of the bacterial sequence available (OTU19058 sequence was made up of 412 bases), as well as having a greater proportion of the sequence aligned, providing more substantial information of the bacterial identity.

Table 5.7 Blast identification of 10 most abundant OTUs of the mock community. Identity (%) reflects percentage of matched bases when comparing reference sequences to BLAST database. ‘E’ value gives information on the probability of having the match occurring by chance, the lower the score the more significant the match. Alignment gives the number of bases that were matched between reference sequences and closest match in BLAST database.

OTUID BLAST Identification Identity (%) E Value Alignment OTU11487 S. constellatus subsp. 99 0 410/412 Pharyngis S. anginosus C238, 99 0 410/412 complete genome S. anginosus subsp. 99 0 410/412 Whileyi OTU9000 S. agalactiae 99 0 409/412 OTU19838 S. pyogenes 99 0 411/412 OTU19058 S. pseudopneumoniae 95 0 391/412 S. mitis strain 100 1.00E-180 348/348 S. pneumoniae 94 6.00E-178 389/412 OTU13849 S. pneumoniae 99 0 409/412 OTU16800 S. sanguinis 95 0 391/412 OTU11504 S. pneumoniae 99 0 410/412 OTU13036 S. pseudopneumoniae 100 0 412/412 OTU10358 S. parasanguinis 99 0 409/412 OTU1432 S. mitis strain 88 2.00E-108 304/347

190 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(B) Oropharyngeal DNA Samples Samples were initially filtered to remove any OTUs that were only present in one sample in the dataset, OTUs with fewer than 20 reads were also removed (Procedure as per Chapter 3, Section 3.2.3.5). This resulted in the 483 BHS oropharyngeal DNA samples containing 14,987 OTUs.

Correlation between OTU abundance and Streptococcus bacterial burden (Streptococcus-specific qPCR copy number) data was tested using Spearman’s correlation coefficient (Chapter 3, Section 3.2.3.4) resulting in P-values and correlation coefficients (rho; estimated measure of association) being generated. P-values were corrected for multiple comparisons (Bonferroni correction 302).

After correlating all 14,987 OTUs, there were 4,231 OTUs that presented a statistically significant result (adjusted P-value < 0.05 429), with 89 of these OTUs displaying a negative correlation (individual OTUs that increased their number of reads as the biomass of total yield decreased). Details of the OTUs identified can be found in Appendix 5.2. These OTUs were then examined using the BLAST database and subsequently removed from the overall dataset. This left a total of 14,898 taxa and 483 samples (unrarefied data used later in Section 5.3.3.2C, DESeq2 Analysis).

Next, samples were rarefied to avoid bias in results due to variation in sample sequencing depth. To establish the level of rarefaction samples were ordered by number of sequences and plotted. Figure 5.4 shows the 50 samples with the lowest abundance. The distribution of the data indicated that a rarefaction level at 7,700 reads could be applied resulting in 8 samples and 350 OTUs being removed.

191 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Busselton Sequencing Data: Rarefaction level 7,700 reads

30000

s

d 20000

a

e

R

f

o

r

e

b

m

u

N 10000 7,700 reads

0 0 10 20 30 40 50 Samples Figure 5.4 Number of sequencing reads for the 50 samples with the lowest abundance of Streptococcal reads. The established rarefaction level of 7,700 reads was used (indicated by the red line) and this resulted in the removal of 8 samples and 330 OTUs.

5.3.1.4 Summary of Analysis Workflow Analysis was initially carried out on the OTU table (no species identity for reasons highlighted above). The whole population (475 individuals post rarefaction) was assessed to look at alpha and beta diversities, as well as overall species abundance and prevalence.

Next the cohort was separated according to current-smoke exposure. A number of different techniques were carried out including assessing: diversity, relative abundance, Indicator Species Analysis, Differential Expression Analysis for Sequence Count Data (DESeq, Chapter 2, Section 2.5.3.3), Weighted Correlation Network Analysis (WCNA) (Chapter 4). OTUs that were identified as significant through all these methods were collated, and those seen across multiple tests were picked for manual species identification using the BLAST database and subsequently visualize in a phylogenetic tree. A brief inventory of the species identified is described in the text, for full identification results please refer to Appendix 5.

192 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3.2 Whole Population Analysis

5.3.2.1 Whole Population: Alpha Diversity

Alpha Diversity (α) looks at diversity within a community (Chapter 2, Section 2.5.2.1A), as before it was calculated using a number of complementary diversity indices including Observed Richness, Pielou’s evenness, Shannon’s Diversity and Simpson’s Reciprocal Index (Inverse Simpson’s).

The different diversity indices were assessed for normality using Shapiro-Wilk’s test (at 95% confidence level) and confirmed with quantile-quantile plots (Chapter 2, Section 2.5.2.1B). All diversity indices were significant:

- Observed Richness (Mn = 481.9, SD = 182.8, P = 1.09E-06) - Pielou’s evenness (Mn = 0.5, SD = 0.1, P= 6.74E-07) - Shannon-Weiner Diversity (Mn = 3.0, SD = 0.8, P = 0.03) - Inverse Simpson’s Reciprocal Index (Mn = 7.4, SD=5.1, P = 2.09E-19).

In light of these results the data was regarded to not be normally distributed and further assessment of the alpha diversity of samples was carried out using non- parametric analysis (Section 5.3.3.1A below and for tests applied Chapter 2, Section 2.5.2.1C).

5.3.2.2 Whole Population: Beta Diversity

Beta diversity (β) looks at any measure of variation in species composition between different communities (Chapter 2, Section 2.5.2.2) and here it was calculated based on the Bray Curtis Dissimilarity matrix. The Bray Curtis dissimilarity matrix is confined between 0 and 1, where 1 indicates no similarity and 0 indicates complete similarity. Average distance was calculated between all samples and was found to be Mn = 0.84, SD = 0.06, indicating very little similarity of the bacterial composition between samples.

To investigate how much variability in the data could be explained by clinical variables a PERMANOVA (Adonis function in R) multivariate model was used, similar to that applied to the 16S data in Chapter 3, Section 3.2.3 (Figure 3.10 summary flowchart). This was achieved in two steps.

193 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Firstly; all clinical variables were tested independently in the Adonis model to see if there were any causing a significant difference. Next (Step 2) any significant variables (P < 0.01) were tested together in a multi-variate Adonis model thereby allowing the collective effect of clinical variables on variance in the data to be assessed (significant variables were added together to calculate total degree of variability caused).

As previously (Chapter 3) a total of 58 different variables from sample metadata were selected and tested (Chapter 3, Table 3.8 for a list of tested variables). As before any samples lacking complete metadata were excluded as the Adonis model does not accept missing (Not Available [NA]) values in the formula and therefore 429 samples were included in the analysis.

Individual traits were tested in an Adonis model independently (Figure 3.10 Part B, formula used: Adonis (beta diversity matrix ~ clinical variable, in sample data)). Seven traits came out as significant at P < 0.01 and these are detailed in Table 5.8.

Significant clinical variables were then selected (Figure 3.10 Part C) but avoiding variables that reflected similar findings e.g. Log qPCR Copy and X16SMiSeqRun are all proxies for quantification/ Streptococcus bacterial load of samples. Other variables were also eliminated; Current-Smoker (lower R2 compared to pack years [Table 5.8]), Status (umbrella for current-smoking, ex-smoking and never- smoking individuals, so difficult to specify which phenotype is causing the variation in the population) and Corrected White Blood Cell (WBC) Count (collection of a number of WBC and with lower R2). The variables selected are highlighted in bold font in Table 5.8.

In order to identify maximum variance in the population, Adonis was run with all selected variables in one multi-variable model. Depending on the order in which variables were placed into the formula made an impact on the results, therefore all possible positions were tests, yielding 6 different permutations (Figure 3.10 Part D).

194 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Table 5.8 Significant clinical variables at P < 0.01. Adonis test was run on each variable independently, P-value shows the significance when tested against beta diversity. R2 gives the amount of variation that can be explained by that particular variable. Clinical Variable R2 P-value Status 0.0093 0.001 Log qPCR Copy Number 0.0059 0.002 16S rRNA gene MiSeq Run 0.0197 0.007 Current-Smoker 0.0067 0.001 Pack Years 0.0075 0.001 Corrected WBC Count 0.0049 0.008 Neutrophil Absolute Value 0.0051 0.007

The final model (formula: Adonis (beta diversity matrix ~ Log qPCR Copy Number + Neutrophil Absolute + Pack Years, in sample data)) explained 1.7% ([0.006+0.005+0.006]*100) of variation in the data (Figure 3.10 Part E and Table 5.9), with the majority of the variance remaining undescribed.

Table 5.9 PERMANOVA analysis: Streptococcal data

Variables Df Sums of F Model R2 P-Value Squares Log qPCR Copy Number 1 0.907 2.5465 0.0059 0.002 Neutrophil Absolute 1 0.747 2.0961 0.0049 0.007 Pack Years 2 0.850 2.3856 0.0055 0.005 Residuals 425 151.377 0.9837 Total 428 153.880 1.0000 *Refer to Chapter 2, Section 2.5.2.2 for further details

195 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3.2.3 Common Streptococcal OTUs Next the most abundant and prevalent OTUs were explored. When assessing abundance, OTUs that contained at least 18,000 reads were selected, resulting in 25 OTUs (Figure 5.5A). When looking at prevalence, OTUs that were present in over 70% of the population (equating to 332 individuals) were considered (Figure 5.5B). Fifteen OTUs were in common and seen across both assessments.

A Most Abundance 25 OTUs PerAbundance 0.04 0.08 0.12 0.16

OTU10104 OTU15936 OTU13812 OTU10307 OTU24710 OTU23481 OTU21636 OTU11843 OTU13903 OTU22893 OTU10817 OTU13837 OTU10706 OTU10102 OTU9031 OTU15267 OTU12653 OTU11394 OTU2846 OTU124 OTU10097 OTU16404 OTU14275 OTU2013 OTU10547 1e+05 Abundance B Most Prevalent 25 OTUs Percent_Prevalence 80 90

OTU10104 OTU15936 OTU24710 OTU10097 OTU11843 OTU10307 OTU10843 OTU10102 OTU1050 OTU23481 OTU12401 OTU21636 OTU13903 OTU13964 OTU10817 OTU13837 OTU13462 OTU2846 OTU12371 OTU9693 OTU15267 OTU15822 OTU12653 OTU14077 OTU7803 360 400 440 480 Number of Samples Figure 5.5 The 25 most abundant (A) and most prevalent OTUs (B). The OTU IDs are labelled along the y-axis (species seen across both coloured in red). The x-axis shows abundance (number of reads per OTU, log10 transformed) for abundance plot and number of samples that contain that OTU in prevalence plot. The colour and size of each point represents percentage from total Abundance (A) and Prevalence (B) of each OTU.

196 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3.3 Smoking Phenotype Next the dataset (14, 548 OTUs across 475 individuals) was separated into the three smoking phenotypes. Current-smokers had a total of 4,621 OTUs across 53 samples; ex-smokers 10,364 OTUs across 191 samples and those who had never- smoked 11,032 OTUs across 231 samples.

Reflecting on previous results, both 16S rRNA gene sequencing data analysis and Weighted Correlation Network Analysis had indicated a close resemblance between never-smoking and ex-smoking individuals. In light of this the never- smoking and ex-smoking individuals were combined into a current non-smoking group (422 samples) for comparison to the current-smokers (53 samples).

5.3.3.1 Diversity (A) Alpha Diversity Alpha diversity was analysed using the four different alpha diversity indices (Chapter 2, Section 2.5.2.1A): Observed Richness, Pielous Evenness, Shannon- Weiner Diversity and Simpson’s Reciprocal Index (Inverse Simpson’s). No significant difference was noted between any of the smoking phenotypes.

(B) Beta Diversity Differences in beta diversity were explored through PERMANOVA analysis, (using Adonis function in R) calculated based on Bray Curtis dissimilarity matrix (Chapter 2 Section 5.2.2). A significant difference was identified (Table 5.8; R2 = 0.007, P = 0.001) with some degree of separation of the individuals who were current-smokers from the non-smoking individuals, but this only explained a tiny proportion of variation in the data.

5.3.3.2 Community Structure In order to select the most significant OTUs to be taken forward for manual identification, a number of different approaches were taken. OTUs that were seen reoccurring across multiple approaches were selected for further assessment.

197 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(A) Relative Abundance Bar Plots Relative abundance (RA) of the top 25 OTUs (most abundant) was explored between current-smoking and current non-smoking individuals. Each OTU was converted into a proportion to investigate changes between individuals (Figure 5.6). A note was made regarding any of the OTUs within the 25 that had been observed in prior analyses (whole population 25 most abundance (MA) and prevalent (MP) OTUs, Figure 5.5). The OTU labels were colour coordinated according to the test they have been previously note; Red = RA + MO + MP; Blue = RA + MA; Green = RA + MP; Black = RA only.

Figure 5.6 Top 25 OTUs present in the data separated by current-smoking and non-smoking phenotype. The bar graph looks at the relative abundance (RA) of the top 25 OTUs. The legend was colour coded depending if these OTUs had been noted in the whole population analysis as the most abundant (MA) and most prevalent (MP). Red = RA + MA + MP; Blue = RA + MA; Green = RA + MP; Black = RA only

198 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(B) Indicator Species Analysis Using rarefied data, Indicator Species Analysis (ISA) was carried out (Chapter 2, Section 2.5.3.2) to establish whether any particular OTUs were associated with smoking. A note was also made regarding species that had previously been seen through other analyses above; Most Abundant (MA), Most Prevalent (MP), Relative Abundance (RA) and Indicator Species Analysis (ISA).

When identifying species (OTUs), strict selection criteria were employed, including having a significance level of P = 0.05, parameters A (probability of a species to be an indicator of a particular niche) and B (frequency that species occurs across the whole dataset) being above 0.6. Identified species (OTUs) with summary statistics can be found in Table 5.10 with those seen across multiple tests also highlighted.

Table 5.10 Species identified to be significantly associated with current-smoking and non-smoking individuals. Several OTUs were noted to have been identified from the prior tests; Most Abundant (MA), Most Prevalent (MP), Relative Abundance (RA).

Associated with Non-Smoking A B Stat P-value Noted tests OTU10817 0.94 0.78 0.85 0.005 MA, MP, RA OTU15822 0.78 0.76 0.77 0.005 MP Associated with Current-Smoking OTU2013 0.86 0.87 0.87 0.005 MA, RA OTU10307 0.72 0.94 0.82 0.005 MA, MP, RA OTU13462 0.73 0.92 0.82 0.005 MP OTU1213 0.85 0.74 0.79 0.005 MA, RA OTU124 0.74 0.81 0.78 0.005 MA, RA OTU23280 0.74 0.77 0.75 0.005 OTU10068 0.73 0.72 0.72 0.005

199 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(C) Differential Expression Analysis for Sequence Count Data (DESeq2) Community structure between current-smokers versus current non-smokers (controls) was explored using DESeq2 (Chapter 2 Section 2.5.3.3). Analysis was carried out on un-rarefied data, and therefore included the total number of individuals 483 with 14,898 observed OTUs. This resulted in there being 430 individuals in the non-smoking group and 53 individuals in the current-smoking group. Normalisation and correction for false discovery rate was accounted within the DESeq function.

Comparing current-smokers versus current non-smokers (never and ex-smoking combined) there were 157 significant OTUs at P < 0.001 of which abundance was increased for 96 of the OTUs and decreased for the remaining 61 OTUs in smokers (Figure 5.7, Appendix 5.3).

Overall, there were 33 OTUs that were significant and were seen across two or more tests. Collating all the results together saw 18 OTUs associated with current- smoking habit and 14 OTUs being associated with current non-smoking individuals.

Next, the 33 OTUs were analysed using BLAST, the species identities can be found in Appendix 5.3. Reference sequences for all identified OTUs were selected out, each one was individually analyzed using the nucleotide database. Picking candidate identities was a manual process therefore care was taken that all required parameters were met before making a decision (Section 5.2.3.2).

The 33 OTUs that were examined using BLAST resulted in seven unique species (further details in Appendix 5.3), including: S. oralis (1 candidate), S. salivarius (21), S. parasanguinis (8), S. thermophiles (1), S. sp. I-G2 (1) and OTU2013 (1) both of which had an identification split between S. mitis and S. pneumoniae. The inability to give a definitive name to the latter two OTUs highlights the need to develop an extensive Streptococcal Specific classifier

200

Non−Smoking vs Current Smoking Individuals

OTU17481 20

OTU16404

OTU11196 OTU11394

15 OTU24710 OTU20115

e Significant OTUs

u

l

a OTU21667

V a Unique to DESeq OTU6270

P OTU6311 a Repeated OTU

d

e OTU20301 t OTU16005

c a Not Significant

e

r OTU16617 r OTU14916 o 10 OTU20067 OTU15495

C Abundance

R

D OTU7607 2e+06

F

OTU2013 0 OTU25019 OTU3041 1 OTU15362 4e+06

g

o OTU19137 OTU10706 OTU11138

L 6e+06

− OTU11853 OTU10860 OTU25507 OTU21403 OTU14902 OTU10935 OTU10899 OTU13963 OTU18727 OTU10038 OTU12514 OTU15115 OTU1699 OTU10312 OTU16821 5 OTU20961 OTU1121 OTU3104 OTU169 OTU6394 OTU25288 OTU17350 OTU11302 OTU667 OTU20024 OTU12664 OTU6180 OTU16402 OTU14543 OTU23131 OTU24149 OTU12401 OTU4608 OTU13773OTU12370 OTU12498 OTU17177 p < 0.001

0 Non−Smoker Smoker

−8 −4 0 4 Log 2 Fold Change Figure 5.7 DESeq analysis: current non-smoking vs. current-smoking individuals. Coloured OTUs were identified to have significant changes (adjusted P < 0.005). Significant OTUs seen in prior analyses (MA, MP, RA, ISA) are coloured in red whilst, those only seen through DESeq analysis are in black. Sizes of points indicate abundance of the particular OTUs in the whole data set.

CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3.3.3 Weighted Correlation Network Analysis (WCNA) The Streptococci OTU data was also put through WCNA analysis, to identify networks of streptococcal species that may be associated with current-smoke exposure. As before, prior to WCNA analysis a filtering pipeline was applied to the data with any OTUs only present in one sample being removed as well as OTUs with fewer than 20 reads. This resulted in 14,898 OTUs across 483 samples, total number of reads being 37,955,013. Data was not rarefied.

WCNA analysis was repeated using the same methods as described previously (Chapter 4

Sections 4.3-4.5). Briefly the dataset was log10 (x+1) transformed and then used to draw a hierarchical clustering dendrogram, no samples were removed. To reduce the number of comparisons carried out, only smoking related variables were selected, which included; age, sex, smoking phenotype (never-smoking, ex-smoking, current-smoking), pack years, monocyte, eosinophil and basophil absolute values.

An adjacency matrix was constructed using a β soft thresholding parameter of 4. Minimum module membership was preserved at 100, resulting in 34 modules being constructed. There were 1,524 OTUs not assigned to any module, so these were classed as the Grey module (10%), and no closely related modules were merged at 0.75 correlation level. Module size varied from 141 OTUs (Dark Green module) to 684 OTUs (Turquoise module). Appendix 5.4.1 provides a brief summary of all modules giving the details about the number of bacterial members per module, their total bacterial abundance and the percent abundance within the while data.

Correlations between clinical variables and identified modules were calculated using Spearman’s correlation and P-values were adjusted using once more using the Benjamini and Hochberg method of false discovery rate control. A summary of the significant associations is shown in Figure 5.8.

Significant correlations were predominantly seen with current-smoke exposure. Modules that were investigated further were those that displayed a significant correlation with both current-smoke exposure and number of pack years. Included modules were

202 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

(statistics for current -smoke exposure provided): Light Green module (rho = 0.24, P = 3E-06), Tan (rho = -0.33, P = 1E-11), Pale Turquoise (rho = -0.24, P = 3E-06), Turquoise (rho = -0.3, P = 2E-09) and Blue (rho = -0.31, P = 8E-10).

Whole Population: Module−trait relationships (Streptococci data) 0.00025 −0.11 −0.012 −0.021 0.053 0.023 0.019 0.023 0.051 MEdarkgrey (1) (0.5) (1) (1) (1) (1) (1) (1) (1) 0.019 −0.14 −0.0032 0.029 −0.04 0.023 −0.024 0.0082 −0.0058 MElightyellow (1) (0.2) (1) (1) (1) (1) (1) (1) (1) 1 0.041 −0.059 0.16 0.037 −0.31 −0.17 −0.12 −0.069 −0.052 MEblue (1) (1) (0.2) (1) (8e−10) (0.02) (0.7) (1) (1) 0.06 −0.061 0.13 0.06 −0.3 −0.16 −0.098 −0.093 −0.052 MEturquoise (1) (1) (0.3) (1) (2e−09) (0.04) (1) (1) (1) 0.028 −0.099 0.079 0.057 −0.21 −0.087 −0.13 −0.042 0.026 MEviolet (1) (0.7) (1) (1) (8e−05) (1) (0.7) (1) (1) 0.025 −0.099 0.15 0.0081 −0.24 −0.18 −0.12 0.0062 −0.035 MEpaleturquoise (1) (0.7) (0.2) (1) (3e−06) (0.02) (0.7) (1) (1) 0.062 −0.077 0.14 0.069 −0.33 −0.15 −0.12 −0.049 −0.029 MEtan (1) (1) (0.2) (1) (2e−11) (0.04) (0.7) (1) (1) 0.047 −0.16 −0.025 0.057 −0.051 0.026 −0.066 −0.043 0.0049 MEgrey60 (1) (0.05) (1) (1) (1) (1) (1) (1) (1) −0.065 −0.12 0.031 0.018 −0.078 0.006 0.047 0.0056 0.0011 MEmidnightblue (1) (0.3) (1) (1) (1) (1) (1) (1) (1) −0.0011 −0.11 0.09 0.0074 −0.16 −0.089 −0.07 −0.11 −0.1 MEsaddlebrown 0.5 (1) (0.4) (1) (1) (0.02) (1) (1) (1) (1) 0.0096 −0.056 −0.019 0.1 −0.13 −0.0085 −0.05 −0.017 0.039 MEdarkolivegreen (1) (1) (1) (1) (0.2) (1) (1) (1) (1) 0.037 −0.084 0.072 0.097 −0.27 −0.12 −0.11 −0.074 −0.025 MEwhite (1) (1) (1) (1) (2e−07) (0.4) (0.9) (1) (1) −0.054 −0.088 −0.055 0.043 0.02 0.093 0.05 −0.0022 0.045 MElightcyan (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.017 0.038 −0.052 0.049 0.0062 0.044 −0.031 −0.0016 −0.034 MEgreen (1) (1) (1) (1) (1) (1) (1) (1) (1) 0.03 0.0075 0.0034 −0.0074 0.0062 0.042 0.037 −0.027 0.043 MEsteelblue (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.071 −0.041 −0.07 −0.06 0.21 0.13 0.057 0.044 −0.00058 MEdarkgreen (1) (1) (1) (1) (2e−04) (0.2) (1) (1) (1) −0.077 −0.034 −0.097 −0.057 0.24 0.17 0.079 0.073 0.067 MElightgreen (1) (1) (1) (1) (3e−06) (0.02) (1) (1) (1) 0 −0.083 −0.052 −0.018 0.015 0.0044 0.069 0.042 0.0012 −0.067 MEdarkred (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.063 −0.092 −0.026 −0.041 0.11 0.11 0.064 0.0077 0.051 MEmagenta (1) (0.9) (1) (1) (0.4) (0.5) (1) (1) (1) −0.033 −0.043 0.059 0.011 −0.11 −0.058 −0.049 −0.039 −0.055 MEbrown (1) (1) (1) (1) (0.4) (1) (1) (1) (1) −0.093 −0.066 −0.013 0.035 −0.035 0.02 −0.0019 −0.01 0.0052 MEdarkorange (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.021 −0.074 −0.066 0.03 0.058 0.12 0.04 0.12 0.12 MEyellow (1) (1) (1) (1) (1) (0.3) (1) (1) (1) 0.0035 −0.13 0.052 0.016 −0.11 −0.056 −0.028 −0.094 0.0022 MEsalmon (1) (0.2) (1) (1) (0.4) (1) (1) (1) (1) −0.00042 −0.09 0.045 0.019 −0.1 −0.033 −0.042 0.019 0.0097 MEroyalblue (1) (1) (1) (1) (0.5) (1) (1) (1) (1) 0.11 −0.079 −0.0019 0.03 −0.045 −0.011 −0.039 0.029 0.061 MEcyan −0.5 (1) (1) (1) (1) (1) (1) (1) (1) (1) 0.033 −0.026 0.053 0.0063 −0.094 −0.064 −0.054 0.019 0.039 MEorange (1) (1) (1) (1) (0.7) (1) (1) (1) (1) −0.024 −0.1 −0.038 −0.006 0.07 0.072 0.036 0.014 0.091 MEblack (1) (0.7) (1) (1) (1) (1) (1) (1) (1) −0.017 −0.088 0.016 0.0077 −0.037 0.017 0.0044 −0.02 0.063 MEgrey (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.07 −0.13 0.014 0.019 −0.053 0.001 0.0023 −0.099 −0.048 MEskyblue (1) (0.2) (1) (1) (1) (1) (1) (1) (1) −0.095 −0.039 −0.02 0.034 −0.021 0.08 0.049 0.033 0.043 MEpink (1) (1) (1) (1) (1) (1) (1) (1) (1) −0.15 −0.063 0.046 −0.009 −0.059 0.0049 0.017 −0.085 −0.089 MEpurple (0.4) (1) (1) (1) (1) (1) (1) (1) (1) −0.09 −0.13 0.027 0.039 −0.1 −0.014 −0.011 −0.013 0.026 MEdarkturquoise (1) (0.2) (1) (1) (0.5) (1) (1) (1) (1) −1 −0.0049 −0.18 0.041 0.025 −0.11 −0.016 0.013 −0.033 −0.021 MEgreenyellow (1) (0.02) (1) (1) (0.4) (1) (1) (1) (1) −0.059 −0.13 −0.067 0.018 0.079 0.098 0.034 −0.012 0.02 MEred (1) (0.2) (1) (1) (1) (1) (1) (1) (1)

x e r r r e g e e e rs te te te s a k k k a lu lu lu o o o e o o o Y s s s m m m k b b b S S tS c a a a r x a _ e E n P il_ s_ v re te h il e r y p h N u c o p C o n o n si s o o a m e b Figure 5.8 Module-trait associations based on the streptococcal data. Correlation value and adjusted P-value were calculated using Spearman’s correlation and corrected with the Benjamini and Hochberg false discovery rate (FDR) procedure. Abbreviations: ME (Module Eigengene).

203 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Blue Module Members The Blue module displayed a strong negative correlation with current-smoking and number of pack years. It contained a total read abundance of 687,355 and 641 OTUs. Appendix 5.4.2 gives a summary of the 25 most abundant OTUs of the module, those with the highest module membership and highest prevalence. Out of these 13 were identified with a definitive species name including S. oralis (1), S. pneumoniae (2), S. salivarius (6), S. sp. A12 (2) and S. sp. oral taxon 431 (2). The other OTUs did not have such a conclusive outcome from the BLAST database search and displayed a split between S. mitis, S. pneumoniae, S. sp. oral taxon 431, S. oralis, S. pseudopneumoniae, S. sp. A12, S. sp. I-G2, S. sp. I-P16 and S. parasanguinis.

Light Green Module Members The Light Green module displayed a strong positive correlation with current-smoking and number of pack years. The module was made up of 414 different members (total read abundance 159,614 reads). Appendix 5.4.3 details the 23 most abundant OTUs of the module, those with the highest module membership and highest prevalence; these OTUs were subsequently taken through BLAST analysis (Section 5.2.3.2).

There were 13 OTUs with a definitive species (out of the 23 OTUs inspected); S. mitis (1 OTU), S. oralis (1), S. parasanguinis (1), S. pneumoniae (3), S. salivarius (5) and S. sp. oral taxon 431 (2). The rest of the OTUs displayed a split species identification mainly between S. sp. oral taxon 431, S. mitis and a third identity either S. parasanguinis, S. pneumoniae or S. salivarius.

Pale Turquoise Module Members The Pale Turquoise module displayed a strong negative correlation with current-smoking and number of pack years; it was a relatively small module containing 206 OTUs (total read abundance of 126,667). Appendix 5.4.4 gives a summary of the 16 most abundant OTUs in the module, those with the highest module membership and highest prevalence. There were 7 of the 16 OTUs that had a definitive species identity; S. parasanguinis (5), S. salivarius (1) and S. oralis (1). The rest of the OTUs identities were split between S. sp. A12, S. sp. I-G2 and S. sp. I-P16.

204 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Tan Module Members The Tan module displayed a strong negative correlation with current-smoking and number of pack years. It contained 484 different members with a total read abundance of 372,846. Appendix 5.4.5 gives a summary of the 19 most abundant OTUs that were most abundant module, those with highest module membership and highest prevalence, which were analysed using BLAST. There were 16 OTUs (out of 19 identified OTUs) that had a definitive species identity; S. parasanguinis (7), S. salivarius (4) and S. sp. A12 (5), there were three OTUs that had an identity split between S. mitis and S. sp. oral taxon 431.

Turquoise Module Members The Turquoise module displayed a strong negative correlation with current-smoking and number of pack years. It contained 684 different members (total read abundance 674,831 reads [1.8% of the overall total]).

Appendix 5.4.6 gives a summary of the 26 most abundant OTUs of the module, those with the highest module membership and highest prevalence. BLAST search was able to identify definitive species name for 15 OTUs; S. parasanguinis (2), S. pneumoniae (4) and S. salivarius (4), S. sp. A12 (2), S. sp. I-G2 (5), and one S. sp. oral taxon 431. Reflecting on the potential identities for the other ten OTUs, it could be proposed that S. sp. oral taxon 431 is closely related to S. mitis (5 OTUs had a split between these two species). There were two OTUs that were identified as S. sp. oral taxon 431 and S. pneumoniae, one split between S. sp. I-G2 and S. pneumoniae, one split between the three closely related species; S. pneumoniae, S. mitis and S. pseudopneumoniae.

205 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.3.3.4 Phylogenetic Visualisation After identifying a number of key OTUs through a variety of analyses (Sections 5.3.2.3, 5.3.3.2 and 5.3.3.3) and manually classifying each one using the BLAST database the OTUs, reference sequences from the 140 manually identified OTUs were aligned and a phylogenetic tree was contrasted to see the evolutionary relationships.

All 140 OTUs were exported together with their respective reference sequences. For OTUs that were not definitively assigned to a single species the OTU name was altered to contain all possible species identified. For example, OTU5643 identity was split between S. sp. oral taxon 431, S. mitis, S. pneumoniae, so the OTU was extended to: OTU5643_S.sp.Oral.Taxon431_S.mitis_S.pneumoniae.

Next all the OTUs and their respective reference sequences were loaded into the Molecular Evolutionary Genetics Analysis (MEGA7) software 430, which contains an array of methods for phylogenomics and phylomedicine. MEGA7 software aligned all identified streptococci sequences and this alignment was used to reconstruct a phylogeny that was then used to draw the evolutionary relationship between the identified OTUs. This evolutionary history was inferred using the Neighbor-Joining method 431, which is a “bottom-up” agglomerative clustering method.

The tree was constructed by finding pairs of OTUs that minimize the total branch length at each stage of clustering. A total of 140 nucleotide sequences were analyzed, codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There was a total of 412 positions in the final dataset.

This phylogenetic tree was then exported as a network file (.nwk) and visualised in R using the package ggtree (Version 1.10.5 241). The tree was converted into a cladogram for a more simplified diagram, the smoking status (smoker versus non-smoker) associated with each species (OTU) was added (Figure 5.9).

206 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

Figure 5.9 Evolutionary relationships of manually identified OTUs. The evolutionary history was inferred using the Neighbor-Joining method. Analysis involved 140 nucleotide sequences. There was a total of 412 positions in the final dataset.

The identified OTUs were seen to be associated with either current-smoking or current non-smoking individuals, therefore this association was added to each identified OTU. Overall, the OTUs separated into three main clades. First was dominated by S. salivarius that saw a large proportion of current-smoking individuals associated with this clade. It

207 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS should be noted that Streptococcus thermophiles, is a subspecies of S. salivarius (Streptococcus salivarius subsp. thermophilus) and was seen to branch off in the middle of the S. salivarius clade (Figure 5.9). The second clade saw a split in identity between S. pneumoniae, S. mitis, S. pseudopneumoniae and S. sp. Oral Taxon 431. The final clade was predominantly S. parasanguinis species, with the latter two clades being associated largely with non-smoking individuals. Reflecting on all of the analyses carried out in this chapter further OTUs were examined through BLAST and saw common oral taxa species present. When inspecting the identified species from the list of potential contaminants (Section 5.3.1.3B, Appendix 5.2), a number of oral taxa come up including S. pseudopneumoniae, S. mitis, S. pneumoniae, S. oralis and S. salivarius.

Reflecting back to the contaminants found during the processing of 16S rRNA gene sequencing results (Appendix 3.3), there were a number of streptococci bacteria seen. Unfortunately, the 16S rRNA gene sequencing cannot provide enough sequencing depth to identify OTUs at the species level. Following Streptococci-specific sequencing there were a number of OTUs identified as potential contaminants (Appendix 5.2), which could be of human origin. It would not be possible to directly translate between the two assays (16S rRNA gene and Streptococci specific sequencing), so it cannot be suggested that these are the same organisms, but these results do providing further complication to the process of decontamination.

Finally, all the identified (Streptococci) OTUs; including those found to be significant in patients, the mock community members and potential contaminants were collated and used to draw out the phylogenetic relationship (Appendix 5.5). The phylogenetic tree sees the different species largely branch out separately into four different clades, one of which is almost completely made up of the potential contaminant OTUs, providing confidence for data decontamination. One thing to note is the presence of 5 OTUs that were associated with human samples (both current and non-smoking individuals), these OTUs were identified as significant through WCNA analysis, which was carried out on unrarified data. Potential contaminant OTUs were seen to come out significant in 16S rRNA gene analysis when unrarefied data was examined. This highlights the importance of rarefying the data.

208 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS

5.4 Discussion The aims of this chapter were to: 1. Establish a robust methodology for carrying our streptococci specific quantitative PCR and next generation sequencing of human samples using the Illumina platform. 2. Establish a reference guide for identification of streptococci species. 3. Investigate further the Busselton Health Study findings in relation to smoking and overgrowth of the Streptococcus genus (Chapter 3) by differentiating at species level the Streptococcus OTUs.

The main outcome from the work conducted was the successful establishment of a robust streptococcal-specific quantification and NGS methodology for DNA extracted from human oropharyngeal swabs. The methods presented in this chapter were founded upon the prior work of Bishop et al 410 and P. Cardenas 411, with a focus on targeting the Streptococcus genus by utilizing just the methionine aminopeptidase (map) gene.

The number of Streptococcal species quiescently make up a large proportion of the microbial community of healthy airways, however they can also play a major role in disease. Additionally, classification of this bacterium has gone through a very extensive evolutionary process, and yet culture dependent techniques have remained the primary method of identification, especially in the clinical microbiology laboratories of hospitals. Methodology developed here, could potentially play an important role to ensure more accurate identification of this bacterial genus, considering that 16S rRNA gene sequencing cannot always sequence deep enough (depending on the variable region and amplicon studied) to provide species identification for many genera of bacteria including the Streptococcus genus. Consequently, next generation sequencing of other genes is increasingly important for the progression of next generation sequencing systems to compliment and in the long term even replace current clinical microbiology practice.

The biggest limitation to the work of this chapter was the lack of a robust reference guide for automated identification of streptococci species. A limited classifier could be generated by using the data sets of Bishop et al. 410 although initial examination (data not shown) suggested there was misnaming of the original sequences hampering the

209 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS construction of a robust classifier. Future work would be to revisit this but also sequence the genomes of many more streptococcal isolates of healthy individuals and clinical isolates in order to build an extensive and robust classifier of a similar calibre to that of the RDP classifier and others (Silva and Greengenes). Consequently, key streptococcal OTUs had to be manually identified through BLAST database to allowsome level of OTU identification.

Streptococcal-specific sequencing When analysing the quantification results, it seems that smoking individuals contained a significantly lower burden of streptococci compared to non-smoking individuals. Sequencing quality results (Table 5.6) were of the same standard to that of the 16S rRNA gene sequencing (Chapter 3, Table 3.3), with slightly better overall cluster density. Due to the lack of an accurate Streptococcal Specific Classifier, data was analysed “blindly” and only the significant OTUs that were seen to consistently come up across multiple tests were manually identified using BLAST. This method of species identification was able to provide a basic species origin, however overall this is not ideal as many identified species names were not definitive or 100% accurate. Phylogenetic tree visualisation of the identified species was used to see the evolutionary relationships between species and to some extent confirm their bacterial identities.

The mock community members (Figure 5.3) allowed confirmation that the closely related species of S. mitis, S. pneumoniae and S. pseudopneumoniae could be separated at the 95% clustering level. Additionally, the results of this chapter showed complete agreement on species identity when comparing the NGS results to the Sanger sequencing results, providing further validation of the accuracy of the current streptococci specific NGS sequencing.

Potential contaminants were identified using a similar approaches that for the 16S rRNA gene sequencing of Chapter 3 (Chapter 3, Section 3.2.3.4), where Spearman’s correlation was used to detect OTUs with a large number of reads and a low biomass. At a first glance into the species names of the identified contaminants, it can be noted that overall the database was not able to provide concrete species identities, and most seem to be a split between S. mitis, S. pneumoniae and S. pseudopneumoniae. However, after calculating the

210 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS evolutionary relationship between the species identified as potential contaminants and those seen to be significant in the whole population it was noted that potential contaminants predominantly branched out separately into a different clade (Appendix 5.5). Nevertheless, when taking these results into account for further studies, the process of decontamination has to be approached with care.

Current-smoking Related Species Overall streptococcal species identified to be significant in the population differed to that seen in the mock community or potential contaminants, meaning that the airways contain a distinct community of bacteria. Initial data processing predominantly looked at the whole dataset, and OTUs that showed highest dominance and prevalence in the population. Interestingly the WCNA results seem to identify OTUs that showed correlation (both positive and negative) with current-smoke exposure.

When visualising the phylogenetic relationship between identified significant OTUs, three core clades were constructed. The first group was made up of the S. salivarius species and contained 53 OTUs, more than half of which were associated with current- smoke exposure. This relationship has not been previously noted. Mason et al 432 has shown elevation of Lactobacillus salivarius in the subgingival community of smokers. As L. salivarius is a more gastrointestinal probiotic bacterium and Mason et al. used the Greengenes database, it may be possible that this taxon has been misclassified.

S. salivarius is a member of the salivarius group and considered an opportunistic pathogen. It has been shown to be an early colonizer of the upper respiratory tract (nasopharynx, oropharynx)433, being transferred from the mother within days after birth 434 and providing long-term immunity to further bacterial exposure (in most circumstances) 435. A number of these strains (K12 and M18) have been employed as probiotics to effectively inhibit strep-throat causing bacterial species 436,437.

The second clade was dominated by the mitis group, species consisting mainly of S. mitis and S. pneumoniae, as well as S. pseudopneumoniae and some un-identified species of a close evolutionary relationship. S. mitis is part of the commensal microbiota in the upper respiratory tract, inhabiting the oral cavity soon after birth 433,438. It is believed to form

211 CHAPTER 5: STREPTOCOCCUS spp. ANALYSIS the foundation of oral biofilms and displays a lifelong importance to the ecosystem 439,440. Production of IgA1 protease that aids species colonization 441 and a number of genome encoded cell-wall anchored and choline-binding proteins, which provided adherence sites for secondary colonisers 442-444 could all contribute to S. mitis commensalism. S. pneumoniae is a known pathogenic bacterium, that although normally present in the airway microflora asymptomatically, can under certain conditions and in vulnerable individuals become pathogenic and lead to conditions like pneumonia or a variety of pneumococcal infections. Overall this clade showed more association with non-smoking individuals, however there were a handful of current-smokers exhibiting a connection with S. pneumoniae. Cigarette smoke exposure increases susceptibility to pulmonary and invasive bacterial infections, also referred to as the dose-response effect 445,446, therefore it is not surprising to see this association.

The final cluster contained Streptococcus parasanguinis species, and a few members that were not specifically named but displayed a close relationship to the aforementioned species, so potentially are of similar identity. S. parasanguinis is also an early coloniser of the oral cavity 419, it is recognised as a beneficial, health-related bacteria and an accepted member of a healthy upper respiratory microflora 447. Consequently, it appears appropriate that this species is associated with non-smoking individuals, which dominated this clade.

In summary, a successful streptococci quantitative PCR and specific sequencing method for human samples has been developed, results of which can be used for a detailed understanding of the streptococcal species present in a human clinical sample. Further work needs to be carried out to establish a robust and reliable reference guide for automated taxonomic classification. The results presented in this chapter indicate that there is a close association between S. salivarius and current-smoke exposure.

212 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS

Chapter 6: General Discussion and Future Works

Many studies are providing ample evidence that the microbiome comes into play not only during disease, but also in a healthy state. For the airway microbiome, many studies focus on disease, and little time has been dedicated into characterising the bacterial community structure of healthy individuals. Studies of healthy subjects are important to establish what is normal.

The main aim of this thesis was to investigate the airway microbiome in a general population sample containing many healthy individuals. The thesis tested if environmental stress (smoking) causes alteration in community structure. It examined whether common diseases were associated with microbial changes that could contribute to the illnesses. The Hygiene Hypothesis suggests that exposure to a large variety of bacteria in early childhood may be protective and nurturing of a strong immune system 76, and this thesis examined whether rich and diverse communities of organisms in healthy individuals could protect against asthma.

Modern culture independent techniques have become a gold standard when it comes to carrying out large-scale microbiome projects, and consequently this study focused on high throughput sequencing of the 16S rRNA gene in order to define bacterial identities of upper airway samples (Chapter 3). Samples were collected in Australia and sent over to the NHLI (ICL) for DNA extraction, 16S rRNA qPCR and in house next generation sequencing (Illumina MiSeq) and analysis.

Prior to analysis, it was important to control for any contaminant bacteria that may have been picked up during sample handling and sequencing. Salter et al. (2014) 119 highlighted pitfalls of low biomass samples, and how easily bacterial proportions can be altered through introduction of contaminants. One limitation from sample handling in this study was lack of experimental kit controls, meaning that it was not possible to identify which bacteria were introduced from reagents and extraction kit equipment. Consequently, this thesis used extensive filtering techniques to identify and remove signals from potential contaminant bacteria (Chapter 3, Section 3.2.3.4). This approach

213 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS could be useful for studies (conducted by others) that have likewise occurred prior to the Salter et al. publication.

Downstream analysis suggested that contaminant bacteria are more likely to group together within a network, as shown in Chapter 4. Interconnection hubs of contaminant bacterial were identified within the data set, using two different OTU-picking methods and against two different bacterial databases (Silva and Greengenes). The network analysis is potentially another way of analytically decontaminating datasets lacking kit controls.

It was important to structure sample handling in a randomised manner to limit introduction of any technical biases and batch effects. Furthermore, when approaching data analysis, it was important to look at the data from many different angles to generate robust representations of the results. For this reason, a number of different statistical and machine learning techniques (specifically data clustering and network generation) were instigated.

It was interesting to note that despite the detection of over 4,000 distinct OTUs, these were contained in only 14 bacterial phyla. Five of these (Firmicutes, Bacteroidetes, Proteobacteria, Fusobacterium and Actinobacterium) collectively made up 98.3% of all OTUs (Chapter 3, Section 3.3.2.1), demonstrating the presence of selective pressures in the airways restricting or encouraging the growth of these bacteria.

Smoking depleted a great number of bacterial species, leaving only more adept bacteria behind, out of which streptococci were the most prevalent. 16S rRNA gene sequencing differentiates poorly between Streptococcus spp., and so a next generation sequencing method focused on the map gene was developed that targeted the genus of Streptococcus. This technique was able to determine species identity with a high degree of accuracy (Chapter 5) and may be of value for the routine diagnosis of Streptococcus spp. in clinical samples, as an addition to culture.

The microbiome data for the BHS population was found to contain a rich community of bacteria within which were found tightly connected networks (Chapter 4, Section 4.6).

214 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS

The networks are consistent with the hypothesis that bacteria have to work together in order to withstand host defences and selective pressures. Additionally, the networks may represent groups of bacteria fitting into distinct ecological niches provided by the host.

Although the OTU content of oropharyngeal swabs was highly individual, overall a low dissimilarity (when testing beta diversity) was observed between samples, indicating strong selective pressures on the composition of these microbial communities. The similarity between individual samples was even greater when inferring metabolic functions for these bacteria (Chapter 4, Section 4.7.3). Clinical variables were able to explain less than 10% of the variation in diversity, providing further evidence for factors selecting for the airway microbiome in the general population.

The BHS is a long running epidemiological investigation with serial follow up of its participants. Unfortunately, samples for this thesis were acquired for one time point only, in this way giving only a ‘snap shot’ view into the bacterial composition of airways. For a future project, it would be interesting to evaluate the healthy airway microbiome across multiple time points, to see for example how bacterial communities change in different seasons and examining if bacterial diversity decreases in winter months, causing susceptibility to viral and other infections.

Several health conditions that are common in the Busselton population were explored for potential associations to the respiratory microbiome. Asthma is a common chronic inflammatory disorder of the airways 134 that was associated with significantly lower bacterial diversity within the airways. These results align with previously published studies 155,156,323, and the relative increases in Veilonella and Prevotella spp. in healthy individuals are also consistent with prior studies, including those with bronchoscopic investigation of the lower airways 103,155-157,324. The loss of diversity in asthmatics provides direct evidence for the importance of the Hygiene Hypothesis.

Robust associations with diabetes and gastro-esophageal reflux disease were not detected, perhaps because of the limited numbers of individuals suffering from these conditions and the low statistical power for analysis. The few significant deviations from healthy individuals did not come out consistently across multiple analytical techniques.

215 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS

One finding suggested that diabetes may lead to a reduction of Firmicutes in the airways, which is comparable to what has been previously been reported in relation to the gut microbiome in diabetes182.

Studies have nevertheless shown that gut bacteria may be translocated into the lungs 195 and the presence of a gut-lung axis 196-198 has been suggested. An effective study of the relationship of diabetes and gastro-esophageal reflux disease and the airways microbiota would require direct recruitment of patients who display these conditions, rather than a population survey.

In this thesis, cigarette smoking was discovered to be a significant environmental factor that caused extensive upper airway microbiome remodelling. Individuals who were recorded as current-smokers displayed much lower bacterial loads, as well as a decrease in species diversity and richness within the population (Chapter 3, Section 3.3.4.1B). This study examined a large number of participants, including healthy non-smoking as well as ex- and current-smoking individuals, and was well powered to test for smoking effects.

This study allowed not only the examination of changes that occur in current-smokers but also the investigation of what happens to the microbiome after cesasion of smoking. Results suggest that it can take up to two years after quitting smoking for airway microbiome to regenerate and resemble that of a healthy never-smoking individual (Chapter 3, Section 3.3.5). Unfortunately, the majority of ex-smoking individuals from this cohort had quit a long time before testing (10+ years), with low numbers of those who quit in the immediate past, making it difficult to pin point an exact timeframe for when the microbiome begins to replenish post cessation of smoking.

A next step would be to design a logitudinal study following the smoking history of individuals. Ideally, this would include the start of smoking (to see how quickly bacterial composition depleted from a healthy state), following through the course of smoking and until after cessation (to determine how quickly these effects are reversed). How ethically feasible this is is another matter however. Working with groups of smokers who have just signed up to a cessation clinic may therefore be a more realistic approach. Nevertheless having a much clearer understanding of the changes that occur after

216 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS cessation of smoking could potentially lead to inhalation therapies that increase the diversity within airways and replenish the community faster. Another aspect that would be worth exploring is how smoking intensity changes the microbiome; this includes identifying structual differences to the bacterial composition in individuals who smoke one cigarette per day when compared to those who smoke 60 cigarettes per day.

Smokers had a reduced bacterial richness and were dominated by fewer individual species than non-smoking individuals (Chapter 3, Section 3.3.4). The phylogenetic structure of this subgroup indicated a closer relatedness of the dominant group than expected by chance, with a significant increase in Firmicutes phylum. Overall, this was caused by a substantial increase in streptococci species in smokers, at the same time as depletion within the bacterial community of Prevotella, Haemophilus and Neisseria.

This co-abundance was further confirmed through network analysis (Chapter 4), where tightly interconnected bacterial cabals were seen to reflect earlier findings. An adapted analytical pipeline was able to identify modules dominated by streptococci species that exhibited a strong correlation with current-smokers. These bacterial interconnections were further confirmed when analysis was repeated with a new OTU table (picked using a different reference method and a different bacterial database), providing confidence in the validity of the findings.

The bacterial abundance of smoking associated modules (mainly dominated by Streptococcus spp., as well as some Bifidobacterium spp. and Lactobacillus spp.) was also seen to increase with number of pack years. These species may be foreign, being introduced into the body through smoking, or are most resilient to the environmental stress caused by smoking. The direct toxic effects of smoking may be seen on bacteria such as Prevotella, Haemophilus and Neisseria, which showed a significantly negative correlation with smoking and pack years.

Considering smoking has a strong affiliation with streptococci, and that 16S rRNA gene sequencing did not provide enough detail on species identity, a new sequencing assay was developed, which focused specifically on classifying different species from this bacterium (Chapter 5). This streptococci-specific sequencing was an adaptation of prior

217 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS work 410,411, involving further optimisation with a focus on the map gene for specific streptococcal bacterial load quantification as well as targeted sequencing of human clinical samples (in this thesis oropharyngeal samples).

Following successful establishment of a robust streptococci-specific quantification PCR it was noted that smoking individuals had a significantly lower Streptococcus spp. burden compared to non-smoking individuals. This reflects the lower bacterial burden seen in smoking individuals through 16S rRNA gene sequencing. Following blind analysis of the data, 33 OTUs were identified to be significantly associated with either current-smoking or current non-smoking individuals. Comparison of these sequences with public databases showed that S. salivarius and S. parasanguinis were the most abundant and most prevalent of the species identified, with the former being more common in current- smoking individuals (Chapter 5, Section 5.3.2.3, Appendix 5.3).

In the future, it is desirable to assemble a reliable streptococci-specific reference database that could be used to assign OTUs automatically during processing. A well- established streptococci-specific sequencing protocol complete with a reference database would be highly beneficial for testing hospital patients for the presence of pathogenic species and would complement and aid standard clinical culture.

Apart from bacteria, there are many other micro-organisms residing the in the lungs that may affect health and disease. The established basis of the bacterial community outline in this thesis should be followed by assessment of the fungal community through for example sequencing of the ITS2 gene region. The increased carbohydrate metabolism suggested for the streptococci dominant module (Chapter 4, Section 4.7.4) has been previously associated with C. albicans 384, providing a potential example of the fungal community contributing to the overall ecosystem.

In conclusion, a healthy upper airway microbiome contains a rich community of microorganisms that work together in closely inter-connected networks. Asthmatic patients exhibit a decrease in bacterial diversity confirming results from prior studies and supporting the concept of the Hygiene Hypothesis. Strong evidence that smoking has a significant effect in remodelling the airway microbiome was generated through many

218 CHAPTER 6: GENERAL DISCUSSION AND FUTURE WORKS different analyses. Environmental stress caused by cigarette smoke depleted abundance and diversity of bacteria, leaving only the most resilient streptococci behind. When assessing the small number of identified OTUs, S. salivarius was seen to be more commonly present in smoking individuals. Considering it is an opportunistic pathogen it could potentially be a key player augmenting respiratory conditions that accompany smoking.

219 BIBLIOGRAPHY

Bibliography

1. Altermann W, Kazmierczak J. Archean microfossils: a reappraisal of early life on earth. Research in Microbiology. 2003;154(9):611-617. doi:10.1016/j.resmic.2003.08.006.

2. Whitman WB, Coleman DC et al. Prokaryotes: the unseen majority. Proceedings of the National Academy of Sciences USA. 1998;95(12):6578-6583.

3. Sleator RD. The human superorganism – of microbes and men. Medical Hypotheses. 2010;74(2):214-215. doi:10.1016/j.mehy.2009.08.047.

4. Colgan R. Advice to the young physician: on the art of medicine. Boston, MA: Springer US; 2010:1-145. doi:10.1007/978-1-4419-1034-9.

5. van Leewenhoeck A. "Observations, communicated to the publisher by Mr. Antony Van Leewenhoeck, in a dutch letter of the 9th of Octob. 1676. Here English'd: concerning little animals by him observed in rain-well-sea and snow water; as also in water wherein pepper had lain infused". Philosophical Transactions (1665-1678);1677:1-12.

6. Lane N. The unseen world: reflections on Leeuwenhoek (1677) “Concerning little animals.” Philosophical Transactions of the Royal Society of London B: Biological Sciences. 2015;370(1666):20140344-20140344. doi:10.1098/rstb.2014.0344.

7. Koch R. The ethiology of anthrax, based on the life history of Bacillus anthracis. The Germ Theory of Disease. March 1876:1-7.

8. Koch R. Essays of Robert Koch. Praeger Pub Text; 1884.

9. Koch R. The etiology of tuberculosis. The Germ Theory of Disease. March 1882:1-7.

10. Kaufmann SHE, Schaible UE. 100th Anniversary of Robert Koch's Nobel prize for the discovery of the tubercle Bacillus. Trends in Microbiology. 2005; Vol 13:469-475. doi:10.1016/j.tim.2005.08.003.

11. Franklin RE, Gosling RG. Molecular configuration in sodium thymonucleate. Nature.1953;171(4356):740-741. doi:10.1038/171740a0.

12. Watson JD, Crick F. Molecular structure of nucleic acids. Nature. 1953; 171:737-738.

13. Woese CR, Kandler O. Towards a natural system of organisms - Proposal for the domains archaea, bacteria, and eucarya. Proceedings of the National Academy of Sciences USA. 1990;87(12):4576-4579. doi:10.1073/pnas.87.12.4576.

14. Woese CR, Fox GE. Conservation of primary structure in 16S Ribosomal-RNA. Nature. 1975;254(5495):83-86. doi:10.1038/254083a0.

15. Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proceedings of the National Academy of Sciences USA. 1977;74(11):5088- 5090.

16. Staley JT, Konopka A. Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annual Review Microbiology. 1985;39(1):321-346. doi:10.1146/annurev.mi.39.100185.001541.

220 BIBLIOGRAPHY

17. Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. Journal of Clinical Microbiology. 2007;45(9):2761-2764. doi:10.1128/JCM.01228-07.

18. Jay ZJ, Inskeep WP. The distribution, diversity, and importance of 16S rRNA gene introns in the order Thermoproteales. Biology Direct. 2015:1-10. doi:10.1186/s13062- 015-0065-6.

19. Pereira F, Carneiro J et al. Identification of species by multiplex analysis of variable- length sequences. Nucleic Acids Research. 2010;38(22):e203-e203. doi:10.1093/nar/gkq865.

20. Chakravorty S, Helb D et al. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. Journal of Microbiological Methods. 2007;69(2):330-339. doi:10.1016/j.mimet.2007.02.005.

21. Fellner P, Sanger F. Sequence analysis of specific areas of the 16S and 23S ribosomal RNAs. Nature. 1968;219(5151):236-238.

22. Wu R. Nucleotide sequence analysis of DNA. I. Partial sequence of the cohesive ends of bacteriophage lambda and 186 DNA. Journal of Molecular Biology. 1970;51(3):501-521.

23. Sanger F, Nicklen S et al. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences USA. 1977;74(12):5463-5467.

24. Sanger F, Barrell BG et al. Nucleotide sequence of bacteriophage PhiX174 DNA. Nature. 1977:1-9.

25. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Research. 1979;6(7):2601-2610. doi:10.1093/nar/6.7.2601.

26. Baer R, Bankier AT et al. DNA-Sequence and expression of the B95-8 Epstein-Barr Virus Genome. Nature. 1984;310(5974):207-211. doi:10.1038/310207a0.

27. Blattner FR, Plunkett G et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277(5331):1453–&. doi:10.1126/science.277.5331.1453.

28. Fleischmann RD, Adams MD et al. Whole-genome random sequencing and assembly of Haemophilus-Influenzae. Science. 1995;269(5223):496-512. doi:10.2307/2887657.

29. Ronaghi M, Karamohamed S et al. Real-time DNA sequencing using detection of pyrophosphate release. Analytical Biochemistry. 1996;242(1):84-89. doi:10.1006/abio.1996.0432.

30. Buermans HPJ, Dunnen den JT. Next generation sequencing technology: Advances and applications. BBA - Molecular Basis of Disease. 2014;1842(10):1932-1941. doi:10.1016/j.bbadis.2014.06.015.

31. Bosch Ph D ten JR, Wayne W. Keeping up with the next generation. The Journal of Molecular Diagnostics. 2010;10(6):484-492. doi:10.2353/jmoldx.2008.080027.

32. Zimmerman E. Top of the 50 Smartest Company List: Illumina. 2014. https://www.technologyreview.com/s/524531/why-illumina-is-no-1/.

33. Stackebrandt E, Ebers J. Taxonomic parameters revisited: tarnished gold standards. Microbiology Today. 2006; 33(4):152-155.

221 BIBLIOGRAPHY

34. Caporaso JG, Lauber CL et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences USA. 2011;108 Suppl 1(Supplement 1):4516-4522. doi:10.1073/pnas.1000080107.

35. Claesson MJ, O'Sullivan O et al. Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. Ahmed N, ed. PLoS One. 2009;4(8). doi:10.1371/journal.pone.0006669.

36. Pruesse E, Quast C et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research. 2007;35(21):7188-7196. doi:10.1093/nar/gkm864.

37. DeSantis TZ, Hugenholtz P et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied Environmental Microbiology. 2006;72(7):5069-5072. doi:10.1128/AEM.03006-05.

38. Benson DA, Karsch-Mizrachi I et al. GenBank. Nucleic Acids Research. 2009;37:D26-D31. doi:10.1093/nar/gkn723.

39. Cole JR, Wang Q et al. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucleic Acids Research. 2014;42(D1):D633-D642. doi:10.1093/nar/gkt1244.

40. Gilbert JA, Dupont CL. Microbial metagenomics: beyond the genome. Annual Review of Marine Science. 2011;3:347-371.

41. Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012;486(7402):215-221. doi:10.1038/nature11209.

42. Bahrndorff S, Alemu T. The microbiome of animals: Implications for conservation biology. International Journal of Genomics. 2016;2016:5304028. doi:10.1155/2016/5304028.

43. Schlaeppi K, Bulgarelli D. The Plant Microbiome at Work. APS Journals. 2015; 28(3):212-217, doi:10.1094/MPMI-10-14-0334-FI.

44. Mohan Kulshreshtha N, Kumar R et al. Exiguobacterium alkaliphilum sp. nov. isolated from alkaline wastewater drained sludge of a beverage factory. International Journal of System Evolutionary Microbiology. 2013;63(Pt 12):4374-4379. doi:10.1099/ijs.0.039123-0.

45. Blaser MJ, Cardon ZG et al. Toward a predictive understanding of Earth’s microbiomes to address 21st century challenges. MBio. 2016;7(3):e00714–16–16. doi:10.1128/mBio.00714-16.

46. Singh BK, Trivedi P. Microbiome and the future for food and nutrient security. Microbilogy Biotechnology. 2017;10(1):50-53. doi:10.1111/1751-7915.12592.

47. Lee S, Geller JT et al. Characterization of wastewater treatment plant microbial communities and the effects of carbon sources on diversity in laboratory models. PLoS One. 2014;9(8):e105689. doi:10.1371/journal.pone.0105689.

48. Bleich A, Fox JG. The mammalian microbiome and its importance in laboratory animal research. Institute for Laboratory Animal Research Journal. 2015;56(2):153-158. doi:10.1093/ilar/ilv031.

222 BIBLIOGRAPHY

49. Willey J, Sherwood L et al. Prescott's Microbiology. McGraw-Hill Higher Education; 2013. Published: New York, Mcgraw-Hill Education

50. Savage DC. Microbial ecology of the gastrointestinal tract. Annual Review of Microbiology. 1977;31:107-133. doi:10.1146/annurev.mi.31.100177.000543.

51. Berg RD. The indigenous gastrointestinal microflora. Trends Microbiology. 1996;4(11):430-435. doi:10.1016/0966-842X(96)10057-3.

52. Sender R, Fuchs S et al. Are we really vastly outnumbered? Revisiting the ratio of bacterial to host cells in humans. Cell. 2016;164(3):337-340. doi:10.1016/j.cell.2016.01.013.

53. Sender R, Fuchs S et al. Revised estimates for the number of human and bacterial cells in the body. PLoS Biology. 2016;14(8):e1002533. doi:10.1371/journal.pbio.1002533.

54. Mackie RI, Sghir A et al. Developmental microbial ecology of the neonatal gastrointestinal tract. American Journal of Clinical Nutrition. 1999;69(5):1035S–1045S.

55. Jiménez E, Marín ML et al. Is meconium from healthy newborns actually sterile? Respiratory Microbiology. 2008;159(3):187-193. doi:10.1016/j.resmic.2007.12.007.

56. Stout MJ, Conlon B et al. Identification of intracellular bacteria in the basal plate of the human placenta in term and preterm gestations. American Journal of Obstetrics and Gynecology. 2013;208(3):–7. doi:10.1016/j.ajog.2013.01.018.

57. MacIntyre DA, Chandiramani M et al. The vaginal microbiome during pregnancy and the postpartum period in a European population. Nature; Scientific Reports. 2015;5;8988. doi:10.1038/srep08988.

58. Aagaard K, Ma J et al. Placenta harbors a unique microbiome. Scientific Translatory Medicine. 2014;6(237):–237ra65. doi:10.1126/scitranslmed.3008599.

59. Romero R, Hassan SS et al. The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome. 2014;2(1). doi:10.1186/2049-2618-2-10.

60. Guttmacher AE, Maddox YT et al. The human placenta project: Placental structure, development, and function in real time. Placenta. 2014;35(5):303-304. doi:10.1016/j.placenta.2014.02.012.

61. Rutayisire E, Huang K et al. The mode of delivery affects the diversity and colonization pattern of the gut microbiota during the first year of infants' life: a systematic review. BMC Gastroenterology. 2016;16(1). doi:10.1186/s12876-016-0498-0.

62. Song SJ, Dominguez-Bello MG et al. How delivery mode and feeding can shape the bacterial community in the infant gut. Canadian Medical Association Journal. 2013;185(5):373-374. doi:10.1503/cmaj.130147.

63. Dominguez-Bello MG, Costello EK et al. Delivery mode shapes the acquisition and structure of the initial microbiota across multiple body habitats in newborns. Proceedings of the National Academy of Sciences USA. 2010;107(26):11971-11975. doi:10.1073/pnas.1002601107.

64. Bäckhed F, Roswall J et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host and Microbe. 2015;17(5):690-703. doi:10.1016/j.chom.2015.04.004.

223 BIBLIOGRAPHY

65. Cabrera-Rubio R, Carmen Collado M et al. The human milk microbiome changes over lactation and is shaped by maternal weight and mode of delivery. American Journal of Clinical Nutrition. 2012;96(3):544-551. doi:10.3945/ajcn.112.037382.

66. Fernández L, Langa S et al. The human milk microbiota: Origin and potential roles in health and disease. Pharmacological Research. 2013;69(1):1-10. doi:10.1016/j.phrs.2012.09.001.

67. Sela DA, Li Y et al. An infant-associated bacterial commensal utilizes breast milk sialyloligosaccharides. Journal of Biological Chemistry. 2011;286(14):–11918. doi:10.1074/jbc.M110.193359.

68. Manthey CF, Autran CA et al. Human milk oligosaccharides protect against enteropathogenic Escherichia coli attachment in vitro and EPEC colonization in suckling mice. Journal of Pediatric Gastroenterology and Nutrition. 2014;58(2):165-168. doi:10.1097/MPG.0000000000000172.

69. Balmer SE, Wharton BA et al. Diet and faecal flora in the newborn: breast milk and infant formula. Archives of Disease in Childhood 1989;64(12):1672-1677.

70. Bezirtzoglou E, Tsiotsias A et al. Microbiota profile in feces of breast- and formula-fed newborns by using fluorescence in situ hybridization (FISH). Anaerobe. 2011;17(6):478-482. doi:10.1016/j.anaerobe.2011.03.009.

71. Koenig JE, Spor A, Scalfone N et al. Succession of microbial consortia in the developing infant gut microbiome. Proceedings of the National Academy of Sciences USA. 2011;108(SUPPL. 1):4578-4585. doi:10.1073/pnas.1000081107.

72. Mutius von E, Vercelli D. Farm living: effects on childhood asthma and allergy. Nature Reviews Immunology. 2010;10(12):861-868. doi:10.1038/nri2871.

73. Stein MM, Hrusch CL et al. Innate immunity and asthma risk in Amish and Hutterite farm children. New England Journal of Medicine. 2016;375(5):411-421. doi:10.1056/NEJMoa1508749.

74. Valkonen M, Wouters IM et al. Bacterial exposures and associations with atopy and asthma in children. PLoS One. 2015;10(6). doi:10.1371/journal.pone.0131594.

75. Penders J, Thijs C et al. Factors influencing the composition of the intestinal microbiota in early infancy. Pediatrics. 2006;118(2):511-521. doi:10.1542/peds.2005-2824.

76. Strachan DP. Hay fever, hygiene, and household size. British Medical Journal. 1989;299(6710):1259-1260.

77. Renz H, Brandtzaeg P et al. The impact of perinatal immune development on mucosal homeostasis and chronic inflammation. Nature Reviews Immunology. 2011;347:911- 911. doi:10.1038/nri3112.

78. West CE, Jenmalm MC et al. The gut microbiota and its role in the development of allergic disease: a wider perspective. Clinical and Experimental Allergy. 2015;45(1):43- 53. doi:10.1111/cea.12332.

79. Schaub B, Liu J et al. Maternal farm exposure modulates neonatal immune mechanisms through regulatory T cells. Journal of Allergy and Clinical Immunology. 2009;123(4):774-782.e775. doi:10.1016/j.jaci.2009.01.056.

224 BIBLIOGRAPHY

80. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207-214. doi:10.1038/nature11234.

81. Eppinga H, Konstantinov SR et al. The microbiome and psoriatic arthritis. Current Rheumatology Reports. 2014;16(3):407. doi:10.1007/s11926-013-0407-2.

82. Eid HM, Wright ML et al. Significance of microbiota in obesity and metabolic diseases and the modulatory potential by medicinal plant and food ingredients. Frontiers in Pharmacology. 2017;8:387. doi:10.3389/fphar.2017.00387.

83. Huang YJ, Nariya S et al. The airway microbiome in patients with severe asthma: Associations with disease features and severity. Journal of Allergy and Clinical Immunology. 2015;136(4):874-884. doi:10.1016/j.jaci.2015.05.044.

84. Molyneaux P, Willis-Owen SAG et al. Host-Microbial interactions in Idiopathic Pulmonary Fibrosis. American Journal of Respiratory and Critical Care Medicine. 2017; 15;195(12):1640-1650. doi: 10.1164/rccm.201607-1408OC.

85. Karlsson F, Tremaroli V et al. Assessing the human gut microbiota in metabolic diseases. Diabetes. 2013;62(10):3341-3349. doi:10.2337/db13-0844.

86. Gray H. Gray's Anatomy. Bounty Books; 2012.

87. Netter FH. Atlas of Human Anatomy. Novartis Medical Education; 1997.

88. Dodds MWJ, Johnson DA et al. Health benefits of saliva: A review. Journal of Dentistry. 2005;33(3 SPEC. ISS.):223-233. doi:10.1016/j.jdent.2004.10.009.

89. Antunes MB, Cohen NA. Mucociliary clearance - a critical upper airway host defense mechanism and methods of assessment. Current Opinion in Allergy and Clinical Immunology 2007;7(1):5-10. doi:10.1097/ACI.0b013e3280114eef.

90. Vickery TW, Ramakrishnan VR et al. Bacterial pathogens and the microbiome. Otolaryngologic Clinics of North America. 2017;50(1):29-47. doi:10.1016/j.otc.2016.08.004.

91. Stetzenbach LD, Amman H et al. Microorganisms, mold, and indoor air quality. American Society for Microbiology. December 2004:1-20. https://www.asm.org/ccLibraryFiles/FILENAME/000000001277/Iaq.pdf

92. Burrows SM, Elbert W et al. Bacteria in the global atmosphere – Part 1: Review and synthesis of literature data for different ecosystems. Atmospheric Chemistry and Physics. 2009;9(23):9263-9280. doi:10.5194/acp-9-9263-2009.

93. Yooseph S, Andrews-Pfannkoch C et al. A metagenomic framework for the study of airborne microbial communities. PLoS One. 2013;8(12). doi:10.1371/journal.pone.0081862.

94. Bowers RM, Lauber CL et al. Characterization of airborne microbial communities at a high-elevation site and their potential to act as atmospheric ice nuclei. Applied Environmental Microbiology. 2009;75(15):5121-5130. doi:10.1128/AEM.00447-09.

95. Be NA, Thissen JB et al. Metagenomic analysis of the airborne environment in urban spaces. Microbial Ecology. 2014;69(2):346-355. doi:10.1007/s00248-014-0517-z.

225 BIBLIOGRAPHY

96. Kuske CR. Current and emerging technologies for the study of bacteria in the outdoor air. Current Opinion in Biotechnology. 2006;17(3):291-296. doi:10.1016/j.copbio.2006.04.001.

97. Lai KM, Emberlin J et al. Outdoor environments and human pathogens in air. Environmental Health. 2009;8 Suppl 1(Suppl 1):S15-S15. doi:10.1186/1476-069X-8- S1-S15.

98. Blaser M. Missing Microbes. Oneworld Publications; 2014.

99. Lemon KP, Klepac-Ceraj V et al. Comparative analyses of the bacterial microbiota of the human nostril and oropharynx. MBio. 2010;1(3). doi:10.1128/mBio.00129-10.

100. Frank DN, Feazel LM et al. The Human nasal microbiota and Staphylococcus aureus carriage. PLoS One. 2010;5(5). doi:10.1371/journal.pone.0010598.

101. Yan M, Pamp SJ et al. Nasal microenvironments and interspecific interactions influence nasal microbiota complexity and S. aureus carriage. Cell Host and Microbe. 2013;14(6):631-640. doi:10.1016/j.chom.2013.11.005.

102. Wos-Oxley ML, Chaves-Moreno D et al. Exploring the bacterial assemblages along the human nasal passage. Environmental Microbiology. 2016;18(7):2259-2271. doi:10.1111/1462-2920.13378.

103. Hilty M, Burke CM et al. Disordered microbial communities in asthmatic airways. PLoS One. 2010;5(1):e8578. doi:10.1371/journal.pone.0008578.

104. Teo SM, Mok D et al. The Infant nasopharyngeal microbiome impacts severity of lower respiratory infection and risk of asthma development. Cell Host and Microbe. 2015;17(5):704-715. doi:10.1016/j.chom.2015.03.008.

105. Paster BJ, Boches SK et al. Bacterial diversity in human subgingival plaque. Journal of Bacteriology. 2001;183(12):3770-3783.

106. Xu X, He J et al. Oral cavity contains distinct niches with dynamic microbial communities. Environmental Microbiology. 2015;17(3):699-710. doi:10.1111/1462- 2920.12502.

107. Smith BC, Zolnik CP et al. distinct ecological niche of anal, oral, and cervical mucosal microbiomes in adolescent women. Yale Journal of Biology and Medicine. 2016;89(3):277-284.

108. Windfuhr JP, Toepfner N et al. Clinical practice guideline: tonsillitis I. Diagnostics and nonsurgical management. European Archives of Oto-Rhino-Laryngology 2016;273(4):973-987. doi:10.1007/s00405-015-3872-6.

109. Struzycka I. The oral Microbiome in dental caries. Polish Journal of Microbiology. 2014;63(2):127-135.

110. Kirst ME, Li EC et al. Dysbiosis and alterations in predicted functions of the subgingival microbiome in chronic periodontitis. Applied Environmental Microbiology. 2015;81(2):783-793. doi:10.1128/AEM.02712-14.

111. Ozok AR, Persoon IF et al. Ecology of the microbiome of the infected root canal system: a comparison between apical and coronal root segments. International Endodontic Journal. 2012;45(6):530-541. doi:10.1111/j.1365-2591.2011.02006.x.

226 BIBLIOGRAPHY

112. Charlson ES, Chen J et al. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One. 2010;5(12):e15216. doi:10.1371/journal.pone.0015216.

113. Segal LN, Alekseyenko AV et al. Enrichment of lung microbiome with supraglottic taxa is associated with increased pulmonary inflammation. Microbiome 2013;1:19 https://doi.org/10.1186/2049-2618-1-19

114. Morris A, Alexander T et al. Airway obstruction is increased in pneumocystis-colonized human immunodeficiency virus-infected outpatients. Journal of Clinical Microbiology. 2009;47(11):3773-3776. doi:10.1128/JCM.01712-09.

115. Stearns JC, Davidson CJ et al. Culture and molecular-based profiles show shifts in bacterial communities of the upper respiratory tract that occur with age. The ISME Journal. 2015;9(5):1246-1259. doi:10.1038/ismej.2014.250.

116. de Steenhuijsen Piters WAA, Huijskens EGW et al. Dysbiosis of upper respiratory tract microbiota in elderly pneumonia patients. The ISME Journal. 2016;10(1):97-108. doi:10.1038/ismej.2015.99.

117. Bassis CM, Erb-Downward JR et al. Analysis of the upper respiratory tract microbiotas as the source of the lung and gastric microbiotas in healthy individuals. MBio. 2015;6(2):e00037. doi:10.1128/mBio.00037-15.

118. Goleva E, Jackson LP et al. The effects of airway microbiome on corticosteroid responsiveness in asthma. American Journal of Respiratory and Critical Care Medicine. 2013;188(10):1193-1201. doi:10.1164/rccm.201304-0775OC.

119. Salter SJ, Cox MJ et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12(1):87–12. doi:10.1186/s12915-014-0087-z.

120. Morris A, Beck JM et al. Comparison of the respiratory microbiome in healthy nonsmokers and smokers. American Journal of Respiratory and Critical Care Medicine. 2013;187(10):1067-1075. doi:10.1164/rccm.201210-1913OC.

121. Charlson ES, Bittinger K et al. Topographical continuity of bacterial populations in the healthy human respiratory tract. American Journal of Respiratory and Critical Care Medicine. 2011;184(8):957-963. doi:10.1164/rccm.201104-0655OC.

122. Lozupone CA, Cota-Gomez A et al. Widespread colonization of the lung by Tropheryma whipplei in HIV Infection. American Journal of Respiratory and Critical Care Medicine. 2013;187(10):1110-1117. doi:10.1164/rccm.201211-2145OC.

123. Baughman RP, Thorpe JE et al. Use of the protected specimen brush in patients with endotracheal or tracheostomy tubes. Chest. 1987;91(2):233-236.

124. Erb-Downward JR, Thompson DL et al. Analysis of the lung microbiome in the “healthy” smoker and in COPD. PLoS One. 2011;6(2):e16384. doi:10.1371/journal.pone.0016384.

125. Beck JM, Young VB, Huffnagle GB et al. The microbiome of the lung. Translational Research. 2012;160(4):258-266. doi:10.1016/j.trsl.2012.02.005.

126. Dickson RP, Martinez FJ et al. The role of the microbiome in exacerbations of chronic lung diseases. Lancet. 2014;384(9944):691-702. doi:10.1016/S0140-6736(14)61136- 3.

227 BIBLIOGRAPHY

127. Gleeson K, Eggli DF et al. Quantitative aspiration during sleep in normal subjects. Chest. 1997;111(5):1266-1272. doi:10.1378/chest.111.5.1266.

128. Huxley EJ, Viroslav J et al. Pharyngeal aspiration in normal adults and patients with depressed consciousness. American Journal of Medicine. 1978;64(4):564-568. doi:10.1016/0002-9343(78)90574-0.

129. Bals R. Epithelial antimicrobial peptides in host defense against infection. Respiratory Research 2014 15:1. 2000;1(1):5-150. doi:10.1186/rr25.

130. Nicod LP. Lung defences: An overview. European Respiratory Review. 2005;14(95):45- 50. doi:10.1183/09059180.05.00009501.

131. Zanetti M, Gennaro R et al. Cathelicidins: a novel protein family with a common pro- region and a variable C-terminal antimicrobial domain. FEBS Letters. 1995;374(1):1-5.

132. Kościuczuk EM, Lisowski P et al. Cathelicidins: family of antimicrobial peptides. A review. Molecular Biology Reports. 2012;39(12):10957-10970. doi:10.1007/s11033- 012-1997-x.

133. Piludu M, Lantini MS et al. Salivary histatins in human deep posterior lingual glands (of von Ebner). Archives of Oral Biology. 2006;51(11):967-973. doi:10.1016/j.archoralbio.2006.05.011.

134. Martinez FD. CD14, endotoxin, and asthma risk: actions and interactions. Proceedings of the American Thoracic Society. 2007;4(3):221-225. doi:10.1513/pats.200702-035AW.

135. Vos T, Allen C et al. Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet. 2016;388(10053):1545-1602. doi:10.1016/S0140-6736(16)31678-6.

136. To T, Stanojevic S et al. Global asthma prevalence in adults: findings from the cross- sectional world health survey. BMC Public Health. 2012;12(1):204. doi:10.1186/1471- 2458-12-204.

137. Kroegel C. Global Initiative for Asthma (GINA) guidelines: 15 years of application. http://dxdoiorg/101586/eci091. 2009;5(3):239-249. doi:10.1586/eci.09.1.

138. Yawn BP. Factors accounting for asthma variability: achieving optimal symptom control for individual patients. Primary Care Respiratory Journal. 2008;17(3):138-147. doi:10.3132/pcrj.2008.00004.

139. Moore WC, Pascual RM et al. Update in asthma 2009. American Journal of Respiratory and Critical Care Medicine. 2010;181(11):1181-1187. doi:10.1164/rccm.201003- 0321UP.

140. WHO. WHO | Asthma. WHO. April 2017. https://web.archive.org/web/20110629035454/http://www.who.int/mediacentre/fa ctsheets/fs307/en/.

141. Dietert RR. Maternal and childhood asthma: risk factors, interactions, and ramifications. Reproductive Toxicology. 2011;32(2):198-204. doi:10.1016/j.reprotox.2011.04.007.

228 BIBLIOGRAPHY

142. Cookson WO, Moffatt MF et al. Genetic risks and childhood-onset asthma. The Journal of Allergy and Clinical Immunology. 2011;128(2):266–70–quiz271–2. doi:10.1016/j.jaci.2011.06.026.

143. Tan DJ, Walters EH et al. Age-of-asthma onset as a determinant of different asthma phenotypes in adults: a systematic review and meta-analysis of the literature. Expert Review of Respiratory Medicine. 2015;9(1):109-123. doi:10.1586/17476348.2015.1000311.

144. Moffatt MF, Gut IG et al. A large-scale, consortium-based genomewide association study of asthma. The New England Journal of Medicine. 2010;363(13):1211-1221. doi:10.1056/NEJMoa0906312.

145. Gold BD. Asthma and gastroesophageal reflux disease in children: Exploring the relationship. The Journal of Pediatrics. 2005;146(3):S13-S20. doi:10.1016/j.jpeds.2004.11.036.

146. Kelly FJ, Fussell JC. Air pollution and airway disease. Clinical and Experimental Allergy. 2011;41(8):1059-1071. doi:10.1111/j.1365-2222.2011.03776.x.

147. Polosa R, Knoke JD et al. Cigarette smoking is associated with a greater risk of incident asthma in allergic rhinitis. The Journal of Allergy and Clinical Immunology. 2008;121(6):1428-1434. doi:10.1016/j.jaci.2008.02.041.

148. Agostinis F, Foglia C et al. GINA report, global strategy for asthma management and prevention. Allergy. 2008;63(12):1637-1639. doi:10.1111/j.1398-9995.2008.01742.x.

149. Dykewicz MS. Occupational asthma: Current concepts in pathogenesis, diagnosis, and management. Journal of Allergy and Clinical Immunology. 2009;123(3):519-528. doi:10.1016/j.jaci.2009.01.061.

150. Baur X, Aasen TB et al. The management of work-related asthma guidelines: a broader perspective. European Respiratory Review. 2012;21(124):125-139. doi:10.1183/09059180.00004711.

151. Hershenson MB. Rhinovirus-induced exacerbations of asthma and COPD. Scientifica (Cairo). 2013;2013:405876. doi:10.1155/2013/405876.

152. Cardenas PA, Cooper PJ et al. Upper airways microbiota in antibiotic-naïve wheezing and healthy infants from the tropics of rural Ecuador. PLoS One. 2012;7(10):e46803. doi:10.1371/journal.pone.0046803.

153. van den Bergh MR, Biesbroek G. et al Associations between pathogens in the upper respiratory tract of young children: interplay between viruses and bacteria. PLoS One. 2012;7(10):e47711. doi:10.1371/journal.pone.0047711.

154. De Schutter I, Dreesman A et al. In young children, persistent wheezing is associated with bronchial bacterial infection: a retrospective analysis. BMC Pediatrics. 2012;12(1):83. doi:10.1186/1471-2431-12-83.

155. Park H, Shin JW et al. Microbial communities in the upper respiratory tract of patients with asthma and chronic obstructive pulmonary disease. PLoS One. 2014;9(10). doi:10.1371/journal.pone.0109710.

156. Denner DR, Sangwan N et al. Corticosteroid therapy and airflow obstruction influence the bronchial microbiome, which is distinct from that of bronchoalveolar lavage in

229 BIBLIOGRAPHY

asthmatic airways. The Journal of Allergy and Clinical Immunology. 2016;137(5):1398– 1405.e3. doi:10.1016/j.jaci.2015.10.017.

157. Huang YJ, Nelson CE et al. Airway microbiota and bronchial hyperresponsiveness in patients with sub-optimally controlled asthma. The Journal of Allergy and Clinical Immunology. 2011;127(2):372–381.e1–3. doi:10.1016/j.jaci.2010.10.048.

158. Goleva E, Harris JK et al. Asthma control and disordered microbial communities in the lower airways of patients with poorly controlled asthma. Journal of Allergy and Clinical Immunology. 2012;129(S):AB128. doi:10.1016/j.jaci.2011.12.425.

159. UK A. Asthma facts and statistics. https://www.asthma.org.uk/about/media/facts-and- statistics/. Accessed October 1, 2017.

160. Mutius von E. Of attraction and rejection--asthma and the microbial world. The New England Journal of Medicine. 2007;357(15):1545-1547. doi:10.1056/NEJMe078119.

161. Kim HY, Shin YH et al. Patterns of sensitisation to common food and inhalant allergens and allergic symptoms in pre-school children. Journal of Paediatrics and Child Health. 2013;49(4):272-277. doi:10.1111/jpc.12150.

162. Doll R, Hill AB. Smoking and carcinoma of the lung; preliminary report. British Medical Journal. 1950;2(4682):739-748.

163. DeMarini DM. Genotoxicity of tobacco smoke and tobacco smoke condensate: a review. Mutation Research. 2004;567(2-3):447-474. doi:10.1016/j.mrrev.2004.02.001.

164. National Health Service NHS. How smoking affects your body. https://www.nhs.uk/smokefree/why-quit/smoking-health-problems. Published 2017. Accessed October 2, 2017.

165. Laniado-Laborín R. Smoking and chronic obstructive pulmonary disease (COPD). Parallel epidemics of the 21st century. International Journal of Environmental Research and Public Health. 2009.

166. McCarthy M. Smoking remains leading cause of premature death in US. British Medical Journal. 2014;348:g396-g396. doi:10.1136/bmj.g396.

167. Bánóczy J, Squier C. Smoking and disease. European Journal of Dental Education. 2004;8(s4):7-10. doi:10.1111/j.1399-5863.2004.00316.x.

168. Doll R, Peto R et al. Mortality in relation to smoking: 50 years' observations on male British doctors. British Medical Journal. 2004;328(7455):1519–0. doi:10.1136/bmj.38142.554479.AE.

169. Borgerding M, Klus H. Analysis of complex mixtures - Cigarette smoke. Experimental and Toxicological Pathology. 2005;57(SUPPL. 1):43-73. doi:10.1016/j.etp.2005.05.010.

170. Talhout R, Schulz T et al. Hazardous compounds in tobacco smoke. International Journal of Environmental Research and Public Health. 2011;8(12):613-628. doi:10.3390/ijerph8020613.

171. Molyneaux PL, Mallia P et al. Outgrowth of the bacterial airway microbiome after rhinovirus exacerbation of chronic obstructive pulmonary disease. American Journal of Respiratory and Critical Care Medicine. 2013;188(10):1224-1231. doi:10.1164/rccm.201302-0341OC.

230 BIBLIOGRAPHY

172. Boussageon R, Bejan-Angoulvant T et al. Effect of intensive glucose lowering treatment on all cause mortality, cardiovascular death, and microvascular events in type 2 diabetes: meta-analysis of randomised controlled trials. British Medical Journal. 2011;343(jul26 1):d4169-d4169. doi:10.1136/bmj.d4169.

173. Mathers CD, Loncar D et al. Projections of global mortality and burden of disease from 2002 to 2030. PLoS Medicine. 2006;3(11). doi:10.1371/journal.pmed.0030442.

174. Wild S, Roglic G et al. Global prevalence of diabetes estimates for the year 2000 and projections for 2030. Diabetes Care. 2004;27(5):1047-1053. doi:10.2337/diacare.27.5.1047.

175. Ogurtsova K, da Rocha Fernandes JD et al. IDF Diabetes Atlas: Global estimates for the prevalence of diabetes for 2015 and 2040. Diabetes Research and Clinical Practice. 2017;128:40-50. doi:10.1016/j.diabres.2017.03.024.

176. DiBaise JK, Zhang H et al. Gut microbiota and its possible relationship with obesity. Mayo Clinic Proceedings. 2008;83(4):460-469. doi:10.4065/83.4.460.

177. Tsai F, Coyle WJ. The microbiome and obesity: is obesity linked to our gut flora? Current Gastroenterology Reports. 2009;11(4):307-313.

178. Turnbaugh PJ, Ley RE et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444(7122):1027-1131. doi:10.1038/nature05414.

179. Ley RE, Turnbaugh PJ et al. Human gut microbes associated with obesity. Nature. 2006;444(7122):1022-1023. doi:10.1038/nature4441021a.

180. Lontchi-Yimagou E, Sobngwi E et al. Diabetes Mellitus and Inflammation. Current Diabetes Reports. 2013;13(3):435-444. doi:10.1007/s11892-013-0375-y.

181. Zhang Y, Zhang H. Microbiota associated with type 2 diabetes and its related complications. Food Science and Human Wellness. 2013;2(3-4):167-172. doi:10.1016/j.fshw.2013.09.002.

182. Larsen N, Vogensen FK et al. Gut microbiota in human adults with type 2 diabetes differs from non-diabetic adults. PLoS One. 2010;5(2):e9085. doi:10.1371/journal.pone.0009085.

183. Bakken JS, Borody T et al. Treating Clostridium difficile infection with fecal microbiota transplantation. Clinical Gastroenterology and Hepatology. 2011;9(12):1044-1049. doi:10.1016/j.cgh.2011.08.014.

184. Kootte RS, Levin E et al. Improvement of insulin sensitivity after lean donor feces in metabolic syndrome is driven by baseline intestinal microbiota composition. Cell Metabolism. 2017;26(4):611-619.e616. doi:10.1016/j.cmet.2017.09.008.

185. El-Serag HB, Sweet S et al. Update on the epidemiology of gastro-oesophageal reflux disease: a systematic review. Gut. 2014;63(6):871-880. doi:10.1136/gutjnl-2012- 304269.

186. Lagergren J, Bergström R et al. Symptomatic gastroesophageal reflux as a risk factor for esophageal adenocarcinoma. The New England Journal of Medicine. 1999;340(11):825- 831. doi:10.1056/NEJM199903183401101.

231 BIBLIOGRAPHY

187. Navarro Silvera SA, Mayne ST et al. Diet and lifestyle factors and risk of subtypes of esophageal and gastric cancers: classification tree analysis. Annual Epidemiology. 2014;24(1):50-57. doi:10.1016/j.annepidem.2013.10.009.

188. National Institute of Diabetes and Digestive and Kidney Diseases. Acid Reflux (GER & GERD) in Adults | NIDDK. October 2016. https://www.niddk.nih.gov/health- information/digestive-diseases/acid-reflux-ger-gerd-adults/all-content.

189. Kasasbeh A, Kasasbeh E et al. Potential mechanisms connecting asthma, esophageal reflux, and obesity/sleep apnea complex--a hypothetical review. Sleep Medicine Reviews. 2007;11(1):47-58. doi:10.1016/j.smrv.2006.05.001.

190. Gaude GS. Pulmonary manifestations of gastroesophageal reflux disease. Annual Thoracic Medicine. 2009;4(3):115-123. doi:10.4103/1817-1737.53347.

191. Hunt R, Quigley E et al. Coping with common gastrointestinal symptoms in the community: a global perspective on heartburn, constipation, bloating, and abdominal pain/discomfort May 2013. Journal of Clinical Gastroenterology. 2014;48(7):567-578. doi:10.1097/MCG.0000000000000141.

192. Ayazi S, Hagen JA et al. Obesity and gastroesophageal reflux: quantifying the association between body mass index, esophageal acid exposure, and lower esophageal sphincter status in a large series of patients with reflux symptoms. Journal of Gastrointestinal Surgery. 2009;13(8):1440-1447. doi:10.1007/s11605-009-0930-7.

193. Grusell EN, Dahlen G et al. Bacterial flora of the human oral cavity, and the upper and lower esophagus. Diseases of the Esophagus. 2013;26(1):84-90. doi:10.1111/j.1442- 2050.2012.01328.x.

194. Yang L, Lu X et al. Inflammation and intestinal metaplasia of the distal esophagus are associated with alterations in the microbiome. Gastroenterology. 2009;137(2):588-597. doi:10.1053/j.gastro.2009.04.046.

195. Dickson RP, Singer BH et al. Enrichment of the lung microbiome with gut bacteria in sepsis and the acute respiratory distress syndrome. Nature Microbiology. 2016;1(10).:16113.doi:10.1038/NMICROBIOL.2016.113.

196. Barfod KK, Roggenbuck M et al. The murine lung microbiome in relation to the intestinal and vaginal bacterial communities. BMC Microbiology. 2013;13(1):303. doi:10.1186/1471-2180-13-303.

197. Olszak T, An D et al. Microbial exposure during early life has persistent effects on natural killer T cell function. Science. 2012;336(6080):489-493. doi:10.1126/science.1219328.

198. Vieira WA, Pretorius E. The impact of asthma on the gastrointestinal tract (GIT). Journal of Asthma Allergy. 2010;3:123-130. doi:10.2147/JAA.S10592.

199. Lazarevic V, Gaïa N et al. Decontamination of 16S rRNA gene amplicon sequence datasets based on bacterial load assessment by qPCR. BMC Microbiology. 2016;16(1):73. doi:10.1186/s12866-016-0689-4.

200. BPMRF. Busselton Population Medical Research Foundation. bpmri.org.au. http://bpmri.org.au. Accessed July 25, 2016.

201. UWA. Busselton Health Study. http://www.sph.uwa.edu.au/research/busselton-health

232 BIBLIOGRAPHY

202. Peat JK, Woolcock AJ et al. Decline of lung function and development of chronic airflow limitation: a longitudinal study of non-smokers and smokers in Busselton, Western Australia. Thorax. 1990;45(1):32-37. doi:10.1136/thx.45.1.32.

203. Toelle BG, Xuan W et al. Respiratory symptoms and illness in older Australians: The Burden of Obstructive Lung Disease (BOLD) study. Medical Journal of Australia. 2013;198(3):144-148. doi:10.5694/mja11.11640.

204. Musk AW, Knuiman M et al. Patterns of airway disease and the clinical diagnosis of asthma in the Busselton population. European Respiratory Journal. 2011;38(5):1053- 1059. doi:10.1183/09031936.00102110.

205. Zhu K, Hunter M et al. Associations between body mass index, lean and fat body mass and bone mineral density in middle-aged Australians: The Busselton Healthy Ageing Study. Bone. 2015;74:146-152. doi:10.1016/j.bone.2015.01.015.

206. O'Leary CM, Knuiman MW et al. Homocysteine and cardiovascular disease: a 17-year follow-up study in Busselton. The European Journal of Cardiovascular Prevention & Rehabilitation. 2004;11(4):350-351. doi:10.1097/01.hjr.0000136457.36990.1a.

207. Hung J, Knuiman MW et al. C-reactive protein and interleukin-18 levels in relation to coronary heart disease: Prospective cohort study from Busselton Western Australia. Heart Lung and Circulation. 2008;17(2):90-95. doi:10.1016/j.hlc.2007.07.002.

208. Adams C, Burke V et al. Cholesterol tracking from childhood to adult mid-life in children from the Busselton study. Acta Paediatrica. 2005;94(3):275-280. doi:10.1080/08035250410024303.

209. Cullen K, Stenhouse NS et al. Multiple-Regression analysis of risk-factors for cardiovascular-disease and cancer mortality in Busselton, Western-Australia - 13-Year Study. Journal of Chronic Disease. 1983;36(5):371-377. doi:10.1016/0021- 9681(83)90169-8.

210. James A, Hunter M et al. Rationale, design and methods for a community-based study of clustering and cumulative effects of chronic disease processes and their effects on ageing: the Busselton healthy ageing study. BMC Public Health. 2013;13(1):936. doi:10.1186/1471-2458-13-936.

211. Klindworth A, Pruesse E et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Research. 2013;41(1):e1-e1. doi:10.1093/nar/gks808.

212. Kozich JJ, Westcott SL. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Applied Environmental Microbiology. 2013;79(17):5112-5120. doi:10.1128/AEM.01043-13.

213. Illumina MS. 16S Metagenomic Sequencing Library Preparation. https://support.illumina.com/documents/documentation/chemistry_documentation/ 16s/16s-metagenomic-library-prep-guide-15044223-b.pdf

214. Illumina Adapter Sequences. February 2016:1-38. http://dnatech.genomecenter.ucdavis.edu/wp-content/uploads/2013/06/illumina- adapter-sequences_1000000002694-00.pdf

215. Kozich JJ, Schloss PD et al. 16S rRNA Sequencing with the Illumina MiSeq: Library Generation, QC, & Sequencing. March 2013:1-24.

233 BIBLIOGRAPHY

216. Zipper H, Brunner H et al. Investigations on DNA intercalation and surface binding by SYBR Green I, its structure determination and methodological implications. Nucleic Acids Research. 2004;32(12):e103-e103. doi:10.1093/nar/gnh101.

217. Lane DJ, Stackebrandt E et al. 16S/23S rRNA Sequencing. in: Nucleic Acid Techniques in Bacterial Systematics. New York, NY: John Wiley and Sons; 1991:115-175.

218. Caporaso JG, Kuczynski J et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods. 2010;7(5):335-336. doi:10.1038/nmeth.f.303.

219. Krueger F. Trim Galore. https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

220. Aronesty E. Command-line tools for processing biological sequencing data. https://expressionanalysis.github.io/ea-utils/

221. Aronesty E. Comparison of sequencing utility programs. TOBIOIJ. 2013;7(1):1-8. doi:10.2174/1875036201307010001.

222. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26(5):589-595. doi:10.1093/bioinformatics/btp698.

223. Yilmaz P, Parfrey LW et al. The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks. Nucleic Acids Research. 2013;42(D1):gkt1209–D648. doi:10.1093/nar/gkt1209.

224. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460-2461. doi:10.1093/bioinformatics/btq461.

225. Caporaso JG, Bittinger K et al. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics. 2010;26(2):266-267. doi:10.1093/bioinformatics/btp636.

226. Haas BJ, Gevers D et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research. 2011;21(3):494-504. doi:10.1101/gr.112730.110.

227. Price MN, Dehal PS et al. FastTree 2-Approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3). doi:10.1371/journal.pone.0009490.

228. Paradis E, Claude J et al. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20(2):289-290.

229. Dray S, Dufour AB. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software. 2007. doi:10.18637/jss.v022.i04

230. Huber W, Carey VJ. Orchestrating high-throughput genomic analysis with Bioconductor. Nature Methods. 2015;12(2):115-121. doi:10.1038/nmeth.3252.

231. McMurdie PJ, Paulson JN. An interface package for the BIOM file format. April 2018:1- 19. https://github.com/joey711/biomformat/, http://biom-format.org/.

232. Pages H, Aboyoun P. Biostrings: String objects representing biological sequences, and matching algorithms. 2016. https://bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings .pdf

234 BIBLIOGRAPHY

233. Maechler M, Rousseeuw P. Finding Groups in Data: Cluster Analysis Extended Rousseeuw et al. April 2018:1-82. https://cran.r-project.org/web/packages/cluster/cluster.pdf

234. Wilke CO. Cowplot: Streamlined Plot Theme and Plot Annotations for “Ggplot2” (Version 0.6. 1). 2016. https://cran.r-project.org/web/packages/cowplot/cowplot.pdf

235. Love MI, Huber W et al. Moderated estimation of fold change and dispersion for RNA- seq data with DESeq2. Genome Biology. 2014;15(12):550–21. doi:10.1186/s13059- 014-0550-8.

236. Wickham H, Francois R, RStudio. Package “dplyr.” August 2016:1-77. URL: https://github.com/hadley/dplyr.

237. Langfelder P, Bin Zhang. Package “dynamicTreeCut.” March 2016:1-14. http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/BranchCutting/.

238. Langfelder P, Horvath S. Fast R functions for robust correlations and hierarchical clustering. Journal of Statistical Software. 2012;46(11).

239. Wickham H. ggplot2: Elegant Graphics for Data Analysis.; 2009. Publisher: Springer- Verlag New York. doi: 10.1007/978-0-387-98141-3

240. Kahle D and Wickham H. ggmap: Spatial visualization with ggplot2. February 2016:1- 18. http://stat405.had.co.nz/ggmap.pdf

241. Yu G, Smith DK et al. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2016;8(1):28-36. doi:10.1111/2041-210X.12628.

242. Murrell P. Package “gridBase.” February 2015:1-5. https://cran.r-project.org/web/packages/gridBase/gridBase.pdf

243. Auguie B, Antonov A. Package “gridExtra.” August 2016:1-10. doi:10.5281/zenodo.11422. https://cran.r-project.org/web/packages/gridExtra/gridExtra.pdf

244. Warnes GR, Ben Bolker. Package “gtools.” May 2015:1-41. https://cran.r-project.org/web/packages/gtools/gtools.pdf

245. Wickham H. Package “gtable.” August 2016:1-16. https://cran.r-project.org/web/packages/gtable/gtable.pdf

246. De Caceres M, Jansen F. Package “indicspecies.” 2015. https://cran.r-project.org/web/packages/indicspecies/indicspecies.pdf

247. Lawrence M, Huber W. Software for computing and annotating genomic ranges. PLoS Computational Biology. 2013;9(8):e1003118. doi:10.1371/journal.pcbi.1003118.

248. Urbanek I. Package “jpeg.” February 2015:1-5. http://www.rforge.net/jpeg/.

249. Yihui X. knitr: A general-purpose tool for dynamic report generation in R. January 2015:1-11. https://www.cs.bham.ac.uk/~axj/pub/teaching/2016-7/stats/knitr-manual.pdf

235 BIBLIOGRAPHY

250. Yihui X. Package “knitr.” May 2017:1-67. https://cran.r-project.org/web/packages/knitr/knitr.pdf

251. Sarkar D. Lattice: Multivariate data visualization with R. New York, NY: Springer New York; 2008. doi:10.1007/978-0-387-75969-2.

252. Bengtsson H, Bravo HC, Gentle- man R, et al. Package “matrixStats.” 2017:1-46. https://cran.r-project.org/web/packages/matrixStats/matrixStats.pdf

253. Venables WN, Ripley BD. Modern applied statistics with S. Springer Science & Business Media; 2013.

254. Pinheiro J, Bates D. Package “nlme.” 2017:1-336. https://cran.r-project.org/web/packages/nlme/nlme.pdf

255. Giraudoux P. Package “pgirmess.” 2017:1-63. https://cran.r-project.org/web/packages/pgirmess/pgirmess.pdf

256. McMurdie PJ, Holmes S. phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4). doi:10.1371/journal.pone.0061217.

257. Revell LJ. Phytools: Phylogenetic tools for comparative biology (and other things). November 2017:1-191. https://cran.r-project.org/web/packages/phytools/phytools.pdf

258. Kembel SW. An introduction to the picante package. September 2012:1-16. http://picante.r-forge.r-project.org/picante-intro.pdf

259. Kembel SW, Ackerly DD. Package “picante.” March 2014:1-55. https://cran.r-project.org/web/packages/picante/picante.pdf

260. Wickham H. The split-apply-combine strategy for data analysis. Journal of Statistical Software. 2011;40(1). http://www.stat.wvu.edu/~jharner/courses/stat623/docs/plyrJSS.pdf

261. Zeileis A, Kleiber C. Regression models for count data in R. July 2008. doi:10.18637/jss.v027.i08. https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf

262. Jackman S. Package “pscl.” March 2015:1-100. https://cran.r-project.org/web/packages/pscl/pscl.pdf

263. Neuwirth E. Package “RColorBrewer.” December 2014:1-5. https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf

264. Wickham H. Reshaping data with the reshape package. Journal of Statistical Software. 2007. Doi:10.18637/jss.v021.i12 http://had.co.nz/reshape/introduction.pdf

265. Wickham H, Wickham H. Package “scales.” August 2016:1-41. https://github.com/hadley/scales.

266. Wickham H, RStudio. Package “stringr.” August 2016:1-29. https://github.com/hadley/stringr.

267. Oksanen J, Blanchet FG, Kindt R. Package “vegan.” August 2016. https://cran.r-project.org/web/packages/vegan/vegan.pdf

236 BIBLIOGRAPHY

268. Chen H, Boutros P. Package “VennDiagram.” April 2016:1-33. https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf

269. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9(1):559. doi:10.1186/1471-2105-9-559.

270. Whittaker RH. Vegetation of the Siskiyou mountains, Oregon and California. Ecological Society of America. 1960:1-61.

271. McCune B, Grace JB, Urban DL. Analysis of ecological communities. 2002. Publisher: MJM Software Design. doi: 10.1016/S0022-0981(03)00091-1. ISSN 0022-0981

272. Shannon CE, W W. The mathematical theory of communication. The Bell System Technical Journal. 1948. http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

273. Simpson EH. Measurement of diversity. 1949;163(4148):688. doi:10.1038/163688a0.

274. Solow AR. A simple test for change in community structure. Journal of Animal Ecology. 1993;62(1):191-193. doi:10.2307/5493.

275. Jost L. Partitioning diversity into independent alpha and beta components. Ecology. 2007;88(10):2427-2439. doi:10.1890/06-1736.1.

276. Rogers GB, Cuthbertson L et al. Reducing bias in bacterial community analysis of lower respiratory infections. The International Society for Microbial Ecology Journal. 2013;7(4):697-706. doi:10.1038/ismej.2012.145.

277. Morris EK, Caruso T et al. Choosing and using diversity indices: insights for ecological applications from the German Biodiversity Exploratories. Ecological Evolution. 2014;4(18):3514-3524. doi:10.1002/ece3.1155.

278. Cox MJ, Cookson WO et al. Sequencing the human microbiome in health and disease. Human Molecular Genetics. 2013;22(R1):R88-R94. doi:10.1093/hmg/ddt398.

279. Shapiro SS, Wilk MB. An Analysis of Variance Test for Normality (Complete Samples). Biometrika. 1965;52(3/4):591. doi:10.2307/2333709.

280. Wilk MB, Gnanadesikan R. Probability plotting methods for the analysis for the analysis of data. Biometrika. 1968;55(1):1-17. doi:10.1093/biomet/55.1.1.

281. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association. 1952;47(260):583-621. doi:10.1080/01621459.1952.10483441.

282. Tukey JW. Comparing individual means in the analysis of variance. Biometrics. 1949;5(2):99. doi:10.2307/3001913.

283. Baselga A, Orme CDL. betapart: an R package for the study of beta diversity. Methods in Ecology and Evolution. 2012;3(5):808-812. doi:10.1111/j.2041-210X.2012.00224.x.

284. Bray JR, Curtis JT. An ordination of the upland forest communities of Southern Wisconsin. Ecological Society of America. 1957;27(4):325. doi:10.2307/1942268.

285. Jost L, Chao A et al. Chapter 6: Compositional similarity and β (beta) diversity. International Journal of Environmental Research And Public Health. 2011. http://chao.stat.nthu.edu.tw/wordpress/paper/87.pdf

237 BIBLIOGRAPHY

286. Anderson MJ. A new method for non‐parametric multivariate analysis of variance. Austral Ecology. 2001;26(1):32-46. doi:10.1111/j.1442-9993.2001.01070.pp.x.

287. Anderson MJ. Distance-based tests for homogeneity of multivariate dispersions. Biometrics. 2006;62(1):245-253. doi:10.1111/j.1541-0420.2005.00440.x.

288. Webb CO. Exploring the phylogenetic structure of ecological communities: An example for rain forest trees. The American Naturalist. 2000;156(2):145-155. doi:10.1086/303378.

289. Dufrêne M, Legendre P. Species assemblages and indicator species: the need for a flexibles asymmetrical approach. Ecological Monographs. 1997;67(3):345-366. doi:10.1890/0012-9615(1997)067[0345:SAAIST]2.0.CO;2.

290. De Cáceres M, Legendre P et al. Improving indicator species analysis by combining groups of sites. Oikos. 2010;119(10):1674-1684. doi:10.1111/j.1600- 0706.2010.18334.x.

291. Love M, Anders S et al. Beginner's guide to using the DESeq2 package. 2014. http://dowell.colorado.edu/HackCon/files/DESeq2_beginnersguide.pdf

292. Love M, Anders S et al. BiocGenerics I. Package “DESeq2.” 2013. https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf

293. Cameron AC, Trivedi PK. Regression Analysis of Count Data. Cambridge University Press; 1998.

294. Robinson MD, McCarthy DJ et al. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139- 140. doi:10.1093/bioinformatics/btp616.

295. Fortenberry JD. The uses of race and ethnicity in human microbiome research. Trends Microbiology 2013;21(4):165-166. doi:10.1016/j.tim.2013.01.001.

296. Mason MR, Nagaraja HN et al. Deep sequencing identifies ethnicity-specific bacterial signatures in the oral microbiome. PLoS One. 2013;8(10):e77287. doi:10.1371/journal.pone.0077287.

297. Schwabe RF, Jobin C. The microbiome and cancer. Nature. 2013;13(11):800-812. doi:10.1038/nrc3610.

298. Vogtmann E, Goedert JJ. Epidemiologic studies of the human microbiome and cancer. British Journal of Cancer. 2016; 114(3). doi:10.1038/bjc.2015.465.

299. Amend AS, Seifert KA et al. Quantifying microbial communities with 454 pyrosequencing: does read abundance count? Molecular Ecology. 2010;19(24):5555- 5565. doi:10.1111/j.1365-294X.2010.04898.x.

300. Quast C, Pruesse E et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research. 2013;41(Database issue):D590-D596. doi:10.1093/nar/gks1219.

301. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.

302. Weisstein, Eric W. "Bonferroni Correction." From MathWorld--A Wolfram http://mathworld.wolfram.com/BonferroniCorrection.html .

238 BIBLIOGRAPHY

303. Kunin V, Engelbrektson A et al. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environmental Microbiology. 2010;12(1):118-123. doi:10.1111/j.1462-2920.2009.02051.x.

304. Bokulich NA, Subramanian S et al. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature Methods. 2013;10(1):57–U11. doi:10.1038/NMETH.2276.

305. Engelbrektson A, Kunin V et al. Experimental factors affecting PCR-based estimates of microbial species richness and evenness. The International Society for Microbial Ecology Journal. 2010;4(5):642-647. doi:10.1038/ismej.2009.153.

306. Krohn A, Stevens B et al. Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity. PeerJ Preprints. 2016. doi:10.7287/peerj.preprints.2196v2.

307. Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Research. 2016;44(W1):W242-W245. doi:10.1093/nar/gkw290.

308. Lim MY, Yoon HS et al. Analysis of the association between host genetics, smoking, and sputum microbiota in healthy humans. Scientific Reports. 2016;6:23745. doi:10.1038/srep23745.

309. Qin J, Li R et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59-65. doi:10.1038/nature08821.

310. Qin J, Li Y et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature .2012;490(7418):55-60. doi:10.1038/nature11450.

311. Segal LN, Blaser MJ et al. A brave new world: the lung microbiota in an era of change. Annals of the American Thoracic Society. 2014;11 Suppl 1:S21-S27. doi:10.1513/AnnalsATS.201306-189MG.

312. Sibley CD, Parkins MD et al. A polymicrobial perspective of pulmonary infections exposes an enigmatic pathogen in cystic fibrosis patients. Proceedings of the National Academy of Sciences USA. 2008;105(39):15070-15075. doi:10.1073/pnas.0804326105.

313. Zhao J, Schloss PD et al. Decade-long bacterial community dynamics in cystic fibrosis airways. Proceedings of the National Academy of Sciences USA. 2012;109(15):5809- 5814. doi:10.1073/pnas.1120577109.

314. Huang YJ, Sethi S et al. Airway microbiome dynamics in exacerbations of chronic obstructive pulmonary disease. Journal of Clinical Microbiology. 2014;52(8):2813-2823. doi:10.1128/JCM.00035-14.

315. Wang Z, Bafadhel M et al. Lung microbiome dynamics in COPD exacerbations. European Respiratory Journal. 2016;47(4):1082-1092. doi:10.1183/13993003.01406-2015.

316. Cox MJ, Turek EM et al. Longitudinal assessment of sputum microbiome by sequencing of the 16S rRNA gene in non-cystic fibrosis bronchiectasis patients. PLoS One. 2017;12(2):e0170622. doi:10.1371/journal.pone.0170622.

317. Yarza P, Yilmaz P et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology. 2014;12(9):635-645. doi:10.1038/nrmicro3330.

239 BIBLIOGRAPHY

318. Marri PR, Stern DA et al. Asthma-associated differences in microbial composition of induced sputum. The Journal of Allergy and Clinical Immunology. 2013;131(2):346– 52.e1–3. doi:10.1016/j.jaci.2012.11.013.

319. Costello EK, Lauber CL et al. Bacterial community variation in human body habitats across space and time. Science. 2009;326(5960):1694-1697. doi:10.1126/science.1177486.

320. Jetté ME, Dill-McFarland KA et al. The human laryngeal microbiome: effects of cigarette smoke and reflux. Nature Scientific Reports. 2016;6:35882. doi:10.1038/srep35882.

321. Sverrild A, Kiilerich P et al. Eosinophilic airway inflammation in asthmatic patients is associated with an altered airway microbiome. The Journal of Allergy and Clinical Immunology. 2016;140(2):407-417.e411. doi:10.1016/j.jaci.2016.10.046.

322. Denner DR, Sangwan N et al. Corticosteroid therapy and airflow obstruction influence the bronchial microbiome, which is distinct from that of bronchoalveolar lavage in asthmatic airways. The Journal of Allergy and Clinical Immunology. 2015;137(5):1398– 1405.e3. doi:10.1016/j.jaci.2015.10.017.

323. Simpson JL, Daly J et al. Airway dysbiosis: Haemophilus influenzae and Tropheryma in poorly controlled asthma. European Respiratory Journal. 2015;47(3):ERJ–00405–2015– 800. doi:10.1183/13993003.00405-2015.

324. Zhang Q, Cox M et al. Airway microbiota in severe asthma and relationship to asthma severity and phenotypes. PLoS One. 2016;11(4). doi:10.1371/journal.pone.0152724.

325. Durack J, Lynch SV et al. Features of the bronchial bacterial microbiome associated with atopy, asthma, and responsiveness to inhaled corticosteroid treatment. The Journal of Allergy and Clinical Immunology. 2017;140(1):63-75. doi:10.1016/j.jaci.2016.08.055.

326. Yu G, Phillips S et al. The effect of cigarette smoking on the oral and nasal microbiota. Microbiome. 2017;5(1):3. doi:10.1186/s40168-016-0226-6.

327. Wu J, Peters BA et al. Cigarette smoking and the oral microbiome in a large study of American adults. The International Society for Microbial Ecology Journal. 2016;10(10):2435-2446. doi:10.1038/ismej.2016.37.

328. Beman JM, Steele JA et al. Co-occurrence patterns for abundant marine archaeal and bacterial lineages in the deep chlorophyll maximum of coastal California. The International Society for Microbial Ecology Journal. 2011;5(7):1077-1085. doi:10.1038/ismej.2010.204.

329. Lozupone CA, Faust K et al. Identifying genomic and metabolic features that can underlie early successional and opportunistic lifestyles of human gut symbionts. Genome Research. 2012;22(10):1974-1984. doi:10.1101/gr.138198.112.

330. Chaffron S, Rehrauer H et al. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Research. 2010;20(7):947- 959. doi:10.1101/gr.104521.109.

331. Faust K, Sathirapongsasuti JF et al. Microbial co-occurrence relationships in the human microbiome. PLoS Computational Biology. 2012;8(7). doi:10.1371/journal.pcbi.1002606.

240 BIBLIOGRAPHY

332. Sansonetti PJ. War and peace at the intestinal epithelial surface: an integrated view of bacterial commensalism versus bacterial pathogenicity. The Journal of Pediatric Gastroenterology and Nutrition. 2008;46 Suppl 1:E6-E7. doi:10.1097/01.mpg.0000313819.96520.27.

333. Parfrey LW, Walters WA et al. Microbial eukaryotes in the human microbiome: ecology, evolution, and future directions. Frontiers Microbiology. 2011;2:153. doi:10.3389/fmicb.2011.00153.

334. Ley RE, Bäckhed F et al. Obesity alters gut microbial ecology. Proceedings of the National Academy of Sciences USA. 2005;102(31):11070-11075. doi:10.1073/pnas.0504978102.

335. Turnbaugh PJ, Hamady M et al. A core gut microbiome in obese and lean twins. Nature 2009;457(7228):480-U487. doi:10.1038/nature07540.

336. Vrieze A, Van Nood E et al. Transfer of intestinal microbiota from lean donors increases insulin sensitivity in individuals with Metabolic Syndrome. Gastroenterology. 2012;143(4):913–. doi:10.1053/j.gastro.2012.06.031.

337. Ridaura VK, Faith JJ et al. Gut microbiota from twins discordant for obesity modulate metabolism in mice. Science. 2013;341(6150):1079–U49. doi:10.1126/science.1241214.

338. Brook I, Gober AE et al. Recovery of potential pathogens and interfering bacteria in the nasopharynx of smokers and nonsmokers. Chest. 2005;127(6):2072-2075. doi:10.1378/chest.127.6.2072.

339. Horvath S, Langfelder P. Tutorials for the WGCNA package for R: WGCNA background and glossary. 2011. https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tu torials/Simulated-00-Background.pdf

340. Bailey P, Chang DK et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature. 2016;531(7592):47-52. doi:10.1038/nature16965.

341. Kristensen VN, Lingjærde OC et al. Principles and methods of integrative genomic analyses in cancer. Nature Review Cancer. 2014;14(5):299-313. doi:10.1038/nrc3721.

342. Clarke C, Madden SF et al. Correlating transcriptional networks to breast cancer survival: a large-scale coexpression analysis. Carcinogenesis. 2013;34(10):2300-2308. doi:10.1093/carcin/bgt208.

343. Voineagu I, Wang X et al. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature. 2011;474(7351):380-4. doi:10.1038/nature10110.

344. Hawrylycz MJ, Lein ES et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 2012;489(7416):391-399. doi:10.1038/nature11405.

345. Miller JA, Horvath S et al. Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways. Proceedings of the National Academy of Sciences USA. 2010;107(28):12698-12703. doi:10.1073/pnas.0914257107.

346. van Eijk KR, de Jong S et al. Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects. BMC Genomics. 2012;13(1). doi:10.1186/1471-2164-13-636.

241 BIBLIOGRAPHY

347. Tong M, Li X et al. A modular organization of the human intestinal mucosal microbiota and its association with inflammatory bowel disease. PLoS One. 2013;8(11):e80702- e80714. doi:10.1371/journal.pone.0080702.

348. Callahan BJ, Sankaran K et al. Bioconductor workflow for microbiome data analysis: from raw reads to community analyses. F1000Res. 2016;5:1492. doi:10.12688/f1000research.8986.1.

349. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11(10). doi:10.1186/gb-2010-11-10-r106.

350. Weiss S, Van Treuren W et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. International Society of Microbial Ecology Journal. 2016;10(7):1669-1681. doi:10.1038/ismej.2015.235.

351. McMurdie PJ, Holmes S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Computational Biology. 2014;10(4):e1003531–12. doi:10.1371/journal.pcbi.1003531.

352. Sokal RR, Michener CD. A Statistical Method for Evaluating Systematic Relationships. Volume 38, Part 2, Issue 22 of University of Kansas science bulletin. 1958. Publisher: University of Kansas

353. Siska C, Kechris K. Differential correlation for sequencing data. BMC Respiratory Notes. 2017;10(1):54. doi:10.1186/s13104-016-2331-9.

354. Benjamini Y, Krieger AM et al. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93(3):491-507. doi:10.1093/biomet/93.3.491.

355. Nelder JA, Baker RJ. Generalized Linear Models. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 1972. doi:10.1002/0471667196.ess0866.pub2.

356. Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bulletin, Vol. 1, No. 6. (Dec., 1945), pp. 80-83 http://links.jstor.org/sici?sici=0099- 4987%28194512%291%3A6%3C80%3AICBRM%3E2.0.CO%3B2-P

357. Bin Zhang, Horvath S. A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4(1). doi:10.2202/1544-6115.1128.

358. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4(1):Article17. doi:10.2202/1544-6115.1128.

359. Langfelder P, Zhang B et al. Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R. Bioinformatics. 2008;24(5):719-720. doi:10.1093/bioinformatics/btm563.

360. Haro C, Rangel-Zúñiga OA et al. Intestinal microbiota is influenced by gender and body mass index. PLoS One. 2016;11(5):e0154090. doi:10.1371/journal.pone.0154090.

361. Borgo F, Garbossa S et al. Body mass index and sex affect diverse microbial niches within the gut. Frontiers Microbiology. 2018;9:213. doi:10.3389/fmicb.2018.00213.

242 BIBLIOGRAPHY

362. Langille MGI, Zaneveld J et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology. 2013;31(9):814-821. doi:10.1038/nbt.2676.

363. Balvočiūtė M, Huson DH. SILVA, RDP, Greengenes, NCBI and OTT - how do these taxonomies compare? BMC Genomics. 2017;18(Suppl 2):114. doi:10.1186/s12864-017- 3501-4.

364. Kembel SW, Wu M et al. Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Computational Biology. 2012;8(10). doi:10.1371/journal.pcbi.1002743.

365. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 2000;28(1):27-30.

366. Kanehisa M, Goto S et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research. 2014;42(Database issue):D199-D205. doi:10.1093/nar/gkt1076.

367. Kanehisa M. A database for post-genome analysis. Trends Genetics. 1997;13(9):375- 376. doi:10.1016/S0168-9525(97)01223-7.

368. Wu GD, Chen J et al. Linking long-term dietary patterns with gut microbial enterotypes. Science. 2011;334(6052):105-108. doi:10.1126/science.1208344.

369. Arumugam M, Raes J et al. Enterotypes of the human gut microbiome. Nature. 2011;473(7346):174-180. doi:10.1038/nature09944.

370. Claesson MJ, Jeffery IB et al. Gut microbiota composition correlates with diet and health in the elderly. Nature. 2012;488(7410):178–. doi:10.1038/nature11319.

371. Yatsunenko T, Rey FE et al. Human gut microbiome viewed across age and geography. Nature. 2012;486(7402):222-227. doi:10.1038/nature11053.

372. Jangi S, Gandhi R et al. Alterations of the human gut microbiome in multiple sclerosis. Nature Communications. 2016;7:12015. doi:10.1038/ncomms12015.

373. Asshauer KP, Wemheuer B et al. Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics. 2015;31(17):2882-2884. doi:10.1093/bioinformatics/btv287.

374. Fratiglioni L, Wang H-X. Smoking and Parkinson’s and Alzheimer’s disease: review of the epidemiological studies. Behavioral Brain Research. 2000;113(1):117-120. doi:10.1016/S0166-4328(00)00206-0.

375. Bagaitkar J, Demuth DR et al. Tobacco use increases susceptibility to bacterial infection. Tobacco Induced Diseases. 2008;4(1):12. doi:10.1186/1617-9625-4-12.

376. Kroll JS, Wilks KE et al. Natural genetic exchange between Haemophilus and Neisseria: intergeneric transfer of chromosomal genes between major human pathogens. Proceedings of the National Academy of Sciences USA. 1998;95(21):12381-12385.

377. Tatuson RL, Mushegian AR et al. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli. Current Biology. 1996;6(3):279-291. doi:10.1016/S0960-9822(02)00478-5.

243 BIBLIOGRAPHY

378. Afzal M, Shafeeq S et al. UlaR activates expression of the ula operon in Streptococcus pneumoniae in the presence of ascorbic acid. Microbiology. 2015;161(1):41-49. doi:10.1099/mic.0.083899-0.

379. Harrison A, Bakaletz LO et al. Haemophilus influenzae and oxidative stress. Front Cell Infect Microbiology. 2012;2:40. doi:10.3389/fcimb.2012.00040.

380. Pericone CD, Overweg K et al. Inhibitory and bactericidal effects of hydrogen peroxide production by Streptococcus pneumoniae on other iInhabitants of the Upper Respiratory Tract. Infect Immun. 2000;68(7):3990-3997. doi:10.1128/IAI.68.7.3990- 3997.2000.

381. Kwong WK, Zheng H. Convergent evolution of a modified, acetate-driven TCA cycle in bacteria. Nature Microbiology. 2017;2(7):17067. doi:10.1038/nmicrobiol.2017.67.

382. Marsh PD, Moter A et al. Dental plaque biofilms: communities, conflict and control. Periodontal 2000. 2011;55(1):16-35. doi:10.1111/j.1600-0757.2009.00339.x.

383. Lemos JA, Burne RA. A model of efficiency: stress tolerance by Streptococcus mutans. Microbiology. 2008;154(11):3247-3255. doi:10.1099/mic.0.2008/023770-0.

384. He J, Kim D et al. RNA-Seq reveals enhanced sugar metabolism in Streptococcus mutans co-cultured with Candida albicans within mixed-species biofilms. Frontiers in Microbiology. 2017;8(JUN). doi:10.3389/fmicb.2017.01036.

385. Billroth T. Untersuchungen Über Die Vegetationsformen Von Coccobacteria Septica Und Den Antheil, Welchen Sie an Der Entstehung Und Verbreitung Der Accidentellen Wundkrankheiten Haben. Published in 1874, Berlin: G. Reimer. https://lib.ugent.be/catalog/rug01:002036023

386. Ferretti JJ, Stevens D et al. History of streptococcal research. 2016. Ferretti JJ, Stevens DL, Fischetti VA, editors. Oklahoma City (OK): University of Oklahoma Health Sciences Center.

387. Brown JH. The Use of Blood Agar for the Study of Streptococci. Palala Press; 1919.

388. Dochez AR, Avery OT et al. Studies on the biology of Streptococcus : i. Antigenic relationships between strains of Streptococcus haemolyticus. Journal of Experimental Medicine. 1919;30(3):179-213.

389. Lancefield RC. A serological differentiation of human and other groups of hemolytic Streptococci. Journal of Experimental Medicine. 1933;57(4):571-595.

390. Lancefield RC, Dole VP. The properties of t antigens extracted from group a hemolytic Streptococci. Journal of Experimental Medicine. 1946;84(5):449-471.

391. Cho C-Y, Tang Y-H et al. Group B Streptococcal infection in neonates and colonization in pregnant women: An epidemiological retrospective analysis. Journal of Microbiology, Immunology and Infection. 2017. pii: S1684-1182(17)30185-8 doi:10.1016/j.jmii.2017.08.004.

392. Sherman JM. The Streptococci. Bacteriological Reviews. 1937;1(1):3-97.

393. Schleifer KH, Kilpper-Bälz R. Transfer of Streptococcus faecalis and Streptococcus faecium to the genus Enterococcus nom. rev. as Enterococcus faecalis comb. nov. and Enterococcus faecium comb. nov. International Journal of System Evolutionary Microbiology. 1984;34(1):31-34. doi:10.1099/00207713-34-1-31.

244 BIBLIOGRAPHY

394. Schleifer KH, Kraus J et al. Transfer of Streptococcus-Lactis and related Streptococci to the genus Lactococcus Gen-Nov. Systematic and Applied Microbiology. 1985;6(2):183- 195. doi:10.1016/S0723-2020(85)80052-7.

395. Selander RK, Caugant DA et al. Methods of multilocus enzyme electrophoresis for bacterial population-genetics and systematics. Applied Environtal Microbiology. 1986;51(5):873-884.

396. Stanley T, Wilson IG. Multilocus enzyme electrophoresis. Molecular Biotechnology. 2003;24(2):203-220. doi:10.1385/MB:24:2:203.

397. Gevers D, Cohan FM et al. Re-evaluating prokaryotic species. Nature Reviews Microbiology. 2005;3(9):733-739. doi:10.1038/nrmicro1236.

398. Karenlampi R, Rautelin H et al. Evaluation of genetic markers and molecular typing methods for prediction of sources of Campylobacter jejuni and E-coli infections. Applied Environmental Microbiology. 2007;73(5):1683-1685. doi:10.1128/AEM.02338-06.

399. Cooper JE, Feil EJ. Multilocus sequence typing – what is resolved? Trends Microbiology. 2004;12(8):373-377. doi:10.1016/j.tim.2004.06.003.

400. Urwin R, Maiden MCJ. Multi-locus sequence typing: a tool for global epidemiology. Trends Microbiology. 2003;11(10):479-487. doi:10.1016/j.tim.2003.08.006.

401. Maiden M, Bygraves JA et al. Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences USA. 1998;95(6):3140-3145.

402. Enright MC, Spratt BG. Multilocus sequence typing. Trends Microbiology. 1999;7(12):482-487. doi:10.1016/S0966-842X(99)01609-1.

403. Jolley KA, Maiden MCJ. Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics. 2010;11(1):595. doi:10.1186/1471-2105-11-595.

404. Godoy D, Randle G et al. Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, Burkholderia pseudomallei and Burkholderia mallei. Journal of Clinical Microbiology. 2003;41(5):2068-2079. doi:10.1128/JCM.41.5.2068-2079.2003.

405. Thompson FL, Gevers D et al. Phylogeny and molecular identification of vibrios on the basis of multilocus sequence analysis. Applied Environmental Microbiology. 2005;71(9):5107-5115. doi:10.1128/AEM.71.9.5107-5115.2005.

406. Devulder G, de Montclos MP et al. A multigene approach to phylogenetic analysis using the genus Mycobacterium as a model. International Journal of System Evolutionary Microbiology. 2005;55(1):293-302. doi:10.1099/ijs.0.63222-0.

407. Enright MC, Spratt BG. A multilocus sequence typing scheme for Streptococcus pneumoniae: identification of clones associated with serious invasive disease. Microbiology. 1998;144(11):3049-3060. doi:10.1099/00221287-144-11-3049.

408. Johnson CN, Benjamin WH Jr et al. Genetic relatedness of levofloxacin-nonsusceptible Streptococcus pneumoniae isolates from North America. Journal of Clinical Microbiology. 2003;41(6):2458-2464. doi:10.1128/JCM.41.6.2458-2464.2003.

409. Kilian M, Poulsen K et al. Evolution of Streptococcus pneumoniae and its close commensal relatives. PLoS One. 2008;3(7):e2683. doi:10.1371/journal.pone.0002683.

245 BIBLIOGRAPHY

410. Bishop CJ, Aanensen DM et al. Assigning strains to bacterial species via the internet. BMC Biology. 2009;7(1):3. doi:10.1186/1741-7007-7-3.

411. Cardenas PA. Upper airways microbiota profiling in a case/control study between wheezing and healthy children from the tropics of Ecuador. 2014:1-295. http://hdl.handle.net/10044/1/23951.

412. Janda WM. The Genus Streptococcus – Part I: Emerging pathogens in the “Pyogenic Cocci” and the ‘Streptococcus bovis’ groups. Clinical Microbiology Newsletter. 2014;36(20):157-166. doi:10.1016/j.clinmicnews.2014.10.001.

413. Whiley RA, Beighton D et al. Streptococcus intermedius, Streptococcus constellatus, and Streptococcus anginosus (the Streptococcus milleri group): association with different body sites and clinical infections. Journal of Clinical Microbiology. 1992;30(1):243-244.

414. Lamas CC, Eykyn SJ. Blood culture negative endocarditis: analysis of 63 cases presenting over 25 years. Heart. 2003;89(3):258-262.

415. Aziz RK, Kansal R et al. Microevolution of group A streptococci in vivo: capturing regulatory networks engaged in sociomicrobiology, niche adaptation, and hypervirulence. PLoS One. 2010;5(4):e9798. doi:10.1371/journal.pone.0009798.

416. Skerman VBD, McGowan V. Approved Lists of Bacterial Names. Washington (DC): ASM Press; 1980.

417. Kawamura Y, Hou XG et al. Streptococcus peroris sp. nov. and Streptococcus infantis sp. nov., new members of the Streptococcus mitis group, isolated from human clinical specimens. International Journal of Systemic Bacteriology. 1998;48(3):921-927. doi:10.1099/00207713-48-3-921.

418. Whiley RA, Fraser HY et al. Streptococcus-Parasanguis Sp-Nov, an atypical Viridans Streptococcus from human clinical specimens. FEMS Microbiology Letters. 1990;68(1- 2):115-122. doi:10.1016/0378-1097(90)90135-D.

419. Peng Z, Fives-Taylor P et al. Identification of critical residues in Gap3 of Streptococcus parasanguinis involved in Fap1 glycosylation, fimbrial formation and in vitro adhesion. BMC Microbiology. 2008;8(1). doi:10.1186/1471-2180-8-52.

420. Arbique JC, Poyart C et al. Accuracy of phenotypic and genotypic testing for identification of Streptococcus pneumoniae and description of Streptococcus pseudopneumoniae sp. nov. Journal of Clinical Microbiology. 2004;42(10):4686-4696. doi:10.1128/JCM.42.10.4686-4696.2004.

421. Upton M, Tagg JR et al. Intra- and interspecies signaling between Streptococcus salivarius and Streptococcus pyogenes mediated by SalA and SalA1 antibiotic peptides. Journal of Bacteriology. 2001;183(13):3931-3938. doi:10.1128/JB.183.13.3931- 3938.2001.

422. Rodriguez AM, Callahan JE et al. Physiological and molecular characterization of genetic competence in Streptococcus sanguinis. Molecular Oral Microbiology. 2011;26(2):99- 116. doi:10.1111/j.2041-1014.2011.00606.x.

423. Kilian M, Mikkelsen L et al. Taxonomic study of Viridans Streptococci - description of Streptococcus-gordonii sp-nov and emended descriptions of Streptococcus-sanguis (White and Niven 1946), Streptococcus-oralis (Bridge and Sneath 1982), and Streptococcus-mitis (Andrewes and Horder 1906). International Journal of Systemic Bacteriology. 1989;39(4):471-484. doi:10.1099/00207713-39-4-471.

246 BIBLIOGRAPHY

424. Bensing BA, Rubens CE et al. Genetic loci of Streptococcus mitis that mediate binding to human platelets. Infection and Immunity. 2001;69(3):1373-1380. doi:10.1128/IAI.69.3.1373-1380.2001.

425. Kearse M, Moir R et al. Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28(12):1647-1649. doi:10.1093/bioinformatics/bts199.

426. Wang Q, Garrity GM et al. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied Environmental Microbiology. 2007;73(16):5261-5267. doi:10.1128/AEM.00062-07.

427. Gao X, Lin H et al. A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy. BMC Bioinformatics. 2017;18(1):247. doi:10.1186/s12859-017-1670-4.

428. Altschul SF, Gish W et al. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403-410. doi:10.1016/S0022-2836(05)80360-2.

429. Fisher R. Statistical methods and scientific induction. Journal of the Royal Statistical Society Series B (Methodological). Vol. 17, No. 1 (1955), pp. 69-78.

430. Kumar S, Stecher G et al. MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for bigger datasets. Molecular Biology and Evolution. 2016;33(7):1870-1874. doi:10.1093/molbev/msw054.

431. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4(4):406-425. doi:10.1093/oxfordjournals.molbev.a040454.

432. Mason MR, Preshaw PM et al. The subgingival microbiome of clinically healthy current and never smokers. The International Society for Microbial Ecology Journal. 2015;9(1):268-272. doi:10.1038/ismej.2014.114.

433. Carlsson J, Grahnén H et al. Early establishment of Streptococcus salivarius in the mouths of infants. Journal of Dental Research. 1970;49(2):415-418. doi:10.1177/00220345700490023601.

434. Horz H-P, Meinelt A et al. Distribution and persistence of probiotic Streptococcus salivarius K12 in the human oral cavity as determined by real-time quantitative polymerase chain reaction. Oral Microbiology and Immunology. 2007;22(2):126-130. doi:10.1111/j.1399-302X.2007.00334.x.

435. Cosseau C, Devine DA et al. The commensal Streptococcus salivarius K12 downregulates the innate immune responses of human epithelial cells and promotes host-microbe homeostasis. Infection and Immunity. 2008;76(9):4163-4175. doi:10.1128/IAI.00188- 08.

436. Burton JP, Wescombe PA et al. Safety assessment of the oral cavity probiotic Streptococcus salivarius K12. Applied Environmental Microbiology. 2006;72(4):3050- 3053. doi:10.1128/AEM.72.4.3050-3053.2006.

437. Burton JP, Chilcott CN et al. Extended safety data for the oral cavity probiotic Streptococcus salivarius K12. Probiotics Antimicrobial Proteins. 2010;2(3):135-144. doi:10.1007/s12602-010-9045-4.

247 BIBLIOGRAPHY

438. Pearce C, Bowden GH et al. Identification of pioneer Viridans Streptococci in the oral cavity of human neonates. Journal of Medical Microbiology. 1995;42(1):67-72. doi:10.1099/00222615-42-1-67.

439. Fitzsimmons S, Evans M et al. Clonal diversity of Streptococcus mitis biovar 1 isolates from the oral cavity of human neonates. Clinical and Diagnostic Laboratory Immunology. 1996;3(5):517-522.

440. Hohwy J, Reinholdt J et al. Population dynamics of Streptococcus mitis in its natural habitat. Infection and Immunity. 2001;69(10):6055-6063. doi:10.1128/IAI.69.10.6055- 6063.2001.

441. Cole MF, Evans M et al. Pioneer oral streptococci produce immunoglobulin A1 protease. Infection and Immunity. 1994;62(6):2165-2168.

442. P E Kolenbrander JL. Adhere today, here tomorrow: oral bacterial adherence. Journal of Bacteriology. 1993;175(11):3247.

443. Denapaite D, Brückner R et al. The Genome of Streptococcus mitis B6 - What Is a Commensal? PLoS One. 2010;5(2):e9426. doi:10.1371/journal.pone.0009426.

444. Mitchell J. Streptococcus mitis: walking the line between commensalism and pathogenesis. Molecular Oral Microbiology. 2011;26(2):89-98. doi:10.1111/j.2041- 1014.2010.00601.x.

445. Greenberg D, Givon-Lavi N et al. The contribution of smoking and exposure to tobacco smoke to Streptococcus pneumoniae and Haemophilus influenzae carriage in children and their mothers. Clinical Infectious Disease. 2006;42(7):897-903. doi:10.1086/500935.

446. Phipps JC, Aronoff DM et al. Cigarette smoke exposure impairs pulmonary bacterial clearance and alveolar macrophage complement-mediated phagocytosis of Streptococcus pneumoniae. Infect and Immunology. 2010;78(3):1214-1220. doi:10.1128/IAI.00963-09.

447. Corby PM, Lyons-Weiler J et al. Microbial risk indicators of early childhood caries. Journal of Clinical Microbiology. 2005;43(11):5753-5759. doi:10.1128/JCM.43.11.5753- 5759.2005.

248 APPENDICES

Appendices

Chapter 2 Appendix 2 Study Questionnaire

Chapter 3 Appendix 3.1 List of Identified Contaminants Appendix 3.2 Indicator Species Analysis and DESeq2 Results 3.2.1 Asthmatics vs. Healthy Individuals 3.2.2 Diabetics vs. Healthy Individuals 3.2.3 GERD vs. Healthy Individuals 3.2.4 Current-Smokers vs. Never-Smokers 3.2.5 Current-Smokers vs. Ex-Smokers 3.2.6 Ex-Smokers vs. Never-Smokers

Chapter 4 Appendix 4.1 WCNA Module Summary 4.1.1 Silva Database 4.1.1.1 Black Module 4.1.1.8 Light Green Module 4.1.1.2 Blue Module 4.1.1.9 Light Yellow Module 4.1.1.3 Brown Module 4.1.1.10 Magenta Module 4.1.1.4 Cyan Module 4.1.1.11 Purple Module 4.1.1.5 Green Module 4.1.1.12 Salmon Module 4.1.1.6 Green Yellow Module 4.1.1.13 Yellow Module 4.1.1.7 Light Cyan Module

4.1.2 GreenGenes Database 4.1.2.1 Black Module 4.1.2.4 Pink Module 4.1.2.2 Brown Module 4.1.2.5 Red Module 4.1.2.3 Green Module 4.1.2.6 Turquoise Module

Appendix 4.2 Greengenes Cluster Dendrogram Appendix 4.3 KEGG 3 Functions Relationship with Current Smoking

Chapter 5 Appendix 5.1 Mock Community Species Appendix 5.2 List of Identified Streptococci Contaminants Appendix 5.3 Significant Streptococci OTUs Appendix 5.4 Streptococci WCNA Module Summary 5.4.1 Summary of all Modules 5.4.4 Pale Turquoise Module 5.4.2 Blue Module 5.4.5 Tan Module 5.4.3 Light Green Module 5.4.6 Turquoise Module

Appendix 5.5 Complete OTU Analysis Tree

249