University of Groningen

The interplay between genetics, the microbiome, DNA‐methylation & gene‐expression Bonder, Marc Jan

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record

Publication date: 2017

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA): Bonder, M. J. (2017). The interplay between genetics, the microbiome, DNA‐methylation & gene‐ expression. University of Groningen.

Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne- amendment.

Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

Download date: 10-10-2021 The interplay between genetics, the microbiome, DNA-methylation & gene-expression

Marc Jan Bonder

Marc Jan Bonder The interplay between genetics, the microbiome, DNA-methylation & gene-expression

Second edition.

Cover design by Dennis Woering (Woewal Design). Printed by NetzoDruk Groningen

©Marc Jan Bonder & Dennis Woering (Woewal Design). All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means without permission of the author.

ISBN: 978-90-367-9601-9 / 978-90-367-9602-6 The interplay between genetics, the microbiome, DNA-methylation & gene-expression

PhD thesis

to obtain the degree of PhD at the University of Groningen on the authority of the Rector Magnificus Prof. E. Sterken and in accordance with the decision by the College of Deans.

This thesis will be defended in public on Wednesday 22 march 2017 at 12:45 hours

By Marc Jan Bonder Born on 23 march 1989 in Tolbert, The Nederland Supervisors Prof. C. Wijmenga Prof. L. Franke

Co-supervisor Dr. A. Zhernakova

Assessment committee Prof. H.M. Boezen Prof. E.P.J.G. Cuppen Prof. P. van der Harst

Paranymphs S.C.R. Bonder P. Deelen

Propositions

1. The integration of multiple biological data layers, for example genetic, transcriptomic, methylome and the microbiome data, leads to a more complete understanding of biological effects. (this thesis)

2. Both genetics and the environment influence the microbiome, DNA-methylation & gene- expression. (this thesis)

3. Commonly used medication can have negative influences on your “health”. (this thesis)

4. Large-scale (perturbation) studies on the gut microbiome are needed to accurately identify relations between the host and the gut microbiome. (this thesis)

5. The gut microbiome is an attractive target for therapies aimed at improving lipid levels in blood. (this thesis)

6. Genetic risk factors affect both gene-expression and DNA-methylation, at a local and a distal level. (this thesis)

7. Genetic effects on expression and DNA-methylation are tissue-specific. (this thesis)

8. DNA-methylation changes reflect the altered abundance of transcription factors. (this thesis)

9. Analysis of individual genome-, methylome-, microbiome- and transcriptome profiles will become something to be performed on a daily basis for health monitoring.

10. In the coming years, integrative analysis of big data on DNA-methylation, the microbiome, gene expression and genetics will substantially change the healthcare system, all the way from your general practitioner to the ICU.

11. Open access science publications and open access to data and methods will greatly speed up the changes in healthcare and yield major benefits to the public and researchers.

12. A scientist works with others to discover the world around them; science is teamwork. Table of contents GENERAL INTRODUCTION Page 10 - 13 THE INFLUENCE OF A SHORT- TERM GLUTEN-FREE DIET ON THE HUMAN GUT MICROBIOME Page 14 - 30 PROTON PUMP INHIBITORS AFFECT THE GUT MICROBIOME Page 31 - 46 THE GUT MICROBIOME CONTRIBUTES TO A SUBSTANTIAL PROPORTION OF THE VARIATION IN BLOOD LIPIDS Page 47 - 61 POPULATION-BASED METAGENOMICS ANALYSIS REVEALS MARKERS FOR GUT MICROBIOME COMPOSITION AND DIVERSITY Page 62 - 78 THE EFFECT OF HOST GENETICS ON THE GUT MICROBIOME Page 79 - 94 GENETIC AND EPIGENETIC REGULATION OF GENE EXPRESSION IN FETAL AND ADULT HUMAN LIVERS Page 95 - 113 IMPROVING PHENOTYPIC PREDICTION BY COMBINING GENETIC AND EPIGENETIC ASSOCIATIONS Page 114 - 132 DISEASE VARIANTS ALTER TRANSCRIPTION FACTOR LEVELS AND METHYLATION OF THEIR BINDING SITES Page 133 - 151 DISCUSSION Page 152 - 161 APPENDICES Page 162 - 172 General introduction 1 In recent years we have been highly successful in identifying the genetic basis of disease1. DNA genotyping and, in particular, sequencing technologies have advanced tremendously and prices have come down substantially. It is now possible to collect genotyping information on 1 large numbers of patients and controls, which enables researchers to identify genetic variants that are associated to (complex) diseases or traits by conducting so-called genome-wide association studies (GWAS). For instance, for inflammatory bowel disease over a hundred genetic variants have been identified by systematically comparing thousands of patients with 2 thousands of controls2,3. However, although GWAS have identified hundreds of associated loci, they do not provide mechanistic information on how these variants ultimately cause disease. This is particularly challenging, since the majority of the identified GWAS variants are not changing protein structure, but are non-coding and must thus have regulatory effects. 3 To gain a better functional understanding of these variants, quantitative trait loci (QTL) mapping studies are now being conducted4. The most common form of QTL mapping is expression QTL (eQTL) mapping, which allows us to link a genetic variant to its effects on gene expression. Two different types of eQTLs have been defined: cis local QTL effects and trans distal QTL effects. To date, most large-scale studies on eQTLs have been performed in 4 blood, since it is easy to collect from patients and controls. However, the cis and trans eQTLs identified can be very tissue- and context-specific, so the effects observed in blood might not be representative for expression in other tissues5. In the largest trans-eQTL study to date6, 233 GWAS associated variants have been linked to expression variation, giving insights into the mechanism of action of these variants. Besides expression, much effort has been put into the mapping of genetic influence on DNA-methylation. DNA-methylation is a key component of the epigenome. By studying DNA- methylation levels in a genomic region, one can gain insight into the regulatory potential of the genomic region. Using DNA-methylation QTL (meQTL) mapping, we acquire complementary data to eQTLs7, which also helps to provide more insight into the downstream effects of genetic variation in health and disease. However, the great majority of complex diseases are not solely caused by genetic factors, but also by environmental factors. Unfortunately, for many diseases, these environmental risk factors still need to be identified. A major challenge is that in many diseases it is not yet clear what these environmental risk factors might be or even how they can be identified. Paradoxically, a promising way to overcome this problem is to take advantage of the massive improvement in genotyping technologies. For instance, DNA methylation chips provide information on over 485,000 different CpG sites at once, and the variation in measured DNA methylation can be a strong proxy for phenotypic status or environmental exposures: CpG sites have now been found to be near-perfect proxies for age and many other associations are being determined through epigenome-wide association studies (EWAS)8. This suggests that other CpG sites might be representative proxies for more environmental factors, some of which might represent risk factors for certain diseases. DNA sequencing improvements now also make it possible to identify and quantify micro- organisms. There are a comparable number of microbial and human cells within the human body and on its surface9, but there are roughly 150 times more microbial genes than human genes10. The microbiome is a collection of , archaea and viruses living together in a community, and they collectively perform important functions for the host. The largest fraction of the human microbiome is located in the digestive tract, where it has important functions in the metabolism but has also been shown to interact with the immune system. The gut microbiome is linked to multiple environmental and intrinsic factors, for instance age11, gender12 and diet13. In inflammatory bowel disease, differences in the composition of the microbiome have been linked to the disease. This suggests that the onset and progression of disease could be altered by changing the microbiome.

11 The aim of this thesis was to use the new biological data based on sequencing technology, to study the role of genetic and environmental factors in disease, and to ascertain how genetic 1 variation and environmental factors are related to variations in phenotype, gene expression, methylation and microbial composition. In the first part of this thesis, the relationship between several phenotypic factors and the microbiome were studied. Studies on the links between variation in the human microbiome 2 and diet (chapter two), medication use (chapter three) and lipid levels (chapter four) are presented. Chapter five presents an integrative analysis in which 126 exogenous and intrinsic factors influencing the gut microbiome were identified. In chapter six, we report on how the host genome influences the microbiome composition. 3 In the second part of the thesis, the relationship between DNA-methylation, gene expression, phenotypes, and genetic variation were studied. Chapter seven describes the link between genetic variation and DNA-methylation and expression levels in several different tissues. In chapter eight, epi-genomic and genomic risk scores have been used to explain variation in individual traits like height and BMI. In chapter nine we studied the effects of genetic risk factors 4 on downstream molecular data. To investigate the effects of the risk factors we integrated multiple omics layers, including DNA-methylation, RNA-sequencing, gene expression and genetics. In the third part of the thesis, I discuss the results and show the combined interpretation of the results presented in the previous chapters. More specifically I discuss two approaches that can be used for multi-omics integration studies. These results showcase future research possibilities and highlights possibilities to gain more insight into health and disease by integrating multiple biological omics data.

Definitions Epigenome The set of chemical compounds attached to the DNA. Microbiome The set of micro-organisms, as represented by the genetic information, in a particular environment. Locus A site on the human genome. GWAS Genome-Wide Association Study, the study of relating variation in the human genome to a trait or disease. EWAS Epigenome-Wide Association Study, the study of relating variation in the human epigenome to a trait or disease. QTL Quantitative Trait Loci, a locus in the human genome having a relation with a quantitative trait, such as expression (eQTL), DNA-methylation (meQTL), or the microbiome (miQTL). CpG site A site on the human genome where a cytosine is followed by a guanine; this is a site where DNA-methylation can take place.

12 References 1 1. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001-6 (2014). 2. Jostins, L. et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease.Nature 491, 119–124 (2012). 3. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel 2 disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015). 4. Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013). 3 5. Fu, J. et al. Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genet. 8, e1002431 (2012). 6. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013). 7. Banovich, N. E. et al. Methylation QTLs Are Associated with Coordinated Changes in 4 Transcription Factor Binding, Histone Modifications, and Gene Expression Levels. PLoS Genet. 10, 1–12 (2014). 8. Rakyan, V. K., Down, T. a, Balding, D. J. & Beck, S. Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 12, 529–41 (2011). 9. Sender, R. et al. Are We Really Vastly Outnumbered? Revisiting the Ratio of Bacterial to Host Cells in Humans. Cell 164, 337–340 (2016). 10. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). 11. T, Y. et al. Anokhin {AP} et al: {Human} gut microbiome viewed across age and geography. Nat. 2012 486, 222–227 (2012). 12. Yurkovetskiy, L. et al. Gender bias in autoimmunity is influenced by microbiota.Immunity 39, 400–412 (2013). 13. David, L. A. et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–563 (2014).

13 The influence of a short-term gluten-free diet on the human gut microbiome Genome Medicine, DOI: 10.1186/s13073-016-0295-y

Marc Jan Bonder1*, Ettje F. Tigchelaar1,2*, Xianghang Cai3*, Gosia Trynka4, Maria C. Cenit1, Barbara Hrdlickova1, Huanzi Zhong3, Tommi Vatanen5,6, Dirk Gevers5, Cisca Wijmenga1,2, Yang Wang3#, Alexandra Zhernakova1,2# 2 Abstract 1 Background A gluten-free diet (GFD) is the most commonly adopted special diet worldwide. It is an effective treatment for coeliac disease, and is also often followed by individuals to alleviate gastrointestinal complaints. It is known there is an important link between diet and the gut microbiome, but it is largely unknown how a switch to a GFD affects the human gut microbiome. 2 Methods We studied changes in the gut microbiomes of 21 healthy volunteers who followed a GFD for four weeks. We collected nine stool samples from each participant: one at baseline, four during the GFD period, and four when they returned to their habitual diet (HD), making a total of 189 samples. We determined microbiome profiles using 16S rRNA sequencing and 3 then processed the samples for taxonomic and imputed functional composition. Additionally, in all 189 samples, six gut health-related biomarkers were measured. Results Inter-individual variation in the gut microbiota remained stable during this short-term GFD intervention. A number of taxon-specific differences were seen during the GFD: the most 4 striking shift was seen for the family Veillonellaceae (class Clostridia), which was significantly reduced during the intervention (p=2.81x10−05). Seven other taxa also showed significant changes; the majority of them are known to play a role in starch metabolism. We saw stronger differences in pathway activities: 21 predicted pathway activity scores showed significant association to the change in diet. We observed strong relations between the predicted activity of pathways and biomarker measurements. Conclusions A gluten-free diet changes the gut microbiome composition and alters the activity of microbial pathways.

1. University of Groningen, University Medical Centre Groningen, Department of Genetics, Groningen, the Netherlands; 2.Top Institute Food and Nutrition, Wageningen, the Netherlands; 3. BGI-Shenzhen, 518083, China; 4. Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. 5. Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; 6. Department of Computer Science, Aalto University School of Science, 02150 Espoo, Finland; *Authors contributed equally; #Authors contributed equally; Correspondance to: Alexandra Zhernakova, E-mail: [email protected] 2 15 Background 1 Gluten is a major dietary component of wheat, barley and rye. In genetically susceptible individuals, the consumption of gluten triggers the development of coeliac disease – an autoimmune disorder commonly seen in populations of European ancestry (with a frequency of approximately 1%)1. In the absence of any medication, the only treatment is a life-long gluten- 2 free diet (GFD), which is effective and well tolerated by the majority of patients. Non-coeliac gluten sensitivity, another common disorder linked to the consumption of gluten-containing food and resulting in a range of symptoms of intestinal discomfort (such as diarrhea and abdominal pain), has also been shown to improve on a GFD2,3. More recently, a GFD is being considered as a way to ameliorate symptoms in patients with irritable bowel syndrome (IBS)4. 3 However, beyond these medical indications, more and more individuals are starting on a GFD to improve their health and/or to control weight. The diet’s popularity has risen rapidly in the last few years, making it one of the most popular diets worldwide, along with a low-carbohydrate diet and a fat-free diet. The numbers of those adopting the diet for non-medical reasons now 4 surpass the numbers of those who are addressing a permanent gluten-related disorder3. Several studies have reported the effect of a GFD on the composition of the gut microbiome in coeliac disease patients5–7. In these studies, the microbiome composition in coeliac patients on a GFD was compared with untreated patients and healthy individuals. The most consistent observation across these studies is the difference in the abundance and diversity of Lactobacillus and Bifidobacterium in the treated and untreated coeliac disease patients. It should be noted that these studies were relatively small (7-30 subjects in each group). Specifically, De Palma et al8 assessed the effect of a 1-month GFD on ten healthy individuals, but the study was limited to the use of non-sequence based methods, including FISH and qPCR. Their study described how Bifidobacterium, Clostridium lituseburense, Faecalibacterium prausnitzii, Lactobacillus and Bifidobacterium longum were decreased during GFD, whereas Escherichia coli, Enterobacteriaceae and Bifidobacterium angulatum were increased. To the best of our knowledge, there has been no comprehensive analysis of the effect of a GFD on the entire gut microbiome composition using a next-generation sequencing approach. The effect of other diet interventions on the microbiome composition was recently studied using the 16S rRNA sequencing method9. In particular, it was shown that a short-term animal- based diet led to an increased abundance of bile-tolerant microorganisms (, Bilophila and Bacteroides) and a decreased abundance of Firmicutes, which metabolize dietary plant polysaccharides (Roseburia, Eubacterium rectale and Ruminococcus bromii)9. In this work we assessed the effect of a GFD on gut microbiota using the next-generation 16S rRNA sequencing method. The analysis was performed in 189 samples, representing up to 9 time points for 21 individuals. We investigated the diet-related changes both on the level of taxonomic units as well as on the predicted bacterial pathways. Next to this we assessed a set of selected biomarkers to assess the gut health in relation to changes in bacterial composition and their association to a GFD. Our study offers insights into the interaction between the gut microbiota and a GFD.

16 Results Food Intake 1 We first investigated if a GFD had a significant effect on the daily intake of macronutrients by analyzing the GFD and HD food records from participants (Additional file 2: Table S1. Mean (SD) daily intakes of energy, protein, fat, and carbohydrate during GFD and HD are shown in 2 table 1. We observed slightly higher carbohydrate intake and a slightly lower fat intake on GFD, however, none of the differences in energy or macronutrient intake were significantly different. We therefore concluded that dietary macronutrient composition was not significantly changed by following a GFD. 3 Table 1. Mean and standard deviation (SD) of energy, protein, carbohydrates, and fat intake during the gluten-free diet (GFD) and habitual diet (HD). g=grams, en%=energy % GFD (n=12) HD (n=12) Nutrient Mean SD Mean SD P-value 4 Energy (kcal) 1709.5 344.0 1811.5 433.9 0.243 Protein (g) 73.1 18.4 78.1 18.2 0.401 Protein (en%) 17.1 17.2 Carbohydrates (g) 211.1 50.3 199.9 63.2 0.275 Carbohydrates (en%) 49.4 44.1 Fat (g) 63.7 18.1 72.5 24.3 0.109 Fat (en%) 33.6 36.0 Microbial Differences Due To Diet In total we used 155 fecal samples, originating from 21 individuals, for the microbiota analysis, and we observed 114 different taxonomic units. We first checked if GFD influenced the number and proportion of bacteria in individual subjects, for which we investigated differences in alpha diversity between the GFD and HD time-points using several alpha diversity measures (Observed species, Shannon, Chao1 and Simpson indexes). We found no differences in the alpha diversity in any of these tests. Therefore, we concluded that a change in diet did not influence the bacterial diversity within a sample. Next we tested if there was any difference in the bacterial diversity related to variation in diet between participants (beta-diversity) by comparing the unweighted unifrac distance in sample groups. We observed a strong difference when comparing different time points from a single individual to all other individuals, regardless of diet type, Wilcoxon P-value <2.2x10-16. When we compared the diet-induced differences within the same individual, we saw a small but significant change, Wilcoxon P-value=0.024, although the same diet time points were slightly more alike (Additional file 3: Figure S2). In the PCoA analysis over the unweighted unifrac distance (figure 1A), we also saw that the main driver of the diversity is the inter-individual difference, with subjects clustering together, both during and after the dietary intervention. In the first ten principal coordinates, which explain more than half of the total variation, we observed changes between the time points for individual participants, although there was no single component, or combination of components, capturing the difference between the GFD versus HD time points in the first ten components. We therefore concluded that a GFD has a significant effect on the diversity between the groups, but that the inter-individual effect on the variation of the microbiome is stronger than the effect of diet.

17 A B Gluten free diet timepoints 1 High richness

2 Principal coordinate 2 (9%) -> coordinate Principal Principal coordinate 1 (12%) -> coordinate Principal Low richness Low 3 Principal coordinate 1 (12%) -> Timepoints-> Figure 1. PCoA plot showing the differences in the samples. a) Samples plotted on PCoA 1 and 2, per- centage of explained variation is given in the legends. Each color represents an individual, the larger and less opaque spheres are gluten-free diet time points, and the smaller spheres in the same color are habit- ual diet time points. b) The differences in the first component over the time points. There are two groups 4 based on richness, i.e. high versus low, one individual had samples in both groups. The sample belonging to both richness groups has a bolder color.

We further investigated changes in beta-diversity in relation to the time points, figure 1B. When we plotted PCo1 versus the time points, we observed a separation into two groups. Since PCo1 describes the difference in alpha-diversity between samples, we concluded that this separation is based on richness. The richness separates all but one participant into either a clear high-richness or low-richness group (figure 1B). There is a significant difference in richness between the two groups, Wilcoxon P-value=0.0016, excluding the one participant who seems to be an intermediate. However, unlike the study by Le Chatelier et al10, we did not see any significant difference in stability, i.e. in variation in richness, between the low- and high-richness groups. Differentially abundant taxa When comparing the HD and GFD time points, corrected for age and ethnicity in MaAsLin, we observed eight significant microbial changes (figure 2 and table 2). The strongest association was found to the family Veillonellaceae, of which the abundance in the gut dropped significantly on a GFD (p=2.81x10-05, q=0.003) (figure 2B and Additional file 4: Figure S3). Other species that decreased on a GFD included Ruminococcus bromii (p=0.0003, q=0.01) and Roseburia faecis (p=0.002, q=0.03). While families Victivallaceae (p=0.0002, q=0.01), Clostridiaceae (p=0.0006, q=0.015), and Coriobacteriaceae (p=0.003, q=0.035), order ML615J-28 (p=0.001, q=0.027), and genus Slackia (p=0.002, q=0.01) increased in abundance on a GFD. Next we tested for trends during the diet change, however we did not observe a time-dependent change in the microbiome composition. Since we observed two different groups based on richness in the PCoA analysis, we tested for different reactions to the change in diet in the high-richness- and low-richness groups. However, no significant associations were found in this analysis. Since six out of the 28 participants smoked, we tested for overlap between smoke-associated bacteria and diet-related bacteria. We did not find any overlap, Additional file 5: table S2 shows the bacteria associated with smoking.

18 A) B) k_Bacteria|p_Firmicutes|c_Clostridia| o_Clostridiales|f_Veillonellaceae 1 0.5

0.4 2 0.3 3 0.2

Arcsin transformed abudance, corrected for gender and etnicity 4 0.1

Gluten free diet (n=74) Habitual diet (n=81) Higher during gluten free diet P value: 2.81e−05, Q value: 0.003 Higher during habitual diet

Figure 2. A) Cladogram showing the differentially abundant taxa. This plot shows the different levels of . Blue indicates bacteria higher in the habitual diet and red indicates those higher in the gluten-free diet. The different circles represent the different taxonomic levels. (From inside to outside: Kingdom, Phylum, Class, Order, Family, Genus, and Species). B) Comparison of the abundance of Veil- lonellaceae* in the gluten-free diet vs. habitual diet. In the plot, the aggregate ‘overall weeks’ including correction is shown. * Veillonellaceae is placed in the order Clostridiales in GreenGenes 13.5. However, according to the NCBI classification, it belongs to order Negativicutes.

Table 2. GFD-induced changes in taxonomic composition. A positive coefficient means more of the microbe was present during the habitual diet, while a negative coefficient means less of the microbe was present during the habitual diet. All associations were to the kingdom bacteria, for readability the kingdom label is not presented. * Veillonellaceae is placed in the order Clostridiales in GreenGenes 13.5. However, according to the NCBI classification, it belongs to order Negativicutes.

Taxonomic unit Coeff. N.not.0/N P-value Q-value

p_Firmicutes|c_Clostridia|o_Clostridiales| -5 f_Veillonellaceae* 0.0424 155/155 2.81 x10 0.0030 p_Lentisphaerae|c_Lentisphaeria| -4 o_Victivallales|f_Victivallaceae -0.0093 89/155 2.30 x10 0.0105 p_Firmicutes|c_Clostridia|o_Clostridiales| f_Ruminococcaceae|g_Ruminococcus| 0.0151 99/155 2.94 x10-4 0.0105 s_bromii p_Firmicutes|c_Clostridia|o_Clostridiales| -4 f_Clostridiaceae -0.0121 150/155 5.69 x10 0.0152 p_Tenericutes|c_RF3|o_ML615J-28 -0.0095 82/155 1.30 x10-3 0.0277

p_Firmicutes|c_Clostridia|o_Clostridiales| -3 f_Lachnospiraceae|g_Roseburia|s_faecis 0.0065 100/155 1.88 x10 0.0326 p_Actinobacteria|c_Coriobacteriia| o_Coriobacteriales|f_Coriobacteriaceae| -0.0044 43/155 2.14 x10-3 0.0326 g_Slackia p_Actinobacteria|c_Coriobacteriia| -0.0137 155/155 2.67 x10-3 0.0357 o_Coriobacteriales|f_Coriobacteriaceae

19 Imputation of Bacterial Function 1 Next to the taxonomic associations we also aimed to study differences in pathway composition in relation to GFD. We applied PICRUSt and HUMAnN for pathway annotation, as described in Methods. In total, 161 pathways and 100 modules were predicted, all of the pathways and modules were found in at least 1% of the samples. 2 We used MaAsLin to identify differences in the pathway composition and conducted the same tests – GFD versus HD and the time-series test – as for the microbial composition. The data was again corrected for age and ethnicity. We observed that 19 KEGG pathways and two KEGG modules (table 3) were different in abundance between GFD and HD. We did 3 not observe associations related to the transition from GFD to HD (T0 – T4). Four out of five top associations, all with a Q-value <0.0003, are related to metabolism changes: tryptophan metabolism, butyrate metabolism (figure 4A), fatty acid metabolism, and seleno-compound metabolism. Table 3. GFD-induced changes in pathway and module activity. A positive coefficient means more activity 4 of the pathway/module during the habitual diet, while a negative coefficient means less activity of the pathway/module during the habitual diet.

Feature Coeff. N.not.0/N P-value Q-value KO00380: Tryptophan metabolism -0.0011 155/155 2.45x10-5 0.002 KO00650: Butyrate metabolism -0.0014 155/155 2.72x10-5 0.002 KO00071: Fatty acid metabolism -0.0011 155/155 4.74x10-5 0.002 KO00450: Selenocompound metabolism 0.0009 155/155 9.23x10-5 0.003 KO00630: Glyoxylate and dicarboxylate -0.0010 155/155 2.53x10-4 0.007 metabolism KO00520 Amino sugar and nucleotide sugar -4 metabolism 0.0009 155/155 2.83x10 0.007 M00064: ADP-L-glycero-D-manno-heptose 0.0066 155/155 4.12x10-4 0.023 biosynthesis KO00643: Styrene degradation -0.0013 155/155 4.29x10-4 0.008 M00077: Chondroitin sulphate degradation -0.0037 76/155 5.81x10-4 0.023 Chondroitin sulphate degradation KO00760: Nicotinate and nicotinamide 0.0008 155/155 6.79x10-4 0.012 metabolism KO00620: Pyruvate metabolism -0.0012 155/155 0.002 0.023 KO00253: Tetracycline biosynthesis -0.0027 155/155 0.002 0.024 KO00471: D-Glutamine and D-glutamate 0.0012 155/155 0.002 0.024 metabolism KO04122: Sulphur relay system -0.0020 155/155 0.002 0.024 KO00633: Nitrotoluene degradation -0.0022 155/155 0.002 0.024 KO00072: Synthesis and degradation of -0.0020 155/155 0.003 0.028 ketone bodies KO00310: Lysine degradation -0.0007 155/155 0.003 0.031 KO00624: Polycyclic aromatic hydrocarbon 0.0006 155/155 0.005 0.043 degradation KO00561: Glycerolipid metabolism -0.0012 155/155 0.005 0.043 KO00680: Methane metabolism -0.0006 155/155 0.006 0.047 KO00550: Peptidoglycan biosynthesis 0.0011 155/155 0.007 0.047

20 Butyrate metabolism and Butyrate levels in GFD vs HD 1 A) B) 0.095 20

15 2 0.090 10

0.085 3 Butyrate levels ( mol/g) levels Butyrate 5 butyrate metabolism KO- 00650: butyrate 0.080 4 Gluten free diet (n=74) Habitual diet (n=81) Gluten free diet (n=74) Habitual diet (n=81)

P value: 2.72e−05, Q value: 0.00188 P value: 0.888, Q value: 1.0

Figure 3. Box plot of predicted activity of butyrate metabolism per diet period (a) and the butyrate levels (mol/g) per diet period (b). There was a significant increase in activity in butyrate metabolism (q=0.001877), but no change in butyrate level was observed.

Biomarkers In Relation To Diet Changes Biomarkers related to GFD versus HD We measured four biomarkers in feces: calprotectin, human-β-defensin-2, chromogranin A, and a set of five short-chain fatty acids (acetate, propionate, butyrate, valerate and caproate). In addition, we measured citrulline levels and a panel of cytokines (IL-1β, IL-6, IL-8, IL-10, IL-12, and TNFα) in blood. The Wilcoxon test was used to test biomarker level differences between the average values and the GFD and HD period values. We saw no significant change in biomarker levels in relation to GFD (tables 4A and 4B).

Correlations between fecal biomarkers and microbiome We correlated the fecal biomarker levels to the microbiome composition as well as to the microbiome predicted pathways and modules. After multiple testing correction, we observed many statistically significant correlations between the levels of biomarkers and microbiome/ pathway abundances; the absolute correlation, Spearman Rho, was between 0.6 and 0.14. An expected observation was the correlation of the butyrate pathway activity to the butyrate biomarker, as we had previously observed a significant correlation between the predicted butyrate pathway activity and diet change (table 4). When correlating the actual butyrate measurements with the predicted activity of the butyrate metabolism, we observed a low but significant correlation of -0.269 (p=0.0009, q=0.0012, Additional file 6: figure S4). However, there was no significant difference in butyrate levels in the two diet periods (figure 3B and table 4). Another interesting correlation was found between the predicted pyruvate metabolism pathway and the levels of propionate (mol/g), since propionate can be oxidized to pyruvate,11 for which we observed a correlation of -0.54 (p=9.44x10-13, q=1.48x10-10, Additional file 7: figure S5). A complete list of the significant correlations between the fecal biomarkers and the microbiome compositions, the predicted KEGG pathway activity scores, and predicted activity of KEGG modules can be found in Additional file 8: Tables S3, Additional file 9: Table S4, and Additional file 10: Table S5.

21 1 Table 4. Median and 25%/75% quantiles of the measured biomarkers. None of the differences were sta- tistically significant. BDL=below detection limit.

A) Plasma Wilcoxon test Habitual diet Gluten-free diet 2 P-value Citrullin (mol/l) 45.60 (38.15-51.50) 48.00 (36.35-56.85) 0.9328 IL 1 Beta (g/l) 1.60 (0.68-2.10) 1.23 (0.79-1.68) 0.8870 IL 6 (g/l) BDL (BDL-1.60) BDL (BDL-0.38) 0.1240 3 IL 8 (g/l) 6.04 (2.89-12.61) 5.41 (3.34-11.19) 0.9030 IL 10 (g/l) 0.83 (0.74-1.01) 0.83 (0.74-0.97) 0.9322 IL 12P70 (g/l) 1.53 (0.95-1.78) 1.53 (0.95-2.11) 0.2131 4 TNF Alpha (g/l) 0.56 (BDL-4.33) BDL (BDL-5.13) 0.9761

B) Feces Wilcoxon test Habitual diet Gluten-free diet P-value Chromogranin A (nmol/g) 10.85 (7.69-23.09) 11.44 (7.37-27.18) 0.8128 Beta Defensin 2 (ng/g) 24.90 (18.78-35.03) 26.10 (20.03-46.90) 0.5256 Calprotectin (g/g) 21.55 (BDL -42.88) 13.05 (BDL-31.28) 0.0528 Acetate (mol/g) 24.37 (17.35-34.34) 23.61 (18.58-35.12) 0.8651 Propionate (mol/g) 7.55 (4.24-10.98) 6.84 (4.67-9.07) 0.6986 Butyrate (mol/g) 6.86 (3.53-10.63) 6.48 (4.27-10.40) 0.8882 Valerat (mol/g) 1.09 (0.74-1.76) 1.24 (0.79-1.70) 0.6824 Caproat (mol/g) 0.28 (0.05-0.85) 0.21 (0.04-0.66) 0.2488

Discussion We investigated the role of a four-week GFD on microbiome composition in healthy individuals and identified moderate but significant changes in their microbiome compositions and even stronger effects on the imputed activity levels of bacterial pathways. On a taxonomic level we identified eight bacteria that change significantly in abundance on GFD: Veillonellaceae, Ruminococcus bromii and Roseburia faecis decreased on GFD, and Victivallaceae, Clostridiaceae, ML615J-28, Slackia and Coriobacteriaceae increased on GFD. The strongest effect was seen in the decrease of Veillonellaceae during GFD, Gram-negative bacteria known for lactate fermentation. This is the first time that the Veillonellaceae family has been associated to a dietary intervention, but it was recently shown to be decreased in autistic patients12. Remarkably, the patients in that study were more often on a GFD (9/10) than the control group (5/10). Our findings suggest that GFD, rather than autism, can be the cause of a lower abundance of Veillonellaceae in these patients, thus highlighting the importance of including dietary information in analyses of microbiota in relation to diseases. Veillonellaceae is considered to be a pro-inflammatory family of bacteria; an increase in Veillonellaceae abundance was consistently reported in IBD, IBS and cirrhosis patients13–15. It is conceivable that a decrease in Veillonellaceae abundance might be one of the mediators of the GFD’s beneficial effect observed in patients with IBS and gluten-related disorders.

22 Several of the associated bacteria have been previously linked to diet changes and starch metabolism. In particular, Ruminococcus bromii is important for the degradation of resistant starch in the human colon16, and is increased when on a resistant starch diet17. It is also known 1 that degradation of cellulose by Ruminococcus results in the production of SCFA and hydrogen gas18; a decrease in abundance of Ruminococcus and its fermentation products might explain the beneficial effect of a GFD that is experienced by some IBS patients as previously reported 19 by Aziz et al . Both Ruminococcus bromii and Roseburia faecis were recently reported to be 2 influenced by switching from a vegetarian to a meat-containing diet9. It is likely that changes in these bacteria observed in relation to GFD are the consequences of the different starch composition of a GFD versus HD. Moreover, stool consistency could influence the results of microbiome composition20; unfortunately data on stool composition was not collected in our study. 3 The five bacteria for which we found an increased abundance on GFD are less well characterized although the Slackia genus, its family Coriobacteriaceae, and the family Clostridiaceae have been previously linked to gastrointestinal diseases in humans – inflammatory bowel disease, 21–23 celiac disease and colorectal cancer . The Victivallaceae family and ML615J-28 order have 4 not been previously associated to diet change or phenotypic change in human. However, in general, it could be hypothesized that these bacteria benefit from a change in available substrates as a result from the change in diet, which could in turn result in altered metabolite production and related gastrointestinal complaints. In this study we found a stronger effect of diet on the imputed KEGG pathways than on the taxonomic level. So, although the changes in the overall microbiome were moderate, there were more profound effects on the pathway activities of the microbiome. The strength of our study lies in our analysis of the microbiome at multiple time points for the same individuals. We identified that the inter-individual variability is the strongest determinant of sample variability, suggesting that in healthy individuals the gut microbiome is stable, even with short-term changes in the habitual diet. We did not observe differences in the downstream effect of GFD in relation to high or low richness, which contradicts previous observations24. The study by David et al9 identified a profound effect of short-term diet change from a vegetarian to an animal-based diet and vice versa. This profound short-term diet effect was not observed in our study when changing from a gluten-containing to a gluten-free diet. Induced by the diet change, David et al9 found significant differences in macronutrient intake between meat-based and plant-based diet, whereas macronutrient intake in this study was not changed during the diets. These results suggest that changing the main energy source (meat vs. plant) has a more profound effect on the microbiome than changing the carbohydrate source (gluten). Although De Palma et al8 did observe a reduction in polysaccharide intake for GFD in healthy individuals, we were unable to reproduce their finding because we could not distinguish between different classes of carbohydrates in our dataset as the food composition data on GFD foods lacked this information. Further, it is possible that changes in nutritional intake other than those driven by gluten exclusion might influence microbiome changes. For our selection of blood and stool biomarkers we observed no significant associations with the diet change. All the selected biomarkers are markers of inflammation or metabolic changes, and remained in the normal range in all our participants, with a high proportion of the values of blood inflammatory markers being below the detection limit. Overall, we conclude that a GFD and its downstream effects on the microbiome do not cause major inflammatory or metabolic changes in gut function in healthy subjects. However, the lower abundance of Veillonellaceae, the pro-inflammatory bacterium linked to Crohn’s disease and other gut disease phenotypes, suggests a reduction in gut inflammatory state. This change in bacterial composition might be linked with a beneficial effect of GFD for patients with gut disorders such as gluten-related disorders and/or IBS.

23 Conclusions 1 We have identified eight taxa and 21 bacterial pathways associated with a change from a habitual diet to a GFD in healthy individuals. We conclude that the effect of gluten intake on the microbiota is less pronounced than that seen for a shift from a meat-based diet to a vegetarian diet (or vice versa). However, a GFD diet clearly influences the abundance of several 2 species, in particular those involved specifically in carbohydrate and starch metabolism. Our study illustrates that variations in diet could confound the results of microbiome analysis in relation to disease phenotypes, so dietary variations should be carefully considered and reported in such studies. The short-term GFD did not influence the levels of inflammatory gut biomarkers in healthy individuals. Further research is needed to assess the impact of a GFD 3 on inflammatory and metabolic changes in gut function in individuals with gastrointestinal conditions such as IBS and gluten-related disorders. Methods 4 Study Design We enrolled 21 participants (9 males and 12 females), without any known food intolerance and without known gastrointestinal disorders, in our GFD study for 13 weeks (figure 4). After baseline measurements (T=0), all the participants started a GFD for four weeks (T=1 - 4), followed by a “wash-out” period of five weeks. Subsequently, data was collected when they returned to their habitual diets (HD, gluten-containing) for a period of four weeks (T=5 - 8) (figure 4). Fecal samples were collected at all time points. Blood was collected at baseline, at T=2 and T=4 on GFD, and at T=6 and T=8 on HD.

Figure 4. Timeline of GFD study, including number of participants and collected samples

The participants were between 16 and 61 years old (mean 36.3 years). Mean BMI was 24.0 and 28.6% (n=6) of participants were smokers. The majority of participants were European (n=19), two participants were South American and one was Asian. Except for one, none of the participants had taken an antibiotic treatment for the year prior to the study start. In both diet periods (GFD, HD), participants kept a detailed three-day food record. All 21 participants completed the GFD period; for 17 participants all data points were available. An overview of the participants’ characteristics can be found in Additional file 1: figure S1. Written consent was obtained from all participants and the study followed the sampling protocol of the LifeLines-DEEP study25, which was approved by the ethics committee of the University Medical Centre Groningen, document no. METC UMCG LLDEEP: M12.113965.

24 Gluten-Free Diet and Dietary Intake Assessment 1 Methods to assess GFD adherence and dietary intake have been described previously by Baranska et al.26 In short, before the start of the study, the participants were given information on gluten-containing food products by a dietician and they were instructed how to keep a three- day food record. The food records were checked for completeness and the macronutrient intake was calculated. Days on which a participant had a daily energy intake below 500 kcal 2 or above 5000 kcal were excluded from our analysis (n=2). Of 21 participants, 15 (71%) completed the dietary assessments; three were excluded from food intake analysis because of incomplete food records. We used the paired t-test to compare group means between GFD and HD. 3 Blood Sample Collection Participants’ blood samples were collected after an overnight fast by a trained physician assistant. We collected two EDTA tubes of whole blood at baseline (T0) and during the GFD period at time points T2 and T4, during the HD period one EDTA tube was collected at time 4 points T6 and T8. Plasma was extracted from the whole blood within 8 hours of collection and stored at -80°C for later analysis. Microbiome Analysis Fecal sample collection Fecal samples were collected at home and immediately stored at -20°C. At the end of the 13-week study period, all samples were stored at -80°C. Aliquots were made and DNA was isolated with the QIAamp DNA Stool Mini Kit. Isolated DNA was sequenced at the Beijing Genomics Institute (BGI). Sequencing We used 454 pyrosequencing to determine the bacterial composition of the fecal samples. Hyper- variable region V3 to V4 was selected using forward primer F515 (GTGCCAGCMGCCGCGG) and reverse primer: “E. coli 907-924” (CCGTCAATTCMTTTRAGT) to examine the bacterial composition. We used QIIME27, v1.7.0, to process the raw data files from the sequencer. The raw data files, sff files, were processed with the defaults of QIIME v1.7.0, however we did not trim the primers. Six out of 161 samples had fewer than 3,000 reads and were excluded from the analysis. The average number of reads was 5,862, with a maximum of 12,000 reads. OTU picking The operational taxonomic unit (OTU) formation was performed using the QIIME reference optimal picking, which uses UCLUST28, version 1.2.22q, to perform the clustering. As a reference database we used a primer-specific version of the full GreenGenes 13.5 database29. Using TaxMan30 we created the primer-specific reference database, containing only reference entries that matched our selected primers. During this process we restricted the mismatches of the probes to the references to a maximum of 25%. The 16S regions that were captured by our primers, including the primer sequences, were extracted from the full 16S sequences. For each of the reference clusters, we determined the overlapping part of the taxonomy of each of the reference reads in the clusters and used this overlapping part as the taxonomic label for the cluster. This is similar to the processes described in other studies9,30–33. OTUs had to be supported by at least 100 reads and had to be identified in two samples; less abundant OTUs were excluded from the analysis.

25 Estimation of gene abundance and pathway activity 1 After filtering the OTUs, we used PICRUSt34 to estimate the gene abundance, and the PICRUSt output was then used in HUMAnN35 to calculate the bacterial pathway activity. First, the reference database was clustered based on 97% similarity to the reference sequence to better reflect the normal GreenGenes 97% database required for PICRUSt. Three out of 1,166 OTUs 2 did not contain a representative sequence in the GreenGenes 97% set and were therefore excluded from the analysis. Since merging the reference database at 97% similarity level led to merging of previously different clusters, for the pathway analysis we chose to permute the cluster representative names in the OTU-table 25 times; this was to be sure that our OTU picking strategy would not cause any problems in estimating the genes present in each micro- 3 organism. Next, we ran PICRUSt on the 25 permuted tables and calculated the average gene abundance per sample. The average correlations between the permutations within a sample was higher than 0.97 (Pearson r). Hence, we averaged the PICRUSt output, which was then used to calculate the pathway activity in HUMAnN.

4 Changes in the gut microbiome or in gene abundance due to diet To identify differentially abundant taxa, microbial biomarkers, and differences in pathway activity between the GFD and HD periods, we used QIIME and MaAsLin36. QIIME was used for the alpha-diversity analysis, principal coordinate analysis (PCoA) over unifrac distances and visualization. In the MaAsLin analysis we corrected for ethnicity (defined as continent of birth) and gender. MaAsLin was used to search for differentially abundant taxonomic units to discriminate between the GFD and HD time points. Additionally, we tested for during transition from HD to GFD (T0-T4). MaAsLin uses a boosted, additive, general linear model to discriminate between groups of data. In the MaAsLin analysis we did not test individual OTUs, but focused on the most detailed taxonomic label each OTU represented. Using the QIIMETOMAASLIN37 tool, we aggregated the OTUs if the taxonomic label was identical and, if multiple OTUs represented a higher order taxa, we added this higher order taxa to the analysis. In this process, we went from 1,166 OTUs to 114 separate taxonomic units that were included in our analysis. Using the same tool, QIIMETOMAASLIN, we normalized the microbial abundance using acrsin square root transformation. This transformation leads to the percentages being normally distributed. In all our analyses we used the Q-value calculated using the R38 Q-value package39 to correct for multiple testing. The Q-value is the minimal false discovery rate at which a test may be called significant. We used a Q-value of 0.05 as a cut-off in our analyses. Biomarkers Six biomarkers related to gut health were measured in the ‘Dr. Stein & Colleagues’ medical laboratory (Maastricht, the Netherlands). These biomarkers included: fecal calprotectin and a set of plasma cytokines as markers for the immune system activation40–42, fecal human- β-defensin-2 as a marker for defense against invading microbes43,44, fecal chromogranin A as a marker for neuro-endocrine system activation45–47, fecal short-chain fatty acids (SCFA) secretion as a marker for colonic metabolism48, and plasma citrulline as a measure for enterocyte mass49,50. The plasma citrulline level and the panel of cytokines (IL-1β, IL-6, IL-8, IL-10, IL-12, and TNFα) were measured by high-performance liquid chromatography (HPLC) and electro-chemiluminescence immunoassay (ECLIA), respectively. In feces, we measured calprotectin and human-β-defensin-2 levels by enzyme-linked immunosorbent assay (ELISA), chromogranin A level by radioimmunoassay (RIA), and the short-chain fatty acids acetate, propionate, butyrate, valerate and caproate by gas chromatography–mass spectrometry (GC- MS). All biomarker analyses were performed non-parametrically, with tie handling, because of the high number of samples with biomarker levels below the detection limit. We used the Wilcoxon test to compare the average biomarker levels between the diet periods and the

26 Spearman correlation to search for relations between the microbiome or gene activity data and the biomarker levels. 1 Author contributions GT, AZ and CW designed the study. GT, ET, BH and MC were involved in sample collection and DNA isolation. XC, HZ and YW performed the data generation. MJB, TV, DG, SZ, MC and ET were involved in data processing, analysis and interpretation. MJB, SZ and ET drafted the work. 2 All authors have critically revised this article and approved the final version to be published. Funding This study was funded by a European Research Council advanced grant (FP/2007–2013/ERC 3 grant 2012-322698) to CW, a grant from the Top Institute Food and Nutrition Wageningen (GH001) to CW, and a Rosalind Franklin Fellowship from the University of Groningen to AZ. GT is supported by the Wellcome Trust Sanger Institute, Cambridge, UK (WT098051). Ethics approval and consent to participate 4 This GFD study followed the sampling protocol of the LifeLines-DEEP study, which was approved by the ethics committee of the University Medical Centre Groningen and conform the Declaration of Helsinki, document no. METC UMCG LLDEEP: M12.113965. All participants signed their informed consent prior to study enrolment. Availability of data and materials The supporting data is available to researchers in the European Nucleotide Archive, under study accession number PRJEB13219. (http://www.ebi.ac.uk/ena/data/view/PRJEB13219) References 1. Sollid, L. M. Coeliac disease: dissecting a complex inflammatory disorder. Nat. Rev. Immunol. 2, 647–55 (2002). 2. Sapone, A. et al. Spectrum of gluten-related disorders: consensus on new nomenclature and classification. BMC Medicine 10, 13 (2012). 3. Catassi, C. et al. Non-celiac gluten sensitivity: The new frontier of gluten related disorders. Nutrients 5, 3839–3853 (2013). 4. Vazquez-Roque, M. I. et al. A controlled trial of gluten-free diet in patients with irritable bowel syndrome-diarrhea: effects on bowel frequency and intestinal function. Gastroenterology 144, 903–911.e3 (2013). 5. Collado, M. C., Donat, E., Ribes-Koninckx, C., Calabuig, M. & Sanz, Y. Specific duodenal and faecal bacterial groups associated with paediatric coeliac disease. J. Clin. Pathol. 62, 264–269 (2009). 6. Di Cagno, R. et al. Different fecal microbiotas and volatile organic compounds in treated and untreated children with celiac disease. Appl. Environ. Microbiol. 75, 3963–3971 (2009). 7. Nistal, E. et al. Differences in faecal bacteria populations and faecal bacteria metabolism in healthy adults and celiac disease patients. Biochimie 94, 1724–9 (2012). 8. De Palma, G., Nadal, I., Collado, M. C. & Sanz, Y. Effects of a gluten-free diet on gut microbiota and immune function in healthy adult human subjects. Br. J. Nutr. 102, 1154– 1160 (2009). 9. David, L. A. et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–63 (2014). 10. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–6 (2013).

27 11. Brock, M., Maerker, C., Schütz, A., Völker, U. & Buckel, W. Oxidation of propionate to pyruvate in Escherichia coli: Involvement of methylcitrate dehydratase and aconitase. 1 Eur. J. Biochem. 269, 6184–6194 (2002). 12. Kang, D.-W. et al. Reduced incidence of Prevotella and Other Fermenters in Intestinal Microflora of Autistic Children. PLoS One 8, e68322 (2013). 13. Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host 2 Microbe 15, 382–392 (2014). 14. Haberman, Y. et al. Pediatric Crohn disease patients exhibit specific ileal transcriptome and microbiome signature. J. Clin. Invest. 1–17 (2014). doi:10.1172/JCI75436 15. Shukla, R., Ghoshal, U., Dhole, T. N. & Ghoshal, U. C. Fecal Microbiota in Patients with Irritable Bowel Syndrome Compared with Healthy Controls Using Real-Time Polymerase 3 Chain Reaction: An Evidence of Dysbiosis. Dig. Dis. Sci. 60, 2953–62 (2015). 16. Ze, X., Duncan, S. H., Louis, P. & Flint, H. J. Ruminococcus bromii is a keystone species for the degradation of resistant starch in the human colon. ISME J. 6, 1535–1543 (2012). 17. Walker, A. W. et al. Dominant and diet-responsive groups of bacteria within the human colonic microbiota. ISME J. 5, 220–230 (2011). 4 18. Rajilić-Stojanović, M. Function of the microbiota. Best Pract. Res. Clin. Gastroenterol. 27, 5–16 (2013). 19. Aziz, I. et al. Efficacy of a Gluten-free Diet in Subjects With Irritable Bowel Syndrome- Diarrhea Unaware of Their HLA-DQ2/8 Genotype. Clin. Gastroenterol. Hepatol. (2015). 20. Tigchelaar, E. F. et al. Gut microbiota composition associated with stool consistency. Gut gutjnl-2015-310328 (2015). doi:10.1136/gutjnl-2015-310328 21. Maukonen, J. et al. Altered Fecal Microbiota in Paediatric Inflammatory Bowel Disease. J. Crohns. Colitis 9, 1088–95 (2015). 22. Chen, W., Liu, F., Ling, Z., Tong, X. & Xiang, C. Human intestinal lumen and mucosa- associated microbiota in patients with colorectal cancer. PLoS One 7, e39743 (2012). 23. Olivares, M. et al. The HLA-DQ2 genotype selects for early intestinal microbiota composition in infants at high risk of developing coeliac disease. Gut 64, 406–17 (2015). 24. Fang, S. & Evans, R. M. Microbiology: Wealth management in the gut. Nature 500, 538– 539 (2013). 25. Tigchelaar, E. F. et al. An introduction to LifeLines DEEP: study design and baseline characteristics. bioRxiv (Cold Spring Harbor Labs Journals, 2014). doi:10.1101/009217 26. Baranska, A. et al. Profile of volatile organic compounds in exhaled breath changes as a result of gluten-free diet. J. Breath Res. 7, 37104 (2013). 27. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335–336 (2010). 28. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010). 29. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). 30. Brandt, B. W., Bonder, M. J., Huse, S. M. & Zaura, E. TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage. Nucleic Acids Res. 40, W82-7 (2012). 31. Bonder, M. J., Abeln, S., Zaura, E. & Brandt, B. W. Comparing clustering and pre-processing in taxonomy analysis. Bioinformatics 28, 2891–2897 (2012). 32. May, A., Abeln, S., Crielaard, W., Heringa, J. & Brandt, B. W. Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations. Bioinformatics 30, 1530–1538 (2014). 33. Ding, T. & Schloss, P. D. Dynamics and associations of microbial community types across the human body. Nature 509, 357–60 (2014). 34. Langille, M. G. I. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat. Biotechnol. 31, 814–21 (2013). 35. Abubucker, S. et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8, e1002358 (2012).

28 36. Tickle T, Waldron L, Yiren Lu, H. C. Multivariate association of microbial communities with rich metadata in high-dimensional studies. (In progress) 37. Tickle, T. QiimeToMaAsLin. (2013). 1 38. R Development Core Team, R. F. F. S. C. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing 1, 2673 (2008). 39. Dabney, A. et al. Q-value estimation for false discovery rate control. Medicine 344, 539– 548 (2004). 2 40. Tibble, J. A. et al. High prevalence of NSAID enteropathy as shown by a simple faecal test. Gut 45, 362–6 (1999). 41. Joshi, S., Lewis, S. J., Creanor, S. & Ayling, R. M. Age-related faecal calprotectin, lactoferrin and tumour M2-PK concentrations in healthy volunteers. Ann. Clin. Biochem. 47, 259–63 (2010). 3 42. Maheshwari, A. et al. Effects of interleukin-8 on the developing human intestine. Cytokine 20, 256–67 (2002). 43. Harder, J., Bartels, J., Christophers, E. & Schröder, J. M. A peptide antibiotic from human skin. Nature 387, 861 (1997). 44. Langhorst, J. et al. Elevated human beta-defensin-2 levels indicate an activation of the 4 innate immune system in patients with irritable bowel syndrome. Am. J. Gastroenterol. 104, 404–10 (2009). 45. El-Salhy, M., Lomholt-Beck, B. & Hausken, T. Chromogranin A as a possible tool in the diagnosis of irritable bowel syndrome. Scand. J. Gastroenterol. 45, 1435–9 (2010). 46. Sidhu, R., Drew, K., McAlindon, M. E., Lobo, A. J. & Sanders, D. S. Elevated serum chromogranin A in irritable bowel syndrome (IBS) and inflammatory bowel disease (IBD): a shared model for pathogenesis? Inflamm. Bowel Dis. 16, 361 (2010). 47. Ohman, L., Stridsberg, M., Isaksson, S., Jerlstad, P. & Simrén, M. Altered levels of fecal chromogranins and secretogranins in IBS: relevance for pathophysiology and symptoms? Am. J. Gastroenterol. 107, 440–7 (2012). 48. Hamer, H. M. et al. Review article: the role of butyrate on colonic function. Aliment. Pharmacol. Ther. 27, 104–19 (2008). 49. Windmueller, H. G. & Spaeth, A. E. Source and fate of circulating citrulline. Am. J. Physiol. 241, E473-80 (1981). 50. Crenn, P., Messing, B. & Cynober, L. Citrulline as a biomarker of intestinal failure due to enterocyte mass reduction. Clin. Nutr. 27, 328–39 (2008). Acknowledgements We thank all the participants for their collaboration as well as Jackie Senior and Kate Mc Intyre for editing the manuscript. We thank Jackie Dekens, Zlatan Mujagic and Daisy Jonkers for support with the biomarker analyses and Stein’s lab for measuring the biomarkers. We thank Hermie Harmsen for the helpful discussions during the project.

29 Description of additional data files 1 The following additional data are available with the online version of this paper. Additional data: Data file 1 (Figure S1) Baseline characteristics of the GFD study group. 2 Data file 2 (Table S1) Results macronutrient intake per participant. Dat file 3 (Figure S2) Unweighted unifrac distances when comparing inter-individual vs intra individual distances. In group 1 the intra-individual differences are shown regardless of diet. Group 2 shows the intra-sample differences are shown within the same diet. Group 3 3 shows the intra-individual differences are shown between the two diet groups. In group 4 the inter-individual differences are shown regardless of diet. Group 5 shows the inter-sample differences are shown within the same diet. Group 6 shows the inter-individual differences are shown between the two diet groups. The main difference is the intra vs inter individual 4 difference. Also the same diet points in the samples are slightly more close to each other. However we don’t see such a phenomena for group 5 vs group 6. Data file 4 (Figure S3) Abundance of Veillonellaceae family in the GFD participants. In all but four participants we see a clear trend of higher levels of Veillonellaceae on the habitual diet. The right most samples do not show this phenomena. Data file 5 (Table S2) Relation of smoking and microbiome composition. Data file 6 (Figure S4) Measured butyrate levels vs the predicted activity of butyrate metabolism. Data file 7 (Figure S5) Measured propionate levels vs the predicted activity of pyruvate metabolism. Data file 8 (Table S3) Correlation of bacteria and levels of fecal biomarkers. Data file 9 (Table S4) Correlation of predicted HUMAnN pathway activity and levels of fecal biomarkers. Data file 10 (Table S5) Correlation of predicted HUMAnN module activity and levels of fecal biomarkers.

30 Proton Pump Inhibitors Affect the Gut Microbiome

Gut, DOI:10.1136

Floris Imhann1*, Marc Jan Bonder2*, Arnau Vich Vila1*, Jingyuan Fu2, Zlatan Mujagic3, Lisa Vork3, Ettje F. Tigchelaar2, Soesma A. Jankipersadsing2, Maria C. Cenit2, Hermie J.M. Harmsen4, Gerard Dijkstra1, Lude Franke2, Ramnik J. Xavier5, Daisy Jonkers4#, Cisca Wijmenga2#, Rinse K. Weersma1#, Alexandra Zhernakova1# 3 Abstract 1 Background & Aims Proton pump inhibitors (PPI) are among the top ten most widely used drugs in the world. PPI use has been associated with an increased risk of enteric infections, most notably Clostridium difficile. The gut microbiome plays an important role in enteric infections, by resisting or promoting colonization by pathogens. In this study, we investigated 2 the influence of PPI use on the gut microbiome. Methods The gut microbiome composition of 1815 individuals, spanning three cohorts, was assessed by tag-sequencing of the 16S rRNA gene. The difference in microbiota composition in PPI users vs. non-users was analyzed separately in each cohort, followed by a meta-analysis. 3 Results 211 of the participants were using PPI at the moment of stool sampling. PPI use is associated with a significant decrease in Shannon’s diversity and with changes in20% of the bacterial taxa (FDR < 0.05). Multiple oral bacteria were overrepresented in the fecal microbiome of PPI-users, including the genus Rothia (p=9.8x10-38). In PPI users we observed a 4 significant increase in bacteria: genera Enterococcus, Streptococcus, Staphylococcus and the potentially pathogenic species Escherichia coli. Conclusions The differences between PPI users and non-users observed in this study are consistently associated with changes towards a less healthy gut microbiome. These differences are in line with known changes that predispose to C. difficile infections and can potentially explain the increased risk of enteric infections in PPI users. On a population level, the effects of PPI are more prominent than the effects of antibiotics or other commonly used drugs.

Summary box What is already known about this subject: • PPI use is associated with increased risk of enteric infections, in particular with a 65% increase in incidence of Clostridium difficile infection. • PPI is one of the most commonly used drugs. • Changes in the gut microbiome can resist or promote the colonization of enteric infections. What are the new findings: • PPI use is associated with decreased bacterial richness and profound changes in the gut microbiome: 20% of the identified bacteria in this study showed significant deviation. • Oral bacteria and potential pathogenic bacteria are increased in the gut microbiota of PPI users. • On the population level we see more microbial alterations in the gut associated with PPI use than with antibiotics or other drug use. How might it impact on clinical practice in the foreseeable future? • Given the widespread use of PPI, the morbidity and mortality associated with enteric infections, and the increasing number of studies investigating the microbiome, both healthcare practitioners and researchers should take into consideration the influence of PPI on the gut microbiome

1.University of Groningen and University Medical Center Groningen, Groningen, the Netherlands, Department of Gastroenterology and Hepatology; 2. University of Groningen and University Medical Center Groningen, Groningen, the Netherlands, Department of Genetics; 3. M aastricht University Medical Center+, Maastricht, The Netherlands, Division Gastroenterology-Hepatology, NUTRIM School for Nutrition, and Translational Research in Metabolism; 4. University of Groningen and University Medical Center Groningen, Groningen, the Netherlands, Department of Medical Microbiology; 5. Broad Institute of Harvard and MIT, Boston, USA; *Shared first authors; #Shared last authors. Correspondance to: Prof. Dr. Rinse K. Weersma, E-mail: [email protected]

32 Background & Aims 1 Proton pump inhibitors (PPI) are among the top ten most widely used drugs in the world. In 2013, 7% of the population of the Netherlands used omeprazole. In the same year, esomeprazole was the second largest drug in terms of revenue in the United States.1,2 PPI are used to treat gastro-esophageal reflux disorder (GERD) and to prevent gastric and duodenal ulcers.3,4 Of the general population, 25% report having heartburn at least once a month, explaining the large 2 demand for PPI.4 Nevertheless, PPI are frequently prescribed or taken for long periods without evidence-based indication.5,6 PPI use has been associated with increased risk of enteric infections.5,7–9 A meta-analysis of 23 studies, comprising almost 300,000 patients, showed a 65% increase in the incidence 3 of Clostridium difficile-associated diarrhea among patients who used PPI.9 In healthcare- related settings, PPI use also increases the risk of recurrent C. difficile infections.5 Another meta-analysis of 11,280 patients, from six studies evaluating Salmonella, Campylobacter and other enteric infections, also found an increased risk due to acid suppression, with a greater 8 association with PPI than with H2-receptor antagonists. Recently, the Dutch National Institute 4 for Public Health and the Environment (RIVM) noticed a marked increase in the occurrence of campylobacteriosis associated with increased PPI use in the Netherlands.7 The gut microbiome plays an important role in these enteric infections.10–13 Gut microbiota can resist or promote the microbial colonization of the gut by C. difficile and other enteric infections through several mechanisms that either directly inhibit bacterial growth or enhance the immune system.10,11 Moreover, substituting the gut microbiota of diarrhea patients with C. difficile with a healthy microbiome through faecal transplantation has been proven to cure C. difficile infection.14 The increased incidence of enteric infections in PPI users andthe importance of the gut microbiome composition in the development of these infections led us to investigate the influence of PPI use on the gut microbiome. Results PPI use is associated to older age and higher BMI PPI were used by 211 (11.6%) of the 1815 participants: 8.4% of the general population (Cohort 1), 20.0% of the IBD patients (Cohort 2) and 15.2% of the participants of case-control Cohort 3. Women use PPI more often than men: 9.2% versus 7.4%, albeit this was not significant (P = 0.61, Chi-square test). PPI users were generally older: 51.6 (SD 13.4) versus 44.4 (SD 14.7) years of age (P = 2.50 x 10-11, WMW test) and have a higher BMI of 26.9 (SD 5.0) versus 24.9 (SD 4.2) for non-users (P = 1.89 x 10-8, WMW test). Antibiotics were concomitantly used by 2% of the 99 PPI users of Cohort 1 and 33% of the 60 PPI users of Cohort 2. There was no overlap between PPI users and antibiotics users in Cohort 3. Based on our data, we included age, gender, BMI and antibiotics as co-factors in the microbiome analyses. Table 1 provides an overview of the characteristics per cohort and the use of PPI.

33 - - - - -

1 1.73% 0.00% 17.3% 33.56% 49.48% 1.92 (1.11) 24.16 (4.11) (n=289) 44.57 (18.24) Average (SD)* Average Non-PPI users 2 (11,9296) 65,842 - - - - - MUMC

0.00% 0.00% 28.4% 30.77% 90.38% 3 1.60 (0.81) (n=52) 26.24 (4.10) PPI users 51.94 (14.27) Cohort study 3: IBS case-control Average (SD)* Average 43,807 (28,604) - - 4 0.00% 5.42% 39.17% 16.67% 28.75% 39.58% 20.42% 37.08% 100.00% (n=240) 25.58 (4.72) 42.45 (14.57) Average (SD)* Average Non-PPI users 52,970 (37,787) - - UMCG

0.00% 61.67% 31.67% 38.33% 26.67% 16.67% 30.00% 21.67% 100.00% Cohort 2: IBD patients (n=60) 26.14 (5.53) PPI users 50.87 (14.49) Average (SD)* Average 51,081 (43,990) - - - - -

1.02% 0.00% 4.47% 42.05% 25.77% 1.38 (0.61) 25.05 (4.03) (n=1075) 44.79 (13.58) Average (SD)* Average Non-PPI users 55,884 (40,057) 55,884 - - - - -

2.02% 0.00% 7.07% 36.36% 34.34% (general population) (general 1.36 (0.53) (n=99) Cohort 1: LifeLines-DEEP 27.73 (5.10) PPI users 51.94 (13.59) Average (SD)* Average 48,879 (43,001) Age BMI Gender (% Male) Reads per sample (%) Antibiotics IBD (%) IBS (%) Diarrhea (%) (IBS-D and functional diarrhea together) bowel Average per day movements (%) Anti-TNF-α Mesalazine (%) (%) Methotrexate (%) Steroids Thiopurines (%) Table 1. Characteristics of the three independent cohorts of the three in this study . * unless otherwise bowel stated, BMI = body mass index, IBD inflammatory 1. Characteristics Table alpha, UMCG = University factor TNF-α = tumor necrosis deviation, SD = standard Pump Inhibitor, PPI = Proton disease, IBS = irritable bowel syndrome, Medical Center. MUMC = Maastricht University Medical Center Groningen, 34 Composition of the gut microbiota 1 The predominant phylum in each cohort was Firmicutes with abundances of 76.7%, 73.8% and 77.4% in Cohorts 1, 2 and 3, respectively. Information on the composition of the gut microbiome for all three cohorts and on all taxonomic levels is provided in Supplementary Figures S1, S2 and Supplementary Table S1. Independent of PPI use, the overall high-level bacterial composition of the gut was homogeneous in all three cohorts (by phylum, class, and 2 order level, Spearman correlations: rho > 0.94; P < 1.6 x 10-13). Reduced Diversity of the Gut Microbiome Associated with PPI Use In all three cohorts we identified a lower species richness and lower Shannon diversity, although not significant (Cohort 1, p=0.85 ; Cohort 2, p=0.16; Cohort 3, p=0.53), however in 3 combined analysis of all three datasets we identified moderate but significant decrease in gut alpha diversity of PPI users was observed in the meta-analysis of all 1815 gut microbiome samples: Shannon index (P = 0.01) and species richness (P = 0.02)(Supplementary Figures S3 and S4). 4 Meta-analysis: Differences in gut microbiome associated to PPI use The meta-analysis across all three cohorts showed statistically significant alterations in 92 of the 460 bacterial taxa abundance (FDR < 0.05). These changes are depicted in a cladogram in Figure 1 and in a heatmap in Figure 2, and in Supplementary Figure S5. Details of each taxon, including the individual direction, coefficient, P-value and FDR for each cohort, as well as the meta-analysis, are provided in Supplementary Tables S2 and S3. Cochran’s Q test was used to check for heterogeneity. None of the 92 reported associations were significantly heterogeneous at the Bonferroni corrected P-value cut off (P < 5.43 x 10-4) (Supplementary Table S2).

Figure 1. PPI-associated statistically significant differences in the gut microbiome. Meta-analysis of three independent cohorts comprising 1815 fecal samples, showing a cladogram (circular hierarchical tree) of 92 significantly increased or decreased bacterial taxa in the gut microbiome of PPI users compared to non-users (FDR < 0.05). Each dot represents a bacterial taxon. The two most inner dots represent the highest level of taxonomy in our data: the kingdoms Archea and Bacteria (prokaryotes), followed outwards by the lower levels: phylum, class, order, family, genus and species. Red dots represent significantly increased taxa. Blue dots represent significantly decreased taxa.

35 1

2

3

4 Figure 3. Principal Coordinate Analysis of 1815 gut microbiome samples and 116 oral samples and 116 oral Analysis of 1815 gut microbiome 3. Principal Coordinate Figure to non-PPI of PPI users is significantly different The gut microbiome samples. microbiome 1 For Principal Coordinate test). (PCoA1: P = 1.39 x 10-20, Wilcoxon users in the first Coordinate microbiome the oral towards of PPI users is a significant shift of the gut microbiome there Figure 2. Significantly altered families in 2. Significantly altered Figure cohorts.PPI users consistent in three independent cohortsMeta-analysis of three The heatmap samples. comprising 1815 fecal or shows 19 families significantly increased associated with PPI use in the gut decreased for each cohortmicrobiome and for the meta- analysis (meta-analysis FDR < 0.05).

36 The overall difference of the gut microbiome associated to PPI use was also observed in the PCoA of all the datasets together (Figure 3 and Supplementary Figure S6). The same PCoA with separate colors for each cohort has been added in Supplementary Figure S7. Notably, we 1 observed statistically significant differences between PPI users and non-users in two principal coordinates (PCoA1: P = 1.39 x 10-20, PCoA3: P = 0.0004, Wilcoxon test).

Similar changes in three independent cohorts were associated to PPI use 2 The order Actinomycetales, families Streptococcoceae, Micrococcoceae, genus Rothia, and species Lactobacillus salivarius were increased in participants using PPI in each cohort. None of the individual cohorts contained any significantly decreased taxa (FDR < 0.05). In the general population (Cohort 1), 41 of the 829 bacterial taxa were significantly increased, including the class Gammaproteobacteria, the family Enterococcoceae, and the genera 3 Streptococcus, Veillonella and Enterococcus (FDR < 0.05) (Supplementary Table S4). No effects due to PPI dosage were observed in the associated bacteria. In IBD patients (Cohort 2), PPI use was associated with an increase of 12 of the 667 bacterial taxa, including the family Lactobacillaceae as well as the genera Streptococcus and Lactobacillus (FDR < 0.05) 4 (Supplementary Table S5). In IBS case-control Cohort 3, 18 of the 624 taxa were significantly increased, including the order Lactobacillales (FDR < 0.05) (Supplementary Table S6). Oral cavity bacteria are more abundant in the gut microbiome of PPI users We hypothesized that the changes in the gut microbiome associated with PPI use are caused by reduced acidity of the stomach and the subsequent survival of more bacteria that are ingested with food and oral mucus. Indeed, some of the statistically significantly increased bacteria in PPI users (e.g. Rothia dentocariosa, Rothia mucilaginosa, the genera Scardovia and Actinomyces and the family Micrococcaceae) are typically found in the oral microbiome.15 By analyzing 116 oral microbiome samples from participants in Cohort 1, we could compare the overall composition of bacteria in the oral microbiome to the composition of the gut microbiome. We observed a statistically significant shift in Principal Coordinate 1 in the gut microbiome samples of the PPI users towards the oral samples, compared to non-PPI users (P = 1.39 x 10-20, Wilcoxon test) (Figure 3). In Supplementary Figure S8, the overrepresentation of oral cavity bacteria in the guts of PPI users is depicted in a cladogram. PPI use is independent of bowel movement frequency and stool consistency Some of the significantly increased taxa were more abundant in the small intestine.11 To ensure that the changes observed in microbiota composition were not due to diarrhea and/or more frequent bowel movements, we checked in our general population whether clinical symptoms of diarrhea were more often present in PPI users. Neither diarrheal complaints (IBS-D and functional diarrhea, P = 0.22, Fisher’s exact test), stool consistency as defined by the Bristol Stool Scale (rho = 0.027 P = 0.36, Spearman correlation) nor the defecation frequency (rho = -0.001, P = 0.98, Spearman correlation) of the participants in Cohort 1 were related to PPI use. PPI, anitbiotics and other commonly used drugs In Cohort 1, sixteen taxa were associated to antibiotics and others commonly used drug categories besides PPI (Supplementary Table S7). After correction for PPI use, only six taxa remained associated to certain drugs: statins, fibrates and drugs that change bowel movements. All 92 alterations in bacterial taxa associated to PPI use remained statistically significant if we correct the microbiome analyses for antibiotics and other commonly used drugs.

37 Conclusions 1 We show that PPI use is consistently associated with profound changes in the gut microbiome. In our study these changes were more prominent than changes associated with either antibiotics or other commonly used drugs. While PPI have proven to be useful in the prevention and treatment of ulcers and GERD, they have also been associated with an increased risk of C. 2 difficile, Salmonella spp., Shigella spp., Campylobacter spp., and other enteric infections.4,5,7–9 The increased risk of acquiring one of these enteric infections is likely due to changes in the PPI user’s gut microbiome. Gut microbiota can resist or promote colonization of C. difficile and other enteric infections through mechanisms that either directly inhibit bacterial growth or enhance the immune system.10–13 In the case of C. difficile, spores might be able to germinate 3 more easily because of metabolites synthesized by certain gut bacteria.12,13 We hypothesized that PPI change the gut microbiome through their direct effect on stomach acid. This acidity forms one of the main defenses against the bacterial influx that accompanies ingesting food and oral mucus. PPI reduce the acidity of the stomach, allowing more bacteria 4 to survive this barrier. We have shown here that species in the oral microbiome are more abundant in the gut microbiome of PPI users. Moreover, a study looking into the effect of PPI on the esophageal and gastric microbiome in oesophagitis and Barret’s oesophagus showed similar bacterial taxa associated with PPI use, including increased levels of Enterobacteriaceae, Micrococcaceae, Actinomycetaceae and Erysipelotrichaceae.16 Gastric bypass surgery compromises the stomach acid barrier and leads to gut microbiome changes similar to the PPI-associated alterations in this study, thereby supporting our hypothesis.17 We looked at the role of the gut microbiome in C. difficile infections, which cause 12.1% of all nosocomial infections and were responsible for half a million infections and associated with 29,000 deaths in the United States in 2011.18,19 Virulent strains of C. difficile can only colonize a susceptible gut, after which toxins are produced and spores are shed. This leads to a wide spectrum of symptoms varying from mild diarrhea to fulminant relapsing diarrhea and pseudomembranous colitis.20 Recent human, animal and in vitro studies show an overlap between the specific alterations in the gut microbiota associated with PPI use found in this study and bacterial changes that lead to increased susceptibility to C. difficile. The reduced alpha diversity in PPI-users is associated with increased susceptibility to C. difficile infection.13,21,22 The PPI-associated decreases of the family Ruminococcoceae and the genus Bifidobacterium, as well as the PPI-associated increases of the class Gammaproteobacteria, the families Enterobacteriaceae, Enterococcoceae, Lactobacillaceae and the genera Enterococcus and Veillonella, have been consistently linked to increased susceptibility to C. difficile infection. (Table 2)10,13,21–27 The Ruminococcaceae family is significantly decreased in C. difficile patients and enriched in healthy controls.22,24,26 Moreover, mice that have been treated with a mixture of antibiotics that do not become clinically ill after a challenge with C. difficile have higher levels of Ruminococcaceae.23 Within the Ruminococcaceae family, the Faecalibacterium genus was significantly increased in patients who recovered from C. difficile illness, whereas it was severely decreased in C. difficile patients with active disease.26 Last, a decreased Ruminococcus torques OTU was significantly associated with C. difficile infection in another study, although their OTU-picking was done using a different reference database and associations were performed using OTU-level, making direct comparisons with our study difficult.13

38 Table 2. Taxa and microbiome aspects associated with both PPI use and increased risk of C. difficile infection 1 Direction that Taxa or increases C. References of role on risk of C. difficile infection. microbiome aspect difficile infection risk 2 Buffie et al. Nature. 2015 Alpha diversity Reduced Chang et al. J. Infect. Dis. 2008 Antharam et al. J. of Clinical Microbiology. 2013 3 Buffie et al. Nature. 2015. Extended Figure 3d and k__Bacteria 3e p__Firmicutes Reeves et al. Gut Microbes. 2011 c__Clostridia Decreased Antharam et al. J. of Clinical Microbiology. 2013. o__Clostridiales f__Ruminococcaceae Rea et al. J. of Clinical Microbiology. 2011. 4 Schubert et al. Mbio. 2014. k__Bacteria p__Actinobacteria Buffie et al. Nature Reviews Immunology. 2013 c__Actinobacteria Rea et al. J. of Clinical Microbiology. 2011 Decreased o__Bifidobacteriales Baines et al. J. of Antimicrobial Chemotherapy. f__Bifidobacteriaceae 2013 g__Bifidobacterium Buffie et al. Nature. 2015 (Extended figure 3d and k__Bacteria 3e) p__Firmicutes Antharam et al. J. of Clinical Microbiology. 2013 c__Bacilli Increased Rea et al. J. of Clinical Microbiology. 2011 (Fig.4) o__Lactobacillales f__Enterococcaceae Baines et al. J. of Antimicrobial Chemotherapy. g__Enterococcus 2013 Schubert et al. Mbio. 2014 k__Bacteria p__Firmicutes c__Bacilli Buffie et al. Nature Reviews Immunology. 2013 o__Lactobacillales Buffie et al. Nature. 2015 f__Lactobacillaceae, Increased Reeves et al. Gut Microbes. 2011 g__Lactobacillus, s__delbrueckii, Antharam et al. J. of Clinical Microbiology. 2013 s__plantarum Rea et al. J. of Clinical Microbiology. 2011 and s__reuteri

k__Bacteria p__Firmicutes c__Clostridia Increased Antharam et al. The J. of Clinical Microbiology. 2013 o__Clostridiales f__Veillonellaceae g__Veillonella k__Bacteria p__Proteobacteria c__ Antharam et al. J. of Clinical Microbiology. 2013 Gammaproteobacteria Reeves et al. Gut Microbes. 2011 Increased o__Enterobacteriales Schubert et al. Mbio. 2014 f__Enterobacteriaceae Peterfreund et al. PLOS ONE. 2012 g__Escherichia s__coli

39 Species of the Bifidobacterium genus: Bifidobacterium longum, Bifidobacterium lactis, Bifidobacterium pseudocatenulatum, Bifidobacterium breve, Bifidobacterium pseudolongum, 1 Bifidobacterium adolescentis and Bifidobacterium animalis lactis have been shownto inhibit or prevent C. difficile infection.10 The administration of antibiotics that enhance the susceptibility to C. difficile in an in vitro model of the gut also significantly reduce the genus Bifidobacterium.25 Moreover, active C. difficile diarrhea is associated with decreased 24 2 Bifidobacteria in elderly patients. The class Gammaproteobacteria and the family Enterobacteriaceae are both significantly increased in PPI users. Gammaproteobacteria are enriched in C. difficile patients compared to healthy controls.22 Within the class Gammaproteobacteria, the family Enterobacteriaceae 3 dominate the murine gut microbiome after administration of clindamycin. Those mice that became clinically ill after the administration of an antibiotic cocktail containing clindamycin and a C. difficile challenge, had profoundly increased levels of Enterobacteriaceae in their gut microbiome, while mice that did not become clinically ill had a gut microbiome that predominantly consisted of Firmicutes.23 The family Enterobacteriaceae is also increased in 27 4 hamsters that were treated with clindamycin and subsequently infected with C. difficile. The Enterococcus genus, which is also more abundant in PPI-users, is significantly enriched in C. difficile-infected patients compared to healthy controls.22,26 An Enterococcus faecalis OTU and an Enterococcus avium OTU are both significantly associated with increased susceptibility to C. difficile infections in mice.13 Moreover, an Enterococcus avium OTU is also significantly associated with C. difficile in humans.13 The administration of the antibiotic ceftriaxone lead to an increase in the genus Enterococcus and enhanced the susceptibility to C. difficile in an in vitro model of the gut.25 The increased abundance of the family Lactobacillaceae in PPI users was associated with increased risk of C. difficile infection in several studies. Mice treated with a cocktail of antibiotics (consisting of kanamycin, gentamycin, colistin, metronidazole and vancomycin), cefoperazone or a combination of clindamycin and cefoperazone have higher levels of Lactobacillaceae in their gut.23 Mice treated with cefoperazone and clindamycin that developed C. difficile infection after being challenged with the pathogen also had a higher level of Lactobacillaceae.23 Within the Lactobacillaceae family, the Lactobacillus genus is significantly enriched in C. difficile infection patients compared to healthy controls.22 Lactobacillus spp in the gut microbiome are also associated with active C. difficile diarrhea in patients. 24 In contrast to these studies, the Lactobacillus species Lactobacillus delbrueckii, Lactobacillus plantarum and a Lactobacillus reuteri OTU increased colonization resistance to C. difficile.10,13 However, in concordance with increased risk, a Lactobacillus johnsonii OTU enhanced C. difficile infection.13 Last, the Veillonella genus that is increased in PPI users is significantly enriched in C. difficile patients compared to healthy controls.22 The prevention of healthcare-associated C. difficileinfections is a priority in the United States and reduction targets for 2020 have been established.5,28 A recent study looking into the effect of PPI on the risk of developing recurrent C. difficile infections found that of 191 PPI users admitted to a hospital, only 47.1% had an evidence-based indication for PPI use.5 Moreover, PPI use was discontinued in only 0.6% of the cases.5 The U.S. Food and Drug Administration already recommends limiting PPI use to a minimum dose and duration.29 Despite these recommendations, PPI are still often over-prescribed.5,6 The risk of unnecessary antibiotics use is already addressed.30 However, limiting the unnecessary use of PPI should also be considered in preventing C. difficile and other enteric infections.

40 The microbiome is being intensively studied in various diseases and conditions including IBD, IBS, obesity, old age, non-alcoholic steatohepatitis (NASH) and non-alcoholic fatty liver disease (NAFLD).31 PPI users are overrepresented in these groups as they more likely to have 1 gastrointestinal complaints or experience GERD, either due to their health condition or their associated lifestyle. Prominent microbiome studies looking into obesity, IBD and NAFLD include results that researchers have contributed to the condition under study, but we show 32,33 they are also associated to PPI use. It could well be that some of the observed effects 2 should rather have been attributed to the use of PPI. Future microbiome studies in humans should therefore always take the effect of PPI on the gut microbiome into account. This paper reports the largest study to date investigating the influence of Proton Pump Inhibitors on the gut microbiome. The profound alterations seen in the gut microbiome could 3 be linked to the increased risk of C. difficile and other enteric infections. Given the widespread use of PPI, the morbidity and mortality associated with enteric infections, and the increasing number of studies investigating the microbiome, both healthcare practitioners and microbiome researchers should be fully aware of the influence of PPI on the gut microbiome. 4 Methods Cohorts We studied the effect of PPI use on the gut microbial composition in three independent cohorts from the Netherlands. These cohorts together comprise 1815 adult individuals, including both healthy subjects and patients with gastrointestinal diseases. Cohort 1 consists of 1174 individuals who participate in the general population study LifeLines-DEEP in the northern provinces of the Netherlands.34 Cohort 2 consists of 300 Inflammatory Bowel Disease (IBD) patients from the department of Gastroenterology and Hepatology University Medical Center Groningen (UMCG), the Netherlands. Cohort 3 consists of 189 Irritable Bowel Syndrome (IBS) patients and 152 matched controls from Maastricht University Medical Center (MUMC), the Netherlands. This study was approved by the institutional review boards of the UMCG and the MUMC (MUMC http://www.clinicaltrials.gov, NCT00775060). All participants signed an informed consent form. Medication use Current medication use at the time of stool collection of Cohort 1 participants was extracted from a standardized questionnaire.35 Two medical doctors reviewed all the medication for 1174 participants. PPI use was scored if participants used omeprazole, esomeprazole, pantoprazole, lansoprazole, dexlansoprazole or rabeprazole. To exclude other possible drug effects on the gut microbiome, medication use was scored in eight categories, allowing for later correction of parameters or exclusion of certain participants. These categories were medication that: (1) changes bowel movement or stool frequency, (2) lowers triglyceride levels, (3) lowers cholesterol levels, (4) anti-diabetic medication (both oral and insulin), (5) systemic anti-inflammatory medication (excluding NSAIDs), (6) topical anti-inflammatory medication, (7) systemic antibiotics, including antifungal and antimalarial medication, and (8) antidepressants including serotonin-specific reuptake inhibitors (SSRIs), serotonin- norepinephrine reuptake inhibitors (SNRIs), mirtazapine, and tricyclic antidepressants (TCAs). The definitions of these categories are described in the Supplementary Appendix. Analysis of drugs used in Cohort 2 was based on the IBD-specific electronic patient record in the UMCG. Current PPI use, as well as current IBD medication (mesalazines, thiopurines, methotrexate, steroids, TNF-alpha inhibitors and other biologicals) were scored at the time of sampling by the gastroenterologist treating the IBD patient. Current PPI consumption in the IBS case- control Cohort 3 was based on self-reported questionnaires. Pseudonymized data for all three cohorts was provided to the researchers.

41 Gut complaints and other clinical characteristics 1 Information on age, gender and BMI was available for all three cohorts. In Cohort 1, gut complaints were investigated using an extensive questionnaire that included defecation frequency and the Bristol Stool Scale. Possible IBS and functional diarrhea or constipation were determined using self-reported ROME III criteria. The IBD patients in Cohort 2 were 2 diagnosed based on accepted radiological, endoscopic, and histopathological evaluation. All the IBD cases included in our study fulfilled the clinical criteria for IBD. IBS in Cohort 3 was diagnosed by a gastroenterologist according to the ROME III criteria. Stool and oral cavity mucus sample collection 3 A total of 1815 stool samples and 116 oral cavity mucus samples were collected. Cohorts 1 and 2 used identical protocols to collect the stool samples. Participants of cohort 1 and 2 were asked to collect one stool sample at home. Stool samples were frozen within 15 minutes after stool production in the participants’ home freezer and remained frozen until DNA isolation. 4 A research nurse visited all participants to collect the stool samples shortly after production and they were transported and stored at –80oC. Participants of cohort 3 were asked to bring a stool sample to the research facility within 24 hours after stool production. These samples were immediately frozen upon arrival at –80oC. Oral cavity mucus samples were collected from 116 additional healthy volunteers using buccal swab. DNA isolation and analysis of the microbiota composition Microbial DNA from stool samples was isolated with the Qiagen AllPrep DNA/RNA Mini Kit cat. # 80204. DNA isolation from oral cavity swabs was performed using the UltraClean microbial DNA isolation kit (cat.# 12224) from MoBio Laboratories (Carlsbad, CA, USA). To determine the bacterial composition of the stool and oral cavity mucus samples, sequencing of the variable region V4 of the 16S rRNA gene was performed using Illumina MiSeq. DNA isolation is described in the Methods section of the Supplementary Appendix Taxonomy determination Bacterial taxonomy was determined by clustering the sequence reads with UCLUST (version 1.2.22q) with a distance threshold of 97%, using Greengenes (version 13.8) as the taxonomy reference database. Sequencing and the determination of taxonomy are described in the Methods section of the Supplementary Appendix. Statistical analysis In each cohort, differentially abundant taxa in the gut microbiome between PPI users and non-PPI users were analyzed using the multivariate statistical framework MaAsLin.33 MaAsLin performs boosted, additive, general linear models between meta-data and microbial abundance data. After running the association studies in the individual cohorts, we performed a meta-analysis of the three cohorts, using the weighted Z-score method. The Cochran’s Q test was used to check for heterogeneity. The significance cut-off for the Cochran’s Q test was determined by Bonferroni correction for the 92 significant results: P < 5.43 x 10-4. Differences in richness (the number of species within a sample), principal coordinate analyses (PCoA), and Shannon diversity analysis were determined using the QIIME microbiome analysis software.36 The Wilcoxon test and Spearman correlations were used to identify differences in Shannon’s diversity and relations between the PCoA scores of PPI users and non-PPI users, while the Chi-square test, Fisher’s exact test, Spearman correlation and Wilcoxon-Mann-Whitney test (WMW test) were used to determine differences in age, gender, BMI, antibiotics use, and gut complaints between PPI users and non-users. In all the microbiome analyses, multiple test corrections were based on the false discovery rate (FDR). An FDR-value of 0.05 was used as a significance cut-off.

42 In addition to the PPI effect, we also tested the influence of other commonly used drugs in Cohort 1. Using MaAsLin with similar settings to those described above, we tested the microbial changes associated with the use of other drugs, with and without correction for PPI, 1 and the changes when including these common drugs as a correcting factor in the PPI versus non-PPI analysis. Significant results were graphically represented in cladograms using GraPhlAn.37 More details on the statistical analysis can be found in the Methods section (Supplementary Appendix). 2 Correction for factors influencing the gut microbiota Differentially abundant taxa were corrected for several parameters, which were identified by statistical analysis of cohort phenotypes or univariate MaAsLin runs and subsequently added 3 as co-factors to the additive linear model. Analyses in the general population Cohort 1 were corrected for age, gender, BMI, antibiotics use, sequence read depth, and ROME III diagnosis (IBS-Constipation (IBS-C), IBS-Diarrhea (IBS-D), IBS-Mixed (IBS-M), IBS-Undetermined (IBS-U), functional bloating, functional constipation, functional diarrhea, or none). The analysis of IBD patients in Cohort 2 was corrected for age, gender, BMI, antibiotics use, sequence read 4 depth, diagnosis (Crohn’s disease or ulcerative colitis) combined with disease location (colon, ileum or both) and IBD medication (use of mesalazines, steroids, thiopurines, methotrexate or anti-TNF antibodies). The analysis of the IBS case-control Cohort 3 was corrected for age, gender, BMI, sequence read depth, and IBS status according to the ROME III criteria. In the meta-analysis, all microbiome data were corrected for age, gender, BMI, antibiotics use, and sequence read depth. Author contributions A.Z., R.K.W., C.W. and D.J. designed the study. F.I., E.F.T., S.A.J., Z.M., L.V., M.C.C., and G.D. acquired the data; F.I., M.J.B. and A.V.V. analysed and interpreted the data; F.I. drafted the manuscript; M.J.B. and A.V.V. performed the statistical analysis; A.Z., R.K.W., C.W., D.J., G.D., L.F., J.F., H.J.M.H. and R.J.X. critically revised the manuscript; A.Z., R.K.W., C.W., D.J., L.F. and J.F. obtained funding; A.Z., R.K.W., C.W. and D.J. supervised the study. Funding Sequencing of the LifeLines-DEEP and IBS MUMC cohorts was funded by the Top Institute Food and Nutrition grant GH001 to CW. CW is further supported by an ERC advanced grant ERC-671274, SZ holds a Rosalind Franklin fellowship (University of Groningen) and MC holds a postdoctoral fellowship from the Spanish Fundación Alfonso Martín Escudero. RKW, JF and LF are supported by VIDI grants (016.136.308, 864.13.013 and 917.14.374) from the Netherlands Organization for Scientific Research (NWO).

43 References 1 1. The Dutch Foundation for Pharmaceutical Statistics (SFK). Data and Facts on 2013. (2014). 2. Drugs.com. Top 100 sales in the United States in 2013. (2013). 3. Olbe, L., Carlsson, E. & Lindberg, P. A proton-pump inhibitor expedition: the case histories 2 of omeprazole and esomeprazole. Nat. Rev. Drug Discov. 2, 132–139 (2003). 4. Moayyedi, P. & Talley, N. J. Gastro-oesophageal reflux disease. Lancet 367, 2086–2100 (2006). 5. McDonald, E. G., Milligan, J., Frenette, C. & Lee, T. C. Continuous Proton Pump Inhibitor Therapy and the Associated Risk of Recurrent Clostridium difficile Infection. JAMA Intern. 3 Med. 175, 784–91 (2015). 6. Kelly, O. B., Dillane, C., Patchett, S. E., Harewood, G. C. & Murray, F. E. The Inappropriate Prescription of Oral Proton Pump Inhibitors in the Hospital Setting: A Prospective Cross- Sectional Study. Dig. Dis. Sci. 60, 2280–2286 (2015). 7. Bouwknegt, M., van Pelt, W., Kubbinga, M. E., Weda, M. & Havelaar, A. H. Potential 4 association between the recent increase in campylobacteriosis incidence in the Netherlands and proton-pump inhibitor use ??? An ecological study. Eurosurveillance 19, 1–6 (2014). 8. Leonard, J., Marshall, J. K. & Moayyedi, P. Systematic review of the risk of enteric infection in patients taking acid suppression. Am. J. Gastroenterol. 102, 2047–2056 (2007). 9. Janarthanan, S., Ditah, I., Adler, D. G. & Ehrinpreis, M. N. Clostridium difficile-Associated Diarrhea and Proton Pump Inhibitor Therapy: A Meta-Analysis. Am J Gastroenterol 107, 1001–1010 (2012). 10. Buffie, C. G. & Pamer, E. G. Microbiota-mediated colonization resistance against intestinal pathogens. Nat Rev Immunol 13, 790–801 (2013). 11. Kamada, N., Chen, G. Y., Inohara, N. & Núñez, G. Control of pathogens and pathobionts by the gut microbiota. Nat. Immunol. 14, 685–90 (2013). 12. Britton, R. A. & Young, V. B. Role of the intestinal microbiota in resistance to colonization by Clostridium difficile. Gastroenterology 146, 1547–1553 (2014). 13. Buffie, C. G. et al. Precision microbiome reconstitution restores bile acid mediated resistance to Clostridium difficile. Nature 517, 205–8 (2015). 14. van Nood, E. et al. Duodenal infusion of donor feces for recurrent Clostridium difficile. N. Engl. J. Med. 368, 407–15 (2013). 15. Segata, N. et al. Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol. 13, R42 (2012). 16. Amir, I., Konikoff, F. M., Oppenheim, M., Gophna, U. & Half, E. E. Gastric microbiota is altered in oesophagitis and Barrett’s oesophagus and further modified by proton pump inhibitors. Environ. Microbiol. 16, 2905–2914 (2014). 17. Zhang, H. et al. Human gut microbiota in obesity and after gastric bypass. Proc.Natl. Acad.Sci.U.S.A 106, 2365–2370 (2009). 18. Lessa, F. C. et al. Burden of Clostridium difficile Infection in the United States. (supp). N. Engl. J. Med. 372, 825–834 (2015). 19. Magill, S. S. et al. Multistate Point-Prevalence Survey of Health Care–Associated Infections. N. Engl. J. Med. 370, 1198–1208 (2014). 20. M., R., M.H., W. & D.N., G. Clostridium difficile infection: New developments in epidemiology and pathogenesis. Nat. Rev. Microbiol. 7, 526–536 (2009). 21. Chang, J. Y. et al. Decreased diversity of the fecal Microbiome in recurrent Clostridium difficile-associated diarrhea. J. Infect. Dis. 197, 435–438 (2008). 22. Antharam, V. C. et al. Intestinal dysbiosis and depletion of butyrogenic bacteria in Clostridium difficile infection and nosocomial diarrhea. J. Clin. Microbiol. 51, 2884–2892 (2013). 23. Reeves, A. E. et al. The interplay between microbiome dynamics and pathogen dynamics in a murine model of Clostridium difficile infection. Gut Microbes 2, 145–158 (2011).

44 24. Rea, M. C. et al. Clostridium difficile carriage in elderly subjects and associated changes in the intestinal microbiota. J. Clin. Microbiol. 50, 867–875 (2012). 25. Crowther, G. S. et al. Evaluation of NVB302 versus vancomycin activity in an in vitro 1 human gut model of Clostridium difficile infection. J. Antimicrob. Chemother. 68, 168– 176 (2013). 26. Schubert, A. M. et al. Microbiome Data Distinguish Patients with Clostridium difficile Infection and Non- C . difficile -Associated Diarrhea from Healthy. MBio 5, 1–9 (2014). 2 27. Peterfreund, G. L. et al. Succession in the Gut Microbiome following Antibiotic and Antibody Therapies for Clostridium difficile. PLoS One 7, (2012). 28. Department, H. and H. S. Request for Comments on the Proposed 2020 Targets for the National Action Plan To Prevent Health Care-Associated Infections: Road Map To Elimination (Phase I: Acute Care Hospital) Measures. (2014). Available at: https://www. 3 federalregister.gov/articles/2014/02/25/2014-04069/request-for-comments-on-the- proposed-2020-targets-for-the-national-action-plan-to-prevent-health. 29. FDA. Drug Safety and Availability - FDA Drug Safety Communication: Clostridium difficile- associated diarrhea can be associated with stomach acid drugs known as proton pump inhibitors (PPIs). (2012). Available at: http://www.fda.gov/Drugs/DrugSafety/ 4 ucm290510.htm. 30. Blaser, M. Stop the killing of beneficial bacteria. Nature 476, 393–394 (2011). 31. Mehal, W. Z. The Gordian Knot of dysbiosis, obesity and NAFLD. Nat. Rev. Gastroenterol. Hepatol. 10, 637–44 (2013). 32. Goodrich, J. K. et al. Human genetics shape the gut microbiome. Cell 159, 789–799 (2014). 33. Morgan, X. C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012). 34. Tigchelaar, E. F. et al. An introduction to LifeLines DEEP: study design and baseline characteristics. bioRxiv (Cold Spring Harbor Labs Journals, 2014). doi:10.1101/009217 35. Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015). 36. Caporaso, J. G. et al. correspondence QIIME allows analysis of high- throughput community sequencing data Intensity normalization improves color calling in SOLiD sequencing. Nat. Publ. Gr. 7, 335–336 (2010). 37. Asnicar, F., Weingart, G., Tickle, T. L., Huttenhower, C. & Segata, N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ 3, e1029 (2015). Figure Legends

Acknowledgements We thank all the participants of the IBS, IBD and Lifelines-DEEP studies for contributing samples; Astrid Maatman, Tiffany Poon, Wilma Westerhuis, Daan Wiersum, Debbie van Dussen, Martine Hesselink and Jackie Dekens for logistics support; Timothy Tickle, Curtis Huttenhower, Alexandra Sirota, Chengwei Luo, Dirk Gevers and Aleksander Kostic for their help in training the first authors; Hendrik van Dullemen and Rinze ter Steege for including IBD patients; Marten Hofker and Eelke Brandsma for contributing to the scientific discussion and Jackie Senior and Kate Mc Intyre for editing the manuscript.

45 Description of supplementary data files 1 The following additional data are available with the online version of this paper. Supplementary data: Supplementary Appendix. Online Methods 2 Supplementary Table S1 Taxonomic comparison of cohort 1,2 and 3 Supplementary Table S2 Outcome meta-analysis: All bacterial taxa Supplementary Table S3 Outcome meta-analysis: Annotation 3 Supplementary Table S4 MaAsLin results: Cohort 1 LifeLines-DEEP Supplementary Table S5 MaAsLin results: Cohort 2 IBD UMCG Supplementary Table S6 MaAsLin results: Cohort 3 IBS MUMC 4 Supplementary Table S7 Cohort 1 medication influencing the microbiome Supplementary Figure S1 Bar charts: Gut microbiome composition phylum level Supplementary Figure S2 Bar charts: Gut microbiome composition class level Supplementary Figure S3 Alpha diversity: Shannon index Supplementary Figure S4 Alpha diversity: Richness Supplementary Figure S5 Heatmap all significant associated taxa in all cohorts Supplementary Figure S6 PCoA component 1 and component 3 Supplementary Figure S7 PCoA separate for individual cohorts. Supplementary Figure S8 Cladogram: Oral cavity bacteria marked

46 The gut microbiome contributes to a substantial proportion of the variation in blood lipids Circulation Research, DOI: 10.1161/CIRCRESAHA.115.306807

Jingyuan Fu1,2, Marc Jan Bonder2, Maria C. Cenit2, Ettje Tigchelaar2,3, Astrid Maatman2, Jackie A.M. Dekens2,3, Eelke Brandsma1, Joanna Marczynska2,4, Floris Imhann5, Rinse K. Weersma5, Lude Franke2, Tiffany W. Poon6, Ramnik J. Xavier6,7,8, Dirk Gevers6, Marten H. Hofker1,*, Cisca Wijmenga2,*, Alexandra Zhernakova2,3,* 4 Abstract 1 Rationale: Evidence suggests the gut microbiome is involved in the development of cardiovascular disease (CVD), with the host-microbe interaction regulating immune and metabolic pathways. However, there was no firm evidence for associations between microbiota 2 and metabolic risk factors for CVD from large-scale studies in humans. In particular, there was no strong evidence for association between CVD and aberrant blood lipid levels Objectives: To identify intestinal bacteria taxa, whose proportions correlate with body mass index (BMI) and lipid levels, and to determine whether lipid variance can be explained by 3 microbiota relative to age, gender and host genetics. Methods and Results: We studied 893 subjects from the LifeLines-DEEP population cohort. After correcting for age and gender, we identified 34 bacterial taxa associated to BMI and blood lipids; most are novel associations. Cross-validation analysis revealed that microbiota 4 explain 4.5% of the variance in BMI, 6% in triglycerides, and 4% in high-density lipoproteins (HDL), independent of age, gender and genetic risk factors. A novel risk model including the gut microbiome explained up to 25.9% of HDL variance, significantly outperforming the risk model without microbiome. Strikingly, the microbiome had little effect on low-density lipoproteins or total cholesterol. Conclusions: Our studies suggest that the gut microbiome may play an important role in the variation in BMI and blood lipid levels, independent of age, gender and host genetics. Our findings support the potential of therapies altering the gut microbiome to control body mass, triglycerides and HDL.

1. University of Groningen, University Medical Center Groningen, Department of Pediatrics, Groningen, the Netherlands; 2. University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; 3. Top Institute Food and Nutrition, Wageningen, the Netherlands; 4. Department of Immunology, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Krakow, Poland; 5. University of Groningen, University Medical Center Groningen, Department of Gastroenterology and Hepatology, Groningen, the Netherlands; 6. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA; 7. Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA; 8. Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA; *These authors jointly directed this study; Correspondence to: Dr. Jingyuan Fu, E-mail: [email protected]

48 Novelty and Significance 1 What Is Known? • The human gut holds about 100 trillion bacteria, which together can weigh several pounds. • This ecosystem (the microbiome) is shaped by early life events, the host genome, diet and other lifestyle factors. 2 • This bacterial community is associated with an individual’s susceptibility to many diseases, including cardiovascular diseases. What New Information Does This Article Contribute? • Healthy lipid levels are associated with increased microbial diversity. • Body mass index (BMI) and blood lipids are associated with 34 different microbial 3 taxonomies. • A risk model including age, gender, genetic factors and gut microbiome explains a large part of the variation seen in BMI, triglycerides and high-density lipoprotein cholesterol. The bacterial community in the human gut (known as the microbiome) has been referred to as an 4 extra organ, or the second human genome, because of its important role in an individual’s health. As most of these bacteria cannot be cultured, we knew very little about their diversity and function until the recent development of innovative DNA sequencing technology. In our study we defined the microbial composition found in 893 human subjects by sequencing bacteria-specific 16s rRNA genes; we observed a large inter-individual variation in gut bacteria composition. We show that the bacterial diversity is associated with the lipid blood levels at human population level, especially with the levels of triglycerides and high-density lipoprotein cholesterol, and we report significant associations for 34 bacteria taxonomies. Our findings suggest microbial-intervention therapy will have the potential to help control blood lipid levels and prevent disease.

Introduction In recent years, the gut microbiome has emerged as an important player in human health.1,2 Gut microbiota comprise thousands of microbial species that are involved in host metabolism by regulating energy extraction, activation of the immune system, drug metabolism and other processes.3,4 Association of bacterial composition to many diseases has been observed, including immune, inflammatory and metabolic phenotypes.5–7 Several mechanisms for the downstream effect of microbiota were discovered that also suggest they play a role in cardiovascular disease (CVD). The microbiota play an important role in choline diet-induced trimethylamine N-oxide (TMAO) production, which has been implicated in CVD.8 A further mouse study has demonstrated that atherosclerosis susceptibility can be transmitted via gut microbiota transplantation.9 Further, dysbiosis in the gut has been shown to induce increased permeability of the intestine, leading to increased systemic levels of bacterial products causing low-grade chronic inflammation.10 This inflammation may directly affect atherogenesis and has also been hypothesized to lead to the development of insulin resistance with concomitant effects on plasma lipids.11 Gut microbiota have also been linked with lipid metabolism through their role in bile acid metabolism. They can also influence the efficiency of energy harvest from ingested food12,13 and play a crucial role in the metabolic processes and development of obesity. In line with these observations, altering the gut microbiome in humans and mice has shown improvement in metabolic syndrome.14–16 However, the evidence for a causal relationship between the gut microbiome and the development of CVD has not been firmly established for lack of large-scale human studies. Atherosclerosis, a lipid-driven disease, is the main underlying cause of CVD. However, to date, no studies of sufficient size have been done to assess the association between lipids and microbiota. In this study, we performed a systematic analysis of host genome, gut microbiome, body mass index (BMI) and blood lipids

49 in 893 human subjects from the Dutch LifeLines-DEEP cohort.17 We investigated which gut bacteria were associated with BMI and blood lipids, and how much of the variation in blood 1 lipids could be explained by the gut microbiome, relative to age, gender, body mass index and host genetics. Results 2 Microbial Diversity in the LifeLines-DEEP Cohort After quality control, our study included 893 human subjects. The study cohort had a wide range of age, BMI and blood lipids levels (Table 1). We assessed how variable the gut microbial composition was in the cohort in terms of microbial richness and diversity. The 3 microbial richness reflects the number of OTUs per individual. The cohort had on average 238 OTUs per individual, ranging from 44 to 355. When individuals were grouped into different bins based on their richness, we observed that age and the proportion of females were higher in the richer OTU groups (Figure 1). The Spearman correlation showed the richness was -12 4 significantly higher in women (P=0.0055) and increased with age (P=5.87x10 ) (Online Table I). Given the abundance of OTUs, we computed the microbial diversity (Shannon’s diversity index) and observed similar significant correlations for age and gender (Online Table I). We then investigated whether bacterial richness and diversity were correlated with BMI and lipid levels. After correcting for age and gender, OTU richness was negatively correlated with BMI (P=3.8x10-4) and TG (P=1.37x10-4), but positively correlated with HDL (P=8.3x10-4). We did not observe significant correlations between microbial richness and LDL or TC levels (Online Table I).

Table 1. Summary of physical characteristics of 893 LifeLines-DEEP subjects

Mean±s.d. Range Men Women Total Men Women Total n=380 n=513 n=893 n=380 n=513 n=893 44.7 44.6 44.6 Age in years 18-78 18-80 18-80 ±12.9 ±12.9 ±12.9 25.4 25.1 25.2 BMI 16.9-39.3 16.9-44.9 16.9-44.9 ±3.3 ±4.6 ±4.1 HDL-C 1.35 1.69 1.54 (mmol/L) 0.6-2.5 0.7-3.3 0.6-3.3 ±0.32 ±0.43 ±0.42 LDL-C 3.36 3.07 3.19 (mmol/L) 1.0-6.5 0.8-7.5 0.8-7.5 ±0.89 ±0.93 ±0.93 TC 5.13 5.04 5.08 (mmol/L) 2.5-8.7 2.4-9.7 2.4-9.7 ±1.00 ±1.01 ±1.01 TG 1.39 0.97 1.15 0.22- (mmol/L) 0.29-4.06 0.22-14.05 ±1.23 ±0.53 ±0.92 14.05 0.20 -0.21 -0.04 Log2-TG -2.2-3.8 -1.79 -2.02 -2.2-3.81 ±0.81 ±0.67 ±0.76

50 0.62/0.38

0.59/0.41 90 Female 0.59/0.41 1 Male Figure 1. The richness of the gut 80 microbiome. A. The microbial 15 0 richness associated with age and 0.50/0.50 70 gender. The bar plot shows the distribution of individuals binned 2 Ag 60 00 to different groups of richness. The e 0.49/0.51 0.63/0.37 blue and red colors indicate the 50 proportion of males and females in each group and the dark grey line 40 01 indicates the correlation between 3

Number of Individuals 0.47/0.53 0.57/0.43 the average age and richness, while 30 the light grey shadow indicates the standard deviation of the age per 20

05 richness bin. 150 175 200 225 250 275 300 325 4 Number of OTUs (richness)

Association of Bacteria to Lipid Metabolites We next tested for association between the individual bacterial OTU, BMI and blood lipid levels. After adjusting for age and gender, we identified 148 associated OTUs at FDR=0.05: 66 OTUs were associated with BMI, 114 with TG, and 34 with HDL (Online Tables II-IV). We did not detect any significant association at OTU level for LDL or TC. Of the 148 associated OTUs, 12 were shared by all three traits (BMI, TG and HDL); 29 OTUs were shared by BMI and TG and 4 by BMI and HDL, while 21, 64 and 9 OTUs were specifically associated with BMI, TG and HDL, respectively (Online Figure II). At the taxonomic level, we identified 50 significant associations for 34 unique taxonomies at FDR=0.05: 22 were associated with BMI, 23 with TG, 4 with HDL, and 1 with LDL (Figure 2, Online Table V). We found 18 associations (36%) were detected by binary analysis (presence/absence); 4 associations (8%) were detected by the quantitative model; and 28 associations (56%) were detected by the meta-analysis of binary and quantitative analyses (Online Table V). Although most of the associated taxonomies were shared across lipid metabolites and BMI, several microbes were predominantly linked to lipids rather than BMI. For example, the family Clostridiaceae/Lachnospiracease (N16 in Figure 2), was specially associated with LDL (P=9.1x10-5) (Online Table V), and not detected for BMI, nor other lipids. Further, the family Pasteurellaceae (N32) (Proteobacteria), genus Coprococcus (N24) (Firmicutes) and genus Collinsella species Stercoris (N2) showed strong association to TG levels (P=6.2x10-5, P=4.6x10-5 and P=0.0006, respectively), a nominal significance to other lipids, and no association to BMI (P>0.1). We confirmed several previously described bacterial associations to obesity. An increased abundance of genus Akkermansia (N34) has been associated with a decrease in BMI (P=0.0005)18. We also confirmed the association of both the family Christensenellaceae (phylum Firmicutes) (N18) and the phylum Tenericutes (mainly represented by order RF-39) (N33) with low BMI (P=9.8x10-7 and P=0.0002, respectively), as reported in the TwinsUK cohort.19 In addition, we identified a novel and strong association of these particular bacteria with lower levels of TG (P=2.1x10-5 and P=2.7x10-7, respectively), and higher levels of HDL (P=0.0047 and P=0.0006, respectively). We also observed several new associations with BMI and levels of TG and HDL, such as genus Eggerthella (N3) with increased TG (P=4.1x10-5) and decreased HDL (P=6.3x10-5), and family Pasteurellaceae (N32) with decreased TG (P=6.2x10-5). The genus Butyricimonas (N9) was previously linked to a lean phenotype in mice after fecal transplantation from twins discordant for obesity.15 Our study shows that this genus is strongly associated with decreased TG (P=4.7x10-6) and nominally associated with BMI and HDL in humans.

51 2 1 34 33 2 1 34 33 2 1 34 33 BMI 3 32 TG 3 32 HDL 3 32 4 31 4 31 4 31 1 5 30 5 30 5 30 6 29 6 29 6 29 7 28 7 28 7 28 8 27 8 27 8 27 9 26 9 26 9 26 2 10 25 10 25 10 25 11 24 11 24 11 24 12 23 12 23 12 23 13 22 13 22 13 22 14 21 14 21 14 21 15 20 15 20 15 20 16 17 18 19 16 17 18 19 16 17 18 19 3

1 . k_Archaea 10 . g_Odoribacter 19 . f_Clostridiaceae 28 . o_Burkholderiales/Rhodocyclales 2 . s_Stercoris 11 . f_Rikenellaceae 20 . f__Clostridiaceae: g_02d06 29 . f_Desulfovibrionaceae 3 . g_Eggerthella 12 . o__Bacteroidales: f_S24−7 21 . g_Dehalobacterium 30 . g_Bilophila 4 . o_Bacteroidales 13 . p_Cyanobacteria 22 . f_Lachnospiraceae 31 . c_Gammaproteobacteria 5 . f_Bacteroidaceae/ 14 . o_Gemellales/Bacillales 23 . g_Blautia 32 . f_Pasteurellaceae Z < 0, FDR < 0.05 4 6 . o__Bacteroidales: f_S24−7/Barnesiellaceae 15 . f_Mogibacteriaceae/Clostridiaceae/Lachnospiraceae 24 . g_Coprococcus 33 . p_Tenericutes Z < 0, FDR > 0.05 7 . g_Bacteroides 16 . f_Clostridiaceae/Lachnospiraceae 25 . g_Lachnospira 34 . g_Akkermansia Z > 0, FDR < 0.05 8 . f_Odoribacteraceae 17 . f_Peptostreptococcaceae/Mogibacteriaceae/Clostridiaceae 26 . f_Erysipelotrichaceae: g_CC_115 9 . g_Butyricimonas 18 . f_Christensenellaceae 27 . f_Erysipelotrichaceae: g_Holdemania Z > 0, FDR > 0.05 Figure 2. The effect of taxonomies on BMI and lipids. The effects of 34 taxonomies associated with BMI, TG and HDL are shown as Z-scores. Red sectors indicate positive associations and blue negative associations. Brighter colors indicate that the association was significant at FDR 0.05 level. Dashed circles indicate the scale of Z values from 1 to 5.

Variance of Blood Lipid Explained by Microbiota Composition To anticipate how much BMI and blood lipids can be modulated by the gut microbiome, it is important to estimate what proportion of variation in these metabolic traits can be explained by the microbiome. To do so, we performed a 100x cross-validation analysis by splitting the dataset randomly into an 80% discovery set and a 20% validation set. The OTUs identified at P=1x10-5 level in the discovery set explained 2.74% variation in BMI in the validation set, 3.83% in TG, 2.46% in HDL, 0.01% variation in LDL, and 0.01% in TC. When the association significance decreased and the risk model included more (but less-significant) OTUs,the explained variation increased to 4.57% in BMI, 6.0% in TG, 4.0% in HDL, but was only 1.5% in LDL and 0.7% in TC (Figure 3A). To test the robustness of our estimation, we re-rarefied the OTU library 100 times and repeated the whole analysis. This approach yielded similar results, thereby confirming the robustness of our estimation (Online Figure III). Microbiota Contribute Significantly to Lipid Variation, Independently of Age, Gender and Genetics Evidence has already shown that the gut microbiome can be shaped by host genetics.19 We further tested whether the explained variation in the gut microbiome was independent of genetic factors by testing the association between the gut microbiome and genetic risk scores.20 To date, 157 genetic loci have been reported to be associated with blood lipid levels21 and 97 loci have been associated with BMI.22 In our cohort, these SNPs collectively explained 2.1% variation in BMI (P=1.66x10-5), 3.4% in TG (P=3.22x10-8), 7.5% in HDL (P<2.2x10-16), 4.6% in LDL (P=8.0x10-11), and 5.6% in TC (P=7.7x10-13), after correcting for age and gender. However, we did not observe any significant association between the microbiome and the genetic risk at FDR=0.05. Nor did we find a significant association for either single SNPs (Online Table VI) or for the combined lipid and BMI genetic risk scores (Online Table VII). Our results indicated that the proportion of variation in BMI and lipid levels explained by the gut microbiome was different from that explained by genetic variation. Therefore, we further assessed whether

52 the microbiome could make a significant contribution to the explained variation beyond age, gender and genetic factors. Our analysis unambiguously showed that age, gender, genetics and the gut microbiome collectively explained 11.3% of the variation in BMI, 17.1% in TG 1 and 25.9% in HDL-cholesterol, with the microbiome making a significant contribution to the explained variation in BMI (P=4.1x10-3), TG (P=4.5x10-4) and HDL (P=2.7x10-3) (Figure 3B). When we included BMI as a risk factor, the total explained variation in lipids increased to 25% in TG, 37.4% in HDL, 22.3% in LDL, and 22.3% in TC (Online Figure IV). The microbiome 2 made a lesser, but still significant, contribution to TG (P=4x10-3) and HDL (P=0.026). Our study therefore indicates that the gut microbiome can explain a substantial proportion of the variation, independent of age, gender, BMI and genetics.

r =age+gender A. B. 1 3 r2=age+gender+rg BMI TG HDL r3=age+gender+rg+rm 30

5 P= 0.0027 e e e

25 4

rianc NS e rianc rianc Va 3456 Va Va NS 20

rianc P= 0.00045 2345 2 Explained Explained Explained Va

15 P=0.0041 01 01234 10-5 10-4 10-3 10-2 0.1 01 10-5 10-4 10-3 10-2 0.1 10-5 10-4 10-3 10-2 0.1 5x10-5 5x10-4 5x10-3 0.05 5x10-5 5x10-4 5x10-3 0.05 5x10-5 5x10-4 5x10-3 0.05 Di erent P cuto Di erent P cuto Di erent P cuto 10 Explained e e 2 2 LDL TC rianc rianc Va Va 1 0 01 -5 -4 -3 -2 -5 -4 -3 -2 05 Explained 10 10 10 10 0.1 10 10 10 10 0.1 -5 -4 -3 -5 -4 -3 5x10 5x10 5x10 0.05 Explained 5x10 5x10 5x10 0.05 BMI TG HDL LDL TC Di erent P cuto Di erent P cuto Figure 3. The contribution of the gut microbiome to BMI and lipids. A. The variation explained by gut microbes at different levels of significance. B. The variation explained by different risk models including age, gender, genetic risk and microbial risk. The significance of microbial contribution is indicated as the P value of the ANOVA test that compared the performance of the risk models r2 and r3. Discussion Obesity and aberrant levels of blood lipids are associated with a high risk of CVD. Studying the effect of the gut microbiome on BMI and blood lipid levels yields insight into the role of the microbiome in the development of CVD. Although animal studies have shown that microbiota can influence lipid metabolism,23 no large-scale studies have been performed in humans thus far. Here, we investigated the impact of the gut microbiome on BMI and blood lipid levels in 893 human subjects from the LifeLines-Deep cohort. The power of our study is reflected by three factors. First, to our knowledge, it is the largest association study linking the gut microbiome to blood lipids in humans to date. Second, our cohort represented a wide range of ages, BMI and blood lipids, as well as microbial composition. We also had detailed medication information per individual and could exclude those taking lipid-lowering or antibiotic medication. Moreover, we adopted a novel and powerful two-part model to account for both the binary and quantitative features of microbial data. We established associations for 34 taxonomies with BMI and blood lipid levels, and we estimated that gut microbiota composition can explain up to 6% of the variation in lipid levels, and that this effect is independent of age, gender and host genetics.

53 Our results for the microbiota associated to BMI are in line with a recent study of 416 twin- pairs from the TwinsUK population19; in particular, we confirmed that lower abundances of 1 families Christensenellaceae, Rikenellaceae, class Mollicutes, genus Dehalobacterium and kingdom Archaea were associated to a high BMI. Of 22 independent taxa associated with BMI by our study, 16 were also accessed in the TwinsUK study: 11 (68.8%) showed significant association to BMI (p<0.05) with the same direction of effect as we found (Online Table V). 2 We also identified a correlation of decreased bacterial diversity with increased BMI, which is in line with previous observations.24 However, many of the taxonomies we identified are novel findings. Several of the identified bacteria are known to be involved in the bile acid metabolic pathway. In particular, order 3 (phylum ) and family Clostridiaceae (phylum Firmicutes) are both negatively correlated with BMI and TG, and known to be involved in bile acid metabolism.25 Bile acid activity of commensal bacteria are involved in a complex interplay with host hepatic enzymes, and together they promote digestion and absorption of dietary lipids.26 Interestingly, several small-scale studies reported lowered cholesterol upon using probiotics with bile 27,28 4 salt hydrolytic activity. Our study found support for the role of bacterial bile acids in lipid metabolism. Another pathway enriched in several associated bacteria is short chain fatty acids (SCFA) metabolism. Both orders Bacteroidales and Clostridiales, identified in our study, are involved in SCFA metabolism.25 SCFA are produced by microbiota from dietary fibers, effect host body energy homeostasis, and are protective against metabolic syndrome, type 2 diabetes, and atherosclerosis.16,29–31 To firmly establish the gut microbiome as a risk factor for obesity and aberrant levels of blood lipids, we have been able to estimate that the microbiome could explain 4.57-6% of the variation in BMI, TG and HDL, respectively. We did not detect any significant association between the gut microbiome and genetic predisposition to obesity and aberrant levels of blood lipids, suggesting the variation explained by the microbiome is independent of that explained by genetic variants. It should be noted, however, that the genetic risk score was limited to our established 157 lipid-associated SNPs21 and 97 BMI-associated SNPs22, which together only explain a small proportion of the heritability of lipid levels. We might have missed the effect of other, not yet discovered SNPs. Our risk model included age, gender, genetic variation, and gut microbiome and explained 11.3% of the variation in BMI, 17.1% in TG and 25.9% in HDL, significantly outperforming the risk model without the microbiome. Since blood lipids and BMI are highly correlated with each other and many associated bacteria were shared, we investigated whether the observed effect of the gut microbiome on lipids might just be the confounded effect of BMI. We showed that by including BMI in the risk model, the gut microbiome made a smaller, but significant, contribution to the variation in TG and HDL, suggesting that the microbiome affects blood lipids partly independently of BMI. Our results therefore indicate that the gut microbiome is a potentially important player in blood lipid metabolism. In contrast to genetics, gender and age (all fixed characteristics), an individual’s microbiota composition can be modified by diet, pre- and probiotics, and fecal transplantation. Studies have shown that diet can alter the gut microbiome.32 Our study has not addressed how much of the association we observed between gut microbiome and blood lipids might be explained by diet. A better understanding of this could provide more insights into the role of diet in microbiome and lipid metabolism. Our study supports the potential of microbiota-modifying intervention to correct lipid dis- balance and thereby help prevent CVD. From potential to action, the next steps are to validate the associations we report in independent cohorts and to prove there is a causal axis of gut microbiome-lipids-CVD in functional studies. It is essential to gain more mechanistic insight into the functioning of the gut microbiome, although research in humans is still in its infancy. The gut microbiomes in our study were profiled by 16s rRNA gene sequencing. This technology can identify microbial taxonomies and composition, but has limitations in identifying genetically-specific species and strains. Furthermore, 16s rRNA sequencing provides little

54 information on bacterial genes and their functions. With the decreasing cost of metagenome sequencing and development of techniques for culturing and for functional studies of gut bacteria, we expect to learn more about the levels of bacterial genes, metabolic pathways and 1 their functions in the future. In conclusion, we have observed a strong association between the gut microbial composition and the variation in BMI and blood lipid levels, which is independent of age, gender and host genetics. This observation provides insight into the microbiome’s role in regulating 2 metabolic processes during the development of CVD. We established associations for a total of 34 intestinal bacteria taxonomies with BMI and blood lipids. We observed that the gut microbiome makes a significant contribution, beyond that of clinical risk factors and genetics, to the individual variance seen in BMI and to the blood levels of triglycerides and HDL, but 3 that it has little effect on LDL or total cholesterol levels. Our results highlight the potential of therapies that alter the gut microbiome to control body mass, triglycerides and HDL in CVD prevention. In moving from potential to action, it will be essential to identify the causal axis of microbiome-lipids-CVD and to gain more mechanistic insight into the gut bacteria functions. Methods 4 Population Cohort The LifeLines-DEEP cohort is a sub-cohort of the LifeLines cohort (167,729 subjects),33 which employs a broad range of investigative procedures in assessing the biomedical, socio- demographic, behavioral, physical and psychological factors that contribute to the health and disease of the general population. A subset of approximately 1,500 participants also took part in LifeLines-DEEP: for these participants, additional biological materials were collected, including genome-wide genotyping and analysis of the gut microbiome composition. A full description of the LifeLines-DEEP dataset is given in the paper describing the study design.17 Lipid Measurements We had lipid measurements available for all 1,500 LifeLines-DEEP samples. Total cholesterol (TC) was measured with an enzymatic colorimetric method, high-density lipoprotein (HDL) cholesterol with a colorimetric method, and triglycerides (TG) with a colorimetric UV method (Modular P analyzer, Roche Diagnostics, Burgdorf, Switzerland). The low-density lipoprotein (LDL) cholesterol concentration was calculated using the Friedewald equation. More details 34 were reported previously. The triglyceride level was further log2 transformed. Genotype Information All LifeLines-DEEP samples were genotyped using the HumanCytoSNP-12 BeadChip and ImmunoChip, a customized Illumina Infinium array. The data were harmonized,35 merged and subsequently imputed using the Genome of the Netherlands (GoNL) dataset.36,37 Further details and information on the quality control are described in Tigchelaar et al.17 We removed ethnic outliers and genetically related participants from our study. Microbiome Data Generation Sequencing Microbiome data was generated for 1,180 LifeLines-DEEP samples. Fecal samples were collected at home within two weeks after collection of blood samples, and stored immediately at ‑20 oC. After transport on dry ice, all samples were stored at -80 oC. Aliquots were made and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen; cat. #80204). Isolated DNA was sequenced at the Broad Institute, Boston, using Illumina MiSeq paired-ends. Hyper-variable region V4 was selected using forward primer 515F [GTGCCAGCMGCCGCGGTAA] and reverse primer 806R [GGACTACHVGGGTWTCTAAT]. We used custom scripts to remove the primer sequences and align the paired-end reads. Details are given in Gevers et al.38

55 OTU picking 1 Selection of unique bacterial sequences so-called operational taxonomic unit (OTU) picking was performed using the QIIME reference optimal picking, which uses UCLUST39 (version 1.2.22q) to perform the clustering. Matching OTUs to bacteria was done using a primer-specific version of the GreenGenes 13.5 reference database.40 Using TaxMan,41 we created the primer- 2 specific reference database containing only reference entries that matched the selected primers. During this process we restricted probe-reference mismatches to a maximum of 25%. The 16S regions that were captured by our primers, including the primer sequences, were extracted from the full 16S sequences. For each of the reference sequences, we determined the overlapping part of the taxonomy of each of the reference reads in the clusters and used 3 this overlap as the taxonomic label for the cluster. This process is based on, and similar to, work described in Bonder et al.,42 Brandt et al.,41 May et al.43 and Ding et al.44 We used QIIME45 for exploratory analysis and for gathering basic statistics on the microbiome dataset. Quality Control 4 Overall, for 1,021 samples, we had lipid measurements, genotype and microbiome information. We excluded 99 samples from participants who were taking antibiotic or other potential microbiome-modifying drugs, or who were on lipid-lowering medication. The library size of microbial sequencing varied greatly among samples, ranging from 3,969 to 336,900 reads. The sequence depth can significantly bias the measures of microbial composition and rarefication was widely used to make the library sizes equal by randomly selecting the same number of reads per sample.46,47 We compared the number of samples at different sequence depths and determined the rarefication depth based on criteria to obtain both the number of reads and the number of samples as high as possible. We rarefied the library size to 15,000 read-depth using the rarefy function in R package vegan (v2.3-0). At this depth, we only excluded 29 subjects. After these exclusion steps, we had 893 samples (380 males and 513 females) for final analysis. Their characteristics are summarized in Table 1. Further, we filtered on the OTU abundance and confined our analysis to 645 OTUs, each of which comprised ≥0.05% of reads and was present in at least 1% of the population. These OTUs accounted for an average of 99% of total reads per sample. The OTUs were assigned to 173 taxonomies that were further truncated to 136 taxonomies after removing identical or highly similar information between different clade levels. Statistical Analysis Analysis of microbial diversity The microbial Shannon diversity index was calculated using the diversity function in R package vegan (version 2.3-0). Two-part model for association analysis We observed that the distribution of the abundance of OTUs or taxonomies departed significantly from a normal distribution due to the fact that bacteria were not presented in many samples. Only 50 out of 645 OTUs (7.7%) were presented in more than 90% of samples, whereas 448 OTUs (69.5%) were detected in less than 50% of samples. At the taxonomic level, 32 out of 136 (9.5%) taxonomies were detected in more than 90% of samples, whereas 60 taxonomies (44.1%) were detected in less than 50% of samples. There are different explanations for the detection rate: (1) the bacteria are really absent in the samples; (2) the abundance levels of bacteria are lower and not to be detected at the current sequencing and rarefication depth; (3) the abundance levels are similar and it is a random effect due to the sequencing or rarefication procedure. We therefore adopted a novel, two-part model that was developed to account for both binary (detected/undetected) and quantitative features.48 This approach overcomes the problem of a non-normal distribution, which is a feature of the majority of gut bacteria OTUs or taxa.

56 The two-part model is illustrated in Online Figure I. The first part describes a binomial analysis that tests for association of detecting a microbe (represented by an OTU or a taxonomy) with a trait. The binary feature (b) of a microbe under study was coded as 0 for undetected or 1 1 for detected for each sample. The binary model is described as: y = b1b + e, where y refers to the trait level (BMI or lipid level) per individual after adjusting for age and gender; b is a binary feature; b1is the estimated effect for the binary effect, and e represents the residuals. 2 The second part of the quantitative analysis tests for association between the lipid level and the abundance of bacteria, but only for the subjects where that microbe is present. The abundance level (q) of a microbe was the log10 transformed read count per individual. The 3 quantitative model is written as y = b2q + e ; where q is the abundance of a microbe; b2 is the estimated effect for the abundance, and e represents the residuals. To further combine the effect of both binary and quantitative analysis, a meta-P-value was derived using an unweighted Z method. Then a final association P value per microbe-trait pair was assigned from the minimum of P values from the binary analysis, quantitative analysis 4 and meta-analysis. The association Z-score was calculated based on the Z-distribution. If the association direction was negative, the Z-score was assigned a negative value. If the association direction is positive, the Z-score was assigned a positive value. The association P value was set as the minimum value of three P values and the distribution of the association P values could be skewed, so we therefore performed 1000x permutation tests to control the false discovery rate (FDR). For each permutation, we randomized the gut microbial composition across individuals and performed the two-part analysis on permuted data. At a certain P cut-off, the average number of the detected significance N( 0) in 1000x permutations was defined as the false positive, and its ratio to the detected positive (N1) in the real analysis was the FDR. We controlled the FDR at 0.05. This method accounts for the complicated features of the microbial data and maximizes the power. If the association P value comes from the binary model, indicating the effect is only due to the presence/absence of the microbe, the abundance of the microbe in the samples does not matter. If the association P value comes from the quantitative model, this indicates the abundance level of the microbe associates with the trait, and the absence of the microbe has no influence. The explanation would be another microbe takes its place and has a similar function. If the association P value comes from the meta-analysis, indicates that both the presence/absence and the abundance of microbes can influence the trait.

Estimating the Variance Explained by the Gut Microbiome To estimate the proportion of variation in BMI and lipids that could be explained by the gut microbiome, we performed a 100x cross-validation. Each time we split the data randomly into an 80% discovery set and a 20% validation set. In the discovery set, a total of n number of significantly associated OTUs was identified at a certain P value and the effect sizes of binary and quantitative features of each OTU ( b1and b2 ) were estimated. Then the risk of the gut microbiome on BMI or lipids (rm) for each individual in the validation set was calculated using an additive model: n

rm = ∑(β1 + bj + β2 jq j ) j=1

57 The variation in BMI and blood lipids explained by the gut microbiome was represented as 2 the squared correlation coefficient (R ) between the traits and rm, after correcting for age and 1 gender. To ensure the robustness of our estimation, we repeated the cross-validation 100 times and calculated the average value of the explained variation. We hypothesized that many microbes may contribute a small effect but may not be confidently detected at an FDR of 0.05. Therefore, we did this analysis at different significant P levels ranging from 1x10-5 to 0.1. 2 Genetic Risk Score Calculation A total of 157 lipid-associated single nucleotide polymorphisms (SNPs)21 and 97 BMI- associated SNPs22 were extracted from the literature. The risk alleles and their effect sizes were extracted for each SNP and each lipid type. We excluded three SNPs for which 3 genotypes could not be successfully imputed in the LifeLines cohort: rs9411489 at the ABO locus, rs3177928 at the HLA locus, and rs12016871 at the MTIF3 locus. Thus, our final study included genetic information for 96 BMI-associated SNPs and 155 lipid-associated SNPs, including 71for HDL, 56 for LDL, 40 for TG and 72 for TC. We then computed weighted genetic risk scores (r ) for BMI and lipids, as described previously.34 4 g The association analysis between individual SNPs and the gut microbiome was performed using the analysis pipeline developed in house for quantitative trait loci analysis.49 We further tested whether the explained variation in the gut microbiome was independent of genetic factors by testing the association between the gut microbiome and genetic risk scores. 20 The associations between microbes and the genetic risk score of BMI and lipid levels were assessed using our two-part model. The significance was controlled at FDR<0.05 by 1000x permutation tests. The Significance of the Microbial Contribution To test whether the gut microbiome contributes significantly to variation in BMI and blood lipids, we compared the performance of three different risk models, in particular the risk models with and without microbial risk:

r1 = age + gender ;

r2 = age + gender + rg ; and

r3 = age + gender + rg + rm ,

where rg is the calculated genetic risk and rm is the highest microbial risk we determined. The variation explained by each risk model was calculated in 100x cross-validation, as described above. To evaluate the significance of microbial contribution, the ANOVA test was used to

compare the performance of the risk models r3 and r2: the average of F values of the ANOVA test from 100x cross-validation was calculated and the P value was determined based on the F-distribution. As BMI and lipids are highly correlated, we also investigated whether the gut microbiome can contribute to lipid levels independent of BMI. To do so, we tested four risk models of lipids including BMI as a risk factor:

r1 = age + gender ;

r2 = age + gender + bmi ;

r3 = age + gender + bmi + rg ; and

r4 = age + gender + bmi + rg + rm .

58 Sources of funding 1 This project was funded by grants from the Top Institute Food and Nutrition, Wageningen, to CW, AZ and ET (GH001), the Netherlands Organization for Scientific Research to JF (NWO- VIDI 864.13.013), CardioVasculair Onderzoek Nederland to MH and AZ (CVON 2012-03), and NWO grants to LF (ZonMW-VIDI 917.14.374) and RW (ZonMW-VIDI 016.136.308). AZ holds a Rosalind Franklin Fellowship (University of Groningen) and MCC holds a postdoctoral 2 fellowship from the Fundación Alfonso Martín Escudero. This research received funding from the European Community’s Health Seventh Framework Programme (FP7/2007–2013, grant agreement 259867). 3 References 1. Clemente, J. C., Ursell, L. K., Parfrey, L. W. & Knight, R. The impact of the gut microbiota on human health: An integrative view. Cell 148, 1258–1270 (2012). 2. Tremaroli, V. & Bäckhed, F. Functional interactions between the gut microbiota and host 4 metabolism. Nature 489, 242–249 (2012). 3. Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). 4. Balzola, F., Bernstein, C., Ho, G. T. & Lees, C. A human gut microbial gene catalogue established by metagenomic sequencing: Commentary. Inflamm. Bowel Dis. Monit. 11, 28 (2010). 5. Henao-Mejia, J. et al. Inflammasome-mediated dysbiosis regulates progression of NAFLD and obesity. Nature 482, 179–85 (2012). 6. Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013). 7. Kau, A. L., Ahern, P. P., Griffin, N. W., Goodman, A. L. & Jeffrey, I. Human nutrition, the gut microbiome, and immune system: envisioning the future. Nature 474, 327–336 (2012). 8. Tang, W. H. W. et al. Intestinal microbial metabolism of phosphatidylcholine and cardiovascular risk. N. Engl. J. Med. 368, 1575–84 (2013). 9. Gregory, J. C. et al. Transmission of atherosclerosis susceptibility with gut microbial transplantation. J. Biol. Chem. 290, 5647–5660 (2015). 10. Frazier, T. H., DiBaise, J. K. & McClain, C. J. Gut Microbiota, Intestinal Permeability, Obesity- Induced Inflammation, and Liver Injury. J. Parenter. Enter. Nutr. 35, 14S–20S (2011). 11. Glass, C. K. & Olefsky, J. M. Inflammation and lipid signaling in the etiology of insulin resistance. Cell Metab. 15, 635–645 (2012). 12. Bäckhed, F. et al. The gut microbiota as an environmental factor that regulates fat storage. Proc. Natl. Acad. Sci. U. S. A. 101, 15718–15723 (2004). 13. Turnbaugh, P. J. et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–1031 (2006). 14. Turnbaugh, P. J. et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, 1027–31 (2006). 15. Ridaura, V. K. et al. Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in Mice Gut Microbiota from Twins Metabolism in Mice. Science 341, 1241214 (2013). 16. Vrieze, A. et al. Transfer of intestinal microbiota from lean donors increases insulin sensitivity in individuals with metabolic syndrome. Gastroenterology 143, 913–916.e7 (2012). 17. Tigchelaar, E. F. et al. An introduction to LifeLines DEEP: study design and baseline characteristics. bioRxiv (Cold Spring Harbor Labs Journals, 2014). doi:10.1101/009217 18. Everard, A. et al. Cross-talk between Akkermansia muciniphila and intestinal epithelium controls diet-induced obesity. Proc. Natl. Acad. Sci. U. S. A. 110, 9066–71 (2013). 19. Goodrich, J. K. et al. Human genetics shape the gut microbiome. Cell 159, 789–799 (2014).

59 20. Nitsch, D. et al. Limits to causal inference based on mendelian randomization: A comparison with randomized controlled trials. Am. J. Epidemiol. 163, 397–403 (2006). 1 21. Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet. 45, 1274–83 (2013). 22. Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015). 2 23. Velagapudi, V. R. et al. The gut microbiota modulates host energy and lipid metabolism in mice. J. Lipid Res. 51, 1101–1112 (2010). 24. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–6 (2013). 25. Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc 3 collection of Pathway/Genome Databases. Nucleic Acids Res. 42, 742–753 (2014). 26. Ridlon, J. M., Kang, D.-J. & Hylemon, P. B. Bile salt biotransformations by human intestinal bacteria. J. Lipid Res. 47, 241–59 (2006). 27. Hepner, G., Fried, R., St Jeor, S., Fusetti, L. & Morin, R. Hypocholesterolemic effect of yogurt and milk. Am. J. Clin. Nutr. 32, 19–24 (1979). 4 28. Hlivak, P. et al. One-year application of probiotic strain Enterococcus faecium M-74 decreases serum cholesterol levels. Bratisl. lek??rske List. 106, 67–72 (2005). 29. Karlsson, F. H. et al. Symptomatic atherosclerosis is associated with an altered gut metagenome. Nat. Commun. 3, 1245 (2012). 30. Wang, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012). 31. Kimura, I. et al. The gut microbiota suppresses insulin-mediated fat accumulation via the short-chain fatty acid receptor GPR43. Nat. Commun. 4, 1829 (2013). 32. David, L. A. et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–63 (2014). 33. Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015). 34. Li, N. et al. Pleiotropic effects of lipid genes on plasma glucose, hba1c, and homa-ir levels. Diabetes 63, 3149–3158 (2014). 35. Deelen, P. et al. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res. Notes 7, 901 (2014). 36. Collection, S. & Genome, T. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 1–95 (2014). 37. Deelen, P. et al. Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur. J. Hum. Genet. 22, 1321–1326 (2014). 38. Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392 (2014). 39. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010). 40. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). 41. Brandt, B. W., Bonder, M. J., Huse, S. M. & Zaura, E. TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage. Nucleic Acids Res. 40, W82-7 (2012). 42. Bonder, M. J., Abeln, S., Zaura, E. & Brandt, B. W. Comparing clustering and pre-processing in taxonomy analysis. Bioinformatics 28, 2891–2897 (2012). 43. May, A., Abeln, S., Crielaard, W., Heringa, J. & Brandt, B. W. Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations. Bioinformatics 30, 1530–1538 (2014). 44. Ding, T. & Schloss, P. D. Dynamics and associations of microbial community types across the human body. Nature 509, 357–60 (2014). 45. Navas-Molina, J. A. et al. Advancing our understanding of the human microbiome using QIIME. Methods in Enzymology 531, (Elsevier Inc., 2013).

60 46. Hughes, J. B. & Hellmann, J. J. The application of rarefaction techniques to molecular inventories of microbial diversity. Methods Enzymol. 397, 292–308 (2005). 47. Koren, O. et al. A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial 1 Community Structures in Human Microbiome Datasets. PLoS Comput. Biol. 9, e1002863 (2013). 48. Keurentjes, J. J. B. et al. The genetics of plant metabolism. Nat. Genet. 38, 842–9 (2006). 49. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known 2 disease associations. Nat. Genet. 45, 1238–1243 (2013). Acknowledgments We thank the LifeLines-DEEP participants and the LifeLines staff in Groningen for their 3 collaboration. We thank Jackie Senior and Kate Mc Intyre for editing the manuscript and Mathieu Platteel for practical and analytical work.

4 Description of supplementary data files The following additional data are available with the online version of this paper. Online data: Online Table I Spearman correlation between traits and OTU richness and diversity Online Table II OTUs associated with BMI at FDR < 0.05 level Online Table III OTUs associated with TG at FDR < 0.05 level Online Table IV OTUs associated with HDL at FDR < 0.05 level Online Table V Associated Taxonomies at FDR < 0.05 level. Online Table VI The association between microbes and the SNPs associated with lipids and BMI at P value 1x10-5 Online Table VII The associations between microbes and the combined BMI and lipid genetic risk scores at P < 0.05 level3 Online Figure I The workflow of the two-part model Online Figure II The number of OTUs associated with TG, HDL and BMI at FDR < 0.05 and their overlaps with each other Online Figure III The amount of variance in BMI and lipids explained by the gut microbiome Online Figure IV The variation of lipids explained by age, gender, BMI, genetic and microbial risk.

61 Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity Science, DOI: 10.1126/science.aad3369 Alexandra Zhernakova1,2,*, Alexander Kurilshikov3,4,†, Marc Jan Bonder1,†, Ettje F. Tigchelaar1,2,†, Melanie Schirmer5,6, Tommi Vatanen5, Zlatan Mujagic2,7, Arnau Vich Vila8, Gwen Falony9,10, Sara Vieira-Silva9,10, Jun Wang9,10, Floris Imhann8, Eelke Brandsma11, Soesma A. Jankipersadsing1, Marie Joossens9,10,12, Maria C. Cenit1,13,14, Patrick Deelen1,15, Morris A. Swertz1,15, LifeLines cohort study, Rinse K. Weersma8, Edith J. M. Feskens2,16, Mihai G. Netea17, Dirk Gevers5,18, Daisy Jonkers7, Lude Franke1, Yurii S. Aulchenko4,19,20,21, Curtis Huttenhower5,6, Jeroen Raes9,10,12, Marten H. Hofker11, Ramnik J. Xavier5,22,23,24, Cisca Wijmenga1,‡,*,

1,11‡,* Jingyuan Fu 5 Abstract 1 Deep sequencing of the gut microbiomes of 1,135 participants from a Dutch population- based cohort shows relations between the microbiome and 126 exogenous and intrinsic host factors, including 31 intrinsic factors, 12 diseases, 19 drug groups, 4 smoking categories, and 60 dietary factors. These factors collectively explain 18.7% of the variation seen in the inter-individual distance of microbial composition. We could associate 110 factors to 125 2 species and observed that fecal Chromogranin A (CgA), a protein secreted by enteroendocrine cells, was exclusively associated with 61 microbial species whose abundance collectively accounted for 53% of microbial composition. Low CgA levels were seen in individuals with a more diverse microbiome. These results are an important step towards better understanding of environment-diet-microbe-host interactions. 3

4

1. University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; 2. Top Institute Food and Nutrition, Wageningen, the Netherlands; 3. Institute of Chemical Biology and Fundamental Medicine SB RAS, Novosibirsk, Russia; 4. Novosibirsk State University, Novosibirsk, Russia; 5. The Broad Institute of MIT and Harvard, Cambridge, MA, USA; 6. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; 7. Division of Gastroenterology-Hepatology, Department of Internal Medicine, NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Center, Maastricht, the Netherlands; 8. University of Groningen, University Medical Center Groningen, Department of Gastroenterology and Hepatology, Groningen, the Netherlands; 9. KU Leuven–University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Laboratory of Molecular Bacteriology, Leuven, Belgium; 10. VIB, Center for the Biology of Disease, Leuven, Belgium; 11. University of Groningen, University Medical Center Groningen, Department of Pediatrics, Groningen, the Netherlands; 12. Vrije Universiteit Brussel, Faculty of Sciences and Bioengineering Sciences, Microbiology Unit, Brussels, Belgium; 13. Microbial Ecology, Nutrition & Health Research Group, Institute of Agrochemistry and Food Technology, National Research Council (IATA-CSIC), Valencia, Spain; 14. Department of Pediatrics, Dr. Peset University Hospital, Valencia, Spain; 15. University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, the Netherlands; 16. Division of Human Nutrition, Wageningen University, Wageningen, the Netherlands; 17. Department of Internal Medicine, Radboud University Medical Center, Nijmegen, the Netherlands; 18. Current address: Janssen Human Microbiome Institute, Janssen Research and Development, Cambridge, MA, USA; 19. Centre for Global Health Research, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Teviot Place, EH8 9AG, Edinburgh, UK; 20. PolyOmica, Groningen, the Netherlands; 21. Institute of Cytology & Genetics SB RAS, Novosibirsk, Russia; 22. Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, USA; 23. Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, MA, USA; 24. Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, MA, USA; † equal contribution; ‡ shared last; Correspondence to: a.zhernakova@ umcg.nl (AZ), [email protected] (CW), [email protected] (JF)

63 The human gut microbiome plays a major role in the production of vitamins, enzymes, and other compounds that digest and metabolize food and regulate our immune system1. It can be 1 considered as an extra organ, with remarkable dynamics and a major impact on our physiology. The composition of the gut microbiome can be considered as a complex trait, with the quantitative variation in the microbiome affected by a large number of host and environmental factors, each of which may have only a small additive effect, making it difficult to identify 2 the association for each separate item. In this study, we present a systematic metagenomic association analysis on 207 intrinsic and exogenous factors from the LifeLines-DEEP cohort, a Dutch population-based study2,3. Our study reveals covariates in the microbiome and, more importantly, provides a list of factors that correlate with shifts in the microbiome composition and functionality. 3 This study includes stool samples from 1,179 LifeLines-DEEP participants from the general population of the northern part of the Netherlands2. The cohort comprised predominantly Dutch participants; 93.7% had both parents born in the Netherlands. The gut microbiome was analyzed using paired-end metagenomic shotgun sequencing (MGS) on a HiSeq2000, 4 generating an average of 3.0 Gb of data (about 32.3 million reads) per sample4. After excluding 44 samples with low read counts, 1,135 participants (474 males and 661 females) remained for further analysis. We tested 207 factors with respect to the microbiomes of these participants: 41 intrinsic factors of various physiological and biomedical measures, 39 self- reported diseases, 44 categories of drugs, 5 categories of smoking status and 78 dietary factors (fig. S1 and table S1). These factors cover dietary habits, life-style, medication use, and health parameters. Most of the factors showed a low or modest inter-correlation (table S2A-C, 2A-D); many are highly variable, including, as expected in the Dutch population, the high consumption of milk products and low use of antibiotics. Antibiotic use in the Netherlands is the lowest in the Europe, at a level half that of the UK and one-third that of Belgium. To cover health-domain factors relevant to the host immune system and gut health, we collected cell counts for eight different blood cell types, measured blood cytokine levels, assessed stool frequency and stool type by Bristol Stool Score, and measured fecal levels of several secreted proteins including calprotectin as a marker for the immune system activation, human-β- defensin-2 (HBD-2) as a marker for defense against invading microbes and chromogranin A (CgA) as a marker for neuroendocrine system activation. After quality control and removal of sequence reads mapping to the human genome, the microbiome sequence reads were mapped to approximately 1 million microbial-taxonomy- specific marker genes using MetaPhlAn 2.05 to predict the abundance of microorganisms (fig. S3A). For each participant, we predicted the abundance levels for 1,649 microbial taxonomic clades ranging from four different domains to 632 species (Fig 1A). The majority of the reads (97.6%) came from Bacteria, 2.2% from Archaea, 0.2% from Viruses and <0.01% from Eukaryotes. Comparison to previous taxonomic profiles of the same subjects by 16S rRNA gene sequencing (Fig. 1B) showed MGS predicted more microbial species but fewer families and genera. At the phylum level, the abundance of dominant bacterial phyla Firmicutes (63.7%) and Bacteroidetes (8.1%) were similar to estimates based on 16S rRNA gene sequencing, but the abundance of Actinobacteria was higher in MGS (22.3%) than 16S (12.3%) (fig. S4). The microbiome quality control project has recently suggested that microbial composition estimates may not be comparable between studies if sample preparation and data analysis are not done in the same way6. For instance, compared to the composition reported in other studies of a similar size that used different methods7,8, our study detected a higher abundance of Actinobacteria but a lower abundance of Bacteroidetes. Importantly, all samples in our study were isolated and processed using the same pipeline, ensuring low technical variation and high analysis power to access the association of multiple factors with the microbiome.

64 A. B. Domain Phylum Class Order Family Genus Species Domain PhylumClass OrderFamilyGenus Species (4) (20) (31) (47) (102) (220) (632) (2) (32) (74) (123) (203) (259) (57) 1

2

3

Fig. 1. The taxonomic tree of microbial taxonomies predicted by MGS and 16S rRNA gene 4 sequencing. (A) Taxonomic tree based on MGS shotgun sequencing data. (B) Taxonomic tree based on16S rRNA gene sequencing data. Each dot represents a taxonomic entity. From the inner to outer circles, the taxonomic levels range from domains to species. Different colors of dots indicate different taxonomy levels according to the color key shown. Numbers in brackets indicate the total number of unique taxonomies detected at each level. The high inter-individual variation reflects the community composition (fig. S5) and is clearly driven by the abundance of the dominant phyla (fig. 2A). Our further analysis of microbial composition was confined to the 632 unique species (table S3). For the functional profiling, the abundance of 568,874 UniRef gene families were grouped into clusters of orthologous groups (COG) based on the EggNOG database and MetaCyc pathways (fig. S3A). Although the distribution of diversity, genes, and COG richness showed high inter-variability (Fig. 2B-D), functional profiles based on 23 non-redundant, gene ontology molecular function categories remained stable (fig. S6) within our cohort, similar to previous reports9.

Abundance of Firmicutes A. Abundance of Actinobacteria Abundance of Bacteroidetes

20 40 60 80 20 40 60 80 010 20 30 40 50 2 12 PCoA 2 PCoA 2 01 2 PCoA 2 01 0 −1 −1 −1

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 PCoA 1 PCoA 1 PCoA 1 B. C. D. 0 0 120 08 y 08 0 08 06 06 06 Frequenc Frquency Frquenc y 04 04 04 02 02 02

1.5 2.02.5 3.0 3.5 2000060000100000 2000 4000 6000 8000 Shannon diversity indix Richness of Genes Richness of COG Fig. 2. The inter-individual variation of microbial composition and function profile. (A) Principal coordinates plots of Bray-Curtis distance of microbial composition. The composition was driven by the most dominant phyla: Firmicutes, Actinobacteria, and Bacteroidetes. Each dot represents one individual. Color indicates the relative abundance of each phylum. (B) Distribution of Shannon’s diversity index. (C) Distribution of the gene richness. (D) Distribution of the clusters of orthologous groups (COG) richness.

65 We correlated 207 factors to the inter-individual variation in microbial composition, diversity, richness of genes, and COGs (fig. S3B). At false discovery rate (FDR) <0.1 level, 126 factors 1 were associated with inter-individual distance of microbial composition (Bray-Curtis distance) (Fig 3, table S4), of which 90% could be replicated in 16S rRNA data from the same subjects (table S5, fig. S7), together explaining 18.7% of the variation in composition distance (fig. S8A). A total of 35 factors were associated with Shannon’s diversity index of microbial composition 2 (together explaining 13.7% variation, table S6, fig. S8A), of which 80% were replicated in 16S rRNA data from the same subjects (table S7); 31 factors were associated with gene richness (together explaining 16.7% variation, table S8) and 34 factors with COG richness (explaining 18.8% variation) (table S9, fig. S8A, for replication rates see table S10). We saw a large overlap between different diversity and richness analyses, and most of them were also associated 3 with composition distance (fig. S8B). We performed multivariate association analyses between each factor with 170 abundant species (>0.01% of total microbial composition and present in at least 10 individuals) and 215 MetaCyc pathways (fig. S3C). When corrected for age, gender, and sequence depth, we 4 found 485 associations at FDR<0.1 between 110 factors and 125 species (table S11) and 524 associations between 71 factors and 176 MetaCyc pathways (table S12). By correcting the correlation structures among all 207 factors, the number of associations was reduced to 128 independent associations with species (table S13) and 215 associations with pathways (table S14). Our data confirmed some previous findings and also yielded novel associations. In our study, age and gender were correlated not only with microbial composition distance and diversity but also with functional richness. Women showed higher COG richness than men (adjusted P=0.03), and COG richness increased with age (adjusted P=0.002) (fig. S9). Multiple intrinsic parameters, such as blood cell counts and lipid levels, were associated to composition and function levels as well. For example, a higher level of hemoglobin was consistently associated with lower diversity and functional richness (Fig 3, table S6-S9). The strongest associations we found were for the fecal levels of several secreted proteins, including human-β-defensin-2 (HBD-2), calprotectin10,11, and chromogranin A (CgA), with microbial composition, diversity, and functional richness (Fig. 3, Fig. 4A), as well as with specific species (table S11) and pathways (table S12). Among these associations, CgA showed the strongest association with composition distance (adonis R2=0.03, adjusted P=0.0006), microbial diversity (Spearman r=-0.22, adjusted P=1.49x10-12), gene richness (Spearman r=-0.23, adjusted P=9.4x10-13), and COG richness (Spearman r=-0.285, adjusted P=2.53x10-20) (tables S4-S9). The association of CgA with composition distance was then validated in an independent cohort of 19 individuals for whom 16S rRNA gene sequencing data was available (P=0.0065) (fig. S10). A owerl CgA level was associated with higher diversity, with functional richness, with high levels of high- density lipoprotein (HDL), and with intake of fruits and vegetables. In contrast, elevated fecal CgA was associated with high fecal levels of calprotectin, high blood levels of triglycerides, high stool frequency, soft stool type, and self-reported irritable bowel syndrome (IBS) (Fig. 4B). After correcting for the confounding effect of all other factors, our analysis revealed 61 species exclusively associated with CgA (Fig. 4C-D, table S13) whose abundance levels collectively accounted for 53% of the total abundance of the microbiome on average, and with 40 MetaCyc pathways (table S14) that accounted for 34.6% of the pathway profiles. The strongest association to CgA was observed for the Archaea species Methanobrevibacter smithii (fig. S11A), which plays an important role in the digestion of polysaccharides by consuming the end products of bacterial fermentation and methanogenesis12(fig. S11B). A negative association with CgA abundance was observed for 24 out of 36 species from phylum Bacteroidetes (Fig. 4C-D).

66 Shannon’s index of diversity, gene richness, and COG respectively. Color key for correlation is shown. with factor each of coefficients correlation the shows plot bar the under heatmap The distance). (Bray-Curtis composition microbial of individual variationof the gut microbiome. indicates the explained variationbar plot The of each factorthe inter-individual variationin 3. FactorsFig. associated with inter-individual variation of gut microbiome. A total of 126 factorswere (FDR<0.1) associated with inter- Shannon’s index Shannon’s index Gene richnes Gene richnes COG richnes s COG richnes s

Explained Variance Explained Variance 1 in BC distance (R2) Intrinsic factors Diet in BC distance (R2) 0 0.010 0.020 s

s 00.002 0.004

BioMK_ChromograninA carbohydrates.total antrop_age protein.plant Bristol_av.stool.freq how_often_fruits Bristol_av.stool.type beer 2 antrop_gender.F1M2 breads Biochem_TG kcal Biochem_HDL soda_with_sugar BlCells_Ery coffee BioMK_BetaDefensin2 nonalc_drinks antrop_SBP how_often_coffee BlCells_Hb Biochem_Insulin how_often_soda 3 antrop_BMI how_often_vegetables BlCells_Leuco red_wine BioMK_Calprotectin fruits antrop_height pastry BlCells_Lympho savoury_snacks BlCells_Granulo low_carb_diet antrop_hip_cir protein.animal antrop_DBP how_often_chocomilk_sweetened_milk_drinks 4 Biochem_Creatinine pasta antrop_WHR how_often_pasta BlCells_Mono protein.total QOL_phys.comp.score cereals Biochem_Glucose rice Biochem_LDL how_often_crisps_savory_crackers Biochem_Cholesterol BioMK_IL10 how_often_alcohol alcohol_products BioMK_Citrullin BioMK_IL1beta sauces HBF fat.total ready_meal meat Shannon’s inde Gene richness

COG richnes s how_often_muesli vegetables wholefat_milk Explained Variance how_often_rice dairy in BC distance (R2) how_often_breakfast

Diseases x 0 0.001 0.002 how_often_yoghurt_milk_based_puddings IBS cheese ever_heart_attack buttermilk vegetarian depression tea anemia gluten_free_diet fibromyalgiae soda_no_sugar kidney_stones how_often_pulses CFS how_often_tea food_allergy halffat_milk COPD_bronchitis alcohol.g stomach_ulcer weight_related_diet osteoporosis how_often_meat bloodpressure_ever_high how_often_nuts how_often_juice

Shannon’s inde how_often_milk_or_sourmilk Gene richnes COG richness spreads how_often_bread how_often_boiled_potatos Medicine Explained Variance how_often_fish in BC distance (R2) legumes 0 0.002 0.004 eggs s x potatos PPI statin Shannon’s inde Gene richnes antibiotics_merged COG richness laxatives Smoking beta_blockers Explained Variance tricyclic_antidepressant opiat in BC distance (R2) platelet_aggregation_inhibitor 0 0.001 0.002 ACE_inhibitor s x calcium smk_history SSRI_antidepressant smk_current anti_androgen_oral_contraceptive smk_father other_antidepressant smk_mother vitamin_D oral_contraceptive metformin Color key for correlation beta_sympathomimetic_inhaler <=-0.3 0 >=0.3 angII_receptor_antagonist folic_acid

67 A. B. Abundance of CgA Spearman Correlation 1 0.51.0 1.52.0 2.5 −0.15 −0.05 0.05 0.15

2 BioMK_BetaDefensin2 how_often_fruits Verrucomicrobia vegetables Archaea Biochem_HDL 1 2 Bacteroidetes beta_sympathomimetic_inhaler

oA 2 Proteobacteria antrop_BMI

PC IBS 0 Biochem_TG Bristol_av.stool.type Bristol_av.stool.freq Biochem_Insulin 3 −1 Actinobacteria BioMK_Calprotectin CgA

−1.0 −0.50.0 0.51.0 1.52.0

PCoA 1 D. Eggerthella unclassi ed C. Eggerthella lenta Collinsella aerofaciens Gordonibacter pamelaeae Olsenella unclassi ed 4 Bi dobacterium longum Bacteroides cellulosilyticus a Bacteroides caccae teri Bacteroides eggerthii Actinobac Bacteroides intestinalis Bacteroides plebeius Bacteroides salyersiae a Bact Bacteroides thetaiotaomicron teri Bacteroides uniformis er Bacteroides vulgatus o Bacteroidales bacterium ph8 idetes oteobac Odoribacter splanchnicus Pr Barnesiella intestinihominis Parabacteroides goldsteinii Ver. Parabacteroides distasonis

s Parabacteroides johnsonii Parabacteroides unclassi ed

ruse Paraprevotella unclassi ed

Vi Paraprevotella clara Prevotella copri Alistipes indistinctus

Eur. Alistipes negoldii Alistipes putredinis Alistipes senegalensis Alistipes shahii Eubacterium rectale Eubacterium eligens Eubacterium siraeum Clostridium sp L2 50 Ruminococcus torques Ruminococcus gnavus Butyrivibrio crossotus Coprococcus catus Lachnospiraceae bacterium 2 1 58FAA Lachnospiraceae bacterium 1 4 56FAA Lachnospiraceae bacterium 7 1 58FAA Roseburia hominis Ruminococcus bromii Anaerotruncus unclassi ed Subdoligranulum unclassi ed Firmicutes Streptococcus salivarius Streptococcus parasanguinis Streptococcus sanguinis Streptococcus thermophilus Lactobacillus delbrueckii Eur. for Euryarchaeota Coprobacillus unclassi ed - Catenibacterium mitsuokai Ver. for Verrucomicrobia Eubacterium biforme + Mitsuokella unclassi ed Dialister invisus Phascolarctobacterium succinatutens Desulfovibrio piger Bilophila unclassi ed Akkermansia muciniphila Methanobrevibacter unclassi ed Methanobrevibacter smithii

Actinobacteria Proteobacteria Bacteroidetes Verrucomicrobia Firmicutes Euryarchaeota Fig. 4. The association of fecal level of Chromogranin A. (A) Principal coordinate plots of Bray- Curtis distance of microbial composition. Each dot represents one individual and its color is based on the abundance level of CgA: warm colors indicate high abundance and cool colors low abundance. The red arrow indicates the association direction of CgA, while the directions of the CgA-associated phyla are shown as black arrows. (B) Correlation between CgA and other factors at FDR<0.1. (C) Taxonomic tree of 170 species, of which 61 species were exclusively associated with CgA level. Each dot represents a taxonomic entity. Red dots indicate positively associated species. Blue dots indicate negatively associated species. (D) Taxonomic tree of the 61 species exclusively associated with CgA level. The branches are colored to show phylum levels as shown in the color key. Species in red show increased abundance associated with higher CgA levels. Species in blue show lower abundance associated with higher CgA levels.

68 CgA is a member of the granine peptides, which are secreted in nervous, endocrine and immune cells under stress13, and during active periods of gut-related diseases such as IBS and inflammatory bowel disease, although some findings are contradictory14–16. Many 1 different functions have been proposed for CgA and other granine peptides, including roles in neurological pathways, pain regulation, and antimicrobial activity against bacteria, fungi, and yeasts17,18. However, their mechanism of action and physiological importance need further detailed investigation. To test whether genetic variants that influence CHGA gene expression 2 (encoding CgA) can affect fecal CgA level and the gut microbiome, we tested the effect of six SNPs known to regulate gene expression of CHGA on fecal CgA and abundances of species (table S15). No significant association was observed, suggesting that genetic variation in CHGA expression does not explain the variation observed in the fecal CgA levels and microbiome composition (table S16-S17). Our observation that CgA strongly correlates with microbiome 3 composition, especially with a large number of species from Bacteroidetes phylum, and with diversity will hopefully encourage studies to unravel the role of CgA in gut health. We also observed associations (FDR<0.1) between 63 dietary factors and inter-individual distances in microbiota composition, including energy (kcal), intake of carbohydrates, proteins 4 and fats, and of specific food items such as bread and soft drinks (Fig. 3, table S4). Drinking buttermilk (sour milk with a low fat content) was associated with high diversity, while drinking high-fat (whole) milk (3.5% fat content) was associated with lower diversity (table S6). Two of the species most strongly associated with drinking buttermilk are Leuconostoc mesenteroides (q=9.1x10-46) and Lactococcus lactis (q=2.5x10-8), both used as a starter culture for industrial fermentation (table S11). The abundance of dairy-fermentation-related bacteria increased with increasing dairy consumption, indicating potential for the use of probiotic drinks to augment and alter the gut microbiome composition. Consumption of alcohol-containing products, coffee, tea, and sugar-sweetened drinks were also correlated with microbial composition. Consumption of sugar-sweetened soda had a negative effect on microbial diversity (adjusted P=5x10-4), whereas consumption of coffee, tea and red wine, which all have a high polyphenol content, was associated with increased diversity19–21. Red wine consumption correlated with F. prausnitzii abundance, which has anti-inflammatory properties, correlates negatively with inflammatory bowel disease22, and shows higher abundance in high-richness microbiota23. Apart from the negative associations between sugar-sweetened soda and bacterial diversity, other features of a Western-style diet, such as higher intake of total energy, snacking, and high-fat (whole) milk, were also associated with lower microbiota diversity (Fig. 3). A higher amount of carbohydrates in the diet was associated with lower microbiome diversity. Total carbohydrate intake was positively associated with Bifidobacteria, but negatively with Lactobacillus, Streptococcus, and Roseburia species. A low carbohydrate diet consistently showed opposite directions for these species. We did not observe an association of carbohydrate intake to prevotella species, as has been described previously24. As expected, the use of antibiotics was significantly associated with microbiome composition, in particular with strong and significant decreases in two species from the genus Bifidobacterium (Actinobacteria phylum) (table S11), in line with previous studies25. Several other drug categories, such as proton pump inhibitors (PPI) (95 users), metformin (15 users), statins (56 users), and laxatives (21 users) also had a strong effect on the gut microbiome. PPI users were found to have profound changes in 33 bacterial pathways (table S12). The most significant positive correlation of PPIs was observed with the pathway of 2,3-butanediol biosynthesis (q=5.3x10-14). We also observed overlap between species and pathways associated to PPI and with calprotectin levels, particularly for bacteria typical of the oral microbiome (table S2A-C, table S11, fig. S12). This is in line with the correlations of PPI with calprotectin levels reported in the literature26. Even after excluding the 95 PPI users from our analysis, the positive correlation of calprotectin to most oral bacteria remained significant, indicating this association is not due to the confounding effect of PPI (fig. S12). Furthermore, the levels of calprotectin were positively correlated with age and metabolic phenotypes (body mass index (BMI), diabetes, use of statins and metformin, HBAc1, and systolic blood pressure),

69 but negatively correlated with the consumption of vegetables, plant proteins, chocolate, and breads. Multivariate analysis correcting for all factors revealed 14 species (table S13) and 114 1 bacterial metabolic pathways (table S14) exclusively associated with calprotectin, suggesting calprotectin is robustly associated with gut microbiome. Metformin is commonly used to control blood sugar levels for treating type 2 diabetes, but can cause gastrointestinal intolerance27. In 15 metformin users, we observed an increased 2 abundance of Escherichia coli (E. coli) and a positive correlation with specific pathways, including the degradation and utilization of D-glucarate and D-galactarate and pyruvate fermentation pathways. Previous studies in C. elegans indicated the specific drug-bacteria interaction of metformin and E. coli28. Our results are in line with recent observations in humans29 that suggest that metformin can impact the microbiome through short-chain fatty 3 acid (SCFA) production. To confirm this observation, we profiled acetate, propionate and butyrate in 24 type 2 diabetes patients in our cohort: 9 non-metformin users and 15 users4, and found that SCFA levels were consistently higher in metformin-users, especially for propionate (Wilcoxon test P=0.035) (fig. S13). 4 We assessed the effect of current smoking status, smoking history, parental smoking, and maternal smoking during pregnancy on the gut microbiome. These parameters were associated with Bray-Curtis distance, albeit with very modest effect. We did not detect significant associations for individual species or at pathways. In this study we included 39 self-reported diseases, for which participants had reported at least five cases. IBS was reported by 9.9% of participants (n=112, table S1) and was associated with changes in the gut microbiome and a lower microbial diversity (adjusted P=0.05) (table S6). Species from the Eggerthella and Coprobacillus genera were positively associated with medication and food allergies, respectively. Individuals who had suffered a heart attack (n=10) in the past had a significantly lower abundance of Eubacterium eligens bacterium, even after correcting for all other factors (q=4.6x10-4). Linking the deep-sequenced MGS data to various intrinsic and exogenous factors from the same individual not only allowed us to detect associations at species level, but also provided new insights into the interaction between the host, microbiota, and environmental factors, including diet. For instance, we have replicated and expanded our association of BMI and blood lipid levels with the gut microbiota based on 16S rRNA gene sequencing data30 by showing associations with four specific species of the family Rikenellaceae. We previously associated this family with BMI and triglycerides in 16S rRNA data. In the current study we observed higher BMI was associated with lower level of two species from the family Rikenellaceae, Alistipes finegoldii, and Alistipes senegalensis, while blood lipids were associated with other two species, Alistipes shahii and Alistipes putredinis (table S11). Strikingly, these species were also associated to certain dietary factors and drugs. For instance, a high level of Alistipes shahii, which was associated to low TG levels, was linked to higher fruit intake (q= 0.00027). Individuals with a higher abundance level of Alistipes shahii had a higher number of different species in the gut (species richness) (Spearman r=0.2, adjusted p=3.96x10-11), suggesting a beneficial effect on the microbial ecosystem (table S18). Correlations with the number of different species were also found for other bacteria including Roseburia hominis, Coprococcus catus, and Barnesiella intestinihominis and unclassified species from genus Anaerotruncus that also showed correlation both with fruit, vegetable, and nut consumption and with intrinsic phenotypes like HDL, triglycerides and quality of life. Based on this data, it would be interesting to explore the potential to modulate disease-associated species through medication or diet, although we still need to address the causality and underlying mechanism.

70 Conclusions 1 Our study revealed significant associations between the gut microbiome and various intrinsic, environmental, dietary and medication parameters, and disease phenotypes, with a high replication rate between MGS and 16S rRNA gene sequencing data from the same subjects. Moreover, our study provides many new intrinsic and exogenous factors that correlate with shifts in the microbiome composition and functionality that can be potentially be manipulated 2 to improve microbiome-related health and we hope our results will inspire further experiments to explore the biological relevance of associated factors. While most of the factors we assessed exerted a very modest effect, fecal levels of Chromogranin A showed a high potential as a biomarker for gut health. 3 Materials and Methods Population cohort and clinical metadata The LifeLines-DEEP cohort is a sub-cohort of the LifeLines cohort (167,729 participants), which 4 is being studied using a broad range of investigative procedures to assess the biomedical, socio-demographic, behavioral, physical, and psychological factors that contribute to health and disease in the general Dutch population3. A subset of approximately 1,500 participants also took part in the more detailed LifeLines-DEEP study. For these participants, additional biological materials were collected, including samples for the analysis of the gut microbiome composition. The phenotyping and processing of LifeLines-DEEP has been described previously2. The 1,179 LifeLines-DEEP participants collected stool samples at home. These were immediately stored in the freezer, then collected on dry ice within a few days and transferred to a ‑80°C facility. All samples were collected over a short period of 3 to 4 months (May-August 2013). Sample collection, DNA preparation, processing, and sequence and data analysis were all performed in a standardized manner using laboratory space and equipment described previously2. Three fecal biomarkers, six plasma cytokines, and plasma citrulline were also measured, including calprotectin as marker for the immune system activation31,32, human-β-defensin-2 (HBD-2) as a marker for defense against invading microbes33,34, chromogranin A (CgA) as a marker for neuro-endocrine system activation 14,35,36, and citrulline as a measure for enterocyte mass37,38. The fecal markers were measured simultaneously to avoid additional thaw cycles of the fecal samples (done by Dr. Stein & Colleagues Medical Laboratory, the Netherlands). Commercial enzyme-linked immunosorbent assay (ELISA, Bühlmann Laboratories, Switzerland, and Immunodiagnostik AG, Germany) 14,34 was used to measure calprotectin and human β-defensin 2. Chromogranin A was measured using a commercial radio-immunoassay (RIA, Euro-Diagnostica, Sweden). The assessment methods were described in detail previously14,34. Concentrations of plasma citrulline were determined by high pressure liquid chromatography (HPLC) fluorescence detection, as described previously38. Plasma cytokines (i.e. IL-1β, IL-6, IL-8, IL-10, IL-12p70, and TNF-α) were measured by ProcartaPlexTM multiplex immunoassay (eBioscience, USA) as described previously39,40. All test kits were used according to manufacturers’ instructions. Stool type and stool frequency were calculated as averages from the 7-day Bristol stool scale. In summary, we selected 207 factors that were further categorized into 5 groups: 41 intrinsic factors, 39 diseases with at least 5 occurrences, 5 smoking factors, 78 dietary factors, and 44 drugs with at least 5 users (see table S1 and table S19 for drug categories). Most of the factors showed low or modest inter-correlation, except for substantial correlations between food intake frequency and total amount (table S2A-C, fig. 2A-D). The data type can be continuous (e.g. age, BMI, blood lipids), binary (e.g. gender, disease status, or medication), or categorical (e.g. food intake frequency). For continuous traits, we also tested the normality of the distribution and performed log-transformation if necessary. A brief description of each factor and the summary statistics are provided in table S1.

71 Metagenomic sequencing 1 Sequencing and reads filtering. Microbiome data was generated for 1,179 LifeLines-DEEP samples. Fecal samples were collected at home within two weeks of blood sample collection, and stored immediately at ‑20 oC. After transport on dry ice, fecal samples were stored at -80 oC. Aliquots were made and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen; 2 cat. #80204). The 16S rRNA gene of the isolated DNA was previously sequenced at the Broad Institute, Boston, using Illumina MiSeq pair-ends; the methods for 16S rRNA gene sequencing and data analysis are described in30. In the current study, the metagenomics sequencing was performed using shotgun sequencing method. The sequence read quality was first filtered by the in-house pipeline in the Broad Institute (Boston). On average 3.0 Gb data (abound 32.3 3 million high-quality reads) were generated per sample. We removed 44 samples whose read counts were lower than 15 million and retained 1,135 samples for further analysis. Next, the sequencing adapters were removed and reads were quality trimmed using Trimmomatic (version 0.32)41. Human contamination was further removed (on average this was less than 1% of the reads) by mapping the reads against the human reference genome (build 37) using 4 bowtie2 (version 2.1.0)42. Microbial composition and functional profile. The profile of microbial composition was predicted using the tool MetaPhlan 2.05. This tool uses a set of ~1 million markers (average 184 marker genes for each bacterial species) from >7,500 species43. The tool yields reported the abundance level of 1,649 microbial taxonomies in our data, including 20 phyla, 31 classes, 47 orders, 102 families, 220 genera, and 632 species from four different domains (Fig. 1). We confined our analysis to the species level, showing high levels of specificity, accuracy, and coverage (99.98% of the composition, on average) (table S3). Samples were functionally profiled using HUMAnN2 (http://huttenhower.sph.harvard.edu/ humann2). HUMAnN2 maps reads to a customized database of pan-genomes consisting of species predicted during the taxonomic profiling of the respective sample. This analysis revealed the abundance level of 568,874 gene families from the UniProt Reference Clusters (UniRef50, http:/www.uniprot.org) that were further mapped to 344 pathways from MetaCyc metabolic pathway database (www.metacyc.org). The 568,874 gene families were further grouped into 21,556 clusters of orthologous groups (COGs) from the EggNOG data (www. egnogdb.embl.de) (www.geneontology.org). The complete analysis scheme is presented in fig. S3. To make high-level functional observations possible, the gene family abundances were further grouped into broader functional categories based on annotations for the UniProt proteins 44 and gene ontology (GO) 45,46. This resulted in a total of 23 non-redundant GO Molecular Function categories.

Statistical analysis Measuring the features of microbial composition and function. We calculated four different parameters to assess the different features of microbial composition and function among individuals. Two parameters were assessed at the level of the compositional profile: Bray- Curtis distance and Shannon’s diversity index. Bray-Curtis distance was calculated using the function vegdist in R package vegan (version 2.3-2). The Shannon diversity index per individual was calculated using the function diversity from the same package. At the level of the functional profile, we assessed the richness of the gene families and orthologous groups, respectively. The total count of the unique UniRef gene families per sample was calculated as the gene richness and the total count of the unique COGs from EggNOG database per sample was calculated as the COGs richness.

72 Impact of sequencing depth. The sequencing depth of 1,135 samples ranged between 15.1 and 91.2 million reads. We assessed the impact of sequencing depth on the microbial composition and function profiles. We did not observe any strong impact on microbial composition. The 1 Spearman correlation coefficient between Shannon diversity index and sequencing depth was 0.044 (P=0.14) and the sequencing depth explained only 0.68% variation in the microbial composition (P=0.024). However, for the functional profiles, we did observe significant associations between sequencing depth and gene richness (Spearman correlation r=0.37, 2 P=8.0x10-40) and COG richness (Spearman correlation r=0.27, P=5.0x10-20), respectively. A recent study suggested that an appropriate analysis model can be sufficient to account for the difference in sequencing depth47. Thus we further corrected the sequencing depth for the gene richness and COG richness using linear regression models. The sequence depth- corrected richness showed significant correlations to the rarefied operational taxonomic unit 3 based richness analysis using16S rRNA sequencing data30: Spearman r=0.48 (P=2.3x10-51) for gene richness and Spearman r=0.54 (P=1.1x10-66) for COG richness, respectively. Association analysis between 207 factors and Bray-Curtis distance. We assessed how many variations of Bray-Curtis distance can be explained by each of the 207 factors using the function 4 adonis from the R package vegan48. The P value was determined by 1000x permutations and was further adjusted for multiple testing of 207 factors using the Benjamini and Hochberg method. The total variation explained was also calculated per category (intrinsic factors, diseases, smoking, dietary factors, and drugs) and for all factors together. Overall, 126 factors (FDR<0.1) were significantly associated with inter-individual microbial distance, including 31 intrinsic factors, 60 dietary factors, 19 drugs, 12 diseases, and 4 smoking parameters. All together, these 126 factors explained 18.7% of the variation in microbial composition. Association analysis between factors and diversity and richness. The association analysis was assessed by the Spearman correlation between each factor and each diversity or richness measure. The P values were further adjusted for multiple testing of 207 factors using the Benjamini and Hochberg method. Association analysis on individual species. To test the association for each species, we first filtered low abundance species and confined our analysis to 170 species that accounted for at least 0.01% of microbial composition and were present in more than 10 participants. These 170 species accounted for an average 99.3% of microbial composition. The percentage of each species was arscin-square-root transformed by taking the arcsine of the square root of the proportional value of each species. The associations of individual species to each factor were assessed using a statistical program of Multivariate Association with Linear Model (MaAsLin, https://huttenhower.sph.harvard.edu/maaslin). MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and microbial community abundance or function. The clinical metadata can be of any type, e.g., continuous, Boolean, or discrete/factor data. MaAsLin returns associations of specific microbial community members with metadata. This tool can include a boosting step that selects other factors in the study that are potentially associated with microbial abundance. Then, if these factors are included in a linear model as covariates, the detected associations are independent of other factors. Thus, MaAsLin can handle the complex data types and correlation structures of metadata in LifeLines-DEEP. In our analysis we used the default settings of MaAsLin. We performed association analysis in two different ways using MaAsLin. In the first analysis, we only corrected for the impact of age, gender, and sequence-depth. For association analysis with age, we corrected for the effect of gender and sequence-depth. For the association analysis with gender, we corrected for age and sequence-depth. For the rest of the factors, the effects of age, gender, and sequencing depth were all included. The second analysis was performed to address the dependence of the data and to reveal the most dominant effects. We included a boosting step in MaAsLin that selected all the potential cofactors as covariates

73 in the model. Thus, the reported associations were corrected for the influence of other factors. In each analysis, the false discovery rate was controlled at Q value 0.1 using the R-package 1 Q-value. This correction was applied for the total number of tests for 170 species and 207 factors. Association analysis on individual pathways. To test the association for each MetaCyc pathway, we first filtered eukaryote pathways and those pathways presented in less than 2 10 participants. The arscin-square-root transformation was applied to the abundance level of pathway data. Because the distribution of pathway data was more skewed, we further performed quartile normalization on the transformed data. Then, a total of 215 pathways were further associated with different factors using MaAsLin. We performed association analysis 3 in two different ways using MaAsLin. In the first analysis we only corrected for the impact of age, gender, and sequence-depth. For associations with age, we corrected for the effect of gender and sequence-depth. For the association analysis with gender, we corrected for age and sequence-depth. For the rest of the factors, the effects of age, gender, and sequencing- depth were all included. The second analysis was performed to address the dependence of 4 data and reveal the most dominant effect. We included a boosting step into MaAsLin that selected all the potential cofactors as covariates in the model. Thus the reported associations were corrected for the influence of other factors. In each analysis, the false discovery rate was controlled at Q value 0.1 level using R package Q-value. This correction was applied for the total number of tests for 215 pathways and 207 factors. Correlations between species abundance and the richness of species of the gut microbiome. The species together form a complex ecosystem. To assess the effect of each species on the total richness, the Spearman correlation was computed between the abundance of each species and the richness of species (i.e., the number of different species) of the microbial composition. The P-values were adjusted for using the Benjamini and Hochberg method. Replication of the CgA association We validated the association between CgA with microbial composition in an independent cohort of 19 subjects (9 males and 10 females) with an average age of 36.3 years and average body mass index (BMI) of 24.0. We performed 16S rRNA gene sequencing on these samples. The fecal level of CgA was measured in the same array, as described above. The fecal samples were collected and stored in the same way as in the LifeLines-DEEP study2. DNA was isolated with the QIAamp DNA Stool Mini Kit. Hyper-variable region V3 to V4 of 16S rRNA gene was selected using forward primer F515 (GTGCCAGCMGCCGCGG) and reverse primer: “E. coli 907- 924” (CCGTCAATTCMTTTRAGT) and was sequenced using 454 pyrosequencing at the Beijing Genomics Institute (BGI), China. The average number of reads was 5,862 with a maximum of 12,000 reads. The operational taxonomic unit (OTU) was picked by QIIME (version 1.7.0)49 reference optimal picking, which uses UCLUST (version 1.2.22q)50 to perform the clustering. As a reference database, we used a primer-specific version of the full GreenGenes 13.5 database51. Using TaxMan52, we created the primer-specific reference database containing only reference entries that matched our selected primers. During this process we restricted the mismatches of the probes to the references to a maximum of 25%. The 16S regions that were captured by our primers, including the primer sequences, were extracted from the full 16S gene sequences. For each of the reference clusters, we determined the overlapping part of the taxonomy of each of the reference reads in the clusters and used this overlapping part as the taxonomic label for the cluster. OTUs had to be supported by at least 100 reads and had to be detected in at least two samples; less abundant OTUs were excluded from the analysis. In this way, abundance data on 1,155 OTUs were obtained. Bray-Curtis distance was calculated on the transformed abundance level of these OTUs, as described above. The association between CgA level and Bray-Curtis distance was calculated using the adonis method from R package vegan. The P value was determined by 1,000 permutations.

74 Association of cis-eQTL SNPs of the CHGA gene with the gut microbiome 1 For 984 samples, genetic information was available and has been described previously2,30. Seeing the strong relation between CgA and the microbiome, we further tested whether genetic variants that influence CHGA (encoding CgA) gene expression can affect fecal CgA level and the gut microbiome. Seven SNPs cis-affecting the expression of CHGA gene (cis- eQTL SNPs) were extracted from various studies53,54, of which one SNP, rs9658667, failed to 2 pass our quality control (table S15). Thus, genetic association was performed between six cis-eQTL SNPs and fecal level of CgA and the transformed abundance of 170 species. Short chain fatty acids relation to metformin use in type 2 diabetes patients To measure the effect of metformin usage on production of short chain fatty acids, we profiled 3 the fecal level of acetate, propionate, and butyrate in 24 type 2 diabetes patients, of whom 15 took metformin and 9 did not, using gas chromatography–mass spectrometry (GC-MS) in the ‘Dr. Stein & Colleagues’ medical laboratory (Maastricht, the Netherlands). The Wilcoxon test was performed to assess the difference of SCFAs between T2D-metformin users and non- 4 users. Funding This project was funded by grants from the Top Institute Food and Nutrition, Wageningen, to C.W. (TiFN GH001), the Netherlands Organization for Scientific Research to J.F. (NWO- VIDI 864.13.013), L.F. (ZonMW-VIDI 917.14.374), and R.W. (ZonMW-VIDI 016.136.308), CardioVasculair Onderzoek Nederland to M.H. and A.Z. (CVON 2012-03). A.Z. holds a Rosalind Franklin Fellowship (University of Groningen) and M.C.C. holds a postdoctoral fellowship from the Fundación Alfonso Martín Escudero. This research received funding from the European Research Council under the European Union’s Seventh Framework Program: C.W. is supported by FP7/2007-2013)/ERC advanced Grant Agreement no. 2012-322698. M.G.N. is supported by an ERC Consolidator Grant (#310372). L.F. is supported by FP7/2007–2013, grant agreement 259867, and by an ERC Starting Grant, grant agreement 637640 (ImmRisk). J.R. and G.F. are supported by FP7 METACARDIS HEALTH-F4-2012-305312, VIB, FWO, IWT, the Rega institute for Medical Research, and KU Leuven. S.V.S. and M.J. are supported by postdoctoral fellowships from FWO. Author contributions A.Z., C.W. and J.F. designed the study. A.Z., E.F.T., L.F., and C.W. initiated the cohort and collected cohort data. A.Z., E.F.T., Z.M., S.A.J., M.C.C., and D.K. generated data. A.Z., A.K., M.J.B., E.F.T., M.S., T.V., A.V.V., G.F., S.V.S, J.W., F.I., P.D., M.A.S., C.H., R.J.X., and J.F. analyzed data. G.F, S.V.S., J.W. E.B., M.J., R.K.W., E.J.M.F., M.G.N., D.G., D.J., L.F., Y.S.A., C.H., J.R., R.J.X., and M.H.H. participated in integral discussions. A.Z., A.K., M.J.B., R.J.X., C.W., and J.F. wrote the manuscript. The authors have no conflicts of interest to report. Data The data is currently being uploaded to the European Genotyping Agency (https://www.ebi. ac.uk/ega/) (ega-box-423). Informed consent The study was approved by the institutional review board of UMCG, ref.M12.113965.

75 References and Notes 1 1. Clemente, J. C., Ursell, L. K., Parfrey, L. W. & Knight, R. The impact of the gut microbiota on human health: an integrative view. Cell 148, 1258–1270 (2012). 2. Tigchelaar, E. F. et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ 2 Open 5, e006772 (2015). 3. Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015). 4. Information on materials and methods is available at the Science Web site. 5. Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific 3 marker genes. Nat. Methods 9, 811–4 (2012). 6. Sinha, R., Abnet, C. C., White, O., Knight, R. & Huttenhower, C. The microbiome quality control project: baseline study design and future directions. Genome Biol. 16, 276 (2015). 7. Goodrich, J. K. et al. Human Genetics Shape the Gut Microbiome. Cell 159, 789–799 (2014). 4 8. Falcony, G. et al. Population-level analysis of gut microbiome variation. (2016). 9. Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). 10. Hildebrand, F. et al. Inflammation-associated enterotypes, host genotype, cage and inter- individual effects drive gut microbiota variation in common laboratory mice. Genome Biol. 14, R4 (2013). 11. Hedin, C. et al. Siblings of patients with Crohn’s disease exhibit a biologically relevant dysbiosis in mucosal microbial metacommunities. Gut (2015). doi:10.1136/ gutjnl-2014-308896 12. Miller, T. L., Wolin, M. J., Conway de Macario, E. & Macario, A. J. Isolation of Methanobrevibacter smithii from human feces. Appl. Environ. Microbiol. 43, 227–232 (1982). 13. Lee, T. et al. Evaluation of psychosomatic stress in children by measuring salivary chromogranin A. Acta Paediatr. 95, 935–939 (2006). 14. Ohman, L., Stridsberg, M., Isaksson, S., Jerlstad, P. & Simrén, M. Altered levels of fecal chromogranins and secretogranins in IBS: relevance for pathophysiology and symptoms? Am. J. Gastroenterol. 107, 440–7 (2012). 15. Sciola, V. et al. Plasma chromogranin a in patients with inflammatory bowel disease. Inflamm. Bowel Dis. 15, 867–871 (2009). 16. Wagner, M. et al. Increased fecal levels of chromogranin A, chromogranin B, and secretoneurin in collagenous colitis. Inflammation 36, 855–861 (2013). 17. Aslam, R. et al. Chromogranin A-derived peptides are involved in innate immunity. Curr. Med. Chem. 19, 4115–4123 (2012). 18. Bartolomucci, A. et al. The extended granin family: structure, function, and biomedical implications. Endocr. Rev. 32, 755–797 (2011). 19. Queipo-Ortuño, M. I. et al. Influence of red wine polyphenols and ethanol on the gut microbiota ecology and biochemical biomarkers. Am. J. Clin. Nutr. 95, 1323–1334 (2012). 20. Duda-Chodak, A., Tarko, T., Satora, P. & Sroka, P. Interaction of dietary compounds, especially polyphenols, with the intestinal microbiota: a review. Eur. J. Nutr. 54, 325–341 (2015). 21. Mills, C. E. et al. In vitro colonic metabolism of coffee and chlorogenic acid results in selective changes in human faecal microbiota growth. Br. J. Nutr. 113, 1220–1227 (2015). 22. Sokol, H. et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl. Acad. Sci. U. S. A. 105, 16731–16736 (2008).

76 23. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013). 24. Wu, G. D. et al. Linking long-term dietary patterns with gut microbial enterotypes. Science 1 334, 105–108 (2011). 25. Korpela, K. et al. Intestinal microbiome is related to lifetime antibiotic use in Finnish pre- school children. Nat. Commun. 7, 10410 (2016). 26. Poullis, A., Foster, R., Mendall, M. A., Shreeve, D. & Wiener, K. Proton pump inhibitors 2 are associated with elevation of faecal calprotectin and may affect specificity. Eur. J. Gastroenterol. Hepatol. 15, 573–574 (2003). 27. Burton, J. H. et al. Addition of a Gastrointestinal Microbiome Modulator to Metformin Improves Metformin Tolerance and Fasting Glucose Levels. J. Diabetes Sci. Technol. 9, 808–814 (2015). 3 28. Cabreiro, F. et al. Metformin retards aging in C. elegans by altering microbial folate and methionine metabolism. Cell 153, 228–239 (2013). 29. Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015). 30. Fu, J. et al. The gut microbiome contributes to a substantial proportion of the variation in 4 blood lipids. Circ. Res. 117, 817–824 (2015). 31. Tibble, J. A. et al. High prevalence of NSAID enteropathy as shown by a simple faecal test. Gut 45, 362–366 (1999). 32. Joshi, S., Lewis, S. J., Creanor, S. & Ayling, R. M. Age-related faecal calprotectin, lactoferrin and tumour M2-PK concentrations in healthy volunteers. Ann. Clin. Biochem. 47, 259– 263 (2010). 33. Harder, J., Bartels, J., Christophers, E. & Schröder, J. M. A peptide antibiotic from human skin. Nature 387, 861 (1997). 34. Langhorst, J. et al. Elevated human beta-defensin-2 levels indicate an activation of the innate immune system in patients with irritable bowel syndrome. Am. J. Gastroenterol. 104, 404–410 (2009). 35. El-Salhy, M., Lomholt-Beck, B. & Hausken, T. Chromogranin A as a possible tool in the diagnosis of irritable bowel syndrome. Scand. J. Gastroenterol. 45, 1435–1439 (2010). 36. Sidhu, R., Drew, K., McAlindon, M. E., Lobo, A. J. & Sanders, D. S. Elevated serum chromogranin A in irritable bowel syndrome (IBS) and inflammatory bowel disease (IBD): a shared model for pathogenesis? Inflamm. Bowel Dis. 16, 361 (2010). 37. Windmueller, H. G. & Spaeth, A. E. Source and fate of circulating citrulline. Am. J. Physiol. 241, E473-480 (1981). 38. Crenn, P., Messing, B. & Cynober, L. Citrulline as a biomarker of intestinal failure due to enterocyte mass reduction. Clin. Nutr. 27, 328–339 (2008). 39. Farzi, A. et al. Synergistic effects of NOD1 or NOD2 and TLR4 activation on mouse sickness behavior in relation to immune and brain activity markers. Brain. Behav. Immun. 44, 106–120 (2015). 40. Whelan, R. a et al. A Transgenic Probiotic Secreting a Parasite Immunomodulator for Site- Directed Treatment of Gut Inflammation. Mol. Ther. 22, 1730–1740 (2014). 41. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 42. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359 (2012). 43. Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015). 44. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204-212 (2014). 45. Dimmer, E. C. et al. The UniProt-GO Annotation database in 2011. Nucleic Acids Res. 40, (2012). 46. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43, D1049-1056 (2014). 47. McMurdie, P. J. & Holmes, S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10, e1003531 (2014).

77 48. Oksanen, J. et al. vegan: Community Ecology Package. R package version 2.3-0. (2015). 49. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing 1 data. Nat. Methods 7, 335–336 (2010). 50. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010). 51. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and 2 workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). 52. Brandt, B. W., Bonder, M. J., Huse, S. M. & Zaura, E. TaxMan: A server to trim rRNA reference databases and inspect taxonomic coverage. Nucleic Acids Res. 40, W82-7 (2012). 53. Gupta, R. K., Pant, C. S., Singh, A. K. & Behl, P. Real time ultrasonography in the evaluation of hydrocephalus and associated abnormalities. Indian Pediatr. 23, 249–254 (1986). 3 54. Neoptolemos, J. P. et al. Study of common bile duct exploration and endoscopic sphincterotomy in a consecutive series of 438 patients. Br. J. Surg. 74, 916–921 (1987). Acknowledgements We thank the LifeLines-DEEP participants and the Groningen LifeLines staff for their 4 collaboration. We thank Jackie Dekens, Mathieu Platteel, and Astrid Maatman for management and technical support. We thank Jackie Senior and Kate Mc Intyre for editing the manuscript. Description of supplementary data files The following additional data are available with the online version of this paper. Additional data: Fig. S1 Overview of the LifeLines-DEEP cohort Fig. S2 Inter-correlation of the factors per category Fig. S3 Analysis scheme of metagenomic shotgun sequencing data Fig. S4 Comparison of the phyla between metagenomics sequencing and 16S rRNA seq. Fig. S5 Variation of microbial composition across individuals Fig. S6 High function stability across individuals based on GO categories Fig. S7 Replication of 126 associated factors with Bray-Curtis distance on 16S seq. data Fig. S8 Number of factors associated with composition distance, diversity, and richness Fig. S9 Correlation of clusters of orthologous group (COG) richness with gender and age Fig. S10 Validation of the association of Chromogranin A in an independent cohort of 19 subjects (16S rRNA seq. data) Fig. S11 The association between CgA and M. smithii and the methanogenesis pathway Fig. S12 Associated species shared between proton pump inhibitors (PPI) and calprotectin Fig. S13 Differences in short-chain fatty acids between metformin users and non-users among diabetes patients Table S1 Description and summary statistics of 207 factors. Table S2 Pairwise Spearman correlation coefficients of 207 factors. Table S3 The predicted abundance of 632 species. Table S4 Association of 207 factors with Bray-Curtis distance Table S5 Replication of the association with Bray-Curtis distance on 16S data. Table S6 Association with Shannon’s diversity index. Table S7 Replication of the association with Shannon’s diversity index on 16S seq. data. Table S8 Association with gene richness. Table S9 Association with COG richness.

78 The effect of host genetics on the gut microbiome Nature Genetics, DOI: 10.1038/ng.3663

Marc Jan Bonder1,*, Alexander Kurilshikov1,2,3,*, Ettje F. Tigchelaar1,4, Zlatan Mujagic4,5, Floris Imhann6, Arnau Vich Vila6, Patrick Deelen1,7, Tommi Vatanen8,9, Melanie Schirmer8,10, Sanne P Smeekens11,12, Daria V. Zhernakova1, Soesma A Jankipersadsing1,14, Martin Jaeger11,12, Marije Oosting11,12, Maria C. Cenit1,‡, Ad A. M. Masclee5, Morris A. Swertz1,7, Yang Li1, Vinod Kumar1, Leo Joosten11,12, Hermie Harmsen13, Rinse K Weersma6, Lude Franke1, Marten H. Hofker14, Ramnik J. Xavier8,15,16,17, Daisy Jonkers5, Mihai G. Netea11,12, Cisca Wijmenga1, Jingyuan Fu1,14,#, Alexandra Zhernakova1,4,# 6 Abstract 1 The gut microbiome is affected by multiple factors, including genetics. In this study we as- sessed the influence of host genetics on microbial species, pathways and GO categories, based on metagenomic sequencing in 1,514 subjects. In a genome-wide analysis we identi- fied associations of 9 loci to microbial taxonomies and 33 loci to microbial pathways and GO -8 2 terms at P < 5x10 . Additionally, in a targeted analysis of regions involved in complex diseas- es, innate and adaptive immunity or food preferences, 32 loci were identified at suggestive P < 5x10-6. Most of our reported associations are novel, including genome-wide significance for C-type lectin molecules CLEC4F/CD207 on 2p13.3 and CLEC4A/FAM90A1 on 12p13. We also identified association of a functional LCT SNP with Bifidobacterium (P = 3.45x10-8), and pro- 3 vide evidence of a gene-diet interaction in regulating Bifidobacterium abundance. Our results demonstrate the importance of understanding host-microbe interactions to gain better insight into human health.

4

1. University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; 2.Institute of Chemical Biology and Fundamental Medicine SB RAS, Novosibirsk, Russia; 3. Novosibirsk State University, Novosibirsk, Russia; 4. Top Institute Food and Nutrition, Wageningen, the Netherlands; 5.Division of Gastroenterology-Hepatology, Department of Internal Medicine, NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Center, Maastricht, the Netherlands; 6. University of Groningen, University Medical Center Groningen, Department of Gastroenterology and Hepatology, Groningen, the Netherlands; 7. University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, the Netherlands; 8. The Broad Institute of MIT and Harvard, Cambridge, USA; 9. Department of Computer Science, Aalto University School of Science, 02150 Espoo, Finland.; 10. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, USA; 11. Department of Internal Medicine, Radboud University Medical Center, Nijmegen, the Netherlands; 12. Radboud Center of Infectious Diseases, Radboud University Medical Center, Nijmegen, the Netherlands; 13. University of Groningen, University Medical Center Groningen, Department of Medical Microbiology, Groningen, the Netherlands; 14. University of Groningen, University Medical Center Groningen, Department of Pediatrics, Groningen, the Netherlands; 15. Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, USA; 16. Gastrointestinal Unit, Massachusetts General Hospital, Boston, USA; 17. Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Boston, USA;‡Present address: Department of Pediatrics, Dr. Peset University Hospital, Valencia, Spain; *These authors contributed equally to this work; # These authors jointly supervised this work; Correspondence to: A.Z. ([email protected]), J.F. ([email protected]) or C.W. ([email protected]).

80 The gut microbiome is considered our second genome and is linked to many common diseases. Numerous intrinsic and exogenous factors, including diet and medication, affect its composition1–6. Studies in mice7,8 and in human twins6,9 have observed substantial heritability 1 of some bacteria. Such a genetic effect has not, so far, been investigated in humans on a large scale10,11. Several human and mouse studies have revealed interactions between host genetics and the microbiome in relation to disease phenotypes12–14. For example, NOD2 and CARD9 risk alleles associated with inflammatory bowel disease (IBD) only become manifest 2 when triggered by the gut microbiome15–17, indicating that genetic predisposition to diseases can depend on the microbiome. In this study we aimed to identify genomic loci that influence the gut microbiome in humans. 18 We performed a three-stage association analysis using three Dutch cohorts: LifeLines-DEEP 3 (n=984) as discovery, and 500FG19 (n=425) and MIBS-CO20 (n=105) as replication followed by meta-analysis (described in Methods). Genotyping and metagenome shotgun sequencing was performed for all samples using uniform methods and pipelines. Our full dataset included 1,514 individuals, making it, to our knowledge, the largest metagenomics dataset available to date. The analysis was done in two branches: a genome-wide and a focused branch for 4 which we selected SNPs from four groups: GWAS hits for immune and metabolic traits, innate bacterial sensing genes, the MHC locus, and genes related to food metabolism and preferences. We first performed the genome-wide analysis of common SNPs (MAF>0.05) on microbial taxonomies and functional units present in at least 25% of individuals (219 taxonomies, 636 MetaCyc pathways and 661 GO-terms with at least 2000 genes (GO2000)), Supplementary Table 1. For each trait only the individuals with non-zero values were included in the analysis (Supplementary Methods). SNPs associated at P<5x10-5 in LifeLines-DEEP and replicated at P<0.01 with the same allelic direction were included in the meta-analysis (Fig. 1, Supplementary Table 2). In this analysis, 58 SNPs in nine loci were associated to microbial taxa (Fig. 2, Supplementary Table 3, Supplementary Fig. 1) and a total of 33 loci were associated to functional units, with 21 associated to MetaCyc pathways and 12 loci to GO2000 groups (Fig. 2, Supplementary Figs 2 and 3, Supplementary Table 3) at P<5x10-8 (FDR<12% (Supplementary Table 4A-B)).

Identifying microbial quantitative trait loci

1 2 3 4 5 6 7 8 9 10 11 abMetagenomics data-based Bacteria MetaCyc pathways GO categories

Genome-wide Targeted approach GWAS Innate HLA Food

Discovery Detect associations Detect associations LifeLines-Deep at 5×10–5 at 5×10–4 (984) GO associations MetaCyc associations Taxonomy associations Replication Replicate associations at Replicate associations at Genome wide signi cant 500FG + MIBS-CO 0.01 and with the same 0.05 and with the same Suggestive loci (425 + 104) direction direction

P=5×10–8 P=5×10–6 9 loci associated to bacteria 14 GWAS loci associated Meta-analysis 21 loci associated to MetaCyc 15 Innate loci associated 12 loci associated to GO-terms 2 HLA loci associated 1 Food loci associated 22 21 20 19 18 17 16 15 14 13 12 Figure 1. Data analysis workflow and association summary. A) Analysis overview highlighting the steps taken and, selections made. B) Overview of the loci in the human genome that influence the gut microbiome. 81 Microbial quantitative trait loci ADB C EFGH I 30 1 15 19 31 11 13 16 8 10 20 25 1 2 3 4 6 8 12 18 28 9 23 2426 27 29 5 7 14 17 21 22 ) p (

2 10 6 log −

4 3 123456789101112131415161718219 0221 2 Chromosome # SNP Microbial taxa, MetaCyc pathway or GO term # SNP Microbial taxa, MetaCyc pathway or GO term 1 rs199545687 Microtubule (GO:0005874) 21 rs6933130 L-methionine biosynthesis (PWY-5345) 2 rs12563071 Ubiquinol-8 biosynthesis (UBISYN-PWY) 22 rs7016086 Ubiquinol-7 biosynthesis (PWY-5873) 3 rs958798 Transcription regulation from RNA pol-II (GO:0045944) 23 rs1041530 Seleno-compound metabolism (PWY-6395) 4 rs12048670 Polyamine biosynthesis II (POLYAMINSYN3-PWY) 24 rs11606643 Innate immune response (GO:0045087) 5 rs4553849 Creatinine degradation II (PWY-4722) 25 rs9669179 Ribofl avin biosynthetic process (GO:0009231) 6 rs6546647 Embryonic morphogenesis (GO:0048598) 26 rs7133214 L-methionine salvage cycle I (PWY-7528) 4 7 rs2084597 Starch degradation III (PWY-6731) 27 rs7992913 L-isoleucine biosynthesis I (PWY-3001) 8 rs17770672 Demethylmenaquinol-9 biosynthesis (PWY-5862) 28 rs74773701&rs4144435 P562-PWY & GO:0042823 9 rs2166811 Oxidoreductase activity (GO:0016706) 29 rs113062739 Demethylmenaquinol-9 biosynthesis (PWY-5862) 10 rs4973961 Thiamine Biosynthesis (PWY-7282) 30 rs17789629 Sulfuric ester hydrolase activity (GO:0008484) 11 rs924067 4-chlorobenzoate degradation (PWY-6215) 31 rs2285198 Cell proliferation (GO:0008283) 12 rs1497266 Thiamine diphosphate biosynth. proc. (GO:0009229) A rs12137699 f_Sutterellaceae 13 rs10935496 Drug transmembrane transporter act.(GO:0015238) B rs7605872 s_Dialister_invisus 14 rs35598536 Chorismate biosynthesis I (ARO-PWY) C rs4548017 c_Methanobacteria 15 rs10012347 Sitosterol degradation (PWY-6948) D rs10813066 g_Blautia 16 rs12645801 Glycocholate metabolism (PWY-6518) E rs1889714 s_Dialister_invisus 17 rs1666789 PROTOCATECHUATE-ORTHO-CLEAVAGE-PWY F rs16913594 s_Bacteroides_xylanisolvens 18 rs78533343 Aerobic respiration I (PWY-3781) G rs17115310 f_Acidaminococcaceae 19 rs2163761&rs56879175 PWY-6948 & GO:0003697 H rs10743315 s_Lachnospiraceae_bacterium_1_1_57FAA 20 rs9475677 Methylaspartate cycle (PWY-6728) I rs2834288 f_Oscillospiraceae Figure 2. Manhattan plot of genome-wide associations to microbes and to functional levels(MetaCyc pathways and GO2000 terms). The top part of the Manhattan plot shows the microbial abundance QTLs, labeled with letters, and the microbial functional QTLs, marked with numbers. Details of the associations are given in the table below;microbial abundance QTLs are in italic.

The strongest taxonomical association was observed for genus Blautia and family Methanobacteriaceae, for which substantial heritability was reported in twin studies (0.34 and 0.22, respectively)6. Blautia is an immunogenic epithelial-barrier-associated microbe21 linked to abnormal Paneth cell counts22, Crohn’s disease and primary sclerosing cholangitis23. The associated block is a large, intergenic region, located upstream of the LINGO2 gene, which has been associated to body mass index (BMI), obesity and motion sickness24–26. Methanobacteriaceae has been linked to BMI and lipid levels6,27, the Methanobacteriaceae- associated SNPs on 6q16.1 map into an extended lncRNA RP11-436D23.1. Two genome- wide significant hits, on 2q37.3 and 10p12.1, were observed for Dialister invisus. Individuals with higher levels of Dialister showed diet-dependent improvement in their inflammatory and cytokine profiles28. Sutterellacea abundance was associated to a SNP in the VANGL1 gene (P=4.5x10-8); high abundance of Sutterella has previously been associated with pouch health29,30. VANGL1 is highly expressed in gut tissues and involved in wound healing of the intestinal mucosa31 and mediates the invasiveness and progression of colon cancer cells32,33. For MetaCyc pathways the strongest association was observed for a pathway involved in the plant-derived steroids degradation (PWY-6948). Plant sterols have been suggested to have a beneficial effect on metabolic syndrome34. The abundance of this pathway was associated to two independent loci: SNPs in the SORCS2 gene (P=3.1x10-9) on 4p16.1, which is also associated to insulin-like growth factors35, and in the SLIT3 gene (P=3.8x10-9) on 5q35, which is also associated to obesity36. Another strong association was for the bile acid metabolism pathway (PWY-6518) (P=9.7x10-9) to SNPs in the ARAP1 gene, which is involved in focal adhesion and also associated to type 2 diabetes37. Among the 12 loci associated with GO

82 terms, two were located near clusters of C-type lectin domain family 4 genes: the CLEC4F/ CLEC4K genes on 2p13.3 and the CLEC4A/FAM90A1 locus on 12p13. These genes encode for molecules involved in cell adhesion, cell signaling, inflammation and immune response and are 1 known to bind the intestinal microbiota and to modulate the production of pro-inflammatory cytokines when activated. Several SNPs at the CLEC4A/FAM90A1 locus correlate with the GO term ‘riboflavin biosynthetic process’. Riboflavin is a redox mediator that can stimulate the 38 growth of certain gut bacteria, including F. prauznitsii ; its metabolism is also increased in 2 ileal Crohn’s disease39. Overall, we observed that the genetic variants associated with functional profiles showed little overlap with the taxonomy-associated SNPs. Studies have shown that microbial composition 40 can be highly variable, while individuals may still have similar functional profiles . We 3 investigated whether associated pathways and GO terms can be driven by specific microbial taxonomies by assessing ‘pathway-bacteria’ and ‘GO-bacteria’ correlations (Supplementary Fig. 4; Supplementary Tables 5 and 6). We observed that most associated pathways and GO terms are driven by several microbial taxonomies. This partly explains the low overlap between taxonomy- and function-associated loci and suggests that host genetics can directly 4 shape the functional composition of the microbiome. Another explanation is the limitation in the analysis power. The average zero-rate per entry for functional data is 5% in contrast to 35% for the taxonomic abundance (Supplementary table 1). The larger sampling for microbial function leads to higher discovery power. We further performed a targeted analysis using less stringent thresholds, P < 5x10-4 at discovery, P < 0.05 in replication and P < 5x10-6 in the meta-analysis (Fig. 1 and Supplementary Table 2), corresponding to an FDR<1% (Table 4C-D), on four categories of relevant immune and metabolism genes: immune and metabolic GWAS SNPs; innate molecules involved in microbial sensing; the MHC locus representing the adaptive immune system; and genes involved in food tolerance and preferences (Supplementary Tables 7 and 8). In the analysis of GWAS hits, 14 loci were associated with the gut microbiome (Supplementary Table 9). The strongest signal (P=1.3x10-7) was observed for the GO2000 term ‘cell-cell signaling’ associated to rs2155219 and located in the C11ORF30-LRRC32 locus, which has been associated to multiple immune-related phenotypes, including IBD and allergy15,41. This GO-term is a general functional category but its abundance was correlated with the 42 -86 abundance of two IBD-associated bacteria , Coprococcus comes (rS=0.55, P=4.64x10 ) -49 and Proteobacteria (rS=-0.42, P=7.10x10 ). Three other IBD genes were also associated to the gut microbiome (Supplementary Table 9): CCL2 which contributes to microbiota-induced inflammation in mice on a high-fat diet43, DAP2 which is a negative regulator of autophagy, and IL23R which mediates the IL-23-induced Th17 responses, crucial for mucosal immunity44,45. We also observed an association of a SNP associated to Behcet’s disease (rs1800871), located in the IL10 gene, to the cullin-RING ubiquitin ligase complex (GO:0031461). IL10 is an important immuno-regulator of the gut, and mutations in this gene have been reported in cases of severe IBD46. Our findings, together with the epidemiological evidence for the role of infections in Behcet’s disease47, encourage further investigation into the gut microbiome in these patients. We also identified associations to several metabolic loci, including SNPs located in two lipid transfer protein genes, PLTP and APOE48,49, and in the PPARG gene, a regulator of fatty acid storage and glucose metabolism involved in lipodystrophy and type 2 diabetes50–52. Finally, we observed an interesting association between Lactococcus abundance and a SNP associated to body fat distribution (rs2294239) that affects the expression of the nearby ZNRF3 gene. ZNRF3 is the negative regulator of WNT signaling in Paneth cells53,54 , and acts as a tumor suppressor in gastric cancer55. Thus, the association of the microbiome to several GWAS SNPs suggests that the link between host genetics and immune-mediated and metabolic phenotypes can be mediated by the gut microbiota. To investigate the effect of variation in host immune genes on the microbiome, we selected

83 key molecules of innate microbial sensing based on previous studies56–58 (Supplementary Table 8). Furthermore, we imputed all SNPs, amino acids and alleles of HLA-A, HLA-B, HLA-C, 1 HLA-DQ, HLA-DP and HLA-DR genes located in the MHC locus. Only two genetic variants in MHC were associated: an amino acid in the HLA-B gene and SNP rs3873352 located near the MUC22 gene (Supplementary Table 10). Many more associations were observed with innate immunity (Supplementary Table 11). Fifteen innate-immune-sensing loci were associated to -6 2 microbiome composition or function at P<5x10 . In particular, the NOD2 locus (the strong risk locus for Crohn’s disease, previous linked with microbiome composition16) was associated to the MetaCyc pathway abundance of Enterobactin biosynthesis. This pathway showed a strong correlation with the abundance of E. coli (Supplementary Table 5). Enterobactin produced by E. coli inhibits myeloperoxidase (MPO), a bactericidal host enzyme, thereby gaining a survival 3 advantage that allows E. coli to bypass host innate immune responses in inflammatory gut disease59. Moreover, SNPs in the NOD1 gene were associated to several pathways or GO categories, most of which were also driven by Enterobacteriales and, in particular, by E. coli. This bacterium is commonly associated with gut inflammation and Crohn’s disease60,61. 4 Consistent with the associations to the CLEC clusters detected at genome-wide level, the targeted analysis found associations to two other CLEC loci: the CLEC clusters on 12p13.3 (CLEC6A, CLEC4E and CLEC7A), and CD209/CLEC4G (dectin-1) on 19p13.2 (Supplementary Table 11). C-type-lectin receptors, together with Toll-like- and NOD-like receptors, play a major role in microbial recognition and in the activation of inflammatory reactions that are essential for controlling infections62. Mice lacking dectin-1 have an increased severity of chemically induced colitis, and polymorphisms in the CLEC7A gene in humans are linked to an increased severity of ulcerative colitis63. In general, our analysis indicated a stronger link of microbiota composition and function to innate receptors than to adaptive immunity genes. As it is known that food intake can be affected by the host genome, we also selected SNPs from seven genes involved in food metabolism and/or consumption preferences (Supplementary Table 8). SNPs located in the LCT locus were associated to five GO terms at P<5x10-6 (Supplementary Table 12). The LCT locus has been widely studied in relation to adult- type lactose intolerance (hypolactasia) and adults’ inability to digest milk64,65. The functional SNP rs4988235 tags the primary haplotype associated with hypolactasia in European populations, and predispose to hypolactasia when present in homozygous (G/G) genotype66. Given the previous observations linking this functional variant to Bifidobacterium9,10, we used a recessive model to investigate the relation between rs4988235 and the abundance of Bifidobacterium. Indeed, we observed that homozygosity for the G/G genotype was related to a high abundance of Bifidobacterium (P=3.45x10-8) (Fig. 3A). Next, we tested if this haplotype and Bifidobacterium abundance were associated with the consumption of dairy products. Quantitative information on dairy intake was available for the LifeLines-DEEP and MIBS- CO cohorts. We did not observe a significant difference in the dairy products consumption between individuals with different rs4988235 genotypes, nor any correlation between dairy products and Bifidobacterium abundance (Supplementary Fig. 5). However, we found that the abundance of Bifidobacterium was dependent on an interaction between genotype and intake of dairy products. In individuals with a G/G genotype a relation between Bifidobacterium abundance and milk product consumption was observed in both LifeLines-DEEP and the MIBS-

CO cohort (Pmeta=0.0144) (Fig. 3B). In the 500FG cohort a similar trend was observed, although not significant (Fig. 3B). In summary, we detected a recessive effect of an LCT functional variant on the abundance of Bifidobacterium and found evidence of interaction between host genetics and diet in regulating the microbiome composition.

84 Discovery Replication Meta−analysis

e a 3 p = 1.64e−08 p = 0.1953 p = 3.453e−08 1

undanc 2

ium ab 1

0 2 −1 LLD −2 500FG 500FG LLD MIBS−CO MIBS−CO Scaled Bifidobacter A/A & A/G G/G A/A & A/G G/G A/A & A/G G/G

LifeLines−DEEP MIBS−CO 500FG 3

e b pdiff = 0.026 pdiff = 0.009 pdiff = 0.21 3

undanc 2

ium ab 1 rS = 0.097 rS = −0.061 rS = 0.269 4 0 rS = 0.042 r = 0.540 S rS = −0.032 −1

−2 G/G G/G G/G A/A & A/G A/A & A/G A/A & A/G Scaled Bifidobacter 0246 01234567 0123 log (dairy products, g/day) log (dairy products, g/day) glasses of milk, N/day Figure 3. The complex interaction between the functional LCT variant, dairy intake and Bifidobacterium. A) A microbial QTL plot of the association of the functional LCT SNP (rs4988235) to the abundance of Bifidobacterium. The three plots in A show the effect of the functional SNP on Bifidobacterium, the plots show a combination of violin plots and a boxplots. The boxplots show the median, 25% and 75% quantiles. B) Interaction of LCT genotype and intake of dairy products in Bifidobacterium abundance in the three cohorts. Total dairy consumption was unavailable for the 500FG cohort, but we see the same trend related to data on the number of glasses of milk drunk per day in this cohort. In conclusion, we performed a GWAS study of gut microbiome composition and function, accessed by metagenomics sequencing in a large population cohort of 1,514 individuals. We identified multiple associations of genetic variants to human gut microbiome composition and function which highlights the role of innate sensor molecules, particularly C-type lectins, in gut homeostasis, and provide evidence for an interaction between host genetics, diet and the microbiome. Identifying associations between human genetics and the gut microbiome, and exploring their interactions, can provide insights into the role of the microbiome in complex diseases and drive the development of therapies to modulate the microbiome towards better health. Methods Cohort descriptions This study uses three independent, Dutch population cohorts: the LifeLines-DEEP cohort as a discovery cohort, and the 500 Functional Genomics cohort (500FG) and the controls in the Maastricht IBS cohort (MIBS-CO) as replication cohorts. The LifeLines-DEEP cohort consists of 1,539 individuals from the three northern provinces of the Netherlands (636 males and 903 females, age range 18-84 years)1. The replication cohorts comprise 534 individuals from the 500FG cohort (from the Human Functional Genomics Study, 237 males and 296 females, age range 18-75 years) and 105 healthy controls from the MIBS cohort (42 males and 63 females, age range 19-71 years)20. We confined our analysis to the set of individuals with both high- quality genotype and high-quality microbiome data: 984 individuals from LifeLines-DEEP, 425 individuals from 500FG and 105 controls from MIBS (MIBS-CO).

85 For all three cohorts, extensive phenotype information was available, including food intake. For 500FG this was based on questionnaire data for selected food items. LifeLines-DEEP and 1 MIBS used a validated food frequency questionnaire67,68. Informed consent The LifeLines-DEEP and MIBS-CO studies were approved by the institutional ethics review 2 boards of the UMCG and MUMC (clinical trials: NCT00775060). The 500FG study was approved by the Ethical Committee of Radboud University Nijmegen (NL42561.091.12, 2012/550). All participants signed an informed consent form. This study was approved by the institutional ethical review board of the UMCG, ref. M12.113965. 3 Genotyping For all three cohorts, genome-wide genotyping was performed and the remaining SNPs and HLA alleles were imputed. The genotyping of the LifeLines-DEEP samples is described in Tigchelaar et al.18. In short, individuals were genotyped using both the HumanCytoSNP-12 4 BeadChip and ImmunoChip, a customized Illumina Infinium array. The data were merged and subsequently imputed using the Genome of The Netherlands (GoNL) dataset69. We removed ethnic outliers and genetically related participants to include a total of 1,268 individuals in the final genetic study. For the 500FG and MIBS cohorts, DNA samples of 516 500FG and 288 MIBS (cases and controls), individuals were genotyped using the commercially available SNP chip, Illumina HumanCoreExome-12 v1.1. The genotype calling was performed using Opticall 0.7.070 using the default settings. Samples with a call rate ≤0.99 were excluded from the dataset as were variants with a Hardy-Weinberg equilibrium ≤0.0001, call rate ≤0.99 and MAF ≤0.001. Ethnic outliers were identified by multi-dimensional scaling plots of samples merged with 1000 Genomes data, and were then removed from our dataset. This filtering step resulted in two high-quality genetic datasets of 487 500FG individuals and 287 MIBS individuals. The final data contained genotype information of 518,980 variants in both datasets for further imputation. The strands of the variants were aligned and identifiers were updated to a combined reference of 1000G phase 3 v571 and the GoNL dataset69 using Genotype Harmonizer72. The data was phased using SHAPEIT2 v2.r64473 with the combined reference panel. Finally, this data was imputed using IMPUTE274 with GoNL as the reference panel75. Further, we imputed the HLA region in the three cohorts using the T1DGC reference derived from SNP2HLA v 1.0.276 reference, and developed by the Type 1 Diabetes Genetics Consortium (TIDGC)76. We converted the SNP2HLA reference to a build 37 IMPUTE2 reference and separately harmonized and imputed our datasets using IMPUTE274. In total, this study tested associations for 8.1 million SNPs in the genome-wide analysis and 8,606 variants in the HLA. Metagenomic sequencing & reads filtering Metagenomic sequencing was performed for 1,179 LifeLines-DEEP samples, 520 500FG samples and 626 MIBS samples (cases and controls) using the Illumina HiSeq platform. Within two weeks of participants giving a blood sample, they collected fecal samples at home and stored them immediately at -20 oC. After transport to the research lab on dry ice, fecal samples were stored at -80 oC. Aliquots were made and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen; cat. #80204) with the addition of mechanical lysis. Reads were quality filtered and adapter removal was performed using Trimmomatic (v.0.32)77, an average of 3.0 Gb data (around 32.3 million high-quality reads) was obtained per sample. Reads belonging to the human genome were removed by mapping the data to the human reference genome (version NCBI37) with Bowtie2 (v.2.1.0). Finally, 99 samples with reads lower than our 15 million read threshold were removed (44 samples from LifeLines-DEEP, 10 samples from 500FG, and 45 samples from MIBS).

86 Microbial data processing 1 The profile of microbial composition was determined using MetaPhlan 2.278, which uses a set of ~1 million markers (average 184 marker genes for each microbial clade) from >7,500 species. MetaPhlan 2.2 reported the abundance level of 1,772 microbial taxonomies in our data, including 21 phyla, 32 classes, 50 orders, 108 families, 235 genera and 678 species from four different domains. We further normalized the taxonomy data using an arcsin square root 2 transformation and corrected the normalized non-zero data for age, sex and read-depth. Functional profiling was performed using HUMAnN2 which maps reads to a customized database of functionally annotated pan-genomes. This analysis revealed the abundance levels of 5,379,353 gene families from the UniProt Reference Clusters that were further 3 mapped to 773 pathways from the MetaCyc metabolic pathway database. We grouped the gene families based on gene ontology (GO)79,80. Using the hierarchy in GO, we isolated a subset of “informative” GO terms, which we defined as those associated with >2,000 proteins for which no descendant term was associated with >2,000 proteins. This yielded a set of 611 non-redundant GO terms for subsequent analysis. Details about the grouping can be found 4 in Vatanen et al81. For the gene counts for MetaCyc pathways and GO2000 terms, we first quantile-normalized the data and then corrected the normalized non-zero data for age, sex and read-depth. Relating abundances of microbes, pathways and genes As we extracted multiple levels of information on microbial abundance and functional entities from the metagenomics data, we also investigated the correlations between microbial taxonomies and pathways/GO terms using Spearman’s rank correlation in the R ‘base’ package. The correlation P-values were Bonferroni-corrected for the number of tests performed in each case (number of microbial taxonomies on the specified taxonomic level multiplied by the number of pathways/GO terms). All correlations mentioned were significant after multiple testing corrections. Quantitative trait locus association analysis We confined genetic analysis to the taxonomies, MetaCyc pathways, and GO2000 terms that were present in at least 25% of the samples in our datasets. In total, 236 taxonomies, 636 Metacyc pathways and 611 GO2000 terms were included for analysis in the LifeLines- DEEP discovery cohort. We were able to perform replication analysis for 219 taxonomies, 636 MetaCyc pathways, and 611 GO2000 terms in the replication cohorts. In order to link the microbial composition and function to genetic variation, the normalized abundance value of taxonomies, GO2000 terms and MetaCyc pathways were treated as quantitative traits and we used rank-based Spearman correlation analysis to identify the association to the genotype of each SNP. For each trait only the individuals with non-zero values were included. These analysis steps have been incorporated into our in house developed QTL-mapping pipeline82, along with various quality control steps. For this specific application, the microbiome QTL mapping, we have included a filter for zero counts, to only incorporate the non-zero values during the analysis. We chose to do so because the zero counts can cause the data distribution to depart from normal, thus inducing a bias when correcting for the effect of age, gender and sequence depth using a linear model. In such a way, the number of samples can differ per taxa or bacterial pathway, thereby affecting the analysis power. The common taxa or pathways have more non-zero values, yielding more power to identify QTLs. We preformed our QTL-mapping in three steps. The association was first tested in the LifeLines-DEEP cohort. It was then replicated by a meta-analysis in the 500FG and MIBS cohorts. Finally, we performed a full meta-analysis over all three datasets. In this way, we performed genetic analysis at both the genome-wide level and the targeted-gene level (Fig. 1, Supplementary Table 2).

87 Genome-wide association: At the genome-wide level, we performed association analysis for all 8.1 million SNPs. Associations with bacteria, pathways and GO2000 terms were first identified 1 at a P-value of 5x10-5 in the discovery cohort. We then replicated these associations in the replication cohorts and selected the associations which were replicated at a P < 0.01 in the same allelic direction as in the discovery cohort. For these associations, we then performed a meta-analysis using a weighted Z-approach to combine the association signal across the -8 2 three cohorts. Only the associations at P < 5x10 were reported. Targeted genes approach: For this analysis, we selected SNPs from four different categories: (1) SNPs associated to GWAS-hits for immune and metabolic phenotypes, (2) genes involved in innate microbial sensing, (3) genes involved in adaptive microbial sensing (the MHC locus), 3 and (4) genes involved in food metabolism and food preferences. For the GWAS-associated SNPs, we extracted the immune and metabolic related SNPs from the GWAS catalogue and from recent Immunochip and Metabochip studies listed in Bonder et al. (2015). The selection of GWAS SNPs and their associated disease is presented in Supplementary Table 7. For genes involved in innate microbial sensing we included all 4 26 pathogen receptor and adapter molecules described in Casals et al.56. We also added MUC2 and NLRP6 (their roles in bacterial sensing were recently described in Caballeor et al.58). We also included C-type lectin molecules for their role in pathogen recognition84, and three viral recognition receptors, MDA5 (encoded by IFIH1), RIG1 (encoded by RARRES3) and MAVS85. For adaptive immunity, we included all the genes from the MHC locus. We performed imputation of amino acids and allelic variants of the MHC molecules, as described above. For food tolerance and food preferences, we included molecules based on their association in GWAS and other literature studies (Supplementary Table 8). For the candidate genes in innate and adaptive immunity and the food-related genes, we defined a region of 250 Kb around the genes and tested all SNPs in this region. We selected a total of 76,444 SNPs for the targeted-gene approach, in which we adopted the same methods as for the GWAS analysis, but with a less stringent cut-off. The associations with bacteria, pathways and GO2000 terms were first identified at Pa -value of 5x10-4 in the discovery cohort. We then replicated those associations in the replication cohorts and selected the associations that were replicated at P < 0.01 in the same allelic direction as in the discovery cohort. For these associations, we further performed a meta-analysis using a weighted Z-approach to combine the association signals across the three cohorts. Only the associations at P < 5x10-6 are reported. Statistical significance In our analysis we use a conservative, three-stage approach to identify associations between the gut microbiome and host genetics. For the genome-wide analysis we assumed to perform 109 unrelated pair-wise tests between 1 million unrelated SNPs and approximately 1,000 unrelated microbial traits. In the first stage we selected all associations at P<5x10-5 in the discovery dataset. Therefore, we expect 50,000 false positives (5x10-5 x 109 tests performed). In the second stage we used a replication threshold of 0.01. After the second stage we expect to have 500 false positive results. In the third stage we perform a meta-analysis using the weighted Z-score approach86 over both the discovery and replication cohort, here use a more stringent threshold of P<5x10-8 . Multiple P-value combinations, in the first and second stage, can reach the final desired meta P-value threshold, see Supplementary Table 4A-B. For example, if the P-value in the first stage is 5x10-5, the P-value in the second stage needs to be lower than 3x10-5 to meet the meta- P-value threshold of 5x10-8. In case of the aforementioned P-value thresholds the expected number of false positive results would be 1.5. The maximum amount of false positive results in our three-stage approach is 5, depending on the combination of first and second stage P-values (Supplementary Table 4B). Since we find 42 associations, we expect a False Discovery Rate of 12%.

88 For the targeted analysis we report all associations that passed the meta P-value threshold of P<5x10-6. Given out set up, explained above, and the application of the chosen thresholds (P < 5x10-4 in the initial discovery, P < 0.05 in the first replication, and the final P < 5x10-6), our 1 sample sizes, the 85,405 targeted SNPs, and the roughly 1000 unrelated microbial traits, the chance of the association being sporadic is very minor (Supplementary Tables 4C and 4D). Given that we found 32 independent loci at this level of significance, we expect our false discovery rate to be < 1%. 2 Analysis of the LCT locus The genetic variant rs4988235 at the LCT locus is known to have a regressive effect on lactose intolerance. The homozygous G/G allele will result in lactase deficiency and individuals with a G/G genotype are susceptible for lactose intolerance, while individuals with the A/A or 3 A/G genotype will have efficient LCT enzyme. Thus, we specifically assessed the recessive effect of the functional SNP rs4988235 in the LCT locus by dividing the individuals into an LCT-deficient group (G/G genotype) and an LCT-efficient group (A/A or A/G genotype). The microbial abundance was compared between the two groups using the Spearman correlation 4 test in the R ‘base’ package. The association was assessed in the discovery and replication cohorts, followed by a meta-analysis using the weighted sum of Z method in R package ‘metap’ v.0.6-2. Further, we specifically tested the association between bacteria and dairy-product intake for the functional LCT SNP. To do this, we calculated the Spearman correlation between dairy- product consumption and Bifidobacterium abundance in groups of LCT-deficient and LCT- efficient individuals separately. For the LifeLines-DEEP and MIBS cohorts, dairy intake was extracted from the food frequency questionnaire (units: g/day). For the 500FG cohort, milk intake was reported as glasses of milk per day. Comparison of correlations between groups of LCT-efficient and -deficient individuals was performed using Fisher’s r-to-z transformation with a one-tailed test in R package ‘psych’ v.1.5.8. For the LCT-associations described above, we selected only the participants for whom we had genetic, microbiome and food frequency questionnaire data available (923 individuals from LifeLines-DEEP, 397 from 500FG, and 101 from MIBS-CO). Functional annotation of SNPs For the associated SNPs, we further characterized their functions by looking into their association with complex traits and effect on gene expression using the multiple eQTL datasets. Multi-tissue eQTL results from the GTEx consortium87 were used in the annotation, including sigmoid colon, transverse colon, small intestine, esophagus, gastroesophageal junction, mucosa and muscularis from the esophagus, and from the eQTL dataset in a large number of blood samples derived from Zhernakova et al.88. URLs Human Functional Genomics Study: www.humanfunctionalgenomics.org HUMAnN2: http://huttenhower.sph.harvard.edu/humann2 MetaCyc metabolic pathway database: www.metacyc.org EGA: https://www.ebi.ac.uk/ega/ Clinical-trials: http://www.clinicaltrials.gov SRA: http://www.ncbi.nlm.nih.gov/sra Accession codes The LifeLines-DEEP and MIBS metagenomics sequencing data is available at the European Genome-phenome Archive (EGA); LifeLines-DEEP: EGAS00001001704, MIBS: EGAS00001001924. The 500FG data is available at the SRA: PRJNA319574

89 Funding 1 This project was funded by grants from the Top Institute Food and Nutrition, Wageningen, to C.W. (TiFN GH001), the Netherlands Organization for Scientific Research to J.F. (NWO- VIDI 864.13.013), L.F. (ZonMW-VIDI 917.14.374), and R.W. (ZonMW-VIDI 016.136.308), and CardioVasculair Onderzoek Nederland to M.H., M.G.N., A.Z. and J.F (CVON 2012-03). A.Z. 2 holds a Rosalind Franklin Fellowship (University of Groningen). This research received funding from the European Research Council under the European Union’s Seventh Framework Program: C.W. is supported by FP7/2007-2013/ERC Advanced Grant (agreement 2012-322698) and a Spinoza Prize from the Netherlands Organization for Scientific Research (NWO). M.G.N. holds an ERC Consolidator Grant (#310372). L.F. has an FP7/2007–2013 grant (agreement 259867) 3 and by an ERC Starting Grant, (637640, ImmRisk). Y.L holds a Netherlands Organization for Scientific Research (NWO) VENI grant (number 863.13.011). Author contributions 4 Conceptualization AZ, JF, CW, MJB; Methodology MJB, AK, LF, JF, PD, TV, MS; Software MJB, AK, LF, JF, PD, MAS, DVZ; Formal Analysis MJB, AK, JF, AZ; Investigation AZ, JF, MJB, AK, FI, DVZ, SAJ, AVV, EFT, HH, MCC; Resources CW, AZ, LF, JF, MAS, MGN, RJX, LJ, AAMM; Data Curation MJB, AK, JF, PD, LF, SAJ, YL; Writing – Original Draft AZ, JF, MJB, AK, CW; Writing – Review & Editing MJB, AK, EFT, ZM, FI, AVV, PD, TV, MS, SPS, DVZ, SAJ, MJ, MO, MAS, MCC, YL, VK, HH, RKW, LF, MHH, DJ, MGN, CW, JF, AZ; Visualization AK, MJB, AZ, JF; Supervision AZ, JF, CW, LF, RKW, MHH; Project Administration AZ, JF, CW, LF, MGN, DJ, AAMM, SPS; Funding Acquisition AZ, JF, CW, LF, MGN, DJ, AAMM. Competing financial interests The authors declare no competing financial interests. References 1. Tigchelaar, E. F. et al. Gut microbiota composition associated with stool consistency. Gut 65, gutjnl-2015-310328 (2015). 2. Imhann, F. et al. Proton pump inhibitors affect the gut microbiome. Gut 65, gutjnl-2015-310376 (2015). 3. Zhernakova, A. et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science (80-. ). 352, 565–569 (2016). 4. David, L. A. et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–563 (2014). 5. Scott, K. P., Gratz, S. W., Sheridan, P. O., Flint, H. J. & Duncan, S. H. The influence of diet on the gut microbiota. Pharmacol. Res. 69, 52–60 (2013). 6. Goodrich, J. K. et al. Human genetics shape the gut microbiome. Cell 159, 789–799 (2014). 7. Org, E. et al. Genetic and environmental control of host-gut microbiota interactions. Genome Res. 25, 1558–1569 (2015). 8. Leamy, L. J. et al. Host genetics and diet, but not immunoglobulin A expression, converge to shape compositional features of the gut microbiome in an advanced intercross population of mice. Genome Biol. 15, 552 (2014). 9. Goodrich, J. K. et al. Genetic Determinants of the Gut Microbiome in UK Twins Correspondence. Cell Host Microbe 19, 731–743 (2016). 10. Blekhman, R. et al. Host genetic variation impacts microbiome composition across human body sites. Genome Biol. 16, 191 (2015). 11. Davenport, E. R. et al. Genome-wide association studies of the human gut microbiota. PLoS One 10, e0140301 (2015).

90 12. Srinivas, G. et al. Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nat Commun. 4\, \ (2013). 13. Parks, B. W. et al. Genetic control of obesity and gut microbiota composition in response 1 to high-fat, high-sucrose diet in mice. Cell Metab. 17, 141–152 (2013). 14. McKnite, A. M. et al. Murine gut microbiota is defined by host genetics and modulates variation of metabolic traits. PLoS One 7, e39191 (2012). 15. Jostins, L. et al. Host-microbe interactions have shaped the genetic architecture of 2 inflammatory bowel disease. Nature 491, 119–124 (2012). 16. Knights, D. et al. Complex host genetics influence the microbiome in inflammatory bowel disease. Genome Med. 6, 107 (2014). 17. Lamas, B. et al. CARD9 impacts colitis by altering gut microbiota metabolism of tryptophan into aryl hydrocarbon receptor ligands. Nat. Med. advance on, (2016). 3 18. Tigchelaar, E. F. et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open 5, e006772 (2015). 19. Netea, M. G. et al. Understanding human immune function using the resources from the Human Functional Genomics Project. Nat. Med. 22, (2016). 4 20. Mujagic, Z. et al. Small intestinal permeability is increased in diarrhoea predominant IBS, while alterations in gastroduodenal permeability in all IBS subtypes are largely attributable to confounders. Aliment. Pharmacol. Ther. 40, 288–297 (2014). 21. Juste, C. et al. Bacterial protein signals are associated with Crohn’s disease. Gut 63, 1566–1577 (2014). 22. Liu, T.-C. et al. O-011 Paneth Cell Phenotypes Define a Subtype of Pediatric Crohn’s Disease Through Alterations in Host-Microbial Interactions. Inflamm. Bowel Dis. 22 Suppl 1, S4 (2016). 23. Torres, J. et al. The features of mucosa-associated microbiota in primary sclerosing cholangitis. Aliment. Pharmacol. Ther. 43, 790–801 (2016). 24. Hromatka, B. S. et al. Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes and glucose homeostasis. Hum. Mol. Genet. 24, 2700–2708 (2015). 25. Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015). 26. Williams, A. L. et al. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico. Nature 506, 97–101 (2014). 27. Fu, J. et al. The gut microbiome contributes to a substantial proportion of the variation in blood lipids. Circ. Res. 117, 817–824 (2015). 28. Martínez, I. et al. Gut microbiome composition is linked to whole grain-induced immunological improvements. ISME J. 7, 269–280 (2013). 29. Tyler, A. D. et al. Characterization of the Gut-Associated Microbiome in Inflammatory Pouch Complications Following Ileal Pouch-Anal Anastomosis. PLoS One 8, e66934 (2013). 30. Landy, J. et al. Variable alterations of the microbiota, without metabolic or immunological change, following faecal microbiota transplantation in patients with chronic pouchitis. Sci. Rep. 5, 12955 (2015). 31. Kalabis, J., Rosenberg, I. & Podolsky, D. K. Vangl1 protein acts as a downstream effector of intestinal trefoil factor (ITF)/TFF3 signaling and regulates wound healing of intestinal epithelium. J. Biol. Chem. 281, 6434–6441 (2006). 32. Bae, J. A. et al. An unconventional KITENIN/ErbB4-mediated downstream signal of EGF upregulates c-Jun and the invasiveness of colorectal cancer cells. Clin. Cancer Res. 20, 4115–4128 (2014). 33. Lee, S. et al. Expression of KITENIN in human colorectal cancer and its relation to tumor behavior and progression. Pathol. Int. 61, 210–220 (2011). 34. Klingberg, S. et al. Inverse relation between dietary intake of naturally occurring plant sterols and serum cholesterol in northern Sweden. Am J Clin Nutr 87, 993–1001 (2008).

91 35. Kaplan, R. C. et al. A genome-wide association study identifies novel loci associated with circulating IGF-I and IGFBP-3. Hum. Mol. Genet. 20, 1241–1251 (2011). 1 36. Liu, Y. J. et al. Genome-wide association scans identified CTNNBL1 as a novel gene for obesity. Hum. Mol. Genet. 17, 1803–1813 (2008). 37. Strawbridge, R. J. et al. Genome-wide association identifies nine common variants associated with fasting proinsulin levels and provides new insights into the 2 pathophysiology of type 2 diabetes. Diabetes 60, 2624–2634 (2011). 38. Khan, M. T., Browne, W. R., van Dijl, J. M. & Harmsen, H. J. M. Employ Riboflavin for Extracellular Electron Transfer? Antioxid. Redox Signal. 17, 1433–1440 (2012). 39. Morgan, X. C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012). 3 40. Lozupone, C. A., Stombaugh, J. I., Gordon, J. I., Jansson, J. K. & Knight, R. Diversity, stability and resilience of the human gut microbiota. Nature 489, 220–230 (2012). 41. Bønnelykke, K. et al. Meta-analysis of genome-wide association studies identifies ten loci influencing allergic sensitization. Nat. Genet. 45, 902–906 (2013). 42. Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host 4 Microbe 15, 382–392 (2014). 43. Caesar, R., Tremaroli, V., Kovatcheva-Datchary, P., Cani, P. D. & B??ckhed, F. Crosstalk between gut microbiota and dietary lipids aggravates WAT inflammation through TLR signaling. Cell Metab. 22, 658–668 (2015). 44. Yahiro, K. et al. DAP1, a negative regulator of autophagy, controls subAB-mediated apoptosis and autophagy. Infect. Immun. 82, 4899–4908 (2014). 45. Koren, I., Reem, E. & Kimchi, A. DAP1, a novel substrate of mTOR, negatively regulates autophagy. Curr. Biol. 20, 1093–1098 (2010). 46. Glocker, E. et al. Inflammatory Bowel Disease and Mutations Affecting the Interleukin-10 Receptor. N. Engl. J. Med. 361, 2033–2045 (2009). 47. Galeone, M., Colucci, R., D’Erme, A. M., Moretti, S. & Lotti, T. Potential infectious etiology of Beh??et’s disease. Patholog. Res. Int. 2012, 595380 (2012). 48. Huuskonen, J., Olkkonen, V. M., Jauhiainen, M. & Ehnholm, C. The impact of phospholipid transfer protein (PLTP) on HDL metabolism. Atherosclerosis 155, 269–281 (2001). 49. Getz, G. S. & Reardon, C. A. Apoprotein E as a lipid transport and signaling protein in the blood, liver, and artery wall. J. Lipid Res 50, 156–161 (2009). 50. Hevener, A. L. et al. Muscle-specific Pparg deletion causes insulin resistance. Nat. Med. 9, 1491–1497 (2003). 51. Hegele, R. A., Cao, H., Frankowski, C., Mathews, S. T. & Leff, T. PPARG F388L, a transactivation-deficient mutant, in familial partial lipodystrophy. Diabetes 51, 3586– 3590 (2002). 52. Doney, A. S. F. et al. Association of the Pro12Ala and C1431T variants of PPARG and their haplotypes with susceptibility to Type 2 diabetes. Diabetologia 47, 555–558 (2004). 53. Hao, H.-X. et al. ZNRF3 promotes Wnt receptor turnover in an R-spondin-sensitive manner. Nature 485, 195–200 (2012). 54. Farin, H. F. et al. Visualization of a short-range Wnt gradient in the intestinal stem-cell niche. Nature 530, 340–343 (2016). 55. Zhou, Y. et al. ZNRF3 acts as a tumour suppressor by the Wnt signalling pathway in human gastric adenocarcinoma. J. Mol. Histol. 44, 1–9 (2013). 56. Casals, F. et al. Genetic adaptation of the antibacterial human innate immunity network. BMC Evol Biol 11, 202 (2011). 57. Bunge, J., Willis, A. & Walsh, F. Estimating the Number of Species in Microbial Diversity Studies. Annu. Rev. Stat. Its Appl. 1, 427–445 (2014). 58. Caballero, S. & Pamer, E. G. Microbiota-mediated inflammation and antimicrobial defense in the intestine. Annu. Rev. Immunol. 33, 227–56 (2015). 59. Singh, V. et al. Interplay between enterobactin, myeloperoxidase and lipocalin 2 regulates E. coli survival in the inflamed gut. Nat Commun 6, 7113 (2015).

92 60. Nguyen, H. T. T. et al. Crohn’s disease-associated adherent invasive escherichia coli modulate levels of microRNAs in intestinal epithelial cells to reduce autophagy. Gastroenterology 146, 508–519 (2014). 1 61. Sadaghian Sadabad, M. et al. The ATG16L1-T300A allele impairs clearance of pathosymbionts in the inflamed ileal mucosa of Crohn’s disease patients. Gut 64, gutjnl-2014-307289- (2014). 62. Dambuza, I. M. & Brown, G. D. C-type lectins in immunity: Recent developments. Curr. 2 Opin. Immunol. 32, 21–27 (2015). 63. Iliev, I. D. et al. Interactions between commensal fungi and the C-type lectin receptor Dectin-1 influence colitis. Science (80-. ). 336, 1314–1317 (2012). 64. Enattah, N. S. et al. Evidence of still-ongoing convergence evolution of the lactase persistence T-13910 alleles in humans. Am. J. Hum. Genet. 81, 615–25 (2007). 3 65. Tishkoff, S. a et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39, 31–40 (2007). 66. Troelsen, J. T. Adult-type hypolactasia and regulation of lactase expression. Biochim. Biophys. Acta - Gen. Subj. 1723, 19–32 (2005). 67. Streppel, M. T. et al. Relative validity of the food frequency questionnaire used to assess 4 dietary intake in the Leiden Longevity Study. Nutr. J. 12, 75 (2013). 68. Siebelink, E., Geelen, A. & de Vries, J. H. M. Self-reported energy intake by FFQ compared with actual energy intake to maintain body weight in 516 adults. Br. J. Nutr. 106, 274–281 (2011). 69. Collection, S. & Genome, T. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 1–95 (2014). 70. Shah, T. S. et al. OptiCall: A robust genotype-calling algorithm for rare, low-frequency and common variants. Bioinformatics 28, 1598–1603 (2012). 71. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). 72. Deelen, P. et al. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res. Notes 7, 901 (2014). 73. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013). 74. Howie, B., Marchini, J. & Stephens, M. Genotype Imputation with Thousands of Genomes. G3 1, 457–470 (2011). 75. Deelen, P. et al. Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur. J. Hum. Genet. 22, 1321–1326 (2014). 76. Jia, X. et al. Imputing Amino Acid Polymorphisms in Human Leukocyte Antigens. PLoS One 8, (2013). 77. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). 78. Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015). 79. Dimmer, E. C. et al. The UniProt-GO Annotation database in 2011. Nucleic Acids Res. 40, (2012). 80. Blake, J. A. et al. Gene ontology consortium: Going forward. Nucleic Acids Res. 43, D1049–D1056 (2015). 81. Vatanen, T. et al. Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans. Cell 165, 842–853 (2015). 82. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013). 83. Bonder, M. J., Luijk, R., Zhernakova, D. V & Moed, M. Disease variants alter transcription factor levels and methylation of their binding sites. (2015). doi:10.1101/033084 84. Robinson, M. J., Sancho, D., Slack, E. C., LeibundGut-Landmann, S. & Reis e Sousa, C. Myeloid C-type lectins in innate immunity. Nat. Immunol. 7, 1258–1265 (2006).

93 85. Reikine, S., Nguyen, J. B. & Modis, Y. Pattern recognition and signaling mechanisms of RIG-I and MDA5. Front. Immunol. 5, 342 (2014). 1 86. Whitlock, M. C. Combining probability from independent tests: The weighted Z-method is superior to Fisher’s approach. J. Evol. Biol. 18, 1368–1373 (2005). 87. Ardlie, K. G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science (80-. ). 348, 648–660 (2015). 2 88. Zhernakova, D. V et al. Hypothesis-free identification of modulators of genetic risk factors. bioRxiv 1–25 (2015). doi:10.1101/033217 Acknowledgements 3 We thank the participants and the staff of LifeLines-DEEP, 500FG and MIBS for their collaboration. We thank Jackie Dekens, Mathieu Platteel, Janneke Pietersma and Astrid Maatman for management and technical support, and Kate Mc Intyre and Jackie Senior for editing the manuscript.

4 Description of supplementary data files The following additional data are available with the online version of this paper. Supplemenatry data: Supplementary Figure 1 Genome wide significant microbial QTL plots on microbial level Supplementary Figure 2 Genome wide significant microbial QTL plots on microbial function level (MetaCyc-Pathway) Supplementary Figure 3 Genome-wide microbial QTL plots on microbial function level (GO- terms) Supplementary Figure 4 Correlation of associated GO-terms with taxonomies on species level. Supplementary Figure 5 Relation between Bifidobacterium and milk consumption and LCT SNP (rs4988235) and milk consumption Supplementary Table 1 Abundance levels of microbes, MetaCyc pathways and GO2000 terms. Supplementary Table 2 Summary of the tested number of associations per analysis branch Supplementary Table 3 Genome wide microbial QTL results Supplementary Table 4 Estimations for the FDRs presented in the paper. Supplementary Table 5 Correlations between microbial abundance and MetaCyc abundance levels Supplementary Table 6 Correlations between microbial abundance and GO2000 abundance levels Supplementary Table 7 SNP selection list GWAS associated SNPs Supplementary Table 8 SNP selection list for innate immunity and food preference, including references Supplementary Table 9 Microbial QTLs on abundance and functional level for the SNPs previously related to GWAS studies Supplementary Table 10 Microbial QTLs on abundance and functional level for the variants in the HLA. Supplementary Table 11 Microbial QTLs on abundance and functional level for the SNPs related to innate immunity Supplementary Table 12 Microbial QTLs on abundance and functional level for the SNPs related to food preference

94 Genetic and epigenetic regulation of gene expression in fetal and adult human livers BMC Genomics, DOI: 10.1186/1471-2164-15-860

Marc Jan Bonder1* Silva Kasela2,3*, Mart Kals2,4, Riin Tamm2,3, Kaie Lokk3, Isabel Barragan5, Wim Buurman6, Patrick Deelen1,7, Jan-Willem Greve8, Maxim Ivanov5, Sander S. Rensen7, Jana V. van Vliet-Ostaptchouk9,10, Marcel Wolfs11, Jingyuan Fu1, Marten H. Hofker11, Cisca Wijmenga1, Alexandra Zhernakova1, Magnus Ingelman-Sundberg5, Lude Franke1# and Lili Milani2# 7 Abstract 1 Background The liver plays a central role in the maintenance of homeostasis and health in general. However, there is substantial inter-individual variation in hepatic gene expression, and although numerous genetic factors have been identified, less is known about the epigenetic factors. 2 Results By analyzing the methylomes and transcriptomes of 14 fetal and 181 adult livers, we identified 657 differentially methylated genes with adult-specific expression, these genes were enriched for transcription factor binding sites of HNF1A and HNF4A. We also identified 1,000 genes specific to fetal liver, which were enriched for GATA1, STAT5A, STAT5B and 3 YY1 binding sites. We saw strong liver-specific effects of single nucleotide polymorphisms on both methylation levels (28,447 unique CpG sites (meQTL)) and gene expression levels (526 unique genes (eQTL)), at a false discovery rate (FDR) < 0.05. Of the 526 unique eQTL associated genes, 293 correlated significantly not only with genetic variation but also with methylation levels. The tissue-specificities of these associations were analyzed in muscle, 4 subcutaneous adipose tissue and visceral adipose tissue. We observed that meQTL were more stable between tissues than eQTL and a very strong tissue-specificity for the identified associations between CpG methylation and gene expression. Conclusions Our analyses generated a comprehensive resource of factors involved in the regulation of hepatic gene expression, and allowed us to estimate the proportion of variation in gene expression that could be attributed to genetic and epigenetic variation, both crucial to understanding differences in drug response and the etiology of liver diseases.

1. University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; 2. Estonian Genome Center, University of Tartu, Tartu, Estonia; 3. Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia; 4. Institute of Mathematical Statistics, University of Tartu, Tartu, Estonia; 5. Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden; 6. Department of Surgery, University Hospital Maastricht and Nutrition and Toxicology Research Institute (NUTRIM), Maastricht University, Maastricht, the Netherlands; 7. University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, the Netherlands; 8. Department of General Surgery, Atrium Medical Center Parkstad, Heerlen, the Netherlands; 9. University of Groningen, University Medical Center Groningen, Department of Endocrinology, Groningen, the Netherlands; 10. University of Groningen, University Medical Center Groningen, Department of Epidemiology, Unit of Genetic Epidemiology and Bioinformatics, Groningen, the Netherlands; 11.University of Groningen, University Medical Center Groningen, Department of Pathology and Medical Biology, Molecular Genetics section, Groningen, the Netherlands; *,# equal contributions; Correspondance to: Dr Lili Milani, E-mail: [email protected]; Dr Lude Franke, Email: [email protected]

96 Background 1 The liver plays a central role in the maintenance of homeostasis and health in general. Given the substantial inter-individual variation seen in metabolism, regulation of nutrients, protein synthesis, and detoxification of xenobiotics. It is essential to have a better understanding on inter-individual variation of gene expression, methylation and genetic effects specific to liver, and on different conditions, e.g. developmental stages. These variations can affect 2 the liver’s metabolic properties, leading to high levels of metabolites, either in the forms of lipids, proteins or xenobiotics, which can result in serious diseases or toxic side-effects. For example, several single nucleotide polymorphisms (SNPs) associated with liver function and related diseases have been identified through genome-wide association (GWA) studies1-6. We and others have studied how these SNPs affect liver gene expression levels by mapping 3 expression quantitative trait loci (eQTL)7-11, and several genetic variants that regulate genes involved in the absorption, distribution, metabolism and excretion of drugs (ADME genes) have also been identified. Apart from genetic variation, epigenetic mechanisms (DNA methylation and histone 4 modifications) also play an important role in regulating tissue-specific gene expression12-14. In particular, such mechanisms can influence the expression of hepatic ADME genes. For example, the methylation status of a CpG island in exon 2 of CYP1A2 was shown to correlate with interindividual differences in the expression of this gene in human livers15. Given that CYP1A2 is an important drug-metabolizing enzyme, those factors that influence its epigenetic state may also contribute to the individual drug response. Interestingly, the epigenetic state of ADME genes, at least in rodent livers, can change in response to xenobiotic exposure16,17, thus opening the perspective for epigenetics-mediated drug-drug interactions. More examples on epigenetic regulation of ADME genes have been reviewed by Kacevska et al18. However, the majority of such data come from studies of epigenetic alterations observed either in tumors, or in cell lines treated with DNA demethylating agents. So far it is not clear, to which extent such cancer-related or experimentally induced epigenetic alterations correspond to the natural epigenetic variability in human livers. Hence, it is essential to include epigenetic variation when studying the regulation of hepatic gene expression, to further explain the causes of differences in drug response and the etiology of diseases associated with liver function. Here we present a comprehensive survey of the methylome and transcriptome of the human liver (Figure 1A). First, we addressed the regulation of gene expression in the developing human liver by comparing genome-wide expression and methylation levels in 96 adult and 14 fetal livers from the Karolinska Liver Bank. Then we used genetic, epigenetic and gene expression data from the adults, along with an extra cohort of 85 Dutch adult liver samples to investigate the regulation of gene expression in the human liver. Finally, we explored the tissue specificity of the identified associations between SNPs, methylation and expression in other tissues from the Dutch adult samples. Results Developmental regulation of hepatic gene expression The epigenome of the developing human liver We first compared the epigenomes of 8- to 21-week-old fetal livers with adult livers. We assessed the methylation levels of 366,074 variable CpG sites and found 28,917 CpG sites (annotated to 12,619 unique genes) that showed a significant difference (absolute mean beta value difference > 0.2, FDR < 0.05) between fetal and adult liver tissue (see Supplementary Online Methods in the Additional file 1 and Additional file 2). Although the number of hypomethylated CpG sites in fetal liver (53.4%) was similar to the number of hypermethylated sites (46.6%) in this cohort, we observed an age-specific association between the genomic location of CpG

97 sites and whether they were hypo- or hypermethylated (chi-squared test p-value < 2.2 x 10-16). In fetal livers, the majority (86%) of the differentially methylated CpG sites that are located 1 within CpG islands (CGI) were hypomethylated, whereas this was not the case for CpG sites outside CGIs, where roughly 50% of the CpG sites were either hypo- or hypermethylated in fetal livers (Figure 1B). This is particularly interesting because in both adult and fetal livers, close to 80% of the CpG sites within CGI are not methylated, with > 95% overlap between the 2 two age groups. Accordingly, the CpG sites within CGIs that were hypomethylated in the fetal livers mostly had intermediate methylation levels in the adult liver samples (Additional file 3). To explore the functions of the genes that were differentially methylated in fetal liver compared to adult liver, we used the GREAT pathway tool19. The CpG sites that were hypomethylated 3 in the adult livers and hypermethylated in fetal liver were strongly enriched for metabolic pathways, such as the steroid metabolic process the regulation of lipid metabolic processes, regulation of generation of precursor metabolites and energy, and regulation of glycolysis (all with p-values < 1.15 x 10-44) (Table 1A). However, the genes that were associated with hypomethylated CpG sites in the fetal samples were strongly enriched for pathways of insulin 4 receptor signaling, regulation of glycogen synthase activity, differentiation processes, and developmental functions (Table 1B).

Figure 1. Study design and distribution of CpG sites. (A) Description of the biomaterials and analyses. (B) Distribution of the location of differentially methylated CpG sites between fetal and adult livers. The plot shows the percentage of differentially methylated CpG sites (y-axis) that are hypermethylated (black) or hypomethylated (grey) in CpG islands, shores, shelves and other regions of the genome. (C) Distribution of differentially expressed and methylated genes depending on the relation to CpG islands.

98 Tabel 1. Gene Ontology analysis of differentially methylated genes in fetal versus adult livers. 1Fold enrichment - fold enrichment of number of genomic regions in the test set with the annotation 2Observed region hits - actual number of genomic regions in the test set with the 1 annotation A. Top 10 biological processes associated with hypomethylated genes in adult livers. 2 Term Name P-Value Fold Enrich.1 Obs. Regions2

Steroid metabolic process 2.77E-52 2.03 558 Regulation of lipid metabolic process 4.42E-51 2.06 528 3 Regulation of generation of precursor 5.62E-48 3.24 216 metabolites and energy Regulation of glycolysis 1.15E-44 5.21 116 Sterol metabolic process 4.28E-44 2.56 288 4 Positive regulation of lipid metabolic process 3.52E-43 2.48 300 Regulation of cellular carbohydrate catabolic 3.26E-42 4.22 136 process Regulation of lipid transport 3.88E-42 3.54 168 Cholesterol metabolic process 3.71E-40 2.51 271 Regulation of cellular ketone metabolic 4.25E-39 2.11 381 process

B. Top 10 biological processes associated with hypomethylated genes in fetal livers

Term Name P-Value Fold Enrich.1 Obs. Regions2

Insulin receptor signalling pathway 1.74E-130 77.00 88 Positive regulation of glycogen (starch) 1.69E-105 37.19 90 synthase activity Anterior/posterior pattern specification 1.27E-96 2.26 813 Regulation of gene expression by genetic 2.36E-92 10.52 143 imprinting Regulation of glycogen (starch) synthase 3.97E-81 19.01 91 activity Genetic imprinting 3.54E-69 6.66 147

Response to estrogen stimulus 4.52E-69 2.14 657 Positive regulation of insulin receptor 9.60E-68 11.64 98 signalling pathway Positive regulation of cell cycle 2.83E-61 2.51 421

Luteinizing hormone secretion 4.18E-61 11.24 90

99 The transcriptome of the developing liver 1 Comparison of gene expression levels between the fetal and adult liver samples yielded 3,284

differentially expressed probes (absolute log2-fold change > 1.0, FDR < 0.05, Additional file 4). Pathway analysis, using Gene Network20, confirmed that 1,396 genes with higher expression in the adult livers were strongly enriched for metabolic functions like monocarboxylic acid, 2 steroid and bile acid metabolic processes, as well as the response to xenobiotic process (Table 2A). In contrast, 1,277 genes that were highly expressed in fetal tissue were associated with regulating organelle organization, chromosome organization, and tetrapyrrole (e.g. hemoglobin) biosynthetic processes (Table 2B). These observations are in line with the fetal development, which is characterized by tissue differentiation and growth and by the fact that 3 the liver is predominantly a hematopoietic organ during this period21. Orchestration of epigenetics and transcriptomics in regulating liver development We found 1,655 genes that showed both differential expression and differential methylation in adult vs. fetal livers (Additional file 5). More specifically, 657 genes were linked to probes 4 with higher expression levels in adults, and 1,000 genes linked to probes that were more highly expressed in fetal livers (with an overlap of two genes). As expected, these genes are even more significantly enriched for developmental stage-specific functions, such as drug response for the adult cohort (p-value 4.0 x 10-131) and liver development for the fetal cohort (p-value 6.0 x 10-90). In the majority of the genes with more than one detection probe, the differences in expression levels were very similar between fetal and adult livers. However, in two genes (TGM2 and INS-IGF2), one of the probes was more highly expressed in fetal livers, while the other probe reflected higher expression in adult livers. The location of the differentially methylated CpG sites differed significantly in relation to CGIs, depending on the expression and methylation differences between fetal and adult livers (chi-squared test p-value < 2.2 x 10- 16, Figure 1C): for genes with a lower expression in fetal livers, the hypomethylated CpG sites more often map within CpG islands, shores and shelves, while the hypermethylated CpG sites map further away from CGI regions. Regions within 2 kb of the transcription start site (TSS) of the 1,655 genes are enriched for binding sequences of transcription factors essential for the development or function of the liver, specifically HNF4A (adjusted p-value = 2 x 10-73) and HNF1A (adj. p-value = 6 x 10-38); hematopoietic transcription factors GATA1 (adj. p-value = 8 x 10-36), STAT5A (adj. p-value = 2 x 10-43), and STAT5B (adj. p-value = 1 x 10-49); and YY1 (adj. p-value = 2 x 10-36), which plays a fundamental role in embryogenesis and differentiation. We therefore investigated the expression levels of these transcription factors and observed that transcripts for the HNF1A and HNF4A genes were more highly expressed in adult livers, and GATA1, STAT5A, STAT5B and YY1 were all more highly expressed in fetal livers (Figure 2, Additional file 6). Table 2. Gene Ontology analysis of differentially expressed genes in fetal versus adult liver. A. Top 10 biological processes associated with hyperexpressed genes in adult livers

Term P-value Nr of genes Monocarboxylic acid metabolic process 2.80E-205 347 Lipid localization 8.26E-202 201 Lipid transport 1.99E-197 180 Steroid metabolic process 2.91E-196 257 Bile acid metabolic process 1.13E-192 40 Response to xenobiotic stimulus 9.88E-184 114 Cellular response to xenobiotic stimulus 9.88E-184 114 Xenobiotic metabolic process 6.72E-183 113 Bile acid biosynthetic process 6.23E-178 23 Response to glucocorticoid stimulus 2.61E-173 131

100 1

2

3

4

Figure 2. Expression levels of transcription factors in fetal and adult livers. Box plots of the log2 transformed expression levels (y-axis) are shown for the adult and fetal liver samples (x-axis). The transcripts for HNF1A and HNF4A were expressed at significantly higher levels in the adult livers, while YY1, GATA1, STAT5A and STAT5B were expressed at higher levels in the fetal livers.

Table 2. Gene Ontology analysis of differentially expressed genes in fetal versus adult liver. B. Top 10 biological processes associated with hyperexpressed genes in fetal liver

Term P-value Nr of genes Negative regulation of organelle organization 3.93E-162 138 Regulation of organelle organization 1.73E-132 370 Negative regulation of cellular component organization 3.40E-127 265 Regulation of chromosome organization 1.61E-114 72 Porphyrin-containing compound biosynthetic process 1.84E-112 34 Tetrapyrrole biosynthetic process 1.84E-112 34 Negative regulation of chromosome organization 6.05E-108 29 Chromatin assembly or disassembly 8.76E-106 128 Pigment biosynthetic process 1.73E-103 53 G1 phase 6.25E-102 36

101 Table 3 lists the 20 genes with the largest differences in expression and methylation, clearly illustrating the fetal-specific expression of genes involved in differentiation and hematopoiesis 1 (e.g. DLK1, HBZ, HBM, AHSP, EPB42 and NFE2), and the adult-specific expression of genes involved in drug metabolism, catabolism and other biosynthesis processes. CYP2E1 and CYP2C8 are the cytochrome P450 (CYP) genes; these show the most significant difference in expression levels between fetal and adult liver, with an approximately 7-fold higher expression 2 level in adult liver. Table 3. Top 20 genes with largest difference in expression and differential methylation between fetal and adult livers.

3 Median Adj. Mean Beta Beta Adj Gene expression logFC p-value value value p-value Adult Fetal (FDR) Adult Fetal difference (FDR) DLK1 3.27 12.64 9.15 3.55E-46 0.42 0.63 0.22 3.19E-36 4 HBZ 3.23 12.35 9.07 1.82E-45 0.8 0.55 -0.25 4.02E-18 HBM 3.01 12.52 9.03 4.27E-42 0.42 0.21 -0.21 2.55E-11 AHSP 4.37 13.26 8.46 2.66E-42 0.74 0.41 -0.33 1.43E-34 EPB42 3.13 11.25 8.19 2.26E-48 0.89 0.47 -0.42 3.84E-64 CYP2E1 13.42 4.1 -7.64 8.63E-36 0.51 0.88 0.36 1.10E-41 HBE1 2.87 10.66 7.63 4.10E-48 0.76 0.49 -0.27 1.41E-34 CRP 13.33 4.33 -7.27 7.71E-34 0.53 0.89 0.36 2.99E-42 C9 11.91 3.59 -7.18 3.98E-39 0.48 0.84 0.36 8.73E-39 APCS 12.69 4.39 -7 5.95E-40 0.45 0.88 0.43 7.26E-48 SLC4A1 4.04 11.3 6.96 1.46E-61 0.75 0.4 -0.35 2.14E-42 NNMT 11.06 3.44 -6.88 4.10E-40 0.36 0.84 0.48 7.10E-45 CYP2C8 12.9 4.83 -6.85 3.42E-31 0.59 0.88 0.29 2.57E-36 AQP9 11.51 3.47 -6.81 1.26E-33 0.39 0.84 0.45 8.23E-45 NFE2 4.1 11.03 6.8 2.29E-47 0.82 0.45 -0.36 3.02E-41 ADH1C 11.93 3.92 -6.69 1.96E-24 0.42 0.81 0.39 8.06E-38 MYL4 3.88 10.77 6.65 6.78E-60 0.88 0.4 -0.48 1.90E-62 C3P1 11.36 3.81 -6.63 2.34E-37 0.73 0.24 -0.49 1.56E-47 RHAG 3.35 10.22 6.56 1.50E-46 0.81 0.49 -0.31 1.64E-47 HSD17B6 12.12 4.61 -6.31 4.88E-29 0.89 0.66 -0.22 1.48E-41 Genetic and epigenetic effects on inter-individual variability in gene expression Correlation in DNA methylation and gene expression We next assessed whether DNA methylation is correlated to gene expression levels in the adult samples. We combined data from the Karolinska Liver Bank and Dutch liver samples (total number of samples with expression and methylation data = 158) and compared expression probes with CpG sites that map within 250 kb of these probes. We did not include the fetal samples due to the large developmental differences reported above, and we estimated that the fetal samples would not add any considerable statistical power for the analyses. We identified 3,238 significant methylation-expression associations (eQTMs, Additional file7), comprising 1,988 unique expression probes (in 1,798 genes) and 2,980 CpG sites (reflecting

102 2,057 unique genes), with a permutation p-value < 0.05. As expected, there are more eQTMs with a negative correlation between expression levels and CpG methylation levels (58.4%), irrespective of the CpG site location in relation to CpG islands. Furthermore, for CpG sites 1 with strong correlation between expression and methylation levels, and/or within 50 kb of the expression probes, we observed an overrepresentation of negative correlations (chi-squared test p-value < 2.2 x 10-16, Figure 3). 2

3

4

Figure 3. Distribution of the direction of the expression and methylation correlation coefficient. (A) Proportion of eQTM effects (y-axis) grouped by the absolute Spearman correlation coefficient. Grey and black colors represent negative and positive correlation between expression probe and methylation CpG site, respectively. (B) Proportion of eQTM effects (y-axis) grouped by the distance between expression probe and CpG site in kilobase pair (kb). Grey and black colors represent negative and positive correlation between expression probe and methylation CpG site, respectively. Regulation of gene expression by genetic polymorphisms We next explored the effects of genetic variation on liver gene expression levels. Expression quantitative trait locus (eQTL) mapping in the adult livers (meta-analysis of the two cohorts, combined number of samples with expression and genotype data = 171) yielded a total of 47,168 significant SNP-probe pair correlations (FDR < 0.05), representing 751 unique genes (Additional file 8). The eQTL probes are significantly enriched for liver-specific genes (area under the curve (AUC) 0.67, p-value 4 x 10-57, as reported by Gene Network) and are strongly enriched for genes encoding drug-metabolizing enzymes (p-value 2.0 x 10-19). We compared our results with reported liver cis-eQTLs7-10,22 and observed that we could replicate 667 reported eQTL genes, however we also identified 84 new eQTL genes (Additional file 8). Influence of genetic variation on DNA methylation We investigated the effects of SNPs on CpG methylation (meQTL) in adult liver samples (meta-analysis, combined number of samples with methylation and genotype data =161). In total we found significant cis-meQTL for 28,447 unique methylation probes (FDR < 0.05, mapping to 12,054 unique genes), reflecting 1,477,126 different SNP-CpG site combinations. In contrast to the eQTL, we did not observe any enrichment of liver functions for these 12,054 meQTL associated genes. Looking further into the SNPs affecting DNA methylation and gene expression, we identified 215 unique genes and 10,432 unique SNPs associated with both an eQTL and meQTL, resulting in a total of 30,644 overlapping QTL effects. Interestingly, for most of the 215 genes (69.3%) influenced by both an eQTL and meQTL we observed an opposite effect direction, i.e. the same genotype was associated with higher methylation levels and lower expression levels, or vice versa (Additional file 9). This effect is strongest in the CpG islands and CpG island shores, where it occurs in more than 75% of the cases (Additional file 10).

103 Contribution of genetic variants and DNA methylation to variation in hepatic 1 gene expression Once we had identified eQTL and eQTMs, we ascertained to what extent SNPs andDNA methylation could jointly explain the variation in liver gene expression levels. We selected 293 expression probes (reflecting 274 unique genes) that had both a significant cis-eQTL 2 and significant eQTM effect. We then tested four different linear models (see Supplementary Online Methods in the Additional file 1) to assess the proportion of variation in gene expression that could be explained. For 83% of these 293 expression probes, most of the expression variation was explained by a SNP (Additional file 11), whereas for the remaining 17% the expression variation was most strongly explained by a specific CpG site. For the latter cases, 3 we observed that these expression-associated CpG sites were likely to have a meQTL effect (chi-squared p-value = 0.035). As expected, when we combined the SNP genotype and CpG site methylation levels, we could explain more of the expression variation than by using either SNP or methylation levels alone. Given the correlations between genotypes and methylation levels, we also estimated the unique contributions of the two on gene expression levels (Additional 4 file 12). Overall, SNP genotypes uniquely explain a greater proportion of the variation in gene expression (median 0.1, standard deviation 0.122) than methylation levels (median 0.029, standard deviation 0.049). The SNPs and CpG sites with particularly high correlations with the expression levels were generally closer to the transcription start site of the corresponding genes ( Additional file 13). The contributions of SNPs and DNA methylation levels to the proportion of variation explained in gene expression levels are illustrated in Additional file 14 and Table 4 by 16 unique ADME genes that had both significant eQTL and eQTMs. The ADME gene list was extracted from www.pharmaADME.org. For the GSTT1, GSTM1, UGT1A1, GST01 and PON1 genes, DNA methylation explains a larger proportion of the variation in gene expression levels compared to SNP genotypes. Overall, we found that adding more CpGs to the model, which were all associated to the selected expression probes of the same gene, did not significantly increase the power to explain more of the variation in gene expression (p-value < 2.2 x 10-16). In addition to the ADME genes, we also investigated the role of SNPs and CpG site methylation in the regulation of genes associated with diseases and liver function by querying all SNPs from the GWAS catalog (http://www.genome.gov/gwastudies/) in our list of identified eQTLs. We identified cis-acting SNPs and DNA methylation differences that were associated with the expression of 47 genes previously identified in different GWA studies with complex traits, including enzyme and metabolite levels as well as cardiovascular and inflammatory bowel diseases (Additional file 15). Tissue-specificity of eQTL, meQTL and eQTMs Since we had also generated methylation and expression data for three other tissues (muscle, subcutaneous- (SAT) and visceral adipose tissue (VAT)) from the same individuals in the Dutch cohort, we could assess the tissue-specificity of the detected eQTL, meQTL and eQTM effects. We had previously compared liver eQTL with other tissues for only a limited number of samples11, so we re-did this analysis with the new adult liver samples from the Karolinska Liver Bank. For liver eQTL, approximately 40-50% of the effects found in one tissue could also be significantly detected in another tissue (Figure 4A). We identified only a few opposite allelic effects (< 1%) between the tissues (Additional file 16A and 16B), suggesting that if a SNP affects expression in multiple tissues, the allelic direction is mostly identical. The eQTL effects (n = 32,863) that were only present in liver and not in the other three tissues were related to genes strongly specific to liver function (p-value 5 x 10-53) and metabolic and catabolic processes (p-values < 5 x 10-20).

104 Table 4. Proportion of explained variation by SNPs and CpG sites associated with the expression 1 of ADME genes. * F-test p-value < 0.05; ** F-test p-value < 0.005. F-test null hypothesis: model for gene expression with the SNP and CpG site as explanatory variables and model for gene expression with the SNP and all CpG sites1 fit equally well with the differences being due to random chance. SNP and all CpG sites1 - the CpG sites that have eQTM effects with the expression probe 2 % of variation in expression explained by Gene/ Chr SNP CpG site SNP & SNP and all Locus SNP CpG CpG CpG sites1 3 GSTT1 22 rs9612520 cg05380919 50% 75% 78% 84%** CYP3A5 7 CS015290 cg03133378 55% 7% 57% 57% GSTM1 1 rs75953876 cg18938907 11% 55% 56% 61% GPX7 1 rs11810754 cg11953272 48% 16% 49% 52% 4 UGT1A1 2 rs7592624 cg11811840 22% 41% 45% 47% SLC22A18 11 rs413781 cg24724917 30% 15% 44% 49%* FMO4 1 rs2223477 cg14981176 39% 16% 39% 39% GSTM3 1 rs115636764 cg23645476 21% 20% 35% 46%** SLC19A1 21 rs7867 cg27210852 22% 10% 30% 30% GSTO2 10 rs11595547 cg23659134 20% 24% 28% 28% PON1 7 rs854533 cg07404485 13% 23% 27% 30% DHRS2 14 rs57350570 cg07125017 23% 4% 26% 26% GSTA4 6 rs538920 cg22486834 14% 14% 20% 21% CEBPA 19 rs80241821 cg19035908 17% 6% 20% 20% MGST3 1 rs10737515 cg16553119 12% 12% 13% 13% DHRS7 14 rs376391 cg18906360 12% 9% 13% 13%

Contrary to the strong tissue-specificity of eQTL, meQTL were much more stable across the different tissues. On average, 70% of the meQTL are shared between at least two tissues, with over 98% of their effects having the same allelic direction (Figure 4B, Additional file 16C & 16D). As we had observed for eQTL, there were also a few significant meQTL that showed an opposite allelic direction between liver and the other three tissues (Additional file 17). The CpG sites of the meQTL with opposite effects were more often located outside the gene bodies (p-value 1.53 x 10-11), but when they were in gene bodies, they were in exons rather than introns (p-value 1.7 x 10-90). As expected, we observed very strong tissue-specificity for the identified eQTMs. Only up to 4% of the eQTMs found in one tissue were also detectable with an identical effect direction in another tissue (Figure 4C, Additional file 16E).

105 1

2

3

4

Figure 4. Venn diagram of the overlap of QTLs in four tested tissues. The number of overlapping (A) eQTL, (B) meQTLs, (C) eQTMs in shown for adult human liver, VAT, SAT and muscle samples.

Genes with adult-specific functions are enriched for eQTL We hypothesized that the expression of genes with important functions in the adult liver should be under strict genetic and epigenetic control. We thus focused on the set of probes with significantly higher and lower expression levels in adult liver compared to fetal liver (from the previous section “The transcriptome of the developing liver”), and formed a matched set of probes that were not differentially expressed between the two groups but displayed similar median expression levels and standard deviations in the adult liver samples. We observed that the expression of these adult liver-specific probes are much more likely to be affected by SNPs than the matched set of probes (1.43 times more than expected, chi-squared test p-value = 8.8 x 10-7). Furthermore, we observed that these probes were 1.24-fold enriched for liver-specific eQTL probes compared to a matched set of probes with eQTL in multiple tissues (chi-squared test p-value = 8.847 x 10-6). Vice versa, probes with lower expression levels in adult liver compared to fetal liver did not differ from the matched set of probes in terms of having eQTL and liver-specific eQTL effects. Furthermore, we did not observe any enrichment of meQTL in the adult liver-specific methylation probes.

106 Discussion 1 Previous studies on the regulation of gene expression in human liver have only accounted for the effect of genetic variation in adult samples7-11. In this study, we investigated the developmental regulation of gene expression in human livers by comparing the expression and methylation levels of genes in adult and fetal livers. In addition, we used both genetic variants and DNA methylation differences in order to explain the variability in transcript levels observed 2 in adult livers. Comparison of the fetal and adult liver methylomes and transcriptomes revealed that hypomethylated CpG sites and up-regulated genes were closely related to the tissue- specific functions: with fetal livers enriched for developmental and hematopoietic functions, while catabolic and metabolic processes were more prominent in adult livers. This has been described in the transcriptome of fetal livers at different stages of development in mice23-25. 3 As the differences in methylation between fetal and adult livers were very large, when attempting to characterize the effects of variable methylation on gene expression levels in adults, we performed the eQTM analysis using a panel of only adult liver samples. Similarly to Gutierrez-Arcelus et al.26, we observed both positive and negative correlations between 4 DNA methylation and gene expression across the samples, with similar distributions across different genomic regions. Bell et al. have also observed a modest but significant excess of negative correlations between DNA methylation and variation in gene expression levels across individuals27. It has been reported that the role of DNA methylation appears to depend on the genomic context28: for example, CpG sites located near the genes and/or with a stronger correlation between the methylation and expression were more likely to display a negative correlation. Interestingly, CpG sites downstream of the expression probes displayed less negative correlations than those upstream of the probes, indicating that methylation in gene bodies is associated with active gene expression, as known from the early days of DNA methylation research29,30. This paradox – in which methylation in the promoter is negatively correlated with the expression, whereas methylation in the gene body is positively correlated with expression30 – can be explained by the fact that, in mammals, DNA methylation silences the initiation of transcription, but not transcription elongation28. Our eQTL mapping in adult livers revealed 751 unique genes, which were strongly liver-specific and enriched for drug metabolizing functions. Of these, 84 genes were new associations, while others have already been reported7-10,22. The new associations are probably due to the larger number of samples and imputation of SNPs not present on previously used genotyping arrays, using data from the 1000 Genomes project. While we observed liver-specific associations with eQTL, the meQTL were not enriched for liver-specific functions. Furthermore, when we analyzed the SNPs that had significant effects on both methylation and expression, in most of the genes the same SNP allele had an opposite effect on gene expression compared to the methylation level, and this effect was most evident in CpG islands and shores (Additional file 10). These results show that, although there are many associations between SNPs and methylation levels, the relationships between them are not clear and do not reflect tissue- specific functions. Inter-individual variability in ADME gene expression has been shown to affect drug efficacy, toxicity, and susceptibility to environmental toxins31. When we focused on the expression of ADME genes, we observed very strong cis-acting SNP and DNA methylation effects for 16 genes (Table 4), including members of the glutathione S-transferases (GSTs) family of phase II ADME isozymes: GSTA4, GSTM1, GSTM3, GSTO2 and GSTT1; solute carrier transporters SLC19A1 and SLC22A18, responsible for the transmembrane transfer of multiple drugs and endogenous compounds; and FMO4, GPX7, PON1 and UGT1A1. GSTT1 is involved in the conjugation of a variety of compounds32-35, while GSTM1 functions in the detoxification of exogenous/endogenous toxins. The effects of epigenetic modifications on the expression of these genes have been reported in blood and brain tissues36,37. In our study, we observed that both SNPs and DNA methylation contribute to the variability of the expression of these genes.

107 For example, the SNP rs2739330, downstream of the GSTT1 gene and upstream of the DDT gene, has been reported to be associated with gamma-glutamyl transferase levels in plasma38. 1 This SNP, together with methylation levels of a nearby CpG site cg05380919, explains 78% of the variability in the expression of GSTT1, possibly with a stronger contribution from the methylation levels of the CpG site. Similarly, for GSTM1 the strongest SNP only explains 11% of the variation in its expression, while methylation levels of the CpG site cg18938907 has a 2 much stronger association with the expression of the gene, and may be responsible for up to 55% of the variation (Additional file 11). The CpG site falls within a CpG island that spans the promoter and a portion of the gene’s first intron. Interestingly, they are located near the transcription factor binding site of TBP, which has been shown to bind to the promoter of GSTM1 in HepG2 cells, according to ENCODE ChIP-Seq data. 3 A substantial portion of the overall phenotypic variance in hepatic enzyme PON1 activity between individuals remains unexplained. Besides a variety of non-genetic factors, numerous transcription factors39 and miRNA regulation40, various functional PON1 polymorphisms have been shown to influence serum PON1 levels and activity39,41. The SNP rs705379 has been 4 shown to be associated with approximately 50% mean reductions in serum PON1 protein levels as well as transcript levels41,42. In our study, it was interesting to see that this SNP was associated with increased methylation of nine CpG sites in its vicinity and with a lower expression of PON1. Glutathione peroxidases (GPX) constitute a major antioxidative damage enzyme family43 and are thus important in cancer therapy44. Not only genetic but also epigenetic mechanisms of gene regulation have been proposed for GPX7, while recently a CpG island was identified as a key player in regulating GPX7 expression45. In total, we identified 87 SNPs in the GPX7 gene affecting the methylation of nine CpG sites (meQTL), with six of the sites being directly implicated in quantitative differences in the gene expression. In addition, we replicated the two eQTL reported in human liver samples for GPX710, and for the first time identified their association with differences in methylation of CpG sites, which are further correlated with changes in GPX7 expression levels. One of the expression-associated SNPs discovered in this study, rs11810754, appears to explain most of the variation in the expression levels of the gene (48%), while the CpG site with the strongest correlation with expression (cg11953272) did not add any extra information to the variability of the expression of the gene (only 1%, Table 4). Three other tissues (muscle, SAT, VAT) were used to assess the tissue-specificity of both eQTL and meQTL effects. Our eQTL results showed that over half of the associations of SNPs with expression in one tissue could not be detected in another tissue (with identical eQTL allelic effect directions). This is similar to other studies7,9,11. In contrast, SNP-methylation correlations were much less tissue-specific than SNP-expression correlation: approximately 70% of the meQTL were also identified in any of the other tissues. To the best of our knowledge, this has not been described before, but could be due to sequence-dependent DNA methylation or the fact that genetic variation in a similarly methylated region can affect the entire region (given that we have excluded the direct effects of SNPs on methylation probes). On the other hand, we observed that DNA methylation associated with expression levels (i.e. eQTMs) are highly tissue-specific, in accordance with the fact that DNA methylation plays an important role in regulating tissue-specific gene expression. Thus, conclusions drawn from eQTL oreQTM data in one tissue cannot be extrapolated to other tissues, whereas the effect of SNPs on methylation is more likely to be detectable in an alternative tissue, for example DNA in blood, which is more readily accessible. The greatest limitation of our study was the use of microarrays instead of massively parallel sequencing. Despite stringent filtering and remapping of expression and methylation probe sequences, we cannot rule out all technical artefacts inherent to microarray studies. Another drawback of microarrays is also their lower coverage of the genome, with the expression arrays only covering a few exons per gene, and the methylation array containing approximately

108 1% of the CpG sites in the human genome. Future studies using RNA and genome sequencing should be able to generate a more complete picture of the factors involved in the regulation of gene expression in human liver or other tissues. 1 Conclusions By performing a genome-wide survey of genomic and epigenomic variation and their associations with gene expression in fetal and adult human liver, we have generated a 2 comprehensive resource for the analysis of factors involved in the regulation of hepatic gene expression. The investigation of fetal livers allowed us to explore the developmental changes in the hepatic methylome and transcriptome. Although the role of DNA methylation in different regions of the genome is still unclear, our results elucidate the coordinated effects of SNPs 3 and methylation, as well as the tissue specificity of their effects on gene expression. This strengthens the hypothesis that knowledge of inter-individual variability, driven by genetic polymorphisms and DNA methylation marks and their interaction, is crucial for understanding the causes of differences in drug response and the etiology of diseases associated with liver function. 4 Materials and Methods The materials and methods of this study are described in detail in the Supplementary Online Methods in the Additional file 1. Briefly, our study was performed on two different cohorts, 14 fetal and 96 adult liver samples from the Karolinska Liver Bank cohort46,47, and 85 adult samples from the Dutch tissue cohort MORE (BBMRI obesity cohort)11,48. For both datasets, the number of samples for which there is full expression, methylation and genotype data is not 100%. We therefore report the number of samples per specific analysis. DNA from the samples were genotyped using HumanOmni BeadChips (Illumina), according to the manufacturer’s instructions. We imputed both datasets using the GIANT release from the 1000 Genomes project, resulting in 5,763,069 unique SNPs, which were used in all our downstream analyses. Gene expression data was generated using HumanHT-12 BeadChips (Illumina), according to the standard protocol. Bisulfite-converted DNA samples were hybridized to Infinium HumanMethylation450 BeadChips (Illumina), following the Illumina Infinium HD Methylation protocol. Availability of supporting data "The data sets supporting the results of this article are available in the GEO repository, Dutch BBMRI more expression data: GSE22070; methylation data: GSE61454; Karolinska expression and methylation data: GSE61279. Competing interests The authors declare that they have no competing interests. Author contributions MJ.B., S.K., L.M. and L.F. designed the study; M.I., M.IS., M.W., W.B., JW.G., S.R. J.VO and MH.H. collected and extracted the samples; MJ.B., S.K. and M.K. designed and performed the computational analyses; P.D. imputed the Dutch genotype data; MJ.B., S.K., L.M. and L.F. interpreted the data and wrote the manuscript; C.W., J.F., A.Z, R.T and I.B. participated in the interpretation of the data and provided input to the manuscript; K.L. participated in the data analysis. All authors provided input to the manuscript and read and approved the final manuscript.

109 Financial Support 1 This work is supported by grants from the Estonian Science Foundation (ETF9293), the European Union through the European Social Fund (MJD71), the European Regional Development Fund in the framework of the Centre of Excellence in Genomics (EXCEGEN), the University of Tartu (SP1GVARENG), the Estonian Research Council (IUT20-60), the Swedish Research Council, 2 the Seurat 1 NOTOX project, the Marie Curie CIG 322283, IOP Genomics grant IGE05012A, the Netherlands Organization for Scientific Research (NWO), an NWO VENI grant 916.10.135, an NWO VENI grant 863.09.007, a Horizon Breakthrough grant from the Netherlands Genomics Initiative (grant 92519031), Systems Biology Centre for Metabolism and Ageing (SBC-EMA) and BBMRI (RP2/RP3/Complementatieproject BBMRI-NL-CP2013-71). The research leading 3 to these results has received funding from the European Community's Health Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 259867. References 4 1. Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, Brodsky J, Jones CG, Zaitlen NA, Varilo T, Kaakinen M, et al: Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet 2009, 41:35-46. 2. Qi L, Cornelis MC, Kraft P, Stanya KJ, Linda Kao WH, Pankow JS, Dupuis J, Florez JC, Fox CS, Paré G, et al: Genetic variants at 2q24 are associated with susceptibility to type 2 diabetes. Hum Mol Genet 2010, 19:2706-2715. 3. Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, Zeggini E, Huth C, Aulchenko YS, Thorleifsson G, et al: Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet 2010, 42:579-589. 4. Suhre K, Shin SY, Petersen AK, Mohney RP, Meredith D, Wägele B, Altmaier E, Deloukas P, Erdmann J, Grundberg E, et al: Human metabolic individuality in biomedical and pharmaceutical research. Nature 2011, 477:54-60. 5. Adams LA, White SW, Marsh JA, Lye SJ, Connor KL, Maganga R, Ayonrinde OT, Olynyk JK, Mori TA, Beilin LJ, et al: Association between liver-specific gene polymorphisms and their expression levels with nonalcoholic fatty liver disease. Hepatology 2013, 57:590-600. 6. Ellinghaus D, Folseraas T, Holm K, Ellinghaus E, Melum E, Balschun T, Laerdahl JK, Shiryaev A, Gotthardt DN, Weismüller TJ, et al: Genome-wide association analysis in primary sclerosing cholangitis and ulcerative colitis identifies risk loci at GPR35 and TCF4. Hepatology 2013, 58:1074-1083. 7. Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, Kasarskis A, Zhang B, Wang S, Suver C, et al: Mapping the genetic architecture of gene expression in human liver. PLoS Biol 2008, 6:e107. 8. Greenawalt DM, Dobrin R, Chudin E, Hatoum IJ, Suver C, Beaulaurier J, Zhang B, Castro V, Zhu J, Sieberts SK, et al: A survey of the genetics of stomach, liver, and adipose gene expression from a morbidly obese cohort. Genome Res 2011, 21:1008-1016. 9. Innocenti F, Cooper GM, Stanaway IB, Gamazon ER, Smith JD, Mirkov S, Ramirez J, Liu W, Lin YS, Moloney C, et al: Identification, replication, and functional fine-mapping of expression quantitative trait loci in primary human liver tissue. PLoS Genet 2011, 7:e1002078. 10. Schröder A, Klein K, Winter S, Schwab M, Bonin M, Zell A, Zanger UM: Genomics of ADME gene expression: mapping expression quantitative trait loci relevant for absorption, distribution, metabolism and excretion of drugs in human liver. Pharmacogenomics J 2013, 13:12-20. 11. Fu J, Wolfs MG, Deelen P, Westra HJ, Fehrmann RS, Te Meerman GJ, Buurman WA, Rensen SS, Groen HJ, Weersma RK, et al: Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genet 2012, 8:e1002431. 12. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009, 462:315-322.

110 13. Ghosh S, Yates AJ, Frühwald MC, Miecznikowski JC, Plass C, Smiraglia D: Tissue specific DNA methylation of CpG islands in normal human adult somatic tissues distinguishes neural from non-neural tissues. Epigenetics 2010, 5:527-538. 1 14. Varley KE, Gertz J, Bowling KM, Parker SL, Reddy TE, Pauli-Behn F, Cross MK, Williams BA, Stamatoyannopoulos JA, Crawford GE, et al: Dynamic DNA methylation across diverse human cell lines and tissues. Genome Res 2013, 23:555-567. 15. Ghotbi R, Gomez A, Milani L, Tybring G, Syvänen AC, Bertilsson L, Ingelman-Sundberg M, 2 Aklillu E: Allele-specific expression and gene methylation in the control of CYP1A2 mRNA level in human livers. Pharmacogenomics J 2009, 9:208-217. 16. Thomson JP, Hunter JM, Lempiäinen H, Müller A, Terranova R, Moggs JG, Meehan RR: Dynamic changes in 5-hydroxymethylation signatures underpin early and late events in drug exposed liver. Nucleic Acids Res 2013, 41:5639-5654. 3 17. Chen WD, Fu X, Dong B, Wang YD, Shiah S, Moore DD, Huang W: Neonatal activation of the nuclear receptor CAR results in epigenetic memory and permanent change of drug metabolism in mouse liver. Hepatology 2012, 56:1499-1509. 18. Kacevska M, Ivanov M, Ingelman-Sundberg M: Epigenetic-dependent regulation of drug transport and metabolism: an update. Pharmacogenomics 2012, 13:1373-1385. 4 19. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G: GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 2010, 28:495-501. 20. Cvejic A, Haer-Wigman L, Stephens JC, Kostadima M, Smethurst PA, Frontini M, van den Akker E, Bertone P, Bielczyk-Maczyńska E, Farrow S, et al: SMIM1 underlies the Vel blood group and influences red blood cell traits. Nat Genet 2013, 45:542-545. 21. Moscovitz JE, Aleksunes LM: Establishment of metabolism and transport pathways in the rodent and human fetal liver. Int J Mol Sci 2013, 14:23801-23827. 22. Yang X, Zhang B, Molony C, Chudin E, Hao K, Zhu J, Gaedigk A, Suver C, Zhong H, Leeder JS, et al: Systematic genetic and genomic analysis of cytochrome P450 enzyme activities in human liver. Genome Res 2010, 20:1020-1036. 23. Lee JS, Ward WO, Knapp G, Ren H, Vallanat B, Abbott B, Ho K, Karp SJ, Corton JC: Transcriptional ontogeny of the developing liver. BMC Genomics 2012, 13:33. 24. Jochheim-Richter A, Rüdrich U, Koczan D, Hillemann T, Tewes S, Petry M, Kispert A, Sharma AD, Attaran F, Manns MP, Ott M: Gene expression analysis identifies novel genes participating in early murine liver development and adult liver regeneration. Differentiation 2006, 74:167-173. 25. Li T, Huang J, Jiang Y, Zeng Y, He F, Zhang MQ, Han Z, Zhang X: Multi-stage analysis of gene expression and transcription regulation in C57/B6 mouse liver development. Genomics 2009, 93:235-242. 26. Gutierrez-Arcelus M, Lappalainen T, Montgomery SB, Buil A, Ongen H, Yurovsky A, Bryois J, Giger T, Romano L, Planchon A, et al: Passive and active DNA methylation and the interplay with genetic variation in gene regulation. Elife 2013, 2:e00523. 27. Bell JT, Spector TD: DNA methylation studies using twins: what are they telling us? Genome Biol 2012, 13:172. 28. Jones PA: Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet 2012, 13:484-492. 29. Wolf SF, Jolly DJ, Lunnen KD, Friedmann T, Migeon BR: Methylation of the hypoxanthine phosphoribosyltransferase locus on the human X chromosome: implications for X-chromosome inactivation. Proc Natl Acad Sci U S A 1984, 81:2806-2810. 30. Jones PA: The DNA methylation paradox. Trends Genet 1999, 15:34-37. 31. Ingelman-Sundberg M, Sim SC, Gomez A, Rodriguez-Antona C: Influence of cytochrome P450 polymorphisms on drug therapies: pharmacogenetic, pharmacoepigenetic and clinical aspects. Pharmacol Ther 2007, 116:496-526. 32. Marinković N, Pasalić D, Potocki S: Polymorphisms of genes involved in polycyclic aromatic hydrocarbons' biotransformation and atherosclerosis. Biochem Med (Zagreb) 2013, 23:255-265.

111 33. Tulsyan S, Chaturvedi P, Agarwal G, Lal P, Agrawal S, Mittal RD, Mittal B: Pharmacogenetic influence of GST polymorphisms on anthracycline-based chemotherapy responses and 1 toxicity in breast cancer patients: a multi-analytical approach. Mol Diagn Ther 2013, 17:371-379. 34. Ramos DL, Gaspar JF, Pingarilho M, Gil OM, Fernandes AS, Rueff J, Oliveira NG: Genotoxic effects of doxorubicin in cultured human lymphocytes with different glutathione 2 S-transferase genotypes. Mutat Res 2011, 724:28-34. 35. Zhong S, Huang M, Yang X, Liang L, Wang Y, Romkes M, Duan W, Chan E, Zhou SF: Relationship of glutathione S-transferase genotypes with side-effects of pulsed cyclophosphamide therapy in patients with systemic lupus erythematosus. Br J Clin Pharmacol 2006, 62:457-472. 3 36. Liu Y, Ding J, Reynolds LM, Lohman K, Register TC, De La Fuente A, Howard TD, Hawkins GA, Cui W, Morris J, et al: Methylomics of gene expression in human monocytes. Hum Mol Genet 2013, 22:5065-5074. 37. Sintupisut N, Liu PL, Yeang CH: An integrative characterization of recurrent molecular aberrations in glioblastoma genomes. Nucleic Acids Res 2013, 41:8803-8821. 4 38. Chambers JC, Zhang W, Sehmi J, Li X, Wass MN, Van der Harst P, Holm H, Sanna S, Kavousi M, Baumeister SE, et al: Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma. Nat Genet 2011, 43:1131-1138. 39. Fuhrman B: Regulation of hepatic paraoxonase-1 expression. J Lipids 2012, 2012:684010. 40. Liu ME, Liao YC, Lin RT, Wang YS, Hsi E, Lin HF, Chen KC, Juo SH: A functional polymorphism of PON1 interferes with microRNA binding to increase the risk of ischemic stroke and carotid atherosclerosis. Atherosclerosis 2013, 228:161-167. 41. Deakin S, Leviev I, Brulhart-Meynet MC, James RW: Paraoxonase-1 promoter haplotypes and serum paraoxonase: a predominant role for polymorphic position - 107, implicating the Sp1 transcription factor. Biochem J 2003, 372:643-649. 42. Brophy VH, Hastings MD, Clendenning JB, Richter RJ, Jarvik GP, Furlong CE: Polymorphisms in the human paraoxonase (PON1) promoter. Pharmacogenetics 2001, 11:77-84. 43. Miyamoto Y, Koh YH, Park YS, Fujiwara N, Sakiyama H, Misonou Y, Ookawara T, Suzuki K, Honke K, Taniguchi N: Oxidative stress caused by inactivation of glutathione peroxidase and adaptive responses. Biol Chem 2003, 384:567-574. 44. Yang P, Ebbert JO, Sun Z, Weinshilboum RM: Role of the glutathione metabolic pathway in lung cancer treatment and prognosis: a review. J Clin Oncol 2006, 24:1761-1769. 45. Peng D, Hu T, Soutto M, Belkhiri A, Zaika A, El-Rifai W: Glutathione peroxidase 7 has potential tumour suppressor functions that are silenced by location-specific methylation in oesophageal adenocarcinoma. Gut 2014, 63:540-551. 46. Kacevska M, Ivanov M, Wyss A, Kasela S, Milani L, Rane A, Ingelman-Sundberg M: DNA methylation dynamics in the hepatic CYP3A4 gene promoter. Biochimie 2012, 94:2338- 2344. 47. Ivanov M, Kals M, Kacevska M, Barragan I, Kasuga K, Rane A, Metspalu A, Milani L, Ingelman- Sundberg M: Ontogeny, distribution and potential roles of 5-hydroxymethylcytosine in human liver function. Genome Biol 2013, 14:R83. 48. Wolfs MG, Rensen SS, Bruin-Van Dijk EJ, Verdam FJ, Greve JW, Sanjabi B, Bruinenberg M, Wijmenga C, van Haeften TW, Buurman WA, et al: Co-expressed immune and metabolic genes in visceral and subcutaneous adipose tissue from severely obese individuals are associated with plasma HDL and glucose levels: a microarray study. BMC Med Genomics 2010, 3:34.

112 Acknowledgments 1 We would like to thank The Target project (http://www.rug.nl/target) for providing the compute infrastructure and the BigGrid/eBioGrid project (http://www.ebiogrid.nl) for sponsoring the imputation pipeline implementation, the GCC for imputation, Soesma Medema for lab work (UMCG), and Jackie Senior (UMCG) for editing. We would also like to acknowledge the staff at the Core Facility of the Estonian Genome Center, University of Tartu, for performing the 2 microarray experiments of the Karolinska samples. Additional files The following additional data files are available with the online version of this paper. 3 Additional file 1 Supplementary materials & methods. Additional file 2 Differentially methylated genes between fetal and adult livers. Additional file 3Average DNA methylation levels of all CpG sites on the 450K beadchip and of 4 differentially methylated CpG sites between adult and fetal livers. Additional file 4 Differentially expressed genes between fetal and adult liver. Additional file 5 Differentially expressed and differentially methylated genes between fetal and adult livers. Additional file 6Comparison of the expression levels of transcription factors in fetal and adult livers. Additional file 7 eQTMs identified in liver at FDR 0.05. Additional file 8 eQTL identified in liver at FDR 0.05. Additional file 9 Top 15,000 meQTL identified in liver. Additional file 10 Distribution of opposite and identical effects of a SNP on gene expression and gene methylation. Additional file 11 The contributions of SNPs and DNA methylation levels to the proportion of variation explained in gene expression levels. Additional file 12Unique proportion of gene expression variation explained by a SNP or a CpG site . Additional file 13Relation between the distance from TSS and the explained variation in gene expression by a CpG site and a SNP. Additional file 14 The contributions of SNPs and DNA methylation levels to the proportion of variation explained in gene expression levels of 16 ADME genes. Additional file 15 The contributions of SNPs and DNA methylation levels to the proportion of variation explained in gene expression levels of genes previously identified in GWA studies. Additional file 16 Overlapping eQTL and meQTL with the same or opposite allelic direction and eQTMs with consistent direction identified in multiple tissues. Additional file 17 meQTL with an opposite allelic direction between liver and the other three tissues.

113 Improving Phenotypic Prediction by Combining Genetic and Epigenetic Associations American Journal of Human Genetics, DOI: 10.1016/j.ajhg.2015.05.014

Sonia Shah,1,2,14 Marc J. Bonder,3,14 Riccardo E. Marioni,1,4,5 Zhihong Zhu,1 Allan F. McRae,1,2 Alexandra Zhernakova,3 Sarah E. Harris,4,5 Dave Liewald,4 Anjali K. Henders,6 Michael M. Mendelson,7,8,9 Chunyu Liu,10 Roby Joehanes,11 Liming Liang,12 BIOS Consortium, Daniel Levy,9 Nicholas G. Martin,6 John M. Starr,4,13 Cisca Wijmenga,3 Naomi R. Wray,1 Jian Yang,1 Grant W. Montgomery,6,14 Lude Franke,3,14 Ian J. Deary,4,13,14 and Peter M. Visscher1,2,4,14,* 8 Abstract 1 We tested whether DNA-methylation profiles account for inter-individual variation in body mass index (BMI) and height and whether they predict these phenotypes over and above genetic factors. Genetic predictors were derived from published summary results from the largest genome-wide association studies on BMI (n ~ 350,000) and height (n ~ 250,000) to date. We derived methylation predictors by estimating probe-trait effects in discovery samples and 2 tested them in external samples. Methylation profiles associated with BMI in older individuals from the Lothian Birth Cohorts (LBCs, n = 1,366) explained 4.9% of the variation in BMI in Dutch adults from the LifeLines DEEP study (n = 750) but did not account for any BMI variation in adolescents from the Brisbane Systems Genetic Study (BSGS, n = 403). Methylation profiles based on the Dutch sample explained 4.9% and 3.6% of the variation in BMI in the LBCs and 3 BSGS, respectively. Methylation profiles predicted BMI independently of genetic profiles in an additive manner: 7%, 8%, and 14% of variance of BMI in the LBCs were explained by the methylation predictor, the genetic predictor, and a model containing both, respectively. The corresponding percentages for LifeLines DEEP were 5%, 9%, and 13%, respectively, suggesting 4 that the methylation profiles represent environmental effects. The differential effects of the BMI methylation profiles by age support previous observations of age modulation of genetic contributions. In contrast, methylation profiles accounted for almost no variation in height, consistent with a mainly genetic contribution to inter-individual variation. The BMI results suggest that combining genetic and epigenetic information might have greater utility for complex-trait prediction.

1. Queensland Brain Institute, University of Queensland, Brisbane 4072, Australia; 2. University of Queensland Diamantina Institute, Translational Research Institute, University of Queensland, Brisbane, QLD 4072, Australia; 3. Department of Genetics, University Medical Center Groningen, University of Groningen, Groningen 9713 AV, the Netherlands; 4. Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh EH8 9JZ, UK; 5. Medical Genetics Section, Centre for Genomic and Experimental Medicine, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK; 6. Queensland Institute of Medical Research Berghofer Medical Research Institute, Brisbane, QLD 4029, Australia; 7. Framingham Heart Study and Boston University School of Medicine, Boston, MA 01702, USA; 8. Department of Cardiology, Boston Children’s Hospital, Boston, MA 02115, USA; 9. Population Studies Branch, National Heart, Lung, and Blood Institute, NIH, Bethesda, MD 20892-7936, USA; 10. Department of Biostatistics, Boston University, Boston, MA 02118, USA; 11. Hebrew Senior Life, Harvard Medical School, Boston, MA 02131, USA; 12. Departments of Epidemiology and Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; 13. Department of Psychology, University of Edinburgh, Edinburgh EH8 9JZ, UK; 14. These authors contributed equally to this work; *Correspondence to: [email protected]

115 Introduction 1 Obesity is a major risk factor for a number of chronic diseases, including diabetes, cardiovascular diseases, and cancer.1–4 Once considered a health burden only in high- income countries, it is a growing epidemic that is dramatically on the rise in low- and middle- income countries, particularly in urban settings. Knowledge of the genetic and environmental 2 contributors to obesity is necessary for developing effective strategies to reduce its global burden. Body mass index (BMI) is a commonly used measure for quantifying obesity. Although many genetic determinants of BMI have been identified by large genome-wide association studies (GWASs),5 only about 10% of the inter-individual variation in BMI has been explained by genetic factors. With recent advances in high-throughput genomic technologies, researchers 3 are now turning to epigenetics as a way of understanding the interplay between genetics and environment and their contribution to complex traits and diseases. Epigenetics refers to the regulatory processes that control gene expression without altering the DNA sequence. The most studied epigenetic process is DNA methylation, the reversible 4 addition of a methyl group primarily to a cytosine residue at a CpG dinucleotide. Because epigenetic variation reflects both genetic and environmental exposures, there is potential to identify novel disease-associated genes and pathways that might not be discovered through genetic studies alone. Methylome-wide association studies (MWASs), using methylation arrays such as the Illumina Infinium HumanMethylation450 array, have already begun to identify genomic CpG sites whose methylation levels are associated with BMI.6 DNA- methylation levels at specific CpG sites have already shown to be accurate predictors of age and smoking status,7–9 and such phenotypic prediction could extend to complex traits and disease and potentially improve prediction over genetic information. The ability of DNA- methylation profiles to predict cross-sectionally complex traits independently of genotypic information has not yet been explored. Here, we investigate whether DNA-methylation profiles associate with BMI and height independently of genotypic information. BMI and height represent two complex traits with different relative contributions of genetics and environment to inter-individual variance.10–13 Heritability estimates for BMI are high but vary (0.3–0.8) among twin and family studies. The genetic contribution appears to vary with age, such that it has a greater influence during childhood than during adult life.10 In contrast, height is known to have a mostly genetic contribution; heritability estimates from both twin and family studies are consistently around 0.8 in nutritionally replete societies.11–13 These findings suggest that epigenetic contributions might be greater for BMI than for height. Therefore, the present study aimed to test the relative contributions of DNA-methylation status and genetic variation to inter-individual variation in BMI and height. We hypothesized a priori that after genetic determinants of phenotype are accounted for, DNA methylation will provide a far more substantial contribution to inter-individual variation in BMI than to variation in height. To this end, we first performed an MWAS for BMI and height in two independent datasets; the discovery sample comprised 1,366 older individuals from two Scottish birth cohorts (the Lothian Birth Cohorts [LBCs] of 1921 [n = 446; mean age 79.1 ± 0.6 years] and 1936 [n = 920; mean age 69.5 ± 0.8 years]), and the validation sample was the LifeLines DEEP cohort of Dutch adult individuals (n = 750; mean age 45.5 ± 13.3 years). For each trait, we generated methylation-profile scores (a weighted sum of the methylation levels at associated CpG sites) in the validation cohort on the basis of the observed CpG associations in the discovery cohort, and we estimated the proportion of height and BMI variance accounted for by these DNA-methylation profiles. We also determined whether the methylation-profile scores were associated with the two traits independently of genetic-profile scores (weighted sum of associated effect alleles of associated SNPs) on the basis of results from the most recent BMI and height meta-GWASs carried out by the Genetic Investigation of Anthropometric Traits (GIANT) consortium.14

116 In adults, the BMI cutoffs that define obesity are not linked to age and do not differ for men and women, whereas in children BMI varies with age and sex.15 Therefore, methylation changes associated with BMI in adults might not necessarily reflect those observed in children or 1 adolescents. To investigate this further, we tested whether BMI-associated methylation changes observed in the adults of the LBCs and LifeLines DEEP cohort were predictive of BMI in adolescents from the Brisbane Systems Genetics Study (BSGS; n = 403; mean age 14.0 ± 2.4). 2 Results Cohort Characteristics 3 After sample QC, 1,366 samples from the LBCs (n = 446 from LBC1921 and n = 920 from LBC1936), 752 samples from the LifeLines DEEP cohort, and 403 samples from the BSGS cohort (after we removed one individual from each MZ twin pair) had methylation, phenotype, and genotype data. Cohort characteristics of these samples are provided in Table 1. The LifeLines DEEP participants had a much wider age range (18–81 years) and were on average 4 much younger (mean 45.5 ± 13.3 years) than LBC participants (69.5 ± 0.8 years in LBC1936 and 79.1 + 0.6 years in LBC1921). The mean age in the BSGS cohort was 14 ± 2.4 years. BMI and height distributions for each cohort are shown in Figures S1 and S2. The BMI and height phenotypes were adjusted for age and sex in each cohort. Table 1. Cohort characteristics of the LBC and LifeLines-DEEP participants at time of DNA- methylation assays Cohort LBC1936 LBC1921 LifeLines DEEP BSGS # 920 446 752 403 69.5 79.1 45.5 14.0 Age (years) ± 0.8 ± 0.6 ± 13.3 ± 2.4 Female 49.50% 60.50% 57.80% 48.10% 27.8 26.2 25.4 20.4 BMI (kg/m2) ± 4.4 ± 4.0 ± 4.2 ± 3.7 166.4 163.1 175.2 159.3 Height (cm) ± 8.9 ± 9.3 ± 8.9 ± 11.6

Methylome-wide Association Analysis To create a multi-probe methylation predictor, we first conducted a methylome-wide association analysis. A total of 431,951 and 407,935 CpG probes remained in the LBC and LifeLines DEEP datasets, respectively, after QC and probe filtering. Probes with an association p value < 1.16 × 10-7 in the LBC dataset and a p value < 1.22 × 10-7 in the LifeLines DEEP dataset were considered to be significantly associated after Bonferroni correction for the number of probes tested. After removal of correlated probes, nine CpG probes in the LBC dataset and five probes in the LifeLines DEEP dataset were associated with BMI and were used for generating methylation-profile scores (Table S1). Two probes (cg06500161 and cg11024682) were significantly associated with BMI in both cohorts cg06500161 is found in an intronic region of ABCG1 (ATP-binding cassette, sub-family G, member 1 [MIM: 603076]), and cg11024682 is intronic to one isoform of SREBF1 (sterol regulatory element binding transcription factor 1 [MIM: 184756]). Both genes are known to be involved in lipid metabolism, but neither has been identified by GWASs to harbor genetic variants that are associated with BMI. For height, no CpG probes passed the p value threshold in the LBCs, whereas only a single probe passed the threshold in the LifeLines DEEP cohort. Therefore, to generate a height- profile score, we used a less stringent association p value of <0.001 for probe selection.

117 507 and 949 CpG probes were selected in the LBCs and LifeLines DEEP cohort, respectively. Quantile-quantile plots for each MWAS are shown in Figure S3. We observed inflation in the 1 lambda values for BMI, lambdas were 1.53 and 1.17 in the LBCs and LifeLines DEEP cohort, respectively, whereas for height, lambdas were 1.12 and 1.36, respectively. Lambdas close to 1 (SD = 0.1) were observed with permutation analysis (performed in both the LBCs and LifeLines DEEP cohort), which indicates that the inflation was due to real signal and not an 2 artifact of our assumption of the null distribution of the test statistic.

Proportion of BMI and Height Variance Explained by Profile Scores in the LBCs 3 and LifeLines DEEP Cohort. Consistent with expectation, all methylation- and genetic-profile scores were correlated with their respective traits in the anticipated direction (Table S2). The methylation-profile scores explained 6.9% and 4.9% (p value < 1 × 10-15 and 7 × 10-10, respectively) of the variation in BMI in the LBCs and LifeLines DEEP cohort, respectively, whereas the genetic-profile scores explained 4 8.0% and 9.4% (p value < 1 × 10-15), respectively (Figure 1). When both the methylation- and genetic-profile scores were included in an additive model for BMI, each remained independently associated with BMI. The proportion of variance explained by the additive model was 14.0% and 13.6% in the LBCs and LifeLines DEEP cohort, respectively, suggesting a mainly additive effect of the two scores on BMI (Figure 1).

Figure 1. BMI and Height Prediction. The plots depict how much of the variance in the sex- and age-adjusted BMI and height phenotypes (adjusted R2) was explained by the methylation- profile score, the genetic-profile score, an additive model including both scores (methylation þ genetic), and an interaction model (methylation × genetic). The methylation score in the LBCs is based on selected probes and effects sizes from the LifeLines DEEP MWAS, and vice versa. The genetic-profile scores are based on results from the GIANT meta-GWAS.

118 The BMI methylation-profile scores, based on 78 probes selected from an MWAS in the larger Framingham Heart Study (M.M.M., unpublished data) but weighted with effect sizes estimated in the LBCs, explained 7.3% of the variation in BMI in the LifeLines DEEP cohort, whereas a 1 profile score based on the effects estimated in the LifeLines DEEP cohort explained 11% of the variation in the LBCs. As before, the methylation-profile scores showed an additive effect with the genetic-profile scores (Figure S4). Compared to the methylation-profile scores derived 2 from the MWAS in the LBCs or LifeLines DEEP cohort, the larger R values for the profile scores 2 based on probes identified in the Framingham cohort suggest that the larger sample size in the latter study provided more power to identify additional CpG probes and hence allowed us to explain a higher proportion of variance in BMI.

The height methylation-profile scores were associated with height and explained 0.31% 3 and 0.76% (p value = 0.02 and 0.01 of the variation in the LBCs and LifeLines DEEP cohort, respectively). The height genetic-profile scores explained 18.5% and 19.8% (p value < 1 × 10-15) of the inter-individual variation in the height phenotype in the LBCs and LifeLines DEEP cohort, respectively (Figure 1). The additive model including both methylation- and genetic- profile scores explained 18.5% and 20.1% of the variation in the height phenotype in the LBCs 4 and LifeLines DEEP cohort, respectively. However, the methylation-profile score showed no independent association in the LBCs (p = 0.16) and remained only marginally associated (p = 0.035) with the height phenotype independently of the geneticprofile score in the LifeLines DEEP cohort. For BMI, the interaction model explained a slightly larger proportion of variance than did the additive model in the LBCs (15% versus 14%; ANOVA p value = 5 × 10-6) but not in the LifeLines DEEP cohort (Table S3). There was no significant interaction between the genetic- and methylation-profile scores for height in either cohort.

Proportion of BMI Variance Explained in BSGS Adolescent Individuals The methylation-profile scores derived from the MWAS analysis in the LBC individuals did not explain any variation (adjusted R2 = -0.001) in the sex- and age-adjusted BMI phenotype from the BSGS cohort, whereas that derived from the mostly middle-aged individuals of the LifeLines DEEP study explained 3.6% (p value = 8 × 10-5; Figure 2). Methylation scores based on the CpG probes identified in the larger Framingham MWAS but weighted with effect sizes from the older LBC individuals explained 3.0% of the variation in BMI in adolescent individuals. Based on the same CpG probes but effect sizes derived from the younger, albeit smaller, LifeLines DEEP cohort, the methylation- profile scores explained almost twice (5.4%) the variation in BMI in adolescent individuals (Figure 2). Given that the proportion of variance explained in a prediction setting is a function of sample sizes of the discovery cohorts, the R2 values from different-sized cohorts are not directly comparable. We therefore compared the ratio of the methylation score R2 to the genetic score R2 to look at the relative contribution of the methylation- and genetic-profile scores to variance in BMI in both BSGS adolescents and older cohorts. As shown in Table S4, in all cases, the methylation-profile scores had a lower contribution to BMI variance in the BSGS cohort than in the other cohorts. The methylation predictor derived from older individuals (probes and weights for the methylation-profile score derived from the LBC MWAS) performed the worst. A BMI methylation score based on a fixed-effect meta-analysis of the LBC and LifeLines DEEP MWAS results, whereby a Bonferroni correction for 374,629 common probes in the two cohorts (p value < 1.33 × 10-7) was used for selecting probes, performed better than the methylation score based on the LBC MWAS. However, despite the larger sample size, it performed worse than the predictor based on the LifeLines DEEP MWAS: its adjusted R2 was 0.028 (p value = 4.0 × 10-4).

119 1

2

3

4

Figure 2. BMI Prediction in BSGS Adolescents. The plots show how much of the variance in the sex- and age-adjusted BMI pheno-type (adjusted R2) was explained by the methylation-profile score, the genetic-profile score, an additive model including both scores (methylation þ genetic), and an interaction model (methylation × genetic). The GWAS scores are based on results from the GIANT meta-GWAS. Methylation scores are based on probe selection and weights derived from the LBCs MWAS or the LifeLines DEEP MWAS (upper panel) or probe selection from the Framingham discovery with weights derived from the LBCs or LifeLines DEEP studies (lower panel).

Correcting for Cell Count In the LBCs, all cell counts except neutrophils were associated with sex- and age-adjusted BMI (p < 0.05), but only monocytes were associated with sex- and age-adjusted height. In contrast, in the LifeLines DEEP cohort, all cell counts were significantly associated with BMI, but not with height. Adjusting for cell count reduced some of the inflation observed in the uncorrected analysis for BMI, lambdas were 1.28 and in the LBCs and LifeLines DEEP cohort, respectively, whereas for height, lambdas were 1.00 and 1.15, respectively. The proportion of variance explained by the methylation scores after cell-count correction is shown in Figure S5. The cell-count-corrected methylation scores based on the MWAS discovery in the LBCs and LifeLines DEEP cohort remained significantly associated with BMI and showed an additive effect, although the proportion of variance explained was substantially less in the LBCs (3.2%). For height, the methylation-profile score was still marginally associated with the sex- and age-adjusted height phenotype in the LifeLines DEEP cohort (adjusted R2 = 0.0041; p value = 0.045), but not in the LBCs.

120 Discussion 1 We investigated two traits that we postulated a priori to have varying contributions of genetic and environmental factors to inter-individual variability we hypothesized that height would have a mostly genetic component, whereas BMI would have a larger environmental contribution that increases with age.10 We found that the methylation-profile scores contributed almost nothing to the inter-individual variance in height but showed a strong association with BMI. The 2 BMI methylation-profile score improved prediction of BMI over and above the genetic- profile score. The two profile scores acted mostly in an additive manner, suggesting that methylation- profile scores capture information that is largely independent of the genetic determinants of BMI. Our results suggest that even if there are genetic variants whose effects on BMI are mediated by methylation, their contribution is small. Therefore, methylation profiles might 3 have important utility in improving phenotype prediction over and above genetic data alone. Furthermore, BMI methylation profiles in older people (the LBC individuals) did not predict well in adolescents (BSGS cohort). A methylation predictor based on CpG probes identified in a larger, independent study (Framingham) explained almost double the proportion of 4 variance in BMI in BSGS adolescent individuals when the effect sizes used for generating the methylation-profile score were derived from the younger LifeLines DEEP cohort than when they were derived from the older LBC individuals. A methylation predictor based on the meta- analysis of the LBCs and LifeLines DEEP cohort, despite the larger sample size, performed worse than a predictor based on the LifeLines DEEP cohort alone. The relative contribution of the methylation and genetic predictors for BMI in adolescent individuals was also found to be much lower. Combined, the results suggest that these differences might be due to the direct effect of more prolonged exposure to environmental factors in older individuals, or the fact that older individuals are ‘‘exposed’’ to the phenotype for longer, and therefore might show larger effects on methylation due to reverse causation (Figure S6). The effect sizes for individual CpGs are much larger than effect sizes for individual SNPs, and this is reflected in the fact that the proportion of variance explained by the CpGs identified in relatively small sample sizes (<1,500 individuals) is comparable to that explained by SNPs identified in very large samples used in genetic discovery (over 250,000 individuals). This suggests that bigger studies might be able to identify epigenetic variation that accounts for a larger proportion of the inter-individual variance of a complex trait. A permutation analysis gives an indication of the highly correlated structure of the methylation probes in the genome. If lambda is the mean C2 statistic across all ~400,000 probes, then its sampling variance is 2/M, where M is the effective number of independent probes, i.e., the number of independent probes that give the same sampling variance as the observed variance. The SD of the genome-wide lambda from permutations therefore implies a surprisingly small effective number of independent methylation probes of only 2/0.12 = 200, consistent with a complex correlation structure. Such a complex correlation structure or small effective number of probes does not imply the absence of meaningful and genome-wide biological inference, as shown, for example, for gene expression, which is also 16 characterized by a complex correlation structure. A limitation of our study was the relatively small sample size of the LBCs and LifeLines DEEP cohort. We showed that a methylation-profile score based on a more extensive CpG probe list identified from the larger Framingham study performed the best. This suggests that the smaller sample size of the LBCs and LifeLines DEEP cohort lacked the power for statistical identification of additional CpGs. A sensitivity analysis using different p value thresholds to select CpG probes in the LBCs and LifeLines DEEP cohort showed that the ability of the methylation score to predict BMI decreased as the p value threshold was relaxed (Table S5). Forming large consortia to enable meta-analyses of multiple studies will overcome power issues and identify more robust associations, as well as estimate effect sizes more accurately. However, sample characteristics of the cohorts would need careful consideration for methylation analyses.

121 As more BMI-associated CpG sites are identified, the interaction between methylation and genetic profiles might become stronger, because it would be reasonable to expect that 1 methylation at some of these CpGs might lie in the causal pathway, downstream of SNP effects. Further analysis to identify SNPs associated with both BMI and methylation levels at BMI-associated CpG sites would be needed to dissect the observed interaction and determine causality. Current work using a Mendelian randomization approach to identify a causal SNP 2 (rs4925108) that is associated with methylation at a CpG site in SREBF suggests that both the SNP and the methylation levels at the CpG appear to be associated with BMI (M.M.M., unpublished data). Another limitation of our study was the use of methylation profiles observed in blood. It is well 3 known that tissue-specific DNA-methylation profiles exist; therefore, methylation profiles in blood might not be entirely representative of other tissues. If the primary interest for identifying epigenetic profiles is to determine causality, the tissue under investigation might be of great importance, and a more relevant tissue, such as adipose tissue, might be more suitable for a trait such as BMI or obesity. This might not apply for prediction, and comparing blood-derived 4 methylation predictors with those derived from other tissues would be a logical next step if data were available. The SNP arrays used in the BMI and height GWASs provide comprehensive coverage of the genome: 93% of common SNPs (both coding and non-coding) in the CEU population (Utah residents with ancestry from northern and western Europe from the CEPH collection) are tagged at r2 0.8.17 In comparison, although the Infinium HumanMethylation450 array comprehensively evaluates promoter regions and CpG islands, as well as other potentially relevant intergenic regions, such as regulatory regions,18 it only interrogates a small subset of the ~28 million CpG sites in the human genome. Therefore, other CpG sites might be missed, potentially giving an incomplete and biased view of the relative contribution of genetic and epigenetic factors to phenotypic variation. Despite this, the array has already proven to be a useful high-throughput technology for unraveling interesting biology: a number of studies have successfully identified CpGs in or near likely candidate genes associated with various phenotypes. A drawback of using epigenetic disease markers, like any other molecular biomarker, is that they are vulnerable to confounding and reverse causation. This also applies to cell counts as a biomarker. The observed attenuation of the BMI variance explained by the methylation predictor when the MWAS was adjusted for cell counts suggests that both methylation and cell counts are involved in either the cause or the consequence of BMI differences between individuals. Distinguishing methylation changes that lie in the causal pathway from those that are a consequence of disease is an important task for understanding disease etiology and identifying new drug targets. Combining genetic and epigenetic data in a typical Mendelian randomization analysis might identify causal methylation changes due to genetic variation. In the context of BMI, methylation changes due to obesity would still be of interest for understanding the etiology of downstream disease outcomes, such as cardiovascular disease or type 2 diabetes. However, neither causality nor functional knowledge is necessary for prediction and was therefore not the focus of this study. In summary, we have shown that inter-individual differences in environment or lifestyle are partly reflected in DNA-methylation data, and therefore DNA-methylation profiles have the potential to significantly improve complex-trait prediction over and above that of genetic predictors. Outside of disease association, applying accurate prediction of complex traits by using genetic and epigenetic predictors might be useful in forensic investigations where a biological sample is available but where there is no profile from the person whose sample is investigated.

122 Material and Methods 1 Cohorts LBCs The LBCs comprise individuals born in 1921 (LBC1921) and 1936 (LBC1936), and most of these individuals were participants in the Scottish Mental Surveys (SMSs) of 1932 and 1947, 2 respectively, when nearly all 11-year-old children in Scotland completed an IQ-type test in school. The LBC studies provide follow up of surviving SMS participants who are living in the Lothian region (Edinburgh city and outskirts) of Scotland.19-21The LBC studies focus on the determinants of people’s cognitive aging differences and collect detailed information on 3 cognitive, biomedical, lifestyle, socio-demographic, behavioral, physical, and psychological factors. An overview of the data collected in the LBCs can be found in the cohorts’ profile article.21 The current study draws upon the baseline examinations (including blood-sample collection and phenotypic measurements) of 550 LBC1921 participants recruited in 1999– 2001 (average age of 79 years) and 1,091 LBC1936 participants recruited in 2004–2007 4 (average age of 70 years). LifeLines DEEP This is a sub-cohort (n = 752, recruited in 2013) of the LifeLines study,22 the latter of which is a multi-disciplinary prospective population-based cohort study examining the health and health-related behaviors of 167,729 persons living in the north of the Netherlands in a unique three-generation design. It employs a broad range of investigative procedures in assessing the biomedical, socio-demographic, behavioral, physical, and psychological factors contributing to the health and disease of the general population and has a special focus on multi-morbidity and complex genetics. A full description of the LifeLines DEEP study can be found in the paper describing the cohort and data.22 BSGS The BSGS is a study on adolescent twins comprising a total of 962 individuals from 314 families of European descent,23 and a subset of these individuals have DNA-methylation data (614 individuals from 177 families). Families consist of adolescent monozygotic (MZ) and dizygotic (DZ) twins, their siblings, and their parents. The BSGS comprises a sub-sample from a larger and continuing study on families with adolescent twins. Recruitment commenced in 1992. A full description of the BSGS cohort has been previously provided.23,24 Ethics Ethics permission for LBC1921 was obtained from the Lothian Research Ethics Committee (wave 1: LREC/1998/4/183). Ethics permission for LBC1936 was obtained from the Multi- Centre Research Ethics Committee for Scotland (wave 1: MREC/01/0/56) and the Lothian Research Ethics Committee (wave 1: LREC/ 2003/2/29). The BSGS was approved by the Queensland Institute for Medical Research Human Research Ethics Committee. The LifeLines DEEP study was approved by the ethical committee of the University Medical Centre Groningen (document no. METC UMCG LLDEEP: M12.113965). For all studies, written consent was obtained from all participants. Phenotypic Measurements LBCs Weight and height were measured in the LBCs by a trained nurse according to a standardized protocol. Participants were asked to remove their shoes before a seca stadiometer was used to assess height in centimeters. Weight (after participants removed shoes and outer clothing) was measured in kilograms by electronic seca scales, which provided digital readouts.

123 LifeLines DEEP 1 Height was measured without shoes by the seca 222 stadiometer. Weight was measured without shoes and heavy clothing by the seca 761 scale. All measurements were performed by a trained research nurse.

2 BSGS Height and weight were both measured clinically with a stadiometer and accurate scales, respectively. Anthropometric measurements were only available for the offspring. Complete blood cell counts (lymphocytes, monocytes, neutrophils, eosinophils, and basophils) were measured in the LBCs and LifeLines cohort. 3 DNA Methylation Whole-blood samples were collected at the same time as phenotypic measurements in all studies. Extracted DNA was profiled with the Infinium HumanMethylation450 BeadChip,25 4 and data were available on 752 LifeLines DEEP participants, 1,518 LBC participants (514 from LBC1921 and 1,004 from LBC1936), and 614 BSGS participants. For each of the LBCs and the LifeLines DEEP cohort, samples were randomized on 96-well plates, and methylation arrays were run in a single experiment to minimize batch effects. Low-quality probes and samples were excluded from further analysis as described below. LBC DNA-Methylation Quality Control Details of DNA extraction and methylation profiling are described elsewhere.26 Background correction of the raw intensity data and generation of the methylation beta values were done with the R minfi package. Quality-control (QC) steps included the removal of probes with a low (<95%) detection rate at p < 0.01. Array control probes were inspected manually, and low- quality samples (e.g., samples with inadequate hybridization, bisulfite conversion, nucleotide extension, or staining signal) were removed. Samples with a low call rate according to the Illumina-based threshold (samples with <450,000 probes detected at p < 0.01) were removed. LBC samples had been genotyped with the Illumina 610-Quadv1 genotyping platform. Genotype information from the 65 SNP control CpG probes on the methylation chip were cross-validated with those from the genotyping chip with the R wateRmelon package. Where there was low correspondence, samples were excluded (n = 9). We also excluded eight participants whose reported sex did not match their predicted sex according to methylation levels for probes on the X and Y chromosomes. LifeLines DEEP DNA-Methylation QC Details of DNA extraction and methylation profiling are described elsewhere.21 Probe QC, background correction, color correction, and normalization were performed with a custom pipeline based on the pipeline by Tost and Touleimat.27 All methylation probes were re-mapped to the human genome (hg37, UCSC Genome Browser),28 and both poorly mapping probes and probes with a SNP in the single-base extension side (according to GoNL29) were removed in the same step. Data were normalized with DASEN.30 BSGS DNA-Methylation QC Details of DNA extraction, methylation profiling, and methylation QC are provided elsewhere.24 In all cohorts, non-autosomal probes and probes with underlying SNPs at the target CpG site (according to Illumina annotation) were excluded from further analysis. Methylation levels are presented as beta values, which range between 0 and 1, where a value of 0 indicates that all copies of the CpG site in the sample were completely unmethylated (no methylated molecules were measured), and a value of 1 indicates that every copy of the site was methylated. Beta values were then processed as follows in all cohorts. The beta values were logit transformed:

124 log (beta/ (1 - beta). For removal of variation due to batch effects and covariates, the logit-transformed beta values were regressed onto the technical variables (plate, array, and array position) and covariates (sex and age for the main analysis; in addition, cell count was 1 adjusted in a sensitivity analysis in the LBCs and LifeLines DEEP cohort). Residuals from this linear regression were inverse-normal transformed and used in all subsequent analyses.

Genotyping 2 Genotype data were available for all samples with DNA-methylation data in the three cohorts. The LBC and BSGS samples were genotyped with the Illumina Human610-Quad v1.0 genotyping platform, and data were available on all participants with DNA-methylation data. After QC, genotyped data were imputed with 1000 Genomes Phase 1 version 331 and IMPUTE2.32,31 The LifeLines DEEP samples were genotyped with the HumanCytoSNP-12 BeadChip and the 3 ImmunoChip,34 a customized Illumina Infinium array. The data were merged and subsequently imputed with GoNL29,35 and IMPUTE2.32,33 Details of QC in each cohort are described below. LBC Genotyping QC 4 DNA samples from each individual were genotyped with the Illumina Human610-Quad BeadChip. Individuals were excluded on the basis of unresolved gender discrepancy, relatedness, call rate (%0.95), and evidence of non-European descent. SNPs were included in the analyses if they met the following conditions: call rate R 0.98, minor allele frequency R 0.01, and Hardy-Weinberg equilibrium test with p R 0.001. LifeLines DEEP Genotyping QC Details of DNA extraction, genotyping, and QC are provided elsewhere.22 BSGS Genotyping QC DNA samples from each individual were genotyped by the Scientific Services Division at deCODE Genetics (Iceland) with the Illumina Human610-Quad BeadChip. Genotypes were called with the Illumina BeadStudio software. A detailed description of genotyping QC can be found elsewhere.23,36 Methylome-wide Association Analysis in the LBCs and LifeLines DEEP Cohort The BMI and height phenotypes were adjusted for sex and age and standardized for the generation of Z scores. Linear regression analysis was used to test the association between each CpG probe (independent variable) and the BMI or height Z score phenotype (dependent variable). Methylation-Profile Scores for BMI and Height In the LBCs and LifeLines DEEP cohort, we first selected CpG probes on the basis ofa Bonferroni-corrected association p value threshold (p < 0.05/[number of probes]). To remove redundant CpG probes from the methylation-profile score, if multiple probes passed the p value threshold and had a pairwise correlation greater than 0.1 within a 500-bp window, we selected only the most significant probe for the score. The choice of correlation threshold and window size was based on previous studies that investigated pairwise probe correlation as a function of the distance between probes.37,38 BMI and height methylation-profile scores were calculated as the weighted sum of the selected CpG methylation levels (the weights for each CpG probe were the effect sizes from the MWAS). We used selected probes and effect sizes from the LBC MWAS to generate a methylation-profile score in the LifeLines DEEP cohort, and vice versa.

125 For BMI, as a secondary replication cohort, we generated an additional methylation-profile score, whereby we selected probes on the basis of results from a larger, independent MWAS 1 on BMI in the Framingham Heart Study (n = 2,377; mean age 67 ± 9 years; age range = 40–93 years; M.M.M., unpublished data). In this analysis, 78 CpG probes had an association p value < 1.22 × 10-7 (Bonferroni correction for 409,403 probes) and were selected for generating a BMI methylation-profile score. To generate a Framingham-based methylation score in the LBCs, 2 we derived effect sizes for these 78 probes from the LifeLines DEEP MWAS, whereas we derived effect sizes from the LBC MWAS to generate the score in the LifeLines DEEP cohort. Table 2. Summary of MWAS and GWAS Prediction Analyses Cohort 3 Trait Probe Selection Effect-Size Estimation Prediction Location of Results MWAS prediction BMI LifeLines DEEP LifeLines DEEP LBC Figure 1 4 BMI LBC LBC LifeLines DEEP Figure 1 BMI Framingham LifeLines DEEP LBC Figure S4 BMI Framingham LBC LifeLines DEEP Figure S4 BMI LifeLines DEEP LifeLines DEEP BSGS Figure 2 BMI LBC LBC BSGS Figure 2 BMI Framingham LifeLines DEEP BSGS Figure 2 BMI Framingham LBC BSGS Figure 2 Height LifeLines DEEP LifeLines DEEP LBC Figure 1 Height LBC LBC LifeLines DEEP Figure 1 GWAS Prediction BMI GIANT 2015 GIANT 2015 LBC Figure 1 BMI GIANT 2015 GIANT 2015 LifeLines DEEP Figure 1 BMI GIANT 2015 GIANT 2015 BSGS Figure 2 Height GIANT 2014 GIANT 2014 LBC Figure 1 Height GIANT 2014 GIANT 2014 LifeLines DEEP Figure 1

Genetic-Profile Score for BMI and Height We used SNP genotype data to calculate genetic-profile scores for BMI and height. SNPs and weights (effect sizes) used for generating the genetic-profile scores (the weighted sum of the effect allele count) were based on the GIANT meta-GWAS for BMI in ~350,000 individuals14 and for height in ~250,000 individuals.39 It is important to note that none of the LBC, LifeLines DEEP, or BSGS participants were part of the GIANT meta-GWAS, so discovery bias was not an issue. Prior testing in an independent cohort indicated that using all HapMap3 SNPs provided the best predictor for BMI,14 whereas SNPs that had a p value < 5 × 10-5 and that were selected with the GCTA-COJO (conditional and joint genome- wide analysis) function in the GCTA software37 provided the best predictor for height.14,39

126 Proportion of Phenotypic Variance Explained in the LBCs and LifeLines DEEP Cohort 1 Using linear regression, in each cohort we estimated how much variance in the sex- and age- adjusted BMI and height phenotypes (adjusted R2) was explained by the methylation- and genetic-profile scores, both individually and combined. We also looked for any evidence of interaction between the methylation- and genetic-profile scores. For each trait in each cohort, 2 we ran the following four regression models and extracted the proportion of variance explained from each: Model 1: trait ~ MWAS score Model 2: trait ~ GWAS score 3 Model 3: trait ~ MWAS score + GWAS score Model 4: trait ~ MWAS score + GWAS score + (MWAS score ×GWAS score) We used an ANOVA to test whether the interaction model (model 4) explained significantly 4 more of the variation in the phenotype than the additive model (model 3). A summary of the cross-cohort GWAS and MWAS predictions is presented in Table 2. Proportion of Variance Explained in BMI of Adolescent Individuals We generated five methylation scores in the BSGS individuals and report how muchof the variation in sex- and age-adjusted BMI was explained by each of the scores: (1) probe selection and weights derived from the LBCs; (2) probe selection and weights derived from the LifeLines DEEP cohort (3); probe selection from the Framingham discovery and weights derived from the LBCs; (4) probe selection from the Framingham discovery and weights derived from the LifeLines DEEP cohort; and (5) probe selection and weights derived from a fixed-effect meta-analysis of the LBCs and LifeLines DEEP cohort. To estimate the proportion of variance accounted for by the profile scores in the BSGS cohort, we corrected the sex- and age-adjusted BMI Z scores (scores standardized within the cohort) for family structure by using a linear mixed model (LMM) analysis in GCTA,41 in which we used SNP genotypes to estimate pedigree relatedness (pairwise relatedness < 0.05 was set to 0 according to the method in Zaitlen et al.42). Residuals from this LMM analysis were used as the sex-, age-, and family-structure-corrected BMI phenotype. The proportion of variance explained in the latter phenotype by each of the abovementioned methylation-profile scores was estimated by linear regression. Correcting for Cell Count Chronic inflammation is known to be associated with obesity, and white blood cell counts have been shown to increase with increasing BMI.43 Although the aim of this study was to investigate how much variation in BMI and height is captured by genetic and methylation differences irrespective of causality, we did perform the above analyses on cell-count- corrected methylation data as a sensitivity analysis. Accession Numbers The European Genome-phenome Archive (EGA) accession number for the Lothian Birth Cohort methylation data reported in this paper is EGA: EGAS00001000910. The Database of Genotypes and Phenotypes (dbGaP) accession number for the Framingham Heart Study methylation data reported in this paper is dbGaP: phs000724.v2.p9. The NCBI Gene Expression Omnibus (GEO) accession number for the Brisbane Systems Genetic Study methylation data reported in this paper is GEO: GSE56105.

127 Consortia 1 The members of the BIOS Consortium are Bastiaan T. Heijmans, Peter A.C. ’t Hoen, Joyce van Meurs, Aaron Isaacs, Rick Jansen, Lude Franke, Dorret I. Boomsma, René Pool, Jenny van Dongen, Jouke J. Hottenga, Marleen M.J. van Greevenbroek, Coen D.A. Stehouwer, Carla J.H. van der Kallen, Casper G. Schalkwijk, Cisca Wijmenga, Alexandra Zhernakova, Ettje 2 F. Tigchelaar, P. Eline Slagboom, Marian Beekman, Joris Deelen, Diana van Heemst, Jan H. Veldink, Leonard H. van den Berg, Cornelia M. van Duijn, Bert A. Hofman, André G. Uitterlinden, P. Mila Jhamai, Michael Verbiest, H. Eka D. Suchiman, Marijn Verkerk, Ruud van der Breggen, Jeroen van Rooij, Nico Lakenberg, Hailiang Mei, Maarten van Iterson, Michiel van Galen, Jan Bot, Peter van ’t Hof, Patrick Deelen, Irene Nooren, Matthijs Moed, Martijn Vermaat, Daria 3 V. Zhernakova, René Luijk, Marc Jan Bonder, Freerk van Dijk, Wibowo Arindrarto, Szymon M. Kielbasa, Morris A. Swertz, and Erik W. van Zwet. Web Resources 4 The URLs for data presented herein are as follows: Database of Genotypes and Phenotypes (dbGaP), http://www.ncbi.nlm.nih.gov/gap European Genome-phenome Archive (EGA), https://www.ebi.ac.uk/ega/home Gene Expression Omnibus (GEO), http://www.ncbi.nlm.nih. gov/geo/

References 1. Abdullah, A., Peeters, A., de Courten, M., and Stoelwinder, J. (2010). The magnitude of association between overweight and obesity and the risk of diabetes: a meta-analysis of prospective cohort studies. Diabetes Res. Clin. Pract. 89, 309–319. 2. Møller, H., Mellemgaard, A., Lindvig, K., and Olsen, J.H. (1994). Obesity and cancer risk: a Danish record-linkage study. Eur. J. Cancer 30A, 344–350. 3. Poirier, P., Giles, T.D., Bray, G.A., Hong, Y., Stern, J.S., PiSunyer, F.X., and Eckel, R.H.; American Heart Association; Obesity Committee of the Council on Nutrition, Physical Activity, and Metabolism (2006). Obesity and cardiovascular disease: pathophysiology, evaluation, and effect of weight loss: an update of the 1997 American Heart Association Scienftific Statement on Obesity and Heart Disease from the Obesity Committee of the Council on Nutrition, Physical Activity, and Metabolism. Circulation 113, 898–918. 4. Lavie, C.J., Milani, R.V., and Ventura, H.O. (2009). Obesity and cardiovascular disease: risk factor, paradox, and impact of weight loss. J. Am. Coll. Cardiol. 53, 1925–1932. 5. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thorleifsson, G., Jackson, A.U., Lango Allen, H., Lindgren, C.M., Luan, J., Mägi, R., et al.; MAGIC; Procardis Consortium (2010). Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat. Genet. 42, 937–948. 6. Dick, K.J., Nelson, C.P., Tsaprouni, L., Sandling, J.K., Aïssi, D., Wahl, S., Meduri, E., Morange, P.E., Gagnon, F., Grallert, H., et al. (2014). DNA methylation and body-mass index: a genome-wide analysis. Lancet 383, 1990–1998. 7. Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., Fan, J.B., Gao, Y., et al. (2013). Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49, 359–367. 8. Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biol. 14, R115. 9. Shenker, N.S., Polidoro, S., van Veldhoven, K., Sacerdote, C., Ricceri, F., Birrell, M.A., Belvisi, M.G., Brown, R., Vineis, P., and Flanagan, J.M. (2013). Epigenome-wide association study in the European Prospective Investigation into Cancer and Nutrition (EPIC-Turin) identifies novel genetic loci associated with smoking. Hum. Mol. Genet. 22, 843–851.

128 10. Elks, C.E., den Hoed, M., Zhao, J.H., Sharp, S.J., Wareham, N.J., Loos, R.J., and Ong, K.K. (2012). Variability in the heritability of body mass index: a systematic review and meta- regression. Front. Endocrinol. (Lausanne) 3, 29. 1 11. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I., Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris, J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Res. 6, 399–408. 12. Macgregor, S., Cornes, B.K., Martin, N.G., and Visscher, P.M. (2006). Bias, precision and 2 heritability of self-reported and clinically measured height in Australian twins. Hum. Genet. 120, 571–580. 13. Hemani, G., Yang, J., Vinkhuyzen, A., Powell, J.E., Willemsen, G., Hottenga, J.J., Abdellaoui, A., Mangino, M., Valdes, A.M., Medland, S.E., et al. (2013). Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs. Am. J. Hum. 3 Genet. 93, 865–875. 14. Locke, A.E., Kahali, B., Berndt, S.I., Justice, A.E., Pers, T.H., Day, F.R., Powell, C., Vedantam, S., Buchkovich, M.L., Yang, J., et al.; LifeLines Cohort Study; ADIPOGen Consortium; AGEN- BMI Working Group; CARDIOGRAMplusC4D Consortium; CKDGen Consortium; GLGC; ICBP; MAGIC Investigators; MuTHER Consortium; MIGen Consortium; PAGE Consortium; 4 ReproGen Consortium; GENIE Consortium; International Endogene Consortium (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206. 15. Must, A., and Anderson, S.E. (2006). Body mass index in children and adolescents: considerations for population-based applications. Int. J. Obes. 30, 590–594. 16. Fehrmann, R.S., Karjalainen, J.M., Krajewska, M., Westra, H.J., Maloney, D., Simeonov, A., Pers, T.H., Hirschhorn, J.N., Jansen, R.C., Schultes, E.A., et al. (2015). Gene expression analysis identifies global gene dosage sensitivity in cancer. Nat. Genet. 47, 115–125. 17. Delano, D., Eberle, M., Galver, L., and Rosenow, C. (2010). Array Differences in Genomic Coverage and Data Quality Impact GWAS Success. Illumina. http://www.illumina.com/ documents/products/whitepapers/whitepaper_gwas_array.pdf 18. Slieker, R.C., Bos, S.D., Goeman, J.J., Bovée, J.V., Talens, R.P., van der Breggen, R., Suchiman, H.E., Lameijer, E.W., Putter, H., van den Akker, E.B., et al. (2013). Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450k array. Epigenetics Chromatin 6, 26. 19. Deary, I.J., Whiteman, M.C., Starr, J.M., Whalley, L.J., and Fox, H.C. (2004). The impact of childhood intelligence on later life: following up the Scottish mental surveys of 1932 and 1947.J. Pers. Soc. Psychol. 86, 130–147. 20. Deary, I.J., Gow, A.J., Taylor, M.D., Corley, J., Brett, C., Wilson, V., Campbell, H., Whalley, L.J., Visscher, P.M., Porteous, D.J., and Starr, J.M. (2007). The Lothian Birth Cohort 1936: a study to examine influences on cognitive ageing from age 11 to age 70 and beyond. BMC Geriatr. 7, 28. 21. Deary, I.J., Gow, A.J., Pattie, A., and Starr, J.M. (2012). Cohort profile: the Lothian Birth Cohorts of 1921 and 1936. Int. J. Epidemiol. 41, 1576–1584. 22. Tigchelaar, E.F., Zhernakova, A., Dekens, J.A.M., Hermes, G., Baranska, A., Mujagic, Z., Swertz, M.A., Muñoz, A.M., Deelen, P., Cénit, M.C., et al. (2014). An introduction to LifeLines DEEP: study design and baseline characteristics. bioRxiv. http://biorxiv.org/ content/early/2014/09/16/009217 23. Powell, J.E., Henders, A.K., McRae, A.F., Caracella, A., Smith, S., Wright, M.J., Whitfield, J.B., Dermitzakis, E.T., Martin, N.G., Visscher, P.M., and Montgomery, G.W. (2012). The Brisbane Systems Genetics Study: genetical genomics meets complex trait genetics. PLoS ONE 7, e35430. 24. McRae, A.F., Powell, J.E., Henders, A.K., Bowdler, L., Hemani, G., Shah, S., Painter, J.N., Martin, N.G., Visscher, P.M., and Montgomery, G.W. (2014). Contribution of genetic variation to transgenerational inheritance of DNA methylation. Genome Biol. 15, R73. 25. Bibikova, M., Barnes, B., Tsan, C., Ho, V., Klotzle, B., Le, J.M., Delano, D., Zhang, L., Schroth, G.P., Gunderson, K.L., et al. (2011). High density DNA methylation array with single CpG site resolution. Genomics 98, 288–295.

129 26. Shah, S., McRae, A.F., Marioni, R.E., Harris, S.E., Gibson, J., Henders, A.K., Redmond, P., Cox, S.R., Pattie, A., Corley, J., et al. (2014). Genetic and environmental exposures constrain 1 epigenetic drift over the human life course. Genome Res. 24, 1725–1733. 27. Touleimat, N., and Tost, J. (2012). Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 4, 325–341. 2 28. Bonder, M.J., Kasela, S., Kals, M., Tamm, R., Lokk, K., Barragan, I., Buurman, W.A., Deelen, P., Greve, J.W., Ivanov, M., et al. (2014). Genetic and epigenetic regulation of gene expression in fetal and adult human livers. BMC Genomics 15, 860. 29. Genome of the Netherlands Consortium (2014). Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 3 818–825. 30. Pidsley, R., Y Wong, C.C., Volta, M., Lunnon, K., Mill, J., and Schalkwyk, L.C. (2013). A data- driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics 14, 293. 31. 1000 Genomes Project Consortium (2010). A map of human genome variation from 4 population-scale sequencing. Nature 467, 1061–1073. 32. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959. 33. Howie, B., Marchini, J., and Stephens, M. (2011). Genotype imputation with thousands of genomes. G3 (Bethesda) 1, 457–470. 34. Parkes, M., Cortes, A., van Heel, D.A., and Brown, M.A. (2013). Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nat. Rev. Genet. 14, 661–673. 35. Deelen, P., Menelaou, A., van Leeuwen, E.M., Kanterakis, A., van Dijk, F., Medina-Gomez, C., Francioli, L.C., Hottenga, J.J., Karssen, L.C., Estrada, K., et al.; Genome of Netherlands Consortium (2014). Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur. J. Hum. Genet. 22, 1321– 1326. 36. Medland, S.E., Nyholt, D.R., Painter, J.N., McEvoy, B.P., McRae, A.F., Zhu, G., Gordon, S.D., Ferreira, M.A., Wright, M.J., Henders, A.K., et al. (2009). Common variants in the trichohyalin gene are associated with straight hair in Europeans. Am. J. Hum. Genet. 85, 750–755. 37. Ong, M.L., and Holbrook, J.D. (2014). Novel region discovery method for Infinium 450K DNA methylation data reveals changes associated with aging in muscle and neuronal pathways. Aging Cell 13, 142–155. 38. Huynh, J.L., Garg, P., Thin, T.H., Yoo, S., Dutta, R., Trapp, B.D., Haroutunian, V., Zhu, J., Donovan, M.J., Sharp, A.J., and Casaccia, P. (2014). Epigenome-wide differences in pathology-free regions of multiple sclerosis-affected brains. Nat. Neuro- sci. 17, 121– 130. 39. Wood, A.R., Esko, T., Yang, J., Vedantam, S., Pers, T.H., Gustafsson, S., Chu, A.Y., Estrada, K., Luan, J., Kutalik, Z., et al.; Electronic Medical Records and Genomics (eMEMERGEGE) Consortium; MIGen Consortium; PAGEGE Consortium; LifeLines Cohort Study (2014). Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173– 1186. 40. Yang, J., Lee, S.H., Goddard, M.E., and Visscher, P.M. (2013). Genome-wide complex trait analysis (GCTA): methods, data analyses, and interpretations. Methods Mol. Biol. 1019, 215–236. 41. Yang, J., Lee, S.H., Goddard, M.E., and Visscher, P.M. (2011). GCTA: a tool for genome- wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82. 42. Zaitlen, N., Lindström, S., Pasaniuc, B., Cornelis, M., Genovese, G., Pollack, S., Barton, A., Bickeböller, H., Bowden, D.W., Eyre, S., et al. (2012). Informed conditioning on clinical covariates increases power in case-control association studies. PLoS Genet. 8, e1003032.

130 43. Dixon, J.B., and O’Brien, P.E. (2006). Obesity and the white blood cell count: changes with sustained weight loss. Obes. Surg. 16, 251–257. 1 Acknowledgments We thank the cohort participants and team members who contributed to these studies. Phenotype collection in the Lothian Birth Cohort 1921 was supported by the UK’s Biotechnology 2 and Biological Sciences Research Council (BBSRC), The Royal Society and The Chief Scientist Office of the Scottish Government. Phenotype collection in the Lothian Birth Cohort 1936 was supported by Age UK (The Disconnected Mind project). Methylation typing was supported by Centre for Cognitive Ageing and Cognitive Epidemiology (Pilot Fund award), Age UK, The Wellcome Trust Institutional Strategic Support Fund, The University of Edinburgh, and The 3 University of Queensland. REM, SEH, DL, JMS, IJD and PMV are members of the University of Edinburgh Centre for Cognitive Ageing and Cognitive Epidemiology (CCACE). CCACE is supported by funding from the BBSRC, the Medical Research Council (MRC), and the University of Edinburgh as part of the cross-council Lifelong Health and Wellbeing initiative (MR/K026992/1). Research reported in this publication was supported by National Health and 4 Medical Research Council (NHMRC) project grants 613608, APP496667, APP1010374 and APP1046880. NHMRC Fellowships to GWM, PMV, and NRW and Australia Research Council (ARC) Future Fellowship to NRW (FT0991360). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NHMRC or ARC. LF was financially supported by grants from the Netherlands Organization for Scientific Research (NWO-VENI grant 916-10135 and NWO VIDI grant 917-14374) and a Horizon Breakthrough grant from the Netherlands Genomics Initiative (grant 92519031). The research leading to these results has received funding from the European Community’s Health Seventh Framework Programme (FP7/2007– 2013) under grant agreement no. 259867. The Framingham Heart Study is funded by National Institutes of Health contract N01-HC-25195. The laboratory work for this investigation was funded by the Division of Intramural Research, National Heart, Lung, and Blood Institute, National Institutes of Health, and by a Director’s Challenge Award, National Institutes of Health (DL, PI). The analytical component of this project was funded by the Division of Intramural Research, National Heart, Lung, and Blood Institute, and the Center for Information Technology, National Institutes of Health, Bethesda, MD. This study utilized the computational resources of the Biowulf system at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov). The Lifelines-Deep work was supported by the European Research Council Advanced Grant (ERC-671274 to CW), the Dutch Digestive Diseases Foundation (MLDS WO11-30 to CW), the European Union’s Seventh Framework Programme (EU FP7) TANDEM project (HEALTH-F3-2012-305279 to CW), the Netherlands Organization for Scientific Research (NWO-VENI grant 916-10135 to LF and NWO VIDI grant 917-14374 to LF) and by Top Institute Food and Nutrition Wageningen (GH001 to CW). Generation of the methylation data (as part of the Biobank-based Integrative Omics Study (BIOS)) is financially supported by the Biobanking and Biomolecular Research Infrastructure of The Netherlands (BBMRI-NL), funded by the Netherlands Organisation for Scientific Research (NWO). AZ holds a Rosalind Franklin fellowship (University of Groningen).

131 Additional files 1 The following supplemental data files are available with the online version of this paper. Figure S1 BMI distribution Figure S2 Height distribution 2 Figure S3 MWAS QQ plots. Figure S4 Framingham-based BMI methylation scores. Figure S5 Proportion of variance explained in sex and age-adjusted BMI and height phenotype 3 after correction for cell-count. Figure S6 Causality and reverse causation. Table S1 CpG probes significantly associated with BMI in LBC and Lifelines-deep. 4 Table S2 Correlation between sex- and age-adjusted BMI and height with methylation and genetic predictors Table S3 Additive vs interaction model. Table S4 Relative contribution of methylation and genetic scores to variance in BMI. Table S5 Effect of p-value threshold on prediction ability of methylation score

132 Disease variants alter transcription factor levels and methylation of their binding sites Nature genetics, doi:10.1038/ng.3721

Marc Jan Bonder1,*, René Luijk2,*, Daria V. Zhernakova1, Matthijs Moed2, Patrick Deelen1,3, Martijn Vermaat4, Maarten van Iterson2, Freerk van Dijk1,3, Michiel van Galen3, Jan Bot5, Roderick C. Slieker2, P. Mila Jhamai6, Michael Verbiest3, H. Eka D. Suchiman2, Marijn Verkerk6, Ruud van der Breggen2, Jeroen van Rooij6, Nico Lakenberg2, Wibowo Arindrarto8, Szymon M. Kielbasa7, Iris Jonkers2, Peter van ’t Hof7, Irene Nooren5, Marian Beekman2, Joris Deelen2, Diana van Heemst9, Alexandra Zhernakova1, Ettje F. Tigchelaar1, Morris A. Swertz1,3, Albert Hofman10, André G. Uitterlinden6, René Pool11, Jenny van Dongen11, Jouke J. Hottenga11, Coen D.A. Stehouwer12,13, Carla J.H. van der Kallen12,13, Casper G. Schalkwijk12,13, Leonard H. van den Berg14, Erik. W. van Zwet8, Hailiang Mei7, Yang Li1, Mathieu Lemire15, Thomas J. Hudson15,16,17, BIOS Consortium18, P. Eline Slagboom2, Cisca Wijmenga1, Jan H. Veldink14, Marleen M.J. van Greevenbroek12,13, Cornelia M. van Duijn19, Dorret I. Boomsma11, Aaron Isaacs19, Rick Jansen20, Joyce B.J. van Meurs6, Peter A.C. ’t Hoen4,#, Lude Franke1,#, Bastiaan T. Heijmans2,# 9 Abstract 1 Most disease-associated genetic variants are non-coding, making it challenging to design experiments to understand their functional consequences1,2. Identification of expression quantitative trait loci (eQTLs) has been a powerful approach to infer downstream effects of disease variants but the large majority remains unexplained3,4. The analysis of DNA 2 methylation, a key component of the epigenome5,6, offers highly complementary data on the regulatory potential of genomic regions7,8. Here, we show that disease variants have wide- spread effects on DNA methylation in trans that likely reflect differential occupancy of trans- binding sites by cis-regulated transcription factors. Using multiple omics data on 3,841 Dutch individuals, we identified 1,907 established trait-associated SNPs that affect methylation levels 3 of 10,141 different CpG sites in trans (FDR<0.05). These included SNPs that affect both the expression of a nearby transcription factor (like NFKB1, CTCF and NKX2-3) and methylation of its respective binding site across the genome. Trans-meQTLs effectively expose downstream effects of disease-associated variants. 4

1. Department of Genetics, University of Groningen, University Medical Centre Groningen, Groningen, The Netherlands; 2. Molecular Epidemiology Section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands; 3. Genomics Coordination Center, University Medical Center Groningen, University of Groningen, Groningen, the Netherlands; 4. Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands; 5. SURFsara, Amsterdam, the Netherlands; 6. Department of Internal Medicine, ErasmusMC, Rotterdam, The Netherlands; 7. Sequence Analysis Support Core, Leiden University Medical Center, Leiden, The Netherlands; 8. Medical Statistics Section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands; 9. Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden, The Netherlands; 10. Department of Epidemiology, ErasmusMC, Rotterdam, The Netherlands; 11. Department of Biological Psychology, VU University Amsterdam, Neuroscience Campus Amsterdam, Amsterdam, The Netherlands; 12. Department of Internal Medicine, Maastricht University Medical Center, Maastricht, The Netherlands; 13. School for Cardiovascular Diseases (CARIM), Maastricht University Medical Center, Maastricht, The Netherlands; 14. Department of Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands; 15. Ontario Institute for Cancer Research, Toronto, Ontario, Canada M5G 0A3; 16. Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada M5S 1A1; 17. Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A1; 18. A full list of members and affiliations appears in the Supplementary Note; 19. Genetic Epidemiology Unit, Department of Epidemiology, ErasmusMC, Rotterdam, The Netherlands; 20. Department of Psychiatry, VU University Medical Center, Neuroscience Campus Amsterdam, Amsterdam, The Netherlands; * These authors contributed equally to this work; # These authors jointly directed this work; Correspondance to: Lude Franke ([email protected]) and Bastiaan T. Heijmans ([email protected])

134 To systematically study the role of DNA methylation in explaining downstream effects of genetic variation, we analysed genome-wide genotype and DNA methylation in whole blood from 3,841 samples from five Dutch biobanks9–13 (Figure 1, Supplementary Table 1, Supplemental Text). 1 We found cis-meQTL effects for 34.4% of all 405,709 tested CpGs (n=139,566 at a CpG-level FDR of 5%, P≤1.38x10-4), typically with a short physical distance between the SNP and CpG (median distance 10 kb, Supplementary Fig. 1). By regressing out primary meQTLs effect for each of these CpGs and repeating the cis-meQTL mapping, we observed up to 16 independent 2 cis-meQTLs for these CpGs (SupplementaryTable 2) totalling 272,037 independent cis-meQTL effects. Few factors determine whether a CpG site shows a cis-meQTL effect except the variance in methylation level of the CpG site involved (Supplementary Fig. 2, Supplementary Fig. 3a). The proportion of methylation variance explained by SNPs, however, is typically small (Supplementary Fig. 3b). When accounting for this strong effect of CpG variation, we find 3 only modest enrichments and depletions for cis-meQTL CpG sites for CpG island and genic annotation (Supplementary Fig. 3e) or when using annotations of biological function based on chromatin segmentations of 27 blood cell types (Figure 2a). a Methylation Quantititative Trait Loci Expression Quantititative Trait Loci 4   Figure 1. Overview P = 5.5 x 10—455   of a genomic region

ylation  0 around TMEM176B and xpression 0 characteristics of CpGs

ed e í ed meth í maliz maliz í associated to meQTLs and Nor Nor í í P = 9.72 x 10—668 eQTMs. In the illustration, í the relations between a SNP, G/G G/A A/A G/G G/A A/A DNA methylation at nearby

meQTL eQTL CpGs, and the associations with the gene itself are 3’ 5’ shown. Boxes show the eQTM cg23533927 TMEM176B rs7806458 median, the inter-quartile 7:150,548,934 —150,548,935 7:150,488,373—150,498,448 7:150,476,888 range (IQR). Whiskers show Expression Quantititative Trait Methylation the outer quartile plus 1.5 P = 3.21 x 10—140 P = 8.17 x 10—5  times the IQR. The top left plot shows the observed    ylation ylation after methylation Quantitative 0 0 ed meth ed meth

í Trait Locus (meQTL) í maliz maliz Nor í í Nor between cg23533927 and regression out meQTL effects í í rs7806458. The top right í í í 0  í í 0 6 NormalizHGJHQHíexpression after NormalizHGJHQHíexpression plot shows the observed regressing out eQTL effects expression Quantitative b Many CpG-sites are under genetic control c Trans-meQTLs are robust across cell types Trait Locus (eQTL)  eQTMs within category (3.2%) between TMEM176B and

94.9% of trans-meQTLs show  2 rs7806458. The observed consistent allelic direction r = 0.74 Cis- & trans- Trans-meQTL meQTL (1.6%) (0.9%) methylation-expression  association (eQTM)

Cis-meQTLs í í í between TMEM176B and (32.8%)    Significant trans-meQTL effects cg23533927, is shown in peripheral blood (Z-score)

í below the gene. The left part shows the data before í correction for the cis-eQTL No meQTL (64.7%) Trans-meQTLs in lymphocytes (Z-score)

í and cis-meQTL, the eQTM effect after correction for cis-eQTLs and cis-meQTLs is shown on the right. b, Two overlaid pie charts. The inner chart indicates the proportion of tested CpGs harboring meQTLs. Over 35% of all tested CpGs show evidence for harboring a meQTL, either in cis or in trans. The outer chart indicates what CpGs are associated with gene expression in cis (in total 3.2%). c, Replication of peripheral blood trans-meQTLs in lymphocytes.

135 R TssAFlnk ZNF/Rpts eprPCWk ReprPC Positive direction Negative direction TxFlnk TssBivBivFlnkEnhBiv

1 a TssA TxWkEnhG Quies d 20000 CGI Enh Het

Tx 15000 024 024 10000 í í cg02169713 5000 8.0 cg09899215 í í 6.0 0 í í 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 í í 01 4.0 LY6G5C % H3K27ac LDHD 2 2.0 ” > 0.204 ichment 2000 5000 1500 1.0 4000 3000 1000

0.6 2000 500 1000 0 0.4 0 0.0 0.2 0.4 0.6 0.8 1.0 110 100 1000 1e+04 1e+05 3 0.2 % H3K27me3 Absolute distance to TSS ” > 0.796 ” > 4481

Cis meQTL enr 0.1 1500

+ - 15000 1000

b 8.0 10000

6.0 500 270(+) 20(-) 83(+) 545(-) 4 4.0 5000 0 0 Hí Hí 0.02 0.5 10 Hí 0.1 0.2 0.5 124 2.0 H3K36me3/H3K27ac H3K4me3/H3K27ac

” > 1.09 ” > 1.28 ichment 1.0 1000 0.6 250 + - 800 0.4 200 600 150 400 100 483(+) 126(-) 60(+) 218(-) 0

0.2 200 eQTM enr 05 0 0.02 0.1 0.5 2510 50 200 5000 80000 0.05 0.5 2510 100 1000 50000 0.1 H3K4me1/H3K27me3 H3K4me1/H3K9me3

” > 4.9 ” > 8.9 c 8.0 6.0 4.0 + - - +

ichment 2.0 145(+) 79(-) 20(+) 44(-) 4(+) 48(-) 185(+) 170(-) 1.0 e 0.6 T 0.4 rue positive rate AUC = 0.826 0.2

ans meQTL enr 0.1 Tr

Tx Enh Het CGITssA TxFlnk TxWkEnhG TssBiv Quies BivFlnkEnhBivReprPC TssAFlnk ZNF/Rpts False positive rate ReprPCWk

Figure 2. Characterization of identified cis- and trans-meQTL and eQTM-effects. a-c, Over- or underrepresentation of CpGs for predicted chromatin states for cis-meQTLs, trans-meQTLs and eQTMs. Grey bars reflect uncorrected enrichments, colored bars reflect enrichments after correction for factors influencing the likelihood of harboring a meQTL or eQTM, including methylation variability. Bar graphs show odds ratios and error bars (95% confidence interval). CGI: CpG island; TssA: Active TSS; TssAFlnk: Flanking active TSS; TxFlnk, Transcribed at gene 5’ and 3’; Tx: Strong transcription; TxWk: Weak transcription; EnhG: Genic enhancer; Enh: Enhancer; ZNF/Rpts: ZNF genes and repeats; Het: Heterochromatin; TssBiv: Bivalent/Poised TSS; BivFlnk: Flanking bivalent TSS/Enhancer; EnhBiv: Bivalent enhancer. d, Decision tree for predicting the effect direction of eQTMs. Each subplot shows the distributions for positive (blue) and negative (red) associations for that subset of the data. Dashed vertical lines indicate the optimal split used by the algorithm. The boxes in the leaves indicate the number of positive and negative effects in each of the leaves. e, Receiver operator characteristic curve showing the performance of the decision tree.­­­­

136 We contrasted these modest functional enrichments to CpGs whose methylation levels correlates with gene expression in cis (i.e. mapping expression quantitative trait methylations (eQTMs)), by generating RNA-seq data for 2,101 out of 3,841 individuals in our study. Using 1 a conservative approach that maximally accounts for potential biases (see Methods), we identified 12,809 unique CpGs that correlated to 3,842 unique genes in cis (CpG-level FDR < 0.05). eQTMs were enriched for mapping in active regions, e.g. in and around active -91 transcription start sites (TSSs) (3-fold enrichment, P=1.8x10 ) and enhancers (2-fold 2 enrichment, P=1.1x10-139, Figure 2b). The majority of eQTMs showed the canonical negative correlation with transcriptional activity (69.2%) but a substantial minority of correlations was positive (30.8%) in line with recent evidence that DNA methylation does not always negatively correlate with gene expression14. As expected, negatively correlated eQTMs were enriched in active regions like active TSSs (3.7- fold enrichment, P=9.5x10-202). Positive correlations 3 primarily occurred in repressed regions (e.g. Polycomb repressed, 3.4-fold enrichment, P=5.8x10-103) (Supplementary Fig. 4). The sharp contrast between positively and negatively associated eQTMs enabled us to predict the direction of the correlation. A decision tree trained on the strongest eQTMs (those with an FDR < 9.7x10-6, n=5,137) using data on histone marks and distance relative to gene, could predict the direction with an area under the curve 4 of 0.83 (95% confidence interval,0.78-0.87) (Figure 2d, e). We next ascertained whether trans-meQTLs are biologically informative, since previous trans- eQTL mapping studies demonstrated that identifying trans-expression effects provide a powerful tool to uncover and understand downstream biological effects of disease-SNPs3,15,16. We focussed on 6,111 SNPs that were previously associated with complex traits and diseases (‘trait-associated SNPs’, see Methods and Supplementary Table 3). We observed that one- third of these trait-associated SNPs (1,907 SNPs, 31.2%) affect methylation in trans at 10,141 CpG sites, totalling 27,816 SNP-CpG combinations (FDR<0.05, P<2.6x10-7, Figure 3a). This represents a 5-fold increase in the number of CpG sites affected as compared with a previous trans-meQTL mapping study17. We evaluated whether the GWAS SNP themselves were likely underlying the trans-effects or that the associations could be attributed to another SNP in moderate LD. Of the 1,907 GWAS SNPs with trans-effects, 1,538 (87.2%) were in strong LD with the top SNP (R2 > 0.8), indicating that the GWAS SNPs indeed are the driving force behind many of the trans-meQTLs. Of note, due to the sparse coverage of the Illumina 450k array, the true number of CpGs in the genome that are altered by these trait associated SNPs will be substantially higher. To validate our trans-meQTLs, we performed a replication analysis in a set of 1,748 lymphocyte samples17. Of the 18,764 overlapping trans-meQTLs, 94.9% had a consistent allelic direction (Figure 1E; Supplementary Table 4). This indicates that the identifiedtrans -meQTLs are robust and not caused by differences in cell-type composition. Further analysis of SNPs known to influence blood cell composition18,19 showed no or only few trans-effects and alternative adjustments of the methylation-data corroborated the stability of trans-effects, both indicating a limited influence of cell type composition (Supplementary Results, Supplementary Tables 5–7). After the identification of the trans-meQTLs, we assessed if the trans-SNPs also affected expression of the genes associated with the trans-CpGs. By overlaying the trans-meQTLs and cis-eQTMs, we could link 436 SNPs to 850 genes, totalling 2,889 SNP-gene pairs. We found significant associations trans( -eQTLs) (FDR < 0.05) for 8.4% of these effects, and 91% of these effects showed the expected direction of the effect, given the directions of the trans-meQTLs and cis-eQTMs­ (Supplementary Table 8).

137 22 1 21 a Z-Scorbe 20 19 5 - 60 18 trans-meQTl with- 177 out HiC contact 4,204 GWAS SNPs 16 trans-meQTl with 15 without trans-meQTLs HiC contact 144

e 13

2 Sit 122

11 1,907 GWAS SNPs with trans-meQTLs 100 9 tion of CpG- 8 3 7 6 ,QWHUFKURPRVRPDO+L&FRQWDFWV 5 2.0 20 3HUPXWDWLRQV AS SNPs 4

15 Chromosomal loca 1.0 2EVHUYHG 3 0.6 10 )UHTXHQF\

4 ichment of GW 0.4 5 2 Enr

ic 0 ious r 0 100 200 300 400 500 600 700 1 Immune Cancer ascular Va 1XPEHURI+Lí&FRQWDFWV ov Metabolic

Neurological 1 2 3 4 5 6 7 8 9 11 12 13 14 15 18 21 22 HematologicalAnthropometr Cardi Chromosomal location of SNP Figure 3. trans-meQTLs are related to Hi-C interchromosomal contacts and enrichment of GWAS category of trans-meQTL SNPs. a, Distribution of tested trait-associated SNPs influencing DNA methylation in trans. Over 1,900 SNPs (31.2%) of all tested SNPs have downstream effects on DNA methylation. For the associated GWAS SNPs we show the overrepresentation of SNPs with trans-meQTLs in different GWAS trait categories, where the y-axis shows the odds ratio (bottom left). Hi-C contacts are overrepresented among trans-meQTLs. Grey bars show the number of Hi-C contacts using permutated data, while the red bar reflects the actually observed number in our data (bottom right). b, Dot-plot depicting the trans-meQTLs. The effect strength is reflected by the size of the dot. Red dots indicate an overlap with a Hi-C contact. Several SNPs with widespread trans-meQTLs show inter-chromosomal contacts genome-wide, further implicating an important role for those SNPs in the development of the associated trait. In contrast to cis-meQTL CpGs, trans-meQTLs CpGs show substantial functional enrichments: they are enriched around TSSs and depleted in heterochromatin (Figure 2c) and are strongly enriched for being an eQTM (1,913 CpGs (18.9%), 5.2-fold, P=2.3x10-101). Among the 1,907 trait-associated SNPs that make up the trans-meQTLs there was an overrepresentation of GWAS-identified SNPs associated with immune- and cancer-related traits (Figure 3a). The large majority of trans-meQTLs were inter-chromosomal (93%, 9,429 CpG-SNP pairs) and included 12 trans-meQTLs SNPs (yielding 3,616 unique CpG-SNP pairs) that each showed downstream trans-meQTL effects across all of the 22 autosomal chromosomes (i.e. trans- bands, Figure 3b). We subsequently studied the nature of these trans-meQTLs. Using high-resolution Hi-C data20, we identified 720 SNP-CpG pairs (including 402 CpG sites and 172 SNPs) among the trans- meQTLs that overlapped with an inter-chromosomal contact, which is 2.9-fold more frequent than expected by chance (P=3.7x10-126, Figure 3). The enrichment for Hi-C inter-chromosomal contacts remained after removing SNPs that were responsible for trans-bands (P=1.7x10-61). Hence, inter-chromosomal contacts may produce associations between SNPs and CpGs in trans. In order to characterize the 720 SNP-CpG pairs overlapping with inter-chromosomal contacts, we performed motif enrichments using three motif enrichment analyses (Homer, PWMEnrich, DEEPbind)21–23. These analyses revealed that the 402 CpG sites involved frequently overlapped with CTCF, RAD21 and SMC3 binding sites (P=2.3x10-5, P=3.5x10-5 and P=5.1x10-5, respectively), factors known to regulate chromatin architecture24,25. An analysis of ChIP-Seq data on CTCF binding confirmed this finding (1.8-fold enrichment,P =5.2x10-7).

138 We next tested whether the trans-meQTLs reflected the effect of differential transcription factor (TF) binding of TFs that map close to the SNPs. The rationale for this hypothesis is that binding of TFs has been linked to changes in local DNA methylation, primarily loss-of- 1 methylation upon TF binding and gain-of-methylation after loss of TF occupancy7,8. This model suggests that trans-meQTLs may be attributed to SNPs affecting the expression of a TF in cis and that the SNP allele preferentially has a unidirectional effect on DNA methylation. In line with this prediction, we observed that if a SNP is associated with multiple CpGs sites in 2 trans (at least 10, n=305), the direction of the association of the SNP was consistently skewed towards either increased or decreased DNA methylation. On average 76% of the CpGs per trans-meQTL SNP displayed the same direction of effect (expected 50%, P=10-111; Figure 4a). A significant skew in direction of the allelic effect was present for 59.7% of the 305 individual SNPs with at least 10 trans-meQTL effects and increased to 95.2% for the 104 SNPs with at 3 least 50 trans-meQTL effects (binomial test P<0.05), suggesting that differential TF binding may explain a substantial fraction of trans-meQTLs. In order to explore this mechanism further, we combined ChIP-seq data on TF binding at CpGs and cis-expression effects of SNPs to directly examine the involvement of TFs in mediating 4 trans-meQTLs. Among trait-associated SNPs influencing at least 10 CpGsin trans (n=305), we identified 13trans -meQTL SNPs with strong support for a role of TFs (Figure 4a). The most striking example was a locus on chromosome 4 (Figure 4b), where two SNPs (rs3774937 and rs3774959, in strong LD) were associated with ulcerative colitis (UC)26. Top SNP rs3774937 was associated with differential DNA methylation at 413 CpG sites across the genome, 92% of which showed the same direction of the effect, i.e. lower methylation associated with the minor allele (binomial P=2.72x10-69). Of those 380 CpG sites with lower methylation, 147 (38.7%) overlap with a nuclear factor kappaB (NFKB) transcription factor binding site (2.75-fold enrichment, P=5.3x10-32), as derived from ENCODE NFKB ChIP-seq data in blood cell types (Figure 4c). Three motif enrichment analyses (Homer, PWMEnrich, DEEPbind)21–23 corroborated the enrichment of NFKB binding motifs for the 413 CpG sites (Figure 4c). Notably, SNP rs3774937 is located in the first intron of NFKB1 and we found that the minor allele was associated with higher NFKB1 expression (Figure 4a). Of the 413 trans- CpGs, 64 were eQTMs and revealed a coherent gene network (Figure 4d) that was enriched for immunological processes related to NFKB1 function27 (Figure 4e). Taken together, these results support the idea that the minor allele of rs3774937, which is associated with increased UC risk, decreases DNA methylation in trans by increasing NFKB1 expression in cis. The same analysis approach indicated that the 779 trans-methylation effects of rs8060686 (associated with various phenotypes including metabolic syndrome28 and coronary heart disease29) were mediated by altered CTCF binding which mapped 315 kb from the trans- meQTL SNP. We observed a strong CTCF ChIP-seq enrichment with 603/779 trans-CpGs overlapping with CTCF binding (P=1.6x10-232) and enrichment for CTCF motifs (Figure 5). Of these trans-CpGs, only 13 were observed previously in lymphocytes17. Hence, the minor allele of rs8060686 increased DNA methylation in trans which could be attributed to a lower CTCF gene expression in cis. We found another example of this phenomenon: 228 trans-meQTL effects of 4 SNPs on chromosome 10, mapping near NKX2-3 and implicated in inflammatory bowel disease26, were strongly enriched for NKX2 transcription factor motifs and associated with NKX2-3 expression. Again, a negative correlation was observed: the minor allele of rs11190140 decreased DNA methylation in trans at NKX2-3 binding sites and increased NKX2-3 gene expression in cis (Supplementary Fig. 6).

139 1 ABHD6 AIF1 LST1 SLAMF8 —10 —09 —09 —08 —08 alue DNASE1L3 CHI3L2 P-v PRR5L ACSF3 3.6 x 10 1.0 x 10 1.4 x 10 9.2 x 10 2.2 x 10 TIGIT HLA-K ITGAL T/T MAP3K6 tic —1 6 COL9A2 TL /T NOD2 C/T SNX20 TRIM26

2 -eQ ASB2 INF2 enobio ZG16B RP11-344B5.2 o x P = 4.7 x 10 y response PDZK1IP1 / CC r trans C/C C 2 PRELID1 NFKBIL1 MAML

SLC9A3R2

  

xpression e ed maliz í Nor í í y response HLA-J AP3M2 IL32 ANO9 HLA-A esponse t TTN-AS1 tion LTA ZNRD1 HLA-U thway name TRAF2 Pa T/T tion of inflamm ato MAPKAPK2 y response TL —2 2 r ZNF718 3 ARID5A to ZNRD1-AS1 SF1 /T C/T TRAF1 -meQ ALAS1 CC ICAM1 IER3 P = 3.9 x 10 erleukin-10 produc DDR1 C/C C / GPER CCDC114 ositive regula trans Regulation of inflammator P Abnormal physiological r Int Inflamma PPFIA4

  0

í í

\ODWLRQFJ h met ed maliz Nor EHD1 SH3PXD2A ICAM5 BAIAP2 HIVEP1 CCDC19 BAIAP2-AS1 thway RP11-473M20.7 Pa RP4-647J21.1 GO:0050727 GO:0050729 MP:0008872 GO:0032613 GO:0006954 ALPL TATDN3 RTN2 LIMK2 PDLIM7 4 B3GNT7 e d FLVCR1 NFKB1 Risk allele decreasing methylation Risk allele increasing methylation T/T ect direction ed in trans

—1 2 TL 413 CpGs ect /T

C/T TL e a cis -eQ P = 2.6 x 10 C/C C / CC ans -meQ

   

xpression e ed maliz í Nor í í Tr and overlap with NFKB binding tive Colitis —32 a NFKB1 er rs3774937 ctor associated to 4:103,434,253 Ulc binding sites 4: 103,422,486103,538,459 Risk fa 3.8-fold enrichment b c P -value = 5.3 x 10 CpG-site overlapping NFKB 700 ● ● ● ● ● ● ● 600 BPTF CTCF NFKB NKX2-3 ZBTB38 ● ● ● Unknown Cell counts ( 500 ) ) rs3774937 (Ulcerative colitis) 400 ) ● ) ) ● a) om Ulcerative colitis IBD Crohn's disease IBD ) ( ( ( ( in ) e c rs3774959 (Ulcerative colitis) r ht ● g ) ) oca 300 n ) ● ade s6584283 s4409764 s10883365 s11190140 r r r r ­ ° ® ° ¯ heart diseas ● y Lung Lu ( ● ) ) ) 200 ● ● ● ● ht ht ht ● g g g HDL cholesterol ( ● HDL cholesterol Coronar ( ● ● ( Hei Prostate cancer, Hei Hei Hei ( ( ( ( ● s8060686 (Metabolic syndrome r ● rs7216064 ( ● ● ● ● ● ● s8044995 (Schizophrenia 100 s255049 r ● ● ● ● ● ● ● r ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● s3729639 ● r ● s1991431 s6763931 s724016 s6440003 s16942887 ● ● ● ● r r r r ● ● ● r ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ­ ° ® ° ¯ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0

Nr of CpG sites with decreased methylation for minor allele

700 600 500 400 300 200 100 0 Nr of CpG sites with increased methylation for minor allele minor for methylation increased with sites CpG of Nr a Figure 4. An imbalance in effect direction of trans-meQTLs implies involvement of transcription factors. a, Each dot represents a SNP with a, Each dot represents factors. of transcription implies involvement of trans-meQTLs direction in effect 4. An imbalance Figure whereas methylation, decreases minor allele the where of trans-effects number The x-axis shows the effects. trans-meQTL 10 least at exhibit often allelic direction same the have of which many of effects with a multitude SNPs in methylation. y-axis shows an increase the binding sites for overlapping of trans-CpGs dots), and an overrepresentation (colored factor on a transcription for a cis-eQTL evidence expression colitis and an increased with ulcerative b, associated Depiction of the NFKB1 gene and rs3774937, factor. that transcription quartileouter show the Whiskers IQR. and median show the c, IQR. the plus 1.5 times C. Boxes minor allele and risk the for of NFKB1 In addition to influencing NFKB1 expression, rs3774937 alsorelates to DNA methylation at 413 CpGs in trans, decreasing methylation (3.8-fold enrichment, binding with NFKB sites (37.3%) overlap CpG sites the of Many grey). (dark CpG sites affected 93% of at levels showing a 413 CpGs (17.4%), with 72 of the are that associated (outer chart). genes P-value=5.3x10-32) of the eQTM network d, Gene (left plot) trans-eQTL and observed of the trans-meQTL in blue, illustrations NFKB1 is depicted (in red). an trans-eQTL and trans-meQTL effects (right plot) of rs3774937. pathways e, Top as identifiedby for DEPICT which the genes in d wereoverrepresented. Many of the colitis. of ulcerative nature in line with the inflammatory inflammation-related, were identified pathways

140 cis-eQTL a  1 

xpression  trans-meQTL trans-eQTL

ed e  í maliz

on   i

Nor í CTCF 

í xpress P-value = 0.0157 e d

\ODWLRQFJ  0 16: 67,596,310-67,673,086 CC/C/CCC/T/T T/T ze li í ma ed met h or 2

rs8060686 N c í í 16:67,911,517 maliz Nor P-value = 4.4 x 10-8 í P-value = 0.0018 Risk factor associated to CC/C/CCC/T/T T/T CC/C/CCC/T/T T/T Metabolic syndrome TMC6 KCNAB3 CNTROB ZIK1 779 CpGs ZNF416 CTCF SIRT2 a ected in trans ACAP1 ZNF419 ZNF671 ZNF304 3 DDAH2 LTB TTC39C LTA b Trans-meQTL e ect direction CARNS1 ZNF549 ZNF547 ZNF134 CERS4 FBXL15 CUEDC2 and overlap with CTCF binding C6orf48 NFKBIL1 ZNF530 ZNF211 ZNF551 ABCA2 LRP8 OSGIN1 TCF19 VARS2 MROH6 NAPRT1 KRTCAP3

20.3x enrichment MVB12B C9orf142 MAPK11 MMP14 COL6A1 COL6A2 4 P-value = 1.6 x 10-232

Risk allelle AGAP2 CYP4F22 C4B KIF5C CRIP2 CDK5R1 FAM63A ZNF550 ZNF773 ZNF543 ZNRD1-AS1 increasing methylation

ZNF772 ZNF154 SOX13 RAPGEFL1 CLCF1 FAM153A UST LY6G5C C21orf33 HEBP2 SLX1A

CpG-site overlapping CTCF ChIP-seq binding sites Risk allelle decreasing methylation

Figure 5. Trans-meQTL CpGs related to rs8060686 show overlap with CTCF binding sites. a, Depiction of the CTCF gene and rs8060686, associated with metabolic syndrome. The plot shows an increased expression of NFKB1 for the risk allele C. b, In addition to influencing CTCF expression, rs8060686 also influences DNA methylation at 779 CpGs in trans, increasing methylation levels at 87.7% of affected CpG sites (dark grey). In addition, many of the CpG sites (77.4%) overlap with CTCF binding sites (20.3-fold enrichment, P-value = 1.6 x 10-232), shown in the outer chart. c, Illustrations of meQTL (left plot) and eQTL effects (right plot) of rs8060686 in trans. Only SNP-gene combinations were tested where the gene was associated with one of the 779 CpGs with a trans-meQTL. d, Gene network of the genes associated with 60 of the 779 CpGs (7.7%) with a trans-meQTL.

A height locus30 harbouring 4 SNPs and is associated with 267 trans-CpGs implicated a role for ZBTB38 in mediating trans-meQTL effects (Supplementary Fig. 7). In contrast to the aforementioned TFs that are all transcriptional activators, ZBTB38 is a transcriptional repressor31,32 and its expression was positively correlated with methylation in trans, which is in line with our observation that eQTMs in repressed regions are enriched for positive correlations. Finally, the trans-methylation effects of rs7216064 (64 trans-CpGs), associated with lung carcinoma33, preferentially occurred at regions binding CTCF, while the SNP was located in the BPTF gene, which is known to occupy CTCF binding sites34 (Supplementary Fig. 8). The possibility to link trans-meQTL effects to an association of TF expression in cis and concomitant differential methylation in trans at the respective binding site is limited to TFs for which ChIP-seq data or motif information is available. In order to make inferences on TFs for which such data is not yet available, we ascertained whether trans-meQTLs SNPs were more often associated with TF gene expression in cis as compared with SNPs without a trans-meQTL effect. We observed that 13.1% of the GWAS SNPs that produced trans-meQTLs also affect TF gene expression in cis, whereas only 4.5% of the GWAS SNPs without a trans- meQTLs affects TF gene expression in cis (Fisher’s exact P=6.6x10-13).

141 Here we report that one third of known disease- and trait-associated SNPs has downstream methylation effects in trans and often are associated with multiple regions across the 1 genome. Our data suggest that the biological mechanism underlying trans-meQTLs commonly involves a local effect on the expression of a nearby TF that influences DNA methylation at the distal binding sites of that particular TF. The direction of downstream methylation effects is remarkably consistent for each SNP and indicates that decreased DNA methylation is 2 a signature of increased binding of transcriptional activators. As such, our study reveals previously unrecognized functional consequences of disease variants in non-coding regions. These can be looked up online (see URLs), and will provide leads for experimental follow-up. Methods 3 Cohort descriptions The five cohorts used in our study are described briefly below. The number of samples per cohort and references to full cohort descriptions can be found in Supplementary Table 1. 4 CODAM The Cohort on Diabetes and Atherosclerosis Maastricht10 (CODAM) consists of a selection of 547 subjects from a larger population-based cohort.35 Inclusion of subjects into CODAM was based on a moderately increased risk to develop cardiometabolic diseases, such as type 2 diabetes and/or cardiovascular disease. Subjects were included if they were of Caucasian descent and over 40 years of age and additionally met at least one of the following criteria: increased BMI (>25), a positive family history of type 2 diabetes, a history of gestational diabetes and/or glycosuria, or use of anti-hypertensive medication. LifeLines-DEEP The LifeLines-DEEP (LLD) cohort9 is a sub-cohort of the LifeLines cohort.36 LifeLines is a multi- disciplinary prospective population-based cohort study examining the health and health-related behaviours of 167,729 individuals living in the northern parts of The Netherlands using a unique three-generation design. It employs a broad range of investigative procedures assessing the biomedical, socio-demographic, behavioural, physical and psychological factors contributing to health and disease in the general population. A subset of 1,500 LifeLines participants also take part in LLD9. For these participants, additional molecular data is generated, allowing for a more thorough investigation of the association between genetic and phenotypic variation. LLS The aim of the Leiden Longevity Study11 (LLS) is to identify genetic factors influencing longevity and examine their interaction with the environment in order to develop interventions to increase health at older ages. To this end, long-lived siblings of European descent were recruited together with their offspring and their offspring’s partners, on the condition that at least two long-lived siblings were alive at the time of ascertainment. For men the age criteria was 89 or older, for women age 91 or over. These criteria led to the ascertainment of 944 long-lived siblings from 421 families, together with 1,671 of their offspring and 744 partners. NTR The Netherlands Twin Register12,37,38 (NTR) was established in 1987 to study the extent to which genetic and environmental influences cause phenotypic differences between individuals. To this end, data from twins and their families (nearly 200,000 participants) from all over the Netherlands are collected, with a focus on health, lifestyle, personality, brain development, cognition, mental health, and aging.

142 RS The Rotterdam Study13 is a single-centre, prospective population-based cohort study 1 conducted in Rotterdam, the Netherlands13. Subjects were included in different phases, with a total of 14,926 men and women aged 45 and over included as of late 2008. The main objective of the Rotterdam Study is to investigate the prevalence and incidence of and risk factors for chronic diseases to contribute to a better prevention and treatment of such diseases in the 2 elderly. Genotype data Data generation Genotype data was generated for each cohort individually. Details on the methods used can 3 be found in the individual papers (CODAM: van Dam et al.35; LLD: Tigchelaar et al.9; LLS: Deelen et al.39, 2014; NTR: Willemsen et al.12; RS: Hofman et al.13). Imputation and QC For each cohort separately, the genotype data were harmonized towards the Genome of the 4 Netherlands40 (GoNL) using Genotype Hamonizer41 and subsequently imputed per cohort using Impute242 using GoNL43 reference panel43 (v5). Quality control was also performed per cohort. We removed SNPs based on imputation info-score (<0.5), HWE (P<10-4), call rate (<95%) and minor allele frequency (>0.05), resulting in 5,206,562 SNPs that passed quality control in each of the datasets. Methylation data Data generation For the generation of genome-wide DNA methylation data, 500 ng of genomic DNA was bisulfite modified using the EZ DNA Methylation kit (Zymo Research, Irvine, California, USA) and hybridized on Illumina 450k arrays according to the manufacturer’s protocols. The original IDAT files were generated by the Illumina iScan BeadChip scanner. We collected methylation data for a total of 3,841 samples. Data was generated by the Human Genotyping facility (HugeF) of ErasmusMC, the Netherlands (see URLs). Probe remapping and selection We remapped the 450K probes to the human genome reference (HG19) to correct for inaccurate mappings of probes and identify probes that mapped to multiple locations on the genome. Details on this procedure can be found in Bonder et al. (2014)44. Next, we removed probes with a known SNP (GoNL, MAF > 0.01) at the single base extension (SBE) site or CpG site. Lastly, we removed all probes on the sex chromosomes, leaving 405,709 high quality methylation probes for the analyses. Normalization and QC Methylation data was processed using a custom pipeline based on the pipeline developed by Tost & Toulemat45. First, we used methylumi46 to extract the data from the raw IDAT files. Next, we removed incorrectly mapped probes and checked for outlying samples using the first two principal components (PCs) obtained using principal component analysis (PCA). None of the samples failed our quality control checks, indicating high quality data. Following quality control, we performed background correction and probe type normalization as implemented in DASEN47. Normalization was performed per cohort, followed by quantile normalization on the combined data to normalize the differences per cohort. We used mix-up mapper48 to identify sample mix-ups between genotype and DNA methylation data, detecting and correcting 193 mix-ups. Lastly, in order to correct for known and unknown confounding sources of variation in the methylation data and increase statistical power, we removed the first components which were not affected by genetic information(22 PCs) from the methylation data using methodology we have successfully used in trans-eQTL3,49 and meQTL analyses44.

143 RNA sequencing 1 Total RNA from whole blood was deprived of globin using Ambion’s GLOBIN clear kit and subsequently processed for sequencing using Illumina’s Truseq version 2 library preparation kit. Paired-end sequencing of 2x50bp was performed using Illumina’s Hiseq2000, pooling 10 samples per lane. Finally, read sets per sample were generated using CASAVA, retaining only 2 reads passing Illumina’s Chastity Filter for further processing. Data was generated by the Human Genotyping facility (HugeF) of ErasmusMC, the Netherlands (see URLs). Initial QC was performed using FastQC,v0.10.1 (See URLs),, removal of adaptors was performed using cutadapt50 (v1.1), and Sickle,v1.2 See URLs) was used to trim low quality ends of the reads (min length 25, min quality 20). The sequencing reads were mapped to human 3 genome (HG19) using STAR51 v2.3.125 . Gene expression quantification was performed by HTseq-count. The gene definitions used for quantification were based on Ensembl version 71, with the extension that regions with overlapping exons were treated as separate genes and reads mapping within these overlapping parts did not count towards expression of the normal 4 genes. Expression data on the gene level were first normalized using Trimmed Mean of M-values52. Then expression values were log2 transformed, gene and sample means were centred to zero. To correct for batch effects, PCA was run on the sample correlation matrix and the first 25 PCs were removed using methodology that we have used before3,49, details are provided in Zhernakova et al53. Cis-meQTL mapping In order to determine the effect of nearby genetic variation on methylation levels (cis-meQTLs, here defined as the relationship between a CpG and a SNP no further than 250kb apart), we performed cis-meQTL mapping using 3,841 samples for which both genotype data and methylation data were available. To this end, we calculated the Spearman rank correlation per cohort, followed by meta-analysis using a weighted Z-method described previously3. To detect all possible independent SNPs regulating methylation at a single CpG-site we regressed out all primary cis-meQTL effects and then performed cis-meQTL mapping for the same CpG-site to find secondary cis-meQTL. We repeated this in a stepwise fashion until no more independent cis-meQTL were found. To filter out potential false positive cis-meQTLs caused by SNPs affecting the binding of a probe on the array, we filtered thecis- meQTLs effects by removing any CpG-SNP pair for which the SNP was located in the probe. In addition, all other CpG-SNP pairs for which the SNP was outside the probe, but in LD (R2 > 0.2 or D’ > 0.2) with a SNP inside the probe were also removed. We tested for LD between SNPs in the probe and in the surrounding cis area in the individual genotype datasets, as well as in GoNL v5, in order to be as strict as possible in marking a QTL as true positive. To correct for multiple testing, we empirically controlled the false discovery rate (FDR) at 5%. For this, we compared the distribution of observed P-values to the distribution obtained from performing the analysis on permuted data. Permutation was done by shuffling the sample identifiers of one data set, breaking the link between, e.g., the genotype data and the methylation or expression data. We repeated this procedure 10 times to obtain a stable distribution of P-values under the null distribution. The FDR was determined by only selecting the strongest effect per CpG3 in both the real analysis and in the permutations (i.e. probe level FDR < 5%). Cis-eQTL mapping For a set of 2,116 BIOS samples we had also generated RNA-seq data. We used this data to identify cis-eQTLs. Cis-eQTL mapping was performed using the same method as cis-meQTL mapping. Details on these eQTLs will be described in a separate paper53.

144 Expression quantitative trait methylation (eQTM) analysis To identify associations between methylation levels and expression levels of nearby genes 1 (cis-eQTMs), we first corrected our expression and methylation data for batch effects and covariates by regressing out the PCs and regressing out the identified cis-meQTLs and cis- eQTLs, to ensure identified associations between CpG sites and gene expression levels were not due to shared genetic effects. We mapped eQTMs in a window of 250Kb around the TSS 2 of a transcript. Further statistical analysis was identical to the cis-meQTL mapping. For this analysis we were able to use a total of 2,101 samples for which both genetic, methylation and gene expression data was available. To correct for multiple testing we controlled the FDR at 5%, the FDR was determined by only selecting the strongest effect per CpG3 in both the real analysis and in the permutations. 3 Trans-meQTL mapping To identify the effects of distal genetic variation with methylation (trans-meQTLs) we used the same 3,841 samples that we had used for cis-meQTL mapping. To focus our analysis and limit the multiple testing burden, we restricted our analysis to SNPs that have been previously 4 found to be significantly correlated to traits and diseases. We extracted these SNPs from the NHGRI genome-wide association study (GWAS) catalogue, used recent GWAS studies not yet in the NHGRI GWAS catalogue and studies on the Immunochip and Metabochip platform that are not included in the NHGRI GWAS catalogue (Supplemental file 1). We compiled this list of SNPs in December 2014. Per SNP we only investigated CpG sites that mapped at least 5 Mb from the SNP or on other chromosomes. Before mapping trans-meQTLs, we regressed out the identifiedcis -meQTLs to increase the statistical power of trans-meQTL detection (as done previously for trans-eQTLs3) and to avoid designating an association as trans that may be due to long-range LD (e.g. within the HLA region). To ascertain the stability of the trans-meQTLs we also performed the trans-mapping using uncorrected data cell-type proportions corrected methylation data. In addition, we performed meQTL mapping on SNPs known to influence the cell type proportions in blood18,19. To filter out potential false positive trans-meQTLs due to cross-hybridization of the probe, we remapped the methylation probes with very relaxed settings, identical to Westra et al.3, with the difference that we only accepted mappings if the last bases of the probe including the SBE site were accurately mapped to the alternative location. If the probe mapped within our minimal trans-window, 5 Mb from the SNP, we removed the effect as being a false positive trans-meQTL. We controlled the false-discovery rate at 5%, identical to the aforementioned cis-meQTL analysis. Trans-eQTL mapping To check if the trans-meQTL effects also showed in gene expression levels, we annotated the CpGs with a trans-meQTL to genes using our eQTMs. Using the 2,101 samples for which both genotype and gene expression data were available, we performed trans-eQTL mapping, associating the SNPs known to be associated with DNA methylation in trans with their corresponding eQTM genes. Annotations and enrichment tests Annotation of the CpGs was performed using Ensembl54 (v70), UCSC Genome Browser55 and data from the Epigenomics Roadmap Project.56 We used the Epigenomics Roadmap annotation for the SBE site of the methylation site using 27 blood cell types. We used both the histone mark information and the chromatin marks in blood-related cell types only, as generated by the Epigenomics Roadmap Project. Summarizing the information over the 27 blood cell types was done by counting presence of histone-marks in all the cell types and scaling the abundance, i.e. if the mark is bound in all cell types the score would be 1 if it would be present in none of the blood cell types the score would be 0.

145 To calculate enrichment of meQTLs or eQTMs for any particular genomic context, we used logistic regression because this allowed us to account for covariates such as CpG methylation 1 variation. For cis-meQTLs, we used the variability of DNA methylation, the number of SNPs tested, and the distance to the nearest SNP per CpG as covariates. For all other analyses we used only the variability in DNA methylation as a covariate. We used transcription factor ChIP-seq data from the ENCODE-project for blood-related cell 2 lines (narrow peak data). We overlapped CpG locations with ChIP-seq signals and performed a Fisher exact test to determine whether the trans-meQTL probes associated with a SNP were overlapping a ChIP-seq region more often than other trans-meQTL probes. Enrichment of known sequence motifs among trans-CpGs was assessed by PWMEnrich22 3 package in R, Homer57 and DEEPbind23. For PWMEnrich, hundred base pair sequences around the interrogated CpG site were used, and as a background set we used the top CpGs from the 50 permutations used to determine the FDR threshold of the trans-meQTLs. For Homer the default settings for motif enrichment identification were used, and the same CpGs derived from the permutations were used as a background. For DEEPbind we used both 4 the permutation background like described for Homer and the permutations background as described for PWMEnrich. Using data published by Rao et al.20 we were able to intersect the trans-meQTLs with information about the 3D structure of the human genome using combined Hi-C data for both inter- and intra-chromosomal data at 1Kb and the quality threshold of E30 in the GM12878 lymphoblastoid cell line. Both the trans-meQTL SNP and trans-meQTL probes were put in the relevant 1Kb block, and for these blocks we looked up the chromosomal contact value in the measurements by Rao et al. Surrounding the trans-meQTLs SNPs, we used a LD window that spans maximally 250Kb from the trans-meQTL SNP and had a minimal R2 of 0.8. If a Hi-C contact between the SNP block and the CpG-site was indicated, we flagged the region as a positive for Hi-C contacts. As a background, we used the combinations found in our 50 permutated trans-meQTL analyses, taking for each permutation the top trans-meQTLs that were similar in size to the real analysis. eQTM direction prediction We predicted the direction of the eQTM effects using both a decision tree and a naive Bayes model (as implemented by Rapid-miner58 v6.3). We built the models on the strongest eQTMs (FDR<9.73x10-6). For the decision tree we used a standard cross-validation set-up using 20 folds. For the naive Bayes model we used a double loop cross-validation: performance was evaluated in the outer loop using 20-fold cross-validation, while feature selection (using both backward elimination and forward selection) took place in the inner loop using 10-fold cross- validation. Details about the double-loop cross-validation can be found in Ronde et al.59. During the training of the model, we balanced the two classes making sure we had an equal number of positively correlating and negatively correlating CpG-gene combinations, by randomly sampling a subset of the overrepresented negatively correlating CpG-gene combination group. We chose to do so to circumvent labelling al eQTMs as negative, since this is the class were the majority of the eQTMs are in. In the models we used CpG-centric annotations: overlap with epigenomics roadmap chromatin states, histone marks and relations between the histone marks, GC content surrounding the CpG-site and relative locations from the CpG-site to the transcript. DEPICT To investigate whether there was biological coherence in the trans-meQTLs identified for the NFKB1 locus, we performed gene-set enrichment analysis for the genes near the trans-CpG sites of the UC genetic risk factor (which maps in the NFKB1 locus). To do so, we adapted DEPICT27, a pathway enrichment analysis method that we originally have developed for

146 GWAS. Instead of defining loci with genes by using the top associated SNPs (as isdone when analysing GWAS data), we used the eQTM information to empirically link trans-CpGs to genes (that map close to the CpGs). Within the DEPICT gene set enrichment, significance 1 is determined by using a background set of genes. As a background in the adapted DEPICT enrichment analyses we matched our background to the results from the actual trans-meQTL and eQTM analyses: the matching was performed by generating a set of background CpGs (and corresponding correlating eQTM genes), by selecting an equal number of CpGs for which 2 we had found trans-meQTL effects with SNPs that map outside the NFKB locus. By doing so we ensure that the characteristics of these background CpGs are the same as the real NFKB trans-meQTL CpGs, both in terms of CpG variance and the requirement that they also show a significant correlation with expression levels of genes close to the CpG (i.e. a cis- eQTM), ensuring that the corresponding input genes for DEPICT have the same expression 3 variation distribution in the actual NFKB analysis and in the background. Subsequent pathway enrichment analysis was conducted as described before27, and significance was determined by controlling the false discovery rate at 5%. URLs 4 All results can be queried using our dedicated QTL browser: www.genenetwork.nl/ biosqtlbrowser. Data was generated by the Human Genotyping facility (HugeF) of ErasmusMC, the Netherlands, see: www.glimDNA.org. Cohort webpages; LifeLines: http://lifelines.nl/ lifelines-research/general Leiden Longevity Study http://www.healthy-ageing.nl & http://www. leidenlangleven.nl, Netherlands Twin Registry: http://www.tweelingenregister.org, Rotterdam studies: http://www.erasmus-epidemiology.nl/rotterdamstudy, the Genetic Research in Isolated Populations program: http://www.epib.nl/research/geneticepi/research.html#gip, Codam study http://www.carimmaastricht.nl/, PAN study: http://www.alsonderzoek.nl/. Software; FastQC: (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), Sickle: (https://github.com/najoshi/sickle) Accession codes All results can be queried using our dedicated QTL browser, see URLs. Raw data was submitted to the European Genome-phenome Archive (EGA), under accession EGAS00001001077. Author contributions BTH, PACtH, JBJvM, AI, RJ and LF formed the management team of the BIOS consortium. DIB, RP, JVD, JJH, MMJVG, CDAS, CJHvdK, CGS, CW, LF, AZ, EFG, PES, MB, JD, DvH, JHV, LHvdB, CMvD, BAH, AI, AGU managed and organized the biobanks. JBJvM, PMJ, MV, HEDS, MV, RvdB, JvR and NL generated RNA-seq and Illumina 450k data. HM, MvI, MvG, JB, DVZ, RJ, PvtH, PD, IN, PACtH, BTH and MM were responsible for data management and the computational infrastructure. MJB, RL, MV, DVZ, RS, IJ, MvI, PD, FvD, MvG, WA, SMK, MAS, EWvZ, RJ, PACtH, LF and BTH performed the data analysis. MJB, RL, LF and BTH drafted the manuscript. ­­D.V.Z, M.M., P.D. and M.V. contributed equally. A.I., R.J. and J.B.J.M. contributed equally Competing financial interests The authors declare no competing financial interests.

147 References 1 1. Manolio, T. A. Genomewide Association Studies and Assessment of the Risk of Disease. New Engl. J. Med. 362, 166–176 (2010). 2. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90, 7–24 (2012). 2 3. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat. Genet. 45, 1238–1243 (2013). 4. Wright, F. A. et al. Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 46, 430–437 (2014). 5. Bernstein, B. E., Meissner, A. & Lander, E. S. The Mammalian Epigenome. Cell 128, 669– 3 681 (2007). 6. Mill, J. & Heijmans, B. T. From promises to practical strategies in epigenetic epidemiology. Nat. Rev. Genet. 14, 585–594 (2013). 7. Gutierrez-Arcelus, M. et al. Passive and active DNA methylation and the interplay with genetic variation in gene regulation. Elife 2013, e00523 (2013). 4 8. Tsankov, A. M. et al. Transcription factor binding dynamics during human ES cell differentiation. Nature 518, 344–349 (2015). 9. Tigchelaar, E. F. et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open 5, e006772 (2015). 10. van Greevenbroek, M. M. J. et al. The cross-sectional association between insulin resistance and circulating complement C3 is partly explained by plasma alanine aminotransferase, independent of central obesity and general inflammation (the CODAM study). Eur. J. Clin. Invest. 41, 372–379 (2011). 11. Schoenmaker, M. et al. Evidence of genetic enrichment for exceptional survival using a family approach: the Leiden Longevity Study. Eur. J. Hum. Genet. 14, 79–84 (2006). 12. Willemsen, G. et al. The Adult Netherlands Twin Register: twenty-five years of survey and biological data collection. Twin Res. Hum. Genet. 16, 271–281 (2013). 13. Hofman, A. et al. The rotterdam study: 2014 objectives and design update. Eur. J. Epidemiol. 28, 889–926 (2013). 14. Hu, S. et al. DNA methylation presents distinct binding sites for human transcription factors. Elife 2013, 1–16 (2013). 15. Yao, C. et al. Integromic analysis of genetic variation and gene expression identifies networks for cardiovascular disease phenotypes. Circulation 131, 536–549 (2015). 16. Huan, T. et al. A meta-analysis of gene expression signatures of blood pressure and hypertension. PLoS Genet. 11, e1005035 (2015). 17. Lemire, M. et al. Long-range epigenetic regulation is conferred by genetic variation located at thousands of independent loci. Nat. Commun. 6, 6326 (2015). 18. Orrù, V. et al. Genetic variants regulating immune cell levels in health and disease. Cell 155, 242–256 (2013). 19. Roederer, M. et al. The genetic architecture of the human immune system: A bioresource for autoimmunity and disease pathogenesis. Cell 161, 387–403 (2015). 20. Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). 21. Heinz, S. et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol. Cell 38, 576–589 (2010). 22. D, S. R. and D. PWMEnrich: PWM enrichment analysis. R package version 4.6.0. (2015). 23. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Supp:Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33, 831– 838 (2015). 24. Zuin, J. et al. Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells. Proc Natl Acad Sci USA 111, 996–1001 (2014).

148 25. Splinter, E. et al. CTCF mediates long-range chromatin looping and local histone modification in the ??-globin locus. Genes Dev. 20, 2349–2354 (2006). 26. Jostins, L. et al. Host-microbe interactions have shaped the genetic architecture of 1 inflammatory bowel disease. Nature 491, 119–124 (2012). 27. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015). 28. Kristiansson, K. et al. Genome-wide screen for metabolic syndrome susceptibility loci 2 reveals strong lipid gene contribution but no evidence for common genetic basis for clustering of metabolic syndrome traits. Circ. Cardiovasc. Genet. 5, 242–249 (2012). 29. Lettre, G. et al. Genome-Wide association study of coronary heart disease and its risk factors in 8,090 african americans: The nhlbi CARe project. PLoS Genet. 7, (2011). 30. Soranzo, N. et al. Meta-analysis of genome-wide scans for human adult stature identifies 3 novel loci and associations with measures of skeletal frame size. PLoS Genet. 5, (2009). 31. Filion, G. J. P. et al. A Family of Human Zinc Finger Proteins That Bind Methylated DNA and Repress Transcription A Family of Human Zinc Finger Proteins That Bind Methylated DNA and Repress Transcription. Mol. Cell. Biol. 26, 169 (2006). 32. Sasai, N. & Defossez, P. A. Many paths to one goal? The proteins that recognize methylated 4 DNA in eukaryotes. Int. J. Dev. Biol. 53, 323–334 (2009). 33. Shiraishi, K. et al. A genome-wide association study identifies two new susceptibility loci for lung adenocarcinoma in the Japanese population. Nat. Genet. 44, 900–903 (2012). 34. Qiu, Z. et al. Functional Interactions between NURF and Ctcf Regulate Gene Expression. Mol. Cell. Biol. 35, 224–37 (2015). 35. Van Dam, R. M., Boer, J. M. A., Feskens, E. J. M. & Seidell, J. C. Parental history off diabetes modifies the association between abdominal adiposity and hyperglycemia. Diabetes Care 24, 1454–1459 (2001). 36. Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015). 37. Boomsma, D. I. et al. Netherlands Twin Register: a focus on longitudinal research. Twin Res 5, 401–406 (2002). 38. Boomsma, D. I. et al. Genome-wide association of major depression: description of samples for the GAIN Major Depressive Disorder Study: NTR and NESDA biobank projects. Eur. J. Hum. Genet. 16, 335–342 (2008). 39. Deelen, J. et al. Genome-wide association meta-analysis of human longevity identifies a novel locus conferring survival beyond 90 years of age. Hum. Mol. Genet. 23, 4420–4432 (2014). 40. Collection, S. & Genome, T. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 1–95 (2014). 41. Deelen, P. et al. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res. Notes 7, 901 (2014). 42. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, (2009). 43. Deelen, P. et al. Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur. J. Hum. Genet. 22, 1321–1326 (2014). 44. Bonder, M. J. et al. Genetic and epigenetic regulation of gene expression in fetal and adult human livers. BMC Genomics 15, 860 (2014). 45. Touleimat, N. & Tost, J. Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 4, 325–41 (2012). 46. Davis, S., Du, P., Bilke, S., Triche, im & Bootwalla, oiz. Methylumi: Handle Illumina methylation data. R Packag. version 2.2.0. - (2012). 47. Pidsley, R. et al. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics 14, 293 (2013).

149 48. Westra, H. J. et al. MixupMapper: Correcting sample mix-ups in genome-wide datasets increases power to detect small genetic effects. Bioinformatics 27, 2104–2111 (2011). 1 49. Fehrmann, R. S. N. et al. Trans-eqtls reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the hla. PLoS Genet. 7, e1002197 (2011). 50. Martin, M. Martin, M. Cutadapt removes adapter sequences from high-throughput 2 sequencing reads. EMBnet.journal. 2011. Date of access 05/08/2015. . 17, 10–12 (2011). 51. Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 52. Robinson, M. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11, R25 (2010). 3 53. Zhernakova, D. V et al. Hypothesis-free identification of modulators of genetic risk factors. bioRxiv 1–25 (2015). doi:10.1101/033217 54. Flicek, P. et al. Ensembl 2013. Nucleic Acids Res. 41, 48–55 (2013). 55. Kent, W. J., Sugnet, C. W., Furey, T. S. & Roskin, K. M. The Human Genome Browser at UCSC W. J. Med. Chem. 19, 1228–1231 (1976). 4 56. Roadmap Epigenomics Consortium. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). 57. Heinz, S. et al. Effect of natural genetic variation on enhancer selection and function. Nature 503, 487–492 (2013). 58. Hofmann, M. & Klinkenberg, R. Rapid Miner Data Mining Use Cases and Business Analytics Applications. (Chapman & Hall/CRC, 2013). 59. De Ronde, J. J., Bonder, M. J., Lips, E. H., Rodenhuis, S. & Wessels, L. F. A. Breast cancer subtype specific classifiers of response to neoadjuvant chemotherapy do not outperform classifiers trained on all subtypes. PLoS One 9, e88551 (2014).

Acknowledgements This work was performed within the framework of the Biobank-Based Integrative Omics Studies (BIOS) Consortium funded by BBMRI-NL, a research infrastructure financed by the Dutch government (NWO 184.021.007). Samples were contributed by LifeLines, the Leiden Longevity Study, the Netherlands Twin Registry (NTR), the Rotterdam studies, the Genetic Research in Isolated Populations program, the Codam study and the PAN study. We thank the participants of all aforementioned biobanks and acknowledge the contributions of the investigators to this study (Supplemental Acknowledgements). This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. L.F. is supported by a grant from the Dutch Research Council (ZonMW-VIDI 917.14.374) and is supported by FP7/2007–2013, grant agreement 259867, and by an ERC Starting Grant, grant agreement 637640 (ImmRisk).

150 Additional files 1 The following supplemental data files are available with the online version of this paper. Supplementary Figure 1 Density of distances between CpG-site and strongest associated meQTL SNP Supplementary Figure 2 Relation between methylation variation and meQTL associated CpGs 2 Supplementary Figure 3 Characterization of cis-meQTLs Supplementary Figure 4 Characterization of cis-eQTMs in relation to the direction of the eQTM effect 3 Supplementary Figure 5 Trans-meQTLs identified for a risk factor for inflammatory bowel disease, rs11190140, and the overlap with NKX2-3 Supplementary Figure 6 Trans-meQTLs identified for a risk factor for height, rs6763931, and the overlap with ZBTB38 4 Supplementary Figure 7 Trans-meQTLs identified for a risk factor related to lung carcinoma, rs7216064, and overlap with BPTF Supplementary Table 1 Descriptions and number of samples per cohort. Supplementary Table 2 Number of independent cis-meQTLs­ per QTL mapping round Supplementary Table 3 GWAS SNPs tested for trans-meQTLs Supplementary Table 4 Replication of lymphocytes trans-meQTLs in blood and vice-versa Supplementary Table 5 Results of trans-meQTL in non-corrected data Supplementary Table 6 Results of trans-meQTL in blood-cell composition corrected data Supplementary Table 7 Results of trans-meQTL mapping on Blood cell composition related SNPs Supplementary Table 8 Trans-meQTL effects replicated in expression Supplementary note 1 Supplementary results & Acknowledgements

151 Discussion 10 1. Understanding health and disease by studying genetics and the environment 1 In recent years there has been a focus on explaining phenotypes and diseases using genetic approaches. So far, hundreds of genome-wide association studies (GWAS) in many different disease phenotypes and traits have been performed and led to the identification of thousands of genetic variants that influence the predisposition to common diseases1. For instance, large case-control GWAS have been performed on inflammatory bowel disease2 (IBD), type 2 2 diabetes3 (T2D), and on common traits such as height4 and body mass index5 (BMI). Using variants identified by GWAS we can explain a proportion of the variation in traits between individuals. Expanding the GWAS studies with more samples leads to more loci being identified and more accurate effect-size estimates. For example, in 2008, three independent GWAS studies on height were published at the same time6–8 and reported between 12 and 3 27 genome-wide significantly associated loci (P < 5 x 10-8) in studies of between 15,000 and 30,000 samples. These height-associated loci explained up to 3.7% of individual variation in height. In the following years, the sample sizes increased almost 10 times, to 253,288 individuals, and the number of genome-wide significantly associated height loci (or SNPs) rose sharply to 6974. The explained variance in height has now risen to 16% when considering 4 these 697 SNPs, or even to 29% if we also include SNPs that are likely relevant but that did not reach genome-wide significance4. It is to be expected that even larger GWAS studies on height and other phenotypes will yield many more variants and more accurate effect-sizes for the variants identified, such that the proportion of explained variance will rise even further. However, recent studies have also shown there is a limit to the extent that common genetic variants can be used for predicting disease or other phenotypes. For instance, it has been estimated that approximately 30 – 40% of the variation in height is unlikely to be explained by genetic variants9. This is because, for nearly all complex diseases and traits (such as height), a substantial part of the phenotypic variation is determined by environmental factors and potentially by the interplay between genetic and environmental variation10 (such as exposure to smoking or medication, degree of physical activity, and dietary habits). Given the recent success in identifying genetic risk factors for disease, the clear importance of environmental risk factors, and, most importantly, the increasing availability of large cohort studies, it is becoming possible to identify which ‘omics’ layers are influenced by genetic and environmental factors, and how these interact. These large cohort studies have not only looked at genetic variation in the participants, but also generated data on gene expression, methylation, the microbiome, metabolites, and many other phenotypes for them. In this thesis we have identified many of these relations. I will discuss below how these studies have aided the interpretation of genetic risk factors and the usefulness of multi-omics in predicting disease and other phenotypes. 2. Integrating multi-omics data to gain more insight into the functional consequences of genetic variants Despite the tremendous success in identifying genetic variants associated to traits and diseases, there are several challenges in interpreting the associated signals. It is important to realize that approximately 88% of variants associated to common diseases are located in non-coding regions11, which hampers the interpretation of these variants because it is often very difficult to infer the likely causal gene for a genomic locus. One way to overcome this difficulty, and to gain more knowledge on the likely causal genes and pathways, is to ascertain whether these genetic variants are also affecting gene expression or methylation levels, for instance11. Given the availability of cohort studies with one or multiple- omics datasets, it is now possible to ascertain such quantitative traits in a large number of samples and at high resolution. In this thesis we have conducted three quantitative trait studies: we studied the effects of cis (i.e. local) expression quantitative trait locus (cis-eQTL) and methylation quantitative trait locus (cis-meQTL) in four different tissues (chapter seven); 10 153 we studied cis and trans (distal) effects on DNA-methylation in blood (chapter nine); and we studied how genetic variants affect the human gut microbiome (chapter six). 1 We observed that many genetic risk factors affect gene expression and methylation in cis, but also identifiedtrans -eQTLs and trans-meQTLs (chapter nine), indicating that it is also possible to identify the downstream pathways disrupted by these disease-associated variants. This was particularly clear for trans-meQTLs, where 31% of the tested genetic risk factors were 2 affecting methylation in CpG sites in trans. However, since the biology of methylation is less well understood than gene expression, we overlapped the trans-meQTLs with identified eQTLs to aid in their interpretation. We found over 240 trans-eQTLs overlapping with trans- meQTL signals, i.e. roughly 9% of the total trans-meQTLs are also identified in expression, 3 based on the links found between SNPs and methylation, and between methylation and gene expression. Over 90% of the identified trans-eQTLs had a concordant allelic direction based on the relations seen between methylation and gene-expression. Furthermore, we observed several instances where trans-meQTLs were likely caused by differences in transcription factor abundances, which were due to a cis-eQTL gene expression effect on the transcription 4 factor gene (chapter nine, figure 4). Consequently, the binding sites for this DNA transcription factor are more or less occupied, and this leads to either methylation or demethylation through an unknown molecular mechanism. For example, we observed that a genetic variant, SNP rs3774959, that affects gene expression levels of the nearby transcription factor NFKB112 also influenced methylation at 413 downstream sites; nearly all these sites overlap with binding sites of the NFKB transcription factor. This example also shows the value of multi-omics studies, because the identified eQTLs were instrumental in helping us understand the nature of a massive number of trans-meQTLs from this SNP. Yet, since most of the trans-meQTLs that we detected remain unexplained, it is possible that some of the biology underlying these epigenetic signals will become clear from incorporating additional omics data, for instance, on protein or metabolite levels or on the microbiome. 3 Integration of multi-omics data to explain variation in phenotypes Apart from using multi-omics datasets to better understand the downstream molecular consequences of genetic risk factors, these data can also be used to study non-genetic effects on diseases and traits in detail. Examples of such analyses are presented in chapter four, where we aimed to explain variation in lipid levels and BMI by using the gut microbiome as a predictor, and in chapter eight, where we aimed to explain variation in height and BMI by using methylation as a predictor. Both these studies were performed using data from the LifeLines-DEEP13 project. In this cohort study we collected extensive phenotype information on 1,500 participants and on their genotypes, microbiome composition, gene expression, DNA-methylation, metabolite and cytokine levels. Since the data was all generated at the same time-point, this study is uniquely suited to conducting multi-omics integration studies, which link more than two data-layers to each other. In chapter eight we describe a study on three groups of samples from the LifeLines-DEEP13 population cohort, the Lothian Birth Cohorts14, and the Brisbane systems genetic study15. All three datasets provided information on genetics and DNA-methylation. We first built a predictor for the height of an individual, based solely either on genetics or on DNA-methylation levels. By across dataset training and testing the models, we found that a DNA-methylation based predictor could explain 0.76% of the variation in height and that genetics could explain 19.8%. Since the genetic predictor explains almost 25 times more variation in height as compared to the methylation predictor, there was little to be gained by making a combined predictor for height. This was expected since height is known to be mostly heritable, with its heritability estimated at 60% – 80% by twin and family-based studies9,16,17.

154 However, when applying the same method to BMI, we found that DNA-methylation levels could explain up to 7.3% of the variation in BMI, whereas genetics could explain up to 9.4%. A combined predictor could explain 13.6% of the variation in BMI. This shows clearly that 1 prediction performance for BMI increases substantially when using multi-omics datasets. Similarly, in chapter four we describe a study using combined genetic and microbiome-based predictors to explain variation in lipid levels and BMI based on the microbiome composition. In this study we used only data from the LifeLines-DEEP cohort and, in order to get accurate 2 predictions, we used cross-validation to estimate the explained variation. We found that for three of the five traits tested (i.e. high density cholesterol, triglycerides, and BMI), the combined genetic and microbiome predictor performed significantly better than a predictor using solely genetic data. With the microbiome composition data we were able to explain up to 4.5% 3 in variation in BMI, while with genetics we could explain 2.1%. (The difference in explained variance by genetic data in chapter eight and chapter four is due to a different significance threshold for selecting SNPs in the genetic predictor for BMI). In total, when we also took age and gender into account, we were able to explain 11.3% of the variation in BMI with genetics and microbiome composition. 4 When we combined the predictors for BMI described in chapter four and chapter eight into one predictor incorporating three levels of data (genetic variation, methylation and microbiome data; see figure 1), we observed that the combined predictor could explain even more of the variation in BMI than the predictors that used only two omics levels. The explained variation shown in figure 1 was re-estimated based on the original data reported for the three cohorts and our estimates were based on a subset of samples for which all three data layers were available. This means the explained variation is different than that reported in chapter four and chapter eight. In our combined analysis age and gender explain 4.2%, the microbiome composition explains 3.0%, DNA-methylation levels explain 6.9%, and genetic variation explains 8.6% of the variation in BMI. The predictor that uses age, gender, genetic variation, DNA-methylation and microbiome composition can explain 20.3% of the variation in BMI. It is evident that the predictors are not independent, as the summed explained variation is higher (22.8%) than the explained variation by the combined predictor (20.3%). This is because the different omics datasets do not provide fully independent information (for example, genetic variation can influence DNA methylation, as described in chapter nine, and can also affect microbiome composition, as described in chapter six). However, the difference between the five-level predictor and the best four-level predictor remains statistically significant, Anova P-value = 0.0003, indicating that a multi-omics predictor can be helpful in predicting complex disease and trait phenotypes.

In the BMI example above, we see that the use of non-genetic information in predicting risk can improve the phenotype prediction substantially. Although the prediction of BMI levels is not directly relevant in a clinical manner (because BMI can usually be measured), the same method can also be applied to predict disease phenotypes. Given that most large-scale GWAS analyses have so far found only a limited number of risk factors, which usually explain only a limited proportion of phenotype variation, the development of models that incorporate gene expression, methylation, metabolite or microbiome data or a subset of these is promising and can be informative in predicting disease and trait phenotypes. For instance, it would be clinically relevant to predict risk for cardiovascular diseases (CVD) or T2D; although enormous progress has been made in prevention and treatment since the 1960s, CVD remains the leading cause of death worldwide. It has been estimated that nearly 40% of all deaths in the US will be due to CVD by 203018. Since CVD and T2D are chronic diseases and expensive to treat, early diagnosis would permit better monitoring of disease progression and might allow less expensive treatments if the diagnosis is made while symptoms are still mild.

155 Explained variation in body mass index 1 10.0 100 Genetic

8.0 Methylation 80

Microbiome 2 6.0 60 Age & Gender 4.0 40 Unexplained

20 % explained variation % explained 3 variation % explained 2.0

0.0 0

Figure 1. Multi-omics data can explain some of the variation in body mass index. This figure shows the 4 variation explained by the different biological data layers and phenotypes. In the left part of the figure, the variation explained by the separate predictors on each layer is shown. In the right part of the figure the total variation explained by the combined model is shown. Here we started with age and gender, then included genetics, methylation and finally the microbiome predictor, so that the explained variation is attributed to the features based on the individual explained prediction. Another use of integrative multi-omics predictors is to help identify environmental factors leading to disease in large cohorts, like smoking. Smoking has an influence on both the microbiome (see chapter five), and methylation levels19,20. Using the methylation and microbiome data, it might be possible to accurately predict smoking status without needing to ask participants in biobanks about their smoking behavior. By using such integrated approaches, we may also be able to identify mismatches between the different data-layers, which could then help correct potential sample mix-up problems, or identify which answers in the lifestyle questionnaires have not been filled out accurately. Furthermore, if we can infer clinically relevant diseases or phenotypes accurately, it might be possible to save biobanking costs since phenotypes are typically expensive to measure. These multi-omics datasets also hold great promise for personalized approaches: to better tailor nutrition21, treatment or medication to individual patients. Currently a lot of research is being conducted into such precision medicine22. For instance, genetic data can be used to partly determine the appropriate type of medication or dosage for treating patients. Such patient-specific prescriptions may also benefit from the use of other omics data, thereby leading to a more accurate type of medication prescription and dosage better tailored to individual patients, possible cost reductions and fewer side-effects. Likewise, we can use the omics data to offer a more personalized nutritional advice, as reported by Zeevi et al23. However, before we can start using these predictors in a diagnostic or clinical setting, it is crucial to test how they work thoroughly24. Firstly, these models have to reach a sufficiently high, prediction accuracy before they can be adopted for clinical use. With the exception of the genetic data, where it is clear that genetic variation is causal to the phenotype, the other omics data may not be so easy to use, for instance, expression data in blood is not identical at two different time-points, even if the blood samples have been drawn from an individual on the same day25. So relations identified at the first time-point might not be well reflected at the second time-point. Although we observed that these omics levels were informative for BMI and lipid levels, an important step is to perform longitudinal analyses and, ideally, also to analyze omics data from multiple time-points. By using more time-points we can learn about the stability of the various omics layers and phenotypes over time. And by using a predictor built on the expression and methylation levels at time-point X to predict the level of a trait at time-point Y, we can learn more about the usability and stability of the non-genetic predictors in general. This research will eventually lead to better predictions on the phenotypes of

156 interest and will also provide insight into whether each layer of omics data are informative for predicting disease development at a later stage. 1 For this purpose, it would be interesting to determine the power of the current predictors in the follow-up data from the LifeLines and LifeLines-DEEP cohorts. At this second time-point, five years after the initial sample collection, we could assess the stability of the non-genetic predictors, which would tell us more about the relevance of the information captured in the microbiome composition, expression data, and methylation data. 2 Furthermore, it is important to gain a better understanding of how the different biological data layers interact which each other and how they relate to confounding factors. In chapters two, three and five, we have studied potential confounding factors for studies on the microbiome. We identified relations between different dietary factors, commonly used drugs, intrinsic 3 factors, diseases, and smoking on the microbiome composition in the LifeLines-DEEP cohort. More specifically, we modeled all of these factors simultaneously in chapter five, while taking into account relations between the individual factors, rather than studying all factors separately. Studies linking the microbiome or other biological omics data layers to phenotypes, or preferably to multiple phenotypes at the same time, are proving valuable in 4 identifying unwanted confounders, which can vary per trait. We have provided insight into some of these confounders, which we hope will enable the creation of better models yielding better reproducible results. While the generation of such multi-omics datasets is rather costly, this should be considered against their potential to aid the earlier diagnosis and treatment of patients. In the United States alone, healthcare costs related to T2D were estimated to be US$176 billion in 201226. If we can achieve earlier diagnoses using good predictors, this may limit future treatment costs. Given that the risk of development of either T2D or CVD can be limited by lifestyle changes, earlier diagnosis could have a major impact on quality of life and reduce treatment costs. The LifeLines-DEEP dataset does not include a sufficient number of participants suffering from CVD or T2D for us to be able to build and test specific predictors. However, since we can better predict lipid levels and BMI with the multi-omics predictors, I believe this approach can also be applied to case-control analyses in both CVD and T2D cohorts. 4. Future perspectives Several avenues are important for further research using multi-omics integration approaches in the interpretation of genetic risk variants and prediction of diseases and traits. Firstly, both the GWAS interpretation and predictor development will benefit from having larger datasets available. The larger datasets will aid identification of smaller effects and offer better effect size estimations. Secondly, by using higher resolution27, multi-omics datasets28, such as single cell data, and the integration of data on multiple tissues, we can improve interpretation of GWAS data and explain variation in traits. Thirdly, by using more different omics datasets simultaneously, and fourthly, by studying the most relevant tissues29, we can improve both the interpretation of the GWAS results and the accuracy and power of multi-omics predictors. In the next two sections I will describe the prospects for two multi-omics integration strategies. 4.1 Multi-omics in interpretation of genetic variants Specifically for the interpretation of GWAS results using multi-omics, there is much to gain by studying different omics layers simultaneously. In chapter nine, we describe an overlap we identified between cis-eQTLs on transcription factors and trans-methylation QTLs on downstream binding sites for these transcription factors. The next step in these analyses would be scaling to a genome-wide trans-mapping to identify more relations between genetic variation and DNA-methylation changes. This would help identify genetic regulation of genome-wide binding proteins or DNA binding transcripts. By direct integration of genetic, expression and methylation data, we may be able to identify more

157 relations between genetic, (distal) methylation changes and expression. This could enable us to explain more relations between methylation and transcription, and to learn more about the 1 relations where there is no clear transcription factor binding motif or footprint, which proved crucial in interpreting the trans-methylation QTLs (chapter nine). By directly integrating the data levels, it might even be possible to do a “DNA-binder wide” analysis, mapping all the DNA-binding proteins and transcripts on the whole genome in one analysis. This could make 2 such an approach an alternative to CHromatin immunoPrecipitation sequencing (CHiP-seq) analysis, which is performed per protein. The multi-omics-based alternative would have an advantage over CHiP-seq methods, because it would not only work for proteins that bind to DNA but also for RNAs that bind to the DNA. However, it is important to note that we used the Illumina 450K array, which assays only part of the DNA-methylome30. Switching to 3 genome-wide bisulfite sequencing, or a genome-wide sequencing approach that can identify methylation without bisulfite treatment, would provide a full picture, but this is likely to remain expensive for the next few years30. In addition to the drawback of genomic resolution, using combined expression and methylation QTL mapping to identify downstream binding of proteins or RNAs on the DNA may not have the same precision as CHiP-seq. Firstly, the 4 precision is affected because there is a non-perfect relation between expression levels and protein levels, i.e. the functional level of transcription factors. Secondly, protein levels might not be directly representative for the binding of the same protein to the DNA. To get to a better “DNA-binder wide” analysis, it might be worthwhile to integrate protein levels of transcription factors in the analysis, thereby giving a better understanding of how expression levels of transcription factors relate to protein-levels. This would close the gap between the difference in expression and protein levels31. By moving to more precise data sets, like single cell data27, where information on gene expression and/or DNA-methylation is generated at a single cell level, we can learn much more about the relation between different omics levels, because many of these relations are highly cell-type- and context-specific. In this thesis we only investigated information at a bulk level, thus missing much context-specific information, i.e. differences from cell to cell, or even information on differences per cell type. For our research we mainly used data derived from whole blood, which is a mixture of cell types, and since there are differences between the cells and the cell types that make up blood, it is possible that we missed genetic effects that are only present in the rarer cell types. This also has implications when trying to interpret GWAS signals. By using the appropriate (i.e. affected) cell type or cells for a disease or trait, more specific information on the effect of genetic variation will be revealed. We know that genetic regulation can be tissue-specific (chapter seven) and that using the correct tissue type is relevant, for example, when integrating GWAS data and multi-omics data derived from colonic biopsies with the gut microbiome composition. If we could perform such studies we could learn more about the relation between the host and microbiome, whereas using expression data derived from blood might not be truly reflective of the gut situation. Especially for gut diseases, like IBD and celiac disease, this could yield new insights. We envision that, using these data, we would also learn more about the microbiome QTLs identified in our microbiome quantitative trait study (chapter six). Currently, there are no such experimental results available in which there is data on intestinal DNA-methylation levels, gene-expression levels, and the gut microbiome. The generation of such datasets is challenging, since it is not possible to collect the stool microbiome at the same time-point as a biopsy; this is important to relate human gene expression, microbial gene activity, and methylation levels to each other properly. 4.2 Multi-omics prediction models Many biobanks are now enriching their datasets with additional molecular levels. One issue is the actual usefulness of generating another molecular level on top of the multi-omics datasets that may already be available. To shed light on this, we have expanded our BMI predictor with the expression data available for the LifeLines-DEEP cohort. By using the same strategy as for microbiome composition (see methods, chapter four), I built a classifier based on expression

158 data. Then I combined the classifier into our multi-omics BMI predictor, which now integrates phenotypes, genetic variation, DNA-methylation, gene expression, and microbiome data to predict BMI. I observed that expression alone can explain 9.0% of the variation in BMI, and 1 after including gene-expression data in the combined predictor, we could explain 26.5% of the total variation in BMI, which represents a 6.2% increase over the predictor shown in figure 1. This clearly shows that every additional data layer added in our model makes it possible to predict trait levels more accurately. 2 I therefore expect that by adding additional datasets, such as more detailed information on phenotype, on physical activity (measured by a wearable device for example), on food intake or quality of life, it is likely the prediction performance will be improved. Another avenue would be to generate molecular data on cell types and tissues that are more relevant for the disease 3 or phenotype under investigation. For instance, for inflammatory bowel disease and celiac disease, the most relevant tissue to study would be the intestine, and ideally this should involve single-cell technologies that can provide information on every individual cell that is present in the intestine. Another strategy would be to develop better predictive models for phenotypes by integrating 4 the omics layers in a different way. In the analyses described in this thesis, we extracted relevant features from each of the omics layers independently: we identified methylation CpG sites that were informative, genes whose expression levels were informative, and individual bacteria that were informative for the trait of interest. We then built models per data-layer, which we subsequently integrated at model level to develop a final model based on all the individual predictors. While this approach works well as can be seen in the research chapters, it ignores more complex, but potentially informative features that may serve as features for phenotype prediction. For instance, we might well find that a certain combination of two variables, or a certain ratio of two variables that have been assayed using different techniques, could be even more informative for a phenotype, but that these features will have been missed in our current approaches. Methods that can resolve this issue and that identify such relations, while avoiding overfitting the data, could therefore prove highly valuable to better model and predict complex phenotypes. Another in the future. Yet another strategy to improve these phenotype predictions would be to use larger sample sizes. In the risk prediction studies (chapters four and eight), we used information from large- scale GWAS meta-analyses and the reported effect-sizes on individual SNPs to build our genetic risk predictors. It is expected that the same strategy can be applied to other molecular levels: information obtained from large-scale epigenome-wide association studies (EWAS) or transcriptome-wide association studies (TWAS) could be used to select informative CpG sites and individual genes, along with reported effect-sizes to improve phenotype predictions in smaller multi-omics datasets, as in the LifeLines-DEEP. However, with ever-increasing model complexities, cross-validation or preferably replication of the models built or the effects identified, will be absolutely crucial. This is especially true if we try to generate more sophisticated integrative models, using several multi-omics datasets in one predictor. Thus, multi-omics prediction of disease phenotypes is likely to be an interesting research avenue for the next few years. First of all, we have shown that these multi-omics predictors work substantially better in predicting BMI than methods that use more limited levels. An important question is to understand whether such predictors can also help to predict the onset of complex traits like obesity or diseases. If this turns out to be the case, it is likely that some of the features that have been selected for the predictor are indeed reflecting genes that could play a causal role in causing obesity, and which might therefore serve as potential drug targets. Given that we identified specific genes whose expression levels are associated to BMI, and that we could employ Mendelian randomization approaches to infer a causal role32, we might well be able to correct the expression of these genes by medication, or we might be able to alter the gut microbiome in such a way that we promote or decrease the levels of BMI-relevant microbes.

159 Conclusions 1 From the work presented in this thesis, we have gained better insight into the downstream effects of genetic risk factors for disease, through integration of different biological omics datasets. We further studied and identified factors which influence or shape the gut microbiome. Finally, we have shown that multi-omics datasets can be used to predict disease 2 phenotypes and that each of these separate molecular levels can be informative. We are now beginning to use a wide-scale multi-omics approach for predicting traits and we expect that larger and more deeply characterized biobank cohorts will make it possible to build prediction algorithms to help diagnose patients earlier and to improve their treatment. 3 References 1. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001-6 (2014). 2. Jostins, L. et al. Host-microbe interactions have shaped the genetic architecture of 4 inflammatory bowel disease. Nature 491, 119–124 (2012). 3. Reynisdottir, I. et al. Localization of a susceptibility gene for type 2 diabetes to chromosome 5q34-q35.2. Am. J. Hum. Genet. 73, 323–35 (2003). 4. Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014). 5. Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015). 6. Weedon, M. N. et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 40, 575–583 (2008). 7. Lettre, G. et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat. Genet. 40, 584–591 (2008). 8. Gudbjartsson, D. F. et al. Many sequence variants affecting diversity of adult human height. Nat. Genet. 40, 609–615 (2008). 9. Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat. Genet. 47, 1114–1120 (2015). 10. Purcell, S. Variance Components Models for Gene–Environment Interaction in Twin Analysis. Twin Res. 5, 554–571 (2002). 11. Edwards, S. L., Beesley, J., French, J. D. & Dunning, A. M. Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797 (2013). 12. Zhernakova, D. et al. Hypothesis-free identification of modulators of genetic risk factors. bioRxiv 33217 (2015). doi:10.1101/033217 13. Scholtens, S. et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 44, 1172–1180 (2015). 14. Deary, I. J., Gow, A. J., Pattie, A. & Starr, J. M. Cohort profile: the Lothian Birth Cohorts of 1921 and 1936. Int. J. Epidemiol. 41, 1576–1584 (2012). 15. McRae, A. F. et al. Contribution of genetic variation to transgenerational inheritance of DNA methylation. Genome Biol. 15, R73 (2014). 16. Silventoinen, K. et al. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Res. 6, 399–408 (2003). 17. Hemani, G. et al. Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs. Am. J. Hum. Genet. 93, 865–875 (2013). 18. Heidenreich, P. A. et al. Forecasting the future of cardiovascular disease in the United States: a policy statement from the American Heart Association. Circulation 123, 933–44 (2011). 19. Dogan, M. V et al. The effect of smoking on DNA methylation of peripheral blood mononuclear cells from African American women. BMC Genomics 15, 151 (2014).

160 20. Monick, M. M. et al. Coordinated changes in AHRR methylation in lymphoblasts and pulmonary macrophages from smokers. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 159B, 141–151 (2012). 1 21. Zeevi, D. et al. Personalized Nutrition by Prediction of Glycemic Responses. Cell 163, 1079–1095 (2015). 22. Offit, K. Personalized medicine: new genomics, old lessons. Hum. Genet. 130, 3–14 (2011). 2 23. Zeevi, D. et al. Personalized Nutrition by Prediction of Glycemic Responses. Cell 163, 1079–1094 (2015). 24. Harper, R. & Reeves, B. Reporting of precision of estimates for diagnostic accuracy: a review. BMJ 318, 1322–3 (1999). 25. Whitney, A. R. et al. Individuality and variation in gene expression patterns in human 3 blood. Proc. Natl. Acad. Sci. U. S. A. 100, 1896–901 (2003). 26. American Diabetes Assosiation. Economic costs of diabetes in the U.S. in 2012. Association, American Diabetes. Diabetes Care 36, 1033–46 (2013). 27. Eberwine, J., Sul, J.-Y., Bartfai, T. & Kim, J. The promise of single-cell sequencing. Nat. Methods 11, 25–27 (2013). 4 28. Vazquez, A. I. et al. Increased Proportion of Variance Explained and Prediction Accuracy of Survival of Breast Cancer Patients with Use of Whole-Genome Multiomic Profiles. Genetics 203, (2016). 29. Nica, A. C. & Dermitzakis, E. T. Using gene expression to investigate the genetic basis of complex disorders. Hum. Mol. Genet. 17, R129-34 (2008). 30. Teh, A. L. et al. Comparison of Methyl-capture Sequencing vs. Infinium 450K methylation array for methylome analysis in clinical samples. Epigenetics 11, 36–48 (2016). 31. Edfors, F. et al. Gene-specific correlation of RNA and protein levels in human cells and tissues. 1–10 (2016). doi:10.15252/msb.20167144 32. Smith, G. D. Mendelian randomization for strengthening causal inference in observational studies: application to gene x environment interactions. Perspectives on Psychological Science 5, 527–545 (2010).

161 Appendices 11 Summary 1 There are many factors involved in the development of human diseases and traits. In recent years the field of human genetics has been very successful in linking genetic variation to diseases and traits. By conducting large-scale studies comparing the genetic make-up of affected versus non-affected participants, we have identified thousands of variants in the 2 human genome that are more or less commonly found in cases compared to controls. These genome-wide association studies (GWAS) have been instrumental in the identification of genes linked to a multitude of diseases and traits. Variants in the functional parts of a gene can be relatively straightforward to interpret. However, not all the variants linked to disease can be directly interpreted. By using intermediate molecular data layers, such as gene expression, 3 DNA-methylation or protein levels, we can gain more insight into the genetic variants identified by GWAS. However, we only get a limited picture of disease by focusing on genetic variation. Another important factor related to disease is the environment. But it is much harder to quantify 4 environmental factors than to determine the genetic differences between two individuals. Using the intermediate molecular data, or biological omics, we can gain insights into the environment of individuals. The environment surrounding individuals can, for instance, influence the composition of their microbiome, but also their gene expression, DNA-methylation and protein levels. By studying the differences in these biological omics in relation to phenotypes and disease, we can learn more about the environmental factors that lead to disease. However, as with GWAS studies, we do not always know what the differences in these biological data layers mean. In this thesis we have focused on two biological omics, the gut microbiome composition and the DNA-methylome. The gut microbiome is the collection of micro-organisms that live together in the human gut; DNA-methylation is the occurrence of a methyl group bound to the DNA and this mainly occurs at cysteine-guanine pairs. In the first part of the thesis we have focused on inter-individual differences influencing, or influenced by, differences in the microbiome composition, while in the second part, we have focused on changes in DNA- methylation associated to tissue differences and on the influence of genetic variation on DNA- methylation. The microbiome There are many factors which influence the gut microbiome. In chapter two we describe a study into the specific influence of a gluten-free diet (GFD) on the human microbiome. A gluten-free diet is the most commonly adopted special diet worldwide. It is the only effective “treatment” for coeliac disease, but is also often adopted by individuals with other gastro­ intestinal complaints. We found that inter-individual variation in the gut microbiota remained stable during the GFD intervention, although on an individual level the most striking shift in composition was seen for the bacterial family Veillonellaceae. Veillonellaceae is considered to be a pro-inflammatory family of bacteria. Because of this it is conceivable that a decrease in Veillonellaceae abundance might be one of the mediators of the GFD’s beneficial effect observed in patients with IBS and gluten-related disorders. Apart from diet, we also specifically tested for the influence of commonly used drugs on the microbiome, this study is described in chapter three. We found that proton pump inhibitors (PPI), which are among the top ten most widely used drugs in the world, have an even more prominent effect on a population level than antibiotics. PPI use was associated with a significant decrease in microbial diversity and with changes in 20% of the bacterial taxa. Multiple oral bacteria were over­represented in the fecal microbiome of PPI users. The differences between PPI users and non-users seen in this study were consistently associated with changes towards a less healthy gut microbiome.

163 Since many studies have suggested there is a relation between the gut microbiome and the development of cardiovascular disease, we studied the relation between lipid levels, body 1 mass index (BMI) and the microbiome. In chapter four we report a study that identified 34 bacterial taxa associated to BMI and blood lipids. Furthermore, we built a model to explain differences in lipid and BMI levels between different individuals based on both their genetics and microbiome composition. This showed that the microbiome composition could explain 2 up to 6% of variance in lipid levels (triglycerides), on top of the variation explained by age, gender and host genetics. Our findings support the potential of therapies altering the gut microbiome to control body mass, triglycerides and HDL. Most microbiome studies to date have been focused on the abundances of microbial 3 species, however, there are complex interactions between microbes living in the gut. Since looking at species only gives a limited resolution, we used a strategy to not only look at the abundance of the different microbes, but also at the number of reads per gene present in the microbiome. In chapter five we report on a study into the factors influencing both the microbes and microbial functions. We identified 126 exogenous and intrinsic host factors, 4 including 31 intrinsic factors, 12 diseases, 19 drug groups, 4 smoking categories, and 60 dietary factors that all influence the microbiome in some way. These factors collectively explain 18.7% of the variance seen in the inter-individual microbial composition. After identification of the main factors influencing the microbiome, we went on to search specifically for the relation between host genetics and the gut microbiome, as described in chapter six. We assessed the influence of host genetics on microbial species and function, which we investigated by looking at gene pathways and GO categories in 1,514 subjects. We identified 41 genomic loci that were significantly linked to a difference in the microbiome composition. In addition, we investigated genomic regions involved in diseases, immunity or food preferences, and found 32 loci that were associated with microbiome composition. DNA-methylation In the second part of this thesis we report on the relation between genomic variation and the DNA-methylome, and factors related to the DNA-methylome. In chapter seven we describe the differences between DNA-methylation and gene expression in fetal and adult liver and compare the local genetic relation on DNA-methylation and gene expression in adult liver to two types of fat and muscle. When comparing adult versus fetal liver we identified 1,657 differentially methylated genes; these genes were enriched for transcription factor binding sites of HNF1A, HNF4A, GATA1, STAT5A, STAT5B, and YY1. We also identified, 2,673 differentially expressed genes; these genes were enriched for metabolic and developmental pathways. When comparing the genetic control on liver we observed strong liver-specific effects from single nucleotide polymorphisms (SNPs) on both methylation levels (28,447 unique CpG sites) and gene expression levels (526 unique genes). In chapter eight we report on tests to discover whether DNA-methylation profiles account for the inter-individual variation in BMI and height. We derived methylation predictors for both BMI and height by estimating probe-trait effects in discovery samples and tested them in external samples. Methylation profiles associated with BMI based on the LifeLines- DEEP cohort explained 4.9% and 3.6% of the variation in BMI seen in two other datasets (Lothian Birth Cohorts and the Brisbane System Genetic Study, respectively). Methylation profiles predicted BMI independently of genetic profiles in an additive manner: 5%, 9%, and 13% of variance of BMI in LifeLines-DEEP subjects were explained, respectively, by the methylation predictor, genetic predictor, and a model containing both. In contrast, methylation profiles accounted for almost no variation in height. The BMI results suggest that combining genetic and epigenetic information might have greater utility for predicting complex traits.

164 In chapter nine we describe a study on disease-associated genetic variants and their relation to DNA-methylation. We show that disease variants have widespread effects on distal DNA methylation sites, and these likely reflect differential occupancy of trans-binding sites (i.e. 1 sites located far from the genetic risk factor of interest) by cis-regulated transcription factors (located near the genetic risk factor of interest). For 1,907 established trait-associated SNPs, we found that they affect distal methylation levels of 10,141 different CpG sites (false discovery rate <0.05). They included SNPs that affect both the expression of a nearby transcription 2 factor (like NFKB1, CTCF and NKX2-3) and the methylation of its respective binding site across the genome.

Eight main points from this thesis 3 1. The gut microbiome is influenced by many exogenous and intrinsic factors. 2. A gluten-free diet has limited but significant effects on microbiome composition. 3. Genetic differences between individuals influence their microbiome composition. 4 4. DNA-methylation and gene-expression in liver differs between adult liver and fetal liver. 5. The genetic control on DNA-methylation varies per tissue type. 6. Genetic variants found to be associated by GWAS have effects on distal methylation sites. 7. Distal effects on methylation work fully or partly by influencing local changes in the gene expression of transcription factors. 8. Using other biological omics datasets, in addition to genomic data, helps in predicting complex phenotypes such as BMI.

165 Samenvatting 1 Er zijn vele factoren die betrokken zijn bij de ontwikkeling van ziekten en fenotypen. In de afgelopen jaren zijn er veel successen geboekt in het koppelen van genetische variatie aan ziektes en uiterlijke kenmerken. Door het uitvoeren van grootschalige studies waarin de 2 genetische variatie tussen getroffen versus niet-getroffen deelnemers wordt vergeleken, hebben zijn er duizenden varianten in het menselijk genoom geïdentificeerd die minder of meer voorkomen in de cases in vergelijking met controles. Deze genoomwijde associatiestudies (GWAS) hebben een grote rol gespeeld bij de identificatie van de genen die een relatie hebben met ziekten en patiënt karakteristieken. Varianten in de functionele delen van een gen 3 kunnenrelatief eenvoudig te interpreteren zijn. Echter, niet alle varianten gekoppeld aan ziekte zijn direct te interpreteren. Via moleculaire data lagen, zoals genexpressie, DNA-methylatie of eiwitniveaus kunnen we meer inzicht krijgen in de genetische varianten die door GWAS zijn gevonden. 4 Echter, als er alleen wordt gefocust op genetische variatie in relatie tot ziekten krijgen we maar een beperkt beeld van de ziekten. Een andere belangrijke factor in de ontwikkeling van ziekte is de leefomgeving van een persoon. Maar omgevingsfactoren zijn veel moeilijker te kwantificeren en te vergelijken tussen twee individuen dan de genetische verschillen. Met behulp van de intermediaire moleculaire gegevens, of biologische-omics (omics), kunnen we inzicht krijgen in de omgeving van individuen. De omgeving van individuen kan bijvoorbeeld invloed hebben op de samenstelling van zijn of haar microbioom, maar ook de genexpressie, DNA-methylatie en eiwitniveaus kunnen verschillend zijn hierdoor. Door het bestuderen van de verschillen in deze omics ten opzichte van fenotypes en ziekten, kunnen we de omgevingsfactoren die leiden tot ziekte identificeren. Echter, zoals met GWAS-studies, we weten niet altijd wat de verschillen in deze biologische data lagen betekenen. In dit proefschrift hebben we ons gericht op twee biologische-omics, de samenstelling van het microbioom in de darm en het DNA-methyloom. Het darm microbioom is de verzameling van micro-organismen die in de menselijke darm samenleven, DNA-methylering is het optreden van een methylgroep gebonden aan het DNA, dit komt vooral voor bij cysteïne-guanine (CpG) paren. In het eerste deel van dit proefschrift hebben we ons gericht op interindividuele verschillen in de samenstelling van het microbioom, terwijl we in het tweede deel van het proefschrift ons hebben gericht op veranderingen in de DNA-methylatie geassocieerd met weefsel verschillen en de invloed van genetische variatie op DNA-methylatie. Het microbioom Er zijn vele factoren die het darm microbioom beïnvloeden. In hoofdstuk twee wordt een onderzoek beschreven naar de specifieke invloed van een glutenvrij dieet (GFD) op het menselijk microbioom. Een glutenvrij dieet is het meest gebruikte speciale dieet wereldwijd. Het is de enige effectieve “behandeling” voor coeliakie, maar wordt ook vaak door mensen met andere darmklachten gebruikt. In de studie vonden wij dat interindividuele variatie in de darmflora stabiel is gebleven tijdens de GFD-interventie, de meest opvallende verschuiving die gezien werd binnen de individuen was de veranderde hoeveelheid van de bacteriële familie Veillonellaceae. De Veillonellaceae familie wordt beschouwd als een pro-inflammatoire familie van bacteriën. Hierdoor is het denkbaar dat de vermindering van de aanwezigheid van deze bacteriële familie een van de redenen is dat mensen een die een GFD gebruiken het als positief ervaren. Naast voeding, hebben we ook specifiek onderzocht wat de invloed van veel gebruikte medicatie is op het microbioom, dit is beschreven in hoofdstuk drie. We vonden dat protonpompremmers (proton pump inhibitors, PPI), die behoren in de top tien van meest gebruikte medicijnen in de wereld, op een populatieniveau nog prominentere effect hebben dan antibiotica. PPI-gebruik werd geassocieerd met een significante daling van microbiële diversiteit en veranderingen in

166 20% van de bacteriële taxa. Meerdere orale bacteriën zijn oververtegenwoordigd in het fecale microbioom van de PPI-gebruikers. De verschillen tussen PPI-gebruikers en niet-gebruikers die zijn gevonden in dit onderzoek zijn consistent geassocieerd met veranderingen naar een 1 minder gezond darm microbioom. Aangezien veel studies hebben gesuggereerd dat er een verband is tussen het microbioom en de ontwikkeling van hart- en vaatziekten, hebben we de relatie tussen lipide niveaus, body mass index (BMI) en het microbioom onderzocht. In hoofdstuk vier beschrijven we een studie 2 waar in we hebben gevonden dat 34 bacteriële taxa geassocieerd zijn met BMI en bloedlipiden. Bovendien bouwden we een model om verschillen in lipide en BMI-niveaus te voorspellen, op basis van zowel genetische informatie en microbioom samenstelling. Hieruit bleek dat met behulp van de samenstelling van het microbioom 6% van de variantie in lipide niveaus 3 (triglyceriden) kan worden verklaard, bovenop de variatie verklaard door leeftijd, geslacht en humane genetica. Onze vindingen onderschrijven de mogelijkheid om BMI en lipiden te beïnvloeden door middel van interventie op microbioom niveau. De meeste microbioom studies hebben zich tot op heden gericht op het kwantificeren van microbiële soorten, maar er zijn complexe interacties tussen verschillende microben 4 in de darm. Omdat kijken naar soorten slechts een beperkt beeld geeft gebruikten wij een strategie om niet alleen te kijken naar de hoeveelheden van de bacterie maar ook naar de genen in de bacterie en de hoeveelheid van het gen aanwezigheid in het microbioom. In hoofdstuk vijf beschrijven we een onderzoek naar de factoren die van invloed zijn op zowel de microben en microbiële functies. We identificeerden 126 exogene en intrinsieke factoren, waaronder 31 intrinsieke factoren, 12 ziekten, 19 medicatie groepen, 4 rook categorieën en 60 voedingsfactoren die allemaal invloed hebben op het microbioom. Deze factoren tezamen verklaren 18,7% van de variantie in de interindividuele microbiële samenstelling. Na identificatie van de belangrijkste factoren die invloed hebben op het microbioom, gingen we specifiek op zoek naar de relatie tussen humane genetica en het darm microbioom, zoals beschreven in hoofdstuk zes. Wij hebben de invloed van humane genetische variatie op de microbiële soorten en functie, die we onderzochten door te kijken naar gen pathways en GO-categorieën, onderzocht in 1.514 individuen. We identificeerden 41 genomische loci die significant zijn gerelateerd aan een verschil in de microbioom compositie. Daarnaast onderzochten we genomische regio’s die betrokken zijn bij ziekten, immuniteit of voedsel voorkeuren, en hebben 32 loci gevonden die suggestief zijn geassocieerd met de samenstelling van het microbioom. DNA-methylatie In het tweede deel van dit proefschrift beschrijven we de relatie tussen genomische variatie en het DNA-methyloom, en factoren die verband houden met de DNA-methyloom. In hoofdstuk zeven beschrijven we de verschillen tussen DNA-methylatie en genexpressie in foetale levers versus volwassen levers en vergelijken we de lokale genetische invloed op DNA-methylatie en genexpressie in een volwassen lever in vergelijking tot twee soorten vet en spieren. We identificeerden 1.657 differentieel gemethyleerde genen; deze genen zijn verrijkt voor transcriptiefactor bindingsplaatsen van HNF1A, HNF4A, GATA1, STAT5A, STAT5b en YY1. We vonden ook 2.673 differentieel tot expressie komende genen, deze genen zijn verrijkt voor genen die te maken hebben met metabolisme en ontwikkeling. We zagen sterk lever- specifieke effecten van genetische variatie op beide niveaus, methylatie (28.447 unieke sites) en genexpressie (526 unieke genen). In hoofdstuk acht beschrijven we een studie waar in we DNA-methylatie-profielen gerelateerd hebben aan interindividuele variatie in BMI en lengte. We gebruikten DNA-methylatie verschillen als voorspeller voor zowel BMI en lengte door relaties te schatten in LifeLines-DEEP individuen en deze relaties te valideren in externe individuen. Met behulp van DNA-methylatie profielen geassocieerd met BMI op basis van het LifeLines-DEEP cohort kan 4,9% en 3,6% van de variatie

167 in BMI worden verklaard in de twee andere datasets, de Lothian birth cohorts en Brisbane System genetische studie. We vonden dat we met methylatie profielen BMI onafhankelijk 1 van genetische profielen kan worden verklaard op een additieve wijze: 5%, 9% en 13% van de variantie van BMI in LifeLines-DEEP participanten kan worden verklaard door respectievelijk, de methylering voorspeller, genetische voorspeller, en een model dat zowel methylatie als genetische informatie gebruikt. Daarentegen verklaren methylatie profielen bijna geen variatie 2 in lengte, wat te verwachten is aangezien lengte een eigenschap is die erg over erfbaar is. De BMI resultaten suggereren dat het combineren van genetische en epi-genetische informatie van grote waarde kunnen zijn bij het voorspellen van complexe menselijke fenotypen. In hoofdstuk negen beschrijven we een studie waar we kijken naar de relatie van DNA-methylatie met ziekte-geassocieerde genetische varianten. We laten zien dat de ziekte-geassocieerde 3 varianten grote gevolgen hebben op DNA-methylatie op andere gebieden van het genoom. Dit wordt waarschijnlijk veroorzaakt door differentiël gebruik van trans-bindingsplaatsen (d.w.z. locaties die zich ver van de genetische risicofactor bevinden) van door cis-gereguleerde transcriptiefactoren (gelegen in de buurt van de genetisch risico factor). Voor 1.907 ziekte- 4 geassocieerde genetische varianten hebben we gevonden dat ze invloed hebben op in totaal 10.141 verschillende CpG sites. Een deel van de genetische risicofactoren beïnvloeden zowel de expressie van een nabijgelegen transcriptiefactor (zoals NFKB1, CTCF en NKX2-3) en de methylering van zijn respectieve downstream bindingsplaats in het genoom.

Acht hoofdpunten uit dit proefschrift 1. Het darm microbioom wordt beïnvloed door vele exogene en intrinsieke factoren. 2. Een glutenvrij dieet heeft een beperkt, maar significant effect op de samenstelling van het microbioom. 3. Genetische variatie beïnvloed de persoonlijke microbioom samenstelling. 4. DNA-methylatie en gen-expressie in de lever verschilt tussen volwassen level en foetale lever,. 5. De genetische invloed op DNA-methylatie verschilt per weefsel. 6. Genetische varianten gevonden door GWAS, hebben invloed op distale methylatie sites. 7. Distale gevolgen op methylatie zijn deels het gevolg van lokale genexpressie verschillen van transcriptiefactoren. 8. Het gebruik van andere biologische-omics, naast genomische data, helpt bij het voorspellen van complexe fenotypen zoals BMI.

168 List of abbreviations 1 16s rRNA 16s ribosomal RNA gene TC total cholesterol ADME absorption, distribution, metabolism and excretion TCAs tricyclic antidepressants AUC area under the curve TG triglycerides 2 BGI Beijing Genomics Institute TMAO trimethylamine N-oxide BMI Body mass index TSS transcription start site CGI CpG islands VAT visceral adipose tissue CVD cardiovascular disease 3 ECLIA electro-chemiluminescence immunoassay EDTA ethylenediaminetetraacetic acid ELISA enzyme-linked immunosorbent essay eQTL expression quantitative trait loci 4 eQTM expression quantitative trait methylation FDR false discovery rate FISH fluorescence in situ hybridization GC_MS gas chromatography-mass spectrometry GFD Gluten-free diet GWA genome-wide association HD Habitual diet HDL high-density lipoprotein HPLC high performance liquid chromatography IBD Inflammatory Bowel Diseases IBS Irritable Bowel Syndrome IBS Irritable bowel syndrome KEGG Kyoto encyclopedia of genes and genomes LDL low-density lipoprotein meQTL methylation quantitative trait loci NSAID nonsteroidal anti-inflammatory drug OTU Operational taxonomic unit PCoA Principal Coordinate Analysis PPI Proton Pump Inhibitor qPCR quantitative real-time polymerase chain reaction RIA radioimmunoassay SAT subcutaneous adipose tissue SCFA short chain fatty acids SD standard deviation SNP single nucleotide polymorphism SNRIs serotonin-norepinephrine reuptake inhibitors SSRI serotonin-specific reuptake inhibitors

169 Acknowledgments 1 Dear family, friends and colleagues, I want to thank all of you for your big or small contribution to my thesis! I really enjoyed the years that I spent in Groningen working and learning, and enjoying my time with all of you and 2 I am proud of the work presented here. Some of you deserve a personal thank you, but these are by no means the only people who have contributed to this work. Big thanks to everybody I do not mention specifically. Lude, Sasha and Cisca: Thank you so much for giving me the space to be able to do this work! 3 I really enjoyed working in the lab, before and during my PhD! Thanks for pushing me, all the opportunities you have given me, and all the support and guidance in the various projects! Thanks for creating such a great atmosphere and such a nice group. Without the three of you, this work would not have been possible! I wish you all the best for your future projects. 4 Dear Patrick: I do not know where to start or how to thank you. Without you, my PhD, Master’s and Bachelor’s periods would have been so different. Thanks for all the great things and everything you have taught me! Dear Arnau and Floris: The other two microbiome guys, thanks for all the teamwork. I really enjoyed working together in our exploration of the microbiome and, in particular, on the PPI paper. Also, I want to say sorry for not using the correct cover, but please go ahead and use the Telegraaf picture on your thesis. I would also like to thank all the other members of the Poepgroep, in particular Jing, Alex, Wouter, Rinse, Jackie D and Marten. Thanks for all your help and I enjoyed working with you! The P? -Medicine platform team (Freerk, Urmo, Niek, Sipko, Annique, Marion, Joeri, Lucas and Adriaan), I no longer know what the amount of P’s is or what they were all supposed to represent, but I would like to thank all of you for your help and the nice atmosphere in the Franke-Swertz group. Harm-Jan, Juha, Dasha, Isis, Javier, thanks for the warm welcome to the group, the good times and everything you taught me. Pieter and Morris, thanks for the great help and managing the cluster, lots of this work would not be possible without this! A big thanks to all the co-authors of the many projects I have had the opportunity to play a part in. Thanks to Yang, Cheng, Vinod, and Sebo and all other people in the group, I enjoyed working with you in the Wijmenga group! Kate & Jackie S: Thank you so much for your unbelievable help in editing my sometimes horrible texts. Without you, the messages in this thesis would have been much less clear and the text full of “kromme zinnen”. Mentje, Helene, Bote and Joke: Thanks for taking care of much of the administrative things, which have come along during my period at the genetics department. Dennis: Thanks for your help with my thesis lay-out and making the awesome cover. Dear Steven, Papa and Mama: Thanks for all the interest you have shown in my research, even though most of the time it was hard for you to follow what I was doing. And thanks for all the support you have given me. Dear Suus: Thanks for all the patience you have had, every time when I wanted to finish one more thing and I lost track of the time again. Without you, this thesis would not have been possible! I cannot thank you enough!

170 Curriculum vitae 1 Marc Jan Bonder was born 28 march 1989 in Tolbert, The Netherlands. He finished his Bache- lor degree in bioinformatics at the Hanze University of Applied Sciences in 2010 and received his Master degree in bioinformatics at the Vrije Universiteit Amsterdam in 2012. After this he moved back to Groningen to do a PhD under joint supervision of Dr. Alexandra Zhernakova and Prof. Dr. Lude Franke at the Department of Genetics, University of Groningen and Univer- 2 sity Medical Centre Groningen. He is currently starting a Post-Doc at the EMBL European Bioinformatics Institute in the lab of Dr. Oliver Stegle.

3 List of publications The influence of proton pump inhibitors and other commonly used medication on the gut microbiota. Imhann et al. Gut Microbes, 00-00 4 The emerging landscape of dynamic DNA methylation in early childhood. CJ Xu et al. BMC genomics 18 (1), 25 Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. S Wahl et al. Nature Disease variants alter transcription factor levels and methylation of their binding sites. MJ Bonder & R Luijk et al. Nature Genetics (2016) The effect of host genetics on the gut microbiome. MJ Bonder & A Kurilshikov et al. Nature Genetics 48 (11), 1407-1412. Identification of context-dependent expression quantitative trait loci in whole blood. DV Zhernakova et al. Nature Genetics Linking the human gut microbiome to inflammatory cytokine production capacity. M Schirmer et al. Cell 167 (4), 1125-1136. e8 Genome-wide analysis identifies 12 loci influencing human reproductive behavior. N Barban et al. Nature genetics Evidence for mitochondrial genetic control of autosomal gene expression. I Kassam et al. Human Molecular Genetics, ddw347 Age-related accrual of methylomic variability is linked to fundamental ageing mechanisms. RC Slieker et al. Genome Biology 17 (1), 191 A GWAS meta-analysis suggests roles for xenobiotic metabolism and ion channel activity in the biology of stool frequency. SA Jankipersadsing et al. Gut, gutjnl-2016-312398, 2016 Blood lipids influence DNA methylation in circulating cells. KF Dekkers et al. Genome Biology 17 (1), 13, 2016 Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. A Zhernakova et al. Science 352 (6285), 565-569, 2016 Population-level analysis of gut microbiome variation. G Falony et al. Science 352 (6285), 560-564, 2016. Tobacco smoking is associated with DNA methylation of diabetes susceptibility genes. S Ligthart et al. Diabetologia 59 (5), 998-1006, 2016

171 Proton pump inhibitors affect the gut microbiome. F Imhann & MJ Bonder & AV Vila et al. Gut 65 (5), 740-748, 2016 1 The influence of a short-term gluten-free diet on the human gut microbiome. MJ Bonder & EF Tigchelaar, et al. Genome Medicine 8 (1), 1, 2016 Improving Phenotypic Prediction by Combining Genetic and Epigenetic Associations. S Shah 2 & MJ Bonder et al. The American Journal of Human Genetics 97 (1), 75-85, 2015 Gut microbiota composition associated with stool consistency. EF Tigchelaar et al. Gut, gutjnl-2015-310328, 2015 3 Trans-ancestry genome-wide association study identifies 12 genetic loci influencing blood pressure and implicates a role for DNA methylation. N Kato et al. Nature genetics 47 (11), 1282-1293, 2015 Genotype harmonizer: automatic strand alignment and format conversion for genotype data 4 integration. P Deelen & MJ Bonder et al. BMC Research Notes 7 (1), 901, 2015 Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels. P Deelen & DV Zhernakova et al. Genome Medicine 7 (30), 13, 2015 The Gut Microbiome Contributes to a Substantial Proportion of the Variation in Blood Lipids. J Fu et al. Circulation Research 117 (9), 817-824, 2015 Genetic and epigenetic regulation of gene expression in fetal and adult human livers. MJ Bonder & S Kasela et al. BMC genomics 15 (1), 1, 2014 Breast Cancer Subtype Specific Classifiers of Response to Neoadjuvant Chemotherapy Do Not Outperform Classifiers Trained on All Subtypes. JJ de Ronde & MJ Bonder et al. PLOS ONE 9 (2), e88551, 2014 Comparing clustering and pre-processing in taxonomy analysis. MJ Bonder et al.. Bioinformatics 28 (22), 2891-2897, 2012 The Relation between Oral Candida Load and Bacterial Microbiome Profiles in Dutch Older Adults. EA Kraneveld et al. PLoS ONE 7 (8), e42770, 2012 TaxMan: a server to trim rRNA reference databases and inspect taxonomic coverage. BW Brandt, et al. Nucleic Acids Research, 2012 Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA. RS Fehrmann et al. PLoS Genet 7 (8), e1002197, 2011

172